How to get the CPU cycle count in x86_64 from C++?
Solution 1
Starting from GCC 4.5 and later, the __rdtsc()
intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
-
volatile
: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result). -
"=a"(lo)
and"=d"(hi)
: the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86rdtsc
instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with"=r"
wouldn't work: there's no way to ask the CPU for the result to go anywhere else. -
((uint64_t)hi << 32) | lo
- zero-extend both 32-bit halves to 64-bit (because lo and hi areunsigned
), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long
instead of unsigned int
. Then the compiler would know that lo
was already zero-extended into RAX. It wouldn't know that the upper half was zero, so |
and +
are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info
Solution 2
Your inline asm is broken for x86-64. "=A"
in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc
and rdtscp
, and (at least these days) all define a __rdtsc
intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like @Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc()
intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc
(with one underscore) is in <immintrin.h>
, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>
, so we're stuck with <intrin.h>
(MSVC) vs. <x86intrin.h>
(everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc()
as returning (signed) __int64
.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc()
. Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc
in immintrin.h
, but doesn't have an x86intrin.h
at all. More recent ICC have x86intrin.h
, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long
, especially if you want to subtract them and convert to float. int64_t
-> float/double is more efficient than uint64_t
on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter()
which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence
(or cpuid
) to improve repeatability of rdtsc
and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see @HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid
to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc
counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc
is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock
).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr
in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID
.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
-
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (
constant_tsc
), which doesn't stop when the clock halts (nonstop_tsc
). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers). - std::chrono::clock, hardware clock and cycle count
- Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
- Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
-
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with
cli
), and virtualization ofrdtsc
under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers. - Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono
or clock_gettime
. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc
entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See @amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see @amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo
uses for the CPU features, and other aliases for the same feature that you'll also find.
-
tsc
- the TSC exists andrdtsc
is supported. Baseline for x86-64. -
rdtscp
-rdtscp
is supported. -
tsc_deadline_timer
CPUID.01H:ECX.TSC_Deadline[bit 24] = 1
- local APIC can be programmed to fire an interrupt when the TSC reaches a value you put inIA32_TSC_DEADLINE
. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen. -
constant_tsc
: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles. -
nonstop_tsc
: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors withCPUID.80000007H:EDX[8]
. The TSC keeps ticking even in deep sleep C-states. On all x86 processors,nonstop_tsc
impliesconstant_tsc
, butconstant_tsc
doesn't necessarily implynonstop_tsc
. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies bothconstant_tsc
andnonstop_tsc
features. See Linux's x86/kernel/cpu/intel.c detection code, andamd.c
was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3
. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
-
tsc_adjust
:CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1)
TheIA32_TSC_ADJUST
MSR is available, allowing OSes to set an offset that's added to the TSC whenrdtsc
orrdtscp
reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc
and nonstop_tsc
together make the TSC usable as a timesource for things like clock_gettime
in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc
/ nonstop_tsc
features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource'
would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable
can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp
produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc
directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram
on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from @Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC
function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax
, it's just rdtsc
/ret
. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or
+mov
instead of lea
to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of @Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc()
for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo)
. Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc()
function itself with an actual add/adc
with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with |
instead of +
, but definitely prefer the __rdtsc()
intrinsic if you care about 32-bit code-gen from gcc).
Solution 3
VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc
has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc
before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter
.
If you really insist on using rdtsc
, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc
that will actually mean anything.
Solution 4
Linux perf_event_open
system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES
instead of PERF_COUNT_HW_INSTRUCTIONS
. This answer will focus on PERF_COUNT_HW_CPU_CYCLES
specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES
, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS
on my quick experiments, presumably because my non-stressed machine is frequency scaled now.
Solution 5
For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);
Related videos on Youtube
user997112
Updated on October 13, 2021Comments
-
user997112 over 2 years
I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap
uint64_t
tolong long
for windows....?)static inline uint64_t get_cycles() { uint64_t t; __asm volatile ("rdtsc" : "=A"(t)); return t; }
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"
Could someone please help?
-
Mark Ransom over 11 yearsVisual Studio does not support assembly on x86-64.
-
user997112 over 11 years@MarkRansom I presume you mean MSVC? I think I have the ICC compiler installed too and just to be sure I am just installing MinGW
-
Nikos C. over 11 yearsTo get
uint64_t
you should#include <stdint.h>
(actually<cstdint>
but your compiler is probably too old to have that one.) -
Mark Ransom over 11 years@user997112, yes I meant MSVC. I completely forgot that you can substitute compilers in it since I've never tried it.
-
user997112 over 11 yearsGuys, I now get the error in the edit3. I have included <stdint.h> and this is on Windows 7
-
Nik Bougalis over 11 years@MarkRansom Also, Visual Studio doesn't support gcc-style assembly ;)
-
brian beuning over 11 yearsYou need to be careful with this. With a multi-core chip, the clock counts are different on the different cores. If the scheduler moves your thread between cores, the count can jump. Some OS have fixed this. Some chips put cores to sleep to save power, then that cores clock does not advance.
-
rcgldr over 5 years@MarkRansom - to clarify for others reading this, VS doesn't support inline assembly for 64 bit builds, but it does support separate assembly source files and uses ML64.EXE for 64 bit assembly. I use custom build step to run ML64.EXE rather than use the default, command line, using x64.asm as example: "ml64 /c /Zi /Fo$(OutDir)\x64.obj x64.asm" (/Zi for debug build, no /Zi for release build), output file: "$(OutDir)\x64.obj
-
-
Nik Bougalis over 11 yearsThat's a nice way to package it.
-
phonetagger over 11 yearsI'm pretty sure when I was researching it, I found documentation that
QueryPerformanceCounter
(which is a thin veil overrdtsc
) suffers from the same problem you identified on multicore/multiprocessor systems. But I think I also found documentation that this problem was a real problem on early systems because most BIOSes didn't even attempt to synchronize the counters on the different cores, but most newer BIOSes (perhaps not counting cheap junk machine BIOSes) do make that effort, so they may be off by only a few counts now. -
phonetagger over 11 years.... But to avoid that possibility entirely, you can set a thread's processor affinity mask so that it will run on only a single core, eliminating this issue entirely. (which I see you also mentioned)
-
Jerry Coffin over 11 yearsQPC can be, but isn't necessarily, a thin veil over rdtsc. At least at one time, the single-processor kernel used rdtsc, but the multiprocessor kernel used the motherboard's 1.024 MHz clock chip instead (for exactly the cited reasons).
-
jstine over 11 yearsFWIW, gcc 4.5 and newer include __rdtsc() -- #include <x86intrin.h> get it. Header also includes many other intel intrinsics found in Microsoft's <intrin.h>, and it gets included by default these days when you include most any of the SIMD headers -- emmintrin.h, xmmintrin.h, etc.
-
Tomilov Anatoliy about 6 years
std::uint64_t x; asm volatile ("rdtsc" : "=A"(x));
is another way to readEAX
andEDX
together. -
Peter Cordes over 5 years@Orient: only in 32-bit mode. In 64-bit mode,
"=A"
will pick either RAX or RDX. -
Peter Cordes over 5 yearsAny reason you prefer inline asm for GNU compilers?
<x86intrin.h>
defines__rdtsc()
for compilers other than MSVC, so you can just#ifdef _MSC_VER
. I added an answer on this question, since it looks like a good place for a canonical aboutrdtsc
intrinsics, and gotchas on how to userdtsc
. -
BeeOnRope over 5 yearsThe
tsc
doesn't necessarily tick at the "sticker frequency", but rather at the tsc frequency. On some machines these are the same, but on many recent machines (like Skylake client and derived uarchs) they are often not. For example, my i7-6700HQ sticker frequency is 2600 MHz, but the tsc frequency is 2592 MHz. They are probably not the same in cases the different clocks they are based on can't be made to line up to exactly the same frequency when scaling the frequency by an integer. Many tools don't account for this difference leading to small errors. -
Peter Cordes over 5 years@BeeOnRope: Thanks, I hadn't realized that. That probably explains some not-quite-4GHz results I've seen from RDTSC stuff on my machine, like 4008 MHz vs. the sticker frequency of 4.0 GHz.
-
BeeOnRope over 5 yearsOn recent enough kernels you can do a
dmesg | grep tsc
to see both values. I gettsc: Detected 2600.000 MHz processor ... tsc: Detected 2592.000 MHz TSC
. You can also useturbostat
to show this. -
Peter Cordes over 5 yearsYup, 4000.000 MHz processor and 4008.000 MHz TSC on i7-6700k. Nifty.
-
Mysticial over 5 years@PeterCordes See jstine's comment. The rdtsc intrinsic didn't exist at the time.
-
Peter Cordes over 5 yearsYeah, I discovered that while doing more digging for my answer. I think it's safe to say that
__rdtsc()
should be recommended over inline asm these days, though, so your answer could use an update. (Or hopefully the OP will accept my attempt at a canonical answer; I already closed a bunch of similar questions as duplicates of this one.) Also, I played around with makinglo
andhi
unsigned long
oruintptr_t
, so the compiler wouldn't have to zero-extendeax
intorax
, which helps, and changing|
to+
which leads to weird optimizations... -
Mysticial over 5 years@PeterCordes I'll update it tomorrow when I get the time during or between events. (my company sent me to the Hot Chips conference)
-
Peter Cordes over 5 yearsCool. There's nothing actually wrong with your current answer (except maybe that missed-optimization); it should continue to be future-proof.
-
BeeOnRope about 4 yearsJust to add to this the sticker base and turbo frequency and tsc frequencies have now diverged wildly. An i5-1035 has a tsc frequency of 1.5 GHz, but a base frequency of 1.1 GHz, and a turbo frequency (not really relevant) of 3.7 GHz.
-
Some Name almost 4 yearsAs per Vol.3 unhalted reference cycles counting is defined to run with the core crystal clock, TSC or the bus clock. so tsc rate might not be the same as reference cycles rate.
-
Peter Cordes over 3 yearsYou should probably point out that core clock cycles are different from RDTSC reference cycles. It's actual CPU cycles, not cycles of some fixed frequency, so in some cases it more accurately reflects what you want. (But it doesn't tick which the core is halted, e.g. for frequency transitions, or while asleep, so it's very much not a measure of real time, especially for a program involving I/O.)
-
Peter Cordes over 3 yearsYou measure more cycles than instructions with this program? Probably mostly measurement overhead, because the loop itself should run at 1 iteration / cycle = 2 instructions / cycle. Your default
n=10000
(clock cycles) is pretty tiny, compared to system-call overheads on Linux with Spectre and Meltdown mitigations enabled. If you asked perf / PAPI to makerdpmc
usable in user-space, you could use that to measure with less overhead thanrdtsc
(and still in CPU cycles, not ref cycles). -
Peter Cordes over 3 yearsFun fact, you can get the PMU to count reference cycles for you, but that doesn't keep ticking when the clock is halted. Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
-
Ciro Santilli OurBigBook.com over 3 years@PeterCordes thanks for those pointers. Maybe
PERF_COUNT_HW_REF_CPU_CYCLES
does something more similar to RDTSC ("Total cycles; not affected by CPU frequency scaling.") Note that kernelland instructions should be removed bype.exclude_kernel = 1;
, 10k already seems to give representative results that vary more or less linearly with size experimentally. I would also guess that RDTSC and RDPMC don't distinguish between different processes running at the same time, though they are lower overhead than the syscall. -
Peter Cordes over 3 yearsYes, the
PERF_COUNT_HW_REF_CPU_CYCLES
counter ticks at the same frequency as the TSC that RDTSC reads. (Except when it doesn't tick at all during frequency transitions). My point was thatPERF_COUNT_HW_CPU_CYCLES
which you used will similarly pause during frequency transitions, so it's not just wall-clock time in different units. The throughput cost of turbo frequency shifts while executing in user-space won't be counted by that counter, in case that ever matters. -
Peter Cordes over 3 yearsI didn't notice or look for
pe.exclude_kernel = 1;
. Yeah, that should cancel my objection about syscall overhead. Is it possible your loop ended up split across a cache-line boundary or something? You'd expect any modern Intel or AMD CPU to run it at 1 cycle / iter, 2 IPC unless there was some weird effect. Try putting.p2align 4
before the label; the loop body is less than 16 bytes so that will make sure it's not crossing any boundary and thejnz
isn't touching the end of a 32-byte boundary. (JCC erratum mitigation penalty.) -
Ciro Santilli OurBigBook.com over 3 years@PeterCordes oops, you are right, I hadn't done it carefully/mixed with other examples. Thanks!
-
Peter Cordes over 3 yearsdue to superscalar execution - technical nitpick: on Intel Sandybridge-family CPUs, it's actually due to macro-fusion in the decoders turning
sub/jnz
into a single dec-and-branch uop. So the back end is only executing 1 uop / cycle. And this uop comes from the uop cache, so other than initial decode, there's actually nothing superscalar going on :P (Except probably issuing groups of 4 of those uops into the back end, then idling for 3 cycles.) But if you have an AMD CPU, it will only fuse cmp or test, so that would be real superscalar execution. -
Arty about 3 years@PeterCordes Just for reference saying that my laptop's
Intel Pentium T4300
with 2 cores incrementsRDTSC
values with different frequencies, sometimes with rate of0.48 ns/cycle
(most of time), sometimes0.96 ns/cycle
(rarely when overheated). Measured this on Windows with CLang and__rdtsc()
intrinsic. I bought my laptop around year 2008. Would be great if it was possible somehow to test on any CPU if this CPU supports steady rdtsc (same frequency always) or varying (changing on different cores and in time), do you know if it is possible? -
Peter Cordes about 3 years@Arty: Apparently the changeover (from core cycles to reference cycles) didn't get a CPUID feature bit, so OSes determine
constant_tsc
by looking at CPU model numbers, as mentioned in this answer. But the laternonstop_tsc
(not halting in sleep states) impliesconstant_tsc
, so you can at least check that. If you were writing an OS (or using more system calls), you could use power-management to control the CPU's frequency while running your calibration loop. -
Peter Cordes about 3 years@Arty: Your T4300 is a Penryn CPU, 2nd-gen Core2, so it should have
constant_tsc
if notnonstop_tsc
. I guess thermal throttling involves deep-sleep, pausing the clock for some average duty-cycle, but low-speed idle doesn't have an effect. -
jonadv almost 3 yearsI just got a bachelor degree reading this comment