How to get the CPU cycle count in x86_64 from C++?

c++ c performance x86 rdtsc

68,559

Solution 1

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.

But the include that's needed is different:

#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

Here's the original answer before GCC 4.5.

Pulled directly out of one of my projects:

#include <stdint.h>

//  Windows
#ifdef _WIN32

#include <intrin.h>
uint64_t rdtsc(){
    return __rdtsc();
}

//  Linux/GCC
#else

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

#endif

This GNU C Extended asm tells the compiler:

volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.

(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)

https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Solution 2

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more

You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like @Mysticial's.

(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.

Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.

Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.

// valid C99 and C++

#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

// requires a Nehalem or newer CPU.  Not Core2 or earlier.  IDK when AMD added it.
inline
uint64_t readTSCp() {
    unsigned dummy;
    return __rdtscp(&dummy);  // waits for earlier insns to retire, but allows later to start
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.

These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.

The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.

You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.

BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs

For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see @HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.

How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures, an Intel white-paper from 2010.

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).

The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.

Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.

You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)

Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).

If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.

See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.

CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See @amdn's answer there for some good info about multi-socket sync.

(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see @amdn's answer on the linked question, and more details below.)

CPUID features relevant to the TSC

Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.

tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.

Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)

For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.

tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)

constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable

The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"

The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.

On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.

There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.

An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)

If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).

How efficient is the asm from using the intrinsic?

It's about as good as you'd get from @Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.

A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

However, we can maybe get the best of both worlds with a modified version of @Mysticial's code:

// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
    // long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.

    unsigned long lo,hi;  // let the compiler know that zero-extension to 64 bits isn't required
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) + lo;
    // + allows LEA or ADD instead of OR
}

On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)

But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

Solution 3

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.

In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).

Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.

Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.

If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:

   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   rdtsc
   ; save eax, edx

   ; code you're going to time goes here

   xor eax, eax
   cpuid
   rdtsc

I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).

Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.

Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.

Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.

Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

Solution 4

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES

This Linux system call appears to be a cross architecture wrapper for performance events.

This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.

Here is an example based on the one provided at the end of the man page.

perf_event_open.c

#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>
#include <sys/types.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}

The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.

You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:

Total cycles; not affected by CPU frequency scaling.

so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

Solution 5

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:

unsigned __int64 __rdtsc(void);

View more solutions

68,559

user997112

Updated on October 13, 2021

Comments

user997112 over 2 years
I saw this post on SO which contains C code to get the latest CPU Cycle count:

CPU Cycle count based profiling in C/C++ Linux x86_64

Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?

I am using x86-64

EDIT2:

Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
```
static inline uint64_t get_cycles()
{
  uint64_t t;
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}
```
EDIT3:

From above code I get the error:

"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"

Could someone please help?
- Mark Ransom over 11 years
  
  Visual Studio does not support assembly on x86-64.
- user997112 over 11 years
  
  @MarkRansom I presume you mean MSVC? I think I have the ICC compiler installed too and just to be sure I am just installing MinGW
- Nikos C. over 11 years
  
  To get uint64_t you should #include <stdint.h> (actually <cstdint> but your compiler is probably too old to have that one.)
- Mark Ransom over 11 years
  
  @user997112, yes I meant MSVC. I completely forgot that you can substitute compilers in it since I've never tried it.
- user997112 over 11 years
  
  Guys, I now get the error in the edit3. I have included <stdint.h> and this is on Windows 7
- Nik Bougalis over 11 years
  
  @MarkRansom Also, Visual Studio doesn't support gcc-style assembly ;)
- brian beuning over 11 years
  
  You need to be careful with this. With a multi-core chip, the clock counts are different on the different cores. If the scheduler moves your thread between cores, the count can jump. Some OS have fixed this. Some chips put cores to sleep to save power, then that cores clock does not advance.
- rcgldr over 5 years
  
  @MarkRansom - to clarify for others reading this, VS doesn't support inline assembly for 64 bit builds, but it does support separate assembly source files and uses ML64.EXE for 64 bit assembly. I use custom build step to run ML64.EXE rather than use the default, command line, using x64.asm as example: "ml64 /c /Zi /Fo$(OutDir)\x64.obj x64.asm" (/Zi for debug build, no /Zi for release build), output file: "$(OutDir)\x64.obj
Nik Bougalis over 11 years

That's a nice way to package it.
phonetagger over 11 years

I'm pretty sure when I was researching it, I found documentation that QueryPerformanceCounter (which is a thin veil over rdtsc) suffers from the same problem you identified on multicore/multiprocessor systems. But I think I also found documentation that this problem was a real problem on early systems because most BIOSes didn't even attempt to synchronize the counters on the different cores, but most newer BIOSes (perhaps not counting cheap junk machine BIOSes) do make that effort, so they may be off by only a few counts now.
phonetagger over 11 years

.... But to avoid that possibility entirely, you can set a thread's processor affinity mask so that it will run on only a single core, eliminating this issue entirely. (which I see you also mentioned)
Jerry Coffin over 11 years

QPC can be, but isn't necessarily, a thin veil over rdtsc. At least at one time, the single-processor kernel used rdtsc, but the multiprocessor kernel used the motherboard's 1.024 MHz clock chip instead (for exactly the cited reasons).
jstine over 11 years

FWIW, gcc 4.5 and newer include __rdtsc() -- #include <x86intrin.h> get it. Header also includes many other intel intrinsics found in Microsoft's <intrin.h>, and it gets included by default these days when you include most any of the SIMD headers -- emmintrin.h, xmmintrin.h, etc.
Tomilov Anatoliy about 6 years

std::uint64_t x; asm volatile ("rdtsc" : "=A"(x)); is another way to read EAX and EDX together.
Peter Cordes over 5 years

@Orient: only in 32-bit mode. In 64-bit mode, "=A" will pick either RAX or RDX.
Peter Cordes over 5 years

Any reason you prefer inline asm for GNU compilers? <x86intrin.h> defines __rdtsc() for compilers other than MSVC, so you can just #ifdef _MSC_VER. I added an answer on this question, since it looks like a good place for a canonical about rdtsc intrinsics, and gotchas on how to use rdtsc.
BeeOnRope over 5 years

The tsc doesn't necessarily tick at the "sticker frequency", but rather at the tsc frequency. On some machines these are the same, but on many recent machines (like Skylake client and derived uarchs) they are often not. For example, my i7-6700HQ sticker frequency is 2600 MHz, but the tsc frequency is 2592 MHz. They are probably not the same in cases the different clocks they are based on can't be made to line up to exactly the same frequency when scaling the frequency by an integer. Many tools don't account for this difference leading to small errors.
Peter Cordes over 5 years

@BeeOnRope: Thanks, I hadn't realized that. That probably explains some not-quite-4GHz results I've seen from RDTSC stuff on my machine, like 4008 MHz vs. the sticker frequency of 4.0 GHz.
BeeOnRope over 5 years

On recent enough kernels you can do a dmesg | grep tsc to see both values. I get tsc: Detected 2600.000 MHz processor ... tsc: Detected 2592.000 MHz TSC. You can also use turbostat to show this.
Peter Cordes over 5 years

Yup, 4000.000 MHz processor and 4008.000 MHz TSC on i7-6700k. Nifty.
Mysticial over 5 years

@PeterCordes See jstine's comment. The rdtsc intrinsic didn't exist at the time.
Peter Cordes over 5 years

Yeah, I discovered that while doing more digging for my answer. I think it's safe to say that __rdtsc() should be recommended over inline asm these days, though, so your answer could use an update. (Or hopefully the OP will accept my attempt at a canonical answer; I already closed a bunch of similar questions as duplicates of this one.) Also, I played around with making lo and hi unsigned long or uintptr_t, so the compiler wouldn't have to zero-extend eax into rax, which helps, and changing | to + which leads to weird optimizations...
Mysticial over 5 years

@PeterCordes I'll update it tomorrow when I get the time during or between events. (my company sent me to the Hot Chips conference)
Peter Cordes over 5 years

Cool. There's nothing actually wrong with your current answer (except maybe that missed-optimization); it should continue to be future-proof.
BeeOnRope about 4 years

Just to add to this the sticker base and turbo frequency and tsc frequencies have now diverged wildly. An i5-1035 has a tsc frequency of 1.5 GHz, but a base frequency of 1.1 GHz, and a turbo frequency (not really relevant) of 3.7 GHz.
Some Name almost 4 years

As per Vol.3 unhalted reference cycles counting is defined to run with the core crystal clock, TSC or the bus clock. so tsc rate might not be the same as reference cycles rate.
Peter Cordes over 3 years

You should probably point out that core clock cycles are different from RDTSC reference cycles. It's actual CPU cycles, not cycles of some fixed frequency, so in some cases it more accurately reflects what you want. (But it doesn't tick which the core is halted, e.g. for frequency transitions, or while asleep, so it's very much not a measure of real time, especially for a program involving I/O.)
Peter Cordes over 3 years

You measure more cycles than instructions with this program? Probably mostly measurement overhead, because the loop itself should run at 1 iteration / cycle = 2 instructions / cycle. Your default n=10000 (clock cycles) is pretty tiny, compared to system-call overheads on Linux with Spectre and Meltdown mitigations enabled. If you asked perf / PAPI to make rdpmc usable in user-space, you could use that to measure with less overhead than rdtsc (and still in CPU cycles, not ref cycles).
Peter Cordes over 3 years

Fun fact, you can get the PMU to count reference cycles for you, but that doesn't keep ticking when the clock is halted. Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
Ciro Santilli OurBigBook.com over 3 years

@PeterCordes thanks for those pointers. Maybe PERF_COUNT_HW_REF_CPU_CYCLES does something more similar to RDTSC ("Total cycles; not affected by CPU frequency scaling.") Note that kernelland instructions should be removed by pe.exclude_kernel = 1;, 10k already seems to give representative results that vary more or less linearly with size experimentally. I would also guess that RDTSC and RDPMC don't distinguish between different processes running at the same time, though they are lower overhead than the syscall.
Peter Cordes over 3 years

Yes, the PERF_COUNT_HW_REF_CPU_CYCLES counter ticks at the same frequency as the TSC that RDTSC reads. (Except when it doesn't tick at all during frequency transitions). My point was that PERF_COUNT_HW_CPU_CYCLES which you used will similarly pause during frequency transitions, so it's not just wall-clock time in different units. The throughput cost of turbo frequency shifts while executing in user-space won't be counted by that counter, in case that ever matters.
Peter Cordes over 3 years

I didn't notice or look for pe.exclude_kernel = 1;. Yeah, that should cancel my objection about syscall overhead. Is it possible your loop ended up split across a cache-line boundary or something? You'd expect any modern Intel or AMD CPU to run it at 1 cycle / iter, 2 IPC unless there was some weird effect. Try putting .p2align 4 before the label; the loop body is less than 16 bytes so that will make sure it's not crossing any boundary and the jnz isn't touching the end of a 32-byte boundary. (JCC erratum mitigation penalty.)
Ciro Santilli OurBigBook.com over 3 years

@PeterCordes oops, you are right, I hadn't done it carefully/mixed with other examples. Thanks!
Peter Cordes over 3 years

due to superscalar execution - technical nitpick: on Intel Sandybridge-family CPUs, it's actually due to macro-fusion in the decoders turning sub/jnz into a single dec-and-branch uop. So the back end is only executing 1 uop / cycle. And this uop comes from the uop cache, so other than initial decode, there's actually nothing superscalar going on :P (Except probably issuing groups of 4 of those uops into the back end, then idling for 3 cycles.) But if you have an AMD CPU, it will only fuse cmp or test, so that would be real superscalar execution.
Arty about 3 years

@PeterCordes Just for reference saying that my laptop's Intel Pentium T4300 with 2 cores increments RDTSC values with different frequencies, sometimes with rate of 0.48 ns/cycle (most of time), sometimes 0.96 ns/cycle (rarely when overheated). Measured this on Windows with CLang and __rdtsc() intrinsic. I bought my laptop around year 2008. Would be great if it was possible somehow to test on any CPU if this CPU supports steady rdtsc (same frequency always) or varying (changing on different cores and in time), do you know if it is possible?
Peter Cordes about 3 years

@Arty: Apparently the changeover (from core cycles to reference cycles) didn't get a CPUID feature bit, so OSes determine constant_tsc by looking at CPU model numbers, as mentioned in this answer. But the later nonstop_tsc (not halting in sleep states) implies constant_tsc, so you can at least check that. If you were writing an OS (or using more system calls), you could use power-management to control the CPU's frequency while running your calibration loop.
Peter Cordes about 3 years

@Arty: Your T4300 is a Penryn CPU, 2nd-gen Core2, so it should have constant_tsc if not nonstop_tsc. I guess thermal throttling involves deep-sleep, pausing the clock for some average duty-cycle, but low-speed idle doesn't have an effect.
jonadv almost 3 years

I just got a bachelor degree reading this comment