Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?

linux assembly x86 cpu-usage rdtsc

36,563

Solution 1

As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.

The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.

(Another "problem" is that many people use RDTSC for measuring time, which is not what it does, but you wrote that you want CPU cycles, so that is fine. If you do use RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime syscall is surprisingly good under Linux.)

I would just write rdtsc inside the asm statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.

If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc (or use rdtscp which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid.

In reply to the further question in the comment:

The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).

Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.

Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.

clock_gettime(CLOCK_MONOTONIC_RAW) is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will still advance at the same pace.

Solution 2

There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.

When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).

Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).

People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.

Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.

All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).

Solution 3

good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:

Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. See "Time Stamp Counter" in Chapter 17 of the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.

36,563

Author by

user1106106

Updated on July 09, 2022

Comments

user1106106 almost 2 years
I want to get the CPU cycles at a specific point. I use this function at that point:
```
static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    // broken for 64-bit builds; don't copy this code
    return x;
}
```
(editor's note: "=A" is wrong for x86-64; it picks either RDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)

The problem is that it returns always an increasing number (in every run). It's as if it is referring to the absolute time.

Am I using the functions incorrectly?
user1106106 over 12 years

thank you for the quick reply. I don't see why I would get increasing numbers. let's say, some program is calling my function(that measures cpu ticks till that point), always at the same place(lets say - the 5-th line in the main function). So, every time he runs hes program. my function should give the same number (more or less), and note increasing number...
Basile Starynkevitch over 12 years

It is an increasing number, since it is from a counter probably started at power-on or reboot time.
user1106106 over 12 years

and another thing - if i use the : clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_nsec; I don't get increasing number, but almost the same number every run
Necrolis over 12 years

@user1106106: thats cause RDTSC is CPU wide, clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) is only at the process level, aka RDTSC starts from power-up, gettime(..) starts from the process start.
user1106106 over 12 years

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) - gets me time in microseconds resolution, and not in cycles.how do i convert it to cycles?
Damon over 12 years

struct timespec has nano second, not microsecond resolution (1:1000 difference). Though of course, for a variety of reasons a timer might not run at the full resolution available in the value. clock_getres tells you that. For example HPET is required to provide at least 0.1µs (or better) by the spec. Some implementations provide nanoseconds, others don't, and they don't have to. The number you get is still nano, however. To get clock cycles from time, you need to multiply with the clock speed. But if it is really clocks you want, use RDTSC in the first place.
Johan about 8 years

You can use it to measure cycles. Just make sure you run the cpu at 100% by loading it with work.
Peter Cordes over 7 years

The funny thing is that it counts time in "reference cycles", and runs at the CPU's rated clock speed (i.e. if it's sold as 2.4GHz CPU, that's the RDTSC count frequency). To measure core clock cycles, use performance counters to measure unhalted_core_cycles or something.
Peter Cordes over 7 years

On modern CPUs, RDTSC does measure time, in reference cycles. On CPUs where the CPUID includes tsc_invariant and nonstop_tsc, the gettimeofday system call is implemented in user-space (VDSO page) in terms of RDTSC. (and so is clock_gettime for some clk_id values, I assume). CPU manufacturers decided that having a very-low-overhead timesource was more valuable than having RDTSC as a benchmarking tool, so they changed it, and you will have problems on CPUs from ~2005(?) and later if you want to measure cycles with it. But you can use performance counters for that.
Damon over 7 years

@PeterCordes: That's all nice and well, however... my Skylake-gen CPU (which I would consider "modern") definitively does not measure time with RDTSC, nor are cores synchronized. The same code returns the same number of "ticks" regardless of performance level (once with, and once without "warming up" CPU), that is, at higher performance levels the ticks must be shorter. Also, I have experienced "time travel" artefacts (allegedly fixed in ~2005, too, but definitively present now).
Peter Cordes over 7 years

Hmm, that's surprising. I always just use perf counters, not RDTSC. Is the thing you're testing bottlenecked on RAM? That would explain taking the same amount of real time regardless of CPU frequency, since only L2 and L1 caches scale with core clock speed, not L3 or RAM. Otherwise IDK. This recent article mentions SKL without mentioning any differences for it, and goes into detail about using the TSC as a timesource and working out the conversion from ticks to nanosecs.
Peter Cordes over 7 years

Skew between cores is probably from Linux adjusting the TSC (maybe to keep the local clock in sync with an NTP server)?
Robin F. almost 6 years

this helped me to clear up the confusion from the comments above: stackoverflow.com/a/11060619/5242207