Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?
Solution 1
As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile
and memory
in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.
Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid
, memory fence instructions (mfence, lfence,
...) etc.
As an aside, while using cpuid
as a barrier before rdtsc
is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid
instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use one of the memory fence instructions.
The Linux kernel uses mfence;rdtsc
on AMD platforms and lfence;rdtsc
on Intel. If you don't want to bother with distinguishing between these, mfence;rdtsc
works on both although it's slightly slower as mfence
is a stronger barrier than lfence
.
Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f
Solution 2
you can use it like shown below:
asm volatile (
"CPUID\n\t"/*serialize*/
"RDTSC\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r"
(cycles_low):: "%rax", "%rbx", "%rcx", "%rdx");
/*
Call the function to benchmark
*/
asm volatile (
"RDTSCP\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high1), "=r"
(cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx");
In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. With this method we avoid to call a CPUID instruction in between the reads of the real-time registers
The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed. The two “mov” instructions coming afterwards store the edx and eax registers values into memory. Finally a CPUID call guarantees that a barrier is implemented again so that it is impossible that any instruction coming afterwards is executed before CPUID itself.
Steve Lorimer
Updated on July 05, 2022Comments
-
Steve Lorimer almost 2 years
Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.
These are our options:
1:
rdtscp
is a serializing call. It prevents reordering around the call to rdtscp.__asm__ __volatile__("rdtscp; " // serializing read of tsc "shl $32,%%rdx; " // shift higher 32 bits stored in rdx up "or %%rdx,%%rax" // and or onto rax : "=a"(tsc) // output to tsc variable : : "%rcx", "%rdx"); // rcx and rdx are clobbered
However,
rdtscp
is only available on newer CPUs. So in this case we have to userdtsc
. Butrdtsc
is non-serializing, so using it alone will not prevent the CPU from reordering it.So we can use either of these two options to prevent reordering:
2: This is a call to
cpuid
and thenrdtsc
.cpuid
is a serializing call.volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing unsigned tmp; __cpuid(0, tmp, tmp, tmp, tmp); // cpuid is a serialising call dont_remove = tmp; // prevent optimizing out cpuid __asm__ __volatile__("rdtsc; " // read of tsc "shl $32,%%rdx; " // shift higher 32 bits stored in rdx up "or %%rdx,%%rax" // and or onto rax : "=a"(tsc) // output to tsc : : "%rcx", "%rdx"); // rcx and rdx are clobbered
3: This is a call to
rdtsc
withmemory
in the clobber list, which prevents reordering__asm__ __volatile__("rdtsc; " // read of tsc "shl $32,%%rdx; " // shift higher 32 bits stored in rdx up "or %%rdx,%%rax" // and or onto rax : "=a"(tsc) // output to tsc : : "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered // memory to prevent reordering
My understanding for the 3rd option is as follows:
Making the call
__volatile__
prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So__volatile__
is not enough.Tell the compiler memory is being clobbered:
: "memory")
. The"memory"
clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.So my questions are:
- 1: Is my understanding of
__volatile__
and"memory"
correct? - 2: Do the second two calls do the same thing?
- 3: Using
"memory"
looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?
- 1: Is my understanding of
-
Gunther Piez over 11 yearsThe
cpuid; rdtsc
is not about memory fences, it's about serializing the instruction stream. Usually it is used for benchmarking purposes to make sure no "old" instructions remain in the reorder buffer/reservation station. The execution time ofcpuid
(which is quite long, I remember >200 cycles) is then to be subtracted. If the result is more "exact" this way is not quite clear to me, I experimented with and without and the differences seems less the the natural error of measurement, even in single user mode with nothing else running at all. -
Gunther Piez over 11 yearsI am not sure, but I possibly the fence instruction used this way in the kernel are not useful at all ^^
-
janneb over 11 years@hirschhornsalz: According to the git commit logs, AMD and Intel confirmed that the m/lfence will serialize rdtsc on currently available CPU's. I suppose Andi Kleen can provide more details on what exactly was said, if you're interested and ask him.
-
janneb over 11 years@hirschhornsalz: ... IIRC the argument basically goes that while the fence instructions only serialize wrt. instructions that read/write memory, in practice there's no point in reordering non-mem instructions wrt rdtsc and thus it's not done. Although per the architecture manual it's in principle allowed.
-
Gunther Piez over 11 yearsThat's exactly what I think, in practice (=non-benchmarking code) there is no point in avoiding the reordering of instructions. I would even go one step further and argue that there isn't even a point in avoiding the reordering of memory instructions, since rdtsc is only used as a non memory depended timer source here and so drop the fences. But I should really ask Andy :-)
-
Jonatan Lindén about 8 yearsHi, it appears that you copied this answer from Gabriele Paolinis white paper "How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures" (you missed a line break though). You're using someone else's work without giving the author credit. Why not add an attribution?
-
Edd Barrett over 7 yearsYes, indeed, it is coped. I'm also wondering if the two movs in reading the start time is necessary: stackoverflow.com/questions/38994549/…
-
ExOfDe over 7 yearsIs there a specific reason to have two variables high and low?
-
Joseph Garvin about 7 yearsIs the memory clobber part of the asm still necessary? I notice the code in Intel's white paper makes no mention of it: intel.com/content/dam/www/public/us/en/documents/white-papers/…
-
Cody Gray about 7 yearsYes, @ExOfDe, there is a reason. The
RDTSC[P]
instruction returns a 64-bit value, but it returns it in two 32-bit halves: the upper half in theEDX
register and the lower half in theEAX
register (as is the common convention for returning 64-bit values on 32-bit x86 systems). You can, of course, combine those two 32-bit halves into a single 64-bit value if you want, but that requires either (A) a 64-bit processor (and theRDTSC[P]
instruction was introduced to the ISA long before 64-bit integers were natively supported), or (B) compiler/library support for 64-bit ints. -
Peter Cordes over 6 yearsAre you sure that
mfence; rdtsc
on Intel really serializes the instruction stream?lfence
is now officially / more-clearly documented as serializing, (so it can be used to mitigate Spectre mis-speculation of bounds-check branches). But I'm not suremfence
serializes the instruction stream on Intel. (Maybe it does, but it's not clearly documented). Fun fact: on Core2,mfence
has better throughput thanlfence
(when that's all the machine is running, no other instructions mixed in. source: Agner Fog's tests). -
Peter Cordes over 6 yearsIt's probably important to use
lfence
on Intel andmfence
on AMD; any argument about "stronger barrier" is totally inapplicable because we're talking about the instruction stream and additional micro-architectural effects, not the well-documented memory-ordering effects. For example, LFENCE isn't fully serializing on AMD: it has 4-per-clock throughput Bulldozer-family / Ryzen! Maybe it does serializerdtsc
but not itself or some other instructions? Or more likely it's very cheap on AMD because their memory-ordering implementation works differently. -
Peter Cordes over 6 yearsIf you're going to use your own inline asm instead of a builtin/intrinsic, at least write efficient inline asm that uses constraints to tell the compiler which registers to look at, instead of using
mov
instructions. -
supercat about 3 years@JosephGarvin: A "memory clobber" is an explicit notice to a compiler that a piece of code may be dependent upon memory ordering in ways the compiler should not expect to understand. Some compilers are prone to assume that memory order only matters in situations where they can see explicit reasons why it might; others assume it may matter in cases where they can't prove it doesn't. Such considerations are orthogonal to anything a processor might do with memory ordering.
-
Joseph Garvin almost 3 years@supercat: I understand that, but confusingly the kernel code linked does not use the
memory
constraint. Maybe because GCC understands it is implied bylfence
? -
supercat almost 3 years@JosephGarvin: If the kernel code uses lfence and mfence without memory clobbers, it likely does so because the authors thought it obvious that any quality compiler should recognize them as including an implied memory clobber; whether a gratuitously clever compiler would regard them likewise, or instead exploit the fact that more "optimizations" would be possible without a memory clobber, is a separate issue.