How can I do a CPU cache flush in x86 Windows?

36,674

Solution 1

Fortunately, there is more than one way to explicitly flush the caches.

The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately, it is a privileged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.

Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.

Solution 2

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from all levels of cache (L1, L2, L3).

something as sneaky as doing say a large memcopy?

Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.

Solution 3

There is unfortunately no way to explicitly flush the cache. A few of your options are:

1.) Thrash the cache by doing some very large memory operations between iterations of the code you're benchmarking.

2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.

3.) Implement the portion of your code your benchmarking (if it's possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it's still free to do what it wants.

1 is probably the easiest and sufficient for your purposes.

Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch's answer

Solution 4

The x86 instruction WBINVD writes back and invalidates all caches. It is described as:

Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated.

Importantly, the instruction can only be executed in ring0, i.e. the operating system. So your userland programs can't simply use it. On Linux, you can write a kernel module that can execute that instruction on demand. Actually, someone already wrote such a kernel module: https://github.com/batmac/wbinvd

Luckily, the kernel module's code is really tiny, so you can actually check it before loading code from strangers on the internet into your kernel. You can use that module (and trigger executing the WBINVD instruction) by reading /proc/wbinvd, for example via cat /proc/wbinvd.

However, I found that this instruction (or at least this kernel module) is really slow. On my i7-6700HQ I measured it to take 750µs! This number seems really high to me, so I might have made a mistake measuring this -- please keep that in mind! Explanation of that instruction just say:

The amount of time or cycles for WBINVD to complete will vary due to size and other factors of different cache hierarchies.

Share:
36,674
user183135
Author by

user183135

Updated on November 22, 2021

Comments

  • user183135
    user183135 over 2 years

    I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call.

    Is there a known way to do this with a system call or even something as sneaky as doing say a large memcpy?

    Intel i686 platform (P4 and up is okay as well).

  • marr75
    marr75 over 14 years
    "will make sure that the CPU flushes all levels of cache." Not true, as I stated, modern commercial cpus, especially when abstracted by an operating system, can (and probably do) have very complicated caching strategies.
  • intgr
    intgr over 14 years
    I believe you are confusing the CPU cache with other OS-level caches. The OS has basically no say in what the CPU will cache or not cache, because these decisions need to happen so quickly, there is no time for kernel interrupts or anything of the like. CPU cache is implemented purely in silicon.
  • intgr
    intgr over 14 years
    A context switch will indeed let other processes run and thereby pollute the cache. But this is normal part of OS behavior -- it will take place with or without the benchmark, so it makes sense to include this in your timings anyway.
  • Falaina
    Falaina over 14 years
    Ohh, I stand correct, neat I didn't know about this instruction.
  • Gunther Piez
    Gunther Piez over 14 years
    Your claim that there is no instruction for cache flushing is wrong. And rewriting a routine using non temporal instructions for benchmarking is nonsense. If the data the routine is using fits in the caches, it would run way slower during the benchmarking, making the measurements worthless.
  • marr75
    marr75 over 14 years
    There is no way to explicitly flush the cache from windows. You are denied direct access to the hardware... there are non-portable assembly instructions that can do it.
  • Gunther Piez
    Gunther Piez over 14 years
    You can easily do it in Windows 95,98, ME. And even for the modern windows variants you can implement it in ring 0 using a driver.
  • Falaina
    Falaina over 14 years
    @drhirsch While I do stand corrected on the instruction for flushing the cache (thanks!), I disagree with your assessment of the use of non-temporal instructions. If he did the initial data loads for his benchmark using non-temporal instructions it isn't that much different from running with an empty cache and would be a sufficient way to simulate cold cache misses (though, I admit not nearly as correct as using the flush instruction!)
  • Gunther Piez
    Gunther Piez over 14 years
    I apollogize, I was a bit harsh. But you can't modify a program using non temporal instructions to simulate cold cache behavior for benchmarking. 1) You would need to unroll exactly one loop and make it nontemporal, thus changing the control flow and the usage of the inctruction cache. 2) If the data resides in cache before the start, even non temporal instructions will load the data from the cache, and you will get a warm cache result. 3) If not, the second iteration will need to fetch the data from memory again, you will get a result with doubled memory latencies.
  • unixman83
    unixman83 over 12 years
    The wbinvd instruction takes on the order of 2000-5000 clock cycles to complete! Most instructions take 2-5, on average.
  • Michael Boyer
    Michael Boyer almost 10 years
    The CLFLUSH instruction does not flush only the L1 cache. From the Intel x86-64 reference manual: "The CLFLUSH (flush cache line) instruction writes and invalidates the cache line associated with a specified linear address. The invalidation is for all levels of the processor’s cache hierarchy, and it is broadcast throughout the cache coherency domain."
  • Lukas Kalbertodt
    Lukas Kalbertodt almost 5 years
    Note: I know that this question is asking about Windows. However, it is linked from many places that are not talking about a specific OS, so I thought mentioning the kernel module makes sense.
  • Peter Cordes
    Peter Cordes almost 5 years
    Does wbinvd inside virtual8086 mode (e.g. a DOS program under 32-bit Windows) actually affect the host CPU? cli gets virtualized like other privileged instructions. (And BTW, invd is more than just "not really recommended", unless that's understatement for humour. You must not use invd except for cases like leaving cache-as-RAM mode; an interrupt handler could have just dirtied cache a couple cycles before you execute it on this or another core, causing it to corrupt the OS's state by discarding that store.)
  • Peter Cordes
    Peter Cordes almost 5 years
    x86 doesn't have general-purpose non-temporal loads. SSE4 movntdqa loads are only special when reading from WC memory, not normal write-back (WB) memory regions. (The manual says the NT hint may be ignored; that is the case on all current implementations except for reading from WC memory, e.g. for copying from video RAM to main memory.)
  • Ana Khorguani
    Ana Khorguani about 4 years
    Hi, I was wondering if you have checked as well if this kernel module invalidates L1 and L2 cache of all the cores? As Intel documentation says, non-shared caches may not be written back nor invalidated. Basically that figure shows that only private L1, L2 of the core and shared L3 will be written back and invalidated, but other cores L1 and L2 won't. However, when I tested this kernel module, I observed that it invalidates L1 and L2 of other cores as well.
  • Ana Khorguani
    Ana Khorguani about 4 years
    I was wondering if there is a loop calling wbinvd instruction for each core? I'm not sure how to check that. Otherwise I am confused how is this modules wbinvd does what is more or less not provided by the instruction itself?
  • Lukas Kalbertodt
    Lukas Kalbertodt about 4 years
    @AnaKhorguani I don't know which caches are flushed exactly, sorry. I assumed all caches (including L1 and L2) are flushed, but I am not sure. And no idea about your core question either, sorry!
  • Ana Khorguani
    Ana Khorguani about 4 years
    ok, thanks anyway. In the code there is a function call wbinvd_on_all_cpus. I was not able to find the implementation itself, but I assume it calls wbinvd for all the cores, though I might check with the module author himself then :)