Is there a way to flush the entire CPU cache related to a program?

c++ assembly memory optimization cpu-cache

12,966

Solution 1

For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.

No, you cannot do this reliably or efficiently with pure ISO C++17. It doesn't know or care about CPU caches. The best you could do is touch a lot of memory so everything else ends up getting evicted¹, but this is not what you're really asking for. (Of course, flushing all cache is by definition inefficient...)

CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language. But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses. That's because it's not a normal thing to do.

On x86, for example, the asm instruction you're looking for is wbinvd. It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode). So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction. As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed. i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects. It's also complicated on a multi-core system because it has to flush caches for all cores.

I don't think there's any way to use it in user-space (ring 3) on x86. Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call). So wbinvd only works when actually running in ring 0 (i.e. in kernel code). See Privileged Instructions and CPU Ring Levels.

But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");. On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel). That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem. Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though. Especially if you want control of turbo frequency stuff.

The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed. So the details of exactly how it's done are critical. It doesn't make sense to me to even want a portable / generic way to do this.

Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM. But in that case your code is already totally arch-specific.

Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing. e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write. (And the eviction part is important for DMA reads, too: you don't want the old cached value). But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.

The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches. If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.

This is why gcc provides __builtin__clear_cache. It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code. x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions. See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling. (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)

Despite the name, __builtin__clear_cache is totally unrelated to wbinvd. It needs an address-range as args so it's not going to flush and invalidate the entire cache. It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.

When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.

It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86. Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache. But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases. Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.

There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.

The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt. (Or with NT stores, which also evict the cache line if it was hot before hand). Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.

There's an Intel intrinsic _mm_clflush(void const *p) wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address. You could loop over all the cache lines in all the pages your process has mapped... (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).

There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags). Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all. The man page says it first appeared in MIPS Linux "but nowadays, Linux provides a cacheflush() system call on some other architectures, but with different arguments."

I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.

Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines. The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory. If you want to commit to persistent storage without making the next read slow, you can use clwb. But note that clwb is not guaranteed to avoid eviction, it's merely allowed to. It might run the same as clflushopt, like may be the case on SKX.

See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient. See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.

Anyway, this is the kind of cache-control that's relevant for modern CPUs. Whatever experiment you're doing requires ring0 and assembly on x86.

Footnote 1: Touching a lot of memory: pure ISO C++17

You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it. If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...). And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.

Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores. And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way). x86 supports WT memory regions on a per-page basis, but mainstream OSes use WB pages for all "normal" memory.

Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data. Just reading large amounts of data isn't reliable either; some CPUs implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data. E.g. the L3 cache in Intel IvyBridge and later does this.

Solution 2

The answer is no, there is no standard C++ way to do this (even with some compiler intrinsics). GCC has __builtin__clear_cache and __builtin_prefetch and Clang probably has them also.

As Johan commented, x86-64 has a privileged instruction for doing what you want, but __builtin__clear_cache doesn't use it (and is a no-op on x86-64, because instruction caches are coherent with data caches on that architecture so hardware takes care of syncing recently-stored data before executing it as code).

On Linux, you might (perhaps) use the cacheflush(2) Linux specific system call. I never used it, and I don't know if it is implemented on x86-64.

BTW, you should not reason on programs, but on processes. Each has its own virtual address space.

Your question lacks some motivation. If you care about micro-benchmarking, be aware that the kernel scheduler is allowed to reschedule and move your thread or process to some other core at arbitrary machine code instruction (be however aware of processor affinity).

(the function should work regardless of compiler optimizations)?

No, optimizing compilers are reordering and rescheduling machine code instructions and often mix several computations related to different C++ statements. They are allowed to do some computations at compile-time. Read more about the as-if rule. See CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”.

12,966

Author by

Vincent

Researcher, astrophysicist, computer scientist, programming language expert, software architect and C++ standardization committee member. LinkedIn: https://www.linkedin.com/in/vincent-reverdy

Updated on June 22, 2022

Comments

Vincent almost 2 years
On x86-64 platforms, the CLFLUSH assembly instruction allows to flush the cache line corresponding to a given address. Instead of flushing the cache related to a specific address, would there be a way to flush the entire cache (either the cache related to the program being executed, or the entire cache), for example by making it full of dummy contents (or any other approach I would not be aware of):
- using only standard C++17?
- using standard C++17 and compiler intrinsics if necessary?
What would be the contents of the following function: (the function should work regardless of compiler optimizations)?
```
void flush_cache() 
{
    // Contents
}
```
Peter Cordes about 6 years

__builtin__clear_cache and cacheflush(2) both take a range of virtual addresses, so they don't help at all vs. _mm_clflush(void*). Also, __builtin__clear_cache is a no-op on x86, because its semantic meaning is to make a buffer of JITed machine-code safe to execute. I edited to make the answer not actually wrong, but you should probably just remove the first section and leave the good 2nd section. (My answer covers the details that your first section tried to answer.)
Patrick almost 4 years

Can you describe how the stated cache eviction would be performed? For example, if I want to evict the cache line where a given (virtual) memory address A is mapped to, I would just access random addresses at A+i*64KiB where i=1,..,N?
Peter Cordes almost 4 years

@Patrick: Yes. For L2 cache, use a hugepage (which guarantees contiguous physical memory) so virtual stride = physical stride. For L1d cache, the necessary stride to alias the same set is typically 4kiB, so the same offset in any page is fine. Cache replacement is typically pseudo-LRU not true-LRU, so touching 8 other lines isn't guaranteed to evict the oldest line, but probabilistically it's probably fine. (If this strategy is worth even considering for any given application where the producer can waste a huge amount of extra time to help the next reader a bit).
Peter Cordes almost 4 years

@Patrick: It doesn't have to be random addresses, sequential is fine. It's actually good if HW prefetch notices the strided read access pattern, although that won't really matter if you do them all at once. The next address calculation doesn't depend on the previous load so memory-level parallelism in the CPU core can have 8 to 12 demand loads outstanding. (I assume loads are as good as stores for forcing eviction, unless there's some bias in the cache towards evicting clean lines over dirty lines.) If you have multiple such producers, they can all share a read set that can stay hot in L3.
Patrick almost 4 years

Thanks for the explanations! I am trying to evict a certain address from the cache that refers to an element X of an array. So if I want to evict it from L1D I would just compute the offset of this element X with respect to the page and then use this to access many other pages with same offset until the set is full and the oldest entry is evicted, right?
Peter Cordes almost 4 years

@Patrick: Yes, but usually the set is always full with some data. Also eviction is normally pseudo-LRU not true-LRU. (algorithm LRU, how many bits needed for implement this algorithm?)
Bogdan Mart over 3 years

It's also Possible to use this kernel module on linux github.com/batmac/wbinvd