Aligning to cache line and knowing the cache line size

c linux caching computer-architecture memory-alignment

73,420

Solution 1

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((align(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

Solution 2

I am using Linux and 8-core x86 platform. First how do I find the cache line size.

$ getconf LEVEL1_DCACHE_LINESIZE
64

Pass the value as a macro definition to the compiler.

$ gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...

At run-time sysconf(_SC_LEVEL1_DCACHE_LINESIZE) can be used to get L1 cache size.

Solution 3

Another simple way is to just cat the /proc/cpuinfo:

grep cache_alignment /proc/cpuinfo

Solution 4

There's no completely portable way to get the cacheline size. But if you're on x86/64, you can call the cpuid instruction to get everything you need to know about the cache - including size, cacheline size, how many levels, etc...

http://softpixel.com/~cwright/programming/simd/cpuid.php

(scroll down a little bit, the page is about SIMD, but it has a section getting the cacheline.)

As for aligning your data structures, there's also no completely portable way to do it. GCC and VS10 have different ways to specify alignment of a struct. One way to "hack" it is to pad your struct with unused variables until it matches the alignment you want.

To align your mallocs(), all the mainstream compilers also have aligned malloc functions for that purpose.

Solution 5

posix_memalign or valloc can be used to align allocated memory to a cache line.

View more solutions

73,420

Author by

MetallicPriest

Updated on July 05, 2022

Comments

MetallicPriest almost 2 years
To prevent false sharing, I want to align each element of an array to a cache line. So first I need to know the size of a cache line, so I assign each element that amount of bytes. Secondly I want the start of the array to be aligned to a cache line.

I am using Linux and 8-core x86 platform. First how do I find the cache line size. Secondly, how do I align to a cache line in C. I am using the gcc compiler.

So the structure would be following for example, assuming a cache line size of 64.
```
element[0] occupies bytes 0-63
element[1] occupies bytes 64-127
element[2] occupies bytes 128-191
```
and so on, assuming of-course that 0-63 is aligned to a cache line.
MetallicPriest over 12 years

But how do I align to a cache line in c?
Necrolis over 12 years

@MetallicPriest: updated my post a bit (note: there was an error in cache line size, align to 64 bytes, not 16, 16 bytes is to prevent splitting)
Sebastian Mach over 12 years

@MetallicPriest: gcc and g++ both support __attributes__
Steve Jessop over 12 years

I know this is your own question, but for future readers you could answer both parts of it :-)
MetallicPriest over 12 years

Is memory mapped by mmap, aligned too?
MetallicPriest over 12 years

Steve, do you know if memory mapped by mmap is aligned to a cache line.
Necrolis over 12 years

@MetallicPriest: mmap & VirtualAlloc allocate page aligned memory, generally page granularity is 64kb (under windows), and since 64kb is a power of 64, it will be aligned properly.
Steve Jessop over 12 years

I don't think it's guaranteed by Posix, but I also wouldn't be in the least surprised if linux always selects addresses that are page-aligned, never mind just cache-line aligned. Posix says that if the caller specifies the first parameter (address hint), that has to be page-aligned, and the mapping itself is always a whole number of pages. That's strongly suggestive without actually guaranteeing anything.
tothphu almost 12 years

You can get the cache line size programatically. Check here. Also you can not generalize to having 64 byte cache lines on x86. It is only true for recent ones.
Necrolis almost 12 years

@tothphu: a more portable way to get it is via CPUID, and as of many revisions of the Intel guides, cache lines have been 64 bytes, IIRC even the P4 (which is now ancient) had 64 byte cachelines (in fact, it did, see: osronline.com/article.cfm?article=273). also there is no need to spam the link, rather just edit your comment.
tothphu almost 12 years

@Necrolis I seem to remember that I have read 32 bytes somewhere in Core Duo timeframe, but then my memory is probaly deceiving me. Otherwise I couldn't edit the comment I have crossed some 5 min boundary.
jww over 8 years

@James - alignas is C++11. Its not available for C++03. And it won't work on a number of Apple platforms. On some of their OSes, Apple provides and ancient C++ Standard Library that pretends to be C++11, but lacks unique_ptr, alignas, etc.
Nick Strupat about 8 years

@James also, the standard only requires alignas to support up to 16 bytes, so any higher value won't be portable. And since virtually all modern processors have a cache line size of 64 bytes, alignas isn't useful unless you know your compiler supports alignas(64).
Brian Cain almost 7 years

Where are these sysconf()s specified? POSIX / IEEE Std 1003.1-20xx ?
Maxim Egorushkin almost 7 years

@BrianCain pubs.opengroup.org/onlinepubs/9699919799/functions/sysconf.h‌tml
Maxim Egorushkin almost 7 years

@BrianCain I use Linux, so I just did man sysconf. Linux is not exactly POSIX compilant, so that Linux-specific documentation is often more useful. Sometimes it is out of date, so you just egrep -nH -r /usr/include -e '\b_SC'.
Dení over 6 years

In case of Mac, use sysctl hw.cachelinesize.
Peter Cordes about 6 years

Yes, mmap only works in terms of pages, and pages are always larger than cache lines. Even in some theoretical weird architecture, there are extremely good reasons why cache lines won't be larger than pages (caches are normally physically tagged, so one line can't be split across 2 virtual pages without extreme pain for the CPU designers).
NoSenseEtAl over 5 years

C++11 addes alignas that is portable way of specifying alignment
Alnitak over 5 years

alignas is also in C11, not just C++11.
Carlo Wood almost 5 years

@NoSenseEtAl alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately. See for example stackoverflow.com/questions/49373287/…
Carlo Wood almost 5 years

alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately.
Carlo Wood almost 5 years

@NickStrupat It seems that support for alignment to cache line sizes has finally been added to C++17. My last comment seems also not to be correct anymore for C++17 (the problem was merely that operator new would not guaranteed return memory aligned better than std::max_align_t). I just found this: en.cppreference.com/w/cpp/thread/…
Nick Strupat almost 5 years

@CarloWood You're right about the C++17 addition. The only advantage remaining for my library and its underlying get_cachline_size function is that it can retrieve that information at run-time. The downside is that you lose possible compiler optimizations if the cache line size is known at compile time.
Carlo Wood almost 5 years

@NickStrupat After posting this comment, I tried it out and discovered that neither gcc nor clang support it... Apparently they went for option 3 in lists.llvm.org/pipermail/cfe-dev/2018-May/058138.html (I read the whole thread; it's long but to summarize -- they have no clue how to implement it and were thinking about filing a Defect Report). Nevertheless, your library will of course have the exact same ABI/ODR issues. I'm starting to feel that simply using 64 bytes everywhere for now is my best option :/.
maxschlepzig over 4 years

Perhaps you want to remove a useless use of cat.
Peter Cordes over 4 years

@CarloWood: Compilers are allowed to support over-aligned types, and in practice they do. (all of gcc, clang, MSVC, ICC support alignas(64)). True that ISO C++ only requires alignas up to alignof(max_align_t), but it also doesn't specify __declspec or __attribute__. I'd call alignas portable because in real life compilers can and do support it because it's useful. Not in the same sense that behaviour required by ISO C++ is portable, sure.
Peter Cordes over 4 years

@Necrolis: re: earlier comments: x86 (and x86-64) page size is 4kiB. x86-64 hugepages are 2MiB or 1GiB. Yes, everything uses 64-byte cache lines since Core 2 at least, so all x86-64. Pentium II/III did use 32-byte lines, maybe even Pentium M / Core solo/duo. Over-aligning might waste a bit of space on those ancient CPUs, but it's not a big deal. On modern CPUs, L2 spatial prefetch tries to complete an aligned pair of cache lines (128 bytes) so it can sometimes make sense to align by 128.
Peter Cordes over 4 years

Usually it's so much better to have a compile-time-constant line size that I'd rather hard-code 64 than call sysconf. The compiler won't even know it's a power of 2, so you'll have to manually do stuff like offset = ptr & (linesize-1) for remainder or bit-scan + right-shift to implement division. You can't just use / in code that's performance-sensitive.
Peter Cordes over 4 years

Compilers are reluctant to implement hardware_destructive_interference_size because you really want it to be a compile-time-constant, but it can't always be if you're compiling for a "generic" target that could run on multiple CPUs of the same ISA. A conservative choice would be possible but not guaranteed future-proof. (Like 128 bytes to account for current x86 CPU with 64-byte lines and an L2 spatial prefetch that likes to complete an aligned pair of lines. (mainstream Intel))
ilstam almost 4 years

But if you used a cross compiler that wouldn't work right? Because it would get the cache line size of you current architecture and not the one of your target architecture.
Maxim Egorushkin almost 4 years

@ilstam When cross-compiling you would need to obtain that getconf LEVEL1_DCACHE_LINESIZE from your target architecture, sure. Your build system might provide it, or you'd have to hardcode it as a system-specific value into your build system.
Maxim Egorushkin almost 4 years

@ilstam Another method is to have arch-specific implementations in different shared libraries and load the right one at run-time. Or, more advanced users, could have their own mechanisms of using arch-specific functions, but one would need to be an expert with all the details involved (which isn't rocket science, but requires a bit of thorough reading and appreciation).