What is a "cache-friendly" code?

183,596

Solution 1

Preliminaries

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

Main concepts for cache-friendly code

A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

The following particular aspects are of high importance to optimize caching:

  1. Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
  2. Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.

Use appropriate containers

A simple example of cache-friendly versus cache-unfriendly is 's std::vector versus std::list. Elements of a std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).

Don't neglect the cache in data structure and algorithm design

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

Know and exploit the implicit structure of data

Another simple example, which many people in the field sometimes forget is column-major (ex. ,) vs. row-major ordering (ex. ,) for storing two dimensional arrays. For example, consider the following matrix:

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

Exploiting the ordering (e.g. changing column index first in ):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in ):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.

Avoid unpredictable branches

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

Avoid virtual functions

In the context of , virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

Common problems

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

Solution 2

In addition to @Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.

Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.

Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.

And all that it takes to ruin the performance is to go from

// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
    for(unsigned int x=0; x<width; ++x)
    {
        ... image[y][x] ...
    }
}

to

// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
    for(unsigned int y=0; y<height; ++y)
    {
        ... image[y][x] ...
    }
}

This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

Solution 3

Optimizing cache usage largely comes down to two factors.

Locality of Reference

The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.

  • Spatial

The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.

Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.

For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.

  • Time

The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.

Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.

The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.2

Line Sharing

The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.

To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3 item among those four, flushes it to main memory, and loads the new item in its place.

The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.

Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).

  • False Sharing

There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).


  1. Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.

  2. In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.

  3. Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.

Solution 4

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtual calls - all steps towards better locality.

Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programming is also a great introduction into the subject.

Solution 5

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.

Naive matrix multiply looks like:

for(i=0;i<N;i++) {
   for(j=0;j<N;j++) {
      dest[i][j] = 0;
      for( k=0;k<N;k++) {
         dest[i][j] += src1[i][k] * src2[k][j];
      }
   }
}

If N is large, e.g. if N * sizeof(elemType) is greater than the cache size, then every single access to src2[k][j] will be a cache miss.

There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:

int itemsPerCacheLine = CacheLineSize / sizeof(elemType);

for(i=0;i<N;i++) {
   for(j=0;j<N;j += itemsPerCacheLine ) {
      for(jj=0;jj<itemsPerCacheLine; jj+) {
         dest[i][j+jj] = 0;
      }
      for( k=0;k<N;k++) {
         for(jj=0;jj<itemsPerCacheLine; jj+) {
            dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
         }
      }
   }
}

If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.

Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.

Some results of googling "cache blocking":

http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf

http://software.intel.com/en-us/articles/cache-blocking-techniques

A nice video animation of an optimized cache blocking algorithm.

http://www.youtube.com/watch?v=IFWgwGMMrh0

Loop tiling is very closely related:

http://en.wikipedia.org/wiki/Loop_tiling

Share:
183,596
Noah Roth
Author by

Noah Roth

Updated on October 14, 2021

Comments

  • Noah Roth
    Noah Roth over 2 years

    What is the difference between "cache unfriendly code" and the "cache friendly" code?

    How can I make sure I write cache-efficient code?

  • TemplateRex
    TemplateRex almost 11 years
    perhaps you could expand the answer a bit by also explaining that, in -multithreaded code- data can also be too local (e.g. false sharing)
  • huseyin tugrul buyukisik
    huseyin tugrul buyukisik almost 11 years
    Each level of access needs another level of cache? L1 l2 l3 l4 is there a l5 ?
  • Rafael Baptista
    Rafael Baptista almost 11 years
    There can be as many levels of cache as the chip designers think is useful. Generally they are balancing speed vs. size. If you could make your L1 cache as big as L5, and just as fast, you would only need L1.
  • Jack Aidley
    Jack Aidley almost 11 years
    I realise empty posts of agreement are disapproved of on StackOverflow but this is honestly the clearest, best, answer I've seen so far. Excellent work, Marc.
  • Marc Claesen
    Marc Claesen almost 11 years
    @JackAidley thanks for your praise! When I saw the amount of attention this question received, I figured many people may be interested in a somewhat extensive explanation. I'm glad it's useful.
  • hookenz
    hookenz almost 11 years
    What you didn't mention is that cache friendly data structures are designed to fit within a cache line and aligned to memory to make optimal use a cache lines. Great answer though! awesome.
  • user123
    user123 almost 11 years
  • mowwwalker
    mowwwalker almost 11 years
    Err, should that be x<width ?
  • rossipedia
    rossipedia almost 11 years
    I'd love some elaboration on the section about matrix ordering. It's a little inspecific (for example: "It is easy to see that implementations which do not exploit this ordering" which ordering?)
  • maxy
    maxy almost 11 years
    Modern image editors use tiles as internal storage, e.g. blocks of 64x64 pixels. This is much more cache-friendly for local operations (placing a dab, running a blur filter) because neighbouring pixels are close in memory in both directions, most of the time.
  • 0x90
    0x90 almost 11 years
    Do you have a clue what software is used in the end of the video to test the cache sizes and tweaking it?
  • Marc Claesen
    Marc Claesen almost 11 years
    @0x90 unfortunately not, I'll see if I can find out.
  • 0x90
    0x90 almost 11 years
    what do you mean by batch, one may don't understand.
  • arul
    arul almost 11 years
    Batching: instead of performing the unit of work on a single object, perform it on a batch of objects.
  • 0x90
    0x90 almost 11 years
    AKA blocking, blocking registers, blocking caches.
  • arul
    arul almost 11 years
    Blocking/Non-blocking usually refers to how objects behave in a concurrent environment.
  • Martin Thoma
    Martin Thoma almost 11 years
    People who read this might also be interested in my article about matrix multiplication where I tested the "cache-friendly" ikj-algorithm and the unfriendly ijk-algorithm by multiplying two 2000x2000 matrices.
  • gsgx
    gsgx almost 11 years
    I tried timing a similar example on my machine, and I found that the times were the same. Has anyone else tried timing it?
  • Marc Claesen
    Marc Claesen almost 11 years
    I like the extra pieces of information in your answer, particularly the valarray example.
  • Framester
    Framester almost 11 years
  • Amro
    Amro over 10 years
    batching == vectorization
  • Olof Forshell
    Olof Forshell about 10 years
    The rules that apply to making data cache-friendly also applies to code. Compact, well thought-through code will always execute faster than code wich performs unnecessary calls or jumps. In addition the code can hint to the cache system which memory areas it will soon need so that they (hopefully) will have been fetched when the code gets to them. And although bus-locking isn't strictly a cache problem, code and data organized to minimize bus-locking will execute faster: a bus-lock is many times slower than a RAM access.
  • Engineer
    Engineer about 10 years
    +1 At last: a plain description of set associativity! EDIT further: This is one of the most informative answers on SO. Thank you.
  • Matteo Italia
    Matteo Italia almost 10 years
    @I3arnon: nope, the first is cache-friendly, since normally in C arrays are stored in row-major order (of course if your image for some reason is stored in column-major order the reverse is true).
  • Renra
    Renra almost 10 years
    This whole thing seems like a very obscure matter if you want to get "into" it and actually see what's happening. Do you use some kind of tools that help you analyze your programs? For example can valgrind tell you somehow?
  • Renra
    Renra almost 10 years
    M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory) = 1 + 3 + 2 + 4 --> 0 cache hits, 4 memory accesses -- Something like this output that's mentioned in the answer - does any tool actually tell you this or do you rely on your knowledge of what's going on behind the scenes?
  • Blaisorblade
    Blaisorblade over 8 years
    @Renra Good profilers can read Hardware Performance Counters to gather this info; on Linux you want to use oprofile (or frontends for it).
  • Gauthier
    Gauthier over 7 years
    Wait a minute... image[0][1] is adjacent to image[0][2]. Therefore the inner loop should loop through the second index, right? for each y in rows: (fetch a whole row, hopefully) for each x in cols: ...image[y][x].... Thats makes your first code snippet the good one, not the second. Am I missing something?
  • Matteo Italia
    Matteo Italia over 7 years
    @Gauthier: yes, the first snippet is the good one; I think that when I wrote this I was thinking along the lines of "All it takes [to ruin the performance of a working application] is to go from... to..."
  • Chaine
    Chaine almost 7 years
    @MarcClaesen I'm sorry for reaching to you here but I really want to use Optunity but I find the guide difficult to follow. Do you have a tutorial on how to use it?
  • kmiklas
    kmiklas almost 7 years
    This could be the best answer on SO.
  • Sitesh
    Sitesh over 6 years
    This is an excellent talk by Scott Myers on CPU cache : youtu.be/WDIkqP4JbkE
  • Paul A. Clayton
    Paul A. Clayton about 6 years
    Limited associativity with simple modulo power of two indexing also means that certain allocation, layout, and access patterns can introduce excessive conflict misses. Pointer-chasing issues are closely related to caching (in that data dependencies make latencies additive). Non-temporal stores explicitly bypass cache to avoid pollution with data that has extremely low temporal locality. Mention of typical read-for-ownership overhead (which motivates cache block zeroing instructions) might be useful.
  • RonJohn
    RonJohn almost 5 years
    "Sort, Eliminate Branches, Batch" just what mainframes were doing 60 years ago!!! LOL Even now, sorting input files by a table's primary key engenders a marked speedup because of buffering.
  • TrebledJ
    TrebledJ almost 5 years
    k==; I'm hoping this is a typo?
  • ed9w2in6
    ed9w2in6 over 3 years
    just trying to understand why virtual function is worse than normal function in terms of cache friendness - I mean they both need to reference a funtion implementation so has the same chance of cache miss as such, while virtual function also need to query vtable, so it has one more chance of cache miss, is that why we think virtual function is less cache friendly?
  • IC_
    IC_ almost 3 years
    “ It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues.” What’s the difference between them? They’re both tightly packed in memory, why row-major ordering will run into cache misses?
  • Ieshaan Saxena
    Ieshaan Saxena about 2 years
    Thank you so much for this answer! Incredibly helpful!