C++ How to force prefetch data to cache? (array loop)

20,381

Solution 1

First, I suppose that tab is a large 2D array such as a static array (e.g., int tab[1024*1024][1024*1024]) or a dynamically-allocated array (e.g., int** tab and following mallocs). Here, you want to prefetch some data from tab to the cache to reduce the execution time.

Simply, I don't think that you need to manually insert any prefetching to your code, where a simple reduction for a 2D array is performed. Modern CPUs will do automatic prefetching if necessary and profitable.


Two facts you should know for this problem:

(1) You are already exploit the spatial locality of tab inside of the innermost loop. Once tab[i][0] is read (after a cache miss, or a page fault), the data from tab[i][0] to tab[i][15] will be in your CPU caches, assuming that the cache line size is 64 bytes.

(2) However, when the code traverses in the row, i.e., tab[i][M-1] to tab[i+1][0], it is highly likely to happen a cold cache miss, especially when tab is a dynamically-allocated array where each row could be allocated in a fragmented way. However, if the array is statically allocated, each row will be located contiguously in the memory.


So, prefetching makes a sense only when you read (1) the first item of the next row and (2) j + CACHE_LINE_SIZE/sizeof(tab[0][0]) ahead of time.

You may do so by inserting a prefetch operation (e.g., __builtin_prefetch) in the upper loop. However, modern compilers may not always emit such prefetch instructions. If you really want to do that, you should check the generated binary code.

However, as I said, I do not recommend you do that because modern CPUs will mostly do prefetching automatically, and that automatic prefetching will mostly outperform your manual code. For instance, an Intel CPU like Ivy Bridge processors, there are multiple data prefetchers such as prefetching to L1, L2, or L3 cache. (I don't think mobile processors have a fancy data prefetcher though). Some prefetchers will load adjacent cache lines.


If you do more expensive computations on large 2D arrays, there are many alternative algorithms that are more friendly to caches. A notable example would be blocked(titled) matrix multiply. A naive matrix multiplication suffers a lot of cache misses, but a blocked algorithm significantly reduces cache misses by calculating on small subsets that are fit to caches. See some references like this.

Solution 2

For GCC only:

__builtin_prefetch((const void*)(prefetch_address),0,0);

prefetch_address can be invalid, there will be no segfault. If there too small difference between prefetch_address and current location, there might be no effect or even slowdown. Try to set it at least 1k ahead.

Solution 3

The easiest/most portable method is to simply read some data every cacheline bytes apart. Assuming tab is a proper two-dimensional array, you could:

char *tptr = (char *)&tab[0][0];
tptr += 64;
char temp;
volatile char keep_temp_alive;
for(int i = 0; i < N; i++)
{
    temp += *tptr;
    tptr += 64;
    for(j = 0; j < M; j++)
        count += tab[i][j];
}
keep_temp_alive = temp;

Something like that. However, it does depend on: 1. You don't end up reading outside the allocated memory [by too much]. 2. the J loop is not that much larger than 64 bytes. If it is, you may want to add more steps of temp += *tptr; tptr += 64; in the begginning of the loop.

The keep_temp_alive after the loop is essential to prevent the compiler from completely removing temp as unnecessary loads.

Unfortunately, I'm too slow writing generic code to suggest the builtin instructions, the points for that goes to Leonid.

Share:
20,381
lizaczek
Author by

lizaczek

Updated on July 10, 2022

Comments

  • lizaczek
    lizaczek almost 2 years

    I have loop like this

    start = __rdtsc();
    unsigned long long count = 0;
    for(int i = 0; i < N; i++)
        for(int j = 0; j < M; j++)
            count += tab[i][j];
    stop = __rdtsc();
    time = (stop - start) * 1/3;
    

    Need to check how prefetch data influences on efficiency. How to force prefetch some values from memory into cache before they will be counted?