Cycles/cost for L1 Cache hit vs. Register on x86?

24,188

Solution 1

Here's a great article on the subject:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1

To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)

PS:

The specifics will vary, but this link has some good ballpark figures:

Approximate cost to access various caches and main memory?

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns 

PPS:

These figures represent much older, slower CPUs, but the ratios basically hold:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2

Level                    Access Time  Typical Size  Technology    Managed By
-----                    -----------  ------------  ---------     -----------
Registers                1-3 ns       ?1 KB          Custom CMOS  Compiler
Level 1 Cache (on-chip)  2-8 ns       8 KB-128 KB    SRAM         Hardware
Level 2 Cache (off-chip) 5-12 ns      0.5 MB - 8 MB  SRAM         Hardware
Main Memory              10-60 ns     64 MB - 1 GB   DRAM         Operating System
Hard Disk                3M - 10M ns  20 - 100 GB    Magnetic     Operating System/User

Solution 2

Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)

Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.

Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.


For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.


Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.

Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.


Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).

As harold commented, register access is 0 cycles. So, for example:

  • inc eax has 1 cycle latency (just the ALU operation)
  • add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
  • mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)

http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).

Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).

  • L1 Data cache = 32 KB, 64 B/line, 8-WAY.

  • L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.

  • L2 cache = 256 KB, 64 B/line, 8-WAY

  • L3 cache = 8 MB, 64 B/line

  • L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])

  • L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).

  • L2 Cache Latency = 12 cycles

  • L3 Cache Latency = 36 cycles

  • RAM Latency = 36 cycles + 57 ns

The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.

@paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

Solution 3

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.

EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

Solution 4

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.

As result:

1) I think you can consider L1 cache as an unmanaged processor registers pool.

2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Share:
24,188
user541686
Author by

user541686

Updated on October 09, 2020

Comments

  • user541686
    user541686 about 3 years

    I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?

    How many cycles does an L1 cache hit take? How does it compare to register access?