Physical or virtual addressing is used in processors x86/x86_64 for caching in the L1, L2 and L3?

caching x86 virtual-memory tlb virtual-address-space

12,109

The answer to your question is - it depends. That's strictly a CPU design decision, which balances over the tradeoff between performance and complexity.

Take for example recent Intel Core processors - they're physically tagged and virtually indexed (at least according to http://www.realworldtech.com/sandy-bridge/7/). This means that the caches can only complete lookups in pure physical address space, in order to determine if the line is there or not. However, since the L1 is 32k, 8-way associative, it means that it uses 64 sets, so you need only address bits 6 to 11 in order to find the correct set. As it happens to be, virtual and physical addresses are the same in this range, so you can lookup the DTLB in parallel with reading a cache set - a known trick (see - http://en.wikipedia.org/wiki/CPU_cache for a good explanation).

In theory one can build a virtually index + virtualy tagged cache, which would remove the requirement to go through address translation (TLB lookup, and also page walks in case of TLB misses). However, that would cause numerous problems, especially with memory aliasing - a case where two virtual addresses map to the same physical one.

Say core1 has virtual addr A caches in such a fully-virtual cache (it maps to phys addr C, but we haven't done this check yet). core2 writes to virtual addr B that map to the same phys addr C - this means we need some mechanism (usually a "snoop", term coined by Jim Goodman) that goes and invalidates that line in core1, managing the data merge and coherency management if needed. However, core1 can't answer to that snoop since it doesn't know about virtual addr B, and doesn't store physical addr C in the virtual cache. So you can see we have an issue, although this is mostly relevant for strict x86 systems, other architectures may be more lax and allow a simpler management of such caches.

Regarding the other questions - there's no real connection with PAT that I can think of, the cache is already designed, and can't change for different memory types. Same answer for the other question - the HW is mostly beneath the distinction between user/kernel mode (except for the mechanisms it provides for security checking, mostly the various rings).

12,109

Author by

Alex

Updated on June 03, 2022

Comments

Alex about 2 years
Which addressing is used in processors x86/x86_64 for caching in the L1, L2 and L3(LLC) - physical or virtual(using PT/PTE and TLB) and somehow does PAT(page attribute table) affect to it?

And is there difference between the drivers(kernel-space) and applications(user-space) in this case?

Short answer - Intel uses virtually indexed, physically tagged (VIPT) L1 caches: What will be used for data exchange between threads are executing on one Core with HT?
- L1 - Virtual addressing (in 8-way cache for define Set is required low 12 bits which are the same in virt & phys)
- L2 - Physical addressing (requires access to TLB for Virt-2-Phys)
- L3 - Physical addressing (requires access to TLB for Virt-2-Phys)
Alex over 10 years

Big thanks! And in your opinion, is there any benefit from the knowledge of the mechanism on x86 and whether the I as developer knowing this, can I somehow optimize the performance of my program?
Leeor over 10 years

Absolutely, a SW developer that doesn't know the HW he runs on would do a poor job in optimizing it (if needed to), or debugging it (when needed to :). The cache mapping address type is a little low level indeed, although it does open a hatch to some important optimiations such as SW prefetch intrinsics and cache-aware design). See this great post for examples - stackoverflow.com/questions/16699247/… . Also there's the question of out-of-order execution that might give some hints, and of course the variety of compiler optimizations (not HW, but important too)
Alex over 10 years

I mean - benefit from the knowledge that in x86: "they're are physically tagged and virtually indexed"
Leeor over 10 years

It's not x86, it's just a common design point that occurs on many CPUs. I'm pretty sure most ARM based designs are also utilizing this. To benefit from that, you need to make sure your addresses are not physically aligning too much on the tag bits (or at least have a good spread) - that's no easy task as you usually don't decide where the OS assigns your pages.
Alex over 10 years

Thanks! But if i can't affect to "where the OS assigns your pages", what benefit can I take from this?
Leeor over 10 years

Not much, probably nothing. Caches are designed to achieve best spread for addresses in order to minimize cache thrashing. It would have a far greater benefit to design your code to be cache friendly in general by tiling large structure to fit in it, or avoiding false sharing, than to worry about physical pages being scattered. I would pay attention to how the lower bits match (for e.g. when working with A and B arrays, try to have them at different page offsets), but that applies to virtual addresses and not specifically related to VIPT caches
Peter Cordes over 7 years

re: the VIPT L1 speed hack that allows fetching tags (and data) from a set in parallel with the TLB access. It's really more like a PIPT cache, with the index translation happening for free (because the index bits are all below the page offset). I took a stab at writing a detailed explanation of why it works a while ago. You might want to link that as well as Wikipedia.