Caching

03 Apr 2018 |

When we write programs for computers, we often abstract away memory as a large, fast contiguous byte-addressable array. But we have a problem when we try to implement this abstraction in hardware - it turns out that it’s impossible to satisfy all these conditions.

Creating memory that is both fast and large is extremely expensive - however it’s possible to cheaply make a small amount of fast memory, and it’s also possible to make a large amount of slow memory.

To provide the illusion of a large amount of fast memory, we can construct the memory hierarchy, namely a series of caches.

mem_heiarchy

Caching

Caching exploits temporal locality as well as spatial locality. Basically, the distribution of memory accesses is not uniform - if we access a specific block of memory, it’s very likely that we will repeatedly access that block. It’s also very likely that we will access the memory spots right next to that spot. By putting items that we recently accessed into the cache, we can speed up memory access.

Associativity

If a cache block can go into any location inside the cache, the cache is said to be fully associative. A fully associative cache may experience fewer cache misses (specifically conflict misses), but lookups in the cache will take longer - we will have to look at every item in the cache in order to determine if we have a cache hit or miss.

Usually we split the cache into sets, and assign each memory block to a specific set. More generally, each set can hold \(n\) entries, we refer to that cache as \(n\) -way associative.

On the other end of the spectrum, a direct-mapped cache is a cache where each set can only hold one block.

A fully associative cache is really a \(m\)-way associative cache, where \(m\) is the number of total cache entries. Similarly, a direct-mapped cache is really a 1-way associative cache.

Determining a cache hit

A memory address is split into the tag, the index, and the block offset.

cache

When we lookup in the cache, we take our address and use the index to get the tag from the data entry of the cache. If it’s a match, we then index into the data_entry of the cache, using our offset.

Since we use the index to determine which cache row to retrieve, the index should be \(\lceil \log_2(n)\rceil\) bits for a cache that can hold \(n\) entries. Similarily, we use the offset to identify which byte to fetch inside the data block, so for a data block of size \(b\), our offset shoudl be \(\lceil \log_2(b)\rceil\) bits long.

The tag bits are simply the bits that are left over.

Virtual Memory

Programs assume that they have access to all the memory available to the system. To them, memory looks like an infinitely large contiguous byte-addressable array.

The OS is responsible for providing every program running on the computer with this abstraction. To do so it uses a page table to turn virtual addresses into physical ones.

The MMU is responsible for taking in virtual addresses and outputing physical addresses. It maintains a cache called the TLB, or translation lookaside buffer.

Basically every virtual address and translation is stored inside a page table. In early virtual memory systems, the TLB didn’t exist and thus each translation required an access to the page table held in main memory - this is problematic because it breaks down the memory heirarchy that we set up earlier - the page table is too big to fit in the cache and thus resides in main memory, and since every access to memory has to go through this page table, we’re bottlenecked by out translation speed. In the worst case, when the page table may not be in memory. In this case, we must pull the page table from the disk, invoking an even larger time penalty.

The TLB is a cache of common memory translations.

Caching and Virtual Memory

Here we have a choice - we can either cache on the virtual address or on the physical address. Each of these have different benefits

  1. Physically indexed, physically tagged (PIPT)
    • In this kind of cache, the physical address is used for both the index and the tag. This is great, because we have no problems with coherency or aliasing.
    • On the other hand, this now means we need to do a TLB lookup every time we access memory. Furthermore, we may pay an even greater penalty if our memory translation is not in the TLB - we must then access main memory, which takes even more time
  2. Virtually indexed, virtually tagged (VIVT)
    • Uses the virtual address for both the index and the tag
    • Cache lookup can happen in parallel with TLB lookup
    • However, there are problems where Process A uses a virtual address, and then Process B uses the same virtual address. In this situation, our cache will give the wrong value for the lookup.
    • The same virtual address mapping to different physical addresses is also difficult to deal with in this setup.
  3. Virtually indexed, physically tagged (VIPT)
    • This enables the MMU TLB lookup to proceed in parallel with fetching the data from the cache RAM, just like VIVT caches.
      • Must wait until physical address is given by MMU before finishing.
    • However, we can address homonyms and don’t deal with the same aliasing problems as VIVT caches.
    • Regardless, we can still have the synonym problem, where two physical RAM blocks are in the cache, occupying two slots but with differetnt virtual indexes.

In general if we can do address translation via the TLB before the cache lookup finishes, then PIPT works best.

Written on April 3, 2018