Why is there no possible performance improvement with cache upsizing?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

  1. The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur

  2. In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
  3. Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
  4. Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?

How to implement true LRU? (II)

We covered 2 true LRU implementations: square matrix implementation and counter based implementation in previous post. We will discuss 2 more true LRU implementations in this post.

In the following discussion, we assume the number of ways in a cache set is N.

Linked List Implementation

Linked list is probably the most straightforward LRU implementation. Nodes are linked from head to tail. Head points to the most recently used item, while tail points to the least recently used item.

Upon a cache access, the corresponding node “jumps” to the head, and the previous head becomes the next node of new head.

Continue reading → How to implement true LRU? (II)

How to implement true LRU? (I)

Least Recently Used, or LRU, is the optimal cache replacement policy, and it is used in various applications where arbitration is required. In this post, we will discuss different true LRU implementations, based on the paper “Highly Efficient LRU Implementations for High Associativity Cache Memory”.

In the following discussion, we assume the number of ways in a cache set is N.

Continue reading → How to implement true LRU? (I)

What are MESI / MEOSI / MEOFSI protocols?

We discussed how MSI protocol works in previous post. However, in practice, other variants of MSI protocol are more commonly used. These includes:

MESI
MEOSI
MEOFSI

MESI Protocol

MESI adds the state Exclusive to be basic MSI protocol, indicating “clean exclusive” state. As a comparison, M state indicates “dirty exclusive” or “modified exclusive” state.

Continue reading → What are MESI / MEOSI / MEOFSI protocols?

What is cache coherency? How to enforce cache coherency?

In previous post, we discussed the UMA / NUMA architecture. But caching shared data in different processors becomes a new problem. This is because different processors hold their view of memory through their individual caches. Cache coherency plays an important role in avoiding different processors seeing different values for the same data.

Continue reading → What is cache coherency? How to enforce cache coherency?

What are NUMA / UMA architectures?

We have discussed CPU pipelining and out-of-order scheduling in previous posts, and these are for increasing instruction-level parallelism. The next level of parallelism will be thread-level parallelism, or TLP. One solution to TLP is multiprocessors, i.e., a computer system consisting of multiple processors, typically controlled by one operating system and sharing the same memory address space.

Generally speaking, there are two multiprocessor architectures

Uniform memory access / UMA
Non uniform memory access / NUMA

Continue reading → What are NUMA / UMA architectures?

What are the types of caches based on index and tag bits?

Based on whether the index and tag bits use physical address or virtual address, cache can be categorized as:

Physically Indexed, Physically Tagged Cache / PIPT Cache
Physically Indexed, Virtually Tagged Cache / PIVT Cache
Virtually Indexed, Physically Tagged Cache / VIPT Cache
Virtually Indexed, Virtually Tagged Cache / VIVT Cache

Continue reading → What are the types of caches based on index and tag bits?