Page 2 of 16

Why is there no possible performance improvement with cache upsizing?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

  1. The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur

  2. In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
  3. Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
  4. Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?

How to implement w = 3/2 x + 1/4 y + z?

Obviously, w = 3/2 x + 1/4 y + z = x + (x >> 1) + (y >> 2) + z. But, this is not the end of the story.

All variables here need to be interpreted as fixed point number, with lower 2 digits representing 0.5 and 0.25.

Let’s say x, y and z are within the range between 0 and 3, inclusive. Then w is within the range between 0 and 8.25, inclusive. w’s integer part has 4 bits, and w has 6 bits in total.

‘d8.25 can be represented as ‘b1000.01

What are universal gates?

A universal gate is a gate or a group of gates which can implement any Boolean function without the need to use any other types of gates.

Universal gates satisfy 2 conditions:

    1. They should be able to block the input propagation. For example, if one input of an AND gate is 0, then the output of the AND gate stays at 0, i.e., the other input is blocked
    2. They should be able to form an inverter. For example, if both inputs of a NAND gate are tied to the same value, the output of the NAND gate will be inverted, i.e., NAND gate can form an inverter

Continue reading → What are universal gates?

Design a circuit that detects if one input is a delayed version of the other

Assuming 1b input A is generated by a random sequence, 1b input B is a delayed version of A. The delay value varies, and can be [1, 10] (inclusive).

Design a circuit, that takes A and B as inputs, and Y as output. If B is guaranteed to have 1s, then Y should be 1.

Continue reading → Design a circuit that detects if one input is a delayed version of the other

How to implement pseudo LRU?

We discussed true LRU implementations, but they are usually costly. In this post, we will explain how to implement pseudo LRU.

An example is shown below. The data structure is similar to a binary tree. For N ways in a cache set, we need to keep (N – 1) bits. If there are 8 sets in a way, then we will need 7 bits.

Upon a cache access, all tree nodes point to that cache way will be flipped. For example, if set 3 is accessed, then h0, h1 and h4 will be flipped in next cycle.

To find LRU, we can perform a depth-first-search starting from h0, and traverse nodes in lower levels. If the node is 0, then we traverse the left sub-tree; otherwise, we traverse the right sub-tree. In the diagram above, the LRU is set 3.

Once we understand the basic idea of pseudo LRU, it is easy to write code for pseudo LRU.

How to implement true LRU? (II)

We covered 2 true LRU implementations: square matrix implementation and counter based implementation in previous post. We will discuss 2 more true LRU implementations in this post.

In the following discussion, we assume the number of ways in a cache set is N.

Linked List Implementation

Linked list is probably the most straightforward LRU implementation. Nodes are linked from head to tail. Head points to the most recently used item, while tail points to the least recently used item.

Upon a cache access, the corresponding node “jumps” to the head, and the previous head becomes the next node of new head.

Continue reading → How to implement true LRU? (II)

How to implement true LRU? (I)

Least Recently Used, or LRU, is the optimal cache replacement policy, and it is used in various applications where arbitration is required. In this post, we will discuss different true LRU implementations, based on the paper “Highly Efficient LRU Implementations for High Associativity Cache Memory”.

In the following discussion, we assume the number of ways in a cache set is N.

Continue reading → How to implement true LRU? (I)