Why is there no possible performance improvement with cache upsizing?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

  1. The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur

  2. In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
  3. Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
  4. Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?

How to implement true LRU? (II)

We covered 2 true LRU implementations: square matrix implementation and counter based implementation in previous post. We will discuss 2 more true LRU implementations in this post.

In the following discussion, we assume the number of ways in a cache set is N.

Linked List Implementation

Linked list is probably the most straightforward LRU implementation. Nodes are linked from head to tail. Head points to the most recently used item, while tail points to the least recently used item.

Upon a cache access, the corresponding node “jumps” to the head, and the previous head becomes the next node of new head.

Continue reading → How to implement true LRU? (II)

How to set single-clock design constraints in Post-CTS run?

In previous post, we introduced how to manipulate objects in SDC. In this post, we will look at how to constrain single-clock design in physical design, or Post-CTS run. Note, there are some differences about how to set single-clock constraints in synthesis, or Pre-CTS run, and we will cover this topic in another post. Interviewees should not mix Pre-CTS and Post-CTS clock constraints.

First, we define the clock and its associated attributes, including clock period, waveform, name, and clock ports. If the clock duty cycle is not 50%, and both negedge and posedge are used in the design, then defining clock waveform is critical. This step is the same between Pre-CTS and Post-CTS run.

Continue reading → How to set single-clock design constraints in Post-CTS run?