In previous post, we briefly discussed some variants of cache coherence protocol. We also pointed out that, in widely adopted NUMA architecture, snoop-based cache coherence scheme is typically combined with home directories associated with each distributed home memory. Although sounds intuitive, there are a couple of home directory implementation issues we need to consider, including
Status tracked by home directory
Non atomic operations
Status Tracked by Home Directory
Home directory can receive read request, write request or invalidate from local cache. For a read or write request, it needs to decide whether the data is in the home memory or in a remote cache; for a write or invalidate request, it needs to invalidate the remote cache blocks. This means, the home directory has to keep track of cache block state, as well as which cache has the cache block copy.
The most straightforward way is that, home directory keeps track of which cache has the cache block, and the state of the cache block in each cache. Upon receiving read / write request or invalidate from local cache, home directory can selectively sends “Fetch” and “Invalidate” to remote caches that have the copy in shared or modified state. This saves fabric bandwidth, but it is expensive to implement.
The less expensive way is that, home directory only keeps track of which cache has the cache block. Upon receiving read / write request or invalidate, home directory can selectively sends “Fetch” and “Invalidate” to all remote caches that have the copy regardless the state. This saves more fabric bandwidth, but it is more affordable to implement.
The cheapest way is that, home directory only keeps 1 bit to track whether any of the remote caches have the cache block in shared or modified state. This significantly increases the fabric traffic, since traffic will be generated to all remote caches whenever home directory receives a request from local cache.
The three approaches we discussed above, presents a design trade-off between implementation complexity, power consumption and fabric bandwidth consumption.
Non Atomic Operations
The cache coherent protocols we discussed before, omits a few complications, which can make the implementation trickier. One assumption we have made is that, all operations are atomic, i.e., an operation can finish instantaneously. For example, if home directory is serving a write request to address A, this write completes in no time. However, in real world, this assumption does not hold true. It is possible that 2 caches in the system want to write the same cache block at the same time. Thus home directory may see another write request to address A while it is serving an earlier write to the same address. A simple solution is to only serve the earlier request to the same cache block and block the latter one, until the earlier request completes.
Let’s take one step further and look at cache side: if local cache wants to modify a shared cache block, it needs to send an invalidate to home directory, and wait for acknowledgement from home directory before actually updating the cache block. Since the operations are not atomic anymore, multiple caches may want to modify the same cache block at the same time. Thus local cache may see an invalidate to the same cache block while waiting for home directory response. Should local cache continue to wait for home directory response, or serve the invalidate from home directory first? Local cache needs to have way to decide whether its invalidate arrives at home directory first, or its invalidate is currently blocked by home directory since home directory is serving a remote cache. For the former case, local cache should continue to wait for home directory response; for the latter case, it has to invalidate its local copy first.
Another interesting scenario is that, while local cache is writing back the data of address A to home directory, it receives an invalidate to the same cache block. This means, while local cache eviction to the cache block is in progress, home directory is already serving a remote cache. This case is straightforward since the both data write back and invalidate have the same outcome.
The topic covered in this post relates to architecture interview questions. We highly recommend interviewees to read J. Hennessy and D. Patterson’s book, Computer Architecture, Chapter 5 for more details.