We covered different types of data dependencies through memory in previous post. To handle data memory dependencies in CPU out-of-order scheduling, there are usually two steps involved:

  1. Calculate effective address for memory access
  2. Check addresses of all active load / store and ensure conflicting load / store cannot execute out of order

J. P. Shen and M. H. Lipasti have excellent elaboration on this topic in their book, Modern Processor Design. In this post, we will briefly discuss their idea about load / store processing.

A Load / Store Processing Model

J. P. Shen and M. H. Lipasti proposed a load / store processing model in their book Modern Processor Design, which is shown in the diagram below.

load store processing

Load and store instructions are issued to reservation stations first, and then dispatched to load unit or store unit.

In the store unit, store instructions first go through effective address calculation and address translation, and then reside in “Finished” store buffer. Note store instructions must not modify data memory in “Finished” store buffer. Store instructions in “Completed” store buffer are eventually committed to data memory.

Similar to store, load instructions first go through address generation and translation, and eventually read data cache to get the data from memory.

One assumption we can make is, store instructions are required to finish in program order, thus WAW data dependencies are implicitly enforced. Essentially, handling data dependencies can be simplified as handling load / store dependencies.

Handling Data Dependencies with In-order Load / Store Dispatch

The most straightforward solution is to issue load / store instructions to common reservation stations in program order, and to dispatch them in FIFO order from reservation station. Load can only be dispatched when store buffer is empty. However, load instructions begin chains of dependencies and the latency of loads is long and unpredictable. Executing loads as early as possible is critical.

An improved scheme is to support load bypassing with in-order load / store dispatch, and load instruction stalls if there’s address matching in store buffer, see diagram below. Therefore, unrelated loads can make forward progress.

load bypass

To further speed up load, we can even support load forwarding with in-order load / store dispatch, see diagram below. Load stalls if there exists address matching but no data stores; data is forwarded to load if there exits address matching and data available stores. Since load receives data from store directly, thus load instructions can be executed early, and data cache accesses are avoided.

load forwarding

Handling Data Dependencies with Out-of-order Load / Store Dispatch

If we relax the restriction of FIFO dispatch from reservation station, load can be issued before preceding stores. Since there is no way to check for address matching, potential RAW dependency exist. A scheme called “speculative loads” can help to resolve this issue, see diagram below. If a store dispatched from reservation station finds a matching load in load buffer, the aliased load and all trailing instructions flush.

speculative loads

The relaxation introduces possible WAR data dependency as well. Load address may match address of succeeding store, thus incorrect data forwarding is triggered. A simple solution is to stall the loads on “finished” stores with matching address, and forward data to loads on “completed” stores with matching address only.

Conclusion

Data dependencies in CPU out-of-order scheduling are barely mentioned in computer architecture classes, but they are hot topics in hardware interview questions. It is worthwhile remembering the basic concepts and solutions, in order to tackle the interview. For more details, please refer to J.P. Shen and M.H. Lipasti’s book, Modern Processor Design.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.