A fundamental restriction in MIPS 5-stage pipeline is, if one instruction stalls, all instructions following it stall as well, even if they have no dependencies at all. If the stalled instruction take a long time to finish, the throughput of the CPU pipeline will suffer.
In modern CPU technology, out-of-order scheduling is an important technique to address the restriction above. Tomasulo’s algorithm is a typical solution to CPU out-of-order scheduling.
Tomasulo’s Algorithm Overview
A diagram of Tomasulo’s algorithm from J. Hennessy and D. Patterson is shown below.
All instructions are dispatched from instruction queue to corresponding execution unit in order. There exists multiple execution units in the pipeline, allowing multiple instructions to be executed in parallel. In the diagram above, execution units include memory unit, FP adders and FP multipliers.
Note each execution unit is coupled with a couple of reservation stages where instructions are buffered and waiting to be executed. The fields in each reservation stage include:
Qp – operation to perform on operands
Qj, Qk – reservation stations that will produce the operand. A value of 0 means the source operand is already available in Vj or Vk field, or is unnecessary
Vj, Vk – values of the source operands. Only one of the V field or the Q field is valid for each operand
A – information for the memory address calculation for a load or store in load/store buffers. Initially it holds the immediate field of the instruction; after address calculation, it holds the effective address
Busy – this reservation station and its accompanying execution unit is occupied
Each register in register file will have a tag Qi, indicating the reservation station ID producing the result. If Qi is 0, no currently active instruction is computing a result destined for this register, and register holds a valid value.
There is a common data bus or CBD, used for broadcasting results to wherever it is needed.
Instruction Execution in Tomasulo’s Algorithm
In issuing phase, even if the execution unit is busy, as long as there is available reservation station, instruction can be issued. If the source operands are available, they are copied to Vj and Vk field of the reservation station, otherwise the reservation station IDs producing the result are copied to Qj and Qk field. The Qi tag of the destination register is updated with the reservation stage ID.
In execution phase, if all operands and the execution unit are available, the operation is performed.
In result writing phase, when the result is available and the CDB is free, the result is broadcasted to register file and other reservation stations.
J. Hennessy and D. Patterson have summarized a detailed algorithmic description of Tomasulo’s algorithm:
Resolving Data Hazards in Tomasulo’s Algorithm
When instructions are dispatched to reservation stations, any available operands are copied and buffered. Operand buffering during instruction issuing resolves WAR hazards.
Results are only written to register if its Qi tag matches the reservation station ID. This result buffering and associative access for register file write resolve WAW hazards.
Via CDB, results are broadcasted to wherever they are needed. This forwarding mechanism resolves the RAW hazards.
In summary, Tomasulo’s algorithm resolve all possible data hazards, which is pretty amazing.
How Tomasulo’s algorithm resolves data hazards is a common interview question. Interviewees should be able to walk through instruction execution flow and explain how data hazards are resolved. We highly recommend readers to refer to J. Hennessy and D. Patterson’s book, Computer Architecture for more details.