13. Overview of PPro, PII and PIII pipeline
The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in various manuals and tutorials from Intel. It is recommended that you study this material in order to get an understanding of how these microprocessors work. I will describe the structure briefly here with particular focus on those elements that are important for optimizing code.
Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double buffer that can hold two 16-byte chunks. The code is passed on from the double buffer to the decoders in blocks which I will call ifetch blocks (instruction fetch blocks). The ifetch blocks are usually 16 bytes long, but not aligned. The purpose of the double-buffer is to make it possible to decode an instruction that crosses a 16-byte boundary (i.e. an address divisible by 16).
The ifetch block goes to the instruction length decoder, which determines where each instruction begins and ends, and next to the instruction decoders. There are three decoders so that you can decode up to three instructions in each clock cycle. A group of up to three instructions that are decoded in the same clock cycle is called a decode group.
The decoders translate instructions into micro-operations, abbreviated uops. Simple instructions generate only one uop, while more complex instructions may generate several uops. For example, the instruction ADD EAX,[MEM] is decoded into two uops: one for reading the source operand from memory, and one for doing the addition. The purpose of splitting instructions into uops is to make the handling later in the system more effective.
The three decoders are called D0, D1, and D2. D0 can handle all instructions, while D1 and D2 can handle only simple instructions that generate one uop.
The uops from the decoders go via a short queue to the register allocation table (RAT). The execution of uops work on temporary registers which are later written to the permanent registers EAX, EBX, etc. The purpose of the RAT is to tell the uops which temporary registers to use, and to allow register renaming (see later).
After the RAT, the uops to go the reorder buffer (ROB). The purpose of the ROB is to enable out-of-order execution. A uop stays in the reservation station until the operands it needs are available. If an operand for one uop is delayed because a previous uop that generates the operand is not finished yet, then the ROB may find another uop later in the queue that can be executed in the meantime in order to save time.
The uops that are ready for execution are sent to the execution units, which are clustered around five ports: Port 0 and 1 can handle arithmetic operations, jumps, etc. Port 2 takes care of all reads from memory, port 3 calculates addresses for memory writes, and port 4 does memory writes.
When an instruction has been executed then it is marked in the ROB as ready to retire. It then goes to the retirement station. Here the contents of the temporary registers used by the uops are written to the permanent registers. While uops can be executed out of order, they must be retired in order.
In the following chapters, I will describe in detail how to optimize the throughput of each step in the pipeline.