17. Out of order execution (PPro, PII and PIII)
The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. This makes out-of-order execution possible. If one part of the code is delayed because of a cache miss then it won't delay later parts of the code if they are independent of the delayed operations.
Writes to memory cannot execute out of order relative to other writes. There are four write buffers, so if you expect many cache misses on writes or you are writing to uncached memory then it is recommended that you schedule four writes at at time and make sure the processor has something else to do before you give it the next four writes. Memory reads and other instructions can execute out of order, except IN, OUT and serializing instructions.
If your code writes to a memory address and soon after reads from the same address, then the read may by mistake be executed before the write because the ROB doesn't know the memory addresses at the time of reordering. This error is detected when the write address is calculated, and then the read operation (which was executed speculatively) has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to make sure the execution unit has other things to do between a write and a subsequent read from the same memory address.
There are several execution units clustered around five ports. Port 0 and 1 are for arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1, whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and rotates, and floating point operations. Port 1 also handles jumps and some MMX and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory write, and port 4 executes all memory write operations. In chapter 29 you'll find a complete list of the uops generated by code instructions with an indication of which ports they go to. Note that all memory write operations require two uops, one for port 3 and one for port 4, while memory read operations use only one uop (port 2).
In most cases each port can receive one new uop per clock cycle. This means that you can execute up to 5 uops in the same clock cycle if they go to five different ports, but since there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3 uops per clock on average.
You must make sure that no execution port receives more than one third of the uops if you want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter 29 and count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then you can improve your code by replacing some MOV register,register or MOV register,immediate instructions with MOV register,memory in order to move some of the load from port 0 and 1 to port 2.
Most uops take only one clock cycle to execute, but multiplications, divisions, and many floating point operations take more:
Floating point addition and subtraction takes 3 clocks, but the execution unit is fully pipelined so that it can receive a new FADD or FSUB in every clock cycle before the preceding ones are finished (provided, of course, that they are independent).
Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so that it can receive a new instruction every clock cycle. Floating point multiplication is partially pipelined: The execution unit can receive a new FMUL instruction two clocks after the preceding one, so that the maximum throughput is one FMUL per two clock cycles. The holes between the FMUL's cannot be filled by integer multiplications because they use the same circuitry. XMM additions and multiplications take 3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM register is implemented as two physical 64-bit registers, you need two uops for a packed XMM operation, and the throughput will then be one arithmetic XMM instruction every two clock cycles. XMM add and multiply instructions can execute in parallel because they don't use the same execution port.
Integer and floating point division takes up to 39 clocks and is not pipelined. This means that the execution unit cannot begin a new division until the previous division is finished. The same applies to squareroot and transcendental functions.
Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps, calls, and returns is one for every two clocks.
You should, of course, avoid instructions that generate many uops. The LOOP XX instruction, for example, should be replaced by DEC ECX / JNZ XX.
If you have consecutive POP instructions then you may break them up to reduce the number of uops:
POP ECX / POP EBX / POP EAX ; can be changed to: MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12The former code generates 6 uops, the latter generates only 4 and decodes faster. Doing the same with PUSH instructions is less advantageous because the split-up code is likely to generate register read stalls unless you have other instructions to put in between or the registers have been renamed recently. Doing it with CALL and RET instructions will interfere with prediction in the return stack buffer. Note also that the ADD ESP instruction can cause an AGI stall in earlier processors.