12. Prefixes (PPlain and PMMX)
An instruction with one or more prefixes may not be able to execute in the V-pipe (see chapter 10, sect. 7), and it may take more than one clock cycle to decode.
On the PPlain, the decoding delay is one clock cycle for each prefix except for the 0FH prefix of conditional near jumps.
The PMMX has no decoding delay for 0FH prefix. Segment and repeat prefixes take one clock extra to decode. Address and operand size prefixes take two clocks extra to decode. The PMMX can decode two instructions per clock cycle if the first instruction has a segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions with address or operand size prefixes can only decode alone on the PMMX. Instructions with more than one prefix take one clock extra for each prefix.
Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and 32 bit integers.
Where prefixes are unavoidable, the decoding delay may be masked if a preceding instruction takes more than one clock cycle to execute. The rule for the PPlain is that any instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction pairs. In other words, each extra clock cycle that an instruction takes to execute can be used to decode one prefix in a later instruction. This shadowing effect even extends across a predicted branch. Any instruction which takes more than one clock cycle to execute, and any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any other reason except decoding delay and branch misprediction, has a shadowing effect.
The PMMX has a similar shadowing effect, but the mechanism is different. Decoded instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to four instructions. As long as there are instructions in the FIFO buffer you get no delay. When the buffer is empty then instructions are executed as soon as they are decoded. The buffer is filled when instructions are decoded faster than they are executed, i.e. when you have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions execute faster than they are decoded, i.e. when you have decoding delays due to prefixes. The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two instructions per clock cycle provided that the second instruction is without prefixes and none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can each receive one instruction per clock cycle from the FIFO buffer.
CLD / REP MOVSD
The CLD instruction takes two clock cycles and can therefore overshadow the decoding delay of the REP prefix. The code would take one clock cycle more if the CLD instruction was placed far from the REP MOVSD.
CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL
The CMP instruction takes two clock cycles here because it is a read/modify instruction. The 0FH prefix of the SETNZ instruction is decoded during the second clock cycle of the CMP instruction, so that the decoding delay is hidden on the PPlain (The PMMX has no decoding delay for the 0FH).
Prefix penalties in PPro, PII and PIII are described in chapter 14.