19. Partial stalls (PPro, PII and PIII)
Partial register stall is a problem that occurs when you write to part of a 32 bit register and later read from the whole register or a bigger part of it. Example:
MOV AL, BYTE PTR [M8] MOV EBX, EAX ; partial register stall
This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX. The stall can be avoided by changing to code to:
MOVZX EBX, BYTE PTR [MEM8] AND EAX, 0FFFFFF00h OR EBX, EAX
Of course you can also avoid the partial stalls by putting in other instructions after the write to the partial register so that it has time to retire before you read from the full register.
You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32 bits):
MOV BH, 0 ADD BX, AX ; stall INC EBX ; stall
You don't get a stall when reading a partial register after writing to the full register, or a bigger part of it:
MOV EAX, [MEM32] ADD BL, AL ; no stall ADD BH, AH ; no stall MOV CX, AX ; no stall MOV DX, BX ; stall
The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands. These instructions are fast on the PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is offered when you want your code to perform reasonably well on all processors. The replacement for MOVZX EAX,BYTE PTR [M8] looks like this:
XOR EAX, EAX MOV AL, BYTE PTR [M8]
The PPro, PII and PIII processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that the upper 24 bits of EAX are zero, so that a partial stall can be avoided. This mechanism works only on certain combinations:
XOR EAX, EAX MOV AL, 3 MOV EBX, EAX ; no stall XOR AH, AH MOV AL, 3 MOV BX, AX ; no stall XOR EAX, EAX MOV AH, 3 MOV EBX, EAX ; stall SUB EBX, EBX MOV BL, DL MOV ECX, EBX ; no stall MOV EBX, 0 MOV BL, DL MOV ECX, EBX ; stall MOV BL, DL XOR EBX, EBX ; no stall
Setting a register to zero by subtracting it from itself works the same as the XOR, but setting it to zero with the MOV instruction doesn't prevent the stall.
You can set the XOR outside a loop:
XOR EAX, EAX MOV ECX, 100 LL: MOV AL, [ESI] MOV [EDI], EAX ; no stall INC ESI ADD EDI, 4 DEC ECX JNZ LL
The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.
You should remember to neutralize any partial register you have used before calling a subroutine that might push the full register:
ADD BL, AL MOV [MEM8], BL XOR EBX, EBX ; neutralize BL CALL _HighLevelFunction
Most high level language procedures push EBX at the start of the procedure which would generate a partial register stall in the example above if you hadn't neutralized BL.
Setting a register to zero with the XOR method doesn't break its dependency on earlier instructions:
DIV EBX MOV [MEM], EAX MOV EAX, 0 ; break dependency XOR EAX, EAX ; prevent partial register stall MOV AL, CL ADD EBX, EAX
Setting EAX to zero twice here seems redundant, but without the MOV EAX,0 the last instructions would have to wait for the slow DIV to finish, and without XOR EAX,EAX you would have a partial register stall.
The FNSTSW AX instruction is special: in 32 bit mode it behaves as if writing to the entire EAX. In fact, it does something like this in 32 bit mode:
AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP
hence, you don't get a partial register stall when reading EAX after this instruction in 32 bit mode:
FNSTSW AX / MOV EBX,EAX ; stall only if 16 bit mode MOV AX,0 / FNSTSW AX ; stall only if 32 bit mode