All data in RAM should be aligned to addresses divisible by 2, 4, 8, or 16 according to this scheme:
|operand size||PPlain and PMMX||PPro, PII and PIII|
On PPlain and PMMX, misaligned data will take at least 3 clock cycles extra to access if a 4 byte boundary is crossed. The penalty is higher when a cache line boundary is crossed.
On PPro, PII and PIII, misaligned data will cost you 6-12 clocks extra when a cache line boundary is crossed. Misaligned operands smaller than 16 bytes that do not cross a 32 byte boundary give no penalty.
Aligning data by 8 or 16 on a dword size stack may be a problem. A common method is to set up an aligned frame pointer. A function with aligned local data may look like this:
_FuncWithAlign PROC NEAR PUSH EBP ; prolog code MOV EBP, ESP AND EBP, -8 ; align frame pointer by 8 FLD DWORD PTR [ESP+8] ; function parameter SUB ESP, LocalSpace + 4 ; allocate local space FSTP QWORD PTR [EBP-LocalSpace] ; store something in aligned space ... ADD ESP, LocalSpace + 4 ; epilog code. restore ESP POP EBP ; (AGI stall on PPlain/PMMX) RET _FuncWithAlign ENDP
While aligning data is always important, aligning code is not necessary on the PPlain and PMMX. Principles for aligning code on PPro, PII and PIII are explained in chapter 15.