» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

20. Dependency chains (PPro, PII and PIII)

日期: 2000-04-01 14:00 | 联系我
关注我: Telegram, Twitter

20. Dependency chains (PPro, PII and PIII)

A series of instructions where each instruction depends on the result of the preceding one is called a dependency chain. Long dependency chains should be avoided, if possible, because they prevent out-of-order and parallel execution.

Example:

MOV EAX, [MEM1] ADD EAX, [MEM2] ADD EAX, [MEM3] ADD EAX, [MEM4] MOV [MEM5], EAX

In this eaxmple, the ADD instructions generate 2 uops each, one for reading from memory (port 2), and one for adding (port 0 or 1). The read uops can execute out or order, while the add uops must wait for the previous uops to finish. This dependency chain does not take very long to execute, because each addition adds only 1 clock to the execution time. But if you have slow instructions like multiplications, or even worse: divisions, then you should definitely do something to break the dependency chain. The way to do this is to use multiple accumulators:

MOV EAX, [MEM1] ; start first chain MOV EBX, [MEM2] ; start other chain in different accumulator IMUL EAX, [MEM3] IMUL EBX, [MEM4] IMUL EAX, EBX ; join chains in the end MOV [MEM5], EAX

Here, the second IMUL instruction can start before the first one is finished. Since the IMUL instruction has a delay of 4 clocks and is fully pipelined, you may have up to 4 accumulators.

Division is not pipelined so you cannot do the same with chained divisions, but you can of course multiply all the divisors and do only one division in the end.

Floating point instructions have a longer delay than integer instructions, so you should definitely break up long dependency chains with floating point instructions:

FLD [MEM1] ; start first chain FLD [MEM2] ; start second chain in different accumulator FADD [MEM3] FXCH FADD [MEM4] FXCH FADD [MEM5] FADD ; join chains in the end FSTP [MEM6]

You need a lot of FXCH instructions for this, but don't worry: they are cheap. FXCH instructions are resolved in the RAT by register renaming so they don't put any load on the execution ports. An FXCH does count as 1 uop in the RAT, ROB, and retirement station, though.

If the dependency chain is long you may need three accumulators:

FLD [MEM1] ; start first chain FLD [MEM2] ; start second chain FLD [MEM3] ; start third chain FADD [MEM4] ; third chain FXCH ST(1) FADD [MEM5] ; second chain FXCH ST(2) FADD [MEM6] ; first chain FXCH ST(1) FADD [MEM7] ; third chain FXCH ST(2) FADD [MEM8] ; second chain FXCH ST(1) FADD ; join first and third chain FADD ; join with second chain FSTP [MEM9]

Avoid storing intermediate data in memory and read them immediately afterwards:

MOV [TEMP], EAX MOV EBX, [TEMP]

There is a penalty for attempting to read from a memory address before a previous write to that address is finished. In the example above, change the last instruction to MOV EBX,EAX or put some other instructions in between.

There is one situation where you cannot avoid storing intermediate data in memory, and that is when transferring data from an integer register to a floating point register, or vice versa. For example:

MOV EAX, [MEM1] ADD EAX, [MEM2] MOV [TEMP], EAX FILD [TEMP]

If you don't have anything to put in between the write to TEMP and the read from TEMP, then you may consider using a floating point register instead of EAX:

FILD [MEM1] FIADD [MEM2]

Consecutive jumps, calls, or returns may also be considered dependency chains. The throughput for these instructions is one jump per two clock cycles. It is therefore recommended that you give the microprocessor something else to do between the jumps.

前一篇：15. Instruction fetch (PPro, PII and PIII)
下一篇：2. Literature

标签: MMX 优化