» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

21. Searching for bottlenecks (PPro, PII and PIII)

日期: 2000-04-01 14:00 | 联系我
关注我: Telegram, Twitter

21. Searching for bottlenecks (PPro, PII and PIII)

When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if there is another bottleneck which is narrower.

If you expect code cache misses then you should restructure your code to keep the most used parts of code together.

If you expect many data cache misses then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (chapter 7), and avoid long dependency chains after a data read cache miss (chapter 20).

If you have many divisions then try to reduce them (chapter 27.2) and make sure the processor has something else to do during the divisions.

Dependency chains tend to hamper out-of-order execution (chapter 20). Try to break long dependency chains, especially if they contain slow instructions such as multiplication, division, and floating point instructions.

If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace conditional jumps with conditional moves if possible, and replace small procedures with macros (chapter 22.3).

If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (chapter 19).

If you aim at a throughput of 3 uops per clock cycle then be aware of possible delays in instruction fetch and decoding (chapter and 14 and 15), especially in small loops.

The limit of two permanent register reads per clock cycle may reduce your throughput to less than 3 uops per clock cycle (chapter 16.2). This is likely to happen if you often read registers more than 4 clock cycles after they last were modified. This may, for example, happen if you often use pointers for addressing your data but seldom modify the pointers.

A throughput of 3 uops per clock requires that no execution port gets more than one third of the uops (chapter 17).

The retirement station can handle 3 uops per clock, but may be slightly less effective for taken jumps (chapter 18).

前一篇：16.1 Eliminating dependencies
下一篇：6. Alignmen

标签: MMX 优化