小辉程序员之路, since 1996 http://www.xiaohui.com
乐走天涯: 工作并快乐着,职业并休闲着
 » 首页 > MMX 优化: How to optimize for the Pentium family of microprocessors

21. Searching for bottlenecks (PPro, PII and PIII)


http://www.XiaoHui.com 日期: 2000-04-01 13:00

21. Searching for bottlenecks (PPro, PII and PIII)

When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if there is another bottleneck which is narrower.

If you expect code cache misses then you should restructure your code to keep the most used parts of code together.

If you expect many data cache misses then forget about everything else and concentrate on how to restructure your data to reduce the number of cache misses (chapter 7), and avoid long dependency chains after a data read cache miss (chapter 20).

If you have many divisions then try to reduce them (chapter 27.2) and make sure the processor has something else to do during the divisions.

Dependency chains tend to hamper out-of-order execution (chapter 20). Try to break long dependency chains, especially if they contain slow instructions such as multiplication, division, and floating point instructions.

If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable, then try if some of them can be avoided. Replace conditional jumps with conditional moves if possible, and replace small procedures with macros (chapter 22.3).

If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial stalls. If you use PUSHF or LAHF instructions then look out for partial flags stalls. Avoid testing flags after shifts or rotates by more than 1 (chapter 19).

If you aim at a throughput of 3 uops per clock cycle then be aware of possible delays in instruction fetch and decoding (chapter and 14 and 15), especially in small loops.

The limit of two permanent register reads per clock cycle may reduce your throughput to less than 3 uops per clock cycle (chapter 16.2). This is likely to happen if you often read registers more than 4 clock cycles after they last were modified. This may, for example, happen if you often use pointers for addressing your data but seldom modify the pointers.

A throughput of 3 uops per clock requires that no execution port gets more than one third of the uops (chapter 17).

The retirement station can handle 3 uops per clock, but may be slightly less effective for taken jumps (chapter 18).

Tags: MMX 优化



 文章评论

目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓
 
发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填
E-mail: 可选 (不会被公开)
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
10 + 18 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 

小辉程序员之路 建站于 1997 ◇ 做一名最好的开发者是我不变的理想……
Copyright(C) 1997-2008 XiaoHui.com   All rights reserved
声明:站内所有原创文字,未经许可,均可转载、复制。
转载时必须以链接形式注明作者和原始出处