小辉程序员之路, since 1996 http://www.xiaohui.com
乐走天涯: 工作并快乐着,职业并休闲着
 » 首页 > MMX 优化: How to optimize for the Pentium family of microprocessors

31. Comparison of the different microprocessors


http://www.XiaoHui.com 日期: 2000-04-03 13:00

31. Comparison of the different microprocessors

The following table summarizes some important differences between the microprocessors in the Pentium family:

PPlain PMMX PPro PII PIII
code cache, kb81681616
data cache, kb81681616
built in level 2 cache, kb00256512 *)512 *)
MMX instructionsnoyesnoyesyes
XMM instructionsnonononoyes
conditional move instructruct.nonoyesyesyes
out of order executionnonoyesyesyes
branch predictionpoorgoodgoodgoodgood
branch target buffer entries256256512512512
return stack buffer size04161616
branch misprediction penalty3-44-510-2010-2010-20
partial register stall00555
FMUL latency33555
FMUL throughput1/21/21/21/21/2
IMUL latency99444
IMUL throughput1/91/91/11/11/1

*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.

Comments to the table:

Code cache size is important if the critical part of your program is not limited to a small memory space.

Data cache size is important for all programs that handle more than small amounts of data in the critical part.

MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.

Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.

Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.

Processors with a good branch prediction method can predict simple repetitive patterns. A good branch prediction is most important if the branch misprediction penalty is high.

A return stack buffer improves prediction of return instructions when a subroutine is called alternatingly from different locations.

Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.

The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.

Most of the optimizations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of.

Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.

Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the conditional moves in the PPro, PII and PIII will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should detect which processor it is running on and select the appropriate version of code (chapter 27.10).

Tags: MMX 优化



 文章评论

目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓
 
发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填
E-mail: 可选 (不会被公开)
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
9 + 11 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 

小辉程序员之路 建站于 1997 ◇ 做一名最好的开发者是我不变的理想……
Copyright(C) 1997-2009 XiaoHui.com   All rights reserved
声明:站内所有原创文字,未经许可,均可转载、复制。
转载时必须以链接形式注明作者和原始出处