31. Comparison of the different microprocessors


日期: 2000-04-03 14:00 | 联系我
关注我: Telegram, Twitter

31. Comparison of the different microprocessors

The following table summarizes some important differences between the microprocessors in the Pentium family:

PPlain PMMX PPro PII PIII
code cache, kb81681616
data cache, kb81681616
built in level 2 cache, kb00256512 *)512 *)
MMX instructionsnoyesnoyesyes
XMM instructionsnonononoyes
conditional move instructruct.nonoyesyesyes
out of order executionnonoyesyesyes
branch predictionpoorgoodgoodgoodgood
branch target buffer entries256256512512512
return stack buffer size04161616
branch misprediction penalty3-44-510-2010-2010-20
partial register stall00555
FMUL latency33555
FMUL throughput1/21/21/21/21/2
IMUL latency99444
IMUL throughput1/91/91/11/11/1

*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.

Comments to the table:

Code cache size is important if the critical part of your program is not limited to a small memory space.

Data cache size is important for all programs that handle more than small amounts of data in the critical part.

MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.

Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.

Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.

Processors with a good branch prediction method can predict simple repetitive patterns. A good branch prediction is most important if the branch misprediction penalty is high.

A return stack buffer improves prediction of return instructions when a subroutine is called alternatingly from different locations.

Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.

The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.

Most of the optimizations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of.

Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.

Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the conditional moves in the PPro, PII and PIII will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should detect which processor it is running on and select the appropriate version of code (chapter 27.10).

标签: MMX 优化

 文章评论
目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓

当前页面是本站的 Google AMP 版本。
欲查看完整版本和发表评论请点击:完整版 »

 

程序员小辉 建站于 1997
Copyright © XiaoHui.com; 保留所有权利。