首页 随笔 乐走天涯 程序资料 评论中心 Tag 论坛 其他资源 搜索 消息中心 联系我 关于 RSS

31. Comparison of the different microprocessors


日期: 2000-04-03 14:00 | 联系我 | 关注我: Google+ | Twitter | 新浪微博

31. Comparison of the different microprocessors

The following table summarizes some important differences between the microprocessors in the Pentium family:

PPlain PMMX PPro PII PIII
code cache, kb81681616
data cache, kb81681616
built in level 2 cache, kb00256512 *)512 *)
MMX instructionsnoyesnoyesyes
XMM instructionsnonononoyes
conditional move instructruct.nonoyesyesyes
out of order executionnonoyesyesyes
branch predictionpoorgoodgoodgoodgood
branch target buffer entries256256512512512
return stack buffer size04161616
branch misprediction penalty3-44-510-2010-2010-20
partial register stall00555
FMUL latency33555
FMUL throughput1/21/21/21/21/2
IMUL latency99444
IMUL throughput1/91/91/11/11/1

*) Celeron: 0-128, Xeon: 512 or more, many other variants available. On some versions the level 2 cache runs at half speed.

Comments to the table:

Code cache size is important if the critical part of your program is not limited to a small memory space.

Data cache size is important for all programs that handle more than small amounts of data in the critical part.

MMX and XMM instructions are useful for programs that handle massively parallel data, such as sound and image processing. In other applications it may not be possible to take advantage of the MMX and XMM instructions.

Conditional move instructructions are useful for avoiding poorly predictable conditional jumps.

Out of order execution improves performance, especially on non-optimized code. It includes automatic instruction reordering and register renaming.

Processors with a good branch prediction method can predict simple repetitive patterns. A good branch prediction is most important if the branch misprediction penalty is high.

A return stack buffer improves prediction of return instructions when a subroutine is called alternatingly from different locations.

Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.

The latency of a multiplication instruction is the time it takes in a dependency chain. A throughput of 1/2 means that the execution can be pipelined so that a new multiplication can begin every second clock cycle. This defines the speed for handling parallel data.

Most of the optimizations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of.

Scheduling floating point code for the PPlain and PMMX often requires a lot of extra FXCH instructions. This will slow down execution on older microprocessors, but not on the Pentium family and advanced non-Intel processors.

Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the conditional moves in the PPro, PII and PIII will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should detect which processor it is running on and select the appropriate version of code (chapter 27.10).

标签: MMX 优化

 文章评论
目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓

发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail: 可选 (不会被公开。如果我回复了你的评论,你将会收到邮件通知)
网站 / Blog: 可选
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
10 x 3 + 4 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 
首页 随笔 乐走天涯 猎户星 Google Earth 程序资料 程序生活 评论 Tag 论坛 资源 搜索 联系 关于 隐私声明 版权声明 订阅邮件

程序员小辉 建站于 1997 ◇ 做一名最好的开发者是我不变的理想。
Copyright © XiaoHui.com; 保留所有权利。