首页 随笔 乐走天涯 程序资料 评论中心 Tag 论坛 其他资源 搜索 联系我 关于 RSS

26.3 String instructions (all processors)


日期: 2000-04-02 15:00 | 联系我 | 关注我: Telegram, Twitter

26.3 String instructions (all processors)

String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JECXZ on PPlain and PMMX.

REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. Always use the DWORD version if possible, and make sure that both source and destination are aligned by 8.

Some other methods of moving data are faster under certain conditions. See chapter 27.8 for details.

Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses. In other words, you will get a penalty of one clock extra per iteration if ESI+(wordsize)-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to use the DWORD version and align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16 bit mode.

REP MOVS and REP STOS can perform very fast by moving an entire cache line at a time on PPro, PII and PIII. This happens only when the following conditions are met:

  • both source and destination must be aligned by 8
  • direction must be forward (direction flag cleared)
  • the count (ECX) must be greater than or equal to 64
  • the difference between EDI and ESI must be numerically greater than or equal to 32
  • the memory type for both source and destination must be either writeback or write-combining (you can normally assume this).

Under these conditions the number of uops issued is approximately 215+2*ECX for REP MOVSD and 185+1.5*ECX for REP STOSD, giving a speed of approximately 5 bytes per clock cycle for both instructions, which is almost 3 times as fast as when the above conditions are not met.

The byte and word versions also benefit from this fast mode, but they are less effective than the dword versions.

REP STOSD is optimal under the same conditions as REP MOVSD.

REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by loops. See example 1.10, 2.8 and 2.9 for alternatives to REPNE SCASB. REP CMPS may suffer cache bank conflicts if bit 2-4 are the same in ESI and EDI.

标签: MMX 优化 | String

 文章评论
目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓

发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail: 可选 (不会被公开。如果我回复了你的评论,你将会收到邮件通知)
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
6 x 1 + 4 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 
首页 随笔 乐走天涯 猎户星 Google Earth 程序资料 程序生活 评论 Tag 论坛 资源 搜索 联系 关于 隐私声明 版权声明 订阅邮件

程序员小辉 建站于 1997 ◇ 做一名最好的开发者是我不变的理想。
Copyright © XiaoHui.com; 保留所有权利。