小辉程序员之路, since 1996 http://www.xiaohui.com
乐走天涯: 工作并快乐着,职业并休闲着
 » 首页 > MMX 优化: How to optimize for the Pentium family of microprocessors

27.8 Moving blocks of data (all processors)


http://www.XiaoHui.com 日期: 2000-04-03 13:00

27.8 Moving blocks of data (all processors)

There are several ways of moving blocks of data. The most common method is REP MOVSD, but under certain conditions other methods are faster.

On PPlain and PMMX it is faster to move 8 bytes at a time using floating point registers if the destination is not in the cache:

TOP: FILD QWORD PTR [ESI] FILD QWORD PTR [ESI+8] FXCH FISTP QWORD PTR [EDI] FISTP QWORD PTR [EDI+8] ADD ESI, 16 ADD EDI, 16 DEC ECX JNZ TOP

The source and destination should of course be aligned by 8. The extra time used by the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many write operations. Note that this method is only advantageous on the PPlain and PMMX and only if the destination is not in the level 1 cache. You cannot use FLD and FSTP (without I) on arbitrary bit patterns because denormal numbers are handled slowly and certain bit patterns are not preserved unchanged.

On the PMMX processor it is faster to use MMX instructions to move eight bytes at a time if the destination is not in the cache:

TOP: MOVQ MM0,[ESI] MOVQ [EDI],MM0 ADD ESI,8 ADD EDI,8 DEC ECX JNZ TOP

There is no need to unroll this loop or optimize it further if cache misses are expected, because memory access is the bottleneck here, not instruction execution.

On PPro, PII and PIII processors the REP MOVSD instruction is particularly fast when the following conditions are met (see chapter 26.3):

  • both source and destination must be aligned by 8
  • direction must be forward (direction flag cleared)
  • the count (ECX) must be greater than or equal to 64
  • the difference between EDI and ESI must be numerically greater than or equal to 32
  • the memory type for both source and destination must be either writeback or write-combining (you can normally assume this).

On the PII it is faster to use MMX registers if the above conditions are not met and the destination is likely to be in the level 1 cache. The loop may be rolled out by two, and the source and destination should of course be aligned by 8.

On the PIII the fastest way of moving data is to use the MOVAPS instruction if the above conditions are not met or if the destination is in the level 1 or level 2 cache:

SUB EDI, ESI TOP: MOVAPS XMM0, [ESI] MOVAPS [ESI+EDI], XMM0 ADD ESI, 16 DEC ECX JNZ TOPUnlike FLD, MOVAPS can handle any bit pattern without problems. Remember that source and destination must be aligned by 16.

If the number of bytes to move is not divisible by 16 then you may round up to the nearest number divisible by 16 and put some extra space at the end of the destination buffer to receive the superfluous bytes. If this is not possible then you have to move the remaining bytes by other methods.

On the PIII you also have the option of writing directly to RAM memory without involving the cache by using the MOVNTQ or MOVNTPS instruction. This can be useful if you don't want the destination to go into a cache. MOVNTPS is only slightly faster than MOVNTQ.

Tags: MMX 优化



 文章评论

目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓
 
发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填
E-mail: 可选 (不会被公开)
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
10 + 12 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 

小辉程序员之路 建站于 1997 ◇ 做一名最好的开发者是我不变的理想……
Copyright(C) 1997-2009 XiaoHui.com   All rights reserved
声明:站内所有原创文字,未经许可,均可转载、复制。
转载时必须以链接形式注明作者和原始出处