» 首页 > 标签 » MMX 优化

标签: MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：光栅处理（一）
图2描述了一个典型的3D处理数据流。输入时，它接收ransformed，增亮顶点，连接如离散三角形（每个由三个顶点）或三角网孔之类的结构。在一个网孔中，每个新的顶点都定义了一个新的三角形。其余的两个点就是前面的网孔表单中的两个点。最后，流程在CRT屏幕上画图作为输出。图中，圆圈中加个M标出了MMX技术能增强品质或性能的地方。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：附录B　MMX命令和操作数限制
在此要特别感谢震源工作室(作品：战国风云)的好朋友易春华热心补充了以下一段MMX™命令和操作数限制，使本手册变得更加完整。发表于 2000-06-12
标签: MMX 优化 | 开发者

» [MMX 汇编优化] - MMX技术开发者手册：第三章 MMX™代码开发规则（五）
Intel的MMX™技术是对Intel体系结构(IA)指令集的扩展。该技术使用了单指令多数据技术(SIMD)技术，以并行方式处理多个数据元素，从而提高了多媒体和通讯软件的运行速度。MMX™指令集增加了57条新的操作码和一个新的64位四字数据类型。这种新的64位数据保持了可供MMX 发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - MMX汇编优化相关下载
在 Intel 的 www 站上，有很多关于MMX 优化的文献和教程(如Intel最新C/C++编译器、Intel最新指令教程、VTUNE代码优化软件以及大量的软件开发者手册）. 建议你研究一下这些文档来对微处理器的结构、指令有更深的认识。但我无法给你正确的下载地址，因为文件位置经常发生变化。你可以去这两个网址搜索到你所需要的文档或工具. 发表于 2000-06-12
标签: Intel | 下载 | MMX 优化 | 开发者

» [MMX 汇编优化] - MMX技术开发者手册：第三章 MMX™代码开发规则（六）内存优化
Intel的MMX™技术是对Intel体系结构(IA)指令集的扩展。该技术使用了单指令多数据技术(SIMD)技术，以并行方式处理多个数据元素，从而提高了多媒体和通讯软件的运行速度。MMX™指令集增加了57条新的操作码和一个新的64位四字数据类型。这种新的64位数据保持了可供MMX 发表于 2000-06-12
标签: MMX 优化 | 内存优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：光栅处理（二）
为了更加完全地在一个典型的3D库中描述顶点的信息和状态，可查看Microsoft的Direct3D描述。事实上，Direct3D处理管道技术在接下所讨论的步骤中与MMX技术的结合时十分紧密的。而且，我们也希望读者能用Direct3D来实际运用。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：附录A　MMX指令集
下表为MMX 指令集一览表。下列指令助记符是助记符基本集。大多数指令都有多种形式(如PCKED-BYTE，-WORD和-DWORD形式)。在“Intel体系结构MMX?技术程序员参考手册”(序列码为243007)中可以找到MMX?指令的全部信息。发表于 2000-06-12
标签: MMX 优化 | 开发者

» [MMX 汇编优化] - MMX技术开发者手册：第三章　MMX™代码开发规则（一）
下述规则将帮助快速地开发出高效的MMX™代码，并且这些代码可以在具有MMX™技术的所有的处理器上运行。发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：手册简介
MMX技术开发者手册:手册简介br<br> 发表于 2000-06-12
标签: MMX 优化 | 开发者

» [MMX 汇编优化] - MMX技术开发者手册：第四章代码开发策略
Intel的MMX™技术是对Intel体系结构(IA)指令集的扩展。该技术使用了单指令多数据技术(SIMD)技术，以并行方式处理多个数据元素，从而提高了多媒体和通讯软件的运行速度。MMX™指令集增加了57条新的操作码和一个新的64位四字数据类型。这种新的64位数据保持了可供MMX 发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：第五章　MMX™的编码技术(一）
MMX技术开发者手册:第五章　MMX™的编码技术(一）发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：光栅处理（三）
虽然上面建立的运算可以从MMX技术的并行处理中获得益处，但运算的数目是受限的。而且，他们经常是使用浮点运算的。更多的方便将在下面的在扫描献上花每个点的步骤中体现出来。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：第二章　处理器体系结构和流水线简介(一)
第二章　处理器体系结构和流水线简介(一）发表于 2000-06-12
标签: MMX 优化 |

» [MMX 汇编优化] - MMX技术开发者手册：第三章　MMX™代码开发规则（二）
本节概括了Intel体系结构的重要的常规优化技术。发表于 2000-06-12
标签: MMX 优化 | Intel

» [MMX 汇编优化] - MMX技术开发者手册：第五章　MMX™的编码技术(二）
MMX技术开发者手册:第五章　MMX™的编码技术(二）发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：读者对象
这篇文章是为编程者和技术管理员而写的。同时也是作为关于3D图形处理的MMX技术的一个概述。它提供了一些最有效的策略和执行上的权衡的线索发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - 优化：How to optimize for the Pentium family of microprocessors　怎样优化Pentium系列处理器代码 English
MMX 优化英文文档：How to optimize for the Pentium family of microprocessors　怎样优化Pentium系列处理器代码发表于 2000-06-12
标签: MMX 优化 | Pentium

» [MMX 汇编优化] - MMX技术开发者手册：第六章　MMX™性能监测扩展（一）
MMX技术开发者手册:第六章　MMX™性能监测扩展（一）发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：处理流程
在插图中可以看到3D图像处理流程是由一个应用程序来控制的。通常是通过通知一个API函数来实现的。景象管理在这个应用中可能是完整的，也可能成为API函数下的3D流程的一部分。景象管理传递给3D管道处理器，它处理关于点或顶点的的几何数据库。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：像素点色深的转化
MMX技术在16位高彩和24位真彩数据方面作的并不是最好。封装好的加，乘逻辑运算符实际上已使得24位成了最具有吸引力的运算方式。它包含了3个8位（红，绿，兰）元素或？位固定点（48位是给RGB的）。完成运算之后（像Gouraud阴影，alpha，等），算法转化成RGB16（555或565）以来更新缓存。MMX技术并没有内置的算法，也没有为5位而封装的块。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：第二章　处理器体系结构和流水线简介(二）
具有MMX™技术的处理器的在片高速缓存子系统，是由两个16K的4路线长为32字节的关联高速缓存体构成。高速缓存具有一个回写机制和一个伪LRU的置换算法。数据的高速缓存由八个按四字节边界交错的存贮体构成。在具有MMX™技术的奔腾处理器上，只要引用的数据不在同一个高速缓存体上，就可以被一条读取指令和一条存贮指令同时访问。发表于 2000-06-12
标签: MMX 优化 | Cache

» [MMX 汇编优化] - MMX技术开发者手册：第三章 MMX™代码开发规则（三）
Intel的MMX™技术是对Intel体系结构(IA)指令集的扩展。该技术使用了单指令多数据技术(SIMD)技术，以并行方式处理多个数据元素，从而提高了多媒体和通讯软件的运行速度。MMX™指令集增加了57条新的操作码和一个新的64位四字数据类型。这种新的64位数据保持了可供MMX 发表于 2000-06-12
标签: MMX 优化

» [MMX 汇编优化] - 关于3D处理的MMX技术：执行概要
INTEL公司的MMX处理器能成倍增加并行处理的整形数据,一次能处理64位。与一次最多处理32位的纯粹的INTEL结构代码相比,它能加快处理3D图像中的像素点。因此,MMX技术可提供更高层次的速度和更高质量的图像。发表于 2000-06-12
标签: 3D | MMX 优化

» [MMX 汇编优化] - MMX技术开发者手册：第三章 MMX™代码开发规则（四）
Intel的MMX™技术是对Intel体系结构(IA)指令集的扩展。该技术使用了单指令多数据技术(SIMD)技术，以并行方式处理多个数据元素，从而提高了多媒体和通讯软件的运行速度。MMX™指令集增加了57条新的操作码和一个新的64位四字数据类型。这种新的64位数据保持了可供MMX 发表于 2000-06-12
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 28.1 Integer instructions
28 . List of instruction timings for PPlain and PMMX 28.1 Integer instructions Explanations: Operands: r = register, m = memory, i = immediate data, sr = segment register m32 = 32 bit memory operand,... 发表于 2000-04-03
标签: MMX 优化 | Integer

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 28.2 Floating point instructions
28.2 Floating point instructions Explanations: Operands: r = register, m = memory, m32 = 32 bit memory operand, etc. Clock cycles: The numbers are minimum values. Cache misses, misalignment, denormal... 发表于 2000-04-03
标签: MMX 优化 | 浮点运算 | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.6 Using integer instructions to do floating point operations (all processors)
27.6 Using integer instructions to do floating point operations (all processors) Integer instructions are generally faster than floating point instructions, so it is often advantageous to use integer... 发表于 2000-04-03
标签: MMX 优化 | Integer | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.10 Detecting processor type (All processors)
27.10 Detecting processor type (All processors) I think it is fairly obvious by now that what is optimal for one microprocessor may not be optimal for another. You may make the most critical parts of... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.7 Using floating point instructions to do integer operations (PPlain and PMMX)
27.7 Using floating point instructions to do integer operations (PPlain and PMMX) Integer multiplication (PPlain and PMMX) Floating point multiplication is faster than integer multiplication on the P... 发表于 2000-04-03
标签: MMX 优化 | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 31. Comparison of the different microprocessors
31 . Comparison of the different microprocessors The following table summarizes some important differences between the microprocessors in the Pentium family: PPlain PMMX PPro PII PIII code cache, kb ... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 28.3 MMX instructions (PMMX)
28.3 MMX instructions (PMMX) A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.3 Freeing floating point registers (all processors)
27.3 Freeing floating point registers (all processors) You have to free all used floating point registers before exiting a subroutine, except for any register used for the result. The fastest way of ... 发表于 2000-04-03
标签: MMX 优化 | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.4 Transitions between floating point and MMX instructions (PMMX, PII and PIII)
27.4 Transitions between floating point and MMX instructions (PMMX, PII and PIII) You must issue an EMMS instruction after your last MMX instruction if there is a possibility that floating point code... 发表于 2000-04-03
标签: Floating Point | MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 30. Testing speed
30 . Testing speed The Pentium family of processors have an internal 64 bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). This is very useful for... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 29.1 Integer instructions
29 . List of instruction timings and micro-op breakdown for PPro, PII and PIII Explanations: Operands: r = register, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand... 发表于 2000-04-03
标签: MMX 优化 | Integer

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.8 Moving blocks of data (all processors)
27.8 Moving blocks of data (all processors) There are several ways of moving blocks of data. The most common method is REP MOVSD , but under certain conditions other methods are faster. On PPlain and... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 29.2 Floating point instructions
29.2 Floating point instructions Instruction Operands micro-ops delay throughput p0 p1 p01 p2 p3 p4 FLD r 1 FLD m32/64 1 1 FLD m80 2 2 FBLD m80 38 2 FST(P) r 1 FST(P) m32/m64 1 1 1 FSTP m80 2 2 2 FBS... 发表于 2000-04-03
标签: MMX 优化 | 浮点运算 | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 29.4 XMM instructions (PIII)
29.4 XMM instructions (PIII) Instruction Operands micro-ops delay throughput p0 p1 p01 p2 p3 p4 MOVAPS r128,r128 2 1 1/1 MOVAPS r128,m128 2 2 1/2 MOVAPS m128,r128 2 2 3 1/2 MOVUPS r128,m128 4 2 1/4 M... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.9 Self-modifying code (All processors)
27.9 Self-modifying code (All processors) The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for PPlain, 31 for PMMX, and 150-300 for PPro, PII and PI... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.5 Converting from floating point to integer (All processors)
27.5 Converting from floating point to integer (All processors) All conversions from floating point to integer, and vice versa, must go via a memory location: FISTP DWORD PTR [TEMP] MOV EAX, [TEMP] O... 发表于 2000-04-03
标签: MMX 优化 | Integer | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 29.3 MMX instructions (PII and PIII)
29.3 MMX instructions (PII and PIII) Instruction Operands micro-ops delay throughput p0 p1 p01 p2 p3 p4 MOVD MOVQ r,r 1 2/1 MOVD MOVQ r64,m32/64 1 1/1 MOVD MOVQ m32/64,r64 1 1 1/1 PADD PSUB PCMP r64,... 发表于 2000-04-03
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.9 FRNDINT (all processors)
26.9 FRNDINT (all processors) This instruction is slow on all processors. Replace it by: FISTP QWORD PTR [TEMP] FILD QWORD PTR [TEMP] This code is faster despite a possible penalty for attempting to ... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.2 Rotates through carry (all processors)
26.2 Rotates through carry (all processors) RCR and RCL with a count different from one are slow and should be avoided. ... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 25.1. Loops in PPlain and PMMX
25 . Loop optimization (all processors) When analyzing a program you often find that most of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.3 String instructions (all processors)
26.3 String instructions (all processors) String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JE... 发表于 2000-04-02
标签: MMX 优化 | String

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.8 FPREM (all processors)
26.8 FPREM (all processors) The FPREM and FPREM1 instructions are slow on all processors. You may replace it by the following algorithm: Multiply by the reciprocal divisor, get the fractional part by... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 25.2 Loops in PPro, PII and PIII
25.2 Loops in PPro, PII and PIII In the previous chapter ( 25.1 ) I explained how to use convolution and loop unrolling in order to improve pairing in PPlain and PMMX. On the PPro, PII and PIII there... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.4 Bit test (all processors)
26.4 Bit test (all processors) BT, BTC, BTR , and BTS instructions should preferably be replaced by instructions like TEST, AND, OR, XOR , or shifts on PPlain and PMMX. On PPro, PII and PIII, bit tes... 发表于 2000-04-02
标签: MMX 优化 | Bit

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.7 FCOM + FSTSW AX (all processors)
26.7 FCOM + FSTSW AX (all processors) The FNSTSW instruction is very slow on all processors. The PPro, PII and PIII processors have FCOMI instructions to avoid the slow FNSTSW . Using FCOMI instead o... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.5 Integer multiplication (all processors)
26.5 Integer multiplication (all processors) An integer multiplication takes approximately 9 clock cycles on PPlain and PMMX and 4 on PPro, PII and PIII. It is therefore often advantageous to replace... 发表于 2000-04-02
标签: MMX 优化 | Integer

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.1 XCHG (all processors)
26 . Problematic Instructions 26.1 XCHG (all processors) The XCHG register,[memory] instruction is dangerous. By default this instruction has an implicit LOCK prefix which prevents it from using the ... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.10 FSCALE and exponential function (all processors)
26.10 FSCALE and exponential function (all processors) FSCALE is slow on all processors. Computing integer powers of 2 can be done much faster by inserting the desired power in the exponent field of ... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.11 FPTAN (all processors)
26.11 FPTAN (all processors) According to the manuals, FPTAN returns two values X and Y and leaves it to the programmer to divide Y with X to get the result, but in fact it always returns 1 in X so y... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.12 FSQRT (PIII)
26.12 FSQRT (PIII) A fast way of calculating an approximate squareroot on the PIII is to multiply the reciprocal squareroot of x by x: SQRT(x) = x * RSQRT(x) The instruction RSQRTSS or RSQRTPS gives ... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.6 WAIT instruction (all processors)
26.6 WAIT instruction (all processors) You can often increase speed by omitting the WAIT instruction. The WAIT instruction has three functions: a. The old 8087 processor requires a WAIT before every ... 发表于 2000-04-02
标签: MMX 优化 | WAIT

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.13 MOV [MEM], ACCUM (PPlain and PMMX)
26.13 MOV [MEM], ACCUM (PPlain and PMMX) The instructions MOV [mem],AL MOV [mem],AX MOV [mem],EAX are treated by the pairing circuitry as if they were writing to the accumulator. Thus the following i... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 22.5. Replacing conditional jumps by conditional moves (PPro, PII and PIII)
22.5. Replacing conditional jumps by conditional moves (PPro, PII and PIII) The PPro, PII and PIII processors have conditional move instructions intended specifically for avoiding branches because br... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.2 Division (all processors)
27.2 Division (all processors) Division is quite time consuming. On PPro, PII and PIII an integer division takes 19, 23, or 39 clocks for byte, word, and dword divisors respectively. On PPlain and PM... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.14 TEST instruction (PPlain and PMMX)
26.14 TEST instruction (PPlain and PMMX) The TEST instruction with an immediate operand is only pairable if the destination is AL , AX , or EAX . TEST register,register and TEST register,memory is al... 发表于 2000-04-02
标签: MMX 优化 | TEST

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 23. Reducing code size (all processors)
23 . Reducing code size (all processors) As explained in chapter 7 , the code cache is 8 or 16 kb. If you have problems keeping the critical parts of your code within the code cache, then you may con... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.15 Bit scan (PPlain and PMMX)
26.15 Bit scan (PPlain and PMMX) BSF and BSR are the poorest optimized instructions on the PPlain and PMMX, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped. The fol... 发表于 2000-04-02
标签: MMX 优化 | Bit

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 24. Scheduling floating point code (PPlain and PMMX)
24 . Scheduling floating point code (PPlain and PMMX) Floating point instructions cannot pair the way integer instructions can, except for one special case, defined by the following rules: the first ... 发表于 2000-04-02
标签: MMX 优化 | Floating Point

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 27.1 LEA instruction (all processors)
27 . Special topics 27.1 LEA instruction (all processors) The LEA instruction is useful for many purposes because it can do a shift, two additions, and a move in just one instruction taking one clock... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 26.16 FLDCW (PPro, PII and PIII)
26.16 FLDCW (PPro, PII and PIII) The PPro, PII and PIII have a serious stall after the FLDCW instruction if followed by any floating point instruction which reads the control word (which almost all f... 发表于 2000-04-02
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 9. Address generation interlock (PPlain and PMMX)
9 . Address generation interlock (PPlain and PMMX) It takes one clock cycle to calculate the address needed by an instruction which accesses memor y. Normally, this calculation is done at a separate ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 8. First time versus repeated execution
8 . First time versus repeated execution A piece of code usually takes much more time the first time it is executed than when it is repeated. The reasons are the following: Loading the code from RAM ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 2. Literature
2 . Literature A lot of useful literature and tutorials can be downloaded for free from Intel's www site or acquired in print or on CD-ROM. It is recommended that you study this literature in order t... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 20. Dependency chains (PPro, PII and PIII)
20 . Dependency chains (PPro, PII and PIII) A series of instructions where each instruction depends on the result of the preceding one is called a dependency chain. Long dependency chains should be a... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 15. Instruction fetch (PPro, PII and PIII)
15 . Instruction fetch (PPro, PII and PIII) The code is fetched in aligned 16-bytes chunks from the code cache and placed in the double buffer, which is called so because it can contain two such chun... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 14. Instruction decoding (PPro, PII and PIII)
14 . Instruction decoding (PPro, PII and PIII) I am describing instruction decoding before instruction fetching here because you need to know how the decoders work in order to understand the possible... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 6. Alignmen
6 . Alignment All data in RAM should be aligned to addresses divisible by 2, 4, 8, or 16 according to this scheme: alignment operand size PPlain and PMMX PPro, PII and PIII 1 (byte) 1 1 2 (word) 2 2 ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 21. Searching for bottlenecks (PPro, PII and PIII)
21 . Searching for bottlenecks (PPro, PII and PIII) When optimizing code for these processors, it is important to analyze where the bottlenecks are. Spending time on optimizing away one bottleneck do... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 16.1 Eliminating dependencies
16 . Register renaming (PPro, PII and PIII) 16.1 Eliminating dependencies Register renaming is an advanced technique used by these microprocessors to remove dependencies between different parts of th... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 4. Debugging and verifying
4 . Debugging and verifying Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you wan... 发表于 2000-04-01
标签: MMX 优化 | Debug

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 22.1 Branch prediction in PPlain
22 . Jumps and branches (all processors) The Pentium family of processors attempt to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction i... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 22.2 Branch prediction in PMMX, PPro, PII and PIII
22.2 Branch prediction in PMMX, PPro, PII and PIII 22.2.1 BTB organization (PMMX, PPro, PII and PIII) The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16 sets. Each e... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - How to optimize for the Pentium family of microprocessors
How to optimize for the Pentium family of microprocessors Copyright © 1996, 2000 by Agner Fog. Last modified 2000-03-31. Contents Introduction Literature Calling assembly functions from high lev... 发表于 2000-04-01
标签: Pentium | CPU | MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 16. Register read stalls
16.2 Register read stalls But there is another limitation which may be quite serious, and that is that you can only read two different permanent register names per clock cycle. This limitation applie... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 5. Memory model
5 . Memory model The Pentiums are designed primarily for 32 bit code, and the performance is inferior on 16 bit code. Segmenting your code and data also degrades performance significantly, so you sho... 发表于 2000-04-01
标签: MMX 优化 | Memory

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 22.3. Avoiding jumps (all processors)
22.3. Avoiding jumps (all processors) There can be many reasons why you may want reduce the number of jumps, calls and returns: jump mispredictions are very expensive, there are various penalties for... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 17. Out of order execution (PPro, PII and PIII)
17 . Out of order execution (PPro, PII and PIII) The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its operands are ready and there is a vacant execution unit for it. Thi... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 22.4. Avoiding conditional jumps by using flags (all processors)
22.4. Avoiding conditional jumps by using flags (all processors) The most important jumps to eliminate are conditional jumps, especially if they are poorly predictable. Sometimes it is possible to ob... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 1. Introduction
1 . Introduction This manual describes in detail how to write optimized assembly language code, with particular focus on the Pentium® family of microprocessors. Most of the information herein is ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 18. Retirement (PPro, PII and PIII)
18 . Retirement (PPro, PII and PIII) Retirement is a process where the temporary registers used by the uops are copied into the permanent registers EAX, EBX , etc. When a uop has been executed it is ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 3. Calling assembly functions from high level language
3 . Calling assembly functions from high level language You can either use inline assembly or code a subroutine entirely in assembly language and link it into your project. If you choose the latter o... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 10.1 Pairing integer instructions (PPlain and PMMX): Perfect pairing
10 . Pairing integer instructions (PPlain and PMMX) 10.1 Perfect pairing The PPlain and PMMX have two pipelines for executing instructions, called the U-pipe and the V-pipe. Under certain conditions ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 19.1 Partial register stalls
19 . Partial stalls (PPro, PII and PIII) 19.1 Partial register stalls Partial register stall is a problem that occurs when you write to part of a 32 bit register and later read from the whole registe... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 10.2 Imperfect pairing
10.2 Imperfect pairing There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, be... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 19.2 Partial flags stalls
19.2 Partial flags stalls The flags register can also cause partial register stalls: CMP EAX, EBX INC ECX JBE XX ; partial flags stall The JBE instruction reads both the carry flag and the zero flag.... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 11. Splitting complex instructions into simpler ones (PPlain and PMMX)
11 . Splitting complex instructions into simpler ones (PPlain and PMMX) You may split up read/modify and read/modify/write instructions to improve pairing. Example: ADD [mem1],EAX / ADD [mem2],EBX ; ... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 7. Cache
7 . Cache The PPlain and PPro have 8 kb of on-chip cache (level one cache) for code, and 8 kb for data. The PMMX, PII and PIII have 16 kb for code and 16 kb for data. Data in the level 1 cache can be... 发表于 2000-04-01
标签: MMX 优化 | Cache

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 19.3 Flags stalls after shifts and rotates
19.3 Flags stalls after shifts and rotates You can get a stall resembling the partial flags stall when reading any flag bit after a shift or rotate, except for shifts and rotates by one (short form):... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 12. Prefixes (PPlain and PMMX)
12 . Prefixes (PPlain and PMMX) An instruction with one or more prefixes may not be able to execute in the V-pipe (se chapter 10, sect. 7 ), and it may take more than one clock cycle to decode. On th... 发表于 2000-04-01
标签: MMX 优化

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 19.4 Partial memory stalls
19.4 Partial memory stalls A partial memory stall is somewhat analogous to a partial register stall. It occurs when you mix data sizes for the same memory address: MOV BYTE PTR [ESI], AL MOV EBX, DWO... 发表于 2000-04-01
标签: MMX 优化 | Memory

» [MMX 优化: How to optimize for the Pentium family of microprocessors] - 13. Overview of PPro, PII and PIII pipeline
13 . Overview of PPro, PII and PIII pipeline The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in various manuals and tutorials from Intel. It is recommende... 发表于 2000-04-01
标签: MMX 优化 | PII | PIII