» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

12. Prefixes (PPlain and PMMX)

日期: 2000-04-01 14:00 | 联系我 | 关注我: Telegram, Twitter

12. Prefixes (PPlain and PMMX)

An instruction with one or more prefixes may not be able to execute in the V-pipe (see chapter 10, sect. 7), and it may take more than one clock cycle to decode.

On the PPlain, the decoding delay is one clock cycle for each prefix except for the 0FH prefix of conditional near jumps.

The PMMX has no decoding delay for 0FH prefix. Segment and repeat prefixes take one clock extra to decode. Address and operand size prefixes take two clocks extra to decode. The PMMX can decode two instructions per clock cycle if the first instruction has a segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions with address or operand size prefixes can only decode alone on the PMMX. Instructions with more than one prefix take one clock extra for each prefix.

Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and 32 bit integers.

Where prefixes are unavoidable, the decoding delay may be masked if a preceding instruction takes more than one clock cycle to execute. The rule for the PPlain is that any instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction pairs. In other words, each extra clock cycle that an instruction takes to execute can be used to decode one prefix in a later instruction. This shadowing effect even extends across a predicted branch. Any instruction which takes more than one clock cycle to execute, and any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any other reason except decoding delay and branch misprediction, has a shadowing effect.

The PMMX has a similar shadowing effect, but the mechanism is different. Decoded instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to four instructions. As long as there are instructions in the FIFO buffer you get no delay. When the buffer is empty then instructions are executed as soon as they are decoded. The buffer is filled when instructions are decoded faster than they are executed, i.e. when you have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions execute faster than they are decoded, i.e. when you have decoding delays due to prefixes. The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two instructions per clock cycle provided that the second instruction is without prefixes and none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can each receive one instruction per clock cycle from the FIFO buffer.

Examples:

CLD / REP MOVSD

The CLD instruction takes two clock cycles and can therefore overshadow the decoding delay of the REP prefix. The code would take one clock cycle more if the CLD instruction was placed far from the REP MOVSD.

CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL

The CMP instruction takes two clock cycles here because it is a read/modify instruction. The 0FH prefix of the SETNZ instruction is decoded during the second clock cycle of the CMP instruction, so that the decoding delay is hidden on the PPlain (The PMMX has no decoding delay for the 0FH).

Prefix penalties in PPro, PII and PIII are described in chapter 14.

前一篇：19.4 Partial memory stalls
下一篇：19.3 Flags stalls after shifts and rotates

标签: MMX 优化

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 10 x 2 + 5 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字