» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

14. Instruction decoding (PPro, PII and PIII)

日期: 2000-04-01 14:00 | 联系我 | 关注我: Telegram, Twitter

14. Instruction decoding (PPro, PII and PIII)

I am describing instruction decoding before instruction fetching here because you need to know how the decoders work in order to understand the possible delays in instruction fetching.

The decoders can handle three instructions per clock cycle, but only when certain conditions are met. Decoder D0 can handle any instruction that generates up to 4 uops in a single clock cycle. Decoders D1 and D2 can handle only instructions that generate 1 uop and these instructions can be no more than 8 bytes long.

To summarize the rules for decoding two or three instructions in the same clock cycle:

The first instruction (D0) generates no more than 4 uops,
The second and third instructions generate no more than 1 uop each,
The second and third instructions are no more than 8 bytes long each,
The instructions must be contained within the same 16 bytes ifetch block (see next chapter).

There is no limit to the length of the instruction in D0 (despite Intel manuals saying something else), as long as the three instructions fit into one 16 bytes ifetch block.

An instruction that generates more than 4 uops takes two or more clock cycles to decode, and no other instructions can decode in parallel.

It follows from the rules above that the decoders can produce a maximum of 6 uops per clock cycle if the first instruction in each decode group generates 4 uops and the next two generate 1 uop each. The minimum production is 2 uops per clock cycle, which you get when all instructions generate 2 uops each, so that D1 and D2 are never used.

For maximum throughput, it is recommended that you order your instructions according to the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two simple 1-uop instructions for free, in the sense that they do not add to the decoding time. Example:

MOV EBX, [MEM1] ; 1 uop (D0) INC EBX ; 1 uop (D1) ADD EAX, [MEM2] ; 2 uops (D0) ADD [MEM3], EAX ; 4 uops (D0)

This takes 3 clock cycles to decode. You can save one clock cycle by reordering the instructions into two decode groups:

ADD EAX, [MEM2] ; 2 uops (D0) MOV EBX, [MEM1] ; 1 uop (D1) INC EBX ; 1 uop (D2) ADD [MEM3], EAX ; 4 uops (D0)

The decoders now generate 8 uops in two clock cycles, which is probably satisfactory. Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate higher than this you can assume that decoding is not a bottleneck. However, complications in the fetch mechanism can delay decoding as described in the next chapter, so to be safe you may want to aim at a decoding rate higher than 3 uops per clock cycle.

You can see how many uops each instruction generates in the tables in chapter 29.

Instruction prefixes can also incur penalties in the decoders. Instructions can have several kinds of prefixes:

An operand size prefix is needed when you have a 16-bit operand in a 32-bit environment or vice versa. (Except for instructions that can only have one operand size, such as FNSTSW AX). An operand size prefix gives a penalty of a few clocks if the instruction has an immediate operand of 16 or 32 bits because the length of the operand is changed by the prefix. Examples:
ADD BX, 9 ; no penalty because immediate operand is 8 bits MOV WORD PTR [MEM16], 9 ; penalty because operand is 16 bits The last instruction should be changed to:
MOV EAX, 9 MOV WORD PTR [MEM16], AX ; no penalty because no immediate
An address size prefix is used when you use 32-bit addressing in 16 bit mode or vice versa. This is seldom needed and should generally be avoided. The address size prefix gives a penalty whenever you have an explicit memory operand (even when there is no displacement) because the interpretation of the r/m bits in the instruction code is changed by the prefix. Instructions with only implicit memory operands, such as string instructions, have no penalty with address size prefix.
Segment prefixes are used when you address data in a non-default data segment. Segment prefixes give no penalty on the PPro, PII and PIII.
Repeat prefixes and lock prefixes give no penalty in the decoders.
There is always a penalty if you have more than one prefix. This penalty is usually one clock per prefix.

前一篇：6. Alignmen
下一篇：20. Dependency chains (PPro, PII and PIII)

标签: MMX 优化

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 1 x 2 + 1 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字