首页 随笔 乐走天涯 程序资料 评论中心 Tag 论坛 其他资源 搜索 联系我 关于 RSS

19.1 Partial register stalls


日期: 2000-04-01 14:00 | 联系我 | 关注我: Telegram, Twitter

19. Partial stalls (PPro, PII and PIII)

19.1 Partial register stalls

Partial register stall is a problem that occurs when you write to part of a 32 bit register and later read from the whole register or a bigger part of it. Example:

MOV AL, BYTE PTR [M8] MOV EBX, EAX ; partial register stall

This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned to AL (to make it independent of AH). The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX. The stall can be avoided by changing to code to:

MOVZX EBX, BYTE PTR [MEM8] AND EAX, 0FFFFFF00h OR EBX, EAX

Of course you can also avoid the partial stalls by putting in other instructions after the write to the partial register so that it has time to retire before you read from the full register.

You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32 bits):

MOV BH, 0 ADD BX, AX ; stall INC EBX ; stall

You don't get a stall when reading a partial register after writing to the full register, or a bigger part of it:

MOV EAX, [MEM32] ADD BL, AL ; no stall ADD BH, AH ; no stall MOV CX, AX ; no stall MOV DX, BX ; stall

The easiest way to avoid partial register stalls is to always use full registers and use MOVZX or MOVSX when reading from smaller memory operands. These instructions are fast on the PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is offered when you want your code to perform reasonably well on all processors. The replacement for MOVZX EAX,BYTE PTR [M8] looks like this:

XOR EAX, EAX MOV AL, BYTE PTR [M8]

The PPro, PII and PIII processors make a special case out of this combination to avoid a partial register stall when later reading from EAX. The trick is that a register is tagged as empty when it is XOR'ed with itself. The processor remembers that the upper 24 bits of EAX are zero, so that a partial stall can be avoided. This mechanism works only on certain combinations:

XOR EAX, EAX MOV AL, 3 MOV EBX, EAX ; no stall XOR AH, AH MOV AL, 3 MOV BX, AX ; no stall XOR EAX, EAX MOV AH, 3 MOV EBX, EAX ; stall SUB EBX, EBX MOV BL, DL MOV ECX, EBX ; no stall MOV EBX, 0 MOV BL, DL MOV ECX, EBX ; stall MOV BL, DL XOR EBX, EBX ; no stall

Setting a register to zero by subtracting it from itself works the same as the XOR, but setting it to zero with the MOV instruction doesn't prevent the stall.

You can set the XOR outside a loop:

XOR EAX, EAX MOV ECX, 100 LL: MOV AL, [ESI] MOV [EDI], EAX ; no stall INC ESI ADD EDI, 4 DEC ECX JNZ LL

The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

You should remember to neutralize any partial register you have used before calling a subroutine that might push the full register:

ADD BL, AL MOV [MEM8], BL XOR EBX, EBX ; neutralize BL CALL _HighLevelFunction

Most high level language procedures push EBX at the start of the procedure which would generate a partial register stall in the example above if you hadn't neutralized BL.

Setting a register to zero with the XOR method doesn't break its dependency on earlier instructions:

DIV EBX MOV [MEM], EAX MOV EAX, 0 ; break dependency XOR EAX, EAX ; prevent partial register stall MOV AL, CL ADD EBX, EAX

Setting EAX to zero twice here seems redundant, but without the MOV EAX,0 the last instructions would have to wait for the slow DIV to finish, and without XOR EAX,EAX you would have a partial register stall.

The FNSTSW AX instruction is special: in 32 bit mode it behaves as if writing to the entire EAX. In fact, it does something like this in 32 bit mode:

AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP

hence, you don't get a partial register stall when reading EAX after this instruction in 32 bit mode:

FNSTSW AX / MOV EBX,EAX ; stall only if 16 bit mode MOV AX,0 / FNSTSW AX ; stall only if 32 bit mode

标签: MMX 优化

 文章评论
目前没有任何评论.

↓ 快抢占第1楼,发表你的评论和意见 ↓

发表你的评论
如果你想针对此文发表评论, 请填写下列表单:
姓名: * 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail: 可选 (不会被公开。如果我回复了你的评论,你将会收到邮件通知)
反垃圾广告: 为了防止广告机器人自动发贴, 请计算下列表达式的值:
1 x 2 + 4 = * 必填
评论内容:
* 必填
你可以使用下列标签修饰文字:
[b] 文字 [/b]: 加粗文字
[quote] 文字 [/quote]: 引用文字

 
首页 随笔 乐走天涯 猎户星 Google Earth 程序资料 程序生活 评论 Tag 论坛 资源 搜索 联系 关于 隐私声明 版权声明 订阅邮件

程序员小辉 建站于 1997 ◇ 做一名最好的开发者是我不变的理想。
Copyright © XiaoHui.com; 保留所有权利。