» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

27.2 Division (all processors)

日期: 2000-04-02 15:00 | 联系我 | 关注我: Telegram, Twitter

Division is quite time consuming. On PPro, PII and PIII an integer division takes 19, 23, or 39 clocks for byte, word, and dword divisors respectively. On PPlain and PMMX an unsigned integer division takes approximately the same, while a signed integer division takes somewhat more. It is therefore preferable to use the smallest operand size possible that won't generate an overflow, even if it costs an operand size prefix, and use unsigned division if possible.

Integer division by a constant (all processors)

Integer division by a power of two can be done by shifting right. Dividing an unsigned integer by 2^N:

SHR EAX, NDividing a signed integer by 2^N:

CDQ AND EDX, (1 SHL N) -1 ; or SHR EDX, 32-N ADD EAX, EDX SAR EAX, NThe SHR alternative is shorter than the AND if N > 7, but can only go to execution port 0 (or u-pipe), whereas AND can go to either port 0 or 1 (u or v-pipe).

Dividing by a constant can be done by multiplying with the reciprocal. To calculate the unsigned integer division q = x / d, you first calculate the reciprocal of the divisor, f = 2^r / d, where r defines the position of the binary decimal point (radix point). Then multiply x with f and shift right r positions. The maximum value of r is 32+b, where b is the number of binary digits in d minus 1. (b is the highest integer for which 2^b <= d). Use r = 32+b to cover the maximum range for the value of the dividend x.

This method needs some refinement in order to compensate for rounding errors. The following algorithm will give you the correct result for unsigned integer division with truncation, i.e. the same result as the DIV instruction gives (Thanks to Terje Mathisen who invented this method):

b = (the number of significant bits in d) - 1 r = 32 + b f = 2^r / d If f is an integer then d is a power of 2: goto case A. If f is not an integer, then check if the fractional part of f is < 0.5 If the fractional part of f < 0.5: goto case B. If the fractional part of f > 0.5: goto case C. case A: (d = 2^b) result = x SHR b case B: (fractional part of f < 0.5) round f down to nearest integer result = ((x+1) * f) SHR r case C: (fractional part of f > 0.5) round f up to nearest integer result = (x * f) SHR r

Example:

Assume that you want to divide by 5.

5 = 00000101b.

b = (number of significant binary digits) - 1 = 2

r = 32+2 = 34

f = 2³⁴ / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)

The fractional part is greater than a half: use case C.

Round f up to 0CCCCCCCDh.

The following code divides EAX by 5 and returns the result in EDX:

MOV EDX,0CCCCCCCDh MUL EDX SHR EDX,2

After the multiplication, EDX contains the product shifted right 32 places. Since r = 34 you have to shift 2 more places to get the result. To divide by 10 you just change the last line to SHR EDX,3.

In case B you would have:

INC EAX MOV EDX,f MUL EDX SHR EDX,b

This code works for all values of x except 0FFFFFFFFH which gives zero because of overflow in the INC instruction. If x = 0FFFFFFFFH is possible, then change the code to:

MOV EDX,f ADD EAX,1 JC DOVERFL MUL EDX DOVERFL:SHR EDX,b

If the value of x is limited, then you may use a lower value of r, i.e. fewer digits. There can be several reasons to use a lower value of r:

you may set r = 32 to avoid the SHR EDX,b in the end.
you may set r = 16+b and use a multiplication instruction that gives a 32 bit result rather than 64 bits. This will free the EDX register:
IMUL EAX,0CCCDh / SHR EAX,18
you may choose a value of r that gives case C rather than case B in order to avoid the INC EAX instruction

The maximum value for x in these cases is at least 2^r-b, sometimes higher. You have to do a systematic test if you want to know the exact maximum value of x for which your code works correctly.

You may want to replace the slow multiplication instruction with faster instructions as explained in chapter 26.5.

The following example divides EAX by 10 and returns the result in EAX. I have chosen r=17 rather than 19 because it happens to give a code, which is easier to optimize, and covers the same range for x. f = 2¹⁷ / 10 = 3333h, case B: q = (x+1)*3333h:

LEA EBX,[EAX+2*EAX+3] LEA ECX,[EAX+2*EAX+3] SHL EBX,4 MOV EAX,ECX SHL ECX,8 ADD EAX,EBX SHL EBX,8 ADD EAX,ECX ADD EAX,EBX SHR EAX,17

A systematic test shows that this code works correctly for all x < 10004H.

Repeated integer division by the same value (all processors)

If the divisor is not known at assembly time, but you are dividing repeatedly with the same divisor, then you may use the same method as above. The code has to distinguish between case A, B and C and calculate f before doing the divisions.

The code that follows shows how to do multiple divisions with the same divisor (unsigned division with truncation). First call SET_DIVISOR to specify the divisor and calculate the reciprocal, then call DIVIDE_FIXED for each value to divide by the same divisor.

.data RECIPROCAL_DIVISOR DD ? ; rounded reciprocal divisor CORRECTION DD ? ; case A: -1, case B: 1, case C: 0 BSHIFT DD ? ; number of bits in divisor - 1 .code SET_DIVISOR PROC NEAR ; divisor in EAX PUSH EBX MOV EBX,EAX BSR ECX,EAX ; b = number of bits in divisor - 1 MOV EDX,1 JZ ERROR ; error: divisor is zero SHL EDX,CL ; 2^b MOV [BSHIFT],ECX ; save b CMP EAX,EDX MOV EAX,0 JE SHORT CASE_A ; divisor is a power of 2 DIV EBX ; 2^(32+b) / d SHR EBX,1 ; divisor / 2 XOR ECX,ECX CMP EDX,EBX ; compare remainder with divisor/2 SETBE CL ; 1 if case B MOV [CORRECTION],ECX ; correction for rounding errors XOR ECX,1 ADD EAX,ECX ; add 1 if case C MOV [RECIPROCAL_DIVISOR],EAX ; rounded reciprocal divisor POP EBX RET CASE_A: MOV [CORRECTION],-1 ; remember that we have case A POP EBX RET SET_DIVISOR ENDP DIVIDE_FIXED PROC NEAR ; dividend in EAX, result in EAX MOV EDX,[CORRECTION] MOV ECX,[BSHIFT] TEST EDX,EDX JS SHORT DSHIFT ; divisor is power of 2 ADD EAX,EDX ; correct for rounding error JC SHORT DOVERFL ; correct for overflow MUL [RECIPROCAL_DIVISOR] ; multiply with reciprocal divisor MOV EAX,EDX DSHIFT: SHR EAX,CL ; adjust for number of bits RET DOVERFL:MOV EAX,[RECIPROCAL_DIVISOR] ; dividend = 0FFFFFFFFH SHR EAX,CL ; do division by shifting RET DIVIDE_FIXED ENDPThis code gives the same result as the DIV instruction for 0 <= x < 2³², 0 < d < 2³².

Note: The line JC DOVERFL and its target are not needed if you are certain that x < 0FFFFFFFFH.

If powers of 2 occur so seldom that it is not worth optimizing for them, then you may leave out the jump to DSHIFT and instead do a multiplication with CORRECTION = 0 for case A.

If the divisor is changed so often that the procedure SET_DIVISOR needs optimizing, then you may replace the BSR instruction with the code given in chapter 26.15 for the PPlain and PMMX processors.

Floating point division (all processors)

Floating point division takes 38 or 39 clock cycles for the highest precision. You can save time by specifying a lower precision in the floating point control word (On PPlain and PMMX, only FDIV and FIDIV are faster at low precision; on PPro, PII and PIII, this also applies to FSQRT. No other instructions can be speeded up this way).

Parallel division (PPlain and PMMX)

On PPlain and PMMX, it is possible to do a floating point division and an integer division in parallel to save time. On PPro, PII and PIII this is not possible, because integer division and floating point division use the same circuitry.

Example: A = A1 / A2; B = B1 / B2

FILD [B1] FILD [B2] MOV EAX, [A1] MOV EBX, [A2] CDQ FDIV DIV EBX FISTP [B] MOV [A], EAX

(make sure you set the floating point control word to the desired rounding method)

Using reciprocal instruction for fast division (PIII)

On PIII you can use the fast reciprocal instruction RCPSS or RCPPS on the divisor and then multiply with the dividend. However, the precision is only 12 bits. You can increase the precision to 23 bits by using the Newton-Raphson method described in Intel's application note AP-803:

x₀ = RCPSS(d)

x₁ = x₀ * (2 - d * x₀) = 2*x₀ - d * x₀ * x₀

where x₀ is the first approximation to the reciprocal of the divisor, d, and x₁ is a better approximation. You must use this formula before multiplying with the dividend:

MOVAPS XMM1, [DIVISORS] ; load divisors RCPPS XMM0, XMM1 ; approximate reciprocal MULPS XMM1, XMM0 ; Newton-Raphson formula MULPS XMM1, XMM0 ADDPS XMM0, XMM0 SUBPS XMM0, XMM1 MULPS XMM0, [DIVIDENDS] ; results in XMM0This makes four divisions in 18 clock cycles with a precision of 23 bits. Increasing the precision further by repeating the Newton-Raphson formula in the floating point registers is possible, but not very advantageous.

If you want to use this method for integer divisions then you have to check for rounding errors. The following code makes four divisions with truncation on packed word size integers in approximately 42 clock cycles. It gives exact results for 0 <= dividend < 7FFFFH and 0 < divisor <= 7FFFFH:

MOVQ MM1, [DIVISORS] ; load four divisors MOVQ MM2, [DIVIDENDS] ; load four dividends PUNPCKHWD MM4, MM1 ; unpack divisors to DWORDs PSRAD MM4, 16 PUNPCKLWD MM3, MM1 PSRAD MM3, 16 CVTPI2PS XMM1, MM4 ; convert divisors to float, upper two operands MOVLHPS XMM1, XMM1 CVTPI2PS XMM1, MM3 ; convert lower two operands PUNPCKHWD MM4, MM2 ; unpack dividends to DWORDs PSRAD MM4, 16 PUNPCKLWD MM3, MM2 PSRAD MM3, 16 CVTPI2PS XMM2, MM4 ; convert dividends to float, upper two operands MOVLHPS XMM2, XMM2 CVTPI2PS XMM2, MM3 ; convert lower two operands RCPPS XMM0, XMM1 ; approximate reciprocal of divisors MULPS XMM1, XMM0 ; improve precision with Newton-Raphson method PCMPEQW MM4, MM4 ; make four integer 1's in the meantime PSRLW MM4, 15 MULPS XMM1, XMM0 ADDPS XMM0, XMM0 SUBPS XMM0, XMM1 ; reciprocal divisors with 23 bit precision MULPS XMM0, XMM2 ; multiply with dividends CVTTPS2PI MM0, XMM0 ; truncate lower two results MOVHLPS XMM0, XMM0 CVTTPS2PI MM3, XMM0 ; truncate upper two results PACKSSDW MM0, MM3 ; pack the four results into MM0 MOVQ MM3, MM1 ; multiply results with divisors... PMULLW MM3, MM0 ; to check for rounding errors PADDSW MM0, MM4 ; add 1 to compensate for later subtraction PADDSW MM3, MM1 ; add divisor. this should be > dividend PCMPGTW MM3, MM2 ; check if too small PADDSW MM0, MM3 ; subtract 1 if not too small MOVQ [QUOTIENTS], MM0 ; save the four resultsThis code checks if the result is too small and makes the appropriate correction. It is not necessary to check if the result is too big.

Avoiding divisions (all processors)

Obviously, you should always try to minimize the number of divisions. Floating point division with a constant or repeated division with the same value should of course be done by multiplying with the reciprocal. But there are many other situations where you can reduce the number of divisions. For example: if (A/B > C)... can be rewritten as if (A > B*C)... when B is positive, and the opposite when B is negative.

A/B + C/D can be rewritten as (A*D + C*B) / (B*D)

If you are using integer division, then you should be aware that the rounding errors may be different when you rewrite the formulas.

前一篇：26.14 TEST instruction (PPlain and PMMX)
下一篇：22.5. Replacing conditional jumps by conditional moves (PPro, PII and PIII)

标签: MMX 优化

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 4 x 2 + 5 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字