**25. Loop optimization (all processors)**

When analyzing a program you often find that most of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most time-consuming loop using assembly language. The rest of the program may be left in high-level language.

In all the following examples it is assumed that all data are in the level 1 cache. If the speed is limited by cache misses then there is no reason to optimize the instructions. Rather, you should concentrate on organizing your data in a way that minimizes cache misses (see chapter 7).

**25.1. Loops in PPlain and PMMX**

A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure which reads integers from an array, changes the sign of each integer, and stores the results in another array.

A C language code for this procedure would be:

void ChangeSign (int * A, int * B, int N) { int i; for (i=0; i<N; i++) B[i] = -A[i];}

Translating to assembly, we might write the procedure like this:

**Example 1.1:**

_ChangeSign PROC NEAR PUSH ESI PUSH EDI A EQU DWORD PTR [ESP+12] B EQU DWORD PTR [ESP+16] N EQU DWORD PTR [ESP+20] MOV ECX, [N] JECXZ L2 MOV ESI, [A] MOV EDI, [B] CLD L1: LODSD NEG EAX STOSD LOOP L1 L2: POP EDI POP ESI RET ; (no extra pop if _cdecl calling convention) _ChangeSign ENDPThis looks like a nice solution, but it is not optimal because it uses slow non-pairable instructions. It takes 11 clock cycles per iteration if all data are in the level one cache.

**Using pairable instructions only (PPlain and PMMX)**

**Example 1.2:**

MOV ECX, [N] MOV ESI, [A] TEST ECX, ECX JZ SHORT L2 MOV EDI, [B] L1: MOV EAX, [ESI] ; u XOR EBX, EBX ; v (pairs) ADD ESI, 4 ; u SUB EBX, EAX ; v (pairs) MOV [EDI], EBX ; u ADD EDI, 4 ; v (pairs) DEC ECX ; u JNZ L1 ; v (pairs) L2:Here we have used pairable instructions only, and scheduled the instructions so that everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the same speed without splitting the NEG instruction, but the other unpairable instructions should be split up.

**Using the same register for counter and index**

**Example 1.3:**

MOV ESI, [A] MOV EDI, [B] MOV ECX, [N] XOR EDX, EDX TEST ECX, ECX JZ SHORT L2 L1: MOV EAX, [ESI+4*EDX] ; u NEG EAX ; u MOV [EDI+4*EDX], EAX ; u INC EDX ; v (pairs) CMP EDX, ECX ; u JB L1 ; v (pairs) L2:

Using the same register for counter and index gives us fewer instructions in the body of the loop, but it still takes 4 clocks because we have two unpaired instructions.

**Letting the counter end at zero (PPlain and PMMX)**

We want to get rid of the `CMP` instruction in example 1.3 by letting the counter end at zero and use the zero flag for detecting when we are finished as we did in example 1.2. One way to do this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely, then you should rather start the counter at -N and count through negative values up to zero. The base registers should then point to the end of the arrays rather than the beginning:

**Example 1.4:**

MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 L1: MOV EAX, [ESI+4*ECX] ; u NEG EAX ; u MOV [EDI+4*ECX], EAX ; u INC ECX ; v (pairs) JNZ L1 ; u L2:We are now down at five instructions in the loop body but it still takes 4 clocks because of poor pairing. (If the addresses and sizes of the arrays are constants we may save two registers by substituting `A+SIZE A` for `ESI` and `B+SIZE B` for `EDI`). Now let's see how we can improve pairing.

**Pairing calculations with loop overhead (PPlain and PMMX)**

We may want to improve pairing by intermingling calculations with the loop control instructions. If we want to put something in between `INC ECX` and `JNZ L1`, it has to be something that doesn't affect the zero flag. The `MOV [EDI+4*ECX],EBX` instruction after `INC ECX` would generate an AGI delay, so we have to be more ingenious:

**Example 1.5:**

MOV EAX, [N] XOR ECX, ECX SHL EAX, 2 ; 4 * N JZ SHORT L3 MOV ESI, [A] MOV EDI, [B] SUB ECX, EAX ; - 4 * N ADD ESI, EAX ; point to end of array A ADD EDI, EAX ; point to end of array B JMP SHORT L2 L1: MOV [EDI+ECX-4], EAX ; u L2: MOV EAX, [ESI+ECX] ; v (pairs) XOR EAX, -1 ; u ADD ECX, 4 ; v (pairs) INC EAX ; u JNC L1 ; v (pairs) MOV [EDI+ECX-4], EAX L3:I have used a different way to calculate the negative of `EAX` here: inverting all bits and adding one. The reason why I am using this method is that I can use a dirty trick with the `INC` instruction: `INC` doesn't change the carry flag, whereas `ADD` does. I am using `ADD` rather than `INC` to increment my loop counter and testing the carry flag rather than the zero flag. It is then possible to put the `INC EAX` in between without affecting the carry flag. You may think that we could have used `LEA EAX,[EAX+1]` here instead of `INC EAX`, at least that doesn't change any flags, but the `LEA` instruction would have an AGI stall so that's not the best solution. Note that the trick with the `INC` instruction not changing the carry flag is useful only on PPlain and PMMX, but will cause a partial flags stall on PPro, PII and PIII.

I have obtained perfect pairing here and the loop now takes only 3 clock cycles. Whether you want to increment the loop counter by 1 (as in example 1.4) or by 4 (as in example 1.5) is a matter of taste, it makes no difference in loop timing.

**Overlapping the end of one operation with the beginning of the next (PPlain and PMMX)**

The method used in example 1.5 is not very generally applicable so we may look for other methods of improving pairing opportunities. One way is to reorganize the loop so that the end of one operation overlaps with the beginning of the next. I will call this convoluting the loop. A convoluted loop has an unfinished operation at the end of each loop iteration which will be finished in the next run. Actually, example 1.5 did pair the last `MOV` of one iteration with the first `MOV` of the next, but we want to explore this method further:

**Example 1.6:**

MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 XOR EBX, EBX MOV EAX, [ESI+4*ECX] INC ECX JZ SHORT L2 L1: SUB EBX, EAX ; u MOV EAX, [ESI+4*ECX] ; v (pairs) MOV [EDI+4*ECX-4], EBX ; u INC ECX ; v (pairs) MOV EBX, 0 ; u JNZ L1 ; v (pairs) L2: SUB EBX, EAX MOV [EDI+4*ECX-4], EBX L3:

Here we begin reading the second value before we have stored the first, and this of course improves pairing opportunities. The `MOV EBX,0` instruction has been put in between `INC ECX` and `JNZ L1` not to improve pairing but to avoid AGI stall.

**Rolling out a loop (PPlain and PMMX)**

The most generally applicable way to improve pairing opportunities is to do two operations for each run and do half as many runs. This is called rolling out a loop:

**Example 1.7:**

MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 TEST AL,1 ; test if N is odd JZ SHORT L1 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX MOV [EDI+4*ECX], EAX INC ECX ; make counter even JZ SHORT L2 ; N = 1 L1: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX], EAX ; u MOV [EDI+4*ECX+4], EBX ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) L2:

Now we are doing two operations in parallel which gives the best pairing opportunities. We have to test if `N` is odd and if so do one operation outside the loop because the loop can only do an even number of operations.

The loop has an AGI stall at the first `MOV` instruction because `ECX` has been incremented in the preceding clock cycle. The loop therefore takes 6 clock cycles for two operations.

**Reorganizing a loop to remove AGI stall (PPlain and PMMX)**

**Example 1.8:**

MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 TEST AL,1 ; test if N is odd JZ SHORT L2 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX ; no pairing opportunity MOV [EDI+4*ECX-4], EAX INC ECX ; make counter even JNZ SHORT L2 NOP ; add NOP's if JNZ L2 not predictable NOP JMP SHORT L3 ; N = 1 L1: NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX-8], EAX ; u MOV [EDI+4*ECX-4], EBX ; v (pairs) L2: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) NEG EAX NEG EBX MOV [EDI+4*ECX-8], EAX MOV [EDI+4*ECX-4], EBX L3:

The trick is to find a pair of instructions that do not use the loop counter as index and reorganize the loop so that the counter is incremented in the preceding clock cycle. We are now down at 5 clock cycles for two operations which is close to the best possible.

If data caching is critical, then you may improve the speed further by interleaving the `A` and `B` arrays into one structured array so that each `B[i]` comes immediately after the corresponding `A[i]`. If the structured array is aligned by at least 8 then `B[i]` will always be in the same cache line as `A[i]`, so you will never have a cache miss when writing `B[i]`. This may of course have a tradeoff in other parts of the program so you have to weigh the costs against the benefits.

**Rolling out by more than 2 (PPlain and PMMX)**

You may think of doing more than two operations per iteration in order to reduce the loop overhead per operation. But since the loop overhead in most cases can be reduced to only one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 would only save 1/4 clock cycle per operation, which is hardly worth the effort. Only if the loop overhead cannot be reduced to one clock cycle and if N is very big, should you think of unrolling by 4.

The drawbacks of excessive loop unrolling are:

- You need to calculate N MODULO R, where R is the unrolling factor, and do N MODULO R operations before or after the main loop in order to make the remaining number of operations divisible by R. This takes a lot of extra code and poorly predictable branches. And the loop body of course also becomes bigger.
- A Piece of code usually takes much more time the first time it executes, and the penalty of first time execution is bigger the more code you have, especially if N is small.
- Excessive code size makes the utilization of the code cache less effective.

**Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPlain and PMMX)**

If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with unrolled loops because you may not be able to pair two memory access operations. For example `MOV AL,[ESI] / MOV BL,[ESI+1]` will not pair if the two operands are within the same dword of memory. But there may be a much smarter method, namely to handle four bytes at a time in the same 32 bit register.

The following example adds 2 to all elements of an array of bytes:

**Example 1.9:**

MOV ESI, [A] ; address of byte array MOV ECX, [N] ; number of elements in byte array TEST ECX, ECX ; test if N is 0 JZ SHORT L2 MOV EAX, [ESI] ; read first four bytes L1: MOV EBX, EAX ; copy into EBX AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX XOR EBX, EAX ; get the highest bit of each byte ADD EAX, 02020202H ; add desired value to all four bytes XOR EBX, EAX ; combine bits again MOV EAX, [ESI+4] ; read next four bytes MOV [ESI], EBX ; store result ADD ESI, 4 ; increment pointer SUB ECX, 4 ; decrement loop counter JA L1 ; loop L2:This loop takes 5 clock cycles for every 4 bytes. The array should of course be aligned by 4. If the number of elements in the array is not divisible by four, then you may padd it in the end with a few extra bytes to make the length divisible by four. This loop will always read past the end of the array, so you should make sure the array is not placed at the end of a segment to avoid a general protection error.

Note that I have masked out the highest bit of each byte to avoid a possible carry from each byte into the next when adding. I am using `XOR` rather than `ADD` when putting in the high bit again to avoid carry.

The `ADD ESI,4` instruction could have been avoided by using the loop counter as index as in example 1.4. However, this would give an odd number of instructions in the loop body, so there would be one unpaired instruction and the loop would still take 5 clocks. Making the branch instruction unpaired would save one clock after the last operation when the branch is mispredicted, but we would have to spend an extra clock cycle in the prolog code to setup a pointer to the end of the array and calculate -N, so the two methods will be exactly equally fast. The method presented here is the simplest and shortest.

The next example finds the length of a zero-terminated string by searching for the first byte of zero. It is faster than using `REP SCASB`:

STRLEN PROC NEAR MOV EAX,[ESP+4] ; get pointer MOV EDX,7 ADD EDX,EAX ; pointer+7 used in the end PUSH EBX MOV EBX,[EAX] ; read first 4 bytes ADD EAX,4 ; increment pointer L1: LEA ECX,[EBX-01010101H] ; subtract 1 from each byte XOR EBX,-1 ; invert all bytes AND ECX,EBX ; and these two MOV EBX,[EAX] ; read next 4 bytes ADD EAX,4 ; increment pointer AND ECX,80808080H ; test all sign bits JZ L1 ; no zero bytes, continue loop TEST ECX,00008080H ; test first two bytes JNZ SHORT L2 SHR ECX,16 ; not in the first 2 bytes ADD EAX,2 L2: SHL CL,1 ; use carry flag to avoid a branch POP EBX SBB EAX,EDX ; compute length RET STRLEN ENDPAgain we have used the method of overlapping the end of one operation with the beginning of the next to improve pairing. I have not unrolled the loop because it is likely to repeat relatively few times. The string should of course be aligned by 4. The code will always read past the end of the string, so the string should not be placed at the end of a segment.

The loop body has an odd number of instructions so there is one unpaired. Making the branch instruction unpaired rather than one of the other instructions has the advantage that it saves 1 clock cycle when the branch is mispredicted.

The `TEST ECX,00008080H` instruction is non-pairable. You could use the pairable instruction `OR CH,CL` here instead, but then you would have to put in a `NOP` or something to avoid the penalties of consecutive branches. Another problem with `OR CH,CL` is that it would cause a partial register stall on a PPro, PII and PIII. So I have chosen to keep the unpairable `TEST` instruction.

Handling 4 bytes simultaneously can be quite difficult. The code uses a formula which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the `LEA` instruction). I have not masked out the highest bit of each byte before subtracting, as I did in the previous example, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we don't care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a zero, and then test all four bytes to find the last zero, or use `BSWAP` to reverse the order of the bytes.

If you want to search for a byte value other than zero, then you may `XOR` all four bytes with the value you are searching for, and then use the method above to search for zero.

**Loops with MMX operations (PMMX)**

Handling multiple operands in the same register is easier on the MMX processors because they have special instructions and special 64 bit registers for exactly this purpose.

Returning to the problem of adding two to all bytes in an array, we may take advantage of the MMX instructions: **Example 1.11:**

.data ALIGN 8 ADDENTS DQ 0202020202020202h ; specify byte to add eight times A DD ? ; address of byte array N DD ? ; number of iterations .code MOV ESI, [A] MOV ECX, [N] MOVQ MM2, [ADDENTS] JMP SHORT L2 ; top of loop L1: MOVQ [ESI-8], MM0 ; store result L2: MOVQ MM0, MM2 ; load addents PADDB MM0, [ESI] ; add eight bytes in one operation ADD ESI, 8 DEC ECX JNZ L1 MOVQ [ESI-8], MM0 ; store last result EMMSThe store instruction is moved to after the loop control instructions in order to avoid a store stall.

This loop takes 4 clocks because the `PADDB` instruction doesn't pair with `ADD ESI,8`. (A MMX instruction with memory access cannot pair with a non-MMX instruction or with another MMX instruction with memory access). We could get rid of `ADD ESI,8` by using `ECX` as index, but that would give an AGI stall.

Since the loop overhead is considerable we might want to unroll the loop:

**Example 1.12:**

.data ALIGN 8 ADDENTS DQ 0202020202020202h ; specify byte to add eight times A DD ? ; address of byte array N DD ? ; number of iterations .code MOVQ MM2, [ADDENTS] MOV ESI, [A] MOV ECX, [N] MOVQ MM0, MM2 MOVQ MM1, MM2 L3: PADDB MM0, [ESI] PADDB MM1, [ESI+8] MOVQ [ESI], MM0 MOVQ MM0, MM2 MOVQ [ESI+8], MM1 MOVQ MM1, MM2 ADD ESI, 16 DEC ECX JNZ L3 EMMSThis unrolled loop takes 6 clocks per iteration for adding 16 bytes. The `PADDB` instructions are not paired. The two threads are interleaved to avoid a store stall.

Using the MMX instructions has a high penalty if you are using floating point instructions shortly afterwards, so there may still be situations where you want to use 32 bit registers as in example 1.9.

**Loops with floating point operations (PPlain and PMMX)**

The methods of optimizing floating point loops are basically the same as for integer loops, although the floating point instructions are overlapping rather than pairing.

Consider the C language code:

int i, n; double * X; double * Y; double DA; for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i];This piece of code (called DAXPY) has been studied extensively because it is the key to solving linear equations.

**Example 1.13:**

DSIZE = 8 ; data size MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+DSIZE*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y JZ SHORT L3 ; test for N = 0 FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[0] JMP SHORT L2 ; jump into loop L1: FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[i] FXCH ; get old result FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store Y[i] L2: FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; subtract from Y[i] INC ECX ; increment index JNZ L1 ; loop FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store last result L3:Here we are using the same methods as in example 1.6: Using the loop counter as index register and counting through negative values up to zero. The end of one operation overlaps with the beginning of the next.

The interleaving of floating point operations work perfectly here: The 2 clock stall between `FMUL` and `FSUBR` is filled with the `FSTP` of the previous result. The 3 clock stall between `FSUBR` and `FSTP` is filled with the loop overhead and the first two instructions of the next operation. An AGI stall has been avoided by reading the only parameter that doesn't depend on the index in the first clock cycle after the index has been incremented.

This solution takes 6 clock cycles per operation, which is better than the unrolled solution published by Intel!

**Unrolling floating point loops (PPlain and PMMX)**

The DAXPY loop unrolled by 3 is quite complicated:**Example 1.14:**

DSIZE = 8 ; data size IF DSIZE EQ 4 SHIFTCOUNT = 2 ELSE SHIFTCOUNT = 3 ENDIF MOV EAX, [N] ; number of elements MOV ECX, 3*DSIZE ; counter bias SHL EAX, SHIFTCOUNT ; DSIZE*N JZ L4 ; N = 0 MOV ESI, [X] ; pointer to X SUB ECX, EAX ; (3-N)*DSIZE MOV EDI, [Y] ; pointer to Y SUB ESI, ECX ; end of pointer - bias SUB EDI, ECX TEST ECX, ECX FLD DSIZE PTR [ESI+ECX] ; first X JNS SHORT L2 ; less than 4 operations L1: ; main loop FMUL DSIZE PTR [DA] FLD DSIZE PTR [ESI+ECX+DSIZE] FMUL DSIZE PTR [DA] FXCH FSUBR DSIZE PTR [EDI+ECX] FXCH FLD DSIZE PTR [ESI+ECX+2*DSIZE] FMUL DSIZE PTR [DA] FXCH FSUBR DSIZE PTR [EDI+ECX+DSIZE] FXCH ST(2) FSTP DSIZE PTR [EDI+ECX] FSUBR DSIZE PTR [EDI+ECX+2*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+DSIZE] FLD DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] ADD ECX, 3*DSIZE JS L1 ; loop L2: FMUL DSIZE PTR [DA] ; finish leftover operation FSUBR DSIZE PTR [EDI+ECX] SUB ECX, 2*DSIZE ; change pointer bias JZ SHORT L3 ; finished FLD DSIZE PTR [DA] ; start next operation FMUL DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] FSUBR DSIZE PTR [EDI+ECX+3*DSIZE] ADD ECX, 1*DSIZE JZ SHORT L3 ; finished FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] FSUBR DSIZE PTR [EDI+ECX+3*DSIZE] ADD ECX, 1*DSIZE L3: FSTP DSIZE PTR [EDI+ECX+2*DSIZE] L4:The reason why I am showing you how to unroll a loop by 3 is not to recommend it, but to warn you how difficult it is! Be prepared to spend a considerable amount of time debugging and verifying your code when doing something like this. There are several problems to take care of: In most cases, you cannot remove all stalls from a floating point loop unrolled by less than 4 unless you convolute it (i.e. there are unfinished operations at the end of each run which are being finished in the next run). The last `FLD` in the main loop above is the beginning of the first operation in the next run. It would be tempting here to make a solution which reads past the end of the array and then discards the
extra value in the end, as in example 1.9 and 1.10, but that is not recommended in floating point loops because the reading of the extra value might generate a denormal operand exception in case the memory position after the array doesn't contain a valid floating point number. To avoid this, we have to do at least one more operation after the main loop.

The number of operations to do outside an unrolled loop would normally be N MODULO R, where N is the number of operations, and R is the unrolling factor. But in the case of a convoluted loop, we have to do one more, i.e. (N-1) MODULO R + 1, for the abovementioned reason.

Normally, we would prefer to do the extra operations before the main loop, but here we have to do them afterwards for two reasons: One reason is to take care of the leftover operand from the convolution. The other reason is that calculating the number of extra operations requires a division if R is not a power of 2, and a division is time consuming. Doing the extra operations after the loop saves the division.

The next problem is to calculate how to bias the loop counter so that it will change sign at the right time, and adjust the base pointers so as to compensate for this bias. Finally, you have to make sure the leftover operand from the convolution is handled correctly for all values of N.

The epilog code doing 1-3 operations could have been implemented as a separate loop, but that would cost an extra branch misprediction, so the solution above is faster.

Now that I have scared you by demonstrating how difficult it is to unroll by 3, I will show you that it is much easier to unroll by 4:

**Example 1.15:**

DSIZE = 8 ; data size MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+DSIZE*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y TEST AL,1 ; test if N is odd JZ SHORT L1 FLD DSIZE PTR [DA] ; do the odd operation FMUL DSIZE PTR [ESI+DSIZE*ECX] FSUBR DSIZE PTR [EDI+DSIZE*ECX] INC ECX ; adjust counter FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] L1: TEST AL,2 ; test for possibly 2 more operations JZ L2 FLD DSIZE PTR [DA] ; N MOD 4 = 2 or 3. Do two more FMUL DSIZE PTR [ESI+DSIZE*ECX] FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX+DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FXCH FSTP DSIZE PTR [EDI+DSIZE*ECX] FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE] ADD ECX, 2 ; counter is now divisible by 4 L2: TEST ECX, ECX JZ L4 ; no more operations L3: ; main loop: FLD DSIZE PTR [DA] FLD DSIZE PTR [ESI+DSIZE*ECX] FMUL ST,ST(1) FLD DSIZE PTR [ESI+DSIZE*ECX+DSIZE] FMUL ST,ST(2) FLD DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE] FMUL ST,ST(3) FXCH ST(2) FSUBR DSIZE PTR [EDI+DSIZE*ECX] FXCH ST(3) FMUL DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FXCH ST(2) FSUBR DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE] FXCH ST(3) FSTP DSIZE PTR [EDI+DSIZE*ECX] FSTP DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE] FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FSTP DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE] ADD ECX, 4 ; increment index by 4 JNZ L3 ; loop L4:

It is usually quite easy to find a stall-free solution when unrolling by 4, and there is no need for convolution. The number of extra operations to do outside the main loop is N MODULO 4, which can be calculated easily without division, simply by testing the two lowest bits in N. The extra operations are done before the main loop rather than after, to make the handling of the loop counter simpler.

The tradeoff of loop unrolling is that the extra operations outside the loop are slower due to incomplete overlapping and possible branch mispredictions, and the first time penalty is higher because of increased code size.

As a general recommendation, I would say that if N is big or if convoluting the loop without unrolling cannot remove enough stalls, then you should unroll critical integer loops by 2 and floating point loops by 4.

标签: MMX 优化