22. Jumps and branches (all processors)
The Pentium family of processors attempt to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction is correct, then it can save a considerable amount of time by loading the subsequent instructions into the pipeline and start decoding them before the jump is executed. If the prediction turns out to be wrong, then the pipeline has to be flushed, which will cost a penalty depending on the length of the pipeline.
The predictions are based on a Branch Target Buffer (BTB) which stores the history for each branch or jump instruction and makes predictions based on the prior history of executions of each instruction. The BTB is organized like a set-associative cache where new entries are allocated according to a pseudo-random replacement method.
When optimizing code, it is important to minimize the number of misprediction penalties. This requires a good understanding of how the jump prediction works.
The branch prediction mechanisms are not described adequately in Intel manuals or anywhere else. I am therefore giving a very detailed description here. This information is based on my own research (with the help of Karki Jitendra Bahadur for the PPlain).
In the following, I will use the term 'control transfer instruction' for any instruction which can change the instruction pointer, including conditional and unconditional, direct and indirect, near and far, jumps, calls, and returns. All these instructions use prediction.
The branch prediction mechanism for the PPlain is very different from the other three processors. Information found in Intel documents and elsewhere on this subject is directly misleading, and following the advises given is such documents is likely to lead to sub-optimal code.
The PPlain has a branch target buffer (BTB), which can hold information for up to 256 jump instructions. The BTB is organized like a 4-way set-associative cache with 64 entries per way. This means that the BTB can hold no more than 4 entries with the same set value. Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which means that a new entry will not necessarily displace the least recently used entry of the same set-value. How the set-value is calculated will be explained later. Each BTB entry stores the address of the jump target and a prediction state, which can have four different values:
state 0: "strongly not taken"
state 1: "weakly not taken"
state 2: "weakly taken"
state 3: "strongly taken"
A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in state 0 or 1. The state transition works like a two-bit counter, so that the state is incremented when the branch is taken, and decremented when it falls through. The counter saturates, rather than wrap around, so that it does not decrement beyond 0 or increment beyond 3. Ideally, this would provide a reasonably good prediction, because a branch instruction would have to deviate twice from what it does most of the time, before the prediction changes.
However, this mechanism has been compromised by the fact that state 0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes sense, because a branch instruction is predicted to fall through if it has no BTB entry. This improves the utilization of the BTB, because a branch instruction which is seldom taken will most of the time not take up any BTB entry.
Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated, and this new entry will always be set to state 3. This means that it is impossible to go from state 0 to state 1 (except for a very special case discussed later). From state 0 you can only go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the BTB.
This is a serious design flaw. By throwing state 0 entries out of the BTB and always setting new entries to state 3, the designers apparently have given priority to minimizing the first time penalty for unconditional jumps and branches often taken, and ignored that this seriously compromises the basic idea behind the mechanism and reduces the performance in small innermost loops. The consequence of this flaw is, that a branch instruction which falls through most of the time will have up to three times as many mispredictions as a branch instruction which is taken most of the time. (Apparently, Intel engineers have been unaware of this flaw until I published my findings).
You may take this asymmetry into account by organizing your branches so that they are taken more often than not. Consider for example this if-then-else construction:
TEST EAX,EAX JZ A <branch 1> JMP E A: <branch 2> E:
If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in succession, then you can reduce the number of branch mispredictions by up to a factor 3 by swapping the two branches so that the branch instruction will jump more often than fall through:
TEST EAX,EAX JNZ A <branch 2> JMP E A: <branch 1> E:
(This is contrary to the recommendations in Intel's manuals and tutorials).
There may be reasons to put the most often executed branch first, however:
These considerations have little weight, however, for small critical loops, so I would still recommend organizing branches with a skewed distribution so that the branch instruction is taken more often than not, unless branch 2 is executed so seldom, that misprediction doesn't matter.
Likewise, you should preferably organize loops with the testing branch instruction at the bottom, as in this example:
MOV ECX, [N] L: MOV [EDI],EAX ADD EDI,4 DEC ECX JNZ L
If N is high, then the JNZ instruction here will be taken more often than not, and never fall through twice in succession.
Consider the situation where a branch is taken every second time. The first time it jumps the BTB entry will go into state 3, and will then alternate between state 2 and 3. It is predicted to jump all the time, which gives 50% mispredictions. Assume now that it deviates from this regular pattern and falls through an extra time. The jump pattern is:
01010100101010101010101, where 0 means nojump, and 1 means jump. ^The extra nojump is indicated with a ^ above. After this incident, the BTB entry will alternate between state 1 and 2, which gives 100% mispredictions. It will continue in this unfortunate mode until there is another deviation from the 0101 pattern. This is the worst case for this branch prediction mechanism.
22.1.2 BTB is looking ahead (PPlain)
The BTB mechanism is counting instruction pairs, rather than single instructions, so you have to know how instructions are pairing in order to analyze where a BTB entry is stored. The BTB entry for any control instruction is attached to the address of the U-pipe instruction in the preceding instruction pair. (An unpaired instruction counts as one pair). Example:
SHR EAX,1 MOV EBX,[ESI] CMP EAX,EBX JB L
Here SHR pairs with MOV, and CMP pairs with JB. The BTB entry for JB L is thus attached to the address of the SHR EAX,1 instruction. When this BTB entry is met, and if it is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load the instructions following L into the pipeline. This happens before the branch instruction has been decoded, so the Pentium relies solely on the information in the BTB when doing this.
You may remember, that instructions are seldom pairing the first time they are executed (see chapter 8). If the instructions above are not pairing, then the BTB entry should be attached to the address of the CMP instruction, and this entry would be wrong on the next execution, when instructions are pairing. However, in most cases the PPlain is smart enough to not make a BTB entry when there is an unused pairing opportunity, so you don't get a BTB entry until the second execution, and hence you won't get a prediction until the third execution. (In the rare case, where every second instruction is a single-byte instruction, you may get a BTB entry on the first execution which becomes invalid in the second execution, but since the instruction it is attached to will then go to the V-pipe, it is ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a U-pipe instruction).
A BTB entry is identified by its set-value which is equal to bits 0-5 of the address it is attached to. Bits 6-31 are then stored in the BTB as a tag. Addresses which are spaced a multiple of 64 bytes apart will have the same set-value. You can have no more than four BTB entries with the same set-value. If you want to check whether your jump instructions contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I have never heard of anybody doing so. There are no tools available to do this job for you.
22.1.3 Consecutive branches (PPlain)
When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair executed also contains a control transfer instruction, then the PPlain won't load its target because it cannot load a new target while the pipeline is being flushed. The result is that the second jump instruction is predicted to fall through regardless of the state of its BTB entry. Therefore, if the second jump is also taken, then you will get another penalty. The state of the BTB entry for the second jump instruction does get correctly updated, though. If you have a long chain of control transfer instructions, and the first jump in the chain is mispredicted, then the pipeline will get flushed all the time, and you will get nothing but mispredictions until you meet an instruction pair which does not jump. The most extreme case of this is a loop which jumps to itself: It will get a misprediction penalty for each iteration.
This is not the only problem with consecutive control transfer instructions. Another problem is that you can have another branch instruction between a BTB entry and the control transfer instruction it belongs to. If the first branch instruction jumps to somewhere else, then strange things may happen. Consider this example:
SHR EAX,1 MOV EBX,[ESI] CMP EAX,EBX JB L1 JMP L2 L1: MOV EAX,EBX INC EBX
When JB L1 falls through, then you will get a BTB entry for JMP L2 attached to the address of CMP EAX,EBX. But what will happen when JB L1 later is taken? At the time when the BTB entry for JMP L2 is read, the processor doesn't know that the next instruction pair does not contain a jump instruction, so it will actually predict the instruction pair MOV EAX,EBX / INC EBX to jump to L2. The penalty for predicting non-jump instructions to jump is 3 clock cycles. The BTB entry for JMP L2 will get its state decremented, because it is applied to something which doesn't jump. If we keep going to L1, then the BTB entry for JMP L2 will be decremented to state 1 and 0, so that the problem will disappear until next time JMP L2 is executed.
The penalty for predicting the non-jumping instructions to jump only occurs when the jump to L1 is predicted. In the case that JB L1 is mispredictedly jumping, then the pipeline gets flushed and we won't get the false L2 target loaded, so in this case we will not see the penalty of predicting the non-jumping instructions to jump, but we do get the BTB entry for JMP L2 decremented.
Suppose, now, that we replace the INC EBX instruction above with another jump instruction. This third jump instruction will then use the same BTB entry as JMP L2 with the possible penalty of predicting a wrong target, (unless it happens to also have L2 as target).
To summarize, consecutive jumps can lead to the following problems:
All this mess may give you a lot of penalties, so you should definitely avoid having an instruction pair containing a jump immediately after another poorly predictable control transfer instruction or its target.
It is time for another illustrative example:
CALL P TEST EAX,EAX JZ L2 L1: MOV [EDI],EBX ADD EDI,4 DEC EAX JNZ L1 L2: CALL P
This looks like a quite nice and normal piece of code: A function call, a loop which is bypassed when the count is zero, and another function call. How many problems can you spot in this program?
First, we may note that the function P is called alternatingly from two different locations. This means that the target for the return from P will be changing all the time. Consequently, the return from P will always be mispredicted.
Assume, now, that EAX is zero. The jump to L2 will not have its target loaded because the mispredicted return caused a pipeline flush. Next, the second CALL P will also fail to have its target loaded because JZ L2 caused a pipeline flush. Here we have the situation where a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump was mispredicted. The BTB entry for JZ L2 is stored at the address of P's return instruction. This BTB entry will now be mis-applied to whatever comes after the second CALL P, but that doesn't give a penalty because the pipeline is flushed by the mispredicted second return.
Now, let's see what happens if EAX has a nonzero value the next time: JZ L2 is always predicted to fall through because of the flush. The second CALL P has a BTB entry at the address of TEST EAX,EAX. This entry will be mis-applied to the MOV/ADD pair, predicting it to jump to P. This causes a flush which prevents JNZ L1 from loading its target. If we have been here before, then the second CALL P will have another BTB entry at the address of DEC EAX. On the second and third iteration of the loop, this entry will also be mis-applied to the MOV/ADD pair, until it has had its state decremented to 1 or 0. This will not cause a penalty on the second iteration because the flush from JNZ L1 prevents it from loading its false target, but on the third iteration it will. The subsequent iterations of the loop have no penalties, but when it exits, JNZ L1 is mispredicted. The flush would now prevent CALL P from loading its target, were it not for the fact that the BTB entry for CALL P has already been destroyed by being mis-applied several times.
We can improve this code by putting in some NOP's to separate all consecutive jumps:
CALL P TEST EAX,EAX NOP JZ L2 L1: MOV [EDI],EBX ADD EDI,4 DEC EAX JNZ L1 L2: NOP NOP CALL P
The extra NOP's cost 2 clock cycles, but they save much more. Furthermore, JZ L2 is now moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only problem that remains is that the returns from P are always mispredicted. This problem can only be solved by replacing the call to P by an inline macro (if you have enough code cache).
The lesson to learn from this example is that you should always look carefully for consecutive jumps and see if you can save time by inserting some NOP's. You should be particularly aware of those situations where misprediction is unavoidable, such as loop exits and returns from procedures which are called from varying locations. If you have something useful to put in, instead of the NOP's, then you should of course do so.
Multiway branches (case statements) may be implemented either as a tree of branch instructions or as a list of jump addresses. If you choose to use a tree of branch instructions, then you have to include some NOP's or other instructions to separate the consecutive branches. A list of jump addresses may therefore be a better solution on the PPlain. The list of jump addresses should be placed in the data segment. Never put data in the code segment!
22.1.4 Tight loops (PPlain)
In a small loop you will often access the same BTB entry repeatedly with small intervals. This never causes a stall. Rather than waiting for a BTB entry to be updated, the PPlain somehow bypasses the pipeline and gets the resulting state from the last jump before it has been written to the BTB. This mechanism is almost transparent to the user, but it does in some cases have funny effects: You can see a branch prediction going from state 0 to state 1, rather than to state 3, if the zero has not yet been written to the BTB. This happens if the loop has no more than four instruction pairs. In loops with only two instruction pairs you may sometimes have state 0 for two consecutive iterations without going out of the BTB. In such small loops it also happens in rare cases that the prediction uses the state resulting from two iterations ago, rather than from the last iteration. These funny effects will usually not have any negative effects on performance.