Function performed by each stage as a function of time.
In this example, instruction is fetched from the cache in cycle 1 and its execution proceeds normally.
The fetch operation for instruction which starts in cycle 2, results in a cache miss.
The instruction fetch unit must now suspend any further fetch requests and wait for to arrive.
We assume that instruction is received and loaded into buffer B1 at the end of cycle 5, It appears that cache memory used here is four time faster than the main memory.
The pipeline resumes its normal operation at that point and it will remain in normal operation mode for some times, because a cache miss generally transfer a block from main memory to cache.
From the figure, it is clear that Decode unit, Operate unit and Write unit remain idle for three clock cycle.
Such idle periods are sometimes referred to as bubbles in the pipeline.
Once created as a result of a delay in one of the pipeline stages, a bubble moves downstream until it reaches the last unit. A pipeline can not stall as long as the instructions and data being accessed reside in the cache. This is facilitated by providing separate on chip instruction and data caches.
Consider the following program that contains two instructions, followed by
: A A + 5
: B 3 * A
When this program is executed in a pipeline, the execution of can begin before the execution of completes. The pipeline execution is shown below.
Stages 1 2 3 4 5 6
F: Fetch F1 D1 O1 W1
D: Decode F2 D2 O2 W2
In clock cycle 3, the specific operation of instruction i.e. addition takes place and at that time only the new updated value of A is available. But in the clock cycle 3, the instruction is fetching the operand that is required for the operation of . Since in clock cycle 3 only, operation of instruction is taking place, so the instruction will get the old value of A , it will not get the updated value of A , and will produce a wrong result. Consider that the initial value of A is 4. The proper execution will produce the result as
But due to the pipeline action, we will get the result as
Due to the data depending, these two instructions can not be performed in parallel.
Therefore, no two operations that depend on each other can be performed in parallel. For correct execution, it is required to satisfy the following:
• The operation of the fetch stage must not depend on the operation performed during the same clock cycle by the execution stage.
• The operation of fetching an instruction must be independent of the execution results of the previous instruction.
• The dependency of data arises when the destination of one instruction is used as a source in a subsequent instruction.
In general when we are executing a program the next instruction to be executed is brought from the next memory location. Therefore, in pipeline organization, we are fetching instructions one after another.
But in case of conditional branch instruction, the address of the next instruction to be fetched depends on the result of the execution of the instruction.
Since the execution of next instruction depends on the previous branch instruction, sometimes it may be required to invalidate several instruction fetches. Consider the following instruction execution sequence:
Instruction 1 2 3 4 5 6 7 8 9 10.
In this instruction sequence, consider that is a conditional branch instruction.
The result of the instruction will be available at clock cycle 5. But by that time the fetch unit has already fetched the instruction and
If the branch condition is false, then branch won't take place and the next instruction to be executed is which is already fetched and available for execution.
Now consider that when the condition is true, we have to execute the instruction After clock cycle 5, it is known that branch condition is true and now instruction has to be executed.
But already the processor has fetched instruction and It is required to invalidate these two fetched instruction and the pipe line must be loaded with new destination instruction .
Due to this reason, the pipeline will stall for some time. The time lost due to branch instruction is often referred as branch penalty.
Instruction 1 2 3 4 5 6 7 8 9 10
I1 F1 D1 O1 W1
I2 F2 D2 O2 W2
I3 F3 D3 O3 W3
I4 F4 D4
I10 F10 D10 O10 W10
I11 F11 D11 O11 W11
The effect of branch takes place is shown in the figure in the previous slide. Due to the effect of branch takes place, the instruction I4 and I5 wh ich has already been fetched is not executed and new instruction I10 is fetcged at clock cycle 6.
There is not effective output in clock cycle 7 and 8, and so the branch penalty is 2.
The branch penalty depends on the number of stages in the pipeline. More numbers of stages results in more branch penalty.
Dealing with Branches:
One of the major problems in designing an instruction pipe line is assuming a steady flow of instructions to the initial stages of the pipeline.
The primary problem is the conditional branche instruction until the instruction is actually executed, it is impossible to determine whether the branch will be taken or not.
A variety of approaches have been taken for dealing with conditional branches:
• Multiple streams
• Prefetch branch target
• Loop buffer
• Branch prediction
• Delayed branch
A single pipeline suffers a penalty for a branch instruction because it must choose one of two instructions to fetch next and sometimes it may make the wrong choice.
A brute-force approach is to replicate the initial portions of the pipeline and allow the pipeline to fetch both instructions, making use of two streams.
There are two problems with this approach.
• With multiple pipelines there are contention delays for access to the registers and to memory
• Additional branch instructions may enter the pipeline (either stream) before the original branch decision is resolved. Each such instruction needs as additional stream.
Prefetch Branch target
When a conditional branch is recognized, the target of the branch is prefetced, in addition to the instruction following the branch. This target is then saved until the branch instruction is executed. If the branch is taken, the target has already been prefetched,.
A top buffer is a small, very high speed memory maintained by the instruction fetch stage of the pipeline and containing the most recently fetched instructions, in sequence.
If a branch is to be taken, the hardware first cheeks whether the branch target is within the buffer. If so, the next instruction is fetched from the buffer.
The loop buffer has three benefits:
1. With the use of prefetching, the loop buffer will contain some instruction sequentially ahead of the current instruction fetch address. Thus, instructions fetched in sequence will be available without the usual memory access time.
2. If a branch occurs to a target just a few locations ahead of the address of the branch instruction, the target will already be in the buffer. This is usual for the common occurrence of IF-THEN and IF-THEN-ELSE sequences.
3. This strategy is particularly well suited for dealing with loops, or iterations; hence the name loop buffer. If the loop buffer is large enough to contain all the instructions in a loop, then those instructions need to be fetched from memory only once, for the first iteration. For subsequent iterations, all the needed instructions are already in the buffer.
The loop buffer is similar in principle to a cache dedicated to instructions. The differences are that the loop buffer only retains instructions in sequence and is much smaller in size and hence lower in cost.
Branch Prediction :
Various techniques can be used to predict whether a branch will be taken or not. The most common techniques are:
• Predict never taken
• Predict always taken
• Predict by opcode
• Taken/not taken switch
• Branch history table.
The first three approaches are static; they do not depend on the execution history upto the time of the contitional branch instructions.
The later two approaches are dynamic- they depend on the execution history.
Predict never taken always assumes that the branch will not be taken and continue to fetch instruction in sequence.
Predict always taken assumes that the branch will be taken and always fetch the branet target
In these two approaches it is also possible to minimize the effect of a wrong decision.
If the fetch of an instruction after the branch will cause a page fault or protection violation, the processor halts its prefetching until it is sure that the instruction should be fetched.
Studies analyzing program behaviour have shown that conditional branches are taken more than 50% of the time , and so if the cost of prefetching from either path is the same, then always prefetching from the branch target address should give better performance than always prefetching from the sequential path.
However, in a paged machine, prefetching the branch target is more likely to cause a page fault than prefetching the next instruction in the sequence and so this performance penalty should be taken into account.
Predict by opcode approach makes the decision based on the opcade of the branch instruction. The processor assumes that the branch will be taken for certain branch opcodes and not for others. Studies reported in
Lilja,D "Reducing the branch penalty in pipeline processors", computer, July 1988. showed that success rate is greater than 75% with the strategy.
Dynamic branch strategies attempt to improve the accuracy of prediction by recording the history of conditional branch instructions in a program. Scheme to maintain the history information:
• One or more bits can be associated with each conditional branch instruction that reflect the recent history of the instruction.
• These bits are referred to as a taken/not taken switch that directs the processor to make a particular decision the next time the instruction is encountered.
• Generally these history bits are not associated with the instruction in main memory. It will unnecessarily increase the size of the instruction. With a single bit we can record whether the last execution of this instruction resulted a branch or not.
• With only one bit of history, an error in prediction will occur twice for each use of the loop: once for entering the loop. And once for exiting.
If two bits are used, they can be used to record the result of the last two instances of the execution of the associated instruction.
The history information is not kept in main memory, it can be kept in a temporary high speed memory.
One possibility is to associate these bits with any conditional branch instruction that is in a cache. When the instruction is replaced in the cache, its history is lost.
Another possibility is to maintain a small table for recently executed branch instructions with one or more bits in each entry.
The branch history table is a small cache memory associated with the instruction fetch stage of the pipeline. Each entry in the table consists of three elements:
• The address of the branch instruction.
• Some member of history bits that record the state of use of that instruction.
• Information about the target instruction, it may be the address of the target instruction, or may be the target instruction itself.
Ij Fj Ej
Ij + 1 Fj + 1
Ik Fk Ek
Ik + 1 Fk + 1 Ek + 1
For example consider the consider the following code segments:
I1 LOOP Shift_left R1
I2 Decrement R2
I3 Branch_if 0
I4 NEXT Add R1,R3
Original Program Loop
Here register is used as a counter to determine the number of times the contents of register are sifted left.
Consider a processor with a two-stage pipeline and one delay slot. During the execution phase of the instruction , the fetch unit will fetch the instruction After evaluating the branch condition only, it will be clear whether instruction or will be executed next.
The nature of the code segment says that it will remain in the top depending on the initial value of and when it becomes zero, it will come out from the loop and execute the instruction . During the loop execution, every time there is a wrong fetch of instruction . The code segment can be recognized without disturbing the original meaning of the program
LOOP Decrement R2
NEXT Add R1,R3
Reordered instructions for program loop.