Pipelining concept in computer architecture
In this Module, we have three lectures,
1. Introduction to Pipeline Processor
2. Performance Issues
It is observed that organization enhancements to the CPU can improve performance. We have already seen that use of multiple registers rather than a single a accumulator, and use of cache memory improves the performance considerably. Another organizational approach, which is quite common, is instruction pipelining.
Pipelining is a particularly effective way of organizing parallel activity in a computer system. The basic idea is very simple. It is frequently encountered in manufacturing plants, where pipelining is commonly known as an assembly line operation.
By laying the production process out in an assembly line, product at various stages can be worked on simultaneously. This process is also referred to as pipelining, because, as in a pipeline, new inputs are accepted at one end before previously accepted inputs appear as outputs at the other end.
To apply the concept of instruction execution in pipeline, it is required to break the instruction in different task. Each task will be executed in different processing elements of the CPU.
As we know that there are two distinct phases of instruction execution: one is instruction fetch and the other one is instruction execution. Therefore, the processor executes a program by fetching and executing instructions, one after another.
Let and refer to the fetch and execute steps for instruction . Execution of a program consists of a sequence of fetch and execute steps is shown in the figure on the next slide.
Now consider a CPU that has two separate hardware units, one for fetching instructions and another for executing them.
The instruction fetch by the fetch unit is stored in an intermediate storage buffer
. The results of execution are stored in the destination location specified by the instruction.
For simplicity it is assumed that fetch and execute steps of any instruction can be completed in one clock cycle.
The operation of the computer proceeds as follows:
• In the first clock cycle, the fetch unit fetches an instruction (instruction , step ) and stored it in buffer at the end of the clock cycle.
• In the second clock cycle, the instruction fetch unit proceeds with the fetch operation for instruction (step ).
• Meanwhile, the execution unit performs the operation specified by instruction which is already fetched and available in the buffer (step ).
• By the end of the second clock cycle, the execution of the instruction is completed and instruction is available.
• Instruction is stored in buffer replacing , which is no longer needed.
• Step is performed by the execution unit during the third clock cycle, while instruction is being fetched by the fetch unit.
• Both the fetch and execute units are kept busy all the time and one instruction is completed after each clock cycle except the first clock cycle.
• If a long sequence of instructions is executed, the completion rate of instruction execution will be twice that achievable by the sequential operation with only one unit that performs both fetch and execute.
Basic idea of instruction pipelining with hardware organization is shown in the figure on the next slide.Another picture is shown on the next slide for instruction piplining.
The processing of an instruction need not be divided into only two steps. To gain further speed up, the pipeline must have more stages.
Let us consider the following decomposition of the instruction execution:
• Fetch Instruction (FI): Read the next expected instruction into a buffer.
• Decode Instruction ((DI): Determine the opcode and the operand specifiers.
• Calculate Operand (CO): calculate the effective address of each source operand.
• Fetch Operands(FO): Fetch each operand from memory.
• Execute Instruction (EI): Perform the indicated operation.
• Write Operand(WO): Store the result in memory.
There will be six different stages for these six subtasks. For the sake of simplicity, let us assume the equal duration to perform all the subtasks. It the six stages are not of equal duration, there will be some waiting involved at various pipeline stages.
The timing diagram for the execution of instruction in pipeline fashion is shown in the figure on the next slide.
From this timing diagram it is clear that the total execution time of 8 instructions in this 6 stages pipeline is 13-time unit. The first instruction gets completed after 6 time unit, and there after in each time unit it completes one instruction.
Without pipeline, the total time required to complete 8 instructions would have been 48 (6 X
time unit. Therefore, there is a speed up in pipeline processing and the speed up is related to the number of stages.
The cycle time of an instruction pipeline is the time needed to advance a set of instructions one stage through the pipeline. The cycle time can be determined as
maximum stage delay (delay through stage which experience the largest delay)
number of stages in the instruction pipeline.
time delay of a latch, needed to advance signals and data from one stage to the next.
In general, the time delay is equivalent to a clock pulse and
Now suppose that n instructions are processed and these instructions are executed one after another. The total time required to execute all n instructions is
A total of k cycles are required to complete the execution of the first instruction, and the remaining (n-1) instructions require (n-1) cycles.
The time required to execute n instructions without pipeline is
because to execute one instruction it will take cycle.
The speed up factor for the instruction pipeline compared to execution without the pipeline is defined as:
In general, the number of instruction executed is much more higher than the number of stages in the pipeline So, the n tends to , we have
i.e. We have a k fold speed up, the speed up factor is a function of the number of stages in the instruction pipeline.
Though, it has been seen that the speed up is proportional to number of stages in the pipeline, but in practice the speed up is less due to some practical reason. The factors that affect the pipeline performance is discussed next.Effect of Intermediate storage buffer:
Consider a pipeline processor, which process each instruction in four steps;
F: Fetch, Read the instruction from the memory
D: Decode, decode the instruction and fetch the source operand (S)
O: Operate, perform the operation
W: Write, store the result in the destination location.
The hardware organization of this four-stage pipeline processor is shown in the figure on the next slide.
In the preceding section we have seen that the speed up of pipeline processor is related to number of stages in the pipeline, i.e, the greater the number of stages in the pipeline, the faster the execution rate.
But the organization of the stages of a pipeline is a complex task and if affects the performance of the pipeline.The problem related to more number of stages:
At each stage of the pipeline, there is some overhead involved in moving data from buffer to buffer and in performing various preparation and delivery functions. This overhead can appreciably lengthen the total execution time of a single instruction.
The amount of control logic required to handle memory and register dependencies and to optimize the use of the pipeline increases enormously with the number of stages.
Apart from hardware organization, there are some other reasons which may effect the performance of the pipeline.
(A) Unequal time requirement to complete a subtask:
Consider the four-stage pipeline with processing step Fetch, Decode, Operand and write.
The stage-3 of the pipeline is responsible for arithmetic and logic operation, and in general one clock cycle is assigned for this task
Although this may be sufficient for most operations, but some operations like divide may require more time to complete.
Following figure shows the effect of an operation that takes more than one clock cycle to complete an operation in operate stage.
Instructions 1 2 3 4 5 6 7 8 9
The operate stage for instruction takes 3 clock cycle to perform the specified operation. Clock cycle 4 to 6 required to perform this operation and so write stage is doing nothing during the clock cycle 5 and 6, because no data is available to write.
Meanwhile, the information in buffer B2 must remain intake until the operate stage has completed its operation.
This means that stage 2 and stage 1 are blocked from accepting new instructions because the information in B1 cannot be overwritten by a new fetch instruction.
The contents of B1, B2 and B3 must always change at the same clock edge.
Due to that reason, pipeline operation is said to have been stalled for two clock cycle. Normal pipeline operation resumes in clock cycle 7.
Whenever the pipeline stalled, some degradation in performance occurs.
Role of cache memory:
The use of cache memory solves the memory access problem
Occasionally, a memory request results in a cache miss. This causes the pipeline stage that issued the memory request to take much longer time to complete its task and in this case the pipeline stalls. The effect of cache miss in pipeline processing is shown in the figure.
Instructions 1 2 3 4 5 6 7 8 9
Stages 1 2 3 4 5 6 7 8 9 10
F: Fetch F1 F2 F2 F2 F2 F4 F3
D: Decode D1 idle idle idle D2 D3
O: Operate O1 idle idle idle O2 O3
W: Write W1 idle idle idle W2 W3
to be continued.......