Using value prediction as a complexity-effective solution to improve performance Rafael A. Moreno Dept. Arquitectura de Computadores y Automática. Universidad Complutense de Madrid Fac. de CC. Físicas. 28040 Madrid - SPAIN
[email protected]
Informe Técnico - Technical Report DACYA-UCM 5/98
Using value prediction as a complexity-effective solution to improve performance Rafael A. Moreno 1
Introduction
In the last years, many of the efforts on processor microarchitecture research have been focussed in attempting to increase the ILP (Instruction-Level Parallelism), in order to improve the performance. However, even though the processor performance grow up with every new generation, the average ILP is every day more distant from the maximum achievable ILP. In a superscalar processor with a specific width (number of instruction issued per cycle), there are many factors that can affect to the ILP [Wall93]. Some of the most important ones are the instruction window size, the number of functional units, the control flow, and the data flow. Several techniques for eliminating control and data dependencies have been proposed mainly based on branch and data prediction and speculative execution. While branch prediction [Smit81], [Yeh92], [McFa93] can be considered today as a classical technique, deeply studied, widely accepted, and broadly used in all modern superscalar processors, value prediction, however, is yet a very novel technique, whose actual efficacy has to be still proved. Recent works on value prediction have shown that data are predictable [Lipa96a], [Saze97a], [Wang97], and several techniques for value prediction have been proposed [Lipa96a], [Lipa97], [Saze97a], [Gabb97], [Nakr99]. The results obtained are very promising: using sophisticated value predictors, we can predict correctly about 60% of the output values in a program. However, only a few works have shown the actual potential of value prediction to increase ILP under real conditions [Rych98]. Unfortunately, the results are not very optimistic: using these sophisticated value predictors, we can obtain, in average, a speedup not greater than 10-15%. On the other hand, a common technique used in current processors to overcome the problem of data dependencies is the use a large instruction window. If we increase the window size, the possibility of finding independent instruction ready to be executed grows up, and so, the performance improves. However, increasing the window size can have negative repercussions on the cycle time. A recent study about the complexity of superscalar processors [Pala97] shows that the delay of some critical pieces of logic of the pipeline (like the register rename logic, the wakeup logic, the instruction selection logic, and data bypass logic) increases when we enlarge the window size or we increase the issue width. This dependence can be logarithmic, linear or even quadratic, depending on the piece of logic under consideration. In this paper we will study the effect of the window size on value prediction. In this line, we will prove that the maximum potential of value prediction to increase ILP is achieved with small window sizes, and it diminishes when the instruction widow grows in size. So, we will show that using value prediction and small windows, we can obtain the same or even a higher performance than using larger windows without value prediction. In addiction, we will also study the effects of other different factors (fetch width, realistic branch prediction, etc.) on value prediction. The rest of the paper is organized as follows. Section 2 summarizes the previous work on data value prediction. Section 4 presents the baseline architecture we are used for the experiments. Section 3 describes the low-cost predictor that we propose. Section 5 presents the results, and Section 6 concludes the paper.
2
Previous work
Early works on value prediction [Lipa96a], [Lipa97] shown that instructions exhibit a new kind of locality, called value locality, which means that the values generated by a given static instruction tend to be repeated for a large fraction of the time. This property makes the data to be predictable. Another important conclusion of this work is that, with little differences, predictability rate is very similar for different
2
processors. That proves that value locality origin comes from program construction itself, but not from compiler or architecture design singularities. In a later work, Sazeides et al. [Saze97a] state that the predictability of a value sequence is a function of the sequence itself and the predictor used to predict the sequence. In this way, we can find some kinds of predictable sequences, like for example the stride sequences, that do not exhibit value locality. This work shows a first classification of the most common predictable value sequences that we can find in a program (constant, stride, non-stride, repeated stride, and repeated non-stride). They also propose and analyze two types of prediction models: computational predictors (last-value, stride, and others) and context based predictors. Most of the value predictors proposed in the literature fit in one of the following types: x
Last-value predictors (LVP), that make a prediction based on the last outcome of the same static instruction, and can predict correctly constant sequences of data. [Lipa96a], [Gabb97], [Wang97]
x
Stride predictors (SP), that make a prediction based on the last outcome plus a constant stride, and can predict correctly arithmetic sequences of data (even constant sequences, whose stride is 0), [Gabb97], [Nakr99], [Wang97]
x
Context based predictors (CBP), which learn the values that follow a particular context and make a prediction based n last values generated by the same instruction. They can predict correctly repetitive sequences of data (even stride repetitive sequences) [Saze97a], [Saze97b], [Wang97]
x
Hybrid predictors (HP), that combine some of the previous predictors and include a selection mechanism, either hardware [Wang97], [Rych98] or software [Gabb97], [Gabb98b].
Several different implementations of each kind of predictor have been proposed. If we compare the different proposals and their respective results, we can get two important conclusions: 1) The more complex the predictor, the higher the prediction accuracy (percentage of correct predictions), but also the more expensive the hardware needed for its implementation, as shown in figure 1. Prediction accuracy Last-value – Stride – Hybrid (Last + Stride) – Context Based – Hybrid (Last + Stride + Context) Hardware cost Fig. 1. Value prediction classification
2) All the implementations include some confidence mechanism, in order to reduce the number of prediction misses. High miss-prediction rates can affect negatively to the performance, because of the necessity of re-executing the miss-predicted instructions, and also because these instructions take resources that cannot be used by other instructions. There exist a number of other different techniques to exploit value locality. Some of the most remarkable ones are address prediction, that predict the address of load instruction and prefetch the data from the cache [Gonz97], and instruction reusing [Soda97], which use a table to look up results computed with the same inputs in the past.
3
Baseline architecture
In first place, we will study the original baseline architecture, without support for value prediction, used as starting point to our experiments. Next, we state the main design alternatives we have make to support value prediction. And finally, we detail the modified baseline architecture that supports value prediction according to the previous design decisions.
3
3.1
The original baseline model
The simulator used in this work has been brought out from the SimpleScalar 3.0 tool set [Burg97], so the baseline architecture is derived from those used by the SimpleScalar out-of-order simulator, which is based on the Register Update Unit (RUU) scheme [Sohi90]. The RUU is a scheme that unifies under the same structure the instruction window, the rename logic, and the reorder buffer. In this way, the RUU is a central reservation station, which is also used to automatically rename registers and hold the results of pending instruction. Furthermore, each cycle the RUU retires completed instruction to the architected register file. A scheme of this architecture is shown in Figure 2. In the fetch stage, instructions are read from the instruction cache and they are stored in the Instruction Fetch Queue (IFQ), where they wait until they can be transferred to the instruction window (i.e., the Reorder Rename Buffer or RUU). The fetch width and the IFQ size are two configurable parameters, and usually they are given the same value. The number of instructions fetched in a particular cycle is given by the minimum between the fetch width and the number of holes in the IFQ. In this stage, branch prediction is performed. The type of branch predictor is also selectable by the user.
FETCH
Fetch mechanism Instruction Fetch Queue (IFQ)
DECODE/ DISPATCH
Decode & Rename
Register Update Unit (RUU)
Load/Store Queue (LSQ)
Scheduling & Bypass logic
EXECUTE Int ALUs
FP ALUs
Int Mul/Div
FP Mul/Div
D-Cache (L1)
D-Cache (L2)
Virtual Mem.
WRITELoad/Store Queue (LSQ)
Register Update Unit (RUU)
COMMIT
Commit logic
Fig. 2. Original baseline model
In the dispatch stage the instructions are decoded and are assigned to the next free reservation station of the RUU, if any. The RUU acts also as a circular reorder buffer, so instructions are placed into the RUU in strict program order. Figure 3 shows the detail of a reservation station (RS). The RUU implements also the rename mechanism, so that each RS has assigned a fixed and unique destination register tag. In addition, for each source operand, the RS contains three fields: the tag of the register assigned to this input (Srci Tag), a field to store the operand’s value itself (Srci Value), and a Valid bit (V) to indicate if the value is already available or not. Furthermore, the RS also stores the opcode of the instruction, the identifier of the destination register (Dst Tag), and a execution bit (E) that is set when the instruction completes execution. Opcode
E
Dst Tag
Src0 Tag
Src0 Value
V
Src1Tag
Src1 Value
V
Fig. 3. Detail of a reservation station (a RUU entry)
4
In addition, the Load/Store Queue (LSQ) is used to achieve memory synchronization and communication. Load and stores are split in two operations: the address computation, which is translated to an ADD instruction and placed into the RUU; and the load or store itself, which are kept in the LSQ in strict program order. In this original baseline model a strong memory dependence model is assumed, so that an unresolved store prevents to any subsequent load from execution. The maximum number of instructions dispatched per cycle, the RUU size and the LSQ size are configurable parameters. Usually, the dispatch with is given the same value that the fetch width. In the execute stage, operand-ready instructions are issued to the functional unit for its execution. The scheduling mechanism used is simple: older instructions are issued first, whenever an appropriate functional unit or a memory port is available. The number of functional units and memory ports is also configurable. Bypass logic allows forwarding the operand values from instructions that have been executed, but have not yet written their results into the RUU, in order to allow the earlier execution of the dependent instruction. In the write-back stage the output values generated in the execution are copied into the corresponding source registers of the RUU, and the valid bit is set. To do that, the values generated in the execute stage are transmitted to the reservation stations along with its respective tags. All the active reservation stations compare every transmitted tag with the tags associated to its source operands. If the comparison matches, the value is copied into the reservation station. Finally, in the commit stage, all the non-speculative completed instructions are retired in strict program order to preserve the sequential consistency of the program. The commit width is also configurable, and it is usually given the same value that the fetch and issue width. 3.2
Design decisions
A number of design alternatives can be regarded when we deal with value prediction. The main design decisions we have accomplished are the following: a) The value predictor Many different predictors have been proposed in the literature. In this paper we will work with two types: a perfect value predictor, which will allow us to study the maximum potential of value prediction, and analyze the limits of value prediction; and a low-cost hybrid predictor, an inexpensive implementation of a last-value + stride predictor. The details of these two predictors are discussed in the next section. b) The updating policy The predictor must be updated whenever a new value is generated. The mechanism for updating the predictor depends on the kind of predictor itself. However, we can distinguish three main updating policies: perfect updating, late updating, and speculative updating. With perfect updating the predictor is updated just after being accessed, assuming that the actual value is available in this moment. That is the best most effective, but it is unrealistic, and only can be achieved in the simulation environment. With late updating the predictor is updated after the instruction execution, when the actual output value is available. This policy presents an important drawback. If there are short loops in the program, it is possible to fetch several instances of the same static instruction before the first instance had been executed. In this case, the second instance and the consecutive ones read a non-updated value from the predictor table, which is very likely to be incorrect. This problem is bigger as the instruction fetch bandwidth increases. With speculative updating the predictor is updated just after being accessed, like in perfect updating, but in this case we use the predicted value itself to update the predictor. When the instruction is executed, in case of value miss-prediction, the predictor must be updated again using the actual value. The main problem of this prediction arises when the prediction is incorrect, since all the consequent instances of the same static instruction will read an incorrect value from the predictor. To minimize this effect, it is important to have a high-accurate confidence mechanism.
5
In this work we have used a perfect updating mechanism, that reduces the negative effect of the misspredictions over the predictor updating. That allows us to study the maximum potential of the value prediction -isolated of other side effects- that is the main goal of this work. c) Prediction of input operands vs. prediction of output results Both approaches are conceptually identical: if the output of an instruction is predictable, the input of the instruction that consumes this result must be also predictable. Predicting operands allows an earlier dispatch, because instructions with predictable inputs can be dispatched in parallel with the dependency checking and renaming. However, predicting operands enforces the serialization of the validations that, as we will see later, results in a lower performance than parallel validation. On the other hand, predicting results requires the checking all the cross dependencies in order to transmit the predicted outputs to the consumer inputs. However, the renaming logic already implements this crossed checking, so, no additional logic is needed to support value prediction. Moreover, predicting results allows parallel validation. In our approach we will predict output results. We restrict the prediction to those operations that write the result on just one general-purpose register, either integer or floating point. This condition excludes multiplication or division instructions -which uses two output registers-, as well as double-precision instructions. d) Speculative forwarding vs. non-speculative forwarding When we deal with value prediction, we can find three kind of values: predicted values, speculative value, and actual values. Predicted values are those values generated by the value predictor. Speculative values are those results generated by instructions when some of the inputs are either predicted or speculative. Actual values are those results generated by instructions with actual inputs, i.e., with non-predicted and non-speculative inputs. Instructions executed with predicted or speculative operands are considered speculative, and hence, their (speculative) results cannot modify the architected state of the processor because, eventually, they can become incorrect. Predicted or speculative values are only used when the actual value is not available. With non-speculative forwarding, when an instruction is executed with predicted operands, the speculative result is not sent back to the dependent instructions. So, input values can only be either predicted or actual, but not speculative. On the other hand, in the speculative forwarding approach, speculative results are sent back and used by the dependent instruction. So, input values can be predicted, speculative or actual. Comparing both approaches, speculative forwarding could look more effective, because if the inputs of an instruction are not predictable but they are speculative, this approach allows the instruction to be executed, but the non-speculative forwarding approach does not. However, a recent study about program predictability modeling [Saze98] shows that the fraction of instructions with predictable inputs whose output is not also predictable is very small. In other words, most of the speculative outputs are also predictable. So, the effectiveness of both approaches must be very similar. Available values
Value stored in the
Predicted
Speculative
Actual
reservation station
No
No
No
None
Yes
No
No
Predicted
Yes/No
Yes
No
Speculative
Yes/No
Yes/No
Yes
Actual
Table 2. Values stored in the reservation station
Nevertheless, in this work we have use speculative forwarding. With this scheme, it is possible to find that both the predicted and the speculative values are available at the input of an instruction. In this case we
6
always use the speculative value. So, whenever a speculative value is available, it voids any previous predicted value. Likewise, if an actual value is available, it nullifies any previous predicted or speculative value. So, with this choice, we have just to store one value per input in the reservation station, as shown in Table 2. e) Parallel validation vs. serial validation All the instruction executed with predicted or speculative values must be validated or invalidated when the actual values of the inputs become available. A
A (pred.)
A (output)
B
C o m p
Completion Bit Instruction B
C B (pred.) D B (output)
Fig. 4.a. Dependence chain
C (output)
C
C o m p
Completion Bit Instruction C
C (pred.)
Cycle 1
Cycle 2
C
C o m p
Completion Bit Instruction D
Cycle 3
C
Fig. 4.b. Serial validation
A (pred.)
A (output)
B (pred.)
C o m p
Completion Bit Instruction B
B (output)
C
C (pred.)
C o m p
C (output)
Completion Bit Instruction C
C
C o m p
Completion Bit Instruction D
Cycle 1
C
Fig. 4.c. Parallel validation
When we have a chain of dependent instructions executed speculatively, the validation process can be serial or parallel. In serial validation, only one instruction of the chain is validated in every cycle. On the other hand, using parallel validation, the complete chain can be validated in a single cycle. This solution requires more complicated validation hardware, but it is also much more effective than serial validation. Fig. 4 shows a simplified diagram of both types of validation techniques. In this Figure, for simplicity purposes, all the operations are assumed to have only one operand. Moreover, it is also assumed that each instruction has a completion bit (C), in addition to the execution bit (E). The E bit is set when the instruction is executed, either with valid operands or predicted/speculative operands. The C bit is activated when the instruction has been executed and all the operands have actual values or are validated. The
7
instruction can commit only when both bits are activated. The detailed parallel validation logic is defined in the next subsection. As we stated before, parallel validation is only possible with output value prediction, but not with input value prediction. If we predict inputs, when an instruction generates a result, it has always to pass the result to the dependent instruction, which make the comparison with the predicted input. If both inputs of this second instruction are valid, then the actual result is passed to the next dependent instruction, and so on. So, in a dependence chain, the actual results must be passed from one instruction to another, one per cycle. With output value prediction, however, the same instruction that generates the value makes the comparison. If this comparison is correct, it is not necessary to pass the actual result to the dependent instruction, but just a validation signal. This validation signal can be chained, as shown in Fig. 4.c, and several dependent instructions can be validated in the same cycle. f) Full recovery vs. selective invalidation When an instruction is executed with predicted or speculative operands and some of the inputs are incorrect (because of a value miss-prediction), the instruction must be re-executed. Full recovery is a solution similar to those used in branch miss-prediction recovery. When some instruction suffers from a value miss-prediction, this instruction and all the subsequent issued instructions are squashed from the pipeline. This solution is easy to implement, but it is low effective if the number of value miss-prediction is high, because we squash a lot of instructions from the pipeline, many of which could be correct. With selective invalidation, only the instructions with incorrect operands are invalidated. Whenever a value miss-prediction is detected, it is necessary to invalidate all the instructions that consumed this value, as well as all the dependent instruction that consumed the speculative values. This solution is much more complicated to implement in hardware, but is also more performance-effective, because the number of instruction squashed in each miss-prediction is much lower. In this work we have used selective invalidation, in order to study the maximum potential performance improvement achievable with value prediction. However, from the point of view of the hardware implementation, using a good confidence mechanism (in order to minimize the number of value missprediction) and full recovering could be a more appropriate solution. 3.3
The modified baseline model
The modified baseline model includes some changes in relation to the original one, as well as some additional hardware elements. Figure 4 shows a diagram of this architecture. In addition to the value predictor, whose details will be discussed in the next section, the most remarkable differences between this baseline architecture and the original are the following: A perfect fetch mechanism is assumed, in order to feed the processor with a constant flow of instruction, and avoid problems of instruction cache misses and non-contiguous instruction alignment [Rote96], etc. Recently, several mechanisms have been proposed that provide near perfect fetch [Yeh93], [Rote97]. The reservation station (RS) has been modified in order to support the different types of source operands. This is shown in Figure 5. The old valid bit (V) has been replaced by a two-bit state (Srci State) that indicates, as shown in Table 3, if the source value placed in the RS is predicted, speculative, actual, or if on the contrary, no value is available. In addition, the station of the destination register is also stored in the RS (Dst State). Furthermore, the C (Completion) bit is also included. Table 4 summarizes the meaning of C and E bits. A perfect disambiguation mechanism is assumed, in order to relax memory dependencies, and avoid any interference with value prediction. The perfect disambiguation mechanism allows executing all those load instructions that are not dependent of any previous store, even when the address of this previous store is yet unresolved. In the literature, we can find several mechanisms for memory dependence prediction [Mosh97], which guarantee high prediction accuracy, close to 100%.
8
Perfect fetch mechanism
FETCH
Updating mechanism
Value Predictor
Instruction Fetch Queue (IFQ)
DECODE/ DISPATCH
Decode & Rename
Register Update Unit (RUU)
Load/Store Queue (LSQ)
Scheduling & Bypass logic
EXECUTE Int ALUs
FP ALUs
Int Mul/Div
FP Mul/Div
D-Cache (L1)
D-Cache (L2)
Virtual Mem.
WRITEValidation logic Load/Store Queue (LSQ)
Register Update Unit (RUU)
COMMIT
Commit logic
Figure 4. Modified baseline model Opcode
E
C
Dst. Tag
Dst State
Src0 Tag
Src0 Value
Src0 State
Src1Tag
Src1 Value
Src0 State
Figure 5. Modified reservation station
State
Code
Meaning
NO VALID
00
Value not available (producer instruction not executed, and value not predictable)
PRED
01
Predicted value available (producer instruction not executed, but value predictable)
SPEC
10
Speculative value available (producer instruction executed speculatively)
ACTUAL
11
Actual value available (producer instruction executed with actual values) Table 4. States of source and destination operands
Bit E (Execution)
Bit C (Completion)
Meaning
0
0
Instruction not executed
0
1
Impossible
1
0
Instruction executed speculatively. Instruction cannot commit
1
1
Instruction executed and operands validated. Instruction can commit Table 5. Meaning of bits E and C
The scheduling policy has been also slightly modified, in such a way that instruction with actual operands are issued first, and instruction with predicted or speculative operands are issued later. Within each group, an older-instruction-first policy is used. Using this policy, speculative instructions are not issued while
9
there are enough non-speculative instructions ready to execute, even if these non-speculative instructions are newer that the non-speculative ones. Finally, write-back stage has been adapted to support parallel validation of speculative instructions. Parallel validation implies the necessity of transmitting a validation signal in a chained way, namely, this signal has to travel through all the speculatively executed instructions belonging to the same dependence chain. To implement such a chained mechanism, it is difficult to make use of the same tag-based logic used for write-back. So we propose a new write-back & validation logic based on a set of state and validation lines per physical register. Each physical register has associated a set of three lines: two state lines for transmitting the output register state and one validation line (VAL/INVAL*) for validation/invalidation purposes. Every time an instruction produces a value, it updates the state of the output register (to SPEC or ACTUAL) and transmits the new state by the state lines to the dependent instruction. If the new value validates a previous one (PRED or SPEC), the validation line (VAL/INVAL*) is set to 1, and the value is needn’t to be transmitted. Otherwise, if the new value invalidates a previous one, the validation line (VAL/INVAL*) is set to 0 and the new value is transmitted, if it is available. The validation/invalidation signal is propagated through all the dependency chain, along with the new output register states, in order to perform parallel validation/invalidation. Fig. 6 shows a diagram of the validation logic.
RS 0 Predicted Value
RS 1 Predicted Value
C o m p
Produced value Dst. Tag
RS 3 Predicted Value
C o m p
Produced value
Produced value Dst State
Dst. Tag
C o m p
Dst State
Dst. Tag
Dst State
Src0 Tag Src0 State
Src0 Tag Src0 State
Src0 Tag Src0 State
Src1Tag Src1 State
Src1Tag Src1 State
Src1Tag Src1 State
Comb. Logic
Mux
Mux
Comb. Logic
Mux
Mux
Comb. Logic
Mux
Mux
VAL/INVAL*
Reg. 0
OUT-STATE 2 VAL/INVAL*
Reg. 1
OUT-STATE 2 VAL/INVAL*
Reg. 3
OUT-STATE 2
Fig 6. Parallel validation logic
The piece of combinational logic attached to each RS is used to generate the new destination register state and the output validation signal, as a function of the states of the source registers, the state of the destination register, the input state lines, the input validation line, and the comparison between the predicted value and the produced value, if any. It is a generalization of the logic shown in Fig. 4.c.
10
The validation logic can be critical in time, as the delay of the longest combinational path could be longer than the cycle time, so the implementation of a more complex-effective validation logic is a open research topic for the future work. In order to explain the function of this logic lets go use some examples that summarize the different cases we can find. Case 1. Validation of predicted operands This example is shown in Fig. 7. Instructions A – B – C form a dependence chain, and the outputs of the three instructions are predictable (the rest of operands are assumed to be actual). Instruction A is allocated to RS0 (output register = R0), instruction B is allocated to RS1 (output register = R1), and instruction C is allocated to RS0 (output register = R2). The three instructions are executed in parallel (instruction B and C use predicted input operands, and so they are speculative). After execution, the three instructions compare their respective predicted output values with the generated values, in order to achieve the validation or invalidation. In this case lets go to assume that all the comparisons are correct, so each instruction has to send a validation signal along with the new state of the output value. Each instruction transmits the following group of signals: (Actual)
(Actual)
After T4
Before T4 RS0 (A)
A
(Actual)
Dst Src0 Src1
R0 (Pred) B
(Actual)
R0 XX XX E=0
Pred Actual Actual C=0
RS0 (A) Dst Src0 Src1
RS1 (B)
R1 (Pred) C R2 (Pred) T1 T2 T3 T4 T5
Dst Src0 Src1
R1 R0 XX E=0
Dst Src0 Src1
R2 R1 XX E=0
Pred Pred Actual C=0
D D
E E E
WB WB WB
C C C
Pred Pred Actual C=0
Actual Actual Actual C=1
RS1 (B) Dst Src0 Src1
R1 R0 XX E=1
Dst Src0 Src1
R2 R1 XX E=1
RS2 (C) F F F
A B C
R0 XX XX E=1
Actual Actual Actual C=1
RS2 (C) Actual Actual Actual C=1
Fig. 7 Example of validation of predicted operands
Instruction A: Has actual operands, and the comparison is correct x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1 Instruction B: Receives a validation of the predicted operand, and the comparison is correct x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1 Instruction C: Receives a validation of the predicted operand, and the comparison is correct x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1
Case 2. Validation of predicted operands This example is shown in Fig. 8. In this case all the comparison are incorrect, so the instruction have to transmit an invalidation signal, along with the new state of the output operands. These signals are as follows: Instruction A: Has actual operands, but the comparison is incorrect x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a invalidation signal: VAL/INVAL* = 0
11
Instruction B: Receives an invalidation of the predicted operand. Furthermore the comparison is incorrect. (Notice that both operands of B are ACTUAL, but the instruction must be re-executed) x
Transmit the new state of the output value: OUT-STATE = NO VALID
x
Transmit an invalidation signal: VAL/INVAL* = 0 Instruction C: Receives an invalidation of the predicted operand. Furthermore the comparison is incorrect. x
Transmit the new state of the output value: OUT-STATE = NO VALID
x
Transmit an invalidation signal: VAL/INVAL* = 0
(Actual)
Before T4
(Actual)
RS0 (A)
A
(Actual)
Dst Src0 Src1
R0 (Pred) B
(Actual)
R0 XX XX E=0
C R2 (Pred)
Dst Src0 Src1
R1 R0 XX E=0
Dst Src0 Src1
R2 R1 XX E=0
T1 T2 T3 T4 T5 F F F
RS0 (A)
Pred Actual Actual C=0
Dst Src0 Src1
RS1 (B)
R1 (Pred)
A B C
After T4
D D D
E E E
WB WB WB
Actual Actual Actual C=1
RS1 (B)
Pred Pred Actual C=0
Dst Src0 Src1
R1 R0 XX E=0
Dst Src0 Src1
R2 R1 XX E=0
RS2 (C)
C C C
R0 XX XX E=1
No Valid Actual Actual C=0
RS2 (C)
Pred Pred Actual C=0
No Valid No Valid Actual C=0
Fig. 8 Example of validation of predicted operands
Case 2. Validation of predicted and speculative operands This example is shown in Fig. 9. In this case lets go assume that instruction A is delayed (for example due to resource limitations). Instructions B and C are executed speculatively, so that instruction C uses the speculative value generated by B. Furthermore, we assume that the comparison of the predicted value and the actual value of instruction A is correct. In this case the signals are as follows: (Actual)
Before T5
(Actual)
RS0 (A)
A
(Actual)
Dst Src0 Src1
R0 (Pred) B
(Actual) C
R2 (Spec) T1 T2 T3 T4 T5 F F F
R0 XX XX E=0
Pred Actual Actual C=0
RS0 (A) Dst Src0 Src1
RS1 (B)
R1 (Spec)
A B C
After T5
D D D
E E E
WB WB WB
Dst Src0 Src1
R1 R0 XX E=1
Dst Src0 Src1
R2 R1 XX E=1
Spec Spec Actual C=0
R1 R0 XX E=1
Dst Src0 Src1
R2 R1 XX E=1
Actual Actual Actual C=1
RS2 (C)
Fig. 9 Example of validation of predicted operands
Instruction A: Has actual operands, and the comparison is correct x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1
12
Actual Actual Actual C=1
RS1 (B) Dst Src0 Src1
RS2 (C)
T6 C C C
Spec Pred Actual C=0
R0 XX XX E=1
Actual Actual Actual C=1
Instruction B: Receives a validation of the predicted operand (It does not need to make any comparison, as the output is speculative, not predicted) x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1 Instruction C: Receives a validation of the speculative operand (It does not need to make any comparison, as the output is speculative, not predicted) x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x
Transmit a validation signal: VAL/INVAL* = 1
Case 2. Invalidation of predicted and speculative operands This example is shown in Fig. 10. Lets go assume, as in the previous case, that instruction A is delayed and instructions B and C are executed speculatively. But now, we assume that the comparison of the predicted value and the actual value of instruction A is incorrect. In this case the signals are as follows: (Actual)
Before T5
(Actual)
RS0 (A)
A
(Actual)
Dst Src0 Src1
R0 (Pred) B
(Actual) C
R2 (Spec) T1 T2 T3 T4 T5 F F F
R0 XX XX E=0
Pred Actual Actual C=0
RS0 (A) Dst Src0 Src1
RS1 (B)
R1 (Spec)
A B C
After T5
D D D
E E E
WB WB WB
Dst Src0 Src1
R1 R0 XX E=1
Dst Src0 Src1
R2 R1 XX E=1
Spec Spec Actual C=0
Actual Actual Actual C=1
RS1 (B) Dst Src0 Src1
R1 R0 XX E=0
Dst Src0 Src1
R2 R1 XX E=0
RS2 (C)
T6 C C C
Spec Pred Actual C=0
R0 XX XX E=1
No Valid Actual Actual C=0
RS2 (C) No Valid No Valid Actual C=0
Fig. 10 Example of validation of predicted operands
Instruction A: Has actual operands, but the comparison is correct x
Transmit the new state of the output value: OUT-STATE = ACTUAL
x Transmit a invalidation signal: VAL/INVAL* = 0 Instruction B: Receives an invalidation of the predicted operand (Notice that both operands of B are ACTUAL, but the instruction must be re-executed) x
Transmit the new state of the output value: OUT-STATE = NO VALID
x Transmit an invalidation signal: VAL/INVAL* = 0 Instruction C: Receives an invalidation of the speculative operand
4
x
Transmit the new state of the output value: OUT-STATE = NO VALID
x
Transmit an invalidation signal: VAL/INVAL* = 0
The value predictor
In our experiments we have used two predictors: a perfect value predictor, and a low-cost hybrid predictor (last-value + stride). The perfect predictor is implemented assuming that the actual input values of all the instructions are always available when the instruction is issued. So, this is an ideal predictor with 100% of accuracy and 0% of miss-prediction. Although such a predictor is not implementable in practice, it will let us study the maximum potential of value prediction, so that we will be able to know which is the upper bound on the performance improvement that we can expect to reach with any realistic value predictor. 13
The second predictor is a low-cost implementation of a hybrid predictor, which combines a last-value predictor and a stride predictor. This predictor will let us study the performance improvement achievable with an inexpensive real predictor, which get a reasonable predictability rate without increasing too much the hardware complexity. Last-value and stride are the two more simplest value predictor that we can implement, and they provide reasonable prediction accuracy with low hardware cost. Stride predictor gets, in average, higher prediction accuracy than value prediction. The reason is obvious: last-value prediction is a particular case of stride prediction, when the stride is zero. So, stride predictors embrace much more predictable situations than lastvalue predictors do. On the other hand, from the point of view of the cost, stride predictor is more expensive than last-value predictor is, because it has to store, not only the last outcome of the instruction - like the last-value predictor -, but also the stride. However, recent studies [Gabb98a] have shown that value-predictable instructions can be classified in two groups: a small group that exhibits stride behavior, and a large group that presents last-value behavior. So, if we use a single stride predictor to predict all these instruction, the stride field will not be used efficiently, because most of the times, the prediction is done with stride zero. For example, these studies show that the average percentage of non-zero stride in the SPECint95 benchmarks is only 16%. Because of this reason, several hybrid last-value + stride hybrid predictor have been proposed [Gabbay97], [Wang97]. One of the most remarkable is the hybrid predictor of Gabbay, which uses a small stride prediction table to predict the instructions that exhibit non-zero stride behavior, and a large last-value table to predict the zero-stride instructions. In addition, a software profile-based instruction classification mechanism is used to decide which predictor is to be used with each instruction. We propose a new hybrid last-value + stride predictor that presents to main advantages with respect to the Gabbay’s one. First, the hardware saving is greater, because instead of using two different tables for lastvalue and stride predictors, we use two overlapped tables: a large table to store the last-outcome field of both the last-value and the stride predictor, and a small table to store the stride field of the stride predictor. Second, the classification mechanism is done in hardware, by means of a state machine, so it is not necessary to perform a previous profiling and marking of the instruction. Fig. 10 shows a diagram of the predictor we propose, and Fig. 11 shows the state machine of the predictor. Tag
Last Value
State
Instruction Address Hash Function
W-T
FSM
Select
W-T
Valid?
Predict?
Last Value
Hash Function
Predicted Data Value
Stride
Fig. 10. Hybrid last-value + stride predictor
14
Actual Value = Last Value /
Last Value
Actual Value = Last Value /
Predict Tag Miss / Update Last Value Actual Value z Last Value / Update Last Value
Init
Actual Value = Last Value /
& Stride
Don’t Predict Same stride / Update Value
Actual Value z Last Value / Update Last Value & Stride
Transient
Stride
Don’t Predict
Predict
Different Stride /
Same stride /
Different Stride /
Update Last Value & Stride
Update Last Value
Update Last Value
Fig. 11. Predictor state machine
Fig. 12 shows the average prediction accuracy of the hybrid predictor - with different sizes of the stride table - compared with last-value and stride predictors for different hardware costs, using the SPECint95 benchmarks. We can observe that, for a given hardware cost, the prediction accuracy of the hybrid predictor is always higher than the prediction accuracy of the last-value or stride predictor. However, the miss-prediction rate is also slightly higher. Reducing the miss-prediction rate is one of the most important goals for the future work. A more detailed study of the hybrid predictor, and its comparison with other last-value, stride or hybrid predictors can be found in [Pinu98]. SPECINT95
% Predictable Instructions 0,6
0,5
0,4 Last-Value - Hit Stride - Hit Hybrid 128 - Hit Hybrid 256 - Hit
0,3
Hybrid 512 - Hit Hybrid 1K - Hit Last Value - Miss Stride - Miss 0,2
Hybrid 128 - Miss Hybrid 256 - Miss Hybrid 512 - Miss Hybrid 1K - Miss
0,1
0 0
500
1000
1500
2000
2500
Predictor Size (Kbits)
Fig. 12. Comparative results of the hybrid predictor
15
3000
5
Results
5.1
Simulation environment
The simulator used in this work has been brought out from the SimpleScalar 3.0 tool set [Burg97]. This tool includes an out-of-order processor timing simulator that can support out-of-order issue and execution, and allow to the user to tune a lot of different parameters of the processor core, the memory hierarchy, and the branch predictor. The SimpleScalar architecture is derived from the MIPS-IV ISA, with some modifications and additions, and it support both big-endian and little-endian executables to guarantee the portability. However, our experiments have been run over a sole type of architecture (SUN-UltraSPARC), so we have used only the big-endian version. The main modifications introduced to the simulator to support value prediction, have been the design of a new module for value prediction, and the logic needed to validate predictions and recover the pipeline from value miss-predictions. These elements are described in detail in the next sections. To study the influence of the issue width and the instruction window size, we have run several simulations with different combinations of these two parameters. Furthermore, to study the influence of memory hierarchy, branch prediction, and limited resources we have run our simulations over two different conditions: ideal conditions (perfect branch prediction, perfect memory, and unlimited resources) and realistic conditions (hybrid branch predictor, 2-level cache memory hierarchy, and limited resources) As workload, we have used the SPEC95 integer benchmarks compiled with the GNU-GCC compiler and using the -O3 compilation option (maximum level of optimization). Table 1 shows the programs and the data inputs we have used in our experiments. To limit the simulation time, we have restricted our simulations to 100 millions of committed instructions. Benchmark Compress95 Cc1 Go Ijpeg M88ksim Perl Li Vortex
Description Data compression Compiler Game Jpeg encoder M88000 Simulator PERL interpreter LISP emulator Data base
Input data 30000 e 2231 gcc.i 99 specmun.ppm ctl.raw scrabbl.in train.in train.in
# Instr. 95 M 203 M 132 M 553 M 120 M 40 M 183 M 2520 M
Table 6. SPEC95 integer benchmark suite and input data
5.2
Simulations results
We have run several experiments with different configurations: different window sizes, different fetch widths (4, 8, and 16), and different conditions: x
Ideal conditions: perfect branch prediction mechanism, perfect memory, infinite hardware resources, and one-cycle latency for all the operations.
x
Realistic conditions: realistic branch prediction mechanism (gshare + bimodal hybrid predictor), realistic memory (2 level cache), limited hardware resources, and different latencies operations.
Table 7 shows the basic parameters used in the experiments.
16
Memory Ideal Conditions
Realistic Conditions
I-cache (L1): f D-cache (L1): f I-cache (L1): f D-cache (L1): 128 K (4 way assoc.) D-cache (L2): 4 M (4 way assoc.)
Branch Predictor
FUs (Fetch width: 4/8/16) Int-ALU: f/f/f Int-Mult/Div: f/f/f Int-ALU: f/f/f Int-Mult/Div: f/f/f Mem. Ports: f/f/f Int-ALU: 4/8/16 Int-Mult/Div: 1/2/4 Int-ALU: 4/8/16/ Int-Mult/Div: 1/2/4 Mem. Ports: 2/4/8
Perfect
Hybrid (Gshare + Bimodal) Gshare History: 12 bits Gshare Table: 16 K Bimodal Table 16 K Selection Table: 16 K
Hybrid Value Predictor Last-V Table: 32 K Stride Table: 1024
Last-V Table: 32 K Stride Table: 1024
Table 7. Main configuration parameters
Fig. 13 and Fig. 14 show the results obtained under ideal and realistic conditions respectively, and using a fetch width of 4 (Figs. 13.a and 14.a), 8 (Figs. 13.b and 14.b), and 16 (Figs. 13.c and 14.c). The graphics on the left show the average IPC obtained for the benchmarks in Table 6, with different conditions, and using a perfect value predictor, a hybrid predictor, and no prediction. The graphics on the right show the speedup (in percentage) obtained when use a perfect predictor or a hybrid predictor, in relation with the IPC obtained with no value prediction. The average IPC has been computed by calculating the IPC of each individual benchmark and then obtaining the average for all the IPCs. The average speedup has been obtained by dividing the average IPC with value prediction (either with perfect or with hybrid predictor) by the average IPC without value prediction.
17
Speedup (Processor Width = 4)
Average IPC (Processor Width = 4) Perfect
No Pred.
Hybrid
Perfect
3
15 Speedup (%)
2.5
IPC
Hybrid
20
2
1.5
10
5
0
-5
1 0
8
16
24
32
40
0
48
8
16
24
32
40
48
WinSize
WinSize
Fig. 13.a. Ideal Conditions. Fetch & issue width = 4 Average IPC (Processor Width = 8) Perfect
No Pred.
Speedup (Processor Width = 8)
Hybrid
Perfect
Hybrid
6
45 40
5
Speedup (%)
35
IPC
4
3
30 25 20 15 10
2
5 0
1
0 0
16
32
48
64
80
96
112
16
32
48
64
80
96
112
128
128
WinSize
WinSize
Fig. 13.b. Ideal Conditions. Fetch & issue width = 8 Average IPC (Processor Width = 16) Perfect
No Pred.
Speedup (Processor Width = 16)
Hybrid
Perfect
Hybrid
12
80 11
70 10
60 Speedup (%)
IPC
9 8 7 6
50 40 30 20
5
10
4
0 0
3 0
32
64
96
128
160
192
224
32
64
96
256
128 WinSize
WinSize
Fig. 13.c. Ideal Conditions. Fetch & issue width = 16
18
160
192
224
256
Average IPC (Processor Width = 4) Perfect
No Pred.
Speedup (Processor Width = 4)
Hybrid
Perfect
Hybrid
3
20
15 Speedup (%)
IPC
2.5
2
10
5
1.5
0
-5
1 0
8
16
24
32
40
0
48
8
16
24
WinSize
32
40
48
WinSize
Fig. 14.a. Realistic Conditions. Fetch & issue width = 4 Average IPC (Processor Width = 8) Perfect
No Pred.
Speedup (Processor Width = 8)
Hybrid
Perfect
5
Hybrid
45 40 35 Speedup (%)
IPC
4
3
30 25 20 15 10 5
2
0 -5 0
1 0
16
32
48
64
80
96
112
16
32
48
64
128
80
96
112
128
192
224
256
WinSize
WinSize
Fig. 14 .b. Realistic Conditions. Fetch & issue width = 8 Average IPC (Processor Width = 16) Perfect
No Pred.
Speedup (Processor Width = 16)
Hybrid
Perfect
Hybrid
7
80 70
6
Speedup (%)
60
IPC
5
4
50 40 30 20
3
10 0
2 0
32
64
96
128
160
192
224
0
256
32
64
96
128 WinSize
WinSize
Fig. 14.c. Realistic Conditions. Fetch & issue width = 16
19
160
5.3
The effect of instruction fetch width on value prediction
The first conclusion we can get is that the bigger the fetch width, the higher the speedup achieved with value prediction. This effect can be more clearly observed in Fig. 15, which draws the maximum and the average speedup (for all the window sizes) achieved for the different experiments. Maximum Speedup Fetch width = 8
Average Speedup
Fetch width = 16
Fetch width = 4
Fetch width = 8
Fetch width = 16
60
80 70 60 50 40 30 20 10 0
50 Speedup (%)
Speedup (%)
Fetch width = 4
40 30 20 10 0
Perfect pred. (Ideal cond.)
Perfect pred (Real. cond.)
Hybrid pred. (Ideal cond.)
Perfect pred. (Ideal cond.)
Hybrid pred. (Real. cond.)
Perfect pred (Real. cond.)
Hybrid pred. (Ideal cond.)
Hybrid pred. (Real. cond.)
Fig. 15. Maximum and average speedups
Gabbay et al. in [Gabb98] reported this phenomenon. Gabbay observed that, in conventional processors with low fetch bandwidth, most of the data dependent instructions are fetched sequentially in different consecutive cycles, and consequently they are also executed in a serialized manner. In this case value prediction becomes useless because the actual inputs values are ready almost every time that an instruction is issued. As the fetch width enlarges, the number of dependent instructions that are fetched simultaneously in the same cycle increases, and hence, value prediction becomes more and more useful. 5.4
The effect of window size on value prediction
The second observation we can do is that, in all the cases, the speedup increases at the beginning as the window grows in size. However, the maximum value is quickly reached with a relatively small window size, and then the speedup starts to decline as the window continues increasing. The explanation of this effect is simple. At the beginning, with very small window size, the number of dependent instructions held in the window is small, and so, the usefulness of value prediction is rather limited. As the window size increases, the number of dependent instructions stored in the window raises. In such a situation, value prediction can be efficiently exploited because it allows issuing dependent instruction in parallel - even if their operands are not ready -, so that the IPC we can reach is considerably higher than it is without value prediction. However, as the window size get bigger and bigger, the number of independent instructions kept in the window also increases, and hence value prediction becomes again useless, because there are enough independent instructions in the window to cover the available issue bandwidth. The most interesting aspect of this phenomenon is the fact that the maximum IPC improvement is gotten with a small window size, and not with large window. This is a very important effect because it will allow us to use value prediction as a complexity-effective technique to improve performance: using value prediction and a small window, we can obtain the same or even a higher performance than using a larger window without value prediction.
20
21
AV
E
(W
)
6)
)
6)
64
25
)
6)
64
25
=
=
=
ze
ze
si
si
in
in
(W
=
ze
ze
si
si
in
in
(W
(W
E
L
AG
R
AG
ER
PE
L
64
25
=
=
ze
ze
si
)
6)
64
25
=
=
ze
ze
si
in
si
in
(W
in
(W
5
5
R
S9
S9
PE
ES
ER
AV
PR
ES
(W
Hybrid
PR
G
si
No Pred.
M
M
in
C
AV (W
)
) 8)
32
8)
32
12
= =
ze ze
si
si
in
in
(W
)
8)
32
12
=
)
8)
12
=
=
=
ze
ze
si
si
in
in
(W
(W
L
E
E
AG
L
R
R
AG
ER
ER
AV
PE
PE
ze
ze
si
si
in
in
(W
(W
5
5
S9
32
12
=
=
ze
ze
si
si
in
)
8)
32
12
=
=
ze
ze
si
si
in
(W
(W
G
G
S9
ES
PE
ES
PR
PR
M
M
O
O
C
IJ
PE
in
in
Hybrid
O
O
(W
6)
IJ
(W
(W
8)
)
IPC
C
C
O
O
AV
ER
AG
L
L
E
(W
in
si
ze
ze
ze
ze
si
si
=
=
=
48
)
)
)
)
)
48
16
48
)
)
16
48
)
)
16
16
=
=
=
ze
ze
=
48
)
)
16
48
)
)
)
)
48
16
48
)
)
)
16
48
16
16
=
=
=
=
ze
si
si
in
in
in
in
(W
(W
(W
=
=
=
=
=
=
=
ze
si
si
in
in
(W
(W
ze
ze
si
si
in
in
(W
(W
5
5
E
R
AG
PE
R
S9
PE
ES
G
S9
PE
(W
ze
si
si
in
in
(W
(W
ze
ze
ze
ze
ze
ze
ze
si
si
si
si
si
si
si
in
in
in
in
in
in
in
(W
(W
(W
(W
(W
(W
(W
G
P
P
PE
IS
ES
ER
PR
PR
AV
M
M
IJ
IJ
XL
O
IS
G
C
C
X
X
O
C
G
G
C
TE
G
R
XL
VO
IM
IM
TE
KS
KS
R
88
88
VO
M
M IPC
Hybrid
C
C
G
25
)
P
P
IS
32
12
)
8)
32
12
=
)
8)
32
12
)
8)
32
12
=
=
=
ze
ze
si
si
in
in
(W
(W
O
O
IS
XL
G
G
=
ze
ze
si
si
in
in
(W
(W
C
C
C
C
G
=
ze
ze
si
si
in
in
(W
(W
X
X
TE
TE
G
R
XL
VO
R
=
=
ze
ze
si
si
in
in
(W
(W
IM
IM
KS
KS
VO
88
88
No Pred.
PE
PE
64
6)
)
IPC
M M
No Pred.
IJ
IJ
=
=
ze
ze
si
si
in
in
(W
(W
P
P
IS
64
25
)
6)
64
25
=
)
6)
64
25
=
=
=
ze
ze
si
si
in
in
(W
(W
O
O
IS
XL
G
G
=
ze
ze
si
si
=
ze
ze
si
si
in
in
(W
(W
C
C
C
C
G
in
in
)
6)
64
25
=
=
ze
ze
si
si
in
in
(W
(W
X
X
TE
TE
G
R
R
XL
VO
VO
(W
(W
IM
IM
KS
KS
88
88
M
M
Realistic Conditions (Processor width = 4)
3 Perfect
2.5
2
1.5
0.5 1
0
Fig. 16.a. IPC for realistic conditions, processor with = 4, and window sizes = 16 and 48 Realistic Conditions (Processor width = 8)
6 Perfect
5
4
3
2
1
0
Fig. 16.b. IPC for realistic conditions, processor with = 8, and window sizes = 32 and 128
Realistic Conditions (Processor width = 16)
10 Perfect
9
8
7
6
5
4
3
2
1
0
Fig. 16.c. IPC for realistic conditions, processor with = 16, and window sizes = 64 and 256
Figure 12 show this effect for the different benchmarks. In this figure we have extracted the IPC reached with realistic conditions and different fetch widths (4, 8, and 16) using two different window sizes: the largest window considered in our experiments (48, 128, and 256 respectively) and a significantly smaller window (16, 32, and 64 respectively). We can observe that the IPC reached with perfect prediction and the small window is very similar or even higher than using the large window and no value prediction for all the benchmarks, as well as for the average. In some cases, the IPC obtained with perfect value prediction is even higher with a small window that it is with a large window. This is because of the scheduling policy, which issues the non-speculative instructions first, instead of issuing the oldest instructions first. With a small window both issue policies are very similar, because the number of ready instruction in the window is low and hence, almost all the ready instructions (speculative and non-speculative) are issued every cycle. However, with a large window we can find many ready instructions in the window, and using the non-speculative instructions first policy, we postpone the issuing of the oldest instructions. If we take into account that using perfect prediction all the ready (speculative and non-speculative) instructions are always correctly executed, this policy results in a lower performance solution. On the other hand, if we use a realistic value predictor, the appropriateness of one issue policy or another will depend on the accuracy of the confidence mechanism. In our case, as the hybrid predictor has a considerable miss-prediction rate, we have decided to use the non-speculativeinstruction-first policy in order reduce the number of invalidations. 5.5
The effect of realistic conditions on value prediction
The third observation is relative to the comparison between the results obtained with ideal condition and with realistic conditions. We can observe that, although the IPC is always higher with ideal conditions – which is logic and expected- the speedup, however, is similar or even higher in average using realistic conditions than it is using ideal conditions. In other words, value prediction exhibits a greater potential when we work with a real processor than it does in an ideal environment. This effect is because of the reduction in the effective window size that takes place when realistic conditions are assumed. This reduction is due to two main causes: a)
The effective fetch bandwidth is reduced when realistic conditions are assumed. This problem was reported in [Rote96], where the different factors that can affect negatively to effective fetch width branch miss-prediction, cache miss-prediction, branch throughput, fetch unit latency, etc. - are analyzed. This reduction affects directly to the effective window size, because a limited fetch width makes difficult the window to be filled.
b) Branch miss-predictions reduce drastically the effective window size, because they force to the squashing of a lot of instructions from the window. So, every time a branch miss-prediction occurs, the window is almost emptied. Fig. 17 shows the average occupancy of the instruction window without value prediction, for different fetch widths. We can observe that with ideal conditions the window occupancy is 100% in all the cases, i.e., the effective window size is equal to the real window size. On the other hand, when we use realistic conditions, the rate between the window occupancy and the window size decreases as the window grows in size, i.e., the effective window suffers a reduction with respect to the real window size. Furthermore, the reduction rate increases as we enlarge the window. As a consequence of the reduced effective window size, value prediction becomes more important, because, as we concluded before, value prediction is much more useful with small windows. Furthermore, if we extrapolate these curves we can conclude that, with realistic conditions, the effective window size tends to be a constant as the window size increases. This effect explains that the speedups obtained with realistic conditions tend also to be constant as the window increases, and do not tend to zero as with ideal conditions.
22
Processor Width = 4 Realistic Conditions
Perfect Conditions
48
Avg. Window Occupancy
40 32 24 16 8 0 0
8
16
24
32
40
48
WinSize
Processor Width = 8
Processor Width = 16
Realistic Conditions
Perfect Conditions 128
224
112 Avg. Window Occupancy
Avg. Window Occupancy
Realistic Conditions 256
192 160 128 96 64
Perfect Conditions
96 80 64 48 32 16
32
0
0 0
32
64
96
128
160
192
224
0
256
16
32
48
64
80
96
112
128
WinSize
WinSize
Fig. 17 Average instruction window occupancy
5.6
The effect of the prediction accuracy
The fourth observation comes from the comparison between the results obtained using the hybrid predictor and the results obtained using the perfect predictor. We can observe that the potential IPC improvement of value prediction is much higher than the improvements obtained with the hybrid predictor. This is because of both the limited prediction accuracy and the relatively high miss-prediction rate of the proposed predictor. Nevertheless, the results obtained by the perfect predictor are really promising. So, one of the most important research aims in this field will be to come near the results obtained by the perfect predictor using a realistic predictor with high prediction accuracy, and low miss-prediction rate. Of course we can find various predictors in the literature that get higher accuracy than the proposed one [Wang97], [Rych98]. However, the performance improvement difference is not very significant in most of the cases, so that it does not compensate the use of such a complex predictor.
6
Conclusions and future work
In this paper we have shown that value prediction can be used as a technique to get high performance with relatively small window sizes. We have proved that using a small window and value prediction we can get similar or even higher IPC than using a large window and no value prediction. As long as using small window sizes result in a complexity reduction of some critical pipeline structures, this can also bring on a cycle time reduction.
23
In addition we have proved that value prediction exhibits a higher potential when the fetch width increases, and also when we work with realistic conditions (realistic branch predictor, realistic caches, limited resources, etc), due to the reduction on effective window size that realistic conditions produce. We have shown that perfect value prediction can get much higher performance improvement than a realistic low-cost hybrid value predictor. So, one of the main lines of future work in this area will be to improve the accuracy of value predictors to get near perfect prediction. Combining different prediction techniques in a hybrid predictor in order to cover a good number of different predictable sequences, emerge as a good solution. The combination of value prediction with address prediction and data prefetching could also lead to higher prediction accuracy solutions. However, miss-prediction rate reduction becomes even a more important topic, as miss-prediction recovery is one of the most critical aspects of value prediction. So, developing accurate confidence mechanism is also a crucial aspect on value prediction. Other important open research lines on value prediction are developing techniques for speculative predictor updating, implementation of more complexity-effective mechanisms for parallel validation, developing more sophisticated scheduling techniques that deal with speculative and non-speculative instructions, and studying the interaction between value prediction and branch prediction, and value prediction and memory disambiguation.
7
References
[Burg97] D. Burger and T.M. Austin. “The SimpleScalar Tool Set, Version 2.0”. Technical Report CS#1342, University of Wisconsin-Madison, 1997. [Gabb97] F. Gabbay and A. Mendelson, “Can Program Profiling Support Value Prediction?”, Proc. of the 30th Int. Symp. on Microarchitecture (MICRO-30), pp. 270-280, Dec. 1997. [Gabb98a] F. Gabbay and A. Mendelson. “The Effect of Instruction Fetch Bandwidth on Value Prediction”, Proc. of the 25th Int. Symp. on Computer Architecture (ISCA-25), pp. 272-281, 1998. [Gabb98b] F. Gabbay and A. Mendelson, “Improving Achievable ILP Trough Value Prediction and Program Profiling”, Microprocessores and Microsystems, Vol 22, n.3, Sept. 1998. [Gonz97] J. Gonzalez and A. Gonzalez, “Speculative Execution via Address Prediction and Data Prefetching.” Proc. of the 11th ACM Int. Conf. on Supercomputing (ICS’97), pp. 196-203, July 1997. [Lipa96a] M.H. Lipasti and J.P. Shen, "Exceeding the Dataflow Limit via Value Prediction," Proc. of the 29th Int. Symp. on Microarchitecture (MICRO-29), pp. 226-237, Dec. 1996. [Lipa96b] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, "Value locality and load value prediction," Proc. of the 7th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS- VII), pp. 138-147, Oct. 1996. [Lipa97] M. H. Lipasti, Value Locality and Speculative Execution, Ph.D. thesis, Carnegie Mellon University, April 1997. [McFa93] S. McFarling, “Combining Branch Predictors.” Technical Report TN-36, Digital Equipment Corp., June 1993. [Mosh97] A. Moshovos, S.E. Breach. T.N. Vijaykumar and G. Sohi. “Dynamic Speculation and Synchronization of Data Dependences”. Proc. of the 24th Int. Symp. on Computer Architecture (ISCA-24), 1997. [Nakr99] T. Nakra, R. Gupta, M.L. Soffa, “Global Context-Based Value Prediction”, Proc. of the 5th Int. Symp. On High Performance Computer Architecture (HPCA-5), Jan. 1999 [Pala97] S. Palacharla, N.P. Jouppi, J.E. Smith, “Complexity-Effective Superscalar Processors”. Proc. of the 24th Int. Symp. on Computer Architecture (ISCA-24), 1997. [Pinu98] L. Pinuel, “Implementation of a Low Cost Hybrid (Last-Value + Stride) Value Predictor”, Technical Report, DACYA-??, Univ. Complutense de Madrid, 1998 [Rote96] E. Rotenberg, S. Bennett and J. E. Smith, “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. of the 29th Int. Symp. on Microarchitecture (MICRO29), pp. 24-34, Dec. 1996. [Rote97] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace Processors,” Proc. of the 30th Int. Symp. on Microarchitecture (MICRO-30), pp. 138-148, Dec. 1997. [Rych98] B. Rychlik, J. Faisty, B. Krug, J.P. Shen, “Efficacy and Performance Impact of Value Prediction”, PACT-98 24
[Saze97a] Y. Sazeides, J.E. Smith, "The Predictability of Data Values," Proc. of 30th Int. Symp. on Microarchitecture (MICRO-30), pp. 248-258, Dec. 1997. [Saze97b] Y. Sazeides, J.E. Smith. “Implementations of Context Based Value Predictors”. Technical Report #ECE-TR-97-8, University of Wisconsin-Madison, 1997. [Saze98] Y. Sazeides and J.E. Smith. “Modeling Program Predictability”. Proc. of the 25th Int. Symp. on Computer Architecture (ISCA-25), pp. 73-84, 1998. [Smit81] J.E. Smith, “A Study of Branch Prediction Strategies”, Proc. of the 8th Int. Symp. on Computer Architecture (ISCA-8), pp. 135-148, 1981 [Soda97] A. Sodani, G.S. Sohi, “Dinamic Instruction Reuse”. Proc. of the 24th Int. Symp. on Computer Architecture (ISCA-24), pp. 194-205, 1997. [Sohi90] G.S. Sohi, “Instruction Issue Logic for High-Performance, Interrumpible, Multiple Functional Unit, Pipelined Computers”, IEEE Trans. on Computer, 39(3), pp. 349-359, 1990 [Wuan97] K. Wang and M. Franklin, "Highly Accurate Data Value Prediction using Hybrid Predictors," Proc. of 30th Int. Symp. on Microarchitecture (MICRO-30), pp. 281-290, Dec. 1997. [Wall93] D.W. Wall. “Limits of Instruction-Level Parallelism” Technical Report WRL 93/6 Digital Western Research Laboratory, 1993. [Yeh92] T-Y. Yeh, Y. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction” Proc. of the 19th Int. Symp. on Computer Architecture (ISCA-19), pp. 124-134, 1992 [Yeh93] T-Y. Yeh, D. Marr, and Y. Patt, “Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache”, Proc. of the 7th ACM Int. Conf. on Supercomputing, (ICS’93) pp. 67-76, July 1993.
25