Increasing the Instruction-Level Parallelism through Data-Flow Manipulation Stephan Jourdan†, Adi Yoaz‡, and Mattan Erez‡ †
‡
Performance Microprocessor Division Intel Corporation
[email protected]
Abstract In recent years several techniques for exceeding the program’s data-flow constraints have been proposed. These include value prediction, instruction reuse, and dependency redirection. In this paper we show how they can be regarded as one universal method of performing what we term result reuse. We study in depth the dependency redirection derivative and propose several hardware schemes to accomplish it. First, we summarize the necessary changes to the execution core, and propose a novel mechanism that allows us to treat value-identical instructions symmetrically (disregarding their original ordering). To perform value-identity detection, two options are explored: a new cost-effective value-identity predictor that is highly accurate; and a safe method, to determine valueidentity through expression tracking. In addition we provide two specific applications of the techniques. The first eliminates unnecessary load operations – redundant load elimination. The second technique translates data-dependencies based on control-flow speculation, effectively using the branch predictor to perform value-speculation.
1. Introduction One of the major factors limiting the performance of modern super-scalar processors is the lack of instructionlevel parallelism (ILP). The exploitation of the ILP, in turn, is restricted by several constraints, which are traditionally classified as data dependencies, name dependencies (false dependencies), and control dependencies1. Current processors employ techniques to eliminate (or tolerate) the effects of name and control dependencies. Name dependencies can be removed by dynamically renaming both register and memory locations, and control dependencies can be overcome through speculative execution. Out-of-order execution is a related paradigm, which attempts to tolerate the effects of all three dependency types, by allowing
1 We deliberately omit the resource dependencies, which are implementation dependent.
Microprocessor Research Lab Intel Corporation
{adi.yoaz, mattan.erez}@intel.com
independent dependency chains to execute independently of each others constraints. Recently, several techniques that manipulate data dependencies and exceed data-dependency limit have been introduced. These techniques are value prediction, instruction reuse, dependency collapsing, and dependency redirection. We propose a universal view that unifies these techniques through their operations on the data-flow graph and the micro-architectural state. The data-flow graph (DFG) is a graph representing the data dependencies. The nodes represent instructions while the arcs denote the dependencies between them. All techniques that aim to surpass the data-dependency limit must, in some way, modify the DFG, while maintaining correctness. There are four possible operations on the DFG that may increase performance: •
Arc Elimination - removing the dependencies between instructions. By providing the required values directly to the dependent instructions their execution is advanced, and parallelism is increased.
•
Node Elimination - removing instructions and their related arcs, when the results are known. This saves execution resources, reduces power consumption, increases the ILP, and promotes the execution of the dependent instructions.
•
Node Merging - dynamically merging several instructions into an equivalent single instruction. This effectively shortens the dependency chain and has other advantages similar to node-elimination.
•
Arc Redirection - redirecting the dependency arc from the original instruction to one that is computed earlier. Thus, dependent instructions execute earlier. Value prediction falls under the arc-elimination category. The outcome of selected instructions is predicted and supplied to the dependent instructions. Instruction reuse is a node-elimination technique. Here the result of an instruction is safely determined and may be used by subsequent commands. Dependency collapsing can eliminate nodes, merge several instructions into a single node, and redirect arcs by finding and transforming short chains of dependent instructions into more compact representations.
Dependency redirection was introduced in the context of improving memory communications, essentially redirecting a dependency on a load instruction to the instruction producing the original stored value. More recently, we generalized this technique to include all valueidentity relations, that is, a dependency may be redirected towards any older instruction that is detected to produce the same value. In this paper we propose a hardware mechanism to accomplish this concept. Another possible categorization is through the microarchitectural perspective. Here we classify the different techniques according to the changes they make to the internal micro-architectural state. Value prediction and instruction reuse do not change the internal (physical) DFG representation. Instead, selected values are forwarded to the registers, or reservation stations, of dependent instructions from dedicated hardware structures. Dependency collapsing, on the other hand, physically modifies the DFG, while dependency redirection may do both. In this paper we examine more closely the value identity derivative. We introduce two ways of performing the identity detection: a speculative prediction and a safe method through expression tracking. We also give two specific applications of these techniques. The first eliminates unnecessary load operations – redundant load elimination, and the second translates data-dependencies based on control-flow speculation. This allows us to speculate on a result with very few resources, by tying it to a control speculation. The paper is organized as follows: the rest of Section 1 presents related work and the simulation methodology. Section 2 unifies the different result reuse techniques through their DFG manipulations. Section 3 discusses the execution core aspects regarding the dependency redirection derivative. Section 4 details two ways of performing valueidentity detection. Section 5 describes two applications of these techniques, and concluding remarks are provided in Section 6.
1.1
Related Work
Related work has been conducted in four different directions. A more detailed description of specific mechanisms is given in Section 2. •
Value Prediction: Lipasti and Shen introduced value locality and data-value prediction. First, they studied value prediction for loads, and later generalized it to include all computed values [Lipa96]. These studies used a last value predictor only. Stride-based prediction as well as an independent study of value prediction was done by Gabbay and Mendelson and summarized in [Gabb98]. Further studies include the investigation of the potential of context-based value predictors [Saze97] and the exploration of various confidence predictors to limit misprediction effects [Cald99].
•
Instruction/Value Reuse: Harbison proposed dynamic value-reuse. He used a value-cache to eliminate redundant computations by storing results of recurring arithmetic computations [Harb82]. Richardson followed up on this concept, introducing the result-cache, which also eliminates trivial computations [Rich92]. Instruction-reuse [Soda97] is based on tracking changes in the input values of a given static instruction, in order to reuse the most recent result. The tracking is based on dependencies and/or operand values.
•
Dependency Redirection: [Mosh97] and [Tyso97] first introduced this paradigm in the particular case of memory-dependencies. [Jour98] generalized the idea both to eliminate redundancy in the register file, and to redirect data-dependencies. All these techniques rely on value-identity detectors.
•
Dependency Collapsing: Static dependency collapsing, is an integral part of many instruction set architectures (ISAs), for example the Multiply-Accumulate instruction. Dynamic collapsing of pairs of ALU instructions was proposed in [Vass93]. This scheme was later extended to identify short sequences of dependent instructions in order to manipulate the DFG [Saze96]. This latter scheme used special ALU units that can perform more sophisticated operations to shorten the critical path and increase parallelism. More recently [Frie98] used the Trace-Cache fill unit to perform some collapsing. In this paper we concentrate on the result reuse techniques, which include value prediction, instruction reuse, and dependency redirection.
1.2
Simulation Methodology
Results provided in this paper were collected from an IA-32 trace-driven simulator. Each trace consists of about 30 million consecutive IA-32 instructions translated into microoperations on which the simulator works. Traces are representative of the entire execution, and record both user and kernel activities. Results are reported for 42 traces grouped into 8 suites: • • • • • •
SpecFp95 (FP - 10 traces), SpecInt95 (INT - 8), Multimedia using MMX™ instructions (MM-8), programs written in JAVA (JAV - 5), TPC benchmarks (TPC - 3), Common programs running on NT (NT - 8).
In this paper, we focus on the ideas, and back them up with statistical results that emphasize the motivations for the various suggestions. IPC performance results are deliberately ignored. We are convinced that backing an idea with only statistical results and common sense is the right way to proceed. The effectiveness of an idea is highly
dependent on both the context it is used in (e.g. high-end out-of-order processor, embedded processor), and the vectors of performance targeted (e.g. IPC performance, power, die area). Therefore, general ideas and paradigms should be presented in such a way that their validity is proven, and not as a bottom-line performance number.
2.
A General Framework for Result Reuse
Result Reuse is the general term we give to all techniques that provide values to dependent instructions ahead of normal program execution. Such values are determined, either safely or speculatively, based on committed or in-progress results. The differences between the various micro-architectural techniques that fall under result reuse (value prediction, instruction reuse, and dependency redirection) are mainly in the way they manipulate the DFG. Result reuse uses some structures to record past results, or tags identifying in-flight results. The structures are accessed using a context, depending on the implemented variation of the scheme. If the current context matches a recorded one, then the retrieved result (or tag) is used for the current instruction. [Soda98] gave a qualitative description for the existence of the opportunities to perform reuse. Dependency collapsing does not fit this definition of result reuse since it is not based on previous results. We therefore will not discuss this technique in the following sections. As explained in the introduction, there are four ways to change the dependency graph in order to increase performance: Arc-Elimination, Node-Elimination, NodeMerging, and Arc-Redirection. Following is a more detailed description of the known mechanisms for performing these manipulations, including the type of result reuse they implement. Value Prediction eliminates dependency arcs by directly supplying the needed value to dependent instructions. It proposes to record previously generated results and re-use them as a prediction for future instructions. Results are arranged based on a context as seen while processing the instructions that generated them. The idea is for a given instruction, to access the prediction structures with the current context and, on a context match, to speculate on the result. For value-prediction, information used to form the context can be of any kind, but two major methods have been devised: • Last-value predictor and stride predictor – make the prediction based on the recently seen value plus a calculated stride (the delta between two recent consecutive values). The context is the instruction address and the stride (the stride is assumed 0 for all last
value predictions). Note that a stride predictor potentially reuses the result of virtual instructions: the predicted result is the last result added to the stride value. • Context predictor – attempts to learn the values that follow a particular sequence of previous values and predict one of these values when the same pattern repeats. Therefore, the context is a combination of several previous results, and the instruction address. Recently several predictors were presented that are hybrids of these two basic mechanisms. Thus, the context is a combination of the value-context, the instruction address and stride. Other information vectors include control-flow indication (e.g. past branches or adjacent op-code), or any sort of combination. A confidence predictor often backs the value predictor since the cost of mispredictions may be too high. Instruction Reuse operates on the nodes of the DFG, eliminating instructions (their execution and dependencies) based on the safe determination of their result value. The idea is that an instruction always produces the same result when its operand values are the same. Since this technique is completely safe, the instruction need not execute and may be removed from the DFG. As in value prediction, instruction reuse uses the current context to lookup previous results in dedicated structures. Two sorts of contexts have been proposed. The first is based on the notion of memoization borrowed from compilers. The context is composed from the source operand values, and either the instruction address or function op-code. The second scheme relies on the fact that if the source operands have not been updated since the last time the instruction was executed, then the instruction will produce the same result. The context is based on dependency tracking. This scheme is quite effective particularly in retrieving work performed while pursuing a wrong controlflow path. Dependency Redirection is an integral part of result reuse and works by Arc-Redirection. It is based on detecting that an instruction produces the same result as another instruction. If the result has already been produced, then this technique is no different than the other techniques (just using another sort of context). The main novelty is that it works even when the result has not been produced yet. The instruction is linked to the detected identical result, which may still be unknown. In this way the dependency arcs pointing to this instruction are redirected to the earlier instruction. Section 3 describes how to perform these links, and Section 4 discusses detection mechanisms. The detection of value-identity relations between instructions can be either safe or speculative depending on the context used. Two examples are move elimination and memory bypassing. Pipeline issues. The different techniques react differently to the increasing pipeline depth of modern processors. Context-
based speculative techniques, such as value and address prediction suffer most from deep pipeline effects, since the current context may be speculative. [Beke99] describes these issues and proposes schemes to mitigate some of their effects. For example, if the results of the last few previous dynamic instructions (global correlation) are used to form the context, the predictor predicts either all instructions correctly or none. The sequence of mispredictions may last until the next full re-start, i.e. thread switch or full pipeline flush. To sum up, result reuse records in some structures past results, or tags identifying in-flight results, to reuse them for following instructions. Recorded values are arranged based on a context as seen while processing the instructions that generated them. If the current context matches a recorded one, then the retrieved value is used for the current instruction. Value prediction, instruction reuse, and dependency redirection, are all variation of result reuse. Note that several variations can co-exist (hybrid result reuse schemes), especially when some focus on long-term reuse while others focus on the short-term.
section 3.2). Second, the DFG needs to be repaired on a data misprediction. This, most probably, means flushing the machine of all instructions following the offending instruction and reverting to a check-pointed DFG.
3.2
Multiple Mappings
The most advanced register-renaming scheme involves three structures: a Free List (FL), a Register Alias Table (RAT), and an Active List (AL). In [Jour98], we showed that such a scheme handles memory renaming as well, by translating all memory dependencies into register dependencies. Below, we detail the basics of this renaming scheme and describe modifications to the allocation and reclaiming policies for accommodating multiple mappings between several logical registers and the same physical register. # logical register
# physical register
REGISTER ALIAS TABLE
Instruction Renaming
Register Allocation
3. Execution Core
FREE LIST
In this section, we explore several modifications to the execution core for supporting result reuse. Sophisticated recovery schemes are beyond the scope of this paper (though extremely inefficient, flushing the pipeline is always an option).
3.1
Register Reclaiming
Instruction Retirement
Figure 1. Register Renaming.
Equivalent Data-Flow Graph
In the introduction we stated that result reuse mechanisms can be categorized according to the way they change the micro-architectural state. Techniques such as value prediction or instruction reuse, for example, do not physically modify the internal representation of the DFG. Instead, the determined result is written back and marked as ready prior to instruction scheduling. Dependency arcs are not modified but are already resolved. In practice this is as if the dependency arcs were removed. This applies to any result reuse variation where the predicted result value is known. The DFG remains unchanged to facilitate the recovery from a data misprediction, only the instructions that are dependent on the mispredicted value need to be reexecuted. One drawback of such schemes is the extra write ports in the register file. Dependency redirection modifies dependency arcs. When a result is detected to be identical to the result of an earlier instruction, all dependency arcs to the first instruction may be redirected towards the second (earlier) instruction. The solution introduced in [Jour98] and summarized below physically updates the DFG. There are two problems in maintaining a modified representation of the DFG. First, it impacts the allocation/reclaiming policies (described in Sub-
ACTIVE LIST
Renaming Techniques. The RAT is indexed by the source logical registers, and provides the mappings to the corresponding physical registers. This structure maintains the latest mapping2 for each logical register. For each logical destination register specified by the renamed instructions, the allocator provides an unused physical register from the FL, and the RAT is updated with the new mappings. A physical register is reclaimed once the instruction that caused its eviction from the RAT retires. The AL records these instruction-evicted mapping pairs to achieve the proper reclaiming. Figure 1 depicts this cycle. Physical Register Allocation (Reuse). In the regular scheme, physical registers are allocated prior to instruction scheduling from the FL and update the RAT. An alternate option is to perform the allocation during execution as proposed in [Gonz98] (virtual renaming). Redirecting dependencies is achieved by re-allocating the physical register holding the detected identical result instead of allocating an idle physical register from the FL. The RAT is updated with the re-allocated registers, thus, all 2
Throughout the paper, we refer to a mapping as the pairing of a logical register and its corresponding physical register.
dependencies are redirected towards the first instruction generating the result. In addition to dependency redirection, reusing physical registers reduces the need for physical registers. Physical Register Reclaiming (Multiple Mappings). Reallocating, still in-use, physical registers distorts the basics of the regular register reclaiming scheme depicted above. This technique assumes that once a logical register is renamed, all subsequent instructions can no longer access the physical register it was associated with. An elegant technique to overcome this problem is to provide counters for each physical register. The counters are incremented when new mappings are established (first allocation or all reallocations), and decremented when mappings are killed.
3.3
Instruction and Execution Streams
Dependencies are redirected away from parent instructions, towards older instructions. Older in the execution stream does not necessarily mean older in the instruction stream due to out-of-order execution. For instance, an instruction may be detected to produce the same value as another instruction soon to be fetched, but the latter may be executed first. To resolve this issue, we propose to alter the scheme proposed in Sub-section 3.2. When two instructions are detected to produce identical results, the order of the computations should not impair performance. Allocating the same physical register to both instructions overcomes the ordering issue. Thus, in any case, the instructions dependent on the younger instruction in the execution stream are scheduled earlier than with the unaltered DFG. Actually, there is no longer a distinction between an earlier, and an older instruction. Both instructions are treated symmetrically, and the one producing the value last is responsible for the verification. The only problem is when to enable the data verification. We add a valid bit to each physical register. This bit is cleared when the register is first allocated and set on the first update. Any update with this bit already set leads to a comparison between the recorded value and the incoming result. On a mismatch, the identity relation was mispredicted and the DFG needs to be repaired. In summary, identity detection is performed in program order regardless of the order of the computations.
4. Value-Identity Detectors Value-identity detectors speculate on the equality between results produced by two instructions. In other words, such structures predict, either safely or speculatively, that a given instruction produces the same value as another instruction that may be completely unrelated. Speculative value-identity detectors differ from other result reuse variations since only relations are predicted.
This means that they may perform well even when the results themselves are highly unpredictable. A good example is memory spill bypassing where colliding loads are predicted to produce the same values as the instructions whose results were stored to memory (memory dependencies are translated into register dependencies). On the down side, the individual gain may be lower than with value predictors since the identical values may not be computed yet. This section describes a highly efficient prediction scheme and provides insights into building safe detectors.
4.1.
Value-Identity Prediction
For any incoming instruction, the value-identity prediction structures may predict the location in the instruction window, or anywhere in the physical register space, of another instruction that produces the same result. In this case, the physical register allocated to this earlier instruction is retrieved, and re-allocated to the incoming instruction. In concept, in addition to the obvious verification/recovery phase, value-identity prediction involves three steps: 1. Establishing a potential identity relation between a pair of instructions. That is, identifying the possibility that a certain instruction (follower) produces the same result as a preceding instruction (leader) in the dynamic instruction stream. 2. Recording this relation into an identity prediction structure. Note that a follower may have several prior leaders that potentially produce identical results due to the fact that several dynamic paths lead to this follower. 3. Predicting an identity relation. Once a potential identity relation has been established and then recorded, it should be retrieved. That is, when a follower enters the pipeline, prior to renaming, an appropriate leader, if it exists, should be identified. In this section, we describe prediction structures to speculate on memory collisions. These structures can be easily extended to predict any kind of value-identity. In the case of memory collisions, the right prediction is to point to the instruction that generated the result that was later stored in memory. However, we prefer to predict the location of the colliding store instead, in order to bypass memory access (value verification can be circumvented by comparing addresses). In this particular case, this means that the source physical register is retrieved from the instruction window to be re-allocated, in place of the destination physical register. To relate it to the above description, we use a special purpose mechanism to establish an identity relation between a leader (store) and a follower (load). The recording and retrieval mechanisms are appropriate for predicting any value-identity relation. Value-Identity Relation. The value-identity relation is transitive. This means that if both instructions A and B on
one side, and instructions B and C on the other side are predicted to produce the same result - value-identical, instructions A and C are also predicted to be value-identical. As a result, prediction structures should focus on short-range value-identity relations because of their higher accuracy. Long-range relations should be achieved through the transitive property of the relation. The structure of this value-identity predictor is similar to the predictors described in [Mosh97] and [Tyso97]. Confidence is of the utmost importance since mispredictions are costly. We describe a new cost-effective scheme that outperforms the previously published predictors. The predictor features four novel techniques: •
Use path information to perform accurate prediction
•
Use distance information to allow multiple instances of the same static value-identity relation
•
Use additional information to record only identity relations
• Use partial matching to reduce redundancy in the tables Path Information. Intuitively, load-store pairings are highly dependent on the control-flow path between the store and the load. The path information helps to distinguish whether a load collides within the instruction window, and for a collision, with which store it collides. Figure 2 gives an example where the highlighted load: •
collides with the store ST1 when the path to the load is either from A or from B,
•
collides with the store ST2 when the path is from D,
•
collides with no store when the path is from C. A
B
C
D
instruction of the block, and then truncating the result to fit the size of the recorded path information. The path information is shifted left before it is hashed with the load IP to access the prediction table. The prediction table delivers information about the predicted colliding store (leader), if any. The table may be set-associative. In this case, it is better to tag entries with information provided only by the load IP (to avoid redundancy). Distance Information. The prediction table is accessed only for loads. Updates are performed based on the outcome of the memory disambiguation process (this is step 1 above, establishing identity). The table records the distance, in the instruction window, between the follower and the leader: that is, between the current load and the predicted colliding store. The physical register associated with the store is retrieved from the instruction window. Recording the distance and indexing the prediction table with path information enables the overlapping of multiple dynamic instances of the same static load-store pair. This is not the case when physical registers are recorded directly in the prediction table (although no distance value is required). This is an important feature since overlapping is a frequent behavior, particularly in tight loops. Along with the distance information, a confidence bit can be used to record the outcome of the last prediction performed while accessing this entry. This confidence scheme can be extended to several bits if required to achieve higher performance, and pollution-free bits [Beke99] can be added to reduce aliasing in the prediction table. DEPENDENCY TABLE Tag Conf
IP
Dist.
INSTRUCTION WINDOW
PHR
IP Current Ptr Distance
PC Load
ST2
Location
ST1
Store?
LO A D
Figure 2. Path Distinction. The dependency table records the collision information and is accessed for each incoming load (follower). The index is based on the load IP and the path information, and is formed by any hash function (e.g. xor). The path information can be encoded in many ways, a simple way is to record the direction of past branches as done in branch prediction. An alternative is to base the path information on the starting addresses of past basic blocks. The path information is recorded with a shift(n)-xor scheme, and is updated on any new basic block of instructions by first shifting the old content n bits to the left, xor-ing it with the IP of the first
==
== Predicted ?
Figure 3. Value-Identity Predictor (Memory Dependency). Additional Information. Entries in the prediction table are costly. Based on figure 2, an entry in the table should not be provided for the path coming from C. This requires tagging schemes where no collision is predicted on a miss. The prediction scheme described so far features an implicit tagging scheme when combined with the instruction window. If the entry selected by the distance value is not a store, there is a “tag” mismatch and no collision is predicted. A more general tagging scheme is to record in each entry of
the prediction table a few bits from the predicted leader (colliding store) IP. Figure 3 illustrates the prediction structures (tags are optional). Partial Matching. Long path information leads to a high level of redundancy in the table. In the example above, two entries may be allocated for the collision between the load and ST1 (from path A and from path B). We propose, as an option, to record only one path by allowing partial matching in a set-associative implementation of the Dependency Table. Instead of performing a full tag comparison, the way selected is the one that most closely matches the current path. This selection is performed by xoring the current path history with the tag recorded in each way. The outcomes hold the discrepancy between the paths, and are used to select the way. Note that the confidence now relies only on the additional information described above. For instance, the path from C in figure 2 leads to ST2 (partial match with the path form D) but the store entry selected with its recorded distance, will probably not match the IP recorded in the table. Figure 4 reports results of such a predictor using a 2way set-associative 4k-entry table, and an instruction window of 128 entries. Within this window, 28% of the loads collide with stores and the predictor processes nearly 80% of them. The strength of this predictor is in its nearperfect accuracy.
needs to be checked against the previous value in the specific architectural register. A binary prediction is made on whether to reuse a value or not. The RAT provides the physical register recording this previous value. Note that no extra reads need to be performed to the RAT since the physical register had to be obtained in order to be inserted into the active list. This scheme relies on compiler algorithms to assign the same architectural register to instructions producing the same value. It is not efficient in solving spill codes since they are inserted mainly because of a lack of architectural register, especially in the IA-32 ISA. This means that the architectural register was reused one or more times before the colliding load is renamed.
4.2.
Value-identity relations can be safely detected based on expression tracking. A typical example is move instructions where the result value is identical to the input value. We offer a generalization of this scheme to detect value identities between all instructions. The expression tracking operation is conducted in parallel with or prior to register renaming to perform arc-redirection, node-elimination, and node-merging. An alternative is to position the tracking unit on the backside of the trace cache as done in [Frie98] for dependency collapsing. The following simple example illustrates the task of the expression-tracking unit. label:
N u m b e r o f C o llis io n s A c c u ra c y P re d ic t e d C o llid in g
N u m b e r o f P re d ic t io n s N o n -C o llid in g n o t P re d ic t e d 100
45 40
9 9 .9
35
% of all loads
30 25
9 9 .7
20
9 9 .6
15
Prediction Accuracy
9 9 .8
9 9 .5 10 9 9 .4
5 0
9 9 .3 C FP
IN T
J AV
MM
NT
TP C
Ave ra g e
Figure 4. Predictor Performance. Dependency-Table Update. An entry is allocated in the prediction table for a load that collided with a store within the instruction window. The memory-ordering hardware tracks these events. As stated before, this mechanism is specific for establishing identity relations among load and store instructions. No attempt is made here to generalize it but one may think of similar structures to track any type of identity relations. [Tull99] proposed a simple, and more limited, general tracking mechanism. This scheme assumes that result reuse is performed only with the result previously stored in the same architectural register. This means that a result only
Expression Tracking
mov eax,ebx add eax,4 … jnz label
// eax