carnegie mellon university - CiteSeerX

CARNEGIE MELLON UNIVERSITY

VALUE LOCALITY AND SPECULATIVE EXECUTION

A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree of

DOCTOR OF PHILOSOPHY in ELECTRICAL AND COMPUTER ENGINEERING

by

Mikko Herman Lipasti

Pittsburgh, PA 15213 April 1997

Value Locality and Speculative Execution

ii

Abstract This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose and floating-point registers as well. Furthermore, value locality exists not only in the data flow portion of a processor, but also in the control logic, where both register and memory dependences between instructions tend to remain static and are relatively easily predictable. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework. Detailed evaluation of value prediction for load instructions only, as well as all instructions that write general-purpose or floating-point registers shows significant potential for performance improvements. Further experimental results indicate that both register and memory dependence relationships between instructions are easily predictable. These discoveries result in significant potential for further performance improvements, particularly in conjunction with wide-issue and deeply-pipelined superscalar processors that employ aggressive techniques like accurate dynamic branch prediction and instruction fetching via trace cache to overcome control- and data-flow restrictions on parallel execution of serial programs. Finally, this thesis introduces a new microarchitectural paradigm called Superflow that supersedes historical limits on instruction flow, register dataflow, and memory dataflow and demonstrates potential performance improvements of 2-4x over the current state-of-the-art microprocessors.


iii


iv

Acknowledgments First of all, I want to express my appreciation to my advisor, John Paul Shen, for his support, advice, and leadership throughout this project. John’s ability for condensing wild ideas and conjectures down to key concepts, as well as his willingness to entertain and pursue such notions, has been instrumental in evolving this thesis from vague ideas about redundancy in memory traffic to effective and powerful microarchitectural techniques for increasing instruction-level parallelism. I would also like to thank professors Daniel P. Siewiorek and Rob Rutenbar of CMU, and James Smith of University of Wisconsin, for serving on my thesis committee. I benefited greatly from their wealth of knowledge and experience during the transition from Ph.D. proposal to completed Ph.D. thesis. I also want to express my gratitude to my outside committee members, Dr. Greg Pfister from IBM Austin, and Dr. Steve Kunkel from IBM Rochester. Their insights and feedback helped to gravitate my research toward real-world implementation and performance issues. I also benefited greatly from numerous discussions with other members of our research group, including Bryan Black, Yuan Chou, Andrew Huang, Derek Noonburg, and Chris Newburn. Special thanks go to Chris Wilkerson for his key insights in the early stages of this thesis as well as for coining the term Superflow to describe the microarchitectural paradigm advocated in this thesis. I also want to express my appreciation to my management at IBM, whose professional and financial support greatly eased the transition from salaried employee to indentured servant (i.e. graduate student). Finally, thanks to my loving wife Erica Ann Lipasti for her patience, forbearance, and willingness to make tremendous sacrifices in not just financial security, standard of living, and quality of life, but also personal friendships, proximity to family, and psychological and emotional support systems in order to let me perform this work. Her enduring support during this time is the ultimate sign of her love, devotion, and respect. Furthermore, her selfless devotion to caring and providing for our dear daughter, Emma Kristiina—who has evolved from a toddler barely out of helpless infancy, to a happy, active, intelligent, and very outgoing young lady during our time here at CMU—has been instrumental in allowing me to focus my energies on my research without having to sacrifice those precious and irreplaceable times together as a family.


v


vi

Contents

Abstract iii Acknowledgments v

CHAPTER 1

Introduction

1

Historical Background and Motivation Taxonomy of Speculative Execution

Theoretical Contributions

1

3

5

Value Locality 5 The Weak Dependence Model 7 Pipeline Contraction Framework 9

Microarchitectural Contributions

10

Load and Register Value Prediction 10 Dependence Prediction 15 Alias Prediction 16 Putting it all Together: The Superflow Paradigm

17

Thesis Overview 19

CHAPTER 2

Machine Models and Workloads 21 Machine Models 21 PowerPC 620 21 PowerPC 620+ 22 Infinite PowerPC Model 22 Alpha AXP 21164 23 Execution-Driven Idealized PowerPC Model

Misprediction Recovery Mechanisms

24

29

Instruction Refetch 29 Instruction Reissue 30 Selective Instruction Reissue 30

Workloads 31 SPEC92 Integer Suite. 31 Miscellaneous Integer Programs 31 SPEC92 Floating Point Suite 32 SPEC95 Integer Suite 32 SPEC95 Floating Point Suite 33


vii

CHAPTER 3

Load Value Prediction 35 Introduction and Related Work Value Locality 37 Exploiting Value Locality 41

35

Load Value Prediction Table 42 Dynamic Load Classification. 42 Constant Verification Unit 43 The Load Value Prediction Unit 45 LVP Unit Implementation Notes 46

Microarchitectural Models

46

PowerPC 620/620+ LVP Unit Operation 47 Alpha AXP 21164 LVP Unit Operation 48

Experimental Framework 49 Experimental Results 49 Base Machine Model Speedups with Realistic LVP 50 Enhanced Machine and LVP Model Speedups 51 Distribution of Load Verification Latencies 52 Data Dependency Resolution Latencies 53 Bank Conflicts 54

Conclusions and Future Work

CHAPTER 4

56

Register Value Prediction 57 Motivation 57 Value Locality 58 Exploiting Value Locality The Value Prediction Unit Verifying Predictions 65


60 61

66

VP Unit Operation 66 Misprediction Penalty 67

Experimental Framework 68 Experimental Results 68 PowerPC 620 Machine Model Speedups 69 PowerPC 620+ Machine Model Speedups 70 Infinite Machine Model Speedups 71

VP Unit Implementation 72 Conclusions and Future Work

CHAPTER 5

74

Dependence Prediction 77 Motivation and Related Work 77 Detecting Control and Data Dependences Experimental Framework 79


78

viii

Pipelined Dispatch Structure 80 Dependence Prediction and Recovery 82 Source Operand Value Prediction and Recovery Conclusions and Future Work 88

CHAPTER 6

84

The Superflow Paradigm 91 Background 91 The Superflow Paradigm

93

The Weak Dependence Model 94 Generalized Speculation and Pipeline Contraction Overview of the Superflow Paradigm 96

Instruction Flow Techniques

95

98

Conditional Branch Throughput 98 Taken Branch Throughput 99 Misprediction Latency 102

Register Data Flow Techniques

103

Dependence Detection and Prediction Eliminating Dependences 104

Memory Data Flow Techniques

103

110

Memory Latency 110 Memory Bandwidth 112

Summary and Conclusions

CHAPTER 7

115

Summary and Conclusions

117

Thesis Summary 117 Key Contributions 118 Theoretical Contributions 118 Microarchitectural Contributions 119

Future Work

APPENDIX A

120

Additional Data on Load Value Prediction 123 Miscellaneous PowerPC Data 123 PowerPC Load Value Prediction Data 128 Miscellaneous Alpha AXP 21164 Data 133

APPENDIX B

Additional Data on Register Value Prediction 137 Miscellaneous PowerPC Data 137 PowerPC Register Value Prediction Data


143

ix

APPENDIX C

Additional Data on Superflow Machine Models 153 Additional Results For Instruction Flow 153 Additional Results For Register Data Flow 160 Additional Results For Memory Data Flow 169


x

List of Figures

Figure 1-1. Figure 1-2. Figure 1-3. Figure 1-4. Figure 1-5. Figure 1-6. Figure 1-7. Figure 1-8. Figure 1-9. Figure 2-1. Figure 2-2. Figure 3-1. Figure 3-2. Figure 3-3. Figure 3-4. Figure 3-5. Figure 3-6. Figure 3-7. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 4-6. Figure 4-7. Figure 4-8. Figure 4-9. Figure 4-10. Figure 5-1. Figure 5-2. Figure 5-3. Figure 5-4. Figure 5-5. Figure 5-6. Figure 6-1. Figure 6-2. Figure 6-3.

Taxonomy of Speculative Execution 3 Load Value Locality 6 Register Value Locality 7 Pipeline Contraction 9 Block Diagram of Value Prediction Unit 11 Example use of Value Prediction Mechanism 12 Branch Misprediction Penalty 15 Dependence Prediction Mechanism 17 Superflow Overview 18 PPC 620 and 620+ Block Diagram 23 Alpha AXP 21164 Block Diagram 24 Load Value Locality 39 PowerPC Value Locality by Data Type 40 Block Diagram of the LVP Mechanism 45 Base Machine Model Speedups 50 Load Verification Latency Distribution 53 Data Dependency Resolution Latencies 54 Percentage of Cycles with Bank Conflicts 55 Register Value Locality 59 Register Value Locality by Instruction Type 61 Value Prediction Unit 62 VPT Hit Rate Sensitivity to Size 63 CT Hit Rates 64 Example use of Value Prediction Mechanism 65 620 Speedups 70 620+ Speedups 71 Infinite Machine Model Speedups 72 Doubling Data Cache vs. VP 73 Branch Misprediction Penalty 79 Pipelined Dispatch Structure 81 Dependence Prediction Mechanism 83 Source Operand Value Prediction Mechanism 85 Effect of Dependence and Value Prediction 87 Reduced Branch Misprediction Penalty 88 Pipeline Contraction 96 Superflow Overview 97 Superflow Instruction Fetch Unit 101


xi

Figure 6-4. Figure 6-5. Figure 6-6. Figure 6-7. Figure 6-8. Figure 6-9. Figure 6-10. Figure 6-11. Figure 6-12. Figure 6-13.


Instruction Fetch Unit Performance 102 Superflow Instruction Fetch Unit Performance 103 Source Operand Value Predictability 105 Dependence Predictability 106 Effect of Deep Pipelining 107 Effect of Finite Reorder Buffer 109 Alias Predictability 111 Load Stream Partitioning 112 Load Value Predictability 113 Effect of Constrained Memory Bandwidth 114

xii

List of Tables

Table 2-1. Table 2-2. Table 2-3. Table 2-4. Table 2-5. Table 2-6. Table 2-7. Table 2-8. Table 3-1. Table 3-2. Table 3-3. Table 3-4. Table 4-1. Table 4-2. Table 4-3. Table 4-4. Table 5-1. Table 5-2. Table 5-3. Table 5-4. Table 6-1. Table 6-2. Table A-1. Table A-2. Table A-3. Table A-4. Table A-5. Table B-1. Table B-2. Table B-3. Table B-4. Table B-5. Table C-1. Table C-2. Table C-3. Table C-4.

PowerPC Machine Model Specifications 22 Alpha AXP 21164Instruction Latencies 24 Idealized PowerPC Model Instruction Latencies 25 SPEC92 Integer Benchmark Descriptions 31 Miscellaneous Integer Benchmark Descriptions 32 SPEC92 Floating Point Benchmark Descriptions 32 SPEC95 Integer Benchmark set 33 SPEC95 Floating Point Benchmark set 33 LVP Unit Configurations 41 LCT Hit Rates 43 Successful Constant Identification Rates 44 PowerPC 620+ Speedups 52 Instruction Types 60 Classification Table Configurations 64 Baseline Performance (IPC) 66 VP Unit Configurations 68 Machine Model Parameters 79 Benchmark Characteristics 80 Dependence Prediction Results 83 Source Operand Value Prediction Results 86 Evolution of Microprocessors 92 Benchmark Characteristics 100 PowerPC 620 Model Miscellaneous Data 123 PowerPC 620+ Model Miscellaneous Data 126 PowerPC 620 LVP Data 129 PowerPC 620+ LVP Data 131 Alpha AXP 21164 Data 133 PowerPC 620 Model Miscellaneous Data 137 PowerPC 620+ Model Miscellaneous Data 140 PowerPC 620 VP Data 144 PowerPC 620+ VP Data 146 Infinite PowerPC VP Data 149 SPECInt95 Results for Instruction Flow 153 SPECFP95 Results for Instruction Flow 158 SPECInt95 Results for Register Data Flow 161 SPECFP95 Results for Register Data Flow 164


xiii

Table C-5. Table C-6. Table C-7. Table C-8. Table C-9. Table C-10.


SPECInt95 Results for ROB Size 128 and Fetch Width 16 167 SPECFP95 Results for ROB Size 128 and Fetch Width 16 168 SPECFP95 Results for ROB Size 256 and Fetch Width 16 168 SPECInt95 Results for Memory Data Flow for ROB Size 128 170 SPECFP95 Results for Memory Data Flow for ROB Size 128 177 SPECFP95 Results for Memory Data Flow for ROB Size 256 182

xiv

CHAPTER 1

Introduction

This thesis introduces a ubiquitous program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism (also known as ILP or IPC, instructions per cycle) by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework.

1.1 Historical Background and Motivation There are two fundamental restrictions that limit the amount of instruction level parallelism (ILP) that can be extracted from sequential programs: control flow and data flow. Control flow limits ILP by imposing serialization constraints at forks and joins in a program’s control flow graph [1]. Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent (i.e. one needs the result of another to compute its own result, and hence must wait for the other to complete before beginning to execute). Examining the extent and effect of these limits has been a popular and important area of research, particularly in the case of control flow [2,3,4,5]. Continuing advances in the development of accurate branch predictors (e.g. [6]) have led to increasingly-aggressive control-speculative microarchitectures (e.g. the Intel Pentium Pro [7]), which undertake aggressive measures to overcome control-flow restrictions by using branch prediction and speculative execution to bypass control dependences and expose additional instruction-level parallelism to the microarchitecture. Meanwhile, numerous mechanisms have been


1

Historical Background and Motivation

proposed and implemented to eliminate false data dependences and tolerate the latencies induced by true data dependences by allowing instructions to execute out of program order (see [8] for an overview). Surprisingly, in light of the extensive energies focused on eliminating control-flow restrictions on parallel instruction issue, less attention has been paid to eliminating data-flow restrictions on parallel issue. Recent work has focused primarily on reducing the latency of specific types of instructions (usually loads from memory) by rearranging pipeline stages [9, 10], initiating memory accesses earlier [11], or speculating that dependences to earlier stores do not exist [12, 13, 14, 15]. The most relevant prior work in the area of eliminating data-flow dependences consists of the Tree Machine [16,17], which uses a value cache to store and look up the results of recurring arithmetic expressions to eliminate redundant computation (the value cache, in effect, performs common subexpression elimination [1] in hardware). Richardson follows up on this concept in [18] by introducing the concepts of trivial computation, which is defined as the trivialization of potentially complex operations by the occurrence of simple operands; and redundant computation, where an operation repeatedly performs the same computation because it sees the same operands. He proposes a hardware mechanism (the result cache) which reduces the latency of such trivial or redundant complex arithmetic operations by storing and looking up their results in the result cache. In this thesis, we introduce the concept of value locality, which is similar to redundant computation, along with a proposed technique--Value Prediction, or VP--for predicting the results of instructions at dispatch by exploiting the affinity between instruction addresses and the values these instructions produce. VP differs from Harbison’s value cache and Richardson’s result cache in two important ways: first, the VP table is indexed by instruction address, and hence value lookups can occur very early in the pipeline; second, it is speculative in nature, and relies on a verification mechanism to guarantee correctness. In contrast, both Harbison and Richardson use table indices that are only available later in the pipeline (Harbison uses data addresses, while Richardson uses actual operand values); and require their predictions to be correct, hence requiring mechanisms for keeping their tables coherent with all other computation.


2


Speculative Execution

Control Speculation Branch Direction (binary)

Data Speculation Data Location Aliased (binary)

Branch Target (multi-valued)

Address (multi-valued) Data Value (multi-valued)

Figure 1-1. Taxonomy of Speculative Execution. 1.1.1 Taxonomy of Speculative Execution In order to place our work on prediction-based speculative execution into a meaningful historical context, we introduce a taxonomy of speculative execution. This taxonomy, summarized in Figure 1-1, categorizes our work as well as previously introduced techniques based on which types of dependences are being bypassed (control vs. data), whether the speculation relates to storage location or value, and what type of decision must be made to enable the speculation (binary vs. multivalued). Control Speculation There are essentially two types of control speculation: speculating on the direction of a branch, which requires a binary decision (taken vs. not-taken); and speculating on the target of a branch, which requires a multi-valued decision (the target can potentially be anywhere in the program’s address space). Examples of the former are any of the many branch prediction schemes explored in the literature (e.g. [19,6,20]), while examples of the latter are the Branch Target Buffer (BTB) or Branch Target Address Cache (BTAC) units included on most modern microprocessors (e.g. the PowerPC 620 [15] or the Intel Pentium Pro [7]). A novel mechanism for performing both branch direction and branch target prediction is proposed as part of the Superflow microarchitecture paradigm in Chapter 6.


3


Data Speculation Data speculation techniques break down logically into two categories: those that speculate on the storage location of the data, and those that speculate on the actual value. Furthermore, techniques that speculate on the location come in two fundamentally different flavors: those that speculate on a specific attribute of the storage location (e.g. is it aliased with an earlier definition), and those that speculate on the address of the storage location. An example of the former is speculative disambiguation, which optimistically assumes that an earlier definition does not alias with a current use, and provides a mechanism for checking the accuracy of that assumption. Speculative disambiguation has been implemented both in software [13] as well as in hardware [12, 14, 15]. Another example of this type of speculation occurs implicitly in most control-speculative processors, whenever execution proceeds speculatively past a join in the control-flow graph where multiple reaching definitions for a storage location are live [1]. By speculating past that join, the processor hardware is implicitly speculating that the definition on the predicted path to the join in question is in fact the correct one (as opposed to the definition on an alternate path). There are a large number of techniques that speculate on data address. Most prefetching techniques, for example, are speculative in nature and rely on some heuristic for generating addresses of future memory references (e.g. [21, 22, 23, 24, 25]). Of course, since prefetching has no architected side effects, no mechanism is needed for verifying the accuracy of the prediction or for recovering from mispredictions. Another example of a technique that speculates on data address is fast address calculation [26, 11], which enables early initiation of memory loads by speculatively generating addresses early in the pipeline. Dependence prediction, proposed in Chapter 5, and alias prediction, proposed in Chapter 6, are speculative techniques that predict the current storage location of register input operands (i.e. rename buffer number) and memory operands (e.g. store queue entry), respectively. The final category in our taxonomy, techniques that speculate on data value, has received little attention in the literature. The only work we are aware of is that proposed in this thesis (preliminary results have been published in [27] and [28]). Note that neither the Tree Machine [16,17] or Richardson’s work [18] qualify since they are not speculative.


4


1.2 Theoretical Contributions 1.2.1 Value Locality In this thesis, we introduce the concept of value locality, which we define as the likelihood of a previously-seen value recurring repeatedly within a storage location. Although the concept is general and can be applied to any storage location within a computer system, we have limited our study to examine only the value locality of general-purpose or floating point registers immediately following instructions that write to those registers, as well as the value locality exhibited in dependence relationships between instructions. A plethora of previous work on static and dynamic branch prediction (e.g. [19,6,20]) has focused on an even more restricted application of value locality, namely the prediction of a single condition bit based on its past behavior. Intuitively, it seems that it would be a very difficult task to discover any useful amount of value locality in a general purpose register. After all, a 32-bit register can contain any one of over four billion values--how could one possibly predict which of those is even somewhat likely to occur next? As it turns out, if we narrow the scope of our prediction mechanism by considering each static instruction individually, the task becomes much easier and we are able to accurately predict a significant fraction of register values being written by machine instructions. We examine the phenomenon of value locality more closely in Section 3.2 on page 37 and Section 4.2 on page 58. The initial benchmark set that we use to explore value locality and quantify its performance impact consists of the SPEC92 integer suite (described in Section 2.3.1 on page 31) and miscellaneous integer benchmarks (described in Section 2.3.2 on page 31). In later experiments, we augment this initial benchmark set with integer benchmarks from the more recent SPEC95 suite and floating point benchmarks from both SPEC92 and SPEC95 (all of these benchmarks are described in Section 2.3 on page 31). Load Value Locality Figure 1-2 shows the average value locality for load instructions in each of the benchmarks. The value locality of each static load is measured by counting the number of times that load instruction retrieves a value from memory that matches a previously seen value for that static load and dividing by the total number of dynamic occurrences of that load. The average load value locality of a benchmark is the dynamically-weighted average of the value localities of all the static loads in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we Value Locality and Speculative Execution

5


Load Value Locality

Load Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0

71

2 1-

cc

ss

eg

e pr

p cj

m

co

t

ot

t qn

e

k

w ga

rf

e gp

ep

gr

g

m

pe

rl

pe

k

ic

qu

sc

p

is

xl

Figure 1-2. Load Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of sixteen. check for matches against only the most recently retrieved value), while the second set (dark bars) has a history depth of sixteen (i.e. we check against the last sixteen unique values). We see that even with a history depth of one, most of the integer programs exhibit load value locality in the 50% range, while extending the history depth to sixteen can improve that to better than 80%. What that means is that the vast majority of static loads exhibit very little variation in the values that they load during the course of a program’s execution. Unfortunately, one of our benchmarks--cjpeg-demonstrates poor load value locality. Register Value Locality Figure 1-3 shows the average value locality for all instructions that write an integer or floating point register in each of the benchmarks. The value locality of each static instruction is measured by counting the number of times that instruction writes a value that matches a previously seen value for that static instruction and dividing by the total number of dynamic occurrences of that instruction. The average value locality of a benchmark is the dynamically-weighted average of the value localities of all the static instructions in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we check for matches against only the most-recentlywritten value), while the second set (dark bars) has a history depth of four (i.e. we check against the last four unique values). We see that even with a history depth of one, most of the programs exhibit value locality in the 40-50% range (average 51%), while extending the history depth to Value Locality and Speculative Execution

6


Register Value Locality

Register Value Locality (%)

100.0

80.0

60.0

40.0

20.0

0.0

1

-27

1 cc

t ss eg tot pre eqn m co

cjp

wk

ga

erf

gp

p

gre

eg

mp

rl

pe

ick

qu

sc

p

xlis

Figure 1-3. Register Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of four. four (along with a perfect mechanism for choosing the right one of the four values) can improve that to the 60-70% range (average 66%). What that means is that a majority of static instructions exhibit very little variation in the values that they write into registers during the course of a program’s execution. Unfortunately, three of our benchmarks--cjpeg, compress, and quick--demonstrate poor register value locality. 1.2.2 The Weak Dependence Model The implied inter-instruction precedences of a sequential program are an overspecification and need not be rigorously enforced to meet the requirements of the sequential execution model. The actual program semantics and inter-instruction dependences are specified by the control-flow graph (CFG) and the data-flow graph (DFG). As long as the serialization constraints imposed by the CFG and the DFG are not violated, the execution of instructions can be overlapped and reordered (e.g. via out-of-order execution) to achieve better performance by avoiding the enforcement of implied but unnecessary precedences. However, true inter-instruction dependences must still be enforced. To date, all machines enforce such dependences in a rigorous fashion that involves the following two requirements:


7


• Dependences are determined in an absolute and exact way, i.e. two instructions are identified as either dependent or independent, and when in doubt dependences are pessimistically assumed to exist. • Dependences are enforced throughout instruction execution, i.e. the dependences are never allowed to be violated, and are enforced continuously while the instructions are in flight. We classify such a traditional and conservative approach as adhering to the strong dependence model for program execution. We believe that the traditional strong dependence model is overly rigorous and unnecessarily restricts available parallelism. This thesis proposes the weak dependence model, which specifies that: • Dependences need not be determined exactly or assumed pessimistically, but can instead be optimistically approximated or even temporarily ignored. • Dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state. The advantage of adopting the weak dependence model is that the program semantics as specified by the CFG and DFG need not be completely determined before the machine can process instructions. Furthermore, the machine can now speculate aggressively and temporarily violate the dependences as long as corrective measures are in place to recover from misspeculation. If a significant percentage of the speculations are correct, the machine can effectively exceed the performance limit imposed by the traditional strong dependence model. Conceptually speaking, a machine that exploits the weak dependence model has two interacting engines. The front-end engine assumes the weak dependence model and is highly speculative. It tries to make predictions about instructions in order to aggressively process instructions. When the predictions are correct, these speculative instructions will effectively have skipped over or folded out certain pipeline stages. The back-end engine still uses the strong dependence model to validate the speculations, to recover from misspeculation, and to provide history and guidance information to the speculative engine. In combining these two interacting engines, an unprecedented level of instruction level parallelism can be harvested without violating the program semantics. The edges in the DFG that represent inter-instruction dependences are now enforced in the critical path only when misspeculations occur. Essentially, these dependence edges have become probabilistic and the serialization penalties incurred due to enforcing these dependences are eliminated or masked whenever correct speculations occur. Hence, the traditional data-flow limit based on the length of the critical path in the DFG is no longer a hard limit that cannot be exceeded [28].


8


Branch Prediction

Value Prediction

Fetch Fetch Dispatch Dispatch


Execute Rename Commit Op Read



Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read



Fetch Fetch Dispatch Dispatch Execute Rename Execute Commit Rename Op Read


Commit Op Read


Figure 1-4. Pipeline Contraction. Branch prediction is used to fold the dispatch and execute pipeline stages into the fetch stage, and value prediction is used to fold the execute stage into the dispatch stage. 1.2.3 Pipeline Contraction Framework In this section, we introduce a generalized framework called pipeline contraction that captures all forms of speculation, both in the control-flow and data-flow domains. Control-flow speculation, already ubiquitous in high-performance processors, consists of speculating on both the direction (taken vs. not taken) and the target (if taken) of branch instructions. Data-flow speculation, which is less common, consists of speculating on the specific attributes or even values of instruction inputs and outputs. Both types of speculation can be described as attempts to contract the instruction execution pipeline by probabilistically obtaining the semantic outcome of an instruction as early as possible. For example, in Figure 1-4, we see the semantics of a branch instruction, which without speculation would require three pipeline stages, contracted down to one stage whenever both the target and the direction of the branch can be correctly predicted during the fetch stage. Similarly, data speculation techniques such as value prediction [28] can be used to contract execution pipelines and allow dependent instructions to execute in parallel. The pipeline contraction framework is a useful tool for assessing the potential benefit of speculative techniques by considering the following metrics: the degree of contraction that can be obtained with the proposed technique (i.e. how many pipeline stages can be folded away), the relative frequency and accuracy of these contractions, and the delays incurred while recovering from


9


incorrect contractions. For example, branch prediction is a very powerful technique because it measures up well against all three factors: it folds away a large number of pipeline stages, branches occur frequently and are very predictable, and recovery from mispredictions costs little or no additional delay relative to not predicting the branches. Within this framework, value prediction can be generalized to include: 1) predicting direction and target of a branch instruction; 2) predicting source and/or destination operands of an ALU instruction; and 3) predicting the memory address and/or operand of a load/store instruction. The number of stages folded away is determined by the distance (in pipe stages) between where the prediction is made and where the value is normally produced.

1.3 Microarchitectural Contributions 1.3.1 Load and Register Value Prediction The fact that the register writes in many programs demonstrate a significant degree of value locality opens up exciting new possibilities for the microarchitect. Since the results of many instructions can be accurately predicted before they are issued or executed, dependent instructions are no longer bound by the serialization constraints imposed by operand data flow. Instructions can now be scheduled speculatively with additional degrees of freedom to better utilize existing functional units and hardware buffers, and are frequently able to complete execution sooner since the critical paths through the data dependence graph have been collapsed. We propose two approaches to exploiting value locality: Load Value Prediction and the more general Register Value Prediction. Both of these share two basic mechanisms: one for accurately predicting values--the VP (value prediction) unit--and one for verifying these predictions. The Value Prediction Unit Value prediction is useful only if it can be done accurately, since incorrect predictions can lead to increased structural hazards and longer latency (the misprediction penalty is described in greater detail on page 14). Hence, we propose a two-level prediction structure for the VP Unit: the first level is used to generate the prediction values, and the second level is used to decide whether or not the predictions are likely to be accurate. The internal structure of the VP Unit is illustrated in Figure 1-5. The VP Unit consists of two tables: the Classification Table (CT) and the Value Prediction Table (VPT), both of which are direct-mapped and indexed by the instruction address (PC) of the instruction being predicted. Value Locality and Speculative Execution

10


Classification Table (CT)

Prediction Result

PC of pred. instr.

Predicted Value

Value Prediction Table (VPT)

Updated Value

Figure 1-5. Block Diagram of Value Prediction Unit. The PC of the instruction being predicted is used to index into the Value Prediction Table to find a value to predict. At the same time, the Classification Table is also indexed with the PC to determine whether or not a prediction should be made. When the instruction completes, both the prediction history and value history are updated. Entries in the CT contain two fields: the valid field, which consists of either a single bit that indicates a valid entry or a partial or complete tag field that is matched against the upper bits of the PC to indicate a valid field; and the prediction history field, which is a saturating counter of 1 or more bits that tracks the correctness of recent predictions. The prediction history is incremented or decremented whenever a prediction is correct or incorrect, respectively, and is used to classify instructions as either predictable or unpredictable. This classification is used to decide whether or not the result of a particular instruction should be predicted. Increasing the number of bits in the saturating counter adds hysteresis to the classification process and can help avoid erroneous classifications by ignoring anomalous values and/or destructive interference caused by multiple static instructions mapping to the same CT entry. The relatively simple CT configurations described in Chapters 2-4 (as well as [27] and [28]) achieved classification hit rates between 70% and 95%. The VPT entries also consist of two fields: a valid field, which, again, can consist of a single valid bit or a full or partial tag; and a value history field, which contains one or more 32- or 64-bit values that are maintained with an LRU policy. The value history fields are written when an instruction is first encountered (by its result) or whenever a prediction is incorrect (by the actual result). The Value Locality and Speculative Execution

11


Predicted Fetch

Dependent

PC of pred. instr. VPT

CT

Dispatch Buffer

Dispatch Buffer Release

Dispatch

Reserv. Station

Predict

Spec? Rename Data Buffer

Reserv. Station Reissue

Execute FU

FU Result Bus

Complete/ Verify

Compl. Buffer

?= Committed Value

Invalidate

Compl. Buffer

Predicted Value

Figure 1-6. Example use of Value Prediction Mechanism. The dependent instruction shown on the right uses the predicted result of the instruction on the left, and is able to issue and execute in the same cycle. VPT replacement policy is also governed by the CT prediction history to introduce hysteresis and avoid replacing useful values with less useful ones. Verifying Predictions Since value prediction is by nature speculative, we need a mechanism for verifying the correctness of the predictions and efficiently recovering from mispredictions. This mechanism is summarized in the example of Figure 1-6, which shows the parallel execution of two data-dependent instructions. The producer instruction, shown on the left, has its value predicted and written to its rename buffer during the fetch and dispatch cycles. The consumer instruction, shown on the right, reads the predicted value from the rename buffer at the beginning of the execute cycle, and is able to issue and execute normally, but is forced to retain its reservation station. Meanwhile, the predicted instruction also executes, and its computed result is compared with the predicted result during its completion stage. If the values match, the consumer instruction releases its reservation station. If not, completion of the first instance of the consumer instruction is invalidated, and a second instance reissues with the correct value.


12


Verifying Constant Loads In our experiments with Load Value Prediction, we discovered that certain loads exhibit constant behavior; that is, they load the same constant value repeatedly. To exploit this behavior and avoid accessing the conventional memory hierarchy for these loads, we propose the constant verification unit (CVU), which is described in further detail in Chapter 3 (and [27]). To verify predictable loads, we simply retrieve the value from the conventional memory hierarchy and compare the predicted value to the actual value, just as we do in the more generalized value prediction scheme (see Figure 1-6). However, for highly-predictable or constant loads, we use the CVU, which allows us to avoid accessing the conventional memory system completely by forcing the VPT entries that correspond to constant loads to remain coherent with main memory (loads are classified as constant if the saturating counter at their VPT entry has reached its maximum value). For the VPT entries that are classified as constants by the CT, the data address and the index of the VPT entry are placed in a separate, fully-associative table inside the CVU. This table is kept coherent with main memory by invalidating any entries where the data address matches a subsequent store instruction. Meanwhile, when the constant load executes, its data address is concatenated with the VPT index (the lower bits of the instruction address) and the CVU’s contentaddressable-memory (CAM) is searched for a matching entry. If a matching entry exists, we are guaranteed that the value at that VPT entry is coherent with main memory, since any updates (stores) since the last retrieval would have invalidated the CVU entry. If one does not exist, the constant load is demoted from constant to just predictable status, and the predicted value is now verified by retrieving the actual value from the conventional memory hierarchy. We find that an average of 6% (and up to 33% for some benchmarks) of loads from memory can be verified with the CVU, resulting in a proportional reduction of L1 cache bandwidth requirement. VP Unit Operation The VP Unit predicts the values during fetch and dispatch, then forwards them speculatively to subsequent dependent instructions via the processor’s standard result forwarding mechanism. Dependent instructions are able to issue and execute immediately, but are prevented from completing architecturally and are forced to retain possession of their reservation stations until their inputs are no longer speculative. Speculatively forwarded values are tagged with a bit vector representing the uncommitted register writes they depend on, and these tags are propagated to the results of any


13


subsequent dependent instructions. Meanwhile, uncommitted instructions execute in their respective functional units, and the predicted values are verified either by a comparison against the actual values computed by the instructions, or in the case of constant loads, by an address match in the CVU. Once a prediction is verified, all the dependent instructions can either release their reservation stations and proceed into the completion unit (in the case of a correct prediction), or restart execution with the correct register values (if the prediction was incorrect). Since a large number of instructions can be in flight at the same time, the time between predicting and verifying a value can be dozens of cycles or more, allowing the processor to speculate multiple levels down the dependence chain beyond the write, executing instructions and resolving branches that would otherwise be blocked by data-flow dependences. Misprediction Penalty The worst-case penalty for an incorrect value prediction in this scheme, as compared to not predicting the value in question, is one additional cycle of latency along with structural hazards that might not have occurred otherwise. The penalty occurs only when a dependent instruction has already executed speculatively, but is waiting in its reservation station for one of its predicted inputs to be verified. Since the value comparison takes an extra cycle beyond the pipeline result latency, the dependent instruction will reissue and execute with the correct value one cycle later than it would have had there been no prediction. In addition, the earlier incorrect speculative issue may have caused a structural hazard that prevented other useful instructions from dispatching or executing. In those cases where the dependent instruction has not yet executed (due to structural or other unresolved data dependences), there is no penalty, since the dependent instruction can issue as soon as the actual computed value is available, in parallel with the value comparison that verifies the prediction. In any case, due to the CT which accurately prevents incorrect predictions from occurring, the misprediction penalty does not significantly affect performance. There can also be a structural hazard penalty even in the case of a correct prediction. Since speculative values are not verified until one cycle after the actual values become available, speculatively issued dependent instructions end up occupying their reservation stations for one cycle longer than they would have had there been no prediction.


14

Cycles Per Instruction (CPI)


1.0 RAS Mispred BTB Mispred BHT Mispred Other

0.8 0.6 0.4 0.2 0.0

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Figure 1-7. Branch Misprediction Penalty. The approximate contribution of RAS, BTB, and BHT mispredictions to overall CPI is shown for single-cycle dispatch (left bar), 2-cycle (middle bar) and 3-cycle (right bar) pipelined dispatch. 1.3.2 Dependence Prediction Detecting data dependences among multiple instructions in flight is an inherently sequential task that becomes very expensive combinatorially as the number of concurrent in-flight instructions increases. Olukotun et al. argue convincingly against wide-dispatch superscalars because of this very fact [30]. Wide (i.e. greater than four instructions per cycle) dispatch is difficult to implement and has adverse impact on cycle time because all instructions in a dispatch group must be simultaneously cross-checked. Even current microprocessor implementations with dispatch windows of four or less (e.g. Alpha AXP 21164 and Pentium Pro) require multiple instruction decode and dependence-checking pipeline stages. One obvious solution to the problem of the complexity of dependence detection is to pipeline it into two or more stages to minimize impact on cycle time. In Chapter 5, Section 5.4 we propose a pipelined approach to dependence detection that facilitates the implementation of wide-dispatch microarchitectures. However, pipelined dependence checking aggravates the cost of branch mispredictions by delaying resolution of mispredicted branches. In Figure 1-7, we see the IPC impact of pipelining dependence checking on a 16-dispatch machine with an advanced branch predictor and no other structural resource limitations (refer to Section 2.3.4 on page 32 and Section 2.1.5 on page 24 in Chapter 2 for further details on the benchmarks and machine model). We see that lengthening dispatch to two or three pipeline stages (vs. the baseline case of one) severely increases the number of cycles during which no useful instructions are dispatched and increases Value Locality and Speculative Execution

15


CPI (decreases IPC) dramatically, to the point where sustaining even 2-3 IPC becomes very difficult. We alleviate these problems in two ways: by introducing a scalable, pipelined, and speculative approach to dependence detection called dependence prediction and also by exploiting a modified approach to value prediction called source operand value prediction [28]. Fundamental to these is the notion that maintaining semantic correctness does not require that we rigorously enforce source-to-sink data-flow relationships or that we even exactly detect these relationships before we start executing. Rather, we use dynamically adaptive techniques for predicting values as well as dependences and speculatively issue instructions early, before their dependences are resolved or even known. As shown in Figure 1-8, dependence prediction is implemented with a dependence prediction table (DPT) with 8K entries, which is direct-mapped and indexed by hashing together the instruction address bits, the gshare branch predictor’s branch history register (BHR), and the relative position of the operand (i.e. first, second, or third) being looked up. Each DPT entry contains a numeric value which reflects the relative index of that input operand’s location in the rename buffers. This relative index is used to check the value silo to see if the operand is already available. If all of the instruction’s predicted input operands are available, the instruction is permitted to dispatch early, after the first dispatch cycle. In the second (or third, in the three-cycle dispatch pipeline) dispatch cycle, exact dependence information becomes available, and the earlier prediction is verified against the actual information. In case of a mismatch, the DPT entry is replaced with the correct relative position, and the early dispatch is cancelled. 1.3.3 Alias Prediction As described in the previous section, detecting and enforcing dependences between multiple instructions in flight presents a serious scalability bottleneck for wide-issue superscalar processors. To a lesser extent, the detection and enforcement of dependences that occur through aliased memory locations also causes difficulties. In this case, however, the problems are caused by the latency involved in computing and comparing the addresses of all loads with all previous unretired stores. Data shown in Chapter 6 indicates that a significant portion of all loads are aliased to earlier stores (15% on average for integer benchmarks, and 6% on average for floating point benchmarks-see Figure 6-10 on page 111). In order to resolve these dependences as early as possible, before Value Locality and Speculative Execution

16


Dispatch Group P1: Src

PC

DPT

Value Silo

BHR

Figure 1-8. Dependence Prediction Mechanism. During the first dispatch stage, the source operand position, PC, and branch history register (BHR) are hashed together to index the dependence prediction table (DPT), which predicts the value silo entry that contains the source operand. the addresses of the aliased stores and loads are even computed, we propose alias prediction, a simple mechanism that associates a dependence distance (measured in number of intervening stores) with each static load. This distance is used to index into the store queue to speculatively resolve store-to-load dependences at dispatch, and correctly resolves roughly 80% of all aliased loads. 1.3.4 Putting it all Together: The Superflow Paradigm This section introduces the Superflow microarchitecture paradigm, which employs a broad spectrum of speculative microarchitectural techniques designed to increase instruction throughput while alleviating the detrimental effects of increasing interconnect latencies and deeper instruction execution pipelines. These techniques are based on the weak dependence model, which relaxes the serialization constraints induced by the detection and enforcement of true data dependences by allowing them to be temporarily and speculatively violated to expose additional instruction-level parallelism. The Superflow paradigm has the potential of sustaining close to 10 IPC for non-numerical programs without requiring any advanced compilation support. Superflow aggressively continues the statistical approach which emerged during the 1980’s in designing machines by focusing on optimizing for the statistical common cases instead of the less likely worst case, and generalizes the


17


Branch Predictor

Fetch Fetch

Instruction Flow

Dispatch Dispatch

Execute Rename

...

Execute

Execute Rename

...

Memory

Execute Rename

Rename Memory Data Flow

Register Data Flow Commit Op Read

Figure 1-9. Superflow Overview. The Superflow paradigm seeks to maximize instruction throughput by optimizing instruction flow, register data flow and memory data flow. current control speculation in the form of branch prediction to include data speculation in the form of value prediction [28]. With this generalization, both the control and data flow constraints of a program can be circumvented via aggressive speculation. As shown in Figure 1-9, instruction execution can be divided into four logical stages--fetch, dispatch, execute, and commit--each taking one or more cycles. During the fetch stage, instructions are retrieved from cache or main memory. During the dispatch stage, the instructions are decoded, their operands are renamed, and inter-instruction dependences are detected. During the execute stage, the instructions first check to see if their source operands are available. If they are, the instructions execute and forward their own results back to subsequent instructions that might be waiting for them. Finally, in the commit stage, instructions are allowed to write back their results into the architected registers in program order. Figure 1-9 also identifies the three key parameters that the Superflow paradigm seeks to maximize: instruction flow, which is the rate at which useful instructions are fetched and dispatched to the execution core; register data flow, which is the rate at which register values become available (i.e. Value Locality and Speculative Execution

18

Thesis Overview

inter-instruction data dependences are resolved); and memory data flow, which is the rate at which data values are stored to and retrieved from data memory. Chapter 6 will present techniques for increasing all of these rates of flow beyond traditional limits. Hence the name Superflow.

1.4 Thesis Overview This dissertation documents, through an approximately chronological narrative, discoveries and experiments related to value locality and its various applications to speculative execution. Chapter 2 provides an overview of the workloads and machine models used to quantitatively evaluate the techniques introduced in this thesis. Chapter 3 describes and evaluates load value prediction, which was first described in [27], and subsequently and independently confirmed in [31] and [32]. Chapter 4 proposes generalized value prediction, which largely subsumes load value prediction by value-predicting all general-purpose and floating-point register writes. This approach was first described in [28], and later independently confirmed in [32]. Chapter 5 introduces dependence prediction, as well as a new twist on value prediction that decouples instruction execution from dependence checking by predicting source operands rather than destination operands [33,34]. Chapter 6 introduces the Superflow paradigm, which combines these speculative techniques with novel forms of branch prediction, instruction fetching, and memory latency and bandwidth reduction to deliver performance close to 10 IPC for integer programs [35]. Finally, Chapter 7 provides a summary of thesis contributions as well as commentary on future directions and opportunities.


19

Thesis Overview


20

CHAPTER 2

Machine Models and Workloads

This chapter presents an overview of the machine models and workloads that are used throughout this thesis.

2.1 Machine Models In order to validate and quantify the performance impact of value locality and the speculative techniques based on it, we implemented three realistic, cycle-accurate simulation models, two of them based on the PowerPC 620 [46, 15], an aggressively out-of-order current-generation microprocessor, and one based on the DEC Alpha AXP 21164 [47], a current-generation in-order design with a very high clock frequency. We chose to use two different instruction set architectures in order to alleviate our concern that the value locality behavior we observed is perhaps only an artifact of certain idioms in the instruction set, compiler, run-time environment, and/or operating system we were running on, rather than a more universal attribute of general-purpose programs. We chose the PowerPC 620 and the AXP 21164 since they represent two extremes of the microarchitectural spectrum, from complex “brainiac” CPUs that aggressively and dynamically reorder instructions to achieve a high IPC metric, to the clean, straightforward, and deeply-pipelined “speed demon” CPUs that rely primarily on clock rate for high performance [48]. In addition to these realistic models, we examined a number of less realistic PowerPC-based simulation models in order to explore the limits of performance obtainable with these new techniques. These models are described in the following sections. 2.1.1 PowerPC 620 The microarchitecture of the PowerPC 620 is summarized in Figure 2-1. The PowerPC 620 is a four-wide superscalar processor with relatively short execution pipelines, six functional units, and support for out-of-order execution via distributed reservation stations [8]. Our model is based on published reports on the PowerPC 620 [46, 15], and accurately models all aspects of the microarValue Locality and Speculative Execution

21

Machine Models

Table 2-1. PowerPC Machine Model Specifications. Number of functional units (FU) and reservation stations (RS) as well as issue and result latencies for each instruction class are shown. PowerPC 620/620+

Infinite

# FU/RS Instruction Class Simple Int Complex Int Load/Store --L1 miss penalty Simple FP Complex FP Br (pr/mispr)

620 2/4 1/2 1/3 8 cycles 1/2 shared 1/4

620+ 2/8 1/4 2/6 8 cycles 1/4 shared 1/8

Issue Lat 1 1-35 1

Result Lat 1 1-35 2

1 18 1

3 18 0/1

I&R Lat 1,1 1,1 1,1 0 (perfect) 1,1 11, 1,0/1

chitecture, including branch prediction, fetching, dispatching, register renaming, out-of-order issue and execution, result forwarding, the non-blocking cache hierarchy, store-to-load alias detection, and in-order completion. In addition, we added a Load Value Prediction Unit (see Chapter 3) or Value Prediction Unit (see Chapter 4) that predicts register writes by keeping a value history indexed by instruction addresses. Key machine model parameters are summarized in Table 2-1. 2.1.2 PowerPC 620+ To alleviate some of the bottlenecks we found in the 620 design, we also modeled an aggressive “next-generation” version of the 620, which we termed the 620+. The 620+ differs from the 620 by doubling the number of reservation stations, FPR and GPR rename buffers, and completion buffer entries; adding an additional load/store unit (LSU) without an additional cache port (the base 620 already has a dual-banked data cache); and relaxing dispatching requirements to allow up to two loads or stores to dispatch and issue per cycle. In all other respects, our PowerPC 620+ model is identical to the PowerPC 620 model described in the previous section. Key machine model parameters are summarized in Table 2-1. 2.1.3 Infinite PowerPC Model In order to examine the limits of performance obtainable with the speculative techniques presented in this thesis, we also constructed an idealized infinite simulation model. It implements the following assumptions: • Perfect caches • Perfect alias detection and store-to-load forwarding Value Locality and Speculative Execution

22

Machine Models

Comp Unit (16/32)

LVP or VP Unit

Fetch/Dispatch Unit

RS (2/4)

RS (2/4)

RS (2/4)

RS (2/4)

RS (4/8)

RS (3/6)

SCFX

SCFX

MCFX

FPU

BRU

LSU

LSU 620+

GPR/Rename (8/16)

FPR/Rename (8/16)

CR/Rename (16/16)

Figure 2-1. PPC 620 and 620+ Block Diagram. Buffer sizes shown as (620/620+). • Perfect instruction fetching (limited to one taken branch per cycle). • Unit latency for mispredicted branches with no fetch bubble (instructions following a mispredicted branch are able to execute in the cycle following resolution of the mispredicted branch). It is our intent that the infinite model match the SP machine model presented in [4], except for the branch prediction mechanism, which is a 2048-entry BHT with a 2-bit saturating counter per entry, copied from our 620 model. Key machine model parameters are summarized in Table 2-1. 2.1.4 Alpha AXP 21164 Our in-order Alpha AXP 21164 processor model is summarized in Figure 2-2. The 21164 is a fourwide superscalar processor with relatively deep execution pipelines and no support for out-oforder execution [8]. Instruction latencies for the 21164 are shown in Table 2-3. Our model differs from the actual AXP 21164 [47] in three ways. First, we do not fully model the MAF (miss address file) which enables nonblocking L1 cache misses on the 21164. Rather, we assume an abstract pipelined memory subsystem which allows any number of nonblocking misses (this is more aggressive than the actual 21164 and will tend to understate the benefits of value predictionbased speculation). Second, in order to allow value prediction-based speculation to occur, we must verify predictions by comparing the predicted value to the actual value computed by the ALU. This comparison requires an extra stage before writeback. The third modification, the addition of the reissue buffer, allows us to buffer instruction dispatch groups that contain value-predicted instructions. With this feature, we are able to redispatch instructions when a misprediction occurs with only a single-cycle penalty. Value Locality and Speculative Execution

23

Machine Models

S0

Fetch

S1 Decode S2

Slot

S3

Dispatch

S4

Divider

Compare

FP Add Pipe

S6

FP Multiply Pipe

Int Pipe 0

Int Pipe 1

Multiplier

Buffer

Reissue

S5

S7 Writeback S8 Writeback

Figure 2-2. Alpha AXP 21164 Block Diagram. Table 2-2. Alpha AXP 21164Instruction Latencies. Alpha AXP 21164 Instruction Class Simple Integer Complex Integer Load/Store Simple FP Complex FP Branch(pred/mispr)

Issue 1 16 1 1 1 1

Result 1 16 2 4 36-65 0/4

2.1.5 Execution-Driven Idealized PowerPC Model In order to further examine the limits of performance obtainable with the speculative techniques presented in this thesis, we implemented an execution-driven idealized PowerPC model. We opted to implement yet another model primarily for reasons of execution efficiency (the trace-driven models described in the preceding sections are all implemented within the VMW environment [50], which, though flexible and easy-to-use, imposes severe simulation-time overhead). Using an


24

Machine Models

execution-driven model, as opposed to a trace-driven model, also makes it possible for us to examine the effects of mispredicted-path instructions, though none of the results included in this thesis reflect such effects. Table 2-3. Idealized PowerPC Model Instruction Latencies. Instruction Class Integer Arithmetic and Logical Integer Multiply Integer Divide Load/Store FP Add/Subtract/Normalize/Negate FP Multiply/Multiply-Add FP Divide Branch(pred/mispred)

Issue Latency

Result Latency 1 1 1 1 1 1 1 1

1 3 10 2 3 3 11 0/1

Our idealized model simulation environment is built around the PSIM PowerPC functional emulator that is distributed as part of the Free Software Foundation’s GDB debugger [52]. This functional emulator gives us complete access to the processor’s internal register state at each instruction boundary. This is a requirement for us, since we use value-based dynamic prediction mechanisms to improve performance, and we need access to actual values, not just addresses, to accurately simulate these mechanisms. The idealized machine model has a canonical four-stage pipeline: fetch, dispatch, execute, and complete. The width of the fetch, dispatch, and completion stages and the total instruction window size can be varied arbitrarily, while the execute stage has unlimited width. The latency of the dispatch stage can also be varied arbitrarily, while the latency of the execute stage is instructiondependent and is summarized in Table 2-3. All functional units are fully pipelined, all architected registers are dynamically renamed, and instructions are allowed to execute out-of-order subject to register and memory data dependences and the total instruction window size. If a load instruction is aliased to an earlier store, a forwarding mechanism exists that delays the load until the store’s data becomes available, and then forwards that value directly to the load (in effect, the processor dynamically renames memory). Fetch and Branch Prediction Our idealized model supports various forms of fetch and branch prediction. For the results presented in Chapter 5, the following description is accurate. Several variations of advanced forms of


25

Machine Models

instruction fetching and branch prediction are explored in Section 6.3 on page 98 and are described in detail there. Our model uses a very aggressive gshare branch predictor [51] with a 256K entry branch history table (BHT) with 2 bits per entry that is indexed by the exclusive-or of a 18-bit global conditional branch history register and the branch instruction address (the term gshare is commonly used to describe this type of table-indexing). The branch target buffer (BTB) is direct-mapped, is not tagged, and has 1024 (for simulations in Chapter 5) or 8192 (for simulations in Chapter 6) entries, while the return address stack (RAS) also has 1024 entries (this size was chosen to be large enough to ensure that the stack would practically never overflow). Fetch and branch prediction both occur during the fetch stage of the pipeline, while instructions are fetched from a dual-banked instruction cache with line size equal to the specified fetch width (this configuration is described as interleaved sequential in [53]). Up to three conditional branches can be predicted per cycle, with a gshare extension to the multiple branch prediction scheme described in [54]. Since gshare requires branch address information to perform branch history table lookups, we store the branch addresses for each fetch group in the BTB, along with the fetch address. In addition, there is a BTB-like table that is used to store branch address information for subroutine return fetch groups. This table is accessed whenever a return address is pushed on the RAS, and the branch addresses are pushed on the RAS as well. As fetch addresses are retrieved from the BTB or popped from the RAS, the branch addresses within the return fetch group are immediately available for our gshare predictor. Some of the key implementation decisions for very aggressive superscalar processors are discussed briefly in the following sections. Various configurations were used in the experiments described in subsequent chapters. In-order vs. Out-of-order Branching Resolving predicted conditional branches as quickly as possible is a key ingredient in alleviating the performance penalty of branch mispredictions. Clearly, branches will resolve earlier if they are allowed to execute out of order. However, the additional complexity involved in supporting out-oforder branch resolution can be considerable. Recovering from a branch misprediction involves cancelling all instructions subsequent to the mispredicted branch, recovering rename register mappings back to the state following that instruction, and restarting instruction fetch and dispatch from the branch destination (target or fall-through path). If branches execute in-order, the hardware Value Locality and Speculative Execution

26

Machine Models

need only access the structures that track all this information at a single location (i.e. at the oldest branch). If branches execute out-of-order, however, random access is needed into such structures. Hence, current-generation processors (e.g. the PowerPC 620) only allow branches to execute in order. To explore the importance of this feature, we collected results for both in-order branching (Chapter 5) as well as out-of-order branching (Chapter 6), and found that, for very wide- and deeply-pipelined superscalar processors, out-of-order branching is increasingly important. In-order vs. Out-of-order Execution of Loads and Stores Allowing loads and stores to execute out of order with respect to each other also increases performance by allowingloads to execute earlier. However, it also makes it possible for store-to-load aliases to be violated. This can happen when a load issues before an older store that is writing to the same memory location, and accesses the cache before the store has a chance to write its value to that location. Several schemes exist for avoiding these violations. The simplest is to force loads to issue in order with respect to stores. Another scheme forces loads to wait until the addresses of all preceding stores are known, hence enabling an exact check against these store addresses (usually via an associative search). Finally, loads can be allowed to speculatively issue ahead of older stores, as long as an exact mechanism exists for checking for and recovering from violations. The latter is the approach employed by the PowerPC 620 and Hewlett-Packard PA-8000 (see Section 2.2.1). To explore the importance of this issue, we collected results for two of the schemes described here. In Chapter 5, we force loads to wait until the addresses of previous stores are known, and assume a forwarding mechanism between aliased store/load pairs. In Chapter 6, we allow loads to issue out of order, and assume a perfect recovery mechanism whenever an alias is detected (however, recovery is delayed until both the load address and the store address are known and the store’s data is available). Fetch, Dispatch, Issue, and Completion Bandwidth Fetch, dispatch, issue, and completion bandwidth are four separate quantities that are frequently assumed to be identical within a processor implementation. There is no fundamental reason for this, and it is in fact not true for several current microprocessors. Fetch bandwidth describes the maximum number of instructions that can be fetched in a single cycle. Like wise, dispatch bandwidth describes how many instructions can enter the active instruction window in a single cycle. Issue bandwidth describes how many instructions can begin execution in each cycle. Finally, com-


27

Machine Models

pletion bandwidth describes how many instructions can write back their results into the architected register file in each cycle. We vary these parameters in several ways. In Chapter 5, we simulate fetch/dispatch widths of 4, 8, and 16, while assuming unlimited issue and completion bandwidth. In Chapter 6, we simulate fetch/dispatch widths of 4, 8, 16, and 32, while assuming unlimited issue bandwidth and both unlimited and 16-wide completion bandwidth. Instruction Window Size Current-generation microprocessors that support out-of-order execution have instruction window sizes between 16 and 56. DEC’s next-generation processor (AXP 21264) pushes the envelope to 64 instructions. Increasing the size of the instruction window exposes more instruction-level parallelism by bringing more potentially independent instructions into the processor, but creates severe headaches for logic designers (see [55] for an excellent discussion of this topic). We explore the effect of instruction window size by limiting it to 128 in some of the experiments presented in Chapter 6. Additional results for instruction window size 64 and size 256 are presented in a separate technical report [33]. Dispatch Latency Detecting control and data dependences among multiple instructions within a dispatch group is an inherently sequential task that becomes very expensive combinatorially as the size of the dispatch group increases. Olukotun et al. argue convincingly against wide-dispatch superscalars because of this very fact [30]. Wide (i.e. greater than four) dispatch is difficult to implement and has adverse impact on cycle time because all instructions in a dispatch group must be simultaneously crosschecked. Even current microprocessor implementations with dispatch windows of four or less (e.g. Alpha AXP 21164 and Pentium Pro) require multiple instruction decode and dependencechecking pipeline stages. One obvious solution to the problem of the complexity of dependence detection is to pipeline it into two or more stages to minimize impact on cycle time. In Section 5.4 on page 80 we propose a pipelined approach to dependence detection that facilitates the implementation of wide-dispatch microarchitectures. We simulate pipelined dispatch latencies of one, two, and three cycles in Chapter 5 and Chapter 6.


28


Memory Bandwidth and Latency Memory latency is a severe bottleneck to processor performance, and is caused by three factors: address-generation interlocks, which delay the initiation of a fetch from memory because the address is unknown; the latency of accessing the storage device itself (on- or off-chip cache or main memory), and queueing delays caused by contention for shared resources in the memory subsystem (i.e. limited memory bandwidth). We model all of these effects in Section 6.5 on page 110 by imposing realistic latency and cache port bandwidth constraints.

2.2 Misprediction Recovery Mechanisms As briefly outlined in Section 1.3.1 on page 10, adding value-speculative techniques to a processor microarchitecture by necessity requires a mechanism for recovering from incorrect value predictions. This is probably the most significant incremental change to the control logic complexity of existing superscalar designs. Once such a mechanism exists, all of the value-speculative techniques outlined in this thesis are reduced primarily to table lookups and comparators. Naturally, there are several mechanisms of varying complexity that can be used to implement misprediction recovery. Conceptually, they are all similar to existing mechanisms for recovering from branch mispredictions. In fact, the simplest mechanism described below (Instruction Refetch) is virtually identical to branch misprediction recovery, and hence would add minimal complexity to an existing design. Two other options (Instruction Reissue and Selective Instruction Reissue) of increasing complexity (and correspondingly decreasing recovery cost) are also described below. Misprediction recovery mechanisms should always be evaluated within the Pipeline Contraction Framework (see Section 1.2.3 on page 9) to assess the importance of recovery cost as a function of misprediction frequency. 2.2.1 Instruction Refetch Instruction refetch is a crude but easy-to-implement mechanism for recovering from value mispredictions, and simply involves squashing all instructions that follow the misprediction, resetting register rename mappings, and restarting instruction fetch from the instruction following the mispredicted instruction. Such a mechanism is actually already in place in at least two currentgeneration processors: the PowerPC 620 and the Hewlett-Packard PA-8000. Both of these processors allow loads to execute ahead of potentially aliased stores (see “In-order vs. Out-of-order Execution of Loads and Stores” on page 27), and perform exact alias checking each time a store Value Locality and Speculative Execution

29


instruction completes. If a violated alias is detected at this point, an instruction refetch occurs, beginning at the next load instruction. While the recovery cost of this recovery mechanism seems exorbitant (potentially up to dozens of cycles), it’s seeming importance pales when adjusted by the misprediction frequency, which is extremely low (see Table A-1, “PowerPC 620 Model Miscellaneous Data,” on page 123 for refetch frequencies for the PowerPC 620). Since value mispredictions are relatively much more frequent than violated store-to-load aliases (see Table 3-2, “LCT Hit Rates,” on page 43, and Figure 4-5, “CT Hit Rates,” on page 64), and the high recovery cost would eliminate any performance gains from correct value predictions, we did not explore this recovery mechanism in our simulations. 2.2.2 Instruction Reissue Instruction reissue is simply a less extreme version of instruction refetch. Rather than discarding all pipeline state and refetching all instructions following a mispredicted one, we add some buffering hardware that simplifies and significantly speeds up the process of reissuing instructions. The reissue mechanism we describe for the Alpha 21164 in Section 3.4.2 on page 48 imposes only a single-cycle recovery penalty for mispredicted instructions. That is, the execution of all subsequent instructions is delayed by a single cycle relative to when they would have executed had there been no value prediction at all. 2.2.3 Selective Instruction Reissue Finally, the most advanced recovery mechanism involves reissuing only those instructions that are data-dependent on incorrect value predictions. Since the renaming and forwarding logic within an out-of-order processor core already tracks dependences between instructions, the only additional logic required is that used to propagate the speculative state of operands and to commit or squash the predicted operands when they are verified to be wrong or right. The speculative state of an operand is defined as the logical sum (or) of the speculative states of the inputs operands of the instruction that is generating the operand. Speculative state is either committed to non-speculative status or corrected via global commit and squash broadcast buses. Every cycle, the speculative state of each operand is checked against these buses, and bits matching the commit bus are cleared while bits matching the squash bus trigger instruction reissue.


30

Workloads

In addition, to enable speedy reissue of instructions that received incorrect operands, any instructions with speculative inputs must remain in the window of active instructions (within the PowerPC 620, this entails continuing to occupy a reservation station entry). We use the selective instruction reissue mechanism in all of our PowerPC models (PowerPC 620, PowerPC 620+, Infinite PowerPC, and Idealized PowerPC).

2.3 Workloads Value locality and its performance impact are quantified in the context of five workload suites which are described in the following five sections. We are aware that these benchmarks may not be representative of commercial workloads [50]. 2.3.1 SPEC92 Integer Suite. The SPEC92 integer benchmarks are an industry-standard CPU benchmark suite consisting of real unix application programs like compilers and interpreters, file compression and decompression programs, and other utilities. We used five of the programs in the suite, and these are described in Table 2-4. All benchmarks are compiled at full optimization with the manufacturers’ reference compilers (xlc under IBM AIX, and cc under DEC OSF/1). All benchmarks are run to completion with the input sets described, but do not include supervisor-state instructions, which our tracing tools are unable to capture. Table 2-4. SPEC92 Integer Benchmark Descriptions. Instr. Count Benchmarks cc1 compress eqntott sc xlisp

Description GCC 1.35 from SPEC92 SPEC92 file compression SPEC92 Eqn to truth table Spreadsheet from SPEC92 SPEC92 LISP interpreter

Input Set insn-recog.i from SPEC92 1 iter. with 1/2 of SPEC92 Mod. input from SPEC92 Short input from SPEC92 6 queens

PPC 146M 38.8M 25.5M 78.5M 52.1M

Alpha N/A 50.2M 44.0M 107M 60.0M

2.3.2 Miscellaneous Integer Programs In order to increase coverage and include more modern programs in our simulations, we augment the SPEC92 integer programs described above with a set of miscellaneous unix programs, described in Table 2-5. This set includes two image-processing applications (cjpeg and mpeg), three commonly-used language interpreters (perl, gawk, and grep), GNU’s perfect hash function generator (gperf), a more recent version of GCC (cc1-271), and a recursive quicksort. Once again, Value Locality and Speculative Execution

31

Workloads

all benchmarks are compiled at full optimization with the manufacturers’ reference compilers and are run to completion with the input sets described, but do not include supervisor-state instructions, which our tracing tools are unable to capture. Table 2-5. Miscellaneous Integer Benchmark Descriptions. Instr. Count Benchmarks cc1-271 cjpeg gawk gperf grep mpeg perl quick

Description GCC 2.7.1; SPEC95 flags JPEG encoder GNU awk; result parser GNU hash fn generator gnu-grep -c “st*mo” Berkeley MPEG decoder SPEC95 Anagram search Quick sort

Input Set genoutput.i from SPEC95 128x128 BW image 1.7M simulator output file gperf -a -k 1-13 -D -o dict Same as compress 4 frames w/ fast dithering find “admits” in 1/8 of input 5,000 random elements.

PPC 102M 2.8M 25.0M 7.8M 2.3M 8.8M 105M 688K

Alpha 117M 10.7M 53.0M 10.8M 2.9M 15.1M 114M 1.1M

2.3.3 SPEC92 Floating Point Suite In order to evaluate value locality in the context of numerical/scientific programs, we also used four of the SPEC92 floating point benchmarks in many of our simulations. These are summarized in Table 2-6. Once again, all benchmarks are compiled at full optimization with the manufacturers’ reference compilers, and are run to completion with the input sets described, but do not include supervisor-state instructions, which our tracing tools are unable to capture. Table 2-6. SPEC92 Floating Point Benchmark Descriptions. Instr. Count Benchmarks doduc hydro2d swm256 tomcatv

Description Nuclear reactor simulator Computation of galactic jets Shallow water model Mesh generation program

Input Set Tiny input from SPEC92 Short input from SPEC92 5 iterations (vs. 1,200) 4 iterations (vs. 100)

PPC 35.8M 4.3M 43.7M 30.0M

Alpha 38.5M 5.3M 54.8M 36.9M

2.3.4 SPEC95 Integer Suite We also used the SPEC95 integer benchmark suite for our study, since it is easily available, widely used, and well-understood. Table 2-7 summarizes the benchmarks and input sets used, and also shows run length. The benchmarks are compiled for the PowerPC instruction set with GCC version 2.7.2 at full optimization. Our simulation environment captures the behavior of all user-state


32

Workloads

instructions, including those in NetBSD run-time support libraries, but does not account for supervisor-state execution. All of the benchmarks are run to completion, albeit with reduced input sets and/or fewer iterations than in the SPEC95 reference runs. Table 2-7. SPEC95 Integer Benchmark set. Benchmark go m88ksim gcc compress li ijpeg perl vortex

Description SPEC95 game SPEC95 88K simulator SPEC95 gcc compiler SPEC95 data compression SPEC95 lisp emulator SPEC95 jpeg encoder SPEC95 perl interpreter SPEC95 database program

Input Set 2stone9.in, 9x9, lev. 5 100 iter . of dhrystone genoutput.i small input (10K) six queens problem tinyrose.ppm train scrabbl.pl reduced version of train

Length(PPC) 79.6M 107.0M 181.8M 39.7M 56.8M 92.1M 50.1M 153.1M

2.3.5 SPEC95 Floating Point Suite We also used the SPEC95 floating-point benchmark suite for our study, since it is easily available, widely used, and well-understood. Table 2-8 summarizes the benchmarks and input sets used, and also shows run length. The floating-point benchmarks are compiled for the PowerPC instruction set with G77 version 0.5.18, also at full optimization (we only use six out of ten SPECFP95 benchmarks due to time and space constraints). Again, our simulation environment captures the behavior of all user-state instructions, including those in NetBSD runtime support libraries, but does not account for supervisor-state execution. All of the benchmarks are run to completion, albeit with reduced input sets and/or fewer iterations than in the SPEC95 reference runs. Table 2-8. SPEC95 Floating Point Benchmark set. Benchmark applu apsi fpppp mgrid swim tomcatv

Description SPEC95 PDE Solver SPEC95 Atmospheric model SPEC95 Quantum chemistry model SPEC95 Multigrid solver SPEC95 Shallow water model SPEC95 Mesh Generation


Input Set 5 iter, 12x12x12 10 steps, 128x1x32 3 atoms (ref uses 30) 10 iterations (test has 40) 10 iter 128x128 3 iter w/ array size 43

Length (PPC) 38.5M 159.0M 50.0M 111.0M 38.8M 47.2M

33

Workloads


34

CHAPTER 3

Load Value Prediction

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. This chapter describes the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describes how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains.

3.1 Introduction and Related Work The gap between main memory and processor clock speeds is growing at an alarming rate [36]. As a result, computer system performance is increasingly dominated by the latency of servicing memory accesses, particularly those accesses which are not easily predicted by the temporal and spatial locality captured by conventional cache memory organizations [37]. Conventional cache memories rely on a program’s temporal and spatial locality to reduce the average memory access


35

Introduction and Related Work

latency. Temporal locality describes the likelihood that a recently-referenced address will be referenced again soon, while spatial locality describes the likelihood that a close neighbor of a recently-referenced address will be referenced soon. Designing the physical attributes (e.g. size, line size, associativity, etc.) of a cache memory to best match the temporal and spatial locality of programs has been an ongoing challenge for researchers and designers alike. Some have proposed adding additional features such as non-blocking fetches [38], victim caches [39], and sophisticated hardware prefetching [22] to alleviate the access penalties for those references that have locality characteristics that are not captured by more conventional designs. Others have proposed altering the behavior of programs to improve the data locality of programs so that it better matches the capabilities of the cache hardware. Such improvements have primarily been limited to scientific code with predictable control flow and regular memory access patterns, due to the ease with which rudimentary loop transformations can dramatically improve temporal and spatial locality [40,41]. Explicit prefetching in advance of memory references with poor or no locality has also been examined extensively in this context, both with [21,22] and without additional hardware support [23,24]. Dynamic hardware techniques for controlling cache memory allocation that significantly reduce memory bandwidth requirements have also been proposed [42]. In addition, alternative pipeline configurations that reduce average memory access latency via early execution of loads have been examined [9,11]. The most relevant prior work related to ours is the Tree Machine [16,17], which uses a value cache to store and look up the results of recurring arithmetic expressions to eliminate redundant computation (the value cache, in effect, performs common subexpression elimination [1] in hardware). Richardson follows up on this concept in [18] by introducing the concepts of trivial computation, which is defined as the trivialization of potentially-complex operations by the occurrence of simple operands; and redundant computation, where an operation repeatedly performs the same computation because it sees the same operands. Richardson proposes a hardware mechanism (the result cache) which reduces the latency of such trivial or redundant complex arithmetic operations by storing and looking up their results in the result cache. In this chapter, we further examine value locality, a concept related to redundant computation, and demonstrate a technique--Load Value Prediction, or LVP--for predicting the results of load instructions at dispatch by exploiting the affinity between load instruction addresses and the values the Value Locality and Speculative Execution

36

Value Locality

loads produce. LVP differs from Harbison’s value cache and Richardson’s result cache in two important ways: first, the LVP table is indexed by instruction address, and hence value lookups can occur very early in the pipeline; second, it is speculative in nature, and relies on a verification mechanism to guarantee correctness. In contrast, both Harbison and Richardson use table indices that are only available later in the pipeline (Harbison uses data addresses, while Richardson uses actual operand values); and require their predictions to be correct, hence requiring mechanisms for keeping their tables coherent with all other computation.

3.2 Value Locality In this chapter, we further explore the concept of value locality, which we define as the likelihood of a previously-seen value recurring repeatedly within a storage location. Although the concept is general and can be applied to any storage location within a computer system, this chapter examines only the value locality of general-purpose or floating-point registers immediately following memory loads that target those registers. A plethora of previous work on dynamic branch prediction (e.g. [19,6]) has focused on an even more restricted application of value locality, namely the prediction of a single condition bit based on its past behavior. This chapter can be viewed as a logical continuation of that body of work, extending the prediction of a single bit to the prediction of an entire 32- or 64-bit register. Intuitively, it seems that it would be a very difficult task to discover any useful amount of value locality in a register. After all, a 32-bit register can contain any one of over four billion values-how could one possibly predict which of those is even somewhat likely to occur next? As it turns out, if we narrow the scope of our prediction mechanism by considering each static load individually, the task becomes much easier, and we are able to accurately predict a significant fraction of register values being loaded from memory. What is it that makes these values predictable? After examining a number of real-world programs, we assert that value locality exists primarily for the same reason that partial evaluation [29] is such an effective compile-time optimization; namely, that real-world programs, run-time environments, and operating systems incur severe performance penalties because they are general by design. That is, they are implemented to handle not only contingencies, exceptional conditions, and erroneous inputs, all of which occur relatively rarely in real life, but they are also often designed with future expansion and code reuse in mind. Even code that is aggressively optimized Value Locality and Speculative Execution

37

Value Locality

by modern, state-of-the-art compilers exhibits these tendencies. We have made the following empirical observations about the programs we examined for this study, and feel that they are helpful in understanding why value locality exists: • Data redundancy: Frequently, the input sets for real-world programs contain data that has little variation. Examples of this are sparse matrices, text files with white space, and empty cells in spreadsheets. • Error-checking: Checks for infrequently-occurring conditions often compile into loads of what are effectively run-time constants. • Program constants: It is often more efficient to generate code to load program constants from memory than code to construct them with immediate operands. • Computed branches: To compute a branch destination, for e.g. a switch statement, the compiler must generate code to load a register with the base address for the branch, which is a runtime constant. • Virtual function calls: To call a virtual function, the compiler must generate code to load a function pointer, which is a run-time constant. • Glue code: Due to addressability concerns and linkage conventions, the compiler must often generate glue code for calling from one compilation unit to another. This code frequently contains loads of instruction and data addresses that remain constant throughout the execution of a program. • Addressability: To gain addressability to non-automatic storage, the compiler must load pointers from a table that is not initialized until the program is loaded, and thereafter remains constant. • Call-subgraph identities: Functions or procedures tend to be called by a fixed, often small, set of functions, and likewise tend to call a fixed, often small, set of functions. As a result, loads that restore the link register as well as other callee-saved registers can have high value locality. • Memory alias resolution: The compiler must be conservative about stores aliasing loads, and will frequently generate what appear to be redundant loads to resolve those aliases. • Register spill code: When a compiler runs out of registers, variables that may remain constant are spilled to memory and reloaded repeatedly. Naturally, many of the above are subject to the particulars of the instruction set, compiler, and runtime environment being employed, and it could be argued that some of them could be eliminated with changes in the ISA, compiler, or run-time environment, or by applying link-time or run-time code optimizations (e.g. [43, 44]). However, such changes and improvements have been slow to appear; the aggregate effect of the above (and other) factors on value locality is measurable and significant today on the two modern RISC instruction sets that we examined, both of which provide state-of-the-art compilers and run-time systems. It is worth pointing out, however, that the value locality of particular static loads in a program can be significantly affected by compiler optimizations such as loop unrolling, loop peeling, tail replication, etc., since these types of transforValue Locality and Speculative Execution

38

Value Locality

Alpha AXP Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0 c

es

pe

cj

tt

c

s

g

1

27

c1

du

pr

m co

do

to

f

k

ep

er

w

ga

n eq

rl

g

d

gr

gp

pe

o2

r yd

h

sc

k

ic

pe

m

qu

tv

6

25

ca

m

sw

om

p

is

xl

t

PowerPC Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0 1

27

c1

c

1

cc

es

co

m

pr

tt

c

s

g

pe

cj

du

do

to

n eq

k

w

ga

f

er

gp

ep

g

d

gr

o2

h

r yd

pe

m

rl

pe

k

ic

qu

sc

tv

6

25

sw

m

ca

om

p

is

xl

t

Figure 3-1. Load Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of sixteen. mations tend to create multiple instances of a load that may now exclusively target memory locations with high or low value locality. A similar effect on load latencies (i.e. per-static-load cache miss rates) has been reported by Abraham et al. in [45]. The benchmark set used to explore value locality and quantify its performance impact is the SPEC92 integer suite (described in Section 2.3.1 on page 31), miscellaneous integer programs (described in Section 2.3.2 on page 31), and the SPEC92 floating-point suite (described in Section 2.3.3 on page 32. Figure 3-1 shows the value locality for load instructions in each of the benchmarks. The value locality for each benchmark is measured by counting the number of times each static load instruction retrieves a value from memory that matches a previously-seen value for that static load and dividing by the total number of dynamic loads in the benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we check for matches against only the mostrecently-retrieved value), while the second set (dark bars) has a history depth of sixteen (i.e. we check against the last sixteen unique values)1. We see that even with a history depth of one, most of the integer programs exhibit load value locality in the 50% range, while extending the history


39

Value Locality

Integer Data Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0

FP Data

Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0

Data Addresses

Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0

Instruction Addresses

Value Locality (%)

100.0 80.0 60.0 40.0 20.0 0.0 1

27

c1

c

pe

cj

es

om

pr

tt

c

s

g

1

cc

du

do

c

to

n eq

k

w

ga

f

er

gp

ep

g

d

gr

o2

h

r yd

pe

m

rl

pe

k

ic

qu

sc

tv

6

25

m

sw

ca

m

to

p

is

xl

Figure 3-2. PowerPC Value Locality by Data Type. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of sixteen. depth to sixteen (along with a hypothetical perfect mechanism for choosing the right one of the sixteen values) can improve that to better than 80%. What this means is that the vast majority of static loads exhibit very little variation in the values that they load during the course of a program’s

1. The history values are stored in a direct-mapped table with 1K entries indexed but not tagged by instruction address, and the values (one or sixteen) stored at each entry are replaced with an LRU policy. Hence, both constructive and destructive interference can occur between instructions that map to the same entry.


40

Exploiting Value Locality

Table 3-1. LVP Unit Configurations. For history depth greater than one, a hypothetical perfect selection mechanism is assumed. LVPT LVP Unit Configuration Simple Constant Limit Perfect

Entries 1024 1024 4096

∞

LCT History Depth 1 1 16/Perf ∞/Perf

Entries 256 256 1024

∞

CVU Bits/Entry 2 1 2 Perfect

Bits/Entry 32 128 128 0

execution. Unfortunately, three of our benchmarks (cjpeg, swm256, and tomcatv) demonstrate poor load value locality. To further explore the notion of value locality, we collected data that classifies loads based on the type of data being loaded: floating-point data, non-floating-point data, instruction addresses, and data addresses (pointers). These results are summarized in Figure 3-2 (the results shown are for the PowerPC architecture only). Once again, two sets of numbers are shown for each benchmark, one for a history depth of one (light bars), and the other for a depth of sixteen (dark bars). We omit results for categories in which an insignificant number of loads occurred for a given benchmark. In general, we see that address loads tend to have better locality than data loads, with instruction addresses holding a slight edge over data addresses, and integer data loads holding an edge over floating-point loads.

3.3 Exploiting Value Locality The fact that memory loads in many programs demonstrate a significant degree of value locality opens up exciting new possibilities for the microarchitect. In this chapter, we describe and evaluate the Load Value Prediction Unit, a hardware mechanism which addresses both the memory latency and memory bandwidth problems in a novel fashion. First, by exploiting the affinity between load instruction addresses and the values being loaded, we are able to reduce load latency by two or more cycles. Second, we can reduce memory bandwidth requirements by identifying highly-predictable loads and completely bypassing the conventional memory hierarchy for these loads. The LVP Unit consists of a load value prediction table or LVPT (Section 3.3.1) for generating value predictions, a load classification table or LCT (Section 3.3.2 and Section 3.3.3) for deciding which predictions are likely to be correct, and a constant verification unit or CVU (Section 3.3.3) that replaces accessing the conventional memory hierarchy for verifying highly-predictable loads.


41


3.3.1 Load Value Prediction Table The LVPT is used to predict the value being loaded from memory by associating the load instruction with the value previously loaded by that instruction. The LVPT is indexed by the load instruction address and is not tagged, so both constructive and destructive interference can occur between loads that map to the same entry (the LVPT is direct-mapped). Table 3-1 shows the number of entries (column 2) as well as the history depth per entry (column 3) for the four LVPT configurations used in our study. Configurations with a history depth greater than one assume a hypothetical perfect mechanism for selecting the correct value to predict, and are included to explore the limits of history-based load value prediction. 3.3.2 Dynamic Load Classification. Load value prediction is useful only if it can be done accurately, since incorrect predictions can lead to increased structural hazards and longer load latency (the misprediction penalty is discussed further in Section 3.4). In our experimental framework we classify static loads into three categories based on their dynamic behavior. There are loads whose values are unpredictable with the LVPT, those that are predictable, and those that are almost always predictable. By classifying these separately we are able to take full advantage of each case. We can avoid the cost of a misprediction by identifying the unpredictable loads, and we can avoid the cost of a memory access if we can identify and verify loads that are highly-predictable. In order to determine the predictability of a static load instruction, it is associated with a set of history bits. Based on whether or not previous predictions for a given load instruction were correct, we are able to classify the loads into three general groups: unpredictable, predictable, and constant loads. The load classification table or LCT consists of a direct-mapped table of n-bit saturating counters indexed by the low-order bits of the instruction address. Table 3-1 shows the number of entries (column 4) as well as the size of each saturating counter (column 5) for the LCT configurations used in our study. The 2-bit saturating counter assigns the four available states 0-3 as “don’t predict”, “don’t predict”, “predict” and “constant,” while the 1-bit counter assigns the two states as “don’t predict” and “constant.” The counter is incremented when the predicted value is correct and decremented otherwise. In Table 3-2, we show the percentage of all unpredictable loads the LCT is able to classify as unpredictable (columns 2, 4, 6, and 8) and the percentage of predictable


42


Table 3-2. LCT Hit Rates. Percentages shown are fractions of unpredictable and predictable loads identified as such by the LCT. PowerPC Simple Bench-mark cc1-271 cjpeg compress doduc eqntott gawk gperf grep hydro2d mpeg perl quick sc swm256 tomcatv xlisp GM

Unpr 86% 97% 99% 83% 91% 85% 93% 93% 82% 86% 84% 98% 77% 99% 100% 88% 90%

Alpha AXP Limit

Pred 64% 61% 94% 75% 85% 92% 75% 88% 85% 90% 71% 84% 90% 89% 89% 83% 81%

Unpr 58% 92% 97% 82% 88% 44% 76% 67% 63% 78% 65% 93% 59% 99% 100% 77% 75%

Simple Pred 90% 61% 90% 92% 99% 95% 97% 81% 91% 93% 93% 89% 97% 93% 98% 93% 90%

Unpr 86% 93% 98% 84% 68% 74% 77% 85% 86% 84% 83% 98% 86% 99% 99% 90% 86%

Limit Pred 57% 75% 56% 68% 80% 86% 79% 82% 80% 88% 66% 95% 85% 86% 68% 74% 78%

Unpr 64% 93% 97% 78% 83% 59% 77% 92% 60% 85% 74% 96% 78% 99% 99% 76% 81%

Pred 86% 82% 94% 92% 97% 93% 91% 92% 89% 93% 93% 95% 95% 90% 70% 93% 90%

loads the LCT is able to correctly classify as predictable (columns 3, 5, 7, and 9) for the Simple and Limit configurations. 3.3.3 Constant Verification Unit Although the LCT mechanism can accurately identify loads that retrieve predictable values, we still have to verify the correctness of the LVPT’s predictions. For predictable loads, we simply retrieve the value from the conventional memory hierarchy and compare the predicted value to the actual value (see Figure 3-3). However, for highly-predictable or constant loads, we use the constant verification unit, or CVU, which allows us to avoid accessing the conventional memory system completely by forcing the LVPT entries that correspond to constant loads to remain coherent with main memory. For the LVPT entries that are classified as constants by the LCT, the data address and the index of the LVPT are placed in a separate, fully-associative table inside the CVU. This table is kept coherent with main memory by invalidating any entries where the data address matches a subsequent store instruction. Meanwhile, when the constant load executes, its data address is concatenated Value Locality and Speculative Execution

43


Table 3-3. Successful Constant Identification Rates. Percentages shown are ratio of constant loads to all dynamic loads. PowerPC Benchmark cc1-271 cjpeg compress doduc eqntott gawk gperf grep hydro2d mpeg perl quick sc swm256 tomcatv xlisp GM

Simple 13% 4% 33% 5% 19% 10% 21% 16% 2% 12% 8% 0% 32% 8% 0% 14% 6%

Alpha AXP Limit 23% 7% 34% 20% 44% 28% 39% 24% 8% 25% 19% 0% 46% 17% 0% 45% 11%

Simple 10% 17% 36% 5% 21% 31% 38% 18% 3% 10% 7% 31% 26% 12% 1% 8% 12%

Limit 14% 17% 42% 15% 35% 31% 56% 22% 10% 28% 8% 31% 31% 12% 1% 30% 18%

with the LVPT index (the lower bits of the instruction address) and the CVU’s content-addressable-memory (CAM) is searched for a matching entry. If a matching entry exists, we are guaranteed that the value at that LVPT entry is coherent with main memory, since any updates (stores) since the last retrieval would have invalidated the CVU entry. If one does not exist, the constant load is demoted from constant to just predictable status, and the predicted value is now verified by retrieving the actual value from the conventional memory hierarchy. Table 3-3 shows the percentage of all dynamic loads that are successfully identified and treated as constants. This can also be thought of as the percentage decrease in required bandwidth to the L1 data cache. As a second-order effect, we also observe a decrease in the L2 cache bandwidth for our 21164 machine model (see Section 3.6.1). Although we were disappointed that we are unable to obtain a more significant reduction, we are pleased to note that load value prediction, unlike other speculative techniques like prefetching and branch prediction, reduces, rather than increases, memory bandwidth requirements.


44


Fetch

Sample Load Execution

Load PC LCT

LVPT

Predict?

Sample Store Execution

Predicted Value

Disp

Ex1

Address Address/ LVPT Index

Ex2

CVU

Cache

Actual Value Address

Verify? Comp

Figure 3-3. Block Diagram of the LVP Mechanism. The Load PC is used to index into the LVPT and LCT to find a value to predict and to determine whether or not a prediction should be made. Constant loads that find a match in the CVU needn’t access the cache, while stores cancel all matching CVU entries. When the load completes, the predicted and actual values are compared, the LVPT and LCT are updated, and dependent instructions are reissued if necessary. 3.3.4 The Load Value Prediction Unit The interactions between the LVPT, LCT, and CVU are described in Figure 3-3 for both loads and stores. When a load instruction is fetched, the low-order bits of the load instruction address are used to index the LVPT and LCT in parallel. The LCT (analogous to a branch history table) determines whether or not a prediction should be made, and the LVPT (analogous to a branch target buffer) forwards the value to the load’s dependent instructions. Once the address is generated, in stage EX1 of the sample pipeline, the cache access and CVU access progress in parallel. When the actual value returns from the L1 data cache, it is compared with the predicted value, and the dependent speculative instructions are consequently either written back or reissued. Since the search on the CVU can not be performed in time to prevent initiating the memory access, the only time the CVU is able to prevent the memory access is when a bank conflict or cache miss occurs.


45


In either case, a CVU match will cancel the subsequent retry or cache miss. During the execution of a store, a fully-associative lookup is performed on the store’s address and all matching entries are removed from the CVU. 3.3.5 LVP Unit Implementation Notes An exhaustive investigation of LVP Unit design parameters and implementation details is beyond the scope of this thesis. However, to demonstrate the validity of the concept, we analyzed sensitivity to a few key parameters, and then selected several design points to use with our microarchitectural studies (Section 3.6). We realize that the designs we have selected are by no means optimal, minimal, or very efficient, and could be improved significantly. For example, we reserve a full 64 bits per value entry in the LVP Table, while most instructions generate only 32 or fewer bits, and space in the table could certainly be shared between such entries with some clever engineering. The intent of this thesis is not to present the details of such a design; rather, our intent is to explore the larger issue of the impact of load value prediction on microarchitecture and instruction-level parallelism, and to leave such details to future work. However, we note that the LVP Unit has several characteristics that make it attractive to a CPU designer. First of all, since the LVPT and LCT lookup index is available very early, at the beginning of the instruction fetch stage, access to these tables can be superpipelined over two or more stages. Hence, given the necessary chip space, even relatively large tables could be built without impacting cycle time. Second, the design adds little or no complexity to critical delay paths in the microarchitecture. Rather, table lookups and verifications are done in parallel with existing activities or are serialized with a separate pipeline stage (value comparison). Finally, we reiterate that the LVP Unit, though speculative in nature, actually reduces memory bandwidth requirements, rather than aggravating them.

3.4 Microarchitectural Models In order to validate and quantify the performance impact of load value prediction and constant identification, we implemented trace-driven timing models for two significantly different modern microprocessor implementations--the PowerPC 620 [46, 15] and the Alpha AXP 21164 [47]; one aggressively out of order, the other clean and in order. We chose to use two different architectures in order to alleviate our concern that the value locality behavior we observed is perhaps only an artifact of certain idioms in the instruction set, compiler, run-time environment, and/or operating Value Locality and Speculative Execution

46


system we were running on, rather than a more universal attribute of general-purpose programs. We chose the PowerPC 620 and the AXP 21164 since they represent two extremes of the microarchitectural spectrum, from complex “brainiac” CPUs that aggressively and dynamically reorder instructions to achieve a high IPC metric, to the clean, straightforward, and deeply-pipelined “speed demon” CPUs that rely primarily on clock rate for high performance [48]. These machine models are described in detail in Section 2.1 on page 21. 3.4.1 PowerPC 620/620+ LVP Unit Operation In the PowerPC 620/620+ machine models, the LVP Unit predicts load values during dispatch, then forwards them speculatively to subsequent operations via the 620’s rename busses. Dependent instructions are able to issue and execute immediately, but are prevented from completing architecturally and are forced to retain possession of their reservation stations. Speculatively-forwarded values are tagged with the uncommitted loads they depend on, and these tags are propagated to the results of any subsequent dependent instructions. Meanwhile, uncommitted loads execute in the load/store pipe, and the predicted values are verified by either a CVU address match or a comparison against the actual values retrieved by the loads. Once a load is verified, all the dependent operations are either ready for in-order completion and can release their reservation stations (in the case of a correct prediction), or restart execution with the correct load values (if the prediction is incorrect). Since the load/store unit supports multiple non-blocking loads on cache misses, verifying a predicted value can take up to dozens of cycles, allowing the processor to speculate several levels down the dependency chain beyond the load, executing instructions and resolving branches that would otherwise be blocked by true dependencies. The worst-case penalty for an incorrect load value prediction in this scheme, as compared to not predicting the value in question, is one additional cycle of latency, along with structural hazards that might not have occurred otherwise. The penalty occurs only when a dependent instruction has already executed speculatively, but is waiting in its reservation station for the load value to be committed or corrected. Since the load value comparison takes an extra cycle beyond the standard two-cycle load latency, the dependent instruction will reissue and execute with the correct load value one cycle later than it would have had there been no prediction. In addition, the earlier incorrect speculative issue may cause a structural hazard that prevents other useful instructions from dispatching or executing. In those cases where the dependent instruction has not yet executed (due


47


to structural or other unresolved data dependencies), there is no penalty, since the dependent instruction can issue as soon as the loaded value is available, in parallel with the value comparison in the load/store pipeline. In any case, due to the LCT which accurately prevents incorrect predictions, the misprediction penalty does not significantly affect performance. There can also be a structural hazard penalty even in the case of a correct prediction. Since speculative values are not verified until one cycle after the actual values become available, speculatively-issued dependent instructions may end up occupying their reservation stations for one cycle longer than they would have had there been no prediction. 3.4.2 Alpha AXP 21164 LVP Unit Operation In order to support load value prediction, our Alpha AXP 21164 processor model differs from the actual AXP 21164 [47] in two ways. First, in order to allow speculation to occur in our LVP configurations, we must compare the actual value returned by the data cache and the predicted value. Since the distance between the data cache and the writeback is likely a critical path in hardware, the comparison requires an extra stage before writeback. The second modification, the addition of the reissue buffer, allows us to buffer instruction dispatch groups that contain predicted loads. With this feature, we are able to redispatch instructions when a misprediction occurs with only a single-cycle penalty. In order to keep the AXP 21164 model as simple as possible, when any one of two dispatched loads is mispredicted then all of the eight possible instructions in flight are squashed and reissued from the reissue buffer regardless of whether or not they are dependent on the predicted data. Since the 21164 is unable to stall anywhere past the dispatch stage, we are unable to predict loads that miss the L1 data cache. However, when an L1 miss occurs, we are able to return to the nonspeculative state before the miss is serviced. Hence, there is no penalty for doing the prediction. The inability of our LVP Unit to speculate beyond an L1 cache miss in most cases means that the LVP Unit’s primary benefit is the provision of a zero-cycle load [11]. Typically, we envision the CVU as a mechanism for reducing bandwidth to the cache hierarchy (evidence of this is discussed in Section 3.6.1 and Section 3.6.5). However, since the 21164 is equipped with a true dual-ported cache and two load/store units it is largely unaffected by a reduction in bandwidth requirement to the L1 cache. In addition to reducing L2 bandwidth, the primary benefit of the CVU in the 21164 model is that it enables those predictions identified as constants to Value Locality and Speculative Execution

48

Experimental Framework

proceed regardless of whether or not they miss the L1 data cache. Hence, the only LVP predictions to proceed in spite of an L1 cache miss are those that are verified by the CVU.

3.5 Experimental Framework Our experimental framework consists of three main phases: trace generation, LVP Unit simulation, and microarchitectural simulation. All three phases are performed for both operating environments (IBM AIX and DEC OSF/1). For the PowerPC 620, traces are collected and generated with the TRIP6000 instruction tracing tool. TRIP6000 is an early version of a software tool developed for the IBM RS/6000 that captures all instruction, value and address references made by the CPU while in user state. Supervisor state references between the initiating system call and the corresponding return to user state are lost. For the Alpha AXP 21164, traces are generated with the ATOM tool [49], which also captures user state instruction, value and address references only. The instruction, address, and value traces are fed to a model of the LVP Unit described earlier, which annotates each load in the trace with one of four value prediction states: no prediction, incorrect prediction, correct prediction, or constant load. The annotated trace is then fed in to a cycle-accurate microarchitectural simulator that correctly accounts for the behavior of each type of load. All of our microarchitectural models are implemented using the VMW framework [50], which enables significant productivity gains by allowing us to reuse and retarget existing models. The LVP Unit model is separated from the microarchitectural models for two reasons: to shift complexity out of the microarchitectural models and thus better distribute our simulations across multiple CPUs; and to conserve trace bandwidth by passing only two bits of state per load to the microarchitectural simulator, rather than the full 32/64 bit values being loaded.

3.6 Experimental Results We collected four types of results from our microarchitectural models: cycle-accurate performance results for various combinations of LVP Unit configurations and microarchitectural models for both the 620 and 21164; distribution of load latencies for the 620; average data dependency resolution latencies for the 620; and reductions in bank conflicts for the 620. Additional detailed performance data are included in Appendix A.


49

Experimental Results

Alpha AXP 21164 1.5 HM=1.06 Simple HM=1.09 Limit HM=1.15 Perfect

Speedup

1.4 1.3 1.2 1.1 1.0 71

-2

1 cc

c

s

g

es

pe

du

pr

cj

m

do

co

t

ot

nt

f

k

ep

er

w

ga

eq

gr

gp

rl

g

2d

pe

ro

k

sc

ic

pe

m

d hy

qu

tv

6

25

ca

m

m to

sw

p

is

xl

PowerPC 620

1.5

Speedup

1.63

HM=1.03 Simple HM=1.03 Constant HM=1.06 Limit HM=1.09 Perfect

1.4 1.3 1.2 1.1 1.0 1

27

c1

c

es

pe

cj

pr

c

om

tt

c

s

g

1

cc

du

do

to

n eq

k

w

ga

f

er

gp

ep

d

o2

gr

r yd

h

g

pe

m

rl

pe

k

ic

qu

sc

tv

6

25

m

sw

ca

m

to

p

is

xl

Figure 3-4. Base Machine Model Speedups. 3.6.1 Base Machine Model Speedups with Realistic LVP In Figure 3-4, we show speedup numbers relative to the baseline 620 for two LVP Unit configuration that we consider realistic (i.e. buildable within one or two processor generations) as well as two idealized LVP Unit configurations. The two realistic configurations, Simple and Constant, are described in Table 3-1. To explore the limits of load value prediction, we also include results for the Limit and Perfect LVP Unit configurations (also described in Table 3-1). The former is similar to the Simple configuration, only much larger, but it is not realistic, since it assumes a hypothetical perfect mechanism for selecting which of the sixteen values associated with each load instruction address is the correct one to predict. The latter configuration, Perfect, is able to correctly predict all load values, but does not classify any of them as constants. Neither of these configurations is buildable, but the configurations are nevertheless interesting, since they give us a sense of how much additional performance we can expect from more aggressive and accurate LVP implementations. Figure 3-4 also shows three of these four LVP configurations for the Alpha AXP 21164. We omit the Constant configuration from our 21164 simulations because it does not differ significantly Value Locality and Speculative Execution

50


from the Simple configuration on the 620 and because we have limited access to native Alpha CPU cycles for collecting traces. In general, the 21164 derives roughly twice as much performance benefit from LVP as does the 620. We attribute this to two factors: its small first-level data cache (8K direct-mapped vs. the 620’s 8-way associative 32K cache) benefits more from the CVU, and its in-order issuing policy makes it more sensitive to load latency, since it is forced to depend solely on the compiler to try to overlap it with other useful computation. The 620, on the other hand, is able to find other useful computation dynamically due to its out-of-order core. Two benchmarks (grep and gawk) stand out for the dramatic performance increases they achieve on both models. This gain results from the fact that both benchmarks are data-dependence bound, i.e. they have important but relatively short dependency chains in which load latencies make up a significant share of the critical path. Thus, according to Amdahl’s Law, collapsing the load latencies results in significant speedups. Conversely, benchmarks which we would expect to perform better based on their high load value locality (e.g. hydro2d and mpeg on both models), fail to do so because load latencies make up a lesser share of the critical dependency paths. The bandwidth-reducing effects of the CVU manifest themselves as lower first-level data cache miss rates for several of the benchmarks running on the 21164. For example, the miss rate for compress drops from 4.3% to 3.4% per instruction, a 20% reduction. Likewise, eqntott and gperf experience ~10% reductions in their miss rates, which translate into the significant speedups shown in Figure 3-4. Even cjpeg and mpeg, which gain almost nothing from LVP on the 620, eke out measurable gains on the 21164 due to the 10% reduction in primary data cache miss rate brought about by the CVU. 3.6.2 Enhanced Machine and LVP Model Speedups To further explore the interaction between load value prediction and the PowerPC 620 microarchitecture, we collected results for the 620+ enhanced machine model described earlier in conjunction with four LVP configurations. The results for these simulations are summarized in Table 3-4, where the third column shows the 620+’s average speedup of 6.0% over the base 620 with no LVP, and columns 4 through 7 show average additional speedups of 4.2%, 4.0%, 7.1%, and 10.4% for the Simple, Constant, Limit, and


51


Table 3-4. PowerPC 620+ Speedups. Column 3 shows 620+ speedup relative to 620 with no LVP; columns 4-7 show additional LVP speedups relative to baseline 620+ with no LVP. Bench cc1-271 cc1 cjpeg compress doduc eqntott gawk gperf grep hydro2d mpeg perl quick sc swm256 tomcatv xlisp

Base Cyc 93,371,808 117,571,998 2,818,987 33,436,604 43.796,620 18,823,362 28,741,147 4,893,966 2,169,697 5,398,363 5,394,984 102,965,698 704,262 62,227,728 51,327,965 32,838,452 44,844,605

HM

620+ 1.057 1.112 1.126 1.092 1.030 1.049 1.009 1.108 1.018 1.024 1.192 1.050 1.019 1.061 1.044 1.018 1.052

Simple 1.006 1.012 1.001 1.006 1.007 1.029 1.293 1.026 1.329 1.018 1.012 1.046 1.000 1.035 1.000 1.003 1.022

Constant 1.003 1.006 1.011 1.006 1.008 1.037 1.240 1.019 1.310 1.019 1.023 1.007 0.999 1.056 1.000 1.003 1.026

Limit 1.045 1.021 1.000 1.019 1.016 1.083 1.327 1.045 1.531 1.028 1.036 1.099 1.051 1.061 1.000 1.004 1.059

Perfect 1.071 1.041 1.021 1.175 1.039 1.082 1.272 1.034 1.789 1.041 1.031 1.116 1.170 1.088 1.025 1.050 1.058

1.060

1.042

1.040

1.071

1.104

Perfect LVP configurations, respectively. In general, we see that the increased machine parallelism of the 620+ more closely matches the parallelism exposed by load value prediction, since the relative gains for the realistic LVP configurations are nearly 50% higher than they are for the baseline 620. The most dramatic examples of this trend are grep and gawk, which show very little speedup from the increased machine parallelism without LVP, but nearly double their relative speedups with LVP (with the Simple LVP configuration, grep increases from 20% to 33%, while gawk increases from 15% to 30%). 3.6.3 Distribution of Load Verification Latencies In Figure 3-5 we show the distribution of load verification latencies for each of the four LVP configurations (Simple, Constant, Limit, and Perfect) on the 620 and 620+ machine models. That is, we show the percentage of correctly-predicted loads that are verified a given number of cycles after they are dispatched. The numbers shown are the sum over all the benchmarks. These results provide an intuitive feel for the number of cycles of load latency being eliminated by load value prediction. Clearly, if a larger percentage of loads have longer latency, LVP will prove more bene-


52


PPC 620

PPC 620+

100.0 Simple Constant Limit Perfect

% Loads Verified

75.0

50.0

25.0

0.0

7

7

Figure 3-5. Load Verification Latency Distribution. Numbers shown are the percentage of correctly-predicted loads that are verified a given number of cycles after they are dispatched. ficial. Interestingly enough, the distributions for all four LVP configurations look virtually identical, which indicates that more aggressive LVP implementations (like Limit and Perfect) are uniformly effective, regardless of load latency. One would expect that a wider microarchitecture like the 620+ would reduce average load latency, since many of the structural dependencies are eliminated. These results counter that expectation, however, since there is a clear shift to the right in the distribution shown for the 620+. This shift is caused by the time dilation brought about by the improved performance of the 620+, which in turn is caused by its microarchitectural improvements as well as the relative improvement in LVP performance noted in Section 3.6.2. 3.6.4 Data Dependency Resolution Latencies The intent of load value prediction is to collapse true dependencies by reducing memory latency to zero cycles. To confirm that this is actually happening and to quantify the dependencies being collapsed, we measured the average amount of time an instruction spends in a reservation station waiting for its true dependencies to be resolved. The results are summarized in Figure 3-6, which categorizes the waiting time reductions by functional unit type. The numbers shown are the average over all the benchmarks, normalized to the waiting times without LVP. We see that instructions in the branch (BRU) and multi-cycle integer (MCFX) units experience the least reductions in true dependency resolution time. This makes sense, since both branches and move-from-special-purpose-register (mfspr) instructions are waiting for operand types (link register, count register, and Value Locality and Speculative Execution

53


PPC 620

PPC 620+

Relative Resolution Latency (%)

100.0 Simple Constant Limit Perfect 75.0

50.0

25.0

0.0

BRU

MCFX

FPU

FU Type

SCFX

LSU

BRU

MCFX

FPU

FU Type

SCFX

LSU

Figure 3-6. Data Dependency Resolution Latencies. Cycles spent waiting for operands are normalized to baseline 620/620+ models. condition code registers) that the LVP mechanism does not predict. Conversely, the dramatic reductions seen for floating-point (FPU), single-cycle fixed point (SCFX), and load/store (LSU) instructions correspond to the fact that operands for them are predicted. Furthermore, the relatively higher value locality of address loads shown in Figure 3-2 corresponds well with the dramatic reductions shown for load/store instructions in Figure 3-6. Even with just the Simple or Constant LVP configurations, the average dependency resolution latency for load/store instructions has been reduced by about 50%. 3.6.5 Bank Conflicts The purpose of the CVU is to reduce memory bandwidth by eliminating the need for constant loads to access the conventional memory hierarchy. In our 620 and 620+ models, this benefit manifests itself as a reduction in the number of bank conflicts to the two banks of the first-level data cache. On the 620, in any given cycle, both a load and a store can attempt to access a data cache port. If both accesses are to the same bank, a conflict occurs, and the store must wait to try again the next cycle. On our 620+ model this problem is aggravated, since up to two loads and a store can attempt to access the two available banks in each cycle. In Figure 3-7, we show the fraction of cycles in which a bank conflict occurs for each of our benchmarks running on the 620 and 620+ models. Overall, bank conflicts occur in 2.6% of all 620


54


PowerPC 620

Bank Conflict (%)

6.0 No LVP Simple Constant

4.0

2.0

0.0 71

-2

1 cc

1

cc

s

g

es

pe

pr

cj

m

co

c

du

do

t

ot

nt

eq

k

w

ga

f

er

gp

ep

gr

g

2d

ro

pe

m

d hy

rl

pe

k

ic

qu

sc

tv

6

25

m

sw

ca

m to

p

is

xl

PowerPC 620+

Bank Conflict (%)

15.0

10.0

5.0

0.0 1

27

c1

c

1

cc

es

pr

om

tt

c

s

g

pe

cj

du

do

c

to

n eq

k

w

ga

f

er

gp

ep

g

d

o2

gr

h

r yd

pe

m

rl

pe

k

ic

qu

sc

tv

6

25

m

sw

ca

m

to

p

is

xl

Figure 3-7. Percentage of Cycles with Bank Conflicts. simulation cycles for our benchmark set, and 6.9% of all 620+ cycles. Our Simple LVP Unit configuration is able to reduce those numbers by 8.5% and 5.1% for the 620 and 620+, respectively, while our Constant configuration manages to reduce them by 14.0% and 14.2% (we are pleased to note that these reductions are relatively higher than those shown in Table 3-3, which means the CVU tends to target loads that are, on average, more likely to cause bank conflicts). Interestingly enough, a handful of benchmarks (gawk, grep, hydro2d) experience a slight increase in the relative number of cycles with bank conflicts as shown in Figure 3-7. This is actually brought about by the time dilation caused by the increased performance of the LVP configurations, rather than an increase in the absolute number of bank conflicts. One benchmark--tomcatv--did experience a very slight increase in the absolute number of bank conflicts on the 620+ model. We view this as a second-order effect of the perturbations in instruction-level parallelism caused by LVP, and are relieved to note that it is overshadowed by other factors that result in a slight net performance gain for tomcatv (see Table 3-4).


55


3.7 Conclusions and Future Work We make three major contributions in this chapter. First, we explore the concept of value locality in computer system storage locations. Second, we demonstrate that load instructions, when examined on a per-instruction-address basis, exhibit significant amounts of value locality. Third, we describe load value prediction, a microarchitectural technique for capturing and exploiting load value locality to reduce effective memory latency as well as bandwidth requirements. We are very encouraged by our results. We have shown that measurable (3% on average for the 620, 6% on average for the 21164) and in some cases dramatic (up to 21% on the 620 and 17% on the 21164) performance gains are achievable with simple microarchitectural extensions to two current microprocessor implementations that represent the two extremes of superscalar design philosophy. We envision future work proceeding on several different fronts. First of all, we believe that the relatively simple techniques we employed for capturing value locality could be refined and extended to effectively predict a larger share of load values. Those refinements and extensions might include allowing multiple values per static load in the prediction table by including branch history bits or other readily available processor state in the lookup index; or moving beyond history-based prediction to computed predictions through techniques like value stride detection, which is examined in a somewhat different context in Chapter 6. Second, our load classification mechanism could also be refined to correctly classify more loads and extended to control pollution in the value table (e.g. removing loads that are not latency-critical from the table). Third, the microarchitectural design space should be explored more extensively, since load value prediction can dramatically alter the available program parallelism in ways that may not match current levels of machine parallelism very well. Fourth, feedback-directed compiler support for rescheduling loads for different memory latencies based on their value locality may also prove beneficial. Finally, more aggressive approaches to value prediction could be investigated. These might include speculating down multiple paths in the value space or speculating on values generated by instructions other than loads (see Chapter 4).


56

CHAPTER 4

Register Value Prediction

For decades, the serialization constraints imposed by true data dependences have been regarded as an absolute limit--the dataflow limit--on the parallel execution of serial programs. This chapter extends the technique introduced in Chapter 3 into a new technique--register value prediction--for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality, which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency, and provide performance gains of 4.3%-18% (depending on machine model) by exceeding the dataflow limit.

4.1 Motivation As described in Chapter 1, Section 1.1, there are two fundamental restrictions that limit the amount of instruction level parallelism (ILP) that can be extracted from sequential programs: control flow and data flow. Control flow limits ILP by imposing serialization constraints at forks and joins in a program’s control flow graph [1]. Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent (i.e. one needs the result of another to compute its own result, and hence must wait for the other to complete before beginning to execute). In this chapter, we extend the LVP approach of Chapter 3 for predicting the results of load instructions to apply to all instructions that write an integer or floating point register; show that a signifi-


57

Value Locality

cant proportion of such writes are trivially predictable; describe a value prediction hardware mechanism that allows dependent instructions to execute in parallel; and present results that demonstrate significant performance increases over our baseline machine models.

4.2 Value Locality In this chapter, we revisit the concept of value locality, which we first introduced in [27] as the likelihood of a previously-seen value recurring repeatedly within a storage location. Although the concept is general and can be applied to any storage location within a computer system, we have limited our current study to examine only the value locality of general-purpose or floating point registers immediately following instructions that write to those registers. A plethora of previous work on static and dynamic branch prediction (e.g. [19,6]) has focused on an even more restricted application of value locality, namely the prediction of a single condition bit based on its past behavior. In Chapter 3 and [27], we examined the value locality of registers being targeted by loads from memory. This chapter is a logical continuation of that work, extending the prediction of load values to the prediction of all integer and floating point register values. Intuitively, it seems that it would be a difficult task to discover useful amounts of value locality in register writes following all instructions. Despite the load value locality described in Chapter 3 that indicates redundancy in inputs to computation, one would expect the results of ALU computation to have significantly less redundance in them, due to the increased entropy brought about by the computation. As it turns out, however, if we once again narrow the scope of our prediction mechanism by considering each static instruction individually, the task becomes almost trivial and we are able to accurately predict a significant fraction of register values being written by machine instructions. The benchmark set used to explore value locality and quantify its performance impact is the SPEC92 integer suite (described in Section 2.3.1 on page 31), miscellaneous integer programs (described in Section 2.3.2 on page 31), and the SPEC92 floating-point suite (described in Section 2.3.3 on page 32). Figure 4-1 shows the register value locality for all instructions that write an integer or floating point register in each of the benchmarks. The register value locality for each benchmark is measured by counting the number of times each static instruction writes a register value that matches a


58

Value Locality

100.0

Value Locality (%)

80.0 60.0 40.0 20.0 0.0 1

27

c1

c

1

cc

s tt wk erf ep eg erl ick es to p qu p p gr pr eqn ga g m m

g

pe

cj

co

sc lisp duc o2d 256 atv c x r do yd wm om t h s

Figure 4-1. Register Value Locality. The light bars show value locality for a history depth of one, and dark bars show it for a history depth of four. previously-seen value for that static instruction and dividing by the total number of dynamic register writes in the benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we check for matches against only the most-recently-written value), while the second set (dark bars) has a history depth of four (i.e. we check against the last four unique values).1 We see that even with a history depth of one, most of the programs exhibit value locality in the 40-50% range (average 49%), while extending the history depth to four (along with a perfect mechanism for choosing the right one of the four values) can improve that to the 60-70% range (average 61%). What that means is that a majority of static instructions exhibit very little variation in the values that they write during the course of a program’s execution. To further explore the notion of value locality, we collected value predictability data that classifies register writes based on instruction type (the types are summarized in Table 4-1). These results are summarized in Figure 4-2. Once again, two sets of numbers are shown; one for a history depth of one, and another for a history depth of four. Integer and floating-point double loads (I_LD and FPD_LD) are the most predictable frequently-occurring instructions. FP_OTH, FP_MV, MC_MV are also very predictable but make up an insignificant portion of the dynamic instruction mix. For

1. The history values are stored in a direct-mapped table with 16K entries indexed but not tagged by instruction address, and the values (up to four) stored at each entry are replaced with an LRU policy. Hence, the potential exists for both constructive and destructive interference between instructions that map to the same entry.


59

Value Locality

Table 4-1. Instruction Types. Instr Type SC_A SC_A_I SC_L SC_L_I MC_A MC_A_I MC_MV I_LD ST_U FP_LD FPD_LD FP_A FP_M FP_MA FP_OTH FP_MV

Description Single-cycle arithmetic, 2 reg. operands Single-cycle arithmetic, 1 reg. operand Single-cycle logical, 2 reg. operands Single-cycle logical, 1 reg. operand Multi-cycle arithmetic, 2 reg. operands Multi-cycle arithmetic, 1 reg. operand Multi-cycle register move Integer load instructions Store with base reg. update FP load single FP load double FP instructions other than multiply FP multiply instructions FP multiply-add instructions FP div,abs,neg,round to single precision FP register move instructions

Freq (%) 5.45 23.55 1.86 9.89 0.14 0.06 1.86 33.00 5.14 3.16 4.76 3.52 2.11 3.65 1.61 0.26

the single-cycle instructions, fewer input operands (one vs. two) correlate with higher value locality. For the multi-cycle instructions, however, the opposite is true. The worst value locality is exhibited by the floating-point-load-single (FP_LD) instructions. We attribute this to the fact that the floating-point benchmarks we used initialize input arrays with pseudo-random numbers, resulting in poor value locality for loads from these arrays. The store-with-update (ST_U) instruction type also has poor value locality. This makes sense, since the ST_U instruction is used to step through an array at a fixed stride (hence the base address register is updated with a different value every time the instruction executes, and history-based value prediction will fail). On the other hand, ST_U is also used in function prologues to update the stack frame pointer, where, given the same call-depth, the value is predictable from one call to the next. Hence, some of our call-intensive benchmarks report higher value locality for ST_U. However, the former effect dominates and lowers the overall value locality for ST_U.

4.3 Exploiting Value Locality The fact that the register writes in many programs demonstrate a significant degree of value locality opens up exciting new possibilities for the microarchitect. Since the results of many instructions can be accurately predicted before they are issued or executed, dependent instructions are no


60

Value Locality

100.0

Value Locality (%)

80.0 60.0 40.0 20.0 0.0

_I

_A

SC

_A

SC

_I

_L

SC

_L

SC

_I

_A

V

_M

_A

C

M

C

M

C

M

LD

I_

_U

ST

D

D

_L

FP

_L

D

FP

_A

FP

A

_M

FP

_M

FP

V

TH

_O

FP

_M

FP

Figure 4-2. Register Value Locality by Instruction Type. longer bound by the serialization constraints imposed by operand data flow. Instructions can now be scheduled speculatively with additional degrees of freedom to better utilize existing functional units and hardware buffers, and are frequently able to complete execution sooner since the critical paths through dependence graphs have been collapsed. However, in order to exploit value locality and bring about all of these benefits, two mechanisms must be implemented: one for accurately predicting values--the VP (value prediction) unit--and one for verifying these predictions. 4.3.1 The Value Prediction Unit Value prediction is useful only if it can be done accurately, since incorrect predictions can lead to increased structural hazards and longer latency (the misprediction penalty is described in greater detail in Section 4.4.2). Hence, we propose a two-level prediction structure for the VP Unit: the first level is used to generate the prediction values, and the second level is used to decide whether or not the predictions are likely to be accurate. The internal structure of the VP Unit, which is very similar to the LVP Unit described in Chapter 3, is summarized in Figure 4-3. The VP Unit consists of two tables: the Classification Table (CT) and the Value Prediction Table (VPT), both of which are direct-mapped and indexed by the instruction address (PC) of the instruction being predicted. Entries in the CT contain two fields: the valid field, which consists of either a single bit that indicates a valid entry or a partial or complete tag field that is matched against the upper bits of the PC to indicate a valid field; and the prediction history, which is a saturating counter of 1 or more bits. The prediction history is


61

Value Locality

Classification Table(CT)

PC of pred. instr.

Prediction Result

Predicted Value

Value Prediction Table(VPT)

Updated Value

Figure 4-3. Value Prediction Unit. The PC of the instruction being predicted is used to index into the VPT to find a value to predict. At the same time, the CT is also indexed with the PC to determine whether or not a prediction should be made. When the instruction completes, both the prediction history and value history are updated. incremented or decremented whenever a prediction is correct or incorrect, respectively, and is used to classify instructions as either predictable or unpredictable. This classification is used to decide whether or not the result of a particular instruction should be predicted. Increasing the number of bits in the saturating counter adds hysteresis to the classification process and can help avoid erroneous classifications by ignoring anomalous values and/or destructive interference. The VPT entries also consist of two fields: a valid field, which, again, can consist of a single valid bit or a full or partial tag; and a value history field, which contains one or more 32- or 64-bit values that are maintained with an LRU policy. The value history fields are replaced when an instruction is first encountered (by its result) or whenever a prediction is incorrect (by the actual result). The VPT replacement policy is also governed by the CT prediction history to introduce hysteresis and avoid replacing useful values with less useful ones. As a preliminary exploration of the VP Unit design space, we analyzed sensitivity to a few key parameters, and then selected a specific design point to use with our microarchitectural studies (see Section 4.6). However, the intent of this thesis is not to explore the details of such a design; rather, our intent is to explore the larger issue of the impact of value prediction on microarchitecture and instruction-level parallelism, and to leave such details to future work.


62

Value Locality

80.0 eqntott gawk tomcatv

60.0

Hit Rate (%)

hydro2d sc xlisp grep mpeg cc1 perl gperf doduc cc1-271

40.0

swm256 compress quick cjpeg

20.0

0.0 256

1024

4096

16384

# VPT Entries

Figure 4-4. VPT Hit Rate Sensitivity to Size. In Figure 4-4, we show the sensitivity of the VPT hit rate to size for each of our benchmarks. We see that for most benchmarks, the hit rate levels off at or around 4096 entries, though in several cases significant improvements are possible beyond that size. Nevertheless, we chose 4096 as our design point, since going beyond that size (i.e. 4096 entries x 8 bytes/entry = 32KB) seemed unreasonable without severely impacting processor cycle time. The purpose of the CT is to partition instructions into two classes: those that are predictable by the VPT, and those that are not.To measure its effectiveness at accomplishing this purpose, we simulated six different CT configurations, which are summarized in Table 4-2. The state descriptions specify the effect of each state on both value prediction as well as the replacement of values in the VPT when new values are encountered. The results for each configuration are summarized in Figure 4-5. From the results, we conclude that the best choice for maximizing both the predictable and unpredictable hit rates is the 1024/3-bit configuration (this is not surprising, since it has the highest hardware cost). However, since the 1024/2-bit configuration is only slightly worse at identifying predictable instructions and is actually better at identifying unpredictable ones (hence minimizing Value Locality and Speculative Execution

63

Value Locality

Predictable

100.0

Unpredictable

Hit Rate

90.0

80.0 256/1-bit 256/2-bit 256/3-bit 1024/1-bit 1024/2-bit 1024/3-bit

70.0

60.0

256

1K 4K # VPT Entries

16K

256

1K 4K # VPT Entries

16K

Figure 4-5. CT Hit Rates. The Predictable Hit Rate is the number of correct value predictions that were identified as such by the CT divided by the total number of correct predictions, while the Unpredictable Hit Rate is the number of incorrect predictions that were identified as such by the CT divided by the number of incorrect predictions. Table 4-2. Classification Table Configurations. Configuration (entries/bits) 256/1-bit 256/2-bit 256/3-bit 1024/1-bit 1024/2-bit 1024/3-bit

State Descriptions {0=no pred, 1=pred & no repl} {0,1=no pred, 2,3=pred, 3=no repl} {0,1=no pred, 2-7=pred, 5-7= no repl} {0=no pred, 1=pred & no repl} {0,1=no pred, 2,3=pred, 3=no repl} {0,1=no pred, 2-7=pred, 5-7= no repl}

misprediction penalty), and is significantly cheaper to implement (it uses 1/3 fewer bits), we decided to use the latter in our microarchitectural simulation studies.1 We note that the unpredictable hit rates of the 3-bit configurations are worse (relative to the 1- bit and 2-bit configurations) than their predictable hit rates, and conclude that this must be because the 3-bit state assignments heavily favor prediction (see Table 4-2). Changing the state assignments might improve these hit rates.

1. Note that we do not claim that the hit rates shown in Figure 4-5 are a reliable predictor of system performance. Just as in branch prediction, higher hit rates may not necessarily translate into fewer execution cycles. Rather, detailed cycleby-cycle simulation of the entire microarchitecture is needed to verify performance improvements.


64

Value Locality

Predicted

Disp

PC

Dependent

VPT

Dispatch Buffer

Dispatch Buffer

Reserv. Station

Pred

Rename Buffer

Rel

Spec?

Fetch

CT

Data

Reserv. Station

Reissue

Exec FU

FU Result Bus

Comp/ Verify

Compl. Buffer

?= Committed Value

Inval

Compl. Buffer

Predicted Value

Figure 4-6. Example use of Value Prediction Mechanism. The dependent instruction shown on the right uses the predicted result of the instruction on the left, and is able to issue and execute in the same cycle. 4.3.2 Verifying Predictions Since value prediction is by nature speculative, we need a mechanism for verifying the correctness of the predictions and efficiently recovering from mispredictions. This mechanism is summarized in the example of Figure 4-6, which shows the parallel execution of two data-dependent instructions. The producer instruction, shown on the left, has its value predicted and written to its rename buffer during the fetch and dispatch cycles. The consumer instruction, shown on the right, reads the predicted value from the rename buffer at the beginning of the execute cycle, and is able to issue and execute normally, but is forced to retain its reservation station. Meanwhile, the predicted instruction also executes, and its computed result is compared with the predicted result during its completion stage. If the values match, the consumer instruction releases its reservation station. If not, completion of the first instance of the consumer instruction is invalidated, and a second instance reissues with the correct value.


65


Table 4-3. Baseline Performance (IPC). Benchmark cc1-271 cc1 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp doduc hydro2d swm256 tomcatv HM

620 1.05540 1.20880 0.99308 1.15508 1.35984 1.22254 1.61187 1.07909 1.62410 1.00018 0.97000 1.24365 1.15722 0.81249 0.80267 0.85172 0.91337 1.07501

620+ 1.07260 1.30892 1.10503 1.22739 1.41655 1.23106 1.82027 1.06635 1.86998 1.05241 0.99904 1.31691 1.21509 0.83851 0.82059 0.88852 0.93081 1.12578

Infinite 6.40244 6.81969 10.11820 5.66520 5.58084 4.05087 7.00588 2.02673 7.99286 8.03310 4.91123 6.75335 8.30155 5.80629 5.53410 4.15299 5.77235 5.43294

4.4 Microarchitectural Models In order to validate and quantify the performance impact of the Value Prediction Unit, we implemented three cycle-accurate simulation models, two of them based on the PowerPC 620 [46, 15]-one which matches the current 620 closely, and one, termed the 620+, which alleviates some of its known bottlenecks--and an additional idealized model which removes all structural dependences. These models are described in detail in Section 2.1 on page 21. Table 4-3 summarizes the performance of each of our benchmarks on each of the three baseline machine models without value prediction. 4.4.1 VP Unit Operation The VP Unit predicts the values during fetch and dispatch, then forwards them speculatively to subsequent dependent instructions via the 620’s rename buffers. Up to four predictions can be made per cycle on our 620/620+ models, while the infinite model can make up to 4096 predictions per cycle. Dependent instructions are able to issue and execute immediately, but are prevented from completing architecturally and are forced to retain possession of their reservation stations until their inputs are no longer speculative. Speculatively forwarded values are tagged with the uncommitted register writes they depend on, and these tags are propagated to the results of any


66


subsequent dependent instructions. Meanwhile, uncommitted instructions execute in their respective functional units, and the predicted values are verified by a comparison against the actual values computed by the instructions. Once a prediction is verified, its tag gets broadcast to all active instructions, and all the dependent instructions can either release their reservation stations and proceed into the completion unit (in the case of a correct prediction), or restart execution with the correct register values (if the prediction was incorrect). Since a large number of instructions can be in flight at the same time (16 on the base 620, 32 on the 620+, and up to 4096 in our infinite model), the overall latency for verifying a predicted value can be dozens of cycles or more, allowing the processor to speculate multiple levels down the dependence chain beyond the write, executing instructions and resolving branches that would otherwise be blocked by data-flow dependences. 4.4.2 Misprediction Penalty The worst-case penalty for an incorrect value prediction in this scheme, as compared to not predicting the value in question, is one additional cycle of latency along with structural hazards that might not have occurred otherwise. The penalty occurs only when a dependent instruction has already executed speculatively, but is waiting in its reservation station for one of its predicted inputs to be verified. Since the value comparison takes an extra cycle beyond the pipeline result latency, the dependent instruction will reissue and execute with the correct value one cycle later than it would have had there been no prediction. In addition, the earlier incorrect speculative issue may have caused a structural hazard that prevented other useful instructions from dispatching or executing. In those cases where the dependent instruction has not yet executed (due to structural or other unresolved data dependences), there is no penalty, since the dependent instruction can issue as soon as the actual computed value is available, in parallel with the value comparison that verifies the prediction. In any case, due to the CT which accurately prevents incorrect predictions (see Figure 4-5), the misprediction penalty does not significantly affect performance. There can also be a structural hazard penalty even in the case of a correct prediction. Since speculative values are not verified until one cycle after the actual values become available, speculatively issued dependent instructions end up occupying their reservation stations for one cycle longer than they would have had there been no prediction.


67


Table 4-4. VP Unit Configurations. VPT

CT

Configuration Simple 1PerfCT

Entries 4096 4096

History Depth 1 1

4PerfCT

4096

4/Perfect

8PerfCT

4096

8/Perfect

∞ 4.5 Experimental Framework Perfect

Perfect

Entries 1024

∞ ∞ ∞ ∞

Bits/Entry 2 Perfect Perfect Perfect Perfect

Our experimental framework is very similar to the one used in Chapter 3, and consists of three main phases: trace generation, VP Unit simulation, and microarchitectural simulation. Traces are collected and generated with the TRIP6000 instruction tracing tool, which is an early version of a software tool developed for the IBM RS/6000 that captures all instruction, value and address references made by the CPU while in user state. Supervisor state references between the initiating system call and the corresponding return to user state are lost. The instruction, address, and value traces are fed to a model of the VP Unit described earlier, which annotates each instruction in the trace with one of three value prediction states: no prediction, incorrect prediction, or correct prediction. The annotated trace is then fed to a cycle-accurate microarchitectural simulator that correctly accounts for the behavior of each type of instruction. One of the well-known shortcomings of trace-driven simulation is that the non-architected side effects of speculative instructions that never complete are not accurately modeled. For our machine models, these side effects include instruction and data cache perturbation due to speculative fetches and loads as well as perturbation of the branch history table, return address stack, and branch target address cache by speculative branch instructions. Fortunately, the VPT and CT structures are modeled accurately since they are never updated until completion. Our model also properly accounts for all other structural resource contention caused by speculative execution.

4.6 Experimental Results We collected performance results for each of the three machine models described in Section 2.1 on page 21 (base 620, enhanced 620+, and infinite) in conjunction with five different VP Unit configurations, which are summarized in Table 4-4. Attributes that are marked perfect in Table 4-4 indi-


68


cate behavior that is analogous to perfect caches; that is, a mechanism that always produces the right result is assumed. More specifically, in the 1PerfCT, 4PerfCT and 8PerfCT configurations, we assume an oracle CT that is able to correctly identify all predictable and unpredictable register writes. Furthermore, in the 4PerfCT and 8PerfCT configurations, we assume a perfect mechanism for choosing which of the 4 (or 8) values stored in the value history is the correct one. Moreover, we assume that the Perfect configuration can always correctly predict a value for every register write. We point out that the only VP Unit configuration that we know how to build today is the Simple one, while the other four are merely included to measure the potential contribution of improvements to both VPT and CT prediction accuracy. Additional detailed performance data

is included in Appendix B. 4.6.1 PowerPC 620 Machine Model Speedups In Figure 4-7 we show the speedups that the VP Unit configurations of Table 4-4 obtain over the base PowerPC 620 machine model. The Simple configuration achieves an average speedup of 4.3% (harmonic mean), the 1PerfCT configuration improves that to 5.4%, 4PerfCT to 6.4%, 8PerfCT to 6.8%, and Perfect all the way to 11.0%. Two benchmarks, gawk and grep, demonstrate outstanding performance gains, even with the imperfect configurations, while the gains for cjpeg and compress are nonexistent, even with perfect CTs. We attribute the poor showing of cjpeg and compress to their lack of register value locality (see Figure 4-1). Detailed profiling of grep and gawk revealed that both spend a significant portion of their time in the bmexec() and dfaexec() routines, which implement string search routines in loops with long dependence chains. For both benchmarks, value prediction is frequently able to break these dependence chains, resulting in significant additional parallelism. The speedups for several benchmarks (cc1-271, grep, perl, doduc, and hydro2d) are quite sensitive to CT accuracy (i.e. a perfect CT produces significantly more speedup), indicating a need for a more accurate classification mechanism. In general, however, we are pleased with our results, which show that value prediction is able to produce measurable speedups on a current-generation microprocessor design.


69


HM=1.043 Simple HM=1.054 1PerfCT HM=1.064 4PerfCT HM=1.068 8PerfCT HM=1.110 Perfect

1.6

Speedup

1.5 1.4 1.3 1.2 1.1 1.0 1

27

1-

cc

s

g

1

cc

es

pe

pr

cj

c

om

t

ot

nt

eq

k

w

ga

f

er

gp

g

ep

gr

pe

m

rl

pe

k

ic

qu

sc

c

p

is

xl

du

do

2d

ro

d hy

tv

6

25

m

sw

ca

m

to

Figure 4-7. 620 Speedups. 4.6.2 PowerPC 620+ Machine Model Speedups In Figure 4-8 we show the value prediction speedups over the baseline 620+ machine model. The Simple configuration achieves an average speedup of 6.4% (harmonic mean), the 1PerfCT configuration improves that to 7.8%, 4PerfCT to 9.1%, 8PerfCT to 9.6%, and Perfect all the way to 14.2%. While the trends are similar to the speedups for the base 620 model, the speedups are higher across the board. We attribute this to the fact that the increased machine parallelism and additional hardware resources provided by this model better match the additional instruction-level parallelism exposed by value prediction. Furthermore, the hardware is better able to tolerate the increase in structural hazards caused by value prediction Perhaps the most interesting observation about Figure 4-8 (which applies to Figure 4-7 as well) is the lack of any obvious correlation to Figure 4-1, which shows the value locality for each benchmark. This underscores our earlier point that a high hit rate (i.e. high value locality) does not necessarily translate into a proportional reduction in execution cycles. This follows from the fact that benchmarks with high value locality may not necessarily be sensitive to result latency (i.e. they are not data-flow-limited), whereas benchmarks with lower value locality may be very sensitive, and


70


1.8


1.7

Speedup

1.6 1.5 1.4 1.3 1.2 1.1 1.0 71

-2

1 cc

1

cc

s

g

es

pe

pr

cj

c

om

t

ot

nt

eq

k

w

ga

f

er

gp

ep

gr

g

pe

m

rl

pe

k

ic

qu

sc

c

p

is

xl

du

do

2d

ro

d hy

tv

6

25

m

sw

ca

m to

Figure 4-8. 620+ Speedups. hence may derive significant performance benefits even if only a small fraction of register writes are predictable. For example, eqntott has significantly better value locality than grep, yet grep obtains significantly more speedup from value prediction. 4.6.3 Infinite Machine Model Speedups In Figure 4-9 we show the value prediction speedups over the Infinite machine model. The Simple configuration achieves an average speedup of 18.4% (harmonic mean), the 1PerfCT configuration improves that to 27.4%, 4PerfCT to 29.4%, 8PerfCT to 30.4%, and Perfect all the way to 47.7%. These numbers are very encouraging to us, since they demonstrate that the ultimate performance potential of value prediction remains largely untapped by current and even reasonably-extrapolated next generation processors, and that much work remains to be done to find more effective ways to apply it to realistic microarchitectures. Several benchmarks that displayed measurable speedups with the finite models show negligible speedup with the infinite model (e.g. mpeg, perl, sc, xlisp), which leads us to believe that they are not dataflow-limited by nature. However, the fact that they do show speedups with the finite mod-


71


3.5

6.02

3.0 Speedup

6.29


2.5 2.0 1.5 1.0

71

-2

1 cc

s

g

1

cc

es

pe

cj

m co

pr

t

ot

nt

eq

k

w

ga

f

er

gp

ep

gr

g

pe

m

rl

pe

k

ic

qu

sc

c

p

is

xl

d

du

do

tv

6

o2

dr

hy

25

m

sw

ca

m

to

Figure 4-9. Infinite Machine Model Speedups. els highlights the fact that value prediction, by removing serialization constraints, allows a processor to more efficiently utilize a limited number of execution resources. We included the infinite model results to support our assertion that value prediction can be used to exceed the dataflow limit. Our infinite machine model measures a dataflow limit, since, for all practical purposes (ignoring our limit of 4096 active instructions), parallel issue in the infinite model is restricted only by the following three factors: • Branch prediction accuracy • Fetch bandwidth (single taken branch per cycle) • Data-flow dependences Value prediction directly impacts only the last of these, and yet we are able to demonstrate average and peak speedups of 18.4% and 198% (2.98x speedup for gawk) using our Simple VP Unit configuration. Hence, we lay claim to exceeding the dataflow limit.

4.7 VP Unit Implementation An exhaustive design study of VP Unit design parameters and implementation details is beyond the scope of this thesis. As stated earlier, some preliminary exploration of the design space was conducted by analyzing sensitivity to a few key parameters. We realize that the design selected is by no means optimal, minimal, or even reasonably efficient, and could be improved significantly


72


1.2 1.33

1.54

Speedup

HM=1.013 620 2x HM=1.043 620 VP HM=1.014 620+ 2x HM=1.064 620+ VP

1.1

1.0

71

-2

1 cc

s

g

1

cc

es

pe

cj

pr

om

c

t

ot

nt

eq

k

w

ga

f

er

gp

ep

gr

g

pe

m

rl

pe

k

ic

qu

sc

c

p

is

xl

du

do

2d

ro

d hy

tv

6

25

m

sw

ca

m to

Figure 4-10. Doubling Data Cache vs. VP. with some effort. For example, we reserve a full 64 bits per value entry in the VPT, while most instructions generate only 32 or fewer bits, and space in the table could certainly be shared between such entries with some clever engineering. However, to evaluate the feasibility of implementing a VP Unit in a real-world processor, we compare it against one alternative approach that consumes roughly the same amount of chip space: doubling the first-level data cache to 64K by increasing the line size from 64 bytes to 128 bytes. The results of this comparison, which are shown in Figure 4-10, make clear that, at least for this benchmark set, value prediction delivers three to four times more speedup than doubling the data cache for both the 620 and 620+ machine models. Furthermore, the VP Unit has several characteristics that make it attractive to a CPU designer. First of all, since the VPT and CT lookup indices are available very early, at the beginning of the instruction fetch stage, access to these tables can be superpipelined over two or more stages. Hence, given the necessary chip space, even relatively large tables could be built without impacting cycle time. Second, the design adds little or no complexity to critical delay paths in the microarchitecture. Rather, table lookups and verifications are done in parallel with existing activities or are serialized with a separate pipeline stage (value comparison). Hence, it is unlikely that VP would have an adverse effect on processor cycle time, whereas doubling the data cache would quite likely do just that.


73


4.8 Conclusions and Future Work We make three major contributions in this chapter. First, we demonstrate that many instructions that write general purpose or floating point registers, when examined on a per-instruction-address basis, exhibit significant amounts of value locality. Second, we describe value prediction, a dataspeculative microarchitectural technique for capturing and exploiting value locality to reduce dataflow restrictions on parallel instruction issue. Third, we demonstrate that value prediction can be used to exceed the dataflow limit by 18% (harmonic mean), as measured on a processor model with no structural hazards. We are very encouraged by our results. We have shown that measurable (5% on average for the 620, 7% on average for the 620+) and in some cases dramatic (up to 33% on the 620 and 54% on the 620+) performance gains are achievable with simple microarchitectural extensions to current-generation and reasonably-extrapolated next-generation microprocessor implementations. We envision future work proceeding on several different fronts. First of all, we believe that the relatively simple techniques we employed for capturing value locality could be refined and extended to effectively predict a larger share of register values. Those refinements and extensions might include allowing multiple values per static instruction in the prediction table by including branch history bits or other readily available processor state in the lookup index; or moving beyond history-based prediction to computed predictions through techniques like value stride detection (in Chapter 6 we incorporate stride detection into source operand value prediction). Second, our classification mechanism could also be refined to correctly classify more instructions and extended to control pollution in the value table (e.g. removing instructions that are not latency-critical from the table). Results presented in Chapter 5 indicate that incorporating branch history into the classification table lookup index can improve accuracy considerably. Third, significant engineering work is needed to optimize our VP Unit design and reduce its implementation cost and potential impact on processor cycle time. Fourth, the microarchitectural design space should be explored more extensively, since value prediction appears to dramatically alter the available program parallelism in ways that may not match current levels of machine parallelism very well. Fifth, feedback-directed compiler support for rescheduling instructions for different latencies based on their value locality may also prove beneficial. Finally, more aggressive approaches to value prediction could be investigated (e.g. speculating down multiple paths in the value space, or predicting writes to condition


74


code and other special purpose registers). In short, there is a great deal of interesting future work that is related to value prediction and the exploitation of value locality.


75



76

CHAPTER 5

Dependence Prediction

The serialization constraints induced by the detection and enforcement of true data dependences have always been regarded as requirements for correct execution. In this chapter, we propose two data-speculative techniques--source operand value prediction and dependence prediction--that can be used to relax these constraints to allow instructions to execute before their data dependences are resolved or even detected. We find that inter-instruction dependences and source operand values are easily predictable. These discoveries minimize the IPC penalty of deeper pipelining of instruction dispatch and result in average integer program speedups ranging from 22% to 106%, depending on machine issue width and pipeline depth.

5.1 Motivation and Related Work As described in Chapter 1, Section 1.1, there are two fundamental restrictions that limit the amount of instruction level parallelism (ILP) that can be extracted from sequential programs: control flow and data flow. Control flow limits ILP by imposing serialization constraints at forks and joins in a program’s control flow graph [1]. Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent (i.e. one needs the result of another to compute its own result, and hence must wait for the other to complete before beginning to execute). In Chapter 3, we explored the notion of value locality--defined as the recurrence of previouslyseen values--and demonstrated a technique--Load Value Prediction, or LVP--for predicting the results of load instructions at dispatch by exploiting the affinity between load instruction addresses and the values the loads produce. In Chapter 4, we extended the LVP approach for predicting the results of load instructions to generalized value prediction for all instructions that write an integer or floating-point register and showed that a significant proportion of such writes are trivially predictable. In the same vein, in this chapter we propose exploiting value locality not only in the dataflow portion of a microarchitecture, but also in the control logic. We find that the dependence rela-


77

Detecting Control and Data Dependences

tionships between dynamic instructions contain a great deal of value locality, and propose a mechanism--dependence prediction--for capturing and exploiting this value locality to allow early dispatch of instructions in wide-dispatch machines. We find that combining value and dependence prediction leads to significant performance increases, with harmonic mean speedups ranging, depending on machine model, from 22% to 106% for integer programs.

5.2 Detecting Control and Data Dependences Detecting control and data dependences among multiple instructions in flight is an inherently sequential task that becomes very expensive combinatorially as the number of concurrent in-flight instructions increases. Olukotun et al. argue convincingly against wide-dispatch superscalars because of this very fact [30]. Wide (i.e. greater than four) dispatch is difficult to implement and has adverse impact on cycle time because all instructions in a dispatch group must be simultaneously cross-checked. Even current microprocessor implementations with dispatch windows of four or less (e.g. Alpha AXP 21164 and Pentium Pro) require multiple instruction decode and dependence-checking pipeline stages. One obvious solution to the problem of the complexity of dependence detection is to pipeline it into two or more stages to minimize impact on cycle time. In Section 5.4 we propose a pipelined approach to dependence detection that facilitates the implementation of wide-dispatch microarchitectures. However, pipelined dependence checking aggravates the cost of branch mispredictions by delaying resolution of mispredicted branches. In Figure 5-1, we see the IPC impact of pipelining dependence checking on a 16-dispatch machine with an advanced branch predictor and no other structural resource limitations (refer to Section 2.3.4 on page 32 and Section 2.1.5 on page 24 for further details on the benchmarks and machine model). We see that lengthening dispatch to two or three pipeline stages (vs. the baseline case of one) severely increases the number of cycles during which no useful instructions are dispatched and increases CPI (decreases IPC) dramatically, to the point where sustaining even 2-3 IPC becomes very difficult. We propose to alleviate these problems in two ways: by introducing a scalable, pipelined, and speculative approach to dependence detection called dependence prediction and also by exploiting a modified approach to value prediction called source operand value prediction [28]. Fundamental to these is the notion that maintaining semantic correctness does not require that we rigorously enforce source-to-sink data-flow relationships or that we even exactly detect these relationships Value Locality and Speculative Execution

78



1.0 RAS Mispred BTB Mispred BHT Mispred Other

0.8 0.6 0.4 0.2 0.0

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Figure 5-1. Branch Misprediction Penalty. The approximate contribution of RAS, BTB, and BHT mispredictions to overall CPI is shown for single-cycle dispatch (left bar), 2-cycle (middle bar) and 3-cycle (right bar) pipelined dispatch. before we start executing. Rather, we use dynamically adaptive techniques for predicting values as well as dependences and speculatively issue instructions early, before their dependences are resolved or even known. Table 5-1. Machine Model Parameters. Parameter Branch Predictor Fetch and Dispatch Width Completion Width Instruction Window Instruction Cache I-cache Miss Latency Data Cache D-cache Miss Latency

Value gshare(16) {4,8,16} Unrestricted Unlimited Perfect N/A Perfect N/A

5.3 Experimental Framework To evaluate the performance potential of the dependence prediction and source operand value prediction, we implement a flexible emulation-based simulation framework for the PowerPC instruction set. The simulation framework accurately models branch and fetch prediction, dispatch and completion width constraints, instruction window size, latency in the memory hierarchy, and all branch misprediction and data dependence delays for realistic instruction latencies. The machine model and simulation environment is described in detail in Section 2.1.5 on page 24. Key parameters for the machine model are summarized in Table 5-1. Value Locality and Speculative Execution

79

Pipelined Dispatch Structure

Table 5-2. Benchmark Characteristics. Bench mark

Run Length

go m88ksim gcc compress li ijpeg perl vortex

79.6M 107.0M 181.8M 39.7M 56.8M 92.1M 50.1M 153.1M

BHT BTB Mispred Mispred 12.0% 2.7% 5.1% 5.9% 3.0% 2.7% 2.4% 0.6%

8.5% 4.3% 8.7% 0.0% 5.3% 0.6% 11.1% 1.6%

RAS Mispred 0.0% 0.0% 3.9% 0.0% 12.2% 18.7% 4.1% 11.4%

We selected the SPEC95 integer benchmark suite (described in Section 2.3.4 on page 32) for our study, since it is readily available, widely used, and well-understood. Table 5-2 shows run length (dynamic instruction count) and BHT (branch history table), BTB (branch target buffer), and RAS (return address stack) branch misprediction rates. The BHT misprediction rate is the number of mispredicted conditional branches divided by the total number of conditional branches, the BTB misprediction rate is the number of taken branches with mispredicted targets divided by the total number of taken branches, and the RAS misprediction rate is the total number of subroutine returns with mispredicted targets divided by the total number of subroutine returns. Additional simulation results for a variety of machine models as well as floating point benchmarks are omitted here, but are available in a separate technical report [33].

5.4 Pipelined Dispatch Structure In this section, we describe a pipelined dispatch structure that facilitates the implementation of wide-dispatch microarchitectures by reducing the circuit complexity and cycle-time demands imposed by simultaneous cross-checking of data dependences within a large dispatch group. In this scheme, dependence checking is divided into two pipeline stages. During the first stage, all destination registers within a dispatch group are identified and renamed, and the rename mappings are written into the dependence-resolution buffer (DRB) and the mapping file (MF). During the second stage, all source registers are identified and their rename mappings are looked up in the DRB and the shadow mapping file (MF’). In our microarchitecture, all register writes are renamed to slots in a value silo. The value silo is used to scoreboard, hold, and forward the results of


80

Pipelined Dispatch Structure

Dispatch Group Value Silo

P1:

Src Dst

MF

P2:

DRB

BRS BRS BRS BRS

MF’

Figure 5-2. Pipelined Dispatch Structure. During stage P1, all instructions in a fetch group write destination register rename mappings into the DRB and the MF. During stage P2, the instructions search the DRB and MF’ for source register rename mappings. instructions until they are ready to complete and write back into the architected register file. Conceptually, the value silo is a monolithic structure, but in an actual hardware implementation, it can be partitioned by register type and even register number. As shown in Figure 5-2, during the first pipeline stage P1 of pipelined dispatch, all instructions in a fetch group allocate value silo slots for their destination operands, and then write the mapping tuples into their dependence-resolution buffer (DRB) entries. At the same time, the value silo slot numbers are written into the mapping file (MF), a table indexed by the register number. If a dispatch group contains more than one write to the same architected register, arbitration logic selects the last write before a taken or predicted-taken branch. During the second pipeline stage P2, all instructions in a fetch group search ahead in the DRB for a register number matching each of their input registers (the DRB is multi-ported and content-addressable). If multiple matching entries are found, the closest one (i.e. the most recent definition) is selected. If no matching entry is found, the shadow mapping file (MF’) entry for the register is used instead. MF’ summarizes the register-to-value silo mappings for all previous fetch groups, and is a onecycle-delayed copy of the mapping file MF. If no register-to-value-silo mapping exists, the appropriate MF’ entry will instead point to the architected register file. At the end of P2, all the instruc-


81

Dependence Prediction and Recovery

tions in the fetch group know where in the value silo they can find their input operands, and can check the scoreboarded valid bits to see if they are available. Whenever a predicted branch occurs within a dispatch group, a snapshot of the mapping file MF that includes all register writes through the branch is pushed onto a branch recovery stack (BRS). Any instruction following a taken or predicted-taken branch within a fetch group is discarded and prevented from writing into either the DRB or the MF. When a branch misprediction is resolved, any instructions that are newer than the branch are discarded along with their value silo slots, and fetching starts over from the actual destination of the mispredicted branch, while the MF snapshot corresponding to that branch is retrieved from the branch recovery stack. As described here, instruction dispatch is pipelined into two stages. However, it is easy to envision even deeper pipelining of this process. Hence, we simulate the performance effects of and present results for one-, two-, and three-stage dispatch pipelines.

5.5 Dependence Prediction and Recovery Figure 5-1 illustrates the detrimental performance effects of a pipelined dispatch structure. In short, a pipelined dispatch structure increases the number of cycles between a branch misprediction and the detection of that misprediction, hence aggravating the misprediction penalty and severely limiting performance. To alleviate these effects, we propose a mechanism called dependence prediction (DP) that can frequently short-circuit multi-cycle dispatch by predicting the dependence relationships between instructions in flight and speculatively allowing instructions that are predicted to be data ready to execute in parallel with exact dependence checking. As shown in Figure 5-3, dependence prediction is implemented with a dependence prediction table (DPT) with 8K entries, which is direct-mapped and indexed by hashing together the instruction address bits, the gshare branch predictor’s branch history register (BHR), and the relative position of the operand (i.e. first, second, or third) being looked up. Each DPT entry contains a numeric value which reflects the relative index of that input operand’s location in the value silo. This relative index is used to check the value silo to see if the operand is already available. If all of the instruction’s predicted input operands are available, the instruction is permitted to dispatch early, after the first dispatch cycle. In the second (or third, in the three-cycle dispatch pipeline) dispatch cycle, exact dependence information becomes available, and the earlier prediction is verified


82

Dependence Prediction and Recovery

Dispatch Group P1:

Src

DPT

Src

PC

Value Silo

BHR

Figure 5-3. Dependence Prediction Mechanism. During stage P1, the source operand position, PC, and branch history register (BHR) are hashed together to index the dependence prediction table (DPT), which predicts the value silo entry that contains the source operand. During P2, the prediction is verified. against the actual information. In case of a mismatch, the DPT entry is replaced with the correct relative position, and the early dispatch is cancelled. The total number of operands predicted, average number of predictions per instruction, and percentage of correct predictions are shown in Table 5-3. We find that for most benchmarks, the DPT achieves a respectable hit rate. For two benchmarks--go and perl--the dependence prediction hit rates were rather low. This behavior can be attributed to the unpredictable branch behavior of these three benchmarks, since unpredictable branches can lead to unpredictable dependence distances when there are multiple definitions reaching a use. As seen in Figure 5-1 and Table 5-2, both go and perl have high BTB misprediction rates, while go has a high BHT misprediction rates. Table 5-3. Dependence Prediction Results. Bench mark

Operands Predicted

go m88ksim gcc compress li ijpeg perl vortex


89.6M 113.4M 174.2M 40.6M 49.8M 92.3M 47.3M 120.7M

Per Instruction 1.126 1.060 0.958 1.023 0.878 1.001 0.944 0.788

Correct Predictions 38.0% 77.4% 59.7% 87.3% 72.4% 71.9% 48.2% 71.3%

83

Source Operand Value Prediction and Recovery

In Figure 5-5 on page 87 we show the effect of dependence prediction on IPC for dispatch widths of four, eight, and sixteen, and dispatch latencies of one, two and three cycles. Without dependence prediction, the best performance, obviously, is obtained with single-cycle dispatch, which sustains about 2.8 IPC in the worst case (go), and 4.8 IPC in the best case (m88ksim) with 16-wide dispatch. Lengthening dispatch to two and three cycles degrades go to asymptotic IPC of 1.7 and 1.3, respectively, while reducing ijpeg (which is now the best performer) to 3.7 and 2.5 IPC. Furthermore, wider dispatch width is providing rapidly diminishing returns, hence eroding incentive for building processors with dispatch widths exceeding four, which is the width that many currentgeneration microprocessors implement. Fortunately, dependence prediction is able to alleviate these depressing trends by reducing the average dispatch latency. For both two- and three-cycle dispatch, dependence prediction significantly elevates the sustainable IPC and brings it much closer to the single-cycle case. Furthermore, wider dispatch again harvests greater IPC, restoring incentive for building wider superscalar processors. Three benchmarks--compress, li, and ijpeg-behave particularly well, eliminating nearly all of the performance penalty induced by two- and three-cycle dispatch.

5.6 Source Operand Value Prediction and Recovery A complementary approach for reducing the adverse performance impact of pipelined dispatch involves a variation on previous work on value prediction (Chapter 4 as well as [28]). In earlier work, the destination operands (i.e. results) of instructions were predicted via table-lookup at fetch/dispatch, and then forwarded directly to dependent instructions. The shortcoming of this approach is that dependence relationships must be detected before values can be forwarded to dependent instructions. To overcome this problem, we propose predicting the values of source operands, rather than destination operands, hence decoupling value-speculative instruction dispatch entirely from dependence detection. As in the earlier work, we predict only floating-point and general-purpose register operands, and not condition registers or special-purpose registers. Source operand value prediction (VP) is illustrated in Figure 5-4. As in Chapter 4, we use a value prediction table (VPT) to keep track of past operand values, and exploit the value locality [27] of operands to predict future values. In our experiments, the VPT is direct-mapped, 32KB in size, and is indexed by hashing together the instruction address bits and the relative position of the operand (i.e. first, second, or third) being looked up. Source operand value prediction also uses a directValue Locality and Speculative Execution

84


Classification Table (CT)

Value Prediction Table (VPT) PC of pred. instr.

Oper. Position

Predict?

Prediction Result

Predicted Value

Updated Value

Figure 5-4. Source Operand Value Prediction Mechanism. The source operand position and PC are hashed together to index the VPT and CT. The prediction and value histories are updated at completion. mapped classification table (CT) similar to the one proposed in Chapter 4 for classifying the predictability of source operands and deciding whether or not the operands should be predicted. In our experiments, the CT is direct-mapped, has 8K entries with a 2-bit saturating counter at each entry, and is indexed by hashing together the instruction address bits and the relative position of the operand being looked up. When all of the input operands of an instruction are classified as predictable, the instruction is permitted to dispatch early, after the first dispatch cycle (instructions with unpredictable source operands may still end up executing sooner than without value prediction, in cases where an operand that is predicted is on a critical path). Once dispatch finishes and exact dependence information becomes available, the instruction waits for its verified operands to become available in the value silo (operands in the value silo become verified when the instructions that generate them have validated all of their input operands) and then compares them against its predicted operands. If they match, the result operands of the instruction are marked verified, and the instruction is allowed to complete in program order. If they don’t match, the instruction re-executes with the correct operands. Just as in Chapter 4 (and [28]), this results in a one-cycle misprediction penalty, since the instruction in question as well as all of its dependents do not execute with their correct inputs until one cycle later than if there had been no prediction.


85


Table 5-4. Source Operand Value Prediction Results. Bench mark go m88ksim gcc compress li ijpeg perl vortex

Value Locality 45.3% 56.1% 40.9% 42.4% 33.7% 35.2% 44.5% 32.9%

CT Predictable Hit Rate 77.0% 92.8% 78.0% 97.5% 76.9% 91.6% 76.4% 83.3%

CT Unpredictable Hit Rate

Dependence Prediction Hit Rate

83.7% 89.6% 89.6% 98.8% 92.9% 95.9% 84.3% 93.9%

42.1% 89.6% 63.0% 94.6% 75.4% 81.2% 54.4% 82.5%

Table 5-4 summarizes the value locality, classification hit rates, and dependence prediction hit rates for each of our benchmarks. Value locality (column two), as defined in [27], is the ratio of the dynamic count of source operands that are predictable with the VPT mechanism and the dynamic count of all source operands. The predictable hit rate (column three) is the ratio of the number of predictable source operands that were identified as such by the CT and the total number of predictable source operands. Similarly, the unpredictable hit rate (column four) is the ratio of the number of unpredictable source operands that were identified as such by the CT and the total number of unpredictable source operands. The dependence prediction hit rate (column five) is included to show the interaction between value prediction and dependence prediction. When both types of prediction are used, operands that are deemed unpredictable by the CT are relegated to dependence prediction. We see that the dependence prediction hit rates are better across the board than the ones shown in Table 5-3, indicating that the techniques are mutually synergistic. We also note that the value locality numbers are similar to those reported earlier [28], while the CT hit rates are somewhat better. The former is not surprising, even though source operands should be no more or less predictable than destination operands, since the CT in these experiments is significantly larger (8K vs. 1K entries). In Figure 5-5, we show the effect of dependence prediction and value prediction on IPC for various dispatch widths of four, eight, and sixteen, and dispatch latencies of one, two and three cycles. The best performance, obviously, is obtained with value prediction and single-cycle dispatch, which sustains 3.6 IPC in the worst case (go), and 5.9 IPC in the best case (vortex) with 16-wide dispatch. Lengthening dispatch latency to two and three cycles degrades go to 3.0 and 2.8 IPC, respectively, while reducing vortex to 5.8 and 5.6 IPC. Value and dependence prediction improve performance Value Locality and Speculative Execution

86


Sustained IPC

6.0 4.0 3.0 2.0 1.0

6.0 Sustained IPC

+DP+VP +DP Baseline

5.0

0.0

go

m88ksim gcc compress

li

Fetch Width=8

ijpeg

perl

vortex

HM

vortex

HM

vortex

HM

+DP+VP +DP Baseline

5.0 4.0 3.0 2.0 1.0 0.0 6.0

Sustained IPC

Fetch Width=4

go


li

ijpeg

perl

+DP+VP +DP Baseline

Fetch Width=16

5.0 4.0 3.0 2.0 1.0 0.0

go


li

ijpeg

perl

Figure 5-5. Effect of Dependence and Value Prediction. The sustained IPC for dispatch widths of 4, 8 and 16 is shown for single-cycle dispatch (left bar), two-cycle dispatch (middle bar), and three-cycle dispatch (right bar). Each stacked bar shows cumulative IPC attainable with dependence prediction (+DP) and value prediction (+DP+VP). significantly over the baseline in all cases, and wider dispatch harvests even greater additional IPC, restoring incentive for building wide-dispatch processors. We see that with dependence and value prediction, virtually all of the performance penalty associated with pipelined dispatch has been eliminated, allowing even three-cycle dispatch to nearly match the performance of single-cycle dispatch. Even the worst case benchmark (go) only degrades by 17% from single-cycle to two-cycle dispatch, while the best case (vortex) degrades by only 2% for two-cycle dispatch and 4% for three-cycle dispatch. Furthermore, three-cycle dispatch with value and dependence prediction can usually at least match, and frequently clearly outperValue Locality and Speculative Execution

87



RAS Mispred BTB Mispred BHT Mispred Other

0.4

0.2

0.0

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Figure 5-6. Reduced Branch Misprediction Penalty. The approximate contribution of RAS, BTB, and BHT mispredictions to overall CPI is shown for single-cycle dispatch (left bar), 2-cycle (middle bar) and 3-cycle (right bar) pipelined dispatch for a 16-wide model employing dependence prediction and source operand value prediction. form (compress, vortex, m88ksim, li), single-cycle dispatch without value or dependence prediction. Comparing Figure 5-6 to Figure 5-1 highlights the benefits of dependence and value prediction by showing the significant reductions in branch misprediction penalty that can be obtained with these techniques.

5.7 Conclusions and Future Work We make three major contributions in this chapter. First of all, we propose a pipelined dispatch structure that eases the implementation of wide-dispatch microarchitectures. Second, we propose dependence prediction, a speculative technique for alleviating the performance penalty of pipelined dispatch. Third, we propose source operand value prediction, which is a modified approach to value prediction that decouples instruction execution from dependence checking by predicting source operands rather than destination operands. We show that these techniques can speculate beyond data-flow and dependence detection bottlenecks to deliver unprecedented levels of uniprocessor performance, particularly for machines with wide and deeply pipelined instruction dispatch. This chapter takes only an initial stab at evaluating dependence prediction. Clearly, more effective and accurate ways of storing and retrieving the dependence relationships between instructions could be devised. In particular, aggressive instruction fetch mechanisms like trace cache [54] may lend themselves quite naturally to storing interinstruction dependence information, since that Value Locality and Speculative Execution

88


instruction remains static within traces stored in such a cache. Future exploration of such mechanisms seems a quite promising avenue of research.


89



90

CHAPTER 6

The Superflow Paradigm

The imminent availability of an effectively unlimited number of on-chip transistors is ushering in a new era of microarchitectural innovation. Future processor implementation trade-offs will continue to be governed by cycle-time demands, which will now be dominated by interconnect latency rather than gate delays, as well as maximization of instruction throughput, but with decreasing concern for transistor count or chip area consumed. This chapter introduces the Superflow microarchitecture paradigm, which employs a broad spectrum of speculative microarchitectural techniques designed to increase instruction throughput while alleviating the detrimental effects of increasing interconnect latencies and deeper instruction execution pipelines. These techniques are based on the weak dependence model, which relaxes the serialization constraints induced by the detection and enforcement of true data dependences by allowing them to be temporarily and speculatively violated to expose additional instruction-level parallelism. Quantitative evaluation of the Superflow paradigm indicates potential IPC (sustained instructions per cycle) of 9.0 and 15.2 and realizable IPC of 6.7 and 7.2 for the SPEC95 integer and floating point programs, respectively, without recompilation or changes to the instruction set architecture.

6.1 Background In its brief life time of 25 years, the microprocessor has achieved a total performance growth of about 9,000X due to both technology improvements and microarchitecture innovations. As can be seen in Table 6-1 both transistor count and clock frequency have increased by an order of magnitude in each of the decades of the 1970’s and 1980’s. During the 1980’s IPC also increased by almost an order of magnitude. So far in the 1990’s both transistor count and clock frequency have already achieved another order of magnitude increase. However, IPC improvement is struggling and may not even reach a 3X increase by the end of this decade. While the contribution to performance by technology improvements seems to be accelerating, the contribution from microarchi-


91

Background

Table 6-1. Evolution of Microprocessors. 1970-1979

1980-1989

1990-1999

2000+

Transistor Count

10K-100K

100K-1M

1M-30M

100M

Clock Frequency

0.2-2MHz

2-20MHz

20-500MHz

1GHz

Instructions/cycle