Processor Modeling and Evaluation Techniques for Early Design ...

1 downloads 655 Views 857KB Size Report
early design-stage processor performance analysis, and who rekindled my interest in the architecture ...... Instruction template operands and their interpretation. ...... [73] SPEC Newsletter, Systems Performance Evaluation Cooperative, Vol.
Processor Modeling and Evaluation Techniques for Early Design Stage Performance Comparison

by

John-David Wellman

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 1996

Doctoral Committee: Professor Edward S. Davidson, Co-Chairman Pradip Bose, IBM T. J. Watson Research Lab, Co-Chairman Professor Trevor N. Mudge Professor Yale N. Patt Associate Professor Stéphane Lafortune

The challenge is to design computers that make the best use of available technology; in doing so we may be assured that every increase in processing speed can be used to advantage in current problems or will make previously impractical problems tractable. -- Leonard Shustek, Analysis and Performance of Computer Instruction Sets (1978)

Remember that time is money. -- Ben Franklin, Advice to a Young Tradesman

© John - David Wellman All Rights Reserved

1996

This work is dedicated to those who have given so much of themselves to me... my family.

ii

ACKNOWLEDGEMENTS It is difficult if not impossible to understand all of the complex forces that have acted upon me to help determine and focus my research. I owe a great debt to many people both within the University of Michigan and outside it. This dissertation simply would not exist but for the support and encouragement of these people, many of whom can never be adequately thanked for their help. I owe the greatest debt to my advisor, Edward Davidson, who, as a friend and a mentor has provided great advice and encouragement through both the exciting and difficult times of my research. His patience, wisdom and humor have meant a lot through the years, and are greatly appreciated, but perhaps most appreciated is the faith he showed in me early on, as this research was just being born. I owe a similar debt to Pradip Bose of the IBM T. J. Watson Research Lab, with whom I spent two summers as an intern. It was Pradip who introduced me to the field of early design-stage processor performance analysis, and who rekindled my interest in the architecture and performance of processors. I also owe the IBM corporation as a whole, for providing financial support in the form of Graduate Fellowships for two years. I am also grateful to each of the other members of my dissertation committee: Trevor Mudge, who is responsible in no small part for my entering graduate school, Yale Patt, who always challenges one to think carefully about matters, and Stéphane Lafortune. My appreciation must also be extended to the many other people whose ideas, opinions, critiques and support of me have proven to be quite valuable. Thanks go out first to the CAD timing group, who introduced me to the nuts and bolts of academic research. I also owe a debt to the many members of my current research group, both current and past.

iii

Particularly relevant were William Mangione-Smith, whose bounds analysis techniques fascinated me early on, and Eric Boyd, who extended the bounds analysis to an even more interesting level. Thanks also to Waqar Azeem, Perry Wang and Hsien-Hsin Lee who helped define the limits of bounds analysis in my mind, and thus allowed me to move on into what became my dissertation research. I must also thank Tien-Pao Shih, whose work inspired me, and whose insights and ideas have always driven me forward. For my sanity, I owe two friends in particular: Adam Greenspan and my officemate Jude Rivers. The staff at the University and in ACAL must also be commended for providing an environment that is so conducive to research. Finally, my family deserves my most appreciative thanks for supporting my decisions, lending their moral support and encouraging my graduate studies. I must further thank my in-laws, who blessed the union between myself and their wonderful daughter. And to my wife Suzanne, no mere words can ever suffice to express the debt of gratitude I feel for all she is adding to my life; I can only thank God in my heart that we are together, and strive daily to share the love I feel.

iv

TABLE OF CONTENTS DEDICATION.................................................................................................................................ii ACKNOWLEDGEMENTS...........................................................................................................iii TABLE OF CONTENTS ...............................................................................................................v LIST OF TABLES .......................................................................................................................viii LIST OF FIGURES........................................................................................................................x CHAPTER I INTRODUCTION.................................................................................................................1 1.1 The Focus and Goals of Our Research...............................................................................2 1.2 Computer Performance Evaluation Methods .....................................................................3 1.2.1 Performance Measurement.......................................................................................4 1.2.2 Analytic and Trace-Driven Simulation Models........................................................5 1.2.2.1 Analytic Models..............................................................................................5 1.2.2.2 Trace-Driven Simulation ................................................................................7 1.3 Related Work in Early Stage and Design Space Analysis .................................................7 1.4 Investigating Changes to the ISA and Processor Models ................................................10 1.4.1 Comparing ISA Features ........................................................................................12 1.4.2 VMW: Providing ISA and Processor Models ........................................................13 1.5 Contributions of this Dissertation ....................................................................................14 1.5.1 Resource Conflict Methodology.............................................................................15 1.5.2 Reduced Trace Analysis .........................................................................................16 1.5.3 Hierarchical Design Space Evaluation ...................................................................18 CHAPTER II THE RESOURCE CONFLICT METHODOLOGY.......................................................20 2.1 Introduction......................................................................................................................20 2.2 The Resource Conflict Methodology ...............................................................................22 2.2.1 Availability Times...................................................................................................23 2.2.2 Types of Resources in an RCM Model...................................................................24 2.2.3 Instruction Resource Use Information....................................................................26 2.2.4 RCM Model Simulation .........................................................................................29 2.2.5 Capturing the Pipeline Hazards ..............................................................................30 2.3 REAP: An RCM Modeling and Simulation Tool ............................................................33 2.3.1 The Superscalar Processor Family Modeled by REAP ..........................................34 2.3.2 The REAP Instruction Set Architecture: BRISC....................................................38

v

2.3.3 Implementing the REAP Processor Simulator .......................................................40 2.3.3.1 The Instruction Templates ............................................................................40 2.3.3.2 The Trace Description...................................................................................42 2.3.3.3 The Processor Description File.....................................................................44 2.3.3.4 Overview of the REAP Simulation Routines................................................52 2.3.4 Comparing REAP to Cycle-by-Cycle Timing Simulators .....................................58 2.4 Extending the Basic REAP Processor Models.................................................................62 2.4.1 Controlling the Processor Execution Model...........................................................62 2.4.2 Adding Branch Prediction Models .........................................................................67 2.4.3 Adding a Reorder Buffer Model.............................................................................74 2.4.4 Adding a Pending Store Queue Model ...................................................................77 2.4.5 Adding Finite Cache Effect Models .......................................................................79 2.4.6 Some Limitations of RCM Processor Models........................................................83 2.4.6.1 REAP Single Availability Time Limitation..................................................84 2.4.6.2 RCM Execution Ordering Limitation ...........................................................86 2.5 Adding Detail to the REAP Execution Information ........................................................89 2.6 RCM Modeling: Conclusions ........................................................................................107 CHAPTER III REDUCED TRACE ANALYSIS .....................................................................................110 3.1 Introduction....................................................................................................................110 3.1.1 Analytic Models: Using a Simplified Processor Model .......................................110 3.1.2 Reduced Trace Analysis: Reducing the Analysis Redundancy ............................113 3.2 Overview of Reduced Trace Analysis............................................................................114 3.2.1 The Reduced Trace Description ...........................................................................116 3.2.2 Forming the Initial Reduced Trace Description ...................................................118 3.2.3 Reduced Trace Description Analysis....................................................................119 3.2.4 Information Content of a Reduced Trace Description..........................................126 3.3 Implementing Reduced Trace Analysis .........................................................................127 3.3.1 TRED: A Trace Reduction Tool ...........................................................................128 3.3.2 RETANE: A Reduced Trace Description Simulator ............................................129 3.3.3 Experiments Using RETANE ...............................................................................132 3.3.4 Identifying the Source of Simulation Inaccuracy .................................................134 3.4 Reduced Trace Description Optimization......................................................................138 3.4.1 TROPT: Operation and Graph Transformations...................................................140 3.4.1.1 The Transformations...................................................................................140 3.4.1.2 Applying the Transformations ....................................................................149 3.4.2 Example Optimization of a Reduced Trace Description ......................................155 3.4.3 The Path Substitution Problem .............................................................................162 3.4.3.1 Example Path Substitutions ........................................................................163 3.4.3.2 Potential for False Path Substitutions .........................................................164 3.4.3.3 Controlling Substitution: TROPT’s Application of the Transformations...167 3.4.4 Optimized Trace Descriptions: Experimental Results..........................................168 3.4.5 Controlling the Optimization Growth in Instruction Evaluations ........................174

vi

3.4.6 Alternatives to Reduced Trace Description Optimization....................................187 3.4.6.1 Using Higher-Order Connections Analysis ................................................188 3.4.6.2 Generating Larger Code Blocks in the Initial Reduced Trace Description 193 3.5 Extending the Basic RETANE Processor Models..........................................................203 3.5.1 Local Execution-Order-Dependent Hardware Models.........................................204 3.5.2 Global Execution-Order-Dependent Hardware Models .......................................210 3.5.2.1 Modeling Finite Caches in RETANE .........................................................214 3.5.2.2 Comparing the Descriptions With and Without Cache Effects ..................217 3.6 Adding Detail to the RETANE Execution Information .................................................221 3.6.1 The Population Statistics Reports.........................................................................225 3.6.2 The Short-Memory Statistics Reports ..................................................................227 3.6.3 The Medium-Memory Statistics Reports .............................................................232 3.6.4 The Long-Memory Statistics Reports ..................................................................234 3.6.5 The Multiple-Memory-Level Statistics Reports...................................................237 3.6.6 RTA Statistics Gathering: Conclusions ................................................................241 3.7 Reduced Trace Analysis: Conclusions...........................................................................243 CHAPTER IV EXAMPLE DESIGN SPACE EXPLORATION ............................................................246 4.1 4.2 4.3 4.4

Introduction....................................................................................................................246 A Hierarchical Search Methodology..............................................................................249 The Bounds Analysis: Background and Method ...........................................................251 An Example Design Optimization .................................................................................254 4.4.1 Applying the Physical Design Constraints ...........................................................256 4.4.2 Applying Bounds Analysis...................................................................................258 4.4.3 Applying Reduced Trace Analysis .......................................................................265 4.4.4 Applying Full Trace Analysis (RCM) ..................................................................271 4.4.5 Selecting the Processor for Detailed Design ........................................................278 4.5 Conclusions and Implications ........................................................................................279 CHAPTER V CONCLUSIONS AND FUTURE WORK ......................................................................282 5.1 Contributions of this Work.............................................................................................282 5.2 Future Research Directions............................................................................................287 5.2.1 The Resource Conflict Methodology (RCM) .......................................................288 5.2.2 Reduced Trace Analysis (RTA) ............................................................................290 5.2.3 Bounds Analysis ...................................................................................................292 APPENDIX..................................................................................................................................294 BIBLIOGRAPHY.......................................................................................................................307

vii

LIST OF TABLES Table 1. Table 2. Table 3. Table 4. Table 5: Table 6: Table 7: Table 8. Table 9. Table 10. Table 11. Table 12. Table 13. Table 14. Table 15. Table 16. Table 17. Table 18. Table 19. Table 20. Table 21. Table 22. Table 23. Table 24. Table 25. Table 26. Table 27.

The instruction execution class identifiers for the BRISC ISA................................... 42 Initial processor configurations for REAP vs. timer comparisons.............................. 59 Processor models used in the REAP vs. timer comparisons....................................... 60 Processor function unit configurations for RETANE vs. REAP comparisons.......... 132 Composition of the optimized blocks of figure 79(b). .............................................. 161 Execution times for full execution traces and reduced trace simulation (for the xp processor executing the LFK workloads) ........................................................ 170 The femc simulation run-times and speedups for the xp processor. .......................... 173 Instructions evaluated in analyzing the block pairs of lfk3. ...................................... 190 Instructions evaluated in analyzing the block triplets of lfk3. ................................... 191 Fifty-instruction predecessor paths for block F of lfk3 (figure 94). .......................... 192 Comparison of the number of instruction evaluations required to analyze the various SPEC test case reduced trace descriptions of figure 99 and figure 100................ 201 Description of the workload execution reports of section 2.5................................... 222 Relative design area required by various processor units (example estimates), ....... 257 Function unit configurations considered in the example design space search. ......... 259 Parameters of the processor configurations and their modeled ranges. .................... 260 Range of parameter values that yield processor designs with the best potential performance (i.e. lowest total cycles bound). ....................................................... 261 Percent improvement of the bound relative to the 6k processor’s performance bound. ................................................................................................................... 263 Estimated design area for processors with good performance potential................... 265 The additional parameters investigated during RTA evaluation phase...................... 266 Range of processor parameter values for RTA evaluation phase. ............................. 266 Processor parameters for best performance for each design..................................... 267 Percent change in execution cycles for a variation in single processor parameter values (versus the best performing processor parameters of table 21). ........................... 270 Comparing femc test case execution information for the 6k_xmp processor. ........... 272 Comparing SPEC test case execution information for the 6k_xmp processor. ......... 274 Comparing execution information with an infinite number of register ports............ 275 Execution information with different function unit latencies.................................... 276 Instruction template operands and their interpretation.............................................. 295

viii

Table 28. Table 29. Table 30. Table 31. Table 32. Table 33.

The full set of BRISC instruction templates. ............................................................ BRISC instructions in instruction classes 1, 6, 7 and 8............................................. BRISC instructions in instruction classes 2 and 9..................................................... BRISC instructions in instruction class 3.................................................................. BRISC instructions in instruction class 4.................................................................. BRISC instructions in instruction class 5..................................................................

ix

296 299 301 303 304 304

LIST OF FIGURES Figure Figure 1. Figure 2. Figure 3. Figure 4. Figure 5. Figure 6. Figure 7. Figure 8. Figure 9. Figure 10. Figure 11. Figure 12. Figure 13. Figure 14. Figure 15. Figure 16. Figure 17. Figure 18. Figure 19. Figure 20. Figure 21. Figure 22. Figure 23. Figure 24. Figure 25. Figure 26. Figure 27. Figure 28.

Example instruction template for a 3-register floating-point ADD instruction............27 General execution algorithm for the RCM model........................................................30 Basic superscalar processor design: the 6k.processor. .................................................34 The REAP register file model. .....................................................................................36 Excerpt from the BRISC ISA file.................................................................................41 An excerpt from a REAP input trace file. ....................................................................43 General format of the processor description file. .........................................................44 Register files description for a 6k processor.................................................................45 The ports and buses section of a 6k processor description file.....................................46 REAP function-unit model...........................................................................................47 The function unit description (for a 6k processor). ......................................................50 A comparison of shared and not-shared writeback executions. ...................................52 The core REAP simulation routine. .............................................................................53 Overview of the get_instruction_start_time routine of REAP simulation...................54 Overview of the set_new_avail_times routine of REAP. .............................................57 Command line options which control the processor behavior. ....................................63 Speedup obtained with different levels of renaming (relative to the processor with no renaming hardware)....................................................................64 Speedup when address computation is forwarded from the memory access instructions with register update..............................................................................66 The speedup obtained by using pipelined memory port access relative to the use of nonpipelined memory access. .......................................................................67 The command line options to define the branch prediction scheme. ...........................68 Speedup obtained when perfect branch prediction is used. .........................................69 Speedup obtained from applying static branch prediction to the LFK test cases. .......70 Speedups obtained using different techniques to handle resolvable branches. ............71 Speedup obtained from the addition of branch folding to the LFK test cases. ............73 The command line option to specify a reorder (completion) buffer to REAP..............74 Relative performance for different size reorder buffers. ..............................................75 Relative performance for different numbers of maximum completions per cycle.......76 The command line options to describe a store queue to REAP....................................77

x

Figure 29. Figure 30. Figure 31. Figure 32. Figure 33. Figure 34. Figure 35. Figure 36. Figure 37. Figure 38. Figure 39. Figure 40. Figure 41. Figure 42. Figure 43. Figure 44. Figure 45. Figure 46. Figure 47. Figure 48. Figure 49. Figure 50. Figure 51. Figure 52. Figure 53. Figure 54. Figure 55. Figure 56. Figure 57. Figure 58. Figure 59. Figure 60. Figure 61.

Speedups obtained with the addition of different sized pending store queues.............78 The command line options for specifying a finite cache model to REAP....................79 Some important DineroIII command-line switches. ....................................................81 Relative execution performance of three finite cache models versus the performance for the processor with a “perfect” (infinite, all-hit) cache model..................................83 Sequentializing affect of the REAP finite register port model.....................................84 Execution with in-order function units.........................................................................87 Execution with out-of-order function units..................................................................88 Workload execution characterization statistics. ...........................................................92 Issue width execution statistics (a p2 processor).........................................................93 Statistics describing the latest available resource for each instruction executed in lfk21 on the buffered p2 processor configuration. ................................94 Statistics describing the breakdown of the Issue Unit latest available resource category for different causes of issue unit unavailability.........................................96 Statistics describing the breakdown of the Multiple latest available resource category for cases where only two resources were involved)..................................97 The function unit utilization statistics report................................................................98 The function unit decode stage utilization statistics.....................................................99 Execution pipe utilization statistics............................................................................100 Function unit input buffer statistics............................................................................101 Register utilization information (no renaming used). ................................................102 Register file read and write port utilization statistics. ................................................103 Register read and write port usage histograms...........................................................104 Register value (lifetime) information. ........................................................................105 Diagram of reduced trace analysis methodology. ......................................................115 Example reduced trace description. ...........................................................................117 Algorithm to generate initial reduced trace description from an execution trace. .....118 Example basic blocks (block B1 precedes block B2). ................................................122 Trace-driven simulation results for block B1 of figure 52. .........................................123 Trace-driven simulation results for block B2 of figure 52. .........................................123 Trace-driven simulation results for block the pair B1 B2. ..........................................124 The loss of global sequencing information in a reduced trace description. ...............126 Basic algorithm for reduced trace analysis. ...............................................................130 Percent error in RETANE estimates versus full trace REAP estimates for four unbuffered processor models. ........................................................................133 Percent error in RETANE estimates versus full trace REAP estimates for four buffered processor models. ............................................................................134 Change in estimation error as a function of the input buffer size (lfk15)...................136 Change in estimation error as a function of maximum issue width (lfk15). ..............137

xi

Figure 62. Figure 63. Figure 64. Figure 65. Figure 66. Figure 67. Figure 68. Figure 69. Figure 70. Figure 71. Figure 72.

Figure 73. Figure 74. Figure 75. Figure 76. Figure 77. Figure 78. Figure 79. Figure 80. Figure 81. Figure 82. Figure 83. Figure 84. Figure 85. Figure 86. Figure 87. Figure 88.

The path collapsing transformation (single and multiple application).......................141 The self-loop extraction transformation.....................................................................142 Self-loop extraction exposing a path for path collapsing. ..........................................142 The one-off self-loop extraction transformation (simple and complex graphs where p > k). ..........................................................................................................143 The block splitting transformation (multiple exits or entrances). ..............................145 The path splitting transformation. ..............................................................................146 The path peeling transformation (where m > p).........................................................148 Example of reduced optimization work required when blocks are split before path collapsing is applied. .............................................................................................150 Example of reduced optimization work requires when path collapsing is applied before block splitting. ............................................................................................150 The main TROPT optimization algorithm that controls the application of the graph transformations to a reduced trace description............................................152 The TROPT optimize_self_loops routine, which controls the application of the self-loop extraction and one-off self-loop extraction optimization transformations. .....................................................................................................153 The TROPT do_splitting_transformations routine, which controls the applications of the block splitting and path splitting optimization transformations..................154 Original reduced trace description. ............................................................................156 Example optimization: self-loop and one-off self-loop extractions. ..........................156 Example optimization: path splitting of blocks B and F. ...........................................157 Example graph optimization: simple splitting transformations. ................................158 Example graph optimization: simple splitting transformations. ................................159 Example graph optimization: final stages to completion. ..........................................160 Determining the minimum unrolling factor for a self-loop body...............................162 Example graph optimization: initial and final graphs. ...............................................164 Example workload loop nest and its reduced trace description. ................................165 Percent error in optimized RETANE estimates versus full trace estimates. ..............169 Percent error in the RETANE estimates for the femc trace descriptions....................172 Increase in the instruction evaluations required to analyze the fully optimized LFK reduced trace descriptions with 50 instruction self-loop bodies. ..................175 Increase in the instruction evaluations required to analyze the fully optimized LFK reduced trace descriptions with and without self-loop unrolling. .................176 Increase in the instruction evaluations to analyze the fully optimized femc reduced trace descriptions with and without self-loop unrolling...........................177 Increase in instruction evaluations to analyze the SPEC reduced trace descriptions when full optimization is applied. .....................................................178

xii

Figure 89. Increase in the instruction evaluations to analyze the fully optimized femc reduced trace descriptions with and without self-loop unrolling...........................180 Figure 90. Graph example showing inability of TROPT to enlarge block B3. ............................182 Figure 91. Final block upsizing transformation (i.e. forced path peeling). .................................183 Figure 92. Increase in the number of instruction evaluations required to analyze the optimized SPEC trace descriptions........................................................................185 Figure 93. Percent error in SPEC RETANE estimates at various optimization levels (p2b processor). .....................................................................................................186 Figure 94. Reduced trace description of the lfk3 test case...........................................................189 Figure 95. Algorithm to generate initial reduced trace description from an execution trace. .....195 Figure 96. Increase in dynamic instruction evaluations required for a larger initial reduced trace description versus an optimized basic blocks trace description for 25-instruction blocks. ............................................................................................196 Figure 97. Increase in dynamic instruction evaluations required for a larger initial reduced trace description versus an optimized basic blocks trace description for 50-instruction blocks. ............................................................................................196 Figure 98. Relative error in the TROPT-generated optimized reduced trace descriptions versus that in a TRED-generated larger code blocks trace descriptions (with 50 instruction blocks). ..................................................................................198 Figure 99. SPEC test case increase in instruction evaluations required for a larger initial reduced trace description versus and optimized basic blocks trace description for 25-instruction blocks........................................................................................199 Figure 100.SPEC test case increase in instruction evaluations required for a larger initial reduced trace description versus and optimized basic blocks trace description for 50-instruction blocks........................................................................................200 Figure 101.Percent error for RETANE with and without a reorder buffer model........................207 Figure 102.The daxpy test case (from the linpack benchmark)....................................................208 Figure 103.Real and predicted speedups from increasing reorder buffer size..............................209 Figure 104.Example code to illustrate increased finite cache simulation effort...........................213 Figure 105.Percent error of RETANE estimates for three finite cache models when using reduced trace descriptions optimized to a minimum code block size of 50 instructions (on a p2 processor)........................................................................217 Figure 106.Number of instruction evaluations required to analyze the optimized LFK reduced trace descriptions with the finite direct-mapped cache model. ................218 Figure 107.Number of instruction evaluations required to analyze the optimized LFK reduced trace descriptions with the finite 2-way associative cache model............219 Figure 108.Workload description reports (for a p2b processor)...................................................226 Figure 109.Statistics describing which resources determine instruction start time......................227 Figure 110.Function unit utilization statistics reports. .................................................................228

xiii

Figure 111.Execution pipeline utilization statistics reports..........................................................229 Figure 112.Register file read and write port utilization reports....................................................230 Figure 113.Register read and write port usage histograms...........................................................231 Figure 114.Issue width statistics reports (for a p2b processor). ...................................................232 Figure 115.Example program segment.........................................................................................236 Figure 116.Register utilization information (no renaming used in this example). .......................237 Figure 117.Function unit decode utilization statistics reports......................................................239 Figure 118.Function unit decode utilization statistics reports modified to consider function unit types (rather than individual function units)...................................................239 Figure 119.Example program segment.........................................................................................240 Figure 120.Stages in the proposed early-stage design exploration hierarchy...............................250

xiv

CHAPTER I INTRODUCTION

In the design and development of general-purpose workstations, companies have always sought ways to achieve higher computer performance with reasonable cost, and in recent years quantitative evaluations have had a growing influence on the design and development of workstations. When new features are proposed for a machine design, they must be evaluated to determine the performance benefit they can provide before they are realistically considered for inclusion in the machine being developed. This quantitative analysis is generally used in the context of an iterative design process. First a set of goals is defined, and a target workload is identified. An initial machine organization is then selected that should be capable of meeting the design goals, and development of that organization begins. As development progresses, changes to this initial organization are then proposed and evaluated, and if these changes are quantitatively shown to improve the design, they are incorporated. This results in a new organization which is then used as the new basis for further development. This process continues until the organization is frozen and the final machine hardware is then fully developed and produced. Historically, this approach worked very well when the machine designs were relatively simple sequential von Neumann machines, but today’s designs which incorporate a number of concurrent execution modes, are much more complex. Where simple analytic (or back-of-the-envelope) calculations might have been sufficient historically, much more precise models of the system’s execution are required to accurately analyze the performance (or relative performance) of current computer systems.

1

2

1.1 The Focus and Goals of Our Research The focus of this dissertation is the development of processor performance analysis techniques that can be applied very early in the design cycle. Thus we do not focus on the analysis of processor designs throughout the full design cycle or in the latter stages of design when the hardware is fully designed in detail. In the early stages of the design cycle, designers work to develop an initial processor specification, and thus they need to consider new processor designs and possibly even new instruction set architectures. In these early design stages, we assume that the designers will have a set of target applications which fairly represent the workload that the finished processor is intended to execute. The designer will be interested in exploring a space of machine designs, and determining the trade-offs and relative performance among design points within that space. Any methodology provided for a designer’s use must be able to consider a large design space, to provide a large number of performance estimates in a reasonable amount of time, and to give an accurate representation of the performance trade-offs for the different design alternatives. To illustrate the importance of performance modeling for comparing design alternatives, one can consider published research projects that use quantitative analyses for proposed changes in a processor. The literature is full of such studies, from the determination of the optimal granularity of pipelining [44] to the comparison of the utility of multiple register sets (i.e. register windows). [20] [31] Other studies have used simulation to determine the function unit mix for multiple function unit processors [40] [63] and even to explore alternative processor execution models. [72] From these and many similar studies, some trends become apparent. Because of the difficulty in developing and modifying these detailed performance models, these studies generally use a detailed model of a single processor design, to which a new design feature is added. The performance benefit achieved on the modeled processor with the added feature is then measured, and this value is presented as an example of the performance benefit available by adding this design feature to a processor. The performance of a current processor, however, is the result of complex interaction between the execution hardware provided and the workload’s utilization of that hardware. It is

3 impossible to know a priori how much performance improvement a proposed change in a processor design will provide, even when a similar change was examined for another (different) processor. The performance bottlenecks for the new processor might differ from those of the previously examined processor, resulting in a different performance benefit being obtained. Since the performance of a processor depends upon both the hardware resources and the way that the workload uses those resources, it is important that the machine evaluations used during the design process are driven by realistic representations of the workloads. With the growing power of current processors, the workloads that run on these machines have also grown, resulting in workloads that require billions of instruction executions to complete (see, for example, the instruction counts of the SPEC benchmarks [25]). In forming an accurate performance estimate, the performance analysis must consider the information presented by these billions of instructions. Because the overall design process is iterative and many changes in a system design might be considered, a large number of different processor evaluations may be required. The evaluations must be as fast as possible, so that the overall analysis time remains manageable, and the processor development time is not adversely impacted. Still, the estimates must also provide real insights into the performance differences among the design alternatives being compared, so these methods must provide accurate relative performance estimates to indicate the performance trade-offs in the designs.

1.2 Computer Performance Evaluation Methods In their survey of computer performance evaluation methods, Heidelberger and Lavenberg divide the evaluation techniques into three categories: performance measurement, analytic models, and simulation models. [29] Each of these methods has particular strengths and weaknesses. Each will be examined briefly in the following sections, and we will show that none of these methods is completely sufficient for early design stage processor comparison.

4

1.2.1 Performance Measurement Performance measurement is only possible when there is hardware available for measurement, i.e. when the computer has been realized in some hardware form. Thus it is impossible to measure the performance of a processor or system during the early stages of that system’s development. The need for a physical implementation of the computer design is a serious limitation; designers cannot wait until after a computer has been designed before considering different design alternatives, and yet the time and cost required to develop a large number of prototype circuits is also prohibitive. Still, performance measurement data can be useful when developing a new system that is based on an existing system, since measurements from the existing system can be used to determine where performance is being lost and to indicate ways in which the system might be improved. Several studies have been done, for example, using measurement to characterize both the instruction set usage by workloads (e.g. [3], [17], [45], [76] and [79]) and the execution breakdown of workload programs (e.g. [14], [21] and [65]). One example of the far-reaching impact such studies can have is the genesis of the reduced instruction set computing (RISC) concept. By looking at the complex instruction sets provided for many processors and the subset of those instructions that compilers used in the code, it was noted that a large number of the instructions were not frequently used in compiled code. Similarly, from the analysis of the performance measurement (and circuit design) data, it was noted that the variable-length instructions with complex fields required a large, slow, and difficult decode process. It was theorized that the use of a smaller set of simpler instructions, permitting a fixed instruction length and a simplified decode, would necessarily provide a performance benefit. [62] While the measurement of existing processors can be useful, as development continues on a new processor design it will increasingly diverge from the measured system, and this measurement data will be less and less directly applicable to the new processor. Furthermore, measurements may be inapplicable when considering the implementation of a radically new computer design, for which there is no sufficiently similar design to measure. We will drop performance measurement from further consideration here, since our focus is on performance estimation methods that can provide estimates for novel (proposed) designs, and allow the user to generate trade-off analyses.

5 The two general domains of performance prediction that remain are therefore analytic models and simulation models.

1.2.2 Analytic and Trace-Driven Simulation Models Heidelberger and Lavenberg assert that computer systems can be characterized by a set of hardware and software resources and a set of jobs that compete for access to these resources. They further state that any system that includes a set of resources and a set of tasks that compete for the resources can be modeled using queueing networks. [29] These assertions allow them to differentiate the analytic and simulation models on the basis of the underlying queueing network. The performance of a system is predicted by estimating characteristics of the resource utilizations, the queue lengths and the queueing delays. Queueing network models result in analytic performance models when these values can be determined mathematically. Simulation models are models where the characteristic values cannot be (or simply are not) determined mathematically, but instead are determined by simulating the system’s execution behavior. We will consider analytic and simulation models not necessarily in terms of an underlying queueing network, but rather in a more general context. 1.2.2.1 Analytic Models Analytic models can be generally described as modeling the computer system via a set of formulas (or timing rules) to which a set of metrics, taken from the workload code, are applied. The formulas are generally derived from the processor hardware model and indicate the execution restrictions between sets of instructions. In the MA bounds method of [53] for example, a set of formulas are derived to describe the peak throughput of the processor function units when executing a steady stream of instructions. The metrics that are input to this model are primarily counts of the number of each type of instruction within an inner loop block body. Similarly, in [22] Emma and Davidson present a model of a pipeline’s performance that uses an inverse bandwidth formula based on the length of the execution pipeline path. The metrics in this case are histograms of the dependence chain lengths in the executed code.

6 These examples are typical of analytic performance models, which are generally driven by a set of metrics that characterize the workload either as a set of counts of or distributions for event occurrences. These metrics substantially reduced the full execution trace information of the workload, and the set of formulas used to evaluate the performance of a system can then be calculated very rapidly. Thus, the strength of the analytic methods is in their high computation speed and the flexibility they obtain by representing different processors through changes in the formulas. In order to form an analytic model of a system’s performance, however, a number of simplifying assumptions must generally be made. [29] In the MA bounds analysis of [53], for example, the workload code is assumed to be a single basic block inner loop code that is repeatedly executed, and the performance bound indicates the upper bound on the steady state performance that the processor might achieve. Similarly, the code schedule is assumed to be optimal, and the performance is assumed to be bound by the throughputs of the units in the processor, and the other elements of the processor are assumed to have no impact on the bound. In Emma’s pipeline analysis, the method only considers a single pipeline executing all instructions, where all pipeline segments are assumed to have a unit service time, and all instructions are assumed to pass through all the pipeline segments in order. [22] If a designer is investigating a processor that conforms to all of the assumptions, then an analytic model can provide a good estimation accuracy. In fact, if enough is known about the processor design when the analytic model is being formed, then the analytic model can even be tuned (or calibrated) to produce very accurate results. Analytic models require that the model designer have a good understanding of the underlying design’s execution behavior, however, since the analytic model is generated by abstracting away “unnecessary” information and retaining only the important sources of performance impact. Unfortunately, when investigating a large space of novel processor designs, it is unlikely that one can know enough about the processor designs in order to generate accurate analytic performance models. Thus analytic computer system modeling will not generally provide enough accuracy for our early design stage processor comparison.

7 1.2.2.2 Trace-Driven Simulation Trace-driven simulation models are software models of the machine hardware that track the movement of instructions and data through the computer system during execution. An execution trace of the workload (i.e. the sequence of instructions that are executed during a run of the workload) is used as the input to the simulator, so the simulator input will accurately model the actual workload. The simulators we consider here are trace-driven performance timing simulators, also called timers. Timers operate by modeling the flow of instructions through the hardware model; the values of the data being transferred or used by instructions are ignored in these simulators in an effort to reduce the overall work required in the performance analysis. Traditional timers are cycle-by-cycle models of the processor execution; thus each clock cycle the instructions are advanced through the hardware and a cycle-by-cycle picture of the execution (i.e. a workload execution profile) is developed. [9] [42] Timers can produce extremely accurate performance estimates, provided that they implement a detailed model of the processor hardware, and they include the full workload characteristics by using the full execution trace as an input. Unfortunately, because these detailed timing simulations do some amount of work simulating each instruction’s execution, and they analyze every instruction in the full execution trace (which often contain billions of instructions), timers require a long time to produce performance estimates. Furthermore, these detailed timer models are highly machine specific; the inclusion of a lot of detail in the processor model tends to limit the timer’s flexibility (i.e. the number of variable design parameters available to the user). Often in order to include a significant design change in a timer model, it is necessary to rewrite major portions of the timer code. [9]

1.3 Related Work in Early Stage and Design Space Analysis Some other research has addressed the development of tools and techniques to allow users to analyze processors in an early design stage. In [43], Kumar and Davidson propose a systematic, hierarchical approach for the analysis of different processor designs. Their methodology provides a number of different processor models, where the higher-

8 level models are very easy to evaluate (thus requiring little time), but provide less accurate performance estimates; lower-level models provide very accurate performance estimates, but at the cost of longer estimation run times. While analytic models cannot provide the same level of detail or estimation accuracy as simulation models over a large set of parameters, Kumar and Davidson argue that they can be accurate within a small area around a point at which they have been calibrated. Furthermore, this calibration can be achieved by using a low-level simulation that provides an accurate performance estimate, to which the high-level model can be tuned. By providing an objective function for optimization, the designer can then initiate an automated analysis of the design space in order to maximize or minimize the objective function. The process begins with a reference processor design, and the high-level model is calibrated around this reference design by using the low-level model’s performance estimates and tuning the parameters of the high-level model to fit that performance. The calibrated high-level model is then used to explore the design space and to discover the interesting areas of the design space that may hold an optimal design. In [43], an example is given where the high-level model is calibrated along several axes of the design space, and an optimum design area is identified by extrapolating in the direction that provides the best performance improvement. In [43], the distance that the analysis can move along the line is controlled so that the early search iterations have more freedom to explore a large design area, and later searches refine within an interesting space. Once the high-level analysis returns an interesting area to explore, a new reference design is selected in that region, and the low-level model is used to again calibrate the high-level model around this design. Once the high-level method selects a design from within a space for which it is already calibrated, the high-level and low-level methods will agree on the relative performance of the processor design, and that design is returned as the optimal design (though it is only guaranteed to be a locally optimal design). This general approach seems to be well founded, and provides a good, general method for comparing processor designs. In [43], however, Kumar and Davidson only concentrate on the development of the hierarchy itself, and do not present detailed discussion of the performance estimation tools and models that could be used in such a

9 hierarchy. They do provide a case-study analysis of the CPU-Memory subsystem of the IBM System 360 Model 91 using a detailed simulation model (that provided a limited set of parameters) for the low level analysis and a linear analytic model for the high-level analysis. In this dissertation, the focus is on the early stages of design where fewer design details will be known about the processor, and the analysis must include a large number of variable design parameters in order to investigate the large space of available processor designs. In [18], Conte describes a method for systematically comparing processor designs in an effort to determine a near-optimal processor design from which detailed processor design and development can proceed. This work recognizes the iterative nature of the processor design process, and tries to start this process with a very good initial processor design (i.e. one that is close to what will be the final design), thereby reducing the number of later design iterations. Two processes are developed in [18]; the first considers each of the workload programs separately, and develops a characterization of the resources required by each program. These resource requirements are determined by considering an infinite resource processor design, and determining from the execution of that processor the number of each resource actually required. The second method described in [18] formulates the design selection problem as an optimization problem that starts with a reference design, and determines the lowest cost processor that retains at least ninety percent of the reference processor’s performance. In this method, a timer is used to provide a performance estimate at each evaluated design point in the design space, and a simulated annealing method is used to search the space and optimize the processor design (though other methods could also be used). The critical element in this second method is that the timing simulations can each require a significant time to produce a performance estimate. As the simulated annealing progresses through a series of iterations, refining the processor design and evaluating the performance, a large number of designs may be evaluated. While Conte does not directly indicate the number of iterations required for each workload optimization, he does provide some data that indicates that over 2000 simulated annealing iterations were executed for some of the benchmarks. Because each processor simulation can be fairly expensive, the analysis of 2000 processor design refinements could take a very long time to complete,

10 limiting the effectiveness of this approach. To reduce the simulation time required in each iteration, Conte develops a trace sampling approach. Still, neither Kumar and Davidson in [43] nor Conte in [18] address the problem of specifying a highly parameterizable timing simulator, so that large design spaces can be explored. The IBM 360 model 91 simulator used by Kumar and Davidson in [43] provides only six variable parameters: the memory cycle time, the number of memory banks, the instruction buffer size, the type of fixed-point unit architecture, the type of floating-point unit architecture, and whether the loop mode feature is enabled. Of these parameters, the fixed-point and floating-point architecture types are limited to three choices each, and the loop mode feature may be either enabled or disabled. Thus, the design space that this performance simulator explores is clearly limited, though this was acceptable for the study they presented. The hierarchical search methodology itself would be very useful in searching a design space in the early design stages, but a more flexible form of processor performance simulator must be developed. In [18], Conte provides an example use of his approach that investigates the processor hardware included in a superscalar processor design. His study defines a set of function unit types that correspond to instruction types in the ISA and uses the optimization to determine the number of function units of each type and their latencies. Again, the purpose of this optimization is to produce a processor with the least cost that retains 90 percent of the original design performance; the only variables in the simulation, however, are the number of function units of each type and their latencies, so the design space that this optimization explored was again rather limited. It is likely that the cost of the simulations was too large to explore a much larger space, even using the trace sampling approach that he developed to reduce the cost per processor design evaluation.

1.4 Investigating Changes to the ISA and Processor Models Processor performance has been increasing at a high rate for the past ten to fifteen years. This advance is the result of several factors: better process technology has permitted an increase in device speeds and densities, instruction set changes can improve the number of simultaneously executable operations in a program, and new architectural features have

11 permitted better exploitation of the workload’s instruction-level parallelism. In this dissertation, the process technology will not be directly considered, so computer performance will be described at the architectural level in terms of execution cycles. Changes in the instruction set architecture (ISA) of recent processors have first increased the number of instructions in the dynamic workload and then reduced it over time: the move from complex instruction set computers (CISC) to early reduced instruction set computers (RISC) that used simpler instruction sets [62] resulted in an increase in the number of dynamic instructions required to describe a workload. [8] Similarly, the move from these early RISC instruction sets to a more complex RISC instruction set, such as that of the IBM POWER based processors [34] [56] resulted in a decrease again. The IBM POWER instruction set architecture retained the ideals of the early RISC instruction sets, such as a load-store memory model and fixed-length instructions, but increased the work specified by some of the instructions. Clearly, instruction set architectures have changed over time, so a complete early design stage performance estimation method should include the ability to consider different instruction set architectures and features. Current processors also include many enhancements that are intended to increase the instruction-level parallelism that they can and do exploit. Pipelining the function units allows the clock rate to be increased but retains the peak (one-instruction per cycle) throughput of the function unit. [41] Similarly, including multiple pipelined function units (and multiple instruction issue) increases the number of instructions that can be simultaneously executing in the processor. [63] More aggressive architectural features, such as branch prediction [67] speculative execution and dynamic instruction scheduling [68] can further increase the ability of the processor to exploit the instruction level parallelism of the workload. Again, an early design stage processor performance model should allow the user to consider many different parameters within the space of these different organizational elements. The timing simulators used by Kumar and Davidson in [43] and Conte in [18] both limit the early design stage modeling that is available to the user. Conte’s simulator is arguably the more flexible of the two, allowing the processor design to change in many different ways. The basic form is held fixed as a centralized-window out-of-order issue

12 processor with out-of-order instruction finish, though completion (i.e. committing the results to the processor state) is in-order. The issue width, window size, function unit latencies and other parameters can be altered, as can the number of function units in the processor, but the composition of the function units (i.e. the set of instructions that each function unit can execute) is fixed. Furthermore, the instruction set architecture used in [18] is a general ISA derived from the instruction execution classes of the gcc intermediate language (RTL) representation [74], but is still a fixed ISA. With the simulators of [18], there is no way a user can explore different instruction sets or even instruction execution classes (i.e. instructions that share execution hardware in the processor) without rewriting portions of the simulator itself. Thus the search methodology that Conte uses cannot discover a lower cost processor design that requires grouping the instructions into different execution classes than are predefined by the simulator.

1.4.1 Comparing ISA Features The investigation of different ISA features is not new; many authors and projects have developed methods for comparing different features of instruction sets and determining estimates of their impact on the performance. In [24] and [55], Mitchell, Flynn and Mulder describe their computer architect’s workbench. This workbench is an environment that includes an optimizing compiler with front-ends for several different languages and can produce code for a set of different instruction set architectures. Different instruction sets will necessarily result in different workload representations, and these different representations can then be evaluated using an architectural simulator module. The architectural simulator implements a very simple high-level model of a processor design, and most of the comparisons shown are based on the impact of the different instruction sets on the required instruction cache size and bandwidth. Thus the user is able to compare different instruction set trade-offs to some degree, but the underlying processor performance model is very simplified, and the performance estimates are consequently of questionable accuracy. In [4], Alpert, Averbuch and Daneli compare a load/store architecture and a symmetric instruction set architecture (which can specify operands from memory and operate on them in the same instruction) to determine whether either approach holds an

13 inherent advantage. Clearly, the ability to specify memory accesses and functional operation in the same instruction rather than requiring a separate instruction to load each memory value to a register could significantly reduce the number of instructions required to dynamically describe the workload (i.e. the number of executed instructions). In this study, the two architectures were compared using detailed processor simulators that implemented identical pipeline structures based on the MIPS-X pipeline. [16] Their results show that the symmetric architecture has a slight performance advantage over the load-store architecture, gaining roughly a four percent speedup. The MIPS-X processor is a single pipeline processor implementing a very basic five stage pipeline, and the inter-instructional latencies are relatively small, particularly with the bypasses added in this study. Furthermore, the effects of a finite cache are not considered, so the additional benefit of the load/store architecture in separating the initial memory access and subsequent use of the data does not show up in their results. Clearly, this exposes some of the dangers in considering any ISA question on a very limited set of processors. Thus it could be important, in considering ISA changes, to consider the effects of these changes for a range of processor parameters.

1.4.2 VMW: Providing ISA and Processor Models In his dissertation, Trung Diep develops his visualization-based microarchitecture workbench (VMW) [19] which is intended to provide users with a highly retargetable framework for developing machine simulators. VMW was specifically designed to provide a framework that supports the specification of both an instruction set architecture and the processor microarchitecture (organization). Thus, VMW can be used to explore processors developed by different vendors, or can be used to explore the trade-offs between different ISA features in the context of a full timing simulation. Unfortunately, Diep concentrates his efforts on the description and analysis of current, fully designed processors in order to analyze the execution of workloads on those processor designs. In [19] for example, VMW is used to examine the PowerPC 620 processor, and particular mention is made of the fact that the VMW model was refined during the design of the 620 processor to provide a highly detailed and accurate model of the completed 620 processor. The performance of the 620 was then measured across seven

14 traces taken from the SPEC benchmark suite [73] and a description of the utilizations of the various processor hardware elements was developed. There is no discussion in [19] as to how a user might compare different processor design alternatives. VMW processor specifications are made using five input files. Two of these specification files describe the instruction set architecture: one for the instruction syntax, and another for the semantics. The remaining three specification files describe the processor microarchitecture: one file for the machine organization, another for the machine behavior, and the third for the instruction timing. Once these files have been developed, VMW uses them to create a simulator for that specific instruction set architecture and processor microarchitecture. By using three specification files to describe a processor implementation, a great deal of flexibility is provided for the specification of different processor designs. In order to compare many different processor designs, however, the user would have to generate the three processor specification files for each processor model (all five files if ISA changes are considered) and then compile each set of files to produce a simulator for that processor. Clearly, automating this process could prove difficult, as there is no apparent parameterization of the generated simulator. Furthermore, the level of detail illustrated in [19] is appropriate to the modeling of a fully designed processor, and no discussion is given regarding the applicability of the VMW approach for modeling processors at different levels of detail.

1.5 Contributions of this Dissertation This dissertation presents two processor performance estimation techniques we have developed for analyzing processor designs early in the design cycle, and shows an example application of these techniques as part of a hierarchical design space search methodology. The techniques themselves were developed to enable a user to easily alter both the processor hardware model and the instruction set model, so that a large design space can be explored.

15

1.5.1 Resource Conflict Methodology Chapter II describes the resource conflict methodology (RCM), a method designed to allow a user to easily develop early design stage processor models that allow the specification of both an instruction set architecture and a processor microarchitecture through the use of simple parameter files (and command-line switches). RCM uses an abstraction of the hardware elements as basic resources used by an instruction during execution. By tracking the times when the various resources are available for use, an RCM simulation can determine when a given instruction is fetched, issued to a function unit, begins and finishes execution. Thus, the impact of earlier instruction executions on the following instructions is determined through the alteration of these resource availability times in processor model. In order to develop an accurate early design stage performance estimate, RCM uses a workload execution trace to drive its simulation. The basic RCM simulation procedure reads instructions from the trace and analyzes each instruction to determine the instruction’s execution profile (i.e. the execution map for that instruction’s use of resources in the processor), and the way that instruction alters the availability of the processor resources for following instructions. With RCM the user is able to develop flexible (i.e. highly parameterized) processor modeling tools that allow the exploration of a large design space. The processor description is provided in a processor description file that indicates the set of hardware resources that should be modeled in the processor. In our implemented RCM modeling tool, a set of command line switches have also been provided that allow the user to selectively add processor hardware elements without modifying the basic processor description file. The use of command-line switches is a stylistic choice; there is no reason that the parameter files could not include all of the parameters. RCM also provides a neat and clean separation of the instruction set architecture from the processor hardware specification, so the impact of different instruction set features can be examined. The instruction set architecture model is described by a set of instruction templates (one per instruction in the instruction set architecture model) that indicate the kinds of resources used by each instruction of the instruction set. Because RCM allows the user to define both an ISA model and a processor hardware configuration,

16 the user can easily explore not only a varying number of function units of fixed types, as in Conte’s work, but also variations in the types of function units the processors include. An RCM simulation identifies the timing constraints and data dependences of the workload through the availability of the resources required by each instruction. The simplified view of hardware as basic resources provides a uniform way for the simulator to view these many different hardware elements, resulting in a fairly simple simulation algorithm and a simplified interface for the addition of new hardware element models to the simulator. Furthermore, for early design cycle processor models, the RCM instruction based simulation can provide performance estimates that are as accurate as a more traditional cycle-by-cycle estimation that uses the same processor modeling detail. There are, however, some restrictions in the kinds of processors that RCM can model; the instruction-based evaluation method employed here evaluates the instruction’s fetch, issue and execution behaviors simultaneously, effectively analyzing the full lifetime of the instruction within the processor all at the same time. Because of this approach, it is important that the instructions execute through a function unit in the same order that they are issued to that function unit, i.e. that any function unit buffer be a first-in first-out (FIFO) queue. If the function unit input buffers are allowed to reorder the instruction executions, then an instruction issued to the function unit could alter the execution of an earlier issued instruction, which is difficult to model in RCM. The restriction of the function unit buffers to FIFO queues does not, however, preclude some forms of out-oforder issue. Since the issue order and function-unit execution order must match, the issue unit could fetch a window of instructions from memory and issue those instructions to the function units in an order that is different from the order in which they were fetched (i.e. a form of centralized window out-of-order issue).

1.5.2 Reduced Trace Analysis Note that RCM is a full execution trace-driven method, and full workload execution traces can include billions of instructions. Furthermore, recent trends indicate that the size of workload execution traces will continue to grow. In order to explore a large design space, evaluation methods must provide performance estimates very quickly. Kumar and Davidson used multiple performance estimators, with the faster, less accurate

17 estimators used to focus the use of the slower, more accurate models on a small space of only a few processor designs. [43] Alternatively, Conte used one level of modeling, but employed trace sampling to reduce the total number of instructions that had to be simulated for each design. [18] Chapter III describes our second methodology, called reduced trace analysis (RTA) that was developed to reduce the long simulation times of the resource conflict methodology (or similar full execution trace-driven methods) by reducing the amount of redundant computation done in the performance analysis. The RTA method begins by analyzing the full execution trace and reducing it to a representative weighted control flow graph. The nodes of the graph correspond to blocks of code from the original execution trace, and the links in the graph indicate the transfer of control from one block to another; weights are used to indicate the total number of times that a control transfer link is taken during the complete workload execution. A reduced trace analysis is performed on this reduced trace description by analyzing the code blocks and interface links. Many of the basic blocks in a program are executed numerous times during the workload execution, and thus many copies of the block would appear in the workload’s full execution trace. A full trace-driven simulation methodology would analyze each copy of the basic block as it was encountered in the trace. The performance improvement of the RTA method is achieved by developing a block pairs sufficiency condition under which each evaluation of the code block will result in the same performance estimate, so each single code block of the graph needs to be analyzed only once and the block pair associated with each interface link needs to be analyzed only once. The interface cost for each of the links is then calculated from the single block and block pair costs. Thus, RTA formulates its performance estimate by using a trace-driven simulation to evaluate the performance of each single block and each connected block pair only once, accumulating a weighted sum of the resulting block and link execution times to form the total execution cycles estimate. By analyzing the reduced trace description graph rather than a full execution trace, a large amount of the redundant computation done in traditional full trace-driven timers can be eliminated. Questions arise, however, as to whether the estimates derived by a reduced trace analysis will be able to produce a reasonably accurate performance estimate,

18 and whether RTA is able to analyze interesting processor designs. In chapter III, we show that the reduced trace analysis tool we implemented can model most of the processors that are available in the full resource conflict methodology. Processor hardware elements that utilize information from the global ordering of execution, such as caches and dynamic branch predictors are difficult for an RTA simulation to evaluate, because the reduced trace description graph does not retain this global execution ordering information. We also show that reduced trace analysis can provide accurate timing estimates for a wide variety of processor models and workloads. In order for the block pairs sufficiency condition to hold, the two code blocks of a connected block pair must fully determine the interface between those blocks, i.e. the first block of a block pair (the predecessor block) must include enough work (i.e. instructions) to warm the simulator state for the second (successor) block. In order to help assure that the reduced trace descriptions satisfy block pairs sufficiency, we apply a set of transformations to optimize the descriptions so that the code blocks are increased in size to better assure that the block pairs are sufficient. Our experiments show that the reduced trace analysis performance estimates of these optimized reduced trace descriptions are nearly as accurate as the full trace simulation results, suffering less than two percent relative error. Equally important, the reduced trace analyses still enjoy excellent speedups over the full RCM trace simulation.

1.5.3 Hierarchical Design Space Evaluation Chapter IV presents an illustrative example of the application of both the resource conflict methodology and reduced trace analysis to search through a space of processor designs. A hierarchical search methodology is defined that defines a five stage approach for evaluating a processor design space. The hierarchy begins with a large space of possible processor designs, and at each level of the hierarchy it reduces the space to include only a set of interesting designs which are forwarded to the next level of the hierarchy for further consideration. The first step in our proposed hierarchy reduces the initial design space by eliminating those designs that cannot be produced within the specified area and power limitations, effectively removing from further consideration those designs that are impossible to build. The next level of the hierarchy takes advantage of previous work in

19 analytic performance bounding (e.g. [53]) to consider the potential performance of the remaining processor designs, and eliminating those designs that do not provide a large enough potential to remain interesting. The following level employs reduced trace analysis to determine the performance for the processor designs that remain in the interesting design space, and the subset of those that remain interesting are then forwarded to a detailed resource conflict methodology analysis, which can produce detailed information about the workload execution and processor performance trade-offs. The final stage of the hierarchy selects a processor design to be implemented, and is followed by detailed design, during which phase a detailed timing performance simulator can be developed to capture the implementation details of the emerging hardware design.

CHAPTER II THE RESOURCE CONFLICT METHODOLOGY

2.1 Introduction The importance of early design stage analysis has begun to appear in published literature. Consider, for example, the development of the Pentium Pro which used early design cycle (trace-driven) timing simulation to explore a number of different processor design issues before committing to a processor organization. [61] The most common methodology used to develop accurate processor performance estimates is a detailed tracedriven timing simulator; such timing simulators (or timers) are software models of the processor hardware, where the cycle-by-cycle advance of instructions and data through the processor are simulated. [29] These timer models are capable of producing very accurate processor performance simulations, and thus can generate very accurate performance estimates, but they generally allow only limited flexibility in specifying new processor design elements. Once a timer has been implemented, it usually allows some parameters to be entered controlling such design elements as buffer lengths or even pipeline latencies, but to consider more significant changes in the design requires reprogramming portions of the timer. [10] To enable a user to examine a machine design space and to compare the different processor organizations within that space, the processor evaluation methodology must allow the specification and evaluation of many different processor designs. In order to address this problem, Diep proposed and developed a timer-generation framework in his dissertation. [19] Using this framework, a timer writer develops a set of description files to specify the processor and instruction set models, and these files are then automatically compiled to generate a detailed simulator for that processor model. 20

21 While this approach may ease the process of writing a detailed timing simulator, it does not provide a means for comparing many processor designs early in the design cycle, as the parameter values are all specified in the description files. An evaluation tool to explore a design space must provide the user with the ability to alter a large set of processor design parameters so that the processor model can be quickly and easily modified, allowing the search through the design space to be automated. In his dissertation [18], Conte presents an example of such a tool for analyzing the function unit mix required to achieve good performance for a given workload. He develops an event-driven simulator for a family of superscalar processors that include parameters for the number of function units, their pipeline lengths, and the issue buffer window size (which also indicates the maximum number of instructions issued per cycle). While this is a reasonable simulation tool for exploring the performance of processors with different numbers of function units taken from a fixed set of function unit types, there are a number of interesting trade-offs it cannot consider, including different instruction execution classes, different function unit types, and the analysis of processors containing differences in other hardware elements. In order to provide users with a processor simulation design methodology that allows the exploration of a very large space of processor designs early in the design cycle, the modeling and simulation methodology should allow the user to vary the processor hardware, instruction set model, and potentially even the execution model, all through the specification of simulator input parameters. We have therefore developed a processor performance evaluation methodology which focuses on the flexible description of the processor design, allowing a simulator to be developed that can examine a large design space simply by specifying the value of organizational parameters that define aspects of the processor design. This methodology, called the resource conflict methodology (or RCM), takes advantage of the fact that the simulator will target early design stage processor performance estimates. In the early design stages, the processors are not fully defined or developed, i.e. there is no detailed processor hardware description, and thus no performance impact from the implementation details, power limitations, heat dissipation and so forth that will arise in the later development stages. A somewhat abstracted view of the hard-

22 ware elements can therefore be used: rather than considering all of the detail that would be present in the later design stages, RCM views the hardware elements using an abstract resource model. Each instruction in the instruction set model will require a set of resources for its execution, and will affect a set of resources during execution. By charting the use of hardware resources, an RCM model is capable of determining when an instruction would be forced to stall. All stalls in RCM are characterized through such resource conflict stalls, where the instruction must wait for the execution resources it needs to become available. In this way, the entire execution trace for a given workload can be examined, and an estimate of the total number of cycles required to execute the workload on a given set of processor hardware elements can be determined. An RCM model requires that the user specify the set of hardware resources that are available in the processor, and a description of the types of resources that each instruction uses when it executes. The processor resources are described and specified via a processor description file, and the resources used by each instruction are described by an instruction template (with one template per instruction in the modeled instruction set). The RCM model is then simulated for a specific workload execution trace, which provides a dynamic workload description to the RCM simulation. This chapter first describes the overall resource conflict methodology, explaining the general concepts involved in modeling a processor and simulating the workload execution. These concepts are then refined through the use of an RCM based superscalar RISC processor modeling tool, called REAP. Section 2.3 describes the basic implementation of REAP (clarifying RCM modeling with a concrete example) and discusses the accuracy that this implementation of RCM analysis can obtain. Later sections then describe features added to REAP, including additional processor hardware resource models in section 2.4, and more informative execution reports in section 2.5.

2.2 The Resource Conflict Methodology The primary mechanism used in RCM to model the processor hardware is the resource; all interactions between instructions are mapped through these abstract

23 resources. Thus RCM resources have one main attribute: at a given time, when an instruction might require a given resource, that resource may be unavailable. Within an RCM model, therefore, the resource availability times are the primary data that must be recorded (i.e. they constitute the simulator state). Furthermore, it is the ways in which the execution of an instruction uses and alters these availability times that model the processor’s execution of that instruction, and by extension the entire workload.

2.2.1 Availability Times Each resource in an RCM model will have an associated availability time that indicates when the resource is available. Whenever an instruction is executed, it requires that a set of resources be available to begin execution, and it alters the availability times of some of the resources to indicate when they will be available to later instructions. If an instruction requires a given resource for execution and the resource is available, then the instruction reserves the resource for the time it requires, but if the resource is unavailable, the instruction must wait for the resource to become available before it can reserve (and use) the resource. The simplest implementation of these availability times employs a single value to indicate the time at which the resource is available for use, where the resource is assumed to be unavailable at all earlier times. For some processors, however, it may be advantageous to use a more complex implementation of the availability times using a list of availability time windows within which the resource is available. Consider, for example, a bus resource that is used in cycle 10, and is used again in cycle 20. If a single availability time is used, then no later instruction could use the bus between cycles 10 and 20 because the single availability time would indicate that the bus was unavailable until cycle 21, even though the bus is not in use between cycles 10 and 20. For processors that would allow the later instructions to use the bus between cycles 10 and 20, a set of availability time windows might be needed to properly model the processor execution. While a set of availability times could be implemented for each resource, the REAP implementation we describe in section 2.3 uses a single availability time, and we have not implemented a version of RCM that does use such availability time windows.

24 The execution of the processor is simulated using an instruction-by-instruction analysis of the workload input trace, where each instruction is fully analyzed before simulation continues to the next instruction. The full analysis of an instruction includes the determination of the times at which the instruction is fetched into the issue unit and issued to a function unit input buffer, and the times that it starts and finishes execution in the function unit pipeline. This analysis methodology thus restricts the types of processors that RCM can effectively simulate; the instructions issued to a particular function unit must execute through that unit in the same order that they are issued, i.e. the function units must contain only first-in first-out queues as function unit input buffers. This does not restrict the order in which the instructions are issued to the function units, and thus an outof-order issue (but in-issue-order function unit execution) processor can be modeled. This abstracted view of hardware elements as resources provides a much more uniform interface between the instructions and the processor hardware elements. This more uniform interface in turn promotes a simplified model for instruction and hardware element interactions, resulting in a simple algorithm for the core processor simulation. It also provides a simple standard interface for the addition of new hardware element models into the processor model and execution simulation algorithm.

2.2.2 Types of Resources in an RCM Model In RCM, there are two kinds of resources: functional resources and value resources. These different kinds of resources are used to model different kinds of processor hardware elements. Functional Resources. A functional resource models a hardware element whose use implies an exclusive reservation. Thus the use of a functional resource will necessarily impact the availability of that resource for later instructions. Functional resources are therefore used to model hardware elements such as execution pipelines, buses, register ports, and buffer stages. Value Resources. Value resources differ from functional resources in that the effect of using a value resource depends upon how the resource was used. A value resource is used to model a hardware element whose availability depends on the value of

25 data stored within the element, such as a specific register or the program counter. The availability actually being modeled is not so much the hardware element availability, but rather the availability of the appropriate value within the hardware element. Value Resource Sub-Types. There are two kinds of use for a value resource, a source use and a destination use. Consider, for example, a given general-purpose register. If the register is used by an instruction to provide the source operand value for an operation, then the value in the register remains unchanged. When a register is used as the destination of an operation, however, the value in the register is overwritten. Thus there are four possible ways in which two instructions could use the same value operand. If two instructions both require the same register as a source operand (i.e. read after read), then the data should be simultaneously available to both of them. If both instructions use the register as a destination operand (i.e. write after write), then the later instruction would have to wait for the earlier instruction to finish writing to the register before it would be available for the later instruction’s use (i.e. for the later instruction to write to it). If the earlier instruction uses the register as a source operand, and the later uses it as a destination operand (i.e. write after read), then the value generated by the later instruction cannot be written until the data has been read by the earlier instruction (i.e. the write of the data must not occur before the read). Finally, if the earlier instruction uses the register as a destination operand, and the later uses it as a source operand (i.e. read after write), then the later instruction cannot read its source value data, which is generated by the earlier instruction, until that earlier instruction has generated the value. In order to properly model these kinds of interaction through value resources, a value resource has two resource availability times associated with it: one to govern its use as a source and the other as a destination. During the simulation of a workload, using a value resource as a source does not alter the source use availability time of the resource, but it does alter the destination use availability time (because of the potential for writeafter-read conflicts). Using a value resource as a destination will alter both the source and destination use availability times of the resource. The use of two availability times for value resources also facilitates the modeling of bypass paths in the function units: a bypass path can allow a destination use to produce an earlier availability time for a later source

26 use than for a later destination use, since the bypass path allows the value to be forwarded before the register itself would be written.

2.2.3 Instruction Resource Use Information To analyze the execution of a workload on a given processor organization, the RCM model must consider the effect of each instruction’s execution on the availability times of the processor resources. Each instruction requires a set of resources which it will use during its execution, and the instruction’s execution will alter the times at which some of these resources are available to following instructions. In most timing simulators, the specification of the instruction resource requirements are taken from two sources: the input execution trace and the instruction set architecture definition. An execution trace provides information about each instruction executed during the workload program run. This information usually includes the same information that would be provided to the processor (were it implemented) on an instruction by instruction basis, i.e. the instruction’s opcode, the register numbers for register operands, and possibly some immediate data. Many execution traces also include information about value data that could affect run time, such as load and store addresses or the program counter value, since this value data would otherwise be unavailable in a data-value independent timing simulation. The execution trace, however, does not indicate the semantic interpretation of the instruction information. This semantic interpretation of the instruction information in an implemented processor is determined by using the instruction opcode to reference the instruction set architecture (ISA) definition. Thus, if the processor encounters an instruction with opcode 8 and operands numbered 1, 3, and 7, the processor determines how to execute this instruction by referring to the (built-in) ISA definition. Because RCM is intended to provide as much flexibility as possible, no single instruction set architecture is encoded within the method itself or in the tools. Instead, an instruction template is used to describe the execution semantics and ISA information for each instruction. Consider again the situation where the trace indicates that an instruction with opcode 8 and operands numbered 1, 3, and 7 is the next instruction in the trace. The trace might identify this as: 8 1, 3, 7. Assume that the instruction with opcode 8 is a three-

27

Opcode: 8 Instruction: IClass: Operands: Op 1: Op 2: Op 3:

FADD floating-point 3 FP_Reg, source FP_Reg, source FP_Reg, destination

Figure 1. Example instruction template for a 3-register floating-point ADD instruction.

register floating-point addition (FADD) instruction, whose first two operands are the floating-point source registers, and that the last operand is the floating-point destination register. Thus, this trace line corresponds to the operation: FADD F1 + F3 => F7. In order for the RCM model to understand this trace instruction, the simulator needs to recognize that it is a floating-point instruction with three floating-point register operands, and that the first two operands are accessed as sources while the last is accessed as a destination. RCM will thus require an instruction template that identifies this information for the instruction with opcode 8. An example template for this three-register floatingpoint add instruction is shown in figure 1. The Instruction Execution Class. Looking at the instruction template of figure 1, note that the template uses an instruction execution class (the IClass entry) to indicate the kind of function-unit processing the instruction requires. In RCM, instructions can be grouped into instruction execution classes in any way desired by the user; RCM does not place any restriction on the composition or number of instruction execution classes used in the instruction set model. Consequently, an instruction set can be very coarsely divided into a few large execution classes, or many instructions can be assigned their own unique execution classes. When RCM is used to simulate a processor, the instruction execution class is used to determine which function units will receive the issue of each instruction. Each function unit in an RCM model identifies the set of instruction execution classes that it can execute. During the simulation of the instruction, all of the function units are examined to determine which function units can execute this instruction by comparing the instruction’s execution class to the set of classes that each function unit can execute. Once the set of function units is determined, if there is more than one possible function unit, then the sim-

28 ulation would select from among the set in some manner, e.g. the first available. The instruction will then have to wait until the selected function unit’s resources are available before the instruction can begin execution. Note that the function unit resources include such elements as the function unit input buffer, which must be available for the instruction to be issued to the function unit, the function unit’s decode stage (i.e. the first stage of the execution pipeline), which must be available for the instruction to begin execution, and the writeback stage (i.e. the last stage of the execution pipeline), which must be available for the instruction to finish execution. Because the instruction execution class is the basic mechanism that determines instruction issue and execution, execution classes should be used to differentiate instructions that require different execution hardware in the processor or that have different instruction execution profiles in the RCM model (e.g. instructions that execute through the processor datapath differently). Fixed-point instructions and floating-point instructions, for example, would most likely be assigned to different instruction execution classes because they generally use separate hardware datapaths. Similarly, some of the fixed-point instructions could also be assigned to unique execution classes because their execution differs from the rest of the fixed-point instructions. Consider, for example, the fixed-point multiply and divide instructions, which generally have longer instruction execution profiles than other fixed-point instructions (i.e. the effective pipeline length for these instructions is different, implying a different use of the pipeline resources). Assigning the fixed-point multiply and divide instructions to unique instruction execution classes allows these instructions to be issued to different hardware in the processor (or at least different models of the hardware in the RCM simulator), and different instruction execution profiles can be assigned to them. Thus, the combination of the instruction templates (which group the instructions into execution classes) and the function unit parameters (which indicate which execution classes each function unit can execute) allows the user to model several different ISAs and to model a given ISA at different levels of detail. The Instruction Operands. The instruction template also identifies the different kinds of operands in the instruction, and whether each is a source or destination operand. For register operands, the template also identifies the associated register file for each regis-

29 ter number. Thus, whenever an instruction is read from the trace, the RCM model consults the instruction template to determine the instruction’s execution class, the number of source and destination operands, each operand type, and for each register operand the type of register referred to (i.e. general purpose, floating-point, etc.). The operand resources indicate another set of resources (besides the function unit resources) that the instruction uses during execution. For the floating-point add instruction of figure 1, for example, the template indicates that the instruction uses three register operands, two as sources and one as a destination, and that all three are registers in the floatingpoint register file (FPR). Once this operand information is combined with the trace information, RCM is able to determine that this particular floating-point add instruction will use FPR registers 1 and 3 as sources, and FPR register 7 as a destination. Thus, when the instruction is simulated, it will have to wait for those resources to become available before it can begin execution. If finite register port resources are being modeled, each register accessed would also have to acquire a register port resource. Finally, there are instructions that include operand types not shown here, such as memory accessing instructions and branch instructions. The memory accessing instructions (e.g. load and store instructions) access the memory unit through a memory port. Thus, the instruction template for a memory accessing instruction will indicate that a memory port is used, and the execution of a memory accessing instruction would require that a memory port be available. Similarly, the branching instructions use the processor’s program counter as a destination resource, and thus before they can execute, the program counter must be available for a destination use.

2.2.4 RCM Model Simulation Once an RCM model has been developed, including the ISA model and the instruction templates, then the user can simulate a workload execution trace to generate a performance estimate. To simulate the processor, the instructions are read from the execution trace in sequential order; each instruction is fully evaluated before the analysis continues with the next instruction. The general simulation algorithm is shown in figure 2. For each instruction of the execution trace, the instruction’s issue time is determined by examining the issue unit resources. The execution start time is then determined

30

while (get_next_instruction() != none) determine the instruction issue time for each resource required for the instruction execution get resource availability time endfor determine earliest time all resources available determine instruction start time determine instruction finish time for each resource used by the instruction set new resource availability times endfor endwhile Figure 2. General execution algorithm for the RCM model.

by finding the earliest time at which all of the resources required by the instruction are available. Once the execution start time is known, the length of the instruction’s execution pipeline (in the function unit to which it was issued) is used to determine when the instruction finishes execution. The simulator thus has formed a complete instruction execution profile: the simulator knows when the instruction is fetched, when it is issued, which function unit receives the instruction, which execution pipe executes it, and the starting and finishing times for that execution. Thus the simulator can update the resource availability times for the affected resources and continue simulation with the next instruction. Note that the simulation algorithm of figure 2 makes the instruction the atomic unit of work in the RCM model, rather than the cycle as in more traditional cycle-by-cycle timer simulation models. [9] Questions obviously arise as to whether this instruction-byinstruction simulation can be used to describe a wide range of interesting processor models, and yet still provide accuracy similar to that of a cycle-by-cycle timer model. As we will show in section 2.3, the REAP superscalar processor simulation tool can model a large number of processors (further extended in section 2.4) with an accuracy equivalent to a traditional cycle-by-cycle timer using the same early design stage processor model.

2.2.5 Capturing the Pipeline Hazards An important issue in the simulation of a pipelined processor is the correct detection of the pipeline hazards. In order for a simulator to produce the correct execution behavior, it must be able to detect all of the pipeline hazards: the structural, control and

31 data hazards. [41] The RCM model and simulator are able to capture all of these pipeline hazards through the resource availability times. Structural Hazards. The structural hazards are captured directly by the resource availability times. If two instructions use the same hardware element and that element is modeled as a resource, then the resource availability time and sequential instruction enforce the proper access semantics. Generally, for the functional resources, the earlier instruction will use the resource first, and the later instruction will have to wait until the resource is once again available (i.e until after the earlier instruction’s use). The value resources do not model physical hardware elements so much as the movement of data into physical hardware elements, and thus there are no structural hazards. All of the value resource hazards are captured as data hazards (described below). Control Hazards. The control hazards are captured within the issue unit model. The issue unit consists of two resources: the issue bus, which determines when instructions can be issued to the function units, and the program counter, which determines when new instructions can be fetched into the issue unit’s internal buffers. The program counter is a value resource, where branch instructions use the program counter as a destination, and the fetch process uses the program counter as a source (to get the target fetch address). The issue bus is a functional resource that must be available for instructions to be able to issue to the function units. The issue unit in an RCM model is assumed to package instructions into groups that can be simultaneously issued (i.e. that issue to the function units on the same cycle). The issue unit fetch process loads instructions into an internal issue unit buffer, and each cycle some number of instructions are grouped together and issued to their respective function units. Each time an issue group is issued (i.e. the instructions are sent to the function units) the instructions must pass across the issue bus, and thus the issue bus resource is marked unavailable through the issue cycle. Later issue groups, therefore, cannot be issued before (or even simultaneously with) earlier groups because the (functional) resource availability time will indicate that the issue bus is in use. For control-flow altering instructions such as branches, the RCM simulation will always execute the branches in the workload execution order (i.e. later branches cannot be executed before earlier branches). Each branch instruction uses the program counter as a

32 destination value resource. Because the destination use of a value resource prevents following instructions from using that resource (as either a source or destination) until the current destination use finishes execution, a later branch cannot alter the program counter (another destination use) until the earlier branch has finished. Furthermore, because the fetch process requires the source use of the program counter, it too will be blocked from continuing until the branch target has been placed in the program counter. The issue unit will therefore be unable to issue an instruction from a branch target until after the branch is evaluated, thereby capturing the proper branching control-flow operation. Thus, as long as the branches are evaluated in the program specified order by the RCM model simulation, the control hazards will be correctly captured. The inclusion of branch prediction resource models should not, therefore, alter the ability of an RCM model to capture the control hazards in the workload in any way, since the branches are still resolved in program order. Of course, there remains a general difficulty in modeling mispredicted branch behavior using an execution trace since the code executed after branch misprediction is generally not included in the traces, and thus the wrong-path speculative execution effect on the processor resource availability can be difficult to ascertain. Data Hazards. Data hazards in the RCM model are captured through the value resources. There are three types of data hazards: the read-after-write (RAW), write-afterwrite (WAW) and write-after-read (WAR) hazards. In this discussion, we will refer to the issue order of instructions, rather than the trace or program order of the instructions. If the processor implements an in-order issue, then the issue order is identical to the program or trace order. In the limited form of out-of-order issue that RCM can support, the issue unit has to guarantee that data dependencies are not violated when two instructions are reordered. Thus, the issue unit would have to check the data dependencies of instructions as part of the issue criterion, so the processor’s issue unit would have to keep track of the dependence tree for the instructions in the issue window, and capture these hazards. The RAW data hazards occur when an earlier issued instruction sets a value (writes) and that value is then used as an operand by a following instruction (read). Because the RCM simulator issues the instructions in the proper data-dependence order, the data dependence will naturally be discovered through the value resource availability times. Because each instruction is completely evaluated before a following instruction, the

33 earlier issued instruction’s destination use of the value resource will alter the source use availability time, blocking the later instruction’s source use until the write is finished. For the WAW data dependencies, the earlier instruction’s destination use of the value resource will also alter the destination use availability time of that resource, so the following (later issued) instruction is blocked from writing to that resource until the preceding write has finished. Finally, the WAR data dependence requires that the value in a register not be changed until all previously issued reads are finished. The WAR hazards are also captured in RCM because a source use of a value resource alters the destination availability time and the resource is not available to the later issued instruction until after the last previously issued read has used the value. Note that this scheme satisfies the WAR dependencies. The ability of multiple instructions to use the value as a source operand is unaffected, however, because the source use of a value resource does not alter the source use availability time. When a sequentially following instruction accesses a resource as an operand, it will necessarily have to wait until the resource becomes available. Because of the two availability times of the value resources and the semantics of their use, the value resource will not become available until it holds the proper data value. Thus, the use of value resources, which logically model the availability of the data value in a resource, does satisfy the data dependencies.

2.3 REAP: An RCM Modeling and Simulation Tool Given the Resource Conflict Methodology described above, we have developed and implemented a processor modeling and simulation tool called REAP (for RCM Earlystage Analysis Program). REAP is not intended to show RCM modeling in its full breadth and generality, but rather to illustrate RCM modeling and to serve as a simple implementation of an early design cycle processor modeling tool that we can use for experimentation with and verification of the RCM concept. In developing REAP, we had two goals in mind. First, we wanted to develop a tool which demonstrated the real-world applicability of RCM modeling. Second, we wished to show the ability of RCM models to accurately estimate the early design cycle performance of the modeled processors for interesting workloads. In order to show some real-world

34

Instruction Cache

CDR

GPR

LKR

FPR

Issue Unit

Input Buffers

CNT Decode Function Units

Data Cache FXU

FPU

BRU

Figure 3. Basic superscalar processor design: the 6k.processor.

applicability, we will apply RCM modeling to a family of superscalar processors, based loosely on the IBM RS/6000 processor [26] [28], but we will limit the level of detail in our processor models to that which would be available in the early design stages.

2.3.1 The Superscalar Processor Family Modeled by REAP In this chapter, we demonstrate the implementation of an RCM model for a family of superscalar processors based on the IBM RS/6000 processor design. In this section, we focus on one member of this basic superscalar processor family that is functionally similar to the RS/6000 (i.e. has a similar hardware configuration), and describe the resources used to model it in the REAP RCM simulator. This processor configuration, which we refer to as a 6k processor has the design elements shown in figure 3. The 6k processor of figure 3 is a superscalar processor design supporting the simultaneous issue of up to three instructions per cycle. The processor has separate instruction and data caches (allowing simultaneous access to both) and three functional units (FUs). Each of these function units services a disjoint subset of the instruction set:

35 the FXU executes the fixed-point and memory access instructions, the FPU executes the floating-point instructions and the BRU executes the branch and condition-register instructions. In Figure 3, the FXU is expanded in detail to show some of the internal configuration of a function unit. Each instruction issued to a function unit will enter a dedicated input buffer, where the instruction waits for the decode pipeline stage to become available. When the decode stage becomes available, the instruction at the head of the function unit input buffer enters the decode stage where it is analyzed and all data dependencies are checked. Once all the data dependencies are resolved and all the resources that the instruction needs are available, the instruction proceeds into the execution portion of the pipeline. Because all dependencies must be satisfied before the instruction enters the execution pipeline, the instruction cannot experience any data-dependence stalls within the execution pipeline. Figure 3 also shows a path from the FXU to the data cache; this path represents a data cache memory port. Those units that can access memory do not necessarily require a unique, dedicated memory port, and thus the memory ports (unlike the buses) can be a source of resource conflict. There are five register files shown in figure 3, corresponding to the general purpose (GPR), floating-point (FPR), condition (CDR), link (LKR), and count (CNT) registers. In the 6k processor, there are 32 registers in GPR, 32 in FPR, 8 in CDR, and one each in LKR and CNT. In the IBM RS/6000 processor, register renaming hardware is implemented for the floating-point registers; for the 6k processor, each of the register files can contain register renaming hardware. Each register file may also include a limited number of read and write ports through which the registers may be accessed. The internal structure of a register file is modeled in REAP to have the elements shown in figure 4. The register file shown in figure 4 contains a set of registers, a number of read ports, a number of write ports, and some register renaming hardware. The registers shown are the actual physical storage elements that constitute the memory of the register file. The read and write ports represent the access ports through which data can be read from (i.e. a copy taken from) or written to (i.e. a new value set in) a register. The register renaming hardware includes some register name mapping hardware and a free list. The register mapping indicates the physical register that is holding the data for each register name (i.e. each

36

Write Address (map register)

Address

Free List R/W Flag Read Address (no remap) Register Mapping

Registers

Write Ports

Read Ports

Figure 4. The REAP register file model.

architected register number). The free list contains a set of flags that indicate which of the physical registers are not currently mapped to an architected register, and thus are candidates for register remappings. In REAP, the only elements of the register file of figure 4 that are modeled with availability times are the physical registers and the register read and write ports; the register renaming hardware is used primarily for the bookkeeping of the renaming behavior, and is assumed never to cause a resource conflict. Note, however, that the mapping of registers from the free list to architected registers does utilize the physical register availability times (to ensure that a register is not remapped until it is write available). When an instruction in the processor accesses a register from this register file, access to either read or write a register number is requested. This number indicates an architected register, and thus the register file must first determine which physical register holds the data for the register number requested, by consulting the register mapping table. The register mapping indicates the physical location currently assigned to each architected register number. If the access is a read access, the requested data will reside in the identified physical register when the preceding definition of the register value finishes execution; because the physical register has a source use availability time associated with it, REAP can determine when that value is available to be read. The transmission of the data will also require the use of one of the register read ports, and thus the (functional) availability times of

37 those ports are also consulted. Because RCM simulation has already evaluated all previous instructions before this read access, the resource availability times will include the effects of those prior instructions’ executions. If the access is a write access and the instruction type supports register renaming (i.e. this instruction is remappable) then the register renaming hardware will assign a new physical register for the written data. A new register mapping is determined by selecting the first physical register from the free list that is available when the architected register value is to be written. That physical register number is entered into the mapping for the architected register number, therefore making that the current mapping. The previously mapped physical register is marked as free and added to the free list, indicating that the physical register can be used for future register mappings that occur after its availability time. Note that all preceding instructions have been fully analyzed, so the register can immediately be marked free (i.e. all prior reads to the register have been fully analyzed). Furthermore, the availability times of the physical registers reflect all prior register accesses. The transmission of the data into the physical register files will also require a register write port, and thus the write port availability times are checked to see when the destination access can occur. Not shown in figure 3 are the connections between the function units and the register files. Each function unit has dedicated operand acquisition and result buses, so no bus conflicts will occur in accessing the operand registers or writing back the results. The function units can also contain data forwarding paths which are used to send data from the result buses directly to a decode stage, whether it is in the same function unit or another, without having to pass through the register files. These forwarding paths effectively reduced the pipeline length by one stage. With the three function units, the superscalar 6k processor can sustain the execution of up to three instructions per cycle. Instructions are always issued to each function unit in order, and maintain their ordering through the individual function units. Instructions issued to different function units, however, may execute out of relative order: there is no communication between function units to maintain a global instruction ordering. This allows the processor to exploit more of the instruction-level parallelism available in the

38 code stream, particularly when function unit input buffers are large enough to hold stalled instructions so the issue unit can continue to feed other function units. In a 6k processor the issue unit fetches instructions to an internal buffer, and in each cycle determines the set of instructions that can be issued simultaneously (i.e. the issue group), and the instructions are issued to the function units. This issue group is determined by the issue bus width, the availability of space in the function unit input buffers, and the occurrence of branches in the instruction stream. The inclusion of a control-flow altering instruction within the instruction stream interrupts the instruction issue process until the branch target is known. The specific conditions that limit the size of an issue group are explained in more detail in section 2.3.3.4 where the operation of the REAP simulator is described. Note that this description of the family of superscalar processors that REAP can model has focused on the basic processor elements and on a single processor configuration (a 6k processor). REAP provides a large number of parameters that allow the user to specify other processor configurations (e.g. different buffer sizings, different function unit types, different numbers of function units, etc.). Several other superscalar processor models have been evaluated, as discussed in section 2.3.4, and several REAP extensions have been implemented to provide a richer set of hardware elements, as described in section 2.4.

2.3.2 The REAP Instruction Set Architecture: BRISC Because an RCM model requires a set of instruction templates that define the instruction set model, we have developed a simplified instruction set architecture called BRISC (for Basic RISC) that we use as the base instruction set for our REAP processor investigations. The BRISC architecture was derived from the IBM POWER architecture, which was designed to operate in a scalar or superscalar processor implementation, and for which we had acquired the tools necessary for generating execution traces. The POWER architecture itself is not used in this dissertation primarily because the creation of instruction templates for the many instructions in the POWER ISA (over 200) would have required a large amount of effort both to initially implement and verify the templates, and later to update them as REAP was further developed. Note, however, that there is no rea-

39 son why the full POWER instruction set could not be modeled with instruction templates for use in RCM. The BRISC ISA was intended to be a small though relatively complete instruction set, so that all of the POWER instructions in an execution trace could be mapped into corresponding instructions in the BRISC ISA. The actual process we employed was to coalesce many similar POWER instructions into a single representative BRISC instruction. For example, the IBM POWER architecture provides several different kinds of shift instructions, including sl (shift left), sle (shift left extended), sleq (shift left extended with MQ register), sllq (shift left long with MQ register), slq (shift left with MQ register), sr (shift right), sra (shift right arithmetic), and so forth. In the BRISC instruction set we have used a single BRISC SHFT instruction to represent all of these different POWER instructions. This process reduced the full POWER instruction set from over 200 instructions to 63 instructions. In coalescing the POWER ISA into the BRISC instructions, some amount of detail was abstracted away from the instruction set. Consider again the example given above for the POWER shift instructions. Each of the POWER ISA shift instructions indicates a slightly different form of execution for the RS/6000 processor, while the single BRISC SHFT instruction cannot distinguish these details. Our work, however, does not focus on the detailed processor models, but rather on early stage models, and thus the loss of this detailed information is not critical. Still, the BRISC ISA retains many of the features of the POWER instruction set, including specialized link and count registers which reduce register pressure and avoid some stalls. The link registers are used to hold return addresses whenever a jump and link instruction is executed; thus, rather than using a general purpose register, the dedicated link register concept reduces register file pollution and localizes link register operations within the BRU. The count register is used to contain the loop controlling index value; specialized end-of-loop instructions which branch based on the count register value allow the branch condition to remain constantly ready for evaluation, again reducing register file pollution and localizing the needed information within the BRU. The full BRISC instruction set descriptions are given in the appendix.

40

2.3.3 Implementing the REAP Processor Simulator REAP is an implementation of the resource conflict methodology intended to produce execution cycle performance estimates for early design stage superscalar processor models taken from the family described in section 2.3.1. In order to simulate a given processor design, the user must provide three inputs to REAP: the instruction templates, the processor description file, and an input trace. The instruction templates describe the instruction set information for each of the instructions in the ISA model. Thus, an instruction template is used for each instruction of the modeled ISA to indicate the kinds of operands used, which operands are sources and destinations, the kinds of function units that can execute the instruction, and so forth. The processor description file is used to indicate the set of hardware resources implemented in the processor. The description file for REAP includes information about the processor’s register files, the number of memory ports, the issue unit fetch pipeline length and maximum issue width, and a description of the function units implemented in the processor. The function unit description identifies the composition of the function unit, including the size of the function unit input buffer, the number of execution pipelines it contains and their lengths. Finally, the input trace is a full execution trace, which provides the instruction execution sequence for a given workload program. Each of these inputs is described in the sections that follow. Once the three inputs have all been described, section 2.3.3.4 discusses the process of simulating an execution trace. 2.3.3.1 The Instruction Templates The REAP instruction templates are described in an instruction set description file, with one template used to describe each instruction of the simulation instruction set. This section describes the REAP instruction templates with particular reference to the BRISC instruction set. As described in section 2.2, the instruction templates provide REAP with the types of resources used by each instruction, and thus include the information necessary for the simulator to determine the resources used by the instructions as they are read from the input trace file. Note, however, that the templates only indicates the types of resources used, not the actual resources or the times when each resource is used. Determining the actual sources used requires the processor hardware resource model and the trace input

41

;; This file contains the definition of the BRISC instruction set. 63 ; The total number of instructions in the ISA file ;--------------------------------------------------------0 ; Opcode of instruction 0: add 1 ; Instruction execution class “ADD” ; Mnemonic for instruction 2 ; number of source/functional resources R ; Uses a register resource in reg_file 1 R ; Uses a register resource in reg_file 1 1 ; number of destination resources R ; Sets the value to a register in reg_file 1 ;--------------------------------------------------------1 ; Opcode for instruction 1: add immediate ... Figure 5. Excerpt from the BRISC ISA file.

data as well, and determining the times when the resources are used requires knowledge of the processor resource availability times when the instruction is encountered. Each template identifies the instruction’s opcode, its execution class, the mnemonic representation for the instruction, and the operand resources used by the instruction. Figure 5 shows an excerpt from the BRISC ISA file, giving the instruction template information for the fixed-point add instruction. For each instruction, the opcode value is used in uniquely identifying the instruction and provides a means for accessing the proper template information when an instruction is read from the trace. The instruction execution class identifies the type of processing required by the instruction. A set of instructions are assigned to an instruction execution class if they are modeled to use the same function units, execution pipelines, and so forth when they execute in the processor. There are no restrictions on the number of instruction execution classes that can be used in REAP or the identifier assigned to any particular instruction execution class. In the BRISC ISA that we use for our analyses in this dissertation, the instruction execution classes will be assigned the identifiers shown in table 1. The operand resources are described in the templates using two distinct sections, one for the source (input) operands and another for the destination (output) operands. The operands are described by an operand type specifier, e.g. the operands in the template for the ADD instruction of figure 5 all come from general purpose registers, and thus are identified by the R notation. The R indicates that the associated operand is a register reference, and the indicates the register is from the register file with identifier 1. Each

42 Table 1. The instruction execution class identifiers for the BRISC ISA.

Class Identifier

Description

1

fixed-point execution except multiply and divide

2

floating-point execution except divide

3

memory access instructions (loads and stores)

4

condition register execution instructions

5

control-flow altering instructions (jump, branch)

6

fixed-point multiply

7

fixed-point multiply with immediate data

8

fixed-point divide

9

floating-point divide

register file is assigned a unique register file identifier, and the instruction template information is used to identify the register file to which a register number from the trace refers. There are other operand identifiers which indicate the use of other hardware resources by the instruction. An M identifier indicates that the instruction uses a memory port during execution either as a source operand (to load a value) or as a destination operand (to store a value). Similarly, the B identifier as a destination operand indicates that the instruction may alter the flow of control (i.e. it is some form of branching instruction) and thus uses the program counter as an output operand. Additional operand identifiers are defined to identify special forms of the register operands, such as the update register in a load with update operation, or the register containing the value to be stored by a store operation. These special register operands are indicated in the instruction templates so that certain extensions to the basic REAP simulator can identify them (see section 2.4). The update register must be indicated, for example, when fast update register forwarding is implemented, and the register holding the value to be stored must be indicated in a store instruction for the pending store queue to be able to identify that operand. 2.3.3.2 The Trace Description. The second input to REAP is a BRISC execution trace. The execution trace is derived from an IBM RS/6000 POWER execution trace by translating the POWER trace to the required REAP trace format. The POWER traces are generated using either the

43

13 1 13 0 8 12 43 53 27 ...

FMAD ADDI FMAD ADD FADD FABS FSTX JMP LD

6, 7, 1 0, 25712 6, 5, 3 0, 1 2, 3 0 1, 28, 7 16344 3, 32768

=> => => => => =>

1 4 3 6 1 0

=> 5

Figure 6. An excerpt from a REAP input trace file.

atrace [59] or xtrace [58] tracing package, each of which produces an execution trace of the user code for an RS/6000 executable file. In order to generate the BRISC traces, we developed a simple trace translation program that would accept the IBM POWER traces from atrace or xtrace, map the POWER instructions to their associated BRISC instructions, and output the REAP execution trace format. The REAP trace format requires that each trace line begin with the instruction opcode, followed by the instruction mnemonic (which is included only to increase readability), the list of source operands and then the list of destination operands. The ordering of the operands must correspond to the ordering given in the instruction template file, because the operand position is used to correlate an operand value read from the trace to the operand descriptor from the template. Figure 6 shows a portion of a REAP trace. Looking at the ADD instruction of figure 6, the opcode (0) is the first element of the line, then the mnemonic (“ADD”) followed by the two source operands (0, 1) a special marker (“=>”) for readability and the destination operand (6). Recall that the instruction template given in figure 5 stated that all the operands of the ADD instruction are all registers from register file 1, which is the general purpose register file. Thus, in this case the values in the two GPR registers 0 and 1 are added, and the result is placed in GPR register 5. Similarly, the FSTX instruction (which is a floating-point store using an index register) has three register operands, the FPR register to store (register 1), and the two GPR registers whose values are used to determine the effective address (registers 28 and 7). The JMP instruction has a single operand, which is an immediate value indicating the offset of the jump (i.e. the amount that is added to the program counter value to determine the new address). Because the actual data values in the registers are not important in

44

Figure 7. General format of the processor description file.

a timing performance simulator, this immediate data is ignored by REAP. Finally, the LD instruction is a fixed-point load instruction that uses a base register and an immediate data offset that is added to determine the effective address. In this example, the LD instruction uses GPR register 3 as the base register, has an immediate data offset value of 32768 (which is ignored by REAP), and loads the data into GPR register 5. Note that without the template information to indicate which register files contain each of the registers, the register numbers provided in the trace would be insufficient to identify unique register resources. 2.3.3.3 The Processor Description File The final input to REAP is a description of the hardware resources available in the processor. This description is provided in the processor description file, which contains information about the processor’s register files, ports and buses, and the function units. The overall format for this file is shown in figure 7. Register Files Description. The register file information in the processor description file provides information about the number of register files in the processor and the composition of each of the files. Recall that the register files in REAP have the form shown in figure 4, and thus provide an opportunity to investigate many different features. Figure 8 gives an excerpt from the register file section of a 6k processor, showing the GPR and FPR register files. Recall that a 6k processor has five register files corresponding to the five register files shown in figure 3. In the processor description file, each register file is given a unique register file identifier: in this case a 1 indicates the GPR, 2 the FPR, 3 the CDR, 4 the LKR and 5 the CTR. These register file identifiers correspond to the i identifiers used in the R notation of the instruction templates (see section 2.3.3.1), and thus identify the register file to which each register operand of an instruction refers.

45

5

; Number of Register Files

1 0 32 32 3 2

; ; ; ; ; ;

GPR reg file id rename flag 32 physical registers in the file 32 architected registers for the file 3 read ports 2 write ports

1 40 32 3 1

; ; ; ; ; ;

FPR reg file id rename flag 40 physical registers in the file 32 architected registers for the file 3 read ports 1 write port

2

... Figure 8. Register files description for a 6k processor.

The information following the register file identifier indicates whether the register file contains renaming hardware, the number of physical registers in the register file, the number of architected registers for the register file, and the number of read and write ports for accessing the register file. Because the simulator is driven by an execution trace, the register file must provide the same number of architected registers in each register file as are represented in the trace. Thus, if the trace instructions use 32 general purpose registers, the processor description file must provide 32 architected GPR registers as well. With register renaming hardware, the register file could contain a different number of physical registers than there are architected registers. From figure 8, we see that the GPR register file of this 6k processor does not include register renaming hardware, but that the FPR does include register renaming hardware, providing 40 physical registers to hold the current values of the 32 architected registers (plus 8 older values). REAP does not require that the number of physical registers exceed the number of architected registers, so some experiments can be made using fewer physical registers than architected registers; however, if the number of live values in the simulated processor ever exceeds the number of physical registers, the simulator will abort. The minimum number of read and write ports in a processor is also restricted by REAP since the processor description file must always provide enough ports so that each instruction can access all its register resources in a single cycle. In the case of the BRISC

46

1

; Number of machine memory ports

3

; The machine issue width

1

; The fetch pipeline length

Figure 9. The ports and buses section of a 6k processor description file.

instruction set, the store with index (stx) instruction includes three input GPR registers (the value to be stored, the base address register and the index register), while the load with update (ldu) instruction includes two output GPR registers (the destination for the loaded data, and the updated base register). Thus, the GPR register file must provide at least three read ports and two write ports. The minimum number of read and write ports for each of the other register files can be determined in a similar manner. Ports and Buses Description. After the register file description, the next portion of the processor description file describes the number of memory access ports, the issue bus maximum width and the depth of the fetch pipeline. Figure 9 shows this portion of the file for this 6k processor. Looking at Figure 9, note that the 6k processor contains only a single memory port, which restricts the number of data accesses that can be simultaneously initiated to the data cache and memory hierarchy. In REAP, memory ports are not specifically assigned to function units, but rather are dynamically accessed by those instructions that use a memory port as an operand. Note also that this 6k processor has a maximum issue bus width of three instructions, indicating that up to three instructions can be issued in the same cycle. REAP does not restrict the mix of instructions that can be issued together, except in the case of control-flow altering instructions (where the target address of the instruction would need to be known before the following instructions could be fetched). Finally, this 6k processor has a fetch pipeline length of one stage. The fetch pipeline length measure here is used to indicate the number of cycles required from the time that a new fetch target address is known until the instructions first become available in the issue unit (assuming an instruction cache hit). If an instruction cache access miss were to occur, then the added access time penalty would be added to this fetch pipeline length. Finite cache models are

47

input buffers decode

pipe 1

pipe 2

pipe n

execution

. . .

(individual) (shared)

writeback

Figure 10. REAP function-unit model.

discussed in more detail in section 2.4.5, though the discussion there concentrates on the data cache model. Function Units Description. The last section of the processor description file is devoted to the function units description, and provides information about the number and composition of the function units in the processor. The function unit model implemented in REAP is more general than the one shown in the FXU of figure 3; a REAP function unit can contain more than one execution pipeline. The general REAP function unit model is shown diagrammatically in figure 10. Each function unit has four main components: a dedicated FIFO input buffer, a decode stage, a set of execution pipelines, and one or more writeback stages. The input buffer receives instructions from the issue unit and holds them until the function unit’s decode stage is available (i.e. empty). When the decode stage empties, the input buffer shifts the next instruction into the decode stage (advancing all other instructions one stage toward decode). The decode stage then decodes the instruction, determines which execution pipeline the instruction will use, and accesses the required instruction operands. Each execution pipe contains the execution datapath for some subset of the instructions that the full function unit executes, and this subset is indicated in the execution pipeline by a set of instruction execution classes that the pipeline can execute. The sets of execution classes corresponding to the pipelines in a given function unit must be disjoint (i.e. no two pipe-

48 lines in a function unit can execute the same instruction classes) so that the decode stage of the function unit can determine for a given instruction the one execution pipeline that can execute that instruction. All resource dependence is checked in the decode stage, and the instruction is assumed to stall there until the dependencies are satisfied. The decode stage will not allow the instruction to actually start execution until it can advance through the entire execution pipeline without incurring any resource conflict stalls. Thus, all of the required source operand resources must be available in the cycle in which the instruction is to start execution, and the destination operand resources must be available by the time they will be required (i.e. at the end of the execution pipeline, in the writeback stage). Note that RCM could easily model other decode stage semantics, but this is the what is currently implemented in REAP. All execution pipelines are assumed to be fully pipelined, and only one instruction can be in decode at a time (so in each function unit only one instruction can enter execution per cycle regardless of the number of pipelines in the unit). Note again that REAP allows only a single instruction to decode per function unit per cycle, but other implementations of RCM could allow more instructions to be decoded and begin execution per cycle. No stalls can arise within the execution pipeline, and thus there are never any resource conflicts in execution once the operand and resource dependence have been satisfied in the decode stage. Looking at figure 10, note that the function unit writeback stage can be modeled in two different ways in REAP: each pipeline can have a distinct writeback stage, or all pipelines can share a single writeback stage. This option is provided in REAP to allow the user to model different kinds of function unit pipeline organizations. The execution pipelines themselves are fully pipelined and cannot stall; a processor that does not use fully pipelined function units, however, can still be modeled. Consider, for example, a processor that issues all fixed-point instructions to a single fixed-point unit. Fixed-point multiplication and division involve more work than fixedpoint addition or subtraction. First, consider the fixed-point function unit that includes fast multiply and divide hardware, and thus provides a fully pipelined fixed-point unit that requires the same number of stages to add, subtract, multiply or divide fixed-point num-

49 bers. This function unit can be modeled by a single execution pipeline that includes the number of pipeline stages required for these operations. Consider now a processor that does not include the fast multiplication and division hardware, but still allows the decode stage to begin executing one fixed-point instruction per clock cycle. Such a function unit would have to provide the fast fixed-point execution path (for addition and subtraction) and slower execution paths (for multiply and divide instructions). Furthermore, these paths would all have to be fully pipelined, and each execution path would have to provide for its own writeback so that the finished instructions do not stall the pipeline while they are waiting to write their results (so the instructions may finish execution out of relative order). This function unit would be modeled with more than one execution pipeline (i.e. one for add/subtract, and one for multiply/divide, or one each for multiply and divide) where each pipeline has its own writeback stage. Finally, consider a processor that physically includes a single execution datapath, but that does not include fast fixed-point multiply and divide hardware. In this case, the fixed-point unit is fully pipelined for the most common fixed-point instructions (i.e. fixedpoint addition and subtraction, the logical operations and so forth) but is not fully pipelined for fixed-point multiplication or division, where the same fixed-point datapath pipeline is used, but the stages are used repeatedly until the multiplication or division operation is completed. To model this physical processor hardware with REAP, the processor is given multiple execution pipelines of different lengths but all of the pipelines share a single writeback stage. This single writeback stage is thus a functional resource that all the instructions executed in this function unit must acquire (in order) as they finish execution. A divide instruction that uses a longer pipeline, and thus uses a different logical execution pipeline than a following instruction, will block the following instruction from starting execution until that following instruction would be able to begin execution, flow through its shorter pipeline, and acquire the writeback stage after the divide instruction releases it. Figure 11 shows the layout of the function-unit description section for a 6k processor’s FXU. Looking at figure 11, the number of function units is identified first, and then each function unit is described. In the function unit description, the first item is the unique function unit identifier, in this case 1, which is used primarily to identify the function units

50

3 ; Number of Function Units ; Function Unit 1 1 ; unique function unit id 1 ; shared writeback flag (0=dont share, 1=do share) 10 ; input buffer has 10 stages 1 ; 1 decode stage 3 ; 3 execution pipes 1 ; Unique execution pipe identifier 2 ; Number of classes handled by pipe 1 1 ; Class of instructions handled by pipe 1 3 ; Class of instructions handled by pipe 1 2 ; pipeline length for pipe 1 2 ; Unique execution pipe identifier 1 ; Number of classes handled by pipe 2 6 ; Class of instructions handled by pipe 2 7 ; Class of instructions handled by pipe 2 9 ; pipeline length for pipe 2 3 ; Unique execution pipe identifier 1 ; Number of classes handled by pipe 3 8 ; Class of instructions handled by pipe 3 19 ; pipeline length for pipe 3 ; Function Unit 2 2 ; unique function unit id ... Figure 11. The function unit description (for a 6k processor).

in the more detailed REAP output (see section 2.5). The next entry indicates whether the writeback stage is to be shared or whether each execution pipeline should have a unique writeback stage. At this time, there is no way within REAP to specify a function unit where some of the execution pipelines share a writeback stage and others have unique writeback stages. The next line indicates the number of stages in the function unit input buffer; the buffers can contain any number of stages, from 0 (indicating no buffer) to a large finite value. The next entry indicates the number of decode stages in the function unit. In this dissertation, we consider only processors that have a single decode stage, so this parameter should be considered available for expansion of the REAP simulator. The next line of the function unit description indicates the number of execution pipes in the function unit, which is followed by a description of each of the execution pipes.

51 Each execution pipe is given its own identifier (which should be unique at least within the function unit) which is again used primarily to describe the flow of the instructions in the detailed simulator output. The number of execution classes executed by the pipeline is then given, and each execution class is listed. Finally, the physical execution pipe length is given (in pipeline stages) indicating the total number of cycles required for an instruction to move through the function unit using that execution pipeline (i.e. the number of stages in the execution pipeline including the writeback stage, plus the function unit’s decode stage). Note that the FXU described in figure 11 has three execution pipelines. The first of these pipelines executes two instruction execution classes: class 1 and class 3. Recall from table 1 that instruction execution class 1 corresponds to the fixed-point instructions (except the multiply and divide instructions) and class 3 corresponds to the memory access instructions. The second execution pipeline executes another two instruction execution classes: class 6, corresponding to the fixed-point multiply instructions and class 7 corresponding to the fixed-point multiply with immediate data instructions. Both of these instructions have been removed from instruction execution class 1 in order to provide them with a longer execution pipeline. Finally, the last execution pipeline executes instruction execution class 8, the fixed-point divide instructions, which have also been removed from instruction execution class 1 and given a very long execution pipeline. Note also that the FXU of figure 11 uses a single, shared writeback stage for all of the execution pipelines. Thus, this 6k processor’s FXU is fully pipelined for most of the fixed-point instructions and memory instructions, but the fixed-point multiply and divide instructions are not fully pipelined. Because the REAP 6k processor model requires following fixed-point and memory access instructions to acquire the writeback stage in order as they finish, the 6k processor model of figure 11 will result in the desired behavior. Figure 12 gives a comparison of this 6k processor’s FXU for a short sequence of instructions (see figure 12 (a)) when the function unit either includes a single shared writeback stage (b) or separate writeback stages (c) for each execution pipeline. Looking at figure 12 (b) we see that the divide instruction (with a pipeline length of 19 cycles) blocks the execution of the following two add instructions because the writeback stage would be unavailable to those instructions until the divide instruction has fin-

52

SUB DIV ADD ADD cy|IsB| --|---|-0|CBA|-1|--D|CB 2|---|DC 3|---|-D 4|---|-D 5|---|-D 6|---|-D 7|---|-D 8|---|-D 9|---|-D 10|---|-D 11|---|-D 12|---|-D 13|---|-D 14|---|-D 15|---|-D 16|---|-D 17|---|-D 18|---|-D 19|---|-D 20|---|-D 21|---|-22|---|-23|---|--

R1, R7, R1, R3,

FXU | -------------------| -------------------| A------------------| B-----------------A| CB-----------------| C-B----------------| C--B---------------| C---B--------------| C----B-------------| C-----B------------| C------B-----------| C-------B----------| C--------B---------| C---------B--------| C----------B-------| C-----------B------| C------------B-----| C-------------B----| C--------------B---| C---------------B--| C----------------B-| C-----------------B| D-----------------C| ------------------D| -------------------|

R2 R8 R4 R6

=> => => =>

R3 R9 R5 R12

cy|IsB| --|---|-0|CBA|-1|--D|CB 2|---|DC 3|---|-D 4|---|-5|---|-6|---|-7|---|-8|---|-9|---|-10|---|-11|---|-12|---|-13|---|-14|---|-15|---|-16|---|-17|---|-18|---|-19|---|-20|---|-21|---|--

(a) the code sequence

FU 0 | -------------------| -------------------| A------------------| B-----------------A| CB-----------------| D-B---------------C| ---B--------------D| ----B--------------| -----B-------------| ------B------------| -------B-----------| --------B----------| ---------B---------| ----------B--------| -----------B-------| ------------B------| -------------B-----| --------------B----| ---------------B---| ----------------B--| -----------------B-| ------------------B| -------------------|

(c) FXU with distinct writeback

(b) FXU with shared writeback Figure 12. A comparison of shared and not-shared writeback executions.

ished and released it. Conversely, figure 12 (c) shows that the function unit with multiple distinct writeback stages (i.e. each execution pipeline has its own writeback stage) allows the add instructions that follow the divide to execute without worrying about whether the divide has released its writeback stage. Thus, these instructions can execute in the physically distinct add and subtract execution pipeline (which has a pipeline length of only two cycles) and finish execution long before the divide instruction does. 2.3.3.4 Overview of the REAP Simulation Routines REAP simulation follows the general RCM simulation algorithm described above in figure 2. In the main loop, REAP fetches each instruction, identifies the resources that it

53

void consume_trace() { Instruction * instr; int issue_time, start_time, finish_time, completion_time; // obtain the next instruction (from the issue unit) while((instr = the_issue_unit->get_next_instruction()) != NULL) { instr_counter ++; instr_class_counts[instr->get_class()]++; int ilat; int op_code = instr->get_opcode(); int br_instr = machine_isa[op_code].alters_issue(); // Check the availability times for the resources start_time = get_instruction_start_time(instr); ilat = instr->get_execution_pipeline_length(); if (ilat > 0){ finish_time = start_time + ilat; } else { finish_time = start_time; } instr->set_start_time(start_time); instr->set_finish_time(finish_time); set_new_avail_times(instr); } } Figure 13. The core REAP simulation routine.

will use, determines the instruction’s issue start of execution and finish of execution times, and then updates the processor resource availabilities to indicate the execution of that instruction. The core of this simulation algorithm is the analysis of the instruction issue and execution times, which is done in the consume_trace routine illustrated in figure 13. Note that REAP is implemented in C++ and thus the figures in this section that include code fragments to illustrate the actual REAP simulator code use C++ notation. As figure 13 shows, the consume_trace routine works by examining each instruction in turn. The instructions are taken from the instruction trace using the issue unit’s get_next_instruction routine (or method), which reads in a line from the trace. The template information is used to analyze the line and determine which of the values in the trace line are register numbers, which are immediate values and so forth. This information is then combined into an internal instruction data structure that includes all of the instruction information, and a pointer to the data structure is returned by the get_next_instruction routine (where instr in figure 13 is set to point to this next instruction’s data structure).

54

int get_instruction_start_time(Instruction * iptr) { // Start the available time (avail_time) at the issue time int issue_time = the_issue_unit->add_to_issue_group(iptr); int avail_time = issue_time; int ilat = iptr->get_execution_pipeline_length(); // Go through the source registers to determine latest available int num_src_regs = iptr->get_num_src_regs(); for (i = 0; i < num_src_regs; i++) { Register * sreg = iptr->get_source_reg(i); int atime = sreg->get_srcfunc_avail_time(); if (atime > avail_time) { avail_time = atime; } } // Go through the destination registers int num_dest_regs = iptr->get_num_dest_regs(); for (i = 0; i < num_dest_regs; i++) { // If register renaming, free up old mapping and remap if ((do_register_renaming) || (do_load_register_renaming)) { iptr->free_reg_mapping(i); iptr->map_register(i); } // Get the availability time and check if it is latest Register * dreg = iptr->get_destination_reg(i); int atime = dreg->get_dest_avail_time(); if (relax_decode_requirements) { atime = atime - ilat; } if (atime > avail_time) { avail_time = atime; } } ... Figure 14. Overview of the get_instruction_start_time routine of REAP simulation.

Determining the Instruction’s Start Time. In the general RCM simulation algorithm, the next step is to check the resource availability times to determine the times at which the instruction starts and finishes execution. In REAP, this is done in the get_instruction_start_time routine, which first determines the instruction issue time, and then the time at which the instruction starts execution. Figure 14 shows the basic form of the get_instruction_start_time routine. Issue Constraints. To determine the issue time for an instruction, the subroutine get_instruction_start_time calls the add_to_issue_group routine, which tries to add the instruction to the current issue group. Recall that an issue group is a group of instructions that can be (and are) issued to the function units in the same cycle. Because REAP analyzes each instruction in turn, the full issue groups are not generated before the component

55 instructions are analyzed, but rather as each instruction is analyzed the instruction issue time determination must consider whether the instruction can fit in the current issue group or must start a new issue group. There are no restrictions on the composition of an issue group in REAP, except that an issue group must contain instructions that are issued together, and that the group cannot be larger than the maximum issue width of the processor. There are actually three conditions that cause an instruction to start a new issue group: 1. the issue group is full (i.e. the maximum number of instructions are in the group) 2. the function unit input buffers are not available in time for this instruction to issue with the rest of the issue group, and 3. the last instruction in the issue group alters the flow of control. If any of these three conditions hold, then the instruction must start a new issue group. Note that condition 3 above assumes that there is no branch target prediction hardware in the processor, and thus that the branch instruction must issue, execute and finish before the target address (from which the current instruction was fetched) is known. In this case, the current instruction cannot be issued with the issue group that includes that branch instruction. When the instruction must start a new issue group, the new issue group’s issue time will be the time at which this instruction issues. The instruction issue time can be calculated as the maximum of three components: 1. one cycle after the issue time of the previous issue group, 2. the cycle at which the function unit input buffer is available, and 3. the cycle after the instruction is fetched into the issue buffer. The first of these times is simply the earliest time at which the issue group could possibly follow the previous issue group. The second time is the earliest time at which the instruction can be issued to the function unit that will execute the instruction, which might not be available until well after the previous issue group has been issued. Finally, the third time is the time at which the instruction is available from the issue buffer. If the last instruction of the preceding issue group is a control-flow altering instruction, then even once the target address is available the fetch pipeline must move that target address

56 instruction from the memory system into the issue unit’s input buffer, and this could determine the issue time. Starting Time Impact of Other Resources. Once the instruction issue time has been determined, the instruction’s execution times must be calculated. The execution times of an instruction are affected not only by the issue time, but also by the many other resources that the instruction requires for execution. In REAP, the calculation of the instruction execution times is split into the determination of the instruction’s starting time (i.e. the last cycle the instruction spends in decode) and the calculation of the finishing time, which is simply the starting time plus the length of the pipeline in which the instruction executes. The get_instruction_start_time subroutine completes the calculation of the instruction start time by determining which of the resources that the instruction requires results in the latest instruction start time. In REAP, some resources must be available before the instruction can start execution, while other resources do not need to be available for the instruction to start execution, but must be available by the time the instruction finishes execution. The get_instruction_start_time routine will take these considerations into account; the source operand registers, for example, must be available before the instruction can leave the decode stage, but the destination register need not be available until the instruction enters the writeback stage. Thus, the get_instruction_start_time routine will return the earliest time at which the instruction in question could start execution, Determining the Instruction’s Finish Time. Once the instruction’s start time is determined, the length of the instruction’s execution pipeline is added to the start time to determine the instruction’s finish time, which is the time at which the instruction will be in the writeback stage. Thus, the full execution profile for the instruction has now been determined; REAP has calculated the instruction issue, start and finish times Setting New Resource Availability Times. Given the instruction’s execution profile, the effect of the instruction’s execution on the processor resource state can now be calculated, and this is done by the set_new_avail_times routine. Figure 15 shows part of the set_new_avail_times routine. The set_new_avail_times routine determines the new resource availability times for each resource affected by the instruction execution, and sets those resource availabili-

57

void set_new_avail_times(Instruction * iptr) { int start_time = iptr->start_time(); int src_func_avail_time = iptr->start_time()+1; int dest_avail_time = iptr->finish_time(); // Set the function unit availability times Function_Unit * fu_used = the_issue_unit->get_fu_used(); int buffer_avail_time = fu_used->get_earliest_decode_time(); fu_used->set_buffer_avail_time(buffer_avail_time); fu_used->set_avail_time(src_func_avail_time); if (fu_used->shared_writeback()) { // Ensure sole ownership of the writeback stage. fu_used->set_wb_avail_time(dest_avail_time + 1); } // Set the execution pipeline availability times Execution_Pipe * epipe_used = iptr->get_execution_pipe(); epipe_used->set_used(start_time); // Go through destination registers and set new available times for (i = 0; i < iptr->get_num_dest_regs(); i++) { Register * dreg = iptr->get_destination_reg(i); if (no_bypassing) { dreg->set_srcfunc_avail_time(dest_avail_time+1); dreg->set_dest_avail_time(dest_avail_time+1); } else { dreg->set_srcfunc_avail_time(dest_avail_time); dreg->set_dest_avail_time(dest_avail_time); } ... Figure 15. Overview of the set_new_avail_times routine of REAP.

ties. In setting the new availability times for the resources, the different resources (functional, source value and destination value) are all affected differently. The value resources that are accessed as source operands will not have their availability times altered by the instruction execution, since the value itself is in no way altered. The destination value resources will be marked unavailable until the instruction finishes execution, at which time the new value will be available to following instructions. Finally, the functional resources will be marked unavailable until the cycle after the instruction uses them. Further distinctions are introduced when the processor execution model is extended to add other parameters, as can be seen in figure 15 where a special check must be made to handle the situation where function unit bypass paths have been enabled (see section 2.4.1). The no_bypassing check is made to determine whether the destination

58 value resources are available in the cycle where they enter the writeback stage, or the cycle after that (when they have been written to the register file). Once the new resource availability times have been determined and set, simulation of the current instruction is complete, and processing continues to the next instruction of the trace. Thus, processing continues again to the top of the consume_trace routine, where the next instruction of the trace is read in (using the get_next_instruction routine) and the same routines are used for this new instruction. Processing continues until all instructions of the trace have been evaluated. It is important to note that the REAP implementation of an RCM processor model that we have described here is not the only implementation that could be developed, and is in fact a relatively simple implementation. The REAP implementation itself will be extended later, in section 2.4, to add new types of resources and different controls to the processor execution model. Even those extensions, however, will not incorporate the full power of the resource conflict methodology, since REAP could be extended in ways that we have not implemented. It is important to realize that the limitations apparent in the REAP tool are not necessarily limitations of the overall resource conflict methodology. We discuss some of the limitations of RCM in section 2.4.6.

2.3.4 Comparing REAP to Cycle-by-Cycle Timing Simulators Given REAP, we have a means for determining the timing performance of processor organizations (taken from the family of superscalar processors of section 2.3.1) early in the design cycle. We now consider whether the instruction-by-instruction approach which RCM advocates has any inherent accuracy limitations versus a cycle-by-cycle timer model of the same processor. To investigate this question, we conducted a set of experiments comparing REAP to cycle-by-cycle timers. In order to compare the REAP and cycle-by-cycle timer models, we needed to decide what processor organizations we should use for the comparison. In order not to unduly favor one of the methods, we decided to compare the results of these two methodologies using identical processor models with identical workloads so that the only difference between the cycle-by-cycle simulation and the instruction-by-instruction simulation was in the atomic unit of work. The cycle-by-cycle timer models we developed were based

59 on the TRISC timer originally developed by Pradip Bose of IBM T. J. Watson Research Lab. [11] TRISC was initially developed to model a simplified RS/6000 style processor, and thus it does not support nondisjoint function units, where two function units can execute the same subset of instruction execution classes. Because of the difficulty in modifying this base timer model to accurately handle such organizations, we decided not to compare these kinds of processors, and instead focused on disjoint function unit processor configurations. The TRISC timer also did not support register renaming or limited register ports, so we did not include those features in this comparison. This left us with a limited set of processor parameters we could use in the comparison, and we defined four basic processor configurations (defined by their function unit organizations) that we would use for the comparisons. These four configurations are described in table 2. Table 2. Initial processor configurations for REAP vs. timer comparisons.

Processor Configuration

Function Units

xp

1 (fixed-point, memory-access, branch and conditional) and 1 floating-point unit

6k

1 (fixed-point and memory-access), 1 floating-point, and 1 (branch and conditional)

4u

1 fixed-point, 1 memory-access, 1 floating-point and 1 (branch and conditional)

5u

1 fixed-point, 1 memory-access, 1 floating-point, 1 branch and 1 conditional unit

The actual versions of these processor configurations that we compared were developed by setting various other parameter values, such as the number of function unit input buffers, the maximum issue width and so forth. Note that no model was developed for a processor with a single function unit since our work is targeted to superscalar machines, and if the more complicated superscalar processor models are accurate, then we can reasonably assume that the scalar machine model would be as well. Once we had selected the set of processor configurations and parameter sets (i.e. the processor versions we were going to compare), a REAP processor description file was developed for each one. We also developed and implemented a cycle-by-cycle timer model corresponding to each processor version that had the same level of detail as the REAP processor description file. This required the generation of a separate timer model (i.e. program) for each processor configuration, but the versions of the configurations could be modeled by changing parameters in the timer models. Thus, we wrote four

60 parameterized early design-stage timer programs that allowed the user to define the length of the execution pipelines, the number of input buffers, the issue width, and so forth. We then selected a set of parameter values to use in comparing the TRISC-based timers and REAP resulting in 24 different processor versions that we tested as shown in table 3. Table 3. Processor models used in the REAP vs. timer comparisons.

Processor Class

Processor Name

Execution Pipeline Lengths

Max Issue Width

Input Buffer Stages

xp

xp

2/3

2

0

xp-b1

2/3

2

1

xp-b10

2/3

2

10

xp-lfp

2/6

2

0

xp-lfp-b10

2/6

2

10

xp-i1

2/3

1

0

6k

2/3/2

3

0

6k-b1

2/3/2

3

1

6k-b10

2/3/2

3

10

6k-lfp

2/6/2

3

0

6k-lfp-b10

2/6/2

3

10

6k-i2

2/3/2

2

0

4u

2/2/3/2

4

0

4u-b1

2/2/3/2

4

1

4u-b10

2/2/3/2

4

10

4u-lfp

2/2/6/2

4

0

4u-lfp-b10

2/2/6/2

4

10

4u-i3

2/2/3/2

3

0

5u

2/2/3/2/2

5

0

5u-b1

2/2/3/2/2

5

1

5u-b10

2/2/3/2/2

5

10

5u-lfp

2/2/6/2/2

5

0

5u-lfp-b10

2/2/6/2/2

5

10

5u-i4

2/2/3/2/2

4

0

6k

4u

5u

61 We then applied a set of simple test case loop codes to the REAP processor description files and cycle-by-cycle timer models and compared the cycle-by-cycle timer results to those of the REAP model. Comparing the processors of table 3 for cycle-bycycle compatibility, we determined that REAP was able to generate a cycle-by-cycle execution profile that exactly matched that of the corresponding cycle-by-cycle timer model. We then compared the same set of processors using a set of test cases taken from the Livermore Fortran Kernels (LFKs). [54] The LFK test cases were derived from the Livermore Fortran Kernel test set by extracting just the execution of the kernel codes themselves. This resulted in 22 test cases for the 24 kernels (kernels one and two were not traced) that contain between 32,023 and 385,211 instructions. With the longer executions from these LFK test programs, it was too time-consuming to do a cycle-by-cycle comparison for every cycle of each processor model. The shortest LFK test case required between 42,020 cycles on the 5u-b10 processor organization (the most parallel processor tested) and 53,034 cycles on the xp processor organization (the least parallel processor organization). Similarly, the longest test case required between 446,313 cycles for the 5u-b10 processor and 642,168 cycles for the xp processor organization. Because there was no automated way to compare the outputs directly, we compromised by spot-checking the full outputs at various locations, and by comparing the total execution cycles of the two performance estimates. The results of these comparisons reflected that REAP and the TRISC-based timers were still generating the same total execution cycles, and the portions of the trace execution profiles that were compared did match exactly. We then moved on to some much longer traces we obtained by sampling the execution of the femc program, which is an application program written by the University of Michigan Radiation Laboratory that uses a finite element mesh analysis to simulate electromagnetic backscatter from irregular objects. [15] Using our test input files, the full run of this program required more then three billion instructions to complete, so we took a set of 1-million instruction long trace samples (125 million instructions apart, spanning the first half of a full femc run). This resulted in twelve test case traces sampled from various parts of the femc execution, with execution time estimates between 692,806 and 777,912

62 cycles for the 5u-b10 processor organization and between 862,086 and 1,680,789 cycles for the xp processor organization. Rather than try to compare the millions of cycles that result from the full set of femc test case simulations, or even to store this output so that it could be spot-checked, we instead simply compared the total execution cycles from REAP and the TRISC-based timers. Once again, these execution cycle counts were found to be identical for the tested processor organizations, and thus we concluded that the RCM instruction-by-instruction simulation can provide the same accuracy as a cycle-by-cycle timer simulation model. There appears to be no inherent accuracy disadvantage in the use of instruction-based simulation rather than cycle-by-cycle based simulation.

2.4 Extending the Basic REAP Processor Models The REAP tool we have described so far is capable of modeling a number of interesting processors, but these processor models are admittedly basic. There are a number of additional processor organizational elements whose impact we might like to consider, such as branch predictors, pending store queues, reorder buffers and finite caches. To help consider the effect of these kinds of resources on the overall processor performance, and to show the flexibility RCM has in its ability to incorporate these more complex processor elements, we have added command-line switches to REAP that allow the user to include some of these processor elements in the basic processor models. In addition to these new resources, we have also added switches that allow the user to specify some aspects of the processor execution model. The sections below describe the command-line switches that we have added and the way they are modeled in REAP.

2.4.1 Controlling the Processor Execution Model There are a number of switches which the user can specify in order to alter the processor execution model. These switches, shown in figure 16, illustrate how an RCM model can incorporate options that allow it to compare different execution models. The first two switches in figure 16 refer to the register renaming model implemented for the processor. Recall that the REAP register file model allows the processor register files to include hard-

63

+noWAWorWAR +load_rename +fast_update +non_pipe_mem +no_bypass

: : : : :

eliminate WAW and WAR checking enable only load renaming forward update addresses ASAP do not use a pipelined memory access model include no result bypassing

Figure 16. Command line options which control the processor behavior.

ware for register renaming (see figure 4). The inclusion of this hardware is indicated by the renaming value set for each register file in the processor description file (see figure 8). The first switch of figure 16 is the noWAWorWAR switch, which allows the user to turn off the checking for write-after-write and write-after-read dependence conflicts. Turning off the WAW and WAR dependence conflicts provides an estimate of the performance the processor could achieve if perfect register renaming were implemented, and thus gives an optimistic performance benefit from renaming. In REAP, this is implemented simply by not checking the destination use availability time of the destination registers, so that an earlier write or read to that register, which alters the destination use availability time of the register, will not delay the current instruction. The default register renaming model implemented in REAP supports full renaming, where a register file with renaming hardware will rename any register that is written during execution (i.e. all destination registers from that file are remapped). The second switch of figure 16 selects a more conservative form of register renaming. The load_rename switch indicates that REAP should only rename registers that are the loaded data destination registers of load instructions, which is the renaming policy used on the original IBM RS/6000 processor. [32] Renaming is restricted to the load renaming policy in REAP by identifying the load instructions by their use of a memory port as a source operand in the instruction template. The load instruction templates have all been designed so that the loaded register is the first destination operand (e.g. in a load with update instruction, the destination register for the loaded value is always the first destination register listed), so the loaded register is easily identified, and only the loaded registers are renamed. We experimented with different switches for the LFK test cases on a p2 processor configuration in order to see what kind of benefit register renaming could provide. The p2

2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

register load renaming full register renaming

lfk3

Speedup

64

LFK test case Figure 17. Speedup obtained with different levels of renaming (relative to the processor with no renaming hardware).

configuration includes two identical FXUs that execute the fixed-point and memoryaccess instructions (with a pipeline length of 3 cycles in this version), two fast FPUs that execute all floating-point instructions (with a pipeline length of 2 cycles in this version), and two BRUs that execute both the branch and condition-register instructions (with a pipeline length of 2 cycles in this version). The comparisons were made for a processor that included no register renaming hardware, and a processor that included 40 physical fixed-point and 40 physical floating-point registers (each architected with 32 registers) where the register renaming is done either using a full renaming or load renaming policy. The results of these tests are given in figure 17, which indicates the speedup obtained when the LFK test case is executed on a processor with either load or full renaming versus the processor with no renaming. Looking at these results, we see that load renaming can lead to a significant performance improvement and that the full renaming does not present a very large improvement over the load renaming. Interestingly, the full renaming results and the perfect renaming (noWAWorWAR) results were very nearly identical, and showed no difference at the resolution of the graph. Of course, it is important to remember that the compiler used to generate the executable files from which the traces were gathered is an optimizing compiler for the IBM

65 RS/6000, and thus the code that the compiler generated would have been optimized for the load renaming of the FPR registers. Because the LFK test case kernels are primarily floating-point calculation loops, the compiler’s code schedule may be skewing these results to favor the load renaming scheme, showing that load renaming is sufficient, where a different compiler could produce a much better performance for full renaming. The fast_update switch of figure 16 is used to indicate that the processor has been implemented so that the update values (from load and store with update instructions) will be generated and forwarded to dependent instructions as soon as possible. In REAP, this fast forwarding of the address register value is modeled to occur at the end of the effective address calculation, which is the second stage of the memory access execution pipeline. Thus, if a load with update instruction begins execution in cycle i, then a following instruction that requires the updated address register value could begin execution in cycle i+1 (because the address register value is not used in decode, but rather in the address generation stage that follows decode). This forwarding of the update value can impact performance particularly when an inner loop body ends with a store with update, and the start of the next iteration requires that updated register value. We compared the execution times for the LFK test cases again on a base p2 processor configuration without any register renaming hardware. The FXU execution pipelines were increased to six cycles in order to highlight the impact that the fast forwarding can have. Figure 18 shows the speedup obtained by using the fast forwarding of update values over the same processor without fast forwarding for the LFK test cases. Clearly fast updating can significantly improve the overall workload performance, at least for the LFK test cases. Note, however, that the benefit achieved depends upon the processor organization, e.g. a processor with a short memory access pipeline would show less benefit than one with a longer pipeline. The next switch of figure 16 is the non_pipe_mem switch, which is used to indicate that the memory accesses cannot be pipelined through the memory ports. REAP generally assumes that a memory port can service a new access in every cycle, with the results of these accesses returning some time later (one cycle apart) across another portion of the memory access port, i.e. the memory port in REAP can sustain an access from the processor to memory and the return of data from memory to the processor every cycle. The

66

1.6 1.5

Speedup

1.4 1.3 1.2 1.1 1.0 0.9 lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 18. Speedup when address computation is forwarded from the memory access instructions with register update.

non_pipe_mem switch is used to indicate that each memory access must finish before any other memory access can be initiated on that port. Thus, each memory access blocks further memory accesses through that port for some number of cycles, blocking following memory access instructions (but not the pipeline since the memory access instructions will advance through the pipeline while they wait for their data to be returned, and other instructions that would use the pipeline but not use the memory access ports can similarly advance through the pipeline). We compared the performance of the LFK test cases on a 6k processor configuration that had an FXU pipeline of six stages, and shorter FPU and BRU pipelines of two stages each. The FXU pipeline handles both memory and fixed-point instructions, and thus the longer pipeline length is used again to help highlight the impact that the memory pipelining can have on the overall execution. Each function unit was also provided with a 10-stage input buffer. Figure 19 shows the speedup obtained by using pipelined memory ports over the same processor without pipelined memory ports for the LFK test cases. Finally, the no_bypass switch of figure 16 turns off all result bypassing in the processor. REAP generally assumes that the writeback stage can forward results to waiting

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 lfk3

Speedup

67

LFK test case Figure 19. The speedup obtained by using pipelined memory port access relative to the use of nonpipelined memory access.

instructions without having to send them to the registers and then have the waiting instructions read them from the registers. The no_bypass disallows those bypass paths, effectively increasing each pipeline length by one stage, which would clearly result in an overall reduction in the processor performance. If we implemented a larger set of processor bypasses, or wished to selectively turn on or off certain function unit bypasses, controls could be provided in REAP to do so, though this is not considered further here. A similar effect to the no_bypass switch could also be achieved within REAP by changing the pipeline latencies specified in the function unit description section of the processor description file, though this might also alter the conflict pattern for shared writeback stages and register write ports in some unintended manner.

2.4.2 Adding Branch Prediction Models The effect of branches on the execution of a workload can become a significant factor in the overall performance achieved. When a branch is encountered by a processor, the processor must determine the branch target before the next set of instructions can be fetched from memory. In order to acquire a (likely) branch target as soon as possible,

68

+perfect_brpr +static_brpr n +perf_resolved +exec_resolved +branch_fold

: : : : :

use perfect branch prediction use static br pred (n = mispredict penalty) perfectly predict resolvable branches execute resolvable branches include branch folding

Figure 20. The command line options to define the branch prediction scheme.

many processors have implemented some form of branch prediction. The branch prediction and execution switches we have added to REAP are shown in figure 20. The perfect_brpr switch of figure 20 indicates to REAP that it should model a processor that includes branch prediction hardware that correctly predicts any branch encountered. By allowing the user to model a processor with perfect branch prediction, the potential benefits from including branch prediction hardware can be ascertained. In REAP, the perfect branch prediction is modeled by removing the restriction that requires an issue group to end whenever a branch is added to the group, and by removing the fetch restrictions of the processor. Clearly, this results in an optimistic model for the branch prediction benefit that might be obtained. To examine the perfect branch prediction model, we compared the performance of a 6k processor configuration with and without the perfect branch prediction switch enabled. For this study, the FXU pipeline length was set to 3 stages, the FPU pipeline length to 2 stages, and the BRU pipeline length to 6 stages to highlight the effect of the branches; all of the pipelines were also given 10-stage input buffers. The fetch pipeline length was increased from one cycle to three cycles to further emphasize the impact of the branches, and no register renaming scheme was used. The speedup obtained by adding perfect branch prediction for the LFK test cases to this processor over the same processor without any branch prediction is shown in figure 21. Note that not all of the LFK test cases of figure 21 show an improvement for this processor model even when perfect branch prediction is utilized, which indicates that branch execution is not impacting the performance of some of the LFK loops. Perfect branch prediction is an ideal which cannot realistically be achieved, and thus it is important to consider the benefits of less optimal forms of branch prediction. In order to implement such a nonideal branch prediction scheme within REAP, we elected to

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 lfk3

Speedup

69

LFK test case Figure 21. Speedup obtained when perfect branch prediction is used.

implement a static branch prediction scheme based on the scheme used in the IBM PowerPC 601 processor. [57] The static_brpr switch of figure 20 is used to indicate that the processor includes a static branch prediction scheme. Each branch instruction that is read from the trace will include a prediction bit, and the branch will be predicted according to the value of that bit. If the branch is mispredicted, then a mispredict penalty (in cycles) is added to the branch execution pipeline length, indicating additional cycles needed to recover from the mispredict. The REAP static branch prediction implementation does not place a restriction on the way the static prediction bit is set; the static bit is read from the trace file along with each branch instruction. In order to compare the effect of different levels of static branch prediction, we implemented the PowerPC 601 processor’s static prediction algorithm to set the prediction bit for each branch in a trace, and then used the static branch prediction switch (rather than the perfect branch prediction switch) for the same buffered 6k processor used to generate the data of figure 21. We then simulated the execution of the full set of LFK test cases on this processor (with a misprediction penalty of one cycle) for three different prediction schemes. The results of this analysis are shown in figure 22.

70

1.8

predict all predict cnd/ctr predict bcnd

1.7 1.6

Speedup

1.5 1.4 1.3 1.2 1.1 1.0 0.9 lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 22. Speedup obtained from applying static branch prediction to the LFK test cases.

In figure 22, the speedup obtained using static branch prediction is plotted relative to the same processor with no prediction for three different cases. The first case includes the prediction of all branch types in the workload. The second case does not predict the unconditional branches but all other branch types are predicted (i.e. the condition-register (bcnd) and count register (bctr) branches). The third case predicts only the bcnd conditional branch instructions. All branches that are not predicted in the second and third cases are executed normally (i.e. they follow the standard execution path for the branches, and must be resolved before their target address is known). From the graph, note that branch prediction has increased the performance of most of the LFK test cases, and in some cases dramatically so. There are also cases, however, where the penalties of static branch mispredictions resulted in a performance degradation (e.g. lfk13, lfk16, lfk20, lfk22 and lfk24 when only the bcnd are predicted). Note also that the prediction of all branches produced better performance than predicting only a subset of the branches, which makes sense in light of the long branch execution path in this processor. There are also two modifying switches implemented for the static branch prediction: the perf_resolved and the exec_resolved switches. Some of the branches are resolvable when they are encountered by the issue unit; branches such as the unconditional

71 2.1 2.0

Static predict resolved Execute resolved Perfect predict resolved

1.9 1.8

Speedup

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 23. Speedups obtained using different techniques to handle resolvable branches.

branches and count-register based branches include their data, and could be resolved immediately, avoiding either a wait for execution or a branch misprediction penalty. The perf_resolved switch indicates that such resolvable branches should be modeled by being perfectly predicted, while the exec_resolved switch indicates that resolvable branches should be executed rather than predicted; the default case is to statically predict even the resolved branches. Of course, the resolvable branches are often the more easily predicted branches, such as the loop ending branches, and thus the difference in performance between the perfect prediction and static prediction of such branches may not be particularly large. Using the same 6k processor model from the study in figure 22, we ran the LFK test cases for the static prediction of all branch types, and compared the performance for the case where the resolvable branches are statically predicted, where they are executed, and where they are perfectly predicted. Figure 23 shows the speedup of each method over a processor with no branch prediction. Looking at figure 23, we note that the execution of the resolved branches for this processor results in a considerably lower performance. The reason for this is that many of the branches in the LFK test case codes are loop-ending branches based on the count reg-

72 ister value. While these branches are resolvable at the time they are issued, the execution path for branches is so long in this processor that these branches still benefit from prediction because of the effective reduction in the execution path that a prediction provides. Comparing the static prediction to the perfect prediction of resolvable branches, we note that the static prediction does quite well for most of the LFK test cases. For some of the LFK tests (e.g. lfk13, lfk14, lfk15 and lfk16) the static prediction of the resolvable branches performs much worse than the perfect prediction of those branches. This results from the misprediction of inner-loop, loop-ending branches by the static predictor in those codes, where the penalty cycles accumulate to reduce the overall performance benefit of the branch prediction. Recall from figure 22 that these same test cases showed little if any benefit from branch prediction at all. The final branch control switch of figure 20 is not really a branch prediction switch so much as another execution model modifier. The branch_fold switch indicates that the branches, when encountered in the fetch stream, should be pulled out and sent to the branch resolution hardware without moving through the issue unit. Thus, the branch_fold switch simulates the IBM RS/6000 processor where the intelligent cache unit both fetches instructions from memory and resolves the branch instructions before sending the instruction stream to the issue unit. [26] [27] [60] When branch folding is enabled, therefore, the branch instructions bypass the issue unit and can be executed earlier since they effectively issue (out of order) as soon as they are fetched. Furthermore the issue unit will not have to issue the branch instructions, requiring that fewer instructions be issued overall, and the issue unit will never encounter any branch instructions. Because the issue groups will never be broken by branch instructions, they can become larger, promoting more processor execution parallelism. We compared the performance for a p2 processor configuration where the FXUs have three stage pipelines, the FPUs have two stage pipelines, the BRUs six stage pipelines, and each function unit has a 10 entry input buffer. The fetch pipeline length was set to only one stage, and no register renaming or branch prediction was enabled. The speedup obtained by the processor with branch folding over the same processor without branch folding for the LFK test cases is shown in figure 24.

73 1.6 1.5 1.4

Speedup

1.3 1.2 1.1 1.0 0.9

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 24. Speedup obtained from the addition of branch folding to the LFK test cases.

Note that in figure 24 the branch folding only improved the performance of a few of the LFK test cases, but again this result is highly dependent upon both the processor and the workload. For branch folding to produce a performance improvement, the test case must incur some number of issue unit stalls either because the branch targets are determined too late (creating fetch stalls) or the issuing of branches breaks up the issue groups and starves some function units. The addition of branch folding will remove the branches from the issue stream, and can effectively issue them earlier to the branch execution units, but does not shorten the execution time for the branches (as does a correct branch prediction). Many of the LFK test cases, therefore, do not show much benefit from just the branch folding, but instead would require more aggressive means for handling branches if significant performance improvements are to be realized. Furthermore, recall that some of the LFK test cases did not show significant speedups even when perfect branch prediction was employed (see figure 21) which implies that branch folding will not improve their performance much either.

74

+reorder_buffer n c: add n entry (complete c/cycle) reorder buffer Figure 25. The command line option to specify a reorder (completion) buffer to REAP.

2.4.3 Adding a Reorder Buffer Model Reorder buffers provide a means for generating precise interrupts in processors that utilize out-of-order issue, access-execute slip, or out-of-order execution. In such processors, instructions are allowed to execute and finish in any relative order, but are required to commit their changes to the processor state in sequential (program) order. [70] A reorder buffer is essentially a FIFO queue of instruction tags that are used to control the completion order of the instructions. For the issue unit to issue an instruction, a slot must be available in the reorder buffer to hold a tag for that instruction. When the instruction has finished execution and is ready to commit its results to the processor state, the instruction will send its result data to temporary storage space. This data will not be committed to the actual processor state until all instructions that were issued before this instruction (i.e. those with tags that appear before this instruction’s tag in the reorder buffer) have finished execution and committed their changes. The reorder buffer can accept up to N instructions per cycle (where N is generally the maximum issue width) and can complete C instructions per cycle (where C is also commonly the maximum issue width). When an interrupt or exception occurs during execution, any instructions which were already completed have been committed, changing the global processor state. Any finished instructions that have not yet completed can be completed at the time of the interrupt if all the previous instructions in the buffer are complete. Other instructions, even if they have finished execution (but are not currently allowed to complete) are flushed from the queue without changing the global processor state. The address of the first uncompleted instruction will then indicate the instruction address used to restart execution after the interrupt is handled. In REAP, we have implemented a command-line switch to indicate whether the processor includes a reorder buffer. This reorder buffer switch has the form shown in figure 25, and indicates the number of slots in the buffer (n) and the number of instructions that can be completed each cycle (c).

75

2-Entry Completion Buffer 4-Entry Completion Buffer 8-Entry Completion Buffer 16-Entry Completion Buffer

1.8

Relative Execution Time

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 26. Relative performance for different size reorder buffers.

Because the reorder buffer can cause the issue unit to stall whenever the reorder buffer is full, a reorder buffer that is too small can introduce stall cycles to a workload execution. Using REAP, we compared the execution times of the LFK test loop traces executing on a buffered p2 processor configuration both with no reorder buffer and for a set of reorder buffers of different sizes. The base buffered p2 processor included two FXUs with three stage pipeline lengths, two FPUs with two stage pipeline lengths, and two BRUs with two stage pipeline lengths. No register renaming was used, nor were any other command-line switches enabled (other than the reorder_buffer switch of figure 25). The reorder buffer models that were used all allowed the full maximum issue width (of six instructions) to be issued and completed per cycle, i.e. the values of N and C were set to six by setting the c parameter of the reorder_buffer switch of figure 25 to six. Figure 26 shows the relative execution time for each LFK test case for the processor with a reorder buffer relative to the same processor with no reorder buffer. Looking at the results, note that this processor requires somewhere between an eight and sixteen entry reorder buffer in order to ensure that the buffer does not introduce many stalls; even sixteen instructions, however, is not quite sufficient for the lfk8 and lfk15 test cases which still encounter a small degradation in performance.

76

Complete 1 per cycle Complete 2 per cycle Complete 3 per cycle

Relative Cycles of Execution

1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9

lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 27. Relative performance for different numbers of maximum completions per cycle.

Another question that arises with reorder buffers is how many instructions must be completed per cycle in order to avoid a performance loss relative to the no-buffer performance. If the number of instructions that can be completed from the reorder buffer each cycle is too small, then the reorder buffer will quickly fill and the execution will be bound by the number of instructions that can be completed per cycle. If the reorder buffer is to complete a large number of instructions per cycle, however, then the reorder buffer hardware becomes more complex, potentially requiring a large amount of area in the processor and impacting the processor’s cycle time. There is no need for the reorder buffer to be able to complete more instructions per cycle than the workloads will effectively finish in one cycle, and thus there is a potential to save some design space (and power) by reducing the maximum instructions completed per cycle below the maximum issue width. Figure 27 shows the performance of the same p2 processor relative to the no reorder buffer processor, where the plotted processors each have a reorder buffer with 16 entries and a differing number of instructions that it can be complete each cycle. From figure 27, we see that the execution of these workloads required only two instruction completions per cycle in order to achieve the same level of performance as a reorder buffer that can complete six instructions per cycle (i.e. compare figure 27 to the 16-entry reorder buffer performance of figure 26). Recall, however, that the p2 processor

77

+store_queue n +no_bypass_sq

: add n slot store queue : do not bypass to the store queue

Figure 28. The command line options to describe a store queue to REAP.

can issue and execute up to six instructions per cycle, so it clearly is experiencing other sources of delay (such as data and control dependencies in the code or instruction issue restrictions) that make the ability to complete more than two instructions in a cycle unnecessary.

2.4.4 Adding a Pending Store Queue Model Another potential source of delay occurs when a store instruction enters the execution pipeline, but the data which it is to store is still being generated. In the basic REAP processor model, the store instruction stalls the function unit decode stage until the data to be stored is generated, potentially blocking other ready instructions from advancing into the pipeline. Furthermore, if the store includes an update of a register value, and following instructions depend upon that updated value, then the store could potentially cause several function units to stall. A pending store queue helps avoid these stalls by allowing the store instruction to move through the execution pipeline, using and updating whatever registers are required for the store to generate the effective address. The store’s effective address is then placed in the pending store queue, which holds this information until the store data becomes available. When the store data becomes available, it is forwarded to the store queue, where it is attached to the proper effective address and the data is sent to the memory system. Pending store queues, such as the floating-point pending store queues of the IBM RS/6000 and PowerPC processors [26], are featured on many different processors. In REAP, we provide some command-line switches which indicate whether the processor should include a pending store queue, its size, and whether results can be bypassed to it. In figure 28, note that the user can specify the inclusion of a pending store queue of some size by using the store_queue switch. The no_bypass_sq switch indicates whether data can be bypassed from a function unit’s writeback stage directly to the store queue or

78

1.8

1-entry store queue 2-entry store queue 3-entry store queue 4-entry store queue

1.7 1.6

Speedup

1.5 1.4 1.3 1.2 1.1 1.0 0.9 lfk24

lfk23

lfk22

lfk21

lfk20

lfk19

lfk18

lfk17

lfk16

lfk15

lfk14

lfk13

lfk12

lfk11

lfk10

lfk9

lfk8

lfk7

lfk6

lfk5

lfk4

lfk3

0.8

LFK test case Figure 29. Speedups obtained with the addition of different sized pending store queues.

must wait the additional cycle (to cross a result bus). Note that the REAP store queue is not specific to a certain kind of data, but instead is used for all store instructions, so the pressure on the store queue could conceivably be greater than in the RS/6000 or PowerPC where only the floating point stores are queued. We ran a set of experiments comparing the speedup of a processor design with different sizes of pending store queues versus no store queue on an unbuffered 6k processor configuration with a three stage FXU pipeline, a two stage BRU pipeline, and a five stage FPU pipeline. This 6k processor also included full register renaming (where 40 physical registers are provided for the fixed-point and floating-point register files) and branch folding. Looking at figure 29, note that the addition of a store queue provided a speedup for most of the LFK test cases. In some cases, the addition of even a single entry store queue resulted in dramatic performance increases (e.g. lfk12 and lfk21 experience approximately a 1.7 times speedup even when a single entry queue was added). Note also that the addition of a pending store queue never caused a performance loss. Thus, the addition of pending store queues can have a positive effect on performance. These performance

79

+dcache dp : add data cache model +icache ip : add instruction cache model +allcache dp ip: add data and instruction cache models where dp is the data cache miss penalty (in cycles) where ip is the instruction cache miss penalty (in cycles) +dcache2 dr dw : add data cache model +allcache2 dr dw ip : add data and instruction cache models where dr is the data cache read miss penalty (in cycles) and dw is the data cache write miss penalty (in cycles) and ip is the instruction cache miss penalty (in cycles) The cache option must be followed by the required DineroIII cache options in order to describe the cache model. Figure 30. The command line options for specifying a finite cache model to REAP.

improvement results are, however, highly dependent on the processor organization and the workload being executed.

2.4.5 Adding Finite Cache Effect Models So far, REAP has assumed that all memory accesses require the same access time, an assumption consistent with the processor having a “perfect” infinite-size, all-hit cache or a constant access time, no page fault memory with no cache at all. With the increasing impact of the memory system performance on overall processor performance, however, processors generally include a finite-sized cache (or caches) where the cache miss behavior can introduce uncertainty in the memory access time and stalls during execution. As the difference between the processor and memory speeds increases, studies have shown that misses, particularly in the first level cache, significantly reduce overall processor performance. [64] Within REAP we provide a set of command-line switches that allow the user to include finite cache models in the processor design. These switches allow the user to specify a finite-size data cache, an instruction cache, or both, and to define a cache miss penalty that should be applied whenever a miss occurs. The actual command-line parameter switches are shown in figure 30.

80 The actual REAP cache simulation uses the DineroIII cache simulator written by Mark Hill. [30] DineroIII operates by keeping a set of linked lists, one per cache set, which indicate the last N cache blocks accessed within each cache set. For each memory access, DineroIII determines whether the access hits (i.e. the referenced block is in the list corresponding to its cache set at a depth less than the associativity of the cache) or misses, and REAP then uses this information to determine whether the memory access pipeline length should include a cache miss penalty. Note that the command line switches of figure 30 require that the user provide not only the REAP finite cache switches, but also a set of DineroIII switches to identify the finite cache model that should be used. DineroIII supports a large number of switches that describe the physical cache size and structure, the replacement strategy, and other behaviors. When a finite cache command line switch is set for REAP, REAP extracts the information that it needs, and then forwards the remaining list of switches to the embedded copy of DineroIII. Figure 31 describes the DineroIII command-line switches used to describe the cache and its behavior. The difference between the dcache and the dcache2 switches (or the allcache and allcache2 switches) in figure 30 is that the dcache switch allows the user to specify only one cache miss penalty, while the dcache2 switch allows the user to specify separate penalties, one for a read miss and the other for a write miss. Often, the memory access penalties that the processor experiences for read misses and write misses will differ, and the dcache2 and allcache2 switches are provided to allow the user to specify these different access penalties for read and write accesses to the data cache. The access penalty for the instruction cache cannot have two values in REAP, because the instruction cache is assumed to be a read-only cache. Note that the allcache and allcache2 switches require that the user provide a penalty for the data cache and a penalty for the instruction cache accesses. The fact that these two penalties are specified separately in these switches is not intended to imply that the underlying cache model will necessarily be a split cache. The cache structure is determined completely by the DineroIII parameters, and thus the allcache or allcache2 switch could be used with a unified cache where the user could specify the same instruction miss and data read miss penalties, since they would result in the same action within the cache.

81

-x --b -S -u -d -i

Parameter Description OPTIONS(* ==> default) --------------------------------------------blocksize in bytes I+ (no default) sub-block size in bytes I (default: 0 (no sub-blocks)) unified cache size in bytes I * blocksize (default: 0) data cache size in bytes I * blocksize (default: 0) instruction cache size I * blocksize (default: 0) Note either -u or both -i and -d must be positive.

-a -r

associativity replacement

-f

fetch

-p -P -w

prefetch dist. in sub-blocks abort prefetch percent write

-A

writeallocate

I+ (default: 1 (direct-mapped)) LRU* (‘l’), FIFO (‘f’), RANDOM (‘r’). DEMAND* (‘d’),ISSPREFETCH (‘m’) ALWAYSPREFETCH (‘a’) TAGGEDPREFETCH (‘t’) SUBBLOCKPREFETCH (‘S’), LOADFORWARDPREFETCH (‘l’) J+ (default: 1 (sub)-block) 0 9

cy|IsB| --|---|-0|-BA|-1|---|-2|---|-3|---|-4|---|-5|---|-6|---|-7|---|-8|---|-9|---|-10|---|-11|---|-12|---|-13|---|--

(b) infinite register ports

(a) code sequence

FXU | -----|------|-B----|-B----|-B----|-B----|-B----|-B----|-B----|--B---|---B--|----B-|-----B|------|------|--

FPU | ----------| ----------| A---------| -A--------| --A-------| ---A------| ----A-----| -----A----| ------A---| -------A--| --------A-| ---------A| ----------| ----------| ----------|

(c) finite register ports

Figure 33. Sequentializing affect of the REAP finite register port model.

These three principles lead to some difficulties when REAP (and resource conflict modeling in general) is used to model certain kinds of processors and hardware elements. Section 2.4.6.1 will describe one problem that arises because REAP uses a single availability time rather than availability time windows, and section 2.4.6.2 presents some limitations that result from RCM’s full analysis of each instruction before considering following instructions. These two examples will illustrate the kinds of limitations in both the general RCM approach and in our specific REAP implementation. 2.4.6.1 REAP Single Availability Time Limitation One problem arises with REAP’s assumption that a single “moving wall” of availability is sufficient to capture the resource availability. Consider the register port resources for a segment of the trace such as in figure 33 (a), where a floating-point divide (instruction A, the FDIV) is followed by a floating-point load (instruction B, the FLD) instruction. Assume that this code is executed on a buffered 6k processor configuration, where the floating-point divide instruction requires 10 execution pipeline stages and the floatingpoint load instruction requires 5 execution pipeline stages. Suppose that all of the instruc-

85 tions are independent, and assume that there are an infinite number of register ports. This results in the execution shown in figure 33 (b) for these two instructions. If, however, the number of floating-point read ports is restricted to three and the number of floating-point write ports is restricted to one, then the execution on the otherwise identical processor would be as shown in figure 33 (c). Looking at figure 33 (b) note that instruction B (the floating-point load instruction) starts execution in cycle one, and finishes execution in cycle five. In figure 33 (c), however, instruction B does not start execution until cycle 7, and finishes in cycle 11 after instruction A has finished (in cycle 10). The reason that the floating-point load instruction cannot start execution until cycle 7 in figure 33 (c) is not because of data dependence, but because of a resource conflict over the single floating-point register file write port. When REAP encounters the sequence of instructions in figure 33 (a), it first analyzes the floating-point divide instruction, analyzing the full execution of that instruction and applying its effects to the global resource state. Because the floating-point divide instruction has two input operands and one output operand, all from the floating-point register file, it is assigned two of the floating-point read ports and the one write port. The write port is actually used on cycle 10, and thus it is made available to following instructions on cycle 11. Because of the single availability time value used in REAP, the floatingpoint register file write port is thus assumed to be unavailable to other instructions at all cycles before cycle 11. REAP next analyzes the floating-point load instruction, which is a memory-access instruction and thus executes in the FXU. The load instruction requires a general purpose register, which is available at time 0, the memory port, which is available at time 0, and the FXU, which is available at time 0. However, the floating-point load instruction also requires a floating-point write port to set the loaded value into the floating-point register file. While no instruction is actually using the write port on cycles one through nine, the floating-point divide has already assigned the port a new availability time (cycle 11). Because REAP does not keep track of windows of time within which such a resource might be available, the FLD instruction must wait in decode until such time as the write port will be available when the FLD instruction enters the writeback stage (i.e. cycle 7).

86 If windows of resource availability were kept, the FLD instruction could use the write port before the floating-point divide instruction, since the FDIV does not need the port until cycle 10, and thus the behavior of figure 33 (b) could be recovered. As REAP is currently implemented, however, there are no windows of resource availability, and thus there is no way to model a processor that has a limited number of register ports, and yet does not assign them to the instructions the way that REAP does. 2.4.6.2 RCM Execution Ordering Limitation This register port modeling problem is a limitation of REAP, not of RCM in general. If an RCM modeling tool were implemented that included the windows of resource availability, then that RCM tool could correctly capture the behavior of a processor that does not fit the current REAP finite register port model. There are, however, limitations to what RCM in general can model, based on the two primary principles of the resource conflict methodology, i.e. that the execution of a later instruction cannot alter the execution of an earlier instruction, and that all aspects of timing can be modeled through a nominal time plus the effects of resource conflicts. Any processor that violates one of these principles will necessarily be difficult to model using the resource conflict methodology. One example of such a processor is a distributed out-of-order execution processor, where the issue unit sends instructions to the function unit buffers, and the function units reorder the execution of the instructions within their input buffers (i.e. so that function unit input buffers would actually be reservation stations rather than FIFO buffers). The basic RCM approach will determine the instruction’s fetch time, issue time, start of execution time and the time when the instruction finishes execution all before considering the next instruction. Furthermore, this analysis is essentially done at the issue time of the instruction. It is therefore difficult to model the effects of hardware elements that allow the instructions to alter their relative execution order within a function unit, because a function unit that allows dynamic reordering of the instructions allows an instruction issued to the function unit at a later time to delay the start of execution of an earlier-issued instruction. Consider, for example, the code stream of figure 34 (a) executing on the buffered 6k processor, where the FXU pipeline length is six stages, except for fixed-point divide

87

27 LD 1, 88 0 ADD 5, 6 6 DIV 27, 21

=> 6 => 7 => 11

(a) the code stream

(b) the execution results

|IsB| --|---|--0|CBA|--1|---|-CB 2|---|--C 3|---|--C 4|---|--C 5|---|--C 6|---|--C 7|---|--8|---|--9|---|--10|---|--11|---|--12|---|--13|---|--14|---|--15|---|--16|---|--17|---|--18|---|--19|---|--20|---|--21|---|---

FXU | --------------|--------------|A-------------|BA------------|B-A-----------|B--A----------|B---A---------|B------------A|CB------------|-CB-----------|--CB----------|---CB---------|----C--------B|-----C--------|------C-------|-------C------|--------C-----|---------C----|----------C---|-----------C--|------------C-|-------------C|--------------|-

FPU | -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------|

Figure 34. Execution with in-order function units.

instructions which have a pipeline length of fourteen stages. Also assume that the register values for R5, R27 and R21 are all available when this code begins execution. Since RCM assumes the instructions issue in-order and the execution order within a function unit is the same as the issue order (to that function unit) the LD instruction will start on cycle 1 and finish on cycle 6, the ADD instruction will begin on cycle 6 and finish on cycle 11, and the DIV will begin execution on cycle 7 and finish on cycle 20, as shown in the detailed output of figure 34 (b). As figure 34 (b) shows, the execution of the add instruction is held up waiting for the data from the load on which it depends. Note that the divide instruction, however, is not blocked by a data dependence, and could begin execution as soon as it is issued to the function unit. Thus, one might be tempted to consider processor designs that would allow the processor to reorder the issue of instructions, so that the divide instruction is issued to the FIFO function unit buffer before the add instruction. If the issue unit reorders the instructions in the issue buffer each cycle so that they are issued in the order that the earliest fetched instruction that is ready to execute is issued first, followed by later fetched

88

|IsB| --|---|--0|CBA|--1|---|-BC 2|---|--B 3|---|--4|---|--5|---|--6|---|--7|---|--8|---|--9|---|--10|---|--11|---|--12|---|--13|---|--14|---|--15|---|--16|---|--17|---|---

FXU | --------------|--------------|A-------------|CA------------|BCA-----------|B-CA----------|B--CA---------|B---C--------A|B----C--------|B-----C-------|B------C------|B-------C-----|B--------C----|-B--------C---|--B--------C--|---B--------C-|----B--------C|-------------B|--------------|-

FPU | -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------| -------|

Figure 35. Execution with out-of-order function units.

instructions that are ready to execute, with instructions that are not ready to execute (because of data dependence) being issued later, when they do become ready, then the divide instruction (instruction C) would be issued before the add (instruction B) and the execution profile would be that shown in figure 35. By allowing the issue unit to select the earliest-fetched ready-to-execute instruction from a pool of instructions, the RCM model would be a form of centralized-window out-of-order issue machine, while still retaining the FIFO function unit input buffers. In this fashion, RCM can actually model out-of-order issue machines. Consider, however, what happens when the processor to be modeled is not an out-of-order issue machine, but rather an in-order issue and out-of-order execution machine using function unit reservation stations instead of FIFO queues. For the code stream of figure 34 (a), the execution that such an out-of-order execution processor would follow is identical to that of figure 35. The processor’s issue unit would issue the three instructions in order to the function unit input buffers, and the processor’s FXU would begin executing instruction A, the LD instruction, since it was the first to enter the input buffer and is ready to execute. On the next cycle, the processor’s FXU would discover that instruction C (the DIV instruction) is ready to execute, and thus would

89 begin executing it. Finally, once the LD instruction finishes, instruction B (the dependent ADD instruction) will be ready to execute, and the processor’s FXU would begin executing it. Developing an RCM model of this processor, however, would be very difficult. In RCM, the order in which the instructions are analyzed by the simulator is the same as the order in which they are issued (because the issue and execution analysis is done together). Thus, the code stream of figure 34 (a) would have been issued and analyzed in the sequential order, i.e. the load, then the add, and then the divide. Thus, the RCM analysis would first assign the LD instruction to the FXU, and determine that it starts execution on cycle 1 and completes on cycle 6. The ADD instruction would be analyzed next, and it would be found to issue on cycle 1, start execution on cycle 6, and complete on cycle 11. The DIV instruction is analyzed next, and the simulator would discover that the divide instruction could execute before the ADD instruction because it is not held up by a data dependence, and thus the FXU would reorder these two instruction executions when it took them from its reservation station input buffer. The simulator would therefore have to start the DIV instruction’s execution in cycle 2, completing on cycle 15. However, this also means that the starting time for the ADD instruction, which had been marked as cycle 6, must now be delayed to cycle 11, similarly affecting the finish time. Thus, when the RCM simulator analyzes the DIV instruction, it would have to be able to allow it to alter the execution of an earlier simulated instruction. While this certainly seems possible in this case, we must also consider cases where a chain of dependent instructions are analyzed, and then a later instruction is found that can execute before all of them, resulting in all of their execution times needing to be adjusted. Clearly, this kind of distributed out-of-order execution poses a difficulty for the RCM analysis as it has been described here. While there may be ways to extend the general resource conflict methodology to include such processors, we have not implemented any such mechanism to date.

2.5 Adding Detail to the REAP Execution Information So far, REAP has been described as generating only a count indicating the total cycles of execution for the workload. This single value is useful for the direct comparison

90 and ranking the performance of different processors, but it does not provide particular insights into why different processors achieve different performances on the same workload (except from what might be inferred indirectly by varying switch settings and parameters and comparing the performance results). If a designer were interested in exploring a space of processor designs to select from among them, more information about the causes of performance degradation would help to guide and focus the search through the space. In an effort to provide more information about the effects that are controlling and determining the processor performance, we have added a number of statistical reports to REAP. These reports describe the performance of the application when running on the processor, providing the users with a better understanding of the performance obtained, and highlighting design elements that tend to restrict the performance. These reports can thus indicate portions of the processor that should be targeted to increase the performance. This section will describe the detailed reports that have been added to REAP, with examples taken from the analysis of one of the LFK test cases (lfk21) executing on a buffered p2 processor configuration. The p2 processor contains double the function unit resources of a 6k configuration; thus it is a processor with a maximum 6-wide issue, two FXUs, two FPUs and two BRUs. The two FXUs each execute fixed-point and memory instructions, the FPUs execute the floating-point instructions, and the BRUs execute the branch and condition register instructions. In the version of the p2 configuration we used to generate the reports here, most of these FXU instructions are executed in a four stage pipeline, but the fixed-point multiply and divide instructions execute through a longer (nine stage) pipeline. Most of the FPU instructions execute through a five stage pipeline, but the floating-point divide instructions require a nine stage pipeline. Finally, all of the BRU instructions execute through a two stage pipeline. The FXUs and FPUs each have a ten stage input buffer, while each BRU has a five stage buffer. The register files are defined to have the appropriate number of architected and physical registers, and no register renaming hardware is used on this processor. The register files each have a minimum number of read and write ports and there is a single port to the data memory system (a separate port to the instruction memory is assumed and

91 not modeled). None of the extended hardware elements of section 2.4 is implemented in this processor. Clearly, this processor includes a relatively large amount of functional parallelism in the six buffered execution units, but as we shall see the other resources, combined with the relatively long pipelines, tend to limit the performance quite severely. Still, these parameter values were selected not so much to provide a fast processor design, but rather to succinctly illustrate the descriptions that the REAP detailed execution reports can produce. REAP currently provides eleven reports which respectively detail the Workload Information, Issue Widths, Latest Available Resource, FU Utilization, Decode Utilization, Pipeline Utilization, Input Buffer Use, Register Utilization, Register Port Utilization, Register Histogram, and Register Value Data. Each of these reports will be described in this section, with an example taken from the execution of the lfk21 test case on the buffered p2 processor configuration described above. The Workload Information Report. The first statistical report that REAP provides is the most basic: the total execution time, the total number of instructions, and the instructions per cycle (IPC) for the workload. This is, in effect, the same information REAP has been providing throughout the earlier discussions. To this basic performance information, we added a breakdown of the executed instructions by their BRISC instruction execution classes. The report obtained for the execution of lfk21 on our buffered p2 processor configuration is shown in figure 36. The five instruction execution classes shown in figure 36 for this processor correspond to the BRISC instruction execution classes of table 1: the fixed-point operations except fixed-point multiply and divide (class 1), the floating-point operations except floating-point divide (class 2), the memory access instructions (class 3), the condition register instructions (class 4), the branch instructions (class 5), and the fixed-point multiply (class 6) fixed-point multiply immediate (class 7) fixed-point divide (class 8) and floating-point divide instructions (class 9). Looking at figure 36 we note that the workload that was executed here (lfk21) contains neither fixed-point multiply/divide instructions, nor floatingpoint divide instructions.

92

Execution required 951265 cycles Total instructions: 385211 so IPC = 0.40494 Instruction breakdown by class for the workload is: I Class # in Workload ------------------1 3904 2 126250 3 190003 4 652 5 5051 6 0 7 0 8 0 9 0 Figure 36. Workload execution characterization statistics.

The breakdown of the workload into the dynamic instruction counts for instructions of each execution class provides a first-order characterization of the workload; such counts could also be provided for each instruction (i.e. each BRISC opcode) though we have not done so at this time. Still, even the instruction execution class counts allow the user to calculate information such as the IPC for each instruction execution class. Because the REAP execution model is based on the instructions class, this breakdown could also help guide decisions regarding the number of function units of each type that might prove useful in the processor design. Looking at the total execution cycles and IPC results for this workload execution, we note that the performance is quite poor. While the buffered p2 processor includes six function units, a maximum six-wide issue, and can sustain two fixed-point or memory, two floating-point and two branch or condition-register instruction executions per cycle, this version achieves only 0.4 instructions per cycle. Clearly one wonders why the performance for this workload was so poor, and it is precisely this kind of question that the remaining REAP reports are expected to help answer. The Issue Widths Report. The second report gives information about the widths of the issue groups, indicating the number of groups of each width that were issued during execution. During the processor simulation, each time a group of instructions is issued, the number of instructions is recorded. This results in a table of widths like the one shown in figure 37.

93

The issue width statistics show: width # of issues ---------------1 59400 2 59351 3 58101 4 625 5 0 6 5051 There are 951256 total execution cycles, and 182528 issue cycles, so there are 768728 cycles when no instructions are issued. There are 385211 issued instrs and 182528 total issues for an average issue group size of 2.11042 instr Figure 37. Issue width execution statistics (a p2 processor).

The issue width statistics of figure 37 are provided to give the user an idea of the efficiency that the workload and processor exhibit in utilizing the issue unit’s parallelism. In figure 37, the processor has a maximum issue width of six instructions per cycle; the statistics indicate, however, that the machine actually achieved an average instruction issue width of just 2.11 instructions per issue group issued. Note that the cycles where no instructions were issued are not included in the statistics because these cycles are not considered to have included an issue (i.e. a zero-instruction issue is not an issue group). The number of such cycles can be easily determined, however, by subtracting the number of issue cycles reported in figure 37 from the total number of execution cycles (951,265 cycles, reported in figure 36) resulting in 768,728 cycles where no instructions were issued. From these statistics, we can see that this processor was not able to utilize its full issue parallelism very often on this workload. Of course, from this limited information, it is difficult to understand why the issue group size is being limited. Clearly, a designer might now wonder why the processor could not issue more instructions per cycle, or take greater advantage of the available parallelism. If would be useful to know which resources of the processor are determining the instructions’ execution start times. We thus we can examine the Latest Available Resource report to see which resource determined each instruction’s execution start time so that we can explore designs which address the limitations presented by these resources.

94

Breakdown of the cause of stalls during execution shows: Resource Cause # Instructions Percent of Instr --------------------- -----------------------------Issue Unit 63753 16.550 Decode Available 1303 0.338 Writeback Stage 0 0.000 Source Register 189431 49.176 Destination Register 1248 0.324 Register Read Port 650 0.169 Register Write Port 0 0.000 Memory Port 0 0.000 Multiple 128826 33.443 Figure 38. Statistics describing the latest available resource for each instruction executed in lfk21 on the buffered p2 processor configuration.

The Latest Available Resource Report. REAP can report for each instruction the latest available resource that determined the start time for the instruction’s execution. A table is generated by REAP that indicates for each resource in the modeled processor the number of instructions for which that resource was the latest available. Figure 38 shows this table for the p2 processor when executing the lfk21 test case. The entries in figure 38 indicate the number of instructions whose latest available resource was one of the resource categories listed for the p2 processor. The Issue Unit category indicates that the instruction was able to execute as soon as it was issued, and thus that the issue unit itself was the latest available resource. The Decode Available category indicates that the instruction was able to be issued to the function unit input buffers, but the function unit’s decode stage was tied up with some earlier instruction. Once the decode stage became available, the instruction was able to start execution immediately, and thus the decode stage was the latest available resource. The Writeback Stage category indicates that the instruction was able to issue and enter the decode stage, but then it had to wait in decode for a previous instruction to move far enough through its pipeline that the (shared) writeback stage would be available when this instruction would need it. The Source Register and Destination Register categories indicate that the instruction had to wait in the decode stage for a source or destination operand (respectively) to become available when the instruction would need to use it. These two categories are generally the result of dependencies in the workload.

95 The Register Read Port and Register Write Port categories indicate that the instruction had to wait in decode because, even after the register value resources were available, there were not enough available register ports for the instruction to begin execution. Similarly, the Memory Port category indicates that the instruction was ready to execute, but that the instruction requires a memory port resource to execute and no memory port was available, so the instruction had to wait for a memory port to become available. Finally, the Multiple entry indicates instructions whose execution was equally delayed by two or more different resources of the processor, i.e. instructions that had more than one resource that became available at the same (latest) time. It is expected that the analysis of which modeled resources are often the latest available will help the user to understand the performance of the workload, and perhaps even point out ways in which the processor design could be improved. Consider, for example, the Decode Available category, which indicates instructions that must wait for the function unit’s decode stage to become available. A large number of instructions in this category indicates that the processor would achieve a better performance if more function units (of the appropriate type) are added so that instructions that previously stalled (waiting for the decode stage to clear) could be sent to these new units, which presumably would have an available decode stage. Once data is gathered indicating that a specific resource was very often the latest available resource (i.e. caused a large number of instructions to wait) the user could explore designs which provide more of that constraining resource, or shorten the effective time that it is in use. Looking at figure 38 for example, we note that the main cause of stalls when executing this workload on this processor was the Source Register category, indicating that nearly 50 percent of the instructions were stalled in decode waiting for a source register. Source register unavailability is often attributable to true data dependence in the code, and thus is difficult to reduce; in this processor, the relatively long execution pipelines (four stage FXU pipelines and five stage FPU pipelines) result in a large number of stalls waiting on data dependences. If a large number of instructions must wait for source registers, then this could indicate that effort should be devoted to reducing these data dependence effects, e.g. by rescheduling, better bypassing, or shortened pipelines.

96

Breakdown of Issue Unit stalls during execution shows: Resource Cause # Instructions Percent of Instr --------------------- -----------------------------Issue Unit 63753 16.550 IU - Max Width 24 0.006 IU - PC Available 1 0.000 IU - Other 63728 16.544 Figure 39. Statistics describing the breakdown of the Issue Unit latest available resource category for different causes of issue unit unavailability.

The issue unit category in figure 38 also includes a large number of instructions for which it was the latest available resource. Every instruction must be attributed a latest available resource, and if no other resource restricts the start of execution time to be later than the issue time, then the instruction’s latest available resource is the issue unit. If the issue unit is the latest available resource for a large proportion of the instructions, then this indicates that much of the workload is issue-bound, and thus could indicate that the hardware is being used very efficiently; it could, however, also indicate that the code schedule might profit from some reordering, or that the branch handling paths are too slow. To help differentiate these cases, the Issue Unit category is further subdivided into cases where the issue unit was the latest available resource and the issue group was maximum size, the program counter was unavailable, or some other reason, as shown in figure 39. Looking at figure 39, we see the breakdown of the Issue Unit category into the reasons that the current instruction did not issue earlier. The results show three categories, Max Width, PC Available, and Other. The Max Width category indicates that the current instruction follows a full-width issue group, and thus that the instruction could not be included because it would violate the maximum issue width. The PC Available category indicates that the program counter is unavailable, and thus that the instruction must follow a control-flow altering instruction, and that the branch resolution delay is affecting the issue time of the current instruction. The Other category captures all other cases. In this workload, most of the entries in the issue unit category actually correspond to branch instruction issues, for which the branch execution units (BRUs) and required resources are available, but the in-order issue semantics result in the branch instruction’s execution being delayed until it can be issued (i.e. until the sequentially preceding instruc-

97

Breakdown of Multiple stalls during execution shows: Resource Cause # Instructions Percent of Instr --------------------- -----------------------------Multiple 128826 33.443 Two Resource Subcases: 127576 33.118 Decode and Read Port 75 0.019 Decode and Write Port 625 0.162 Source Reg and Dest Reg 63750 16.549 Source Reg and Write Port 625 0.162 Dest Reg and Write Port 1 0.000 Write Port and Mem Port 62500 16.225 Figure 40. Statistics describing the breakdown of the Multiple latest available resource category for cases where only two resources were involved)

tions have issued). In general, decreasing the number of instructions that must wait for the issue unit could be accomplished by a number of different means, such as increasing the fetch and issue rates of the processor, and/or resolving branches more quickly. Finally, from figure 38 we also see that the Multiple category includes a large number of the executed instructions (roughly one third). The multiple category is used to indicate that two or more resources were unavailable to a given instruction until the exact same (latest) time. We have added code to differentiate the multiple category for those situations where exactly two resources delayed the instruction by the same amount, as shown in figure 40. Looking at figure 40, we see that the breakdown of the multiple category into resource pairs describes nearly all of the multiple category stalls for this workload (i.e. 127,576 out of 128,826 multiple category cases are caused by two resources, which is 99.03 percent of the multiple category stalls). It would certainly be possible to add code that could determine the breakdown of the multiple category into even further subcases, but because of the potentially large amount of bookkeeping overhead and the fact that the current breakdown captures most of the multiple category adequately, we simply have not implemented such code. Looking at the pairs of resources that cause significant stalls in figure 38, we note that roughly half the multiple category stalls correspond to source and destination registers, and another half correspond to a register write port and the memory port. The cases

98

The function unit utilization statistics show: FU active cycles empty cycles pct util ----- ------------- ------------- -------1 951262 3 100.000 2 951252 13 99.999 3 951195 70 99.993 4 951247 18 99.998 5 118869 832396 12.496 6 118840 832425 12.493 Figure 41. The function unit utilization statistics report.

where the source and destination registers are both the latest available resources are again largely due to dependence in the code. The register write port resource conflict stalls are a result of too few register write ports in the register files, and the memory port resource conflict stalls are a result of the fact that this processor includes only a single-ported data cache, even though two memory accesses can be simultaneously executing. Thus, these stalls can be addressed by increasing the number of register and data memory ports; note, however, that increasing either the number of register write ports or the number of memory ports alone will not suffice, since both resources stall the instructions in question by the same amount. The FU Utilization Report. To better indicate to the user which function units are heavily utilized or experience significant conflict, REAP provides information about the utilization of each function unit, as shown in figure 41. The function units in figure 41 are, the FXUs (FU 1 and 2), the FPUs (FU 3 ad 4) and the BRUs (FU 5 and 6). Looking at figure 41, note that REAP reports the overall function unit utilization in terms of the number of active and empty cycles. The active cycles correspond to the number of execution cycles in which the function unit had some instruction in at least one of its pipeline stages (including the time spent stalled in the decode stage waiting for resources). The number of empty cycles correspond to those cycles in which the function unit had no instructions anywhere in its pipelines. The Decode Utilization Report. The decode utilization report is similar to the FU Utilization report, except that the data reported here is the utilization of the function unit decode stage. The decode stage execution statistics indicate the number of active, stall and empty cycles for each function unit’s decode stage. The active cycles are cycles in which

99

The function unit decode utilization statistics show: FU active cy stall cy empty cy pct util act util --- --------- -------- -------- -------- -------1 96880 845376 9 99.999 10.184 2 97027 854222 16 99.998 10.200 3 63125 887970 170 99.982 6.636 4 63125 888070 70 99.993 6.636 5 32528 53814 864923 9.077 3.419 6 32526 53788 864951 9.074 3.419

stall util ---------89.815 89.799 93.346 93.357 5.657 5.654

Figure 42. The function unit decode stage utilization statistics.

the decode stage was actively analyzing an instruction, while the stall cycles are cycles where an instruction is stalled waiting in the decode stage for a dependence or conflict to clear, and the empty cycles are cycles when there are is no instruction in the decode stage For each instruction, the function unit decode stage will only credit one active cycle for the instruction; any further cycles that the instruction spends in the decode stage are considered stall cycles. Figure 42 shows the Decode Utilization report for the lfk21 test case on the p2 processor version we have been considering here. The tables of figure 41 and figure 42 indicate the efficiency with which the processor is using the function unit resources. The overall function unit utilization statistics indicate the number of cycles in which a function unit is being used in any way, thus indicating a measure of total usage for the function unit. The decode utilization shows the amount of time that instructions are in decode, breaking down the time the decode spends actively decoding instructions versus inactive (either stalled waiting for and instruction’s resources or empty waiting for an instruction to decode). Thus, the user can use these tables to form an overall understanding of the usage that the workload code makes of the function unit resources. The Pipeline Utilization Reports. Beyond the overall function unit information, each REAP function unit can contain several different execution pipelines. Because these different pipelines will execute different instruction execution classes, information about the way each pipeline is used during the workload execution may provide additional insights to the user. Thus, REAP provides utilization information for each execution pipeline, as shown in figure 43 for the lfk21 test case on this buffered p2 processor.

100

The utilization statistics for the full execution pipes show: F Unit E Pipe cycles active cycles empty pct util ------ ------ ------------- ------------- -------1 1 288138 663127 30.290 1 2 0 951265 0.000 2 1 288303 662962 30.307 2 2 0 951265 0.000 3 1 252500 698765 26.544 3 2 0 951265 0.000 4 1 252500 698765 26.544 4 2 0 951265 0.000 5 1 32528 918737 3.419 6 1 32526 918739 3.419 Figure 43. Execution pipe utilization statistics.

The execution pipeline utilization of figure 43 indicates for each execution pipeline the number of cycles during which an instruction was in at least one execution stage of the pipeline. Similar utilizations could also be given for an individual stage (such as the first pipeline execution stage) if the user wished to distinguish that information. Because REAP issues different instruction execution classes to different execution pipes, this breakdown of the pipeline utilizations of each function unit provides a finer granularity of the utilization statistics. Figure 43 indicates that only one of the function unit execution pipelines was used for each of the function units, and thus this workload does not include instructions from the instruction execution classes that are served by other second pipelines (i.e. the test case trace included no fixed-point multiply or divide instructions for function units 1 and 2, and no floating-point divide instructions for function units 3 and 4). The Input Buffer Use Report. Another important consideration in a processor design is whether the processor provides ample function unit input buffering. If there is enough input buffering, then the issue process can slip ahead of the execution process, allowing the processor to hide some or all of the load dependence latencies. REAP provides a report of the utilization of the input buffer stages, as shown in figure 44. For each function unit, a distribution is given showing the number of cycles during which each number of buffer stages was in use. The first column indicates the number of stages in use. For each number of stages, N, the second column indicates the number of cycles of execution that exactly N buffer stages were in use (i.e. held an instruction). The

101

Function unit input buffer data: Function Unit with id 1 : # Stages # Cycles Cumulative in use Used Cycles Used ---------- -------------------0 768 768 1 4199 950497 2 727 946298 3 2804 945571 4 5369 942767 5 7973 937398 6 11825 929425 7 4625 917600 8 9175 912975 9 11100 903800 10 892700 892700 Function Unit with id 2 : ... Figure 44. Function unit input buffer statistics.

third column indicates, for all but the 0-stages row, the number of cycles where N or more buffer stages were in use. Looking at figure 44, note that function unit one (one of the FXUs) uses all ten of its input buffers fairly often during the execution of this workload. This indicates that the issue unit is able to send work to the function unit faster than the function unit can consume it. Note, however, that even though figure 44 shows that all the buffers are often full, this does not necessarily indicate that the number of buffer stages should be increased to improve performance. It is entirely possible that the issue process has already slipped far enough ahead of the execute process so that the execution units are not often starved for work. The Register Utilization Report. Another potentially interesting area for consideration is the way in which registers are used during execution; the user may be interested in the distribution of register accesses, the number of registers used by the workload, the effectiveness of register renaming, and so forth. REAP can produce several different reports about the use of registers, beginning with a description of the utilization of each register of each register file, as shown in figure 45 for the fixed-point general purpose register file (register file number one).

102

Register resource usage information Register File with register file id Rnum active cy wait cy empty cy ---- --------- ------- -------0 27 913182 38056 1 0 0 951265 2 1 7 951257 3 127500 694375 129390 4 0 0 951265 5 0 0 951265 6 0 0 951265 7 126875 687500 136891 8 626 949110 1529 9 2524 909696 39046 10 1900 947946 1420 ...

(for used registers): 1 : pct util act util wait util -------- -------- --------95.999 0.003 95.997 0.000 0.000 0.000 0.001 0.000 0.001 86.398 13.403 72.995 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 85.610 13.337 72.272 99.839 0.066 99.773 95.895 0.265 95.630 99.851 0.200 99.651

Figure 45. Register utilization information (no renaming used).

In figure 45, REAP reports the number of cycles in which each register was active (i.e. a data value being read from or written into the register), the number of cycles it was empty (i.e. cycles before the first definition, or cycles between a last use and a new definition), and the number of cycles the register was waiting (i.e. the register was holding a value waiting for its next use). In this example, only the first ten registers of the fixed-point register file are shown. No register renaming was used in this processor, so the utilizations of figure 45 indicate the register usage as defined by the compiler’s register allocation. In cases where register renaming is employed in the processor, the usage should be much more uniform, and the utilizations would indicate the efficiency of the register renaming. The Register Port Utilization Report. The use of the various register ports in the different register files can also provide interesting data about the pressure that such resources experience. Recall that the register ports model was somewhat restrictive in REAP, because the register ports are allocated at issue time, and thus can restrict the instruction execution order. Figure 46 shows a REAP report of the register port utilization. In figure 46, REAP reports the number of cycles during which each port was active (i.e. a register value was being moved through that port on that cycle) and the number of cycles it was idle (i.e. no register value was moved through the port on that cycle). Each read and write port of each register file is listed, with the read ports identified by an “RP-” prefix, and the write ports by a “WP-” prefix. Note the relatively high utilization of WP-0

103

Register port usage information: Register File with register file id 1 : Port cycles used cycles idle -----------------------------RP-0 66353 884912 RP-1 65103 886162 RP-2 65003 886262 WP-0 65153 886112 WP-1 64379 886886

pct util -------6.975 6.844 6.833 6.849 6.768

Register File with register file id 2 : Port cycles used cycles idle -----------------------------RP-0 126250 825015 RP-1 126250 825015 RP-2 63125 888140 WP-0 253125 698140

pct util -------13.272 13.272 6.636 26.609

Register File 2 with register file id 3 : ... Figure 46. Register file read and write port utilization statistics.

of register file 2, which is the single write port of the floating-point register file. In a case where a register port shows a high utilization, this could indicate that the number of register ports of that type should be increased, since the ports may constitute a performance bottleneck. Note that the port utilization is well balanced in most cases. The reason that the port utilization is balanced is that the register ports are assigned (as are most resources in REAP) on a “first-available is preferred” basis, i.e. if there are two register ports available at time t and time t+i, then REAP would use the port available at time t. The reason that port RP-2 of register file 2 (which is a read port to the floating-point register file) does not have a similar percent utilization as ports RP-1 and RP-0 from that register file is that the workload includes an fma instruction, which uses all three of the floating-point register read ports on a given cycle. After that instruction executes, all three read ports will show the same availability time. There are two following floating-point instructions in the inner loop that use one read port each, and REAP selects the read ports in order (since they all have the same availability time), resulting in ports RP-0 and RP-1 being utilized twice for every use of RP-2.

104

Register port usage information: Register File with register file id 1 : Read Ports: Write Ports: # Ports Cycles Used # Ports Cycles Used -------- ------------------ ----------0 757984 0 822361 1 190729 1 128276 2 1926 2 628 3 626 Register File with register file id 2 : Read Ports: Write Ports: # Ports Cycles Used # Ports Cycles Used -------- ------------------ ----------0 761890 0 698140 1 126250 1 253125 2 0 3 63125 Figure 47. Register read and write port usage histograms.

The Register Port Histograms Report. To better determine whether a register port is bottlenecking the processor performance, REAP can report some further information regarding the register port use, as shown for register file 1 (which is the fixed-point register file) and register file 2 (which is the floating-point register file) in figure 47. This information is a histogram of the number of cycles during execution that a given number of register ports were in use. Looking at the data in figure 46, we see that there were very few cycles (only 626) where all three fixed-point register read ports were simultaneously being used, and two ports were being used on only 1926 cycles. The reason for this is that most of these fixedpoint register file accesses were for the base registers that were used in floating-point memory accesses, and there is relatively little fixed-point calculation in the loop kernel trace. Similarly, the fixed-point register file write port data shows that the two fixed-point write ports are sufficient for this workload, because the two ports are both simultaneously in use on only 628 of the total cycles of execution. We can also see that the floating-point register ports use of figure 47 is consistent with that in figure 46. Recall that one of the floating-point read ports in figure 46 is used only half as often as the other two read ports, and that this was explained by the fact that the fma instruction in the inner loop code uses all three read ports and causes their avail-

105

... Statistics for values in register 7: Values that were loaded then overwritten: none Values that were assigned then overwritten: life # Vals | # Reads # Values # Stores # Values ---- ------- | --------- ----------------- -------0 624 | none 624 none 63749 1 625 | one 63125 one 0 12 62500 | two 0 many 0 | many 0 Values that preexisted and then were overwritten: none Values that were loaded and persisted out of trace: none Values that were assigned and persisted out of trace: life # Vals | # Reads # Values # Stores # Values ---- ------- | --------- ----------------- -------11 1 | none 1 none 1 | one 0 one 0 | two 0 many 0 | many 0 Values that preexisted into and persisted out of trace: none Statistics for values in register 8: ... Figure 48. Register value (lifetime) information.

ability times to be set to the same time. The remaining two floating-point instructions in the inner loop code each read a single floating-point register at different times. Looking at figure 47, we see that a single floating-point register file read port (the register file with id 2) is used for 126,250 cycles, and all three floating-point read ports are used on 63,125 cycles. Thus, the 63,125 cycles indicate the executions of the fma instruction, while the 126,250 cycles indicate the execution of the other two floating-point instructions. The Register Value Data Report. Users might also be interested in a characterization of the values used in the workload; if the workload has many values that are initially defined and then held for long periods of time before their last use, then the workload will present much more stress on the register files than another application where a value’s last use tends to follow soon after its initial definition. Similarly, if a lot of values are defined and used only once in the workload, the performance might be improved by the use of a different register file implementation than is used for a workload program that reuses its values many times. REAP can provide statistics about the values that move through the registers as shown for fixed-point register 7 in figure 48.

106 Looking at figure 48, note that the report for a given register is broken into six categories (each marked in bold), indicating which values were defined by being loaded from memory, which were assigned to the register as the result of an operation, and which were defined on entry to the trace (i.e. preexisted in a register at the beginning of the trace). The categories also indicate how the values were killed: the value was either overwritten (i.e. another value was loaded from memory or defined by an operation and assigned into this register), or persisted in a register at the end of the workload trace. For each of these six classifications, three tables are given. The first lists the total number of values with each particular lifetime (in processor cycles) that existed in the register under that category over the full workload execution. Lifetimes of fewer than twenty cycles are broken down explicitly, and lifetimes greater than or equal to twenty cycles are lumped together. Lifetimes that did not occur are not shown. The second table then lists the number of values that were never read, read only once, read only twice, or read many times during their lifetime. Similarly, the third table reports the number of values that were never stored, stored once, or stored more than once. For this workload, general purpose register 7 is being used as an address register for accessing the elements of an array, and the primary use of register 7 occurs in a single floating-point load with update instruction. Even though this is a load, the value is not being loaded, but rather is computed by the load instruction and assigned to the address register. Looking at the category of values that were assigned and then overwritten in figure 48, we see that there are 63,749 total uses of register seven in that category, and that 62,500 of them have a lifetime of 12 cycles, and are read once before they are overwritten. These are the inner-loop uses of register seven by the floating-point load with update command. The remaining uses of register seven (i.e. those with a lifetime of zero or one cycles) result from the execution of the interface code between the innermost loop and the next outer level of the loop nest. Thus, the reports shown here provide the user with a much more detailed description of the workload execution on the modeled processor. While the set of reports described above constitute the full set of reports that have been implemented to date in REAP, they by no means exhaust the set of reports that could be implemented. Instead,

107 these reports are presented primarily to indicate the kinds of detail that can be recovered from an RCM simulation of a processor executing a workload.

2.6 RCM Modeling: Conclusions This chapter has described the resource conflict methodology, which was developed to allow users to consider the performance of many different processor organizations early in the machine design cycle. The focus of RCM is to provide users with a flexible means for specifying processor models so that many different processors can be easily compared. A uniform, abstracted hardware element model is used to provide a simplified means for describing the interaction between the instructions and their resources, which allows the development of a highly parameterizable RCM based simulator. The workload execution performance is determined by considering the stalls created when an instruction’s required resources for execution are unavailable. Each instruction execution requires a set of resources to begin execution and affects a set of resources during execution; affected resources will be unavailable to following instructions until the current instruction has released them. When these resource conflicts are used to delay the start of execution of following instructions, a realistic portrait of the workload execution is formed, showing the overall execution performance. The resource conflict methodology is based on two primary principles: 1. The execution of a later-issued instruction cannot alter the execution of an earlierissued instruction. 2. All aspects of the processor timing can be modeled as a nominal time plus the effects of resource conflicts. These principles define a set of restrictions that must hold in a processor in order to produce a reasonable processor model for it using RCM. In this work, an RCM-based modeling tool called REAP has been developed and implemented to model a family of superscalar processors. REAP adds a third principle to this list: 3. The availability of each resource can be adequately modeled by keeping only the time at which the resource becomes available to later instructions, i.e. by assuming that the resource was busy at all previous times and remains available at all later times until some later instruction uses it.

108 This third principle arises from the manner in which REAP was implemented; REAP uses a single value to represent each resource availability time, indicating the time at which the resource becomes available for any use by a later instruction. The resource is assumed to be busy for all times prior to this availability time, and thus the REAP availability is modeled as a moving barrier before which the resource is unavailable. REAP was implemented with a set of processor hardware elements for a class of basic superscalar processors. This REAP model includes resources for the processor issue unit, function units, execution pipelines, function unit input buffers, registers, and register ports. A number of tests were conducted to compare the REAP simulations to cycle-bycycle timers of the same processor model, and these studies showed that REAP can provide performance estimates and cycle-by-cycle execution results that match those of early design-stage timers. There is, therefore, no inherent loss of accuracy due to the use of an instruction-based simulation methodology, at least for these basic processor elements. REAP has also been extended to provide a broader set of processor hardware elements that can be included in the processor models. These extensions include finite caches, reorder buffers pending store queues, and simple branch predictors. REAP can also generate a number of reports, including descriptions of the workload composition broken down by instruction class, the size of the issue groups, the latest available resource for each instruction, the function unit utilizations, the function unit decode stage utilizations, the execution pipeline utilizations for each function unit, the input buffer utilization, the register utilizations and the distribution of register read and write port uses for each register file, a histogram of register read and write port uses for each register file, and the uses and lifetimes of values in the registers for each register file. The major limitation of resource conflict methodology modeling and simulation, however, is the requirement that an execution trace be used to drive the simulation. The resource conflict methodology must analyze every instruction in an input trace, determining the availability of the resources used by that instruction, and recording the changes that occur from the execution of the instructions. Because full execution traces are generally very long, often containing hundreds of millions or billions of instructions, each simulation run can take a considerable time. Even if the RCM model contains relatively few

109 resources, and thus does not need to do much work per instruction, the multiplication of this work by billions of instructions will necessarily lead to a significant aggregate time. In order for a user to be able to compare a large number of machines in a reasonable period of time, the time required to simulate each processor must be relatively small. Chapter III presents a methodology that was developed to reduce the time required to simulate each workload by reducing the trace to a reduced trace description that eliminates much of the redundant information in the original full execution trace. This reduced trace description can then be analyzed much more quickly than the full trace, at the cost of a (small) loss in the accuracy of the estimate, as well as some additional difficulties in modeling some kinds of processor elements and in gathering some of the detailed execution statistics for which the required information is no longer available.

CHAPTER III REDUCED TRACE ANALYSIS

3.1 Introduction The focus of this dissertation is the development of methods that allow designers to explore a large space of processor organizations in order to compare the architectural and organizational trade-offs between the organizations. Chapter II presented the resource conflict methodology (RCM) which provides a flexible means for specifying and comparing different processor organizations. Unfortunately RCM is also a full execution trace-driven simulation. Since current execution traces often include billions of instructions, the evaluation of each processor could take a large amount of simulation time. Because product development time is critical given the typically short market windows of the computer industry, methods for comparing processors should be devised that yield useful results without requiring a long time per processor evaluation. This would provide designers with a greater opportunity for design space exploration within a limited amount of time.

3.1.1 Analytic Models: Using a Simplified Processor Model Most often, the move to faster processor performance simulation involves a simplification of the processor model; simplifying assumptions are made or higher-level models of the processor are devised. While the application of simplifying assumptions can certainly reduce the time spent in generating a performance estimate, more serious issues arise regarding the accuracy of the performance estimate returned. Furthermore, the use of higher-level models of the system and simplifying assumptions regarding its operation can

110

111 lead to a general reduction in the number of processor organizations that can be specified, distinguished and examined using these methods, because the assumptions and high-level models hide the effect of the finer details available in a fully-detailed simulation. One of the best known methods for quickly analyzing processor designs is to use analytic models of the system. Because an analytic model describes the system behavior in terms of a set of (system and workload) parameters and a set of equations relating them, the system performance can be determined very quickly. Unfortunately, these methods also require the use of many simplifying assumptions that reduce the overall accuracy of the model, and reduce its ability to distinguish fine-grained differences in execution between two processors, much less their effect on performance. [29] Consider, for example, the simple set of performance bounds models proposed by Mangione-Smith et. al. in [51] and [49] that determine an upper bound on the steady-state performance of a processor’s execution of a single basic-block inner loop code. In that work, the models are shown to provide performance bounds for the vectorizable Livermore Fortran Kernel codes [54] that are close to the performance that can be achieved by the real processor (with some code tuning). Though this bounds analysis methodology was intended primarily for the performance tuning of scientific (loop-based) codes on an existing machine, the speed of analysis and encouraging results for the LFK codes might prompt one to consider using this method to compare proposed (new) processor organizations. Note, however, that a large number of assumptions are used in the model to help reduce the execution of a superscalar processor to a mathematically tractable level. First, the model is restricted to the analysis of single basic block inner loop bodies. Although this was later extended by Lee in [48] to a weighted sum, reflecting the analysis of many loop bodies, there is still no analysis of nonloop code. Similarly, the model considers only the throughputs of the issue unit and selected functional units, plus the loop-carried dependence effects, which are deemed a priori to be the most likely performance bottlenecks. If other factors are important in the performance of the processor, the need for this extension would first have to be recognized, and the analytic model would have to be extended to capture those effects. Some work has been done to include more effects in the bounds analysis, such as the work by Boyd to include more of the compiler effects [13],

112 the work by Wang to include the Alpha 21064 first level cache effects [77], and similar work for parallel processors in [1], [2] and [12]. When analytic evaluations are used to guide design decisions, a simplistic analysis often proves to be far too optimistic in assessing the effect of a desired feature. Often, the actual impact delivered by such a design decision is far less than the analysis suggested because of factors that were not considered. Conversely, some features are incorrectly deemed to have little impact, because they were not considered within the context of an appropriate set of design features that enabled them to deliver their full performance potential. Analytic models are often incapable of identifying and evaluating such complex interactions between design alternatives. Furthermore, while the insights from analytic methods can be interesting, the number of processor designs that they can distinguish is somewhat limited; in the bounds model, the only way to distinguish the effect of changing function unit latencies is through loop-carried dependence. While this may be appropriate for a performance bound that seeks to set up an ideal performance target by recognizing only fundamental limits to performance, other effects must be considered for accurate performance estimation. Similarly, there is no way to distinguish processor design features such as in-order versus out-of-order instruction issue, since the bounds model assumes a perfect instruction scheduling order (though the MACS bound of [13] does incorporate the effect of a given instruction schedule, there is no automated for fully analytic method for calculating the MACS bound). The most effective way to employ such an analytic model might be as part of a hierarchical search of the design space, as proposed by Kumar and Davidson in [43], where a high-level analytic model that provides an upper bound on performance can be used to eliminate large portions of the design space from further, more detailed investigation. Although we do not consider them further here, we will consider the application of analytic models as part of such a hierarchical design space analysis methodology in chapter IV.

113

3.1.2 Reduced Trace Analysis: Reducing the Analysis Redundancy Even after reducing the search space, it may still be necessary to consider a large number of processor designs. A means for quickly generating performance estimates for a large number of (related) processor organizations is still needed in order to select from those designs that provide the most promising performance (or performance/cost ratio). It is for this purpose that we developed reduced trace analysis (RTA). Reduced trace analysis provides a method for quickly analyzing the performance of a given processor organization model for a given workload execution trace with an accuracy that is nearly as good as full execution trace-driven simulation. The reduced trace analysis methodology has been developed to address two main problems of full execution trace simulation: the size of the input trace and the number of instructions that must be evaluated by the simulator. Both of these problems are addressed through the same mechanism: trace reduction. Workload execution traces are typically very large, including hundreds of millions or even billions of instruction executions. The workload codes themselves, however, are generally much smaller than the traces because the execution traces include a new copy of the instruction information for every execution of that instruction. RTA reduces the execution trace, removing a large number of these redundant copies of the instructions by representing the execution trace as a control-flow graph of code blocks with weighted flow links between the blocks. By reducing the initial full execution trace to the much more compact control flow graph which constitutes a reduced trace description, the size of the input to a reduced trace analyzer is greatly reduced. In order to reduce the number of instruction evaluations done to analyze the reduced trace description, the block pairs sufficiency assumption is employed. The block pairs sufficiency assumption states that the effect of a prior sequence of block executions on the execution of a block that immediately follows that sequence is dependent only on the last block of the sequence. Given this assumption, reduced trace analysis needs only to consider pairs of blocks (rather than triples, quadruples or more), and can evaluate each code block and each connected block pair only once, potentially saving a great deal of instruction evaluation. This chapter will introduce the general concepts of the reduced trace analysis (RTA) methodology in section 3.2, and then describe the RETANE reduced trace analysis

114 tool that we developed (both to illustrate and test the RTA approach) in section 3.3. Our initial tests with RETANE indicated that many of the initial reduced trace descriptions we generated did not satisfy the block pairs sufficiency assumption for some of the processor models we analyzed, so section 3.4 introduces the concept of reduced trace optimization, which is used to modify the trace descriptions to help satisfy the block pairs sufficiency assumption. Section 3.5 considers the extended processor element models (added to REAP in section 2.4) and discusses how these models can be included in RETANE to model more interesting and complex processors. Section 3.6 then describes the generation of more detailed execution information by determining the subset of the workload execution reports generated by REAP in section 2.5 that can be accurately generated using a RETANE analysis of a reduced trace description.

3.2 Overview of Reduced Trace Analysis A full execution trace includes all the executed instructions of a given workload program, so it represents the dynamically executed workload instructions as a sequential list. In most programs, the number of dynamically executed instructions is much larger than the number of static instructions in the executable program because portions of the static workload code are executed very many times during the full workload run. A workload basic block is a section of the workload’s static code that has a single entry point (at the start of the basic block) and a single exit point (at the end of the basic block). Thus, the instructions in a basic block are always executed in sequential order. When the workload code is executed, the execution can therefore be considered as a set of basicblock executions with an ordered set of transfers of control between the basic blocks. Reduced trace analysis takes advantage of this view of the workload execution to reduce the execution trace even further. Rather than retain an ordered list of workload basic block executions, the execution trace is reduced to a weighted control flow graph form, called a reduced trace description, where the transfers of control from a given basic block to another given basic block are summarized using a weighted control flow arc. Reduced trace analysis then analyzes these reduced trace descriptions to generate a performance estimate for the entire workload execution. The overall approach used here in

115

one execution

program

executable

Compiler

Tracer

reduced trace description

execution trace

Trace Description Extractor machine description

many executions

Reduced Trace Simulator

performance estimate

Figure 49. Diagram of reduced trace analysis methodology.

reduced trace analysis thus employs a three stage process: the full workload execution trace is generated, a reduced trace description is formed to represent this full execution trace, and then this reduced trace description is analyzed for a number of different machine organizations. This complete process is shown in figure 49. The first step shown in figure 49 is to compile the high-level workload program into an executable file on an existing machine, and then to instrument the program and produce the full execution trace. This full execution trace is then fed to the trace reduction program which generates a reduced trace description of that trace. Once this reduced trace description has been generated, the user can proceed using only the reduced trace description (i.e. the trace description need only be generated once, and the full trace is no longer needed in the following analyses). The trace description and a set of processor organization parameters are then fed to a reduced trace simulator which analyzes the

116 reduced trace description and generates a performance estimate for the execution of the workload on that processor much more quickly than for a full execution trace analysis. Note that the reduced trace analysis described in this chapter will only consider the generation of reduced trace descriptions from a full workload execution trace. There is no reason why the initial reduced trace description could not be formed without the generation of a full execution trace by examining the workload code in concert with profile information. When the workload source code is used to develop the reduced description, we will call it a reduced workload description and the whole process is called reduced workload analysis. To date, we have only implemented reduced trace analysis (i.e. using an execution trace to develop the reduced description) and the move from the workload source code to a reduced workload description is left as open research. The analysis of a reduced workload description can simply use the same general analysis methods that we have developed for reduced trace descriptions, though the reduced workload description does not contain compiler-generated register allocations and code scheduling.

3.2.1 The Reduced Trace Description The reduced trace description is a weighted control-flow graph describing the code blocks executed during program execution and the total flow between those blocks. A sample reduced trace description is shown in figure 50, which will be used here to illustrate the reduced trace description concepts. There are two basic elements in a reduced trace description: code blocks, which form the nodes in the graph, and flow links, which form the edges. In the initial formulation of a reduced trace description, the code blocks will each correspond to an executed basic block from the workload execution trace, and the flow links will represent transfers between blocks, weighted by the number of times that control flowed along that flow link from one block to another during the full workload execution. Each code block contains a unique block identifier, a stream of sequentiallyexecuted instructions, and a list of flow links identifying the blocks to which control may be transferred on exit from this block. Thus, code block B2 in figure 50 would contain the code stream associated with that block, and pointers to code blocks B3 and B5. The flow

117

B1 w1,2 w2,3

B2 w2,5

B3 w3,4

w3,6

B5 w5,6

B4 w4,3

w5,5

w4,6 B6 w6,7

w6,2

B7 w7,x Figure 50. Example reduced trace description.

links contain both a pointer to the next block, and the weight indicating the number of times that link was traversed during execution. When discussing the blocks and links in a reduced trace description control flow graph, some further terms will prove useful. The analysis of a reduced trace description is based on the analysis of blocks and connected pairs of blocks. When two blocks are connected by a flow link, the block from which the link emanates is called the predecessor block, and the block to which the link points is the successor block. Looking at the reduced trace description of figure 50, blocks B1 and B2 form a block pair where block B1 is the predecessor block and block B2 is the successor block (this pair is linked by the flow link with weight w1,2). Similarly, block B5 has a link (with weight w5,5) that connects back to block B5, so the graph of figure 50 also includes a block pair from block B5 to block B5 (with weight w5,5). When looking at a given block, there will generally be links that are initiated from the block and links that terminate at (i.e. point to) the block. The set of links that are initiated from the block are called block exit links (or just exit links), and they point to the set of successor blocks whose execution followed the execution of this block at least once in the trace. The set of links that terminate at a block are the block entry links (or just entry links) and each of these links is an exit link from some predecessor block of this block.

118

while (the trace has not been completely scanned) { set curr_instr to the next instruction from the trace add the curr_instr to current basic block (curr_BB) if (curr_instr ends the basic block) { /* Look for previous definition of curr_BB in blocks list */ if (matching block found in basic blocks list) { set matching_BB to the block from the basic blocks list if (last_BB has a link to matching_BB) { increment weight on the link from last_BB to matching_BB } else { /* New link defined from last_BB to matching_BB */ add a weight 1 link from last_BB to matching_BB } set last_BB to be matching_BB; } else { /* No match found: new block definition */ add curr_BB to the basic blocks list add a weight 1 link from last_BB to curr_BB last_BB = curr_BB } } } Figure 51. Algorithm to generate initial reduced trace description from an execution trace.

Thus, looking at the reduced trace description graph of figure 50, for block B3 there are two entry links (with weights w2,3 and w4,3) and two exit links (with weights w3,4 and w3,6).

3.2.2 Forming the Initial Reduced Trace Description The initial reduced trace description can be directly formulated from the execution trace by determining which basic blocks are executed in the trace and recording the flow of control between them. Accomplishing this task is relatively straightforward, requiring only a forward scan through the full execution trace and some bookkeeping. The general algorithm we developed to extract trace descriptions is described in figure 51. This algorithm simply reads in each instruction from the execution trace and assigns it to a current working basic block. If this instruction can alter the flow of program control (i.e. it is a branch instruction), then it ends the current basic block. The basic block

119 data structure of this algorithm includes both the basic block code and a list of flow links from each basic block to the set of blocks that immediately follow its execution. The set of basic blocks so far discovered from the execution trace is kept as a linked list of basic blocks. Once the end of the current basic block is found, the list of previously discovered basic blocks is searched to determine if the current basic block is a new basic block, or another execution of an already discovered basic block, i.e. a matching basic block is sought in the list of basic blocks already discovered. If a matching basic block is found, then the weight of the flow link from the previously executed basic block (pointed to by the variable last_BB) to the matching basic block is incremented in the basic blocks list to indicate another transfer of control between those blocks. If there was no previous flow link from the last_BB to the matching basic block (matching_BB), then a new link is added (with a weight of one). If no matching basic block is found in the basic blocks list, then the current basic block is a new basic block of the workload, and is thus added to the basic blocks list, and a weight one control flow link is added pointing from the last_BB to this new block (curr_BB). By continuing through the entire execution trace, the complete graph of executed basic blocks can be discovered, and the weighted control flow links are determined. Once the full execution trace has been analyzed, the reduced trace description is then stored for future use. Note that because the set of execution basic blocks is taken from the execution trace, there may be basic blocks in the static program code that are not executed, and thus do not appear in the reduced trace description.

3.2.3 Reduced Trace Description Analysis Once the trace description has been derived, it is used with a set of processor organization parameters to drive a reduced trace description performance analysis. Consider a reduced trace description such as that in figure 50. The basic approach in reduced trace description analysis is to evaluate the cost (in execution cycles) for each code block and flow link in the reduced trace description graph. These costs are then weighted by the appropriate link weights, and accumulated to produce the complete workload performance estimate. This approach can be formalized as follows: let Bi

120 represent block i, Li,j represent the link from block i to block j, and wi,j the number of traversals of link Li,j. No links are created for block pairs that are not explicitly seen in the execution trace, indicating that control never passed between those blocks. If C(Bi) is the cost for executing block i, and C(Li,j) the cost for traversing the link from block i to block j, then the total execution cost is: C total = ∑ ∑ w i, j [ C ( Bi ) + C ( L i, j ) ] i

(1)

j

which is the weighted sum of the costs for traversing each block’s outward links, where the cost of all executions of block Bi (i.e. the C(Bi) cost) is apportioned among its outbound arcs, and links that do not actually appear in the graph are assumed to have weight zero. The reduced trace description itself contains the values for the following physical quantities: the number of blocks, the number of links out of each block, and the number of traversals of each link (i.e. the weight wi,j). Thus to apply (1), the simulation analysis must determine only C(Bi), the cost for executing each individual block, and C(Li,j), the cost to traverse each link. Because each block contains the instructions in the block, this instruction stream can be sent as a short execution trace to an evaluation subroutine containing a trace-driven performance simulator. This simulator will determine the number of elapsed cycles between the starting time of the earliest starting instruction and the latest starting instruction, returning the inclusive difference as the cost for the given block, C(Bi). Since the stream of instructions in a given block will never change, all evaluations of a single block cost should return the same execution value and thus the block cost can be recorded after the first evaluation, and this recorded value can then be used in all subsequent block evaluations during this analysis. This process does not utilize any contextual information about the processor state when the code block begins execution, i.e. the RTA simulation assumes that the code block begins execution on an idle processor with no work in process and no unavailable resources, which we refer to as a cold-start processor. The determination of the link costs is more involved: where the single block cost C(Bi) represents the number of cycles required to execute the code in a single block, the C(Li,j) link cost represents the number of execution cycles required to traverse the link

121 between blocks Bi and Bj. This flow link cost can have a positive or negative value, according to the degree of conflict or overlap, respectively, between the pair of blocks. In order to determine the flow link cost, reduced trace analysis employs what we have termed the block pairs sufficiency assumption, which assumes that the flow link cost between any two connected blocks is completely determined by the two blocks of the pair. Thus, when this assumption holds, the analysis of the reduced trace description will only need to consider a pair of blocks, rather than block triples, quadruples, or even longer sequences. Under the block pairs sufficiency assumption, a simple procedure can be formulated for calculating the link costs. Consider two connected blocks from a reduced trace, where block Bi contains a link of weight wi,j to block Bj. Assume that the cost for executing each block, i.e. the values of C(Bi) and C(Bj), are known. Since only the interaction of these two blocks determines the interface between them, the cost of this interface is the link cost, and can be calculated by comparing the individual block costs to the cost for executing the combined block pair, C(Bi,Bj). Thus, the link cost C(Li,j) is: C(Li,j) = C(Bi,Bj) - [C(Bi) + C(Bj)]

(2)

When considering a pair of code blocks executed on a cold-start processor (that does not include any hardware elements that make use of global execution sequencing information) we can see the different kinds of conflict or overlap that different processor function units, other resources and configurations provide. For scalar, non-pipelined machines the flow link cost will always be zero cycles since the execution of the blocks cannot overlap or interfere. As the organization is changed to exploit more instruction parallelism, however, the interface between blocks becomes more complicated. For a in-order scalar single-pipeline machine, the interface between the two blocks cannot have a negative link cost but can involve instruction dependence and other resource conflict stalls. For example, if block Bj follows block Bi and an instruction in Bi generates a value which an instruction of Bj uses, then the instruction in Bj may have to wait for the data to be generated. Similarly, the execution of block Bi may tie up some resources that instructions in block Bj need to execute, and thus delay the starting time for those instructions.

122

A> B> C> D> E>

FADD ADD ADD MUL SUB

F1, R1, R3, R3, R7,

F2 R2 R4 R6 R5

-> -> -> -> ->

F3 R3 R5 R7 R9

F> G> H> I> J>

FSUB FMUL FSUB FMAD ADD

(a) Block B1

F5, F7, F2, F3, R3,

F7 -> F6 F4 -> F8 F31 -> F2 F6, F8 -> F9 R9 -> R14

(b) Block B2

Figure 52. Example basic blocks (block B1 precedes block B2).

For a superscalar multi-pipeline machine, such stalls can still occur, but a further consideration arises in that the block pair may allow some instructions of the successor block to begin execution before some instructions of the predecessor block, and this may result in a negative link cost. For example, consider the two code blocks shown in figure 52, where block B1 is executed immediately before block B2, and RTA is analyzing the flow link cost for the code running on a dual-issue processor with two buffered function units: a floating-point unit and an integer unit. The code of block B1 consists of a single floating-point instruction followed by four dependent integer instructions, so its execution will tie up the integer unit for some time, leaving the floating-point unit idle for much of the block’s execution. Similarly, block B2 consists of four floating-point instructions followed by a single integer instruction, and so it will heavily utilize the floating-point unit and leave the integer unit idle for much of its execution. Figure 53 shows the execution of block B1 on a dual-issue machine, where the integer unit execution pipeline length is 4 cycles and the floating point unit execution pipeline length is 5 cycles. The first row of figure 53 indicates the following processor resources: the issue unit, the integer unit, and the floating point unit. The second row of the headers indicates the correspondence of the columns to the issue buffer slots (I) and the input buffer slots (B), a decode stage (D), execution stages (E), and a writeback stage (W) for the integer and floating-point units. Each instruction of figure 52 has been assigned a unique alphabetic identifier which is used to indicate the instruction in the trace output (shown in bold to the left-hand side before the instruction mnemonic in figure 52). The execution cost for a block is defined in RTA analysis to be the difference between the time at which the first instruction of the block starts execution (i.e. the last

123

0 1 2 3 4 5 6 7 8 9 10 11 12

|issu| | II | | -- | | BA | | DC | | -E | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- |

integer unit BBBBBBBBBB D E E W --------------------------- - - - ---------- B - - ---------D C B - --------ED C - B --------ED C - - B ---------E D C - ---------- E D C ---------- E - D C ---------- E - - D ---------- - E - ---------- - - E ---------- - - - E ---------- - - - -

| | | | | | | | | | | | | | | |

floating-point unit BBBBBBBBBB D E E E W ----------------------------- - - - - ---------- A - - - ---------- - A - - ---------- - - A - ---------- - - - A ---------- - - - - A ---------- - - - - ---------- - - - - ---------- - - - - ---------- - - - - ---------- - - - - ---------- - - - - ---------- - - - - -

| | | | | | | | | | | | | | | |

Figure 53. Trace-driven simulation results for block B1 of figure 52.

cycle that the instruction spends in decode) and the time at which the last instruction of the block starts execution. Looking at figure 53, note that the execution of block B1 is C(B1) = (8-1) + 1 = 8 cycles (since instruction A starts in cycle 1 and E starts in cycle 8). Figure 54 shows the execution of the instructions for the single block B2, where the cost for executing block B2 is seen to be C(B2) = (6-1) + 1 = 6 cycles. Figure 55 shows the execution of the block pair, i.e. the execution of the combined instruction stream for block B1 followed by block B2. Looking at figure 55, note that the executions of blocks B1 and B2 do overlap significantly, i.e. the instructions of block B2 begin executing before some the instructions from block B1 have begun execution. In effect, the starting times of instructions F, G, H and I from block B2 have moved up into

0 1 2 3 4 5 6 7 8 9 10 11

|issu| | II | | -- | | GF | | IH | | -J | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- |

integer unit BBBBBBBBBB D E E W --------------------------- - - - ---------- - - - ---------- - - - ---------- J - - ---------- - J - ---------- - - J ---------- - - - J ---------- - - - ---------- - - - ---------- - - - ---------- - - - ---------- - - - -

| | | | | | | | | | | | | | |

floating-point unit BBBBBBBBBB D E E E W ----------------------------- - - - - ---------G F - - - --------IH G F - - ---------I H G F - ---------- I H G F ---------- I - H G F ---------- I - - H G ---------- - I - - H ---------- - - I - ---------- - - - I ---------- - - - - I ---------- - - - - -

Figure 54. Trace-driven simulation results for block B2 of figure 52.

| | | | | | | | | | | | | | |

124

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

|issu| | II | | -- | | BA | | DC | | FE | | HG | | JI | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- | | -- |

integer unit BBBBBBBBBB D E E W --------------------------- - - - ---------- B - - ---------D C B - --------ED C - B --------ED C - - B --------JE D C - ---------J E D C ---------J E - D C ---------J E - - D ---------- J E - ---------- J - E ---------- J - - E ---------- - J - ---------- - - J ---------- - - - J ---------- - - - -

| | | | | | | | | | | | | | | | | | |

floating-point unit BBBBBBBBBB D E E E W ----------------------------- - - - - ---------- A - - - ---------- - A - - ---------- F - A - ---------H G F - A ---------I H G F - A ---------- I H G F ---------- I - H G F ---------- I - - H G ---------- - I - - H ---------- - - I - ---------- - - - I ---------- - - - - I ---------- - - - - ---------- - - - - ---------- - - - - -

| | | | | | | | | | | | | | | | | | |

Figure 55. Trace-driven simulation results for block the pair B1 B2.

the execution time of block B1. In this case, the flow link cost may be negative, indicating that the blocks overlap. The combined block pair cost, C(B1,B2) can be determined from figure 55: the earliest starting instruction for the block pair starts in cycle 1 and the latest starting instruction starts in cycle 11. This results in a combined block pair execution time of C(B1,B2) = (11-1) + 1 = 11 cycles. Applying the values for C(B1), C(B2) and C(B1,B2) to formula (2), the link cost is calculated to be -3 cycles, which indicates that the executions of blocks B1 and B2 overlap by three cycles. Thus the contribution of this block pair execution to the total program execution cost becomes: w1,2[C(B1) + C(L(1,2))] = w1,2[8 + (-3)] = 5w1,2

(3)

and the total number of execution cycles for the workload is increased by five cycles per execution of the block pair (even though the execution of block B1 alone requires eight cycles). We now consider whether this analysis procedure can be logically justified. It is sufficient to consider a reduced trace description which consists of just three blocks, B1, B2 and B3, executed in sequence (i.e. the workload starts execution with block B1 which connects to block B2 with weight one, block B2 connects to block B3 with weight of one, and block B3 is the last block executed in the workload and thus has no outward links). The single block costs C(B1), C(B2) and C(B3) and the block pair costs C(B1,B2) and

125 C(B2,B3) are determined as above (i.e. they are each evaluated on a cold-start processor using the instruction streams from the blocks). From (2) the link costs are determined to be: C(L1,2) = C(B1,B2) - [C(B1) + C(B2)] C(L2,3) = C(B2,B3) - [C(B2) + C(B3)] Thus, the total cycles contributed by block B1 and its interface to block B2 would be: C(B1) + C(L1,2) = C(B1) + C(B1,B2) - [C(B1) + C(B2)], the cost for block B2 and its interface to block B3 is: C(B2) + C(L2,3) = C(B2) + C(B2,B3) - [C(B2) + C(B3)], and the cost for block B3, which has no outward links, is simply C(B3). Adding these three contributions results in the expression: C(B1) + C(B1,B2) - [C(B1) + C(B2)] + C(B2) + C(B2,B3) - [C(B2) + C(B3)] + C(B3)] which reduces to: C(B1,B2) + C(B2,B3) - C(B2)

(4)

Interpreting (4), the formula calculates the cost of the sequence B1 to B2 to B3 as the cost of a cold start of block B1 followed by the execution of block B2 on a processor warmed by the execution of block B1, plus the cost of a cold start of block B2 followed by the execution of block B3 executing on a processor warmed by an execution of block B2, minus the cost of a cold start of block B2. Because the block pairs sufficiency assumption states that the flow link cost between blocks B2 and B3 is completely determined by blocks B2 and B3 (i.e. that the execution of the predecessor block B2 is sufficient to warm the processor state for the successor block B3) equation (4) effectively reduces to a cold start of block B1 followed by a warmed block B2 and then a warmed block B3, which is exactly the sequence of executions that a full trace simulation would include. The reduced trace analysis of more complicated graphs can be justified in exactly the same manner. Thus when the block pairs sufficiency assumption holds, the RTA analysis is accurate.

126

(a)

for i = 1, 2*N B1 if (i

Suggest Documents