expansion caches for superscalar machines - Semantic Scholar

EXPANSION CACHES FOR SUPERSCALAR MACHINES

a dissertation submitted to the department of electrical engineering and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy

By John D. Johnson March 1996

c Copyright 1996 by John D. Johnson All Rights Reserved

ii

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Michael J. Flynn (Principal Adviser)

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Kunle Olukotun I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Bruce A Wooley Electrical Engineering

Approved for the University Committee on Graduate Studies:

iii

Abstract Superscalar implementations improve performance by executing multiple instructions in parallel. This increases the demands on instruction caches as well as instruction decoding and issuing mechanisms leading to very complex hardware. This work examines expanded instruction cache organizations to reduce the complexity or to improve the performance of superscalar machines. Traditional instruction caches improve performance simply by reducing average memory latency. They function by providing rapid access to exact copies of some of the instructions found in main memory. Additional opportunities are presented by an instruction cache when the information stored in the cache is not an exact copy of the main memory data. Ordinarily the cached representation is larger than the main memory representation, therefore the term expansion caches is used to describe this general caching technique. Expansion caches with in-order issue superscalar machines are simple to implement and are shown to provide superior performance to that of much more complicated out-of-order issue superscalar machines with traditional caches. Alternatively, an expansion cache can be used to replace the traditional cache in an out-of-order issue superscalar machine and improve performance by an average of 43% because of increased eective instruction cache bandwidth.

iv

Acknowledgments Many people deserve thanks for helping me complete this dissertation. Completing a Ph.D. requires support on many dierent levels. I would like to acknowledge all the people who helped me make this dissertation possible. Unfortunately, I'm restricted to naming only a few very important individuals in this section. On an academic level I would like to thank my principal adviser, Dr. Michael Flynn, for his insights in guiding me toward this contribution and his patience and understanding with my occasionally slow progress toward completing this research. I would also like to thank my second and third readers, Dr. Kunle Olukotun and Dr. Bruce Wooley, for suggestions for improving the presentation of this work. For me, the ideas for this dissertation were actually the easy part of earning a Ph.D. Presenting the ideas in a consistent and understandable format requires much more eort and much more help from other people. This help came as feedback from my Stanford oce mates. Over the years I have had many very helpful conversations with all members of the Stanford Architecture and Arithmetic Group. I would especially like to thank Brian Flachs for his insight and helpful comments in guiding me towards my result. Financial support is another area deserving acknowledgment. I have received support from the Army Research Oce { Durham under contract DAAG-29-82-K0109, from NASA under contracts NAG2-248 and NAG2-842, and educational assistance support from Hewlett-Packard company. This work was conducted on facilities supplied under NASA grant NAGW 419 and an equipment donation from HewlettPackard company. I must also acknowledge the nancial support provided by my wife over the many years it took to complete this program. v

One's parents obviously play a very important role in one's life so I would like to thank them for imparting in me the drive and ambitions to pursue this degree. I am especially proud that my parents were able to instill in all their children a love of learning that allows them to excel in their chosen elds. Parents like mine are few and far between. Finally, and most importantly, I would like to thank my wife, Irene Bunner, and children, Paul and Lynn. They provided the emotional support and loving environment allowing me to continue to pursue this research over the extended period of time I needed to reach this nal result.

vi

Contents Abstract

iv

Acknowledgments

v

1 Introduction

1

1.1 Limitations to Instruction Parallelism : : : : : : : : : : : : 1.1.1 True Data Dependencies : : : : : : : : : : : : : : : 1.1.2 Control Dependencies : : : : : : : : : : : : : : : : : 1.1.3 Resource Con icts : : : : : : : : : : : : : : : : : : 1.1.4 Data Output Dependencies and Anti-dependencies 1.2 Instruction Issue Policies : : : : : : : : : : : : : : : : : : : 1.2.1 In-Order Issue with In-Order Completion : : : : : : 1.2.2 In-Order Issue with Out-of-Order Completion : : : 1.2.3 Out-of-Order Issue with Out-of-Order Completion : 1.3 Expanded Instruction Caches : : : : : : : : : : : : : : : : 1.4 Previous Work : : : : : : : : : : : : : : : : : : : : : : : : 1.4.1 Multiple Instruction Issue : : : : : : : : : : : : : : 1.4.2 Very-Long-Instruction-Word Processors : : : : : : : 1.4.3 Superscalar Processors : : : : : : : : : : : : : : : : 1.4.4 High Performance Instruction Fetching : : : : : : : 1.4.5 Decoded Instruction Caches : : : : : : : : : : : : : 1.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 Expanded Cache Models

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 3 3 4 4 5 5 6 7 9 11 11 12 13 14 14 15

19 vii

2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Epic Superscalar Model : : : : : : : : : : : : : : : : : : : : 2.2.1 Epic Expansion Unit : : : : : : : : : : : : : : : : : 2.2.2 Epic Branch Prediction : : : : : : : : : : : : : : : : 2.2.3 Epic Execution Units : : : : : : : : : : : : : : : : : 2.3 Reorder Buer Model : : : : : : : : : : : : : : : : : : : : : 2.3.1 Out-of-Order Issue in the Reorder Buer Model : : 2.3.2 Register Renaming : : : : : : : : : : : : : : : : : : 2.3.3 Reorder Buer Model Branch Prediction : : : : : : 2.3.4 Reorder Buer Model Pipeline and Execution Units 2.3.5 Reorder Buer Model Decode Process : : : : : : : : 2.4 Expanded Cache Reorder Buer Model : : : : : : : : : : : 2.4.1 Expro Model Instruction Issue : : : : : : : : : : : : 2.4.2 Expro Model Branch Prediction : : : : : : : : : : : 2.5 Model summary : : : : : : : : : : : : : : : : : : : : : : : : 2.6 Methodology, Tools and Benchmarks : : : : : : : : : : : : 2.6.1 Trace Driven Methodology : : : : : : : : : : : : : : 2.6.2 Benchmarks : : : : : : : : : : : : : : : : : : : : : : 2.7 Simple 8-Wide Superscalar Machine Reference Model : : : 2.8 Reorder Buer Model Performance : : : : : : : : : : : : : 2.9 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 Scheduling

3.1 Code Scheduling and Interlocks : : : : : : 3.2 List Scheduling Algorithm : : : : : : : : : 3.2.1 List scheduling : : : : : : : : : : : 3.2.2 Scheduling Loads and Align Hints : 3.2.3 Scheduling Example : : : : : : : : 3.2.4 Register Allocation : : : : : : : : : 3.3 Performance Eects of Scheduling : : : : : 3.4 Summary : : : : : : : : : : : : : : : : : : viii

19 20 24 27 31 35 37 38 39 40 41 46 46 48 51 54 54 57 59 64 66

69 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

69 70 71 72 74 77 78 82

4 Cache Alignment 4.1 4.2 4.3 4.4 4.5 4.6

Aligning to Decoder Slots : : : : : : : : : : : : : : : In-Order Issue Aligned Expanded Cache Performance Duplication Ratio and Line Utilization : : : : : : : : Out-of-Order Aligned Expanded Cache Performance : Instruction Duplication Issues : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : :

85 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5 Routing and Multiple Cycle Packing

5.1 Cycle Tagged Instructions : : : : : : : : : : : : : : : 5.1.1 Cost of Cycle Tagging : : : : : : : : : : : : : 5.1.2 Terminating Instruction Packing : : : : : : : : 5.2 Performance of Cycle Tagging : : : : : : : : : : : : : 5.3 Expro Machine Routing and Multiple Cycle Packing : 5.4 Summary : : : : : : : : : : : : : : : : : : : : : : : :

99

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

6 Branch Prediction

6.1 Epic Branch Prediction Performance 6.2 Cost of a Miss Predicted Branch : : : 6.3 Summary : : : : : : : : : : : : : : :

99 101 102 107 111 114

115 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

7 Instruction Run Merging

7.1 Branching and Instruction Run Merging : : : : : : 7.2 Instruction Merging in the Epic Machine : : : : : : 7.3 Instruction Run Merging In Expro : : : : : : : : : 7.3.1 Expro Machine Performance : : : : : : : : : 7.3.2 Cost of Removing Instruction Run Merging 7.4 Summary : : : : : : : : : : : : : : : : : : : : : : : 8.1 Variations in Epic's Con guration : : : : : : : : 8.1.1 Varying the number of Load/Store Units 8.1.2 Varying the Number of Functional Units

115 118 128

129 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

8 Machine Con guration Variations

ix

85 88 89 93 95 96

129 132 135 135 143 146

149 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

149 151 153

8.1.3 Varying the Width of Epic : : : : : : : : : : : : : : : 8.1.4 Varying the Width of the Decoder and Cycle Packing 8.2 Variations in Expro's Con guration : : : : : : : : : : : : : : 8.2.1 Varying the Size of the Reorder Buer : : : : : : : : 8.2.2 Varying the Out-of-Order Decode Width : : : : : : : 8.2.3 Varying Decode Cycles on Out-of-Order Model : : : 8.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9 Summary and Conclusions

9.1 Expansion Cache Summary : : : : : : : : : : : : : : : 9.1.1 Cost of an In-Order Issue Expansion Cache : : : 9.1.2 Cost of an Out-of-Order Issue Expansion Cache 9.2 Future Directions : : : : : : : : : : : : : : : : : : : : : 9.3 Conclusions : : : : : : : : : : : : : : : : : : : : : : : :

Bibliography

155 157 159 161 161 164 166

169 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

170 171 173 174 175

177

x

List of Tables 1.1 Studies Reporting Speedup Using Multiple Instruction Issue 2.1 2.2 2.3 2.4

: : : : :

Model Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : Description of the Benchmarks : : : : : : : : : : : : : : : : : : : : Benchmark Sizes and Instruction Counts : : : : : : : : : : : : : : Percent of Cycles when Decoding Emptied the Instruction Buer

4.1 Average Duplication Ratios for Aligned Expansion Cache 4.2 Average Line Utilization for Aligned Expansion Cache :

: : : : : :

: : : : : : : : : : : : : :

5.1 5.2 5.3 5.4

Line Utilization for Cycle Tagged Con gurations : : Duplication Ratio for Cycle Tagged Con gurations Line Utilization for Cycle Tagged Con gurations : : Duplication Ratio for Cycle Tagged Con gurations

6.1 6.2 6.3 6.4

Average Branch Prediction Rate : : : : : : : : : : : : : : : : : : : Branch Cost Function Parameters for the Epic Machine : : : : : : Branch Cost Function Parameters for the Reorder Buer Machine Branch Cost Function Parameters for the Expro Machine : : : : :

7.1 7.2 7.3 7.4 7.5

16 Entry Reorder Buer Model Instruction Issue Limiters : 16 Entry Expro Model Instruction Issue Limiters : : : : : 32 Entry Reorder Buer Model Instruction Issue Limits : : 32 Entry Expro Model Instruction Issue Limits : : : : : : Non-merging Expro Model Instruction Issue Limiters : : : xi

: :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

12 52 58 58 65 91 92 105 105 108 108 117 122 124 126 138 140 142 143 145

8.1 In-order Issue Execution Unit utilization During Execution Cycles 8.2 Performance of Individual Benchmarks for 2 Epic Con gurations : 8.3 Expro Performance Decoding 8 and 4 Instructions per Cycle : : : 9.1 Overhead Bits for In-order Issue Expansion Cache Features :

xii

: :

150 155 164

: : : : :

172

: : : :

List of Figures 1.1 Expansion Cache Block Diagram

: : : : : : : : : : : : : : : : : : : :

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20

Block Diagram of Epic Model : : : : : : : : : : : : : : : : : : : : : : Dynamic Superscalar Decode Complexity Grows as O(n2 ) : : : : : : Epic Decode Complexity Grows as O(n) : : : : : : : : : : : : : : : : Expanded Instruction Cache Entry : : : : : : : : : : : : : : : : : : : Block Diagram of Epic Expansion Unit : : : : : : : : : : : : : : : : : Expansion Unit Pipeline : : : : : : : : : : : : : : : : : : : : : : : : : Tag and Branch Fields in an Expanded Instruction Cache Entry : : : Execution Unit Pipeline : : : : : : : : : : : : : : : : : : : : : : : : : Load Pipeline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Block Diagram of Reorder Buer Model : : : : : : : : : : : : : : : : Reorder Buer Model Branch Prediction Fields : : : : : : : : : : : : Reorder Buer Model Pipeline : : : : : : : : : : : : : : : : : : : : : : Reorder Buer Model Decode Timing : : : : : : : : : : : : : : : : : : Epic Model Decode Timing : : : : : : : : : : : : : : : : : : : : : : : Block Diagram of Expanded Cache Reorder Buer Model : : : : : : : Successor Index Example Code : : : : : : : : : : : : : : : : : : : : : Indirect Jump Example Code : : : : : : : : : : : : : : : : : : : : : : Experimental Setup : : : : : : : : : : : : : : : : : : : : : : : : : : : : Simple 8-Wide In-Order Issue Superscalar Machine : : : : : : : : : : Instructions Per Cycle for Single-Issue and Simple 8-Wide In-orderIssue Machines : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.21 Instructions Per Cycle for Reorder Buer Model Superscalar Machine xiii

10 21 22 22 23 24 25 28 32 33 36 39 41 42 45 47 49 50 55 60 63 65

3.1 3.2 3.3 3.4

Epic Instruction Scheduling Example : : : : : : : : : : : : : : : : : FFT Inner Loop Scheduling Example : : : : : : : : : : : : : : : : : Epic Performance With and Without Scheduling : : : : : : : : : : : Reorder Buer Machine Performance With and Without Scheduling

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

Decoder Slot Alignment Ineciency : : : : : : : : : : : : : : : : : : : Alignment in an Expanded Instruction Cache : : : : : : : : : : : : : PC Bit Allocation for Direct Mapped Cache : : : : : : : : : : : : : : PC Bit Allocation for an Expanded Mapped Cache : : : : : : : : : : Aligned Expanded Instruction Cache Performance : : : : : : : : : : : Instruction Duplication Example : : : : : : : : : : : : : : : : : : : : Line Utilization Example : : : : : : : : : : : : : : : : : : : : : : : : : Out-of-Order Non-speculative Aligned Expanded Instruction Cache Model and Reorder Buer Model Performance : : : : : : : : : : : : :

5.1 5.2 5.3 5.4 5.5

Routing Network for Cycle Tagged Expanded Instructions Excessive Duplication When Filling Every Slot : : : : : : : Performance for Cycle Tagged Con gurations : : : : : : : Performance for Cycle Tagged Con gurations : : : : : : : Expro machine with Terminate on Hit and Full : : : : : :

6.1 6.2 6.3 6.4 6.5

Epic Successor Address Branch Prediction : : : : : : : Performance of Adding Branch Prediction to Epic : : : Branch Cost Function for the Epic Machine : : : : : : Branch Cost Function for the Reorder Buer Machine : Branch Cost Function for the Expro Machine : : : : :

7.1 7.2 7.3 7.4 7.5 7.6

Sequence of Two Instruction Runs : : : : : : : : : : : : : : : : : : : : 130 Instruction Run Merging Example : : : : : : : : : : : : : : : : : : : : 131 Performance of Adding Speculative Merging to Epic : : : : : : : : : : 133 Performance of the 16 Entry Expro Model and Other Models : : : : : 136 Performance of 16 and 32 Entry Expro Model and Reorder Buer Model141 Expro Machine With and Without Instruction Run Merging : : : : : 144 xiv

: : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

75 78 80 81 85 86 87 87 88 90 91 94 100 102 106 109 113 116 116 123 125 127

8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8

Varying the Number of Load/Store Units : : : : : : : : Varying the Number of Functional Units : : : : : : : : Varying the Width of the Decoder : : : : : : : : : : : : Varying the Width of the Expanded Cache : : : : : : : Varying the Size of the Reorder Buer : : : : : : : : : Varying the Size of Expro's Reorder Buer : : : : : : : Varying the Out-of-Order Decode Width : : : : : : : : Varying Decode Cycles for the Reorder Buer Machine

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9.1 Performance of Expansion Cache Features : : : : : : : : : : : : 9.2 Overhead Bits for Out-of-Order Issue Expansion Cache Features

xv

: : : : : :

152 154 156 158 160 162 163 165 170 173

xvi

Chapter 1 Introduction Computers need to be faster. Existing applications are growing, demanding increasingly more computing power, and new applications are arising as more computing power becomes available. Meeting these demands for faster machines requires new and innovative ideas about machine architecture and machine organization. Increasing computer performance with new and better architectures and machine organizations is important because the traditional path to faster machines, using faster circuit technology, is becoming more dicult. Fortunately experts are expecting the number of gates available for building a computer to continue to increase as VLSI technology continues to improve. This directs high speed machine design towards doing more operations concurrently, instead of doing each operation faster. However, just building a machine capable of executing instructions in parallel is insucient. The complexity and communication requirements of concurrent instruction execution often necessitate a longer cycle time for the machine. The amount of instruction level parallelism found in many applications is small and thus it is still very important for the machine to have very short cycle times. A complex machine capable of many concurrent operations may be slower than a simpler machine if the complex machine has a longer cycle time and there is little parallelism in the application being executed. The term superscalar [AC87] describes a computer implementation capable of concurrently executing several scalar instructions. This work investigates the instruction 1

2

CHAPTER 1. INTRODUCTION

fetch and instruction issue mechanisms for superscalar machines. An important component of the instruction fetch and issue mechanism is the instruction cache. Traditional instruction caches improve performance simply by reducing average memory latency. They function by providing rapid access to exact copies of some of the instructions found in main memory. This work investigates additional opportunities presented by an instruction cache when the information stored in the cache is not an exact copy of the main memory data. Ordinarily the cached representation is larger than the main memory representation, therefore the term expansion caches is used to describe this general caching technique. Superscalar architecture increases the demands on the instruction cache and instruction decoding procedures and this results in very complex hardware requirements. This work studies expansion caches to reduce the complexity and quantity of hardware required to implement decoding in superscalar machines. Because the cycle time of the simpli ed hardware can be shorter, the overall performance of the simple expansion cache superscalar implementation can be superior to a dynamically scheduled superscalar implementation, even though it requires a small increase in the number of cycles needed to execute a given program. Alternatively, an expansion cache can replace the traditional cache in a complex dynamic issue superscalar machine. This provides a larger instruction bandwidth to the execution units and permits exploiting more parallelism. Better performance results when the implementation technology can support large and complex machine structures.

1.1 Limitations to Instruction Parallelism Superscalar machines attempt to reduce program execution time by executing instructions in parallel. However, instructions often depend on preceding instructions for source operands and control ow information. The independence in the instruction stream is called instruction level parallelism. Exploiting this independence is the method used by superscalar processors for improving performance. Dependence constraints within an application de ne the amount of exploitable instruction level

1.1. LIMITATIONS TO INSTRUCTION PARALLELISM

3

parallelism available. This section presents a summary of these dependency constraints, and describes a number of techniques for reducing them.

1.1.1 True Data Dependencies When an instruction uses a value produced by a previous instruction this instruction has a true data dependency upon the previous instruction. This is also called a ReadAfter-Write hazard. A superscalar processor cannot execute these two instructions at the same time. It must delay the second instruction until the rst instruction is able to deliver its results to the second. The processor can implement bypass paths allowing the rst instruction to deliver its result to the second before the rst fully completes its register write back. Still, the second instruction cannot start processing until it receives data from the rst. This dependency fundamentally limits the number of instructions available for parallel execution. Superscalar processors attempt to overcome the performance limits imposed by true data dependencies by overlapping their execution with other independent instructions. The more execution cycles an instruction requires the less likely it is the processor will nd other instructions to overlap and hide its latency. When instructions are unavailable for issue then some of the processor's resources become idle and performance decreases.

1.1.2 Control Dependencies A number of studies on the limits of available instruction level parallelism [RF72] [NF84] have shown that when branches are perfectly predicted then very high levels of parallelism are available. Unfortunately, branches cannot be perfectly predicted and as shown in chapter 6 have a substantial impact on the amount of exploitable parallelism. Instructions following a branch are said to have a control dependency upon the branch. The group of instructions leading up to a branch that are always executed with the branch is called a basic block. Studies by Wall [Wal91] and Lam and Wilson [LW92] have shown instruction level parallelism within a basic block is quite small. To achieve

4


high performance it is necessary to mitigate the eects of control dependencies. Fortunately this is possible through the use of branch prediction and compiler support. Speculative execution is the execution of instructions before resolving some control dependencies. It occurs when using branch prediction to reduce the control dependencies in the code stream. Implementing speculative execution is dicult because miss predicted branches, interrupts, and exceptions require undoing all state changes caused by speculatively issued instructions before servicing the unexpected condition. Attempting a large amount of speculative execution requires considerable and complex buering hardware.

1.1.3 Resource Con icts Executing an instruction uses resources in the machine. Resources include data memory access ports, register le ports, execution units and result buses. A resource con ict arises when two instructions must use the same resource at the same time. Resource con icts occur because a machine does not have enough hardware resources. It is possible to remove resource con icts by adding additional hardware to the processor. Duplicating the resource involved in the con ict removes the problem. However, adding hardware may not be a cost eective method of resolving resource con icts. Delaying one of the instructions until the resource becomes available can be a cost eective method of resolving these con icts. Understanding the resource utilization patterns of the processor is an important part of the design. A highly utilized resource shows an area where additional hardware can oer improved performance. If the hardware for a resource is expensive to implement then high utilization of this resource is important when creating a cost eective machine.

1.1.4 Data Output Dependencies and Anti-dependencies Two additional types of data dependencies are output dependencies and anti-dependencies. An output dependency is also called a Write-After-Write hazard and an anti-dependency is also called a Write-After-Read hazard. These data dependencies are due to the reuse

1.2. INSTRUCTION ISSUE POLICIES

5

of storage resources in the machine. Sometimes they are called storage con icts and often occur because of reuse of machine registers. A processor can remove most storage con icts by providing additional registers. When registers are allocated dynamically by the hardware and the additional registers cannot be seen by the program then this is called register renaming [TF70]. A processor implements register renaming by allocating a new register whenever an instruction writes a new value. It also maintains a name mapping between the register address in the instruction and the new register. Any other instruction reading the original register address receives the value from the new register. Another write occurring to this register address allocates another new register. When all reads of the rst allocated register are complete it can be returned to the free list. Register renaming is expensive in terms of hardware but is necessary for the highest level of performance. Another method of reducing the performance impact of storage con icts is rearranging the program code to spread the reuse of registers. A third method is reducing the time an instruction keeps a register busy with a value to be written. Selecting the best technique for storage con ict reduction is an important aspect of designing a cost eective superscalar machine.

1.2 Instruction Issue Policies The term instruction issue refers to the process of initiating execution of an instruction on an execution unit. The term instruction issue policy refers to the protocol used to issue instructions. The instruction issue policy determines the processor's ability to examine instructions ahead of the current point of execution. The next three subsections describe issue policies and discuss their performance bene ts and implementation costs.

1.2.1 In-Order Issue with In-Order Completion The simplest instruction issue policy is issuing instructions in their original program order and allowing them to complete in the same order. This is called in-order issue

6


and in-order completion. This issuing protocol matches the semantics expected by the assembly language. Each instruction is viewed as completing before the next instruction begins. In-order issue and in-order completion organizations simplify the hardware necessary to implement a machine. When an exception or interrupt occurs the machine must be able to present a consistent state matching the in-order issue completion semantics to the interrupt handler. Because an in-order issue machine's state matches the required semantics it is unnecessary to buer machine state in case an unexpected condition occurs. Implementation is simpli ed because the machine's actual state matches the required architectural state when an exception or interrupt occurs. In-order issue does not preclude the parallel issuing of multiple instructions. If two instructions do not have any dependencies between them then they can be issued in parallel. If execution of these two instructions requires the same number of cycles then these two instructions also have in-order completion.

1.2.2 In-Order Issue with Out-of-Order Completion With out-of-order completion long latency operations such as loads and divides do not prevent issuing of additional instructions while a long latency instruction propagates through the pipeline. An out-of-order completion machine with in-order issue is more complex than an in-order completion machine. This technique is often used by pipelined scalar machines to improve their performance. When an instruction depends upon an incomplete instruction in the pipeline then it must be stalled until the dependent instruction is able to deliver its results. The in-order issue policy also stalls all following instructions even if they are ready to execute. The following instructions are stalled not because they have a dependency on the instruction in the pipeline. Their stalling is caused by the machine's inability to maintain enough state to expunge their results if an exception occurs and the in-order completion state must be presented. Out-of-order completion produces better performance than in-order completion but it adds to the complexity of the machine. The dependency checking logic is more elaborate as it must check each issued instruction against all instructions in all

1.2. INSTRUCTION ISSUE POLICIES

7

pipeline stages. An in-order completion machine requires dependency checking only during the issue stage. Out-of-order completion also requires additional result buses and register write ports, as well as an arbitration method if there are insucient ports to allow the peak completion rate. Out-of-order completion complicates exception handling. Address checking for branches and load/store instructions must present in-order completion behavior. When a long latency instruction causes an exception after an instruction following it has already completed then the machine must be able to undo any state changes caused by the second instruction. Providing this form of support for exceptions is called precise exceptions. Precise exceptions allow a program to be restarted at any instruction address. Implementing them increases the amount of hardware necessary to build the machine.

1.2.3 Out-of-Order Issue with Out-of-Order Completion With an out-of-order issue for execution policy the processor decodes instructions without regard to dependencies with other instructions currently being processed by the machine. The decoder is buered from the execution units so it can continue decoding instructions regardless of whether they can be immediately executed. This buer is called an instruction window. The execution hardware selects instructions from the instruction window as soon as all its dependencies are resolved and execution resources are available. Out-of-order issue naturally results in out-of-order completion. Some method of recording the in-order decoding sequence must also be implemented to recover from exceptions and miss predicted branches. An out-of-order processor decodes instructions and places them into the instruction window as long as there is room in the buer and it receives instructions from the cache. The issue portion of the processor analyzes the dependencies between the instructions. It also examines all instructions in the window and selects the ones ready for execution. An instruction from the window is issued as soon as all of its operands are available and there is an execution unit available. Little regard is paid to the original program order once the dependencies have been analyzed. Ordering is forced by the program dependencies and not by the order of instructions in memory.

8


Out-of-order machines are much more complicated than in-order machines. Finding ready instructions in the instruction window requires a hardware search over all of them. Posting results typically requires an associative lookup over the buered instructions to determine if any are waiting for the result. All of these operations are usually in the time critical path of the machine and must be implemented with minimum time delay. An out-of-order machine still has the requirement of presenting an in-order state when an exception or interrupt occurs. This adds greatly to the hardware complexity as the machine must buer both instructions whose issue has been delayed and instructions whose completion is out-of-order. All machine state updates performed by out-of-order issue and out-of-order completion instructions must be undone before servicing the interrupt or exception. This demands complex hardware state buering. The advantage of out-of-order issue machines is they are able to exploit all available instruction level parallelism within their instruction window. This allows these machines to achieve high levels of performance, especially when the instruction window is kept nearly full by the fetch hardware. The decision as to which issue policy a machine should employ depends upon many factors. Usually the goal is to achieve the best possible performance for a given die area. If a complex machine does not t onto the die area then a simpler machine is usually a better choice than implementing a complex machine on multiple die. When the complex machine can t within the die area then the tradeo is still not straight forward. One issue is whether the instruction window can be kept nearly full allowing high utilization of the complex issue mechanisms. Another issue is a complex machine may have a longer cycle time than a simpler machine. This may result in better performance for the simpler machine even if it requires additional cycles to complete a program. Moreover, a simple machine permits additional area for the cache. How expanded instructions aect these tradeos is the focus of this work.

1.3. EXPANDED INSTRUCTION CACHES

9

1.3 Expanded Instruction Caches Instruction interpretation is a sequence of steps. During each step, the host machine performs computation. The exact nature of this computation depends on the functionality required of each step, but performing it always requires time. If this computation time is not overlapped with other computations it contributes directly to the total execution time of the program. The rst step in interpreting an instruction is fetching it from memory. The time required for this step depends upon the host implementation. High performance machines typically use a cache to reduce the instruction fetch time. The next instruction interpretation steps are the decoding of the instruction and the generation of the operand addresses. Finally, possibly after many cycles of preparation, the machine is ready to perform the actual functional transformation speci ed by the instruction. This is called execution. Execution performs state changes on the data being manipulated by the program. Interpreting an instruction usually takes much longer than the time required to perform the data transformation it speci es. This is because computation is required before the speci ed data transformation can take place. Computation requirements before actual data transformation are due to the machine's architecture and machine's organization. The original source code does not specify these computations. It is the machine's architecture and organization that specify the computations needed for fetching and preparing instructions for execution. The original source code only speci es computations needed for transforming the data. Thus, there are two types of computations performed during instruction interpretation. They are:

Computations required for instruction fetch and preparation for execution. Computations transforming the data being manipulated by the program. This work de nes instruction preparation to be the computations required for instruction fetching and preparation for execution, and the term instruction execution to be the computations manipulating the program's data. The two types of

10


Memory

Instruction

Expanded

Parallel

Expansion

Instruction

Execution

Unit

Cache

Units

Figure 1.1: Expansion Cache Block Diagram computations performed during instruction interpretation are relatively independent. Computations performed during instruction preparation are mainly dependent on the structure of the program being executed. Computations performing data transformations are mainly dependent upon its data. A program's code, and thus its structure, is usually unmodi ed during its execution. Because the code is almost static during execution, extreme variations in the results of architecturally dependent computations do not occur. It is likely subsequent interpretation of an instruction produces the same results for the architecturally dependent computations as previous interpretations. The expanded instruction caching organizations studied here capitalize on the nature of architectural and machine dependent computations. These computations usually produce the same result for each execution of a given instruction. Combining this with the frequent re-execution of instructions during the run of a program creates the fundamental idea for expansion caches. This idea is to use memory to save the machine dependent computation results for reuse during subsequent re-executions. This is called expanded instruction caching. Assuming the number of cycles required to read saved results from the cache memory is less than the number of cycles required to recompute the results, this machine organization can execute programs in fewer cycles provided the cache hit rate is suciently large. The above considerations lead to a machine organization having a two phase interpretation process. The rst phase is called instruction expansion, and the second phase is called instruction execution. During instruction expansion, architecturally dependent computations are performed and the results are entered into an expanded instruction cache. During instruction execution, the data dependent computations are performed. Figure 1.1 is a simpli ed block diagram of such a machine organization. It shows

1.4. PREVIOUS WORK

11

instructions are rst fetched from main memory by the instruction expansion unit. The instruction expansion unit expands the instructions, performing the architectural dependent computations. This unit then places the resulting expanded instructions into the expanded instruction cache. The execution unit reads instructions from the expanded instruction cache and, if they are still valid, executes them. If the expanded instruction is invalid then the execution unit must wait while the expansion unit reexpands the required instruction. The time required for the execution unit to execute an expanded instruction is very short because it only performs data dependent computations. Execution normally occurs in one cycle. Some complex computations, such as oating point divide, may require multiple execution cycles. However, most of the results presented are for models requiring only one cycle for all types of functional operations. For the benchmarks used in this analysis the frequency of the complex operations is low enough that the number of complex instruction execution cycles does not have a signi cant impact on the resulting performance.

1.4 Previous Work 1.4.1 Multiple Instruction Issue There is a large amount of literature devoted to the many approaches to multiple instruction issue, concurrent execution and other methods of exploiting instruction level parallelism. Table 1.1 compares some speedups reported by various studies. The reported speedups are not directly comparable as the types of benchmarks change, the reference machine varies widely, and many of the models are not realizable. In general a speedup of around 2 is achieved for integer scalar code. Achieving higher speedup is possible when branch prediction is extremely accurate or the benchmarks have a large amount of instruction level parallelism. Out-of-order issue machines have been manufactured for many years. Two of the earliest manufactured out-of-order issue machines where the CDC 6600 [Tho70] and the IBM 360/91 [Tom67]. New machines employing multiple instruction issue

12

CHAPTER 1. INTRODUCTION Simulation studies: Weiss/Smith [WS84] Tjaden/Flynn [TF70] Johnson [Joh91] Jouppi [JW89] Simulation studies with compiler assist: Wedig [Wed82] Kuck [KMC72] Limits studies: Lam/Wilson [LW92] Wall [Wal91] Riseman/Foster [RF72] Nicolau/Fisher [NF84]

Speedup 1.6 1.8 1.9 2.0 2.0 { 3.0 8 2.4, 6.8, 39.6 5 51 90

Table 1.1: Studies Reporting Speedup Using Multiple Instruction Issue are currently being announced or delivered. Examples are the AMD K5 [Sla94], the HP PA-8000 [Hun95], the IBM PowerPC 620 [LTT95], and the Intel Pentium Pro [Gwe95]. Increasingly aggressive machines continue to appear on the drawing boards (these days, they are appearing on CAD workstations).

1.4.2 Very-Long-Instruction-Word Processors A Very-Long-Instruction-Word (VLIW) processor is a machine capable of specifying many concurrent operations in each instruction. The simplicity of an in-order multiple issue machine is the motivation behind a VLIW processor. Early work on this type of machine is reported by Fisher [Fis83]. Some early implementations of VLIW machines are by Multi ow [CNO+88] and Cydrome [RYYT89]. The early VLIW machines required compiler assistance to achieve high performance. The trace scheduling technology reported by Fisher [Fis81] predicts the most likely path of execution and lls the VLIW instruction with instructions from along this trace. Taking an unexpected branch o the predicted trace through the code executes additional code to undo the eects of executing along the predicted path. The early VLIW machines do not support binary compatibility between dierent

1.4. PREVIOUS WORK

13

versions of the machine. If a machine becomes wider or the memory latency becomes less then applications must be recompiled. Rau has reported work on reducing this limitation [Rau93]. VLIW machines are experiencing renewed interest as VLSI technology allows constructing wider and wider machines. Research into compiler support for VLIW is presented by Holm [Hol92] and Hwu et al. [HMC+93]. Intel and HP have announced plans to deploy a VLIW microprocessor [Gwe94]. VLIW machines are pertinent to in-order issue expanded instruction cache machines because an expanded instruction is very similar to a VLIW instruction. Both of these machines build wide words of independent instructions to exploit instruction level parallelism and both need compiler support to achieve high levels of performance.

1.4.3 Superscalar Processors Superscalar processors are multiple issue machines usually performing dynamic concurrency detection. The term was rst used by Agerwala and Cocke [AC87], in describing the research spawning the IBM RIOS architecture and the IBM RS/6000. It was popularized by Jouppi [JW89] and by Johnson [Joh89],[Joh91]. One of the earliest attempted superscalar machine hardware implementation was the Meta ow machine [PSS+91]. Since 1988 there are over 400 published articles reporting to be relevant to superscalar machines. All the multiple issue machines mentioned in section 1.4.1 are classi ed as superscalar machines by their manufacturers. Flynn [Fly95] presents a comparison of various commercial superscalar processors. Most commercial superscalar machines are relatively limited in the amount of instruction level parallelism they attempt to exploit. The instruction window varies from 2 to 16 instructions and the maximum issue rate is 2 to 4 instructions per cycle. These machines implement some form of branch prediction, ranging from static to 2bit adaptive branch prediction. Recent superscalar machines are implementing larger instruction windows, such as the 32 entry window for the MIPS R10000 and the 56 entry HP PA8000.

14


1.4.4 High Performance Instruction Fetching Instruction cache design and instruction fetching mechanisms have received much attention. Researchers have studied many methods for delivering instructions at high rates in the presence of frequent branches and instruction alignment constraints. Conte et al. [CMMP95] presents several techniques for improving an instruction cache's ability to deliver multiple, non-sequential instructions. These instruction caches are feeding a highly parallel dynamic out-of-order issue execution core. Conte's hardware techniques are focused on creating multiple banked caches and using branch prediction to fetch non-sequential instructions at high rates. Merging hardware or collapsing buers are optionally used to achieve additional instruction fetch bandwidth. Expansion caches for out-of-order machines address the same instruction fetch issues as Conte's caches. The dierence is expansion caches do not require multiple row decoders to support the multiple cache banks and do not use merging or switching hardware between the instruction cache and instruction decoders. Conte's methods also requires a separate branch target buer. However, expansion caches make less ecient use of the cache's storage area. The merging or switching hardware between the instruction cache and instruction decoders increases the miss predicted branch penalty and may be dicult to t into the allotted cycle time. Expansion caches eliminate the need for this hardware by merging and aligning instructions as they are entered into the cache. The most eective technique depends on the exact details of the technology used to implement a machine.

1.4.5 Decoded Instruction Caches One of the earliest processors employing a dierent program representation between the cache and main memory is the Bell Lab's CRISP microprocessor [DMB87] [BCD+ 87]. CRISP is a machine providing compact encoding for instructions in main memory and also providing ecient to execute RISC style instructions in the cache memory. CRISP delivered a 192-bit wide canonical instruction from its decoded instruction cache to its execution unit. It implements branch folding, a technique to combine

1.5. SUMMARY

15

branches in to the cache line but did not attempt any other form of multiple instruction issue. Johnson [Joh89] enhances the branch folding idea into successor address branch prediction. He also suggests pre-decode bits can be added to the instruction cache to ease the cycle time pressure during the decode cycle. This pre-decoding idea is extended in [Joh91] into a proposed design for a superscalar 386. The diculty in implementing a superscalar 386 is its variable length instruction encoding. Adding additional bits into cache lines and pre-decoding the instructions can simplify the parallel decode of instructions as they are fetched from the cache. Johnson implemented this idea in the AMD K5 microprocessor [Sla94]. The K5 machine's instruction cache delivers 128 bits of instructions and 80 bits of pre-decode information from its instruction cache. Vassiliadis et al. [VBE94] at IBM propose the SCISM machine and a compound instruction cache to support improved parallelism oered by an interlock collapsing ALU [MEV92]. The instruction compounding unit incorporates preprocessing information for parallel issue of compound instructions from the instruction cache. Franklin and Smotherman [FS94] propose a ll-unit approach to instruction issue. This approach attempts to provide code compatibility and still facilitate a simple VLIW style execution engine. It uses a shadow cache to store VLIW style instructions assembled by the ll unit. The main enhancement this work considers over these previous versions of decoded or extended instruction caches is the ability to duplicate instructions within the expanded instruction cache. The ability to duplicate instructions leads to improved instruction cache bandwidth, reduced complexity in the routing network, and ability to support in-order issue speculative instructions.

1.5 Summary An expanded instruction cache is a technique for delivering instructions to the decoder at a higher rate and in an easier to decode manner. It achieves these improvements

16


by processing instructions during cache miss servicing and reorganizes them to improve instruction issue eciency. A cache eectively stores these results because the computations carried out during the reorganizations are mostly dependent upon the structure of the machine and not dependent upon data being manipulated by the program. A cost of an expansion cache is the increased width of cache lines. Another cost is the less ecient use of storage inside the cache. Instruction duplication and reduced line utilization cause the inecient storage use. These costs are reduced by paying careful attention to them and implementing methods to mitigate their eects. Expanded instruction caching can be applied to both in-order issue machines and out-of-order issue superscalar machines. It oers improved cache bandwidth through the use of cache line alignment, ecient branch prediction, and support for speculative execution. When eectively applied expansion caching improves the performance of superscalar machines. The improved instruction bandwidth allows a simple in-order issue machine to achieve similar performance to that of a much more complicated outof-order issue superscalar machine with traditional cache. Alternatively, an expansion cache can be used to replace the traditional cache in an out-of-order issue superscalar machine to provide an average of 43% performance improvement for the benchmarks used in this study. The following chapters cover the following issues. Chapter 2 introduces the inorder issue and out-of-order issue superscalar machine models studied. It also presents both an in-order issue reference machine and an out-of-order issue reference machine as well as the benchmarks and methodology for evaluating the models' performance. Chapter 3 is concerned with the compiler scheduling requirements for these machines. Covered in this chapter is the list scheduling method used to meet the needs of cache line assembly and resource allocation. An expanded instruction cache improves the eective instruction bandwidth by providing improved alignment for the required instruction packets. Instruction bandwidth and cache line alignment issues are addressed in chapter 4. Chapter 5 presents the impacts of instruction routing in superscalar machines. An

1.5. SUMMARY

17

expansion cache enables static routing from the parallel instruction register to the execution units. Static routing simpli es the required instruction issuing hardware and reduces the time required to decode and issue instructions. However, static routing uncovers a smaller amount of instruction level parallelism than dynamic routing. A major limiter of superscalar performance is branching and branch prediction alleviates some of this performance penalty. Chapter 6 studies the eectiveness of using an expanded instruction cache to implement branch prediction. Branch prediction makes speculative execution possible. Using an expansion cache to support speculative execution is the subject of chapter 7. Chapter 8 presents the performance of some variations in machine con guration. This information informs a designer to the sensitivity of some con guration variations and aids in determining the best approach for constructing a superscalar machine. Conclusions and future work are presented in chapter 9.

18


Chapter 2 Expanded Cache Models 2.1 Introduction Superscalar execution requires an instruction fetch unit capable of supplying instructions at a high rate to keep the multiple execution units busy. Ideally instructions would be fetched at a rate sucient to achieve near 100% utilization of the execution hardware. Achieving this goal is dicult, not because it is dicult to provide sucient sequential instruction stream bandwidth, but because there are very few sections of code having long sequential runs of instructions. Presented in this chapter are three approaches for supplying instructions to execution units at high rates. The rst approach is built around using in-order parallel issue of independent instructions to reduce the complexity of the design. This statically scheduled method is guided by the principle of simple hardware enables short cycle times. The second instruction issue approach is a dynamically scheduled out-of-order execution machine, modeled after the work of Johnson [Joh91]. The out-of-order method uses additional hardware such as reservation stations and a reorder buer to nd additional instruction level parallelism. The third approach combines the best features of the rst two approaches. The purpose of the models is to compare the performance of these dierent approaches to superscalar implementation. The design objective of the instruction fetch and issue portions of these models is they represent the best performance that can 19

20

CHAPTER 2. EXPANDED CACHE MODELS

realistically be achieved. The philosophy of the execution portions of these models is execution should not limit the performance of the machine, even if it is somewhat expensive to implement this performance level. The rst section of this chapter describes the details of using an expanded instruction cache with an in-order issue machine. This model is called the Epic superscalar model. Epic is an acronym for Expanded Parallel Instruction Cache. Later sections in this chapter describe an expanded instruction cache applied to an out-of-order machine. This chapter describes the various features and organizations associated with an expanded instruction cache but it does not analyze the performance contribution of each of these features. The performance analysis is left to the latter chapters in this dissertation.

2.2 Epic Superscalar Model This section explains Epic's framework. Epic is an in-order issue Expanded Parallel Instruction Cache machine. This superscalar machine organization uses an expansion cache as the mechanism for delivering instructions to the execution units. The expansion cache contains the scalar instructions along with additional elds specifying information needed for superscalar execution. It is a parallel organization because multiple instructions are simultaneously delivered to the execution units. Figure 2.1 is a block diagram of the Epic machine model. Instructions are fetched from the main instruction memory by the Expansion Unit and stored into the Expanded Instruction Cache. Output from the expanded instruction cache controls the multiple execution units in parallel. There are several independent and specialized execution units. This execution unit partitioning is into 3 instruction classes: branch instructions, load/store instructions and functional instructions. Each cycle is either an expansion cycle or an execution cycle in the Epic model. If a valid instruction is read from the expanded instruction cache then the next cycle is an execution cycle. Otherwise, there is a cache miss and the next and several following cycles are expansion cycles. During expansion cycles, instructions are read from main instruction memory, partially decoded and packed into the expanded parallel

2.2. EPIC SUPERSCALAR MODEL

21

Instruction Memory

Expansion Unit

Expanded Instruction Cache

Expanded Parallel Instruction Register

Branch Units

Load Store Units

Functional Units

Register Interconnect and Bypass

Figure 2.1: Block Diagram of Epic Model

22


Figure 2.2: Dynamic Superscalar Decode Complexity Grows as O(n2)

Figure 2.3: Epic Decode Complexity Grows as O(n) instruction cache. During execution cycles, data is manipulated by the execution units. As shown in gure 2.2, a dynamic superscalar machine decoding n instructions in parallel must compare each instruction being decoded to the other n ? 1 instructions it decodes during the same cycle. Each of n decoders must process all previous instructions being issued during the cycle. This results in order n2 complexity for the dependency checking hardware. The resulting hardware complexity requirements can increase the machine's cycle time. For the Epic machine only one instruction is decoded per cycle. As shown in gure 2.3, this results in only one decoder processing the instructions to be issue during a cycle. Epic superscalar dependency checking hardware has complexity of only order n. The expander can serially decode packets of instructions to be executed in parallel, reducing the hardware complexity. The expansion unit is pipelined, but each stage of the expansion pipeline handles only one instruction per cycle. An Epic expansion unit does not have the complexity

2.2. EPIC SUPERSCALAR MODEL Tag

Successor Index

23 Branch

Instructions

Load/Store

Functional

Instructions

Instructions

Figure 2.4: Expanded Instruction Cache Entry normally associated with the decode unit of a superscalar processor. During execution cycles an Epic machine behaves like a VLIW (Very Long Instruction Word) machine. VLIW machines are attractive because of their ability to accommodate large amounts of instruction-level parallelism with relatively simple and inexpensive control hardware [CNO+88] [BYA93]. This simple control also leads to a shorter cycle time and thus higher performance. During each execution cycle a wide expanded instruction cache entry is clocked into the Expanded Parallel Instruction register. This wide entry contains several independent instructions and controls the parallel execution units. The independence of the individual instructions was ensured by the expansion process that initially assembled the expanded instruction. The Expanded Parallel Instruction register is also a pipeline register as it holds the current instructions while the next instructions are being fetched from the expanded instruction cache. Bits in this register directly control the operation of each of the execution units, as well as controlling the next expanded parallel instruction to fetch. The execution units are also pipelined, although this is not explicitly shown in Figure 2.1. Figure 2.4 shows a sample organization for the expanded cache entries in the Epic model. The tag eld determines if the cache entry is valid. It is the full address of the rst scalar instruction processed during instruction expansion. The successor index eld performs instruction sequencing and branch prediction, as explained in section 2.2.2. Each instruction type has a xed eld position within the cache entry. If all the elds in the instruction are not lled, as is the usual case, the unused elds are set to no-operation codes (NOPs). Motivation for this simple machine arrangement is the desire for very short cycle times in hardware implementations. There is no need for complicated and communication intensive resource allocation logic during the interpretation of the wide

24


Address Latch Address Buss

Main Instruction Memory

Instruction Bus Pipeline latch

Expand Logic

Expansion Register


Figure 2.5: Block Diagram of Epic Expansion Unit expanded parallel instruction. The bits from the expanded instruction directly drive the execution units, thereby shortening the cycle time and requiring no additional gate delays to select a highest priority instruction. This direct control technique is called static routing and is further discussed in chapter 5.

2.2.1 Epic Expansion Unit Figure 2.5 presents a block diagram of the expansion unit and associated memories. Central to the expansion process is the Expand Logic block. It controls both the reading of instructions from main memory and the packing of the expanded instructions.

2.2. EPIC SUPERSCALAR MODEL Fetch

25

Decode

Write

Fetch

Decode

Write

Fetch

Decode

Write

Fetch

Decode

Figure 2.6: Expansion Unit Pipeline The expansion process is divided into three pipeline stages, as shown by gure 2.6. During stage 1, the Fetch stage, the next instruction to be expanded is read from main memory. If main memory is not able to deliver the next instruction then the expansion process stalls until the next instruction is available. During stage 2, the Decode stage, a single instruction is decoded and examined for dependencies against all instructions already residing in the Expansion Register. The feedback path from the output of the Expansion Register to the Expand Logic block in Figure 2.5 is the data path for the information needed to perform the dependency checking. Execution use unit counts are also maintained by the expand logic. These counts allow the expansion unit to determine whether are not the instruction being decoded can be executed in parallel with the instructions already packed into the expansion register. If there are no data dependencies and there is an execution unit available then the expansion process can continue. When the expansion can continue the instruction, along with information specifying the sequence number within the expanded parallel instruction, is routed to the appropriate slot in the Expansion Register. The trapezoid shaped routing block below the Expand Logic block in gure 2.5 delivers the instruction to the appropriate slot. During each expansion cycle, other than the initial expansion cycle, only one of the paths out of the routing block writes an instruction and associated information into the Expansion Register. During the initial expansion cycle all slots, except for the slot selected for instruction delivery, deliver a NOP code into the Expansion Register. During Stage 3, the Expanded Instruction Cache Write stage, the current content

26


of the Expansion Register is written into the expanded instruction cache. The write occurs every expansion cycle even though the expansion process may build another instruction into the expansion register during the next cycle. If the decode stage determines the expansion process can not continue then the next stage 3 write is canceled. This is shown on the right side of Figure 2.6. During the cycle in which stage 2 determines no more instructions can be packed into the Expanded Instruction register the writing of the now complete Expanded Instruction register into the expanded instruction cache is occurring. The completed expanded instruction is also delivered to the execution units for execution next cycle. The expansion cache is singled ported permitting only a single read or write access each cycle. While determining whether another instruction can be packed into the expanded instruction register, an extra instruction is read from main memory. This extra instruction probably will be the rst instruction examined during the next expansion process so a buering mechanism is needed for main memory data. Main memory has a longer latency, and is assumed to be 10 cycles for most of the performance simulations reported here. The main memory model for the Epic machine is a high bandwidth streaming design incurring several cycles of delay for the rst access to a new address, but providing single cycle access for each sequential word. A 10 cycle latency main memory is an aggressive goal for real hardware implementations. A large second level cache may be required to actually achieve an average latency of 10 cycles. If the main memory has a longer latency then performance degradations at smaller cache sizes are larger, and vice versa for shorter main memory latency. Control of the main memory read address is performed by the Expand logic, as shown in Figure 2.5. When a new sequential run of instructions is needed the Expand Logic sends a new address to the address register. Then it must wait the main memory latency time (10 cycles) for the rst instruction. After that, a sequential instruction is available each cycle. If the Expand Logic requires a non-sequential read it sends a new address to the main memory and ignores the sequential data streaming in while the latency for the new address passes.


27

In an actual implementation the memory system may also have a several cycle delay when crossing a page boundary. However, the performance degradation caused by this behavior is small as it does not occur very often. Page sizes are about one thousand instruction words, so this degradation occurs on the order of only 0.1% of the instruction word fetches. This page crossing behavior is not modeled for the performance simulations reported here. The Epic machine requires only one instruction word per cycle and it starts processing the rst instruction as soon as it arrives. Other superscalar designs typically require several instruction words in parallel before processing can begin on an instruction packet. This reduction in instruction memory bus width is another advantage the Epic model has over more complicated superscalar designs.

2.2.2 Epic Branch Prediction A unique method of branch prediction is enabled by the Epic split expand and execute paradigm. During expansion it is possible to predict a branch and continue packing instructions from the target of the branch into the expanded instruction register. During execution the instructions packed after the branch are executed in parallel with the branch. If the branch is a conditional branch then these instructions must be agged as being speculatively executed, but no other special handling is needed during execution of these instructions. The special handling for predicting the branch direction and the non-sequential fetch from main memory occur only during expansion time, when the demand for high speed alignment and decoding is less stringent. This method of including an explicit next address within instruction is similar to some microcoded machines using an un-encoded next address eld within each microinstruction. It is also similar to the branch folding technique proposed by Berenbaum et al. for the Crisp microprocessor [BCD+87], and the successor line branch prediction used by Johnson for his superscalar machines [Joh91]. Epic's instruction expansion process enhances this branch prediction technique to include not only the address to execute after the branch but also several additional instructions. Figure 2.7 shows an expanded view of the tag, successor index and branch instruction elds in an expanded instruction. Each branch eld is repeated for each branch

28

CHAPTER 2. EXPANDED CACHE MODELS Tag

Starting Address of First Instruction

Successor

Branch

Next Starting Address if all Branches correctly predicted

Load/Store

Opcode

Functional

Predicted Target Address

Figure 2.7: Tag and Branch Fields in an Expanded Instruction Cache Entry

unit, which is two times for 2 branch units in the model presented in section 2.2. Each cycle performs comparison between the tag and the required execution address generated during the previous cycle. If the tag matches the address then the entire expanded parallel instruction packet is valid and all execution units continue with the execution they initiated at the start of the cycle. If the tag does not match then there is a miss and the required instruction packet was not fetched. The expanded instruction packet being executed is invalid so writes by the execution units must be inhibited for this packet of instructions. An expansion process is scheduled to begin next cycle. Branch prediction is accomplished by having the expansion unit predict the direction of a conditional branch while it is expanding the branch. Unless stated otherwise, static branch prediction is used for the simulation results reported here. Normally, static branch prediction uses information provided by the compiler to determine the direction of the branch. However, instead of using compiler analysis this study uses the dynamic instruction trace to implement optimal static branch prediction. Optimal static branch predictions is statically encoding in the opcode the most frequent direction a branch will take during execution. Using optimal static branch prediction ensures the results are not limited by ineective compiler branch prediction. Optimal static branch prediction is simple to implement under the trace driven methodology used here. Expansion of a conditional branch consists of predicting the target address and


29

storing it in the predicted target address eld of the expanded instruction. Furthermore, if the taken direction is predicted then a new non-sequential fetch address is sent to the main memory and the expansion process stalls until the target instruction is returned. The branch instruction is placed into the opcode eld of the expanded branch instruction. This communicates to the execution unit the operation to perform when determining whether the branch is correctly predicted. Specifying which branch direction is predicted in the expanded instruction allows the branch execution unit to correctly restart execution after executing a miss predicted branch. The successor index for the entire expanded instruction is written into the predicted target address in case this branch is the last instruction that can be packed into the expanded instruction register. If the branch prediction is completely static then no overhead bits are needed in the expanded instruction to support branches. The static opcode communicates sucient information to the branch execution unit for it to determine the predicted direction and thereby determine if the branch was correctly predicted. If the expander uses any information other than the opcode to predict the branch direction, such as a branch history table [LS84], then the predicted direction must be saved in the expanded instruction. This requires one bit per branch in the expanded instruction to specify whether the expansion process predicted taken or not taken. Executing a conditional branch consists of comparing the registers or condition codes speci ed by the opcode and determining if the branch is correctly predicted. If it is correctly predicted then the execution of the branch instruction is complete. The target instruction is either packed in the expanded instruction for parallel execution with this branch, or it is at the successor index of this expanded instruction and is being fetched this cycle. If the branch is incorrectly predicted then the branch execution unit must cancel any writes initiated this cycle by other speculative instructions within the expanded parallel instruction. It must also prevent the expanded instruction currently being fetched this cycle from being executed next cycle. The branch unit sends the correct target execution address to the expended instruction cache and this expanded instruction fetched next cycle and ready for execution the following cycle. Thus a

30


miss predicted branch has a one cycle penalty if the correct target is already in the expanded instruction cache. No special reorder buers or other hardware is needed to recover from a miss predicted branch because all speculative instructions are issued during the same cycle as the branch they depend upon. Thus speculative instructions can be canceled before they reach the write results stage of the pipeline. This eliminates the need for special hardware result buering for speculative instructions and simpli es the machine. Processing of unconditional jumps at expansion time deserves additional consideration. Since a jump's only eect is to change the ow of control it could be folded out during expansion time. No action is performed by an execution unit for processing an unconditional jump. However, there is still a need to pack an unconditional jump instruction into a branch unit eld. The machine is required to know the addresses of each instruction between the target of the unconditional jump and the end of the expanded instruction. For example, the address of an instruction causing a data page fault is needed to recover from the fault. If unconditional jumps are folded out then an instruction's address cannot be determined from the starting address of the expanded instruction and original program order information. If unconditional jumps are folded out and no additional address information is added then it is not possible to determine individual instruction's address. The tag and successor address information cannot be used because unconditional jumps cause these address to become unrelated to the addresses of the instructions in the expanded instruction. Folding out unconditional jump instructions requires additional overhead bits in the expanded instruction, such as storing the address of each instruction within the expanded instruction. This overhead would almost double the width of the expanded instruction and is prohibitively expensive. Keeping unconditional jumps in the expanded instruction solves the problem. Processing of subroutine calls is similar to processing unconditional jumps in that all the ow control eects of the call instruction are contained with the sequencing and successor index of the expanded instruction. Additionally, a call instruction must be packed into a branch unit eld because the subroutine's return address must be saved when the call is executed.


31

Indirect jumps, such as subroutine return instructions and case statement instructions are also predicted during expansion time. Their target address is computed using the register data at expansion time, even though it may be incorrect. This leads to low prediction accuracy for some indirect jumps, but these jumps are low in frequency. The dynamic frequency of indirect jumps ranges from near zero to 2.2% of executed instructions for the TeX benchmark. The average prediction rate for all indirect jumps is about 51%. The low frequency of indirect jumps results only a small performance degradation for using this simple approach for predicting indirect jump target addresses. As stated before, an expanded instruction's tag is the full address of the rst instruction assembled into the expanded instruction packet and this tag determines the validity of the entire packet. This single tag for a packet of instructions scheme allows an individual scalar instruction to be resident in the expanded instruction cache at more than one location. For example, suppose expansion starts at address n and this instruction along with several following instructions can be assembled into a packet. This packet has tag n. Now suppose elsewhere there is a branch to instruction n and the instruction at n can be packed into that expanded instruction. The instruction at n will reside in the expanded instruction cache both in the entry having tag n and in the entry containing the branch. Additional branches or dierent expansion starting addresses may cause this instruction to be entered into the expansion cache three or more times. Duplicating instructions in the expanded instruction cache is necessary for remove alignment hardware between the instruction cache and execution units. It has the drawback of reducing the eective size of the cache, but has the advantage of allowing a shorter cycle time. The eect of the Epic structure on eective cache size is studied in detail in later chapters.

2.2.3 Epic Execution Units The overlapped boxes in the lower part of Figure 2.1 represent the execution units for the Epic model. There are 8 execution units in the con guration presented so far. Two of the units are specialized to execute branch instructions, two are specialized to

32

CHAPTER 2. EXPANDED CACHE MODELS Fetch

Reg Read

Execute

Delay

Write

Fetch

Reg Read

Execute

Delay

Write

Fetch

Reg Read

Execute

Delay

Write

Figure 2.8: Execution Unit Pipeline execute load/store instructions and the remaining four execute all other instruction types. The lines coming from the expanded instruction register and going into each execution unit carry the simple scalar instruction each unit executes during each cycle. This simple control is one of the central features of the Epic organization. That is the wide parallel paths are kept as short and simple as possible. If the expansion process is not able to pack an instruction into one of the elds, a NOP is substituted. The main emphasis of this study is on instruction issuing mechanisms so simpli ed high speed models are used for the execution units. Basically, all functional operations can be done in 1 cycle. No distinction is made between oating point and integer operations. Loads have a single load delay slot. Figure 2.8 shows the pipeline stages for the functional execution units. They each use a simple 5 stage pipeline. The rst stage, labeled Fetch in the gure, is the read of the expanded instruction cache. During the second stage, labeled Reg Read, data is read from the register le and delivered to the execution unit. Any additional decoding needed to execute the instruction is also done during this stage. The third stage is when the data is modi ed by the ALU. The Delay stage is added so the number of stages in a functional execution unit matches the number of stages in the load/store units. Data is written back into the register le during the fth stage. Full forwarding is implemented. If one unit produces a result during a given cycle, this result is available for use by any unit during the next cycle. For this analysis, there are no restrictions placed on the number of read and write ports for the register les feeding the execution units. The branch units have an even shorter pipeline than the functional units. The branch unit's pipeline is only two stages when a branch is correctly predicted. All

2.2. EPIC SUPERSCALAR MODEL Load

Fetch

Use

Reg Read

Fetch

Address

33 Data

Reg Write

Execute

Reg Read

Write

Pipeline stall

Figure 2.9: Load Pipeline

branches compare two registers, or test a single bit of a register. As in the MIPS R3000, these operations are simple enough that they can be completed during the Reg Read stage. There is no need for the execute stage as branch instructions do not manipulate data nor is there a need for the write back stage as no result is produced. Figure 2.9 shows the load unit pipeline and an example of a data interlock. Load units have a 1 cycle delay slot. That is, if an instruction uses data being retrieved by a load instruction this dependent instruction will be delayed one cycle if it is issued the cycle after the load is issued. The Epic model has hardware interlocks and the machine automatically inserts a pipeline stall when needed. Moreover, all instructions issued in parallel from the expanded instruction register are stalled, not just the single instruction having the dependency. This parallel stalling of all the pipelines simpli es the machine. Stalling all instructions in parallel eliminates the need for special hardware to determine if some instructions are able to race ahead of a stalled instruction. This method of stalling all instructions can have a performance impact, and makes the scheduling of instructions by the compiler an important issue. Scheduling for the load delay is discussed in detail in chapter 3. An interlock mechanism maintains the order of completion of the instructions in the pipeline. Epic uses a simple register busy bit scoreboard. This scoreboard is simpler than the scoreboard used in the CDC 6600 [Tho70]. The CDC machine uses a tag indicating the unit that produces the value in the future, while this Epic machine

34


uses just a single bit to indicate the register is updated in the future. Epic's mechanism is very similar to the scoreboard mechanism used in the Motorola MC88100 microprocessor [Mot88]. Epic's register scoreboard functions as follows. There is one bit for each register in the machine. Decoding an instruction having more than one cycle latency sets the scoreboard bit corresponding to the destination register for this instruction. This bit is cleared at the end of the cycle before the result for the instruction is ready. Because of the single cycle execution units used by Epic, only load instructions set a register busy bits in the scoreboard. Decoding an instruction also reads the scoreboard bits for all of its source operands. If any bit is set then some of this instruction's source operands are not ready next cycle. A signal is generated to stall the entire set of scalar instructions being decoded this cycle. Next cycle the scoreboard bits are read again and if they are all clear then all instructions in the expanded instruction register are issued in parallel. Store instructions are executed by the load/store unit, but have a dierent pipeline behavior than loads. The model uses a store buer to eliminate all stalls that could be associated with store instructions. A load from an address in the store buer is assumed to return the data from the store buer without any additional delay. This may be dicult to implement in hardware, but the frequency of this is low. An actual implementation may use an additional cycle and has only a small decrease in performance. Overall, the execution section of the Epic machine presented has very fast execution units and very fast busing and forwarding between the units. This may be dicult to implement in hardware, but is an appropriate model for this study as slow register data ow or slow execution units could hide the eects of dierent instruction issuing mechanisms. The main focus of this study is instruction issuing mechanisms so it is desirable for the results to not be greatly in uenced by limitations in the execution units. Eective execution unit selection can proceed after the results of this study determine eective instruction issuing mechanisms.

2.3. REORDER BUFFER MODEL

35

2.3 Reorder Buer Model This section explains the Reorder Buer superscalar model used by this work as the reference model for out-of-order issue machines. This superscalar machine organization is modeled after the work of Johnson [Joh91] and the instruction fetch and issue portions of this model are similar to those in his Standard Processor. The philosophy of this Reorder Buer model is nding as much instruction level parallelism as possible is more important than reducing the amount of communication and resource allocation logic. Whether this additional logic increases the cycle time is dependent upon many implementation factors. For the performance comparisons in this study it is assumed the Epic model and the Reorder Buer model both have the same cycle time, even though this may be dicult to achieve. Figure 2.10 shows a block diagram of the Reorder Buer superscalar processor. The processor is divided into two major sections: the fetch and issue section on the top half of the gure and the execute section on the bottom half of the gure. The reorder buer in the upper left is a major component of this superscalar model and operates in parallel with the register le. The stacked rectangles on top of each execution unit represent instruction issue reservation stations for each unit. In the gure, the connections between components usually comprise multiple buses. The following describes the high level operation of this superscalar machine. The instruction cache supplies multiple instructions per cycle. These instructions are decoded and their source operands are read from either the register le or the reorder buer. If a data dependency exists and the source operand cannot be supplied a tag is used to identify the required operand. The instructions and source operands are then moved to the reservation stations for the appropriate execution unit. Each execution unit selects a ready instruction from its reservation station and issues it for execution. The results are rst stored in the reorder buer and then retired in-order to the register le.

36


Instruction Memory

Instruction Cache Register File

Reorder Buffer

Decode

Bypass

Branch Units

Load Store Units

Functional Units

Result Busses

Figure 2.10: Block Diagram of Reorder Buer Model


37

2.3.1 Out-of-Order Issue in the Reorder Buer Model The Reorder Buer model attempts to yield the best superscalar performance by using our-of-order issue. Reservation stations are used between the instruction decoder and execution units to buer some instructions while others are issued. The register le contains the in-order state while the reorder buer contains the lookahead state. The reorder buer is a rst-in, rst-out (FIFO) queue. As instructions are decoded they are allocated a tag and entered into this queue. A tag is a unique identi er maintained by the machine's hardware. There is one tag value for each active instruction in the machine. There are enough tags to identify uniquely each possible active instruction. When fewer than the maximum number of instructions are currently active the unused tags are kept on a free list. Also kept in the reorder buer queue is the result register address, or a ag specifying that this instruction does not write a result to a register. Entries in the reorder buer are kept in sequential execution order. When an instruction's tag entry reaches the output of the reorder buer and its result value has been posted into the reorder buer then the result value is written into the register le. When several instructions at the output of the queue have their results ready then all these instructions can complete during a single cycle. When an entry in the reorder buer reaches the output of the queue but its corresponding instruction has not yet completed its path through the execution units, then this entry and all following entries are held in the reorder buer. Besides allocating an entry in the reorder buer, the decode process delivers instructions to the appropriate reservation stations. The decoder determines the source operands and type of the instruction. It uses the source operand register addresses to access both the register le and reorder buer in parallel. The register le access is a simple linear address decode and access. The reorder buer access is a much more complicated associative lookup and prioritized access. Each source register address is compared to every entry in the reorder buer. If this register address matches a result register address in the reorder buer then the tag or data from the reorder buer is used as the source operand instead of the register le data. If a source register address matches more than one result register address in the reorder buer, then the entry

38


residing in the FIFO queue the shortest time is used to supply the tag associated with the source operand's data value. For each source operand, the decode process delivers to a reservation station either the data value identi ed by the source's register address or a tag identifying a value that will be transmitted on one of the result buses. Each execution unit's reservation station contains a storage eld for the operation speci ed by an instruction and storage elds for two source data items or tags. Each cycle an execution unit examines all the entries in its reservation station. If an entry does not have any tags in the source operand elds then this entry is ready for execution. Each cycle each tag transmitted on the result buses is compared to all the tags in all the source operand elds. For each match the data on the corresponding result bus is copied into the matching reservation station's source operand data eld. When an execution unit nds a reservation station entry having all of its source operands ready it selects it for execution and removes it from the reservation station. Entries in the reservation stations are kept in FIFO order. When more than one instruction is ready during a given cycle the oldest instruction residing in the reservation station is selected for execution. However, instructions need not be removed in FIFO order. When an instruction early in the reservation station queue is ready before a later instruction then it is selected for execution and removed rst. This accomplishes out-of-order issue for this superscalar model.

2.3.2 Register Renaming Signi cant register storage con icts can arise in superscalar processors due to the frequent reuse of registers. Because of register allocation by compilers it often happens that dierent and independent computations interfere with each other because they use the same temporary registers. Scalar machine instruction issue is staggered in time and register reuse is not usually a problem. The staggered issue permits scalar machines to use forwarding and bypassing to eectively reduce lost of performance caused by register reuse. However, in a superscalar machine these storage con icts can prevent the parallel issue of otherwise independent instructions. Register renaming is the process of using more physical storage elements than

2.3. REORDER BUFFER MODEL Successor

Tag

Instructions

Index Branch offset

39

Target offset

Figure 2.11: Reorder Buer Model Branch Prediction Fields addressable registers to eliminate storage con icts within registers [Kel75]. It has been shown [JW88, Joh91] register renaming is an important component to achieving good performance in an out-of-order issue machine. The organization of the reorder buer automatically provides register renaming in this model. When an execution unit produces a result, it is written into the reorder buer and into any reservation station entries containing the tag for this result. Since there may be several entries in the reorder buer specifying the same destination register address there are more physical storage elements than addressable elements. In the Reorder Buer superscalar model presented here the reorder buer has 16 entries. This number proves to provide sucient register renaming so that register storage con icts do not cause signi cant performance degradation.

2.3.3 Reorder Buer Model Branch Prediction The branch prediction method used by the Reorder Buer model is to include branch prediction information in each instruction cache line. This is also the method chosen by Johnson for his superscalar model. Each instruction cache line includes additional elds indicating the address of the next line to be executed, the oset of the branch within the current line, and the oset to the target in the next line. Figure 2.11 shows the elds used for branch prediction in the Reorder Buer model machine. The elds are:

The successor index eld predicts the next cache line index to fetch. In the

Reorder Buer machine model the successor index also includes enough bits to specify a full instruction address. These bits are needed the rst cycle after the fetch to determine if the branch prediction is correct.

40


The target oset eld is the low bits of the predicted target address. It speci es the oset to the rst instruction within the target line to execute.

The branch oset eld indicates the location of the branch instruction within the

cache line. Instructions after the branch oset are not scheduled for execution.

When a cache line is fetched and the program counter speci es an instruction located at or before the instruction identi ed by the branch oset eld then all instructions between the program counter address and the branch oset are scheduled for execution. The successor address is the predicted next address to execute. If the program counter speci es an instruction after the branch oset then instructions starting at the program counter address and continuing to the end of the cache line are scheduled for execution. Sequential execution is predicted in this case and the next address to be fetched is computed by adding one to the address of the last instruction in the cache line. Although this method of branch prediction is similar to the Epic machine's branch prediction there are major dierences between them. Both have a single eld for branch prediction information in each cache line. The dierence in Epic's case is only one predicted target address is needed for each cache line. In the Reorder Buer model machine's case there may be two or more branches requiring prediction within a cache line, but only a single eld to store the prediction. The Epic model achieves a better branch prediction accuracy. The dierence in prediction accuracy is discussed in detail in chapter 6.

2.3.4 Reorder Buer Model Pipeline and Execution Units There are 8 execution units in the Reorder Buer model superscalar machine, the same as the Epic machine model. Two of the units are specialized to execute branch instructions, two are specialized to execute load/store instructions and the remaining four execute all other instruction types. Because the main emphasis of this study is on instruction issuing mechanisms, high speed models are used for the execution units. All functional operations can be done in 1 cycle. No distinction is made between oating point and integer operations.

2.3. REORDER BUFFER MODEL Fetch

41

Allocate

Reg Read

Execute

Write

Commit

Fetch

Allocate

Reg Read

Execute

Write

Commit

Fetch

Allocate

Reg Read

Execute

Write

Commit

Figure 2.12: Reorder Buer Model Pipeline Loads have a single load delay slot. Loads require an additional execution cycle because they must rst generate the address and then access the data cache. Figure 2.12 shows the Reorder Buer model's pipeline stages for the execution units. The load/store units have an additional pipeline stage after the execute stage, which is the load delay slot. This pipeline diers from the Epic pipeline in that the decode process is two stages instead of one and there is an additional commit stage added to the end of this pipeline.

2.3.5 Reorder Buer Model Decode Process The Reorder Buer model uses two pipeline stages for its decode process. Figure 2.12 labels these stages as allocate and reg read. Figure 2.13 shows the justi cations for this two stage decode process. The total at the bottom shows at least 28 gate delays are required to route the instruction from an 8 instruction cache line to a reservation station. This limit is determined as follows. Winograd [Win65, Win67] and Spira [Spi73] have shown for Boolean logic the number of gate delays required to compute a function of n arguments is lower bounded by: t dlogr ne where t is then number of gate delays and r is the maximum fan-in of each gate. Using this bound one can compute the minimum number of gate delays required for each step for decoding an instruction and delivering it to a reservation station. By adding these delays along the critical path the minimum number of gate delays required for the decode process is obtained. For this analysis 2-input gates (r = 2) are used. The rst step for decoding an instruction shown in gure 2.13 is the align block. The leftmost align block has as its inputs any 1 of 8 instructions from the instruction

42


I cache register (256 bits) ...

...

Gate Delays ...

Align

4

...

Enable

1

Allocate

8

Operand Select

4

Access

7

... Select and Bypass

3

Drive

3

... 136 Input PLA

... ... Reg File

... Reorder Buffer

...

Reservation Stations (48 loads) TOTAL 30 Execute Stage

Figure 2.13: Reorder Buer Model Decode Timing


43

cache output, and has 1 output. Not shown in the gure are the 3 control lines needed to select which of the 8 instructions are to be selected. There is a total of 8 of these align blocks and they are running in parallel. Only the leftmost align block, which is required to select any one of the 8 instructions, is in the critical path. There are 8 data and 3 control inputs and one binary output so minimum delay for this block is dlog2(8 + 3)e or 4 gate delays. The next block is the enable block. This block has only one data input and one not shown control input. This block enables an instruction to the allocate PLA (programmed logic array) or replaces it with a code that does not require any resources. Provided the control is not the critical path, the minimum delay for this block is dlog2(1 + 1)e or 1 gate delay. The next block in the decoding process's critical path is the allocate block of gure 2.13. This block takes as input the resources each instruction is requesting and generates as output the control signals selecting which instructions can use the register le. There are 8 instructions competing for register le access, and each of these instructions has up to 2 source operands. It is assumed 7 bits of the instruction are needed for determining instruction and operand usage. The source operands select one of 32 registers from the register le, requiring 5 bits for each source operand. Thus each of the 8 instructions feeds 7 + 5 + 5 = 17 bits into the resource allocation logic. The number of input bits to this logic function is 8 17 = 136. The minimum time for this function, which is probably very dicult to achieve, is dlog2(136)e = 8 gate delays. Source operand selection is the next block in the decoding process's critical path. This multiplexor has as its data inputs the source operand eld of each possible instruction. Its control inputs are the output bits of the allocate PLA specifying which source operands are to be used for accessing the register le and reorder buer. There is one of these multiplexors for each register read port. Each bit of selection address can come from any of 8 possible instructions, so the minimum delay for this block is dlog2(8 + 3)e or 4 gate delays. Next in the encoding process's critical path is register access. There are 32 integer registers and 32 oating point registers. The 6 address inputs and 64 internal storage

44


cells result in a minimum delay of dlog2(6 + 64)e = 7 gate delays to select a source register. A similar analysis applies to the reorder buer, which runs in parallel with the register le. Actual hardware implementations for register les use sense ampli ers and other circuit techniques instead of simple 2-input logic gates, so there may be some inaccuracies in this analysis. However, 7 gate delays is a good estimate of the minimum time to access a 64 entry register le. The select block is next in gure 2.13. The machine must select whether a source operand is read from the register le or from the reorder buer. The machine also needs to bypass data from a result bus instead of register le or reorder buer data. The nal source selection is performed by the bypass multiplexor in the gure. Two bits of control and 3 bits of data input results in a minimum delay of dlog2(2 + 3)e or 3 gate delays. Finally, at the bottom of gure 2.13 is the number of gate delays required to drive the reservation stations. For this block's analysis a fan out limit of 4 is used, so driving 48 loads requires a fan out tree of dlog4(48)e = 3 levels of gate delay. Totaling up all the gate delays in the critical path shown in gure 2.13 results in 28 gate delays for the Reorder Buer model's decode process. The execute stage of the pipeline consists of an adder and a small amount of additional logic for control, latching and driving the result buses. Waser and Flynn [WF82] uses the Winograd bound to compute the gate delay for a canonic adder and reports the formula 2dlogr n ? 1e + 2 ? . For a fan in of 2 (r = 2), a data path of width of 64 (n = 64) and ignoring results in a minimum of 14 gate delays for an adder. So a lower bound on an executes stage is approximately 18 gate delays. The execute stage's minimum of 18 gate delays is much shorter than the minimum 28 gate delays of the decode process. To prevent the decode process from requiring more cycle time than the execute stage it is necessary to divide it into two pipeline stages. This is done in the Reorder Buer machine model by inserting a set of pipeline registers after the allocate block. This completes the justi cation for the two stage decode process. For comparison, gure 2.14 shows the minimum gate delays for the Epic model. In this model the only blocks in the critical path of the decode process is the register


45

I cache register (256 bits) ...

Reg File

Gate Delays

Access

7

Bypass

2

Drive

1

Pipeline Register (1 load) TOTAL 10 Execute Stage

Figure 2.14: Epic Model Decode Timing

46


le, bypass multiplexor, and a pipeline register before the execution unit. Using the same analysis as used above for the Reorder Buer model, the Epic model has a minimum timing of 10 gate delays for its decode process. This easily ts into the minimum cycle time of 18 gate delays required by the execute stage.

2.4 Expanded Cache Reorder Buer Model This section explains a superscalar machine combining both the expanded instruction cache and the out-of-order issue reorder buer decoder. This superscalar machine organization incorporates the improved instruction cache bandwidth oered by the expanded instruction cache with the higher instruction level parallelism detection provided by the reorder buer and reservation station issuing mechanism. Figure 2.15 shows a block diagram of the expanded instruction cache with a reorder buer out-of-order issue superscalar processor. This model is given the name Expro. This gure is quite similar to the Reorder Buer model of gure 2.10, except the instruction cache and decode blocks have been replaced by an expansion block, an expanded instruction cache, and an expanded instruction register.

2.4.1 Expro Model Instruction Issue The Expro machine organization provides better performance by delivering to the reorder buer additional useful instructions per cycle. With a traditional cache, instructions residing in a cache line after a taken branch are delivered to the reorder buer for execution, but these are not useful instructions. Another ineciency occurs for instructions residing in the cache line before the branch target. These are delivered to the reorder buer, but these are not useful. By replacing the traditional cache with an expansion cache, most instructions delivered to the reorder buer are predicted to be executed. The expansion unit of the Expro model functions similarly to the Epic's model's expansion unit, which is shown in gure 2.5. Instructions are fetched from main memory, analyzed and packed into an expanded instruction register. The dierence

2.4. EXPANDED CACHE REORDER BUFFER MODEL

47

Expansion

Expanded Instruction Cache Register File

Reorder Buffer

Expanded Instruction Register

Bypass

Branch Units

Load Store Units

Functional Units

Reservation Stations

Result Busses

Figure 2.15: Block Diagram of Expanded Cache Reorder Buer Model

48


is the Expro model does not stop packing as soon as a data dependence or resource limitation is encountered. The Expro model continues packing until the cache line is nearly full. What is meant by \nearly full" is discussed in chapter 4. The decode process is very similar to the decode process of the Reorder Buer model explained in the last section. The decode process takes instructions from the expanded cache line and prepares them for execution. The Expro model uses two pipeline stages to implement the decode process, the same as the Reorder Buer model. Decoding instructions is done by analyzing the instructions, reading the register source operands and sending the instructions to the reservation stations. Decoding also allocates tags for each instruction and creates corresponding entries in the reorder buer. There are no position restrictions placed on instructions lling the expanded cache line. The decode process is able to route any instruction in the line to any reservation station. There are restrictions on how many of each type of instruction can be routed during each cycle. An instruction can be routed only if there is a reservation station and execution unit of appropriate type available. For example, if there are two load/store units then the third load/store instruction in the cache line will require and additional decode cycle.

2.4.2 Expro Model Branch Prediction Branch prediction by the Expro model is performed during the expansion process. When the expander encounters a branch instruction, it predicts it using static prediction information supplied by the compiler. The target address is computed and expansion continues at the point of the target. This can ll the expanded instruction cache line with instructions from several dierent areas of main memory. Thus, during each execution cycle the expanded instruction cache can provide nearly a full cache line's worth of instructions predicted to be executed. The successor index method of branch prediction is used at execution time. Successor addresses are computed during expansion time. Once the expansion process

2.4. EXPANDED CACHE REORDER BUFFER MODEL

10:

1000: 1001: 1002:

49

BRN 10 ... xxx BRN 1000 ... yyy zzz

Figure 2.16: Successor Index Example Code stops lling the cache line, it computes the address of the next instruction to execute and stores this address in the successor eld. For example consider gure 2.16. The expansion process packs into a cache line the instruction for a predicted branch to address 10, then a predicted branch to address 1000 and then 2 more sequential instructions. The successor address is address 1002. Indirect jumps (branches whose target addresses are speci ed by register values) are predicted at expansion time based on the information available at the time of expansion. There are two main classes of indirect jumps. The rst class is generated by the return statements of subroutines. The second class is generated by case or switch statements. For the return class, the expansion process triggered by the initial cache miss usually accurately computes the target address of the return jump. This is because the return address register usually contains the correct return address value when the instruction cache miss occurs. The return address computed during expansion is used only for branch prediction so an incorrectly computed address will be treated as a miss predicted branch during execution. Subsequent uses of the expanded instruction cache line containing the return's indirect jump may have a dierent return address. This results in an incorrect prediction. For the machines modeled here, no attempt is made to adjust the predicted return address when the call chain changes. This results in a low prediction accuracy for returns. However, the percentage of return instructions encounter during execution is small. The low prediction accuracy results in only a small performance

50


L1:

L2:

L3:

JMP @R1 ... xxx JMP L3 ... yyy JMP L3 ... zzz

Figure 2.17: Indirect Jump Example Code degradation because of the low frequency of return jumps. Measured prediction rates and performance impacts are presented in chapter 6. The other class of indirect jumps occurs when the compiler emits code to implement case statements. In this case the expansion process often incorrectly predicts the target of the indirect jump. Case statement code usually computes a target address in a register and then executes the indirect branch. Since the expansion process is building instructions for several cycles worth of execution, it is likely the register value used to compute the predicted target will be modi ed. However, the dynamic frequency of case statement indirect jumps is very low, only 0.4% is the highest for the benchmarks used here. Thus this incorrect prediction has very little performance impact. Predicting indirect jumps requiring special attention during the expansion. If care is not taken an indirect jump may permit incorrect program ow after the jump. Suppose the code is as in gure 2.17. The at-sign indicates and indirect jump. Register R1 can contain the address of either L1 or L2. The xxx, yyy and zzz are instructions that can be packed into the expanded cache line. The expander could predict the ow is to jump to label L1 and then to L3 and pack all of these instructions into one cache line. This skips the instruction at L2. It would then compute the next address to execute as L3+1 and place this address in the successor eld. The ow during execution could be from the indirect jump to L2 and then L3. The branch execution unit must be able to detect how the indirect

2.5. MODEL SUMMARY

51

jump was predicted. If the only information the branch execution unit has about the prediction is the initial address and the predicted successor address of the line then it is not possible to determine which of the above two cases was predicted. For branch statements, which can branch to only one of two locations, the expander includes information in the expanded instruction line specifying the predicted direction. However, specifying the predicted target address of indirect jumps requires a full 32 bit address. This is very expensive in terms of expanded cache line width for such a low frequency operation. This problem can be solved by restricting which types of instruction can be packed into an expanded cache line after an indirect jump. After an indirect jump is placed into an expanded line the expander limits packing to just instructions that do not branch. This make is possible to compute at execution time the predicted target address. Since the only way to reach to the successor address is through non-branching instructions the branch execution can actually determine the address used during the expansion. The branch execution unit subtracts the number of packed instructions after the indirect jump from the successor address in the expanded instruction line. This address is then compared to the actual indirect jump address to determine if the prediction is correct.

2.5 Model summary Table 2.1 presents a summary of the overall processor con gurations. Each line of this table is explained in this section. The rst line of Table 2.1 is Issue Policy, and is one of the instruction issue methods described in Section 1.2. The Epic model uses in-order, that is, the machine issues instructions in the same order as they are emitted by the compiler. The in-order de nition is extended for the case of simultaneously issuing two or more independent instructions during the same cycle. The other two models use out-of-order issue. Interlock Mechanism is the next entry in Table 2.1. This describes the method for determining the order of completion of the instructions in the pipelines. Epic uses a simple register scoreboard, with one busy bit for each register. It stalls if a

52

CHAPTER 2. EXPANDED CACHE MODELS Epic Reorder Buf. Expro. Issue Policy in-order out-of-order out-of-order Interlock Mechanism scoreboard results tags result tags Register Renaming none reorder buf. reorder buf. Cache Line Size (inst.) 8 8 8 Expansion Pipeline Stages 3 n/a 3 Reservation Stations n/a 4, 8 for L/S 4, 8 for L/S Reorder Buer Entries n/a 16 16 Execution Pipeline Stages 4 6 6 Decode Stages 1 2 2 Miss Pred. Branch Penalty 1 3 3 Branch Units 2 2 2 Functional Units 4 4 2 Load/Store Units 2 2 2 Load Delay (cycles) 1 1 1 Main Memory Latency 16 16 16 Data Cache in nite in nite in nite Table 2.1: Model Summary

source operand is still being computed. The other two machines use result tags in the reservation stations and a reorder buer to match results with reservation station entries waiting for operands. Register Renaming is the next line in summary Table 2.1. The Epic model does not use any hardware for register renaming. This is an important consideration as renaming hardware can increase performance by removing output and anti-dependencies. Eliminating register renaming hardware is consistent with the Epic model's philosophy of simple hardware. The other two machines achieve register renaming by associative lookup in the reorder buer. The Cache Line Size entry in Table 2.1 reports the maximum number of scalar instructions that can be read from main instruction memory and written into the instruction cache during an expansion process or cache miss. For the Epic and Expro models, the term line size is used here to describe the maximumnumber of instructions that are cached as a result of a miss. This is somewhat dierent from the usual use of the term line size because each fetch by the expansion process may read less than

2.5. MODEL SUMMARY

53

the full number of instructions permitted by the line size width. Traditional caches always read exactly the number of words speci ed by the line size width. The Expansion Pipeline Stages entry in the summary table speci es the number of pipeline stages in the expansion unit, as explained by section 2.2.1. Since the expansion unit is unique to the expansion cache models, this entry does not apply to the reorder buer model. The Reservation Stations line in the summary table speci es the number reservation station entries serving each execution unit. The Epic model does not have reservation stations so this entry does not apply to this model. The other two models have a 4 instruction entry reservation station for each of the branch and functional units, and an 8 entry reservation station for each load/store unit. The Execution Pipeline Stages entry in the summary table speci es the number of pipeline stages in the execution units, as explained by section 2.2.3. Because of this study's emphasis on instruction issuing and not data transformation, this parameter is not varied in the studied machines. The Decode Stages entry in the summary table speci es the number of execution pipeline stages used by the decode process. The Epic machine uses 1 stage and the machines with reservation stations and reorder buers use 2 stages. The Miss Predicted Branch Penalty entry in the summary table speci es the number of cycles the machine requires to recover from a miss predicted branch, assuming there are cache hits for all referenced instructions. This parameter has a major impact on performance. In a superscalar machine, several instructions are executed in a cycle so each cycle spent in miss predicted branch recovery results in several instructions being ush from execution. The Epic model uses a one cycle miss predicted branch penalty because of the simplicity in its miss predicted branch recovery. The reorder buer models require 3 cycles to recover from a miss predicted branch. Three cycles are needed because of the 2 cycle decode and an additional cycle for instruction

ushing from the reorder buer. The Branch Units, Functional Units and Load/Store Units entries in the machine summary table specify the number of each type of execution unit. The basic con gurations studied in this report has 2/4/2 counts for the respective units. This mix

54


of execution units is chosen because with this counts program execution is generally limited by instruction fetching consideration data dependencies and not by limited execution resources. This is consistent with this work's strategy of studying the instruction fetch and issue concerns, not the data ow and data manipulation concerns. There are some programs where the number of loads that can be issue in parallel does become a limiting factor, and this is discussed in section 3.2.2. The Load Delay entry in the machine summary table speci es the number of cycles that must lapse between issuing a load and the issuing an instruction using the data retrieved by the load. As stated before, hardware mechanisms will stall the machine if a dependent instruction is issued too soon. This parameter is set to one cycle, which is the typical number for most current RISC style machines. The Instruction Main Memory Latency describes the number of machine cycles required to retrieve an instruction after changing the main memory address. This parameter is held constant at 16 cycles for the machines covered by this report. The nal entry in the summary table, Data Cache size provides additional evidence that the emphasis of this study is on instruction fetch and issue mechanisms, not data manipulation requirements. All the studied machines are assumed to have an in nite data cache. That is, all load instruction will return the required data after the speci ed load delay cycles. No data cache misses are simulated.

2.6 Methodology, Tools and Benchmarks 2.6.1 Trace Driven Methodology Trace driven simulation produces the results of this study. Figure 2.18 presents the overall methodology. A benchmark is rst compiled and optimized using the standard DEC Ultrix 4.2 C compiler to emit Ucode intermediate language [Nye82]. This Ucode is then delivered to an instrumentation program that inserts additional Ucode at basic block boundaries. The instrumented benchmark's Ucode is assembled, linked and executed. Executing the benchmark emits a basic block trace while the benchmark runs, which is written to a le.

2.6. METHODOLOGY, TOOLS AND BENCHMARKS

Input

cc -O

Instrument ucode

Execute Cgen

.ins

.bt

Simulate

Report

Figure 2.18: Experimental Setup

55

56


Using the commercial compiler as the front end for the code generation provides high quality machine independent code. The optimization control ag passed to the compiler is -O2, which invokes the global Ucode optimizer. This optimizer does register allocation, except for the temporary registers needed during expression evaluation. The nal register allocation is left to the code generation phase. The Cgen block in gure 2.18 is the code generator constructed to support these studies. The machine con gurations studied here dier from the MIPS architecture [KH92] for which the commercial code generator is targeted. These superscalar machines require additional code scheduling to uncover more instruction level parallelism. Chapter 3 covers the issues associated with code scheduling. The code generation process emits an \.ins" le, a machine code le with instruction formats very similar to the MIPS instruction set. The dotted line in gure 2.18 from the trace le to the Cgen block is the information path used by the code generator for static branch prediction. Trace driven methodology allows the code generator to implement optimal static branch prediction. It implements optimal static branch prediction by rst scanning the dynamic trace and computing the number of times each branch is taken and is not taken. If over the entire trace a branch is taken more times than it is not taken then it is predicted taken. Static branch prediction restricts the compiler to encoding a single branch direction in each branch instruction. Scanning the the dynamic trace achieves optimal static branch prediction because it enables to compiler to encode the direction that will be most taken during the trace driven execution. The Simulate block in gure 2.18 is the trace driven simulator. It reads the machine code le into the simulated machine's memory and pulls basic block records from the trace le one block at a time. This simulation is conducted on a machine cycle by cycle basis. This simulator was constructed to support this study and is written in Pascal and C. The simulator's main loop is traversed once per machine cycle. Each cycle it examines each pipeline stage and computes the next outputs for the stage. At the end of the cycle the pipeline registers are \clocked", that is, the outputs of the current cycle are moved to the inputs of the next cycle.

2.6. METHODOLOGY, TOOLS AND BENCHMARKS

57

Trace le information is accessed only at the end of each basic block. Normally, the simulator decodes an instruction and computes the address of the next instruction to execute by adding one to the program counter. However, if it is the last instruction in a basic block then the trace le is accessed and used to determine the next address to execute. If the instruction is a miss predicted branch, the simulator tags this instruction with the correct target address and turns o reading from the trace le. Simulation continues along the miss predicted path, using the predicted targets of any branches encountered while following this path. When the branch execution unit discovers the miss predicted branch it uses the correct target address to restart the fetch sequence and re-enables reading from the trace le. The simulator is able to compute the data memory addresses for load and store instructions that are not indirect through a register. When there is a register indirect memory reference any later data memory references are delayed until the address is known.

2.6.2 Benchmarks Table 2.2 describes the six benchmarks included in this study. These benchmarks mimic an execution pro le similar to a typical workstation environment. Five of the benchmark programs are also used in the SPEC benchmark suite, although the version of these programs and input data sets used here are not the same as those used in the SPEC [Dix92] suite. All the benchmarks are written in the C language. Table 2.3 presents both the static and dynamic instruction counts for the benchmarks. The tracing methodology does not support tracing run time library code nor system code, so the numbers presented are just for the compiled C code. The fft benchmark is somewhat dierent from the other ve in that it has a much lower percentage of conditional branches. This is due to its structure being numerical or matrix style code, that is, almost all control ow is looping over array data structures. This type of control leads to very good branch prediction, and this benchmark is included to demonstrate this behavior.

58


Benchmark Description compress Lempel-Ziv compression on a 150KB tar le. espresso

fft

gcc1

spice3

tex

Boolean expression minimizer reducing a 14-bit input, 8-bit output PLA. Fast Fourier transform { 1024 1D t Gnu C version 1.36 compiling (and optimizing) to assembly 1500 lines of C. Circuit simulation of a Schottky TTL edge-triggered register. Document preparation system formatting a 14 page technical report.

Table 2.2: Description of the Benchmarks

Benchmark compress espresso fft gcc1 spice3 tex

Static Size Dynamic (bytes) Instructions 6,760 13,252,875 96,732 153,611,785 1,456 6,692,459 540,276 42,774,378 469,356 151,890,201 161,012 72,706,435

Table 2.3: Benchmark Sizes and Instruction Counts

2.7. SIMPLE 8-WIDE SUPERSCALAR MACHINE REFERENCE MODEL

59

2.7 Simple 8-Wide Superscalar Machine Reference Model This section introduces the Simple in-order issue reference model superscalar machine used for performance comparison. This simple superscalar machine uses the cache line as the multiple instruction fetch mechanism. Its block diagram is presented in gure 2.19. This reference machine is selected for it simplicity. It does not have complicated buers or associative memories attempting to extract additional parallelism that may be found in the instruction stream when out-of-order execution is supported. This simple model allows performance comparisons to be made by adding features to this reference model. The reference model machine is an 8-wide superscalar machine. Each cache line is 8 instructions wide, which is 32 bytes. The 8-wide con guration is chosen because of several reasons. The rst reason is the desire to select a model for which the execution portion of the machine is not the performance limiter. The expected average instruction level parallelism of the benchmarks is less than 3 instructions per cycle, so a peak of 8 instructions per cycle will not be limiting. The second reason for this 8-wide con guration is that it is in reach of next generation technology. Current machines, such as the DEC Alpha 21164 [BK95] and Power-PC 620 [LTT95] have 6 execution units operating in parallel. The historical doubling every few years of the number of transistors that can be integrated on a single chip should easily allow 8 execution units in the next generation technology. Figure 2.19 shows the overall ow of instructions in this reference machine. Instruction execution is divided into 5 pipeline stages. During the rst stage a cache line's worth of instructions is fetched from the instruction cache. This cache line contains the instruction pointed to by the program counter. Instructions starting at the one pointed to by the PC and through the end of the cache line are selected for execution. Instructions before the PC will be thrown away. The cache has a one cycle access time, when the cache hits it delivers 8 instructions every cycle. When the cache misses a 8 cycle miss penalty is incurred.

60


Pipeline Stages

Instruction Cache

Fetch

8 Align 1 to 8 Unit and Dependency Analysis 8

Decode and Issue

Route

Execute Branch Units

Load Store Units

Function Units Load Delay

Write Back Register Interconnect and Bypass

Figure 2.19: Simple 8-Wide In-Order Issue Superscalar Machine


61

The second pipeline stage, labeled decode and issue in gure 2.19, is where the instruction selection process occurs. The rst step in this process is performed by the align block. This is a shifter controlled by low bits of the program counter, which are the bits labeled the oset bits in gure 4.3. The align block rst takes the instruction selected by the program counter an aligns it to the rst instruction decoder slot. The second instruction following the one pointed to by the PC is aligned to the second decoder slot, and so on. The next function of the decode and issue pipeline stage is the block labeled Unit and Dependency Analysis in the gure. This block inspects the aligned instructions and determines which of them can execute in parallel. It examines the inputs and outputs of the instructions and determines if any dependencies exist between an instruction and all previous instructions being decoded during this cycle. If there is a dependency then the dependent instruction and all following instructions are not selected for execution during the current cycle. This block also decodes the instructions and allocates execution units to each one. If there are insucient execution units to execute a given instruction then this instruction and all instructions following it are unselected for execution during the current cycle. The third function performed during the decode and issue stage is the delivering of instructions to the execution units. This is implemented by the route block in gure 2.19. This block can route any of the up to 8 input instructions to any of the 8 execution units. Full crossbar functionality is implemented by this routing network, it has no routing or bandwidth restrictions. A large quantity of logic is required to implement the above three decode and issue functions. It may be dicult to meet the machine's cycle time constrain in a single pipeline stage. However, because this machine model's purpose is a base model for comparison and not a proposal for implementation, the simulations will complete the align, allocate, and route functions during a single pipeline stage. The third, fourth, and fth stages of the reference model machine are execute, load delay, and write back stages. These stages are the same as the corresponding stages in the Epic model. These are described in section 2.2.3, Epic Execution Units. In summary, the main features of the execution units are:

62


All instructions except load execute in one cycle. Load instructions have a one cycle delay, which is a 2 cycle latency. There is no dierence between integer and oating point timing and unit usage. The data cache is modeled as an in nite cache. If any source operand is unavailable at the start of the execute stage then the

pipelines of all execution units are stalled until all source operands are ready. In-order issue is enforced.

Figure 2.20 presents the performance of the simple 8-wide in-order-issue superscalar machine along with the performance of a single-issue pipelined machine. The performance is reported in average instructions executed per cycle (IPC) versus instruction cache size, as reported by the simulator. The average IPC plotted on the y y-axis is an unweighted arithmetic average of the IPC for each benchmark. The xaxis is the instruction cache size in bytes used to store the instruction data and does not include any of the cache overhead such as tag storage. Simulating the in nite cache size coordinate was conducted by increasing the cache size to be the same as the main memory size. The same benchmark code is executed by the single-issue machine and the simple 8-wide superscalar machine. The single-issue machine has an average IPC rate of 0.77 at the 4Kbyte instruction cache size and 0.91 at the in nite cache size. The simple in-order-issue machine has an IPC rate of 1.06 at the 4K size and 1.26 at the in nite size. The single-issue machine does not have any branch prediction, but does use a delayed branch. For these performance numbers the delayed branch slot is lled 70% of the time. The simple 8-wide superscalar machine always predicts in-line execution. The simple approach to implementing a superscalar machine achieves an average speedup of 1.38 over the single issue machine. The peak speedup is 8 times, much larger then the 1.38 average realized speedup. The maximum rate of 8 instructions per cycle occurs only when there are no dependencies between the instructions in a cache line and the instruction type distribution is 2 branches, 4 functional and

2.50

|

2.00

|

1.50

|

Instructions Per Cycle


1.26

1.00

|

0.50

|

single-issue simple

|

0.00 |

63

4K 16K 64K 256K Cache Size (bytes for instructions)

inf.

Figure 2.20: Instructions Per Cycle for Single-Issue and Simple 8-Wide In-order-Issue Machines

64


2 load/stores. Clearly this distribution can not be met for every cycle of program execution. Still, it is desirable to improve the average speedup to be closer to the peak of 8 instead of the observed 1.38. The following sections and chapters discuss how an expanded instruction cache can be use to improve the speedup.

2.8 Reorder Buer Model Performance This section discusses the performance of the Reorder Buer superscalar machine model presented in section 2.3. This model is used as the reference model for out-oforder issue machines. While there are many possible variations on the con guration of the Reorder Buer superscalar machine, only one con guration for out-of-order superscalar machines is studied by this work. For variations on this model see the work of Johnson [Joh91]. Figure 2.21 presents the performance of the reorder buer model, along with the single issue machine and the simple line cache superscalar machine described in the previous section. A bar graph is used for the in-order execution machines and a line graph is used for the out-of-order execution machines. Performance is again measured in instructions executed per cycle and is a function of instruction cache size. At large cache sizes the Reorder Buer model achieves an average execution rate of 1.6 instructions per cycle. This is a 28% improvement over the simple line cache superscalar machine. The out-of-order issue mechanism is able to nd additional instruction level parallelism and achieves a faster execution rate. This execution rate is lower than one would hope for, as the peak rate is 8 instructions per cycle. Examining the simulation results presented in table 2.4 shows the main reason for the low performance of this Reorder Buer machine. This table reports the percentage of cycles the Reorder Buer machine's decoder was limited by an empty instruction buer. The instruction buer is the register at the output of the instruction cache for holding the 8 instructions fetched during the previous cycle. The other reasons the decoder becomes limited is because of insucient execution units or insucient space in the reorder buer. On average 56% of the time the decoder is limited by insucient

2.50

|

2.00

|


2.8. REORDER BUFFER MODEL PERFORMANCE

|

1.50

65

1.62

1.00

|

0.50

|

single-issue simple

|

0.00 |

Reorder Buffer


inf.

Figure 2.21: Instructions Per Cycle for Reorder Buer Model Superscalar Machine Benchmark Inst. buer emptied cycles compress 59.2% espresso 56.6% t 48.9% gcc1 57.3% spice3 58.8% tex 57.1% Average 56.3% Table 2.4: Percent of Cycles when Decoding Emptied the Instruction Buer

66


instructions from the instruction cache. Insucient instruction cache bandwidth is the main performance limiter in this machine. Instruction bandwidth is limited by the frequent branching found in the benchmark code. A branch causes the instructions in the cache line after the branch to be ushed before they are delivered to the decoder. A branch also usually limits the number of instructions loaded into the instruction buer on the next cycle as it probably does not branch to the start of a cache line. This limited instruction cache bandwidth can be addressed by expansion caches. The following chapters discuss each problem in detail and reports the eectiveness of each feature.

2.9 Summary This chapter describes two expanded instruction machine models and two reference machines. The rst machine is Epic, an expanded instruction cache machine using inorder instruction issue. The second expanded instruction cache machine is Expro, a machine using both an expanded instruction cache and out-of-order instruction issue. The rst reference machine is a simple in-order issue superscalar machine using the cache line as the multiple instruction fetch mechanism. The second reference machine is a reorder buer machine using a traditional instruction cache and out-of-order instruction issue. Also covered by this chapter are the tools and methodology used to evaluate the performance of these models. In the Epic model, the expanded instruction cache structure reduces the complexity of the hardware required to implement a superscalar machine. The expanded instruction cache is able to improve decoder eciency in the three areas of cache alignment, branch prediction and instruction run merging. The expanded instruction cache is also able to align instructions with the required execution units without requiring a time consuming routing network. In the Reorder Buer model a traditional cache is used. This machine is organized to take advantage of almost all the instruction-level parallelism within the limits of its instruction window. As soon as an instruction has its operands available it is issued out of a reservation station, so no execution unit is ever idle while there are

2.9. SUMMARY

67

instructions ready to execute. However, the complexity of the hardware required to implement this machine is larger than the Epic machine. The Expro model combines the best features of both the expanded instruction cache and reorder buer machine. This model capitalizes on the improved instruction cache bandwidth oered by the expansion cache. It also has the ability to nd instruction-level parallelism within the more eective instruction fetch window. However, it does bear both the costs of a larger instruction cache and more complex hardware. The following several chapters analyze the various issues and eects on performance presented by the features of instruction expansion machines. Chapter 3 covers instruction scheduling considerations and presents the basic block list scheduling algorithm used to schedule code for these machines. Chapter 4 covers the instruction cache alignment issues and shows the costs and bene ts of various alignment techniques. Chapter 5 investigates the issues associated with either positioning the instructions in the cache line to directly control the execution units or routing them through a routing network. Chapter 6 analyzes the branch prediction and eects it has on performance.

68


Chapter 3 Scheduling Code scheduling issues arise when target machines are superscalar processors. For example, Epic's simple hardware places increased demands upon the compiler in order to achieve the high performance oered by the parallel execution units. This chapter presents scheduling methods employed by the code generator needed to achieve high performance.

3.1 Code Scheduling and Interlocks The design objective of the Epic machine is the simplest possible concurrent issue hardware in the instruction issue unit. Epic's architecture is designed to use only single instruction look ahead in the stream for nding additional instructions for parallel execution. When the expansion process nds an instruction that is dependent upon an instruction already entered into the expanded instruction register, the expansion process does not attempt to search forward beyond the dependent instruction for any additional instructions to pack into the expansion register. This premise requires the compiler to create a code schedule exposing instruction level parallelism with only one instruction look ahead. Another area spawning hardware complexity is the instruction interlock mechanism. Ideally, there would not be any interlock hardware at all, as is the case for some VLIW machines [Fis83]. However, since the code compiled for Epic is not cognizant 69

70

CHAPTER 3. SCHEDULING

of the exact packing and cache misses occurring during execution it is not possible to schedule code that is free of multi-cycle data hazards. An interlock mechanism is required. A simple register scoreboard interlock mechanism is used in the Epic machine. A busy bit for each register in the machine causes a stall if the register is accessed. This resolution method of stalling all instructions within the expanded instruction ensures in-order issue is maintained. This in-order issue policy greatly simpli es the hardware required to implement the parallel execution units. However, this stalling does have implications for code scheduling. These implications are discussed in section 3.2.2. Constructing an optimal schedule under the constraints of limited resources is an NP-complete problem [Gro83]. Practical methods of scheduling therefore require heuristic algorithms. An eective heuristic is assigning each instruction within a basic block a priority value indicating its relative importance during scheduling. Smotherman et al. [SKAH91] describes 26 dierent priority functions. Davidson et al. [DLSM81] compared a number of heuristic scheduling algorithms, and they recommend list scheduling as the best compromise. List scheduling is used to generate the code for this study, and the algorithm used is detailed in the next section.

3.2 List Scheduling Algorithm The scheduling algorithm schedules instructions within a basic block. It uses a data structure called a Directed Acyclic Graph, or a DAG. An overview of the three pass scheduling within a basic block follows: 1. Build the DAG and label each instruction with its depth in cycles from the root. 2. Set priorities. Set the priority of each instruction as the depth of its deepest child. 3. Schedule using list scheduling. Pick the highest priority instruction from the ready list and schedule it.

3.2. LIST SCHEDULING ALGORITHM

71

3.2.1 List scheduling The rst pass of the scheduling algorithm starts at the top of each basic block and builds a DAG for the basic block. Each instruction is a node in the graph. A hash table keeps track of the last instruction to read or write each register and each labeled memory address. For each node, the last instruction to write an operand read by the current node becomes its parent. The last instruction to read or write an operand written by the current node also becomes a parent. If the reference is before the start of the basic block then the dependent instruction is in a previous basic block, so no edge is entered into this DAG for this reference. The DAG nodes are built using double links. Each node has down pointers to all children and up pointers to the parents. All sibling nodes are in a circular list allowing ecient access. While the DAG is being built each node is labeled with the minimum number of cycles needed to reach this node from the start of the basic block. This is simply the maximum number of cycles needed to reach any parent node plus the required execution cycles for the parent. This eld is called the depth. Building the DAG and labeling the node is the rst top down pass over the instructions in the basic block. After the rst pass for building the DAG the second bottom up priority labeling pass begins. This pass starts at each node and labels every node on all paths to the root with the maximum number of cycles required to reach the deepest leaf. The longest path is found by looking at the depth eld of each parent and selecting the largest. Each node on this path is labeled with the depth of the leaf node when the depth of the leaf node is less than the depth of the label. This label is called the priority label. A queue of not yet visited parents is maintain while searching for paths to the root. The worst case running time for this priority setting algorithm is O(n2 ), where n is the number of nodes in the graph. However, typical time is much less. When a node is found already labeled with a number larger than the starting label the traversal can be stopped. This reduces the running time when there are many dependencies in the scheduled code, which is the typical case. After all nodes in the DAG have been visited the priority labeling second pass is complete. Scheduling is the third pass over the instructions in the basic block. Scheduling

72


uses list scheduling. This algorithm starts by placing root nodes on a ready list. The scheduling heuristic is to select the node on the ready list with the highest priority for which there is an available execution unit. The node with the highest priority in the ready list is scheduled and removed. Each child of this scheduled node is visited. The child's parent count is decremented, and if the count goes to zero, the child is put on the waiting list, with a delay of the time for the parent's execution time. When no instructions can be scheduled from the ready list, either because it is empty or there are no available units, it is time to move nodes from the waiting list to the ready list. The execution unit counters are reset, and the instructions ready this cycle on the waiting list are put on the ready list. Other instructions on the waiting list that will be ready in later cycles stay on the waiting list. There is a special case when the last instruction in a basic block is a branch instruction. The scheduling algorithm must ensure this branch instruction remains as the last instruction in the basic block. If it was to be moved then the machine may not execute the instruction scheduled after it. The expander uses the branch instruction for determining control ow as it does not have access to true basic block information. When the last instruction within a basic block is not a branch instruction then the scheduler is free to reposition this instruction to any position resulting in a good schedule.

3.2.2 Scheduling Loads and Align Hints Epic's in-order issue expanded instructions and interlock mechanism requires special alignment information to achieve high performance. This section describes the problem and presents a solution of emitting additional instructions for conveying information to the expander. There is one busy bit for each machine register in Epic's scoreboard. Decoding a multiple cycle instruction sets the scoreboard busy bit corresponding to the destination register. This busy bit is cleared at the end of the cycle just before the instruction's result is ready. For the one cycle load delay instructions used in the Epic this bit is cleared the next cycle.


73

The decoder inspects the scoreboard busy bits corresponding to all source operands for every instruction within the expanded instruction. If any bit is set then not all the instruction's source operands are available. A signal is generated stalling all components of the expanded instruction being decoded. Next cycle the scoreboard bits are read again and if they are all clear then all instructions in the expanded instruction register will be issued in parallel. Stalling of all individual instructions within an expanded instruction impacts the code scheduling for Epic. There are cases when it is desirable not to pack an independent instruction into the current expanded instruction. If such an instruction is dependent on a multiple cycle load that was started in the previous cycle then all instructions in the expanded instruction are stalled and not just the one with the load dependency. It is better to place the load dependent instruction into the next line, thereby allowing the other independent instructions to execute this cycle. An example of this eect is included in the scheduling example of the next section and in gure 3.1. In the gure if the instruction at node \8:" were allowed to be packed into the second expanded instruction then execution of this second expanded instruction would be delayed until cycle 3 because of the dependency on the 2 cycle load in node \1:". Forcing the dependent node \8:" into the third expanded instruction allows the second expanded instruction to execute without a delay. To accommodate this form of scheduling an align instruction is used. The align instruction informs the expansion process to stop lling the current expanded instruction. The expanded instruction will then be executed. When needed, the expansion process restarts after the align instruction and ll the next line. The align instruction is information used only by the expander and is never entered into the expanded instruction cache. During the third pass of code scheduling the code emitter maintains the cycle count relative to the start of the basic block for each emitted instruction. If it is determined emitting an instruction causes an extra stall if it is packed into the current expanded instruction then the code generator instead emits an align instruction. Emitting align instructions is done only if the basic block is large enough. The minimum basic block size before align instructions are emitted is called the trigger

74


size. A lower bound on basic block size is used as a quali er on emitting align instructions to limit code growth. Also, an align instruction is not always eective for small basic blocks because the compiler may not be accurate when determining cycle boundaries. The compiler assumes the rst instruction in each basic block is the rst instruction packed into an expanded instruction. This may not be the case because the expansion process does not have basic block information when it is expanding. It may pack several instructions from the previous basic block into the expanded instruction, thereby throwing o the cycle boundaries computed by the compiler. For larger basic blocks the in uence of instructions from previous basic blocks quickly diminishes and the align instruction is eective. Simulation shows emitting align instructions when the basic block size is larger than 16 instructions is a good compromise. Performance is not very sensitive to changes in this size. A basic block size trigger for align instructions of 16 results in 5% to 10% of the expansions being terminated by an align instruction.

3.2.3 Scheduling Example Figure 3.1 shows an example of scheduled code for an Epic machine. The number of execution units for this machine is reduced to 1 branch unit, 1 load/store unit and 3 functional units to allow the example to t on one page. On the top of the gure is the DAG created by the compiler. The bottom of the gure shows how the expander will build the instructions into the expanded instruction cache. The original order of the instructions within the basic block is unimportant and is not shown. There are three lines of information shown in each node of the DAG. The top number followed by a colon is the address for the instruction after scheduling is complete. This address is used as the name of the node during this discussion. The center line is the opcode for the instruction. Only the opcode's type is shown in the gure. The two numbers separated by a slash on the bottom line are the depth and priority labeling of the node. The rst scheduling pass builds the DAG and labels each node with its depth in cycles from the root. For example the depth of each node on the left branch is 1, 3, and 4. The load instruction at node \1:" can execute at cycle 1. The instruction


75

1:

3:

Lod

op

1/4

1/2

8: op 3/4

9: op 3/3

2: op 1/3

5: Str 2/2

6: op 2/2

10: Str 4/4

Tag

11: Brn 3/last

Ld/St

Op 1

Op 2

1

1:ld

2:op

3:op

4

5:str

4:op

6:op

Brn

8:op

9:op

x:spec

y:spec

8

10

4: op 2/3

11:brn

10:str

Op 3

Next 4

7:align

8

10

z:spec

Figure 3.1: Epic Instruction Scheduling Example

z

76


at \8:" cannot execute until cycle 3 because it is dependent upon the 2 cycle load instruction. The node at \10:" can execute at cycle 4, one cycle after the node above it. The second pass sets the node priorities using a bottom up algorithm. The algorithm starts at each node and traverses the DAG to the roots. An example is visiting the leftmost leaf node \10:". When node \10:" is selected for priority labeling it does not have any children so it is given a priority the same as its depth in cycles, which is 4 in this case. This priority of 4 is then propagated up to the root along every node in the branch. When node \9:" is selected it is assigned a priority of 3 because this is the earliest cycle it can execute. As soon as the algorithm starts to propagate this priority up the branch to the root it can stop processing this node as the rst parent node already has a higher priority of 4 assigned to it by the leftmost branch. Because this basic block ends with a branch instruction, node \11:" is given the special priority last. This node must remain the last instruction in the schedule. After the second pass of assigning priorities the third pass for emitting instructions begins. The three nodes \1:", \3:", and \2:" are put on the ready list. Node \1:" has the highest priority so it is emitted rst. Node \2:" is second in priority and is emitted second. Node \3:" is third. After node \3:" is emitted there are no more instructions on the ready list so the compiler knows to advance to the next cycle. When the cycle changes then nodes \4:", \5:", and \6:" are moved from the waiting list to the ready list. Instructions \8:" and \9:" are not moved because they are not ready until the following cycle. The scheduler now schedules instruction \4:" rst because it has the highest priority of 3. Instructions \5:" and \6:" are then scheduled. Now the scheduler determines the need for an align instruction, assuming the align trigger value is small enough to qualify this basic block for align instructions. If instruction \8:" was emitted next the expander would detect it as independent of instructions \4:", \5:", and \6:" and pack \8:" into the expanded instruction. However, if this happens then instructions able to execute during cycle two are delayed until cycle three. To avoid this the compiler emits an align instruction at address \7:". This scheduling of instructions on the ready list and moving instructions from


77

the waiting list to the ready list at cycle boundaries continues until all nodes have been processed. The specially tagged branch is emitted last. On the bottom of the gure, the instructions labeled x:spec are instructions the expander packed into this expanded instruction cache line for speculative execution.

3.2.4 Register Allocation The in-order issue Epic machine does not provide register renaming so this is another issue the compiler must address. Because the Epic machine has a shorter pipeline than the Reorder Buer machine the need for register renaming is reduced. With proper register allocation the performance degradation caused by excessive register reuse can be controlled so it does not cause signi cant degradation. Most register allocation is performed by the machine independent front end of the compiler, the DEC Ultrix C compiler used to produce the Ucode. The only register allocation remaining at code generation time is allocation of temporary registers for expression evaluation. Spreading the time between reuse of temporary registers is the method used by the compiler for reducing false sharing. Ten registers are allocated as temporary registers. When these registers are not in use they are maintained on a free list. The list is maintained as a FIFO queue, thereby spreading the distance between the reuse of a temporary. This is in contrast to the easier to implement LIFO stack often used to maintain a free list. Using LIFO allocation for temporary registers is likely to cause false sharing because as soon after a register is returned to the free list it is plucked o again for the next expression evaluation. With FIFO allocation reuse is delayed because all other temporary registers are reallocated before a deallocated register is reused. The compiler uses a simple approach of register allocation followed by code scheduling. There are cases where some false sharing could be avoided if the register allocation and the scheduling are combined. However, examining the code emitted for the benchmarks shows these cases are rare so the simpler approach of independent register allocation and code scheduling is used.

78


3.3 Performance Eects of Scheduling This section discusses the eectiveness of the scheduling algorithm, for both the inorder issue Epic machine and the out-of-order issue Reorder Buer machine. It begins with an example from the FFT benchmark and concludes with the simulated performance of the machines with and without instruction scheduling. Total instructions in loop body Schedule In nite resources data ow

43 cycles 17

Breadth rst schedule on Epic Epic scheduled without align instructions Epic scheduled with align instructions

23 21 19

Reorder Buer machine total latency Reorder Buer machine average loop time

23 15

Figure 3.2: FFT Inner Loop Scheduling Example Figure 3.2 shows statistics for one of the inner loops of the FFT benchmark. In this loop there are a total of 43 instructions to be scheduled. If in nite machine resources are available, the data dependence limits the execution time for one pass through the loop to 17 cycles. The next line in the gure is called a breadth rst schedule. This is the schedule resulting when not using the algorithm described by the last section. Instead of full list scheduling the DAG is built and all root nodes are scheduled in the beginning of the basic block. Next all level 2 instructions are scheduled, and so on. The idea is to present the expander with as many independent instructions as possible. Unfortunately this schedule does not work very well. It requires 23 cycles while the goal is 17 cycles. The problem is some shorter independent DAG branches become scheduled during the beginning of the basic block and interfere with the longest paths. There is time to schedule these independent DAG branches later in the loop, but if they are scheduled at the start of the loop they increase the total cycle count.

3.3. PERFORMANCE EFFECTS OF SCHEDULING

79

The next line in gure 3.2 shows the results of using the priority driven list scheduling algorithm without emitting any align instructions. This scheduling achieves a 21 cycle execution time for the loop. Emitting the align instructions shortens the loop time by two cycles, to 19 cycles total. This is close to the optimal 17 cycles for the case of in nite resources. Another load unit is needed to achieve the 17 cycle time. The 19 cycle time appears to be the best possible schedule with the limited hardware constraints. The next two lines of the gure describe the behavior of the same loop on the Reorder Buer machine. The same scheduling is used for the Epic machine and Reorder Buer machine. On the Reorder Buer machine, one pass of the loop requires 23 cycles to execute. However, its out-of-order issue and register renaming allow the Reorder Buer machine to overlap one execution of the loop with the next execution. After startup overhead, there is enough overlap between two passes of the loop to reduce the average time to just 15 cycles. The list scheduling algorithm performs well under the microscopic view of one loop inside one benchmark. A macroscopic view is presented in gure 3.3. It shows the average performance improvement scheduling achieves for each benchmark on the Epic machine. The con guration of the Epic machine presented in this gure uses an in nite instruction cache and the full cache alignment, multiple cycle packing, branch prediction and speculative execution features described in the following chapters. For the Epic machine the average instructions per cycle without scheduling is 1.19 and with scheduling is 1.69. This is an improvement of 42%. One striking feature is the variation in the performance changes for the various benchmarks. The improvement achieved by the scheduling ranges from a low of 24% for Compress to a high of 98% for FFT. The reason FFT is dierent is it has a larger average basic block size and more inherent instruction level parallelism. The larger basic block size presents the scheduling algorithm with more opportunities for eective scheduling. Without scheduling the parallelism can not be discovered by Epic's single instruction look ahead during expansion. For FFT, parallelism is available but is not discovered without eective code scheduling. The other benchmarks do not achieve as drastic improvement as FFT. Their


2.80

|

2.60

|

2.40

|

2.20

|

2.00

|

1.80

|

1.60

|

1.40

|

1.20

|

1.00

|

0.80

|

0.60

|

0.40

|

0.20

|

0.00

|

Full Epic without scheduling (avg: 1.19) Full Epic with scheduling (avg: 1.69)

|

3.00

|


80

compress

espresso

fft

gcc1

spice3

tex

Figure 3.3: Epic Performance With and Without Scheduling

3.00

|

2.80

|

2.60

|

2.40

|

2.20

|

2.00

|

1.80

|

1.60

|

1.40

|

1.20

|

1.00

|

0.80

|

0.60

|

0.40

|

0.20

|

0.00

|

81

Reorder Buffer machine without scheduling (avg: 1.71) Reorder Buffer machine with scheduling (avg: 1.83)

|

3.20

|


3.3. PERFORMANCE EFFECTS OF SCHEDULING

compress

espresso

fft

gcc1

spice3

tex

Figure 3.4: Reorder Buer Machine Performance With and Without Scheduling performance improvement ranges from 24% for Compress to 33% for TeX. This is still a respectable improvement, showing code scheduling is needed to allow the instruction expander to discover the instruction level parallelism available in these benchmarks. Figure 3.4 shows the performance improvement code scheduling accomplishes on a Reorder Buer machine. The con guration for this machine is an in nite instruction cache and a 16 entry reorder buer. Code scheduling improves the performance of the Reorder Buer machine, but not nearly as much as it improves the performance of the Epic machine. Code scheduling increases performance from an average of 1.7 instructions per cycle to an average of 1.8 instructions per cycle, an improvement of

82


7%. The improvement in individual benchmarks ranges from 1.5% for TeX to 19% for FFT. Scheduling does not have nearly as a dramatic eect on performance for the outof-order issue Reorder Buer machine as it does for the Epic machine because of its more aggressive look ahead in the instruction stream. Two independent statements in the original program are likely to each generate several words of code. Without scheduling the Epic machine does not discover the parallelism between the two statements because they are separated by more than one instruction. The Reorder Buer machine is able to nd the parallelism because it is able to buer up to 16 instructions and is likely to discover the independent instructions. Scheduling with out-of-order machines is able to achieve performance improvements because the compiler is able to search over many more instructions than is possible by the hardware. When a basic block is large the compiler is still able to examine the complete basic block for available instruction level parallelism whereas the hardware is limited to its small window of 16 to 32 instructions. Because the available instruction level parallelism is often spread over large areas, code scheduling is important for both in-order issue and out-of-order issue machines.

3.4 Summary Compilers perform static instruction scheduling with the goal of minimizing program execution time. The instruction scheduler attempts to overlap independent instructions to prevent resource con ict stalls and data dependence stalls. For parallel in-order issue machines, such as Epic, the eectiveness of the schedule is especially important for exploiting the available instruction level parallelism. Scheduling is also important for out-of-order issue machines because a compiler can search for available instruction level parallelism over a much larger area than is possible by fetch and decode hardware. The in-order issue Epic requires very careful scheduling because of the large eects a single dependence may cause. If just one source operand in an expanded instruction is not available because of a load latency then all the operations within the expanded

3.4. SUMMARY

83

instructions are delayed until this operand is available. This leads to the need for the three pass priority based list scheduling algorithm described in section 3.2. The intolerant nature of in-order issue expanded instructions also creates the requirement for an align instruction to control exactly which instructions are packed into each expanded instruction. While the basic block list scheduling used here achieves sizable gains in performance, much more sophisticated algorithms are possible. As discussed in section 3.2.4, the compiler used here has independent register allocation and scheduling passes. Bradlee et al. [BEH91] presents methods of integrating register allocation and instruction scheduling. Expanding the limit of scheduling to beyond just basic blocks is also feasible, as presented by Bernstein and Rodeh [BR91]. When the architecture is enhanced to support additional methods of exploiting instruction level parallelism, such as described by Michael Smith [Smi92], then even more complicated scheduling becomes necessary to reach the highest level of performance. This chapter addresses code scheduling issues arising when the target machines are superscalar processors using expanded instruction caches. It reports results for the case of all instruction expansion features combined to create the 8-wide Epic machine. Eectiveness of individual instruction expansion features is somewhat buried by presenting just overall performance results. To fully understand the individual features the next and following chapters go into detail on the costs and eciency of the cache alignment, multiple cycle packing, branch prediction and speculative execution features.

84


Chapter 4 Cache Alignment This chapter describes using an expanded instruction cache to improve the eectiveness of the instruction cache's limited bandwidth at its output port. It also discusses the bene ts to processor performance of this mechanism and its cost in terms of decreased utilization of the cache's storage area.

4.1 Aligning to Decoder Slots Instruction fetch eciency is improved if the instructions supplied by the instruction cache are aligned to the slots of the decoder. This aligning results in fewer wasted decoder slots. Figure 4.1 demonstrates the eect of instruction alignment on fetch eciency. In this example the instructions labeled S1 through S6 can be executed in parallel. PC

S1 S3

S4

S5

S6

Figure 4.1: Decoder Slot Alignment Ineciency 85

S2

86

CHAPTER 4. CACHE ALIGNMENT PC

S1

S2

S3

S4

S5

S6

Figure 4.2: Alignment in an Expanded Instruction Cache Without alignment, this packet of instructions can start at any location within the cache line. However, if the start of the packet is not in the rst two locations within the cache line then two cycles are required to fetch these six instructions. An expanded instruction cache can avoid the ineciency caused by packets of instructions falling across cache line boundaries by positioning each packet at the start of a cache line. This is shown in gure 4.2. This gure shows the same six instructions as gure 4.1, all of which can execute in parallel, but now the instructions are positioned at the start of the cache line. All six instructions are fetched in parallel and can be issued in just one cycle. The expanded instruction cache achieves the desired alignment by employing extra bits in the cache tags and duplicating instructions within the cache. The extra tag bits allow the expander to align any instruction to the rst decoder slot. The following instructions are packed into following slots within the line, thereby achieving the desired alignment. The number of extra tag bits required depends on the line size and associativity of the cache. Assume there are 2n cache lines so n bits are needed to index the cache data array. Figure 4.3 shows the caching indexing for n = 10, a 29 bit program counter, and an 8 instruction wide direct mapped cache. With the least signi cant bit of the PC numbered 0, a conventional direct mapped cache uses PC bits n +2 to 3 to index the data array, while bits 2 to 0 are used control the shifter that delivers the requisite instructions to the decoder slots. These shifter control bits are labeled oset in the gure. Bits 28 to 12 are used as the tag, which is 16 bits wide. In conventional caches, the index eld of the PC is shifted by 3 bits because only

4.1. ALIGNING TO DECODER SLOTS

16 Cache Tag

87

10

3

Index

Offset

Figure 4.3: PC Bit Allocation for Direct Mapped Cache

19 Cache Tag

10 Index

Figure 4.4: PC Bit Allocation for an Expanded Mapped Cache

one address in 8 is allowed to be loaded into the rst instruction slot. In an expanded cache any instruction address can be loaded into the rst slot, thus the index eld cannot be shifted by 3 bits. As shown by gure 4.4, and expanded instruction cache data array would be indexed by PC bits n ? 1 to 0. The instruction addressed by the PC is aligned to the rst decoder slot and the following 7 instructions are aligned to the following 7 decoder slots. The upper 19 bits of the PC are used as the tag. Thus an 8 instruction wide direct mapped expanded instruction cache requires adding log2 8, which is 3, additional bits in the tag eld of each cache line. Instruction alignment improves the amount of parallelism available to the instruction issue logic. To quantify the extent this improvement in parallelism translates into improved processor performance requires a reference model and a simulation of both models. The reference model was discussed in section 2.7 and the next section presents the performance comparison.

CHAPTER 4. CACHE ALIGNMENT

2.50

|

2.00

|


88

1.38

|

1.50

simple aligned 1.00

|

0.50

| |

0.00 |


inf.

Figure 4.5: Aligned Expanded Instruction Cache Performance

4.2 In-Order Issue Aligned Expanded Cache Performance Figure 4.5 presents the performance of an expanded instruction cache machine performing only instruction packet alignment. This gure also plots for comparison the performance of the simple superscalar reference machine described in section 2.7. The performance is measured in instructions executed per cycle and is an unweighted average of the six benchmarks. With an in nite expanded instruction cache, the average performance of the 8wide machine is 1.38 instructions per cycle. The reference machine achieves only 1.26 instructions per cycle. This is an improvement of 10% at the in nite cache size. At

4.3. DUPLICATION RATIO AND LINE UTILIZATION

89

the 4 Kbyte cache size the aligning expanded instruction cache machine achieves a performance of only 0.80 instructions per cycle, while the reference machine achieves a rate of 1.06 instructions per cycle. That is the more complex expanded instruction cache machine only has 75% of the performance of the reference machine with a 4 Kbyte instruction cache. This is not a good design tradeo and later sections discuss how to compensate for this problem. To understand why the aligning expanded instruction cache machine has lower performance for small cache sizes and high performance for large cache size one needs to consider how eciently this organization utilizes the instruction cache. The expanded instruction cache machine also has a longer cache miss service time than the reference machine, but this performance penalty is small compared to the lower utilization of the instruction cache storage area. Two factors contribute to the less ecient use of the instruction cache: instruction duplication and line utilization. Instruction duplication occurs when the same instruction is present in the expanded instruction cache in more than one location. Line utilization is the percentage of the cache line that is lled with instructions from main memory. Data dependencies cause some instruction elds in the expanded instruction to be lled with codes specifying no operation (NOPs).

4.3 Duplication Ratio and Line Utilization Alignment causes instructions to be duplicated in the expanded cache. Figure 4.6 shows an example of why this occurs. In this example instructions at addresses 1 through 4 are all independent and can be executed during the same cycle. When the expander starts at address 1 it packs the cache line with all four instructions from address 1 to address 4. After this cache line is executed there will be a cache miss for address 2. The expander now packs the cache line with 3 instructions starting at address 2. Instructions at address 2, 3, and 4 are duplicated in the expanded instruction cache. When branch prediction is included in he expansion process, there is no limit beside cache size on how many times an instruction can be duplicated in the cache. It

90

CHAPTER 4. CACHE ALIGNMENT 1: op 2: lod 3: op 4: brn 2

Tag 1

1:op

2:lod

3:op

2

2:lod

3:op

4:brn 2

4:brn 2

Figure 4.6: Instruction Duplication Example is possible for each cache line to contain a branch to a single address. The instruction at this address could be duplicated in each cache line. To have a quantitative measure of the amount of duplication occurring in an expanded instruction cache a duplication ratio is de ned. De nition { duplication ratio : The ratio of the total number of instructions in the cache to the number of unique instructions in the cache. The average duplication ratio is the average over the run of the program of each execution cycle's duplication ratio. During simulation it is straightforward to compute the duplication ratio by maintaining counters with each main memory instruction. The simulation keeps two additional counters, the number of instructions in the cache and the number of unique instructions in the cache. When an instruction is loaded into the instruction cache the number of instructions is always incremented. If the main memory counter associated with this instruction is zero then the unique instruction counter is also incremented. The main memory counter associated with this instruction is incremented. When an instruction is ushed from the cache the main memory counter for the number of instructions in the cache is decremented. If this count becomes zero then the number of unique instructions is also decremented.

4.3. DUPLICATION RATIO AND LINE UTILIZATION Benchmark compress espresso t gcc1 spice3 tex Average

4K 1.07 1.07 1.13 1.05 1.06 1.04 1.07

Cache Size 16K 64K 256K 1.10 1.11 1.11 1.08 1.13 1.18 1.14 1.14 1.14 1.06 1.07 1.08 1.06 1.08 1.11 1.07 1.10 1.13 1.09 1.10 1.13

91 inf. 1.11 1.19 1.14 1.14 1.14 1.17 1.15

Table 4.1: Average Duplication Ratios for Aligned Expansion Cache Brn Units

Lod/Str Units 1:lod

4:str

7:brn

3:lod

Function Units 2:op

5:op

6:str

Figure 4.7: Line Utilization Example Table 4.1 reports the average duplication ratio for the six benchmarks used by this study. Small cache sizes have about 7% of the instructions duplicated and larger cache sizes have about 15% duplicated. The reason the smaller cache sizes have less duplicated code is that sequences of code that are seldom used, maybe even only executed once, get ushed from the small cache but remain present in a large cache. The duplication ratio is relatively small and is not the major reason for the inecient use of the expanded instruction cache. The major ineciency is due to the lines being un lled because of data dependencies or resource limitations. Line utilization is another quantitative measure de ned to describe this ineciency. De nition { line utilization : The percentage of non-NOP instructions in each executed expanded instruction cache line. The average line utilization is the average over all execution cycles of each cycle's line utilization.

92

CHAPTER 4. CACHE ALIGNMENT Benchmark Line Utilization compress 25.87% espresso 25.93% t 35.54% gcc1 26.66% spice3 27.29% tex 28.18% Average 28.24% Table 4.2: Average Line Utilization for Aligned Expansion Cache

Figure 4.7 shows an example of how instructions and NOPs pack into expanded cache lines, including the static routing used to eliminate the routing network between the parallel instruction register and the execution units. In the rst line there are three independent instructions, two loads and one functional. The fourth instruction is a store, so it can not be packed in the rst expanded instruction cache line because only two load/store resources are available. The rst line has 3 instructions and 5 NOPs, for a line utilization of 3=8 or 0.375. The second line has the store from address 4 and one additional instruction from address 5. The instruction at address 6 could not be packed into the second line because it has a data dependency with the instruction at address 4. The second line utilization is 2=8 or 0.25. Only 2 instructions are in the third line because this expanded cache machine model terminates lling a line as soon as a branch instruction is packed into the line. Table 4.2 shows the average line utilization for the benchmarks used in this study. The average line utilization does not vary with cache size as the same packets of instructions are executed regardless of the cache size. The table shows for this aligned expanded instruction cache machine only about 28% of each cache line is lled with executed instructions. The other 72% of each cache line is lled with NOPs. This is a severe ineciency and methods to improve on line utilization are discussed in chapter 5. When analyzed in another manner, a line utilization of about 25% is expected. Other research has shown for this type of superscalar machine and benchmarks, under 2 instructions per cycles is the expected execution rate. This aligned expanded

4.4. OUT-OF-ORDER ALIGNED EXPANDED CACHE PERFORMANCE

93

instruction cache machine delivers 8 instructions per cycle, so one expects only about 25% of these instructions to be used.

4.4 Out-of-Order Aligned Expanded Cache Performance Figure 4.8 presents the performance of a machine using an expanded instruction cache performing instruction packet alignment with an out-of-order execution engine. The block diagram of this machine is shown in gure 2.15. Also presented on the graph is the performance of the reference out-of-order issue machine, the Reorder Buer model described in section 2.3. The purpose for investigating this con guration for an expanded instruction cache machine is to understand the eects of cache line alignment in an out-of-order issue machine. The expansion process of this machine lls cache lines with groups of parallel instructions without breaking any group across a cache line. When the expander comes across a cycle boundary in this out-of-order aligned expanded instruction cache machine it does not stop lling the expanded instruction. Instead it keeps track of the cycle's boundary and continues expanding instructions in-line. This in-line lling predicts all branches to be not taken. After the expanded instruction is completely lled it is truncated back to its last cycle boundary. Besides building complete groups of parallel instructions, the truncation also controls the amount of instruction duplication occurring in the expansion cache. This duplication issue is discussed in detail in section 5.1.1. Figure 4.8 shows the Reorder Buer reference model machine always performs substantively better than the out-of-order aligned expansion cache machine, even though the reference machine has a simpler cache. This is because the Reorder Buer machine implements branch prediction while the out-of-order aligned expansion cache machine does not. This con guration is not a good design tradeo. It is presented here for completeness as this organization is the analogous con guration for the inorder issue aligned expansion cache machine of last section.


2.50

|

2.00

|


94

|

1.50

|

1.00

1.62

Reorder Buffer machine Expanded Aligned Out-of-order

|

0.50

|

0.00 |


inf.

Figure 4.8: Out-of-Order Non-speculative Aligned Expanded Instruction Cache Model and Reorder Buer Model Performance

4.5. INSTRUCTION DUPLICATION ISSUES

95

This out-of-order aligned expansion cache machine also has poorer performance than the in-order aligned expansion cache machine. This is mainly do to the longer decode time required by the out-of-order issue. The in-order issue machine has a 1 cycle miss predicted branch penalty, while this out-of-order issue machine has a 3 cycle penalty. With all branches predicted in-line the prediction rate is very poor. This results in a large performance penalty for the longer decode time. This out-of-order aligned expanded instruction organization points out the need for branch prediction and speculative execution to achieve high performance. Without speculation there is not enough instruction level parallelism in these benchmarks to execute an average of more than one instruction per cycle. Having the expanded instruction cache deliver multiple instructions for speculative execution is possible and is described in chapter 7.

4.5 Instruction Duplication Issues Instruction duplication has another issue when used in a system supporting self modifying code. This issue is the ushing or updating of all cached copies of an instruction when there is a write to its address. Guaranteeing that none of the instructions remain unmodi ed in the expanded instruction cache after a write is dicult. A similar issue arises when the virtual memory system ushes a page. Guaranteeing that all copies of an instruction are ushed is dicult when the tag structure used by the expanded instruction cache only identi es the rst instruction in each packet. One method to addressing this issue is to augment the virtual memory system with a bit per page, called the expand bit. This bit is similar to a dirty bit, it is used to keep additional state about each page to improve the performance of the system. Each time an instruction is loaded into the expanded instruction cache the expand bit for the page from which the instruction was fetched is set. Whenever there is a write to a page with the expand bit set, or if the page is ushed, then the entire expanded cache is ushed. Flushing the entire expanded cache seems drastic, but this is by far the cheapest method that can guarantee all copies of every instruction on the page are updated. As

96


stated before, in the worst case every line in the expanded cache can contain a copy of one speci c instruction. Any scheme keeping a directory or tag for each cached address needs to be able to handle the ushing of the complete expanded instruction cache. Expanded instruction cache ushing on virtual page removal is not required in the special case of an architecture not allowing self modifying code. Since the code is read only, it is not an error for a copy of an instruction to remain in the expanded cache even though the page it came from is no longer in main memory. The value for the instruction remains the same whether it is in memory or on disc. When a new program overlays one that may have instructions in the cache the system implementation must have a method for ushing cache lines having valid tags. The main concern is the instructions that are duplicated in the cache but are not identi ed by cache tags. However, the only path to reach any of these instructions is by fetching a run of instructions starting at a valid tag. It is impossible to reach a stale copy of an instruction. Thus, for an architecture forbidding self modifying code the duplication of instructions in the expanded instruction cache does not add complexity to the virtual memory system.

4.6 Summary This chapter shows an expanded instruction cache machine implementing just cache alignment can provide a 10% performance improvement over the simple reference superscalar machine with an in nite size cache. However, an expanded instruction cache is less ecient than a conventional cache when the area for the cache is limited. To quantify the eciency of cache area use the measures duplication ratio and line utilization are de ned. The low utilization of the cache's storage is the main reason the expanded instruction cache model presented so far requires much larger cache storage than a conventional cache machine. To be practicable, more eective use of the cache's storage area must be achieved. This is possible by adding additional features to the expansion process and the execution process. One method of utilizing more of the

4.6. SUMMARY

97

cache's storage area is to pack more than one's cycles worth of execution into each cache line. This method is called multiple cycle packing and is discussed in chapter 5. However, packing multiple cycles into one expanded instruction is more complex than one may expect. If care is not taken the duplication ratio escalates and the eciency of the expanded cache's storage area is again reduced. Another reason the expanded cache line utilization is so low is the frequent occurrence of branches in the benchmarks and the policy of not expanding once a branch is encountered. Section 2.2.2 explains how an expanded instruction cache can be used to implements branch predictions. The performance of branch prediction is presented in chapter 6. With this additional functionality in the expander it is possible to improve this situation, thereby improving performance. This idea is presented in the chapter 7.

98


Chapter 5 Routing and Multiple Cycle Packing In many superscalar implementations the routing network for delivering instructions from the instruction cache to the proper execution units requires signi cant time. It may be a component of the critical path cycle time. An in-order issue expanded instruction cache machine addresses this problem by using static routing. Static routing, introduced in section 2.2, is the technique of wiring each eld of an expanded instruction directly to one and only one execution unit, thereby eliminating the need for an instruction routing network. Static routing with wide expanded instructions leads to very low cache line utilization. This chapter describes methods for improving line utilization by packing more than one cycle's worth of execution into an expanded instruction while still maintaining the cycle time advantages of static routing.

5.1 Cycle Tagged Instructions Allowing a single expanded instruction to control more than one cycles's worth of execution achieves improved cache utilization. This is implemented using a method called Cycle Tagged Instructions. In this method each instruction within an expanded instruction is tagged with an additional eld describing when it should execute. Brie y, 99

100

CHAPTER 5. ROUTING AND MULTIPLE CYCLE PACKING Routing Network

Expanded Parallel Instruction Register

Branch Units

Load Store Units

Function Units

Figure 5.1: Routing Network for Cycle Tagged Expanded Instructions this method functions as follows:

Tag each instruction with its execution cycle relative to the start of the expanded instruction.

The rst cycle's instructions are aligned within the expanded instruction to slots for their execution units.

The next cycle's instructions are aligned with a routing network during the current cycle.

Having the rst cycle's instructions already aligned within the expanded instruction meets the goal of direct execution unit control for the rst cycle. During the rst cycle there is time to route the second cycle's instructions into position for direct control of the appropriate execution units. Figure 5.1 shows the routing network needed to position instructions for the second and following execution cycles. Instructions for the third cycle are positioned during the second cycle and so on. Aligned rst cycle instructions with routed second cycle and beyond instructions accomplishes the simple direct execution unit control required to minimize cycle time.

5.1. CYCLE TAGGED INSTRUCTIONS

101

5.1.1 Cost of Cycle Tagging Cycle tagging increases the overhead in the expanded instruction. Cycle tagging requires a eld describing the issue cycle for each instruction. The width of this eld must be sucient to accommodate the largest number of cycles occurring during execution of one expanded instruction. For an 8 instruction wide machine it is possible for data dependencies to limit execution to only one instruction per cycle, so one expanded instruction can control up to 8 cycles of execution. Thus the cycle tag eld needs to be 3 bits per instruction, for a total of 24 bits. In general, this straight forward encoding for an n instruction wide machine requires ndlog2ne bits. This encoding simpli es the decoding for the routing network as each eld explicitly states which cycle the instruction must be delivered to the appropriate execution unit. Not supporting the full range of multi-cycle encoding is a possible method of reducing the expanded instruction width overhead of cycle tagging. Only a limited amount of the cycle tagged instructions in an 8 wide machine will execute in cycles 5 through 8. It is possible to truncate each cycle tag eld to only 2 bits without much loss of line utilization. This would reduce the overhead from 24 bits to 16 bits. Another encoding is possible. It is possible to ag just the cycle boundaries between the instructions because the original program order is stored in the expanded instruction. This requires only one bit per packed instruction, reducing the total extra bits for cycle tagging from ndlog2ne to just n. However, this encoding complicates the decoding for the routing network. If the cycle is long enough to achieve the decoding and routing then this is a better encoding because the extra encoding bits are duplicated many times in the cache whereas there is only one copy of the decoder. Finally, it is possible to determine the cycle boundaries at execution time without using any additional cycle boundary bits. This method re-decodes the instructions within the expanded instruction and uses the original program order information to rediscover the dependencies and resource limitations and recreates the cycle boundaries. A full cycle is available for recreating the cycle boundaries and routing. During the rst cycle all execution units begin execution of the instructions delivered to them. Write backs for instructions not meant for rst cycle execution are inhibited later in the pipeline. However, for most technologies this is not a practicable method

102

CHAPTER 5. ROUTING AND MULTIPLE CYCLE PACKING 1: X 2: Y 3: A 4: B 5: C 6: D 7: E Tag 1: 3: 4: 5: 6: 7:

Expanded Instructions (without static routing) X Y A B C D E A A B C D E

B C D E A

C D E A B

D E A B C

E A B C D

A B C D E

B C D E A

C D E A B

Figure 5.2: Excessive Duplication When Filling Every Slot of encoding the cycle boundaries. It is likely the full parallel decode of all instructions in the expanded instruction and routing will require more than one cycle. Another consideration when using this method of no additional cycle boundary bits is the align instructions described in chapter 3. Processing the align requires either terminating expanded instruction packing when one of these instructions is encountered, or entering these instructions into the expanded instruction cache. The other methods of identifying cycle boundaries by adding additional bits to the expanded instruction allow the align instructions to be folded out during the expansion process.

5.1.2 Terminating Instruction Packing Implementing cycle tagging causes several intriguing situations to arise during the expansion process. These situations concern termination of lling an expanded instruction. It would be possible for the expander to always completely ll each expanded instruction, achieving a 100% utilization of the cache's storage. However, this

5.1. CYCLE TAGGED INSTRUCTIONS

103

leads to instruction duplication ineciencies and performance limitations. Figure 5.2 shows what can occur when every expanded instruction is lled until it is completely full. The top of the gure shows the code as it resides in main memory. There is a loop implemented by the branch instruction labeled \E" at address 7. The bottom of the gure shows an expanded instruction cache after completely lling each expanded instruction. For simplicity, the instructions are shown in program order within the expanded cache lines and the cycle tag information is not exhibited. The gure shows how each instruction of a 5 instruction loop becomes packed into an 8 wide expanded instruction at 9 dierent locations when each expanded instruction is completely lled. There is only one tag for each cache line. Because the size of the loop and the size of the line are relatively prime there is a complete cache line lled by each address in the loop. The result is an exceedingly high duplication ratio. An 100% utilization does not necessarily result in better performance. Another problem arising with completely lling each expanded instruction is lost of parallelism and performance. This problem is fundamentally the same as the alignment problem described in chapter 4. Full packing results in some groups of independent instructions that could be executed in parallel becoming spread across two sequential cycles because of cache line boundaries. Again, forcing 100% utilization may reduce performance. A method to preserve the performance gain achieved by not separating a group of parallel instructions across line boundaries is to ll the cache line and then truncate back to the last cycle boundary. The partially completed instruction group is replaced by NOPs. This method is called terminate on full. This method somewhat reduces the utilization but results in better performance. This terminate on full method slightly increases the expansion time as work is thrown away when the expander looks ahead in the instruction stream and then truncates back to the last cycle boundary. Terminating packing only on cycle boundaries treats the alignment problem but does not fully handle the duplication problem. It helps the duplication problem because there are fewer starting addresses required but simulations show duplication is still a problem. A technique called terminate on hit treats the duplication problem. This technique is implemented by con guring the expander to terminate the multiple

104

CHAPTER 5. ROUTING AND MULTIPLE CYCLE PACKING

cycle packing at any cycle boundary when the next cycle's instructions are already found in the expanded instruction cache. Using terminate on hit has no performance penalty during execution time as it only switches to the next expanded instruction at cycle boundaries. There is sucient time to fetch the next expanded instruction while executing the last instruction group in the current line. It reduces duplication because expansion stops whenever an instruction on a cycle boundary is found in the cache. This instruction and following instructions are not duplicated. It does cause the utilization to decrease as fewer slots are lled, but this is not at the expense of performance. Terminate on hit is implemented during expansion time by reading the expanded instruction cache while decoding each instruction. If the decode processing determines the instruction being expanded causes a cycle boundary then the hit or miss result of the cache access is interrogated. When there is a hit for this instruction then the expansion process is terminated and the next cycle is used to write the now complete expanded instruction into the expanded instruction cache. Observe that reading and writing of the expansion cache with terminate on hit is slightly dierent from the pipeline described in section 2.2.1. The simple expansion process described there writes the expansion cache every cycle and thereby saves one cycle when the expanded instruction is complete. With terminate on hit the expansion cache must be read every cycle, making it unavailable for writing. Terminate on hit requires one additional cycle during each expansion to write the completed expanded instruction into the cache. This is not a performance limitation because the expansion process is infrequent. Terminate on hit provides a very small performance improvement over the cycle tagging with terminate on full because the lower duplication requires fewer expansion cycles. One more method of controlling duplication is used by Epic machines. As described in chapter 3, an align instruction is emitted by the compilers when performance loss would occur if two independent instructions are dispatched during the same cycle. These align instructions are also used by the expander to terminate multiple cycle packing into the expanded instruction. This again reduces the average utilization of the cache lines but has no impact on execution performance as it occurs

5.1. CYCLE TAGGED INSTRUCTIONS Benchmark compress espresso t gcc1 spice3 tex Average

105

Line Utilization No Cycle Tag Terminate on Full Terminate on Hit 25.44% 79.58% 65.45% 25.04% 81.73% 72.13% 37.04% 82.63% 81.48% 26.14% 79.53% 73.30% 26.55% 78.08% 71.08% 28.12% 71.88% 67.46% 28.05% 78.91% 71.82%

Table 5.1: Line Utilization for Cycle Tagged Con gurations (4 Kbyte Instruction Cache, Branch Prediction, and Speculative Execution) Benchmark compress espresso t gcc1 spice3 tex Average

Duplication Ratio No Cycle Tag Terminate on Full Terminate on Hit 1.10 1.50 1.20 1.04 1.28 1.11 1.24 1.29 1.21 1.04 1.18 1.08 1.04 1.16 1.07 1.04 1.20 1.10 1.08 1.27 1.13

Table 5.2: Duplication Ratio for Cycle Tagged Con gurations (4 Kbyte Instruction Cache, Branch Prediction, and Speculative Execution) only on cycle boundaries. Table 5.1, table 5.2, and gure 5.3 present the line utilization, duplication ratio, and machine performance, respectively, for no cycle packing and the two forms of termination for multiple cycle packing. Branch prediction and speculative execution, described in chapters 6 and 7, are used by the Epic machine con gurations reported in these gures. The performance of adding multiple cycle packing without adding branch prediction is presented in the next section. Table 5.1 shows terminate on full cycle packing achieves an average line utilization of almost 79%, much better than the 28% achieved without multiple cycle packing. As expected, terminate on hit has a lower utilization, averaging to about 72%. The tables report line utilization and


2.50

|

2.00

|


106

1.72

|

1.50

No Cycle Tag Terminate on Full Terminate on Hit 1.00

|

0.50

| |

0.00 |


inf.

Figure 5.3: Performance for Cycle Tagged Con gurations

5.2. PERFORMANCE OF CYCLE TAGGING

107

duplication ratios only for the 4 Kbyte cache size, for other cache sizes the data is very similar. Figure 5.3 shows performance does not always correlate directly to line utilization. Multiple cycle packing is not expected to improve performance in the in nite cache con guration. This is because multiple cycle packing addresses the problem of low utilization of the cache's storage but an in nite cache does not have a storage size limitation. A careful examination reveals there is a slight performance decrease for both of the multiple cycle packing methods. This is caused by the initial cache miss service times being longer for the multiple cycle packing cases. An in nite cache has only initial misses and a longer service time for these misses results in lower performance. Normally, the reduction in miss ratio achieved by the multiple cycle packing osets the longer service time for each miss. At small cache sizes the improved utilization has the most impact on performance. At a 4 Kbyte instruction cache size terminate on full cycle packing achieves a 12.5% performance improvement over single cycle expanded instructions. Terminate on hit cycle packing achieves a 19.8% performance improvement over single cycle expanded instructions, even though it has a lower line utilization. Table 5.2 points out the reason for this better performance. The terminate on full packing achieves a lower duplication ratio. The 11% reduction in duplication ratio improves performance more than the 9% reduction in line utilization reduces performance. Terminate on hit multiple cycle packing achieves better performance than terminate on full for all the cache sizes and benchmarks studied. Therefore, terminate on hit is used for all the performance reports presented in the next and following sections.

5.2 Performance of Cycle Tagging This section analyzes the performance gains achieved by multiple cycle packing. Table 5.3, table 5.4, and gure 5.4 present the line utilization, duplication ratio, and machine performance, respectively, for three machines. The machines are the reference machine, an aligned expanded instruction cache machine, and the same aligned

108


Benchmark compress espresso t gcc1 spice3 tex Average

Line Utilization Line Cache No Cycle Packing Terminate on Hit 86.58% 25.87% 59.08% 89.28% 25.93% 67.13% 85.88% 35.54% 64.61% 86.49% 26.66% 65.70% 87.41% 27.29% 67.65% 84.52% 28.18% 59.78% 86.69% 28.24% 63.99%

Table 5.3: Line Utilization for Cycle Tagged Con gurations (4 Kbyte Instruction Cache)

Benchmark compress espresso t gcc1 spice3 tex Average

Duplication Ratio Line Cache No Cycle Packing Terminate on Hit 1.00 1.07 1.19 1.00 1.07 1.12 1.00 1.13 1.17 1.00 1.05 1.09 1.00 1.06 1.10 1.00 1.04 1.12 1.00 1.07 1.13

Table 5.4: Duplication Ratio for Cycle Tagged Con gurations (4 Kbyte Instruction Cache)

2.50

|

2.00

|


5.2. PERFORMANCE OF CYCLE TAGGING

1.37

|

1.50

109

Line Cache No Cycle Packing Terminate on Hit 1.00

|

0.50

| |

0.00 |


inf.

Figure 5.4: Performance for Cycle Tagged Con gurations

110


expanded instruction cache machine with multiple cycle packing. The table reports data for an instruction cache storage size of 4 Kbytes. The line cache column in table 5.3 presents the cache line utilization for a nonexpanded instruction cache machine. This is de ned as the number of instructions executed divided by the number of instructions delivered by the cache. The number of instructions delivered by the cache is the cache line width (in instructions) times the number of changes in addresses fetched from the cache. If the same cache line is used for several execution cycles then this cache line counts only once when computing the utilization. The utilization of the line cache con guration is not 100% because of the branches encountered during execution. A branch causes some of the fetched instructions to be discarded. Utilization table 5.3 shows cycle tagging achieves about 64% utilization of the cache lines. This is lower utilization then the con gurations of the previous section because this Epic con guration does not include branch prediction and speculative execution. The con guration used in this section is chosen to show the performance gains of adding only multiple cycle packing to the aligned expansion cache of last chapter. When the expander encounters a branch instruction it stops packing the expanded instruction and lls the rest of it with NOPs, which reduces utilization. The 64% line utilization achieved by multiple cycle packing is a large improvement over the 28% line utilization of the non-multiple cycle expanded instruction machine. As shown in gure 5.4 this results in about 15% performance improvement at the 4 Kbyte cache size. However, at the 4 Kbyte cache size line utilization is very important and the line cache machine still out performs the multiple cycle packing expanded instruction machine. Low cache storage eciency is also eected by the increased duplication ratio occurring with multiple cycle packing. The duplication ratios in table 5.4 for the line cache machine are exactly 1.00 because this con guration does not support duplication in the cache. The duplication overhead for a non-cycle tagged expanded instruction machine is 7% and packing multiple cycles results in almost double the overhead at 13%.

5.3. EXPRO MACHINE ROUTING AND MULTIPLE CYCLE PACKING

111

In summary, cycle tagging and multiple cycle packing into an expanded instruction increases performance over single cycle expanded instructions for small caches. However, the performance improvement achieved by alignment alone is insucient to oset the performance penalty of reduce cache storage eciency for the in-order issue expanded instruction cache machine.

5.3 Expro Machine Routing and Multiple Cycle Packing Static routing is not eective in the Expro machine. To evenly spread instructions across the available execution units the dynamic decoder in the Expro machine routes each instruction to the reservation station with the smallest number of buered instructions. It is not possible for the expander to determine at expansion time which reservation station will have the fewest pending instructions at execution time. The presence of dynamic scheduling reduces the eectiveness of static routing. The Expro machine, like the Reorder Buer machine, is con gured with 2 pipeline stages for the decoding and routing of instructions. The complex decode process requires more than one stage. With two stages available the time required for routing is not much of a concern. This makes the ineectiveness of static routing unimportant. Removal of the static routing requirement reduces the number of overhead bits in the expanded instruction. All machines require the original program order of the instructions to be available in case an exception interrupts program ow. The Expro machine's expansion process enters instructions into the expanded instruction in original program order. This allows determination of original program order by using the relative position of instructions within the expanded instruction, requiring no additional overhead bits. The Epic machine pays an overhead of 16 to 24 bits in the expanded instruction because of its need for static routing whereas the Expro machine requires none. Multiple cycle packing is fundamental to the out-of-order issue Expro machine.

112


Delivering more than one cycle's worth of instructions during most cycles is how this machine maintains an average queue of available instructions to execute. Expanded instruction caching supports the high instruction fetch bandwidth requirements of the Expro machine. There is an issue when the Expro machine's expansion process should terminate the lling of an expanded instruction. As in the Epic machine, the options are: 1) completely pack each expanded instruction, 2) terminate on the last full cycle boundary, and 3) terminate on hit. Completely packing each expanded instruction leads to very high duplication ratios and is not an eective method even for moderately large cache sizes. The decision as to use terminate on hit verses terminate on last full cycle is not obvious. As seen in section 5.1.2, terminate on last full cycle boundary has a higher utilization ratio than terminate on hit and thus will deliver more instructions to the decoder. This may be important for the Expro machine's higher rate of instruction issue. On the other hand, terminate on full has a higher duplication ratio and reduces the eectiveness of the cache, increasing the number of misses. The decision as to which to use depends on which method achieves better performance. Figure 5.5 shows the performance of the Expro machine when con gured with expansion processes that terminates on hit and terminates on full. Both con gurations use the branch prediction and instruction run merging described in chapters 6 and 7, and both have a 32 entry reorder buer. For all cache sizes except in nite, terminating packing on hit has better performance than terminating on full. At an in nite cache size, the slightly larger instruction bandwidth of terminate on full achieves an average performance improvement of only 0.5%. Because this improvement is so small, and it's negative for small caches, terminate on hit is used as the expansion termination algorithm for the Expro machine.

2.50

2.32

|

2.00

113

|


5.3. EXPRO MACHINE ROUTING AND MULTIPLE CYCLE PACKING

1.50

|

1.00

|

0.50

|

|

0.00 |

Terminate on Hit Terminate on Full


inf.

Figure 5.5: Expro machine with Expansion Terminate on Hit and Terminate on Full (32 entry reorder buer)

114


5.4 Summary For the in-order issue Epic machine the cycle tagged structure and multiple cycle packing achieves the cycle time bene ts of static routing and improves cache utilization. The expander aligns the rst cycle's instructions to the appropriate execution units. This achieves the required alignment for the rst cycle after a fetch from the expanded cache. The expander then places instructions tagged for the second and following execution cycles into the rst available slot in the expanded instruction. This packing achieves high utilization of the cache line. Performance considerations limit utilization. If the lines were lled to 100% of capacity then many instruction packets would be split between cache lines and parallelism would be lost. When the cache lines are lled to capacity then instruction duplication also increases, negating the eciency advantage one is attempting to gain. These considerations require an expansion termination policy. The simulations show terminating expansion upon a hit achieves the highest performance. The cost of cycle tagging is 1 bit per instruction to ag the cycle boundaries. In its most eective con guration multiple cycle tagging achieves an average performance increase of 20% for the 4 Kbyte instruction cache size and increase of 5% for the 64 Kbyte size. Multiple cycle packing is fundamental to the out-of-order issue Expro machine. Static routing is ineective for this machine because of the dynamic nature of execution unit selection at decode time. There are no overhead bits required for multiple cycle packing in the Expro machine because of the lack of static routing. The terminate on hit policy is the best expansion process termination policy for the Expro machine. Multiple cycle packing is more eective when combined with branch prediction and instruction run merging. How these features are implemented and their impact on performance is the subject of the next two chapters.

Chapter 6 Branch Prediction Accurate, high speed branch prediction is an important component of high performance machines. Typical workstation code contains about one branch in every 5 to 6 instructions. With branches being this frequent it is imperative that superscalar machines predict the outcomes of branches during instruction fetching, without waiting for the execution units to generate the correct target addresses. The machines presented in chapter 2 use successor address static branch prediction. Static branch prediction is performed at compile time and inserts branch prediction hints into the emitted opcodes. Static branch prediction does not use runtime branch information to improve prediction accuracy. This chapter presents the prediction accuracy of these static branch prediction methods and a model for estimating the performance eects of dierent branch prediction rates.

6.1 Epic Branch Prediction Performance Figure 6.1 shows the successor address branch prediction method used by the Epic machine. The implementation aspects of this method were described in section 2.2.2 and the performance aspects are discussed in this section. Branch prediction is implemented by adding a successor address eld to the expanded cache line. This eld both predicts branches and allows the next cache line to be pre-fetched before decoding any instructions in the current cache line. 115

116

CHAPTER 6. BRANCH PREDICTION


~10 Tag

Successor

Instructions

30

2.50

|

2.00

|


Figure 6.1: Epic Successor Address Branch Prediction

1.49 |

1.50

align & pack brn predict 1.00

|

0.50

| |

0.00 |


inf.

Figure 6.2: Performance of Adding Branch Prediction to Epic

6.1. EPIC BRANCH PREDICTION PERFORMANCE Benchmark compress espresso t gcc1 spice3 tex Average

Epic 85.33% 84.35% 93.73% 85.32% 84.57% 78.00% 85.22%

Reorder 74.82% 78.55% 90.55% 75.57% 77.74% 75.39% 78.77%

117

Expro 85.33% 84.35% 93.73% 85.36% 84.57% 77.46% 85.13%

Table 6.1: Average Branch Prediction Rate Figure 6.2 shows the performance gained by adding successor address branch prediction to the aligned expanded instruction con guration presented in section 4.2. With cache line alignment and successor address branch prediction the expanded instruction cache machine achieves an average of 1.49 instructions per cycles with an in nite instruction cache. This is an improvement of 8.3% over an expanded instruction cache machine using only cache line alignment. Table 6.1 reports branch prediction rates for the three machine con gurations. The con gurations reported are the full speculative execution con gurations for each machine and the Expro machine has a 32 entry reorder buer. The Epic and Expro machines achieve about 85% branch prediction accuracy and the Reorder buer machine achieves about 79% accuracy. The Epic and Expro machines have almost exactly the same branch prediction rate because they both use the same expanded instruction cache con guration. The Reorder Buer machine has a lower branch prediction accuracy because of con icts for the successor address eld. Con icts do not occur in the expansion cache machines because there is only one predicted next address for each cache line. Even if two branches are packed into one expanded instruction cache line there is a single predicted next address after the execution of both of these branches. The single successor address eld is sucient to store the branch prediction's best guess about the next expanded instruction to execute. In the case of the Reorder Buer machine there can be a con ict for the successor address eld. Here a single cache line can contain more than one branch instruction.

118


Each of these branches is in a dierent basic block so each branch requires its own branch target address. However, there is only one successor address eld in the cache line, so this leads to a con ict. This con ict is resolved at run time by entering the correct target into the single successor address eld each time there is a miss predicted branch. Entering a new address is the correct prediction for the branch that caused the miss but this removes prediction information for other branches in the cache line. Removing of valid prediction information causes a lower branch prediction accuracy. This data demonstrates successor address branch prediction is not the most effective method of branch prediction when used with wide traditional caches. The branch prediction is too intertwined with the cache structure. To be eective the branch prediction mechanism must be able to predict one target address for most basic blocks, not one target address per cache line. It is better to use an independent branch prediction mechanism such as a branch target buer [LS84]. The successor address is eective when used with an expansion cache because each expanded instruction does not contain more than one expected program execution trace. Successor address branch prediction has an advantage over an independent branch target buer because only one set of address tags is needed for both branch prediction and cache implementation. Static branch prediction is used by the machines presented here. However, it is possible to improve branch prediction rates with more sophisticated hardware methods that record run time behavior of the programs [Los82] [LS84] [YP91] [PSR92] [YP92] [YP93] [YS94] [YGS95]. The eect of improved branch prediction accuracy on machine performance is the subject of the next section.

6.2 Cost of a Miss Predicted Branch This section investigates the performance costs of miss predicted branches. In a machine without speculative execution and a static con gured pipeline the cost of a miss predicted branch in cycles is simply the number of pipeline stages needed to determine the miss prediction and restart the correct instruction fetch. In a machine supporting speculative execution the cost of a miss predicted branch in terms of

6.2. COST OF A MISS PREDICTED BRANCH

119

performance is somewhat more complicated. This is because the amount of work lost depends on the number of instructions issued in parallel with the branch. To understand the eect of the branch prediction rate on performance an approximation for instructions per cycle as a function of the branch prediction rate is derived. Let the total cycles required to execute a program be divided into three groups: cycles for instruction execution, cycles for miss predicted branch servicing and cycles for instruction cache servicing. cycles = cyclesexecution

+ cyclesIcache miss + cyclesbranch miss

In reality there is some interaction among these three cycle types. Some instructions may be able to complete while a cache miss is being serviced or a missed branch recovery is occurring. However, to simplify the analysis assume these three types of cycles are independent. Cycles per instruction (CP I ) is computed by dividing by the number of instructions required to execute the program: CP I

execution cyclesIcache miss cyclesbranch miss + instructions + instructions = cycles instructions

execution The rst term, cycles instructions , can be approximated by the average issue rate for the machine. All data dependencies and resource limits are included in this term. Thus this term is a function of program characteristics, data dependencies, data cache behavior, and machine execution resources. For a given simulation run on a given machine this term can be computed by having the simulator count the cycles used for execution and the total number of instructions executed. As stated above, this average issue rate term is assumed to be independent of the branch prediction rate. In actuality, the average issue rate does depend upon the branch prediction rate, but it is only a weak dependence for the small instruction window machine con gurations studied by this work. A very high branch prediction rate with large instruction fetch windows and abundant machine resources may allow additional exploiting of instruction level parallelism. This increases the average instructions per cycle. This variation is assumed to be small for the machines studied.

120


Icache miss The second term, cycles instructions , is the instruction cache miss penalty term. For a given machine and program this can also be easily computed during simulation. The simulator simply counts the number of stall cycles due to instruction cache misses and divides by the number of instructions executed. Branch prediction rate in uences this term because a miss predicted branch may in uence the lines fetched into the instruction cache. However, any in uence of the branch prediction rate on this term is ignored for this analysis. The third term, cyclesbranch miss , is the term that varies as the branch prediction rate changes. For each miss predicted branch there are two components contributing to the cost in cycles for the miss prediction. The rst cost is the processing of the speculatively issued instructions that now must be expunged. Let cyclesdiscardb be the number of cycles needed to discard the incorrectly issued instruction for branch b. The second cost is the number of cycles required to restart the correct instruction stream after it is known a branch was miss predicted. This term is the number of pipeline stages required to remove any speculative instructions in the pipeline and fetch the correct branch target instructions. For the machines studied by this analysis this penalty term does not depend on the number of instructions expunged. Let this term be called cyclesbrn recover . Let N be the number of miss predicted branches in the program. Then the number of cycles consumed by all miss predicted branches is: cyclesbranch miss

=

XN ( b=1

cyclesdiscardb

+ cyclesbrn recover )

The cyclesbrn recover term is independent of branch number b, so it can be factored out of the summation. The branch prediction rate is the number of correctly predicted branches divided by the total number of branches. Let the branch prediction rate be called . The number of miss predicted branches, N , is (1 ? ) branches. The branch miss cycles per instruction then becomes:


cyclesbranch miss insts:

121

1 = (1 ? ) branches cyclesbrn pen + insts: insts:

XN b=1

cyclesdiscardb

The simulator counts the number of instructions discarded for all miss predicted branches, call this number instructionsdiscard. Let the average dispatch rate including all instructions issued and completed as well as instructions issued and ushed be DP Cavg (Dispatches per Cycle). This is the total number of instructions dispatched divided by the number of dispatch cycles (non-cache miss cycles and branch miss stall cycles). The cycles lost because of discarded instructions for all branch prediction misses is approximately the instructions discarded divided by the instruction dispatch rate.

XN b=1

cyclesdiscardb

discard N instructions DP C avg

Substituting for the summation gives: cyclesbranch miss insts:

instructionsdiscard ) = (1 ? ) branches ( cyclesbrn pen + insts: DP C

avg

discard ) inside the parenthesis be called Let the term (cyclesbrn pen + instructions DPCavg cyclesBMP , the average number of branch miss penalty cycles. Substituting into the CPI equation gives the CPI as an approximate function of the branch prediction rate .

CP I

branches execution cyclesIcache miss = cycles + + (1 ? ) cyclesBMP insts: insts: insts:

execution + cyclesIcache miss and b = branches cycles Let a = cyclesinsts: BMP . The IP C insts: insts: (Instructions per Cycle) is just IP C = 1=CP I . Substituting in the a and b gives the desired function:

122

CHAPTER 6. BRANCH PREDICTION Benchmark compress espresso t gcc1 spice3 tex Average

insts execution cycles

1.65 1.60 2.92 1.57 1.54 1.71 1.83

cyclesI miss insts

0.00 0.00 0.00 0.01 0.00 0.00 0.00

branches insts

0.16 0.16 0.05 0.16 0.18 0.13 0.14

cyclesBMP

1.92 2.14 1.39 2.08 2.00 1.99 1.92

Table 6.2: Branch Cost Function Parameters for the Epic Machine with an In nite Cache IP C

= a + (11? )b

As ! 1 the eect of the b term disappears. As decreases it becomes more and more a limiter of performance. How much a low branch prediction rate can limit performance depends on the relative size of a and b. The following graphs and tables report the eects for the various machine con gurations studied. Table 6.2 and gure 6.3 present the simulated values of the branch cost function and a plot of this function for the prediction rate ranging from 0.5 to 1.0 for the insts Epic machine. In table 6.2 the rst column, labeled execution cycles , is 1 over the rst term of a. This column is the average number of instructions per cycle executed by the given machine and benchmark when instruction cache penalties and missed I miss , is the second term of a branch penalties are removed. The second column, cyclesinsts and is the average number of penalty cycles per instruction for a cache miss. Because the in nite instruction cache model is selected for the numbers presented in this table only the initial misses contribute to values in this column. The product of the last two columns of table 6.2 creates the b term. Without instruction cache and branch penalties the Epic machine is able to achieve about 1.8 instructions per cycle for the selected benchmarks. This is relatively low compared to the out-of-order machines. The branch miss penalty cycles for the Epic machine averages to about 1.9 cycles per miss predicted branch, which is also low compared to the out-of-order machines. The in-order Epic machine does not nd as

4.00

|

3.50

|

3.00

|

2.50

|

2.00

|



1.50

|

1.00

|

0.50

| |

0.00 | 0.5

123

compress espresso fft gcc1 spice3 tex

| 0.6

| 0.7

| 0.8

| | 0.9 1.0 Branch Prediction Rate

Figure 6.3: Branch Cost Function for the Epic Machine with an In nite Cache

124



3.19 2.74 2.82 2.90 2.87 2.66 2.86

cyclesI miss insts

0.00 0.00 0.00 0.00 0.00 0.00 0.00

branches insts

0.16 0.16 0.05 0.16 0.18 0.13 0.14

cyclesBMP

8.76 8.95 6.03 8.53 8.78 9.36 8.40

Table 6.3: Branch Cost Function Parameters for the Reorder Buer Machine with an In nite Cache much instruction level parallelism as the out-of-order machines, but it does not pay as much penalty for a miss predicted branch. The 1.9 cycles per miss predicted branch comes from a xed one cycle pipeline penalty and an average of 0.9 cycles for expunging speculatively issued instructions after a predicted branch miss. The only speculative instructions in the in-order issue Epic machine are the ones packed into the expanded line after a packed branch. Because there are only a few slots for speculative instructions, and they must be independent of the instructions already packed into the expanded instruction, normally only a small number of instructions must be expunged after a miss predicted branch. Figure 6.3 plots the predicted instructions per cycle as a function of the branch prediction rate for the in nite instruction cache Epic machine. On the graph the plotted symbol on each curve is the actual simulated IP C and branch prediction rate, the rest of the curve is plotted using the missed branch cost formula. Performance is almost independent of because the b term is small compared to the a term. Table 6.3 and gure 6.4 present the branch cost results for the Reorder Buer machine. For this out-of-order machine the average number of instructions per cycle is about 2.9. This is higher than the Epic machine's 1.9. The out-of-order issue achieves about a one instruction per cycle increase over the Epic machine. However, the average number of cycles lost to a miss predicted branch is much higher at about 8.4 cycles. The increased cycle count for a miss predicted branch occurs because both the

4.00

|

3.50

|

3.00

|

125


2.00

|

2.50

|



1.50

|

1.00

|

0.50

| |

0.00 | 0.5

| 0.6

| 0.7

| 0.8


Figure 6.4: Branch Cost Function for the Reorder Buer Machine with an In nite Cache

126



4.10 3.92 4.02 3.86 3.96 4.00 3.98

cyclesI miss insts

0.00 0.00 0.00 0.00 0.00 0.00 0.00

branches insts

0.16 0.16 0.05 0.16 0.18 0.13 0.14

cyclesBMP

9.58 9.33 6.30 8.45 10.45 7.98 8.68

Table 6.4: Branch Cost Function Parameters for the Expro Machine with an In nite Cache and 32 Entry Reorder Buer terms summed to form the b factor are larger. The xed cost term for a miss predicted branch in the Reorder Buer machine is 3 cycles. Two of these 3 cycles are required because of the 2 cycle instruction decode and 1 cycle is required for ushing instructions out of the reservation stations and reorder buer. The remaining 5.4 cycles penalty per miss predicted branch comes from the much larger number of speculative instructions being processed when a miss predicted branch occurs. The larger branch miss penalty value causes the branch cost function plots to vary more with . This function is plotted in gure 6.4 for the Reorder Buer machine. The gure shows better branch prediction accuracy is bene cial to this out-of-order issue machine. Improving the branch prediction rate causes a more than linear improvement for the IP C . Table 6.4 and gure 6.5 present the branch cost results for the Expro machine. This is for the 32 entry reorder buer and in nite expansion cache con guration. This machine achieves an even higher rate of instruction issue because of the increased instruction cache bandwidth and larger reorder buer. Its issue rate without cache and missed branch penalties is about 4.0 instructions per cycle. The miss predicted branch penalty cycles are correspondingly larger, and averages to about 8.7 cycles per missed branch. As in the Reorder Buer machine, this is the sum of a xed 3 cycle pipeline penalty plus a term that grows with the average instruction issue rate. Figure 6.5 shows the missed predicted branch cost function is even more curved as the average issue rate increases. Comparing gures 6.4 and 6.5 shows some of

4.00

|

3.50

|

3.00

|

2.50

|



2.00

|

1.50

|

1.00

|

0.50

| |

0.00 | 0.5

127


| 0.6

| 0.7

| 0.8


Figure 6.5: Branch Cost Function for the Expro Machine with an In nite Cache and 32 Entry Reorder Buer

128


the performance improvement gained by the Expro machine over the Reorder Buer machine is because of the improved branch prediction rate attained by the expansion cache. This section illustrates accurate branch prediction is important for high speculative issue rate machines. With branch prediction accuracy in the 70% range the out-of-order issue machines are only able to achieve about one per cycle for 5 of the six benchmarks. The importance of accurate branch prediction is reduced when lower levels of speculative issue are attempted. This is demonstrated by the in-order issue Epic machine's low variation in performance with variations in the branch prediction rate.

6.3 Summary This chapter presents the performance of successor address branch prediction for both expansion cache machines and traditional cache machines. It also derives a function to estimate performance of a machine in terms of the branch prediction rate . Successor address branch prediction is a static prediction method that computes predicted addresses during a cache miss. Successor address branch prediction is ecient for expanded instruction cache machines but has limitations for traditional cache machines. Its limitation is it predicts only one next address per cache line, which reduces prediction accuracy if multiple branches occur within the line. Performance as a function of branch prediction accuracy reveals accurate branch prediction is important for machines with out-of-order issue and large numbers of speculative issued instructions. A more than linear increase in performance is obtain for increases of in the 90% range. For machines like Epic with few speculative issued instructions the performance increase with increasing is near linear. Even when branches are accurately predicted high performance requires a high instruction cache bandwidth. Branching generates a need to deliver multiple cache lines to the instruction decode unit. This problem is addressed by the next chapter which investigates methods of using an expanded instruction cache to eciently deliver multiple predicted to be executed instruction.

Chapter 7 Instruction Run Merging Branches disrupt the ow of instructions from an instruction cache to the decoders. They break the sequentiality of instruction addressing, rendering a wide instruction fetch mechanism ineective. High performance superscalar processors require sucient instruction bandwidth to keep the execution units busy but frequent branches make this a dicult task even when the branches are accurately predicted. This chapter describes the problem of fetching sucient instructions to keep the decode hardware busy in the presence of branches and proposes solutions using expansion caches. The solutions revolve around merging instructions from multiple runs into a single expanded instruction. This builds on top of the branch prediction techniques of chapter 6 and the multiple cycle packing techniques of chapter 5. Merging is a mechanism for fetching instructions at sucient rates to achieve high utilization of the decode and execution hardware existing in the machine models. These techniques achieve performance advantages in both the in-order Epic machine and the out-of-order Expro machine.

7.1 Branching and Instruction Run Merging Branches cause miss alignment of instructions with respect to the cache lines delivering them. Branching causes some instructions in a traditional cache line to be invalid and reduces the number of instructions per cycle delivered to the decode hardware. 129

130

CHAPTER 7. INSTRUCTION RUN MERGING

S3

S4

S1

S2

T1

T2

S5 Branch

T3

Figure 7.1: Sequence of Two Instruction Runs An instruction run is de ned as the set of sequentially fetched instructions between branches, and the number of instructions in the set is called the run length. Figure 7.1 shows a sequence of two instruction runs. As explained in chapter 4, the fetch bandwidth is reduced because the runs do not correspond to a cache line. The gure also shows a branch disrupting the sequential ow of instructions. As explained in chapter 6, it is possible to predict the outcome of the branch and probably know which cache line to fetch for instructions after the branch. The problem for a single ported cache is the instructions before and after a predicted branch are in dierent cache lines. Reading more than one line per cycle from a single ported cache is impossible. Instruction run merging is the process of simultaneously providing the decoder with instructions from both before and after a branch instruction. A conditional branch causes the instructions provided after it to become speculatively executed. Merging is accomplished in an expanded instruction machine by organizing the expander to continue instruction expansion after it encounters and predicts a branch. The expander fetches instructions from the predicted target of the branch and packs these instructions into the expanded instruction register until this register is lled or a dependency is found. This is shown in gure 7.2. This gure shows a packed expanded instruction cache line where instructions S1 through S4 are before the conditional branch. Instruction S5 is the conditional branch, which is predicted taken.

7.1. BRANCHING AND INSTRUCTION RUN MERGING

131

Conditional Branch

S1

S2

S3

S4

S5

T1

T2

T3

Succ.

Speculative Execution

Figure 7.2: Instruction Run Merging Example Instructions T1 through T3 are from the target of the branch. Note the execution of instructions T1 through T3 are speculative and their results must be ushed if the branch is mispredicted. Merging increases decoder eciency because it lls decoder slots after a branch with instructions predicted to be executed. It also improves cache line utilization as it reduces the number of cases when the expansion must stop packing an expanded instruction cache line. Instruction run merging is possible in the expanded instruction cache because of the cache tag structure allowing duplication of instructions within the cache. This method of using instruction duplication to achieve instruction run merging has the advantage of requiring only a single access port into the instruction cache. Other methods of supplying instructions from both before and after a predicted branch requires parallel accesses from two dierent cache addresses. Fetching two addresses during the same cycle greatly increases the cache hardware as it requires dual address decoders and sense ampli ers. Also, merging by instruction duplication does not require any shifters or multiplexors in the critical path from the instruction cache to the decoder. Merging after a dual ported cache access requires a shifter and a multiplexor in this critical path. Instruction run merging needs no overhead bits in the expanded instruction. The execution hardware can determine which instructions are speculative because it has access to the original program order as well as which instructions are branch instructions. It identi es all instructions after a conditional branch as speculatively issued instructions.

132


7.2 Instruction Merging in the Epic Machine For the in-order issue Epic machine instruction run merging provides a special type of speculative execution. It is called in-order issue speculative execution. In-order issue speculative execution requires much less support hardware than out-of-order issue speculative execution. During expansion time the expander packs instructions from the branch target into the expanded instruction. These instructions must be data and resource independent of all instructions already assembled into the expanded instruction. During execution time instructions packed after a branch are issued in parallel with the branch. If the branch is a conditional branch then these instructions are identi ed as being speculatively executed, but no other special handling or buering is needed for execution of these instructions. Special handling for predicting the branch direction and the non-sequential fetching from main memory occur only during expansion time, when the demand for high speed alignment and decoding is less stringent. If a branch is incorrectly predicted then the branch execution unit must arrange to cancel any result writes by speculative instructions initiated during the same cycle. It must also inhibit the expanded instruction being fetched from being executed next cycle. The branch unit sends the correct target execution address to the expanded instruction cache and this expanded instruction is fetched during the next cycle. Thus a mispredicted branch has a one cycle penalty if the correct target is already in the expanded instruction cache. Recovering from a mispredicted branch requires no special reorder buers or other hardware because all speculative instructions are issued during the same cycle as the branch they depend upon. Thus speculative instructions can be canceled before they reach the result write stage. This eliminates the need for special result buering of speculative instructions and simpli es the machine. Even without predicting branches, support for speculative execution is required. For example, consider a load that page faults and is followed by an add instruction that is issued the same cycle. Recovery from the page fault requires all results of

2.50

|

2.00

|


7.2. INSTRUCTION MERGING IN THE EPIC MACHINE

133

1.72

|

1.50

1.00

|

0.50

| |

0.00 |

simple align & pack brn predict Full Epic Model


inf.

Figure 7.3: Performance of Adding Speculative Merging to Epic the add be nulli ed. This problem is handled by making the pipelines long enough to allow all operations to complete fault checking before the write-back stages of all pipelines are reached. The DEC Alpha [BK95] uses this technique. Figure 7.3 shows the performance of the Epic machine with instruction run merging and speculative execution, along with other machine con gurations described in previous chapters. The bar labeled Full Epic Model is the con guration including instruction run merging. This con guration includes cache alignment and branch prediction as both of these features are required to support merging and speculative execution in an expanded instruction. The Full Epic con guration also includes the multiple cycle packing of chapter 5 because of the performance improvement it provides at small cache sizes.

134


The gure shows merging and in-order issue speculative execution achieve a considerable performance improvement at a very low additional hardware cost. Adding merging and speculative execution to the Epic branch prediction con guration of chapter 6 results in an 15.8% performance improvement for the in nite cache size. The small additional hardware costs for adding these features are in the expander and the branch units. Additional control is required in the expander to continue lling the expanded instruction after a branch is predicted. The branch units require additional hardware to identify the speculative issued instructions and send the kill signals to their write-back stages. The write-back stage must already support a kill signal as this is required to implement exception handling. Another result presented by gure 7.3 is this full Epic con guration achieves better performance than the simple reference superscalar machine for all instruction cache sizes. At the 4 Kbyte instruction cache size the Epic machine achieves an 8% performance improvement and at the in nite cache size it achieves a 37% performance improvement. These performance improvements assume the two machines have the same cycle time, but this may be dicult for the reference machine to achieve. The reference machine requires instructions for parallel issue to be aligned, decoded and routed in a single cycle while the Epic machine has the very fast decode stage described in section 2.2. If the reference machine requires a longer cycle time to issue the multiple instructions then its performance will be proportionately less. The main cost for the performance improvements achieved by Epic is in increased width of the expanded instruction. When comparing the machines at a given cache size, the cache size is measured in terms of bytes used to hold instructions. The overhead bits created during instruction expansion are not included in the cache sizing. For the full 8-wide speculative execution Epic con guration presented by this section the overhead bits total to 57 bits, or about a 22% increase in the width of the instruction cache. A small additional cost is paid in the hardware used to implement the expander. This may be oset by reduced hardware for aligning, routing, and decoding between the instruction cache and execution units. Overall, using an expanded instruction cache has a 8% to 37% performance increase for an about 22% increase in cache size. Larger cache sizes achieve the larger performance increases.

7.3. INSTRUCTION RUN MERGING IN EXPRO

135

A consideration when evaluating the cost of the 22% increase in cache size is the increase is in the width of the cache, not in the number of dierent lines in the cache. This is important because increases in width of the cache has only minimal impact on the access time of the cache data array. This means the wider cache will not have an impact on the cycle time of the machine. Its main cost is area cost, not timing delay cost. A full implementation of an expanded instruction cache is a cost eective method of improving performance in an in-order issue machine.

7.3 Instruction Run Merging In Expro The Expro machine described in section 2.15 includes instruction run merging. This section rst presents the Expro machine's performance relative to the reference Reorder Buer machine. It then discusses the importance of instruction run merging used in the Expro machine by presenting the performance of a version of this machine with instruction run merging removed.

7.3.1 Expro Machine Performance Figure 7.4 presents the performance of the Expro model using a 16 entry reorder buer, expanded instruction aligning, branch prediction, and instruction run merging. This machine is the top line graph and achieves an average performance of 2.0 instructions per cycle. The other line graph in the gure is for the out-of-order issue reference Reorder Buer machine. For comparison, the gure also plots with bar graphs the in-order issue reference machine, labeled simple, and the full in-order issue Epic machine. The gure shows the complexity of out-of-order issue can improve performance over the simpler in-order issue Epic machine and Simple in-order reference model at all cache sizes. The performance improvement of the out-of-order issue Expro model over the in-order issue Epic model is 4% at the 4 Kbyte cache size and 16% at the in nite cache size. As explained before, the performance gain is small because the


2.50

| 2

2.00

|


136

|

1.50

1.00

|

0.50

|

In-order Simple In-order Epic Model

|

0.00 |

Expro Model Reorder Buffer Model


inf.

Figure 7.4: Performance of the 16 Entry Expro Model and Other Models


137

in-order issue machines have a 1 cycle mispredicted branch penalty while the out-oforder issue machines have a 3 cycle mispredicted branch penalty. The gure also shows for most cache sizes the Expro machine achieves better performance than the out-of-order issue reference Reorder Buer machine. The performance improvement of the Expro model over the out-of-order issue reference model varies from ?3% at the 4 Kbyte instruction cache size to 23% at the in nite cache size. The negative performance increase at the small cache size is because here the ecient use of the cache's storage area is more important than increasing the rate of instruction delivery. The expanded instruction of the Expro model leads to lower utilization of the cache's storage area because of the duplication of instructions and the lower line utilization. For the large cache sizes the cache miss rate decreases and Expro's expanded instruction cache's ability to supply a higher rate of instruction delivery results in better performance. Some hardware costs paid for the performance improvements achieved by the Expro machine over the Reorder Buer machine are the increased width of the expanded instruction. However, the increase in width of the Expro machine over the Reorder buer machine is much smaller than the increase in width of the Epic machine over the Simple line cache reference machine. The out-of-order decode used by the Expro machine removes the need for static routing and agging cycle boundaries. Expro does not require overhead to store the original program order. The expander can enter the individual instructions linearly into the expanded instruction and original program order becomes implied by their position within the expanded instruction. The Expro model and Reorder Buer model both use successor address branch prediction so the overhead bits for the successor address elds are the same. In fact, as explained in section 2.3.3, the Reorder Buer machine uses 3 additional bits per cache line to store a branch oset. The Expro model uses 3 extra bits in the cache tags to support the duplication of instructions within the expanded cache. The change in cache width of the Expro machine over the Reorder Buer machine is a 3 bit larger tag and a 3 bit smaller data line. This results in the cache width of the two machines being almost equal. As in the Epic machine, a small additional hardware cost is required by the Expro

138


Resources Limiting Additional Instruction Issue (percentage of decode cycles) instruction insucient insucient insucient reorder Benchmark buer branch load/store functional buer empty units units units full compress 57.25% 3.96% 6.37% 6.08% 26.34% espresso 56.06% 1.60% 8.71% 4.41% 29.22% t 48.87% 0.00% 14.88% 9.42% 26.84% gcc1 56.77% 1.96% 19.71% 1.05% 20.51% spice3 59.02% 1.04% 16.79% 0.57% 22.58% tex 57.65% 2.98% 19.68% 0.53% 19.16% Average: 55.94% 1.92% 14.35% 3.67% 24.11% Table 7.1: 16 Entry Reorder Buer Model Instruction Issue Limiters (in nite expansion cache, 16 entry reorder buer, 8-wide machine) machine to implement the instruction expander. For out-of-order issue machines the main hardware cost of instruction expansion is the less ecient use of the cache storage area. This results in requiring a larger instruction cache to achieve low cache miss rates. It is informative to investigate the limits of performance in the out-of-order issue machines. Other than cache misses and miss predicted branch recovery the main performance limiter for out-of-order issue machines is limits on the number of instructions issued each cycle. There are three reasons additional instructions cannot be issued in a given cycle: 1) no more instructions were delivered by the instruction cache 2) no more execution units of the appropriate class are available, and 3) the reorder buer is full. Table 7.1 presents simulation results for the instruction issue limiters for the reference model Reorder Buer machine. The table displays data for a 16 entry reorder buer and in nite instruction cache. In this table a decode cycle is any cycle that the instruction cache delivers instructions for issue. Decode cycles do not include cache miss service cycles or mispredicted branch recovery cycles. The column labeled instruction buer empty is the percentage of decode cycles when additional instructions cannot be issued because all valid instructions delivered by the cache are issued. When decoding is limited by the instruction buer becoming


139

empty then performance can be improved by delivering more instructions per cycle. The goal of the expansion cache is to reduce this percentage to as low as possible. The columns labeled with insucient units reports the percentage of decode cycles when additional instructions can not be issued because every available unit of the type in the column's class has already been issued an instruction during this decode cycle. When the reservation station for an execution unit is completely lled then this reduces the number of units available during a decode cycle. Each reservation station can accept only one instruction per cycle. If, for example, there are 3 loads among the instructions delivered by the cache and there are only 2 load/store units then this decode cycle becomes limited by insucient load/store units. When decoding is limited by insucient execution units then performance can be improved by adding additional units. The column labeled reorder buer full is the percentage of decode cycles when additional instructions cannot be issued because the reorder buer is full. The reorder buer is full when all its locations are lled with either instructions waiting for results or instructions waiting to be retired back into the the register le. The total number of reservation station entries is larger than the number of entries in the reorder buer. Thus the reorder buer usually lls before any reservation station lls. When decoding is limited by the reorder buer becoming full then performance can be improved by increasing the size of the reorder buer and reservation stations. Table 7.2 shows the reference Reorder Buer machine has sucient execution units. Additional instruction issue is limited by execution units for only about 20% of the decode cycles. This machine's main limitation is the instruction cache does not deliver enough instructions to fully utilize the available execution units or ll up the reorder buer. For about 56% of the decode cycles additional instructions could be issued if the instruction cache delivered more instructions per cycles. An expanded instruction cache improves the rate of instruction delivery to the decoder, and is the improvement oered by the Expro machine over the reference model Reorder Buer machine. The performance improvement realized by adding an expanded instruction cache was presented in gure 7.4. Table 7.2 presents simulation results for instruction issue limits for the Expro machine with a 16 entry reorder

140


Resources Limiting Additional Instruction Issue (percentage of decode cycles) instruction insucient insucient insucient reorder Benchmark buer branch load/store functional buer empty units units units full compress 41.68% 2.68% 6.08% 1.33% 48.23% espresso 41.00% 2.85% 5.49% 3.48% 47.17% t 48.50% 0.00% 13.58% 0.08% 37.84% gcc1 41.10% 1.80% 17.77% 0.54% 38.79% spice3 43.29% 0.90% 13.71% 0.35% 41.76% tex 42.40% 3.04% 19.13% 0.56% 34.87% Average: 42.99% 1.88% 12.63% 1.06% 41.44% Table 7.2: 16 Entry Expro Model Instruction Issue Limiters (in nite expansion cache, 16 entry reorder buer, 8-wide machine) buer. Insucient instructions from the instruction cache and limited reorder buer size account for approximately an equal share of limiting factors. Each limits about 42% of the decode cycles. Even though instruction expansion improves instruction bandwidth an even larger instruction cache bandwidth would improve performance even more. However, the point of diminishing returns is being reached. A larger instruction bandwidth could be achieved by increasing the width of the expanded instruction and lling each line with more instructions. Increasing the expanded cache width increases the instruction duplication ratio and renders the cache less ecient. Also, the complexity of the parallel decoding of the larger number of instructions grows at an n2 rate, rapidly increasing hardware cost. Adding width to the expanded instruction cache is thus very expensive so improving instruction bandwidth is considered impractical. Increasing the size of the reorder buer and reservation stations is considered more practical. Figure 7.5 presents the performance results when the size of the reorder buer is doubled to 32 entries. The number of entries of every reservation station are also doubled. The graph shows the Reorder Buer machine receives a small performance gain of about 11% by doubling the reorder buer size. The gain varies very little with cache size. Doubling the reorder buer allows the Expro machine's performance to improve

2.50

2.32

|

2.00

141

|



1.50

|

1.00

|

0.50

|

32 Entry Expro 16 Entry Expro 32 Entry RO Ref Model 16 Entry RO Ref Model

|

0.00 |


inf.

Figure 7.5: Performance of 16 and 32 Entry Expro Model and Reorder Buer Model

142


Resources Limiting Additional Instruction Issue (percentage of decode cycles) instruction insucient insucient insucient reorder Benchmark buer branch load/store functional buer empty units units units full compress 70.46% 6.08% 8.67% 12.00% 2.78% espresso 70.46% 2.22% 13.81% 8.10% 5.41% t 58.52% 0.00% 24.95% 16.51% 0.02% gcc1 66.30% 3.13% 26.78% 2.42% 1.37% spice3 70.22% 1.83% 22.55% 1.24% 4.16% tex 66.30% 3.95% 24.73% 1.29% 3.73% Average: 67.05% 2.87% 20.25% 6.93% 2.91% Table 7.3: 32 Entry Reorder Buer Model Instruction Issue Limits (in nite expansion cache, 32 entry reorder buer, 8-wide machine) about 15% at the 4 Kbyte cache size and about 16% at the in nite cache size. The large instruction bandwidth provided by the expansion cache is able to use more eciently the larger reorder buer. The larger reorder buer permits more instructions to be pending, thereby increasing the probability there are instructions ready for execution units to execute. Tables 7.3 and 7.4 present the instruction issue limits using a 32 entry reorder buer size for the reference Reorder Buer machine and the Expro machine. This reorder buer size is sucient for the reference machine as this buer limits performance for only about 3% of the decode cycles. Inadequate instruction bandwidth is the main limit of this machine's decode rate. Additional instructions could be issued in about 67% of the cycles if the instruction cache could deliver them. The expanded instruction cache of the Expro machine delivers additional instructions during most decode cycles. Table 7.4 shows the additional instruction delivery reduces the percentage of decode cycles where the instruction cache is limiting performance to about 58%. The need for an additional load/store unit becomes more important with insucient load/store units limiting performance for about 21% of decode cycles. Adding additional load/store units is very expensive in terms of hardware as they require additional ports on the data cache and the complexity of the


143

Resources Limiting Additional Instruction Issue (percentage of decode cycles) instruction insucient insucient insucient reorder Benchmark buer branch load/store functional buer empty units units units full compress 57.90% 12.68% 12.01% 3.38% 14.04% espresso 55.69% 5.61% 11.62% 7.12% 19.96% t 64.82% 0.00% 23.34% 11.84% 0.00% gcc1 55.06% 4.39% 30.87% 1.68% 8.00% spice3 58.34% 3.73% 20.23% 0.71% 16.99% tex 55.60% 6.03% 29.36% 1.20% 7.80% Average: 57.90% 5.41% 21.24% 4.32% 11.13% Table 7.4: 32 Entry Expro Model Instruction Issue Limits (in nite expansion cache, 32 entry reorder buer, 8-wide machine) store buers increases rapidly as their number increases. As explained above, increasing the instruction bandwidth is also expensive. To improve performance beyond the 2.3 instructions per cycle achieved by the 32 entry reorder buer Expro machine is very expensive in terms of hardware costs.

7.3.2 Cost of Removing Instruction Run Merging Figure 7.6 shows the importance of instruction run merging for the Expro machine. This gure presents the out-of-order issue Expro machine with two dierent con gurations for the instruction expander. The top curve corresponds to the full Expro model as described by the previous section. Its expander uses instruction run merging and packs the expanded instruction with predicted to be executed instructions until the expanded instruction is nearly full. As explained in chapter 5, packing must be terminated before the expanded instruction is completely full to void excessive instruction duplication. The bottom curve in the gure presents an Expro model with a simple instruction expander performing just cache line alignment, as described in chapter 4, and branch prediction. This con guration does not use multiple cycle packing and instruction run merging. Both machines use successor address branch prediction as described in section 2.11.

2.50


| 2.32

2.00

|


144

|

1.50

1.00

|

0.50

Run Merging No Merging

|

|

0.00 |


inf.

Figure 7.6: Expro Machine With and Without Instruction Run Merging (8-wide issue, 32 Entry Reorder Buer)


145

Resources Limiting Additional Instruction Issue (percentage of decode cycles) instruction insucient insucient insucient reorder Benchmark buer branch load/store functional buer empty units units units full compress 100.00% 0.00% 0.00% 0.00% 0.00% espresso 100.00% 0.00% 0.00% 0.00% 0.00% t 100.00% 0.00% 0.00% 0.00% 0.00% gcc1 99.99% 0.00% 0.00% 0.00% 0.01% spice3 99.09% 0.00% 0.00% 0.00% 0.91% tex 100.00% 0.00% 0.00% 0.00% 0.00% Average: 99.85% 0.00% 0.00% 0.00% 0.15% Table 7.5: Non-merging Expro Model Instruction Issue Limiters (in nite expansion cache, 32 entry reorder buer, 8-wide machine) Each cache line has an additional eld predicting the cache line to fetch next. For the Reorder Buer model, the instructions after a branch are in the cache line and are delivered to the decoder for speculative execution. For the aligned expansion cache the expansion process terminates lling the cache line when the branch is encountered. Any instructions following the branch are not packed into the same cache line for this model. A 36% to 30% performance decrease is paid at the 4 Kbyte and in nite cache sizes, respectively, for not lling the expanded instruction until it is nearly full. This shows the importance of delivering to the decoder enough instructions to span multiple cycles worth of execution. The reservation stations and reorder buer existing in the Expro model are eective only if they can be supplied with sucient instructions to keep their utilization high. An Expro model not performing instruction run merging is not a good design tradeo and is presented here only for completeness. Reducing the average number of instructions delivered per cycle by removing instruction run merging from the expansion process has a large negative eect on performance. How the reduced instruction delivery rate eects unit utilization is shown in table 7.5. The reduced instruction bandwidth becomes the limiter for nearly 100% of the decode cycles. Clearly this is not a cost eective design. The instruction run merging is an important component of the Expro design. As shown by the last

146


subsection, expanded instruction caching with out-of-order issue machines is eective at improving performance when branch prediction and instruction run merging are implemented in the expander.

7.4 Summary This chapter describes using instruction run merging for both the in-order issue Epic machine and the out-of-order issue Expro machine. Instruction run merging is the process used by the instruction expander to ll an expanded instruction with additional instructions from the target of a predicted branch. It both increases the expanded instruction cache's eective bandwidth and improves its line utilization. Instruction run merging implementation costs are low as it does not require any additional overhead bits in the expanded instruction. Instruction run merging extends the in-order issue Epic machine with a limited form of speculative execution. This is called in-order issue speculative execution and is inexpensive to implement in terms of hardware. The only buering required to implement in-order issue speculative execution is the normal pipeline stages already present in the execution units. Adding instruction run merging and speculative execution to the Epic con guration results in up to 15.8% performance improvement over a machine without instruction run merging. The Epic machine incorporating all instruction expansion features achieves a 37% performance improvement over the Simple in-order machine reference model. Instruction run merging provides the out-of-order issue Expro machine with suf cient instruction bandwidth to achieve high performance. With instruction run merging the Expro machine is able to make eective use of a 32 entry reorder buer and achieve an average performance of 2.3 instructions per cycle, a 43% performance improvement over the 16 entry Reorder Buer machine reference model. This chapter completes the series of analysis for the individual features of instruction expansion. Using cache alignment, multiple cycle packing, branch prediction, and instruction run merging together results in improved performance for both inorder issue machines and out-of-order issue machines. Many variations are possible

7.4. SUMMARY

147

for the con guration of an expanded instruction cache machine. Varying the width of the expanded instructions, the number of execution units, or the number of outof-order issue decode cycles may be dictated or allowed by the hardware technology used to implement a machine. The next chapter presents the performance impacts of varying these parameters.

148


Chapter 8 Machine Con guration Variations This chapter studies some variations in con guration for the Epic and Expro machines. The previous chapters used a xed execution hardware con guration having a maximumof issue rate of 8 instruction per cycle and a xed distribution of execution units. This distribution is 2 load/store units, 4 functional units, and 2 branch units. With the available average instruction level parallelism for the selected benchmarks being low, probably around 3 instructions per cycle, an 8 wide machine may not be a cost eective design. The motivation in the previous chapters for studying this abundant execution resource con guration is to understand the eects of the expanded instruction cache fetch mechanisms. This chapter presents performance variations of expanded instruction machines resulting from varying execution resources, allowing more informed choice when deciding upon a cost eective design. The rst section of this chapter concentrates on the in-order issue Epic machine. The second section concentrates on the out-of-order Expro machine.

8.1 Variations in Epic's Con guration To understand what variations in machine resource con guration are appropriate it is important to understand the utilization of the various machine units. This 149

150

CHAPTER 8. MACHINE CONFIGURATION VARIATIONS

Unit 8-Wide Epic Load Store Unit 1 54% Load Store Unit 2 30% Functional Unit 1 67% Functional Unit 2 19% Functional Unit 3 8% Functional Unit 4 4% Branch Unit 1 29% Branch Unit 2 2% Table 8.1: In-order Issue Execution Unit utilization During Execution Cycles sections rst presents the utilization of the execution resources for the 8-wide inorder issue Epic machine and then discusses changes in performance when varying the con guration. Table 8.1 presents the utilization of each execution unit during execution cycles for 8 issue Epic machine. An execution cycle is any cycle when any execution unit is issued an instruction. Execution cycles do not include cache miss cycles, expansion cycles or miss predicted branch recovery cycles. The Epic con guration is the full con guration including branch prediction and in-order issue speculative execution. The table shows an average of 54% of the execution cycles an instruction is issued to at least 1 load/store unit. For 30% of the execution cycles an instruction is issued to both load/store units. Function unit 1 is the most utilized, receiving an instruction during 67% of the execution cycles. The second branch unit is the least utilized, receiving an instruction for only 2% of the execution cycles. The low utilization of the second branch unit is surprising. When examining a static code listing output by a compiler one nds many instruction execution paths containing more than one branch. For example, a subroutine return followed by a conditional branch based on the returned value return usually can be packed into one expanded instruction. However, the static distribution of these execution paths is quite dierent from the dynamic distributions. Most of the dynamic execution time is spent in inner loops of the code and not in subroutine calls and returns. The low

8.1. VARIATIONS IN EPIC'S CONFIGURATION

151

number of cases when two branches are close to each other in the dynamic trace leads to the low utilization of the second branch unit. Performance is reduced when there are insucient resources to execute a given type of instruction. Table 8.1 shows that of the three execution unit types the load/store type is most frequently fully utilized. Both load/store units are in use for 30% of the cycles. Varying the number of load/store units therefore has the largest impact on machine performance. The next subsection discusses this variation.

8.1.1 Varying the number of Load/Store Units Figure 8.1 displays the performance of the Epic machine while varying the number of load/store units. The graphs present the performance for removing a load/store unit from the original Epic con guration and adding an additional unit. The data is for the full in-order issue Epic machine, including speculatively issued instructions. Removing a load/store unit from the machine creates a 7 issue machine and adding a load/store unit increases the instruction issue width to 9. The cache size on is the instruction storage space for the original 8-wide Epic machine. The 7-wide and 9-wide machines have a 1=8 smaller and 1=8 larger instruction storage area, respectively. Modifying the number of load/store units requires the code scheduling also be modi ed to achieve the best performance. Rescheduling is not required because the instruction expander and hardware interlocks during execution cycles the three different con gurations ensures binary compatibility. Nonetheless, performance suers when scheduling assumes a dierent con guration than is used during execution. The performance reported in the graphs incorporates code scheduled for the number of load/store available in the machine. For his study the benchmarks were recompiled before each simulation run. The gure reports removing a load/store unit results in a 23% to 24% performance penalty depending on cache size. This is somewhat less than might expected using the data from table 8.1. The table reports two load/store units are busy for 30% of the cycles. There is not a 30% decrease in performance when one load/store unit is removed because the code was rescheduled to adjust for the reduced machine resources.


2.50

|

2.00

|


152

1.78

|

1.50

1 L/S 2 L/S 3 L/S 1.00

|

0.50

| |

0.00 |


inf.

Figure 8.1: Varying the number of Load/Store Units (In-order model, cycle packing, 4 functional units, 2 branch units)


153

Performance gained by adding a third load/store unit is also presented in gure 8.1. It shows adding a third unit only attains a 0.3% to 3% performance increase. This is much smaller than expected when basing the gain estimate on performance lost when removing a unit. With the in-order issue of the Epic model there is little additional instruction level parallelism available for use by a third load/store unit. Two load/store units are needed for achieve good performance for the selected benchmarks on the in-order issue Epic machine. Three units provides excess capacity. This study assumes full functionality and independence of the two load/store units, as well as an in nite data cache. There are considerable possible variations on the load/store units that are beyond the scope of this study. Dependencies between load/store units such as a dual banked cache and nite size caches may in uence the performance results. The best con guration for the two load/store units is left as a subject for further study.

8.1.2 Varying the Number of Functional Units Varying the number of functional units available is an another dimension for machine con guration. The hardware cost or payo of adding or removing a functional unit is not as great as that of load/store units. Modifying the number of load/store units changes the number of ports on the data cache and the data cache is a large area of a processor. Changing the number of functional unit changes the number of ports on the register le. Adding the read and write ports to a register le that needed to support an additional functional units is much easier than adding a port to a data cache. Under these considerations it may be cost eective to retain a functional unit even when it is not highly utilized because of its relatively low cost. Figure 8.2 shows the performance of the in-order issue Epic machine with 2, 3, 4, and 5 functional units. The con guration of the Epic machine is the full in-order issue speculative execution version of Epic with 2 load/store units and 2 branch units. The benchmarks are recompiled for each machine con guration to achieve the best schedule for the available functional units. Little variation in average performance occurs over the range of 3 to 5 functional units. Even when the number of function units is reduce by half to only 2 units the


2.50

|

2.00

|


154

1.73

|

1.50

1.00

|

0.50

| |

0.00 |

2 Function Units 3 Function Units 4 Function Units 5 Function Units


inf.

Figure 8.2: Varying the number of Functional Units (In-order Epic model, cycle packing, 2 load/store units, 2 branch units)

8.1. VARIATIONS IN EPIC'S CONFIGURATION Benchmark compress espresso t gcc1 spice3 tex Average

155

Instructions Per Cycle 2 Function Unit 4 Functional Units 1.53 1.53 1.44 1.48 2.43 2.88 1.45 1.44 1.41 1.42 1.57 1.55 1.64 1.72

Table 8.2: Performance of Individual Benchmarks for 2 Epic Con gurations performance decrease is only 5% to 9% depending on cache size. There is on interesting anomaly in the performance graph at the 4 Kbyte cache size. Here a con guration with 3 functional units performs slightly better than one with 4 functional units. A better cache hit rate because of slightly dierent instruction packing is the cause of the anomaly. Another consideration when deciding in the eectiveness of a con guration is the performance of the individual benchmarks and not just the average performance. Table 8.2 show the performance of the individual benchmarks at the in nite cache size. The only benchmark having a sizable reduction in performance when reducing the number of functional units from 4 to 2 is the fft benchmark. A large amount of instruction level parallelism in this benchmark makes it more sensitive than the other benchmarks to reduction in functional units. It is likely some application executed on a general purpose machine behave similarly to fft so it is probably a good design choice to capitalize on a high level of instruction level parallelism when is it present.

8.1.3 Varying the Width of Epic Previous sections show under utilization of the execution units of an 8-wide machine is common. This raises the questions of what is the performance of a narrow Epic machine? Figure 8.3 presents the performance of a 4-wide Epic machine, and 7-wide Epic machine with only 1 load/store unit and the 8-wide Epic machine. The Epic

156


2.50

|

2.00

|


Key: Branch-units/Functional-units/Load-Store-units

1.72

|

1.50

1/2/1 2/4/1 2/4/2 1.00

|

0.50

| |

0.00 |


inf.

Figure 8.3: Varying the Width of The Decoder (In-order Epic model, with cycle packing)


157

con guration is with full branch prediction and in-order speculative execution. The 4wide machine can issue 1 branch instruction, 2 functional instruction and 1 load/store instruction during each cycle. Comparing the left and right bars of each cache size group of the gure shows changing from an 8-wide machine to a 4-wide machine results in about 22% performance decrease, which is relativity independent of cache size. A change in the cache con guration the occurs when reducing the machine from 8 to 4 wide is there at twice as many cache lines in the narrow machine. This does increase the area for cache tags but this is not include in the cache size axis on the graph. The center bar in each group is for a 7-wide machine supporting only 1 load/store instruction issue each cycle. This machine can additionally issue 2 branch instructions and 4 functional instructions each cycle. This bar demonstrates most of the performance lost of reducing the issue width is half is due to removing one load/store unit. Another bit of information conveyed by this bar is the doubling the number of cache lines and halving their width does not have a major eect on the expanded cache's storage space eciency when multiple cycle packing is used. The ratio of performance between the 4-wide machine and the 7-wide machine is about the same independent of cache size, even thought the line width is about half the size.

8.1.4 Varying the Width of the Decoder and Cycle Packing The previous section shows there is an advantage of having 8 execution units, with 2 load/store units being especially important. However, chapter 4 shows using an 8 wide expanded cache causes low expanded instruction cache line utilization and performance suers at small cache sizes. Chapter 5 presents multiple cycle routing as a solution to this problem but there are other approaches. One approach is using a narrower instruction cache than the number of execution units. This approach requires a routing network in the time sensitive path from the instruction cache to the execution units. Nevertheless, it is simple enough it may be possible to implement it without extending the cycle time. The intention of the narrow cache is to improve the line utilization by adjusting its size to be closer to the average number of instructions issued in parallel.


2.50

|

2.00

|


158

1.72

|

1.50

8-Issue Single Cyc 4-Issue 8-Wide 8W w/Cycle Pack 1.00

|

0.50

| |

0.00 |


inf.

Figure 8.4: Varying the Width of the Expanded Cache (In-order model, 2 load/store units, 4 functional units, 2 branch units)

8.2. VARIATIONS IN EXPRO'S CONFIGURATION

159

Figure 8.4 presents the performance of a 4-issue machine with 8 execution units, along with an 8-issue Epic machine without multiple cycle packing on the left and an 8-issue machine with multiple cycle packing on the right. The gure shows the 4-issue machine achieves better performance the single cycle 8-issue Epic machine at small cache sizes. At large cache sizes the 4-issue machine has a slight performance disadvantage because of its lower peak issue rate. Comparing the gure's center bar for the 4-issue machine to the left bar for the multiple cycle packing Epic machine shows multiple cycle packing always out performs the 4-issue machine. The main dierence in these two con gurations is the routing network delivering instructions from the cache to the execution units. The 4-issue machine has a 4 input to 8 output routing network while the 8-issue multiple cycle Epic machine has an 8 to 8 routing network. The multiple cycle packed Epic machine routing network requires more hardware to implement but has less cycle time pressure. The 4-issue machine's routing network has large cycle time pressure because it is in the critical path. The rst cycle's static routing of the multiple cycle packed Epic machine reduces cycle time pressure by permitting a full cycle for routing of the second cycle's instructions. The performance advantage with only a small increase in routing hardware makes the multiple cycle packing Epic machine a better design choice than limiting the issue width. This completes the variations on the in-order issue Epic machine presented here. These variations show a second load/store unit and multiple cycle packing are important for performance. The next section presents variations on the out-of-order issue machines.

8.2 Variations in Expro's Con guration This section presents the performance variations the out-out-order issue machine under various con guration variations. It covers variations in the size of the reorder buer, the width of instruction issue, and the number of cycles required for instruction decode.


2.50

|

2.00

|


160

1.50

|

1.00

|

0.50

|

1.81

64 Entries 32 Entries 16 Entries

|

0.00 |


inf.

Figure 8.5: Varying the Size of the Reorder Buer and Reservation Stations (Reorder Buer machine, 8-wide issue, 2 cycle decode)


161

8.2.1 Varying the Size of the Reorder Buer Figure 8.5 presents the performance of the Reorder Buer machine when varying the size of the reorder buer and reservation stations over a wider range than was presented by section 7.3. This is the out-of-order issue machine not using an expansion cache. The gure shows while there is about a 9% gain for doubling the reorder buer size from 16 to 32 entries there is almost no gain in doubling the size again to 64 entries. When the reorder buer size is increased the sizes of the reservation stations in front of each execution unit are also increased. The sizes used for these simulations are the number of entries in each load/store reservation station is 1=2 the number of entries in the reorder buer. The number of entries in each functional unit and branch unit reservation stations are 1=4 the number of entries in the reorder buer. So in total are 2 21 + (4 + 2) 14 = 2 21 times as many reservation station entries as reorder buer entries. This ratio of reservation station entries to reorder buer entries is sucient to almost always allows an instruction to be entered into a reservation station if there is an available slot in the reorder buer. For other variations on the reorder buer con guration see Johnson [Joh91]. Figure 8.6 shows the eects of increasing the reorder buer size for the out-oforder issue Expro expansion cache machine. The ratios of reservation station sizes to reorder buer size is the same as those used for the Reorder Buer machine in the rst part of this section. The graph shows 32 entries are also sucient to capture almost all the instruction level parallel the expanded instruction cache is able to provide. There is a very small increase in doubling the size to 64 entries and no additional increase in quadrupling the size to 256 entries. A reorder buer larger than 32 entries is not a cost eective design for an instruction cache delivering a maximum of 8 instructions per cycle.

8.2.2 Varying the Out-of-Order Decode Width A wide decode width is dicult to implement because the dependency checking grows as order n2 and the width n increases. It is useful to know the amount of additional

2.50


2.36

|

2.00

|


162

|

1.50

1.00

|

0.50

| |

0.00 |

256 Entries 64 Entries 32 Entries 16 Entries


inf.

Figure 8.6: Varying the Size of the Expro's Reorder Buer and Reservation Stations (Expro model, 8-wide issue, 2 cycle decode)

2.50

2.32

|

2.00

163

|



|

1.00

|

0.50

|

1.50

|

0.00 |

8 Wide 4 Wide


inf.

Figure 8.7: Varying the Out-of-Order Decode Width (Expro Out-of-order model, 32 entry reorder buer, 2 branch units 4 functional units, 2 branch units)

164

CHAPTER 8. MACHINE CONFIGURATION VARIATIONS Benchmark Instructions Per Cycle 8-decode 4-decode compress 2.12 1.96 espresso 2.06 1.90 t 3.71 3.13 gcc1 2.14 1.89 spice3 1.86 1.77 tex 2.04 1.87 Average 2.32 2.09

% -7.4% -8.0% -15.6% -11.4% -4.8% -8.1% -9.2%

Table 8.3: Expro Performance Decoding 8 and 4 Instructions per Cycle performance a wider decoder is able to achieve. Figure 8.7 shows the variation in performance of the out-of-order issue Expro machine when the decoder width is 4 instructions and 8 instructions. There is a 32 entry reorder buer and 2 load/store, 4 functional, and 2 branch units. Each cycle the 4-wide machine delivers up to 4 instructions to the decoders. The machine can execute 8 instruction in one cycle if the reservation stations contain enough pending instructions of the required distribution. The gure shows a 8% to 9% performance loss, depending on cache size, by reducing the peak number of instructions decoded each cycle from 8 to 4. However, the average may be miss leading. Table 8.3 shows for in nite cache the performance lost by each benchmark. The fft benchmark loses the most performance at about 16% and the spice3 benchmark loses the least at about 5%. When evaluating the required decoder width careful consideration must be paid to which applications are expected to be executed on the machine.

8.2.3 Varying Decode Cycles on Out-of-Order Model All con guration of the out-of-order machine presented up to this point use 2 pipeline stages to decode and enter instructions into the reservation stations and reorder buer. Section 2.3.5 justi ed the need for 2 stages based upon the Winograd bound and two input gates. It may be possible to use high speed gates for the decoding hardware or approach the logic block diagram dierently and achieve a 1 cycle decode. It

2.50

|

2.00

|



|

1.50

1.00

|

0.50

|

1.81

1 cyc 2 cyc

|

0.00 |

165


inf.

Figure 8.8: Varying Decode Cycles for the Reorder Buer Machine (16 entry reorder buer, 8-wide decode)

166


is informative to know the amount of performance that can be gained if enough technology and power are used to achieve a 1 cycle decode. Figure 8.8 presents the performance of the Reorder Buer machine using 1 and 2 stages for the decoding of the instructions. There is a 7% to 9% performance improvement achieved by removing one stage of the decode process. The performance improvement occurs because the miss predicted branch penalty is reduced from 3 cycles to 2 cycles. The performance increase is not as large as might be expected because the only time the extra decode cycle causes a performance penalty is during a miss predicted branch. The frequency of branches is in the 15% of instructions range and the branch prediction is in the 85% range so the extra cycle rarely causes a pipeline stall. It is likely to be very dicult to implement a single cycle decode with the complex dependency checking and resource allocation required by an out-oforder machine. The less than 10% performance increase oered by a one cycle decode makes it an unwise design choice.

8.3 Summary This chapter covers variations in con guration for both the in-order issue Epic machine and the out-of-order issue Expro and Reorder Buer machines. The various give insight to the appropriate design tradeos when the restrictions of the hardware technology are forced upon an implementation of a machine. For the in-order issue 8 wide Epic machine, reducing the number of load/store units has largest impact on performance. Removing one load/store unit results in a 30% performance decrease. Adding a load/store unit and adding or removing a function unit has little impact performance. Multiple cycle packing in the Epic machine is a preferable over reducing its decode width from 8 to 4 instructions. For the out-of-order issue 8 wide Expro machine and Reorder Buer machine, a 32 entry reorder buer and appropriately sized reservation stations are sucient not to limit the performance of the machines. Larger than 32 entries achieve almost no increase in performance. Reducing the decode width from 8 to 4 instruction pays an average performance penalty of about 9%, but some programs with abundant

8.3. SUMMARY

167

instruction level parallelism pay more. Reducing the decode cycle from 2 to 1 is a dicult task and has less the a 10% performance gain, making it an unwise design tradeo.

168


Chapter 9 Summary and Conclusions This work demonstrates an expanded instruction cache can reduce the complexity of the implementation hardware of an in-order issue superscalar machine. An in-order issue expansion cache machine achieves approximately the same performance as the complex out-of-order issue Reorder Buer superscalar machine, assuming both machines can be implemented with the same cycle time. Comparing an in-order issue expansion cache machine to the Simple in-order issue reference model superscalar machine shows an average performance improvement of 37% with in nite sized instruction caches. Alternatively, an expansion cache can replace the traditional cache in an out-oforder issue superscalar machine. In this case the expansion cache provides improved instruction bandwidth to the out-of-order issue decoder and permits exploiting additional instruction level parallelism. Comparing an out-of-order issue expansion cache machine to the reference out-of-order issue Reorder Buer superscalar machine shows an average performance improvement of 43% with in nite sized instruction caches. For an expanded instruction cache to be eective the performance gain must outweigh the performance lost due to decreased cache hit rates. This implies an expanded instruction cache should be large enough so its operation is in the at part of the miss rate curve. Care must also be taken to control the duplication ratio and the line utilization to prevent inecient use of the cache's storage area. 169

2.50

CHAPTER 9. SUMMARY AND CONCLUSIONS

| 2.32

2.00

|


170

|

1.50

1.00

|

0.50

|

RO32-expro RO16-expro RO16-line simple align & pack brn predict speculative

|

0.00 |


inf.

Figure 9.1: Performance of Expansion Cache Features

9.1 Expansion Cache Summary An expanded instruction cache is able to improve decoder eciency in the three areas of cache alignment, branch prediction and instruction run merging. The expanded instruction cache is also able to align instructions with the required execution units and eliminate the need for a time consuming instruction routing network. Figure 9.1 presents the performance of machines con gured with each major feature supported by expansion caches, all together on a single graph. The in-order issue machines use bar graphs and the out-of-order issue machines use line graphs. The in-order issue reference machine is labeled simple, the out-of-order issue Reorder Buer reference machine is labeled RO16-line. All the other graphs show expansion cache machines. Each bar for the in-order issue machines adds another expansion

9.1. EXPANSION CACHE SUMMARY

171

cache feature. The two lines for the two out-of-order Expro models are for 16 and 32 entry reorder buer sizes. All the machines have a peak issue rate of 8 instructions per cycle and an execution unit distribution of 2 load/store units, 2 branch units, and 4 functional units. The align & pack bar presents the in-order issue expansion cache machine equipped with just cache line alignment and multiple cycle packing as described in chapters 4 and 5. Because of less ecient use of the cache's storage this machine performs worst than the Simple reference model at small cache sizes. At large cache sizes the improved instruction bandwidth oered by the cache line alignment improves performance. The brn predict bar is for and in-order expansion cache machine equipped with alignment, cycle packing, and successor address branch prediction as described in chapter 6. Augmenting the expansion process with branch prediction delivers an average of 8% performance improvement. The speculative bar equips the machine with in-order issue speculative execution and improves average performance by an additional 16%. Chapter 7 discusses this simple to implement in-order issue speculative execution of instructions.

9.1.1 Cost of an In-Order Issue Expansion Cache One of the costs is the increased width of the cache line for the performance improvements oered by an expanded instruction cache. Table 9.1 summarizes the cost of each expansion cache feature. This table presents an in-order issue machine with an 8 instruction wide cache line. The costs column is the number of bits added to the line width by an expansion cache to support the feature. The gain column is the percentage of performance gain attained over the previous row. No entry exists for the rst row because there is not a previous row. No entry exists in the second original order row because this feature reduces required cycle time and does not have an impact on performance when assuming equal cycle times for all machines. No entry is present in the third cycle packing row because this feature improves cache line utilization but does not have a performance impact for the in nite cache reported by the table. Multiple cycle packing reduces the cache size required to capture the

172

CHAPTER 9. SUMMARY AND CONCLUSIONS Cost Feature's Cumulative gain (bits) gain over Simple (in nite cache)

Increased cache tag width for alignment Original order encoding to allow static routing Cycle boundary bit for multiple cycle packing Successor address for branch prediction Run merging for speculative execution

3 16 8 30 0

Total

57

8% 16%

9% 18% 37%

Table 9.1: Overhead Bits for In-order Issue Expansion Cache Features working set by a factor of about 2 to 3 and should be used for all expansion cache implementations. The cumulative gain column shows the performance gain achieved when equipping a machine with all expansion cache features in all the above rows. This performance is relative to the Simple in-order issue reference machine. Again, there is no entry for the original order row because this feature addresses cycle time considerations. There is no entry for the cycle packing row because this feature addresses cache size considerations. The total overhead width for all in-order issue expansion cache machine features is 57 bits. The width of the cache for the Simple machine is 32 bits times 8 instructions plus approximately 20 bits for the tag, for a total of 276 bits. The overhead added by instruction expansion is approximately a 21% increase in cache width. Another cost of an expansion cache is the decreased utilization of the cache's storage area. Decreased utilization occurs because of instruction duplication and reduced line utilization. Chapter 5 reports an expansion cache contains about 13% duplicate instructions and has a decrease of 15% in line utilization. Combining these two storage ineciencies results in requiring about a 30% larger expansion cache to capture the same working set as a traditional cache. The nal cost of an expanded instruction machine is the hardware for the expander

9.1. EXPANSION CACHE SUMMARY

173 Cost (bits)

Increased cache tag width for alignment Successor address for branch prediction Run merging for speculative execution

3 30 0

Total

33

Figure 9.2: Overhead Bits for Out-of-Order Issue Expansion Cache Features itself. Because the expander processes only one instruction at a time it is small compared to the caches and execution units. Some of the expander's area cost is oset by reduced area for the instruction routing requirements.

9.1.2 Cost of an Out-of-Order Issue Expansion Cache The out-of-order issue Expro expansion cache machine has a complex instruction decoder requiring two pipeline stages for decoding. This extra time during decoding and the greater exibility required by a dynamic issue decoder removes the need for static routing and the associated overhead bits. The expander simply packs the instructions into the expanded cache line in the original order and an instruction's position in the cache line implies program order. Table 9.2 presents the overhead bit requirements for the out-of-order issue Expro machine. For out-of-order issue machines the expander performs cache line alignment, multiple cycle packing, branch prediction, and run merging. Only the alignment and branch prediction require overhead bits. This totals to 33 bits of overhead for an 8 instruction wide expansion cache. However, any cost eective out-of-order issue machine also requires some form of branch prediction. The successor address branch prediction used by the Reorder Buer reference machine uses a successor address branch prediction requiring 33 bits in each cache line. Thus the expanded cache width overhead is about the same as the width requirements for branch prediction in the Reorder Buer machine. The main cost for an out-of-order issue expansion cache machine is the less ecient

174


use of cache storage area. The duplication ratio and line utilization for in-order and out-of-order issue expansion caches are the same. An expanded instruction cache size needs to be about 30% larger than a traditional cache to support the same working set. The out-of-order issue Expro machine gains a large part of its performance by using a larger reorder buer and reservation stations. This is an increased hardware cost required to eectively use the increased instruction bandwidth delivered by the expanded instruction cache. As in the in-order issue case, the nal cost of the expanded instruction cache machine is the hardware for the expander itself. Because the expander processes only one instruction at a time it is small compared to the rest of the machine.

9.2 Future Directions Expansion caches introduced by this work are promising methods of improving superscalar machine performance but there are still areas for future research. Some of these areas are introduced in this section. The rst area of research is determining if there are other eective methods for performing expansion caching. Can some of the overhead be reduced? Can the duplication ratio be better controlled? Can line utilization be improved? Are there better machine con gurations than the ones presented here? The focus of this work is simplifying instruction decoding or making decoding more ecient in the case of out-of-order issue machines. Little eort was invested in researching possible expansion cache methods for reducing the complexity or controlling ow in the data portion of the machine. For example, the register le may have a limited number of ports. During instruction expansion it may be possible to analyze the register addresses used by several instructions and combine some of the reads into a single access. It may be possible to bypass results directly back into execution units and eliminate certain result write backs. For out-of-order issue machines it may help decoding if some of the resource requirements are pre-decoded. Another area of research, especially for in-order issue machines, is improving the

9.3. CONCLUSIONS

175

software scheduling algorithms to beyond just rearranging instructions within basic blocks. The compiler used for this research is limited to just basic block scheduling and uncovers only a limited amount of the instruction level parallelism. Sophisticated compilers are able to analyze across multiple basic blocks and may uncover additional instruction level parallelism. For example, the Epic machine does not take advantage of any parallelism between iterations of a loop. Since Epic does not support register renaming it encounters dependencies between loop iterations. If the compiler performs additional loop unrolling and code rearrangement it is possible to more eciently utilize the execution units of the Epic machine. Finally, implementing an expansion cache machine in a VLSI technology gives de nitive answers to some of the timing assumptions used by this analysis. Does the wider expansion cache have an impact on the cycle time? Can an out-of-order decoder be built in one or two pipeline stages? Is the register data ow timing requirement so large that simplifying the decode process is unnecessary? These and other important questions would be answered when an expanded instruction cache machine is implemented in hardware.

9.3 Conclusions Expanded instruction caching is an approach for improving performance in superscalar machines. It capitalizes on opportunities presented when the information in the instruction cache is not an exact copy of the main memory data. Ordinarily the cached representation is larger than the main memory representation, therefore the term expansion caches is used to describe this general caching technique. An expansion cache with an in-order issue superscalar machine provides up to a 37% improvement over the Simple reference machine. This is at a cost of a 22% wider cache line and about 30% less eective use of the cache's storage area. This expansion cache machine con guration, called Epic, performs approximately the same as a much more complicated out-of-order issue Reorder Buer superscalar machine using a traditional cache. An alternative method of employing an expansion cache is replacing the traditional

176


cache of an out-of-order issue superscalar machine. The more complex dynamic issue decoding allows reducing the expansion overhead bits to about the same amount of hardware as used for branch prediction. An expansion cache improves the performance of an out-of-order issue superscalar machine with a 16 entry reorder buer by 23% because of increased eective instruction cache bandwidth. Doubling the size of the reorder buer and reservation stations increase the performance improvement to up to 43%. Expansion cache implementations must control the duplication ratio and line utilization of the expanded instructions. Duplication ratios must be kept small otherwise the decreased cache eciency will overwhelm the performance gains. Compiler instruction scheduling is needed for good expansion cache performance, especially for in-order issue machines. Expanded instruction caching is a promising organization for future superscalar machines. Studies should continue into the features presented here along with other re nements and performance improvements.

Bibliography [AC87]

T. Agerwala and J. Cocke. High performance reduced instruction set processors. IBM Technical Report RC12434 (55845), IBM Thomas J. Watson Research Center, Yorktown Heights, NY, January 1987.

[BCD+87] A.D. Berenbaum, B.W. Colbry, D.R. Ditzel, R.D. Freeman, H.R. McLellan, K.J. O'Connor, and M. Shoji. CRISP: a pipelined 32-bit microprocessor with 13-kbit of cache memory. IEEE Journal of Solid-State Circuits, SC-22(5):776{782, October 1987. [BEH91]

David G. Bradlee, Susan J. Eggers, and Robert R. Henry. Integrating register allocation and instruction scheduling for RISCs. SIGPLAN Notices, 26(4):122{131, April 1991.

[BK95]

P. Bannon and J. Keller. Internal architecture of Alpha 21164 microprocessor. In Digest of Papers. COMPCON '95. Technologies for the Information Superhighway, pages 79{87, March 1995.

[BR91]

David Bernstein and Michael Rodeh. Global instruction scheduling for superscalar machines. SIGPLAN Notices, 26(6):241{255, June 1991.

[BYA93]

G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5 minisupercomputer: architectural and implementation. The Journal of Supercomputing, 7(1):143{180, May 1993.

[CMMP95] Thomas M. Conte, Kishore N. Menezes, Patrick M. Mills, and Burzin A. Patel. Optimization of instruction fetch mechanisms for high issue rates. 177

178

BIBLIOGRAPHY In Proceedings 22nd Annual International Symposium on Computer Architecture, pages 333{344, June 1995.

[CNO+88] R.P. Colwell, R.P. Nix, J.J. O'Donnell, D.B Papworth, and P.K. Rodman. A VLIW architecture for a trace scheduling compiler. IEEE Transactions on Computers, C-37(8):967{979, August 1988. [Dix92]

K. M. Dixit. New CPU benchmark suites from SPEC. SPEC Newsletter, 4(1), February 1992.

[DLSM81] Scott Davidson, David Landskov, Bruce D. Shriver, and Patrick W. Mallett. Some experimental in local microcode compaction for horizontal machines. IEEE Transactions on Computers, C-3D07(7):460{477, July 1981. [DMB87] D.R. Ditzel, H.R. McLellan, and A.D. Berenbaum. Design tradeos to support the c programming language in the CRISP microprocessor. In Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS II), pages 158{163, October 1987. [Fis81]

Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30, July 1981.

[Fis83]

Joseph A. Fisher. Very long instruction word architectures and the ELI512. In Proceedings of the 10th Annual Symposium on Computer Architecture, pages 140{150, Stockholm, Sweden, June 1983. IEEE.

[Fly95]

M. J. Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Jones and Bartlett, Boston, 1995.

[FS94]

M. Franklin and M. Smotherman. A ll-unit approach to multiple instruction issue. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 162{171, December 1994.

BIBLIOGRAPHY

179

[Gro83]

Thomas Gross. Code Optimization of Pipeline Constraints. PhD thesis, Stanford University, December 1983.

[Gwe94]

Linley Gwennap. VLIW: the wave of the future. Microprocessor Report, 8(2):18{21, February 1994.

[Gwe95]

Linley Gwennap. Intel's P6 uses decoupled superscalar design. Microprocessor Report, 9(2):9{15, February 1995.

[HMC+93] W.-M.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, and R Hank. The superblock: an eective technique for VLIW and superscalar compilation. Journal of Supercomputing, 7(1-2):229{248, May 1993. [Hol92]

John Gustaf Holm. Evaluation of some superscalar and VLIW processor designs. Master's thesis, University of Illinois at Urbana-Champaign, 1992.

[Hun95]

D. Hunt. Advanced performance features of the 64-bit PA-8000. In Digest of Papers. COMPCON '95. Technologies for the Information Superhighway, pages 123{128, March 1995.

[Joh89]

William M. Johnson. Super-scalar processor design. Technical Report CSL-TR-89-383, Computer Systems Laboratory, Stanford University, June 1989.

[Joh91]

M. Johnson. Superscalar Microprocessor Design. Prentice-Hall, Englewood Clis, New Jersey, 1991.

[JW88]

N. P. Jouppi and D. W. Wall. Available instruction-level parallelism of superscalar and superpipelined machines. Technical Report Technical Note TN-2, Digital Equipment Corporation Western Research Laboratory, Palo Alto, CA, September 1988.

180

BIBLIOGRAPHY

[JW89]

N. P. Jouppi and D. W. Wall. Available instruction-level parallelism for superscalar and superpipelined machines. In Proceedings of ASPLOS III, pages 272{282, April 1989.

[Kel75]

R. M. Keller. Look{ahead processors. Computing Surveys, 7(4):177{195, December 1975.

[KH92]

G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, Englewood Clis, NJ, 1992.

[KMC72] D. Kuck, Y. Muraoka, and S. Chen. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Transactions on Computers, C-21:1293{1310, December 1972. [Los82]

J. J. Losq. Generalized History Table for Branch Prediction. IBM Technical Disclosure Bull., 25(1):99{101, June 1982.

[LS84]

J. Lee and A. J. Smith. Branch Prediction Strategies and Branch Target Buer Design. IEEE Computer, 17(1):6{22, January 1984.

[LTT95]

D. Levitan, T. Thomas, and P. Tu. The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor. In Digest of Papers. COMPCON '95. Technologies for the Information Superhighway, pages 285{291, March 1995.

[LW92]

Monica S. Lam and Robert P. Wilson. Limits of control ow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 46{57, May 1992.

[MEV92] N. Malik, R.J. Eickemeyer, and S. Vassiliadis. Interlock collapsing alu for increased instruction-level parallelism. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 149{157, December 1992. [Mot88]

Motorola Inc. MC88100 RISC Microprocessor User's Manual. Motorola Inc., Phoenix, Arizona, 1988.

BIBLIOGRAPHY

181

[NF84]

A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE Transactions on Computers, C-33:968{976, November 1984.

[Nye82]

Peter Nye. U-code an intermediate language for pascal* and fortran. S-1 Document PAIL-8, Stanford University, May 1982.

[PSR92]

Shien-Tai Pan, Kimming So, and Joseph T. Rahmeh. Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation. In ASPLOS-V, pages 76{83. ACM, October 1992.

[PSS+91] V. Popescu, M. Schultz, J. Spracklen, G. Gibson, B. Lightner, and D. Isaman. The meta ow architecture. IEEE Micro, 11(3):10{13, 63{ 7, June 1991. [Rau93]

B. Ramakrishna Rau. Dynamically scheduled VLIW processors. In The 26th Annual International Symposium on Microarchitecture, pages 80{92, December 1993.

[RF72]

E. M. Riseman and C. C. Foster. The inhibition of potential parallelism. IEEE Transactions on Computers, C-21:1405{1411, December 1972.

[RYYT89] B. R. Rau, D. Yen, W. Yen, and R. Towle. The CYDRA 5 departmental supercomputer. IEEE Computer, 22:12{35, January 1989. [SKAH91] Mark Smotherman, S. Krishnamurthy, P.S. Aravind, and D. Hunnicutt. Ecient dag construction and heuristic calculation for instruction scheduling. In Proceedings of the 24th International Symposium on Microarchitecture, pages 93{102, November 1991. [Sla94]

Michael Slater. AMD's K5 designed to outrun pentium. Microprocessor Report, 8(14):1,6{11, October 1994.

[Smi92]

Michael David Smith. Support for Speculative Execution in HighPerformance Processors. PhD thesis, Stanford University, November 1992.

182

BIBLIOGRAPHY

[Spi73]

P. M. Spira. Computation times of arithmetic and boolean functions in (d; r) circuits. IEEE Transactions on Computers, C-22, June 1973.

[TF70]

G. S. Tjaden and M. J. Flynn. Detection and parallel execution of independent instructions. IEEE Transactions on Computers, C-19:889{895, October 1970.

[Tho70]

J. E. Thornton. Design of a Computer: The Control Data 6600. Scott, Foresman and Co., Glenview, IL, 1970.

[Tom67]

R. M. Tomasulo. An ecient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1):25{33, January 1967.

[VBE94]

S. Vassiliadis, B. Blaner, and R.J. Eickemeyer. Scism: a scalable compound instruction set machine. IBM Journal of Research and Development, 38(1):59{78, January 1994.

[Wal91]

D. W. Wall. Limits of instruction-level parallelism. In Proceedings of ASPLOS IV, pages 176{188, April 1991.

[Wed82]

R. Wedig. Detection of Concurrency in Directly Executed Language Instruction Streams. PhD thesis, Stanford University, June 1982.

[WF82]

S. Waser and M. Flynn. Introduction to Arithmetic for Digital Systems Designers. Holt, Rinehart and Winston, New York, 1982.

[Win65]

S. Winograd. On the time required to perform addition. Journal ACM, 12(2), 1965.

[Win67]

S. Winograd. On the time required to perform multiplication. Journal ACM, 14(4), 1967.

[WS84]

S. Weiss and J. E. Smith. Instruction issue logic in pipelined supercomputers. In Proceedings of the 11th Annual Symposium on Computer Architecture, pages 110{118, June 1984.

BIBLIOGRAPHY

183

[YGS95]

Cli Young, Nicolas Gloy, and Michael D. Smith. A Comparative Analysis of Schemes for Correlated Branch prediction. In ISCA-95, pages 276{286. ACM, June 1995.

[YP91]

Tse-Yu Yeh and Yale N. Patt. Two-Level Adaptive Training Branch Prediction. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 51{61, November 1991.

[YP92]

Tse-Yu Yeh and Yale N. Patt. Alternative Implementations of Two-Level Adaptive Branch Prediction. In Conference Proceedings, The 19th Annual Symposium on Computer Architecture, pages 124{134, May 1992.

[YP93]

Tse-Yu Yeh and Yale N. Patt. A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In ISCA-93, pages 257{266. IEEE, 1993.

[YS94]

Cli Young and Michael D. Smith. Improving the Accuracy of Static Branch Prediction Using Branch Correlation. In ASPLOS-VI, pages 232{ 241. ACM, October 1994.