ALEXANDRIA UNIVERSITY FACULTY OF ENGINEERING
PROCESSOR PERFORMANCE MODELS
A thesis submitted to Computer Science and Automatic Control Department in partial fulfilment of the requirements for the degree of
Master of Science (Computer Science)
By Ahmed Hazem El-Mahdy
Registered: September, 1995 Submitted: May, 1998
C. V.
Name: Date of Birth: Place of Birth: Nationality: Email:
Ahmed Hazem El-Mahdy 1/8/1972 Alexandria, Egypt Egyptian
[email protected]
Home Address: Villa Hanno, Ahmed Farouk Street, Smouha, Alexandria, Egypt.
Office Address: Computer Science and Automatic Control Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt.
Current Profession: Teaching Assistant at: Computer Science and Automatic Control Department, Faculty of Engineering, Alexandria University.
Educational Record: • General Certificate of Secondary School Education: 1990 El-Nasr Boy’s School, Alexandria • B.Sc. (hon.): 1995 Computer Science and Automatic Control Department, Faculty of Engineering, Alexandria University
Professional Memberships: • Student member at ACM • Student member at IEEE computer society
ii
SUPERVISORS
Prof. Dr. Mohamed Salah Selim Associate Professor at: Computer Science and Automatic Control Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt.
Prof. Dr. Ahmed Abdou El-Nahass Associate Professor at: Computer Science and Automatic Control Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt.
iii
ABSTRACT
In this thesis, we introduce two new analytical models for predicting the performance of existing processor architecture types under Multimedia workload. The purpose of these models is to determine which architecture features have major effects on processor performance. The majority of existing processor architecture types fall into two classes: simple-issue and complexissue processors. With this respect, we focus on Pentium and Pentium II processors as typical architecture types for these classes respectively. Our models lie between other simple models that assume uniform instruction-level parallelism, and other models of trace orientation that assume the availability of detailed characterisation of the behaviour of a running program on a certain architecture. The formers are not very realistic, and the latter demand a large processing time and parametric studies are not possible. We present the models and show how the required parameters are evaluated. We have conducted experiments on both the Pentium and the Pentium II processors with the aid of their built-in monitors to assess the accuracy of our models in Multimedia applications. For these applications, the detailed models predicted performance within 10 percent margin when compared to actual results. As part of this study, we characterise the effects of new Multimedia applications on processor architecture.
iv
ACKNOWLEDGMENTS
I am deeply indebted to Professor Salah Selim for the constant guidance he provided me in the course of this research. He shared with me many of his ideas, and gave extremely insightful advice aimed to identify significant ideas, and suppress insignificant ones. I am deeply indebted to Professor Ahmed El-Nahass for his help, encouragement and advice throughout the course of this research as well as my graduate studies, and for the opportunity to do research with him. I am grateful to all staff members of the computer department for their continuous encouragement and unfailing advice starting with early undergraduate studies, for which I will be always indebted. The support of many colleagues will be long remembered. I thank them all for sharing with me their spirit of discovery. Most of all thanks to my family for all their encouragement throughout the work. Without their support, this work wouldn’t have been possible.
v
TABLE OF CONTENTS
C. V..................................................................................................................................... II SUPERVISORS................................................................................................................. III ABSTRACT ...................................................................................................................... IV ACKNOWLEDGMENTS...................................................................................................V TABLE OF CONTENTS .................................................................................................. VI LIST OF FIGURES........................................................................................................... IX LIST OF TABLES ..............................................................................................................X GLOSSARY ...................................................................................................................... XI LIST OF SYMBOLS USED.............................................................................................XII CHAPTER 1: INTRODUCTION ....................................................................................... 1 1.1 THE NEED FOR ANALYTICAL MODELS ......................................................................... 1 1.2 SCOPE OF THE WORK .................................................................................................. 2 1.3 THESIS ORGANISATION ............................................................................................... 2 CHAPTER 2: BACKGROUND.......................................................................................... 4 2.1 INSTRUCTION PROCESSING MODEL .............................................................................. 4 2.2 PROGRAM DEPENDENCIES ........................................................................................... 5 2.2.1 Data Dependencies ............................................................................................ 5 2.2.1.1 True Data Dependency .............................................................................................. 5 2.2.1.2 Storage Dependencies (Data anti-dependencies) ........................................................ 5 2.2.1.3 Resource Dependencies (Structural hazards) .............................................................. 6
2.2.2 Control Dependencies ........................................................................................ 6 2.3 PROCESSING PHASES................................................................................................... 6 2.3.1 Instruction Fetch Phase...................................................................................... 7 2.3.2 Instruction Dispatch Phase................................................................................ 7 2.3.3 Instruction Issue and Execute Phase .................................................................. 7 2.3.4 Instruction Commit Phase .................................................................................. 8 CHAPTER 3: LITERATURE SURVEY............................................................................ 9 3.1 SUPERSCALAR TECHNIQUES ........................................................................................ 9 3.1.1 Historical Perspective ........................................................................................ 9 3.1.2 Recent Superscalar Processors..........................................................................10 3.2 PERFORMANCE MODELS ............................................................................................12 3.2.1 Performance Analysis Fundamentals .................................................................13 3.2.2 The Piecewise Linear Model..............................................................................15 vi
3.2.3
An Instruction-Level Parallelism model .............................................................15
CHAPTER 4: SIMPLE-ISSUE SUPERSCALAR MODEL (PENTIUM MODEL).........17 4.1 PENTIUM ARCHITECTURE ...........................................................................................17 4.1.1 Execution Pipelines ...........................................................................................17 4.1.2 Memory Interface..............................................................................................20 4.1.3 Branch Prediction .............................................................................................20 4.1.4 Data Cache Consistency Protocol (MESI Protocol) ...........................................22 4.2 THE PENTIUM PROCESSOR MODEL .............................................................................24 4.2.1 Model Assumptions and Definitions ...................................................................24 4.2.1.1 4.2.1.2 4.2.1.3 4.2.1.4 4.2.1.5
4.2.2
Instruction Element ................................................................................................. 25 Cache Model........................................................................................................... 25 Write-back Buffer Model......................................................................................... 25 Write-buffer Effects On Memory Operations............................................................ 27 Execution Element .................................................................................................. 28
CPI Components ...............................................................................................29
4.2.2.1 4.2.2.2 4.2.2.3 4.2.2.4
Exeuction Busy (EBusy).......................................................................................... 29 Execution Idle Due to Limited ILP (EIdle_ILP)........................................................ 30 Execution Idle Due to Address Generation Interlock (EIdle_AGI)............................. 30 Execution Idle Due to Branch Miss-prediction (EIdle_Branch) ................................. 30
4.3 EVALUATING MODEL PARAMETERS............................................................................31 4.3.1 Model Parameters.............................................................................................31 4.3.2 Pentium Performance Counters Overview..........................................................32 4.3.3 CPI Evaluation Procedure.................................................................................34 CHAPTER 5: CHARACTERISING MULTIMEDIA APPLICATIONS.........................39 5.1 5.2 5.3 5.4 5.5 5.6
WHY CHOOSING MULTIMEDIA APPLICATIONS .............................................................39 MULTIMEDIA BENCHMARK PROGRAMS.......................................................................39 BENCHMARKING METHODOLOGY ...............................................................................40 WORKLOAD PARAMETER VALUES ..............................................................................41 CPI BREAKDOWN ......................................................................................................42 SUMMARY OF MULTIMEDIA CHARACTERISTICS ...........................................................45
CHAPTER 6: PERFORMANCE PREDICTION FOR THE PENTIUM MMX..............47 6.1 PENTIUM MMX ARCHITECTURE ENHANCEMENTS.......................................................47 6.2 MODEL MODIFICATIONS ............................................................................................47 6.3 PREDICTING THE CPI .................................................................................................48 6.3.1 Target System Configuration .............................................................................48 6.3.2 Parameters .......................................................................................................48 6.3.3 Calculating CPI ................................................................................................49 6.3.4 Results ..............................................................................................................49 CHAPTER 7: COMPLEX-ISSUE SUPERSCALAR MODEL (PENTIUM II MODEL) 51 7.1 PENTIUM II ARCHITECTURE...............................................................................51 7.1.1 Fetch/Decode Unit ............................................................................................52 7.1.2 Dispatch/Issue/Execution Unit...........................................................................52 7.1.3 Commit Unit (Retire Unit) .................................................................................53 7.1.4 Memory Interface..............................................................................................53 vii
7.2 PROPOSED COMPLEX-ISSUE MODEL (PENTIUM II) .......................................................55 7.2.1 Instruction Fetch/Decode Unit...........................................................................55 7.2.1.1 Assumptions And Definitions .................................................................................. 55 7.2.1.2 The Model .............................................................................................................. 56
7.2.2
The Dispatch/Issue/Execution Unit ....................................................................58
7.2.2.1 Calculating Execution Rate...................................................................................... 64
7.2.3
Commit Unit .....................................................................................................65
CHAPTER 8: PREDICTING PENTIUM II PERFORMANCE.......................................66 8.1 EVALUATING MODEL PARAMETERS............................................................................66 8.1.1 Workload Dependent Parameters ......................................................................67 8.1.2 Architecture Dependent Parameters ..................................................................68 8.2 PERFORMANCE PREDICTION .......................................................................................71 CHAPTER 9: COMPLEX-ISSUE REDUCTION TO SIMPLE ISSUE MODEL............73 9.1 COMPLEX-ISSUE MODEL REDUCTIONS ........................................................................73 9.2 PENTIUM PERFORMANCE PREDICTION.........................................................................75 9.2.1 Model Parameters.............................................................................................76 9.2.1.1 Workload Dependent Parameters ............................................................................. 76 9.2.1.2 Architecture Dependent Parameters ......................................................................... 76
9.2.2
Performance Prediction Results.........................................................................78
CHAPTER 10: CONCLUSIONS AND FUTURE WORK................................................80 10.1 10.2
CONCLUSIONS .......................................................................................................80 FUTURE WORK ......................................................................................................81
APPENDIX A: IDENTIFYING SOME PENTIUM PERFORMANCE COUNTERS .....82 A-1 INTRODUCTION ..........................................................................................................82 A-2 THE EXPERIMENT ......................................................................................................82 A-2 CONCLUSION .............................................................................................................83 APPENDIX B: PREDICTING PENTIUM MMX PERFORMANCE WITH L2 CACHE ............................................................................................................................................84 B-1 THE EFFECT OF L2 CACHE .........................................................................................84 B-2 PENTIUM MMX PERFORMANCE PREDICTION USING SIMPLE-ISSUE MODEL ...................85 B-3 PENTIUM MMX PERFORMANCE PREDICTION USING REDUCED COMPLEX-ISSUE MODEL .. .................................................................................................................................86 BIBLIOGRAPHY ..............................................................................................................87
viii
LIST OF FIGURES
Number
Page
FIGURE 3-1: PROCESSOR ARCHITECTURE OF A TYPICAL SUPERSCALAR PROCESSOR.................10 FIGURE 3-2: GENERIC PROCESSOR MODEL .............................................................................13 FIGURE 3-3: JOUPPI'S PIECEWISE LINEAR MODEL....................................................................15 FIGURE 4-1: PENTIUM PROCESSOR ARCHITECTURE .................................................................19 FIGURE 4-2: BRANCH PREDICTION FLOWCHART .....................................................................21 FIGURE 4-3: BRANCH PREDICTION MECHANISM.......................................................................22 FIGURE 4-4: PENTIUM PERFORMANCE MODEL ........................................................................24 FIGURE 4-5: WRITE-BACK BUFFER MODEL.............................................................................25 FIGURE 4-6: EFFECTIVE MEMORY PENALTY VS. WRITE RATE .................................................26 FIGURE 5-1: CPI BREAKDOWN FOR P100 ...............................................................................43 FIGURE 5-2: CPI BREAKDOWN FOR P133 ...............................................................................44 FIGURE 6-1: PERFORMANCE PREDICTION FOR P166.................................................................50 FIGURE 7-1: PENTIUM II PROCESSOR ARCHITECTURE..............................................................54 FIGURE 7-2: STATE TRANSITION DIAGRAM FOR FETCH UNIT ...................................................57 FIGURE 7-3: STATE TRANSITION RATE DIAGRAM OF THE INSTRUCTION QUEUE OF DISPATCH/ISSUE/EXECUTE UNIT .....................................................................................59 FIGURE 7-4: INSTRUCTION DATA-DEPENDENCY ......................................................................61 FIGURE 7-5: INSTRUCTION RATE VS. WINDOW SIZE WHILE CHANGING THE ISSUE PROBABILITY ..64 FIGURE 7-6: COMMIT UNIT MODEL ........................................................................................65 FIGURE 8-1: PENTIUM II PREDICTED VS. ACTUAL CPI.............................................................72 FIGURE 9-1: MAPPING OF COMPLEX-ISSUE MODEL TO PENTIUM ARCHITECTURE.......................74 FIGURE 9-2: PREDICTED VS. ACTUAL CPI ..............................................................................79 FIGURE B-1: EFFECT OF L2 CACHE ON CPI ............................................................................84 FIGURE B-2: PERFORMANCE PREDICTION FOR P166................................................................85 FIGURE B-3: PREDICTED VS. ACTUAL CPI FOR P166...............................................................86
ix
LIST OF TABLES
Number
Page
TABLE 3-1: CHARACTERISTICS OF RECENT HIGH-PERFORMANCE PROCESSORS .........................11 TABLE 4-1: THE MESI STATES ..............................................................................................23 TABLE 4-2: WORKLOAD DEPENDENT PARAMETERS ................................................................31 TABLE 4-3: ARCHITECTURE DEPENDENT PARAMETERS ...........................................................32 TABLE 4-4: CPI COMPONENTS...............................................................................................32 TABLE 4-5: PENTIUM PERFORMANCE MONITORS EVENTS .......................................................33 TABLE 5-1: BENCHMARK PROGRAMS .....................................................................................40 TABLE 5-2: SYSTEMS USED IN MULTIMEDIA PERFORMANCE ANALYSIS ....................................41 TABLE 5-3: WORKLOAD PARAMETERS DESCRIPTION ..............................................................42 TABLE 5-4: WORKLOAD PARAMETERS ...................................................................................42 TABLE 5-5: RESULTS FOR P100..............................................................................................43 TABLE 5-6: RESULTS FOR P133..............................................................................................44 TABLE 6-1: INPUT PARAMETERS (WORKLOAD DEPENDENT)....................................................48 TABLE 6-2: INPUT PARAMETERS (ARCHITECTURE DEPENDENT)...............................................49 TABLE 6-3: PERFORMANCE PREDICTION RESULTS FOR P166 ...................................................49 TABLE 8-1:PENTIUM II TEST SYSTEM.....................................................................................66 TABLE 8-2: COMPLEX-ISSUE PARAMETERS DESCRIPTION ........................................................67 TABLE 8-3: BASIC BLOCK PARAMETER VALUES .....................................................................68 TABLE 8-4: MODEL PARAMETERS ..........................................................................................71 TABLE 8-5: PREDICTED VS. ACTUAL CPI................................................................................72 TABLE 9-1: MODEL PARAMETERS FOR THE MULTIMEDIA WORKLOAD .....................................78 TABLE 9-2: PREDICTED VS. ACTUAL CPI................................................................................79 TABLE B-1: PERFORMANCE PREDICTION RESULTS FOR P166...................................................85 TABLE B-2: PREDICTED VS. ACTUAL CPI...............................................................................86
x
GLOSSARY
Basic block: a contiguous block of instructions, with a single entry point and a single exit point. Dynamic instruction stream: the sequence of executed instructions. Instruction control dependency: the situation where the order of execution of instructions cannot be determined before run-time. Instruction data dependency: the ordering relationship between instructions. Instruction dispatch: the dissemination of instructions into shelving buffers for later issuing and execution. Instruction issue: the dissemination of dependency-free instructions from shelving buffers to execution units. Instruction-level parallelism (ILP): the maximum number of instructions that can be simultaneously executed in the pipeline. Micro operations (µ µops): RISC-like operations obtained by the conversion of CISC instruction in the Pentium II core. Processor architecture: hardware organisation of a microprocessor also cited as Microarchitecture in the literature. Processor’s precise state: The architecture visible registers and memory. Shelving buffers: buffers, associated with hardware functional units keeping track of instruction data availability, used to issue instructions. Speculative execution: to fetch and execute instructions from the predicted path of a conditional branch. Superpipelining: a technique used to increase the clock rate of a pipeline by decomposing its stages into multiple stages. Superscalar processing: the ability to initiate multiple instructions during the same clock cycle. Window of execution: the full set of instructions that may be simultaneously considered for parallel execution.
xi
LIST OF SYMBOLS USED
pr
The probability that a read miss occurs
pw
The probability that a write miss occurs
pb
The probability that the write buffer is full
ps
The probability that a write hits a shared line (in the MESI protocol)
pw_m/e
The probability that a write hits a Modified/Exclusive line (in the MESI protocol
pIO
The probability that IO read or write cycle occurs
pi i=1,2,3,4
The probability that the write buffer queue size is i
pu, pv
Probability that an instruction is executed in u-pipe or v-pipe respectively
pB
Probability that a branch prediction is correct
twb
The time, in cycles, per instruction in which the processor is stalled due to full write buffers
tmw
The average memory write penalty rate (cycles per instruction)
CPI
Performance Metric (Cycles Per Instruction)
λ
Requests arrival rate (request per cycle)
ILPaverage
Average number of instructions issued in the Pentium
pILP
Probability that two instructions are independent
nb
Number of basic blocks fetched in a cycle
nf
Number of cycles required to fetch a block assuming perfect cache
pI
Instruction cache hit ratio
Cm
Instruction cache miss penalty
xii
Cb
Branch miss prediction penalty
nILP
Average number of ILP assuming infinite issue width
pFBusy,i
Probability that function unit i is busy
pAi,j
Probability that the age of instruction at function unit i is j
pDj
Probability that two instruction with age difference j are independent
pi,k
Probability that function unit i is executing an instruction with latency ti,j
ti,j
Execution latency no. j of function unit i
pissue
Probability that an instruction is issued
xiii
CHAPTER 1: INTRODUCTION Modern processors have become very complex at both the technology and architecture levels. This complexity is due to sustaining the exponential processor improvements rate (~60% per year). Although technology is an important driving force, it is no longer the leading factor. Starting in mid-80’s, the increase in processor performance is attributed more to architecture innovations [30]. If current highest performance processors relied solely upon technology improvements, they will be five times slower [35]. As stated by Flynn [16], who had classified computers with respect to their data and instruction streams in 1966 (SISD, MISD, SIMD, MIMD), the SISD processor’s performance was the one with most improvements. Furthermore, the cost of computation has decreased by the factor of one million since 1966. Improvements in process technology, starting in 1971, has followed Moore’s law doubling the number of transistors per processor every 18 months [58]. This has presented a flexible design environment for the architect. Consequently, the processor architecture field has matured from primitive scalar processors (serial execution structures, simple pipelining) to more sophisticated superscalar architecture types. These types of architecture analyse dynamic instruction streams and construct dynamic instruction execution schedules exploiting instruction-level parallelism (ILP).
1.1 The Need for Analytical Models Evaluating the performance of current processors is a complex task. Performance models are analytical models, simulation models and physical models arranged with an increasing order of accuracy and cost These three alternatives complement each other in the development process. At the early processor design stages, it is very time-consuming to do detailed simulation experiments and would be impossible to carry out experiments on physical models. Instead, analytical models present approximate –and quick- calculations providing answers to basic performance trade-offs sought in this initial stage (parametric studies).
1
There has been many proposed performance models in the literature. However, most of them are relatively old and do not reflect new applications (like Multimedia) demands1. Moreover, the recent ones haven’t tackled the Multimedia applications [31].
1.2 Scope of the Work Major existing processor architecture types are classified to simple-issue or complex-issue. Processors in the simple-issue class are characterised by using small set of superscalar techniques (discussed in chapter 4). Processors in this class include Intel Pentium, Sun UltraSPARC, Cyrix 5x86, and IBM/Motorola PowerPC601. Processors in the complex-issue class are characterised by using aggressive superscalar techniques (discussed in chapter 7). Processors in this class include Intel Pentium II, MIPS R10000, IBM/Motorola PowerPC, and AMD K5. In this thesis, we introduce two new analytical models for predicting the performance of two major processors in the aforementioned classes. We have chosen the Pentium and Pentium II2 from simple-issue and complex-issue classes respectively. The reason for this choice is the high popularity of these processors and their availability to us in the labs. Moreover, most of the existing superscalar techniques are incorporated in them. The performance prediction is done in the context of Multimedia applications. In this respect, we have selected a set of Multimedia programs as our benchmark.
1.3 Thesis Organisation This thesis is organised as 10 chapters and 2 appendices. •
Chapter 1: The introduction.
•
Chapter 2: The background of the thesis.
•
Chapter 3: The work related to the thesis.
•
Chapter 4: Proposed Pentium simple-issue model.
•
Chapter 5: Characterisation of Multimedia workload
1
The memory component is usually ignored, due to high cache hit ratios present in other benchmark programs. However, in Multimedia applications, memory component is very significant part (about 50% of the CPI)
2
From here on, we will refer to Intel Pentium as Pentium, and Intel Pentium II as Pentium II.
2
•
Chapter 6: Accuracy assessment of the simple-issue model by comparing actual versus predicted performance results.
•
Chapter 7: Proposed Pentium II complex-issue model.
•
Chapter 8: Accuracy assessment of complex-issue model by predicting the performance of Pentium II processor and comparing the results with experimental results.
•
Chapter 9: Accuracy assessment of the complex-issue model by reducing it to simple-issue model and predicting the performance of the Pentium processor.
•
Chapter 10: The conclusions and future extensions to the work.
•
Appendix A: Some details concerning the identification of some Pentium performance monitoring counters.
•
Appendix B: Accounting for L2-cache in predicting the performance of the Pentium processor.
3
CHAPTER 2: BACKGROUND Existing processor architecture types are influenced by a set of instruction-level parallelism methods (discussed later in this chapter) including dynamic scheduling, speculative execution, out-of-order instruction issue, and register renaming [22, 28, 39, 8, 13, 9, 57, 51, 52, 41, 42]. Different subsets of this set are integrated in what is collectively known as superscalar processor architecture [46]. The basic goal of this architecture is to exploit the instruction-level parallelism (ILP). In this chapter, we briefly describe the background for the superscalar architecture approach. We point the constraints imposed by preserving the binary compatibility of the legacy instruction sets and then discuss how the superscalar approach meets these constraints while exploiting as much ILP as possible. More details can be found elsewhere [46, 35, 56, 19].
2.1 Instruction Processing Model One of the advantages of using the superscalar processor approach is preserving the binary compatibility of the legacy instruction sets. That is the ability to execute a machine language program written in an earlier generation of the same processor. However, the sequencing model, inherent in these old instruction sets, necessitated maintaining a sequential execution model. In this model, an instruction counter is used to fetch a single instruction from the memory. After the execution, the counter is either incremented to the next instruction or updated with a new branch target address in the case of a branching instruction. The superscalar processor gives the appearance of a sequential processor (virtually) while taking every possible opportunity to concurrently execute instructions. Should the program be interrupted, the precise state should be captured. This state should be the state present if the processor has followed a sequential execution model. At the end of interrupt, the formerly interrupted instruction sequence is resumed.
4
2.2 Program Dependencies The problem, the superscalar processor faces, is converting a sequential program into a parallel one while maintaining the appearance of the sequential execution model. Sequentially executing a program imposes many dependencies. The goal, of the superscalar processor, is to remove as much dependencies as possible. Some of these dependencies are inherent in the execution model, and cannot be removed. Others can be removed by incorporating complex techniques whose benefits depend on the degree of ILP. If the degree of ILP is quite small, these techniques give diminishing returns, and vice versa. Program dependencies are classified into Data dependencies and control dependencies 2.2.1 Data Dependencies 2.2.1.1 True Data Dependency A Data dependency is present if an instruction depends on the result of another instruction. This dependency is called Read-after-write (RAW). There is no way to remove this type of data dependency, and consequently a strict sequential execution3 of these two instructions must be followed. 2.2.1.2 Storage Dependencies (Data anti-dependencies) This type of data dependency occurs due to limited physical storage. In such case, an instruction is delayed until the other instruction using the same storage location has completed execution. This dependency is further classified into Write-after-read (WAR) and Write-after-write (WAW). Strict sequential execution of instructions will naturally remove this dependency even if pipelining technique is used. This type of data dependency can be removed using register renaming technique. In this technique, a logical register is mapped into a physical register number. Physical registers are more than logical registers. For example, the R10000 maintains a 32 logical register and 64 physical register.
3
By strict sequential order we mean that the actual execution order is to be sequential and not just the appearance
5
2.2.1.3 Resource Dependencies (Structural hazards) Resource dependencies involve the limited number of physical resources. Instructions are delayed until the required physical resource is available. Physical resources can be functional units, shelving buffer entries (will be discussed in section 2.3.3), and even registers if register renaming is invoked. 2.2.2 Control Dependencies An instruction is said to be control dependent of another instruction, if it is not known whether it will be executed or not before executing the other instruction. The conditional branching instruction causes such type of program dependency, as the outcome of the branch is unknown until its execution. Using branch prediction techniques, high prediction accuracy rates are achieved. Consequently, the control dependencies can be decreased. If speculative execution is used, a large window of instructions can be formed. However, the relative branch miss-prediction penalty will increase.
2.3 Processing Phases In order to exploit ILP, superscalar processors incorporate many techniques to remove false program dependencies and thus more instructions can be executed concurrently. The underlying architecture of a superscalar processor is pipeline architecture. In some architecture, the pipeline stages are decomposed into many stages resulting in superpipeline architecture. A superscalar processor should use: •
A control dependency resolution method
•
A data dependency resolution method
•
A method to implement the precise state
•
A commit method to maintain the appearance of the sequential execution model
•
A Dynamic scheduling method
•
Multiple resources to support parallel execution
In this section, we will give a brief description of the above mentioned items in the context of a complete execution process. For sake of clarity, we divide the execution process into separate
6
phases and discuss each one separately. In a typical superscalar process though, the phases are almost seamless. Processing phases are Instruction fetch, instruction dispatch, instruction issue and execute, and instruction commit. 2.3.1 Instruction Fetch Phase In this phase, instructions are prefetched with the aid of branch prediction. Instructions proceed to execute along the predicted path of conditional branch. This is called speculative execution. An instruction cache (L1-cache) is almost used for all current processors. This hides the latency and increases the bandwidth of the instruction fetch process. In many types of architecture, Harvard architecture is used in the level of L1 cache. The cache is spilt into instruction cache and data cache. An instruction buffer is usually used to smooth out the instruction fetch irregularities caused by cache misses and branch miss prediction. 2.3.2 Instruction Dispatch Phase During this phase, instructions are removed from the instruction fetch buffers and false data dependencies are removed by register renaming. In this technique, the logical registers set is mapped into a larger physical register set. The mapping is done via a mapping table. A window of instruction is thus established in which instructions are control-dependency free. Only true data dependency should remain. 2.3.3 Instruction Issue and Execute Phase In this phase, instructions are checked for the availability of data operands and issued for parallel execution. This forms a dynamic instruction schedule. Instruction issue can be in-order or out-of-order. For in-order issue, instructions are issued in the sequential order of instructions. This limits the exploitation of ILP but results in a simple design. In out-of-order issue, the order to instructions is not considered. Instructions may be issued in any order subject to data availability. (For example, the Pentium processor implements the in-order issue approach resulting in a simple design; maximum instruction window size is 2, no register renaming is used. The Pentium II processor implements the out-of-order
7
issue approach resulting in a complex design; max instruction size is 30, register renaming, and extensive speculation execution) Shelving-buffers or reservation units are usually used to keep track of the data availability for the out-of-order instruction issue. This scheduling algorithm is called Tomasolo’s algorithm. If instructions in a predicted execution path are allowed to execute, this case is called speculative execution. The degree of speculation is the maximum number of unresolved branches allowed to execute. 2.3.4 Instruction Commit Phase The final phase of the superscalar processing is the commit or the retire phase. The purpose of this phase is to keep the appearance of the sequential execution model. Instructions are only allowed to change the logical state (logical registers and memory) of the system in this phase.
8
CHAPTER 3: LITERATURE SURVEY The objective of this chapter is to present a brief survey of major existing processor architecture types and performance modelling techniques. We will focus on the post-1980 developments.
3.1 Superscalar Techniques 3.1.1 Historical Perspective The idea of superscalar processing is based on the exploitation of ILP. The idea was formulated early in 1970 [50], but it was concluded that only small amount of ILP (not more than 2 or 3) could be available without investing an enormous amount of hardware. The significance of ILP gained attention only in the beginning of the 1980s. It has been shown [33] that a large amount of ILP is found (about 90), when ILP is sought beyond basic block boundaries. Since then, there have been many studies of the availability of ILP [26, 54, 53]. IBM was first to implement superscalar issue [7]. The term superscalar, was first coined by John Cocke at IBM and the first experimental version of the design, called AMERICA was implemented. From the beginning of 1980s many superscalar techniques were proposed and successfully implemented at the late 1980s. A series of papers examined the instruction issue problem [1, 55] The decoupled approach, in which memory access and instruction execution phases were decoupled, was proposed by Wisconsin [35]. The Astronautics ZS-1 embodies this approach by using architectural queues for communication with main memory [47]. The Power-2 design uses queues in similar fashion [35]. The form of speculative execution in recent processor has its roots in the original IBM 360/91. The approach combines the dynamic scheduling techniques of the IBM 360/91 (based on Tomasulo’s algorithm) with an in-order commit facility. The concept of reorder buffer [45] is used to allow for in-order commit and precise interrupt. The speculation mechanism was further enhanced be the addition of renaming and dynamic scheduling [49].
9
3.1.2 Recent Superscalar Processors Floating point Register File Floating pt. Instruction buffers Predecoder
Instr. cache
Decode, rename & dispatch
Instr. buffer
Functional units Mem.
Interger/addr. instr. buffers
Functional units and data cache
interface
Integer register file
Re-order and commit
Figure 3-1: Processor Architecture of A Typical Superscalar Processor Existing superscalar processors can be classified according to their target application fields, performance levels, and architecture. We limit our survey to the models for general use. These processors are typically intended for the high-performance desktop and workstation domain. Starting from 1994, wide-issue superscalar processors are announced by every major processor vendor: Intel Pentium II, Intel Pentium, AMD K5, Sun UltraSPARC, MIPS R1000, IBM/Motorola PowerPC604, and HP PA-8000. Figure 3-1 shows the architecture of typical superscalar processor incorporating most of the superscalar techniques. All these processors share the same evolution path concerning the instruction issue policy [44]. Two main steps are in this path. The first is characterised by a straightforward issue policy with no renaming, in-order issue, and limited speculative execution. The second is characterised by the inclusion of the advanced issue techniques. It includes register renaming, speculative execution, and shelved issue (e.g. by the use of reservation units). It has been concluded also that for a given memory and execution bandwidths, the branch prediction accuracy and instruction window size are the main constraints on processor performance [44]. 10
Table 3-1 shows some of the recent superscalar processors and their characteristics. It is worth noting that the trend is to increasing the issue width (the maximum number of instructions that can be issued in a cycle).
Table
3-1:
Characteristics
of
Recent
High-
performance Processors Processor
Year
Initial
Max.
intro- clock
issue
duced
rate
Issue order
No. of Func- Pipeline tional units
stages
5
12
4
5
rate
(MHz) Pentium II
1997
233
3
Out-oforder
AMD K5
1995
100
4
Out-oforder
Sun UltraSPARC
1995
167
4
In-order
9
9
MIPS R1000
1995
200
4
Out-of-
5
7
7
6
6
9
order PowerPC 604
1994
100
4
Out-oforder
HP PA-8000
1996
200
4
Out-oforder
Pentium
1994
66
2
In-order
3
5
Cyrix 6x86
1995
150
2
In-order
4
7
11
Some of the listed processors are Complex-instruction-set-computer (CISC) while others are reduced-instruction-set-computer (RISC). However, all the CISC implementation, with exception of Pentium, convert CISC instruction to RISC like operations. The core of the processor is implemented as a RISC core.
3.2 Performance Models Many papers, in literature, have proposed methods to predict the maximum program parallelism [54, 53, 26, 14, 3]. The models are generally based on idealised processor architecture, and only performance bounds were obtained. Furthermore, most of these are experimental studies, usually by simulation. Other studies are based on performance evaluation of existing processors [6, 5, 29]. Few other papers are concerned about predicting the actual parallelism in realistic programs [34, 43] at the expense of ignoring the effects of memory hierarchy. Memory has become a serious bottleneck, failing to cope with processor improvements. It has been estimated that memory/processor performance gap will be increased by a factor of 1.5 per year [36]. Thus, this increasingly important resource should be included in the performance model. A more detailed performance model based on statistical approach is proposed by Noonburg [32]. In spite of the high accuracy achieved using this approach, a serious limitation is the size of the state space. We present two performance models. One is based on the model proposed by Emma [14]. The other is a novel model with the goal of predicting the performance of the Pentium II processor. In what follows, we will briefly present the models that are relevant to our work.
12
3.2.1 Performance Analysis Fundamentals Main Memory
Cache
Instruction Element
Execution Element
Figure 3-2: Generic Processor Model Emma has presented the essentials of performance analysis as-they relate to making the fundamental choices and trade-offs in processor design. In pursuing the above goal, the author advocates the Cycles-per-instruction4 (CPI) metric instead of its reciprocal Instruction-per-cycle (IPC). The CPI can be written as: CPI = ∑ (cycles event )(event instructio n ) This formula decomposes the performance into two separate components. The first, which is (Event/Instruction), is the frequency of the occurrence of some abstract hardware events, which is more application dependent. Moreover, the second, which is (cycles/event), is the associated hardware penalty of the abstract event and thus is dependent of the architecture of the processor. In order to apply this technique, the author presented an abstract scalar processor shown in Figure 3-2. The model is derived directly from the basic von Neumann architecture. The instruction element fetches instructions from the cache. The execution element is a scalar execu4
Total execution time is related to CPI by: Total execution time = Number of instructions × CPI × Cycle time
13
tion unit. All memory access is cached. Thus, every aspect of von Neumann architecture is modelled except for the input/output unit. The CPI is decomposed into two independent components: Finite Cache Effect and Infinite Cache Effect. Finite Cache Effect (FCE) metric assess the effect of the memory hierarchy. It is the contribution of cache penalties to the overall performance. The infinite cache effect is the CPI achievable assuming a perfect cache. Infinite cache effect is further decomposed into Execution Busy (EBusy) and Execution Idle (EIdle) components. The EBusy represents the intrinsic work. This should be the time taken if the execution element is perfect. EIdle accounts of the case of the execution element being idle due to its own hazards. In the case of pipelined design, among typical hazard causes are branch miss-prediction, address generation interlocks, and data dependencies. Performance bounds were given, where for a given workload EBusy and EIdle are constant, while FCE is directly proportional to the cache miss ratio. These bounds are shown to be upper bounds for the superscalar case. The author gives an analysis for some architectural techniques including decoupled instruction fetch/ execute, and out-of-order execution. Decoupled instruction fetch/ execute was argued to have many disadvantages; high reliance on branch prediction, high memory bandwidth requirements, and high Address-generation-interlocks (AGI) rates Out-of-order execution was argued to be insignificant in the case that execution time is one cycle.
14
3.2.2 The Piecewise Linear Model ILP
Machine Parallelism < Program Parallelism
Machine Parallelism > Program Parallelism
modelled actual
Machine Parallelsim
Figure 3-3: Jouppi's Piecewise Linear Model Jouppi [27] describes a first-order performance model, which uses the concept of machine parallelism and program parallelism. The overall performance is approximated by the use of the two components as shown in Figure 3-3. Jouppi presented a machine taxonomy, which includes superscalar, superpipelined, and superscalar superpipelined. Machine parallelism was calculated by the product of the degree of superpipelining and the degree of superscalarity. The performance, as shown in Figure 3-3, should be bounded by program parallelism in case machine parallelism is bigger. If not, the performance will be bounded by machine parallelism. Jouppi then describes many factors that account for the differences between modelled and actual performance. These include the non-uniform distribution of instruction level-parallelism including variation of parallelism within different instruction classes, variation in instruction latency variation in per-cycle available parallelism. A justification, taking into account the above-mentioned factors, is present. Still the justification does not provide an explanation for the rounding of the knee of the actual performance. In addition, memory effects were not taken into consideration. 3.2.3 An Instruction-Level Parallelism model An ILP model is proposed by Noonburg that attempts to model the interaction of program parallelism in the knee of the actual performance. The metric sought is ILP, which is calculated by
15
the product of Machine parallelism distribution and program parallelism distribution. as shown in the following equation ILP = Mˆ I × PˆD × Mˆ F × PˆC × M B where PˆC and PˆD are program control and data parallelism distributions respectively. And Mˆ I , Mˆ F , and M B are the issue parallelism, fetch parallelism, and branch parallelism respectively. The idea is forming a parallelism function that specifies the parallelism that can be achieved by a particular stage, as a function of the parallelism from the previous stage. This is achieved by a matrix-vector multiplication. A variable denoted by a hat accent is matrix, otherwise it is a vector. The ij element of a matrix represents the probability that the output parallelism (the result of applying the parallelism function) is i, given that the input parallelism is j (the function argument). Thus, the ILP distribution is calculated and averaged to obtain IPC (Instruction Per Cycle) metric. The values for the distortions is obtained from program traces. There are some limitations in this model, which are: no handling for the case of multi-cycle result latencies, no handling for inter-instruction dependency lengths and the assumption that execution phases are independent.
16
CHAPTER 4: SIMPLE-ISSUE SUPERSCALAR MODEL (PENTIUM MODEL) The objective of this chapter is to present our proposed model for the simple-issue class of superscalar processors. The essential characteristic of these processors is the in-order instruction issue. No shelving buffers are used. Consequently, instruction issue stalls subject to the unavailability of data operands. No register renaming is used and the relative branch prediction penalty is not critical. Typical processors, in this class are the Pentium, PowerPC 601 and Alpha 21164. Due to its high popularity and availability, the Pentium processor is chosen to be representative of this class. This chapter begins by giving a brief description of the Pentium processor. Then, we present our proposed performance model.
4.1 Pentium Architecture The object of this section is to briefly present the Pentium architecture. Detailed description can be found in Intel documentation [39, 40]. The Pentium processor, which is the successor of the Intel486 processor, is the first superscalar processor in the x86 family. It contains a dual pipeline permitting the parallel execution of two instructions per clock. The on-chip L1 cache is organised as two separate caches; 8Kb 2-way set associative data cache, and 8k 2-way set associative code cache. The data cache implements the MESI write-back protocol (discussed in section 4.1.4). Dynamic branch prediction based on history-bits is used. The external data bus is 64-bit wide and supports burst read and write-back operations. The data-path is 32-bit wide, including the register sizes and the ALUs. Write-back buffers are also used to speedup back-to-back memory write operations. 4.1.1 Execution Pipelines The Pentium processor has two 5-stage pipelines, which are called “u” and “v” pipes. While the u-pipe can execute all instructions in the instruction set, the v-pipe can execute a small set of instructions. These are called “simple” instructions. The two pipelines have the same stages,
17
which are Prefetch (PF), Instruction Decode (D1), Address Generate (D2), Execute – ALU and Cache Access (EX), and Writeback (WB). •
Instruction prefetch: The instruction prefetch is done in the PF stage, which is implemented by using Branch Target Buffer (BTB) and two independent prefetch buffers capable of holding 32-bytes (code cache line-size). At any given time, one prefetch buffer is active; requesting sequential instruction prefetches from the on-chip code cache. When a branch instruction is prefetched, the dynamic branch prediction predicts the outcome of the branch instruction. If it is predicted not taken, the sequential prefetching is resumed. If is predicted taken, the other prefetch buffer is activated requesting instruction prefetches from the branch target address. If, in the WB stage, the branch is found to be miss-predicted, the pipelines are flushed and the former prefetch buffer is reactivated requesting sequential instruction prefetch.
•
Instruction decode and issue: The D1 and D2 stages serve to decode and issue up to two instructions per clock, that is the issue window is of size 2. This issue scheme is characterised by being in-order issue, no register renaming, limited5 speculative branch processing. This scheme checks the decoded instructions for register dependencies and other conditions (hardware limitations mostly due to asymmetrical pipelines) . Independent instructions are issued to the EX stage. In the other case, instructions are issued to u-pipe sequentially.
5
No instruction is executed in the predicted conditional branch path. Speculation here is limited to performing PF, D1, and D2 pipeline stages.
18
Memory Interface
Integer register file
Integer Unit "u" Instr. Cache
Prefetch Buffers
Data Cache
Decode & issue Integer Unit "v"
Commit (WB)
Integer Pipeline
Prefetch
Decode1
Decode2
Execute
Writeback
Figure 4-1: Pentium Processor Architecture •
Instruction execute: The EX stage serves to execute ALU and memory access instructions. Instructions requiring both ALU and memory access operations require one additional clock. Concurrent data cache access is possible if there is no cache bank conflict (each data cache line is divided into 8 banks). Since memory dependencies are not checked yet, it is possible that both pipes will try to access the same cache bank. In this case, the v-pipe stalls until the u-pipe access is completed.
•
Instruction commit: The WB stage enables the instructions to modify the processor’s state and finish execution. A conditional branch instruction residing in the v-pipe is verified for correct branch prediction. If not successful, the pipelines will be flushed incurring a miss-
19
prediction penalty of 4 clocks (the u-pipe conditional branches are verified in the EX stage, incurring a branch penalty of 3 clocks).
4.1.2 Memory Interface The Pentium processor has a write-back buffer for each pipeline. The write buffers are one quadword (64-bit) wide. They can be filled simultaneously by u-pipe and v-pipe in the same clock. Memory reads are not allowed to be reordered around previously generated writes sitting in the write buffers. The write-back buffers have to be flushed before any bus cycle takes place. The Pentium processor supports strong write ordering only. Writes generated by the processor will appear on the bus or updated in the L1 data cache in the order they occur. Writes to M or E-state lines in the L1 data cache will not proceed until write-back buffers are flushed, external write-buffers (if any) are flushed, and any current write cycle is finished (M and E states are parts of the MESI protocol discussed in section 4.1.4). Reads, residing in the data cache, are allowed to be reordered around previously generated writes. All IO operations are not reordered. 4.1.3 Branch Prediction The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome of conditional branch instructions. The BTB is implemented using a four-way, set-associative cache with 256 entry [2]. Each entry contains the branch target address and 2 history bits. The history bits represents four possible states for a branch instruction residing in the BTB, these are: Strongly taken, Weakly taken, Weakly not taken, and Strongly not taken. A branch instruction will be predicted taken if its state is Strongly taken or Weakly taken. It will be predicted not taken otherwise. Figure 4-2 presents a flowchart of the operation of the branch prediction mechanism. Upon execution of a branch instruction, the outcome of the branch is feedback to the branch prediction logic updating the history bits. Figure 4-3, shows the state transitions when a branch instruction is executed. A branch entry changing state from Weakly not taken to Strongly not taken will be deallocated from the BTB. When the same branch instruction is encountered again it will miss the BTB and it will be predicted not taken (details are provided in appendixA).
20
Branch instr.
In BTB?
No
Yes
Predict not taken
Prediction?
Not taken
taken
Switch to other prefetch queue and prefetch from target address
Enter execution
Enter execution
Not taken No
Correct prediction?
Enter execution
Taken
Correct prediction?
No
Not taken
Correct prediction?
Yes
Yes
No
Resume normal operation
Make entry in BTB (Strong Taken)
Flush pipeline
Resume normal operation
Yes
Upgrade entry's history bits
Flush pipeline
Switch prefetcher to other queue and resume sequential code fetches
Downgrade entry's history bits
Upgrade entry's history bits
Downgrade entry's history bits
Flush pipelines
Resume normal operation
Figure 4-2: Branch Prediction Flowchart
21
taken again Not Taken
Weakly Taken (In BTB)
not taken again n ke ta
Predict not taken (outside BTB)
not taken again
Not Taken
Taken Taken
Taken
Weakly Not Taken (In BTB)
N ot Ta ke n
Strongly Taken (In BTB)
Not Taken
Strongly Not Taken (In BTB)
Figure 4-3: Branch prediction mechanism
4.1.4 Data Cache Consistency Protocol (MESI Protocol) The Pentium processor supports the MESI protocol. This protocol is used to maintain cache consistency when several cache subsystems or bus masters exist. It is applied only to memory read/write cycles, which are run through the data cache. Every line in the data cache is assigned a state of the four possible states: Modified, Exclusive, Shared, and Invalid. Table 4-1 gives a brief description of the MESI states.
22
Table 4-1: The MESI States State
Description
Modified
The cache line contains modified data (different from main memory). This state happens due to a write hit to Exclusive line. All subsequent accesses (reads/writes) to this line will not cause any memory cycles and the line state will not be changed.
Exclusive
The line contains no modified data (the same as main memory). All subsequent accesses (reads/writes) to this line will not cause any memory cycles. A write to this line will change its state to Modified. This state happens due to a write hit to Shared line.
Shared
This line contains the same information shared by L2 cache. Only a write to this line will generate a writethrough bus activity (and change line state to Modified). This state happens due to a read miss that loads this line into cache.
Invalid
This indicates that the line is not present in the cache
23
4.2 The Pentium Processor Model We present our proposed performance model in the framework of the Pentium architecture. This is no loss of generality; similar models can be derived for other processors in the class of simple-issue superscalar processors. The performance model is based on the set of basic performance limits proposed by Emma [14]. These limits were defined for the scalar case and, because of their intuitive nature, acts as bounds for the superscalar case. Adding more details to the framework proposed by Emma, we were able to construct a model that predicts the performance with in a limited error margin. 4.2.1 Model Assumptions and Definitions As shown in Figure 4-4, the model consists of a set of components corresponding to the processing elements. These are the memory, caches (L1-code and data), instruction element, execution element. Each of which contributes a CPI component.
Main Memory
L1 Code Cache
Instruction element
L1 Data Cache
Write-Back Buffer Execution Element
Figure 4-4: Pentium Performance Model
24
4.2.1.1 Instruction Element The instruction element prefetches instructions and presents them to the execution element. In the experiments we have conducted, we observed that code cache hit ratio is high for most of the benchmarks (more that 96%) and thus code cache effect is not significant. For one of the benchmarks, the code cache hit ratio (hit per code cache read) is about 92% but the code miss rate (miss per instruction) is less than 0.03. Moreover, the prefetch mechanism with branch prediction accuracy more than 82% can effectively decrease the code miss rate to be 0.005 (0.18*0.03). Therefore, we totally ignore the effect of code cache. 4.2.1.2 Cache There is one level of cache; L1-cache. L1 is spilt into Code and Data caches each is 8Kb 2-way set associative. The write-policy is write-once with no write-miss allocated. Write buffers are used to allow a read hit to proceed behind memory write operation. 4.2.1.3 Write-back Buffer There are two write-buffers in the Pentium processor, one associated with the u-pipe and the other with the v-pipe. When a write misses the data cache, the write is buffered into its corresponding write buffer and execution continues. Simultaneous access for both buffers is a rare situation as we have experimentally found (less than 0.01). Moreover, when an entry is written to either write-buffers, the other buffer entry is also reserved and subsequent writes to the buffer skips the empty location.
λ
µ Finite Storage M/M/1/2 Queue
Figure 4-5: Write-back Buffer Model Therefore, we model the write-buffers as single write-buffer with size two. Figure 4-5 shows the write-buffer model, which is modelled as M/M/1/2 queueing model [21]. The arrival rate λ denotes the rate at which writes are issued to the write buffer assuming infinite buffer size. The
25
service rate µ denotes the rate at which the writes are written back to the memory. The queue Effective Write Penalty for Raw wirte penalty = 7 cycles 7
Average Write Penalty (Writes Req. / Cycle)
6
5
4
3
2
1
0 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Write Rate (Write Req. / Cycle)
Figure 4-6: Effective Memory Penalty vs. Write Rate has a finite storage capacity of two requests only (the depth of the write-back buffer in the Pentium processor). When the queue fills up, the input is effectively “turned off”, i.e. λ k < 1 λk = 0 k ≥ 1 Thus, effective write-back penalty is: t mw _ eff = p[queueing ] ⋅ t mw
(4-1)
where, t mw _ eff is the effective write penalty (in cycles) per write operation, t mw is the raw write penalty (in cycles) per memory write operation. For a general M/M/1/N queue model we have pk =
1− λ µ
1 − (λ µ )
N +1
(λ µ )k
where 0 ≤ k ≤ N
Substituting with N=2 and k=2 we get pb = p[queueing ] = p 2 =
(λ µ )2 2 1 + λ µ + (λ µ )
(4-2)
and
26
t mw = 1 µ thus, t mw _ eff =
(λ µ )2 1 ⋅ 2 1 + λ µ + (λ µ ) µ
(4-3)
(4-4)
Figure 4-6 plots equation 4-4 choosing µ = 0.14 (tmw=7 cycles/req.)6. It can be seen that the relation is almost linear upto λ=2µ, then the curve flattens, diminishing the effect of the writebuffer. 4.2.1.4 Write-buffer Effects On Memory Operations The write-buffer attempts to hide the latency of write operations by buffering write requests. As a consequence of the strong memory consistency model of Pentium processor, the write-buffer affects other memory operations. The write-buffer has to be written back in the following cases: •
Data read misses
•
Write hits to modified/exclusive data cache lines
•
Write hits to shared cache lines
•
IO operations
It is to be noted that the modified/exclusive/shared cache lines status is part from the MESI cache protocol, which involves writing back the first write hit to a cache line. Thus not only does the write-buffer affect the memory-write operation, but also affect other operations that need to access the memory. We account for these effects by defining a set of probability measures, namely: pw= Probability that an instruction is causing a write miss pr= Probability that an instruction is causing a read miss ps= Probability that an instruction is writing to data cache line with status “shared” pw_m/e= Probability that an instruction is writing to data cache line with status “modified/exclusive” pio= Probability that an instruction is an IO instruction The sample space of these probability measures is the set of instructions executed in a given workload. That is, we define pw = pr =
6
number of write misses total number of instructions executed
number of read misses total number of instructio ns executed
This corresponds to a 70ns memory access penalty in a Pentium 100Mhz test system.
27
pw _ m / e =
number of write hits to " modified/e xclusive" data - cache lines total number of instructio ns executed
ps =
number of write hits to "shared" data - cache lines total number of instructio ns executed
p io =
number of IO instructions total number of instructio ns executed
We decompose the net finite cache effect into: FCE_read: Finite cache effect due to read misses FCE_write_miss: Finite cache effect due to write misses FCE_write_RW: Finite cache effect due to presence of both read and write misses FCE_write_other: Finite cache effect due to presence of IO operations, writes to shared lines, and writes to modified/exclusive lines FCE_write_EX: Finite cache effect due to external write buffers7
Define tmr to be the number of cycles needed to read a quadword from memory. Then, FCE_Read = p r ⋅ t mr
(4-5)
FCE_write_miss = pw pb ⋅ t mw
(4-6)
FCE_write_RW = pr ( p1 + 2 p2 ) ⋅ t mw
(4-7)
FCE_write_others = ( p IO + p w _ m / e + p s ) ⋅ ( p1 + 2 p 2 ) ⋅ t mw
(4-8)
4.2.1.5 Execution Element Two pipelines are used with separate functional units. We model this unit as a single unit with a double execution speed. Referring to Emma’s model, the execution unit contribution to overall performance is represented by EBusy metric. To account for the superscalarity of the Pentium processor, we decompose it into EIdle_ILP and EBusy. EIdle is the CPI representing the execution idle time due to limited instruction-level parallelism. EBusy, from now on, represents the amount of time taken if both execution units are totally overlapping.
EIdle_ILP is dependent upon two types of parallelism; program parallelism and machine parallelism. Machine parallelism is 2 but it is usually less than 2 in spite of the presence of dual
7
This component is zero for the P100 and P133 systems, otherwise please refer to (for the P166) Equation (6-5) in chapter 6.
28
pipelines. This is because of the pairing rules that restrict a small set of instructions that can be paired together.
4.2.2 CPI Components The motivation behind using the CPI metric is to be able to assess the role of the workload and the machine architecture separately. CPI = ∑ (events instructio n ) ⋅ (cycles event ) where (Events/instruction) is workload dependent and (Cycles/event) is architecture dependent.
From the model, the CPI is decomposed into two main components: Finite cache effect Infinite cache effect
Finite cache effect This is the effect of the finite nature of the cache. This part is given in the previous section. Infinite cache effect This is the CPI obtained given that there is no memory penalty, i.e. the cache appears to be infinite. This component is decomposed into: Execution Busy (EBusy) Execution Idle due to limited instruction-level parallelism (EIdle_ILP) Execution Idle due to address generation interlock (EIdle_AGI) Execution Idle due to wrong branch prediction (EIdle_Branch) 4.2.2.1 Exeuction Busy (EBusy) This CPI depends on the instruction mix and the instruction timing for a given instruction set and processor architecture. This information can be gathered from Pentium built-in performance counters (Can be also obtained from profiling the given workload). Thus, Ebusy =
1 ∑ f i ⋅ ti 2 i
Where f i is the dynamic frequency of executing instruction i. ti is the number of cycles required to execute instruction i assuming perfect cache. Since the Pentium processor has a dual pipeline architecture, we consider the case in which both pipelines are fully overlapping (The ideal case), and account for the limited parallelism through the EIdle_ILP metric
29
4.2.2.2 Execution Idle Due to Limited ILP (EIdle_ILP) The average instruction-level parallelism is bounded by Min(Machine parallelism, Program parallelism). Since the set of multimedia benchmark programs inherently have a large amount of program parallelism, one may expect that the average instruction-level parallelism is almost 2. But due to the set of instruction issue constraints, the amount of parallelism is largely program optimisation dependent and we have found out that it lies in the range 1.5 to 2. Eidle_ILP = (Average Parallelis m - 1) ⋅ Ebusy
4.2.2.3 Execution Idle Due to Address Generation Interlock (EIdle_AGI) This metric represents the pipeline idle time due to the presence of address generation interlock. The penalty of AGI in the Pentium processor is 1 cycle and from the entire benchmark suite, we found that this value is always less than 0.09. Eidle_AGI = AGI/Instruction ⋅ 1
Due to the rather limited instruction level parallelism (machine limited by 2) and the relatively small rate of AGI per instruction, we may safely ignore the effect of parallelism. For the general case though, the amount of parallelism amplifies the effect of AGIs
4.2.2.4 Execution Idle Due to Branch Miss-prediction (EIdle_Branch) This metric represents the branch miss prediction penalty. If the instruction is in the U pipeline, wrong branch prediction will incur 3 cycles. If the instruction is in the V pipeline, wrong branch prediction will incur 4 cycles. Let ILPaverage be the average number of parallelism, then it can be shown that p[branch is in U ] = 2 - ILPaverage Thus: Eidle_Bran ch = rate of branch misspredic tion ⋅ (3 ⋅ p[branch is in U ] + 4.(1 − p[branch is in U ]))
And finally: CPI = FCE_Read + FCE_write + FCE_write_RW + FCE_write_other +FCE_write_EX + EIdle_Branch+ EIdle_AGI+EIdle_ILP+EBusy
30
4.3 Evaluating Model Parameters The objective of this section is to describe the procedure used to evaluate our Pentium performance model parameters. The values are obtained with the aid of the Pentium processor’s performance counters. 4.3.1 Model Parameters Model parameters are classified into workload dependent and architecture dependent parameters. Table 4-2 describes the workload parameters. Table 4-3 describes the architecture dependent parameters.
Table 4-2: Workload Dependent Parameters Parameter
Description
pr
Probability that an instruction causes a read miss
pw
Probability that an instruction causes a write miss
pw_m/e
Probability that an instruction writes to modified/ exclusive line
ps
Probability that an instruction writes to shared line
pbranch
Probability an instruction is a branch
pB
Probability that a branch is correctly predicted
ILPaverage
Average number of ILP
EBusy
Intrinsic execution time per instruction divided by 2
pAGI
Probability that an instruction causes an AGI
31
Table 4-3: Architecture Dependent Parameters Parameter
Description
tmr
L1 read miss penalty in cycles
tmw
L1 write miss penalty in cycles
tAGI
AGI Penalty in cycles
tBranch_u, tbranch_v
Branch penalties for u, v-pipes respectively in cycles.
CPI is the performance metric. The breakdown of components of the CPI are given in Table 4-4.
Table 4-4: CPI Components CPI Components CPI FCE_Read FCE_Write_Miss FCE_Write_RW FCE_Write_other
FCE_Write_EX EIdle_AGI EIdle_Branch EIdle_ILP EBusy
Description Average number of cycles per instruction Finite Cache effect due to read misses Finite Cache effect due to write misses Finite cache effect due to presence of both read and write misses Finite cache effect due to presence of IO operations, write to shared/modified/exclusive lines. Finite Cache Effect due to external write buffers Execution Idle due to Address Generation Interlocks Execution Idle due to wrong branch prediction Execution Idle due to limited ILP Intrinsic average execution time per instruction divided by 2
4.3.2 Pentium Performance Counters Overview The Pentium processor incorporates two built-in performance counters. Each counter can be assigned to count a specific event. Table 4-5 lists all possible events. All used events are marked by ‘*’.
32
Table 4-5: Pentium Performance Monitors Events Event# 00h* 01h* 02h 03h* 04h* 05h* 06h 07h 08h 09h 0Ah 0Bh 0Ch* 0Dh 0Eh 0Fh 10h 11h 12h* 13h* 14h* 15h* 16h* 17h* 18h 19h* 1Ah* 1Bh* 1Ch 1Dh* 1Eh 1Fh* 22h* 23h 24h 25h 26h 27h 28h* 29h*
Description data reads data writes data TLB misses data read misses data write misses writes (hits) to M or E state lines data cache lines written back data cache snoops data cache snoop hits memory accesses in both pipes bank conflicts misaligned data memory references code reads code TLB misses code cache misses any segment register loads segment descriptor cache accesses segment descriptor cache hits Branches BTB hits taken branches or BTB hits pipe flushes instructions executed instructions executed in the v-pipe clocks while bus cycle in progress pipe stalled by backup writes pipe stalled by data memory reads pipe stalled by writes to M or E line locked bus cycles I/O read or write cycles non-cacheable memory references pipe stalled by AGIs floating-point operations breakpoint #0 matches breakpoint #1 matches breakpoint #2 matches breakpoint #3 matches hardware interrupts data reads or data writes data read misses or data write misses
33
4.3.3 CPI Evaluation Procedure In this section, we describe the how the model parameters are obtained from the hardware monitor readings and present, in details, how the CPI metric is evaluated. Some of these parameters are directly available from specific hardware counters. While others are obtained by constructing a set of simultaneous equations which is solved for the unknown parameters. The following four parameters are readily available from the hardware counters
ps =
pw =
write misses (04h) instructio ns executed (16h)
(4-1)
pr =
read misses (03h) instructio ns executed (16h)
(4-2)
data writes (01h) - data write misses (04h) - write hits to M/E (05h) instructio ns executed (16h)
(4-3)
pw _ m / e =
write hits to M/E (05h) instructio ns executed (16h)
Also, we can directly calculate: Total cycles (counter counts cycles) CPI = instructio ns executed (16h) p IO =
FCE_Write_EX =
IO read or wirte cycles (1Dh) instructio ns executed (16h)
Pipeline stalled by writing to M / E lines (1Bh) instructio ns executed (16h)
(4-4)
(4-5)
(4-6)
(4-7)
The write –back requests are generated by write misses and write hits to a shared line. The arrival rate λ is the number of write-back requests per cycle, thus p + ps λ= w (4-8) CPI(1-p2 ) Without the term (1 − p2 ) , the arrival rate λ will be the effective rate taking in consideration the queue full events. Counter no. 19h counts the duration in which the pipeline is stalled due to full write buffers. This affects read misses, write misses, and write hits to Shared state data-cache lines. Defining
34
t wb = CPI_stall_wb =
pipe stalled by backup writes (19h) instructio ns executed (16h)
(4-9)
where twb is the CPI component in which the pipeline is stalled due to full write buffers. This situation occurs when the write buffer is full and there is an outstanding read miss, write miss, write hit to a shared line, IO, or write hit to modified/Shared line. Thus, we may write t wb = p1 ( pr + ps + p IO + pw _ m / e )t mw + 1 2 p2 ⋅ pr + pw + ps + p IO + pw _ m / e t mw 2
(4-10)
Let’s define8 pi = p[queue size is i] where 0 ≤ i ≤ 2 i ( λ µ) = 2 1 + (λ µ ) + (λ µ )
(4-11)
The write-buffer service rate µ is the reciprocal of the memory write penalty tmw. 1 µ= (4-12) t mw This parameter depends on the memory hierarchy performance. This parameter is not readily available from the performance monitor. However, it can now be obtained by solving equations (4-8), (4-10), (4-11), and (4-12). In solving these equations, we arrive at a polynomial of third degree, which we solve numerically. Counter no. 1Ah counts the duration in which the pipeline is stalled due to memory read not bypassed while line fill is in progress. Thus defining t r = CPI_stall_read =
pipe stalled by data memory reads (1Ah) instructio ns executed (16h)
The read stall is composed of two CPI components: one accounts for the raw read miss penalty and the other accounts for the case that the write buffer is full. When, the write buffer is full, the read is stalled waiting for emptying the buffer.
8
For the case of Pentium-MMX, p3 and p4 are added as the queue size is 4
35
t r = pr t mr + pr ( p1 + 2 p2 ) pb t mw Then the memory read penalty is (t − pr ( p1 + 2 p2 )t mw ) t mr = r pr
(4-13) (4-14)
And thus, we evaluate the CPI components as follows:
FCE_Read = p r ⋅ t mr FCE_write_miss = pw pb ⋅ t mw
FCE_write_RW = pr ( p1 + 2 p2 ) ⋅ t mw FCE_write_others = ( p IO + pw _ m / e + ps )⋅ (1 p1 + 2 p2 ) ⋅ t mw
(4-15) (4-16) (4-17) (4-18)
In order to calculate branch prediction accuracy, we first calculate the accuracy of the prediction when the instruction is inside the BTB by: pipeline flushes (15h) - wrong prediction outside BTB Acc. In BTB = 1 − branches where wrong prediction s outside BTB = Taken branches or BTB hits(14h) − BTB hits (13h) Then we calculate the accuracy of prediction when the instruction is outside the BTB by: wrong prediction s outside BTB Acc. Outside BTB = 1 − branches (12h) − BTB hits (13h) Finally, we may approximate the accuracy of the branch prediction scheme by: pc1 = p[prediction is correct] = BTB hit ratio ⋅ acc, In BTB + BTB miss ratio ⋅ acc. Outside BTB
where BTB hit ratio =
BTB hit ratio (13h) , and BTB miss ratio = 1 – BTB hit ratio. branches (12h)
We may also approximate the branch prediction accuracy by: p c 2 = p[branch is correct] =
pipeline flushes (15h) branches (12h)
For enhancing the approximation, we will take the average of the two previous methods. 36
p B = pc =
p c1 + pc 2 2
The branch instruction can be executed in both u and v-pipes equally likely. Thus we define: pu = p[instructio n is executed in u - pipe ] =
instructio ns executed (16h) - instructio ns executed in v - pipe (17h) insturctio ns executed (16h)
And since instruction can be executed either in u or v-pipes therefore, p v = p[instructio n is executed in v - pipe ] = 1 − pu The branch miss-prediction penalties are 3 or 4 clocks if the branch is executed in the u or vpipe respectively. Thus, FCE_Branch is calculated by: FCE_Branch =
branches (12h) ⋅ (1 − p B ) ⋅ ( pu ⋅ 3 + pv ⋅ 4 ) instructio ns (16h)
(4-19)
The IO component is easily calculated by: CPI_IO =
I/O read or write cycles (1ch) instructio ns executed (16h)
(4-20)
The EIdle_AGI component can be easily calculated by: Eidle_AGI =
pipeline stalled by AGI (1fh) instructio ns executed (16h)
(4-21)
The remaining components are Edle_ILP and EBusy. Edile_ILP represents the portion of time that one of the execution units is busy (an instruction has been issued to u-pipe), while the other unit is idle (no instruction has been issued to the v-pipe). The EBusy represents the portion of the time that both the execution units are busy.
37
Let E_not_idle = EIdle_ILP + EBusy. We may approximate that: Eidle_ILP = pu ⋅ Ebusy
(4-22)
Thus, Ebusy =
E_not_idle 1 + pu
(4-23)
Where E_not_idle = FCE_Read + FCE_Write_ miss + FCE_Write_RW + FCE_Write_ WB + FCE_Write_ EX + IO + Eidle _ Branch + Eidle + ILP
(4-24)
Finally, the CPI can be calculated by: CPI = FCE_Read + FCE_write + FCE_write_RW + FCE_write_other +FCE_write_EX + EIdle_Branch+ EIdle_AGI+EIdle_ILP+EBusy
38
CHAPTER 5: CHARACTERISING MULTIMEDIA APPLICATIONS In this chapter, we characterise the Multimedia applications. We start by indicating the motivation behind selecting this type of workload. Then, we present a set of Multimedia programs serving as our benchmark. This benchmark is then used to characterise the Multimedia workload in isolation (as much as possible) from the architecture. Workload parameters are evaluated using Pentium built-in counters9. Using the obtained workload parameters values, we analyse the results and present a set of attributes shared by Multimedia applications.
5.1 Why choosing Multimedia Applications Multimedia has become an increasingly important field at both the levels of research and commercial worlds [10]. Multimedia imposes new demands that affects all aspects of a computer system – the processor, the memory system, and the I/O subsystem. Since this is a relatively new field, there have been a few quantitative and comparative studies [4]. Furthermore, the demands of Multimedia will force profound changes in current microprocessors [24, 11]. In spite of that, there is a shortage of published evaluation of architectural alternatives suitable for those applications [31]. Almost all major processor vendors have incorporated a set of Multimedia extensions to their processors: Intel MMX [37, 38], HP MAX-2 [23], Sun VIS [51], and MicroUnity [18].
5.2 Multimedia Benchmark Programs There are many classes of Multimedia applications: video conferencing, image compression, video authoring, image processing, visualisation, 3D graphics, animation, and speech recognition to name a few. Real-time digital video processing, which is a very important component in Video conferencing, video on-demand and hypermedia, imposes demanding requirements on the processor. For example, displaying MPEG standard compressed video requires the processor to
9
This is not a necessity, workload parameters may be also obtained from Pentium II but Pentium hardware counters provided more workload details than the Pentium II counters.
39
decode and render 30 video frames per second, each frame containing an order of 105 pixels. The processing complexity is in the order of 102 operations per pixels. Therefore, an order of 107 operations per second is required. Moreover, Dubey [12] pointed out that Graphics, Recognition and Video currently are among the most demanding Multimedia application (and also Data mining). In this respect, we have chosen the MPEG Motion Picture Experts Group) video compression standard [17] (with decoding and encoding operations), Voice recognition and 3D rendering as shown in Table 5-1.
Table 5-1: Benchmark Programs Class
Program
Description
Name Video
Proc- Decode
Xing player: XingMPEG Player, Version 3.12, Xing Technology Corporation, 810 Fiero Lane, San Luis Obispo, CA 93401. http://www.xing.com
Encode
LSX MPEG Encoder: Version 1.0 Demo, Ligos Corporation, http://www.ligos.com
Quake
The famous 3D game: Version 1.06 shareware. Id software. http://www.idsoftware.com
essing
3D rendering
Voice Recog- Voice
IBM viatype system: IBM corporation. http://www.ibm.com
nition
5.3 Benchmarking Methodology We have used our proposed Pentium performance model to study the performance of our benchmark programs. By the use of the Pentium’s built-in counters, we were able to measure the model parameters as previously described. The systems, we used to analyse the workload are, summarised in Table 5-2.
40
Table 5-2: Systems used in Multimedia Performance Analysis Machine Name P100
Processor
Cache/ Memory configuration
Pentium 100
P133
Pentium 133
Memory: 16MB L2: 256 KB Memory: 16MB EDO L2: 256 KB
5.4 Workload Parameter Values In this section, we characterise the workload in isolation from the architecture. A set of workload dependent parameters are obtained by the aid of Pentium hardware counters. Table 5-3 summarises workload parameters. Table 5-4 shows the values for the parameters for each workload. The pr , pw, pI depend on the cache geometry, which for the case of P100 and P133, is 8Kb and pB depends on BTB geometry, which is 256 entry. In the next chapter, we update these values to reflect the changes in cache (16Kb) and in BTB (512 entry) respectively. All the parameters, except Basic Block are obtained as described in detail in the last chapter. The basic block value is estimated by assuming geometric distribution of branches [14]. By measuring the branch mix PBranch, we calculate average block size from: Basic Block Size =
1 p Branch
41
Table 5-3: Workload Parameters Description Parameter fBranch fRead fWrite fIO fFloat fOther Pr Pw Pw_m/e Ps PI pB Block size pILP 2EBusy pAGI
Description Fraction of branches Fraction of memory reads Fraction of memory writes Fraction of IO operations Fraction of floating point operations Fraction of int operations L1 Data cache read hit ratio L1 Data cache write hit ratio L1 Data cache write hit ratio to Modified/Exclusive line L1 Data cache write hit ratio to shared line L1 Code cache hit ratio Branch prediction success ratio Average basic block size Fraction of time that two successive instructions are independent Intrinsic execution time Address Generation Interlock
Table 5-4: Workload Parameters Parameter fBranch fRead fWrite fIO fFloat fOther Pr Pw PI PB Block size PILP 2EBusy PAGI
Encode 0.108 0.419 0.145 0.000 0.001 0.327 0.985 0.883 .975 .86 9.48 .792 2.012 0.08
Decode 0.050 0.349 0.154 0.001 0.002 0.444 0.961 0.692 .955 0.719 20.761 .806 1.096 0.014
Voice 0.174 0.456 0.186 0.001 0.055 0.127 0.950 0.838 0.939 0.829 6.082 .556 2.095 0.089
Quake 0.091 0.356 0.160 0.000 0.087 0.305 0.964 0.657 .985 0.837 10.946 .707 1.378 0.038
5.5 CPI Breakdown In this section we intend to present the experimental results done on three different configurations with Pentium based systems.
42
Table 5-5: Results for P100 Program / CPI Breakdown Per- Encode
Decode
Voice
Quake
centage CPI
1.487
FCE_Read
15.4%
2.154
2.520
1.915
50.3% 28.0% 32.9%
FCE_Write_miss
2.3%
1.2%
0.0%
1.0%
FCE_Write_RW
0.0%
1.8%
0.0%
1.4%
FCE_Write_EX
0.0%
0.0%
0.0%
0.0%
FCE_Write_Other
0.0%
18.5%
EIdle_Branch
3.4%
2.0%
3.7%
2.6%
EIdle_AGI
5.8%
0.7%
3.3%
1.8%
EIdle_ILP
5.1%
3.1% 19.2% 10.9%
68.1%
22.5% 42.0% 37.3%
EBusy
3.7% 12.0%
3.0000
2.5000
2.0000
Ebusy EIdle_ILP EIdle_AGI EIdle_Branch
CP I 1.5000
FCE_Write_Other FCE_Write_EX FCE_Write_RW FCE_Write_miss
1.0000
FCE_Read
0.5000
0.0000 encode
Decode
Voice
Quake
Figure 5-1: CPI Breakdown for P100
43
Table 5-6: Results for P133 Program / CPI Breakdown Percent-
Encode Decode Voice
Quake
age CPI
1.582
1.956
2.124
1.960
11.6%
29.3%
26.7%
33.5%
FCE_Write_miss
2.5%
2.5%
2.9%
1.1%
FCE_Write_RW
0.0%
3.7%
0.0%
1.6%
FCE_Write_EX
0.0%
0.0%
0.0%
0.0%
FCE_Write_Other
0.0%
26.5%
3.3%
12.4%
EIdle_Branch
2.6%
3.0%
3.3%
2.6%
EIdle_AGI
4.6%
0.7%
4.5%
2.1%
EIdle_ILP
15.4%
8.5%
17.3%
10.2%
EBusy
63.2%
25.9%
42.1%
36.6%
FCE_Read
2.5000
2.0000
Ebusy EIdle_ILP
1.5000
EIdle_AGI EIdle_Branch
CP I
FCE_Write_Other FCE_Write_EX FCE_Write_RW
1.0000
FCE_Write_miss FCE_Read
0.5000
0.0000 Encode
Decode
Voice
Quake
Figure 5-2: CPI Breakdown for P133
44
It can be seen from the CPI breakdown that the memory component represents a very significant component, which exceeds 50% in some programs. For instance, memory component is 14% in Encode, 62% in Decode, 33% in Voice and 48% in quake are obtained for P133 case. However, in P100 memory component is 18% in Encode, 72% in Decode, 32% in Voice and 47% in quake. The relatively large memory component in Decode and Quake is attributed to poor PCI bus drivers in the P100 system under the high write miss rate and the high rate of IO cycles in the Decode case which generates DMA cycles further increasing memory penalty10. Observing memory CPI components, we conclude that the use of a more advanced memory technology (EDO in P133), have successfully retained the relative effect of memory while increasing processor frequency. Therefore, memory presents a bottleneck resource with potential of speeding up execution by a factor of 2 if processor-memory gap is resolved.
5.6 Summary of Multimedia Characteristics In this section, we summarise the characteristics of the four Multimedia programs. These share the following characteristics: •
High degree of ILP: Average number of issuable instructions assuming max in-order issue rate of 2 instructions per cycle ranges form 1.5 to 1.9. Moreover, AGI11 rates are less than 0.09.
•
Large Basic Block: The average size of basic block size is 10 instructions this decreases the effect of branch prediction.
•
High memory requirements: The fraction of CPI contributed by memory is very significant (more than 50% in some programs). (Moreover, as described in appendix-B, L2 cache has a
10
Similar high memory penalties has been reported by Bekerman [5]. Using real-mode test programs including MPEG decoding, we were able to verify this remark.
11
AGI (Address Generation Interlock) occurs when the address generation of an instruction operand depends on a value calculated by another instruction which is currently haven’t finished execution.
45
small effect on reducing memory penalty. Totally removing L2 cache worsen overall performance by less than 20%) •
Tight instruction loops: Instruction fetch is not an issue. Instruction cache hit ratio is more than 97% (99% for some applications).
46
CHAPTER 6: PERFORMANCE PREDICTION FOR THE PENTIUM MMX In the previous chapter the Pentium model parameters were obtained. This was done by conducting experiments on existing machines. In order to assess the accuracy of our model, we consider the case of predicting the performance of Pentium processor MMX architecture. We use the experimental data obtained from the Pentium processor then, predict the performance of a Pentium MMX processor and then compare the predictions with actual experimental results on the Pentium MMX. (The maximum error is 8%).
6.1 Pentium MMX Architecture Enhancements There are two classes of enhancement in this architecture: One in the control-store, and the other in the datapath. The control-store enhancement is the Multimedia extensions in the instruction set. We will not consider these because our benchmark is not using these extensions. The datapath enhancements are: 1. Data and Code caches sizes are doubled 2. Write-back buffer size is doubled and there write-back buffers are not dedicated to specific pipeline
6.2 Model Modifications Minor modifications are done in the model to accommodate for the increased size of write buffer. Recalling that the write-buffer was modelled as M/M/1/2 queue, we now model it as M/M/1/4 queue. Thus, equation (4-11) is modified to be pi = p[queue size is i ] where 0 ≤ i ≤ 4 =
1 − (λ µ ) (λ µ )i 5 1 − (λ µ )
(6-1)
Equation (4-8) is modified to be λ=
pw + ps CPI(1-p4 )
(6-2)
In addition, equation (4-10) is modified to be t wb = ( p w p 4 + ( p r + p IO + p w _ m / e + p s )⋅ (1 p1 + 2 p2 + 3 p3 + 4 p4 ))⋅ t mw
(6-3)
6.3 Predicting the CPI 6.3.1 Target System Configuration The target system configuration is: •
Processor: Pentium-MMX 166MHz
•
Cache/ Memory: 512Kb L2 cache, 16Mb EDO RAM.
6.3.2 Parameters The workload dependent parameter values are obtained from the P100 test system as shown in Table 6-1 except that pr and pw are updated using P166 cache hit ratios (since cache size is different). ps was found to be nearly equal to zero, so that pw_m/e is not required. The values of these parameters were obtained by testing the P166 with the Multimedia workload. Table 6-1: Input Parameters (Workload Dependent) Parameter
Encode
Decode
Voice
Quake
pr
0.006
0.009
0.015
0.008
pw
0.016
0.034
0.016
0.055
pbranch
0.090
0.0444
0.165
0.091
ILPaverage
1.925
1.862
1.542
1.708
EBusy
1.013
0.485
1.058
0.715
The architecture parameters are obtained from testing the P166 system with L2 disabled and running a test program that generates a sequence of read and writes. The memory penalties tmr, tmw_ex are found to be 17 cycles. The use of an external write-buffer, which cannot be disabled, is the reason for having tmw set to 6 cycles (normal two bus cycles). The obtained parameters are indicated in Table 6-2.
48
Table 6-2: Input Parameters (Architecture Dependent) Parameter tmr
Value
Units 68 cycles 6 cycles
tmw
68 cycles
tmw_ex
6.3.3 Calculating CPI It can be easily shown that pr, pw ,pIO, ps, pw_m/e, tmr, tmw can be obtained from the Pentium built-in performance monitors. The CPI can be calculated by solving the following equations p4 =
1 − (λ µ ) 4 ⋅ (λ µ ) 5 1 − (λ µ )
CPI =
where
(6-1)
( pw + ps ) λ (1 − p 4 )
(6-2)
FCE_Write = ( p w p 4 + ( p r + p IO + p w _ m / e + p s )(1 p1 + 2 p 2 + 3 p3 + 4 p 4 ))t mw
(6-3)
FCE_Write = FCE_Write_ miss + FCE_Write_ RW + FCE_Write_ other
(6-4)
FCE_Write_EX = p w (1 − p 0 )t mw _ ex
(6-5)
where tme_ex is the CPI component for which the external write buffer sees. CPI = FCE_read + FCE_write + FCE_write_EX + IO + EIdle_AGI + EIdle_Branch + EIdle_ILP + EBusy
(6-6)
6.3.4 Results Table 6-3: Performance Prediction Results for P166 Encode Decode
Voice
Quake
Actual CPI
1.908
1.582
2.873
1.804
Predicted CPI
1.802
1.482
2.820
1.890
-6%
-6%
-2%
5%
Error
49
3.5000
3.0000
2.5000
2.0000 CPI
Actual Predicted 1.5000
1.0000
0.5000
0.0000 Encode
Decode
Voice
Quake
Figure 6-1: Performance Prediction for P166
50
CHAPTER 7: COMPLEX-ISSUE SUPERSCALAR MODEL (PENTIUM II MODEL) The objective of this chapter is to present our proposed model for the complex-issue class of superscalar processors. This class of processors is characterised by out-of-order issue through the use of shelving buffers. Extensive speculative execution is used. Register renaming, out-oforder issue and execution are extensively used. Processors in this class include Pentium II and Pentium Pro, MIPS R10000, IBM/Motorola PowerPC 604, AMD K5. In the beginning of this chapter, we briefly describe the Pentium II architecture, which is chosen to be a representative of this class. Pentium II is very popular and is readily available. Next, we present our proposed model.
7.1 PENTIUM II ARCHITECTURE The objective of this section is to briefly present the Pentium II architecture. Detailed description can be found in Intel documents [41]. The Pentium II architecture differs significantly from its predecessor (Pentium). It features an out-of-order instruction issue, register renaming, speculative execution, superpipelining core (12 stages), and a micro-flow mechanism in which CISC operations are dynamically transformed into a sequence of RISC like operations. The L2-cache is tightly coupled to the processor, delivering transfer rate (between the processor core and the L2- cache) of one-half the processor core frequency. The on-chip L1 caches are organized as two separate caches; 16Kb 4-way set associative data cache, and 16Kb 4-way set associative instruction cache. The data cache employs a write-back mechanism (versus write-one policy in Pentium processors), which consists of 8-banks with dual ports. The processor has a write buffer of depth four, which can be accessed by the two execution units (versus dedicating two entries for u-pipe and two entries for v-pipe in Pentium MMX).
51
There are three decoupled units in the Pentium II processors: 1. Fetch/Decode unit 2. Dispatch/Issue/Execute 3. Commit or Retire unit 7.1.1 Fetch/Decode Unit Instruction fetch and decode Each cycle, 16 aligned bytes are fetched from L1 instruction cache. This stream of instructions is presented to three parallel decoders, which decode and convert each instruction into a set of RISC like instructions known as “µops”. All µops have the same format, which is two logical sources and one logical destination. Most CISC instructions are normally converted into a single µops. Some instructions may require up to 4 µops. Register renaming The sequence of the generated µops is queued for register renaming. Register renaming is done by using register alias table. µops are then dispatched to the instruction window, which is implemented by using a reservation station (window size 30 instructions). A reorder buffer is used by the commit unit. 7.1.2 Dispatch/Issue/Execution Unit The job of this unit is to schedule and issue (out-of order) instructions into the appropriate functional units for execution. Scheduling is done by scanning the instruction window for independent µops having their operands and the required function unit available. A centralized reservation station does this function. If successful, the reservation station issues these µops to corresponding functional units. Up to 5 instructions can be executed in a cycle depending on required resources. They are: 1. Integer/MMX unit 2. Integer/jump/MMX unit 3. Load unit 4. Store unit (Address) 52
5. Store unit (Data) Up to 5 branches can be predicted and are speculatively executed.
7.1.3 Commit Unit (Retire Unit) This unit is responsible for preserving the virtual sequential instruction execution order. A µop commit involves updating the machine state. The order in which the µops are retired is the strict sequential execution order. Up to three µops can be committed per cycle. 7.1.4 Memory Interface A memory read CISC instruction is converted into one µop during the decode path. A memory reorder buffer is used to allow for out-of-order execution of loads (bus interface unit). A memory write CISC instruction is transformed into two µops; one for address generation and the other for data generation and store. Stores are executed in strict sequential order. No speculative execution is done.
53
Figure 7-1: Pentium II Processor Architecture
54
16 byte
Inst. cache
Inst. Buffer
decode & generate uops
6 Entry
instr. queue
Reorder Buffer
Retires up to 3 uops per cycle
Register File
Res. Stat.
30 Entry
RAT (reg. alias tab) Rename
40 Phisical reg.
Store Unit (Data)
Store Unit (Address)
Load Unit
Int. Unit
Int. Unit
Bus interface
Pentium II processor Architecture
Data Cache
7.2 Proposed Complex-issue Model (Pentium II) This section describes our proposed performance complex-issue model for the Pentium II processor. The model is partitioned into three parts, which are: 1. Instruction fetch/decode unit 2. Instruction dispatch/issue/execute unit 3. Commit unit 7.2.1 Instruction Fetch/Decode Unit The main function of this unit is to construct what is known as “instruction window”. We consider the following factors that affects the performance of this unit: 1. Branch prediction accuracy 2. Basic block size 3. Instruction cache hit ratio 4. Memory bandwidth
7.2.1.1 Assumptions And Definitions We define a running program to be a dynamic sequence of blocks of instructions. All blocks have the same number of instructions of equal size. Each block has one entry point and one exit point (basic block). Thus, program control-flow is defined as the dynamic sequence of basic blocks executed. We define nb to be number of blocks fetched in a cycle, thus Fetch rate nb = min ,1 Basic block size Where Fetch rate is the maximum number of instructions fetched in a cycle.
55
We assume that at most one block is fetched at any time, because fetching multiple basic blocks in each cycle is not typical in current processor architecture types. Further, we define nf to be the number of cycles required to fetch a block assuming no instruction cache misses. Thus, 1 nf = nb The ceiling guarantees that fetch cycles are always confined to one block. Because the instruction cache is not perfect, we define pI to be the probability that a fetch misses the cache. p I = p[instructio n cache hit ] The fetch unit resolves the program control-dependencies by using branch prediction. We define pB to be the probability that a branch prediction is correct. 7.2.1.2 The Model We construct a discrete-time discrete-state Markov chain model as shown in Figure 7-2. The states f1, f2, …, fnf are the fetch stages. fi is the ith fetch stage (a stage represents one cycle) fetching the same block. The states m1, m2, …, mnf are the fetch idle states due to instruction cache misses. The state b is the state at which the fetch unit is idle due to branch missprediction.
56
pI pIpB
pI
pI
f1
f2 qI
pI
pI
qB
f3
pI
qI
fnf
pI
qI
pI
b qIpB
qI
m1
m2
mnf-1
mnf
qI
qI
qI
qI
pI
Figure 7-2: State Transition Diagram for Fetch Unit Solving for stationary probabilities, we find that: p f1 = p f 2 = L = p f n =
1 n f + n f ⋅ qI
pI
+ qB
qI
p m1 = p m2 = L = p mn =
pb =
pI n f + n f ⋅ qI
qB n f + n f ⋅ qI
pI
pI
+ qB
+ qB
57
Expanding each instruction miss state (m1, m2, …, mnf) into Cm states (each corresponds to a cycle) also expanding branch recovery State (b) into Cb states (each corresponds to a cycle), we get p f1 = p f 2 = L = p f n = p f =
1 n f + n f ⋅ qI
pI qI
p m1 = p m2 = L = p mn = p m =
pB =
n f + n f ⋅ qI
pI
⋅ C m + q B ⋅ Cb
pI ⋅ C m + q B ⋅ Cb
qB n f + n f ⋅ qI
pI
⋅ C m + q B ⋅ Cb
Thus the effective (useful) fetch rate, λfetch ,is given by Effective fetch rateλ fetch = n f ⋅ p f ⋅ nb ⋅ Block Size
(7-1)
Wasted fetch rate depends on the fraction of the cycles spent in the b state. Thus, Wasted fetch rateλ w_fetch = p b ⋅ nb ⋅ Block Size 7.2.2 The Dispatch/Issue/Execution Unit The instruction dispatch uses register renaming to remove false data-dependencies between instructions in the instruction window. That is, Write-after-read and Write-after-write dependencies. Instructions are then issued to functional units (execution units) for execution constrained by the structural limitations (limited hardware parallelism) and instruction dependencies. Hardware limitations dictate that an instruction can not be issued to a functional unit because no functional unit capable of executing this instruction is idle. We consider the following factors that affects the performance of this unit: •
Instruction window size 58
•
Functional units availability
•
Data operands availability (degree of ILP)
λ
λ
fetch
0
1
µ 1
λ
fetch
fetch
λ
fetch
2
µ 2
nI
µ 3
µ nI
Figure 7-3: State Transition Rate Diagram of the Instruction Queue of Dispatch/issue/execute unit12 The instruction dispatch/issue process is modelled as a finite storage Markovian queue with constant arrival rate λfetch, and variable service rate µi and capacity nI. Figure 7-3 shows the state-transition diagram for the instruction queue13 (instruction window). A state is defined as the number of instructions in the instruction window (0,1,…,nI). We assume that all false datadependencies are already removed now and only true data dependencies exist. As the instruction issue-rate depends on the current instruction window size, we define µ i to be the instruction issue rate when there is i instructions in the instruction window. As i increases, there is a better chance to find more independent instruction thus, µ i increases. We classify instructions in any given time into two classes: independent and dependent instruction. Only instructions from the independent group have all their data values available. Define PILP to be the probability of selecting an instruction from the independent group. Moreover, de12
We emphasise that the labels on the ordered links refer to rates and not to probabilities. The transition probabilities can be found out by multiplying each by dt. In that case, self-loops have to be added to each state with rates equal to the sum of output rates indicating the probability that in the next interval of time dt the system remains in the given state.
59
fine a sequence of instruction selections such that all selections are taken from the independent class. This sequence defines the number of instructions issuable in a cycle, and is geometrically distributed. Thus nILP, the average number of instructions issuable in a cycle, is given by nILP =
1 1 − p ILP
Since all instructions in the instruction window are checked in the same cycle for issuing possibility, we define the probability that j instructions can be issued at any time given that there is w instructions in the instruction window by the binomial distribution: p[issue j instructio ns window size = i ] = C ij ( pissue ) ⋅ (1 − pissue ) j
i− j
where j ≤ max issue width
Where max issue width is the maximum number of instructions can be issued in a cycle, and pissue is the probability that an instruction can be successfully issued. This probability depends on the availability of data operands (The age of instruction in the execution units). We account for the availability of data operands by considering the age of instructions in active execution units. By age, we mean the number of cycles an instruction resides in an execution unit. The larger the age of an instruction is, the larger the probability that an issue candidate instruction will be independent of it.
13
We use “queue” instead of “window” interchangeably as the serving order has no significance since we are interested in the system thruput.
60
ILP levels
Level l = i-1
l*nilp +1
l*nilp +2
l*nilp +3
l*nilp +nilp
Level l=i
l*nilp +1
l*nilp +2
l*nilp +3
l*nilp +nilp
Level l = i+1
l*nilp +1
l*nilp +2
l*nilp +3
l*nilp +nilp
Legend i
j
Instr. j is data dependent on instr. i
Figure 7-4: Instruction Data-dependency In order to assess the degree of independency between two instructions, we assume that on the average there are nilp instructions available for execution. And, every nilp instructions are grouped into an instruction-level having the same age as shown in Figure 7-4. Define pDd to be the probability that two instructions having a distance d are independent. By distance d we mean that the number of independent instruction levels between a currently executing instruction and an issue candidate is d. Each instruction, in an instruction-level, is equally likely to be dependent on any instruction in the previous level. Thus, we define
p D1 = p ILP =
nilp − 1 nilp 61
Thus,
p Dn = ( pILP )
n
nilp − 1 = n ilp
n
The latency of functional units is accounted for by assigning each functional unit a set of execution times (architecture dependent) and their associated probabilities (workload dependent). For example, in the case of the load unit, a data access can take a short time if data resides in the cache, or in the case of a cache miss, a larger time penalty is incurred. Let f to be the number of function units. Function unit i is assigned the execution times ti,1 with probability pi,1 and ti,2 with probability pi,2 and so on. Also we define the probability that a function unit to be busy by p FBusy ,i = p[function unit i is Busy ] = f i ⋅ average execution rate ⋅ average instrcutio n latency in unit i = f i λ fetch p[instructio n queue is not full ]∑ ( pi , j ⋅ t i , j )
(7-1)
j
where fi is the fraction of instructions executable at function unit no. i. If the multiplicity of a function unit is more than one, we average this rate uniformly among other function units. And, nI
p[instructio n buffer is not full ] = 1 −
λ
∏µ i =1
i
nI
1 + ∑∏ nI
i =1
λ µi
Thus, we may define: f p[an instructio n is independen t of other ] = ∑ f i ∏ ∑ ( p D , j ⋅ p A,i , j )p FBusy ,i + 1 ⋅ (1 − p FBusy ,i ) i =1 j≠i j
where, 62
p Ai , j = p[instructio n age in function unit i is j function unit i is busy ] =
∑
p i ,k
k :t i , k ≥ i
∑p
t
i, j i, j
j
To account for the availability of an appropriate functional unit, we define the maximum issue rate λmax_issue assuming infinite ILP by the reciprocal of the slowest functional unit effective time (average latency is maximum) as follows λ max_ issue =
1 max f i ∑ pi , j t i , j j
We define the probability of issuing an instruction by pissue = p[an instructio n is independen t of other instructio ns f = ∑ f i ∏ ∑ ( p D , j ⋅ p A,i , j )p FBusy ,i + 1 ⋅ (1 − p FBusy ,i ) i =1 j ≠i j
] (7-2)
Finally, we may now calculate µ i to be the expected number of instructions issuable per cycle as follows: Let’s first define µ i′ to be the issue rate assuming infinite functional availability as follows
µ i′ = +
Min (i , m )
∑ j ⋅ p[issue j instructio ns window size is i] j =1
∑ m ⋅ p[issue j instructions window size is i] i
(7-3)
j = m +1
=
Min (i, m )
∑ j =1
i
i
jC j p issue q issue
i− j
+
i
∑m⋅C
j = m +1
i j
i
pissue qissue
i− j
63
where m is the maximum issue width and qissue = 1 − pissue Then µ i should be bounded by λmax_issue, we may write
µi =
λ max_ issue m
µ i′
(7-4)
In order to show the effect of the used parameters, we plot the issue rate versus instruction window sizes while changing issue probabilities and setting λmax_issue to m14. This is shown in Figure 7-5.
6
5
Issue rate
4
Pissue = 1 Pissue = 0.8 Pissue = 0.5 Pissue =0.2
3
2
1
0 0
2
4
6
8
10
12
Instruction Window Size
Figure 7-5: Instruction rate vs. Window size while changing the issue probability 7.2.2.1 Calculating Execution Rate The instruction execution rate is calculated from:
14
Changing this parameter will merely scale the issue rate proportionally.
64
Execution rate λ exe =
max queue size
∑ p[queue size = i]⋅ µ
i
(7-5)
i =1
7.2.3 Commit Unit The effective commit rate depends on: •
Instruction supply rate λexe from the Execution units
•
Max commit rate: This is 3 µops per cycle in the Pentium II architecture
We model the commit unit as an M/M/1 queue. The infinite length queue is not a limitation as the execution rate is generally much less than the commit rate. λexe
µc
Figure 7-6: Commit Unit Model Figure 7-6 represents the queueing model. Instructions are written to the commit unit when finished execution. The inter-arrival time as assumed to have an exponential distribution with mean rate λexe. The service time is also assumed to have an exponential distribution with mean rate µc. The thruput can be readily calculated from Thruput = p[queueing] ⋅ µ c = λexe
65
CHAPTER 8: PREDICTING PENTIUM II PERFORMANCE In this chapter we assess the accuracy of our Pentium II complex-issue model. Model parameters were obtained from both Pentium and Pentium II experiments. Those obtained from Pentium are workload oriented, while the others obtained from Pentium II are Pentium II architecture oriented, namely memory penalties. Substituting these parameters into the Pentium complex-issue model, we were able to get predicted values for the CPI performance metric for the three Multimedia applications: MPEG Encode, MPEG Decode, and 3D-Rendering. Comparing these to experimental values of CPI, we found a maximum error of 10%
8.1 Evaluating Model Parameters Table 8-1 presents the configuration of the test system. This system is used to get experimental results for the three Multimedia programs: 1. MPEG Encode 2. MPEG Decode 3. Real-time 3D rendering
Table 8-1:Pentium II Test System System description Processor
Pentium II 266MHz
Memory
Main Memory: 32Mb EDO RAM L2: 512 Kb cache on same processor card, half core frequency L1: two (split) 16Kb data and code caches (4-way set associative )
Table 8-2 presents model parameters 66
Table 8-2: Complex-issue Parameters Description Parameter Work- Block size Load Dep. PI
Architecture Dep.
PB fload, fstore, fint Pload_hit Pstore_hit PILP 2EBusy Fetch rate Cm Cb tload_hit, t-
Description The average number of instructions in a basic block. Instruction cache hit ratio. (The probability that the fetched instruction resides in the instruction cache) The probability that a branch is correctly predicted Instruction mix. (Load, store, and integer mixes respectively) Data cache read hit ratio (Probability that a load hits the data cache) Data cache write hit ratio (Probability that a store hits the data cache) The probability that two instructions are independent Double the EBusy parameter in simple-issue model Maximum number of instructions fetched in a cycle. Instruction cache miss penalty (Cycles / Instruction cache miss) Branch miss prediction penalty (Cycles / Branch miss prediction Load latencies in the case of data cache hit and miss respectively
load_miss
tstore_hit, Store Data latencies in the case of data cache hit and miss respectively tstore_miss tstore_address Store Address Unit latency (Cycles) tint Integer latency (Cycles)
8.1.1 Workload Dependent Parameters •
Basic Block Size
This parameter is set to the average block size of a given workload. The value is estimated by assuming geometric distribution of branches [14]. By measuring the branch mix PBranch, we calculate average block size from: Basic Block Size =
1 p Branch
Table 8-3 summarises the values of block sizes for the four Multimedia programs
67
Table 8-3: Basic Block Parameter Values Benchmark Encode Decode Quake
•
PBranch 0.0902 0.0426 0.09
Basic Block Size 11.09 23.47 11.21
PI, PB, fload fstore fint
Taken from chapter 6 •
Pload, Pstore_address, Pstore_data, Pint, 2EBusy
These parameters are the µops instruction mix. Since most of the executed instructions are generally simple CISC operations, thus on the average a load maps into a load µop and an integer µop, a store maps into a store-address and store-data µop, and an integer maps into one µop. •
PILP (Probability that two instructions are independent)
The ILPaverage parameter in the case of the simple-issue Pentium model, present the average number of independent instructions that are issued in parallel. Since the probability that two instructions are independent is PILP we may write: ILPaverage = 1 ⋅ (1 − p ILP ) + 2 ⋅ p ILP thus p ILP = ILPaverage − 1
Where ILPaverage is obtained directly from Pentium architecture as described in chapter 4. 8.1.2 Architecture Dependent Parameters •
Fetch Rate 68
Since the maximum fetch rate of the Pentium II is 3, this parameter is set of 3 regardless of the workload. •
Cb
The value of this parameter is 10 regardless of the workload, as the conditional branch instructions are resolved in the 10th pipeline stage. •
tload_miss, tstore_miss, Cm
As obtained from experimenting the Pentium II system, the memory read penalty is found15 to be 96 cycles with L2 cache miss. The L2 cache read hit ratio is about 0.75, thus we take memory read penalty to be 30 cycles. And, since write-buffers are used, we approximate the write penalty to be half the read access penalty and set write penalty to 12 cycles. •
tload_hit, tstore_hit
The load latency of the load unit is 3 cycles. However, the load issue thruput is 1 cycle. This issue is incurred if an instruction depends on the load. We take the average and set tload_hit to 1.5. The same reasoning is used for the tstore_address and set accordingly to 1.5. tstore_miss is set to 1 since its latency is so. •
tint
The value of this parameter is set to one as all integer operations have one cycle latency (except for multiply which has a latency of 4 but thruput of 1). However, we account for the intrinsic workload properties and µops generation by defining fµops as follows: f µops = (2 f load + 2 f store + f int )2 EBusy
15
An experiment is done by a test program generated a read pattern such that a read miss occur in each memory reference. Thus, the obtained penalties are worst case penalties.
69
where fload fstore fint are the CISC instruction mix obtained directly from Pentium hardware counters. The fetch rate λfetch is increased by the factor fµops. And, λexe is then decreased by the same factor. With the aid of the Pentium II performance monitors, we obtained the actual CPI metric.
Table 8-4 summarises the model parameters.
70
Table 8-4: Model Parameters Parameter
Encode Block size
Workload Dep.
Quake
11.09
22
11.2
PI
0.99
0.99
0.9925
PB
0.91
0.84
0.895
fload
0.45
0.4
0.35
fstore
0.15
0.16
0.16
fint
0.38
0.4
0.35
Pload_hit
0.98
0.98
0.98
Pwrite_hit
0.91
0.79
0.64
PILP
0.71
0.86
0.71
2EBusy
2.03
1.3
1.27
3
3
3
Cm
96
96
96
Cb
10
10
10
3
1.1
1.5
tLoad_miss
30
30
30
twrite_hit
1.5
1.5
1.5
twrite_miss
12
12
12
1
1
1
Fetch rate
Architecture Dep.
Decode
tload_hit
tint
8.2 Performance Prediction Table 8-5 and Figure 8-1 show the predicted vs. actual CPI.
71
Table 8-5: Predicted vs. Actual CPI Parameter
Encode
Decode
Quake
Predicted
1.89
1.54
1.30
Actual
1.98
1.58
1.29
Error
-4%
-2%
1%
2.50
2.00
1.50 CPI
Predicted Actual 1.00
0.50
0.00 encode
decode
3D
Figure 8-1: Pentium II Predicted vs. Actual CPI
72
CHAPTER 9: COMPLEX-ISSUE REDUCTION TO SIMPLE ISSUE MODEL The object of this chapter is to assess the accuracy of the complex-issue model (Pentium II) by reducing it to the simple issue case (Pentium). We first describe the required modifications of the complex-issue model to get the simple-issue one. We then use the modified model to predict the performance of our Multimedia benchmark programs on Pentium processor (P166) based system. Some of the model parameters are directly taken from chapter 6, and the others are obtained from the previous chapter. Comparing predicted performance with measured ones, we have found an error margin less than 10%.
9.1 Complex-issue Model Reductions The Pentium architecture may be partitioned into two parts as shown in Figure 9-1 which are: 1. Instruction fetch / decode 2. Instruction dispatch/ issue/ execute The instruction fetch/ decode part, in the complex-issue model, models the instruction prefetch using branch prediction and supplies control dependent free instruction to the instruction window. Therefore, this naturally maps to the code cache, instruction buffer, and branch prediction logic of the Pentium processor. The simple-issue nature of the Pentium processor, induces some differences on the instruction dispatch/ issue/ execute and instruction commit part of the complex-issue model. Instructions are issued in-order. Issue stalls when any functional unit stalls. Therefore in-order instruction completions are granted and hence the commit part is no longer relevant to the architecture. Instruction completion rate λexe will solely specify the execution rate (no commit part). Due to the relative symmetry between the u-pipe and v-pipe of the Pentium processor, we may assume an equal probability for an instruction to be issued to either pipe. Thus f1 = f 2 = 0.5
(9-1) 73
Memory Interface
Integer register file
Integer Unit "u" Instr. Cache
Prefetch Buffers
Data Cache
Decode & issue Integer Unit "v"
FPU
Fetch/ decode part
Floating point register file
Commit (WB)
Dispatch/Issue/ Execution Part
Figure 9-1: Mapping of Complex-issue model to Pentium Architecture Thus, substituting in Equation (7-1), we get
p FBusy ,1 = p FBusy , 2 =
λ fetch (1 − p 2 )
∑p
t
(9-2)
1,k 1,k
k
74
Since instruction issue stalls when any functional unit experience more than one cycle latency, this translates to p PD , j = ilp 0
j =1 otherwise
(9-3)
Since each functional unit may have more than one function (load/store/int), we define pˆ 1, k = pˆ 2 ,k = p type ptype,i
where type ∈ {Load, Store, Int},
and 1 ≤ k ≤ ∑ max (itype )
(9-4)
type
Substituting into Equation (7-2), we get pissue = ∑ f i ⋅ ( p FBusy ,i pilp + 1 − p FBusy ,i ) 2
i =1
= ( p FBusy ,i pilp + 1 − p FBusy ,i )
(9-5)
Given that no instruction is in execute phases, one or two instructions may be issued. This directly maps to defining the window size to be two. Moreover, since the issue is done in-order, the combination term is replaced by value 1. Substituting into (7-3), we get µ1′ = pissue µ 2′ = p issue (1 − pissue ) + 2( pissue )
2
(9-6)
Calculating the probability that the instruction window is full and substituting into (7-5) we get, λ exe = Execution rate = λ fetch
1 + (λ fetch µ1 )
1 + (λ fetch µ1 ) + (λ fetch µ1 )(λ fetch µ 2 )
(9-7)
9.2 Pentium Performance Prediction We have chosen the P166 MMX processor to predict its performance. 75
9.2.1 Model Parameters Model parameters are obtained as follows: 9.2.1.1 Workload Dependent Parameters •
Block Size, PI, PB, PILP
All these parameters are set exactly to the same values as obtained before in the previous chapter. •
fload, fstore, fint (Instruction Mix)
We ignore floating point and IO instructions. Pint (the fraction of non-memory accessing instructions), fload ,and fstore are directly obtained from Pentium’s hardware statistical counters. •
Pload_hit, Pstore_hit, PI (Data Cache and Code Cache Hit Ratios)
These parameters are directly obtained from Pentium’s hardware statistical counters.
9.2.1.2 Architecture Dependent Parameters •
Fetch Rate (instruction / cycle)
As the maximum fetch rate in the Pentium is 2, this parameter is set to 2 regardless the workload. •
tload_miss, tstore_miss (memory access penalties in cycles)
Taken directly from chapter 6, tload is set to 68 cycles. Since this penalty is an overhead incurred in the cache miss case, we define t load _ miss = t load _ hit + 68 Due to the effect of external write-buffers in the test system which reduces the write penalty, and since the full write miss assuming full write buffers incurs 6 cycles, we set the value of write miss penalty to 6 cycles. Thus, total write miss penalty is given by t store _ miss = t store _ hit + 6 76
•
Cm (Instruction Cache Miss Penalty in cycles)
Since the memory access penalty is set to 68 and the fetch rate is 34, we set this value to 34. •
Cb (Instruction Miss Prediction Penalty in cycles)
This parameter is calculated from as indicated in chapter 4. •
tload_hit, tstore_hit,,tint (operation latencies in cycles)
The functional latencies assuming perfect cache are related to EBusy by the following equation EBusy = p load t load _ hit + p store t store _ hit + pint t int
(9-1)
The Pentium instruction set is CISC, there is no load/store specific operations that solely access the memory and the register file with no integer operations. Consequentially, instruction latencies (with perfect cache) vary the same all over the instruction set. Therefore, we approximate that all operation latencies are equal. Thus, t load _ hit = t store _ hit = t int = 2EBusy Benchmark Encode Decode Voice Quake
2EBusy 2.03 1.31 2.16 1.27
(9-2)
Op. Latencies 2.03 1.31 2.16 1.27
77
Table 9-1: Model Parameters for the Multimedia Workload Parameter
Encode
Workload
Block size
Dep.
Voice
Quake
11.09
22.00
4.34
11.20
PI
0.99
0.99
0.98
0.99
PB
0.91
0.84
0.87
0.90
Pload
0.45
0.40
0.47
0.35
Pstore
0.15
0.16
0.22
0.16
Pint
0.41
0.44
0.31
0.49
Pload_hit
0.98
0.98
0.97
0.98
Pwrite_hit
0.91
0.79
0.93
0.64
PILP
0.93
0.89
0.54
0.69
2.00
2.00
2.00
2.00
Cm
34.00
34.00
34.00
34.00
Cb
3.35
3.44
3.28
3.35
tload_hit
2.03
1.30
2.12
1.30
tload_miss
70.03
69.30
70.12
69.30
twrite_hit
2.03
1.30
2.12
1.30
twrite_miss
8.03
7.30
8.12
7.30
tint
2.03
1.30
2.12
1.30
Architecture Fetch rate Dep.
Decode
9.2.2 Performance Prediction Results After specifying the model parameters, we calculate the predicted performance by substituting into model equations. Table 9-2 presents the predicted and actual CPI for the four Multimedia applications. Figure 9-2 plots the results. All the predicted values lies within 10% error margin.
78
Table 9-2: Predicted vs. Actual CPI Encode
Decode
Voice
Quake
Predicted
1.88
1.56
2.95
1.87
Actual
1.91
1.58
2.87
1.80
Error
1.6%
1.1%
-2.7%
-3.4%
3.50
3.00
2.50
2.00 CPI
Predicted Actual 1.50
1.00
0.50
0.00 Encode
Decode
Voice
Quake
Figure 9-2: Predicted vs. Actual CPI
79
CHAPTER 10: CONCLUSIONS AND FUTURE WORK 10.1 Conclusions In this thesis, we demonstrated that analytical models of superscalar performance can achieve reasonably accurate results. This facilitates exploring the design space far more quickly than is possible with trace-driven simulation. Due to the diversity of existing processor architecture types, we presented two models: simpleissue and complex-issue models. With this respect, we have chosen the Pentium and Pentium II as typical examples. The thesis also presents a characterisation of four Multimedia applications. These applications include MPEG Decoding, MPEG Encoding, Voice recognition, and real-time 3D rendering. All these applications program have the following characteristics: •
High degree of ILP: Average number of issuable instructions assuming max in-order issue rate of 2 instructions per cycle ranges form 1.5 to 1.9.
•
Large Basic Block: The average size of basic block size is 10 instructions this decreases the effect of branch prediction. Moreover, AGI rates is less than 0.09.
•
High memory requirements: The fraction of CPI contributed by memory is very significant (more than 50% for some programs). Moreover, L2 cache has a small effect on reducing memory penalty (about 20%).
•
Tight instruction loops: Instructions fetch is not an issue. Instruction cache hit ratio is more than 97% (99% for some applications).
80
10.2 Future Work Further work is needed to •
Applying the models to PowerPC processor: The models are applied to the Pentium and Pentium II processors, which share the same instruction set and relatively similar hardware system penalties. Further work is needed to apply the models to PowerPC processors, which presents two major changes: RISC instruction set and different system architecture.
•
Modelling proposed processor architecture types: Multiscalar [48], simultaneous Multithreading [25], Processor Coupling [20] and Multithreaded Vector [15] architecture are some of many novel approaches that aggressively exploit ILP. Further research is needed to model these complex architecture types and study the effect of Multimedia.
81
APPENDIX A: IDENTIFYING SOME PENTIUM PERFORMANCE COUNTERS A-1 Introduction The objective of this experiment is to find out what the branch related counters in the Pentium counts, particularly the Branch Taken or BTB hits counter. It is not clear whether these two counters count mutual exclusive events or not. To resolve this problem, we started with experimenting these counters to find out precisely what they count. We have obtained important conclusions about the real work or the branch prediction. We found some differences from that documented at [2].
A-2 The Experiment 1. The following test program pseudo code is used RDMSR If(taken[i]==taken) Then goto Skip . . . execute some code with no branches . . . skip: execute some code with no branches . . . RDMSR and calculate the difference in counter Increment i Repeat previous steps n times 2. The array Taken is used to control the branch behaviour. Its elements are set-up in either taken or not taken states. 3. We applied several branch behaviours and obtained the results as shown in next section 82
A-2 Conclusion From the observations from experiment we may state that the selected counters count as follows: BTB hits: counts all the hits to the BTB (including hits to entries in the weakly taken state.) Branch Taken or BTB hits: counts all hits to the BTB + Branches not allocated entries in BTB and is taken. In other words it counts Branches taken which were predicted not taken + branches not taken but are inside the BTB. Pipeline Flushes: counts all the hits to the BTB (including hits to entries in the weakly taken state.) Also, a branch entry changing state from Weakly not taken to Strongly not taken will be deallocated from the BTB. When the same branch instruction is encountered again it will miss the BTB and it will be predicted not taken.
83
APPENDIX B: PREDICTING PENTIUM MMX PERFORMANCE WITH L2 CACHE The objective of this appendix is to consider the L2 cache when predicting the performance of the Pentium MMX. The appendix begins by describing how the memory penalties parameters are approximated. Then, prediction results are compared to actual results and a maximum 10% error is found.
B-1 The Effect of L2 Cache
3.50
3.00
2.50
2.00
CPI
CP I
CPI_L2_Disabled FCE FCE_L2_Disabled
1.50
1.00
0.50
0.00 Encode
Decode
Voice
Quake
Figure B-1: Effect of L2 Cache on CPI Figure B-1 shows the effect of L2 cache on P166 system. As observed, the effect is less than 23% for all applications. The FCE increases on the average by 1.5. (memory access time is about 65 ns).
84
B-2 Pentium MMX Performance Prediction Using Simple-issue model Taking the approximate effect of L2 cache, we use our simple-issue model to predict the performance of Pentium MMX. Table B-1 and Figure B-2 presents the prediction results
Table B-1: Performance Prediction Results for P166 Encode
Decoding
Voice
Quake
Predicted CPI
1.525
1.292
2.577
1.543
Actual CPI
1.556
1.190
2.464
1.622
2%
-8%
-4%
5%
Error
3.0000
2.5000
2.0000
CPI
Actual
1.5000
Predicted
1.0000
0.5000
0.0000 Encode
Decode
Voice
Quake
Figure B-2: Performance Prediction for P166
85
B-3 Pentium MMX Performance Prediction Using Reduced Complex-issue Model Taking the approximate effect of L2 cache, we use our reduced complex-issue model to predict the performance of Pentium MMX. Table B-2 and Figure B-3 presents the prediction results
Table B-2: Predicted vs. Actual CPI Encode
Decode
Voice
Quake
Predicted
1.66
1.38
2.83
1.63
Actual
1.54
1.33
2.66
1.55
-7.9%
-3.4%
-6.3%
-5.0%
Error
3.00
2.50
2.00
CP I 1.50
Predicted Actual
1.00
0.50
0.00 Encode
Decode
Voice
Quake
Figure B-3: Predicted vs. Actual CPI for P166
86
BIBLIOGRAPHY
[1]
R. D. Acosta, J. Kjelstrup, and H. C. Trong, “An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors,” IEEE Trans. Computers, vol. C-35, no. 9, September 1986, pp. 815-825.
[2]
D. Anderson and T. Shanley, Pentium Processor System Architecture 2nd ed. Reading, Massachusetts: MindShare, Inc (Addison-Wesley), 1995.
[3]
T. M. Austin and G. S. Sohi, “Dynamic Dependency Analysis of Ordinary Programs,” Proc. of 19th ISCA, May 1992, pp. 342-351.
[4]
H. Balakrishnan, R.Garg, “Multimedia SPECmarks: A Performance Comparison of Multimedia Programs on Different Architectures,” Technical Report, Computer Science Division, Univ. of California at Berkeley, 1994.
[5]
M. Bekermann and A. Mendelson, “A Performance Analysis of Pentium Processor Systems,” IEEE Micro, vol. 15, no. 3, October 1995, pp. 72-83.
[6]
D. Bhandarkar and J. Ding, “Performance Characterisation of the Pentium Pro Processor,” Proc. of third High Performance Computer Architecture, February 1997.
[7]
J. Cocke and V. Markstein, “The Evolution of RISC Technology at IBM,” IBM J. Res. Develop., vol. 34, no. 1, January 1990, pp. 4-11.
[8]
R. Colwell and R. Steck, “a 0.6-µm BiCMOS Microprocessor with Dynamic Execution,” Proc. Int’l Solid-State Circuits Conf., IEEE, Piscataway, N.J., 1995, pp.176177.
[9]
D. Christie, “Developing the AMD-K5 Architecture,” IEEE Micro, vol. 16, no. 2, April 1996, pp. 16-26.
[10]
Communications of the ACM, Special Issue on Digital Multimedia Systems, vol. 34, no. 4, April 1991.
[11]
K. Diefendroff and P. K. Dubey, “How Multimedia Workloads Will Change Processor Design,” IEEE Computer, vol. 30, no. 9, September 1997, pp. 43-45.
[12]
P. K. Dubey, “Architectural and Design Implications of Mediaprocessing,” Tutorial presented at Hotchips XI, August 1997.
[13]
“Overview: The Cyrix M1 Architecture”, Cyrix Corp, http://www.cyrix.com
[14]
P. G. Emma, “Understanding some simple processor-performance limits,” IBM J. Res. Develop, vol. 41, no. 3, May. 1997, pp. 215-232.
[15]
R. Espasa and M. Valero, “Multithreaded Vector Architectures,” Proc. of third High Performance Computer Architecture, February 1997.
[16]
M. J. Flynn, “Parallel Processors were the Future… and may yet be,” IEEE Computer, vol. 29, no. 12, December 1996 pp. 151-152.
[17]
D. L. Gall, “MPEG: A Video Compression Standard for Multimedia Applications,” Commu. of the ACM, vol. 34, no. 4, April 1991, pp. 46-58. 87
[18]
C. Hansen, “MicroUnity’s Media Processor Architecture,” IEEE Micro, vol. 16, no. 4, August 1996, pp. 34-41.
[19]
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. Singapore: McGraw-Hill, Inc., 1993.
[20]
S. W. Keckler and W. J. Dally, “Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism,” Proc. of 19th Int. Symp. on Computer Architecture, 1992, pp. 201-213.
[21]
L. Kleinrock, Queueing Systems (volume I: Theory). New York: John Wiley & Sons, 1974.
[22]
A. Kumar, “The HP PA-8000 RISC CPU,” IEEE Micro, vol. 17, no. 2, April 1997, pp. 27-32.
[23]
R. B. Lee, “Subword Parallelism with MAX-2,” IEEE Micro, vol. 16, no. 4, August 1996, pp. 51-59.
[24]
R. B. Lee and M. D. Smith, “Media Processing: A New Design Target,” IEEE Micro, vol. 16, no. 4, August 1996, pp. 6-9.
[25]
J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm, and D. M. Tullsen “Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading,” ACM Trans. Computer Systems, vol. 15, no. 3, August 1997, pp. 322-354.
[26]
N. P. Jouppi and D. W. Wall, “Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines,” WRL Research Report 89/7, Western Research Laboratory, July 1989.
[27]
N. P. Jouppi, “The Non-uniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance,” IEEE Trans. Computers, vol. 38, no. 12, December 1989, pp. 1645-1658.
[28]
S. C. Mcmahan, M. Bluhm, and R. A Garibay, “6x86: The Cyrix Solution to Executing x86 Binaries on a High Performance Microprocessor,” Proc. of the IEEE, vol. 83, no. 12, December 1995, pp. 1664-1672.
[29]
“MDR Labs: Performance Analysis of the 6x86,” version 3, MicroDesign Resources, Sebastopol, CA, June 1996.
[30]
V. Milutinovic, Surviving The Design of a 200MHz RISC Microprocessor: Lessons Learned. Washington: IEEE Computer Society Press, 1997.
[31]
T. Mudge, “Strategic Directions in Computer Architecture,” ACM Computing Surveys, vol. 28, no. 4, December 1996, pp. 671-678
[32]
D. B. Noonburg and J. P. Shen, “A Framework for Statistical Modeling of Superscalar Processor Performance,” Proc. of third High Performance Computer Architecture, February 1997.
[33]
A. Nicolau and J. J. A. Fisher, “Measuring the Parallelism Available for Very Long Instruction Word Architectures,” IEEE Trans. Computers, vol. C-33, no. 11, November 1984, pp. 968-976. 88
[34]
D. B. Noonburg and J. P. Shen, “Theoretical Modeling of Superscalar Processor Performance,” Proc. of MICRO-27, November 1994, pp. 52-62.
[35]
D. A. Patterson and J. L. Hennessy, Computer Architecture A Quantitative Approach. San Francisco, California: Morgan Kaufmann, 1996.
[36]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K, Keetonm C. Kozyrakis, R, Thomas, and K. Yelick, “A Case for Intelligent RAM,” IEEE Micro, vol. 17, no. 2, April 1997, pp. 34-44.
[37]
A. Peleg, S. Wilkie, and U. Weiser, “Intel MMX for Multimedia PCs,” Commu. of the ACM, vol. 40, no. 1, January 1997, pp. 24-38.
[38]
A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, vol. 16, no. 4, August 1996, pp. 42-50.
[39]
Pentium® Processor Family Developer’s Manual, Volume 1:Pentium® Processors. Mt. Prospect, IL: Intel Corp, 1995.
[40]
Pentium® Processor Family Developer’s Manual, Volume 3:Architecture and Programming Manual. Mt. Prospect, IL:Intel Corp, 1995.
[41]
Pentium® II Processor Developer’s Manual, Intel Corp, October 1997.
[42]
PowerPC 604e™ RISC Microprocessor Technical Summary, IBM Corp, 1996.
[43]
R. H. Saavedra and A. J. Smith, “Analysis of Benchmark Characteristics and Benchmark Performance Prediction,” ACM Trans. On Computer Systems, vol. 14, no. 4, November 1996, pp. 344-384.
[44]
D. Sima, “Superscalar Instruction Issue,” IEEE Micro, vol. 17, no. 5, September 1997, pp. 28-39.
[45]
J. E. Smith and Andrew R. Pleszkun, “Implementing Precise Interrupts in Pipelined Processors ,” IEEE Trans. Computers, vol. 37, no. 5, May 1986, pp. 562-573.
[46]
J. E. Smith and G. S. Sohi, “The Microarchitecture of Superscalar Processors,” Proc. of the IEEE, vol. 83, no. 12, December 1995, pp. 1609-1624.
[47]
J. E. Smith, “Dynamic Instruction Scheduling and the Astronautics ZS-1,” IEEE Computer, vol. 22, no. 7, July 1989, pp. 21-35.
[48]
G. S. Sohi, S. E. Breach, T. N. Vijaykumar, “Multiscalar Processors,” Proc. 22nd ISCA, June 1995, pp. 414-425.
[49]
G. S. Sohi, “Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers,” IEEE Trans. Computers, vol. 41, no. 3, March 1990, pp. 349-359.
[50]
G. S. Tjaden and M. J. Flynn, “Detection and Parallel Execution of Independent Instructions,” IEEE Trans. Computers, vol. C-19, no. 10, October 1970, pp. 889-895.
[51]
M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, “VIS Speeds New Media Processing,” IEEE Micro, vol. 16, no. 4, August 1996, pp. 10-20.
[52]
M. Tremblay and J. M. O’Connor, “UltraSPARC I: A Four-Issue Processor Supporting Multimedia,” IEEE Micro, vol. 16, no. 2, April 1996, pp. 42-50. 89
[53]
D. W. Wall, “Limits of Instruction-Level Parallelism,” DEC Research Report, Western Research Laboratory, November 1993.
[54]
D. W. Wall, “Limits of Instruction-Level Parallelism,” Proc. Fourth Conf. ASPLOS. Santa Clara, California, April 1991, pp. 248-259.
[55]
S. Weiss and J. E. Smith, “Instruction Issue Logic in Pipelined Supercomputers,” IEEE Trans. Computers, vol. C-33, no. 11, November 1984, pp. 110-118.
[56]
B. Wilkinson, Computer: Architecture Design and Performance. London: Prentice Hall, 1996.
[57]
K. C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, April 1996, pp. 28-40.
[58]
A. Yu, “The Future of Microprocessors,” IEEE Micro, vol. 16, no. 6, December 1996, pp. 46-53.
90
©XXlK}{A i®Ba¥{A LBsBgA X¥O¥ m} ~¥¦KI P{Bl} AXC Z¦XwK : 6JBI 2~¥¦KI P{Bl} «{k `¦yZK{A m} LB}¦{lK{{ JyZ}{A ZAXeÂA QY¥} ~¦XwK : 7JBI 2~¥¦KI P{Bl} AXC Z¦XwK :8JBI i¦aI{A ZAXeÂA QY¥} «{G JyZ}{A ZAXe¾A QY¥} ZBeKUA : 9JBI ©ZBM}{A ©X¦XO{A NBRI¾A ¥ LBOKKa¼A ~Xw¦ :10JBI AX¾A J¦vAZ} hlI u®Bj¥ «{k uZlK{A :A xR{} ZBIKk¼A «s YU¾A m} ©XXlK}{A i®Ba¥{A LBsBgG X¥O¥ m} ~¥¦KI P{Bl} AXC Z¦XwK
:J xR{}
«BM{A ¬¥Ka}{A LAY dBy{A ©ZyAY
ª{BaZ{A fU{}
ZBIKk¼A «s YU¾A m} ªM¦XR{A LBO{Bl}{A AXC Z¦XwK{ á¦X¦XO á¦{¦{RK á¦OY¥} ª{BaZ{A ¤Y¡ ~XwK ª¦ZB}l}{A LBte{A X¦XRK ¥¡ QYB}{A z{K ~AXUKaA á} uX¢{A¥ .©X¦XO{A ©XXlK}{A i®Ba¥{A LBw¦IiK LBO{Bl}{A LAZB}k á} «}jl{A ª¦I{Bp{A áG .LBO{Bl}{A AXC «{k Z¦MDK ~¡C B¢{ «K{A LBO{Bl}{{ LBO{Bl}{A¥ LB}¦{lK{{ i¦aI{A ZAXeÂA LAY LBO{Bl}{A B}¡¥ ᦦa¦®Z á¦}av LRK QZXK ªM¦XR{A ¥ ~¥¦KI{A LBO{Bl} «{k `¦yZK{A ~K Xws ,x{i}{A AY¡ á} ¥ .LB}¦{lK{{ JyZ}{A ZAXeÂA LAY .á¦}aw{A á¦Y¡ ák ©ZIl} ZBcK¼A ªlaA¥ ª{M}C B}¢A N¦R 2~¥¦KI{A «{k ¬`A¥K{{ ¬¥a{A m¦`¥K{A hZKtK ªi¦aI ¬ZUA QYB} ia¥KK B¢DI ªRZKw}{A QYB}{A `¦}KK `¦}} |¦etK X¥O¥ hZKtK ¥ P}AZI{A Y¦tK ZBME «{k X}KlK ¬ZUA QYB} ¥ LB}¦{lK{A ¬¥Ka} QYB}{A ¥ mvA¥{A ák Bw¦vX AZ¦IlK ZIlK ¼ «{¥¼A QYB}{A .©XXR} ©ZB}k «{k P}AZI{A Y¦tK z¥{a{ .ª¦ZK}AZBI{A LBaAZX{A mtK ¼ ¥ P®BK{A «{k |¥eR{{ AZ¦Iy BKv¥ J{iKK ¬ZU¾A P{Bl} «{k JZBOK B¦ZOC Xv ¥ .QYB}{A LAZ¦pK} ~¦v «{k |¥eR{A ª¦t¦y ª{BaZ{A ¤Y¡ «s hZl QYB}{A ªvX Z¦XwK{ LBO{Bl}{A z{K |UAX ª¦I}{A AX¾A J¦vAZ}I ªBlKa¼A m} 2~¥¦KI{A ¥ ~¥¦KI{A m¦}O ZBIKk¼A «s YU¾A m} By}K Xw{s ,LBw¦IiK{A ¤Y¢{ ¥ .©XXlK}{A i®Ba¥{A P}AZI ~AXUKaA Xk ,NRI{A á} `Oy ¥ .JZBOK{A P®BKI B¢KZBw} Xk DiU % 10 X¥XR «s AX¾A ZXw áC |¦eBtK{A .LBO{Bl}{A ©ZB}k «{k ©XXlK}{A i®Ba¥{A P}AZI LBI{iK} ZB¢jA ~K Xws -:«{¦ B}y á¦wR{} ¥ JA¥IA 10 á} ª{BaZ{A á¥yKK ¥ ª{BaZ{A ª}Xw} :1JBI ª{BaZ{A ª¦t{U :2JBI LBO{Bl}{A ©ZB}k |BO} «s ¤`BOG xIa B} hZk : 3JBI ~¥¦KI{A P{Bl} «{k `¦yZK{A m} LB¦{}l{{ i¦aI{A ZAXeÂA QY¥} ~¦XwK : 4JBI ©XXlK}{A i®Ba¥{A LBw¦IiK LBI{iK} f¦UcK : 5JBI
ª¦ZXyaÂA ªl}BO ªaX¢{A ª¦{y
LBO{Bl}{A AXC QYB}
§{k |¥eR{{ §®`O J{iK}y §{ÀA ~yRK{A ¥ ªIaBR{A L¼ÀA ~aw{ ª}Xw} ª{BaZ Z¦KaOB}{A ªOZX «s ªIaBR{A L¼ÀA
:á} ª}Xw} ¬X¢}{A ~`BR X}RA
1995 ZI}KIa :|¦OaK{A V¦ZBK 1998 ¥¦B} :~¦XwK{A V¦ZBK