Performance evaluation of highly concurrent ... - ACM Digital Library

a n d for T E K T R O N I X 4015 a n d C O M P U T E K 420 storage t u b e displays. F u r t h e r m o r e , i m p l e m e n t a t i o n has b e e n g r e a t l y a i d e d b y the m o d u l a r structure o f SEMBEGS. T h e m o d u l a r structure has a l l o w e d the system to be built u p step b y step in a leisurely fashion. N e w features can be r e a d i l y a d d e d as t h e y are required. T o date, the m a i n a p p l i c a t i o n involving SEMBEGS is a t w o - d i m e n s i o n a l g r a p h i c s p o s t p r o c e s s o r w h i c h allows different views o f the g r a p h i c a l o u t p u t from p r o g r a m s to be d i s p l a y e d o n a variety o f h a r d w a r e devices. W o r k is p r e s e n t l y p r o c e e d i n g o n i m p r o v i n g the i m p l e m e n t a t i o n o f BAGDAMS, a n d a d d i n g several n e w device h a n d l e r s to the d i s p l a y h a n d l e r system. T h e most interesting a r e a o f c u r r e n t w o r k is in p r o v i d i n g a n i n p u t h a n d l e r for a satellite c o m p u t e r used in c o m p u t e r - a i d e d design a n d d r a f t i n g ( C A D D ) . It is felt that BAGDAMS will a i d the c o m m u n i c a t i o n o f the g r a p h i c a l d a t a p r o v i d e d b y the C A D D system, for use in s i m u l a t i o n s a n d large analysis programs. A c k n o w l e d g m e n t s . I wish to t h a n k m y supervisor, G e o f f W i l l i a m s , for the c o n s t a n t interest he has t a k e n in m y work. It was he w h o o r i g i n a l l y suggested the project a n d w h o p r o v i d e d endless e n c o u r a g e m e n t , suggestions, a n d criticism. I also a c k n o w l e d g e the m a n y useful c o m m e n t s a n d criticisms o f the referees. Received June 1978

References 1. Blinn, J.F., and Goodrich, A.C. The internal design of the IG routines, an interactive graphics system for a large timesharing environment. Comptr. Graphics (ACM) 10, 2 (Summer 1976), 229-234. 2. Deecker, G.F.P., and Penny, J.P. Standard input forms for interactive computer graphics. Comptr. Graphics (ACM) 11, 1 (Spring 1977), 32--40. 3. Newman, W.M., and Sproull, R.F. An approach to graphics system design. Proc. IEEE 62, 4 (April 1974), 471-483. 4. Newman, W.M., and Sproull, R.F., Principles of Interactive Computer Graphics. McGraw-Hill, New York, 1973. 5. Proceedings of Conference on Data: Abstraction, Definition and Structure. SIGPLAN Notices (ACM) 8, 2 (1976), Special Issue, Vol. II. 6. Proceedings of the Workshop on Large Databases for Interactive Design, Waterloo, Ontario, Sept. 1975 (available from ACM, New York). 7. Selected Papers from the Conference on Data: Abstraction, Definition and Structure. Comm. ACM 20, 6 (June 1977), 382-420. 8. SIGGRAPH GSPC. Status Report of the Graphic Standards Planning Committee of ACM/SIGGRAPH. Comptr. Graphics (ACM) 11, 3 (Fall 1977). 9. Weller, D., and Williams, R. Graphic and relational database support for problem solving. Comptr. Graphics (ACM) 10, 2 (Summer 1976), 183-189. 10. Wendorf, J.W. BAGDAMS Version 1.0 User's Guide and Reference Handbook. In preparation, Math. and Comput. Branch, Chalk River Nuclear Labs., AECL, Chalk River, Ontario, Canada, 1978. 11. Williams, R. A survey of data structures for computer graphics systems. Computing Surveys 3, 1 (March 1971), 1-21. 12. Wright, T. A schizophrenic system plot package. Comptr. Graphics (ACM) 9, 1 (Spring 1975), 252-255.

904

Computer Systems

S. Fuller, Editor

Performance Evaluation of Highly Concurrent Computers by Deterministic Simulation B. Kumar and E.S. Davidson University of Illinois, Urbana Simulation is presented as a practical technique for performance evaluation of alternative configurations of highly concurrent computers. A technique is described for constructing a detailed deterministic simulation model of a system. In the model a control stream replaces the instruction and data streams of the real system. Simulation of the system model yields the timing and resource usage statistics needed for performance evaluation, without the necessity of emulating the system. As a case study, the implementation of a simulator of a model of the CPUmemory subsystem of the IBM 360/91 is described. The results of evaluating some alternative system designs are discussed. The experiments reveal that, for the case study, the major bottlenecks in the system are the memory unit and the fixed point unit. Further, it appears that many of the sophisticated pipelining and buffering techniques implemented in the architecture of the IBM 360/91 are of little value when high-speed (cache) memory is used, as in the IBM 360/195. Key Words and Phrases: performance evaluation, deterministic simulation, control stream, concurrent computers CR Categories: 6.20, 8.1

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. This work was supported in part by the Joint Services Electronics Program (U.S. Army, U.S. Navy, and U.S. Air Force) under Contract DAAB-07-72-C-0259, and in part by the National Science Foundation under Grant MCS 73-03488A01. Authors' present addresses: B. Kumar, Digital Systems Laboratory, Stanford, CA 94305; E.S. Davidson, Coordinated Science Laboratory, Urbana, IL 61801. © 1978 ACM 0001-0782/78/1100-0904 $00.75. Communications of the ACM

November 1978 Volume 21 Number 11

1. Introduction Many significant scientific problems, such as the global weather modeling problem, require the use of large amounts of computing time. The last decade has seen the introduction of a number of high-speed computing systems aimed at solving problems of this class in a reasonable time. All of these systems, such as the Illiac IV, CDC Star, TI ASC, CDC 7600, and IBM 360/91, use concurrent architectures to increase the instruction execution rate of a single instruction stream. In the design of such highly concurrent systems it is particularly important to understand how the system design parameters and job load variations interact in the performance function of the system. This is especially true when radically new architectures are proposed in an attempt to utilize the capabilities of new technology. Analytic modeling of the system is not possible even for a moderate level of system complexity, given the "state of the art" of available analytic tools. Building an easily modifiable prototype of the system and running benchmark jobs on it is prohibitively expensive for a complex system. In this paper simulation is presented as a viable alternative technique for computer system design optimization. The simulation proposed differs markedly from most simulation done today in the fact that it uses a detailed, deterministic system model rather than simplifying and inaccurate probabilistic assumptions. Variation of system parameters is made possible by incorporating these parameters as variables in the implementation of the simulator. Job load parameters can be varied by using synthetic controllable job loads. The use of deterministic simulation models for system design has been investigated in the past and described in the literature, but to a surprisingly small extent. Ballance et al. [3] describe a simulator that was used in the design of the look-ahead unit of the IBM STRETCH System. Boland et al. [4] discuss simulations used in the design of the memory unit of the IBM System 360/Model 91. On a more theoretical level, Tjaden and Flynn [11] used a simulator based on the IBM 7094 to examine the issue of the instruction decode stack size needed to exploit parallelism in programs. However, the above studies were restricted to studying a small portion of a system. Murphey and Wade [8] describe a simulator used to estimate the performance of the IBM System 360/Model 195 after it was designed. However, there is no evidence that this simulator was used in the design process to examine any questions relating to the design of the system. In fact, we know of no previous research in the field involving a detailed, deterministic simulation of a complex, many-parameter computer system in which each of the parameters may assume any of a range of values. The paper is organized as follows. Section 2 describes the basic concepts of the modeling philosophy. Section 3 describes techniques for the generation of control 905

streams needed to exercise the model simulator. Section 4 is a detailed description of a modeling case study and the results of some experiments conducted using the simulator of the model. Section 5 concludes the paper and points out areas for further research.

2. Modeling Philosophy As stated in the introduction, our model of a computer system is oriented towards simulation of the system, with the goal of evaluating the performance of various system configurations. The two most important features of this approach are the concept of a control stream and the technique of selecting the best combination of resources to make up the model. Further, the simulation is totally deterministic. These aspects are discussed below. 2.1 The Control Stream The model is the basis of construction of a simulator that simulates those aspects of the system needed to evaluate its performance. Simulation of the model should therefore provide timing and resource usage statistics for typical system usage. The system need not be emulated for this information to be gathered. Recognition of this fact enables a significant reduction in model complexity, by the introduction of the concept of a control stream. In the real system an instruction, while it is being fetched from memory and processed by the CPU, traverses a flow path in the system. This flow path through the system is different for each distinct type of instruction. Moreover, in concurrent CPU-memory systems, the data that is needed by an instruction will have its own independent flow path through the system. Typically both the instruction and its data traverse their flow paths simultaneously. The model of the system consists of resources which correspond in some fashion to the resources comprising the real system. In the model a unit of traffic is generated corresponding to the starting of an instruction along its flow path in the real system. However, no distinction is made in the model between the instruction flow and the data flow caused by that instruction in the real system. The two taken together form the control flow of that instruction and are reflected in the flow path of the corresponding unit of traffic in the model. Thus the instruction and data streams in the real system are replaced by a control stream in the model. The traffic for a simulator of the model is derived from a program execution trace by one of the methods to be discussed in Section 3. During simulation, however, no attention is paid to the actual data used or produced by the program. Thus the model will be concerned only with the data flow path (inasmuch as it is a portion of the control flow path) and not with the data itself. Since only timing statistics are important, the processing of a Communications of the ACM


traffic unit by a resource in the model consists solely of occupancy o f the resource by the traffic unit for the characteristic period of time for that resource in the real system. 2.2 Level of Simulation and Resources in the Model There is a wide range of choices possible for the level at which the model simulates the system. The criteria for choosing the appropriate level are that it should not be so low as to include system detail that is irrelevant to performance nor should it be so high as to obscure any factor interactions that possibly play some part in determining system performance. Thus the logic gate level is in most cases too low a level to simulate for performance evaluation; constructing the model and ensuring that it tracks system design changes become cumbersome. We believe that, in the absence of any other information, the most appropriate level of simulation is the functional level, with the system clock cycle chosen as the basic time quantum of the simulator. The model associates a logical resource with combinations o f various steps in the execution sequence of an instruction. This division of the execution sequence and assignment to associated resources is made only as fine as needed to describe the degree of concurrency possible in the system. Thus if there are two distinct consecutive steps in the execution sequence which can never be simultaneously in progress for two different instructions, and if the output of the first step is the only input to the second step and to no other step, then a single resource in the model is assigned to b o t h steps. The processing time of each resource is fixed by the combination of execution steps that it represents. Each of these resources can process only one unit of traffic at a time. A consequence of this technique is that the model will have no more resources than required to reflect system timing and dependency accurately. This assignment reduces the model complexity significantly over one which assigns a resource to each logical execution step. For example, a system with no concurrency, i.e. no instruction look-ahead and no execution unit pipelining or parallelism, is modeled as a single resource with variable but deterministic processing time. 2.3 Deterministic Simulation The simulator of the model is entirely deterministic. This is in contrast with Monte Carlo simulation in which certain phenomena in a model are treated as randomly varying quantities. In a deterministic simulator the flow of a unit of the control stream is completely determined by the process in the real system that the unit models. So is the time that it spends in each resource along its flowpath. However, it should be noted that the term "deterministic" applies only to the behavior of the simulator for a given control stream, and not to any of the other aspects, such as the method of generation of control streams. 9O6

3. Control Stream Generation To exercise the simulator of the model, a control stream to be processed by the simulator must be generated. Since the simulator does not perform the computation accomplished by the original instruction and data streams, each stream instruction need only be sufficiently described so as to enable the simulator to determine its dynamic flow. This information would minimally consist of: (1) The static control flow path of the instruction, i.e. the resources needed to process the instruction in the order that they are needed. For example, in a concurrent system, some of the resources needed by an instruction may be the instruction decoder, a particular execution unit, a memory location from which an operand is to be fetched, buses to transmit the operands to the execution unit, etc. (2) The dependency of this instruction on instructions preceding it in the stream. This information is necessary for the simulator to set up the interlocks to ensure correct sequencing of the stream. Two types o f dependency are found in programs: data dependency caused by two instructions which reference the same operand location, and procedural dependency caused by instructions which follow conditional branch instructions. Thus for a control stream to be executed by the simulator of the model of a concurrent system, the data dependency information for an instruction would point to the most recent instructions that read from or wrote into the operand locations referenced by this instruction. The procedural dependency information would point to the most recent conditional branch instruction that must be executed before this instruction is executed. Note that the control stream represents a single execution of a single program. Thus all activity following branches is known a priori. Execution in the simulator is merely delayed until such time as the branch would have been completed in the real system. 3.1 Design Assumptions A major assumption made in the following discussion of control stream generation is that the gross architecture o f the system has already been decided. Thus the instruction set has been defined, as have the values for some of the system parameters, while only the possible ranges of values for some other parameters have been decided. The aim of the design procedure at this stage is to produce a fully specified architecture in which all system parameter values are defined. Thus this includes the process of designing new processors in a compatible family of systems such as the IBM System 360. It should be noted that this assumption is not strictly necessary. Thus the performance simulator may be just one tool in an overall design process that includes an instruction set design process and a mechanism for generating programs coded in a target instruction set. Communications of the ACM


3.2 Control Stream Generation Two methods of control stream generation are now described.

3.2.1 Control stream generation from program traces. The instruction execution trace of a real program is gathered while it executes on a system with an identical instruction set. Each instruction in the trace is then mapped into a control stream instruction, specified by the set of parameters needed to describe it to the simulator. The static control flow-path of each instruction is entirely determined by the operands that it uses and the operations that it performs on them. Data dependency information is gathered from a simple forward scan of the trace by maintaining a list of operands used and the most recent instructions that used them. This list need only keep track of dependencies on a certain number of most recent instructions, on the grounds that instructions further back would have completed execution and will not delay instructions far ahead in the stream. For every instruction, the data dependency information is then derived by scanning the list for the operands used by this instruction and specifying the most recent instructions to use those operands. The list is then updated. Dependency interlocks built into the simulator use this information to prevent improper out of sequence usage of operands. 3.2.2 Synthetic stream generation. The code executed in a real or hypothetical class of programs can be characterized by a number of statistical distributions. The information contained in these distributions must be enough to derive the main attributes of control streams described earlier. For example, resource demands of the control stream can be derived from an instruction frequency distribution. Data dependency information for instructions in the control stream can be derived from a distribution of the number of intervening instructions between two instructions accessing the same operands. Procedural dependency arises from the occurrence of algorithmic control constructs in programs. Almost all the branch instructions in programs can be attributed to the occurrence of one of the following high level language features: conditional constructs (if-then-else and case statements), iterative constructs (for and while statements) and calling and returning from procedures. Thus we feel that the procedural dependency information for a control stream is best derived from distributions describing the occurrence of high level language features in that class of programs. For example, iterative constructs can be described by distributions of the iteration count and the length of the iteration (in instructions). We now outline a procedure for stream generation using these statistical distributions. We will assume that the statistical distributions characterizing a class of programs have been gathered from execution of programs on a system with an identical instruction set. The instruction frequency distribution is sampled, to decide the resource usage pattern of the next instruction in the stream. If it is not a branch instruction, data dependency 907

information is generated for it by sampling the data dependency distributions. If it is a branch instruction, a high level language construct will be generated, depending on the type of branch at hand. For example, a branch-on-counter-condition instruction, such as BXLE or BXH on the IBM 360, is most often used with forloops by computers, and will trigger the generation of a for-loop construct in this scheme. This will include generation of an iteration count and the length of the iteration (in instructions) from the corresponding distributions. A procedure similar to the above is then followed for generating the instructions in the construct. When the entire construct has been generated, the outer procedure for generating the main stream is continued. The stream length is chosen by sampling the program length distribution (in instructions), and the generation procedure is stopped when this length has been reached. The above procedure follows a first order approximation since it assumes that there is no correlation between the occurrence of successive instructions in programs. More refined procedures would replace the instruction frequency distribution by higher order distributions that describe the occurrence of instruction pairs, triplets, etc. It should be emphasized again that the stochastic generation of control streams, as described above, in no way alters the deterministic nature of the simulator itself. In fact, the main reason for experimenting with synthetic control stream generation is to attempt to parameterize classes of programs. One can then investigate the performance of such concurrent computers not only as a function of the system architectural parameters but also as a function of the workload parameters. 4. Case Study--Modeling the System 360 Model 91 CPU-Memory System

The System 360 Model 91 was chosen as a case study for two reasons: the high degree of concurrency in its operations and the availability of fairly detailed descriptions of it in the literature. Most of the documentation on which the model is based is from [1, 2, 4, 10]. For a detailed description of the model and the experiments conducted, see [6]. The level of simulation chosen was such that the basic time quantum of the simulator was the CPU clock cycle. 4.1 Model of the System The purpose of this simulation was to evaluate a number of system configurations based on the Model 91. The model was therefore constructed to handle any of a number of such configurations. Thus the grouping of execution stages into resources does not follow Section 2.2 with respect to the Model 91 but does so with respect to a class of augmented versions of the Model 91. In the following subsections, execution times of resources are given in CPU cycles, each cycle lasting 60 nsec in the Communications of the ACM


Model 91. In the description of the model, a unit of traffic and the type of entity in the real system that it models have been used interchangeably. For example, "memory reference" has been used in place of "the traffic unit representing a memory reference." Further, in place of the pseudo-processing that a model resource does, the function accomplished by the corresponding resource in the real system is quoted for descriptive purposes. For example, the model resource IDEC which represents the instruction decoder of the real system is described as "decoding the instruction in one cycle," whereas all that occurs in the simulation of the model is that the unit of traffic representing the instruction occupies the resource IDEC for one cycle. 4.1.1 The memory unit. The resources and possible control flow paths in this unit are shown in Figure 1. SAB (Storage Address Bus) is a resource that transfers a memory reference address in one cycle to the memory unit. If the reference is a fetch request and the memory bank referenced by it is busy, the reference is put in the REQST (Request Stack) buffer. A store reference is held in the SAR (Store Address Registers) buffer until the datum to be stored by that reference is put in the SDB (Store Data Buffer) by the CPU. When a reference is ready to be initiated, i.e. the bank referenced is free, the reference is transferred from the SAR or REQST to the referenced bank in one cycle by the SAB. It occupies the bank for a period equal to the memory cycle time. At this stage, if it is a store reference, its function is completed and it is terminated. If it is a fetch, it returns either to the instruction unit or to the appropriate execution unit as the contents of the location addressed by that reference. Not shown in Figure 1 are the handling of the multi-access feature and the priority scheme for initiating references [4]. 4.1.2 The instruction unit. Figure 2 shows the resources and control flow paths possible in this unit. Instruction-fetch memory references are sent to the memory unit at a maximum rate of one per cycle. These transfers are subject to the availability of space in IBUF, the FIFO buffer in which prefetched instructions are held, and the clearing of conditional branch interlocks. A prefetched instruction is held in IBUF until it is its turn to be decoded. IEX (Instruction Extractor) is a resource that extracts the next instruction from IBUF in one cycle, while IDEC (Instruction Decoder) decodes the instruction in another cycle. The instruction decoding rate is limited by the availability of space in the operation buffers in the execution units and the clearing of conditional branch interlocks. Decoded arithmetic instructions are sent to the execution units while branch instructions are executed in the instruction unit itself. If the instruction requires an operand from memory, the control flow splits into two paths, with one flow path representing the instruction and the other the memory reference. This reference is processed by the resource A D G E N (Operand Address Generator) in one cycle, 908

Fig. I. T h e m e m o r y unit.

Bank Free ~ l ~ T e r m i n a t e l [

Incoming ~ ' E ~ Memory References From The CPU

- Send To F,BIF,g41

Fetch Conflict Or Multi-occes-sl...... I

Store

Operand From CPU

Or FXB (Fig.3) Or IBUF(Fig.2)

"

~

~ IqnRI

LSZZJ

• Terminate

Key to Abbreviations SAB: BANKS: REQST: SAR: SDB:

Storage Address Bus Memory Banks Request Stack Store Address Registers Store Data Buffer

Fig. 2. T h e i n s t r u c t i o n u n i t .

Tnstruction Doublewords From Memory

To SAB (Fig.1 ~

Operondl Memory References (if required)

Decoded Instructions To FLOS (Fig.4) Or FXOS (Fig.3)

Key to Abbreviations IBUF: IEX: IDEC: ADGEN:

Instruction Buffer Instruction Extractor Instruction Decoder Operand Address Generator

corresponding to the computation of the operand address, following which it goes to the memory unit. The details of branch instruction execution, the loop mode feature and IBUF handling [1] are not shown. 4.1.3 The fixed point execution unit. Figure 3 shows the resources and control flow paths possible in this unit. Communications of the ACM

N o v e m b e r 1978 V o l u m e 21 N u m b e r 11

Fig. 4. The floatingpoint unit.

F i g . 3. T h e f i x e d p o i n t u n i t .

Decoded Instructions From The Instruction Unit

Operonds From Memory

Operands m Memory

Decoded Instructions From The Instruction Unit

E[~ From Floating

Eq

F

ig%rsE

T o Fixed Point Registers

From Fixed Point Registers

V

To SDB (Fig. i )

i

To Fixed Point Registers

Key to Abbreviations FXOS: FXIU: FXB: FXBB: FXEU: FXRB:

Fixed Point Fixed Point Fixed Point Fixed Point Fixed Point Fixed Point

Operation Stack Instruction Unit Operand Buffer Operand Buffer Bus Execution Unit Register Bus

To Floating Point Registers

To SDB [Fig. 1)

Key to Abbreviations FLOS: DECODE & SELECT: FLB: FLBB: FLRB: ADD: ADI & AD2: MUL: MULDIV: CDB:

Floating Point Operation Stack Two-stage Floating Point Instruction Decoder Floating Point Operand Buffer Floating Point Operand Buffer Bus Floating Point Register Bus Add Unit Reservation Stations Two-stage Floating Point Add Unit Multiply/Divide Unit Reservation Stations Floating Point Multiply/Divide Unit Common Data Bus

FXOS (Fixed Point Operation Stack) is a FIFO buffer that holds the instruction sent from the instruction unit until it is to be decoded. Decoding is done in one cycle by the resource FXIU (Fixed Point Instruction Unit), after which the instruction, if it is neither a load nor a store, goes to FXEU (Fixed Point Execution Unit). FXB (Fixed Point Operand Buffer) is a buffer that holds operands fetched from memory until the instruction that will use them has control of FXEU, whereupon they are transmitted to FXEU in one cycle by the resource FXBB (Fixed Point Operand Buffer Bus). Register operands, when they are available, are transmitted to FXEU in one cycle by the resource FXRB (Fixed Point Register Bus). When it has all its operands, the instruction is processed by FXEU for a period equal to the instruction execution time. The result of the instruction is then sent either to a register or to the SDB in one cycle by FXRB. Loads and stores are executed by FXIU itself using FXBB and FXRB. Not described above is the handling of the data dependency interlocks.

reservation stations respectively, and hold decoded instructions until their operands arrive and the execution units are free. The decoding rate of DECODE and SELECT is limited by the availability of space in these buffers. FLB (Floating Point Operand Buffer) is a buffer that holds operands fetched from memory until the instructions that need them reach ADD or MUL, whereupon they are transmitted to ADD or MUL by the one cycle resource FLBB (Floating Point Operand Buffer Bus). FLRB (Floating Point Register Bus) is a similar one-cycle resource that transmits register operands. AD 1 and AD2 are one-cycle resources modeling the two stage pipelined add unit, while MULDIV models the multiply/divide unit [2]. The CDB models the Common Data Bus which is the heart of the Tomasulo algorithm, details of which are indicated in [ 10]. Load and store instructions do not need reservation stations and are executed by the DECODE-SELECT combination using FLBB, FLRB, and CDB.

4.1.4 The floating point unit. Figure 4 depicts the resources and control flow paths possible in this unit. FLOS (Floating Point Operation Stack) is a FIFO buffer that holds the instructions sent from the instruction unit, until they are decoded. Decoding is done by the twostage pipeline consisting of the resources DECODE and SELECT, each of which takes one cycle. ADD and M U L are buffers that model the add and multiply unit

4.2 Control Streams Used in the Experiments Synthetic control streams were not used in this phase of the research. Two programs written in Fortran were traced while executing on the IBM System 360/75 at the Computing Services Office of the University of Illinois. These traces were transformed into the control streams used in the experiments. One of these programs, called

9O9

Communications of the ACM


ERROR, is a scaled down version of a program used as a benchmark at the Computing Services Office. This program has a large amount of double precision floating point computation done in predominantly straight-line code, i.e. there are very few branches. The proportion of instruction types in E R R O R is 94 percent floating point instructions, 4 percent fixed point instructions and 2 percent branches. The second program, called EIGEN, is a portion of a program used to find the eigenvalues of a 14 × 14 floating point matrix chosen from [5]. It uses the subroutines TRED 1 and TQL1 from the EISPACK library [9]. Being a matrix manipulation program, EIG E N has a large number of loops of varying size. The proportion of instruction types in the program is 41% floating point, 51 percent fixed point and 8 percent branches. The programs were traced using a modified version of the TRACE-360 package obtained from the University of Waterloo. The conversion from the traces to control streams was done by a program written in SAIL on the DEC 10 system at the Coordinated Science Laboratory of the University of Illinois. 4.3 Simulator Implementation The simulator of the model described in Section 4.1 was implemented in GPSS-10 on the DEC 10 system at the Coordinated Science Laboratory. To facilitate experimentation, a number of model features were parameterized. Thus by specifying alternative sets of parameter values, various system configurations based on the Model 91 can be realized. Some of the specifiable parameters are given below. All buffer sizes can be varied. The memory cycle time can be changed as can the number of banks of memory. The number of bytes fetched per memory reference can be varied. The priority scheme for memory reference initiation can be changed. The Tomasulo algorithm for operand forwarding in the floating point unit can be removed, as can the loop mode feature in the instruction unit and the multi-access feature in the memory unit. The conditional branch handling policy of fetching instructions from the branch target address can be stopped. The fixed point unit can be pipelined to a limited degree, while the floating point units can be "unpipelined." A number of these simulator features were used in the experiments, as described below. 4.4 Discussion of Experiments Performed and Results In this section, we present a summary of some of the interesting experiments conducted using the simulator. Table I is a list of the performance of each experimental system on each of the programs. The system performance is measured by the average instruction throughput sit of the system. This is defined as the average number of instructions completed per CPU cycle over the run of the program, and is computed by dividing the total number of instructions executed in the program by the number 910

Table I. Performance of Model 91 based systems. System

No. 0 1 2 3 4 5

ERROR Throughput sit

EIGEN Throughput sit

Description

Actual

Normalized

Actual

Normalized

normal serial no Tomasulo algorithm pipelined fixed point unit no index register data dependency no loop mode

0.313 0.034 0.237

1.00 0.11 0.76

0.277 0.040 0.272

1.00 0.14 0.98

0.314

1.00

0.327

1.18

0.318

1.02

0.357

1.29

0.313

1.00

0.246

0.89

of cycles that the simulator took to execute the program. sit is thus the MIPS rate of the system normalized to a CPU cycle time of one second. Also shown is the system performance expressed as a fraction of the "normal" system performance. The normal system is that in which the parameter values are set to match the actual Model 91 system. 4.4.1 A strictly serial Model 91. System 1 has the same resources as the Model 91, but no concurrency is allowed between different instructions. As shown, the normal system achieved a performance increase factor of 9.1 on E R R O R and 7.1 on EIGEN over a strictly serial machine. Thus the "order of magnitude" goal for performance improvement due to architectural changes alone, a stated objective of the 360/91 designers [1], is nearly met on these traces. 4.4.2 Effect of removing the Tomasulo algorithm on performance. System 2 is obtained by removing the Tomasulo algorithm for operand forwarding in the floating point unit. Thus the floating point decoder F L I U does not send an instruction to a reservation station until the operands required by it are available, and consequently the instructions following it are delayed. Further, the CDB is now used solely as a result bus transporting the results of execution only to registers or to the SDB, from which they must be fetched by instructions that need them. It is seen that the performance of the system on E R R O R fell to 0.76. Statistics gathered during the normal system execution showed that 80 percent of the floating point instructions needed to wait for their operands to become available. Thus the large performance drop of system 2 on E R R O R is to be expected. Statistics collected showed that resource utilization in system 2 was very unbalanced. The mean utilization of the reservation stations fell by 67 percent below the normal system, while the mean processing times in IDEC and DECODE rose by 34 and 53 percent respectively, only because instructions waited longer in them for buffers to be freed. On the other hand, the performance of system 2 on Communications of the ACM


EIGEN was almost the same: 0.98. Statistics for EIGEN showed that in the normal system, the fixed point unit needed as much as 89 percent of the total run time to execute its instructions. Thus decreasing the bandwidth of the floating point unit did not materially affect the performance of this fixed point unit bound program. 4.4.3 Effect of pipelining the fixed point unit on performance. System 3 attempts to increase the bandwidth of the fixed point unit by introducing a moderate amount of pipelining. In the normal system the instruction at the head of the FXOS queue is decoded by FXIU and executed by FXEU (except for loads and stores). The next instruction in the FXOS queue is not started until this one has completed execution. We conjectured that this delay is a major reason for the low fLxed point unit bandwidth. Thus, in system 3, once an instruction has acquired its operands, the decoder FXIU dispatches it to an infinitely large buffer of reservation stations from which FXEU will execute it or executes the instruction itself if it is a load or a store. As soon as it has done this, FXIU is free to decode the next instruction in FXOS. However no attempt is made to optimize operand forwarding, as the Tomasulo algorithm does in the floating point unit. As expected, the performance increase of this system on ERROR was negligible, since ERROR has only about 4 percent fixed point instructions. However, the performance on EIGEN improved to 1.18. The mean waiting time of an instruction in FXIU decreased by 60 percent, and in FXOS by 57 percent, thus freeing up space for later instructions in the stream. Further, the queue of the fixed point unit reservation stations had a maximum of three decoded instructions in it at any time over the run of the program. This suggests that the performance of the fixed point unit of the 360/91 can be improved with a fairly small increase in complexity. 4.4.4 Effect of removing index register data dependency on performance. EIGEN, being a matrix manipulation program, uses index registers to a large extent. The operations on these registers, which are also general purpose registers, are done in the fixed point unit. In fact, the only programmable means of communication between the two execution units is through the address computations associated with memory references of floating point instructions. Thus there is a covert dependency of the floating point unit on the fLxed point unit. To examine the effect of this dependency on performance, a fictitious system 4 was simulated in which it was assumed that address computations never had to wait for their index registers to be available. The performance on ERROR increased to only 1.02, as it does a much smaller amount of indexing than EIGEN. By contrast, the performance on EIGEN increased to 1.29. The average time that a memory reference spent in ADGEN for address computation dropped by 65 percent. This shows clearly that the fixed point unit in the normal system does not have the bandwidth 911

to match the operand-address supplying rate of the instruction unit, and is thus holding up the floating point unit as well. 4.4.5 Effect of removing the loop mode feature on performance. The loop mode feature of the Model 91 enables iterative loops that are small enough, to be held in the instruction buffer so that for iterations after the second one, those instructions need no longer be refetched from main memory. In system 5, this feature is disabled. As expected, the change in performance on ERROR was negligible since it has no loops small enough to fit in the 64-byte-long IBUF. Even the performance on EIGEN fell to only 0.89. Statistics gathered from the normal system show that, while executing EIGEN, the system spent 46 percent of the run time in loop mode. Thus the fairly small drop in performance can only be explained by the observation that it was the fixed point unit, and not the instruction unit, that was the bottleneck in the normal system. The experiments on systems 3, 4, and 5 bring to light an imbalance in the design of the Model 91. It is reasonable to expect that programs with a number of short iterative loops will also have a considerable amount of index register manipulations, since most loops are used to index through some data structure. However, while the loop-mode feature optimizes the handling of loops, its effect is severely limited by the time it takes for the low bandwidth fixed point unit to perform the index register manipulations. 4.4.6 Effect of the memory unit on performance. To evaluate the effect of the memory unit on performance, several systems were constructed by varying either the memory cycle time or the number of memory banks (the normal system has 16 memory banks, each with a cycle time of 12 CPU cycles). Table II lists the performance of these systems on ERROR and EIGEN. It is seen that the system performance is a very sensitive function of the memory cycle time. Figure 5 describes this phenomenon graphically and shows that the normalized performance on both programs decreases approximately linearly with increases in cycle time over a fairly wide range of cycle times. This is true not only for ERROR, but also for EIGEN, for which the serial execution of the fixed point unit just aggravates the situation. The effect of increasing the number of memory banks, while it does affect the number of memory conflicts, is seen to have a smaller effect on performance than decreasing cycle time. Decreasing the number of memory banks, however, does degrade performance considerably. A number of the CPU architectural features such as loop mode, extensive buffering, multiaccess, and prefetching instructions from the target address of a conditional branch were necessitated by the disparity in the CPU and memory bandwidth. The experiments described in this subsection show that the memory cycle Communications of the ACM

November 1978 Volume 21 N u m b e r 11

Table II. Performance of Model 91 based systems with varying memory parameters. ERROR Throughput sit

System Memory No. cych; memory time banks 12 16 (normal) 12 4 12 8 12 32 12 64 1 16 3 16 7 16 16 16 20 16


u) I-ft. "1O 3 0 fl: -r

1.4

Actual

Normalized

Actual

Normalized

0.313

1.00

0.277

1.00

0.186 0.262 0.354 0.370 0.488 0.461 0.403 0.256 0.214

0.59 0.84 1.13 1.18 1.56 1.47 1.29 0.82 0.68

0.202 0.255 0.287 0.295 0.388 0.375 0.323 0.240 0.210

0.73 0.92 1.03 1.07 1.40 1.35 1.17 0.87 0.76

-

~ ERROR

t- 1.2 _ EIGEN o~ \xx~ :E iii I-tO

Q

1.0

N

xx~b,.,,

Q: ~ 0.8

ox

I

I

I

I

10 20 MEMORYCYCLETIME (CPUcycles)

time is still the dominant factor in determining performance despite the above-mentioned CPU features. 4.4.7 An experiment with the 360/195. A later model in the System 360 line, the Model 195, has a very similar CPU architecture to the Model 91, except for some minor fixed point and decimal instruction optimization [7, 8]. However, it has a cache memory, with a cycle time of one CPU cycle. Thus the system in Table II with a cycle time of 1 (repeated as system 6 of Table III) is a good approximation to the Model 195, assuming a cache hit ratio of 100 percent. We conjectured that the CPU features detailed towards the end of Section 4.4.6 would be of little use if the disparity between memory and CPU cycle times were removed. To verify this conjecture, we constructed sys912

System

No.

Fig. 5. System performance vs. memory cycle time. !.6

Table III. Performance of Model 195 based systems.

6 7

ERROR Throughput sit


Description

Actual

Normalized

Actual

Normalized

Normal (with memory cycle time of 1) Depleted (as in Section 4.4.7)

0.488

1.00

0.388

1.00

0.479

0.98

0.382

0.99

tem 7 of Table 3 with a memory cycle time of 1 (to simulate the 195). Further, the loop mode and multiaccess features were removed. No prefetching was done from the target addresses of conditional branches. Finally, the sizes of all execution unit buffers were cut by 50 percent and the size of the instruction buffer by 75 percent. The results in Table III show that the performance of the depleted Model 195 suffered only a marginal degradation--to 0.98 on E R R O R and 0.99 on E I G E N . This small decrease in performance suggests that a considerable simplification of the Model 195 CPU can be made with very little effect on performance. This result appears to contradict some of the assertions in [8], including the one that loop mode accounts for some 20 percent of the speed advantage of the Model 195 over the Model 65 on some partial differential equation programs. These results and those of the preceding section lead us to the observation that all the sophisticated features of the Model 91 CPU could not overcome the bottleneck caused by the large memory cycle time, while many of these same features have marginal value in the Model 195 with the introduction of the cache. 4.5 L i m i t a t i o n s o f the C a s e S t u d y The conclusions drawn about the architecture of the Model 91 in this study suffer from the limitations that only two benchmark programs were used. Further, while E I G E N is a fairly typical linear systems program, ERR O R is not so typical. It is, nevertheless, an interesting extreme point in the program space. It should be noted however that the aim of the case study was to illustrate the application of the method proposed. Any conclusions drawn about the architecture of the Model 91 are secondary results, and should be viewed with the above limitations in mind. The primary reason for running only two benchmark programs in the case study was the considerable amount of computing entailed in the simulation. Thus simulation runs needed between 60 and 90 minutes of CPU time on a KI-10 CPU. The simulator itself was developed and debugged in less than 6 man-months. In some cases, such computing costs as above may represent an insignificant fraction of the total design costs for a system. In other cases, however, the simulator may have to be augmented by other performance tools, as discussed in Section 5. Communications of the ACM


5. Conclusion and Suggestions for Further Research

We have described a method for modeling systems to evaluate their performance by simulation. We have described two alternate techniques to generate control streams to exercise such simulators. The case study and the experiments conducted show that the technique is a viable one and that a number of issues of performance evaluation for design optimization can be examined by this technique. The limited number of experiments conducted in the case study seem to indicate that the major bottlenecks in the Model 91 are the memory unit and the fixed point unit, with the latter causing considerable delays in the operand address computation for instructions that use indexing. The loop mode feature seems to have a smaller impact on performance than even a minor change in the architecture of the fixed point unit. Further it appears that many of the sophisticated pipelining and buffering techniques implemented in the architecture of the Model 91 are of little value when high-speed memory is used, as in the Model 195. The method described in this paper is completely general and has a wide range of applicability. Thus none of the aspects of the method, as described in Sections 2 and 3, were driven by the particular system being examined. There is however a wide range of choices for the level of detail at which the model simulates the system. Certainly the level chosen in this study is higher than the logic gate level, where a lot of detail is irrelevant to performance evaluation. However, the complexity of the simulator at even this level suggests that even higher levels may have to be examined. Further research should investigate the possibility of combining models of different levels of accuracy and complexity to yield useful and accurate performance information in a cost-effective manner. Typically this research would involve the collection of appropriate statistics from the simulator which would be used to derive analytic functions to describe parts of the system and their interactions with programs. Parameterizations of these lumped models and the systems and programs being modeled and derivations of the laws of interaction of these parameters would be the next step. By extrapolation of the analytic functions, new systems and new types of programs could be evaluated and conclusions drawn. Their validity can be assured by returning to deterministic simulation for selected cases to verify interesting conclusions and to adjust the fit of the analytic model in important regions. In our case study, we did not use the method of synthetic control stream generation. This is a very interesting area for further work, related to the question of parametric description of programs. The best simple set of parameters needed to describe programs and generate test programs has to be identified. Once this is done, a test environment, in which both system and program parameters can be varied, will be available to simulate an enormous range of possibilities. 913

Acknowledgments. We are grateful to one of the referees, whose careful reading and helpful suggestions led to an improved version of the original manuscript. Received November 1976; revised April 1978 References

1. Anderson, D.W., Sparacio, F.J., and Tomasulo, R.M. The IBM System 360/Model 91: Machine philosophy and instruction handling. IBM .I. Res. and Develop. 11 (Jan. 1967), 8-24. 2. Anderson, S.F., Earle, J.G., Goldschmidt, R.E., and Powers, D.M. The IBM System 360/Model 91: Floating-point execution unit. IBM J. Res. and Develop. 11 (Jan. 1967), 34-53. 3. Ballance, R.S., Cocke, J.A., and Kolsky, H.G. The lookahead unit. In Planning a Computer System, McGraw-Hill, New York, 1962. 4. Boland, L.T., Granito, G.D., Marcotte, A.V., Messina, B.V. and Smith, J.W. The IBM System 360/Model 91: Storage system. IBM J. Res. and Develop. 11, (Jan. 1967), 54-79. 5. Gregory, R.T., and Karney, D.L. A Collection of Matrices for Testing Computational Algorithms. Wiley-Interscience, New York, 1969. 6. Kumar, B. Performance evaluation of a highly concurrent computer by deterministic simulation. M.S. Th., Rep. R-717, Coordinated Sci. Lab., University of Illinois, Urbana, IlL, Feb. 1976. 7. McLaughlin, R.A. The IBM 360/195. Datamation 15, 10 (Oct. 1969), 889-895. 8. Murphey, J.O., and Wade, R.M. The IBM 360/195. Datamation 16, 4 (April 1970), 72-79. 9. Smith, B.T., et al. Matrix Eigensystem Routines--EISPA CK Guide. Springer-Verlag, 1976. 10. Tomasulo, R.M., An efficient algorithm for exploiting multiple execution units. IBM J. Res. and Develop. 11 (Jan. 1967), 25-33. 11. Tjaden, G.S., and Flynn, M.J. Detection and parallel execution of independent instructions. IEEE Trans. Comptrs., C-19, (Oct. 1970), 889-895.

Communications of the ACM


Performance evaluation of highly concurrent ... - ACM Digital Library

Performance evaluation of highly concurrent ... - ACM Digital Library

Suggest Documents

Performance evaluation of web servers using ... - ACM Digital Library

performance evaluation of multiple register sets - ACM Digital Library

Performance Evaluation of Web Services ... - ACM Digital Library

Automated Categorization of Performance ... - ACM Digital Library

Computerized Performance Monitors as ... - ACM Digital Library

Design, development and performance ... - ACM Digital Library

Design, development and performance ... - ACM Digital Library

Evaluating analytic performance - ACM Digital Library

Web Page Segmentation Evaluation - ACM Digital Library

design - ACM Digital Library

crpit - ACM Digital Library

Conversations - ACM Digital Library

Incentives - ACM Digital Library

Gunrock - ACM Digital Library

Abstract - ACM Digital Library

AdaGIDE - ACM Digital Library

MOVELETS - ACM Digital Library

P10 - ACM Digital Library

2PXMiner - ACM Digital Library

feature - ACM Digital Library

C++ ... - ACM Digital Library

practice - ACM Digital Library

Proceedings of - ACM Digital Library

Exploring the Performance of ROS2 - ACM Digital Library