An Instruction Fetch Unit for A Graph Reduction Machine

A n I n s t r u c t i o n F e t c h U n i t for A G r a p h R e d u c t i o n M a c h i n e

William E. Hostmann

Shreekant S. Thakkar

Hewlett P a c k a r d Wilsonville, O R 97070.

Oregon G r a d u a t e Center Beaverton, OR 97006

Abstract The G-machine provides architecture support for the evaluation of functional programming languages by graph reduction. This paper describes an instruction fetch unit for such an architecture that provides a high throughput of instructions, low latency and adequate elasticity in the instruction pipeline. This performance is achieved by a hybrid instruction set and a decoupled RISC architecture. The hybrid instruction set consists of complex instructions that reflect the abstract architecture and simple instructions that reflect the hardware implementation. The instruction fetch unit performs translation from complex instruction to a sequence of simple instructions which can be executed rapidly. A suitable mix of techniques, including cache, buffers and the translation scheme, provide the memory bandwidth required to feed a RISC execution unit. The simulation results identify the performance gains, maximum throughput and minimum latency achieved by various techniques. Results achieved here are in general applicable to yon Neumann architectures.

for evaluation of functional language programs by programmed graph reduction which is specified by a sequence of instructions derived by compiling an applicative expression. This is different from combinator reduction34 where control is derived dynamically from the expression graph. This paper describes a proposed instruction fetch unit in the instruction pipeline of the G-machine. The novelty of the instruction fetch unit is that it uses a combination of several different techniques used in conventional architectures to increase the throughput of the number of instructions fetched and executed. Hardware translation of Gmachine (complex) instructions to simple instructions is done to provide fast execution by a RISCt-type execution unit. In this way, the architecture supports a hybrid instruction set which is a mixture of both complex and simple instructions. Explicit prefetching of alternate instructions for a multi-way control transfer instruction is investigated on the premise that such instructions occur frequently in programs written in LML. This is intended to reduce latency in restarting the pipeline after a control transfer has occurred. Control transfer instructions are thus eliminated from the instruction pipeline by early partial decoding of the instruction stream. Sequential prefetching (i.e., prefeteh always) and implicit prefetehing (i.e., widening bandwidth) techniques are examined to study their effect on the throughput of G-code instructions. Buffering techniques are used to maintain regularity between the number of instructions produced and instructions consumed. First, pipeline registers are used to buffer simple instructions and literals to the execution unit. Second, an instruction cache with explicit prefetching is used for buffering complex instructions from the instruction store. Finally, implicit prefetching is used for fetching words from the instruction store which can contain more than one instruction. A simulation of such an instruction fetch unit was performed using a statistical model of the instruction type usage based on observation in an earlier functional simulation of the G-machine. The simulation studied the throughput, latency and elasticity across the various stages of the instruction pipeline. Different instruction fetch and prefetch strategies were simulated to investigate instruction throughput and pipeline latency. The sizes of the interstage instruction buffers were varied to examine which

1. I n t r o d u c t i o n Conventional computer languages and conventional computer architectures have grown up side-by-side, and are more suited to one another. There has been no similar development of computer architectures for evaluating functional language programs. Attempts to bridge this gap with software have produced functional language programs that use a computer's resources inefficiently, and fail to realize the level of performance that should be possible for an application. A research project at Oregon Graduate Center is investigating a new computer architecture, one based on a well thought-out abstract model for evaluating functional language programs. The abstract model was defined by Johnsson and Augustsson 2'14 as an evaluation model for a compiler for a dialect of ML (a functional programming language) called Lazy ML (LML). The abstract model represents an expression as a graph, and through successive transformations, or reductions, modifies the graph until its form is that of a fully evaluated result. Hence the model is called graph-reduction, and the architecture is called the Gmachine (Graph-reduction machine). The G-machine is a pipelined yon Neumann architecture that l~rovides support

tReduced Instruction Set Computer~

82

0884-7495/86/0000/0082$01.00 © 1986 IEEE

one byte. The mean instruction length observed from the code compiled is about 2 bytes. Although this is a simplistic measure, it suggests that the code efficiency of the Gmachine is probably somewhat better than that of most current generation, general-purpose von-Neumann architectures. The choice of instruction format (variable or fixed) affects instruction bandwidth, program size and decoding logic complexity. The variable format helps to pack more instructions per word thus increasing the bandwidth of instruction fetch and also making program size compact. The later is less important now since memory costs are less significant. Variable instruction format has a disadvantage of making instruction decoding logic complex. Therefore a compromise was made and the variation was limited.

configuration provides the best elasticity between different stages in the instruction pipeline. 2. T h e G - m a c h i n e A r c h i t e c t u r e The abstract model and the G-machine architecture are described in detail by Johnsson13 and Kieburtz 19 respectively. Kieburtz describes how the G-machine supports the functional programming language LML. A brief description of the current architecture is presented here, processor. The G-machine processor is realized as a looselycoupled coprocessor which is attached to a conventional microprocessor (figure 1). It will be treated by the host processor as an asynchronous I/O device. The host processor will handle the memory management function for the Gmachine as well as running the operating system and other utilities (editors, complier, etc). The G-machine processor has a 32-bit architecture with a 34-bit multiplexed address and data bus.

I

2.1.1. O r g a n i z a t i o n The G-machine processor will operate as a slave to a host processor which shares access to its instruction store and to the graph store (figure 1). The host processor will be responsible for sending the G-processor an initialization signal, for loading the instruction and graph stores, and for signaling the G-processor to begin evaluation. During an evaluation, the G-processor can signal its host, interrupting it to request service. To make a service request the Gprocessor deposits the address of a node in graph store onto the G-bus, from which it will be read by the host processor. Details of the service request are communicated as a graph in graph store, whose root is pointed to by the address deposited on the G-bus when the request is signaled. This arrangement allows the G-processor to run free of interrupts. On initialization and following a signal of a service request, the G-processor enters a wait state, awaiting a continuation signal from its host. Access to the G-bus is arbitrated by fixed priorities that give the G-processor preference over the host. The G-machine is designed to provide hardware stlpport for the following aspects of evaluationlY: graph traversal; instruction fetch; contezt switching; and dynamic list-

- mez

•

? I/0 bus

~G-program memory ~

F i g u r e 1: T h e G - m a c h i n e a n d i t s h o s t p r o c e s s o r 2.1. T h e G - m a c h i n e Processor

The G-machine is a programmable graph-reduetio9 engine that can perform either applicative-order (innermost) or normal-order (outermost) evaluations. It employs a tagged, dynamically allocatable, list-structure graph store (G) to store an expression graph undergoing reduction, and a byte-addressable instruction store (IS). The processor

structured memory. 2.1.2. P r o c e s s o r Design

The G-processor (figure 2) internal data/address bus (G-bus) connects the major functional units of the processor with one another and to the graph store. The ALU implements integer addition, add-with-carry, and complementstion on signed 32-bit data. It also provides shift operations, byte insertion, and constant zero. There are condition codes for zero, negative, carry generation and overflow generation. The execution unit is responsible for dispatch and distribution of control signals extracted from fields of simple instructions. Its decoding function is not much more complicated than that of the Berkeley RISC architecture25 Unlike the Berkeley RISC, however, the G-processor is not completely synchronous, and simple instruction dispatch is subject to the availability of the resources required by the instruction. In particular, this allows the processor to have a much shorter instruction cycle time, not tied to the speed of the memory. Instruction dispatch (issue) may await the availability of graph store, of the next simple instruction or literal from the G-code stream, or of completion of an ALU operation. Thus, the use of an asynchronous pipeline

consists of an instruction fetch unit (IFU) that fetches Gcode from the instruction store, translates it to simple instructions, and delivers a sequence of simple instructions to the execution unit (E). A separate data and address bus (G-bus) connects the graph store with a pointer stack (P) that holds graph pointers, an arithmetic-logical unit (ALU) which also has an associated register (value) stack (V), and some special registers. The top segment of the pointer stack is supported by hardware registers in the G-machine process o l and some fixed number of cells adjacent to the stack top can be directly accessed by stack operation instructions. The remainder of the pointer stack overflows into a fast memory. G-code 14 is a predominately zero-address machine code. When instructions do have operands they are either control addresses in control-transfer instructions, indices for bit-shift instructions, literal data, pointers to constants stored in graph store, or indices relative to the top of the pointer or value stack. Operation codes are represented in

83

Execution Unit

D-Register I

lastructior Store & Instruct~ol Cache

,

il

F i g u r e 2: T h e G - p r o c e s s o r

accommodates alternative hardware implementation strategies, and allows a greater degree of overlap in simple instruction execution. The D-register is used for diagnostic purpose; the stack can be explicitly dumped and its contents examined using this register. The A-register holds a graph store address for a READ or WRITE operation and the T-register holds current values of the relevant data tags (not shown).

substituted by a complex instruction though this m a y not be an easy thing to do in general. Thus, advantage can be taken of the reduction in the required bandwidth for the instruction store and cache. This also allows the prefetch of another complex instruction while the current one is being translated. Hence, by widening (i.e.,by implicitly prefetching) and reducing (i.e., by translating) the 'required instruction store bandwidth, the severity of the yon N e u m a n n bottleneck has been reduced. As will be seen later the required bandwidth is further reduced by the inclusion of an instruction cache into the pipeline. The other major limiting factor in a yon N e u m a n n architecture is the Flynn bottlencck~ 8. In our design it exists in the translation stage of the pipeline instead of the fetch stage. This is not really a bottleneck since the execution unit can only execute at most one instruction per cycle. The second problem of performance degradation, caused by control transfer instructions in the instruction stream, can be reduced in several ways. Some of the existing approaches are: loop buffers 12, multiple instruction s t r e a m s1, delayed branch ~$ , 28 , taken/not taken s w i t c h9' 26 and branch target buffers23'24'31. A variation of the multiple instruction stream and delayed branch strategies was adopted for this design.

3. T h e I n s t r u c t i o n F e t c h U n i t 3.1. Design R a t i o n a l e To maintain a steady flow of instructions in a pipelined architecture, two problems have to be overcome in the instruction fetch and decode stage of the pipeline. First, the memory access time may be so long that a request by the instruction fetch unit for another instruction will not be satisfied soon enough to maintain the flow through the pipeline. Second, a change in expected instruction sequ'ence). caused by a control transfer instruction, may invalidate the contents of the pipeline. The second problem is related to the first one since the penalty for a control transfer instruction will depend on the time to fetch the target instruction from memory. The first problem of instruction accessing, however, is helped by the fact that most instructions are obeyed sequentially and the memory word size is normally such that a word fetched from the store can contain several instructions. Thus, the store requests can be made before the corresponding instruction being required and replies buffered until they are needed for execution. This implicit prefetching technique, is used in many high performance processors to reduce the severity of the son Neumann bottleneck~ 3. Hence, such a scheme is used in this design. In this architecture, the first problem of instruction accessing is alleviated significantly by having the IFU translate (i.e., expand) complex G-code instructions into a sequence of simple instructions, which do not require access to the instruction store. The translation takes advantage of the fact that sequences of commonly occurring simple instructions can be identified at compile time and can be

3.1.1. T r a n s l a t i o n S c h e m e A hardware translation scheme, similar to the one on Manchester University's MU6-G6'36, is used in this design to translate complex G-machine instructions into a sequence of simple RISC-like instructions. A similar scheme was first used on Manchester University's Atlas 22 and the complex instructions, called extracodes, were translated into optimized subroutines of simple instructions held in the main memory. O n MU6-G, the subroutines were held in a high speed memory within the instruction fetch memory. The M U 6 - G scheme has the advantage of reducing instruction fetch bandwidth and for generating the translated sequences rapidly while the Atlas scheme has more flexibility. For :~Flynn'e observation was that there is always some point in the instruction fetch/decode path through which the instructions pass at the m a x i m u m rate of one per clock cycle.

t yon N e u m a n n bottleneck refers to the contention on the communication path between the processor and the memory.

84

example, on the MU6-G, instructions such as Jump to Subroutine are treated as complex instructions which are translated into a sequence of simple instructions, called sub-sequences, by the instruction fetch unit. Similar schemes have also been used recently in Fairchild's CLIPPER microprocessor 20 and HewlettPackard's Spectrum processor 37. CLIPPER's scheme is similar to the one on the MU6-G and Spectrum's is similar to Atlas. The advantages of this translation scheme besides reducing the instruction fetch bandwidth are: (i) Simple instructions can be implemented to execute rapidly. The expense of executing a sub-sequence is only incurred when the corresponding complex instruction is used. (ii) Minimum control circuitry is required. Hence the area saved can be used for larger on-chip or onboard processor state. Much smaller memory space is required to hold these sub-sequences when compared with that required for microcode.

instruction stream and interprets control transfer instructions. Similarly, Wilkes35 and Katevenis 15 indicate that a significant performance increase can be achieved for RISC architectures by keeping the control transfer instructions out of the RISC instruction pipeline. The execution unit receives instructions from the IFU and decodes these into control signals for the ALU, the pointer stack and the value stack. If a conditional control transfer instruction occurs, then the execution unit will notify the IFU of the control transfer.

3.1.2. Buffering The high incidence of rapidly executed stack instructions in the G-machine imposes a severe stress on the instruction fetch unit. Hence, two forms of inter-stage buffering are provided to buffer instructions in this pipelined architecture. This is intended to provide high throughput and extremely short latency. The inter-stage buffers consist of a cache and an instruction queue. The cache is used for buffering complex G-code instructions between the instruction store and the instruction fetch unit. The purpose of a cache in the instruction pipeline is to increase bandwidth through the locality of references (temporal locality) and instruction reuse characteristics (spatial locality). Thus the cache will give good performance for a rapid burst of stack instructions which can occur in compiled LML code because of locality of instruction references. For performance evaluation a fully associative cache organization with a modified FIFO replacement strategy was used. This organization when used with random replacement strategy has been shown to give the best performance when used for caching instructions 33. In our study, replacement strategy does not play a critical role since we are not specifically modeling loops and this was proved by our simulation. However, for implementation, a direct-mapped cache of a suitable size may be easier to implement than a fully associative cache without significant loss of performance.

(iii)

Existing complex instructions can be altered or improved without changing the physical implementation. (iv) New complex instructions can be defined, as sequences of simple instructions, and added to the instruction repertoire. (v) The sub-sequences are highly optimized. A further advantage is that the programmer's model of the architecture can remain intact when changing the hardware implementation. Advantages similar to (i) and (ii) have recently been claimed for new RISC architectures such as the RISC I and RISC II microprocessors25, the IBM 801 minicomputer 2s, and the MIPS microprocessor 10. The difference is that these processors rely on a high-bandwidth instruction fetch unit and an instruction cache to provide a stream of simple instructions. The translation from complex to simple instructions is a part of the compilation process for a RISC architecture. Some supercomputers (e.g. CRAY) and high performance mainframes (e.g. Manchester machines) have also used RISC philosophy for the past two decades and claim the same advantages. Microprogrammed logic was always slower than hardwired logic and no regular devices such as fast programmable logic arrays existed to implement the large amount of random circuitry. Hence the only option was to implement a simple instruction set using hardwired logic which could be executed rapidly. A new stud~ 5 concludes that the performance gains of RISC processors 27 are not owing to a reduced instruction set

Instructions are produced by the instruction fetch unit at a rate of one per cycle. Ideally, the simple instructions should be executed one per cycle but this may not be possible. For example, the memory-bound instructions such as LOAD/STORE may take more than one cycle since memory access is asynchronous 25. If they can be executed one per cycle then the sub-sequence control store acts as the queue and only one register is required to buffer instructions between the IFU and execution unit. Otherwise further buffering is needed between the IFU and execution unit. In this design the buffering is provided as pipelined registers structured as an instruction queue. The instructions are consumed by the execution unit directly from the instruction queue while the literals have to be moved from the literals queue via the G-bus. A custom designed queue 20 can shift-in and shift-out a value every clock cycle; it can also be flushed and a new value shifted to the head of the queue in one clock cycle 17. This feature is intended to reduce latency after a control transfer instruction occurs.

per so, but also to implementation features and compiler technology. Our design takes advantage of both RISC and CISC philosophies, namely the fast execution of RISC instructions and lower control logic overhead, and CISC's support of an abstraction of the architecture (e.g., support for operating system, byte string manipulation, floating point and vector computation). In the G-machine, the IFU is responsible for translating complex instructions into a sequence of simple instructions. In so doing, it also linearises the code sequence, that is, it removes unconditional jumps and substitutes the correct sequence. Hence, the IFU partially decodes the

3.1.3. B r a n c h Prediction A normal pipeline suffers a branch penalty because for a conditional control transfer instruction it must make a choice; that is, the instruction fetch unit must fetch either

85

taken, the target is loaded immediately from the cache into the instruction fetch stage of the pipeline without any additional delay for instruction fetch. In the present design, four such prefetches can be accumulated along the main sequence; but since the secondary (prefetched) sequences are not decoded, no additional prefetches arc generated, unlike the alternative IFU design. The only extra hardware required for this design is four registers for holding the addresses of possible target instructions. Once a stream is identified, the appropriate register is loaded into the program counter and the next instruction is fetched from that address. The delayed branch mechanism has been effectively used in the RISC microprocessor 25 and I B M 80128. The instruction sequences for those architectures are scheduled by the compilers and therefore can take advantage of delay in the resolution of a branch instruction. Instructions which are not dependent on the outcome of the branch are inserted in the sequence after the branch instructions. The potential speed up is not realized if such instructions cannot be scheduled. Advantage is taken of delayed branch in this design in the translated sub-sequences of simple instructions which are held in the sub-sequence control store. These subsequences are hand optimized and only done once, unlike the RISC microprocessor and I B M 801. Further advantage can also be taken in scheduling the execution of CISC instructions, though this is perhaps less straightforward than scheduling RISC instructions, since the hardware is less visible to the compiler.

the next sequential instruction or the branch target. A brute force approach to this problem is to replicate the initial stages of the pipeline (i.e., multiple instruction streams) so that both the succeeding instruction and the potential branch target can be fetched, decoded and processed. This approach causes the following problems: (i) The branch target cannot be fetched until the address is determined, which may require a computation. (ii) Successive or additional conditional control transfer instructions may enter the initial stage of the pipe before the branch target for the first instruction has been resolved. (iii) The cost of replicating significant parts of the pipeline can be substantial. (iv) The control logic of such a scheme could be rather complex. A decision was made to use multiple instruction streams because some of the above problems are not true for this architecture. First, the G-machine machine does not support addressing modes which require address computation at execution time. The address of the alternate stream is determined by partial decoding of control transfer instructions in the IFU. Second, a VLSI implementation is envisioned so the cost of replication is not as significant as it can be with off-the-shelf components. Third, the control logic can be partitioned in such a way that it can be implemented as a small finite state machine which is incorporated into the instruction fetch stage of the pipeline. Finally, and most important, the inclusion of an instruction translation stage in the pipeline makes extra bandwidth available between the instruction cache and fetch stages. This allows fetching of an alternate instruction stream(s) when a control transfer instruction is encountered. Two different proposals of multiple instruction streams have been evaluated.

3.2. F u n c t i o n a l Organization The IFU fetches G-code from the instruction store and translates instructions into sequences of simple instructions. Several G-code instructions, such as the stack manipulation operations, translate directly into simple instructions, whereas others, such as EVAL (evaluate a graph) translate into rather complex sequences which themselves contain control flow instructions. The current organization of the IFU is shown in figure 3. The program counter (Control) generates addresses to instruction store (IS). These requests are intercepted by the instruction cache and only passed through to instruction store if there is a cache miss. If there is a cache-hit then the required instruction is sent to the IFU. Otherwise the instruction store is accessed. The fetched instructions are then translated into a sequence of simple instructions by a look-up in the subsequence control store (SSCS). Simultaneously, the four most significant bits of the opcode are decoded to determine the type and length of the operand. G-code can contain literal data, G-memory addresses, and jump instructions specifying addresses in instruction store. These d a t a are all removed by the IFU. Literals and G-memory address constants are routed to a literals queue (LQ), from whence they can subsequently be moved to the G-bus. Jump addresses are interpreted directly by the IFU to initialize its target address registers (TAR's). Thus, unconditional jumps are never translated into simple instructions at all; succeeding G-code instructions are fetched from the jump address. Conditional jump and ease-switch instructions are also partially interpreted by the IFU. A conditional jump

In the first proposal is, the instruction fetch unit used the brute force approach of multiple instruction streams. In t h a t design, up to four different instruction streams could be prefetched. This was designed to support up to four-way case-switch instructions. Each instruction stream was envisioned as a byte-wide queue. Fetches were handled in round-robin sequence for each instruction stream; only one stream was decoded. The instructions were fetched a byteper-cycle from the instruction store. A similar translation scheme to the one proposed here was also used for the alternative design of the IFU. This design relied on the premise that a large percentage of complex instructions are multiway conditional control transfer instructions. The simulation results 7 showed poor performance. This design only tried to reduce latency delays. The bandwidth across the different stages in the pipeline was mismatched which affected the latency and more significantly throughput. The cost and complexity of the control logic, resurfaced. In the second proposal, presented here, a simpler technique was used. Only enough logic is replicated to prefetch the branch target instruction into the cache. T h a t is, when a branch is detected, the address of the possible target instruction is loaded into a special register and the instruction at the address is fetched into the cache. Since an instruction word is fetched into the cache, succeeding instructions may also be prefetehed. Thus, if the branch is

86

,-,q Selection!"~

I n ~

Instruction Buffer & Selector .

Instruction Store

.

.

.

.

.

.

.

.

II

.

.

.

I

s

r

Litcrols Queue

Prefetch I [ Address ~, J/ To Instruction Store I& Coche Progrom Counter I

G-Bus

ledicotes Future Enhoncements

I

'~ Addresss

To

G-Bus

,11

I

To E Unit

I

I

Torget Address Registers

•

i

I

Instruction Fetch Unit

Figure

3: T h e I n s t r u c t i o n

Fetch Unit

the alternative target address buffers. No interruption of the simple instruction stream occurs. Although this activity sounds complicated, it is in fact quite straightforward to support with hardware. It can have a considerable advantage if programs exhibit loops with conditional exit, as happens when tail-reeursive function definitions are compiled. The translation unit unwinds the code loop as if it were straight-line code, punctuated occasionally by conditional jump instructions. The only taken jump will be the one that finally terminates execution of the loop. A large instruction cache will be able to hold entire Ioopst thus giving high throughput with low latency for most of the time when non-loop control transfer is encountered.

specifies a possible alternative instruction stream to the one currently being fetched. The IFU maintains multiple branch target address buffers (four in the current design) and a new T A R is loaded with the target address of the conditional jump from which to begin fetching a code stream. The IFU uses nondeterministic prediction of the path to be taken at the jump, and fetches the alternate instruction i n t o the instruction cache. Case-switch instructions may specify wider than two-way branch alternatives. The multiple target address buffers scheme will accommodate up to four-way case switches. Each G-code instruction expands into a sequence of simple instructions. The expansion ratio is highly variable over different G-code sequences, but seems to have a dynamic average between five and seven as observed in simulations. Thus we expect there to be some excess memory bandwidth over that required to support a single stream of G-code instructions. Only a single level of nondeterministic jump prediction is supported. When a second conditional jump instruction is encountered in translating a G-code stream, translation stops until execution of the already-translated simple instruction stream resolves the outstanding jump. Conditional jump instructions are translated into simple instructions that specify the index of a target address register (figure 3). When a conditional jump is taken, the IFU must flush the queue of translated but not yet executed simple instructions, for they were produced on the assumption that the jump would not be taken. It must abort translation of the G-code instruction currently in the translation unit and, restart pipeline by fetching the next instruction from the instruction cache. The appropriate T A R is selected and its contents become the next program counter. A cache hit is ensured by the prefetching done earlier. Most of these operations can be done simultaneously, in a single internal clock cycle.

4. T h e S i m u l a t i o n The major components in the simulation model are the instruction pipeline: instruction store, instruction cache, instruction fetch unit, instruction and literals queues; and the execution unit. Several configurations were modeled to determine which provides the highest throughput, the lowest latency and maximum elasticity. The hardware features examined during the simulations were: instruction queue length, cache size, support for branch target instruction prefetehing, and instruction buffer. These features were simulated using different values for the following parameters: instruction type distribution, execution rate mix, instruction expansion ratio, locality of instruction reference, and the size of word fetched from memory (in bytes). These parameters are described in detail in the following sections. 4.1. T h e S i m u l a t i o n M o d e l s Three basic configurations of the instruction pipeline were simulated. t Some measurement referenced by Katevenis show, for a particular imperative language and architecture, 55% of the branches targeted within a dis-

When a conditional jump is not taken, the action required of the IFU is much simpler; it needs only to free all

tance of 128 bytes, or 93% of them targeted with 16K byte range.

87

(1)

(2)

(3)

ured by the above parameters, was evaluated using the above configurations.

N o - p r e f e t c h i n g into the cache of the possible branch targets of conditional jump and case-switch instructions. Different cache sizes (byte addressable) and instruction queue lengths were simulated to study the load balancing across the instruction pipeline. E x p l i c i t p r e f e t c h i n g into the cache of the possible branch targets of conditional jump and case-switch instructions.

4.3. W o r k l o a d The simulation is a statistical (probabilistic) one. The workload, which is the arrival of different G-code instruction types for execution, is a sequence based on a mapping from a random vector (a sequence of random numbers) to a distribution that represents the instruction type mix. This simulation technique allowed us to explore various strategies without being concerned with implementation details.

E x p l i c i t p r e f e t c h i n g w i t h a b o r t . Same as (2), though the prefetching sequence is aborted if the branch is resolved by the execution unit before the sequence is completed.

The references to the different instruction types are distributed as follows:

Three different versions of the simulations were run using the above configurations. The first version fetched instructions byte-at-a-time from the memory hierarchy (cache or instruction store). The second version performed implicit prefetching by fetching an instruction word into the instruction buffer. The third version is similar to the'second but performs prefetch of the successive instruction word for every instruction fetch. This technique is known to increase the throughput 32. Each of the configurations was modeled in a multitasking facility for the C programming language 4. The primitives of the facility provide for the creation and destruction of tasks according to the F O R K - J O I N facility of UNIX. It caters to task synchronization and communication by providing locks, semaphores, queues, and the ability to suspend a task for variable length intervals. The instruction fetch unit, the memories (instruction cache and store), and the execution unit are each modeled as separate tasks communicating with each other through semaphores and queues. An animation facility was developed to show interaction between various tasks. This proved to be very useful for debugging and validation of our simulation models. 4.2.

Performance

Instruction Type Distribution short 40% single operand 10~ short literal 10~ long literal 10°/o unconditional j u m p t 10% conditional j u m p t 10% m-way case-switchJ" 10% The instruction reference pattern is intended to give simulation a reasonable approximation of a typical LIVIL program execution environment and provide a basis for comparing the relative merits of performance for t h e different hardware configurations. This distribution is based on the observed distribution of instructions t~pes in an earlier functional simulation of the G-machine ~ . The figure for the case-switch instructions is based on a static count since the compiler for LML, which includes the case-switch statements, was not completed at the time of this study. Other variations of instruction workloads which contained a higher percentage of short instructions and case-switch instructions were also used. The results for these workloads yielded similar relative performances to the instruction type distribution given above. The CISC to RISC instruction translation is based on a uniform probability distribution. A maximum of 25 instructions could be placed on the RISC queue per CISC instruction which requires translation. This approximation was arrived at by a static count of the RISC instructions generated by a CISC instruction type as referenced in The Gmachine Programmers Guide 16. the

Measures

The main performance parameters 21 of a pipeline are: throughput, latency and elasticity. Throughput is the rate at which items are processed to completion. In these simulations the throughput was measured as: Number o/ G-code instructions executed simulated time Latency is the time for one item to traverse the entire pipeline when it is otherwise empty. During the simulations the average latency of the pipeline was measured as:

4.4. R e s u l t s a n d A n a l y s i s The purpose of the simulations was to gain an understanding of the relative performances of the different configurations described in section 4.1~ Several factors influence the performance of the different configurations. The following discussion examines these relationships. The different configurations of the instruction pipeline were run for one million cycles. This allowed the simulation to reach a steady state behavior and provided enough information for comparing the relative merits of the various configurations. To understand the elasticity across the different stages of the instruction pipeline, different size interstage b u f f e r s ~instruction cache and instruction queue) were modeled.

Total time instruction pipeline is idle due branch Total number of branches taken

T o t a l time instruction pipeline is idle is used in the average latency measure instead of total time execution unit is idle. The reason is that instructions with literals may be interpreted following a control transfer and these are never seen by the execution unit. Fixed amount of time is taken through the translation stage and the instruction queue thus our measure is relatively accurate. Elasticity was measured in terms of the cache size, queue size and the absence or presence of an instruction buffer. The performance of the instruction fetch unit, as meas-

Katevenis reports 3 0 ~ of all instructions in a program produced from an imperative language are branch instructions and more than 80% of them are unconditional branch instructions Is.

88

made possible by the instruction translation scheme. The results clearly show that latency has very little effect on the throughput of the proposed pipelined architecture. In our design the choice comes to either using sequential prefetching with a small cache or using a large cache without any explicit or sequential prefetching. A decision to use either configuration lies between the cost and complexity of the prefetching mechanism and a large instruction cache. Advantage has been taken of both the RISC and CISC philosophies in this instruction fetch unit to support a functional programming language. New conventional architectures have also adopted similar techniques to support highlevel instructions necessary to support abstractions of the architecture. High performance is now possible along with the ability to support higher level abstractions. The scheme

The instruction pipeline time for a simple instruction is six cycles when it is fetched from the instruction cache and ten cycles when fetched from the instruction store. Obviously, when the pipeline is full instructions are executed in one or two cycles. If every instruction is executed in one cycle by the execution unit, then the instruction queue only needs to be of length one, since one instruction is produced and consumed per cycle. In our simulation, we have assigned a two cycle execution time for L O A D / S T O R E instructions. The ratio of LOAD/STORE instructions to single execution cycle instructions is referred to as the execution mix. For the simulations, this mix was varied from 10%:90°/o (two cycle:one cycle instructions) to 20%:80%t. Results only from the latter mix are shown. The length of the instruction queue was varied in different configurations. A length queue of four to eight increased the throughput for both execution mixes. Thus the execution mix has an effect on the queue length. For the rest of our analysis an instruction queue length of eight is

used here allows for special instructions to be coded to support a particular abstraction. This makes it possible to create a retargetable RISC processor.

assumed. The cache size had a significant effect on both the throughput and latency. For example, an inclusion of a 1024 line cache yielded 24% increase in throughput in the configurations without any form of prefetching (figure '4). Similarly there was large decrease in latency (32~) when using a 1024 line cache. Explicit prefetching of alternate instruction streams when encountering a conditional control transfer instruction has significant effect on latency but not on the throughput. For example, in the configurations with a 128 line cache (figures 4, 5) with prefetching~ latency decreases by 45°z~ 60~o and throughput increases by only 1°~. If the execution mix and expansion ratio are varied, then further increase in throughput may be possible.

Acknowledgements The authors would like to thank Professor Kieburtz for many discussions and comments on this work, Barton Schaefer, Mark Foster for their comments and Dr. Bain for making available his multi-tasking environment.

References

[1]

[2]

A significant decrease in latency was observed in a configuration with a very small cache (16 lines) and with explicit prefetching. The throughput remained unchanged from the configuration without a cache. A significant increase in throughput (34%) is achieved through implicit prefetching by an instruction word into an instruction buffer. An instruction from the buffer can be assembled in one machine cycle. Hence, for a consecutive sequence of instructions, it is likely that the next instruction will be found in the instruction buffer, As expected sequential prefetching (prefetch always) showed an increase in the throughput. This increase was most significant with smaller caches rather than with a large cache (figures 5, 6). The decrease in latency was not as significant as in the configurations using explicit prefetching.

[3]

[4]

Is}

[8]

Anderson, D. W., Sparacio, F. J. and Tomasulo, R. M., "The Model 91: Machine Philosophy and

Instruction Handling," IBM Journal of Research and Development, vol. 11(January 1967), . Augustsson, L., "A Compiler for Lazy ML," Prec. A CM Syrup. on LISP and Functional Programming, Austin, Texas, August 1984, pp. 218-227. Backus, J., "Can programming be liberated from yon Neumann Style? A functional style and its algebra of programs," Communications of the A CM, vol. 21, 8 (August 1978), pp. 613-641. Bain, W. L., A Multi-tasking facility for C., April 1985. Colwell, R. P., III, C. Y. H., Jensen, E. D., Sprunt, H. M. B. and Kollar, C. P., "Instruction Sets and Beyond: Computers, Complexity and Controversy," Computer, September 1985. Edwards, D. B. G., Knowles, A. E. and Woods, J. V., 'TIU6-G: A new design to achieve mainframe performance

from

a

mini-sized computer,"

SIGARCH Newsletter (ACM), vol. 8, 6 (June 1980), Prec. ACM SIGARCH Symp. on Computer

5. Conclusion High throughput and low latency have been achieved in the instruction pipeline by using a correct mix of different strategies. Performance increase in throughput has been clearly identified with the cache, and with implicit and sequential prefetching techniques. Decrease in latency has been achieved using explicit prefetching which has been

[7]

Is] t For compiled imperative language programs, up to 20-30~ of instructions have been obeerved to be LOAD/STORE typeU,~

89

Architecture. Farahmand-nia,

S., Kuo, S. L. and Rankin, L.,

Simulation of the G.machine Instruction Fetch and Translation Unit, Dept. of Computer Sc. & Eng., Oregon Graduate Center, May 1985. Flynn, M. J., 'Tery High-Speed Computing Systems," Proceedings of the IEEE, vol. 54, 12 (December 1966), pp. 1901-1909.

[gl [10]

[11]

[12] Ix3]

[14] [lS]

[181

[17] [18] [19]

[20] [21]

[22] [23] [24] [251 [28]

[27]

Hailpern, B. T. and Hitson, B. L., "S-1 Architecture Manual," Teeh. report STAN-CS-79-715, Computer Systems Lab., Stanford University, 1979. Hennessy, J. L., Jouppi, N., Baskett, F. and Gill, J., "MIPS: A VLSI Processor Architecture," Proc. CMU Conference on VLSI Systems and Computations, October 1981, pp. Computer Science Press.

[28]

[29]

Hsieh, J. T., Pleszkun, A. R. and Vernon, M. K., "Performance Evaluation of A Pipeline VLSI Architecture Using the Graph Model Behaviour (GMB)," Proceedings CHDL 85, August 1985. Ibbett, R. N., The Architecture of High Performance Computers, (Macmillan Press 1982), 1982, pp. Springer Verlag. Johnsson, T., The G-machine -- an abstract architecture for graph-reduction, Dept. of Computer Sciences, Chalmers Univ. of Technology, Gothenburg, 1983. Johnsson, T., "Efficient compilation of lazy evaluation," Proc. of 1984 ACM SIGPLAN Notices Conf. on Compiler Constr, June, 1984. Katevenis, M. G. H., "Reduced Instruction Set Computer Architecture for VLSI," Ph.D Thesis, Report No. UCB/Computer Science Department 83/141, EECS, University of California, Berkeley, October 1983. Kieburtz, R. B., "G-machine Programmers Guide," Internal Document, Dept. of Computer Sc. & Eng., Oregon Graduate Center, August 1984. Kieburtz, R. B., Private discussion., 1984. Kieburtz, R. B., "Control of the Instruction Fetch Unit," Internal Document, Dept. of Computer Sc. & Eng., Oregon Graduate Center, April 1985. Kieburtz, R. B., "The G-machine: a fast, graphreduction evaluator," Proe. of IFIP Conf. on Functional Prog. Lang. and Computer Arch., Nancy, 1985. Kuo, S. L. and Sutton, R., "," Internal Report, December 1985. Lampson, B. W., McDaniel, G. and Ornstein, S. M., "An Instruction Fetch Unit for a High-performance Personal Computer," IEEE Transactions on Computers, vol. C-33, 8 (August 1984), . Lavington, S. H., "The Manchester Mark 1 and Atlas: A Historical Perspective," Communications of the A C M (ACM), vol. 21, 1 (January 1978), . Lee, J. K. F. and Smith, A. J., "Branch Prediction Strategies and Branch Target Buffer Design," IEEE Computer Magazine, vol. 17, 1 (1984), . Morris, D. and Ibbett, R. N., The MU5 Computer System, 1979, pp. Springer Vcrlag. Patterson, D. A. and Sequin, C., "A VLSI RISC," Computer, vol. 15, 9 (September 1982), pp. 8-21. Patterson, D. A., Garrison, P., Hill, M., Lioupis, D., Nyberg, C., Sippel, T. and Dyke, K. V., "Architecture of a VLSI Instruction Cache for a RISC," Proc. A C M SIGARCH Syrup. on Computer Architecture, June 1983.

[30]

[311 [32]

[331

[34] [35] [381 [37]

90

Patterson, D., "Reduced Instruction Set Computers," Communications of the ACM, vol. 28, 1 (January 1985),. Radin, G., "The 801 Minicomputer," Proc. SIGARCH/SIGPLAN Notices Symp. on Arch. Support for Prog. Languages and Operating Systems, 1982. Sachs, H. and Hollingsworth, W., "A High Performance 846,000 Transistor UNIX Engine The Fairchild Clipper," ICCD 85, October 1985, pp. 342-346. Sarangi, A., "Simulation and Performance Evaluation of a Graph Reduction Machine Architecture," Masters Thesis, Oregon Graduate Center, 1984. Smith, J. E., "A Study of Branch Prediction Strategies," Proc. A C M SIGARCH Syrup. on Computer Architecture, May 1981. Smith, A. J., "Cache Evaluation and the Impact of Workload Choice," SIGARCIt Newsletter - l£th Annual Int. Symposium on Computer Architecture, vol. 13, 3 (June 1985), pp. 64-73. Smith, J. E. and Goodman, J. R., "Instruction Cache Replacement Policies and Organisations," IEEE Transactions On Computers, vol. C-34, 3 (March 1985), . Turner, D. A., "New implementation techniques for applicative languages," Software, Practice g~ Experience, vol. 9, 1 (1979), . Wilkes, M. V., "Keeping Jump Instructions out of the Pipeline of a RISC like Computer," A C M SIGARCHNewsletter, vol. 11, 5 (1983), . Woods, J. V., Knowles, A. E., Lomas, P. B. and Edwards, D. B. G., "The MU6-G Computer System," To appear in IEE CDT, 1986. "A Simple Design May Pay Off Big For HewlettPackard," Electronics, March 3, 1986, pp. 39-44.

Conf iguration i Inetructi~ ~ffer NO ~ q u | n t i a l No TAR ~ e f e t c h TAR ~ e f e t c h [ l IXXX]

Conf igurotion i [nctruction ~ f f e r NO Seduanticl ~ e f a t c h

~e~etch ~tebla TAR Prefetch

NO ,T/

~'e fintell

TARix~x)l~efetch

~ortehlepra~,etc.TAR

i

iSO00Q

ill

ithaca lioooo ic~ooo

95OOO

3" 2. I'

nO cache

129

~56 Cache Size

5i2

C

tOS4

no cache

t2O

256 Cache Size

SiS

i024

Figure 4 Configuration 2

Configuration 2 Nlth Inatruction 8uffar No Sequ|htlal ~ e f e t c h Tkq Prefetch TAR Prefetch ~ o r t ~ l a TAR [ l l~xx] Prefotch

Nlth Inatructlon E~Jffer ~TAR ~ a f e t c h I )

No Caquentiel Prefwtch TAR ~ e f e t c h ~ o r t a b l e TAR I X XJ~I Prefetch

i

m

i40000~

8

7

i35000.

6

i~O000i 125000-

flO00O

no cCChl

S

111111' tL~9

~56 Cache Size

513

i'

3

i 0

1024

nO cache

128

~ Cache S l i t

512

t024

Figure 5 Conftguratton 5

Configuration 5 Mtth I n | t r u c t l o n ~ ' f f c r • HO TAR Prefltch | . ]

Nlth Instruction Ouffer TAR ~ e f e t c h l 1

Nlth ~ q u a n k l e l ~ a f c t c h TAR Prefetch Aborteble TAR I X X ](I Pre fetch

Nith Sequential ;~,efetcn TAR ~ e f e t c h ~ o r t c b l e TAR [X][KI P~efetch

mum

mm

IdO000,

7

i~ooo. 130003. i 1Z5OOO, 12~SO 1160~ it~OC

no ccche

IIEII 128

256 ClChe S i z l

S e 1

512

0

1024

Figure 6

9]

1111Ill 111

no cache

t29

256 Cache SIZe

512

1024

An Instruction Fetch Unit for A Graph Reduction Machine

An Instruction Fetch Unit for A Graph Reduction Machine

Suggest Documents

An Instruction Fetch Unit for A Graph Reduction Machine

Temporal Instruction Fetch Streaming - COMPAS Lab

lockup-free instruction fetch/prefetch cache organization

Exploiting Choice: Instruction Fetch and Issue on an Implementable ...

Fetch Gating Control through Speculative Instruction ... - CiteSeerX

Design of Instruction Fetch Unit and ALU for Pipelined RISC Processor

lockup-free instruction fetch/prefetch cache organization

Fetch Directed Instruction Prefetching - EECS @ Michigan - University

An algebraic theory of graph reduction - LaBRI

An Architecture for Combinator Graph Reduction Philip J ... - CMU ECE

An Architecture for Combinator Graph Reduction - CMU (ECE)

An Architecture for Combinator Graph Reduction - CMU (ECE)

An Efficient Graph Processing System on a Single Machine

Washing Machine Instruction Manual

A Novel Technique for Call Graph Reduction for ... - Semantic Scholar

Reducing Instruction Fetch Cost by Packing Instructions into Register ...

Reducing Instruction Fetch Cost by Packing Instructions into Register ...

The Effect of Instruction Fetch Bandwidth on Value ... - CiteSeerX

Increasing the Instruction Fetch Rate via Block-Structured ... - CiteSeerX

Applying Graph Reduction Techniques for Identifying ... - CiteSeerX

A Reduction-Graph Model of Ratio Decidendi

INSTRUCTION MANUAL FOR SEWING MACHINE

INSTRUCTION MANUAL FOR MIG WELDING MACHINE

INSTRUCTION MANUAL FOR WIRE WELDING MACHINE