Decoupled Vector Architectures: a rst look Roger Espasa Mateo Valero Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya. c/ Gran Capita, Modul D6, 08034 Barcelona, SPAIN e-mail:
[email protected] Abstract
The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. We will show how, even for an ideal memory system with no latency, decoupling provides a signi cant advantage over standard mode of operation. We will also present data showing that for more realistic latencies, decoupled vector architectures perform substantially better than non-decoupled vector architectures. We will also introduce a bypassing technique between the queues and show how it can reduce the total memory trac. A side eect of the decoupling technique presented is that it tolerates so well long memory latencies that could make feasible to use very slow DRAM parts in vector computers in order to reduce cost.
1 Introduction Recent years have witnessed an increasing gap between processor speed and memory speed, which is due to two main reasons. First, technological improvements in cpu speed have not been matched by similar improvements in memory chips. Second, the instruction level parallelism available in a processor has increased. Since several instructions are being issued at the same processor cycle, the total amount of data requested per cycle to the memory system is much higher. These two factors have led to a situation where memory chips are on the order of 10 to a 100 times slower than cpus and where the total execution time of a program can be greatly dominated by average memory access time. Current superscalar processors have been attacking the memory latency problem through basically three main types of techniques: caching, multithreading and decoupling (which, sometimes, may appear together). Cache-based superscalar processors reduce the average memory access time by placing the working set of a program in a faster level in the memory hierarchy. Software and hardware techniques such as [1] have been devised to prefetch data from low levels in the memory hierarchy to higher levels (closer to the cpu) so that data arrives at the higher level before it is actually needed, thus hiding the memory latency of the lower levels. On top of that, program transformations such as loop blocking [2] have proven very useful to t the working set of a program into the cache, so that the probability that the data needed by the processor is in the cache is very high. Multithreaded processors [3] attack the memory latency problem by switching between threads of computations every time a long-latency operation (such as a cache miss) threatens to halt the processor. Since the cpu switches from one thread to another, the amount of parallelism exploitable augments, the probability of halting the cpu due to a hazard decreases, the occupation of the functional units increases and the total throughput of the system is improved. While each single thread still pays latency delays, the cpu is (presumably) never idle thanks to this mixing of dierent threads of computation. Decoupled processors [4, 5, 6, 7] have focused on numerical computation and attack the memory latency problem by making the observation that the execution of a program can be split into two This work was supported by the Ministry of Education of Spain under contracts TIC 880/92 and 0429/95, by ESPRIT 6634 Basic Research Action (APPARC) and by the CEPBA (European Center for Parallelism of Barcelona).
1
dierent tasks. The rst task consists in moving data from memory to the processor and sending back to memory the results generated by the arithmetic units. The second task consists in executing all arithmetic instructions needed to carry out the program's computations. A decoupled processor typically has two independent processors (the address processor and the computation processor) that perform these two tasks asynchronously and that communicate through architectural queues. Latency is hidden by the fact that usually the address processor is able to slip ahead of the computation processor and start loading data that will be needed soon by the computation processor. This excess data produced by the address processor is stored in the queues, and stays there until it is retrieved by the computation processor. Vector machines have traditionally tackled the latency problem by the use of long vectors. Once a (memory) vector operation is started, it pays for some initial (potentially long) latency, but then it works on a long stream of elements and eectively amortizes this latency across all the elements. For example, a vector load that accesses 128 elements and has to pay for a 12 cycle memory latency will be completed in 140 cycles (assuming no con icts in the memory system), yielding an average of 1.09 cycles per element. Although vector machines have been very successful for certain types of numerical calculations, there is still much room for improvement. Several studies in recent years [8, 9] show how the performance achieved by vector architectures on real programs is far from the theoretical peak performance of the machine. Moreover, vector architectures have not completely managed to overcome the memory latency problem. For example, vector multiprocessors have con icts in the interconnection network and the memory modules that make the average memory access time higher than desirable. Moreover, the hardware and software techniques that are used by scalar processors to reduce the impact of increased memory latency are dicult or expensive to use in vector architectures. Caches have traditionally not been used in vector architectures because working sets of vector programs are usually very large, and vector loads tend to produce \sweeps" over the cache that make all its data useless, induce a lot of cache pollution and cache line re lls waste a lot of memory bandwidth [10]. Software pipelining is able to hide latency by overlapping work from several dierent iterations of a program loop [11]. This has the eect of enlarging register lifetimes, which translates in a higher number of registers needed to hold live variables and thus produces a high pressure on the register allocator. Thus, software pipelining is also dicult to use in vector processors because of the limited number of dierent registers in the vector register le. Data prefetching has also not been used in vector processors due to the high cost of an incorrect prefetch. In scalar machines prefetches are usually done on a cache line basis, and a useless prefetch has a small penalty in terms of bus occupation and can be done while no other activity is going on in the system bus. For a vector processor, a useless prefetch has a terrible cost both in terms of bus cycles and in terms of wasted buer space. Thus, vector processors have mostly relied on the vector principle of amortizing the latency over a large set of elements. Nevertheless, [12, 9] point out the fact that even working on large vectors, there are a large number of lost cycles due to latency and resource con icts. Functional unit hazards and con icts in the vector register le ports can make vector processors stall for long periods of time and suer from the same latency problems as scalar processors. For example, [12] shows how the memory port of a single-port vector computer was heavily underutilized even for programs that were memory bound. This was mostly due to the fact that con icts in the computation part of the processor (for example, two consecutive multiply instructions in a processor where there is only one multiplier) prevented the program to proceed and issue critical memory instructions that could have used the port to memory while the multiplier con ict was being resolved. It also showed how a vector processor could spend up to 50% of all its execution cycles waiting for data to come from memory. The conclusion was that in order to obtain full performance of a vector processor, some additional mechanism had to be used to reduce the memory delays (coming from lack of bandwidth and long latencies) experienced by programs. Since caches and software pipelining where already discarded for the aforementioned reasons, we turned to the principle of decoupling in order to reduce the number of lost cycles due to memory problems. The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. We will show how, even for an ideal memory system with no latency, decoupling provides a signi cant advantage over standard mode of operation. We will also present data showing that for more realistic latencies, decoupled vector architectures per2
form substantially better than non-decoupled vector architectures. We will also introduce a bypassing technique between the queues and show how it can reduce the total memory trac. A side eect of the decoupling technique presented is that it tolerates so well long memory latencies that could make feasible to use very slow DRAM parts in vector computers in order to reduce cost. The rest of this paper is organized as follows: section 2 will present the reference architecture. Sections 3 will present our simulation methodology. Section 5 will describe the decoupled vector architecture and sections 6 and 2 will present the performance bene ts of decoupling. Finally, section 8 will present our conclusions and future work.
2 The Reference Architecture In order to compare the decoupled vector architecture to a standard, non decoupled vector architecture, we have taken as our reference model the Convex C34 series of processors [13]. We have designed a vector architecture, that we will refer to as the Reference Vector Architecture that is a close model of the C34 architecture, albeit the low level details of the particular implementation of the C34 have been overlooked. The C34 architecture was chosen mainly because we had a C3480 machine readily available and we had developed a tracing tool speci cally targeted to it [14]. The main implication of this election is that this study is restricted to the class of vector computers having one memory port and two functional units. It is also important to point out that we used the output of the Convex compilers to evaluate our decoupled architecture. Thus, the results presented will be a lower bound of the speedup achievable through decoupling, since the compiler was not aware of this decoupling. We believe that a compiler able to decouple the code could obtain a better performance from the proposed decoupled vector architecture. The reference architecture consists of a scalar part and an independent vector part. The scalar portion executes all instructions that involve scalar registers (A and S registers), and issues a maximum of one instruction per cycle. The vector part consists of two computation units (FU1 and FU2) and one memory accessing unit (the LD unit). The FU2 unit is a general purpose arithmetic unit capable of executing all vector instructions. The FU1 unit is a restricted functional unit that executes all vector instructions except multiplication, division and square root. Both functional units are fully pipelined. The vector unit has 8 vector registers which hold up to 128 elements of 64 bits each one. This eight vector registers are connected to the functional units through a restricted crossbar. Every two vector registers are grouped in a register bank and share two read ports and one write port that links them to the functional units. The compiler is responsible to schedule the vector instructions and allocate the vector registers so that no port con icts arise. The machine modeled implements fully exible chaining, through the use of two read and one write pointer. The LD unit can only service one request to/from memory at time. The real C34 architecture does not allow to chain the result of a vector load instruction with a vector computation instruction. We have included this restriction in our model since the compiler has scheduled vector instructions taking it into account.
3 Experimental Framework In order to perform a detailed performance analysis of the execution of the Perfect Club programs on the reference vector architecture and the proposed decoupled architecture, we have taken a trace driven approach. The tracing procedure is as follows: the Perfect Club programs are compiled on a Convex C34 machine using the Fortran compiler (version 8.0) at optimization level -O2 (which implies scalar optimizations plus vectorization). Then the executables are processed using Dixie [14], a tool that decomposes executables into basic blocks and then instruments the basic blocks to produce four types of traces: a basic block trace, a trace of all values set into the vector length register, a trace of all values set into the vector stride register and a trace of all memory references (actually, a trace of the base address of all memory references). Dixie instruments all basic blocks in the program, including all library code. This is especially important since a number of fortran intrinsic routines (SIN, COS, EXP, etc.) are translated by the compiler into library calls. This library routines are highly vectorized 3
generating an instrumented versi Figure 1: The instrumentation pr
trace data a.out.dixie a.out
split-trace
DIXIE a.out.bb
(a)
BB trace
VL trace
VS trace
(b)
(c) SIMULATOR
PERFORMANCE RESULTS
Mem trace
#insns #ops % avg. Program #bbs S V V Vect VL ARC2D 5.2 63.3 42.9 4086.5 98.5 95 FLO52 5.7 37.7 22.8 1242.0 97.1 54 BDNA 47.0 23.9 19.6 1589.9 86.9 81 TRFD 44.8 352.2 49.5 1095.3 75.7 22 DYFESM 34.5 236.1 33.0 696.2 74.7 21 SPEC77 166.2 1147.8 262.6 3337.8 74.4 12 MG3D 452.14 11066.75 663.79 17995.49 61.9 27 MDG 185.90 4446.64 746.97 5611.66 55.8 7 ADM 42.4 709.0 104.3 694.8 49.5 6 OCEAN 165.64 4414.30 16.17 1444.92 24.7 89 QCD 80.05 1079.77 4.10 91.84 7.8 22 TRACK 50.67 505.96 4.47 26.96 5.1 6 SPICE 31.12 279.06 3.37 7.80 2.7 2 Table 1: Basic operation counts for the Perfect Club programs. to memory the results generated by the arithmetic units. The second task consists in executing all arithmetic instructions needed to carry out the program's computations. In a typical vector processor, the compiler schedules both types of operations in a single instruction stream, attempting to minimize the possible hazards between them. On the other hand, the decoupled vector architecture splits the instruction stream into three dierent streams. One has the vector computation instructions only, and is executed by the vector processor (VP). The other contains all the memory accessing instructions (both vector and scalar) and goes to the address processor (AP). The third one are the computation instructions executed in scalar mode and goes to the scalar processor (SP). The three processors are connected through a set of architectural queues and proceed independently. Thus, the SP and VP perform the computation part of a program and the AP performs all memory accessing functions (see gure 2). In order to control these three processors, the decoupled vector architecture has a fourth processor responsible of fetching all instructions and distributing them among the AP, SP and VP. This processor is the fetch processor (FP). Note that in this paper we focus on the evaluation of a decoupled vector architecture able to execute normal non-decoupled vector programs. If we had a decoupling vectorizing compiler, there would be no need of a fetch processor, since each one of the AP, SP and VP would have its own program counter and would fetch its own instructions. The following sections will describe in more detail each one of the processors present in our architecture.
The Fetch Processor
The fetch processor is a very simpli ed version of the control unit of the reference vector architecture. Its purpose is to fetch instructions from a sequential, non-decoupled instruction stream and distribute these instructions to the processor responsible of executing them. The FP is responsible for \translating" the incoming non-decoupled instruction stream into its decoupled counterpart. This translation task is accomplished through a set of very simple rules: Computation instructions are sent to their corresponding unit in a straightforward correspondence. Memory accessing instructions are sent to the AP and a modi ed version of the instruction is sent to the processor that expects to receive the data. This modi ed instruction (a queue mov, or QMOV for short) instructs the computation processor (either VP or SP) to move data from its input queue into a destination register. The FP also takes care of generating all the necessary QMOV instructions when a certain instruction requires data from another processor (typically, a vector register being operated with a scalar register).
5
FP AFBQ SPIQ SFBQ
APIQ
ASDQ
SP
SADQ
VPIQ VADQ
AP
AVDQ VSDQ
SVDQ
SCALAR CACHE
MEMORY
VP
6 Performance of the Decoupled Vector Architecture In this section we present the performance of the decoupled vector architecture versus the reference architecture. In order to compare the eectiveness of both architectures in executing vector programs, we have run simulations for each of the benchmark programs both on the reference architecture simulator and the decoupled vector architecture simulator. At each run, the only changing parameter was the memory latency. For each program we conducted experiments with latencies ranging from 1 cycle to 100 cycles in steps of 10 cycles. We have also included in the results lower bounds for the execution time of each program. To compute the lower bound for one of the programs we consider what would be the execution time if there were no dependencies at all. Thus, we consider only resource constraints in order to determine the minimum possible execution time. Given that our two architectures both have essentially ve resources (unit FU1, unit FU2, the memory port, the scalar processor and the scalar cache), we partition all operations executed by a program into these ve categories. Then, the category that has the maximum number of operation determines the minimum theoretical execution time for the program. For the decoupled simulations, we have used the following parameters: instruction queues were all 16 instructions long. All scalar queues were of length 256. The vector load queue (AVDQ) also had 256 slots, where each slot is a vector register. The vector store queue (VADQ) was set to 16 slots. The rationale for these values is as follows: all queues were rst set at \in nite values", that is 512 slots for the instruction queues and 256 slots for all other queues, to have an optimistic value of the speedup achievable by decoupling. Then, some simulations were conducted to evaluate the eect of reducing these queues. For the instruction queues, simulations showed that reducing their length to 16 slots did not aect noticeably nal performance (less than 2% dierence). For the vector store queue, we had to x a small queue length due to the policy followed by the AP when dispatching stores. Since the VADQ is usually full (until a ush occurs) it is not reasonable to have 256 slots in the store queue, hence the nal value chosen (16 slots). For the vector load queue we will present results using a queue length of 256 and next section will present data on the actual usage of the queue. Figure 3 presents the execution time for the selected programs for the two architectures under study and also presents its corresponding ideal lower bound. For each program we plot its execution time for the dierent latency values studied. The overall results suggest two important points. First, the DVA architecture shows a clear speedup over the REF architecture even when memory latency is just 1 cycle. A latency of 1 cycle represents an idealized memory system, since no currently available memory system in a vector processor can deliver such good performance. This speedup is due to the fact that the AP slips ahead of the VP and loads data in advance, so that when the VP needs its input operand they are (almost) always ready in the queues. Even if there is no latency in the memory system, this \slipping" produces a similar eect as a prefetching technique, with the advantage that the AP always knows exactly which data has to be loaded (no incorrect prefetches). Thus, the partitioning of the program into separate tasks helps in exploiting more parallelism between the AP and VP and translates into an increase in performance, even in the absence of memory latency. The second important point is that the slopes of the execution time curves for the reference and the decoupled architectures are substantially dierent. See how the slope of the DVA curves is much smaller than the slope of the corresponding REF curves. This implies that decoupling tolerates long memory delay much better than current vector architectures. To summarize the speedups obtained, gure 4 presents the speedup of the DVA over the REF architecture for each particular value of memory latency. Speedups (at latency 100) range from a 1.35 for ARC2D to a 2.05 for SPEC77 An important fact is that memory latency will sharply increase in the near future, and latencies like 100 cycles will be commonplace, making decoupling a very attractive technique to hide this long memory latencies for numerical applications. The only program that does not show signi cant speedups, DYFESM, is not aected negatively by the decoupling principle. We have investigated DYFESM and we found out that its three most important loops do not bene t from decoupling. The rst loop, which is responsible for 68% of all vector operations, can not be executed in less than 3 chimes, and both the reference and the decoupled architectures achieve this minimum. Thus, it does not show any speedup. The next two loops, each one responsible for 7.1% of all vector operations, have a reduction vector operation that 7
12
Cycles ( x 10e8 )
Cycles ( x 10e8 )
14 12 IDEAL REF DVA
10 8 6
10 IDEAL REF DVA
8 6 4
1 10 20 30 40 50 60 70 80 90100
1 10 20 30 40 50 60 70 80 90100
Memory latency in cycles
Memory latency in cycles
Execution times for BDNA
Execution times for DYFESM
30
25
Cycles ( x 10e8 )
Cycles ( x 10e8 )
12
IDEAL REF DVA
20
10 8 6
1 10 20 30 40 50 60 70 80 90100
01 10 20 30 40 50 60 70 80 90100 100
Memory latency in cycles
Memory latency in cycles
Execution times for ARC2D
Execution times for FLO52
Cycles ( x 10e8 )
Cycles ( x 10e8 )
IDEAL REF DVA
15 IDEAL REF DVA
10
60 IDEAL REF DVA
40
20 5 01 10 20 30 40 50 60 70 80 90100 100
1 10 20 30 40 50 60 70 80 90100
Memory latency in cycles
Memory latency in cycles
Execution times for TRFD
Execution times for SPEC77
Figure 3: DVA versus Reference architecture for the benchmark programs 8
Speedup over REF
2.0 SPEC77 TRFD FLO52 BDNA ARC2D DYFESM
1.5
1.0 1 10 20 30 40 50 60 70 80 90100
Memory latency in cycles
Figure 4: Speedup of the DVA over the Reference architecture for the benchmark programs has a dependency with itself of distance 1. This dependency make the SP stall and prevents the AP (which is waiting for a register coming from SP) to get ahead of the VP. The three processors have to work in a lockstep fashion and can not improve upon the reference architecture.
7 Queue Length For the results presented in the previous section, we have been using an \in nite" length vector load queue. The purpose of this section is to study the actual usage of this queue to help in choosing an appropriate value. The other vector queue in the DVA architecture (the vector store queue {VADQ) did not show any signi cant eect on nal performance when increased or decreased. We did simulations with the VADQ having 8 and 4 slots and the performance impact was minimal. Therefore, in this section we will only consider the load queue. Figure 5 presents the distribution of busy slots in the AVDQ for the benchmark programs. For each program we plot three distributions corresponding to three dierent memory latency values (latencies of 1, 30 and 100 cycles). Each bar in the graphs represents the total number of cycles that the AVDQ had a certain number of busy slots (We plot absolute number of cycles instead of percentages to be able to compare the three dierent latencies). For example, for BDNA, the AVDQ was completely empty (zero busy slots) for more than 400 millions of cycles. From gure 5 we can see that for the six benchmark programs it is uncommon to use more than four slots in the queue. At latency 1, most programs have typically 0 or 1 busy slots. When moving to latency 30, the graphs show a clear increase in the number of cycles with 2 busy slots. At latency 100, the highest percentages are with 2 busy slots and the number of cycles with 3, 4 or even 5 busy slots become important. As expected, the longer the memory latency, the higher the number of busy slots, since the memory system has more outstanding requests and, therefore, needs more slots in the queue. The sharp increase in the number of cycles having 2 busy slots can be explained by analyzing the characteristics of the programs. The six benchmarks are, as a whole, memory bound. Therefore, it is fair to assume that the majority of their loops will also be memory bound. When the DVA is executing a memory bound loop, what happens is that the VP executes faster than the AP. In the steady state, what will happen is that the VP will most of the time be waiting for data to arrive to its input queue (the AVDQ). As soon as one vector is ready in the AVDQ, the VP will execute a QMOV instruction and start moving it to a vector register. At the same time, and since we are assuming that it is a memory bound loop, the AP will most probably start another vector load. Thus, we will have two busy slots in the AVDQ. The rst slot is the one being consumed by the VP and the second one is the slot reserved by the vector load instruction. This situation will repeat itself because the VP will process its input data faster than the AP can feed it, and it explains why we have this \magic" value of two slots in the AVDQ most of the time. Another important point is that the queue length seems to be bounded by 9 slots, with none 9
5
Cycles (x 10e8)
Cycles (x 10e8)
4 3 L=1 L = 30 L = 100
2 1
4 L=1 L = 30 L = 100
3 2 1
0
0 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Busy slots in the AVDQ
Busy slots in the AVDQ
BDNA
DYFESM
Cycles (x 10e8)
Cycles (x 10e8)
10 8 L=1 L = 30 L = 100
6 4
3 L=1 L = 30 L = 100
2 1
2 0
0 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Busy slots in the AVDQ
Busy slots in the AVDQ
ARC2D
FLO52
10
Cycles (x 10e8)
Cycles (x 10e8)
6
4 L=1 L = 30 L = 100
2
8 L=1 L = 30 L = 100
6 4 2
0
0 0
1
2
3
4
0 1 2 3 4 5 6 7 8 9
Busy slots in the AVDQ
Busy slots in the AVDQ
TRFD
SPEC77
Figure 5: Busy slots in the AVDQ for the benchmark programs for three dierent memory latency values. 10
of the programs having at any point in time more than 8 full slots. This is a counterintuitive result, since one would expect that, for compute bound loops, the AVDQ would be completely lled. We have already mentioned that the six programs are memory bound, but that does not preclude that they have at least one loop that can be compute bound. Thus, if a program has one compute bound loop, at least for some cycles, there should be more than nine elements in the queue. What happens is that when a loop is compute bound, the resource that becomes the bottleneck is the instruction queue of the vector processor (VPIQ). Consider the following hypothetical compute bound loop: Ld/Ld/Mul/Add/Mul/Add/Mul/Add. This loop can be processed in three chimes. The AP only has to work during two chimes, and thus, will use the third chime to prefetch data. At each loop iteration, the AVDQ length increases by one slot and the queue should completely ll, provided that the loop performs enough iterations. Nevertheless, this will not happen because of the limited length of the VPIQ. Recall that each Ld instruction translates into a QMOV instruction in the VPIQ. This loop would use 8 instruction slots in the VPIQ. As soon as the FP would have dispatched two consecutive loop iterations to the VP, the VPIQ would be completely lled. By that time, the AP could have completed a total of two loop iterations, and could have lled a total of four slots in the AVDQ. But the AP will not be able to continue lling the queue because the FP is blocked on the VPIQ waiting for it to drain. Since the FP is blocked, it can not dispatch more instructions to the AP and the address processor will stall. From that point onward, the VP, FP and AP will work in lockstep. The AP will only be able to bring more data when the VP has executed enough instructions to release the FP, and the amount of data in the AVDQ will remain constant. Note that this behavior will not aect nal performance since, if the loop is compute bound, the maximum speed is determined by the VP and not by the AP. That is, the VP does already have enough work to do, so that the stalling of FP does not matter.
8 Conclusions and Future Work In this paper we have presented decoupled vector architectures. To the best of our knowledge, the idea of applying the decoupling principles studied for scalar architectures to vector architectures has not been pursued before. We have described a basic decoupled architecture that uses the principles of decoupling to hide most of the memory latency seen by vector processors. The architecture is similar to current vector architectures with the exception that has two main processors (the AP and the VP) that proceed independently one of another and that communicate through architectural queues. The motivation behind our proposal is that the AP will be able to skip ahead of the VP and load the data from memory before it is actually needed by the VP. Thus, when the VP asks for the already loaded data, it will nd it in a queue inside the processor, accessible with no latency, instead of having to wait for the main memory system to deliver the requested data. With its ability to advance all memory requests as much as possible, the decoupled vector architecture manages to make a very good use of the memory bus. This paper has shown two key points. First, the decoupled vector architecture is superior to a conventional architecture even for ideal memory models having only a one cycle latency. This is due to the fact that decoupling allows for parallelism between the computation and the memory unit. This implies that the AP can execute memory instructions independently of stalls or hazards occured in the computation processors. This situation leads to the fact that the AP skips ahead of the other two processors and \prefetches" data in advance. Second, as memory latency increases, decoupling helps in overlapping this latency with useful work and thus makes most of the programs studied insensible to the growing latency. Again, this ability to tolerate long memory delays comes from the fact that the AP starts the loading of data \as soon as possible" without waiting for prior computation instructions to complete. Moreover, we have seen that this speed improvements can be implemented with a reasonable cost/performance tradeo. We have shown how the length of the queues does not need to be very large to allow for the decoupling to take place. A vector load queue of four slots is enough to achieve a high fraction of the maximum performance obtainable by an in nite queue. On the other side, the vector store queue does not need to be very large. Our experiments varying the store queue length 11
indicate no performance dierences between having a 4, 8 or 16 slots store queue. Finally, decoupling shows how, at least for single memory ported vector architectures, the memory system can be slowed down without jeopardizing performance. This is very interesting since the cost of current vector supercomputers is mostly dominated by the memory system costs. Expensive 15ns SRAM parts are currently used in vector machines to achieve the desired bandwidth. This paper suggests that using decoupling techniques, the SRAM parts could be replaced by much cheaper DRAM parts and, although latency would increase sharply, decoupling would alleviate its eect and would permit achieving the same performance levels.
References [1] Tien-Fu Chen and Jean-Loup Baer. A performance study of software and hardware data prefetching strategies. In Proceedings of the 21st International Symposium on Computer Architecture, 1994. [2] Ken Kennedy and Kathryn S. McKinley. Optimizing for parallelism and data locality. In Proceedings of the International Conference on Supercomputing, pages 323{334, July 1992. [3] A. Agarwal. Performance tradeos in multithreaded processors. IEEE Transactions on Parallel and Distributed Systems, 2(4):398{412, October 1991. [4] Lizy Kurian, Paul T. Hulina, and Lee D. Coraor. Memory latency eects in decoupled architectures. IEEE Transactions on Computers, 43(10):1129{1139, October 1994. [5] James E. Smith. Decouled access/execute computer architectures. ACM Transactions on Computer Systems, 2:289{308, November 1984. [6] Mathew K. Farrens and Andrew R. Pleszkun. Implementation of the pipe processor. Computer, pages 65{70, January 1991. [7] James E. Smith, G.E. Dermer, B.D. Naderwarn, S.D. Klinger, C. M. Rozewski, D. L. Fowler, K. R. Scidmore, and J. P. Laudon. The ZS-1 central processor. In 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, pages 199{204. CS press, 1987. [8] Willi Schonauer and Hartmut Hafner. Supercomputers: Where are the lost cycles ? Supercomputing, 1991. [9] Willi Schonauer and Hartmut Hafner. Explaining the gap between theoretical peak performance and real performance for supercomputer architectures. Scienti c Programming, 3:157{168, 1994. [10] L.I. Kontothanassis, R. A. Sugumar, G. J. Faanes, J. E. Smith, and M. L. Scott. Cache performance in vector supercomputers. In Spercomputing, 1994. [11] M. S. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN '88 Conference on Programming Language Design and Implementation, volume 23, pages 318{328, Atlanta, GA, 1988. [12] Roger Espasa, Mateo Valero, David Padua, Marta Jimenez, and Eduard Ayguade. Quantitative analysis of vector code. In Euromicro Workshop on Parallel and Distributed Processing. IEEE Computer Society Press, January 1995. [13] Convex Press, Richardson, Texas, U.S.A. CONVEX Architecture Reference Manual (C Series), sixth edition, April 1992. [14] Roger Espasa and Xavier Martorell. Dixie: a trace generation system for the C3480. Technical Report CEPBA-RR-94-08, Universitat Politecnica de Catalunya, 1994. 12