PERFORMANCE ADVANTAGES OF MERGING INSTRUCTION- AND DATA-LEVEL PARALLELISM Francisca Quintana U. Las Palmas de Gran Canaria
[email protected]
Roger Espasa, Mateo Valero U. Politècnica Catalunya--Barcelona {roger,mateo}@ac.upc.es
Abstract This paper presents a new architecture based on addding a vector pipeline to a superscalar microprocessor. The goal of this paper is to show that instruction-level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute regular vectorizable code at a performance level that can not be achieved using only ILP techniques. We present an analysis of the two paradigms at the instruction set architecture (ISA) level that shows that the DLP model has several advantages: executes fewer instructions, fewer overall operations (by factors as large as 1.7), and generally executes fewer memory accesses. We then analyze the ILP model in terms of IPC. Our simulations show that a 4-way machine achieves IPCs in the range 1.03-1.52 and that by scaling to 16-way, only a 26% of the peak IPC is achieved. The combined ILP+DLP model, on the contrary, is shown to perform from 1.24 to 2.84 times better than the 4-way ILP machine. Moreover, when we scale up the ILP+DLP machine, the speedup over the 16-way ILP machine increases to as much as 3.45. All this extra performance is shown to be achieved with very modest control hardware, thus ensuring that clock cycle time is not jeopardized in our proposed architecture.
1. Introduction Over the last few years, microprocessor performance has been steadily increasing due to a rapid innovation in microarchitecture techniques aimed at exploiting instruction level parallelism (ILP). Current state-of-the-art microprocessors all include 4-wide fetch engines coupled with sophisticated branch predictors, large reorder buffers to dynamically schedule instructions based on dependencies and non-blocking caches to allow multiple outstanding misses. All these techniques focus on a single goal: executing several instructions that are known to be independent, in parallel. The larger the number of instructions that can be launched on each cycle, the better the performance achieved. However, recent research has pointed out that it will be very difficult to keep this pace of improvement. The first concern is that the efficiency of current superscalar processors is very low. That is, even issuing four instructions per cycle, the typical IPC's achieved are in the range 0.5-1.5 [1][2][3]. The second concern is that even if we increase issue width, not much performance will be gained, as several studies have shown [2]. A very wide superscalar machine requires large amounts of hardware to control the execution of the in-flight instructions. Unfortunately, these large amounts of hardware are not enough to achieve good levels of ILP because there is an inherent limitation to the amount of instruction level parallelism that can be extracted from a given program. Therefore, trying to scale current superscalar processor to issue widths of 16 or 32 seems not justified from a cost/performance point of view. Due to these limitations there is a renewed interest in alternatives to extract more parallelism from sequentially specified programs and also in techniques that use more efficiently the available hardware resources in superscalar engines. We propose to explore an alternative source of parallelism that has long been ignored by the microprocessor community: data level parallelism (DLP). The DLP paradigm uses vectorization techniques to discover data level parallelism in a sequentially specified program and expresses this parallelism using vector instructions. A single vector instruction specifies a series of operations to be performed on a stream of data. Each operation performed on each individual element is independent of all others and, therefore, a vector instruction is easily pipelineable and highly parallel.
1
Using vector instructions to express data-level parallelism has two very important advantages. First, the total number of operations that have to be executed to complete a program is reduced because each vector instruction has more semantic content that the corresponding scalar instructions. Second, the fact that the individual operations in a single vector instruction are independent allows a more efficient execution: once a vector instruction is issued to a functional unit, it will use it with useful work for many cycles. During those cycles, the processor can look for other vector instructions to be launched to the same or other functional units. It is very likely that, by the time a vector instruction completes all its work, there is already another vector instruction ready to occupy the functional unit. Meanwhile, in a scalar processor, when an instruction is launched to a functional unit, another instruction is required at the very next cycle to keep the functional unit busy. However, many hazards can get in the way of this requirement: true data dependencies, cache misses, branch misspeculation, etc. The combination of these two effects has many related advantages. First, the pressure on the fetch unit is greatly reduced. By specifying many operations with a single instruction, the total number of different instructions that have to be fetched is reduced. Many branches disappear embedded in the semantics of vector instructions. A second advantage is the simplicity of the control unit. With relatively few control lines, a vector architecture can control the execution of many different functional units, since most of them work in parallel in a fully synchronous way. A third advantage is related to the way the memory system is accessed: a single vector instruction can exactly specify a long sequence of memory addresses. Consequently, the hardware has considerable advance knowledge regarding memory references, can schedule these accesses in an efficient way [4], and needs to access no more data than is actually needed. In addition, a vector memory operation is able to amortize start-up latencies over a potentially long stream of vector elements. The goal of this paper is to show that, merging the ILP and DLP concepts, we can exploit the advantages of DLP in order to improve the performance of current ILP processors for numerical applications. We introduce a mixed model based on a typical 4-wide ILP core extended with a vector pipeline. Introducing a vector pipeline, our goal is to exploit an alternative source of parallelism rather than wasting large silicon resources trying to overexploit instruction level parallelism. While a certain degree of dynamic scheduling is always beneficial to help saturating processor resources and overcoming compiler limitations, the necessary resources that have to be invested in order to issue large numbers of instructions in parallel does not pay off in terms of performance [2]. This paper is organized as follows. We will start by making a comparison of ILP and DLP models at ISA level. Then, we evaluate the performance of typical ILP-based superscalar machines under a wide range of parameters to show that their efficiency is very low (even under ideal conditions). We will scale the issue width of these superscalar models to show that, even fetching and dispatching large numbers of instructions, the performance obtained falls short of the expectations, and that the cost/complexity of this scaled-up machine is not justified. In section 5 we proceed to do a full scale comparison of the ILP model and our mixed ILP+DLP model in terms of IPC. We conclude this work with an analytical model that explains the different factors involved in the speed-ups achieved by our ILP+DLP model.
2. Methodology and Benchmarks During this study we have used both trace-driven simulation and data gathered from hardware
2
counters during real executions. All the data that will be presented regarding the ILP paradigm have been gathered from the performance counters of an Mips R10000 processor [5], and also from execution-driven simulations using the SimpleScalar toolset [6]. On the other hand, the DLP architecture is analysed using simulation data from traces gathered on a Convex C3400 vector machine using the Dixie tool [7]. What programs can benefit from the mixed ILP+DLP model ? For many years, the majority of the scientific computing applications have fit very well in the DLP model.There is a large body of vectorizable code that was optimized for yesterday’s vector supercomputers which is being run on today’s superscalar microprocessors. Moreover, in recent years the fraction of applications that contain highly regular, DLP code has increased -- DSP and multimedia applications such as graphics, compression, encryption [8]. We believe that the fraction of regular vectorizable applications is important enough to deserve special attention in future processors. In this initial study we will focus on analysing the behaviour of the ILP+DLP model for highly vectorizable floating point codes. These are the types of engineering and scientific codes traditionally run on parallel vector supercomputers and these types of codes are the ones that will benefit most from merging instruction and data level parallelism. In future works, we will study medium and low vectorization floating point codes. We have chosen as our workload the Specfp92 benchmarks. The selection of the individual programs to be studied was based on their behaviour on the Convex machine.We compiled all of them on the Convex machine and we selected the 7 programs that achieved at least 70% vectorization. The selected programs were: swm256, hydro2d, su2cor, nasa7, tomcatv, wave5 and mdljdp2. Table 1 presents some measurements for the set of benchmarks collected using the performance counters of the Mips R10000 processor. Column two presents the total number of instructions executed (in millions), not including mis-speculated ones. Column three presents the number of computation instructions (integer and floating point instructions) in these benchmarks. Column four shows the number of memory instructions executed.
Total Instruct
Compute Instruct.
Memory Instruct.
Swm256
11466
4087
3290
Hydro2d
5759
2142
1473
Nasa7
6897
2251
2392
Su2cor
5135
1618
1896
Tomcatv
1453
545
360
Wave5
2737
862
1014
Mdljdp2
3210
1235
740
Table 1: Benchmark Characteristics on the R10000 machine (in millions)
3
# Instructions Scalar
# Ops Vector
Vector
% Vect
avg. VL
swm256
6.20
74.51
9534.34
99.9
127
Hydro2d
41.50
39.15
3973.78
99.0
101
Nasa7
152.42
67.26
3911.86
96.2
58
Su2cor
152.55
26.78
3356.83
95.7
125
Tomcatv
125.75
7.19
916.80
87.9
127
Wave5
676.52
41.27
1807.23
72.8
43
1495.87
89.05
3731.33
71.4
41
Mdljdp2
Table 2: Benchmark Characteristics on the Convex C34 machine (in millions) Table 2 presents some statistics for the selected programs when run on the Convex machine. Columns two and three present the total number of instructions issued by the decode unit, broken down into scalar and vector instructions. Column four presents the number of operations performed by vector instructions. Each vector instruction can perform many operations (up to 128), hence the distinction between vector instructions and vector operations. The fifth column is the percentage of vectorization of each program. We define the percentage of vectorization as the ratio between the number of vector operations and the total number of operations performed by the program (i.e., column four divided by the sum of columns two and four). Finally column six presents the average vector length used by vector instructions, and is the ratio of vector operations and vector instructions (columns four and three, respectively)A comparison at the ISA level between ILP and DLP
3. A comparison at the ISA level betweeen ILP and DLP We start by comparing the ILP and DLP paradigms at the instruction set architecture level. Although the really interesting comparison is always in terms of performance, i.e. time to execute a given task, we will point out many of the advantages that the DLP model can potentially contribute to a mixed ILP+DLP architecture just by looking at the dynamic instruction mixes, without assuming any particular micro-architecture implementation. We will look at three different issues that are solely determined by the instruction set being used and by the compiler: number of instructions executed, number of operations executed and memory traffic generated. The distinction between instructions and operations is necessary because in the DLP model, a vector instruction executes several operations (between 1 and 128 in our case). 3.1. Scalar and Vector ISA's We use the MIPS IV instruction set architecture [9] as being representative of architectures
4
exploiting the ILP paradigm. The relevant aspects of this ISA from the point of view of our study are: separate integer and floating point register sets, each of which holds 32 64-bit registers, conditional move instructions, compare-and-branch instructions, fused multiply-add floating point instructions, and prefetch instructions. For the vector instruction set we have used the Convex C34 ISA. The relevant aspects of this instruction set are as follows: 8 32-bit integer registers, 8 64-bit floating point scalar registers and 8 vector registers each of which holds 128 words of 64 bits, and one mask register (for masked vector operations) holding 128 bits. The vector instructions include a full set of integer, logical (population count, trailing zero count, etc.) and floating point operations (including square root), as well as vector reductions that take a vector as input and generate a 64-bit scalar result (min, max, sum, prod, parity). The vector memory instructions include both load and stores as well as gather/scatter instructions for irregular accesses. In the scalar instruction set it is worth mentioning that the C34 includes some trascendental functions in hardware, such as sinus, cosinus and exponential, and some CISC-style memory accesses to the stack using push and pop opcodes. 3.2. Instructions executed As already mentioned, vector instructions contain a high semantic content in terms of operations specified. The result is that, to perform a given task, a vector program executes many fewer instructions than a scalar program.
# instructions (x 10^8)
100
80 DLP ILP
60
40
20
0
p2
jd
dl
M
tv
ca
e
av
W
m To
a7
6
2D
or 2c Su
as N
ro
yd H
25
m
Sw
Figure 1: ILP - DLP instructions Figure 1 presents a comparison of the total number of instructions (in millions) executed on the ILP machine (R10000) and a DLP machine (Convex C34) for each of our benchmark programs. In the R10000 case, we use the values of graduated instructions gathered using the hardware performance counters. In the C34 case, we use the traces provided by Dixie. As it can be seen, the differences are huge. Obviously, as vectorization degree decreases, this gap is diminished.
5
Although several compiler optimizations (loop unrolling, for example) can be used to lower the overhead of typical loop control instructions in the ILP code, DLP instructions are inherently more expressive. Having vector instructions allows a loop to do a task in less iterations. This implies less computations for address calculations and loop control, as well as less instructions dispatched to execute the loop body itself. As a direct consequence of executing less instructions, the instruction fetch bandwidth required, the pressure on the fetch engine and the negative impact of branches are all three reduced in comparison to an ILP processor. Also, a relatively simple control unit is enough to dispatch a large number of operations in a single go, whereas an ILP processor devotes an always increasing part of its area to manage out-of-order execution and multiple issue. This simple control, in turn, can potentially yield a faster clocking of the whole datapath. 3.3. Operations executed Although the comparison in terms of instructions is important from the point of view of the pressure on the fetch engine, a more accurate comparison between the ILP and DLP model comes from looking at the total number of operations performed. Figure 2 plots the total number of operations executed on each platform for each program. In the R10000 case, one instruction equals one operation. For the C34 case, the total number of operations is broken down in three parts: at the bottom, scalar operations (that is, operations performed in scalar mode using scalar instructions). The middle part of the bar (DLP_vops) are operations performed in vector mode. Finally, the top part of the bar (DLP_vspill) corresponds to the fraction of memory operations that are known to be spill code. We separate the latter from the bulk of vector operations to highlight the effect of the very small vector register file (small in terms of different logical vector registers, not in terms of size). As figure 2 shows, for all programs but one, the DLP version of the program executes much fewer operations than the ILP version. The ratio of ILP operations to DLP operations can be favourable to the DLP model by factors that go from 1.10 up to 1.70. Program mdljdp2 behaves very differently from the other programs analysed. First, the DLP version executes 62% more operations than the ILP version. Upon inspection of the most executed loops and basic blocks, we determined that most loops have many conditionals inside their body. These conditional statements turn into long sequences of mask creation instructions which are then vectorized. The fact that the conditionals are heavily nested, generates quite an explosion in the amount of work necessary to produce the final masks. This is the first factor that explains the large number of extra operations. The second factor is related to the speculation associated with execution under mask. If the mask is mostly filled with "true" values, then most of the work performed under mask is actually useful. But, if the mask is mostly filled with "false" values, then all operations performed and finally discarded are speculative overhead. Although we have been unable to properly quantify this effect, our measurements indicate that 65% of all vector operations are performed under mask. The third factor involves the large fraction of scalar code found in this program. This code is found in setup and shutdown basic blocks associated with the main loop we just described. The implication is that this scalar code will not disappear with better vectorization techniques; it is an intrinsic part of a vector loop.
6
# operations (x 10^8)
100
80 DLP_vspill (C34) DLP_vops (C34) DLP_sops (C34) ILP (R10000)
60
40
20
0
2
D
6
tv
p jd dl
M e
av
W
ca m
To
or
2c
a7
as
2 ro
yd
Su
N
H
25 m Sw
Figure 2: ILP - DLP operations 3.4. Memory Traffic in the ILP and DLP models In this section we analyze the memory traffic generated under each ISA. By memory traffic we understand the number of words transferred between the register files and the next level of the memory hierarchy using load and store instructions. That is, we do not look into the traffic generated in the superscalar cache memory hierarchy. We are only interested in the loads and stores as generated by the compiler/programmer. Intuitively, and if we ignore spill code problems, one would expect both paradigms to execute a similar number of memory transaction for a given program. After all, both programs have to move in and out the same pieces of data (again, ignoring spill code and prefetch questions). As we will see, this is not the case. Figure 3 compares the total memory traffic generated by the DLP and the ILP machines. For each program, the first bar plots the total DLP traffic showing in different colours the normal program traffic and the spill code traffic. The following bar plots the traffic from the register file of the ILP machine to the L1 cache. Three different behaviours can be seen in figure 3. In 2 programs, swm256 and hydro2d, the data movement in the DLP case is larger than the ILP case regardless of spill code. Two other programs, tomcatv and mdljdp2 have more traffic in the DLP model but mainly due to spill code (this is especially relevant in the mdljdp2 case, with 65% of all memory references being spill code). Finally, nasa7, su2cor and wave have significantly lower traffic in the DLP model, especially if we remove the spill code. We investigated both swm256 and hydro2d to understand why they have an intrinsically higher memory traffic. Looking at their most executed loops, we found that they show a pattern similar to what is shown in figure 4. All of them have array variables that are loaded in iteration "i" and can be reused in iteration "i+1" (as is the case of variable B in the example). When this situation arises, the ILP machine takes advantage of it by keeping the value of the variable inside a register from iteration to iteration. By contrast, the vectorizing compiler of the C34 machine is forced to perform redundant loads: in each iteration it must load 128 elements
7
Traffic in 64-bit words (x 10^8)
starting at position "i" of array B and also load 128 elements starting at position "i+1" of array B. Clearly, the two loads only differ in 2 elements, with the remaining 126 data elements being loaded twice. This explains the extra traffic.
40
30 DLP_spill (C34) DLP (C34) ILP (R10000)
20
10
0
tv
p2
jd
dl
M e
ca
m
av
W
To
2D
or
2c
Su
a7
as
N
6
25
ro
m
yd
H
Sw
Figure 3: ILP - DLP Memory Traffic In this section we have analysed the different behaviours of ILP and DLP ISA’s in intructions executed, operations executed and memory traffic. The DLP model executes less instructions and operations than the ILP model, which imposes lower pressure in the fetch and decode unit. This is a potential characteristic that can be exploited in a mixed ILP+DLP model to keep complexity low. Moreover, the ILP+DLP model can also take advantage of the lower memory traffic present in a DLP ISA.
DO J=1,N DO I=1,N A(..,I,..) = B(..,I+1,..) + B(..,I,..) ENDDO ENDDO
Figure 4: Typical loop in swm256 and hydro2d benchmark programs.
4. The ILP model. We now turn to the question of how does the ILP model behave in terms of performance under different assumptions. This section will take as a performance indicator the number of instructions executed per cycle (IPC) for each benchmark. We have taken an approximate model of an R10000 processor as an example of a machine that
8
exploits ILP (see figure 5). We have obtained the measurements for the ILP machine from execution-driven simulations using the SimpleScalar ToolSet. In particular, we have used the outof-order simulator which supports out-of-order issue and execution based on the Register Update Unit. This scheme uses a reorder buffer to automatically rename registers and hold the results of pending instructions. Each cycle the reorder buffer retires completed instructions in program order to the architected register file. We have set up the simulator to support the same functional units as the R10000 has (both in number and type), as well as their latencies (shown in table 3). The processor memory system consists of a load/store queue. Loads are dispatched to the memory system when the address of all previous stores are known. Loads may be satisfied by either the memory system or a previous store still in the queue if they have the same address. The memory model consist of an L1 data cache and an L1 instruction cache. Both of them are non-blocking and have been configured with the size and replacement policy of the R10000 L1 caches (32 Kb). The main memory latency has been set to 40 cycles. The simulator performs speculative execution. It supports dynamic branch prediction with a branch target buffer with 2-bit saturating counters. The branch misprediction penalty is three cycles.
Fetch
released regs
Decode&Rename
INT-regs
L/S unit
FP-regs
@ Reorder Buffer
L1 CACHE
Figure 5: The ILP model
Latency int add mul logic/shift div sqrt
1 5 1 34 34
fp 2 2 2 9 9
Table 3: Functional Units Latencies
9
4.1. ILP configurations Measurements of actual performance of applications running on machines exploiting the ILP paradigm show that the actual IPC achieved falls very short of the theoretical peak performance of the machine [1][2][3]. This lack of performance is due to many different effects that could be broadly classified in the following three categories: Control Limitations: In ILP processors, there is a restricted instruction window from which instructions are selected to be sent to the functional units for execution. Not always enough independent instruction will be found in it to be sent to the functional units. Whenever this situation arises, a performance loss appears. Another source of potential problems are branches. When the control unit mispredicts a branch outcome, all the resources allocated to mis-speculated instructions are wasted and can severely limit the obtained IPC. Memory Access Limitations: Due to the increasing gap between main memory and the cpu speed, current ILP processors need increasingly large caches to keep up performance. In spite of these caches, there is a loss of performance each time a miss occurs, because it is necessary to pay the latency of higher level caches or main memory to bring data from those devices. Resource Limitations: Typically the number and type of functional units in an ILP processor is chosen balancing the different instruction mixes found in programs. Therefore, when several instructions of the same type appear together in a burst, they might have to wait for an available functional unit of the right type in order to be executed. This extra time waiting for a hardware resource also incurs in a performance decrease. Our goal is to highlight the importance of these three different aspects in the performance of current and future ILP processors. In order to do so, we start by defining, as a base line, a processor configuration that closely matches that of an R10000 or ALPHA 21264. In order to approximate an upper bound for future ILP processors, we will study machines that do not have some or all of the three limitations present in our baseline model. We will define several configurations in which the limitations are gradually removed by assuming ideal/infinite hardware, until we reach an "Unbound" machine where performance is only limited by the intrinsic program parallelism and issue width. Table 4 presents the different configurations we will be studying during this paper. A configuration is represented by IX_YC_YM where I stands for ILP model, X indicates the maximum number of instructions that can be fetched and issued per cycle, and Y indicates the type of control and memory system. In our convention, RC stands for Real Control, PC for Perfect Control, RM for Real Memory and PM for Perfect Memory. The Unbound configuration eliminates all resource limitations that could appear in the previous configurations. It has as many general purpose functional units as instructions can be issued per cycle. Moreover, all the instruction repeat rates have been set to 1 cycle. It also has a perfect control unit (PC) and a perfect memory system (PM). In table 4, column 2 indicates both the fetch and issue width of each configuration. Next column briefly describes the control unit: its two main parameters are the branch prediction configuration (either a BTB having 512 or 2048 entries with two-bit saturating counters or an idealized perfect branch predictor). Columns four, five and six indicate the number of integer, floating point and memory units, respectively. Column seven presents the latency of a miss in the L1 cache (remember that we do not model an L2 cache and, hence, a miss in L1 goes directly into main
10
memory). Finally, columns eight and nine summarize the total computing power of each configuration: RPC stands for "results-per-cycle" and indicates the total number of functional units, whether integer or floating point, that can generate a result on a given cycle. In column nine, MPC stands for "memory-transations-per-cycle" and indicates the total number of memory operations, whether loads or stores, that can be started on a given cycle.
Functional Units Fetch/ Issue
Control Unit INT
FP
LD/ ST
L1-tomem Lat
RPC
MPC
I4_RC_RM
4
BTB: 512 RB: 32
2
2
1
40
4
1
I4_RC_PM
4
BTB 512 RB: 32
2
2
1
1
4
1
I4_PC_RM
4
Perfect Branch Pred. RB: 512
2
2
1
40
4
1
I4_PC_PM
4
Perfect Branch Pred. RB: 512
2
2
1
1
4
1
I4_Unbound
4
Perfect Branch Pred. RB: 512
4
4
4
1
8
4
Table 4: ILP Configurations 4.2. Performance analysis of the ILP model Using the SimpleScalar toolset [6] we simulated the execution of our seven benchmarks on each of the configurations defined in the previous section. Each simulation consisted in the execution of a billion instructions starting at the beginning of each program. The results, in terms of IPC, can be seen in figure 6. We show five bars for each program. The first bar shows the IPC for a machine that is equivalent to today's superscalar architectures (I4_RC_RM), and is in the range of 1.03--1.52. We note that, even in the best case, we hardly reach a 38% of the theoretical peak performance. The second bar (I4_RC_PM) shows the IPC when we use a perfect memory system where we assume that all data is in the L1 Cache (no misses). In this case, there is quite a good speed-up for most programs (by factors as good as 1.54) and now the IPC achieved lies in the 1.26--2.35 range. Despite the improvement, we are still only at a 59% utilization of the machine. The third bar, instead of improving the memory system, retains the real memory system used in the baseline but uses a perfect control unit (I4_PC_RM). Interestingly enough, the final performance associated to a perfect control unit is better than the performance associated to a perfect memory system. In this case, IPC reaches the 1.73--2.34 range, which represents a speed-up over the baseline typically around 1.65. This shows that the major bottleneck in superscalar architectures is in the control unit, and that, with enough instruction level parallelism the negative effects of cache misses can be successfully hidden. When we join the two ideal systems (fourth bar, I4_PC_PM) we can see that there is only a very minor improvement in some programs. This is rather surprising, and
11
reinforces the idea that the single most important thing in a superscalar processor is to properly feed the functional units by always having ready instructions in the reorder buffer. The only remaining limitation in the fourth bar is the number and type of functional units. In bar I4_Unbound this limitation is finally removed, and the performance achieved is only limited by the true program dependencies and by the issue width (4). We observe a large increase in performance and IPC reaches the range 2.64--3.75, which represents between a 66% and a 94% of the peak performance. Note the difference between actual and peak performance is dictated by the intrinsic characteristics of each program, not by our choice of parameters. 4
IPC
3 I4_RC_RM I4_RC_PM I4_PC_RM I4_PC_PM I4_Unbound
2
1
0
e
p2
jd
dl
M
tv
or
2D
ca
m
av
W
To
2c
Su
a7
as
N ro
6
25
m
yd
H
Sw
Figure 6: Performance Limitations in ILP The overall conclusions that can be extracted from this data are: First, control is the most single important hardware in a superscalar machine (more so than the memory system) [10]. Second, the performance of a real superscalar machine (equivalent to I4_RC_RM) is very low and hardly reaches 38% of the peak performance. Third, despite the fact that we have shown IPC’s close to 3.75, these have been attained with ideal (i.e. impossible) assumptions. Therefore, the performance expected from microarchitecture techniques will necessarily lie between the real bar (I4_RC_RM) and the unbounded bar (I4_Unbound). Fourth, even in the ideal conditions, the average performance is clearly not satisfactory. 4.3. Scaling up the ILP model The analysis presented in the previous section clearly shows that there are only two alternatives to improve the IPC rate for our set of benchmarks: either improve the code (statically or dynamically) so that it has fewer dependencies in general and, therefore, exhibits more instruction level parallelism or, alternatively, to raise the issue width above 4. We will focus on raising the issue width to show that it is not the path to follow. As we will see, although performance does improve with larger issue widths, the cost associated with this improvement is too large to be
12
justified. We start by defining five new configurations that mimic those presented in table 4 but with all the necessary hardware scalings to accommodate an issue width of 16 instructions per cycle. Table 5 presents the configurations, which use the same naming convention as above. Configuration I16_RC_RM corresponds to an hypothetical 16-wide superscalar that might be built in the near future. We note that most hardware resources have been incremented by a factor of 4. For the perfect control (PC) configurations the reorder buffer is 4 times larger than in the other configurations. Although it is not as good as a truly infinite reorder buffer we believe it closely behaves as if it was infinite. Note also that the memory parameters have not been changed, although one might expect that, by the time a 16-wide superscalar can be built, the distance in terms of latency between the L1 on-chip cache and the external memory would increase. Nevertheless, we maintain the same memory latency to help comparing the 4-wide and 16-wide machines.
Functional Units Fetch/ Issue
Control Unit INT
FP
LD/ ST
L1-tomem Lat
RPC
MPC
I16_RC_RM
16
BTB 2048 RB: 128
8
8
4
40
16
4
I16_RC_PM
16
BTB 2048 RB: 128
8
8
4
1
16
4
I16_PC_RM
16
Perfect Branch Pred. RB: 512
8
8
4
40
16
4
I16_PC_PM
16
Perfect Branch Pred. RB: 512
8
8
4
1
16
4
I16_Unbound
16
Perfect Branch Pred. RB: 512
16
16
16
1
32
16
Table 5: Scaled ILP Configurations We would like to point out the large hardware cost associated with the hardware structures required for the control unit of these 16-wide configurations. Olukoton et al. [2] have looked at the die area requirements of scaling an R10000 from 4-wide issue up to 6-wide issue. According to their calculations, the fetch unit, the decode unit, the instruction queues and the reorder buffer take a 16% of total die area in the 4-wide R10000. When scaling to 6-wide issue, these same structures grow by a factor of almost 3.5 and take a 35% of all die space. We note that our 16wide configurations are substantially larger than the 6-wide machine analysed in [2] so we expect the control portion of our machines to take between the 50% and 75% of all the die space. Figure 7 presents the results of simulating the configurations introduced in table 5 for each of our benchmarks (a billion instructions simulated per benchmark). The first bar shows the obtained IPC for a Real Control - Real Memory configuration. In comparison to the same bar from figure 6 we can notice that scaling up the model by four has increased the IPC approximately by the same factor. In general, this trend is the same for the other bars, where perfect memory, perfect control
13
or both are included in the architecture. However, the unbounded model reaches a much lower efficiency. For example, in the I16_RC_RM performance is around 16%--26% of the peak issue width, and when looking at the Unbound configuration, this percentage only improves to 25%-64%. The reason is that we have exploited all the available ILP in the program codes. Even though from a performance point of view, the increase in IPC correlates well to the growth factor of 4 of the hardware parameters, from a cost point of view the hardware required grows quadratically. That is, in order to sustain an average IPC around 1, a 4-wide machine is required, while in order to sustain an average IPC around 3-4 a 16-wide machine is needed. This general rule questions the increase of issue width as a long term solution for reaching the theoretical peak performance, taking into account that an increase in the issue width implies a quadratical increase in the control hardware which becomes the most area-consuming part of the processor. It´s necessary to use other techniques to increase IPC without increasing the cost or the complexity if possible.
10
IPC
8 I16_RC_RM I16_RC_PM I16_PC_RM I16_PC_PM I16_Unbound
6
4
2
0
e
p2
jd
dl
M
tv
or
2D
ca
m
av
W
To
2c
Su
a7
as
N ro
6
25
m
yd
H
Sw
Figure 7: Performance behaviour of a scaled up ILP model. This paper proposes exploiting data level parallelism, as well as instruction level parallelism, as a way to increase IPC with a very modest increase in the cost/complexity of the control unit. Our proposal adds a simple DLP unit to the 4-way ILP processor, mostly increasing its datapath with vector register and vector functional units, but with minimal impact to the reorder buffer or instruction queues.
5. The ILP+DLP proposal The basic idea of the ILP+DLP model is to add a vector pipeline with its corresponding registers and functional units to a current 4-way superscalar core. The goal of the vector pipeline is to execute the data-parallel portions of the code in a highly parallel way without increasing the amount of control hardware required to drive the vector units.As designer seek higher levels of performance, they will replicate the number of vector functional units without increasing control
14
complexity, so that the vector functional units can process several data items per cycle. The advantage of increasing hardware parallelism following this style is that clock cycle time is not jeopardized. As it is mentioned in [2][11][12], increasing the width of current superscalars will certainly improve the IPC rate, albeit not as much as it would be desirable (as we have seen in section 4.3). However, these researchers claim that the impact on cycle time of the longer wire delays required for wide-issue machines will offset the benefits of the increased IPC rate. By contrast, our proposed ILP+DLP model will have a relatively simple out-of-order engine (4-wide or, at most, 6-wide) plus a vector pipeline that will hardly affect the complexity of the instruction queue wakeup&select logic. What happens then to programs that do not have data-parallel sections ? Typically, programs that do not vectorize are programs that have pointer-chasing constructs or loops with dependencies/ recurrences. These constructs are also difficult to efficiently execute on an ILP machine, because they tend to show an inherently sequential behaviour. We think that these kinds of programs, that have a low intrinsic level of instruction parallelism can benefit more from faster clocks than from wide-issue machines. In this section we present a preliminary study of the advantages of the ILP+DLP model for highly vectorizable code. In future work, we plan to study the combined effect of wide issue engines and cycle time impact on programs with low vectorization ratios. 5.1. The EIPC measure Comparing the ILP+DLP and ILP models in terms of instructions executed per cycle presents a clear problem. As we have seen in section 3.2, a DLP machine executes many fewer "instructions" than the ILP machine, and, in most cases, it even executes fewer "operations". Therefore, using the IPC measure for the ILP+DLP model would be meaningless. Our goal is to introduce a measure that indicates how well should an ILP machine perform in order to match the performance of a ILP+DLP machine. One way of factoring in both the difference in number of operations/instructions and the average number of cycles required to execute one operation is to define the following indicator of performance:
total MIPS R10000 instructions EIPC = ---------------------------------------------------------------------------total (ILP+DLP) cycles EIPC stands for “Equivalent IPC” where IPC indicates the number of instructions executed per cycle in the ILP machine. The intuitive sense of the EIPC measure is simple: an EIPC of 10 indicates that an ILP machine should sustain a performance of 10 instructions executed each cycle to match the performance of the ILP+DLP machine. 5.2. ILP+DLP microarchitecture The microarchitecture exploiting both ILP and DLP is derived from a combination of a loose model of and R10000 and a simplified version of the Convex C3400. The idea is to retain the outof-order features of the R10000 and add a vector pipeline that shares in common with the scalar and integer pipelines the path to main memory. The microarchitecture can be seen in figure 8.
15
Fetch
Decode&Rename
released regs
F FP-regs
I
M
V L/S unit
INT-regs
V-regs mask-regs
@ Reorder Buffer
MEMORY
Figure 8: The ILP+DLP model Instructions flow in-order through the Fetch and Decode/Rename stages and then go to one of the four queues present in the architecture based on instruction type. At the rename stage, a mapping table translates each virtual register into physical register. There are 4 independent mapping tables, one for each type of registers: Integer, Floating Point, Vector and mask registers. Each mapping table has its own associated list of free registers. When instructions are accepted into the decode stage, a slot in the reorder buffer is also allocated. Instructions enter and exit the reorder buffer in strict program order. When an instruction defines a new logical register, a physical register is taken from the free list, the mapping table entry for the logical register is updated with the new physical register number and the old mapping is stored in the reorder buffer slot allocated to the instruction. When the instruction commits, the old physical register is returned to its free list. Note that the reorder buffer only holds a few bits to identify instructions and register names; it never holds register values. This is especially important since it is not conceivable to hold full vectors inside the reorder buffer. All instruction queues are set at 16 slots. The reorder buffer can hold 32 instructions. The machine has a 64 entry BTB, where each entry has a 2-bit saturating counter for predicting the outcome of branches. Also, an 8-deep return stack is used to predict call/return sequences. Both scalar register files (integer and floating point) have 64 physical registers each. The mask register file has 8 physical registers. The Vector register read/write ports have been modified from the original scheme and in the ILP+DLP microarchitecture each vector register has 1 dedicated read port and 1 dedicated write port. The original banking scheme of the register file in the C34 can not be kept since renaming shuffles all the carefully scheduled read/write ports and, therefore, would induce a lot of port conflicts. See [13] for a more detailed description of the ILP+DLP machine. 5.3. ILP+DLP configurations Table 6 presents the different configurations selected to study the performance of the ILP+DLP model. Two observations must be made: first, the reorder buffer of our architecture remains the same as the one defined in section 4.1 (see table 4) even though we have added a new vector
16
pipeline. Second, the amount of control hardware will be kept constant across all ILP+DLP experiments, to highlight the fact that our proposed machine does not require enlarged reorder buffers.
Functional Units Fetch /Issue
Control Unit
SCALAR
VECT OR
int
fp
int+fp
LD/ .ST
L2 to Mem Lat
RPC
MPC
ID_RM
4
BTB-64 RB: 32
1
1
2
1
50
4
1
ID_PM
4
BTB-64 RB: 32
1
1
2
1
1
4
1
Table 6: ILP+DLP Configurations In table 6 each configuration is named with “ID_YM”, where ID stands for ILP+DLP model and Y is a letter (R or M) that indicates Real or Perfect memory system. We note that in both cases, we do not model conflicts in the memory crossbars, memory sections or memory banks, and the memory latency differs: 1 cycle for the ideal case and 50 cycles for the real case. We also point out that the BTB used in the ILP+DLP model is 8 times smaller than the smallest one of the ILP configurations. It is important to note that, in all cases, the ILP+DLP machine is limited to fetching and decoding 4 instructions per cycle. The total computing resources of the ILP+DLP configurations presented here are four results per cycle since as four instructions are issued each cycle, we expect to have the two vector units and the two scalar units busy. The advantage here is in the out-of-order execution of vector and scalar instructions. See [13] for further comparisons of in-order and out-of-order execution in the context of vector architectures. 5.4. Performance analysis of the ILP+DLP model Figure 9 presents the IPC/EIPC comparison of the ILP and ILP+DLP models, all of which have a peak performance of 4 RPC and can transfer at most 1 memory word per cycle. The first thing to note is that, in all but one case, the performance of ILP+DLP is larger than that of ILP by factors that go from 1.25 up to 2.84. While IPC for the superscalar machine hardly exceeds 1.4 in any case, the ILP+DLP machine is for most programs well over 2.3. When comparing the bars with real and perfect memory, we see that, while the ILP machine is very sensible to main memory latency (when increasing latency from 1 to 40 cycles, IPC drops by factors between 1.01 and 1.58, except in wave) the ILP+DLP machine experiences almost no difference between a 1 cycle and a 50 cycle main memory latency. It is interesting to see that the only program where the ILP model outperforms the ILP+DLP model is mdljdp2, the one which executed many more operations in vector mode than in scalar mode (see the discussion in section 3.3 and figure 2). Note that the ILP+DLP machine is very close to its peak performance. Although the nominal peak performance is 4, if we look back to table 2 we can see that for the majority of the time at most three operations can be running concurrently: two vector functional units and the memory port. Even though the scalar units could work in parallel with the other 3 units, our analysis show that
17
(a) scalar and vector sections tend to be disjoint and (b) the fraction of scalar code is too small to make a significant difference. Thus, the actual peak is around 3 instructions per cycle. Five programs reach more than 80% of this peak. 3
IPC/EIPC
2 I4_RC_RM I4_RC_PM ID_RM ID_PM 1
0
p2
jd
dl
M e
tv
ca
m
av
W
To
2D
or
2c
su
a7
as
N ro
6
25
m
yd
H
Sw
Figure 9: Performance of 4-RPC ILP and ILP+DLP models The overall conclusion is that the DLP model allows a typical superscalar machine to much better exploit the available parallelism in a program, providing an EIPC that is much closer to the theoretical peak. 5.5. Scaling up the ILP+DLP model As we did for the pure ILP model, we are interested in seeing if the EIPC rate can be improved by adding more hardware resources. Our goal is to exploit more datapath resources (that is, functional units and memory ports) without increasing control complexity. To this end, table 7 introduces two new configurations for the ILP+DLP model where we have increased by 4 all computational resources while keeping the control unit exactly the same. The naming convention for the two new configurations is the same as the one used in table 6. Note that the scalar units have been doubled from the smaller ILP+DLP configurations while the memory ports and vector units have been quadrupled. The reason for this non-uniform scaling is two-fold: first, we want to force the ILP+DLP machine to have a lower RPC value than its ILP counterpart, to make sure that we are not over-powering the ILP+DLP model and, second, to keep the control complexity almost exact to a 4-way R10000-like processor. The notation ‘2x4' in the vector units indicates that we have 2 independent functional units that are 4-way deep. This means that, on one of the functional units, 4 independent operations from the same pair of vectors are launched on every cycle. This is much simpler than actually having four independent functional units: when a single vector add, for example, is initiated, it will proceed at four results per cycle until all elements have been processed. In the case of the memory port, the notation ‘1x4' indicates that, for stride-1 accesses, data is brought from the memory
18
system in blocks of four. For stride-2 accesses, data arrives in blocks of 2 words and for all other strides (and for scalar references) data arrives one word per cycle. The implementation is such that we save many address pins over a configuration where 4 different ports where available. For stride-1 accesses, our system sends only every fourth address to the memory system, knowing that, in return, four words will be sent.
Functional Units Fetch /Issue
Control Unit
SCALAR
VECT OR
int
fp
int+fp
LD/ .ST
L2 to Mem Lat
RPC
MPC
ID+_RM
4
BTB-64 RB: 64
2
2
2x4
1x4
50
12
4
ID+_PM
4
BTB-64 RB: 64
2
2
2x4
1x4
1
12
4
Table 7: Scaled ILP+DLP Configurations Note that there is a big difference in the effort required to add these extra resources to the ILP+DLP architecture and to the 16-way ILP machine studied in section 4.3. In the ILP+DLP machine the extra functional units are added by partitioning the vector data path and register file in 4 sections. Each section is completely independent of all others and, yet, they work synchronously under control of the same instruction. In each section we have 1/4 the total register file and 2 functional units. Thus, from the point of view of control, the extra resources do not require any special attention. On the other hand, in the ILP machine, increasing the number of functional units has forced us to add a 16-wide fetch engine and to implement a 128-entry reorder buffer. Moreover, the number of ports into the register files has grown enormously either putting in jeopardy the cycle time or introducing some need for duplicate register files. Finally, the L1 cache in the ILP machine has to be 4-ported, while the ILP+DLP machine still retains its simple scheme where only 1 address is sent over the memory port per cycle. Figure 10 presents the IPC values for the two new configurations defined and compares them to the performance of the 16-way ILP machine introduced in section 4.3. As we saw above, if we compare ‘‘real'' configurations, the ILP+DLP machine outperforms the ILP machine in most cases. In four programs, swm256, hydro2d, su2cor and tomcatv, the speed-up of the ILP+DLP machine over the ILP machine is in the range 2.0--3.45. For programs nasa7 and wave, speedups are more moderate but still significant: 1.2 and 1.13, respectively. Looking at the bars with real memory, we see that the ILP machine is typically below an IPC of 4, while the ILP+DLP machine exceeds an EIPC of 6 in four cases. Finally, program mdljdp2 still behaves very differently due to its high number of extra speculative operations executed. If a perfect memory system is considered for the superscalar machine, we can see that performance increases significantly. In one case, nasa7, the ILP machine outperforms the ILP+DLP machine with perfect memory, but only by a very small margin (less than 6%). In two cases, swm256 and tomcatv, IPC for the ILP machine reaches 6.
19
10
IPC/EIPC
8 I16_RC_RM I16_RC_PM ID+_RM ID+_PM
6
4
2
0
2
D
6
tv
p jd dl
M e
av
W
ca m
To
or
2c
a7
as
2 ro
yd
Su
N
H
25 m Sw
Figure 10: Performance of 16-RPC ILP and ILP+DLP models
6. An analytical model for ILP+DLP performance With all the data presented so far, one could believe that all the performance advantage of the ILP+DLP model over the pure ILP model comes only from the differences in semantic content of their respective ISAs. Indeed, by comparing figures 2 and 9, one can see that, whenever the number of operations is lower for the DLP model, then the ILP+DLP architecture outperforms the pure ILP machine. And, conversly, in the only case where the number of operations favours the ILP model (mdljdp2) then the ILP architecture outperforrms the ILP+DLP model. However, an analytical study of the exact speedup values reveals that the semantic ISA gap can not explain all the performance improvement. The intrinsic efficiency of each model (ILP and DLP) in executing its own ISA also has a strong impact on the final performance difference. To show this, let's unfold the speedup performance equation. Let TI and TID be the execution time for the ILP and ILP+DLP machines respectively. Then, the relative speedup of the ILP+DLP machine over the ILP machine can be expressed in terms of operations as: T N I × CPO I × t I S = -------I- = --------------------------------------------T ID N ID × CPOID × t ID In this equation N represents the number of operations in each model (note that in the ILP model, operations and instructions are equivalent). The term CPO is the average number of cycles per operation executed (which, in the ILP model, is exactly equivalent to CPI, cycles per instruction). Finally, "t" represents the cycle time of each microarchitecture.
20
Unfolding the CPO terms, we obtain: 1 -------------NI OPC I tI N I OPCID t I S = --------- × ------------------ × ------ = --------- × ----------------- × -----N ID 1 t N ID OPC I t ID ----------------- ID OPC ID As is well known, speedup is related to the three factors that appear in equation (2). The first factor is the ratio of instructions executed by each model, and expresses the relative effectiveness of each ISA to express a certain program. The second ratio describes the efficiency of each model executing its own instruction set, and indirectly shows how well each microarchitecture can exploit the available parallelism. The third factor involves the relative cycle times of each machine. Since all this study has been focused on a comparison of the first two factors, we will assume that both machines have the same cycle time (and, therefore, we eliminate it from the equation). Table 8 presents the necessary data to compute all the terms in equation (2). Columns two and three are the number of operations (instructions) executed by each model in millions. The following two columns are the average number of operations per cycle executed by each microarchitecture (the ILP and ILP+DLP respectively). Also presented in table 8, are the two ratios that affect the speedup value and finally the total speedup of the ILP+DLP machine over the ILP machine.
NI
NID
OPCI
OPCID
swm256
11466
9540.2
1.21
hydro2d
5759
4015.3
nasa7
6897
su2cor
Program
N --------IN ID
OPC ID -----------------OPCI
S
2.10
1.201
1.735
2.084
1.03
2.04
1.434
1.980
2.839
4064.3
1.52
1.66
1.696
1.092
1.852
5135
3509.4
1.27
1.82
1.463
1.433
2.096
tomcatv
1453
1042.6
1.25
1.76
1.393
1.408
1.961
wave
2737
2483.8
1.24
1.41
1.102
1.137
1.253
mdljdp2
3210
5227.2
1.23
1.51
0.614
1.227
0.753
Table 8: Factors that affect the Speedup equation (ignoring cycle time). First two columns are in millions of operations. As it can be seen from table 8, the ratio between the number of operations executed under each ISA always favours the ILP+DLP machine in all but one program (mdljdp2). If we ignore the microarchitecture effects, the ISA alone would provide speedups in the range of 1.1 up to 1.7. Interestingly enough, the ILP+DLP microarchitecture is more efficient at executing its ISA than the ILP microarhcitecture, as it can be deduced from column 7. On each cycle, the ILP+DLP machine executes from 1.1 to almost 2 operations more than the ILP machine. Even in the
21
mdljdp2 case, the ILP+DLP microarchitecture executes 23% more operations per cycle than the ILP machine.
7. Summary The limitations in the attainable parallelism of current state-of-the-art superscalar processors has promoted a renewed interest in alternatives to extract more parallelism from sequentially specified programs and also in techniques that use more efficiently the available hardware resources in superscalar engines. In this paper we have explored an alternative source of parallelism: data level parallelism. We have proposed merging the ILP and DLP concepts to harness the advantages of the DLP model in order to improve the performance of current ILP processors. This initial study has been focused on highly vectorizable floating point codes. These codes are the ones that will benefit most from the merging of the instruction and data level parallelism. We have studied the behaviour at the ISA level of the ILP and DLP models. We looked at total number of instructions executed, number of operations executed and memory traffic. The DLP model executes much less instructions that the ILP machine due to the higher semantic content of its instructions. This translates into a lower pressure on the fetch engine and the branch unit. Moreover, the DLP model executes less operations than the ILP machine in all but one case (by factors in the range 1.10-1.70). The analysis of memory traffic has revealed that, in general, and ignoring spill code effects, the DLP machine performs less data movements than the ILP machine. In the few cases where this is not true it is due to the intrinsic characteristics of the ISA we have used. We have also analyzed the ILP model from a performance point of view. Our experiments show that the obtained IPC for a typical four-way ILP machine is in the range 1.03-1.52, very far from the theoretical peak. Our results also show that control is the most important part in an ILP machine and that even by going to a perfect memory and perfect control model we only reach IPC’s in the range 2.64-3.75. However, this range still only represents a 66%-95% of the theoretical peak. When the ILP model is scaled up by a factor of four IPC also increases by four. However, the implementation cost grows quadratically, which questions increasing issue width as a long term solution. In this paper we have proposed exploiting data level parallelism, as well as instruction level parallelism, as a way to increase IPC with a very modest increase in cost and complexity of the control unit. Simulations of the ILP+DLP model have shown that the performance of the ILP+DLP machine is larger than that of ILP by factors that go from 1.24 to 2.84. Moreover, the performance of the ILP+DLP machine is very close to its theoretical peak performance. As opposed to the ILP model, scaling up the ILP+DLP model does not jeopardize cycle time. Simulations of this scaled up ILP+DLP machine have shown speed-ups over the scaled ILP machine that reach 3.45. The overall conclusion is that the DLP model allows a typical ILP machine to much better exploit the available parallelism in a program providing IPC values that are much closer to the theoretical peak. Finally, we have analytically split the overall ILP+DLP speed-up in two factors: ISA efficiency and microarchitecture efficiency. In both factors the ILP+DLP machine outperforms the ILP model. Our future work is focused on the study of the the ILP+DLP model when executing medium and non vectorizable codes. On the other hand we are also making this study for a modern vector ISA,
22
which will allow us to eliminate negative effects such as spill code, redundant memory accesses, etc. Finally, we are studying the available parallelism in vector and scalar sections of code to understand their relative contribution to the average exploited parallelism. References 1. 2. 3. 4.
5. 6.
7. 8.
9. 10.
11.
12.
13.
Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen. “Value Locality and Load Value Prediction”. ASPLOS VII. October 1996. pp. 138-147. K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang. “The case for a SingleChip Multiprocessor” ASPLOS VII. October 1996. pp. 2-11. Norman P. Jouppi and David W. Wall. “Available instruction level parallelism for superscalar and superpipelined machines”, ASPLOS, pages 272-282, 1989. M. Peiron, M. Valero, E. Ayguade and T. Lang. “Vector Multiprocessors with Arbitrated Memory Access”. In 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligura, Italy, June 22-24, 1995, pp. 243-252. K. Yeager et al. “The MIPS R10000 Superscalar Microprocessor”. IEEE Micro, vol 16, No 2, April 1996, pp 28-40. Doug Burger, Todd M. Austin and Steve Bennett. “Evaluating Future Microprocessors: The SimpleScalar ToolSet”. University of Wisconsin-Madison. Computer Sciencies Department. Technical Report CS-TR-1308, July 1996. Roger Espasa and Xavier Martorell. “Dixie: a trace generation system for the C3480”. Technical Report CEPBA-RR-94-08, Universitat Politecnica de Catalunya, 1994. John Wawrzynek, Krste Asanovic, Brian Kingsbury, David Johnson, James Beck and Nelson Morgan. “Spert II: A Vector Microprocessor System”. IEEE Computer, March 1996, pages 79-86. Charles Price. MIPS IV Instruction Set, revision 3.1. MIPS Technologies, Inc., Mountain View, California, January, 1995. Norman P. Jouppi and Parthasarathy Ranganathan. “The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance”. Workshop of Mixing Logic and DRAM: Chips that Compute and Remember. Organized as part of the 24th Annual International Symposium on Computer Architecture, Denver, June 1st, 1997. Subbarao Palacharla, Norman P. Jouppi and J.E. Smith. “Complexity-Effective Superscalar Processors”. In 24th Annual Internacional Symposium on Computer Architecture, Denver, 1997, pp. 206-218. Keith I. Farkas, Norman P. Jouppi and Paul Chow. “Register File Design Considerations in Dynamically Scheduled Processors”. In 2nd IEEE Symposium on High-Performance Computer Architecture, February 1996. Roger Espasa, Mateo Valero and James E. Smith. “Out-of-order Vector Architectures”. North Carolina, December 1-3, 1997.
23