Vector Microprocessors for Desktop Computing

15 downloads 44490 Views 266KB Size Report
growth of media application computational requirements. ... acceptable performance for more traditional desktop applications such as those typi ed by the ...
Submitted to the 32nd Annual International Symposium on Microarchitecture

Vector Microprocessors for Desktop Computing Mark G. Stoodley Corinna G. Lee University of Toronto ATI Technologies, Inc. June 1999 Abstract

Desktop workloads are expected to shift over the next few years to become increasingly mediacentric. These multimedia applications require much larger computational demands than current desktop processors can provide. In this paper, we describe four major requirements that we believe any e ective desktop processor should address: it should meet the performance requirements of desktop workloads, it should exploit advances in VLSI fabrication technology to provide this performance, it should provide scalable performance for di erent processor generations with binary compatibility, and it should have mature compiler technology. We explain how vector microprocessors meet three of these requirements, but there is a perception that the performance for non-vectorizable codes would be unacceptably low. The rst half of this paper argues that current desktop workloads such as productivity applications and the SPEC95 integer benchmarks are either highly interactive or contain little exploitable parallelism. We explain how vector microprocessors should be able to exploit advances in VLSI technology to provide acceptable performance for these types of applications. There is also an important class of multimedia programs, such as MPEG2 decoding, that cannot be vectorized without extensive loop reorganization because their parallelism is in outer loops. In the second half of the paper, we quantitatively evaluate three compiler approaches that enable vectorization of outer loop parallelism: loop interchange, complete loop unrolling and outer-loop vectorization. We nd that, contrary to the standard practice of using loop interchange, outer-loop vectorization provides the greatest performance bene t of the three techniques for four multimedia benchmarks including the IDCT kernel from MPEG2 decoding.

1 Introduction Desktop computing workloads are expected to undergo a fundamental shift over the next few years to increasingly media-centric applications [12][21][10]. These applications process many di erent types of digital data, often simultaneously and requiring real-time synchronization. Examples of these data include images, videos, voice, audio, animations and graphics. The computational requirements of media applications are inherently di erent than those of traditional desktop workloads represented by, for example, the SPECint95 integer benchmarks[12]. Real-time decrypting, decoding and synchronization of continuous, high-data-rate bitstreams requires di erent processing capabilities than those for integer programs such as compilers, simulators, and database programs. To address the new computational demands of media applications, much e ort has recently been directed towards retro tting SIMD-style media instructions into current general-purpose processor designs [27][31][9] [25][30][20][15]. While these media instructions can exploit the parallelism that is abundantly present in virtually all multimedia applications, it is likely that this approach may not scale well in light of the expected growth of media application computational requirements. The computational power of today's desktop processors permits a single MPEG2 video stream to be decoded at only a few frames per second in a window

occupying a few square inches. Full-motion, full-screen video decoding may require more than two orders of magnitude more processing power[33]. We believe that a vector microprocessor could meet this tremendous computational demand much more cost-e ectively and in fewer processor generations than today's superscalar processors. In this paper, we present our case for using vector microprocessors in desktop computers. It is based upon four requirements that we believe any successful desktop processing solution should address: providing performance to meet the computational demands of desktop workloads, the ability to exploit the high clock rates enabled by advances in VLSI technology, binary compatibility combined with scalable performance among di erent processor implementations, and mature compiler technology. Vector microprocessors meet three of these requirements but there is a perceived weakness regarding its ability to support the performance requirements of programs that cannot be easily vectorized. There are at least three classes of such programs: those that have little parallelism, those where the human user, not the computer, is the performance bottleneck, and many important multimedia programs, such as MPEG2 decoding, that cannot be vectorized without extensive loop reorganization. In the rst part of this paper, we argue that a vector microprocessor should provide acceptable performance for the rst and second classes of programs. In the second half, we evaluate three compiler techniques that can address the performance of the third class of programs. The paper is organized as follows. In Section 2, we present our case that vector microprocessors, are an attractive alternative for desktop computing and address the performance of the rst two classes of non-vectorizable programs outlined above. In Section 3, we describe a speci c model for a desktop vector microprocessor. We then focus our attention on the third class of applications by evaluating compiler techniques for vectorizing several important multimedia applications in Section 4. In Section 5 we discuss related work and we conclude in Section 6.

2 Why Vector Microprocessors for the Desktop? We believe that any e ective desktop computing solution should address at least four major requirements. First, it should cost-e ectively exploit the abundant parallelism in multimedia applications while providing acceptable performance for more traditional desktop applications such as those typi ed by the SPECint95 benchmark suite. Second, it should be capable of taking advantage of the high clock rates enabled by continuing improvements in VLSI fabrication technology. Third, it should provide scalable performance without requiring applications to be recompiled to take advantage of more powerful hardware implementations. Fourth, it should rely on mature compiler technology to achieve high performance.

In the rest of this section, we explain why vector microprocessors are a good alternative for desktop processing by showing how these four requirements are met.

2.1 Applications Any successful desktop computing solution will need to support the performance demands of desktop workloads. We rst evaluate the performance requirements of today's desktop applications and then examine multimedia applications which will become increasingly important as the shift to more media-centric desktop workloads continues. Many other microprocessor architectures such as Very Long Instruction Word (VLIW), chip multiprocessors (CMP), multiscalar processors, and recon gurable arrays are believed to satisfy this requirement but there is a perception that vector microprocessors do not. We discuss this perceived weakness in this section. Two classes of applications that typify current desktop workloads are productivity software, such as word processors, and the SPEC95 integer benchmarks. These applications are commonly believed to be non-vectorizable and so would execute only in the scalar unit of a vector microprocessor. Because our vector microprocessor design has an in-order and an issue mechanism narrower than current dynamically scheduling superscalar processors, relying solely on the scalar unit may deliver degraded performance for these types of applications. For productivity software, this performance degradation should not be as important because the computer is typically not the bottleneck to performance; the human user is. Applications such as word processors and presentation software are highly interactive. It is therefore likely that at some point in the future, if not already, computers will be fast enough to make almost all of the wait time be on the user side. Moreover, it is unlikely that the workload of such tools will increase at the same rate that computer performance is improving. For these reasons, we believe that the lower scalar performance of the vector microprocessor compared to today's processors is unlikely to translate into a perceptible performance degradation to the users of these applications. For desktop applications such as the SPEC95 integer benchmarks, the performance degradation should not be large because these types of programs have only a moderate amount of instruction level parallelism (ILP). Although limit studies indicate that high levels of ILP (10 instructions per cycle (IPC) or more) are present in such programs [38][32], architectural and compiler innovations over the past decade have not been highly successful at extracting this parallelism to improve performance[22][2]. To see how much ILP researchers have been able to extract from these types of applications, we surveyed the IPC results presented in papers from three architecture conferences of 1998 and 1999 for two of the most

Source [37], Fig. 5 [35], Fig. 12 [6], Fig. 11 [39], Fig. 3 [7], Table 1 [34], Fig. 4 [36], Fig. 3 Minimum Maximum

gcc 1.6 2.0 2.3 1.7 1.7 2.1

go 1.5

2.1 1.8 1.9 2.3 1.6 1.4 1.5 2.3 2.1

Special Notes multiscalar: 42-way OOO ideal memory only 4-way commit multithreaded: single thread 8-way fetch, 16-way issue 8-way fetch, 16-way issue

Figure 1: Survey of recent architecture papers showing instructions-per-cycle (IPC) for gcc and go benchmarks on 8-way issue processors. problematic benchmarks of the SPECint95 benchmarks suite: gcc and go. Figure 1 summarizes our ndings. The numbers in each column of this table cannot be directly compared because many parameters are di erent between the studies (in particular the compiler that was used). Nonetheless, this table demonstrates that even 8-wide issue enables at most 2.1 to 2.3 instructions on average to execute per clock cycle. While 8-way IPC numbers were most commonly presented, some papers also presented results for 4-way and 16-way issue processors. The 16-way results ranged from 1.8 to 3.3 IPC for these benchmarks [37][5] while the 4-way numbers ranged from 1.7 to 2.0 IPC [35][6]. A vector microprocessor such as the one described in Section 3, with 2-way in-order issue, should realistically achieve approximately 1.0 IPC and should therefore require about twice the number of clock cycles that an 8-way out-of-order issue processor would require to execute one of these programs. While a factor of two seems undesirable, other e ects could counter this potential degradation in performance. There are three reasons why a vector microprocessor may still be able to provide acceptable performance for these types of applications. The rst two concern the increase in the number of clock cycles while the third considers the cost e ectiveness of providing performance for these applications. First, some of these applications can, in fact, be partially vectorized. Asanovic has shown that, in some cases, performance critical components of these types of programs can be vectorized [2]. By modifying only 0.3% of the source code of one application and at most 7.5% of another, four of the eight SPEC95 integer benchmarks (compress, ijpeg, li, m88ksim) can be partially vectorized. These modi cations result in an average 34% improvement in performance for the entire SPECint95 suite over the Torrent T0's scalar execution time, suggesting that some of the performance degradation incurred by narrower, in-order issue may be countered by taking advantage of the highly ecient vector hardware. Second, some of the increase in the number of clock cycles can be countered by the faster clock rate made

possible by the vector microprocessor's simpler design and in-order issue mechanism. By pipelining global signals and avoiding complex OOO hardware, it is likely that a vector microprocessor will have a faster clock rate than 8-way or 16-way issue dynamically scheduled superscalar processors. We discuss this issue in more detail in the next section. Third, it is important to keep in mind the cost of providing performance for these applications. The vector microprocessor may occupy only as much die area as today's 4-way issue processors [19] [18]. Moreover, the die of the vector microprocessor consists primarily of datapath structures including a high capacity vector register le that provides high operand bandwidth with only a small number of register ports. In contrast, an 8-way or 16-way processor will occupy far more die area due to larger reorder bu ers and multiported register les whose area increases at least quadratically with issue width. For these reasons we believe that an increase in the number of clock cycles by a factor of two could be countered by partial vectorization and a faster clock rate. Any remaining performance improvement provided by wider out-of-order issue would probably not be cost e ective. We now turn our attention to the performance requirements of multimedia applications which are becoming increasingly important in desktop workloads. A characteristic of multimedia applications is that they contain abundant data parallelism. Vector processors are widely known for their ability to e ectively exploit data parallelism in scienti c applications, and recent work has established that this is also the case for some multimedia applications[2][19]. There is, however, a class of important multimedia applications, such as MPEG2 decoding, that cannot be easily vectorized because the parallelism in these programs is in outer loops while vectorizing compilers typically only target parallelism in inner loops. We examine this class of applications in more detail in the second half of this paper where we evaluate three compiler techniques that enable vectorization, and hence higher performance, for these types of applications. A vector microprocessor, then, can provide aggressive support for multimedia programs that contain abundant data parallelism and should be able to provide acceptable performance for other types of applications for which there is either limited parallelism to exploit or the human user is more of a performance bottleneck than the processor. While other architectural alternatives can also provide good support for multimedia applications and, in some cases, for non-vectorizable codes, they may not meet one of the other three requirements for an e ective desktop solution.

2.2 VLSI Fabrication Technology VLSI fabrication technology a ects how microprocessor architectures are designed. For example, the number of transistors available on a single chip limits the sizes and types of structures that can be e ectively implemented in a microprocessor. In this section, we discuss how a vector microprocessor can continue to take advantage of the high clock rates made possible by improvements in VLSI technology. First, a vector microprocessor uses a narrow, in-order issue mechanism. As we discussed in the previous section, this provides lower performance for programs that cannot use the vector hardware such as the SPEC95 integer benchmarks. The bene t of this approach, however, is that it does not require a complex reorder bu er. While current 4-way OOO designs can achieve clock rates over 500Mhz and may continue to achieve high clock rates as VLSI technology improves, it is unlikely that wider issue OOO processors will be cost e ective because the cost of issuing multiple instructions per cycle grows at least quadratically with issue width and the required circuitry may soon limit the clock frequencies of OOO processors [29]. Moreover, these wider processors show diminishing performance returns for this increased cost as we discussed in the previous section. A 2-way in-order issue mechanism should continue to take advantage of improvements in clock rates made possible by VLSI technology improvements at least as well as current designs. Codes that cannot be vectorized should continue to bene t from the trend of increasing microprocessor clock frequencies. Olukotun et al. have also made this argument for chip multiprocessors [28]. Second, the vector units in a vector microprocessor require only simple, localized control logic [2] and may be able to pipeline any performance-critical global signals, such as vector instruction dispatch, due to the multi-cycle execution of each vector instruction. Some details of how the vector microprocessor can avoid global signals are included in Section 3. By requiring only local signals, the vector microprocessor avoids the negative e ects of a VLSI technology trend that is becoming increasingly important; wire delays will begin to dominate transistor switching time as feature sizes decrease to 0.1m and below [24]; processors that require long wires may soon be unable to match the clock frequencies of processors with only short wires. Even as vector microprocessors are scaled to provide more parallelism in hardware, they still require only short wires because global signals can be pipelined. Third, vector microprocessors consist mainly of datapath circuitry [2][19] which can be easily replicated to consume available transistors. Exploiting the tremendous improvements in transistor density provided by continually evolving fabrication processes can therefore be directly applied to providing support for more parallelism. The datapath can be widened by adding more functional units or the vector length can be extended by increasing the size of the vector register le without limiting the new design's clock frequency. It is this ability of the vector architecture to utilize increased transistor density directly to exploit more

parallelism that suggests vector microprocessors can meet the huge computational demands of multimedia applications. While providing hardware to support more parallelism is key to improving performance, however, providing the means for software to seamlessly exploit di erent implementations of this hardware is also an important requirement for a desktop microprocessor.

2.3 Transparent Scalability One of the touted advantages of dynamically scheduling superscalar processors is that applications do not have to be recompiled to exploit the improved performance provided by new generations of processors. We believe that any successful desktop solution should include this ability because not all users are willing to purchase new versions of software for every new processor generation and because of the costs of developing and maintaining software versions speci c to each processor generation. While OOO multimedia processors can provide increased support for legacy scalar codes, they may provide this support less e ectively for a workload dominated by multimedia applications because the new multimedia instructions introduced into general-purpose microprocessors over the past ve years support only a xed amount of parallelism [27][31][9][25][30][20][15]. This xed amount of parallelism causes several problems especially with regards to scalability. For example, a packed add instruction operating on 16 bit values performs four adds in parallel with a 64 bit datapath. For a user to bene t from the improved performance provided by, say, eight parallel adds in a future processor implementation, four things would have to occur: 1) a new instruction would need to be added to the instruction set which consumes precious opcode space; 2) the application would need to be rewritten to use this instruction because the compiler infrastructure to automatically exploit these instructions is not yet available; 3) the application would need to be recompiled; and 4) the application would need to be distributed to the users. In contrast, a vector microprocessor does not su er from the four problems identi ed above. The ISA requires no additional changes to support greater parallelism in hardware. Furthermore, an application needs to be compiled only once to exploit the vector hardware but thereafter can take advantage of higher performance implementations without being rewritten, recompiled or redistributed.

2.4 Compiler Technology Our fourth requirement for an e ective desktop processor solution is reliance on mature compiler technology. Vector compiler technology is well established in the scienti c domain and is very e ective. Applying this mature 25-year old compiler technology to multimedia applications should be a small evolutionary step.

Passing this requirement removes a potential hurdle to market acceptance; it should be much easier to motivate the move to a new architecture if the software tools required by that architecture are already robust and provide good performance. In contrast, inadequate compiler performance may obstruct or delay acceptance of even a well designed architecture.

2.5 Summary In summary, we have discussed four requirements that we believe any e ective desktop solution should address. It should provide good performance for desktop applications, be able to fully take advantage of the high clock frequencies enabled by improvements in VLSI technology, provide scalable performance with binary compatibility, and rely on mature compiler technology. Vector microprocessors meet three of these requirements but there is a perceived weakness in executing desktop applications that cannot be easily vectorized. We presented evidence in this section that vector microprocessors should be able to provide acceptable performance for current desktop workloads typi ed by productivity applications and the SPEC95 integer benchmarks. There is still a class of important multimedia applications, however, that cannot be easily vectorized. We address the performance of these applications in Section 4. In the next section, we describe a vector microprocessor design that we believe is suitable for desktop computing dominated by multimedia workloads. We then turn our attention to techniques to enable vectorization of a class of multimedia applications that cannot take advantage of the vector hardware without extensive loop reorganization.

3 A Vector Microprocessor Design We propose using a two-way issue, in-order superscalar processor tightly coupled with a vector processing unit as a scalable, cost-e ective solution for desktop computing. A block diagram of the design and a table listing its features appear in Figure 2. This design is similar to the Torrent T0 vector microprocessor designed at Berkeley [4]. The T0 was designed to accelerate neural networks for speech recognition. Because the requirements for desktop processing are di erent from those for processing neural nets, our vector microprocessor design di ers from the T0 as shown in Figure 3. Dual in-order issue is provided to mitigate the possible performance degradation for the code in desktop applications that does not utilize the vector hardware. Our design includes only a single vector unit because preliminary simulations have shown that, with only

Scalar RFs

Int FU FP FU

Instr Fetch

Instr Decode

LDST FU

Address Port

DCache

64 bits

VMU ICache VRF

To L2

To L2 VU

Issue Logic

issue width 2 issue mechanism in-order

Scalar Datapath No. int FUs No. FP FUs No. LDST FUs

1 1 1

Vector Register File

No. vector registers 32 vector length 64

Vector Datapath VU int width VU fp width No. VUs No. VMUs

8 4 1 1

Memory Interface

L1 data bus width 64 bits L1-L2 data bus width 128 bits No. address ports 1

Figure 2: Vector Microprocessor Design one vector memory unit (VMU), a second vector unit does not give signi cant performance bene t. It is possible that providing a second vector memory unit may increase the cost e ectiveness of a second vector unit although we have not investigated this possibility. The longer vectors in our design provide two bene ts over the Torrent design. First, each vector instruction takes 8 cycles to execute (64 elements  8 datapaths) rather than 4 cycles. Longer instruction execution implies that communication between the vector unit and the scalar unit occurs half as often in our design than in the T0, allowing greater freedom in accomplishing that communication. In particular, any global signals can be pipelined to a greater depth without increasing the processor clock cycle. A second bene t is that longer vectors provide greater performance both by reducing the dynamic instruction count and by exploiting more parallelism to decrease the average number of cycles per operation [19]. Longer vectors thus provide better performance than shorter vectors while enabling the vector microprocessor to more easily take advantage of the high clock rates continually made possible by advances in VLSI technology. Our design includes a 64-bit integer datapath in each functional unit of the vector unit to match the width of current desktop microprocessors. In addition, it includes four oating-point datapaths to enable vectorization of oating-point codes. Because of the T0's narrow application area, it was not required to support these data types. The reason for fewer oating-point datapaths than integer datapaths in our design

Feature Issue Width Vector Units Vector Length Int Datapaths FP Datapaths Memory System

Desktop Torrent Design T0 Why Desktop is Di erent 2-way Scalar faster execution of purely scalar code 1 2 second unit not cost-e ective 64 32 better performance, individual vector instructions execute longer 864b 832b desktop systems process 64-bit data 464b desktop systems process oating-point data DRAM SRAM SRAM too expensive for desktop system Cache Flat Hierarchy Figure 3: Di erences between our design and the Berkeley Torrent T0

is because a oating-point vector datapath requires much more die area than an integer vector datapath. We quantify the cost of the vector oating-point datapath later in this section. The nal, and possibly most signi cant, di erence between our design and the T0's is in the memory system. Because we are designing a vector microprocessor suitable for desktop processing, an expensive static RAM memory system is not feasible. Instead, we use a cache-based memory hierarchy similar to the ones used in today's fastest processors. The vector unit is tightly coupled to the scalar processor in the sense that there is low latency communication provided between the vector unit and the scalar register le, allowing the vector unit to quickly access scalar values. Pairs of instructions (any combination of scalar and vector) are issued in program order and executed in either the scalar functional units or the vector units (VU and VMU). The operands for scalar (integer and oating point) instructions are maintained in the scalar register les while vector operands (integer and oating point) are maintained in the vector register le (VRF). The cache memory hierarchy is based on current high performance desktop microprocessor designs with access latencies most similar to the Alpha 21264 [14]. The split L1 non-blocking caches are each 64KB, 2-way set-associative with 32B lines and 4 banks. The uni ed L2 blocking cache is 512KB, 4-way set-associative with 64B lines and 8 banks. The L1 hit time is 1 cycle, L1 miss penalty is 20 cycles, and the L2 miss penalty is 160 cycles. Details of the vector hardware are shown in Figure 4, which illustrates the organization of the VU, the VMU, and how they are connected to the vector register le. For the sake of presentation, this gure shows a vector unit with four parallel pipelines (VP0 - VP3) and vector registers (VR0 - VR31) with only 16 elements. However, the vector microprocessor we are proposing has 8 parallel pipelines and vector registers with 64 elements. The layout of register elements is shown for vector register 2 (VR2 in Figure 4). Figure 4 illustrates two important vector parameters: vector width and vector length. The vector width

Address Port

VMU X−Bar

64b

VRF

DCache

VR0 VR1 0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

VR2 VR3

VR31 VU V P 0

V P 1

V P 2

V P 3

Figure 4: Vector Hardware Details determines the number of parallel pipelines in the vector unit. The vector length parameter determines how many total operations should be performed by a vector instruction and how many total elements there are in a single vector register. The combination of a pipeline together with the stripe of the vector register le that feeds it is commonly called a lane. Our design therefore has eight lanes. The most commonly executed vector instructions require no inter-lane communication because the constituent operations of most vector instructions are independent; the lanes naturally partition the vector register le and vector datapaths into separate, simple processors. While some signals, such as a new instruction broadcast, must connect to all of these processors, these signals should not have to propagate in a single clock cycle because vector instructions execute for many clock cycles; these signals should therefore not a ect the clock cycle time. Less common vector instructions, such as reductions, may require inter-lane communication but these instructions typically execute after all the iterations of a long vector loop and so the performance of these instructions may not be critical to the performance of codes that require their use. The vector memory unit (VMU) in this design provides a cost-e ective mechanism for moving vectors of data between the vector register le and the memory via a 64-bit data bus. This VMU design is similar to that of the Berkeley Torrent T0 and has been described elsewhere [3] [19].

Processor Component

64b Vector Datapath 8 integer units

4 oating point units

load/store unit 64b Vector Register File 32 64-element vector registers 64b Scalar unit scalar integer and FP datapath scalar integer and FP register le instruction issue Clocking and Overhead Total

Area Percentage mm2 of Total

24.0

30.9%

25.6

33.0%

9.5

12.2%

10.3 0.5 0.8 4.0

13.2% 0.6% 1.0% 5.2%

3.0

77.7

3.9%

100.0%

Figure 5: Die area breakdown for the vector microprocessor design in 0.25 mtechnology We conclude this section with a die area estimate for this vector microprocessor design. This area estimate builds on earlier results [19] by quantifying the die area of the oating point vector datapath. We show estimated die areas for the major components of our design in Figure 5 in a 0.25 m technology omitting structures such as caches, TLB and the pad ring. The die area for the vector FP units is shown in bold. To estimate this area we measured the oating-point functional units of two processors from annotated die photos of two processors: the Alpha 21164 (6.1 mm2) and the MIPS R5000 (6.7 mm2). These areas are also scaled to 0.25 m. We then multiplied the average of these areas (6.4 mm2) by four to compute the area of our 4-wide FP vector datapath. The die area for the entire vector microprocessor is 77.7 mm2 which is only slightly larger than the die areas of the processor portion of current 4-way superscalar designs such as the HP PA-8000 (68 mm2), the Alpha 21264 (70 mm2) and the MIPS R10000 (67 mm2) [19]. For completeness, we also measured the areas of oating-point units in three out-of-order processors (Alpha 21264, HP PA8000 and MIPS R10000) and found the oating-point units to be slightly larger (on average 10.5 mm2). Using this area would in ate the vector microprocessor's die area to 94 mm2. While this is 34% larger than current designs, it is still well within the implementation limits of current technology. Figure 5 also shows the percentage of the die area allocated to the various components of the vector microprocessor. Three important observations can be drawn from this data. First, datapath components represent over 81% of the vector microprocessor's die area. Current superscalar processors, such as the HP PA8000, allocate only about 43% of their die area to datapath circuitry [19]. Second, less than 1% of the die area is dedicated to instruction issue logic and overhead structures in the vector microprocessor. This number ranges from 43% in the HP PA8000 to over 60% in the MIPS R10000 [19]. This percentage is

expected to grow as issue rates increases and since the control circuitry in instruction issue logic is much more complex than datapath circuitry, this means the vector microprocessor is likely much easier to design than a dynamically scheduling superscalar processor. The nal observation is that the vector register le (VRF) occupies about 12% of the vector microprocessor's die area. Current superscalar register les range from 7% (MIPS R10000) to 15% (HP PA8000) [19]. Although it occupies about the same area as a superscalar register le, the vector register le contains 16KB of (vector) data whereas the superscalar register les hold less than 1KB of (scalar) data. Moreover, the VRF delivers 16 64-bit operands and stores 8 64-bit results per cycle, totaling 192 bytes per cycle or 96GB/s at 500Mhz. Providing this bandwidth requires only one read port and one write port per bit cell. On the other hand, current superscalar register le designs deliver approximately half the bandwidth because they execute only four instructions per cycle. Moreover, to supply this operand bandwidth, they require two read ports and one write port per functional unit per bit cell. The highly ecient vector register le organization is a large part of the reason why vector microprocessors can cost-e ectively provide high performance for data-parallel applications such as multimedia programs. Of course, only vectorized programs can take advantage of this tremendous operand bandwidth.

4 Vectorizing Multimedia Programs In the rest of the paper, we turn our attention to the problem of providing performance for multimedia applications that are written in such a way that they cannot be vectorized easily. We rst describe how many of these multimedia programs contain abundant parallelism in outer loops which can only be vectorized after extensive loop reorganization. We then discuss three compiler techniques that can transform outer-loop parallelism into a vectorizable form and nally we evaluate the e ectiveness of these techniques for three processor architectures.

4.1 Outer-Loop Parallelism in Multimedia Applications Vector compilers primarily target parallelism in inner loops. While an application's inner loop may not contain much parallelism, there can still be abundant parallelism in outer loops. In scrutinizing the source codes of many multimedia applications such as MPEG2 encoding, MPEG2 decoding and MP3 audio decoding, we have found abundant data parallelism in outer loops. In these codes, the inner loops frequently contain recurrences or have very short trip counts (less than ten). This code structure arises because many multimedia computations require outer loop nests that iterate over a large number of objects and each object is processed independently of the other objects. The motion compensation code that dominates MPEG2

for (i=0; i

Suggest Documents