Increasing performance with multiply-add units ... - Semantic Scholar

0 downloads 0 Views 73KB Size Report
David López, Mateo Valero, Josep Llosa and Eduard Ayguadé. {david | mateo ...... Morgan Kaufmann Publishers Inc. San Francisco, California. 1st edition July ...
Increasing performance with multiply-add units and wide buses David López, Mateo Valero, Josep Llosa and Eduard Ayguadé {david | mateo | josepll | eduard}@ac.upc.es Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya Campus Nord, Mòdul D6, Jordi Girona 1-3 08034 Barcelona, SPAIN Abstract A balanced increase of memory bandwidth and computational performance is one of the current trends towards high performance microprocessors. This improvement can be attained either by replicating resources such as buses and functional units or by making them more complex. For example, some microprocessors, as the IBM’s POWER2 double the width of the buses between the register file and the first-level data cache in order to get similar results as by doubling the number of buses, but at a lower cost. In a similar way, some microprocessors have multiply-add fused functional units to increase the computation capability, as IBM’s POWER2 and RS6000 processors. In this paper we evaluate the performance and the effects on register pressure of these alternatives. The performance benefits have been evaluated using 1180 kernel loops of the Perfect Club benchmarks, which account for 78% of the total execution time. The results show that both techniques (widening buses and using multiply-add fused functional units) are complementary cost-effective solutions to increase the processor efficiency in numerical applications.

1. Introduction High-performance microprocessors make use of pipelining in order to have a shorter cycle time and parallelism to maximize the work done per cycle. To effectively exploit instruction level parallelism (ILP), current processors rely on architectural techniques (dynamic instruction execution), compilation techniques (aggressive scheduling techniques such as software pipelining) or a combination of both.

1

Numerical applications spent the great majority of their execution time in the innermost loops. Software pipelining [Cha81, Lam88] is an effective compilation technique to extract ILP from innermost loops. In a software pipelined loop, the number of cycles between the initiation of successive iterations is termed the Initiation Interval (II) [RaGl81]. The Initiation Interval II between two successive iterations is bounded either by recurrences in the graph (RecMII) or by resource constrains of the architecture (ResMII) [DeTo93, Rau94]. This lower bound on the II is termed the Minimum Initiation Interval (MII) where MII=MAX(ResMII,RecMII), so the reduction of the greatest of these two values is required to increase the performance of the programs. An approach to reduce RecMII is to reduce the latency of operations in the critical path of the loop. This objective can be overcome by reducing the latency of the functional units (FU) and/or by solving complex operations in the FU. On the other hand, ResMII can be reduced by increasing the number of operations done by cycle, both in computation instructions and in memory-access instructions. In computation instructions, the increment of operations done by cycle can be overcome by increasing the number of FU and/or by solving complex operations in the FU. In memory-access instructions low latency and high bandwidth memory subsystems are required. Low latency memory accesses have been traditionally achieved by using cache memories, which have also contributed to increase the bandwidth of the memory subsystem. However, heavily exploiting ILP has large memory bandwidth demands. To meet the high bandwidth requirements two complementary alternatives can be applied: to increase the number of buses and to increase their width to move several (consecutive) words per memory access (use of wide buses). The RS6000 [IBM94] and the POWER2 [WhDh94] microprocessors include the Multiply-Add Fused Functional Unit (MAF FU) that implements a floating point multiply and add operation (FMA). FMA consist of a floating point multiply and a dependent add. In the POWER2 microprocessor, floating-point-multiply, floating-point-add and floating-point-multiply-and-add can be performed in two cycles, so if a multiply followed by a dependent add are in the critical path of a loop, the RecMII can be reduced by two cycles. The FMA operation can also reduce ResMII because two

2

Memory 1 word

Register File

a)

Memory 1 word

Memory 1 word

Register File

2 words

Register File

b)

c)

Memory 2 words

2 words

Register File

d)

Figure 1: Some memory configurations: a) a single bus, b) two single buses, c) a wide bus and d) two wide buses

operations are performed simultaneously using only one FU. A part of this paper is dedicated to study the performance benefits of using MAF FUs relative to conventional functional units. A high memory bandwidth can reduce ResMII because more memory operations can be done in parallel. The usual bus configuration in most microprocessors consists of one single bus between the cache memory and the register file (Figure 1a). Some current microprocessors [WhDh94, Hsu94, Hun95] have been built with two memory buses (Figure 1b). An alternative to the use of two buses is the use of one wide bus (Figure 1c). In this paper we present a study of the benefits of using 1 wide bus versus a single bus and the performance degradation in front of two buses. In a similar way, we evaluate the improvement offered by configurations with two wide buses (Figure 1d), like in the POWER2. The different evaluations are compared to other current bus configuration such as two load and 1 store unidirectional buses (as CRAY Y/MP [Vaj91]) and one bidirectional and one unidirectional load buses (as the 21164 Alpha microprocessor [ERPR95]). Since reducing the II can increase the register requirements of a scheduled loop [MAD92, LVAL95], the register requirements for each bus and FU configuration are studied. We demonstrate that the register requirements of a wide bus configuration does not have a significant impact on the final performance. Also, we show that using MAF FUs reduces the register requirements. Using MAF FU scarcely affects memory-bound loops (i.e loops bounded by memory requirements) performance. Besides this, when MAF FUs are used, some compute-bound loops (i.e. loops bounded by computation requirements) can become memory-bound. On the other hand, the use of wide buses can convert memory-bound loops in compute-bound loops. We study the benefits of the

3

use of both techniques together. We show that both techniques are complementary and have additional benefits, specially in loops that are almost balanced. For the evaluations, we use 1180 innermost loops that account for 78% of the execution time of the Perfect Club [BCKK88]. Even though wide loads/stores can be applied to both dynamically and statically scheduled machines, the paper is targeted to statically scheduled machines. We use an aggressive scheduling technique to obtain the maximum performance and stress the memory system to maximum. The scheduling technique used is Hypernode Reduction Modulo Scheduling (HRMS) [LVAG95], a software pipelining technique that tries to produce optimal schedules with reduced register requirements. The outline of the paper is as follows. Section 2 discuss the benefits of using MAF FUs and Wide Buses. Section 3 presents an example to illustrate odds and drawbacks of the use of the studied techniques. Section 4 presents the experimental framework used for our evaluations. In Section 5 different bus configurations and the use of MAF FUs are evaluated. This Section also shows the performance advantages of combining both techniques. Finally, Section 6 states our conclusions.

2. Trade-offs of using MAF FUs and wide buses Many experts credit the FMA instruction as a key component of the RS6000 processor’s high floating point performance. The use of MAF FUs is an appropriate addition to the floating point architecture. In VLSI, connectivity is a very important factor in cost and performance. A single unit which performs a multiply and add operation would produce a significant reduction in internal connection requirements. In addition, normalization must be done only once for both operations, and this increases the accuracy. Also this single unit provides a single operator of one-cycle pipeline time for two original operations. On the other hand, having an ALU with FMA instruction can increase the area of the chip. An extensive discussion of the RS6000 and POWER2 floating-point units can be found in [MHR90] and [HFH94], respectively.

4

Widening buses is a cost-effective solution to increase the memory bandwidth because increasing the number of buses has several costs and drawbacks at different levels of the memory hierarchy: processor core, on-chip cache, TLB, and off-chip connection. At the processor level, doubling the number of buses requires doubling the number of load/store units. Also, doubling the number of load/store units requires additional register-file ports for each additional load/store unit: 1 read port in the integer register file (2 if the addresses can be computed using 2 registers as in SPARC [GAB+88] or PA-RISC [LMM92]) for computing the address and 1 write port and 1 read port in each register file (integer and floating) for loading/storing the data. Unfortunately, the access time of a multiported device, such as the register file, increases with the number of access ports [WeEs88]. Moreover, there is a noticeable area penalty (added to the extra area required for the additional load/store unit), since in CMOS, the area of a multiported device is proportional to the square of the number of ports [Jol91]. Most current microprocessors implement on-chip first level data caches. In order to allow two memory accesses per cycle a dual-ported cache is required. Increasing the number of ports of a cache has the same drawbacks than register files. To implement a dual-ported cache, the Alpha 21164 [ERPR95] duplicates the data cache and maintains identical copies of the data in each cache. This implementation doubles the die area of the primary data cache. Another option to implement a dual-ported data cache, is to split the data cache into two (or more) independent banks [SoFr91]. However, bank conflicts can reduce the effective bandwidth. Besides, it has the die area cost due to the crossbar between the load/store units and the banks, which can also increase cache access time. Almost all microprocessors implement low latency address translation with a translation lookaside buffer (TLB), typically highly associative. If address translation has to be performed before (physically indexed, physically tagged data cache) or while (virtually indexed, physically tagged data cache) accessing the data cache, multiple translations must be performed per cycle. If this is the case, the TLB might be in the critical path of the processor. Multiporting the TLB (as in the case of the register file) can increase cycle time as well as require some extra die area. Some processor designs implement multi-level TLBs [CLC+95]. In this way the first level can be a small multiported

5

TLB providing multiple translations per cycle and the second level can be a bigger single ported TLB providing high hit rate. However, a single ported TLB will provide the same, or even better, hit rate, requiring less area. Finally, some microprocessors implement off-chip first-level caches [Hsu94,Hun95]. Performing two off-chip memory accesses per cycle, requires two buses with address, control and data information. Having two off-chip buses, increases the number of pins required (complicating and making expensive the package of the chip) and increases the complexity of the off-chip memory system (similar problems to on-chip caches). The alternative to increase the number of buses is to exploit wide buses having, at the instruction set level, explicit wide load/store operations. This alternative has the advantage that, for each wide access, only one issue slot, one address generator, one address translation per cycle, and one wide (i.e. one address, one control and two data buses) memory port are required. This technique has also some drawbacks at cache level (two modules are needed to access to two consecutive words), at register level and specially in performance. This point will be studied in the following sections.

3. Motivating example. As an example consider a processor with one general-purpose FU fully pipelined, with a latency of 2 cycles, and that load/store operations can be served in 1 cycle. Consider the loop shown in Figure 2a. Figure 2b shows the pseudo-code of the loop representing the operations in the loop body and Figure 2c shows the associated dependence graph. Figure 2d shows the loop unrolled with u=2 and Figure 2e the associated dependence graph. In these figures, the superscripts associated to operations indicate to which replication of the original loop they belong to (0..u-1). Figure 3d shows a schedule for one iteration of the unrolled loop for an architecture with a single bus (see Figure 1a) and a FU without FMA operation. It also shows the lifetime of the variables involved. Each value is marked with the operation that produces it. Figure 3e shows the kernel code

6

a) DO I=1,N

d)

R1= D DO I=1,N,2 A0: R2= A(I) B0: R3= B(I) *0: R4=R2*R3 +0: R5= R4+R1 C0: C(I)= R5 A1: R6= A(I+1) B1: R7= B(I+1) *1: R8=R6*R7 +1: R9= R8+R1 C1: C(I+1)= R9 ENDDO

C(I)=A(I)*B(I)+D ENDDO

b) A: B: *: +: C:

R1= D DO I=1,N R2= A(I) R3= B(I) R4=R2*R3 R5= R4+R1 C(I)= R5 ENDDO

c)

A

B *

D

+ C

e) A0

A1

*0

B1

B0

D

*1

+0

+1

C0

C1

Figure 2: (a) Example loop. (b) Optimized pseudo-code representing the operations in the loop and (c) its dependence graph. (d) Optimized pseudo-code with unrolling factor u=2 and (e) its dependence graph

of the scheduling, where the subscripts indicate the stage of the schedule to which the operation has been scheduled. In this case, 6 cycles are needed to execute two iterations (3 cycles per iteration), since the most used resource is the load/store unit and there are 3 memory accesses per iteration. This schedule requires at least 5 registers since there are up to 5 variables overlapping at a given cycle. Figure 3f shows the schedule of the loop with 2 buses (see Figure 1b). Notice that 2 iterations can be scheduled in 4 cycles (2 cycles per iteration). In addition, reducing the II increases the register pressure. For instance, in this case 6 registers are needed in spite of the scheduling have been done trying to reduce the registers requirements. In this example all the memory accesses have stride 1. If wide buses (see Figure 1c) are available, two memory accesses that access consecutive memory locations (for instance A0 and A1) can be performed simultaneously using a single wide memory access. If wide load/store operations have to be used, the unrolled graph (Figure 3a) must be compacted, resulting the graph shown in Figure 3b. Figure 3g shows a schedule of the compacted graph with one wide bus. Notice that the schedule achieves the same throughput than having two buses. Because two compacted accesses must be done together, the use of this technique can increase the lifetime of the variables and therefore the register requirements (in the example, 7 registers are required). Notice that load operations for A1 and B1 are performed one cycle earlier, and the store operation for C 1 is performed one cycle later than in the schedule in Figure 3f.

7

A1

L/S fu

B1

B0

A

B0,1

0,1

A0

*0

D

*1 +1

0

1

+

Fma1

Fma0

0

*0 A1 B1

+0 *1

1

+

C0,1 C

C

L/S fu A00 B00 C01 A10 B10 C11

0

0

D A B *0 A

1

B1

*1

+

0

+1

+ 00

4 5 3 4 5

* 10

3

+ 11 * 00

L/S fu g)

c)

b)

a) e)

+1

0,1

Stage 1

C

D

A0

B0

1 *0 A

B1

D A0 B0 *0 A1 B1 *1 +0 +1

B0

*1 D

0

+

*0

B0,1

D

A0,1

Stage 0

A0

*1

+0

C0

C1

d) L/SL/S fu f)

A00 B00 C01

A10 B 10 C11

+1

+ 01 A0,1 0

5

1 B 0,1 0 +1

7

* 00

6

* 10 C0,1 1

3

2 3 2 3 4 3 3 3 2 2 2 1

h)

0

D A B *0 A1 B1 *1

+0 +1

+ 01 + 11 * 00 * 10

L/S Mfu A0,1 0 B0,1 0 C0,1 1

0

Fma11

4 6 5 3 1 0 1 D A0 A1 B0 B Fma Fma

5 7

Fma00

4

Figure 3: Three versions of the dependence graph for the program in Figure 2: a) with u=2, b) with wide memory accesses and c) with FMA operations. The scheduling and registers requirements of a single iteration (d) with one single bus and the kernel code (e) applying software pipelining. Kernel code and register requirements for the different configurations: f) two buses, g) one wide bus and h) one wide bus and MAF FUs.

In the last two cases, the most used resource is the FU unit. If a MAF FU is used the four operations can be fussed into two (Figure 3c) and the loop can be scheduled in three cycles (II=1.5, see Figure 3h). Notice that the register requirements are not increased even though the II has been reduced. This is because no registers are needed to store the intermediate results between the multiply and the add operations, so MAF FUs has an additional benefit in terms of registers requirements.

4. Experimental framework In this section we describe the experimental framework used to evaluate the performance of the different bus and functional units organizations. The framework includes a set of processor core configurations, an experimental compiler, and a set of benchmark codes.

8

Different machine configurations are used along the paper. All of them include two functional pipelined units able to perform arithmetic operations with a latency of two cycles (either with FMA or not); division and square root operations are also performed by the functional units, but they are not pipelined and have latencies of 19 and 27 cycles, respectively. Load and store memory accesses have latencies of two (pipelined) and one cycles, respectively. Bus configurations include bidirectional single or wide buses (1 or 2), and two additional asymmetric configurations: 1 bidirectional and 1 load buses (as in the Alpha 21164 processor) and 2 load and 1 store buses (as in the Cray Y/MP processor). For those configurations with wide buses we assume instructions at the ISA level to perform wide memory accesses; a study of the use of wide buses on superscalar processors without explicit wide instructions can be found in [LVLA97]. Table 1 summarizes the different configurations with the names used to refer to them; for instance, configuration 1SB-2FMA corresponds to a machine with a single bidirectional bus and two functional units implementing fused multiply-add operations.

Bus FU # of buses type width # of FU configuration configuration 1SB 1 load/store single 2SB 2 load/store single 2FU 2 1WB 1 load/store wide 2WB 2 load/store wide 2FMA 2 1LS1LB 2 1 load/store single 1 load 2L1SB 3 2 load single 1 store Table 1: Processor core configurations.

Implements FMA? not yes

The compiler used to parse Fortran77 codes is ICTINEO [ABG+96]. After analysing the code and performing some basic optimizaciones, the compiler generates the dependence graphs that feed the instruction scheduler module. These dependence graphs correspond to innermost loops composed of a single basic block (do not include either procedure calls or conditional exits); loops with conditional statements in their body are previously converted to single basic block codes using IF-conversion [AKW83].

9

The dependence graph has been enhanced to include information about strides in memory references. Stride edges show memory accesses of the same kind (load or store) performed with a constant separation or offset between consecutive references. ICTINEO applies a compaction algorithm similar to the one applied by the AIX Fortran77 compiler (version 3.2) for the IBM SP2 architecture to unify memory accesses than can be performed with a single wide access. A detailed version of the algorithm can be found in [LVLA97]. Finally, HRMS [LVAG95] (Hypernode Reduction Modulo Scheduling) is applied in order to generate a near throughput optimal software pipeline for the loops with minimum register requirements (about 97.5% of the loops are scheduled with their minimum initiation interval).

Program

% Execution time

% Memory operations

% Mem. ops. with stride 1

% of Mem. ops. compacted

% of execution time in loops with recurrences

ADM

78.73

38.31

71.04

53.36

48.43

QCD

42.81

54.30

86.09

64.29

23.64

MDG

62.33

46.72

67.25

51.27

14.39

TRACK

30.0

62.58

75.85

75.21

36.00

BDNA

68.89

18.92

45.25

36.23

98.83

OCEAN

96.79

43.67

26.10

20.35

72.24

DYFESM

97.5

58.47

89.49

87.52

58.43

MG3D

69.81

20.20

52.56

47.34

33.54

ARC2D

94.77

40.24

59.77

58.56

1.28

FLO52

92.29

42.30

90.73

89.36

1.69

TRFD

97.22

60.21

66.31

65.87

6.93

SPEC77

84.66

33.84

35.63

22.57

33.69

Table 2: Some general characteristics of the Perfect Club Loops

The Perfect Club benchmarks [BCKK88] have been used to perform the evaluation. With the previous premises, ICTINEO selects 1180 loops that account for 78% of the total execution time of the benchmark set (running on a HP 9000/735 workstation).

10

Table 2 shows some general characteristics of the loops studied. The first column indicates, for each program, the percentage of the total execution time spent in the loops selected. The second column shows the (dynamic) percentage of operations in the loops that are memory operations. The third column shows the percentage of these memory operations that have stride 1 (with an unroll factor of two). Not all the memory operations with stride 1 can be compacted because of dependences between them, so the fourth column shows the percentage of memory operations that can be compacted in wide accesses. Finally, the fifth column shows the percentage of execution time spend in loops that have recurrences. Recurrences can override the potential performance benefits of wide buses because recurrence circuits impose critical paths for the execution of the loops, so the additional memory bandwidth may not have any effect on performance. In addition, recurrence circuits increase the likelihood of paths between compactable memory accesses, preventing their compaction.

5. Performance evaluation In this section, we present the performance evaluation of the use of wide buses, MAF FUs and their combined effects, taking into account the effects on the register pressure. In all the tests performed, we consider a perfect memory (i.e. a hit ratio of 100%).

5.1 Comparison of different bus configurations First of all, we evaluate the performance obtained when increasing the number of buses and when widening them over a base configuration of 1 single bus and two functional units (1SB-2FU). Table 3 shows the characteristics of the tested loops when running in the base architecture.

11

Table 3: characterizations of the programs with a 1SB-2FU configuration Resource Bound Program

adm qcd mdg track bdna ocean dyfesm mg3d arc2d flo52 trfd spec77

Recurrence bound

Bal.

35.4 0 0.5 0.6 76.2 45.0 4.4 41.9 2.7 2.1 1.6 33.9

0.9 0.4 0.5 0 0 0 15.6 0 0 0 0 0

63.7 99.6 99.0 99.4 23.8 55.0 80.0 58.1 97.3 97.9 98.4 66.1

Memory Bound

Compute Bound

Bal.

27.3 0 31.4 0 16.1 29.0 0 39.5 59.1 49.3 0 15.3

0 0 0 2.6 0 0 0 0 0 1.1 0 1.6

36.4 99.6 67.6 96.8 7.7 26.0 80.0 18.6 38.2 47.5 98.4 49.2

WB

!WB

29.0 99.2 67.6 93.7 0.2 21.0 75.7 13.5 20.3 35.6 98.4 8.3

7.4 0.4 0 3.1 7.5 5.0 4.3 5.1 17.9 11.9 0 40.9

%Mem. %com%load %store ops. pact.

49.1 60.5 65.0 64.4 81.5 91.6 60.5 67.0 52.9 50.3 60.4 46.8

58.2 54.2 70.8 57.9 58.4 50.7 66.3 54.6 68.8 67.9 65.9 75.4

41.8 45.8 29.2 42.1 41.6 49.3 33.6 45.4 31.2 32.1 34.1 24.6

48.5 64.3 55.7 74.7 4.3 63.9 90.8 80.8 52.5 85.5 66.6 14.7

The second column of table 3 shows the (dynamic) percentage of time for each program spent in loops with RecMII greater than ResMII (e.g. 35.4% in ADM program). These loops are constrained by their recurrences and would never benefit of having higher memory bandwidth. The next column shows the percentage of time spent in loops with RecMII=ResMII (e.g. 0.9% in ADM program). These loops are balanced, so having higher memory bandwidth does not improve their performance. The fourth column shows the percentage of time spent in loops where ResMII is greater than RecMII (e.g. 63.7% in ADM program). These loops are bounded by the available resources. In order to analyse their behaviour when changing the memory bandwidth, we further divide them into three groups: compute bound, balanced and memory bound loops (e.g. 27.3%, 0% and 36.4% for the ADM program, respectively). Memory bound loops are the ones than can take benefit from the increase of memory bandwidth. For these loops some additional information is shown in the table. The column labeled WB shows the percentage of time of the loops whose performance increase when the bus is widen. The !WB column is the complementary of the previous column (i.e. loops whose performance doesn’t vary when a wide bus is used).

12

The last four columns shows additional characteristics for the memory bound loops: percentage of memory operations, over the total operations of these loops (e.g. 49.1% in ADM program), the percentage of these operations what are load operations and store operations (e.g. 58.2% and 41.8% respectively in ADM program) and the percentage (over the total memory operations) that can be compacted when wide buses are available. Figure 4 shows the speed-up of the different bus configurations using the execution time of 1SB-2UF as baseline1. The speed-ups obtained increasing the memory bandwidth vary significantly according to the characteristics of each program (see Table 3). The use of two single buses in front of one single bus can increase the performance up to 2. This speed-up can be achieved only in memory bound loops. Programs with a high percentage of time spent in memory bound loops are the ones that take more advantage of the use of more buses. These programs are TRFD, TRACK, QCD and DYFESM, with speed-ups of 1.98, 1.96, 1.86 and 1.67

3.0

speed_up

2.5

1 wide bus 2 single buses 2 wide buses 2 load 1 store buses 1 load/store 1 load buses

2.0

1.5

1.0 adm

qcd

mdg

track

bdna

ocean dyfesm mg3d

arc2d

flo52

trfd

spec77

gm

Figure 4: Speed up of different bus configurations (1 single bus = 1.0) with two general-purpose FU.

1.The execution time of each loop has been estimated as the number of iterations times the II found by HRMS.

13

respectively. On the contrary, programs like BDNA (7.7% of time spent in memory bound loops) show a negligible speed-up. The use of one wide bus can achieve a performance near to the performance of using two single buses if the percentage of memory operations that can be compacted is close to 100%. The programs with the highest percentage of compactable memory operations are DYFESM, MG3D and FLO52, been their respective speed-ups 94.18%, 97.73% and 97.9% of the speed-up of two single buses. Using 2 wide buses in front of 2 single buses obtains a remarkable speed-up in programs that combine high percentage of memory accesses with high percentage of accesses with stride 1. This situation can be observed in TRFD, that obtains a speed-up close to three with respect to the base configuration (close to two respect the 1WB configuration) and 1.47 with respect to the 2SB configuration. QCD and TRACK have respective speed-ups of 2.46 and 2.77 relative to the baseline architecture, speed-ups of 1.67 and 1.75 relative to the 1WB architecture and speed-ups of 1.32 and 1.42 relative to the 2SB configuration. Notice that the 2WB configuration is the one with the best performance, except in SPEC77 where the best performance is achieved by the two load and one store unidirectional buses configuration. The 2L1SB configuration (as the CRAY Y/MP) has a peak speed up of three, and can be achieved in memory bound programs with a 2:1 relation in the number of load / store operations. For example, TRFD has a relation of 65.9:34.1 and the time spent in memory bound loops is 98.4%. In this case, the speed-up respect to the baseline configuration is 2.85. Notice that the performance of this configuration is always smaller that the 2WB one, except in the SPEC77 program. It is caused by the low number of compactable memory operations available in this program (only 14.7% of them). Finally, the results shown in the Figure 4 indicates that a low cost bus configuration as the 1LS1LB-2FU (present in the Alpha 21164) has a performance close to the 2SB-2FU configuration.

14

100

80

60

% of cycles

1 1 2 2 2 1

single bus wide bus single buses wide buses load 1 store buses ld/st 1 ld buses

40

20

0 0

32

64

96

128

160

192

224

Dynamic cumulative of register requirements

Figure 5: Register behaviour depending on the bus configuration

5.2 Impact of using wide buses on the register pressure Adding more buses to a processor can increase the register requirements of a loop mainly because a more aggressive bus configuration can reduce the II of the scheduled loop. Register pressure is, to some extent, proportional to the number of concurrently executed iterations. Reducing the II, in general, increases the number of stages in which the schedule is divided, and therefore, the number of simultaneously overlapped iterations. In addition, the lifetimes can be larger because an smaller II imposes more resource constrains than a larger one. In the case of wide buses, in addition, the lifetime of a value can be increased because the other value of the wide operation is not ready yet to be stored or must be loaded a few cycles before. If the additional register requirements of wide buses would be high, the performance advantage could be counteracted by the performance lost due to additional spill code. A detailed analysis of the loops reveals that the ones that have an important difference in registers requirements are those that use a small number of registers. In general, the loops with a high number of register requirements have minor differences, specially when considering execution time. Figure 5 shows the percentage

15

of execution time spent in loops that can be scheduled with a given number of registers. For instance, the loops that spent more than 65% of execution time can be scheduled with a 64-register file in all the bus configurations. The loops that require more than 64 registers have the same behaviour for the different bus configurations tested. In conclusion, the increase of register requirements of a wide bus configuration does not have a significant impact on the final performance since the main (relative) differences are in loops with low register requirements (that can be scheduled without needs of spill code).

5.3 Benefits of Multiply-Add Fused FUs and their impact on the register pressure The use of MAF FUs can increase the performance in compute-bound loops and in loops with recurrences that includes fussionable operations in the critical path. Table 4 shows some characteristics of the tested programs for a 2 wide buses 2 FU configuration: this configuration achieves highest performance than in section 5.1, and put the maximum pressure over the FUs. The goal is to study the performance when these two FU implement FMA operation. . Table 4: characterization of the programs with a 2WB-2FU configuration Resource Bound

Recurrence Bound Program

adm qcd mdg track bdna ocean dyfesm mg3d arc2d flo52 trfd spec77

Bal.

46.5 0 1.2 5.9 76.3 42.3 41.8 46.8 3.4 2.7 4.5 41.6

FMA

! FMA

29.4 0 0 0 0.1 1.3 2.2 0 0 0 0 0.3

17.1 0 1.2 5.9 76.2 41.0 39.6 46.8 3.4 2.7 4.5 41.3

0 1.0 0 0 0 6.2 8.1 0 0 0 1.3 13.7

53.5 99.0 98.8 94.1 23.7 51.5 50.1 53.2 96.6 97.3 94.2 44.7

Compute Bound

Memory Bound

Bal.

1.2 4.9 7.8 16.2 7.6 5.7 0.6 2.8 6.9 2.6 0 8.5

2.3 24.0 1.3 27.8 0 3.8 49.3 1.5 2.2 1.2 94.1 0.7

50.0 70.1 89.7 50.1 16.1 42.0 0.2 48.9 87.5 93.5 0.1 35.5

FMA

!FMA

47.8 70.1 87.3 49.5 16.1 41.0 0.2 48.9 85.8 88.5 0.1 31.8

2.2 0 2.4 0.6 0 1.0 0 0 1.7 5.0 0 3.7

%arit. %comp. ops. ops.

69.6 58.7 40.5 58.2 73.2 76.9 54.2 79.0 62.9 60.2 66.6 60.8

52.4 59.3 82.6 76.2 67.3 43.0 84.8 43.8 41.8 55.3 50.0 46.7

The second column shows the percentage of time spent in loops constrained by recurrences. This percentage is also divided in columns FMA and !FMA that indicate what part benefit from MAF FUs and what part can not. Notice that the only program that has a potential benefit is ADM (29.4%).

16

100

1.3

60

1.2 % of cycles

speed up

80

2WB-2FU 2WB-2FMA

40

1.1 20

0 0

32

1.0

64

96

128

160

192

224

Number of registers

adm

qcd

mdg

track

bdna

ocean dyfesm mg3d

arc2d

flo52

trfd

spec77

Program

Figure 6: a) Speed up of converting ALUs in MAF ALUs for a 2WB configuration and b) registers behaviour of both FU configurations

Next column shows the percentage of time spent in loops balanced (in terms of ResMII and RecMII) and the ones bounded by the resources availables. The ones bounded by resources are divided again in memory bound, balanced and compute bound. For the last ones, the table shows what part of its percentage is from loops that can take benefit from the use of FUs with FMA operation (labeled FMA) and what part cannot (labeled !FMA). The last two columns show the percentage of arithmetic operations found in compute bound loops, and what part of these operations can be fussed in one FMA operation. Figure 6a concludes that the program with the best performance is MG3D, despite of none of the recurrence bound loops can take profit from the MAF FUs; but 48.9% of loops are compute bound and 70% of their operations are arithmetical, been 43.8% of them fussionables. Figure 6b compares the register requirements for a 2WB-2FU and 2WB-2FMA configurations. This figure shows that the use of MAF FUs has less register requirements than the use of single FUs. This is caused by two reasons: first, single FUs requires the use of one extra register to store the result of the multiply before enter in the execution of add operation. And second, the register that stores the data that must be added to the result of the multiply is deallocated some cycles before (2 in the tested FUs).

17

In conclusion, the use of MAF FUs has a significant speed up due to the increment of operations by cycle, but also the register requirements decreases, so the need of spill code decreases and the performance can increase again.

5.4 Using wide buses and MAF FUs. In Sections 5.1 and 5.3 we have done an analysis of the performance when using wide buses and when using FMA FUs. In this Section we are going to study both techniques in co-operation. To analyse the behaviour of the loops we change to a new baseline: 2 single buses and 2 single FU. Table 5: characterization of the programs with a 2SB-2FU configuration Recurrence Bound

Bal.

Resource Bound Compute Bound

Program ADM QCD MDG TRACK BDNA OCEAN DYFESM MG3D ARC2D FLO52 TRFD SPEC77

45.1 0.1 1.2 4.3 76.3 40.7 33.4 45.6 3.3 2.6 3.1 41.4

MAF !MAF 28.5 16.6 0 0.1 0 1.2 0 4.3 0.1 76.2 1.2 39.5 1.8 31.6 0 45.6 0 3.3 0 2.6 0 3.1 0.4 41.0

0 0.7 0 0 0 6.0 6.3 0 0 0 0.9 13.4

54.9 99.2 98.8 95.7 23.7 53.3 60.3 54.4 96.7 97.4 96.0 45.2

47.6 55.6 86.4 36.5 16.1 40.5 0.2 45.5 79.4 78.1 0.1 33.6

MAF !MAF 46.4 1.2 55.6 0 84.2 2.2 36.5 0 16.0 0.1 39.5 1.0 0.2 0 45.5 0 77.7 1.7 73.4 4.7 0.1 0 30.0 3.6

Memory Bound

Bal.

WB 7.1 43.3 11.2 58.1 0.1 7.3 59.3 6.0 8.3 14.8 94.8 1.4

0.2 0.3 1.2 1.0 0 0.1 0.7 2.7 3.2 4.4 1.2 1.9

7.1 43.3 11.2 58.2 7.6 12.7 59.4 6.2 14.1 14.9 94.8 9.7

!WB 0 0 0 0.1 7.5 5.4 0.1 0.2 5.8 0.1 0 8.3

Table 5 shows the characteristics of the programs for this baseline, and Figure 7 shows the speed-up of the programs when widening buses (the black column), when using FU with FMA operation (the shadowed column) and when both techniques are used simultaneously (the white column). The program that achieves the best performance when wide buses are used is TRFD. This is because of 94.8% of time is spent in memory bound loops, and all of them takes profit of the use of wide buses. Notice that this program is scarcely affected by the use of MAF FUs, because recurrence bound and compute bound loops that take profit of the use of MAF FU are 0.1%. The program that achieves best performance when MAF FU are available is MG3D. This is because the 45.5% of time is spent in compute bound loops, and all of them take profit of these FU.

18

speed up

1.4

both MAF ALUs wide buses

1.2

1.0 adm

qcd

mdg

track

bdna

ocean

dyfesm

mg3d

arc2d

flo52

trfd

spec77

gm

Program

Figure 7: Speed ups caused by widening buses, using MAF ALUs and both techniques simultaneously.

Notice that this program can have a little improvement of the performance when widening buses (6.0% of time spent in memory bound loops that take profit of wide buses). It is important to remark that in this program, as in most of the programs tested, the performance when using both techniques is greater than the addition of the performance of both techniques separately. The program with the best global performance is TRACK, that achieves a speed up of 57.1%, been the speed up of 11.3% when using MAF FUs only and 35.6% when using wide buses only. Notice that the addition of both is lower than the speed up when both techniques are used. The program with the worst global performance is the BDNA. It match with the results of Table 5, which shows that only 16.1% of time is spent in loops tha take profit of MAF FUs and only 0.1% of time is spent in loops that take profit of the use of wide buses.

5.5 Evaluation of RS6000 and POWER2 architectures. The RS6000 [IBM94] and POWER2 [WhDh94] are two examples of microprocessors where these techniques has been implemented, The RS6000 microprocessor has 1 single bus between the register file and the first cache level, and 1 MAF FU. On the other hand, the POWER2 microprocessor includes 2 wide buses and 2 MAF FUs.

19

4.0

speed_up

3.0 4SB-4FU POWER2 RS6000 2.0

1.0 adm

qcd

mdg

track

bdna

ocean dyfesm mg3d

arc2d

flo52

trfd

spec77

program

Figure 8: Speed ups for RS6000 and POWER2

Figure 8 shows the performance of both processors using as baseline a 1SB-1FU configuration. The RS6000 achieves an average speed up close to 1.15, while the POWER2 achieves an average speed up close to 2.45. We have compared these two microprocessors with a 4 single bus, 4 single FUs architecture, that gives an upper limit of the speed up that the POWER2 architecture would like to achieve. Although the average speed up of the 4SB-4FU configuration is close to 3.2, it is remarkable that in some programs the speed up of the POWER2 its very close to its maximum (QCD, DYFESM and TRACK). The results obtained with the RS6000 and POWER2 architectures show that the techniques studied in this paper can have a significant increase of the performance, been a cost-effective solutions.

6. Conclusions The increment of ILP in the future microprocessors will need the combined application of techniques like good branch prediction mechanisms to allow a continuous instruction flow to the processor, latency reduction, an increase of the effective memory bandwidth, best speculative execution mechanisms along with an adequate number of functional units and registers. This paper has been focused on the bandwidth problem and on the use of FMA FUs.

20

We have studied two mechanisms to increase the memory bandwidth: increasing the number of buses and made them wider. The use of wide buses is a cost-effective technique if they exists memory accesses with stride 1 that can be compacted. Different bus configurations have been studied. The results show that, for numerical programs, 1 wide bus configuration has a good performance, with a hardware simplest than a 2 single bus configuration, and the register pressure doesn’t vary significantly. The 2 wide buses configuration achieves the best results of all the bus configurations tested. The use of MAF FUs is a cost-effective method to increase the performance of the programs in applications that have fussionable multiply and add operations. It affects to loops bounded by the recurrences and by the number of arithmetical operations. We have evaluated both techniques working together. In most of cases. the global increase of performance is greater than the addition on the increase of performance of both techniques but separately. To summarize, we have analysed two existent architectures, the IBM RS6000 and IBM POWER2 microprocessors. In the first case, the use of 1 MAF FU achieves an average speed up of 1.15, in the second case, an average speed up of 2.45 has been achieved, in front of a 1 single bus, 1 single functional unit, been greatest than 3 in some programs (QCD and TRACK). Besides, the average speed up of the POWER2 architecture in front of a 2 single buses, 2 functional units architecture is 1.20, in the tested programs.

21

References [ABG+96] E. Ayguadé, C. Barrado, A. González, J. Labarta, J. Llosa, D. López, S. Moreno, D. Padua, F. Reig, Q. Riera and M. Valero. Ictíneo: A Tool for Instruction-Level Parallelism Research. RR UPC-DAC-1996-61. Dec 1996. [AKW83]

J.R. Allen, K. Kennedy and J. Warren. Conversion of control dependence to data dependence. In Proc. 10th annual Symposium on Principles of Programming Languages, January 1983.

[BCKK88] M. Berry, D. Chen, P. Koss and D. Kuck. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. Technical Report 827, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, November 1988. [Cha81]

A.E. Charlesworth. An approach to scientific array processing: The architectural design of the AP1120B/FSP-164 family. Computer, 14(9):18-27, 1981.

[CLC+95] D.C. Chang, D. Lyon, C. Chen, L. Peng, M. Massoumi, M. Hakimi, S. Iyengar, E. Li and R. Remedios. Microarchitecture of HAL’s memory management unit. CompCon95, pp. 272-279, 1995. [DeTo93]

J.C. Dehnert and R.A. Towle. Compiling for cydra 5. Journal of Supercomputing, 7(1/2):181-227, 1993.

[ERPR95] J.H. Edmonson, P. Rubinfeld, R. Preston and V. Rajagopalan. Superscalar instruction execution in the 21164 Alpha microprocessor. IEEE Micro, 15(2):33-43, April 1995. [GAB+88] R.B. Garner, A. Agrawal, F. Briggs, E.W. Brown, D. Hough, B. Joy, S. Kleiman, S. Muchnik, M. Namjoo, D. Patterson, J. Pendleton and R. Tuck. The scalable processor architecture (SPARC). CompCon 88, pp. 278-283, 1988.

[HFH94]

T.N. Hicks, R.E. Fry and P.E. Harvey. POWER2 floating point unit: Architecture and Implementation. In IBM Journal of Research and Development. Vol 38 N. 5 pp 525-536. Sep 1994.

[Hsu94]

P.Y.T. Hsu. Designing the TFP microprocessor. IEEE Micro, 14(2):23-33, April 1994.

[Hun95]

D. Hunt. Advanced performance features of the 64-bit PA-8000. In CompCon 95, 1995.

[IBM94]

IBM Inc. RISC System/6000 PowerPC System Architecture. Edited by Frank Levine and Steve Thurber. Morgan Kaufmann Publishers Inc. San Francisco, California. 1st edition July 1994.

[[Jol91]

R. Jolly. A 9-ns 1.4 gigabyte 17-ported CMOS register file. IEEE j. of solid-State Circuits, 25:1407-1412, October 1991.

[Lam88]

Monica Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. of the SIGPLAN’88 Conference on Programming Language Design and Implementation, pp. 318-328, June 1988.

[LMM92]

R.B. Lee, M. Mahon, and D. Morris. Pathlenght reduction features in the PA-RISC architecture. In CompCon 92, pp.129-135, 1992.

[LVAG95] J. Llosa, M. Valero, E. Ayguadé and A. González. Hypernode Reduction Modulo Scheduling. In 28th International Symposium on Microarchitecture (Micro-28), pp. 350-360. December 1995. [LVAL94]

J. Llosa, M. Valero, E. Ayguadé and J. Labarta. Register requirements of pipelined loops and their effect on performance. In Proc. 2nd Int. Workshhop on Massive Parallelism: Hardware, Software and Applications, pp 173-189. October 1994.

[LVLA97] D. López, M. Valero, J. Llosa and E. Ayguadé. Increasing Memory Bandwidth with Wide Buses: Compiler, Hardware and Performance Trade-offs. In Proceedings of the 11th Int. Conf. on Supercomputing (ICS-11), pp 12-19, July 1997.

22

[MAD92]

W. Mangione-Smith, S.G. Abraham, and E.S. Davidson. Register requirements of pipelined processors. In Int. Conf. on Supercomputing, pp. 260-271, July 1992.

[MHR90]

R.K. Montoye, E. Hokenek and S.L. Runyon. Design of the IBM RISC System/6000 floating-point execution unit. In IBM Journal of Research and Development. Vol 34 N. 1 pp 59-70. Jan 1990.

[RaGl81]

B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Microprogramming Workshop, pp. 183-197, October 1981.

[Rau94]

B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 63-74, November 1994.

[SoFr91]

G. S. Sohi and M. Franklin. High-Bandwidth Data Memory Systems for Superscalar Processors. ASPLOS-IV, pp. 53-62, April 1991.

[Vaj91]

S.Vajapeyam. Instruction-level characterization of the CRAY Y/MP processor. PhD thesis. University of Wisconsin, Madison, 1991.

[WeEs88]

N. Weste and K. Eshraghian. Principles of CMOS VLSI Design: A systems Perspective. Addison-Wesley Publishing, 1988.

[WhDh94] S.W.White and S. Dhawan. POWER2: Next Generation of the RISC System/6000 family. IBM J. Res. Develop. 38, No. 1, 493-502 (September 1994)

23

Suggest Documents