MOERAE: Portable, Thread-Based Interface between Parallelizing ...

MOERAE: Portable, Thread-Based Interface between Parallelizing Compilers and Shared-Memory Multiprocessorsy Seon Wook Kim Michael Voss Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47907-1285

Abstract Shared-memory multiprocessor (SMP) machines have become widely available. As the user community grows, the importance of compilers that can translate standard, sequential programs onto this machine class is increasing. In this paper we deal with the issue of how such parallelizing compilers generate code that can be re-targeted easily and eciently at dierent machines. The goal is to create a portable interface that will permit us to take advantage of the best possible compilers for all of today's SMP machines. Our portable interface is called MOERAE and consists of two parts, a postpass to the Polaris compiler and a portable runtime library. The MOERAE postpass translates parallel programs, generated by the Polaris parallelizer, into code that can be compiled with a sequential compiler plus calls to the MOERAE runtime library. The runtime library implements a microtasking scheme and is based on the widely-available Pthreads package. We will show measurements taken on a SUN Enterprise and an SGI Origin 2000 multiprocessor that demonstrate that (1) Polaris+MOERAE is a portable parallelizer that is more powerful than the newest commercially available parallelizers; (2) MOERAE programs perform close to or better than programs expressed in the native parallel directive languages; (3) Several MOERAE programs outperform those expressed in the portable OpenMP parallel directive standard. We will also discuss the performance of several parallelization techniques on real machines and programs to show strengths and weaknesses of state-of-the-art parallel compilers. The public availability of our compiler and runtime library makes it possible to analyze the performance in detail and identify improvements. This oers advantages over the proprietary nature of commercial compilers, such as those for the new OpenMP API and their associated libraries.

y

This work was supported in part by DARPA contract #DABT63-95-C-0097, and an NSF CAREER award. This work is not necessarily representative of the positions or policies of the U. S. Government.

I. Introduction Shared-memory multiprocessor (SMP) systems are being increasingly used, but nonportable programming interfaces and their underlying compilers still represent a signi cant challenge for the user community. Typically, one of two methods is used for parallelizing a shared-memory program: a user may take advantage of a restructuring compiler to parallelize a program, or use the parallel directive language supported by the machine to explicitly express the parallelism in the code. Unfortunately, the codes generated by these methods are usually not portable across architectures. Figure 1 gives an example of parallel directives and shows dierences between the constructs available on SUN and SGI machines. C SUN Directives C$DOALL SHARED(A, N), PRIVATE(I) DO I = 1, N, 1 A(I) = 0.0 ENDDO

C SGI Directives C$DOACROSS SHARE(A, N), LOCAL(I) DO I = 1, N, 1 A(I) = 0.0 ENDDO

Figure 1: Parallel Loop Directives Available on SUN and SGI Multiprocessors. Although the examples show only syntactic dierences, there are often semantic subtleties for which a direct text replacement would not suce. Porting a program between SMPs can involve signi cant rewriting or retranslating work. Understanding the performance behavior of a given program on a new machine can be a further challenge. This is due to dierences in the machine architectures, but also due to the compilers, which may vary substantially in their optimization capabilities. For example, a small parallel loop may be selected for parallel execution by one compiler, but left serial by another compiler due to estimated parallelization overheads. In this paper we 1

will discuss such performance dierences. The results are being used in our ongoing project of developing compiler techniques that can orchestrate a set of optimizing transformations in the best possible way. One important step towards providing portable parallelism has been made by the recent OpenMP parallel language initiative [1]. OpenMP de nes an application program interface (API) for expressing parallelism. This language can also be used for expressing the output of a parallelizing source-to-source compiler. In related work we have evaluated this parallelization method [2]. Relative to this work, the eort described in the present paper tries to answer the question, whether a direct translation of a program into a portable thread-based form can be of advantage. To this end, we will present measurements of several compilation schemes on a range of programs and machines. Our MOERAE translator and runtime library system distinguishes itself in two key aspects from current systems. First, it provides a portable translator and execution environment without proprietary software components. This does not only make the system available to the research community at large, but also gives one unrestricted access to all of its performance-critical parts. Second, the MOERAE runtime library supports dynamicallysized thread-private arrays and array reduction operations. Both of these capabilities have been shown to be of great importance for parallelizing real applications [3]. In this abstract paper we present overall performance results of these capabilities. The full paper will detail the eects of various implementation schemes. Our portable interface, MOERAE, consists of two parts, a postpass to the Polaris compiler [4] and a portable runtime library. The postpass translates parallel programs into 2

code that can be compiled with standard, sequential compilers plus calls to the MOERAE runtime library. The runtime library implements a microtasking scheme and is based on the widely-available Pthreads package. Currently we have used this system on a SUN Enterprise Multiprocessor and an SGI Origin 2000. Parallel Programs

Sequential Programs

Polaris Parallelizer

Architecture-Specific Directive Language

Architecture-Specific Parallel Compiler/ Libraries

MOERAE

OpenMP API

Parallel OpenMP Compiler/Libraries

Sequential Compiler/ Pthread Libraries

SMP Machines

Languages

Compilers/Libraries

Figure 2: Language and Translator System Used in Our Machine Environment. Figure 2 gives an overview of our compilation system. The Polaris translator can generate output in various forms, including machine-speci c dialects, the OpenMP language, and the MOERAE system. We will describe MOERAE and its implementation in the next section. In Section III, we will then analyze MOERAE's performance and compare it to other parallelization approaches. Section IV concludes the paper.

3

II. MOERAE The MOERAE system expresses loop-level parallelism in Fortran programs. MOERAE consists of two major components: a means to translate sequential programs into a portable thread-based form, and a runtime library for creating, managing and destroying these parallel threads. The program transformations are performed by a modi ed version of the Polaris [4] compiler. We have added a postpass to Polaris to transform parallel loops into subroutines. The loops are replaced with calls to a scheduler, which dispatches the subroutines to the participating threads. The master thread executes the modi ed main program, and the child threads execute the newly-created subroutine, as orchestrated by the scheduler. MOERAE provides three forms for expressing reductions operations, called blocked reduction, privatized reduction, and expanded reduction [5, 6]. Brie y, in blocked reductions all reduc-

tion statements are enclosed within critical sections such that they are performed atomically within a parallel loop. In contrast, in both privatized and expanded reductions, the reduction variables are replicated versions of the original variable. The replication is done either by array privatization or by array expansion. Each processor performs the reduction into its local version of the replicated variable. At the end of the loop, a global reduction is performed from the replicated variable into the original reduction variable. For privatized reductions this is done within a critical section at the end of the parallel loop in a postamble. For expanded reductions the global reduction can be done after the parallel loop, because expanded variables live beyond the loop end and are shared by all processors. Before the 4

parallel loop body, a preamble clears the replicated reduction variable. Most compilers for parallel languages can transform only scalar reductions. Polaris, however, is able to also recognize array reductions (in which the reduction variable is an array). MOERAE's task includes the transformation of such operations into the proper parallel form. Figure 3 illustrates this scheme, and Figure 4 shows an example. User View

Implementation

PROGRAM TEST

PROGRAM TEST

Sequential Section

Sequential Section

DO I = 1, 1000, 1

CALL SCHEDULING(LTEST_)

Parallel Loop ENDDO

SUBROUTINE LTEST_(...) DO I=INIT,FINAL,STEP

RETURN END

Transformation by MOERAE Sequential Section

Sequential Section

Child Threads

STOP END STOP END

Master Thread Parallel Compiler/ Libraries

Sequential Compiler/MOERAE Runtime Libraries

Figure 3: Overview of the MOERAE Translator and Runtime Libraries. The key factor in achieving performance with MOERAE was the design of a fast and ecient runtime library for the scheduler and work dispatcher. The runtime library uses a microtasking scheme, similar to the methods introduced in [7, 8, 9, 10, 11, 12] in order to reduce the overhead of creating threads each time a parallel section is encountered. That is, at the beginning of the program (by calling routine INITIALIZE THREAD) all of the threads are created. Using spin-wait, the threads sleep during serial program sections and wake up 5

PROGRAM SAMPLE REAL ARRAY, BRRAY INTEGER I PARAMETER (ISIZE=100) DIMENSION ARRAY(ISIZE), BRRAY(ISIZE) COMMON /DUMMY/ BRRAY C the following loop is parallel DO I = 1, ISIZE, 1 ARRAY(I) = 0.0 BRRAY(I) = 0.0 ENDDO STOP END SUBROUTINE _SUBR_SAMPLE_DO_1(INIT000, MAX000, STEP000, MYCPUID, ARRAY) REAL ARRAY, BRRAY INTEGER INIT000, MAX000, STEP000, MYCPUID, I PARAMETER (ISIZE=100) DIMENSION ARRAY(ISIZE), BRRAY(ISIZE) COMMON /DUMMY/ BRRAY DO I = INIT000, MAX000, STEP000 ARRAY(I) = 0.0 BRRAY(I) = 0.0 ENDDO RETURN END PROGRAM SAMPLE EXTERNAL BLOCK_SCHEDULING, INITIALIZE_THREAD, STOP_EVENT EXTERNAL _SUBR_SAMPLE_DO_1 REAL ARRAY, BRRAY PARAMETER (ISIZE=100) DIMENSION ARRAY(ISIZE), BRRAY(ISIZE) COMMON /DUMMY/ BRRAY CALL INITIALIZE_THREAD() CALL BLOCK_SCHEDULING(1, ISIZE, 1, _SUBR_SAMPLE_DO_1, 1, ARRAY) CALL STOP_EVENT() STOP END

Figure 4: Sample Program Transformed by Polaris using MOERAE. Note that subroutine INITIALIZE THREAD and STOP EVENT are called BLOCK SCHEDULING would be called for every parallel loop

only once at program start and end whereas in a larger program.

for each parallel section. In order to provide portability, the runtime library is implemented using the Pthreads package and is also available on Solaris threads.1 1

Our current implementation on SGI IRIX 6.4 uses processes using sproc(), because Pthreads are not yet

6

III. Performance Analysis The execution times of ve benchmark programs (TRFD, MDG, BDNA, ARC2D, and FLO52) from the Perfect Club Benchmark Suite were used to evaluate the performance of MOERAE [13, 14, 15, 16]. The codes were run on a SUN Enterprise 4000 (Solaris 2.5) [17] and an SGI Origin 2000 (IRIX 6.4) [18]. For comparison, each code was parallelized (1) by the SUN native parallel compiler (AutoPar) [19], (2) by the SGI native parallel compiler (PFA) [20], (3) by Polaris using the architecture speci c directives [19, 20], (4) by Polaris using the OpenMP API [21] and (5) by Polaris using the MOERAE scheme. Although this paper aims at uncovering intrinsic performance dierences between these approaches, it must be noted that speci c implementations are being compared. We compare codes as generated by each compiler without detailed analysis of modi cations that can be made in the context of each language for optimizing performance. A. Overview

Figures 5 and 6 show the speedup of these codes relative to the serial execution time of the original codes using various parallelization methods on the SUN and SGI systems. On the SUN Enterprise, overall, the performance of the portable MOERAE system is comparable to the others. In TRFD, the performance of MOERAE and the native directive methods perform the best. In MDG and FLO52, Polaris with the OpenMP API outperforms the others. In BDNA, MOERAE gives better performance on 4 processors. In ARC2D, MOERAE has the best performance. fully supported.

7

FLO52

ARC2D

BDNA

MDG

TRFD

5

4

3

Speedup

SUN + AutoPar Polaris + Directive Polaris + MOERAE Polaris + OpenMP

2

1

0 1

2

4

1

2

4

1 2 4 Processors

1

2

4

1

2

4

Figure 5: Speedup of Benchmarks as Executed on SUN Enterprise 4000. SUN + AutoPar refers to the codes as transformed by the native parallelizer (AutoPar), Polaris + Directive are the codes transformed by Polaris expressed with the SUN iMPact directive set, Polaris + MOERAE are codes transformed using MOERAE and Polaris + OpenMP are Polaris-transformed codes using the OpenMP API.

The performance on the SGI Origin 20002 is dierent from that on the SUN Enterprise 4000. In FLO52,

TRFD

and

MDG, Polaris + MOERAE

gives the best performance. In

BDNA

and

Polaris with OpenMP API is best, but in ARC2D the SGI native parallelizer (PFA)

outperforms the others. The following sections will discuss these measurements in more detail. 2

By the time of this measurement, BDNA and FLO52 using MOERAE have not yet been timed in dedicated mode.

8

TRFD

3

MDG

BDNA

ARC2D

FLO52

2

Speedup

SGI + PFA Polaris + Directive Polaris + MOERAE Polaris + OpenMP 1

0 1

2

4

1

2

4

1 2 4 Processors

1

2

4

1

2

4

Figure 6: Speedup of Benchmarks as Executed on SGI Origin 2000. SGI + PFA refers to the codes as transformed by the native parallelizer (PFA), Polaris + Directive are the codes transformed by Polaris expressed with the SGI directive set, Polaris + MOERAE are codes transformed using MOERAE and Polaris + OpenMP are Polaris-transformed codes using the OpenMP API. B.

TRFD

In TRFD, two loops OLDA

DO100

and OLDA

DO300

take more than 90% of the total se-

rial execution time. As shown in Figure 7, the shapes of the gures on SUN and SGI are very similar except for the Polaris Directive

+ Directive method on SGI. On SGI, the Polaris +

method incurs huge parallelization overheads because of repeated thread iden-

ti cation function calls inside the parallel loops. This overhead can be reduced, however this is beyond the scope of the paper [22]. The method Polaris 9

+ MOERAE outperforms the

others on both SUN and SGI. The native compiler is unable to improve the performance of these loops, because the optimization would involve several advanced transformations [3].

Polaris + OpenMP

parallelizes the code, but the speedups are limited due to high

cache con icts in expanded arrays inside the parallel loops. This is caused by the fact that OpenMP does not allow dynamically-sized private arrays. Instead, such arrays need to be dynamically expanded before the loop and placed in shared memory. In contrast, MOERAE is able to place these arrays on each thread's stack. C.

MDG

In MDG, two parallel loops INTERF the total serial execution time. and POTENG

DO2000

DO1000

and POTENG

INTERF DO1000

DO2000

take more than 95% of

has three array and one scalar reductions,

has two scalar reductions. Figure 8 shows the speedup of these two

loops, OpenMP gives the best performance on SUN, and MOERAE does on SGI. There is no speedup with the native compiler on both machines. The loop INTERF expressed in parallel using the SUN directives, but not POTENG the native compiler. On SGI, POTENG

DO1000

DO2000, due

DO2000 can be parallelized.

can be

to an error in

On SUN, OpenMP yields

better performance than MOERAE in both loops, which is caused by a minor dierence in the generated code. On SGI, the overall performance of Polaris

+ MOERAE

is better

than the others, even though the performance of major loops is very similar. In INTRAF DO1100, Polaris + OpenMP

uses critical sections repeatedly for synchronizing three scalar

reductions inside the parallel loop. This is costly when the number of loop iterations is very large.

Polaris + MOERAE

uses expanded reductions, and parallelizes the preamble of the 10

3

2 Speedup

SUN + AutoPar Polaris + Directive Polaris + MOERAE Polaris + OpenMP

1

0 OLDA DO100 OLDA DO300 Most Time Consuming Parallel Loops

(a)

4

Speedup

SGI + PFA Polaris + Directive

2

Polaris + MOERAE Polaris + OpenMP

0 OLDA DO100 OLDA DO300 Most Time Consuming Parallel Loops

(b)

Figure 7: Speedup of Most Time Consuming Parallel Loops in TRFD. (a) SUN Enterprise 4000. (b) SGI Origin 2000. The sets of three bars show 1, 2, and 4 processor speedup.

11

reduction variable initialization, which does not need critical section operations. 5

4

SUN + AutoPar

Speedup

3

Polaris + Directive Polaris + MOERAE

2

Polaris + OpenMP

1

0 INTERF DO1000 POTENG DO2000 Most Time Consuming Parallel Loops

(a)

5

4

SGI + PFA

Speedup

3

Polaris + Directive Polaris + MOERAE Polaris + OpenMP

2

1

0 INTERF DO1000 POTENG DO2000 Most Time Consuming Parallel Loops

(b)

Figure 8: Speedup of Most Time Consuming Parallel Loops in MDG. (a) SUN Enterprise 4000. (b) SGI Origin 2000. The sets of three bars show 1, 2, and 4 processor speedup. 12

D.

BDNA

The performance of BDNA has many interesting aspects both on SUN and on SGI3 . BDNA has two major parallel loops, ACTFOR

DO500 and ACTFOR DO240, which take more than 80%

of the total serial execution time. The loop ACTFOR

DO500 performs

reductions on 3 arrays

and one scalar variable. On SUN, the native compiler outperforms the others as shown in Figure 9 (a). The native compiler uses privatized reductions, and it predicts the sizes of the privatized arrays from the statements inside the parallel loops at compile-time. This allows it to save the overhead of dynamic memory allocation. Furthermore, because the array sizes are known at compile-time, the compiler can unroll the loop in a postamble. This results in a slightly larger than linear speedup. On SUN, OpenMP performs less than the other schemes. Its transformation makes use of several unnecessary COMMON blocks and large, auxiliary variables, which cause signi cant overhead. On SUN, the performance of privatized reductions versus expanded reductions in MOERAE is dierent from that in INTERF

DO1000.

Expanded reductions are about 50% better

than the privatized reductions. Even if the array size of the reduction variables is large, the overhead of the preamble (memory allocation and initialization) and the postamble (summation computation and free) are negligible compared to the loop computation time, which accounts for the main dierence in the two schemes. Table 1 shows the overhead of the preamble, postamble, and computation in ACTFOR

DO500.

The initialization and sum-

mation in an expanded reduction are parallelized. The actual computation dominates the overall execution time. The overhead of the memory allocation in the privatized reduction 3

On SGI, BDNA was not run in dedicated mode.

13

is larger than that in an expanded reduction, because the privatized reduction allocates separately for each thread from a common pool, guarded by a lock/unlock pair. The overhead of initialization in the privatized reduction is less, because its variables are localized to threads. But the overhead of the summation in the privatized reduction is larger because of the overhead of lock and unlock operations. The reason for the dierence in actual computation time is more subtle and required a detailed study of the assembly code. We have found that in neither case the base address of the reduction array is placed in a register. In the privatized reduction code the memory load and use of this address is done in two consecutive instructions, while in the expanded reduction case, the load is moved upward in the instruction stream. The latter allows the pipelined SPARC architecture [23] to overlap this memory load with the subsequent instructions. For the code generator, the dierence between the two reduction schemes is merely that of a subroutine parameter versus an address on the local stack. This raises code generation and register allocation issues beyond the scope of this paper. In

ACTFOR DO240,

which has 12 array reductions and one scalar reduction, OpenMP

gives better performance than the others. The ranks of the reduction arrays are determined at compile-time, and hence no memory allocation is needed. Although both OpenMP and MOERAE use similar scheduling schemes, the speci c schedules generated in these cases caused the measured performance dierences. The performance of Polaris is worse than Polaris

+ MOERAE, since

+ Directive

the MOERAE interchanges the loop in the initial-

ization of the reduction arrays, which reduces cache misses. The method SUN does not parallelize the loop ACTFOR

+ AutoPar

DO240, but does so for ACTFOR DO235 and DO237 which

14

5

4

SUN + AutoPar

Speedup

3

Polaris + Directive Polaris + MOERAE Polaris + OpenMP

2

1

0 ACTFOR DO500 ACTFOR DO240 Most Time Consuming Parallel Loops

(a) 8

6

Speedup

SGI + PFA Polaris + Directive

4


2

0 ACTFOR DO500 ACTFOR DO240 Most Time Consuming Parallel Loops

(b)

Figure 9: Speedup of Most Time Consuming Parallel Loops in BDNA. (a) SUN Enterprise 4000. (b) SGI Origin 2000. The sets of three bars show 1, 2, and 4 processor speedup.

15

Table 1: Composition of Execution Time (Seconds) in Loop

ACTFOR DO500

when Using

Privatized Reductions and Expanded Reductions on SUN Enterprise 4000. Reduction

Statements

Number of Processors 1 2 4 Privatized Memory Allocation 0.000186 0.000358 0.000605 Initialization 0.000109 0.000108 0.000098 Computation 2.479811 1.189530 0.600789 Summation 0.000192 0.000290 0.000345 Free 0.000020 0.000011 0.000011 Expanded Memory Allocation 0.000293 0.000255 0.000284 Initialization 0.000100 0.000223 0.000345 Computation 1.486813 0.760792 0.396877 Summation 0.000163 0.000205 0.000261 Free 0.000019 0.000012 0.000015

are large, inner loops of ACTFOR

DO240.

The performance on SGI is quite dierent from that on SUN. First, a native compiler can parallelize none of the loops. The Polaris overhead in ACTFOR

+ Directive method incurs

the parallelization

DO500 because of thread identi cation function calls in reduction arrays,

and dynamically-allocated memory access costs, which does not appear in ACTFOR The Polaris

DO240.

+ MOERAE methods give moderate performance even though they were not run

in dedicated mode. OpenMP performs best in ACTFOR but worst in ACTFOR

DO500 because

of the unrolled loop,

DO240, which

is due to the synchronization of reduction variables, and

the scheduling schemes. In ACTFOR

DO240, Polaris + Directive and Polaris + MOERAE

use interleaved scheduling, but OpenMP does not.

16

E.

ARC2D

Figure 10 shows the major parallel loops in ARC2D taking more than 50% of the total serial execution time. In STEPFX

DO230, STEPFX DO210,

and FILERX

DO19,

there is super-

linear speedup because both Polaris, SUN AutoPar, and SGI PFA perform loop interchange to create stride-1 access patterns [22]. However, in the optimized serial version, which we use for speedup = 1, this transformation is not applied. The performance of the dierent translation schemes shown in Figure 10 is very similar. The reason for the worse overall performance of OpenMP on SUN, shown in Figure 5, is that many smaller loops incur higher overheads due to the longer parallel-loop startup latency. On SGI, the automatic parallelizer PFA and OpenMP incurs the least overhead in small loops. F.

FLO52

In FLO524, the execution times of the major loops are very small, causing large parallelization overheads. The loops from DFLUX

DO30

to EULER

DO46

in Figure 11, take more

than 70% of the total serial execution time. In most cases, the performance is very similar except for

PSMOO DO40

and

PSMOO DO80

using SUN directives. Due to excessive multi-

version loop generation, the SUN directive method often introduces unnecessary overheads. Similarly as ARC2D, OpenMP shows the least parallelization overhead.

4

On SGI, FLO52 was not run in dedicated mode.

17

15

10

SUN + AutoPar

Speedup

Polaris + Directive Polaris + MOERAE Polaris + OpenMP 5

STEPFY DO435

XPENT2 DO11

FILERX DO19

FILERY DO39

XPENTA DO11

STEPFX DO210

STEPFX DO230

0

Most Time Consuming Parallel Loops

(a) 10

Speedup

8

SGI + PFA

6

Polaris + Directive Polaris + MOERAE 4

Polaris + OpenMP

2

STEPFY DO435

XPENT2 DO11

FILERX DO19

FILERY DO39

XPENTA DO11

STEPFX DO210

STEPFX DO230

0

Most Time Consuming Parallel Loops (b)

Figure 10: Speedup of Most Time Consuming Parallel Loops in ARC2D. (a) SUN Enterprise 4000. (b) SGI Origin 2000. The sets of three bars show 1, 2, and 4 processor speedup. 18

5

4

SUN + AutoPar

Speedup

3

Polaris + Directive Polaris + MOERAE

2

Polaris + OpenMP

1

EULER DO46

EULER DO40

EFLUX DO40

PSMOO DO80

DFLUX DO60

EULER DO50

PSMOO DO40

EFLUX DO10

EFLUX DO30

DFLUX DO30

0


(a) 4

3

Speedup

SGI + PFA Polaris + Directive 2


1

EULER DO46

EULER DO40

EFLUX DO40

PSMOO DO80

DFLUX DO60

EULER DO50

PSMOO DO40

EFLUX DO10

EFLUX DO30

DFLUX DO30

0


(b)

Figure 11: Speedup of Most Time Consuming Parallel Loops in FLO52. (a) SUN Enterprise 4000. (b) SGI Origin 2000. The sets of three bars show 1, 2, and 4 processor speedup. 19

IV. Conclusion We have presented a compiler/runtime library interface, MOERAE, which provides portability of compiler-parallelized programs across shared-memory multiprocessor systems. We have shown that this system yields comparable and often better performance than other parallelization methods, such as using native parallelizing compilers, using machine-speci c parallel directives, and using standard OpenMP directives. We have presented performance results of several benchmark applications at the overall and the loop level. They demonstrate the performance portability of the MOERAE system. Our results hold for both machines that we measured, a SUN Enterprise 4000 and an SGI Origin 2000 system, although the machines can perform quite dierently for individual loops. One problem in MOERAE that we identi ed is that small loops incur relatively large overheads in the SGI implementation. We are currently developing improved synchronization schemes which consider the speci c SGI hardware organization. Our portable interface requires only that sequential Fortran and C compilers be available on the target machine (for example, GNU gcc, g77, or native sequential compilers). Therefore, its performance does not depend on the availability of a native parallel compiler and runtime library on the target system. In this way, MOERAE allows one to take advantage of the best compilers for today's SMP systems. As an example, we have shown signi cant performance gains of the Polaris parallelizer over the native compilers available on the target machines. The greatest growth in the parallel machine market has been in lower-end systems. 20

These machines often use operating systems such as WIN/NT or LINUX, which have not been traditionally used for parallel processing. Hence, their parallel compilers may not yet exist or have not yet fully matured. MOERAE would allow portability across these systems as well as across the variety of established multiprocessor machines. One question addressed in this paper is how we can implement a parallelizing compiler and its related libraries in a performance-portable way. To this end we have studied the detailed performance behavior of a range of applications. We have found that the performance is in uenced by a large number of compiler parameters. Among others, we have found that the backend compilers perform optimizations in an inconsistent manner. This calls for a tighter integration of the advanced program analysis capabilities of our parallelizing compiler with the backend optimizer. The MOERAE framework and our performance studies are a basis for this. MOERAE can be viewed as a parallel program translation and execution system that is an alternative to the recently proposed OpenMP portable language standard and its corresponding compilers/libraries. The Polaris compiler can generate code through both interfaces, MOERAE and OpenMP, and we have discussed performance dierences. More important than the speci c dierences is the fact that MOERAE is fully available to the research community. It allows the performance behavior to be analyzed in detail and improvements to be developed. This contrasts with the proprietary compiler and runtime environments that represent the state-of-the-art on SMP machines. In an ongoing project, we are currently extending MOERAE to also oer an OpenMP API for parallel programs. This will eectively provide a freely-available, portable OpenMP language system to the 21

research community.

V. References [1] Kuck and Associates, http://www.openmp.org/. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, October 1997. [2] Mike Voss and Rudolf Eigenmann. Generating portable shared-memory applications using OpenMP. Technical Report ECE-HPCLab-98207, Purdue University, School of Electrical and Computer Engineering, 1998. [3] R. Eigenmann, J. Hoe inger, and D. Padua. On the Automatic Parallelization of the Perfect Benchmarks. IEEE Transactions on Parallel and Distributed Systems, 9(1):5{ 23, January 1998. [4] W. Blume, R. Doll, R. Eigenmann, J. Grout, J. Hoe inger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel Programming with Polaris. IEEE Computer, pages 78{82, December 1996. [5] Jee Ku. The design of an ecient and portable interface between a parallelizing compiler and its target machine. Master's thesis, University of Illinois at UrbanaChampaign, Department of Electrical Engineering, December 1995. [6] Bill Pottenger and Rudolf Eigenmann. Idiom Recognition in the Polaris Parallelizing Compiler. Proceedings of the 9th ACM International Conference on Supercomputing, pages 444{448, 95. [7] M. Guzzi. Multitasking runtime systems for the cedar multiprocessor. Master's thesis, University of Illinois at Urbana-Champaign, Department of Electrical Engineering, July 1986. [8] R. Eigenmann, J. Hoe inger, G. Jaxon, and D. Padua. Cedar Fortran and its Restructuring Compiler. In A. Nicolau D. Gelernter, T. Gross and D. Padua, editors, Advances in Languages and Compilers for Parallel Processing, pages 1{23. MIT Press, 1991. [9] International Business Machines Corporation. Parallel FORTRAN: Language and Library Reference, 1988. SC23-0431-0. [10] C. Beckmann and C. Polychronopoulos. The eect of barrier synchronization and scheduling overhead on parallel loops. In Proceedings of 1989 International Conference on Parallel Processing, volume 2, pages 200{204, August 1998. [11] Constantine Polychronopoulos, Nawaf Bitar, and Steve Kleiman. nanoThreads: A User-Level Threads Architecture. Proceedings of the 1993 International Conference on Parallel Computing Technologies, Moscow, Russia, September 1993. 22

[12] G. Fox and et. al. Common Runtime Support for High-Performance Parallel Languages, Parallel Compiler Runtime Consortium. In Supercomputing 93, pages 752{757, November 1993. [13] M. Berry, D. Chen, P. Koss, D. Kuck, L. Pointer, S. Lo, Y. Pang, R. Rolo, A. Sameh, E. Clementi, S. Chin, D. Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S. Orszag, F. Seidl, O. Johnson, G. Swanson, R. Goodrum, and J. Martin. The Perfect Club Benchmarks: Eective Performance Evaluation ofSupercomputers. International Journal of Supercomputer Applications, 3(3):5{40, Fall 1989. [14] George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputer Performance Evaluation and the Perfect BenchmarksTM . Proceedings of ICS, Amsterdam, Netherlands, pages 254{266, March 1990. [15] G. Cybenko, J. Bruner, S. Ho, and S. Sharma. Parallel Computing and the Perfect Benchmarks. Presented at the International Symposium on Supercomputing, Fukwoka, JAPAN, November 6-8, 1991. [16] William Blume and Rudolf Eigenmann. Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs. IEEE Transactions of Parallel and Distributed Systems, 3(6):643{656, November 1992. [17] SUN Microsystems Inc., Mountain View, CA, http://www.sun.com/servers/enterprise/e4000/index.html. SUN Enterprise 4000. [18] SiliconGraphics Inc., http://www.sgi.com/origin/2000/index.html. Origin 2000. [19] Sun Microsystems Inc., Mountain View, CA. Fortran Programmer's Guide, 1996. SC23-0431-0. [20] SiliconGraphics Inc. SGI Fortran Compiler. [21] Kuck and Associates, Inc. GuideTM Reference Manual, Version 2.1, September 1996. [22] Mike Voss. Portable level-parallelism for shared-memory multiprocessor architectures. Master's thesis, Electrical and Computer Engineering, Purdue University, December 1997. [23] Richard P. Paul. SPARC Architecture, Assembly Language Programming, & C. Prentice Hall, Englewood Cli, NJ, 1994.

23