A Portable Parallelizing Compiler with Loop ... - Semantic Scholar

1 downloads 87673 Views 223KB Size Report
**Department of Computer and Information Science. National ... ABSTRACT. Multithreaded programming support seems to be the most obvious approach to helping programmers ... compiler (PFPC) with loop partitioning on an AcerAltos-.
Proc. Natl. Sci. Counc. ROC(A) Vol. 23, No. 6, 1999. pp. 751-765

A Portable Parallelizing Compiler with Loop Partitioning C HAO -T UNG YANG*, S HIAN -S HYONG TSENG ** , † , M ING -C HANG HSIAO**,

AND

S HIH -H UNG KAO**

*ROCSAT Ground System Section National Space Program Office Hsinchu, Taiwan, R.O.C. **Department of Computer and Information Science National Chiao Tung University Hsinchu, Taiwan, R.O.C. (Received December 4, 1998; Accepted May 14, 1999) ABSTRACT Multithreaded programming support seems to be the most obvious approach to helping programmers take advantage of operating system parallelism. In this paper, we present the design and implementation of a portable FORTRAN parallelizing compiler (PFPC) with loop partitioning on our AcerAltos-10000 multiprocessor system, running an OSF/1 multithreaded OS. In order to port the PFPC to the system environments, we define a minimal set of thread-related functions and data types, called B Threads, that form the kernel used in the PFPC to support execution on different platforms. Our compiler is highly modularized so that porting to other platforms is very easy, and it can partition parallel loops into multithreaded codes based on several loop-partitioning algorithms. The experimental results clearly show that the compiler achieves good speedup. Our goal is to construct a high-performance, portable FORTRAN parallelizing compiler on a shared-memory multiprocessor system. The results of this study make theoretical and technical contributions to the design of a high-performance parallelizing compiler. Key Words: parallelizing compiler, loop parallelization, multithreaded, B threads, single-to-multiple threads translator, loop partitioning, speedup

I. Introduction The last decade has seen the coming of age of parallel processing. Many different classes of multiprocessors have been designed and implemented in industry and academia (Hwang, 1993; Gates and Peterson, 1994). To achieve high speedup in such systems, it seems necessary that tasks be decomposed into several sub-tasks, which can be executed on different processors in parallel. Parallelizing compilers analyze sequential programs (Shen et al., 1990; Banerjee, 1988; Wolfe, 1989; Zima and Chapman, 1990; Bacon et al., 1993; Banerjee et al., 1993; Blume et al., 1994a, 1994b; Yang et al., 1994) in order to detect hidden parallelism and use this information for automatic restructuring of sequential programs into parallel sub-tasks for multiprocessors using scheduling algorithms (Tzen and Ni, 1993). In addition to the computer architectures, operating systems also support some kinds of services needed to achieve parallelism. Multithreading support seems †

to be the most obvious approach to helping programmers take advantage of parallelism in operating systems. For example, Mach, OSF/1, Solaris, Microsoft Windows NT are operating systems which support multithreading (Loepere, 1992a, 1992b). These operating systems usually have packages for handling multithreads (Boykin et al., 1993), e.g., the C Threads package in Mach and P Threads package in OSF/1. Although multithreading a lot of multiprocessors is powerful, we sometimes lack good parallelizing compilers which can to help programmers exploit parallelism and improve performance. Therefore, it has become an important issue to develop parallelizing compiling techniques that can exploit the potential power of multiprocessors in multithreaded operating systems. In particular, loops are such a rich source of parallelism that their parallelization will lead to considerable improvement of efficiency in multiprocessors (Polychronopoulos, 1988; Zima and Chapman, 1990), so we focused in this study only on the parallelism of loops. In brief, the data dependence analysis problem

To Whom all correspondence should be addressed.

− 751 −

C.T. Yang et al. is that of determining whether two references to the same array within a nest of loops may refer to the same elements of that array (Li et al., 1990; Kong et al., 1991; Goff et al., 1991; Maydan et al., 1991; Pugh, 1992; Wolfe and Tseng, 1992; Pugh and Wonnacott, 1994). If there is no cross-iteration dependence in a loop, then that loop can be executed in parallel on separate processors as a DOALL loop. Otherwise, if there are loopcarried dependences, referred to as DOACROSS loop, whose iterations are either executed sequentially or in parallel through enforced synchronization instructions within the body of the concurrent loops, then some runtime overhead will be incurred. In this paper, we present the design and implementation of a portable FORTRAN parallelizing compiler (PFPC) with loop partitioning on an AcerAltos10000 multiprocessor system (Acer, Inc., 1991), running OSF/I. In order to port the PFPC to other system environments, we define a minimal set of thread-related data types and functions, called B Threads, that are required for an operating system to support execution of the PFPC. This compiler is highly modularized and includes some routines which can be used to generate thread-specific codes and partitioned loops for different platforms; that is, it is portable (Polychronopoulos and Kuck, 1987; Hummel et al., 1992; Tzen and Ni, 1993). The experimental results clearly show that our compiler achieves good speedup. The ultimate goal is to construct a high-performance and portable FORTRAN parallelizing compiler on shared-memory multiprocessors will be constructed. The rest of this paper is organized as follows. Section II introduces the concepts of data dependence analysis and parallel loop partitioning. Section III presents the system environment and the B Threads package for our compiler. In Section VI, we describe our approach to designing and implementing a parallelizing compiler for multithreaded OSF/1. In Section V, we present some experimental measurements. Finally, concluding remarks and future directions are given in Section VI.

II. Background 1. Data Dependences A data dependence1 is said to exist between two statements S1 and S2 if there is an execution path from S1 to S2, if both statements access the same memory 1

2 3

Fig. 1. The model of a nested loop for data dependence analysis.

location and if at least one of the two statements writes the memory location (Pugh, 1992). There are three types of data dependences: True (flow) dependence occurs when S1 writes a memory location that S2 later reads. Anti-dependence occurs when S1 reads a memory location that S2 later writes. Output dependence occurs when S1 writes a memory location that S2 later writes. The nested loop is based on the assumption that the increment step is normalized to 1. Suppose that we want to test whether or not there exists dependence from statement S1 to S2 in the nested loop model as shown in Fig. 1. Let A=( α 1, α 2, α 3, ..., αν ) and B= ( β1, β 2, β3, ..., βν) be the integer vectors of n integer indices within the ranges of the upper and lower bounds of the n loops in the loop model. There is a dependence from S1 to S2 if and only if there exist A and B, such that A is lexicographically less than or equal to B such that f(A)=g(B), where f and g are functions from Z n to Z. Otherwise, the two array reference patterns are independent. If these dependences exist between statements in the same iteration, they are called loop-independent dependences. If these dependences exist between statements in different iterations, they are called loopcarried dependences. There are two types of loop parallelization in parallelizing compilers, DOALL and DOACROSS loops, respectively. A loop can be transformed into a DOALL 2 loop validly if it contains no loop-carried dependence (LCD). If there are LCDs between different iterations, then the loop can be transformed into a DOACROSS loop. A11 the iterations of a DOACROSS loop can be executed in parallel like a DOALL loop, but synchronization instructions 3 are inserted to preserve the dependence relation. Otherwise, if there is a dependence cycle, then the loop may be

Data dependence is normally defined with respect to the set of variables which are used and modified by a statement, denoted by the In/Out sets. All iterations of a DOALL loop can be executed in parallel to achieve high speedup in multiprocessor systems. The SIGNAL. and WAIT statements are used to enforce synchronization between processes.

− 752 −

A Protable Parallelizing Compiler Table 2. Sample Partition Sizes

Fig. 2. An example of a DOACROSS loop.

Scheme

N=1000 and P=4

SS CSS(125) CSS/4 GSS Factoring TSS(88, 12)

1 1 1 1 1 1 1 1 1 1 1 1 1... 125 125 125 125 125 125 125 125 250 250 250 250 250 188 141 106 79 59 45 33 25... 125 125 125 125 62 62 62 62 31... 88 84 80 76 72 68 64 60 ... 12

executed sequentially, like a DO loop. For example, consider the program segment shown in Fig. 2(a). Since there is an LCD with constant dependence distance 5, loop I can be transformed into a DOACROSS loop along with the synchronization primitives as shown in Fig. 2(b) (Wolfe, 1989; Zima and Chapman, 1990). Then we (1) generate a SIGNAL statement immediately following S1 in the program text: SIGNAL(S1), and (2) generate a WAIT statement immediately before S2: WAIT(S1, I-d), where d means that the dependence distance is equal to 5 (Zima and Chapman, 1990).

loop into k equal-sized chunks. Table 2 gives sample partition sizes for SS, CSS(125), CSS/4, GSS, Factoring, and TSS(88, 12) when N=1000 and P=4. We must emphasize here that some differences exist between loop partitioning and loop scheduling. A loop-scheduled chunk is executed by a particular processor until it is finished, but this is not always true of DO loop partitioning. In our implementation, one partition is mapped to a thread, and the scheduling of threads is the responsibility of the operating system, OSF/l .

2. Previous Loop Partitioning Algorithms

In this section, we will describe the system environment in which the parallelizing compiler was run. Our target machine is AcerAltos 1000 multiprocessor system, running the OSF/1 multithreaded OS, which provides C Threads and P Threads runtime libraries. For portability, a brief definition of B Threads, which is the kernel required to use our compiler, is also given. However, our compiler can be easily ported to different thread packages as well.

If a loop can be executed in parallel, we want to break this loop down into a set of tasks on different processors (Gupta, 1992). As we know, task granularity, which is an important issue in loop partitioning, heavily influences load balancing. Therefore, a good loop-partitioning algorithm will achieve better load balancing with only a small overhead. Currently, there are several loop-partitioning methods available in different loop scheduling algorithms, for example, SS, GSS (Polychronopoulos and Kuck, 1987), CSS, Factoring (Hummel et al., 1992), and TSS (Tzen and Ni, 1993). Assume that the number of processors available is P, the number of iterations of the DOALL loop is n, and the size of ith partition is K i . Formulas for calculating Ki in different algorithms are listed in Table 1, where the CSS/k algorithm partitions the DOALL Table 1. Various Loop Scheduling Algorithms Scheme

Formulas

SS CSS(k) CSS/ λ GSS Factoring TSS(f,l)

K i=1 K i=k K i=N/ λ  K i=R i /P, R 0=N, R i+1=R i −K i K i=(1/2) i/P×N/P K i=f−i× δ , I=2N/(f+1), δ =(f−1)/(I−1)

III. Our System Environment

1. Target Machine Our PFPC is run on AcerAltos-10000 system, which is a PC-based shared-memory, symmetric multiprocessor computer designed for departmental client/server environments (Acer, Inc., 1991). The system includes up to four i486-DX (33-MHz) CPUs, an 8K internal cache and a 128K external cache per CPU, 32 MB shared-memory, and a 64-Bit high-speed frame bus. Due to the symmetric architecture, computation tasks can be easily distributed to any available processor. This means that balanced loading of all processors can be obtained.

2. Multithreaded Operating Systems and Thread Packages The Mach operating system developed at CarnegieMelon University for multiprocessing is a kernelized operating system that supports the fundamental primi− 753 −

C.T. Yang et al. Table 3. Two Basic Thread Functions of B Threads Thread_Create()

This function creates a thread and returns a thread ID that may be used in other thread operations.

Thread_Detach()

Usually, the runtime library must retain information about threads even after they have terminated. Thread operation can be optimizec by providing advice that allows the library to release its records abou a given thread using this function.

Table 4. The Data Types and Thread Functions Used in Mutex Objects Mutex_Object

A data type used in declaring mutex objects. We use a mutex object to ensure that access to a shared data is mutually exclusive in time.

Mutex_Init()

This function initializes a mutex object. The mutex object is accessed only through the following two standard atomic operations.

Mutex_Lock()

Before accessing a shared datum in a thread, we must acquire a lock on the mutex object. When several threads acquire a lock at the same time, only one thread will be granted access; the other threads will be blocked until the mutex is unlocked, and then a blocked thread will be signaled to grant the lock.

Mutex_Unlock()

This function unlocks a locked mutex object, usually when access to a shared object is complete.

tives by mean of minimized kernel abstractions. There are five basic abstractions, including task, thread, port, message, and memory object, that are exported by the Mach kernel. A task is the basic unit of resource management with a large address space and port rights that protect access to system resources, such as processors, communication capabilities, and virtual memory. A thread is a light weight process executed within a task, the basic unit of CPU utilization, and containing the minimal processing states associated with a computation. It has been estimated that creating a thread takes about 1/10th the time of a UNIX fork/exec combination. All the threads within a task can share the resources of this task. A thread is roughly equivalent to an independent program counter within a task, that is, individual flow of control within a task. As such, it has its own private processor state, and a thread running within a task is, then a process. Threads are the basic units of scheduling. Threads are scheduled to processors by the Mach kernel and may run in parallel on a multiprocessor. Mach provides a C-thread package that is a set of low level, language independent primitives for manipulating threads of control. The package is a run time library that provides C-thread function calls, and it allows parallel programming in C under the Mach operating system. The operating system run on our target machine was the OSF/1 multithreading operating system (Boykin et al., 1993). The kernel services provided in OSF/1 were derived from the Mach. OSF/1 could provide C Threads and P Threads packages, which are a set of low-level, language-independent primitives for manipulating control threads (Loepere, 1992a, 1992b). These packages, two runtime libraries, provide C Thread and

P Thread function calls that allow parallel programming in C.

3. B Threads B Threads comprise the basic thread package including facilities for thread creation, thread detachment, and synchronization. Two basic functions of B Threads are described in Table 3. The thread implementation must provide two synchronization methods: mutex objects for shortduration mutual exclusion and condition variables for event notification (Boykin et al., 1993). Table 4 shows the B Thread data types and functions used for synchronization in mutex objects. Occasionally, threads synchronize the occurrence of events in their tasks using other threads. The condition variable construct implements this kind of synchronization. Table 5 shows the B Thread data types and functions used in condition variables. B Threads also comprise a subset of other thread packages; Table 6 shows the functions and data types defined in B Threads and their corresponding elements in P Threads and C Threads. Our parallelizing compiler can be ported to any arbitrary operating system that supports functions and data types similar to those of B Threads.

IV. Design Approach 1. The Parallelizing Compiler Model The model for a PFPC intended to produce parallel object codes rather than being just a source-to-

− 754 −

A Protable Parallelizing Compiler Table 5. The Data Types and Thread Functions Used in Condition Variables Condition_Variable Condition_Init() Condition_Wait() Condition_Signal()

A data type used in declaring condition variables that are used to signal that some operation has been completed. This function initializes a condition variable. A thread waits on a condition variable when this function call is invoked. This function awakens a thread waiting on a specific condition variable. If no thread is waiting on the condition variable, this function has no effect. Note that Condition_Signal( ) can wake up exactly one waiting thread.

Table 6. Comparison of B Threads, P Threads and C Threads B Threads

P Threads

C Threads

Comment

Thread_Create() Thread_Detach() Mutex_Object Mutex_Init() Mutex_Lock() Mutex_Unlock() Condition_Variable Condition_Init() Condition_Wait() Condition_Signal()

pthread_create() pthread_detach() pthread_mutex_t pthread_mutex_init() pthread_mutex_lock() pthread_mutex_unlock() pthread_cond_t pthread_cond_init() pthread_cond_wait() pthread_cond_signal()

cthread_fork() cthread_detach() mutex_t mutex_alloc() mutex_lock() mutex_unlock() condition_t condition_alloc() condition_wait() condition_signal()

Function Function Data type Function Function Function Data type Function Function Function

source restructurer (Hsiao et al., 1994) was designed as shown in Fig. 3 (Yang et al., l995). The processing stages are described in the following: (1) First, a practical parallelism detector (PPD) (Yang et al., 1996) is used to test the data dependence relations and to then restructure a sequential FORTRAN source program into a parallel form; i.e., if a loop can be parallelized or partially parallelized, then PPD marks that loop with a DOALL loop or DOACROSS loop with comments. PPD takes the traditional FORTRAN 77 source program as input and generates the corresponding prompted parallel code. The structure of PPD is divided into two phases, an analysis phase and a codegen phase. In the analysis phase, a single-subscript testing algorithm, the I test, is used to check if the linear equation formed by the array subscript has an appropriate integer solution. Instead of linearizing the subscript of an array, we check it subscript-bysubscript since there is no certainty that either of them overrides the other in precision. The result of the analysis phase is the determination of the execution modes of all the loops. The execution mode of a loop may be one of the following three types: DOALL, DOACROSS, and DOSEQ, where the former two indicate that a loop can be executed in a fully or partially parallel manner, respectively, and the last one is the normal sequential type. In the codegen − 755 −

Fig. 3. The model of the PFPC running on OSF/1.

phase, the outcome of the analysis phase is referred in order to produce the prompted parallel codes. The optimizations for synchronized statements of DOACROSS loops are also obtained. (2) Second, because there is no FORTRAN compiler in OSF/1 and because multithreading only supports C programming, a FORTRAN-to-C (f2c) (Feldman et al., 1992) converter is used to convert the processed FORTRAN program, output by PPD, into its C equivalent. (3) Third, the component, a single-to-multiple (s2m) thread translator, takes the program obtained from f2c as input and then generates the output in which the parallel loops (DOALL or DOACROSS) are translated into sub-tasks by replacing them with multithreaded codes. The structure of the s2m thread translator (Hsiao et

C.T. Yang et al.

Fig. 4. The structure of the new s2m. Fig. 6. Main program of the general output produced by s2m.

al., 1994) consists of five modules as shown in Fig. 4. The kernel module is written so as to be portable; it calls functions in the thread-code generating module and calls functions in the DOALL loop-partition module. The thread-code generating module contains several functions that are used to generate different thread specific codes: P Threads or C Threads. The DOALL loop-partition and DOACROSS loop-partition modules contain routines which partition DOALL and DOACROSS loops, respectively. In this paper, we improve the power of s2m to partition and generate corresponding multithreaded codes for a DOACROSS loop. The config module is very small and contains several arrays of functions. When the s2m kernel calls a function in thread-code generating, DOALL loop-partition modules or DOACROSS loop-partition modules, there must be an entry in the config module pointing to that called function so that the s2m kernel can access the function through the config module. If users want to add their own thread-code generating routines, DOALL loop-partition routines, or DOACROSS loop-partition routines, they can

Fig. 5. The DOALL loop of the input program of s2m.

append their own functions to these three modules and then append entries pointing to those functions in the config module. Therefore, a new version of s2m can be produced by simply compiling the config module and user functions directly, which can be ported to other platforms easily. (4) Finally, the resulting multithreaded program is compiled and linked with the P Thread runtime library using the native C compiler, e.g., the GNU C compiler. Then, the generated parallel object codes can be executed in parallel on the multiprocessors to achieve high processor utilization.

2. The S2m for DOALL Loops We will now explain how the s2m converts specific types of conventional sequential programs, i.e., DOALL loops, into their parallel equivalents with the P Thread runtime library codes embedded in them. The general form of a DOALL loop program for s2m is shown in Fig. 5. In this figure, there is one for-loop enclosed in “/* $DOALL$ L???: */” and “/* $END _DOALL$ L???: */” comments; these two comments are used to indicate that the for-loop enclosed by them is a DOALL loop. The ? ? ? here stands for the loop label used in the original FORTRAN program. The output of the main program has the form shown in Fig. 6 produced by s2m. There are six rectangles in this figure; each corresponds to a session that performs a specific job. The first session, the thread-related definition (Fig. 7), outputs thread-related definitions. Some variables for using the thread package are defined in this session. The loop variable is an array of loop_args, which is used to pass the begin iteration, end iteration, and the iteration step for each pthread created later on. The ThCount variable records the number of threads; this number is decreased by one when a thread is going to be terminated.

− 756 −

A Protable Parallelizing Compiler

Fig. 8. The DOALL function definition for an output thread produced by s2m.

Fig. 7. Thread-related definition added to the output.

The second session is devoted to variable declaration. This session, is originally in the main function of the input program (Fig. 5); s2m removes the variable declarations from the main function to make variables visible to the entire program. This eases the parameter passing problem when a thread forks since all the necessary variables are global! Note that when this approach is applied to functions other than the main function, the variables may need to be renamed to avoid conflicts. The third session is mutex and condition variable initialization. This session initializes a mutex object and a condition variable with default attributes since we need these two variables when performing synchronization. The P Threads codes for this session are shown below. pthread_mutex_init (&CountLock, pthread_mutexattr default); pthread_cond_init (&ThCond, pthread_condattr_ default);

The fourth, fifth, and sixth sessions are for DOALL loops only. The iterations calculation session, also called the loop partitioning session, partitions the DOALL loop using to the user-assigned loop-partitioning algorithms. The default loop partitioning algorithm is CSS/4, which divides the iterations into four chunks of equal size, but this can be changed with a command line option when s2m is invoked. At the end of this session, the variable ThCount will have the number of threads that need to be created later on. The start iteration, end iteration and the iteration step for the ith thread are stored in loop[i], begin, loop[i], end and loop[i], step, respectively. This pre-calculation of tasks for each thread eliminates the need for synchronization of loop indices in several loop scheduling algorithms. This makes our approach faster. The fifth session is for forking threads. The default number of threads to be created is four; however,

this number can be changed with a command line option when s2m is invoked. Usually, this session contains a for-loop that performs thread creation. The sixth session uses previously initialized mutex objects and condition variables for synchronization purposes. It waits for the thread count (ThCount variable) to reach zero (the threads created for a particular parallel loop have all been finished) and then continues execution. The thread count is initially the number of threads created and is decreased by one before each thread is terminated. This synchronization is necessary to ensure the correctness of program execution. The P Thread codes for this session may be as follows: pthread_mutex_lock (&CountLock); while (ThCount !=0) { pthread_cond_wait(&ThCond, &CountLock); pthread mutex_unlock(&CountLock);

Figure 8 is the function definition of a DOALL loop for a thread. This function definition is mainly the corresponding for-loop with minor changes. First, the loop is executed from loop->begin to loop-> end; these two variables are calculated in the iteration calculation session. Second, the thread count (ThCount variable) is decreased by one before the thread is terminated. We use mutex objects to ensure mutex exclusion and then decrease the thread count by one. The P Thread codes for this session are shown below. pthread_mutex_lock ( &CountLock); ThCount - -; pthread_mutex_unlock ( &CountLock); pthread_cond_slgnal (&ThCond);

3. The S2m for DOACROSS Loops Now, we will describe how s2m converts DOACROSS loops into its equivalent with synchronization instructions. The general form of a DOACROSS program fed to s2m is shown in Fig. 9. In Fig. 9, there

− 757 −

C.T. Yang et al.

A. Partition Mechanism

Fig. 9. THE DOACROSS loop of the input program for s2m.

is one for-loop enclosed in /* DOACR ? ? ? */ and /* ENDACR ? ? ? */ comments, where these two comments are used to indicate that the for-loop is a DOACROSS loop. The statements /* SIGNAL (n) */ and /* WAIT (n, i−m) */ indicate the synchronized events, where n is a synchronized statement's number, i is the loop index, and m is the dependence distance. The execution of /* WAIT (n, i−m) */ is blocked until the matching /* SIGNAL (n, i) */ in the previous iteration has been executed. Figure 10 is the function definition of a DOACROSS loop for mapping into a thread. This function definition is mainly the corresponding forloop with minor changes. A DOACROSS loop is divided into blocks by the loop partition algorithm. The loop is executed from iteration loop->begin to iteration loop->end, and these two variables are calculated in the iteration calculation session as shown in Fig. 6. There are four rectangles in Fig. 10. The first session, sets the elements of the ready_n array corresponding to index i to be True, when the synchronized statements have been finished. In the second session, the thread tests whether the loop iteration is a head or not. If the iteration is a head, the third session is skipped; otherwise, busy-waiting scheme is used. This form of synchronization is particularly effective in busbased multiprocessors with snoopy caches (Saltz et al., 1991) since in such architectures busy-waiting generates no bus traffic. In the fourth session, the thread count (ThCount variable) is decreased by one before the thread is terminated. We use a mutex object to ensure mutex exclusion and then decreases the thread count by one.

It is well known that if the number of processors is large enough, i.e., every iteration can be assigned to a unique processor, then the program can be run in parallel to get high performance. However, it is almost impossible to let all the iterations execute simultaneously in parallel because the number of processors is usually less than the number of iterations. In practice, all iterations are partitioned into chunks that can be assigned to processors and executed in parallel. There are two commonly used methods for partitioning loop iterations: vertical spreading and horizontal spreading. Vertical spreading assigns a contiguous chunk of iterations to each processor: let m be the number of processors available, let P i represent the ith processor (1≤i≤m), and let n be the number of iterations. Let B=n/m, the chunk of iterations (i−1)B+q is mapped to P i, where q ranges from 1 to B; the iterations (m− l)B+1 through n are mapped to Pm (Fig. 11(a)). Horizontal spreading maps the set of iterations given by (i+q) *m (0≤q≤B) to P i , while making suitable adjustments for the case where B>n/m (Fig. 11(b)) (Zima and Chapman, 1990). Deciding which option is appropriate in some particular case is not a trivial matter: we must consider the dependence structure of the loop and the properties of the target machine. Vertical spreading is adopted for DOALL loop parallelization. When a processor has a cache memory, vertical spreading will often be the better method. However, it is not suitable for DOACROSS loops running on some multiprocessor systems with fewer processors because it will serialize a loop that contains a loop-carried dependence with a constant distance and may not lead to obvious speedup.

Fig. 10. The DOACROSS function definition for a thread of the output.

− 758 −

A Protable Parallelizing Compiler

Fig. 11. (a) Vertical spreading (b) Horizontal spreading.

Fig. 12. A DOACROSS loop with a dependence distance of 3.

For example, a DOACROSS loop with constant dependence distance 3 is shown in Fig. 12. Figure 13(a) shows the execution order for the example. The DOACROSS loop, which contains 12 iterations, is partitioned into four equal-sized chunks by means of vertical spreading. Suppose that there are only four processors in the target machine, and that each processor takes one time unit to execute one iteration. Due to the synchronization delay, we can see that high speedup is not possible. If synchronization overhead is also considered, the performance will be even worse. If the iterations are partitioned into more chunks, the synchronization overhead will degrade the performance extremely. In Fig. 13(b), the loop is partitioned as four chunks by means of horizontal spreading. Though horizontal spreading uses less time to execute the loop in parallel than does vertical spreading, synchronization between chunks will also degrade the performance. Therefore, a DOACROSS loop partition is proposed to solve this problem. The key idea in this strategy is that iterations with dependence relations are put in the same chunk4 as far as possible. The partition method is the same as horizontal spreading, but the number of partitioned chunks is determined by the dependence distance in that loop. This implies that the DOACROSS loop partition is a special case of horizontal spreading when the number of processors equals the dependence distance. The loop (Fig. 12) can be scheduled as shown in Fig. 13(c). All iterations are partitioned as three chunks, where 3 is the dependence distance in the loop. 4

Fig. 13. A comparison of various partition mechanisms for a DOACROSS loop.

The stride of contiguous iterations in the same chunk is the dependence distance. Because iterations with dependence relations are put into the same chunks, there exists no synchronization between chunks. Ideally, the best speedup is 3 for the loop obtained using DOACROSS loop partitioning. Using this scheduling, the number of chunks created by s2m depend on the dependence distance in a DOACROSS loop detected by PPD. Note that parallelization is ignored when the dependence distance of the loop is 1 or the number of loop-carried dependence is larger than four because this will incur large synchronization overhead when that loop is run.

4. The S2m Algorithm We will now summarize our discussion of s2m in an algorithm. The number of iterations and processors must be known at compile time if we want to use the static scheduling algorithm. This is a necessary condition for static scheduling since the scheduling decision is made at compile time. In our implementation, dynamic loop partitioning is used. Therefore, the loop bounds can be determined at compile-time or run-time by programmers. The algorithm consists of six phases.

One chunk is mapped to a single thread running on OSF/1.

− 759 −

C.T. Yang et al. Algorithm: S2m (Single thread to multiple threads) Arguments: Input: The input program translated by processing PDP. Output: The converted program is obtained in which the DOALL and DOACROSS for-loops are translated into sub-tasks by replacing them with multithreaded codes. s2m [-s type] [-I lib] [-p pno] [-n kno] input-file output-file -s type: specific a loop-partitioning algorithm type -1 lib: links with the P Thread or C Thread runtime library -p pno: specify the number of processors to be run -n kno: specify the k value in the CSS/k loop-partition algorithm Phase 1: Check these input arguments for correct execution. Phase 2: Add thread-related definitions and loop-partitioning definitions to the output file. Phase 3: Add thread-initialization statements. Phase 4: If the DO loop is DOALL, perform loop partitioning using the specified loop-partitioning algorithm, and generate the DOALL thread functions. Phase 5: If the DO loop is DOACROSS, perform loop partitioning using the specified loop-partitioning algorithm and generate DOACROSS thread functions. Phase 6: Copy the sequential statements directly into the output file.

V. Experimental Results 1. Performance of DOALL Loops Loops can be roughly divided into four types: uniform workload, increasing workload, decreasing workload, and random workload types. These four types are most common in programs and should cover most cases (Tzen and Ni, 1993). We will show the performance gained by using our parallelizing compiler on four different loop types. The first example is matrix multiplication as shown in Fig. 14(a), where the two outer loops can be parallelized. Since this example is highly load balanced, every iteration of the outer-most DO loop takes constant time to execute; this kind of loop is called a uniform workload loop. The second type is adjoint convolution as shown in Fig. 14(b). Adjoint convolution exploits significant load imbalance; only the outer loop can be parallelized, and its ith iteration takes O(N2−i) time to execute. As index i increases from one to N 2 , the workload decreases from O(N2) to O(1). This kind of loop is called a decreasing workload loop. The third one is reverse adjoint convolution, as shown in Fig. 14(c), which also exploits significant load imbalance; only the outer loop can be parallelized, and its ith iteration takes O(i) time to execute. As index i increases from one to N 2, the workload also increases from O(1) to O(N2). This kind of loop is called a increasing workload loop. The fourth one is transitive closure, as shown in Fig. 14(d), whose workload is dependent on the input data. That is, each iteration takes either O(1) or O(N)

Fig. 14. (a) Program 1-Matrix multiplication. (b) Program 2-Adjoint convolution. (c) Program 3-Reverse adjoint convolution. (d) Program 4-Transitive closure.

time. This kind of loop is called a random workload loop. In general, increasing workload and decreasing workload loops are called uniform workload loops, which can be parallelized using static scheduling methods. In contrast, except for GSS, dynamic scheduling is suitable for all loop types. This is because GSS will initially allocate too many iterations to an idle processor when the third style loop is presented; that is, GSS may cause load imbalancing. GSS is suitable for a loop with an increasing workload, so the reverse adjoint convolution example is used to demonstrate the performance of GSS. The experiments were performed on the AcerAltos10000 system. With a fixed number of processors (CPUs=4), four examples were run with different numbers of threads by varying the value k of CSS/k to demonstrate the speedup of each program with forking of different numbers of threads, where the value k is the number of chunks. Figure 15 shows the speedup when the number of processors was fixed at 4 and the number of threads was varied from 1 to 150. The matrix size in Program 1 was 500×500, the matrix size in both Programs 2 and 3 was 150×150, and in Program 4, it was 300×300. Since Program 1 was highly load balanced, more speedup was gained than the others until the number of threads exceeded 25. Otherwise, the performance dropped since too many threads were forked. Large numbers of threads can balance imbalanced workloads, such as in Programs 2 and 3, and the overhead may reduce performance. This is a tradeoff. The result give us a hint that workloads can be partitioned evenly while the number of forked threads are forked is as small as possible. As we mentioned in Section II, CSS/k is suitable for uniform workloads, e.g., matrix multiplication, when k is equal to 4, 8 or 16. The transitive closure with random workload is related to the input data; this program gains high speedup when the number of threads is equal to 4. Its performance drops dramatically since too many threads are

− 760 −

A Protable Parallelizing Compiler

Fig. 15. Speedup obtained by varying the number of threads (CPUs=4).

shows the experimental results obtained by running Program 2 with different loop-partitioning algorithms and arguments. Since the loop in Program 2 with a decreasing workload; loop-partitioning algorithms like GSS, TSS or Factoring did not achieve much speedup. Figure 19 shows the experimental results obtained by running Program 3 with different loop-partitioning algorithms and arguments. Because of the increasing workload of Program 3, TSS, Factoring, and GSS looppartitioning algorithms allocated a large number of iterations at the first chunk initially and could obtain good speedup. Figure 20 shows the experimental results of running Program 4 with different loop-partitioning algorithms and arguments. Since the workload of this loop was randomly balanced, GSS, TSS or Factoring did not provide much speedup, but CSS/4 obtained good performance in most cases.

2. DOACROSS Loop Performance

Fig. 16. Speedup obtained by varying the number of processors (CSS/8).

forked. Generally speaking, the best speedup for Programs 1, 2, 3, and 4 is about 3.75, 3.46, 3.81, and 3.2, respectively. Furthermore, with a fixed number of threads, four examples were run by varying the number of available processors. DOALL loops were partitioned into eight threads by the CSS/8 algorithm to demonstrate the speedup obtained with the different numbers of processors used. Figure 16 shows the speedup when the number of threads was fixed at 8, and the number of processors was varied from 1 to 4. The problem sizes for Programs 1, 3, and 4 were the same size as in Fig. 15; only n is changed from 150 to 120 in Program 2. Figure 16 shows the highly imbalanced workload of Programs 2, 3, and 4, and the speedup gained was less than that in Program 1 when the number of threads was fixed at 8. The final experimental results of DOALL demonstrate the speedup of each program using different loop partitioning algorithms used. To avoid creating too many small partitions that could produce too much runtime overhead, we used GSS(10), TSS(N/2P, 10) and Factoring(10) in our test. Figure 17 shows the speedup when Program 1 was run with different looppartitioning algorithms and arguments. Since the loop in Program 1 was highly balanced, GSS, TSS or Factoring did not contribute much speedup. Figure 18

A DOACROSS loop for exploiting the performance of PFPC is shown in Fig. 21. In this experiment, the number of processors was fixed to four, and the number of threads was varied5. The execution speedup for different loop bounds n is shown in Fig. 23. It is well known that if the iterations of a DOACROSS loop

Fig. 17. Speedup of Program 1 with different loop-partitioning algorithms.

Fig. 18. Speedup of Program 2 with different loop-partitioning algorithms.

− 761 −

C.T. Yang et al.

Fig. 19. Speedup of Program 3 with different loop-partitioning algorithms.

the iterations with dependence relations into the same chunk. However, the CSS/k mechanism could not provide any real benefit; moreover, its performance was sometimes even worse than that of sequential execution. In Tables 8 and 9, the performance statistics also reveal that CSS/k failed to offer any benefits. In our multiprocessor system, since there are only four processors, vertical spreading is not suitable for parallelizing a DOACROSS loop. If more than four chunks are created, the performance will deteriorate rapidly. As for DOACROSS loop partition, the performance result for Example 2 are shown in Fig. 24. Our experiments show that DOACROSS loops also exhibit much parallelism. However, if the partition mechanism is not proper, synchronization overhead may seriously degrade the performance. On the other hand, using DOACROSS loop partitioning, improved performance result is more readily observable. The experiments also show that if the number of processors is not large enough, vertical spreading is not a good choice.

Fig. 20. Speedup of Program 4 with different loop-partitioning algorithms.

are executed in parallel, different iterations may have to be synchronized using operations (i.e., SIGNAL and WAIT operations). Thus the operations will occur large synchronization and scheduling overhead. In our experiment, the best speedup of the DOACROSS loop was only 1.51 using the CSS/4 loop partitioning algorithm. The performance obtained in parallelizing a DOACROSS loop with various dependence distances will next be discussed. An example shown in Fig. 22 was adopted from adjoint convolution. In this loop, a measurement was taken at three different dependence distances; i.e., 4, 3, and 2. As mentioned above, the performance result of parallelizing a DOACROSS loop varied widely with different loop partition mechanisms. Two loop partitioning mechanisms, DOACROSS loop partition and CSS/k, were adopted in this experiment. Table 7 lists the experimental results obtained by running the example with a dependence distance of 4 and different loop partition mechanisms. Since the distance was 4, the DOACROSS partitioning algorithm could partition the loop into four chunks. DOACROSS loop partitioning could obtain high speedup because the synchronization overhead was reduced by putting 5

Fig. 21. Example 1.

Fig. 22. Example 2.

We use CSS/k a kind of vertical spreading, which denotes the partition of a parallel loop into k equal-size chunks, after which one chunk is packaged into a thread.

− 762 −

A Protable Parallelizing Compiler

VI. Conclusions and Further Directions

Table 7. The Execution Time (sec.) for Example 2 with a Dependence Distance of 4

Since most investigations on parallelizing compilers have focused on source-to-source transformations which cannot be run on multiprocessors without a code generation phase, we have designed a parallelizing compiler that can generate parallel object codes rather than being just a source-to-source translator. In this paper, we have presented the design and implementation of a PFPC with loop partitioning on our AcerAltos-10000 multiprocessor system, running the OSF/1 multithreaded OS. In order to port our PFPC to other system environments, a minimal set of threadrelated functions and data types, called B Threads, have also been defined which form the kernel used in our PFPC to support execution on different platforms. Furthermore, the single-to-multiple thread translator component is highly modularized so that porting to other platforms is very easy. Our portable compiler based upon four loop-partitioning algorithms, CSS/k, GSS(k), Factoring and TSS(f, l), currently can partition DOALL and DOACROSS loops into multithreaded codes. To show the performance of our compiler, some experiments have been performed. In the near future, we will integrate runtime data dependence testing and parallel loop transformations into our parallelizing compiler. Our goal is to construct a high-performance,

Distance(m)=4 N 5000 2500 1000

Sequential 120.55 26.67 6.30

Doacross 35.28 7.34 2.21

CSS/4 108.83 27.2 4.7

CSS/8 123.44 32.94 6.16

Table 8. The Execution Time (sec.) for Example 2 with a Dependence Distance of 3 Distance(m)=3 N 5000 2500 1000

Sequential 117.20 32.91 6.3

Doacross 38.73 9.71 2.62

CSS/4 108.8 27.12 4.37

CSS/8 122.89 31.84 5.77

Table 9. The Execution Time (sec.) for Example 2 with a Dependence Distance of 2 Distance(m)=2 N 5000 2500 1000

Sequential 118.18 33.73 5.3

Doacross 58.26 14.42 2.35

CSS/4 108.72 27.44 4.39

CSS/8 123.77 31.77 4.97

portable FORTRAN parallelizing compiler for sharedmemory multiprocessor systems.

Acknowledgment

Fig. 23. The speedup of Example 1 obtained by varying the number of threads.

This work was supported in part by the National Science Council of the Republic of China under Grants NSC 82-0408-E009285, NSC 83-0408-E009-034 and NSC 84-2213-E009-090. We would like to thank the anonymous reviewers for suggesting improvements and offering encouragement.

References

Fig. 24. The speedup of Example 2 obtained by using DOACROSS loop partitioning.

Acer, Inc. (1991) AcerAltos-10000 System Guide. Acer Inc., Taipei, Taiwan, R.O.C. Banerjee, U. (1988) Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston, MA, U.S.A. Banerjee, U., R. Eigenmann, A. Nicolau, and D. A. Padua (1993) Automatic program parallelization. Proc. IEEE, 8(12), 211-243. Bacon, D. F., S. L. Graham, and O. J. Sharp (1993) Compiler transformations for high-performance computing. ACM Computing Surveys, 26(4), 345-420. Blume, W., R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. A. Padua, P. Peterson, B. Pottenger, L. Rauchwerger, P. Tu, and S. Weatherford (1994a) Polaris: the next generation in paral-

− 763 −

C.T. Yang et al. lelizing compilers. Proc. of the 7th Workshop Languages and Compilers for Parallel Computing, pp. 141-154. Ithaca, NY, U.S.A. Blume, W., R. Eigenmann, J. Hoeflinger, D. Padua, P. Peterson, L. Rauchwerger, and P. Tu (1994b) Automatic detection of parallelism: a grand challenge for high-performance computing. IEEEParallel and Distributed Technology, 2(3), 37-47. Boykin, J., D. Kirschen, A. Langerman, and S. LoVerso (1993) Programming under Mach. Addison Wesley, New York, NY, U.S.A. Feldman, S. I., D. M. Gay, M. W. Maimone, and N. L. Schryer (1992) A FORTRAN-to-C Converter. Computing Science Technical Report, No. 149, Bell Communication Research and CarnegieMellon University, Pittsburgh, PA, U.S.A. Gates, K. E. and W. P. Peterson (1994) A technical description of some parallel computers. Int’l J. High Speed Computing, 6(3), 399-449. Goff, G., K. Kennedy, and C. W. Tseng (1991) Practical dependence testing. Proc. of the ACM SIGPLAN ‘91 Conf. on Programming Language Design and Implementation, pp. 15-29, Toronto, Canada. Gupta, R. (1992) Synchronization and communication costs of loop partitioning shared-memory multiprocessor systems. IEEE Trans. Parallel Distrib. Syst., 3(4), 505-512. Hwang, K. (1993) Advanced Computer Architecture. Parallelism, Scalability, Programmability. McGRAW-Hill Inc., New York, NY, U.S.A. Hsiao, M. C., S. S. Tseng, C. T. Yang, and C. S. Chen (1994) Implementation of a portable parallelizing compiler with loop partition. Proc. of the 1994 Int’l Conf. on Parallel and Distributed Systems, pp. 333-338. Hsinchu, Taiwan, R.O.C. Hummel, S. F., E. Schonberg, and L. E. Flynn (1992) Factoring: a method for scheduling parallel loops. Commun. of ACM, 35(8), 90-101. Kong, X., D. Klappholz, and K. Psarris (1991) The I test: an improved dependence test for automatic parallelization and vectorization. IEEE Trans. Parallel Distrib. Syst., 2(3), 342349. Li, Z., P. C. Yew, and C. Q. Zhu (1990) An efficient data dependence analysis for parallelizing compilers. IEEE Trans. Parallel Distrib. Syst., 1(1), 26-34. Loepere, K. (1992a) Mach 3 Kernel Principles. Open Software Foundation and Carregie Mellon University, Pittsburgh, PA, U.S.A. Loepere, K. (1992b) Mach 3 Server Write’s Guide. Open Software Foundation and Carregie Mellon University, Pittsburgh, PA,

U.S.A. Maydan, D. E., J. L. Hennessy, and M. S. Lam (1991) Efficient and exact data dependence analysis. Proc. of the ACM SIGPLAN ‘91 Conf. on Programming Language Design and Implementation, pp. 1-14. Toronto, Canada. Polychronopoulos, C. D. (1988) Parallel Programming and Compilers. Kluwer Academic Publishers, Boston, MA, U.S.A. Polychronopoulos, C. D. and D. J. Kuck (1987) Guided selfscheduling: a practical self-scheduling scheme for parallel supercomputers. IEEE Trans. Comput., C-36(12), 1425-1439. Pugh, W. (1992) A practical algorithm for exact array dependence analysis. Commun. of ACM, 35(8), 102-114. Pugh, W. and D. Wonnacott ( 1994) Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Program. Lang. Syst., 16(4), 1248-1278. Saltz, J. H., R. Mirchandaney, and K. Crowley (1991) Runtime parallelization and scheduling of loops. IEEE Trans. Comput., 40(5), 603-612. Shen, Z., Z. Li, and P. C. Yew ( 1990) An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parallel Distrib. Syst., 1(3), 356-364. Tzen, T. H. and L. M. Ni (1993) Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst., 4(1), 87-98. Wolfe, M. (1989) Optimizing Supercompilers for Supercomputers. MIT Press, Cambridge, MA, U.S.A. Wolfe, M. and C. W. Tseng (1992) The power test for data dependence. IEEE Trans. Parallel Distrib. Syst., 3(5), 591-601. Wolfe, M. (1995) High-Performance Compilers for Parallel Computing. Addison-Wesley Publishing, New York, NY, U.S.A. Yang, C. T., S. S. Tseng, and C. S. Chen (1994) The anatomy of parafrase-2. Proc. Natl. Sci. Counc., ROC(A), 18(5), 450-462. Yang, C. T., S. S. Tseng, and M. C. Hsiao (1995) A model of parallelizing compiler on multithreaded operating systems. Proc. HPC-ASIA 1995 Int’l Conf. on High Performance Computing, Taipei, Taiwan, R.O.C. Yang, C. T., C. T. Wu, and S. S. Tseng (1996) PPD: a practical parallel loop detector for parallelizing compilers. IEICE Trans. Information and Systems, E79-D(11), 1545-1560. Yang, C. T., S. S. Tseng, C. D. Chuang, and W. C. Shih (1997) Using knowledge-based techniques on loop parallelization for parallelizing compilers. Parallel Computing, 23(3), 291-309. Zima, H. P. and B. Chapman (1990) Supercompilers for Parallel and Vector Computers. Addison-Wesley Publishing and ACM Press, New York, NY, U.S.A.

− 764 −

A Protable Parallelizing Compiler

 

G 

G

!"#$%&'( !GG 

!GG 

!GG

!"#$%&'()*+,-./01 GG !"#$%&#'

  !"#$%&'()*+^ÅÉê^äíçëJNMMMM !"#$%&' ()*+,-./01234567  !"#$%&'()*+,-./01234!56789:#$%&;?@ABCDEF0AG _ !"#$%&'()*+,-./0123456789:;"#$%?&'@ABCDE.F0  !"#$%&'()!*+,#-./012345$%#)!678#9:;*#[email protected]  !"#$%&'()*+,-.,/0123456789

− 765 −

Suggest Documents