Abstract. In this paper we present a new programming environ- ment for distributed memory parallel computers con- sisting of a Fortran 77 compiler enhanced ...
Fortran-S: A Fortran Interface for Shared Virtual Memory Architectures F. Bodin L. Kervella T. Priol IRISA-INRIA - Campus de Beaulieu - 35042 Rennes Cedex - France
Reprinted from "Proceeding of Supercomputing'93," Portland (USA), November 1993.
Abstract
In this paper we present a new programming environment for distributed memory parallel computers consisting of a Fortran 77 compiler enhanced with directives to specify parallelism. These directives are used to specify shared data structures and parallel loops. Shared data structures are implemented using the KOAN shared virtual memory that is available on an Intel iPSC/2 computer. Preliminary results obtained with the rst prototype of the compiler are presented.
1 Introduction The current generation of parallel architectures tends toward MIMD distributed memory parallel computers (DMPCs). However, the success of these parallel architectures will depend on the availability of programmingenvironments that help users design parallel algorithms. The previous generation was characterized by the lack of software; users had to take into account the underlying architecture in the design of their parallel algorithms including data distribution among the local memories, processes mapping, etc. Since then, researches have undertaken the task of providing parallel extensions to sequential languages such as C or Fortran to hide the distributed nature of these architectures. Several prototypes have been built, including Fortran-D [10], Vienna Fortran [3], Pandore [2]. Recently, a coalition of industrial and academics group founded the High Performance Fortran Forum to propose extensions to Fortran (HPF). All these approaches provide the user with a global address space Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage,the ACM copyrightnotice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or speci c permission.
and the compilers are in charge of generating processes and communications. Data management is also carried out by the compiler. Because the compiler is in charge of everything, its design and implementation become intricate. Moreover, when data access patterns are unknown at compile time, the generation of ecient parallel code must rely on runtime techniques [24]. There is another alternative based on the use of both a data management service available within an operating system of a DMPC and an ad hoc compiler that could drastically simplifythe design of a programming environment for DMPCs. This approach, which is the one explored here, takes advantage of a Shared Virtual Memory (SVM) implemented within the operating system. The basic idea of this concept is to hide the underlying architecture of DMPCs by providing the user with a virtual address space. Locality is achieved by a page caching mechanism. Thus, a DMPC can be programmed as a conventional shared memory parallel computers. The compiler is then responsible for generating parallel processes and communications are handled by the operating systems. Consequently, the compiler design is simpli ed in comparison to those approaches based on explicit data distribution. Unfortunately, few experiments have been done to show the eectiveness of such an approach on message-passing parallel architectures. The KOAN project [14, 19] has been established to to investigate the use of the SVM paradigm on DMPCs and to validate this programming model for DMPCs. We propose a Fortran-77 code generator based on a directives set called Fortran-S. It is intended to be a testbed for studying code generation and optimizations for SVM architectures. This paper gives an overview of Fortran-S and performance results are provided for some parallel algorithms. These results were obtained with an iPSC/2 running the KOAN shared virtual memory.
The paper is organized as follows. Section 2 sketches the KOAN SVM runtime system. Section 3 describes the Fortran-S fundamentals. Section 4 presents a description of the Fortran-S directive set. Section 6 gives performance results, followed by conclusions in Section 7.
2 KOAN SVM Runtime The KOAN SVM is embedded in the operating system of the iPSC/2. It allows the use of fast and low-level communication primitives as well as a Memory Management Unit (MMU). It diers from SHIVA described in [17] in that it is an operating system based implementation. The KOAN SVM implements the xed distributed manager algorithm as described in [16] with an invalidation protocol for keeping the shared memory coherent at all times (strong coherency). This algorithm oers a suitable compromise between ease of implementation and eciency. We now summarize the basic functionality of the KOAN SVM runtime system 1 . KOAN SVM provides the user several memory management protocols for eciently handling some particular memory access patterns. One of these is when several processors have to write into dierent locations of the same page. This pattern involves a lot of messages because the page has to move from processor to processor (this is called the ping-pong eect or false sharing). KOAN provides a weak cache coherence protocol to let processors concurrently modify their own copy of a page. At a synchronization point, all the copies of a page which have been modi ed are merged into a single page that re ects all the changes. This technique is similar to the one presented in [13] by Karp et al. However, in order to speedup the merging process, our implementation of this form of weak coherency has some restrictions : each processor must only access a memory regions that are not overlapping. Hence, the starting and the ending addresses speci ed by each processor are used to carry out the merging process. A drawback of shared virtual memory on DMPCs is its inability to run eciently parallel algorithms that contain a producer/consumer scheme, i.e. when a page is modi ed by one processor and then accessed by the other processors. KOAN SVM can eciently manage this memory access pattern by using the broadcast facility of the underlying topology of DMPCs (hyper1 A more detailed description of the KOAN SVM can be found in [14]
cube, 2D-mesh, etc). All pages that have been modi ed by the processor in charge of running the producer phase are broadcast in parallel to all other processors that will run the consumer phase. Last of all, KOAN SVM provides a page locking mechanism to implement atomic updates and special synchronization schemes. Page locking allows a processor to lock a page into its cache until it decides to release it. Page locking is very ecient and minimizes the number of critical sections within a parallel code. We have performed measurements in order to determine the costs of various basic operations for both read and write page faults (the size of a page is 4 Kbytes) of the KOAN shared virtual memory. For each type of page fault (read or write), we have tested the best and worst possible situation on dierent numbers of processors. For a 32-processor con guration, the time required to solve a read page fault is in the range of 3:412 ms to 3:955 ms. For a write page fault, timing results are in the range of 3:447 ms to 10:110 ms depending on the number of copies that have to be invalidated. These results can be compared with the communication times of the iPSC/2. The latency is roughly 0:3 ms. Sending a 4 Kbytes message (a page) costs between 2:17 ms and 2:27 ms depending on the number of intermediate routing steps.
3 Fortran-S Fundamental Concepts. Fortran-S is a code generator targeted for shared virtual memory parallel architectures. One of our goals was to respect the Fortran-77 standard since it is widely used in the scienti c community. Therefore no extension to the language syntax has been made. A set of annotations provides the user with a simple programming model based on shared array variables and parallel loops. One of the unique features of FortranS is its SPMD (Single Program Multiple Data) execution model that minimizes the overhead due to the management of parallel processes. The Fortran-S code generator creates a process for each processor for the entire duration of the computation. There is no dynamic creation of processes during the execution. In the following subsections, we present both the user programming and the execution models.
3.1 User Programming Model The Fortran-S programming model is based on parallel loops and shared variables. As a default, variables are not shared. Non-shared variables modi ed in a parallel loop have unde ned value at the end of the
execution of the loop. The value of a variable is said to be unde ned when at some point in the program the variable can have dierent values on dierent processors. Except for parallel loops, the semantics of the program is identical to the semantics of the sequential program. In the sequential parts of programs, shared variables behave identically to non-shared variables. However since shared variables are accessible by the processors, computation of these variables can be done in parallel. Alternatively, all the variables in a program could be shared by default and this would make a DMPC look like a shared memory architecture. This solution was not chosen because it could lead to very inecient programs because of the unnecessary data movement and synchronization. Taking into account communication costs, it is usually more ecient to compute the value of a variable redundantly on all the processors than globally accessing a single set of values. On the other hand, data structures which are to be computed within a parallel loop need to be shared. Restricting shared variables to some data structures in programs constraints parallelism to the computation of those variables. For example, if we consider the following parallel loop: do in parallel = 1 i
ported to a DMPC. However eciency and scalability frequently rely on nding a better parallel algorithm than original. (For example, block algorithms instead of column approaches for numerical algorithms). Declaring parallel loops and shared variables is done using directives. An overview of the main directives is given in section 4. Because SPMD execution model (see section 3.2), there is only one process per 2processor and we do not support nested parallelism . However, perfectly nested loops can be easily implemented through loop collapsing. If loops are not perfectly nested, only the outermost loop is executed in parallel. To illustrate this let us consider an example:
program myprog .......... C parallel loop 1 do in parallel
C subroutine call 1 call myfunction(x,y,z)
enddo
.......... C subroutine call 2 call myfunction(i,j,k)
end subroutine myfunction(t,u,v)
;n
temp = 0. do k = ia(i); ia(i + 1) ? 1 temp = temp + a(k)*x(ja(k))
.......... C parallel loop 2
enddo y(i) = temp enddo
Variables a, ia, x, ja can be either shared or nonshared variables. If a variable is not shared it will be replicated on each processor (this remains transparent to the users). The programmer must take this point into account when deciding which variable should be shared. The choice is a tradeo between eciency and the amount of memory used. For example, note that in the previous loop the variable y has to be shared since it is modi ed in the loop. If such variable is not declared as shared each processor will see a different value of the variable y at the end of the loop execution. The case of variable temp is dierent. The variable is de ned at the beginning of each iteration, and that value is only used in the iteration. So not only can it not be shared, it must be local. Subject to these previous constraints, the program will behave similarly on a sequential computer and on an DMPC. We believe that the programming model is easy to understand and use. Since Fortran-S is a full implementation of Fortran-77, an application can be quickly
do in parallel ......... enddo .......... end
In the above example we have: two parallel loops 1 and 2 and two calls (1 and 2) to subroutine myfunction. When execution reaches parallel loop 1, iterations of that loop are distributed over the processors. However, parallel loop 2 will be executed sequentially for that call. When the program execution reaches the second call to the subroutine, the second parallel loop is then executed in parallel. This is decided at runtime. There is no constraint on the parameters to subroutines; they can be shared or non-shared variables and the same parameter can be a shared variable at one call site and a local variable somewhere else. It is not required that the program be in the same le because Fortran-S accepts full separate compilation. 2
Nested parallelism is currently under investigation
3.2 SPMD Execution Model The SPMD execution model has been widely used to program distributed memory parallel computers with a message passing programming style since it allows a low-overhead parallel execution during both the loading and the running of parallel programs. With an SPMD execution model, a process is created on each processor at the beginning of the program execution and then executed. These processes remain alive until the end of the program. This execution model has been widely investigated in the design of programming environments that exploits data parallelism [2, 3, 10]. However, few work has been done to combine both control parallelism and SPMD code generation. This idea has been investigated within the EPEX project [9, 21] but the user programming model is more complicated than ours because users have to work with several types of code sections: uniprocessor section (that is performed by a single processor), independent section (each processor executed the same section in parallel) and shared section (a parallel loop distributed among the processors). The basic synchronization primitives are implemented by shared variables (locks, barrier, etc...). The user is responsible for adding, within his program, macros that indicate to a preprocessor the boundaries of the uniprocessor, independent and shared sections. Adding these macros is a dicult task since the user has to know perfectly the control ow of his program. The SPMD execution model we implemented simulates a kind of fork&join paradigm. During the execution of a parallel loop, each process running on a processor accesses non-shared variables. Most of the overhead associated with a true fork&join implementation comes from this heritage. To avoid this extracost, sequential code sections are duplicated on each processor. Hence, if a parallel loop is executed, each processor has the same view of the contents of the non-shared variables. The iteration space of the parallel loop is then distributed and the execution continues. Since sequential code sections are duplicated, the code generator has to manage some instructions and library subroutines. Indeed, in a sequential code section, we cannot let several processors to write into the same shared variables. The code generator has to insert properly extra code to allow only one processor to modify a shared variable to avoid memory contention and race conditions. A similar problem arises when shared and non-shared variables are involved in I/O routine calls executed in a sequential code section. For reading operations in a sequential section, only one processor is allowed to execute the I/O rou-
tine; for non-shared variables, the read value has to be broadcast to every processor. Similarly, writing operations in a sequential section is carried out by only one processor. To ensure the correctness of these code generation rules, the code generator adds synchronization barriers when entering or leaving a parallel loop as well as when a processor is elected for executing an I/O routine or for updating a shared variables. Generating ecient SPMD code from control parallelism requires that several optimizations be carried out by the code generator to remove these synchronization barriers.
4 Fortran-S directives The compiler's main function is to translate a Fortran-S program into a Fortran SPMD program. Parallel loops and shared variables are declared using directives. In this section we present the main directives available in Fortran-S. Section 5 gives an overview of the code generation scheme performed by the rst prototype of compiler.
4.1 Shared Variables A variable is declared as shared using the directive: real V(N,N) C$ann[Shared(v)]
Presently, shared variables can only be declared in the main program. Strong coherency is applied by default to shared variables. However in many cases this can lead to false sharing. The KOAN SVM allows shared variable to be managed using a weak cache coherence protocol in some parts of the program, as explained in section 2. The following paragraph shows parallel loop directives and how weak coherency can be used in Fortran-S.
4.2 Parallel Loops A parallel loop is declared using the directive: C$ann[DoShared("BLOCK")] do nel = 1; nbnel
sounds(nel) = sqrt(gama*p(nel)/ro(nel))
enddo
The string "BLOCK" indicates the scheduling strategy of the iterations. The strategies available are the following:
1. 2.
3.
4.
5.
: chunks of contiguous iterations are assigned over the processors. "CYCLIC": iterations are distributed over the processor according to a \modulo-p" scheme. The rst iteration is aected to the rst processor, the second to the second processor, and so on. "ALTERNATE": identical to cyclic distribution but uneven processors execute the sequence of iterations in the reverse order. Depending on data accesses, this scheduling may help to reduce the amount of \false-sharing". "AFFINITY": the anity scheduling allows the system to map iterations onto processors in order to exploit temporal locality and, consequently, best use of the local cache. Before using this kind of scheduling, the user must rst specify a virtual iteration space by adding the annotation C$ann[Affinity("DIST",low,up,step)] in his program. DIST indicates the loop distribution strategy and the remaining parameters are the starting, the ending and the stride of a 1D iteration space. The anity scheduling will refer to this virtual distribution in order to assign to a processor the same index value whatever the parameters of the loop to be distributed. User de ned: in that case the user de nes how the iterations are distributed. This is useful for instance to solve some load balancing problem. This can, for instance, be used in sparse matrix computations. After examination of the structure of the matrix, at run-time, the iteration space can be distributed according to the number of nonzero elements per row there is in the matrix.
do i = 1; n
"BLOCK"
Weak cache Coherency The following annotation, when applied to a parallel loop, establishes weak coherency for a shared variable, and thusly, may greatly reduce the number of con icts that may arise on pages that own the variable:
C$ann[WeakCoherency(y)]
Where y is a shared variable. For example, in the following loop the variable y is written simultaneously by many processors, so there will be false sharing on pages where the variable y is stored. The weak coherence protocol removes that phenomenon, by merging modi cation of pages only at the end of the loop. C$ann[DoShared("BLOCK")] C$ann[WeakCoherency(y)]
temp = 0. do k = ia(i); ia(i + 1) ? 1 temp = temp + a(k)*x(ja(k))
enddo
y(i) = temp
endddo
Reduction operations Fortran-S provides two an-
notations for making reduction operations either on a scalar or a vector variable more ecient. The annotations are C$ann[SGlobal(OP,v)] C$ann[VGlobal(OP,v,n)]
where OP is an associative and commutative operator such as SUM, PROD, MIN, MAX. The parameter v is a scalar when it appears in SGLOBAL, and it is a vector in VGLOBAL. In the later case, n speci es the number of elements in the vector. For this latter directive, the reduction operator is applied to every element of the array. The following example illustrates the use of a reduction operation (global maximum): real u(N,M) C$ann[Shared(u)]
.........
C$ann[DoShared("BLOCK")] C$ann[SGlobal(SMAX,rmax)] do j = 1; 63 do i = 3; 128 ? 1; 2
uold = u(i,2*j+1) u(i,2*j+1) = omega * ((2*u(i-1,2*j+1) +2*u(i+1,2*j+1) .... omega1*u(i,2*j+1) r = DABS(uold - u(i,2*j+1)) if (r .gt. rmax) rmax = r
enddo enddo
For instance, in order to get a global maximum in the previous example, it is more ecient to compute the maximum over each processor (stored in a nonshared variable) and then to merge the results using the C$ann[SGlobal(SMAX,rmax)] directive. Several DMPCs have such reduction operations implemented eciently with message passing techniques. Fortran-S is able to exploit them transparently.
4.3 Page broadcast When a shared variable is modi ed within a sequential code section and then used within a parallel loop, every processor will send requests for the same page to the processor that ran the sequential code section. To avoid this potential contention, it is possible to broadcast pages that stored the shared variable to every processor, before they start to access it. This is done using the following directives: C$ann[BeginBroadcast(var)] C$ann[EndBroadcast()]
where var is a shared variable. These directives cannot appear in a parallel loop. For example, in the following code segment, the directive BeginBroadcast(v) is used to indicate that modi cation, done within the sequential code section, to the shared array v must be recorded. The directive EndBroadcast() indicates that the recorded modi cations must be broadcast to all the processors. do i = 1; m C$ann[BeginBroadcast(v)]
tmp = 0.0 do k = 1; n tmp = tmp + v(k,i)*v(k,i) enddo
xnorm = 1.0 / sqrt(tmp) do k = 1; n v(k,i) = v(k,i) * xnorm
enddo C$ann[EndBroadcast()] C$ann[DoShared("BLOCK")] do j = i + 1; m
tmp = 0.0 do k = 1; n tmp = tmp + v(k,i)*v(k,j)
enddo do k = 1; n
v(k,j) = v(k,j) - tmp*v(k,i)
enddo enddo enddo
The broadcasting reduces contention when several processors want to access the same part of a shared variables (computed sequentially) and improves the performance of the parallel loop. The value of non-shared variables modi ed between a
BeginBroadcast(v) and a EndBroadcast() are undetermined after the broadcasting. This is because only one processor executes the sequence of code delimited by the broadcast directives.
4.4 Directives for synchronization When programming with shared variables, it is sometimes necessary to synchronize processors when they have to write to the same memory address. In the following example, the do loop could be parallelized unless we ensure that several processors will not write to the same array element at the same time. Unfortunately, the value of index k is only known at runtime. do i = 1; n
k = idx(i) v(k) = v(k) + a(i)
enddo
Fortran-S provides two ways for synchronizing parallel computations: critical section and atomic update. C$ann[BeginCritical()] C$ann[EndCritical()]
With a critical section, the previous example could be rewritten as follows: C$ann[DoShared("BLOCK")] do i = 1; n
k = idx(i)
C$ann[BeginCritical()]
v(k) = v(k) + a(i)
C$ann[EndCritical()] enddo
With an atomic update, the previous example could be rewritten as follows: C$ann[DoShared("BLOCK")] do i = 1; n
k = idx(i)
C$ann[AtomicUpdate()]
v(k) = v(k) + a(i)
enddo
This annotation has to be inserted before an assignment statement that modi es a shared variable within
a parallel loop. It ensure that the page that contains the address to be accessed is locked in the memory until the assignment is completed. The main dierence between these two synchronization techniques is the cost of synchronization. Critical sections are implemented thanks to a distributed algorithm based on the sending a token around a ring of processors [22]. It has to be used to synchronize large grain computations whereas atomic update can be to synchronize small grain computations as shown in the previous example. Experiments we have done on a matrix assembling in a numerical application [4] have shown this type of page locking as a synchronization tool is ecient.
4.5 The Escape Mechanism In some case, to achieve eciency, the user may want to take control of parallelism management and data structure distribution. This is possible by adding a C$ann[Escape()] directive that tells to the compiler which subroutines are not to be modi ed. In these subroutines, programmers can freely exploit the SPMD execution model. The programmer has access to shared variables but he or she cannot declare parallel loops. This escape mechanism is similar to the one found in HPF [1].
5 Implementation of a Prototype Fortran-S Compiler The compiler is implemented using the Sigma system developed at Indiana University [11]. Sigma has a Fortran parser, libraries to write program transformations and support for program annotations. The Fortran-S Compiler is a source to source compiler. The function of the compiler is to generate the appropriate SPMD code for the following statements:
Allocating shared variable: the compiler gener-
ates subroutine call to the KOAN system to allocate the shared variables. Access to the shared variable in sequential region (i.e. that part of the code that simulates the sequential execution of the program in SPMD execution) of the program. When a shared variable is updated in a sequential region of the program it must be ensured that only one processor will update the variable, all the other processor having to synchronize on that store operation. One of the optimizations the compiler performs is to
decrease the number of these synchronizations by moving them out of loop bodies. Deciding if a variable is shared or non-shared is done at runtime except for those variables for which this can be determined at compile-time. Parallel loops. At the beginning of a parallel loop the compiler generates a call to a function that decides the allocation of the iteration to the processors. A barrier is also generated at the end of the parallel loop. The processor also sets a ag which indicates the entry in a parallel loop. Function bodies: The body of each function is duplicated. The decision as to which routine body to execute is done at run-time according to a ag that indicates if the call to the function is done in a parallel loop or not. Indeed depending on whether the call to the function is done in a parallel loop or in a sequential region, synchronizations must be done or not. So the compiler generates two bodies for the function: 1. A body identical to the original one to be used if the function is called from within a parallel loop. Parallel loops are left sequential in this body. 2. A body with synchronizations and parallel loops. This body is executed when the function is called in a sequential region of the code. This body contains synchronization for updating shared variables and managing the distribution of the iteration space of parallel loops.
6 Preliminary Results In this section we present the rst results obtained using Fortran-S on an Intel iPSC/2 with 32 nodes. The goal of these experiments was to port sequential Fortran 77 programs to Fortran-S and to measure the performance obtained. In these early performance measurements, it was not intended that there be extensive modi cations to the applications. Rather, the desire was to measure the performance of Fortran-S when applied in a straight forward manner to Fortran 77 code. Very few modi cations have been done to the original programs. The primary modi cation was to expose parallel loops in the programs. However no modi cation of the data structure used in the programs was made. Also they were no major modi cation to the algorithms, so the scalability of some
C$ann[DoShared("BLOCK")] C$ann[WeakCoherency(a)]
do = 2 ? 1 do = 2 ? 1 j
P
;n
i
;n
a(i, j) = (b(i - 1, j) + ... + b(i + 1, j) + b(i, j - 1) + ... + b(i, j + 1)) / 4
enddo enddo
C$ann[DoShared("BLOCK")] C$ann[WeakCoherency(b)]
do = 2 ? 1 do = 2 ? 1 b(i, j) = a(i, j) enddo enddo j
;n
i
;n
Figure 1: Fortran-S code for the Jacobi loop. application is not limited by Fortran-S but by the algorithm used in the application. The problem of falsesharing that appears in many applications was solved using a weak coherence protocol.
6.1 Jacobi loops The code used for the Jacobi experiment is shown in Figure 1. Table 1 gives the speedups and eciencies for dierent problem sizes when using either a strong or a weak cache coherence protocol. For a matrix size set to 100 100, we got a \speed-down" when the number of processors is greater than 16. False sharing could be avoided by using our weak cache coherence protocol. For the same problem size, this cache coherence protocol improves the speedups a little but the remain at. For a larger problem size (200 200) we did not observe this phenomena. however when the number of processors is set to 32, the eciency is bad (20:71%). The weak cache coherence protocol increases the eciency to 32:49%. This behavior is observed only for small matrices while for large matrices the eciency is close to the maximum.
6.2 Matrix multiplication Table 2 gives timing results for small matrices (100 100 and 200 200). For larger matrix size, speedups are near to the maximum. This can be seen in this table; for a 32 nodes con guration, speedups increase from 3:58 to 24:57 when the number of matrix elements was quadrupled. However, for small matrices, the results can be improved by using the
1 2 4 8 16 32 1 2 4 8 16 32
With strong coherency 100 x 100 200 x 200 Times S E Times S (ms) % (ms) 3112 - 12933 1927 1.61 80.75 7323 1.77 1280 2.43 60.78 3975 3.25 1322 2.35 29.43 2284 5.66 3882 0.80 5.01 1446 8.94 5339 0.58 1.82 1928 6.71 With weak coherency 3112 - 12933 1972 1.58 78.90 7323 1.77 1311 2.37 59.34 4016 3.22 923 3.37 42.15 2305 5.61 921 3.38 21.12 1567 8.25 1151 2.70 8.45 1244 10.40
E %
88.30 81.34 70.78 55.90 20.96 88.30 80.51 70.14 51.58 32.49
Table 1: Performance results for the Jacobi loop. P 1 2 4 8 16 32 1 2 4 8 16 32
With strong coherency 100 x 100 200 x 200 Times S E Times S (ms) % (ms) 15694 - 127657 7920 1.98 99.08 64037 1.99 4056 3.87 96.73 32292 3.95 2206 7.11 88.93 16522 7.73 3393 4.63 28.91 8982 14.21 4379 3.58 11.20 5196 24.57 With weak coherency 15694 - 127657 7923 1.98 99.04 64036 1.99 4048 3.88 96.92 32276 3.96 2202 7.13 89.09 16521 7.73 1287 12.19 76.21 8972 14.23 884 17.75 55.48 5206 24.52
E %
99.67 98.83 96.58 88.83 76.78 99.68 98.88 96.59 88.93 76.63
Table 2: Performance results for the matrix multiply. weak cache coherence protocol. Indeed, the poor performances are always due to the same eect: \falsesharing". The same table provides timing results when the parallel loop is executing with weak coherency. For the small matrix, the gain in performances is impressive. When the number of processors is set to 32, speedup augments from 3:58 to 17:75.
6.3 Modi ed Gram-Schmidt Given a set of independent vectors fv1 ; :::; vng in Rm , the Modi ed Gram-Schmidt (MGS) algorithm produces an orthonormal basis of the space generated by these vectors. The basis is constructed in steps where each new computed vector replaces the old one. We added annotations to improve the eciency of the
P 1 2 4 8 16 32 1 2 4 8 16 32
Strong Times S (s) 125.99 79.34 1.59 64.20 1.96 61.59 2.05 65.49 1.92 78.79 1.60 1986.81 1029.11 1.93 562.52 3.53 339.23 5.86 233.10 8.52 205.75 9.66
Weak Times S (s) 200 200 125.99 66.34 1.90 37.07 3.40 23.99 5.25 20.61 6.11 23.62 5.33 500 500 1986.81 1007.51 1.97 517.57 3.84 276.17 7.19 163.98 12.12 124.71 15.93
Weak+Broadcast Times S (s) 125.99 66.69 1.89 37.09 3.40 23.04 5.47 16.85 7.48 14.88 8.47 1986.81 1013.20 1.96 522.38 3.80 278.72 7.13 158.97 12.50 101.62 19.55
MP S 1.97 3.82 7.22 13.98 23.79 1.97 3.90 7.74 14.81 28.87
Table 3: Performance results for the MGS algorithm. parallel MGS algorithm. The source code is shown in section 4.3. The vector, which is modi ed in the sequential section, is broadcast to every processor, since it will be accessed within the parallel loop. This is done by using the C$ann[BeginBroadcast(v)]. A weak cache coherence protocol is also associated with the inner loop to avoid false sharing. A detailed study of this algorithm can be found in [19, 20]. Table 3 summarizes the results we obtained with dierent strategies. The last column shows the speedup with a handcoded message passing version of the MGS algorithm. As the problem size increases, the performances of the two versions tend to draw closer since the SVM version generates less false-sharing. This phenomenon appears very frequently in the experiments we have carried out.
6.4 Because 2.5.1 The benchmark program BECAUSE BBS 2.5.1 is based on the matrix assembly that occurs in the Everest semiconductor device modeling system. Only the Poisson's equation solver has been studied here. It consists of a simpli ed matrix assembly loop over a quasi realistic mesh. A performance analysis of a version of this benchmark written in Fortran-S is described in [7]. With 10648 nodes, we achieved a speedup of 11:39 with 16 processors and 15:71 with 32 processors.
6.5 Ordinary Dierential Equation An ordinary dierential equation solver [8] has been written with Fortran-S. The main parallel loop in that application is limited to 8 iterations. A speedup of 5:8 has been obtained with 8 processors.
6.6 Sparse Matrix-Vector product A performance study of a sparse matrix-vector product is described in [6]. We have shown that computations involved in a sparse matrix vector multiply can be easily distributed by using the user de ned loop scheduling strategy provided by Fortran-S. Concerning data locality, the overhead induced by reading the matrix comes from an initial system distribution that is unrelated to the loop distribution. But these page moves become negligible as soon as the number of iterations calling the sparse matrix vector multiply becomes sizable.
7 Conclusion The approach presented in this paper could be seen as a complementary approach to HPF when the latter is not able to generate ecient code (sparse computations). By using a shared virtual memory, the design of the Fortran-S compiler has been simpli ed. Our rst prototype, which has been designed and implemented in few months. is able to generate ecient parallel code and we have seen encouraging speedups for various applications. Our main objective in designing Fortran-S has been to provide a set of directives to the Fortran language to help the user with parallelizing applications. Because it took very little time rewriting sequential algorithm in Fortran-S, we feel that this object has been achieved. However, this code generator must be be considered a research tool for evaluating new optimizations that will be added either into the compiler or into the SVM. For example, we plan to add and test optimizations such as those described in [5, 12, 15, 18, 23]. As we further investigate applications and gain experience as to what is required, the set of directives will no doubt grow. We will also carry out experiments on new, commercially available parallel machines such as the Paragon XP/S, which will have a Shared Virtual Memory.
Acknowledgment The authors wish to thank Pr D. Gannon for his careful reading and helpful suggestions and comments for improving this paper. This work is supported by Intel SSD under contract no. 1 92 C 250 00 31318 01 2 and Esprit BRA APPARC.
References [1] High performance fortran language speci cation. May 1993. [2] Francoise Andre, Jean-Louis Pazat, and Henry Thomas. Pandore : A System to Manage Data Distribution. In International Conference on Supercomputing, ACM, June 11-15 1990. [3] S. Benkner, B. Chapman, and H. Zima. Vienna fortran 90. In Scalable High Performance Computing Conference, pages 51{59, IEEE Computer Society Press, April 1992. [4] R. Berrendorf, M. Gerndt, Z. Lahjomri, T. Priol, and P. d'Anfray. Evaluation of numerical applications running with shared virtual memory. July 1993. APPARC ESPRIT deliverable. [5] F. Bodin, C. Eisenbeis, W. Jalby, and D. Windheiser. A Quantitative Algorithm for Data Locality Optimization. Springer Verlag, 1992. [6] F. Bodin, J. Erhel, and T. Priol. Parallel sparse matrix vector multiplication using a shared virtual memory environment. In Proceeeding of the
Sixth SIAM Conference on Parallel Processing for Scienti c Computing, March 1993.
[7] F. Bodin and T. Priol. Overview of the KOAN programming environment for the iPSC/2 and performance evaluation of the BECAUSE test program 2.5.1. In Because Workshop, October 1992. [8] Philippe Chartier. L-stable parallel one-block methods for ordinary dierential equations. In
Proceedings of the Sixth SIAM Conference on Parallel Processing, March 1993.
[9] F. Darema-Rodgers, V.A. Norton, and G.F. P ster. Using A Single-Program-Multiple-Data Com-
putational Model for Parallel Execution of Scienti c Applications. Technical Report RC11552,
IBM T.J Watson Research Center, November 1985. [10] G. Fox, K. Kennedy S. Hiranandi, C. Koebel, U. Kremer, C. Tseng, and M. Wu. Fortran D language speci cation. Technical Report TR90079, Department of Computer Science, Rice University, March 1991. [11] Dennis Gannon, Jenq Kuen Lee, Bruce Shei, Sekhar Sarukaiand Srivinas Narayana, Neelakantan Sundaresan, Daya Atapattu, and
Francois Bodin. Sigma II: a tool kit for building parallelizing compilers and performance analysis systems. To appear in Elsevier, 1992. [12] E.D. Granston and H. Wijsho. Managing pages in shared virtual memory systems: getting the compiler into the game. In Proceedings of the International Conference on Supercomputing, ACM, July 1993. [13] A.H. Karp and V. Sarkar. Data Merging for Shared-Memory Multiprocessor. In Proc. Hawaii International Conference on System Sciences, pages 244{256, January 1993. [14] Z. Lahjomri and T. Priol. KOAN: A Shared Virtual Memory for the iPSC/2 Hypercube. In CONPAR/VAPP92, September 1992. [15] M. Lam, E. Rothberg, and M. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth ACM ASPLOS conference, pages 63{75, 1991. [16] Kai Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September 1986. [17] Kai Li and Richard Schaefer. A hypercube shared virtual memory system. Proceedings of the 1989 International Conference on Parallel Processing, 1:125{131, 1989. [18] K. S. McKindley. Automatic and Interactive Parallelization. PhD thesis, Rice University, 1992. [19] T. Priol and Z. Lahjomri. Experiments with Shared Virtual Memory on a iPSC/2 Hypercube. In International Conference on Parallel Processing, pages 145{148, August 1992. [20] T. Priol and Z. Lahjomri. Trade-os Between
Shared Virtual Memory and Message-passing on an iPSC/2 Hypercube. Technical Report 1634,
INRIA, 1992. [21] J.M. Stone. Nested Parallelism in a Parallel FORTRAN Environment. Technical Report RC11506, IBM T.J Watson Research Center, November 1985. [22] I. Suzuki and T. KASAMI. An optimality theory for mutual exclusion algorithms in computer networks. In Conf on Distributed Computing Systems, oct 1982.
[23] M. Wolf and M. Lam. An algorithm to generate
sequential and parallel code with improved data locality. Technical Report , Stanford University,
1990. [24] J. Wu, J. Saltz, S. Hiranandanu, and H. Berryman. Runtime compilation methods for multicomputers. In Proceedings of the 1991 International Conference on Parallel Processing, pages 26{30, 1991.