Finding and Exploiting Parallelism in an Ocean Simulation ... - CiteSeerX

Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results and Implications Jaswinder Pal Singh John L. Hennessy

Computer Systems Laboratory Stanford University, CA 94305

Journal of Parallel and Distributed Computing, v. 15, no. 1, May 1992. Abstract

How programmers will code and compile large application programs for execution on parallel processors is perhaps the biggest challenge facing the widespread adoption of multiprocessing. To gain insight into this problem, an ocean simulation application was converted to a parallel version. The parallel program demonstrated near-linear speed-up on an Encore Multimax, a sixteen processor bus-based shared-memory machine. Parallelizing an existing sequential application— not just a single loop or computational kernel—leads to interesting insights about what issues are significant in the process of finding and implementing parallelism, and what the major challenges are. Three levels of approach to the problem of finding parallelism—loop-level parallelization, program restructuring, and algorithm modification—were attempted, with widely varying results. Looplevel parallelization did not scale sufficiently. High level restructuring was useful for much of the application, but obtaining an efficient parallel program required algorithmic changes to one portion of it. Implementation issues for scalable performance, such as data locality and synchronization, are also discussed. The nature, requirements and success of the various transformations lend insight into the design of parallelizing tools and parallel programming environments.

1

1 Introduction This paper presents our experience with parallelizing an ocean simulation application, written in FORTRAN, that investigates the role of mesoscale1 eddies2 and boundary currents in influencing ocean movements. The primary emphasis of the study is to examine the nature, requirements and scope of program transformations needed to obtain good speedups on a shared memory multiprocessor. The application performs a time-dependent simulation, using finite differencing methods to repeatedly set up and solve a system of elliptic partial differential equations. While the problem is numerically intensive and well-suited to computer simulation, solving a realistic instance with an acceptable degree of accuracy is very time-consuming, which makes it natural to turn to parallel processing for assistance. There are two opposite approaches to the use of parallelism in solving such problems. One is to write explicitly parallel programs, starting with a knowledge of the methods to be used and a clean programming slate, and the other is to construct tools that will automatically convert existing sequential programs into parallel ones. The sequential programs in existence are many and large. They are the result of a lot of painstaking effort by people who understood the physical and mathematical problems they were solving. Even if we ignore our own current lack of understanding about appropriate parallel paradigms and languages, it is unclear how much effort will be required of people in the application domains to produce effective parallel versions of their “dusty decks”, and whether they will be willing to make this investment. For new applications as well, there is at least the inertial desirability of allowing people to continue to program in the familiar sequential paradigm—as long as effective parallelizing tools can be developed—rather than forcing them to think about the synchronization, communication and data distribution required for efficient parallel execution. The unanswered question is whether this is possible. Our goal in this work is to take a complete, real sequential application (not just a single loop or computational kernel) and parallelize it by hand, hoping to gain some insight into the transformations required to obtain an efficient parallel program. To what extent is higher-level knowledge of the problem domain or of alternative solution methods necessary, and how much can be accomplished with only that information which can be extracted from the sequential program? In the latter case, how should parallelizing tools and interactive environments best be designed to extract the required information, and what must be the user’s role in the process? What do the structures of real applications tell us about features that should be included in explicitly parallel languages to enhance expressiveness and efficiency, or in sequential languages to ease the task of a parallelizing tool. Finally, what do the execution characteristics of these applications imply for the design of multiprocessors on which they will be run? This work is one attempt to contribute towards understanding some of these issues. The process of obtaining an effective parallel program can be divided into two parts, which happen to be quite separable in this application: finding the appropriate parallelism, and implementing it efficiently on the machine in question. We attempted three levels of approach to the problem of finding parallelism, each successive one because the previous levels proved inadequate: 1. Parallelizing existing loops, using a commercial parallelizing compiler. 2. Restructuring the program, after understanding the algorithms used, to make it more amenable to largegrained parallelism. 3. Changing algorithms when the original ones were not efficiently parallelizable. The last two levels were implemented in explicitly parallel FORTRAN programs, using the PARMACS macro package for parallel constructs [7]. As we will see, automatic parallelization by a compiler was found to be insufficient and to not scale beyond a few processors. The second approach was very useful throughout the application; we discuss how we implemented it, where we spent most of our time, and what kinds of information we found ourselves needing along the way. Based on this, we also discuss some implications for sequential specification and parallelizing tools. The original algorithm for solving an elliptic equation, however, was found to have an inherently serial component that severely inhibited parallel efficiency. We therefore took the third approach and moved successfully to a completely different algorithm for this part of the application. Implementation issues—such as partitioning, scheduling and synchronization—were found to be comparatively easy, especially on the Encore Multimax we used. Since this multiprocessor is bus based and has 2

a centralized shared memory, data distribution for memory locality is not an issue. The relative speeds of the sixteen NS32032 processors and the memory and communication systems on our machine minimize the importance of cache locality as well, and we exploit it only to the extent that the payoff is high. We do, however, discuss the potential for exploiting data locality in higher-performance, non-uniform memory access machines. Sections 2 and 3 describe the physical problem to be solved and the sequential program, respectively. Section 4 discusses our high-level approach to parallelization and describes the automatic parallelization, profiling and dependence analysis we performed. A discussion of partitioning, scheduling and other implementation issues relevant to the entire application is found in Section 5, while Section 6 covers algorithmic and some implementation issues specific to solving the equations. Finally, Section 7 presents performance results and Section 8 summarizes our findings.

2 The Problem This application studies the role of mesoscale eddies and boundary currents in influencing large-scale ocean movements. The computation proceeds for a number of time-steps—every time-step setting up and solving a couple of spatial partial differential equations—until a steady state is reached between the eddy currents and the mean ocean flow. A more detailed discussion of the problem and the governing time-dependent equation system can be found in Appendix A. For convenience of presentation here, a typical spatial (Helmholtz) equation can be represented as

29 + 2 9 0 2 9 = ; (1) x2 y2 where 9 is a streamfunction we are solving for, is a constant, and is the driving function of the equation. The

terms constituting the driving functions are computed every time-step; they include wind stress, frictional forces, and some Laplacians and Jacobians of expressions computed from the streamfunction values at the beginning of the current time-step. A two-layer cuboidal ocean basin is modeled, with function values maintained at three horizontal crosssections: at the middles of the two layers, and at the interface between them. Second order finite differencing is used to discretize the functions on two-dimensional grids representing these cross-sections. Within limits imposed by the applicability of the model and the nature of information sought, the accuracy of the simulation depends on the physical distance between adjacent grid points. This means that improving the accuracy for a given ocean basin, or simulating larger basins with a given accuracy, requires more grid points in each dimension and hence longer execution times. We simulate square grids of size 14-by-14, 50-by-50 and 98-by-98 points, and use the same (constant) resolution in both dimensions. In fact, we maintain the same physical resolution across grid sizes as well, thus simulating larger ocean basins when we use larger grids.

3 The Sequential Program This section describes the sequential FORTRAN program, developed at the National Center for Atmospheric Research in Colorado, that was the starting point of our work.

3.1

Principal Data Structures

About twenty two-dimensional arrays, each representing a function discretized on an entire cross-sectional grid, constitute the principal data structures in the program. The functions represented include the streamfunctions at the mid-points of the two ocean layers ( 1 , 3 ) and at the interface between them ( 2 ), streamfunction values ( 1M , 3M ) at the previous time-step (needed to avoid numerical instability in computing the friction terms), the driving functions ( ) of the equations, their component terms, and various others.

9

9

9 9

9

3

3.2

High-level Control Flow

The program begins with the initialization of the grids and some one-time computation, which includes computing external forces and solving a single differential equation. After this, the outermost loop of the program iterates over a large number of time-steps. A flowchart for the computation within a time-step is shown in Figure 1, and two of the routines therein are later expanded in Figure 4. Since almost all the time-steps are identical 3, we simulate only 6 of these—instead of the 4500 simulated by the original sequential program—to reduce execution time. Initialize γ a , γ b to all zeros Call SOLVIT to solve the elliptic equations and produce the values required to update the Ψ1 , Ψ 2 matrices for the next time-step Call ADVECT to compute the Jacobian values in the γ expressions (using the Ψ1 and Ψ 2 matrices, see Appendix A)

Set TIMST to twice the time-step interval and call TIMINT to update the values in the Ψ1 , Ψ 2 matrices

Swap Ψ1 , Ψ2 with Ψ1M , (which lag one time-step) for friction calculations

Update streamfunction running sums

Call FRICT to compute the curl and friction components in the γ expressions and update the γ matrices

Determine whether to print output; end time step

Figure 1: Flowchart of a Single Time-Step in the Sequential Program

4 The Approach to Parallelization When the starting point for writing a parallel program is an understanding of the problem and a clean programming slate, the most likely approach is top-down: trying to identify all the high-level parallelism afforded by the problem before moving to lower levels, using the best known parallel algorithms at every step, and analyzing implementation tradeoffs to determine when to stop exploiting parallelism. Given an existing sequential program as a starting point, however, we can identify four levels of approach to its parallelization: 1. Loop Parallelization: the approach taken by current parallelizing compilers. Data dependences are analyzed among statements within a loop, and the parallelizability of every loop determined. A single thread executes the program and forks several others when a parallelizable loop is encountered. No high-level program restructuring or knowledge of the application is required. 2. Algorithm Parallelization: parallelizing an algorithm used by the sequential program, without modifying the basic algorithm. Since scientific algorithms are very often implemented by loops, we use algorithm parallelization to refer to those situations that require a restructuring of control beyond the parallelization of existing loops. The program module implementing an algorithm may be a code segment within a subroutine, an entire subroutine, or a set of subroutines. 4

3. Global Parallelization: finding parallelism across these modules by global dependence analysis. This often involves high-level restructuring of the program. 4. Problem Parallelization: reexamining the physical or mathematical problem to discover other methods— more amenable to parallelization—that can be used to solve it. The information required for this level of parallelization is not available in the sequential program. We attempted all these four levels, incorporating global parallelization into our implementations of algorithm as well as problem parallelization. The results varied widely in terms of effort required and effectiveness, as we shall see.

4.1

Loop Parallelization by a Commercial Compiler

Given that our starting point was unfamiliar sequential code, the most natural approach for us was to parallelize existing, easily identifiable loops in it. This was done with low programming overhead by passing the code through a parallelizing compiler, using directives to help the compiler with loops that it failed on. Because our Encore Multimax does not have a parallelizing compiler, we performed this experiment on another busbased shared memory multiprocessor: an Alliant FX/8 with eight relatively fast processors, each having vector capability. Two sets of speedup curves, obtained with the FX/FORTRAN compiler, are shown in Figure 4.1: one for the concurrentized program over the sequential one (with vectorization disabled in each case), and the other for the vectorized and concurrentized program over the vectorized one on a single processor. The latter set is included only for comparison: We do not consider vectorization any further in this paper. Scalar optimizations were enabled in all cases. Simply running the sequential program through a loop-level parallelizer was clearly not enough to obtain acceptable parallel performance4 . Sometimes, the compiler’s tendency to parallelize the outermost possible loop in a nest interacted poorly with the lack of knowledge about program inputs and with the way the sequential program was written; sometimes the analysis wasn’t sophisticated enough to parallelize the most appropriate, complex loops; and sometimes the effective transformations were without the realm of what is considered to be compiler technology today (see [8] for a detailed treatment of compiler interactions). The rest of this section describes the steps we took in performing our higher-level combination of (explicit) algorithm and global parallelization, using in our analysis only information obtained from the sequential program itself, without any prior knowledge of the algorithms or problem domain.

4.2

Profile of a Sequential Execution

Our first step was to obtain an execution profile of the sequential program, using the UNIX gprof utility on a VAX 11/780. Such a profile is useful in several ways (although it is limited in each of these by the information it provides being at the granularity of entire subroutines, see Section 4.3):

It tells us where the program spends most of its execution time. It shows us the subroutine call hierarchy of a real execution, and helps identify routines to which calls exist in the code but within conditionals that are never satisfied in solving the problem at hand. This is particularly useful in large programs and in software packages written for generality. The call hierarchy, the number of times a given subroutine is invoked, and the time spent per invocation combine to provide hints toward the level at which parallelism should be looked for in various parts of the program.

Figure 2 reveals that three subroutines ( jacob, lapla and trix) occupy over 70% of a representative execution. If no parallelism is involved in the execution of even the least time-consuming of these routines, the speedup obtainable with an arbitrarily large number of processors will be limited to about 5 [1]. The profile output excludes several subroutines from the equation-solving software package, subroutines that we find we can ignore in our parallelization. Finally, we see that while jacob and lapla are called only 3 and 4 times,

5

|

8.0

|

7.5

|

7.0

|

6.5

|

6.0

|

5.5

|

5.0

|

4.5

|

4.0

|

3.5

|

Speedup

8.5

vect. off: 14-by-14 grid vect. off: 50-by-50 grid vect. off: 98-by-98 grid

vect. on: 14-by-14 grid vect. on: 50-by-50 grid vect. on: 98-by-98 grid

ideal

2.5

|

2.0

|

1.5

|

1.0

|

0.5

| |

0.0 | 0

|

3.0

|

|

|

|

|

|

|

|

1

2

3

4

5

6

7

8

Number of Processors

respectively, every time-step—with a call averaging 180 and 95 milliseconds, respectively, for this grid size— trix is called 47 times per time-step with every call taking only 4 milliseconds. Since we may only parallelize within a single time-step (see Section 4.3.2), this observation indicates that we will need to look for parallelism within the work done in subroutines jacob and lapla, but points us toward the possibility of parallelizing the calls to trix rather than having to parallelize it internally. Subroutine hwscrt and its descendants in the call graph constitute the equation solver. Figure 3 shows the amount of time spent in the solver and in the rest of the sequential program, for different problem sizes. Notice that the uniprocessor execution time of the program grows linearly with the number of grid points.

6

MAIN 8588

4590

4590

461

ADVECT

3234 3

2334

JACOB

4590

277

FRICT 900 3.2 %

5.4 % 2

900

27.2 %

2.8 %

239 1

2.8 %

2

2

LAPLA

SOLVIT

2851

900

0.01

21.0 %

HWSCRT

0.30

115

2735 1

Note: Italicized numbers along an arc from a parent to a child denote the number of times the child is called by that parent. Percentages in boldface denote the fraction of the total execution time spent in the subroutine next to which they appear. Numbers on the top left corner of a subroutine name represent the amout of time spent in that subroutine itself due to calls by its left (or only) parent, and those on the bottom left corner represent the time spent in the descendants of that subroutine due to such calls. An analogous interpretation applies to numbers on the top and bottom right corners.

89

GENBUN

2646 1

604

POISD2

2041

7.0 %

47

1973

TRIX

23.0 %

Others

Figure 2: Execution Profile of Sequential Program Simulating 1500 Days on a 26-by-26 Grid

4.3

Discovering the Available Parallelism

Our next step was to uncover all the parallelism afforded by the program at various levels, before determining how to best implement it. In this subsection, we discuss some issues in the analysis of program dependences this required. 4.3.1

Levels of Parallelism and Analysis

The most natural level at which to analyze dependences, from both software and hardware points of view, is that of individually addressable memory locations (as compilers conventionally do). Global analysis at this granularity should allow us to discover all the statically determinable parallelism visible in software. Such analysis is forbiddingly expensive, however, and the overhead required to implement the fine-grained global parallelism it yields is often both prohibitive and unnecessary, particularly when higher level load-balanced partitions can be found (see Section 5.3). While, in theory, dependence information at any higher level we desire can be extracted from this fine-grained analysis, the extraction process is non-trivial as well. For this application, we found it useful to look for parallelism at a hierarchy of successively refined levels or granularities. At every level, we need to do three things: identify the basic units of dependence analysis—what we will call pieces of work or tasks—at that level, characterize every task by the data it reads and writes, and analyze dependences across the summarized tasks in the scope of interest. Identifying the tasks at a given level is application-dependent and quite subjective. It is therefore the

7

490

|

420

|

350

|

280

|

210

|

Execution Time (sec)

|

560

whole program equation solver

30.7% 140

|

70

| 28.6%

|

0| 14 x 14

30.5%| 26 x 26

|

|

|

| 98 x 98

50 x 50

Number of Grid Points

Figure 3: Uniprocessor Execution Time versus Number of Grid Points most difficult step to automate. As human users, we were able conceptualize the computation in an appropriately structured manner after a careful understanding of the high-level control flow, a process that included sophisticated symbolic analysis, profiling of some control variables, and special-casing program generality to avoid being overly conservative. For an automatic tool, the obvious structural boundaries of a sequential program are subroutines and loops. As we shall see shortly, the former are most often inappropriate task delimiters even at the outermost level of analysis, and almost certainly so at lower levels. In some cases, the sequential program is written in such a way as to make loops inappropriate as boundaries too, and it is in any case difficult to automatically identify the most appropriate delimiting loop in a nest. It seems, therefore, that task boundary information must be communicated by the user to the parallelizing system, either in an appropriately designed source language or through interactive assistance. In this application, every outermost-level task in a time-step is a regular calculation on an entire grid(s)—such as adding two grids or computing the Laplacian of a grid function. At the next level in the hierarchy, a task may be an operation on a column or even on an individual grid element. The equation solver has a more complicated structure, which we describe in Section 6; at the outermost level, we view it as a single, separate task. Once the tasks are identified, computing summary use-modification information for them is itself by no means straightforward for a tool. While this application does not have tasks at the outermost level operating on only parts of the two-dimensional grids, the sequential program often has two grids stored together in the same three-dimensional array, making array-name level summaries conservative even at this level. Intermediate levels in the hierarchy are more difficult in this regard, requiring summaries at the level of subarrays. Perhaps experience will allow us to identify classes of applications for which certain kinds of summarizing generally work well, or direct us to leave at least some of this responsibility to the user. At levels that are coarse enough to keep the overhead manageable, the job of preserving dependences may be left to a run-time system once the tasks are identified and their data usage summarized. The compiler is then only responsible for parallelism within these tasks. Such a run-time system was not available to us, so we had to find and specify all the parallelism explicitly. A hierarchical approach allows us to perform the global analysis inexpensively at a coarse granularity, and use the information obtained to both exploit parallelism at that granularity (to perhaps a greater extent than run-time analysis would) as well as restrict the scope of finer-level

8

analysis as appropriate. The analysis at a given level is least expensive if its scope is limited to the parent task at the next coarser level. However, such restriction need not be rigidly enforced: Particularly at intermediate levels, it is often useful to extend the analysis across a subset of outer-level tasks in order to pipeline their execution or reduce synchronization overhead (see Section 5.3). The successively refined levels simply provide a useful conceptual framework that maps well to the structure of real scientific applications and reduces the cost of global analysis. The effectiveness of the analysis in finding parallelism can be traded off with its cost by adjusting its scope at various levels. An interactive analysis and explicit specification also allow the user greater control over the parallelism that is ultimately exploited. We found dependence analysis on summarized task information to likely be the easiest step for an automatic tool in the process of uncovering parallelism. Let us consider, for example, the analysis at the outermost level (across grid tasks). 4.3.2

True and False Dependences

To maximize parallelism, we would like to be constrained by only true dependences; that is, dependences in the work to be done, not those that are artificially imposed by either the control structure of the sequential program or the data structures employed in it. To begin with, there is the true dependence between time-steps: The results generated in one time-step are used at the beginning of the next, so that all desired parallelism must be looked for within the work done in a single time-step 5 . We can divide a time-step into three parts, with true high-level dependences between them: computation of the terms in the driving functions, solution of the spatial equations, and the final updates of the streamfunctions and their running sums. In the flowchart for a regular time-step in Figure 1, these correspond to everything before the call to solvit, the work done within solvit, and the updates done in and after timint, respectively. We mentioned earlier that each equation solution in solvit is treated as a single task at the outermost level. Since the work that follows solvit is simply a set of grid updates, we focus our attention on the part of the time-step that precedes the call to solvit. A cursory look at the flowchart for this part (see Figure 1) does not reveal much promise for parallelism at the outermost level, since there appear to be dependences between successive steps. The analysis that allows us to recognize and remove different types of false dependences has two principal characteristics: 1. It is interprocedural. There is a perhaps significant distinction that can be made here between parallelizing a sequential program and parallelizing an application. Sequential programs are structured in subroutines according to various criteria: modularity, isolating design decisions, grouping together computations that represent the same physical phenomenon, etc. Execution dependences are not necessarily among these criteria. In fact, it is often the case that two pieces of work in the same routine are not independent, but each can be done in parallel with tasks in other routines. Consequently, to find all the available parallelism, our dependence analysis should have as its basic units these pieces of work, and should be neither constrained by nor subsumed in the subroutine structure. 2. For the most part, it is the sort of data dependence analysis that a compiler would perform for simple scalars. It also involves similar types of solutions (such as renaming variables) to get rid of false dependences, only at a much higher level: that of entire two-dimensional arrays.

9

We use two representative examples to illustrate. First, the grids are used and modified in subroutine advect, and then used again in frict, a situation that would appear to sequentialize advect and frict. Closer, interprocedural dependence-checking, however, shows that this is only a naming problem and not a true dependence. The values in the grids when tasks in frict use them are not those produced by advect. In fact, they are swapped in from the M grids—which are themselves unmodified in advect—between the two routines. This dependence can therefore be eliminated by having tasks in frict use the M grids directly as input, particularly since these tasks themselves do not modify their input.

9

9

9

Our second example demonstrates the usefulness of high-level code restructuring, as well as the inappropriateness of constraints imposed by the sequential program’s subroutine boundaries. The flowcharts for advect and frict in Figure 4 show that the grids are updated several times in each of these routines, with successive updates using the most recently updated grid values. Advect and frict are separate subroutines only because they compute terms in the expressions that come from different physical sources (advection and friction, respectively). High-level analysis across these subroutines reveals that although tasks in the sequential program update the grids immediately as they compute their terms, successive tasks do not actually use the

9

updated grids in computing their terms. The terms in both routines can be computed independently and, since the updates are associative, either accumulated concurrently with mutual exclusion or stored in temporary grids which are accumulated together at the end (the latter being our choice). Once again, the high-level analysis required for this restructuring is interprocedural, but otherwise not complicated.

FRICT

ADVECT Call LAPLA to put the Laplacian of Ψ1 in a temporary array WORK1

Update the WORK1 array by adding the f array to it Update the γ arrays with the CURL-z terms

Call JACOB on the Ψ1 and WORK1 arrays and put the Jacobian in a second temporary (WORK3) array

Call LAPLA to put the Laplacian of Ψ1 in a temporary array WORK1

Use the values in WORK3 to update the γ arrays

Repeat the above sequence on the Ψ3 arrays (note that the updated γ arrays from the first sequence are used in the γ updates this time)

Call LAPLA to put the Laplacian of Ψ3 in a temporary array WORK3

Compute the values of Ψ 2 and ( Ψ1 - Ψ3 ) in WORK arrays Use the resultant Laplacians (the friction terms) to finally update the γ arrays Call JACOB to compute the Jacobian of these two WORK arrays and put the result in another

Use the Jacobian output to update the γ a array

Figure 4: Flow Charts of the

advect and frict Subroutines in the Sequential Program

These mechanisms of renaming entire data structures and introducing temporary storage can thus be used to get rid of some apparent high-level dependences in the code. Figure 5 shows the resulting outermost-level parallel structure of the application.

5 The Parallel Implementation The process of obtaining an efficient parallel application can be viewed as comprising several steps. The first of these is finding the available parallelism, which is the only one we have discussed so far. Not all the available parallelism is always implemented, however. Even if we ignore the memory system of the machine, other factors—such as load balancing and synchronization overhead—must be considered in determining the appropriate parallelism to exploit. On real machines, with nonuniform access times to different parts of the memory system, partitioning and scheduling tasks to minimize interprocessor communication and maximize data locality become critical as well. In this section, we ignore the equation solver (which is treated separately in Section 6) and examine some implementation tradeoffs in the rest of the application. Since the memory system is not a critical performance bottleneck on the Multimax we use, we do not actually measure the impact of the data locality issues we discuss, leaving such exploration to a future study of scalability on high-performance

10

Put Laplacian of Ψ1 in W1 1

Copy Ψ1, Ψ3 into T 1 ,T 3

Put Laplacian of Ψ3 in W1 3

Put Jacobians of ( W1 1 , T 1 ), ( W1 , T ) in W5 ,W5 3

1

Put computed Ψ 2 values in W3

Copy Ψ1M , Ψ3M into Ψ1 , Ψ3

Add f values to columns of W1 1 and W1 3

3

Put Ψ1- Ψ3 in W2

T H E

b

Put Laplacian of W7 1,3 in W4 1,3 Put Jacobian of (W2, W3) in W6

U P D A T E

a

Put Laplacian of Ψ1M , Ψ3M in W71,3

Copy T 1, T 3 into Ψ1M , Ψ3M

3

Initialize γ and γ

γ

Put Laplacian of W4 1,3 in W7 1,3

E X P R E S S I O N S

S O L V E T H E E Q U A T I O N F O R Ψa A N D P U T T H E R E S U L T IN γ

a

C O M P U T E T H E I N T E G R A L O F Ψa Compute Ψ = Ψa + C(t) Ψb (note: Ψa and now Ψ are maintained in γa matrix)

Solve the equation for Φ and put result in γ b

Use Ψ and Φ to update Ψ1 and Ψ3 Update streamfunction running sums and determine whether to end program

Note: Horizontal lines represent synchronization points among all processes, and vertical lines spanning phases demarcate threads of dependence.

Figure 5: The Parallel Phases in a Time Step machines. Some discussion of this application’s interactions with a shared memory system can be found in [9].

5.1

Vertical versus Horizontal Partitioning

For a large part of this section, we refer to the chart in Figure 5 depicting the parallelizable phases in a time-step. Once again, we focus on the part of the time-step that precedes the first equation solution. There are two ways in which one can imagine partitioning this work: vertically and horizontally. Vertical Partitioning The chart reveals a fixed number of independent vertical paths, composed of tasks on entire grids, which have dependence-constrained sequentialization imposed within them but which need not synchronize or communicate with one another until the end of the fourth phase. Vertical partitioning implies splitting the processors among these independent paths—according to the parallelizability and expected execution time of each—resulting in few processors sharing a task and large granularity between synchronization points. However, the widely differing amounts of work done on different paths, their variation with problem size, and the difficulty of predicting relative execution times on a real multiprocessor make it very difficult to statically obtain load-balanced assignments of processors to these partitions, particularly when the number of processors used is statically unknown. A better way to use vertical partitioning is with a dynamic tasking scheme in which processors whose initially assigned path has been completed move to the task queue of some other—yet uncompleted—path, and so on. Simple task-queueing compromises our control over data locality and reuse, however, and the dynamically determined synchronization patterns required by this mechanism are difficult to

11

express with the language constructs available to us. Besides, dynamic tasking at such a large granularity is by no means guaranteed to provied load balancing. Since we were able to find a load-balanced static scheduling method that works well for this application, we do not consider the dynamic scheduling—and therefore vertical partitioning—option further in this paper. Horizontal Partitioning A more flexible partitioning mechanism exists that relies only on the regularity among subtasks of a single type of outermost-level task. We call this horizontal partitioning: All processors do the same amount of every type of work in every phase. To maximize load balancing (ignoring data locality for the moment), horizontal partitioning has two requirements: Every set of independent tasks of the same type should afford enough parallelism to utilize all processors, and equal sized subtasks of the same type should take the same amount of time to execute. Other than the equation solver, the different types of tasks in the application are non-trivial nearneighbour calculations (the Laplacians and Jacobians), and various initializations and updates which clearly satisfy our criteria. In both the Laplacian and Jacobian calculations, the matrix written is different from the one(s) read. The computation and data referencing patterns for all non-boundary grid elements are identical and independent (boundary elements are simply set to zero). The regularity of referencing patterns extends to the equation solver as well, and allows us to use computational uniformity as our criterion for load-balanced partitioning (as long as we maintain a regularity of inter-partition communication patterns as well). A loadbalanced partition, then, is one in which all processors compute equal numbers of internal elements and equal numbers of boundary elements. Creating such partitions is simplified by a restriction we impose on the program inputs: The internal dimension of the grid must be an integral multiple of the number of processors.

5.2

Scheduling and Data Locality

We use a horizontal partitioning scheme in this work, having all processors collaborate on every outermostlevel (grid) task 6 . Taking advantage of the global address space provided by the shared memory paradigm, the simplest way to distribute grid points among processors is through a dynamic, self-scheduling loop. This, however, compromises data locality significantly. Fortunately, the regular and input-independent nature of the referencing patterns allows static scheduling for data locality both within and across grid tasks, without the cost in load balancing that this might exact in less regular applications. A domain-decomposition based partitioning and scheduling method is both natural to the physical problem and well-suited to exploiting locality at many levels of a multiprocessor’s memory hierarchy. Every processor is permanently associated with an equal subdomain of the ocean cross-section (excluding boundary elements, which provide the only tradeoff between data locality and load balancing in the application). The localized nature of the individual computations ensures that a task will access only grid points within its subdomain or their immediate neighbours, in all grids that it uses. Having every subdomain be contiguous yields intra-task locality in the near-neighbour computations. This spatial decomposition and conceptualization of the program can exploit the following kinds of locality in a multiprocessor:

Cache Locality: About thirty grid tasks are performed every time-step, each of these accessing a small number of the roughly twenty grids in the program. Every grid task sweeps through the entire ocean cross-section. The types of cache locality available therefore depend on the size of the caches relative to the problem size. If the caches in a machine are large enough relative to the data associated with a subdomain, they might yield data reuse across tasks that access the same grid. In this case, tasks can be temporally ordered (within dependence constraints) to enhance such reuse. For caches that aren’t large enough, a limited amount of blocking is obtainable, both within near-neighbour tasks as well as across tasks by complicated programming. Both these forms of blocking yield at most a small constant factor improvement in miss rates, and are essentially uniprocessor optimizations. The predictability of the referencing patterns ought to allow effective prefetching of data into the cache, if the necessary bandwidth is available. Other than these, the intra-task near-neighbour reuse is the only benefit that small caches provide in this application. Of course, all these benefits of caching can be disrupted by cache mapping collisions, which are particularly difficult for a programmer to predict or control in a program that accesses so many distinct grids in different combinations.

12

Memory Locality: The domain decomposition view of the problem can also exploit locality at the main memory level of a machine with physically distributed memory modules: If the machine’s caches cannot hold their partitions of the entire data set—owing to either limited size or cache mapping collisions— data associated with a subdomain should be allocated on the memory module closest to the processor responsible for that subdomain. The locality and regularity of the access patterns ensure that data need not be dynamically redistributed after this initial distribution. Network Locality: The near-neighbour communication patterns allow convenient exploitation of geographic locality in an interconnection network with nonuniform distances between processors (for example, a mesh or hypercube).

Such a domain decomposition viewpoint appears both essential for scalable performance as well as natural to many scientific applications that model natural phenomena. It results either from knowledge of the physical problem or from an understanding of the ways in which the program references its high-level data structures. Excluding synchronization variables and global sums, the only data that remain actively shared among processors in this application are the elements at interpartition boundaries of grids that are accessed in near-neighbour fashion. This has positive implications for message-passing machines as well as directory-based cache coherence protocols [10], since almost all the sharing is among only a small number of processors. While the computation per processor is proportional to the area of its subdomain, communication is proportional to only the perimeter, so the computation-to-communication ratio can be arbitrarily increased by increasing the problem size. The relative overhead of synchronization and global accumulations is thereby reduced as well, since their absolute overhead is independent of the problem size and since these events occur only a small constant number of times every time-step. While the collocation of tasks and their associated data for memory locality is conceptually quite simple in this application, the shared memory paradigm poses an interesting tradeoff in this regard. A shared address space allows every grid to be declared as a single two-dimensional array, just as it would be in a sequential program. This eases the task of indexing across subdomains in the near-neighbour computations, but can complicate the allocation of data on appropriate memory modules. For example, a convenient way to minimize the number of grid points that are actively shared by processors is to have the subdomains be as close to square as possible (we, however, use sets of adjacent columns as our subdomains for historical reasons that will become clear later). Such a square subdomain is not laid out contiguously in the memory allocated for a two-dimensional array, and hence does not necessarily map well onto the fixed size pages that are the units of physical memory allocation on a machine. To maximize performance, each subgrid would have to be declared as a separate array and allocated in the associated processor, just as would be done on a message passing machine. Otherwise, the two-dimensional grids can be converted to four-dimensional ones, with more complicated indexing (these techniques would also ameliorate the false sharing of large cache lines). Especially when memory latencies for remote references are a performance bottleneck, as they are likely to be on high performance multiprocessors being designed today, the programming convenience of the shared memory paradigm may have to be sacrificed in the interest of performance in such instances. Shared memory, however, enjoys the advantage of not making such attention to locality a correctness issue, allowing it to be ignored when it is unnecessary or inconvenient. In concluding our discussion of data locality outside the equation solver, we might point out that the data usage information required by a compile-time or run-time system to automatically schedule tasks for data locality is very similar to the summarized information needed in our dependence analysis at a given level. In this case, it is simply information about which subgrids are read or written by a task. Program conceptualization and specification in terms of appropriately defined data objects therefore appear promising for both parallelization and data locality. Although all processors end up collaborating on every grid-task in our domain-decomposition based implementation, the outermost-level, global dependence analysis that led to Figure 5 remains important. Discovering potentially parallel tasks allows us to increase the granularity of computation between synchronization points—rather than having processors synchronize after every task—thus reducing synchronization overhead and perhaps improving load balancing by averaging out idle times among tasks. This effect becomes particularly significant on large-scale multiprocessors if the number of processors increases relative to the problem size, tending to reduce granularity and simultaneously increase the cost of global synchronization. We will see later that some steps within the original equation solver do not offer enough parallelism to keep all processors busy

13

all the time; if high-level dependence analysis identifies other tasks that can be done in parallel with these steps, the idle processors can work on those tasks instead. An understanding of high-level dependences also defines the constraints within which tasks can be reordered with respect to one another to enhance the reuse of cached data. Finally, such analysis allows us to reduce communication and synchronization by using more vertical forms of partitioning when efficient performance calls for them (see, for example, Section 6.1).

5.3

Synchronization

Synchronization has two purposes in this application: enforcing mutually exclusive access to a shared variables, and ensuring that the order of operations with true dependences between them is preserved. Mutual exclusion is primarily required once every time-step, when processes accumulate their private sums into a shared sum in computing the integral of a function over its domain. We simulate the required fetch-and-add operation with locks. The type of synchronization used to preserve dependences is determined by the granularity of dependence analysis. Loop-level analysis would insert barriers between successive grid tasks, for want of information about their interrelationships. Our outermost analysis with grid tasks as the basic units allows us to reduce the number of barriers by inserting them only between phases in Figure 5. Barriers are simple from the conceptual and ease-of-programming points of view. For a load-balanced application such as this with a number of synchronizations that is independent of problem size, they do not hurt performance much if the problem size is large and the number of processors small. However, by analyzing dependences at the level of subgrids and extending the analysis across grid-task boundaries, we find that they can almost always be either completely eliminated through static scheduling or replaced by more specific synchronization between only those processors that work on adjacent subdomains. The latter allows synchronization variables to also be shared by only neighbouring processors, is less contentious and therefore cheaper than barrier synchronization (particularly when the number of processors is large), and permits us to pipeline the execution of tasks pronounced dependent by the outermost analysis. Recall that the most natural synchronization, preserving dependences between individual matrix elements, is unnecessarily expensive in both programming effort and execution overhead. Choosing levels of analysis appropriately, and sometimes extending the analysis across task boundaries at the next outer level, is therefore important. There is one situation every time-step that calls for synchronization among all processors to preserve a dependence. This is when a globally accumulated value is subsequently used by all processors. If the use immediately follows the accumulation and the load up to this point is well balanced, the convenient barrier implementation might make sense. For less balanced loads, if other useful work can be done between the accumulation and use, more lazy forms of synchronization may be useful. The original equation solver has a more complicated synchronization structure, which is discussed in Section 6.1.1.

6 Solving the Equations So far, we have treated the equation solver as a single task in our outermost dependence analysis. In this section, we describe the internal parallelization of this task, by far the most time-consuming part of our effort (that it took an entirely different algorithm to obtain good parallel performance is therefore ironic). As mentioned earlier, our first approach was to learn what we could about the structure of the algorithm from the code itself in order to parallelize it. A complicated, interprocedural analysis of data-structure usage, high-level dependences, and patterns in the dynamic control flow led us to a parallel version with some clear limitations. A higher-level understanding of the direct algorithm used allowed us to push the parallelization further, but also led us to discover an inherently serial section in the part of the algorithm that we were having trouble with; that is, we found that we could not fully overcome the limitations we had identified while still retaining the sequential algorithm. Finally, we moved to a completely different, iterative algorithm that was much more efficiently parallelizable.

14

6.1

The Direct Method in the Original Program

The sequential algorithm maps the finite difference approximation to our generic elliptic equation (1) onto a block tridiagonal system of equations, and solves this system using a generalization of the Buneman variant of cyclic odd-even reduction. The algorithm is quite complicated, and is described in detail in [12, 11]; for our purposes, we only treat it skeletally to describe our parallelization. A more detailed description of the parallelization can be found in [9]. We can think of every column of the grid we are solving for as representing an unknown in the block tridiagonal system of equations. The bulk of the actual solution can be divided into three phases: the reduction phase, in which the original system is reduced to a single equation by eliminating half the current unknowns in every successive step; the solution phase, in which that equation is solved by working on a single column; and the back substitution phase, in which the number of columns worked on is doubled in successive steps to yield solutions for all the unknowns. The work done on a column in a given step of all these phases involves similar types of tasks: some updates and the setting up and solving of a series of successive linear tridiagonal systems, using a non-obvious algorithm involving Chebyshev matrix polynomials. Once we were able to visualize the outermost structure of a step in the different phases, a high-level parallelization mechanism was not difficult to see. By analyzing some complicated in-place data referencing patterns on a grid (the ability to visualize which would have been useful), we found that the work done for different columns in a given step is independent, and that these columns can therefore be worked on in parallel. We also found that the computations for all columns (except the last) in a given step are identical, and that there are some serializing dependences within each such computation. Our limited knowledge of the work within a column led us to initially treat it as an unparallelizable unit and exploit only the parallelism across columns. While simple in terms of data locality when sets of adjacent columns are used as subdomains, this method of parallelization—called no shared-column partitioning—is clearly not load balanced. There are steps—for example, the last few in the reduction phase, the single step in the solution phase, and the first few in the back substitution phase—in which the processors outnumber or do not exactly divide the columns to be worked on. An additional, less severe, load-imbalance is contributed by the fact that the work done on the last column in a reduction step is different than that done on other columns. Upon understanding the detailed working of the algorithm, we were in a position to implement the obvious improvement to the above method, at some cost in data locality: Have processes share, to the extent possible, the columns left over after equal distribution. We call this shared-column partitioning. Every successive linear tridiagonal system, solved as part of the work for a column, itself involves some updates and a couple of first-order linear recurrences. The setting up of these systems and the updates can be parallelized; the recurrences, however, cannot. Since there are these inherently serial sections within the computation for a column, we would like to have as few processes as possible share a column as long as all of them can be assigned a column to work on (that is, at this level we prefer vertical partitioning to horizontal, see Section 5.1, and columnwise partitioning to square subblocks). The cyclic reduction algorithm clearly cannot be dichotomized into sections that can utilize only one processor and those that are fully parallelizable across all available processors: Parts of several steps utilize only a subset of the processors, the size of this subset varying with the step. Especially under “shared-column” partitioning, the determination of which data should be private and which shared in the equation solver was one of the most tedious parts of the parallelization—a part that required careful analysis of how data were being used at different times as well as of the run-time values taken by certain control variables and subroutine arguments. Understanding the roles of the latter was also necessary to grasp the fairly complicated algorithm implementation and dependence structure, to restrict the code to the purposes of this application, and to be confident that our parallelization was not misguided. Tools to automate these analyses, perhaps interactively with the user and through controlled execution profiling, would have been very useful. 6.1.1

Synchronization

Every step of the cyclic reduction algorithm both reads and writes the grid being solved for. The communication required among processes is more complicated than the nearest-neighbour interactions we have seen so far. For these reasons, the easiest way to think about synchronization is at the grid level: inserting barriers between 15

successive steps in all three phases of the algorithm. Under “no shared-column” partitioning, the number of processors that need to synchronize decreases in the last few reduction steps and increases in the first few back substitution steps. A barrier mechanism that allows processes to specify their removal from or addition to the number of waiters is therefore a useful language feature. Dependence analysis at the next finer (column) level, however, shows that barriers are unnecessary in this case as well (although they are what we use in our programs). The key to more precise synchronization at the column level is the regularity of the algorithm structure. In a given step of the reduction phase (for example), the work for a column depends on (reads) only a few other columns, at regular offsets from it. These offsets start at zero or one and are essentially doubled every step; they are therefore fully determined by the grid size and the step number. Half the columns of the grid are active (written) in the first step of this phase. In every successive step, half the currently active columns are eliminated and the others remain active for the next step. Given its position and the grid dimension, a column can easily determine at run time the step in which it is eliminated. Before and during its elimination step, it is not read by any other column. Once it is eliminated, it is not modified again in the reduction phase; instead, other columns begin to read it in subsequent steps. Under “no shared-column” partitioning and static scheduling, a single processor is associated with a column. A column is therefore accessed by only that processor until a certain point (its elimination) and is then shared in read-only mode until the end of the reduction phase. The synchronization this requires can be easily expressed with only one flag per column. A column sets its flag when it is eliminated (or at the beginning if it is not active in the first step). In any step, an active column to be worked on waits for the flags to be set of only those columns that it depends on (the distances to these columns being easily expressed in the program). Back substitution steps can be similarly pipelined, and processor utilization thereby increased. The dependence structure of this algorithm has been quite naturally expressed in a concurrent object-oriented language with futures [2]. “Shared-column” partitioning complicates the situation, requiring separate but simultaneous synchronization among disjoint subsets of processors.

6.2

The Iterative Method

Parallelism in the direct algorithm is clearly limited by the serial section in the work for a column. Fortunately, inexact algorithms that are far more amenable to parallelism exist to solve our elliptic equations: These are the methods of iteration to convergence. The accuracy of these algorithms (as well as the number of iterations they require for a given grid and hence their execution time) depends on the error tolerance used for convergence, among other factors. An iterative algorithm is therefore justified as a replacement for a direct one only if the tolerance used is fine enough to provide comparably useful results. The results we are really interested in are those of the whole problem, not just the solver itself. The limited resolution of the grid discretization has already introduced an approximation to the continuous problem, even when a direct solver is used, so that there is a limit on how small the iterative tolerance needs to be. If the grid resolution is made finer for the same ocean basin, the number of iterations to convergence increases as well, an effect not found in a direct method. However, the execution time per iteration scales more slowly with the increased number of grid points than does the execution time of the best known direct methods, making iterative methods potentially faster even on a uniprocessor as larger grids are used. Among the “classical” iterative methods are the Jacobi and SOR Gauss-Seidel iterations that we use. Others, such as the conjugate gradient method, could have been chosen as well. The latter might provide faster convergence under certain conditions, and are also useful if appropriate values for the parameters needed by some classical methods (such as SOR) are difficult to ascertain [5]. In general, finding a parallel iterative algorithm that converges quickly for a given problem, domain size and accuracy is a difficult task. However, the coefficient matrix for our equation and grid resolution is very well conditioned, causing SOR iterations to perform very well for all the grid sizes we use. Recall that we do not alter the grid resolution when simulating grids of different sizes. If the resolution were made much finer, the diminishing convergence rate of SOR might justify a move to more sophisticated iterative techniques.

16

6.2.1

The Classical Iterations

Using second-order finite differencing, the Helmholtz equation (1) yields the following iterative step:

9 i0t 1;j + 9 it 1;j + 9 i;jt 01 + 9 i;jt 1 0 D2 i;j ; (2) 4 0 D2 2 where D is the physical distance between adjacent points on the grid, and t represents the iteration number. If a grid initialized with some arbitrary values (we use the values currently in the right hand side grid ) has the 9 i;jt 1

( + )

( ) +

( )

=

( )

( ) +

near-neighbour computation represented by Eq. (2) repeatedly performed on every element, the elements should ultimately stabilize to yield the solution to Eq. (1). We use the following two classical iterations.

Jacobi Iteration: The grids read and written in a given Jacobi iteration are different; the computation in every iteration is therefore exactly that in Eq. (2). The method is implemented using two grids, which alternately represent (t) and (t+1) in successive iterations. If processes synchronize between iterations (as in our implementation), parallelism does not alter the order of accesses or the numerical results.

9

9

Gauss-Seidel Iteration with Successive Over-Relaxation: Gauss-Seidel iteration is performed in place; that is, an element uses the updated values of those elements that have already been computed in the current iteration, and only a single grid is used. Successive over-relaxation (SOR) is a technique used to make the Gauss-Seidel method converge in fewer iterations. If the grid is traversed so that elements in column are computed top-tobottom before work is begun on column + 1, then the equation representing an iteration is

i i t t 1 t t 1 2 9 i;jt 1 = ! 3 9 i01;j + 9 i 1;j +4 90i;jD021+2 9 i;j 1 0 D i;j + (1 0 !) 3 9 i;jt ; (3) where ! is the SOR parameter (! = 1 reduces SOR to ordinary Gauss-Seidel iteration). If the exact order ( + )

( + )

( ) +

( + )

( ) +

( )

of computation in Eq. (3) is preserved, the only parallelism available is within diagonal wavefronts as they advance across the grid. Methods such as red-black ordering have therefore been proposed to yield fully parallel SOR updates. However, all these methods violate the ordering in Eq. (3) to varying extents. Since the goal is convergence, not preserving a particular ordering, we can justify using the following parallel iterations: In any iteration, all processors start to work on their grid partitions at the same time, ordering their computation as in Eq. (3) but without regard for updates in other partitions. (This method is also mentioned by Fox et al in [3].) Now, the relative timing of the processes within an iteration determines whether an inter-partition boundary element uses old or updated near-neighbour values, and the result is no longer statically predictable. This may make the number of iterations to convergence vary with the number of processors used (or even with the particular run of the program), especially when the size of each partition is small; however, we found it to have a negligible impact for our equations as long as every partition has more than one column. The barrier synchronization and test for global convergence we use after every iteration in both methods can be replaced by allowing a process to drop out of the solver once its own partition has converged. Data locality issues in the iterative solvers are very similar to those in the rest of the application; the differences are that communication at subdomain boundaries is more frequent, and a single grid or pair of grids are accessed repeatedly.

7 Speedup and Timing Results Four levels of parallelization have been discussed in this paper: loop-level, algorithm, global and problem. Since the impact of global parallelization is expected to be significant for this application only when we go to much larger numbers of processors, we combine it with both algorithm and problem parallelization to yield the following three implementations: simple loop-level parallelization of the sequential code, high-level analysis and restructuring retaining the direct method of solving the elliptic equations, and similar restructuring and parallelization using the iterative method for the equation solutions. We implement both extents of parallelizing the direct method that were described in Section 6.1. Since the direct method is column-oriented, and since minimizing the number of interpartition boundary elements by using square subdomains does not make much difference to the performance of the rest of the application on our Multimax, we use a decomposition into

17

blocks of adjacent columns in all our programs. Performance results for loop-level parallelization by a commercial compiler were presented in Section 4.1. In this section, we present results obtained using the other two approaches. Two kinds of speedups are presented for each parallel program: normalized or true speedups, measured over the execution of the original sequential program (with the direct solver) on a single processor of the Multimax, and self-relative speedups, measured with respect to an execution of the parallel program using a single processor. Since the goal of parallelization is to obtain improved performance over that delivered by the sequential program, we believe that normalized speedup is ultimately the more meaningful metric. In all cases, results are reported for three different grid sizes: 14-by-14, 50-by-50 and 98-by-98. The smallest of these is chosen to yield one-column partitions when the largest number of processors (twelve) is used. The results were obtained when no other user processes were running on the machine. Although our Multimax has 16 processors, we restrict our measurements to 12 to avoid interference from system processes while retaining simplicity of grid partitioning for all problem sizes used. In measuring execution times, we omit initialization, process creation, input-output, and the first timestep with its cold-start cache misses. This omission is especially valid because we are simulating only 6 time-steps rather than 4500, so that the impact of the omitted portion is far greater in our measurements than in the actual use of the application.

7.1

Retaining the Direct Method to Solve the Equations

Table I shows the speedups obtained on the application, using the direct equation solver with both extents of parallelization. Because we removed many conditionally executed statements that provided generality in the code (particularly in the equation solver) when rewriting it for parallelism, the one-processor runs of the parallel programs are a little (about 5%, on average) faster than a run of the sequential program, despite the run-time overhead of the code introduced for parallel control flow. Normalized speedup results for the smallest and largest grid sizes are also plotted in Figure 6, which presents normalized speedups for all the parallel schemes that we implemented. Analogous graphs for self-relative speedups are shown in Figure 7. Keep in mind that all results are for the entire application using the solver in question, not just for the solver itself. As shown in Figure 3, for example, the direct solver takes up about 30 percent of the execution time on a uniprocessor for all the problem sizes we use. For our grid sizes, speedups for the rest of the application other than the solver improve as the grid size increases; that is, as the granularities of the partitions are increased, the relative synchronization overhead reduced, and the proportion of inter-partition boundary elements—and hence communication—decreased. In general, the predictability of this trend might be affected by the use of finite, direct-mapped caches, particularly since the application uses many distinct grids in different combinations (see Section 5.2). Speedups for the direct solver are also better for larger grids: The larger the grid relative to the number of processors, the greater the proportion of columns in the reduction and back substitution phases that are worked on in non-shared mode—in full parallelism with other columns—and the more the speedup. However, the figures clearly reveal the non-scalability of parallelism owing to the serial sections in the work for a column. An interesting result is the difference between using “no shared-column” and “shared-column” partitioning in the direct solver. the parallelization that allows column-sharing when needed and the one that doesn’t. For all the extra understanding and effort that went into the former, this difference is small: The improvements allowed by column-sharing are overshadowed by the inherently serial sections that remain. Since the improvements are in those situations in which columns call for sharing, it is expected that the difference will be greater when the proportion of such columns is larger, i.e. as the number of processors increases for a fixed-size grid. This effect, borne out in Figures 6 and 7, may change on machines on which data locality is more significant.

7.2

Using the Iterative Method to Solve the Equations

In Section 6.2, we described the two iterative methods we use to solve our elliptic equations; the versions of the whole program that incorporate these methods are accordingly called Jacobi and SOR here. We use two tolerances for convergence in each case: The tolerance of 1007 produces a numerical answer for this problem that is very close to that obtained using a direct solver; the coarser 10 05 is used mainly for comparison. A 18

Num.

14-by-14 grid

26-by-26 grid

50-by-50 grid

98-by-98 grid

procs.

N-Sp.

S-Sp

N-Sp.

N-Sp.

N-Sp.

S-Sp

1

1.03

1.00

1.05

1.00

1.05

1.00

1.05

1.00

2

1.76

1.70

1.88

1.79

1.90

1.81

1.90

1.82

4

2.87

2.77

3.09

2.94

3.16

3.01

3.25

3.10

6

3.54

3.42

3.90

3.72

4.21

4.01

4.24

4.05

S-Sp

S-Sp

No Shared Column Partitioning

8

—

—

4.45

4.24

4.71

4.48

4.92

4.70

12

3.96

3.83

5.03

4.79

5.65

5.38

5.94

5.67

Shared Column Partitioning 1

1.03

1.00

1.06

1.00

1.06

1.00

1.06

1.00

2

1.81

1.77

1.93

1.83

1.92

1.82

1.97

1.86

4

2.83

2.76

3.22

3.05

3.31

3.13

3.36

3.19

6

3.59

3.49

4.12

3.90

4.26

4.03

4.47

4.23

8

—

—

—

—

—

—

—

—

12

3.82

3.72

5.21

4.94

5.89

5.58

6.24

5.91

Table I: Normalized and Self-Relative Speedups Obtained using the Direct Methods discussion of the number of iterations taken by the two schemes to converge can be found in Appendix B. Here, we focus on performance. Speedups for the application using both iterative solvers and both tolerances are shown in Table II and in Figures 6 and 7. Finally, to show that our findings can be extended to finer tolerances as well, we present results for a tolerance of 1009 in Appendix C.

|

Speedup

| |

| |

|

|

|

|

7

8

9

|

|

|

5.0

0.0 | 0

|

| 6

1.0

|

| 5

2.0

|

| 4

3.0

|

| 3

4.0

|

| 2

6.0

10 11 12


|

|

|

|

|

|

|

|

|

1

2

3

4

5

6

7

8

9

|

|

|

10 11 12


Figure 6: Normalized Computational Speedup versus Number of Processors

19

|

| 1

7.0

Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-7 Iterative: Jacobi, Tol = 1e-7 Iterative: SOR, Tol = 1e-5

Iterative: Jacobi, Tol = 1e-5

|

|

0.0 | 0

|

|

1.0

|

2.0

|

3.0

|

4.0

8.0

|

5.0

11.0

9.0

|

6.0

12.0

10.0

|

7.0

13.0

|

8.0

14.0

|

9.0

Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-7 Iterative: Jacobi, Tol= 1e-7 Iterative: SOR, Tol = 1e-5


|

10.0

|

11.0

|

12.0

|

13.0

|

14.0

|

Speedup

Several trends are apparent from the table and figures. Let us first consider the performance of the two iterative solvers using a single processor, and then examine the speedups obtained by adding processors.

6.0 5.0

|

4.0

|

3.0

|

Speedup

Speedup

2.0 1.0

|

|

|

|

|

|

|

|

2

3

4

5

6

7

8

9

|

|

|

0.0 | 0

|

|

| 1

|

|

10 11 12

Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-7 Iterative: Jacobi, Tol =1e-7 Iterative: SOR, Tol = 1e-5


ideal

|

|

7.0

|

|

0.0 | 0

8.0

|

|

3.0

9.0

|

|

4.0

10.0

|

11.0

|

12.0

|

|

5.0

1.0

|

6.0

2.0

Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-7 Iterative: Jacobi, Tol = 1e-7 Iterative: SOR, Tol = 1e-5


ideal

|

7.0

|

8.0

|

9.0

|

10.0

|

11.0

|

12.0

|

|

|

|

|

|

|

|

|

1

2

3

4

5

6

7

8

9


|

|

|

10 11 12


Figure 7: Self-relative Computational Speedup versus Number of Processors Single Processor Performance Using a single processor, we find that a tolerance of 1007 causes the iterative solvers to be a little slower than the direct one in all cases simulated, while a tolerance of 10 05 causes the former to almost always be a little faster (Table II). Increasing the grid size tilts the balance substantially toward the iterative solvers. As expected, Jacobi performs substantially worse than SOR when the finer tolerance is used, doing less work per iteration (see Eqs. (2) and (3)) but taking more iterations to converge (see Appendix B). With the coarser tolerance, the two are relatively close in performance for these problem specifications. Results Tolerance = 1e-7

Num. of Procs

14-by-14 N-Sp.

50-by-50

S-Sp.

N-Sp.

S-Sp.

Tolerance = 1e-5 98-by-98 N-Sp.

S-Sp.

14-by-14

50-by-50

98-by-98

N-Sp.

S-Sp.

N-Sp.

S-Sp.

N-Sp.

S-Sp.

SOR 1

0.80

1.00

0.90

1.00

0.95

1.00

0.97

1.00

1.13

1.00

1.19

1.00

2

1.58

1.98

1.76

1.95

1.86

1.96

1.88

1.93

2.18

1.94

2.37

2.00

4

3.02

3.79

3.40

3.79

3.67

3.85

3.52

3.63

4.37

3.88

4.74

4.00

6

4.24

5.28

4.92

5.48

5.46

5.73

4.88

5.02

6.29

5.58

6.90

5.82

8

—

—

6.53

7.27

6.94

7.28

—

—

8.64

7.66

9.10

7.66

12

5.54

6.95

9.57

10.66

10.64

11.16

6.69

6.89

12.28

10.89

13.46

11.34

1

0.67

1.00

0.74

1.00

0.79

1.00

0.95

1.00

1.15

1.00

1.23

1.00

2

1.27

1.88

1.43

1.88

1.52

1.94

1.82

1.91

2.19

1.91

2.39

1.95

4

2.43

3.61

2.78

3.71

3.00

3.81

3.35

3.53

4.45

3.87

4.76

3.88

6

3.40

5.04

4.14

5.48

4.40

5.59

4.75

5.00

6.56

5.71

7.16

5.84

Jacobi

8

—

—

5.44

7.08

5.95

7.56

—

—

8.73

7.59

9.15

7.47

12

4.97

7.37

7.85

9.83

8.74

11.10

6.51

6.85

12.35

10.74

14.06

11.47

Table II: Normalized and Self-Relative Speedups Obtained using the Iterative Methods

20

for an even finer tolerance confirm this relationship (see Appendix C). The Effects of Parallelism The iterative solvers exhibit more interprocessor communication than the rest of the application. The memory referencing patterns of the SOR and Jacobi solvers are different, which affects both uniprocessor and self-relative parallel performance. In terms of cache usage, the disadvantage of Jacobi iterations is that they use two grids for every iteration as opposed to the single grid used by SOR, thus increasing cache space requirements as well as the likelihood of mapping collisions; the advantage is that they suffer less read-write interference among processors at inter-partition boundaries. As long as the partitions are small enough for a processor to fit its share of two matrices into its cache, and we can ignore the effects of cache mapping collisions, we would expect Jacobi iterations to exhibit better self-relative speedups than SOR. However, mapping collisions can have a significant effect, and other factors—such as variations in the numbers of SOR iterations to convergence, and simply quirks in particular runs of the program—come into play as well. In this case, these effects clearly dominate and no consistent trend is observed. Since the differences in self-relative speedup between Jacobi and SOR aren’t significant, the trends observed for uniprocessor performance carry over to normalized multiprocessor speedups (Table II): SOR is preferable to Jacobi under the finer tolerance; with the coarser tolerance, that order of preference is preserved for the smallest grid but reversed for the larger two.

8 Summary and Conclusions In summary, we found the following observations to hold for this application. Simple loop-level compiler parallelization did not get us far in producing an efficient parallel program, even for the small numbers of processors we used. Algorithm parallelization—particularly of the equation solver—required very sophisticated analysis and was the most time-consuming part of our effort. The resulting explicitly parallel programs produced significantly better speedups than those yielded by the compiler. However, an inherently serial part of the algorithm restricted its parallel efficiency, and we had to move to a completely different, iterative algorithm to obtain truly effective and potentially scalable performance. Parallelism, Dependence Analysis and Synchronization The application affords parallelism at a hierarchy of levels or granularities. Correspondingly, a successively refined, hierarchical dependence analysis was found to be useful. The most difficult part of this process was identifying the tasks or units of analysis at every level (particularly because conventional sequential programs are not written with data dependence analysis or parallelization in mind), followed by summarizing the data usage information for each task. The former, at least, requires some user input, the extent of effort ranging from simple visual inspection of the code to higher-level algorithmic understanding. The actual dependence analysis among the summarized tasks was interprocedural but otherwise quite easy. Extending the analysis at a given level across tasks at the next outer level allowed us to distinguish the synchronization required by the application from that imposed by a lack of information from different levels of the hierarchy, enabling us to increase granularity, minimize synchronization overhead, and pipeline the execution of tasks when necessary. These effects are expected to become especially significant when larger numbers of processors are used. Synchronization among subsets of processes, as well as mechanisms to join or drop out of a synchronization group, were found to be useful. We also found that the original equation solver could not be dichotomized into parts that are highly parallel and parts that can only utilize a single processor. Data Locality A domain-decomposition based parallelization was found to be most appropriate from the points of view of both load balancing and data locality. It requires knowledge of the physical problem, or at least an understanding of how the application references its high-level data structures. It is the same approach that would naturally be taken on a message-passing machine, and appears essential for scalable performance. The inherent locality in this viewpoint, together with the regular and predictable data referencing of the application, allows the exploitation of a machine’s memory hierarchy at several levels without trading off load balancing (except in the treatment of boundary values). The only problematic issues are cache mapping collisions among different grids, and appropriate data distribution across physical memory modules in units of fixed-size pages. Almost all the sharing of data is at the boundaries between adjacent subdomains, and the computation-to-communication ratio can be arbitrarily magnified by increasing the problem size. Program conceptualization and specification

21

in terms of data objects was found to be very promising for both parallelization as well as automatic scheduling for data locality. The summaries of data usage required for the two purposes were also found to be very similar. Our experience with this application points toward the importance of parallelizing at a high level and therefore indicates that the parallel programmer may need to be familiar with the algorithms and problem domain. Hopefully, experiences such as this one will help guide the design of parallel programming environments and enhance our understanding of which aspects of the parallelization process are best automated and which must remain the user’s responsibility. Certainly, changing the algorithms used is beyond a traditional compiler’s capability. Today’s parallelizing compilers are also incapable of the kind of analysis and restructuring required for algorithm parallelization. In our case, this required interprocedural symbolic and dataflow analysis, often assisted by profiling: understanding patterns in the modifications to control variables, analyzing data structure usage (interprocedurally) to determine which data to privatize and which to share, propagating subroutine arguments symbolically, and extracting the control flow for this particular application from the generality afforded by the sequential code. Perhaps high-level interactive programming environments will find a place here. For example, the process of isolating high-level dependences would have been greatly aided by an interactive tool that could extract and analyze use-modification information and dependences for task-level data structures, and by language constructs that could specify tasks at this higher level. A tool to visualize data structure access patterns would also have been very useful, from the viewpoints of both parallelism and data locality, particularly since the information this provides for a small problem can easily be extrapolated to larger problems in a regular application such as this. The role today’s parallelizing compiler would play in this application would simply be to handle the low-level bookkeeping involved in implementing the parallel loops in the restructured program. Future Work This work can be extended with several interesting experiments to examine scalability on high-performance machines. We would like to incorporate the data locality and synchronization techniques we have discussed, and run the application on machines with much larger numbers of processors to see what transformations are most significant in obtaining scalable performance. It will also be interesting to run the application on shared-memory multiprocessors with very different architectures, as well as on message-passing machines. Understanding the interactions between applications and architectures will be a significant factor in determining how much of the process of parallel programming for high performance can be automated, how effective compiler technology will be, and how soon all this will happen.

Acknowledgements We would like to thank Stephen Klotz for providing us with the sequential program. This work was supported by DARPA under Contract No. N00014-87-K-0828.

References [1] Amdahl, G.M. Validity of the Single Processor Approach to Achieving Large-scale Computing Capabilities. AFIPS Conference Proceedings, 30 (1967), pp. 483-489. [2] Chandra, R., Gupta, A., and Hennessy, J.L. COOL: A Language for Parallel Programming. In Gelernter, D., Nicolau, A., and Padua, D. (Eds.),. Languages and Compilers for Parallel Computing, MIT Press, Cambridge, MA, 1990, pp. 126-148. [3] Fox, G., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., and Walker, D. Solving Problems on Concurrent Processors, Vol. 1, Prentice Hall, 1988, Chap. 7. [4] FX/FORTRAN Programmer’s Handbook, Alliant Computer Systems Corporation, Littleton, MA, Jul. 1988. [5] Golub, G.H., and Van Loan, C.F. Matrix Computations, Second Edition, The Johns Hopkins University Press, 1989, Chap. 10.

22

[6] Holland, W.R. The Role of Mesoscale Eddies in the General Circulation of the Ocean—Numerical Experiments using a Wind-Driven Quasi-Geostrophic Model. J. Physical Oceanography, 8 (May 1978), pp. 363-392. [7] Lusk, E.L., and Overbeek, R.A. Use of Monitors in FORTRAN: A Tutorial on the Barrier, Self-scheduling DO-Loop, and Askfor Monitors. Argonne National Laboratory Tech. Rep. ANL-84-51, Rev. 1, Argonne National Laboratory, Argonne, IL, Jun. 1987. [8] Singh, J.P., and Hennessy, J.L. An Empirical Investigation of the Effectiveness and Limitations of Automatic Parallelization. Proc. International Symposium on Shared Memory Multiprocessing , Information Processing Society of Japan, Tokyo, Japan, Apr. 1991, pp. 25-36. [9] Singh, J.P., and Hennessy, J.L. Parallelizing the Simulation of Ocean Eddy Currents. Stanford University Tech. Rep. CSL-TR-89-388, Stanford University, Stanford, CA, Aug. 1989. [10] Censier, L.M., and Feautrier, P. A New Solution to Coherence Problems in Multicache Systems. IEEE Trans. on Computers, C-27, 12 (Dec. 1978), pp. 1112-1118. [11] Swartztrauber, P. and Sweet, R. Efficient FORTRAN Subprograms for the Solution of Elliptic Equations. NCAR Tech. Note TN/IA-109, National Center for Atmospheric Research, Boulder, CO, Jul. 1975. [12] Sweet, R. A Cyclic Reduction Method for the Solution of Block Tridiagonal Systems of Equations. SIAM J. Num. Anal., 14, 4 (Sep. 1977), pp. 706-720. [13] Using the Encore Multimax. Argonne National Laboratory Tech. Mem. 65, Rev. 1, Math. and Comp. Sci. Division, Argonne National Laboratory, Argonne, IL, Feb. 1987.

23

List of Footnotes 1. Mesoscale: having a horizontal extent on the order of 1 to 100 kilometres. 2. Eddy: current running contrary to the main current, as in a whirlpool. 3. A second-order leapfrogging step is used every 50 time-steps to avoid serious time-splitting of the developing solution. Since we simulate only a small number of time-steps, and since the leapfrogging steps are not significantly different from the parallelization point of view, we omit these steps. 4. It may be argued that the processors on the FX/8 are much faster than those on the Multimax, so that it is unfair to compare the speedups presented here with those obtained later in the paper by explicit parallelization on the Multimax. However, we have since measured similar speedups with the same explicit parallelization on the FX/8 as well. 5. This is not strictly true. Some pipelining is possible at the beginning and end of time-steps, but there is no reason to try to exploit this and ignoring it is cleaner, both conceptually and in implementation. 6. When identical independent tasks have to be performed on different grids—for example at the outermost level—intra-task communication may be reduced by splitting the processes among these tasks. This is useful if it does not compromise data locality, as might be the case if the same grids are each partitioned among all processors in other tasks. However, we find that the loss of inter-task data locality is not worth the reduced intra-task communication in this application. 7. Geostrophic: relating to the deflective forces caused by the rotation of the earth.

24

Appendix A The Mathematics of the Problem The application uses a quasi-geostrophic 7 circulation model of a wind-driven ocean basin. The forcing function is the wind stress from the overlying atmospheric Coriolis effects, and the impacts of bottom friction and lateral friction with the ocean walls—provided by biharmonic damping—are included. The model requires the solution of the following equations (using the current state of the system) at every time-step [6]:

2 01 2 t r 9 1 = J (f + r 9 1; 9 1 ) 0 (f0 =H1)w2 + F1 + H1 curlz 2 2 t r 9 3 = J (f + r 9 3; 9 3 ) 0 (f0 =H3)w2 + F3 (9 1 0 9 3 ) = J (9 1 0 9 3 ; 9 2) 0 (g0 =f0 )w2 t

where

and

( d) (e) (f)

r and J are the Laplacian and Arakawa Jacobian operators, respectively; g0 is ‘reduced gravity’ (= g 1 =0 ); f = f0 + (y 0 L0 =2) is the Coriolis parameter; L; L0 are the length and width, respectively, of the ocean basin; H1; H3 are the depths of the two layers of the ocean (total depth H = H 1 + H3); is the steady wind-stress acting at the upper surface; curlz is the vertical component of the wind-stress curl; 91 ; 93 are the quasi-geostrophic streamfunctions at the middle of the upper and lower layers, respectively; 92 is the quasi-geostrophic streamfunction at the interface between the layers; w2 is the vertical velocity at the interface; F1 ; F3 are the biharmonic lateral friction terms for the two layers.

For numerical solution in this application, the above equations are rewritten in finite-difference form as follows [6]:

Baroclinic Mode: (

where

ZZ

r 0 9 = a ; 2)

subject to

a = J (f + r(9 1 ); 9 1 ) 0 J (f + r(9 3 ); 9 3 ) 0 2 J (9 1 0 9 3; 9 2) + H101 curlz + F1 0 F3 ; 9 = t (9 1 0 9 3 ); 92 = H 9 HH 9 ; 2 = HHfH g : (

and

9 x y = 0;

1

1+

3

2 0

1

3

3)

0

(1)

(2) (3) (4) (5)

To aid the solution, we let

where

and

9 = 9 a + C (t)9 b; (r 0 2 )9 a = a ; 9 a = 0 on grid boundary; (r 0 2 )9 b = 0; 9 b = 1 on grid boundary; RR C (t) = 0 RR 99 ab xx yy :

(6)

(7) (8) (9)

Barotropic Mode:

r(8) = b ;

8 = 0 on grid boundary; 25

(10)

8 = t H 9 HH 9 ;

b = HH J (f + r(9 1 ); 9 1 ) + HH J (f + r(9 3); 9 3 ) + H101curlz + HH F1 + HH F3: (

where and

1

1+

3

1

3)

(11)

3

1

(12)

3

It is these equations, (1) – (12), that are set up and solved every time-step.

B Iterations to Convergence We mentioned in Section 6.2 that the number of Jacobi iterations to convergence is independent of the number of processors used, but the number of SOR iterations might not be. In addition, the number of SOR iterations is strongly dependent on the parameter used in Eq. (3), and that of both types of iterations on the grid resolution tolerance used to determine iterative convergence. We use the same grid resolution as we go to larger grids. The problem is very well-conditioned for this resolution, and we do not see an increase in the number of iterations as grid sizes increase within our range. The amount of data required for a complete presentation of the SOR results along all the axes of comparison (tolerance, grid size, number of processors, and value of ) is large. In Table III, we present results for two tolerances, three grid sizes and six numbers of processors, but only for that value of (= 1.15) that we found to consistently achieve the quickest convergence. Since we find that the numbers of iterations needed to solve the equations within a time-step are each constant across time-steps in almost all cases, Table III has only three columns — labeled o-t (for the one-time equation), ts1 and ts2 (for the two time-step equations) — rather than thirteen for every grid-size. The only exception to this finding is in the case when every processor is assigned only a single column (14-by-14 grid with 12 processors), so that all internal elements are at inter-partition boundaries and the order in which adjacent processors access and produce their elements in a given SOR iteration is quite indeterminate. Even in this case, the difference across time-steps or program runs is never greater than a single iteration. Table III also shows the results for Jacobi iterations for the same grid sizes.

!

!

!

Tolerance = 1e-7

Num. of procs

Tolerance = 1e-5

14-by-14, eqn:

50-by-50, eqn:

98-by-98, eqn:

14-by-14, eqn:

o-t

o-t

o-t

o-t

ts1

ts2

ts1

ts2

ts1

ts2

50-by-50, eqn:

98-by-98, eqn:

ts1

ts2

o-t

ts1

ts2

o-t

ts1

ts2

SOR Gauss-Seidel 1

14

9

7

14

8

6

14

7

6

10

5

4

10

4

3

10

4

2

2

14

8

7

14

8

6

14

7

6

10

5

4

10

4

3

10

4

2

4

13

8

7

14

8

7

14

7

6

10

5

4

10

4

3

10

4

2

6

13

8

7

14

8

7

14

7

6

9

5

4

10

4

3

10

4

2

8

—

—

—

14

8

7

14

7

6

9

5

4

10

4

3

10

4

2

12

15

10

9

13

8

7

14

7

6

11

6

4

10

4

3

10

4

2

15

7

4

15

5

2

15

4

2

Jacobi *

24

15

12

23

14

11

23

13

10

Table III: Number of Iterations to Convergence under The Iterative Schemes Table III shows that the variation in the number of SOR iterations to convergence with the number of processors used is more marked for smaller grids, which have a greater proportion of inter-partition boundary elements for the same number of processors. In the case of the 98-by-98 grid, for instance, there is no variation across all the numbers of processors (up to 12) used. Our typical finding is that the number of iterations for a given (small) grid size tends to increase slightly with the number of processors (i.e. as we deviate more from true SOR Gauss-Seidel iteration), which contributes to diminishing the efficiency of self-relative speedup as the number of processors grows. This effect, however, is real, and we make no attempt to factor it out of the speedups we present.

26

C Performance Results for an Even Finer Tolerance In this appendix, we present the performance results obtained for the iterative schemes using a tolerance of 1009 , i.e. a finer tolerance than the ones we used in Section 7, to show that the trends we observed in that section can be carried over to finer tolerances as well. Table IV: Execution Times, Normalized and Self-Relative Speedups Obtained Using the Iterative Methods with a Finer Tolerance Tolerance = 1e-9

Num. 14-by-14

of procs

Time

N-Sp.

50-by-50 S-Sp.

Time

N-Sp.

98-by-98 S-Sp.

Time

N-Sp.

S-Sp.

SOR 1

15.11

0.68

1.00

218.04

0.75

1.00

865.90

0.77

1.00

2

7.77

1.32

1.94

112.96

1.44

1.38

442.31

1.50

1.96

4

4.00

2.57

3.78

57.48

2.84

3.79

227.87

2.92

3.80

6

2.86

3.59

5.28

38.84

4.20

5.61

153.88

4.32

5.83

8

—

—

—

28.98

5.63

7.52

114.11

5.83

7.59

12

2.14

4.82

7.08

19.88

8.21

10.97

77.58

8.57

11.16

Jacobi 1

19.72

0.52

1.00

299.28

0.55

1.00

1165.42

0.57

1.00

2

10.59

0.97

1.86

160.05

1.02

1.87

612.22

1.09

1.90

4

5.50

1.87

3.58

81.00

2.01

3.70

310.02

2.15

3.76

6

3.89

2.64

5.07

54.11

3.01

5.53

208.86

3.18

5.58

8

—

—

—

41.77

3.91

7.17

158.89

4.19

7.33

12

2.64

3.89

7.47

28.60

5.70

10.47

107.64

6.18

10.83

Table IV shows the absolute performance and the self-relative speedups obtained with the finer tolerance, and Figure 8 plots the normalized speedups in this case; the extension of the trends can be observed in the data presented. As we would expect, the finer tolerance causes the iterative schemes to do considerably worse than the direct schemes for uniprocessor execution. However, the better self-relative speedups more than compensate for this, and the absolute performance becomes better under the iterative method as the number of processors increases (Figure 8).

27

| | |

Speedup

|

Speedup

| | | |

| |

| |

|

2.0

1.0

|

|

|

|

|

|

|

|

|

0

1

2

3

4

5

6

7

8

9

|

|

|

0.0 | 0

|

|

0.0 |

|

|

1.0

3.0

|

2.0

4.0

|

3.0

5.0

|

4.0

6.0

|

5.0

7.0

|

6.0

8.0

|

7.0

9.0

|

8.0

10.0

|

9.0

11.0

|

10.0

12.0

|

11.0

Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-9 Iterative: Jacobi, Tol= 1e-9

13.0

|

12.0

|

13.0

14.0

|

14.0

10 11 12


Direct: No Shared Column Direct: Shared Column Iterative: SOR, Tol = 1e-9 Iterative: Jacobi, Tol = 1e-9

|

|

|

|

|

|

|

|

|

1

2

3

4

5

6

7

8

9

|

|


Figure 8: Normalized Computational Speedup versus Number of Processors using a Finer Tolerance

28

|

10 11 12

Finding and Exploiting Parallelism in an Ocean Simulation ... - CiteSeerX

Finding and Exploiting Parallelism in an Ocean Simulation ... - CiteSeerX

Suggest Documents

FINDING AND EXPLOITING PARALLELISM IN A ... - CiteSeerX

TERAFLUX: Exploiting Dataflow Parallelism in Teradevices - CiteSeerX

JANUS: Exploiting Parallelism via Hindsight

Exploiting Coarse-Grain Parallelism in the MPEG-2 ... - CiteSeerX

Exploiting Parallelism in Knowledge Discovery Systems to ... - CiteSeerX

MMT: Exploiting Fine-Grained Parallelism in

Exploiting Parallelism in Coalgebraic Logic ... - Semantic Scholar

Exploiting Parallelism in Physically-Based Simulations on ... - CiteSeerX

Exploiting Parallelism in Tabled Evaluations? - Semantic Scholar

Exploiting Coarse Grained Parallelism in ... - Semantic Scholar

Exploiting Parallelism in Decision Tree Induction - DCC

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE ...

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE ...

Exploiting parallelism in the ME calculus

Exploiting the Multilevel Parallelism and the

Exploiting Vertical Parallelism from Answer Set Programs - CiteSeerX

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

A Calculus for Exploiting Data Parallelism on Recursively ... - CiteSeerX

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

Exploiting GPU and Cluster Parallelism in Single ...

Exploiting Temporal Parallelism For Software-only Video ... - CiteSeerX

Exploiting Parallelism with Dependence-Aware ... - Semantic Scholar

Ch4. Exploiting Instruction-Level Parallelism with Software ...

RouteBricks: Exploiting Parallelism To Scale Software Routers