2001 Springer-Verlag
Int J STTT (2001) 3: 1–18
Parallel programming with a pattern language Berna L. Massingill1∗ , Timothy G. Mattson2 , Beverly A. Sanders1 1 2
Department of Computer and Information Sciences, University of Florida, FL, USA; E-mail: {blm,sanders}@cise.ufl.edu Parallel Algorithms Laboratory, Intel Corporation, USA; E-mail:
[email protected]
Abstract. A design pattern is a description of a highquality solution to a frequently occurring problem in some domain. A pattern language is a collection of design patterns that are carefully organized to embody a design methodology. A designer is led through the pattern language, at each step choosing an appropriate pattern, until the final design is obtained in terms of a web of patterns. This paper describes a pattern language for parallel application programs aimed at lowering the barrier to parallel programming by guiding a programmer through the entire process of developing a parallel program. We describe the pattern language, present two example patterns, and sketch a case study illustrating the design process using the pattern language. Key words: Design patterns – Parallel programming
1 Introduction Parallel hardware has been available for decades and is becoming increasingly mainstream. Parallel software that fully exploits the hardware is much rarer, however, and mostly limited to the specialized area of supercomputing. Part of the reason for this state of affairs could be that most parallel programming environments, which focus on the implementation of concurrency rather than higher-level design issues, are simply too difficult for most programmers to risk using. We acknowledge the support of Intel Corporation, the National Science Foundation (grant #9704697), and the Air Force Office of Scientific Research (grant #4514209-12). ∗ Present address: Department of Computer Science, Trinity University; CEa E-mail:
[email protected]
a
CE
Please indicate ZIP code for the Florida address, add city, state, and ZIP to Intel address, and complete the Trinity University address.
A design pattern describes, in a prescribed format, a high-quality solution to a frequently occurring problem in some domain. The format is chosen to make it easy for the reader to quickly understand both the problem and the proposed solution. Because the pattern has a name, a collection of patterns provides a vocabulary with which to talk about these solutions. A structured collection of design patterns is called a pattern language. A pattern language is more than just a catalog of patterns: the structure of the pattern language is chosen to lead the user through the collection of patterns in such a way that complex systems can be designed using the patterns. At each decision point, the designer selects an appropriate pattern. Each pattern leads to other patterns, resulting in a final design in terms of a web of patterns. A pattern language thus embodies a design methodology and provides domain-specific advice to the application designer. (In spite of the overlapping terminology, a pattern language is not a programming language.) This paper describes a pattern language for parallel application programs. The current state of the pattern language can be viewed at http://www.cise.ufl.edu/ research/ParallelPatterns. The pattern language is extensively hyperlinked, allowing the programmer to work through it by following links. The goal of the pattern language is to lower the barrier to parallel programming by guiding a programmer through the entire process of developing a parallel program. The main target audience is experienced programmers who may lack experience with parallel programming. The programmer brings to the process a good understanding of the actual problem to be solved and then works through the pattern language, eventually obtaining a detailed parallel design or perhaps working code.
m
MS ID: STTT0045
3 January 2001 15:57 CET
Editor’s or typesetter’s annotations (will be removed before the final TEX run)
2
B.L. Massingill et al.: Parallel programming with a pattern language
In this paper, we first give an overview of the organization of the pattern language. We then present the text of two of the patterns, followed by a simple case study illustrating the design process using the pattern language. We close with brief descriptions of related approaches. 2 Organization of the pattern language The pattern language is organized into four design spaces – FindingConcurrency, AlgorithmStructure, SupportingStructures, and ImplementationMechanisms – which form a linear hierarchy, with FindingConcurrency at the top and ImplementationMechanisms at the bottom. 2.1 The FindingConcurrency design space This design space is concerned with structuring the problem to expose exploitable concurrency. The designer working at this level focuses on high-level algorithmic issues and reasons about the problem to expose potential concurrency. There are three major design patterns in this space: – DecompositionStrategy. This pattern addresses the question of how to decompose the problem into parts that can execute simultaneously. It and related patterns (TaskDecomposition and DataDecomposition) discuss the two major strategies for decomposing problems – task-based decomposition and data-based decomposition – and help the programmer select an appropriate strategy based on one or the other or a combination of both. – DependencyAnalysis. Once the programmer has identified the entities into which the problem is to be decomposed, this pattern helps him or her understand how they depend on each other. These dependencies include both ordering constraints and data dependencies; this pattern and related patterns (GroupTasks, OrderTasks, and DataSharing) help the programmer analyze these dependencies in detail. – DesignEvaluation. This pattern is a consolidation pattern; it is used to evaluate the results of the other patterns in this design space and prepare the programmer for the next design space, the AlgorithmStructure design space. Figure 1 illustrates the relationships among patterns in this design space. In the figure, arrows indicate the direction in which the programmer moves between patterns; double-headed arrows indicate that it may be necessary to move back and forth between patterns as the analysis proceeds. 2.2 The AlgorithmStructure design space This design space is concerned with structuring the algorithm to take advantage of potential concurrency. That
is, the designer working at this level reasons about how to use the concurrency exposed in the previous level. Patterns in this space describe overall strategies for exploiting concurrency. Selected patterns in this space have been described in [17]. Patterns in this design space can be divided into the following three groups, plus the ChooseStructure pattern, which addresses the question of how to use the analysis performed by using the FindingConcurrency patterns to select an appropriate pattern from those in this space. A key part of the ChooseStructure pattern is the figure included here as Fig. 2; it illustrates the decisions involved. 2.2.1 “Organize by ordering” patterns These patterns are used when the ordering of groups of tasks is the major organizing principle for the parallel algorithm. This group has two members, reflecting two ways task groups can be ordered. One choice represents “regular” orderings that do not change during the algorithm; the other represents “irregular” orderings that are more dynamic and unpredictable. – PipelineProcessing. The problem is decomposed into ordered groups of tasks connected by data dependencies. – AsynchronousComposition. The problem is decomposed into groups of tasks that interact through asynchronous events. 2.2.2 “Organize by tasks” patterns These patterns are those for which the tasks themselves are the best organizing principle. There are many ways to work with such “task-parallel” problems, making this the largest pattern group. – EmbarrassinglyParallel. The problem is decomposed into a set of independent tasks. Most algorithms based on task queues and random sampling are instances of this pattern. – SeparableDependencies. The parallelism is expressed by splitting up tasks among units of execution (threads or processes). Any dependencies between tasks can be pulled outside the concurrent execution by replicating the data prior to the concurrent execution and then combining the replicated data after the concurrent execution. This pattern applies when variables involved in data dependencies are written but not subsequently read during the concurrent execution. – ProtectedDependencies. The parallelism is expressed by splitting up tasks among units of execution. In this case, however, variables involved in data dependencies are both read and written during the concurrent execution; thus, they cannot be pulled outside the concurrent execution but must be managed during the concurrent execution of the tasks.
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
3
begin here TaskDecomposition
DecompositionStrategy GroupTasks DataDecomposition OrderTasks
DependencyAnalysis
DataSharing DesignEvaluation
continue to AlgorithmStructure design space Fig. 1. Organization of patterns in the FindingConcurrency design space
Start
OrganizeByOrdering
Regular
PipelineProcessing
OrganizeByTasks
Irregular
Linear
AsynchronousComposition
Independent
EmbarrassinglyParallel
OrganizeByData
Recursive
Partitioning
Linear
DivideAndConquer
GeometricDecomposition
Recursive
RecursiveData
Dependent
Separable Dependencies
SeparableDependencies
Inseparable Dependencies
ProtectedDependencies
Decision
Key
Decision/branch point Terminal pattern
Fig. 2. Organization of the AlgorithmStructure design space
– DivideAndConquer. The problem is solved by recursively dividing it into subproblems, solving each subproblem independently, and then recombining the subsolutions into a solution to the original problem.
differing in how the decomposition is structured (linearly in each dimension or recursively).
– GeometricDecomposition. The problem space is decomposed into discrete subspaces; the problem is then solved by computing solutions for the subspaces, with the solution for each subspace typically requiring data from a small number of other subspaces. Many instances of this pattern can be found in sci-
2.2.3 “Organize by data” patterns These patterns are those for which the decomposition of the data is the major organizing principle in understanding the concurrency. There are two patterns in this group,
m
MS ID: STTT0045
3 January 2001 15:57 CET
4
B.L. Massingill et al.: Parallel programming with a pattern language
entific computing, where it is useful in parallelizing grid-based computations, for example. – RecursiveData. The problem is defined in terms of following links through a recursive data structure.
2.3 The SupportingStructures design space This design space represents an intermediate stage between the AlgorithmStructure and ImplementationMechanisms design spaces. While it is sometimes possible to go directly from the AlgorithmStructure space to an implementation for a target programming environment, it often makes sense to construct the implementation in terms of other patterns that constitute an intermediate stage between the problem-oriented patterns of the AlgorithmStructure design space and the machine-oriented “patterns” of the ImplementationMechanisms space. Two important groups of patterns in this space are those that represent program-structuring constructs (such as SPMD and ForkJoin) and those that represent commonly used shared data structures (such as SharedQueue). In some cases, a library may be available that contains a readyto-use implementation of the data structures. If so, the pattern describes how to use the data structure; otherwise, the pattern describes how to implement it.
The reader will note that while both patterns follow the same overall format, adapted from [10], there is some variation with regard to the sections included in each pattern; not all patterns include all sections. For the benefit of the reader of this paper, comments have been added to each pattern section indicating its general purpose. These comments, which appear in italics at the beginning of most sections, are not part of the actual pattern. Underlined words represent hyperlinks in the actual pattern. These could be links to a definition (in the pattern language’s glossary), another pattern, example code, a document containing more detailed analysis, etc. 3.1 The DecompositionStrategy pattern
Pattern name: DecompositionStrategy Intent: (Briefly states the problem solved by this pattern.) This pattern addresses the question “How do you go about decomposing your problem into parts that can be run concurrently?” Motivation:
2.4 The ImplementationMechanisms design space This design space is concerned with how the patterns of the higher-level spaces are mapped into particular programming environments. We use it to provide patternbased descriptions of common mechanisms for process or thread management (e.g., creating or destroying processes or threads) and process or thread interaction (e.g., monitors, semaphores, barriers, or message-passing). Patterns in this design space, like those in the SupportingStructures space, describe entities that strictly speaking are not patterns at all. We include them in our pattern language anyway, however, to provide a complete path from problem description to code, and we document them using our pattern notation for the sake of consistency.
3 Example patterns This section presents the text of two representative patterns, the DecompositionStrategy pattern from the FindingConcurrency design space and the EmbarrassinglyParallel pattern from the AlgorithmStructure design space. The DecompositionStrategy pattern consists of a single document, presented as Sect. 3.1. The EmbarrassinglyParallel pattern consists of two documents, a main pattern (presented as Sect. 3.2) and a supporting “examples document” containing longer and more detailed examples (presented as Sect. 3.3).
(Gives the context for the pattern, i.e., why a designer would use the pattern and what background information should be kept in mind when using it.) Parallel programs let you solve bigger problems in less time. They do this by simultaneously solving different parts of the problem on different processors. This can only work if your problem contains exploitable concurrency, i.e., multiple activities or tasks that can take place at the same time. Exposing concurrency requires decomposing the problem along two different dimensions: – Task decomposition. Break the stream of instructions into multiple chunks called tasks that can execute simultaneously. To achieve reasonable runtime performance, tasks must execute with minimal need to interact; i.e., the overhead associated with managing dependencies must be small compared to the program’s total execution time. – Data decomposition. Determine how data interacts with the tasks. Some of the data will be modified only within a task; i.e., it is local to each task. For such data, the algorithm designer must figure out how to break up the data and properly associate it with the right tasks. Other data is modified by multiple tasks; i.e., the data is global or shared between tasks. For shared data, the goal is to design the algorithm so that tasks don’t get in each other’s way as they work with the data.
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
Balancing the opposing forces of data and task decomposition occurs against the backdrop of two additional factors: efficiency and flexibility. The final program must effectively utilize the resources provided by the parallel computer. At the same time, parallel computers come in a variety of architectures, and you need enough flexibility to handle all the parallel computers you care about. Note that in some cases the appropriate decomposition will be obvious; in others you will need to dig deeply into the problem to expose the concurrency. Sometimes you may even need to completely recast the problem and restructure how you think about its solution. Applicability: (Gives a high-level discussion of when this pattern can be used.) Use this pattern when: – You have determined that your problem is large and significant enough that expending the effort to create a parallel program is worthwhile. – You understand the key data structures of the problem and how they are used. – You understand which parts of the problem are most compute-intensive. It is on these parts that you will focus your efforts. Implementation: (Explains how to implement the pattern, usually in terms of patterns from lower-level design spaces.) The goal is to decompose your problem into relatively independent entities that can execute concurrently. As mentioned earlier, there are two dimensions to be considered: – The task-decomposition dimension focuses on the operations that will take place within concurrentlyexecuting entities. We refer to a set of operations that are logically grouped together as a task. For a task to be useful, the operations that make up the task should be largely independent of the operations taking place inside other tasks. – The data-decomposition dimension focuses on the data. You need to decompose the problem’s data into chunks that can be operated on relatively independently. While the decomposition needs to address both the tasks and the data, the nature of the problem usually (but not always) suggests one decomposition or the other as the primary decomposition, and it is easiest to start with that one. – A data-based decomposition is a good starting point if: – The most compute-intensive part of the problem manipulates a large data structure.
5
– Similar operations are being applied to different parts of the data structure, in such a way that the different parts can be operated on relatively independently. For example, many problems can be cast in terms of the multiplication of large matrices. Mathematically, each element of the product matrix is computed using the same set of operations. This suggests that an effective way to think about this problem is in terms of the decomposition of the matrices. We talk about this approach in more detail in the DataDecomposition pattern. Data-based decompositions tend to be more scalable (i.e., their performance scales with the number of processing elements), since memory is being decomposed. – A task-based decomposition is a good starting point if: – It is natural to think about the problem in terms of a collection of independent (or nearly independent) tasks. For example, many problems can be considered in terms of a function that is evaluated repeatedly, with a slightly different set of conditions each time. We can associate each function evaluation with a task. If this is the case, you should start with a taskbased decomposition, which is the subject of the TaskDecomposition pattern. If there are many nearly independent tasks, taskbased decompositions tend to produce a design with a lot of flexibility, which is an advantage when later deciding how to allocate tasks to processing elements1 . In some cases, you can view the problem in either way. For example, we earlier described a data decomposition of a matrix multiplication problem. You can also view this as a task-based decomposition – for example, by associating the update of each matrix column with a task. In cases where a clear decision cannot be made, the best approach is to try each decomposition and see which one is most effective at exposing lots of concurrency. During the design process, you also need to keep in mind the following competing forces: – Flexibility. Is your design abstract enough that you have maximum flexibility to adapt to different implementation requirements? For example, you don’t want to narrow your options to a single computer system or style of programming at this stage of the design. – Efficiency. A parallel program is only useful if it scales efficiently with the size of the parallel computer (in terms of reduced runtime and/or memory utilization). For the problem’s decomposition, this means you need enough tasks to keep all the processing elements (PEs) busy with enough work per task to compensate for overhead incurred to manage dependen1 Generic term used to reference a hardware element in a parallel computer that executes a stream of instructions.
m
MS ID: STTT0045
3 January 2001 15:57 CET
6
B.L. Massingill et al.: Parallel programming with a pattern language
cies. The drive for efficiency can lead to complex decompositions that lack flexibility. – Simplicity. Your decomposition needs to be complex enough to get the job done, but simple enough to let you debug and maintain your program in a reasonable amount of time. Balancing these competing forces as you decompose your problem is difficult, and in all likelihood you will not get it right the first time. Therefore, use an iterative decomposition strategy in which you decompose by the most obvious method (task or data) and then by the other method (data or task). Examples: (Provides implementations of the pattern in particular programming environments. Higher-level patterns may use pseudocode, while lower-level patterns may use code from one or more popular programming environments such as MPI, OpenMP, or Java.) Medical imaging. We will define a single problem here, taken from the field of medical imaging, and then decompose it two different ways: in terms of tasks and in terms of data. The decompositions will be presented in the DataDecomposition and TaskDecomposition patterns; the discussion here will serve to define the problem and then describe the way the two solutions interact. An important diagnostic tool is to give a patient a radioactive substance and then watch how that substance propagates through the body by looking at the distribution of emitted radiation. Unfortunately, the images are of low resolution, due in part to the scattering of the radiation as it passes through the body. It is also difficult to reason from the absolute radiation intensities, since different pathways through the body attenuate the radiation differently. To solve this problem, medical imaging specialists build models of how radiation propagates through the body and use these models to correct the images. A common approach is to build a Monte Carlo model. Randomly selected points within the body are assumed to emit radiation (usually a gamma ray), and the trajectory of each ray is followed. As a particle (ray) passes through the body, it is attenuated by the different organs it traverses, continuing until the particle leaves the body and hits a camera model, thereby defining a full trajectory. To create a statistically significant simulation, thousands if not millions of trajectories are followed. The problem can be parallelized in two ways. Since each trajectory is independent, it would be possible to parallelize the application by associating each trajectory with a task. This approach is discussed in the “Examples” section of the TaskDecomposition pattern. Another approach would be to partition the body into sections and
assign different sections to different processing elements. This approach is discussed in the “Examples” section of the DataDecomposition pattern. As in many ray-tracing codes, there are no dependencies between trajectories, making the task-based decomposition the natural choice. By eliminating the need to manage dependencies, the task-based algorithm also gives the programmer plenty of flexibility later in the design process, when how to schedule the work on different processing elements becomes important. The data decomposition, however, is much more effective at managing memory utilization. This is frequently the case with a data decomposition as compared to a task decomposition. Since memory is decomposed, data-decomposition algorithms also tend to be more scalable. These issues are important and point to the need to at least consider the types of platforms that will be supported by the final program. The need for portability drives one to make decisions about target platforms as late as possible. There are times, however, when delaying consideration of platform-dependent issues can lead one to choose a poor algorithm. Parallel database. As another example of a single problem that can be decomposed in multiple ways, consider a parallel database. One approach is to break up the database itself into multiple chunks. Multiple worker processes would handle the actual searching operations, each on the chunk of the database it “owns” and a single manager would receive search requests and forward each to the relevant worker to carry out the search. A second approach for this parallel database problem would also use a manager and multiple workers but would keep the database intact in one logical location. The workers would be essentially identical and each would be able to work on any piece of the database. Observe that the issues raised in this example are similar to those by the medical imaging example. Iterative algorithms. Many linear-algebra problems can be solved by repeatedly applying some operation to a large matrix or other array. Effective parallelizations of such algorithms are usually based on parallelizing each iteration (rather than, say, attempting to perform the iterations concurrently). For example, consider an algorithm that solves a system of linear equations Ax = b (where A is a matrix and x and b are vectors) by calculating a sequence of approximations x(0) , x(1) , x(2) , and so forth, where for some function f , x(k+1) = f (x(k) ). A typical parallelization would be structured as a sequential iteration (computing the x(k) s in sequence), with each iteration (computing x(k+1) = f (x(k) ) for some value of k) being computed in a way that exploits potential concurrency. For
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
example, if each iteration requires a matrix multiplication, this operation can be parallelized using either a taskbased decomposition (as discussed in the “Examples” section of the TaskDecomposition pattern) or a data-based decomposition (as discussed in the “Examples” section of the DataDecomposition pattern).
7
amount of work required for each task cannot be predicted ahead of time, this pattern produces a statistically optimal solution. Examples of this pattern include the following: – Vector addition (considering the addition of each pair of elements as a separate task). – Ray-tracing codes such as the medical-imaging example described in the DecompositionStrategy pattern. Here the computation associated with each “ray” becomes a separate task. – Database searches in which the problem is to search for an item meeting specified criteria in a database that can be partitioned into subspaces that can be searched concurrently. Here the searches of the subspaces are the independent tasks. – Branch-and-bound computations, in which the problem is solved by repeatedly removing a solution space from a list of such spaces, examining it, and either declaring it a solution, discarding it, or dividing it into smaller solution spaces that are then added to the list of spaces to examine. Such computations can be parallelized using this pattern by making each “examine and process a solution space” step a separate task.
3.2 The EmbarrassinglyParallel pattern (main pattern)
Pattern name: EmbarrassinglyParallel Intent: (Briefly states the problem solved by this pattern.) This pattern is used to describe concurrent execution by a collection of independent tasks. Parallel algorithms that use this pattern are called embarrassingly parallel because once the tasks have been defined the potential concurrency is obvious. Also known as: (Lists other names by which this pattern is commonly known.)
independent tasks
– Master-Worker, Task Queue. Motivation: A
(Gives the context for the pattern, i.e., why a designer would use the pattern and what background information should be kept in mind when using it.)
B
C
D
F
E
Overview. Consider an algorithm that can be decomposed into many independent tasks. (Some algorithms are easily seen to be decomposable in this way. For others, some insight may be required to discover a decomposition into independent tasks.) Such an algorithm, often called an “embarrassingly parallel” algorithm, contains obvious concurrency that is trivial to exploit once these independent tasks have been defined, because of the independence of the tasks. Nevertheless, while the source of the concurrency is often obvious, taking advantage of it in a way that makes for efficient execution can be difficult. The EmbarrassinglyParallel pattern shows how to organize such a collection of tasks so they execute efficiently. The challenge is to organize the computation so that all units of execution (UEs) finish their work at about the same time – that is, so that the computational load is balanced among processors. Figure 3 illustrates the problem. This pattern automatically and dynamically balances the load as necessary. With this pattern, faster or lessloaded UEs automatically do more work. When the
assigned to 4 UEs (poor load balance)
B
C
assigned to 4 UEs (good load balance)
D
A F
C E
B
A F
D
E
Fig. 3. Load balance with the EmbarrassinglyParallel pattern
m
MS ID: STTT0045
3 January 2001 15:57 CET
8
B.L. Massingill et al.: Parallel programming with a pattern language
As these examples illustrate, this pattern allows for some variation: – The tasks can all be roughly equal in size, or they can vary in size. – For some problems (the database search, for example), it may be possible to solve the problem without executing all the tasks. – For some problems (branch-and-bound computations, for example), new tasks may be created during execution of other tasks. More formal discussion. The EmbarrassinglyParallel pattern is applicable when what we want to compute is a solution(P) such that solution(P) = f(subsolution(P, 0), subsolution(P, 1), ..., subsolution(P, N-1)) such that for i and j different, subsolution(P, i) does not depend on subsolution(P, j). That is, the original problem can be decomposed into a number of independent subproblems such that we can solve the whole problem by solving all of the subproblems and then combining the results. We could code a sequential solution thus: Problem P; Solution subsolutions[N]; Solution solution; for (i = 0; i < N; i++) { subsolutions[i] = compute_subsolution(P, i); } solution = compute_f(subsolutions);
If function compute_subsolution modifies only local variables, it is straightforward to show that the sequential composition implied by the for loop in the preceding program can be replaced by any combination of sequential and parallel composition without affecting the result. That is, we can partition the iterations of this loop among available UEs in whatever way we choose, so long as each is executed exactly once. This is the EmbarrassinglyParallel pattern in its simplest form – all the subproblems are defined before computation begins, and each subsolution is saved in a distinct variable (array element), so the computation of the subsolutions is completely independent. These computations of subsolutions then become the independent tasks of the pattern as described earlier. There are also some variations on this basic theme: – Subsolutions accumulated in a shared data structure. One such variation differs from the simple form in that it accumulates subsolutions in a shared data structure (a set, for example, or a running sum). Computation of subsolutions is no longer completely independent (since access to the shared data structure must be
synchronized), but concurrency is still possible if the order in which subsolutions are added to the shared data structure does not affect the result. – Termination condition other than “all tasks completed”. In the simple form of the pattern, all tasks must be completed before the problem can be regarded as solved, so we can think of the parallel algorithm as having the termination condition “all tasks completed”. For some problems, however, it may be possible to obtain an overall solution without solving all the subproblems. For example, if the whole problem consists of determining whether a large search space contains at least one item meeting given search criteria, and each subproblem consists of searching a subspace (where the union of the subspaces is the whole space), then the computation can stop as soon as any subspace is found to contain an item meeting the search criteria. As in the simple form of the pattern, each computation of a subsolution becomes a task, but now the termination condition is something other than “all tasks completed”. This can also be made to work, although care must be taken to either ensure that the desired termination condition will actually occur or to make provision for the case in which all tasks are completed without reaching the desired condition. – Not all subproblems known initially. A final and more complicated variation differs in that not all subproblems are known initially; that is, some subproblems are generated during solution of other subproblems. Again, each computation of a subsolution becomes a task, but now new tasks can be created “on the fly”. This imposes additional requirements on the part of the program that keeps track of the subproblems and which of them have been solved, but these requirements can be met without too much trouble, for example by using a thread-safe shared task queue. The trickier problem is ensuring that the desired termination condition (“all tasks completed” or something else) will eventually be met. What all of these variations have in common is that they meet the pattern’s key restriction: it must be possible to solve the subproblems into which we partition the original problem independently. In addition, if the subsolution results are to be collected into a shared data structure, it must be the case that the order in which subsolutions are placed in this data structure does not affect the result of the computation.
Applicability: (Gives a high-level discussion of when this pattern can be used.) Use the EmbarrassinglyParallel pattern when: – The problem consists of tasks that are known to be independent; that is, there are no data dependencies be-
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
9
– This pattern is particularly valuable when the effort required for each task varies significantly and unpredictably. It also works particularly well on heterogeneous networks, since faster or less-loaded processors naturally take on more of the work. – The downside, of course, is that the whole pattern breaks down when the tasks need to interact during their computation. This limits the number of applications where this pattern can be used.
tween tasks (aside from those described in “Subsolutions accumulated in a shared data structure” above). This pattern can be particularly effective when: – The startup cost for initiating a task is much less than the cost of the task itself. – The number of tasks is much greater than the number of processors to be used in the parallel computation. – The effort required for each task or the processing performance of the processors varies unpredictably. This unpredictability makes it very difficult to produce an optimal static work distribution.
Implementation:
Structure:
(Explains how to implement the pattern, usually in terms of patterns from lower-level design spaces.)
(Describes how the participants interact to define this pattern.) Implementations of this pattern include the following key elements: – A mechanism to define a set of tasks and schedule their execution onto a set of UEs. – A mechanism to detect completion of the tasks and terminate the computation. Usage: (Describes how the pattern is used. Other patterns are referenced as applicable to explain how it can be used to solve larger problems.) This pattern is typically used to provide high-level structure for an application; that is, the application is typically structured as an instance of this pattern. It can also be used in the context of a simple sequential control structure such as sequential composition, if-then-else, or a loop construct. An example is given in our overview paper [15], where the program as a whole is a simple loop whose body contains an instance of this pattern. Consequences: (Gives the designer the information needed to make intelligent design tradeoffs.) The EmbarrassinglyParallel pattern has some powerful benefits, but also a significant restriction. – Parallel programs that use this pattern are among the simplest of all parallel programs. If the independent tasks correspond to individual loop iterations and these iterations do not share data dependencies, parallelization can be easily implemented with a parallel loop directive. – With some care on the part of the programmer, it is possible to implement programs with this pattern that automatically and dynamically adjust the load between units of execution. This makes the EmbarrassinglyParallel pattern popular for programs designed to run on parallel computers built from networks of workstations.
There are many ways to implement this pattern. If all the tasks are of the same size, all are known a priori, and all must be completed (the simplest form of the pattern), the pattern can be implemented by simply dividing the tasks among units of execution using a parallel loop directive. Otherwise, it is common to collect the tasks into a queue (the task queue) shared among UEs. This task queue can then be implemented using the SharedQueue pattern. The task queue, however, can also be represented by a simpler structure such as a shared counter. Key elements. (Describes elements identified in the “Structure” section.) Defining tasks and scheduling their execution. A set of tasks is represented and scheduled for execution on multiple units of execution (UEs). Frequently the tasks correspond to iterations of a loop. In this case we implement this pattern by splitting the loop between multiple UEs. The key to making algorithms based on this pattern run well is to schedule their execution so the load is balanced between the UEs. The schedule can be: – Static. In this case the distribution of iterations among the UEs is determined once, at the start of the computation. This might be an effective strategy when the tasks have a known amount of computation and the UEs are running on systems with a well-known and stable load. In other words, a static schedule works when you can statically determine how many iterations to assign to each UE in order to achieve a balanced load. Common options are to use a fixed interleaving of tasks between UEs, or a blocked distribution in which blocks of tasks are defined and distributed, one to each UE. – Dynamic. Here the distribution of iterations varies between UEs as the computation proceeds. This strategy is used when the effort associated with each task is unpredictable or when the available load that can be supported by each UE is unknown and potentially changing. The most common approach used for dynamic load balancing is to define a task queue to be
m
MS ID: STTT0045
3 January 2001 15:57 CET
10
B.L. Massingill et al.: Parallel programming with a pattern language
used by all the UEs; when a UE completes its current task and is therefore ready to process more work, it removes a task from the task queue. Faster UEs or those receiving lighter-weight tasks will go to the queue more often and automatically grab more tasks. Implementation techniques include parallel loops and master-worker and SPMD versions of a task-queue approach. Parallel loop. If the computation fits the simplest form of the pattern – all tasks the same size, all known a priori, and all required to be completed – they can be scheduled by simply setting up a parallel loop that divides them equally (or as equally as possible) among the available units of execution. Master-Worker or SPMD. If the computation does not fit the simplest form of the pattern, the most common implementation involves some form of a task queue. Frequently this is done using two types of processes, master and worker . There is only one master process; it manages the computation by:
Correctness considerations. (Summarizes key results from the supporting theory, ideally in the form of guidelines, that if followed will result in a correct algorithm.) The keys to exploiting available concurrency while maintaining program correctness (for the problem in its simplest form) are as follows: – Solve subproblems independently. Computing the solution to one subproblem must not interfere with computing the solution to another subproblem. This can be guaranteed if the code that solves each subproblem does not modify any variables shared between units of execution (UEs). – Solve each subproblem exactly once. This is almost trivially guaranteed if static scheduling is used (i.e., if the tasks are scheduled via a parallel loop). It is also easily guaranteed if the parallel algorithm is structured as follows: – A task queue is created as an instance of a threadsafe shared data structure such as SharedQueue, with one entry representing each task. – A collection of UEs execute concurrently; each repeatedly removes a task from the queue and solves the corresponding subproblem. – When the queue is empty and each UE finishes the task it is currently working on, all the subsolutions have been computed, and the algorithm can proceed to the next step, combining them. (This also means that if a UE finishes a task and finds the task queue empty, it knows that there is no more work for it to do, and it can take appropriate action – terminating if there is a master UE that will take care of any combining of subsolutions, for example.)
– Setting up or otherwise managing the workers. – Creating and managing a collection of tasks (the task queue). – Consuming results. There can be many worker processes; each contains some type of loop that repeatedly: – Removes the task at the head of the queue. – Carries out the indicated computation. – Returns the result to the master. Frequently the master and worker processes form an instance of the ForkJoin pattern, with the master process forking off a number of workers and waiting for them to complete. A common variation is to use an SPMD program with a global counter to implement the task queue. This form of the pattern does not require an explicit master. Detecting completion and terminating. Termination can be implemented in a number of ways. If the program is structured using the ForkJoin pattern, the workers can continue until the termination condition is reached, checking for an empty task queue (if the termination condition is “all tasks completed”) or for some other desired condition. As each worker detects the appropriate condition, it terminates; when all have terminated, the master continues with any final combining of results generated by the individual tasks. Another approach is for the master or a worker to check for the desired termination condition and, when it is detected, create a “poison pill”, a special task that tells all the other workers to terminate.
– Correctly save subsolutions. This is trivial if each subsolution is saved in a distinct variable, since there is then no possibility that the saving of one subsolution will affect subsolutions computed and saved by other tasks. – Correctly combine subsolutions. This can be guaranteed by ensuring that the code to combine subsolutions does not begin execution until all subsolutions have been computed as discussed above. The variations mentioned earlier impose additional requirements: – Subsolutions accumulated in a shared data structure. If the subsolutions are to be collected into a shared data structure, then the implementation must guarantee that concurrent access does not damage the shared data structure. This can be ensured by implementing the shared data structure as an instance of a “threadsafe” pattern.
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
– Termination condition other than “all tasks completed”. Then the implementation must guarantee that each subsolution is computed at most once (easily done by using a task queue as described earlier) and that the computation detects the desired termination condition and terminates when it is found. This is more difficult but still possible. – Not all subproblems known initially. Then the implementation must guarantee that each subsolution is computed exactly once, or at most once (depending on the desired termination condition.) In addition, the program designer must ensure that the desired termination detection will eventually be reached. For example, if the termination condition is “all tasks completed”, then the pool generated must be finite, and each individual task must terminate. Again, a task queue as described earlier solves some of the problems; it will be safe for worker UEs to add as well as remove elements. Detecting termination of the computation is more difficult, however. It is not necessarily the case that when a “worker” finishes a task and finds the task queue empty that there is no more work to do – another worker could generate a new task. One must therefore ensure that the task queue is empty and all workers are finished. Further, in systems based on asynchronous message passing, one must also ensure that there are no messages in transit that could, on their arrival, create a new task. There are many known algorithms that solve this problem. One that is useful in this context is described in [7]. Here tasks conceptually form a tree, where the root is the master task, and the children of a task are the tasks it generated. When a task and all its children have terminated, it notifies its parent that it has terminated. When all the children of the root have terminated, the computation has terminated. This of course requires children to keep track of their parents and to notify them when they are finished. Parents must also keep track of the number of active children (the number created minus the number that have terminated). Additional algorithms for termination detection are described in [2]. Efficiency considerations. (Discusses implementation issues that affect efficiency.) – If all tasks are roughly the same length and their number is known a priori, static scheduling (usually performed using a parallel loop directive) is likely to be more efficient than dynamic scheduling. – If a task queue is used, put the longer tasks at the beginning of the queue if possible. This ensures that there will be work to overlap with their computation. Examples: (Provides implementations of the pattern in particular programming environments. Higher-level patterns may
11
use pseudocode, while lower-level patterns may use code from one or more popular programming environments such as MPI, OpenMP, or Java.)
Vector addition. Consider a simple vector addition, say C = A + B. As discussed earlier, we can consider each element addition Ci = Ai + Bi as a separate task and parallelize this computation in the form of a parallel loop: – See Sect. “Vector Addition” in the examples document (presented in this paper as Sect. 3.3).
Varying-length tasks. Consider a problem consisting of N independent tasks. Assume we can map each task onto a sequence of simple integers ranging from 0 to N − 1. Further assume that the effort required by each task varies considerably and is unpredictable. Several implementations are possible, including: – A master-worker implementation using a task queue. See Sect. “Varying-Length Tasks, Master-Worker Implementation” in the examples document (presented in this paper as Sect. 3.3). – An SPMD implementation using a task queue. See Sect.“Varying-Length Tasks, SPMD Implementation” in the examples document (presented in this paper as Sect. 3.3).
Optimization. See our overview paper [15] for an extended example using this pattern. Known uses: (Describes contexts in which the pattern has been used, including real programs, where possible in the form of literature references.) There are many application areas in which this pattern is useful. Many ray-tracing codes use some form of partitioning with individual tasks corresponding to scan lines in the final image [4]. Applications coded with the Linda coordination language are another rich source of examples of this pattern [3]. Parallel computational chemistry applications also make heavy use of this pattern. In the quantum chemistry code GAMESS, the loops over two electron integrals are parallelized with the TCGMSG task queue mechanism mentioned earlier. An early version of the distance geometry code, DGEOM, was parallelized with the master-worker form of the EmbarrassinglyParallel pattern. These examples are discussed in [19].
m
MS ID: STTT0045
3 January 2001 15:57 CET
12
B.L. Massingill et al.: Parallel programming with a pattern language
Related patterns: (Lists patterns related to this pattern. In some cases, a small change in the parameters of the problem can mean that a different pattern is indicated; this section notes such cases, with circumstances in which designers should use a different pattern spelled out.) The SeparableDependencies pattern is closely related to the EmbarrassinglyParallel pattern. To see this relation, think of the SeparableDependencies pattern in terms of a three-phase approach to the parallel algorithm. In the first phase, dependencies are pulled outside a set of tasks, usually by replicating shared data and converting it into task-local data. In the second phase, the tasks are run concurrently as completely independent tasks. In the final phase, the task-local data is recombined (reduced) back into the original shared data structure. The middle phase of the SeparableDependencies pattern is an instance of the EmbarrassinglyParallel pattern. That is, you can think of the SeparableDependencies pattern as a technique for converting problems into embarrassingly parallel problems. This technique can be used in certain cases with most of the other patterns in our pattern language. The key is that the dependencies can be pulled outside of the concurrent execution of tasks. If this isolation can be done, then the execution of the tasks can be handled with the EmbarrassinglyParallel pattern. Many instances of the GeometricDecomposition pattern (for example, “mesh computations” in which new values are computed for each point in a grid based on data from nearby points) can be similarly viewed as two-phase computations, where the first phase consists of exchanging boundary information among UEs and the second phase is an instance of the EmbarrassinglyParallel pattern in which each UE computes new values for the points it “owns”. It is also worthwhile to note that some problems in which the concurrency is based on a geometric data decomposition are, despite the name, not instances of the GeometricDecomposition pattern but instances of EmbarrassinglyParallel. An example is a variant of the vector addition example presented earlier, in which the vector is partitioned into “chunks”, with computation for each “chunk” treated as a separate task.
3.3 The EmbarrassinglyParallel pattern (supporting examples)
!$OMP PARALLEL DO DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END PARALLEL DO
Varying-length tasks, master-worker implementation: The following code uses a task queue and master-worker approach to solving the stated problem. We implement the task queue as an instance of the SharedQueue pattern. The master process, shown below, initializes the task queue, representing each task by an integer. It then uses the ForkJoin pattern to create the worker processes or threads and wait for them to complete. When they have completed, it consumes the results. #define Ntasks 500 #define Nworkers 5
/* Number of tasks /* Number of workers
*/ */
SharedQueue task_queue;
/* task queue
*/
/* array to hold results */ Results Global_results[Ntasks]; void master() { void Worker(); // Create and initialize shared data structures task_queue = new SharedQueue(); for (int i = 0; i < N; i++) enqueue(&task_queue, i); // Create Nworkers threads executing function // Worker() ForkJoin (Nworkers, Worker); Consume_the_results (Ntasks); }
The worker process, shown below, loops until the task queue is empty. Every time through the loop, it grabs the next task, does the indicated work (storing the results into a global results array). When the task queue is empty, the worker terminates. void Worker() { int i; Result res; while (!empty(task_queue) { i = dequeue(task_queue); res = do_lots_of_work(i); Global_results[i] = res; }
Pattern name: EmbarrassinglyParallel supporting document: examples Vector addition:
}
The following code uses an OpenMP parallel loop directive to perform vector addition.
Note that we ensure safe access to the key shared variable (the task queue) by implementing it using patterns from
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
the SupportingStructures space. Note also that the overall organization of the master process is an instance of the ForkJoin pattern. Varying-length tasks, SPMD implementation: As an example of implementing this pattern without a master process, consider the following sample code using the TCGMSG message-passing library (described in [11]. The library has a function called NEXTVAL that implements a global counter. An SPMD program could use this construct to create a task-queue program as shown below. while (itask = NEXTVAL() < Number_of_tasks){ do_lots_of_work(itask); }
4 Using the pattern language In this section, we illustrate the process of program design using the pattern language via a simple example. Another example appears in [15]; a more detailed discussion of a case study involving the example presented here can be found in [16]. 4.1 Problem description We consider the problem of developing a parallel version of a sequential molecular-dynamics simulation program, with the eventual goal of executing this parallel version in a distributed-memory message-passing environment. The theoretical background of molecular dynamics is interesting, but not relevant to the problem of “parallelizing” the sequential code, so the following discussion will focus on understanding the physical problem at its simplest level. Molecular dynamics is used to simulate the motions of a large molecular system. For example, molecular dynamics simulations show how a large protein moves around and how different-shaped drugs might interact with the protein. Not surprisingly, molecular dynamics is extremely important in the pharmaceutical industry. Interestingly enough, molecular dynamics is also a significant topic in computer science, since it is a source of perfect test problems for parallel computing – it is simple to understand, relevant to science at large, and very difficult to effectively parallelize. Many papers have been written about parallel molecular dynamics algorithms; a list of representative references can be found in [19]. The basic idea behind molecular-dynamics simulations is to treat the molecule as a large collection of balls connected by springs. The balls represent the atoms in the molecule, while the springs represent the chemical bonds between the atoms. The molecular dynamics simulation itself is an explicit time-stepping process. At each time step, the algorithm computes the force on each atom and then uses standard classical mechanics to compute
13
how the forces move the atoms. This process is carried out repeatedly to step through time and compute a trajectory for the molecular system. The forces due to the chemical bonds (the “springs”) are relatively simple to compute. These correspond to the vibrations and rotations of the chemical bonds themselves. These are short-range forces that can be computed with knowledge of the handful of atoms that share chemical bonds. What makes the molecular dynamics problem so difficult is the fact that the “balls” have partial electrical changes. Hence, while atoms interact with a small neighborhood of atoms through the chemical bonds, the electrical charge causes every atom to apply a force to every other atom. This is the famous N -body problem. On the order of N 2 terms must be computed to get the long-range force. Since N is large (tens or hundreds of thousands) and the number of time steps in a simulation is huge (tens of thousands), the time required to compute these long-range forces dominates the computation. There are several elegant ways to reduce the effort required to solve the N body problem; for this example we will use the simplest of these, the so-called cutoff method. The idea is quite simple: even though each atom exerts a force on every other atom, this force decreases as the distance between the atoms grows. Hence, it should be possible to pick a distance beyond which the force contribution can be ignored. That is the cutoff, and it reduces a problem that scales as O(N 2 ) to one that scales as O(N × M ), where M is the number of atoms within the cutoff volume, usually hundreds. The computation is still huge, and it still dominates the overall runtime for the simulation, but at least the problem is tractable. The sequential program to be used as the basis for this example further simplifies the problem by modeling a system in which the short-range interactions can be ignored. (It approximates the behavior of a gas consisting of argon atoms.) There are many details, but the basic algorithm can be summarized in the pseudocode shown in Fig. 4. Key features of the individual subroutines are as follows: – updatePositions uses the forces array to calculate new values for the elements of the coordinates array coords. This is an O(N ) calculation and hence not a major contributor to the program’s overall running time. – findNeighbors calculates new values for the list of neighbors in nbrs using the atoms’ coordinates in coords. This is an O(N 2 ) calculation and hence the most computationally intensive part of the calculation. In the simulation, atoms do not move much during each time step, so this computation is not performed at every time step, but it is still a major part of the program’s overall computational effort. – longRangeForcescalculates new values for the forces array by summing, for each atom, the force contri-
m
MS ID: STTT0045
3 January 2001 15:57 CET
14
B.L. Massingill et al.: Parallel programming with a pattern language c c c
N is a constant integer representing number of atoms M is a constant integer representing a maximum number of pairs of "neighbor" atoms
c
3D coordinates of atoms real coords(N, 3) forces on atoms in each dimension real forces(N, 3) list of pairs of "neighbor" atoms integer nbrs(2, M)
c c
do move = 1, numberOfMoves call call call call call
updatePositions(N, coords, forces) findNeighbors(N, coords, M, nbrs) longRangeForces(N, coords, forces, M, nbrs) physicalProperties( .. ) printResults( .. )
end do Fig. 4. Pseudocode for sequential molecular-dynamics simulation
butions made by all its “neighbors”, as recorded in the nbrs array. This is an O(N × M ) calculation and hence the second most computationally intensive part of the calculation. – physicalProperties performs a number of calculations that while interesting in their own right are not discussed here because they constitute an O(N ) calculation that is not a major contribution to the program’s overall running time. – printResults prints summary results, so it likewise is not a major contributor to the program’s overall running time and need not be discussed here. 4.2 Parallelization using our pattern language A programmer with experience in parallel programming, particularly for problems in this application domain, might immediately see what parts of this algorithm would benefit most from parallelization and how to parallelize them, and might even see that the AlgorithmStructure patterns of most use are the EmbarrassinglyParallel and SeparableDependencies2 patterns. A less experienced programmer, however, would probably need to work through the FindingConcurrency patterns to arrive at this conclusion.
2 The full text of this pattern can be found in [17] or at our Web site at http://www.cise.ufl.edu/research/ParallelPatterns. Briefly, this pattern is used for task-based decompositions in which the dependencies between tasks can be eliminated as follows: necessary global data is replicated and (partial) results are stored in local data structures. Global results are then obtained by reducing (combining) results from the individual tasks. The simplest examples of this pattern are reduction operations such as global sums, maxima, etc.
4.2.1 Using the FindingConcurrency design space The first step for such a programmer is to find the concurrency in the algorithm, which is the domain of our FindingConcurrency design space. Entering our pattern language at that level, the programmer would first look at the DecompositionStrategy pattern. Reading through its Applicability section, the programmer would learn that efforts should be focused on those parts of the program that are most computationally intensive. For this problem, the most computationally intensive tasks are the computation of the “neighbors” list (findNeighbors, an O(N 2 ) computation) and the computation of forces (longRangeForces, an O(N × M ) computation), so the programmer would use the pattern to determine how to effectively parallelize these two steps. Each of these steps involves a loop over pairs of atoms, with each loop iteration essentially independent of the others, so the DecompositionStrategy pattern would suggest a task-based decomposition of each of the two critical steps, with one task per atom. With this decision in mind, the programmer would turn to the DependencyAnalysis pattern to analyze dependencies among tasks. This pattern (and the related GroupTasks, OrderTasks, and DataSharing patterns) leads the programmer to the following conclusions: – The tasks can be grouped according to the routine from which they arise – that is, there should be a task group for each of the original routines (findNeighbors, longRangeForces, updatePositions, physicalProperties, and printResults). (Note that only the first two groups have actually been decomposed into multiple tasks.) – The original control flow of the program imposes ordering constraints on the task groups; they must ex-
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language
ecute in the same sequence as in the sequential program. – Within each task group, there are two conclusions to be drawn about data dependencies. First, there is no obvious way of partitioning data among tasks such that each task needs access only to its own data. However, if we look only at the variables modified by the tasks, it is possible to compute their values by having each task compute a local value and then combining these local values. – Data dependencies between the two major task groups (for the two program steps to be parallelized) are such that the data generated by a particular atom’s task in the findNeighbors step is exactly that needed for that atom’s task in the longRangeForces step, suggesting that we should use the same scheme for partitioning tasks among units of execution in both steps. Careful consideration of the data dependencies between tasks in these two groups also reveals that there is actually no need to recombine the results computed by the independent tasks in findNeighbors step, since each task’s results are used only by the corresponding task in the longRangeForces step. Finally, the programmer would turn to the DesignEvaluation pattern to reconsider the analysis so far. Among the factors that this pattern reminds the programmer to consider is choice of target platform. (From a computer scientist’s perspective, it would be more elegant to develop a platform-independent design and then implement it for the desired platform or platforms, but unfortunately it can happen that a design that is effective for one class of platform is ineffective for another, so our pattern language recommends that programmers consider this factor even during the early phases of the design.) Recall that in this example the goal is to produce a program for a distributed-memory message-passing platform. For this environment, it is critical to reduce communication among processes, so at this point in the design process the programmer would review again the data-sharing requirements of the tasks identified so far, looking not only at the variables each task modifies but also at the variables whose values it needs. It will be observed that so far our pattern language has dealt with design issues rather than with anything that maps in a straightforward way onto the code of the parallel program being designed; this would be even more obvious if we were starting with a problem description rather than with sequential code. Nevertheless, the patterns used so far have contributed to the design process by helping the programming reach an understanding of the best alternatives for structuring a solution to the problem.
exploitable concurrency, the programmer would then turn to the AlgorithmStructure design space to determine how to take advantage of this concurrency. In this design space, the programmer would turn first to the ChooseStructure pattern, which provides guidance in determining which AlgorithmStructure pattern or patterns best fits the concurrency identified in the previous step. For this problem, the ChooseStructure pattern would steer the programmer toward the EmbarrassinglyParallel pattern (for the findNeighbors step) and the SeparableDependencies pattern (for the longRangeForces step). The programmer would next turn to the EmbarrassinglyParallel pattern to determine how to parallelize the findNeighbors step. He or she would determine that this computational step could be viewed as an instance of the EmbarrassinglyParallel pattern in its simplest form – completely independent tasks, all known before the computation begins. This simplest form of the pattern can be parallelized effectively using a simple static distribution of tasks among units of execution (UEs), which is readily accomplished by in effect partitioning atoms among processes and having each UE perform the calculations for the atoms it “owns”. Next the programmer would turn to the SeparableDependencies pattern to determine how to parallelize the longRangeForces step. Again, he or she would find that this computational step could be viewed as an instance of the pattern in one of its simpler forms – independent tasks computing local results that are then combined in a global “reduction” operation. This form of the pattern can be effectively parallelized by using a static distribution of tasks among UEs – as in the findNeighbors calculation – followed by a global reduction operation. At this point the programmer has come to the following conclusions about the design: – The major source of exploitable concurrency is in the findNeighbors and longRangeForces steps. – The single neighbor list of the original sequential program can be replaced with per-atom or per-UE neighbor lists, since calculation of the force on each atom requires only a knowledge of which other atoms are its neighbors. (This is particularly attractive in a design intended for a distributed-memory platform, since it means there is no need for an expensive communication operation to combine sublists.) – Both the findNeighbors and longRangeForces steps can be easily parallelized by partitioning the iterations of the original loops over all atoms among UEs. In the longRangeForces step, this parallelized loop must be followed by a global-sum operation to combine the results calculated by each UE.
4.2.2 Using the AlgorithmStructure design space Having worked through the patterns of the FindingConcurrency space to identify and understand the problem’s
15
The programmer is now ready to begin working on code.
m
MS ID: STTT0045
3 January 2001 15:57 CET
16
B.L. Massingill et al.: Parallel programming with a pattern language
4.2.3 Using the SupportingStructures design space At this point the programmer is ready to consider how to map the design onto code for the target platform. Guided by the AlgorithmStructure patterns (in particular, by examples in the EmbarrassinglyParallel and SeparableDependencies patterns), he or she would conclude that for a distributed-memory message-passing architecture a good choice for overall program structure would be the one represented by the SPMD (shared program, multiple data) pattern. This pattern would in turn guide him or her to two additional important design decisions. First, in a distributed-memory environment it is necessary to decide whether variables should be somehow distributed among processes or duplicated in each process. Based on the analysis performed with the FindingConcurrency patterns, it is apparent that the patterns of data access in this application make it desirable to duplicate data in every process. Second, in a distributed-memory environment those parts of the computation that are not explicitly parallelized must either be performed in tandem by all processes or delegated to one selected process. For this application, again based on the data-dependencies analysis, it makes most sense to duplicate the non-parallelized parts of the computation, with the exception of the parts of the program that write out results to disk, which should be performed by one process only. Based on the discussion in the SeparableDependencies pattern of implementing a global reduction operation, the programmer would also consult the Reduction pattern to see how this could be implemented. Most programming environments provide a library routine that implements this pattern; if the particular environment did not, this pattern would explain how it could be implemented in terms of other patterns or constructs. Figure 5 gives pseudocode for the resulting design. While it omits most of the details of the simulation, it includes everything relevant to the parallel structure of the program.
4.2.4 Using the ImplementationMechanisms design space Once the programmer has arrived at a design at the level of the pseudocode in Fig. 5, he or she must then implement it in a particular programming environment, addressing whatever additional issues are relevant in that environment. For this problem, the design has been specifically aimed at distributed-memory message-passing platforms, so the only issues here are mapping the SPMD and Reduction constructs to the particular programming environment of choice. These issues are encapsulated in the SPMD and reduction patterns, so the programmer can consult the implementation section of these patterns for guidance on how to implement them for the desired environment (and recall, the reduction pattern will likely
already be implemented and available as a library component). These patterns in turn guide the programmer into the ImplementationMechanisms design space, which provides lower-level and more environment-specific help, such that after review of the relevant patterns the programmer can finish the process of turning a problem description into finished code for the target environment. 4.3 Implementation results We implemented a parallel program based on a sequential molecular-dynamics program and the design described in the preceding discussion. The parallel program produced results identical to those of the sequential program, and it performed acceptably on a network of PCs, as the graph in Fig. 6 indicates. 5 Related work Parallel programming. It is impractical to attempt to give a reasonable treatment of the previous work in the field of parallel computing in this paper. We will limit ourselves to mentioning one work [8] that, while not in the pattern format, is similar to our approach in its emphasis on the design of parallel software, and the importance of starting with a high level design. The term “embarrassingly parallel” has been attributed to Geoffrey Fox. A representative work describing the concept is [9]. Design patterns and pattern languages. The first pattern language was elaborated by Alexander [1] and addressed problems from architecture, landscaping, and city planning. Design patterns and pattern languages were first introduced in software engineering by Kent Beck and Ward Cunningham at a workshop associated with OOPSLA in 1987. Interest in patterns for software exploded with the publication of [10]. Since then, considerable work has been done in identifying and exploiting patterns to facilitate software development, on levels ranging from overall program structure to detailed design. The common theme of all of this work is that of identifying a pattern that captures some aspect of effective program design and/or implementation and then reusing this design or implementation in many applications. Early work on patterns dealt mostly with object-oriented sequential programming, but more recent work [12, 21] addresses concurrent programming as well, though mostly at a fairly low level. Ortega-Arjona and Roberts [20] have given what they call architectural patterns for parallel programming that are similar to the patterns in our AlgorithmStructure design space. Their patterns do not, however, belong to a pattern language. Program skeletons and frameworks. Algorithmic skeletons and frameworks capture very high-level patterns
m
MS ID: STTT0045
3 January 2001 15:57 CET
B.L. Massingill et al.: Parallel programming with a pattern language c c c
N is a constant integer representing number of atoms M is a constant integer representing a maximum number of pairs of "neighbor" atoms
c
3D coordinates of atoms real coords(N, 3) forces on atoms in each dimension real forces(N, 3) list of pairs of "neighbor" atoms integer nbrs(2, M)
c c
17
myid = getProcessID() nprocs = getNumberOfProcesses() do move = 1, numberOfMoves c
unchanged call updatePositions(N, coords, forces)
c c
modified to find neighbors only for atoms "owned" by this process call findNeighborsModified(N, coords, M, nbrs, myid)
c c c
now computes only forces due to atoms "owned" by this process -- no changes required to this routine, consequence of modifications to findNeighbors call longRangeForces(N, coords, forces, M, nbrs) call globalSum(N, forces)
c
unchanged call physicalProperties( .. )
c
print from one process only if (myid .eq. printID) call printResults( .. ) endif end do
Fig. 5. Pseudocode for each process in SPMD parallel molecular-dynamics simulation
that provide the overall program organization with the user providing lower-level code specific to an application. Skeletons, as in [6], are typically envisioned as higherorder functions, while frameworks are often use objectoriented technology. Particularly interesting is the work of MacDonald et al. [13], which uses design patterns to generate frameworks for parallel programming from pattern template specifications.
for designing and reasoning about programs (as a design pattern does) and for code skeletons and libraries (as a framework does). Archetypes do not, however, directly address the question of how to choose an appropriate archetype for a particular problem.
Programming archetypes. Programming archetypes [5, 14] combine elements of all of the above categories: they capture common computational and structural elements at a high level, but they also provide a basis for implementations that include both high-level frameworks and lowlevel code libraries. A parallel programming archetype combines a computational pattern with a parallelization strategy; this combined pattern can serve as a basis both
We have described a pattern language for parallel programming. Currently, the top two design spaces (FindingConcurrency and AlgorithmStructure) are relatively mature, with several of the patterns having undergone scrutiny at two writers’ workshops for design patterns [17, 18]. Although the lower two design spaces are still under construction, the pattern language is now usable and several case studies are in progress. Preliminary results of
6 Summary
m
MS ID: STTT0045
3 January 2001 15:57 CET
18
B.L. Massingill et al.: Parallel programming with a pattern language
References
1000
Execution time (seconds)
original sequential program parallel program ideal parallel program
100
10 1
10 Processors
Fig. 6. Performance of pattern-based parallel code for molecular-dynamics simulation, implemented using MPI and executed on a network of Intel-based PCs running Linux and connected by a 100 Mb/s switch. Problem size is 2,048 atoms and 800 time steps. The graph compares execution times of the parallel program running on varying numbers of processors to the execution time of the sequential program running on one processor and to execution time of an “ideal” parallel program with execution time equal to the execution time of the sequential program divided by the number of processors
the case studies and feedback from the writers’ workshops leave us optimistic that our pattern language can indeed achieve our goal of lowering the barriers to parallel programming. It is worth reiterating at this point that parts of our pattern language (e.g., the patterns of the FindingConcurrency design space) deal with design issues rather than with anything that maps in a straightforward way onto the code of the parallel program being designed. This can be seen in the example in Sect. 4; it would be even more obvious if we were starting with a problem description rather than with sequential code. Most of the literature concerned with the use of patterns in software engineering associates code, or constructs that will directly map onto code, with each pattern. It is important to appreciate, however, that patterns solve problems and that these problems are not always directly associated with code. The FindingConcurrency patterns are an example of this; as shown in Sect. 4, they do not lead to code but rather help the programmer reach an understanding of the best alternatives for structuring a solution to the problem. It is this guidance for the algorithm designer that is missing in most parallel programming environments.
Acknowledgements. We thank Doug Lea and Alan O’Callaghan for their careful critiques of those patterns submitted in earlier forms to the 1999 and 2000 Pattern Languages of Programs workshops. We also thank Vidyamani Parkhe for his work on the example described in Sect. 4.
1. Alexander, C., Ishikawa, S., Silverstein, M.: A Pattern Language: Towns, Buildings, Construction. Oxford University, Oxford, USA, 1977 2. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation Numerical Methods. Prentice-Hall, Englewood Cliffs, N.J., USA, 1989 3. Bjornson, R., Carriero, N., T .Mattson, G., Kaminsky, D., Sherman, A.: Experience with Linda. Technical Report RR866, Yale University Computer Science Department, August 1991 4. Bjornson, R., Kolb, C., Sherman, A.: Ray tracing with network Linda. SIAM News 24(1), January 1991 5. Chandy, K.M.: Concurrent program archetypes: In: Proc. Scalable Parallel Library Conference, 1994 6. Cole, M.I.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT, Cambridge, Mass., USA, 1989 7. Dijkstra, E.W., Scholten, C.S.: Termination detection for diffusing computations. Inf. Process. Lett. 11(1), August 1980 8. Foster, I.: Designing and Building Parallel Programs. AddisonWesley, Reading, Mass., USA, 1995 9. Fox, G.C., Williams, R.D., Messina, P.C.: Parallel Computing Works. Morgan Kaufman, San Francisco, 1994 10. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading, Mass., USA, 1995 11. Harrison, R.J.: Portable tools and applications for parallel computers. Int. J. Quantum Chem. 40: 847–863, 1991 12. Lea, D.: Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, Reading, Mass., USA, 1997 13. MacDonald, S., Szafron, D., Schaeffer, J., Bromling, S.: From patterns to frameworks to parallel programs, 1999. Submitted to IEEE Concurrency, August 1999; see also http://www.cs.ualberta.ca/ stevem/papers/
IEEECON99.ps.gz 14. Massingill, B.L., Chandy, K.M.: Parallel program archetypes. In: Proc. 13th Int. Parallel Processing Symposium (IPPS ’99), 1999. Extended version available as Caltech CS-TR-9628 (ftp://ftp.cs.caltech.edu/tr/cs-tr-96-28.ps.Z) 15. Massingill, B.L., Mattson, T.G., Sanders, B.A.: A pattern language for parallel application programming. Technical Report CISE TR 99-022, University of Florida, 1999. Available via ftp://ftp.cise.ufl.edu/cis/tech-reports/tr99/
tr99-022.ps 16. Massingill, B.L., Mattson, T.G., Sanders, B.A.: A pattern language for parallel application programming (Web site).
http://www.cise.ufl.edu/research/ParallelPatterns, 1999 17. Massingill, B.L., Mattson, T.G., Sanders, B.A.: Patterns for parallel application programs. In: Proc. 6th Pattern Languages of Programs Workshop (PLoP99), 1999. See also our Web site at http://www.cise.ufl.edu/research/
ParallelPatterns 18. Massingill, B.L., Mattson, T.G., Sanders, B.A.: Patterns for finding concurrency for parallel application programs. In: Proc. 7th Pattern Languages of Programs Workshop (PLoP ’00), 2000. See also our Web site at
http://www.cise.ufl.edu/research/ParallelPatterns 19. Mattson, T.G. (ed.): Parallel Computing in Computational Chemistry, ACS Symposium Series, vol. 592. American Chemical Society, Washington, D.C., USA, 1995 20. Ortega-Arjona, J., Roberts, G.: Architectural patterns for parallel programming. In: Proc. 3rd European Conference on Pattern Languages of Programming and Computing, 1998. See also http://www.cs.ucl.ac.uk/staff/J.Ortega-Arjna/
research/europlop98.prn 21. Schmidt, D.C.: The ADAPTIVE Communication Environment: an object-oriented network programming toolkit for developing communication software. http://www.cs.wustl.edu/~schmidt/ACE-papers.html, 1993
m
MS ID: STTT0045
3 January 2001 15:57 CET