30 Saw Mill River Road. Minneapolis, MN 55455. Gainesville, FL 32611. Hawthorne, NY 10532. Abstract. Many parallel algorithm design models have been.
A3: A Simple and Asymptotically Accurate Model for Parallel Computation A. Grama and V. Kumar
S. Ranka
V. Singh
Computer Science Univ. Minnesota Minneapolis, MN 55455
Computer Science Univ. Florida Gainesville, FL 32611
IBM Watson Research Ctr. 30 Saw Mill River Road Hawthorne, NY 10532
Abstract
Many parallel algorithm design models have been proposed for abstracting a large class of parallel architectures. However, all of these models potentially make inaccurate asymptotic performance predictions that may be too optimistic or too pessimistic depending on the circumstances. We propose a new, simpler parallel model called A3 (Approximate Model for Analysis of Aggregate Communication Operations) that provides asymptotically accurate time estimates for a wide class of parallel programs that are based on aggregate communication operations. Accuracy is attained (1) by making the model sensitive to the structure of aggregate data communication operations and (2) by classifying these aggregate communication operations into those that are cross-section bandwidth sensitive and those that are not. We note that algorithms expressed exclusively using those aggregate communication operations that are cross-section bandwidth insensitive have the same time complexity across a wide range of architectures. Other algorithms (using aggregate communication operations sensitive to cross-section bandwidth) may have dierent time complexity but their implementations may still be portable and possibly optimal across a wide range of architectures as long as they use a library of aggregate communication operay This work was supported by Army Research Oce contract DA/DAAH04-95-1-0538, and by Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, the content of which does not necessarily re ect the position or the policy of the government, and no ocial endorsement should be inferred. Related papers are available via WWW at URL: http://www.cs.umn.edu/users/kumar/papers.html. The work of Sanjay Ranka was supported in part by AFMC and ARPA under contract #F19628-94-C-0057. The content of the information does not necessarily re ect the position or the policy of the Government and no ocial endorsement should be inferred.
tions customized to each architecture. We note that the simpler, asymptotically accurate algorithm analysis facilitated by A3 can make algorithm design much faster and simpler.
1 Introduction
Parallel computing holds the promise of costeective and large scale computing power for a variety of critical applications. However, this promise is tempered by exorbitant software development costs and development cycles. This is largely due to the wide variety and fast-changing nature of hardware architectures. In the absence of abstract parallel computation models, programs must be custom-designed for individual machines. The lack of portability is one of the major impediments to the widespread use of parallel computing. The framework for an architecture-independent computation model (either sequential or parallel) is illustrated in Figure 1. An algorithm designer designs algorithm A for solving problem P on an abstract machine model M . A translator T exists that can take this algorithm and a description of a physical architecture (or computer) C , and develop (in a reasonable amount of time) a machine code for solving problem P on architecture C . This framework makes it possible to design ecient algorithms in an architectureindependent manner if the runtime of the algorithm A on model M is asymptotically the same as its runtime on all physical architectures such as C . For serial algorithm design, the von Neumann model serves as model M . Although the processor architecture and memory organization of sequential computers dier, they all conform to the same abstract model. Consequently, the asymptotic time complexity of an algorithm on dierent implementations of the von Neumann model is the same. Indeed, the existence of a universal model has played a key role in the impressive growth of the application of sequential
Abstract Computation Model M
Problem P
Algorithm Designer
Algorithm A for solving problem P on model M
Description of Architecture C
Translator T
Machine code for solving Problem P on arch. C
Figure 1: Framework for architecture-independent computation model computers. If algorithms had to be redesigned for every new sequential architecture, the cost of software development would have been much higher, greatly limiting the cost-eective application of computers. For parallel program design, no satisfactory model I has been found that can be used to design programs for a variety of architectures, despite extensive eorts. A parallel computation model must exhibit two characteristics. First, it must be suciently general to capture the salient properties of a large class of parallel computers. Second, programs designed for this abstract model must execute eciently on actual parallel computers. These characteristics represent con icting goals. For example, a universal parallel machine model must make assumptions about the degree of connectivity among the processors. Lower dimensional networks have a relatively low degree of connectivity compared to networks such as hypercubes and multi-stage networks. If the model assumes a low degree of connectivity, algorithms designed for it will be optimal for parallel computers with this characteristic. However, the same algorithms may be suboptimal for architectures with a higher degree of connectivity, because the communication resources of the computer may be under-utilized. On the other hand, if the model assumes a high degree of connectivity among processors, then algorithms designed for the model may be sub-optimal on an architecture such as a mesh, yet may be optimal on a hypercube. Thus, regardless of the connectivity of the parallel model, it may yield sub-optimal performance for some architectures. Trading accuracy and generality appropriately is critical to any such model. LogP was proposed recently as a more realistic parallel model compared to earlier proposals. Although it certainly provides some improvements, LogP can make asymptotically incorrect time estimates for par-
allel algorithms on mainstream parallel computers{too pessimistic or too optimistic depending on the circumstances. We propose a new, simpler model called A3 which provides asymptotically accurate time estimates for a wide class of parallel programs that use aggregate communication operations. Accuracy is attained (1) by making the model sensitive to the structure of aggregate data communication operations and (2) by classifying these aggregate communication operations into those that are cross-section bandwidth sensitive and those that are not. We note that algorithms expressed exclusively using those aggregate communication operations that are cross-section bandwidth insensitive have the same time complexity across a wide range of architectures. Other algorithms (using aggregate communication operations sensitive to crosssection bandwidth) may have dierent time complexity but their implementations may still be portable and possibly optimal across a wide range of architectures as long as they use a library of aggregate communication operations customized to each architecture. We note that the simpler, asymptotically accurate algorithm analysis facilitated by A3 can make algorithm design much faster1 since many algorithm alternatives can be quickly pruned down to a small set of potential contenders. We identify a small set of aggregate communication operations that cover the communication patterns in a vast majority of parallel applications. The A3 model quanti es the performance of parallel programs on the basis of these communication operations. The rest of the paper is organized as follows: Section 2 presents a brief overview of communication costs in parallel computers; Section 3 discusses existing models for parallel computation; Section 4 presents the A3 model; Section 5 analyzes sample algorithms using the new model; and conclusions are drawn in Section 6.
2 Communication Model of Conventional Parallel Computers
The cost of communicating a message between two processors in a parallel computer arises from various sources. These delays can be categorized into three classes:
Message-size dependent delay: This delay is a
consequence of copying the message into buers and due to the nite capacity of the interconnection network. We capture these delays in a con-
1 We demonstrate that for a representative class of algorithms, A3 predicts communication overhead within a small constant factor.
stant tw , henceforth referred to as the per-word transfer time. Path-length dependent delay: This delay is a consequence of the overhead at intermediate network interfaces and the signal propagation time on each link. This delay is represented by constant th , also referred to as the per-hop time. Start-up latency: This delay is independent of message length and number of hops. It is incurred due to handshaking between processors, buer management and routing. This is represented by ts , also referred to as the message startup time. A message of size m traversing d hops of a cut-through (CT) routed network incurs a communication delay given by CT = ts + th d + tw m Tcomm The corresponding cost for a store-and-forward (SF) network is SF = ts + (th + tw m)d Tcomm
Most conventional computers have cut-through routed networks. In the rest of this paper, we will restrict ourselves to cut-through networks only. The startup time ts is often large, and can be several hundred machine cycles or more. The perword transfer time tw is determined by the link bandwidth. Lower degree networks typically have higher link bandwidth (and consequently lower tw ). This is because the individual channels can be made fatter for the same overall cost [12, 5]. The per-word transfer time tw is often higher than tc , the time to do a unit computation on data available in the cache. The per-hop component th d can often be subsumed into the startup time ts without signi cant loss of accuracy. This is because the diameter of the network, which is the upper bound on distance d is relatively small for most practical sized machines, and th also tends to be small. The above expressions adequately model communication time only for lightly loaded networks. However, as the network becomes more congested, multiple messages attempting to traverse a particular link on the network are serialized or time-multiplexed on the link. This eectively increases tw for the contending messages. On shared address space systems, reading m consecutive words from a remote processor's memory also takes time ts + th d + tw m. Often, th d is constant for all memory modules, and can be subsumed into ts . The parameter m is usually the cache line size. The value
of ts tends to be much smaller than its counterpart in message passing systems. As before, the communication time corresponds to lightly loaded networks. If there are more memory transfers in progress than the cross-section bandwidth of the network, the time for these operations will be higher. The communication overhead in both shared address space and message passing computers can often be reduced. In many parallel computers, it is possible for the communication to proceed after a certain startup cost has been paid at the source and destination processors. In this case, the transit time for messages, given by th d + tw m, can be eectively masked by overlapping it with computation at source and destination processors.
3 Existing Models for Parallel Computation
The method of analyzing precise communication cost presented in Section 2 is used extensively by practitioners [12, 8, 7]. If the parallel algorithm is such that the network remains lightly loaded (for a wide range of architectures), then the communication costs given by this model are also largely architecture independent. This is because the th d term can often be ignored in comparison with large startup overhead ts or tw m for signi cantly large message size m. But this precise analysis tends to be architecture speci c when the network becomes congested. Given a parallel algorithm, eective tw for any message transfer can depend on the topology of the network and other messages that are in transit concurrently. Nevertheless, this model of communication cost encourages the design of parallel algorithms that are good for a variety of practical architectures. The startup cost ts encourages design of parallel algorithms that have bulk access locality since it makes it more attractive to access data in bulk. This is also called spatial locality in the context of serial algorithms. Per word access time tw encourages data volume locality since it rewards parallel algorithms that have fewer non-local data accesses. This is also called temporal locality in the context of serial algorithms. Since the precise analysis of communication costs is architecture speci c, it makes it dicult to design good parallel algorithms in an architecture independent manner. Hence a large number of models for parallel computers have been investigated. The oldest and most well-known of these models is PRAM. It is relatively easy to design and analyze parallel algorithms for a PRAM model. But the actual performance of these parallel algorithms can be quite poor on realistic parallel computers. The reason is that
this model promotes neither data volume locality (assumes tw to be 1 or 0) nor bulk-access locality (assumes ts to be 0). Furthermore, the PRAM model assumes that any arbitrary set of p locations in the memory can be accessed simultaneously by p processors. In contrast, in all practical parallel processors the memory is normally organized into blocks. Access by a processor to a memory block precludes concurrent access to the same block by other processors. If in a PRAM algorithm, all processors require access to dierent locations that happen to be within a same memory block on a practical parallel computer, then all these accesses will happen serially. Many models have been designed that remove various drawbacks of the PRAM model [1, 6, 11, 13, 9, 2, 17] But most of these models fail to promote either the bulk access locality or the data volume locality. The LogP model [4] tries to address the shortcomings of earlier models such as PRAM [13, 9], LPRAM [1], BPRAM [2], HPRAM [11], YPRAM [6], and BSP [17]. A comprehensive discussion of these models and their relationship to each other is provided in [10]. In the spectrum of abstract to practical parallel computers, the LogP model comes closer to practical computers compared to earlier models including its starting point{the BSP model. In this paper, we concentrate on comparing our model to the LogP model.
3.1 LogP Model
The underlying hardware assumed by the LogP model consists of a set of identical processing elements, each with local memory, connected by an interconnection network. The LogP model is invariant with respect to the topology of the interconnection network and the routing algorithm. The LogP model characterizes a parallel computer using four parameters: the communication delay (L), the communication overhead (o), the communication bandwidth (g), and the number of processors (p). The communication delay L is the delay (latency) experienced by a xed-size message consisting of a small number of words from the source to the destination. Sending a message from one processor to another involves overheads at the source and destination processors. These overheads include copying buers, and preparing headers. When the processor is performing these functions, it cannot do anything else. This overhead is modeled by the communication overhead o. The rate at which individual packets can be injected into the network is limited by the bandwidth available to each processor. If a processor has to wait a minimum of g units of time between two consecutive message sends, then g is referred to as the gap.
Clearly, the gap is determined by the per-processor communication bandwidth of the network. Finally, the machine is assumed to have p processors. A single unit of time is de ned as the time is takes to perform a single local computation. Larger messages are sent in LogP by breaking them into smaller messages and pipelining them. For each of the smaller messages, the processor incurs an overhead of o at the sending end. The message can then be injected into the network. Subsequent messages can be injected at gaps of g. Here, g is determined by the network bandwidth. If a sequence of short messages need to be sent, the gap of the rst message can be overlapped with the preparation overhead o of the second message. Depending on values of g and o, two situations arise. In the rst case g > o: in this case, the sending processor can prepare the messages faster than the network can accept them. In this case, the network bandwidth will not be wasted. In the second case o > g: in this case, the processor cannot inject messages into the network fast enough to utilize available bandwidth. Therefore, it is desirable that the size of the \small xed message" m be large enough such that g > o. Note that m is the fth \hidden" parameter of the LogP model (that is often not mentioned). The LogP model rewards bulk access of up to m words. Messages can be injected into the network at intervals of g. Of this time, o is spent by the processor preparing the next message. The remaining time g ? o (assuming g > o) can potentially be used to perform useful computation although it is more complex to utilize computation time with such small granularity.
3.2 Discussion of LogP Model
The major attractiveness of LogP compared to other previous models such as BSP is that it rewards both bulk access locality (using o) and data volume locality (using L and g). In fact, the LogP model is quite similar to the general communication model discussed in Section 2 with the exception that eective tw in the LogP model is scaled to compensate for the cross-section bandwidth c by
teffective = tw pc : w
The major drawback of LogP is that it is still unable to exploit structure in underlying communication patterns. The model assumes a random mapping of computation onto processors. It is therefore incapable of distinguishing benign data redistributions from those that congest the network. It assumes that all redistributions can be accomplished in time proportional to
message size per processor. Let us study the implications of this assumption closely. Assume that each processor has a data item of size m that it needs to communicate to a unique, randomly chosen destination (random permutation). LogP assumes that this permutation can be performed in (m) time. However, there are no known algorithms that can perform the permutation in this time even on (p) crosssection bandwidth networks such as hypercubes for xed values of m2 . The transportation primitive of Shankar and Ranka [14, 15, 16] performs this permutation in (m) time but it requires m to grow as (p).3 Now consider an adaptation of these models for a mesh. The eective bandwidth is scaled down by the cross-section bandwidth to account for the limited bandwidth of the mesh while performing generalized redistributions. However, aggregate communication operations such as NEWS and broadcasts consist entirely of redistributions that do not congest the network and are therefore capable of extracting the same network performance from a mesh as from a hypercube. In these cases, the communication cost of LogP is an larger than the actual cost. The communication time predicted by LogP is therefore neither a consistent upper bound nor a consistent lower bound. When applied to communication operations such as broadcast and NEWS, the time predicted by LogP is an upper bound since it assumes worst case bandwidth utilization (all messages cross the bisection). However, when applied to some bandwidth-sensitive operations (such as the transportation primitive), the time predicted by LogP is actually a lower bound. To remedy these problems the LogP model itself suggests use of dierent values of gap g for dierent communication patterns. This re ects one of the underlying motivations for our parallel computation model.
4 A3: A Parallel Computation Model for Architecture Independent Programming and Analysis
In this section, we propose a parallel computation model based on abstraction of communication inherent in the algorithm. Many existing models incorporate bulk access locality and data volume locality. Their main drawback has been their inability to identify favorable data redistributions and their costs. In our model, we remedy this by classifying redistributions into categories and assigning appropriate costs 2 If the message size m grows as (log p), then it is possible to perform the permutation in (m) time[17] 3 The transportation primitive of Shankar and Ranka[14] requires m to grow as (p2 ). This can be reduced to (p) by increasing the number of stages at the expense of higher constants.
to them on k:d cubes. If a parallel program is described in terms of low level communication operations such as send and receive, then the performance of the program can change dramatically from one architecture to the other. This is because the sequence of low level primitives suited for one architecture may not be good for another. On the other hand, if a parallel program is described only in terms of a library of high-level aggregate communication operations (such as all-to-all broadcast), then the parallel program can be eectively portable, as these aggregate communications can be implemented eciently for each architecture. Since the total number of commonly used aggregate communication operations is relatively small, it is possible to package these into libraries and use a table to look up their communication costs. This forms the basis of our model. Typical applications can be viewed as a set of interacting subtasks executing on dierent processors. Interaction is by means of exchange of messages. It is often possible to formulate these tasks in such a way that the computation proceeds in interleaved steps of computation and communication. Groups of processors go through phases of local computation followed by communication for non-local data. At any given snapshot, all processors within a group are in the same phase (i.e., computing or communicating). Algorithms fashioned in this manner are referred to as synchronous algorithms. (Note that this de nition of synchronous algorithms is not the same as SIMD algorithms. Instead of synchronizing after each instruction, synchronous algorithms interact after phases of computation.) All processors (or subsets thereof) participate in the communication step. Since the communication involves more than a single pair of processors (aggregates or groups), such communication operations are also referred to as aggregate or collective communication operations. All parallel programs written in data-parallel languages such as HPF and NESL[3] fall in this category. Programs based on MPI that use only collective communication operations also fall in this category. The A3 model presented in this paper is suitable for the class of synchronous algorithms. We have identi ed the following set of aggregate communication operations that appear most frequently: many-to-many broadcast (and their dual - reduce, and variations - scan), many-to-many personalized, k-shifts, 2-D grid based communication - NEWS (and its generalization to log p dimensions - butter y). For a more detailed description of these operations, the reader is referred to Kumar et al. [12] and Shankar and Ranka [15, 16]. Personalized communica-
tion appears in such applications as matrix transpose, hash joins, and Fast Fourier Transforms. Broadcasts and scans are ubiquitous. NEWS communication is required in nite dierence and nite element methods, and a variety of image processing applications. The performance properties of aggregate communication operations can be ascertained from the underlying data movement. We rst brie y discuss a widely used [12] accurate cost model, and then present a new, simpler cost model called A3 that can be applied in the presence of adequate slack. We will demonstrate in Section 5 the use of these models for performance prediction. We show that the A3 cost model is adequate in the presence of slack and yields very good estimates of the performance.
4.1 Accurate Cost Model
The accurate cost model is based on a tabulation of the aggregate communication operations and their exact cost on each architecture. Algorithm analysis is based on a table lookup for the appropriate operationarchitecture combination. The runtime of the entire parallel algorithm can be computed in this manner. Table 1 tabulates the various aggregate communication operations along with their cost on k:d cubes.
4.2 A3 : A Simpler Model
A closer examination of the above table yields interesting insights. The space of aggregate communication operations can be partitioned into two parts: operations sensitive to the cross-section bandwidth, and those insensitive to the cross-section bandwidth. Sensitivity to cross-section bandwidth is determined by the volume of data transferred and the redistribution pattern. For example, operations such as random permutations and some structured redistributions such as k-shifts are cross-section bandwidth sensitive. Let the maximum amount of data entering or leaving any processor during an aggregate communication operation be m. If m grows at a particular rate with respect to p, it is possible to derive concise estimates of the asymptotic time requirements of various communication operations based solely on m and the crosssection bandwidth. This rate of growth of m is referred to as the slack in communication. The amount of slack required is dierent for dierent algorithms and architectures. A slack of (p) is adequate for all cases illustrated in Table 1. In many of these cases, a lower slack is adequate. For some other aggregate communication operations, the slack required may be higher. For example, the transportation primitive of Shankar and Ranka[14] requires a slack of O(p2 ) per processor. This can be reduced to O(p) per processor by increasing the number of stages resulting in
higher constants. Most other aggregate communication operations require less slack than the transportation primitive. For bandwidth-sensitive operations the asymptotic performance is largely a function of the cross-section bandwidth. This implies that irrespective of interconnection topologies, various communication operations will have similar asymptotic performance if cross-section bandwidth is equal. This implies an asymptotic equivalence of architectures such as the crossbar, hypercube, fat-tree and multi-stage networks in the presence of adequate slack for a wide class of aggregate communication operations. In the presence of adequate slack, the communication times can be approximated as follows:
Aggregate Communication Operations Sensitive to Cross-Section Bandwidth Aggregate
communication operations such as many-to-many personalized and k-shifts are cross-section bandwidth sensitive operations. Let m be the maximum amount of data leaving or entering any processor and c be the cross-section bandwidth of the network. For such operations, each link in the network has an eective bandwidth of c=(tw p). The time taken by any aggregate communication operation in this class is given by
Tcomm = tw m(p=c)
(1)
Speci cally, for a k:d cube, the cross-section bandd?1 d width is given by c = p . Substituting the value of c in Equation 1, we get
Tcomm = tw mp d1
(2)
Equation 2 gives the time for a aggregate communication operation in this class for a k:d cube.
Aggregate Communication Operations Insensitive to Cross-Section Bandwidth Aggregate
communication operations such as many-to-many broadcast and NEWS belong to this class. NEWS is de ned as 2-D grid structured communication. The communication operation is bandwidth insensitive on the family of k:d meshes for d 2. The cost of these operations is largely independent of the number of processors. If m is the maximum amount of data entering or leaving any processor, the time for these operations is given by: Tcomm = tw m (3) A parallel algorithm can be analyzed for a k:d cube by using these two communication costs applied to appropriate aggregate communication operations.
Linear Array
Operation
2-D Mesh Hypercube (wraparound, square)
Bandwidth Insensitive Operations One-to-all broadcast
(ts + tw m) log p +th (p ? 1)
One-to-all broadcast with (p) slack
2(ts p + tw m)
2(2ts pp + tw m)
All-to-all broadcast
(ts + tw m)(p ? 1)
2ts (pp ? 1) + tw m(p ? 1) ts log p + tw m(p ? 1)
One-to-all personalized (ts + tw m)(p ? 1)
(ts + tw m) log p +2th(pp ? 1)
(ts + tw m) log p 2(ts log p + tw m)
2ts (pp ? 1) + tw m(p ? 1) ts log p + tw m(p ? 1)
Bandwidth Sensitive Operations All-to-all personalized
(ts + tw mp=2)(p ? 1)
(2ts + tw mp)(pp ? 1)
Circular q-shift
(ts + tw m)bp=2c
(ts + tw m)(2bpp=2c + 1)
(ts + tw m)(p ? 1) +(th =2)p log p
t s + tw m +th log p
Table 1: Various communication operations and their cost on a k:d cube. In both these cases, the A3 model ignores the impact of the startup latency (ts ) and the per-hop time (th ). This restricts the applicability of the model to cases where there is adequate slack. In other cases, it yields lower bounds on the communication time for the operation.
5 Analysis of Algorithms
In this section, we present results of a sample analysis of parallel dense matrix-matrix product. (The reader is referred to [10] for a more extensive analytical study of parallel algorithms for sample sort and Gaussian elimination.) We compare the A3 and LogP models to the accurate model. We show that the A3 model yields accurate asymptotic estimates of communication overhead. The LogP model fails to do so for the mesh. This is true of the other algorithms discussed in [10] as well. Furthermore, although it may seem that the LogP model gives accurate runtimes for the hypercube, it gives no hints about placement of tasks on processors. Randomly placing subtasks
on processors may cause contention over the network, degradation of performance, and worse runtimes. In our model it is assumed that the library call to aggregate communication operations ensures proper placement of subtasks on processors. Note that this library needs to be constructed just once for each architecture. From the analysis, it also becomes clear that our model makes it simpler to analyze algorithms. We assume knowledge of the algorithms used for basic communication operations. Whereas this knowledge is not necessary for analyzing algorithms in our framework, it is necessary for analyzing them for the LogP model. The reader is referred to [12] for detailed descriptions of these analyses. For the LogP model, we assume that the overhead o is small and that the gap g > o. This allows us overlap of computation and communication where feasible. Furthermore for asymptotic analysis, as message sizes increase, it is possible to drop the term L of the LogP model.
5.1 Dense Matrix-Matrix Product
Consider two n n matrices A and B partitioned intopp blockspAi;j and Bi;j (0 i; j < pp) of size (n= p) (n= p) each. These blocks are mapped onto a pp pp logical mesh of processors. The processors are labeled from P0;0 to Ppp?1;pp?1 . Processor Pi;j initially stores Ai;j and Bi;j and computes block Ci;j of the result matrix. Computing submatrix Ci;j requires all submatrices Ai;k and Bk;j for 0 k < pp. To acquire all the required blocks, an all-to-all broadcast of matrix A's blocks is performed in each row of processors, and an all-to-all broadcast of matrix B 's blocks is performed in each column. After Pi;j acquires Ai;0 ; Ai;1 ; : : : ; Ai;pp?1 and B0;j ; B1;j ; : : : ; Bpp?1;j , it performs the submatrix multiplications and additions to compute Ci;j .
Accurate Analysis Assume that the logical mesh of processors is embedded into a p-processor hypercube. The algorithm requires two all-to-all broadcast steps (each consisting of pp concurrent broadcasts in all rows and columns of the processor mesh) among groups of pp processors. The messages consist of submatrices of n2 =p elements. The total communication time is 2(ts log p + tw (n2 =p)(pp ? 1)) for the hypercube. After the communication step, each pprocessor computes a submatrix Ci;jp, which requires p multipp) (n= p) submatrices. This takes plications of ( n= a total of pp (n=pp)3 tc = n3 =ptc time. Thus, the parallel run time for multiplying two n n matrices using this algorithm on a p-processor hypercube is approximately 3 2 TP = np tc + 2ts log p + 2tw pn p
(4)
If we use a mesh instead of a hypercube, only the term associated with ts in the parallel run time (Equation 4) of this matrix multiplication algorithm is affected. On a wraparound mesh with store-and-forward routing, each all-to-all broadcast among the pp processors of a row or column of the mesh takes approximately ts pp + tw n2 =pp time. Thus, the total parallel run time is
TP = np tc + 2ts pp + 2tw pn p : 3
2
(5)
Analysis Using A3 Model As noted earlier, all-to-
all broadcast is not a bandwidth sensitive operation. Therefore, time taken is tw m, where m is the maximum amount of data entering or leaving any processor. For matrix multiplication, each all-to-all broad-
cast along the columns or rows requires np2 pp = pn2p data to enter each processor. Therefore,
TP = np tc + 2tw pn p 3
2
Since the all-to-all broadcast is not a bandwidth sensitive operation, the approximate analysis for 2-D mesh is the same as the approximate analysis for the hypercube. Therefore, 2the predicted communication overhead is still 2tw pn p .
Analysis Using LogP Model Assume that g > o making it possible to pipeline messages. The analysis for the LogP model is similar to that for the accurate model. The communication overhead is given by 2 n p 2(g p ). For a hypercube g is determined by the link bandwidth tw . Therefore, the communication over2 head predicted by LogP is 2(tw pn p ). For a mesh, the gap g must however be scaled up to compensate for theplower bisection width of the mesh. Therefore g = tw p and communication overhead predicted by the LogP model is 2(tw n2 ). Comments Figure 2 graphically presents the com-
munication overhead predicted by all three models. The communication characteristics (values of ts and tw ) are assumed to be those of the Intel Delta, in which ts = 77 microseconds and tw = 0:54 microseconds/word. For the hypercube, both LogP and the A3 model predict the communication overhead of the parallel algorithm accurately in asymptotic terms. The accurate and A3 models are however much simpler to use. The LogP model actually requires knowledge of the all-to-all broadcast algorithm to derive the communication time. For the mesh, the A3 model predicts the communication overhead accurately in asymptotics, the LogP model overestimates it by a factor of pp. Thus the A3 model is both accurate and more easy to use compared to the LogP model of computation for this algorithm.
6 Concluding Remarks 3
The A model is valid for synchronous algorithms based on aggregate communication operations. This methodology of parallel program design oers several advantages: Algorithms are portable across a variety of platforms with a minimal loss of eciency. The parallel model hides the complexity of writing ecient communication routines for dierent machines from the user.
40000 Accurate A3 LogP 30000
Comm Overhead (us) 20000
10000
0
4
8
16
32
64
128
256
p Matrix product on hypercube (n = 256) 80000 Accurate A3 LogP 60000
Comm Overhead (us) 40000
20000
0
4
8
16
32
64
128
256
p Matrix Multiplication on mesh (n = 256)
Figure 2: Communication overhead predicted by accurate analysis, A3 and LogP models for multiplying two 1024 1024 matrices.
Programming parallel machines using this
paradigm is much simpler. The model brings the algorithm design process closer to the functionality of MPI and HPF. The model provides a means of determining good data partitioning and placement strategies. Data placement and the communication operations they induce together determine the gure of merit for an algorithm. The key to designing ecient algorithms is to identify favorable combinations of the two. Although, a large majority of parallel algorithms are synchronous, there is a class of algorithms that does not t into this framework nicely. Examples of such applications are in discrete event simulation and branch-and-bound tree search [12]. There is also a possible loss in eciency resulting from organizing parallel algorithms as synchronous algorithms based on aggregate communication operations. Further-
more, the A3 model is not applicable for cases when the slack is small. In such cases, the setup overhead ts can dominate the bandwidth term mtw and may become a critical factor in the overall analysis of such algorithms. Even for large slacks there are potential aggregate communication operations for which ne tuned mappings can be achieved for particular interconnection networks and routing strategies in which the time requirements predicted by our model would be asymptotically inferior to the best time which can be obtained; and thus would only be an upper bound. However, based on our experience such cases are infrequent [12, 8]. One example if the butter y shift operation in which each processor P sends data to processors P + 2i or P ? 2i . For dierent values of i and different mappings of virtual processors to physical processors, this operation can be bandwidth sensitive or insensitive. In such cases, a precise analysis can be done using the exact cost of each communication operation for the speci c architecture, but the analysis of their run time becomes quite architecture speci c. Note that the such parallel programs can still be made portable; i.e., the same program can run on many different architectures without re-writing any code. But the performance of the parallel program can change in a less predictable manner.
References
[1] A. Agarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Technical Report RC 14998 (No.64644), IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989. [2] A. Agarwal, A. K. Chandra, and M. Snir. On communication latency in PRAM computations. Technical Report RC 14973 (No.66882), IBM T.J. Watson Research Center, Yorktown Heights, NY, 1989. [3] Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and Marco Zagha. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4{14, April 1994. [4] D. Culler, R. Karp, D. Patterson, and et al. Logp: Towards a realistic model of parallel computation. In Principles and Practices of Parallel Programming, May 1993. [5] W. J. Dally. Analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6), June 1990.
[6] Pilar de la Torre and Clyde P. Kruskal. Towards a single model of ecient computation in real parallel machines. In Future Generation Computer Systems, pages 395 { 408, 8(1992). [7] Ian Foster. Designing and Building Parallel Programs. Addison-Wesley, Reading, MA, 1995. [8] G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Vol. 1. Prentice-Hall, Englewood Clis, NJ, 1988. [9] P. B. Gibbons. A more practical PRAM model. In Proceedings of the 1989 ACM Symposium on Parallel Algorithms and Architectures, pages 158{ 168, 1989. [10] Ananth Grama, Vipin Kumar, Sanjay Ranka, and Vineet Singh. On architecture independent models for parallel program design. Technical report, Department of Computer Science, University of Minnesota, 1996. [11] T. Heywood and S. Ranka. A practical hierarchical model of parallel computation. i. the model. Journal of Parallel and Distributed Computing, 16(3):212{32, Nov. 1992. [12] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Algorithm Design and Analysis. Benjamin Cummings/ Addison Wesley (ISBN 08053-3170-0), Redwod City, 1994. [13] A. G. Ranade. How to emulate shared memory. In Proceedings of the 28th IEEE Annual Symposium on Foundations of Computer Science, pages 185 { 194, 1987. [14] S. Ranka, R. V. Shankar, and K. A. Alsabti. Many-to-many personalized communication with bounded trac. In Proceedings. Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation, 6-9 Feb. 1995. [15] R. Shankar and S. Ranka. Random data access on a coarse grained parallel machine i. one-toone mappings. Technical report, School of Computer and Information Science, Syracuse University, Syracuse, NY, 13244, October 1994. A short version of the paper appeared in the Proceedings of First International Workshop on Parallel Processing, Bangalore, India, December 1994.
[16] R. Shankar and S. Ranka. Random data access on a coarse grained parallel machine ii. one-tomany and many-to-one mappings. Technical report, School of Computer and Information Science, Syracuse University, Syracuse, NY, 13244, October 1994. A short version of the paper appeared in the Proceedings of Symposium of Parallel and Distributed Processing, 1994. [17] L. G. Valiant. General purpose parallel architectures. Handbook of Theoretical Computer Science, 1990.