D. Gannon is with the Department of Computer Sciences, Purdue Univer- ...... [32] H. J. Siegel and R. J. McMillen, "Using the augmented data manipulator.
1 180
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12,
DECEMBER
1984
On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms DENNIS B. GANNON
AND
JOHN VAN ROSENDALE
the latency and bandwidth of the processor communication medium. The first model is that of a "medium-scale" shared memory multiprocessor, having perhaps 2-32 processors, with each processor capable of exploiting substantial local vector parallelism. Section II of this paper gives a formal description of the shared memory model and illustrates a method of analyzing algorithms for machines of this type. Several standard, but important, numerical problems are studied and a number of alternate implementations are analyzed. In parIndex Terms -Interconnection networks, numerical software, ticular, it is shown that for machines which have two levels parallel algorithms, parallel architectures, VLSI complexity. of parallelism, the performance of algorithms strongly depends on the way in which the problem is partitioned to fit on the architecture. The performance of the algorithms is given as a function of global and local memory latencies, the speed I. INTRODUCTION of arithmetic operations, the number of processors, and the T HE traditional model of parallel algorithm analysis was size of the problem. motivated by a desire to explore the potential of paralThe second model is that of a highly parallel MIMD system lelism. Thus, the question was asked: Given an unlimited where processors communicate through a large network and number of processing elements and an infinite capacity to there is no shared memory. We assume here a number of move and permute data, what is the fastest method to solve processors ranging from perhaps 32 to a few thousand, but the problem under consideration? This has proven to be a with processors of lesser power than in the shared memory fruitful area of research and much has been learned. How- model. Analysis and design of algorithms for such systems ever, with the appearance of the Illiac IV, the Cray I, and the turns out to be significantly different from that for the shared CDC Cyber 205, it was quickly realized that the design of memory machines. In Section III, it is shown that the techdata structures and the cost of processor-to-processor and niques used in VLSI complexity analysis can be used to processor-to-memory communication are critical ingredients derive reasonable upper bounds on speed up and efficiency. in the design and analysis of practical algorithms. The goal The appropriate parameters for this analysis turn out to be the of this paper is to highlight the role played by communication ratio of message transmission times to arithmetic speed, and cost in the analysis of numerical algorithms. the relation of the problem being solved to the topology of the Because the spectrum of parallel architectures suitable for communication network. By looking at specific algorithms, scientific computation is so broad, it is difficult to derive one it is shown that many of the derived upper bounds are exact. analytical model of computation that characterizes the perAs a variant of this second architecture model, in formance of every machine. In this paper, we restrict our Section IV we consider machines interconnected by packetattention to three families of machine architectures and de- switched communications networks. Analysis of algorithms scribe an analytical model of performance that is reasonably for such machines is similar to analysis of algorithms for suited to each. In particular, our goal for each model is to other nonshared memory machines, except that commucharacterize the effect of communication costs on system nication delays play a central role. The paper concludes with performance. In each model, we give an estimate of effec- a discussion of the shortcomings of the approaches described tive efficiency or speedup of a computation as a function of here and suggests several directions where more work needs to be done.
Abstract -This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Manuscript received April 6, 1984; revised August 17, 1984. This work was supported by the National Aeronautics and Space Administration under NASA Contracts NAS1-17070 and NAS1-17130 while the authors were in residence at the Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, VA 23665. Primary support for the first author was provided by IBM and by ONR Contract N84K0012. D. Gannon is with the Department of Computer Sciences, Purdue University, West Lafayette, IN 47907. J. Van Rosendale is with ICASE, NASA Langley Research Center, Hampton, VA 23665.
II. SHARED MEMORY MACHINES
One of the clearest trends in commercial systems is the trend toward multiprocessor shared memory architectures (see Fig. 1) where each processor has either a pipelined multitasking or vector capability. This family of multiprocessors includes the Cray X-MP, Cray-IL, the HEP-I and
0018-9340/84/1200- 1180$0l.00 C 1984 IEEE
1181
GANNON AND VAN ROSENDALE: PARALLEL NUMERICAL ALGORITHMS
5) The parallel execution of p tasks on p processors is denoted by
pardo(i = 1,p) task(i); endpar;
Fig. 1. Shared memory multiprocessor.
Here task(i) is a procedure, block, or statement that is executed on the ith processor. No assumptions will be made about the processor synchronization or task-scheduling mechanisms other than that the execution order will be consistent with the serial data dependencies.
HEP-IL,9 and the ETA GF10. The proposed Cedar multi- Algorithm Design [ 12] may be viewed as a machine in this class where each "processor" is, in fact, a cluster of smaller processors. In this section, we consider the design and analysis of algorithms for such machines. We begin with a list of properties shared by many machines in this class. (Of course, no list will characterize every machine, and the set of specifications below should be considered only as an approximnation to a family of architectures.) 1) There are p processors, with p roughly in the range 2 p ' 32. 2) All processors have equal access to shared memory and vectors may be of arbitrary length and stride.2 3) Each processor also has a sizable local memory from which it can fetch vectors of arbitrary length and stride. 4) Each processor can perform vector diadic operations (or vector triadic operations where one operand is a scalar) using operands in from either the local memory or global shared memory. The execution time for a vector operation of length n is
processor
r2
(n + n42)
if either operand is in global
memory,
rK'(n +
n'l2)
n'II,2 9 -17)nG 4r9 S 9 (n 2p) AJZ1= YJ for 1 j n If we again consider effective efficiency, one finds that for where the superscript j refers to the jth column of the grid. n >> p, the approximate performance is Then the set of solution columns Z7 is viewed as a set of vectors in the solution of another n tridiagonal systems 8/17 E subst, GM (2.3) 2G 1/2 + BjXj = Zj for I ' j n. 16p2n T 17nT This method is known as the Alternating Direction Implicit 8/22 (ADI) method and is used in many applications [27]. We E subst, LM 1 + 17ni½2 + (lOp + 16p2)nG examine two solution schemes. Assume that the components of the arrays are stored by 22T 8nT rows. The most natural partitioning of the algorithm is to Fig. 3 plots the effective efficiencies (2.1)-(2.4) for the four substructure the system (A matrix, B matrix, and Y array) methods described above as a function of T for n = 64 000, into blocks of size n/p x n. Let ri = in/p and si = p = 32, nt/2 = 10, n%2 = 1000. Observe that as T becomes (i + l)n/p. The block A[E';'il is the set of coefficients correlarge, the simple shared memory method has the best asymp- sponding to n/p column equations and the block B7;4il corretotic performance. On the other hand, for small T, the local sponds to components ri through si of all n row equations. By memory substructured algorithm is clearly superior. Con- "downloading" the data sequently, we find that the choice of optimal algorithm deri;si] y[ri;sj] [ri; sif B [I;n] (A [l;n] 'I[l;n] pends heavily on the relation of problem parameters to
I(8p
=
1
+
I
1184
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
into the local memory of processor i, "simple" algorithm to compute
one may use
the
Multiprocessor Efficiency
1.88
z[r;s]
=
(A[r;s])-1y[r;s]
0.90
in each processor using only local memory. Then, without "uploading" the Z array back to shared memory, it is possible to use the substructured method to solve the row equations X[r;sl
=
n12) + r
'(8n
-
0.70
SLmple ADI
0.60
_________
Subst-ructred ADI
e8se
(B[i;n])-IZ
03.40
The only use of the shared memory is to solve n reduced systems of size 2p required to complete the substructured elimination. The total time to complete this is as follows. Step 1): Download data and solve column equations:
7rX Q2(-+
8.88
7) (-+
0.30
0.20
8.10
5.8
6 .3
7.6
8.9
18.
11.
12.
X-axis =Lo82( n
14.
15.
16.
18.
Y-axis = Efficienc_A
n
Step 2): Do the substructured elimination and upload final results:
Fig. 4. Algorithmic efficiency as a function of n for n%,2 = 1000, n1,2= 10, andp = 32.
III. ALGORITHM ANALYSIS FOR MACHINES BASED
17(
-
2)rii(n
ni/2)
+
+
(16p
7) (
-
+
n1/2)
+ 8(n +
ON LARGE NETWORKS
n2
+ (-+n+2).
The alternative is to use the simple algorithm for both column and row equations. Unfortunately, this requires that the partial solution vector Z be moved back to shared memory and read back to local memory in transposed order. Based on our memory addressing assumptions, this step requires a minimum of r2 1(n/p) (n + n 1/2) seconds. Step 2'): Transpose Z and use the simple method and upload the final results: rw1(8n
-
7) (-+ n1/2) + 2r.1(i- + np1) + r,-,i-(n +
p
n/2).
To determine the effective efficiency, observe that to com- 14) operations. If the machine can be programmed to run at 100 percent efficiency, the execution time would be
pute B -1A - Iy requires at least n (16n
r2-
-(16n
-
14).
p
The asymptotic efficiency for the substructured algorithm is 50 percent (computed in the limit as n goes to infinity) and 62 percent for the transposed simple method. In the case that n is small (near or below n/2), the substructured algorithm is superior. To illustrate this, consider the special case of p = 32 processors, np42 = 1000, and n4/2 = 10. Fig. 4 depicts the efficiency as a function of n. The point at which both methods are equal is when n is approximately 12000. In general, one can show that the substructured algorithm is superior to the simple scheme when
In
4a
k = max(1, (15p - n
)
-
303) log(k)
(3.2)
where R is Independent of k. Minimizing (3.2) as a function of k is, in general, not easy. There are two interesting cases to consider. Case 1: (nlp) c (17/E) (a/,3) + (15/2). In this case, the last term in (3.2) is negative; thus, we choose k as large as possible. If p < 1/2, then we set k = p which implies that it is best to distribute the FFT's across all p processors and to solve the columns of tridiagonal systems without communication (n 12/p columns per processor.) If np12 or p is greater than (17/2)(a/,3) + (15/2), we find n1/2 < p. Setting k = n 1/2, the FFT's are again distributed and the execution time is 2n
TF S
T
a +
log(n)
1
g(n-
.!
lo(P p7 + 34 log(n )
(2n
log(n)
+
1n0hfimlmnain e e
The optimal value for k is found to be
Pk ))-25a
To choose the optimal partitioning of the problem, we minimize TFFTs + TTri as a function of k E [(p/n'12) min(p, n 112)]. The function takes the form +
eX
Recursive1reduction6of eight equations to two and Shaf implementation1
+ 30/3 k- + logs)
1?
I'l2
1
)
30
g "2)
Case 2: (nip) > (17/2) (a//3) + (15/2).
(1 +
30
a)) )
In the case that n"2 » 15p/2, we have k = 1 and the execution time is
TFPS = a(.(log(p) + 2 + 34 log(P) + 30p (n 112 + log(p)). Setting k = 1 implies that each row of the grid is stored completely in one processor and the FFT's involve no communication. Consequently, only the solution of the tridiagonal systems involves communication and the time bound above is valid for a tree network. The resulting speedup is of optimal complexity. On the other hand, the theoretical lower bound for communication cost for an Elliptic problem is f3C (n 1/2/p 112) for a Mesh network and /3C (n lf2/p) for the Shuf network. The time estimate above suggests that this form of the fast Poisson solver is suboptimal for these networks. Based on the upper bounds for speedup in Table I, Fig. 9 illustrates the relative performance of optimal fast Poisson solvers as a function of problem size (n) when p - 512 and r - 1 for the three network topologies Ring, Mesh, and Shuf. Multigrid iterative methods [3], [11] are structured so that each stage of the iteration reduces an n 1/2 X n 112 problem to a problem of smaller size which is solved by a direct method.
1191
GANNON AND VAN ROSENDALE: PARALLEL NUMERICAL ALGORITHMS
C0 log(p)
Multiprocessor Efficiency
0.90
Elliptic POE Mesh
0.80
/
0.70
//
0.30
and that the number of packets transmitted per clock cycle is about C1p
Elliptic POE Shuffle
1.00
7
/
Elliptic POE Ringy'Tree /
0.10
0.00
0.0
2.0
4.0
6.0
8.0
10.
12.
X-axis = Log2i n
14.
16.
18.
20.
7Y -acis = Efficiency
Fig. 9. Efficiency upper bounds based on Table I for fast Poisson solvers. p = 512, r = 1.0.
Let the initial problem be distributed such that the square subgrids of size (n/p)112 x (n/p)112 are mapped to each processor in a mesh. The communication cost to reduce the original problem to one of size p 112 X p 1/2 iS P3C1(n/p)12 for some constant Cl. Using the FPS method (Case 1) to solve the reduced problem on the mesh requires an additional 3C2p 1/2 communication steps. Applying K such iterations will reduce the error to a fixed level and cost
OK(ClI
+ C2P /2)
which is of the optimal complexity. Other methods, such as the preconditioned conjugate gradient, can also be shown to have this communication bound. On the other hand, the authors know of no method that achieves the lower bound of PC(n 112/p) for the networks with high bandwidth. IV. MULTISTAGE PACKET-SWITCHED NETWORKS
for any number of processors p. Thus, the Omega network behaves much like a crossbar switch, except that message propagation delays are relatively long [20], [15]. Performance of other log(p) stage packet networks, such as the Banyan network, baseline network, and so on, is comparable to that of the Omega network. In principal, packet-switched networks can be accommodated in our multiprocessor model by treating switches as specialized processors. Although feasible, this approach is difficult since the communications patterns in packet networks are complex. A more illuminating approach is to view the packet-switched network as a close emulation of a crossbar switch, and to modify our multiprocessor model accordingly. The model given in the previous section needs to be modified only in the two provisions governing communication. 3': Each arithmetic operation takes a seconds. Transmission or receipt of a word of data takes /3 seconds. Receipt of a message can be done -y seconds after its transmission or any time thereafter. 4': The connection topology is a complete graph. The delay parameter y here is designed to model the time taken to route and forward messages in the network, and the time packets spend queued at switches when there is contention. With this model, sending a one-word message between two processors takes total time 2/3 + y. In sending a k-word message, k sends and receives are required. But the propagation delays can be overlapped, so after receiving the first word, a new word can be received every ,B seconds, giving a total message delay of
(k + 1),j + y With this model of a multiprocessor interconnected by a packet-switching network, it is possible to look at any of the algorithms already considered. We look here at fast Fourier transforms since there are interesting aspects of this algorithm not yet treated. The communications required in an FFT can all be viewed as permuting data between processors. Suppose that we have n words of data, with n/p words per processor. Then we can ask: How much time is required to simultaneously move the data on every processor to some other processor? Let the time taken to perform this operation be denoted by Xp(n). To compute the value of Xp(n), note that to move n/p data words from one processor to another should take time
The analysis of nonshared memory multiprocessors to this point assumed a machine M with a fixed or switchable connection topology G(M). The only communication cost in this model was the time ,3 taken by the processors to perform sends and receives. For most nonshared memory multiprocessors, this model is probably quite reasonable. However, this model corresponds poorly to machines with multistage packet-switched networks. An example of such a network is the Omega network of Lawrie [22]. With this type of network, messages can be broken into packets of uniform size where each packet consists of a destination tag and a data field. The switches in the network read the destination tag and forward the packet along its routes. An Omega network /n + interconnection of p processors contains log(p) stages, each + Y, pJ having p/2 switches. In the best case, a packet can be routed through the network in log(p) steps. However, when the as discussed above. But this will be so only if the target network has heavy traffic, contention occurs and contending processor is ready to receive the data as soon as they arrive packets must be queued at the switches. Simulation results there. In permuting n words of data, each processor must suggest that packets are delayed by an average amount send n/p and receive n/p words, so the execution time for
1192
IEEE TRANSACTIONS ON COMPUTERS, VOL.
this permutation cannot be less than 2n p In fact, the time required to perform this permutation is
Xp(n)
= max[ (.)pQ
+ I
± +
(4.1)
as one can easily verify. Now consider the problem of performing an FFT on a vector of length n with this multiprocessor model. The execution time will be
TFFTn= a(!() log(n) + Xp(n) log(p) a()
+
log(n)
max[(.)p, (!n + 1>, + y log(p)
This is exactly the same as the execution time derived previously, except for the new term involving y. Notice here that the vector length n does not multiply y, so the impact of a large y, caused perhaps by packet contention, is not as severe as one might expect. Next consider the problem of performing multiple fast Fourier transforms. In treating fast Poisson solvers, it was implicitly assumed that to take fast Fourier transforms of m vectors, one would just repeat the parallel algorithm for a single FFT m times. This is not necessarily the best approach, especially on packet-switched networks where one needs to contend with propagation delays. At least four reasonable approaches to performing m fast Fourier transforms can be found. 1) Repeat the parallel algorithm FFT_n for a single data vector m times. 2) Combine m invocations of FFT_n to overlap communication. Each step of the FFT would be performed on all m data vectors at once before proceeding to the next step. r p, the data can be permuted so that 3) If m . p and m data vector i resides on processors (i - 1) (p/m) + 1 through i (p/m). Then the algorithm FFT_n for a single data vector can be performed on each block of p/m processors. Finally, the results must be permuted back to their correct locations. 4) If p 'rm and p rm, the data can be permuted so each data vector resides on only one processor. Each processor then performs sequential FFT's on the m/p data vectors it has, and finally, the results are permuted back to their correct locations. Now looking in detail at each of these, for the first approach, the execution time is just m times the execution time of FFT_n. That is,
Tr-FFT-n = a(r) log(n) +
mX,(n)
NO.
12, DECEMBER 1984
munication steps rather than m log(p) as in the first approach. However, each step is now a permutation on mn words of data rather than on n words as in the first approach. The execution time is thus
T-FFT-n= a(2) log(n) + Xp(mn) log(p). p At first sight there appears to be little difference between the two approaches. However, the function Xp(n) satisfies the
inequality
X(r(mn) ' mXp (n) for all m, so the second approach is always at least as good as the first approach. The third approach here is somewhat more complex. Two operations are involved, permuting the data, so each of the m vectors is distributed over a block of p/m processors, and then performing FFT's on these processor blocks. The FFT algorithm needed here is just the FFT for a single data vector FFT_n, already studied, except that only p/m processors are used now. The execution time to perform these FFT's on processor blocks is aF = C
/m)log(n)
+ XP(nm)
log
where we have used the identity Xpim (n) = Xp (mn) which is easily derived from (4.1). The other operation needed is permuting the m data vectors. Each data vector is originally distributed evenly over the p processors, and must be moved so it is distributed over a block of (p/m) processors. The cost of this is
Tdata=max[2n (P 1)3rn (P ll) p
[
( p (p
)
]
The factors (p - 1l/p arise since a fraction of each vector is already in the proper processor memory and does not need to be transmitted. Assuming that p is large, (p - 1)/p is close to unity, and we can set
Tdata
XXp(mn)
The data need to be permuted before and after performing the FFT's, so the total execution time becomes
T3_FFT_n
sz(-)log(n)
-a
+ Xpr(mn) (log(p) - log(m) + 2)
Analysis of the fourth algorithm is similar. No communication is involved in the FFT's in this case, but data permutations are required before and after the FFT's. The execution time is thus
T4mFFT_n= (a Pn
log(p).
With the second approach, one will have only log(p) com-
c-33,
m
log(n) +
2Xp(mn).
One way to compare these four algorithms for computing FFT's is to compute their speedups. The results are
1 193
GANNON AND VAN ROSENDALE: PARALLEL NUMERICAL ALGORITHMS
SI
p
-
[/3
1+max S2
a'2a
Ail,9orit.he
log(p)
2na (/3+y) log(n)
fo
o
M FFTs
I .22
2,.90
p
0.80
P' (3+ ) log(p) 2mna log(n)
0.70
+ max[
ja 2a
+
.
Method 38
2.60 0 .52 1r
=
Method 2
/
0.40 p
I +
S4
Ef ficiency
=
1
S3
l -+
-
maxI3-,--+ a22a
2mna
(/ +
0.30
")l
log(n)
0.20
+2
f/ Y
2.12
2.
p
2.0
,. 1.5
Method 1 3.0
4.5
6.0 X -axis
2
+ max/ 3 + P (_ + _) ca 2a 2mna log(n)
7.5
9.0
12.
LoO2(
e )
Y-axis = Efficiency
12.
13.
15.
I
Comparing these equations, it is clear that the first algorithm is never better than the second, as already mentioned, since the impact of 'y is smaller in the second. Note that this conclusion applies only for the packet-switched networks under consideration. For networks with fixed or circuitswitched topology, these two algorithms perform identically. Fig. 10 illustrates the efficiency in the case that = /3 = 1.0 and y = 50.0, p = 512, and n = 1024. In this case, we have plotted the performance as a function of the number of equations m. Observe that the third algorithm becomes better than the second when m ' 4. Had we included the cost of the "bit reversal" permutation in all algorithms, the third algorithm would have become better even earlier. Between the third and fourth algorithms, there is nothing to decide, since the third applies only to the case m ' p and the fourth to the case m ' p. Fig. 10 depicts these two methods as one with a transition at m p. Although searching for optimal algorithms is interesting, the real issue here is the impact of the delay fy caused by the use of a packet-switching network. The effect of y depends on the ratio of problem size to the number of processors; on problems with a great deal of computation, y is well masked. In fact, in the three FFT algorithms for multiple data vectors found to be best, y always enters the execution time in the ratio a
2mnaAlthough this analysis was performed only for FFT algorithms, experience suggests that the delays caused by packetswitched networks are relatively unimportant on most compute bound problems.
V. CONCLUSION This paper has considered three basic families of multiprocessors and the analysis of communication complexity in
the algorithms for these architecture classes. The principal goal here was to look at communication and its impact on algorithm performance. For large shared memory multi-
Fig. 10. Efficiency as a function m for four methods to implement algorithms for doing m FFT's of size n. p = 512, n = 1024, a = ,B = 1.0, and y = 50.0. processors, analyzing communication turns out to be rela-
tively straightforward. The main issues are memory latency and finding ways to organize or substructure problems to minimize its effect. Studying algorithms on nonshared memory machines is more difficult since the topology of the communication network is a central issue. Our analysis of nonshared memory network-based machines was divided into two parts, the first covering machines with a fixed or circuit-switched topology, and the second covering machines based on packet-switched networks. On circuit-switched machines, techniques borrowed from VLSI complexity theory provide a nice tool for obtaining lower bounds on algorithm complexity. Given an interconnection topology, one can, with relative ease, compute upper bounds on efficiency of the problem solution. An important point here is that these are upper bounds on the problem (e.g., FFT, fast Poisson solve, direct solution of tridiagonal systems), not on any particular implementation of an algorithm for solving the problem. In the cases studied, these upper bounds are apparently quite tight; in two of the three cases studied, these upper bounds are actually attained. By contrast, analysis of algorithms on machines interconnected by a large packet-switching network is far easier, given our simple model of the behavior of packet-switching networks. Here the propagation delay parameter y, modeling the impact of packet contention, is quite important. But on most large problems, it seems to be possible to substructure the problem so that the effect of y is minor. With our model of a packet-switched network in which such a network is treated as a crossbar switch with delay, analysis of algorithms is no more difficult than for shared memory multiprocessors. (In fact, the delay y.is closely related to the value nt?2.) At the moment, this model rests only on heuristic considerations and simulation results, so it would be valuable to establish the precise circumstances under which it holds. Many important problems remain to be solved. In particular, improved techniques are needed for lower bounds on communication in multiprocessors. In the case of specific algorithms, we do not know of better lower bounds (or better
1194
IEEE TRANSACTIONS ON COMPUTERS, VOL.
algorithms) in the case of elliptic PDE's on high bandwidth networks such as the Shuf connection. Of particular importance are issues that were not considered at all in this paper. This includes a systematic approach to algorithms with dynamic data structures, such as adaptive grid algorithms for PDE's. Do these problems have a reasonably nice solution on nonshared memory systems? If so, what is the structure of the communication? A closely related problem is the analysis of complexity of communication in data flow machines. How does it differ from the models we have surveyed in this paper? REFERENCES
[I] A. Aggarwal, "Tradeoffs for transitive functions in VLSI with delay," in Proc. 23rd IEEE Symp. Foundations of Comput. Sci., Tucson, AZ, 1983. [2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms. Reading, MA: Addison-Wesley, 1974. [3] A. Brandt, "Multigrid solvers on parallel computers," ICASE, NASA Langley Res. Cen., Hampton, VA, Rep. 80-23, 1980. [4] R. Brent and L. Goldshlager, "Some area-time tradeoffs for VLSI," SIAM J. Comput., vol. 11, pp. 737-747, Nov. 1982. [5] J. Browne and R. Kapur, "Block tridiagonal systems on reconfigurable array computers," in Proc. 1981 Int. Conf. Parallel Processing, M. Liu and J. Rothstein, Eds., 1981, pp. 92-97. [6] B. Buzbee, "A fast Poisson solver amenable to parallel computation," IEEE Trans. Comput., vol. C-22, pp. 793-796, 1973. [7] B. Buzbee, G. Golub, and X. Neilson, "On direct methods for solving Poisson's equation," SIAM J. Numer. Anal., vol. 7, pp. 627-656. [8] B. Chazelle and L. Monier, "A model of computation for VLSI with related complexity results," in Proc. 13th Annu. Symp. Theory of Comput., 1981, pp. 318-325. [9] D. DeGroot, "Expanding and contracting SW-Banyan networks," in Proc. 1983 Int. Conf. Parallel Processing, pp. 19-24. [10] J. Fishburn and R. Finkel, "Quotient networks," IEEE Trans. Comput., vol. C-31, Apr. 1982. [11] D. Gannon and J. Van Rosendale, "Highly parallel multigrid solvers elliptic P.D.E.s: An experimental analysis," ICASE, NASA Langley Res. Cen., Hampton, VA, Tech. Rep., 1982. [12] D. Gajski, D. Kuch, D. Lawrie, and A. Sameh, "CEDAR, A large scale multiprocessor," in Proc. 1983 Int. Conf. Parallel Processing, 1983, pp. 524-529. [13] M. Gentleman and G. Sande, "Fast Fourier transforms-For fun and profit,'? in 1966 Fall Joint Comput. Conf., AFIPS Proc., vol. 29, 1966, pp. 563-578. [14] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAullite, L. Rudulf, and M. Snir, "The NYU ultracomputer-Designing an MIMD shared memory parallel computer," IEEE Trans. Comput., vol. C-32, pp. 175-189, Feb. 1983. [15] A. Gottlieb and J. T. Schwartz, "Networks and algorithms for very-largescale parallel computation," IEEE Computer, Jan. 1982. [16] C. Grosch, "Performance analysis of Poisson solvers on array computers," Supercomputers, Vol. 2, R. W, Hockney and C. R. Jesshope, Eds. Infotech, 1979, pp. 149-181.
[17] D. Heller, "A survey of parallel algorithms in numerical linear algebra," SIAM Rev., vol. 20, pp. 740-777, 1978. [18] R. W. Hockney and C. R. Jesshope, Parallel Computers. Bristol, England: Adam Hilger Ltd., 1981. [19] H. Jordan, "A special purpose architecture for finite element analysis," in Proc. 1978 Int. Conf. Parallel Processing, IEEE, 1978, pp. 263-266. [20] C. P. Kruskal and M. Snir, "Analysis of Omega-type networks for parallel processing," Ultracomput. Note, Courant Inst., New York Univ., New York, NY, 1982. [21] J. Lambiotte and R. Voigt, "The solution of tridiagonal linear systems on the CDC Star- 100 computer," ACM TOMS, vol. 1, pp. 308-329, 1975. [22] D. H. Lawrie, "Access and aligment of data in an array processor," IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975. [23] F. T. Lieghton, "New lower bound techniques for VLSI," in Proc. 22nd IEEE Symp. Foundations of Comput., Oct. 1981, pp. 1-12. [24] J. Lipovski and A. Tripathi, "A reconfigurable varistructured array processor," in Proc. Int. Conf. Parallel Processing, 1977, pp. 165-174.
c-33,
NO.
12,
DECEMBER
1984
[25] A. Noor, K. Hussein, and R. Fulton, "Substructuring techniquesStatus and projections," Comput. and Structures, vol. 8, pp. 621-632, 1978.
[26] D. S. Parker, Jr., "Notes on shuffle/exchange-type switching networks," IEEE Trans. Comput., vol. C-29, pp. 213-222, 1980. [27] D. W. Peaceman and H. H. Rachford, Jr., "The numerical solution of parabolic and elliptic differential equations," J. Soc. Indust. Appl. Math., vol. 3, pp. 28-41, Mar. 1955. [28] F. Preparata and J. Vuillemin, "The cube-connected-cycles: A versatile network for parallel computation," Commun. ACM, vol. 24, pp. 300-319, 1981. [29] A. Sameh, S. Chen, and D. Kuck, "Parallel Poisson and biharmonic solvers," Computing, vol. 17, pp. 219-230, 1976. [30] A. Sameh and D. Kuck, "On stable parallel linear system solvers," J. ACM, vol. 25, pp. 81-91, Jan. 1978. [31] J. E. Savage, "Planar circuit complexity and the performance of VLSI algorithms," in VLSI Systems and Computations, H. Kung, B. Sproull, and G. Steele, Eds. CS Press, 1981, pp. 61-68. [32] H. J. Siegel and R. J. McMillen, "Using the augmented data manipulator network in PASM," Computer, vol. 14, pp. 25-33, Feb. 1981. [33] L. Snyder, "Introduction to the configurable highly parallel computer," Computer, vol. 15, pp. 47-56, Jan. 1981. [34] H. S. Stone, "Parallel processing with perfect shuffle," IEEE Trans. Comput., vol. C-20, pp. 153-161, Feb. 1971. [35] , "Parallel tridiagonal solvers," ACM TOMS, vol. 1, pp. 289-307, 1975.
[36] C. Thompson, "A complexity theory for VLSI," Ph.D. dissertation, Camegie-Mellon Univ., Pittsburgh, PA, 1980. [37] C. D. Thompson and H. T. Kung, "Sorting on a mesh-connected parallel computer," Commun. ACM, vol. 20, no. 4, pp. 263-270, 1977. [38] J. E. Vuillemin, "A combinatorial limit to the computing power of VLSI circuits," inProc. 21stAnnu. Symp. Foundations ofComput. Sci., 1980, pp. 240-300.
Dennis B. Gannon received the B.S., M.S., and Ph.D. degrees in mathematics from the University of California, Davis, in 1969, 1971, and 1974, respectively, and the Ph.D. degree in computer science from the University of Illinois, UrbanaChampaign, in 1980. He is an Assistant Professor in the Department of Computer Science, Purdue University, West Lafayette, IN. His technical interests center around the interaction of the design of algorithms, software, and hardware for large-scale parallel computation. He is currently developing software and applications for an experimental 64-processor parallel computer.
John Van Rosendale received the B.S. degree in
and the physics, the M.S. degree in mathematics, Ph.D. degree in computer science in 1980, all from the University of Illinois, Urbana-Champaign. He performed research in scientific computation in
Boeing Computer Services, Seattle, WA, in 19801981. He is currently a Staff Scientist at ICASE
(Institute for Computer Applications in Science and Engineering),. NASA Langley Research Center,
VA. His principal research interests are network-based multiprocessor architectures, highly parallel algorithms, and functional programming languages.
Hampton,