a network, a universal parallel computer (UPC) with the following properties: (i) It has optimal time-loss, namely O(logc) for simulating networks of degree c. (We.
Time-Optimal Simulations of Networks by Universal Parallel Computers Friedhelm Meyer auf der Heide Rolf Wanka Informatik II, Universitat Dortmund D - 4600 Dortmund 50, Fed. Rep. of Germany
ABSTRACT: For technological reasons, in a realistic parallel computer the processors have to communicate via a communication network with bounded degree. Thus the question for a \good" communication network comes up. In this paper we present such a network, a universal parallel computer (UPC) with the following properties: (i) It has optimal time-loss, namely O(log c) for simulating networks of degree c. (We also prove the lower bound (log c) for the time-loss.) (ii) We introduce the broadcast-capability (how many processors can be reached by one processor in i steps?) and demonstrate its in uence on the number of processors needed for a simulation of a network with n processors. E.g. for broadcast-capability O(ci ) (e.g. networks with degree c), O(n1+" log n) processors are needed (" > 0 arbitrary) whereas O(n polylog(n)) processors suce for networks with polynomial broadcast-capability (e.g. k-dimensional grids). (iii) The UPC is potentially in nite and has multi-user capabilities, i.e. it can be arbitrarily partitioned into nite UPCs each with the above eciency. This construction generalizes a UPC described in [MadH2], where, given a xed degree c, for each n a UPC M0 is constructed which needs O(n1+" log n) processors to achieve constant time-loss for simulating networks with n processors and degree c.
1. Introduction I. The Problem
Recently in complexity theory the following basic concept for parallel computation has been developed: p processors cooperate in the execution of parallel algorithms by interchanging data via a common mechanism for communication. The most powerful instance of this concept is the model of the parallel random access machine (PRAM) where the processors have read and write access to a shared memory consisting of an in nite number of storage locations. This model is widely used to study various problems of parallel computation. Unfortunately it is not realistic because it ignores the current technological constraints of realizing communication among the processors. In this paper we study the more realistic instance of the basic concept where networks of processors are investigated. In this model pairs of processors are interconnected by wires along which they can communicate. The interconnection pattern which we call the communication graph must be xed. Supported in part by DFG-grants Me 872/1-2 and We 1066/2-1
{1{
A network is unrealistic if the number of wires going out of a single processor becomes too large. Therefore we demand that the degree of a processor is bounded by a small integer which is independent of the size, i.e. the number of processors in the network. A network with this property is called realistic. All networks we propose in this paper have degree less than 8. In literature, many dierent communication networks are proposed for special purposes, e.g. the Hypercube [Pease], the Butter y network [Waksman], or the ShueExchange network [Stone] for sorting, permuting and fast Fourier transform, Leighton's modi cation [Leighton] of the AKS-network [AKS] for sorting , multi-dimensional grids for interpolation problems, matrix operations, or graph problems (compare e.g. [Hambrusch],[Atallah,Kosaraju]). On the other hand it is not conceivable to build a new network for each new application. Thus we have to look for versatile, i.e. for universal parallel computers (UPCs). A UPC is a realistic network that can simulate all other networks from a given class U using \few" resources (time, number of processors). Let us make this concept more precise. Let U be a class of networks. A realistic network M0 is universal for U (is a UPC for U ), if it can simulate each network from U . The time-loss of a simulation is the factor by which the simulation takes longer than the simulated machine would take. As usual, we do not take into account the time needed for preprocessing M0 for the network to be simulated. II. Known results about universal parallel computers There are two well-known propositions for UPCs: the Shue-Exchange Network due to Stone [Stone], and the Cube-Connected Cycles Network due to Preparata and Vuillemin [Prep,Vuil]. Both networks are universal for U (n), the class of all networks with n processors. For networks from the subclass U (n; c) of U (n) of networks of degree c, the above UPCs have time-loss O(minfc log n; sort(n)g), where sort(n) denotes the time necessary to sort n items on the respective network. sort(n) can be chosen as O((log n)2) using Batcher's Sort [Batcher] or O(log n) using the probabilistic sort due to Reif and Valiant [Reif,Valiant]. Galil and Paul have in [Galil,Paul] extended this approach to potentially in nite UPCs. They also take the preprocessing mentioned above into account to de ne a model for purposes of uniform complexity theory. In [MadH3] a method is presented to equip the above UPC from [Paul,Galil] with \multi-user properties", i.e. to modify it in such a way that, after removing the rst m processors, the remaining network is isomorphic to M0 , and thus can be used in the same way by other users for further simulations. In [MadH2] it is shown that a UPC for U (n; c); c constant, exists with constant time-loss. It has O(n1+" log n) processors, for arbitrary " > 0. It is not known whether this UPC is best possible for those with constant time-loss. The best lower bound from [MadH1] shows the trade-o (time-loss) ? (size) = (n log n= log log n) {2{
for each UPC for U (n; c); c constant. The model of simulations used is very general. It covers all known simulations including those from this paper. It allows that a processor P of the simulated network is simulated by several processors of the UPC, and that the set of processors simulating P and the time needed for simulating one step varies during the simulation. The time-loss is measured in an amortized sense. In [MadH2] it is shown that the time-loss of each UPC for U (n) is (log n), independent of its size. On the other hand there is a UPC with O(n1+") processors and constant time-loss if only those networks from U (n) with predictable communication are simulated [MadH2]. \Predictable communication" means that each processor can internally precompute in O(t) steps the adresses of processors it wants to communicate with in the next t steps. III. The new results In this paper we introduce a new restriction | besides size and degree | that seems to have in uence on the eciency of the simulation: the broadcast-capability. The broadcast-capability s(i) of a network M 2 U (n) is measured by a function s : f1; : : : ; ng ! f1; : : : ; ng, where s(i) is the maximum number of processors, one processor can reach via a path of length i. For general networks from U (n), s(i) n, for networks from U (n; c), s(i) = O(ci ). But for interesting classes of networks, s(i) can even become polynomial, e.g. for k-dimensional grids, s(i) = O(ik ). ? Let U (n; s(i)) U (n; c; s(i)) denote the class of all networks with n processors, (degree c,) and broadcast-capability s(i): The main result of this paper is the construction of an in nite UPC M0 for the class U of all networks with the following properties: (i) M0 simulates each M 2 U (n; c) with time-loss O(log c). This time-loss is shown to be asymptotically optimal even for M 2 U (n; c; s(i)) with s(i) c + 1, i.e. as small as possible. (ii) The number of processors of M0 needed for the simulation of M 2 U (n) is sensitive to its? broadcast-capability s(i). The number of processors used for simulating M is O n log n s(" log n)1+ : Thus, e.g. for s(i) = O(ci ) it is O(n1+" log n), and for s(i) = O(ik ), i.e. s(i) polynomial in i, it is O(n (log n)(k+1)+1) for arbitrary "; > 0. (iii) M0 has multi-user capabilities in the sense described in II. In order to achieve this we generalize the construction of a UPC M0 for U (n; c); c constant, from [MadH2] mentioned above in four respects: a) We generalize the distributor, i.e. the network from [MadH1] used for information exchange, to an in nite distributor with an additional property. This is done in chapter 2. b) We modify M0 such that it is universal for U (n), with optimal time-loss (log c) for networks from U (n; c). This is done in section 3, where also the lower bound for the time-loss is shown. c) We introduce the broadcast-capability and present a UPC whose performance is sensitive to this capability. This is done in chapter 4.
{3{
d) We modify this UPC in such a way that it becomes in nite without signi cant loss of eciency compared to b) and c), such that it gets the multi-user capability. This is done in chapter 5. This paper is written in an informal way. A detailed description is given in [Wanka].
2. An in nite distributor Because in the next sections we have to transport strings of numbers from a set of processors to another set of processors where the exact sources and sinks depend on the simulated network, we refer to a network with this capability introduced in [MadH1] which is called distributor. Let a; b be integers with a b: An (a; b)-distributor Da;b is a realistic network which has a input-processors A1; : : : ; Aa , b output-processors B1; : : : ; Bb and the following property: If each output-processor Bi; 1 i b; has stored an integer adr(i) 2 f1; : : : ; ag, Da;b can initialize itself such that afterwards following holds: If each inputprocessor Aj ; 1 j a; has stored a string x^j of numbers, it can send the string x^adr(i) to each processor Bi: We say Da;b distributes (^x1 ; : : : ; x^a) according to (adr(1); : : : ; adr(b)): (Note: If a = b and the adr(i)'s are distinct, then Da;b permutes the x^j 's.) The number of steps needed to initialize Da;b we call initialization time and the time to distribute the strings we call distribution time. Since we are looking for the time we need to transport information and the initialization is not part of the simulation time, we are interested in the distribution time only. In [MadH1] a realistic distributor Wn is presented which has n log n processors, maximum degree 6 and a distribution time of O(log n + `) where ` is the maximum length of the distributed strings. The set of the input-processors is equal to the set of the output-processors. The input/output processors have degree 5. Wn is based upon the Butter y network introduced in [Waksman]. It is possible to initialize Wn to be a distributor Da;n for each a n: Further Wn has the property that it can be partitioned into the distributors Wi and Wn?i for each i n, where the input/output processors of a smaller distributor are consecutive input/output processors of Wn. This decomposition can be repeated. Additionally, it is feasible to say that the structure of the distributor is in nite because of its regular design. Therefore we say that Wn is an initial part of the in nite distributor W for distributing n strings.
3. The tight bound for the time-loss
In this chapter we prove that the time-loss of UPCs for U (n; c) is (log c): First we show that simulating networks with degree c on an arbitrary UPC with bounded degree has time-loss (log c). This result does not depend on the size of the UPC. Then we present a network which is a special version of that introduced in [MadH2] and which can simulate each network of degree c with time-loss O(log c): {4{
Theorem 1:
a) Each UPC for U (n; c; s(i)), s(i) c + 1, has time-loss (log c): b) There is a UPC M0 for U (n) with degree 7 and O(n1+" log n) processors (for arbitrary " > 0) which can simulate each network with degree c with time-loss O(1 + log c="): Thus, for constant ", this time-loss is O(log c). Proof of a): Consider the network M consisting of n processors, and whose communication graph is the Kc+1 (the complete graph on c + 1 vertices) together with n ? c ? 1 isolated vertices. In [MadH2] it is proved that the time-loss for simulating a network with communication graph Kn is (log n): Therefore the time-loss for simulating M is
(log c): As it has degree c and broadcast-capability s(i) c + 1, a) follows. Proof of b): (Sketch, we modify the algorithm from [MadH2].) First we recall the construction from [MadH2]. Let T be a complete c-ary tree with depth t and M a network with n processors and degree c. We say that T weakly simulates t steps of M for processor P of M , if the following holds: { Initially, the root processor knows the current con guration of P . Inductively, if a processor of T knows the current con guration of some processor P 0 of M , then its successors in T know the current con guration of the neighbours of P 0 in M . { Finally, the processors of level i of T have simulated t ? i steps of the processor they are initialized with. In particular, the root has simulated t steps of P . This can be done in t steps, as shown in [MadH2]. We say that n copies of T weakly simulate t steps of M , if each processor P of M is weakly simulated by one of the trees. When t steps are performed no processor is able to compute the next con guration. In order to go on simulating, the root processors of the trees must inform the other processors about the t-th successor con gurations which is only known by them. Because it is possible that one con guration becomes very large we only transport the numbers read during the last t steps by the root processors. This string of numbers has length t. Also, it suces to send the strings to the leaves of the trees. If these processors know the numbers they have to read during t steps, they can compute the t-th successor con guration starting again in the 0-th successor con guration just so as all other processors in the trees because their successors can do this. To perform the transportation we use the distributor introduced in chapter 2. We have to send the strings of numbers read by the root processors during the last t steps to the processors on the last level in the trees. For this purpose we interconnect the n roots with the leaves of the trees using the (n; n ct)-distributor Wnct . Thus we obtain the following algorithm for simulating t steps of M . Init.1 Initialize the n trees for a weak simulation of t steps of M starting with con guration K . Init.2 Initialize the distributor Wnct such that the following holds: For each P in M : Each leaf processor simulating P can be reached from the root processor simulating P . Phase 1. Simulate weakly t steps of M starting with K on the trees. Phase 2. Distribute the information read by the root processors during the t steps to the leaf processors according to Init.1. {5{
Phase 3. Let each processor of the UPC simulate t steps of the processor of M it
is initialized with, starting again with con guration K . (Now the trees are initialized for a weak simulation of the next t steps starting with the t-th successor con guration of K .) We shall use this method in the next sections as a subroutine, too. Essentially, during phase 1 we compute the strings of numbers each processor has to read during t steps. During phase 2 we distribute this string to the corresponding leaf processors. Finally each processor in the trees computes its t-th successor con guration, so that this method can be applied again for the next t steps. Thus the time for simulating t steps is O(t) + O(log(n ct ) + t) + O(t) = O(log n + t (1 + log c)), and the time-loss is O(log n=t + 1 + log c). If we choose t = b" logcnc for some arbitrary " > 0, then the total number of processors is O(n1+" log n). The time loss is O(1 + log c="). This was shown in [MadH2] in a very similar way. If we want to simulate a network with degree c0 where c0 > c we can do this as follows: simulating processors in the trees are only those on levels i log c0 = log c (i = 1; 2 : : :). The other processors are used to submit the messages. Note that aprocessor on ? level (i log c0 = log c) has at least c0 successors on level (i + 1) log c0 = log c . The rst c0 of these successors are simulating processors. All these chosen processors are initialized in the same way as the simulating processors of a c0 -ary tree. Thus a tree simulates at most b 2" logc nc ? 1 steps. We get an extra factor for the time-loss of O(log c0 = log c): If we now choose n binary trees (i.e. c = 2) for our UPC it is universal for U (n), has time-loss O(1 + log c0 ="); maximum degree 7 and O(n1+" log n) processors to simulate an arbitrary network M 2 U (n; c0 ). This proves b). 0
4. A UPC sensitive to the broadcast-capability The trees which are part of the UPC constructed in the previous section have exactly ci processors on each level i. But it is possible that on each level some processors of M are simulated very often because there are many distinct paths in M of length i from the processors assigned to the root to the processors aecting the computation of the root processor. In this section we construct a UPC M0 for U (n) which is sensitive to this property, i.e. M0 needs the fewer processors, the more processors are several times simulated in the trees of the UPC from chapter 3. We call the maximal number of processors aecting the computation of one processor at the i-th step the broadcast capability of the network. A formal de nition is as follows: Let G = (V; E ) be an undirected graph and u 2 V: The function su : IN ! IN with su(i) = fv j v 2 V; there exists a path from u to v with length less or equal to ig is called the broadcast-capability of u. We call s(i) = maxfsu (i) j u 2 V g the broadcastcapability of G . ? Theorem 2: There is a UPC M0 for U (n; s(i)) with O nlog ns(" log n)1+ processors, degree 7, and time-loss O(log c) for networks from U (n; c; s(i)) for arbitrary ; " > 0.
{6{
For reasonable broadcast capabilities the following table shows the number of processors used for the simulations. Broadcast-capability s(i) = O(ik ) s(i) = O(ci )
Number of processors O(n (log n)(1+)k+1) O(n1+"+" log n)
Proof: To simulate a network M 2 U (n; c; s(i)) it suces that only s(i) processors are
on the level i in the trees of the network from Chapter 3. Unfortunately, we cannot determine in advance a general communication graph between a level with s(i) and a level with s(i + 1) processors. Therefore take a complete c-ary tree T of depth d and a forest F consisting of s(d) copies of complete c-ary trees of depth d that is connected with T by a (s(d); cd )-distributor Wcd : T is initialized for a weak simulation of d steps. Each processor P in M which is simulated by a leaf processor of T is weakly simulated by exactly one tree of F that is also initialized for a weak simulation of d steps. This construction can be iterated to obtain the graph shown in g.1. We call it tree-distributor network (TD-network). Let t be the total depth of the TD-network and the of distributors in ? number d it. Thus = t=d ? 1. The total number of processors is O t=d c s(t) (d + log s(t)) ; O(cd s(t)) per forest. A network consisting of n copies of this TD-network can weakly simulate t steps of a network M with degree c and broadcast-capability s(i) with time-loss O((log s(t))=d) as follows: Each forest simulates weakly d steps of its root. Now each root processor of the forests knows the d-th successor con guration. Then we distribute the information read by the the root processors to the corresponding processors on the lowest level of the forests above them. When the strings of numbers reach the leaves of the forests, each processor in the forests computes its d-th successor con guration. Note that this is the method we already used in section 3. Thus the time needed to compute the d-th successor con guration is at most of order of the distribution time plus the depth of the forests. Since at each level of the TD-networks there are at most s(t) cd processors the distribution time is O(log(s(t) cd)) = O(d + log s(t)) if c is a constant. Therefore the computation of the t-th successor con guration of the processors at level 0 needs time ? O (t=d) (d + log s(t)) = O(t + (t=d) log s(t)): ?
The time-loss is O log s(t)=d . Hence we choose d = b logcs(t)c for an arbitrary > 0: Now n copies of the TD-network can weakly simulate t steps of M . Finally, a distributor Wns(t)cd connects the processors at level 0 with the leaves of the lowest forest. Now, after having distributed the strings of numbers read by the roots to the corresponding leaves, again all processors can compute their t-th successor con guration. The last distribution time together with the time for updating all processors is ? O log(n s(t) cd) + t = O(log n + (1 + ) log s(t) + t): {7{
0 .. . i1
1 1
i1 .. . i2
2 2
i2
j
c-ary forest j
j
distributor j
i
+1
i .. . t
Fig.1. A TD-network If we choose t = b" logcnc the time becomes O(log n) and the time-loss O(1), if c is a constant. Finally, we assign c = 2 and use the method we have introduced in the previous chapter for simulations with time-loss O(log c0) for arbitrary c0 . We obtain the following results for this network: ? (i) the total number of processors is O n log n s(" log n)1+ , (ii) the time-loss is O(log c0) (which is optimal), (iii) the maximum degree is 7.
5. A potentially in nite UPC The network proposed in the last section still depends on the number n and the function s(i) so one would have to construct for each case a special network. It is better to design an in nite network M0 and to take a part of it that can initialize itself to perform simulations of several networks with varying n, c, s(i) with the same eciency as obtained in this chapter. This will happen in what follows. Theorem 3: There is an in nite UPC M0 for U with degree 7, such that, for each N 2 IN , there is an initial part M0(N ) of M0 with O(N 1+ (log N )2 ) processors (for arbitrary > 0) with the following properties: For each n N , for each broadcast-capability s(i), M0 (N ) can simultaneously simulate a number H of networks from U (n; c; s(i)) with time-loss O(log c). H is so big, that the number O(N 1+ (log N )2 =H ) of processors used for each simulation is only by a factor (log N= log s(" log n)) bigger than in the UPC from Theorem 2.
{8{
Proof: In order to construct an in nite UPC we proceed in a way similar to the previous
chapter. But now we must have a network between the distributors that can be partitioned into forests of size depending on s(i): For this purpose we de ne the following network we can separate into many trees. A tree-net Bm;n is the following network with a set of processors P and communication graph G = (P ; E ): P = fPi;j j 0 i m ? 1; 0 j n ? 1g G = (P; E ) where Pi+1;j j 0 i m ? 2; 0 j n ? 1 E = Pi;j Pi+1;j+2i j 0 i m ? 2; 0 j n ? 1; j + 2i n ? 1 [ Pi;j We say: Pi;j is in the i-th row and j -th column.
Fig.2. The tree-net B4;8 and two trees with depth 2. Obviously Bm;n has m n processors. Note that the structure of the tree-net is potentially in nite. Therefore we denote the in nite version of it by B and say that Bm;n is an initial part of B. The remaining part is isomorphic to B. Let k < m: It is possible to partition Bm;n into at least bn=2k c disjoint binary trees with depth k. This can be easily proved by induction on k using the idea for partitioning illustrated in g.2, where the construction for B4;8 and k = 2 is shown. We shall substitute the tree-net for the forests we used in the previous sections for the weak simulations. But rst we have to modify the way we use the distributors. In the previous chapter we used them for xed n so that their depth depends on n. Now we take W , the potentially in nite version of the distributor. As mentioned in section 2 we are able to partition this large distributor into distributors Wn for arbitrary n where the consecutive input/output-processors are the same as in W restricted to Wn. The remaining part of W is isomorphic to W . Now we can de ne the in nite network M0 . Let M0 be the following UPC: Treenets B(1); B(2); : : : and distributors W (1); W (2); : : : are given. B(i) is interconnected with B(i+1) in such a way that for each j the processor of the 0-th rows and j -th column of B(i) and B(i+1) has a wire to the input/output-processor with number j of W (i). {9{
Fig.3. The in nite UPC M0 . This yields a network illustrated in g.3. The maximum degree is 7 since there are two additional egdes to the input/output processors of the distributors. By M0 (N ) we denote the initial part of M0 , where log N copies of B log N;N 1+ and log N ? 1 distributors WN 1+ (for arbitrary
> 0) are interconnected as described ? ? above. The total number of processors is O log N (N 1+ log N ) = O N 1+ (log N )2 . For n N , M0 (N ) is able to simulate one network M 2 U (n; c; s(i)) as follows: Init.1 Choose t = " log n and d = log s(t). Initialize tree-net B(i), 1 i t=d, with s((i ? 1) d) trees of depth d. Init.2 Initialize each tree and each distributor as discribed in Chapter 4 for the TD-network using the trick for simulating a network with degree c. Init.3 Additionally initialize W (1) as described for the large distributor that extends the TD-network. FOR k := t=d DOWNTO 1 DO Phase 1. In each B(i), 1 i k, the trees simulate weakly d steps of M starting in the (t ? k d)-th successor con guration K 0 . Phase 2. Distribute the information read by the root processors of B(i+1) during the d steps to the rst row of B(i) via W (i). Send the information arriving at the 0-th row of B(i) to the processors on level d along the vertical edges. Phase 3. Now the processors in the trees compute the d-th successor con gurations starting with K 0.
DONE Phase 4. Distribute the information read by the root processors of B(1) during the
last t steps to the 0-th row of B(1) as follows: If on level d of B(t=d) processor P 0 is simulated in column j then the processor in the j -th column of the { 10 {
0-th row of B(1) gets the string of numbers which P 0 needs for computing the t-th successor con guration. Phase 5. Send the strings of numbers along the wires connecting the tree-nets and the distributors from B(1) to B(t=d) . Then send this strings along the vertical edges to the leaves of the forest of this tree-net. Phase 6. Now all processors perform t steps starting again with the 0-th con guration. Also this algorithm uses the method introduced in Section 3. After executing this algorithm each simulating processor knows the t-th successor con guration. They now can repeat this algorithm (without the initialization) to go on simulating. Because the used part of M0 is the same as in the UPC from is O(log c). ? Section 4, the time-loss Again the number of simulating processors is O n log n s(" log n)1+ : Moreover this UPC has the property, that the unused part on the right and below the simulating part consists of tree-nets and distributors. Therefore we can partition this part again into a structure which can simulate t steps of further networks with n processors, degree c, and broadcast-cabability s(i). We can partition the whole UPC into 1+ N
n s(" log n)1+ disjoint structures each simulating a network, one beside another. Just so we can partition the UPC into log N log N
t=d = log n log s(" log n) structures, one beneath the other. Thus the total number of networks from U (n; c; s(i)) which M0 (N ) can simulate simultaneously is 1+ N log N log s ( " log n )
n log n s(" log n)1+ : Thus we obtain the following ratio log N size of M0 (N ) 1+ number of simulated networks = O log s(" log n) n log n s(" log n) which is identical to the result achieved for special UPCs in section 4 except for a factor O(log N= log s(" log n)).
6. Conclusion 1) Note that it is possible to initialize M0 (N ) for simulating networks with dierent sizes, degrees and broadcast-capabilities simultaneously. The way to do this is the same as described for the initialization of M0 (N ) in section 5. 2) M0 (N ) does not use the processors which are below a forest in a tree-net so we have a factor of O(log N ) in the ratio of processors in M0 (N ) to the number of networks we can simulate upon it. By extending the construction of the distributor the factor { 11 {
O(log N ) is replaced by O(log n) and the ratio above even becomes independent of N (see [Wanka]). 3) Because the distributor from [MadH1] we used in this paper is very similar to the tree-net, it is possible to embed the tree-nets into the distributors. This reduces the number of processors by a constant factor, makes the structure of the UPC even easier, but makes the description of the simulations more dicult.
References M. Ajtai, J. Komlos, E. Szemeredi, An O(n log n) sorting network, Proc. 15th ACM-STOC (1983) 1{9 [Atallah,Kosaraju] M. Atallah, S. R. Kosaraju, Graph problems on a mesh-connected processor array, Proc. 14th ACM-STOC (1982) 345{353 [Batcher] K. E. Batcher, Sorting networks and their applications, AFIPS Conf. Proceedings 32 (1968), 307{314 [Galil,Paul] Z. Galil, W.J. Paul, An ecient general-purpose parallel computer, JACM 30 (1983) 360-387 [Hambrusch] S. E. Hambrusch, VLSI algorithms for the connected component problem, SIAM Journal on Computing 12 (1983) 354{365 [Leighton] T. Leighton, Tight bounds on the complexity of parallel sorting, Proc. 16th ACM-STOC (1984) 71{80 [MadH1] F. Meyer auf der Heide, Eciency of universal parallel computers, Acta Informatica 19 (1983) 269{296 [MadH2] F. Meyer auf der Heide, Ecient simulations among several models of parallel computers, SIAM Journal on Computing 15 (1986) 106{ 109 [MadH3] F. Meyer auf der Heide, In nite cube-connected cycles, Information Processing Letters 16 (1983) 1{2 [Pease] M. C. Pease, The indirect binary n-cube microprocessor array, IEEE Transactions on Computers C26-5 (1977) 458{473 [Prep,Vuil] F. Preparata, J. Vuillemin, The cube connected cycles: a versatile network for parallel computation, CACM 24 (1981) 300{309 [Reif,Valiant] J. H. Reif, L. G. Valiant, A logarithmic time sort for linear size networks, Proc. 15th ACM-STOC (1983) 10{16 [Stone] H. Stone, Parallel processing with the perfect shue, IEEE Transactions on Computers C20-2 (1971) 153{161 die Ezienz von Simulationen eingeschrankter [Wanka] R. Wanka, Uber Netzwerkklassen auf universellen parallelen Rechnern, Diplomarbeit, University of Dortmund, 1988 [Waksman] A. Waksman, A permuting network, JACM 15 (1968) 159{163 [AKS]
{ 12 {