Optimal Expression Evaluation and Term Matching on the Boolean Hypercube and on Hypercubic Networks Ernst W. Mayr
Ralph Werchner
y
June 28, 1994
Abstract An optimal tree contraction algorithm for the boolean hypercube and the constant degree hypercubic networks, such as the shue exchange or the butter y network, is presented. The algorithm is based on a recursive approach, uses novel routing techniques, and, for certain small subtrees, simulates optimal PRAM algorithms. For trees of size , stored processor hypercube in in-order, the running time of the all mon a gorithm is ( np log ), which can be shown to be optimal on the hypercube by a corresponding lower bound. Tree contraction can be used to evaluate algebraic expressions consisting of operators + ? and rational operands, as well as for the membership problem for certain subclasses of languages in DCFL. The same algorithmic ingredients can also be used to solve the term matching problem, one of the fundamental problems in logic programming. n
p
O
;
p
; ;=
1 Introduction Tree contraction is a fundamental technique for solving problems on trees. A given tree is reduced to a single node by repeatedly contracting edges, resp., merging adjacent nodes. This operation can be used for problems like top-down or bottom-up algebraic tree computations [?], which themselves can be applied to solve the membership problem for certain subclasses of languages in DCFL [?] or to evaluate expressions consisting of rational operands and the operators +; ?; ; =. Institut fur Informatik, Technische Universitat Munchen, P.O. Box 20 24 20, W-8000 Munchen 2, Germany, Phone: +49 89 2105 2680, e-mail:
[email protected] y Theoretische Informatik, Fachbereich 20, J.W. Goethe-Universit at, Robert-Mayer-Strae 11-15, 60054 Frankfurt, Germany, Phone: +49 69 798 8324, Fax: +49 69 798 8353, e-mail:
[email protected]
formatik.uni-frankfurt.de
1
In the context of parallel processing the objective is to contract the tree using a small number of stages, each consisting of independent edge contractions. Brent [?] was rst to show that a logarithmic number of such stages is sucient, and he applied this result to the restructuring of algebraic expression trees, producing logarithmic depth. Subsequent work [?] [?] [?] [?] [?] concentrated on the ecient parallel computation of contraction sequences, and eventually resulted in work optimal logarithmic time tree contraction algorithms on the EREW PRAM. We consider the tree contraction problem on the binary hypercube and on similar networks. In this model, computing a suitable contraction sequence is more dicult, and the additional problem of routing pairs of nodes to be contracted to common processors arises. For the contraction of paths, both problems become trivial and can be solved in logarithmic time by a parallel pre x operation [?]. For arbitrary trees, we use two known (PRAM) contraction techniques. To achieve a logarithmic running time we separate the local communication operations from the few long distance communication steps and perform them in appropriately sized subcubes. Our algorithm contracts a tree of size n on a p processor l m binary hypercube in O( np log p) steps, which we also show is asymptotically optimal by a matching lower bound. Term matching is an important special case of the uni cation problem. It is one of the most time consuming tasks in logic programming and term rewriting systems. A term is an expression composed of function symbols (with arbitrary arity), constants, and variables. In the term matching problem, we are given two terms A and B , with B containing no variable symbols. The task is to nd a substitution of terms for the variables (in A) making A syntactically identical to B . Consider, for example, A = F(G(x; y; a); F(x; b)) and B = F(G(b; F(a; b); a); F(b; b)); the desired substitution is x 7! b ; y 7! F(a; b). In [?] Paterson and Wegman solve the general uni cation problem sequentially in linear time. The rst parallel algorithm for the term matching problem achieving optimal speedup and logarithmic running time is due to Kedem and Palem [?]. Their algorithms use the CRCW PRAM or the randomized CREW PRAM model. Employing a recursive approach conceptually very similar to the one presented in this paper, Kosaraju and Delcher [?] developed a work-optimal logarithmic time term matching algorithm for the CREW PRAM model. Term matching naturally decomposes into two phases. First, all pairs of corresponding symbols in A and B are determined, where this correspondence is de ned by considering A and B as ordered trees. Two nodes correspond to each other, if they are both roots, or their parents correspond and they are both the i-th child of their respective parent. Since the symbols of A appear in the same order as the corresponding symbols in B , a single monotone shift of the symbols in both terms suces to align corresponding symbols with each other. In the following, this rst part will be called the term alignment problem . In the second phase of term matching, all pairs of corresponding symbols and all subtrees of B corresponding to the same variable are tested for equality. One contribution of this paper is an optimal hypercube algorithm for the term alignment problem. A p processor hypercube can compute and perform the desired routing for terms 2
l m
of size n in O( np log p) steps. Applying our general approach, we rst align the terms A and B roughly, considering only their global structure, and postpone the exact alignment to a recursive call in lower dimensional subcubes. The task of comparing subtrees mapped to the same variable generally has to be performed by a standard sorting algorithm.l This can m be done by a randomized sorting algorithm [?][?] within the same time bound O( np log p). The best known deterministic sorting algorithm [?] exceeds this time bound by a factor of O(log log2 p).
2 Fundamental Concepts and Notation We rst give a short description of our model of computation. A network of processors is a set of processors, interconnected by bidirectional communication links. Each processor has the capabilities of a RAM, a unique processor-id and additional instructions to send or receive one machine word to respectively from a direct neighbor. We assume the word length of the processors to be (log p) bits where p is the number of processors. The topological structure of the communication links can be described by an undirected graph (V; E ). A d-dimensional hypercube is a network of processors represented by the graph G = (V; E ) with
V = f0; 1gd ; E = f(u; v) j u and v dier in exactly one bitg : A network with bounded degree and a structure very similar to the hypercube is the ddimensional shue-exchange network. The processors are numbered as they are in the hypercube. Two processors are connected by a link if their ids dier only in the last bit (exchange edges) or if one id is a cyclic shift by one position of the other id (shue edges). A compendium of results concerning hypercubes and shue-exchange networks can be found in [?]. A tree contraction is the reduction of a given binary tree to a single node, proceeding in stages. In a stage, a set of disjoint pairs of adjacent nodes in the current tree is selected, and the edges connecting these pairs are contracted, i.e., each pair of nodes is merged into a single node. The contractions are not allowed to produce a node with more than two children. For the tree contraction problem on networks of processors we assume that the data comprising each node can be stored in a constant number of processor words and that two adjacent nodes stored by the same processor can be merged in constant time. On networks the complexity of the tree contraction problem depends on the representation of the given tree. We assume the in-order sequence of the nodes to be evenly distributed over the sequence of processors ordered by processor-id. To uniquely describe the structure of the tree in this way, each nontrivial subtree is assumed to be enclosed by a pair of parentheses. To contract a tree given in pre- or post-order, it can be transformed to in-order without asymptotically increasing the time complexity using the routing algorithm of [?]. 3
If the tree is given by an arbitrary, but balanced distribution of the nodes linked by pointers, a list ranking on the Euler tour of the tree and a corresponding routing operation can be used to transform the tree into in-orderl representation. On a p processor hypercube, m 2 n this transformation can be carried out in O( p log p log log log p log p) steps for trees of size n. The dominating part is the time required for the list-ranking. In the following, we will assume that trees are stored in in-order, i.e., arithmetical expressions are given in in x notation. In the context of the term matching problem, terms are usually represented in pre x notation, and that is what we also assume here. Our algorithms use the following basic operations, which are all known to be executable in logarithmic time on the hypercube or shue-exchange network:
parallel-pre x-operation and segmented parallel-pre x-operation [?]; monotone routing (the order of the data items is unchanged) [?]; sparse enumeration sort (sorting p data items on a p processor hypercube, for < 1) [?]; parentheses-structured routing (routing between matching pairs in a well-formed word of parentheses) [?]. In our algorithms we use parallel recursive calls for sets of signi cantly smaller subproblems. The sizes of the subproblems sum up to no more than the original problem size, and thus the processors can be allocated to perform all recursive calls in parallel. In the tree contraction algorithm each subproblem is already stored in its allocated subcube. In the term alignment algorithm we just generate a sequence of subproblems of bounded but unknown sizes. In [?] an optimal general method is proposed to allocate appropriately sized subcubes to these subproblems without signi cant overhead in the running time. This justi es our running time analysis, where we just sum up the running time of our algorithm for each level of the recursion.
3 Tree Contraction on the Hypercube In this section we show how l mtrees of size n can be contracted on a p processor hypercube or hypercubic network in O( np log p) steps. We concentrate on the contraction of trees with at most p nodes in logarithmic time on the hypercube. For inputs of a size n > p, a hypercube of size (n) can be simulated with a slowdown of ( np ). Because of the simple communication structure, the hypercube algorithm can be simulated on a shue exchange network of the same size with just constant slowdown. Furthermore, due to a general simulation result [?], any algorithm on the shue exchange network can be simulated with constant slowdown on each of the other so called hypercubic networks, such as the de Bruijn network, the cube connected cycles network, or the butter y network. We give an outline of the contraction algorithm. It proceeds in three phases: 4
1. We identify and contract small subtrees in a recursive call of the algorithm. 2. The remaining tree may still be quite large, but, if so, has a structure that allows a signi cant further reduction by eliminating the leaves and contracting chains of nodes with a single child each to one edge, an operation similar to the \compact" operation in [?]. 3. After rearranging the elements of the expression in the hypercube, the PRAM tree contraction algorithm of [?] is simulated, with only constant slowdown.
3.1 Contraction of Subtrees In the rst phase,3 we divide the processors of the d-dimensional hypercube into contiguous j k d 3 b c blocks of size 2 4 . These blocks are 4 d -dimensional subcubes, say C1; C2; : : : ; C2dd=4e . We determine those subtrees entirely contained in a single Ci. For this, we rst compute the height , i.e., the level of nesting, of the parentheses enclosing the subtrees. An opening parenthesis with height h is part of a local subtree i there is a parenthesis to its right, but within its block, with a height h. Performing a parallel-pre x-computation within each block Ci, we can compute for each parenthesis the minimum height of the parentheses to its right within the block. Comparing this value with its own height, each opening parenthesis can determine whether it is matched within its block. An analogous computation is performed for the closing parentheses. In a recursive call to this algorithm, for all subcubes Ci in parallel, the identi ed subtrees are contracted. In general, a subcube Ci may contain a sequence of several maximal local subtrees. Hence the algorithm must be capable of contracting a sequence of trees. We only describe how to contract a single tree, the modi cations to the algorithm for dealing with a sequence of trees, however, are quite straightforward.
3.2 Compaction Denote by T , T 0, and T 00 the trees at the beginning of the algorithm, after the rst phase, and after the second phase, respectively. Let L be the set of pairs (u; v), where v is a leaf of T 0, u is its parent, and v is not the left sibling of another leaf. We reduce T 0 to T 00 by eliminating all pairs of nodes in L (see Figure ??). This reduction is equivalent to one \rake" operation (eliminating the leaves) and a repeated \compress" operation (contracting the maximal chains of nodes with a single child). Together, they were called a \compact" operation in [?]. The result is again an extended binary tree, i.e., a tree with each node having zero or two children. We rst prove that the resulting tree is small. Then we show how the compact operation can be carried out on the hypercube in logarithmic time.
5
T’
T’’ compact
C1
C2
C4
C3
Figure 1: The compact operation applied to T 0
Lemma 1 Starting with an expression tree of size at most 2d , applying the rst and second phase of our algorithm, the resulting tree T 00 is an extended binary tree with at most 2dd=4e ? 1 leaves.
Proof: Each leaf of T 0 corresponds to a maximal subtree of T initially stored totally
within one of the blocks C1; C2; : : : ; C2dd=4e . Assign to each leaf the index of its subcube. Then the sequence of numbers assigned to the leaves in T 0 is sorted and two leaves that are siblings of each other are assigned dierent numbers. As the only leaves of T 0 surviving in T 00 are those whose right sibling is also a leaf, they are all assigned distinct numbers. Since these numbers are in the range from 1 to 2dd=4e ? 1, T 00 has at most 2dd=4e ? 1 leaves. To perform the compact operation eciently on the hypercube we rst switch some siblings in the tree such that for all (u; v) 2 L, the leaf v is a right child of u. For our in-order representation of the tree this means that we have to make a modi cation according to the following pattern: : : : (v u(: : :)) : : : ?! : : : ((: : :)u0 v) : : : ; (1) where u is modi ed to u0 to denote that its children have been interchanged. The required routing belongs to a special class of partial permutations called parentheses structured routings [?]. To de ne this class consider the following situation: Some of the processors are storing a parenthesis. Together with some of the opening parentheses data items are stored. The sequence of parentheses is well-formed, and each data item has to be routed from its opening parenthesis to the processor storing the matching closing parenthesis. A partial permutation that can be described in this way is called a parentheses structured routing request. The algorithm proposed in [?] performs the routing in logarithmic time on the hypercube, the shue exchange network or any other hypercubic network. In (??), the routing of the data item (u; v) can be guided by the pair of parentheses enclosing the subtree rooted at u. 6
Let (u1; v1); (u2; v2); : : :; (ur ; vr) be a maximal chain of tupels from L in T 0, i.e., ui is a child of ui+1 for 1 i < r, and ur is either the root or the sibling of an internal node, and v1's sibling w is either a leaf or the parent of two internal nodes. After the above routing step, this chain is, in in-order notation, a sequence
: : : u1v1)u2v2) : : :)ur vr : : : : In each such chain, the edges (ui; vi) are contracted by routing the leaves vi one position to the left. The remaining path u1; : : : ; ur is contracted to a single node by a segmented parallelpre x-operation. Finally, the resulting node is routed to w by a sparse-enumeration-sort, and the two nodes are merged.
3.3 Simulation of PRAM Contraction Algorithm In the third and last phase, we simulate the logarithmic time PRAM tree contraction algorithm proposed in [?] or [?]. Although Lemma ?? guarantees that there are only very few nodes left, a step-by-step simulation of the PRAM algorithm, performing a routing phase for each parallel random access step of the PRAM, would result in a suboptimal algorithm. In our approach, we rst compute the communication structure of the complete execution of the PRAM algorithm (without a complete simulation). Then, we route the nodes that will have to communicate to adjacent hypercube processors. After rearranging the data, we are able to simulate the PRAM tree contraction algorithm without slowdown. In [?] the tree is reduced using the operation \rake". This operation, applied to a leaf l of the tree, involves three nodes: the leaf l, its parent p and its sibling s. The two edges connecting these nodes are contracted and the nodes l and p are eliminated while node s survives. If the given tree has m leaves, the contraction algorithm proceeds in 2 blog mc stages called [1,l], [1,r], [2,l], [2,r], : : : ; [blog mc,l], [blog mc,r]. The leaves are numbered left to right from 1 to m. In stage [i,l] (resp., [i,r]) the rake operation is applied to all leaves numbered by odd multiples of 2i?1 that are left (resp. right ) children. In this fashion, no node is involved in more than one rake operation in each stage (see Figure ?? for an example). We apply this contraction algorithm to the tree T 00. For a stage x, let Tx be the reduced tree just before stage x. The contraction of T 00 can be represented by a tree T~ as shown in Figure ??. For each stage x of the contraction procedure there is one level of T~ containing exactly the nodes of Tx. In T~, a node v with children v1; v2; v3 corresponds to the contraction of v1; v2; v3 to v. The depth of T~ is bounded by bd=2c since T 00 has m < 2dd=4e leaves. Thus T~ can easily be embedded in a d-dimensional cube with dilation 2. To contract T 00, we compute this embedding of T~, route the nodes of T 00 to the corresponding leaves of T~, and perform the contraction along the (embedded) edges of T~. Note that the nodes of T 00 can be routed in logarithmic time by a sparse enumeration sort followed by a monotone routing. The major problem is to compute their destinations according to the embedding of T~. 7
stage [1, l]
f
stage [1, r]
f e
a
g e
1
b
7
8
e b
6
4
8 6
2
d
2
6
f
8
b
stage [2, l]
4
5
c
2
d
3 4
5 stage [2, l]
stage [2, r]
f e 4
8
stage [3, l]
f 4
8
6
Figure 2: Tree contraction 8
f 8
4
f e
8 6
4
f e b 2
8 6
4
Figure 3: The top 4 levels of the corresponding T~ 8
8
For each node v of T~, we de ne p(v) 2 f0; 1; 2; 3gbd=2c as the description of the path from the root of T~ to v. The i-th element of p(v) is j 2 f1; 2; 3g if the i-th edge on this path points to a j -th child, and an appropriate number of 0's is appended to pad p(v) to length bd=2c. Substituting each number of p(v) by its two-bit-representation, the hypercube address of v in the embedding of T~ can easily be obtained. It remains to show how to compute p(v) eciently for the leaves of T~, i.e. the nodes of T 00. Investigating the tree contraction algorithm, the following facts can easily be derived: (a) The nodes of T[i;l] are the leaves of T 00 numbered by multiples of 2i?1 and their lowest common ancestors in T 00. (b) Let v be a node of T~ on a level corresponding to some stage x of the contraction. The descendants of v in T~ are those nodes that have been contracted into v. These are v itself and those nodes u connected in T 00 (considered as an undirected tree) to the parent of v by a path containing no node of Tx. We compute the elements of the strings p(v) corresponding to a stage x in parallel for all nodes v of T~. The computation for stage x can be visualized by a two dimensional array of processors, with each column corresponding to a path from the root of T 00 to a leaf and with each row corresponding to an internal node of T 00. The operations performed on the array are pre x computations and broadcast operations along the rows and the columns. Placing rows and columns in appropriately sized subcubes yields a logarithmic running time while using no more than bd=2c 2d=2+2 processors. For stage x, we rst compute the nodes of Tx using Fact (a) above. Simulating one stage of the contraction, the behavior of each node v of Tx determines j 2 f1; 2; 3g so that v is a j -th child in T~ on the level corresponding to stage x. This j is broadcast to those nodes u of T 00 which are descendants of v in T~ since it is part of their string p(u). Fact (b) tells how to nd these descendants. First, v's parent broadcasts j along the path towards the root of T 00 until reaching a node of Tx. Then, the nodes reached on this segment broadcast j towards the leaves of T 00, again stopping the broadcast operation at nodes of Tx. The total tree contraction algorithmj fork trees of size p consists of a logarithmic number of parallel steps and a recursive call in 43 d -dimensional subcubes. Thus, we have:
Theorem 1 A tree of size n can be contracted on a p processor hypercube in O( steps.
l m
n p
log p)
The running time of this algorithm can be improved by a constant factor viewing stages [i; l] and [i; r] as one new stage. In this way, the number of stages and the depth of T~ is halved. On the other hand, the degree of T~ just grows to 7 (instead of 9) as can be shown by a more detailed analysis of the contraction procedure. Hence, the size of T 00 could be to 2d=3, and the algorithm can be modi ed as to making the recursive call in l increased m 2 d -dimensional subcubes. 3 9
The primitive operations used in the algorithm and the nal simulation of the PRAM tree contraction algorithm can all just as well be performed on the shue exchange network in logarithmic time. Combined with a general simulation result in [?], we have:
Corollary 1 A tree of size n can be contracted on any p processor hypercubic network in l m O(
n p
log p) steps.
The results of Theorem ?? and Corollary ?? can be shown to be asymptotically optimal for any input size if we assume that the only operations allowed on the nodes of the tree are copying, routing, and contracting two nodes. In this case, the number of messages required for tree contraction is at least the sum of the distances in the network spanned by the edges in the input distribution divided by the maximal degree of a node.
4 Dynamic Expression Evaluation As an application, the contraction algorithm given in the previous sectionlcan m be used to n evaluate algebraic expressions of size n on a p processor hypercube in O( p log p) steps, provided that the operators in the expression satisfy certain closure properties [?]. Furthermore, the value of each subexpression can be computed within the same time bound: after contracting the tree, we reverse the contraction steps, starting with a single node, ending up with the original tree, and maintaining an expression tree with the values of all subexpressions already computed. Tree contraction can be applied to the evaluation of algebraic expressions with rational numbers and the operators +; ?; ; =, but the model of computation has to be adjusted appropriately. As the numbers computed in the evaluation may grow rapidly we have to supply the network with the ability to communicate rational numbers and perform the basic arithmetic operations on rational numbers in constant time. We call this model of computation a rational number network. It can be shown that our expression evaluation algorithm on the rational number hypercube is asymptotically optimal for each input size. Thus, the maximum speed-up achievable for the expression evaluation problem on the hypercube is O(p= log p). Note that, on the EREW-PRAM, the optimal speed-up of (p) can be achieved for input sizes of (p log p). Intuitively, the lower bound for the rational number hypercube derives from the fact that the initial distribution of the operands in the hypercube can be chosen in such a way that nodes adjacent in the expression tree are stored in distant processors of the hypercube. In our proof, Abelson's argument [?] bounding the communication complexity from below by the rank of a certain matrix is applied simultaneously to the d partitions dividing the d-dimensional hypercube into two d ? 1-dimensional subcubes. For the constant degree hypercubic networks, the lower bound can be obtained considering a small bisector of the network.
Theorem 2 Expressions containing n rational numbers and the operators +; ?; ; = can be 10
evaluated, including alll subexpressions, on a p processor rational number hypercube or hym percubic network in O( np log p) steps, and this bound is optimal.
5 The Term Alignment Problem Our term alignment algorithm is conceptually very similar to the tree contraction algorithm presented above. Since both algorithms use the primitive operations listed in section 2 in a similar fashion, we will keep our presentation on a high level without presenting any details. In our term alignment algorithm for the hypercube, we follow the general approach of Kosaraju and Delcher [?] for the CREW PRAM model. We only show how to solve the term alignment problem on a p processor hypercube in O(log p) steps for terms A and B with a total length of n p, stored in the rst n processors. For n p the problem is solved by simulating a hypercube of size at least n with a slowdown factor of (n=p). We identify the terms A and B with their corresponding trees. Each matching pair of parentheses corresponds to a node. In a preprocessing step, we convert the trees for A and B into binary trees maintaining the node correspondence between the original nodes of A and B . This can easily be done by introducing k ? 1 new dummy nodes d1; : : : ; dk?1 for each original node v with k > 1 children v1; : : :; vk . We make di the right child of di?1 if i > 1, and we make d1 the right child of v. The node vi becomes the left child of di?1 if i > 1, and v1 continues to be the left child of v. To further simplify the problem we temporarily ignore the function symbols, constants and variables in A and B . Thus, we have to nd all pairs of corresponding parentheses in the strings A and B . To each parenthesis, we assign the attribute l or r indicating whether the node is a left or right child (by default, the root is assigned l). The algorithm now proceeds in four phases. In the rst phase, we divide the d-dimensional 2d b c 3 hypercube into contiguous blocks of size m = 2 . Within each block we identify those parentheses with minimal height. Call the set of nodes corresponding to these parentheses the critical nodes in the trees for A and B . Note that there are at most 2n=m critical nodes and that removing the critical nodes from the trees for A and B leaves subtrees of size at most 2m each. Call a node semicritical if it is not critical but corresponds to a critical node in the other tree. In the second phase, we identify the semicritical nodes. This phase can be implemented in logarithmic time following the ideas given in [?]. In the third phase, we remove the critical and semicritical nodes from the trees for A and B . The resulting subtrees of A and B can be paired according to the correspondence of their roots. There are at most 2n=m pairs each of size at most 4m. These pairs are the subproblems to be solved in the recursive call. To apply the generic divide-and-conquer implementation presented in [?], the generated subproblems need to be stored in contiguous intervals of processors. Thus, we have to permute the data accordingly. The subtrees obtained by the removal of the critical and semicritical nodes can be separated and paired 11
to form subproblems using monotone routings, a parentheses structured routing, and a bitpermute-permutation. In the fourth phase, the subproblems are solved recursively. The total implementation of this divide-and-conquer algorithm includes some additional steps to route the generated subproblems into their allocated subcubes. A detailed description of these steps can be found in [?]. Since the rst three phases can be performed in logarithmic time and the algorithm is called recursively for subproblems of size O(2 23 d) the total running time on inputs of length p is O(log p).
Theorem 3 The term alignment problem l m for terms of size n can be solved on a p processor
hypercube or hypercubic network in O(
n p
log p) steps.
As noted in the introduction the term matching problem can be solved by a term alignment followed by a standard sorting algorithm. Thus, we have
Corollary 2 The term matching problem for terms l m of size n can be solved on a p processor p log log2 p) steps or, using a randomhypercube with a worst case running time of O( np log l m ized algorithm, with an expected running time of O( np log p) steps.
6 Conclusion and Open Problems Using a recursive approach, we have shown how to contract trees eciently on hypercubes and related networks. The same general approach and the same building blocks can be used for other problems on trees like the term matching problem. Our technique of precomputing the communication structure of a PRAM contraction algorithm, rearranging the data and simulating the algorithm without any further overhead can be used for the contraction of small trees in networks. It could also be applied to the construction of simple NC 1 circuits for the expression evaluation problem. Due to the recursive call and the usage of the parentheses structured routing algorithm, the constant factor in the worst case running time of our contraction algorithm is quite large. But the algorithm performs much better for random trees. As noted in [?], with high probability most of the nodes are contained in local subtrees and will be eliminated in the recursive call or even in the reduction within a single processor. For rational number hypercubes and hypercubic networks, the maximal speed-up achievable for the expression evaluation problem is O(p= log p). We conjecture that a similar result holds for all constant degree networks, but we have so far been unable to apply the known lower bound results for communication complexity to this case.
12
References [A80]
H. Abelson. Lower bounds on information transfer in distributed computations. Journal of the ACM, 27:384{392, 1980. [ADKP89] K. Abrahamson, N. Dadoun, D.G. Kirkpatrick and T. Przytycka. A simple parallel tree contraction algorithm. Journal of Algorithms, 10:287{302, 1989. [ALMN90] B. Aiello, F.T. Leighton, B. Maggs and M. Newman. Fast algorithms for bitserial routing on a hypercube. Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 55{64, 1990. [B74] R. Brent. The parallel evaluation of general arithmetical expressions. Journal of the ACM, 21:201{206, 1974. [CP90] R. Cypher and C.G. Plaxton. Deterministic sorting in nearly logarithmic time on the hypercube and related computers. Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, 193{203, 1990. [GMT88] H. Gazit and G.L. Miller and S.-H. Teng. Optimal tree contraction in the EREW model. In: Tewksbury, Stuart K. and Bradley W. Dickinson and Stuart C. Schwartz (eds.): Concurrent Computations: Algorithms, Architecture, and Technology. Plenum Press: New York-London (1988), 139{156. [GR86] A. Gibbons and W. Rytter. An optimal parallel algorithms for dynamic expression evaluation and its applications. Proceedings of the 6th Conference on Foundations of Software Technology and Theoretical Computer Science, Springer Verlag, LNCS-241:453{469, 1986. [KP92] Z.M. Kedem and K.V. Palem. Optimal parallel Algorithms for forest and term matching. Theoretical Computer Science, 93:245{264, 1992. [KD88] S.R. Kosaraju and A.L. Delcher. Optimal parallel evaluation of tree-structured computations by raking. Proceedings of the 3rd Aegean Workshop on Computing: VLSI Algorithms and Architectures, 101{110, 1988. [KD90] S.R. Kosaraju and A.L. Delcher. A tree-partitioning technique with applications to expression evaluation and term matching. Proceedings of the 31st Annual Symposium on Foundations of Computer Science, 163{172, 1990. [L92] F.T. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann Publishers, 1992. [MW92] E.W. Mayr and R. Werchner. Optimal routing of parentheses on the hypercube. Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, 109{117, 1992. 13
[MW93] [MR85] [NS81] [NS82] [PW78] [PS91] [S90] [S80] [VB81]
E.W. Mayr and R. Werchner. Divide-and-conquer algorithms on the hypercube. 10th Annual Symposium on Theoretical Aspects of Computer Science, LNCS665:153{162, 1993. G.L. Miller and J.H. Reif. Parallel tree contraction and its applications. Proceedings of the 26th Annual Symposium on Foundations of Computer Science, 478{489, 1985. D. Nassimi and S. Sahni. Data broadcasting in SIMD computers. IEEE Transactions on Computers, C-30:101{107, 1981. D. Nassimi and S. Sahni. Parallel permutation and sorting algorithms and a new generalized connection network. JACM, 29:642{667, 1982. M.S. Paterson and M.N. Wegman. Linear uni cation. Journal of Computer and System Sciences, 16: 158{167, 1978. G. Pietsch and E. Schomer. Optimal parallel recognition of bracket languages on hypercubes. 8th Annual Symposium on Theoretical Aspects of Computer Science, LNCS-480:434{443, 1991. E.J. Schwabe. On the computational equivalence of hypercube-derived networks. Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 388{397, 1990. J.T. Schwartz. Ultracomputers. ACM Transactions on Programming Languages and Systems, 2:484{521, 1980. L.G. Valiant and G.J. Brebner. Universal schemes for parallel communication. Proceedings of the 13th Annual ACM Symposium on Theory of Computing, 263{ 277, 1981.
14