Based on Orthogonal Trees - CSTAR

569

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-32, NO. 6, JUNE 1983

[20] P.-Y. Ma and T. Lewis, "On the design of microcode compiler for a [33] J. Gieser and R. Sheraga, "Microarchitecture description techniques,"

in Proc. IEEE 15th Annv. Workshop on Microprogram., Palo Alto, CA, Oct. 5-7, 1982, pp. 23-32. Eng., vol. SE-7, May 1981. [21] K. Malik and T. Lewis, "Design objectives for high-level microprogramming languages," in Proc. IEEE Micro-Il Conf., 1978. [22] P. W. Mallett, "Methods of compacting microprograms," Ph.D. disRobert J. Sheraga received the B.S. degree in apsertation, Univ. Southwestern Louisiana, Lafayette, 1978. plied mathematics from Washington University, [23] J. Nash and M. Spak, "Hardware and software tools for the development St. Louis, in 1967, and the M.S. degree in computof a micro-programmed microprocessor," in Proc. IEEE 12th Annu. er science from Stanford University, Stanford, Workshop Microprogramming, Nov. 1979. CA, in 1969. [24] D. Patterson, "An experiment in high level language microprogramming For the past three years, he has been engaged in and verification," Commun. Ass. Comput. Mach., vol. 24, Oct. 1981. research and development activities related to mi[25] D. A. Patterson, "STRUM: Structured microprogramming system for croprogram compilation systems at JRS Research correct firmware," IEEE Trans. Comput., vol. C-25, pp. 974-985, Oct. Laboratories, Inc., Orange, CA, where he is cur1976. Vice President. rently [26] D. A. Patterson, K. Lew, and R. Tuck, "Towards an efficient maMr. Sheraga is a member of the IEEE Comchine-independent language for microprogramming," in Proc. IEEE puter Society and Tau Beta Pi. 12th Annu. Workshop Microprogramming, Hershey, PA, Nov. 18-21, 1979, pp. 22-35. [27] D. Patterson, R. Goodell, M. D. Poe, and S. Steely, Jr., "V-compiler: A next generation tool for microprogramming," in Proc. Nat. Comput. John L. Gieser received the B.S. degree in matheConf., 1981, pp. 103-109. matics and physics from Dickinson State Univer[28] M. D. Poe,"Heuristics for the global optimizations of microprograms," in Proc. 13th Annu. Workshop Microprogramming, Colorado Springs, sity, Dickinson, ND, and the M.S. degree in mathematics from the University of South CaroCO, Nov. 30-Dec. 3, 1980, pp. 13-22. [29] C. V. Ramamoorthy and M. Tsuchiya, "A high-level language for lina, Columbia, SC. He is currently working towards the Ph.D. dehorizontal microprogramming," IEEE Trans. Comput., vol. C-23, pp. 791-801, Aug. 1974. gree in computer science. He is also with JRS Research Laboratories, Orange, CA. Besides micro[30] M. Tokoro, E. Tamura, and T. Takizuka, "Optimization of microprocode generation techniques, his other interests ingrams," IEEE Trans. Comput., vol. C-30, July 1981. Eclude high performance computer architectures, [31] M. Tsuchiya and M. J. Gonzalez, "Toward optimization of horizontal algorithm theory, artifical intelligence and pattern recognition. microprograms," IEEE Trans. Comput., vol. C-25, Oct. 1976. Mr. Gieser is a member of the Mathematical Association of America, the [32] J. Gieser, "On horizontally microprogrammed microarchitecture description techniques," IEEE Trans. Software Eng., vol. SE-8, pp. Association for Computing Machinery, SIGMICRO, SIGARCH, and the IEEE Computer Society. 513-525, Sept. 1982.

machine-independent high-level language," IEEE Trans. Software

Efficient VLSI Networks for Parallel Processing Based on Orthogonal Trees DHRUVA NATH, S. N. MAHESHWARI, AND P. C. P. BHATT

Abstract-In this paper we describe two interconnection networks for parallel processing, namely the orthogonal trees network and the orthogonal tree cycles (OTN and OTC). Both networks are suitable for VISI implementation and have been analyzed using Thompson's model of VLSI. While the OTN and OTC have time performances similar to fast networks such as the perfect shuffle network (PSN), the cube comnected cycles (CCC), etc., they have substantially better area * time2 performances for a number of matrix and graph problems. For instance, the connected components and a minimal spanning tree of an Manuscript received May 18, 1981; revised July 28, 1982. D. Nath was with the Department of Electrical Engineering, Indian Institute of Technology, Delhi, New Delhi-I 10016, India. He is now with the National Institute of Information Technology, B-4/ 159, Safdarjung Enclare, New Delhi 110029, India. S. N. Maheshwari and P. C. P. Bhatt are with the Centre for Computer Science and Engineering and the Department of Electrical Engineering, Indian Institute of Technology, Delhi, New Delhi-110016, India.

undirected N-vertex graph can be found in 0(log4 N) time on the OTC with an area * time2 performance of O(N2 log8 N) and O(N2 log9 N) respectively. This is asymptoticaly much better than the performances of the CCC, PSN and Mesh. The oTc and OTN can be looked upon as general purpose parallel processors since a number of other problems such as sorting and DFT can be solved on them with an area * time2 performance matching that of other networks. Finally, programming the OTN and OTC is simple and they are also amenable to pipelining a series of problems. Index Terms-Area-time complexity, interconnection networks, matrix multiplication, orthogonal trees networks, parallel algorithms, parallel processing, sorting, VLSI.

I. INTRODUCTION A NUMBER of interconnection networks for parallel computers have been studied in the literature [23], [25],

0018-9340/83/0600-0569$01.00 (D 1983 IEEE Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on November 17, 2008 at 05:32 from IEEE Xplore. Restrictions apply.

570

[32]. Recently, these networks have been examined from the point of view of VLSI implementation [23], [29]. Thompson [29] has described a VLSI model of computation in which both the time taken to solve a problem and the area of the network on a chip are of interest. A figure of merit proposed to take both time and chip area into account is area * time2 (or A T2). For a number of problems, lower bounds on A T2 have been derived

[27], [29].

The networks studied so far achieve or nearly achieve these lower bounds for some problems and fall into two broad classes. 1) Networks such as the mesh [17], [29] and the hexagonal array [ 15] which use low chip areas but take a large time to solve problems. An O(N log2 N) area mesh can sort N numbers in O(N'/2) time for an optimal AT2 figure of O(N2 log2 N) [29]. (All logarithms in this paper are to the base 2.) Two N X N Boolean matrixes can be multiplied on an O(N2) area mesh in O(N) time, for an optimal AT2 performance [15], [27]. A major problem with these networks is the asymptotically large computation time. Also, the proposed implementation of problems such as sorting and FFT require that a new problem be fed into the network only after the current problem is fully solved [29]. Therefore, pipelining of problems to obtain higher throughput is not possible. 2) Networks such as the perfect shuffle network (PSN-also known as the shuffle exchange network) [25], and the cube connected cycles (CCC) [23], which take less computation time but use larger chip areas. Under the assumption that communication between interconnected processors takes constant time, N numbers can be sorted in 0(log2 N) time on both the CCC and the PSN using N processors [23], [25]. The chip area used is O(N2/log2N) in both cases [23], [14]. These networks require very large areas for a number of matrix manipulation problems. Both the CCC and PSN can multiply two N X N matrixes in O(log N) time (again, assuming communication between interconnected nodes is in constant time), but require about N3 processors if classical matrix multiplication is used. This leads to chip areas of about O(N6/log2 N). It is possible to reduce this large processor requirement by using Pan's matrix multiplication technique as described in [101. However, at present at least O(N249) processors are required to achieve this O(log N) time bound [31 ], although research is in progress to reduce this figure. This corresponds to a chip area of about O(N5), which still leads to a substantially nonoptimal AT2 bound. Similar comments apply to certain graph theoretic problems using the adjacency matrix representation, such as finding the connected components of an undirected graph and finding a minimal spanning tree in a weighted, undirected graph. Also, as in the case of the mesh, these networks are not amenable to pipelining a series of problems. In this paper we propose and analyse the performance of two simple interconnection networks which we call the orthogonal trees network (OTN) and its derivative the orthogonal tree cycles (OTC). Using these interconnection schemes, we can efficiently solve a large class of problems such as sorting, FFT, matrix multiplication, finding connected components in a

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 6, JUNE 1983

graph, finding a minimal spanning tree in a graph, etc. The time performance of these networks is comparable to that of the other fast networks such as the CCC and PSN. Interestingly, the asymptotic A T2 performance of the OTN and OTC for a number of problems is far superior to that of the other fast networks. For example, we have the following. 1) The connected components and a minimal spanning tree of an undirected N-vertex graph can be found on an OTC with an AT2 of O(N2 log8 N) and O(N2 log9 N) respectively. Both these problems require an A T2 of at least O(N4) on the mesh, PSN and CCC. 2) Two N X N Boolean matrixes can be multiplied in 0(log2 N) time on an OTC with an A T2 of (O(N4 log2 N). This is much better than the A T2 performance of other fast, general purpose networks. For example, on the PSN and CCC an AT2 of about O(N6) is obtained [23]. Programming the OTN and OTC is simple. Also, mapping algorithms onto these networks is straightforward. Finally, they are amenable to pipelined operation. It is interesting to note that a number of other researchers have also independently considered the OTN or networks similar to it. The OTN is implicit in Muller and Preparata's sorting network [18], although it is not described explicitly. Brent and Goldschlager [5] have considered a network similar to the OTN, for the evaluation of propositional calculus formulas. Leighton [16] has used the OTN (which he calls the mesh of trees), to solve problems such as sorting and matrixvector multiplication. In addition, he obtains a tight lower bound on the area of the OTN. Finally, in a recent technical report, Capello and Steiglitz use the OTN (which they call orthogonal forest) for integer multiplication [8]. A. VLSI Model of Computation The choice of the VLSI model to be used is complicated by the proliferation of models used in the literature. These models differ chiefly in the time required for a bit of information to propagate across a wire. Some researchers [5], [23], [24] work

under the assumption that this transfer takes 0(1) time, independent of the length of the wire. Others assume a delay of 0(log N), where the wire is of length N units [29], [30]-. Still others work with an O(N) delay [4], [8]. In the present paper, for the most part we use the logarithmic delay model proposed by Thompson [29]. The salient features of this model are as follows. 1) One bit of logic or storage requires 0(1) area. 2) Wires are 0(1) units wide and can cross at right angles. 3) A wire of length K has a driver of area K, which consists of log K stages of amplification. Therefore, the delay of the wire and driver is together O(log K). However, the amplifier stages are individually clocked and pipelining can be used to transmit one bit every 0(1) units of time through the wire. Assumption 3) is of crucial importance to the performance of all known interconnection schemes because the time to communicate between processors generally dominates over processing time, and sets a lower limit to the achievable performance. The model used by Preparata and Vuillemin [23], among others, differs from the above in that it assumes only

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on November 17, 2008 at 05:32 from IEEE Xplore. Restrictions apply.

NATH et al.: PARALLEL PROCESSING BASED ON ORTHOGONAL TREES

571

0(1) delay for communication between interconnected nodes. Under this assumption, algorithms on the CCC can exhibit time performances which are as small as 0(log N). However, using assumption 3), the same algorithms will have an extra factor of log N in the time. This is because the longest wires in the VLSI layout of the CCC are O(N/log N) units long and hence, have an 0(log N) delay associated with them. Since the 0(1) delay model has also been used by many researchers, it would be interesting to compare the performance of the OTN and OTC with other networks under this model as well. This is done in Section VII of this paper. This paper is organized as follows. Section II introduces the orthogonal trees network and some basic operations on it. These are then applied to the simple problem of sorting N numbers. In Section III we discuss some problems involving matrices and graphs. Section IV contains some recursive algorithms, namely bitonic sort and FFT. In Section V we inFig. 1. Layout of a (4 X 4)-oTN. troduce the orthogonal orthogonal tree cycles and describe some basic operations on it. Section VI contains algorithms but has a factor of log N fewer wire crossings, and is also more on it and in Section VII we compare the A T2 performance of for use with large processors. the OTN and OTC with other networks for various prob- adaptable of the For most problems considered in this paper, the roots lems. of the trees are used for input/output. The roots of the row trees are used as input ports and the roots of the column trees are used as output ports. Both the input and output ports are II. THE ORTHOGONAL TREES NETWORK numbered from 0 to N - 1. B. Operations on the Orthogonal Trees Network A. Structure of the Orthogonal Trees Network The orthogonal trees network can be looked upon as an N Operations on the OTN can be broadly classified into two X N matrix of processors in which each row and each column groups: of processors forms the leaves of a binary tree. The root and i) Processing: This refers to processing by the BP's. the internal nodes of each binary tree are also processors. This ii) Communication: This refers to communication bestructure is called an (N X N)-OTN for short (see Fig. 1). The tween BP's or between BP's and the roots of their trees. It N2 leaf processors form the base of the network and are called includes some simple functions like COUNT-LEAFTOROOT base processors (BP's). Each BP is addressed by a pair (i, j) (described later) which are implemented by the IP's. where i is its row index and j its column index. For conveIn what follows, we describe some of the commonly used nience, the same addressing mechanism is used to refer to communication operations in the OTN, starting with some registers in the BP's. Thus, A (i, j) refers to register A in BP(i, primitive operations. 1) ROOTOLEAF (Vector, Dest): "Vector" here refers to j). The 2N(N - 1) nonleaf processors are called internal processors (IP's). Most of the processing is done by the BP's. either a row or a column of BP's. The contents of the data The IP's are used for communication between BP's. During register in the root of the corresponding tree are broadcast to the course of this communication the IP's may also be required the leaves of the tree. "Dest" refers to a set of BP's in Vector to carry out some simple operations such as summing and ex- and to a register R in these BP's which will receive the broadcasted data, and is specified by a pair (Selector, R). tracting the minimum, on the data. Note that the OTN is a generalization of the tree network "Selector" selects a subset of the BP's in "Vector." The broadcasted data are placed in register R in each BP selected. which has been studied extensively [2], [3], [7]. A simple layout of the OTN on a chip is shown in Fig. 1, For example, ROOTOLEAF (row (0), dest = (j.j is even, A)) where the BP's and IP's are represented as white circles and will broadcast the data available at the root of row tree (0). The black dots respectively. Any two adjacent rows or columns of data will be placed in the A register of all BP's with addresses the base are 0(log N) distance apart. This interrow (column) of the form (0, J) where J is even. This operation is implearea is used to embed the corresponding row (column) tree. mented by making each IP pick up data from its parent and (Each processor occupies 0(log N) area as we show subse- pass it on to its sons. 2) LEAFTOROOT (Vector, Source): "Source" is also a pair quently, and does not increase the separation between adjacent columns by-more than a constant factor). Since there are N of the form (Selector, R). Selector specifies one BP in Vector rows and N columns, the total area of the layout is 0(N2 log2 whose register R contents are sent to the root of the correN). This area has recently been shown to be optimal by sponding tree. For example, LEAFTOROOT (column (0), Leighton [16], who has also described a new layout for the source = (5, B)) selects BP (5, 0) and sends the contents of its network. The new layout requires the same O(N2 log2 N) area B register to the root of column tree (0). The operation is imAuthorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on November 17, 2008 at 05:32 from IEEE Xplore. Restrictions apply.

572

IEEE TRANSACTIONS ON COMPUTERS, VOL.

c-32,

NO.

plemented by making each IP of the tree pick up data (if available) from its son and pass it on to its parent. 3) COUNT-LEAFTOROOT (Vector): Each BP is assumed to contain a single bit flag. COUNT-LEAFTOROOT counts the number of flags in "Vector" set to 1, and makes available the result at the root of the corresponding tree. The operation is performed by having each IP in the tree add up the counts accumulated by its two sons and pass it on to its parent. 4) SUM-LEAFTOROOT (Vector, Source): Source again specifies a set of BP's and a register R. The contents of R in all selected BP's are added and the result appears at the root. The implementation is similar to that of COUNT-LEAFTO-

6,

JUNE

1983

Each of these composite operations is composed of two primitives taking O(log2N) time. Hence, each composite also takes O(log2 N) time. At this point, we sketch briefly the internal structure and area of each processor. Since each word is O(log N) bits long, we require a few (three or four) O(log N) bit registers in each BP. Also, bit-serial operations like comparing two numbers or adding two numbers can be done with 0(1) logic and hence, 0(1) area. In Section III we require the multiplication operation on two O(log N) bit numbers. This can be done using 0(log N) area and O(log N) time, by the "serial pipeline multiplication technique" [6], [13]. Therefore, O(log N) area is sufficient for each BP-and also for each IP, which is simpler than ROOT. To analyze the time taken by each primitive, we make the a BP. Since the original submission of this paper, Thompson [31] following assumptions. has shown how each of these communication operations can i) All numbers being used are O(log N) bits long. ii) Both communication and processing are bit serial. be implemented in just O(log N) time instead of 0(log2 N) time. Each primitive operation involves movement of an 0(log N) He uses a technique called "scaling" in which each IP is a bit word from the root to the leaves of a tree or vice versa. The constant factor larger than its children. Interestingly, the area longest branch in this path is O(N log N) units and hence, in- is maintained at O(N2 log 2N). In the present paper, however, troduces an 0(log N) delay. Since there are log N branches in we assume that each communication operation takes O(log2 the path, transmitting one bit from root to leaf or vice versa N) time. takes 0(log2 N) time. The 0(log N) bits can be transmitted in As a simple application of these operations, we now give an a pipeline at intervals of 0(1) units of time, and therefore, each algorithm to sort N numbers in O(log2 N) time. It is assumed primitive takes 0(log2 N) time. that the numbers are initially available at the input ports, and Next we use these primitives to define some composite op- are all distinct. Procedure SORT-OTN makes available the erations. numbers sorted in ascending order at the output ports. Procedure SORT-OTN for each i(O < i < N) pardo begin 1) ROOTOLEAF (row (i), dest = (all, A)); 2) LEAFTOLEAF (column (i), source = (i, A), dest = (all, B)); 3) for each j(O < j < N) pardo flag (i,j) if A(i,j) > B(i,j) then 1 else 0; 4) COUNT-LEAFTOLEAF (row (i), dest = (all, R)); 5) LEAFTOROOT (Column (i), source = (j: R(j, i) = i, A)) end end SORT-OTN. After steps 1 and 2, each BP (i, j) contains x(i) in its A 1) LEAFTOLEAF (Vector, Source, Dest) register and x(j) in its B register, where x(i) refers to the ith This can be expressed as the sequence number. Step 3 compares each x(i) with each other x(j) and LEAFTOROOT (Vector, Source); sets a flag if x(i) > x(j). After step 4, we have the rank of x(i) ROOTOLEAF (Vector, Dest). (In this and in all subsequent composite operations defined, available in register R of each BP in row (i), the rank being Vector is the same in all primitives which make up the com- between 0 and N - 1. The next step picks up for the ith column posite.) This operation transfers an entire O(log N) bit word the ith ranked element and places it at the root. to the root bit-serially, and when the entire word is available Since all steps take 0(log2 N) time expect step 3 which is in the root it is transferred to the destination leaves, again bit done in O(log N) time, the overall time complexity of the alby bit. It is possible to have a different implementation in gorithm is 0(log2 N). which, as each bit reaches the root it is sent down. This imIf the numbers are not all distinct, step 3 is modified as plementation requires only 0(1) storage at each IP. follows. 2) COUNT-LEAFTOLEAF (Vector, Dest) 3) for each j(O < j < N) pardo This is expressed as flag (i,j): if A(i,j) > B(i,j) COUNT-LEAFrOROOT (Vector); or (A(i, j) = B(i, j) and i > j) ROOTOLEAF (Vector, Dest). then 1 3) SUM-LEAFTOLEAF (Vector, Source, Dest) expressed as else 0; SUM-LEAFTOROOT (Vector, Source); ROOTOLEAF (Vector, Dest).

The time analysis is unchanged.



573

We mention here that the technique of sorting by computing for i = 0 to N - 1 pipedo ranks is a fairly well known one. It has been discussed by VECTORMATRIXMULT-OTN(Ai, [bk]]). Muller and Preparata [18] and Nassimi and Sahni [20] among pipedo indicates that procedure VECTORMATRIXMULT-OTN others. is performed in a pipeline assuming the Ai's to be available successively at the input ports. The separation in time between successive i's in the pipeline is 0(log N) units because all numbers have been assumed to be 0(log N) bits long. (For this III. MATRIX AND GRAPH ALGORITHMS ON THE reason, in the rest of this paper pipelining implies a separation ORTHOGONAL TREES NETWORK of 0(log N) time between successive elements in the pipeline, unless otherwise specified.) The resultant matrix is available A. Matrix Multiplication in row major order at the output ports, the first row appearing Consider the following vector-matrix product, 0(log2 N) time after Ao is input and successive rows being N-1 separated by 0(log N) units of time. The time taken by the 0 1 NULL, D(i), NULL). Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on November 17, 2008 at 05:32 from IEEE Xplore. Restrictions apply.

576

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32,

then

else

NO. 6,

JUNE 1983

begin for each i(O S i < K) pardo COMPEX-OTN (Column (i), J); for each of the two (J/2 X K) Bitonic sequences formed pardo BITONICMERGE-OTN (J/2, K) end if K > 1 then begin COMPEX-OTN (row, K); for each of the two (1 X K/2) Bitonic sequences formed pardo BITONICMERGE-OTN (1, K/2) end

end BITONICMERGE-OTN.

We mention at this point that this implementation is very similar to that of Nassimi and Sahni [19] on the mesh. The major difference' is in the way communication takes placesalong the mesh in [ 19] and along the trees in the OTN. In fact, the bitonic sort algorithm of [19] (which is based on bitonic merging) can also be implemented on the OTN by modifying the interprocessor communication in a similar manner. Details of the implementation are given in [21]. It must be pointed out, however, that the OTN uses O(N log2 N) area for an 0(N1/2 log N) time bound, whereas an O(N'/2) time bound' can be obtained on a mesh of equal area [29]. B. Discrete Fourier Transform As discussed in [29], the FFT algorithm for computing an N-element DFT has a very similar structure to that of Bitonic Merging. By using an implementation similar to BITONICMERGE-OTN, we can compute the DFT in 0(N1/2 log N) time on an (N112 X N1/2)-OTN. Details are omitted for

brevity.

V. THE ORTHOGONAL TREE CYCLES

A. Structure of the Orthogonal Tree Cycles For most of the problems discussed in this paper, the processing in the base of a (K X K)-OTN is of the following type: la binary operation is performed in parallel in each pair of elements (a(i), b(j)), 0 < i, j < K. This is done by keeping pair (a(i), b(j)) n BP (i, J). If the time taken for these K2 operations is t, the same operations can be performed in 0(Kt) time on a cycle of BP's of length K. Pair (a(i), b(j)) is kept in the ith BP in the cycle. Then, by keeping the ai's fixed and circulating the b1's around the cycle, all K2 operations can be performed in t' = 0(Kt) time. Under bit-serial operation t = 0(log N) and t' = 0(K log N). This idea can be-used to reduce the area without increasing the time, in the following way. Consider an (N/log N X N/log N)-OTN with the difference that each BP is replaced by a cycle of BP's, of length log N. Each internal node is still a single processor (IP). We call such a structure the orthogonal tree cycles (or an (N/log N X N/log N)-OTC). The base of the OTC is now a matrix of cycles. BP's in each cycle are numbered from 0 to log N - 1. Each BP (and its registers) is addressed by a triple (i, j, q) where i and j identify the row and

column respectively of the cycle, and q is its position within this cycle. BP(0) in each cycle is connected to the row and column trees for the cycle. Figs. 2 and 3 show a possible layout of a cycle and the complete OTC respectively. Each cycle is horizontally laid out and since each BP of the cycle is an 0(log N) X 0(1) rectangle the separation between adjacent rows and columns of the OTC is 0(log N). This leads to an overall area of 0(N2). Note that this is the same as the area of an (N/log N X N/log N)-OTN. The (N/log N X N/log N)-OTC can implement a number of algorithms such as sorting, finding connected components and finding a minimal spanning tree in the same time as an (N X N)-OTN, while using less area. The reasons for this are asfollows. If the base of the OTN is considered to be composed of squares of log N X log N BP's each, then the processing in square (i, j) of the OTN can be simulated by cycle (i, j) of the OTC as described at the start of this section. To describe the simulation of communication operations, consider the problem of broadcasting N elements available at the roots of the N row (or column) trees of the OTN, to the leaves. Each root broadcasts one element and 0(log2 N) time is taken. In the OTC, log N elements must be sent to each cycle and therefore each root must broadcast log N elements. specifically, if the row trees of the oTN are grouped into N/log N groups of log N each, the ith group is simulated by the ith row tree of 'the OTC. Since each element has 0(log N) bits the log N elements can be broadcast by each OTC tree in a pipeline, adjacent elements being separated by 0(log N) time units. Therefore the broadcast of all N elements from the roots to the leaves takes 0(log2 N) time on the OTC which is the same as the time taken on the OTN. All the other communication operations can similarly be shown to require the same time. Processing at the base of the OTC is now slower than on the OTN. However for most problems it is the communication time which dominates and therefore the time required on the OTC is the same as on the OTN but the area required is less. It is interesting to note that the idea of replacing one processor by a cycle of log N processors has also been used to advantage by Preparata and Vuillemin [23] for the cube connected cycles. It seems as though this technique has wide applicability in reducing processor requirement without affecting time performance.


577

NATH et al.: PARALLEL PROCESSING BASED ON ORTHOGONAL TREES I

O(log

N) 0

0

(log

N)

Fig. 2. Layout of a cycle of the OTC.

B. Operations on the Orthogonal Tree Cycles First we define a communication operation which is local to each cycle. CIRCULATE (i, j, Register Set) where (i, j) identifies a cycle. The implementation is as follows: for each Register R in Register Set do for each q(O < q < log N) pardo R(i, j, q) := R(i, j, (q + 1) mod (log N)). Using this operation, we define a composite operation VECTORCIRCULATE (Vector, Register Set) which is implemented as,

for each cycle (i, j) in Vector pardo CIRCULATE (i, j, Register Set). Note that Vector is a set of cycles. Both CIRCULATE and VECTORCIRCULATE involve transmission of 0(log N) bit words bit-serially over a distance of at most 0(log N) and therefore take 0(log N) time. We now describe a series of composite operations on the OTC for communication between cycles. 1) ROOTOCYCLE (VECTOR, DEST): Vector is again a row/column of cycles, and Dest is a pair (selector, R) which selects a set of cycles in vector and a register R. The implementation is as follows. for p := 0 to log N - 1 pipedo begin ROOTOLEAF (Vector, Dest); VECTORCIRCULATE (Vector, JR}) end. The ROOTOLEAF operation above is similar to the ROOTOLEAF operation on the OTN with the difference that in the OTC, BP(O) of each selected cycle acts as the destination processor. (In fact all operations defined for the OTN can also be defined for the OTC in a similar manner, with BP(0) of a cycle in the OTC acting as the corresponding BP of the OTN). For the ROOTOCYCLE operation we assume that log N numbers are available at invervals of 0(log N) time at the root of the tree on Vector. These numbers are transferred to the selected cycle in a pipeline 0(log N) time apart. Number (q) is finally available in R(q) of each selected cycle. 2) CYCLETOROOT (VECTOR, SOURCE): This is defined as the following. forp := 0 to log N - 1 pipedo begin

0

Fig. 3. Left half of the layout of a (4 X 4)-OTC. N = 16, log N = 4.

LEAFTOROOT (Vector, Source); VECTORCIRCULATE (Vector, Source Register Set) end. The source register set specified in the VECTORCIRCULATE operation of CYCLETOROOT includes R and also all registers (if any) specified in the selector. For example, in CYCLETOROOT (row (i), source = (J: A(i, j) = 1, B)), the source register set is IA, B}. The effect of this operation is to make available log N numbers at intevals of 0(log N) time at the root. Number (q) is taken from register B(q) of cycle (i, j) such that register A (q) is this cycle contains a 1. As there are log N VECTORCIRCULATE operations in all, the contents of the source registers are invariant before and after the operation. If the LEAFTOROOT step in CYCLETOROOT is replaced by SUM-LEAFTOROOT or MIN-LEAFTOROOT, we get composite operations SUM-CYCLETOROOT and MIN-CYCLETOROOT

respectively.

3) CYCLETOCYCLE (VECTOR, SOURCE, DEST): This is

defined as the following.

CYCLETOROOT (Vector, Source); ROOTOCYCLE (Vector, Dest). Operation CYCLETOROOT results in the data in BP(q) of


578

the source cycle being available in BP(q) of each destination cycle (O < q < log N). Composite operations SUM-CYCLETOCYCLE and MINCYCLETOCYCLE can be defined in a manner similar to SUM-CYCLETOROOT and MIN-CYCLETOROOT. Each of the above operations involves a pipeline of length 0(log2 N) in which log N elements are transmitted at O(log N) intervals of time. Therefore the time required is 0(log2 N). VI. ALGORITHMS ON THE ORTHOGONAL TREE CYCLES We now show how the ideas of the previous section can be used to implement OTN algorithms on the OTC. Since the transformation to be applied is similar for all algorithms, we describe it in detail only for the sorting algorithm. A. Sorting Since the number of input ports available is only N/log N, log N, numbers will have to be centered through-each port. We assume that each input port the numbers are entered at time intervals of O(log N) units. The procedure SORT-OTC sorts these numbers and places them at the output ports. First the N/log N) smallest numbers appear in ascending order. 0(log N) time later the next N/log N smallest numbers are output and so on. Procedure SORT-OTC for each i(O < i < N/log N) pardo begin 1) ROOTOCYCLE (row (i), dest = (all, A)); 2) CYCLETOCYCLE (column (i), source = (i, A), dest = (all, B)); 3) for each j (O .1 < N/log N) pardo begin for each BP(q) in cycle (i, j) pardo C(q) : 0; 3.1) 3.2) forp := OtologN-l do begin for each BP(q) in cycle (i, j) pardo if A(q) > B(q) then C(q):= C(q) + 1; CIRCULATE (i, j, B) end end 4) SUM-CYCLETOCYCLE (row (i), source = (all, C),

IEEE TRANSACTIONS ON COMPUTERS, VOL.

C-32, NO. 6, JUNE 1983

of Section III to run on the OTC. Details are omitted for brevity and can be obtained from [21]. We simply mention the area and time figures obtained. Matrix multiplication requires O(log2 N) time and O(N4) area if an (N2/log N2 X N2/log N2)-OTC is used. In case the matrices are Boolean, the interval between successive elements in a pipeline can be reduced to O(1), and O(log2 N) elements transmitted through each tree. Moreover, the length of each cycle can be increased to O(log2 N) and the number of cycles reduced to O(N2/log2 N X N2/log2 N). Each BP now occupies O(1) X O(1) area on a chip, so that each cycle can fit into an O(log N) X O(log N) area as before. Therefore the spacing between adjacent rows and adjacent columns remains O(log N) so that the entire layout occupies only O(N4/log2 N) area. Note that the time taken remains 0(1og2 N). The algorithm for finding connected components now requires O(N2) area for the same O(log4 N) time as before. Note that each cycle must store a log N X log N submatrix of the adjacency matrix. In the MST algorithm, the area goes down to O(N2 log NO and not O(N2). This is because the entire N X N weight matrix must be stored on the chip, and each element requires O(log N) bits. It must be pointed out that algorithms such as FFT and bitonic sort cannot take advantage of the reduced area of the OTC. This is because these algorithms transmit many elements in a pipeline, even in the OTN. Recently, improved algorithms have been developed for the OTN [21]. In fact, for problems such as sorting and matrix multiplication, these algorithm have the same area and time performance as algorithms on the OTC. VII. COMPARISON WITH OTHER NETWORKS In this section we compare the area * time2 performance of both the OTN and the OTC with other interconnection schemes

like the mesh, PSN and CCC, for the problems disucssed in this paper. For meaningful comparisons we consider the performance of all interconnection schemes under Thompson's logarithmic delay model [29]. Subsequently, we also compare the networks under the O(1) delay model. Note that even under Thompson's model it is possible to improve the time performance of the OTN by a factor of log N, by using "scaling" [31].

dest = (all, R)); 5) for p :=O to log N - I pipedo A. Sorting begin for each j (O < j < N/log N) pardo 5.1) Table I lists the performance of various networks for the if cycle (i, j) contains in some A register the problem of sorting N numbers. Note that the O(log2 N) algonumber with rank = (N/log N) * p + j rithm for sorting using the CCC in [23] requires O(log3 N) then move it to register D(0) in the cycle time using Thompson's model. We mention at this point that else load NULL in D(0); some of the figures given in the table have recently been imLEAFTOROOT (column (i), source = (: D(i,j) $ 5.2) proved upon. In particular, it is shown in [21] that the OTN can NULL, D)) achieve the same O(N2 log4 N) AT2 bound as the OTC. end Moreover, research is currently in progress to develop imend proved sorting algorithms on the PSN and CCC. end SORT-OTC. B. Boolean Matrix Multiplication B. Other Algorithms The performance of various networks for the matrix mulIn the same manner as procedure SORT-OTN was converted to SORT-OTC, we can convert the matrix and graph algorithms tiplication problem is shown in Table II. Note that on both the


NATH et

al.: PARALLEL PROCESSING BASED ON ORTHOGONAL TREES

579

TABLE I SORTING Network

Area

Time

Area * Time2

Network

Mesh [29] PSN [301 CCC [23] OTN

N log2 N N2/log2 N N2/Iog2 N N2/log2 N

N'!2 log3 N log3 N log2 N log2 N

N2/log2 N N2 log4 N N2 log4 N N2 log6 N

Mesh [11] PSN, CCC

OTC

N2

N2 log4 N

TABLE II BOOLEAN MATRIX MULTIPLICATION

Network Mesh [15] PSN [101 CCC [231 OTN OTC

Area N2

N6/log N N6/1og2 N N4/10g2 N N4/10g2 N

Time N

log2 N log2 N log2 N log2 N

Area * Time2

N4

N6 log3 N N6 log2 N N4 log6 N

N4 log2 N

and OTC, while the time performance matches that of the other fast networks, the AT2 performance is significantly better. The figures given for the PSN and CCC refer to the implementation of classical matrix multiplication using N3 processors. The number of processors can be reduced to about N249 by using Pan's matrix multiplication technique [10], [31], but the corresponding AT2 of about 0(N5) is still much worse than that of the OTN and OTC. In fact, if the A T2 of the PSN and CCC is improved any further, it would automatically lead to a more efficient sequential algorithm than the 0(N2 49) algorithm presently known. We mention at this point that an optimal 0(N4) AT2 figure has been obtained by Preparata and Vuillemin for this problem [24], but this uses a special purpose network. Since the initial submission of this paper, we have also become aware of the work of Schwartz [28] and Leighton [16] on this problem. Schwartz describes efficient matrix multiplication and graph algorithms on the PSN. Leighton describes an interesting network called the three-dimensional mesh of trees (a generalization of the OTN to three dimensions). Using this network, he is able to get an efficient A T2 bound for matrix multiplication (area = O(N4), time = 0(log N), AT2 = O(N4 log2 N)). OTN

C. Graph Algorithms

Table III

compares

various networks for the connected

components problem (the area and time figures for finding a minimal spanning tree are similar).

Interestingly, for both these problems the OTC and OTN have time performances comparable to fast-but-large networks, while using chip areas comparable to slow-but-small networks. As a result, the AT2 figures on the OTN and OTC are substantially better than for both existing classes of networks. -Recently, improved algorithms for connected components and minimal spanning trees have been proposed [22]. These algorithms require only 0(1og3 N log log N) time on the OTN, OTC, PSN and CCC, while the area requirements remains the same as in Table III. We conjecture that even faster algorithms are possible. The figures given in this table for the PSN and CCC refer to a straightforward implementation of algorithm CONNECT

OTN OTC

TABLE III CONNECTED COMPONENTS Area Time N2

N4/log4 N

N2/10g2 N

N2

N log4 N log4 N log4 N

Area * Time2 N4 N4 log4 N N2 log'0 N N2 log' N

of [ 12]. As mentioned earlier, these figures have been improved somewhat in [22]. It may even be possible to improve upon them further. However, the AT2 obtained on these networks cannot be better than Q(N4/log2 N). This is because Q(N2) operations are necessary if the adjacency matrix representation is used [33]. Therefore, for an algorithm taking time "t" Q(N2/t) processors would be needed, leading to an area of Q((N2/t)2/log2 (N2/t)) on both the CCC and PSN. The resulting lower bound on AT2 is Q(N4/log2 N). D. Comparisons Under the "Constant Delay" VLSI Model Since much of the literature is based on the 0(1) or "constant delay" VLSI model [5], [23], [24], it is worthwhile to compare the OTN and OTC with other networks based on this model. Consider the ROOTOLEAF operation under this model. Since each word has been assumed to be 0(log N) bits long and is transmitted bit serially between IP's, it still takes 0(log N) time to send it from an IP to its children. Therefore, the entire operation takes 0(log2 N) time as before. However, a different implementation is possible, in which, as each bit is received by an IP, it is transmitted forward. Effectively, the first bit of the word moves from the root to the leaves in just 0(log N) time, and the other bits follow in a pipeline at intervals of 0(1) time. Therefore, the ROOTOLEAF operation takes 0(log N) time. The other operations are implemented in a similar manner, except that in SUM-LEAFTOROOT and MIN-LEAFTOROOT, each word must be operated upon by an IP before it is passed on. For this reason, the order in which the bits of a word are passed on becomes significant. In the SUM-LEAFTOROOT operation, the least significant bits should arrive first, otherwise it will be necessary to receive all the bits before the sum can be transmitted. On the other hand, in the MIN-LEAFTOROOT operation, the most significant bits (MSB's) should arrive first because it is not possible to decide which of two numbers is smaller without looking at their MSB's. Consider the sorting problem under this model. It is fairly .easy to see that the OTN requires only 0(log N) time for the complete algorithm. Both the PSN and the CCC require 0(log2 N) time [23], [25]. The time performance of the Mesh does not change because it has only short wires and is therefore unaffected by changes in communication time. Interestingly, under this new model there is no longer any need for the OTC. The whole idea behind the OTC was that log N words could be broadcast in a pipeline in the same 0(log2 N) time as one word. Under the new model, transmitting one word requires only 0(log N) time whereas log N words would take 0(l0g2 N) time. So, the reduction in area that the OTN achieves would be offset by an increase in time. The A T2 figures for sorting on these various networks are


580

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 6, JUNE 1983

TABLE IV SORTING UNDER THE CONSTANT DELAY MODEL Network

Area

Mesh PSN CCC OTN

N log2 N N2/log2 N N2/log2 N N2/10g2 N

Time N1'2 log2 N log2 N

log N

Area * Time2

N2/log2 N N2/log2 N N2/log2 N N2 log4 N

shown in Table IV. Note that there is no substantial change due to the change in model. In fact, this is true for all the problems discussed in the paper. VIII. CONCLUSIONS We have described the orthogonal trees network and the orthogonal tree cycles and shown that they permit efficient VLSI implementation of a number of algorithms. The salient

features of these networks are as follows. 1) They are easy to program, particularly the OTN. Further, algorithms have a straightforward mapping on to them. 2) They exhibit an asymptotic AT2 performance far superior to existing general purpose networks for some problems involving matrices and graphs. 3) For a number of other problems such as sorting and DFT, they have AT2 performances comparable to those of existing networks. In view of 2) and 3), they can be used either as special purpose chips or as general purpose networks. 4) As mentioned in the Introduction, for a number of problems these networks are amenable to pipelining. Consider for instance the problem of sorting N numbers on an (N X N)-OTN. In algorithm SORT-OTN the flow of computation is from the row roots down to the column roots, back to the row roots, and finally down to the column roots again. Thus the algorithm has three phases. At any stage of the computation, only processors at one level of the network are active (for instance at the base, or all processors at a height K in the row or column trees). Since there are O(log N) such levels, there can be O(log N) distinct problems in the network at one time, each in a different stage of computation and separated by O(log N) time. However for this pipelining to work two important points must be borne in mind. Firstly, a single problem uses each processor in the network exactly once during each phase of computation. With pipelining, each processor will need to execute three different steps simultaneously, one in each phase. This problem can be solved by allocating three time slices to each processor and assigning one to each phase. Secondly, during the LEAFTOLEAF operation in step 2), log N other sets of numbers arrive in the base and must be stored. This requires O(log2 N) bits of storage at each BP and does not increase the area requirement of the network. If an (N X N)-OTC is used, the O(log2 N) bits of storage are already available in each cycle and the same procedure can be used. In either case, a new set of sorted numbers is output every O(log N) time units. Since the area is O(N2 log2 N) in both cases, the pipelined A T2 performance is O(N2 log4 N). Interestingly, this is the same as the AT2 performance of the OTC without using pipelining.

ACKNOWLEDGMENT We are grateful to the referees, whose comments have led to a substantial improvement in the paper. REFERENCES

[1] K. E. Batcher, "Sorting networks and their applications," in Proc. AFIPS SJCC, vol. 32, Apr. 1968, pp. 307-314. [2] J. L. Bentley, "a parallel algorithm for constructing minimum spanning trees," Dep. Comput. Sci., Carnegie-Mellon Univ., Tech. Rep., Aug. 1979. [3] J. L. Bentley and H. T. Kung, "A tree machine for searching problems," Dep. Comput. Sci., Carnegie-Mellon Univ. Tech. Rep., Sept. 1979. [4] G. Bilardi, M. Praochi, and F. P. Preparata, "A critique and appraisal of VLSI models of computation," in Proc. CMU Conf VLSI Syst. Computat., Oct. 1981, pp. 81-88. [5] R. P. Brent and L. M. Goldschlager, "Some area-time trade offs for VLSI," Australian National Univ., Tech. Rep., Dec. 1980. [61 R. P. Brent and H. T. Kung, "The area-time complexity of binary multiplication," Dep. Comput. Sci., 28, pp. 521-534, July 1981. [7] S. Browning, "Computation on a tree of processors," Caltech Internal Memorandum. [8] P. R. Capello and K. Steiglitz, "Area-efficient VLSI structures for multiplying at clock rate," Dep. Elec. Eng. Comput. Sci., Princeton Univ., Sept. 1981. [9] B. Chazelle and L. Monier, "A model of computation for VLSI with related complexity results," in Proc. 13th Annu. Ass. Comput. Mach. Symp. Theory Computat., May 1981, pp. 318-325. [10] E. Dekel, D. Nassimi, and S. Sahni, "Parallel matrix and graph algorithms," Dep. Comput. Sci., Univ. Minnesota, June 1979. [11] L. J. Guibas, H. T. Kung, and C. D. Thompson, "Direct VLSI implementation of combinatorial algorithms," Dep. Comput. Sci., Carnegie-Mellon Univ., Res. Rep., Mar. 1979. [12] D. S. Hirschberg, A. K. Chandra, and D. V. Sarwate, "Computing connected components on parallel computers," Commun. Ass. Comput. Mach., vol. 22, pp. 461-464, Aug. 1979. [13] L. B. Jackson, S. F. Kaiser, and H. S. MacDonald, "An approach to the implementation of digital filters," IEEE Trans. Audio Electroacoust., vol. AU- 16, pp. 413-421, Sept. 1968. t 14] D. Kleitman, F. T. Leighton, M. Lepley, and G. L. Miller, "New layouts for the shuffle-exchange graph," in Proc. 13th Ass. Comput. Mach. Symp. Theory Comput., May 1981, pp. 278-292. [15] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor arrays," in Proc. Symp. Sparse Matrix Computat., Knoxville, TN, Nov. 1978. [16] F. T. Leighton, "New lower bound techniques for VLSI," in Proc. 22nd Annu. IEEE Symp. Foundations Comput. Sci., Oct. 1981, pp. 1-12. [17] K. N. Levitt and W. H. Kautz, "Cellular arrays for the solution of graph problems," Commun. Ass. Comput. Mach., vol. 15, pp. 789-801, 1972. [18] D. E. Muller and F. P. Preparata, "Bounds to complexities of networks for sorting and switching," J. Ass. Comput. Mach., vol. 22, pp. 195-201, Apr. 1975. [19] D. Nassimi and S. Sahni, "Bitonic sort on a mesh-connected parallel computer," IEEE Trans. Comput-., vol. C-28, pp. 2-7, Jan. 1979. [20] --,"Parallel permutation and sorting algorithms and a new generalized-connection-network, Dep. Comput. Sci., Univ. Minnesota, Tech. Rep. 79-8, Apr. 1979. [21] D. Nath, "Efficient VLSI networks and parallel algorithms based on them," Ph.D. dissertation, Dep. Elec. Eng., Indian Instit. Technol., New Delhi, Feb. 1982. [22] D. Nath and S. N. Maheshwari, "Parallel algorithms for the connected components and minimal spanning tree problems," Inform. Process. Lett., vol. 14, pp. 7-11, Mar. 27, 1982. [23] F. P. Preparata and J. Vuillemin, "The cube-connected cycles: A versatile network for parallel computation," IRIA, France, Res. Rep., June 1979. [24] , "Area-time optimal VLSI networks for multiplying matrices,"; Inform. Process. Lett., vol. 11, pp. 77-80, Oct. 1980. [25] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans. Comput.,vol. C-20, pp. 153-161, 1971. [26] C. Savage, "Fast, efficient parallel algorithms for some graph problems," Dep. Comput. Studies, North Carolina State Univ., Tech. Rep., Nov. 1978.


581


[27] J. E. Savage, "Area-time tradeoffs for matrix multiplication and transitive closure in the VLSI model," in Proc. 17th Annu. Allerton Conf. Commun., Control, Comput. Oct. 1979. [28] J. T. Schwartz, "Ultracomputers," Ass. Comput. Mach. Trans. Program., Languages, Syst., vol. 2, pp. 484-521, 1980. [291 C. D. Thompson, "A complexity theory for VLSI," Ph.D. dissertation, Dep. Comput. Sci., Carnegie-Mellon Univ., Aug. 1980. [30] , "The VLSI complexity of sorting," Univ. California, Berkeley, Memo. NCB ERL M82/5, Feb. 1982.

[31]

, private communication.

[32] C. D. Thompson and H. T. Kung, "Sorting on a mesh-connected parallel computer," Commun. Ass. Comput. Mach., vol. 20, pp. 263-271, Apr. 1977.

[33] R. L. Rivest and J. Vuillemin, "A generalisation of the AanderaaRosenberg conjecture," in Proc. 7th Annu. Symp. Theory Comput., Ass. Comput. Mach. 1975, pp. 6-11.

S. N. Maheshwari received the B.Tech. degree in electrical engineering from the Indian Institute of Technology, Delhi, in 1969 and the M.S., and Ph.D. degrees in computer science from Northwestern University, Evanston, IL, in 1971 and 1974 respectively. He is currently an Associate Professor with the Centre of Computer Science and Engineering and Department of Electrical Engineering at IIT Delhi. He has also held appointments with the Computer Science Departments at IIT Kanpur, IIT Bombay and University of Colorado at Boulder. His current research interests are in the areas of analysis of algorithms, parallel processing and VLSI, and the theory of relational data bases. Dr. Maheshwari is a member of the Association for Computing Machinery. P. C. P. Bhatt received the B.E., M.E. and Ph.D.

Dhruva Nath received the B.Tech. degree and Ph.D. degree in electrical engineering from the Indian Institute of Technology, Delhi, in 1977 and 1982 respectively. Since 1979, he has also been associated with a group at IIT Delhi working on the design automation of digital systems. His research interests are in the design of efficient algorithms, parallel processing and VLSI.

degrees in electrical engineering from Saugar University, Calcutta University, and Indian Institute of Technology, Kanpur, in 1962, 1964, and 1969,

respectively. Currently, he is a Professor with the Centre for Computer Science and Engineering and the Department of Electrical Engineering at Indian Institute of Technology, Delhi. His research interests are in the areas of computer architecture and programming languages.


Based on Orthogonal Trees - CSTAR

Based on Orthogonal Trees - CSTAR

Suggest Documents