Fast and processor efficient parallel matrix multiplication ... - IEEE Xplore

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 8, AUGUST 1998

705

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System Keqin Li, Senior Member, IEEE, Yi Pan, Senior Member, IEEE, and Si Qing Zheng Abstract—We present efficient parallel matrix multiplication algorithms for linear arrays with reconfigurable pipelined bus systems (LARPBS). Such systems are able to support a large volume of parallel communication of various patterns in constant time. An LARPBS can also be reconfigured into many independent subsystems and, thus, is able to support parallel implementations of divide-and-conquer computations like Strassen’s algorithm. The main contributions of the paper are as follows: We develop five matrix multiplication algorithms with varying degrees of parallelism on the LARPBS computing model, namely, MM1, MM2, MM3, and compound algorithms &1(e) and &2(δ). Algorithm &1(e) has adjustable time complexity in sublinear level. Algorithm &2(δ) implies that 3

it is feasible to achieve sublogarithmic time using o(N ) processors for matrix multiplication on a realistic system. Algorithms MM3, 3

&1(e), and &2(δ) all have o(N ) cost and, hence, are very processor efficient. Algorithms MM1, MM3, and &1(e) are general-purpose matrix multiplication algorithms, where the array elements are in any ring. Algorithms MM2 and &2(δ) are applicable to array elements that are integers of bounded magnitude, or floating-point values of bounded precision and magnitude, or Boolean values. Extension of algorithms MM2 and &2(δ) to unbounded integers and reals are also discussed. Index Terms—Compound algorithm, linear array, matrix multiplication, optical pipelined bus, reconfigurability, Strassen’s algorithm.

——————————F——————————

1 INTRODUCTION

M

multiplication is one of the most fundamental and important problems in science and engineering. Many other important matrix problems can be solved via matrix multiplication, e.g., finding the Nth power, the inverse, the determinant, eigenvalues, an LU-factorization, and the characteristic polynomial of a matrix, and the product of a sequence of matrices [20]. Many graph theory problems are also reduced to matrix multiplication. Examples are finding the transitive closure, all-pairs shortest paths, the minimum-weight spanning tree, topological sort, and critical paths of a graph [6], [12]. Therefore, fast and processor efficient parallel algorithms for matrix multiplication is definitely of fundamental importance. 3 The standard matrix multiplication algorithm takes O(N ) operations. Most existing parallel algorithms are parallelizations of the standard method. For instance, matrix multiplication can be performed in O(N) time on a mesh with wraparound connections and N × N processors [7]; in O(log N) time 3 on a three-dimensional mesh of trees with N processors [20]; in O(log N) time on a hypercube or shuffle-exchange network ATRIX

3

with N processors [12]. It is clear that all implementations of the standard method have cost, i.e., time-processor product, of 3 at least Ω(N ). Therefore, it is interesting to develop highly 3 parallel and processor efficient algorithms that have o(N ) cost. It turns out that such an algorithm should not be a parallelization of the standard matrix multiplication algorithm. Extensive research has been done to develop algorithms β

that require O(N ) operations, such that β < 3. Strassen was log7

the first one who was able to find an O(N ) time algorithm, where log 7 < 2.8074, that recursively reduces the calculation of the product of two matrices of size N × N to the calculation of the products of seven submatrices of size N/2 × N/2 [46]. Chandra implemented Strassen’s algorithm on shared memory multiprocessors in O(log N) time using 2.8074 N /log N processors [8]. It is well known that matrix product can be computed by a CREW PRAM in O(log N) β

time, using M(N) processors [38], where M(N) = O(N ) is the number of operations done by the best sequential algorithm. Since Strassen’s initial work, the value β has been reduced a number of times. Currently, the best matrix β

²²²²²²²²²²²²²²²²

• K. Li is with the Department of Mathematics and Computer Science, State University of New York, New Paltz, NY 12561-2499. E-mail: [email protected]. • Y. Pan is with the Department of Computer Science, University of Dayton, Dayton, OH 45469-2160 E-mail: [email protected]. • S.Q. Zheng is with the Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803-4020 E-mail: [email protected]. Manuscript received 15 Oct. 1997; revised 9 Mar. 1998. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 100374.

multiplication algorithm takes O(N ) operations, where β < 2.375477 [11]. However, these algorithms are highly sophisticated even for sequential execution (the constant β

involved in the O(N ) term is so large that the break-even point with the standard matrix multiplication algorithm is way above practically interesting matrix sizes); parallelization of these methods are only of theoretical interest, and far from practical (see, e.g., [37, p. 624] and [20, p. 312]). The

1045-9219/98/$10.00 © 1998 IEEE

706


reader is referred to [4] for more information on parallel matrix algorithms. A parallel algorithm on the PRAM model only reveals the maximum parallelism of a problem. Large scale parallel computing on a shared memory system is impractical, at least using the current technology. As for distributed memory systems, to the best of the authors’ knowledge, even Strassen’s algorithm has not been parallelized on any popular static network with time and processor complexities 2.8074 O(log N) and O(N ), respectively. (Note that Ω(log N) time is inherent in Strassen’s recursive method and is independent of any implementation on any system.) It is unlikely that Strassen’s algorithm can be parallelized on static networks with the above-mentioned performance, due to its large amount of data transfer and mismatch between its communication patterns and the structure of most of the commonly used networks. Efficient communication is the most difficult issue in designing a high performance parallel system which includes architecture and technology that support high communication bandwidth and an algorithm that involves less communication. The communication capacity of static networks, such as meshes and hypercubes, is limited by their finite node degrees, while communication latency is in proportional to network diameters. Hence, increasing the size of a network may not result in further decrease in the time complexity of a parallel algorithm running on a static network. As a matter of fact, the time complexity of a nontrivial parallel computation is bounded from below by the network diameter. One way to overcome this difficulty is to use electronic/optical buses for communication since they provide direct connection between any two processors in the system. Processor arrays with various bus systems have become increasingly interesting due to recent advances in VLSI and fiber optic technologies. Arrays of processors with global buses [5], multiple buses [10], hyperbuses [17], multidimensional bus architectures [29], and reconfigurable buses [30] have been proposed for efficient parallel computations. The time taken by a datum to go from one end of an electronic bus to the other end is a function of the length 2 L of the bus. These functions can be L , L, or log L, depending on the physical properties of the technology used to implement the bus [2]. Similarly, the time for a signal to traverse a given section of the bus is also unpredictable. Other difficulties associated with electronic buses are low bandwidth (since they cannot support concurrent accesses), temperature sensitivity, capacitive loading, and cross talk caused by mutual inductance. On the other hand, fiber optic communication technologies offer a combination of high bandwidth, predictable message delay, low interference and error probability, and gigabit transmission capacity, and have been used extensively in wide-area networks. Clearly, the ability to control optical delays precisely can also be used in many ways to support high bandwidth multiprocessor interconnection. For instance, precise delays can be used for the buffers required in temporal rearrangement of TDM signals, for collision resolution in packet switched networks, and for synchronization of incoming packets on separate channels.

Depending upon the material through which the signals propagate, one millimeter corresponds to three to seven picoseconds. With the precision available in mechanical layout, sub-picosecond time precision is achievable. Based on the characteristics of fiber optical communications, several researchers have proposed using optical interconnections to connect processors in a parallel computer system [3], [13], [15], [44], [47], [50]. In such a system, messages can be transmitted concurrently on a pipelined optical bus without partitioning the bus into several segments, while the time delay between the furthest processors is only the end-to-end propagation delay of light over a waveguided bus. This design integrates the advantages of both optical transmission and electronic computation, and leads to an entirely new realm of techniques for parallel algorithm development. Many parallel algorithms, such as the Hough transform, singular value decomposition, order statistics, sorting, and some numerical algorithms, have been proposed for systems with optical interconnections [14], [16], [26], [31], [32], [33], [40], [41], [43], [45]. These works indicate that arrays with pipelined optical buses are very efficient for parallel computation due to the high bandwidth within a pipelined bus system. More recently, linear arrays with reconfigurable pipelined bus systems (LARPBS) and arrays with reconfigurable optical buses (AROB) have been independently proposed by Pan et al. [34], [35], and Pavel and Akl [41]. In these systems, messages can be transmitted concurrently on a bus in a pipelined fashion and the bus can be reconfigured dynamically under program control to support algorithmic requirements. Many algorithms have been designed for these two models, including selection, quicksort, matrix operations, computational geometry, and PRAM simulation on the LARPBS model [22], [23], [24], [28], [34], [36], and on the AROB model [40], [41], [42]. A fundamental difference between the two similar models is that counting is not allowed during a bus cycle in the LARPBS model, while it is permitted in the AROB model. In fact, the LARPBS does not allow any processor involvement during a bus cycle, except setting switches at the beginning of a bus cycle. Recently, a linear array with a different pipelined optical bus system has been proposed in [27], [50]. This system can implement most algorithms designed for LARPBS with much less hardware. Other simplified but equivalent models have also been developed [49]. In this paper, we present efficient parallel matrix multiplication algorithms for linear arrays with reconfigurable pipelined bus systems. These kinds of systems are able to support a large volume of parallel communication of various patterns in constant time. An LARPBS can also be reconfigured into many independent subsystems and, thus, is able to support parallel implementations of divide-and-conquer computations, like Strassen’s algorithm. The main contributions of the paper are as follows: We develop five matrix multiplication algorithms with varying degrees of parallelism on the LARPBS computing model: • A linear time algorithm MM1 (applicable to an arbitrary 2

ring or a closed semiring) using N processors, where each processor needs constant amount of memory;

LI ET AL.: FAST AND PROCESSOR EFFICIENT PARALLEL MATRIX MULTIPLICATION ALGORITHMS ON A LINEAR ARRAY

707

3

• A constant time algorithm MM2 using N processors,

where each processor needs constant amount of memory, and we assume that array elements are integers of bounded magnitude, or floating-point values of bounded precision and magnitude, or Boolean values. • A parallelized Strassen’s algorithm MM3 (applicable to an arbitrary ring) with time complexity O(log N), log7 using O(N ) processors, where each processor needs O(log N) amount of memory; • A compound algorithm &1(e) (applicable to an arbitrary ring) of time complexity O(log N + N

1−e

2+0.8074e

),

where 0 ≤ e ≤ 1, using O(N ) processors, where each processor needs O(log N) amount of memory; • A compound algorithm &2(δ) with time complexity δ

O((log N) ), where 0 ≤ δ ≤ 1, using O N

3

2 71 8 7

log N

6δ

processors, where each processor needs O(log N) amount of memory, and array elements are integers of bounded magnitude or floating-point values of bounded precision and magnitude. Algorithm &1(e) has adjustable time complexity in sublinear level. Algorithm &2(δ) implies that it is feasible 3 to achieve sublogarithmic time using o(N ) processors for matrix multiplication on a realistic system. Algorithms 3 MM3, &1(e), and &2(δ) all have o(N ) cost and, hence, are very processor efficient. Algorithms MM1, MM3, and &1(e) are general-purpose matrix multiplication algorithms, where the array elements are in any ring. Notice that the restriction on the array elements in algorithms MM2 and &2(δ) are actually made by virtually all sequential and parallel algorithms since it is usually assumed that an arithmetic operation on array elements takes constant time. Otherwise, even for such a primitive operation (for integers with unbounded magnitude or real values with unbounded precision and magnitude), one of the resources, i.e., execution time or processor hardware, should be increased in proportion to magnitude/precision. Nevertheless, in this paper, extension of algorithms MM2 and &2(δ) to unbounded integers and reals are also discussed. We have not seen parallel computing systems with conventional interconnection networks that solve the matrix multiplication problem in constant running time or support efficient implementation of Strassen’s algorithm and/or provide such a wide range of cost and performance combinations. Our results have the following important implication, i.e., due to its high communication bandwidth, the versatile communication patterns it supports, and its ability of utilizing communication reconfigurability as an integral part of a parallel computation, the LARPBS is a powerful architecture for exploiting large degree of parallelism in a computational problem that most other machine models cannot achieve. The rest of the paper is organized as follows. In Section 2, we explain pipelined optical buses, and linear arrays with reconfigurable pipelined bus systems. In Section 3, we describe the implementation of several primitive communication operations on LARPBS, which are building blocks of

Fig. 1. A linear array of N processors with an optical bus.

our algorithms, and also discuss time complexity measure in LARPBS computations. In Sections 4-6, we present three simple matrix multiplication algorithms MM1, MM2, and MM3. In Section 7, we develop two compound algorithms (combinations of simple algorithms), &1(e) and &2(δ). Section 8 concludes the paper.

2 LINEAR ARRAYS WITH RECONFIGURABLE PIPELINED BUS SYSTEMS A pipelined optical bus system uses optical waveguides instead of electrical signals to transfer messages among (electronic) processors. In addition to the high propagation speed of light, there are two important properties of optical signal (pulse) transmission on an optical bus, namely, unidirectional propagation and predictable propagation delay. (Notice that electronic buses lack these characteristics.) These advantages of using waveguides enable synchronized concurrent accesses of an optical bus in a pipelined fashion [9], [21]. Such pipelined optical bus systems can support a massive volume of communications simultaneously and are particularly appropriate for applications that involve intensive communication operations such as broadcasting, one-to-one communication, multicasting, global aggregation, and irregular communication patterns. Fig. 1 shows a linear array of N processors connected via an optical bus. Each processor is connected to the bus with two directional couplers, one for transmitting on the upper segment and the other for receiving from the lower segment of the bus [44]. Optical signals propagate unidirectionally from left to right on the upper segment and from right to left on the lower segment. An optical bus contains three identical waveguides, i.e., the message waveguide for carrying data, the reference waveguide, and the select waveguide for carrying address information, as shown in Fig. 2. For simplicity, the message waveguide, which resembles the reference waveguide, has been omitted from the figure. An optical pulse p can represent a binary bit 1, and the absence of a pulse represents bit 0. Messages are organized as fixed-length message frames, each frame containing at most b pulses. Therefore, if b = 8, the integer data 165 can be represented as a message frame (p, −, p, −, −, p, −, p), where − stands for the absence of a pulse. The most important hardware parameters of an optical bus system are given as follows:

708


Fig. 3. Optical switches.

Fig. 2. An optical bus.

ω : the duration in seconds of a single optical pulse; cb: the velocity (speed) of light in waveguides; ∆: delay (spatial length) of a single optical pulse, i.e., ∆ = ω × cb;

τ : pipeline cycle, i.e., the time for an optical pulse to travel the section between two consecutive processors; T: bus cycle, i.e., the time (2N − 1)τ for an optical pulse to propagate through the entire bus (P0 → P1 → L → PN−1 → PN−1 → PN−2 → L → P0), plus (N − 1)ω delays introduced for reference and select pulses for the purpose of pipelined data transmission, plus T′ for message processing; b: the length of a message frame. (It is assumed that a message frame is longer enough to hold one data item such as an array element. For processors to send their indices, b is at least log N.) Messages transmitted by different processors may overlap with each other even if they propagate unidirectionally on the bus. We call these message overlappings “transmission conflicts.” To ensure that there is no transmission conflict, it is required that τ > bω, so that each message frame can fit into a pipeline cycle τ, and that in a bus cycle T, up to N messages can be transmitted simultaneously without collisions on the bus, as long as all processors are synchronized at the beginning of each bus cycle. To operate an optical bus in a pipelined fashion so that multiple data transfer can be performed in parallel, extra timing capabilities are necessary. First, one unit delay ∆ (shown as a loop in Fig. 2) is added between two consecutive processors on the receiving segments of the reference waveguide and of the message waveguide. Each loop is an extra section of a fiber and the amount of delay added can be accurately chosen based on the length of the segment. As a result, the propagation delays on the receiving segments of the select waveguide and the reference waveguide are no longer the same. Second, a conditional delay ∆ is added between any two consecutive processors Pi−1 and Pi, where 1 ≤ i ≤ N − 1 on the transmitting segment of the select waveguide (cf. Fig. 2). The switch between processors Pi−1 and Pi is called S(i) and is local to processor Pi. Thus, every processor has its own switch except processor P0. The conditional delays can be implemented using 2 × 2 optical switches such as the

Ti:LiNbO3 switches used in an optical computer [3]. Each switch S(i) can be set by the local processor Pi to one of the two possible states, straight or cross, as shown in Fig. 3. When a switch is set to straight, it takes τ time for an optical signal on the transmitting segment of the select waveguide to propagate from one processor to its successor. When a switch is set to FURVV, a delay ω is introduced, and such propagation takes τ + ω time. Clearly, the maximum delay that the S(i) switches can introduce is (N − 1)ω. The coincident pulse addressing technique [9], [21], [44], i.e., a time-division switching method, can be applied to route messages on an optical bus system. Using this approach, a source processor Pi determines the relative time delay of a select pulse and a reference pulse so that they will coincide and produce a double-height pulse only at a destination processor Pj. By properly adjusting the detecting threshold of the detector at processor Pj, this double-height pulse can be detected, thereby addressing Pj. The details of processor addressing and data communication are given in Section 3. The capability of a linear array with a pipelined optical bus can be significantly enhanced by including switches for reconfiguring the bus system. The enhanced model, called a linear array with a reconfigurable pipelined bus system (LARPBS), originated in [34]. (See [35] for a more elaborated exposition.) In the LARPBS shown in Fig. 4, we insert two optical switches, namely, RST(i), a 1 × 2 optical switch, on the section between Pi and Pi+1 of the transmitting segment, and RSR(i), a 2 × 1 optical switch, on the section between Pi and Pi+1 of the receiving segment, where 0 ≤ i < N. Both RSR(i) and RST(i) switches are controlled by processor Pi. These switches are able to reconfigure a bus system into several independent subsystems that can operate in parallel. When all switches are set to straight, the bus system operates as a regular pipelined bus system. When RSR(i) and RST(i) are set to cross, the bus system is split into two separate systems, one consisting of processors P0, P1, ..., and Pi, and the other consisting of Pi+1, Pi+2, ..., PN−1. The total delay for a signal to pass the optical fiber between RST(i) and RSR(i) is made to be equal to τ. Hence, the subarray with processors P0 to Pi can operate as a regular linear array with a pipelined bus system; so does the subarray with processors Pi+1 to PN−1. Fig. 4 demonstrates the LARPBS model with N = 6 processors, where the array is split into two subarrays, with the first one having four processors and the second one having two processors. In the figure, only one waveguide is shown. Conditional switches are omitted in the figure to avoid confusion. Notice that the time for setting up switches for one reconfiguration is a constant.


709

Fig. 4. An LARPBS with two subarrays.

3 THE LARPBS COMPUTATIONAL MODEL An LARPBS is a distributed memory system which consists of processors linearly connected by a reconfigurable pipelined optical bus system. Each processor can perform ordinary arithmetic and logic computations and interprocessor communication. All computations and communications are synchronized by bus cycles, which is similar to an SIMD machine. However, due to reconfigurability, an LARPBS can be divided into subsystems and these subsystems can operate independently to solve different subproblems, which results in an MIMD machine. Input/output can be handled in a way similar to an ordinary SIMD/MIMD machine; further discussion on this issue is beyond the scope of the paper since we are mainly interested in algorithm development and analysis. In this section, we show in detail how to perform several basic communication, data movement, and global aggregation operations on the LARPBS model using the coincident pulse processor addressing technique. We are going to justify that all these primitive operations can be performed in a constant number of bus cycles. These powerful primitive operations that support massive parallel communications, plus the reconfigurability of the LARPBS model, make the LARPBS very attractive in solving problems that are both computation and communication intensive, such as matrix multiplication. We will see that optical buses are not only communication channels among the processors, but also active components and agents of certain computations, e.g., global data aggregations. The unique characteristics of optical buses have led to an entirely new computational model, and opened up rich avenues of algorithmic design, and novel and interesting algorithms (see chapter 10 of [2]).

3.1 Time Complexity Measure Let us first discuss the issue of measuring the time complexity of an LARPBS computation. As in many other parallel computing systems, a computation on LARPBS is a sequence of alternate communication and computation steps: m1 data movement/global aggregation ↓ n1 local computation ↓

m2 data movement/global aggregation ↓ n2 local computation ↓ : The time complexity of an algorithm is measured in terms of the total number of bus cycles in all the (m1 + m2 + L) communication steps, as long as the time of the ni local computation steps is bounded by a constant and independent of N. Notice that the time for setting up switches for one reconfiguration is a constant, and the number of system reconfiguration does not exceed, actually is far less than, the number of communications steps. Therefore, this part of the system overhead is negligible or can be merged into the big-O notation. The above complexity measure implies that a bus cycle takes constant time, which obviously needs more justifications. Recall that the bus cycle length T = (2N − 1)τ + (N − 1)ω + T′ is the end-to-end message transmission time over a bus (2N − 1)τ + (N − 1)ω plus the message processing time T′ for message generation, preparation, buffering, encoding, and decoding, etc., in the source and destination processors [15], [42]. The ratio r = T′/τ of message processing time (which is comparable to processor speed) to message transmission time between successive processors is at least in the order 3 of 10 , which implies that a bus cycle can be regarded as a constant for systems of sizes O(r). It has been observed that in parallel computing, the time to access a memory location in a global shared memory system, the time for a message to traverse a link in a direct network, and the time for a message to move to the next stage in an indirect network are all assumed to be constants [2]. Strictly speaking, all these times are in proportion to system size. Thus, the assumption that T is a constant has been adopted widely in the literature, and is consistent to other models [2], [5], [10], [15], [16], [17], [30], [40], [41], [42], [45]. As pointed out in the last section of the paper, our algorithms are highly scalable. Therefore, in a realistic system whose size is much less than the problem size, we still can run the algorithms such that linear speedup can be obtained.

710


3.2 Primitive Operations The following primitive operations on LARPBS are used in this paper, and our algorithms are developed by using these operations as building blocks. Implementation details of these operations on optical buses are given in Section 3.4.

domain with a binary operator ⊕. Elements aij and bij are stored in R[m + (i − 1)N + j − 1] and R[n + (i − 1)N + j − 1], respectively. Then C = A ⊕ B, where cij = aij ⊕ bij, can be done as follows, where cij is found in R[m + (i − 1)N + j − 1]: 2

IRU 1 ≤ k ≤ N GRLQSDUDOOHO R(m + k − 1) ← R(m + k − 1) ⊕ R(n + k − 1) HQGIRU

3.2.1 One-to-One Communication Assume that processors Pi , Pi , K , Pi 1

2

are senders and

m

processors Pj , Pj , K , Pj are receivers. In particular, proces1

2

m

sor Pi sends its value in its register R(ik) to the register R(jk) k

in Pj . The operation is represented as k

IRU 1 ≤ k ≤ m GRLQSDUDOOHO R(jk) ← R(ik) HQGIRU

(Note that we use R(i) to denote both the name and the content of register R(i).)

3.2.2 Broadcasting Here, we have a source processor Pi which sends a value in its register R(i) to all the N processors: R(0), R(1), R(2), ..., R(N − 1) ← R(i)

3.2.5 Binary Prefix Sums Assume that every processor Pi, where 0 ≤ i ≤ N − 1, has a register R(i) which holds a binary value. We need to calculate sj = R(0) + R(1) + L + R(j), and save the result in R(j). The operation is represented as for 0 ≤ j ≤ N − 1 do in parallel R(j) ← R(0) + R(1) + L + R(j) endfor

3.2.6 Extraction and Compression In this operation, every processor Pj has a value xj, and we wish to extract those xjs that have certain property and compact them to the beginning of the linear array. In particular, if xi , xi , K , xi are those values that have the 1

3.2.3 Multiple Multicasting. In a multicasting operation, we have a source processor Pi which sends a value in its register R(i) to a subset of the N processors Pj , Pj , K , Pj : 1

2

m

R(j1), R(j2), ..., R(jm) ← R(i)

Assume that we have g disjoint groups of destination processors,

J = JP

L , KL,

G1 = Pj , Pj , Pj , K , G2

1, 1

1, 2

1, 3

j 2 , 1 , Pj 2 , 2

, Pj

2,3

J

g ,1

, Pj

g,2

, Pj

g,3

1

2

6

k

pression operation.

3.2.7 Binary Value Aggregation Assume that every processor Pi, where 0 ≤ i ≤ N − 1 has a register R(i) which holds a binary value. We need to calculate R(0) + R(1) + L + R(N − 1) and save the result in R(0). The operation is represented as

Assume that every processor Pi, where 0 ≤ i ≤ N − 1 has a register R(i) which holds a truth value. We need to calculate

L

g

1

3.2.8 Boolean Value Aggregation

,K ,

and there are g senders Pi , Pi , K , Pi . Processor Pi

m

R(0) ← R(0) + R(1) + L + R(N − 1)

M Gg = Pj

2

property, where i1 < i2 < L < im, then we should have R k − 1 = xi for all 1 ≤ k ≤ m, after the extraction and com-

R(0) and R(1) and L and R(N − 1), k

has

value R(ik) to be broadcast to all the processors in Gk, where 1 ≤ k ≤ g. Since there are g simultaneous multicastings, we have a multiple multicasting operation, which is denoted as follows: for 1 ≤ k ≤ g do in parallel R(jk,1), R(jk,2), R(jk,3), ... ← R(ik) endfor

or R(0) or R(1) or L or R(N − 1),

and save the result in R(0).

3.2.9 Integer Aggregation The binary value aggregation operation can be generalized to calculate the summation of N integers.

3.2.10 Floating-Point Value Aggregation

It is important to point out that due to relative slow speed of the electronic processors, each processor can only read one message frame during one bus cycle. Thus, the g groups of destination processors must be disjoint.

The integer aggregation operation can be further generalized to calculate the summation of N reals.

3.2.4 Element Pair-Wise Operations

Before we go to the implementation of primitive operations, let us look at the coincident pulse technique. Assume that all the S(i) switches on the transmitting segments are set to straight, where 1 ≤ i ≤ N − 1, so that no delay is introduced for the select pulses. Suppose that a source processor Pi wants to send a message to a destination processor Pj.

Let Pi, Pi+1, ..., Pj be a consecutive group of processors. We use R[i..j] as an abbreviation of the registers R(i), R(i + 1), ..., R(j). R[i..j] can be used to store a vector or a matrix in the row-major or column-major order. Assume that A and B are two matrices of size N × N. All array elements are in a

3.3 Coincident Pulse Processor Addressing Technique


Fig. 5. An address frame.

Fig. 6. The address frame for broadcasting.

Processor Pi sends a reference pulse on the reference waveguide at time tref , i.e., the beginning of a bus cycle, and a message frame at time tref on the message waveguide, which propagates synchronously with the reference pulse sent by Pi. Processor Pi also sends a select pulse at time tsel(j) on the select waveguide. Whenever a processor Pj detects a coincidence of a reference pulse and a select pulse, it reads the message frame. Thus, in order for processor Pi to send a message to processor Pj, we need to have the two pulses coincide at processor Pj. This happens if and only if

3.4.1 One-to-One Communication

tsel(j) = tref + (N − j − 1)ω, where 0 ≤ i, j < N. Fig. 5 gives an address frame relative to a reference pulse for addressing processor Pj. An address frame is a sequence of N consecutive time slots sN−1, sN−2, ..., s1, s0, at the beginning of a bus cycle. Each time slot sj has duration ω. The starting time of slot sj is tsel(j) = tref + (N − j − 1)ω. Within an address frame, processor Pi has the chance to address one or more destinations. Obviously, processor Pj is expected to receive the message sent by Pi if and only if there is a select pulse at sj. In Fig. 5, the select pulse is delayed for (N − j − 1)ω seconds relative to the reference pulse. Hence, the two pulses meet at processor Pj after the reference pulse goes through N − j − 1 delays, and processor Pj receives the message at time ti,j = tref + (N − 1 − i)τ + (N − j)τ + (N − 1 − j)ω = tsel(j) + (2N − i − j − 1)τ, where (N − 1 − i)τ is the amount of time for the reference pulse and the message frame to travel from Pi to PN−1 on the transmitting segment, (N − j)τ is the amount of time for the reference pulse and the message frame to travel from PN−1 to Pj on the receiving segment (excluding the extra loop delays), and (N − 1 − j)ω is the amount of time due to loop delays. Since the select pulse is initiated at time tsel(j), and takes (2N − i − j − 1)τ time to reach Pj on the receiving segment, the select pulse meets with the reference pulse and the message frame at the right time tij at the right place Pj.

3.4 Implementation of Primitive Operations We now present in detail how the 10 primitive operations defined in the last subsection are implemented on reconfigurable pipelined optical buses. Readers only interested in algorithmic aspects can skip this section (which is on optical engineering implementation details) without loss of continuity.

711

One-to-one communication can be implemented in one bus cycle as follows. All the Pi s send a reference pulse and a k

message frame containing R(ik) at time tref, and a select pulse at time tsel(jk). Whenever a processor Pj detects a cok

incidence of a reference pulse and a select pulse, it reads the message frame to its R(jk).

3.4.2 Broadcasting To implement the broadcast operation, we set all the S(i) switches on the transmitting segments to straight. The source processor Pi sends a reference pulse at the beginning of its address frame. It is also clear that to broadcast a message, Pi needs to select all the N processors. Thus, if the source processor Pi sends N consecutive select pulses in its address frame on the select waveguide, as shown in Fig. 6, every processor on the bus detects a double-height pulse and thus reads the message.

3.4.3 Multiple Multicasting The implementation of this operation is similar to that of broadcast, that is, processor Pi sends select pulses at times tsel(j1), tsel(j2), ..., tsel(jm). It is clear that the multiple multicasting operation can also be implemented in one bus cycle.

3.4.4 Element Pair-Wise Operations It is clear that element pair-wise operations are similar to one-to-one communications. We first perform a one-to-one communication, that is, Pn+k−1 sends R(n + k − 1) to Pm+k−1 2 for all 1 ≤ k ≤ N , and then the Pm+k−1s conduct a local operation ⊕.

3.4.5 Binary Prefix Sums Assume that R(im) = R(im−1) L = R(i1) = 1, where 0 < im < im−1 < L < i1 and all other values are 0 (we assume that R(0) = 0 for the moment). We divide the N processors into (m + 1)

J

groups G1, G2, ..., Gm+1, where Gk = Pi , Pi k

k +1

, K , Pi

k −1 −1

L,

and we assume that im+1 = 0 and i0 = N, for convenience. Thus, the goal is to let processor Pj know the value (m + 1 − k) if Pj ∈ Gk. The binary prefix sums operation takes four bus cycles. • In the first cycle, each processor Pi sends a message

frame that contains its index i, a reference pulse, and a

712


select pulse at time slot sN−1. In other words, all processors try to send its index to PN−1. Furthermore, processor Pi sets S(i) to cross if R(i) = 1, and straight if R(i) = 0. Such a switch setting changes the destination of some of the N messages due to the delays introduced by the switches S(i), where i ∈ {i1, i2, ..., im}. In particular, the messages sent by processors in Gk are received by PN−k, where 1 ≤ k ≤ m + 1. However, processor PN−k only reads the first message that arrives, i.e., index ik−1 − 1. • In the second bus cycle, processor PN−k, which re-

ceived something during the first cycle, sends the value (k − 1) to processor Pi − 1 for all 1 ≤ k ≤ m + 1. k −1

This is a one-to-one communication. • In the third bus cycle, processors Pi

k −1 −1

, 1 ≤ k ≤ m + 1,

that received something during the second bus cycle, set their RST(i) and RSR(i) switches to cross. Hence, the LARPBS is reconfigured into (m + 1) subarrays, that contain groups of processors G1, G2, ..., Gm+1, respectively. Processor Pi − 1 , 1 ≤ k ≤ m + 1, broadcasts k −1

the value (k − 1) it received in the second bus cycle to all other processors in its subarray. • In the fourth bus cycle, processor P0 broadcasts m to all processors and all Pj save sj = m − (k − 1) in R(j) if Pj ∈ Gk. (Notice that every Pj knows that it belongs to Gk from the value k − 1 it received in the third bus cycle.) The above solution works only when R(0) = 0 initially. The reason is that processor P0 does not have an S(0) switch that causes a delay for its select pulse (cf. Fig. 2). To solve this issue, we need one extra bus cycle in which processor P0 broadcasts its initial value to all processors, and all processors add the value to their final result.

3.4.6 Extraction and Compression To implement this operation, we first invoke the binary prefix sums operation, where R(j) = 1 if and only if xj has the property. In particular, Pi knows the value k and its k

destination Pk−1 for all 1 ≤ k ≤ m. Then, one more one-to-one communication brings all the xjs to the right processors.

3.4.7 Binary Value Aggregation Assume that R(i1) = R(i2) L = R(im) = 1, where i1 < i2 < L < im, and all other values are 0. Thus, the goal is to let processor P0 know the value m. It is clear that this operation is simpler than the binary prefix sums operation. The aggregation operation takes two bus cycles. • In the first cycle, processor P0 sends a dummy mes-

sage z, a reference pulse, and a select pulse at time slot sN−1. In other words, processor P0 tries to send a message to PN−1. Furthermore, processor Pi sets S(i) to cross if R(i) = 1, and straight if R(i) = 0. Such a switch setting changes the destination of message z due to the delays introduced by the switches S(i), where i ∈ {i1, i2, ..., im}. In particular, it is easy to see that processor PN−1−m receives the message z.

• In the second bus cycle, processor PN−1−m, which re-

ceived something during the first cycle, sends the value m back to processor P0. After the two bus cycles, processor P0 obtains the value m. The above solution works only when m ≤ N − 1 or R(0) = 0 initially. To solve this problem, we let processor P0 add 1 to the value received in the second bus cycle if R(0) = 1 initially.

3.4.8 Boolean Value Aggregation It is clear that the above algorithm is also applicable to boolean values, where 1 means true, and 0 false. The logical DQG of the R(i)s is true if and only if the sum m is N. The logical RU of the R(i)s is true if and only if the sum m is not zero.

3.4.9 Integer Aggregation The binary summation operation can be generalized to calculate the summation of N integers. Assume that each R(j) has a d-bit integer R(j) = xj,d−1 xj,d−2 L xj,0, where 0 ≤ j ≤ N − 1. The sum σ = R(0) + R(1) + R(2) + L + R(N − 1) can be calculated using d binary summation operations, that is, finding

σk =

N −1

∑ xj,k j =0

and storing the result in P0. After these d binary summation operations, processor P0 assembles

σ =

d −1

∑ k =0

2 σk = k

∑ 2 ∑ x d −1

k =0

k

N −1 j =0

= ∑ ∑ 2 x = ∑ R1 j6 N −1 d −1

j,k

j =0

N −1

k

j,k

k =0

j =0

via local calculation. The entire integer aggregation operation takes O(d) time, which is constant time as long as d is a constant. Multiple independent aggregation is also possible by reconfiguring the system. For unbounded integer values we can conduct log M simultaneous binary value aggregations using log MN processors, where M is the largest magnitude. In particular, we assume that there are N values x0, x1, ..., xN−1 to be aggregated, where x j = x j , log M −1x j , log M − 2 K x j , 0 is a log M-bit integer, and is in processor Pj initially for all 0 ≤ j ≤ N − 1. The log MN processors are divided into log M groups Gk = {PkN, PkN+1, ..., P(k+1)N−1}, where k = 0, 1, ..., log M − 1. • First, we perform N multicasting simultaneously such

that xj is available to all Pj , PN + j , K , P all 0 ≤ j ≤ N − 1.

3 log M −18N + j , for

• Second, the linear array with log MN processors is re-

configured into log M subarrays G0 , G1 , K , G log M

−1

.

Processors in Gk extract the kth bit xj,k of the xjs via local computation and then perform a binary value aggregation, so that σk is calculated and put into PkN, where 0 ≤ k ≤ log M − 1. • Third, processors PkN send the σks to P0 , P1 , K , P log M

by a one-to-one communication.

−1

,


After the above three steps—each takes constant time—we reduce the problem of aggregating N integers into aggregating log M integers, which can be added using the ordinary binary tree method in O(log log M) time. As a matter of fact, the number of processors can be reduced to Nφ(M), where

0 5 logloglogMM "##,

such that there are φ(M) groups of processors, and that each group calculates log log M σks. This increases the time complexity of the second and the third steps from constant to O(log log M), which is still within the overall execution time.

3.4.10 Floating-Point Value Aggregation It turns out that our method for integer aggregation can easily be extended to handle floating-point real values. Let a real value xj = (sj, mj, ej) be represented by a sign bit sj, a p-bit mantissa mj = mj,1mj,2 L mj,p, and a q-bit exponent ej = ej,1ej,2 L ej,q, such that

0 5 × 41. m sj

j ,1m j , 2 L m j , p

9

2

×r

4 e j ,1e j , 2 Le j , q 92 − 2 q −1

unbounded precision and magnitude, the shifted mjs can be added in O(log p) time using N ψ (p) processors, where

1 6 logp p "##.

ψ p =

The discussion in this section can be summarized as follows:

φ M =

x j = −1

713

,

THEOREM 0. One-to-one communication, broadcasting, multiple multicasting, element pair-wise operation, binary prefix sums, extraction and compression, binary value aggregation, Boolean value aggregation, integer (of bounded magnitude) aggregation, and real value (of bounded precision and magnitude) aggregation all take O(1) time in the LARPBS computing model. THEOREM 0A. The summation of N integers can be calculated in O(log log M) time, using Nφ (M) processors in the LARPBS computing model, where M is the maximum magnitude. THEOREM 0B. The summation of N real values, with precision up −(p+q−1)

q

to 2 and magnitude in the order of M = 2 2 , can be calculated in O(log log M + log p) time, using Nψ (p) processors in the LARPBS computing model.

where r is an implicit radix, and the exponents are in an

4 AN O(N) ALGORITHM USING N 2 PROCESSORS

code. For instance, in the IEEE 754 floatingexcess-2 point standard found in virtually every computer invented since 1980, we have p = 23, r = 2, and q = 8 for 32bit, single precision numbers, and p = 52, r = 2, and q = 11 for 64-bit, double precision numbers [18]. Notice that the

We start with a linear time algorithm MM1 that calculates 2 the product of two N × N matrices by using N processors in the LARPBS model. Let A = (aij)N×N and B = (bij)N×N be two matrices, where the matrix elements aij and bij are in an arbitrary ring (or a closed semiring) with additive operator ⊕ and multiplicative operator ⊗. The product of A and B is C = A × B = (cij)N×N, where

q−1

precision is 2 2

−(p+q−1)

and the magnitude is in the order of

q

2 . In the following, we sketch the method to add N floating-point values x0, x1, ..., xN−1, where xj is stored in Pj, for all 0 ≤ j ≤ N − 1. We assume that r = 2. • First, the maximum exponent emax = max(e0, e1, ...,

eN−1) is found. This is accomplished by applying the extraction and compression operation for q times. Let S0 be the set of all ejs initially. In the kth repetition, 1 ≤ k ≤ q, we extract from Sk−1 the set of ejs with ej,k = 1, and compact these ejs into Sk; if Sk = ∅, then we reset Sk = Sk−1. In O(q) time, we will find emax in Sq. • Second, emax is made available via broadcasting to all processors, and each Pj normalizes ej by shifting mj. This step takes a small amount of local computation time. • Third, the shifted mjs are added as if they are integers, and this requires O(p) bus cycles. • Finally, processor P0 performs normalization of the summation via local computation. The complete floating-point value aggregation takes O(p + q) time, which is constant time as long as p and q are fixed. The above method works only when all the sjs are the same. In general, we can calculate the summation S+ of all positive values and the summation S− of all negative values separately by extracting all positive (negative) values and compressing them to the left (right) half of the linear array and reconfiguring the array. S+ and S− are finally added together. For floating-point values with

cij = (ai1 ⊗ b1j) ⊕ (ai2 ⊗ b2j) ⊕ L ⊕ (aiN ⊗ bNj), for all 1 ≤ i, j ≤ N. We find that Cannon’s algorithm [7], [19] can be easily implemented on the LARPBS model. For convenience, we 2 label the N processors using pairs (i, j), 1 ≤ i, j ≤ N, and processors are linearly ordered in terms of lexicographical order. In other words, processor P(i, j) corresponds to P(i−1)N+(j−1) in the linear array of processors shown in Fig. 1. Each processor P(i, j) has three registers A(i, j), B(i, j), and C(i, j). Elements of arrays A and B are initially stored in A and B registers in the row-major order, i.e., aij in A(i, j), and bij in B(i, j). Finally, array C is stored in C registers in the row-major order, i.e., element cij is found in C(i, j), for all 1 ≤ i, j ≤ N. The algorithm involves an initial alignment of rows of A and columns of B, that are actually two one-to-one communications: for 1 ≤ i, j ≤ N do in parallel A(i, cyclic-shift(j,i − 1)) ← A(i, j) endfor for 1 ≤ i, j ≤ N do in parallel B(cyclic-shift(i, j − 1),j) ← B(i, j) endfor

where

1 6 %&'ii −− jj,+ N ,

cyclic − shift i , j =

if 0 ≤ j ≤ i − 1; if i ≤ j ≤ N − 1.

After such shifts, we have the following data distribution:

714


2 7 P21, 27 2 7 2a , b 7 P2 2 , 17 P2 2 , 2 7 2a , b 7 2a , b 7 P 1, 1 a11 , b11

22

12

21

23

M

2a

22

32

L L

0 5 2 7

O

M

L

P N, N

M

2 7

P N, 1

NN , bN 1

2

7 7 3a , b 8 P N, 2 N ,1

0 5

L L

12

L

2

P 1, N a1N , bNN

then the initial data distribution is as follows:

7

P(1, 1, ∗) P(1, 2, ∗) L P(1, N, ∗)

3a

0

5

N , N - 1 , bN - 1 , N

(Other processors are not shown here for simplicity, since they do not carry any data initially.) In other words, the two input matrices A and B are put in the beginning of the linear array in the row-major and the column-major orders, respectively. This is a quite simple and easy-to-manage way of preparing the input data. When the computation is done, cij is found in C(1, i, j). If we divide C into N row vectors C1, C2, ..., CN,

8

C "# C , C= M # # !C $

Then we have a sequence of N alternate local computation and row/column cyclic shifts: for 1 ≤ k ≤ N do C(i, j) ← C(i, j) ⊕ (A(i, j) ⊗ B(i, j)) for 1 ≤ i, j ≤ N do in parallel A(i, cyclic-shift(j, 1)) ← A(i, j) endfor for 1 ≤ i, j ≤ N do in parallel B(cyclic-shift(i, 1),j) ← B(i, j) endfor endfor

then the output layout is P(1, 1, ∗) P(1, 2, ∗) L P(1, N, ∗) (C1)

THEOREM 1. Matrix multiplication (on an arbitrary ring or a closed semiring) can be performed by using algorithm 2

MM1 on the LARPBS model in O(N) time, with N processors, and each processor needs O(1) memory. 3

5 AN O(1) ALGORITHM USING N PROCESSORS In this section, we present a constant time algorithm MM2 to calculate the product of two N × N integer matrices using 3 N processors in the LARPBS model. This algorithm is applicable to integers and floating-point values of bounded precision and magnitude, and Boolean values. 3 For convenience, we label the N processors using triplets (i, j, k), where 1 ≤ i, j, k ≤ N. Processors P(i, j, k) are ordered in the linear array using the lexicographical order. We use P(i, j, ∗) to denote the group of consecutive processors P(i, j, 1), P(i, j, 2), ..., P(i, j, N). It is noticed that the original 2 system can be reconfigured into N subsystems, namely, the P(i, j, ∗)s. Each processor P(i, j, k) has three registers A(i, j, k), B(i, j, k), and C(i, j, k). The input and output of algorithm MM2 are specified as follows: Initially, elements aij and bji are stored in registers A(1, i, j) and B(1, i, j), respectively, for all 1 ≤ i, j ≤ N. If we partition A into N row vectors A1, A2, ..., AN, and B into N column vectors B1, B2, ..., BN, 1

2

N

B = [B1, B2, ..., BN],

1

2

N

(cf. chapter 5 of [19] for a complete illustration of an implementation on a mesh with wraparound connections). All these alignments and cyclic shifts are one-to-one communications in the LARPBS model. Thus, the initial alignment and each of the subsequent cyclic shifts takes two bus cycles. During the N local computation steps, the value cij is gradually accumulated. The entire algorithm requires O(N) time. Thus, we obtain the following result:

A "# A , A= M # # !A $

(A2, B2) L (AN, BN).

(A1, B1)

P 2, N a21 , b1N

L

(C2)

(CN)

Again, such an arrangement makes it easy for C to be used as an input to another computation. The algorithm proceeds as follows: First, we change the placement of matrix A in such a way that element aij is stored in A(i, 1, j) for all 1 ≤ i, j ≤ N. This can be accomplished by a one-to-one communication: for 1 ≤ i, j ≤ N do in parallel A(i, 1, j) ← A(1, i, j) HQGIRU

After such replacement, we have the following data distribution,

2 7 P21, 2, *7 2 7 2A , B 7 P2 2 , 1, *7 P22 , 2 , *7 2 A ,-7 0- ,-5 P 1, 1, * A1 , B1

2

2

2

M

2 2

M

7 P2N , 2, *7 7 0- ,-5

P N , 1, * AN , -

L L

7 7 P22 , N , *7 0- ,-5

O

M

L L

2

P 1, N , * AN , BN

2

2

7

L P N, N, * L - ,-

0 5

3

where the linear array of N processors are logically arranged as a two dimensional N × N array, and each element in the array stands for a group of N processors P(i, j, ∗). The symbol “−” means that the A and B registers are still undefined. Next, we distribute the rows of A and columns of B to the right processors, such that processors P(i, j, ∗) hold Ai and Bj, for all 1 ≤ i, j ≤ N. This can be performed using multiple multicasting operations: for 1 ≤ i, k ≤ N do in parallel A(i, 2, k), A(i, 3, k), ..., A(i, N, k) ← A(i, 1, k) endfor for 1 ≤ j, k ≤ N do in parallel B(2, j, k), B(3, j, k), ..., B(N, j, k) ← B(1, j, k) endfor

After multicasting, the data distribution is as follows:


2 7 2 7 P2 2 , 1, *7 2A , B 7

2 7 2 7 P22 , 2 , *7 2A , B 7

M

M

P 1, 1, * A1 , B1

2

1

2 2

P 1, 2 , * A1 , B2

2

2

7 P2N , 2, *7 7 2A , B 7

P N , 1, * AN , B1

N

2

L L

2 7 2 7 P22 , N , *7 2A , B 7

O

M

L L

P 1, N , * A1 , BN

2

N

2 2

7

L P N, N, * L AN , BN

7

At this point, processors are ready to compute. In particular, P(i, j, k) calculates A(i, j, k) × B(i, j, k): for 1 ≤ i, j, k ≤ N do in parallel C(i, j, k) ← A(i, j, k) × B(i, j, k) endfor

for 1 ≤ i, j ≤ N do in parallel C(i, j, 1) ← C(i, j, 1) + C(i, j, 2) + L + C(i, j, N) endfor

Here, for the purpose of multiple aggregations, the original 2 system is reconfigured into N subsystems P(i, j, ∗)s, for 1 ≤ i, j ≤ N. After calculation, the data distribution is as follows:

2

7

P 1, 1, 1 c11

2

2 7

2

7

P 1, 2 , 1 c12

2 7

7 P22, 2, 17 2 7 2c 7

P 1, N , 1 c1N

P 2, N , 1 c2 N M

22

L L

M

M

O

2

7 P2N , 2, 17 2 7 2c 7 N2

7

L L

P 2 , 1, 1 c21

P N , 1, 1 cN 1

2

2

2

integers and p and q for floating-point number representations are all fixed. Therefore, Theorem 2 is already enough for most applications and is theoretically very reasonable. However, algorithm MM2 can also be extended to handle matrices containing integers/reals with unbounded magnitude and precision. We present the following results and leave out the details. THEOREM 2A. For integers with unbounded magnitude, matrix multiplication can be performed in O(log log M) time on 3 the LARPBS model, with N φ(M) processors, and each processor needs O(log log M) memory, where M is the largest magnitude. THEOREM 2B. For reals with precision up to 2

2 7

2 7

7 7

L P N, N, 1 L cNN

2 7

Finally, one more data movement via one-to-one communication brings the cij s to the right processors: for 1 ≤ i ≤ N, 2 ≤ j ≤ N, do in parallel C(1, i, j) ← C(i, j, 1) endfor

It is clear that each step in the algorithm involves a simple local computation, or a primitive communication operation and, hence, takes O(1) time. Thus, the entire MM2 algorithm can be executed in O(1) time. This gives rises to the following claims.

−(p+q−1)

and magni-

q

tude in the order of M = 2 , matrix multiplication can be performed in O(log log M + log p) time on the LARPBS 3 3 model, with N ψ (p) < N p/log p processors, and each processor needs O(log p) memory. 2

Then, the values C(i, j, k), 1 ≤ k ≤ N, are assembled using integer aggregation operations, and the result cij is in C(i, j, 1), as shown below,

715

A constant time matrix multiplication algorithm has 4 been obtained on a reconfigurable meshes with O(N ) proc3 essors [39] and on the AROB model using N log N processors [40], assuming that all integers have O(log N) bits. It is clear that algorithm MM2 uses fewer processors and Theorems 2A and 2B are applicable to more general cases.

6 AN O(LOG N) ALGORITHM USING O(N 2.8074) PROCESSORS The matrix multiplication algorithm MM3 described in the present section is a parallelization of Strassen’s recursive method, which is applicable to an arbitrary ring. First of all, let us describe Strassen’s strategy. Without loss of generaln ity, we assume that N is a power of two, i.e., N = 2 , for some n > 1. Given two matrices,

A !A B B= !B

A=

21

and

"# $ "#, $

A12 , A22

11

B12 B22

11 21

where Aij and Bij are matrices of size N/2, the Strassen’s algorithm to calculate C C12 C = A × B = 11 C21 C22

"# $

!

is shown in Table 1. There are three major steps in Strassen’s algorithm. First, matrices Xg and Yg, 1 ≤ g ≤ 7, are calculated

THEOREM 2. For integers of bounded magnitude, or floatingpoint values of bounded precision and magnitude, or Boolean values, matrix multiplication can be performed by using algorithm MM2 on the LARPBS model in O(1) time, 3 with N processors, and each processor needs O(1) memory. (For Boolean values, + is replaced by or and × by and.)

using matrix addition and subtraction. Second, matrices Mg,

It is observed that in virtually all sequential and parallel algorithm design and analysis, it is assumed that an arithmetic operation can be performed by a processor in O(1) time, regardless of the magnitude/precision of the operands. In other words, the maximum magnitude M of

at the cost of more element multiplications from N to N additions and subtractions. However, the total number of

1 ≤ g ≤ 7, are calculated using matrix multiplication of the Xgs and Ygs. Third, matrices C11, C12, C21, C22 are calculated using matrix addition and subtraction of the Mgs. The original motivation of the algorithm is to reduce the number of element 3

2.8074

2.8074

element level operations is only O(N ). The key point of Strassen’s algorithm is that the product C of two matrices A and B of size N can be calculated based

716


TABLE 1 STRASSEN’S MATRIX MULTIPLICATION METHOD X1 = A11 + A22, X2 = A21 + A22, X3 = A11, X4 = A22, X5 = A11 + A12, X6 = A21 − A11, X7 = A12 − A22,

TABLE 2 2 COMPARISON OF N AND

Y1 = B11 + B22 Y2 = B11 Y3 = B12 − B22 Y4 = B21 − B11 Y5 = B22 Y6 = B11 + B22 Y7 = B21 + B22

N 2 4 8 16 32 64 128

M1 = X1 × Y1 M2 = X2 × Y2 M3 = X3 × Y3 M4 = X4 × Y4 M5 = X5 × Y5 M6 = X6 × Y6 M7 = X7 × Y7

n

2.8074

=N =N . The reconfigurability of the LARPBS 7 model is crucial in such an implementation. Also, a pipelined optical bus architecture can efficiently support all necessary data transfer in the parallelized Strassen’s algorithm. log7

processors be labeled as P0, P1, ..., Pp−1. Each processor Pi has six registers A(i), B(i), C(i), X(i), Y(i), and M(i), for each level of the recursion. Hence, O(log N) amount of memory is required in each processor. Let Π(i, j) denote a partition of an LARPBS system consisting of consecutive Let the p = N

processors Pi, Pi+1, ..., Pj. The original system is Π(0, p − 1). When we mention a partition Π(i, j), it is assumed that the original system has been reconfigured such that Π(i, j) is a complete and independent LARPBS system. We use R[i..j] as an abbreviation of the registers R(i), R(i + 1), ..., R(j). Algorithm MM3 is recursively defined. Our first call to the algorithm is MM3(A, B, C, N, Π(0, N

log7

MM3 will be invoked to calculate submatrices of C of smaller sizes using partitions of the system. Thus, our description of MM3(A, B, C, N, Π(i, j))

.

π =

log7

1 7

2 49 343 2,401 16,801 117,679 823,543

N

log7

1 7 49 343 2,401 16,807 117,679

j−i+1 . 7

We wish the input arrays being stored within the range of Π1. This is possible only when N > 8, as shown in Table 2. Thus, if N ≤ 8, we reach the base case of the recursion and calculate C directly using any method, e.g., the sequential algorithm. This takes constant time. In general, when N > 8, we calculate Xg, Yg, and Mg using the partition Πg, and all seven computations are done in parallel. To this end, we first make A and B available to all seven partitions via multiple multicasting, though a partition may only need parts of A and B. 2

for 0 ≤ k ≤ N − 1 do in parallel A[i + π + k], A[i + 2π + k], ..., A[i + 6π + k] ← A[i + k] endfor 2 for 0 ≤ k ≤ N −1 do in parallel B[i + π + k], B[i + 2π + k], ..., B[i + 6π + k] ← B[i + k] endfor

Next, Xg and Yg are calculated on Π(i + (g − 1)π, i + gπ − 1), and 2

saved in X[i + (g − 1)π..i + (g − 1)π + N − 1] and Y[i + (g − 1)π..i 2

+ (g − 1)π + N − 1], respectively, for all 1 ≤ g ≤ 7, in parallel: for 1 ≤ g ≤ 7 do in parallel calculateXgonΠig−1 π igπ − and save the results in X[i + (g − 1)π..i 2

+ (g − 1)π + N − 1]

− 1)),

which means calculating the product C of matrices A and B log7 of size N using all the N processors. Later on, Algorithm

log 7

4 16 64 256 1,024 4,096 16,384

where 1 ≤ g ≤ 7, and

number of processors in this parallel implementation is 7 =

is general, where j − i + 1 = N

N

Πg = Π(i + (g − 1)π, i + gπ − 1),

on seven products M1, M2, ..., M7 of matrices of size N/2. Such a reduction of matrix size can be continued until the size becomes N = 1, i.e., n = 0. Due to this recursive nature, Strassen’s algorithm takes Ω(log N) time. We are going to show that each level of the recursion in Strassen’s algorithm can be implemented in O(1) time, and the entire algorithm takes O(log N) time. Since each level of the recursion increases the number of processors by a factor of 7, the total log7

2

The input arrays A and B are stored in the A and B registers 2 of the first N processors of the partition, both in the row-major order. The output array C is also stored in the C registers of the 2 first N processors of the partition, in the row-major order. The processors Pi, Pi+1, ..., Pj are divided into seven groups of equal size or the partition Π(i, j) is reconfigured into seven independent subpartitions during recursion:

C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 − M2 + M3 + M6

logN

N

1 N log7 7

endfor for 1 ≤ g ≤ 7 do in parallel calculateYgonΠig−1 πi gπ − 1 and save the results in Y[i + (g − 1)π..i 2

+ (g − 1)π + N − 1] endfor

Different operations are performed for different Πgs. However, only one element pair-wise matrix addition


and subtraction is involved in each partition Πg. Then, the Mgs are obtained recursively on Πg, for all 1 ≤ g ≤ 7, in parallel:

TABLE 3 TIME AND PROCESSOR COMPLEXITIES OF &1(e ) e

for 1 ≤ g ≤ 7 do in parallel recursively call MM3(X, Y, M, N/2, Π(i + (g − 1)π, i + gπ − 1)) endfor

2.8074

performed by using algorithm MM3 on the LARPBS model 2.8074

in O(log N) time, with O(N ) processors, and each processor needs O(log N) memory. The time complexity O(log N) is due to the nature of Strassen’s algorithm, not the physical limitation of LARPBS systems. It is unlikely that Strassen’s algorithm can be easily parallelized on static networks such as meshes and hypercubes with the same performance as that of MM3, nor on ordinary bus systems with limited bandwidth.

7 COMPOUND ALGORITHMS A compound algorithm is a combination of multiple existing algorithms solving the same problem. Two compound algorithms for matrix multiplication on LARPBS, i.e., &1(e) and &2(δ), are given in this section. As a matter of fact, &1(e) and &2(δ) are algorithm schemes with a parameter, such that they have adjustable time complexity in different levels. Without loss of generality, we assume that N is a power of two in the following discussion. We first consider algorithm &1(e), the combination of MM3 and MM1 with a parameter e. Algorithm &1(e) proceeds in the same way as MM3 for the first m levels of the recursion. m

Then, the submatrix sizes are reduced to N/2 . At this point, &1(e) employs algorithm MM1 to calculate the remaining m 2

products of submatrices of size N/2 , using (N/2 ) processors for each product. Thus, the time complexity is N O m+ m , 2

and the total number of processors is

N O 7 . 2 2

m

m

If we let m = elog N, for some constant e, where 0 ≤ e ≤ 1, then the time complexity becomes

O(N O(N

0.3

O(N

0.4

O(N

0.5

O(N

0.6

O(N

0.7

2.8074

THEOREM 3. Matrix multiplication (on an arbitrary ring) can be

m

O(N

0.2

 processors is 7 0 and δ ≥ 0,

1 log N 6δ

=N

1

log x log N

61 − δ .

Hence, we have N

3

=N

1log N 6 1.1428

δ

1 6 1log N 61−δ

3 − log 8 7

4 9

= o N 3 , 0 < δ ≤ 1.

Theorem 5 implies that it is feasible to achieve sub3 logarithmic time using o(N ) processors for matrix multiplication on a realistic system. We notice that algorithm &2(δ) can also be extended for arbitrary integers and reals. THEOREM 5A. For integers with unbounded magnitude, matrix δ

multiplication can be performed in O((log N) + log log M) time on the LARPBS model, with

N φ 0 M 5 = O N 1 6 1 1.1428 1.1428 3

O

log N

δ

3 log N

6

δ

log M log log M

⋅

processors, where 0 ≤ δ ≤ 1, M is the largest magnitude, and δ

each processor needs O((log N) + log log M) memory. THEOREM 5B. For reals with precision up to 2

−(p+q−1)

and magni-

q

tude in the order of M = 2 , matrix multiplication can be 2

δ

performed in O((log N) + log log M + log p) time on the LARPBS model, with

N ψ 1 p 6 = O N O 1.14281 6 1.14281 3

log N

δ

3 log N

6δ

p ⋅ log p

processors, where 0 ≤ δ ≤ 1, M is the largest magnitude, δ

Since the set of Boolean values with operations DQG/RU is a closed semiring (which is not a ring), Strassen’s algorithm is not directly applicable. However, Boolean matrix multiplication can be treated as integer matrix multiplication over the ring ZN+1 (cf. Section 6.6 in [1]), we have the following result based on Theorem 5A. THEOREM 5C. Boolean matrix multiplication can be performed in δ

O((log N) + log log N) time on the LARPBS model, with

N f 0 N 5 = O N 1 6 1 1.1428 1.1428 3

log N

d

3 log N

6

d

¿

log N log log N

We developed a number of matrix multiplication algorithms on linear arrays with reconfigurable pipelined bus systems. Algorithm &1(e) has adjustable time complexity in sublinear level. Algorithm &2(δ) implies that it is feasible to 3 achieve sublogarithmic time using o(N ) processors for matrix multiplication on a realistic system. Algorithms MM3, 3 &1(e), and &2(δ) all have o(N ) cost and, hence, are very processor efficient. Algorithms MM1, MM3, and &1(e) are general-purpose matrix multiplication algorithms, where the array elements are in any ring. Extension of algorithms MM2 and &2(δ) to unbounded integers and real values were also discussed in this paper. Many other important matrix problems and graph theory problems can be solved using a matrix multiplication algorithm as a subroutine [24]. Hence, the algorithms developed in this paper can be used as subroutines for solving a large class of problems. We have not seen parallel computing systems with conventional interconnection networks that solve the matrix multiplication problem in constant running time or support efficient implementation of Strassen’s algorithm, and/or provide such a wide range of cost and performance combinations. Our results have the following important implication, namely, due to its high communication bandwidth, the versatile communication patterns it supports, and its ability of utilizing communication reconfigurability as an integral part of a parallel computation, the LARPBS is a powerful architecture for exploiting large degree of parallelism in a computational problem that most other machine models cannot achieve. To conclude the paper, we would like to briefly discuss the scalability of our algorithms. While the algorithms presented in this paper are fast and processor efficient in the theoretical sense, it is less practical to build c

and each processor needs O((log N) + log p) memory.

O

Recently, the well-known Four Russians’ algorithm has been parallelized on the LARPBS model, which has con3 stant execution time and uses O(N /log N) processors [22].

8 CONCLUDING REMARKS

processors, where 0 ≤ δ ≤ 1, and each processor needs

x

δ

an LARPBS system with O(N ) processors, where c > 1 for large scale matrix computations. For a system with fixed size, a large matrix should be partitioned into submatrices such that processors can perform computations and communications at the submatrix level. In other words, our primitive operations should be extended to handle blocks of data, not just a single data item. Some initial work in this direction has been reported in [48]. When the system size is far less than the problem size, the parallel time complexity can generally be represented as O(T(N)/p + Tcomm(N, p)), where N is the problem size, p is the number of processors available, T(N) is the time complexity of the sequential algorithm being parallelized, and Tcomm(N, p) is the overall communication overhead of a parallel implementation. We say that a parallel implementation is scalable in the range [1..P] if linear speedup can be achieved for all 1 ≤ p ≤ P. The implementation is highly scalable if P is as large as


Θ(T(N)/(T∗(N)(log N) )), where k ≥ 0 is some constant, and k

T∗(N) is the best possible parallel time. The implementation is fully scalable if k = 0. A fully scalable parallel implementation means that the sequential algorithm can be fully parallelized, and communication overhead in parallelization is negligible. We have obtained the following results, namely, there is a fully scalable implementation of the standard matrix multiplication algorithm on LARPBS, and there is a highly scalable parallelization of Strassen’s algorithm on LARPBS with k = 2.4771 ... . Due to space limitation, we will report this part of the research in a separate paper [25].

ACKNOWLEDGMENTS The authors are grateful to the four anonymous reviewers whose criticism and constructive comments helped to improve the presentation of the paper. Keqin Li was supported by the U.S. National Aeronautics and Space Administration and the State University of New York through the NASA/University Joint Venture in Space Science Program under Grant NAG8-1313 and the 1996 NASA/ASEE Summer Faculty Fellowship Program. Yi Pan was supported by the U.S. National Science Foundation under Grants CCR-9211621 and CCR-9503882, the U.S. Air Force Avionics Laboratory, Wright Laboratory, Dayton, Ohio, under Grant F33615-C-2218, and an Ohio Board of Regents Investment Fund Competition Grant. Si Qing Zheng was supported by the U.S. National Science Foundation under Grant ECS9626215 and Louisiana Grant LEQSF (1996-99)-RD-A-16.

REFERENCES [1] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The Design and Analysis of Computer Algorithms. Reading, Mass.: Addison-Wesley, 1974. [2] S.G. Akl, Parallel Computation: Models and Methods. Upper Saddle River, N.J.: Prentice Hall, 1997. [3] A.F. Benner, H.F. Jordan, and V.P. Heuring, “Digital Optical Computing With Optically Switched Directional Couplers,” Optical Eng., vol. 30, pp. 1,936-1,941, 1991. [4] D. Bini and V.Y. Pan, Polynomial and Matrix Computations, vol. 1, Fundamental Algorithms. Boston: Birkhäuser, 1994. [5] S.H. Bokhari, “Finding Maximum on an Array Processor With a Global Bus,” IEEE Trans. Computers, vol. 32, pp. 133-139, 1984. [6] R.A. Brualdi and H.J. Ryser, Combinatorial Matrix Theory. New York: Cambridge Univ. Press, 1991. [7] L.E. Cannon, “A Cellular Computer to Implement the Kalman Filter Algorithm,” PhD thesis, Montana State Univ., 1969. [8] A.K. Chandra, “Maximal Parallelism in Matrix Multiplication,” Report RC-6193, IBM T.J. Watson Research Center, Oct. 1979. [9] D. Chiarulli, R. Melhem, and S. Levitan, “Using Coincident Optical Pulses for Parallel Memory Addressing,” Computer, vol. 30, pp. 48-57, 1987. [10] K.L. Chung, “Generalized Mesh-Connected Computers With Multiple Buses,” Proc. Int’l Conf. Parallel and Distributed Systems, pp. 622-626, Dec. 1993. [11] D. Coppersmith and S. Winograd, “Matrix Multiplication via Arithmetic Progressions,” J. Symbolic Computation, vol. 9, pp. 251-280, 1990. [12] E. Dekel, D. Nassimi, and S. Sahni, “Parallel Matrix and Graph Algorithms,” SIAM J. Computing, vol. 10, pp. 657-673, 1981. [13] P.W. Dowd, “Wavelength Division Multiple Access Channel Hypercube Processor Interconnection,” IEEE Trans. Computers, vol. 41, pp. 1,223-1,241, 1992. [14] Z. Guo, “Sorting on Array Processors With Pipelined Buses,” Proc. Int’l Conf. Parallel Processing, pp. 289-292, Aug. 1992. [15] Z. Guo, R. Melhem, R. Hall, D. Chiarulli, and S. Levitan, “Pipelined Communications in Optically Interconnected Arrays,” J. Parallel and Distributed Computing, vol. 12, pp. 269-282, 1991.

719

[16] M. Hamdi and Y. Pan, “Efficient Parallel Algorithms on Optically Interconnected Arrays of Processors,” IEE Proc. Computers and Digital Techniques, vol. 142, pp. 87-92, Mar. 1995. [17] S.J. Horng, “Prefix Computation and Some Related Applications on Mesh Connected Computers With Hyperbus Broadcasting,” Proc. Int’l Conf. Computing and Information, pp. 366-388, July 1995. [18] IEEE, Standard 754, Order No. CN-953, Los Alamitos, Calif.: IEEE CS Press, 1985. [19] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Redwood City, Calif.: Benjamin/Cummings, 1994. [20] T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays ⋅ Trees ⋅ Hypercubes. San Mateo, Calif.: Morgan Kaufmann, 1992. [21] S. Levitan, D. Chiarulli, and R. Melhem, “Coincident Pulse Techniques for Multiprocessor Interconnection Structures,” Applied Optics, vol. 29, pp. 2,024-2,039, 1990. [22] K. Li, “Constant Time Boolean Matrix Multiplication on a Linear Array With a Reconfigurable Pipelined Bus System,” J. Supercomputing, vol. 11, no. 4, pp. 391-403, 1997. A preliminary version appeared in Proc. 11th Ann. Int’l Symp. High Performance Computing Systems, pp. 179-190, July 1997. [23] K. Li, Y. Pan, and S.Q. Zheng, “Simulation of Parallel Random Access Machines on a Linear Array With a Reconfigurable Pipelined Bus System,” Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications, vol. II, pp. 590-599, July 1997. [24] K. Li, Y. Pan, and S.Q. Zheng, “Fast and Efficient Parallel Matrix Computations on a Linear Array With a Reconfigurable Pipelined Optical Bus System,” High Performance Computing Systems and Applications, J. Schaeffer and R. Unrau, eds. Kluwer Academic, 1998. [25] K. Li, Y. Pan, and S.Q. Zheng, “Scalable Parallel Matrix Multiplication Using Reconfigurable Pipelined Optical Bus Systems,” Proc. 10th Int’l Conf. Parallel and Distributed Computing and Systems, Oct. 1998. [26] K. Li, Y. Pan, and S.Q. Zheng, eds., Parallel Computing Using Optical Interconnections. Kluwer Academic, 1998 (forthcoming). [27] Y. Li, Y. Pan, and S.Q. Zheng, “Pipelined TDM Optical Bus With Conditional Delays,” Optical Eng., vol. 36, no. 9, pp. 2,417-2,424, 1997. [28] Y. Li and S.Q. Zheng, “Parallel Selection on a Pipelined TDM Optical Buses,” Proc. Int’l Conf. Parallel and Distributed Computing Systems, pp. 69-73, Dijon, France, Sept. 1996. [29] A. Louri, “Three-Dimensional Optical Architecture and DataParallel Algorithms for Massively Parallel Computing,” IEEE Micro, vol. 11, no. 2 pp. 24-81, Apr. 1991. [30] R. Miller, V.K. Prasanna-Kumar, D.I. Reisis, and Q.F. Stout, “Parallel Computations on Reconfigurable Meshes,” IEEE Trans. Computers, vol. 42, pp. 678-692, 1993. [31] Y. Pan, “Hough Transform on Arrays With an Optical Bus,” Proc. Fifth Int’l Conf. Parallel and Distributed Computing and Systems, pp. 161-166, Oct. 1992. [32] Y. Pan, “Order Statistics on Optically Interconnected Multiprocessor Systems,” Proc. First Int’l Workshop Massively Parallel Processing Using Optical Interconnections, pp. 162-169, Apr. 1994. [33] Y. Pan and M. Hamdi, “Efficient Computation of Singular Value Decomposition on Arrays With Pipelined Optical Buses,” J. Network and Computer Applications, vol. 19, pp. 235-248, July 1996. [34] Y. Pan, M. Hamdi, and K. Li, “Efficient and Scalable Quicksort on a Linear Array With a Reconfigurable Pipelined Bus System,” to appear in Future Generation Computer Systems. A preliminary version appeared in Proc. IEEE Int’l Symp. Parallel Architectures, Algorithms, and Networks, pp.313-319, June 1996. [35] Y. Pan and K. Li, “Linear Array With a Reconfigurable Pipelined Bus System—Concepts and Applications,” to appear in Information Sciences—An Int’l J. A preliminary version appeared in Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications, vol. III, pp. 1,431-1,442, Aug. 1996 [36] Y. Pan, K. Li, and S.Q. Zheng, “Fast Nearest Neighbor Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System,” to appear in Parallel Algorithms and Applications. A preliminary version appeared in Proc. IEEE Int’l Symp. Parallel Architectures, Algorithms, and Networks, pp. 444-450, Dec. 1997. [37] V. Pan, “Parallel Solution of Sparse Linear and Path Systems,” Synthesis of Parallel Algorithms, J.H. Reif ed., pp. 621-678. San Mateo, Calif.: Morgan Kaufmann, 1993. [38] V. Pan and J. Reif, “Efficient Parallel Solution of Linear Systems,” Proc. Seventh ACM Symp. Theory of Computing, pp. 143-152, May 1985. [39] H. Park, H.J. Kim, and V.K. Prasanna, “An O(1) Time Optimal Algorithm for Multiplying Matrices on Reconfigurable Mesh,” Information Processing Letters, vol. 47, pp. 109-113, 1993.

720


[40] S. Pavel, “Computation and Communication Aspects of Arrays With Optical Pipelined Buses,” PhD thesis, Dept. of Computing and Information Science, Queen’s Univ., Ontario, Canada, 1996. [41] S. Pavel and S.G. Akl, “Matrix Operations Using Arrays With Reconfigurable Optical Buses,” J. Parallel Algorithms and Applications, vol. 8, pp. 223-242, 1996. [42] S. Pavel and S.G. Akl, “On the Power of Arrays With Reconfigurable Optical Buses,” Proc. Int’l Conf. Parallel and Distributed Processing Techniques and Applications, vol. III, pp. 1,443-1,454, Aug. 1996. [43] C. Qiao, “Efficient Matrix Operations in a Reconfigurable Array With Spanning Optical Buses,” Proc. Fifth Symp. Frontiers of Parallel Computation, pp. 273-280, 1995. [44] C. Qiao and R. Melhem, “Time-Division Optical Communications in Multiprocessor Arrays,” IEEE Trans. Computers, vol. 42, pp. 577590, 1993. [45] S. Rajasekaran and S. Sahni, “Sorting, Selection, and Routing on the Array With Reconfigurable Optical Buses,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1,123-1,131, 1997. [46] V. Strassen, “Gaussian Elimination Is Not Optimal,” Numerische Mathematik, vol. 13, pp. 354-356, 1969. [47] C. Tocci and H.J. Caulfield, Optical Interconnection—Foundations and Applications. Artech Nouce, Inc., 1994. [48] J.L. Trahan, Y. Pan, R. Vaidyanathan, and A.G. Bourgeois, “Scalable Basic Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System,” Proc. 10th Int’l Conf. Parallel and Distributed Computing Systems, pp. 564-569, Oct. 1997. [49] J.L. Trahan, Y. Pan, R. Vaidyanathan, and A.G. Bourgeois, “Scalable Algorithms and Simulation Results on a Linear Array With a Reconfigurable Pipelined Bus System,” submitted for publication. [50] S.Q. Zheng and Y. Li, “Pipelined Asynchronous Time-Division Multiplexing Optical Bus,” Optical Eng., vol. 36, no. 12, pp. 3,3923,400, 1997.

Keqin Li (S’90-M’91-SM’96) received the BS degree in computer science from Tsinghua University, China, in 1985, and the PhD degree in computer science from the University of Houston in 1990. He is currently an associate professor of computer science at the State University of New York at New Paltz. Dr. Li’s research interests are mainly in design and analysis of algorithms, and parallel and distributed computing, with particular interests in approximation algorithms, job scheduling, task dispatching, load balancing, performance evaluation, dynamic tree embedding, scalability analysis, and parallel computing using optical interconnects. His early work on processor allocation and job scheduling on meshes has inspired extensive subsequent work by numerous researchers, and created a very active and productive research field. He has published nearly 100 journal articles, book chapters, and research papers in refereed international conference proceedings. He received Best Paper Awards at the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications and the 1997 IEEE National Aerospace and Electronics Conference. He has also coedited four conference proceedings and a new book entitled Parallel Computing Using Optical Interconnections, published by Kluwer Academic in 1998. Dr. Li is the associate editor-in-chief of the International Journal of Parallel and Distributed Systems and Networks and has been a guest editor of Informatica and Information Sciences—An International Journal. He has served in various capacities for numerous international conferences as program committee member, track chair, and special session organizer. He was the program chair of the Ninth International Conference on Parallel and Distributed Computing and Systems (October 1997), he is a general cochair of the 10th International Conference on Parallel and Distributed Computing and Systems (October 1998), a conference cochair of the Fourth International Conference on Computer Science and Informatics (October 1998), and the vice program chair of the IPPS Workshop on Optics and Computer Science (April 1999). Dr. Li is a senior member of the IEEE, and a member of the IEEE Computer Society, ACM, SIGACT, SIGARCH, SIAM, ISMM, IASTED, and SCS.

Yi Pan (SM’97) entered Tsinghua University in March 1978 with the highest college entrance examination score among all 1977 high school graduates in Jiangsu Province. He received his BEng degree in computer engineering from Tsinghua University, China, in 1982, and his PhD degree in computer science from the University of Pittsburgh, in 1991. Dr. Pan joined the Department of Computer Science at the University of Dayton, Ohio, in 1991 and has been an associate professor since 1996. His research interests include parallel algorithms and architectures, optical communication and computing, distributed computing, task scheduling, and networking. He has published more than 50 papers in international journals and conference proceedings. He has received several awards, including the U.S. National Science Foundation Research Opportunity Award, U.S. Air Force Office of Scientific Research Summer Faculty Fellowship, Andrew Mellon Fellowship from Mellon Foundation, the best paper award from Parallel and Distributed Processing Techniques and Applications Conference ‘96, and Summer Research Fellowship from the Research Council of the University of Dayton. His research has been supported by the U.S. National Science Foundation, U.S. Air Force Office of Scientific Research, the U.S. Air Force, and the state of Ohio. Dr. Pan is currently on the editorial board of the Journal of Supercomputing and the Journal of Parallel and Distributed Computing Practices, and is an associate editor of the International Journal of Parallel and Distributed Systems and Networks. He has served as a guest editor of special issues for four international journals: Information Sciences, Parallel Processing Letters, Informatica, and International Journal of Parallel and Distributed Systems and Networks. He is the program chair of the 10th International Conference on Parallel and Distributed Computing and Systems in 1998, the conference cochair of the Fourth International Conference on Computer Science and Informatics in 1998, and the program chair of the 1999 IPPS Workshop on Optics and Computer Science. He has also served as vice program chair, publicity chair, session chair, or committee member for more than 15 international conferences. Dr. Pan is a senior member of the IEEE and a member of the IEEE Computer Society. He is listed in Men of Achievement and Marquis Who’s Who in the Midwest.

Si Qing Zheng received the PhD degree in computer science from the University of California, Santa Barbara, in 1987. In August 1987, he joined the faculty of Louisiana State University, where he is currently an associate professor of computer science and an adjunct associate professor of electrical and computer engineering. Dr. Zheng’s research interests include VLSI design, computer architectures, parallel and distributed computing, optical interconnections, computer networks, and design and analysis of algorithms. He has published more than 130 refereed papers in these areas. Dr. Zheng was the program committee chairman of the Eighth International Conference on Computing and Information, the program committee cochairman of the 10th ISCA International Conference on Parallel and Distributed Computing Systems, and the program committee vice chairman of the Second International Conference on Parallel and Distributed Computing and Networks. He is an associate editor of several computing journals.

Fast and processor efficient parallel matrix multiplication ... - IEEE Xplore

Fast and processor efficient parallel matrix multiplication ... - IEEE Xplore

Suggest Documents

Parallel Methods for Matrix Multiplication

Parallel Programming Application: Matrix Multiplication

Parallel Algorithm for Matrix Multiplication

Fast QMC matrix-vector multiplication

Efficient Parallel Implementation of Matrix Multiplication for Lattice

Dynamic Programming and Fast Matrix Multiplication

Fast Matrix Multiplication and Symbolic Computation - arXiv

7. Parallel Methods for Matrix-Vector Multiplication

Parallel Algorithms for Matrix Multiplication

Parallel matrix multiplication for various implementations

Parallel Matrix Multiplication for Business Applications

FAST MATRIX MULTIPLICATION ALGORITHMS ON MIMD ... - KFUPM

Genetic scheduling for parallel processor systems ... - IEEE Xplore

An Adaptable Fast Matrix Multiplication - arXiv

Practical Fast Matrix Multiplication Algorithms - Jianyu Huang

Fast Matrix Multiplication with Big Sparse Data

A LOW-POWER PARALLEL PROCESSOR IC FOR ... - IEEE Xplore

A fast pattern-match engine for network processor ... - IEEE Xplore

efficient matrix multiplication on simd computers - CiteSeerX

Sparse Matrix Computations on Parallel Processor

Fast and Accurate Matrix Completion via Truncated ... - IEEE Xplore

Efficient parallel modular multiplication method using ...

Partitioning for Parallel Matrix-Matrix Multiplication ... - Semantic Scholar

Novel Fast and Scalable Parallel Union-Find ASIC ... - IEEE Xplore