Computation of Singular Value Decomposition on Arrays With ... - Fing

2 downloads 0 Views 636KB Size Report
algorithm given by Brent and Luk[2]. The algorithm, which is based on a one-sided orthogonalizat ion method due to Hestenes[15], takes O(mnS) time on O(n) ...
Computation of Singular Value Decomposition on Arrays With Pipelined Optical Buses ~ Yi Pan, $ Mounir Hamdi ~ University + Hong

Kong

University

of Dayton of Science

At@ract

and Technology

value decomposition of a real m-by-n matrix A is its factorhtion of this matrix into the product of three matrices: A

singular

(m z n)

In this paper, we present parallel algorithms for solving the Singular Value Decomposition problem, which arises

in many application areas. l%e algorithms are designed fbr efficient prfonnance on two architecture, a 1-D array of processors with a single pipelined optical bus and a 2-D array of procewors with multiple pipelined optical buses. Underlying the architectures is a careful design of the routing permutations that takes full advantage of the unique properties of data transmission on optical buses. Analysis of the parallel time requirements of the algorithms shows that the 1-D algorithm takes O(mn) time and the 2-D algorithm needs only O (nlogm) time. The lime complexities are asymptotically equivalent to those implemented on the hypercube while using substantially less hardware. Introduction

Singular value decnmpoaition (SVD) is commonly used in determining the solution of unconstrained linear least squares problems. In many applications such as real-time signal processing the solution is needed in the shortest possible time. SVD is one of the most important factonzations of real m-by-n (m zn ) matrices, and is computaGiven the growing availability of tionally intensive. parallel computers and the nwd for faster solutions to the SVD problem, there has been great interest in the development of parallel implementations of singular value decomposition.

Aem4Ac’9mamxwulA @ 1993ACM O-JM7?1-SW~..Sl.M

525

A = UDVT

where U is an m-by-n

matrix with orthogonal Columm

D is an n-by-n non-negative diagonal matrix, and

V is

an n-by-n orthogonal matrix. For more detai~ please refer to [9]. llte most common implementation of SVD is the Golub-Kahan SVD algorithm [9, 10], which requires O(mn~ time on a single procxssor computer. In order to reduce the computation time of this sequential algorithm, there has been much interest recently in developing faster, parallel SVD[2-4, 7,10,14, 16,23]. On a linear processor array, the most efficient SVD algorithm is the Jaabi-like algorithm given by Brent and Luk[2]. The algorithm, which is based on a one-sided orthogonalizat ion method due to Hestenes[15], takes O(mnS) time on O(n) processors where S is the number of sweeps. However, Brent and hk’s algorithm is not optimal in terms of eomrnunication overhead. Unnecessary costs are incurred by mapping the systolic array architecture onto a ring connected linear array due to the double sends and receives required between pairs of neighboring processors. Eberlein[6], Bischo~l] and others have proposed various modifications of this algorithm for hypercube implementation% which require the embexiding of rings via binary reflected Gray Co&s. Gao and ‘Iltomas[8] have investigated this problem using a recursive divide-exchange communication pattern. These algorithms have the same order of magnitude of time complexity as the one described by Brent and LUIGbut reduce the constant factor. SVD algorithms on hypercube and shuffle-exchange computers have also been designed by Pan and Chuang[18]. h-stead of mapping a column of data onto a processor in a hypcrcube, as is done in[4,8], they map a column pair of data onto a column of processo rs in a hypercube or a shuffle-

exchange computer. Using this method, the total time is reduced to O(n logm) per sweep using (nznY(210gm) processors.

The basic idea of the deeompdtion is to geaerate an orthogonal matrix V such that the transformed matrix Normrdizing the A V = W has orthogonal columns. Euclidean length of eaeh non-null column to unity, we get the relaticm

Most of the paratlel algorithm of the SVD problem are targeted for execution on point-to-point connected parallel architectures. The data transmission in these architectures are quite different from that of arrays of procesors with pipelined optical buses. One of the unique features of a pipelined optical bus is its ability to allow the ~current transmission of messages in a pipelined fashion with a very short end-to-end mesage delay. This makes very effieient in the connection of parallel computers. For this reason, the idea of using arrays of proeesom with pipelined optical buses has attracted attention recently. We will briefly introduce this novel architecture in section 3. Many important parallel algorithms such as broadcasting[20], sortin~12], numerical problems[ 13], and Hough transform[ 19] have been implemented on arrays of processors with pipelined optical buses. In this paper, we use this novel architecture for the implemmtation of efficient parallel algorithms for the SVD problem.

W = U’D

(1)

where U’ is a matrix whose non-null columns form an orthononnal set of vect~ and D is a non-negative diagonal matrix. An SVD of A is then given using the foUowing relation: A = WVr . U’DVT

(2)

& a null column of U’ is always asociated with a zero diagonal element of D, (1) and (2) are essentially identieal. Hestems [15] uses plane rotations to emstnxt sequence of matrices {A~ } is generated by Ak+~ = AkRk

V.

A

1,33P.)

(k =

where A ~-~, and t?ach Rk is a plt?~ It3b3dOII. b’t Ak= (a~~~r..a~), where 4k is the itb column of Ak. Suppnae R4 . (r$j is a rotation on the @~) plane which orthogonalizes columns ~& and u: of A& with p < q. R& is an orthogonal matrix with all of its elements identical to those of the unit matrix except that

This paper is organized as follows. Hestenes method for solving the SVD problem is reviewed in the next seetion. In seetion 3, a parallel SVD algorithm, which is based on the hestenes method, is implemented on a 1-D array of proeesors with a pipelined optical bus. In section 4, a variation of this parallel algorithm is implemented on a 2-D array of procesors with pipelined optical buses. In section 5, we give a summary and assessment of our results.

r&.

eoae

T& = -sine

r~ = sinO

(3)

r&=eo&t

We note that post-multiplication of Ak by Rk affeCtS only columns ~ and a~, and that

Singutar Value Decomposition

(#l,U$l) lkre

is a wealth of serial SVD algorithms available in literature. Most of these algorithms can be transformed into parallel algorithms to be run on parallel computers. Each of these serial algorithms exhibits a certain degree of parallelism. Thus, in choosing the serial algorithm to be transformed into a parallel algorithm, we have to take the potential of inherent parallelism within the algorithm into account. We choose to implement the Hestenes methmt of solving the SVD problem [15] on an array of processom with optical pipelined buses because of its high potential of parallelism and its flexibility in being transformed efficiently into a parallel algorithm suited for our architecture even though its sequential version is less efficient than that of the Golub-KahanReinseh SVD method[9, 10]. the

526

[1 Co@

= (f#,a$) -sine

sine W!W

(4)

The rotation angle 8 should be cbcsen so thatthe two new columns are orthogonal. For this we em use theformulas given by Rutishauserf22]. Defining: Y=

a = (ar~T~k, (%k)T4$

We set Et= O, if y = O. Othenv*

k)Taqk $ = (Uq

(5)

we compute (6)

c=-=+ ‘=ti*=tlle

rotation angle 9 always satisfies ~s

~.

In this way, we orthogonslize the pth and the qth columns in the kth step. lle proms in which alt column pairs (ij) for i

bw

(7)

We call tC a pipeline cycle, thus the term pipelined bus is self explanatory. ‘fhe above condition ensures that each message will “fit” into a pipeline cycle such that in a bus cycle, up to N messages can be transmitted by processors simukaneously without collisions on the bus. ‘his is the biggest difference between an optical bus and electronic bus, where the former can handle up to N simultaneous messages when connected to n proccsors while the latter can handle just 1 message regadless of the number of pro-

Figmcl. Al-D AmywiUIan O@icstBwa

Because of the properties of the optical bus (waveguides), messages propagate unidirectionally from right to left on the !ower bus segment and from left to right on the upper bus segment. That is why two buses are needed for such a 527

eessors. In a parallel array, messages short length (b is very small). 11~

PEa can be done in m bus cycle since each PE can pu one message onto the bus and receive one message fkxn the bus in one bus cycle and a column contains m data elements. Hem the exchange sequence described in [5] is used to move data column A(i) in P~i], for all i, to all other PEs sequentially to produce all (A~) column pails, the ad m-l, q= q 1,2, “ “ “ W2-1 {(’% ~q)lp- o, 1, ;:e: . . In order to obtain all possible column psi@ one still has to consider the pairing among the A columns and among the B columns. l%is can be achieved by iteratively applying the process of exchanging the A and the B columns between neighboring nodes to create subamaya and then moving the A mlumns within the subarrays The entire algorithm is given in procedure 1-D-SVD.

normally has very in the remaining

discussions, we assume that the condition given by equation (7) above is always satisfied and no transmkion conflicts am possible as long as all processom are synchronized at the beginning of each bus cycle. In SIMD environments, each pmessor knows the source (processor id) of the mewage it is receiving and when it is sent relative to the beginning of a bus cycle [req. Therefore, the time (relative to the begiming of a bus cycle) can be easily calculated by the proceasor receiving the message. Let the procesor that wish to nxeive a message be processor i. Assume that the message is sent by processor j at the beginning of bus cycle C. We use a receiving function, rec (i, j, C), to determine the time that processor i has to wait, relative to the beginning of bus cycle C, before receiving the message from procesm j. 7he value of the function, in the number of pipeline cycles, is givem in the following equation. rec(i, j, C)=i+j+l

procedure 1-D-SVD While not (all Iyl < G)do l-D-OR~ for k:=O step 1 until g-1 do fcx p :=1 step 1 until 284-1 do h:=f (g-k~~ Rm(i)*l+i +ifkl; for j :=1 step 1 until m do Bus * A~](i} A ~](i) * Bw, 10 end forj;

1 2 3 4 5 6 7 8 9

(8)

By calculating the receiving function as given by equation (8), each procesor would be able to selectively rtxzive any message from a train of messagea in each bus cycle. 3?tese values of mceivirtg functions can be precomputed and stored in a wt of receiving control registers for fast retrieval during the execution of the algorithm. l%e precomputing of these values is done, obviously, according to the design of the given algorithm. Readers may refer to [17] for details.

11

The SVD algorithm on the 1-D array with an optical bus of size #Z, in which every PE has a local memory large enough to store a pair of columns of data elements, Sup poses thm the IWOcolumns are staed in arrays A(i) and Every PE continuously respectively. B (i~Osi

Suggest Documents