departments of mathematics - CiteSeerX

3 downloads 253 Views 470KB Size Report
Aug 6, 1996 - Page 1 ..... the parallel virtual machine is called a host. The main .... Each host computer performs the computation of a whole col-. 14 ...
ISSN 1360-1725

UMIST

Fvpspack: a Fortran and PVM Package to Compute the Field of Values and Pseudospectra of Large Matrices T. Braconnier Numerical Analysis Report No. 293 August 1996 Manchester Centre for Computational Mathematics Numerical Analysis Reports

DEPARTMENTS OF MATHEMATICS Reports available from: And over the World-Wide Web from URLs Department of Mathematics http://www.ma.man.ac.uk/MCCM/MCCM.html University of Manchester ftp://ftp.ma.man.ac.uk/pub/narep Manchester M13 9PL England

Fvpspack: a Fortran and PVM Package to Compute the Field of Values and Pseudospectra of Large Matrices Thierry Braconnier August 6, 1996 Abstract The eld of values and pseudospectra are tools which yield insight into the spectral behavior of a matrix. For large sparse matrices, both sets can be eciently computed using a Lanczos type method. Since both computations can be done in a natural parallel way, we have developed a package including Fortran and PVM routines. Experiments show that the PVM codes achieve excellent speed-ups and eciency.

Key words. Field of values, pseudospectra, PVM, parallel algorithm, speed-up.

1 Introduction The eld of values of a n by n complex matrix A is the set of all possible Rayleigh quotients:  z  Az  n F (A) = z z : 0 6= z 2 Cl :  Department of Mathematics, University of Manchester, Manchester, M13 9PL, UK.

[email protected],

http://www.ma.man.ac.uk/~braconni/

1

This set F (A) is convex and when A is normal, i.e., AA = AA , it is the convex hull of the spectrum of A. Thus, the knowledge of F (A) allows the user to have a view of the non-normality of the matrix A. Pseudospectra are parameterized sets in the complex plane and can be de ned as follows. For  > 0, the -pseudospectrum of a n by n complex matrix A is the set (A) = f z : z 2 (A + A) for some A with kAk2   g: If the matrix A is normal then the -pseudospectrum is the union of the closed balls of radius  about each of the eigenvalues, otherwise it can be much larger than the union of these -balls. Once again, the knowledge of the sets (A) allows the user to visualize the non-normality of the matrix A. Braconnier and Higham (1995) have proposed a Lanczos type algorithm with Chebyshev acceleration to compute both sets. Their Matlab (see [Mat1990]) routines have been translated in Fortran. In section 2, we brie y recall the methods for computing both sets and the main ideas of the algorithm. Since both computations can be parallelized in a natural way, we have implemented a PVM (see Geist, Beguelin, Dongara and al. (1994)) version of both codes. After recalling some basic ideas of PVM in section 3, the parallel implementation of the codes is described in section 4. The sequential and PVM versions have been included in the package. In section 5, numerical experiments on homogeneous and heterogeneous clusters of workstations are given which show very good speed-ups and eciencies of the PVM implementations. A detailed description of the package is given in the Appendix at the end of this paper.

2

2 Computation of the Field of Values and Pseudospectra In this section we recall how to compute both sets and the main ideas of the method proposed by Braconnier and Higham (1995). Finally, we describe the test matrices used for the numerical experiments and show on this class of matrices what kind of information the eld of values and the pseudospectra can deliver.

2.1 The Field of Values Computation fv (θ1) max

θ1

θ2

fv (θ1) min

Figure 1: Field of values algorithm

We are interested in determining the boundary of F (A). The standard algorithm is based on the following observation (see Johnson (1978)). If we write A = AH + AK , where AH = (A + A )=2 is the Hermitian part of A and AK = (A ? A )=2 is the skew-Hermitian part of A, then, for any z 2 Cl n, z Az = z|  A{zH z} + z|  A{zK z} pure imaginary

real

which implies that z AH z =  (A ); min f Re(w) : w 2 F (A) g = min min H z6=0 z  z z AH z =  (A ); max f Re(w) : w 2 F (A) g = max max H z6=0 z  z 3

where min and max denote the algebraically smallest and largest eigenvalues, respectively, of a Hermitian matrix. Since F (ei A) = ei F (A), we can apply the same reasoning to the rotated matrix A() = ei A. Carrying out the computation for a range of  2 [0; ] we obtain an approximation to the boundary of the eld of value as illustrated by Figure 1. The standard algorithm can be written as

Algorithm 1

Choose the number of angles kmax For k = 0: kmax do 1. Set

i A + e?i A e k : k = k ; AH (k ) = 2 max 2. Compute the extreme eigenpairs of AH (k ): k

k

(min; xmin); (max; xmax): 3. Set



Axmin ; fvmin(k ) = xxmin  x fvmax(k ) =

min min max Axmax : xmaxxmax

x

Note that only the extreme eigenpairs of a Hermitian matrix have to be computed. If the matrix A is real then the boundary of the eld of values is symmetric about the x-axis and we can halve the computation, i.e., perform the computation for  2 [0; =2].

2.2 The Pseudospectra Computation An equivalent de nition of the -pseudospectrum, in terms of the resolvent (zI ? A)?1, is (A) = f z : k(zI ? A)?1k2  ?1 g = f z : min(zI ? A)   g; 4

where min denotes the smallest singular value (see Trefethen (1991). Thus, a natural approach is to impose a grid on the given region of the complex plane and to compute min(zI ? A) at each grid point z. A graphical plot is obtained by feeding the results to a contour or surface plotter. Instead of computing min(zI ? A) directly we can compute the smallest eigenvalue of the Hermitian matrix A(z) = (zI ? A)(zI ? A). Note that if the matrix A is real then the level curves are symmetric about the x-axis, since min(zI ? A) = min(zI ? A); this fact can be used to halve the computational work.

2.3 The Lanczos-Chebyshev Algorithm Let B be a n by n Hermitian matrix. Starting from an initial vector u, the Lanczos process1 applied to B iteratively builds an orthonormal basis Vm 2 Cl nm of the Krylov subspace Km(u; B ) = span(u; Bu; : : : ; B m?1u) and a tridiagonal matrix Tm 2 IRmm such that BVm = VmTm + tm+1;m vm+1 eTm ; (1) where ej denotes the j th column of the identity matrix. If Tm = S diag(i)S  is a spectral decomposition, with S = [s1; s2; : : : ; sm] unitary, then the Ritz pairs (i; Vmsi), i = 1: m, provide approximations to a desired number r  m  n of eigenpairs of B . To maintain the orthogonality of the Krylov basis Vm we use selective re-orthogonalization (see, e.g., Golub and Van Loan (1989)). This step ensures the backward stability of the Lanczos process, in an appropriately de ned sense (Braconnier (1995)). We use the Lanczos method in conjunction with Chebyshev acceleration to improve the speed of convergence to the desired eigenvalues (Saad (1984)). The Chebyshev ellipse reduces to an interval because B is Hermitian. The proposed algorithm can be summarized as follows. We use the terminology of Paige (1994) that distinguishes between the Lanczos process (construction of the Krylov basis) and the Lanczos method (the Lanczos process together with a method for computing the Ritz pairs from the resulting tridiagonal matrix). 1

5

Algorithm 2

Parameters: integers r (number of desired eigenpairs) and m (number of Lanczos steps), with r  m  n; Lanczos starting vector u; degree k of Chebyshev acceleration polynomial and starting vector z0 . 1. Perform m steps of the Lanczos process on B with selective reorthogonalization to compute Vm and Tm . 2. Compute the Ritz pairs (i; yi)i=1:m of B by applying the QR algorithm to Tm . 3. If the stopping criterion is satis ed then exit. 4. Build the Chebyshev interval. 5. Perform Chebyshev acceleration to obtain a better starting vector u for the Lanczos process; go to step 1. The choice of the parameters and of the stopping criterion are described in the Appendix.

2.4 Test Matrices The test matrices come from the discretization using centered nite di erences on a m  m mesh of the two-dimensional model convection-di usion problem

?u(x; y) + p1 rxu(x; y) + p2 ry u(x; y) ? p3u(x; y) on the unit square [0; 1]  [0; 1], with Dirichlet boundary conditions. p1 ; p2 and p3 are real numbers. These matrices are n by n non-symmetric matrices where n = m2. The eigenpairs are known explicitly (see Bai (1993)). Let h = 1=(m + 1), = p1h,

= p2 h and  = p3 h2. Then, the eigenvalues of the matrices are given by

kl = 4 ?  + 2

q

1 ? 2 cos

!

!

k + 2q1 ? 2 cos l : m+1 m+1

Suppose we take  = p1 = p2 = p3 2 . Then, the eigenvalues of the matrix A(n; ) satisfy the following properties 2

For such a choice, the spectrum lies in the right half plane.

6

p

1. if   ( n + 1) then all the eigenvalues are real and contained within the interval [0; 8], and 2. when the mesh-size h decreases, i.e., the size n increases, then the separation of all the eigenvalues decreases. Let us show on the test matrices A(100; ) for  = 1; 11; 100 what kind of information the computation of F (A) and (A) can give. For the eld of values computation, we took 512 equally spaced angles and for the pseudospectra computation the region of the complex plane of interest is

f z = a + ib : ?0:25  a  8:25; ?4:25  b  4:25 g; with a mesh-size equal to 1=64 in each direction. The given pseudospectra plots are log-scaled. For the eld of values plots, the eigenvalues are marked as `+'. 4

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

−4

−4 0

1

2

3

4

5

6

7

8

0

Figure 2: F (A(100; 1)).

1

2

3

4

5

6

7

8

Figure 3: F (A(100; 11)).

Figure 2 shows that for  = 1, the matrix A(100; 1) has a small non-normality: F (A(100; 1)) is close to the interval [0; 8]. For  = 11, the non-normality of the matrix increases because the boundary of F (A(100; 11)) is far apart from the spectrum which is real and lies in [0; 8], as illustrated by Figure 3. For  = 100, the boundary of F (A(100; 100)) encloses a region far from [0; 8] (see Figure 4) but it stays close to the spectrum. Thus, 7

40

30

20

10

0

−10

−20

−30

−40 −30

−20

−10

0

10

20

30

40

Figure 4: F (A(100; 100)).

p

1. when   ( n + 1) the matrix has a small non-normality,

p

2. when  is close to ( n + 1) the matrix A(n; ) becomes highly non-normal, and

p

3. when  increases and is much larger than ( n + 1) the non-normality of the matrix A(n; ) becomes smaller and smaller. 0.6

4

0.4

3

4

3

0.2 2

2 0

1

1 −0.2

0

−0.4

0

−1

−0.6

−1

−2

−0.8

−2

−1

−3

−3

−1.2 −4

−4 0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

Figure 5:  = 1: z 7! log10 (k(zI ? A(100; ))?1 )k2 and (A(100; )).

8

7

8

4

0

4

3

−2

3

2

−4

2

1

−6

1

0

−8

0

−1

−10

−1

−2

−12

−3

−14

−4

−16 0

1

2

3

4

5

6

7

−2

−3

−4

8

0

1

2

3

4

5

6

7

8

7

8

Figure 6:  = 11: z 7! log10 (k(zI ? A(100; ))?1)k2 and  (A(100; )).

4

0.4

4

3

0.2

3

2

0

2

−0.2

1

1

−0.4 0

0 −0.6

−1

−1 −0.8

−2

−2 −1

−3

−1.2

−4

−1.4 0

1

2

3

4

5

6

7

8

−3

−4 0

1

2

3

4

5

6

Figure 7:  = 100: z 7! log10 (k(zI ? A(100; ))?1)k2 and  (A(100; )).

9

The contour lines in the right-hand plots of (A(n; )) represent 10 equally spaced values  in the range given in the corresponding left-hand plots of z 7! log10 (k(zI ? A(100; ))?1)k2. The pseudospectra of the matrices A(100; 1) tell us that the matrix is close to a normal one because the pseudospectra are close to the union of these -balls as shown in Figure 5. For  = 11, Figure 6 shows that the matrix is highly non-normal because the pseudospectra are much larger than the union of these -balls. And for  = 100, the matrix A(100; ) is more close to a normal one as shown in Figure 7. The pseudospectra analysis con rms the information given by the eld of values of the corresponding matrices.

3 Basic Ideas about PVM Workstation 1

Super Computer 1

Super Computer 2

Network : ethernet, FDDI, HiPPI . . .

Workstation 2

Workstation 3

Workstation 4

Figure 8: Example of a parallel virtual machine.

PVM is a public domain message passing library which is accessible via netlib. We give a short overview of this package. A more detailed description is given in Geist, Beguelin, Dongara and al. (1994). This package allows the user to run code (Fortran or C) as if he runs it on a parallel computer just using the available computers connected via a network. An example of possible parallel virtual machine is given in Figure 8. According to the 10

terminology of Geist, Beguelin, Dongara and al. (1994), each computer present in the parallel virtual machine is called a host. The main advantages of PVM are that 1. the users simply need to have access to a number of workstations and not necessarily to a massively parallel computer, 2. di erent types of computer with di erent types of arithmetic can be used, and 3. the package is easy to install and PVM codes are easy to write and run.

3.1 Environment Before compiling the libraries (one for C codes, one for Fortran codes and one for communication between one computer and a group of computers), the user has to set two variables in his .cshrc le: PVM ROOT which gives the path where the libraries reside (default: /pvm3) and PVM ARCH which sets the architecture of a given computer in the parallel virtual machine. A shell script is given in the package to automatically set the second variable. Then, the user has to compile the libraries for each type, i.e., architecture of computer to be used in the application.

3.2 Starting and Running PVM Codes There are two ways to start PVM: a manual way using the command pvmd3 hostname

and the more useful way is to start the PVM console using the command pvm hostfile

where `host le' is a le containing the names of the computers used for the application, the path where PVM daemon pvmd3 is (default: /pvm3/lib/pvmd3), the path where the executable les are (default ~/pvm3/bin/PVM ARCH) and some other options (relative speed of each host computer, the possibility to manually start one host : : :). 11

A daemon, i.e., a process, is started with an identi cation number for each host, which permits all of them to communicate. On each host, two les are created:



/tmp/pvmd. userid



/tmp/pvml. userid


which is related to the running daemon and


which contains information about the PVM session.

If we use the second way of starting PVM then a console interface is created. It allows the user to visualize the parallel virtual machine and to act on it (add/delete host : : :). To run a PVM code, the user has to put the needed executable les in each directory ~/pvm3/bin/PVM ARCH for each type of computer and can start the code on any host present in the parallel virtual machine.

4 Parallel Implementation of the Codes In this section, we show how the computations of the eld of values and pseudospectra are performed using PVM. The sequential codes use the idea of continuation (see Carpraux, Erhel, and Sadkane (1993), Lui (1995)). For the pseudospectra computation, this idea can be explained as follows. At each grid point we have to compute the smallest eigenvalue 1 of a Hermitian matrix. If, at a given grid point, we obtain 1 and a corresponding eigenvector x1 , then as the Lanczos starting vector for Algorithm 2 at an adjacent grid point we take u = x1 . For the eld of values computation no obvious way to use continuation leads to an easy parallel implementation. Thus, we have not used it: for each angle a random vector is used as starting vector for Algorithm 2. The parallel implementations are described below.

4.1 Parallel Field of Values Computation This computation is done using a master-slave scheme as described by Algorithm 3 for the master code and by Algorithm 4 for the slave code. Let u be the starting 12

vector for the Lanczos method, nh the number of hosts and pmax the chosen number of angles for the computation.

Algorithm 3 % Master Code % Initialize the PVM run Enroll PVM Spawn the slave code to the nh hosts % Send the needed data to the nh hosts Broadcast A; u; pmax % Send the rst nh angles to the slaves Do k = 1 : nh Send k ? 1 to host(k) k = nh % Receive any message from any host Do i = 1 : pmax Receive j; fvmin(j ); fvmax(j ) k =k+1 If k  pmax Send k to sending host else Send exit ag to sending host % Exit the PVM run Exit PVM

Algorithm 4 % Replicated Slave Code % Initialize the PVM run 13

Enroll PVM % Receive the needed data from the master Receive A; u; pmax % Receive angle number from the master 1 Receive p If p = exit ag goto 2 Set p = p=pmax Compute (min = max; xmin = max) of AH (p) = (eip A + e?ip A)=2 using Algorithm 2 Set fvmin(p) = xminAxmin=xminxmin Set fvmax(p) = xmaxAxmax=xmaxxmax % Send the results to the master Send p; fvmin(p); fvmax(p) goto 1 % Exit the PVM run 2 Exit PVM We can see on Algorithm 3 that the computation is performed using the task farming technique: the master is waiting for any message sent by any slave. The task farming allows the user to take into account that 1. the computation times are not necessarily the same for all angles, as they may vary with the matrix AH (), 2. jobs run by other users on a host can slow down its performance, and 3. one of the host computers can be much slower than the other ones. In the two last cases the slower host will perform less computations than the faster ones.

4.2 Parallel Pseudospectra Computation For the pseudospectra computation, we have used the continuation technique over the grid as follows. Each host computer performs the computation of a whole col14

host 1

host 1

host 2

xmin

1 0 0 1 xmin

u

xmin

Master

Master

Figure 9: continuation: step 1.

host 1

host 2

Figure 10: continuation: step 2.

host 3

host 1

host 2

host 3

xmin(3)

1 0 0 1 xmin(2)

xmin(2)

1 0 0 1

1 0 0 1

1 0 0 1

xmin

1 0 1 0

xmin

1 0 1 0

1 0 1 0

xmin

1 0 1 0

1 0 1 0

xmin xmin xmin

Master

wait for free host

Master

Figure 11: continuation: step 3.

Figure 12: continuation: step 4.

15

umn. Once, it has computed the min and the corresponding eigenvector xmin for the rst node, it sends xmin to the master who starts the computation for the next column in the grid on another host, if free, using xmin as starting vector (see Figures 10 and 12). The current host will then perform the computation of the next node of its column using the same xmin as starting vector and so on until the end of the column. Figures 9 to 12 illustrate this idea with 3 host computers. De ne 1. the number of hosts by nh, 2. the starting vector for the Lanczos method by u, 3. the abscissa of the nodes in the grid by xz, 4. the ordinates of the nodes in the grid by yz, 5. the number of rows by nrow , 6. the number of columns by ncol , and 7. the array containing the pseudospectra by  = f(i; j )gi=1:ncol; j=1:nrow . The master code is given by Algorithm 5 and the slave code is given by Algorithm 6.

Algorithm 5 % Master Code % Initialize the PVM run Enroll PVM Spawn the slave code to the nh hosts % Send the needed data to the nh hosts Broadcast A; yz; nrow 16

% Initialize the computation Send 1; u; xz(1) to the rst host k=1 % Receive any message from any host Do i = 1: 2ncol Receive task If task = 1 Receive u k =k+1 If free host If k  ncol Send k; u; xz(k) to free host else Send exit ag to free host else Receive j; (:; j ) If u is available If k  ncol Send k; u; xz(k) to sending host else Send exit ag to sending host % Exit the PVM run Exit PVM The integer task is set and sent to the master by the any host. It allows the master to know either if the host has sent the starting vector for the next column in the grid (task = 1) or if the host has sent the computed column (task = 2).

Algorithm 6 % Replicated Slave Code % Initialize the PVM run 17

Enroll PVM % Receive the needed data from the master Receive A; yz; nrow % Receive the index of the column from the master 1 Receive p If p = exit ag goto 2 % Receive starting vector and abscissa from the master Receive u; x(p) from master Do j = 1 : nrow If j = 2 Send 1; xmin(j ? 1) Set z = x(p) + iyz(j ) Compute (min(j ); xmin(j )) of A(z) = (zI ? A) (zI ? A) using Algorithm 2 % Send the results to the master Send 2; p; min(:) goto 1 % Exit the PVM run 2 Exit PVM Once again, the task farming technique is used.

5 Numerical Results Throughout this section, a cluster of workstations is said to be homogeneous if the workstations have a similar relative speed of computation, otherwise it is said to be heterogeneous. We de ne the speed-up by S = Tp=Ts and the eciency by E = 100Tp=p; where Tp is the running time on p workstations and Ts is the running time for the sequential Fortran code on one workstation. The given times are the average of the running time of ve consecutive runs of the corresponding code. In this section, we give numerical results in terms of speed-up and eciency when the PVM implementation of the package is run either on a homogeneous or a heterogeneous cluster of workstations and on a sparse matrix of moderate size. Results for 18

large matrix computations are given only for a homogeneous cluster of workstations. For all computations we have checked that the relative di erence between the results given by the sequential code and the PVM one is at the level of the machine precision.

5.1 Experiments on a Homogeneous Cluster We have run the PVM versions of the code on the test matrix A(100; 100). For the eld of values computation, we took 512 equally spaced angles and for the pseudospectra computation the region of the complex plane of interest is f z = a + ib : ?0:25  a  8:25; ?4:25  b  4:25 g; with a mesh-size equal to 1=64 in each direction. Two types of homogeneous clusters have been used: a cluster of 8 SunOS 4.2 workstations and a cluster of 3 DEC OSF-1 workstations. Table 1 (resp. Table 2) gives the results for the eld of values (resp. pseudospectra) computation. Computer ] of hosts Time (secs) Speed{up Eciency (%) 1 425:20 ?? ?? 2 211:11 2:01 100:5 4 108:29 3:93 98:25 SUN4 6 72:14 5:90 98:33 8 53:83 7:90 98:75 1 19:54 ?? ?? 2 9:92 1:97 98:50 ALPHA 3 6:51 2:99 99:67 Table 1: Field of values computation: results for homogeneous clusters.

We can see that the speed-ups and the eciencies are close to the ideal ones. This is due to the fact that 19

Computer ] of hosts Time (secs) Speed{up Eciency (%) 1 4798:82 ?? ?? 2 2401:66 2:00 100 SUN4 4 1225:50 3:92 98 6 830:49 5:78 96:33 8 627:60 7:65 95:62 1 207:33 ?? ?? ALPHA 2 101:63 2:04 102 3 71:87 2:88 96 Table 2: Pseudospectra computation: results for homogeneous clusters.

1. we use load balancing for both computations, 2. the communication time between the master and the slave codes is negligible compared with the computation time on each slave, and 3. all workstations either in the SUN4 or in the ALPHA cluster approximatively have similar computational speeds. On some experiments, the eciency is larger that 100. This is due to the loading of the given hosts in the parallel virtual machine. Thus, for both computations a PVM implementation is ecient on a homogeneous cluster of workstations.

5.2 Experiments on a Heterogeneous Cluster We have ran the same computations on a parallel virtual machine composed of 1. 8 SunOS 4.2 workstations, 2. 3 DEC OSP-1 workstations, 3. one Silicon Graphics IRIS 5.3 workstation and 20

4. one Sun4 Solaris 2.4 workstation. Table 3 (resp. Table 4) gives the results for the eld of values (resp. pseudospectra) computation. Master computer ] of hosts Time (secs) Speed{up Eciency (%) SUN4 1 425:20 ?? ?? 13 9:58 44:38 341:38 SUN4SOL2 1 358:55 ?? ?? 13 8:69 41:26 316:92 SGI5 1 249:37 ?? ?? 13 6:95 35:88 276 ALPHA 1 19:54 ?? ?? 13 8:00 2:45 18:85 Table 3: Field of values computation: results for heterogeneous cluster

We can see that for the eld of values computation the load balanced implementation is ecient. For example, the speed-up S = 2:45 when the sequential and the PVM code are run on a DECAlpha computer (which is the fastest available computer for this experiment) is not so bad compared with the number of DECAlpha computers , 3, in the heterogeneous cluster. For the pseudospectra computation, the results are not as good as expected. For example, the speed-up S = 1:15 when the sequential code and the master code are run on a DECAlpha computer is signi cantly less than the number of DECAlpha computers, 3, in the heterogeneous cluster. This is due to the fact that at the end of the computation, the DECAlpha computers nish their work and the master then has to wait for the slowest hosts, i.e., the Sun4 computers. In the worst case, the master has to wait for the computation of a whole column on the Sun4 which can take a lot of time compared with the same computation on a DECAlpha computer. 21

Master computer ] of hosts Time (secs) Speed{up Eciency (%) SUN4 1 4798:82 ?? ?? 13 176:64 27:17 209 13(nc) 97:54 49:20 378:45 1 2389:17 ?? ?? SUN4SOL2 13 189:15 12:63 97:15 13(nc) 96:18 24:84 191:08 SGI5 1 1199:90 ?? ?? 13 185:30 6:48 49:85 13(nc) 95:70 12:54 96:45 ALPHA 1 207:33 ?? ?? 13 179:55 1:15 8:85 13(nc) 90:36 2:30 17:65 Table 4: Pseudospectra computation: results for heterogeneous cluster

We can see the improvement that results from omitting the continuation on heterogeneous clusters (see the rows labeled 13(nc) in Table 4). The speed-ups are approximatively twice those when the continuation is used. In this case, each host performs the computation of the smallest singular value min for only one node. The speed-up when the PVM and the sequential code are run on a DECAlpha computer is closer to 3, i.e., the number of DECAlpha computers in the cluster. The strategy of not using continuation is more appropriate for heterogeneous clusters whereas using continuation is ecient for homogeneous clusters. Note that the strategy of not using continuation requires much more communication than if we use it. For a p  p grid, the PVM code without continuation needs 2p communications while the second approach requires p2 communications. Note also that the run times on the 13 hosts can be di erent when the master code is run on di erent computer. The PVM code which is run on 13 hosts with a 22

master code run on a DECAlpha is slower than the PVM code run on a homogeneous cluster of 3 DECAlpla (Table 2) because the other hosts in the heterogeneous cluster (i.e., the Sun4) have a relative speed of computation which is much slower than the DECAlpha computers.

5.3 Experiments on a Large Matrix For the eld of values computation we took the test matrix A(10000; 100) and 16 equally spaced angles. The computed eld of values is given in Figure 13. Note that p the spectrum of these matrix is real and lies in the interval [0; 8] because   ( n+1). 4 3 2 1 0 −1 −2 −3 −4 0

1

2

3

4

5

6

7

8

Figure 13: F (A(10000; 100)).

We ran the PVM code on a homogeneous cluster composed of 3 DEC OSP-1 workstations. The results are summarized in Table 5. For the pseudospectra computation we ran the PVM code on the test matrix A(2500; 100) to estimate the pseudospectra in the region of the complex plane

f z = a + ib : ?0:25  a  8:25; ?8:25  b  8:25 g; with a mesh-size equal to 1=24 in each direction. Figure 14 shows the computed p pseudospectra. The given plots are log-scaled. Note that since  > ( n + 1) the 23

Computer ] of hosts Time (secs) Speed{up Eciency (%) 1 4526:98 ?? ?? ALPHA 2 2297:96 1:97 98:50 3 1504:06 3:01 100:32 Table 5: Field of values computation: results for homogeneous clusters.

eigenvalues of the matrix A(2500; 100) are not a priori real. 8

0

6

−2

8

6

4

−4

4

2

−6

2

0

−8

0

−2

−10

−2

−4

−12

−4

−6

−14

−6

−16

−8

−8 0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Figure 14: z 7! log10 (k(zI ? A(2500; 100))?1)k2 and (A(2500; 100)).

The speed-up and eciencies are given in Table 6 for a homogeneous cluster of 3 DECAlpha workstations. For both computations, we can see that the speed-ups and the eciencies are very close to the ideal ones. Even for large matrices, a PVM version of both codes delivers a reliable answer in an acceptable computational time. Note that the pseudospectra in Figure 14 show the transition between the case p when  = ( n + 1) (see Figure 6), i.e., the matrix is highly non-normal, and when 24

Computer ] of hosts Time (secs) Speed{up Eciency (%) 1 6544:60 ?? ?? ALPHA 2 3290:00 1:99 99:46 3 2211:35 2:96 98:65 Table 6: Pseudospectra: results for homogeneous clusters.

p

  ( n + 1) (see Figure 7), i.e., the matrix becomes much more normal.

6 Conclusion The eciency and the reliability of using a Lanczos type method for computing the eld of values and the pseudospectra of a matrix compared with the reference algorithms, i.e., the QR method and the SVD decomposition are described in Braconnier and Higham (1995). The proposed package gives a Fortran sequential implementation of the ideas developed in the previous paper. Moreover, since both sets can be computed in a natural parallel way, the package gives PVM implementations of the sequential codes when the parallel virtual machine is composed of either homogeneous or heterogeneous computers. All the PVM versions have been shown to give very good speed-ups and eciencies on both types of clusters of computers. Experiments on large sparse matrices show that both computations can be performed in an acceptable computational time using the PVM implementation of the proposed method. The proposed package based on the Fortran and PVM routines used for the experiments is reliable and ecient for computing the eld of values and the pseudospectra of large sparse matrices.

25

Acknowledgments The author wants to thank L. Giraud and L. Carvalho (CERFACS) for useful discussions and details about the use of the PVM package. He also wants to thank Z. Bai (University of Kentucky) for providing him with the test matrices.

References Z. Bai, (1993), A collection of test matrices for the large scale nonsymmetric eigen-

value problem, Tech. Rep., University of Kentucky.

M. Bennani and T. Braconnier, (1994), Stopping criteria for eigensolvers,

Tech. Rep. TR/PA/94/22, CERFACS.

T. Braconnier and N. J. Higham, (1995), Computing the Field of Values and

Pseudospectra Using the Lanczos Method with Continuation, Numerical Analysis Report No. 279, Manchester Centre for Computational Mathematics. To appear in BIT.

T. Braconnier, (1995), In uence of the Orthogonality on the Backward Error and

the Stopping Criterion for Krylov Methods, Numerical Analysis Report No. 281, Manchester Centre for Computational Mathematics. Submitted to BIT.

J. Carpraux, J. Erhel, and M. Sadkane, (1993), Spectral Portrait for non

Hermitian Large Sparse Matrices, Tech. Rep. 777, INRIA/IRISA, unite de recherche INRIA Rennes.

V. Fraysse, L. Giraud, and V. Toumazou, (1996), Parallel Computation of

Spectral Portraits on the Meiko CS2, Tech. Rep. TR/PA/96/02, CERFACS. To appear in HPCN96 proceedings.

A. Geist, A. Beguelin, J. Dongara, W. Jiang, R. Manchek, and V. Sunderam, (1994), PVM: A Users' Guide and Tutorial for

26

Networked Parallel Computing, The MIT Press, Cambridge, Mas-

sachusetts.

G. Golub and C. Van Loan, (1989),

Matrix computations, Johns Hopkins

University Press. Second edition.

W. W. Hager, (1984), Condition estimates, SIAM J. Sci. Stat. Comput., 5, 311{

316.

N. J. Higham, (1990), Experience with a matrix norm estimator, SIAM J. Sci.

Stat. Comput., 11, 804{809.

C. R. Johnson, (1978), Numerical Determination of the Field of Values of a Gen-

eral Complex Matrix, SIAM J. Num. Anal., 15(3), 595{602.

S. H. Lui, (1995), Computation of Pseudospectra by Continuation, SIAM J. Sci.

Comput. To appear.

[Mat1990] The Math Works Inc., (1990), Pro - Matlab User's Guide for SUN workstations. Matlab Version 4.0a. C. C. Paige, (1994), Krylov subspace processes, Krylov subspace methods, and

iteration polynomials., in Proceedings of the Cornelius Lanczos International Centenary Conference, J. D. Brown, M. T. Chu, D. C. Ellison, and R. J. Plemmons, eds., Philadelphia, SIAM, 83{92.

B. Parlett, (1980), The Symmetric Eigenvalue Problem, Prentice-Hall, En-

glewood Cli s.

J. Rigal and J. Gaches, (1967), On the compatiblity of a given solution with the

data of a linear system, J. Assoc. Comput. Mach., 14, 3, 543{526.

Y. Saad, (1984), Chebyshev Acceleration Techniques for Solving Nonsymmetric

Eigenvalue Problems, Math. Comp., 42, 166, 567{588.

Y. Saad, (1992), Numerical Methods for Large Eigenvalue Problems, Algo-

rithms and Architectures for Advanced Scienti c Computing, Manchester University Press, Manchester, U.K. 27

L. N. Trefethen, (1991), Pseudospectra of matrices, in

14th Dundee Bien-

nal Conference on Numerical Analysis, D. F. Griths and G. A. Watson, eds.

A Appendix The package fvpspack is organized as follows

lib doc

field sequen pseudo matlab

examples data field pvm pseudo src

The directory src contains the Fortran source les needed for both computations. The directory lib will contain the compiled library libfvpspack.a. The directory doc contains this paper and a description of each routine contained in src. The directory examples contains 28

1. The subdirectory matlab which contains Matlab post-processing routines in order to visualize the computed eld of values or pseudospectra; and two drivers for starting the sequential and the PVM code directly from a Matlab interface, visualizing the resulting set and comparing the obtained speed-up. 2. The subdirectory data which contains Fortran routines for building the test matrices and some other tools. 3. the subdirectory sequen which contains drivers for the eld of values (subdirectory eld) and for the pseudospectra computation (subdirectory pseudo) to be run in a sequential way. 4. The subdirectory pvm which contains drivers for the eld of values (subdirectory eld) and for the pseudospectra computation (subdirectory pseudo) using the PVM package. In the next sections, we describe the speci cities of the package and the parameters used for the Lanczos-Chebyshev algorithm. The same speci cities and parameters are used for the sequential and PVM codes. Details are given only for the sequential version of the codes.

A.1 Speci cities of the Package The eld of values and pseudospectra codes use reverse communication to access to matrix. Because we are using a Lanczos type method, for the eld of values and for each angle , we need to perform the computation y 21 (ei A + e?i A)x where x is the input vector and y the output one. The user is asked to perform the matrix-vector products z1 Ax and z2 A x outside the main routine foval.f as follows

Algorithm 7 29

% Initialize the reverse communication parameter revcom = 0 10 continue call foval (parameters; revcom; vrcom) if revcom  0 call Atimesx (parameters; vrcom(:; 1); vrcom(:; 2)) If revcom = 2 call Attimesx (parameters; vrcom(:; 1); vrcom(:; 3)) go to 10 else Exit By this way, the user can store the matrix using any storage scheme and use fast sparse matrix-vector products (routine Atimesx computes vrcom(: ; 2) Avrcom(: ; 1) and routine Attimesx computes vrcom(: ; 3) A vrcom(: ; 1)). In Algorithm 7, when revcom = 1 then only z1 Ax is asked by the routine foval.f. This is the case when the minimum and maximum Rayleigh quotients fvmin and fvmax are computed from the eigenvectors. The integer parameter revcom and the multiplication by ei =2 and e?i =2 are dynamically set by the routine foval.f. The same technique is used for the pseudospectra computation. But, since we use a shift and invert technique around the 0 shift, we need to perform the computation y (A ? zI )? (A ? zI )?1 x where x is the input vector and y the output one. The user is asked to solve (A ? zI )z1 = x and (A ? zI ) y = z1 outside the main routine pseud.f as follows

Algorithm 8 % Initialize the reverse communication parameter revcom = 0 30

info(3) = 0 10 continue call pseudo (parameters; revcom; vrcom; info) If revcom  0 If info(3) = 0 Up date the diagonal of the matrix Factorize the matrix or compute the preconditioner If the matrix is singular info(3) = 2 call solveA (parameters; vrcom(:; 1); vrcom(:; 3)) call solveAt (parameters; vrcom(:; 3); vrcom(:; 2)) go to 10 else Exit By this way, the user can store the matrix using any storage scheme and use fast sparse linear solvers (routine solveA solves Avrcom(: ; 3) = vrcom(: ; 1) and routine solveAt solves A vrcom(: ; 2) = vrcom(: ; 3)). In Algorithm 8, the parameter info(3) allows the user to perform only one factorization of the matrix A ? zI per node if he uses a direct solver or to compute the preconditioner just once if he uses an iterative linear solver. If the matrix is detected to be singular (for example, a pivot in the LU factorization is close to 0) then this parameter has to be set to 2. It avoids performing the Lanczos-Chebyshev method because 0 is then the smallest singular value. Note that it is the user's responsibility to use a backward stable linear solver, i.e., one which produces a relative residual kAx~ ? bk kAkkx~k + kbk at the level of machine precision for the computed solution x~. For the eld of values computation, the user has to set 3 input parameters: 1. info(1) : which determines if the matrix A is real or not in order to halve the computation, 31

2. info(2) : which determines if the user provides an initial vector (a good approximation of a vector lying in the invariant subspace associated with min at the rst node of the grid, for example), otherwise a random starting vector is used, and 3. themax: the number of angles (if it is set to 0 then, by default, the computation is performed on 16 angles). For the pseudospectra computation, the user has to set 11 input parameters: 1. info(1) : (the same as for the eld of values computation), 2. info(2) : (the same as for the eld of values computation), 3. default(1) : if set to 0 then no geometric limits of the region of the complex plane are given (see normA), 4. default(2) : if set to 0 then by default, the number of nodes on the grid is set to 16 in each direction, 5. limit(1: 4) : the geometric limits [xmin; xmax; ymin; ymax] of the region of the complex plane (not referenced if default(1) = 0), 6. normA: an estimate of kAk. If default(1) = 0 then the geometric limits of the region of the complex plane are set to [-normA, normA, -normA, normA]. In the directory data, a routine is given to compute such an estimate using Higham's modi cation of Hager's algorithm (see Hager (1984), Higham (1990)). It requires the matrix vector products y Ax and y A x, and 7. npx, npy: the number of nodes in each direction (not referenced if default(2) = 0). Thus, for both computations default values are provide. Error checking on each entry is done at the beginning of the computation.

A.2 Parameters for the Lanczos-Chebyshev Algorithm 32

In this section, we discuss the parameters needed for the Lanczos-Chebyshev algorithm and our choice of the stopping criterion. Choice of m size of the projection: The Kaniel{Saad theory suggests that we take m p close to n for the classical Lanczos method (see Parlett (1980)). Thus, the choice we have made is

 for the eld of values computation: if n < 25 then m = 5, if 25  n  400 p p then m = n and for n > 400 we take m = n=2,  for pseudospectra computation: if n < 100 then m = 5 and if n > 100 then p m = n=2. Choice of the degree of the Chebyshev polynomial: We have used a dynamic choice of this parameter (see Saad (1992)). Let ai be the horizontal semi-axis of the Chebyshev interval passing through i, M the machine precision, and suppose we want to compute 1. Then, the degree of the Chebyshev polynomial is set to

p ) log ( 10 k = log (max M ai ) : i6=1 a1 10

We have xed an upper bound kmax in order to avoid to compute polynomials of larger degree: kmax = 25 if n < 400, kmax = 100 otherwise. Choice of the stopping criterion: We de ne the normwise backward error associated with the approximate eigenpair (; x) of the matrix B to be

 = minf  > 0 : kB k2  kB k2 ; (B + B )x = x g: From a standard result on linear system backward errors (see Rigal and Gaches (1967)) or via a simple computation, we nd that  = kBkkrkk2xk ; 2 2 where r = Bx ? x. For any Ritz vector yi and its corresponding Ritz value i we can compute  without forming r, because we nd that krk2 = jtm+1;m jjsmij, 33

where Tm = S diag(i)S  is a spectral decomposition. As suggested by Bennani and Braconnier (1994), our stopping criterion for the Lanczos algorithm is that the \Lanczos backward error" L is less than or equal to the unit roundo , where L = jtm+1kB;mkjjsmij 2 and the ith Ritz pair approximates the eigenpair of interest. In forming L, we approximate kB k2 by maxi jij, which is a lower bound for kB k2 (Golub and Van Loan (1989)). For the eld of values computation, we are interested in the eigenvector (the boundary is de ned by the Rayleigh quotients built on the eigenvectors of the smallest and largest eigenvalue). For the pseudospectra computation, we are interested in the eigenvalue and a good approximation of the eigenvector in order to use the continuation idea. We recall the result (see Parlett (1980), p. 222) that if B is Hermitian then, for any vector x and de ned by = x Bx=x x, we have

krk22 min j  ( B ) ? j  i i kxk 2

where r = Bx ? x and = minj6=i ji(B ) ? j (B )j. Because the Lanczos algorithm is an orthogonal projection method, if we choose x as an eigenvector then the Rayleigh quotient is the associated computed Ritz value of B . Therefore, our stopping criterion for the Lanczos algorithm applied to the pseudospectra is that if is not too small (greater than or equal to 10?2 in the code) then L2 = is less than or equal to the unit roundo , otherwise L is less than or equal to the unit roundo . Note that the gap is directly computed using the Ritz values of the matrix B .

34