Algorithms and Library Software for Periodic and Parallel Eigenvalue ...

Algorithms and Library Software for Periodic and Parallel Eigenvalue Reordering and Sylvester-Type Matrix Equations with Condition Estimation

Robert Granat

P H D T HESIS , N OVEMBER 2007

D EPARTMENT OF C OMPUTING S CIENCE AND HPC2N U ME A˚ U NIVERSITY

Department of Computing Science Ume˚a University SE-901 87 Ume˚a, Sweden [email protected] c 2007 by Robert Granat Copyright c SIAM Journal on Matrix Analysis and Applications, 2006 Except Paper I, c Springer Netherlands, 2007 Paper II, c Robert Granat and Bo K˚agström, 2007 Papers III-IV, c Robert Granat, Bo K˚agström and Daniel Kressner, 2007 Paper V, c Björn Granat, 2007 Cover,

ISBN 978-91-7264-410-6 ISSN 0348-0542 UMINF-07.21 Printed by Print & Media, Ume˚a University, 2007:2003806

Abstract

This Thesis contains contributions in two different but closely related subfields of Scientific and Parallel Computing which arise in the context of various eigenvalue problems: periodic and parallel eigenvalue reordering and parallel algorithms for Sylvestertype matrix equations with applications in condition estimation. Many real world phenomena behave periodically, e.g., helicopter rotors, revolving satellites and dynamic systems corresponding to natural processes, like the water flow in a system of connected lakes, and can be described in terms of periodic eigenvalue problems. Typically, eigenvalues and invariant subspaces (or, specifically, eigenvectors) to certain periodic matrix products are of interest and have direct physical interpretations. The eigenvalues of a matrix product can be computed without forming the product explicitly via variants of the periodic Schur decomposition. In the first part of the Thesis, we propose direct methods for eigenvalue reordering in the periodic standard and generalized real Schur forms which extend earlier work on the standard and generalized eigenvalue problems. The core step of the methods consists of solving periodic Sylvester-type equations to high accuracy. Periodic eigenvalue reordering is vital in the computation of periodic eigenspaces corresponding to specified spectra. The proposed direct reordering methods rely on orthogonal transformations and can be generalized to more general periodic matrix products where the factors have varying dimensions and ±1 exponents of arbitrary order. In the second part, we consider Sylvester-type matrix equations, like the continuoustime Sylvester equation AX − XB = C, where A of size m × m, B of size n × n, and C of size m × n are general matrices with real entries, which have applications in many areas. Examples include eigenvalue problems and condition estimation, and several problems in control system design and analysis. The parallel algorithms presented are based on the well-known Bartels–Stewart’s method and extend earlier work on triangular Sylvester-type matrix equations resulting in a novel software library SCASY. The parallel library provides robust and scalable software for solving 44 sign and transpose variants of eight common Sylvester-type matrix equations. SCASY also includes a parallel condition estimator associated with each matrix equation. In the last part of the Thesis, we propose parallel variants of the direct eigenvalue

iii

Abstract

reordering method for the standard and generalized real Schur forms. Together with the existing and future parallel implementations of the non-symmetric QR/QZ algorithms and the parallel Sylvester solvers presented in the Thesis, the developed software can be used for parallel computation of invariant and deflating subspaces corresponding to specified spectra and associated reciprocal condition number estimates.

iv

Preface

This PhD thesis consists of the following five papers and an introduction including a summary of the papers. Paper I

R. Granat and B. K˚agström. Direct Eigenvalue Reordering in a Product of Matrices in Periodic Schur Form.1 From Technical Report UMINF-05.05, Department of Computing Science, Ume˚a University. Published in SIAM J. Matrix Anal. Appl. 28(1), 285–300, 2006.

Paper II

R. Granat, B. K˚agström, and D. Kressner. Computing Periodic Deflating Subspaces Associated with a Specified Set of Eigenvalues.2 From Technical Report UMINF-06.29, Department of Computing Science, Ume˚a University. Accepted for publication in BIT Numerical Mathematics, June 2007.

Paper III

R. Granat and B. K˚agström. Parallel Solvers for Sylvester-type Matrix Equations with Applications in Condition Estimation, Part I: Theory and Algorithms. From Technical Report UMINF-07.15, Department of Computing Science, Ume˚a University. Submitted to ACM Transactions on Mathematical Software, July 2007.

Paper IV

R. Granat and B. K˚agström. Parallel Solvers for Sylvester-type Matrix Equations with Applications in Condition Estimation, Part II: The SCASY Software Library. From Technical Report UMINF-07.16, Department of Computing Science, Ume˚a University. Submitted to ACM Transactions on Mathematical Software, July 2007.

Paper V

R. Granat, B. K˚agström, and D. Kressner. Parallel Eigenvalue Reordering in Real Schur Forms. From Technical Report UMINF-07.20, Department of Computing Science, Ume˚a University. Submitted to Concurrency and Computation: Practice and Experience, September 2007. Also published as LAPACK Working Note #192.

1 2

Reprinted by permission of SIAM Journal on Matrix Analysis and Applications. Reprinted by permission of Springer Netherlands.

v

Preface

The topic of the papers concern periodic and parallel algorithms and software for eigenvalue reordering, Sylvester-type matrix equations and condition estimation.

vi

Acknowledgements

First of all, I thank my supervisor Professor Bo K˚agström, co-author of all papers in this contribution, for the past 5+ years of cooperation. Thanks for your guidance, your true commitment to our common projects and your care about me as a student. Bo also read earlier versions of this manuscript and gave many constructive comments, for which I am grateful. Next, I want to send big thanks to Dr Isak Jonsson, my assistant supervisor, especially for the assistance with the linear algebra kernel implementations. Thanks to Dr Daniel Kressner, co-author of two of the papers in this Thesis, for fruitful collaboration. I am looking forward to visit you, Ana and your daughter Mara in Switzerland. Also, thanks for constructive comments on earlier versions of this manuscript. Thanks to Dr Ing Andras Varga for valuable discussions on periodic systems and related applications. For the last years I have enjoyed the company of Lars Karlsson who started the PhD programme last year. Thank you and good luck in your future thesis work. And watch out for antique programming models! Very special thanks to Pedher Johansson for the superior LATEX-templates you provided and many thanks to my younger brother Björn (”I don’t do photos, only pictures...”) for all assistance with the cover. Thanks to all other friends and colleagues at the Department of Computing Science, especially the members of the Numerical Linear Algebra and Parallel and HighPerformance Computing Groups. Thanks to the staff at HPC2N (High Performance Computing Center North) for providing a great computing environment and superior technical support. Sorry guys, there will be no more cakes after today. By the way, it was not my fault! I wish to thank my family Eva, Elias and Anna for enriching my life in so many ways. I love you with all of my heart! Also many thanks to my parents Ann and Ulf, my two other younger brothers Arvid (”Thanks, but no thanks!”) and Lars (”Coolt!”), and my grandparents Margareta and Sigvard for all your encouragement. I also want to thank my mother-in-law, Gun-Brith,

vii

Acknowledgements

for the greatest moose stew in the world! Financial support was provided jointly by the Faculty of Science and Technology, Ume˚a University, by the Swedish Research Council under grant VR 621-2001-3284, and by the Swedish Foundation for Strategic Research under the frame program grant A3 02:128. Finally, without faith in the Lord Jesus Christ, I should never have been able to complete this Thesis in the first place. I owe Him everything and dedicate my work to the glory of His name. The Lord is my Shepherd; I shall not want. (Psalms 23:1) Ume˚a, November 2007 Robert Granat

viii

Contents

1

Introduction 1.1 Motivation for this work 1.2 Parallel computations, computers and programming models 1.3 Matrix computations in CACSD and periodic systems 1.4 Standard and periodic eigenvalue problems 1.5 Sylvester-type matrix equations 1.6 High performance linear algebra software libraries 1.7 The need for improved algorithms and library software

1 1 2 5 7 13 17 19

2

Contributions in this Thesis 2.1 Paper I 2.2 Paper II 2.3 Paper III 2.4 Paper IV 2.5 Paper V

21 21 22 23 24 25

3

Ongoing and future work 3.1 Software tools for periodic eigenvalue problems 3.2 Frobenius norm-based parallel estimators 3.3 Parallel periodic Sylvester-solvers 3.4 New parallel versions of real Schur decomposition algorithms 3.5 Parallel periodic eigenvalue reordering

29 29 30 30 31 31

4

Related papers and project websites

33

Paper I

47

Paper II

67

Paper III

101

Paper IV

135

ix

Contents

Paper V

x

155

C HAPTER 1

Introduction

This chapter motivates and introduces the work presented in this Thesis and gives some background to the topics considered.

1.1 Motivation for this work The growing demand for high-quality and high-performance library software is driven by the technical development of new computer architectures as well as the increasing need from industry, research facilities and communities to be able to solve larger and more complex problems faster than ever before. Often the problems considered are so complex that they cannot be solved by ordinary desktop computers in a reasonable amount of time. Scientists and engineers are more and more forced to utilize high-end computational devices, like specialized high performance computational units from vendor-constructed shared or distributed memory parallel computers or self-made so called cluster-based systems consisting of high-end commodity PC processors connected with high-speed networks. High performance computing clusters, even with a relatively small number of processor nodes, are getting very common due to their good scalability properties and high cost-effectiveness, i.e., a relatively low cost per performance unit compared to the more specialized and highly complex supercomputer systems. In fact, any local area network (LAN) connecting a number of workstations can be considered to be a cluster-based system. However, the latter systems are mainly used for high-throughput computing applications (see, e.g., the Condor project [29]), while clusters with specialized high-speed networks are used for challenging parallel computations. To solve a problem in a reliable and efficient way, a lot of out-of-application considerations must be made regarding solution method, discretization, data distribution and granularity, expected and achieved accuracy of computed results, how to utilize the available computer power in the best way, e.g., would it be beneficial to use a parallel computer or not, does the state-of-the-art algorithm at hand match the memory

1

Chapter 1

hierarchy of the target computer system well enough, etc. Typically, an appropriate and efficient usage of high performance computing (HPC) systems, like parallel computers, calls for non-trivial reformulations of the problem settings and development of novel and improved, highly efficient and scalable algorithms which are able to match the properties of a wide range of target computer platforms. Therefore, scientists and researchers, engineers and programmers can save a lot of time and efforts by utilizing extensively tested, highly efficient, robust and reliable high-quality software libraries as basic building blocks in solving their computational problems. By this procedure, a lot more attention can be focused on the applications and related theory. Most problems in the real world are non-linear, i.e., the output from a phenomena or a process does not depend linearly on the input. Moreover, most problems are also continuous, i.e., the output depends continuously on the input. Roughly speaking, the graph of a continuous function can be painted without ever lifting the pen from the paper. Very few real world problems can be solved analytically, that is, finding an exact mathematical expression that fully describes the relation between the input and the output of the process or phenomena. Typically, we are forced to linearize and discretize the problems to make them solvable in a finite amount of time using a finite amount of computational resources (such as computing devices, data storage, network bandwidth, etc.). This means that the computed solution always will be a more or less valid approximation. The good thing is that by linearization and discretization processes, many problems can be solved effectively by standard linear algebra methods. In numerical linear algebra, systems of linear equations, eigenvalue problems, optimization problems and related solution methods, i.e., matrix computations, are studied. The focus is on reliable and efficient algorithms for large-scale matrix computational problems from the point of view of using finite-precision arithmetic. Developments in the area of numerical linear algebra often results in widely available public domain software libraries, like LAPACK, SLICOT and ScaLAPACK (see Section 1.6).

1.2 Parallel computations, computers and programming models The basic idea behind parallel computations is to solve larger computational problems faster: Larger, in the sense that the main memory in an ordinary workstation or laptop is not enough to store all data structures associated with a very large computational problems; to have enough storage space we might need to use p times as much main memory. Faster, by using more than one (say p) working processing units (usually corresponding to a single processor or core) concurrently. Consequently, by using p processing units equipped with (common or individual) p storage modules, we may

2

Introduction

solve a problem that is up to p times larger (in the sense of the storage requirements) up to p times faster. This is the good news. The bad news are that it is often required that the problem at hand and existing serial solution methods or algorithms are reformulated in terms of parallel tasks that can be computed concurrently, each making up a small piece of the total solution. This can be very hard, especially when the problem at hand has a very strong sequential nature; one often mentioned simple example is to compute the Fibonacci numbers   0, if n = 0  F(n) := (1.1) 1, if n = 1   F(n − 1) + F(n − 2), if n > 1. As can be seen from the recursive formula above, no numbers in the sequence can be computed in parallel because of their strong internal dependency. Fortunately, most real world computational problems can be reformulated such that they can be solved by parallel computations. Some problems are indeed embarrassingly parallel and require no synchronization of cooperation between the involved processing units. Other problems can be parallelized to the cost of some explicit cooperation and synchronization among the processing units. To be able to use computer resources effectively, any parallel algorithm should strive to minimize synchronization costs in relation to the amount of arithmetical work performed.

F IGURE 1: A concept view of a shared memory architecture [101].

Since many decades, the two main streams of parallel computers are shared memory machines (or multiprocessors) and distributed memory machines (or multicomputers). Shared memory machines are typically made up of several high-end processors 3

Chapter 1

having their own respective local cache memories and sharing a number of large main memory modules accessed through a common bus, a multi-stage network or a crossbar network, see Figure 1. Distributed memory machines are in principle a number of stand-alone computers with their own local caches and main memory modules, connected to each other via a network of a certain topology, see Figure 2. The most common topology today is the ring-based k-dimensional torus. For example, for k = 3 this corresponds to a three-dimensional mesh with wrap-around connections, see Figure 3.

F IGURE 2: A concept view of a distributed memory architecture [89].

These two types of parallel computers are (mainly) programmed using two distinct classes of programming models, quite naturally called the shared memory (SM) model and the distributed memory (DM) model. In the SM model, focus is more on fine-grained parallelism, e.g., parallelizing loops by assigning loop indices to specific threads executing in the shared memory environment. This programming model is very useful in applications of massive data parallelism. In the DM model, focus is set more towards parallel execution of disjunct (and, mostly, partly connected) tasks and explicit synchronization through communication operations taking the underlying network topology into account. Notice that the latter model can be simulated in the SM model. For example, DM point-to-point communication can be simulated by copying a specific area of the memory which contains a message (using, e.g., the C++ memcpy function) into an area reserved for the receiving process. Such an approach may be beneficial when porting existing or simplifying new implementations of well-known DM algorithm variants for SM environments. Hybrid variants of the mentioned mainstream parallel computers also exist in which programming models from both paradigms can be combined. For an example, see the 64-bit Linux cluster sarek in Paper IV of this Thesis. One of the challenges today is how to incorporate multicore processors, e.g., dualcores and quadcores, and the future manycore processors in the parallel programming 4

Introduction

F IGURE 3: A square mesh of processing units (nodes )with wrap-around connections [21].

models. For now, many multicore processors can be programmed using programming techniques that have evolved in the SM model, e.g., heavy processes (using fork in Unix/Linux), lightweight processes (OpenMP [93] or POSIX threads, see, e.g., [97]). The natural motivation is that the cores in most cases share all or parts of the processors memory hierarchy (from L1 cache to main memory). With the upcoming evolution of manycores [7], which raise the need for local memory modules connected to each individual core and an on-chip inter-core network (e.g., the local storage (LS) modules and the ring network in IBM’s Cell processor), the picture becomes more complex and unclear. However, our feeling is that future programming techniques for manycore environments will resemble much from DM programming, with explicit point-to-point communication between the cores over the local network, global communication operations and execution of standard highly mature and well-known distributed memory algorithms. Notice that some or all of these DM-like features may be hidden for ordinary users behind software libraries or directly in the hardware. Nevertheless, a future scenario like this is very interesting from the point of view that users may execute real parallel algorithms and get true supercomputer performance and speedup on their manycore-based laptops! Therefore, we expect parallel computing to become an even hotter area in the future. For a good introduction to parallel models, computers and algorithms see, e.g., the standard textbook [48].

1.3 Matrix computations in CACSD and periodic systems Matrix computations are fundamental to many areas of science and engineering and occur frequently in a variety of applications, for example in Computer-Aided Control System Design (CACSD). In CACSD various linear control systems are considered, like the following linear continuous-time descriptor system

5

Chapter 1

E x(t) ˙ = Ax(t) + Bu(t) y(t) = Cx(t) + Du(t),

(1.2)

or a similar discrete-time system of the form Exk+1 yk

= Axk + Buk = Cxk + Duk ,

(1.3)

where x(t), xk ∈ Rn are state vectors, u(t), uk ∈ Rm are the vectors of inputs (or controls) and y(t), yk ∈ Rr are the vectors of output. The systems are described by the state matrix pair (A, E) ∈ R(n×n)×2 , the input matrix B ∈ Rn×m , the output matrix C ∈ Rr×n and the feed-forward matrix D ∈ Rr×m . The matrix E is possibly singular. With E = I, where I is the identity matrix of order n, standard state-space systems are considered. Other subsystems described by the tuples (E, A, B) and (E, A,C), which arise from the system pencil corresponding to Equations (1.2)–(1.3): " # " # A B E 0 −λ , S(λ ) = A − λ B = C D 0 0 are studied when investigating the controllability and observability characteristics of a system (see, e.g, [36, 64]) Applications with periodic behavior, e.g., rotating helicopter blades and revolving satellites can be described by discrete-time periodic descriptor systems of the form Ek xk+1 yk

= Ak xk + Bk uk = Ck xk + Dk uk ,

(1.4)

where the matrices Ak , Ek ∈ Rn×n , Bk ∈ Rn×m , Ck ∈ Rr×n and Dk ∈ Rr×m are periodic with periodicity K ≥ 1. For example, this means that AK = A0 , AK+1 = A1 , etc. Important problems studied in CACSD include state-space realization, minimal realization, linear-quadratic (LQ) optimal control, pole assignment, distance to controllability and observability considerations, etc. For details see, e.g., [91]. The systems (1.2)–(1.4) can be studied by various matrix computational approaches, e.g., by solving related eigenvalue and subspace problems. In this area, improved algorithms and software are developed for computing and investigating different subspaces, e.g., condition estimation of invariant or deflating subspaces [73, 75, 74], solving various important matrix equations, like (periodic) Sylvester-type and (periodic) Riccati matrix equations and computing canonical structure information [32, 33, 37, 38, 64, 63]. One common step in computing such structure information is the need to separate the stable1 and unstable eigenvalues by an eigenvalue reordering technique (see, e.g., [9, 103, 57, 109, 108]).

6

Introduction

1.4 Standard and periodic eigenvalue problems Given a general matrix A ∈ Rn×n , the standard eigenvalue problem (see, e.g., [83]) consists of finding n eigenvalue-eigenvector pairs (λi , xi ) such that Axi = λi xi , i = 1, . . . , n.

(1.5)

Notice that Equation (1.5) only concerns right eigenvectors. Left eigenvectors are defined by yTi A = λi yTi [46], i.e., they are right eigenvectors of the transposed matrix AT . The standard method for the general standard eigenvalue problem is the unsymmetric QR algorithm (see, e.g., [81, 44]), which is a backward stable algorithm belonging to a large family of bulge-chasing algorithms [111] that by iteration reduces the matrix A to real Schur form via an orthogonal similarity transformation Q ∈ Rn×n such that QT AQ = S,

(1.6)

where all eigenvalues of A appear as 1 × 1 and 2 × 2 blocks on the main diagonal of the quasi-triangular matrix S. The column vectors qi , i = 1, 2, . . . , n of Q are called the Schur vectors of the decomposition (1.6). If S(2, 1) = 0 in (1.6), then q1 = x1 is an eigenvector associated with the eigenvalue λ1 = S(1, 1). More importantly, given k ≤ n such that no 2 × 2 block resides in S(k : k + 1, k : k + 1), the k first Schur vectors qi , i = 1, 2 . . . , k, form an orthonormal basis for an invariant subspace of A associated with the k first eigenvalues λ1 , λ2 , . . . , λk . In most practical applications, the retrieved information from the Schur decomposition (eigenvalues and invariant subspaces) is sufficient and the eigenvectors need not be computed explicitly. By definition, the eigenvector xi belongs to the null space of the shifted matrix A − λi I (see, e.g., [83, 63]). In case the matrix A is diagonalizable (see, e.g., [46]), the eigenvectors can be computed by successively reordering each of the eigenvalues in the Schur form to the top-left corner of the matrix (see, e.g., [9, 35, 23]) and reading off the first Schur vector q1 . However, the latter approach is not utilized in practice for the standard eigenvalue problem but the basic idea can be useful in other contexts (see below). Example 1 Consider the closed system of lakes2 in Figure 4. The system was polluted when a close-by factory closed down and dumped w0 cubic meters of toxic waste into 1

2

The definition of stable and unstable eigenvalues depends on the system considered. However, the common definitions of a stable eigenvalue λ for discrete-time and continuous-time systems are |λ | ≤ 1 and Re(λ ) ≤ 0, respectively. The original reference for this problem is unknown, but several variants of it can be found at, e.g., [85, 24].

7

Chapter 1

F IGURE 4: The considered system of three (polluted) connected lakes

one of the lakes (marked polluted lake). Every year there is a certain flow of the toxic waste between the three lakes in the system such that fi j percent of the waste in lake i is transferred to lake j. Given w0 and fi j , i, j = 1, 2, 3, the task is to formulate this as an linear algebra problem and to predict the long-time behavior of the system. Assume w0 = 1000 and f12 = 2, f13 = 5, f21 = 3, f23 = 5, f31 = 3, f32 = 2. Then, the concentration c = [c1 c2 c3 ]T , at year k + 1 can be computed as ck+1 = A · ck = Ak · c0 , with

(1.7)



 0.93 0.03 0.03   A =  0.02 0.92 0.02  0.05 0.05 0.95

and c0 = [1000, 0, 0]T . To predict the long-time behavior, denote the (right) eigenvalue-eigenvector pairs of A by (λi , xi ) and notice that c0 can be written as c0 = β1 x1 + β2 x2 + β3 x3

(1.8)

where βi for i = 1, 2, 3, represent a change-of-basis coordinates of c0 computed from the linear system [x1 , x2 , x3 ][β1 , β2 , β3 ]T = c0 . By combining (1.7) and (1.8), we get ck+1 = Ak (β1 x1 + β2 x2 + β3 x3 ) = β1 λ1k x1 + β2 λ2k x2 + β3 λ3k x3 ,

(1.9)

i.e., the long-time behavior of the system can be predicted by solving an eigenvalue problem of the form (1.5). Actually, the values βi , i = 1, 2, 3, do not need to be computed explicitly for this example (see below).

8

Introduction

By invoking the M ATLAB Schur decomposition command schur, we get the following output matrices3     0.9000 0.0149 −0.0050 −0.7926 0.6097 0     S= 0 1.0000 −0.0340  , Q =  0.2265 0.2944 −0.9285  , 0 0 0.9000 0.5661 0.7359 0.3714 i.e., the eigenvalues of A are λ1 = λ3 = 0.9000 and λ2 = 1.0000. This implies that c∞ = β2 x2 ,

(1.10)

since all other terms of (1.9) converge to zero as k → ∞. The remaining task is now to compute x2 , which we do (and illustrate) from a reordered Schur form. By invoking the M ATLAB command ordschur, the eigenvalue λ2 is reordered to the top-left corner of the updated Schur decomposition     1.0000 0.0149 −0.0343 0.4867 0.8736 0     S˜ =  0 0.9000 0.0000  , Q˜ =  0.3244 −0.1807 −0.9285  , 0 0 0.9000 0.8111 −0.4519 0.3714 ˜ 1) can be read off. By scaling down each element in x2 by kx2 k1 = ∑(x2 ) and x2 = Q(:, (such that each element in x2 represents the amount of toxic waste in percent of w0 ), we arrive at c∞ = w0 [0.3000, 0.2000, 0.5000]T = [300.0, 200.0, 500.0]T , i.e., in the long run the system arrives at a stable equilibrium where lake 1 contains 300 cubic meters of toxic waste, lake 2 contains 200 cubic meters and lake 3 contains 500 cubic meters. Furthermore, this equilibrium is reached after 50 to 60 years, see Figure 5. The periodic (or product) eigenvalue problem (see, e.g., [79, 111]) consists, in its simplest form, of computing eigenvalues and invariant subspaces of the matrix product A = AK−1 AK−2 · · · A0 ,

(1.11)

where A0 , A1 , . . . , AK−1 , AK+i = Ai , i = 0, 1, . . ., is a K-cyclic matrix sequence. Such problems can arise from forming the monodromy matrix [110] of discrete-time periodic descriptor systems of the form (1.4) with E = I. Another well-known product eigenvalue problem is the singular value decomposition which, implicitly, consists of computing eigenvalues and eigenvectors of the matrix product AAT [111]. 3

All computed numbers in this introduction are rounded and presented to only four decimals accuracy, while all computations are performed in full double precision arithmetic.

9

Chapter 1

Evolution of amount of pollution in three−lake system 1000 Lake 1 Lake 2 Lake 3

900

3

Amount of pollution (m )

800 700 600 500 400 300 200 100 0

0

20

40

60

80

100

time (years)

F IGURE 5: The level of pollution in each lake as a function of time illustrated by explicit application of Equation (1.7). A stable equilibrium is reached after 50 to 60 years.

From cost and accuracy reasons, it is necessary to work with the individual factors in (1.11) and not forming the product A explicitly [22, 111]. In general, the eigenvalues of a K-cyclic matrix product is obtained by computing the periodic real Schur form (PRSF) [22, 57] T Ak Zk = Sk , k = 0, 1, . . . , K − 1, Zk⊕1

(1.12)

where k ⊕ 1 = (k + 1) mod K, and Z0 , Z1 , . . . , ZK−1 are orthogonal and the sequence Sk consists of K − 1 upper triangular matrices and one upper quasi-triangular matrix. This is done by applying the periodic QR algorithm [22, 78] to the matrix sequence A0 , A1 , . . . , AK−1 . The periodic QR algorithm is essentially analogous to the standard QR algorithm applied to a (block) cyclic matrix [80]. The placement of the quasitriangular matrix may be specified to fit the actual application. Sometimes, for example in pole assignment, the resulting PRSF should be ordered [107], i.e., the eigenvalues must appear in a specified order along the diagonals of the individual Sk factors. Periodic descriptor systems of the form (1.4) are conceptually studied by computing eigenvalues and eigenspaces of matrix products of the form −1 −1 AK−1 EK−2 AK−2 · · · E1−1 A1 E0−1 A0 , EK−1

(1.13)

which can be accomplished via the generalized periodic Schur decomposition [22, 57]: there exists a K-cyclic orthogonal matrix pair sequence (Qk , Zk ) with Qk , Zk ∈ Rn×n

10

Introduction

such that

(

Sk Tk

= QTk Ak Zk , = QTk Ek Zk⊕1 ,

(1.14)

where all matrices Sk , except for some fixed index j with 0 ≤ j ≤ K − 1, and all matrices Tk are upper triangular. The matrix S j is upper quasi-triangular; typically j is chosen to be 0 or K − 1, but can, in principle, be applied to any matrix in the sequence. The sequence (Sk , Tk ) is called the generalized periodic real Schur form (GPRSF) of (Ak , Ek ), k = 0, 1, . . . , K − 1, and the generalized eigenvalues λi = (αi , βi ) of the corresponding product (1.13) can be computed from the diagonals of the triangular factors. Notice that calculation of the eigenvalues corresponding to a 2 × 2 diagonal block in S j , which signals a complex conjugate pair of eigenvalues, requires some postprocessing including forming explicit matrix products of dimension 2 × 2 of length K, which may cause bad accuracy4 or even numerical over- or underflow; scaling techniques could offer a remedy for such ill-conditioned cases. It is sometimes better to keep the generalized eigenvalues in factored form which allows, e.g., computing the logarithms of eigenvalue magnitudes without forming the product explicitly. The GPRSF provides the necessary tools to avoid computing the matrix product (1.13) explicitly (which can cause numerical instabilities) and allows to handle possibly singular factors Ek . Moreover, by ordering the eigenvalues in the GPRSF, periodic deflating subspaces associated with a specified set of eigenvalues can be computed. This is important in certain applications, e.g., solving periodic Riccati-type matrix equations [57, 65, 52]. Each formal K-cyclic matrix product of the form (1.11) is associated with a matrix tuple A¯ = (AK−1 , AK−2 , . . . , A1 , A0 ) and an associated vector tuple x¯ = (xK−1 , xK−2 , · · · , x1 , x0 ), with xk 6= 0, is called a right eigenvector of the tuple A¯ corresponding to the eigenvalue λ if there exist scalars µk , possibly complex, such that the relations Ak xk = µk xk+1 , k = 0, 1, . . . , K − 1, (1.15) λ := ∏0k=K−1 µk hold with xK = x0 [14]. A left eigenvector y¯ of the tuple A¯ corresponding to λ is defined similarly. In this context, a direct eigenvalue reordering method may be utilized to compute eigenvectors corresponding to each eigenvalue in the corresponding periodic Schur form by reordering each eigenvalue to the top left corner of the periodic Schur form, similarly to the non-periodic case. Corresponding definitions of eigenvectors can be defined for more general matrix products of the form (1.13), see, e.g., [14]. 4

Matrix multiplication is not backward stable in general for n > 1 [61].

11

Chapter 1

Example 2 Consider again the system of lakes in Example 1. By a more rigorous investigation of the system, it was observed that the water flow between the lakes could not be captured by the simple full-year model above. Furthermore, it had a clear periodic behavior and differed depending on the seasons, as follows: ck+1 = A4 A3 A2 A1 · ck ,

(1.16)

or more specifically: ck,2 ck,3 ck,4 ck+1,1

= = = =

A1 · ck,1 , A2 · ck,2 , A3 · ck,3 , A4 · ck,4 ,

(1.17)

where 

  0.9400 0.0300 0.0200 0.9200    A1 =  0.0100 0.9300 0.0200  , A2 =  0.0100 0.0500 0.0400 0.9600 0.0700 

  0.9300 0.0100 0.0100 0.9300    A3 =  0.0400 0.9100 0.0400  , A4 =  0.0200 0.0300 0.0800 0.9500 0.0500

0.0500 0.9400 0.0100

 0.0300  0.0100  , 0.9600

0.0300 0.9000 0.0700

 0.0600  0.0100  , 0.9300

and the subindices 1, 2, 3, and 4, correspond to the seasons summer, autumn, winter and spring, respectively. Notice that A = (A1 + A2 + A3 + A3 )/4, where A is the transition matrix from Example 1, i.e., the model in Example 1 was based on average measurements. To not lose any accuracy or information we keep the matrix product above factorized and compute the eigenvalues via the periodic Schur form which results in the eigenvalues λ1,2 = 0.6555 ± 0.0004i, λ3 = 1.0000. Notice that |λ1,2 | < 1. After reordering the diagonal block sequence corresponding to the eigenvalue 1.0000 to the top-left corner of the periodic Schur form, we end up with the sequence S˜k as     −1.0020 −0.0023 −0.0271 −1.0028 −0.0150 0.0191     S˜1 =  0 0.9150 0.0007  , S˜2 =  0 0.8864 0.0350  , 0 −0.0007 0.9132 0 0 0.9313 

0.9960  ˜ S3 =  0 0

12

  0.0468 0.0514 0.9993   ˜ 0.9100 −0.0372  , S4 =  0 0 0.8831 0

 −0.0251 0.0495  0.8883 0.0028  , 0 0.8727

Introduction

and the sequence Q˜ k of transformation matrices as    −0.4964 0.7844 −0.3718 0.4914 0.7827    Q˜ 1 =  −0.3226 −0.5643 −0.7599  , Q˜ 2 =  0.3205 −0.5704 −0.8059 −0.2572 0.5332 0.8098 −0.2492 

−0.4911 0.7635  Q˜ 3 =  −0.3134 −0.6042 −0.8128 −0.2283

  −0.4195 −0.4698 0.7953   −0.7327  , Q˜ 4 =  −0.3387 −0.5632 0.5360 −0.8152 −0.2244

 −0.3820  −0.7563  , 0.5311  −0.3831  −0.7537  . 0.5340

Moreover, the periodic eigenvector corresponding to the eigenvalue λ = 1.0000 can be read off as (x1 , x2 , x3 , x4 ) = (Q˜ 1 (:, 1), Q˜ 2 (:, 1), Q˜ 3 (:, 1), Q˜ 4 (:, 1)) corresponding to (µ1 , µ2 , µ3 , µ4 ) = (S˜1 (1, 1), S˜2 (1, 1), S˜3 (1, 1), S˜4 (1, 1)). This eigenvector corresponds to the stable equilibriums for each season which are reached after about 20 years, see Figures 6-7. Even though the differences between the equilibriums corresponding to the different seasons are not very large, the periodic model does not mislead us to believe that the level of pollution in each lake stays fixed when arriving at the steady-state as is the case with the non-periodic model in Example 1. Moreover, we get a more accurate prediction of the long-time behavior by working with a periodic Schur form of a matrix product of period K = 4. Working directly with the explicitly formed product A = A4 A3 A2 A1 in the spirit of Example 1 would only result in one stable equilibrium corresponding to the x1 part of the periodic eigenvector above. For other applications with periodic behavior, see, e.g., the modelling of the growth of the citrus trees in [27]. There exists a number of generalizations of the presented periodic Schur forms. For example, the extended periodic real Schur form (EPRSF) [106] generalizes PRSF for handling square products where the involved matrices are rectangular. EPRSF can be computed by a slightly modified periodic QR algorithm. We also remark that the GPRSF also can be modified to cover matrix products with rectangular factors and/or ±1 exponents of arbitrary order (see Section 3).

1.5 Sylvester-type matrix equations Matrix equations have been in focus of the numerical community for quite some time. Applications include eigenvalue problems including condition estimation (e.g., see [73, 13

3

400 200

3

0

50 time (years)

Pollution in three−lake system − winter 1000 Lake 1 Lake 2 800 Lake 3 600 400 200 0

0

50 time (years)

Pollution in three−lake system − autumn 1000 Lake 1 Lake 2 800 Lake 3 600 400 200 0

100

3

0



Pollution in three−lake system − summer 1000 Lake 1 Lake 2 800 Lake 3 600


3


Chapter 1

0

50 time (years)

100

Pollution in three−lake system − spring 1000 Lake 1 Lake 2 800 Lake 3 600

100

400 200 0

0

50 time (years)

100

F IGURE 6: The level of pollution in each lake as a function of time illustrated by explicit application of Equation (1.17). The stable equilibriums corresponding to summer, autumn, winter and spring are reached after about 20 years.

60, 96]) as well as various control problems (e.g., see [36, 77]). Already in 1972, R. H. Bartels and G. W. Stewart published the paper Algorithm 432: Solution of the Matrix Equation AX + XB = C [11], which presented a Schurbased method for solving the continuous-time Sylvester (SYCT) equation AX + XB = C,

(1.18)

where A of size m × m, B of size n × n, and C of size m × n are arbitrary matrices with real entries. Equation (1.18) has a unique solution if and only if A and −B have no eigenvalues in common. The solution method in [11] follows the general idea from mathematics of problem solving via reformulations and coordinate transformations: First transform the problem to a form where it is (more easily) solvable, then solve the transformed problem 14

Introduction

Stable equilibriums for three−lake system 600

Amount of pollution (m3)

500

400

300

200

100

0 0.5

Lake 1 Lake 2 Lake 3 1

1.5 2 2.5 3 3.5 4 Season (1=summer,2=autumn,3=winter,4=spring)

4.5

F IGURE 7: The stable equilibriums corresponding to summer, autumn, winter and spring which are reached after about 20 years.

and finally transform the solution back to the original coordinate system. Other examples include computing derivatives by a spectral method using forward and backward Fourier Transform as transformation method [39], and computing explicit inverses of general square matrices by using LU-factorization and matrix multiplication as transformation methods [61]. The Bartels–Stewart’s method for Equation (1.18) is as follows: 1. Transform the matrix A and the matrix B to real Schur form: SA SB

= QT AQ, T

= P BP,

(1.19) (1.20)

where Q and P are orthogonal matrices and SA and SB are in real Schur form. 2. Update the matrix C with respect to the two Schur decompositions: C˜ = QT CP.

(1.21)

3. Solve the resulting reduced triangular matrix equation: ˜ ˜ B = C. SA X˜ + XS

(1.22) 15

Chapter 1

4. Transform the obtained solution back to the original coordinate system: ˜ T. X = QXP

(1.23)

The first step, which is performed by reducing the left hand side coefficient matrices to Hessenberg forms and applying the QR algorithm to compute their real Schur forms, is also known to be the dominating part in terms of floating point operations [46] and execution time (see Paper III). By recent developments in obtaining close to level 3 performance in the bulge-chasing [25] and advanced deflating techniques for the QRalgorithm [26], this might change in the future. The classic paper of Bartels and Stewart [11] has served as a foundation for later developments of direct solution methods for related problem, see, e.g., Hammarling’s method [56] and the Hessenberg-Schur approach by Golub, Nash and Van Loan [45]. However, these methods were developed before matrix blocking became vital for handling the increasing performance gap between processors and memory modules. SYCT can be formulated in terms of computing a block diagonalization of a matrix in standard real Schur form #" # " # " #−1 " S11 S12 I X S11 S12 + S11 X − XS22 I X = . (1.24) 0 S22 0 S22 0 I 0 I This leads to the ideas of computing reciprocal condition number estimates of selected eigenvalue clusters [10] and the corresponding eigenspaces [73] which are, roughly speaking, based on how ”hard” it is to solve the corresponding triangular Sylvester equation. A similiar reasoning can be applied to the generalized real Schur form (see, e.g., [75, 74] in terms of solving triangular generalized coupled Sylvester (GCSY) equations (AX −Y B, DX −Y E) = (C, F). (1.25) Other condition estimates can be formulated for other Sylvester-type matrix equations as well (see, e.g., [66, 67] and Papers III-IV below). Bartels–Stewart-style methods can be formulated for all matrix equations in Table 2.1 (see Paper III). For example, for the generalized matrix equations such methods would require efficient and reliable implementations of QZ algorithm. For parallel DM environments, the lack of such an implementation has driven the development of fully iterative methods for Sylvester-type matrix equations based on Newton-iteration of the matrix sign function [15, 16, 17]. A comparison of such a fully iterative method and Bartels–Stewart’s method was presented in [50]. We remark that a preliminary version of a highly efficient version of the parallel QZ algorithm is included in the software package SCASY, see Papers III-IV of this Thesis. 16

Introduction

Periodic Sylvester-like equations can be formulated as periodic counterparts of similar problems, e.g., condition estimation of eigenvalues cluster or certain periodic eigenspaces (see Section 2 and Papers I-II below). Also, periodic variants of Bartels– Stewart’s method follow straightforwardly by considering periodic matrices, a periodic Schur decomposition, by performing K-periodic updates of the right hand side and by solving triangular periodic Sylvester-type matrix equations, see, e.g., [105, 49].

1.6 High performance linear algebra software libraries Since long, software libraries have been a fundamental tool for problem solving in Computational Science and Engineering (CSE). Many computational problems arising from real world problems or from discretizations and linearizations of mathematical models can be formulated in terms of common matrix and vector operations. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms that efficiently utilize and adapt to new architecture features. Besides, as we get new insight from CSE research and developments, new demands and challenges emerge that request the solution of new and more complex matrix computational problems. These facts have been driving forces behind library software packages such as BLAS [20], LAPACK [82, 5], SLICOT [41, 102] and their parallel counterparts ScaLAPACK [18] and PSLICOT [90]. Below, we give some insight into the functionality of these libraries. For other attempts of providing high performance software libraries in matrix computations, see, e.g., FLAME [54], where the triangular SYCT equation (see Section 1.5) was used as a model problem for formal derivation and automatic generation of numerical algorithms [98], and PLAPACK [104]. Moreover, recently the Ume˚a HPC research group in collaboration with the IBM T.J. Watson Research Center presented novel recursive blocked algorithms and hybrid data formats for dense linear algebra library software targeting deep memory hierarchies of processor nodes (see the review paper [40] and further references therein). The BLAS (Basic Linear Algebra Subprograms) are structured in three levels. The level 1 BLAS are concerned with vector-vector operations, e.g., scalar products, rotations etc., and was developed during the seventies. The level 2 BLAS perform matrixvector operations and were originally motivated by the increasing number of vector machines during the eighties. The level 3 BLAS concern matrix-matrix operations, such as the well known GEMM (GEneral Matrix Multiply and add) operation C = β C + α AB,

17

Chapter 1

where α and β are scalars, and A is an m × k matrix, B is a k × n matrix, and C is an m × n matrix. In general, the level 3 BLAS perform O(n3 ) arithmetic operations while moving O(n2 ) data element through the memory hierarchy of the computer, leading to a volume-to-surface effect on the performance. If the level 3 BLAS are properly tuned for the cache memory hierarchy of the target computer system, and the computations in the actual program are organized into level 3 operations, the execution may run with close to practical peak performance. In fact, the whole level 3 BLAS may be organized in GEMM-operations [71, 72] which means that the performance will depend mainly on how well tuned the GEMM-operation is. Computer vendors often supply their own high performance implementation of the BLAS which are optimized for their specific architecture. Automatically tuned libraries also exists, see, e.g., ATLAS [8]. See also the GOTO-BLAS [47] which makes use of data streaming to efficiently utilize the memory hierarchy of the target computer. The LAPACK (Linear Algebra Package) combines the functionality of the former LINPACK and EISPACK libraries and performs all kinds of matrix computations, from solving linear systems of equations to calculating all eigenvalues of a general matrix. The computations in LAPACK are organized to perform as much as possible in level 3 operations for optimal performance. The LAPACK project [5, 34, 31] has been extremely successful and now forms the underlying ”computational layer” of the interactive M ATLAB [86] environment, which perhaps is the most popular tool for solving computational problems in science and engineering and for educational purposes. ScaLAPACK (Scalable LAPACK) [18] implements a subset of the algorithms in LAPACK for distributed memory environments (see also Section 1.2). Basic building blocks are two-dimensional (2D) block cyclic data distribution (see, e.g., [48]) over a logical rectangular processor mesh in combination with a Fortran 77 object oriented approach for handling the involved global distributed matrices. In connection to ScaLAPACK, parallel versions of the BLAS exist, the PBLAS (Parallel BLAS) [94]. Explicit communication in ScaLAPACK is performed using the BLACS (Basic Linear Algebra Communication Subprograms) library [19], which provide processor mesh setup tools and basic point-to-point, collective and reduction communication routines. BLACS is usually implemented using MPI (Message Passing Interface) [88], the de-facto standard for message passing communication. We remark that the basic building blocks of LAPACK and ScaLAPACK are under reconsideration [31]. SLICOT (Subroutine Library in Systems and Control Theory) provides Fortran 77 implementations of numerical algorithms for computations in systems and control applications. Based on numerical linear algebra routines from the BLAS and LAPACK libraries, SLICOT provides methods for the design and analysis of control systems.

18

Introduction

Similarly to LAPACK and ScaLAPACK, a parallel version of SLICOT, called PSLICOT, is under development. The goal is to include most functionality of SLICOT in a parallel version. PSLICOT builds also on the existing functionality of ScaLAPACK, PBLAS and BLACS. We refer to [28] for more on the discussion of the impact of multicore architectures on mathematical software.

1.7 The need for improved algorithms and library software As computational science evolves together with technical developments of new and complex computational platforms, the need for reliable and efficient numerical software continues to grow. Typically, new and complex computational-intensive problems arise in the context of real applications ranging from bridge construction, automobile design and satellite control to particle simulations and quantum computations in physics and chemistry. In the rest of this Thesis, we present and discuss our recent contributions regarding new and improved algorithms and library software for computational problems in the following topics of numerical linear algebra: 1. Periodic eigenvalue reordering, which arises in the context of computing periodic eigenspaces to certain matrix products; these contributions also include new and improved algorithms for solving periodic Sylvester-type matrix equations. 2. Parallel solvers and condition estimators for Sylvester-type matrix equations for distributed memory and hybrid distributed and shared memory environments. 3. Parallel eigenvalue reordering in the standard Schur form of a matrix (see Section 1.4) which fills one of the gaps of the functionality in ScaLAPACK (see Section 1.6), providing novel algorithms and software for parallel computations of invariant subspaces. In Chapter 2 of this introduction, we give brief summaries of the contributions of the five papers included in this Thesis. Chapter 3 outlines some ongoing and future work and the last Chapter 4 presents some of our related publications and project websites.

19

20

C HAPTER 2

Contributions in this Thesis

This chapter gives a brief summary of the contributions of the five papers in this Thesis. The first two papers are concerned with periodic eigenvalue reordering in periodic (cyclic) matrix and matrix pair sequences. The third and fourth papers concern parallel solvers and condition estimators for Sylvester-type matrix equations. The last paper concerns a parallel version of the standard algorithm for eigenvalue reordering in the real Schur form of a general matrix. The library software deliverables from this Thesis are novel contributions that open up new possibilities to solve challenging computational problems regarding periodic systems as well as other large-scale control systems applications. Examples include optimal linear quadratic control problems which involve the solution of various algebraic and differential Riccati equations [87, 57, 4, 65, 52]. In the next Chapter 3, we discuss how the contributions from this Thesis can be further extended and refined to cover new challenging computational problems in the considered areas.

2.1 Paper I The first contribution concerns periodic eigenvalue problems. The paper presents the derivation and the analysis of a direct method (see, e.g., [9, 69]) for eigenvalue reordering in a K-cyclic matrix product AK−1 AK−2 · · · A1 A0 without evaluating any part of the matrix product. The method relies on orthogonal transformations only and performs the reordering tentatively to guarantee backward stability. By applying a bubble-sort-like procedure of adjacent swaps of diagonal block sequences in the corresponding periodic real Schur form, an ordered periodic real Schur form is computed robustly. One important step in the method is computing the numerical solution of an associated triangular periodic Sylvester equation (PSE)

21

Chapter 2

(k)

(k)

(k)

A11 Xk − Xk+1 A22 = −A12 , k = 0, 1, . . . , K − 1,

(2.1)

where XK = X0 . Methods for solving Equation (2.1) are discussed, including Gaussian elimination with partial or complete pivoting (GEPP/GECP) and iterative refinement (see, e.g., [61]). An error analysis of the direct reordering method is presented that reveals that the accuracy of the reordered eigenvalues is essentially guided by the accuracy of the computed solution to the associated PSE. The theoretical results are also backtracked to the standard case K = 1, delivering an even sharper error bound than the previously known result in [9]. Some experimental results are presented that illustrate the reliability and robustness of the direct reordering method for a selected number of problems, including well- and ill-conditioned artificial problems with short and long periods and an application with long period from satellite control.

2.2 Paper II The second contribution concerns periodic eigenvalue reordering in periodic matrix pairs which arise in the study of generalized periodic eigenvalue problems, i.e., computing eigenvalues and eigenspaces of matrix products of the form −1 −1 AK−1 EK−2 AK−2 · · · E1−1 A1 E0−1 A0 , EK−1

(2.2)

where K is the period. Such matrix product was studied, e.g., in [12, 14, 84, 111]. Extending and generalizing the results from Paper I and [69, 74], the swapping of the diagonal block sequences in the associated generalized periodic real Schur form by orthogonal transformations is based on computing the numerical solution to an associated periodic generalized coupled Sylvester equation (PGCSY) of the form ( (k) (k) (k) A11 Rk − Lk A22 = −A12 , (2.3) (k) (k) (k) E11 Rk⊕1 − Lk E22 = −E12 , and the computation of K pairs of orthogonal matrices from the solution to (2.3). In this paper, the discussion of solution methods for periodic Sylvester equations from Paper I is broadened to also cover structured variants of an orthogonal QR factorization of a matrix representation of the periodic generalized Sylvester operator of (2.3). This QR-based procedure is known to be numerically stable even when Gaussian elimination with partial pivoting fails because of successive pivot growth (see, e.g., [112]). 22


The presented error analysis generalizes the existing non-periodic result [69] and is connected to the expected accuracy of the QR-based solution method for the associated periodic generalized coupled Sylvester equation. The paper ends with some experimental results that demonstrate the reliability and robustness of the developed reordering method, including an example with infinite eigenvalues and an example related to periodic LQ-optimal control.

2.3 Paper III The third contribution presents the theory and algorithms which serve as foundation of the parallel SCASY software library presented in Paper IV. The aim is to provide robust and efficient parallel algorithms and software for computing the numerical solution of the eight common Sylvester-type matrix equations displayed in Table 2.1 by extending and parallelizing the different steps of the well-known Schur decomposition-based Bartels–Stewart’s method (see Section 1.5). The algorithms presented are novel and complete ScaLAPACK-style implementations. TABLE 2.1: Considered standard and generalized matrix equations. CT and DT denote the continuous-time and discrete-time variants, respectively. Name

Matrix Equation

Acronym

Standard CT Sylvester Standard CT Lyapunov Standard DT Sylvester Standard DT Lyapunov Generalized Coupled Sylvester Generalized Sylvester Generalized CT Lyapunov Generalized DT Lyapunov

AX − XB = C ∈ Rm×n AX + XAT = C ∈ Rm×m AXB − X = C ∈ Rm×n AXAT − X = C ∈ Rm×m (AX −Y B, DX −Y E) = (C, F) ∈ R(m×n)×2 AXBT −CXDT = E ∈ Rm×n AXE T + EXAT = C ∈ Rm×m AXAT − EXE T = C ∈ Rm×m

SYCT LYCT SYDT LYDT GCSY GSYL GLYCT GLYDT

The work by K˚agström-Poromaa [73] and Poromaa [96] on blocked and parallel algorithms for the triangular continuous-time Sylvester (SYCT) matrix equation (1.18) and the generalized coupled Sylvester (GCSY) matrix equation (1.25) is extended and refined to cover other types of Sylvester-like matrix equations, such as the discrete-time and/or generalized variants of the Sylvester and Lyapunov matrix equations. For the discrete-time matrix equations, a novel technique for reducing the arithmetic complexity needed in the explicit blocking is proposed. Two different communication schemes used in the triangular solvers are presented and a new algorithm for performing global scaling to prevent numerical overflow is discussed. One of the shortcomings of the previous algorithms was the lack of support for 23

Chapter 2

handling quasi-triangular matrices where some 2 × 2 blocks, corresponding to complex conjugate pairs of eigenvalues, were shared between different blocks (and processors) in the explicit blocking of the algorithms. This problem was resolved by proposing an implicit redistribution of the matrices in the initial stage of step 3 in Bartels–Stewart’s method. Notice that a second possible solution is outlined below in Section 2.5. The paper includes a generic scalability analysis of the developed parallel algorithms which provides theoretical insight in the expected behavior of the algorithms, including performance models that predict a certain level of scaled parallel speedup. The developed algorithms are also combined with the (Sca)LAPACK-style condition estimation functionality [5] to produce scalable and robust parallel condition estimators. Experimental results from three parallel platforms with different characteristics are presented and analyzed using several performance and accuracy metrics. We remark that the work presented in Papers III-IV in this Thesis was based on preliminary results from [51].

2.4 Paper IV In this fourth contribution, the parallel ScaLAPACK-style software package SCASY is presented. SCASY builds on the theory and algorithms presented in Paper III and extends the functionality by providing parallel solvers and condition estimators for 44 different sign and transpose variants of the eight Sylvester-type matrix equations considered in Paper III. By using and extending existing functionality from the BLAS, LAPACK, ScaLAPACK and RECSY [68] libraries, SCASY delivers highly performing routines for general and triangular Sylvester-type matrix equations and is designed and implemented for DM and hybrid DM and SM platforms. F ORTRAN interfaces to the different core subroutines and test utilities, SCASY’s internal design and routine dependency hierarchy, see Figure 8, the software documentation, install instructions and library usage are discussed in some detail. Some experimental results from demonstrating SCASY’s novel capacity of handling two different levels of parallelism and programming models concurrently are also presented. This is conducted by combining the distributed memory model from the ScaLAPACK-style core routines in SCASY with the shared memory model from the OpenMP version of the node solver library RECSY and a shared memory implementation of the BLAS (SMP-BLAS) on a target distributed memory machine with SMP-aware NUMA1 multiprocessor nodes. 1

24

Non-Uniform Memory Access


F IGURE 8: Software hierarchy of SCASY.

SCASY is currently used by researchers at the ICPS Group [62] in the context of a CFD application for solving the linearized Navier-Stokes equation with stochastic forcing, see, e.g., [42]. The application is implemented in Python with f2py-generated [100] ScaLAPACK wrappers and one core step is to use the general Lyapunov solver PGELYCTD from SCASY to solve general Lyapunov equations of size 20000 × 20000 which are so ill-conditioned that fully iterative methods (See Section 1.5) fail to compute the solution.

2.5 Paper V The final contribution in this Thesis presents novel parallel variants of the standard (blocked) algorithm for eigenvalue reordering in the standard (and generalized) real Schur form of a square matrix (regular matrix pair). High serial performance is accomplished by adopting the recently developed idea

25

Chapter 2

(0,0)

(0,1)

(0,2)

(1,0)

(1,1)

(1,2)

(0,0)

(0,1)

(0,2)

F IGURE 9: Using multiple concurrent computational windows in parallel eigenvalue reordering in the real Schur form. The computational windows (red) are local and the update regions (green and blue) are shared.

of using computational windows and delaying the outside-window updates until each window has been completely reordered locally. By using multiple concurrent windows the parallel algorithm has a high level of concurrency, and most work is performed as level 3 BLAS operations which in turn lead to high parallel performance. The basic ideas are illustrated in Figure 9, where a real Schur form is distributed over a 2×3 mesh of processors and the eigenvalues are reordered using two concurrent computational windows. The parallel Sylvester solvers and the associated condition estimators from Papers III-IV are applied to compute reciprocal condition number estimates for the selected cluster of eigenvalues and the associated eigenspaces (invariant and deflating subspaces). As an application example, stable invariant subspaces and associated condition number estimates of Hamiltonian matrices are computed with satisfying results. Experimental results for ScaLAPACK-style F ORTRAN implementations on a Linux cluster confirm the efficiency and scalability of our algorithms in terms of the serial speedup going up to 14 times on one processor and in addition delivering more than 16 times of parallel speedup using 64 processor for large scale problems. Paper V also provides the software for an alternative solution of the problem with the shared 2 × 2 blocks considered in Papers III-IV by solving the problem already at step 1 in Bartels–Stewart’s method by means of computing an ordered (generalized) real Schur form that do not have any shared 2×2 blocks in the parallel data distribution. However, we do not expect this to have any obvious effect on the parallel performance

26


of the parallel Sylvester solvers since its complexity, although still negligible compared to the total cost of solving the actual Sylvester equations, is higher than the cost of performing the implicit redistribution.

27

28

C HAPTER 3

Ongoing and future work

In this section, we discuss how the algorithms and software presented in the Thesis can be used and further developed to solve new challenging and open problems. Work in these directions is already in progress in our research team.

3.1 Software tools for periodic eigenvalue problems The reordering methods from Papers I-II can be extended to consider product eigenvalue problems in their most general form: s

s

p−1 · · · As11 , A pp A p−1

(3.1)

with exponents s1 , . . . , s p ∈ {1, −1}. The dimensions of A1 , . . . , A p should match, i.e., ( Rnk+1 ×nk if sk = 1, Ak ∈ Rnk ×nk+1 if sk = −1, where n p+1 = n1 . Additionally, the condition p

p

∑ nk + ∑

k=1 sk =1

k=1 sk =−1

p

nk+1 =

∑

k=1 sk =−1

p

nk +

∑ nk+1 ,

(3.2)

k=1 sk =1

ensures that the corresponding lifted block cyclic pencil is square (see, e.g., [43] for details) and the eigenvalues (and eigenvectors) are properly defined. The level of generality imposed by (3.1) allows to cover other applications (see, e.g., [13]) that do not necessarily lead to a matrix product having the somewhat simpler form (1.13). It is worth mentioning that this would admit matrices Ak corresponding to an index sk = −1 to be singular or even rectangular, in which case the matrix product (3.1) should only be understood in a formal1 sense. 1

Formally, if A ∈ Rm×n , then A−1 ∈ Rn×m . Notice that this is consistent with the definition of the pseduoinverse A+ which actually can be computed via an SVD-decomposition.

29

Chapter 3

In this context, we rely on a different definition of the periodic Schur decomposition which allows for handling dimension-induced zero and infinite eigenvalues. The periodic Sylvester-like matrix equation associated with eigenvalue reordering and condition estimation of eigenvalues and periodic eigenspaces now takes the form ( (k) (k) (k) A11 Xk − Xk+1 A22 = −A12 , for sk = 1, (3.3) (k) (k) (k) A11 Xk+1 − Xk A22 = −A12 , for sk = −1, and can be addressed by extending and refining the methods presented in Papers I-II. We refer to [52, 53] for details.

3.2 Frobenius norm-based parallel estimators The current release of the SCASY library (see the library homepage [99]) includes parallel condition estimators based on the LAPACK-style matrix norm estimation technique developed by Hager and Higham [55, 59] which was based on the matrix 1-norm and was successfully applied to SYCT in [73]. K˚agström and Westin introduced a corresponding matrix norm estimation technique in [76] that was based on the Frobeniusnorm which is needed in, e.g., computing reciprocal condition number estimates for selected cluster of eigenvalues in the generalized real Schur form [75, 74]. We plan to include such Frobenius-norm based techniques in the next release of SCASY. This would also make the current implementation of the parallel eigenvalue reordering routines from Paper V complete in comparison to the functionality offered by the corresponding LAPACK routines [5].

3.3 Parallel periodic Sylvester-solvers The algorithms and techniques in SCASY can be extended to cover also K-periodic matrix equations (see, e.g., [49] for a summary of various periodic Sylvester-type matrix equations). For example, by assuming that the involved K-periodic matrices Ak , Bk and Ck in the PSE (2.1) are distributed over the process mesh such that they are aligned internally, and that Ak and Bk are in periodic Schur form with quasi-triangular factors Ar and Bs , the solution sequence Xk can be obtained by computing the implicit distribution from Ar and Bs , applying it to all of Ak , Bk , and Ck , and traversing the block diagonals of the Ck sequence just as in the non-periodic case. Node subsystems can be solved by utilizing the periodic Sylvester solvers from Papers I-II or from the upcoming PEP library [52, 95]. Right hand side updates can be performed as before using level 3 BLAS operations (now in a periodic sense). See [6] for some preliminary 30

Ongoing and future work

work in this direction. Of course, such parallel periodic matrix equation solvers would in the unreduced case rely on the existence of a parallel distributed memory implementation of the periodic Schur decomposition. To the best of our knowledge, no such parallel algorithm exists today. Until today most underlying physical systems studied have very small dimensions; very long periods are much more common (see, e.g., Example 2 in Paper I). For that reason, a distributed memory implementation along the lines of the current de-facto standards as ScaLAPACK and PSLICOT would have very little practical relevance. Very long sequences of periodic matrices with small dimensions naturally call for shared memory implementations which are able to repeat large amounts of small (and sometimes simple) operations with relatively low complexity in a scalable high-throughput way, which makes much more sense in this case. Nevertheless, when discretizations of 3D models of periodic behavior are considered, the request of parallel periodic Schur decomposition algorithms and Sylvester solvers may appear.

3.4 New parallel versions of real Schur decomposition algorithms As demonstrated in Papers III-V, the performance bottleneck in many parallel Schurbased algorithms is the currently slow parallel Schur decomposition from ScaLAPACK. We plan to contribute to the development of new parallel implementations of the parallel QR and QZ algorithms based on the work presented in [30, 1, 2, 3]. For example, the aggressive early deflation techniques developed and presented in [26, 70] can be implemented in parallel using the parallel eigenvalue reordering techniques presented in Paper V. Moreover, high serial performance should be accomplished by using the same delay-and-accumulate technique for the local orthogonal transformations as was used in the parallel reordering algorithms (see also [25]). A high level of concurrency and scalability can be realized by chasing long and separated chains of small to medium-sized bulges (see, e.g., [58]) over the diagonal blocks of the corresponding Hessenberg matrix, similarly to working with separated selected subclusters in the parallel reordering algorithms in Paper V.

3.5 Parallel periodic eigenvalue reordering A straightforward extension of the parallel reordering algorithm is to include periodic eigenvalue reordering of a sequence of matrices in periodic (generalized) real Schur form, very much like the ideas of developing parallel periodic Sylvester solvers. We only have to require that the matrices are properly aligned and distributed in the same 31

Chapter 3

way over the process mesh to be able to follow the outline of the parallel eigenvalue reordering algorithms, substantially guided by the quasi-triangular factor. To enhance parallel computation of periodic eigenspaces associated with selected spectra of a square matrix product, this would require a parallel implementation of the periodic QR and QZ algorithms, as in the case of the parallel periodic Sylvester solvers.

32

C HAPTER 4

Related papers and project websites

The following five peer-reviewed conference publications (papers I-II, IV-VI) and my Licentiate Thesis (paper III) are related to the contents of this Thesis. I. R. Granat and B. K˚agström. Evaluating Parallel Algorithms for Solving SylvesterType Matrix Equations: Direct Transformation-Based versus Iterative MatrixSign-Function-Based Methods. In: PARA 2004, J. Dongarra et al., Eds. Lecture Notes in Computer Science (LNCS), Vol. 3732, Springer Verlag, pp. 719–729, 2005. II. R. Granat, I. Jonsson, and B. K˚agström. Combining Explicit and Recursive Blocking for Solving Triangular Sylvester-Type Matrix Equations in Distributed Memory Platforms. In: Euro-Par 2004, M. Danelutto et al., Eds. Lecture Notes in Computer Science (LNCS), Vol. 3149, Springer Verlag, pp. 742–750, 2004. III. R. Granat. Contributions to Parallel Algorithms for Sylvester-type Matrix Equations and Periodic Eigenvalue Reordering in Cyclic Matrix Products. Licentiate Thesis. From Technical Report UMINF-05.18, ISSN-0348-0542, ISBN 917305-903-X, Dept Computing Science, Ume˚a University, May 2005. IV. R. Granat, B. K˚agström, and D. Kressner. Reordering the Eigenvalues of a Periodic Matrix Pair with Applications in Control. In: Proc. of 2006 IEEE Conference on Computer Aided Control Systems Design (CACSD), pp. 25–30. ISBN:07803-9797-5, 2006. V. R. Granat, I. Jonsson, and B. K˚agström. Recursive Blocked Algorithms for Solving Periodic Triangular Sylvester-type Matrix Equations. In: Proc. of PARA’06: State-of-the-art in Scientific and Parallel Computing, B. K˚agström et al., Eds. Lecture Notes in Computer Science (LNCS), Vol. 4699, Springer Verlag, pp. 531–539, 2007. 33

Chapter 4

VI. R. Granat, B. K˚agström, and D. Kressner. M ATLAB Tools for Solving Periodic Eigenvalue Problems. In proceedings of the 3rd IFAC Workshop PSYCO’07, Saint Petersburg, Russia, 2007. Also, the following project websites are related to the contents of this Thesis. I. The SCASY homepage [99] is related to the public release and maintenance of the parallel algorithms and software presented in Papers III-IV in this Thesis. Currently, the SCASY library is available for download as a β -version (beta0.10). II. The PEP project [95] has the goal to provide a complete package of efficient and reliable software for computing eigenvalues and eigenspaces of matrix products of varying dimensions and signatures (the order of the ±1 exponents), including the special cases treated in Papers I-II of this Thesis. A prospectus of this package, including related M ATLAB tools, was published as paper VI above. The final results will be presented in [53]. III. OpenFGG - Open Fortran Gateway Generator [92] was initiated as a project running in parallel with the LAPACK-style F ORTRAN software development connected to Paper I. Writing numerical software is hard without proper debugging tools and using the M ATLAB environment to testdrive the developed software simplifies and speeds up implementation, maintenance and debugging. F OR TRAN routines are accessed from M ATLAB using MEX-interfaces which provide a so called gateway to the routine. To contruct a gateway, the user must provide a MEX-interface file together with the F ORTRAN source code to the M ATLAB MEX-compiler which in turn constructs the gateway. In this context, OpenFGG can simplify the gateway construction by generating it automatically from the F ORTRAN source and a few hints from the user passed through OpenFGG’s Java-based GUI. The implementation was designed and implemented in cooperation with three MSc students: Magnus Andersson, Johan Sejdhage and Joakim Hjertstedt. A User’s Guide and some documentation and examples can be found on the OpenFGG website (see above). The latest version, OpenFGG 1.5, also supports a subset of the constructs in F ORTRAN 90/95. The OpenFGG website has also been posted on the M ATLAB Central, a worldwide open exchange forum for the M ATLAB and Simulink user community.

34

References

[1] B. Adlerborn, K. Dackland, and B. K˚agström. Parallel Two-Stage Reduction of a Regular Matrix Pair to Hessenberg-Triangular Form. In T. Sørvik et al, editor, Applied Parallel Computing: New Paradigms for HPC Industry and Academia, volume 1947, pages 92–102. Springer-Verlag, Lecture Notes in Computer Science, 2001. [2] B. Adlerborn, K. Dackland, and B. K˚agström. Parallel and blocked algorithms for reduction of a regular matrix pair to Hessenberg-triangular and generalized Schur forms. In J. Fagerholm et al., editor, Applied Parallel Computing PARA 2002, volume 2367 of Lecture Notes in Computer Science, pages 319–328. Springer-Verlag, 2002. [3] B. Adlerborn, D. Kressner, and B. K˚agström. Parallel Variants of the Multishift QZ Algorithm with Advanced Deflation Techniques . In B. K˚agström et al., editor, Applied Parallel Computing - State of the Art in Scientific Computing (PARA’06), volume 4699, pages 117–126. Lecture Notes in Computer Science, Springer, 2007. [4] G. S. Ammar, P. Benner, and V. Mehrmann. A multishift algorithm for the numerical solution of algebraic Riccati equations. Electr. Trans. Num. Anal., 1:33–48, 1993. [5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA, third edition, 1999. [6] P. Andersson. Parallella algoritmer för periodiska ekvationer av Sylvester-typ. Master’s thesis, Deptartment of Computing Science, Ume˚a University, 2007. In Swedish (to appear). [7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The

35

References

Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. [8] ATLAS - Automatically Tuned Linear Algebra Software. math-atlas.sourceforge.net/.

See http://

[9] Z. Bai and J. W. Demmel. On swapping diagonal blocks in real Schur form. Linear Algebra Appl., 186:73–95, 1993. [10] Z. Bai, J. W. Demmel, and A. McKenney. On computing condition numbers for the nonsymmetric eigenproblem. ACM Trans. Math. Software, 19(2):202–223, 1993. [11] R. H. Bartels and G. W. Stewart. Algorithm 432: The Solution of the Matrix Equation AX − BX = C. Communications of the ACM, 8:820–826, 1972. [12] P. Benner and R. Byers. Evaluating products of matrix pencils and collapsing matrix products. Numerical Linear Algebra with Applications, 8:357–380, 2001. [13] P. Benner, R. Byers, V. Mehrmann, and H. Xu. Numerical computation of deflating subspaces of skew-Hamiltonian/Hamiltonian pencils. SIAM J. Matrix Anal. Appl., 24(1), 2002. [14] P. Benner, V. Mehrmann, and H. Xu. Perturbation analysis for the eigenvalue problem of a formal product of matrices. BIT, 42(1):1–43, 2002. [15] P. Benner and E. S. Quintana-Ort´ı. Solving Stable Generalized Lyanpunov Equations with the matrix sign functions. Numerical Algorithms, 20(1):75–100, 1999. [16] P. Benner, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı. Numerical Solution of Discrete Stable Linear Matrix Equations on Multicomputers. Parallel Algorithms and Applications, 17(1):127–146, 2002. [17] P. Benner, E.S. Quintana-Ort´ı, and G. Quintana-Ort´ı. Solving Stable Sylvester Equations via Rational Iterative Schemes. Preprint sfb393/04-08, TU Chemnitz, 2004. [18] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA, 1997. [19] BLACS - Basic Linear Algebra Communication Subprograms. See http:// www.netlib.org/blacs/index.html. 36

References

[20] BLAS - Basic Linear Algebra Subprograms. See http://www.netlib. org/blas/index.html. [21] BlueGene/L - Advanced Simulation and Computing. See https: //asc.llnl.gov/computing_resources/bluegenel/ configuration.html. [22] A. Bojanczyk, G. H. Golub, and P. Van Dooren. The periodic Schur decomposition; algorithm and applications. In Proc. SPIE Conference, volume 1770, pages 31–42, 1992. [23] A. Bojanczyk and P. Van Dooren. Reordering diagonal blocks in the real Schur form. In NATO ASI on Linear Algebra for Large Scale and Real-Time Applications, volume 1770, pages 351–356, 1993. [24] Biochemical and Biophysical Kinetics in Freshwater Lakes. See http:// www.pitb.de/biophys/bp19/. [25] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, I: Maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl., 23(4):929–947, 2002. [26] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, II: Aggressive early deflation. SIAM J. Matrix Anal. Appl., 23(4):948–973, 2002. [27] R. Bru, R. Cantó, and B. Ricarte. Modelling nitrogen dynamics in citrus trees. Mathematical and Computer Modelling, 38:975–987, 2003. [28] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The Impact of Multicore on Math Software. In B. K˚agström et al., editor, Applied Parallel Computing - State of the Art in Scientific Computing (PARA’06), volume 4699, pages 1–10. Lecture Notes in Computer Science, Springer, 2007. [29] Condor - High throughput computing. See http://www.cs.wisc.edu/ condor/. [30] K. Dackland and B. K˚agström. Blocked Algorithms and Software for Reduction of a Regular Matrix Pair to Generalized Schur Form. ACM Trans. Math. Software, 25(4):425–454, 1999. [31] J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, J. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, and S. Tomov. Prospectus for the Next LAPACK and ScaLAPACK

37

References

libraries. In B. K˚agström et al., editor, Applied Parallel Computing - State of the Art in Scientific Computing (PARA’06), volume 4699, pages 11–23. Lecture Notes in Computer Science, Springer, 2007. [32] J. Demmel and B. K˚agström. The Generalized Schur Decomposition of an Arbitrary Pencil A − λ B: Robust Software with Error Bounds and Applications. Part I: Theory and Algorithms. ACM Trans. Math. Software, 19(2):160–174, June 1993. [33] J. Demmel and B. K˚agström. The Generalized Schur Decomposition of an Arbitrary Pencil A − λ B: Robust Software with Error Bounds and Applications. Part II: Software and Applications. ACM Trans. Math. Software, 19(2):175–201, June 1993. [34] J. W. Demmel and J. J. Dongarra. Lapack 2005 prospectus: Reliable and scalable software for linear algebra computations on high end computers. Lapack working note 164, University of Carlifornia, Berkeley and University of Tennessee, Knoxville, 2005. [35] J. J. Dongarra, S. Hammarling, and J. H. Wilkinson. Numerical considerations in computing invariant subspaces. SIAM J. Matrix Anal. Appl., 13(1):145–161, 1992. [36] G. E. Dullerud and F. Paganini. A Course in Robust Control Theory - A Convex Approach. Springer-Verlag, New York, 2000. [37] A. Edelman, E. Elmroth, and B. K˚agström. A Geometric Approach to Perturbation Theory of Matrices and Matrix Pencils. Part I: Versal Deformations. SIAM J. Matrix Anal. Appl., 18(3):653–692, 1997. (Awarded the SIAM Linear Algebra Prize 2000 for the most outstanding paper published during 1997–99). [38] A. Edelman, E. Elmroth, and B. K˚agström. A Geometric Approach To Perturbation Theory of Matrices and Matrix Pencils. Part II: A Stratification-Enhanced Staircase Algorithm. SIAM J. Matrix Anal. Appl., 20(3):667–699, 1999. [39] B. Eliasson. Numerical Vlasov-Maxwell Modelling of Space Plasma. PhD thesis, Uppsala University, Department of Information Technology, Scientific Computing, 2004. [40] E. Elmroth, F. Gustavson, I. Jonsson, and B. K˚agström. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review, 46(1):3–45, 2004.

38

References

[41] E. Elmroth, P. Johansson, B. K˚agström, and D. Kressner. A web computing environment for the SLICOT library. In The Third NICONET Workshop on Numerical Control Software, pages 53–61, 2001. [42] B. F. Farrell and Ioannou P. J. Stochastic forcing of the linearized Navier-Stokes equation. Phys. Fluids, A(5):2600–2609, 1993. [43] D. S. Flamm. A new shift-invariant representation for periodic linear systems. Syst. Control Lett., 17(1):9–14, 1991. [44] J. G. F. Francis. The QR Transformation, Parts I and II. Computer Journal, 4:265–271, 332–345, 1961, 1962. [45] G. H. Golub, S. Nash, and C. F. Van Loan. A Hessenberg-Schur method for the problem AX + XB = C. IEEE Trans. Automat. Control, 24(6):909–913, 1979. [46] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, third edition, 1996. [47] GOTO-BLAS - High-Performance BLAS by Kazushige Goto. See http:// www.cs.utexas.edu/users/flame/goto/. [48] A. Grama, A. Gupta, G. Karypsis, and V. Kumar. Introduction to Parallel Computing, Second Edition. Addison-Wesley, 2003. [49] R. Granat, I. Jonsson, and B. K˚agström. Recursive Blocked Algorithms for Solving Periodic Triangular Sylvester-type Matrix Equations. In B. K˚agström et al., editor, Applied Parallel Computing - State of the Art in Scientific Computing (PARA’06), volume 4699, pages 531–539. Lecture Notes in Computer Science, Springer, 2007. [50] R. Granat and B. K˚agström. Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations: Direct Transformation-Based versus Iterative Matrix-Sign-Function-Based Methods. In J. Dongarra et al., editor, PARA’04 State-of-the-Art in Scientific Computing, volume 3732, pages 719–729. Lecture Notes in Computer Science (LNCS), Springer Verlag, 2005. [51] R. Granat and B. K˚agström. Parallel Algorithms and Condition Estimators for Standard and Generalized Triangular Sylvester-Type Matrix Equations. In B. K˚agström et al., editor, Applied Parallel Computing - State of the Art in Scientific Computing (PARA’06), volume 4699, pages 127–136. Lecture Notes in Computer Science, Springer, 2007.

39

References

[52] R. Granat, B. K˚agström, and D. Kressner. MATLAB tools for Solving Periodic Eigenvalue Problems. In proceedings of 3rd IFAC Workshop PSYCO’07, Saint Petersburg, Russia, 2007. [53] R. Granat, B. K˚agström, and D. Kressner. Algorithms and Software Tools for Product and Periodic Eigenvalue Problems. 2007. In preparation. [54] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw., 27(4):422–455, 2001. [55] W.W. Hager. Condition estimates. SIAM J. Sci. Statist. Comput., (3):311–316, 1984. [56] S. J. Hammarling. Numerical Solution of the Stable, Non-negative Definite Lyapunov Equation. IMA Journal of Numerical Analysis, (2):303–323, 1982. [57] J. J. Hench and A. J. Laub. Numerical solution of the discrete-time periodic Riccati equation. IEEE Trans. Automat. Control, 39(6):1197–1210, 1994. [58] G. Henry, D. S. Watkins, and J. J. Dongarra. A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures. SIAM J. Sci. Comput., 24(1):284–311, 2002. [59] N. J. Higham. Fortran codes for estimating the one-norm of a real or complex matrix, with applications to condition estimation. ACM Trans. of Math. Software, 14(4):381–396, 1988. [60] N. J. Higham. Perturbation theory and backward error for AX − XB = C. BIT, 33(1):124–136, 1993. [61] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, second edition, 2002. [62] The ICPS Group - Scientific Parallel Computing and Imaging. See http:// icps.u-strasbg.fr/. [63] P. Johansson. Software Tools for Matrix Canonical Computations and WebBased Software Library Environments. PhD Thesis UMINF-06.30, Department of Computing Science, Ume˚a University, SE-901 87 Ume˚a, Sweden, June, 2006. [64] S. Johansson. Stratification of Matrix Pencils in Systems and Control: Theory and Algorithms. Licentiate Thesis, Report UMINF-05.17, Department of Computing Science, SE-901 87, Ume˚a University, Sweden, 2005. 40

References

[65] S. Johansson, B. K˚agström, A. Shiriaev, and A. Varga. Comparing One-shot and Multi-shot Methods for Solving Periodic Riccati Equations. 3rd IFAC Workshop PSYCO’07, Saint Petersburg, Russia, 2007. [66] I. Jonsson and B. K˚agström. Recursive blocked algorithms for solving triangular systems. I. One-sided and coupled Sylvester-type matrix equations. ACM Trans. Math. Software, 28(4):392–415, 2002. [67] I. Jonsson and B. K˚agström. Recursive blocked algorithms for solving triangular systems. II. Two-sided and generalized Sylvester and Lyapunov matrix equations. ACM Trans. Math. Software, 28(4):416–435, 2002. [68] I. Jonsson and B. K˚agström. RECSY - A High Performance Library for Solving Sylvester-Type Matrix Equations. In H. Kosch et al, editor, Euro-Par 2003 Parallel Processing, volume 2790, pages 810–819. Springer-Verlag, Lecture Notes in Computer Science, 2003. [69] B. K˚agström. A direct method for reordering eigenvalues in the generalized real Schur form of a regular matrix pair (A, B). In Linear algebra for large scale and real-time applications (Leuven, 1992), volume 232 of NATO Adv. Sci. Inst. Ser. E Appl. Sci., pages 195–218. Kluwer Acad. Publ., Dordrecht, 1993. [70] B. K˚agström and D. Kressner. Multishift Variants of the QZ Algorithm with Aggressive Early Deflation. SIAM J. Matrix Anal. Appl., 29(1):199–227, 2006. [71] B. K˚agström, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Software, 24(3):268–302, 1998. [72] B. K˚agström, P. Ling, and C. Van Loan. Algorithm 784: GEMM-Based Level 3 BLAS: Portability and Optimization Issues. ACM Trans. Math. Software, 24(3):303–316, 1998. [73] B. K˚agström and P. Poromaa. Distributed and shared memory block algorithms for the triangular Sylvester equation with sep−1 estimators. SIAM J. Matrix Anal. Appl., 13(1):90–101, 1992. [74] B. K˚agström and P. Poromaa. Computing eigenspaces with specified eigenvalues of a regular matrix pair (A, B) and condition estimation: theory, algorithms and software. Numer. Algorithms, 12(3-4):369–407, 1996. [75] B. K˚agström and P. Poromaa. LAPACK-style algorithms and software for solving the generalized Sylvester equation and estimating the separation between regular matrix pairs. ACM Trans. Math. Software, 22(1):78–103, 1996. 41

References

[76] B. K˚agström and L. Westin. Generalized Schur methods with condition estimators for solving the generalized Sylvester equation. IEEE Trans. Autom. Contr., 34(4):745–751, 1989. [77] M. Konstantinov, D. Gu, V. Mehrmann, and P. Petkov. Perturbation Theory for Matrix Equations. Number 9 in Studies in Computational Mathematics. Elsevier, North Holland, 2003. [78] D. Kressner. An efficient and reliable implementation of the periodic QZ algorithm. In IFAC Workshop on Periodic Control Systems, 2001. [79] D. Kressner. Numerical Methods and Software for General and Structured Eigenvalue Problems. PhD thesis, TU Berlin, Institut für Mathematik, Berlin, Germany, 2004. [80] D. Kressner. The periodic QR algorithm is a disguised QR algorithm. Linear Algebra Appl., 417(2–3):423–433, 2006. [81] V. N. Kublanovskaya. On some algorithms for the solution of the complete eigenvalue problem. USSR Comp. Math Phys., 3:637–657, 1961. [82] LAPACK - Linear Algebra Package. lapack/.

See http://www.netlib.org/

[83] D. C. Lay. Linear Algebra and its Applications, 2nd edition. Addison-Wesley, 1997. [84] W.-W. Lin and J.-G. Sun. Perturbation analysis for the eigenproblem of periodic matrix pairs. Linear Algebra Appl., 337:157–187, 2001. [85] Lecture series in Linear Algebra by Jörgen Löfström. University of Gothenburg, 1998-1999. [86] The MathWorks, Inc., Cochituate Place, 24 Prime Park Way, Natick, Mass, 01760. M ATLAB Version 6.5, 2002. [87] V. Mehrmann. The Autonomous Linear Quadratic Control Problem, Theory and Numerical Solution. Number 163 in Lecture Notes in Control and Information Sciences. Springer-Verlag, Heidelberg, 1991. [88] MPI - Message Passing Interface. gov/mpi/.

See http://www-unix.mcs.anl.

[89] Multiprocessors. See http://www.csee.umbc.edu/˜plusquel/ 611/slides/chap8_1.html. 42

References

[90] Niconet Task II: Model Reduction. See http://www.win.tue.nl/ niconet/NIC2/NICtask2.html. [91] N. S. Nise. Control Systems Engineering. Wiley, 2003. Fourth International Edition. [92] OpenFGG - Open Fortran Gateway Generator. See http://www.cs.umu. se/research/parallel/openfgg. [93] OpenMP - Simple, Portable, Scalable SMP Programming. See http://www. openmp.org/. [94] PBLAS - Parallel Basic Linear Algebra Subprograms. See http://www. netlib.org/scalapack/html/pblas_qref.html. [95] PEP - Matlab tools for solving periodic eigenvalue problems. See http:// www.cs.umu.se/research/nla/pep. [96] P. Poromaa. Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling and Scalability Issues. In B. K˚agström et al., editor, Applied Parallel Computing. Large Scale Scientific and Industrial Problems, volume 1541, pages 438–446. Springer Verlag, Lecture Notes in Computer Science, 1998. [97] POSIX threads programming. See http://www.llnl.gov/computing/ tutorials/pthreads/. [98] E. S. Quintana-Ort´ı and R. A. van de Geijn. Formal derivation of algorithms: The triangular Sylvester equation. ACM Trans. Math. Software, 29(2):218–243, June 2003. [99] SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. See http://www.cs.umu.se/˜granat/scasy.html. [100] Scientific Tools for Python. See http://www.scipy.org/. [101] Shared-memory MIMD computers. See http://www.netlib.org/utk/ papers/advanced-computers/sm-mimd.html. [102] SLICOT Library In The Numerics In Control Network (Niconet). See http: //www.win.tue.nl/niconet/index.html. [103] J. Sreedhar and P. Van Dooren. Pole placement via the periodic Schur decomposition. In Proceedings Amer. Contr. Conf., pages 1563–1567, 1993.

43

References

[104] R. A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997. [105] A. Varga. Periodic Lyapunov equations: some applications and new algorithms. Internat. J. Control, 67(1):69–87, 1997. [106] A. Varga. Balancing related methods for minimal realization of periodic systems. Systems Control Lett., 36(5):339–349, 1999. [107] A. Varga. Robust and minimum norm pole assignment with periodic state feedback. IEEE Trans. Automat. Control, 45(5):1017–1022, 2000. [108] A. Varga. On solving discrete-time periodic Riccati equations. In Proc. of 16th IFAC World Congress, Prague, Czech Republic, 2005. [109] A. Varga. On solving periodic differential matrix equations with applications to periodic system norms computation. In Proc. of CDC’05, CDC’05, Seville, Spain), 2005. [110] A. Varga and P. Van Dooren. Computational methods for periodic systems - an overview. In Proc. of IFAC Workshop on Periodic Control Systems, Como, Italy, pages 171–176, 2001. [111] D. S. Watkins. Product Eigenvalue Problems. SIAM Review, 47:3–40, 2005. [112] S. J. Wright. A collection of problems for which Gaussian elimination with partial pivoting is unstable. SIAM J. Sci. Comput., 14(1):231–238, 1993.

44

Algorithms and Library Software for Periodic and Parallel Eigenvalue ...

Algorithms and Library Software for Periodic and Parallel Eigenvalue ...

Suggest Documents

Algorithms and Library Software for Periodic and Parallel ... - DiVA

Parallel Multiphysics Algorithms and Software for ...

Parallel Implicit PDE Computations: Algorithms and Software

Robust Algorithms and Software for Parallel PDE-Based Simulations

Mapping Algorithms and Software Environment for Data Parallel PDE ...

Mapping Algorithms and Software Environment for Data Parallel PDE ...

Parallel Performance of Library Algorithms for Computational ...

Domain-Specific Library Generation for Parallel Software and

Distributed Frameworks and Parallel Algorithms for ... - CiteSeerX

Boosting Algorithms for Parallel and Distributed Learning

Algorithms and Architectures for Parallel Processing

SEQUENTIAL AND PARALLEL ALGORITHMS FOR ... - cse.sc.edu

Algorithms and Architectures for Parallel Processing

Algorithms and Architectures for Parallel Processing

Greedy algorithms for high-dimensional eigenvalue problems

a parallel algorithm for the eigenvalue assignment

Engineering Parallel Algorithms for Community ... - Parallel Computing

Theory and Algorithms for Software Pipelining with

geometric algorithms and software architecture for computational ...

LOGICS AND ALGORITHMS FOR SOFTWARE ... - Semantic Scholar

PARALLEL ALGORITHMS FOR EFFECTIVE CORRESPONDENCE ...

OPTIMAL RANDOMIZED PARALLEL AlGORITHMS FOR ...

Parallel Algorithms for Arrangements - CiteSeer

Parallel Dynamic Algorithms for Minimum