Solving Stable Stein Equations on Distributed ... - Semantic Scholar

2 downloads 0 Views 214KB Size Report
Peter Benner1, Enrique S. Quintana-Ort 2, and Gregorio Quintana-Ort 2. 1 Zentrum f ... rithms for solving Stein equations as they appear for instance in discrete-.
Solving Stable Stein Equations on Distributed Memory Computers? Peter Benner1 , Enrique S. Quintana-Ort2, and Gregorio Quintana-Ort2 Zentrum fur Technomathematik, Fachbereich 3 { Mathematik und Informatik, Universitat Bremen, D{28334 Bremen, Germany; phone: +49 421 2184819; fax: +49 421 2184863; e-mail: [email protected]. 2 Depto. de Informatica, Universidad Jaime I, 12080 Castellon, Spain; phone: +34 964 728257; fax: +34 964 728435; fquintana,[email protected].

1

Abstract. We investigate the parallel performance of numerical algo-

rithms for solving Stein equations as they appear for instance in discretetime control problems. We assume that the coecient matrix of the equation is stable with respect to the unit circle. The methods used here are the squared Smith iteration and the sign function method applied to the Lyapunov equation resulting from a Cayley transformation of the original equation. We report experimental results of these algorithms on distributed-memory multicomputers.

Topic 13: Numerical Algorithms for Linear and Nonlinear Algebra.

1 Introduction We study the numerical solution of the Stein equation, often also referred to as discrete Lyapunov equation,

AXAT ? X + C = 0; (1) where A; C 2 Rnn , C = C T , and X 2 Rnn is the sought-after solution. It

easily follows that if there exists a unique solution to (1), then this solution has to be symmetric. Stein equations play a fundamental role in linear control and ltering theory for discrete-time systems; see e.g. [14, 11, 16]. Throughout this paper we assume (1) to be Schur stable, that is, if (A) denotes the spectral radius of a square matrix A, then we assume (A) < 1. This is equivalent to the usual Schur stability of the matrix A, i.e.,  (A)  fz 2 C : jz j < 1g where  (A) denotes the spectrum of A. It is well known that under this assumption, the Stein equation (1) has a unique solution; see, e.g. [13]. Stein equations with Schur stable coecient matrix A appear in many computational problems for linear control systems. For example, such an equation ?

Partially supported by the DAAD programme Acciones Integradas HispanoAlemanas. Enrique S. Quintana-Ort and Gregorio Quintana-Ort were also supported by the Spanish CICYT Project TIC96-1062-C03-03.

has to be solved in each step of Newton's method for discrete-time algebraic Riccati equations [9, 12]. Moreover, the controllability Gramian Wc and observability Gramian Wo of a discrete-time linear time-invariant (LTI) system of the form

xk+1 = Axk + Buk ; yk = Cxk ;

k = 0; 1; 2; : : :;

x0 = x(0) ;

are given by the solutions of the corresponding Stein equations

AWc AT ? Wc + BB T = 0;

AT Wo A ? Wo + C T C = 0:

(2) The Gramians of LTI systems play a fundamental role in many analysis and design problems for LTI systems as computing balanced, minimal, or partial realizations, the Hankel singular values and Hankel norm of the system, and model reduction. The need for parallel computing in this area can be seen from the fact that already for a system with state-space dimension n = 1000, the corresponding Stein equations represent a set of linear equations with one million unknowns. Moreover, in many applications, on-line solutions of these equations are required. Even for small state-space dimension n this means that a system of linear equation with a few thousand unknowns has to be solved in a few milli- or even microseconds. We assume here that the coecient matrices are dense. If sparsity of matrices is to be exploited, other computational techniques have to be employed. The standard methods for solving (1) are based on the Bartels-Stewart method see [2, 16]. In a rst stage of this algorithm the coecient matrix A is reduced to quasi-upper triangular form by the QR algorithm. Hence, in order to use these methods on parallel computers, it is necessary to have an ecient parallelization of the QR algorithm. However, several experimental studies report the diculties in parallelizing the double implicit shifted QR algorithm on parallel distributed multiprocessors (see, e.g., [7, 18]). The algorithm presents a ne granularity which introduces a loss of performance due to communication start-up overhead (latency). Besides, traditional data layouts (column/row block scattered) lead to an unbalanced distribution of the computational load. A di erent approach relies on a block Hankel distribution, which improves the balancing of the computational load [7]. Attempts to increase the granularity by employing multishift techniques have been recently proposed in [8]. Nevertheless, the parallelism and scalability of these algorithms are still far from those of matrix multiplications, matrix factorizations, triangular linear systems solvers, etc.; see, e.g., [5] and the references given therein. For these reasons we will use methods that are purely based on easy to parallelize computational kernels. These are the squared Smith iteration and the sign function method applied to the Lyapunov equation resulting from a Cayley transformation of (1). The employed algorithms will be reviewed in Section 2. The algorithms considered here are implemented using the kernels in the BLACS, PBLAS, and ScaLAPACK libraries [5]. This ensures the portability of the codes across a wide variety of platforms for distributed memory computing.

The computational performance of the implemented algorithms with respect to accuracy, execution time as well as scalability will be reported in Section 3. Some nal remarks are given in Section 4.

2 Numerical Algorithms 2.1 The Smith Iteration

We can rewrite equation (1) in xed point form, X = AXAT + C , and form the xed point iteration

Xk+1 = C + AXk AT ; k = 0; 1; 2; : : :: Then this iteration converges to X i (A) < 1, i.e., convergence is guaranteed X0 := C;

under the given assumptions. The convergence rate of this iteration is linear. A quadratically convergent version of this xed point iteration is suggested in [6, 17]. Setting X0 := C , A0 := A, this iteration can be written as

Xk+1 := Xk + Ak Xk ATk ;

Ak+1 := A2k ;

k = 0; 1; 2; : : ::

(3)

The above iteration is referred to as the squared Smith iteration. The most appealing feature of the squared Smith iteration regarding its parallel implementation is that all the computational cost comes from matrix products. These are known to be highly parallelizable; see, e.g., [5]. Remark 1. The convergence theory of the Smith iteration derived in [17] yields that for (A) < 1 there exist real constants 0 < M and 0 < r < 1 such that

kX ? Xk k  M kC k (1 ? r)? r k : 2

2

1 2

This shows that the method converges for all equations with a Schur stable coecient matrix A. Nevertheless, if the coecient matrix A is highly nonnormal such that kAk2 > 1, then over ow may occur in the early stages of the iteration due to increasing kAk k2 although eventually, limk!1 Ak = 0. Hence we apply the squared Smith iteration only if kAkF < 1 as this ensures a sequence of Ak with decreasing 2-norm.

2.2 The Sign Function Method The sign function method was rst introduced in 1971 by Roberts [15] for solving algebraic Riccati equations. Roberts also shows how to solve Sylvester equations of the form ^ ? X B^ + C^ = 0; AX A^ 2 Rnn ; B^ 2 Rmm C^ 2 Rnm ; (4) via the matrix sign function if both  (A^) and  (?B^ ) are contained in the open left half complex plane, i.e., if A^ and ?B^ are Hurwitz stable.

S

h Let ?

J

0

Z 2i Rnn have no eigenvalues on the imaginary axis and denote by Z = 0 ?1 ? kk , J + 2 C (n?k)(n?k) J + S its Jordan decomposition with J 2 C

containing the Jordan blocks corresponding to the eigenvalues in the open left and right half planes, respectively. Then the matrix sign function of Z is de ned i h as sign (Z ) := S ?0Ik In0?k S ?1 . Note that sign (Z ) is unique and independent of the order of the eigenvalues in the Jordan decomposition of Z (see, e.g., [12, Section 22.1]). Many other equivalent de nitions for sign (Z ) can be given; see, e.g., the survey paper [10]. The sign function can be computed via the Newton iteration for the equation Z 2 = I where the starting point is chosen as Z , i.e., ?



Zk+1 := Zk + Zk?1 =2; k = 0; 1; 2; : : :: It is shown in [15] that sign (Z ) = limk!1 Zk and moreover that Z0 := Z;

sign



A^ C^ 0 B^







+ In+m = 2 00 XI ;

(5) (6)

i.e., underh the given assumptions, (4) can be solved by applying the iteration (5) i to Z0 := A0^ BC^^ . The computation of the sign function via (5) only requires basic numerical linear algebra tools like matrix multiplication, inversion and/or solving linear systems. These computations are implemented eciently on most parallel architectures and, in particular, ScaLAPACK [5] provides easy to use and portable computational kernels for these operations. Hence, the sign function method is an appropriate tool to design and implement ecient and portable numerical software for distributed memory parallel computers. If we apply the Cayley transformation c(A) = (A ? In )(A + In )?1 to A from (1), then the Stein equation (1) is equivalent to a Lyapunov equation as in (4) with A^ = ?B^ T = c(A) and C^ = 2(A + In )?1 C (A + In )?T . Hence the Stein equation can be solved by applying the sign function iteration to the Lyapunov equation resulting from the Cayley transformation. h In [15] i it is observed that applying the Newton iteration (5) to the matrix A^ C^ T and exploiting the block-triangular structure of all matrices involved, ^ 0 ?A (5) boils down to ?  ^ A0 := A; Ak+1 := 21 Ak + A?k 1 ; k = 0; 1; 2; : : :; (7) ?  ^ C0 := C; Ck+1 := 21 Ck + A?k 1 Ck A?k T ; and hence from (6) it follows that X = 12 (limk!1 Ck ). Other iterative schemes for computing the sign function like the Newton-Schulz iteration or Halley's method (see, e.g., [10]) can also be implemented eciently to solve Lyapunov and hence Cayley-transformed Stein equations; details of the resulting algorithms will be reported in [4].

3 Performance Results In this section we compare the accuracy and performance of several solvers for Stein equations of the form (1). All the experiments were performed on a cluster of 16 nodes, connected with a myrinet cross-bar switch. Each node consists of an Intel Pentium-II processor at 300MHz and 128MBytes of RAM. The algorithms were coded in Fortran 77, using ieee double-precision arithmetic ("  2:2  10?16), a tunned BLAS library for Intel Pentium-II processors, and the LAPACK 2.0 [1], BLACS 1.1, PBLAS 2.0 , and ScaLAPACK 1.6 libraries [5].

3.1 Numerical accuracy We rst analyze the reliability of our Stein solvers by means of several numerical examples. Speci cally, we compare the following Stein solvers: { SB03PD. The Bartels-Stewart method for the Stein equation as implemented in the Subroutine Library in Control Theory { SLICOT1 [3]. The method is numerically backward stable and hence gives a lower bound for the accuracy that any numerically reliable method should obtain. { DGEDLSM. The squared Smith iteration for the Stein equation as given in (3). Example 1. In this example A^ has the following structure: 3

2



A = U T A011 AA12 22



1? 7 6 1? T 7 U; A11 = 64 1 ? 5 ; A22 = ?A11 ; =2 1?

A12 is a 4  4 random matrix (not necessarily symmetric), U is a random orthogonal matrix, and C = I8 . As approaches 0, the eigenvalues of A get closer to the unit circle. We evaluate the accuracy of the solvers by means of the relative ~ T ? X~ + C k1 =kC k1, with X~ the computed solution; see Table 1. residual, kAXA The table reports that DGEDLSM always obtains better results than SB03PD.



SB03PD

DGEBLSM

10?1 7:3763  10?13 2:4869  10?14 10?2 3:3078  10?12 6:0907  10?13 10?3 1:5827  10?10 4:6569  10?11 10?4 3:9989  10?8 6:6894  10?9 10?5 5:7602  10?6 6:6383  10?8 10?6 2:8823  10?4 1:0390  10?4 Table 1. Relative residuals for Example 1.

1

Available by anonymous ftp in ftp://wgs.esat.kuleuven.ac.be/pub/WGS/SLICOT.

Example 2. We generated Stein equations of order n from 100 to 1000. A stable matrix A was generated by dividing a matrix with random entries uniformly distributed in [0; 1] by the 1-norm of that matrix. The solution matrix was then generated to be symmetric and positive semide nite as X = GT G with a random uniform matrix G. Finally, C was set to C := X ? AXAT . The relative errors, kX ? X~ k1 =kX k1, were similar for solvers SB03PD and DGEDLSM. For the largest problem sizes, the relative errors in solver DGEDLSM used to be two gures better than those of solver SB03PD.

3.2 Serial and parallel performance In this subsection we investigate the performance of the Stein equation solvers include in the comparison a third solver: { DGEDLSG. The Cayley transformation applied to (1) followed by the Newton iteration for the matrix sign function iteration as given in (7). Following the naming convention in ScaLAPACK we use the pre x \P-" for the parallel versions of these routines. The matrices in the following experiments were generated as described in Example 2. Although the execution time of the iterative solvers (routines DGEDLSM DGEDLSG) depends on the number of iterations necessary for convergence, we always perform a xed amount of 10 iterations. Our experiments showed than in practice, 8{10 iterations are enough for convergence. In our rst experiment, we evaluate the performance of the serial solvers based on the Bartels-Stewart method, SB03PD, the Smith iteration, DGEDLSM, and the Newton iteration for the matrix sign function DGEDLSG. Figure 1 reports the execution time of these solvers for Stein equations of size (n) varying from 100 to 1000. In the left-hand side plot we show the global execution time for the solvers. In the right-hand side plot we consider the execution time of SB03PD as the unit time and we report how much \faster" are the iterative solvers.

SB03PD and DGEDLSM. We

1.4

Execution time (SB03PD as unit)

500

Execution time

400

300

200

100

0 0

200

400 600 Problem size (n)

800

1000

1.2 1 0.8 0.6 0.4 0.2 0 0

200

400 600 Problem size (n)

800

1000

Fig. 1. Execution times of the serial Stein equation solvers. Legend: \   +   " = , \?  ?  ?  ?" = DGEDLSG, and \? ?  ? ?" = DGEDLSM.

SB03PD

The plots in the gure show that the execution time of the iterative solvers

DGEDLSM and DGEDLSG (with 10 iterations) is smaller than that of solver SB03PD. Routine DGEDLSM consistently requires only 60{65% of the execution time of SB03PD. Routine DGEDLSG requires (except for the smaller problem sizes) around 90{95% of the execution time of SB03PD.

MFlop rate per node (serial rate as unit)

Our next experiment is designed to analyze the scalability and performance of the parallel solvers PDGEDLSM and PDGEDLSG. No parallel implementation of the Bartels-Stewart algorithm is included as that requires a parallel kernel for solving Stein equations with A reduced to real Schur form which is not available in the current version of ScaLAPACK (version 1.6). Figure 2 reports the MFlop rate (millions of oating-point arithmetic operation per second) of the serial algorithm and the MFlop rate per node of the parallel implementations. We evaluate the parallel algorithm on p=1, 4, 9, and 16 nodes and we set n so that n=pp is constant and equal to 1000. Our parallel algorithms were evaluated using several distribution block sizes (32 was the optimal in our experiments) and both square and rectangular logical topologies. In both routines the best results were obtained for square 2  2, 3  3, and 4  4 topologies.

MFlop rate per node

200

150

100

50

0

1

4

9 Number of nodes (p)

16

1

0.8

0.6

0.4

0.2

0

1

4

9 Number of nodes (p)

16

Fig. 2. Performance of the parallel Stein equation solvers with n=pp = 1000. Legend:

\?  ?  ?  ?" = PDGEDLSG, and \? ?  ? ?" = PDGEDLSM.

The left-hand plot of the gure reports a high scalability of the parallel routines as the performance remains almost constant as p is increased. The righthand plot reports the MFlop rate of the parallel algorithms divided by that of the serial algorithm. Routine PDGEDLSM shows an scaled eciency around 80{ 90%. The results are slightly worse for routine PDGEDLSG which obtains about 70{80% of scaled eciency.

4 Concluding Remarks We have described iterative algorithms for solving Stein and Lyapunov equations on parallel distributed memory architectures. Our experiments with Stein

equations with stable random matrices show similar numerical accuracies for the iterative solvers and the numerically stable Bartels-Stewart method. For coecient matrices with spectra well-separated from the unit circle the Smith iteration is faster by a factor of about 1.5 than the Bartels-Stewart method. The Smith iteration basically consists of products of matrices. The experimental results on a PC cluster show the scalability and eciency of this algorithm. The Newton iteration for the matrix sign function only requires scalable parallel kernels. The performance of this algorithm is only slightly worse than that of the Smith iteration.

References

1. E. Anderson et al. LAPACK Users' Guide. SIAM, Philadelphia, PA, 2nd ed., 1995. 2. A. Y. Barraud. A numerical algorithm to solve AT XA ? X = Q. IEEE Trans. Autom. Contr., AC-22:883{885, 1977. 3. P. Benner, V. Mehrmann, V. Sima, S. Van Hu el, and A. Varga. SLICOT a subroutine library in systems and control theory. Applied and Computational Control, Signals, and Circuits, 1:505{546, 1999. 4. P. Benner, E.S. Quintana-Ort, and G. Quintana-Ort. Solving linear matrix equations via rational iterative schemes. In preparation. 5. L.S. Blackford et al. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, 1997. 6. E.J. Davison and F.T. Man. The numerical solution of A0 Q + QA = ?C . IEEE Trans. Autom. Contr., AC-13:448{449, 1968. 7. G. Henry and R. van de Geijn. Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM J. Sci. Comput., 17:870{ 883, 1997. 8. G. Henry, D.S. Watkins, and J.J. Dongarra. A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures. Technical Report LAPACK Working Note 121, University of Tennessee at Knoxville, 1997. 9. G.A. Hewer. An iterative technique for the computation of steady state gains for the discrete optimal regulator. IEEE Trans. Autom. Contr., AC-16:382{384, 1971. 10. C. Kenney and A.J. Laub. The matrix sign function. IEEE Trans. Autom. Contr., 40(8):1330{1348, 1995. 11. V. Kucera. Analysis and Design of Discrete Linear Control Systems. Academia, Prague, Czech Republic, 1991. 12. P. Lancaster and L. Rodman. The Algebraic Riccati Equation. Oxford University Press, Oxford, 1995. 13. P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, Orlando, 2nd ed., 1985. 14. V. Mehrmann. The Autonomous Linear Quadratic Control Problem, Theory and Numerical Solution. Number 163 in Lecture Notes in Control and Information Sciences. Springer-Verlag, Heidelberg, July 1991. 15. J.D. Roberts. Linear model reduction and solution of the algebraic Riccati equation by use of the sign function. Internat. J. Control, 32:677{687, 1980. (Reprint of Tech. Rep. No. TR-13, CUED/B-Control, Cambridge Univ., Eng. Dept., 1971). 16. V. Sima. Algorithms for Linear-Quadratic Optimization, volume 200 of Pure and Applied Mathematics. Marcel Dekker, Inc., New York, NY, 1996. 17. R. Smith. Matrix equation XA+BX=C . SIAM J. Appl. Math., 16(1):198{201, 1968. 18. G.W. Stewart. A parallel implementation of the QR algorithm. Parallel Computing, 5:187{196, 1987.

Suggest Documents