Parallel Solution of Large-Scale Algebraic Bernoulli

0 downloads 0 Views 155KB Size Report
Dec 5, 2005 - Parallel Solution of Large-Scale Algebraic Bernoulli .... LU factorization with partial pivoting (LUPP), also known as Gaussian elimination; these.
Parallel Solution of Large-Scale Algebraic Bernoulli Equations with the Matrix Sign Function Method Sergio Barrachina∗

Peter Benner†

Enrique S. Quintana-Ort´ı‡

December 5, 2005

Abstract We investigate the numerical solution of algebraic Bernoulli equations via the Newton iteration for the matrix sign function. Bernoulli equations are nonlinear matrix equations arising in control and systems theory in the context of stabilization of linear systems, coprime factorization of rational matrix-valued functions, as well as model reduction. The algorithm proposed here is easily parallelizable and thus provides an efficient tool to solve large-scale problems. We report the parallel performance and scalability of our parallel implementations on a cluster of Intel Xeon processors.

Key words. Bernoulli equation; linear and nonlinear matrix equations; matrix sign function; control and systems theory; parallel computers.

1

INTRODUCTION

We consider the algebraic Bernoulli equation (ABE) AT X + XA − XGX = 0,

(1)

where A, G ∈ IRn×n , G = GT , and X ∈ IRn×n is the sought-after solution. Equation (1) is a homogeneous algebraic Riccati equation (ARE), i.e., an ARE with a zero constant term. Also note that in correspondence with Bernoulli’s differential equation y(t) ˙ = p(t)y(t) + q(t)y(t)k ,

k 6= 0, 1,

the name algebraic Bernoulli equation can represent any equation of the form   k−1 Y Aj X  Ak = 0, L(X) + A0 X  j=1

where L(X) is a linear operator and Aj ∈ IRn×n for j = 0, 1, . . . , k. Thus, the ABE in (1) is actually a special Bernoulli equation as well as a special ARE. The ABE has several applications in systems and control theory, mostly related to stabilization of linear systems: suppose x(t) ˙ = Ax(t) + Bu(t), ∗ Depto.

x(0) = x0 ∈ IRn ,

(2)

de Ingenier´ıa y Ciencia de Computadores, Universidad Jaume I, 12.071–Castell´ on, Spain. E-mail: [email protected] † Fakult¨ at f¨ ur Mathematik, Technische Universit¨ at Chemnitz, D-09107 Chemnitz, Germany. E-mail: [email protected] ‡ Depto. de Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, 12.071–Castell´ on, Spain. E-mail: [email protected]

where A ∈ IRn×n , B ∈ IRn×m , x(t) ∈ IRn is the state of the linear system given by the differential equation (2), and u(t) ∈ IRm is a control input. The stabilization problem is then to find a control function u(t) such that the solution of (2) asymptotically converges to zero. A particular solution of the stabilization problem is given by the feedback law u(t) := −B T X∗ x(t),

t ≥ 0,

(3)

where X∗ is the maximal (w.r.t. the ordering of symmetric matrices) solution of the ABE (1) with G = BB T , i.e., for any solution X of the ABE, X∗ − X is a symmetric positive semidefinite matrix. The solution X∗ is called a stabilizing solution of (1) if A − GX∗ has all its eigenvalues in the open left half plane. Thus, we can obtain the stabilizing control input function (3), from the stabilizing solution of (1) with G = BB T . Note that there exist other stabilizing feedback laws for u(t). The particular one provided by the ABE appears in several coprime factorization problems for rational transfer functions used in robust control, see., e.g., (Ionescu et al., 1999) or (Zhou et al., 1996, Section 13.7). Also, recent methods for model reduction of unstable dynamical systems (Zhou et al., 1999) require the numerical solution of ABEs. Large-scale unstable dynamical linear systems arise for instance in modeling RLC circuits and VLSI devices (Kamon et al., 1998). Here “large-scale” means that the dimension of the ABE, n, can be as large as 103 or even 104 . Numerically reliable methods as those described below require O(n3 ) flops (floating-point arithmetic operations) and storage for O(n2 ) real numbers. Thus, the solution of an ABE of dimension n in the thousands (or larger) greatly benefits from the use of parallel computing techniques. By considering the ABE as a degenerate case of the ARE, AT X + XA − XGX + Q = 0, well-known methods for solving this latter equation can also be applied to the ABE. Thus, e.g., £ ¤T we can employ spectral projector procedures to compute a basis U T V T , U , V ∈ IRn×n , for the invariant subspace of the 2n × 2n matrix ¸ · A G (4) H := 0 −AT associated with the eigenvalues of H in the open left half plane. Then, if the stabilizing solution of the ABE exists, it can be computed as X∗ := −V U −1 (Sima, 1996). We can obtain the appropriate subspace of (4) by means of the QR algorithm for the (real) Schur form (Golub and Van Loan, 1996) or any of its structure-preserving variants (Benner et al., 2002), followed by a procedure to reorder the eigenvalues of the matrix conveniently. However, this approach leads to a fine-grain algorithm which is difficult to parallelize on distributed-memory architectures (Henry and van de Geijn, 1997). Although there exists a parallel implementation of the QR algorithm as part of ScaLAPACK (Blackford et al., 1997), no parallel eigenvalue reordering procedure is provided in this library. In this paper we pursue a different approach, based on the matrix sign function (Roberts, 1980), which yields iterative solvers that can be easily parallelized and deliver excellent performance on parallel distributed-memory architectures. The paper is structured as follows. In Section 2 we derive an iteration for solving the ABE based on the matrix sign function. Three different variants together with several implementation details are then introduced in Section 3. Experiments reporting the numerical accuracy and the efficiency of a parallel implementation of the new method on a cluster of Intel Xeon processors are given in Section 4. The final section summarizes a few concluding remarks.

2

SIGN FUNCTION ITERATIVE SCHEMES

Consider a matrix Z ∈ IRl×l with no eigenvalues on the imaginary axis, and let ¸ · − 0 J S −1 Z=S + 0

2

J

be its Jordan decomposition. Here, the Jordan blocks in J − ∈ IRj×j and J + ∈ IR(l−j)×(l−j) contain, respectively, those parts of the set of eigenvalues of Z in the open left and right half planes. The matrix sign function of Z is then defined as ¸ · 0 −Ij S −1 , sign (Z) := S 0

Il−j

where Ij denotes the identity matrix of order j. By applying Newton’s root-finding iteration to Z 2 = Il , with the starting point chosen as Z, we obtain the Newton iteration for the matrix sign function: Z0 ← Z, Zk+1 ← 21 (Zk + Zk−1 ), k = 0, 1, 2, . . . .

(5)

The sequence {Zk }∞ k=0 converges to sign (Z) = lim Zk , k→∞

with an ultimately quadratic convergence rate (Roberts, 1980). The sign function can be used to solve AREs when applied to the corresponding Hamiltonian matrix, see (Byers, 1987; Roberts, 1980; Sima, 1996). Adapting this approach for the ABE, we can apply the iteration (5) to the Hamiltonian matrix H in (4). Now, exploiting the block-triangular structure of H, ¸ · −1 A A−1 GA−T , H −1 = 0 −A−T and we obtain from (5) the following Newton iteration for the computation of sign (H): ´ ³ A0 ← A, Ak+1 ← 21 c1k Ak + ck A−1 , k ´ ³ −T , G0 ← G, Gk+1 ← 12 c1k Gk + ck A−1 k Gk Ak k = 0, 1, 2, . . . .

(6)

In order to accelerate the convergence, we can use the determinantal scaling ck := |det(Hk )|1/n proposed in (Byers, 1987). In case that A is stable or anti-stable it is often more efficient to use the optimal norm scaling, given by s kHk k2 ck := , kHk−1 k2 where k · k2 denotes the matrix spectral norm (also known as the 2-norm) and · ¸ Ak Gk Hk := . 0 −ATk In order to avoid the computationally expensive spectral norm, one can use instead the Frobenious norm, denoted as k · kF , and exploit the special structure of Hk so that ck :=

s

kHk kF = kHk−1 kF

Ã

2kAk k2F + kGk k2F −1 2 −T 2 2kAk kF + kA−1 k Gk Ak kF

A suitable stopping criterion for the iteration is to stop when kAk+1 − Ak kF ≤ τ · kAkF , 3

! 14

.

√ where τ is a tolerance threshold. In order to avoid stagnation we choose τ = ε, with ε as the machine precision, and perform 1–3 additional iterations once this criterion is satisfied. Due to the quadratic convergence of the Newton iteration (5), this is usually enough to achieve the attainable accuracy. At convergence, after k¯ iterations, the solution of (1) can be obtained from the full-rank linear least-squares problem (see (Byers, 1987; Roberts, 1980; Sima, 1996)) · ¸ · ¸ Gk¯ Ak¯ + In X = . (7) In − ATk¯ 0n

3

IMPLEMENTATION DETAILS

The iterates in (6) can be implemented in a number of manners. Here we consider the following three variants: – pdgecbne v1. Matrix Ak is first decomposed into triangular factors L and U using the LU factorization with partial pivoting (LUPP), also known as Gaussian elimination; these −T factors are then used to solve two linear systems and thus obtain A−1 k Gk Ak ; finally the −1 explicit inverse Ak is built from L and U . – pdgecbne v2. Matrix Ak is again factorized into two triangular factors using the LUPP −T −1 is computed as two but now A−1 k is immediately built from these factors; next Ak Gk Ak matrix products involving the explicit inverse. We expect a high efficacy for this variant due to the parallel performance of the matrix product operation. – pdgecbne v3. Matrix Ak is now directly inverted using an efficient parallel procedure based −T on Gauss-Jordan transformations (Quintana-Ort´ı et al., 2001). The expression A−1 k Gk Ak is again formed by performing two matrix products with the explicit inverse. The computational cost of (6) is dominated by forming the inverse of Ak and solving the linear systems/performing the matrix products for the sequence Gk . This adds up to 6n3 flops per iteration and requires storage for four n × n matrices for all three variants. The solution of the full-rank linear system in (7) using the QR factorization requires 14n3 /3 additional flops and can be accommodated using the storage space that was used during the iteration plus space for two more n × n matrices. The variants of the ABE solver that we have described in this section are basically composed of traditional matrix computations such as matrix (LU and QR) factorizations, solution of triangular linear systems, matrix products, and matrix inversion (via Gaussian elimination or Gauss-Jordan transformation). All these operations can be efficiently performed employing parallel linear algebra libraries for distributed memory computers (Blackford et al., 1997; van de Geijn, 1997). Here we employ the parallel kernels in the ScaLAPACK and PBLAS libraries (Blackford et al., 1997) to perform these operations and also to build a parallel matrix inversion routine based on GaussJordan transformations.

4

EXPERIMENTAL RESULTS

All the experiments presented in this section were performed on a cluster of np = 30 nodes using ieee double-precision floating-point arithmetic (ε ≈ 2.2204×10−16 ). Each node consists of an Intel Xeon [email protected] GHz with 1 GByte of RAM. We employ a BLAS library specially tuned for this processor that achieves around 3800 Mflop/s. (millions of flops per second) for the matrix product (routine DGEMM from http://www.cs.utexas.edu/users/kgoto).The nodes are connected via a Myrinet multistage network and the MPI communication library is specially developed and tuned for this network. The performance of the interconnection network was measured by a simple loop-back message transfer resulting in a latency of 18 µsec. and a bandwidth of 1.4 Gbit/sec.

4

Table 1

Numerical Performance of the different ABE solvers

Example 1.1 1.2 2.1 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 4.1 4.3

4.1

n 2 2 2 2 2 2 3 4 4 39 64 21 60

δ 1.0e−6 0.0e+0 0.0e+0 1.0e+0 1.0e+0 1.0e+0 0.0e+0 1.0e−8 1.0e−8 1.0e−1 1.0e−4 1.0e+0 1.0e−4

R1 (Xne ) / #Iter. 8.0e−22/ 4 1.4e−14/ 5 0.0e+00/ 5 4.4e−16/ 4 0.0e+00/ 5 7.0e−16/ 5 1.4e−09/ 6 6.2e−15/ 7 2.0e−16/ 6 2.0e−15/ 6 9.3e−12/17 1.0e+00/ 8 5.8e−15/17

R1 (Xsh ) 8.0e−22 4.7e−16 0.0e+00 4.4e−16 1.1e−15 3.5e−15 1.1e−09 3.0e−11 2.8e−16 3.4e−16 5.3e−15 2.8e+00 1.9e−15

Numerical performance

In this subsection we analyze the numerical performance of the ABE solvers as a stabilizing tool for linear systems of the form (2) using a collection of benchmark examples for the solution of AREs (Abels and Benner, 1999). In particular, among these examples we only select those which correspond to linear systems with unstable state matrix A. Besides, since many of these examples have one or more poles on the imaginary axis, posing an ill-conditioned problem, specially for the sign function solvers, we overcome this difficulty by shifting the set of eigenvalues as A¯ = A + δIn , so that the actual ABE that is solved is A¯T X + X A¯ − XBB T X = 0. The solution thus obtained still satisfies that it delivers a stabilizing feedback law. R 6.0.0.88. As no noticeable All the numerical experiments were performed using Matlab° differences were found between the three variants, we only report results for a Matlab implementation of the first one, hereafter gecbne v1. We also include in the comparison a Matlab implementation of the ABE solver based on the QR algorithm, namely gecbsh. Table 1 reports the relative residual R1 (X∗ ) :=

kA¯T X∗ + X∗ A¯ − X∗ BB T X∗ k1 , kX∗ k1

for the solutions Xne and Xsh computed using routines gecbne v1 and gecbsh, respectively. For the former routine we also show the number of steps needed for convergence of the sign iteration. The results illustrate that routine gecbsh yields a smaller residual for Example 3.2 while, for Example 2.7, the residual from routine gecbne v1 is much better. In all other cases both routines perform similarly. The numerical performance of these two approaches is problem dependent and, therefore, it is “dangerous” to infer any type of general results from such a comparison. Stabilizing solutions were obtained from both methods for all linear systems except, as could be expected from the large relative residual, Example 4.1, where both methods failed.

4.2

Parallel performance

We next analyze the parallelism and the scalability of the different variants of the ABE solver. As the performance is quite independent of the actual number of iterations performed by the algorithm (as long as the number of iterations is large), all results in this subsection correspond to 10 iterations. There is currently no other parallel solver for this equation. As discussed previously, parallelizing a solver based on the QR algorithm would require a parallel implementation of the eigenvalue reordering procedure that is not currently available as part of any existing parallel library. 5

ABE solvers 800 pdgecbne_v1 pdgecbne_v2 pdgecbne_v3

700

Execution time (in seconds)

600

500

400

300

200

100

0 2

4

6

8

10

12

Number of processors ABE solvers 3000 pdgecbne_v1 pdgecbne_v2 pdgecbne_v3 2500

Mflop/s. per node

2000

1500

1000

500

0 5

10

15

20

25

30

Number of processors

Figure 1

Parallel performance of the different variants of the ABE solver on the Intel cluster

Our first experiment reports the execution time of the parallel routine for an ABE of dimension n = 3000. This is about the largest size we could solve on a single node of the platform considering the number of data matrices involved, the amount of workspace necessary for computations, and the size of the RAM per node. The plot in the top of Figure 1 reports the execution time of the parallel algorithm using np =1, 2, 4, 6, 8, 10, and 12 processors. The execution of the parallel algorithm on a single node is likely to require a higher time than that of a serial implementation of the algorithm (using, e.g., LAPACK and BLAS); however, at least for such problem dimension, we expect this overhead to be negligible compared with the overall execution time. The figure shows a notable reduction in the execution time when a small number of processors is employed: roughly from 12 minutes to 2 minutes. For such a small problem, using more than 12 processors does not achieve a significant reduction in the execution time. It can also be observed that the execution time of variant pdgecbne v2 is lower than that of pdgecbne v1. This is due to the higher parallelism of the class of operations the former consists of (matrix products) compared with those of pdgecbne v1 (triangular linear system solves). The lowest execution times however correspond to pdgecbne v3. This variant combines the high performance of matrix products with an inversion via Gauss-Jordan which achieves a perfect load balancing and high parallelism as well (Quintana-Ort´ı et al., 2001). Table 2 reports the speed-up of the different variants of the algorithm. Naturally, these values decrease as np gets larger: as the system dimension is fixed, the problem size per node is reduced, and so is the amount of computations and the opportunity for parallelism. The speed-ups delivered by pdgecbne v2 are higher than those obtained by pdgecbne v1 for any number of processors. Also, the results of pdgecbne v3 are better than those of pdgecbne v2. In particular, the speedups achieved using 4/12 processors increase from 2.62/4.88 for pdgecbne v1 to 3.16/6.53 for the pdgecbne v3. We next evaluate the scalability of the parallel algorithm when the problem size per node √ is constant. For that purpose, we fix the problem dimensions to n/ np ≈ 3000, and report the Mflop/s. per node. The plot in the bottom of Figure 1 shows the Mflop rate per node of the parallel routines. These results show a high degree of scalability for our parallel kernels, as

6

Table 2

Speed-up of the different variants of the ABE solver on the Intel cluster

np pdgecbne v1 pdgecbne v2 pdgecbne v3

2 1.28 1.49 1.64

4 2.62 2.82 3.16

6 3.31 3.70 4.09

8 3.95 4.31 4.91

10 4.90 5.37 5.74

12 4.88 5.45 6.53

there is only a small decrease in the performance of the algorithm when np is increased while the problem dimension per node remains constant. In going from a serial execution to a parallel one, due to communication overhead, a moderate loss of performance is experienced (e.g., for pdgecbne v1 performance falls from 2347 Mflop/s. on a single processor to 2098 Mflop/s. when using 4 processors; that is, 10% of loss). The Mflop rate for pdgecbne v1 is then maintained around 2000 Mflop/s. except for the np = 30 case, where a considerable decrease is encountered for the first two variants. This can be due to the use of a non-square logical processor topology in this case. Interestingly enough, the pdgecbne v3 variant not only achieves the highest Mflop rate, but does not experience the decrease encountered on the previous variants when np = 30. The reason is that the method employed by this variant to obtain the inverse presents a perfectly balanced distribution of the work load and thus is quite independent of the logical topology used in the experiments.

5

CONCLUSIONS

We have proposed a new matrix sign function-based iterative scheme to obtain the solution of the ABE. The method can be easily parallelized using the kernels in ScaLAPACK. Furthermore, three variants of this method have been proposed that allow the efficient solution of ABEs with dimensions n in the order of thousands on parallel distributed-memory computers. The experimental comparison of the different implementations concludes that the variant which uses Gauss-Jordan elimination for the computation of the inverse is faster, yields a higher degree of parallelism, and delivers a better scalability.

ACKNOWLEDGMENTS Sergio Barrachina and E.S. Quintana-Ort´ı were supported by the CICYT project No. TIC2002004400-C03-01 and FEDER, and project No. P1B-2004-6 of the Fundaci´ o Caixa-Castell´ o/Bancaixa and UJI. Sergio Barrachina was also supported by a mobility grant of the Fundaci´ o Caixa Castell´ oBancaixa (No. 05I006.41/1). P. Benner was supported by the DFG Sonderforschungsbereich 393 Parallele Numerische Simulation f¨ ur Physik und Kontinuumsmechanik at TU-Chemnitz.

References Abels, J. and Benner, P. (1999). CAREX – a collection of benchmark examples for continuoustime algebraic Riccati equations (version 2.0). SLICOT Working Note 1999-14. Available from http://www.win.tue.nl/niconet/NIC2/reports.html. Benner, P., Byers, R., Mehrmann, V., and Xu, H. (2002). Numerical computation of deflating subspaces of skew-Hamiltonian/Hamiltonian pencils. SIAM J. Matrix Anal. Appl., 24(1):165– 190.

7

Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. (1997). ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA. Byers, R. (1987). Solving the algebraic Riccati equation with the matrix sign function. Linear Algebra Appl., 85:267–279. Golub, G. and Van Loan, C. (1996). Matrix Computations. Johns Hopkins University Press, Baltimore, third edition. Henry, G. and van de Geijn, R. (1997). Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM J. Sci. Comput., 17:870–883. Ionescu, V., Oarˇa, C., and Weiss, M. (1999). Generalized Riccati theory and robust control. A Popov function approach. John Wiley & Sons, Chichester. Kamon, M., Wang, F., and White, J. (1998). Recent improvements for fast inductance extraction and simulation packaging. In Proc. of the IEEE 7th Topical Meeting on Electrical Performance of Electronic Packaging, pages 281–284. Quintana-Ort´ı, E., Quintana-Ort´ı, G., Sun, X., and van de Geijn, R. (2001). A note on parallel matrix inversion. SIAM J. Sci. Comput., 22:1762–1771. Roberts, J. (1980). Linear model reduction and solution of the algebraic Riccati equation by use of the sign function. Internat. J. Control, 32:677–687. (Reprint of Technical Report No. TR-13, CUED/B-Control, Cambridge University, Engineering Department, 1971). Sima, V. (1996). Algorithms for Linear-Quadratic Optimization, volume 200 of Pure and Applied Mathematics. Marcel Dekker, Inc., New York, NY. van de Geijn, R. (1997). Using PLAPACK: Parallel Linear Algebra Package. MIT Press, Cambridge, MA. Zhou, K., Doyle, J., and Glover, K. (1996). Robust and Optimal Control. Prentice-Hall, Upper Saddle River, NJ. Zhou, K., Salomon, G., and Wu, E. (1999). Balanced realization and model reduction for unstable systems. Int. J. Robust Nonlinear Control, 9(3):183–198.

8

Suggest Documents