Computing Covariance Matrices for Constrained Nonlinear Large Scale Parameter Estimation Problems Using Krylov Subspace Methods Ekaterina Kostina and Olga Kostyukova Abstract. In the paper we show how, based on the preconditioned Krylov subspace methods, to compute the covariance matrix of parameter estimates, which is crucial for efficient methods of optimum experimental design. Mathematics Subject Classification (2000). Primary 65K10; Secondary 15A09, 65F30. Keywords. Constrained parameter estimation, covariance matrix of parameter estimates, optimal experimental design, nonlinear equality constraints, iterative matrix methods, preconditioning.
1. Introduction Parameter estimation (PE) and optimum experimental design (OED) are important steps in establishing models that reproduce a given process quantitatively correctly. The aim of parameter estimation is to reliably and accurately identify model parameters from sets of noisy experimental data. The “accuracy” of the parameters, i. e. their statistical distribution depending on data noise, can be estimated up to the first order by means of a covariance matrix approximation and corresponding confidence regions. In practical applications, however, one often finds that the experiments performed to obtain the required measurements are expensive, but nevertheless do not guarantee satisfactory parameter accuracy or even well-posedness of the parameter estimation problem. In order to maximize the accuracy of the parameter estimates additional experiments can be designed with optimal experimental settings or controls (e.g. initial conditions, measurement devices, sampling times, temperature profiles, feed streams etc.) subject to This work was completed with the support of ESF Activity “Optimization with PDE Constraints” (Short visit grant 2990).
2
E. Kostina and O. Kostyukova
constraints. As an objective functional a suitable function of the covariance matrix can be used. The possible constraints in this problem describe costs, feasibility of experiments, domain of models etc. The methods for optimum experimental design that have been developed over the last few years can handle processes governed by differential algebraic equations (DAE), but methods for processes governed by partial differential equations (PDE) are still in their infancy because of the extreme complexity of models and the optimization problem. The aim of the paper is one aspect of numerical methods for OED, namely an efficient computation of covariance matrices and their derivatives. So far numerical methods for parameter estimation and optimal design of experiments in dynamic processes have been based on direct linear algebra methods. On the other hand, for very large scale constrained systems with sparse matrices of special structure, e.g. originating from discretization of PDEs, the direct linear algebra methods are not competitive with iterative linear algebra methods even for forward models. Generally, the covariance matrix can be calculated via a corresponding generalized inverse matrix of Jacobians of parameter estimation problem. But basically the generalized inverse can not be computed explicitly by iterative methods, hence the statistical assessment of the parameter estimate can not be provided by the standard procedures of such methods. Hence, in case of PE and OED in PDE models, generalizations of iterative linear algebra methods to the computation of the covariance matrix and its derivatives are crucial for practical applications. The aim of this paper is to show how to compute covariance matrices using Krylov subspace methods and thus to make a step towards efficient numerical methods for optimum experimental design for processes described by systems of non-stationary PDEs.
2. Covariance matrix and its numerical computation using Krylov subspace methods As in [3], we consider the constrained nonlinear parameter estimation problem min ||F1 (z)||22 , s.t. F2 (z) = 0,
z∈Rn
(2.1)
which results from a discretization of parameter estimation problem in a process described by e.g. PDE. Here the vector z includes unknown parameters and variables resulting from discretization of a PDE, ||s||22 = sT s. To solve problem (2.1) we apply a generalized Gauss-Newton method. At each iteration of the Gauss-Newton algorithm we solve a linear least-squares problem, which can be written in the form min ||Ax − b||22 , s.t. Bx = 0, x
(2.2)
Covariance Matrices for Large Scale Parameter Estimation Problems where b = −F1 (z), A = A(z) = m ¯ = n − m > 0, z is given, and
∂F1 (z) ∂z
∈ IRk×n , B = B(z) =
rankB = m,
rank
A B
∂F2 (z) ∂z
3
∈ IRm×n ,
= n.
(2.3)
It is shown in [3] how to compute covariance matrices when the underlying finite dimensional constrained linearized parameter estimation problem (2.2) is solved using a conjugate gradient technique. One of the intriguing results of [3] is that solving linear constrained least squares problems by conjugate gradient methods we get as a by-product the covariance matrix and confidence intervals as well as their derivatives. The results are generalized for LSQR methods and are numerically tested in [10, 6] and are briefly summarized in the following. Under the regularity assumptions (2.3), the optimal solution x∗ and the Lagrange vector λ∗ satisfy the KKT system ∗ T T x A b A A BT K = , K := , λ∗ 0 B 0 and can be explicitly written using the linear operator A+ as follows T b A 0 x ∗ = A+ , A+ = II 0 K−1 . 0 0 II The approximation of the covariance matrix can be expressed as [3] II 0 T + C=A A+ . 0 0
(2.4)
It is shown in [3] that C satisfies a linear system of equations. Lemma 2.1. The covariance matrix C (2.4) is equal to the sub-matrix X of the matrix K−1 X W K−1 = S T and satisfies the following linear equation system with respect to variables C ∈ IRn×n and S ∈ IRm×n C II K = . (2.5) S 0 Another presentation of the covariance matrix is given in [10] Lemma 2.2. The covariance matrix C (2.4) is equal to C = Z(Z T AT AZ)−1 Z T , ¯ where the columns of the orthogonal matrix Z ∈ IRn×m span the null space of B.
Let us present formulas for the computation of the derivatives of the covariance matrix C = C(A, B) and the matrix S = S(A, B), as the functions of the matrices A and B. These derivatives are needed in numerical methods for design of optimal nonlinear experiments in case the trace of C is chosen as a criterion.
4
E. Kostina and O. Kostyukova Let A(t) = A + t∆A and B(µ) = B + µ∆B. Then n X
∂trC(A(t), B) ∂t
=
∂trC(A, B(µ)) ∂µ
= −2
(i)
−
C (i)T (∆AT A + AT ∆A)C (i) ,
i=1 n X
C (i)T ∆BS (i) ,
i=1
(i)
where C and S denote i−th columns of matrices C and S respectively. The columns C (i) and S (i) are related as follows B T S (i) = (ei − AT AC (i) ), i = 1, ..., n. 2.1. Computing of the covariance matrix using conjugate gradient method Let us show how to compute the covariance matrix using CG method. Algorithm CG(A, b)(for solving minx ||Ax − b||22 ) Step 1: (Initialization) x1 = 0, r1 = AT b, p1 = r1 . Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.2. 2.1: (Update) 1. αk = rkT rk /(pTk AT Apk ) 2. xk+1 = xk + αk pk 3. rk+1 = rk − αk AT Apk T 4. βk = rkT rk /(rk−1 rk−1 ) 5. pk+1 = rk+1 + βk pk 2.2: (Test for convergence) If ||rk+1 || ≤ Tol , then STOP. To solve problem (2.2) we apply the Algorithm CG(A, b) with A = AP, where P is an orthogonal projector onto the null space of B. In Step 2.1 we have to compute vectors APpk and PAT APpk . Since Ppk = pk by construction, then APpk = Apk , and we need to compute only one projection PAT Apk of a vector AT Apk onto the null space of the matrix B. This projection is not calculated explicitly, but a corresponding unconstrained least squares problem (lower level problem) is solved, i.e. to compute Pw we solve q ∗ = arg min ||B T q − w||22 and set q
Pw = w − B T q ∗ . Theorem 2.3. [3] Suppose, that the Algorithm CG(AP, b) results after m ¯ itera∗ tions in the vectors xm , p , ..., p . The solution x of the problem (2.2) and ¯ 1 m ¯ the corresponding covariance matrix C (2.4) can be computed as x∗ = xm , C = ¯ 2 P diag (γk , k = 1, ..., m)P ¯ T with P = (p1 , ..., pm ), γ = 1/||Ap || , k = 1, ..., m. ¯ ¯ k k 2 2.2. Computing of the covariance matrix using LSQR Let us show how to compute the covariance matrix using LSQR, an iterative method for solving large linear systems or least-squares problems which is known to be numerically more reliable than other conjugate-gradient methods [7, 8].
Covariance Matrices for Large Scale Parameter Estimation Problems
5
Algorithm LSQR(A, b)(for solving minx ||Ax − b||22 ) Step 1: (Initialization) β1 u1 = b, α1 v1 = AT u1 , p1 = v1 , x0 = 0, φ¯1 = β1 , ρ¯1 = α1 Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.4. 2.1: (Continue the bidiagonalization) 1. βk+1 uk+1 = Avk − αk uk 2. αk+1 vk+1 = AT uk+1 − βk+1 vk 2.2: (Construct and apply next orthogonal transformation) 1 2 1. ρk = (¯ ρ2k + βk+1 )2 2. ck = ρ¯k /ρk 3. sk = βk+1 /ρk 4. θk+1 = sk αk+1 /ρk 5. ρ¯k+1 = −ck αk+1 6. φk = ck φ¯k ¯ 7. φk+1 = sk φ¯k . 2.3: (Update) 1. dk = (1/ρk )pk 2. xk = xk−1 + (φk /ρk )pk 3. pk+1 = vk+1 − θk+1 pk 2.4: (Test for convergence) If ||rk || := ||Axk − b|| ≤ Tol , then STOP. In this algorithm, the parameters βi and αi are computed such that ||ui ||2 = ||vi ||2 = 1. To solve problem (2.2) we apply the Algorithm LSQR(A, b) with A = AP, where P is an orthogonal projector onto the null space of B. In Step 2.1 we have to compute vectors APvk and PAT uk+1 . Since Pvk = vk by construction (see [10]), then APvk = Avk , and we need to compute only one projection PAT uk+1 of a vector AT uk+1 onto the null space of the matrix B. This projection is not calculated explicitly, but as in CG method a corresponding lower level problem is solved, i.e. to compute Pw we solve q ∗ = arg min ||B T q − w||22 and set Pw = q
w − B T q∗ . Theorem 2.4. Suppose, that the Algorithm LSQR(AP, b) results after m ¯ iterations ∗ in the vectors xm ¯ , d1 , ..., dm ¯ . The solution x of the problem (2.2) and the correT sponding covariance matrix C (2.4) can be computed as x∗ = xm ¯ , C = DD with D = (d1 , ..., dm ¯ ). Proof. The first assertion follows from the theory of Krylov-type methods. We prove that C = DDT . From properties of LSQR [7], we get DT PAT APD = II, where the projection matrix P can be expressed as P = ZZ T with orthogonal matrix Z spanning the null space of B. Hence DT ZZ T AT AZZ T D = (DT Z)(Z T AT AZ)(Z T D) = II.
(2.6)
6
E. Kostina and O. Kostyukova
We show, that the matrix Z T D is nonsingular. Indeed, by LSQR [7] we have D = V R−1 , where R is an upper bi-diagonal matrix, V = (v1 , ..., vm ¯ ¯ ), rank V = m. Since vi ∈ Ker B and rank V = m, ¯ it follows that V = ZN with a nonsingular matrix N . Summing up, Z T D = Z T ZN R−1 = N R−1 is nonsingular as a product of nonsingular matrices and we get from (2.6) that (Z T AT AZ)−1 = (Z T D)(DT Z). Hence C = Z(Z T AT AZ)−1 Z T = Z(Z T D)(DT Z)Z T = DDT since ZZ T D = ZZ T ZN R−1 = ZN R−1 = V R−1 = D.
Remark 2.5. Here and in what follows we assume that all algorithms converge at least after m ¯ iterations. If the algorithm converges after less than m ¯ iterations, that is for some k < m ¯ we have ||rk || ≤ Tol , then xk solves the problem (2.2), but for computing the matrix C we need to continue the process, as we need the “complete” set of the vectors p1 , ..., pm ¯ (CG methods) or d1 , ..., dm ¯ (LSQR methods). Preliminary numerical results [6, 10] show, as expected, that the iterative process needs a proper preconditioning in order to achieve a reasonable efficiency and accuracy in the elements of the covariance matrix. Our aim is to accelerate the solution process by applying appropriate preconditioners [1, 4, 9]. The question to be answered further in this paper is how to compute the linear operator A+ and the covariance matrix in case the KKT systems are solved iteratively with preconditioning.
3. Computing covariance matrix using preconditioned Krylov subspace methods Solving problem (2.2) by a Krylov subspace method is equivalent to solving the following system of linear equations P T AT APx = P T AT b.
(3.1)
Here, P as before is an orthogonal projector onto null space of B. Preconditioning consists in an equivalent reformulation of the original linear system (3.1) ¯ ˜ = ¯b Ax (3.2) ˜ has “better” properties than the matrix A˜ = P T AT AP where the new matrix A¯ from the original system (3.1). Suppose, that we apply preconditioning, that is we change the variables x using a nonsingular matrix W ∈ IRn×n x ¯ = W −1 x. Then the problem (2.2) is equivalent to the following problem ¯x − b||2 , s.t. B ¯x min ||A¯ ¯ = 0, 2 x ¯
(3.3)
Covariance Matrices for Large Scale Parameter Estimation Problems
7
in the sense, that if x ¯∗ solves the problem (3.3) then x∗ = W x ¯∗ solves the problem ¯ = BW. Furthermore, the solution of problem (3.3) is (2.2). Here A¯ = AW, B equivalent to solving the linear system ¯ MT A¯T AMy = MT A¯T b (3.4) ¯ if y ∗ is a solution of system (3.4) then x where M is a projector onto Ker B: ¯∗ = ∗ ∗ ∗ My solves the problem (3.3) and x = W x ¯ is a solution to problem (2.2). ¯ : M = P¯ and We consider two types of projectors M onto null space of B −1 ¯ M = W PW where P and P are orthogonal projection operators onto subspaces ¯ respectively. These types of projector operators generate two Ker B and Ker B types of preconditioning for system (3.1): ¯ • Preconditioning of type I corresponds to solving system (3.4) with M = P, i.e. the system ¯ = P¯ T W T AT b. P¯ T W T AT AW Py (3.5) • Preconditioning of type II corresponds to solving system (3.4) with M = W −1 PW , i.e. the system W T P T AT APW y = W T P T AT b.
(3.6)
3.1. Preconditioning of type I ¯ where P¯ is an orthogonal 3.1.1. Conjugate gradient method. Setting A = A¯P, ¯ projector onto null space of the matrix B we may apply the Algorithm CG(A, b) ¯ k= directly to solve the problem (3.3). Note, that now in the step 2.1 Apk = A¯Pp ¯ k and AT Apk = P¯ A¯T Ap ¯ k = A¯T Ap ¯ k−B ¯ T q ∗ , where q ∗ = arg min ||B ¯T q − Ap q
¯ k ||2 . A¯T Ap 2 Suppose, that the Algorithm CG(A, b) with A = A¯P¯ results after m ¯ iterations in the vectors xm ¯ , p1 , ..., pm ¯ . Then the solution of the problem (3.3) and the ¯ ¯ corresponding covariance matrix are computed as x ¯ ∗ = xm γk , k = ¯ , C = P diag (¯ T 2 ¯ ¯ ¯ 1, ..., m) ¯ P where P = (p1 , p2 , ..., pm ¯k = 1/||Apk ||2 , k = 1, ..., m. ¯ ¯ ), γ Lemma 3.1. The solution x∗ of the problem (2.2) and the corresponding covariance ¯ T. matrix C are computed as x∗ = W x ¯∗ , C = W CW Proof. It follows from the Lemma 2.1 that the matrix C¯ solves the system T ¯T C¯ II A¯ A¯ B = ¯ S¯ 0 B 0 T W 0 C¯ II W 0 ⇒ K = S¯ 0 0 II 0 II T ¯ WC II W 0 ⇒ K = S¯ 0 0 II while the covariance C solves the system (2.5). Hence −1 W C¯ II II C WT 0 −1 −1 −T = K = K W = W −T S¯ 0 0 S 0 II
8
E. Kostina and O. Kostyukova
It follows from the last equation that W C¯ = CW −T , S¯ = SW −T . Thus, C = ¯ T , S = SW ¯ T. W CW The Algorithm CG(A, b) for problem (3.3) makes use of the matrix products AW and BW , which is not always reasonable, see [5]. Using the variable transformation (old)
(old)
= W −1 pk , rk
(old)
= W −1 xk , k = 1, 2, ..., ¯ without and matrix Q = W W T it is not difficult to rewrite the Algorithm CG(A¯P,b) carrying out the variable transformation explicitly. pk
= W T rk , xk
Algorithm Preconditioned CG-I(A, B, b, Q) (for solving problem (2.2)) Step 1: (Initialization) x1 = 0, r1 = AT b − B T q1 , where q1 solves min ||B T q − AT b||2Q , p1 = Qr1 . q
Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.2. 2.1: (Update) 1. αk = rkT Qrk /(pTk AT Apk ) 2. xk+1 = xk + αk pk 3. rk+1 = rk − αk (AT Apk − B T qk ) where qk solves min ||B T q − AT Apk ||2Q , q
T 4. βk = rkT Qrk /(rk−1 Qrk−1 ) 5. pk+1 = Qrk+1 + βk pk 2.2: (Test for convergence) If ||rk || ≤ Tol , then STOP. In this algorithm and in what follows ||s||2Q := sT Qs.
Lemma 3.2 (Computation of covariance matrix with preconditioned CG). Suppose that the Algorithm PCG-I(A, B, b, Q) converges after m ¯ iterations and we get ∗ the vectors xm ¯ and p1 , ..., pm ¯ . Then the solution x of the problem (2.2) and the corresponding covariance matrix are given by x∗ = xm ¯ T, ¯ , C = P diag (γk , k = 1, ..., m)P 2 where P = (p1 , ..., pm ¯ ¯ ), γk = 1/||Apk ||2 , k = 1, ..., m.
¯ where P¯ is an 3.1.2. LSQR method. Similarly to CG method, setting A = A¯P, ¯ we may apply the Algorithm orthogonal projector onto null space of the matrix B LSQR(A, b) directly to solve the problem (3.3). Note, that now in the step 2.1 ¯ k = Av ¯ k and AT uk+1 = P¯ A¯T uk+1 = A¯T uk+1 − B ¯ T q ∗ , where q ∗ = Avk = A¯Pv ¯ T q − A¯T uk+1 ||2 . arg min ||B 2 q
Suppose, that the Algorithm LSQR(A, b) with A = A¯P¯ results after m ¯ iterations in the vectors xm ¯ , d1 , ..., dm ¯ . Then the solution of the problem (3.3) and ¯ ¯ ¯ T where the corresponding covariance matrix are as follows x ¯ ∗ = xm ¯ , C = DD ¯ D = (d1 , d2 , ..., dm ¯ ).
Covariance Matrices for Large Scale Parameter Estimation Problems
9
The solution x∗ of the problem (2.2) and the corresponding covariance matrix ¯ T. C is computed according to Lemma 3.1: x∗ = W x ¯∗ , C = W CW As before using the variable transformation (old)
(old)
= W −1 pk , vk
(old)
(old)
= W −1 dk , k = 1, 2, ..., ¯ b) without carrying and matrix Q = W W T we rewrite the Algorithm LSQR( A¯P, out the variable transformation explicitly.
pk
= W T vk , xk
= W −1 xk , dk
Algorithm Preconditioned LSQR-I(A, B, b, Q) (for solving problem (2.2)) Step 1: (Initialization) β1 u1 = b, α1 v1 = AT b − B T q0 , where q0 solves the problem min ||B T q − AT b||2Q , q
p1 = Qv1 , x0 = 0, φ¯1 = β1 , ρ¯1 = α1 Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.4. 2.1: (Continue the bidiagonalization) 1. βk+1 uk+1 = AQvk − αk uk 2. αk+1 vk+1 = AT uk+1 − B T qk , where qk solves the problem min ||B T q − AT uk+1 ||2Q q
2.2: as in Algorithm LSQR(A, b) 2.3: (Update) 1. dk = (1/ρk )pk 2. xk = xk−1 + (φk /ρk )pk 3. pk+1 = Qvk+1 − θk+1 pk . 2.4: (Test for convergence) If ||rk || := ||Axk − b|| ≤ Tol , then STOP. In this algorithm, the parameters βi and αi are computed in such a way, that ||ui ||2 = ||vi ||Q = 1. Lemma 3.3 (Computation of covariance matrix with preconditioned LSQR). Suppose that the Algorithm PLSQR-I(A, B, b, Q) converges after m ¯ iterations and we ∗ get the vectors xm and d , ..., d . Then the solution x of the problem (2.2) and ¯ 1 m ¯ the corresponding covariance matrix are given by T x∗ = xm ¯ , C = DD , D = (d1 , ..., dm ¯ ).
3.2. Preconditioning of type II 3.2.1. CG method. Setting A = APW, where P is an orthogonal projector onto null space of the matrix B we may apply the Algorithm CG(A, b) directly to solve the system (3.6). Note, that now in the step 2.1 we need to compute Apk = Auk , where uk := PW pk = W pk − B T q ∗ , q ∗ = arg min ||B T q − W pk ||22 , (3.7) q
and AT Apk = W T PAT Auk , where PAT Auk = AT Auk −B T q ∗ with q ∗ = arg min ||B T q− q
AT Auk ||22 , which means that we have to solve twice the lower level problem.
10
E. Kostina and O. Kostyukova
Suppose that after applying the Algorithm CG(APW , b) we get vectors and numbers xm ¯ , u1 , ..., um ¯ , α1 , ..., αm ¯,
(3.8)
(see (3.7) for definition of vectors uk ). Let us show, how we can use this information in order to compute the solution x∗ of the problem (2.2) and the corresponding matrix C. Lemma 3.4. Suppose that the Algorithm CG(APW , b) converges after m ¯ iterations and we get vectors and numbers (3.8). Then the solution x∗ of the problem (2.2) and the corresponding covariance matrix are given by x∗ =
m ¯ X
αk uk , C = U diag (γk , k = 1, ..., m)U ¯ T,
(3.9)
k=1 2 where U = (u1 , ..., um ¯ ¯ ), γk = 1/||Auk ||2 , k = 1, ..., m.
Proof. By construction, y ∗ = xm ¯∗ = My ∗ with ¯ solves the system (3.6). Then x −1 ∗ ∗ M = W PW solves the problem (3.3) and x = W x ¯ is a solution to problem (2.2). Hence, x∗ = W My ∗ = PW y ∗ =
m ¯ X
αk PW pk =
k=1
m ¯ X
αk uk .
k=1
By properties of CG method the vectors pk , k = 1, ..., m, ¯
(3.10)
possess the properties: a1) the vectors (3.10) are linearly independent; a2) pk = W p˜k , where p˜k ∈ Ker B, k = 1, ..., m; ¯ a3) the vectors (3.10) are W T PAT APW -conjugate: = 0 if i 6= j, pi W T PAT APW pj 6= 0 if i = j. Let us show, that the vectors (see (3.7)) uk = PW pk , k = 1, ..., m, ¯ possess similar properties: b1) the vectors (3.11) are linearly independent; b2) the vectors (3.11) form a basis of the null space of B; b3) the vectors (3.11) are AT A-conjugate: = 0 if i 6= j, ui AT Auj 6= 0 if i = j.
(3.11)
Covariance Matrices for Large Scale Parameter Estimation Problems
11
We show first that the Property b1) holds true. Suppose the contrary: then there ¯ exist a vector l ∈ IRm , such that (u1 , ..., um ¯ )l = 0, l 6= 0, or equivalently PW (p1 , ..., pm ¯ )l = 0, l 6= 0. It follows from a2) that PW W T (˜ p1 , ..., p˜m ¯ )l = 0, l 6= 0.
(3.12)
T Multiplying (3.12) with lT (˜ p1 , ..., p˜m and taking to account that p˜k ∈ Ker B ¯) (and hence P p˜k = p˜k ), we get T T lT (˜ p1 , ..., p˜m p1 , ..., p˜m ¯ ) W W (˜ ¯ )l = 0.
Since W is nonsingular, the last equality yields (˜ p1 , ..., p˜m ¯ )l = 0, l 6= 0, which contradicts the fact, that the vectors p˜1 , ..., p˜m ¯ are linearly indepent (see Properties a1) and a2)). Hence, the Property b1) holds true. The Property b2) follows from b1) and the fact, that m ¯ is the dimension of the null space of B. The Property b3) follows immideately from a3). The computation of the covariance matrix (3.9) follows from the properties b1)–b3) and [3]. It is easy to modify the Algorithm CG(APW , b) in order to compute recursively the vectors (3.11) and the vector x∗ . Further, we modify the algorithm such that it makes use of the matrix Q = W W T : Algorithm Preconditioned CG-II(A, B, b, Q) (for solving problem (2.2)) Step 1: (Initialization) x1 = 0, r1 = AT b − B T q1 , where q1 solves min ||B T q − AT b||22 , p1 = Qr1 . q
Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.2. 2.1: (Update) 1. uk = pk − B T q˜k where q˜k solves min ||B T q − pk ||22 , q
2. 3. 4.
αk = rkT Qrk /(uTk AT Auk ) xk+1 = xk + αk uk rk+1 = rk − αk (AT Auk − B T qk ) where qk solves min ||B T q − AT Auk ||22 , q
T 5. βk = rkT Qrk /(rk−1 Qrk−1 ) 6. pk+1 = Qrk+1 + βk pk 2.2: as in Algorithm CG-I(A, B, b, Q)
Note, that in Step 2.1 we have to solve twice the lower level problem.
12
E. Kostina and O. Kostyukova
Lemma 3.5 (Computation of covariance matrix with preconditioned CG). Suppose, that the Algorithm PCG-II(A, B, b, Q) converges after m ¯ iterations and we get the ∗ vectors xm and u , ..., u . Then the solution x of the problem (2.2) and the ¯ 1 m ¯ corresponding covariance matrix are given by x∗ = xm ¯ T, ¯ , C = U diag (γk , k = 1, ..., m)U 2 where U = (u1 , ..., um ¯ ¯ ), γk = 1/||Auk ||2 , k = 1, ..., m.
3.2.2. LSQR method. As in previous subsubsection, setting A = APW, where P is an orthogonal projector onto null space of the matrix B we may apply the Algorithm LSQR(A, b) directly to solve the system (3.6). Note, that now in the step 2.1 we need to compute Avk = APW vk , where PW vk = W vk − B T q ∗ , q ∗ = arg min ||B T q − W vk ||22 , and AT uk+1 = W T PAT uk+1 , where PAT uk+1 = q
AT uk+1 −B T q ∗ , where q ∗ = arg min ||B T q −AT uk+1 ||22 , which means that we have q
to solve twice the lower level problem. Suppose that after applying the Algorithm LSQR(APW , b) we get vectors xm ¯ , d1 , ..., dm ¯ ; v1 , ..., vm ¯ . Let us show, how we can use this information in order to compute the solution x∗ of the problem (2.2) and the corresponding matrix C. By construction, y ∗ = xm ¯∗ = My ∗ with ¯ solves the system (3.6). Then x −1 ∗ ∗ M = W PW solves the problem (3.3) and x = W x ¯ is a solution to problem (2.2). Hence, x∗ = W My ∗ = PW y ∗ =
m ¯ X φk k=1
ρk
PW pk .
Moreover, using properties of LSQR method [7, 8] (see also the proof of the Theorem 2.4) one can show that ˜D ˜ T W T P, where D ˜ = (d1 , ..., dm C = PW D ¯ ). As PW dk =
1 ρk PW pk
in order to compute x∗ and C we need vectors PW pk ,
k = 1, ..., m. ¯
(3.13)
Since p1 = v1 , pk = vk − θk pk−1 , k = 2, 3, ..., and the vectors PW vk are have calculated in the algorithm (see step 2.1 and remarks at the beginning of this subsection), it is easy to modify the Algorithm LSQR(APW , b) in order to compute recursively the vectors (3.13) and the vector x∗ . Further, we modify the algorithm such that it makes use of the matrix Q = W W T : Algorithm Preconditioned LSQR-II(A, B, b, Q) (for solving problem (2.2)) Step 1: (Initialization) β1 u1 = b, α1 v1 = AT u1 − B T q0 , where q0 solves min ||B T q − AT u1 ||22 , q
p1 = Qv1 , g0 = 0, x0 = 0, φ¯1 = β1 , ρ¯1 = α1 , θ1 = 0
Covariance Matrices for Large Scale Parameter Estimation Problems
13
Step 2: For k = 1, 2, 3, . . . repeat steps 2.1–2.4. 2.1: (Continue the bidiagonalization) 1. vˆk = Qvk − B T qˆk , where qˆk solves min ||B T q − Qvk ||22 , q
2. βk+1 uk+1 = Aˆ vk − αk uk 3. αk+1 vk+1 = AT uk+1 − B T qk − βk+1 vk , where qk solves min ||B T q − AT uk+1 ||22 , q
4. gk = vˆk − θk gk−1 2.2: as in Algorithm LSQR(A, b) 2.3: (Update) 1. dk = (1/ρk )gk 2. xk = xk−1 + (φk /ρk )gk 3. pk+1 = Qvk+1 − θk+1 pk . 2.4: as in Algorithm LSQR-I(A, B, b, Q) In this algorithm, the parameters βi and αi are computed in such a way, that ||ui ||2 = ||vi ||Q = 1. Note, that in Step 2.1 we have to solve twice the lower level problem: to compute PQvk and PAT uk+1 . Lemma 3.6 (Computation of covariance matrix). Suppose that Algorithm PLSQRII(A, B, b, Q) converges after m ¯ iterations and we get the vectors xm ¯ and d1 , ..., dm ¯. Then the solution x∗ of the problem (2.2) and the corresponding covariance matrix are given by T x∗ = xm ¯ , C = DD , D = (d1 , ..., dm ¯)
3.2.3. Remarks. Remark 3.7. In the Algorithms PCG-I(A, B, b, Q), and PCG-II(A, B, b, Q), PLSQRI(A, B, b, Q) and PLSQR-II(A, B, b, Q), matrix products like AW and BW are never explicitly performed. Only the action of applying the preconditioner solver operation Q := W W T to a given vector need be computed. This property is important to systems resulted from PDE discretization. In this sense these algorithms ¯ b), CG(APW , b), are more preferable then the corresponding Algorithms CG(A¯P, ¯ b) and LSQR(APW , b). LSQR(A¯P, Remark 3.8. Obviously, the vectors xk , k = 1, 2, ... generated by the Algorithm ¯ b) (or by the Algorithm M(APW , b) ) in an exact arithmetic are connected M(A¯P, with the vectors x ¯k = xk , k = 1, 2, ... generated by the Algorithm PM-I(A, B, b, Q) (or by the Algorithm PM-II(A, B, b, Q)) as follows x ¯k = W xk , k = 1, 2, .... Here M=CG or M=LSQR. Thus, these vector sequences maybe considered as the same except for a multiplication with W . But we can show that the Algorithms PM-I(A, B, b, Q) and PM-II(A, B, b, Q) generate in general completely different vector sequences. Remark 3.9. Algorithm CG-I(A, B, b, Q) and Algorithm 3.1 (preconditioned CG in expanded form) from [4] are the same.
14
E. Kostina and O. Kostyukova
4. Conclusions For solving constraint parameter estimation and optimal design problems, we need the knowledge of covariance matrix of the parameter estimates and its derivatives. Hence, development of effective methods for presentation and computation of the covariance matrix and its derivatives, based on iterative methods, are crucial for practical applications. In the paper, we have shown that solving linearized constrained least squares problems by Krylov subspace methods we get as a byproduct practically for free these matrices. The forthcoming research will be devoted to numerical aspects including choice of effective preconditioners and effective implementation of the described methods for parameter estimation and design of optimal parameters in processes defined by partial differential equations.
References [1] A. Battermann and E. W. Sachs. Block preconditioners for KKT systems in PDEgoverned optimal control problems. In K. H. Hoffman, R. H. W. Hoppe, and V. Schulz, editors, Fast solution of discretized optimization problems, 1–18. ISNM, Int. Ser. Numer. Math. 138, 2001. [2] H. G. Bock. Randwertproblemmethoden zur Parameteridentifizierung in Systemen nichtlinearer Differentialgleichungen, volume 183 of Bonner Mathematische Schriften. University of Bonn, 1987. [3] H. G. Bock, E. A. Kostina, and O. I. Kostyukova. Conjugate gradient methods for computing covariance matrices for constrained parameter estimation problems. SIAM Journal on Matrix Analysis and Application, 29:626–642, 2007. [4] H. S. Dollar, and A. J. Wathen. Approximate factorization constraint preconditioners for saddle-point matrices. Siam J. Sci. Comput., 27(5): 1555-1572, 2006. [5] C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, SIAM, Philadelphia, 1995. [6] E. Kostina, M. Saunders, and I. Schierle. Computation of covariance matrices for constrained parameter estimation problems using LSQR. Technical Report, Department of Mathematics and Computer Science, U Marburg, 2008. [7] C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse linear equations and sparse least-squares, ACM Trans. Math. Softw., 8(1):43 – 71, 1982. [8] C. C. Paige and M. A. Saunders, LSQR: Sparse linear equations and least-squares, ACM Trans. Math. Softw., 8(2):195–209, 1982. [9] T. Rees, H. S. Dollar, and A. J. Wathen. Optimal solvers for PDE-constrained optimization. Technical report RAL-TR-2008-018, Rutherford Appleton Laboratory, 2008. [10] I. Schierle. Computation of Covariance Matrices for Constrained Nonlinear Parameter Estimation Problems in Dynamic Processes Using iterative Linear Algebra Methods. Diploma thesis, Universit¨ at Heidelberg, 2008.
Covariance Matrices for Large Scale Parameter Estimation Problems Ekaterina Kostina University of Marburg Hans-Meerwein-Strasse 35032 Marburg Germany e-mail:
[email protected] Olga Kostyukova Institute of Mathematics Belarus Academy of Sciences Surganov Str. 11 220072 Minsk Belarus e-mail:
[email protected]
15