AD of Matrix Calculations Regression and Cholesky

0 downloads 0 Views 99KB Size Report
We will also use the notation C = A ◦ B for an elementwise multiplication i.e.. Cij = AijBij. I is the identity ... This means for a general matrix C and a symmetric matrix A. Tr (CA) = Tr ... 2.7 Second quadratic form. C = BT A. −1B ..... unlikely that we can run AD in backward mode and calculate all derivatives at once. 5 Let the ...
AD of Matrix Calculations Regression and Cholesky Brian Huge Danske Bank [email protected]

1

Notation and simple matrix rules

We will use the notation AT for the transpose of A and A−1 for the inverse. ( )−1 A−T ≡ AT . Also, we will define the trace by ∑ Tr A = Aii i

We will also use the notation C = A ◦ B for an elementwise multiplication i.e. Cij = Aij Bij . I is the identity matrix hence A = AI = IA. We will use the formulas (D is a diagonal matrix) ( ) Tr AT = Tr (A) Tr (AB) = Tr (BA) Tr (A + B) = Tr (A) + Tr (B) (( ) ) Tr (A (B ◦ C)) = Tr A ◦ B T C D (B ◦ C) = (DB) ◦ C (DB) ◦ I

= D (B ◦ I) = (B ◦ I) D = (I ◦ B) D = (BD) ◦ I

For a symmetric matrix A = AT and an anti symmetric matrix B = −B T we have ( ) Tr (AB) = Tr −AT B T = Tr (−BA) = Tr (−AB) ⇔ Tr (AB) = 0 This means for a general matrix C and a symmetric matrix A ( ) ( ) ( ) Tr (CA) = Tr AT C T = Tr C T AT = Tr C T A ) ) 1 (( Tr C + C T A ⇔ Tr (CA) = 2 Assume we have an output V where we want to calculate sensitivities wrt to som parameters (xi )i=1,...,N . We define the adjoints x ¯i ≡ 1

∂V ∂xi

Then for a matrices A, B, C with C ≡ f (A, B) we have ∂f ∂f dA + dB ∂A ∂B

dC =

and from the chain rule we have ∑ ∂V ∂Cij ∑ ∂Cij ∂V = = C¯ij ∂xk ∂Cij ∂xk ∂xk i,j i,j hence ( ) ( T ) ¯ dB Tr A¯T dA + Tr B



=

¯ij dBij A¯ij dAij + B

i,j

= dV ∑ C¯ij dCij = i,j

( ) = Tr C¯ T dC )) ( ( ∂f ∂f T ¯ dA + dB = Tr C ∂A ∂B and we conclude ( A¯ =

∂f ∂A

(

)T C¯

¯= B

∂f ∂B

)T C¯

which is the recipe for working backwards from outputs to inputs. The results in the rest of this paper can all be derived using this recipe.

2

Getting started

The results in Section 2 are from [1].

2.1

Addition C dC

= A+B = dA + dB

Now, use the recipe to see ) ) ( ) ( ( Tr C¯ T dC = Tr C¯ T dA + Tr C¯ T dB which leaves the result A¯ = C¯ ¯ = C¯ B The next results can all be done the same way so we just state the results. 2

2.2

Multiplication C dC A¯ ¯ B

2.3

= AB = dAB + AdB ¯ T = CB = AT C¯

Inverse C = A−1 dC = −CdAC ¯ T A¯ = −C T CC

2.4

Determinant C

= det(A) ( ) dC = C Tr A−1 dA −T ¯ A¯ = CCA

2.5

Matrix inverse product C dC A¯ ¯ B

2.6

= A−1 B = −A−1 dAA−1 B + A−1 dB ¯ T A−T = −A−T CC ¯ T = −BC ¯ T = −A−T CB −T ¯ = A C

First quadratic form C

= B T AB

dC = dB T AB + B T dAB + B T AdB ¯ T A¯ = B CB ¯ = AB C¯ T + AT B C¯ B

2.7

Second quadratic form C dC A¯ ¯ B

= B T A−1 B = dB T A−1 B − B T A−1 dAA−1 B + B T A−1 dB ¯ T A−T = −A−T B CB −1 T = A B C¯ + A−T B C¯ 3

2.8

Eigenvalues and SVD

Define U S V

= [U1 |U2 ] [ S1 = 0m−k,k

0k,n−k

]

0m−k,n−k

= [V1 |V2 ]

where U, V are orthogonal matrices such that UT U

= U U T = Im

V TV U1T U1

= V V T = In = V1T V1 = Ik

U1T U2 V1T V2 U2 U2T

= 0k,m−k = 0k,n−k = Im − U1 U1T

V2 V2T

= In − V1 V1T

In the following 2 subsections we give slightly extended versions of [1]. We have supplied a proof in Appendix 8.1. 2.8.1

Eigenvalues of a symmetric matrix with distinct eigenvalues

Let A be a symmetric m × m matrix with k distinct positive eigenvalues and m − k eigenvalues which are all 0. Then there exists an m × k matrix U1 with eigenvectors and an k × k diagonal matrix S1 with the positive eigenvalues in the diagonal. Then we have A = U1 S1 U1T ) ) ( ) (( ¯1 ◦ F1 + S¯1 U1T + I − U1 U1T U ¯1 S −1 U1T A¯ = U1 U1T U 1

(1)

1 where Fij = dj −d for i ̸= j and 0 for i = j. Notice, that this way eigenvalues i that are 0 cannot change. If we want sensitivities wrt an eigenvalue that is 0 it should be treated as any other eigenvalue and it should be distinct.

2.8.2

SVD with distinct singular values

Let A be an m × n matrix with k distinct singular values different from 0. Then there exists an m × k matrix U1 and an k × k diagonal matrix S1 with the singular values and an n × k matrix V1 such that A = U1 S1 V1T [ ( ( ) ( ) )] ¯1 − U ¯1T U1 S1 V1T A¯ = U1 I ◦ S¯1 + G ◦ S1 V1T V¯1 − V¯1T V1 + U1T U ( ) ( ) ¯1 S −1 V1T +U1 S1−1 V¯1T In − V1 V1T + Im − U1 U1T U (2) 1 4

1 where Gij = d2 −d Notice again, that this way 2 for i ̸= j and 0 for i = j. j i singular values that are 0 cannot change. If we want sensitivities wrt a singular value that is 0 it should be treated as any other singular value and it should be distinct. Also, if A is symmetric then n = m and V1 = U1 . To see that (1) is the same as (2) we notice that [ ( ( ) ( ) )] ¯1 − U ¯1T U1 + U1T U ¯1 − U ¯1T U1 S1 U1T A¯ = U1 I ◦ S¯1 + G ◦ S1 U1T U ( ) ( ) ¯1T Im − U1 U1T + Im − U1 U1T U ¯1 S −1 U1T +U1 S1−1 U 1 ] [ ( T ) T ( ) 1 ¯1 U1 + Im − U1 U1T U ¯1 S −1 U1T = U1 I ◦ S¯1 + F ◦ U1 U 1 2 ( [ ] )T ( ) ( ) 1 ¯1 U1T + Im − U1 U1T U ¯1 S −1 U1T + U1 I ◦ S¯1 + F ◦ U1T U 1 2

which is exactly the symmetric part of [ ( ( ))] T ( )( ) ¯1 ¯1 S −1 U1T U1 I ◦ S¯1 + F ◦ U1T 2U U1 + 2 Im − U1 U1T 2U 1 ¯1 is because we have added contributions for both U ¯1 and V¯1 . as in (1). 2U

2.9

Cholesky Decomposition

For a positive definite matrix C there is a unique Cholesky decomposition. Define C = LLT We will follow [1]. In our Cholesky implementation we have for(i=0;iϵ} dλ

+ + λ2 ) λ2 − s2i d¯i 2 1{si >ϵ} (s2i + λ2 ) ∑ si −2λ d¯i 2 1{si >ϵ} 2 (si + λ2 ) i ¯ −2λ Tr(DDΣ) (( T ) ) −2λ Tr U x ¯B T V ◦ DΣ ( ) −2λ Tr U T x ¯B T V DΣ ( ) −2λ Tr U T x ¯B T V DU T U Σ ( ) −2λ Tr U T x ¯ xT U Σ ( ) −2λ Tr xT U ΣU T x ¯

= −2λxT U ΣU T x ¯ Define W = U T x ¯B T V then λ2 − s2i S¯ii = Wii 2 1{si >ϵ} (s2i + λ2 ) Next, we will insert in (2). First, ) ( S V T V¯ − V¯ T V ( ) = S V T Bx ¯T U D − DU T x ¯B T V = SW T D − SDW then

( =

(

) ¯ −U ¯T U S UT U

) UT x ¯B T V D − DV T B x ¯T U S

= W DS − DW T S Adding it together we get (

(W DS − SDW )ij ) SW T D − DW T S ij 11

=

Wij (dj sj − di si )

=

Wji (si dj − di sj )

and multiplying with G we need dj sj − di si s2j − s2i = =

s2j 1{sj >ϵ} (s2i + λ2 ) − s2i 1{si >ϵ} (s2j + λ2 ) 1 (s2i + λ2 )(s2j + λ2 ) s2j − s2i ( ) s2j 1{sj >ϵ,si ≤ϵ} s2i 1{si >ϵ,sj ≤ϵ} λ2 1{sj >ϵ,si >ϵ} 1 + − (s2i + λ2 )(s2j + λ2 ) s2j + λ2 s2i + λ2 s2j − s2i

and si dj − di sj s2j − s2i si sj 1{sj >ϵ} (s2i + λ2 ) − si sj 1{si >ϵ} (s2j + λ2 ) 1 2 2 2 2 2 (si + λ )(sj + λ ) sj − s2i ) ( si sj 1{sj >ϵ,si >ϵ} si sj 1{sj >ϵ,si ≤ϵ} si sj 1{si >ϵ,sj ≤ϵ} 1 = − 2 + − (si + λ2 )(s2j + λ2 ) s2j + λ2 s2i + λ2 s2j − s2i =

Adding it all together we find ) )) ) ( ( ( ( ¯ −U ¯T U S Γij ≡ I ◦ S¯ + G ◦ S V T V¯ − V¯ T V + U T U ij  λ2 W −s s W ij i j ji  for si > ϵ, sj > ϵ  (s2i +λ2 )(s2j +λ2 )    si Wij +sj Wji si (s2 +λ2 )(s2 −s2 ) for si > ϵ, sj ≤ ϵ i i j = Wij +si Wji   sj (ss2j+λ for si ≤ ϵ, sj > ϵ 2 )(s2 −s2 )   j j i  0 for si ≤ ϵ, sj ≤ ϵ Notice, there is no longer any singularity. We also have to check the singularity from the second part. We find ( ) ( ) ¯ S −1 V T U S −1 V¯ T In − V V T + Im − U U T U ( ) ( ) = U ΣU T x ¯ B T In − V V T + Im − U U T x ¯B T V ΣV T

so this singularity is also gone and we have ( ) ( ) A¯ = U ΓV T + U ΣU T x ¯B T In − V V T + Im − U U T x ¯B T V ΣV T Mathematically there are no singularities but in practical problems we may have a problem when si = ϵ − ∆ϵ < ϵ < ϵ + ∆ϵ = sj for small ∆ϵ. In this case Γij =

( ) 2 (ϵ + ∆ϵ) Wij + ϵ2 − ∆ϵ2 Wji 1 ϵ2 Wij + Wji ( ) ≃ 2 2 4ϵ ϵ + λ2 ∆ϵ (ϵ + ∆ϵ) + λ2 4ϵ∆ϵ 12

To avoid this we can always calculate the matrix A as the singular value decomposition A = U SV T

2 2 where Sii = si 1{si >ϵ} and then solve minx AT x − b + ∥λx∥ . In this case Γij

) )) ) ( ( ( ¯ −U ¯T U S I ◦ S¯ + G ◦ S V T V¯ − V¯ T V + U T U ij  λ2 W −s s W ij i j ji  for si > ϵ, sj > ϵ  (s2i +λ2 )(s2j +λ2 )    Wij for si > ϵ, sj ≤ ϵ s2i +λ2 = Wij   for si ≤ ϵ, sj > ϵ  s2j +λ2   0 for si ≤ ϵ, sj ≤ ϵ ≡

(

Also, in this case A = U1 S1 V1T = U SV T where all positive singular values are in S1 will both generate the same A¯ since xB T V1 ΣV1T (Im − U1 U1T )¯ = U2 U2T x ¯B T V1 ΣV1T ≡ U2 W21 ΣV1T Furthermore if we use the relations Im − U1 U1T = U2 U2T , In − V1 V1T = V2 V2T we can write A¯ in more compact form as A¯ = U ΩV T where Ω ∈ Rn×m is

[ Ω=

¯B T V2 Γ ΣU1T x T T ¯B V1 Σ 0 U2 x

]

This is only a useful formula if U2 , V2 are easily available.

6.1

Functional Tikhonov Parameter

If we look up Tikhonov regularization on Wikipedia then we see that there are a number of different ways to choose the Tikhonov factor λ. Most methods will have a relationship with the input matrices A, B so λ = f (A, B) is really a function of these 2 matrices. Therefore ) need to calculate the contribution ( we also ¯ Usually, this does not have ¯ B ¯ = g(λ). to the derivative from this function A, ¯ B ¯ any singularities so we can use the tape to calculate the contribution to A, coming from changes in λ. Remember, these contributions should be added together. Alternatively, g can often be calculated directly.

13

6.2

Matrix Regularization

Sometimes it is preferable to regularize the matrices before the regression. So instead of solving AT x = B we will transform the matrices α = f10 (A) , β = f11 (B) and then solve αT y = β instead. and then transform x = f3 (y). We can use the method illustrated in Figure 2 where y = f2 (α, β) is the regression step. We then find y¯ = g3 (¯ x) 0 A¯ = g1 (¯ α) ( ) 1 ¯ ¯ B = g1 β either by analytical calculations or using the tape and α ¯ , β¯ = g2 (¯ y ) is given from the regression step.

7

Cholesky Decomposition

If C is not positive definite but only semi-definite then L is no longer unique. For this reason and for stability of the algorithm we use a pertubation matrix P i.e. all elements are either 1 or 0 and both row and column sums are 1. Also, P T P = P P T = I. We will instead find the Cholesky decomposition of A = P T CP = LLT . There exists a pertubation matrix such that A has a unique Cholesky decomposition, see Higham - Analysis of the Cholesky Decomposition of a Semi-definite Matrix. In this case [ ] L11 0 L= L21 0 where L11 is lower triangular. Notice, A = P T CP dA = P T dCP tr(A¯T dA) = tr(A¯T P T dCP ) = tr(P A¯T P T dC) C¯

¯ T = P AP

Assume the rank of C is m ≤ n then we can find n × m matrices B such that C = BB T i.e. B could be the the m eigenvectors of√C with corresponding eigenvalues si > 0. Define the diagonal m × m matrix S1 with the square roots of these 14

√ eigenvalues. Hence B = U1 S1 . We have to adjust the algorithm a little bit so we have for(i=0;i