AD of Matrix Calculations Regression and Cholesky Brian Huge Danske Bank
[email protected]
1
Notation and simple matrix rules
We will use the notation AT for the transpose of A and A−1 for the inverse. ( )−1 A−T ≡ AT . Also, we will define the trace by ∑ Tr A = Aii i
We will also use the notation C = A ◦ B for an elementwise multiplication i.e. Cij = Aij Bij . I is the identity matrix hence A = AI = IA. We will use the formulas (D is a diagonal matrix) ( ) Tr AT = Tr (A) Tr (AB) = Tr (BA) Tr (A + B) = Tr (A) + Tr (B) (( ) ) Tr (A (B ◦ C)) = Tr A ◦ B T C D (B ◦ C) = (DB) ◦ C (DB) ◦ I
= D (B ◦ I) = (B ◦ I) D = (I ◦ B) D = (BD) ◦ I
For a symmetric matrix A = AT and an anti symmetric matrix B = −B T we have ( ) Tr (AB) = Tr −AT B T = Tr (−BA) = Tr (−AB) ⇔ Tr (AB) = 0 This means for a general matrix C and a symmetric matrix A ( ) ( ) ( ) Tr (CA) = Tr AT C T = Tr C T AT = Tr C T A ) ) 1 (( Tr C + C T A ⇔ Tr (CA) = 2 Assume we have an output V where we want to calculate sensitivities wrt to som parameters (xi )i=1,...,N . We define the adjoints x ¯i ≡ 1
∂V ∂xi
Then for a matrices A, B, C with C ≡ f (A, B) we have ∂f ∂f dA + dB ∂A ∂B
dC =
and from the chain rule we have ∑ ∂V ∂Cij ∑ ∂Cij ∂V = = C¯ij ∂xk ∂Cij ∂xk ∂xk i,j i,j hence ( ) ( T ) ¯ dB Tr A¯T dA + Tr B
∑
=
¯ij dBij A¯ij dAij + B
i,j
= dV ∑ C¯ij dCij = i,j
( ) = Tr C¯ T dC )) ( ( ∂f ∂f T ¯ dA + dB = Tr C ∂A ∂B and we conclude ( A¯ =
∂f ∂A
(
)T C¯
¯= B
∂f ∂B
)T C¯
which is the recipe for working backwards from outputs to inputs. The results in the rest of this paper can all be derived using this recipe.
2
Getting started
The results in Section 2 are from [1].
2.1
Addition C dC
= A+B = dA + dB
Now, use the recipe to see ) ) ( ) ( ( Tr C¯ T dC = Tr C¯ T dA + Tr C¯ T dB which leaves the result A¯ = C¯ ¯ = C¯ B The next results can all be done the same way so we just state the results. 2
2.2
Multiplication C dC A¯ ¯ B
2.3
= AB = dAB + AdB ¯ T = CB = AT C¯
Inverse C = A−1 dC = −CdAC ¯ T A¯ = −C T CC
2.4
Determinant C
= det(A) ( ) dC = C Tr A−1 dA −T ¯ A¯ = CCA
2.5
Matrix inverse product C dC A¯ ¯ B
2.6
= A−1 B = −A−1 dAA−1 B + A−1 dB ¯ T A−T = −A−T CC ¯ T = −BC ¯ T = −A−T CB −T ¯ = A C
First quadratic form C
= B T AB
dC = dB T AB + B T dAB + B T AdB ¯ T A¯ = B CB ¯ = AB C¯ T + AT B C¯ B
2.7
Second quadratic form C dC A¯ ¯ B
= B T A−1 B = dB T A−1 B − B T A−1 dAA−1 B + B T A−1 dB ¯ T A−T = −A−T B CB −1 T = A B C¯ + A−T B C¯ 3
2.8
Eigenvalues and SVD
Define U S V
= [U1 |U2 ] [ S1 = 0m−k,k
0k,n−k
]
0m−k,n−k
= [V1 |V2 ]
where U, V are orthogonal matrices such that UT U
= U U T = Im
V TV U1T U1
= V V T = In = V1T V1 = Ik
U1T U2 V1T V2 U2 U2T
= 0k,m−k = 0k,n−k = Im − U1 U1T
V2 V2T
= In − V1 V1T
In the following 2 subsections we give slightly extended versions of [1]. We have supplied a proof in Appendix 8.1. 2.8.1
Eigenvalues of a symmetric matrix with distinct eigenvalues
Let A be a symmetric m × m matrix with k distinct positive eigenvalues and m − k eigenvalues which are all 0. Then there exists an m × k matrix U1 with eigenvectors and an k × k diagonal matrix S1 with the positive eigenvalues in the diagonal. Then we have A = U1 S1 U1T ) ) ( ) (( ¯1 ◦ F1 + S¯1 U1T + I − U1 U1T U ¯1 S −1 U1T A¯ = U1 U1T U 1
(1)
1 where Fij = dj −d for i ̸= j and 0 for i = j. Notice, that this way eigenvalues i that are 0 cannot change. If we want sensitivities wrt an eigenvalue that is 0 it should be treated as any other eigenvalue and it should be distinct.
2.8.2
SVD with distinct singular values
Let A be an m × n matrix with k distinct singular values different from 0. Then there exists an m × k matrix U1 and an k × k diagonal matrix S1 with the singular values and an n × k matrix V1 such that A = U1 S1 V1T [ ( ( ) ( ) )] ¯1 − U ¯1T U1 S1 V1T A¯ = U1 I ◦ S¯1 + G ◦ S1 V1T V¯1 − V¯1T V1 + U1T U ( ) ( ) ¯1 S −1 V1T +U1 S1−1 V¯1T In − V1 V1T + Im − U1 U1T U (2) 1 4
1 where Gij = d2 −d Notice again, that this way 2 for i ̸= j and 0 for i = j. j i singular values that are 0 cannot change. If we want sensitivities wrt a singular value that is 0 it should be treated as any other singular value and it should be distinct. Also, if A is symmetric then n = m and V1 = U1 . To see that (1) is the same as (2) we notice that [ ( ( ) ( ) )] ¯1 − U ¯1T U1 + U1T U ¯1 − U ¯1T U1 S1 U1T A¯ = U1 I ◦ S¯1 + G ◦ S1 U1T U ( ) ( ) ¯1T Im − U1 U1T + Im − U1 U1T U ¯1 S −1 U1T +U1 S1−1 U 1 ] [ ( T ) T ( ) 1 ¯1 U1 + Im − U1 U1T U ¯1 S −1 U1T = U1 I ◦ S¯1 + F ◦ U1 U 1 2 ( [ ] )T ( ) ( ) 1 ¯1 U1T + Im − U1 U1T U ¯1 S −1 U1T + U1 I ◦ S¯1 + F ◦ U1T U 1 2
which is exactly the symmetric part of [ ( ( ))] T ( )( ) ¯1 ¯1 S −1 U1T U1 I ◦ S¯1 + F ◦ U1T 2U U1 + 2 Im − U1 U1T 2U 1 ¯1 is because we have added contributions for both U ¯1 and V¯1 . as in (1). 2U
2.9
Cholesky Decomposition
For a positive definite matrix C there is a unique Cholesky decomposition. Define C = LLT We will follow [1]. In our Cholesky implementation we have for(i=0;iϵ} dλ
+ + λ2 ) λ2 − s2i d¯i 2 1{si >ϵ} (s2i + λ2 ) ∑ si −2λ d¯i 2 1{si >ϵ} 2 (si + λ2 ) i ¯ −2λ Tr(DDΣ) (( T ) ) −2λ Tr U x ¯B T V ◦ DΣ ( ) −2λ Tr U T x ¯B T V DΣ ( ) −2λ Tr U T x ¯B T V DU T U Σ ( ) −2λ Tr U T x ¯ xT U Σ ( ) −2λ Tr xT U ΣU T x ¯
= −2λxT U ΣU T x ¯ Define W = U T x ¯B T V then λ2 − s2i S¯ii = Wii 2 1{si >ϵ} (s2i + λ2 ) Next, we will insert in (2). First, ) ( S V T V¯ − V¯ T V ( ) = S V T Bx ¯T U D − DU T x ¯B T V = SW T D − SDW then
( =
(
) ¯ −U ¯T U S UT U
) UT x ¯B T V D − DV T B x ¯T U S
= W DS − DW T S Adding it together we get (
(W DS − SDW )ij ) SW T D − DW T S ij 11
=
Wij (dj sj − di si )
=
Wji (si dj − di sj )
and multiplying with G we need dj sj − di si s2j − s2i = =
s2j 1{sj >ϵ} (s2i + λ2 ) − s2i 1{si >ϵ} (s2j + λ2 ) 1 (s2i + λ2 )(s2j + λ2 ) s2j − s2i ( ) s2j 1{sj >ϵ,si ≤ϵ} s2i 1{si >ϵ,sj ≤ϵ} λ2 1{sj >ϵ,si >ϵ} 1 + − (s2i + λ2 )(s2j + λ2 ) s2j + λ2 s2i + λ2 s2j − s2i
and si dj − di sj s2j − s2i si sj 1{sj >ϵ} (s2i + λ2 ) − si sj 1{si >ϵ} (s2j + λ2 ) 1 2 2 2 2 2 (si + λ )(sj + λ ) sj − s2i ) ( si sj 1{sj >ϵ,si >ϵ} si sj 1{sj >ϵ,si ≤ϵ} si sj 1{si >ϵ,sj ≤ϵ} 1 = − 2 + − (si + λ2 )(s2j + λ2 ) s2j + λ2 s2i + λ2 s2j − s2i =
Adding it all together we find ) )) ) ( ( ( ( ¯ −U ¯T U S Γij ≡ I ◦ S¯ + G ◦ S V T V¯ − V¯ T V + U T U ij λ2 W −s s W ij i j ji for si > ϵ, sj > ϵ (s2i +λ2 )(s2j +λ2 ) si Wij +sj Wji si (s2 +λ2 )(s2 −s2 ) for si > ϵ, sj ≤ ϵ i i j = Wij +si Wji sj (ss2j+λ for si ≤ ϵ, sj > ϵ 2 )(s2 −s2 ) j j i 0 for si ≤ ϵ, sj ≤ ϵ Notice, there is no longer any singularity. We also have to check the singularity from the second part. We find ( ) ( ) ¯ S −1 V T U S −1 V¯ T In − V V T + Im − U U T U ( ) ( ) = U ΣU T x ¯ B T In − V V T + Im − U U T x ¯B T V ΣV T
so this singularity is also gone and we have ( ) ( ) A¯ = U ΓV T + U ΣU T x ¯B T In − V V T + Im − U U T x ¯B T V ΣV T Mathematically there are no singularities but in practical problems we may have a problem when si = ϵ − ∆ϵ < ϵ < ϵ + ∆ϵ = sj for small ∆ϵ. In this case Γij =
( ) 2 (ϵ + ∆ϵ) Wij + ϵ2 − ∆ϵ2 Wji 1 ϵ2 Wij + Wji ( ) ≃ 2 2 4ϵ ϵ + λ2 ∆ϵ (ϵ + ∆ϵ) + λ2 4ϵ∆ϵ 12
To avoid this we can always calculate the matrix A as the singular value decomposition A = U SV T
2 2 where Sii = si 1{si >ϵ} and then solve minx AT x − b + ∥λx∥ . In this case Γij
) )) ) ( ( ( ¯ −U ¯T U S I ◦ S¯ + G ◦ S V T V¯ − V¯ T V + U T U ij λ2 W −s s W ij i j ji for si > ϵ, sj > ϵ (s2i +λ2 )(s2j +λ2 ) Wij for si > ϵ, sj ≤ ϵ s2i +λ2 = Wij for si ≤ ϵ, sj > ϵ s2j +λ2 0 for si ≤ ϵ, sj ≤ ϵ ≡
(
Also, in this case A = U1 S1 V1T = U SV T where all positive singular values are in S1 will both generate the same A¯ since xB T V1 ΣV1T (Im − U1 U1T )¯ = U2 U2T x ¯B T V1 ΣV1T ≡ U2 W21 ΣV1T Furthermore if we use the relations Im − U1 U1T = U2 U2T , In − V1 V1T = V2 V2T we can write A¯ in more compact form as A¯ = U ΩV T where Ω ∈ Rn×m is
[ Ω=
¯B T V2 Γ ΣU1T x T T ¯B V1 Σ 0 U2 x
]
This is only a useful formula if U2 , V2 are easily available.
6.1
Functional Tikhonov Parameter
If we look up Tikhonov regularization on Wikipedia then we see that there are a number of different ways to choose the Tikhonov factor λ. Most methods will have a relationship with the input matrices A, B so λ = f (A, B) is really a function of these 2 matrices. Therefore ) need to calculate the contribution ( we also ¯ Usually, this does not have ¯ B ¯ = g(λ). to the derivative from this function A, ¯ B ¯ any singularities so we can use the tape to calculate the contribution to A, coming from changes in λ. Remember, these contributions should be added together. Alternatively, g can often be calculated directly.
13
6.2
Matrix Regularization
Sometimes it is preferable to regularize the matrices before the regression. So instead of solving AT x = B we will transform the matrices α = f10 (A) , β = f11 (B) and then solve αT y = β instead. and then transform x = f3 (y). We can use the method illustrated in Figure 2 where y = f2 (α, β) is the regression step. We then find y¯ = g3 (¯ x) 0 A¯ = g1 (¯ α) ( ) 1 ¯ ¯ B = g1 β either by analytical calculations or using the tape and α ¯ , β¯ = g2 (¯ y ) is given from the regression step.
7
Cholesky Decomposition
If C is not positive definite but only semi-definite then L is no longer unique. For this reason and for stability of the algorithm we use a pertubation matrix P i.e. all elements are either 1 or 0 and both row and column sums are 1. Also, P T P = P P T = I. We will instead find the Cholesky decomposition of A = P T CP = LLT . There exists a pertubation matrix such that A has a unique Cholesky decomposition, see Higham - Analysis of the Cholesky Decomposition of a Semi-definite Matrix. In this case [ ] L11 0 L= L21 0 where L11 is lower triangular. Notice, A = P T CP dA = P T dCP tr(A¯T dA) = tr(A¯T P T dCP ) = tr(P A¯T P T dC) C¯
¯ T = P AP
Assume the rank of C is m ≤ n then we can find n × m matrices B such that C = BB T i.e. B could be the the m eigenvectors of√C with corresponding eigenvalues si > 0. Define the diagonal m × m matrix S1 with the square roots of these 14
√ eigenvalues. Hence B = U1 S1 . We have to adjust the algorithm a little bit so we have for(i=0;i