Information Matrix Splitting

6 downloads 28802 Views 78KB Size Report
Sep 5, 2016 - information data about unknown parameters and has a simper form than that ... which prohibits the Fisher scoring algorithm for large data sets.
Information Matrix Splitting✩ Shengxin Zhu, Tongxiang Gu, Xingping Liu

arXiv:1605.07646v1 [stat.ME] 24 May 2016

Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, P.O. Box, 8009, Beijing 100088, P.R China.

Abstract Efficient statistical estimates via the maximum likelihood method requires the observed information, the negative of the Hessian of the underlying log-likelihood function. Computing the observed information is computationally expensive, therefore, the expected information matrix—the Fisher information matrix—is often preferred due to its simplicity. In this paper, we prove that the average of the observed and Fisher information of the restricted/residual log-likelihood function for the linear mixed model can be split into two matrices. The expectation of one part is the Fisher information matrix but has a simper form than the Fisher information matrix. The other part which involves a lot of computations is a zero random matrix and thus is negligible. Leveraging such a splitting can simplify evaluation of the approximate Hessian of a log-likelihood function. Keywords: Observed information matrix, Fisher information matrix, Newton method, linear mixed model, variance parameter estimation. 2010 MSC: 65J12, 62P10, 65C60, 65F99

1. Introduction Many statical methods require an estimation for unknown (co-)variance parameter(s) of a model. The estimation is usually obtained by maximizing a log-likelihood (the logarithm of a likelihood function). In principle, one requires the observed information matrix—the negative Hessian matrix or the second derivative of the log-likelihood—to obtain an accurate maximum likelihood estimator according to the (exact) Newton method [2], whereas the expected value of the observed information matrix, which is usually referred to as the Fisher information matrix or simply the information matrix is used instead due to its simplicity. The Fisher information matrix describes the essential information data about unknown parameters and has a simper form than that the observed information matrix (as we shall see in Table 1), Therefore, Fisher information matrix is preferred not only in analyzing the asymptotic behavior of maximum likelihood estimates [1, 16, 17] but also in finding the variance of an estimator and in Bayesian inference [11]. In particular, if the Fisher information matrix is used in the process of a maximum (log-)likelihood method, the resulting algorithm is called the Fisher-scoring algorithm [7, 10], which is widely used in machine learning. Besides the traditional application fields like genetical theory of natural selection and breeding [3], many other fields including theoretical physics have introduce the Fisher information matrix theory [5, 6, 12, 14]. The Fisher scoring algorithm is a success in simplifying the approximation of the Hessian matrix of the loglikelihood with some compromise on convergence rate. Still, evaluating of the elements of the Fisher information matrix is one of the bottlenecks in a maximizing log-likelihood procedure, which prohibits the Fisher scoring algorithm for large data sets. In particular, the high-throughput technologies in the biological science and engineering means that the size of data sets and the corresponding statistical models have suddenly increased by several orders of magnitude. Further reducing computations is possible through the technique of average information. The are several ✩ The project is supported by the Natural Science Fund of China(NSFC)(No. 11501044), partially by (NSFC No 1101002, 1151047, 61472462 and 91430218) and the Laboratory of Computational Physics. Email address: [email protected] (Shengxin Zhu)

papers on Nature publication on large scale genome-wide association study [8, 9, 18, 19] which involves many thousand parameters to be estimated. However according to limited but essential formulas appears in recent publications, for example [19, eq.7, eq.8]. It seems that the average information technique has not been noticed. The aim of this letter is to supply a concise mathematical statement and rigorous proof on the average information technique introduced in [4]. The results are applicable in (co)-variance parameters estimation with linear mixed models and the mathematical abstraction can shed light on general nonlinear acceleration schemes and applications which involve a maximum likelihood estimation like genome-wide association study. 2. Preliminary Consider the linear mixed model y = Xτ + Zu + ǫ,

(1)

where y ∈ Rn×1 is the observation, τ ∈ R p×1 is the vector of fixed effects, X ∈ Rn×p is the design matrix which corresponds to the fixed effects, u ∈ Rb×1 is the vector of unobserved random effects, Z ∈ Rn×b is the design matrix which corresponds to the random effects. ǫ ∈ Rn×1 is the vector of residual errors. The random effects, u, and the residual errors, ǫ, are multivariate normal distributions such that E(u) = 0, E(ǫ) = 0, u ∼ N(0, σ2G), ǫ ∼ N(0, σ2 R) and " # " # u G 0 2 var =σ , (2) ǫ 0 R where G ∈ Rb×b , R ∈ Rn×n . Typically G and R are functions of parameters that need to be estimated, say, G = G(γ) and R = R(φ). Further we denote κ = (γ, φ). Estimating the variance parameters θ = (σ2 , κ), requires a conceptually simple nonlinear iterative procedure which can be computationally expensive to carry out. One has to maximize a residual log-likelihood function of the form [13, p.252] yT Py 1 ℓR = const − {(n − ν) log σ2 + log det(H) + log det(X T H −1 X) + 2 }, (3) 2 σ where H = R(φ) + ZG(γ)Z T , ν = rank(X) and P = H −1 − H −1 X(X T H −1 X)−1 XH −1 . Here we suppose ν = p. The first derivatives of ℓR is referred as the scores for the variance parameters θ := (σ2 , κ)T [13, p.252]: ( ) 1 n − p yT Py ∂ℓR 2 (4) =− − 4 , s(σ ) = 2 σ2 ∂σ2 σ ( ) ∂H 1 1 ∂H ∂ℓR = − tr(P ) − 2 yT P Py . (5) s(κi ) = ∂κi 2 ∂κi ∂κi σ where κ = (γT , φT )T , We shall denote S (θ) = (s(σ2 ), s(κ1 ), . . . , s(κm ))T . The negative of the Hessian of the log-likelihood function is referred to as the observed information matrix. We will denote the matrix as IO .  ∂ℓR  R R  ∂σ2 ∂σ2 ∂σ∂ℓ2 ∂κ  · · · ∂σ∂ℓ2 ∂κ 1 m  ∂ℓR  ∂ℓR ∂ℓR  ∂κ1 ∂σ2 ∂κ1 ∂κ1 · · · ∂κ1 ∂κm   . (6) IO = −  . .. .. ..  ..  . . .   ∂ℓR ∂ℓR R · · · ∂κ∂ℓ ∂κm ∂κ1 ∂κm ∂σ2 m ∂κm Given an initial estimation of the variance parameter θ0 , a standard approach to maximize ℓR or find the roots of 2

Algorithm 1 Newton-Raphson method to solve S (θ) = 0. 1: Give an initial guess of θ0 2: for k = 0, 1, 2, · · · until convergence do 3: Solve IO (θk )δk = S (θk ), 4: θk+1 = θk + δk 5: end for the score equations S (θ) = 0 is the Newton-Raphson method (Algorithm 1), which requires elements of the observed information matrix. For convenience, we list the elements of the observed information matrix in the second column of Table 1. In particular, IO (κi , κ j ) =

tr(PHi j ) − tr(PHi PH j ) 2yT PHi PH j Py − yT PHi j Py + , 2 2σ2

(7)

2

∂ H where Hi = ∂H ∂κi and Hi j = ∂κi κ j . Each element IO (κi , κ j ) involves two trace terms which involve two/four matrix-matrix products respectively. Evaluating these terms is computationally expensive, which prohibits the practical use of the (exact) Newton method for large data sets. In practice, the Fisher information matrix, I = E(IO ), is preferred. The elements of the Fisher information matrix have simper forms than these of the observed information matrix (see column 3 in Table 1). The corresponding algorithm is referred to as the Fisher scoring algorithm [10]. Still the element I(κi , κ j ) of the Fisher information matrix involves the computationally expensive trace terms (see Table 1). In [4], the authors introduce the average information matrix IA [4, eq.7](see Table 1). Which is the main part of the average of the observed and expected information matrix. Precisely,

I(κi , κ j ) + IO (κi , κ j ) yT PHi PH j Py tr(PHi j ) − yT PHi j Py/σ2 = + . 2 2σ2 4 {z } | {z } | IA (κi ,κ j )

(8)

Iˆ Z (κi ,κ j )

The authors in [4] approximate the term Iˆ Z (κi , κ j ) to be zero without a proof. A natural explanation comes from the perspective of the classical matrix splitting [15, p.94]. The explanation can be formulated as a general statement: 3. Main result Theorem 1. Let IO and I be the observed information matrix and the Fisher information matrix for the residual log-likelihood of linear mixed model respectively, then the average of the observed information matrix and the Fisher information matrix can be split as IO2+I = IA + IZ , such that the expectation of IA is the Fisher information matrix and E(Iˆ Z ) = 0. Such a splitting aims to remove computationally expensive and negligible terms so that a Newton-like method is applicable for large data which involves many thousands fixed and random effects. It keeps the essential information in the observed information matrix. An approximate Newton iterative procedure is obtained by replacing IO with IA in Algorithm 1. 4. Proof of the main result 4.1. Baisic Lemma Lemma 1. Let y ∼ N(Xτ, σ2 H), be a random variable and H is symmetric positive definite matrix,where rank(X) = ν, then P = H −1 − H −1 X(X T H −1 X)−1 XH −1 is a weighted projection matrix such that 3

Table 1: Observed information Splitting table

index (σ2 , σ2 ) (σ2 , κi )

IO [13, p.252] yT Py − n−ν σ6 2σ2

(κi , κ j )

IO (κi , κ j )

T

y PHi Py 2σ4

IO (κi , κ j ) = i Hi = ∂H ∂κi , 1. 2. 3. 4.

I

IA [4]

n−ν 2σ4 tr(PHi ) 2σ2 tr(PHi PH j ) 2

yT Py 2σ6 yT PHi Py 2σ4 yT PHi PH j Py 2σ2

tr(PHi j )−tr(PHi PH j ) 2yT PHi PH j Py−yT PHi j Py , + 2 2σ2 2 H Hi j = ∂H∂i ∂H j

PX = 0; PHP = P; tr(PH) = n − ν; PE(yyT ) = σ2 PH.

Proof. The first 2 terms can be verified by directly by computation. Since H is a positive definite matrix, there exist H 1/2 such that ˆ = n − rank(X) ˆ = n − ν. ˆ −1 X) ˆ Xˆ T X) tr(PH) = tr(H 1/2 PH 1/2 ) = tr(I − X( where Xˆ = H −1/2 X. The 4th item follows because PE(yyT ) = P(var(y) − Xτ(Xτ)T ) = σ2 PH − PXτ(Xτ)T = σ2 PH.

Lemma 2. Let H be a parametric matrix of κ, and X be an constant matrix, then the partial derivative of the projection matrix P = H −1 − H −1 X(X T H −1 X)−1 XH −1 with respect to κi is P˙ i = −PH˙ i P, where P˙ i =

∂P ∂κi

and H˙ i =

∂H ∂κi .

Proof. Using the derivatives of the inverse of a matrix ∂A −1 ∂A−1 = −A−1 A . ∂κi ∂κi we have ∂ P˙ i = (H −1 − H −1 X(X T H −1 X)−1 X T H −1 ) ∂κi = − H −1 H˙ i H −1 + H −1 H˙ i H −1 X(X T H −1 X)−1 X T H −1 − H −1 X(X T H −1 X)−1 X T H −1 H˙ i H −1 X(X T H −1 X)−1 X T H −1 + H −1 X(X T H −1 X)−1 X T H −1 H˙ i H −1 = − H −1 H˙ i + H −1 X(X T H −1 X)−1 X T H −1 H˙ i P = −PH˙ i P.

4

4.2. Formulas of the observed information matrix Lemma 3. The element of the observed information matrix for the residual log-likelihood (3) is given by yT Py n − p − , 2σ4 σ6 1 T ˙ IO (σ2 , κi ) = y PHi Py, 2σ4 n o o 1 n T ˙ ˙ 1 T ¨ i j Py . tr(PH˙ i j ) − tr(PH˙ i PH˙ j ) + 2y P H P H Py − y P H IO (κi , κ j ) = i j 2 2σ2

IO (σ2 , σ2 ) =

where H˙ i =

∂H ∂κi ,

H¨ i j =

(9) (10) (11)

∂2 H ∂Ki ∂K j .

Proof. The result in (9) is standard according to the definition. The result in (10) follows use the result in Lemma 2 if one uses the score in (4). The first term in (11) follows because ∂ tr(PH˙ i ) = tr(PH¨ i j ) + tr(P˙ j H˙ i ) = tr(PH¨ i j ) − tr(PH˙ j PH˙ i ) ∂κ j

(P˙ j = −PH j P).

The second term in (11) follows because Employ the the result in Lemma 2, we have −

∂(PH˙ i P) = PH˙ j PH˙ i P − PH¨ i j P + PH˙ i PH˙ j P. ∂κ j

(12)

Further note that H˙ i , H˙ j and P are symmetric. The second term in (11) follows because of yT PH˙ i PH˙ j Py = yT PH˙ j PH˙ i Py.

4.3. Formulas of the Fisher information matrix The Fisher information matrix, I, is the expect value of the observed information matrix, I = E(IO ). The Fisher information matrix has a simpler form than the observed information matrix and provides the essential information provided by the data, and thus it is a nature approximation to the negative Jacobian matrix. Lemma 4. The element of the Fisher information matrix for the residual log-likelihood function in (3) is given by tr(PH) n − ν = , 2σ4 2σ4 1 tr(PH˙ i ), I(σ2 , κi ) = E(IO (σ2 , κi )) = 2σ2 1 I(κi , κ j ) = E(IO (κi , κ j )) = tr(PH˙ i PH˙ j ). 2

I(σ2 , σ2 ) = E(IO (σ2 , σ2 )) =

(13) (14) (15)

Proof. The formulas can be found in [13]. Here we supply alternative proof. First note that PX = 0 and PE(yyT ) = P(σ2 H − Xτ(Xτ)T ) = σ2 PH.

(16)

E(yT Py) = E(tr(PyyT )) = tr(PE(yyT )) = σ2 tr(PH) = (n − ν)σ2 .

(17)

Then

where rank(L2 ) = n − rank(X) due to LT2 X = 0. Therefore E(IO (σ2 , σ2 )) =

E(yT Py) n − ν n − ν − = . σ6 2σ4 2σ4 5

(18)

Second, we notice that PHP = P. Apply the procedure in (17), we have E(yT PH˙ i Py) = tr(PH˙ i PE(yyT )) = σ2 tr(PH˙ i PH) = σ2 tr(PHPH˙ i ) = σ2 tr(PH˙ i ), E(yT PH˙ i PH˙ j Py) = σ2 tr(PH˙ i PH˙ j PH) = σ2 tr(PHPH˙ i PH˙ j ) = σ2 tr(PH˙ i PH˙ j ), E(y PH¨ i j Py) = σ tr(PH¨ i j PH) = σ tr(PH¨ i j ). T

2

2

(19) (20) (21)

Substitute (19) into (10), we obtain (14). Substitute (20) and (21) to (11), we obtain (15). Using the Fishing information matrix as an approximate to the negative Jacobian result in the famous Fisherscoring algorithm [10], which is widely used in many machine learning algorithms. 4.4. Proof of the main result Proof. Let 1 T y Py; 2σ6 1 T ˙ y PHi Py; IA (σ2 , κi ) = 2σ4 1 T ˙ ˙ IA (κi , κ j ) = y PHi PH j Py; 2σ2

IA (σ2 , σ2 ) =

(22) (23) (24)

then we have IZ (σ2 , σ2 ) = 0,

(25)

tr(PH˙ i ) y PH˙ i Py − , 4σ2 4σ4 tr(PHi j ) − yT PHi j Py/σ2 , IZ (κi , κ j ) = 4 T

IZ (σ2 , κi ) =

(26) (27)

Apply the result in (17), we have (n − ν) = I(σ2 , σ2 ). 2σ4

(28)

tr(PH˙ i ) and E(IZ (σ2 , κi )) = 0. 2σ2

(29)

tr(PH˙ i PH j ) = I(κi , κ j ) 2

(30)

E(IA (σ2 , σ2 )) = Apply the result in (19), we have E(IA (σ2 , κi )) = Apply the result in (20), we have E(IA (κi , κ j )) = and E(IZ (κi , κ j )) = 0.

5. Discussion Compare IA with IO , and IF in Table 1, in contrast with IO (κi , κ j ) which involves 4 matrix-matrix products, IA (κi , κ j ) only involves a quadratic term which can be evaluate by 4 matrix-vector multiplications and an inner product. Precisely, let ξ = Py, ηi = Hi ξ, η j = H j ξ, ζ = Pη j and ηTi ζ. This significantly reduces the evaluations of the Hessian of the log-likelihood function.

6

Acknowledgement The first author would like to thank Dr Sue Welham and Professor Robin Thompson at Rothamsted Research and for introducing him the restricted maximum likelihood method. Reference References [1] F.Y. Edgeworth and R.A. Fisher. On the efficiency of maximum likelihood estimation. Annals of Statistics, 4(3):501–514, 1976. [2] B Efron and D.V. Hinkley. Assessing the accurancy of the maximum likelihood estimator:observed versus expected Fisher information. Biometrika, 65(3):457–487, 1978. [3] R Fisher. The Genetical Theory of Natural Section. Oxford:Clarendon Press, 1930. [4] A. R. Gilmour, R. Thompson, and B. R. Cullis. Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51(4):pp. 1440–1450, 1995. [5] A.F. Heavens, M. Seikel, B.D. Nord, M. Aich, Y. Bouffanais, B.A. Bassett, and M.P. Hobson. Generalised Fisher matrices. Mon. Not. R. Astron. Soc. arXiv:1404.2854v2. [6] W. Janke, D.A. Johnston, and R. Kenna. Information geometry and phase transitions. Physica A, 336(1-2):181, 2004. [7] R.I. Jennrich and P.F. Sampson. Newton-raphson and related algorithms for maximum likelihood variance component estimation. Technometrics, 18(1):11–17, 1976. [8] C. Lippert, J. Listgarten, Y. Liu, C.M. Kadie, R.I. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature Methods, 8(10):833–835, 2011. [9] J. Listgarten, C. Lippert, C. M. Kadie, R.I. Davidson, E. Eskin, and D. Heckerman. Improved linear mixed models for genome-wide association studies. Nature methods, 9(6):525–526, 2012. [10] N.T. Longford. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika, (4):817–827, 1987. [11] J.I. Myung and D.J. Navarro. Information matrix. In B. Everitt and D. Howel, editors, Encyclopedia of Behavioral Stastics. 2004. [12] M. Prokopenko, Joseph T. Lizier, J. T. Lizier, O. Obst, and X. R. Wang. Relating Fisher information to order parameters. Physical Review E, 84(4), 2011. [13] Shayle R. Searle, George Casella, and Charles E. McCulloch. Variance components. Wiley Series in Probability and Statistics. WileyInterscience [John Wiley & Sons], Hoboken, NJ, 2006. Reprint of the 1992 original, Wiley-Interscience Paperback Series. [14] M. Vallisneri. A user manual for the fisher information matrix. Technical report, California Institute of Technology, Jet Propulsion Laboratory, 2007. [15] R.S. Varga. Matrix Iterative Analysis. Springer-Verlag, second edition. [16] R. Zamir. A necessary and sufficient condition for equality in the matrix Fisher information inequality. Technical report, Tel Aviv University, 1997. [17] R. Zamir. A proof of the Fisher information inequality via a data processing argument. IEEE Transaction On Information Theory, 44:1246– 1250, 2008. [18] Z. Zhang, E. Ersoz, C.-Q. Lai, R.J. Todhunter, H. K. Tiwari, M.A. Gore, P.J. Bradbury, J. Yu, D.K. Arnett, J.M. Ordovas, et al. Mixed linear model approach adapted for genome-wide association studies. Nature genetics, 42(4):355–360, 2010. [19] X. Zhou and M. Stephens. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7):821–824, 2012.

7

Suggest Documents