Biometrika (2001), 88, 3, pp. 727–737 © 2001 Biometrika Trust Printed in Great Britain
Case-deletion measures for models with incomplete data B HONGTU ZHU Department of Mathematics and Statistics, University of V ictoria, P.O. Box 3045, V ictoria, B.C., Canada, V 8W 3P4
[email protected]
SIK-YUM LEE Department of Statistics, T he Chinese University of Hong Kong, Shatin, N.T ., Hong Kong, China
[email protected]
BO-CHENG WEI Department of Applied Mathematics, Southeast University, Nanjing 210096, China
[email protected]
JULIE ZHOU Department of Mathematics and Statistics, University of V ictoria, P.O. Box 3045, V ictoria, B.C., Canada, V 8W 3P4
[email protected]
S This paper proposes several case-deletion measures for assessing the influence of an observation for complicated models with real missing data or hypothetical missing data corresponding to latent random variables. The idea is to generalise Cook’s (1977) approach to the conditional expectation of the complete-data loglikelihood function in the algorithm. On the basis of the diagnostic measures, a procedure is proposed for detecting influential observations. Two examples illustrate our methodology. We show that the method can be applied efficiently to a wide variety of complicated problems that are difficult to handle by existing methods. Some key words: Case-deletion measure; EM algorithm; Missing data; Q-function.
1. I The identification of influential observations has received a great deal of attention; see the extensive bibliographies of Atkinson (1988), Cook & Weisberg (1982), Belsley et al. (1980), Chatterjee & Hadi (1988) and Rousseeuw & Leroy (1987). To quantify the impact of the ith observation, the most common approach is to compute single-case diagnostics with the ith case deleted. Since the pioneering work of Cook (1977), case-deletion diagnostics such as Cook’s distance or the likelihood distance have been successfully applied to various statistical models; see for example Andersen (1992), Atkinson (1982), Banerjee & Frees (1997), Davison & Tsai (1992), Wei (1998) and references therein. Recently, many
728
H. Z, S.-Y. L, B.-C. W J. Z
important models have been proposed with latent variables and/or incomplete data. The development of case-deletion diagnostic measures for such models is difficult because the associated likelihood functions usually involve intractable integrals (Davidian & Giltinan, 1995, p. 328). The algorithm (Dempster et al., 1977) is a very powerful method for computing maximum likelihood estimates for incomplete-data problems, and many complicated statistical models can be analysed by augmenting the observed data with some appropriate hypothetical missing data; see for example Tanner (1993), McCulloch (1997), Meng & van Dyk (1997) and McLachlan & Krishnan (1997). Inspired by the algorithm, we develop case-deletion measures based on the conditional expectation of the complete-data loglikelihood function instead of the more complicated observed-data loglikelihood. It is simple to implement the approach, which is generally applicable to models which can be analysed via -type algorithms. Shih & Weisberg (1986) proposed a case-deletion method based on the algorithm for linear regression with incomplete data, but our approach is different and more generally applicable. The classical case-deletion measures are briefly reviewed in § 2. In § 3, we develop some case-deletion measures for assessing global influence for general models with incomplete data. Statistical properties and motivation of these measures are discussed, and a procedure for identifying influential observations is proposed. Two illustrative examples are reported in § 4. 2. A Let Y and Y be the observed data and the missing data, respectively, and let Y = o m c {Y , Y } be the complete dataset with density function p(Y |h) parameterised by an o m c r-dimensional parameter vector h. In most statistical applications, L (h |Y )=log p(Y | h) c c c is simple, in contrast to the observed data loglikelihood L (h |Y )=log p(Y | h). Let y o o o i,obs be the ith observation of Y ={y ,...,y }. A quantity with a subscript ‘[i ]’ means o 1,obs n,obs the relevant quantity with the ith observation y deleted. For instance, Y , Y and i,obs o[i] m[i] Y are the corresponding observed data, missing data and complete data with y c[i] i,obs deleted. Dots over function denote derivatives with respect to h; for example, L˙ (h |Y )=∂L (h |Y )/∂h and L¨ (h |Y )=∂2L (h |Y )/∂h ∂hT. o o o o o o o o Let h@ and h@ * be the maximum likelihood estimators of h based on L (h |Y ) and [i] o o L (h |Y ), respectively. Cook’s distance C is given by o o[i] i C (M)=(h@ * −h@ )TM(h@ * −h@ ), i [i] [i] for an appropriate choice of M. Cook & Weisberg (1982) considered several choices for M for the linear regression model. In general, the choice can refer either to an external reference scale or an internal scatter of the change of h@ to h@ * (Cook & Weisberg, 1982, [i] p. 113). For external norms, M is usually chosen to be I =−L¨ (Y |h)| @ , which o o h=h obs corresponds to the asymptotic confidence contour {h : (h−h@ )TI (h−h@ )∏x2(a; r)}, (1) obs where x2(a; r) is the upper a point of the chi-squared distribution with r degrees of freedom. For internal norms, the n values h@ −h@ * are treated as a random sample to which Wilks’ [i] (1963) multivariate outlier technique may be applied to detect influential observations. Alternatively, a natural measure of the difference between h@ and h@ * corresponding to [i] the change in values of the observed-data loglikelihood function is the following likelihood
Case deletion
729
distance : i
=L (h@ |Y )−L (h@ * |Y ). i o o o [i] o A motivation for using the likelihood distance is that it may be interpreted in terms of the following asymptotic confidence region (Cox & Hinkley, 1974; Cook & Weisberg, 1982): {h : 2[L (h@ )−L (h)]∏x2(a; r)}. (2) o o Hence, C (M) and are calibrated by comparison to the x2 distribution. i i In practice, the ith observation is regarded as influential if the value of C (M) or is i i relatively large. The degree of ‘largeness’ (Cook & Weisberg, 1982, p. 183) can be based on critical points of the x2 distribution. As mentioned in § 1, for some complicated models, the observed-data loglikelihood function is intractable, and this leads to serious difficulties in applying Cook’s distance or the likelihood distance. 3. I 3·1. Case-deletion estimate and its approximation We base our approach on the function Q(h |h@ )=E{L (h |Y ) |Y , h@ }, (3) c c o which plays a crucial role in the -step of the algorithm. For many incomplete-data problems with intractable loglikelihood functions, the function Q(h |h@ ) is significantly simpler. To assess the influence of the ith case, we will consider the function Q (h |h@ )=E{L (h |Y ) |Y , h@ }. [i] c c[i] o Let h@ be the maximiser of Q (h | h@ ). Since Q(h | h@ ) achieves its global maximum at h@ , more [i] [i] attention should be paid to a case whose deletion seriously influences the estimate; if h@ [i] is far from h@ in some sense, then the ith case is ‘influential’. To obtain h@ , we can apply [i] the Newton–Raphson algorithm to Q (h | h@ ), with h@ as the initial value. Since h@ is needed [i] [i] for every case, the total computational burden involved can be quite heavy, and the following one-step approximation h@ 1 of h@ is used to reduce the burden: [i] [i] ¨ (h@ | h@ )}−1Q ˙ (h@ | h@ ). (4) h@ 1 =h@ +{−Q [i] [i] ¨ (h |h@ )| Note that the Hessian matrix −Q in the true Newton–Raphson step is replaced h=h@ ¨ (h@ |h@ )=−Q ¨ (h |h@ )| . For [i] by −Q diagnostic purposes, the approximation defined by (4) h=h@ might be adequate for assessing influence even if the corresponding estimator is poor (Cook & Weisberg, 1982; Kass et al., 1989). An additional justification for using the approximation is given by the following theorem. T 1. Assuming that ¨ (h |h@ )=O (1), ¨ (h |h@ )−Q ¨ (h | h@ )=O (n), Q ˙ (h |h@ )=O (1), Q ˙ (h | h@ )−Q Q [i] p [i] p p we have that ¨ (h@ |h@ )}−1Q ˙ (h@ |h@ )+O (n−2)=h@ 1 +O (n−2). (5) h@ =h@ +{−Q [i] [i] p [i] p Proofs of Theorem 1 and other theorems are sketched in the Appendix. The validity of
730
H. Z, S.-Y. L, B.-C. W J. Z
assumptions in Theorem 1 can be seen from the following argument. Assume that Y and m Y are given as Y =( y ,...,y ) and Y =( y ,...,y ). A statistical model with c m 1,mis n,mis c 1,com n,com incomplete data is called an independent-type-incomplete-data model if (i) y = i,com (y ,y ), and (ii) y and y are independent for iN j. This kind of model is i,obs i,mis i,com j,com general and subsumes most commonly used models, such as the generalised linear mixed model. For this type of model, it can be shown that n ¨ (h |h@ )= ∑ ˙ (h |h@ )−Q ˙ (h |h@ )=−E{L˙ (h | y Q E{L¨ (h | y )|y , h@ }, Q ) |y , h@ }, c i,com i,obs [i] c i,com i,obs i=1 ¨ (h |h@ )−Q ¨ (h |h@ )=−E{L¨ (h | y Q ) |y , h@ }. [i] c i,com i,obs ˙ (h@ | h@ )=0, we have that (McLachlan & Krishnan, 1997, p. 85) Also, since Q ˙ (h@ |h@ )=−E{L˙ (h@ |y )=O (1). (6) Q ) |y , h@ }=−L˙ (h@ |y o i,obs p [i] c i,com i,obs It follows that the assumptions in Theorem 1 hold at least for independent-typeincomplete-data models. 3·2. Asymptotic region An important reason for using , or C (M), is its interpretation in terms of the i i asymptotic confidence region (2) or (1). We now show that analogous results hold for the ¨ (h@ | h@ ), I and I be the completed information matrix, the Q-function. Let I =−Q com obs mis observed information matrix and the missing information matrix (McLachlan & Krishnan, 1997, p. 101), respectively. T 2. Under mild conditions, 2{Q(h@ |h@ )−Q(h |h@ )}=(h@ −h )TI (h@ −h )+O (n−D). (7) 0 0 com 0 p Moreover, if B is a nonnegative definite matrix with eigenvalues l . . . l 0 and 1 r I−1/2I I−1/2 B in probability, then both 2{Q(h@ | h@ )−Q(h | h@ )} and (h@ −h )TI (h@ −h ) obs mis obs 0 0 com 0 converge to X in distribution, where X has the same distribution as W r (1+l )Z2, in which j j j=1 Z , . . . , Z are independently and identically distributed as N(0, 1). 1 r Theorem 2 reveals the close asymptotic relationship between 2{Q(h@ | h@ )−Q(h |h@ )} and 0 @ (h −h )TI (h@ −h ). The asymptotic distribution is a weighted x2 distribution with weights 0 com 0 determined by the ‘ratio’ of I over I . When the missing information I is relatively mis obs mis small with respect to I , 2{Q(h@ | h@ )−Q(h |h@ )} is asymptotically close to the likelihood obs 0 distance 2{L (h@ |Y )−L (h |Y )}=(h@ −h )TI (h@ −h )+o (dh@ −h d2). o o o 0 o 0 obs 0 p 0 Moreover, we can construct the asymptotic confidence regions {h : (h−h@ )TI (h−h@ )∏X(a)}, {h : 2{Q(h@ |h@ )−Q(h |h@ )}∏X(a)}, com where X(a) is the upper a point of the X distribution. 3·3. Case-deletion measures Case-deletion measures on the basis of the Q-function can be developed as in § 2. Influence of the ith case can be measured by a suitably weighted combination of the elements of h@ −h@ (Cook & Weisberg, 1982). The first norm is defined by an external [i]
Case deletion
731
¨ (h@ |h@ ). Thus, we obtain a generalised Cook distance for models reference scale with M=−Q with incomplete data: ¨ (h@ |h@ )}(h@ −h@ ). D (I )=(h@ −h@ )T{−Q (8) i com [i] [i] Another measure of the difference between h@ and h@ is the Q-distance QD , where [i] i (9) QD =2{Q(h@ |h@ )−Q(h@ |h@ )}, [i] i which is related to the change in values of the Q-function. Substituting (4) into (8) and (9), we obtain the following approximations, QD1 and D1 (I ), of QD and D (I ), i i com i i com respectively: ¨ (h@ | h@ )}−1Q ˙ (h@ |h@ )T{−Q ˙ (h@ | h@ ). (10) QD1 =2{Q(h@ |h@ )−Q(h@ 1 |h@ )}, D1 (I )=Q [i] i [i] i com [i] It can be seen from Theorem 1 and equations (8) to (10) that the measures D (I ) and i com QD and their approximations are all positive, and i D (I )=QD +O (n−2)=D1 (I )+O (n−2)=QD1 +O (n−2). (11) i com i p i com p i p Based on the above results, we obtain the following results concerning the measure D (I ). These results are also valid for other measures. i com T 3. Under some mild conditions, we have the following results: ¨ (h@ | h@ )}−1Q ˙ (h@ | h@ )T{−Q ˙ (h@ |h@ )+O (n−2); (i) 0∏D (I )=O (n−1)=Q i com p [i] [i] p (ii) if B is a nonnegative definite matrix with eigenvalues lA . . . lA 0 and r 1 1 n ˙ @ @ ˙ @ @ I−1/2 ∑ Q (h | h )Q (h | h )TI−1/2 B com [i] [i] com 1 i=1 in probability, then n r ∑ D (I ) tr(B )= ∑ lA , i i com 1 i=1 i=1 in probability. Theorem 3 shows that D (I ), QD and their approximations are rather small, of the i com i order of 1/n. Thus, it is unreasonable to use X(a) as the benchmark for the proposed measures, because it is usually too large. For the independent-type-incomplete-data models, the following theorem gives the relationship between the sum of D (I ) and the i com missing/observed information ratio matrix B; see Theorem 2. This provides a guideline for calibrating D (I ) and QD . i com i T 4. For the independent-type-incomplete-data models, under the assumptions of T heorems 1 and 2, and assuming that n n−1 ∑ L˙ (h@ |y )L˙ (h@ | y )T−n−1I 0, o i,obs o i,obs obs i=1 in probability, we have that n r ∑ D (I ) tr(I +B)−1= ∑ (1+l )−1, i com p i i=1 i=1 in probability. In particular, when I−1 I =o (1), we have that, in probability, obs mis p n ∑ D (I ) r. i com i=1
(12)
(13)
(14)
732
H. Z, S.-Y. L, B.-C. W J. Z
Assumption (12) is generally true for most parametric models. In particular, if {y ; i=1, . . . , n} is a random sample of identically distributed observations, both i,obs n−1I and n−1 W n L˙ (h@ | y )L˙ (h@ |y )T converge to the same Fisher information obs i,obs o i,obs i=1 o matrix. The implication of (13) is that the case-deletion measures D (I ) and QD and their i com i approximations are around W r (1+l )−1/n. Attention should be paid to those cases i i=1 with comparatively large values of the case-deletion measures. Hence, this provides a rough guideline for calibrating these measures. If there are no missing data, Q(h | h@ ) is just the loglikelihood function, D (I ) and QD reduce to the Cook’s distance C (I ) and i com i i obs the likelihood distance L D . Therefore, Theorem 4 also provides a rough guideline for i calibrating C (I ) and L D ; that is, attention should be paid to those cases with values i obs i of CD and L D that are significantly larger than r/n. i i 3·4. A practical procedure We propose the following strategy for labelling influential points for models with incomplete data. Step 1. Calculate maximum likelihood estimates. Step 2. Calculate the first-order approximation h@ 1 . [i] Calculation of h@ 1 is not time-consuming. For simple models, the proposed case-deletion [i] measures can be obtained directly by taking the appropriate conditional expectations of L (h |Y ), L˙ (h |Y ) and L¨ (h |Y ). For certain complicated models, explicit forms of the c c c c[i] c c conditional expectations may involve high-dimensional integrals, which may be approxi¨ (h@ |h@ ) and Q ˙ (h@ | h@ ) can be approximated mated by Monte Carlo methods. For example, Q [i] with the help of randomly drawn observations {Y (s), s=1, . . . , S} from the conditional m distribution p(Y |Y , h@ ): m o 1 S 1 S ¨ (h@ | h@ ) j ∑ ˙ (h@ |h@ ) j ∑ Q L¨ (h@ |Y , Y (s) ), Q L˙ (h@ |Y , Y (s) ). c o m [i] c o[i] m[i] S S s=1 s=1 We obtain h@ 1 by substituting the above matrices into (4). [i] There are now many techniques for sampling from a general density including rejection sampling, the Gibbs sampler and Metropolis–Hastings algorithms. In fact, this random sample may well be available as a by-product of some Monte Carlo simulation in the estimation procedure. Step 3. Calculate D1 (I ) and QD1 : substituting h@ 1 into (8), we get D1 (I ), and i com i [i] i com QD1 can be approximated as i 2 S QD1 j ∑ {L (h@ |Y , Y (s) )−L (h@ 1 |Y , Y (s) )}. i S c o m c [i] o m s=1 Step 4. Identify influential points by computing and comparing case-deletion measures for all observations. 4. E 4·1. Analysis of the motorette data The motorette data reported in Schmee & Hahn (1979) give the results of temperature accelerated life tests on electrical insulation in 40 motorettes. At each of the temperatures
Case deletion
733
150°C, 170°C, 190°C and 220°C, ten motorettes were tested and the times to failure in hours were recorded (Tanner, 1993, p. 41). The dataset is fitted with the following rightcensored linear regression model: t =b +b v +se =xTb+se , i 0 1 i i i i where x =(1, v )T, b=( b , b )T and e ~N(0, 1); v =1000/(temperature+273·2) and i i 0 1 i i t =log (ith failure time). i 10 The data were recorded so that the first q observations are uncensored and the remaining n−q are censored. The relevant complete-data loglikelihood function is given by q n L (h |Y )=−n log s− ∑ (t −b −b v )2/(2s2)− ∑ (T −b −b v )2/(2s2), c c i 0 1 i i 0 1 i i=1 i=q+1 where T is the unknown failure time for case j, which is known to be greater than the j censored event time c . In this example, T , . . . , T are regarded as the missing data Y . j q+1 n m Let r =t −xTb@ , for i=1, . . . , q, and r =T −xTb@ , for i=q+1, . . . , n. It follows that i i i i i i ∂L (h |Y ) n n c c = ∑ r x /s2, ∑ r2/2s4−n/(2s2) , i i i ∂hT i=1 i=1 −W n x xT/s2 −W n r xT/s4 ∂2L (h |Y ) c c = i=1 i i i=1 i i . ∂h ∂hT −W n r x /s4 n/(2s4)−W n r2/s6 i i i i=1 i=1 The conditional distribution p(t |h, c ) is the normal distribution N(xTb, s2) conditional j j j on T >c for j=q+1, . . . , n. At each time, we delete one observation t to calculate j j j all h@ and h@ 1 , and then obtain the case-deletion measures. Index plots of b@ 1 and b@ 1 [i] [i] 0[i] 1[i] are displayed in Fig. 1(a) and ( b) respectively, in which the index identifies the corresponding data point. For each i, h@ and h@ 1 are close to each other. Index plots of QD1 and [i] [i] i D1 (I ) are respectively presented in Fig. 1(c) and (d). As expected, they show similar i com behaviour. Index plots of b@ , b@ , QD and D (I ) are extremely similar. Results in 0[i] 1[i] i i com Fig. 1 clearly suggest that the cases 21 and 22 are influential. Compared with other observations at 190°C, cases 21 and 22 are extremely strange with quite low failure time. Finally, note that the measures for the 11th case are relatively large.
A
A
B
B
4·2. Analysis of the rat data The rat dataset has been studied by Ochi & Prentice (1984) and McCulloch (1994). In this experiment, 16 pregnant rats received a control diet and another 16 rats received a chemically treated diet. The litter size for each rat was recorded after 4 and 21 days. Let i index treatment, i=1, 2, j index litter, j=1, . . . , 16, and k index the newborn rats within a litter, k=1, . . . , m , where m represents the size of the litter after 4 days. Let x denote ij ij ij the size of the litter after 21 days. It is assumed that x |a ~Bi(m , p ), ij ij ij ij where p =W( b +a ), b represents treatment mean, the random litter effects a are ij i ij i ij independently and identically distributed as N(0, s2 ) and W denotes the standard normal i distribution function. In this example, we delete one observation (x , m ) at a time. ij ij To apply our methodology, one possibility is to treat a as missing data. Another ij possibility is to follow McCulloch (1994) and assume a latent survival model of the form y =b +a +e , ijk i ij ijk
H. Z, S.-Y. L, B.-C. W J. Z
734
(b)
(a) _5·95 4·34 ^
b 10[i]
^
b 11[i]
_6·05
4·30 _6·15
4·26 0
10
20
30
40
0
10
Index
20
30
40
30
40
Index (d)
(c) 0·8
0·6
0·6 D1i (Icom)
QD1i 0·4
0·4 0·2
0·2 0·0
0·0 0
10
20
30
40
0
10
Index
20 Index
Fig. 1. The motorette data. Index plots of case-deletion measures b@ 1 , b@ 1 , QD1 and D1 (I ). 0[i] 1[i] i i com
where the residual errors e are independently and identically distributed N(0, 1). Instead ijk of observing y , we only observe the binary variable that indicates whether or not y ijk ijk exceeds 0, and x is the sum of these binary variables for the ith treatment and the jth ij litter. Without loss of generality, we assume that y >0 for k∏x and y ∏0 for k>x . ijk ij ijk ij If we treat {y , a } as missing data, the complete-data loglikelihood function is given by ijk ij ( y −b −a )2 a2 i ij − ∑ ij . −8 ∑ log s2 − ∑ ijk i 2 2s2 i i,j,k ij i Let h=(b , b , s2 , s2 )T. Then it can be shown that, for i=1, 2, 1 2 1 2 1 ∂L (h |Y ) ∂L (h |Y ) a2 c c = ∑ ( y −b −a ), c c =− ∑ − ij , ijk i ij ∂b ∂s2 2s2 2s4 i i i i j,k j −W m 0 0 0 j 1j −W m 0 0 j 2j . L¨ (h |Y )= c c −W {a2 /s6 −1/(2s4 )} 0 1 j 1j 1 −W {a2 /s6 −1/(2s4 )} 2 j 2j 2
A
A
B
B
In this case, the conditional distribution p(Y |Y , h) is not a common distribution to m 0 obtain the required conditional expectation, we resort to the Gibbs sampler defined by the following iteration. Step A. Given a(l) , we generate {y(l+1), k=1, . . . , m } from the truncated normal distriij ijk ij bution N( b +a(l) , 1)I(y >0) for k∏x and N( b +a(l) , 1)I( y ∏0) for k>x . i ij ijk ij i ij ijk ij
Case deletion
735
Step B. Given {y(l+1), k=1, . . . , m }, a(l+1) is generated from the normal distribution ijk ij ij with mean m (s−2 +m )−1 ∑ij ( y(l+1) −b ) i ij ijk i k=1 and variance (s−2 +m )−1. i ij We apply the procedure proposed in § 3·4 as follows. In Step 1, a stochastic approximate algorithm (Delyon et al., 1999) was used to obtain the maximum likelihood estimate of h, namely h@ =( b@ , b@ , s@ 2 , s@ 2 )T=(1·311, 0·948, 0·067, 1·059)T. 1 2 1 2 ¨ (h@ |h@ ) and Q ˙ (h@ | h@ ). In Step 2, based on the estimates in Also, we obtain estimates of Q (i,j) @ Step 1, we calculate all the h 1 ’s. With above results, we obtain D1 (I ) in Step 3. To (i,j) (i,j) com calculate QD1 , 10 000 random observations from the Gibbs sampler are used. Figure 2 (i,j) presents the case-deletion measures QD1 and D1 (I ) plotted against the indices (i,j) (i,j) com (i−1) 1 16+ j. The measures associated with (x , m ) are larger than 0·3; it can be 2,16 2,16 seen from Fig. 2 that this is the only outstandingly influential observation. (a)
(b) 0·4
0·3
0·3 0·2 D1(i,j) (Icom) 0·2
QD1(i,j) 0·1
0·1 0·0
0·0 0
10
20
0
30
10
20
30
Index
Index
Fig. 2. The rat data. Index plots of case-deletion measures QD1 and D1 (I ). (i,j) (i,j) com
A The authors are indebted to the editor and the referee for their constructive comments. The research of H. Zhu and J. Zhou is supported by the Pacific Institute of Mathematical Sciences and the Natural Science and Engineering Research Council of Canada. The research of S.-Y. Lee is supported by a grant from the Research Grant Council of the Hong Kong Special Administration Region. The research of B.-C. Wei is supported by a grant from the National Natural Science Foundation of China. A Proofs ˙ (h@ | h@ ) about h@ yields Proof of T heorem 1. A first-order expansion of Q [i] [i] ¨ (h@ | h@ )}−1Q ˙ (h@ | h@ )+O (n−2)=O (n−1), h@ −h@ ={−Q [i] p p [i] [i] ¨ (h@ | h@ )}−1={−Q ¨ (h@ |h@ )}−1{1+O (n−1)}−1={−Q ¨ (h@ | h@ )}−1+O (n−2). {−Q [i] p p
736
H. Z, S.-Y. L, B.-C. W J. Z
This establishes equation (5).
%
Proof of T heorem 2. Under some mild conditions, I1/2 (h@ −h ) is asymptotically normal with obs 0 mean zero and covariance matrix equal to the identity matrix, where h is the true parameter (Cox 0 & Hinkley, 1974, p. 296). It follows from Taylor expansion that 2{Q(h@ | h@ )−Q(h | h@ )}=(h@ −h )TI (h@ −h )+o (dh@ −h d2) 0 0 com 0 p 0 =(h@ −h )TI1/2 {I +I−1/2I I−1/2 }I1/2 (h@ −h )+o (dh@ −h d2). 0 obs p obs mis obs obs 0 p 0 Thus, (7) is true. Other results in this theorem follow from the multivariate Slutzky theorem (Lehmann, 1999, p. 283). % ˙ (h@ | h@ )+O (n−2), ˙ (h@ |h@ )TI−1 Q Proof of T heorem 3. It follows from (5) and (8) that D (I )=Q p [i] com [i] i com and
A
B
n n ˙ @ @ ˙ @ @ ∑ D (I )=tr I−1 ∑ Q (h | h )Q (h |h )T =O (n−1). i com com p [i] [i] i=1 i=1 The desired results follow from the assumptions. % @ @ ˙ ˙ Proof of T heorem 4. It follows from (6) that D (I )=L (h |y )TI−1 L (h | y )+O (n−1), i com o i,obs com o i,obs p and
A
B
n n )L˙ (h@ | y )T +O (n−1). ∑ D (I )=tr I−1 ∑ L˙ (h@ | y com o i,obs o i,obs p i com i=1 i=1 Thus, we have n ∑ D (I )=tr(I−1 I )+o (1)=tr{(I +I−1/2I I−1/2 )−1}+o (1). i com com obs p p obs mis obs p i=1 Under the given assumptions, we obtain the desired results (13) and (14).
%
R A, E. B. (1992). Diagnostics in categorical data analysis. J. R. Statist. Soc. B 54, 781–91. A, A. C. (1982). Regression diagnostics, transformations and constructed variables (with Discussion). J. R. Statist. Soc. B 44, 1–36. A, A. C. (1988). Plots, T ransformations and Regression. Oxford: Oxford University Press. B, M. & F, E. W. (1997). Influence diagnostics for linear longitudinal models. J. Am. Statist. Assoc. 92, 999–1005. B, D. A., K, E. & W, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley. C, S. & H, A. S. (1988). Sensitivity Analysis of L inear Regression. New York: John Wiley. C, R. D. (1977). Detection of influential observations in linear regression. T echnometrics 19, 15–8. C, R. D. & W, S. (1982). Residuals and Influence in Regression. New York: Chapman and Hall. C, D. R. & H, D. V. (1974). T heoretical Statistics. London: Chapman and Hall. D, M. & G, D. M. (1995). Nonlinear Models for Repeated Measurement Data. London: Chapman and Hall. D, A. C. & T, C. L. (1992). Regression model diagnostics. Int. Statist. Rev. 60, 337–55. D, B., L, M. & M, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist. 27, 94–128. D, A. P., L, N. M. & R, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with Discussion). J. R. Statist. Soc. B 39, 1–38. K, R. E., T, L. & K, J. B. (1989). Approximate methods for assessing influence and sensitivity in Bayesian analysis. Biometrika 76, 663–74. L, E. L. (1999). Elements of L arge-sample T heory. New York: Springer-Verlag. MC, C. E. (1994). Maximum likelihood variance components estimation for binary data. J. Am. Statist. Assoc. 89, 330–5. MC, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. J. Am. Statist. Assoc. 92, 162–70.
Case deletion
737
ML, G. J. & K, T. (1997). T he EM Algorithm and Extensions. New York: Wiley. M, X. L. & D, D. (1997). The EM algorithm—an old folk song sung to a fast new tune (with Discussion). J. R. Statist. Soc. B 59, 511–67. O, Y. & P, R. L. (1984). Likelihood inference in a correlated probit regression model. Biometrika 71, 531–43. R, P. J. & L, A. M. (1987). Robust Regression and Outlier Detection. London: Chapman and Hall. S, J. & H, G. J. (1979). A simple method for regression analysis with censored data. T echnometrics 21, 417–32. S, W. J. & W, S. (1986). Assessing influence in multiple linear regression with incomplete data. T echnometrics 28, 231–9. T, M. A. (1993). T ools for Statistical Inference, 2nd ed. New York: Springer-Verlag. W, B. C. (1998). Exponential Family Nonlinear Models. Singapore: Springer-Verlag. W, S. S. (1963). Multivariate statistical outliers. Sankhya: A 25, 407–26.
[Received June 2000. Revised November 2000]