On Measures of Divergence and the Divergence ...

1 downloads 0 Views 235KB Size Report
responds to Basu's power divergence measure. Furthermore, we ... Theorem 68.2.1 (Basu et. al (1998)) Under regularity conditions, there ex- ists estimator.
68 On Measures of Divergence and the Divergence Information Selection Criterion

Kyriacos Mattheou and Alex Karagrigoriou Department of Mathematics and Statistics, University of Cyprus, Cyprus

Abstract: The aim of this work is to develop a new model selection criterion using a general discrepancy based technique, by constructing an asymptotically unbiased estimator of the overall average discrepancy between the true and the fitted models. Furthermore, the lower bound for the mean squared error of prediction is established. Keywords and phrases: DIC, power divergence, MSE of prediction

68.1

Introduction

The divergence measures are used as indices of similarity or dissimilarity between populations. They are also used either to measure mutual information concerning two variables or to construct model selection criteria. A model selection criterion can be constructed as an approximately unbiased estimator of an expected ”overall discrepancy” (or divergence), a nonnegative quantity which measures the ”distance” between the true model and a fitted approximating model. A well known discrepancy is Kullback-Leibler discrepancy that was used by Akaike (1973) to develop Akaike Information Criterion (AIC). Measures of discrepancy or divergence between two probability distributions have a long history. A unified analysis was recently provided by Cressie and Read (1984) who introduced for both the continuous and the discrete case the so called power divergence family of statistics that depends on a parameter λ and is used for multinomial goodness-of-fit tests. The additive and nonadditive directed divergences of order α were introduced in the 60’s and the 70’s (Renyi, 1961 and Rathie and Kannappan, 1972). It should be noted that for λ tending to 0 and for α tending to 1 the above measures become the KullbackLeibler measure. Another family of measures is the Φ-divergence known also as 393

394

Mattheou and Karagrigoriou

Csiszar’s measure of P information (Csiszar, 1963) the discrete form of which is c given by I (P ; Q) = ki=1 qi Φ(pi /qi ), where Φ is a real valued convex function on [0, ∞] and P = (p1 , p2 , . . . , pk ) and Q = (q1 , q2 , . . . , qk ) are two discrete finite probability distributions. For various functions for Φ the measure takes different forms. The Kullback-Leibler measure is obtained for Φ(u) = u log(u) while the additive directed divergence is obtained for Φ(u) = sgn(α − 1)uα and for the transformation (α − 1)−1 log I c . For a comprehensive discussion on measures of divergence the reader is referred to Pardo (2006). A new discrepancy measure was recently introduced by Basu et. al (1998). In this paper, we develop a new model selection criterion which is an approximately unbiased estimator of the expected overall power divergence that corresponds to Basu’s power divergence measure. Furthermore, we obtain a lower bound for the mean squared error (MSE) of prediction.

68.2

Basu’s Power Divergence Measure

One of the most recently proposed discrepancies is Basu’s Power Divergence [Basu et. al (1998)] which is defined as: ¶ ¾ µ Z ½ 1 1+a 1 a 1+a g (z) f (z) + g (z) dz, a > 0(68.2.1) da (g, f ) = f (z) − 1 + a a where g is the true model, f the fitted approximating model, and a a positive number. The discrete form of the measure is given by k n X i=1

1 1+a o 1 a )p q + q p1+a , − (1 + i i a i a i

where pi and qi , i = 1, 2, are as in Section 1.1. Observe that the above P.k. . , k1+a measure takes the form i=1 qi Φ(pi /qi ) where Φ(u) = u1+a − (1 + a−1 )ua + a−1 . Lemma 68.2.1 The limit of (68.2.1) when a → 0 is the Kullback-Leibler divergence. Furthermore, the discrete form of the measure tends to the KullbackLeibler measure for a → 0 and for Φ(u) = u log u. It is easy to see that Basu’s measure satisfies the basic properties of measures, namely the properties of nonnegativity and the continuity. In particular, the value of measure is nonnegative while small changes in the distributions result in small changes in the measure. Finally, the value of the discrete measure is not affected by the simultaneous and equivalent reordering of the discrete masses which confirms the symmetry property of the Basu’s measure. Consider a random sample X1 , . . . , Xn from the distribution g and a candidate model ft from a parametric family of models {ft }, indexed by an unknown

Measures of divergence and the DIC criterion

395

parameter t ∈ Θ. To construct the new criterion for goodness of fit we shall consider the quantity: µ ¶ ¾ Z ½ 1 1+a a ft (z) − 1 + g (z) ft (z) dz, a > 0. Wt = (68.2.2) a which is the same as (68.2.1) without the last term that remains constant irrespectively of the model ft used. Observe that (68.2.2) can also be written as: ¶ µ Z ¡ a ¢ 1 1+a Eg ft (z) , a > 0. Wt = ft (z) dz − 1 + (68.2.3) a Our target theoretical quantity that would be estimated by the new criterion is ³ ¯ ´ ¯ E Wt ¯t = θˆ (68.2.4) which can be viewed as the average distance between g and ft up to a constant and is known as the expected overall discrepancy between g and ft . In (68.2.4), θˆ is the estimator of t that minimizes an estimate of da (g, ft ) with respect to t. Note that the estimator of t is obtained by minimizing (68.2.3) when the P expectation is replaced by its sample analogue, namely n−1 ni=1 fta (Xi ). In ˆ the theorem below, Basu et. al. (1998) provide the asymptotic properties of θ. Theorem 68.2.1 (Basu et. al (1998)) Under regularity conditions, there exists estimator θˆ which is consistent and asymptotically normal with mean zero and variance J(θ)−2 K(θ), where under the assumption that the true distribution g belongsRto the parametric family {ft }, θ being the true value of the parameter 1+a and ξ = uθ (z) fθ (z) dz, Z Z 1+a 1+2a J(θ) = [uθ (z)]2 fθ (z) dz and K(θ) = [uθ (z)]2 fθ (z) dz − ξ 2 . (68.2.5) The Lemma below provides the derivatives of (68.2.3) in the case where g belongs to the family {ft } (see Mattheou and Karagrigoriou (2006a)). Lemma 68.2.2 For a > 0 and if the true distribution g belongs to the parametric family {ft }, the derivatives of (68.2.3) are: ·Z ¸ ¡ ¢ ∂Wt 1+a a (a) = (a + 1) ut (z) ft (z) dz − Eg ut (z) ft (z) = 0, ∂t ½ Z Z ∂ 2 Wt 2 1+a = (a + 1) (a + 1) [u (z)] f (z) dz − it ft1+a dz (b) t t ∂t2 ´o ³ ¢ ¡ a a = (a + 1)J +Eg it (z) ft (z) − E g a [ut (z)]2 ft (z) where ut =

∂ ∂t

2

∂ (log (ft )), it = − ∂t 2 (log (ft )) and J =

R

1+a

[ut (z)]2 ft

(z) dz.

396

Mattheou and Karagrigoriou

Theorem 68.2.2 Under the assumptions of Lemma (68.2.2) and for a p-dimensional parameter t, the expected overall discrepancy at t = θˆ is given by ·³ ´ ³ ³ ´ ´0 ¸ (a + 1) ˆ ˆ ˆ E θ−θ J θ−θ E Wt | t = θ = Wθ + . (68.2.6) 2

68.3

The divergence information criterion

In this section we introduce the new criterion and prove that it is an approximately unbiased estimator of (68.2.4). Due to the unknown true distribution g, we estimate (68.2.3) by the empirical distribution function given by: Z Qt =

1+a

ft

µ ¶ n 1 1X a (z) dz − 1 + ft (Xi ). a n

(68.3.7)

i=1

Lemma 68.3.1 The derivatives of (68.3.7) are: "Z # n 1X ∂Qt 1+a a = (a + 1) ut (z) ft (z) dz − ut (Xi ) ft (Xi ) , a > 0, (a) ∂t n i=1

(b)

½ Z Z ∂ 2 Qt 2 1+a = (a + 1) (a + 1) [u (z)] f (z) dz − it ft1+a (z) dz t t ∂t2 ) n n 1 X 1X a a it (z) ft (z) − a [ut (z)]2 ft (z) + n n i=1

where ut =

∂ ∂t

i=1

2

∂ (log (ft )) and it = − (ut )0 = − ∂t 2 (log (ft )).

It is easy to see that by the weak law of large numbers, as n → ∞, we have: ¸ · ¸ · 2 ¸ · · 2 ¸ ∂Qt ∂Wt ∂ Qt ∂ Wt →P and → . (68.3.8) P ∂t θ ∂t θ ∂t2 θ ∂t2 θ The consistency of θˆ and (68.3.8) can be used to evaluate the expectation of the empirical estimator evaluated at the true point θ. Theorem 68.3.1 Under the assumptions of Lemma (68.2.2), the expectation of Qt evaluated at θ is given by ·³ ´ ³ ´0 ¸ a+1 ˆ ˆ EQθ ≡ E (Qt |t = θ ) = EQθˆ + E θ−θ J θ−θ . 2 ³ ¯ ´ ¯ The asymptotically unbiased estimator of E Wt ¯t = θˆ is provided in the theorem below (see Mattheou and Karagrigoriou (2006b)).

Measures of divergence and the DIC criterion

397

Theorem 68.3.2 An asymptotically unbiased estimator of the expected overall discrepancy evaluated at θˆ is given by − a2

DIC = Qθˆ + (a + 1) (2π)

68.4

µ

1+a 1 + 2a

¶1+ p 2

p.

(68.3.9)

Lower bound of the MSE of prediction

Let Xj be the design matrix of the model Y = Xj β+ε where β = (β0 , β1 , β2 , . . .)0 , ε ∼ N(0, σ 2 I) and I is the infinite dimensional identity matrix. ³ ´ Let V (j) = {β (j) , s.t. β (j) = β0 , 0, ..., βj1 , 0, ..., βjkj , 0, ... } be the subspace that contains only the kj + 1 parameters βji involved in the model and let β (n) to be the projection of β on V (j). ˆ where the estimator of β (n) obtained The prediction yˆ is given by yˆ = Xj β, through a set of observations (Xij1 , . . . , Xijkj , yi ), i = 1, 2, . . . , n is denoted by ³ ´0 βˆ = βˆ0 , 0, ..., βˆj1 , 0, ..., βˆj2 , 0, ..., βˆjkj , 0, ... . The mean squared error (MSE) of prediction and the average MSE of prediction are defined respectively by h i Qn (j) = E (ˆ yn+1 − yn+1 )2 |X − nσ 2 and Ln (j) ≡ E (Qn (j)) . Lemma 68.4.1 Under the notation and conditions of this section we have that ° °2 ° °2 ° ° ° ° Qn (j) = °βˆ − β ° and Ln (j) = E °βˆ − β ° , Mn (j)

Mn (j)

where Mn (j) = Xj0 Xj and kAk2R = A0 RA. The Lemma below provides a lower bound for the MSE of prediction. In particular, we show that Qn (j) is asymptotically never below the quantity Ln (j ∗ ) = minj Ln (j). Lemma 68.4.2 Let Ln (j ∗ ) = minj Ln (j). Under certain regularity conditions, we have that for every δ > 0 · ¸ Qn (j) lim P > 1 − δ = 1. n→∞ Ln (j ∗ )

68.5

Discussion

Note that the family of candidate models is indexed by a single parameter a. The value of a dictates to what extent the estimating methods become more robust than the maximum likelihood methods. One should be aware of the fact

398

Mattheou and Karagrigoriou

that the larger the value of a the bigger the efficiency loss. As a result one should be interested in small values of a ≥ 0, say between zero and one. The proposed DIC criterion could be used in applications where outliers or contaminated observations are involved. The prior knowledge of contamination may be useful in identifying an appropriate value of α. Preliminary simulations with a 10% contamination proportion show that DIC has a tendency of underestimation in contrast with AIC which overestimates the true model.

References 1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, Proc. of the 2nd Intern. Symposium on Information Theory, (Petrov B. N. and Csaki F., eds.), Akademiai Kaido, Budapest. 2. Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence, Biometrika, 85, 549-559. 3. Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests, J. R. Statist. Soc, 5, 440–454. 4. Csiszar, I. (1963). Eine informationstheoretische ungleichung and ihre anwendung auf den beweis der ergodizitat von markoffischen ketten. Magyar Tud. Akad. Mat. Kutato Int. Kozl., 8, 85-108. 5. Mattheou, K. and Karagrigoriou, A. (2006a). A discrepancy based model selection criterion, Proc. 18th Conf. Greek Statist. Society (to appear). 6. Mattheou, K. and Karagrigoriou, A. (2006b). On asymptotic properties of DIC, TR 09/2006, Dept. of Math. and Stat., Univ. of Cyprus. 7. Pardo, L. (2006). Statistical inference based on divergence measures, Chapman and Hall/CRC. 8. Rathie P.N. and Kannappan, P. (1972). A directed-divergence function of type β. Information and Control, 20, 38-45. 9. Renyi, A. (1961). On measures of entropy and information. Proc. 4th Berkeley Symp. on Math. Statist. Prob., 1, 547-561, Univ. of California Press.