method is maximum likelihood estimation, which minimizes the empirical version of the ... Beran's minimum Hellinger distance estimator proved to afford.
A fully parametric approach to minimum power-divergence estimation D. Ferrari 1
2
1
and D. La Vecchia
2
Dipartimento di Economia Politica, Universit` a di Modena e Reggio Emilia, via Berengario 51, 41100 Modena, Italy Dipartimento di Economia (Metodi Quantitativi), Universit` a di Lugano, 6904 Lugano, Switzerland
Keywords: Asymptotic efficiency; Change-of-variance function, M-estimation; Maximum likelihood; Minimum divergence estimation; Robustness.
Abstract Let FΘ = {Ft , t ∈ Θ ⊆ Rp }, p ≥ 1 be a family of parametric distributions with densities ft with respect to Lebesgue measure and let G be the class of all distributions G having density g with respect to Lebesgue measure. We assume ft and g to have common support X ∈ Rk , k ≥ 1 and the family FΘ to be identifiable; G represents the “true” distribution generating the data, which is regarded as close but not exactly equal to some member of FΘ . Although the focus of our presentation is on continuous distributions, our arguments apply to the discrete case as well. One way to find parameter values is to minimize a data-based divergence measure between the candidate model Ft and an empirical version of G. By far, the most popular minimum-divergence method is maximum likelihood estimation, which minimizes the empirical version of the KullbackLeibler divergence (Kullback (1951), Akaike (1973)). Although the maximum likelihood method is optimal when G ∈ FΘ , even mild deviations from the assumed model can seriously affect its precision. On the other hand, traditional robust M-estimators which can tolerate deviations from model assumptions, do not achieve first order efficiency for most parametric families (Hampel et al. (1986)). Beran (1977) resolved the dispute between robustness and efficiency by considering minimization of Hellinger distance. Beran’s minimum Hellinger distance estimator proved to afford a large fraction of bad data, yet maintaining the highest efficiency at the model. Lindsay (1994) and Basu and Lindsay (1994) extended Beran’s approach by considering the power divergences, a larger class of divergences which includes Hellinger distance as a special case. Related divergences are considered by Basu et al. (1997). The family of power divergences is defined by Z 1 ft (x) Dq (ft kdG) = − Lq dG(x), (1) q X g(x) where Lq (u) := (u1−q − 1)/(1 − q), and q ∈ (−∞, ∞) \ {1}. Such quantity was first considered by Cressie and Read (1984) in the context of goodness-of-fit testing. Expression (1) includes other notable divergences as special cases: Kullback-Leibler divergence (q → 1); twice Hellinger distance (q = 1/2); Neyman’s Chi-square (q = −1) and Pearson’s Chi-square (q = 2). Given the data, the current approaches to minimizing (1) are only feasible in low-dimensional problems, due to the need of replacing g by some smooth kernel estimate. This has two serious drawbacks: (i) some degree of nonparametric analysis for the choice of the kernel bandwidth is unavoidable, with nontrivial complications when dim (X ) is large; (ii) the accuracy of the parameter estimators relies on convergence of the kernel smoother, which, however, suffers from the course of dimensionality. In the present talk, we consider a procedure for parameter estimation based on minimization of (1) which is fully parametric and, consequently, avoids any issue related to kernel smoothing. Therefore, the method can be applied also when dim (X ) and dim(Θ) are moderate or large. Our approach has an information-theoretical flavor as it relies on the generalized entropy (or q-entropy) proposed by Havrda (1967) and later employed by Tsallis (1988) in the context of statistical
2
A fully parametric approach to minimum power-divergence estimation
mechanics, a quantity which is closely related to (1). The resulting estimator of the parameters is indexed by a single constant q, which allows for tuning the trade-off between robustness and efficiency. If q = 1, our estimator is the maximum likelihood estimator; if q = 1/2, we minimize a version of the Hellinger distance which is fully parametric. Choices 1/2 < q < 1 give remarkably robust estimators with negligible efficiency losses compared to maximum likelihood. The method is also understood as the Fisher-consistent version of the estimating procedure proposed by Ferrari and Yang (2009). In the special case of location models, our estimator is related to the minimum density power divergence proposed by Basu et al. (1998). Convergence results for the parameter estimator are provided and infinitesimal robustness is addressed using two well-established measures: the influence function and change-of-variance function. While the former has been widely employed in literature to approximate the asymptotic bias under contamination, to our knowledge, expressions for the change-of-variance function have been obtained only for one–parameter location or scale problems by Hampel et al. (1986) and Genton and Rousseeuw (1995). We derive a general expression for multi–parameter M-estimators and use it to compute an approximate upper bound for the mean squared error under contamination. The worst-case mean squared error is employed for analytic selection of q (min–max approach), yielding estimators that successfully balance robustness and efficiency, improving upon both traditional Mestimators and maximum likelihood, regardless whether the data are close or not to the assumed model. This is clearly seen from our numerical studies.
References H. Akaike (1973). Information Theory and an Extension of the Likelihood Principle, in: 2nd International Symposium of Information Theory. Editor: Petrov, B.N. and Csaki, F. A. Basu, I. R. Harris, N. L. Hjort and M. C. Jones (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85:549–559. A. Basu and B. G. Lindsay (1994). Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, Vol. 46:683–705. R. Beran (1977). Minimum Hellinger distance estimates for parametric models. Annals of Mathematical Statistics, 5:445–463. E. Choi, P. Hall and B. Presnell (1984). Rendering parametric procedures more robust by empirically tilting the model. Biometrika, 87:453–465. N. Cressie and T. R. C. Read (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B: Methodological, 64:69–80. D. Ferrari and Y. Yang (2009). Maximum Lq -likelihood. The Annals of Statistics (in press). M. Genton and P. J. Rousseuw (1995). The change of variance function of M–estimators of scale under general contaminations. Journal of Computational and Applied Mathematics, 64:69–80. F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Sthael (1986). Robust statistics: The approach based on Influence Functions. Wiley & Sons, New York. J. Havrda and F. Charv´ at (1967). Quantification method of classification processes: concept of structural entropy. Kibernetika, 3:35–80. C. Tsallis. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52:479-487.