Testing for Bimodality with Mixture Models (?) Finding features in large datasets David W. Scott1 Department of Statistics, Rice University e-mail:
[email protected] Keywords: density features, modes, nonparametric.
1. Introduction Bump-hunting is an engaging problem in applied statistics. Karl Pearson (1894) considered this problem by developing a method-of-moments estimator for a twocomponent univariate gaussian model, whose numerical solution involved finding roots of a nonic equation. I.J. Good and Gaskins (1980) devised an iterative algorithm to evaluate the weight-of-evidence for a mode (or a bump) by statistical surgery, where probability mass around the mode was reallocated to nearby regions so as to remove that mode (or bump). The weight of evidence for the mode was then computed by computing the change in penalized likelihood.
2. Previous Work Silverman (1981) devised a bootstrap test for the number of modes, but the use of a fixed bandwidth kernel estimator reduces the power and interpretability of the test results. Regions where the density are undersmoothed result in a number of small modes, whose counting is unstable during bootstrap resampling. Similarly, excess mass approaches (Mannen, 19xx) can have difficulty with the use of fixed bandwidth estimation. Minnotte (1997) devised a variation of Good’s approach and a notion of the size of a bump similar to excess mass to provide a mode-by-mode testing capability using bootstrap samples. Its power with large samples is unknown and has only been used in one dimension. Scott (1992) advocated visual clustering as an informal way of using multiple kernel estimates to judge the features in the data. More formal notions of testing of modes are now becoming fashionable. These ideas are limited by the curse of dimensionality. Mixture modeling provides a rich environment with which to approximate the distribution of any data set. However, mixture models are quite challenging to estimate and testing is equally challenging because the usual asymptotics generally do not hold for gaussian mixture densities. Gaussian mixtures, in particular, are highly correlated representations of a multivariate density, making interpretation of individual components rather difficult. Information techniques for determining the correct number of components seem highly influenced by the incorrect choice of mixing model, outliers, and dimension, in addition to the lack of orthogonality. Scott (2004) introduced the idea of partial mixture estimation. This technique turns out to highly related to mode-finding in kernel density estimates. (?)
Work partially supported by NSF and ONR
The simplest hypothesis that is well-suited to mixture modeling is the existence of a second mode. Hartigan (1985) provided the dip test, which does not require any distributional assumptions.
3. New Material In this paper, we revisit the fitting of a two-component mixture and attempting to assess the weight of evidence for a second mode. Some simulation results will be presented. A real example from preliminary simulations of a D0 experiment at Fermilab will also be considered.
References Good, I.J., Gaskins, R.A. (1980) Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data, Journal of the American Statistical Association, 75, 42–56. Hartigan, J.A., Hartigan, P.M. (1985) The dip test of unimodality, The Annals of Statistics, 13, 70–84. Minnotte, M.C. (1997) Nonparametric testing of the existence of modes, The Annals of Statistics, 25, 1646–1660. Pearson, K. (1894) Contributions to the mathematical theory of evolution, Phil. Trans. Roy. Soc. London A, 185, 71–110. Scott, D.W. (1992) Multivariate density estimation: theory, practice, and visualization, John Wiley & Sons, New York. Scott, D.W. (2004) Outlier Detection and Clustering by Partial Mixture Modeling, Proceedings of COMPSTAT 2004, Antoch (Ed.), Prague, 453–464. Silverman, B.W. (1981) Using kernel density estimates to investigate multimodality, Journal of the Royal Statistical Society, Series B, 43, 97–99.