Acta Appl Math (2007) 97: 211–219 DOI 10.1007/s10440-007-9127-9
Clustering Effect on the Statistical Estimation Accuracy of Distribution Density Rimantas Rudzkis · Tomas Ruzgas
Published online: 6 April 2007 © Springer Science + Business Media B.V. 2007
Abstract The paper is devoted to statistical nonparametric estimation of multivariate distribution density. The influence of data pre-clustering on the estimation accuracy of multimodal density is analyzed by means of the Monte Carlo method. It is shown that the soft clustering is more advantageous than the hard one. While a moderate increase in the number of clusters also increases the calculation time, it considerably reduces the estimation error. Keywords Multivariate distribution density · Nonparametric estimation · Data clustering
1 Introduction The modern data analysis uses a lot of distribution density estimation methods. The kernel density estimators are the most common ones (see [13, 14], many other algorithms are also popular and usable (see, e.g., [2, 4, 8], and [15]). It is difficult to choose an effective estimation procedure when data are multivariate, their distribution density is multimodal, and the sample size is not large. In [12], it is shown that, in this case, the accuracy of estimators significantly increases in case the observations are clustered (treating the analyzed multimodal density as a mixture of unimodal densities) and popular nonparametric estimators are separately applied to each cluster. The idea of the pre-clustering is not new, however so far, to best our knowledge, it has been applied only for selecting the bandwidth of kernel density estimators (see [3, 9]. In this paper, we continue the investigations of [12] and search for a more effective clustering procedure. Using the Monte Carlo method, we analyze the relationship between the estimation error, clustering type, and the number of clusters. When the sample clustering has been completed, the mixture components were estimated in two ways: using (1) the kernel estimator with adaptive bandwidth and (2) Friedman’s algorithm based on projection pursuit. The comparative accuracy analysis of some nonparametric estimators in [12] shows that, in most cases analyzed, Friedman’s procedure is most effective, and the kernel estimator is a bit more accurate only if the sample size is small. Though the R. Rudzkis () · T. Ruzgas Institute of Mathematics and Informatics, Akademijos 4, 08663 Vilnius, Lithuania e-mail:
[email protected]
212
R. Rudzkis, T. Ruzgas
simulation results [12] show that data pre-clustering in the estimation of a multimodal distribution density is expedient, the question which clustering method is the best one remains open. Since a preliminary comparative analysis made in this research has shown that probability clustering methods are evidently more advantageous than geometrical ones (k-means, etc.), only the sample clustering methods based on density approximation by a mixture of Gaussian distribution densities are considered in the sequel. In this paper, we investigate (by means of simulation) the problem of how to choose the number of clusters. In addition, we try to answer the question whether the soft clustering is more advantageous than the hard one. This paper comprises the following sections: Sect. 2 describes the EM algorithm used for sample clustering; Sect. 3 reviews the analyzed density estimators; Sect. 4 contains the simulation results and conclusions.
2 Sample Clustering Using the EM Algorithm Let X(1), . . . , X(n) be i.i.d. d-dimensional random vectors observed with unknown distribution density f (x). In case this density is multimodal, it could be analyzed as a mixture of a few unimodal densities: f (x) =
q
pk fk (x).
(1)
k=1
Let the distribution of the random vector X depend on a random variable ν that takes values 1, . . . , q. It is interpreted as the number of the class the observed object belongs to. In the classification theory, the quantities pk (x) = P{ν = k} are called the a priori probabilities that the observed object belongs to the kth class, and the quantities πk (x) = P{ν = k|X = x} are the a posteriori probabilities. The function fk is treated as the conditional density of X given ν = k. By the term soft clustering of a sample we refer to the estimation of the values πk (X(t)) for all k = 1, . . . , q, t = 1, . . . , n. A sample is hard-clustered if estimators ν(1), . . . , ν(n) of ν(1), . . . , ν(n) are indicated, where ν(t) denotes the class number of the observation X(t). The mixture of Gaussian distributions is the most popular model in the clusterisation theory and practice. Therefore, in this section, we assume that fk (x) are Gaussian densities with means Mk and covariance matrices Rk . Let f (θ, x) denote the right-hand side of (1), where θ = ((pk , Mk , Rk ), k = 1, . . . , q). Since πk (x) =
pk fk (x) , f (θ, x)
k = 1, q,
(2)
the estimators of a posteriori probabilities are obtained, as usual, by the “plug-in” method which replaces the unknown parameter vector θ on the right-hand side of (2) by its maximum-likelihood estimate θ ∗ = arg maxθ L(θ ), L(θ ) = nt=1 f (θ, X(t)). The EM algorithm, an iterative procedure most frequently used to find this estimate, is also applied in this study. πk(r) be the estimates obtained after r cycles of the iterative procedure. Then Let πk = the new estimate θ = θ (r+1) is defined by the equalities 1 πk (X(t)), n t=1 n
p k =
Clustering Effect on the Statistical Estimation Accuracy of Distribution Density
213
n 1 M(k) = πk (X(t))X(t), n pk t=1 n = 1 R(k) πk (X(t))[X(t) − M(k)][X(t) − M(k)] n pk t=1
for all k = 1, . . . , q. Inserting θ (r+1) into the right-hand side of expression (2), we get π (r+1) (X(t)), k = 1, q, t = 1, n. Using this iterative procedure, we obtain a nondecreasing sequence L( θ (r) ), however, its convergence to the global maximum depends on the initial π (0) ). The simplest solution of the initial-value selection problem is the ranvalue θ (0) (or dom start technique: the EM procedure is repeated many times from the random starting θ ) is selected as final. The methodology value π (0) . The result with the maximal value of L( of consecutive extraction of the mixture components [11] can be also applied. For choosing the number of clusters q, various tests of model adequacy can be used. Beginning with q = 1, the values of the parameter q are consecutively increased until the hypothesis on model (1) adequacy is not rejected (e.g., with significance level α = 0.1). To test this hypothesis, a criterion based on the increment of the maximum-likelihood function can be also applied. Let θ (q) be the estimator of θ obtained by the EM algorithm with the number of clusters equal to q. Let ψ = L( θ (q + 1)) − L( θ (q)),
(3)
and let G(u) denote the estimated distribution of ψ obtained by using the parametric bootstrap (see, for example, [6, 16]) under the condition that mixture model (1) of Gaussian distributions is valid. Then the hypothesis on model (1) adequacy is not rejected if 1 − G(ψ) ≥ α.
(4)
We denote by q ∗ the number of clusters obtained in this way.
3 The Density Estimation Algorithms Analyzed The comparative analysis of estimation accuracy is made by the Monte Carlo method for two different types of statistics. The following statistical estimators of the distribution density are considered: (1) Friedman’s method (PPDE) [4, 5] based on the projection pursuit and sequential gaussianization of the projections; (2) Silverman’s [14] adaptive kernel density estimator (AKDE), which uses different bandwidth for each observation. Before applying the methods mentioned, the sample is standardized, i.e., it is transformed so as to have zero mean and unit covariance matrix. Let us describe the methods in more detail. PPDE Algorithm J.H. Friedman proposed an iterative algorithm based on the sequential search for univariate projections of data with the distribution most different from Gaussian [7] and subsequent transformation of these projections to Gaussian random values. Let X
214
R. Rudzkis, T. Ruzgas
be a standardized random vector (i.e., a random vector with zero mean and unit covariance matrix) with unknown distribution density f (x). The variable X is transformed after each step, Z (k) = Qk (X), k = 1, 2, . . .. Let Z (0) = X. Given Z (k−1) , Z (k) is obtained by the following procedure. Let gk (u), u ∈ R1 , denote the distribution density of the univariate projection τ Z (k−1) , where the direction vector τ = τ (k) is selected so that gk differs most from the standard normal density ϕ. Let us denote the corresponding distributions by Gk and . Define Z (k) = Z (k−1) + −1 Gk (τ Z (k−1) ) − τ Z (k−1) τ. Thus, the random vector Z (k−1) is transformed so that the projection of Z (k) into the direction τ would have the distribution and the projections to the (d − 1)-dimensional subspace of Rd orthogonal to the direction τ would remain unchanged. Friedman [5] proved that the random vector Z (k) converges in distribution to the standard Gaussian random vector as k → ∞. Hence, for M large enough, f (z) ϕ(z(M) )
M gk (τ (k)z(k−1) )
ϕ(τ (k)z(k) )
k=1
,
(5)
where z(k) = Qk (x). From expression (5) we obtain Friedman’s statistics by substituting statistical estimators for the unknown univariate distribution densities gk on the right-hand side of the expression. On the basis of Legendre orthogonal polynomials [1], the projective estimator is used to estimate the densities gk of the univariate projections. Let ξ1 , . . . , ξn be univariate random variables with distribution density g(u). By means of the transformation ηt = 2 (ξt ) − 1, g(u) y = 2 (u) − 1, we obtain random values η1 , . . . , ηn with distribution density g ∗ (y) = 2ϕ(u) concentrated in the interval [−1, 1]. Using the expansion in Legendre polynomials {ψj }∞ j =0 g ∗ (y) =
∞
bj ψj (y)
j =0
and replacing the coefficients bj = (j + 1/2)Eψj (ηt ) by their empirical counterparts, we obtain the final estimator g (u) = ϕ(u)
s n 2j + 1 ψj (ηt )ψj (y). n t=1 j =0
(6)
According to the recommendations in [8], the order s ≤ 6 of expansion (6) is used. The density of original multivariate data is estimated by combining those projected univariate density estimations. The projection directions τ (k) are selected to maximize the value of the estimated projection index, as proposed by Friedman: τ (k) = arg max{I(τ )}. τ
(7)
1 2 Here I (τ ) is defined by the equalities I (τ ) = −1 g ∗2 (r)dr − 1/2 = ∞ j =1 E [ψj (r)], r = 2 (τ X) − 1, and its estimator is obtained from (6) with ηt = 2 (τ X(t)) − 1, t = 1, . . . , n.
Clustering Effect on the Statistical Estimation Accuracy of Distribution Density
215
AKDE Algorithm The kernel distribution density estimator with the variable bandwidth is defined by the expression
n x − X(t) 1 1 K f(x) = . n t=1 hdt ht
(8)
We use the same procedure as in [8]: the standard Gaussian function ϕ is used as the kernel function, and the bandwidth [10] is determined by the equality hj = h(f(X(j ))/q)−γ , 4 ) d+4 , f(·) is the kernel density estimator (8) obtained by substituting h where h = ( (2d+1)n 1 for ht , q = exp( n log nt=1 f(X(t))), and γ is a sensitivity parameter. The authors in [8] propose to select values of this parameter from the set {0.2, 0.4, 0.6, 0.8} by using the crossvalidation method. We discuss the application of these methods in estimating the conditional density in the case of clustered sample. As noted at the beginning of Sect. 2, we assume that the observed random vector X depends on a latent random variable ν taking values i = 1, . . . , q and that the conditional density fi of the random vector X given {ν = i} is unimodal for i = 1, q. Since formula (1) holds, we use the equality 1
f(x) =
q
p i fi (x)
(9)
i=1
to estimate the density f (x) after the sample clustering. Replacing the a posteriori classifiπi (x) and using the exprescation probabilities πi (x) = P{ν = i|X = x} by their estimates sion pi = Eπi (X), we have 1 πi (X(t)), n t=1 n
p i (x) =
i = 1, . . . , q.
When estimating the component fi (x) of the mixture (9) by using the PPDE algorithm, the weight πi (t) = πi (X(t)) is assigned to the observation ξt = τ X(t), t = 1, . . . , n. Therefore, when applying equality (5) to estimate the density fi (x), formula (6) is replaced by gi (u) = ϕ(u)
s n 2j + 1 πi (t)ψj (ηt )ψj (y). p (x)n t=1 j =0 i
(6 )
Similarly, applying the AKDE algorithm in the estimation of fi (x), formula (8) is replaced by
n 1 πi (t) x − X(t) K fi (x) = . (8 ) p i (x)n t=1 hdt ht i = The same formulas are valid in the case of hard clustering with πi (t) = 1{ν(t)=i} and p π (t). t i
216
R. Rudzkis, T. Ruzgas
4 Study of the Estimation Accuracy The accuracy of the density estimation methods is analyzed using the Monte Carlo method (by simulation). The main attention is paid to the case where the density of independent d-dimensional observations is a mixture of Cauchy densities. The multidimensional Cauchy density is defined as a product of the univariate Cauchy densities with the standard scale parameters. Thus, f (x) =
q
pi fC (x, mi ),
i=1
where fC (x, mi ) =
d
1 . π(1 + (x − mij )2 ) j j =1
For comparison, samples of the components that correspond to Gaussian mixtures with unit covariance matrix are generated. In the latter case, f (x) =
q
pi fN (x, mi ),
i=1
where fN (x, mi ) =
d
1 √ exp(−(xj − mij )2 /2). 2π j =1
We consider the samples from the following models: 1. q = 2, p1 = (1 − p2 ), p2 = 0.1, 0.3, 0.5, m1 = (0, 0, 0, 0, 0), m2 = (0.5k, 0.5k, 0.5k, 0.5k, 0.5k) (here and below, the index k takes the values 1, 2, . . . , 6). 2. q = 3, p1 = p2 = (1 − p3 )/2, p3 = 0.1, 1/3, 0.8, m1 = (0, 0, 0, 0, 0), m2 = (0.5k, 0.5k, 0.5k, 0.5k, 0.5k), m3 = (0.5k, 0.5k, 0, 0, 0). 3. q = 4, p1 = p2 = p3 = (1 − p4 )/3, p4 = 0.1, 0.25, 0.7, m1 = (0, 0, 0, 0, 0), m2 = (0.5k, 0.5k, 0.5k, 0.5k, 0.5k), m3 = (0.5k, 0.5k, 0, 0, 0), m4 = (0, 0, 0.5k, 0.5k, 0.5k). 4. q = 2, p1 = (1 − p2 ), p2 = 0.1, 0.3, 0.5, m1 = (0, 0), m2 = (0.5k, 0.5k). 5. q = 3, p1 = p2 = (1 − p3 )/2, p3 = 0.1, 1/3, 0.8, m1 = (0, 0), m2 = (0.5k, 0.5k), m3 = (0.5k, 0). 6. q = 4, p1 = p2 = p3 = (1 − p4 )/3, p4 = 0.1, 0.25, 0.7, m1 = (0, 0), m2 = (0.5k, 0.5k), m3 = (0.5k, 0), m4 = (0, 0.5k). Note that this set of models is similar to that used in [8, 12]. This enables us to compare our results with those of [8, 12]. For each model, samples of different sizes (50, 100, 200, 400, 800) were simulated. The accuracy of density estimation is measured by 1 |f (X(t)) − f(X(t))| n t=1 n
δ1 =
|f (x) − f(x)|f (x)dx
Clustering Effect on the Statistical Estimation Accuracy of Distribution Density
217
Fig. 1 Relationship between the sample size and errors (model 2 of mixtures for Cauchy and Gaussian distributions: d = 5; p1 = p2 = 0.45, p3 = 0.1; k = 1)
Fig. 2 Relationship between the distances of mixture component centers and errors (model 2 for a mixture of Cauchy distributions: n = 400; d = 5; p1 = p2 = 0.45, p3 = 0.1)
and
n 1 f (X(t)) − f(X(t)) 1 δ2 = |f (x) − f(x)|dx. n t=1 f (X(t)) + f(X(t)) 2
For each combination of the models, the parameter values, the sample sizes, and the averages δ i of the errors δi were calculated from 100 independent samples. The conclusions presented below are based on the analysis of these averages.
218
R. Rudzkis, T. Ruzgas
Fig. 3 Relationship between the selected number of clusters and errors (model 2 for a mixture of Cauchy distributions: n = 400; d = 5; p1 = p2 = 0.45, p3 = 0.1)
Results of the Study The results obtained corroborate the conclusion of [12] about the expediency of the clustering procedure. The dependency of both errors δ1 and δ2 on the size of sample, the type of clustering, the number of clusters, and the distance between the modes of densities is of a rather similar nature (in term of quality). The clustering procedure substantially reduces the error of estimation. The errors are 5–10% smaller in the case of soft clustering as compared to the hard one. q ∗ stands for the optimal number of clusters (see the definition at the end of Sect. 2). A larger number of clusters yields a very slight improvement, while the computational time drastically increases. Friedman’s estimator gives a higher accuracy than the kernel one. The three figures below illustrate the functional dependency of the estimation errors on the sample size, the distance between the modes, and the number of clusters. The results for model 2 are presented in Figs. 1, 2 and 3 (the symbols A and P specify AKDE and PPDE algorithms, respectively; the density estimators without clustering are plotted by solid lines, the density estimators with hard clustering by the short-dashed lines, and density estimators with soft clustering by the long-dashed lines). A similar behavior is established in the case of other models.
References 1. Aburdene, M.F.: Recursive computation of discrete Legendre polynomial coefficients. Multidimens. Syst. Signal Process. 7(2), 221–224 (1996) ´ 2. Cwik, J., Koronacki, J.: Multivariate density estimation: a comparative study. Neural Comput. Appl. 6(3), 173–185 (1997) 3. Duong, T.: Bandwidth matrices for multivariate kernel density estimation. PhD thesis (2004), p. 161 4. Friedman, J.H.: Exploratory projection pursuit. J. Am. Stat. Assoc. 82(397), 249–266 (1987) 5. Friedman, J.H., Stuetzle, W., Schroeder, A.: Projection pursuit density estimation. J. Am. Stat. Assoc. 79, 599–608 (1984) 6. Hall, P.: The Bootstrap and Edgeworth Expansion. Springer, New York (1992) 7. Huber, P.J.: Projection pursuit. Ann. Stat. 13(2), 435–475 (1985) 8. Hwang, J.N., Lay, S.R., Lippman, A.: Nonparametric multivariate density estimation: a comparative study. IEEE Trans. Signal Process. 42(10), 2795–2810 (1994) 9. Jeon, B., Landgrebe, D.A.: Fast parzen density estimation using clustering-based branch and bound. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 950–954 (1994) 10. Jones, M.C., Marron, J.S., Sheather, S.J.: A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 91, 401–407 (1996) 11. Rudzkis, R., Radaviˇcius, M.: Statistical estimations of a mixture of gaussian distributions. Acta Appl. Math. 38, 37–54 (1995)
Clustering Effect on the Statistical Estimation Accuracy of Distribution Density
219
12. Ruzgas, T., Rudzkis, R., Kavaliauskas, M.: Application of clustering in the non-parametric estimation of distribution density. Nonlinear Anal. Model. Control 11(4), 393–411 (2006) 13. Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York (1992) 14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986) 15. Stone, C.J., Hansen, M., Kooperberg, C., Truong, Y.K.: Polynomial splines and their tensor products in extended linear modeling. Ann. Stat. 25, 1371–1470 (1997) 16. Wong, M.: A bootstrap testing procedure for investigating the number of subpopulations. J. Stat. Comput. Simul. 22, 99–112 (1985)