Computational Statistics and Data Analysis 102 (2016) 67–84
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
On bandwidth selection using minimal spanning tree for kernel density estimation Sreevani ∗ , C.A. Murthy Machine Intelligence Unit, Indian Statistical Institute, Kolkata- 700108, India
article
info
Article history: Received 1 June 2015 Received in revised form 19 January 2016 Accepted 19 April 2016 Available online 25 April 2016 Keywords: Kernel density estimation Bandwidth selection Euclidean minimal spanning tree Unbiased estimator
abstract The use of kernel density estimation is quite well known in large variety of machine learning applications like classification, clustering, feature selection, etc. One of the major issues in the construction of kernel density estimators is the tuning of bandwidth parameter. Most of the bandwidth selection procedures optimize mean integrated squared or absolute error, which require huge computational time as the size of the data increases. Here, the bandwidth has been taken to be a function of inter-point distances of the data set. It is defined as a function of the length of Euclidean Minimal Spanning Tree of the given sample points. No rigorous theory about the asymptotic properties of the EMST based density estimator has been developed in the literature. Theoretical analysis of the asymptotic properties of the EMST based density estimator has been established and proved that the estimator is asymptotically unbiased to the original density at its every continuity point. Moreover, theoretical analysis has been provided for general kernel. Experiments are conducted using both synthetic and real-life data sets to compare the performance of the EMST bandwidth to those of conventional cross-validation and plug-in bandwidth selectors. It is found that the EMST based estimator achieves the comparative performance, while being simpler and faster than the conventional estimators. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Many tasks in pattern recognition and machine learning often require the knowledge of underlying densities of the observed data (Menardi and Azzalini, 2014; Brown et al., 2012; Stover and Ulm, 2013; Brox et al., 2007; Ji et al., 2014; Jones and Rehg, 2002; Liu et al., 2007). For example in Bayes classification, the decision rule involves estimation of class conditional probabilities of the training data (Duda et al., 1999; Ramoni and Sebastiani, 2001; Kim and Scott, 2010). And, in model based clustering, every cluster corresponds to ‘mode’ or ‘peak’ in the estimated probability density of a given set of points (Li et al., 2007; Hinneburg and Gabriel, 2007; Tang et al., 2015). Estimation of the density can be done either in a parametric or non-parametric way. In parametric estimation, assumptions are made about the structure of the density, whereas in non-parametric estimation no assumptions are made about the form of the density function. Various methods have been studied for non-parametric density estimation such as histogram, kernel density estimator, spline estimators, orthogonal series estimators, etc., (Silverman, 1986; Scott, 2009; Golyandina et al., 2012). The kernel method is perhaps the most popular and well known technique of non-parametric estimation (Parzen, 1962; Cacoullos, 1966). Throughout this article, we use the following notation. Let X 1 , X 2 , . . . , X n ∈ Rd , d ≥ 2 denote ‘n’ independent and identically distributed random vectors, and X i = (Xi1 , . . . , Xid )′ , (.)′ represents the transpose. A general vector x has the
∗
Corresponding author. E-mail addresses:
[email protected] (Sreevani),
[email protected] (C.A. Murthy).
http://dx.doi.org/10.1016/j.csda.2016.04.005 0167-9473/© 2016 Elsevier B.V. All rights reserved.
68
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
representation x = (x1 , . . . , xd )′ and E(.) denotes expectation of a random vector. Also, dx will be shorthand for dx1 . . . dxd and (.) dx denotes . . . (.) dx1 ..dxd .
Rd
will be shorthand for
R
...
R
,
A general d-dimensional kernel density estimator fˆ , for a random sample X 1 , X 2 , . . . , X n with common probability density function f , is fˆn (x) =
n 1
n i=1
KH (x − X i ),
where KH is the scaled kernel function i.e., KH (x) = |H |−1/2 K (H −1/2 x), K is a d-variate kernel function, H is a symmetric positive definite d × d bandwidth matrix and | · | is the determinant. Traditionally, K is assumed to be symmetric and K (x) dx = 1. Some commonly used kernel functions are uniform, triangle, Epanechnikov, Gaussian, etc. The most widely used kernel is the Gaussian with zero mean and unit variance. From the above equation of fˆn (x), it is clear that kernel density estimate at any test point x is simply the sum of kernel values caused by all training points X i . It is well known that the bandwidth selection is the most important and crucial step to obtain a good estimate. There are mainly two computational challenges associated with KDE; one is the selection of bandwidth, which is estimated using training data and the other is the construction of density at any test point. Note that the issue of bandwidth selection is the only problem considered in this article. The bandwidth matrix H can be considered as a diagonal positive definite matrix i.e., H = diag(h21 , . . . , h2d ), hi > 0, ∀ i to simplify the above equation. Further simplification is obtained from the restriction, hi = h (>0), ∀ i, i.e., H = diag(h2 , . . . , h2 ) and this leads to a single bandwidth kernel density estimator as, fˆn x =
=
n 1
n i =1
Kh x − X i
n 1
nhd
K h−1 x − X i
.
i=1
A full bandwidth matrix provides more flexibility, but it also introduces more complexity into the estimator since more parameters need to be tuned (Wand and Jones, 1994). Although the selection of bandwidth can be done subjectively, but there is a great demand for automatic selection. Several automatic procedures compute optimal bandwidth value by minimizing the discrepancy between the estimate and target density by using some error criterion. A few such error criteria are given below (Wand and Jones, 1994).
2 ˆ ˆ . • Mean Squared Error (MSE): MSE fn (x) = E fn (x) − f (x) 2 ˆ ˆ • Mean Integrated Squared Error (MISE): MISE fn = E fn (x) − f (x) dx . ˆ • Mean Integrated Absolute Error (MIAE): MIAE fˆn = E fn (x) − f (x) dx .
Most of the modern bandwidth selectors are motivated by minimizing the MSE or MISE because these two criteria can be decomposed into variance and bias terms, as
2
MSE fˆn (x) = E fˆn (x) − f (x)
+ Var fˆn (x)
2 = Biasfˆn (x) + Var fˆn (x), 2 MISE fˆn = E fˆn (x) − f (x) dx 2 = E fˆn (x) − f (x) dx = MSE fˆn (x) . The optimal bandwidth that minimizes MISE, when the underlying density is a d-variate normal and the diagonal bandwidth matrix is employed, can be approximated by Silverman (1986), hi = σi
4
(d + 2)n
1
(d+4)
,
∀ i = 1, 2, . . . , d;
where σi is the standard deviation of the ith variable. This is called ‘‘Normal Reference (NR) rule’’. The most studied automatic bandwidth selector which aims to minimize MISE is the Least Squares Cross Validation (LSCV) (Bowman, 1984). The LSCV
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
69
selector of H is,
ˆ LSCV = min LSCV (H ), H H
where LSCV (H ) =
fˆ x; H
2
dx −
n 2
n i =1
fˆ−i X i ; H ,
and fˆ−i (:, H ) is the kernel estimator based on the sample with out X i . One can minimize LSCV (H ) over all positive definite diagonal matrices. Since, the cross-validation criterion involves numerical optimization, it becomes increasingly difficult ˆ LSCV is highly variable (Wand and Jones, 1994). and time consuming as the data size increases. It has been shown that H A number of modifications of LSCV have been proposed in order to improve its performance. These include the Biased Cross Validation (BCV) method (Scott and Terrell, 1987), the method in Ahmad and Ran (2004) based on kernel contrasts, likelihood cross validation (Zhang et al., 2006), indirect cross validation (Savchuk et al., 2010), and do-validation method (Mammen et al., 2012). BCV is a well-known method and aims to minimize the Asymptotic MISE (AMISE), instead of the exact MISE formula. Another most popular data-driven bandwidth selector is the Direct Plug-In (DPI) (Sheather and Jones, 1991; Liao et al., 2010). This is also based on AMISE, where the estimates of the unknown quantities are being ‘‘plugged in’’. Plug-in produces more stable bandwidths than does cross validation, and hence is the currently more popular method. Smoothed Cross Validation (SCV) can be thought of as a hybrid of LSCV and BCV (Hall et al., 1992). It is based on the asymptotic integrated variance, but considers exact integrated squared bias rather than using its asymptotic form. Although bandwidth is usually taken to be a constant, several methods have been proposed to vary it. One such kind is the popular kth nearest neighbor estimate (Loftsgaarden and Quesenberry et al., 1965), the adaptive kernel estimate proposed in Breiman et al. (1977) and also the method in Mahapatruni and Gray (2011). In Zougab et al. (2014), a Bayesian estimation of adaptive bandwidth matrices in multivariate kernel density estimation is investigated, when the quadratic and entropy loss functions are used. Number of articles dealing with the problem of bandwidth selection exists in the literature (Terrell and Scott, 1992; Hall, 1992; Heidenreich et al., 2013; Cheng et al., 2014). The most successful state-of-the-art bandwidth selection methods involve optimizing one of MSE, MISE and AMISE criterion functions. Evaluation of such criterion functions involve O(n2 ) computations, where n is the number of samples (Silverman, 1986). Optimization of these criterion functions then becomes computationally very expensive for large data sets. In case of LSCV, 12 n(n − 1) evaluations of the kernel function K (x) are needed to compute each value of the criterion function. If Gaussian kernel is considered, a single component value K (x) requires O(d2 ) operations, d is the dimension of the data (Silverman, 1986). So O(n2 d2 ) computations are required to compute each value of the criterion function. Other methods like BCV, PI, etc., whose criterion functions depend on the estimation of general integrated squared density derivative functionals, also requires O(n2 d2 ) computational cost. To optimize such criterion functions, most commonly used method is Quasi-Newton, which is an iterative procedure, and approximates Hessian matrix using recent function and gradient evaluations (Gill et al., 1981). If finite difference gradients are used, each iteration of this technique requires d function evaluations. In addition to this, calculation of step directions, and Hessian approximations involves 2d2 + O(d) multiplications and same number of additions and subtractions, per iteration (Byrd et al., 1988). Since, each function evaluation needs O(n2 d2 ) operations, it is seen that entire cost, per iteration, is d × O(n2 d2 ) + 2d2 + O(d) = O(n2 d3 ). If optimization process takes nc iterations, total complexity would then be O(nc n2 d3 ). One can reduce computational time by binning the samples (Scott and Sheather, 1985; Yang et al., 2003; Raykar et al., 2010). Another approach is to use the representative condensed set instead of the whole data set for density estimation (Girolami and He, 2003; Deng et al., 2008; Wang et al., 2014). However, much of the information about the position of the samples is lost by these methods. As large data sets are very common in modern machine learning applications for density estimation, design of a novel bandwidth selector that can handle large data sets with reduced computational time but provides similar accuracy is very much required. Parzen (1962) showed that, if any sequence h = hn of positive integers satisfying hn → 0, nhn → ∞ as n → ∞ then the resulting kernel density estimator is asymptotically unbiased and consistent to the target density. Generalization of Parzen’s work to the multivariate case is presented in Cacoullos (1966). Here, bandwidth is considered as a function of number of data points (n) only. Two different data with same number of data points are shown in Fig. 1, one is scaled version of the other. Clearly, the same bandwidth which only depends on the number of data points does not work for both cases. It is desirable that inter-point distances of the data should play a role in the selection of bandwidth. Therefore, bandwidth should not only be a function of n, but also of inter-point distances of the data. Euclidean Minimal spanning Tree (EMST) (Shamos and Hoey, 1975; Preparata and Shamos, 1985; March et al., 2010) is entirely determined by the Euclidean distance between sample points, and it has a close relationship with the distribution of samples. So, length of the EMST has been considered as the bandwidth for kernel density estimation. In Chaudhuri et al. (1996), bandwidth has been defined using EMST of the given samples, but the theory about the asymptotic properties requires kernel to be uniform. For proving the results, they have constructed two sequences of numbers tn , γn such that for very ϵ > 0, ∃M0 > 0 s.t. P (tn ≤ hn ≤ γn ) ≥ 1 − ϵ, ∀ n ≥ M0 ; tn , γn → 0, ntn 2 , nγn 2 → ∞ and γtn → 1 as n → ∞. Such framework has not been n extended to prove the asymptotic properties for general kernel. Additionally, two assumptions were considered, which are given below. Suppose, Nα (x) = {y : |xi − yi | ≤ α, ∀i}, sequence of sets An such that P (hn ∈ An ) ≥ 1 − ϵ, ∀n, and νξ n (An ) = P (hn ∈ An |x1 = ξ ).
70
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Fig. 1. Data with different scales.
• Assumption 1. Let ∃M1 > 0 such that P (x1 ∈ Nα (x)|hn = α) ≤ α 2 M1 , ∀n ≥ M2 > 0 and ∀α . • Assumption 2. Let there exist M3 > 0 such that νξ n (An ) ≥ 1 − ϵ, ∀n > M3 and ∀ ξ and for every such sequence An . In this article, none of the said assumptions is considered. It is also a fact that the assumption 2 is a strong assumption. Moreover, the two sequences (tn , γn ), as mentioned above, are also not considered. The approach for the proof is very much different from the previous article. In this article, using a different approach, theoretical analysis of the asymptotic properties of the EMST based density estimator for general kernel, under a mild assumption of compact support, has been provided. It is proved that the resulting estimator is asymptotically unbiased to the original density at its every continuity point. The performance of the EMST bandwidth is found to be comparative to the performance of traditional cross validation and plugin techniques. But, computational time is much smaller than the traditional methods. The rest of the paper is organized as follows. Section 2 contains theoretical analysis of the asymptotic properties of the EMST based bandwidth and the resulting density estimator. In Section 3, practical benefits of this estimator are demonstrated on both synthetic and real-life data sets by comparing it with some of the existing methods. Finally, Section 4 contains the discussion and conclusion. 2. Asymptotic analysis of EMST based bandwidth selector This section presents theoretical results of the EMST based density estimator. First, bandwidth selection using EMST of the given samples is described. Definition 1 (Euclidean Minimal Spanning Tree (EMST)). Let V = {x1 , x2 , . . . , xn } be the set of given observations. Let G = (V , E) be fully connected, undirected graph defined on V and E = {eij }; ∀ i, j be a set of all edges eij = (xi , xj ). A weight wij , Euclidean distance between xi and xj , is assigned to each edge eij . Then EMST is a subgraph (G′ ) of G with the minimum total weight connecting all the vertices without loops. Sum of all the weights (distances) of edges in G′ is defined as the length of the EMST. Let X 1 , X 2 , . . . , X n ⊆ Rd , d ≥ 2 be independent and identically distributed random vectors with common density function f (x). Construct EMST (En ) of the given ‘n’ samples. Let Ln = Length of En .
Define Tn =
Ln
1/d
n
.
Clearly, value of Tn is not only function of ‘n’ but also of the inter-point distances of the data. Note that {Tn }∞ n=1 is a sequence of random vectors. The kernel density estimation of f (x) considering Tn as bandwidth is, fˆn (x) =
=
n 1
n j =1 1
KT n x − X j
n
nTn d j=1
K T n −1 x − X j
.
Note that Tn is a continuous random variable. Let gn be the probability density function of Tn . For the statements about asymptotic theory, we make the following assumptions on the target density. Let the support A ⊆ Rd of f be path connected, compact, cl(int and λ(δ(A)) = 0; where (A)) c= A (cl denotes closure, int denotes interior), δ(A) denotes boundary of A i.e., δ(A) = cl(A) cl(A ) and λ denotes Lebesgue measure on Rd . Let V f (x) dx > 0, ∀ open set V ⊆ Rd and V A ̸= ∅. Suppose K (y) is a symmetric density function, centered at zero vector, satisfies the following conditions.
(K1)
K y dy = 1 d
R (K2) sup K y < ∞ y
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
71
(K3) lim ∥y∥d K y = 0.
∥y∥→∞
Initially, asymptotic properties of both Ln , Tn are discussed. Proofs for the propositions are given in case of R2 for the sake of understanding. Similar proofs follow in the case of Rd . Proposition 2.1. Ln → ∞ as n → ∞ in probability i.e., ∀ M > 0, P (Ln > M ) → 1 as n → ∞. Proof. Since, cl(int (A)) = A, it should contain a square of size a × a(a > 0, say). Divide this square to form a grid of size (2k − 1) × (2k − 1), for some integer k ≥ 2. Then size of each box (Aij ) in the grid = 2ka−1 × 2ka−1 ; i, j = 1, 2, .., 2k − 1. Let p = 1, 3, 5, . . . , 2k − 1; q = 1, 3, 5, . . . , 2k − 1. Let Bpqn denote the event that at least one among X 1 , X 2 , . . . , X n belong to Apq and let B = ∩p ∩q Bpqn .
Claim: P Bpqn → 1 as n → ∞. P X ̸∈ Apq = 1 − P(X ∈ Apq )
=1−
f x dx Apq
= bpq (say), where 0 < bpq < 1. n ̸ Apq , . . . , X n ∈ ̸ Apq = P X ̸∈ Apq So, P X 1 ∈ 1
= bnpq → 0 as n → ∞. Thus, for every p and q, P at least one among X 1 , X 2 , . . . , X n ∈ Apq = 1 − bnpq → 1
as n → ∞.
Hence the claim. Claim: P(B) → 1 as n → ∞. P(Bp1 q1 n ∩ Bp2 q2 n ) = P(Bp1 q1 n ) + P(Bp2 q2 n ) − P(Bp1 q1 n ∪ Bp2 q2 n )
≥ P(Bp1 q1 n ) + P(Bp2 q2 n ) − 1 = 1 − bnp1 q1 n + 1 − bnp2 q2 n − 1 = 1 − (bnp1 q1 n + bnp2 q2 n ) → 1 as n → ∞. P(Bp1 q1 n ∩ Bp2 q2 n ∩ Bp3 q3 n ) ≥ P(Bp1 q1 n ∩ Bp2 q2 n ) + P(Bp3 q3 n ) − 1 = 1 − (bnp1 q1 n + bnp2 q2 n ) + 1 − bnp3 q3 n − 1 = 1 − (bnp1 q1 n + bnp2 q2 n + bnp3 q3 n ) → 1 as n → ∞. By Induction, P(∩p ∩q Bpqn ) → 1 as n → ∞ i.e., P(B) → 1 as n → ∞. Hence the claim. Now, consider the distribution of k2 points, one from each Apq ; p, q = 1, 3, 5, . . . , 2k − 1. An EMST for this distribution (with k = 3) is shown by the dotted line in Fig. 2. Note that distance between any pair of points is ≥ (2ka−1) (∵ a point can be anywhere inside Apq ).
⇒ Length of EMST of k2 points, Lk2 , is ≥ (k + 1)(k − 1)
a
(2k − 1)
.
Even if we consider distribution of n (>k2 ) points inside A, Ln(>k2 ) would still be ≥ (k + 1)(k − 1) (2ka−1) . Since, (k + 1)(k − 1) (2ka−1) → ∞ as k → ∞ ⇒ Ln → ∞ as n → ∞ in probability.
Proposition 2.2. Tn → 0 as n → ∞ in probability i.e., ∀ δ > 0, P(Tn < δ) → 1 as n → ∞.
72
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Fig. 2. An EMST for the distribution of data points denoted by ×.
Proof. Ln n
=
Ln n−1
×
n−1 n
< Maximal Edge weight of EMST.
(1)
Claim: Maximal Edge weight of EMST → 0 as n → ∞ in probability. By the above same argument, we can show that, P(at least one among X 1 , X 2 , . . . , X n ∈ Aij , ∀ i, j = 1, 2, 3, . . . , 2k − 1) → 1
as n → ∞.
Since, ∀ i & j, a point can be placed anywhere inside Aij ,
maximal inter point distance =
+
2k − 1 a
√ =
2
a
2
2a 2k − 1
5×
2k − 1 → 0 as k → ∞. ⇒ Maximal Edge weight of EMST → 0 as n (>k) → ∞ in probability. Hence the claim. 1/d From Eq. (1), Tn = Lnn → 0 as n → ∞ in probability.
Proposition 2.3. nTn d → ∞ as n → ∞ in probability i.e., ∀ M > 0, P(nTn d > M ) → 1 as n → ∞. Proof. nTn d = n ×
Ln n
= Ln .
Using Proposition 2.1, nTn d → ∞ as n → ∞ in probability.
The following two lemmas are needed, before the main result is stated. Lemma 2.4. Suppose K (y) satisfying the assumptions (K1)–(K3). Then for any integer m ≥ 1,
Rd
K (z )m dz < ∞.
Proof.
K (z )m dz ≤ sup K z
Rd
m−1
K z
dz
Rd
∵ K (z ) ≤ sup K z
m−2
≤ sup K z sup K z
K z
dz
Rd
≤ sup K z . . . sup K z 0. Suppose K (y) satisfies the assumptions (K1)–(K3). Let s(y) be a bounded function and satisfies st ,m (x) =
Rd
1 Rd
td
|s(y)|dy < ∞. Define m
K t −1 y
s x − y dy.
Then at every continuity point x of s, (A) limt →0 st ,m (x) = s(x) (B) st ,m (x) is bounded.
Rd
K (y)m dy.
Proof. First, part (A) of the lemma will be proved.
st ,m (x) − s(x)
1 m −1 m K (y) dy = K (y) dy K (t y) s(x − y)dy − s(x) d d d d t R R R 1 1 m − 1 −1 m ) s ( x − y ) dy − s ( x ) ) dy = K ( t y K ( t y d d Rd t Rd t 1 = s(x − y) − s(x) d K (t −1 y)m dy d t m
R
= S (say). Let η > 0. Split the region of integration into two regions: ∥y∥ ≤ η and ∥y∥ > η. S =
≤
≤
≤
≤
1 1 −1 m −1 m s(x − y) − s(x) d K (t y) dy + s(x − y) − s(x) d K (t y) dy ∥y∥≤η t t ∥y∥>η 1 1 s(x − y) − s(x) d K (t −1 y)m dy + s(x − y) − s(x) d K (t −1 y)m dy ∥y∥≤η ∥y∥>η t t 1 1 −1 m K (t −1 y)m dy + max s(x − y) − s(x) s ( x − y ) K ( t y ) dy td d ∥y∥≤η t ∥y∥>η ∥y∥≤η 1 + s(x) d K (t −1 y)m dy ∥y∥>η t 1 |s(x − y)| ∥y∥d K (t −1 y)m dy + max |s(x − y) − s(x)| |K (t −1 y)m |dy d ∥y∥≤η ∥y∥d td ∥y∥≤η t ∥y∥>η 1 + |s(x)| d K (t −1 y)m dy t ∥y∥>η 1 max |s(x − y) − s(x)| |K (z )m |dz + d sup ∥z ∥d |K (z )| |s(y)|dy ∥y∥≤η η ∥z ∥> η ∥z ∥≤ ηt Rd t + |s(x)| |K (z )m |dz ∵ ∥y∥ > η . ∥z ∥> ηt
Fix η > 0. Let t → 0, then sup ∥z ∥d |K (z )| → 0 (Using (K3)).
∥z ∥> ηt
Therefore, the second term in the above equation goes to 0 (∵
|s(x)|
∥z ∥> ηt
Rd
|s(y)|dy < ∞). Also,
|K (z )m |dz → 0 as t → 0 (∵ Lemma 2.4, s(y) is bounded).
Now if one lets η → 0, max |s(x − y) − s(x)| → 0.
∥y∥≤η
∴ Using Eq. (2), S → 0 as t → 0, η → 0.
(2)
74
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Next, part (B) of the lemma will be proved.
1 −1 m K (t y) s(x − y)dy |st ,m (x)| = d d R t 1 −1 m ≤ t d K (t y) s(x − y) dy d R K (z )m s(x − tz ) dz ≤ Rd m s(x − tz )dz ≤ sup K (z ) z Rd m s(x − tz )dz ≤ (supK (z )) Rd
z
0, ∃ δ > 0 s.t.
0 0, given 2M ′ > 0, ∃ Nδ,ϵ > 0 s.t.
P (Tn > δ)
0). The proofs can easily be extended to diagonal bandwidth matrices (i.e., H = diag(h21 , . . . , h2d ), hi > 0). Computational complexity: Delaunay triangulation followed by Kruskal’s algorithm can be employed to compute EMST. Delaunay triangulation of the given data set, can be computed in O(n log n); where n is the number of samples. In case of R2 , Delaunay triangulation generates only O(n) edges. Since, EMST is a sub-graph of every Delaunay triangulation of the given n points, any standard minimum spanning tree algorithms like Prim’s algorithm, or Kruskal’s algorithm, on this graph requires O(n log n). However, in higher dimensions (d > 2), the triangulation of the point set might contain the complete graph. Therefore, the computational complexity of EMST for (d > 2) is O(n2 ). 3. Experimental results In this section, performance of the EMST based bandwidth selector has been studied. Experimental study has been conducted with the help of artificial data sets (from Gaussian distribution) as well as real-life publicly available data sets. Ten different shapes of bivariate Gaussian densities, by varying modes, have been considered. For each such case, data samples of different sizes, n = 50, 100, 250, 500, 1000, 2500, and 5000 have been generated. The optimal bandwidth h∗ (Wand and Jones, 1993) is computed by minimizing the MISE score for all the considered Gaussian densities. The performance of ˆ is compared via the ratio hˆ/h∗ . In case of real-life data, performance of the proposed bandwidth bandwidth selectors, say h, is compared via Bayes classification accuracy. Half of the data is taken as training (which is used for the estimation of bandwidth) and rest of the data is used as test set. Standard Gaussian kernel has been employed to obtain the estimate of the underlying density. All the experiments are performed on windows 8, with 3.4 GHz processor and 32 GB RAM machine. Organization of results is as follows: First, the details of both artificial and real-life data sets have been discussed. Then, performance of the EMST bandwidth, in terms of computational time, hˆ/h∗ and Bayes classification accuracy, has been compared with four existing bandwidth selectors. These results demonstrate the practical advantages of the MST based bandwidth selector. Effect of different kernel shapes with EMST bandwidth has also been studied.
76
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84 Table 1 Formulas and parameters of different target densities A–J. Target density A B C D E F G H I J
Formula
0 1 0 , 0 0 1 0 1/2 0 N , 0 0 2 0 1 6/10 N , 0 6/10 1 −2 1 0 2 1 0 1 1 , + , N N 2 2 0 0 1 0 0 1 − 3 / 2 1 / 2 0 3 / 2 1/2 0 1 N , + 12 N , 2 0 0 1 0 0 1 0 1 0 0 1 0 1 N , + 12 N , 2 −1 0 1/2 1 0 1/2 −1 4/9 14/45 1 4/9 0 1 1 N , + N , 2 2 1 14/45 4 /9 −1 0 4/9 − 1 1 / 2 0 1 1 / 2 0 0 1/2 0 3 N , + 37 N , + 71 N , 7 0 0 1/2 0 0 1/2 1 0 1/2 −1 1/4 0 0 1/4 0 1 1/4 0 1 1 1 N , + 2N , + 4N , 4 0 0 1 0 0 1 0 0 1 1 1 − 1 9 / 25 63 / 250 9 / 25 0 3 √ , √ , 9/25 N , + 37 N + 17 N 7 0 63/250 49/100 0 49 / 100 0 2/ 3 −2/ 3 N
0 49/100
Table 2 Real life data sets with the number of instances, attributes and classes. Data set
Instances
Attributes
Classes
Adult Mammography Banana Blood transfusion Liver VertebralColumn Haberman
48 842 11 183 5 267 748 345 310 306
6 6 2 4 6 3 3
2 2 2 2 2 2 2
3.1. Data sets Ten bivariate Gaussian densities have been considered and they cover a wide range of density shapes. Among these ten densities, three (A, B, C) from uni-modal, four (D, E, F, G) from bi-modal and three (H, I, J) from tri-modal normal density. Densities A, G and J are taken from Wand and Jones (1993). The formulas for these densities are given in Table 1 and the corresponding contour plots are depicted in Fig. 3. Density A is uni-modal normal having diagonal co-variance matrix with equal components. Density B is with diagonal co-variance matrix but has unequal variances. Density C is with a correlation between two variates. Density D is mixtures of two Gaussians, where the variances of the considered Gaussians are equal. Whereas for densities E and F, variances of the considered Gaussians are not equal. Density G has one spherical and one oblique elliptical component. Density H is tri-modal normal having all spherical components, whereas density I has elliptical components. Modes of tri-modal density J have different orientations with a small gap between them. Seven real-life public domain data sets were used in the experiments, and details of these data sets are provided in Table 2. Data sets are considered with sample size ranges from 300 to 50 000 and dimension ranges from 2 to 6. Banana data set is taken from machine learning data sets repository. Remaining data sets are from UCI machine learning repository (Bache and Lichman, 2013). Adult data set has 14 features in which 6 features are continuous and 8 are nominal. For this data set, only the six continuous features are considered here. As discussed above, ten Gaussian densities each with seven different sample sizes and seven real life data sets, in total (10 × 7) + 7 data sets are used for experimental purpose. 3.2. Performance Performance of the EMST based bandwidth selector is compared with traditional and state-of-the-art bandwidth selectors namely, Least Squares Cross Validation (LSCV), Biased Cross Validation (BCV), Smoothed Cross Validation (SCV), and PlugIn (PI) bandwidth selectors. The code for the above mentioned bandwidth selectors is taken from R package ‘ks’ (Duong, 2014). Recent extensions to the above methods (Chacón and Duong, 2010) can also generate full bandwidth
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
(a) Density A.
(b) Density B.
(c) Density C.
(d) Density D.
(e) Density E.
(f) Density F.
(g) Density G.
(h) Density H.
(i) Density I.
77
(j) Density J. Fig. 3. Bivariate target densities A–J.
matrices. As EMST bandwidth is a scalar bandwidth matrix, we have considered only scalar bandwidth matrices of the above methods. Running time comparison Running time of the EMST bandwidth selector is compared with the above mentioned bandwidth selectors. For each data set, bandwidths are computed using training samples. Average training time in seconds, over ten trials, has been calculated. Figs. 4 and 5 compare the running time of various bandwidth selectors on synthetic and real-life data sets respectively. These results clearly indicate that the computational time for the EMST based bandwidth selection method is much smaller than the existing bandwidth selection methods. For all data with size up to 1000, training time for the EMST bandwidth found to be about half a second, and for larger data with size up to 5000, the result is obtained within a minute. As the running time differences between EMST and traditional bandwidth selectors do not vary much across different shapes, running time comparisons are discussed here for different data sizes irrespective of the density shape. For sample size 50, time costs for LSCV, SCV and PI are about 0.01 s, which is 10 times slower than the EMST bandwidth. On the other hand, time needed for executing BCV is close to 0.1 s, which is 100 times larger than the EMST. For sample size up to 500, computational times of LSCV, SCV and PI take about 5 times larger than that of EMST. Whereas BCV runs 50 times slower than EMST. For sample size equal to 1000, LSCV’s and PI’s time costs lie in between 2 and 3 s, and the running time of BCV lies in between 40 and 50 s, while the running time for MST bandwidth varies from 0.3 to 0.7 s. The computational cost of PI is about 50 s, while the EMST method runs in only 5 s, for almost all data sets of size 2500. PI’s execution time ranges from 250 to 300 s and the EMST method has running time less than 30 s for any data of size 5000. From the graphs it is clear that BCV is
78
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Fig. 4. Comparison of Computational time for different bandwidth selectors. X -axis represents the number of sample points, Y -axis represents—log10 (Average training time (in seconds)).
computationally much more expensive than all the other methods. For data size bigger than 2500, it is almost 100 times slower than the EMST. One can notice that superiority in terms of computational time increases as the sample size increases. For example, in case of smaller sample sizes less than thousand, the speed up factor of the EMST scheme compared to BCV is about 10–50, but for large samples, the factor is about 100. Even for real-life data, running time of EMST bandwidth is smaller than the traditional methods. Note that BCV is not applied for real-life data, as its complexity increases greatly with the dimension and sample size. One can observe from the figure that the traditional method SCV runs 2–3 times slower whereas PI’s computational time 10 times larger than that of
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
79
Fig. 5. Comparison of Computational time for different bandwidth selectors. Y -axis represents—log10 (Average training time (in seconds)). Table 3 Comparison of bandwidth selectors on Gaussian data. Target density
/
hˆ h∗
#Data samples EMST
LSCV
BCV
SCV
PI
A
50 100 250 500 1000 2500 5000
1.055 (0.045) 1.032 (0.040) 1.009 (0.018) 0.995 (0.010) 0.954 (0.013) 0.901 (0.006) 0.868 (0.005)
1.007 (0.201) 0.970 (0.167) 1.058 (0.086) 1.016 (0.113) 0.984 (0.054) 0.998 (0.098) 1.030 (0.062)
0.954 (0.116) 1.010 (0.094) 1.041 (0.043) 1.045 (0.040) 1.042 (0.038) 1.064 (0.017) 1.085 (0.007)
1.088 (0.095) 1.094 (0.071) 1.090 (0.037) 1.079 (0.036) 1.060 (0.031) 1.056 (0.020) 1.050 (0.011)
0.822 (0.086) 0.876 (0.056) 0.940 (0.045) 0.955 (0.047) 0.959 (0.027) 0.990 (0.029) 0.999 (0.025)
B
50 100 250 500 1000 2500 5000
1.137 (0.037) 1.111 (0.028) 1.103 (0.022) 1.068 (0.011) 1.033 (0.012) 0.978 (0.007) 0.947 (0.006)
1.003 (0.159) 0.971 (0.131) 0.997 (0.116) 1.002 (0.083) 0.999 (0.106) 1.003 (0.067) 1.000 (0.044)
0.948 (0.101) 0.999 (0.107) 1.040 (0.079) 1.033 (0.019) 1.052 (0.032) 1.055 (0.025) 1.045 (0.015)
1.146 (0.078) 1.114 (0.050) 1.123 (0.035) 1.110 (0.028) 1.095 (0.021) 1.075 (0.020) 1.070 (0.006)
0.869 (0.047) 0.889 (0.043) 0.944 (0.039) 0.970 (0.029) 0.983 (0.025) 0.995 (0.021) 1.005 (0.014)
C
50 100 250 500 1000 2500 5000
1.218 (0.037) 1.182 (0.015) 1.144 (0.021) 1.120 (0.017) 1.083 (0.012) 1.031 (0.007) 0.987 (0.005)
1.026 (0.184) 0.982 (0.175) 0.949 (0.180) 0.918 (0.130) 0.986 (0.088) 0.992 (0.054) 1.003 (0.057)
1.206 (0.143) 1.207 (0.174) 1.230 (0.079) 1.253 (0.060) 1.268 (0.051) 1.289 (0.041) 1.299 (0.030)
1.246 (0.075) 1.190 (0.076) 1.133 (0.033) 1.118 (0.038) 1.102 (0.024) 1.088 (0.017) 1.070 (0.006)
0.968 (0.062) 0.975 (0.069) 0.989 (0.040) 1.007 (0.046) 1.030 (0.028) 1.046 (0.018) 1.058 (0.011)
Sample mean (standard deviation) of hˆ/h∗ is given for n = 50, 100, 250, 500, 1000, 2500, 5000 (10 replications in each case). LSCV—Least Squares Cross Validation, BCV—Biased Cross Validation, SCV—Smoothed Cross Validation, PI—PlugIn.
EMST. We can conclude that MST based bandwidth is computationally much more simpler than the well known bandwidth selectors LSCV, SCV, BCV and PI. Comparison via hˆ /h∗ Performance of the proposed EMST bandwidth is compared with the traditional bandwidth selectors via the ratio hˆ/h∗ for all the considered bivariate Gaussian data. Tables 3–5 provide mean and standard deviation of hˆ/h∗ over ten independent runs. For simple normal distributed data sets generated from A and B, the proposed EMST selector performs well, with the mean bandwidth for each sample size being close to the optimal bandwidth. For sample sizes up to 1000, the EMST bandwidth is closer to the optimal bandwidth than those of SCV and PI. For all sample sizes of A and B, LSCV bandwidth is close to the optimal bandwidth but has large variation. For density C, performance of EMST is closer to SCV, and better than BCV. For all sample sizes of density C, proposed bandwidth has smaller variation than all the traditional selectors. One can observe that the sample variation for LSCV bandwidth is almost 10 times larger than the variation of EMST bandwidth. In case of mixture distributions D and F, for all sample sizes up to 500, EMST’s performance is slightly superior to all traditional methods. For density E, proposed bandwidth is better than the cross validation methods BCV and SCV. For densities H and I with sample sizes up to 1000, proposed bandwidth performs very well than the traditional methods, with the bandwidth being very close to the optimal one. For densities C, G and J, with correlated structure, EMST performs poorly for smaller sample sizes. But as the sample size increases, EMST bandwidth is very close to the optimal bandwidth. In most of the cases of Gaussian data, EMST method is found to get similar performance when compared to SCV and for almost all cases, its performance is slightly superior to BCV. Also, for all data sets with sample sizes considered, proposed bandwidth has smaller sample variability when compared to the existing methods. Bayes classification accuracy Given a test sample x = (x1 , . . . , xd )′ to classify, and class labels c1 , c2 , .., ck , Bayes classifier computes the posterior probability of that sample belonging to each class and it is evaluated as, P (cj |x) = P (cj )P (x|cj ) where,
80
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Table 4 Comparison of bandwidth selectors on Gaussian data. Target density
/
hˆ h∗
#Data samples EMST
LSCV
BCV
SCV
PI
D
50 100 250 500 1000 2500 5000
1.058 (0.042) 1.046 (0.022) 1.039 (0.020) 0.989 (0.021) 0.961 (0.009) 0.907 (0.005) 0.887 (0.003)
1.000 (0.167) 1.054 (0.117) 0.917 (0.189) 1.018 (0.077) 1.024 (0.071) 0.989 (0.041) 0.986 (0.071)
1.301 (0.089) 1.358 (0.073) 1.423 (0.097) 1.434 (0.061) 1.457 (0.023) 1.516 (0.020) 1.558 (0.015)
1.271 (0.060) 1.249 (0.050) 1.198 (0.048) 1.159 (0.035) 1.133 (0.023) 1.093 (0.014) 1.074 (0.012)
0.972 (0.040) 1.009 (0.042) 1.016 (0.051) 1.023 (0.033) 1.031 (0.024) 1.025 (0.014) 1.024 (0.014)
E
50 100 250 500 1000 2500 5000
1.181 (0.054) 1.179 (0.036) 1.136 (0.014) 1.125 (0.013) 1.082 (0.011) 1.029 (0.005) 0.982 (0.004)
1.031 (0.139) 1.028 (0.186) 0.920 (0.086) 0.980 (0.121) 1.011 (0.061) 0.988 (0.071) 0.993 (0.060)
1.343 (0.155) 1.420 (0.071) 1.450 (0.054) 1.509 (0.055) 1.511 (0.032) 1.532 (0.018) 1.541 (0.003)
1.309 (0.089) 1.299 (0.079) 1.212 (0.044) 1.203 (0.050) 1.149 (0.033) 1.107 (0.010) 1.098 (0.001)
0.979 (0.070) 1.026 (0.065) 1.012 (0.030) 1.046 (0.041) 1.032 (0.028) 1.030 (0.012) 1.023 (0.011)
F
50 100 250 500 1000 2500 5000
1.054 (0.050) 1.076 (0.021) 1.017 (0.017) 0.996 (0.012) 0.975 (0.012) 0.931 (0.003) 0.900 (0.001)
1.182 (0.218) 1.019 (0.114) 0.958 (0.123) 0.987 (0.060) 0.978 (0.086) 0.960 (0.061) 1.060 (0.051)
1.049 (0.091) 1.102 (0.061) 1.122 (0.043) 1.157 (0.028) 1.191 (0.030) 1.219 (0.013) 1.245 (0.008)
1.224 (0.097) 1.174 (0.047) 1.148 (0.048) 1.134 (0.027) 1.144 (0.031) 1.112 (0.015) 1.095 (0.019)
0.922 (0.092) 0.919 (0.042) 0.944 (0.049) 0.958 (0.023) 1.005 (0.033) 1.003 (0.019) 1.013 (0.010)
G
50 100 250 500 1000 2500 5000
1.487 (0.070) 1.506 (0.020) 1.483 (0.017) 1.444 (0.021) 1.396 (0.008) 1.339 (0.008) 1.285 (0.004)
0.942 (0.120) 0.916 (0.130) 1.015 (0.103) 1.001 (0.119) 0.967 (0.084) 0.993 (0.043) 0.999 (0.045)
1.219 (0.109) 1.377 (0.088) 1.458 (0.070) 1.497 (0.066) 1.519 (0.042) 1.522 (0.016) 1.539 (0.006)
1.480 (0.105) 1.420 (0.049) 1.341 (0.025) 1.280 (0.040) 1.245 (0.020) 1.194 (0.008) 1.140 (0.001)
1.151 (0.064) 1.172 (0.041) 1.178 (0.022) 1.165 (0.041) 1.160 (0.019) 1.146 (0.007) 1.121 (0.001)
Sample mean (standard deviation) of hˆ/h∗ is given for n = 50, 100, 250, 500, 1000, 2500, 5000 (10 replications in each case). LSCV—Least Squares Cross Validation, BCV—Biased Cross Validation, SCV—Smoothed Cross Validation, PI—PlugIn. Table 5 Comparison of bandwidth selectors on Gaussian data. Target density
/
hˆ h∗
#Data samples EMST
LSCV
BCV
SCV
PI
H
50 100 250 500 1000 2500 5000
1.110 (0.038) 1.102 (0.032) 1.088 (0.018) 1.044 (0.015) 1.007 (0.011) 0.957 (0.005) 0.917 (0.002)
0.950 (0.267) 0.961 (0.138) 1.012 (0.142) 0.937 (0.100) 0.985 (0.075) 0.966 (0.074) 0.964 (0.054)
0.991 (0.080) 1.024 (0.075) 1.084 (0.065) 1.105 (0.028) 1.118 (0.024) 1.142 (0.012) 1.156 (0.011)
1.143 (0.068) 1.131 (0.056) 1.146 (0.052) 1.120 (0.026) 1.095 (0.028) 1.085 (0.020) 1.080 (0.015)
0.862 (0.075) 0.884 (0.053) 0.953 (0.049) 0.958 (0.026) 0.974 (0.031) 0.995 (0.025) 0.999 (0.011)
I
50 100 250 500 1000 2500 5000
1.092 (0.064) 1.054 (0.033) 1.026 (0.020) 0.989 (0.009) 0.952 (0.010) 0.903 (0.007) 0.887 (0.003)
0.992 (0.153) 0.924 (0.179) 1.014 (0.124) 0.928 (0.120) 0.983 (0.053) 0.990 (0.073) 1.016 (0.064)
0.961 (0.162) 0.967 (0.091) 1.007 (0.045) 1.024 (0.032) 1.036 (0.034) 1.047 (0.017) 1.056 (0.012)
1.094 (0.087) 1.068 (0.049) 1.088 (0.042) 1.064 (0.033) 1.064 (0.022) 1.062 (0.016) 1.085 (0.010)
0.831 (0.078) 0.843 (0.051) 0.927 (0.045) 0.926 (0.050) 0.959 (0.016) 0.992 (0.030) 0.999 (0.025)
J
50 100 250 500 1000 2500 5000
1.379 (0.039) 1.406 (0.037) 1.376 (0.019) 1.350 (0.018) 1.301 (0.011) 1.245 (0.011) 1.199 (0.007)
0.928 (0.098) 0.964 (0.214) 1.003 (0.106) 0.979 (0.078) 0.953 (0.087) 0.983 (0.047) 0.984 (0.037)
1.297 (0.106) 1.356 (0.077) 1.420 (0.050) 1.473 (0.045) 1.498 (0.022) 1.546 (0.023) 1.576 (0.013)
1.406 (0.097) 1.427 (0.103) 1.308 (0.046) 1.269 (0.028) 1.209 (0.019) 1.182 (0.012) 1.159 (0.008)
1.046 (0.053) 1.112 (0.070) 1.104 (0.048) 1.112 (0.027) 1.094 (0.018) 1.108 (0.011) 1.088 (0.008)
Sample mean (standard deviation) of hˆ/h∗ is given for n = 50, 100, 250, 500, 1000, 2500, 5000 (10 replications in each case). LSCV—Least Squares Cross Validation, BCV—Biased Cross Validation, SCV—Smoothed Cross Validation, PI—PlugIn.
P (x|cj ) =
1 nj hd
nj
X ∈ classj
K h−1 x − X
, nj = # of samples in class j,
P (cj ) = #of training samples , H = diag(h , . . . , h2 ) is the bandwidth. A class label cj with the highest posterior probability is assigned to x. 2
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
81
Table 6 Classification performance of different bandwidth selectors on real-life data. Data set
EMST Ave.Mrate (SD)
SCV Ave.Mrate (SD)
PI Ave.Mrate (SD)
NR Ave.Mrate (SD)
Adult Mammography Banana BloodTransfusion Liver VertebralColumn Haberman
5.305e−01 (3.544e−04) 1.480e−02 (8.822e−04) 9.697e−02 (3.082e−03) 2.190e−01 (8.713e−03) 3.700e−01 (2.528e−02) 1.806e−01 (3.455e−02) 2.611e−01 (1.455e−02)
5.466e−01 (6.941e−03) 1.406e−02 (1.053e−03) 9.648e−02 (3.761e−03) 2.217e−01 (1.343e−02) 3.553e−01 (2.393e−02) 1.787e−01 (3.072e−02) 2.704e−01 (1.821e−02)
5.522e−01 (9.019e−03) 1.428e−02 (1.308e−03) 9.591e−02 (3.844e−03) 2.316e−01 (1.157e−02) 3.825e−01 (1.704e−02) 1.794e−01 (3.218e−02) 2.815e−01 (2.453e−02)
5.496e−01 (2.338e−03) 1.495e−02 (9.061e−04) 1.116e−01 (3.102e−03) 2.319e−01 (8.716e−03) 3.838e−01 (2.909e−02) 1.816e−01 (3.851e−02) 2.657e−01 (1.938e−02)
LSCV—Least Squares Cross Validation, SCV—Smoothed Cross Validation, PI—PlugIn, NR—Normal Reference scale. Table 7 Formulas for target densities S–T. Target density
Formula
S T
S (x, y) = p1 (x)p2 (y) T (x, y) = 12 p1 (x)p2 (y) +
1 p 2 3
(x)p4 (y)
Table 6 provides Bayes classification performance of different bandwidth selectors on real-life data. Average and standard deviation of misclassification rates (denoted by Mrate) over ten independent runs are provided in the table. Along with SCV and PI, performance of the EMST method is also compared with Normal Reference (NR) scale method. As we observe from the results, no method is uniformly found to be better and no method is uniformly found to be worse. For all data sets, EMST has better performance than NR. For Adult, Blood transfusion and Haberman data sets, EMST scheme is slightly better than the traditional methods SCV and PI. On the other hand, in case of Mammography and Liver data sets, traditional methods are performing better than the EMST. But the computational time is much less for the proposed scheme. It is also clear from the table that for most of the cases, sample standard deviation of Mrate for the proposed method is smaller than those of the other selectors. Effect of Different Kernels with EMST Bandwidth In this section, the effect of uniform and Gaussian kernels with EMST bandwidth for kernel density estimation has been studied. To compare the effect of these kernels with EMST bandwidth, both synthetic and real-life data sets are considered. Synthetic data generated from Gaussian and triangular distributions are used. The above mentioned Gaussian densities (A–J) are considered and formulas for triangular density functions are given in Table 7. In case of synthetic data, as the true density is known in each case, the performance of the EMST bandwidth with different kernels can be compared by the accuracy of the resulting kernel density estimator via MISE. Kernel density estimate at any test point x = (x1 , . . . , xd )′ , based on the training data observations X j = (Xj1 , . . . , Xjd )′ , ∀ j = 1, 2, . . . , n and bandwidth matrix H = diag(h2 , . . . , h2 ) is given by, fˆn (x) =
n 1
nhd
K h−1 x − X j
,
j =1
where n is the number of training samples and K is the kernel. Mean integrated squared error between the estimated density fˆn and the original density f is given by, MISE(fˆn ) = E
2 ˆ (fn (x) − f (x)) dx .
Theoretical MISE function on test samples is approximated by, MISE(fˆn ) =
m 2 1 fˆn (xi ) − f (xi ) , m i=1
where m is the number of test samples.
p 1 ( x) =
p 3 ( x) =
x 2−x 0 x − 1.5 3.5 − x 0
if 0 < x ≤ 1; if 1 < x ≤ 2; otherwise.
if 1 < x ≤ 2; if 2 < x ≤ 3; otherwise.
x−1 3−x 0
p2 (x) =
if 0.5 < x ≤ 1.5; if 1.5 < x ≤ 2.5; otherwise.
x − 2.5 4.5 − x 0
p4 (x) =
if 1.5 < x ≤ 2.5; if 2.5 < x ≤ 3.5; otherwise.
For all the above mentioned synthetic data, data samples of sizes n = 1000, 5000 have been considered. For each case, the mean integrated squared error between the estimated density using EMST bandwidth with specific kernel and the original density has been computed, using the above formula. Comparative results, average and standard deviation of MISE over
82
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84 Table 8 Effect of different kernel shapes with EMST bandwidth on synthetic data. Target density
#Data samples
Gaussian Ave.MISE (SD)
Uniform Ave.MISE (SD)
A A B B C C
1000 5000 1000 5000 1000 5000
1.070e−04 (6.24e−05) 3.861e−05 (1.16e−05) 1.272e−04 (2.83e−05) 3.341e−05 (1.36e−05) 1.707e−04 (5.41e−05) 6.435e−05 (2.13e−05)
2.142e−04 (6.27e−05) 8.816e−05 (1.43e−05) 1.966e−04 (3.03e−05) 8.653e−05 (2.11e−05) 2.651e−04 (7.02e−05) 1.264e−04 (2.02e−05)
D D E E F F G G
1000 5000 1000 5000 1000 5000 1000 5000
4.495e−05 (1.95e−05) 1.343e−05 (3.33e−06) 8.344e−05 (3.39e−05) 2.912e−05 (5.66e−06) 6.591e−05 (3.38e−05) 2.736e−05 (8.24e−06) 4.375e−04 (1.03e−04) 2.193e−04 (4.42e−05)
8.363e−05 (2.24e−05) 3.255e−05 (4.85e−06) 1.296e−04 (2.09e−05) 5.681e−05 (5.16e−06) 1.322e−04 (2.80e−05) 5.654e−05 (8.47e−06) 4.123e−04 (6.80e−05) 1.786e−04 (3.44e−05)
H H I I J J
1000 5000 1000 5000 1000 5000
2.731e−04 (4.90e−05) 1.711e−04 (2.01e−05) 1.072e−04 (4.13e−05) 4.194e−05 (1.83e−05) 6.300e−04 (1.05e−04) 4.543e−04 (7.36e−05)
3.221e−04 (6.24e−05) 1.931e−04 (1.75e−05) 2.367e−04 (5.51e−05) 1.025e−04 (1.71e−05) 5.376e−04 (8.76e−05) 3.472e−04 (6.46e−05)
S S T T
1000 5000 1000 5000
4.001e−03 (1.66e−03) 1.466e−03 (2.88e−04) 2.045e−03 (6.15e−04) 7.752e−04 (2.77e−04)
3.961e−03 (1.16e−03) 1.469e−03 (2.22e−04) 2.101e−03 (4.61e−04) 1.013e−03 (2.62e−04)
Table 9 Effect of different kernel shapes with EMST bandwidth on real-life data. Data set
Gaussian Ave.Mrate (SD)
Uniform Ave.Mrate (SD)
Adult Mammography Banana Blood Transfusion Liver VertebralColumn Haberman
5.307e−00 (1.875e−04) 1.659e−02 (8.832e−04) 9.803e−02 (2.460e−03) 2.315e−01 (9.193e−03) 3.756e−01 (2.486e−02) 1.815e−01 (2.151e−02) 2.647e−01 (1.268e−02)
5.347e−00 (1.580e−04) 1.938e−02 (1.491e−03) 9.846e−02 (2.607e−03) 2.451e−01 (9.266e−03) 4.550e−01 (2.863e−02) 1.948e−01 (1.768e−02) 2.673e−01 (1.478e−02)
10 independent runs, are given in Table 8. For almost all of synthetic data sets considered, EMST bandwidth with Gaussian kernel provides better performance than that of uniform kernel. For data generated from C, E, F, I and T, the error values corresponding to the EMST density estimator with Gaussian kernel are much smaller than the errors corresponding to the EMST density estimator with uniform kernel. It is also clear from the table that the sample standard deviation of MISE using Gaussian kernel is smaller than that of uniform kernel. In case of real-life data, classification performances of EMST with uniform and Gaussian kernels i.e., average and standard deviation of Mrate, are given in Table 9. For all the data sets considered, EMST bandwidth with Gaussian kernel provides better performance than that of uniform kernel. 4. Discussion and conclusions Euclidean minimal spanning tree based bandwidth has been considered for multivariate kernel density estimation. The key idea here is that the inter-point distances of the data should affect the selection of bandwidth. Unlike the traditional approaches, which are based on optimizing either MSE, MISE or AMISE explicitly, here the bandwidth is constructed using EMST of the given samples. The absence of optimizing error criterion and its dependency on the minimal spanning tree contributes to low computational cost for the bandwidth selection. It has been established that the resulting density estimator is an unbiased estimator of the original density at its every continuity point, under the mild assumption of compact support. Theoretical results are provided for general kernel i.e., not restricted to a specific kernel like Gaussian or uniform. It is demonstrated through numerical experiments that EMST based method requires less CPU time compared to other schemes. Through the experiments, it is also demonstrated that the EMST bandwidth is closer to the optimal bandwidth and the resulting estimator is similar with existing cross validation and plug-in estimators in terms of Bayes classification error. All these make it suitable for a wide variety of machine learning applications for density estimation involving large data sets. The effect of different kernel shapes with EMST bandwidth has also been studied and from the results it is shown that the Gaussian kernel with EMST bandwidth is performing better than uniform kernel.
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
83
Extension of this method to a density function with unbounded support is under investigation. In this work, theoretical results have been proved for the bandwidth which is a simple function of EMST length, one could use some other sophisticated functions of EMST length. Also, single bandwidth has been used for all directions; a nice extension is to consider different bandwidths along different dimensions or to full bandwidth matrices may also be investigated. Acknowledgments A part of this work has been done at Center for Soft Computing Research (CSCR), ISI, Kolkata, Project No: (IR/S3/ENC01/2002). The authors would like to thank Prof. Probal Chaudhuri, Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, for his valuable comments. Also authors would like to thank Dr Tarn Duong, University of Paris 6, Paris, France, for sharing ‘R’ codes related to bandwidth selection. References Ahmad, I.A., Ran, I.S., 2004. Kernel contrasts: a data-based method of choosing smoothing parameters in nonparametric density estimation. J. Nonparametr. Stat. 16 (5), 671–707. Bache, K., Lichman, M., 2013. UCI Machine Learning Repository. URL: http://archive.ics.uci.edu/ml. Bowman, A.W., 1984. An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71 (2), 353–360. Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel estimates of multivariate densities. Technometrics 19 (2), 135–144. Brown, G., Pocock, A., Zhao, M.-J., Luján, M., 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13 (1), 27–66. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.-P., 2007. Nonparametric density estimation with adaptive, anisotropic kernels for human motion tracking. In: Human Motion–Understanding, Modeling, Capture and Animation. Springer, pp. 152–165. Byrd, R.H., Schnabel, R.B., Shultz, G.A., 1988. Parallel quasi-Newton methods for unconstrained optimization. Math. Program. 42 (1–3), 273–306. Cacoullos, T., 1966. Estimation of a multivariate density. Ann. Inst. Statist. Math. 18 (1), 179–189. Chacón, J., Duong, T., 2010. Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. TEST 19 (2), 375–398. Chaudhuri, D., Chaudhuri, B.B., Murthy, C., 1996. A data driven procedure for density estimation with some applications. Pattern Recognit. 29 (10), 1719–1736. Cheng, T., Gao, J., Zhang, X., 2014. Semiparametric localized bandwidth selection in kernel density estimation. Available at SSRN 2435478. Deng, Z., Chung, F.-L., Wang, S., 2008. FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation. Pattern Recognit. 41 (4), 1363–1372. Duda, R.O., Hart, P.E., Stork, D.G., 1999. Pattern Classification. John Wiley & Sons. Duong, T., 2014. ks: Kernel smoothing, r package version 1.9.2. URL: http://CRAN.R-project.org/package=ks. Gill, P.E., Murray, W., Wright, M.H., 1981. Practical Optimization. Academic Press Inc. Girolami, M., He, C., 2003. Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Mach. Intell. 25 (10), 1253–1264. Golyandina, N., Pepelyshev, A., Steland, A., 2012. New approaches to nonparametric density estimation and selection of smoothing parameters. Comput. Statist. Data Anal. 56 (7), 2206–2218. Hall, P., 1992. On global properties of variable bandwidth density estimators. Ann. Statist. 762–778. Hall, P., Marron, J., Park, B.U., 1992. Smoothed cross-validation. Probab. Theory Related Fields 92 (1), 1–20. Heidenreich, N.B., Schindler, A., Sperlich, S., 2013. Bandwidth selection for kernel density estimation: a review of fully automatic selectors. Adv. Stat. Anal. 97 (4), 403–433. Hinneburg, A., Gabriel, H.-H., 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In: Advances in Intelligent Data Analysis VII. Springer, pp. 70–80. Ji, P., Zhao, N., Hao, S., Jiang, J., 2014. Automatic image annotation by semi-supervised manifold kernel density estimation. Inform. Sci. 281, 648–660. Jones, M.J., Rehg, J.M., 2002. Statistical color models with application to skin detection. Int. J. Comput. Vis. 46 (1), 81–96. Kim, J., Scott, C.D., 2010. L2 kernel classification. IEEE Trans. Pattern Anal. Mach. Intell. 32 (10), 1822–1831. Li, J., Ray, S., Lindsay, B.G., 2007. A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res. 8 (8), 1687–1723. Liao, J., Wu, Y., Lin, Y., 2010. Improving Sheather and Jones bandwidth selector for difficult densities in kernel density estimation. J. Nonparametr. Stat. 22 (1), 105–114. Liu, Z., Shen, L., Han, Z., Zhang, Z., 2007. A novel video object tracking approach based on kernel density estimation and Markov random field. In: Image Processing, IEEE International Conference on, vol. 3. IEEE. Loftsgaarden, D.O., Quesenberry, C.P., et al., 1965. A nonparametric estimate of a multivariate density function. Ann. Math. Statist. 36 (3), 1049–1051. Mahapatruni, R., Gray, A.G., 2011. CAKE: Convex adaptive kernel density estimation. In: International Conference on Artificial Intelligence and Statistics, pp. 498–506. Mammen, E., Miranda, M.D.M., Nielsen, J.P., Sperlich, S., 2012. A comparative study of new cross-validated bandwidth selectors for kernel density estimation. ArXiv Preprint. arXiv:1209.4495. March, W.B., Ram, P., Gray, A.G., 2010. Fast euclidean minimum spanning tree: algorithm, analysis, and applications. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 603–612. Menardi, G., Azzalini, A., 2014. An advancement in clustering via nonparametric density estimation. Stat. Comput. 24 (5), 753–767. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 1065–1076. Preparata, F.P., Shamos, M.I., 1985. Computational Geometry: An Introduction. Springer-Verlag. Ramoni, M., Sebastiani, P., 2001. Robust bayes classifiers. Artificial Intelligence 125 (1), 209–226. Raykar, V.C., Duraiswami, R., Zhao, L.H., 2010. Fast computation of kernel estimators. J. Comput. Graph. Statist. 19 (1), 205–220. Savchuk, O.Y., Hart, J.D., Sheather, S.J., 2010. Indirect cross-validation for density estimation. J. Amer. Statist. Assoc. 105 (489), 415–423. Scott, D.W., 2009. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons. Scott, D.W., Sheather, S.J., 1985. Kernel density estimation with binned data. Comm. Statist. Theory Methods 14 (6), 1353–1359. Scott, D.W., Terrell, G.R., 1987. Biased and unbiased cross-validation in density estimation. J. Amer. Statist. Assoc. 82 (400), 1131–1146. Shamos, M.I., Hoey, D., 1975. Closest-point problems. In: 16th Annual Symposium on Foundations of Computer Science, 1975. IEEE, pp. 151–162. Sheather, S.J., Jones, M.C., 1991. A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B Methodol. 683–690. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. CRC Press. Stover, J.H., Ulm, M.C., 2013. Hyperparameter estimation and plug-in kernel density estimates for maximum a posteriori land-cover classification with multiband satellite data. Comput. Statist. Data Anal. 57 (1), 82–94. Tang, Y., Browne, R.P., McNicholas, P.D., 2015. Model based clustering of high-dimensional binary data. Comput. Statist. Data Anal. Terrell, G.R., Scott, D.W., 1992. Variable kernel density estimation. Ann. Statist. 1236–1265. Wand, M., Jones, M., 1993. Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Amer. Statist. Assoc. 88 (422), 520–528. Wand, M.P., Jones, M.C., 1994. Kernel smoothing. CRC Press.
84
Sreevani, C.A. Murthy / Computational Statistics and Data Analysis 102 (2016) 67–84
Wang, S., Wang, J., Chung, F., 2014. Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans. Cybern. 44 (1), 1–20. Yang, C., Duraiswami, R., Gumerov, N.A., Davis, L., 2003. Improved fast gauss transform and efficient kernel density estimation. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 664–671. Zhang, X., King, M.L., Hyndman, R.J., 2006. A Bayesian approach to bandwidth selection for multivariate kernel density estimation. Comput. Statist. Data Anal. 50 (11), 3009–3031. Zougab, N., Adjabi, S., Kokonendji, C.C., 2014. Bayesian estimation of adaptive bandwidth matrices in multivariate kernel density estimation. Comput. Statist. Data Anal. 75, 28–38.