IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
1
Kernel Density Estimation, Kernel Methods, and Fast Learning in Large Data Sets Shitong Wang, Jun Wang, and Fu-lai Chung
Abstract—Kernel methods such as the standard support vector machine and support vector regression trainings take O(N 3 ) time and O(N 2 ) space complexities in their naïve implementations, where N is the training set size. It is thus computationally infeasible in applying them to large data sets, and a replacement of the naive method for finding the quadratic programming (QP) solutions is highly desirable. By observing that many kernel methods can be linked up with kernel density estimate (KDE) which can be efficiently implemented by some approximation techniques, a new learning method called fast KDE (FastKDE) is proposed to scale up kernel methods. It is based on establishing a connection between KDE and the QP problems formulated for kernel methods using an entropy-based integrated-squared-error criterion. As a result, FastKDE approximation methods can be applied to solve these QP problems. In this paper, the latest advance in fast data reduction via KDE is exploited. With just a simple sampling strategy, the resulted FastKDE method can be used to scale up various kernel methods with a theoretical guarantee that their performance does not degrade a lot. It has a time complexity of O(m3 ) where m is the number of the data points sampled from the training set. Experiments on different benchmarking data sets demonstrate that the proposed method has comparable performance with the state-of-art method and it is effective for a wide range of kernel methods to achieve fast learning in large data sets. Index Terms—Kernel density estimate (KDE), kernel methods, quadratic programming (QP), sampling, support vector machine (SVM).
I. I NTRODUCTION
D
UE TO THEIR attractive performance in classification and regression tasks, kernel methods like support vector machine (SVM), support vector regression (SVR), and support vector data description (SVDD) have been extensively studied Manuscript received July 13, 2010; revised September 12, 2011 and November 5, 2012; accepted December 8, 2012. Date of publication June 18, 2013; date of current version December 12, 2013. This work was supported in part by The Hong Kong Polytechnic University under Grant 1-ZV5V, by the National Natural Science Foundation of China under Grants 60903100, 61272210, and 61170122, by the Fundamental Research Funds for the Central Universities under Grants JUSRP111A38 and JUSRP21128, and by the Natural Science Foundation of Jiangsu Province under Grants BK2009067 and BK2011417. This paper was recommended by Editor H. Kargupta. S. Wang is with the School of Digital Media, Jiangnan University, Wuxi 214122, China. He is also with the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, and also with the National Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100080, China (e-mail:
[email protected]). J. Wang is with the School of Digital Media, Jiangnan University, Wuxi 214122, China (e-mail:
[email protected]). F. Chung is with the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2236828
in machine learning and pattern recognition [1]. These methods are often formulated as quadratic programming (QP) problems, with the distinguished advantage of not suffering from the local minimum problem. However, given N training patterns (or data points), a naive implementation of the QP solver takes O(N 3 ) training time and at least O(N 2 ) space. Therefore, a major issue in applying kernel methods to large data sets is how to scale up these QP solvers. In recent years, many attempts have been made to address this important issue [30]– [35]. A popular approach is to replace the kernel matrix by low-rank approximations. These approximations can be obtained by methods such as the Nyström method [2], greedy approximation [3], sampling [4], or matrix decompositions [5]. Another approach to scale up kernel methods is by chunking [6] or more sophisticated decomposition methods [7]–[9] like the well-known sequential minimal optimization (SMO) algorithm [8], [10] and the reduced SVM [11]. The latest advance in scaling up kernel methods is perhaps the generalized core vector machine (CVM) [12], [13], [16], [29]. The generalized CVM utilizes an approximation algorithm for the center-constrained minimum enclosing ball (MEB) problem in computational geometry to achieve an asymptotic time complexity that is linear in N and a space complexity that is independent of N . It is based on the idea of the socalled core sets, and its distinctive time and space complexities have been manifested theoretically and experimentally in [13]. The generalized CVM is suitable for solving the QP problem in the following form: max αT (diag(K) + Δ) − αT Kα
(1)
where Δ ≥ 0 is a vector. However, its applicability still depends critically on the following requirement being satisfied: αT 1 = 1,
α ≥ 0.
(2)
We should note that there are still some kernel methods violating this requirement and thus cannot be used with the generalized CVM. For example, the total margin SVM (TMSVM) [14] takes the form in (1) of the QP problem but has to satisfy the following constraint instead: αT 1 = 1,
μ ≤ α ≤ λ, μ,
λ ∈ [0, 1].
(3)
In this paper, a study of how to scale up the solver of the aforementioned QP form with the constraint in (3) instead of that in (2) is presented. Recently, a fast data reduction method via kernel density estimate (KDE) approximation [15] was proposed. Despite its simplicity, a surprising property is revealed
2168-2267 © 2013 IEEE
2
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
that the expected value of the integrated squared error (ISE) between the KDE of the original data set and that of the reduced data set only depends on the kernel bandwidth h and the size of the reduced data set m after sampling. According to this property, fast data reduction can be achieved via KDE approximation, i.e., sampling of the KDE, and the corresponding time complexity can be O(m) only. Being motivated by this important property, we made an attempt to link up KDE with kernel methods. By introducing an entropy-based ISE criterion, many kernel methods can be formulated as the corresponding KDE approximation problems. Such equivalence facilitates a new perspective of kernel methods and more importantly provides a solid foundation to make use of the fast data reduction method via KDE approximation to scale up the kernel methods. Thus, a novel fast learning strategy called fast KDE (FastKDE) for kernel methods is proposed here. It can be used to achieve fast learning in large data sets. It applies the KDE approximation to the kernelized feature space formed by the aforementioned QP problem in (1) to obtain the reduced data set and solves the problem subject to the constraint in (3) using the reduced data set. As a result, for a large data set, the complexity for solving such QP problem can be substantially decreased from O(N 3 ) to O(m3 ) without degrading the performance a lot. Our work here is differentiated from that of the generalized CVM in the sense that a significantly larger repertoire of kernel methods can be cast as the aforementioned QP form with the constraint in (3) rather than that in (2). Furthermore, it is completely different from the core sets used in [12] and [13] as it makes use of a reduced representation of the KDE to achieve fast learning in large data sets. To make this paper easier to follow, we outline the logical flow of this study as follows. 1) By defining the entropy-based ISE criterion, we theoretically reveal that typical kernel methods with Gaussian and non-Gaussian kernels can be cast as KDE approximation problems. This formulation is influential to make kernel methods scalable to large data sets. 2) We theoretically prove that the KDE for a data set can be approximately estimated by a sampling strategy and the upper bound of the resultant entropy-based ISE is independent of the data set size. This theoretical result helps us to build a simple but effective fast learning algorithm called FastKDE for different kernel methods, making them suitable for large data set applications. The rest of this paper is organized as follows. In Section II, an interpretation of the kernel methods from a KDE perspective is presented. We consider the cases of Gaussian and nonGaussian kernels to build up a relationship between the aforementioned QP problems and the KDE approximation. Based on the developed relationship, a novel fast learning strategy, i.e., FastKDE, for kernel methods is proposed in Section III. It is based on a simple sampling strategy so that the sampled data points can estimate its true density as close as possible. The theoretical basis is also discussed. In Section IV, the results of several experiments are reported to demonstrate the effectiveness of the proposed technique. Finally, conclusions are made in Section V.
II. N EW P ERSPECTIVE OF K ERNEL M ETHODS : K ERNEL D ENSITY E STIMATION In this section, kernel density estimation and kernel methods are linked to each other, and an interpretation of the kernel methods from a KDE perspective is presented. We consider the cases of Gaussian and non-Gaussian kernels and derive the corresponding relationships. A. Entropy-Based ISE Criterion for Kernel Density Estimation Given a data set S = {x1 , . . . , xN } ∈ Rd , the general form of the kernel density estimator can be represented as qˆ(x; α) =
N
αi kσ (x, xi )
(4)
i=1
where kσ (·) is the kernel function with bandwidth σ and α = {α1 , . . . , αN } are variable weights with constraint N i=1 αi = 1 and αi ≥ 0, i = 1, . . . , N . It is a nontrivial task to estimate the underlying density distribution of the data set. However, one may approximate it by using the Parzen window estimation, i.e., pˆ(x) =
N 1 kσ (x, xi ). N i=1
(5)
The L2 distance or ISE (integrated squared error) is often used to decide which data points are more important than others by choosing αi in a way that larger αi means more important. The ISE between pˆ(x) and qˆ(x; α) can be computed as ISE(α) = (ˆ p(x) − qˆ(x; α))2 dx
pˆ2 (x)dx − 2
=
pˆ(x)ˆ q (x; α)dx
+
qˆ2 (x; α)dx.
(6)
Substituting (4) and (5) into (6), we have 2 N 1 ISE(α) = 2 kσ (x, xi ) dx N i=1 2 − N
+
N
kσ (x, xi )
N
i=1
N
αj kσ (x, xj )dx
j=1
2 αi kσ (x, xi )
dx
i=1
1 = 2 N −
+
N
2 kσ (x, xi )
dx
i=1
N N 2 αj kσ (x, xi )kσ (x, xj )dx N i=1 j=1 N N i=1 j=1
αi αj
kσ (x, xi )kσ (x, xj )dx. (7)
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
On the other hand, the quadratic entropy of a probability density function q(x) is defined as V = 1 − q 2 (x)dx. (8) Substituting (4) into (8), we get the following nonparametric estimator for quadratic entropy: Vˆ (S; α) = 1 − qˆ2 (x; α)dx 2 N αi kσ (x, xi ) dx =1 − i=1
=1 −
N N
Substituting (13) into the optimization problem in (11) and after a series of simplifications, we have α = arg min ISE(α) + λVˆ (S; α)
= arg min
−
+ λ ⎝1 −
i=1 j=1
J(α) = ISE(α) + λVˆ (S; α)
α = arg min J(α)
= arg min ISE(α) + λVˆ (S; α)
s.t. α 1 = 1, α ≥ 0.
Now, let us consider the case that the kernel is of Gaussian type. According to Deng et al. [16], the following relationship holds: kσ (x, xi )kσ (x, xj )dx = k√2σ (xi , xj ). (12)
=
1 N2
⎞ N N 2 αi k√2σ (xi , xj )⎠ . (14) − (1 − λ)N i=1 j=1 N √ The term (1/N 2 ) N i=1 j=1 k 2σ (xi , xj ) can be omitted as it is constantly independent of α. Let us discuss (14) in a way similar to [17]. Assuming that X and Y are two independent random variables, the following relationship holds: fX,Y (x, y) = fX (x) · fY (y) where fX,Y (x, y) is the fX (x) is the probability is the probability density kernel density estimation and fY (y) as follows: fˆX,Y (x, y) =
2 − N +
αi αj k√2σ (xi , xj ).
fˆY (y) =
i=1 j=1
αi αj k√2σ (xi , xj ).
αi kσ (x, xi )kσ (y, xi )
(16a)
N
αi kσ (y, xi ).
(16b)
(16c)
Substituting (16) into (15) and making use of (12), we have (13a)
Similarly, the estimator in (9) can be rewritten as Vˆ (S; α) = 1 − qˆ2 (x; α)dx N N
N
i=1
i=1 j=1
=1 −
joint probability density function, density function of X, and fY (y) function of Y. Then, one may use to approximate fX,Y (x, y), fX (x),
N 1 fˆX (x) = kσ (x, xi ) N i=1
αi k√2σ (xi , xj )
i=1 j=1 N N
(15)
i=1
k√2σ (xi , xj )
i=1 j=1 N N
αi αj k√2σ (xi , xj )⎠
i=1 j=1
Thus, the ISE in (7) becomes ISE(α) = (ˆ p(x) − qˆ(x; α))2 dx N N
⎞
⎛ N N = arg min ⎝ αi αj k√2σ (xi , xj )
(11)
B. Estimation With Gaussian Kernels
N N i=1 j=1
(10)
where λ > 0, λ = 1 is a parameter used to control the tradeoff between ISE and the quadratic entropy. Thus, the kernel density estimation can be described as the following optimization problem:
αi αj k√2σ (xi , xj )
N N 2 αi k√2σ (xi , xj ) N i=1 j=1
⎛
kσ (x, xi )kσ (x, xj )dx. (9)
From the information theoretic point of view, the more regular the set S is, the smaller the value of V will be. Our aim is to find an appropriate α = {α1 , . . . , αN } so that pˆ(x) and qˆ(x; α) are as close as possible and Vˆ (S; α) is as small as possible under the constraints α 1 = 1 and α ≥ 0. Thus, the following entropy-based ISE criterion is proposed:
N N i=1 j=1
αi αj
3
(13b)
N
N N 1 kσ (x, xi )· αi kσ (y, xi ). N i=1 i=1 i=1 (17) Similar to [17], by integrating (17) along the line x = y and using (12), we obtain the following equation:
αi kσ (x, xi )kσ (y, xi ) =
N i=1
αi k√2σ (xi , xi ) ≈
N N 1 αi k√2σ (xi , xj ). N i=1 j=1
(18)
4
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Substituting it into (14), we finally have α = arg min ISE(α) + λVˆ (S; α) N N ≈ arg min αi αj k√2σ (xi , xj ) i=1 j=1
2 − αi k√2σ (xi , xi ) 1 − λ i=1 N
= arg min αT Kα −
2 αT diag(K) 1−λ
(19)
√ where K is a Gaussian kernel matrix with bandwidth 2σ and 2α diag(K)/(1 − λ) is a constant. With such derivations, we conclude that the right side of (19) essentially implies finding the KDE of the given data set with the criterion ISE(α) + λVˆ (S; α). C. Estimation With Generic Kernels When the kernel is no longer of Gaussian type, the relationship (12) does not hold anymore. Fortunately, the KDE with non-Gaussian kernels can still be derived. With the detail of derivations presented in Appendix A, we can formulate the corresponding optimization problem as
multiclass classification, and for regression. Generally speaking, they can often be formulated as the corresponding QP problems. In this section, some representative kernel methods are selected to show that their corresponding QP problems can be cast as the KDE derived in (20). 1) SVDD: As a one-class SVM variant, the SVDD model [18] has been extensively used for novelty detection problems, among which the hard-margin SVDD has been fully studied in [18]. The hard-margin SVDD can be treated equivalently as a MEB problem as follows. Given a data set S = {x1 . . . , xN } ∈ Rd , the MEB of S, denoted as MEB(S), is the smallest ball that contains all data points in S. The MEB problem can be formulated as the following constrained optimization problem: min R2 s.t. c − ϕ(xi ) 2 ≤ R2 ,
where K is an N × N matrix about the kernel function h. Thus α = arg min α Kα − α (diag(K) + Δ)
(20)
where the vector Δ = (1 + λ)diag(K)/(1 − λ). Note here that K can be any positive definite kernel matrix and we do not impose that each h(xi , xi ) always be the same constant. In other words, any positive definite kernel can be adopted in (20). As in the previous section, the QP problem in (20) can now be viewed from the KDE perspective. In other words, from the KDE perspective, we can explain the optimization problem in the form of (20) as follows. Given the data set S = {x1 , . . . , xN } ∈ Rd , with the introduced entropy-based ISE criterion, the goal of the optimization problem is to find the proper α = {α1 , . . . , αN } such that qˆ(x; α) can be made as close as possible to pˆ(x). It should also be emphasized that, in terms of KDE, the constraint N i=1 αi = 1 and αi ≥ 0 can be imposed more strictly. For example, we can set the constraint N α = 1 and μ < α < ν, μ, ν ∈ [0, 1]. In the next section, i i=1 i we will see that such a constraint will help explain more kernel method variants from the KDE perspective. D. Casting Kernel Methods as Kernel Density Estimation In the past decade, kernel methods have been highly successful in various machine learning and pattern recognition problems. In essence, kernel methods can be categorized into three types in terms of their applicability, i.e., for one-class classification application or novelty detection, for binary or
∀xi ∈ S (21)
where ϕ denotes the feature map associated with a given kernel k, and c and R are the center and radius of the MEB in the kernel-induced feature space. Its dual can be formulated as the following QP problem [18]: max α diag(K) − α Kα s.t. α 1 = 1,
α = arg min J(α) = arg min α Kα − 2α diag(K)/(1 − λ)
i = 1, 2, . . . , N,
α≥0
(22)
where α = [α1 , . . . , αN ] is the vector of Lagrange multipliers, 0 = [0, . . . , 0] and 1 = [1, . . . , 1] are N -d zero and unit vectors, respectively, and Km×m is the corresponding kernel matrix. By comparing (20) with (22), it can be seen that they share the same objective and constraints. Thus, both the hardmargin SVDD model and the corresponding MEB problem can be viewed from the KDE perspective. 2) L2-SVR: L2-SVR in [19] has been extensively used for function regression tasks. Its idea can be stated as follows. d Given a data set {zi = (xi , yi )N i=1 }, with input xi ∈ R and yi ∈ R, L2-SVR attempts to construct a linear function f (x) = w ϕ(x) + b,
λ∗i ≥ 0 and λi ≥ 0
in the kernel-induced feature space such that it deviates least from the training data. If we adopt the following ε insensitive loss function:
|y − f (x)|ε =
0 if |y − f (x)| ≤ ε |y − f (x)| − ε otherwise
(23)
then the primal of L2-SVR can be formulated as min w 2 + b2 +
N C 2 ξi + ξi∗2 + 2Cε μN i=1
s.t. yi − (w ϕ(xi ) + b) ≤ ε + ξi (w ϕ(xi ) + b) − yi ≤ ε + ξi∗ for i = 1, . . . , N.
(24)
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
The parameter μ > 0 is used to control the size of ε. With the Lagrange multiplier method, its dual problem can be formulated as 2 y ˜ λ max[λ λ∗ ] C2 − [λ λ∗ ]K λ∗ −C y s.t.[λ
λ∗ ]1 = 1 and λ∗i , λi ≥ 0
(25)
where y = [y1 , . . . , yN ] , λ = [λ1 , . . . , λN ] , and [λ∗1 , . . . , λ∗N ] are 2m Lagrange multipliers and K + 11 + μN I −(K + 11 ) C ˜ K= −(K + 11 ) K + 11 + μN C I
λ∗ =
(26)
is a 2N × 2N “kernel matrix.” After solving the dual variables λ and λ∗ , the variables can be recovered as w =C
N
(λ∗i − λi ) ϕ(xi )
i=1
b =C
N
(λi − λ∗i ) .
(27)
i=1
˜ = ˜ 2N ] = [λ λ ] and By defining α α1 , . . . , α y[˜ ˜ −diag(K) + (2/C) −y , (24) can be rewritten as ˜ +Δ −α ˜α ˜ K ˜ diag(K) ˜ max α ˜ = 1 and α ˜ > 0. s.t. α1
Δ=
Obviously, this is also in the form of (20), and thus, the L2-SVR can also be explained from the KDE perspective. 3) L2-SVM: For two-class classification tasks, given the training set {zi = (xi , yi )N i=1 } where yi ∈ {±1} is the class label, L2-SVM attempts to construct a hyperplane for which the separation between the positive and negative examples is maximized and, at the same time, the training error is minimized. Its primal can be formulated as N
ξi2
i=1
s.t. yi (w ϕ(xi ) + b) ≥ ρ − ξi ,
i = 1, 2, . . . , N
(29)
where C > 0 is a parameter that controls the tradeoff between the margin and the training error. Its dual can be formulated as
N 1 1 ξi − ρ min w 2 + 2 νN i=1
(30)
˜ = [yi yj k(xi , xj ) + yi yj + (δij /C)] is an N × N where K matrix; δij = 1 if i = j, and δij = 0 if otherwise. Obviously, it is also in the form of (20), and thus, the aforementioned L2-SVR can also be explained from the KDE perspective. It is worth pointing out here that each element of the kernel matrices in both L2-SVM and L2-SVR is not always positive, which seems not in-line with the KDE representation. Although this issue is somewhat subtle, it can be addressed as follows. Let us consider a binary data set, with f (x) and g(x) denoting
(31)
Here, the parameter ν is analogous to the parameter ν in ν-SVM [20]. Its dual can be formulated as max −α Kα s.t. α 1 = 1 and 0 ≤ α ≤
1 . νN
(32)
When applied to large data sets, such an SVM variant cannot be scaled up by CVM [12], [13] due to the involved constraint violation. On the other hand, our fast learning strategy based on the KDE perspective can be employed in this case. 5) TM-SVM: The TM-SVM is proposed in [14]. Its primal is formulated as N N 1 2 min w + C ξi (1 − λ) δi 2 i=1 i=1 s.t. yi (w ϕ(xi ) + b) ≥ 1 − ξi + δi , i = 1, 2, . . . , N ξi ≥ 0, δi ≥ 0, i = 1, 2, . . . , N
˜ max −α Kα s.t. α 1 = 1 and α > 0
the density distributions of the positive and negative classes, respectively. When two classes simultaneously exist, the probability of a data point x belonging to the positive class can be f (x)(1 − g(x)) = f (x) − f (x)g(x) in which the first term reflects the positive effect coming from the same class while the second term reflects the negative effect coming from the opposite class. Similarly, the probability of x belonging to the negative class can be g(x)(1 − f (x)) = g(x) − f (x)g(x). In other words, both the positive term and the negative term in the respective probability distributions should be considered in this case. Thus, the kernel matrices in both L2-SVM and L2-SVR can be viewed as a possible choice to embody the basic characteristic of such probability distributions. It should also be emphasized that, in terms of KDE, the conmore strictly. straint N i=1 αi = 1 and αi ≥ 0 can be imposed For example, we can impose the constraint N i=1 αi = 1 and μ < αi < ν, μ, ν ∈ [0, 1]. In the following, we will see that such a stricter constraint can facilitate more kernel methods to be explained from the KDE perspective. 4) One-Class SVM: As discussed in [1], the primal of the one-class SVM can be formulated as
s.t. w xi ≥ ρ − ξi , ξi ≥ 0, i = 1, . . . , N. (28)
min w 2 + b2 − 2ρ + C
5
(33)
where δi denotes the distance between data points xi and the separating hyperplane. We cannot directly develop its dual form into that of (20). However, with a slight modification, one may have the following TM-SVM to achieve this goal: N N min w 2 + b2 − 2ρ2 + C ξi − (1 − λ) δi i=1
i=1
s.t. yi (w ϕ(xi ) + b) ≥ ρ − ξi + δi , i = 1, 2, . . . , N 2
ξi ≥ 0, δi ≥ 0, i = 1, 2, . . . , N.
(34)
6
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Based on the constraints and the objective in (34), we can construct the following Lagrangian:
A. Main Idea
L(w, b, ρ, ξi , δi )
= w + b − 2ρ + C 2
2
2
N
ξi − (1 − λ)
i=1
−
N
N
δi
i=1
αi yi (w ϕ(xi ) + b) − ρ2 + ξi − δi
i=1
−
N
β i ξi −
i=1
N
γ i δi
(35)
i=1
where αi ≥ 0, βi ≥ 0, γi ≥ 0, and s ≥ 0 are the Lagrange multipliers. By setting their partial derivatives with respect to (w.r.t.) w, b, ρ, ξi , and δi to 0, we can arrive at the following equations: 1 αi yi ϕ(xi ) 2 i=1 N
w=
1 αi yi 2 i=1 N
b= N
III. S CALING U P K ERNEL M ETHODS BY FAST K ERNEL D ENSITY E STIMATION
αi = 1
i=1
0 ≤ αi ≤ C αi ≥ C(1 − λ)
(36)
after a few steps of simplifications. Substituting (36) into the objective in (35), we get the corresponding dual of (35) as ˜ max −α Kα s.t. C(1 − λ) ≤ α ≤ C α 1 = 1
In the previous section, we have already shown that the QP problems in the form of (20) obtained by KDE using an entropy-based ISE criterion can also be the QP problems formulated by the kernel methods. While this facilitates a KDE perspective of kernel methods, an immediate advantage is that the fast data reduction techniques for KDE can be employed to scale up the kernel methods for large data set applications. While KDE is a very attractive technique as no a priori information on the probability distribution of the data set is needed, its high computational requirement becomes a major bottleneck in its applications to a large data set. An effective solution to this problem is to reduce the effective number of data points involved in KDE. For example, Scott and Sheather [21] proposed a binning strategy, in which the data samples are prebinned and the kernel density estimator applies to the bin centers rather than the original data points. On the other hand, Holmstrom [22] discussed the multivariate form of the binned kernel density estimator. Rather than using the binning strategy, an alternative approach is to cluster the data points and employ the cluster centers as the reduced data set. A clustering-based branch and bound approach is adopted in [23] while clustering is employed in [24] to identify a set of reference vectors for the Parzen window classifier. In [25], the self-organizing map is used to locate and provide the reference vectors for the density estimators. In [26], Girolami and He proposed the reduced set density estimator (RSDE) based on the ISE criterion. The equivalence between RSDE and the MEB problem is revealed by Deng et al. [16]. The latest advance in data reduction via KDE approximation is proposed by Freedman and Kisilev [15], in which a sampling strategy and a mean shift procedure are combined together to reduce the data samples substantially. An important theorem is presented there, and it provides a solid mathematical foundation for our fast learning strategy to be introduced next.
(37) B. FastKDE
where ˜ = [yi yj k(xi , xj ) + yi yj ] . K
(38)
In this way, the optimization problem in (34) has the form of (20), although the linear term in the objective is zero. Here, αi (i = 1, . . . , N ) has different upper and lower bounds, as indicated in (37), when compared with that in (20). Just like the one-class SVM, the aforementioned TM-SVM variant can be modified to develop a total margin SVR. In order to save space, we directly give its dual as follows: ˜ +Δ −α ˜α ˜ K ˜ diag(K) ˜ max α ˜ = 1 and C(1 − λ) ≤ α ˜ ≤C s.t. α1
(39)
In this section, a FastKDE method to scale up various kernel methods for solving the QP problems as stated in (20) is proposed. For a large data set with N data points, one of the problems existing in the representation of KDE is its complexity O(N ). Thus, it is very meaningful to find a more compact KDE representation. Assume that pˆ(x) is a KDE which utilizes all the data points in density estimation, i.e., pˆ(x) = (1/N ) N i=1 k(x, xi ), where k(x, xi ) is a kernel function with a diagonal bandwidth matrix H= h2 I. Our aim is to find another KDE p(x) using only a few selected data points as its approximation version which is close to pˆ(x). In other words, the objective is to min D (ˆ p(x), p˜(x)) 1 k(x, xi ) m i=1 m
˜ α, ˜ and Δ have the same meaning as those of L2-SVR where K, in (28) and C and λ take the same meaning as in (37).
s.t. p˜(x) =
(40)
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
where D(·) is a distance measure between two probability densities pˆ(x) and p(x) and m N . Here, the expected ISE can be employed as the distance function, and the parameter m is the number of data points sampled from the training set. The optimization problem in (40) seeks to find the data points xi underlying the KDE p˜(x) that well approximates pˆ(x) with much fewer data points. In the following, an important theorem, i.e., Theorem 1, from [15] is first recalled. Then, we derive our own ones, i.e., Theorems 2–4. By replacing the diagonal bandwidth matrix ˜ 2 I in Theorem 1 with a generalized positive definite H=h bandwidth matrix, we independently infer these three theorems which all reveal that the upper bound of the corresponding ISE does not depend on N . This theoretical result facilitates the development of the new FastKDE. Theorem 1 [15]: Let pˆ(x) be the KDE with N data points and p˜(x) be the KDE constructed by sampling m times from pˆ(x), and assume that every kernel function k(x, xi ) has a diagonal bandwidth matrix H = h2 I. Let theexpected ISE between the two densities be given by J = E[ (ˆ p(x) − p˜(x))2 dx]. Then, we have ˜ + A2 h ˜ 2 V + B + ABV J ≤ 4Ah ˜d ˜ d−1 mh mh
(41)
where A, B, V are constants that do not depend on h and m. Let us recall the entropy-based ISE criterion defined in Section II-A. As the quadratic entropy is irrelevant to N , in terms of Theorem 1, we immediately have the following corollary. Corollary 1: With the same assumptions as in Theorem 1, let the expected entropy-based ISE between the two densities be given by 2 2 p(x)) dx . J =E (ˆ p(x) − p˜(x)) dx + λE 1 − (˜ Then, the upper bound of J depends only on λ, h, d, and m, but not on N . Theorem 2: Let pˆ(x) be the KDE with N data points and p(x) be a KDE constructed by sampling m times from pˆ(x), and assume that Gaussian kernel function K(x − xi , Hi ) is adopted with a positive definite bandwidth matrix Hi centered at xi satisfying K(x − xi , Hi )dx = 1. Let ISE between the two densities be given by J = (ˆ p(x) − p(x))2 dx. Then, J is upper bounded by 2 1 − min K(xi − xj , Hi + Hj ) m m i=1 j∈Si m
1 max K(xl − xj , 2Hi ) m i=1 l,j∈Si m
+
(42)
and its expected ISE is upper bounded by 2 1 1 − K(xi −xi , 2Hi )+ K(xi −xi , 2Hi ). (43) m m i=1 m i=1 m
m
The proof of Theorem 2 can be seen in Appendix B.
7
Obviously, when the entropy-based ISE criterion is adopted, we have the following Theorem 3. Theorem 3: With the same assumptions as in Theorem 2, let the entropy-based ISE between the two densities be given p(x))2 dx. Then, J is by J = (ˆ p(x) − p(x))2 dx + λ(1 − ( upper bounded by 2 1−λ − min K(xi − xj , Hi + Hj ) m m i=1 j∈Si m
λ+
1 max K(xl − xj , 2Hi ) m i=1 l,j∈Si m
+
(44)
and with the expected value of J given by E[J] = E (ˆ p(x) − p(x))2 dx + λE 1 − ( p(x))2 dx E[J] is upper bounded by 1−λ 2 1 − K(xi −xi , 2Hi )+ K(xi −xi , 2Hi ) m m i=1 m i=1 (45) which means that the upper bounds of both J and E[J] depend on λ, Hi , and m only, but not on N . The proof of this theorem can be seen in Appendix C. Theorem 4: With the same assumptions as in Theorem 2 m α K(x, x ), where α = 1, α > 0, except p˜(x) = m i i i=1 i i=1 i let the entropy-based ISE between the two densities be given p(x))2 dx. Then, J is by J = (ˆ p(x) − p˜(x))2 dx + λ(1 − (˜ upper bounded by m
m
λ+
λ+
m
m(1 −
λ)αi2
−
i=1
m 2αi i=1
m
min K(xi − xj , Hi + Hj ) j∈Si
1 max K(xl − xj , 2Hi ) + m i=1 l,j∈Si m
and with the expected value of J given by 2 2 p(x)) dx E[J] = E (ˆ p(x) − p˜(x)) dx + λE 1 − (˜ E[J] is upper bounded by λ+
m i=1
m(1 −
λ)αi2
−
m 2αi i=1
m
K(xi − xi , 2Hi ) 1 K(xi − xi , 2Hi ) + m i=1 m
which means that the upper bounds of both J and E[J] do not depend on N . The proof Theorem 4 can be seen in Appendix D. Although Theorems 1, 2, 3, and 4 take different kernel forms, the exact shape of the kernel function does not affect the approximation a lot, as pointed out in [16]. In other words, the parameters of the kernel function are more important. The meaning of Theorems 1 and 2 is straightforward. Two KDEs, pˆ(x) and p˜(x), will be close in terms of ISE or its expectation sense if m is large enough and if the bandwidth h or Hi is chosen properly as a function of m. Moreover, Freedman and
8
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
TABLE I G ENERALIZED CVM V ERSUS FAST KDE
D. Generalized CVM Versus FastKDE Fig. 1. FastKDE method.
Kisilev [15] also gave the criterion for selecting an optimal bandwidth hopt , i.e., ˆ ˜ opt = (N/m)1/(d+1) h h
(46)
ˆ is the bandwidth of pˆ(x). For a given training data where h set, N is fixed, and hopt can be determined by the number of sampled data points (m). In terms of Theorems 1, 2, 3, and 4, we can conclude that the upper bound of J is only determined by the number of sampled data points when the size of the training data set N ˆ are given. One may also conclude that and the bandwidth h the upper bound of J does not depend on N and the adopted sampling method as well. Thus, an appropriate choice of m can help us achieve an acceptable performance (in terms of both speed and accuracy) of KDE approximation. More importantly, Theorem 1 tells us that, with the expected ISE criterion, one may apply a sampling method to a large data set to get its reduced data set without degrading the corresponding KDE approximation accuracy a lot. Similarly, although based on an entropy-based ISE criterion, the QP problem in (20) formulated by KDE can also be efficiently solved by scaling up the solver using a sampled data set. Based on this observation, a fast learning strategy called FastKDE can be proposed as shown in Fig. 1. The proposed method is very simple, and it consists of a sampling step to obtain a reduced data set whose corresponding QP problem can then be solved in a scalable way. C. Time Complexity Given a large data set with N data points, by sampling m data points, FastKDE will take O(m3 ) time complexity to solve the corresponding QP problem. Thus, for taking a simple sampling strategy like the uniform one, the overall time complexity of FastKDE becomes O(m + m3 ) ≈ O(m3 ), which depends on m only but not on N . Recall from [13] that the generalized CVM has a surprising property, i.e., its time complexity depends only on the approximation accuracy ε, i.e., O(1/ε), but not on N and ε is often empirically set as 10−3 − 10−6 . In contrast to the generalized CVM, the time complexity of FastKDE is cubic to m, but when m is appropriately set to be far less than N , FastKDE will exhibit considerably competitive performance to the generalized CVM.
In this section, the relationship between the proposed FastKDE and the generalized CVM is discussed. As mentioned in the introductory section, kernel methods such as SVM and SVR are often formulated as QP problems. Both (generalized) CVM and FastKDE can be used to solve the QP problems in the form of (20). The generalized CVM makes use of an approximation algorithm for the MEB problem in computational geometry to solve the problem by finding the so-called core set. However, it assumes the constraint 0 ≤ αi < 1. When αi has different lower and upper bounds, the (generalized) CVM is not applicable. On the other hand, our work here makes a connection between the KDE approximation and the QP problem in the form of (20). That is, an approximation of the KDE with fewer data points can be used to scale up the solver of such QP problems without degrading the performance a lot. Another merit of the proposed FastKDE method lies in that more upper and lower bounds on α can be set. Specifically, the constraint of αi applied to the kernel methods to solve the aforementioned QP problem is generalized to μ ≤ α ≤ λ, λ, μ ∈ [0, 1]. This means that more kernel methods can be scaled up by our FastKDE method. Table I summarizes the similarities and differences between the generalized CVM and the proposed FastKDE method. In the next section, various experimental results are reported to demonstrate the effectiveness of FastKDE in scaling up different kernel methods for different tasks. They are also compared with those of the generalized CVM. IV. E XPERIMENTAL R ESULTS In this section, we report the performance of using FastKDE to scale up different kernel methods, including L2-SVR for regression, L2-SVM, and TM-SVM for classification. In order to illustrate the effectiveness of FastKDE, we simply use the uniform sampling strategy to condense the original data set. Here, we use the condensation rate (CR) to show the degree of the original training set being concerned. It can be defined as CR =
the size of the sampled data . the size of the full training data set
(47)
Obviously, a smaller value means a higher CR of the training set. In order to reveal the relationship between the sample size and the original full training set size, we also introduce another
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
9
TABLE II D ETAILS OF THE DATA S ETS FOR R EGRESSION TASKS
parameter SS to indicate the sample size from the training set in our experiments. In the experiments, our FastKDE implementation for kernel methods is written in Matlab. SMO is used to solve each QP problem in Step 3 of Fig. 1. For SMO, the value of ε is fixed at 10−8 by default. Meanwhile, we use Gaussiankernel k(x, y) = exp(−( x − y 2 /β)) with β = 2 (1/m2 ) m i,j=1 xi − xj unless otherwise specified. An important competitor of our FastKDE is CVM. We downloaded its C++ implementation codes from the Web site [20]. However, the original codes cannot estimate the training and the testing time when it works under the cross-validation mode. In order to show the training and the testing time of CVM when the cross-validation mode is adopted, we make improvement on it. Although the Matlab code runs much slower than C++ codes under the same environment and the CPU time cannot be directly compared, the speed advantage of our FastKDE can still be shown. Moreover, the environment that we used in the experiments is a computer with 1.8-GHz dual-core CPU, 3-GB RAM, and Vista operation system. All the data sets involved were divided into two parts: One is for validation, and the other is for training and testing with a fivefold cross-validation strategy. Without specific explanation, the parameters of all the algorithms were determined for the best accuracy by using the validation set whose size is 5000. A. Regression Tasks In this section, the algorithms were experimented with several data sets, whose details are summarized in Table II. Because the first three data sets are toy problems, we have the liberty of generating the data sets with arbitrary size as we
wanted. Of course, these problems do not need so many data points for training, but it is convenient for us to illustrate the scaling behavior of different algorithms. The next several data sets were downloaded from the Web site resources [36]. In order to show the scalability of different algorithms, we generate several subsets with different sizes by sampling from the original data sets. These subsets are taken as the training and the testing input of the algorithms. For ease of description, our implementation is denoted as a string starting with “L2SVR-FastKDE” with important parameters following it. For example, the string “L2SVR-FastKDE (CR = 1%)” indicates a FastKDE implementation of L2-SVR with the CR of 1%. Because the performance of the kernel methods is always seriously influenced by parameters C and μ when the Gaussian kernel width is computed using β = 2 (1/m2 ) m i,j=1 xi − xj , we show the best results after searching C and μ from the grids {0.1, 1, 10, 100, 1000}. In the experiment, two strategies of uniform sampling are adopted. In the first strategy, we perform sampling with fixed CR values. With the increase of the full data set size, the actual input size to the QP problem will increase in this case. In the second strategy, we perform sampling with fixed sample sizes. In this way, the actual input size to the QP problem does not change when the full input data set size increases. These two strategies can help us observe the behavior and advantage of the proposed method with uniform sampling over other kernel methods. In the experiment, the following implementations for regression tasks are run for comparison. 1) CVR (ε = 10−5 ): CVM implementation for L2-SVR (in C++). The value of ε is set at 10−5 because it is an acceptable value for most data sets. According to
10
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Fig. 2. Two-dimensional function regression results using FastKDE for L2-SVR: (a) Original function with 2000 training data points and (b) result when CR = 1% is used.
Fig. 3. Mexico hat function regression results using FastKDE for L2-SVR: (a) Original function with 50 000 training data points and (b) result when CR = 1% is used.
our experience based on a large number of trials, an even smaller ε value does not show improved generalization performance but may increase the training time unnecessarily. 2) nuSVR: LibSVM implementation in C++. 3) L2SVR-FastKDE(CR): the FastKDE implementation of L2-SVR in Matlab. The suffix “CR” indicates that a fixed CR is adopted. Therefore, the sampling size increases with the increase of the full data set sizes. 4) L2SVR-FastKDE(SS): the FastKDE implementation of L2-SVR in Matlab. The suffix “SS” indicates that a fixed sampling size is used. To evaluate the generalization ability of these algorithm implementations, the following performance indices were adopted: 1) rooted relative squared error n n RRSE = (f (xi ) − yi )2 / (y i − yi )2 i=1
i=1
2) squared correlation coefficient 2 n n n n f (xi )yi − f (xi ) yi i=1 i=1 i=1 SCC= n 2 n n 2 n 2 2 n yi − n f (xi )− yi f (xi ) i=1
i=1
i=1
i=1
where n is the test data set size, f (xi ) is the output value obtained by the corresponding algorithm implementation, yi is the corresponding true value, and y i = ni=1 yi . We report the relevant experimental results on RRSE and SCC using their respective average values and standard deviations with a fivefold cross-validation strategy in the following experiments. To visually show the regression performance of FastKDE, Figs. 2 and 3 plot the regression results on the data sets 2-D function and Mexico hat. The original function values are plotted in red while the regression results are plotted in blue. It can be seen that the regression performance is acceptable under the cases that pretty high CRs are adopted.
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
11
Fig. 4. Comparison of the RRSE’s for different algorithm implementations. (a) Two-dimensional function. (b) Mexico hat. (c) Friedman. (d) Cart delve. (e) Ailerons. (f) Compact.
Figs. 4 and 5 show the average values and standard deviations of RRSE’s and SCC’s, respectively, of different implementations on our regression data sets. From these results, one may observe that, although FastKDE uses simple uniform sampling strategy for very large data sets, its generalization performance is still comparable to those of the MEB-based regression approaches. Although it does not have an advantage over CVR in the training time when the sampling size is relatively small, FastKDE will become better when the sampling size becomes larger and larger. The reason lies in the fact that there is a lot of raw data in the full training sets and the sampling strategy reduces the raw data effectively. Thus, it speeds up the learning process without affecting the performance a lot. On the other hand, CVR works well on large data sets. However, when the data sets become extremely large, particularly when there is a lot of raw data inside them, CVR suffers from the
heavy computational burden, including computing the distance between the MEB center and all the data points in the full training sets and solving the corresponding QP problem repeatedly. This implies that, when the sizes of full training sets become extremely large, the speed of CVR is still a problem. Note that nuSVR has to terminate early on several data sets because of the excessive training time needed. Fig. 6 compares the training time of various implementations on large data sets. One may observe that L2SVR-FastKDE is faster than CVR and nuSVR on most data sets, except the data set Ailerons. It is because this data set is relatively small and FastKDE has to sample enough data points from that data set such that it can obtain an ideal performance. Thus, as pointed out earlier, FastKDE does not have an distinctive advantage over CVR and nuSVR in this case. However, when the size of the input data set increases, CVR becomes slower
12
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Fig. 5. Comparison of the SCC’s for different algorithm implementations. (a) Two-dimensional function. (b) Mexico hat. (c) Friedman. (d) Cart delve. (e) Ailerons. (f) Compact.
because it needs more costs to compute distances and solve the QP problems. However, unlike CVM, the QP solver used in FastKDE has the computational burden independent of the data set size, i.e., only w.r.t. the sampling size. When the sampling size is fixed, one attractive property of FastKDE is that it still basically keeps its running speed unchangeable when the size of the full data set increases; see curves in Fig. 6. This property facilitates the use of kernel methods on very large data sets. Notice that our FastKDE is coded in Matlab while CVR and nuSVR are coded in C++. We believe that, when they are coded in the same programming language, FastKDE will run much faster than what is shown in Fig. 6. Finally, Fig. 7 compares the numbers of support vectors obtained by different algorithms after the training process. As can be seen, L2SVR-FastKDE produces far less support vectors than the other implementations in most cases. This is because the upper bound of the number of support vectors is determined
by the sampling size. As a result, L2SVR-FastKDE is also faster on testing as the number of kernel evaluations between the test patterns and the support vectors has been significantly reduced. In summary, the experimental results here indicate that FastKDE is effective for regression tasks in scaling up the solver for the corresponding QP problem without degrading the regression performance a lot. B. Classification Tasks In this section, we evaluate the scalability of FastKDE for classification tasks. The classification accuracy and the balanced loss are adopted as the performance indices. Recall from [28] that the balanced loss is computed as lbal = 1 −
TP + TN 2
(48)
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
13
Fig. 6. Comparison of the training time for different algorithm implementations. (a) Two-dimensional function. (b) Mexico hat. (c) Friedman. (d) Cart delve. (e) Ailerons. (f) Compact.
where number of positives correctly classified total number of positives number of negatives correctly classified TN = . total number of negatives TP =
(49) (50)
Obviously, a smaller balanced loss value indicates a better performance. To demonstrate the applicability of FastKDE, we compare the following classification task solvers. 1) CVM (ε = 10−5 ): CVM implementation for L2-SVM (in C++). The value of ε is set at 10−5 . According to our experience based on a large number of trials, an even smaller ε value does not show improved generalization performance but may increase the training time unnecessarily. 2) nuSVM: LibSVM implementation in C++.
3) L2SVM-FastKDE(CR): the FastKDE implementation of L2-SVR. The sampling size increases with the increment of the full data set sizes. 4) L2SVM-FastKDE(SS): the FastKDE implementation of L2-SVM. The sampling size is fixed when the full size of the input data set increases. 1) L2-SVM: In the first part of this experiment, FastKDE is used to scale up L2-SVM for classification, as in Section II-D, for solving the 4 × 4 checkerboard data set which has been commonly used for evaluating large-scale SVM implementations [12]. Fig. 8 shows this benchmarking data set. In our experiment, a series of input data sets with different sizes are generated. The performance of FastKDE is compared with that of CVM and nu-SVM. We implemented L2-SVM with Matlab and solve the corresponding QP problem with the quadprog() function. On the other hand, both nuSVM and CVM are C++ codes downloaded from the Web site.
14
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Fig. 7. Comparison of the number of support vectors for different algorithm implementations. (a) Two-dimensional function. (b) Mexico hat. (c) Friedman. (d) Cart delve. (e) Ailerons. (f) Compact.
Fig. 8. 4 × 4 checkerboard data sets.
Fig. 9 illustrates the performance of different classification algorithms on the checkerboard data set. Here, we observe CVM’s results in two manners, i.e, CVM1 and CVM2. As the
fivefold cross-validation strategy mentioned earlier, CVM1 represents that CVM works with certain C which can make CVM to get the best accuracy. Slightly different from CVM1, CVM2 represents that CVM works with certain C which can make CVM to be the fastest. Fig. 9(a) and (b) compares the obtained accuracy and balanced loss, respectively by CVM1, CVM2, nuSVM, and FastKDE. We can see from Fig. 9 that, although CVM1 gets the best accuracy, its running speed is much slower than that of FastKDE. CVM2 can run quickly; however, the obtained accuracies are not satisfactory. nuSVM must occupy very long or even unbearable running time to get acceptable performance. Obviously, FastKDE is effective in scaling up L2SVM while preserving an acceptable performance as compared with CVM and nuSVM. In the second part of this experiment, we compare several algorithm implementations for the image segmentation task. Because the supervised information used in the training set is
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
Fig. 9.
15
Performance of different classification algorithms on checkerboard. (a) Accuracy. (b) Balance loss. (c) Training time. (d) Number of support vectors.
available from the reference anatomical model, we can treat this image segmentation task as a classification one. Our experiment was conducted on the data set BrainWeb. We download from the Web site [37] the simulated anatomical model which is the T1-weighted magnetic resonance (MR) phantom with a slice thickness of 1 mm, 3% noise, and 20% intensity inhomogeneity. The data points in this task are represented as xi = [xi , xi , si ] ∈ R3 , in which xi ∈ R is the intensity of pixel i, xi is the filtered intensity of pixel i, and si is the standard variance of the intensities of the pixels in the 8-neighborhood of pixel i. The original MR phantom has nine categories. For ease of description, in our experiment, we only involve the two categories White Matters and Gray Matters without considering background areas as another independent category. In the experiment, we first determined the parameters on a validation set containing 5000 patterns in the anatomical model. Then, we trained the classifiers with the training sets with different sizes to obtain the corresponding decision functions which will be further used to classify the data points corresponding to the pixels in an unseen MR image. At last, with the classified data points, we can easily realize the segmentation of an MR image. Fig. 10 shows the performance of different algorithm implementations, and Fig. 11 illustrates the segmentation results of FastKDE on different MR images when the sampling size is 400, where the images to segment are illustrated in Fig. 11(a), the groundtruth images as the reference images are in Fig. 11(b), and the segmentation results for the images in Fig. 11(a) are in Fig. 11(c). By observing Fig. 11(b) and (c), we can intuitively see that the segmentation results using FastKDE are satisfactory.
Fig. 10(a) and (b) shows the average performance of the segmentation results after we run each algorithm implementation ten times. It is obvious that, unlike CVM and nuSVM, when the size of the full training set becomes large, our FastKDE implementation almost always keeps ideal accuracy and balanced loss performances in a relatively short time. What is more, both CVM and nuSVM need much more time than FastKDE in the case that the training set becomes extremely large. Note that our QP solver used here is the quadprog() function in Matlab, and its computational complexity is cubic to the input data size. If more efficient QP solvers are adopted, our FastKDE will run much faster. Fig. 10(d) also shows the support vectors obtained by different algorithm implementations. As can be seen, FastKDE implementation obtains much less support vectors, and this makes the segmentation process more efficiently. However, the CVM and nuSVM often obtain thousands of support vectors, and this makes them slower than FastKDE for classifying the pixels. 2) TM-SVM: In the third part of this experiment, FastKDE is used to scale up the TM-SVM [14] in Section II-D for solving the 4 × 4 checkerboard data set. In order to save the space of this paper, we do not report the experimental results about the data set BrainWeb to support our claims in this section. Recall that, in order to assure α 1 = 1, the parameter C in TM-SVM has to change with the increase of the training set sizes. Therefore, the upper and lower bounds of α in (37) change accordingly for which CVM cannot be applied. In the experiment, the training set size is set at 40 000, and the quadprog() function in Matlab is used to solve the corresponding QP problem. When we directly ran TM-SVM on the full training set, an unbearable time-consuming case
16
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
Fig. 10. Performance of different classification algorithms on BrainWeb. (a) Accuracy. (b) Balance loss. (c) Training time. (d) Number of support vectors. TABLE III FAST KDE FOR TM-SVM P ERFORMANCE FOR D IFFERENT CR S
V. C ONCLUSION Fig. 11. Segmentation results. (a) Images. (b) Ground-truth image. (c) Segmentation results.
happened with our computing settings. Consequently, we use FastKDE, and the CR values vary from 0.5% to 2%. Table III shows the performance of FastKDE for TM-SVM. We can easily find that FastKDE is highly effective in scaling up TM-SVM, although very high CRs are used. Notice that we use the quadprog() function in Matlab to solve the QP problem, and its computation complexity is cubic to the training set size. If a more efficient QP solver is used, our FastKDE will run much faster.
Kernel methods have been highly successful in various machine learning and pattern recognition problems. SVM and SVR are typical kernel methods which are often formulated as a QP problem under certain constraints. However, scalability becomes a serious problem when they are applied to large data sets for which the corresponding QP problems take a much longer time to solve. Thus, a replacement of the naive methods for finding QP solutions posed by SVM is highly desirable. In this paper, a novel approach based on KDE approximation to scale up kernel methods has been proposed. We make an attempt to build up a very important connection between QP problems with KDE based on an entropy-based ISE criterion. In this way, the FastKDE approximation methods
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
can be applied to solve these QP problems. For large data sets, the time complexity for solving these QP problems using the proposed FastKDE method can be greatly decreased from O(N 3 ) to O(m3 ), where N and m( N ) denote the number of training data and the number of sample data, respectively. The experimental results show that the proposed FastKDE method is effective in scaling up different kernel methods without degrading the classification/regression performance a lot. The contributions of this paper can be summarized as follows.
17
where kh (·) is the kernel function. The discrete quadratic entropy of (A.2) can be derived as ⎛ ⎞2 N N N ⎝ Vˆ = 1 − qˆ2 (xi ; α) = 1 − αj kh (xi , xj )⎠ i=1
=1 −
N
i=1
⎛ ⎝
i=1
1) A connection between kernel methods and KDE. With the introduction of an innovative entropy-based ISE criterion, kernel methods can be viewed from a KDE perspective and consequently be scaled up by KDE approximation methods. 2) A simple and yet effective method for fast learning in large data sets. With the relevant theoretical results, the FastKDE method is proposed to scale up different kernel methods for large data set applications. With just a simple sampling strategy, kernel methods can be scaled up with a theoretical guarantee that their performance does not degrade a lot. 3) Broader adaptability. Compared with CVM, FastKDE works on a stricter constraint, making it adaptable to a broader set of kernel methods for fast learning in a large data set. This work has opened up new opportunities for dealing with QP problems in large data sets. The use of efficient or scalable KDE approximation methods provides us great opportunity to scale up kernel methods. Perhaps sampling is the most commonly used method to reduce the data set size. An ideal sampling strategy could estimate the original data distribution optimally with as few data samples as possible. In our FastKDE method, the sampling strategy has not been assumed. Our experiments have simply adopted one naive sampling method, i.e., uniform sampling. In order to enhance FastKDE, it is obvious that many other sophisticated sampling strategies should be considered. In the near future, we plan to develop a set of effective sampling strategies for FastKDE. On the other hand, we found that the dimensionality of the data set affects the number of data samples required for certain performance. For data sets of high dimensionality, more data samples are required, and this brings higher computational cost for solving the corresponding QP problems. Hence, one of our future works is to study other FastKDE approaches that can deal with high-dimensional data sets.
=1 −
N N
⎞ αj αk kh (xi , xj )kh (xi , xk )⎠
j=1 k=1
N N
αj αk
j=1 k=1
Obviously, tion. Let
j=1
N i=1
N
kh (xi , xj )kh (xi , xk ).
(A.3)
i=1
kh (xi , xj )kh (xi , xk ) is also a kernel func-
h(xj , xk ) =
N
kh (xi , xj )kh (xi , xk ).
(A.4)
i=1
To avoid confusion of the variables in (A.1) and (A.2), we use y as the variable for equation (A.2) here. Assuming pˆ(x) and qˆ(y; α) are independent of each other, we have fˆ(x, y) = pˆ(x)ˆ q (y; α) where fˆ(x, y) is the joint probability density function. For x = y, we have N
fˆ(xi , xi ) =
i=1
N
pˆ(xi )ˆ q (xi ; α).
i=1
By considering N
pˆ(xi )ˆ q (xi ; α)
i=1
⎛
⎞ N N 1 ⎝ = kh (xi , xj )⎠ αk kh (xi , xk ) N j=1 i=1 N
k=1
⎛
⎞ N N N 1 ⎝ = αk kh (xi , xj )kh (xi , xk )⎠ N i=1 j=1 k=1
=
N N 1 αk h(xj , xk ), N j=1
(A.5)
k=1
we have
A PPENDIX A KDE W ITH G ENERIC K ERNELS
N
N N 1 αk h(xj , xk ). fˆ(xi , xi ) = N j=1 i=1
Let
(A.6)
k=1
pˆ(x) =
qˆ(x; α) =
N 1 kh (x, xi ) N i=1 N i=1
αi kh (x, xi )
(A.1)
(A.2)
On the other hand, fˆ(x, y) can be estimated using the following KDE [17] fˆ(x, y) =
N i=1
αi kh (x, xi )kh (y, xi ).
18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
where K is a N × N matrix about the kernel function h. Thus,
For x = y, we have N
fˆ(xi , xi ) =
i=1
N N
α = arg min α Kα − α (diag(K) + Δ)
αj kh (xi , xj )kh (xi , xj )
=
N
αj
j=1
=
N
N
where the vector Δ = (1 + λ/1 − λ)diag(K). kh (xi , xj )kh (xi , xj )
i=1
αj h(xj , xj ).
A PPENDIX B P ROOF OF T HEOREM 2
(A.7)
According to Cauchy-Schwartz inequality
j=1
Combining (A.6) and (A.7), we obtain the relationship 1 N
N N
αk h(xj , xk ) =
N
j=1 k=1
αj h(xj , xj ).
(A.8)
N
=
(ˆ p(xi ) − qˆ(xi ; α))
m
a2i
i=1
(ˆ p(x)− p˜(x))2 dx
J=
2
pˆ (xi ) − 2 2
N
pˆ(xi )ˆ q (xi ; α) +
i=1
N
2
qˆ (xi ; α)
=
i=1
N N
=
αj αk h(xj , xk ).
≤
As in Section II-A, to make qˆ(x; α) approximate pˆ(x) better, we construct the following objective function:
αj αk h(xj , xk ) − 2
j=1 k=1
+ λ ⎝1 −
N N
αj αk
N
j=1 k=1
= (1 − λ)
N
=m
N N
⎞
=m
kh (xi , xj )kh (xi , xk )⎠
⎛
⎞2 1 1 ⎝ K(x−xi , Hi )− K(x−xj , Hj )⎠ dx m m N i=1 m
m
m
⎞2 1 1 ⎝ K(x−xi , Hi )− K(x−xj , Hj )⎠ dx m N ⎛
j∈Si
εi
i=1
i=1
where αj αk h(xj , xk )
j=1 k=1 N
⎛ ⎞⎞2 m 1 ⎝ ⎝ 1 K(x−xi , Hi )− K(x−xj , Hj )⎠⎠ dx m N i=1 ⎛
i=1
αj h(xj , xj )
j=1
⎛
⎞2 m N 1 1 ⎝ K(x−xi , Hi )− K(x−xj , Hj )⎠ dx m N i=1 j=1
j∈Si
J(α) = ISE(α) + λVˆ (S; α) N N
⎛
j∈Si
j=1 k=1
−2
≤m
we immediately know that J can be upper bounded by
⎛ ⎞2 N N N 1 ⎝ = 2 kh (x, xj )⎠ − 2 αj h(xj , xj ) N i=1 j=1 j=1
=
ai
i=1
+
2
i=1
i=1 N
m
j=1
Based on (A.4), (A.5) and (A.8), ISE(α) can be computed as ISE(α) =
(A.11)
i=1 j=1
εi = αj h(xj , xj ) + λ
⎛
⎞2 1 ⎝ 1 K(x − xi , Hi ) − K(x − xj , Hj )⎠ dx, m N j∈Si
(A.9)
j=1
i = 1, 2, . . . , m, and the optimization problem α = arg min J(α) = arg min(
N N
αj αk h(xj , xk )−
2 αj h(xj , xj ) 1 − λ j=1
2 α diag(K) 1−λ
(A.10)
N
j=1 k=1
= arg min α Kα−
which holds for any m and any disjoint partitioning {S1 , . . . , Sm } of K(x − xj , Hj ), j = 1, 2, . . . , N . Thus, without loss of generality, assume that |Si | = (N/m) for i = 1, 2, . . . , m. As the Gaussian kernel is adopted, we have K(x − xi , Hi )K(x − xj , Hj )dx = K(xi − xj , Hi + Hj )
WANG et al.: KERNEL DENSITY ESTIMATION, KERNEL METHODS, AND FAST LEARNING
and so we have 1 εi = 2 K 2 (x − xi , Hi )dx m 2 K(x − xi , Hi )K(x − xj , Hj )dx − mN j∈Si 1 + 2 K(x − xl , Hi )K(x − xj , Hi )dx N l∈Si j∈Sj 1 2 = 2− K(xi − xj , Hi + Hj ) m mN j∈Si 1 K(xl − xj , 2Hi ) + 2 N
19
and 2 1−λ − K(xi − xi , 2Hi ) m m i=1 m
E(J) ≤ λ +
1 K(xi − xi , 2Hi ). m i=1 m
+ Thus, the theorem obviously holds.
A PPENDIX D P ROOF OF T HEOREM 4
l∈Si j∈Sj
1 N 2 ≤ 2− · min K(xi − xj , Hi + Hj ) m mN m j∈Si 2 N 1 max K(xl − xj , 2Hi ) + 2· N m l,j∈Si 2 1 = 2 − 2 min K(xi − xj , Hi + Hj ) m m j∈Si 1 + 2 max K(xl − xj , 2Hi ). m l,j∈Si
In terms of the proof of theorem 2, we can easily derive m
J ≤m
=λ + m
i=1
−
+ m
i=1
K(x − xi , Hi )K(x − xi , Hj )dx
j∈Si
m(1 − λ)αi2
i=1
−
m 2αi
+
E(εi ) m
+
N
·
N · min K(xi − xj , Hi + Hj ) m j∈Si
m 1 max K(xl − xj , 2Hi ) m2 i=1 l,j∈Si
=λ +
1 2 = − K(xi − xi , 2Hi ) m m i=1 m 1 + K(xi − xi , 2Hi ). m i=1 Thus the upper bound of E(J) depends only on m, Hi , but not on N .
N
m
i=1
E(J) ≤ m
(1 − λ)αi2
m 1 max K(xl − xj , 2Hi ) m2 i=1 l,j∈Si
≤λ +
i.e., the upper bound of J depends only on m, Hi , but not N . Next, let us analyse E(J). By using K(xi − xi , 2Hi ) to represent the expected value of K(xi − xj , Hi + Hj ) for Si , we have m
p˜2 (x)dx
i=1
εi
2 1 − = min K(xi − xj , Hi + Hj ) m m i=1 j∈Si m 1 + max K(xl − xj , 2Hi ) m i=1 l,j∈Si
m
m 2αi i=1
m
εi + λ − λ
i=1
Thus, J ≤m
m
m(1 − λ)αi2 −
i=1 m
1 m
i=1
m 2αi i=1
m
min K(xi −xj , Hi +Hj ) j∈Si
max K(xl − xj , 2Hi )
l,j∈Si
and E(J) ≤ λ +
m i=1
m(1 − λ)αi2 −
m 2αi i=1
m
K(xi − xi , 2Hi )
1 K(xi − xi , 2Hi ). m i=1 m
A PPENDIX C P ROOF OF T HEOREM 3 In terms of the proof of theorem 2, we can easily obtain m εi + λ − λ p˜2 (x)dx J ≤m i=1
Obviously, the equality above does not depend on N . Similarly, the same conclusion holds for E(J). Thus, this theorem is proved.
m
= λ + (1 − λ)/m −
1 max K(xl − xj , 2Hi ) m i=1 l,j∈Si m
+
2 min K(xi − xj , Hi + Hj ) m i=1 j∈Si
+
ACKNOWLEDGMENT The authors would like to thank the reviewers a lot for the comments which helped improve the quality of this paper greatly.
20
IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 1, JANUARY 2014
R EFERENCES [1] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 2002. [2] C. K. I. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Advances in Neural Information Processing Systems 13, T. Leen, T. Dietterich, and V. Tresp, Eds. Cambridge, MA: MIT Press, 2001, pp. 682–688. [3] A. Smola and B. Schölkopf, “Sparse greedy matrix approximation for machine learning,” in Proc. 7th Int. Conf. Mach. Learn., Stanford, CA, Jun. 2000, pp. 911–918. [4] D. Achlioptas, F. McSherry, B. Schölkopf, and B. S. Olkopf, “Sampling techniques for kernel methods,” in Advances in Neural Information Processing Systems 14, T. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA: MIT Press, 2002, pp. 335–342. [5] S. Fine and K. Scheinberg, “Efficient SVM training using low-rank kernel representations,” J. Mach. Learn. Res., vol. 2, pp. 243–264, Mar. 2001. [6] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [7] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” CSIE, Taipei, Taiwan, 2004, 2010. [8] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1998, pp. 185–208. [9] S. Vishwanathan, A. Smola, and M. Murty, “SimpleSVM,” in Proc. 20th Int. Conf. Mach. Learn., Washington, DC, 2003, pp. 760–767. [10] N. Takahashi and T. Nishi, “Rigorous proof of termination of SMO algorithm for support vector machines,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 774–776, May 2005. [11] Y.-J. Lee and O. Mangasarian, “RSVM: Reduced support vector machines,” in Proc. 1st SIAM Int. Conf. Data Mining, San Jose, CA, 2001, pp. 184–200. [12] I. W. Tsang, J. T. Kwok, and P. M. Cheung, “Core vector machines: Fast SVM training on very large data sets,” J. Mach. Learn. Res., vol. 6, pp. 363–392, Dec. 2005. [13] I. W. Tsang, J. T. Kwok, and J. M. Zurada, “Generalized core vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1126–1140, Sep. 2006. [14] M. Yoon, Y. Yun, and H. Nakayama, “A role of total margin in support vector machines,” in Proc. Int. Joint Conf. Neural Netw., Portland, OR, 2003, vol. III, pp. 2049–2053. [15] D. Freedman and P. Kisilev, “Fast data reduction via KDE approximation,” in Proc. IEEE Data Compress. Conf., 2009, p. 445. [16] Z. H. Deng, F. L. Chung, and S. T. Wang, “FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation,” Pattern Recognit., vol. 41, no. 4, pp. 1363–1372, Apr. 2008. [17] W. Liu, P. P. Plkharel, and J. C. Principe, “Correntropy: Properties and applications in non-Gaussian signal processing,” IEEE Trans. Signal Process., vol. 55, no. 11, pp. 5286–5298, Nov. 2007. [18] D. Tax and R. Duin, “Support vector domain description,” Pattern Recognit. Lett., vol. 20, no. 11–13, pp. 1191–1199, Nov. 1999. [19] A. J. Smola, B. Schölkopf, and B. S. Olkopf, “A tutorial on support vector regression,” Stat. Comput., vol. 14, no. 3, pp. 199–222, Aug. 2003. [20] I. W. Tsang, A. Kocsor, and J. T. Kwok, “LibCVM Toolkit,” CSE, Kowloon, Hong Kong, 2010. [21] D. W. Scott and S. J. Sheather, “Kernel density estimation with binned data,” Commun. Stat.—Theory Methods, vol. 14, no. 6, pp. 1353–1359, 1985. [22] L. Holmstrom, “The accuracy and the computational complexity of a multivariate binned kernel density estimator,” J. Multivariate Anal., vol. 72, no. 2, pp. 264–309, 2000. [23] B. Jeon and D. A. Landgrebe, “Fast Parzen density estimation using clustering-based branch and bound,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 9, pp. 950–954, Sep. 1994. [24] G. A. Babich and O. Camps, “Weighted Parzen windows for pattern classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 5, pp. 567–570, May 1996. [25] L. Holmstrom and A. Hamalainen, “The self-organising reduced kernel density estimator,” in Proc. IEEE Int. Conf. Neural Netw., 1993, vol. 1, pp. 417–421. [26] M. Girolami and C. He, “Probability density estimation from optimally condensed data samples,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 10, pp. 1253–1264, Oct. 2003. [27] J. Friedman, “Multivariate adaptive regression splines (with discussion),” Ann. Stat., vol. 19, no. 1, pp. 1–141, Mar. 1991.
[28] J. Weston, B. Schölkopf, E. Eskin, C. Leslie, and S. Noble, “Dealing with large diagonals in kernel matrices,” in Principles of Data Mining and Knowledge Discovery, vol. 243, Lecture Notes in Computer Science. Helsinki, Finland: Springer-Verlag, 2002, pp. 494–511. [29] F.-L. Chung, Z. Deng, and S. Wang, “From minimum enclosing ball to fast fuzzy inference system training on large datasets,” IEEE Trans. Fuzzy Syst., vol. 17, no. 1, pp. 173–184, Feb. 2009. [30] C. Yang, L. Wang, and J. Feng, “On feature extraction via kernels,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 2, pp. 553–557, Apr. 2008. [31] A. G. Bors and N. Nasios, “Kernel bandwidth estimation for nonparametric modeling,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 6, pp. 1543–1555, Dec. 2009. [32] S.-W. Kim and B. John Oommen, “On using prototype reduction schemes to optimize kernel-based Fisher discriminant analysis,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 2, pp. 564–570, Apr. 2008. [33] L. Wang, K. L. Chan, and P. Xue, “A criterion for optimizing kernel parameters in KBDA for image retrieval,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 3, pp. 556–562, Jun. 2005. [34] M. Lu, C. L. P. Chen, J. Huo, and X. Wang, “Optimization of combined kernel function for SVM based on large margin learning theory,” in Proc. IEEE Int. Conf. Syst., Man Cybern., 2008, pp. 353–358. [35] S. Chen, X. Hong, and C. J. Harris, “Probability density estimation with tunable kernels using orthogonal forward regression,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 4, pp. 1101–1114, Aug. 2009. [36] Regression DataSets. [Online]. Available: http://www.liaad.up.pt/~ltorgo/ Regression/DataSets.html [37] BrainWeb: Simulated Brain Database. [Online]. Available: http://mouldy. bic.mni.mcgill.ca/brainweb/
Shitong Wang received the M.S. degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 1987. He visited London University, London, U.K. and Bristol University, Bristol, U.K.; Hiroshima International University, Hiroshima, Japan; and the Hong Kong University of Science and Technology, Hong Kong, China, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China, and Hong Kong City University, Hong Kong, China, as a Research Scientist, for over five years. He is currently a Full Professor with the School of Digital Media, Jiangnan University, Wuxi, China. His research interests include artificial intelligence, neuron-fuzzy systems, pattern recognition, and image processing. He has published about 80 papers in international/national journals and has authored/coauthored seven books.
Jun Wang received the Ph.D. degree from the Nanjing University of Science and Technology, Nanjing, China, in 2011. He is currently an Associate Professor with the School of Digital Media, Jiangnan University, Wuxi, China. He has published nearly 20 papers in international/national authoritative journals. His research interests include pattern recognition, data mining, and digital image processing.
Fu-lai Chung received the B.Sc. degree from the University of Manitoba, Winnipeg, Canada, in 1987 and the M.Phil. and Ph.D. degrees from the Chinese University of Hong Kong, Shatin, Hong Kong, in 1991 and 1995, respectively. He joined the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, in 1994, where he is currently an Associate Professor. He has published widely in the areas of data mining, machine learning, fuzzy systems, pattern recognition, and multimedia in international journals and conferences.