SCALE ESTIMATION FOR KERNEL-BASED CLASSIFICATION

0 downloads 0 Views 4MB Size Report
approximate of the data probability density function (pdf). [1]. ..... gradient ascent algorithm according to (3). ..... Man and Cybernetics - Part B: Cybernetics, vol.
SCALE ESTIMATION FOR KERNEL-BASED CLASSIFICATION Nikolaos Nasios and Adrian G. Bors Department of Computer Science, University of York, York YO10 5DD, UK E-mail: [email protected] ABSTRACT This paper considers kernel density estimation for unsupervised data classification. A new methodology is proposed for finding the kernel scale using Bayesian statistics. Knearest neighbourhoods are sampled by considering K as a random variable. The variance of each K-nearest neighbourhood is calculated and its probability density is fitted with a Gamma distribution. The estimated Gamma distribution is used to calculate the kernel scale. The proposed methodology is applied in three different machine learning algorithms: scale space, mean shift and quantum clustering. Quantum clustering employs the Shr¨odinger partial differential equation and uses the analogy between particles with their electro-magnetic field and data samples with their corresponding probability density function (pdf). The classification relies on the mode detection and the consequent data assignment in the resulting pdf. The proposed algorithm is applied for the classification of modulated signals and of topography extracted from radar images of terrain. 1. INTRODUCTION Kernel density estimation techniques assign a kernel at each data sample location. The resulting function represents an approximate of the data probability density function (pdf) [1]. Comparisons among several different kernel functions have revealed only small differences in their data modelling capabilities [1]. Various methods have been used to represent and interpret the resulting kernel-based pdf’s. Scale space methods interpret directly the function resulting from kernel addition [2]. The mean shift is an iterative algorithm which updates a set of centers based on the local gradient [3, 4]. More recently, an algorithm derived from the analogy with quantum mechanics [5], called quantum clustering, employs the Shr¨odinger partial differential equation for representing the data potential function [6]. The pdf representation, in all the approaches mentioned above, depends on the kernel scale (bandwidth) [1, 3, 6, 7, 8]. The algorithms used to date for estimating the scale are mostly applied to univariate data and can be classified into two categories: quality-of-fit and plug-in methods. The first category uses cross-validation, by leaving one data sample out, and approximating the pdf with the sum of kernels cen-

tered at the other data [9]. The scale in this case is chosen according to a least squares criterion [1]. Such methods produce excessive bumpiness in the resulting potential surface and their result depends on the selection of specific data samples [7]. The plug-in methods calculate the bias and variance such that it minimizes the mean integrated square error (MISE) between the real density and its kernel density approximation [4]. However, since the real density is unknown, plug-in algorithms require an initial estimate of the scale which is updated in an iterative process [8]. The plugin methods tend to oversmooth the density function to be approximated often missing modalities and important features of the pdf [7, 8]. In this paper we propose to estimate the kernel scale using empirical distributions of variances of K-nearest neighbours. Statistics of K-nearest neighbours are sampled from the data set and their variance is calculated. The number of neighbours K is generated by a uniform distribution with given bounds. Gamma probability density function is considered suitable for modelling statistics of variances when we have only very general assumptions about the data. We estimate the parameters of the Gamma distribution fit to the local variance statistics and we consider its mean as the appropriate kernel scale. Unsupervised classification is achieved by splitting the data according to the modes in the resulting kernel-based function representation. Gradient descent has been used on the resulting quantum potential in order to identify the data modality [6]. However, the gradient descent algorithm can easily get stuck into local maxima and its result depends on the choice of specific thresholds. In this study we evaluate the eigenvalue signs for the local Hessian in order to detect the modes in the representative kernel density function. The proposed scale estimation is applied to three nonparametric algorithms: scale space, mean shift and quantum clustering. The proposed approach is used in the unsupervised classification of modulated signals [10] and in that of vector fields of surface normals extracted from images of terrain [11]. The kernel density estimation methodology is provided in Section 2. The estimation of the scale parameter is presented in Section 3, while mode identification is explained in Section 4. Experimental results are provided in Section 5 and the conclusions of this study are drawn in Section 6.

2. KERNEL DENSITY ESTIMATION Probability density function estimation is a fundamental concept in many research fields. While data modelling, using “naive” nonparametric estimators, such as histograms, lacks interpretability and generalization capability, kernelbased estimators provide smooth functions that asymptotically represent the true pdf. Lately, kernel based estimation has been used in a large variety of applications [2, 3, 7, 9]. A kernel function is assigned to each data sample Xi , i = 1, . . . , N and the pdf is approximated by :   N 1 X X − Xi ψ(X) = (1) K N σ d i=1 σ

where K(·) is the kernel function, σ is a parameter called scale, window width or bandwidth, and d is the data dimension. Due to its smoothness and differentiability properties, the preferred kernel is the Gaussian kernel [1, 2] :  T  t ti (2) K(t) = exp − i 2 where ti =

X−Xi . σ

2.1. The mean shift algorithm The mean shift algorithm employs the estimation of the local gradient of the density function ψ(X) from equation (1), instead of the function itself [3, 4]. The mean shift updating when using the Epanechnikov kernel [1], is given by [4] : 2 X ˆ ˆt ˆ t+1 ) = σ ∇ψ(Ct ) = 1 Xi − C M σ (C ˆ t) d + 2 ψ(C Nσ ˆ Xi ∈Sσ (Ct )

(3) where Nσ is the cardinality of the data samples lying in the ˆ t ), of a radius equal to σ, and C ˆ t is the hypersphere Sσ (C updated cluster center at iteration t, considered initially as one of the data samples. As it can be observed from equaˆ t+1 ) points towards the tion (3), the mean shift vector Mσ (C direction of the steepest slope of the density function ψ(X) leading to the modes of the underlying density [3]. The mean shifts depend on the scale σ, that defines the influence ˆ t ), and implicitly the cardinality Nσ . After conarea Sσ (C vergence, two cluster candidates are considered as different modes if there is a local minimum on the line that joins them [4]. Otherwise, the two clusters are merged. 2.2. Quantum clustering The kernel function can be assimilated with an energy field that manifests around a particle. In this situation, the effect of several particles, associated with data samples results into a potential energy function. Horn and Gottlieb introduced a new unsupervised algorithm called quantum clustering [6]. The state of a quantum mechanical system is completely specified by a function that can be described using ψ(X, t), similar to the kernel function from (2), that depends on the coordinates of the particle located at X at time t. According

to the first postulate of quantum mechanics, the probability that a particle lies in a volume element dX, located at X, at time t, is given by, [5] : Z P (X, t) =

|ψ(X, t)|2 dX

(4)

The fifth postulate of quantum mechanics states that a quantum system evolves according to the Schr¨odinger differential equation [5]. The time-independent Schr¨odinger equation is:   σ2 2 Hψ(X) ≡ − ∇ + V (X) ψ(X) = E · ψ(X) (5) 2 where H is the Hamiltonian operator, E is the energy, ψ(X) corresponds to the state of the given quantum system, V (X) is the Shr¨odinger potential, and ∇2 is the Laplacian. Conventionally, the potential V (X) is given and the equation (5) is solved to find solutions ψ(X). In clustering we consider the inverse problem, where we assume known the location of data samples and their state as given by equation (1), which is considered a solution for (5). We evaluate the potential V (X) produced by the quantum mechanical system, that is assimilated with our data samples. The Shr¨odinger potential of the given set of data samples is [6] : N X d 1 (X − Xi )T (X − Xi )· V (X) = E − + 2 2 2σ ψ(X) i=1   (X − Xi )T (X − Xi ) · exp − 2σ 2 (6) The local minima in the resulting hypersurface described by V (X) are associated with data classes. 3. BAYESIAN ESTIMATION OF KERNEL SCALE One of the main problems in kernel density estimation consists of estimating the kernel scale (bandwidth) [3, 7, 6, 9, 8]. Kernel scale estimation algorithms developed to date can be classified in quality-of-fit and plug-in methods [1]. Comparative studies have shown that while quality-of-fit produce excess bumpiness, the plug-in methods provide a scale that oversmoothes the resulting probability density function [7, 8]. In this paper we propose a new kernel scale estimation approach which is outlined as a graphical model in Fig. 1. The main idea behind the proposed approach is that the scale σ ˆ should be associated with the local data variance. Initially, we consider K-nearest neighbours (KNN) data sample populations [12]. K defines the number of neighbours and has a uniform distribution U [K1 , K2 ] as a prior. After sampling K from the uniform distribution, we sample random data from the given data set {Xi , i = 1, . . . , N }, and choose their K-nearest neighbours, according to their Euclidean distance to Xi . We calculate the local variance of K-nearest neighbours as : PK kXi,(k) − Xi k2 (7) si = k=1 K

N

K1

X

K2

U [ K 1, K 2]

K

s KNN

s

α

β

p (s; α, β)

^ σ

Fig. 1. Bayesian algorithm for estimating the scale σ ˆ. for i = 1, . . . , N , where K < N is the cardinality of a data set that defines the chosen neighbourhood, and where {Xi,(k) , k = 1, . . . , K} are the nearest neighbour data samples to Xi . The empirical pdf of si is represented by a histogram of KNN variances. σ ˆ is considered as a statistical estimate of the probability density function corresponding to the local variance population si , calculated in (7). In the next stage we model this empirical distribution with a Gamma distribution: β α sα−1 −βs e (8) p(s; α, β) = Γ(α) where α > 0 is the shape parameter that enables the function to change its appearance, β > 0 is the scale parameter of the Gamma distribution. Γ(·) represents the Gamma Z ∞ function: Γ(t) = rt−1 e−r dr (9) 0

The distribution of variances for data samples that are generated by Normal independent distributions with mean 0 and variance 1 are modelled by the Chi-square distribution. Gamma distribution is a generalization of the Chisquare distribution and is suitable for modelling distributions of data sample variances in the case when we have no knowledge about the underlying data distribution. The multivariate extension of the Gamma distribution was used for modelling distributions of covariance matrices [10]. Moreover, Gamma is a long tailed distribution and its tail can account for the presence of outliers in the distribution of si . The parameters α and β from (8) are estimated using maximum likelihood estimation. The likelihood function for the distribution from (8) is : " N #α−1 PN β αN Y e−β i=1 si (10) si L(α, β) = N Γ (α) i=1

where we consider N data samples si , i = 1, . . . , N . By differentiating the logarithm of the expression (10) with respect to α and β, respectively, we obtain : # " Q   ( N si )1/N Γ0 (ˆ α)  i=1  ln(ˆ α) − =0 + ln Γ(ˆ α) s (11)  α ˆ   ˆ β= s

where s is the sample mean for the variable si : PN si (12) s = i=1 . N The system of equations (11) is solved using Newton-Raphson algorithm and the parameters α and β are inferred. The resulting Gamma probability density function models the local data variance. The kernel scale σ ˆ is calculated as the Gamma distribution mean estimate : α ˆ σ ˆ= (13) βˆ This estimate provides an appropriate measure of local data spread and is suitable to be used as the kernel scale. 4. DELIMITING THE MODES IN PDF’S After having an appropriate estimate for σ ˆ we can achieve a good data pdf representation. The kernel based modelling leads to a function representation such as ψ(X) from (1), or V (X) as provided by the Schr¨odinger equation (6). Let us consider a rectangular lattice Z, spaced at equidistant intervals of σ ˆ 2 /2, and defined in the space where the data set is located. We consider a kernel function at each knot location on this lattice. Classifying the data is equivalent to finding the modes in the kernel-based density representation. The modes correspond to maxima for (1) and to minima for (6). The mean-shift algorithm finds the modes using an iterative gradient ascent algorithm according to (3). In the following we develop an approach for finding the modes in functional representations. Local maxima and minima in a function are defined by the signs of the eigenvalues of the local Hessian. The Hessian is calculated on the given lattice Z as:  2  ∂ F (Z) H[F (Z)] = (14) ∂x∂y where F (Z) can be either ψ(Z) or V (Z), and where for the sake of simplification we have considered d = 2, i.e. the lattice is defined only along {x, y} axes. The eigendecomposition of the Hessian matrix provides : H = T · Λ · T−1 (15) where T is a matrix whose columns represent the eigenvectors, while Λ = {λi |i = 1, . . . , d} is a diagonal matrix that contains the eigenvalues λi . The sign of the eigenvalues of the Hessian matrix are used to identify the local minima, local maxima, and the saddle points according to the following decisions : λi (Z) > 0, ∀ i = 1, . . . , d then local minimum, λi (Z) < 0, ∀ i = 1, . . . , d then local maximum, (16) ∃λj (Z) > 0, ∧ ∃λi (Z) < 0, i 6= j, then saddle point. In the scale space approach for ψ(X) [2] we consider compact 8-knot neighbourhood sets on the lattice, that all correspond to local maxima and which are surrounded by saddle

points. Conversely, for the quantum potential V (X) from (6) we consider compact sets of local minima knots. Consequently, each distinct mode is identified and labelled accordingly.

5. EXPERIMENTAL RESULTS The proposed methodology for kernel scale estimation is applied in the context of three machine learning algorithms. The first algorithm is the scale space algorithm that applies the mode seeking procedure described in Section 4 directly to the function ψ(X) from (1). The second one updates iteratively a set of centers using the Mean shift (3), while the third applies the algorithm described in Section 4 to the quantum potential V (X) modelled by equation (6). The first set of experiments concerns the classification of phase shift keying (8-PSK) modulated signals when assuming interference and corruption by noise. These data consist of a mixture of eight two-dimensional Gaussian functions whose centers, each corresponding to a symbol, are equidistantly located on a circle of radius 1. These data, characteristic to communication systems, have been used in [10] as well. We generate a total of N = 960 data samples with a variance of 0.11, assuming equal probability for each Gaussian component. We have sampled seven values for K, the number of nearest neighbours, at equal intervals, between two pre-set limits [K1 , K2 ]. We have chosen several values for K1 and K2 , respectively, and the results displaying the fitting of the Gamma distribution p(s; α, β) from (8) to each K are shown in Fig. 2. The Gamma distribution that models the data variance for the entire range of neighbours K ∈ [K1 , K2 ] for data samples Xi is represented in bold in Fig. 2. By varying σ, we have applied the methodology described in Section 4 for finding the number of modes in the scale space and quantum clustering. The results are displayed in Fig. 4, where the range of estimated σ ˆ is clearly indicated by vertical lines. The function ψ(X) for the 8PSK data set is represented in Fig. 3(a), while the quantum potential V (X) is displayed in Fig. 3(b). As it can be observed from these plots, 8 clusters have been identified by all the methods when considering the estimated scale range. For the numerical evaluation we consider the average mean square error with respect to the cluster centers, M SESM (calculated by averaging the given data for each symbol) and the mean square error with respect to the optimal symbol value M SEO . The results achieved are shown in Table 1, where the confidence intervals, obtained from several runs, are specified with “±”.

After labelling all the knots corresponding to each mode, we proceed to splitting the lattice into regions, each assigned to a data class. The algorithm continues afterwards with a region growing process applied on labelled regions which consists of the generalization of the binary dilation from mathematical morphology, under a set of local conditions as given by (16). At each iteration a layer of lattice points is added simultaneously to each cluster of labelled knots. This process continues until lattice regions corresponding to two different clusters become adjoint and ends when all the knots on the lattice are assigned to one or another of the classes. The proposed approach preserves the shape of the clusters without assuming the knowledge of any parameter. 20

20

18

18

16

16

14

14 12

σ =0.19

P(s)

P(s)

12 10

8

6

6

4

4

2

σ =0.23

10

8

2

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

0.4

0.6

(a) K ∼

U [ N7 , N4 ]

(b) K ∼

20

20

18

18

16

16

14

14

12

1

N N U [ 10 , 3]

12

P(s)

P(s)

0.8

s

s

σ =0.26

10

8

6

6

4

4

2

σ =0.36

10

8

2

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

0.4

0.6

s

0.8

1

1.2

s

N 2N , 5 ] (c) K ∼ U [ 20

N N (d) K ∼ U [ 20 , 2]

Fig. 2. Scaling parameter estimation for 8-PSK signals.

Method Scale space Mean shift Quantum clus.

0.4 0.35 0.5

0.3

ψ( Z)

0.25

0.4

0.2 0.15

0.3 V( Z)

0.1

0.2

0.05

M SESM 0.0024 ± 0.00074 0.0168 ± 0.0097 0.00092 ± 0.00026

M SEO 0.004 ± 0.0007 0.020 ± 0.0109 0.002 ± 0.0004

0.1 −1.5

0 1.5 1

1.5

0.5

1 Z1

0

0.5 −0.5

−1

−1 −1.5

−0.5 −1

0

−0.5 0.5

0

0

−0.5

−1

0 −1.5

Z2

−1.5

(a) Evaluation of ψ(X)

Z1

0.5

Table 1. Mean Square Error with confidence intervals.

Z

2

1

1 1.5

1.5

(b) Evaluation of V (X)

Fig. 3. Symbol classification in 8-PSK signals.

In the second set of experiments we apply the proposed methodology for classifying regions of terrain according to the local surface orientation. The vector fields of surface

30 25 20 15

0.19

10 8

0.36

50

40

40

Number of clusters

Number of clusters

Number of clusters

35

50

30

20

0.19

10 8

0.36

30

20

0.19

10 8

0.36

5

0.25

0.3

0.35

σ

0.4

(a) Scale space

0.45

0.5

0.55

0.6

0

0.1

0.15

0.2

0.25

0.3

0.35

0.4

σ

0.45

0.5

0.55

0

0.6

0.1

0.15

(b) Mean shift

Fig. 4. The number of modes detected in 8-PSK data. normals were extracted from Synthetic Aperture Radar (SAR) images of terrain using shape from shading methods adapted to the corresponding illuminance model and image statistics [11]. An area from Wales is shown in Fig. 5(a), while an image of Titan, moon of Saturn, sent in November 2004 by the space station Cassini-Huygens is shown in Fig. 5(c). Their corresponding surface normals, after being smoothed by the vector median algorithm [11], are displayed in Figs. 5(b) and 5(d), respectively. We consider the surface normal vector entries along x and y axes as the input data. The distribution of the resulting 2-D data is non-Gaussian and a parametric method, as for example that adopted in [10], would not provide appropriate results in this case. We define the range of K-nearest neighbours as K ∼ U [N/20, N/2], where N is the number of surface normal vectors, and we evaluate the variance of the sampled neighbourhoods. We apply the estimated scale σ ˆ for three kernelbased clustering algorithms: scale space, mean shift and quantum clustering. The pdf surfaces, representing ψ(X) and V (X) for the two vector fields are shown in Fig. 6. The modes are located onto these surfaces and the classification results are displayed in Fig. 7 for the SAR image from Wales and in Fig. 8 for the SAR image from Titan. A total of 5 clusters were found by the scale space algorithm, 6 by the mean shift and 9 by the quantum clustering for the terrain image from Wales. For the SAR image from Titan, the scale space algorithm found 8 clusters, mean shift 5 and quantum clustering 10. All three nonparametric methods classify correctly the dominant terrain features such as the lake in the top right corner, the slope descending from the semicircular ridge from the left side, as well as the deep valley crossing obliquely the image in the “Wales” image. Terrain features can be clearly identified from the SAR image from Titan after applying the proposed methodology. From the surface plots in Fig. 6 we can observe that ψ(X) provides rather shallow local maxima, compared to the local minima produced by the potential V (X). On the other hand, the quantum clustering algorithm consistently produced more clusters for the same data set. This highlights its sensitivity to small variations in the surface normal orientation.

0.2

0.25

0.3

0.35

0.4

σ

0.45

0.5

0.55

0.6

(c) Quantum clustering

(a) Wales

(b) Surface normals

(c) Titan

(d) Surface normals

Fig. 5. SAR images of terrain and their corresponding vector fields of surface normals.

0.8

0.7 0.6

0.6

0.5

0.4

V( Z)

0.2

0.2 0 1

0.4 0.3 0.2 0.1

1 0.5 0 −0.5

z1

−0.5 −1

−0.5 0

−0.5

0

z2

−1

0 −1

0.5

0

z2

1

(a) ψ(X) for Wales

z1

0.5 0.5

−1

1

(b) V (X) for Wales

2

0.8 0.7

1.5

0.6

1

0.5

V( Z)

0.15

ψ( Z)

0.1

ψ( Z)

0

0.5

0.4 0.3

0 1

0.2

1 0.5 0.5 0

−1

0.1 0 −1

−0.5 0

−0.5

0

z

2

−0.5

−0.5 −1

0

z1 z2

−1

(c) ψ(X) for Titan

0.5

z1

0.5 1

1

(d) V (X) for Titan

Fig. 6. Kernel density surfaces for fields of vector normals.

(a) Scale space

(b) Mean shift

(c) Quantum clustering

(d) Original 3D map

Fig. 7. Topography classification in a SAR image of terrain from Wales.

(a) Scale space

(b) Mean shift

(c) Quantum clustering

(d) 3D reconstruction

Fig. 8. Surface normals classification in Titan SAR image. [4] D. Comaniciu and P. Meer, “Distribution free decomposition of multivariate data,” Pattern Analysis and Applications, vol. 2, no. 1, pp. 22–30, 1999.

6. CONCLUSIONS This paper outlines a new methodology for using kernel density estimation for data classification. Three different methods are considered: scale space, mean shift and quantum clustering. The third approach represents the quantum potential for the given data set using the Shr¨odinger partial differential equation. A new methodology is proposed for estimating the kernel scale. We calculate the variance of Knearest neighbours when K is assumed to be a random variable that has a uniform distribution as prior. The statistics of local variance is fitted to a Gamma distribution. The scale is estimated as the mean of the Gamma distribution. The classification is performed by detecting the modes in the resulting pdf function. The proposed algorithm is applied in symbol classification from corrupted modulated signals and for classifying vector fields of surface normals according to their orientation. 7. REFERENCES [1] B.W. Silverman, Density estimation for statistics and data analysis, Chapman & Hall, 1986. [2] S.J. Roberts, “Parametric and non-parametric unsupervised cluster analysis,” Pattern Recognition, vol. 30, no. 2, pp. 261–272, 1997. [3] D. Comaniciu, “An algorithm for data-driven bandwith selection,” IEEE Trans. on Pattern Analysis and Machine Intel., vol. 25, no. 2, pp. 281–288, 2003.

[5] S. Gasiorowicz, Quantum Physics, J. Wiley, 1996. [6] D. Horn and A. Gottlieb, “Algorithm for data clustering in pattern recognition problems based on quantum mechanics,” Physical Review Letters, vol. 88, no. 1, pp. 1–4, 2002. [7] C.L. Loader, “Bandwidth selection: classical or plugin?,” The Annals of Statistics, vol. 27, no. 2, pp. 415– 438, 1999. [8] M.C. Jones, J.S. Marron, and S.J. Sheather, “A brief survey of bandwidth selection for density estimation,” Journal of the American Statistical Association, vol. 91, no. 433, pp. 401–407, 1996. [9] M. Girolami and C. He, “Probability density estimation from optimally condensed data samples,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1253–1264, 2003. [10] N. Nasios and A.G. Bors, “Variational learning of Gaussian mixture models,” IEEE Trans. on Systems, Man and Cybernetics - Part B: Cybernetics, vol. 36, no. 4, Aug. 2006. [11] A.G. Bors, E.R. Hancock, and R.C. Wilson, “Terrain analysis using radar shape-from-shading,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 974–992, 2003. [12] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, J. Wiley, 2000.

Suggest Documents