Kernel Bandwidth Estimation in Methods based on Probability Density

0 downloads 0 Views 5MB Size Report
distributions characterizing the second order statis- ... modelling the distribution of local variances. ..... bandwidth estimation applied to the following methods:.
Kernel Bandwidth Estimation in Methods based on Probability Density Function Modelling Adrian G. Bors and Nikolaos Nasios Dept. of Computer Science, University of York, York YO10 5DD, UK [email protected]

Abstract In kernel density estimation methods, an approximation of the data probability density function is achieved by locating a kernel function at each data location. The smoothness of the functional approximation and the modelling ability are controlled by the kernel bandwidth. In this paper we propose a Bayesian estimation method for finding the kernel bandwidth. The distribution corresponding to the bandwidth is estimated from distributions characterizing the second order statistics estimates calculated from local neighbourhoods. The proposed bandwidth estimation method is applied in three different kernel density estimation based approaches: scale space, mean shift and quantum clustering. The third method is a novel pattern recognition approach using the principles of quantum mechanics.

1. Introduction While parametric methods focus on finding appropriate parameter estimates, that describe a predefined density function, in kernel density estimation (KDE) the emphasis is on achieving a good estimate of the density function without any underlying model assumption. In KDE the kernel function is centered at each data sample location and defines an influence region around it. A scale parameter, also called bandwidth or window width controls the kernel function smoothing over the surrounding space. The scale space method represents the probability density function (pdf) by simply summing the kernel functions for all data [7]. The meanshift is an updating a algorithm which employs the local gradient for finding the maxima of the pdf representation [3]. An algorithm which relies on the analogy between the pdf representation and the quantum potential of physical particles, was proposed in [4]. The performance of KDE methods crucially depends on the value of the kernel’s bandwidth [6, 8, 9, 10]. The bandwidth is responsible for smoothing the resulting

KDE representation as well as for defining an appropriate mode localization. The algorithms used for finding the bandwidth can be classified into two categories: quality-of-fit and plug-in methods. The first category uses cross-validation by leaving certain data samples out while approximating the pdf with the sum of kernels located at the remaining data. The plug-in methods calculate the bias in the pdf approximation such that it minimizes the mean integrated square error (MISE) between the real density and its kernel-based approximation [9, 10]. However, since the real density function is unknown, plug-in algorithms require an initial pilot estimate of the bandwidth for an iterative estimation process [5]. These studies show that cross-validation algorithms lead to spurious bumpiness in the underlying density. On the other hand plug-in methods tend to oversmooth the density function to be approximated. In this paper we propose a new approach for estimating the bandwidth for KDE methods. Distributions of local variances are evaluated from randomly sampled K-nearest neighbours data sets. A uniform prior is assumed for K, the number of data samples from a neighbourhood, while Gamma distribution is used for modelling the distribution of local variances. The proposed bandwidth estimation method is used in three KDE methods: scale space, mean shift and quantum clustering. These methods are applied for segmenting modulated signals and local terrain orientation estimated from Synthetic Aperture Radar (SAR) images [1]. Kernel density estimation methods are presented in Section 2. The estimation of the bandwidth parameter is described in Section 3. Experimental results are provided in Section 4 and the conclusions of this study are drawn in Section 5.

2. Kernel density estimation Various kernel density estimation methods have been proposed including those based on the scale space [7] and the mean shift [3] which are well known in pattern recognition. In the following we describe in detail a

novel KDE method called quantum clustering [4]. In quantum clustering each data sample is associated with a particle that is part of a quantum mechanical system and has a specific field defined around its location. The activation field in a location X calculated from N data samples is given as in classical KDE by:   N X (X − Xi )2 (1) exp − ψ(X) = 2σ 2 i=1

where Xi are data samples, i = 1, . . . , N and σ represents the bandwidth parameter. The fifth postulate of quantum mechanics states that a quantum system evolves according to the Schr¨odinger differential equation. The timeindependent Schr¨odinger equation is :   σ2 2 H · ψ(X) ≡ − ∇ + V (X) · ψ(X) = E · ψ(X) 2 (2) where H is the Hamiltonian operator, E is the energy, ψ(X) corresponds to the state of the given quantum system, V (X) is the Shr¨odinger potential and ∇2 is the Laplacian. Conventionally, the potential V (X) is given and the equation is solved to find the solutions ψ(X). These are used in quantum mechanics to describe the location of electron orbits in atoms. In our case we consider the inverse problem where we assume known the location of data samples and their state as given by equation (1) which is considered as a solution for (2), subject to the calculation of constants. We want to calculate the resulting potential V (X) created by the quantum system assimilated with the given data. After replacing ψ(X) from (1) into (2) we solve the Shr¨odinger potential of the given set of data samples as [4]: h i PN (X−Xi )2 2 2 i=1 (X − Xi ) exp − 2σ V (X) = El + (3) 2σ 2 ψ(X) where El = E − d/2. From the statistical point of view, the quantum potential formulation can be written as : PN kX − Xi k2 P (X|Xi ) (4) V (X) = El + i=1 2σ 2 where P (X|Xi ) is the a posteriori probability for X, given the data samples {Xi , i = 1, . . . , N }. This expression represents the weighted Euclidean distance from a data X to a set of given data samples, where the weights are represented by its a posteriori probabilities.

3. Statistical estimation of the bandwidth Estimating the bandwidth in KDE is very important, particularly in applications where we need to detect the pdf local extrema. In [4], σ was initialized to arbitrary values. Classical statistical estimation methods such as

those using the quality-of-fit and plug-in are known to provide biased estimates for the bandwidth [5]. In our study we propose a statistical approach for the estimation of the scale σ by modelling the distribution of data variance in randomly selected neighbourhoods. Let as assume a range of neighbourhoods of various sizes {Kj , j = 1, . . . , n}. We define a probability density function p(s|Kj , X) that depends on the neighbourhood size Kj as a pseudo-likelihood of the form : n Y (5) P (s|XKj ) P (s|X) = j=1

where the probability P (s|XKj ) is considered for the nearest neighbourhood data sample population of size Kj and XKj represents a datum from within the Kj ’s nearest neighbourhood. We evaluate : Z P (s|XKj ) = P (s|Kj , XKj )P (Kj |XKj )dKj (6) and after using Bayes we obtain : P (Kj |XKj ) =

P (XKj |Kj )P (Kj ) P (XKj )

(7)

where P (XKj |Kj ) is the probability of the data sample population depending on the specific neighbourhood size. In the following we consider P (Kj ) as a uniform distribution limited to the range [K1 , K2 ]. For each neighbourhood Kj we estimate the local variance : PK j kX(k) − Xi k2 (8) si = k=1 Kj for i = 1, . . . , N , where Kj < N , is the cardinality of a data set which defines the chosen neighbourhood, and k · k denotes the Euclidean distance between a data sample Xi and another datum from its neighbourhood. The local data scale can be represented using a Gamma distribution, whose parameters are estimated from the empirical distribution formed by data calculated according to (8). The Gamma distribution is a long tailed distribution which can account for the presence of outliers in the data to be modelled. The Gamma distribution depends on two parameters and is given by: β α sα−1 −βs P (s|α, β) = e (9) Γ(α) where α > 0 andRβ > 0 are the shape and scale param∞ eters and Γ(t) = 0 rt−1 e−r dr represents the Gamma function. The parameters α and β are estimated from the empirical distribution of local variances (8). The maximum likelihood approach was proposed in [2] for estimating Gamma distribution parameters. The likelihood function corresponding to the distribution (9) is : " M #α−1 PM β αM Y e−β i=1 si (10) si L(α, β) = M Γ (α) i=1

where we consider M data samples {si |i = 1, . . . , M }, each representing the variance of a local neighbourhood, calculated according to (8). We consider the differentials of the likelihood with respect to α and β and equate them to zero in order to estimate the Gamma distribution paramaters. This results in the following system of equations : # " Q  M 1/M ′  ( s ) Γ (ˆ α ) i  i=1  ln(ˆ =0 + ln α) − Γ(ˆ α) s  α ˆ   βˆ = s (11) where s is the sample mean for the variable s, s = PM s /M . The third term from the first equation of i i=1 the system provided in (11) is the logarithm of the ratio between the geometric and arithmetic means. The first nonlinear equation from (11) is solved using the Newton-Raphson iterative algorithm with respect to α ˆ, [2]. This results in the following updating equation:   QM ( si )1/M i=1 PM ln(ˆ αt ) − Ψ(ˆ αt ) + ln si i=1 α ˆ t+1 = α ˆt − 1/ˆ αt − Ψ(ˆ αt ) (12) where Ψ(ˆ αt ) is the Digamma function, which represents the logarithmic derivative of the Gamma function : Γ′ (ˆ α) (13) Ψ(ˆ α) = Γ(ˆ α) After estimating α ˆ at the convergence of (12), we replace it in the second equation from (11) and estimate ˆ σ β. ˆ is estimated from the Gamma distribution (9) by using a sampling procedure.

4. Experimental results The proposed methodology of bandwidth selection is embedded into three different KDE based methods: scale space, mean shift and quantum clustering. In the first example we consider phase-shifting-key modulated signals (8-PSK) which results in 8 clusters, whose centers are located radially at equal angles from each other. The perturbation channel equations for 8-PSK signals assuming interference are : xI (t) = I(t) + 0.2I(t − 1) − 0.2Q(t) − 0.04Q(t − 1) xQ (t) = Q(t) + 0.2Q(t − 1) + 0.2I(t) + 0.04I(t − 1) where (xI (t), xQ (t)) makes up the in-phase and inquadrature signal components at time t on the communication line, and I(t) and Q(t) correspond to the signal symbols and where we also consider additive Gaussian noise with SNR = 22 dB. We have generated 960 signals, by assuming equal probabilities for all intersymbol combinations. Neighbourhoods of size in a

specified range are assumed, Gamma distributions of distances from data to K-neighbourhoods are formed and their mean is used as an estimate for σ ˆ . After estimating by KDE we segment the resulting potential function into the component clusters each associated with a signal in this example. Table 1 provides the bias and confidence intervals in blind detection of modulated signals for 8-PSK when estimating locations of cluster centers. We calculate the mean square error between the detected centers and the sample mean (M SESM ) as well as the mean square error between the cluster centers and the ground truth centers (M SEO ). The number of clusters was always estimated correctly as 8. As it can be observed from Table 1, quantum clustering provides better results than the other methods. Method Scale space Mean shift Quant. clus.

M SESM 0.0024 ± 0.0007 0.0168 ± 0.0097 0.0009 ± 0.0003

M SEO 0.0041 ± 0.0007 0.0200 ± 0.0109 0.0022 ± 0.0004

Table 1. Blind detection of 8-PSK modulated signals. In another application we consider Synthetic Aperture Radar (SAR) images representing terrain information, as shown in Figures 1(a) and 1(b) for images from Wales and of Ayers rock, respectively. We want to identify various topographic regions in this image according to the local surface orientation clustering. In [1] a surface orientation estimation method was proposed for SAR images of terrain. In the data used in this study the estimated surface normals have been smoothed using a curvature consistency approach. The Wales image data consists of the (x, y) coordinates for 1518 local estimates of surface normals. The histogram of local neighbourhood distances for Kj ∈ [N/20, N/2], where N is the data size, is shown when fitted to a Gamma distribution in Figures 2(a) and 2(b) for Wales and Ayers rock images, respectively. The potential ψ(z) from (1) is shown in Figure 3(a), while the quantum potential V (z) from (3) is displayed in Figure 3(b). The vector field of surface normals is segmented based on the vector orientation similarity. Each segmented region corresponds to a local maxima in ψ(z) and to a local minima in V (z). The Wales and Ayers rock SAR image segmentation in topographic regions based on the orientation of the local surface normals is shown in Figures 4(a) and 4(c), respectively, for the mean shift method and in Figures 4(b) and 4(d), respectively, for the quantum potential. Both methods use identical kernel bandwidth as shown in the plots from Figures 2(a) and 2(b) for Wales and Ayers rock images, respectively. Overlapped on these images are the vector field surface normals as estimated and smoothed in [1]. It can be observed that

(a) Wales

(a) Mean shift.

(b) Quantum clustering.

(c) Mean shift.

(d) Quantum clustering.

(b) Ayers rock

Figure 1. SAR images of terrain. 10

3.5

9 3

8 2.5

7

σ = 0.24

6

P(s)

P(s)

2

1.5

5 4 3

1

σ = 0.18

2 0.5

1 0 0

0.2

0.4

0.6

0.8

1

1.2

0 0

1.4

0.2

0.4

0.6

0.8

Figure 4. Topographical segmentation of SAR images from Wales and Ayers rock.

1

s

s

(a) Wales

(b) Ayers Rock

Figure 2. Bandwidth estimation.

0.8

0.7 0.6

0.6

0.5

0.4

V( Z)

ψ( Z)

mum log-likelihood. The kernel bandwidth is then estimated from the resulting Gamma distribution. The proposed methodology is applied in blind detection and for topography segmentation of radar images of terrain.

0.2 0 1

0.4

References

0.3 0.2

1 0.5 0.5 0 −0.5

−0.5

−1

0 −1

−0.5 0

−0.5

0

z2

0.1

z1

−1

(a) Scale space

z

0

z

2

−1

0.5

1

0.5 1

1

(b) Quantum potential

Figure 3. Potential function for Wales data. the segmentation fits well with the actual terrain features from both images.

5. Conclusions This study provides a new methodology for kernel bandwidth estimation applied to the following methods: scale space, mean shift and quantum clustering. Quantum clustering uses the analogy between data clustering and the probability of localizing particles on orbits as defined in quantum mechanics. A Bayesian approach is proposed for estimating the bandwidth by modelling distributions of variances of localized data subsets. A uniform distribution is considered for modelling the neighbourhood size. After sampling the neighbourhood size value K, we calculate variances of K-nearest neighbours. We fit a Gamma distribution to the statistics of variances for these neighbourhoods. The Gamma distribution parameters are estimated using the maxi-

[1] A. G. Bors, E. Hancock, and R. Wilson. Terrain analysis using radar shape-from-shading. IEEE Trans. on Pattern Analysis and Machine Intel., 25(8):974–992, 2003. [2] S. C. Choi and R. Wette. Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics, 11(4):683–690, 1969. [3] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Trans. on Patt. Anal. and Machine Intell., 24(5):603–619, 2002. [4] D. Horn and A. Gottlieb. Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Physical Review Letters, 88(1):1–4, 2002. [5] M. Jones, J. Marron, and S. Sheather. A brief survey of bandwidth selection for density estimation. J. of the American Statical Association, 91(433):401–407, 1996. [6] C. Loader. Bandwidth selection: classical or plug-in? The Annals of Statistics, 27(2):415–438, 1999. [7] S. J. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30(2):261– 272, 1997. [8] B. W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, 1986. [9] H. Wang and D. Suter. Robust adaptive-scale parametric model estimation for computer vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(11):1459–1474, 2004. [10] L. Yang and R. Tschernig. Multivariate bandwidth selection for local linear regression. Journal of Royal Stat. Soc. , Series B, 61(4):793–815, 1999.