Variational Bayes Approach for Classification of Points in ...

Variational Bayes Approach for Classification of Points in Superpositions of Point Processes Tuomas Rajalaa,∗, Claudia Redenbachb , Aila Särkkäa , Martina Sormanib,c a Chalmers

University of Technology and University of Gothenburg, Department of Mathematical Sciences, 412 96 Gothenburg, Sweden b University of Kaiserslautern, Mathematics Department, 67653 Kaiserslautern, Germany c Fraunhofer Institut f¨ ur Techno- und Wirtschaftsmathematik, Fraunhofer-Platz 1, 67663 Kaiserslautern, Germany

Abstract We investigate the problem of classifying superpositions of spatial point processes. In particular, we are interested in realisations formed as a superposition of a regular point process and a Poisson point process. The aim is to decide which of the two processes each point belongs to. Recently, a Markov chain Monte Carlo (MCMC) approach was suggested by Redenbach et al. (2015), which however, is computationally heavy. In this paper, we will introduce a method based on variational Bayes approximation and compare its performance to the performance of a slightly refined version of the MCMC approach. Keywords: spatial point process, superposition, Bayesian inference, Markov Chain Monte Carlo, noise detection

1. Introduction Classification of points in a superposition of spatial point processes is not very widely studied. Most of the studies in the literature consider the situation where the data are generated by two Poisson processes with different intensities and where minefield detection is in focus. The area of interest is typically divided into two parts, one with low intensity in which only noise is present and the other with a higher intensity containing both mines and noise (see Allard and Fraley, 1997; Byers and Raftery, 1998; Dasgupta and Raftery, 1998; Cressie and Lawson, 2000; Stanford and Raftery, 2000). After detecting the boundaries of the minefield, the points in this area can be classified to be either mines or noise. Typically, this classification is based on Bayesian approaches which allow to estimate the posterior probability for each point to be a mine. There are also some studies where a regular process is superimposed with Poisson noise. Walsh and Raftery (2002) consider mines located in parallel lines mixed with Poisson noise. A test based on partial Bayes factors to test the Poisson hypothesis against Strauss superimposed with Poisson is presented by the same authors, see Walsh and Raftery (2005). Furthermore, having a glaciological problem in mind Redenbach et al. (2015) introduce an MCMC approach to classify the points in a superposition of a Strauss and a Poisson process using some of the ideas by Walsh and Raftery (2002). This method seems to classify the points quite well in the sense that Ripley’s K function for the classified Strauss process does not show any significant deviation from that of the original process. In practice, the parameters of the superposition model need to be estimated simultaneously with the classification of the points. Even though the quality of the parameter estimates is not that good, the classification still seems to work. In this paper, we will introduce an alternative approach based on variational Bayes approximation. We will restrict ourselves to superpositions of two stationary and isotropic point processes, where one of the processes is a pairwise interaction process and the other one is a Poisson process. We will compare the performance of the new variational Bayes classification approach to the performance of the MCMC approach. We will also recall a classification method based on distances to the nearest neighbours, see Byers and Raftery (1998). This method was introduced for a superposition of two Poisson processes but according to the authors, it works even when one of the processes is regular. ∗ Corresponding

author. [email protected]

Preprint submitted to Spatial Statistics

October 6, 2015

The paper is organized as follows. First, we recall the two methods that can be found in the literature, namely the method based on the nearest neighbour distances and the MCMC approach. Then, we will introduce the new approach based on variational Bayes approximation. Furthermore, we perform a simulation study in 2D to compare the new method to the existing ones in the case where the regular process is a Strauss process. 2. Model specification We assume that we observe a point pattern x consisting of n points contained in a study region A ⊂ Rd . The point pattern x is interpreted as a superposition of two point patterns x0 and x1 consisting of n0 and n1 points, respectively. We restrict ourselves to the case where the pattern x0 is a realisation of a Poisson process Ξ0 and x1 a realisation of a regular process, here a Strauss process Ξ1 . Note, however, that the Strauss process could be replaced by any pairwise interaction Gibbs point process. Furthermore, if other than regular processes were of interest, a more gereral class of models could be considered. The two processes are assumed to be stationary, isotropic, and independent of each other. In applications we often assume that Ξ1 models the locations of interest while Ξ0 gives locations of some noise points. The distributions of both Ξ0 and Ξ1 are determined by densities with respect to the distribution of a Poisson process with unit intensity. The Poisson process Ξ0 has the density f0 (x0 ) = e−(λ0 −1)|A| λn0 0 , where λ0 > 0 is the intensity and |A| is the volume of the study region A. The density for the Strauss process Ξ1 is f1 (x1 ) = αβ n1 γ sr (x1 ) , where sr (x1 ) is the number of pairs of points in x1 with distance less or equal to r, and α is the normalizing constant. This density depends on three parameters: β > 0 is a parameter governing the intensity, γ ∈ [0, 1] is the strength of interaction, and r > 0 is the range of interaction. It is well-known that it is difficult to estimate γ and r simultaneously. Hence, we assume r to be known, such that the parameters of the full model x = x0 ∪x1 are θ = (λ0 , β, γ). The intensity λ1 of the Strauss process depends on the parameters β, r and γ and can be approximated using the Poisson saddlepoint approximation for stationary Gibbs point processes introduced by Baddeley and Nair (2012). Furthermore, we introduce a latent variable Z ∈ {0, 1}n to indicate which of the points in x are real and which are noise. We define 1 if the ith point belongs to Ξ1 Zi = 0 if the ith point belongs to Ξ0 . 3. Classification methods We recall two methods found in the literature that can be used to classify the points of a superposition of a Poisson and a regular point process. 3.1. Method based on the kth nearest neighbours Byers and Raftery (1998) classify points in a superposition of two homogeneous point processes using the distance Mk to the kth nearest neighbour of a point. For a homogeneous Poisson process the distribution of Mk is known. For two superimposed Poisson processes, Byers and Raftery (1998) postulates that the distribution of Mk is approximately a mixture of the distributions of Mk for two homogeneous Poisson processes. A superposition of two processes, noise process and feature process, is considered. The noise is distributed as a homogeneous Poisson point process with intensity λ0 and the feature points as a Poisson point process with intensity λ1 restricted to a certain part of the observation window overlaid on the noise. This gives a Poisson point process with piecewise constant intensity. An EM algorithm to classify each point as feature or noise is developed to estimate the two intensities λ0 and λ1 , as well as the mixture parameter p. The missing data are the classifications Zi into the two process components. The procedure is described in Algorithm 1 below.

2

Algorithm 1 nnclean algorithm 1: Choose the value of k 2: Compute the distance Mk to the kth nearest neighbour for each point in the data 3: Apply the EM algorithm to these distances to estimate λ0 , λ1 , and p. 4: Use the estimate of p to classify the points by a thresholding method.

The method is implemented in the R-package spatstat (Baddeley and Turner, 2005) as function nnclean. It works well when the feature intensity is at least twice the noise intensity. It was mentioned by Byers and Raftery (1998) that in addition to superpositions of two Poisson processes, the method works well even for superpositions of a regular process and a Poisson process. However, such results were not reported in the paper. 3.2. MCMC method Having a glaciological problem in mind, Redenbach et al. (2015) constructed an MCMC algorithm that classifies each point of a superposition of a Strauss and a Poisson pattern to belong to one of the two patterns. In the most general case of the approach, we can both classify the points and estimate the parameters θ = (λ0 , β, γ) of the model (Poisson intensity and Strauss parameters). In this case, the parameters in θ and Z are updated in turns. For the study presented in this paper, we slightly modified the approach presented by Redenbach et al. (2015) to improve the results of the parameter estimation and to make the setting comparable to the variational Bayes method introduced below. The first modification concerns the prior distributions of the parameters in θ. Here, we use lognormal prior distributions, while the prior distributions in Redenbach et al. (2015) were uniform. The second modification concerns the proposal distribution of γ in the Metropolis-Hastings algorithm. We decided to sample new values of γ from the prior distribution of γ itself, while in Redenbach et al. (2015) new values of γ are sampled from a lognormal distribution centered in the current value of the parameter. In combination with the lognormal priors this modification resulted in improved estimates of the parameter γ. Finally, the simulation study presented below requires an automatic procedure for classifying the points according to their posterior probabilities of being a Strauss point. Redenbach et al. (2015) discovered a particular structure of the posterior probabilities depending on the arrangement of the points. Ideally, one would expect to find clusters of low and high posterior probabilities corresponding to Poisson and Strauss points, respectively. However, it was observed that points organised in r-close pairs form an additional cluster with intermediate posterior probabilities. Having this observation in mind, we proceed in the following way: First, we compute the maximum number of r-close neighbours of any point in the pattern. If this maximum is 0, the pattern is considered a realisation of a hardcore process, so all points should be classified as Strauss points. If the maximum is 1 we classify the posteriors in two clusters, whereas in all the other cases in three clusters. The middle cluster in the three cluster case, and the lowest cluster in the two cluster case should correspond to the r-close pairs. With a probability depending on the estimated θ, we either classify both points in such pairs as Strauss points or one of them as Strauss and the other as Poisson, see Redenbach et al. (2015). Given the number of clusters, clustering is performed using the function fanny, available in the R-package Cluster, which turned out to perform better than the k-means clustering considered in Redenbach et al. (2015). The main idea of the MCMC approach is summarized in Algorithm 2 below. 4. Classification by variational Bayes approach The posterior probabilities of the indicator variables Zi can also be computed directly using some approximation techniques. Both the Strauss model and the Poisson model belong to a class of models called Gibbs models. The joint model, conditional on the indicator variable Z = z, is also a Gibbs model. As Gibbs models are exponential family models, computationally inexpensive maximum likelihood estimates of Gibbs models can be derived using pseudo-likelihood estimation equations. The algorithm presented in Section (4.4) relies on three consecutive approximations. Recall that the main problem is the estimation of the posteriors P (zi = 1|x). Since Z = z is unobserved, the standard non-MCMC tool for its estimation is the expectation-maximization (EM) algorithm. However, 3

Algorithm 2 MCMC algorithm 1: Sample initial parameters λ0 , β, and γ from the predefined prior distributions, fix the range r > 0 2: Sample n0 from the binomial distribution with parameters n and λ0 /(λ0 + λ1 ) and set n1 = n − n0 3: Sample the n0 initial noise points from the n points of x where each point is chosen with the same probability. The remaining data points are the initial true points. A selection is only accepted if the (unnormalized) density of the resulting configuration (x0 , x1 ) is not zero. 4: Choose values for the number of iterations recorded N and number of burn-in iterations M 5: Starting with the initial values of (θ, Z) 6: while number of repetitions less than N + M do 7: Propose an update (θ0 , Z 0 ) from a proposal density Q(θ0 , Z 0 |θ, Z) 8: Accept the proposal with probability α(θ0 , Z 0 |θ, Z) = min(1, H(θ0 , Z 0 |θ, Z)), where H is the Hasting’s ratio In case of acceptance set (θ, Z) = (θ0 , Z 0 ) end while 11: Discard the first M iterations. The remaining sample (θ M +1 , Z M +1 ), . . . , (θ M +N , Z M +N ) is considered a sample from the posterior distribution of (θ, Z) given x. One may also consider to record only every kth (k > 1) value of this sample. 9: 10:

EM cannot be applied directly here since the E-step functions are not independent. We therefore use the more general mean field variational approximation for which we need an analytically manageable joint likelihood function, which is obtained by the two approximations presented in the following Sections 4.1 and 4.2. 4.1. Logistic pseudo-likelihood approximation to Gibbs models Assume for now that the noise/true label vector z is known. Write θ0 = log λ0 , θ1 = [log β log γ]T = [θ11 θ12 ]T and θ = [θ0 θ1T ]T for the parameters in the canonical form. The corresponding form of the Strauss process density is f1 (x1 |θ1 ) ∝ exp θ1T t1 (x1 ) , P T where t1 (x1 ) = [n1 i6=j 1(||xi − xj || < r)] for a fixed range parameter r > 0 and ||xi − xj || is the distance between xi and xj . The normalizing constant of f1 is intractable as it involves an integral over all possible point configurations. The density of the Poisson noise can be written likewise as f0 (x0 |θ0 ) ∝ exp [θ0 t0 (x0 )] with t0 (x0 ) = n0 . Even though f0 can be normalized, the exponential form is more convenient. We assume independent noise and therefore, the superposition x = x0 ∪ x1 has the density X f (x|θ) ∝ exp(θ0 n0 + θ11 n1 + θ12 1(||xi − xj || < r)). i6=j:xi ,xj ∈x1

This density cannot directly be used for inference as the normalization is not available. Hence, conditional densities are typically considered which are defined as λl (u|xl , θl ) :=

fl (xl ∪ u|θl ) fl (xl \ u|θl )

l = 0, 1, u ∈ A.

In the case of Gibbs processes, these have the simple form λl (u|xl , θl ) = exp[θlT tl (u; xl )] where tl (u; xl ) := tl (xl ∪ u) − tl (xl \ u). We adopt an approximation for the Gibbs likelihood around its mode which was given by Baddeley et al. (2014). Here, we only recall the key parts needed for the bivariate case. Let D0 and D1 be two 4

sets of generated dummy points with intensities ρ0 > 0 and ρ1 > 0, respectively. We chose to simulate the dummy points from Poisson processes (some other models are discussed in Baddeley et al. (2014)). After the dummy point generation, let ml = |Dl ∩ A| for l = 0, 1 denote the dummy point counts in the observation area A, and write N = n0 + n1 + m0 + m1 for the total number of data and dummy points. Let ui ∈ x0 ∪ x1 ∪ D0 ∪ D1 stand for a generic data or dummy point, and split the indices 1, ..., N such that Il = {i : ui ∈ xl ∪ Dl }. Define the binary indicator variables yil := 1(ui ∈ xl ), l = 0, 1, i = 1, ..., N , which track the data points of type l, and write yl for the vector of yil ’s with i ∈ Il . We approximate the densities fl by the logistic estimation approximation described in Baddeley et al. (2014), Section 3, namely Y yl l fl (xl |θl ) ≈ πlii (1 − πli )1−yi =: f˜l (yl |θl ), i∈Il

where

exp θlT tl (ui ; xl ) + oli πli = 1 + exp θlT tl (ui ; xl ) + oli

with additional offsets oli := − log ρl . Again, see Baddeley et al. (2014), Section 3, for more details. The joint likelihood of the observed pattern x = x0 ∪ x1 given θ and Z = z is then well approximated at the mode by the joint likelihood of the augmented, binary variables y = y0 ∪ y1 , Y y (1) f˜(y|z, θ) = f˜0 (y0 |θ0 )f˜1 (y1 |θ1 ) = πj j (1 − πj )1−yj , j∈J

where j now runs through J := I0 ∪ I1 in order and yj = 1(uj ∈ x). Let Φ be an N × N matrix with elements Φji = 1(ui ∈ x)1(0 < ||uj − ui || < r),

(2)

where the second term comes from the pairwise interaction function of the Strauss process. Therefore, Φ codes the interacting pairs in the data, and can be computed after the data, the dummy points and the range parameter r are fixed. Next, augment the vector z with values for the dummy points such that zj = 1 for all uj ∈ D1 and zj = 0 for all uj ∈ D0 . Write z¯j = 1 − zj . Then we can write the ”covariates” of the ”observations” yj as Xj := [¯ zj t0 (uj ; x0 ) zj t11 (uj ; x1 ) zj t12 (uj ; x1 )] = [¯ zj

zj

zj Φj· z],

where Φj· is the jth row of Φ. With the help of the Xj ’s we can write the logistic likelihood in Equation (1) as Y exp [yj (Xj θ + oj )] f˜(y|z, θ) = 1 + exp [Xj θ + oj ] j∈J

and in log-scale log f˜(y|z, θ) =

X

yj Xj θ + yj oj − log[1 + exp(Xj θ + oj )] .

j

In vector notation this reads log f˜(y|z, θ) = yT Xθ + y T o − 1TN log[1N + exp(Xθ + o)]

(3)

with the ”design” matrix X having rows Xj and 1N being a vector of N 1’s. Note that in the current implementation of the algorithm we set ρ0 = ρ1 = n/|A| (the recommendation by Baddeley et al. (2014) is 2n/|A|). These intensities could be adapted during the run to match the intensities suggested by the current estimates of the z posteriors. Here, we did not need to do that since using the fixed dummy points worked well. Note also that the independent Poisson process assumption for the noise is not restrictive. In fact, both processes can be from the general family of pairwise interaction Gibbs processes, and an additional cross-type pairwise interaction term can be incorporated. This is possible because when non-canonical parameters such as range are fixed, the matrix Φ is fixed, and X depends on z in a simple enough way to be tractable. In this special case of independent Poisson noise, we can replace the Strauss process by any other pairwise interaction Gibbs process simply by redefining (2). 5

4.2. Quadratic approximation to the logistic regression likelihood The approximating log-likelihood (3) is in a closed form but not yet manageable enough for the variational approximation. The main culprit is the function log(1 + ex ) over which it is hard to take the expectation. Assuming still that z is known, the second approximation we make is the tangential function approximation (described in detail in Jaakkola and Jordan (2000)) which uses the lower bound approximation 1 − log(1 + ex ) ≥ h(ξ)x2 − x + g(ξ) for any ξ > 0, x ∈ R, 2 where h(ξ) = − tanh(ξ/2)/(4ξ) and g(ξ) = ξ/2 − log(1 + eξ ) + 4ξ tanh(ξ/2). The parameter ξ is called the ”variational parameter”, and its value determines how close to the true value the lower bound is. Equality holds when ξ 2 = x2 . Applying this approximation to each term in the log-likelihood (3) with a vector of variational parameters ξ = [ξ1 , ..., ξN ]T , we obtain the vector form inequality 1 −1TN log[1N + exp(Xθ + o)] ≥ (Xθ + o)T H(ξ)(Xθ + o) − 1TN (Xθ + o) + 1TN g(ξ) 2 = θT X T H(ξ)Xθ + 2oT H(ξ)Xθ + oT H(ξ)o 1 − 1TN (Xθ + o) + 1TN g(ξ), 2 where H(ξ) = diag(h(ξ)). The expression is quadratic in θ, so by setting a Gaussian prior θ ∼ N (µ0 , Σ0 ), the log of the joint density f˜(y, θ|z) = f˜(y|z, θ)f (θ) has the closed form lower bound 1 T log f˜(y, θ|z) ≥ − θT [Σ−1 0 − 2X H(ξ)X]θ 2 1 + [(y − 1N + 2H(ξ)o)T X + µT0 Σ−1 0 ]θ 2 1 + 1TN g(ξ) + oT H(ξ)o + (y − 1N )T o 2 p 1 1 T −1 − log(2π) − log |Σ0 | − µ0 Σ0 µ0 2 2 2 =: log f˜(y, θ|z; ξ),

(4) (5)

where the dimension p of θ is three in our example. The approximate full model log-likelihood, including the prior P (zi = 1) = p0 , can be obtained by adding the prior in the equation (4) above, see Appendix. The bound is proportional to a Gaussian density in θ, so given z, θ has an approximatively Gaussian posterior distribution with covariance matrix and mean given by −1 T Σ−1 ξ = Σ0 − 2X H(ξ)X 1 T µξ = Σξ [(y − 1N + 2H(ξ)o)T X + µT0 Σ−1 0 ] 2

which depend on the unknown variational parameter vector ξ. A natural criterion for finding the best variational parameters ξ is to maximize the lower bound of the evidence Z Z f˜(y|z; ξ) = f˜(y, θ|z; ξ)dθ = f˜(y|θ, z; ξ)f (θ)dθ. This criterion is equal to minimizing the Kullback-Leibler divergence since f˜(y|z; ξ) ≤ f˜(y|z) for all ξ. As f (θ) is Gaussian, the integral above can be solved in closed form, yielding 1 1 log f˜(y|z; ξ) = log |Σξ | − log |Σ0 | + 1TN g(ξ) 2 2 1 T −1 1 1 T T + µξ Σξ µξ − µT0 Σ−1 0 µ0 + o H(ξ)o + (y − 1N ) o . 2 2 2 6

(6)

Numerical techniques could now be used for finding ξ that maximizes (6), but Jaakkola and Jordan (2000) proposed to use a simple EM algorithm instead, since it is suitable even when N is large. In this optimization the values of ξ play the role of missing values. With some matrix calculus, especially the linearity of traces, it can be shown that the E-step function has the form Q(ξ 0 |ξ) = Eθ|y,z;ξ log f˜(y, θ|z; ξ 0 ) = tr [X(Σξ + µξ µTξ )X T + ooT + 2Xµξ oT ]H(ξ 0 ) + 1TN g(ξ 0 ) + const. By differentiating the above with respect to ξ 0 and solving the equation where the derivative is set to zero for ξ > 0, it can be shown that the optimal values in the M -step are given by the equation ξ 2 = diag[X(Σξ + µξ µTξ )X T + ooT + 2Xµξ oT ] , cf. Ormerod and Wand (2010), Section 3.1, with the extension of including the offset terms. By iterating the E and M steps we arrive to the optimal ξ and with it, the optimal approximation of the posterior of θ given z and x. 4.3. Mean field variational approximation of the posterior probabilities of latent Z variables In the previous two sections we derived closed form approximations for the distributions f (x|θ, z) and f (θ|x, z). We now describe the final step and approximate f (z|x). We use the variational Bayes approach to approximate the true posterior of the model by defining the factorisation approximation Y f (z, θ|x) ≈ Q(z, θ) = q(θ) qj (zj ). j∈J

The factorisation of Q is in general arbitrary. Here, we chose this so called ”mean field” factorization as it leads to tractable qj ’s. It can be shown (see e.g. Ormerod and Wand (2010), Section 2.1) that the q’s minimizing the Kullback-Leibler divergence between the distributions f and the chosen Q need to be of the form log q(θ) ∝ Ez log f (x, θ, z) and log qj (zj ) ∝ Eθ,z−j log f (x, θ, z) with z−j = {zi : i 6= j}. The inter-locked dependency in the q’s leads to iterative updates of each distribution in turn in an EM manner. Since we do not have the actual densities f available, we replace them with the approximations from the previous sections. Since the grand approximate joint likelihood f˜(y, θ, z) is at most quadratic in θ and zi ’s are mere binary variables in the design matrix X, the closed forms for the q’s are tractable. 4.4. Variational Bayes classification algorithm The density approximations involve iterative procedures themselves so the end result is a nested optimization algorithm. Each time the vector z is required in the inner loop, for example in the updates of ξ when optimizing the lower bound (6), we use the densities where we have taken the expectation with respect to z under the current best guess given by the outer loop. The details of the updates are given in Appendix, and the combined approach is summarized in Algorithm 3. After the q’s have converged in the outer loop, lines 5-11, we arrive to the approximate posteriors of interest, P (zi = 1|x) ≈ qi (1). Thresholding these values using 0.5 as the cut-off leads to the classification. We determine the convergence of ξi ’s and pi ’s on lines 5 and 6 by using the absolute difference of the corresponding consecutive estimates. Some tolerance level > 0 needs to be set, here = 0.01. Since the logistic regression approximation corresponds to the true likelihood only at the mode, the Σξ is not the true posterior covariance of θ. However, this covariance could be estimated later using ˆ , e.g. by sampling vectors z0 and averaging the asymptotic maximum the posterior class probabilities p likelihood estimates given {yi : zi0 = 1}. We did not pursue this further since the focus here is on classification. Note that the prior in the MCMC approach imposes a (circular) dependency between z and θ and is not tractable here since λ1 , and any good approximation of it, is an unknown non-linear function of β = exp(θ11 ), hindering derivation of the q’s. We therefore set to the zi ’s a much simpler Bernoulli prior P (zi = 1) = p0 which seems to work sufficiently well. 7

Algorithm 3 Variational Bayes algorithm 1: Set priors θ ∼ N (µ0 , Σ0 ) and P (zi = 1) = p0 , fix the range r > 0 2: Generate dummy points D1 and D0 with intensities ρ1 > 0 and ρ1 > 0, respectively, from the dummy process 3: Create vector y, the interaction matrix Φ and the design matrix X ˆ (zi = 1|x) = pî = p0 , i = 1, ..., n 4: Initialize a N × 1 non-negative vector ξ, and P ˆ not converged do 5: while p 6: while ξ not converged do 7: Update ξ using Eθ Ez log f˜(y, θ, z; ξ) 8: Update Σξ , µξ using Eθ Ez log f˜(y, θ, z; ξ) 9: end while 10: Update pî using q(zi ) = Eθ,z−i log f˜(y, θ, z; ξ), i = 1, ..., n 11: end while ˆ) 12: Return (µξ , Σξ , p

5. Comparison of the methods We perform a simulation study, where we compare the performance of the developed VB approach and the refined MCMC approach. Even the nearest neighbour approach was included in the comparison but since its performance was clearly worse than the performance of the other methods, the results are not shown here. We simulate superpositions of Strauss and Poisson patterns in a unit square with different signalto-noise ratios and interaction parameter γ. For the signal-to-noise ratio, or noise level, we used the values 10% and 30%, and for γ the values 0.01, 0.1 and 0.5, ranging from very strong regularity to weak regularity. The number of points was 50 in the main simulation study but we also made some experiments with 250 points. The interaction radius r of the Strauss process was fixed to 70% of the maximal range under the hard spheres model: r = 0.09 for λ = 50 and r = 0.04 for λ = 250. We simulate 100 patterns with each combination of γ and noise level. Furthermore, prior sensitivity is analysed by varying the lognormal distributed priors p(θ) between uninformative (P1, low precision for all parameters), partly informative (P2, high precision for γ) and informative priors (P3, high precision for all parameters). In all cases, the priors are concentrated on the true parameter values. To compare the performance, we analyse three aspects. For the purpose of discovering the true second order structure, we compute an integrated version of the squared difference between the (estimated) K function of the real Strauss pattern and the K function of the classified Strauss pattern. For the purpose of parameter estimation, we compare the estimates given by the algorithms with the true γ and noise level values. We also fit the Strauss model to the real and to the classified patterns and compare the estimates of the parameter γ. Results based on the K function are shown in Figure 1, where the integrated squared differences of the K functions of the classified patterns and of the true Strauss patterns are reported as boxplots. Columns 2-4 in each subplot give the performance of the refined MCMC approach, and columns 5-7 the performance of the VB approach. As a reference, we show the difference between the K functions of the true and noisy Strauss patterns in column 1. Noisy patterns should be closer to Poisson patterns than the classified Strauss patterns are. Therefore, one would expect that the difference in K functions between a true pattern and the corresponding noisy pattern would be larger on average than the difference between the true pattern and the classified pattern after noise removal. With the largest value of γ = 0.5 (independently of which of the two noise levels was used), it is not clear whether the results without noise are better than the noisy results. Even when γ = 0.1 and 10% noise is present one cannot see such a clear improvement after the noise has been removed. This is expected since the noisy pattern is very close to the true one due to the low noise level. However, with stronger interaction or more noise, i.e. γ = 0.1 and 30% noise or γ = 0.01 and 10% or 30% noise, the results are improved after the removal of noise. When comparing the VB approach and the MCMC approach, one can see that neither of the methods works so well with γ = 0.5. The performance of VB2 and VB3 seems better than the performance of the remaining combinations but the observed patterns here are too close to Poisson patterns to gain any

8

Figure 1: Integrated squared error in K function estimates of classified patterns compared to the true (noiseless) Strauss pattern. In each subfigure, the first column gives the difference between the noisy and true patterns, columns 2-4 the difference between the MCMC classified and true patterns, and columns 5-7 the difference between the VB classified and true patterns using different noise levels and interaction parameter values. The number after the abbreviations MCMC and VB refers to the prior used, 1 refering to the lowest precision and 3 to the highest precision. ● ●

●

●

●

●

● ● ●

●

●

● ● ●

● ●

●

●

Noise 30% gamma 0.5

●

●

● ●

●

●

●

● ● ●

● ●

●

● ● ● ● ● ●

● ●

● ● ●

●

● ● ●

● ●

● ●

● ●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ● ●

05

● ●

6e 05

0e+00

●

8e − −0

5

● ●

●

● ●

● ● ●

● ●

● ● ● ● ● ● ●

● ● ● ●

●

● ●

● ● ●

● ● ●

●

●

●

● ●

● ●

● ● ●

●

●

● ●

●

●

●

● ●

● ●

●

●

●

●

● ● ●

● ● ● ●

● ●

●

● ●

●

●

3

2

VB

1

VB

3

VB

2

M C C M

M

C

M C

1

y

M C

is

C

3

M

no

2

VB

1

VB

3

VB

2

M C C M

M

C

M C

1

y

M C

3

is

C M

no

2

VB

1

VB

3

VB

2

M C C M

M C C M

M C C M

no

is

1

y

0e

+0

0

●

● ● ●

●

● ● ● ●

●

●

● ● ● ●

●

●

●

● ● ● ● ● ●

●

●

● ●

●

●

●

● ●

●

−0

5

● ●

2e

● ●

Noise 10% gamma 0.5 ● ● ● ●

●

●

2e−05

●

● ●

●

●

●

●

●

● ●

Noise 10% gamma 0.1

●

●

●

● ●

Noise 10% gamma 0.01

● ● ●

6e−05

4e−05

●

●

●

●

● ● ●

● ●

● ●

● ● ● ● ●

●

8e−05

● ●

●

● ●

●

● ● ●

●

● ● ● ●

●

●

4e −

Integrated squared error in K−function values

●

●

●

● ●

● ●

● ●

Noise 30% gamma 0.1●

● ●

●

● ●

●

● ● ●

●

● ● ●

●

●

●

Noise●● 30% ● 0.01 gamma

●

●

● ●

●

● ●

●

advantage of the classifying methods. In situations where we see an improvement of the VB and MCMC compared to the noisy case (γ = 0.1 and noise 30% or γ = 0.01), the variational Bayes approach seems to give better results (better accuracy and smaller variance) than the refined MCMC approach. (When the nearest neighbour approach was used, the difference between the K functions was at least ten times as large as the difference that VB and MCMC gave.) We also investigated how well we can estimate the parameter γ and the noise level. In Figure 2, we show the differences of the estimated and true γ values. Logarithmic scale is used to emphasize the differences. For the two larger γ values, 0.1 and 0.5, the estimates tend to be quite good when either the prior P2 or P3 (high precision for γ) is used. With the uninformative prior P1, VB tends to underestimate γ (overestimates interaction) while the MCMC approach overestimates it (underestimates interaction). For γ = 0.01, VB seems to work better (better accuracy and smaller variance) than the MCMC approach. In the latter case, both methods overestimate γ if the uniformative prior P1 is used while with P2 and P3 γ is underestimated by VB and overestimated by MCMC. Next, we estimated γ from the classified Strauss patterns that the algorithms produced and from the corresponding true patterns by using the function ppm in spatstat. The results are shown in Figure 3. All estimates except the ones based on VB1 are quite good for γ = 0.5. The estimates from the MCMC classified Strauss patterns are better than the ones based on the VB classified patterns when γ = 0.1 (independently of the prior) but the other way around for γ = 0.01. In the latter case, the VB estimates are clearly the best ones. Last, we compared the estimated noise levels to the true ones. The results are shown in Figure 4, where the true noise levels are indicated by horizontal dashed lines. Both methods tend to underestimate the noise level probably since the noise points that respect the regularity of the Strauss process are difficult to detect, with the exception that VB1 overestimates the noise level when γ = 0.5 and only 10% noise is present. In general, VB seems to work better than the MCMC method giving more accurate and less varying estimates. We made also some experiments with 250 points and interaction parameter γ = 0.01. In this case the MCMC algorithm is very slow and we used only the most informative prior P3 and did not use MCMC

9

Figure 2: Difference between the γ values produced by the algorithms and the true values in log scale. In each subfigure, columns 1-3 give the difference between the MCMC classified and true patterns and columns 4-6 the difference between the VB classified and true patterns, using different noise levels and interaction parameter values. The number after the abbreviations MCMC and VB refers to the prior used, 1 refering to the lowest precision and 3 to the highest precision. Noise 30% gamma 0.01

Noise 30% gamma 0.1

Noise 30% gamma 0.5

● ●

● ● ●

4

● ● ● ● ● ●

● ● ● ● ●

●

2

●

●

● ●

● ● ●

●

●

● ●

●

●

●

● ●

●

●

●

●

0

●

−2 ● ● ● ● ● ●

Noise 10% gamma 0.1

Noise 10% gamma 0.5

4


●

● ● ●

2

log(γ) : classified − noiseless

● ● ● ●

●

●

●

0

●

● ●

● ●

● ● ● ●

● ●

● ● ● ● ● ● ● ● ●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

3

2

VB

VB

1 VB

3 M C

2 C M

M

C

M C M

C

M C

1

3 VB

2 VB

1 VB

3 M C

2 C M

M

C

M C M

C

M C

1

3 VB

2 VB

1 VB

3

2

M C M

C

M C C M

M

C

M C

1

−2

●

Figure 3: Difference between the ppm estimated γ values of the classified and true patterns in log scale. In each subfigure, the first column gives the difference between the estimates of the noisy and true patterns, columns 2-4 the difference between the MCMC classified and true patterns, and columns 5-7 the difference between the VB classified and true patterns using different noise levels and interaction parameter values. The number after the abbreviations MCMC and VB refers to the prior used, 1 refering to the lowest precision and 3 to the highest precision. Noise 30% gamma 0.01 ●

●

● ● ● ● ● ●

● ● ● ●

●

●

Noise 30% gamma 0.1 ● ● ● ● ● ●

Noise 30% gamma 0.5 20

● ● ● ● ● ●

● ● ● ●

10 ● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ●

●

●

●

●

●

●

●

● ●

● ● ●

●

●

●

●

● ●

●

●

● ● ●

● ●

●

●

●

●

● ●

0

●

−10 ●

● ●

● ● ●

● ● ●

● ● ● ●

● ●

20

Noise 10% gamma 0.01 ●

●

●

●

●

Noise 10% gamma 0.1 ● ● ● ● ●

● ● ● ●

●

● ● ● ●

● ● ● ●

● ● ● ●

●

●

●

● ● ● ● ● ● ●

●

−20

Noise 10% gamma 0.5 ● ● ●

10

●

●

●

● ●

● ● ●

● ● ●

● ●

● ● ●

● ● ●

● ● ● ●

● ● ●

●

● ●

● ●

●

●

●

●

● ●

●

●

● ●

●

●

● ●

● ● ●

●

●

●

●

●

−1

0

0

● ● ● ● ● ●

● ● ● ● ●

● ● ●

● ●

● ● ● ● ● ●

● ● ● ●

● ● ●

● ●

0

●

10

3 VB

2 VB

1 VB

3 M

C M

C

2 C M M

C

C

1

y is

C M M

no

3 VB

2 VB

1 VB

3 M

C M

C

2 C M M

C

C

1

y is no

C M M

3 VB

2 VB

1 VB

3 M

C M

C

2 C C M M

C C M M

no

is

1

y

−2

log(γ) : classified − noiseless

● ● ● ●

to estimate γ. For VB, priors P2 and P3 were used and the method was used to estimate γ. The results were very similar to the case of 50 points. The spatial structure of the MCMC and VB classified patterns were compared to the true Strauss pattern by using the K function. Both VB and MCMC classified patterns (with both noise levels 10% and 30%) were closer to the true pattern than the noisy pattern is and the VB classified patterns tend to be closer to the true pattern than the MCMC patterns are. As we saw earlier with 50 points, VB tends to underestimate γ and therefore, overestimate interaction. (There are no MCMC estimates of γ available.) The ppm function overestimates γ in MCMC classified patterns and underestimates it in VB patterns. The noise level is underestimated by both methods, more and with larger variance by MCMC than by VB. Figure 4: Error in the estimated noise level of the classified patterns. The true noise levels are indicated by horizontal lines. In each subfigure, columns 1-3 give the error in the MCMC classified patterns and columns 4-6 the error in the VB classified patterns using different noise levels and interaction parameter values. The number after the abbreviations MCMC and VB refers to the prior used, 1 refering to the lowest precision and 3 to the highest precision. Noise 30% gamma 0.01

Noise 30% gamma 0.1

Noise 30% gamma 0.5

● ● ● ● ● ● ●

● ● ●

●

● ●

●

● ●

●

●

● ●

●

● ● ● ● ●

●

●

●

● ●

●

●

●

0.4

● ●

●

●

● ● ●

●

● ● ● ●

Estimated noise level

0.6

● ● ●

● ● ● ● ●

● ● ●


0.2

●

●

●

●

0.0

Noise 10% gamma 0.1

Noise 10% gamma 0.5 ●

●

6 0.

● ● ● ●

● ●

● ● ●

● ●

● ● ●

●

● ●

●

● ●

2 VB

1

3 C M

C

●

M C

1 C

M M C

M C

2

●

3

2

VB

1

3

●

3

●

●

2 C M

M

C

1

3 M C

2

VB

● ● ● ●

M

●

VB

●

C

●

M

●

VB

1

3

VB

C

2 C

M M C

M C

M

C

1

0. 0 M

●

●

M C

●

●

●

VB

● ●

●

●

M C

●

VB

● ● ●

M C

0. 4

●

●

● ● ● ●

0. 2

●

● ● ●

●

VB

●

6. Discussion We have introduced an approach based on variational Bayes approximation which classifies points into two classes in a superposition of a regular point process (Strauss process in our case) and a Poisson point process. We have conducted a simulation study to compare the performance of the new method to the performance of a refined version of the recently presented MCMC method. We even included the method based on the kth nearest neighbours presented by Byers and Raftery (1998) but its performance was poor and the results were not included in the paper. It is worth mentioning, however, that in Byers and Raftery (1998) the noise points were distributed in the whole study region A while the feature (true) points were distributed only in a subregion of A. Here, both the feature and noise points are distributed in the whole study region. The performance of the methods was checked by comparing the K functions of the true and classified data as well as the estimates of the interaction parameter γ and the noise level. When comparing the K functions of the true pattern and of the classified Strauss patterns, in the case the underlying Strauss pattern has weak interaction, both the VB and MCMC methods tend to result in point patterns that

11

are closer to Poisson patterns than the true patterns are. However, when the Strauss pattern is clearly regular (small γ), the VB approach typically performs better than the MCMC approach. Both the VB and MCMC methods estimate γ quite well. In very regular cases, VB tends to perform better than the MCMC approach. The noise level is typically underestimated by both methods, more by MCMC than by VB. Note, however, that with only 10% noise and 50 points, only a few noise points are present and they can be very difficult to be recognized as noise. Furthermore, if a lot of noise points were present, the noisy patterns would be difficult to distinguish from a Poisson pattern and the classifying algorithms would not have a chance to succeed. To summarize the results, both methods seem to find the underlying true spatial pattern and estimate the parameters quite well if the Strauss pattern is regular enough and noise level moderate. However, the VB approach seems to be better than MCMC in highly regular cases. This may partly be due to the fact that VB has the tendency to classify at least one of the points in all close pairs (closer together than the interaction radius) as noise resulting patterns with no interacting pairs of points. The MCMC approach, on the other hand, tends to leave too many such true-true pairs. The prior distributions that give high precision to the parameter γ lead to better results than the uninformative prior. Therefore, it is beneficial to have good prior information on the interaction parameter. All the priors used in this study are concentrated on the true values of the parameters. Note also that new values for γ in the MCMC algorithm were sampled from the prior distribution. A more thorough sensitivity study would be needed to evaluate the effect of misspecified priors. One reason for introducing the VB approach was that the MCMC approach is computationally very time consuming. VB, on the other hand, is very fast. While the MCMC approach takes about 10 minutes to run if about 106 iterations are used (several days are needed when several patterns are considered), the VB classification with a reasonably low number of points (up to a couple of hundreds) needs only a few seconds. However, any type of model can in principle be used in the MCMC approach, while the VB method introduced here is restricted to the pairwise interaction models. The mean field approximation distributions needed in VB can get quite complicated and might not be tractable for more complicated models. Acknowledgements This work was supported by the Knut and Alice Wallenberg Foundation, Swedish Foundation for Strategic Research, and the German Research Foundation within the research training group 1932 ”Stochastic Models for Innovations in the Engineering Sciences” and grant RE 3002/3-1. References D Allard and C Fraley. Nonparametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. Journal of the American Statistical Association, 92(440):1485– 1493, 1997. A Baddeley and G Nair. Fast approximation of the intensity of Gibbs point processes. Electronic Journal of Statistics, 6:1155–1169, 2012. A Baddeley and R Turner. spatstat: An R Package for Analyzing Spatial Point Patterns. Journal of Statistical Software, 12(6):1–42, 2005. A Baddeley, JF Coeurjolly, E Rubak, and R Waagepetersen. Logistic regression for spatial Gibbs point processes. Biometrika, 101:377–392, 2014. S Byers and A Raftery. Nearest-neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 1998. N Cressie and A Lawson. Hierarchical Probability Models and Bayesian Analysis of Mine Locations. Advances in Applied Probability, 32:315–330, 2000. A Dasgupta and A Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93(441):294–302, 1998. 12

T Jaakkola and M Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, pages 25–37, 2000. J Ormerod and M Wand. Explaining variational approximations. The American Statistician, 2010. C Redenbach, A S¨ arkk¨ a, and M Sormani. Classification of points in superpositions of Strauss and Poisson processes. Spatial Statistics, 12:81–95, 2015. D Stanford and A Raftery. Finding curvilinear features in spatial point patterns: principal curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6):601–609, 2000. D Walsh and A Raftery. Detecting mines in minefields with linear characteristics. Technometrics, 2002. D Walsh and A Raftery. Classification of Mixtures of Spatial Point Processes via Partial Bayes Factors. Journal of Computational and Graphical Statistics, 14(1):139–154, 2005.

Appendix: Update details for Algorithm 3 In the following, we use the shorthand notations P (zi = 1|x) ≈ qi (1) =: pi

and

p¯i = 1 − pi

¯ for the vectors. Recall the approximate full model log-likelihood and p and p 1 T log f˜(y, θ, z; ξ) = − θT [Σ−1 0 − 2X H(ξ)X]θ 2 1 + [(y − 1N + 2H(ξ)o)T X + µT0 Σ−1 0 ]θ 2 1 + 1TN g(ξ) + oT H(ξ)o + (y − 1N )T o 2 p0 + 1TN z log + constant 1 − p0 that we obtain by adding the prior P (zi = 1) = p0 to (4), and the design matrix X = [¯ z z z · Φz] where · denotes the element-wise product. Updating the posterior of θ. (lines 7 and 8 in Algorithm 3) Taking expectation with respect to z of summands in (7) gives ˆ Ez X = [¯ p p p · Φp] =: X   A 0 0 Ez XH(ξ)X =  0 B C  =: G 0 C D √ T √ ¯ H(ξ) p ¯=p ¯ T h(ξ) A= p √ T √ B = p H(ξ) p = pT h(ξ) C = pT H(ξ)(Φp) D = pT H(ξ)[(Φp)2 + (Φ2 )(p − p2 )] =: pT H(ξ)v and we obtain −1 Σ−1 ξ = Σ0 − 2G 1 ˆ + µT0 Σ−1 ]T . µξ = Σξ [(y − 1N + 2H(ξ)o)T X 0 2

13

(7)

Optimize ξ. (line 7 in Algorithm 3) The expectation Ez of relevant summands of (7) is given above. The EM optimization of the quadratic approximation involves the E-step function ˆ + tr(ooT H(ξ))] + 1TN g(ξ) Q(ξ 0 |ξ) = Eθ [θT Gθ + 2oH(ξ)Xθ ˆ ξ oT H(ξ)) + tr(ooT H(ξ)) + 1TN g(ξ) = tr(M G) + 2tr(Xµ ¯ T + M33 vpT + 2M13 (Φp)pT = tr (M11 1N pT + M22 1N p i ˆ ξ oT + ooT )H(ξ) + 1TN g(ξ), +2Xµ where M = Σξ + µξ µTξ , and Q has its maximum at ˆ ξ oT + ooT ). ¯ T + M33 vpT + 2M13 (Φp)pT + 2Xµ ξ 2 = diag(M11 1N pT + M22 1N p Update pi = qi (1) ≈ P (zi = 1|x). (line 10 in Algorithm 3) The relevant expectations of the summands in (7) are Szi := Eθ,z−i bT Xθ = Ez−i bT Xµξ = Fzi µξ + const, where Fzi = [¯ zi bi zi bi zi [bi Φi. p + ΦT.i (b · p)]], b = y − 21 1N + 2H(ξ)o, and Tzi := Eθ,z−i θT X T H(ξ)Xθ = Ez−i tr(M X T H(ξ)X) = tr(M E−i X T H(ξ)X) + const.

Jzi

= tr(M Jzi ),   z¯i h(ξi ) 0 0  zi h(ξi ) zi [h(ξi )Φi. p + ΦT.i H(ξ)p] = 2 T T i zi [h(ξi )(Φi. p) + (p · Φ.i ) H(ξ)Φ.i + 2p H(Φ˜ p )]

˜ i = {p : pi = 0}. p Since we have to have pi + p¯i = 1, Pˆ (zi = 1|x) = σ(ui ) ui = log q(zi = 1) − log q(zi = 0) p0 = T1 − T0 + S1 − S0 + log 1 − p0 where σ(x) = (1 + e−x )−1 .

14

Variational Bayes Approach for Classification of Points in ...

Variational Bayes Approach for Classification of Points in ...

Suggest Documents

Variational Bayes

Variational bayes for generalized autoregressive

Streaming Variational Bayes

Variational-Bayes Optical Flow

Variational Bayes for Spatiotemporal ... - University of Surrey

Variational bayes for generalized autoregressive ... - Semantic Scholar

VARIATIONAL BAYES AND MEAN FIELD APPROXIMATIONS FOR

Variational Bayes for Continuous-Time Nonlinear

Quantum Annealing for Variational Bayes Inference

Variational Bayes Inference in Digital Receivers

EMBEDDING DIFFUSION IN VARIATIONAL BAYES: A ... - UniMI

On Statistical Optimality of Variational Bayes

Computation of maximal turning points by a variational approach arXiv ...

Naive-Bayes Classification Algorithm

Hyperparameter Adaptation in Variational Bayes for the Gamma ...

A variational approach to recovering a manifold from sample points

deep variational bayes filters: unsupervised ... - OpenReview.net

Deterministic annealing Variational Bayes - Google Sites

NaÃ¯ve Bayes technique for diatoms classification

Hierarchical Bayes for Text Classification - CiteSeerX

Text Classification using Naive Bayes

bayes classifier in multidimensional data classification - CiteSeerX

A variational Bayes framework for sparse adaptive estimation

using variational bayes free energy for unsupervised ... - CiteSeerX