Learning mixtures by simplifying kernel density estimators

28 downloads 36565 Views 549KB Size Report
On the other hand, learning a KDE is nearly free but evaluating the associated ... EF ),1 it yields a neat interpretation: the best KL approximation of a mixture of ..... For illustration purposes (see Figure 5), we plot the histogram of the origi-.
Learning mixtures by simplifying kernel density estimators Olivier Schwander? and Frank Nielsen?† ?

Laboratoire d'Informatique, École Polytechnique, Palaiseau, France ? Sony Computer Science Laboratories Inc., Tokyo, Japan {schwander,nielsen}@lix.polytechnique.fr

Gaussian mixture models are a widespread tool for modeling various and complex probability density functions. They can be estimated by various means, often using Expectation-Maximization or Kernel Density Estimation. In addition to these well known algorithms, new and promising stochastic modeling methods include Dirichlet Process mixtures and k-Maximum Likelihood Estimators. Most of the methods, including Expectation-Maximization, lead to compact models but may be expensive to compute. On the other hand Kernel Density Estimation yields to large models which are computationally cheap to build. In this paper we present new methods to get high-quality models that are both compact and fast to compute. This is accomplished by the simplication of Kernel Density Estimator. The simplication is a clustering method based on k-means-like algorithms. Like all k-means algorithms, our method rely on divergences and centroids computation and we use two dierent divergences (and their associated centroids), Bregman and Fisher-Rao. Along with the description of the algorithms, we describe the pyMEF library, which is a Python library designed for the manipulation of mixture of exponential families. Unlike most of the other existing tools, this library allows to use any exponential family instead of being limited to a particular distribution. The generic library allows to rapidly explore the dierent available exponential families in order to choose the better suited for a particular application. We evaluate the proposed algorithms by building mixture models on examples from a bio-informatics application. The quality of the resulting models is measured in terms of log-likelihood and of Kullback-Leibler divergence. Abstract.

Kernel Density Estimation, simplication, Expectation Maximization, k-means, Bregman, Fisher-Rao Keywords:

1 Introduction Statistical methods are nowadays commonplace in modern signal processing. There are basically two major approaches for modeling experimental data by probability distributions: we may either consider a semi-parametric modeling by a nite mixture model learnt using the Expectation-Maximization (EM) procedure, or alternatively choose a non-parametric modeling using a Kernel Density Estimator (KDE).

On the one hand mixture modeling requires to x or learn the number of components but provides a useful compact representation of data. On the other hand, KDE nely describes the underlying empirical distribution at the expense of the dense model size. In this paper, we present a novel statistical modeling method that simplies eciently a KDE model with respect to an underlying distance between Gaussian kernels. We consider the Fisher-Rao metric and the Kullback-Leibler divergence. Since the underlying Fisher-Rao geometry of Gaussians is hyperbolic without a closed-form equation for the centroids, we rather adopt a close approximation that bears the name of hyperbolic model centroid, and show its use in a single-step clustering method. We report on our experiments that show that the KDE simplication paradigm is a competitive approach over the classical EM, in terms of both processing time and quality. In Section 2, we present generic results about exponential families, denition, Legendre transform, various forms of parametrization and associated Bregman divergences. These preliminary notions allow us to introduce the Bregman hard clustering algorithm for simplication of mixtures. In Section 3, we present the mixture models and we briey describe some algorithms to build them. In Section 4, we introduce tools for the simplication of mixture models. We begin with the well known Bregman Hard Clustering and present our new tool, the Model Hard Clustering [23] which makes use of an expression of the FisherRao distance for the univariate Gaussian distribution. The Fisher-Rao distance is expressed using the Poincaré hyperbolic distance and the associated centroids are computed with model centroids. Moreover, since an iterative algorithm may be too slow in time-critical applications, we introduce a one-step clustering method which consists in removing the iterative part of a traditional k-means and taking only the rst step of the computation. This method is shown experimentally to achieve the same approximation quality (in terms of log-likelihood) at the cost of a little increase in the number of components of the mixtures. In Section 5, we describe our new software library pyMEF aimed at the manipulation of mixtures of exponential families. The goal of this library is to unify the various tools used to build mixtures which are usually limited to one kind of exponential family. The use of the library is further explained with a short tutorial. In Section 6, we study experimentally the performance of our methods through two applications. First we give a simple example of the modeling of the intensity histogram of an image which shows that the proposed methods are competitive in terms of log-likelihood. Second, a real-world application in bio-informatics is presented where the models built by the proposed methods are compared to reference state-of-the-art models built using Dirichlet Process Mixtures.

2 Exponential families 2.1

Denition and examples

A wide range of usual probability density functions belongs to the class of exponential families: the Gaussian distribution but also Beta, Gamma, Rayleigh distributions and many more. An exponential family is a set of probability mass or probability density functions admitting the following canonical decomposition: p(x; θ) = exp(ht(x), θi − F (θ) + k(x))

(1)

with     

t(x) the sucient statistic, θ the natural parameters, h·, ·i the inner product, F the log-normalizer, k(x) the carrier measure.

The log-normalizer characterizes the exponential family [5]. It is a strictly convex and dierentiable function which is equal to: F (θ) = log

Z

exp(ht(x), θi + k(x)) dx

(2)

x

The next paragraphs detail the decomposition of some common distributions. Univariate Gaussian distribution The normal distribution is an exponential family: the usual formulation of the density function (x − µ)2 exp − f (x; µ, σ ) = √ 2σ 2 2πσ 2 2

1





matches the canonical decomposition of the exponential families with  t(x) = (x,  x2 ),

 1 µ ,− 2 ,  (θ1 , θ2 ) = σ2 2σ   θ12 1 π  F (θ1 , θ2 ) = − + log − , 4θ2 2 θ2  k(x) = 0.

(3)

Multivariate Gaussian distribution The multivariate normal distribution (d is the dimension of the space of the observations) f (x; µ, Σ) =

(2π)d/2

1 p

det(Σ)

exp



(x − µ)T Σ(x − µ) 2



(4)

can be described using the canonical parameters as follows:  t(x) = (x,  −xxT )  (θ1 , θ2 ) = Σ 1

−1

1 µ, Σ −1 2

 1

d

 F (θ1 , θ2 ) = tr(θ2−1 θ1 θ1T ) − log det θ2 + log π 4 2 2  k(x) = 0 2.2

Dual parametrization

The natural parameters space used in the previous section admits a dual space. This dual parametrization of the exponential families comes from the properties of the log-normalizer. Since it is a strictly convex and dierential function, it admits a dual representation by the Legendre-Fenchel transform: F ? (η) = sup {hθ, ηi − F (θ)} θ

(5)

We get the maximum for η = ∇F (θ). The parameters η are called expectation parameters since η = E [t(x)]. Gradient of F and of its dual F ? are inversely reciprocal: ∇F = (∇F ? )

−1

(6)

and F ? itself can be computed by: F = ?

Z

−1

(∇F ? )

+ constant.

(7)

Notice that this integral is often dicult to compute and the convex conjugate

F ? of F may be not known in closed-form. We can bypass the anti-derivative operation by plugging in Eq. (5) the optimal value ∇F (θ∗ ) = η (that is, θ∗ = (∇F −1 )(η)). We get F ? (η) = h(∇F −1 )(η), ηi − F ((∇F −1 )(η))

(8)

This requires to take the reciprocal gradient ∇F −1 = ∇F ∗ , but allows us to discard the constant of integration in Eq. (7). Thus a member of an exponential family can be described equivalently with the natural parameters or with the dual expectation parameters.

2.3

Bregman divergences

The Kullback-Leibler (KL) divergence between two members of the same exponential family can be computed in closed-form using a bijection between Bregman divergences and exponential families. Bregman divergences are a family of divergences parametrized by the set of strictly convex and dierentiable functions F : BF (pkq) = F (p) − F (q) − hp − q, ∇F (q)i

(9)

F is a strictly convex and dierentiable function called the generator of the Bregman divergence. The family of Bregman divergences generalizes a lot of usual divergences, for example:

 the squared Euclidean distance, for F (x) = x2 ,  the Kullback-Leibler (KL) divergence, with the Shannon negative entropy Pd F (x) = i=1 xi log xi (also called Shannon information).

Banerjee et al. [2] showed that Bregman divergences are in bijection with the exponential families through the generator F . This bijection allows one to compute the Kullback-Leibler divergence between two members of the same exponential family: KL (p(x, θ1 ), p(x, θ2 )) =

Z

p(x, θ1 )

x

p(x, θ1 ) dx p(x, θ2 )

= BF (θ2 , θ1 )

(10) (11)

where F is the log-normalizer of the exponential family and the generator of the associated Bregman divergence. Thus, computing the Kullback-Leibler divergence between two members of the same exponential family is equivalent to compute a Bregman divergence between their natural parameters (with swapped order). 2.4

Bregman centroids

Except for the squared Euclidean distance and the squared Mahalanobis distance, Bregman divergences are not symmetrical. This leads to two sided denitions for Bregman centroids:  the left-sided one

cL = arg min x

X

ωi BF (x, pi )

(12)

ωi BF (pi , x)

(13)

i

 and the right-sided one cR = arg min x

X i

These two centroids are centroids by optimization, that is, the unique solution of an optimization problem. Using this principle and various symmetrizations of the KL divergence, we can design symmetrized Bregman centroid:  Jereys-Bregman divergences: BF (p, q) + BF (q, p) 2

(14)

p+q BF (p, p+q 2 ) + BF (q, 2 ) 2

(15)

SF (p, q) =

 Jensen-Bregman divergences [18]: JF (p, q) =

 Skew Jensen-Bregman divergences [18]: (α)

JF (p, q) = αBF (p, αp + (1 − α)q) + (1 − α)BF (q, αp + (1 − α)q)

(16)

Closed-form formula are known for the left- and right-sided centroids [2]: cR = arg min

X

x

=

n X

ωi BF (pi , x)

(17)

i

(18)

ωi pi

i=1

cL = arg min

X

x

ωi BF (x, pi )

(19)

i

! = ∇U ∗

X

ωi ∇U (pi )

(20)

i

3 Mixture Models 3.1

Statistical mixtures

Mixture models are a widespread tool for modeling complex data in a lot of various domains, from image processing to medical data analysis through speech recognition. This success is due to the capacity of these models to estimate the probability density function (pdf) of complex random variables. For a mixture f of n components, the probability density function takes the form: f (x) =

n X

ωi g(x; θi )

(21)

i=1

where ω i denotes the weight of component i ( of the exponential family g .

P

ωi = 1) and θi are the parameters

Gaussian mixture models (GMM) are a universal special case used in the large majority of the mixture models applications: f (x) =

n X

ωi g(x; µi , σi2 )

(22)

i=1

Each component g(x; µi , σi2 ) is a normal distribution, either univariate or multivariate. Even if GMMs are the most used mixture models, mixtures of exponential families like Gamma, Beta or Rayleigh distributions are common in some elds ([14,12]). 3.2

Getting mixtures

We present here some well-known algorithms to build mixtures. For more details, please have a look at the references cited in the next paragraphs. Expectation-Maximization The most common tool for the estimation of the parameters of a mixture model is the Expectation-Maximization (EM) algorithm [8]. It maximizes the likelihood of the density estimation by iteratively computing the expectation of the log-likelihood using the current estimate of the parameters (E step) and by updating the parameters in order to maximize the log-likelihood (M step). Even if originally considered for Mixture of Gaussians (MoGs) the ExpectationMaximization has been extended by Banerjee et al. [2] to learn mixture of arbitrary exponential families. The pitfall is that this method leads only to a local maximum of the loglikelihood. Moreover, the number of components is dicult to choose. Dirichlet Process Mixtures To avoid the problem of the choice of the number of components, one has proposed to use a mixture model with an innite number of components. It can be done with a Dirichlet process mixture (DPM) [20] which uses a Dirichlet process to build priors for the mixing proportions of the components. If one needs a nite mixture, it is easy to sort the components according to their weights ωi and to keep only the components above some threshold. The main drawback is that the building of the model needs to evaluate a Dirichlet process using a Monte-Carlo Markov Chain (for example with the Metropolis algorithm) which is computationally costly. Kernel Density Estimation The kernel density estimator (KDE) [19] (also known as the Parzen windows method) avoids the problem of the choice of the number of components by using one component (a Gaussian kernel) centered on each point of the dataset. All the components share the same weight and since the µi parameters come directly from the data points, the only remaining parameters

are the σi which are chosen equal to a constant called the bandwidth. The critical part of the algorithm is the choice of the bandwidth: a lot of studies have been made to automatically tune this parameter (see [25] for a comprehensive survey) but it can also be chosen by hand depending on the dataset. Since there is one Gaussian component a point in the data set, a mixture built with a kernel density is dicult to manipulate: the size is large and common operations are slow (evaluation of the density, random sampling, etc) since it is necessary to loop over all the components of the mixture. Pros and cons The main drawbacks of the EM algorithm are the risk to converge to a local optimum and the number of iterations needed to nd this optimum. While it may be costly, this time is only spent during the learning step. On the other hand, learning a KDE is nearly free but evaluating the associated pdf is costly since we need to loop over each component of the mixture. Given the typical size of a dataset (a 120 × 120 image leads to 14400 components), the mixture can be unsuitable for time-critical applications. Dirichlet process mixtures usually give high precision models which are very useful in some applications [3] but at a computational cost which is not aordable in most applications. Since mixtures with a low number of components have proved their capacity to model complex data (Figure 1), it would be useful to build such a mixture avoiding the costly learning step of EM or DPM.

4 Simplication of kernel density estimators 4.1

Bregman Hard Clustering

The Bregman Hard Clustering algorithm is an extension of the celebrated kmeans clustering algorithm to the class of Bregman divergences [2]. It has been proposed in Garcia et al. [10] to use this method for the simplication of mixtures of exponential families. Similarly to the Lloyd k-means algorithm, the goal is to minimize the following cost function, for the simplication of n components mixture to a k components mixture (with k < n): L = 0min 0

θ1 ,...,θk

X

X

1