Document not found! Please try again

Learning and Data Clustering - CiteSeerX

2 downloads 0 Views 117KB Size Report
of this mixture, a data cluster, is described by a univariate probability density which is .... mixture model estimation by using a Monte Carlo search method with a ...
1

Learning and Data Clustering Joachim M. Buhmann Rheinische Friedrich–Wilhelms–Universit¨at Institut f¨ur Informatik III, R¨omerstraße 164 D-53117 Bonn, Germany

RUNNING HEAD: Data Clustering and Learning

Correspondence: Joachim Buhmann Universit¨at Bonn, Institut f¨ur Informatik III, R¨omerstraße 164, D-53117 Bonn, Germany Phone: +49 228 550 380 Fax: +49 228 550 382 email: [email protected], [email protected]

J. Buhmann: Data Clustering and Learning

2

1 INTRODUCTION Data clustering (Jain and Dubes, 1988) aims at discovering and emphasizing structure which is hidden in a data set. The structural relationships between individual data points, e.g., pronounced similarity of groups of data vectors, have to be detected in an unsupervised fashion. This search for prototypes poses a delicate tradeoff: not to superimpose too much structure, which does not exist in the data set, and not to overlook structure by a to simplistic modelling philosophy. Conceptually, there exist two different approaches to data clustering which are discussed and compared in this paper:



Parameter estimation of mixture models by parametric statistics.



Vector quantization of a data set by combinatorial optimization.

Parametric statistics assumes that noisy data have been generated by an unknown number of qualitatively similar, stochastic processes. Each individual process is characterized by a univariate probability density which is sometimes called a “natural” cluster of the data set. The density of the full data set is described by a parametrized mixture model, e.g., frequently one uses Gaussian mixtures (McLachlan and Basford, 1988). This model-based approach to data clustering requires to estimate the mixture parameters, e.g., the mean and variance of four Gaussians for the data set depicted in Fig. 1a. Bayesian statistics provides a conceptual framework to compare and validate different mixture models. The second approach to data clustering, which has been popularized as vector quantization in communication theory, partitions the data into clusters according to a suitable cost function. The resulting assignment problem is known as a deterministic NP-hard combinatorial optimization problem. This second strategy can be related to the statistical model-based approach in two different ways: (i) a stochastic optimization strategy like SIMULATED ANNEALING is employed to search for solutions with low costs; (ii) the data set constitutes a randomly drawn instance of the combinatorial optimization problem. The optimization approach to clustering aims at partitioning

J. Buhmann: Data Clustering and Learning

3

the data set into disjunct cells according to a cost function. This approach applied to the data set in Fig. 1a does not search for decision boundaries of exactly four partitions which correspond to the univariate components of the underlying probability distribution. The most difficult problem in mixture density estimation and in vector quantization deals with the complexity of the clustering solution, e.g., how many clusters should be chosen and how we should compare and validate solutions with a different degree of complexity. Learning which is unsupervised in both cases addresses the question to estimate the model parameters or to estimate the data assignments to clusters, respectively. The parameter values can be adjusted in an iterative or on-line fashion, if a steady stream of data is available. Another, prefered choice for small data sets is batch learning where parameters are estimated on the basis of the full data set.

2 Gaussian Mixture Models Natural clusters in data sets are modelled by a mixture of stochastic data sources. Each component of this mixture, a data cluster, is described by a univariate probability density which is the stochastic model for an individual cluster. The sum of all component densities forms the probability density of the mixture model

P (xjΘ) =

K X  =1

p(xj ) :

(1)

Let us assume that the functional form of the probability density P (xjΘ) is completely known up to a finite and presumably small number of parameters Θ = (1 ; : : : ; K ). For the most common case of Gaussian mixtures, the parameters 

= (y

; Σ ) are the coordinates of the mean vector and the

covariance matrix. The a priori probability  of the component  is called the mixing parameter. Adopting this framework of parametric statistics, the detection of data clusters reduces mathematically to the problem how to estimate the parameters Θ of the probability density for a given mixture model. A powerful statistical tool for finding mixture parameters is the maximum likelihood (ML) method (Duda and Hart, 1973), i.e., one maximizes the probability of the independently, identically distributed data set fxi ji

=

1 : : : ; N g given a particular mixture model. For analytical

J. Buhmann: Data Clustering and Learning

4

purposes it is more convenient to maximize the log–likelihood

L(Θ)

=

N X i=1

log P (xi jΘ)

=

N X i=1

log

K X  =1

p(xi j )

!

(2)

which yields the same maximum likelihood parameters due to the monotonicity of the logarithm. The straight-forward maximization of Eq. (2) results in a system of transcendental equations with multiple roots. The ambiguity in the solutions originates from the lack of knowledge as to which mixture component  has generated a specific data vector xi and, therefore, which parameter  is influenced by xi . An efficient bootstrap solution to overcome the computational problem, how to estimate parameters of mixture models with the maximum likelihood method, is provided by the expectation maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm estimates the unobservable assignment variables component  , Mi

=

fMi g in a first step.

Mi

=

1 denotes that xi has been generated by

ˆ i g, the 0 otherwise. On the basis of these maximum likelihood estimates fM

parameters Θ are calculated in a second step. An iteration of these two steps renders the following algorithm for Gaussian mixtures:



E-step: The expectation value of the complete data log-likelihood is calculated conditioned ˆ This yields the expected assignon the observed data fxi g and the parameter estimates Θ. ments of data to mixture components, i.e,

Mˆ i (t+1)



=

p(xi jyˆ (t) ; Σˆ ( t) ) PK (t) (t) ˆ  ; Σˆ  )  =1 p(xi jy

jΣˆ t j?1=2 exp(? 12 (xi ? yˆ t )T (Σˆ t )?1(xi ? yˆ t )) : PK ˆ t ?1=2 exp(? 1 (xi ? yˆ t )T (Σˆ t )?1 (xi ? yˆ t ))  1 jΣ j 2 ( )

=

( )

( )

( )

=

( )

( )

( )

( )

(3)

M-step: The likelihood maximization step estimates the mixture parameters, e.g., the centers and the variances of Gaussians ((Duda and Hart, 1973), pp. 47)

yˆ t

( +1)

PN =

ˆ (t+1) i=1 Mi xi PN ˆ (t+1) i=1 Mi

(4)

J. Buhmann: Data Clustering and Learning Σˆ (t+1)

=

5

N  T  X 1 (t+1) (t+1) (t+1) ˆ x ? x ? M y ˆ y ˆ i i PN ˆ (t+1) i=1 i i=1 Mi

Note that the Eqs. (4,5) have a unique solution after the expected assignments

(5)

fMˆ i t 1 g have ( + )

been estimated. The monotonic increase of the likelihood up to a local maximum guarantees the convergence of the EM algorithm. In our example of four Gaussians (Fig. 1a) the EM algorithm estimates centers and variances as depicted by stars and circles in Fig. 1b. An important question has not yet been raised: How do we estimate the correct number of components in the data set? To differentiate between structure and noise of a data set, the complexity of the mixture model has to be constrained in the spirit of Occam’s razor. Such a preference for simple models with few components is mathematically implemented for example by the MINIMUM DESCRIPTION LENGTH principle (Rissanen, 1989). The reader should note that ML estimation without constraints yields the singular solution of one component with zero variance for each data vector ((Duda and Hart, 1973), pp. 198). Gaussian mixture models have attracted a lot of attention in the neural network community for primarily three reasons: (I) Networks of neural units with Gaussian shaped receptive fields, so-called radial basis functions compute function approximations in a robust and efficient way (Poggio and Girosi, 1990). (II) The a priori assumption that synaptic weights in neural networks are generated by a Gaussian mixture model has considerably reduced the generalization error of layered neural networks in time series prediction tasks (Nowlan and Hinton, 1992), e.g., predicting sun spots and stock market indicators. (III) A neural network architecture based on Gaussian mixtures, the HIERARCHICAL MIXTURE OF EXPERTS (Jacobs et al., 1991), is able to efficiently solve a real-world classification or regression task in a divide-and-conquer fashion.

J. Buhmann: Data Clustering and Learning

6

3 Data Clustering as a Vector Quantization Process The second approach to data clustering is based on an optimization principle to partition a set of data points which are characterized either by coordinates fxi jxi distances fDik jDik

2

Suggest Documents