On the optimal number of clusters in histogram clustering

1 downloads 0 Views 384KB Size Report
May 19, 1999 - risk as in Vapnik's \Empirical Risk Minimization" (ERM) induction ... The loss function with minimal empirical risk is usually a structure with.
On the optimal number of clusters in histogram clustering Joachim M. Buhmann & Marcus Held Institut fur Informatik III, Universitat Bonn, Germany fjb,[email protected] May 19, 1999 Abstract

Unsupervised learning algorithms are designed to extract structure from data samples. The quality of a structure is measured by a cost function which is usually minimized to infer optimal parameters characterizing the hidden structure in the data. Reliable and robust inference requires a guarantee that extracted structures are typical for the data source, i.e., similar structures have to be extracted from a second sample set of the same data source. Lack of robustness is known as over tting from the statistics and the machine learning literature. In this paper the over tting phenomenon is characterized for a class of histogram clustering models which play a prominent role in information retrieval, linguistic and computer vision applications. Learning algorithms with robustness to sample uctuations are derived from large deviation theory and the maximum entropy principle for the learning process. The theory validates continuation methods like simulated or deterministic annealing as robust approximation schemes from a statistical learning theory point of view. It predicts the optimal approximation quality, parameterized by a minimal nite temperature, for given covering numbers of the hypothesis class for structures. Monte Carlo simulations are presented to support the underlying inference principle.

1 Introduction Learning algorithms are designed to extract structure from data. Two classes of learning paradigms have been widely discussed in the literature { supervised and unsupervised learning. The distinction between the two classes depends on supervision or teacher information which is either available to the learning algorithm or missing. This paper applies statistical learning theory to the problem of unsupervised learning. In particular, error bounds as a safeguard against over tting are derived for the recently developed Asymmetric Clustering Model (ACM) for histogram data [1, 2]. These theoretical results show that the use of the continuation method deterministic annealing yields robustness of the learning process in the sense of statistical learning theory. The computational temperature of annealing algorithms plays the role of a control parameter regulating the complexity of the learning machine. The motivation to investigate histogram clustering from a statistical learning point of view is twofold. First, histogram clustering enables the data analyst to compactify histogram data without any need to introduce often arbitrary transformations and metrics in feature spaces.1 Second, due to the discrete nature of the ACM the theoretical analysis is tractable and it provides us with a quantitative view on the over tting phenomena in deterministic annealing algorithms. 1 In the context of ordinal data it is usually unclear how to de ne a suitable metric to compare di erent objects at hand.

1

Let us assume that a hypothesis class H of loss functions is given. These loss functions measure the quality of structures in data.2 The complexity of H is controlled by coarsening, i.e., we de ne a {cover of H . Informally, the inference principle implemented by deterministic annealing performs learning by the following two inference steps: 1. Determine the optimal approximation level for consistent learning (in the sense of a large deviation theory). 2. Given the optimal value average over all hypotheses in an appropriate neighborhood of the empirical minimizer, i. e. (a) determine the empirical minimizer for given training data; (b) average over all hypotheses within a 2 app {risk range of the empirical minimizer.

app is a function of to be determined in the following by uniform convergence arguments. The reader should note that the learning algorithm has to return a structure which is typical in a {cover sense but it is not required to return the structure with minimal empirical risk as in Vapnik's \Empirical Risk Minimization" (ERM) induction principle for classi cation and regression [3]. The loss function with minimal empirical risk is usually a structure with maximal complexity, e.g., in clustering the ERM principle will necessarily yield a solution with the maximal number of clusters and, therefore, is not suitable as a principle to determine a number of clusters which are stable under sample uctuations. All structures in the data with risk di erences smaller than 2 app are considered to be equivalent in the approximation process. The choice of the approximation accuracy is such that induction becomes reliable to sample

uctuations. This induction approach is named Empirical Risk Approximation (ERA) [4] in the following. The general framework of ERA is presented in Sec. 3. This framework is applied to the case of ACM clustering (Sec. 4). The connection to deterministic annealing algorithms for this clustering approaches is established in Sec. 5 yielding a new justi cation for the maximum entropy principle. In Sec. 6 we present empirical results for the studied ACM learning problem and Sec. 7 contains a conclusion.

2 Related Work in Statistical and Computational Learning Theory Note that adopting the notion of loss functions to quantify structures in data blurs the distinction between supervised learning and unsupervised learning. Both types of learning are mathematically formulated as variational problems with expected loss to be minimized. This observation has been the basis of a number of papers on the uniform convergence behavior in the vector quantization problem [5, 6, 7]. These approaches derive upper and lower bounds for the expected distortion error given a nite number of training points and a box constraint on the support of the data distribution. These bounds do not depend on explicit information of the distribution but have the form of the well{known Vapnik{Chervonenkis bounds [3]. In contrast to this strategy our approach relies on a distribution speci c covering number approach to determine the detailed behavior of the learning curves. The resulting bound is more speci c in the sense that it also covers the phenomenon of phase transitions. Consequently 2 It is a problem of unsupervised learning to quantify the notion of relevance by a loss function. Here we assume that such a loss function is given and that only the best hypothesis should be inferred on the basis of empirical data.

2

the derived bound relies mainly on the so{called computational temperature and not on the number of clusters as e. g. in [5]. Controlling the complexity of the underlying hypothesis class by the temperature is in fact related to the notion of the optimal margin in supervised learning. Approximating the complexity of such margin classi ers by the VC dimension is similar to approximating the complexity for vector quantization by the zero temperature entity number of clusters. We believe that there are tight connections to the scale sensitive dimensions where the pure combinatorial nature of the VC dimension is replaced by a complexity which not only covers the coarse behavior of classi ers with respect to labelling, but also the behavior with respect to units measured by a given loss function. 3 In fact this work should nally result in a theory which is independent of the data distribution to be learned or, as a relaxation of this requirement the theory should at least extract (estimates) the relevant values from the given training data instance alone. In spirit the ERA algorithm is similar to the Gibbs algorithm presented for example in [9], where the label of the l + 1th data point xl+1 is predicted by sampling a random hypothesis from the version space. The version space is de ned as the set of hypotheses which are all consistent with the rst l selected data samples. In our approach we use an alternative de nition of consistency, where all hypothesis in an 2 app ball around the empirical minimizer de ne the version space (see also [10]). Averaging over this ball yields a hypothesis with risk equivalent to the expected risk obtained by a random sampling over this ball. From a Bayesian point of view this is similar to averaging over a posterior distribution, where a uniform distribution over the hypothesis space is used as prior. In addition there is a tight methodological relationship to the papers [10] and [11] where learning curves for the learning of two class classi ers are derived using techniques from statistical mechanics. Especially in [11] the notion of an optimal temperature with respect to the generalization error is introduced. These works present asymptotic results in the sense, that the learning behavior is studied in the limit of an in nite number of data samples l and an in nite number of the hypotheses, where the ratio between these two entities is xed. In contrast we discuss the phenomenon of a optimal temperature in the light of a xed hypothesis class and a nite number of training samples.

3 The Empirical Risk Approximation Principle The data samples Z = fzr 2 ; 1  r  lg which have to be analyzed by the unsupervised learning algorithm are elements of a suitable object (resp. feature) space . The samples are distributed according to a measure  which is not assumed to be known for the analysis.4 A mathematically precise statement of the ERA principle requires several de nitions which formalize the notion of searching for structure in the data. The quality of structures extracted from the data set Z is evaluated by the empirical risk of a structure given the training set Z: Xl (1) R^ ( ; Z ) = 1 h(z ; ):

l r=1

r

The function h(z; ) is known as loss function in statistics. It measures the costs for processing a generic datum z with model . Each value 2  parameterizes an individual loss function In contrast to classi cation problems with 0 ? 1 loss functions, unsupervised learning problems are normally characterized by real valued functions. Scale sensitive dimensions are for example the fat{shattering dimension [8] or the annealed entropy for real valued functions [3], which is a distribution dependent entity. In the presented approach we work directly on the distribution dependent covering numbers using the union bound. 4 Knowledge of covering numbers is required in the following analysis which is a weaker type of information than complete knowledge of the density  (see also [12]). For the presented empirical examination this density is used to determine the expressiveness of the derived theoretical results. 3

3

with  denoting the set of possible parameters. This paper is concerned with the class of unbounded non-negative functions since we are particularly interested in logarithmic loss functions O (log f (z)) as they occur in maximum likelihood approaches. The relevant quality measure for unsupervised learning is the expectation value of the loss, known as the expected risk

R( ) =

Z



h(z; ) d(z):

(2)

The distribution  is assumed to decay suciently fast such that all rth moments (r > 2) of the loss function h(z; ) are bounded by E fjh(z; ) ? R( )jr g  r! r?2 V fh(z; )g; 8 2 . E f:g and V f:g denote expectation and variance of a random variable, respectively.  is a distribution dependent constant. ERA requires the learning algorithm to select a hypothesis on the basis of the nest consistently learnable cover of the hypothesis class, i. e. given a learning accuracy a subset of parameters  = f 1 ; : : : ; n~ g can be de ned such that the hypothesis class H is covered by the function balls

  Z 0 0 B ( ) := : h(z; ) ? h(z; ) d(z)  ;

(3)



S

i. e. H  2 B ( ). Here we are interested in averaging over hypotheses which are statistically indistinguishable from the empirical minimizer ^? := arg min 2 R^ ( ; Z ). The optimal structure is denoted by ? := arg min 2 R( ). Large deviation theory is used to determine the approximation accuracy for learning a hypothesis from the hypothesis class H . Using the cover property of the -cover yields the chain of bounds [13]

R( ) + R(^ ?) ? inf R( )  R(^ ?) ? inf 2 2  R(^ ?) ? R^ (^ ?) + sup jR^ ( ) ? R( )j + 2

 2 sup jR^ ( ) ? R( )j + :

(4)

2

In the following, deviations of the empirical risk from p the expected risk are measured on the scale of the maximal standard deviation > := sup 2 V fh(x; )g. The expected risk of the empirical minimizer exceeds the global minimum of the expected risk by > with a probability bounded by Bernstein's inequality [14] 5

P

n

R(^ ?) ? R( ?) > >

o

(

 P sup jR^ ( ) ? R( )j  21 (> ? ) 2 ! > )2 l (  ?

=  2jH j exp ? 8 + 4 ( ? =>)  :

)

(5)

The complexity of the considered hypothesis class jH j has to be small enough to guarantee with high con dence small {deviations. H denotes the cover of the hypothesis class H which minimizes the r.h.s of (5). The choice of the Bernstein inequality instead of the bound proposed by Vapnik [3, Chap. 5.4] yields a bound where the coarse graining of the hypothesis space might be easier analyzed, as the connections between coarse graining and the annealed entropy are not so clear{cut. 5

4

This large deviation inequality weighs two competing e ects in the learning problem, i. e. the probability of a large deviation exponentially decreases with growing sample size l, whereas a large deviation becomes increasingly likely with growing cardinality of the {cover of the hypothesis class. According to (5) the sample complexity l0 ( ; ; ) is de ned by

l0 ( ? => )2 + log 2 = 0: log jH j ? 8 + 4 ( ? => ) 

(6)

With probability 1 ?  the deviation of the empirical risk from the expected risk is bounded by 1 ?opt  > ?  =: app . Averaging over a function ball with radius 2 app around the empirical 2 minimizer yields a hypothesis corresponding to a statistically signi cant structure in the data, i.e., R^ ( ? ) ? R^ (^ ? )  R( ? ) + app ? (R(^ ? ) ? app )  2 app since R( ? )  R(^ ? ). The key task in the following remains to calculate an upper bound for the cardinality jH j of the

{cover. It is exemplarily shown in the next section for the asymmetric clustering model that empirical risk approximation leads naturally to a nite stopping temperature in deterministic annealing algorithms, where the e ective complexity of the hypothesis class depends on the computational temperature. In essence, deterministic annealing implements a sampling procedure for indistinguishable solutions and returns a typical or average solution, thereby avoiding the danger of over tting. The computational temperature serves as a Lagrange parameter for the accuracy , which is justi ed by uniform convergence considerations.

4 Asymmetric clustering model for histogram data The asymmetric clustering model was developed on the basis of the distributional clustering model [1] for the analysis resp. grouping of objects characterized by co{occurrence of objects and certain feature values [2]. Application domains for this explorative data analysis approach can be found for example in texture segmentation [15], in statistical language modeling [1] or in document retrieval [16]. The ACM can be stated formally in the following way: Denote by = X  Y the product space of objects xi 2 X ; 1  i  n andfeatures yj 2 Y ; 1  j  m. The xi 2 X are characterized by observations Z = fzr g = xi(r) ; yj (r) ; r = 1; : : : ; l . The sucient statistics [17] of how often the object{feature pair (xi ; yj ) occurs in the data set Z is measured by the set of frequencies fnij : number of observations (xi ; yj ) =total number ofPobservationsg. Derived measurements are the frequency of observing object xi , i. e. ni = j nij and the frequency of observing feature yj given object xi , i. e. nj ji = nij =ni . The ACM de nes a generative model of a nite mixture of component probability distributions in feature space, i. e. data are generated by the following steps: 1. select an object xi 2 X with uniform probability 1=n (for simplicity reasons),6 2. choose the cluster C according to the cluster membership of  = c (xi ), 3. select yj 2 Y from a class{conditional distribution qj j . This is the most simple model for the analysis of co{occurrence data [2] and a generalization of the presented results to more complex models is straight forward. The name Asymmetric Clustering Model re ects the fact, that the clustering is performed only on the X {space. A toy example for this generative model is given in gure 1. The modeling idea behind the ACM is best contrasted by comparing it to k{means clustering where each object, i. e. data 6

In the general case, the probabilities P fxi g also have to be estimated.

5

Class 1

Class 2

Class 3 Coocurence data for 1600 observations

0.2

0.2

0.2

0.1

0.1

0.1

0

0

0

1 3

20

5 7

5

10

1

5

10

Class 4

1

5

15

9

10 X

1

Assignments

11 10

20 0.2 x−Object

13

0.1

15 5

10

17 19

0

a)

1 1

5

10

1

2 3 Class

0

b)

4

1

3

5

7

9

Y

Figure 1: A simple example of a generative model for co{occurrence data. Depicted are the class{ conditional distributions qj  ;  = 1; : : : 4, the assignments of the objects xi ; i = 1; : : : ; 20 to classes (part a) and an example of co{occurrence data (part b) which is the only information the data analyst has at hand. Here the absolute frequencies are shown. j

point xi , should be assigned to a prototypical centroid. In the ACM the objects xi should be assigned to a class speci c feature histogram qj j . These prototypes are not de ned as points in a Euclidean space but as generators for class{speci c feature distributions in the space of discrete distributions. One should note, that this problem is a density estimation problem, but in contrast to classical density estimation we do not want to infer something like the right binning for density estimators 7 but determine the generation process underlying the given co{occurrence data. In this context the class{conditional distribution qj j are interpreted as histograms. ?  In addition to the class conditional distributions ~q = qj j we introduce indicator variables Mi 2 f0; 1g for the class membership of object xi , where  = 1; : : : ; k denotes the cluster and P k k denotes the maximal number of clusters. =1 Mi = 1 8i : 1  i  n enforces the unique assignment constraint. Using these variables the observed data Z is distributed according to the generative model over X  Y : k X 1 P fxi; yj jM; ~qg = n Mi qjj :  =1

(7)

For the analysis of the unknown data source | characterized (at least approximatively) by the empirical data Z | a structure = (M; ~q) (M 2 f0; 1gnk ) has to be inferred. The aim of an ACM analysis is to group the objects xi according to the unknown indicator variables Mi and to estimate for each class a prototypical feature distribution qj j . The log{likelihood function is given by

L =

k Xl X Xl ?   log P xi(r) ; yj (r) = Mi(r)  log qj(r)j ? l log n: r=1  =1

r=1

(8)

The maximization of the log{likelihood is equivalent to the minimization of the KL{divergence between the empirical distribution of z and the model distribution P f xi ; yj j g. A good approximation for the empirical distribution, therefore, should in principle yield a good approximation for the true distribution of the unknown data source. The conditions for this property 7

The binning is already de ned by the number of feature values.

6

Figure 2: Solutions of the ACM model for di erent temperatures. The KL divergence of the cluster histograms to the mean histogram is plotted to depict phase transitions.

to generalize with respect to unknown observations are formulated in the context of uniform convergence results from statistical learning theory and will be examined in the following. the sucient statistics Z~ = (nij ) and the loss function h(xi ; yj ; ) = log n ? P Using  Mi log qj j the maximization of the likelihood can be formulated as minimization of the empirical risk:

R^ ( ; Z ) =

m n X X i=1 j =1

nij h(xi ; yj ; );

(9)

where the essential quantity to be minimized is the expected risk:

R( ) =

n X m X i=1 j =1

Ptrue fxi; yj g h(xi ; yj ; ):

(10)

Minimization of (9) using di erentiation and the method of Lagrange parameters to ensure proper normalization of the model parameters yields the stationary equations for the empirical estimates of ~q resp. M: Pn M^ n X n ^ i ni q^jj = Pi=1n ^i ij = PM njji; (11) n ^ M h i=1 Mi h =1 i=1 ^ i = M

8 > :0

if  = arg min else 7

(

m ?Pn j =1

j ji log q^j j

)

:

(12)

A local minimum for the minimization of (9) (local maximum of 8) is obtained by alternating these equations until a xed point is found. It is worth mentioning, that (11) generalizes the centroid condition of the k{means algorithm. Furthermore, (12) is the ACM version of a nearest neighbor rule, i. e. it assigns each object xi to the class  such that the empirically estimated distribution nj ji has minimal KL{divergence to the class{conditional distribution qj j . As in classical k{means the deterministic annealing version of ACM is obtained by replacing the assignments variables Mi by their probabilistic counterparts:

"

# m P exp nj ji log qj j "j=1m #: hMi i = k P exp P n log q =1

j =1

j ji

(13)

j j

Empirically there exists evidence [2] that stopping at a nite temperature ( < 1) avoids over tting., i. e. the inferred structure generalizes with respect to the unknown distribution. The dynamics of the cooling process is depicted in Fig. 2. At high temperatures all feature distributions qj j are the same. At a critical temperature ?1  0:15 the feature distributions di erentiate into two clusters and at still lower temperatures into three and four clusters. The phase transition plot (Fig. 2) depicts the distance of the inferred cluster distributions qj j from the mean feature histogram. At zero temperature, all k = 4 clusters are distinct.

5 The critical temperature Due to the limited precision of the observed data it is natural to study histogram clustering as a learning problem with the hypothesis class

H =

( X k ?

Mi log qjj : 8i8 Mi 2 f0; 1g ^

9 = 8j qjj 2 f1=l; 2=l; : : : ; 1g; qjj = 1 ; j =1  =1

m X

k X  =1

Mi = 1 (14)

The limited number of observations causes a limited precision of the frequencies nj ji. The value qjj = 0 has been excluded since it gives rise to in nite expected risk for Ptrue fyj jxi g > 0. The size of the regularized hypothesis class H can be upper bounded by the cardinality of the complete hypothesis class divided by the minimal cardinality of a function ball, i. e. . jH j  jHj min B (h~ ) : (15) h~ 2H

The cardinality of a function ball with radius can be approximated by adopting techniques  from statistical physics and asymptotic analysis [18] ( (x) = 10 for x < 0):8

B (h~ )

=

0 n m q 1 X 1 true XX @ X jjm(i) A  ? n P fyj jxi g log q~ M fqjj g

j jm~ (i)

i=1 j =1

(16)

+Zi1 +Zi1 +Zi1 k m Y n lk(m?1) Z1 Z1 Y k  dqjj    dQ 2dx^x^ exp (nS (~q; Q; x^)) ; = (2)k j =1  =1

?i1 ?i1 ?i1 8 In the following the Fourier representation of the heavy{side function is used. 0

0

8

and the entropy S is given by

S (~q; Q; x^) =

1 0m k X X x^ ? Q @ qjj ? 1A +  =1 j =1 0 m q 1 k n X X 1X @?x^ Ptrue fyj jxi g log jj A : exp log n i=1 =1 q~jjm~ (i) j =1

(17)

The variables Q = fQ g 2f1;::: ;kg are Lagrange parameters to enforce the constraint Pm auxiliary j =1 qj j = 1. Choosing q~j jm~ (i) = 1=l for all j , we obtain a lower bound on the entropy and therefore a bound on the integral, which is independent from h~ . Using the abbreviation P m i := j=1 Ptrue fyj jxi g (log qjj + log l) and

Pi = Pkexp (?x^i ) : exp (?x^i )

(18)

=1

the following saddle point approximations are obtained:

qjj =

Pn Ptrue fy jx g P j i i i=1 Pm Pn Ptrue fy jx g P ;

l=1 i=1 k n X X

= n1

i=1 =1

l i

Pi

m X j =1

(19)

i

Ptrue fyj jxi g jj:

(20)

The entropy S evaluated at the saddle point yields in combination with the Laplace approximation an estimate for the cardinality of the {cover (15) log jH j = 21 log n + 12 log jAsp j + n(log k ? S );

(21)

where jAsp j denotes the determinant of the Hessian matrix evaluated at the saddle point. Inserting this complexity in equation (6) yields an equation which determines the required number of samples l0 for a xed precision  and con dence . This equation de nes a functional relationship between the precision  and the approximation quality for xed sample size l0 and con dence . Under this assumption the precision  depends on in a non{monotone fashion, i. e. i hp (22)  = + 2 2l C +  2C 2 + C ;

>

l0

0

using the abbreviation C = log jH j + log 2 . The minimum of the function  ( ) de nes a compromise between uncertainty originating from empirical uctuations and the loss of precision due to the approximation by an {cover. Di erentiating with respect to and setting the result to zero yields as upper bound for the inverse temperature:





?1 (23) x^  1> 2l0n  + p l0 + C2 2 : 2l0 C +  C Analogous to estimates of k{means, phase{transitions occur in ACM while lowering the temperature. The mixture model for the data at hand can be partitioned into more and more

9

i

1 2 3 4 5 6 7 8 9 10

m(i) 5 3 2 5 2 2 5 4 2 2

i

11 12 13 14 15 16 17 18 19 20

m(i) 2 4 1 5 3 5 3 4 1 2

i

21 22 23 24 25 26 27 28 29 30

m(i) 2 3 1 1 2 5 5 2 2 1



qj 

1 f0:11; 0:01; 0:11; 0:07; 0:08; 0:04; 0:06; 0; 0:13; 0:07; 0:08; 0:1; 0; 0:11; 0:03g 2 f0:18; 0:1; 0:09; 0:02; 0:05; 0:09; 0:08; 0:03; 0:06; 0:07; 0:03; 0:02; 0:07; 0:06; 0:05g 3 f0:17; 0:05; 0:05; 0:06; 0:06; 0:05; 0:03; 0:11; 0:09; 0; 0:02; 0:1; 0:03; 0:07; 0:11g 4 f0:15; 0:07; 0:1; 0:03; 0:09; 0:03; 0:04; 0:05; 0:06; 0:05; 0:08; 0:04; 0:08; 0:09; 0:04g 5 f0:09; 0:09; 0:07; 0:1; 0:07; 0:06; 0:06; 0:11; 0:07; 0:07; 0:1; 0:02; 0:07; 0:02; 0g j

Figure 3: Generative model used in the Monte{Carlo experiments for the evaluation of the theoretical results for the ACM. The left table shows the assignments of the objects to classes, while the right table lists the used class{conditional distributions qj  . j

components, revealing ner and ner details of the generation process. In the limit each object xi would be assigned to a separate cluster. The critical x^opt de nes the resolution limit below which details can not be resolved in a reliable fashion on the basis of the sample size l0 . Given the inverse temperature x^ the cardinality of the hypothesis class can be upper bounded via the solution of the x point equations (19) and (20). On the other hand this cardinality de nes with (23) and the sample size l0 an upper bound on x^. Iterating these two steps we nally obtain an upper bound for the critical inverse temperature given a sample size l0 .

6 Empirical Results For the evaluation of the derived theoretical result a series of Monte{Carlo experiments on arti cial data has been performed for the asymmetric clustering model. Given the number of objects n = 30, the number of groups k = 5 and the size of the histograms m = 15 the generative model for this experiments was created randomly and is summarized in gure 3. From this generative model sample sets of arbitrary size can be generated and the true distributions Ptrue fyj jxi g can be calculated. In gure 4a,b the predicted temperatures (vertical lines) are compared to the empirically observed critical temperatures (minima of the expected risk), which have been estimated on the basis of 2000 di erent samples of randomly generated co{occurrence data for each l0 . The expected risk (solid) and empirical risk (dashed) of these 2000 inferred models are averaged. The algorithm for the estimation of the critical temperature is depicted in Fig. 5. Figure 4c indicates that on average the minimal expected risk is assumed when the e ective number is smaller than or equal ve, i. e. the number of clusters of the true generative model. Therefore predicting the right computational temperature also enables the data analyst to solve the cluster validation problem for the asymmetric clustering model. Especially for l0 = 800 these results suggest that in the light of such a small training set ve clusters normally can not be estimated in a reliable way. On the other hand for l0 = 1600 and l0 = 2000 the right temperature prevents the algorithm to infer too many clusters, which would be an instance of over tting. As an interesting point one should note that for an in nite number of observations the critical inverse temperature diverges, but not more than the ve e ective clusters are extracted. At this point we conclude, that for the case of histogram clustering the ERA solves for realizable rules the problem of model validation, i. e. choosing the right number of clusters.

10

80

80

a)

l 800exp 0 l 800emp 0 l 1200emp 0 l 1200emp

78

b)

l 1600exp 0 l 1600emp 0 l 2000emp 0 l 2000emp

78

0

0

76

Risk

Risk

76

74

74

72

72

70

70

68 0

11

5

10

15

20

inverse temperature β

25

30

68 0

35

5

10

15

20

inverse temperature β

25

30

35

c)

10

effective number of prototypes

9 8 7 6 5 4 l0=800 l =1200 0 l =1600 0 l =2000 0 l0=∞

3 2 1 0 0

5

10

15 20 inverse temperature β

25

30

35

Figure 4: Comparison between theoretically derived upper bound on the optimal temperature and the observed critical temperatures (minimum of the x^ vs. expected risk curve). The bold (dashed) lines denote expected (empirical) risk averaged over 2000 i.i.d. sample sets. Depicted are the plots for l0 = 800; 1200; 1600; 2000. Vertical lines indicate the predicted critical temperatures. In addition the average e ective number of clusters is drawn in part c).

7 Conclusions The two conditions that the empirical risk has to uniformly converge towards the expected risk and that all loss functions within an 2 app -range of the global empirical risk minimum have to be considered in the inference process limits the complexity of the underlying hypothesis class for a given number of samples. The maximum entropy method which has been widely employed in deterministic annealing procedures for optimization problems is substantiated by our analysis. Maximizing the entropy with a xed approximation accuracy corresponds to minimizing the generalized free energy with xed temperature. This continuation method for combinatorial or continuous optimization is substantiated as the maximally robust optimization strategy with optimal generalization properties. Solutions with too many clusters clearly over t the data and do not generalize. The condition that the hypothesis class should only be divided in function balls of size forces us to stop the stochastic search at the lower bound of the computational temperature. Structural Risk Minimization (SRM) was introduced by Vapnik as an induction principle to adjust the trade-o between a limited number of training samples and the complexity of the hypothesis class. Hypothesis classes are de ned as nested structures of sets of functions 11

Given:  The number of observations l0 .  A maximal number of clusters k.  The true probability distribution P true (yj jxi ) (or an estimation P est (yj jxi ) of this distribution on the basis of l0 additional data points).  Fix q~j  = l10 8j;  . Algorithm: 1. Sample randomly K hypothesis such that min qj   l10 . Estimate the maximal exj; j

2. 3. 4. 5.

6. 7. 8. 9. 10.

j

risk Rmax

pected on the basis of these K hypotheses. Determine the temperature zero solution Rmin (equations (19) and (20)). Set x^ = 1:0. Set x^act = 0:0001. while x^act < x^  repeat Use eq. 19 to calculate ~q with the actual inverse temperature x^act . until convergence of q.  x^act ! x^act (1 + ) Calculate R with eq. 20.

= R ? Rmin. Estimate minimal  and  max for a {cover of the space of the randomly generated hypotheses. Calculate a new estimation of the temperature using equation (23). if the change in x^ is suciently large goto step 4. Figure 5: The algorithm to determine the critical temperature.

and the function with minimal empirical risk is selected from the appropriate subset according to the number of samples. ERA di ers from the SRM principles in two respects: (i) the ERA algorithm determines an averaged hypothesis from a subset in the sequence rather than selects the function with minimal empirical risk; (ii) the inference algorithm initially samples from a large subset representing a crude approximation and it proceeds to smaller and smaller subsets in the sequence when additional data permit a re ned estimate of the parameters. The similarity between temperature dependent regularization and structural risk minimization as a scheme for complexity control in supervised learning is apparent and hints at additional, yet still unexplored parallels between supervised and unsupervised learning. Another important result of this investigation is the fact that choosing the right stopping temperature for the annealing process not only avoids over tting but also solves the cluster validation problem in the case of ACM. The possible inference of to many clusters on the basis of the empirical risk functional is suppressed and only an e ective number of clusters is inferred at the critical temperature.

References [1] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pages 183{190, 1993. [2] T. Hofmann and J. Puzicha. Statistical models for co-occurrence data. AI{MEMO 1625, Arti cal Intelligence Laboratory, Massachusetts Institute of Technology, 1998. [3] V. N. Vapnik. Statistical Learning Theory. Wiley{Interscience, New York, 1998.

12

[4] J. M. Buhmann. Empirical risk approximation. Technical Report IAI-TR 98-3, Institut fur Informatik III, Universitat Bonn, 1998. [5] T. Linder, G. Lugosi, and K. Zeger. Rates of convergence in the source coding theorem, in empirical quantizer design and in universal lossy source coding. IEEE Transactions on Information Theory, 40(6):1728{1740, November 1994. [6] T. Linder, G. Lugosi, and K. Zeger. Empirical quantizer design in the presence of source noise or channel noise. IEEE Transactions on Information Theory, 43(2):612{623, March 1997. [7] D. Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199{205, 1982. [8] N. Alon, S. Ben-David, N. Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. In Symposium on Foundations of Computer Science, pages 292{300. IEEE Computer Society Press, 1993. [9] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14(1):83{113, 1994. [10] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25:195{236, 1997. [11] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056{6091, April 1992. [12] D. Haussler and M. Opper. Mutual information, metric entropy and cumulative relative entropy risk. Annals of Statistics, December 1996. [13] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264{280, 1971. [14] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. SpringerVerlag, New York, Berlin, Heidelberg, 1996. [15] J. Puzicha, T. Hofmann, and J.M. Buhmann. Discrete mixture models for unsupervised texture segmentation. In Proceedings of the DAGM-Symposium Mustererkennung 1998, pages 135{142, 1998. [16] T. Hofmann, J. Puzicha, and M.I. Jordan. Learning from dyadic data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11. MIT Press, 1999. to appear. [17] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley, New York, 1973. [18] N. G. De Bruijn. Asymptotic Methods in Analysis. North-Holland Publishing Co., (repr. Dover), Amsterdam, 1958, (1981).

13

Suggest Documents