unsupervised learning without overfitting: empirical risk ... - CiteSeerX

1 downloads 0 Views 439KB Size Report
unsupervised learning which is refered to as Empirical Risk Approximation [1]. ... pirical risk as in Vapnik's “Empirical Risk Minimization” induction principle for ...
UNSUPERVISED LEARNING WITHOUT OVERFITTING: EMPIRICAL RISK APPROXIMATION AS AN INDUCTION PRINCIPLE FOR RELIABLE CLUSTERING Joachim M. Buhmann, Institut f¨ur Informatik III, Universit¨at Bonn, Germany Marcus Held, Institut f¨ur Informatik III, Universit¨at Bonn, Germany ABSTRACT Unsupervised learning algorithms are designed to extract structure from data samples on the basis of a cost function for structures. For a reliable and robust inference process, the unsupervised learning algorithm has to guarantee that the extracted structures are typical for the data source. In particular, it has to reject all structures where the inference is dominated by the arbitrariness of the sample noise and which, consequently, can be characterized as overfitting in unsupervised learning. This paper summarizes an inference principle called Empirical Risk Approximation which allows us to quantitatively measure the overfitting effect and to derive a criterion as a saveguard against it. The crucial condition for learning is met if (i) the empirical risk of learning uniformly converges towards the expected risk and if (ii) the hypothesis class retains a minimal variety for consistent inference. Parameter selection of learnable data structures is demonstrated for the case of k-means clustering and Monte Carlo simulations are presented to support the selection principle.

1 PHILOSOPHY OF UNSUPERVISED LEARNING Learning algorithms are designed to extract structure from data. Two classes of algorithms have been widely discussed in the literature – supervised and unsupervised learning. The distinction between the two classes depends on supervision or teacher information which is either available to the learning algorithm or missing in the learning process. This paper describes a statistical learning theory of unsupervised learning. We summarize an induction principle for unsupervised learning which is refered to as Empirical Risk Approximation [1]. This principle is based on the optimization of a quality functional for structures in data and, most importantly, it contains a safeguard against overfitting of structure to the noise in a sample set. The extracted structure of the data is encoded by a loss function and it is assumed to produce a learning risk below a predefined risk threshold. This induction principle is summarized by the following two inference steps: 1. Define a hypothesis class containing loss functions which evaluate candidate structures of the data and which measure their quality. Control the complexity of the hypothesis class by an upper bound on the costs. 2. Select an arbitrary loss function from the smallest subset which still guarantees consistent learning (in the sense of Statistical Learning Theory). The reader should note that the learning algorithm has to return a structure with costs bounded by a preselected cost threshold, but it is not required to return the structure with minimal empirical risk as in Vapnik’s “Empirical Risk Minimization” induction principle for classification and regression [2]. All structures in the data with risk smaller than the selected risk bound are considered to be equivalent in the approximation process without further distinction. Various cases of unsupervised learning algorithms are discussed in the literature with emphasis on the

optimization aspect, i.e., data clustering or vector quantization, self-organizing maps, principal or independent component analysis and principal curves, projection pursuit or algorithms for relational data as multidimensional scaling and pairwise clustering and histogram based clustering methods. How can we select a loss function from a hypothesis class with desired approximation quality or, equivalently, find a structure with bounded costs? The nested sequence of hypothesis classes suggests a tracking strategy of solutions which are elements of smaller and smaller hypothesis classes in that sequence. The sequential sampling can be implemented by solution tracking, i.e., admissible solutions are incrementally improved by gradient descent if the process samples from a subset with smaller risk bound than in the previous sampling step. Candidate procedure for the sampling of structures from the hypothesis class are stochastic search techniques or continuation methods like simulated or deterministic annealing, although the theory does not refer to any particular learning algorithm. The general framework of Empirical Risk Approximation and its mathematical formulation is presented in Sec. 2, whereas a more detailed discussion can be found in [1]. This theory of unsupervised learning is applied to the case of central clustering in Sec. 3.

2 MATHEMATICAL FRAMEWORK FOR UNSUPERVISED LEARNING  d

The data samples X = xi 2 :  IR ; 1  i  l which have to be analysed by the unsupervised learning algorithm are elements of a suitable d-dimensional Euclidian space. The data are distributed according to a measure  which is assumed to be known for the analysis. A mathematically precise statement of the Empirical Risk Approximation induction principle requires several definitions which formalize the notion of searching for structure in the data. The quality of structures extracted from the data set X is evaluated by a learning cost function

R^ ( ; X ) = 1l

l X i=1

h(xi ; ):

(1)

R^ ( ; X ) denotes the empirical risk of learning a structure for an i.i.d. sample set X . The function h(x; ) is known as loss function in statistics. It measures the costs for processing a generic datum x and often corresponds to the negative log-likelihood of a stochastic model. For example in vector quantization, the loss function quantifies what the costs are to assign a data point x to a particular prototype or codebook vector. Each value 2  parametrizes an individual loss functions with  denoting the parameter set. The parameter characterizes the different structures of the data set which are hypothetically considered candidate structures in the learning process and which have to be validated. Statistical learning theory distinguishes between different classes of loss functions, i.e., 0 ? 1 functions in classification problems, bounded functions and unbounded non-negative functions in regression problems. This paper is concerned with the class of unbounded non-negative functions since we are particularly interested in polynomially increasing loss function (O (kxkp )) as they occur in vector quantization. ^ ( ; X ) depends on the i.i.d. data set X and, therefore, is itself Note that the quality measure R a random variable. The relevant quality measure for unsupervised learning, however, is the expectation value of this random variable, known as the expected risk of learning

R( ) =

Z



h(x; ) d(x):

(2)

^ ( ; X ) or solutions with bounded empirical risk are influenced by fluctuaWhile minima of R tions in the samples, it is the expected risk R( ) which completely assesses the quality of learn-

ing results in a sound probabilistic fashion. The distribution  is assumed to be known in the following analysis and it has to decay sufficiently fast such that all rth moments (r > 2) of the loss function h(x; ) are bounded by E fjh(x; ) ? R( )jr g  r! r?2 V fh(x; )g; 8 2 . E f:g and V f:g denote expectation and variance of a random variable, respectively.  is a distribution dependent constant. The moment constraint on the distribution  holds for all distributions with exponentially fast decaying tails which include all distributions with finite support. Empirical Risk Approximation is an induction principle which requires the learning algorithm to sample from the smallest consistently learnable subset of the hypothesis class. In the following, the hypothesis class H contains all loss functions h(x; ); 2  and the subsets of risk bounded loss functions HR are defined as

HR = fh(x; ) : 2  ^ R( )  R g : (3) The subsets HR are obviously nested since HR  HR      H for R1  R2      1 and H = limR !1 HR . R induces a structure on the hypothesis class in the sense of Vapnik’s 1

2

“Structural Risk” and essentially controls the complexity of the hypothesis class. The Empirical Risk Approximation induction principle requires to define a nested sequence of hypothesis classes with bounded expected risk and to sample from the hypothesis class with the desired approximation quality. Algorithmically, however, we select the loss function ac^ ( ; X )  R . This induction principle is consistent cording to the bounded empirical risk R if a bounded empirical risk implies in the asymptotic limit that the loss function has bounded expected risk R( )  R and, therefore, is an element of the hypothesis class HR , i.e.,

^ ( ; X )  R =) h(x; ) 2 HR : R 8 2  llim !1

(4)

This implication essentially states that the expected risk asymptotically does not exceed the risk bound (R( )  R ) and, therefore, the inferred structure is an approximation of the optimal data structure with risk not worse than R . The consistency assumption (4) for Empirical Risk Approximation holds if the empirical risk uniformly converges towards the expected risk

(

)

jRp( ) ? R^ ( ; X )j >  = 0; 8 > 0; lim P sup l!1 V fh(x; )g 2

(5)



where  denotes the hypothesis set of  distinguishable structures.  defines an -net H;R on the set of loss functions HR or coarsens the hypothesis class with an -separated set if there does not exist a finite -net for HR . Using results from the theory of empirical processes the probability of an –deviation of empirical risk from expected risk can be bounded by Bernstein’s inequality ([3], Lemma 2.2.11)

(

^ P sup jRp( ) ? R( ; X )j >  V fh(x; )g 2 

)





2  2jHj exp ? 2(1 + l=  : min)

(6)

2 = inf The minimal variance of all loss functions is denoted by min 2 V fh(x; )g. jH j denotes the cardinality of an -net constructed for the hypothesis class HR under the assumption of the measure . The confidence level  limits the probability of large deviations [1]. The large deviation inequality weighs two competing effects in the learning problem, i.e., the probability of a large deviation exponentially decreases with growing sample size l, whereas a large deviation becomes increasingly likely with growing cardinality of the hypothesis class. A

compromise between both effects determines how reliable an estimate actually is for a given data set X . The sample complexity l0 (;  ) is defined as the necessary number of training points to guarantee that -large deviations of the empirical risk from the expected risk relative to the variance occur with probability not larger than  . According to Eq. (6) the sample complexity l0(; ) is defined by



l0 = 22 1 +  min





log jHj + log 2 :

(7)

Equation (7) effectively limits the accuracy of Empirical Risk Approximation . A selected accuracy  for the ERA induction principle and a predefined confidence value  require a minimal size l0 of the data set. Solving Eq. (7) for  yields a lower bound of the achievable accuracy for a given sample set. An obvious strategy to improve this critical accuracy is to decrease the cardinality of the hypothesis class by lowering the risk bound R and sampling from a smaller subset of loss functions. In the extreme case of zero learning, e.g. if H is a singleton set with exactly one loss function, the log-cardinality of the hypothesis class log jH j vanishes. Is this extreme case achievable and if not, what is the minimal admissible complexity of the hypothesis class H ? The cardinality of the hypothesis class is controlled by the risk bound R . A minimal value for the risk bound would ensure a minimal size of the hypothesis class H;R which is equivalent to a minimal variety of structures in the data considered in the unsupervised learning task. The second condition of the Empirical Risk Approximation principle ensures that all loss functions with -typical risk should be included in the hypothesis class. Consequently, ^ ( ; X )  R( ) +  pV fh(x; )g provides a lower bound the large upper deviation with R for the risk bound R . The ERA-principle requires that we randomly select a loss function with ^ ( ; X )  R. Inserting this constraint yields a minimal admissible value for the the property R loss bound

R > min 2



q



R( ) +  V fh(x; )g :

(8)

Violations of this inequality (8) deprive the inference algorithm of possible hypotheses which cannot be ruled out on the basis of the samples and which, therefore, should be considered as candidate structures in the data. In general, R is larger than the minimal expected risk. The lower bound on R protects the unsupervised learning algorithm against attempts of too restricted inference and serves as a built–in self–consistency principle. The two bounds (7) and (8) are coupled by the considered hypothesis class H . A decrease of  with fixed confidence  increases the number of required samples. On the other hand, an increasing number of samples allows us to reduce  which, as a consequence, reduces the number of candidate structures in H .

3 A STATISTICAL LEARNING THEORY OF CENTRAL CLUSTERING As an important example of the general framework for unsupervised learning, a statistical theory is developed in this section for central clustering, i.e., for the problem to partition a data set X into k different regions. A partition is represented by a mapping m : IRd ! f1; : : : ; kg which assigns a partition label  2 f1; : : : ; kg to each data vector xi , i.e., m(xi ) =  . The quality of a partition is measured by the sum of squared distances between data points which ^ (m; X ) is given by are assigned to the same partition, i.e., the empirical cost function R

l X 1 ^ R(m; X ) = l i=1

Pl

j =1 m(xi );m(xj ) kxi ? xj k Pl j =1 m(xi );m(xj )

2

:

(9)

i;j = 1 (0); if i = j (i 6= j ) denotes the Kronecker delta.

This criterion favors clusters with

high cluster compactness.

Figure 1: A particular loss function for two means located at x1;2 = 0:125 is shown as the bold solid line which is bounded by the two parabolas depicted by dotted lines. The respective assignment mapping jumps between the values 1 and 2. The expected risk R(m) and the loss function h(x; m) of k -means clustering are defined by (2) and

h(x; m) = with the k means

R

2

kxR ? zk m(x);m(z) d(z)

m(x);m(z) d(z)

R z d(z) y = R  ;m(z)d(z)

;m(z)

 kx ? ym(x) k2

 2 f1; : : : ; kg:

(10)

(11)

A loss function for two means is depicted in Fig. 1. The risk bounded hypothesis class of k–means clustering is defined as HR = fh(x; m) : m(x) is a k partition ^ R(m)  R g : The complexity of the hypothesis class is determined by the -measurable k -partitions of . H cannot be covered by a finite -cover if -measurability is the only restriction on m. We, therefore, have to introduce an appropriate regularization of the function set H . A natural choice is to quantize the data space = [ni~=1 i into n ~ cells i and to require constant assignments of ~. data to clusters within a cell, i.e., 8x 2 i m(x) = mi ; 1  i  n Other learning theoretical approaches to k-means clustering assume that data are always assigned to the closest mean [4, 5, 6]. A clustering algorithm without this constraint on data assignments can be found in [7], i.e., algorithm III with hard assignments selected according to predefined probabilities comes close in spirit to a randomized selection of a loss function from the hypothesis class HR . How should the cells i be defined? A natural choice is to quantize the data space in such a way that two loss functions which are identical everywhere except in cell i have a L1 () distance of  if the data are assigned to the two closest clusters, respectively, i.e.,

Z 1 2 1st 2 kx ? yi2nd    k ? kx ? yi k d(x): min  i

(12)

TheRcentroids yi1st ; yi2nd are theRclosest and the second closest R centroid of2cell i , being defined 1st 2 2nd 2 by i kx ? yi k d(x)  i kx ? yi k d(x)  i kx ? y k d(x) 8y 6= yi1st . Summation of both sides of (12) yields the number of cells

n~ Z X 2 n~ =  min i=1

i

 T  2nd + y1st  2nd y i i d(x) 1st yi ? y i x? 2

(13)

Equation (12) quantizes the space in cells analogous to the data partitioning of K -nearest neighbor classifiers. Low data density and small loss lead to large cells, whereas parts of the data space with high density are finely partitioned. The hypothesis class H which results from the space quantization is finite and can be considered as a regularized version of H . It is defined as H = fh(x; m) : R(m)  R ^ 8i 8x 2 i m(x) = mi 2 f1; : : : ; kgg : Other schemes to regularize the hypothesis class H of all -measurable k-partitions of the data space are imaginable, e.g., for distributions with finite support the data space could be uniformly partitioned into cells of size yielding the hypothesis class H . Loss functions with different assignments in cells with low data density and short distances to the next mean have smaller L1 () distances than loss functions with high data density and long distances to the next mean. The hypothesis class H can be considered as an -net of the class H . In this sense H is the prototypical, coarsened hypothesis class for a variety of different regularization schemes. The probability of large deviations (6) depends on the cardinality of the hypothesis class jHj. This cardinality can be calculated by techniques borrowed from statistical physics. For a large number of cells the following stationary equations are derived [1]

R Pn~ i=1 Pi i x d(x) y = Pn~ with R i=1 Pi i

R

=

n~ X k X i=1  =1

Z

d(x)



Pi =

exp ?x^n~

k P

=1



R



2

(x ? y ) d(x)

exp ?x^n~

i

R

i

 ; (14)

(x ? y )2 d(x)

Pi (x ? y )2 d(x):

(15)

i

Pi = P fmi =  g denotes the probability that data in cell i are assigned to cluster  when a loss function is randomly selected from H . Equation (14) is a probabilistic version of the k-means definition (11). The hard assignment of datum x to clusters  encoded by m(x); is replaced by the probabilistic assignment Pi . It is quite instructive to compare the stationarity equations (14) with the reestimation equations for deterministic annealing of central clustering given a data set X [8, 9]:

Pl ~ 2 comp) : y~ = Pi=1l P~i xi with P~ i = Pkexp (?(xi ? y~ ) =lT2 comp ~ ) =lT ) i=1 Pi  =1 exp (?(xi ? y

(16)

The auxiliary variable x^ = 1=lT comp corresponds to the inverse computational temperature ~ i ; y~ are the empirical estimators of the assignment probabilities and the means on [8, 9]. P the basis of the data set X . Deterministic annealing approaches to probabilistic clustering utilize the inverse temperature as a parameter to control the smoothness of the effective cost function for clustering, i.e., optimization is pursued by gradually reducing the temperature. The relation between the loss bound R and the inverse temperature x^ = 1=lT comp is established by (15), i.e., x^ can be interpreted as a RLagrange variable to enforce the risk bound. Solutions of (15) are only permitted for R  kxk2 d(x) assigning data all with the same probability 1=k

to clusters which are all located at the center of mass of the data distribution (y1 =    = yk = 0). Equality in this equation defines a critical point and is related to the phenomenon of phase transition in clustering [8]. Additional phase transitions occur at lower values of 1=x^ and indicate that the data set can be partitioned into finer and finer clusters. Figure 2 shows the position of six clusters means for a one-dimensional partitioning problem with a mixture distribution composed of five Gaussians centered at x0 2 f?2; ?1:5; ?1; 0; 2g. The temperature as a resolution parameter controls the complexity of the partition, e.g., at T = 0:8 the clusters y2 ; : : : ; y6 are all located at x = ?1:1. For T < 0:25 all six clusters are separated.

Temperature T

comp

1.4

1.0

0.6

0.2

y6 y5 y4 y3

y1

y2

Data Density m(x)

0.2

-3.0

0.1

-2.0

-1.0

0.0

x

1.0

2.0

3.0

Figure 2: The position of six means y1 ; : : : ; y6 is shown as a function of the computational temperature. The data distribution  depicted below is a mixture of five Gaussians with variance 0.05 located at x0 2 f?2; ?1:5; ?1; 0; 2g. Neglecting exponentially small terms in x^, the sample complexity l0 in (7) yields a lower bound for the computational temperature 1=x^

l0 T comp = x1^ 

2~nmin : l0 2 ? 2 log 2

(17)

The correction =min has been neglected. Inequality (17) ensures that the hypothesis class H is not too constraint and that a minimal variety of possible loss functions is considered in the inference process. If we select a too small accuracy value  for a given size of the sample set l then the computational temperature diverges and balances overfitting effects. We like to emphasize the point that inequality (17) defines a parameter selection criterion for data partitionings, i.e., it provides a bound on the maximal number of clusters which can be estimated reliably. Lowering the temperature in simulated or deterministic annealing yields more and more clusters by a cascade of phase transitions (see Fig. 2). The ERA induction principle, however, stops this refinement before the extreme of as many clusters as data points is reached

51.8

0.35

0.3

train l0=100 train l0=200 train l0=400 train l0=1000

51.6

train l =100 0 train l0=200 train l =400 0 train l0=1000

Phase Transitions

51.4

Phase Transitions

0.25

51.2

ε(m)

R(m)

0.2

0.15

51

50.8

50.6 0.1

50.4 0.05

50.2

0

0

1

2

3

4

5

6

7

8

9

10

50

0

1

2

3

4

5

6

7

8

9

10

comp

comp

1/T

1/T

Figure 3: p Overfitting effects in four typical learning sequences: (m) = jR(m) ? R^ (m; X )j= V fh(x; m)g (left) and R(m) of hard partitioning (right) is plotted vs. 1=T comp. since splitting of the data in too small groups is not warranted by the uniform convergence requirement. Too large fluctuations might render unreliable cluster estimates y for small sample sets. The computational temperature acts as a regularization parameter to prevent cluster refinements with overfitting behavior. Figure 3 demonstrates empirically the predicted small sample size overfitting effect for the central clustering problem. The data source for all experiments is a p bimodal mixture of two 50–dimensional Gaussians with unit variance and means  = 1= 50 (1; : : : ; 1)T chosen according to the analysis in [10]. The expected risk R(m) and its variance V fh(x; m)g are approximated by an empiricalp test set of size 10000. The curves in Figure 3 plot the quantity (m) = jR(m) ? R^ (m; X )j= V fh(x; m)g for a typical assignment function m(x) which has been calculated by deterministic annealing techniques [8]. These curves indicate that the decrease of the computational temperature might yield overfitting, i.e., an increase in the – deviation between expected risk and empirical risk. In addition, these curves depict cases l0 = 100; 400 where the criterion of large –deviations discards cluster splits in the case of too small sample size since the two cluster centers are uncorrelated with the means of the source modes [10]. Note that the hard partitioning is defined by the application of the nearest neighbor rule on a set of prototypes given by equation (16). Therefore the inferred hypothesis is indeed a mixture of a set of hypotheses each of which weighted by their plausibility. Or, to turn it the other way around, the result of the deterministic annealing algorithm can be interpreted as an –typical hypothesis. Additional experiments are performed to verify the presented theoretical bound for overfitting phenomena in unsupervised learning resp. central clustering by Monte Carlo simulations. For the evaluation of the inferred probabilistic bound (17) extensive simulations of learning on finite sample sizes are necessary. Such Monte Carlo experiments consist of 1. Fix  and the size of the training set l0 . 2. Carry out a number L of experiments with different sample sets each of size l0 . 3. For each inverse temperature deviation larger than .

T comp = 1= calculate the fraction of experiments with

This fraction will be an estimate for  and has to be inserted into the (; ; l0 )–bound on the critical temperature (17). Note that the unknown values n ~ and min always can be estimated via

frequency of ε deviation for β = 4.4

frequency of ε deviation for β = 4.4

1

1

theo l =200 0 theo l =300 0 theo l =400 0 emp l =200 0 emp l =300 0 emp l =400

0.9

0.8

0.8

0.7

0

0.6

0.6

0.5

0.5

δ

δ

0.7

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.05

0.1

0.15

ε

0.2

0.25

0.3

theo l =600 0 theo l =700 0 theo l =800 0 theo l =1000 0 emp l =600 0 emp l0=700 emp l =800 0 emp l =1000

0.9

0.35

0

0

0

0.05

ε

0.1

0.15

Figure 4: Empirical and theoretical  vs.  dependencies for different sample sizes and a fixed temperature = 4:4 are compared on the basis of 1000 error curves. Monte Carlo sampling or by analytical means for tractable data distributions. For the generative nmin  18:4 and   4:3. bimodal distribution utilized in our experiments we estimate 2~ Figure 4 indicates that the experiments at least qualitatively support the derived relationship between the overfitting phenomenon and the computational temperature. Note that the estimated p(Vm)?fhR^((xm;m;X)g)j since we use the mean empirical values for  are biased towards larger values of jR field solutions in the DA experiments to define the probabilistic partition of the data space.

4 DISCUSSION The theory of Empirical Risk Approximation for unsupervised learning problems extends the highly successful statistical learning theory of supervised learning to a large class of unsupervised learning problems with the respective loss functions. The two conditions that the empirical risk has to uniformly converge towards the expected risk and that all loss functions within an  range of the global risk minimum have to be considered in the inference process limits the complexity of the hypothesis class for a given number of samples. The maximum entropy method which has been widely employed in the deterministic annealing procedure for optimization problems is substantiated by our analysis and it is identified as optimal procedure for large scale problems. Many unsupervised learning problems are formulated as combinatorial or continuous optimization problems. The presented framework for unsupervised learning also defines a theory of robust algorithm design for noisy combinatorial optimization problems, e.g., minimum flow problems with noisy cost values, process optimization with fluctuating capacities, the traveling salesman problem with fluctuating travel times between cities or scheduling problems with stochastic demands on the process. Robust solutions for all these noisy optimization problems require stability against fluctuations in the problem definition, i.e., solutions have to generalize from one problem instance to an equally likely second instance. Neglecting the noise influence and searching for the global optimum yields inferior solutions with higher expected risk than necessary. Furthermore these inferior solutions are often harder to find for computational complexity reasons than approximations with minimal expected risk. Structural Risk Minimization (SRM) was introduced by Vapnik as an induction principle for circumstances when the size of the sample set increases. Hypothesis classes are defined as nested structures of sets of functions and the function with minimal empirical risk is selected

from the appropriate subset according to the number of samples. Empirical Risk Approximation differs from the SRM principles in two respects: (i) the ERA algorithm samples from a subset in the sequence rather than selects the function with minimal empirical risk; (ii) the inference algorithm initially samples from a large subset representing a crude approximation and it proceeds to smaller and smaller subsets in the sequence when additional data permit a refined estimate of the parameters. The most important consequence from a pattern recognition point of view is the possibility to derive a parameter selection criterion from the Empirical Risk Approximation principle. Solutions with too many parameters clearly overfit the problem parameters and do not generalize. The condition that the hypothesis class should contain at least all -typical loss functions close to the global minimum forces us to stop the stochastic search at the lower bound of the computational temperature. The ERA driven induction process returns a set of indistinguishable hypotheses. The similarity between temperature dependent regularization and structural risk minimization as a scheme for complexity control in supervised learning is apparent and hints at additional, yet still unexplored parallels between supervised and unsupervised learning. Acknowledgments: It is a pleasure to thank N. Tishby and J. Puzicha for illuminating discussions. This work has been supported by the German Research Foundation (DFG, #BU 914/3–1) and by the German Israel Foundation for Science and Research Development (GIF, #1-0403-001.06/95).

REFERENCES [1] J. M. Buhmann. Empirical risk approximation. Technical Report IAI-TR 98-3, Institut f¨ur Informatik III, Universit¨at Bonn, 1998. [2] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, Berlin, Heidelberg, 1995. [3] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, Berlin, Heidelberg, 1996. [4] D. Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–205, 1982. [5] L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, New York, Berlin, Heidelberg, 1996. [6] T. Linder, G. Lugosi, and K. Zeger. Empirical quantizer design in the presence of source noise or channel noise. IEEE Transactions on Information Theory, 43(2):612–623, March 1997. [7] M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of Uncertainty in Artificial Intelligence. AAAI, 1997. [8] K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945–948, 1990. [9] J. M. Buhmann and H. K¨uhnel. Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39(4):1133–1145, July 1993. [10] N. Barkai and H. Sompolinski. Statistical mechanics of the maximum-likelihood density estimation. Physical Review A, 50:1766–1769, September 1994.