Empirical Risk Approximation: An Induction Principle for ... - CiteSeerX

Empirical Risk Approximation: An Induction Principle for Unsupervised Learning Joachim M. Buhmann Rheinische Friedrich{Wilhelms{Universitat Institut fur Informatik III, Romerstrae 164 D-53117 Bonn, Germany email: [email protected] WWW: http://www-dbv.cs.uni-bonn.de April 3, 1998

Abstract

Unsupervised learning algorithms are designed to extract structure from data without reference to explicit teacher information. The quality of the learned structure is determined by a cost function which guides the learning process. This paper proposes Empirical Risk Approximation as a new induction principle for unsupervised learning. The complexity of the unsupervised learning models are automatically controlled by the two conditions for learning: (i) the empirical risk of learning should uniformly converge towards the expected risk; (ii) the hypothesis class should retain a minimal variety for consistent inference. The maximal entropy principle with deterministic annealing as an ecient search strategy arises from the Empirical Risk Approximation principle as the optimal inference strategy for large learning problems. Parameter selection of learnable data structures is demonstrated for the case of k-means clustering.

1 What is unsupervised learning? Learning algorithms are designed with the goal in mind that they should extract structure from data. Two classes of algorithms have been widely discussed in the literature { supervised and unsupervised learning. The distinction between the two classes relates to supervision or teacher information which is either available to the learning algorithm or missing in the learning process. This paper presents a theory of unsupervised learning which has been developed in analogy to the highly successful statistical learning theory of classi cation and regression [Vapnik, 1982, Vapnik, 1995]. In supervised learning of classi cation boundaries or of regression functions the learning algorithm is provided with example points and selects the best candidate function from a set of functions, called the hypothesis class. Statistical learning theory, developed by Vapnik and Chervonenkis in a series of seminal papers (see [Vapnik, 1982, Vapnik, 1995]), measures the amount of information in a data set which can be used to determine the parameters of the classi cation or regression models. Computational learning theory [Valiant, 1984] addresses computational problems of supervised learning in addition to the statistical constraints. 1

J. M. Buhmann: Empirical Risk Approximation; Technical Report IAI-TR98-3

2

In this paper I propose a theoretical framework for unsupervised learning based on optimization of a quality functional for structures in data. The learning algorithm extracts an underlying structure from a sample data set under the guidance of a quality measure denoted as learning costs. The extracted structure of the data is encoded by a loss function and it is assumed to produce a learning risk below a prede ned risk threshold. This induction principle is refered to as Empirical Risk Approximation (ERA) and is summarized by the following two inference steps: 1. De ne a hypothesis class containing loss functions which evaluate candidate structures of the data and which measure their quality. Control the complexity of the hypothesis class by an upper bound on the learning costs1.

2. Select an arbitrary loss function from the smallest subset which still guarantees consistent learning (in the sense of Statistical Learning Theory). The reader should note that the learning algorithm has to return a structure with costs bounded by a preselected cost threshold, but it is not required to return the structure with minimal empirical risk as in Vapnik's \Empirical Risk Minimization" induction principle for classi cation and regression [Vapnik, 1995]. All structures in the data with risk smaller than the selected risk bound are considered to be equivalent in the approximation process. Empirical Risk Approximation does not introduce any additional criterion to discriminate between data structures apart from the risk threshold. Conceptually, the ERA inference principle ressembles the \Structural Risk Minimization" [Vapnik, 1995] induction principle as far as the search through a nested sequence of subsets is concerned, although the learning algorithm has to sample a loss function from a hypothesis class rather than to nd the loss function with minimal risk in that class of functions. How can we select a loss function from a hypothesis class with desired approximation quality or, equivalently, nd a structure with bounded learning costs? The nested sequence of hypothesis classes suggests a tracking strategy of solutions which are elements of smaller and smaller hypothesis classes in that sequence. The sequential sampling can be implemented by solution tracking, i.e., admissible solutions are incrementally improved by gradient descent if the process samples from a subset with smaller risk bound than in the previous sampling step. The induction principle does not require that the algorithm samples randomly from the risk bounded hypothesis class which could pose complicated implementation problems. In the following I am not refering to any particular learning algorithm although candidate procedure are stochastic search techniques or continuation methods like simulated or deterministic annealing, respectively. How general can a theory of unsupervised learning be if it is based on optimization? Any learning algorithm which returns a structure validated by data has to be evaluated by a comparison function for alternative structures in data. Whether explicitly formulated or not, unsupervised learning conceptually relies on such a comparion of structures. Consequently, various cases of unsupervised learning algorithms are discussed in the literature with emphasis on the optimization aspect, i.e., data clustering or vector quantization [Rose et al., 1990, Buhmann and Kuhnel, 1993, Gersho and Gray, 1992], self-organizing maps [Kohonen, 1984, Ritter et al., 1992], principal or independent component analysis [Duda and Hart, 1973, Bell and Sejnowski, 1995] and principal curves [Hastie and Stuetzle, 1989], projection pursuit 1

This bound eectively de nes a nested sequence of hypothesis classes.


3

[Huber, 1985], structure detection in images [Becker and Hinton, 1992] or algorithms for relational data as multidimensional scaling [Cox and Cox, 1994, Klock and Buhmann, 1997] and pairwise clustering [Hofmann and Buhmann, 1997] and histogram based clustering methods [Pereira et al., 1993, Hofmann and Puzicha, 1998]. An alternative view of unsupervised learning is prevalent in Bayesian statistics [Ripley, 1996]. The Bayesian school in neural computing de nes unsupervised learning as density estimation of a data source [Hinton and Ghahramani, 1997]. According to this Bayesian view all interesting questions related to the structure of the data can be answered with a robust probabilistic model of the data. The ultimate success of unsupervised learning is achieved if the data can be generated from the learned model, i.e., when the learning algorithm has infered a generative model of the data [Hinton and Ghahramani, 1997]. Mixture models [McLachlan and Basford, 1988], e.g., Gaussian mixtures are prominent examples of this concept of unsupervised learning. Helmholtz machines [Dayan et al., 1995], Generative Topographic Maps [Bishop et al., 1998] and various other neural network models [Poggio and Girosi, 1990, Jordan and Jacobs, 1994, Bishop, 1995] also belong to this class of generative models. The general framework of Empirical Risk Approximation and its mathematical formulation is developed in Sec. 2. The sample complexity with a precision bound for learning and the parameter selection criterion is discussed in Sec. 2.3. The theory of unsupervised learning is applied to the case of central clustering in Sec. 3 where a cluster selection criterion is derived to limit the number of learnable clusters.

2 Mathematical Framework for Unsupervised Learning

The data samples X = xi 2 : IRd ; 1 i l which have to be analysed by the unsupervised learning algorithm are assumed to be elements of a suitable d-dimensional Euclidian space. The data are distributed according to a measure which is assumed to be known for the analysis.

2.1 Empirical Risk, Expected Risk and Hypothesis Class

A mathematically precise statement of the Empirical Risk Approximation induction principle requires several de nitions which formalize the notion of learning costs for structure in the data. The quality of structures extracted from the data set X is evaluated by a learning cost function

R^ (; X ) = 1l

Xl i=1

h(xi; ):

(1)

R^ (; X ) denotes the empirical risk of learning for an i.i.d. sample set X . The function h(x; ) is known as loss function in statistics. It measures the learning costs for processing

a generic datum x and often corresponds to the negative log-likelihood of a stochastic model. Each value 2 parametrizes an individual loss functions with denoting the parameter set. The parameter characterizes the dierent structures of the data set which are hypothetically considered candidate structures in the learning process and which have to be validated. Statistical learning theory distinguishes between dierent classes of loss functions, i.e., 0 ? 1 functions in classi cation problems, bounded functions and unbounded non-negative functions in regression problems. This paper is concerned with the class of unbounded non-negative functions since


4

we are particularly interested in polynomially increasing loss function (O (kxkp)) as they occur ? in vector quantization, principal component analysis (O kxk2 ) and multidimensional scaling ? 4 (O kxk ). Density estimation as an unsupervised learning problem ts into this framework by choosing the negative log-likelihood as a loss function. It is important to note that the quality measure R^ (; X ) depends on the i.i.d. data set X and, therefore, is itself a random variable. The relevant quality measure for unsupervised learning, however, is not the easily evaluable empirical risk R^ (; X ) but its expectation value, known as the expected risk of learning

R() =

Z

h(x; ) d(x):

(2)

while minima of R^ (; X ) or solutions with bounded empirical risk are in uenced by uctuations in the samples, it is the expected risk R() which completely assesses the quality of learning results in a sound probabilistic fashion. The distribution is assumed to be known in the following analysis and is assumed to decay suciently fast such that all rth moments (r > 2) of the loss function h(x; ) are bounded by E fjh(x; ) ? R()jr g r! r?2 V fh(x; )g; 8 2 . E f:g and V f:g denote expectation and variance of a random variable, respectively. is a distribution dependent constant. The moment constraint on the distribution holds for all distributions with exponentially fast decaying tails which include all distributions with nite support. Empirical Risk Approximation is postulated in this paper as an induction principle which requires the learning algorithm to sample from the smallest consistently learnable subset of the hypothesis class. In the following, the hypothesis class H contains all loss functions h(x; ); 2 and the subsets of risk bounded loss functions HR are de ned as

HR = fh(x; ) : 2 ^ R() Rg (3) The subsets HR are obviously nested since HR1 HR2 H for R1 R2 1 and H = limR !1 HR . R induces a structure on the hypothesis class in the sense of Vapnik's

\Structural Risk" and essentially controls the complexity of the hypothesis class. Distances between dierent functions of H are measured according to the L1 () norm2. The complexity of the hypothesis class HR can signi cantly vary from one unsupervised learning problem to the next. Three cases have to be distinguished in analogy to the theory of supervised learning: (i) the cardinality of HR is nite (jHR j < 1); (ii) HR can be covered by a nite -net of cardinality jH;R j n~ ; (iii) there does not exist a nite -cover for HR . The rst two cases are equivalent since the set of representative loss functions is nite either by de nition (i) or by construction (ii). The third case of a hypothesis class without nite -cover poses a much more complicated situation since learning cannot be achieved under these circumstances3. To enable learning in this case the hypothesis class has to be regularized such that only a nite number of representative functions are admitted. Unsupervised learning problems with such characteristics arise in data clustering where the space of all possible k-partitionings de nes the

An alternative to the L1 () norm would be the Lq () norm with q = p=(p ? 1) if the loss function scales as O (kxkp ) (see [Buhmann and Tishby, 1998]). The L () is not suitable for our purpose since functions with signi cant dierences in their expected risk could be very close or identical in terms of L () distance. 3 An analogous situation exists for a set of classi ers with in nite VC-dimension which cannot be trained even with an arbitrarily large, but nite number of training data without restrictive assumptions on the data distribution. Such a set of classi ers is characterized as non-falsi able by Vapnik [Vapnik, 1995] 2

1

1


5

hypothesis class. In the following H H;R denotes a nite hypothesis class which is either identical to HR (i) or which has been constructed by coarsening of HR (ii, iii). The subscript R is omitted since we only consider risk bounded nite hypothesis classes in the reminder of this paper. Index sets which de ne these nite hypothesis classes are denoted by .

2.2 The Empirical Risk Approximation Induction Principle

The Empirical Risk Approximation induction principle requires to de ne a nested sequence of hypothesis classes with bounded expected risk and to sample from the hypothesis class with the desired approximation quality. Algorithmically, however, we select the loss function according to the bounded empirical risk R^ (; X ) R. This induction principle is consistent if a bounded empirical risk implies in the asymptotic limit that the loss function has bounded expected risk R() R and, therefore, is an element of the hypothesis class HR , i.e., (4) 8 2 lim R^ (; X ) R =) h(x; ) 2 HR :

l!1

This implication (4) essentially states that the expected risk asymptotically does not exceed the risk bound (R() R) and, therefore, the loss function h(x; ) is an approximation of the optimal data structure with risk not worse than R. The consistency assumption (4) for Empirical Risk Approximation holds if the empirical risk uniformly converges towards the expected risk ( ) ^ (; X )j jR ( ) ? R lim P sup p (5) l!1 V fh(x; )g > = 0; 8 > 0: 2 How can we bound the deviation of empirical risk from expected risk (5) when the loss function is selected from a class of functions? The concentration of measure phenomenon for large sums of random variables as in (1) shows the eect that \a random variable that depends (in a \smooth"

way) on the in uence of many independent variables (but not too much on any of them) is essentially constant" [Talagrand, 1996]. Mathematical statistics has formalized this statement

and provides a guarantee in form of con dence intervals, i.e., we search for a tight bound on the probability that the expected risk deviates by more than from the empirical risk for the most critical function in a hypothesis class. A necessary and sucient condition for learning in this setting requires the empirical risk to converge uniformly over the set of loss functions towards the expected risk [Vapnik, 1982]. For the class of unbounded non-negative loss functions, the probability of large relative deviations of the empirical risk from the expected risk is bounded by Bernstein's inequality ([van der Vaart and Wellner, 1996], Lemma 2.2.11), i.e., the uniform convergence constraint over a nite set of loss functions yields the probability ( ) q X ^ (; X )j jR ( ) ? R ^ p P sup P jR() ? R(; X )j > V fh(x; )g > 2 V fh(x; )g 2 2 l 2jH j exp ? 2(1 + = ) : (6) min 2 = inf 2 V fh(x; )g. The The minimal variance of all loss functions is denoted by min con dence level limits the probability of large deviations.


6

2.3 Sample Complexity and Parameter Selection

The large deviation inequality weighs two competing eects in the learning problem, i.e., the probability of a large deviation exponentially decreases with growing sample size l, whereas a large deviation becomes increasingly likely with growing cardinality of the hypothesis class. A compromise between both eects determines how reliable an estimate actually is for a given data set X . The sample complexity l0 (; ) is de ned as the necessary number of data to guarantee that -large deviations of the empirical risk from the expected risk relative to the variance occur with probability not larger than (6). According to Eq. (6) the sample complexity l0(; ) is de ned by l0 = 22 1 + log jH j + log 2 (7) min Equation (7) eectively limits the accuracy of Empirical Risk Approximation . A selected accuracy for the ERA induction principle and a prede ned con dence value require a minimal size l0 of the data set. Solving Eq. (7) for yields a lower bound of the achievable accuracy for a given sample set. An obvious strategy to improve this critical accuracy is to decrease the cardinality of the hypothesis class by lowering the risk bound R and sampling from a smaller subset of loss functions. In the extreme case of zero learning, e.g. if H is a singleton set with exactly one loss function, the log-cardinality of the hypothesis class log jH j vanishes. Is this extreme case achievable and if not, what is the minimal admissible complexity of the hypothesis class H ? The cardinality of the hypothesis class is controlled by the risk bound R . A minimal value for the risk bound would ensure a minimal size of the hypothesis class H;R which is equivalent to a minimal variety of structures in the data considered in the unsupervised learning task. The large deviation bound (6) de nes a con dence interval for the empirical risk which assumes a value in the range

q q ^ R() ? V fh(x; )g R(; X ) R() + V fh(x; )g

(8)

with probability 1 ? . Such events are usually considered to be -typical. The second condition of the Empirical Risk Approximation principle ensures that all loss functions with -typical risk should be included p in the hypothesis class. Consequently, the large upper deviations with R^ (; X ) R() + V fh(x; )g provides a lower bound for the risk bound R. The ERAprinciple requires that we randomly select a loss function with the property R^ (; X ) R . Inserting this constraint yields a minimal admissible value for the loss bound

R >

q min R() + V fh(x; )g : 2

(9)

Violations of this inequality (9) deprive the inference algorithm of possible hypotheses which cannot be ruled out on the basis of the samples and which, therefore, should be considered as candidate structures in the data. In general, R is larger than the minimal expected risk and it assumes the minimal value of the risk only if the variance of the loss function vanishes. I like to emphasize the point that inequality (9) ensures a minimal variety or richness of the hypothesis class, i.e., -typical loss functions should not be excluded from the hypothesis class. The lower bound on R protects the unsupervised learning algorithm against attempts of too restricted inferences and serves as a built{in self{consistency principle.


7

The two bounds (7) and (9) are coupled by the considered hypothesis class H . A decrease of with xed con dence increases the number of required samples. On the other hand, an increased number of samples allows us to reduce which, as a consequence, reduces the number of candidate structures in H . A critical assumption in the Empirical Risk Approximation theory for unsupervised learning is the requirement that the distribution of the data is assumed to be known. This assumption is indispensible to construct the coarsened hypothesis class H , i.e., to construct the -net H . One might ask, however, if the distribution has to be identi ed in detail or if it only has to belong to a class of distributions characterized by e.g. moment constraints. The bounds (7,9) establish the fact that the sample complexity depends on the expectation value and the variance of the random variable h for a given hypothesis class H but is otherwise independent on details of the distribution. The construction of H might require additional information about the distribution . For the case of central clustering (see Sect. 3), however, only the second moment of the data distribution and the variance of the loss function has to be known4 .

3 A Statistical Learning Theory of Central Clustering As an important example of the general framework for unsupervised learning, a statistical theory is developed in this section for central clustering, also known as vector quantization [Gersho and Gray, 1992]. Of particular interest is the problem to partition the data space into k partitions according to the least-squares criterion, known as k-means clustering.

3.1 Risk of Central Clustering

k-means clustering is a classical algorithm to divide a data set X into k dierent partitions. A partition is represented by a mapping m : IRd ! f1; : : :; kg which assigns a partition label 2 f1; : : :; kg to each data vector xi, i.e., m(xi) = . The quality of a partition is measured by the sum of squared distances between data points which are assigned to the same partition, i.e., the empirical cost function R^ (m; X ) is given by

Xl 1 ^ R(m; X ) = l i=1

Pl 2 k=1 m(xi );m(xk ) kxi ? xk k : Pl j =1 m(xi );m(xj )

(10)

i;k = 10 ; if ii=6=kk denotes the Kronecker delta. The empirical costs (10) measure the Euclidian distances between data which are assigned to the same cluster. This criterion favors clusters with maximal intra-cluster compactness. The expected risk R(m) and the loss function h(x; m) of k-means clustering are de ned as

R(m) =

h(x; m) =

Z

R

h(x; m) d(x)

2

kx ? yk m(x);m(y) d(y)

m(x);m(y) d(y)

R

(11)

kx ? ym(x)k2

(12)

4 I claim that such a situation is typical for many unsupervised learning problems with an empirical risk function which is additive in the loss for individual data points and, therefore, show the concentration of measure phenomenon.


8

Figure 1: A particular loss function for two means located at x1;2 = 0:125 is shown as the bold solid line which is bounded by the two parabolas depicted by dotted lines. The respective assignment mapping jumps between the values 1 and 2. with the k means

ym(x)

R y d(y) = R m(x);m(y)d(y) :

m(x);m(y)

(13)

A loss function for two means is depicted in Fig. 1. The risk bounded hypothesis class of k{means clustering is de ned as

HR = fh(x; m) : m(x) is a k partition ^ R(m) Rg : (14) The complexity of the hypothesis class is determined by the -measurable k-partitions of . This set falls into category (iii) of the general discussion (see Sect. 2) since H cannot be covered by a nite -cover if -measurability is the only restriction on m. We, therefore, have to introduce an appropriate regularization of the function set H . A natural choice is to quantize the data space = [ni~=1 i into n~ cells i and to require constant assignments of data to clusters within a cell, i.e., 8x 2 i m(x) = mi ; 1 i n~ . Other learning theoretical approaches to k-means clustering assume that data are al

ways assigned to the closest mean [Pollard, 1982, Linder et al., 1994, Devroye et al., 1996, Linder et al., 1997]. An analysis without this constraint on data assignments can be found in [Kearns et al., 1997], i.e., algorithm III with hard assignments selected according to prede ned probabilities ressembles a randomized selection of a loss function from the hypothesis class H.


9

3.2 Regularized Hypothesis Class and its Cardinality

How should the cells i be de ned? A natural choice is to quantize the data space in such a way that two loss functions which are identical everywhere except in cell i have a L1 () distance of if the data are assigned to the two closest clusters, respectively, i.e.,

Z 2 2 1st 2nd k k ? k x ? y k x ? y d(x): i i

(15)

i

TheR centroids yi1st; yi2nd are the R closest and the second Rclosest centroid of cell i, being de ned by i kx ? yi1stk2 d(x) i kx ? yi2ndk2 d(x) i kx ? y k2 d(x) 8y 6= yi1st. This construction represents the L1() distance between two functions h; h0, which exclusively assign data in i either to thePnext or the second next mean, by number of cells with dierent assignments (d(h; h0) = ni~=1 (1 ? mi ;mi )). Summation of both sides of (15) yields the number of cells 0

n~ Z X 2 yi2nd ? yi1stTx ? yi2nd + yi1st d(x) n~ = 2 i=1 i

(16)

Equation (15) quantizes the space in cells analogous to the data partitioning of K -nearest neighbor classi ers. Low data density and small loss lead to large cells, whereas parts of the data space with high density is nely partitioned. The hypothesis class H which results from the space quantization is nite and can be considered as a regularized version of H . It is de ned as

H = fh(x; m) : R(m) R ^ 8i 8x 2 i m(x) = mi 2 f1; : : :; kgg : (17) Other schemes to regularize the hypothesis class H of all -measurable k-partitions of the data

space are imaginable, e.g., for distributions with nite support the data space could be uniformly partitioned into cells of size yielding the hypothesis class H . Loss functions with dierent assignments in cells with low data density and short distances to the next mean have smaller L1 () distances than loss functions with high data density and long distances to the next mean. The hypothesis class H can be considered as an -net of the class H . In this sense H is the prototypical, coarsened hypothesis class for a variety of dierent regularization schemes. The probability of large deviations (6) is limited by the cardinality of the hypothesis class times an exponentially decaying Hoeding's factor. Consider the regularized hypothesis class H de ned in (17). Its cardinality jHj is given by

jHj = kn~ P fR(m) Rg =

S (^x; m) =

Yk Z

=1

x^R +

dy

Z

+{1

?{1 n~ 1X

p1 exp (~nS (^x; m)) dx^ + O (~n); { 2x^ k X

n~ i=1 log =1 exp ?n~ x^

Z

i

(x ? y

(18)

)2 d(x)

:

(19)

The cardinality depends on the entropy S (^x; m) and is determined in the asymptotic limit n~ ! 1 by the maximum of S . Linear terms O (~n) in Eq. (18) are neglected since the cardinality


10

scales as O (exp n~ S sp ). Details of the calculation and a re ned analysis with O (~n) corrections can be found in the appendix. P fR(m) Rg is the probability to select a loss function with expected risk smaller than the risk bound R if we randomly select loss functions from H;1 . The random choice of loss functions re ects the strategy of Empirical Risk Approximation since all hypotheses with risk smaller than R are admitted without additional preferences. For a large number of cells the integral is dominated by its maximum with the respective stationarity equations

Pn~ P R x d(x) i i y = Pi=1 n~ P R d(x) ; i=1 i i Z n ~ k XX R = Pi (x ? y )2 d(x) with

i i=1 =1 R exp ?x^n~ i (x ? y)2 d(x) : R Pi = Pk exp ?x^n~ (x ? y )2 d(x) =1

i

(20) (21) (22)

Pi = P fmi = g denotes the probability that data in cell i are assigned to cluster when a loss function is randomly selected from H . Equation (20) is a probabilistic version of the k-means de nition (13). The hard assignments of datum x to clusters encoded by m(x); is replaced by the probabilistic assignment Pi (22). It is quite instructive to compare the stationarity equations (20,22) with the reestimation equations for deterministic annealing of central clustering given a data set X [Rose et al., 1990, Buhmann and Kuhnel, 1993]: ? Pl P~ x exp ?(xi ? y~ )2=T comp i i i =1 ~ y~ = Pl P~ with Pi = Pk exp (?(x ? y~ )2=T comp) : (23) i i=1 i =1 The auxiliary variable x^ = 1=T comp corresponds to the inverse computational temperature [Rose et al., 1990, Buhmann and Kuhnel, 1993]. P~ i; y~ are the empirical estimators of the assignment probabilities and the means on the basis of the data set X . Deterministic annealing approaches to probabilistic clustering utilize the inverse temperature as a parameter to control the smoothness of the eective cost function for clustering, i.e., optimization is pursued by gradually reducing the temperature. The relation between the loss bound R and the inverse temperature x^ = 1=T comp is established by (21), i.e., x^ can be interpretedR as a Lagrange variable to enforce the risk bound. Solutions of (21) are only permitted for R kxk2 d(x) assigning data all with the same probability 1=k to clusters which are all located at the center of mass of the data distribution (y1 = = yk = 0). Equality in this equation de nes a critical point and is related to the phenomenon of phase transition in clustering [Rose et al., 1990] which is discussed in detail in [Buhmann and Tishby, 1998]. Additional phase transitions occur at lower values of 1=x^ and indicate that the data set can be partitioned into ner and ner clusters. Figure 2 shows the position of six clusters means for a one-dimensional partitioning problem with a mixture distribution composed of ve Gaussians centered at x0 2 f?2; ?1:5; ?1; 0; 2g. The temperature as a resolution parameter controls the complexity of the partition, e.g., at T = 0:8 the clusters y2; : : :; y6 are all located at x = ?1:1. For T < 0:25 all six clusters are separated.


11

Temperature T

comp

1.4

1.0

0.6

0.2

y6 y5 y4 y3

y1

y2

Data Density m(x)

0.2

-3.0

0.1

-2.0

-1.0

0.0

x

1.0

2.0

3.0

Figure 2: The position of six means y1; : : :; y6 is shown as a function of the computational temperature. The data distribution depicted below is a mixture of ve Gaussians with variance 0.05 located at x0 2 f?2; ?1:5; ?1; 0; 2g.

3.3 The Sample Complexity for Clustering

The necessary size l0 of the sample set is related to the accuracy of clustering estimates for a given con dence level (see Eq. 7) by 2: l0 2 sp = n ~ S + log (24) 2(1 + =min)


12

According to (24,19) highly accurate estimates are achievable for a hypothesis class with small cardinality, i.e., with the lowest possible R. We, therefore, expand the entropy S sp for large x^ values.5 The entropy evaluated at the saddle point is approximately given by

S sp x^ R ?

R

n~ Z X

i=1 i

!

kx ? yi1stk2 d(x)

(25)

Terms of the order O exp[?n~ x^ i (kx ? yi2ndk2 ? kx ? yi1stk2 ) d(x)] have been neglected in Eq. (25). The rst term in (25) measures the cost dierence between the maximal loss bound R and the assignments of data to the closest mean. This dierence is bounded by Eq. (9), i.e.,

R ?

n~ Z X min R(m) R ? kx ? yi1stk2 d(x) min: h2H

i=1 i

(26)

Neglecting exponentially small terms in x^, the sample complexity l0 in (24) yields a lower bound for the computational temperature 1=x^ nmin (27) T comp = x1^ 2~ l0 2 ? 2 log 2 Inequality (27) ensures that the hypothesis class H is not too constraint and that a minimal variety of possible loss functions is considered in the inference process. If we select a too small accuracy value for a given size of the sample set l then the computational temperature diverges and balances over tting eects. I like to emphasize the point that inequality (27) de nes a parameter selection criterion for data partitionings, i.e., it provides a bound on the maximal number of clusters which can be estimated reliably. Lowering the temperature in simulated or deterministic annealing yields more and more clusters by a cascade of phase transitions (see Fig. 2). The ERA induction principle, however, stops this re nement before the extreme of as many clusters as data points is reached since splitting of the data in too small groups is not warranted by the uniform convergence requirement. Too large uctuations might render unreliable cluster estimates y for small sample sets and the computational temperature acts as a regularization parameter to prevent cluster re nements with over tting behavior (see [Hofmann and Puzicha, 1998] for empirical justi cation). This regularization eect is particularly pronounced in high-dimensional partitioning problems with a small ratio of # samples / # dimensions. It is worth mentioning that the sample complexity l0 (24) only depends on the data distribution through the minimal variance of the loss function min, the number of cells n~ and the entropy S sp . All three quantities are integrals of the distribution and measure various moments of the data. The analysis, therefore, holds for all distribution which belong to the class of distributions de ned by these moments.

4 Discussion The theory of Empirical Risk Approximation for unsupervised learning problems extends the highly successful statistical learning theory of supervised learning to a large class of unsupervised learning problems with the respective loss functions. The two conditions that the empirical 5

This limit corresponds to the low temperature expansion in statistical mechanics.


13

risk has to uniformly converge towards the expected risk and that all loss functions within an range of the global risk minimum have to be considered in the inference process limits the complexity of the hypothesis class for a given number of samples. The example of central clustering demonstrates that the underlying hypothesis class of unsupervised learning problems possibly assumes an in nite VC -dimension and has to be regularized by a nite set of representative loss functions. The cardinality of this set is determined by the entropy of the solution parameters which encode the hypothesis class. The maximum entropy method which has been widely employed in the deterministic annealing procedure for optimization problems is substantiated by our analysis and it is identi ed as optimal procedure for large scale problems up to logarithmic corrections. Many unsupervised learning problems are formulated as combinatorial or continuous optimization problems. The new framework for unsupervised learning also de nes a theory of robust algorithm design for noisy combinatorial optimization problems, e.g., minimum ow problems with noisy cost values, process optimization with uctuating capacities, the traveling salesman problem with imprecise city locations or scheduling problems with uctuating demands on the process. Robust solutions for all these noisy optimization problems require stability against uctuations in the problem de nition, i.e., solutions have to generalize from one problem instance to an equally likely second instance. In many real world applications the noisy optimization problem characterizes the situation much better than its deterministic variant. Furthermore, neglecting the noise in uence and searching for the global optimum yields inferior solutions which are often hard to nd for computational complexity reasons. Structural Risk Minimization (SRM) was introduced by Vapnik as an induction principle for circumstances when the size of the sample set increases. Hypothesis classes are de ned as nested structures of sets of functions and the appropriate subset is selected according to the number of samples. Within a hypothesis class the function with minimal empirical risk is chosen. Empirical Risk Approximation is also based on a nested sequences of sets of functions but it diers from the SRM principles in two respects: (i) the ERA algorithm samples from a subset in the sequence rather than selects the function with minimal empirical risk; (ii) the inference algorithm initially samples from a large subset representing a crude approximation and it proceeds to smaller and smaller subsets in the sequence when additional data permit a re ned estimate of the parameters. The most important consequence from a pattern recognition point of view is the possibility to derive a parameter selection criterion from the Empirical Risk Approximation principle. Solutions with too many parameters clearly over t the problem parameters and do not generalize. The condition that the hypothesis class should contain at least all -typical loss functions close to the global minimum forces us to stop the stochastic search at the lower bound of the computational temperature. The similarity between temperature dependent regularization and structural risk minimization as a scheme for complexity control in supervised learning is apparent and hints at additional, yet still unexplored parallels between supervised and unsupervised learning.

Acknowledgments: It is a pleasure to thank N. Tishby for illuminating discussions and M. Held and J. Puzicha for careful reading of the manuscript. This work has been supported by the German Research Foundation (DFG, #BU 914/3{1), by the German Israel Foundation for Science and Research Development (GIF, #1-0403-001.06/95) and by the Newton Institute (Cambridge, UK) under the Neural Networks and Machine Learning program.


14

Appendix This appendix summarizes the calculations to derive Eq. (18). jH j = kn~ P fR(m) Rg Pn~ R x d(x) ! 1 mi; Z Yk k n~ Y X Y mi; i=1 i y ? = kn~ R P n ~

=1 i=1 i mi ; d(x) i=1 =1 mi 2f1;:::;kg k R ?

n~ Z X

i=1 i

!

(x ? ymi )2 d(x)

(28)

Dirac's -function and Heaviside's step function can be regularized by the transformations

(x ? a) = 21 (x ? a) = 21

Z

+{1

?{1

Z0

?1

exp(?x0 (x ? a))dx0

dx00

Z

(29)

+{1

exp(?x0(x00 + x ? a))dx0

(30)

?{1 R P cluster probabilities p = ni~=1 Mi i

We introduce the d(x) using the abbreviation Mi mi ; . The cardinality (28) can be rewritten with these transformations as

jH j =

kn~

X

Mi 2f0;1gl k: Pk=1 Mi =1

exp ?

k n~ X X i=1 =1

Mi ln k

!

Z Z Y dy^ k n~ ! X X 1 y^ y ? p Mi x d(x) dy d exp ? (2 )

i

=1 =1 i=1 ?{1 =1 Z1 Yk +Z{1Yk dp^ Z k n~ ! X X dp 2 exp ? =1 p^ p ? i=1 Mi i d(x) =1 =1 ?{1 0 ! Z Z0 +Z{1 dx^ n~ X k X dz 2 exp ?x^(z ? R) ? x^ Mi (x ? y )2 d(x)

Z Yk

?1

=

kn~

+{1 k

?{1

X

Yk Z

Mi 2f0;1gl k: Pk=1 Mi =1 =1

exp ?

k X

dy

i

i=1 =1 +Z{1

?{1

!

1 0 +{1 +{1 dy^ Z dp Z dp^ Z dz Z dx^ (2 )d 2 2 0

(^y y + p^ p ) ? x^(z ? R ) exp ?

=1 Z Z ^ y ln k ? p x d(x) ? p^ i

i

(31)

d(x) + x^

Z

i

?{1 k X k X

?1

Mi =1 =1 2 (x ? y ) d(x)

?{1

(32)

The summation over the admissible values of the assignment variables Mi can be carried out in closed form since these variables occur linearly in the exponential function. The integration


15

of the auxiliary variable z can be carried out using the residue formula with an integration path along the contours of the rst and forth quadrant of the complex plane for negative or positive imaginary part of x^, respectively. Furthermore, the transformation of the auxiliary variables y^ ! y^ n~; p^ ! p^ n~; x^ ! x^n~ to intensive coordinates yields the expression:

jH j =

=

+Z{1 Z1 +Z{1 dp^ +Z{1 dx^ Yk Z ^ d y n ~ k ( d +1) k n~ dy (2)d dp 2 { 2x^ =1

?{1 ?{1 ?{1 0 ! k X exp n~ x^R ? n~ (^y y + p^ p ) =1 Z y^ n ~ k YX 2 exp ?n~ ln k + n~ p x + p^ ? x^(x ? y ) d(x)

i i=1 =1 +Z{1 Z1 +Z{1 dp^ +Z{1 dx^ Yk Z ^ d y n ~ k ( d +1) ^(p ; p^ ; y ; y^ ; x^) exp n ~ S k n~ dy (2)d dp 2 { 2x^ =1

?{1 ?{1 ?{1 0

with the abbreviations

S^(p ; p^ ; y ; y^ ; x^) = x^R ?

(33) (34)

k X

k n~ X X (^y y + p^ p ) + n~1 ln exp n~ i

i=1 =1 Z y^ =1 2 i = p x + p^ ? x^(x ? y ) d(x)

i

The cardinality jH j is determined by the maxima of the function S^. Stationarity of S^ w.r.t. variations of the variables p ; p^ ; y ; y^ ; x^ requires that the following equations are satis ed:

@ S^ @p @ S^ @ p^ @ S^ @ y @ S^ @ y^ @ S^ @ x^ with the abbreviation

= 0 = ?p^ ? = 0 = ?p + = 0 = ?y^ +

n~ ^ Z X y 2 i=1 p

n~ Z X

i=1 i n~

i

d(x) P^ i

X Z i=1 n~ X

2^x

x d(x) P^ i

i

(x ? y ) d(x) P^ i

Z 1 = 0 = ?y + p x d(x) P^ i

i i=1

= 0 = R ?

k Z n~ X X

i=1 =1 i

(x ? y )2 d(x) P^ i

ni ) P^ i Pkexp(~ =1 exp(~ni )

(35) (36) (37) (38) (39)


16

Multiplying Eq. (38) with p , and inserting Eq. (36) into the resulting Eq. (38) yields a condition for the centroids y such that the second term in Eq. (37) vanishes at the stationarity point. Therefore, extremality of S^ implies that y^ = 0 8 and, as a consequence of (35), the auxiliary variables p^ = 0 have to vanish as well. In this sense the de nition of the centroids as representatives for the data is natural since no conjugate elds (^y ; p^ ) with nite values are necessary to bias the statistics of the y . The integration of the p variables yields one since the argument of the exponential functions does not explicitly depend on these variables. After replacing the variables y^ ; p^ in the logarithmic term of S^ we derive the cardinality jH j

jHj = S (^x; m) =

Yk Z

=1

x^R +

dy

Z exp (~nS (^x; m)) dx^ + O (~n);

+{1

?{1 n~ 1X

{2x^

k X

n~ i=1 log =1 exp ?n~ x^

Z

i

(x ? y

(18)

)2 d(x)

:

(19)

Corrections of O (~n) are neglected since the integral scales as O (exp n~ S sp ). How can we evaluate the complex integral in Eq. (18)? The integration along the imaginary axis can be performed by the residue formula applied to the closed path 0 ? {1; 0 + {1; x^sp + {1; x^sp ? {1; 0 ? {1 in the complex x^ plane. The singularity at the origin is bypassed by a small semicircle in the halfplane with positive real part of x^ and, as a consequence of this construction, the closed path does not surround any singularities and, therefore, assumes the value zero. The pieces of the integration path at {1 vanish due to the \in nitely fast oscillations" of the integrand. Consequently, integration along the imaginary axis is equivalent to integrating along a line parallel to the imaginary axis through the stationarity point. The stationarity point is located on the real axis which renders the temperature parameter 1=x^ a real variable at the saddle point. Approximately, the cardinality is given by (40) jH j x^1sp exp (~nS (^xsp; ysp; psp )) + O (~n): (See [Bruijn, 1981] for details of the saddle point/stationary phase method). To capture the O (~n) terms correctly, we have to calculate the Hessian of S^ at the stationary

J. M. Buhmann: Empirical Risk Approximation; Technical Report IAI-TR98-3 point. The second derivatives are given by the following expressions: @ 2S^ = 0 @ 2S^ = 0 @ 2 S^ = 0 @ 2S^ = 0; @ 2S^ = ? @p@p @p@ p^ @p @ y @p @ y^ @p @ x^ Z n ~ 2 @ 2 S^ = X n~ i d ( x ) @ p^@ p^

i i=1 Z Z n~ @ 2S^ = X ( x ? y ) d ( x ) d(x) i 2^ x n ~ @ p^ @ y

i i i=1 2 X ^ @ S = n~ n~ Z x d(x) Z d(x) i @ p^ @ y^

i i=1 p i Z n~ X k Z @ 2S^ = ? X n ~ d ( x ) (x ? y )2 d(x) i @ p^@ x^

i i=1 =1 i

Z n~ Z @ 2S^ = 4^x2 n~ X (x ? y) d(x) (x ? y ) d(x) i @ y @ y

i i=1 i n~ 1 Z (x ? y ) d(x) Z x d(x) @ 2S^ = ? + 2^xn~ X i @ y @ y^

i i=1 p i n~ Z @ 2S^ = 2 X (x ? y) d(x)Pi ? @ y@ x^ i=1 i 2^xn~

n~ X k Z X

i=1 =1 i

(x ? y ) d(x)

Z

Z

i

(x ? y )2 d(x) i

n~ @ 2 S^ = n~ X @ y^ @ y^ pp i=1 i x d(x) i Z n~ X k Z @ 2S^ = ? n~ X 2 @ y^ @ x^ p i=1 =1 i x d(x) i (x ? y) d(x) i

Z k Z n~ X @ 2S^ = n~ X 2 d(x) ( x ? y ) (x ? y )2 d(x) i @ x^2

i i=1 =1 i i = Pi ( ? Pi )

with

17


18

A quadratic expansion of the exponential function in Eq. (34) depends on the Hessian A

0 BB BB B AB BB BB B@

@ 2 S^ @p @p

by

@ 2 S^ @p@p @ 2 S^ @ p^@p @ 2S^ @ y @p @ 2S^ @ y^ @p @ 2 S^ @ x^@p

@ 2S^ @p@ p^ @ 2S^ @ p^@ p^ @ 2 S^ @ y @ p^ @ 2 S^ @ y^ @ p^ @ 2S^ @ x^@ p^

@ 2 S^ @p @ y @ 2 S^ @ p^ @ y @ 2 S^ @ y @ y @ 2 S^ @ y^ @ y @ 2 S^ @ x^@ y

@ 2 S^ @p@ y^ @ 2 S^ @ p^@ y^ @ 2S^ @ y @ y^ @ 2S^ @ y^ @ y^ @ 2 S^ @ x^@ y^

@ 2 S^ @p @ x^ @ 2 S^ @ p^ @ x^ @ 2 S^ @ y @ x^ @ 2 S^ @ y^ @ x^ @ 2 S^ @ x^2

1 C C C C C C : C C C C C A

(41)

has to be read as a submatrix with indices ; 2 f1; : : :; kg. The cardinality is determined

jH j =

^ sp sp sp ? ?n~?1 : 1 p exp n ~ S ( p ; 0 ; y ; 0 ; x ^ ) 1 + O x^sp n~ det A

(42)


19

References [Becker and Hinton, 1992] Becker, S. and Hinton, G. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355:161{163. [Bell and Sejnowski, 1995] Bell, A. and Sejnowski, T. J. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6):1004{1034. [Bishop, 1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon Press, Oxford. [Bishop et al., 1998] Bishop, C. M., Svensen, M., and Williams, C. K. I. (1998). GTM: the generative topographic mapping. Neural Computation, 10(1):215{234. [Bruijn, 1981] Bruijn, N. G. D. (1958, (1981)). Asymptotic Methods in Analysis. North-Holland Publishing Co., (repr. Dover), Amsterdam. [Buhmann and Kuhnel, 1993] Buhmann, J. M. and Kuhnel, H. (1993). Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39(4):1133{1145. [Buhmann and Tishby, 1998] Buhmann, J. M. and Tishby, N. (1998). On the combinatorial nature of phase transitions in data clustering. Internal Report. [Cox and Cox, 1994] Cox, T. F. and Cox, M. (1994). Multidimensional Scaling. Number 59 in Monographs on Statistics and Applied Probability. Chapman & Hall, London. [Dayan et al., 1995] Dayan, P., Hinton, G., Neal, R., and Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7(5):889{904. [Devroye et al., 1996] Devroye, L., Gyor , L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer Verlag, New York, Berlin, Heidelberg. [Duda and Hart, 1973] Duda, R. O. and Hart, P. E. (1973). Pattern Classi cation and Scene Analysis. Wiley, New York. [Gersho and Gray, 1992] Gersho, A. and Gray, R. M. (1992). Vector Quantization and Signal Processing. Kluwer Academic Publisher, Boston. [Hastie and Stuetzle, 1989] Hastie, T. and Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84:502{516. [Hinton and Ghahramani, 1997] Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society, B., 352:1177{1190. [Hofmann and Buhmann, 1997] Hofmann, T. and Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1{14. [Hofmann and Puzicha, 1998] Hofmann, T. and Puzicha, J. (1998). Statistical models for cooccurrence data. AI{MEMO 1625, Arti cal Intelligence Laboratory, Massachusetts Institute of Technology.


20

[Huber, 1985] Huber, P. (1985). Projection pursuit. Annals of Statistics, 13:435{475. [Jordan and Jacobs, 1994] Jordan, M. and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2):181{214. [Kearns et al., 1997] Kearns, M., Mansour, Y., and Ng, A. Y. (1997). An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of Uncertainty in Arti cial Intelligence. AAAI. [Klock and Buhmann, 1997] Klock, H. and Buhmann, J. M. (1997). Multidimensional scaling by deterministic annealing. In M.Pellilo and E.R.Hancock, editors, Proceedings EMMCVPR'97, Lecture Notes In Computer Science, pages 245{260. Springer Verlag. [Kohonen, 1984] Kohonen, T. (1984). Self{organization and Associative Memory. Springer, Berlin. [Linder et al., 1994] Linder, T., Lugosi, G., and Zeger, K. (1994). Rates of convergence in the source coding theorem, in empirical quantizer design and in universal lossy source coding. IEEE Transactions on Information Theory, 40(6):1728{1740. [Linder et al., 1997] Linder, T., Lugosi, G., and Zeger, K. (1997). Empirical quantizer design in the presence of source noise or channel noise. IEEE Transactions on Information Theory, 43(2):612{623. [McLachlan and Basford, 1988] McLachlan, G. J. and Basford, K. E. (1988). Mixture Models. Marcel Dekker, INC, New York, Basel. [Pereira et al., 1993] Pereira, F., Tishby, N., and Lee, L. (1993). Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pages 183{190. [Poggio and Girosi, 1990] Poggio, T. and Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9). [Pollard, 1982] Pollard, D. (1982). Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199{205. [Ripley, 1996] Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. [Ritter et al., 1992] Ritter, H., Martinetz, T., and Schulten, K. (1992). Neural Computation and Self-organizing Maps. Addison Wesley, New York. [Rose et al., 1990] Rose, K., Gurewitz, E., and Fox, G. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945{948. [Talagrand, 1996] Talagrand, M. (1996). A new look at independence. The Annals of Probability, 24(1):1{34. [Valiant, 1984] Valiant, L. G. (1984). A theory of the learnable. Communications of the Association of Computing Machinery, 27:1134{1142.


21

[van der Vaart and Wellner, 1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer-Verlag, New York, Berlin, Heidelberg. [Vapnik, 1982] Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, Berlin, Heidelberg. [Vapnik, 1995] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. SpringerVerlag, New York, Berlin, Heidelberg.

Empirical Risk Approximation: An Induction Principle for ... - CiteSeerX

Empirical Risk Approximation: An Induction Principle for ... - CiteSeerX

Suggest Documents

Modal Logic and the Approximation Induction Principle

Clustering Principles and Empirical Risk Approximation

An asymmetric induction principle and biomimetics with ... - CiteSeerX

An asymmetric induction principle and biomimetics with ... - CiteSeerX

An asymmetric induction principle and biomimetics with ... - CiteSeerX

Approximation Algorithms for Minimizing Empirical

APPROXIMATION FOR BOOTSTRAPPED EMPIRICAL PROCESSES ...

Gaussian Approximation of Local Empirical Processes ... - CiteSeerX

An Empirical Analysis of Approximation Algorithms for the Euclidean ...

An approximation algorithm for approximation rank arXiv ... - CiteSeerX

The application of an accurate approximation in the risk ... - CiteSeerX

Empirical Evaluation of Approximation Algorithms for ...

Empirical Evaluation of Approximation Algorithms for Generalized ...

an Empirical Evaluation - CiteSeerX

An Empirical Evaluation - CiteSeerX

an empirical study - CiteSeerX

Real Options in IT Risk Management: An Empirical ... - CiteSeerX

A Risk Minimization Principle for a Class of Parzen ... - CiteSeerX

Protection for Sale: An Empirical Investigation - CiteSeerX

An Empirical Strategy for Characterizing Bacterial ... - CiteSeerX

AN EMPIRICAL STUDY OF FOREX RISK MANAGEMENT ...

COUNTRY RISK: AN EMPIRICAL APPROACH TO ...

Information Security Risk Management: An Empirical ...

Aggregation via Empirical Risk Minimization - CiteSeerX