Estimating the sample complexity of a multi-class ... - CiteSeerX

0 downloads 0 Views 228KB Size Report
Estimating the sample complexity of a multi-class discriminant model ... study multi-class discriminant model. ..... Learning in Arti cial Neural Networks: The PAC.
Estimating the sample complexity of a multi-class discriminant model Yann Guermeur

André Elissee and Hélène Paugam-Moisy

[email protected]

{aelissee,hpaugam}@univ-lyon2.fr

LIP6, UMR CNRS 7606, Université Paris 6 4, place Jussieu, 75252 Paris cedex 05

ERIC, Université Lumière Lyon 2 5, avenue Pierre Mendès-France, 69676 Bron cedex

Abstract

studies in statistical learning theory have dealt with multi-class discrimination. Section 2 briey outlines the implementation of the MLR model for class posterior probability estimates combination. Section 3 is devoted to the estimation of the sample complexity using two combinatorial quantities generalizing the Vapnik-Chervonenkis (VC) dimension and called the graph dimension and the Natarajan dimension. In section 4, another bound is derived from a theorem where the capacity of the family of functions is characterized by its covering number. Both bounds are discussed in section 5 and further improvements are proposed in perspective.

We study the generalization performance of a multi-class discriminant model. Several bounds on its sample complexity are derived from uniform convergence results based on dierent measures of capacity. This gives us an insight into the nature of the capacity measure which is best suited to study multi-class discriminant model.

1 Introduction Since the pioneering work of Vapnik and Chervonenkis [14], extending the classical GlivenkoCantelli theorem to give a uniform convergence result over classes of indicator functions, many studies have dealt with uniform strong laws of large numbers (see for instance [10, 7]). The bound they provide both in pattern recognition and regression estimation can readily be used to derive a sample complexity which is an increasing function of a particular measure of the capacity of the model considered. The choice of the appropriate bound as well as the computation of a tight upper bound on the capacity measure thus appear to be of central importance to derive a tight bound on the sample complexity. This is the subject we investigate in this paper, throughout an application on the Multivariate Linear Regression (MLR) combiner described in [6]. Up to now, very few

2 MLR combiner for classier combination

We consider a Q-category discrimination task, under the usual hypothesis that there is a joint distribution, xed but unknown, on S = X  Y , where X is the input space and Y the set of categories. We further assume that, for each input pattern x 2 X , the outputs of P classiers are available. Let fj denote the function computed by the j th of these classiers: fj (x) = [fjk(x)] 2 RQ. The kth output fjk (x) approximates the class posterior probability p(Ck jx). Precisely, fj (x) 2 U with U

1

=

n

u2

RQ=1TQu = 1 +

o

p

In other words, the outputs are non-negative and sum to 1. Let F (x) = [fj (x)], (1  j  P ) (F (x) 2 U P ) be a vector of predictors. The MLR model studied here, parameterized by v = [vk ] 2 R=PQ2 , computes the functions g 2 G given by: 2

g(x) =

g1 (x)

6 .. 6 . 6 6 gk (x) 6 6 .. 4 .

gQ (x)

3

2

7 7 7 7 7 7 5

6 .. 6 . 6 = 66 vkT 6 . 4 ..

v1T

T vQ

by P . Furthermore, solving a simple quadratic programming problem establishes p that for all v 2 V and k 2 f1; : : : ; Qg, kvk k2  Q.

3 Growth function bounds

3

For now on, we consider the MLR combiner as a discriminant model, by application of Bayes' estimated decision rule. Although the learning process described in the previous section only amounts to estimating probability densities on a nite sample, the criterion of interest, the generalization ability in terms of recognition rate, can be both estimated and controlled. To perform this task, we use results derived in the framework of the computational learning theory and the statistical learning theory introduced by Vapnik. Roughly speaking, they represent corollaries of uniform strong laws of large numbers. The theorems used in this section are derived from bounds grounded on a combinatorial capacity measure called the growth function. In order to express them, we must rst introduce additional notations and denitions.

7 7 7 7 F (x) 7 7 5

Let  be the set of loss functions satisfying the general conditions for outputs to be interpreted as probabilities (see for instance [3]) and s a N sample of observations (s 2 S N ). Among the functions of G, the MLR combiner is any of the functions which constitutes a solution to the following optimization problem: Problem 1 Given a convex loss function L 2  and a N -sample s 2 S N , nd a function in G minimizing the empirical risk J^(v) and taking its values in U .

The MLR combiner is thus designed to take as inputs class posterior probability estimates and output better estimates with respect to some given criterion (least-squares, cross-entropy : : : ). vk;l;m , the general term of vk , is the coecient associated with the predictor flm (x) in the regression computed to estimate p(Ck jx). Let v be the vector of all the parameters (v = [vk ] 2 RPQ2= ). As was pointed out in [6], optimal solutions to Problem 1 are obtained by minimizing J^(v) subject to v 2 V with    P V = v 2 RPQ2 = 8T(l; m); k (vk;l;m ? vk;l;Q ) = 0 +

1PQ2

Let H be a family of functions from X to a nite set Y . Es (h) and E (h) respectively designate the error on s (observed error) and the generalization error of a function h belonging to H .

Denition 1 Let H be a set of indicator func-

tions (values in {0,1}). Let sX be a N -sample of X and H (sX ) the number of dierent classications of sX by the functions of H . The growth function H is dened by: 

H (N ) = max H (sX ) : sX 2 X N

v=Q

This result holds irrespective of the choice of L 2  and s. Details on the computation of a global minimum can be found in [5]. To sum up, the model we study is a multiple-output perceptron, under the constraints 8x 2 X; F (x) 2 U P and v 2 V . One can easily verify that the Euclidean norm kF (x)k2 of every vector in U P is bounded above



:

Denition 2 The VC dimension of a set H of indicator functions is the maximum number d of vectors that can be shattered, i.e. separated into two classes in all 2d possible ways, using functions of H . If this maximum does not exist, the VC dimension is equal to innity. 2

These denitions only apply to sets of indicator functions. The following theorem, dealing with multiple-output functions, appears as an immediate corollary of a result due to Vapnik [13] (see for instance [1]):

v1-v2 v1-v3 v1-v4

E (h) < Es(h)+

1 (ln(4 (2N )) ? ln( ))+ 1 GH N

per bound on the condence interval in Theorem 1, one must nd an upper bound on the growth function associated with the MLR combiner. This constitutes the object of the next two subsections where two dierent bounds are stated.

3.1 Graph dimension of the MLR combiner

PxQ

v(Q-2)-v(Q-1) v(Q-2)-vQ v(Q-1)-vQ

... ... ...

... v3-vQ ... ... ...

Bayes

g(x)

CQ

Q

This bound can be signicantly improved, by making use of the dependences between the hidden units. For lack of place, we only exhibit here a simple way to do so. Let x 2 sX . Among all the possible classications performed by the hidden unit whose vector is v1 ? v2, exactly half of them associate to F (x) the value 1. The same is true for the hidden unit whose vector is v2 ? v3 . Now if these two units provide the same output for F (x), then the output for the hidden unit whose vector is v1 ? v3 is known. Consequently, the growth function with these three units is at most   associated 3 e2N 3dE . Proceedings step by step, we thus 4 dE get an improved bound:

The discriminant functions computed by the MLR combiner are also computed by the single hidden layer perceptron with threshold activation functions depicted in Figure 1. Several articles have been devoted to bounding the growth function and VC dimension of multilayer perceptrons [9, 11]. Proceeding as in [11], one can use as upper bound on the growth function the product of the bounds on the growth functions of the individual hidden units derived from Sauer's lemma. Since the VC dimension of each hidden unit is at most equal to the dimension of the smallest subspace of RPQ containing U P and this dimension is dE = P (Q ? 1) + 1 [6], we get: G MLP (2N ) 0 and t(z ) = ?1 otherwise. The weights of the output layer, either +1 (solid lines) or ?1 (dashed lines), and the biases are chosen so that the output units compute a logical AND. The number below each layer corresponds to the number of units.

G h(x; y) = 1 () h(x) = y G H = fG h : h 2 H g. G H is the growth function of G H [9]. Thus, in order to obtain an up-

2d CQ E

C2

v2-vQ v3-v4

C Q2

G H , the graph space of H , is dened as follows: For h 2 H , let G h be the function from X  Y to f0; 1g dened by

e2N dE

C1

...

N



-1

v1-vQ v2-v3 v2-v4

Theorem 1 Let s 2 S N . With probability 1 ? , r

+1

...

2  CQ   2 3 ?1 e2N CQ dE

G MLP (2N ) < 4

(1) 3

dE

(2)

3.2 Natarajan dimension

This last bound is very similar to those provided by (1) and (2). This is a good indication that these bounds are quite loose, since bounding G H (N ) by N cH (N ) is crude. In fact, the diculty with the use of the growth function lies in the way to take into account the specicity of the model (the link between the hidden units), when applying a generalization of Sauer's lemma. This diculty concerns all the approaches using combinatorial methods which do not consider how the model is built. This led us to consider an alternative definition of the capacity of the combiner which is an original method to have condence bounds for multi-class discriminant models based on the computation of the a posteriori probability p(Ck jx).

Natarajan introduced in [9] an extension of the VC dimension for multiple-output functions based on the following denition of shattering:

Denition 3 A set

H of discrete-valued functions shatters a set sX of vectors () there exist two functions h1 and h2 belonging to H such that: (a) for any x 2 sX , h1(x) 6= h2(x), (b) for all s1  sX , there exists h3 2 H such that h3 agrees with h1 on s1 and with h2 on sX n s1 , i.e. 8x 2 s1 , h3(x) = h1 (x), 8x 2 sX n s1 , h3 (x) = h2(x).

We established the following result in [6]

Theorem 2 The Natarajan dimension dN of the MLR combiner satises:

4 Covering number bounds

1 2 (dE ? 1) CQ P + b Qc  dN  CQ 2 2

Many studies have been devoted to giving rates of uniform convergence of the means to their expectations, based on a capacity measure called the covering number [12, 10, 7]. To dene this measure, we rst introduce the notion of -cover in a pseudo-metric space.

In [8], the following theorem was proved Theorem 3 Let N cH (N ) be the number of dier-

ent classications performed by a set of functions H on a set of size N . Let dN be the Natarajan dimension of H . Then, for all d  dN : N cH (N ) 

i=d X i=0

?

i C2 CN Q

Denition 4 Let (E; ) be a pseudo-metric space. A nite set T  E is an ?cover of a set H  E with respect to  if for all h 2 H there is a h 2 T such that (h; h )  .

i

Denition 5 Given  and , the covering number

N (; H; ) of H .

We have the following inequality: i=d X

 2 d ? 2 i Q eN i CN CQ  2d i=0

(3)

Covering numbers were introduced in learning theory to derive bounds for regression estimation problems. In [2], Bartlett has extended their use to discrimination, specically to dichotomy computation. For now on, H is supposed to be a set of functions taking their values in R. The discriminant function associated with any of the functions h 2 H is the function t  h, where t is the sign function dened in the caption of Figure 1. E (h) is dened accordingly.

Obviously, G H (N )  N cH (N ). Thus, substituting the upper bound on dN provided by Theorem 2 for d in the formula of Theorem 3 and applying (3) we get: G MLP (2N )


8kT~k N ( =2; T (BPQ ); k:k1) 

<  < 1=2,

The MLR model takes its values in [0; 1]Q. For each pattern x, the desired output y is the canonical coding of the category of x. If jgk (x) ? yk j < 1=2 for all k, then the use of Bayes' estimated decision rule will provide the correct classication for x. A contrario, an error will occur only if jgk (x) ? yk j  1=2 for at least one k. For all k, we dene E (gk ) as being the error for gk ? 1=2 and E (g) as the error in generalization of the discriminant function associated with g. Then, we have: Q X k=1

E (gk )

(7)

where, T~ is the linear operator associated with T ~ k1 and k T~k = supw2BPQ kTw k wk2 . Since kF (x)k2  p P , from Cauchy-Schwarz inequality kT~k is p bounded above by P Q. By injection of (7) in (6) and by ap plying Theorem 4 to bound the right hand side of (5), it yields

2 ln  2N1 ( =2; H ; 2N )  N 

E (g) 

!PQ

E (g)



kX =Q k=1

Es k (gk ) +

s



p

2 ln 2 + P Q ln 8 P Q N 

k

This bound on the error should be related to bound derived from theorem 1 by substituting G H (2N ) by the successive bounds established in section 3.

5 Discussion and Conclusion

(5)

This paper analyses two strategies to derive bounds on the generalization error for a multiclass discriminant model. One is based on combinatorial dimensions and the other uses covering numbers. The bounds on the generalization ability derived in the two former sections cannot be readily compared, since they rest on dierent denitions of the empirical error. A direct application of these results on a biological application [5]

As in the previous section, bounding the generalization ability of the MLR combiner amounts to bounding a measure of complexity. This time however, this measure is dened for each individual function gk . The end of this section thus deals with stating bounds on N1 ( =2; H ; 2N ), where H equals fgk ? 1=2g for any k in f1; : : : ; Qg. Since the function  is 1-lipschitzian, N1 ( =2; H ; 2N )  N1 ( =2; H; 2N ) 5

!

leads to similar bounds in both cases which are roughly 35%. These bounds are not accurate for such an application. In their current implementation, both methods fail to provide useful bounds for real-world application. However, this study sheds lights on some features which could be used to improve the generalization control so as to make it of practical interest. As pointed out in section 3, the combinatorial approach does not take into account the specicity of the model and should be avoided as it is used here. Improving the Natarajan dimension will not indeed decrease the gap between G H (N ) and NcH (N ), which is one of the most important Achilles' heels of the method. To use knowledge of the structure when bounding the generalization error, we introduce a method based on covering numbers which is new for multi-class discrimination. This method has the disadvantage however to P crudely bound E (g) by the sum Qk=1 E (gk ). This is the main bug of the method and future works should x it somehow. One way to do so is to develop a global approach by directly controlling the generalization error of E (g) in terms of covering numbers instead of controlling the generalization errors E (gk ). This will be the subject of our next research. Thus, by presenting a new approach for the study of multi-class discriminant model with real internal representations, we have stated new bounds and pointed out a way to improve them. Covering numbers allow to use knowledge of the learning system and to include it in the condence bounds. Further work will be to derive more practical bounds for the model of interest.

of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, ftp : syseng.anu.edu.au:pub/peter/TR96d.ps. [3] C.M. Bishop (1995): Neural Networks for Pattern Recognition. Clarendon Press, Oxford. [4] B. Carl and I. Stephani (1990): Entropy, compactness, and the approximation of operators. Cambridge University Press, Cambridge, UK. [5] Y. Guermeur, C. Geourjon, P. Gallinari and G. Deléage (1999): Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score Combination. To appear in Bioinformatics. [6] Y. Guermeur, H. Paugam-Moisy and P. Gallinari (1998): Multivariate Linear Regression on Classier Outputs: a Capacity Study. ICANN'98, 693698. [7] D. Haussler (1992): Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Information and Computation, 100, 78-150. [8] D. Haussler and P.M. Long. (1995): A Generalization of Sauer's Lemma. Journal of Combinatorial Theory, Series A, 71, 219-240. [9] B.K. Natarajan (1989): On Learning Sets and Functions. Machine Learning, Vol. 4, 67-97. [10] D. Pollard (1984): Convergence of Stochastic Processes. Springer Series in Statistics, SpringerVerlag, N.Y. [11] J. Shawe-Taylor and M. Anthony (1991): Sample sizes for multiple-output threshold networks. Network: Computation in Neural Systems, Vol. 2, 107-117. [12] V.N. Vapnik (1982): Estimation of Dependences Based on Empirical Data. Springer-Verlag, N.Y. [13] V.N. Vapnik (1998): Statistical Learning Theory. John Wiley & Sons, INC., N.Y. [14] V.N. Vapnik and A.Y. Chervonenkis (1971): On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Application, Vol. 16, 264-280.

References [1] M. Anthony (1997): Probabilistic Analysis of Learning in Articial Neural Networks: The PAC Model and its Variants. Neural Computing Surveys, Vol. 1, 1-47. [2] P. Bartlett (1996): The sample complexity of pattern classication with neural networks: the size

6

Suggest Documents