APPROXIMATION ERROR BOUNDS THAT USE VC-BOUNDS1 1

0 downloads 0 Views 155KB Size Report
the statistical theory background and introduce the necessary definitions and theorems in ... In this section we brie y review the problem of learning from examples as for- mulated in ... A common way to solve this problem consists in defining a.
APPROXIMATION ERROR BOUNDS THAT USE VC-BOUNDS1 Federico Girosi

Center for Biological and Computational Learning and Arti cial Intelligence Laboratory Massachusetts Institute of Technology Cambridge MA 02139 USA

1 Introduction

Approximation theory and statistics are two branches of mathematics that look at very similar problems. However, they use quite dierent mathematical tools, and provide answers to dierent sorts of questions. Results from approximation theory rarely have been used in statistics and vice versa. We believe that both these disciplines would benet from a substantial exchange of ideas and techniques. In this paper we want to present a simple result that comes from the application of the concept of VC-dimension, a fundamental tool in statistical learning theory, to the problem of computing a non-asymptotic, dimension independent, uniform error bound for a class of Radial Basis Functions approximation techniques (Moody and Darken, 1989 Girosi, Jones and Poggio, 1995). In particular, given a function belonging to an appropriate class of functions, we will derive a bound on how well that function can be approximated, in the L1 norm, by a linear superposition of Gaussians of dierent variances centered at dierent locations. The coe cients of the linear combination can be chosen to be either 1 or -1, and the error bound is O(ln l=l) in any number of variables. Although this result is specic for Radial Basis Functions, the proof technique can be applied to other approximation methods as well. The plan of the paper is the following: in the next section we present the statistical theory background and introduce the necessary denitions and theorems in section 3 we present the main approximation theory result, and in section 4 we conclude with some comments on the results of section 3.

2 Some notions of statistical learning theory

In this section we briey review the problem of learning from examples as formulated in statistical learning theory. In all the instances of learning from examples we always have two sets of variables, that we will generically indicate by X and Y , that are related by a probabilistic relationship. We say that the relationship is probabilistic because it usually happens that an element of X does not determine uniquely an element of Y , but rather a probability distribution on Y . This can be formalized assuming that a probability distribution P(x y)is dened over the set X Y . The probability distribution P(x y) is unknown, and under very general conditions can be written as 1 A version of this paper will appear in the Proceedings of the International Conference on Arti cial Neural Networks, Paris, 1995.

P(x y) = P (x)P(yjx) where P(yjx) is the conditional probability of y given x, and P (x) is the marginal probability of x. From now on we will assume that X is a subset of a k-dimensional Euclidean space and Y is a subset of the real line. In all problem of learning from examples, we are provided with examples of this probabilistic relationship, that is with a data set Dl  f(xi  yi ) 2 X Y gli=1 , obtained by sampling l times the set X Y according to P(x y). The problem consists in, given the data set Dl , providing an estimator, that is a function f : X ! Y , that can be used, given any value of x 2 X, to predict the corresponding value y. A common way to solve this problem consists in dening a risk functional, which measures the average amount of error associated to an estimator, and then to look for the estimator with the lowest risk. If V (y ; f(x)) is a measure of the error we make when we predict y by f(x) (for example V (x) = x2 or V (x) = jxj), then the average error is the so called expected risk: If] 

Z

dxdy V (y ; f(x))P (x y) :

(1)

We assume that the expected risk is dened on a \large" class class of functions F , which could be for example a Sobolev space, or a set of dierentiable functions, and we will denote by f0 the function which minimizes the expected risk in this space: f0 (x) = arg min (2) F If] The function f0 is our ideal estimator, and it is often called the \target" function. Unfortunately this function cannot be found, because the probability distribution P (x y) that denes the expected risk is unknown, and only a sample of it (the data set Dl ) is available. A common approach consists in using the data set Dl to build a stochastic approximation of the expected risk, which is usually called the empirical risk, and it is dened as: Iemp f l] = 1l

l X i=1

V (yi ; f(xi ))

(3)

It is important to notice that, in practice, the empirical risk is not minimized on the class F (this problem usually has an innite number of solutions), but on a smaller set H  F . Actually, the empirical risk, rather than being dened on a single set H, is usually dened on an element Hn of an innite nested structure: H0  H1  : : :  Hn  : : :. The index n is a positive number measuring the complexity of the set Hn. For example Hn could be the set of polynomials of degree n, or a set of splines with n nodes, or some more complicated nonlinear parametrization with n parameters. Therefore the solution of the learning problem which is found in practice is the following: f^n l = arg min Iemp f l] (4) f 2Hn

Intuitively, we expect that when the number of data grows to innity the empirical risk converges to the expected risk, and the solution f^n l converge to the function:

fn = arg fmin 2H If]

(5)

n

More precisely, in order for the empirical risk minimization principle to be well dened, one has to guarantee that the minimum of the empirical risk converges to the minimum of the expected risk. In formulas, the following conditions should be satised: lim Iemp f^n l l] = lim If^n l ] = Ifn ] (6) l!1

l!1

It has been shown by Vapnik and Chervonenkis (1989) that in order for the limits in eq. (6) to hold true, or more precisely, for the empirical risk minimization principle to be non-trivially consistent, the following condition has to be satised:

(

)

lim P sup (If] ; Iemp f l]) >  = 0 8 > 0

l!1

f 2Hn

(7)

Vapnik and Chervonenkis derived necessary and su cient conditions for nontrivial consistency, and the results are formulated in terms of the VC-dimension, which is a measure of the complexity of the set Hn . In order to be more precise, let us give a more general denition of expected and empirical risk: If] =

Z

dz Q(f z)P(z)

Iemp f l] = 1l

l X i=1

Q(f zi )

(8) (9)

where the vectors zi are random samples of the probability distribution P (z), Q is a functional of f which can also depend explicitly on z, and f is an element of a set of functions Hn. If we set z = (x y) and Q(f z) = V (y ; f(x)), we recover the denition given in eq. (1) and (3). Since the VC-dimension is more easily dened when Q is an indicator function, that is a function which has values in the set f0 1g, we will consider this case rst. Given a set of N points z1  : : : zN , an indicator function denes a labeling of this set, assigning value 0 or 1 to each of the points. There is clearly a maximum number of 2N dierent ways of separating a set of points, and when f varies in Hn it will implement some of them. If Hn is a small set, it may not contain all the functions f needed to implement all the labelings, that is to separate all the points. The idea underlying the VC-dimension is to count the maximum number of labelings that can be implemented by a class of indicator functions. De nition 2.1 The VC-dimension of a set of indicator functions Q(f z), f 2 Hn, is the maximum number h of vectors z1  : : : zh that can be separated into two classes in all 2h possible ways using functions of the set Hn.

If, for any number N, it is possible to nd N points z1  : : : zN that can labeled in all the 2N possible ways, we will say that the VC-dimension of Hn is innite. We can now give the denition of VC-dimension in the case in which Q is a real function (Vapnik, 1982): De nition 2.2 Let A  Q(f z)  B , f 2 Hn, with A and B eventually innite. The VC-dimension of Q is dened as the VC-dimension of the indicator function:

Q (f z ) =  (Q(f z) ; )   2 (A B)

where  is the Heaviside step function.

The remarkable property of this quantity is that, how it was proved by Vapnik and Chervonenkis, niteness of the VC-dimension is necessary and sucient condition for the consistency of the empirical risk minimization technique (see eq. 6). If Hn is a set of functions parametrized by n parameters, it is clear that the VC-dimension and n will be related, but not always in an obvious way. In fact, although in many cases it is true that the VC-dimension is equal to the number of free parameters, there are cases in which the number of parameters is nite and the VC-dimension is innite, and cases in which the number of parameters is innite and the VC-dimension is nite. An important outcome of the work of Vapnik and Chervonenkis, is that the uniform deviation between empirical risk and expected risk can be bounded in terms of the VC-dimension, as shown in the following theorem: Theorem 2.1 (Vapnik and Chervonenkis) Let A  Q(f z)  B , f 2 Hn, Hn be a set of totally bounded functions and h the VC-dimension of Q. Then, with probability 1 ; , the following inequality holds simultaneously for all the elements f of Hn.

s

 2el jIf] ; Iemp f l]j  (B ; A) h ln h l; ln( 4 )

A very interesting feature of theorem (2.1) is that the error rate does not depend on the dimensionality of the variable z. Moreover, this error bound is non-asymptotic, that means that it holds for any nite number of data points l. The quantity jIf] ; Iemp f l]j is often called estimation error, and bounds of type above are usually called VC bounds. Knowing that, with probability 1 ; , we have a bound like the one of theorem (2.1), of the form:

jIf] ; Iemp f l]j  (n l ) 8f 2 Hn

is an important component needed to determine what the generalization error of an approximation technique is. We dene the generalization error as kf0 ; f^n l kL2 (P ) , that is the distance between our target function f0 and the result of the training algorithm f^n l . In fact, it can been shown (Niyogi and Girosi, 1994) that the generalization error can always be bounded as follows

kf0 ; f^n l k2L2 (P )  kf0 ; fn k2 + 2(n l )

(10) The rst part of the right side of this inequality is called approximation error, because it is a measure of how well elements of the relatively small set Hn can approximate the target function f0 , an element of the larger set F . Therefore, the generalization error is a linear combination of the approximation and estimation error. The approximation error depends on the relative complexity of the sets F and Hn, and it is studied mainly in approximation theory. The estimation error does not depends on the target function f0 , but it depends exclusively on the complexity of the approximating set Hn, through the VCdimension h, which depends on the number of parameters n. In both cases an approximation problem is involved, and therefore it is quite natural to ask whether techniques used to study the estimation error can be used to study the approximation error. In the next section we show that this is indeed the case and we will show how to use VC bounds in approximation theory.

3 VC-dimension and approximation theory

It is now natural to ask if bounds of the type of theorem (2.1) can be used in other disciplines than statistics. We will now show that a natural application can be found in approximation theory. In fact, let us consider functions of the form

Z

f(x) = dt K(x t)(t)

(11)

where x 2 Rd , K is some kernel such that jK j  , and  is a positive function in L1 . The function f has the form of the expected risk If] of eq. (8), if the following \translation" of denition (8) is performed: I ! f f ! x z ! t Q ! K P !  Hn ! Rd According to the rules above we also have that: Iemp f] !

l X i=1

K(x ti)

Therefore, applying theorem (2.1) one can show that if we sample l points ft1 : : : tl g from (t), the following error bound holds with probability 1 ; :

s   l X h ln 2hel ; ln( 4 ) 1  f(x) ; K( x  t )  j  j (12)  i  l i=1 l L1 where jj = kkL1 , h is the VC-dimension of K(x t) as in denition (2.2) in which the role of z is played by t, and the role of f is played by x. In order to make this result useful to approximation theory two steps must be taken:

1. The restriction that  is positive should be eliminated. This is easily done by writing, in eq. (11),  as (t) = sign((t))j(t)j. If we now consider the kernel K(x t)sign((t)) rather than the kernel K(x t), the following error bound holds with probability 1 ; :

  l  f(x) ; 1 X  l ci K(x ti) i=1

s

L1

 2el  jj h ln h l; ln( 4 )

where the ci represent a random sampling of sign((t)) and therefore only assumes value 1 or -1. 2. The bound above is not useful as it is, being formulated in probability. However, convergence in probability implies that for any  there always exists at least one set of l binary coe cients c1  : : : cl and l points t1 : : : tl such that the bound holds. In the limit of  ! 1, the bounds assumes the form:

  l f(x) ; 1 X   l ciK(x ti ) i=1

L1

s h ln 2hel + ln 4 < jj l

As an example of application of this idea let us make the choice K(x t) = Gm (x ; t), where Gm is the Bessel potential function, that is the Fourier transform of G~ m (s) = (1 + k1sk2) m2 Functions of the form (11) are therefore functions that can be written as f = Gm   for some  2 L1 . This normed function space, denoted by Lm1 , is called Bessel potential space or Sobolev-Liouville space and the norm of f = Gm   is simply dened as kf kLm1 = jj. If the smoothness index m is smaller or equal to the number of variables d the Bessel potential is not continuous in the origin. Since we are interested in approximating continuous functions, we only consider the case m > d, which is well known to guarantee that Lm1  CRd]. Moreover, this constraint also ensures that there exists such that jGm (x)j  over Rd . From the previous considerations, and taking = 1, we obtain the following result as an application of the VC bound (2.1): Proposition 3.1 Let f be an element of Lm1 , with m > d. Then we can nd l points ft1 : : : tl g and l coecients ci 2 f;1 1g such that

  l f(x) ; X ci Gm (x ; ti)  i=1

s

< kf kLm1

L1 where hm is the VC-dimension of Gm (x ; t).

hm ln h2mel + ln 4 l

This result, which is an approximation error estimate, is not particularly satisfactory because it involves the kernel Gm , whose analytic expression of Gm (x) is unknown, and the VC-dimension hm , which is also unknown. A more interesting result can be derived noticing that the kernel Gm has an integral representation as an innite sum of gaussians of dierent variances (Stein, 1970, p. 132) ;m Z 1 ; m;d ;1 ;  kxk2 4 2 e (13) Gm (x) = (2 ) ;( m2 ) 0 d e Therefore, we see that functions of the form f = Gm   can be written as ;m Z ;  kx;tk2 (  t) f(x) = (2 ) (14) d t d e m ;( 2 ) where

(  t) = e; 4 m;2 d ;1(t) Notice that equation (14) is of the form (11), with the replacement t ! (  t)  ! . Moreover, since we have made the assumption m > d, the function (  t) is integrable over and t, and its total variation will be denoted by Cm d jj. In order to apply the bound of theorem (2.1) we need to compute the VC-dimension h of the kernel exp(;  kt ; xk2 ), where x is to be considered the free parameter and (  t) correspond to the variable z in denition (2.2). By denition (2.2) the VC-dimension h is the VC-dimension of the indicator function (exp(;  kt ; xk2) ; ) where  is an auxiliary free parameter. By an appropriate redenition of  this is also equivalent to the VC-dimension of ( ; kt ; xk2 ). Since the variable multiplies the free parameter , it does not play any role in the computation of h and can be set to 1. Now the problem is equivalent to compute the VC-dimension of the indicator function of an arbitrary d dimensional sphere, which is known to be h = d + 1 (Dudley, 1979). Putting together all the previous consideration we can therefore conclude that the following proposition holds true: Proposition 3.2 Let f be an element of Lm1 , with m > d. Then we can nd l points ft1 : : : tl g, l variances f 1 : : : l g and l coecients ci 2 f;1 1g such that

s

  l  kx;ti k2  f(x) ; X  ; cie i   i=1

4 Remarks

L1

< Cm d kf kLm1

el + ln 4 (d + 1) ln (d2+1) l

The result derived in the previous section is not particularly elegant, nor we expect the bound to be particularly tight. However, it is comparable, except for the logarithmic factor, to other results (Jones, 1992 Barron, 1993 Breiman,

1993 Girosi and Anzellotti, 1993 Mhaskar and Micchelli, 1994 ), that all have the characteristic of being \dimension independent", although apply to dierent classes of target functions. Except for the work of Mhaskar and Micchelli (1994), in all the other cases the reason for obtaining a \dimension independent" result lies in the existence of an integral representation of the form (11). However, in those cases the results is obtained using an approximation lemma (Jones, 1992 Barron, 1993) which is valid in Hilbert spaces and leads naturally to results that only hold in the L2 norm2. The main point of this paper is the technique used to derive propositions (3.2) and (3.1), rather than the specic results. The concept of VC-dimension has never been used in approximation theory, and it usually appears in statistical learning theory in the estimation bounds, that is bounds on the uniform deviation between empirical and expected risk. We nd that the appearance of the VC-dimension in an approximation bound is intriguing, and might open the way for deeper, cross-disciplinary investigations.

References

1] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on Information Theory, 39(3):930{945, May 1993. 2] L. Breiman. Hinging hyperplanes for regression, classication, and function approximation. IEEE Transaction on Information Theory, 39(3):999{ 1013, May 1993. 3] R.M. Dudley. Balls in Rk Do Not Cut All Subsets of k+2 Points. Advances in Mathematics, 31:306{308, 1979. 4] F. Girosi and G. Anzellotti. Rates of convergence for radial basis functions and neural networks. In R.J. Mammone, editor, Articial Neural Networks for Speech and Vision, pages 97{113, London, 1993. Chapman & Hall. 5] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219{269, 1995. 6] L.K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for Projection Pursuit Regression and neural network training. The Annals of Statistics, 20(1):608{613, March 1992. 7] H.N. Mhaskar and C.A. Micchelli. Dimension independent bounds on the degree of approximation by neural networks. IBM Journal of Research and Development, 38:277{284, 1994. 8] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281{294, 1989. 2 However, a results in the L1 norm, very similar to the one of proposition (3.2), has been

obtained by Girosi and Anzellotti (1993) working with Hilbert spaces of the Sobolev type, rather than with L2 .

9] P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467, Articial Intelligence Laboratory, Massachusetts Institute of Technology, 1994. 10] E.M. Stein. Singular integrals and dierentiability properties of functions. Princeton, N.J., Princeton University Press, 1970. 11] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, Berlin, 1982. 12] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. and its Applications, 17(2):264{280, 1971.

Suggest Documents