On the Near Optimality of the Stochastic Approximation of ... - CiteSeerX

1 downloads 0 Views 309KB Size Report
Nov 16, 1997 - neural networks with a single hidden layer, establishing both upper and lower ... Key words: Stochastic approximation, Neural networks, Upper ...
On the Near Optimality of the Stochastic Approximation of Smooth Functions by Neural Networks V.E. Maiorov

y

Department of Mathematics Technion, Haifa 32000 Israel

R. Meir

z

Department of Electrical Engineering Technion, Haifa 32000 Israel

November 16, 1997 Abstract We consider the problem of approximating the Sobolev class of functions by neural networks with a single hidden layer, establishing both upper and lower bounds. The upper bound uses a probabilistic approach, based on the Radon and wavelet transforms, and yields similar rates to those derived recently under more restrictive conditions on the activation function. Moreover, the construction using the Radon and wavelet transforms seems very natural to the problem. Additionally, geometrical arguments are used to establish lower bounds for two types of commonly used activation functions. The results demonstrate the tightness of the bounds, up to a factor logarithmic in the number of nodes of the neural network.

Key words: Stochastic approximation, Neural networks, Upper bounds, Lower bounds Running title Near optimality of stochastic approximation by neural networks This research was partially supported by a grant from the Israel Science Foundation. Support from the Ollendor center of the department of Electrical Engineering at the Technion is also acknowledged. y The author was partially supported by the center for Absorption in Science, Ministry of Immigrant Absorption, State of Israel. z Part of this work was done while the author was visiting the Isaac Newton Institute, Cambridge, England 

1

1 Introduction Investigations of function approximation of multi-variate real-valued functions by neural networks have, in recent years, assumed an important place in the general theory of nonlinear function approximation (see, for example, [2][5],[14], to name but a few). General theorems pertaining to the density of typical neural networks in various functional spaces can be found in [14], [16] and [17]. Going beyond density issues, the various methods for obtaining upper bounds on approximation rates can be broken up into two broad classes. The rst class, composed of stochastic methods, is based on the demonstration that some well-de ned stochastic procedure yields an approximating structure, which on the average possesses some approximation rate. Standard arguments then lead to the conclusion that there exists a certain function in the approximating class with the desired approximation rate. A second, more classical, approach is based on the construction of a speci c algorithm for which the desired approximation rates can be established. We refer to this class of methods as deterministic. Within the stochastic approach, Barron [2] considered the approximation of functions by neural networks, and obtained upper bounds on approximation rates for classes de ned by the Fourier transform. More recently, Delyon et al. [9], developing the method of Barron [2], obtained upper bound for approximation of functions from the Sobolev class Wpr;d by the linear combination of wavelet-functions. We observe that the stochastic methods alluded to above usually make use of some integral representation of the function being approximated, using a kernel depending on the structure of the particular approximation method used, followed by an application of the Monte-Carlo method. Barron [2] makes use of the Fourier representation of functions, while Delyon et al. [9] utilize the multi-dimensional wavelet representation of functions. In this paper we consider the neural network manifold (

Hn(') = h(x) =

n

X

k=1

ck '(ak  x + bk ) : ak

2 Rd ;

)

ck ; bk 2 R ; 8k ;

(1)

where ' is some sigmoidal function on R , and ak  x is the inner product of two vectors ak and x in R d . Note that Hn(') has the following invariance property with respect to ane transformations, namely if h(x) 2 Hn(') then h(Ax + t) 2 Hn(') for any nondegenerate matrix A and vector t 2 R d . Expressing the function to be approximated by an integral representation, combining both the Radon [12] and wavelet [8] transforms, we obtain an upper bound for the approximation error of functions from the Sobolev class by use of the manifold Hn('), having order n?(r=d?") , where " is an arbitrary positive number. The techniques used for obtaining our estimates of the upper bound make use of the methods of Delyon et al. [9], as well as basic properties of the Radon and Wavelet transforms. 2

Deterministic methods of approximation of functions by neural networks are considered in the works of Mhaskar and Miccelli [18], Chui and Mhaskar [4], and Mhaskar [16]. The latter work considered the approximation of the Sobolev class Wpr;d by the manifold Hn(') for appropriate ', and has given the best possible upper bound of order n?r=d, but using sigmoidal functions ' from a rather restrictive class. DeVore, Oskolkov, Petrushev [7] obtained an upper bound on the rate of approximation of the Sobolev class Wpr;d for the special case of two-dimensional functions, d = 2, for a wide class of sigmoidal functions. Recently Petrushev [19] has extended these results to the class W2r;d, d > 2. The problem of deriving lower bounds on the approximation rates by neural networks, has received considerably less emphasis. An important step, however, was taken by DeVore, Howard and Miccelli [6], who showed that with the additional requirement of continuity of the approximation operator, a lower bound of order n?r=d can be obtained for approximating the Sobolev class Wpr;d by non-linear n-dimensional manifolds. In this work we obtain a lower bound of order (n ln n)?r=d, without the additional continuity restriction. Here we consider sigmoidal functions of two types: piecewise polynomial functions and the standard function of the form '(x) = (1 + e?x)?1. These results make use of known results from the analysis of multi-variate algebraic polynomials ([24], [25], [13], see review in [22]). Before presenting the main results of this work, we should stress that the general conclusions presented here have broad applicability beyond the eld of function approximation. It has become clear in recent years that ecient and robust strategies exist, whereby multi-variate functions may be estimated using samples of the function values at a nite set of points (see, for example, [23]). It turns out that the error incurred by such schemes is composed of two parts, the rst being the approximation error discussed above, and the second being a stochastic error resulting from the niteness of the sample used to estimate an appropriate approximating function. Thus, any attempt to deliver useful performance bounds for function estimation requires as a rst step the establishment of tight bounds on the approximation error, of the form discussed in this work and the other papers alluded to above. The remainder of the paper is organized as follows. We begin in Section 2 by expressing a general function as an integral representation, using properties of the Radon and wavelet transforms. This integral is then approximated in Section 3 by a nite sum and upper bounds are derived on the approximation. Section 4 is devoted to the computation of a lower bound for the case of piecewise polynomial activation functions. Finally, we extend the lower bound results in Section 5 to the widely studied case of the standard sigmoidal activation function.

3

2 An Integral Representation for Neural Networks In this section we pursue the line of thought discussed Section 1, pertaining to a stochastic approach to approximation. These methods are based on two basic ingredients. First, the function of interest is expressed as a convex integral of the form Z

f (x) = (x; z)w(z)dz; where  and w depend on the function f , and the non-negative function w(z) can be thought of as a probability density function for the `random variable' z, namely R w(z)dz = 1. The second step then corresponds to approximating the integral by a nite sum, as in the theory of Monte Carlo integration (see, for example, [2]). Standard results from the latter theory then yield upper bounds on the rates of convergence of the approximation to the exact value of the function. Following the work of Barron [2], various authors have recently utilized this basic idea in constructing neural network approximations. In this section we follow the line of work of Delyon et al. [9], who applied it to wavelet networks, rather than neural networks. The basic idea behind the integral representation constructed in this work is the use of both the Radon and the wavelet transformations. First we observe that neural networks, as in (1), are characterized by constant values on hyper-planes in R d . This observation leads naturally to the Radon transform, which represents general multi-variate functions by a superposition of integrals over hyper-planes in R d . This property then permits the transformation of the problem into a one-dimensional one, for which the wavelet integral representation is well known and understood. One advantage in using the Radon transformation in this fashion is that, in contrast to the work of Delyon et al.[9], only one-dimensional wavelets need to be considered as opposed to d-dimensional wavelets in [9]. We observe that the Radon transform has been used previously in work related to approximation by neural networks. In particular, Chen et al. [3] have provided an elegant proof of the density property of neural networks using this transform. In this work, however, we go beyond density and establish convergence rates as well. Let K be a compact set in the space R d , 1  p  1. Consider the space Lp(K; R d ), of functions de ned on R d with support on the set K and norm

kf kp =

Z

Rd

 p jf (x)j dx p 1

:

We denote the ball of radius r in R d by B d (r) = fx = (x1 ; :::; xd ) : Pdi=1 x2i  r2g. In the sequel we mainly consider the unit ball B d (1), and will simplify the notation somewhat by using B d = B d (1) and Lp = Lp(B d; R d ). The results can be immediately extended to general compact domains K by use of standard extension theorems, as in [1]. For any two sets W; H  Lp we de ne a distance measure of W from H by dist(W; H; Lp) = sup dist(f; H; Lp); f 2W

4

(2)

where dist(f; H; Lp) = inf h2H kf ? hkLp . Furthermore, for any function f 2 L1 we denote by F (f ) or f^ the Fourier transform of f

f^(u) = (2)?d=2

Z

Rd

f (x)eiuxdx;

where u 2 R d and u  x is the inner product of u and x. The inverse Fourier transform will be denoted by F ?1. In the space L1 de ne the derivative of order   0 as

Df = (?)=2 f = F ?1fjujF (u)g;

(3)

q

where juj = u21 +    + u2d. In the space Lp consider the Sobolev class of functions

Wpr;d = ff : max kDf kp  1g: r As stressed above, a basic tool in the construction of this section is the wavelet transform (see for example [11]). We follow here the de nitions and notation of Delyon et al. [9].

The Wavelet Transform We introduce the class rq of functions ' de ning the neural network class (1). Let ' be some function in the space L1(R ). Using the function ', construct a second function on R satisfying the equality Z

R

 ?1 '^( ) ^( )d = 1;

(4)

where '^ and ^ are the Fourier transforms of ' and , respectively. The function class rq consists of all functions ' 2 Lq (R ) \ L1 (R ) for which there exists a function satisfying (4) and such that for all  2 [0; r]; D' 2 Lq (R ) and D? 2 L1 (R ). We observe that in many neural network applications it is customary to use sigmoidal functions which approach a constant non-zero value at in nity. However, by taking suitable linear combinations of such functions, one can always obtain functions vanishing at in nity, which belong to L1 (R). For example, for sigmoidal functions (t), t 2 R , if limt!?1  (t) = 0, limt!1  (t) = 1 and  (t) is non-decreasing, then we require (t + 1) ? (t) = '(t) 2 rq . Examples One can easily verify that the functions p '(t) = 2e?t =2 ; p1 (1 ? t2 )e?t =2 ; (t +3 1) [?1;0](t) + (1 ?3 t) [0;1](t); 2 2

2

5

where  is the characteristic function of the segment , belong to the class rq . The corresponding functions are, respectively p (t) = 2(1 ? t2 )e?t =2 ; p1 (1 ? t2 )e?t =2 ; ?[?1;0](t) + [0;1](t): 2 2

2

Using any legitimate pair of functions ' and as above, and following [8], we construct the wavelet integral representation of functions on R . Let g(t) be any function from the space L1(R ). Then

g(t) =

Z

R+ R

u(a;  )'(a(t ?  ))adad;

(5)

where a 2 R + ,  2 R , and the function u is de ned as

u(a;  ) =

Z

R

g(t) (a(t ?  ))dt:

(6)

In order to utilize this one-dimensional integral representation for neural networks, we rst use the Radon transform to express a general multi-variate function in terms of projections onto hyper-planes.

The Radon Transform Let S d?1 = fx 2 R d : Pdi=1 x2i = 1g be the unit sphere in R d . For a given ! 2 S d?1 and t 2 R + consider the hyper-plane in R d given by !;t = fx 2 R d : x  ! = tg. For any f 2 W1d?1;d de ne the Radon transform Rf (!; t), de ned on S d?1  R ,

Rf (!; t) =

Z

!;t

f (x)dm(x);

(7)

where dm(x) is the Lebesgue measure on the hyper-plane !;t. Using the Radon transform construct a function g(!; t) on S d?1  R . For odd d

@ g(!; t) = @t

and for even d

d?1

!

(Rf (!; t))

d?1 @ (Rf (!; t)); g(!; t) = Ht @t where Ht p(t) = 1 p.v. R pt?() d is the Hilbert transform of the function p(t), and the integral is evaluated as the Gauche principal value, i.e. p.v. R pt?() d = lim!+0  : jt? j> pt?() d . The function f 2 W1d?1;d is then given by the so-called inverse Radon transform (see [12]) !

R

R

f (x) =

Z

S d?1

g(!; x  !)d!; 6

R

(8)

where d! is the Lebesgue normed measure on S d?1. Two important properties of the Radon transform (see [12] and [21]) are the following. For any function f 2 L1 the following connection between the Radon and Fourier transforms Z Z ? 1 Rf (!; t) = f^(s!)eistds: (9) f (x)dm(x) = (2) R

!;t

holds. Moreover, the Hilbert transform satis es the equation Hdtg(!) = sgn(!)^g(!). Therefore, directly from the de nition of the function g() and (9) it follows that

g(!; t) = (2)?1id?1

Z

j sjd?1f^(s!)eistds: R

(10)

At this stage we have all the tools needed in order to construct an integral representation combining the Radon and wavelet transforms. This representation will be used in Section 3 in order to compute an upper bound for the approximation error by neural networks.

The Radon-Wavelet Integral Representation Using (8), (5) and (6) we construct the neural network integral presentation

f (x) =

Z

Z

S d?1 R+ R

u(!; a;  )'(a(x  ! ?  ))adadd!;

(11)

where

u(!; a;  ) =

Z

g(!; t) (a(t ?  ))dt:

R

(12)

This expression may be simpli ed using the following notation. Let and l be some positive numbers. Consider the set Z = S d?1  R +  R , and denote elements of Z by z = (!; a;  ). Introduce a measure on Z by dz = d!dad . Then (11) may be re-written as Z

f (x) = Q F (x; z)!(z)dz; Z

(13)

where we have used the notation

1? F (x; z)  F (x; z; ; l) = 1a+ al '(a(x  ! ?  ))sgn(u(z)); (1 + al )ju(z )j a w(z) = Q(u; ; l) ; Q  Q(u; ; l) = a (1 + al )ju(z)jdz: Z

7

(14)

From the de nition of the function w it follows that w(z)  0, 8z 2 Z , and Z

Z

w(z)dz = 1:

(15)

Thus, w(z) can be viewed as a probability density function on the set Z . For any w-measurable function G(z) we denote by Ew (G) =

Z

G(z)w(z)dz;

Z

(16)

the average value of G(z), where z is viewed as a random variable with probability density function w(z). From (13) and (16) we have for all x f (x) = QEw (F (x; z)): (17)

3 Upper bound Starting from the integral representation (13) we construct an approximation to f (x) by a nite sum, corresponding to a neural network. For this purpose use is made of the probabilistic structure inherent in the probability measure w. Fix a number n, and points z1 ; :::; zn 2 Z , and let z = (z1 ; :::; zn). Consider the function n X Q fn(x; z) = n F (x; zi); (18) i=1 as a function in the variable x. Consider the direct product Z n = Z      Z of n copies of set Z , and de ne on Z n the product measure w = w      w. For any function h(z ) de ned on Z n we set Ew (h) =

Z

Z

h(z )w(z )dz = n

Z

Zn

h(z1; : : : ; zn)w(z1)    w(zn)dz1    dzn::

Let 1 < q < 1 be a xed number. Below we will estimate the average value Ew (kf (x) ? fn(x; z)kqq ) = From (13) and (18) we have

f (x) ? fn(x; z) = Qn

Z

Bd n

Ew (jf (x) ? fn(x; z)jq ) dx:

[Ew (F (x; z)) ? F (x; zi)]:

X

i=1

(19)

(20)

The Burkholder inequality (see for example [10]) guarantees the existence of a constant c0 depending only on q, such that Ew

n

X

i=1

q

n



[Ew (F (x; z)) ? F (x; zi )]  cq0 Ew [( jEw (F (x; z)) ? F (x; zi)j2)q=2 ]: X

i=1

8

(21)

From (19)-(21) using Holder's inequality (

n

X

i=1

n

jaij2)1=2  n(1=2?1=q) ( jaijq )1=q ; +

X

i=1

where ()+ =  for   0, and zero otherwise, we obtain ! n (1=2?1=q) !q Z X c 0 Qn q q Ew (kf (x) ? fn(x; z)kq )  jEw F (x; z) ? F (x; zi )j dx Ew n Bd i=1 and hence  q Z Ew (kf (x) ? fn(x; z)kqq )  2c0 Q Ew jF (x; z)jq dx; (22) d n B where = minf1 ? 1q ; 21 g. We now estimate the integral. Set = 1 ? 1=q. Then we have +

Lemma 1 For any z 2 Z the inequality Ew jF (x; z)jq dx  vd?1 k'kqLq (R) ; Bd Z

d?1

d?1

) is the (d ? 1)-dimensional volume of the unit ball B d?1 . holds, where vd?1 = 2 ?( d?(? +1) 2

3 2 1

Proof We have from de nition (14) of the function F Ew (jF (x; z)jq )dx = d jF (x; z)jq w(z)dzdx Bd B Z Z

=

 =

Z

Z

Bd S d?1 R+

Z

Z

aq(1? ) j'(a(x  ! ?  ))jq w(!; a;  )dxd!dad R (1 + al )q aj'(a(x  ! ?  ))jq w(!; a;  )dxd!dad

Bd S d?1 R+R

S d?1 R+ R

w(!; a;  )d!dad

Z

aj'(a(x  ! ?  ))jq dx:

Bd

Since for any x 2 !;s, x  ! = s, then for any xed !, a and  we have Z

B

aj'(a(x  ! ?  ))jq dx = d Z

1

aj'(a(s ?  ))jq m(

= ?1 = vd?1 k'kqLq (R) :

Z

!;t

1

?1

ds

Z

!;s \B d

\ B d )ds  v

aj'(a(x  ! ?  ))jq dm(x)

d?1

Z

R

j'(a(s ?  ))jq d(as)

Hence, from (23) using (15) we obtain Z

Bd

Ew

jF (x; z)jq dq

Z

 vd?1k'kqLq (R) Sd?1R+R w(!; a;  )d!dad 9

= vd?1 k'kqLq (R) ;

(23)

establishing the lemma. From inequality (22) and Lemma 1 we have

q  Ew (kf (x) ? fn(x; z)kqq )  vd?1 2c0 Q k'kqLq (R) ; n which, upon using Holder's inequality, yields  1=q Ew (kf (x) ? fn(x; z)kq )  Ew kf (x) ? fn(x; z)kqq  vd1?=q1 2cn0 Q k'kLq (R) : We therefore obtain the following statement.

Lemma 2 Let ' 2 rq where 1 < q < 1 and r  0. Then where c1 = 2vd1?=q1 c0 .

Ew (kf (x) ? fn(x; z)kq )  cn1 Q

k'kLq (R) ;

A similar approximation result may be established for the derivative of the function f .

Lemma 3 For any ' 2 rq , 1 < q < 1 and any 0  l  r ,

Ew (kDxl f (x) ? Dxl fn(x; z)kq )  c1 nQ kDtl '(t)kLq (R) :

Proof By analogy with (22) we have the estimate

q Z  Ew (jDxl F (x; z)jq )dx; (24) Ew (kDxl f (x) ? Dxl fn(x; z)kqq )  cn0Q

d B We estimate the last integral. From the de nition (14) of the function F and the relation = 1 ? 1=q we conclude that Z Z al ajDl '(a(x  ! ?  ))jq w(z)dzdx l q E ( jD F ( x; z  ) j ) dx = w  x x Bd Bd Z (1 + al ) Z  d jDxl '(a(x  ! ?  ))jq adxdz:

B Z

From de nition (3) of the operator Dl it follows that Dxl '(a(x  ! ?  )) = Dtl '(a(t ?  ))jt=x! . Therefore Z

Z

l '(a(x  ! ?  ))jq adx = jD j (Dtl ')(a(x  ! ?  ))jq adx x d d B B Z Z 1 j (Dtl ')(a(x  ! ?  ))jq adm(x) = ds d !;s \B ?1 Z 1 = j(Dsl ')(a(s ?  ))jq am(!;s \ B d )ds  vd?1 kDtl '(t)kqLq (R) :

?1

10

Combining these results using Holder's inequality the claim in the lemma follows. R In the next lemma we estimate the function Q = Q(u; ; l) = Z a (1 + al )ju(z)jdz, used to de ne the approximant fn(x; z) in (18). The function u(z) = u(!; a; t) was de ned in (12).

Lemma 4 If  > 1 ? 1q + l, l  0, q > 1, are any numbers, = 1 ? 1q , and ' 2 rq then Q(u; ; l)  c2 

Z

S d?1 R

(jg(!; t)j + jDtg(!; t)j)d!dt;



where c2 = max 1?2 q ; l?l?2 1+ q . 1

1

Proof Denote = R. We have from (12) R

R

Z

Set



ju(!; a;  )jd =

Z Z



g(!; t) (a(t ?  ))dt d:

(25)

= D? . From Plansherel's formula it follows that for xed !, a and  Z Z ? 1 g(!; t) (a(t ?  ))dt = a g^(!; ) ^(?=a)eit d Z

= a?1 jjg^(!; )jj? ^(?=a)eit d Z ? 1 ?  =a jjg^(!; ) ^(?=a)eit d d Therefore from (25), using the inequality kUV k1  kU^ k1 kV^ k1 we have Z ju(!; a;  )jd  a??1kF ?1fjjg^(!; )gkL (R) kF ?1f ^(?=a)gkL (R)

= a??1 kDg(!; t)k t

1

1

L1(R) k  kL1 (R) :

(26)

Represent the integral in Q(u; ; l) as

Q(u; ; l) = =

Z

Z

Z

a (1 + al )ju(z)jdz 1

Z

S d?1 R

0

a (1 + al )ju(!; a;  )jda +

Z

1

 1 l a (1 + a )ju(!; a;  )jda d!d;

(27) and estimate separately each summand. From Fubini's Theorem and the inequality (26) for  = 0 we have Z

Z

1

S d?1 R 0

a (1 + al )ju(!; a;  )jdad!d Z

Z

=

1

Z

S d?1

d!

Z

0

 k kL (R) Sd? kg(!; t)kL (R) d! 0 a?1+ (1 + al )da  c3 k kL (R) Sd? kg(!; t)kL (R) d!; 1

Z

1

1

1

1

1

11

1

daa (1 + al )

Z

R

ju(!; a;  )jd (28)

where c3 = 2=(1 ? 1q ). The second summand in (27) is estimated using Fubini's Theorem once more and the inequality (26) Z

Z

S d?1 R 1

1 a (1 + al )ju(!; a;  )jdad!d Z

Z

1

 k  kL (R) Sd? kDtg(!; t)kL (R) d! 1 a ??1(1 + al )da  c4 k kL (R) Sd? kDtg(!; t)kL (R) d! 1

Z

1

(29)

1

1

1

1

where c4 = 2(1 + l ?  ? 1q )?1. From (28) and (29) we obtain the statement of lemma 4. In the following lemma we obtain an estimate for the approximation of a function f and its derivatives by neural networks.

Lemma 5 Let ' 2 rq , l  0, 1 < q < 1, 0 < " < 1 ? 1q be any numbers, and

= minf1 ? 1q ; 21 g. Then   d d? Ew (kDlf (x) ? Dl fn(x; z)kq )  c 6 kD f (x)k2 + kD + ? q +l+"f (x)k2 ; n" 2

d

1

2

1 2

1

d

=2) where D = Dx , and c6 = 2c 2?(?(3 d=2+1) . 1

Proof Set  = 1 ? 1q + l + ". From Lemmas 3 and 4 we conclude that l Ew (kDl f (x) ? Dl fn(x; z)kq )  cn1Q

kDt '(t)kLq (R)

 cn1c 2 Sd? R(jg(!; t)j + jDtg(!; t)j)d!dt: According to the conditions of the lemma we have c2 = max 1?2 q ; ?l?21+ q = max 1?2 q ; 2"  2 . Therefore " Ew (kDlf (x) ? Dlfn(x; z)kq )  c 1 d? (jg(!; t)j + jDtg(!; t)j)d!dt: (30) n " S R From Holder's inequality we have Z

1





1





1

1

Z

1

Z

2

jDg(!; t)jd!dt  c5 S d? R t

Z

S d?1 R

1

where c5 = have

2d ?(3=2)d ?(d=2+1) .

jDtg(!; t)j2d!dt;

(31)

From the identity (10) and the de nition of the operator Dt we

Dtg(!; t) = (2)?1id?1 Dt

Z

R



jsjd?1f (s!)eistds 12

= (2)?1id?1

Z

R

jsj+d?1f^(s!)eistds:

Hence by the Plansherel formula we obtain Z

R

kDtg(!; t)k22dt = (2)?2 = (2)?2

Z

Z

Z

dtj jsj+d?1f^(s!)eistdsj2 R R  jsj +d?1jf^(s!)j 2 ds: 



(32)

R

Passing to polar coordinates we have Z

Z

S d?1 R

(jsj+d?1jf^(s!)j)2dsd! = =

Z

Z

d! jsj2+d?1jf^(s!)j2jsjd?1ds R 2  + d jxj ?1 jf^(x)j2 dx = kD(+ d? ) f k22:

S d?1 Z

2

Rd

1

(33)

Hence from (31)-(33) we conclude Z

S d?1 R

2  jDt g(!; t)jdtd!

 (2)?2c5 kD(+ d? ) f k2: 2

1

From here and (30) we nally obtain the statement of Lemma 5. We are now ready to prove the main result of this section, establishing an upper bound for the approximation of the Sobolev class by neural networks. We state the results separately for small and large values of the smoothness parameters of the given class. d

Theorem 1 (Small smoothness) Let f 2 W2 ;d and ' 2 d=q 2 . Then for any integers d  1, n  2, and 1 < q < 2 ? ln1n there exists a function h 2 Hn(') such that 2

kf ? hkq  cnln1=2n kf kW d ;d ; 2 2

where c is a constant depending only on d and q.

Proof Set l = 0, " = 1=q ? 1=2, and q = 2 ? ln1n . According to the conditions of the lemma we have "  4 ln1 n , and n1? q  e?1=2 n1=2 . Therefore from Lemma 5 we obtain Ew (kf (x) ? fn(x; z)kq )  c7 Ew (kf (x) ? fn(x; z)kq )  c1?6 q (kD d? f (x)k2 + kD d f (x)k2 ) 1

2

1

"n  cnln1=2n kf kW d ;d ; 1

2

2 2

where c = 4e1=2 c6 jB dj1=q?1=q . Using the value of q , we note that the expected value obeys the required bound. Therefore, clearly there exists a set of values for the parameters (!; a;  ) which yield the desired result. 13

Theorem 2 (Large smoothness) Let f 2 W2r;d, ' 2 rq any function, d  1, 1 < q  2, n  2, r = kd2 + ln1n , k 2 N . Then for any n  1 there exist a function h 2 Hn(') such that

k c (ln n ) kf ? hkq  nr=d kf kW r;d ; 2

where c = (4c6 )k kr=d .

Proof It suces to prove the theorem for q = 2. This will be done using a variant of

the iterative approach introduced in [9]. From Lemma 5 it follows that for any function f 2 W2r;d, and any numbers l  0, 0 < "^ < 1=2 the following inequalities hold   d? d (34) Ew (kf (x) ? fn(x; z)k2)  pcn6 "^ kD f (x)k2 + kD +^"f (x)k2 ;     d? d Ew kDl(f (x) ? fn(x; z))k2  pcn6 "^ kD f (x)k2 + kD +l+^"f (x)k2 : (35) 2

2

1

2

1

2

Let fzigki=1 be k independent draws of the random variable z with corresponding densities wi = w. Then from (34) we have Ew Ew    Ewk kf (x) ? fn(x; z1 ) ?    ? fn(x; zk )k2  n   pcn6 "^ Ew    Ewk? kD d? (f (x) ? fn(x; z1 ) ?    ? fn(x; zk?1)) k2 o  + Ew    Ewk? D d +^" (f (x) ? fn(x; z1 ) ?    ? fn(x; zk?1 )) k2 : 1

2

1

1

1

2

1

2

1

Continuing this iterative inequality k times using (34)and (35) we obtain E  Ew : : : Ewk (kf (x) ? fn(x; z1 ) ?    ? fn(x; zk )k2 ) !k k  d?  X c 6  pn"^ kD f k2 + kDs( d +^") f k2 : s=0 1

2

1

2

Now de ne "^ such that k"^ = 1= ln n. Then for r = kd2 + ln1n we have k k  f k  c(ln n) kf k r;d ; kD E  n2k=c62k"^k max 2 W kd nr=d  2 + ln1n

2

which establishes the claim.

4 Lower Bound - Piecewise Polynomial Functions In this section and the next we establish lower bounds on the approximation error by neural networks, focusing on two speci c types of activation functions '. In particular, we consider rst piecewise polynomial activation functions, followed in Section 5 by the study of the standard sigmoidal activation function. 14

The basic idea of the proof is based on transforming the problem into a nite-dimensional one by an appropriate linear transformation, as was done, for example, in [25]. Lower bounds for the latter problem can be more easily established and, due to the linearity of the transformation, serve as lower bounds for the original problem. We start with some Pm m m notation. Denote the lp -norm of vector f = (f1 ; :::; fm) 2 R by kf klpm = ( i=1 jfijp)1=p, p  1, and let the unit ball in the space lpm be denoted by Bpm = ff 2 R m : kf klpm  1g. For any f de ne a vector sgn f = (sgn f1 ; :::; sgn fm), where sgn a = 1 if a > 0, sgn a = ?1 if a < 0, and sgn a = 0 if a = 0. For any set G  R m let sgn G = fsgn f : f 2 Gg. Finally, if A; B  lpm , the distance of the set A from B is given by dist(A; B; lpm) = sup dist(a; B; lpm ); where

dist(a; B; lpm) = inf b2B ka ? bklpm .

a2A

h

d

i

Let m~ be any integer, and set m = m~ d . Consider the cube I d = ? p1d ; p1d contained in the unit ball B d , and construct the uniform net ( )   1 2 i 2 i d 1 Sm = p m~ ? 1; :::; m~ ? 1 : 0  i1; :::; id  m~ ? 1 d consisting of the m points f1; :::; mg. Introduce a manifold in R m H nm(') = f(h(1); : : : ; h(m)) : h 2 Hn(')g; which is the restriction of the manifold Hn(') to Sm . Consider the nite set in R m E = f("1; :::; "m) : "i 2 f?1; +1g 8 i = 1; :::; mg: In the following theorem we establish a lower bound for the distance of E from H nm('), specialized to the case where ' are piecewise polynomial functions.

Theorem 3 Let ' be a piecewise polynomial function on R with q breakpoints 1 ; :::; q , such that on any interval [i ; i+1 ), i = 0; 1; :::; q (here 0 = ?1, and q = +1) the function ' is an algebraic polynomial of degree at most n. Let n and m be integers such that m = [c(d + 2)n log2 n]. Then dist(E; H nm('); l1m)  cm ; where c depends only on d, q and r.

We rst prove an auxiliary lemma.

Lemma 6 The cardinality of set sgnH nm(') obeys, for any positive integers n and m, 2 cm  jsgnHnm(')j  n where c depends only on d; q, and r.

15

d+2)n

!(

;

Proof The manifold H nm(') is the set of all functions on Sm of the form  7! h( ; a; b; c) =

n

X

i=1

ci'(ai   + bi);

where ci; bi 2 R , ai 2 R d for 1  i  n, and a = (a1 ; :::; an), b = (b1; :::; bn ), c = (c1; :::; cn), and ai   is the inner product of the vectors ai and  . Denote = (a; b; c) 2 R (d+2)n and h( ; a; b; c) = h( ; ). The manifold H nm(') can be expressed in terms of the variables  , H nm(') = fh( ; ) : 2 R (d+2)n g: The following statement is well known (see for example [25], Theorem 3, or [22] for a more modern treatment). Claim 1: If p1( ); :::; pM ( ) are algebraic polynomials of degree at most r in N  M variables, = ( 1; :::; N ) 2 R N , then  N 4 eMr N jf(sgnp1( ); :::; sgnpM ( )) : 2 R gj  N : Construct the partition of the space R (d+2)n by hyper-planes of the form ?ijk = f(a; b; c) 2 R (d+2)n : ai  k ? bi = j g; where i = 1; :::; n, j = 1; :::; q, and k = 1; :::; m. Set M = nqm, and N = (d + 2)n. Denote by S1 ; :::; SP the connected components of the set R (d+2)n n Sijk ?ijk . Since all polynomials of the form p( ) = ai   ? bi ? j retain a single sign on any component Sl , then from claim 1 the following estimate for the number of components P follows  N  4emq (d+2)n : P  4eM (36) = N d+2 Using this result, we can now estimate the cardinality of the set sgnH nm('). From the construction of the hyper-planes f?ijk g it follows that the space R (d+2)n can be divided into P polyhedral regions S1 ; :::; SP , and therefore we have jsgnH nm (')j = jf(sgnh(1; ); :::; sgn h(m; )) : 2 R (d+2)n gj =

P

X

l=1

jf(sgn h(1; ); :::; sgn h(m; )) : 2 Sl gj:

(37)

Since for any l and i the function 7! h(i; ), 2 Sl , is a polynomial of degree  r, then according to Claim 1 we obtain

jf(sgn h(1; ); :::; sgn h(m; )) : 2 Sl gj  (d4+emr 2)n 16

d+2)n

!(

:

(38)

Hence from (37), (38) and (36) we have (d+2)n  (d+2)n 2 (d+2)n 4 emq 4 emr cm  jsgnHnm(')j  d + 2 = n (d + 2)n where the constant c depends only on d; q, and r. Proof of Theorem 3p Let a be an absolute constant satisfying the equation 2a2 ? 8a + 7 = 0 (i.e. a = 2  1= 2) . Fix a vector 2 R (d+2)n , and construct a subset in E !

E = f" 2 E :

m

X

i=1

!

j"i ? sgn h(i; )j  amg:

The cardinality of the set E can be estimated as follows (

jE j = " 2 E : 8 < :

m

("i + 1)  am

X

i=1

)

9 = ;

1  am = "2E: 2 i:"i =1 m m m +    + = dam= m dam=2e + 1 2e = m + m ++ m b mc 1 0 m ? 2 m (1 = 2 ?b m c =m ) 2 e (Cherno bound) m ? 2 m (1 = 2 ? ) 2 e  2m(2?2a+a =2)  2m=4 where we have used 2a2 ? 8a + 7 = 0. Consider the set E 0 = 2R d n (E n E ). From the de nition of the set sgn H nm(') it follows that E j jE 0j = jE n X

!

!

!

!

!

!

2

2

2

T

( +2)

[

2R(d+2)n m  2 ? jsgn H nm(')j max

jE j m  2m ? jsgn H nm(')j2 4 :

Set m = [(d + 2)n log2 n]. Then there exist constants c1; c2 > 0 such that cm log m . Therefore it follows from Lemma 6 that for some 0 < c < 1 2

2

2 jsgn H nm(')j  cmn

17

d+2)n

!(

 2m : 4

c1 m

log2 m

n

Thus jE 0j  2m ? 2 m > 0. Hence there exists in E 0 a vector " = ("1; :::; "m) such that for all 2 R (d+2)n m m X X j"i ? h(i; )j  21 j"i ? sgn h(i; )j  am : 2 i=1 i=1 2

The following lemma yields a lower bound to the approximation error of the Sobolev space, with the help of the the lower bound established for the nite-dimensional case in Theorem 3. 



Lemma 7 If 1  p; q  1 and dr > p1 ? 1q + then dist(Wpr;d; Hn('); Lq ) 

c

m; H  mn('); lqm):

r 1 1 dist(Bp md?p+q

Proof In order to derive a lower bound for Wpr;d it clearly suces to nd a lower bound holding for some set Fm  Wpr;d, which will now be constructed. For x 2 R d let be any function in Wpr;d which satis es (x) = 1 for x 2 21 I d and (x) = 0 for x 2= I d. The

function in the remaining region may be computed by analytic continuation using spline functions. Let m and m~ be positive integers such that m = m~ d . Consider the uniform grid of m points Sm = figmi=1, where i 2 I d. For each i de ne a function (m~ (x ? i)) :

i (x) =

Consider the normed space lpm consisting of vectors a = (a1; : : : ; am) 2 R m , and denote by Bpm the unit ball in this space. Construct the functional subclass (

m

1

X

Fm = fa (x) = mr=d?1=p

i=1

)

ai i (x) : kaklpm  1 :

We rst show that Fm  Wpr;d. Let  2 [0; r]. Then for any a 2 Bpm

kDfa kpLp

1

= mrp=d?1 =

1

mrp=d?1

Z

m

X

I d i=1 Z m X i=1

p  aiD i (x) dx

m Id 1 ~

jai D (m~ (x ? i))jp dx;

where m~1 I d is the cube I d scaled down by a factor of m~ for each side. Since Dx (m~ (x ? i)) = m~ Dt (t)jt=m~ (x?i) , clearly

kDfa kpLp

p?d = m~rp=d?1 m

m

X

i=1

jaijp

!

Z

Id

18

jD (t)jp dt  kD kpLp  1:

We therefore conclude that fa 2 Wpr;d. We proceed with bounding dist(Wpr;d; Hn('); Lq ) from below. For all f 2 Fm and h 2 Hn(') we have q Z m X 1 q kf ? hkLq = I d mr=d?1=p ai i (x) ? h(x) dx; i=1 q m Z X 1 = a i i (y + i ) ? h(y + i) dy: r=d ? 1 =p d i=1 m I m 1 ~

Since i(y + i) = (my ~ ), i = 1; 2; : : : ; m and y 2 m~1 I d, we obtain

dist(Wpr;d; Hn('); Lq )q  sup h2inf kf ? hkqLq H ( ' ) n f 2Fm m Z X 1 q dy  m(r=d?1=p)q supm h2inf j a ( y +  ) ? h ( y +  ) j i i i i d a2Bp Hn (') i=1 m I m 1 Z X j ai (y) ? h(y=m~ + i)jq dy  m(r=d1?1=p)q supm h2inf d d H ( ' ) m ~ n i=1 I a2Bp (39) 1 ~

Note that due to the in mum taken over Hn(') it has been possible to rescale h() by the factor mr=d?1=p . De ne ay = (a1 ; : : : ; am ) (y) and hy = (h(y=m~ + 1); : : : ; h(y=m~ + m)). Then moving the summation into the integral in (39) we have Z dist(Wpr;d; Hn('); Lq )q  (r=d?11=p+1=q)q supm h2inf k ay ? hy kqlqm dy d H ( ' ) m n I a2Bp Z q dy: k a ? h k  m(r=d?11=p+1=q)q supm I d h2inf y y lqm Hn (') a2Bp Using the ane invariance of Hn(') mentioned in Section 1 we conclude that for any y   inf kay ? hy klpm = dist a; H mn('); lqm (y): h2Hn (')

Thus dist(Wpr;d; Hn('); Lq )q

1

 m(r=d?1=p+1=q)q

supm dist a; H mn('); lqm 

Z

a2Bp

The theorem follows upon taking c = ( I d j (y)jqdy)1=q . We can now prove the main theorem. R





Theorem 4 For any 1  p; q  1 and dr > 1p ? 1q + the inequality dist(Wpr;d; Hn('); Lq )  (n logcn)r=d :

holds, where c depends on p; q; d and r.

19

Id

j (y)jqdy:

Proof Let " be the vector constructed in Lemma 6. Consider a vector a = (a1; :::; am ), where ai = m?1=p"i. Clearly a 2 Bpm . Set m = [cn(d + 2) log n]. Then using Lemma 7 and Theorem 3 we obtain

?1=p (c3 m)1=q r 1 1 md?p+q

dist(Wpr;d; Hn('); Lq )  m

 (n logcn)r=d :

5 Lower Bound - The Standard Sigmoid We extend the results of Section 4 to cover the case of the widely studied sigmoidal function (t) = 1+1e?t . We derive estimates for the distance of Wpr;d from the manifold Hn(). The main result is the following theorem. 



Theorem 5 For any 1  p; q  1 and dr > 1p ? 1q + the following inequality dist(Wpr;d; Hn(); Lq ) 

c (n log n)r=d

holds.

First we need an auxiliary lemma. In the space Rm consider the manifold H n;m() de ned in Section 4, that is H nm() = f(h(1); : : : ; h(m)) : h 2 Hn(g: We make use of the following result Claim 2 Let 1 ( ) = pq ((

)) ; : : : ; M ( ) = pqMM ((

)) be rational functions of degree at most r, i.e. pi and qi , i = 1; 2; : : : ; M are algebraic polynomials of degree at most r in N  M variables = ( 1; : : : ; N ) 2 R N . Then 2N  4 eMr jf(sgn1( ); : : : ; sgnM ( ))gj  N : Proof Clearly ( ) ! n o sgn p )(

sgn p (

) M 1 N N = :

2 R ; : : : ; (sgn1 ( ); : : : ; sgnM ( )) : 2 R sgnq1 ( ) sgnqM ) n o n o  (sgnp1( ); : : : ; sgnpM ( )) : 2 R N (sgnq1 ( ); : : : ; sgnqM ( )) : 2 RN 1

1

The result then follows upon using Claim 1.

Lemma 8 The cardinality of the set sgnH nm () is upper bounded as follows: jsgn H nm()j  (cm)(1+ d )(d+2)n : 1

20

Proof For any function h( ; ) 2 H nm()) we have h(k ; ) =

n

X

i=1

ci

; 1 + e?ai k +bi

(40)

where = (a; b; c), a 2 R dn , c; b 2 Rn , and the vector  = (1; :::; m) has coordinates i = mk~i , 0  ki  m~ ? 1, and m~ = m1=d . Denote in (40) e?aij =m~ = tij and e?bi = i . Then n X ci h(k ; ) = : kd k i=1 1 + ti1  :::  tid  i Therefore, for each xed k, the function 7! h(k ; ) is a rational function in (d + 2)n variables ci; i and ti1; :::; tid , 1  i  n, of degree at most ((m~ ? 1)d + 1)(n ? 1) + 1  4dnm1=d . Hence using Claim 2 we conclude that 1

ndm1=d jsgnH nm ()j  4e  (md + 42) n

d+2)n

!2(

 (cm)2(1+ d )(d+2)n : 1

Similarly to the proof of Theorem 4, Theorem 5 is established using Lemmas 6, 7 and 8. The following result by Mhaskar [16], establishes an upper bound for the error of approximation of the space Wpr;d by Hn(). Let r  1, n  1 be integers, and 1  p  1. Then dist(Wpr;d; Hn(); Lp)  cn?r=d : Combining (41) and the lower bound of Theorem 5 we have

Corollary 1 If r  1 is an integer, and 1  p  1 then for any n = 1; 2; :::

c1 r;d ; H ( ); L )  c2 ;  dist( W n p p r=d (n log n) nr=d where c1 and c2 depend only on r, d and p.

Acknowledgments We are grateful to Prof. Allan Pinkus for helpful discussions.

21

(41)

References [1] R.A. Adams, Sobolev Spaces, Academic Press, New-York1975. [2] A.R. Barron, Universal approximation bounds for superposition of sigmoidal function, IEEE Trans. Information Theory, 39, pp. 930-945, 1993. [3] T. Chen, H. Chen and R. Liu, Approximation capability in C (Rn) by multilayer feedforward networks and related problems, IEEE Trans. Neural Networks, 6(1), pp. 25-30, 1995. [4] C.K. Chui, X. Li and H.N. Mhaskar, Some limitations of neural networks with one hidden layer, preprint, 1995. [5] G. Cybenko, Approximation by superposition of sigmoidal functions, Math. of Control, Signals and Systems, 2, pp. 303-314, 1989. [6] R. A. DeVore, R. Howard and C.A. Micchelli, Optimal nonlinear approximation, Manuscripta Mathematica, 63 (1989), 469-478. [7] R. A. DeVore, K. Oskolkov, P. Petrushev, Approximation by feed-forward neural networks, Ann. Numer, Math., 4, pp. 261-287, 1997. [8] I. Daubechies, Ten lectures on wavelets, SIAM Press, 1992. [9] B. Delyon, A. Juditsky, and A. Benveniste, Accuracy Analysis for Wavelet Approximations, IEEE Transaction on Neural Networks 6, pp. 332-348, 1995. [10] P. Hall and C.C. Heyde, Martingale Limit Theory and Its Applications, Orlando, Academic Press, 1980. [11] E. Hernandez and G.L. Weiss, First Course on Wavelets, CRC Press 1996. [12] S. Helgason, The Radon Transform, Birkhauser, Boston, 1980. [13] M. Karpinski, A.J. Macintyre, Polynomial bounds for VC dimension of sigmoidal and general Pfaan neural networks, Journal of Computer and System Science, 54, pp. 160-176, 1997. [14] V. Ya. Lin, A. Pinkus, Fundamentality of ridge functions, J. Approx. Th., 75, pp. 295-311, 1993. [15] V. Maiorov, J. Ratsaby, On the Degree of Approximation using Manifolds of Finite Pseudo-Dimension. Submitted to Journal of Constructive Approx. [16] H.N. Mhaskar, Neural networks for optimal approximation of smooth and analytic functions, Neural Computation, 8, pp. 164-177, 1996. 22

[17] H.N. Mhaskar and C.A. Micchelli, Approximation by superposition of a sigmoidal function and radial basis functions, Adv. Appl. Math., 16, pp. 350-373, 1992. [18] H.N. Mhaskar and C.A. Micchelli, Dimension independent bounds on the degree of approximation by neural networks, IBM Journal of Research and Development, 38, pp. 277-284, 1994. [19] P. P. Petrushev, Approximation by Ridge Functions and Neural Networks, submitted 1997. [20] A. Pinkus, n-Widths in Approximation Theory, Springer-Verlag Berlin 1985. [21] E. Stein, G. Weiss, Introduction to Fourier Analysis on Euclidean Spaces, Princeton, New Jersey, Princeton University Press, 1971. [22] M. Vidyasagar, A Theory of Learning and Generalization, Springer-Verlag London, 1997. [23] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New-York 1995. [24] A.G. Vitushkin, Estimation of the Complexity of the Tabulation Problem, Fizmatgiz, Moscow, 1959. [25] H.E. Warren, Lower bounds for approximation by nonlinear manifold, Trans. Amer. Math. Soc., 133, pp. 167-178, 1968.

23

Suggest Documents