Almost Linear VC-Dimension Bounds for Piecewise ... - CiteSeerX

11 downloads 0 Views 147KB Size Report
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir to date. Recently it has been shown ... which is (WL)forL = O(W). Maass (1994) shows that three-layer networks.
LETTER

Communicated by Dana Ron

Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks Peter L. Bartlett Department of System Engineering, Australian National University, Canberra, ACT 0200, Australia

Vitaly Maiorov Department of Mathematics, Technion, Haifa 32000, Israel

Ron Meir Department of Electrical Engineering,Technion, Haifa 32000, Israel

We compute upper and lower bounds on the VC dimension and pseudodimension of feedforward neural networks composed of piecewise polynomial activation functions. We show that if the number of layers is fixed, then the VC dimension and pseudo-dimension grow as W log W, where W is the number of parameters in the network. This result stands in opposition to the case where the number of layers is unbounded, in which case the VC dimension and pseudo-dimension grow as W 2 . We combine our results with recently established approximation error rates and determine error bounds for the problem of regression estimation by piecewise polynomial networks with unbounded weights. 1 Motivation It is a well-known result of much recent work that in order to derive useful performance bounds for classification, regression, and time-series prediction, use must be made of both approximation and estimation error bounds. Approximation errors result from using functional classes of limited complexity (such as neural networks with a finite number of computational units), and estimation errors are caused by the finiteness of the sample used for learning. In order to achieve good performance bounds, both the approximation and the estimation errors need to be made small. Unfortunately, the demands required in order to reduce these two terms are usually conflicting. For example, in order to derive good estimation error bounds, one often assumes that the parameters are restricted to a bounded domain (see, for example, Bartlett, 1998), which may be allowed to increase with the sample size, as in the method of sieves (Geman & Hwang, 1982). However, imposing such boundedness constraints leads to great difficulties in establishing approximation error bounds, which in general have not been surmounted c 1998 Massachusetts Institute of Technology Neural Computation 10, 2159–2173 (1998) °

2160

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

to date. Recently it has been shown that the VC dimension of many types of neural networks with continuous activation functions is finite even without imposing any conditions on the magnitudes of the parameters (Macintyre & Sontag, 1993; Goldberg & Jerrum, 1995; Karpinski & Macintyre, 1997). Since there is a close connection between the VC dimension and the estimation error (see section 4), this result is significant in the context of learning. Thus, as long as the function itself is bounded, one may proceed to derive good upper bounds for the covering numbers needed in establishing the estimation error bounds. These results can then be combined with available results for the approximation error, the derivation of which is greatly facilitated by avoiding the constraints on the parameters. In this article we consider a specific class of multilayered feedforward neural networks composed of piecewise polynomial activation functions. Since it is known (Leshno, Lin, Pinkus, & Schocken, 1993) that a necessary and sufficient condition for universal approximation of continuous functions over compacta by neural networks is that the transfer functions be nonpolynomial, it is interesting to consider the “minimal” complexity transfer functions that guarantee this universality. It has recently been shown (Maiorov & Meir, 1997) that the rate of approximation of the Sobolev space Wr (with a constraint on r; see section 4) by functions of this type is optimal up to a logarithmic factor, when the parameters of the polynomials are unrestricted in size. In order to complete the derivation of total error bounds, use must be made of results for the covering number of the above class of functions. From the results of Pollard (1984) and Haussler (1992, 1995), these numbers may be upper bounded if upper bounds are available for the pseudo-dimension. We establish upper and lower bounds for both the VC dimension and the pseudo-dimension of such networks. For completeness, we give definitions of these terms in the next section. Let F be the class of binary-valued functions computed by feedforward neural networks with piecewise polynomial activation functions, such that there are k computational (noninput) units and W weights. Goldberg and Jerrum (1995) have shown that VCdim(F ) ≤ c1 (W 2 + Wk) ≤ O(W 2 ), where c1 is a constant. Moreover, Koiran and Sontag (1997) have demonstrated that in fact VCdim(F ) ≥ c2 W 2 = Ä(W 2 ), which would lead one to conclude that the bounds are in fact tight up to a constant. However, the proof they used to establish the lower bound made use of the fact that the number of layers is unbounded. In practical applications, this number is often a small constant. Thus, the question remains as to whether it is possible to obtain a better bound in the realistic scenario where the number of layers is fixed. The contribution of this work is the proof of upper and lower bounds on the VC dimension and pseudo-dimension of piecewise polynomial nets. As we will see, the upper bound behaves as O(WL2 + WL log WL), where L is the number of layers. (We use the “big-oh” notation f (n, m) = O(g(n, m)) to imply the existence of a constant c and integers n0 and m0 such that f (n, m) ≤ cg(n, m) for all n > n0 and m > m0 .) If L is fixed, this is O(W log W),

Almost Linear VC-Dimension Bounds

2161

which is superior to the currently available result which behaves as O(W 2 ). Moreover, using ideas from Goldberg and Jerrum (1995) and Koiren and Sontag (1997), we are able to derive a lower bound on the VC dimension, which is Ä(WL) for L = O(W). Maass (1994) shows that three-layer networks with threshold activation functions and binary inputs have VC dimension Ä(W log W), and Sakurai (1993) shows that this is also true for two-layer networks with threshold activation functions and real inputs. It is easy to show that these results imply similar lower bounds if the threshold activation function is replaced by any piecewise polynomial activation function f that has bounded and distinct limits limx→−∞ f (x) and limx→∞ f (x). We thus conclude that if the number of layers L is fixed, the VC dimension of piecewise polynomial networks with L ≥ 2 layers and real inputs, and of piecewise polynomial networks with L ≥ 3 layers and binary inputs, grows as W log W. Combining this result with the approximation error bounds mentioned above, we are able to derive total error bounds that are much better than currently available results for sigmoidal networks. 2 Upper Bounds We begin the technical discussion with a precise definition of the VCdimension and the pseudo-dimension. For more details and examples see Devroye, Gyorfi, ¨ and Lugosi (1996) and Vidyasagar (1996). Definition 1. Let X be a set and A a system of subsets of X. A set S = {x1 , . . . , xn } is shattered by A if, for every subset B ⊆ S, there exists a set A ∈ A such that S ∩ A = B. The VC-dimension of A, denoted by VCdim(A), is the largest integer n such that there exists a set of cardinality n that is shattered by A. Intuitively, the VC-dimension measures the size, n, of the largest set of points for which all possible 2n labelings may be achieved by sets A ∈ A. It is often convenient to talk about the VC-dimension of classes of indicator functions F . In this case we simply identify the sets of points x ∈ X for which f (x) = 1 with the subsets of A, and use the notation VCdim(F ). For classes of real-valued functions, one defines the pseudo-dimension as follows: Definition 2. Let X be a set and F ⊆ [0, B]X . A set S = {x1 , . . . , xn } is Pshattered by F if there exists a real vector c ∈ [0, B]n such that, for every binary vector e ∈ {0, 1}n , there exists a corresponding function fe ∈ F such that fe (xi ) ≥ ci

if

ei = 1,

and fe (xi ) < ci

if

ei = 0.

2162

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

The pseudo-dimension of F , denoted by Pdim(A), is the largest integer n such that there exists a set of cardinality n that is P-shattered by A. For any class F of real-valued functions, let sgn(F ) be the class of functions obtained by taking the sign of functions in F , where sgn(u) = 1 if u ≥ 0 and zero otherwise. It is then clear from the definition that VCdim(sgn(F )) ≤ Pdim(F ), implying that a lower bound on the VC-dimension of sgn(F ) immediately yields a lower bound on the pseudo-dimension of F . We characterize the class of networks considered in this work. A feedforward multilayer network is a directed acyclic graph that represents a parameterized real-valued function of d real inputs. Each node is called an input unit or a computation unit. The computation units are arranged in L layers. Edges are allowed from input units to computation units. There can also be an edge from a computation unit to another computation unit, but only if the first unit is in a lower layer than the second. There is a single unit in the final layer, called the output unit. Each input unit has an associated real value, which is one of the components of the input vector x ∈ Rd . Each computation unit has an associated real value, called the unit’s output value. Each edge has an associated real parameter, as does ¢ ¡P each computation unit. The output of a computation unit is given by σ e we ze + w0 , where the sum ranges over the set of edges leading to the unit, we is the parameter (weight) associated with edge e, ze is the output value of the unit from which edge e emerges, w0 is the parameter (bias) associated with the unit, and σ : R → R is called the activation function of the unit. The argument of σ is called the net input of the unit. We suppose that in each unit except the output unit, the activation function is a fixed piecewise polynomial function of the form σ (u) = φi (u)

for u ∈ [ti−1 , ti ),

for i = 1, . . . , p + 1 (and set t0 = −∞ and tp+1 = ∞), where each φi is a polynomial of degree no more than l. We say that σ has p break points and degree l. The activation function in the output unit is the identity function. Let ki denote the number of computational units in layer i, and suppose there is a total of W parameters (weights and biases) and k computational units (k = k1 +k2 +· · ·+kL−1 +1). For input x and parameter vector a ∈ A = RW , let f (x, a) denote the output of this network, and let F = {x 7→ f (x, a) : a ∈ RW } denote the class of functions computed by such an architecture, as we vary the W parameters. We first discuss the computation of the VC-dimension, and thus consider the class of functions sgn(F ) = {x 7→ sgn( f (x, a)) : a ∈ RW }. Before giving the main theorem of this section, we present the following result, which is a slight improvement (see Anthony & Bartlett, 1998, chap. 8) of a result due to Warren (1968).

Almost Linear VC-Dimension Bounds

2163

Lemma 1. Suppose f1 (·), f2 (·), . . . , fm (·) are fixed polynomials of degree at most l in n ≤ m variables. Then the number of distinct sign vectors {sgn( f1 (a)), . . . , sgn( fm (a))} that can be generated by varying a ∈ Rn is at most 2(2eml/n)n . We then have our main result: Theorem 1. For any positive integers W, k ≤ W, L ≤ W, l, and p, consider a network with real inputs, up to W parameters, up to k computational units arranged in L layers, a single output unit with the identity activation function, and all other computation units with piecewise polynomial activation functions of degree l and with p break points. Let F be the class of real-valued functions computed by this network. Then VCdim(sgn(F )) ≤ 2WL log(2eWLpk) + 2WL2 log(l + 1) + 2L . Since L and k are O(W), for fixed l and p this implies that VCdim(sgn(F )) = O(WL log W + WL2 ). Before presenting the proof, we outline the main idea in the construction. For any fixed input x, the output of the network f (x, a) corresponds to a piecewise polynomial function in the parameters a, of degree no larger than (l + 1)L−1 (recall that the last layer is linear). Thus, the parameter domain A = RW can be split into regions, in each of which the function f (x, ·) is polynomial. From lemma 1, it is possible to obtain an upper bound on the number of sign assignments that can be attained by varying the parameters of a set of polynomials. The theorem will be established by combining this bound with a bound on the number of regions. Proof of Theorem 1. For an arbitrary choice of m points x1 , x2 , . . . , xm , we wish to bound ¯© ª¯ K = ¯ (sgn( f (x1 , a)), . . . , sgn( f (xm , a))) : a ∈ A ¯ . Fix these m points, and consider a partition {S1 , S2 , . . . , SN } of the parameter domain A. Clearly K≤

N ¯© X ª¯ ¯ (sgn( f (x1 , a)), . . . , sgn( f (xm , a))) : a ∈ Si ¯ . i=1

We choose the partition so that within each region Si , f (x1 , ·), . . . , f (xm , ·) are all fixed polynomials of degree no more than (l + 1)L−1 . Then, by lemma 1, each term in the sum above is no more than µ

2em(l + 1)L−1 2 W

¶W .

(2.1)

2164

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

The only remaining point is to construct the partition and determine an upper bound on its size. The partition is constructed recursively, using the following procedure. Let S1 be a partition of A such that, for all S ∈ S1 , there are constants bh,i,j ∈ {0, 1} for which sgn(ph,xj (a) − ti ) = bh,i,j

for all a ∈ S,

where j ∈ {1, . . . , m}, h ∈ {1, . . . , k1 }, and i ∈ {1, . . . , p}. Here ti are the break points of the piecewise polynomial activation functions, and ph,xj is the affine function describing the net input to the hth unit in the first layer, in response to xj . That is, ph,xj = ah · xj + ah,0 , where ah ∈ Rd , ah,0 ∈ R are the weights of the hth unit in the first layer. Note that the partition S1 is determined solely by the parameters corresponding to the first hidden layer, as the input to this layer is unaffected by the other parameters. Clearly, for a ∈ S, the output of any first layer unit in response to an xj is a fixed polynomial in a. Now, let W1 , . . . , WL be the number of variables used in computing the unit outputs up to layer 1, . . . , L, respectively (so WL = W), and let k1 , . . . , kL be the number of computation units in layer 1, . . . , L, respectively (recall that kL = 1). Then we can choose S1 so that |S1 | is no more than the number of sign assignments possible with mk1 p affine functions in W1 variables. Lemma 1 shows that µ ¶ 2emk1 p W1 . |S1 | ≤ 2 W1 Now, we define Sn (for n > 1) as follows. Assume that for all S in Sn−1 and all xj , the net input of every unit in layer n in response to xj is a fixed polynomial function of a ∈ S, of degree no more than (l + 1)n−1 . Let Sn be a partition of A that is a refinement of Sn−1 (that is, for all S ∈ Sn , there is an S0 ∈ Sn−1 with S ⊆ S0 ), such that for all S ∈ Sn there are constants bh,i,j ∈ {0, 1} such that sgn(ph,xj (a) − ti ) = bh,i,j

for all a ∈ S,

(2.2)

where ph,xj is the polynomial function describing the net input of the hth unit in the nth layer, in response to xj , when a ∈ S. Since S ⊆ S0 for some S0 ∈ Sn−1 , equation 2.2 implies that the output of each nth layer unit in response to an xj is a fixed polynomial in a of degree no more than l(l + 1)n−1 , for all a ∈ S. Finally, we can choose Sn such that for all S0 ∈ Sn−1 we have |{S ∈ Sn : S ⊆ 0 S }| is no more than the number of sign assignments of mkn p polynomials in Wn variables of degree no more than (l + 1)n−1 , and by lemma 1 this is no more than ¶Wn µ 2emkn p(l + 1)n−1 . 2 Wn

Almost Linear VC-Dimension Bounds

2165

Notice also that the net input of every unit in layer n + 1 in response to xj is a fixed polynomial function of a ∈ S ∈ Sn of degree no more than (l + 1)n . Proceeding in this way, we get a partition SL−1 of A such that for S ∈ SL−1 the network output in response to any xj is a fixed polynomial of a ∈ S of degree no more than l(l + 1)L−2 . Furthermore, µ

¶ Y µ ¶W i 2emk1 p W1 L−1 2emki p(l + 1)i−1 2 W1 Wi i=2 µ ¶ W i L−1 Y 2emki p(l + 1)i−1 ≤ 2 . Wi i=1

|SL−1 | ≤ 2

Multiplying by the bound (see equation 2.1) gives the result µ ¶Wi L Y 2emki p(l + 1)i−1 2 . K≤ Wi i=1 Since the points x1 , . . . , xm were chosen arbitrarily, this gives a bound on the maximal number of dichotomies induced by a ∈ A on m points. An upper bound on the VC-dimension is then obtained by computing the largest value of m for which this number is at least 2m , yielding m < L+

L X i=1

µ Wi log

2empki (l + 1)i−1 Wi



£ ¤ < L 1 + (L − 1)W log(l + 1) + W log(2empk) , where all logarithms are to the base 2. We conclude (see, for example, Vidyasagar, 1996, lemma 4.4) that £ ¡ ¢ ¤ VCdim(F ) ≤ 2L (L − 1)W log(l + 1) + W log 2eWLpk + 1 . Theorem 1 can be immediately extended to obtain results for the pseudodimension of the class of piecewise polynomial networks. As explained, for example, in Vidyasagar (1996), the pseudo-dimension of a neural network F is identical to the VC-dimension of an equivalent network with an additional scalar parameter input to the output unit. Since each one of the piecewise polynomial activation functions in the network considered already includes a free scalar input, which contributes to the output unit, it is clear that no additional freedom is obtained by adding a further scalar parameter. We thus conclude that the bound of theorem 1 is also an upper bound for the pseudo-dimension of F . 3 Lower Bound We now compute a lower bound on the VC-dimension (and thus also on the pseudo-dimension) of neural networks with continuous activation func-

2166

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

tions. This result generalizes the lower bound in Koiren and Sontag (1997) since it holds for any number of layers. Theorem 2.

Suppose f : R → R has the following properties:

1. limα→∞ f (α) = 1 and limα→−∞ f (α) = 0, and 2. f is differentiable at some point x0 with derivative f 0 (x0 ) 6= 0. Then for any L ≥ 1 and W ≥ 10L − 14, there is a feedforward network with the following properties: The network has L layers and W parameters, the output unit is a linear unit, all other computation units have activation function f , and the set sgn(F ) of functions computed by the network has VCdim(sgn(F )) ≥

¹ º¹ º W L , 2 2

where buc is the largest integer less than or equal to u. Proof. As in Koiren and Sontag (1997), the proof follows that of theorem 2.5 in Goldberg and Jerrum (1995), but we show how the functions they described can be computed by a network, and keep track of the number of parameters and layers required. We first prove the lower bound for a network containing linear threshold units and linear units (with the identity activation function), and then show that all except the output unit can be replaced by units with activation function f , and the resulting network still shatters the same set. Fix positive integers M, N ∈ N. We now construct a set of MN points, which may be shattered by a network with O(N) weights and O(M) layers. Let {ai }, i = 1, 2, . . . , N denote a set of N parameters, where each ai ∈ [0, 1) has an M-bit binary representation, ai =

M X

2−j ai,j

ai,j ∈ {0, 1},

j=1

that is, the M-bit base two representation of ai is ai = 0.ai,1 ai,2 , . . . , ai,M . We will consider inputs in BN × BM , where BN = {ei : 1 ≤ i ≤ N} , ei ∈ {0, 1}N has ith bit 1 and all other bits 0, and BM is defined similarly. We show how to extract the bits of the ai , so that for input x = (el , em ), the network outputs al,m . Since there are NM inputs of the form (el , em ), and al,m can take on all possible 2MN values, the result will follow. There are three stages to the computation of al,m : (1) computing al , (2) extracting al,k from al , for every k, and (3) selecting al,m among the al,k s.

Almost Linear VC-Dimension Bounds

2167

Suppose the network input is x = ((u1 , . . . , uN ), (v1 , . . . , vM )) = (el , em ). P Using one linear unit, we can compute N i=1 ui ai = al . This involves N + 1 parameters and one computation unit in one layer. In fact, we need only N parameters, but we need the extra parameter when we show that this linear unit can be replaced by a unit with activation function f . Consider the parameter ck = 0.al,k . . . al,M , that is, ck =

M X

2k−1−j al,j ,

j=k

for k = 1, . . . , M. Since ck ≥ 1/2 iff al,k = 1, clearly sgn(ck − 1/2) = al,k for all k. Also, c1 = al and ck = 2ck−1 − al,k−1 . Thus, consider the recursion ck = 2ck−1 − al,k−1 al,k = sgn(ck − 1/2), with initial conditions c1 = al and al,1 = sgn(al − 1/2). Clearly, we can compute al,1 , . . . , al,M−1 and c2 , . . . , cM−1 in another 2(M − 2) + 1 layers, using 5(M − 2) + 2 parameters in 2(M − 2) + 1 computational units (cf. Figure 1). We could compute al,M in the same way, but the following approach gives fewer layers. Set ! Ã M−1 X vi . b = sgn 2cM−1 − al,M−1 − i=1

If m 6= M, then b = 0. If m = M, then the input vector (v1 , . . . , vM ) = eM , PM−1 vi = 0, implying that b = sgn(cM ) = sgn(0.al,M ) = al,M . and thus i=1 In order to conclude the proof, we need to show how the variables al,m may be recovered, depending on the inputs (v1 , v2 , . . . , vM ). Let ∨ and ∧ denote the boolean operations of disjunction and conjunction, respectively. We then have al,m = b ∨

M−1 _

(al,i ∧ vi ).

i=1

PM WM xi = sgn( i=1 xi − Since for boolean x and y, x∧y = sgn(x+y−3/2), and i=1 1/2), we see that the computation of al,m involves an additional 5M parameters in M + 1 computational units and adds another two layers. In total, there are 2M layers and 10M+N−7 parameters, and the network shatters a set of size NM. Clearly we can add parameters and layers without affecting the function of the network. So for any L, W ∈ N, we can set M = bL/2c and N = W +7−10M, which is at least bW/2c provided W ≥ 10L−14. In that case, the VC-dimension is at least ¹ º¹ º W L . 2 2

2168

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

em

el

al al;1 c2 cM ,1 al;M ,1

al;M al;m Figure 1: Network of linear threshold units and linear units for the lower bound.

The network just constructed uses linear threshold units and linear units. However, it is easy to show (see Koiren & Sontag, 1997, theorem 5) that each unit except the output unit can be replaced by a unit with activation function f so that the network still shatters the set of size MN. For linear units, the input and output weights are scaled so that the linear function can be approximated to sufficient accuracy by f in the neighborhood of the point x0 . For linear threshold units, the input weights are scaled so that the behavior of f at infinity accurately approximates a linear threshold function.

Almost Linear VC-Dimension Bounds

2169

Remark. Note that the input to the network constructed in the proof of theorem 2 was a vector of dimension N + M; its size scales linearly with the number of weights and layers. In fact, it is possible to obtain the same general result as above, using scalar inputs. Similarly to Koiren & Sontag (1997, proposition 1) one first constructs a set of MN points in R2 that can be shattered and then proceeds to show that this result can be applied also for inputs in R (Koiren & Sontag, 1997, theorem 1). The only modification to the proof presented above is that in the first layer, we will need 3(N − 1) + 1 parameters, N − 1 threshold gates, and a single linear gate instead of N + 1 parameters and a single linear gate used in our proof. The remainder of the proof is unchanged. The only difference in the final bound is a multiplicative factor. 4 Application to Learning We briefly describe the application of our results to the problem of learning, focusing on the case of regression. The case of classification can be treated similarly. Consider then the problem of estimating a regression function E(Y|X = x) from a set of n independently and identically distributed samples {(X1 , Y1 ), . . . , (Xn , Yn )}, each drawn from a probability distribution P(X, Y), |Y| ≤ M < ∞. In particular, we focus for simplicity on a learning algorithm that constructs a function fˆn ∈ F by minimizing the empirical error, Lˆ n ( f ) =

n ¯ X ¯ ¯Yi − f (Xi )¯2 , i−1

over f ∈ F , where | f | ≤ M. We consider the class of feedforward neural networks with a single hidden layer and an output nonlinearity restricting the range of the function to [−M, M]. Specifically, let à ! k X T ci σ (ai x + bi ) + c0 , f (x) = πM i=1

where the activation function σ (·) is assumed to be piecewise polynomial, and the output unit activation function πM (·) is given by   −M if α < −M, M if α > M, πM (α) =  α otherwise. The need to restrict the range of the functions f arises here for technical reasons having to do with the derivation of the estimation error (see, e.g., Haussler, 1992). Previous work along these lines has relied on bounded parameter values. In particular, Barron (1994) bounds the parameters c, a and b and imposes a Lipschitz constraint on the activation function σ , while

2170

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

P Haussler (1992) constrains the parameters c so that i |ci | ≤ β < ∞, and assumes that the activation function σ is nondecreasing and bounded (see also Lee et al., 1996; Lugosi & Zeger, 1995; Bartlett, 1998). We do not assume any constraints on the parameter values and do not require boundedness of the activation function σ . The expected loss of the resulting estimate fˆn , is given by L( fˆn ) = E|Y − fˆn (X)|2 , where the expectation is taken with respect to X, Y. Standard results from the theory of uniform convergence of empirical measures (Pollard, 1984), together with bounds on certain covering numbers in terms of pseudodimension (Haussler, 1995, corollary 3), show that ˜ f ) + c2 MPdim(F ) log n , (4.1) EL( fˆn ) ≤ s2 + c1 inf L( n f ∈F ª © where c1 and c2 are absolute constants, s2 = E (Y − E(Y|X))2 is the noise ª © ˜ f ) in equa˜ f ) = E |E(Y|X) − f (X)|2 . The quantity inf f ∈F L( variance, and L( tion 4.1 is referred to as the approximation error and depends on the actual regression function E(Y|X = x). Note that in equation 4.1 we have used a result from Haussler (1992, sect. 2.4), where the bound relies on a choice of c1 > 1. As discussed in Haussler (1992), it is possible to set c1 = 1, which shows that the expected loss converges to the approximation error plus noise variance, but this is at the cost of increasing the third ¡ ¢1/2 . In cases where the approximation error term to c02 M Pdim(F ) log n/n inf f ∈F L( f ) can be made arbitrarily small, it is clear that the former bound, namely equation 4.1, is superior. In order to derive an upper bound on the expected loss, use must be made of upper bounds on the approximation error as well as upper bounds on the pseudo-dimension Pdim(F ), which we computed in section 2. The approximation error term can be handled using techniques studied in Maiorov and Meir (1997), where the assumption is made that the regression function belongs to the Sobolev space Wr , containing all functions f : Rd → R with square integrable derivatives up to order r > d/2. In particular, we consider the subset WrK of functions f ∈ Wr satisfying k f (r) kL2 ≤ K. Since r > d/2, well-known embedding theorems from the theory of Sobolev spaces (see corollary 5.16 in Adams, 1975) imply that such a function f is in fact uniformly bounded. Standard techniques show that there is a bound M(K) so that, for all f ∈ WrK and all x ∈ Rd , | f (x)| ≤ M(K) (see, for example, section 11 in Kantorovich & Akilov, 1964). For a given K, if the regression function is in WrK and the network class F is as described above, with k hidden units and an output bound satisfying M ≥ M(K), it has been shown in Maiorov and Meir (1997) that ˜ f ) ≤ c log k/kr/d , inf L(

f ∈F

Almost Linear VC-Dimension Bounds

2171

where d is the Euclidean dimension and c is a constant that depends only on K, r, and d. We note that the result in Mairov P and Meir (1997) was proved for unbounded functions of the form i ci σ (aTi x + bi ). However, it is easy to show that if the regression function is uniformly bounded by M(K), then the Psame approximation rate holds for the class of bounded functions πM ( i ci σ (aTi x + bi )), M ≥ M(K), which is considered here. This result is only slightly worse than the best result available (Mhaskar, 1996) for the standard sigmoidal activation function σ (u) = 1/(1 + e−u ), where the approximation error is upper bounded by c/kr/d . Although the constraint r > d/2 is rather restrictive, it is required in order to yield bounded functions. In future work we hope to remove this restriction, but since this leads to unbounded functions, new techniques might be necessary for the estimation error bounds. A final comment is in order concerning the use of the pseudo-dimension in equation 4.1. It can be shown that for the hard-limiting activation function πM , the pseudo-dimension yields nearly optimal bounds for estimation error, since in that case the pseudo-dimension is essentially equivalent to another combinatorial dimension, called the fat-shattering dimension, which gives nearly matching lower bounds on estimation error (see Bartlett, Long, & Williamson, 1996). Combining the two results, we obtain the total error bound, log k Pdim(F ) log n , EL( fˆn ) ≤ s2 + c r/d + c0 n k

(4.2)

where c and c0 are constants that depend only on K. By allowing the output bound M to increase suitably slowly as n increases, we see that for any f ∈ Wr , the total error decreases, as indicated by equation 4.2, but with constants that depend on f . In the case of a single hidden-layer network with k hidden units, the number of parameters W is given by W = k(d + 2) + 1 = O(k). Thus, we observe that the final term in equation 4.2 is of order O(k log k log n/n). For the standard sigmoid, we have from Karpinski and Macintyre (1997) that Pdim = O(W 2 k2 ) = O(k4 ) and thus this term is O(k4 log n/n). Keeping in mind that the approximation error terms are identical up to a logarithmic factor log k, we can see that the guaranteed upper bound on the rates of convergence of the loss in the case of piecewise polynomial activation functions is much faster. It is an interesting open problem to determine if piecewise polynomial networks have smaller error in this sense. Acknowledgments V. M. was partially supported by the Center for Absorption in Science, Ministry of Immigrant Absorption, State of Israel. The work of R. M. was supported in part by a grant from the Israel Science Foundation. Part of

2172

Peter L. Bartlett, Vitaly Maiorov, and Ron Meir

this work was done while R. M. was visiting the Isaac Newton Institute, Cambridge, England. Support from the Ollendorff Center of the Department of Electrical Engineering at the Technion is also acknowledged. The work of P. B. was supported in part by a grant from the Australian Research Council. Thanks to two anonymous referees for helpful comments. References Adams, R. A. (1975). Sobolev spaces. New York: Academic Press. Anthony, M., & Bartlett, P. L. (1998). A theory of learning in artificial neural networks. Unpublished manuscript. Barron, A. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 4, 115–133. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44, 525–536. Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452. Devroye, L., Gyorfi, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Geman, S., & Hwang, C. R. (1982). Nonparametric maximum likelihood estimation by method of sieves. Annals of Statistics, 10(2), 401–414. Goldberg, P. W., & Jerrum, M. R. (1995). Bounding the VC dimension of concept classes parameterized by real numbers. Machine Learning, 18, 131–148. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combinatorial Theory, A 69, 217–232. Kantorovich, L. V., & Akilov, G. P. (1964). Functional analysis in normed spaces. Oxford: Pergamon Press. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Paffian neural networks. Journal of Computer and System Science, 54, 169–176. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Science, 54, 190–198. Lee, W.-S., Bartlett, P. J., & Williamson, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theory, 42(6), 2118– 2132. Leshno, M., Lin, V., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Lugosi, G., & Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Trans. Inf. Theory, 41(3), 677–687.

Almost Linear VC-Dimension Bounds

2173

Maass, W. (1994). Neural nets with superlinear VC-dimension. Neural Computation, 6(5), 877–884. Maiorov, V., & Meir, R. (1997). On the near optimality of the stochastic approximation of smooth functions by neural networks (Tech. Rep. No. CC-223). Haifa: Department of Electrical Engineering, Technion. Macintyre, M. J., & Sontag, E. D. (1993). Finiteness results for sigmoidal neural networks. In Proc. 25th ACM STOC. Mhaskar, H. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8(1), 164–177. Pollard, D. (1984). Convergence of empirical processes. New York: Springer-Verlag. Sakurai, A. (1993). Tighter bounds on the VC-dimension of three-layer networks. In World Congress on Neural Networks. Hillsdale, NJ: Erlbaum. Vidyasagar, M. (1996). A theory of learning and generalization. New York: SpringerVerlag. Warren, H. E. (1968). Lower bounds for approximation by nonlinear manifolds. Trans. AMS, 133, 167–178. Received December 1, 1997; accepted April 22, 1998.

This article has been cited by: 2. Michael Schmitt . 2005. On the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach. Neural Computation 17:3, 715-729. [Abstract] [PDF] [PDF Plus] 3. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 4. Michael Schmitt. 2002. On the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus]

Suggest Documents