of continuous functions on A. In this paper, we have no occasion to consider discontinuous ..... to see this if Ï is assumed to be analytic on a closed interval.
Approximation of smooth functions by neural networks H. N. Mhaskar∗ Department of Mathematics, California State University Los Angeles,CA 90032, U.S.A. April, 1997 (Appeared as Chapter 13 in Dealing with complexity, a neural networks approach, (M. K´arn´ y, K. Warwick and V. Kurkov´a Eds.), Springer Verlag (Perspectives in Neural Computing), London/Berlin, 1998, pp. 189–204) Abstract We review some aspects of our recent work on the approximation of functions by neural and generalized translation networks.
1
Introduction
Many applications of neural networks are based on their universal approximation property. For example, a common approach in the prediction of time series t1 , t2 , · · · is to consider each tn as an unknown fuction of a certain (fixed) number of previous values. A neural network is then trained to approximate this unknown function. We note that one of the reasons for the popularity of neural networks over their precursors, perceptrons, is their universal approximation property. Mathematically, a neural network can only evaluate a special function, depending upon its architecture. For example, if n, s ≥ 1 are integers, the output of a neural network with one hidden layer comprising n principal elements (neurons), each evaluating a nonlinear function Pn φ, and receiving an input vector x ∈ IRs can be expressed in the form k=1 ak φ(wk · x + bk ), where, for ∗ This research was supported, in part, by National Science Foundation Grant DMS 9404513 and Air Force Office of Scientific Research Grant F49620-97-1-0211.
1
k = 1, . . . , n, the weights wk ∈ IRs , and the thresholds bk and the coefficients ak are real numbers. In the sequel, the class of all such output functions will be denoted by Πφ;n,s . We often refer to the output function itself as the neural network. Some of the questions that arise naturally in a theoretical study of the approximation properties of neural networks are the following. 1. Density. Given a continuous (real-valued) function f on a compact subset K ⊂ IRs and a positive number ǫ, is it possible to find some integer n, and a network P ∈ Πφ;n,s such that |f (x) − P (x)| ≤ ǫ,
x ∈ K?
(1.1)
What are the necessary and sufficient conditions on φ for this property to hold? If it does not hold, what functions can be approximated in this way? 2. Complexity. If we know some a priori assumption about the target function f , formulated mathematically by the statement f ∈ W for some function class W , can one obtain a good bound on the number of neurons n in the network of (1.1) in terms of ǫ? How does the choice of φ affect this bound? 3. Construction. How does one construct a network with a theoretically minimal size that approximates any function from W within a prescribed accuracy? 4. Limitations. Is there any advantage to be gained by using a more complicated architecture, such as networks with multiple hidden layers; i.e., are there some limitations on the networks with one hidden layer? The density problem is perhaps the most widely investigated problem. In the context of neural networks, the works of Cybenko [6], Funahashi [10], and Hornik, Stinchcomb, and White [13] are often cited. In [2], [3], Chui and Li have given a constructive proof which also shows that one may restrict the weights to be integer multiples of a fixed number. The problem has been studied in the context of radial basis function networks by Park and Sandberg [41]. In our paper [31] with Micchelli, we have formulated necessary and sufficient conditions for the function φ so as to achieve density. We have also given similar conditions for the radial and elliptic basis function networks. This work has been further generalized by Pinkus and his collaborators [21], [40]. The results of Ito [14], [15] are similar in spirit to those in [31]. In the context of multiple layers, the analogue is given in [25]. In this context, the well known Kolmogorov-Lorentz theorem [22] adds a new perspective. This theory has been studied extensively by de Figueiredo [7], Hecht–Nielsen [12], Kurkova [17], [18], Nees [38], [39], and Sprecher [16], [42] among others. 2
In this paper, we review some aspects of our work of the past few years regarding the remaining problems. We will briefly mention some related ideas, and point out some practical applications.
2
Preliminaries
We adopt the following notations. In the remainder of this paper, the symbol s will denote a fixed integer, s ≥ 1. If A ⊆ IRs is (Lebesgue) measurable, and f : A → IR is a measurable function, we define the Lp (A) norms of f as follows.
||f ||p,A
Z 1/p p |f (x)| dx , if 1 ≤ p < ∞, := A ess sup |f (x)|, if p = ∞.
(2.1)
x∈A
The class of all functions f for which ||f ||p,A < ∞ is denoted by Lp (A). It is customary (and in fact, essential from a theoretical point of view) to adopt the convention that if two functions are equal almost everywhere in the measuretheoretic sense then they should be considered as equal elements of Lp (A). We make two notational simplifications. The symbol L∞ (A) will denote the class of continuous functions on A. In this paper, we have no occasion to consider discontinuous functions in what is normally denoted by L∞ (A), and using this symbol for the class of continuous functions will simplify the statements of our theorems. Second, when the set A = [−1, 1]s , we will not mention the set in the notation. Thus, ||f ||p will mean ||f ||p,[−1,1]s etc. In applications, the target function is usually unknown. However, in theoretical investigations, one needs to assume some a priori conditions on the target function. Mathematically, these conditions are embodied in the statement that the target function belongs to some known class of functions. One of the most common and least demanding of such conditions is that the function has a certain number of partial derivatives. Some information about the size of these derivatives is also assumed. We now formulate these conditions more precisely. p (Q) consists Let r ≥ 1 be an integer and Q be a cube in IRs . The class Wr,s of all functions with r − 1 continuous partial derivatives on Q which in turn can be expressed (almost everywhere on Q) as indefinite integrals of functions p in Lp (Q). Alternatively, the class Wr,s (Q) consists of functions which have, at almost all points of Q, all partial derivatives up to order r such that all of these p derivatives are in Lp (Q). For f ∈ Wr,s (Q), we write p ||f ||Wr,s (Q) :=
X
||Dk f ||p,Q ,
(2.2)
0≤k≤r
where, for the multi-integer k = (k1 , . . . , ks ) ∈ ZZ s , 0 ≤ k ≤ r means that each 3
component of k is nonnegative and does not exceed r, |k| := Dk f =
∂ |k| f , ∂xk11 · · · ∂xks s
Ps
j=1
|kj |, and
k ≥ 0.
∞ Again, Wr,s (Q) will denote the class of functions which have continuous derivatives of order up to r. In the sequel, we make the following convention regarding constants. The symbols c, c1 , c2 , · · · will denote positive constants depending only on φ, p, r, d, s, and other explicitly indicated parameters. Their value may be different at different occurences, even within a single formula. p In most applications, the target function may be assumed to be in Wr,s (IRs ), s although the approximation is desired only on [−1, 1] . In theoretical investigations, we do not need this assumption. It is known [43] that there exists a p p linear operator T : Wr,s ([−1, 1]s ) → Wr,s ([−π, π]s ) such that T f (x) = f (x) for s x ∈ [−1, 1] , and p p p ||f ||Wr,s ([−1,1]s ) ≤ c1 ||T f ||Wr,s ([−π,π]s ) ≤ c2 ||f ||Wr,s ([−1,1]s ) .
(2.3)
If ψ : IRs → IR is an infinitely many times continuously differentiable function which is identically equal to 1 on [−1, 1]s and 0 outside [−3/2, 3/2]s, and g = ψT f , then g(x) = f (x) if x ∈ [−1, 1]s , g(x) = 0 outside [−3/2, 3/2]s, and p p p ||f ||Wr,s ([−1,1]s ) ≤ c1 ||g||Wr,s ([−π,π]s ) ≤ c2 ||f ||Wr,s ([−1,1]s ) .
(2.4)
In particular, we may denote the function g again by f , and extend it to a 2πperiodic function on IRs . We will always assume that our functions are extended p in this way. The symbol ||f ||p,r,s will then denote ||f ||Wr,s ([−π,π]s ) . A motivation for our work is the following well known theorem in approximation theory [45]. For integer m ≥ 0, we denote the class of all polynomials of s variables and coordinatewise degree not exceeding m by Pm,s . Theorem 2.1 Let 1 ≤ p ≤ ∞, s, r ≥ 1 be integers. Then for every f ∈ p ([−1, 1]s ) and integer m ≥ 0, there exists a polynomial P ∈ Pm,s such that Wr,s ||f − P ||p ≤ c(m + 1)−r ||f ||p,r,s .
(2.5)
This theorem can be formulated for arbitrary f ∈ Lp using the notion of K-functionals and higher order moduli of continuity [8]. This is a standard procedure in approximation theory [8], and hence, we need to study only the case when f has r derivatives. We further observe that a polynomial in s variables and coordinatewise degree not exceeding m depends upon n = (m+1)s parameters, namely, its coefficients. In terms of the number of parameters involved, the estimate (2.5) can be restated as inf
P ∈Pm,s
||f − P ||p ≤ cn−r/s ||f ||p,r,s . 4
(2.6)
p Denoting the class of all functions in Wr,s ([−1, 1]s ) such that ||f ||p,r,s ≤ 1 by p Br,s , we may reformulate this estimate further in the form
sup
inf
p P ∈Pm,s f ∈Br,s
||f − P ||p ≤ cn−r/s .
(2.7)
This estimate is very interesting from the following point of view. The only p information we know about the target function is that it is in Wr,s ([−1, 1]s ). We p may choose our scale so that f ∈ Br,s . The estimate (2.7) then states that this knowledge alone guarantees that the target function, whatever it may be, can be approximated by elements of the n parameter family Pm,s of polynomials within the accuracy O(n−r/s ). This accuracy is thus dependent only on our prior knowledge about the function and the class of approximants chosen! It is to be noted in this connection that the estimate (2.7) makes no assumptions about the manner in which the approximation is obtained. In particular, nonlinear algorithms are not ruled out. Moreover, different assumptions about the target function will lead to different bounds on the accuracy, some of them even independent of s. p Focusing again on the class Br,s , one wonders if the bound in (2.7) can be improved by using a different class of models, neural networks in particular, instead of polynomials. To investigate this question, we describe the notion of nonlinear n- widths (cf. [9]). Let W be any class of functions in Lp . Any approximation process depending upon n parameters can be expressed mathematically as a composition of two functions. The function πn : W → IRn selects the parameters, and the function A : IRn → Lp selects the approximating model depending upon these parameters. The approximation to the target function f is then given by A(πn (f )). The error in approximating any function from W is then given by sup ||f − A(πn (f ))||p . For example, in the case of f ∈W
polynomial approximation, we may choose πn to be the mapping from f ∈ W to the coefficients of the best approximation to f from Pm,s , and the method A simply reconstructs the polynomial given its coefficients. The expression sup ||f − A(πn (f ))||p , with this choice of πn and A, reduces to the left hand
p f ∈Br,s
side of (2.7). Returning to the general case, if the only knowledge about the target function is that f ∈ W , the best we can expect with any approximation method is given by the nonlinear n-width ∆n,p (W ) := inf sup ||f − A(πn (f ))||p ,
(2.8)
f ∈W
where, to avoid certain pathologies, we take the infimum over all continuous functions πn : W → IRn and all functions A : IRn → Lp . The quantity ∆n,p (W ) thus gives the error inherent in approximating (in the Lp -norm) an unknown function in W by a model depending upon n parameters. It turns out [9] that p ∆n,p (Br,s ) ≥ cn−r/s . (2.9) 5
Thus, apart from constant factors, polynomials are the best class of approximants. We stress that the inherent bound on the accuracy of approximation depends entirely on the a priori assumptions on the target functions, not on the method of approximation. The well known dimension independent bounds in neural network approximation are derived for functions belonging to a different class of functions. If the only a priori assumption on the target function p is its membership in Br,s , this information is not strong enough to yield better bounds, whether we use neural networks or any other sophisticated method. The issue is not whether neural networks offer any advantages over polynomials, but whether neural networks can be constructed to achieve the same order of approximation.
3
Complexity Theorems.
In addition to neural networks, we wish to include the radial basis function networks and generalized regularization networks in our study. Therefore, we study generalized translation networks (GTN’s). Let 1 ≤ d ≤ s, n ≥ 1 be integers, f : IRs → IR and φ : IRd → IR. A generalized translation network with Pn n neurons evaluates a function of the form k=1 ak φ(Ak (·) + bk ), where the weights Ak ’s are d × s real matrices, the thresholds bk ∈ IRd and the coefficients ak ∈ IR (1 ≤ k ≤ n). Extending our previous notation, the set of all such functions (with a fixed n) will be denoted by Πφ;n,s . In the case when d = 1, the class Πφ;n,s denotes the outputs of the classical neural networks with one hidden layer consisting of n neurons, each evaluating the univariate activation function φ. In the case d = s and φ is a radially symmetric function, we have the radial (or elliptic) basis function networks. In [11], Girosi, Poggio and Jones have pointed out the importance of the study of the more general case considered here. They have demonstrated how such general networks arise naturally in such applications as image processing and graphics as solutions of certain extremal problems. For f ∈ Lp , we write Eφ;n,p,s (f ) :=
inf
P ∈Πφ;n,s
||f − P ||p .
(3.1)
For a class W ⊆ Lp of target functions, we write Eφ;n,p,s (W ) := sup Eφ;n,p,s (f )
(3.2)
f ∈W
There are two aspects related to the estimation of Eφ;n,p,s (W ). We may study its dependence on φ; in which case, we make only those assumptions on φ without which the density property itself would not hold, but might not obtain the optimal estimates analogous to (2.7). Alternately, we may insist on obtaining estimates similar to (2.7), but restrict the class of activation functions φ. 6
Micchelli and this author have investigated the first aspect of this problem in [32], mainly in the case when φ is a 2π-periodic function. For the simplicity of exposition, we describe the results only in the neural network case; i.e., when φ is a univariate function. The necessary and sufficient condition on φ for the density property to hold is that Z π 1 ˆ φ(t)e−it dt 6= 0. (3.3) φ(1) := 2π −π In the case of periodic functions, the role of algebraic polynomials is played by trigonometric polynomials. The class of all trigonometric polynomials of s variables with coordinatewise degree not exceeding m will be denoted by IH m,s . The class of all 2π-periodic functions in Lp ([−π, π]s ) will be denoted by Lp∗ s , and the corresponding norm by k · k∗p . Similarly, the 2π-periodic version of p p∗ Wr,s ([−π, π]s ) will be denoted by Wr,s , with the corresponding norm kf k∗p,r,s . For f ∈ Lp∗ , we write s ∗ (f ) := Em,p,s
inf
T ∈IH m,s
kf − T k∗p .
(3.4)
A well known result in approximation theory is that ∗ Em,p,s (f ) ≤ cm−r kf k∗p,r,s ,
p∗ . f ∈ Wr,s
(3.5)
∗ ∗ Moreover, there are uniformly bounded linear operators vm,s , such that vm,s (f ) ∗ ∈ IH 2m−1 for every f ∈ Lp∗ , v (T ) = T for every T ∈ I H , and (consem,s s m,s quently) ∗ ∗ kf − vm,s (f )k∗p ≤ cEm,p,s (f ),
f ∈ Lp∗ s .
(3.6)
The fundamental observation in [32] is the following theorem. ˆ Theorem 3.1 Let 1 ≤ p ≤ ∞, φ ∈ Lp∗ 1 and φ(1) 6= 0. Then for any integer N ≥ 1,
∗ 2N X
i· 2πk 1 2ikπ ∗
≤ c EN,p,1
e − φ · − exp (φ).
ˆ ˆ 2N + 1 2N + 1 p (2N + 1)φ(1) k=0 |φ(1)| (3.7) This powerful observation enabled us to translate many of the interesting results for trigonometric approximation to the language of generalized translation networks. In particular, we observe the following estimate for neural network approximation, obtained from (3.6), (3.7), and some technical estimates. ˆ Theorem 3.2 Let 1 ≤ p ≤ ∞, α := max(1/p, 1/2), φ ∈ Lp∗ 1 , and φ(1) 6= 0. p∗ Then for any integers m, N, s ≥ 1, and f ∈ Ls , ∗ ∗ Eφ;M,p,s (f ) ≤ c1 Em,p,s (f ) + c2 msα kf k∗p EN,p,1 (φ),
where M = 2s ms (2N + 1). 7
(3.8)
In applying the above theorem, one chooses N such that the two terms on the right hand side of (3.8) are of the same order of magnitude. If φ is analytic p∗ ∗ in an annular region, and f ∈ Wr,s , then we obtain Eφ;M,p,s (f ) ≤ cm−r kf k∗p,r,s with M ≤ cms log m. In general, Theorem 3.2 suggests that the smoother the function φ, the better the expected approximation order. If φ is not a periodic function, but P some finite linear combination ψ of φ is integrable on IR, one takes the function k∈ZZ ψ(· − 2kπ) in place of φ in Theorem 3.2. The network is then obtained by truncating the infinite series. Our ideas led to one of the first theorems of its kind in the theory of RBF approximation, where the degree of approximation was estimated in terms of the number of RBF evaluations rather than in terms of a scaling factor. Another novel application of these ideas is given in [33], where we obtain dimension independent bounds similar to those obtained by Barron [1], but with a more general class of activation functions and much easier proofs. Next, we describe the results where we look for an estimate of the form (2.7), but with a restricted class of activation functions. Definition 3.1 Let d ≥ 1 be an integer, a real function φ, defined on some nonempty open subset of IRd , is said to be in class Cd if the domain of φ contains a sphere in which φ is infinitely many times continuously differentiable, and k ≥ 0, k ∈ ZZ d ,
Dk φ(b) 6= 0,
(3.9)
where b is the center of this sphere. It seems to be general wisdom that if φ is a univariate, infinitely many times differentiable function on an interval, then φ ∈ C1 if and only if φ is not a polynomial. We are unable to locate a reference to this fact, but it is easy to see this if φ is assumed to be analytic on a closed interval. In particular, the function φ(x) = (1 + e−x )−1 ∈ C1 . Among other functions in Cd are the following, where |x| denotes the Euclidean norm of x ∈ IRd . φ(x) := (1 + |x|2 )α , α 6∈ ZZ, |x|2q−d log |x|, q ∈ ZZ, q > d/2, φ(x) := |x|2q−d ,
(Generalized multiquadrics) d even, d odd,
(Thin plate splines)
and φ(x) := exp(−|x|2 ).
(The Gaussian function)
We prefer to state our theorems in the case which is superficially more general than the case when the activation function is in Cd . Definition 3.2 Let d ≥ 1 be an integer, 1 ≤ p ≤ ∞, k ≥ 1 be an integer, φ be a real function defined on some nonempty open subset of IRd . We say that φ ∈ Ap,k,d if for some nonempty open subset D of the domain of φ, the Lp (D)-closure of Πφ;k,d contains a function whose restriction to D is in Cd . 8
Analogous to Theorem 3.1, we have the following theorem [29] regarding approximation of algebraic polynomials by GTN’s with activation function in Ap,k,d . The theorem is not stated exactly in this form, but the proof is easy from the results in [29]. Theorem 3.3 Let 1 ≤ d ≤ s, m ≥ 0, k ≥ 1 be integers, 1 ≤ p ≤ ∞, φ ∈ Ap,k,d , and ǫ > 0 be arbitrary. Then there exists a set A := Aφ;m,s,ǫ,p , with not more than k(6m + 1)s elements, with the following property. For any P ∈ Pm,s , there exists a GTN N := NP,ǫ,p ∈ Πφ;k(6m+1)s ,s , with weights and thresholds belonging to A such that kP − N k∞ ≤ ǫkP kp . If φ ∈ Cd , then we may choose all thresholds in N to be equal to b. One important feature of Theorem 3.3 is that the set of matrices and thresholds is independent of the polynomial being approximated. Also, the number of real parameters in the elements of this set is of the same order of magnitude as those in polynomials in Pm,s . If f ∈ Lp \ Pm,s and ǫ > 0, we may first find a polynomial P ∈ Pm,s such that kf − P kp ≤ (1 + ǫ)Em,p,s (f ), where Em,p,s (f ) :=
inf
P ∈Pm,s
kf − P kp .
(3.10)
Necessarily, kP kp ≤ kf − P k + kf kp ≤ (1 + ǫ)Em,p,s (f ) + kf kp ≤ (2 + ǫ)kf kp . Using Theorem 3.3, we obtain a network G ∈ Πφ;k(6m+1)s ,s such that kP − Gkp ≤
ǫEm,p,s (f ) kP kp ≤ ǫEm,p,s (f ). (2 + ǫ)kf kp
Hence, kf − Gkp ≤ kf − P kp + kP − Gkp ≤ (1 + 2ǫ)Em,p,s (f ). Since ǫ > 0 is arbitrary, we have proved the following corollary. Corollary 3.1 Let d, s, p, φ, m, k be as in Theorem 3.3, and f ∈ Lp . Then Eφ;k(6m+1)s ,p,s (f ) ≤ Em,p,s (f ). Using the quasi-interpolatory polynomial operators defined in [34], [35] in ∗ place of the operators vm,s in the trigonometric case, Prestin and this author have proved the following theorem in [36], where we obtain the optimal approximation order using the samples of the target function at judiciously chosen “nodes”. We recall that our functions are defined on [−π, π]s and are equal to zero outside [−3/2, 3/2]s. First, we describe the nodes in the case when s = 1. The nodes in this case are the zeros of Jacobi polynomials on [−2, 2], which 9
we now define. In the remainder of this section, let α, β > −1 be fixed numbers. Their mention will be ommitted from the notations and constants. For x ∈ [−2, 2], let w(x) := (2 − x)α (2 + x)β . By Gram-Schmidt orthogonalization, one may obtain a unique system of orthonormalized polynomials {pn }, such that pn ∈ Pn,1 , n = 0, 1, . . ., the leading coefficient of each pn is positive, and for integers n, m ≥ 0, Z 2 1, if n = m, pn (x)pm (x)w(x)dx = (3.11) 0, otherwise. −2 The polynomials {pn } are called the Jacobi polynomials (adjusted to [−2, 2]). It is customary to define these polynomials as orthogonal polynomials with respect to a weight function on [−1, 1], but for our purpose, it is essential to define these the way we have. It is well known that for each integer m ≥ 1, the polynomial pm has m simple zeros {xk,m }, 1 ≤ k ≤ m, in (−2, 2). In the case s ≥ 1, we define the nodes as follows. For an integer m ≥ 1 and a multi-integer k, we write xk,m := (xk1 ,m , . . . , xks ,m ). Theorem 3.4 Let s, r ≥ 1, m ≥ 2 be an integer, φ be as in Theorem 3.3. There exists a set B := Bφ,m,r,s containing O(kms ) elements, an integer N = O(kms ), and ms GTN’s Nk,m ∈ Πφ;N,s , 1 ≤ k ≤ m, each with weights and thresholds p chosen from B with the following property. Let f ∈ Br,s . Then X kf − f (xk,m )Nk,m k ≤ cm−r . (3.12) 1≤k≤m
If φ ∈ Cd , then we may choose all thresholds in all Nk,m ’s to be equal to b. We observe that the networks Nk,m in the above theorem are independent of the target function. Thus, there is no training of the networks in the usual sense. Moreover, the matrices can be chosen to have a very small norm. One drawback of our construction is that the coefficients in the networks become unbounded as m → ∞. We do not feel that this is a serious shortcoming, partly because the networks Nk,m are constructed once for all the target functions, and partly because the divided difference schemes involved in their construction are numerically stable if φ ∈ Cd . Moreover, we have shown in [28] that this phenomenon cannot be avoided if the activation function is smoother than the target function and matrices with a small norm are used. In the case when the activation function is defined on IRs , the networks in the above theorems are not “pure” translation networks in the sense that they are not linear combinations of functions of the form φ(· − yk ). In the case when φ is the Gaussian function, we are able ([27], [24]) to obtain theorems similar to Theorem 3.4 with pure translation networks. In [30], we have applied our ideas to the problem of approximating a nonlinear functional defined on a compact subset of a function space, obtaining near optimal results.
10
4
Local approximation
To illustrate the concept of local approximation, let s ≥ 2 be an integer, f be continuous on [−1, 1]s , and continuously differentiable at every point of this cube except for one point c ∈ (−1, 1)s . We consider φ(x) = χs (x), where 1, if x ∈ [0, 1]s , χs (x) := (4.1) 0, otherwise. Since f is not continuously differentiable on the whole cube, we do not expect a uniform approximation to f from Πχs ;ms ,s within 1/m, m = 1, 2, · · ·. However, if an approximation is desired at a point x away from c, we may actually approximate f (x) by a network with just 1 neuron arbitrarily closely! Indeed, let m be any integer such that the distance between x and c is greater than 1/m, and for k = 1, · · · , s, let yk be the integer part of mxk . Since f is continuously differentiable in the sphere of diameter 1/m centered at x, |f (x) − f (y)χs (m(x − y))| ≤
kf k∞,1,s , m
(4.2)
where y := y(x) := (y1 , . . . , ys ). We observe that the function y takes at most O(ms ) values on [−1, 1]s . Therefore, the curse of dimensionality is manifest in the form of the database of points of the form (y, f (y)) that we have to maintain, not in the network architecture itself. Moreover, the network in (4.2) does not require any training at all. In a computer simulation, we may choose m to be a power of two, and the point y can be computed easily by just truncating the binary representation of x. It is argued in [26] that this is a reasonable way to get around the curse of dimensionality when the only knowledge about f is p of the form f ∈ Wr,s . If σ is the Heaviside function, then it is easy to see that X s (σ(xj ) + σ(1 − xj )) − 2s + 1/2 . (4.3) χs (x) = σ j=1
Thus, the function χs can be represented as a classical neural network with 2s + 1 neurons and two hidden layers. It is proved in [4] that there is no fixed integer ℓ such that χs is in the L1 -closure of the outputs of the networks Πσ;ℓ,s having one hidden layer! On the other hand, if one is willing to use sigmoidal functions of higher order as defined in [31] and networks with multiple hidden layers, then it is possible to obtain (cf. [25], [4]) the so called higher order B-splines and Chui- Wang spline wavelets arbitrarily accurately using a fixed number of neurons, the number being independent of the desired accuracy. To explore this limitation on networks with one hidden layer, let Y := Ym ∞ be the set of all values of y as in (4.2), and f ∈ B1,s . It is easy to see that X kf − f (z)χs (m(· − z)k∞ ≤ c/m. (4.4) z∈Y
11
The localization in the above formula can be recognized by the following fact. If ∆(f ) is the diameter of the closure of the set {x : f (x) 6= 0} and Λ(f, m) := (∆(f )s + m−s ), then the network in (4.4) is actually in Πχs ;cΛ(f,m)ms ,s for some constant c. In general, the function f is supported on the whole cube, the localization factor Λ(f, m) ∼ c, and (4.4) is consistent with the curse of dimensionality (2.9). Using a smooth partition of identity, we may think of f as a sum of O(ms ) functions, each supported on a cube of diameter c/m. For each of these components, the localization factor is of the order m−s and one can actually use networks with a constant number of neurons as in (4.2) to approximate each component. In the case of networks with one hidden layer, we may formulate the local approximation problem as follows. We wish to approximate an arbitrary funcp tion f ∈ W1,s by networks from Πφ;Λ(f,m)N,s within an accuracy 1/m measured p in the L -norm. The problem is to estimate N as a function of m so that this is possible. In view of the negative results in [4], we may even go beyond neural networks, and allow each neuron to evaluate its own activation function, depending upon the target function f . To formalize these ideas, let Lpd,loc be the class of all measurable functions, p-th power integrable on compact subsets of IRd , and for integer N ≥ 1, ǫN,p,d,s(f ) := inf kf −
N X
cj gj (Aj (·))k1 ,
(4.5)
j=1
where the infimum is taken over all cj ∈ IR, d × s matrices Aj and gj ∈ Lpd,loc, j = 1, · · · , N . If N ≥ 1 is not an integer, we define ǫN,p,d,s (f ) = ǫm,p,d,s (f ), where m is the integer part of N . We are interested in ǫ˜m,p,d,s := inf{t > 0 :
sup ǫΛ(f,m)t,p,d,s ≤ 1/m}.
p f ∈B1,s
(4.6)
The quantity ǫ˜m,p,d,s represents the minimum number of neurons in a network with one hidden layer, each evaluating its own activation function in Lpd,loc, that are necessary to guarantee a localized approximation with accuracy 1/m p for every function in B1,s . The analogous quantity for localized approximation from GTN’s is ˜φ;m,p,s := inf{t > 0 : E
sup Eφ;Λ(f,m)t,p,s (f ) ≤ 1/m}.
p f ∈Bp,s
(4.7)
In [5], we proved the following theorem. Theorem 4.1 Let s ≥ 2 be an integer, 1 ≤ p ≤ ∞. (a) For any δ > 0, integers k ≥ 1, 1 ≤ d ≤ s, and φ ∈ Ap,k,d , E˜φ;m,p,s ≤ c(δ)kms+δ , 12
m = 1, 2, · · · .
(4.8)
(b) On the other hand, if 1 ≤ d < s is an integer, ǫ˜m,p,d,s ≥ cms log m,
m = 1, 2, · · · .
(4.9)
We have not stated the theorem exactly in this way in [5], but the proof can be extended easily to prove the theorem as stated. Along with Khachikyan, we applied our ideas in [25] to the problem of predicting the flour data [44]. In the paper [37], we applied the ideas to the problem of beamforming in phased array antennas. In both cases, our algorithms required some preprocessing of the data. This eliminated the need to perform any training of the networks in the traditional sense. In particular, we were able to avoid any nonlinear optimization similar to back-propagation, nearest neighbor, etc., and avoid pitfalls associated with such procedures, such as local minima. In both cases, our results were substantially better than previously known results. We achieved 30 to 40 times better results in the case of flour data, and an improvement of 50% to 100% in the case of the beamforming problems.
5
Some open problems 1. In private discussions with this author, Micchelli has often conjectured that it is possible to obtain optimal approximation using any activation function for which the density result is true. In any case, it will be interesting to give a complete characterization of activation functions for which optimal approximation can be obtained. 2. In particular, a characterization of the class Ap,k,d for a fixed k will be interesting. This question has been studied in some detail by Kurkova [19], [20]. 3. It is an open problem whether one can obtain the upper bound ˜φ;m,p,s ≤ cms log m E for some φ. 4. It will also be interesting to find out if an upper bound of the form E˜φ;m,p,s ≤ cms can be obtained if φ : IRs → IR is a radially symmetric function.
13
References [1] A. R. Barron, Universal approximation bounds for superposition of a sigmoidal function, IEEE Trans. Information Theory, 39 (1993), 930-945. [2] C. K. Chui and X. Li, Approximation by ridge functions and neural networks with one hidden layer, J. Approx. Theory, 70 (1992), 131-141. [3] C. K. Chui and X. Li, Realization of neural networks with one hidden layer, in Multivariate Approximations: From CAGD to Wavelets, (K. Jetter and F. Utrereas eds.), World Scientific Publ., 1993, pp. 77-89. [4] C. K. Chui, X. Li, and H. N. Mhaskar, Localized approximation by neural networks, Mathematics of Computation, 63 (1994), 607–623. [5] C. K. Chui, X. Li, and H. N. Mhaskar, Limitations of the approximation capabilities of neural networks with one hidden layer, Advances in Computational Mathematics, 5 (1996), 233–243. [6] G. Cybenko, Approximation by superposition of sigmoidal functions, Mathematics of Control, Signal and Systems, 2 (1989), 303-314. [7] R. J. P. de Figueiredo, Implications and applications of Kolmogorov’s theorem, IEEE Trans. Automatic Control, 25 (1980), 1227-1231. [8] R. DeVore and G. G. Lorentz, “Constructive Approximation”, Springer Verlag, New York, 1993. [9] R. DeVore, R. Howard and C. A. Micchelli, Optimal nonlinear approximation, Manuscripta Mathematica, 63 (1989), 469-478. [10] K. I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks, 2 (1989), 183-192. [11] F. Girosi, M. Jones and T. Poggio, Regularization theory and neural networks architectures, Neural Computation, 7 (1995), 219-269. [12] R. Hecht-Nielsen, Kolmogorov’s mapping neural network existence theorem, Proc. of the Internat. Conf. on Neural Networks, IEEE, New York, 1987, III, pp. 11-14. [13] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), 359-366. [14] Y. Ito, Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory, Neural Networks, 4 (1991), 385-394.
14
[15] Y. Ito, Approximation of functions on a compact set by finite sums of a sigmoid function without scaling, Neural Networks, 4 (1991), 817-826. [16] H. Katsuura and D. A. Sprecher, Computational aspects of Kolmogorov’s superposition theorem, Neural Networks, 7 (1994), 455–461. ´, Kolmogorov’s theorem is relevant, Neural Computation, 3 [17] V. Kurkova (1991), 617-622. ´, Kolmogorov’s theorem and multilayer neural networks, Neu[18] V. Kurkova ral Networks, 5 (1992), 501-506. ´, Approximation of functions by perceptron networks with [19] V. Kurkova bounded number of hidden units, Neural Networks, 8 (1995), 745–750. ´, Trade–of between the size of weights and the number of [20] V. Kurkova hidden units in feedforward networks, Research Report ICS-96-495. [21] M. Leshno, V. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6 (1993), 861–867. [22] G. G. Lorentz, “Approximation of Functions”, Holt, Rinehart and Winston, New York, 1966. [23] L. Khachikyan and H. N. Mhaskar, Neural networks for function approximation, in “Neural networks for signal processing, V”, (F. Girosi, J. Makhoul, E. Manolakos, E. Wilson Eds.), IEEE, New York, 1995, pp.21–29. [24] H. N. Mhaskar, “An introduction to the theory of weighted polynomial approximaton”, World Scientific, Singapore, 1996. [25] H. N. Mhaskar, Approximation properties of a multilayered feedforward artificial neural network, Advances in Computational Mathematics 1 (1993), 61-80. [26] H. N. Mhaskar, Neural networks for localized approximation of real functions, in “Neural Networks for Signal Processing, III”, (Kamm, Huhn, Yoon, Chellappa and Kung Eds.), IEEE New York, 1993, pp. 190-196. [27] H. N. Mhaskar, Versatile Gaussian networks, Proceedings of IEEE Workshop on Nonlinear Image and Signal Processing, (I. Pitas Editor), Halkidiki, Greece, June, 1995, IEEE, pp.70-73. [28] H. N. Mhaskar, On smooth activation functions, To appear in the Annals of Mathematics in Artificial Intelligence. [29] H. N. Mhaskar, Neural networks for optimal approximation of smooth and analytic functions, Neural Computation, 8 (1996), 164- 177. 15
[30] H. N. Mhaskar and N. Hahm, Neural networks for functional approximation and system identification, Neural Computation, 9 (1997), 143–159. [31] H. N. Mhaskar and C. A. Micchelli, Approximation by superposition of a sigmoidal function and radial basis functions, Advances in Applied Mathematics, 13 (1992), 350-373. [32] H. N. Mhaskar and C. A. Micchelli, Degree of approximation by neural and translation networks with a single hidden layer, Advances in Applied Mathematics, 16 (1995), 151-183. [33] H. N. Mhaskar and C. A. Micchelli, Dimension- independent bounds on the degree of approximation by neural networks, IBM J. Research and Development, 38 (1994), 277- -284. [34] H. N. Mhaskar and J. Prestin, Bounded quasi-interpolatory polynomial operators, Submitted for publication. [35] H. N. Mhaskar and J. Prestin, On Marcinkiewicz-Zygmund-Type Inequalities, To appear in “Approximation theory: in memory of A. K. Varma”, (N. K. Govil, R. N. Mohapatra, Z. Nashed, A. Sharma, and J. Szabados Eds.), Marcel Dekker. [36] H. N. Mhaskar and J. Prestin, On a choice of sampling nodes for optimal approximation of smooth functions by generalized translation networks, To appear in the Proceedings of International Conference Artificial Neural Networks, Cambridge, U.K., July, 1997. [37] H. N. Mhaskar and H. L. Southall, Neural beam-steering and direction finding, To appear in Proc. EANN97. [38] M. Nees, Approximative versions of Kolmogorov’s superposition theorem, proved constructively, J. Comput. Appl. Math., 54 (1994), 350–373. [39] M. Nees, Chebyshev approximation by discrete superposition, application to neural network, Adv. Comput. Math., 5 (1996), 137–151. [40] A. Pinkus, T DI-subspaces of C(IRd ) and some density problems from neural networks, Journal of Approximation Theory, 85 (1996), 269–287. [41] J. Park and I. W. Sandberg, Universal approximation using radial basis function networks, Neural Computation, 3 (1991), 246-257. [42] D. A. Sprecher, A universal mapping for Kolmogorov’s superposition theorem, Neural Networks, 6 (1993), 1089–1094. [43] E. M. Stein, “Singular integrals and differentiability properties of functions”, Princeton Univ. Press, Princeton, 1970. 16
[44] G. C. Tiao and R. S. Tsay, Model specification in multivariate time series, J. R. Statist. Soc., 51 (1989), 157-213. [45] A. F. Timan, “Theory of Approximation of Functions of a Real Variable”, Macmillan Co., New York, 1963.
17