Recent Results and Mathematical Methods for

Recent Results and Mathematical Methods for Functional Approximation by Neural Networks Paul C. Kainen Dept. of Mathematics Georgetown University Washington, DC 20057 [email protected]

Abstract

We describe various models for functional approximation by neural networks, including recent \dimensionindependent" results and emphasizing geometry and topology. These are compared with the more general theory of parametric approximation. Some ideas for heuristic improvement of neural networks are given and the issues of uniqueness and continuity of parameter selection are discussed. Keywords.

1

Feedforward architectures, parallel computation, best approximants, geometric models.

Introduction

The general setting of parameterized approximation by neural networks is that given a target function f : Rd ! Rm and a desired accuracy " > 0, one attempts to nd a parameter vector such that kf 0A k < "; where A denotes the input-output function produced by a neural network architecture A using as \weights" . When the input dimension is d and the output dimension is 1, A : Rd ! R. For example, if A is the standard perceptron architecture with activation function , k hidden units andP1 linear output unit, then for composed of the vi , ci and bi below, one has the usual expression A (x) = ki=1 ci (vi 1 x + bi ); where d vi 2 R and bi ; ci 2 R are the parameters. The norm k:k depends on the problem. Typical functional analysis norms such as supremum or Lp , or others involving various derivatives and moduli of continuity, are quanti able and convenient for the theory but may not agree with human judgement. The choice of norm is a key step in any application. There are two fundamentally dierent cases which may arise, corresponding essentially to individual vs. variable. In the rst case, which is the most widely studied, given the architecture A, the target function f and a required accuracy ", one wishes to choose any suitable , possibly subject to conditions such as an L1 bound on the subvector of output weights. In the variable scenario, it is assumed that currently the parameter is well-chosen to match the target function, but the target changes so that the control must be commensurately adjusted. Thus, the emphasis is on discovering how the parameter vector depends on the target, rather than merely picking an example. One of the techniques of nonlinear approximation, the hypothesis of \continuous selection" formalizes the role of stability of approximation by requiring that the parameter vector be a continuous function of the target. But requiring such a continuous selection narrows the possible choices of parameterization and can lead to less accurate approximation. This chapter is organized as follows. The next section considers the individual and variable problems. Then, in section 3, we have some recent results on functional approximation which provide a quadratic rate of convergence, independent of the dimension of the input space. Section 4 gives a general description of feedforward architectures and points out how limited heuristic capabilities in the hidden units could enable substantial performance improvements. Section 5 has an outline of the ideas of nonlinear approximation which provide a lower bound for the rate of approximation for any method utilizing continuous selection of parameters. Section 6 discusses uniqueness of parameterization. In section 7, we survey some other 1

approaches which have been proposed in the literature, including the use of geometry and topology, and we include some comments on strategic aspects. We conclude with an acknowledgement. 2

Individual vs. variable context

An illuminating example of neural network approximation, is to view a picture as a function f : I 2 ! I; where I = [0; 1]; f associates a brightness to each point of the image domain. To choose a positive integer k and a parameter vector 2 Rk for which kf 0 A k is small amounts to compressing the information in the picture while still retaining an approximate identity. This corresponds to the individual case. Note that by generalizing picture to pictorial ow (i.e., by looking at a picture as a function on I 3 ) this already includes a dynamical perspective. The goal of the neural network is to simulate a ow of light approximately corresponding to that of the picture. On the other hand, one may also wish to classify the picture as belonging to one of a set of possible types. This can be done by using a neural network to compress the characteristic function of the classi ciation, as we described with K urkova in [40]. One hopes that by choosing the space so that its geometry corresponds to the visual grammar, it is possible to have the characteristic function represented by closed convex bodies, but this topic has not yet been investigated. The choice of geometric model leads to a way to encode the picture in some parameters corresponding to the size, shape and placements of the convex bodies. So studying the variable case of approximation might allow us to nd ecient compression of variable scenes, with the encodings facilitating classi cation of the picture. The existence of individual approximation is settled once it is shown that a neural network architecture produces a dense set of functions in some space of possible targets. For some target spaces, one can even describe the cost of approximation in terms of the number of units or bounds on the norms of parameters. A goal of this research program, suggested some years ago by R. Hecht-Nielson, is to nd a \canonical" parameterization. The work of A. Barron [5] and our results with K urkova and Kreinovich [41] achieve this goal for some functions by explicitly computing parameters based on a nite-sum approximation to an integral formula, of Fourier or plane-wave type, resp. Moreover, convergence can be achieved at a quadratic rate in both the L2 and L1 norms. However, many of the problems to which one would like to apply neural networks are variable since they depend on a changing context. For instance, in terms of pictures, if one already has a suitable representation for one frame, one would hope to be able to modify the representation only slightly to accomodate the next frame in a video signal (as is already done in other methods of compression like \mpeg".) Variable approximation arises in several natural ways. One can simply generalize the structure of, say, a perceptron network by allowing inputs to be elements in a function space, with weights being, e.g., measures de ned parametrically and with scalar multiplication replaced by convolution. The activation function now becomes a nonlinear operator. Modeling considerations also lead to a variable viewpoint, not only because signals tend to be nonstationary but also because dynamical information can reduce ambiguity. An elementary appearance of this idea is that movement directly toward or away from the observation point is often more dicult to detect than lateral movement, and indeed is so used in biological behaviors. Variable approximation is only beginning to be analyzed; see, e.g., Chen and Chen [9]; Dingankar and Sandburg [17], also [62]; Puechmorel, Ibnkahla and Castanie [56], [57]; Mhaskar and Hahm [50]. 3

Nonlinear approximation

In this section, we shall describe and indicate the signi cance of several recent \dimension-independent" results on functional approximation. Suppose one is given a target space K (that is, a collection of functions to be approximated) and an ambient space X in which K is compactly embedded. Let's assume X is a Banach space for concreteness. For any " > 0, nd the least n for which it is possible to approximately factor the embedding through Rn a nM K! R !X

with a continuous, so that the \continuous selection" a and \manifold" M satisfy kf 0 Ma(f )kX < " for all f in K . The problem is often phrased in dual form: Given f and n, what is the in mum dn(K; X ) of the error kf 0 Ma(f )kX for any approximate factorization of f through Rn via a continuous selection? This is called the nonlinear n-width, and it applies to various methods of approximation such as wavelets as well as neural networks. Recall that for sequences, a(n); b(n), a(n) b(n) as n ! 1 if the n-th term of each sequence is asymptotically less than or equal to a positive constant times the n-th term of the other. DeVore et al. ([15], [16]) have shown that there is a space W of functions, de ned by a smoothness constraint of order , on the unit cube in d-dimensions, with dn(W ) n0=d . Thus, for a xed degree of smoothness, as dimension increases, accuracy of approximation decreases for any method based on a continuous selection. Neural networks t the nonlinear paradigm with M = A. In [47], Mhaskar proves that neural networks with a single hidden layer are capable of providing the optimal order of approximation, achievable with a continuous selection, for functions with a given number of derivatives provided that the activation function obeys certain mild technical constraints. He also constructs networks that provide a geometric order of approximation for analytic target functions. Ignoring the issue of continuous selection, Barron [5] and Jones [29] showed that if the target function is suitably bounded, then the rate of approximation by neural networks, expressed as the number of hidden units, need only grow quadratically with the accuracy, independent of the input dimension, even if the parameters are chosen incrementally. (See the chapter by K urkova in this volume for a review on incremental approximants.) For these more well-behaved target functions, such dimension-independent quadratic bounds are a major improvement on previous approximation results, for which the number of hidden units is O(1="d ) where " is the error and d is the dimension. However, the earlier theorems apply to a much more general class of functions [49], [33]. In the Jones-Barron scheme, one can further have an L1 bound on the output coecients. Such an upper bound on the sum of the absolute values of the output weights leads to an interesting generalization of total variation and is an example of how theoretical questions regarding neural networks can lead to new mathematical insights. See K urkova [37]. Jones [29] estimated the rate of approximation for functions in the convex closure of a bounded subset of a Hilbert space; see also Barron [5, p.934]. The result had actually been found previously by Maurey but in rediscovering it, they gave a constructive argument, based on approximately optimizing certain convex combinations, which permits its application to incremental approximation by neural networks. Let H be a Hilbert space with a norm k:k, B be a positive real number and G a subset of H such that for every g 2 G kgk B . Then for every f 2 clconv G , for every c > B 2 0 kf k2 and for every natural number n there exists fn that is a convex combination of n elements of G such that Theorem 3.1

kf 0 fn k2 nc :

In the theorem, cl refers to closure in the Hilbert space and conv denotes convex hull. Note that if f 6= 0, then B 2 will suce for c. To estimate the number of hidden units in neural networks, one takes G to be the set of functions computable by single-hidden-unit networks for various types of computational units. Convex combinations of n such functions can be computed by a network with n hidden units and one linear output unit. Thus, an element in the closure of the convex hull of a bounded set in Hilbert space can be approximated within a mean square error of c=n by neural networks with n hidden units. The following application is due to Barron [5]. Theorem 3.2

Suppose f is a function on

Rd with a Fourier representation of the form Z

ei!1x f~(!)d!; Rd where f~ is complex-valued, and further suppose that !f~(!) is in L1(Rd ). If 0 6= f and is an arbitrary sigmoidal function, then for every n 1, there is a function Pn formed by the output of an n-hidden unit perceptron net such that on the unit ball B in Rd , kf 0 PnkL (B) k!f~(!)kL (Rd ) n01=2: f (x) =

2

1

We have stated the theorem in a simple special case in order to give the basic idea. If one uses the unit cube, instead of B, then the constant above is replaced by (1=2)kwf~(!)kL1 (Rd ) ; where w = k!k1 is just the sum of the absolute value of the coordinates of !. In [41] we proved with K urkova and Kreinovich a somewhat analogous result, utilizing distribution theory in place of Fourier analysis. Recall that De(d) denotes the d-th iterated directional derivative in the direction d e ([18]) and Heb is the hyperplane fy 2 R : e 1 y + b = 0g. If d is odd and f 6= 0 is a function on Rd with continuous d-th derivatives and compact support, then for every integer n 1 and every bounded sigmoidal function , there is a function Pn, produced by an n-hidden unit perceptron net, such that

Theorem 3.3

kf 0 PnkL2 (B) (1=2)(2)10d vd1=2kwf kL1 (S d01 2R) n01=2; R where vd is the volume of the unit ball in Rd , and wf (e; b) = Heb De(d) f (y)dy. Again, we are not giving the strongest form of the theorem in order to improve clarity. However, the restriction to odd dimensions is required by our proof which utilizes a special fact about delta distributions only valid in odd dimensions. If the unit ball is replaced by the unit cube, the theorem holds as stated except that the square root of vd is no longer present in the upper bound. Our theorem can be applied in two ways to yield bounds: If there is a reasonable global bound on the directional derivative at every point in Rd (some bound does exist because the derivatives are continuous with compact support), then one can multiply it by an estimate of the largest possible cross-sectional area of any hyperplane intersected with the support. For the Euclidean ball in d-space, this is clearly vd01 and there is a corresponding version for the unit cube due to Ball [3], where the largest cross-sectional area is 21=2 independent of d. Here \area" means d 0 1-dimensional Lebesgue measure. On the other hand, if there is a global bound R in the Radon transform space on the maximum possible

ow across any hyperplane (which might, e.g., come from physical considerations), then we have a bound on jwf j. This provides a bound on the L1 norm of wf since we can restrict the integral to the (compact) set H(f ) of all hyperplanes which intersect the support of f . If the support of f is contained in the unit ball, then the volume of H(f ) is at most twice the area of the unit sphere in d-dimensions. Multiplying this by R gives a bound on the L1 -norm of wf . Since the various constants in our result are decreasing at an exponential rate as a function of the dimension d, it follows that unless some family of functions has d-th derivatives increasing at a superexponential rate (in terms of d), the rate of approximation is improving with higher dimension. Thus, while Barron's result is more generally applicable, the plane-wave approach used here gives a tighter estimate for rates of approximation. In [41] we also derived an integral formula of independent interest. Write C d (Rd ) for the set of functions on Rd with continuous d-th order partial derivatives. We write # for the Heaviside threshold function de ned by #(t) = 0 if t < 0, #(t) = 1 if t 0. Theorem 3.4

For every odd positive integer d, if f

f (x) = 0ad where ad = 21 (01)

0

d 1

2

Z

Z

S d01 R

Z

Heb

2 C (Rd ) is compactly supported, then for all x 2 Rd ! (d) De f (y)dy #(e 1 x + b)dbde;

(2)10d ):

This result represents any compactly supported f with continuous d-th order partials as the input-output function of a generalized perceptron network (with a single hidden layer containing a continuum of units), using the Heaviside activation, where the weights are de ned analytically. It seems plausible to us that such a network could be implemented using optical technology. The weights themselves might be approximable via Monte Carlo methods. The rate-of-approximation results obtained show that one can obtain an error of O(k0 12 ) with k hidden units. In a certain sense, this avoids the problem of dimensionality because it does not involve d explicitly,

but actually d plays a role in two ways. The class of functions for which the approximation holds is getting smaller with increasing d and the asymptotic constant involves d as well. However, for the geometric approach based on plane waves, the dimensional constants can actually decrease with increasing dimension. Neural network approximation is one of a variety of nonlinear techniques including wavelets and splines, which depend on a vector of parameters. Given the parameters, each method produces a function. Conversely, given the function, one must rst nd a parameter vector so that the resulting parameterized function is a good approximation. DeVore et al. [16] shows that wavelet methods are asymptotically optimal among those satisfying the continuity condition. But the neural network results which achieve quadratic-accuracy for incremental appproximation in a manner independent of the dimension are doing better than their lower bound and so must be parameterizing in a non-continuous way. 4

Feedforward architectures

In addition to the perceptron networks P mentioned above, another useful design is the radial basis function (or RBF) network for which A (x) = ki=1 ci (kvi 0 xkbi ), where vi 2 Rd and bi > 0; ci 2 R are the parameters and k:k denotes Euclidean distance. We give a de nition which includes both of these networks as special cases. Many of the results in this chapter, which are stated for the perceptron case, will also hold in the general case. The idea of a common generalization of both perceptron and radial basis function networks seems to have rst been noticed K urkova [33] and by Bottou and Gallinari [7]. An even more general notion than is described here is called -networks by K urkova in her chapter in this volume; see alsoP[41]. Suppose : Rd 2 Y ! R. Then the input-output function produced by a -network has the form ni=1 wi(x; yi ) for x in Rd , where the yi are in Y , the space of parameters. More recently, Mhaskar and Micchelli [48] have given a framework which includes generalized regularization networks [23]. See also [20]. By a feedforward architecture A we shall denote the following: (i) A family of acyclic directed graphs An , n 2 N+ , where An has d + y + k + r vertices, d; y; k; r positive integers corresponding to input dimension, number of additional controls (e.g., 1 in the perceptron case corresponding to the bias term), number of hidden units and output dimension, resp. (ii) A collection algorithm operating at each hidden-unit vertex on the data it receives from the inputs by using a parameter vector (for instance, the parameters weight the inputs in the perceptron case, while for RBF networks, the parameters jointly de ne the location of the archetypal points, i.e., the centers of the hills, as well as the scaling factors). (iii) An activation function (or functions) de ned at every hidden unit which takes the value presented to it by the collection algorithm and produces an output for that unit (in the perceptron case, the activation is typically sigmoidal, while for RBFs, one normally takes something radially symmetric and positive, going to zero asymptotically, like a Gaussian). (iv) An output algorithm which is usually taken as a weighted sum of the hidden unit outputs and applied at each of the r output nodes. Other possibilities include ane and \hard limit" (i.e., truncation to 0 or 1 or +1= 0 1). This description ts the model of a -network where is the composition of a function from Rd 2 Y to R (the collection algorithm) composed with : R ! R. This is similar to the idea of \inner function" due to Stinchcombe and White. For parameter vector 2 Rn and a neural architecture A, we shall write A for the function which takes any input vector x 2 Rd to the corresponding output of the neural network following the above rules. For simplicity and because it suces for most examples, we shall take the number of outputs r = 1 from now on except for some remarks in section 7. Heuristic improvement of the collection algorithm could be valuable in situations for which the number d of inputs is very large. The algorithm could sample the weighted input data and estimate where it lies with respect to the activation function. For instance, in the perceptron case, steep sigmoids are almost constant except along a fuzzy hyperplane so if the estimate is away from the hyperplane, then a very crude approximation suces while if the estimate is near the hyperplane, then full computation might be required. Operation of the net is entirely controlled by choice of activation, collection algorithm and parameter vector. Fixing the rst two, one gets a function M from the linear space of n-parameter vectors into the

function space X . Moreover, the union over n of all such parameterized vectors turns out to be dense in many useful function spaces such as C (J ), the set of continuous real-valued functions on J when J is a compact subset of Rd , using the sup norm. Thus, for any continuous function f and any " > 0 there is a choice a of parameters so that the resulting input-output function g = M (a) 2 X satis es kf (x) 0 g(x)k < " for all x 2 J . Such density results are now known for many dierent classes of functions and norms; see, e.g., [49]. 5

Lower bounds on rate of approximation

Neural networks are one of several branches of the larger enterprise of approximation theory; see, e.g., [45], [1], [58] and [63]. We review the classical theory of functional approximation by polynomials and its modern generalization to the nonlinear case. Chebyshev and Weierstrass showed that all continuous functions on a closed bounded subset J of a Euclidean space Rd can be uniformly approximated by linear combinations of polynomials. Moreover, not only is best approximation possible but the best approximant is unique. Further, it turns out that instead of using the set Pn of n + 1 basis functions 1, x, x2, ..., xn, one can use any linearly independent family 8 = f0; 1 1 1 ; ng contained in C (J ) provided it satis es the Haar condition that any linear combination of the k (\generalized polynomial") has at most n roots in J . A linearly independent family 8 = f0; : : :; ng is called an n-basis (or a Haar n-basis if it also satis es the Haar condition). \Polynomials" in terms of Haar bases act much like the classical (or \algebraic") case. P Given an n-basis 8 in C (J ), one de nes En (f ; 8) := inf kf 0 nk=0 ck k k, where the in mum is over all choices of parameters ck . When 8 = Pn, we will omit it from the notation. Consider a countable subset S = f0; 1; 1 1 1g of C (J ) such that (i) for every n, the rst n + 1 members constitute an n-basis and (ii) S is dense in C (J ) For any strictly decreasing sequence En of positive numbers converging monotonically to 0, there is a function f having exactly that rate of convergence i.e., for all n, En = En (f; f0; 1 1 1 ; ng). So rate of convergence to 0 can be arbitrarily slow. Further, if one considers the worst case, over all f in some set K , one needs K compact to even be sure of getting a nite supremum. When the space X = C (I ), we can give a speci c relation between smoothness of the element f in X and its rate of approximation with respect to algebraic polynomials. Recall [63, p93] that the modulus of continuity !(f ; h) is de ned for f bounded on some interval [a; b] for 0 h jb 0 aj by the formula

!(f ; h) = supfjfx1 0 fx2 j : x1; x2 2 [a; b]; jx1 0 x2 j hg: One has the following result of Jackson ([63, pp287-288]). We state it for [a; b] = [01; 1] to keep the constant explicit. 01 ). Theorem 5.1 If f 2 C ([01; 1]), then for every nonnegative n, En (f ) 6! (f ; n In fact, for functions satisfying some reasonable conditions, smoothness and rate of approximation improve at asymptotically the same rate ([63, section 7.1.13]). This phenomenon can hold for entire spaces of functions. The (Kolmogorov) n-width, denoted wn(K; X ), of a compact subset K of a normed linear space X , is the least possible h for which K is contained in the h-neighborhood of some n + 1-dimensional subspace of X ;

wn(K; X ) = inf sup En(f ; 8); f 2K where the in mum is over all n-bases 8. When K = W (r) (L2[0; 1]), the space of functions whose r-th derivatives have L2 norm at most 1, with all lower-order derivatives 0 at the endpoints, and X is the space L2, then Kolmogorov showed that w2n(K; X ) = w2n01(K; X ) = (2n)R0r . More recently, Barron [5] showed that wn(0C ; X ) n01=d C=d for his class 0C of functions satisfying k!kB F (d!) C , where is a constant, k!kB denotes supx2B k! 1 xk and F is the Fourier magnitude distribution corresponding to f . For d 3, the lower bound on a global linear approximant for the entire class 0C is improved upon by the incremental Jones-Barron neural network procedure. However, we note that for d = 1 and d = 2, it seems that the network cannot outperform a linear method.

Polynomial approximation has the property of being closed under addition; the sum of two degrees polynomials is a degree-s polynomial. This is not true for neural networks; the sum of the functions produced by an s-hidden-unit network and a t-hidden-unit network is produced by an s + t-hidden-unit net. However, while neural network approximation does not satisfy linearity in the the way polynomial approximation does, we conjecture that some of the same phenomena can be found there. The structural characterizations of optimal approximants given by Chebyshev, de la Vallee Poussin and Bernstein might have validity for neural net approximants and might also depend on the nature of the activation function. For instance, a somewhat analogous result to Jackson's theorem has been proved by K urkova [36]: An expression involving the (higher-order) moduli of continuity is an upper bound on rates of approximation as a function of size of parameters in networks with a xed number of hidden units. Recall that the nonlinear n-width, dn, is setting a lower bound on the achievable approximation for any continuous selection; see section 2. When X is clear, we may write dn(K ) for dn(K; X ). It turns out [16] that several other feasible de nitions of nonlinear n-width (which extend the linear case) are all equivalent to this one. Let I d denote the unit cube in Rd and write Ls := Ls(I d ), 0 < s 1. The Besov space B := Bq (Ls), > 0, 0 < q 1, is a subset of Ls de ned by two subtly related indices of smoothness, and q (see, e.g., [8]). For any metric space Y , U (Y ) will denote the unit ball. For r real, r+ is the larger of r and 0. Theorem 5.2

(DeVore et al.) For 1 < p 1, if > d(1=s 0 1=p)+ , then

dn(U (Bq (Ls)); Lp ) n0=d The condition on ensures that U (B ) is embeddable as a compact subset of Lp . Now if s = 2 and p = 1, then for all > d=2, dn(U (B )) n0=d , so, under the hypothesis of continuous selection, for functions of sucient smoothness in the Hilbert space L2(I d ) one can do no better than O(n01=2). with respect to approximation in the L1-norm. To prove the theorem, one shows the upper bound by actually constructing some nonlinear method which achieves it; in [16] wavelets are used. Further, [15] shows that dn(K; X ) bn(K; X ), where the latter invariant, related to work of Bernstein, is the largest radius of any n + 1-dimensional ball contained in an n + 1-dimensional linear subspace of X . For convenience, we will assume that 0 2 K and that all balls will have center at 0. The assumption of continuity in the selection a is used to show that bn is a lower bound on dn via the Borsuk-Ulam theorem. Recall that S n denotes the n-sphere which is the n-dimensional manifold of all points in Rn+1 at constant distance from 0. For every continuous function : S n ! Rn there exists x 2 S n such that (x) = (0x). Now suppose that is a positive real number, Xn+1 is an n + 1-dimensional linear subspace of X and U K (i.e., K contains a ball of radius ). Let a be a continuous selection a : K ! Rn . Then a induces a continuous map on the boundary of the ball into Rn . It follows by the Borsuk-Ulam theorem that there must be an element f0 in the boundary of U with a(f0 ) = a(0f0 ). Hence, writing k:k for k:kX , for any M : Rn ! X , 2kf0 k kf0 0 Ma(f0 )k + k 0 f0 0 Ma(0f0 )k and therefore either f0 or 0f0 is approximated with an error of at least kf0 k. Thus, supf 2K kf 0 Ma(f )k so dn bn . To actually show the lower bound on bn , a family of functions j 2 B are constructed, and the linear span of the rst n of them is called Xn . Computing the appropriate Besov seminorm and using, e.g., Holder's inequality, one nds that for = Cn0=d , U (Xn ) B , which implies the lower bound. 6

Uniqueness of approximation by neural networks

There can be no uniqueness of parameterization. In polynomial approximation, the components are linearly independent functions (e.g., xn) and so are distinguishable. For neural nets, however, the hidden units can be easily permuted without changing the resulting function performed by the network. More formally, the parameter vector a can rearranged to form a0 with M (a) = M (a0). Further, in the perceptron case, if the activation : R ! R happens to be odd or even, then one can reverse the signs of parameters involving some of the hidden units without altering the resulting input-output function. Let us say that satis es an ane recursion if there is a functional equation of the form

(t) = const +

r X k=1

wk (k t + hk );

where r 0, all wk 6= 0, all k 6= 0, all pairs (k ; hk ) are dierent and if hk = 0, then k 6= 1. We shall refer to functions as anely recursive only if they can satisfy a nontrivial ane recursion, with r 2 in the functional equation; otherwise, all constant, odd, even or periodic functions would be included. If is anely recursive, then it is clear that parameterizations are not unique and, indeed, one can replace an n-dimensional parameterization by parameterizations of greater length replacing some of the original hidden units by an entire subnetwork realizing the above functional equation. This can also be iterated. We showed with K urkova [39] that this is essentially the only way that non-uniqueness of parameterization can occur: If is not anely recursive, then perceptron-type nets with as activation function have unique parameterizations up to the symmetries of the situation. These symmetries operate on the euclidean space of parameters and so M can be factored through a quotient mapping (as A. Vogt pointed out to us). We also note that ane recursiveness could be de ned for mappings from Rd but it turns out that the d = 1 case is equivalent, as it is for density: If the activation function permits density for d = 1, it is true in general. Further, all polynomials are anely recursive. It is a dierent question whether there is a unique function (as opposed to parameterization) produced by some neural network on n hidden units that is the best approximant of a target function. According to Haar's condition [63, p36], if the functions j (x) = (yj 1 x + bj ) withP1 j n have at most n 0 1 roots in J , then there are unique real numbers cj , 1 j n such that kf 0 j cj j k is minimal. But there might be other choices for the yj or bj leading to equally close approximants. Finally, let us note that non-uniqueness may actually be better for approximation. Since norms only depend on some aspects of a function, having several approximants with identically good properties in terms of the norm of their dierence with the target, one gains additional control by having multiple choices. Also, from the standpoint of network design, it appears that the ane recursions described above give a way to introduce parallelism. To approximate an entire class of functions requires us to use a non-polynomial activation function [49]. However, for any speci c function, we could use polynomial activation functions (of suciently high degree) and these can be implemented by ane recursions. 7

Other Approaches

Nets with complex activation functions or complex weights have been considered since the late 1980s, but the theory is quite incomplete and does not seem to utilize the special character of complex variables. Complex backpropagation (to determine the weights) has been independently rediscovered on several occasions; see, e.g., [44], [42], [6] and [21]. If we regard visual perception as paradigmatic, it is reasonable to adopt a 2-dimensional (i.e., complex) model. Such a viewpoint would be compatable with optical implementation, and would also be required for a neural net based on quantum computation [14], [43]. Earlier, we considered the application of neural networks to pictures. While Lp norms are commonly used in the current theory of neural network approximation, for the space X of pictures or interesting subspaces which are naturally occurring within some given context, one may question the relevance of such measures. For example, if f op (x) = 1 0 f (x), then the distance from f to f op is the diameter of the space X . That is, a picture and its negative are far apart in terms of Lp. However, while it is dicult to recognize a particular face from its negative, a face and its negative look much closer together than a face and a tree. A more natural metric between pictures should recognize approximate identities up to translation, rotation, dilation and local distortion such as we achieve in vision. See, for instance, [11] for an interesting example of how the distribution of gradients diers for \real" images compared with the gaussian ideal. Using some sort of \rubber sheet" deformations, one can easily imagine disortion which is nevertheless quite transparent to the human visual system. Even small discontinuities can be tolerated; think of re ection in rippling water. To formalize this, we consider the notion of an isometry which is a distance-preserving mapping and, more generally, of an "-isometries which is a mapping such that for all x,y in the domain jm(x; y) 0 m( (x); (y))j < " (writing m for the metric). Hyers and Ulam [27] proved that each "-isometry is also

globally close (in the function space sup norm) to a strict isometry. If : Rd ! Rd is an "-isometry of Rd onto Rd ,with (0) = 0 then for z (x) = limn!1 20n (2n x), z is an isometry of Rd and for all x 2 Rd , j (x) 0 z (x)j < 10", where j:j denotes Euclidean norm. Thus, there is a possibility of considering functions to be equivalent if one can be stretched to coincide with the other. (For the class of functions de ned in terms of pictures, this is called \morphing".) Higher-dimensional data are natural if one wishes to imagine neural nets operating directly on torques (or spins). The elements of various geometric algebras have already been proposed [10], [2]. It is interesting to note in this regard that quaternionic models have also been utilized to model eye movements. An abstract advantage for the quaternions and octonions (based on the real division algebras of order 4 and 8, resp.) is their dierent operational attributes. Because multiplication in the quaternions is not commutative, while the octonions are not associative, both of these variable-types allow the behavior of a neural net with \higher order" hidden units that can multiply to depend on its order in collecting the inputs. These properties might provide a means for the neural net to carry a combinatorial internal state which describes its order of associating external inputs. Banach and Hilbert space settings, not just for the analysis but for the inputs, outputs and parameters, have begun to be used - especially in connection with variable approximation. This makes it possible for neural networks to be regarded as operators, which could then be acting on other operators - for instance, other neural networks. This interesting idea was proposed by Puechmorel and Ibnkahla [57]. Note also that the issue of functional representation (as opposed to approximation) grew out of a problem of Hilbert; see [33], [34]. Taking the input space as a whole lets us substitute more sophisticated translations for the raw data. For instance, one can take transforms based on projective geometry which associate with a scene certain rotational and translational invariants. For transformations with distributed local representation of global information, as in holograms, the signal could be reconstructed from a sparse sample. Conversely, by thinking of the output space more geometrically, one can use coding theory to reduce the output dimension and improve accuracy. See, e.g., [31]. Some other papers considering unusual mathematical models for neural networks include [52] which utilizes the theory of continued fractions in a complex-variable setting, based on the \cascade" architecture [19]; [54], which uses symplectic transformations to accomplish a nonlinear version of principle components analysis; [55] considers back propagation in a Cliord algebra. Finally, we mention an interesting approach to using Lie algebras in identifying nonlinear systems [51]. In problems where neural nets have been successfully applied, the target functions allow themselves to be roughly approximated with rather few parameters. But approximation is optimization: one is trying to minimize the distance from the target to a member of the constructable class. Now it is interesting to note that typical optimizations which a priori depend on many dierent constraints are, in fact, dominated by a few. In the case of linear programming, for example, where the goal is to maximize a linear function de ned on a nite-dimensional polytope, S. Gass did computer experiments many years ago that showed that a random ray from within the polytope usually intersects the boundary in one of a small number of faces. Perhaps this is related to the fact that, with a rather small number of hidden units (in the dozens or less), feedforward neural networks can achieve, say, 80 to 95% of perfect accuracy on some nontrivial problems. If one could prepare several relatively small neural nets that depended on independent features, then the combined performance could achieve desired levels of, say, 99.9% accuracy. Approaches of this type are currently the focus of intensive eorts; see, e.g., the \connectionists" mailing list for discussions and references. While tempting, the method seems to have practical limitations caused by the diculty in ensuring independence and also the systems problem involving the coordination of multiple neural networks. While the geometric approach to neural networks has the seeming disadvantage of requiring large numbers of parameters, this may turn out to oer special opportunities. For instance, high-dimensional Euclidean spaces have strongly counter-intuitive properties for the content of their balls and spheres (see, e.g., Hamming [25] and also [30]). Work in this area, which includes the eld of computational geometry, may oer substantial bene ts to the development of a robust neural network technology. Further developments in other areas of mathematics will also be applicable to neural networks. The theory of knots (and links), which are topological images of one (or more) circles embedded in three-dimensional space, is a good candidate. If outputs are taken in R3 while the input varies over an interval of time, one gets a trajectory, so in the periodic case, there are knots and links and one could look at their invariants.

This model might explain how \neuro-control" can keep two parallel processes (e.g., movement of left and right feet) from interfering. Acknowldgements

The author is grateful to Computacao Biologica-Sistemas-COPPE at the Federal University of Rio de Janeiro for providing a pleasant and stimulating research environment during which the nal draft of this chapter was completed. References

[1] Achieser, N. I. (1992, orig. 1956). Theory of Approximation. New York: Dover Publications. [2] Arena, P., Fortuna, L., Muscato, G., & Xibilia, M. G. (1997). Multilayer perceptrons to approximate quaternion valued functions. Neural Networks, 10, 335-342. [3] Ball, K. (1986). Cube slicing in Rd . Proceedings of AMS, 99, 1-10. [4] Barron, A. R. (1992). Neural net approximation. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems (pp. 69-72). [5] Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory , 39, 930-945. [6] Benvenuto, N., & Piazza, F. (1992). On the complex backpropagation algorithm. IEEE Transactions on Signal Processing, 40(4), 967-969. [7] Bottou, L., Gallinari, P. (1991). A framework for cooperation of learning algorithms. In Neural Information Processing Systems, Lippman, R. P. et al., eds. (Vol. 3, pp. 781-788), Morgan Kauman. [8] Brenner, P., Thomee, V., & Wahlbin, L. B. (1975). Besov Spaces and Applications to Dierence Methods for Initial Value Problems (Lecture Notes in Mathematics 434), Berlin: Springer. [9] Chen, T., Chen, H. (1995) Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Transactions on Neural Networks, 6, 904-910 (et seq. 911-917). [10] Chudy, L. & Chudy V. (1996). Why complex valued neural networks? (On the possible role of geometric algebra framework for the neural network computation) in CMP'96, pp. 109-114. [11] Cipra, B. (1997). Seeking signs of intelligence in the theory of control. SIAM News, 30 (Number 3), April, p.13. [12] Courant, R., Hilbert, D. (1992, orig. 1962). Methods of Mathematical Physics, vol.II. New York: Interscience. [13] Darken, C., Donahue, M., Gurvits, L., & Sontag, E. (1993). Rate of approximation results motivated by robust neural network learning. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 303-309). New York: ACM. [14] Deutsch, D. (1985). Quantum theory, the Church-Turing principle and the universal quantum computer. Proc. Royal Soc. London, A 400, 97-117. [15] DeVore, R., Howard, R., & Micchelli, C. (1989). Optimal nonlinear approximation. Manuscripta Mathematica, 63, 469-478. [16] DeVore, R. A., Kyriazis, G., Leviatan, D., & Tikhomirov, V. M. (1993). Wavelet compression and nonlinear n-widths. Advances in Computational Mathematics, 1, 197-214.

[17] Dingankar, A. T. & Sandberg, I. W. (1995). Network approximation of dynamical systems. In NOLTA'95. (Las Vegas). [18] Edwards, C. H. (1994). Advanced Calculus of Several Variables. New York: Dover. [19] Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems II (Denver, 1989). D.Touretzky, ed. (pp. 524-532) San Mateo: Morgan Kaufmann. [20] Gegout, C., Girau, B. & Rossi, F. (1995). A general feedforward neural network model. NeuroCOLT TR-95-041 (36 pages). [21] Georgiou, G. M. & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Transactions on Circuits and Systems, 330-334. [22] Girosi, F. (1995). Approximation error bounds that use VC-bounds. In Proceedings of ICANN'95 (pp. 295-302). Paris:EC2 & Cie. [23] Girosi, F., Jones, M. & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219-269. [24] Girosi, F., & Anzellotti, G. (1993). Rates of convergence for radial basis functions and neural networks. In Arti cial Neural Networks for Speech and Vision (pp. 97-113). London: Chapman & Hall. [25] Hamming, R. W. (1986). Coding and Information Theory. Englewood Clis, NJ: Prentice-Hall. [26] Hewitt, E., & Stromberg, K. (1965). Real and Abstract Analysis. New York: Springer. [27] Hyers, D. H. and Ulam, S. M. (1945). Bull. American Math. Soc., 51, 288-292. [28] Ito, Y. (1991). Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks, 4, 385-394. [29] Jones, L. K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. The Annals of Statistics, 20, 601-613. [30] Kainen, P. C. (1996). Utilizing geometric anomalies of high dimension: When complexity makes computation easier. In Computer-Intensive Methods in Control and Signal Processing: Curse of Dimensionality, K. Warwick and M. Karny, Eds., Boston: Birkhauser, 1997, pp. 283-294. [31] Kainen, P. C. & K urkova (1993). Quasiorthogonal dimension of euclidean space. Applied Math. Letters 6, 7{10. [32] Kelley, J. L. (1955). General Topology. Princeton: Van Nostrand. [33] K urkova , V. (1992). Kolmogorov's theorem and multilayer neural networks. Neural Networks, 5, 501-506. [34] K urkova, V. (1995). Kolmogorov's theorem. In The Handbook of Brain Theory and Neural Networks (Ed. M. Arbib) (pp. 501{502). Cambridge: MIT Press. [35] K urkova, V. (1995). Approximation of functions by perceptron networks with bounded number of hidden units. Neural Networks, 8:745{750. [36] K urkova, V. (1996). Trade-o between the size of weights and the number of hidden units in feedforward networks. Technical Report ICS-96-495, Institute of Computer Science, Czech Academy of Sciences, Prague.

[37] K urkova, V. (1997). Dimension-independent rates of approximation by neural networks. In Computer-Intensive Methods in Control and Signal Processing: Curse of Dimensionality, K. Warwick and M. Karny, Eds., Boston: Birkhauser, pp. 261-270. [38] K urkova , V. (1997). Rates of approximation of multivariable functions by one-hidden-layer neural networks WIRN'97 (in press). [39] K urkova, V. & Kainen, P. C. (1996). Singularities of nite scaling functions. Applied Math. Letters 9, 33-37. [40] K urkova, V, & Kainen, P. C. (1994). Anely recursive functions and neural networks. In Proceedings of 14th IMACS World Congress on Applied and Computational Mathematics (pp. II.776-779). Atlanta: Gergia Tech. [41] K urkova, V., Kainen, P. C., & Kreinovich, V. (1997). Estimates of the number of hidden units and variation with respect to half-spaces. Neural Networks (in press). [42] Leung, H., & Haykin, S. (1991). The complex backpropagation algorithm. IEEE Transactions on Signal Processing, 39, 2101-2104. [43] Lewenstein, M. (1994). Quantum perceptrons. Journal of Modern Optics, 41, 2491-2501. [44] Little, G. R., Gustafson, S. C., & Senn, R. A. (1990). Generalization of the backpropagation neural network learning algorithms to permit complex weights. Applied Optics, 29(11), 1591-1592. [45] Lorenz, G. G. (1966). Approximation of Functions. New York: Holt, Rinehart and Winston. [46] McShane, E. J. (1944). Integration. Princeton: Princeton University Press. [47] Mhaskar, H. N. (1996) Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8, 164-177. [48] Mhaskar, H. N., & Micchelli, C. A. (1995). Degree of approximation by neural and translation networks with a single hidden layer. Advances in Applied Mathematics, 16, 151-183. [49] Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied Mathematics , 13, 350-373. [50] Mhaskar, H. N., & Hahm, N. (1995). Neural networks for functional approximation and system identi cation. Preprint. [51] Moreau, Y., & Vanderwalle (1996). System identi cation using composition networks. in CMP'96, pp. 183-192. [52] Neruda, R., & Stedry, A. (1995). Approximation capabilities of chain architectures. In ICANN'95 (Paris). [53] Park, J., & Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation , 5, 305-316. [54] Parra, L. C., Deco, G., & Miesbach, S. (1995). Redundancy reduction with information-preserving nonlinear maps. Network, 6(1), 61-72. [55] Pearson, J. K., & Bisset, D. L. (1992). Back propagation in a Cliord algebra. In ICANN'92 (Brighton). [56] Puechmorel, S., Ibnkahla, M., & Castanie (1994). The manifold back propagation algorithm. In ICNN'94 (Orlando, FL) (pp. 395-400). IEEE. [57] Puechmorel, S., & Ibnkahla, M. (1995). Operator valued neural networks. In WCNN'95 (Washington, DC) (Vol I, pp. 68-71).

[58] Rivlin, T. J. (1990). Chebyshev Polynomials. New York: Wiley. [59] Rudin, W. (1964). Principles of Mathematical Analysis. New York: McGraw-Hill. [60] Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill. [61] Sejnowski, T. J., & Yuhas, B. P. (1989). Mapping between high-dimensional representations of acoustic and speech signal. In Computation and Cognition (pp. 52-68). Philadelphia: SIAM. [62] Sandberg, I. W. (1994). General structures for classi cation. IEEE Trans. on Circuits and SystemsI: Fundamental Theory and Applications, 41, 372-376. [63] Timan, A. F. (1994, orig. 1963) Theory of Approximation of Functions of a Real Variable. New York: Dover Publications. [64] Zemanian, A. H. (1987, orig. 1965). Distribution Theory and Transform Analysis. New York: Dover Publications.