Estimates of Approximation Rates by Gaussian Radial-Basis Functions Paul C. Kainen1 , Vˇera K˚ urkov´ a2, and Marcello Sanguineti3 1
Department of Mathematics, Georgetown University Washington, D. C. 20057-1233, USA
[email protected] 2 Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vod´ arenskou vˇeˇz´ı 2, Prague 8, Czech Republic
[email protected] 3 Department of Communications, Computer, and System Sciences (DIST) University of Genoa, Via Opera Pia 13, 16145 Genova, Italy
[email protected]
Abstract. Rates of approximation by networks with Gaussian RBFs with varying widths are investigated. For certain smooth functions, upper bounds are derived in terms of a Sobolev-equivalent norm. Coefficients involved are exponentially decreasing in the dimension. The estimates are proven using Bessel potentials as auxiliary approximating functions.
1
Introduction
Gaussian radial-basis functions (RBF) are known to be able to approximate with an arbitrary accuracy all continuous and all L2 -functions on compact subsets of Rd (see, e.g., [7], [16], [17], [18], [19]). In such approximations, the number of RBF units plays the role of a measure of model complexity, which determines the feasibility of network implementation. For Gaussian RBFs with a fixed width, rates of decrease of approximation error with increasing number of units were investigated in [5], [6], [11], [10]. In this paper, we investigate rates of approximation by Gaussian RBF networks with varying widths. As the set of linear combinations of scaled Gaussians is the same as the set of linear combinations of their Fourier transforms, by Plancherel’s equality the L2 -errors in approximation of a function and its Fourier transform are the same. We exploit the possibility of alternating between function and Fourier transform to obtain upper bounds on rates of approximation. Our bounds are obtained by composing two approximations. As auxiliary approximating functions we use certain special functions called Bessel potentials. These potentials characterize functions belonging to Sobolev spaces. The paper is organized as follows. In Sect. 2, some concepts, notations and auxiliary results for investigation of approximation by Gaussian RBF networks are introduced. Using an integral representation of the Bessel potential and its Fourier transform in terms of scaled Gaussians, in Sect. 3 we derive upper bounds on rates of approximation of Bessel potentials by linear combinations of scaled B. Beliczynski et al. (Eds.): ICANNGA 2007, Part II, LNCS 4432, pp. 11–18, 2007. c Springer-Verlag Berlin Heidelberg 2007
12
P.C. Kainen, V. K˚ urkov´ a, and M. Sanguineti
Gaussians. In Sect. 4 where Bessel potentials are used as approximators, upper bounds are derived for functions from certain subsets of Sobolev spaces. In Sect. 5, estimates from the previous two sections are combined to obtain a bound for approximation of certain smooth functions by Gaussian RBFs. Due to space limitations, we have excluded proofs; they are given in [9].
2
Approximation by Gaussian RBF Networks
In this paper, we consider approximation error measured by L2 -norm with respect to Lebesgue measure λ. Let L2 (Ω) and L1 (Ω) denote, respectively, the space of square-integrable and absolutely integrable functions on Ω ⊆ Rd and .L2 (Ω) , .L1 (Ω) the corresponding norms. When Ω = Rd , we write merely L2 and L1 . For nonzero f in L2 , f o = f /f L2 is the normalization of f . For F ⊂ L2 , F |Ω denotes the set of functions from F restricted to Ω, Fˆ is the set of Fourier o transforms of functions n in F , and F the set of normalizations. For n ≥ 1, we define spann F := { i=1 wi fi | fi ∈ F, wi ∈ R}. In a normed linear space (X , .X ), for f ∈ X and A ⊂ X , we shall write f − AX = inf g∈A f = gX . Two norms · and | · | on the same linear space are equivalent if there exists κ > 0 such that κ−1 · ≤ | · | ≤ κ · . A Gaussian radial-basis-function unit with d inputs computes all scaled and translated Gaussian functions on Rd . For b > 0, let γb : Rd → R be the scaled Gaussian defined by γb (x) = exp(−bx2 ) = e−bx . 2
Then a simple calculation shows that γb L2 = (π/2b)d/4 . Let
(1)
G = τy γb | y ∈ Rd , b > 0 where (τy f )(x) = f (x − y)
denote the set of translations of scaled Gaussians and G0 = {γb | b ∈ R+ } the set of scaled Gaussians centered at 0. We investigate rates of approximation by networks with n Gaussian RBF units and one linear output unit, which compute functions from the set spann G. The d-dimensional Fourier transform is the operator F on L2 ∩ L1 given by 1 eix·s f (x) dx , (2) F (f )(s) = fˆ(s) = (2π)d/2 Rd where · denotes the Euclidean inner product on Rd . The Fourier transform behaves nicely on G0 : for every b > 0, γb (x) = (2b)−d/2 γ1/4b (x)
(3)
Estimates of Approximation Rates by Gaussian Radial-Basis Functions
(cf.[22, p. 43]). Thus
13
0 . spann G0 = spann G
(4)
γb L2 = (2b)−d/2 (2bπ)d/4 = γb L2
(5)
Calculation shows that
and Plancherel’s identity [22, p. 31] asserts that Fourier transform is an isometry on L2 : for all f ∈ L2 f L2 = fˆL2 . (6) By (6) and (4) we get the following equality. Proposition 1. For all positive integers d, n and all f ∈ L2 , 0 L2 = fˆ − spann G 0 L2 = fˆ − spann G0 L2 . f − spann G0 L2 = f − spann G So in estimating rates of approximation by linear combinations of scaled Gaussians, one can switch between a function and its Fourier transform. To derive our estimates, we use a result on approximation by convex combinations of n elements of a bounded subset of a Hilbert space derived by Maurey [20], Jones [8] and Barron [2,3]. Let F be a bounded n subset of a Hilbert space (H, .H ), sF = supf ∈F f H , and uconvn F = { i n1 fi | fi ∈ F } denote the set of n-fold convex combinations of elements of F with all coefficients equal. By Maurey-Jones-Barron’s result [3, p. 934], for every function h in cl conv (F ∪−F ), i.e., from the closure of the symmetric convex hull of F , we have sF (7) h − uconvn F H ≤ √ . n The bound (7) implies an estimate of the distance from spann F holding for any function from H. The estimate is formulated in terms of a norm tailored to F , called F -variation and denoted by .F . It is defined for any bounded subset F of any normed linear space (X , .X ) as the Minkowski functional of the closed convex symmetric hull of F (closure with respect to the norm .X ), i.e., (8) hF = inf c > 0 | c−1 h ∈ cl conv (F ∪ −F ) . Note that F -variation can be infinite (when the set on the right-hand side is empty) and that it depends on the ambient space norm. In the next sections, we only consider variation with respect to the L2 -norm. The concept of F -variation was introduced in [12] as an extension of “variation with respect to half-spaces” defined in [2], see also [13]. It follows from (7) that for all h ∈ H and all positive integers n, h − spann F H ≤
hF o √ . n
(9)
It is easy to see that if ψ is any linear isometry of (X , .X ), then for any f ∈ X , f F = ψ(f )ψ(F ) . In particular, fˆGo0 = f Go0
(10)
14
P.C. Kainen, V. K˚ urkov´ a, and M. Sanguineti
To combine estimates of variations with respect to two sets, we use the following lemma. Its proof follows easily from the definition of variation. Lemma 1. Let F, H be nonempty, nonzero subsets of a normed linear space X. Then for every f ∈ X, f F ≤ (suph∈H hF )f H . For φ : Ω × Y → R, let Fφ = {φ(., y) : Ω → R | y ∈ Y }. The next theorem is an extension of [14, Theorem 2.2]. Functions on subsets of Euclidean spaces are continuous almost everywhere if they are continuous except on a set of Lebesgue measure zero. Theorem 1. Let Ω ⊆ Rd , Y ⊆ Rp , w : Y → R, φ : Ω × Y → R be continuous 1 almost everywhere such that for all y ∈ Y , φ(., y)L2 (Ω) = 1, and f ∈ L (Ω) ∩ 2 L (Ω) be such that for all x ∈ Ω, f (x) = Y w(y) φ(x, y) dy. Then f Fφ ≤ wL1 (Y ) .
3
Approximation of Bessel Potentials by Gaussian RBFs
In this section, we estimate rates of approximation by spann G for certain special functions, called Bessel potentials, which are defined by means of their Fourier transforms. For r > 0, the Bessel potential of order r, denoted by βr , is the function on Rd with Fourier transform βˆr (s) = (1 + s2 )−r/2 . To estimate Go0 -variations of βr and βˆr , we use Theorem 1 with representations of these two functions as integrals of scaled Guassians. For r > 0, it is known ([21, p. 132]) that βr is non-negative, radial, exponentially decreasing at infinity, analytic except at the origin, and a member of L1 . It can be expressed as an integral βr (x) = c1 (r, d)
∞
exp(−t/4π) t−d/2+r/2−1 exp(−(π/t)x2 ) dt,
(11)
0
where c1 (r, d) = (2π)d/2 (4π)−r/2 /Γ (r/2) (see [15, p. 296] or [21]). The factor (2π)d/2 occurs since we use Fourier transform (2) which includes the factor (2π)−d/2 . Rearranging (11) slightly, we get a representation of the Bessel potential as an integral of normalized scaled Gaussians. Proposition 2. For every r > 0, d a positive integer, and x ∈ Rd
+∞ o vr (t)γπ/t (x) dt ,
βr (x) = 0
where vr (t) = c1 (r, d)γπ/t L2 exp(−t/4π) t−d/2+r/2−1 ≥ 0.
Estimates of Approximation Rates by Gaussian Radial-Basis Functions
Let Γ (z) =
∞ 0
15
tz−1 e−t dt be the Gamma function (defined for z > 0) and put k(r, d) :=
(π/2)d/4 Γ (r/2 − d/4) . Γ (r/2)
By calculating the L1 -norm of the weighting function vr and applying Theorem 1, we get an estimate of Go0 -variation of βr . Proposition 3. For d a positive integer and r > d/2, ∞ vr (t)dt = k(r, d) and βr Go ≤ βr Go0 ≤ k(r, d). 0
Also the Fourier transform of the Bessel potential can be expressed as an integral of normalized scaled Gaussians. Proposition 4. For every r > 0, d a positive integer, and s ∈ Rd ∞ wr (t)γto (s) dt , βˆr (s) = 0
where wr (t) = γt L2 tr/2−1 e−t /Γ (r/2). A straightforward calculation shows that the L1 -norm of the weighting function wr is the same as the L1 -norm of the weighting function vr and thus by Theorem 1 we get the same upper bound on Go0 -variation of βˆr . Proposition 5. For d a positive integer and r > d/2, ∞ wr (t)dt = k(r, d) and βˆr Go ≤ βˆr Go0 ≤ k(r, d). 0
Combining Propositions 3 and 5 with (9) we get the next upper bound on rates of approximation of Bessel potentials and their Fourier transforms by Gaussian RBFs. Theorem 2. For d a positive integer, r > d/2, and n ≥ 1 βr − spann GL2 = βr − spann G0 L2 = βˆr − spann G0 L2 ≤ k(r, d) n−1/2 . So for r > d/2 both the Bessel potential of the order r and its Fourier transform can be approximated by spann G0 with rates bounded from above by k(r, d)n−1/2 . If for some fixed c > 0, rd = d/2 + c, then with increasing d, k(rd , d) = (π/2)d/4 Γ (c/2)/Γ (d/2 + c/2) converges to zero exponentially fast. So for a sufficiently large d, βrd and βˆrd can be approximated quite well by Gaussian RBF networks with a small number of hidden units.
16
4
P.C. Kainen, V. K˚ urkov´ a, and M. Sanguineti
Rates of Approximation by Bessel Potentials
In this section we investigate rates of approximation by linear combinations of translated Bessel potentials. For τx the translation operator (τx f )(y) = f (y − x) and r > 0, let Gβr = {τx βr | x ∈ Rd } denote the set of translates of the Bessel potential of the order r. For r > d/2, Gβr ⊂ L2 since translation does not change the L2 -norm. Let r > d/2. By a mostly straightforward argument (see [4]), we get βˆr L2 = λ(r, d) := π d/4
Γ (r − d/2) Γ (r)
1/2 .
(12)
The convolution h ∗ g of two functions h and g is defined by (h ∗ g)(x) = h(y)g(x − y)dy. For functions which are convolutions with βr , the integral Rd representation f (x) = w(y)βr (x − y)dy combined with Theorem 1 gives the following upper bound on Gβr -variation. Proposition 6. For d a positive integer and r > d/2, let f = w ∗ βr , where w ∈ L1 and w is continuous a.e.. Then f Gβr ≤ f Goβr ≤ λ(r, d)wL1 . For d a positive integer and r > d/2, consider the Bessel potential space (with respect to Rd ) Lr,2 = {f | f = w ∗ βr , w ∈ L2 }, where f Lr,2 = wL2 for f = w ∗ βr . The function w has Fourier transform equal to (2π)−d/2 fˆ/βˆr since the transform of a convolution is (2π)d/2 times the product of the transforms, so w = (2π)−d/2 (fˆ/βˆr )ˇ. In particular, w is uniquely determined by f and so the Bessel potential norm is well-defined. The Bessel potential norm is equivalent to the corresponding Sobolev norm (see [1, p. 252]); let κ = κ(d) be the constant of equivalence. For a function h : U → R on a topological space U , the support of h is the set supp h = cl {u ∈ U | h(u) = 0}. It is a well-known consequence of the CauchySchwartz inequality that wL1 ≤ a1/2 wL2 , where a = λ(supp w). Hence by Theorem 1 and (12), we have the following estimate. Proposition 7. Let d be a positive integer, r > d/2, and f ∈ Lr,2 . If w = (2π)−d/2 (fˆ/βˆr )ˇ is bounded, almost everywhere continuous, and compactly supported on Rd and a = λ(supp w), then f Goβr ≤ a1/2 λ(r, d)f L2,r . Combining this estimate of Gβr -variation with Maurey-Jones-Barron’s bound (9) gives an upper bound on rates of approximation by linear combinations of n translates of the Bessel potential βr .
Estimates of Approximation Rates by Gaussian Radial-Basis Functions
17
Theorem 3. Let d be a positive integer, r > d/2, and f ∈ Lr,2 . If w = (2π)−d/2 (fˆ/βˆr )ˇ is bounded, almost everywhere continuous, and compactly supported on Rd and a = λ(supp w), then for n ≥ 1 f − spann Gβr L2 ≤ a1/2 λ(r, d)f L2,r n−1/2 .
5
Bound on Rates of Approximation by Gaussian RBFs
The results given in previous sections imply an upper bound on rates of approximation by networks with n Gaussian RBF units for certain functions in the Bessel potential space. Theorem 4. Let d be a positive integer, r > d/2, and f ∈ Lr,2 . If w = (2π)−d/2 (fˆ/βˆr )ˇ is bounded, almost everywhere continuous, and compactly supported on Rd and a = λ(supp w), then f − spann GL2 ≤ a1/2 λ(r, d)f L2,r k(r, d) n−1/2 , where k(r, d) =
(π/2)d/4 Γ (r/2−d/4) Γ (r/2)
and λ(r, d) = π d/4
Γ (r−d/2) Γ (r)
1/2 .
The exponential decrease of k(rd , d) mentioned following Theorem 2, with rd = d/2 + c, applies also to λ(rd , d). Thus unless the constant of equivalence of the Sobolev and Bessel norms κ(d) is growing very fast with d, even functions with large Sobolev norms will be well approximated for sufficiently large d. Note that our estimates might be conservative as we used composition of two approximations. Acknowledgements. The collaboration between V. K. and M. S. was partially supported by the 2004-2006 Scientific Agreement among University of Genoa, National Research Council of Italy, and Academy of Sciences of the Czech Reˇ grant 201/05/0557 and by the public, V. K. was partially supported by GA CR Institutional Research Plan AV0Z10300504, M. S. by the PRIN Grant from the Italian Ministry for University and Research, project New Techniques for the Identification and Adaptive Control of Industrial Systems; collaboration of P. C. K. with V. K. and M. S. was partially supported by Georgetown University.
References 1. Adams, R. A., Fournier, J. J. F.: Sobolev Spaces. Academic Press, Amsterdam (2003) 2. Barron, A. R.: Neural net approximation. Proc. 7th Yale Workshop on Adaptive and Learning Systems, K. Narendra, Ed., Yale University Press (1992) 69–72 3. Barron, A. R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39 (1993) 930–945 4. Carlson, B. C.: Special Functions of Applied Mathematics, Academic Press, New York (1977)
18
P.C. Kainen, V. K˚ urkov´ a, and M. Sanguineti
5. Girosi, F.: Approximation error bounds that use VC-bounds. In Proceedings of the International Conference on Neural Networks, Paris (1995) 295-302 6. Girosi, F., Anzellotti, G.: Rates of convergence for radial basis functions and neural networks. In Artificial Neural Networks for Speech and Vision, R. J. Mammone (Ed.), Chapman & Hall, London (1993) 97-113 7. Hartman, E. J., Keeler, J. D., Kowalski, J. M.: Layered neural networks with Gaussian hidden units as universal approximations. Neural Computation 2 (1990) 210–215 8. Jones, L. K.: A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics 20 (1992) 608–613 9. Kainen, P. C., K˚ urkov´ a, V., Sanguineti, M.: Rates of approximation of smooth functions by Gaussian radial-basis- function networks. Research report ICS–976 (2006), www.cs.cas.cz/research/publications.shtml. 10. Kon, M. A., Raphael, L. A., Williams, D. A.: Extending Girosi’s approximation estimates for functions in Sobolev spaces via statistical learning theory. J. of Analysis and Applications 3 (2005) 67-90 11. Kon, M. A., Raphael, L. A.: Approximating functions in reproducing kernel Hilbert spaces via statistical learning theory. Preprint (2005) 12. K˚ urkov´ a, V.: Dimension–independent rates of approximation by neural networks. In Computer–Intensive Methods in Control and Signal Processing: Curse of Dimensionality, K. Warwick and M. K´ arn´ y, Eds., Birkh¨ auser, Boston (1997) 261–270 13. K˚ urkov´ a, V.: High-dimensional approximation and optimization by neural networks. Chapter 4 in Advances in Learning Theory: Methods, Models and Applications, J. Suykens et al., Eds., IOS Press, Amsterdam (2003) 69–88 14. K˚ urkov´ a, V., Kainen, P. C., Kreinovich, V.: Estimates of the number of hidden units and variation with respect to half-spaces. Neural Networks 10 (1997) 1061– 1068 15. Mart´ınez, C., Sanz, M.: The Theory of Fractional Powers of Operators. Elsevier, Amsterdam (2001) 16. Mhaskar, H. N.: Versatile Gaussian networks. Proc. IEEE Workshop of Nonlinear Image Processing (1995) 70-73 17. Mhaskar, H. N., Micchelli, C. A.: Approximation by superposition of a sigmoidal function and radial basis functions. Advances in Applied Mathematics 13 (1992) 350-373 18. Park, J., Sandberg, I. W.: Universal approximation using radial–basis–function networks. Neural Computation 3 (1991) 246-257 19. Park, J., Sandberg, I.: Approximation and radial basis function networks. Neural Computation 5 (1993) 305-316 20. Pisier, G.: Remarques sur un resultat non publi´e de B. Maurey. In Seminaire ´ d’Analyse Fonctionelle, vol. I(12). Ecole Polytechnique, Centre de Math´ematiques, Palaiseau 1980-1981 21. Stein, E. M.: Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton, NJ (1970) 22. Strichartz, R.: A Guide to Distribution Theory and Fourier Transforms. World Scientific, NJ (2003)