Spectrum-based Design of Sinusoidal RBF Neural

Spectrum-based Design of Sinusoidal RBF Neural Networks Péter András Department of Psychology University of Newcastle Newcastle upon Tyne, NE1 7RU, UK [email protected] Abstract The paper introduces and describes the spectrum-based design of RBF neural networks. The RBF neural networks used in the paper work with damped sinusoidal nonlinear activation functions. The concept of the associated spectrum of the data is introduced and it is shown how to apply this spectrum to find the number and internal parameters of hidden neurons for a neural network solution of the related data processing problem. A time series prediction application is presented. The relation of the proposed method to the support vector machine method and the application of the method to select appropriate basis functions for a problem with given data are discussed.

I. INTRODUCTION The radial basis function (RBF) neural networks [6] became one of the most popular artificial neural networks. They are organized regularly in three layers, having nonlinear processing units in the hidden layer. The advantage of these networks is that they permit locally tunable approximations. The RBF neural networks are applied for various tasks, including approximation, classification, time series prediction, control and others. Their approximation and prediction capabilities were studied by many researchers, starting from the mid-80s. It was established that they have the universal approximation property with respect to the continuous functions [11], and error bounds were calculated for these networks [2,3]. In this paper we use sinusoidal RBF (s-RBF in short notation) networks that work with nonlinear activation functions of type sin(λ || x − c ||) f ( x) = π || x − c || The most important classical approximation results adapted for sinusoidal RBF networks are presented briefly in the paper. A new approach is proposed with respect to the design of neural networks. Traditionally the parameter selection of the hidden nodes is done by random selection or by the combination of the random selection with some heuristic method. Generally the number of the hidden nodes is set to be sufficiently big on the basis of a priori knowledge or some heuristics. Here we propose a more precise

0-7803-7278-6/02/$10.00 ©2002 IEEE

approach. By this it is possible to make good estimation of the number of the necessary hidden nodes and the internal parameters of these. The approach is based on the analysis of the associated s-RBF spectrum of the data (see definition in Section 3). We show the applicability of the proposed method in the case of the Mackey-Glass time series. The rest of the paper is structured as follows. In Section 2 we briefly present the approximation theory results that apply to the proposed neural networks. In Section 3 we introduce the associated spectrum of the data. In Section 4 the spectrumbased neural network design is described. In Section 5 we discuss the proposed method. We end the paper with the conclusion in Section 6. II. APPROXIMATION PROPERTIES In this section we present the classical neural network approximation results adapted for s-RBF networks. Here we present only the statements of the results, and we point to the basic points of the proofs. The s-RBF networks are composed by three layers of neurons. We have n neurons in the input layer (n is the dimension of the input vectors), one for each component of the input vector. The input neurons have linear activation function, and they do not modify the input values. The output layer contains a single neuron (if we need m dimensional output, it will contain m neurons, one for each component of the output vector). The output neuron has linear activation function, and its output is a linear combination of the outputs of the hidden layer neurons. The network has a single hidden layer, containing neurons with nonlinear activation function. The first result is the universal approximation property of these networks. In the literature we can find two ways of dealing with the proof of this property. The first is based on the Stone-Weierstrass theorem [7], and the second on the properties of the δ-series [4,5,11]. In the case of the s-RBF networks this property can be proved following the second way. The universal approximation property of the s-RBF neural networks is formulated as follows. Let f a be bounded and uniformly continuous function defined on Rn and let sin(λ || x − c ||) g c , λ : R n → R, g c , λ ( x ) = π || x − c ||

then it is possible to build neural networks with a single hidden layer, with neurons having gc,λ functions as activation functions, such that the network’s output approximates the function f arbitrarily well. The second result is about the approximation precision of the s-RBF networks. The basic result in this sense was given by Barron [2,3], and it is based on the statistical properties of the convex closures of function spaces. The proof of the approximation precision result for the s-RBF networks follows the way of Barron' s proof. The approximation precision result for s-RBF neural networks is formulated as follows. If f is a function defined on Rn, such that it satisfies the smootheness criteria specified in [2], then exist real numbers β , and C such that there is an s-RBF neural network with n hidden units, of which output gn approximates the function βf(x), and C || g n − βf || 2 ≤ . n

we can see easily that we get  1, if n = m < g c,n , g c,m > c =  0, otherwise Let us denote the Hilbert space defined with this internal product by Hc. We make the statement of three theorems with respect to the orthogonal approximation properties of the sRBF neural networks. The theorems can be proved by using the framework of the theory of Fourier series [R]. Theorem 1. For each c and each function f∈Hc, we have ∞

g * = ∑ < f , g k > c ⋅g c,k k =0

is the best approximation of f in the sense of the ||.||c norm, in the linear closure of Gc={gc,n|n∈ Z}. The next theorem evaluates the norm of the best approximations by imposing a supremum for it. Theorem 2. If g* is the best approximation of f∈Hc in the linear closure of Gc then || g * || 2c ≤|| f || c2 .

III. THE ASSOCIATED SPECTRUM Our intention is to use orthogonal approximation theory as the basis of neural network design. In order to do this we have to implement an orthogonalization of the set of sRBF basis functions. By orthogonalization we mean that we try to find an infinite subset G0⊆G, where G is the set of all s-RBF basis functions, and an inner product, such that we have for all g1,g2 ∈G0 that c g , if g 1 = g 2 = g . < g 1 , g 2 >=   0, otherwise In what follows we restrict our attention to the one dimensional case, but it can be easily extended to the multi-dimensional case too. First we observe that c +π 2 sin n ( x − c ) sin m( x − c ) ∫ ( x − c) ⋅ π ( x − c) ⋅ π ( x − c) dx = c −π 2 1  , if n = m = 2 ∫ sin nx ⋅ sin mxdx = π π −π 0, otherwise π

Now we can define orthogonalizations of the basis functions by considering the following inner products: π c +π ( x − c) 2 ⋅ f ( x) ⋅ g ( x)dx < f , g >c = 2 c −∫π Using the notation g c ,n ( x) =

sin( n( x − c)) π ( x − c)

0-7803-7278-6/02/$10.00 ©2002 IEEE

Now we are ready to introduce the s-RBF spectrum of a function. If f∈Hc, we call the associated s-RBF spectrum of f the following function γ c ( f ; n) =< f , g c ,n > c or more generally we can define sin t ( x − c) γ ( f ; c, t ) =< f , >c π ( x − c) with t∈R. The next theorem describes the properties of the sRFB spectrum. Theorem 3. If f∈Hc, then the following hold a. there exists an M>0 such that | γ c ( f ; n) |< M for all n∈Z; b. lim γ c ( f ; n) = 0 ; n →∞ ∞

c.

∑< −∞

f , g c , n > 2c ≤ γ * ( f ; c)

where γ * ( f ; c) =

π 2

c +π

∫ ( x − c) c

2

⋅ f 2 ( x)dx .

−π

IV. SPECTRUM-BASED NEURAL NETWORK DESIGN We interpret the information contained in the s-RBF spectrum of a function. First we observe that for each c the orthogonal approximation is close to the goal function in a close vicinity of c. On the other side we observe that this approximation is composed by an infinite sum of s-RBF basis

functions centred at c, and each component basis function has its importance in composing this sum. The importance of the basis function is determined by the corresponding value of the s-RBF spectrum. We restrict our investigation to a special set of the basis functions, namely to the set formed by the functions gc,n. These functions form an orthogonal set in the context of the previously introduced orthogonal approximation framework. We are interested in how important the contribution of these functions is to the approximation of the goal function. Calculating the spectrum values for the pairs (c,n) for c∈D, where D is the definition domain of the approximated function, we determine the importance of the corresponding basis functions. Now, selecting the peaks of this reduced spectrum, practically we select the most important gc,n basis functions. After the selection of the most important basis functions we can build our spectrum based neural network. The previously described method can be formulated as an algorithm as follows: Step 1. Calculate the reduced s-RBF spectrum of the goal function γ ( f ; c, n) for c∈D and n∈N, where D is the definition domain of the approximated function (it is a bounded domain, and the function has bounded values in it). Step 2. Select the peaks (both positive and negative) of the spectrum. Step 3. List the selected (c,n) pairs in descending order of the associated spectrum values. Step 4. Select those parameter combinations which satisfy

γ ( f ; c, n ) ≥ α ⋅ γ * ( f ; c ) where α is a positive real number. Step 5. Build the RBF neural network using the selected parameter combinations as internal parameters. For initial weights of the network we will use the values γ ( f ; c, n ) . γ * ( f ; c) As we can expect this network does not approximate too precisely the target function with its initial weights. The spectral analysis gives us the necessary number of hidden nodes and the values of the internal parameters of these. In addition we get the initial weights, but these should be

0-7803-7278-6/02/$10.00 ©2002 IEEE

trained. The training can be done by simple least squares optimisation. The advantage of the presented method is that it gives an objective theory-rooted criterion to select the number of the hidden nodes and their internal parameters. The previously proposed methods in this respect are too theoretical [9,13], or need too many hidden neurons [10], or are almost purely empirical [8]. We note that the support vector methods provide an efficient way to select a small number of basis functions. We discuss the relation of the proposed method and the method of support vectors in the next section. We can observe in practice that there are some functions or data sets for which the number of the important peaks is small, and other functions for which the number of important peaks is high. Consequently, if we have a small number of peaks than we can use the s-RBF neural networks with good results, but if we have many peaks then probably it is better to look for some other type of neural networks for approximation (i.e., a neural network that works with different basis functions). In all cases the spectral method gives us the parameters of the necessary basis functions. By this observation we point out a remarkable property of the presented approach, namely that we have the possibility to evaluate the appropriateness of the application of the s-RBF networks for the approximation of certain categories of functions. V. APPLICATION AND DISCUSSION In this section we present first an application of the s-RBF networks to time series prediction. We compare the proposed spectral method with the random parameter initialisation method. Next, we discuss some aspects of the spectrum-based neural network design method. A. Time series prediction with s-RBF networks One problem with the time series prediction with spectrumbased neural networks is that we do not know the analytic form of the function that governs the behavior of the time series. To overcome this problem we use Monte Carlo numerical integration [12] to find out the values of the spectrum. By applying Monte Carlo integration we calculate the integrals as sums of randomly distributed values. This can be done in the case of the time series. To calculate the s-RBF spectrum of a time series we calculate the integrals for the values of the spectrum

Figure1: Approximation results of the two networks

γ ( f ; c, n ) =

π sin n || x − c || || x − c || 2 ⋅ f ( x) ⋅ dx ∫ π || x − c || 2 [ c −π , c +π ] n

by calculating the sum sin n || x k − c || π N γ ( f ; c, n) = ⋅ ∑ || x k − c || 2 ⋅ y k ⋅ π || x k − c || 2 k =1 where xk are d-dimensional vectors of observations, i.e., xk=(xk, xk-1, …, xk-d+1), and yk=xk+1. We compare two s-RBF networks applied to predict the Mackey-Glass times series. The first is built using the proposed spectral method, and the second by classical random parameter selection. The Mackey-Glass time series is defined by the difference equation x t +1 = 0.9 ⋅ x t +

0.2 ⋅ x t −τ (1 + x t −τ ) 10

prediction results. Figure 2 shows the evolution of the mean squared error during the training for both networks. The figures show that the spectrum-based network performs the task better, confirming our expectations based on the previously presented theoretical framework. B. Discussion We discuss two issues in this sub-section. First, we look at the relation between the support vector machine methods and the proposed spectrum-based neural network design method. Next, we discuss briefly the use of the spectrum-based analysis to select the best type of basis function for an approximation or prediction task. As we noted before an alternative method to find a simple structure neural network with good approximation / prediction ability is to apply the support vector machine methodology [14] to find the representative subset of the data.

where τ=30 in our case. We used the simplest prediction hypothesis by trying to predict the next value solely on the basis of the previous value. Such predictions should be reasonably good most of the time, as there is a high correlation between the consecutive values of the series. We calculated the Monte Carlo approximation of the associated spectrum values for n = -20,…, 20, and for c values from the range of [-0.5,1.5] with a resolution of 0.05. We determined the 20 most important peaks of the associated spectrum. Both networks had the same number of hidden neurons and they were trained with 25,000 training pairs (xt,xt+1), organized in batches of 500 pairs. Figure 1 presents the

0-7803-7278-6/02/$10.00 ©2002 IEEE

The support vector machine method can be viewed as Bayesian method for the determination of hidden neuron parameters. In this context the prior distribution over the parameter space is the one made by the uniform linear combination of Dirac-δ distributions centred at the representative data points. In our case we can view the associated spectrum of the data as a representation of a prior distribution over the space of the parameters of hidden neurons [1]. The structure of this prior distribution is driven not only by the data (as it is the case with the support vector machine method), but it is driven by the data and the nature of the basis functions jointly. This means that in a sense the prior distribution represented by the associated spectrum is more natural (fits both the data and the

Figure 2: Evolution of the mean square error for the two networks

class of basis functions) than the one that can be associated to the method of support vector machines. The main difference between the two approaches is that, while in the case of the support vector machine method the set of allowed parameters is strictly determined by the considered data (i.e., a sub-set of the full data set determines the internal parameters of the hidden neurons), in the case of the spectrum-based method these parameters can take other values that fit better the nature of the basis functions that are used. We note that while the spectrum-based method may lead to a more natural prior, its computational load is much higher than computational load of the support vector machine method. This means that there is a trade-off between the two methods in terms of computational complexity of the method and architectural complexity of the resulted neural network. Another interesting issue is the use of the proposed spectrum-based method to select the most appropriate set of basis functions to solve an approximation or prediction problem. There are many classes of basis functions (e.g., Gaussian RBFs, sigmoidal functions, step functions) which posses the property of universal approximation capability with respect to continuous functions. Each of these classes defines a linear combination mesh within the set of continuous functions, such that for each continuous function we can find an arbitrarily close point from the mesh (i.e., an arbitrarily close approximation by a linear combination of basis functions). Considering the number of basis functions involved in the composition of each mesh point we can define the level of granularity of mesh points, i.e., those which are composed by a few basis functions are at a low granularity level, while those composed by many basis functions are at a high granularity level. Finding a simple approximation of a function means that we should find a low granularity level combination of basis functions.

The proposed spectrum-based neural network design methods allows us to calculate the granularity level of good approximations of a target function. Many similar level important peaks in the associated spectrum imply high granularity level, while few important peaks imply low granularity level. Making a step further we expect that the spectrum-based analysis of function classes can reveal how appropriate is to approximate functions of one class by linear combination of functions from some other class. This may help significantly the efficient search for simple approximations of target functions, although at the moment the computational load associated to a such generic evaluation may seem prohibitive. VI. CONCLUSIONS A new approach to the design of RBF neural networks is presented in this paper. The approach is based on the spectral analysis of the approximated function or of the predicted time series. By this method we can estimate the number of the hidden neurons of the s-RBF neural network and the internal parameters of these neurons. In addition the proposed method allows the evaluation of the appropriateness of the application of the s-RBF networks for different function or time series categories. The spectrum-based neural network design is similar in some sense to the support vector machine method, and the two methods are complementary in other respects. In both cases a ‘natural’ prior distribution over the space of hidden neuron parameters is sought that can be used to build a simple complexity neural network solution to the problem. The main difference is that, while the support vector machine method is mostly data driven, the spectrum-based method allows for a significant influence by the nature of the considered basis functions. The computational and architectural complexities associated to the two methods indicate a trade-off among them. The spectrum-based neural network design method offers a new way to analyse the match between two function classes

0-7803-7278-6/02/$10.00 ©2002 IEEE

in the sense of existence of low complexity approximation of functions from one class by linear combinations of functions from the other class. Such analysis may become very useful for the search of simple solutions of critical complex prediction, approximation, or control problems, where a key issue is to reduce the chance of overfitting by keeping the approximating models simple. REFERENCES [1] P. Andras: ''RBF neural networks with orthogonal basis functions'', In: R.J. Howlett & L.C. Jain (Eds.) Radial Basis Function Networks 1. Recent Developments in Theory and Algorithms, Physica-Verlag, Heidelberg, pp.67-94., 2001. [2] A.R. Barron: ''Universal approximation bounds for superpositions of a sigmoidal function'', IEEE Transactions on Information Theory, vol.39., p.930-945., 1993. [3] A.R. Barron: ''Approximation and estimation bounds for artificial neural networks'', Machine Learning, vol.14., p.115-133., 1994. [4] G. Cybenko: ''Approximation by superpositions of sigmoidal function'', Mathematics of Control, Signals, and Systems, vol.2., p.303314., 1989. [5] S.W. Ellacott: ''Aspects of numerical analysis of neural networks'', Acta Numerica, p.145-202., 1994. [6] S. Haykin: Neural Networks. A Comprehensive Foundation, Macmillan Publishing Company, New York, 1994. [7] K. Hornik: ''Multilayer feedforward networks are universal approximators'', Neural Networks, vol.2., p.183-192., 1989. [8] N.B. Karayiannis, G.W. Mi: ''Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques'', IEEE Transactions on Neural Networks, vol.8., p.14921507., 1997. [9] V. Kurkova, P.C. Kainen, V. Kreinovich: ''Estimates of the number of hidden units and variation with respect to half-spaces'', Neural Networks, vol.10., p.1061-1068., 1997. [10] T. Poggio, F. Girosi: ''Networks for approximation and learning'', Proceedings of IEEE, vol.78, p.1481-1497., 1990. [11] J. Park, I.W. Sandberg: ''Universal approximation using radial-basis function'', Neural Computation, vol.3., p.246-257., 1991. [12] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery: Numerical Recipes in C. The Art of Scientific Computing, Cambridge, UK: Cambridge University Press, 1999. [13] H.J. Sussmann: ''Uniqueness of weights for minimal feedforward nets with a given input-output map'', Neural Networks, vol.5., p. 589593., 1992. [14] V.N. Vapnik: The Nature of Statistical Learning Theory, New York: Springer-Verlag, 1995.

0-7803-7278-6/02/$10.00 ©2002 IEEE