An Incremental Algorithm for Parallel Training of the ... - Springer Link

Neural Processing Letters 11: 131–138, 2000. © 2000 Kluwer Academic Publishers. Printed in the Netherlands.

131

An Incremental Algorithm for Parallel Training of the Size and the Weights in a Feedforward Neural Network 1,2 ˇ ˇ KATERINA HLAVÁCKOVÁ-SCHINDLER and MANFRED M. FISCHER3 1 Institute for Urban and Regional Research, Austrian Academy of Sciences, Postgasse 7/4/2, A–1010 Vienna, Austria; 2 Institute of Computer Science, Academy of Sciences of the Czech

Republic, Pod Vadárenskou vˇeží 2, 18207 Praha 8, Czech Republic, E-mail: [email protected]; 3 Department of Economic and Social Geography, Wirtschaftsuniversität Wien, Augasse 2–6,

A–1090 Vienna, Austria, E-mail: [email protected]

Abstract. An algorithm of incremental approximation of functions in a normed linear space by feedforward neural networks is presented. The concept of variation of a function with respect to a set is used to estimate the approximation error together with the weight decay method, for optimizing the size and weights of a network in each iteration step of the algorithm. Two alternatives, recursively incremental and generally incremental, are proposed. In the generally incremental case, the algorithm optimizes parameters of all units in the hidden layer at each step. In the recursively incremental case, the algorithm optimizes the parameters corresponding to only one unit in the hidden layer at each step. In this case, an optimization problem with a smaller number of parameters is being solved at each step. Key words: approximation of a function, feedforward network, incremental algorithm, variation of a function with respect to a set, weight decay

1. Introduction The standard approach towards constructing a neural network for functional approximation is to determine parameters of the entire network for a fixed size of the network. Parameter determination is viewed as a non-linear optimization problem solved in a high-dimensional space. Determinating a network topology (i.e. the number of given layers and their size) which guarantees a given degree of accuracy in the approximation is difficult. It can be solved by training an oversized network and pruning, based on some measure of saliency to remove units. A major shortcoming of pruning techniques is their weak theoretical rationale on one side and the high computational demand in the case of larger network models on the other. This motivates investigating the weight training algorithms where the size of the network is determined incrementally. Such algorithms require solving a nonlinear optimization of parameters in a lower dimensional space at each step (for example [1], [5]).

132

´ CKOV ´ ˇ HLAVA A-SCHINDLER AND FISCHER

In this paper, we present two alternatives of incremental training of a single hidden-layer feedforward network using the concept of variation of a function with respect to a set. The activation functions corresponding to the hidden layer of the network are taken to belong to this set. Variation of a function with respect to the set of approximants was introduced in [7] and further work with respect to the error of approximation by feedforward networks can be found in [8] and [5]. In contrast to the incremental aogorithm described in [5], the training algorithm proposed in this paper does not require the knowledge of the variation of the function f to be approximated. It uses the estimate of the variation of the n-th approximation fn of f which is easier to compute. At the same time, n is also the iteration step of the algorithm. Moreover, the training of the presented algorithm is combined with weight decay. The paper is organized as follows. Section 2 summarizes key results on incremental approximation. Both alternatives of the training algorithm together with optimization using weight decay are presented in Section 3. Section 4 is devoted to discussion. 2. Preliminaries First, some theoretical results on approximation of function ([8]) will presented and then applied to feedforward networks. Let (X, || · ||) be a real vector space, N+ the set of positive integers and span G the linear hull of G (i.e. the set of linear combinations of functions from G, where G is a subset of X). For a normed linear space (X, || · ||2 ) of functions on set K ⊂ Rd , d ∈ N+ , Kurková ([7]) introduced the norm variation of a function f ∈ X with respect to a subset G of X as V (f, G) := inf{B ≥ 0;

f ∈ cl conv G(B)}

where the closure is taken with respect to the topology generated by the norm || · || and G(B) := {wg; g ∈ G, w ∈ R, |w| ≤ B}. Let a real linear space (X, || · ||2 ) denote the space with norm || · ||2 generated by an inner product. The following theorem is a corollary of the Jones-Barron theorem ([1]) (originally proven for convex combinations of functions) reformulated by means of variation and for linear combinations of functions. Since our algorithm will deal with finite sets, we use a stronger formulation of the theorem, holding for compact sets G [8]. THEOREM 2.1. Let G be a compact subset of a normal linear space (X, || · ||2 ). Then for every f ∈ X such that V (f, G) < ∞ and for every n = 1, . . . , card calG there exists q a linear combination fn ∈ spann G of n elements of G such that ||f − fn ||2 ≤

B 2 −||f ||22 n

where B = V (f, G) supg∈G ||g||2 .

V (f, G) can be characterized as:

133

AN INCREMENTAL ALGORITHM FOR PARALLEL TRAINING

LEMMA 2.2. Let (X, || · ||2 ) be a normed linear space, G be its subset such that G 6 = {0}. (i) [7] Then for every f ∈ X V (f, G = suph∈S sup|f ·h||g·h| where S = {h ∈ X − g∈G

G⊥ ; ||h||2 = 1} and G⊥ denotes the orthogonal complement of G. P (ii) [8] If GP is finite with card G = m and f ∈ spanG, then V (f, G) = min{ m i=1 |wi |; f = m i=1 wi gi , m ∈ N+ , (∀i = 1, . . . , m)(wi ∈ R, gi ∈ G)}.

The expression in Lemma 2.2. (ii) will be applied to determine the error of approximation of function in each step of our algorithm. Before the algorithm will be described, we explain the relationship of the above theory and feedforward networks. 2.1.

FEEDFORWARD NETWORKS AND APPROXIMATION

It is known that to approximate a continuous function by a perceptron type neural network, with variety of continuous non-polynomial activation functions, only one hidden layer with arbitrarily large number of hidden units suffice due to the universal approximation property (Cybenko [4], Hornik, Stinchcombe, White [6]). In this paper, we limit our attention to perceptron networks with d input units, one hidden layer of n hidden units and one output unit. This allows the application of the above-mentioned theory for estimation of the approximation error by a feedforward network with one hidden layer: If the activation functions gi , i = 1, . . . , n of the hidden layer with n units are members of set G, then the output of the network with weights wi , i = 1, . .P . , n between the hidden layer and the output unit corresponding to function fn = ni wi gi belongs to spann G and the theory can be applied. However, as V (f, G) is usually not given and the computation of variation from Lemma 2.2. is generally difficult, some simplifcation is still necessary. This simplification will be described in the following subsection. 2.2.

ESTIMATE OF THE ERROR OF APPROXIMATION FOR A FIXED NETWORK TOPOLOGY

Variations of fn , where fn is the n-th approximation of f will be used and estimated. fn is a linear combination of n members of a finite dimensional space, thus from Lemma 2.2. (ii) m m P n P V (fn , G) = min |wi |; f = win gi , m ∈ N+ , (1) i=1 i=1 (∀i = 1, . . . , m)(win ∈ R, gi ∈ G) where win , i = 1, . . . , m denotes the output weight corresponding to hidden unit i, i = 1, . . . , m of function fn (and at the same time the weights achieved in the n-th iteration of the algorithm). In Theorem 2.3. we limit to m = n (i.e. only

134


linear combinations of n members of G will be considered) to get an estimate for V (fn , G). Approximation (1) can be correctly applied to Theorem 1.1. thanks to that for every n ∈ N+ , V (fn , G) ≥ V (f, G) for finite G 6 = {0} and V (fn , G) is close to V (f, G) for sufficiently high value of n 1. (The proofs can be found in [5]). Denote F = f (u) = minx∈K {f (x)} in the given interval K ⊂ Rd (d denotes the dimension of the input space). To approximate the norm ||f ||2 , the estimate ||f ||2 ≥ F µ(K) will be used where µ is the Lebesque measure in Rd . Our algorithm is based on the following theorem. THEOREM 2.3. Let G be a compact subset of normed linear space (X, || · ||2 ) of continuous functions defined on a compact interval K ⊂ Rd for some integer d ≥ 1. Then for every f ∈ spanG and for every n = 1, . . . , card G there exists a linear combination fn ∈ spann G of n elements of calG such that s ||f − fn ||2 ≤

Vn2 (fn , Gsupg∈G ||g||22 − (F µ(K))2 n

(2)

Pn Pn n n where Vn (fn , G) := i=1 |wi | such that fn = i=1 wi gi , (∀i = 1, . . . , n) n n (wi ∈ R, gi ∈ G) and wi were found by weight decay optimization in the n-th step; Ff = f (u) = minx∈K {f (x)} denotes the minimum of f in interval K. Function fn is an n-th approximation of function f . This theorem guarantees that the error of approximation on K is limited by (2) in each step n. In practice this means that if for a continuous function f given by a set of input/output values such a feedforward network fn is constructed, then both the in-sample and out-ofsample performance of this network will not give an error greater than (2) for all data in the applied compact set K. In comparison to commonly used algorithms for approximation by neural networks, these generally guarantee the found on approximation error only on the finite set of input values (using error function in input data). 3. The Training Algorithm 3.1.

INITIALIZATION

Some practical issues concerning initialization of the training algorithm will be pointed out here. These hints can be used in any simulation of our algorithm. To avoid overfitting and to initialize the size of the network, the hints based on high numbers of simulation recommended in Sarle [10] will be used. Let function f : K → R (which is to be approximated by a feedforward network) be given by the set of input/output values (x, y) (training set S) on a compact interval K ⊂ Rd . Suppose that the training set contains a point (u, f (u)) so that


135

Ff = f (u) = minx∈K {f (x)} is minimum of f in K. If weight decay is used in optimizing the cost function, a considerable number of hidden units is recommended to avoid spurious ridges and valleys in the output surface which can be produced by networks with a smaller number of hidden units (Chester [3]). As no method is known how to initialize the ‘considerable’ number of hidden units, the application of an incremental algorithm can avoid this problem. The more weights there are in the network, relative to the number of training cases, the more overfitting amplifies noise in the targets ([9]). Reducing the size of weights reduces the effective number of weights with respect to weight decay ([9]). It is obvious from these practical hints that the setting of the appropriate size of the network in advance is not an easy task. To circumvent this problem, we suggest to use incremental training in combination with weight decay. In [10] the number of hidden units is recommended to be about 30 times as many training cases as there are weights in the network for noisy data and 5 times as many training cases for noise-free data to avoid overfitting. For a given training s set of data the upper vound on number of approximants fn is set to nmax = 30(d+1) , s in the case of noisy data, and to nmax = 5(d+1) , in the case of noise free data, where s is the number of training data and d the number of input units. Our algorithm will be used approximately up to such n. The feedforward network considered in our algorithm has d input units, n hidden units and one output unit. The activation functions are functions gj , j = 1, . . . , n from a given set G. The output of the function is given by   n d X X z(x, w) = wi gi  wj i xj + wj 0  + w0 (3) i−1

j =1

where w = (wj i , wi ), is a vector of weights for i = 0, . . . , n, and j = 1, . . . , d. 3.2.

WEIGHT DECAY

The weight decay is one of the simplest regularization methods. It is known that producing an over-fitted mapping with regions of large curvature requires relatively large values P of2weights. By using the regularizer (a weight decay summand) in the form ν i,j wij , the weights will tend to be small and overfitting can be avoided. The generalization error (cost functional) used here is of the form ¯ E(w) = E(w) + ν(w) P where (w) = i,j wij2 . The weight decay parameter ν determines the trade-off between the degrees of smoothness of the solution and its closeness to the data. Determining the appropriate values of parameter ν can be done analytically if the function to be approximated is given analytically (for example [2]) or if f is a realization of a random

136


field with a known prior probability distribution. The situation is however more difficult if function f is given only by a finite set of input/output values. One can set ν experimentally: for example, several networks with different amounts of decay parameter and estimate are used and the generalization error is estimated for each. The decay parameter that minimizes the estimated generalized error is chosen. This method is unfortunately time-consuming and hence not very appropriate.

3.3.

THE ALGORITHM

Let function f be given by a finite set of S of input/output values (x, y). The approximations fn of function f will be generated (recursively incrementally or generally incrementally) for every n ≥ 1, where n denotes both the step of the algorithm and the size of the hidden layer of the network corresponding to function fn . The training algorithm optimizes both the size of the hidden layer n and the training error in parallel in two variants (GI and RI). The function fn is of the form fn : fn (x) = z(x, w). GI Generally incrementally: Pn n n The function fn is of the form fn = i=1 wi gi where weights wi are n achieved by minimization of cost function E¯ (w) (" ! # )2 n d P P P = win gi wj i xj + wj 0 + w0 − y i=1 (x,y)∈S n d P P

+ν1

j =1 i=1

(wj i )2 + ν2

j =1 n P

(wkn )2

i=1

is with respect to all the weights in the network (i.e. (d + 1)n parameters, ν1 and ν2 are constants). RI Recursivelly incrementally: The function fn is constructed recursively according to 1. f1 ∈ G; P n−1 2. if fn−1 = n−1 gi then i=1 wi P n n n f/n = fn−1 + wn gn = n−1 i=1 wi gi + wn gn . (The upper indexes n and n − 1 denote that the weights wi are assigned to function fn and fn−1 respectively.) The weights wj n and wnn of the added hidden unit are achieved by minimization of cost function E¯ n (w) ( ! # )2 " d P P = fn−1 (x) + wnn gn wj n xi + wn0 + wo − y (x,y)∈S d P

+ν1

j =1

j =1

(wj n )2 + µ2 (wnn )2

and weights wj i and win equal to the weights achieved in the step n−1. (The minimization is with respect to d + 1 parameters, ν1 and ν2 are constants.)


137

ˆ a (generally different) solution for both methods. To estimate V (fn , G) Denote w from Theorem 2.3, weights wˆ in , i = 1, . . . , n achieved by GI or RI will be used. If the computed error of the estimate from (1) with these weights is still not satisfactory (i.e. is greater than a given > 0), it is necessary to increase the number of hidden units n := n + 1 and the execute the minimization (one of variant GI or RI) again. Both variants GI and RI have their pros and cons. Minimization in variant GI has freedom in all coordinates of vector w. Using a global optimization method, the achieved minimum will be a global minimum of E¯ n (w). The price is optimization in a high-dimensional space ((d+1)n dimensions). On the contrary, RI variant minimizes E¯ n (w) with respect to a limited number of parameters (corresponding to the added hidden unit). The minimization is done in a space with a smaller dimension d + 1 dimensions) and thus the algorithm with this variant is faster. However, the achieved minimum can be in general bigger than the minimum achieved by GI, even if a global optimization method is used. We must also remark that the upper bound controlling the error of approximation depends only on the weights between the hidden layer and the output unit and takes values w0n , wj 0 , j = 1, . . . , d as fixed. 4. Discussion and Outlook Two variants of new incremental training of the weights and size of a feedforward network were presented. Using the above described theory, for a continuous function f given by a set of input/output values, the approximating feedforward network represented by function fn in each step n guarantees that both the insample and out-of-sample performance of the network gives an error at most the right side of the formula (1) for all data in the applied compact set K. This important feature is not generally guaranteed by common algorithms applied to the problem of learning from examples. These can in general guarantee that the error does not exceed a given value only on the input data. In contrast to algorithm in [5], the simpliication in considering Vn (fn , G) in Theorem 2.3. can cause that the upper gound (2) is bigger than that one in [5]. However, the advantage of this simplification is that Theorem 2.3. provides an easier implementation and speeds up the algorithm in practice with respect to [5]. Pros and cons of both variants GI and RI are currently being studied and tested in an approximation problem from the real world: modelling of the Austrian interregional telecommunication traffic flow. First simulations support the theoretical results. As these simulations have a lot of parameters, a large number of simulations is necessary to compare them to the common methods and this is beyond the framework of this paper.

138


Acknowledgements The author gratefully acknowledge the grant no. P12681–INF provided by the Fonds zur Förderung der wissenschaftlichen Forschung (FWF). Kateˇrina HlaváˇckováSchindler was partially supported by grant CR 201/99/0092. References 1. 2. 3. 4. 5.

6. 7.

8. 9. 10.

Barron, A. R.: Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information Theory 39 (1993), 930–945. Bö, S.: Optimal weight decay in perceptron, Proc. of the International Conference on Neural Networks (1996), 551–556. Chester, D. L.: Why two layers are better than one, Proc. of the International Joint Conference on Neural Networks 1, Lawrence Erlebaum, Washington, D.C., 1990, pp. 265–268. Cybenko, G.: Approximation by superpositions of sigmoidal function, Mathematical Congrol, Signals and Systems 2 (1989), 304–314. Hlaváˇcková, K. and Sanguineti, M.: Algorithm of incremental approximation using variation of a function with respect to a subset, Proc. of the International Conference of Neural Computation, ICSC Academic Press, Canada/Switzerland, 1998, pp. 896–899. Hornik, K., Stinchcombe, M. and White, H.: Multilayer Feedforward Networks are Universal Approximators 5 (1989), 359–366. Kurková, V.: Dimension-independent rates of approximation by neural networks, In: K. Warwick and M. Kárný (eds.), Computer-Intensive Methods in Control and Signal Processing: Curse of Dimensionality, Birkhäuser, Boston, 1997, pp. 261–270. Kurková, V., Savický, P. and Hlaváˇcková, K.: Representations and rates of approximation of real valued Boolean functions by neural networks, Neural Networks 11 (1998), 651–659. Moody, M. E.: The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems, Proc. of NIPS 4 (1992), 847–854. Sarle, W. S.: http://camphor.elcom.nitech.ac.jp/Music/neural/FAQ3.html, 1997.

An Incremental Algorithm for Parallel Training of the ... - Springer Link

An Incremental Algorithm for Parallel Training of the ... - Springer Link

Suggest Documents

An Incremental Learning Algorithm for Face Recognition - Springer Link

An Incremental Algorithm for Computing

An Incremental Bisimulation Algorithm

An incremental ensemble of classifiers - Springer Link

Incremental Learning Algorithm for Speech Recognition - Springer Link

LNCS 6082 - A Parallel Distributed Algorithm for the ... - Springer Link

ALGORITHM FOR CONSTRUCTING AN INTEGRAL ... - Springer Link

An Algorithm for Progressive Raytracing - Springer Link

An Incremental Mining Algorithm for Association

An Efficient Parallel Algorithm for Computing the

A tabu search algorithm for unrelated parallel machine ... - Springer Link

An algorithm for finding the nucleolus of assignment ... - Springer Link

An improved algorithm for the incremental recomputation of active ...

An Incremental Local Distribution Network for ... - Springer Link

Incremental Anomaly Detection Approach for ... - Springer Link

Evaluation of an algorithm for automatic detection of ... - Springer Link

An Improved Algorithm for the Piecewise-Smooth ... - Springer Link

The Randomized Algorithm for Finding an Eigenvector ... - Springer Link

An Exact Algorithm for the Simplified Multiple Depot ... - Springer Link

An incremental linear-time learning algorithm for the Optimum ... - arXiv

An incremental learning algorithm for the hybrid RBF-BP network ...

An Alternative Algorithm to FFT for the Numerical ... - Springer Link

DiPro: An Algorithm for the Packing in Product ... - Springer Link

An Approximate Solution Algorithm for the One ... - Springer Link