optimization - CiteSeerX

0 downloads 0 Views 295KB Size Report
natural for neural networks, a sensitivity analysis of weights will be given in section V. .... The Gaussian and uniform kernels are two p.d.f.'s satisfying the conditions ...... Palmer R. G., “Introduction to the theory of neural computation,” Addison-.
On-Line Stochastic Functional Smoothing Optimization for Neural Network Training Chuan Wang, Jose C. Principe Computational NeuroEngineering Laboratory University of Florida, Gainesville, Fl 32611 [email protected]

Acknowledgments: this work was partially supported by NSF grant ECS-9208789 and ARPA/ ONR grant N00014-94-1-0858.

Send correspondence to:

Dr. Jose C. Principe, Ph.D. Electrical Engineering Department, CSE 405 University of Florida, Gainesville, FL 32611

Phone: (904) 392-2662 Fax: (904) 392-0044 E-mail: [email protected]

Running title: On-line stochastic function smoothing.

Preferred sections: Mathematical and Computational Analysis.

1

On-Line Stochastic Functional Smoothing Optimization for Neural Network Training

Abstract: A set of new algorithms based on an on-line implementation of a well known global optimization strategy based on stochastic functional smoothing are proposed for training neural networks. These algorithms are different from other on-line global optimization approaches because they use not only first-order, but also second-order (Hessian) gradient information. Therefore, they have faster convergence than first-order gradient descent search methods. Convergence and sensitivity analysis of the proposed method are provided. The on-line algorithms are compared with second-order gradient method, momentum learning and conjugate gradients in order to claim their consistent and global convergence abilities; and are compared with conventional stochastic global optimization scheme in order to claim their faster learning rate. Computer simulation results are presented to support the analysis.

Keywords---stochastic functional smoothing, on-line algorithms, global convergence, mean square error, supervised learning, momentum learning, conjugate learning, second-order derivative.

2

I. Introduction Despite great successes in the development of learning algorithms for neural networks such as classical back-propagation (BP), two major issues remain unsolved. The first is the local minimum entrapment and the second is the slow learning speed or convergence rate for most complex learning tasks. In order to overcome these two drawbacks, a lot of alternative approaches have been suggested. For the first problem, simulated annealing [Kirkpatrick, S., et. al, 1983] and genetic algorithms [Holland, 1975] are two methods which guarantee global minimum convergence if the cooling temperature is slow enough or the population is large enough, respectively. But both methods display usually very slow convergence even for simple learning tasks [Hassoun, 1995]. Some others methods such as deterministic and stochastically global optimization methods have been also proposed [Bazaraa et al., 1993]. In deterministic optimization, many initial starting weight value need to be tried, so it is not suitable for optimization in high dimensional spaces such as neural networks. Stochastic optimization, which is considered the only feasible method for global optimization in high dimensional spaces, is based on the following idea [Schoen, 1991; Hassoun, 1995]. A noise term perturbs the cost function

ε ( w ) being minimized in order to avoid

entrapment in a bad local minimum. This noise should be scheduled appropriately during adaptation. That is, during the search process, the perturbations to the effective function being minimized will become exactly

ε ( w ) are gradually removed so that ε ( w ) towards the end of learning.

ˆ ( w ) of the In its most basic form, the stochastic algorithm performs search on a perturbed form ε cost function

ε ( w)

, where the perturbation is additive in nature:

εˆ ( w )

m

=

ε ( w ) + η ( t ) ∑ wi ni ( t )

(1)

i=1

where N = [ n 1 ( t ) , …, n m ( t ) ] is a vector of independent noise sources, W = [ w 1 …, w m ] is a vector of adjusted variables, and η ( t ) is a parameter that controls the magnitude of the noise. To achieve the gradual reduction in noise mentioned above, η ( t ) must be selected in such a way that

3

it approaches zero as t tends to infinite. A simple choice for η ( t ) is given by η ( t ) = πe

− λt

with π ≠ 0 and λ > 0 . An alternative is to use gradient descent in this noisy performance surface, yielding the stochastic gradient rule: w ( t + 1 ) = w ( t ) − η [ ∇ wε ( w ) + η ( t ) N ( t ) ]

(2)

It is important to observe that learning with Eq. (2) only uses first-order gradient information. The sole contribution of the noise term is to make the search escape from local minima. This search strategy has been proven to converge to a global minimum [Schoen, 1991]. Imposing a constraint on the noise distribution, Kushner proposed a global search scheme and provided an asymptotic convergence analysis [Kushner, 1987]. Geman and Hwang further developed a method for global optimization which is essentially the simulation of an annealed diffusion process [Geman and Hwang, 1986]. Both of these two methods applied only first-order derivative information for adaptation, and have been used for neural network training. Hanson added random noise to the weight adaptation and experimentally showed the convergence to a global minimum with constant η ( t ) [Hanson, 1990]. Using exactly the idea given by Eq. (2) but with a damped η ( t ) Darken and Moody proposed the search-then-convergence scheme [Darken and Moody, 1991, 1992].

As far as the slow search convergence is concerned, it is well known that the convergence speed can be improved greatly if higher order curvature information is employed in the adaptation [Beck and Le Cun, 1988; Baritompa, 1994; Le Cun et al., 1991, 1993]. Theoretically, in a neural network the cost function

ε ( w)

can be written as a Taylor expansion around the ‘current’ weight

vector w o [Le Cun et al., 1990; Smagt, 1994],

4

ε ( w)

=

ε ( wo ) + ∑ ∂ w∂ ε ( w ) i, j

ij

wo

w ij +

1 ∂ 2 ε ( w) ∑ 2 i, j, i′, j′ ∂ w ij w i′j′

wo

w ij w i′j′ + …

(3)

where i and j are the indexes of weights.

In the classical back-propagation algorithm, all but the first term of this Taylor expansion are ignored. Hence, BP is a linear approximation. With the second-order term the combination of linear and quadratic approximation is used now, so convergence is faster than classical BP.

Momentum learning [Rumelhart et al, 986] and conjugate gradient [Moller, 1993; Boray and Srinath, 1992; Smagt, 1994] which use the second-order information indirectly [Smagt, 1994], and the second-order learning methods, even with approximations to the Hessian [LeCun et al., 1990, 1991, 1993; Pearlmutter, 1993] were shown to improve the learning speed tremendously. Therefore, it is desirable to use the second-order derivative information. But a major difficulty with the use of the second derivative is the storage and computational complexity of the Hessian matrix. Some simplifying approaches such as the approximation of the full Hessian by a diagonal matrix have been suggested [Le Cun et al., 1990].

In this paper, we propose a strategy called smoothed function optimization [Rubinstein, 1981 and 1986] for training neural networks that deals directly with the above two issues. It will be proven that the on-line smoothed function optimization is globally convergent, and it will be also shown that it uses information from both the first-order and the second-order (Hessian) cost function derivative, therefore, it has desirable properties for optimization in high dimensional spaces.

This paper is organized as follows: The learning problem is defined in section II specifying batch learning and on-line learning modes. Section III presents the smoothed function optimization approach and several new observations for gradient estimation are obtained. Convergence analysis is given in section IV in which we show that the on-line smoothed function optimization with both single-sided and double-sided estimations is also global convergence using the general results

5

obtained by Kushner [Kushne, 1987]. In order to see that the smoothed function optimization is natural for neural networks, a sensitivity analysis of weights will be given in section V. In section VI, some practical learning approaches, including a modified second-order derivative learning, a modified momentum learning, a modified conjugate gradient learning will be proposed to implement the smoothed function optimization. Section VII gives some simulation results to verify the validation of the smoothed function optimization. Conclusions and remarks are stated in section VIII.

II. Classical supervised learning The learning system used in this paper as a prototype is the multilayer perceptron (MLP) with two layers (Figure 1), but the conclusions can be extended to recurrent topologies. In this network, x k denotes one element of an input vector; y i is the ith output of the output layer; W ij represents a weight between the hidden layer and output layer; V jk is a weight between the input and hidden layer; and P j denotes the activations of hidden layer. The nonlinearity f(.) in each neuron of the network is a logistic function which has the form f ( x) =

1 1 + exp ( − x )

(4)

and its derivative is ′

f = f ( 1 − f)

(5)

The training algorithm employed here is the BP algorithm [Rumelhart et al, 1986].

6

y1

y2

f

f

yi W ij

W f

P1

f

Pj

P3

P2

f

V jk

V

x2

x1

x3

xk

x4

Figure 1 A two layers MLP

Let d i ( t ) denote some desired response or target response for output neuron i at time t, where t is the discrete time index. Let the corresponding value of the actual response of this neuron be denoted by y i ( t ) . The response y i ( t ) is produced by a stimulus (vector) x ( t ) applied to the input of the network in which neuron i is embedded. The input vector x ( t ) and desired response d i ( t ) for neuron i constitute a particular example presented to the network at time t. Typically, the actual response y i ( t ) of neuron i is different from the desired response d i ( t ) . Hence, we may define an error signal as the difference between the target response d i ( t ) and the actual response y i ( t ) , as shown by

ei ( t ) = di ( t ) − yi ( t )

(6)

The ultimate purpose of learning is to minimize a cost function based on the error signal e i ( t ) , such that the actual response of each output neuron in the network approaches the target response for that neuron in some statistical sense. A criterion commonly used for the cost function is the

7

Mean-Square-Error (MSE) criterion, defined as the mean-square value of the sum of squared errors [Haykin, 1994]:

J = E

1 ( ei ( t ) ) 2 2∑ i

(7)

where E is the statistical expectation operator, and the summation is over all the neurons (i=1,...,M) in the output layer of the network. Minimization of the cost function J with respect to the network parameters can be easily formulated by the method of gradient descent. However, the difficulty with this optimization procedure is that it requires knowledge of the statistical characteristics of the underlying processes. Practically this shortcoming can be overcome by settling for an approximate solution to the optimization problem. Specifically, the Instantaneous value of the sum of Squared Errors (ISE) (which is also called Least Mean Square (LMS) in adaptive signal community) is the criterion of choice [Haykin, 1994]:

ε ( t)

=

1 ( ei ( t ) ) 2 2∑ i

(8)

The network parameters (weights) are then optimized by minimizing

ε ( t ). Actually, this proce-

dure leads to the so called on-line gradient descent algorithm because the weights can be updated for each sample [Widrow and Hoff, 1960]. If the learning rate is small enough, the noisy gradient estimates are averaged over the iterations and the solution coincides with the statistical solution given by Eq. (7) [Haykin, 1994]. We will present our formulation with statistical operators, but later in the paper we will propose an on-line version.

8

III. Stochastic functional smoothing (SFS) In this section, we are going to review the method of optimization using stochastic smoothed functions and extend it to on-line versions as well as providing some new observations. Stochastic smoothed function optimization [Rubinstein, 1981 and 1986] is a method of minimization applicable to nonconvex functions. The original nonconvex function is replaced by an auxiliary one, called a smoothed function, which possesses some nice properties for the optimization (i.e. single minimum). Operating on the smoothed function, the global minimum of the original problem can be found. The smoothed function idea is similar to that of applying a low-pass filter to the original nonconvex function in parameter space. Local minima of the performance surface can be viewed as high frequency components. Hence, using a low-pass filter in parameter space we can get a convex function which only has a single minimum, the global minimum.

Let us consider the following optimization problem,

1 minJ ( w ) = min ( E 2 w w

∑ ( di ( t ) − yi ( t ) ) 2

)

i

(9)

where J(w) is a cost function dependent on the variable w. In supervised learning of a neural network, w denotes network weights.

To solve this problem let us introduce the following convolution function [Rubinstein, 1981 and 1986] ∞

Jˆ ( w, β ) =



∫ hˆ ( v, β ) J ( w − v ) dv

−∞

=

∫ hˆ ( ( w − v ) , β ) J ( v ) dv

−∞

which is called a smoothed functional, β is a control parameter, and v is a random variable.

9

(10)

We can imagine that hˆ ( v, β ) is the impulse response of a lowpass filter with parameter β .

In order for

Jˆ ( w, β )

to be useful for the original optimization, the impulse response

hˆ ( v, β ) should satisfy the following assumptions [Rubinstein, 1981 and 1986]

(i)

(ii)

1 v hˆ ( v, β ) = ( n ) h ( ) is a piecewise differentiable function with respect to v. β β lim hˆ ( v, β ) = δ ( v ) , where δ ( v )

β→o

(iii)

is Dirac’s delta function.

lim J ( w, β ) = J ( w ) .

β→0

(iv) ∞

∫ hˆ ( v, β ) dv

= 1

−∞

so that

(11)

Jˆ ( w, β ) = E v [ J ( w − v ) ]

The Gaussian and uniform kernels are two p.d.f.’s satisfying the conditions (i)-(iv), but others may exist. In this paper, we assume that the kernel is Gaussian since the noise is normally assumed as additive Gaussian noise in practice.

The idea of smoothed functional goes as follows: for a given function

J ( w ) construct a

smoothed function Jˆ ( w, β ) and operating only with Jˆ ( w, β ) , find the global minimum of J ( w ) . In other words, while operating only with

Jˆ ( w, β ) , we want to avoid all fluctuations

and local minima of J ( w ) , and find w* which is the optimal parameter corresponding to the

10

global minimum.

The impulse response parameter β decides the degree of smoothing applied to J ( w ) . For large β the effect of smoothing is large, and vice versa. When that

β → 0 it follows from condition (ii)

Jˆ ( w, β ) = J ( w ) and then there is no smoothing.

It is intuitively clear that to avoid local minima, β has to be sufficiently large at the start of the optimization. However, on approaching the optimum the effect of smoothing shall be reduced by letting β vanish, since at the minimum point w* we want coincidence between J ( w ) and Jˆ ( w, β ) . Accordingly, a set of smoothed functions Jˆ ( w, β s ) , s=1,2,...... is required while constructing an iterative procedure for finding w* .

It is important to note that to find a gradient of the smoothed function Jˆ ( w, β s ) , we don’t need to know the gradient of J ( w ) , which sometimes does not exist at all. Therefore, we consider two different cases in this papers as follows: (a) gradients of J ( w ) are not available (this corresponds the case of using the hard-limiter nonlinearity for the processing elements of the neural network); and (b) gradients of J ( w ) are available.

We will select the impulse response kernel as a multinomial distribution with dimension n and variance β, that is,

hˆ ( v, β ) =

n vj  1  1 exp − ( ) β   2 j∑ ( 2π ) n ⁄ 2 β n =1

(12)

It is not difficult to show [Styblinski and Tang, 1990], from Eq. (12) that, for case (a),

11

−1 ∇ wJˆ ( w, β ) = E [ vJ ( w − βv ) ] β ν

(13)

∇ wJˆ ( w, β ) = E v [ ∇ wJ ( w − βv ) ]

(14)

and for case (b),

Then, we have the following unbiased estimator of Eqs. (13) and (14) for finite samples N, −1 ∇ wJˆ ( w, β ) = βN

1 ∇ wJˆ ( w, β ) = N

N

∑ vj J ( w − βvj )

(15)

j=1

N

∑ ∇ wJ ( w − βvj )

(16)

j=1

where N is the number of samples ν j from the multivariable p.d.f given in Eq. (11). The relevant point of Eqs. (15) and (16) is that the finite average can be used to estimate with no bias the expectation of Eqs. (12) and (13).

Eqs. (15) and (16) utilize the single-sided gradient descent estimator for

Jˆ ( w, β ) . Alterna-

tively, a double-sided estimator can be applied for this operation which has the form [Styblinski and Tang, 1990; Rubinstein, 1981 and 1986], for cases (a) and (b), respectively,

∇ wJˆ ( w, β ) =

1 2βN

N



i, j = 1

[ v i J ( w − βv i ) − v j J ( w + βv j ) ]

12

(17)

∇ wJˆ ( w, β ) =

1 2N

N



i, j = 1

[ ∇ wJ ( w − βv i ) + ∇ wJ ( w + βv j ) ]

(18)

Notice that the samples v i and v j in the right hand side of Eqs. (17) and (18) are different since these two terms are functions of different variables [Rubinstein, 1981 and 1986].

Gradients given by Eqs.(15) to (18) can be used directly for off-line or batch search modes. But for on-line search algorithms which are the most important aspect for adaptive systems (linear or nonlinear), a stochastic approximation technique is needed.

III.1. Towards an on-line algorithm for stochastic functional smoothing Applying the stochastic approximation technique [Robbins and Monro, 1951] for single-sided estimators, Eqs. (15) and (16) yield,

∇ wJˆ ( w, β ) =

−1 ( vJ ( w − βv ) ) β

∇ wJˆ ( w, β ) = ∇ wJ ( w − βv )

(19)

(20)

where ν is a sample from the normal distribution given in Eq. (12).

Similarity, the double-sided stochastic estimators of Eqs. (17) and (18) are,

1 ∇ wJˆ ( w, β ) = ( v J ( w − βv i ) − v j J ( w + βv j ) ) 2β i

13

(21)

1 ∇ wJˆ ( w, β ) = [ ∇ wJ ( w − βv i ) + ∇ wJ ( w + βv j ) ] 2

(22)

where ν i and ν j are different sampling points.

Since ISE is an approximation (in the stochastic sense) of MSE and the gradient operator is a linear operator, the gradients of the smoothed cost function

ˆ ( w, β ) = ∇ wε

εˆ ( w, β ) of ε ( t )

−1 ( v ε ( w − βv ) ) β

can be estimated by,

(23)

;

ˆ ( w, β ) = ∇ wε ( w − βvj ) ∇ wε

(24)

;

ˆ ( w, β ) = ∇ wε

1 ( v ε ( w − βv i ) − v j ε ( w + βv j ) ) 2β i

(25)

and

ˆ ( w, β ) = ∇ wε

1 [ ∇ wε ( w − βv i ) + ∇ wε ( w + βv j ) ] 2

(26)

for Eqs. (19); (20); (21); and (22), respectively. Here we have ignored the discrete time index t for simplicity.

In order to understand better each estimation some observations are now made. Eq. (23) says that the single-sided estimator of the smoothed gradient, using the stochastic smoothed functional principle and the stochastic approximation technique, is obtained by multiplying a random variable by the original cost function evaluated at the present value w perturbed randomly. Hence, we

14

can conclude that this estimator will not work very well since it does not provide any consistent information for searching the original performance surface.

The gradient estimation given by Eq. (25) is qualitatively different since it is proportional to the difference of two values of the original cost function with different random perturbations, which can be viewed as proper gradient estimation. Actually, this method is a numeric approach to obtain the gradient of cost functions (discontinuous to the first order). Therefore, for case (a) the double-sided estimation should be much better than the single-sided estimation. So, we will prefer the estimation given by Eq. (25) when the gradient information of the original cost function is not available.

When the estimate of the original gradient is available as is normally the case, Eqs. (24) and (26) express the estimation using single-sided and double-sided estimators, respectively. Note that in each case gradient values at points different from the operating point (w) are needed. This strains the computations, mainly for on-line algorithms, because several evaluations are needed with the same data sample. Since the estimation of Eq. (24) is the gradient of

ε ( w ) at point w perturbed

by a random variable βv j , it can be expressed as a Taylor series expansion about w, 2 ( βv j ) 2 ∂ 3 ∂ ∇ wε ( w − βv j ) = ∇ wε ( w ) − βv j 2 ε ( w ) + ε ( w) + … 2 ∂ w3 ∂w

(27)

Recall that v j is nothing but the sampled noise value added to the original function for smoothing purposes and usually takes a very small value, so the high-order terms in Eq. (27) can be ignored. Therefore, the gradient for the smoothed cost function

εˆ ( w, β ) has the approximate form,

2 ˆ ( w, β ) = ∇ wε ( w ) − βvj ∂ 2 ε ( w ) ∇ wε ∂w

15

(28)

Similarly, we have the same result for the double-sided estimator of Eq. (26) as:

2 ˆ ( w, β ) ≈ 2 ∇ wε ( w ) − β ( vj − vi ) ∂ 2 ε ( w ) ∇ wε ∂w

(29)

Since ν i and ν j are random variables of normal distributions, the difference between them, which is a new random variable, obeys normal distribution too [Papoulis 1991]. Hence, we may use a single Gaussian random variable to replace the difference of two Gaussian random variables.

It is now clear that the smoothed gradient (Eqs. (28) and (29)) is estimated at the operating point w, which simplifies implementation of the learning procedure. Both single-sided and double-sided estimations are combinations of gradients and second-order derivatives of the original performance surface, so they contain similar information. But when the double-sided is used, the computational complexity is larger than that of single-sided estimation. Hence, we prefer the singlesided estimation for on-line learning.

It is critical to observe that the gradient estimator (Eqs. (28) and (29)) from the stochastic smoothed functional method contains information from both the gradient and the second-order derivative of the original cost function. So estimating gradients with the smoothed functional method is a second order method that will increase the speed of convergence of the search algorithm with respect to pure gradient methods. The second-order term is a multiplication of the Hessian by a random variable with decreasing variance (β). So the speed-ups should be more apparent during the initial steps of learning when β is still large.

Due to the fact that in most neural networks the gradient of the cost function is directly computable, we may focus the analysis in Eqs. (28) and(29). Of course, the estimations given by Eqs.

16

(23) and (25) are useful when the derivative of the cost function does not exist.

IV. Asymptotic analysis of the behavior of smoothed functional strategy

In most learning systems, the goal is to design an iterative algorithm which usually has the form,

w ( t + 1) = w ( t) + ∆ w ( t)

(30)

where w is the adjusted variable and t in the iteration number. For convergent supervised learning systems,

2 n ∆ w ( t ) = − η ( t ) g ( ∂ J ( w ) , ∂ J ( w ) , …, ∂ J ( w ) ) ∂w ∂w ∂w

(31)

where g is a continuous function and η is the step size or learning rate.

The simplest case is to use only the first-order term, which is the gradient. Certainly, when higherorder terms are used, the convergence should be better, but more computation is needed. Therefore, the expansion is normally restricted to second order terms. Another issue is how to design an algorithm to make sure that learning can reach the global minimum since the big drawback for gradient learning is local minima entrapment. In the previous section we saw how the smoothed function optimization method provides an estimate to the gradient and the second derivative of the performance surface at the operating point. Here we will study the convergence properties of the iterative process to adapt the weights according to Eq. (31).

The smoothed function optimization using iterative search can be shown to converge with probability one (w.p.1) to the global minimum when the double-sided estimators given in Eqs. (17) and (18) are used, provided that we select the β parameter correctly and keep the step size bounded [Rubinstein, R, 1981, 1986]. Since the estimations given in Eqs. (21) and (22) are stochastic approximation versions of Eqs. (17) and (18), learning with Eqs. (25) and (26) also converges w.p.1. to the global minimum if the following two conditions are satisfied [Kushner and Clark,

17

1978],

∑ η ( t) < ∞

(32)

t

and

∑ η2 ( t ) < ∞ t

(33)

Then, we can conclude that with the double-sided estimation (Eqs. (25) or (26)), if the step size obeys Eqs. (32) and (33) and the β parameter is selected correctly, the search procedure converges to the global minimum w.p.1.

For the single-sided estimation, we only consider the case that the original gradient is available since when the gradient is not available (Eq. (15)) the smooth gradient estimator is a random variable. Learning with this estimator is very much like simulated annealing if the step size is scheduled with a Boltzmann distribution. For the estimation given by Eq. (16), Katkovnik showed that the iterative search algorithm also converges to a global minimum for convex cost functions, under some mild conditions such as independent sampling from the kernel function and finite step size [Katkovnik et al., 1972]. But for nonconvex cost functions which are normally the case in neural network optimization, no extensions are available in the literature. We will now provide a proof of convergence for this smoothed function optimization with single-sided estimation given in Eq. (24) based on the results of Kushner [Kushner, 1987].

Let us review the search procedure considered by Kushner as follows [Kushner, 1987]:

w ( t + 1 ) = w ( t ) + η ( t ) b ( w ( t ) , ξ ( t ) ) + η ( t ) ψ ( t ) γw ( t )

(34)

where { ψ ( t ) } is i.i.d Gaussian; { ξ ( t ) } is a bounded sequence of random variables; η ( t ) = A 0 ⁄ log ( A 1 + t )

18

(35)

, where A 0 and A 1 are two constants; and E ( b ( w ( t) , ξ ( t) ) ) = b ( w ( t) )

(36)

is the negative of the cost function J(w) gradient. Both γ ( . ) and b ( . , ξ ( t ) ) are Lipschitz continuous, uniformly in ξ ( t ) , and γ ( . ) is bounded.

Such stochastic approximation algorithms are a Monte Carlo version of the annealing method for locating the global minimum of a function with many minima: For example, let J(.) denote a continuously differential function and set

E ( b ( w ( t ) , ξ ( t ) ) ) = b ( w ( t ) ) . Suppose that noise

corrupted samples of b ( w ( t ) ) , namely b ( w ( t ) , ξ ( t ) ) , are available for a system whose mean performance is J(w). This corresponds to LMS (least mean square) learning in adaptive system since we always assume there is sampling noise for each sample. Then, the algorithm w ( t + 1) = w ( t) + η ( t) b ( w ( t) , ξ ( t) )

(37)

is a standard form of a stochastic approximation method for locating a local zero of b ( ) or local minimum of J(.) under appropriate conditions on { η ( t ) } such as Eqs. (32) and (33). The ψ ( t ) γw ( t ) term might be added artificially, following the usual logic of the annealing scheme, in order to force the sequence to jump around until it eventually settles near the global minimum of J(.).

For the above learning scheme, Kushner showed that, for large A 0 , the iteration procedure converges to a global minimum [Kushner, 1987]. If the rate of decrease of { η ( t ) } is faster than O(1/log(n)), then, w(t) will converge w.p.1 and not drift away from the stable set in which the algorithm is trapped. Therefore, selection of the rate of annealing of the step size is the key point for convergence.

Now, let us take a look at an iterative algorithm based on the gradient estimator of Eq. (28). If we explicitly include the sampling noise in the original gradient estimate, Eq. (28) can be written

19

2 ˆ ( w ( t ) , β ) = η ( t ) ∇ wε ( w ( t ) , ξ ( t ) ) − η ( t ) βvj ( t ) ∂ 2 ε ( w ( t ) , ξ ( t ) ) (38) η ( t) ε ∂w

Let η ( t ) β = η 1 ( t ) represent a new step size variable then, 2 ˆ ( w ( t ) , β ) = η ( t ) ∇ wε ( w ( t ) , ξ ( t ) ) − η1 ( t ) vj ( t ) ∂ 2 ε ( w ( t ) , ξ ( t ) ) (39) η ( t) ε ∂w

Assuming that both step sizes η ( t ) and η 1 ( t ) are annealed by the rate given in Eq. (35), and the second-order derivatives are differentiable (which is valid for neural networks with the sigmoid function) then, ∇ wε ( w ( t ) , ξ ( t ) ) and

∂ 2 ε ( w ( t) , ξ ( t) ) ∂ w2

∂ 2 ε ( w ( t ) , ξ ( t ) ) is bounded. Since the ∂ w2 noise v j ( t ) is assumed to have a Gaussian distribution, all terms in Eq. (39) satisfy the requireare Lipschitz continuous, uniformly in ξ ( t ) , and

ment for the convergence of Eq. (34). Comparing Eq. (34) and Eq. (39), we conclude that for large A 0 the iteration procedure given in Eq. (39) will converge to a global minimum.

V. Sensitivity analysis of the on-line SFS scheme V.I. A reasonable approximation of smoothed function optimization

Since the Hessian matrix is not easy to compute, the analysis and strict implementation of the smoothed function optimization is not straightforward. A reasonable approximation is given below.

20

Let us consider the weight dynamics in a neural network. It is well known that the weight adjustment has the form

w ( t + 1) = w ( t) + η

∂ ε ( w) ∂w ( t)

(40)

Actually, Eq. (40) is an approximation of the following continuous dynamical equation [Matsuoka et al., 1994], 1 dw ( t ) ∂ ε ( w) = η dt ∂w ( t)

(41)

1 of the dynamical system Eq. (41) is much larger than the network’s η time constant the dynamic equation can be rewritten as Eq. (40). Hence, the weight adaptation When the time constant

may be viewed as a slowly changing dynamical process. With this idea in mind, a suitable approximation of the second-order derivative cab be found using the forward difference ∂ 2 ε ( w ( t ) ) = ∂ ( ∇ wε ( w ( t ) ) ) = ∇ wε ( w ( t ) ) − ∇ wε ( w ( t − 1 ) ) ∂w ∂ w2

(42)

Moreover, a reasonable approximation for the slow weight adaptation in successive time step is, ∇ wε ( w ( t ) ) = ε ( t ) ∇ wε ( w ( t − 1 ) )

(43)

where ε ( t ) is the gradient increment between successive iterations.

Therefore, both Eqs. (28) and (29) become ∇ wε ( w − βv j ) = ∇ wε ( w ( t ) ) − βv j ( ε ( t ) − 1 ) ∇ wε ( w ( t − 1 ) )

21

(44)

which leads to the following explicit formula of weight adjustment ∆ w ( t ) = − η ∇ wε ( w ( t ) ) − αv j ∇ wε ( w ( t − 1 ) )

(45)

where α is a new step size with annealing given by Eq. (35).

V.II Sensitivity analysis of the smoothed function optimization

Sensitivity theory is a systematic study of how internal or external perturbations affect the system performance (e.g. system output, system parameters or more elaborate performance-indices) [Frank, 1978]. In neural networks the weights contain all the information learned from the input data, so we are primarily concerned with weight sensitivity. The weight sensitivity can be expressed as w ( t ) new = w ( t ) old + ∆ w ( t )

(46)

where the new weight is the old weight added with a sensitivity variable that translates the effect of the perturbation on a dependent variable. This formula has an inescapable resemblance to the weight adaptation formula in iterative learning (Eq. (30)). In particular, when the sensitivity under analysis is with respect to perturbations of the derivative of the cost function (i.e. the gradient). Since the stochastic smoothed functional method was shown to be a perturbation in the original gradient (Eq. (28)), an alternate view to understand this new estimator is to study the weight sensitivity to a perturbation of the gradient.

As an example, we will use the feedforward network given in Figure 1. The weight vector adaptation formulas for the original cost function

ε ( w ( t) )

are given by the classical BP algorithm

with the noiseless desired signal [Hertz et al.,1991], which are ∆ W ij ( t ) = − η ∇ Wε ( W ( t ) ) = − η ( ( d i ( t ) − y i ( t ) ) y i ( t ) [ 1 − y i ( t ) ] P j ( t ) )

22

(47)

for the weight between hidden layer and output layer, and

∆ V jk ( t ) = − η ∇ Vε ( V ( t ) ) = − η P j ( t ) [ 1 − P j ( t ) ] ∑ ( d i ( t ) − y i ( t ) ) y i ( t ) [ 1 − y i ( t ) ] x k i

(48) for the weight between input layer and hidden layer, where η is the step size.

With the new gradient given by Eq. (45), which corresponds to the smoothed error cost function

εˆ ( w ( t ) ) , Eqs. (47) and (48) become ∆ W ij ( t ) = − η ∇ Wε ( W ( t ) ) − αv j ∇ Wε ( W ( t − 1 ) ) = −η ( ( di ( t ) − yi ( t ) ) yi ( t ) [ 1 − yi ( t ) ] Pj ( t ) ) −α vj ( ( di ( t − 1 ) − yi ( t − 1 ) ) yi ( t − 1 ) [ 1 − yi ( t − 1 ) ] Pj ( t − 1 ) )

(49)

for the weight between hidden layer and output layer, and ∆ V jk ( t ) = − η ∇ Wε ( W ( t ) ) − αv j ∇ Wε ( W ( t − 1 ) ) = −η Pj ( t ) [ 1 − Pj ( t ) ] ∑ ( di ( t ) − yi ( t ) ) yi ( t ) [ 1 − yi ( t ) ] xk ( t ) i

(50)

−α vj Pj ( t − 1 ) [ 1 − Pj ( t − 1 ) ] ∑ ( di ( t − 1 ) − yi ( t − 1 ) ) yi ( t − 1 ) [ 1 − yi ( t − 1 ) ] xk ( t − 1 ) i

The perturbations derived from using the smoothed gradient estimator can be recognized easily as the second terms of Eqs. (49) and (50). And they may be expressed explicitly as ∆ W pij ( t ) = − α v j ( ( d i ( t − 1 ) − y i ( t − 1 ) ) y i ( t − 1 ) [ 1 − y i ( t − 1 ) ] P j ( t − 1 ) )

23

(51)

and ∆ V pjk ( t ) = − α v j P j ( t − 1 ) [ 1 − P j ( t − 1 ) ] × ( ∑ ( di ( t − 1 ) − yi ( t − 1 ) ) yi ( t − 1 ) [ 1 − yi ( t − 1 ) ] xk ( t − 1 ) )

(52)

i

, respectively, for the weight W(t) and V(t).

It is instructive to observe that these perturbations are proportional to the error propagated through the dual network. This means that the smoothed cost function criterion is translated internally into random perturbations to each weight. Moreover, the weights that have larger updates receive more perturbation, which is exactly what is desirable to take the system from local minima. A point to make, which distinguishes this method from others (Darken and Moody, Hanson) where each perturbation must be controlled by the designer, is that this effect is automatically computed for us by the BP algorithm. Therefore, it is an ideal method to escape from local minima.

VI. On-line algorithms to implement the smoothed functional strategy (SFS) In this section, we are going to look at several possible on-line algorithms based on the theory of smoothed functional. The straightforward way is to compute the first-order and second-order derivatives directly, then use Eqs. (28) or (29) to estimate the gradient. But it is well-known that the computation of the second-order Hessian matrix is highly complex, so some other approximations are needed.

(i) SFS momentum learning

Although second-order derivative can be used directly in learning, the first-order approximation of second-order derivative information can also be applied to train neural networks as shown in section V. With the first-order approximation, momentum learning and conjugate learning are two

24

approaches employed frequently, and, therefore, will be studied in this subsection and next subsection separately.

Momentum learning [Rumelhart et al., 1986] has been verified by many researchers to accelerate learning. Its adjustment rule is, ∆ w ( t ) = − η ∇ wε ( w ( t ) ) + η 1 ∆ w ( t − 1 )

(53)

where η and η 1 are step sizes for the gradient and momentum terms, respectively. Both step sizes are normally held constant during training.

Since

∆ w ( t − 1 ) = − η ∇ wε ( w ( t − 1 ) ) + η 1 ∆ w ( t − 2 )

(54)

, it is straightforward to see that using the approximation of Eq. (43), ∆ w ( t − 1 ) ≈ − η 2 ∇ wε ( w ( t − 1 ) )

(55)

where η 2 is a different step size. Then, Eq. (53) can be rewritten as ∆ w ( t ) = − η ∇ wε ( w ( t ) ) − α ∇ wε ( w ( t − 1 ) )

(56)

A comparison of the adaptation rule given by Eq. (43) for the smoothed functional scheme and the adaptation rule given by Eq. (56) for the momentum learning shows that the first term in the right hand side of both equations are the same, but the second terms are different. In Eq. (45) the second term is multiplied by a random variable sampled from a Gaussian distribution and the step size is annealed based on Eq. (35). With this observation, we propose to modify the momentum learning based on the theory of stochastic functional smoothing as follows,

25

∆ w ( t ) = − η ∇ wε ( w ( t ) ) + η 1 v j ∆ w ( t − 1 )

(57)

The step size η 1 should be annealed by Eq. (35). Notice that the sign before the second term of Eq. (57) is not important since the noise v j is a zero mean random variable with Gaussian distribution. It should be pointed out that global convergence can be achieved with Eq. 64 if the step sizes are properly scheduled (see section IV).

(ii) SFS Conjugate gradient algorithm

It has been recognized that convergence of the conjugate gradient approach is superior to the original gradient descent (LMS) [Hestenes, 1980]. The conjugate gradient approach can be regarded as lying between the method of gradient descent and Newton’s method (in which the Hessian must be computed) in terms of complexity and convergence properties [Boray et al., 1992].

In neural networks, one scheme of implementing the conjugate gradient algorithm is [Hassoun, 1995]

∆ w ( t ) = − η ∇ wε ( w ( t ) ) +

[ ∇ wε ( w ( t ) ) − ∇ wε ( w ( t − 1 ) ) ] ∇ wε ( w ( t ) ) T

∇ wε ( w ( t − 1 ) )

2

∆ w ( t − 1) (58)

where the term multiplying ∆ w ( t − 1 ) is called adaptive momentum which is used to adjust the search direction.

By analogy with Eq. (57), we propose, based on the theory of stochastic functional smoothing, a modified conjugate gradient algorithm as

26

∆ w ( t ) = − η ∇ wε ( w ( t ) ) +

[ ∇ wε ( w ( t ) ) − ∇ wε ( w ( t − 1 ) ) ] ∇ wε ( w ( t ) ) T

∇ wε ( w ( t − 1 ) )

2

vj ∆ w ( t − 1 ) (59)

where the step size η could be annealed by Eq. (35), but the second term has no extra annealed step size since

[ ∇ wε ( w ( t ) ) − ∇ wε ( w ( t − 1 ) ) ] ∇ wε ( w ( t ) ) T

∇ wε ( w ( t − 1 ) )

2

is decreasing already as an exponential rate in most cases [Hassoun, 1995]. Of course, an annealing step size could be also applied to it in case of some special situations. The convergence to global minimum of this rule can also be proved, based on the result presented in section IV

(iii) SFS second-order learning rule

One of the efficient second-order learning rules, which was proposed by Le Cun [Le Cun et al., 1990] and was applied in the optimal brain damage (OBD) strategy, has been shown its fast learning rate. But the global convergence is not guaranteed. We will propose, based on this secondorder learning rule and SFS, a modified version of OBD which can reach the global minimum.

First of all, let us clarify how to efficiently estimate the second-order derivative in Eq. (28) for neural networks. Rewriting Eq. (3) into

ε ( w)

=

ε ( wo ) + ∑ ∂ w∂ ε ( w ) i, j

ij

1 ∂ 2 ε ( w) 2 i, j∑ ∂ w ij w i′j′ , i′, j′

wo

wo

w ij +

1 ∂ 2 ε ( w) ∑ 2 i, j ∂ w 2 ij

w ij w i′j′ + O ( w 3 )

27

wo

w 2ij +

(60)

The second-order derivatives may be separated into two sums: one including the second-order derivative with respect to different pairs of weights w ij and w i′j′ , (the third term in Eq. (60)), and the other including the second-order derivative with respect to only one weight (the second term in Eq. (60)). This clearly shows that the Hessian is not a local in space operator and the computational complexity is high. Fortunately, a reasonable technique which employes the ‘diagonal’ terms only (the second term in Eq. (60)) was proposed to train neural networks [Le Cun et al., 1990]. That is,

ε ( w)

=

ε ( wo ) + ∑ ∂ w∂ ε ( w ) i, j

ij

wo

w ij +

1 ∂ 2 ε ( w) 2 2∑ i, j ∂ w ij

wo

w 2ij

(61)

The corresponding on-line learning rule can be expressed as

2 ∆ w ij = η ( t ) ∂ ε ( w ) + η 1 ( t ) ∑ ∂ 2 ε ( w ) ∂ w ij i, j ∂ w

(62)

ij

The diagonal terms of the Hessian for the MLP can be obtained by [Le Cun et al., 1990]

∂ 2 ε ( w) = ∂ 2 ( ε ( w) ) x j ∂ w 2ij ∂ a 2i

where x i = f ( a i ) is the activations of the network, a i =

(63)

∑ wij xj is the total input for a neuron, j

and

∂ 2 ε ( w ) = f′ ( a ) 2 w 2 ∂ 2 ε ( w ) − f′′ ( a ) ∂ ε ( w ) i ∑ li i ∂x ∂ a 2i ∂ a 2l i l for the non-output layers. Notice that l represents layer information;

28

(64)

∂ 2 ε ( w ) = 2f′ ( a ) 2 − 2 ( d − x ) f′′ ( a ) i i i i ∂ a 2i

(65)

for all units in the output layer. Here d i is the corresponding desired signal.

Combining the SFS with the OBD second-order learning rule, a modified learning rule which can reach the global minimum takes the form of 2 ∆ w ij = η ∂ ε ( w ) + η 1 v j ∑ ∂ 2 ε ( w ) ∂ w ij i, j ∂ w ij

(66)

The step size η 1 should be annealed by Eq. (35).

VII. Simulation results We will show with an example that the proposed algorithms in section VI can find the global minimum. The problem here is the parity problem given by Rumelhart et al. [Rumelhart et al., 1986], which can be stated as follows: assuming the input and the desired patterns are binary strings, the desired signal required is 1 when the input patterns contains an odd number of 1s and 0 otherwise. This is a very difficult problem because the most similar pattern require different answers. The XOR problem is a parity problem with patterns of size two. As a harder example, 3 bits input patterns are considered here. And it has been shown that learning for this problem will meet local minima and the convergence rate is very slow for some initial conditions [Rumelhart et al., 1986].

Since the smallest network to solve this three bits parity problem is 3-3-1, that is, three input units, three hidden neurons, and one output neuron [Rumelhart et al., 1986], in all experiments we use this network topology. The nonlinearity is the logistic function given by Eq. (4).

The noise is the sampling of a zero mean Gaussian source with a variance chosen for each experi-

29

ment.

Annealing the step size will be based on the following rule which is an approximation of the ideal decay given by Eq. (35):

η = ηo

1 1+

NI c

(67)

where η o is the initial step size, c is the search time constant, and N I is the iteration number. The search time constant is selected for each simulation.

It can be shown that the MSE for the global solution of the three parity problem is zero.

The convergence of the SFS algorithms in the mean has established in section IV. The purpose of the simulations is to show that the algorithms we proposed have: (1) a much better ability to reach the global than those algorithms using second-order information; (2)they have faster convergence rate than the traditional stochastic optimization approaches of which the search-then-convergence scheme proposed Darken and Moody is an example. So, in subsections VII.1, VII.2, and VII.3, we concentrate on the global searching abilities of SFS algorithms; in subsection VII.4, we will compare the convergence rate of the SFS algorithms with the search-then-convergence scheme since both strategies have been shown to be able to converge to the global minimum.

VII.1 Training with SFS momentum learning

The adjustment rules of Eqs. (53) and (57) are used here to train the 3-3-1 MLP. For different step sizes η and η 1 , learning curves are shown below.

30

15

the straight momentum rule ... the momentum with annealed step size 10

--- the SFS momentum rule

5

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Figure 2 The learning curves using the momentum type learning rules

In Figure 2, the solid line is the learning curve using the momentum learning rule given by Eq. (53) without annealing in step size, where η = 0.2 , η 1 = 0.8 , and the final value of MSE is 0.8122; the dot-line is also from the rule given by Eq. (53) but with annealed step size based on Eq. (67) (c=500), where η 0 = 0.7 ,

η 10 = 0.8 , and the final MSE is 0.8085; the dashed-line

is the learning curve using the SFS momentum learning rule given by Eq. (57), where η 0 = 0.7 , η 10 = 0.8 , the variance of noise is σ 2 = 0.01 , c=500, and the final MSE is 5.0449e-08. It should be mentioned that the initial weights used for the above three learning curves are the same.

Let us try different parameters and see what happens.

31

14

the straight momentum rule .... the momentum with annealed step size --- the SFS momentum rule

12

10

8

6

4

2

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Figure 3 The learning curves using the momentum type algorithms The parameters used for the learning curves in Figure 3 are η = 0.7 and η 1 = 0.2 for the momentum learning without decay in step size; η 0 = 0.4 ,

η 10 = 0.4 , and the constant

c=700 for momentum rule with annealed step size; η 0 = 0.9 ,

η 10 = 0.6 , and the constant

c=700 for SFS momentum rule. The corresponding MSEs are 1.6449, 0.7857, and 2.310e-05, respectively.

Obviously, global minimum can be reached constantly with the modified momentum adaptation rule.

The original momentum learning and momentum learning with decay can reach, for some combination of the initial weights and step sizes, the global minimum too. But with the same parameters including initial weights, the SFS algorithm can also find a global minimum with a consistent convergence rate which are given in Figure 4.

32

8

7

the straight momentum rule .... the momentum with annealed step size

6

5

--- the SFS momentum rule

4

3

2

1

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Figure 4 The learning curves using the momentum type learning rules

Trying several times with different initial weights, different step sizes, and different c value, similar conclusion can be obtained in that we can see the SFS algorithm is much more robust in global search than the original momentum rule and the decay momentum rule.

VII.2 Learning with conjugate gradient

Generally speaking, learning with conjugate gradient given by Eqs. (58) and (59) results in similar conclusions to the momentum rules. Here we show one set of results only to indicate that the modified conjugate rule is the best one.

Figure 5 is obtained with the conjugate gradient learning rules given by Eqs. (58) and (59). The solid-line is the learning curve from the straight conjugate gradient algorithm with η = 0.7 and η 1 = 0.1 , and the MSE is 2.1318; the dot-line is using the conjugate gradient algorithm with annealed step size, where η = 0.7 , η 1 = 0.2 , c=500, the final MSE= 0.7144; the dashed line is the result of SFS conjugate gradient algorithm with η = 0.7 , η 1 = 0.8 , and c=500. It can be seen from Figures 5 that the SFS conjugate gradient algorithm achieves the best result.

33

12

the straight conjugate gradient rule 10

.... the conjugate gradient with annealed step size --- the SFS conjugate gradient rule

8

6

4

2

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Figure 5 The learning curves using the conjugate gradient type algorithms

VII.3 Learning with the OBD type rules

The SFS OBD and the OBD learning rules given by Eqs. (62) and (66) are studied here. Learning curves for the 3-bit parity problem with the straight OBD, the SFS OBD and the OBD with annealed step size are given in Figure 6.

35

the OBD with annealed step size

30

25

.... the straight OBD rule

20

--- the SFS OBD rule

15

10

5

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Figure 6 The learning curves of the OBD type learning rules

All the learning curves of Figure 6 are obtained by use of the same initial weights. The parameters

34

are η = 0.1 and η = 0.1 for the OBD with annealed step size; η = 0.1 , η 1 = 0.3 and s 1 2 σ = 0.01 for the SFS OBD; and η = 0.1 , η = 0.02 for the straight OBD rule. The annealing 1

of the step size η 1 is based on Eq. (67) with c=200 and the η is fixed for all learning procedures. The final MSEs are 2.2580, 1.0129e-06, and 10.213 for the OBD with annealed step size, the SFS OBD, and the straight OBD, respectively.

With another set of initial weights, the corresponding learning curves are presented in Figure 7, where η = 0.1 , η = 0.05 , η = 0.2 , η = 0.2 , and η = 0.06 , η = 0.05 , for the OBD 1 1 1 with annealed step size, the SFS OBD, and the straight OBD, respectively. And the corresponding final MSEs are 2.226, 3.3633e-07, and 3.3892, respectively.

10

the OBD with annealed step size .... the straight OBD rule --- the SFS OBD rule

9 8 7 6 5 4 3 2 1 0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Figure 7 The learning curves of the OBD type learning rules

Clearly, the SFS OBD rule reaches the global minimum very quickly, and the other two algorithms are trapped in local minima.

Running this experiment several times with the same initial weights for all the algorithms always yields the global minimum for the SFS OBD algorithm with properly selected step size η 1 ; but for the straight OBD and the OBD with decay in step size algorithms the global minimum is seldom obtained.

35

Based on the results of extensive simulation experiments given in subsections VII.1--VII.3, we can say definitely that the SFS algorithms have much better capability to reach the global minimum.

VII.4 Comparing with search-then-convergence scheme

The search-then-convergence scheme, proposed by Darken and Moody, uses Gaussian noise in each weight adjustment and anneals the step size of the noise term at the same time. It has been shown that this scheme can reach the global minimum and has better learning speed than the LMS learning. In this subsection, we will show, as an example of the SFS algorithms, using the SFS OBD rule as an example, that the algorithm based on SFS can reach the global minimum faster than search-then-convergence.

First, let us look at one result from the search-then-convergence scheme, which is solid-line depicted in Figure 8, where η = 0.1 , η 1 = 0.2 and final MSE is 1.2281e-05. 3.5

the search-then-convergence scheme 3

--- the SFS OBD rule

2.5

2

1.5

1

0.5

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Figure 8 The learning curves from the search-then-convergence scheme and the SFS OBD

With the same initial weights, the learning curve of the SFS OBD is also shown in Figure 8 using the dashed-line, where η = 0.2 , η 1 = 0.2 and final MSE is 1.797e-07

36

Another experimental result is given in Figure 9 for a different set of initial weights.

4

the search-then-convergence scheme 3.5

--- the SFS OBD rule 3

2.5

2

1.5

1

0.5

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Figure 9 The learning curves from the search-then-convergence scheme and the SFS OBD

Thus, the SFS OBD learning is able to learn faster than search-then-convergence scheme based on the results given above, although both of these two algorithms display global convergence.

Summarizing the results obtained in section VII, we can say that using the modified algorithms can find a global minimum comparing with other second-order algorithms, and speed up learning comparing with search-then-convergence scheme. The key issue related to these SFS based learning algorithms is how to find the optimal parameters such as step sizes, the optimal annealing rate, and the variance of the noise. Since all these parameters are problem dependent, it is not easy to give a theoretical answer. Actually, LMS and other adaptive algorithms have similar problems.

VIII. Conclusions We proposed several adaptive global search algorithms based on smoothed functional optimization strategy. Not only did we provide a proof of the convergence of these algorithms, but computer simulations were given to verify the correctness of them. We pointed out that the adaptive smoothed functional method has some advantages over the stochastic optimization (Eq. (2)) and

37

second-order optimization approaches. Additionally, we showed that these algorithms are very suitable for neural network-like gradient based learning, using sensitivity analysis, since the error backpropagation procedure is a linear transform.

References Baritompa, W., “Accelerations for global optimization covering methods using second derivatives,” Journal of Global Optimization 4, 329-341, 1994.

Bazaraa, M., Sherali, H. D., and Shetty, C. M., Nonlinear Programming: theory and algorithms, John Wiley & Sons, Inc., New York, 1993.

Becker, S., and Le Cun,Y., “Improving the convergence of back-propagation learning with second order methods, Proc. of the 1988 connectionist models summer school, pp. 29-37, Morgan Kaufmann, San Mateo, CA. 1989.

Boray, G. K., and Srinath, M. D., “Conjugate gradient techniques for adaptive filtering,” IEEE trans. on Circuits and Systems--I: Fundamental theory and applications, Vol. 39, No. 1, January, 1992.

Darken, C., and Moody, J., “Towards Faster Stochastic Gradient Search,” NIPs 4, 1991.

Darken, C., Chang, J., and Moody, J., “Learning Rate Schedules for Faster Stochastic Gradient Search,” IEEE Neural Networks for Signal Processing, 1992.

Frank, P. M., introduction to system sensitivity theory, Academic Press, New York, 1978.

Hanson, S. J., “A stochastic version of the delta rule,” Physica D 42, 265-272, 1990.

38

Hassoun M., Fundamentals of Artificial Neural Networks, MIT Press, 1995.

Haykin, S, Neural Networks---A Comprehensive Foundation, Macmillan College Publishing Company, New York, 1994.

Hertz J., Krogh A., Palmer R. G., “Introduction to the theory of neural computation,” AddisonWesley, 1991.

Hestenes, M., Conjugate direction methods in optimization, Springer-Verlag, New York, Heidelberg, Berlin, 1980.

Holland, J. H., Adaptation in natural and artificial systems, the University of Michigan Press, Ann Arbor, Mich., 1975.

Katkovnik, V. Ya., and Yu. Kulchitsky, “Convergence of a class of random search algorithms,” Automat. Remote Control 1972, No. 8, 1321-1326.

Kirkpatrick, S., et. al., “Optimization by simulated annealing,” Science 220, 671-680.

Kushner, H. J., “Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo,” SIAM J. Appl. Math., 1987.

Kushner, H, and Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag, New York, 1978.

Le Cun,Y., Denker, J. S., and Solla, S. A., “Optimal brain damage,” NIPs 2, p. 598-605, Morgan Kaufmann, San Mateo, CA. 1990.

39

Le Cun,Y., Kanter, I., and Solla, S. A., “Second order properties of error surfaces: Learning time and generalization,” NIPs 3, p. 918-924, Morgan Kaufmann, San Mateo, CA. 1991.

Le Cun,Y., Simard, P. Y., and Pearlmutter, B. A., “Automatic learning rate maximization by online estimation of Hessian’s eigenvectors,” NIPs 5, pp. 156-163, Morgan Kaufmann, San Mateo, CA. 1993.

Matsuoka, K., and Kawamoto, M., “A neural network that self-organizes to perform three operations related to principal component analysis,” Neural Networks, Vol. 7, No. 5, pp. 753-765, 1994.

Moller, M. F., “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Networks, Vol. 6, p. 525-533, 1993.

Papoulis, A., Probability, random variables, and stochastic processes, New York, McGraw-Hill, 1993.

Pearlmutter, B. A., “Fast exact multiplication by the Hessian,” Neural Computation 6, 147-160, 1993.

Schoen, F., “Stochastic techniques for global optimization: A survey of recent advances,” Journal of Global Optimization, 1, 207-228, 1991.

van der Smagt, P. P., “Minimisation methods for training feedforward neural networks,” Neural Networks, Vol. 7, No. 1, p.1-11, 1994.

Styblinski, M.A., and Tang, T.-S, “Experiments in Nonconvex Optimization: Stochastic Approximation with Function Smoothing and Simulated Annealing,” Neural Networks, Vol.3, 1990.

40

Robbins, H., and Monro, S., “A stochastic approximation method,” Ann. Mat. Stat. 22, 400-407, 1951.

Rubinstein, R., Simulation and the Monte Carlo Method, Wiley,1981.

Rubinstein, R., Monte Carlo Optimization, Simulation and Sensitivity of the Queueing Networks, Wiley, 1986.

Rumelhart et al, Parallel Distributed Processing, Vol.1, MIT Press, 1986.

Wang, C., and Principe, J. C., “Training neural networks with additive noise in the desired signal,” Submitted to Neural Computation, Feb. 1995.

Widrow, B., and Hoff, M., “Adaptive switching circuits,” IRE WESCON Convention Record, pp. 96-104, 1960.

41