AN EFFICIENT ALGORITHM FOR FEED-FORWARD ... - J-Stage

0 downloads 0 Views 834KB Size Report
the estimation of the weights in feed-forward neural network models was ... the statistical nature of the neural network model as a nonlinear regression analysis.
AN EFFICIENT NEURAL

ALGORITHM NETWORK Shin-ichi

FOR FEED-FORWARD

REGRESSION

ANALYSIS

Mayekawa*

The feed-forward neural network model can be considered a very poweful nonlinear regression analytic tool. However, the existing popular back-propagation algorithm using the steepest descent method is very slow to converge prohibiting the every day use of the neural network regression model. In this regard, a fast converging algorithm for the estimation of the weights in feed-forward neural network models was developed using the alternating least squares (ALS) or conditional Gauss-Newton method. In essence the algorithm alternates the minimization of the residual sums of squares (RSS) with respect to the weights in each layer until the reduction of RSS is negligible. With this approach, neither the calculation of a complex second derivative matrix nor the inversion of a large matrix is necessary. In order to avoid the inflation of the weight values, a ridge method and a quasi Bayesian method were also investigated. The methods were evaluated using several problems and found to be very fast compared to the steepest descent method. With a fast converging algorithm at hand, it is hoped that the statistical nature of the neural network model as a nonlinear regression analysis model is clearly revealed.

1.

Introduction

Since the invention/rediscovery of the error back-propagation method by Rumelhart et al.. (1986), neural networks, especially, feed-forward networks, alias multilayer perceptrons, have been widely used in a variety of areas. In essence, what they do is a nonlinear approximation to the relationship between two sets of multivariate variables, x and y. In applications of neural networks, we "teach" or "train" the network to learn the existing relationship using "training" data . That is, the networks modifies itelf until it "learns" the relationship. The error back propagation is a numerical algorithm that enables this learning process. Traditionally, the relationships among variables have been studied by statisti cal method such as regression analysis or discriminant analysis. In traditional statistics, we first construct a parametric model of the phenomenon of interest and test it in light of the data at hand. The key to this approach is the assumed probability distribution of the error term in the model. The parameters of the model are estimated so that the likelihood of the model is maximized. From a numerical analytic point of view, the learning process of a neural network can be described as the minimization of the residual sums of squares between the data and the output of the network with respect to the parameters of Key Words and Phrases ; Neural Network Training, Feed-Forward Method, Bayesian Method, Nonlinear Regression Analysis * Research Division , The National Center for University Entrance Meguro-ku, Tokyo, 153 Japan.

Neural

Network,

Examinations,

Gauss-Newton

2-19-23,

Komaba,

the network, namely, the weights and the offset or bias values. Therefore, with an appropriate probabilistic assumption for the residual terms, we may treat the neural network as a statistical model. For example, this approach enables us to test whether a particular neural network model fits the data on the basis of the likelihood, or to construct a prediction interval for the output. For the feasibiality of this statistical approach to the neural network models, statisticians must have an opportunity to use the network models as one of their tools. Unfortunately, the existing popular learning algorithm is not satisfactory for their every day use, because it is very slow to converge compared even to the most complex statistical models such as linear structural equation models or finite mixture models. Although there have been several attempts to improve the learn ing algorithm, (Bishop, 1992 and Jacobs, 1988, for example) most users of the network models still use the original back-propagation algorithm which might take hours to converge on small computers. The purpose of this article is to develop an efficient numerical method which enables us to use neural network models easily. Given such a method, it is hoped that the application of neural networks to problems traditionally analyzed using statistical methods becomes more popular, as does researches such as Riplay's (1993), revealing the statistical aspect of neural network models more clearly.

2. The feed-forward

neural network model

Let (xa, ya), a=1, 2, --•, N, be the paired observations, where x and y, respec tively, refer to the 1 x p regressor vector and the 1 x m criterion vector. Collective ly, they can be arranged in the following data matrices : X=[x1,x2,x'a,

xN]'

(11)

and Y=[yl, yZ, ... ya ... yr]

(2)

(In this article, we denote the i-th row of a matrix, say X, by xi, and the j-th column, by x(1).) The purpose of neural network regression is to explain the criterion matrix Y by the regressor matrix X using the feed-forward neural network model. The n-layer feed-forward neural network model, or multi-layer perceptron, can be described as follows. Let nk, k=1, 2, •••, n, be the number of neurons per layer. In this article, we use the convention that the data matrix, X, is referred to as the output of the first layer, and that the model prediction, f, of the Y matrix is referred to as the output of the last layer. Therefore, nl=p and n,=m. An observation, xa, when presented to a neural network, generates an input and an output of all the neurons. The input and the output of the j-th neuron of the k th layer are denoted, respectively, as is and oa;. Given an input, a neuron is

"activated" and

produces an output as 0'a,=0(ia)

(3)

where 0(•) is called the activation or excitation function . In this article we use the logistic function 0'a;aa

)= 1+exp( -ia)

(4)

as the activation function. The relationship between the output of the k-1st layer and the input of the k th layer are as follows. Let the collection of the inputs and the outputs of all the neurons in the k-th layer be, respectively, l ka= 2¢1, k 1a2, 'k ..., Zai, k ..., lank] k

(5)

oak=[01alOat, , k . ., Oak.i,

(~)

and

The layer

input

of the

neurons

k-th

layer

neurons

are

linearly

, Oank k . related

to the outputs

of the

k-1st

as

'k Iai

nk-1

i =1

k-1 k + ek

Oai

(~)

wi .l

where w is the weight that connects the i-th neuron of the k-1st layer to the j-th neuron of the k-th layer and Ok is the offset value for the j-th neuron of the k-th layer. Therefore, for observation Xa, the nk neurons of the k-th layer received the following inputs ikk=Oa-1Wk + Ok

(8)

where Wk is the nk_1x nk weight matrix and Ok is the 1 X nk offset vector . Using the N x nk matrices Ik and Ok, whose rows consist , respectively, and oa, the above relationship can be expressed as Ok=W(Ik),

k=2, 3, .., n

of is (9)

and Ik=Ok-lWk+1NBk

k=2

3 .., n

(10)

where qf(•) denotes the elementwise logistic function and 1N is the N x 1 unit vector . By further denoting the recursive function which maps 0k-1 to Ok by I'k , namely, Ok=qf(Ik) W(Ok-1 Wk +1ek) =I'k(Ok-1) the output of the last layer can be expressed

as

(11)

On=rn(On-1) =rn(rn -1(On ~)) =rncrn -lo...or2(O1) =r(01)

(12)

where r(•)=rnorn

la...or'2(.)

(13)

Therefore, by writing X=O'

and Y=On

(14)

the n-layer feed-forward neural network's prediction of the criterion variables is given by Y=r(X).

(15)

Using the above derivation, the neural network model can be expressed in the following form of the usual regression analysis. Y=F(X)+E

(16)

where E is the N x m residual matrix. The parameters of the neural network regression model shown above are {W1, Bk},k=2, 3, •••, n, and the dispersion matrix of the residual term. We restrict ourselves to the case where the error dispersion matrix is 62I. In standard regression analysis, it is customary to estimate the parameters by the least squares method of minimizing the residual sums of squares RSS=11 Y-Y112.

(17)

However, contrary to the usual multivariate linear regression model, the neural network regression model is not linear in terms of the parameters. Therefore, in order to estimate the parameters, some numerical methods must be used. Given the current value of the parameters, E(l),standard numerical iterative methods update the parameters as $(t+1)-(1)+d(1)

(18)

where d(l) is the direction vector calculated in various ways. In the usual applica tion of the neural network model, the direction vector is chosen by the method of steepest descent. In the next section, we review usual back-propagation algorithm.

3. Estimation of the Parameters

by the Steepest Descent Method

Let us denote the row-vectorized version of the weight matrix and the offset values of the k-th layer by w1, w2, , wnk_,,Bk), k=2, 3,

n

19

where wk denotes the i-th row of the Wk matrix. Note that $k is 1 x qk, where qk, k=2, 3, n, is the number of parameters in the k-th layer. Collectively, all the parameters can be written as the 1 x q vector =( 2, $1, ..., n)

(20)

where q=J]k=2qk. The method of steepest descent uses the negative gradient vector as the direction vector g=(g2, g3, ..., gn)

(21)

gk =

(22)

where aRSS a

E=E t>

In this article, the partial derivatives of a scalar, s, whth repect to each element of an 1 x p vector, v, is denoted by the 1 x p vector as , which is arranged in the same order as the original vector.

Similarly, the partial derivatives of a 1 x q or q x 1

vector, u, with respect to each element of v is denoted by the q x p matrix a

v. Denoting the i-th column of the Ok and P matrices as O(i)and i(Z),respectively, the gradient vectors can be calculated as follows. The gradient vector gk consists of the derivatives of RSS with respect to wk and Bk. Using the Hadamar product (O), they can be expressed as: 12 aRSS a =1N'((O(ki) link)®4k) wi

(23)

1 aRSS =1N4k 2 aak

(24)

and

where minus twice the derivative of RSS with respect to the k-th input vector is denoted as the N x nk matrix 4k. This matrix can be expressed as 4k-[(3k ]=[0i'

02 .. 8a' .. 8N]~

(25)

where the (a, j) element of the 4k matrix is given, recursively, by as = _ 21 aRSS a Za;

_~0a;(1=oa;)(ya;-Oa;) ifk=n 0«;(1 0«;)(8a+1 wk+l') otherwise.

(26)

Note that when the last layer activation function is linear, the k= n case in the above equation must be modified to (yW on ). The whole set of the N x nk deriva tives can be written as

m m

ak=_2aass _ I On®(1N1 =0n)®(Y On)

1 Ok0(1Nln,k Ok)O(dk+l Wk+v)

4.

Estimation

of the Parameters

if k=n

otherwise.

by the Gauss-Newton

(27)

Method

In standard applications of nonlinear regression, the residual sums of squares are usually minimized using the Gauss-Newton method (See for example, Gallant, 1987). The Gauss-Newton method takes advantage of the fact that RSS is a quadratic function of the elements of the f matrix, and achieves quadratic conver gence without calculating the second derivative matrix of RSS. In general, using a linear approximation to the s-th column, yes,, of the f matrix in the neighbor hood of $(1),

Ys~tiksw)+as s' c

RSS

can

be expressed

Ee=ew

~ s=1,2, m

(28)

as

RSS

2111Y(S) ycs>(n-FF$1II2

(29)

S=1

where y(s)(1>is the value of 5 s) evaluated at E(1)and

aeL46(1)

Fs= ay(s) is the Jacobian

matrix.

Therefore,

d(l)=H

the Gauss-Newton

~Z FS(y(s)

5 s)(1))

(30) direction

is

(31)

S=1

where

H = Z F,Fs.

(32)

S=1

Note that this q x q matrix, where q=~k=2 qk, may become too large to invert as many times as is needed during the course of iterations. Moreover, Saarinen et al. (1993) found that the N x q Jacobian matrix is often ill-conditioned resulting in nearly singular H matrix. Therefore, we choose the following conditional minimi zation method by the alternating least squares (ALS). (See, for example, Takane et al., 1977.) The conditional minimization divides all the parameters into n-1 subsets, namely, $k, k=2, 3, •••, n, and tries to minimize RSS with respect to the parameters in each subset treating the parameters in the other subsets as fixed. The updated parameters are immediately fed back when proceeding to the next subset. The update formula is

m

(1+1)=$(0+d(k), where the (conditional)

Gauss-Newton

k=2,3, direction

•••, n

(33)

can be calculated

as follows :

d(k)=Hk ` Z Fs "(Y(s)-Y(S)(1))

(34)

s=1

wherea Fs=ask' (35) and Hk=

~' Fk Fs .

(36)

s=1

The Nx qk Jacobian

matrix Fk consisting of the derivatives

Y(S)=ors), with respect to the parameters ao() a =(ok wZ

of the output, i.e.,

of the k th layer has the following form

Ilnk)Ods

(37)

and aek> -dS.

(38)

The (a, j) element, as S, of the N x qk matrix ds is expressed recursively aa

,s

aon a ik Oas(1-Oas) 0 Oka;(1

1f j=s

otherwise oka;)Sk+l,s wk+1" a ;

if k =n otherwise

(39)

Note that when the last layer activation function is linear, the k=n in the previous equation must be set equal to unity. As before, in matrix notation, we have Jsk__

os0 (llyo(s))es O k0 (1zv1'n O k) o (4S+1 R,k+1")

Technical In this

tion

of the

and j=s

if k=n otherwise

case

(40)

where es is the m x 1 vector whose s-th element is unity, otherwise

5.

as

zero.

details section

we discuss

Gauss-Newton

some algorithm.

of the technical

issues

related

to the implementa

5.1

Step size The direction vector, d(l) for the steepest descent method or d() for the Gauss Newton method given in the previous sections tells us the direction to search for the smaller RSS value than the current one. However, it does not tell us how much we should modify the current value of E(1)or ('I). This is the step size problem. The usual approach is, starting from 7j=1, to halve the step size until RSS($(1)+ ;7d(l))

Suggest Documents