the estimation of the weights in feed-forward neural network models was ... the statistical nature of the neural network model as a nonlinear regression analysis.
AN EFFICIENT NEURAL
ALGORITHM NETWORK Shin-ichi
FOR FEED-FORWARD
REGRESSION
ANALYSIS
Mayekawa*
The feed-forward neural network model can be considered a very poweful nonlinear regression analytic tool. However, the existing popular back-propagation algorithm using the steepest descent method is very slow to converge prohibiting the every day use of the neural network regression model. In this regard, a fast converging algorithm for the estimation of the weights in feed-forward neural network models was developed using the alternating least squares (ALS) or conditional Gauss-Newton method. In essence the algorithm alternates the minimization of the residual sums of squares (RSS) with respect to the weights in each layer until the reduction of RSS is negligible. With this approach, neither the calculation of a complex second derivative matrix nor the inversion of a large matrix is necessary. In order to avoid the inflation of the weight values, a ridge method and a quasi Bayesian method were also investigated. The methods were evaluated using several problems and found to be very fast compared to the steepest descent method. With a fast converging algorithm at hand, it is hoped that the statistical nature of the neural network model as a nonlinear regression analysis model is clearly revealed.
1.
Introduction
Since the invention/rediscovery of the error back-propagation method by Rumelhart et al.. (1986), neural networks, especially, feed-forward networks, alias multilayer perceptrons, have been widely used in a variety of areas. In essence, what they do is a nonlinear approximation to the relationship between two sets of multivariate variables, x and y. In applications of neural networks, we "teach" or "train" the network to learn the existing relationship using "training" data . That is, the networks modifies itelf until it "learns" the relationship. The error back propagation is a numerical algorithm that enables this learning process. Traditionally, the relationships among variables have been studied by statisti cal method such as regression analysis or discriminant analysis. In traditional statistics, we first construct a parametric model of the phenomenon of interest and test it in light of the data at hand. The key to this approach is the assumed probability distribution of the error term in the model. The parameters of the model are estimated so that the likelihood of the model is maximized. From a numerical analytic point of view, the learning process of a neural network can be described as the minimization of the residual sums of squares between the data and the output of the network with respect to the parameters of Key Words and Phrases ; Neural Network Training, Feed-Forward Method, Bayesian Method, Nonlinear Regression Analysis * Research Division , The National Center for University Entrance Meguro-ku, Tokyo, 153 Japan.
Neural
Network,
Examinations,
Gauss-Newton
2-19-23,
Komaba,
the network, namely, the weights and the offset or bias values. Therefore, with an appropriate probabilistic assumption for the residual terms, we may treat the neural network as a statistical model. For example, this approach enables us to test whether a particular neural network model fits the data on the basis of the likelihood, or to construct a prediction interval for the output. For the feasibiality of this statistical approach to the neural network models, statisticians must have an opportunity to use the network models as one of their tools. Unfortunately, the existing popular learning algorithm is not satisfactory for their every day use, because it is very slow to converge compared even to the most complex statistical models such as linear structural equation models or finite mixture models. Although there have been several attempts to improve the learn ing algorithm, (Bishop, 1992 and Jacobs, 1988, for example) most users of the network models still use the original back-propagation algorithm which might take hours to converge on small computers. The purpose of this article is to develop an efficient numerical method which enables us to use neural network models easily. Given such a method, it is hoped that the application of neural networks to problems traditionally analyzed using statistical methods becomes more popular, as does researches such as Riplay's (1993), revealing the statistical aspect of neural network models more clearly.
2. The feed-forward
neural network model
Let (xa, ya), a=1, 2, --•, N, be the paired observations, where x and y, respec tively, refer to the 1 x p regressor vector and the 1 x m criterion vector. Collective ly, they can be arranged in the following data matrices : X=[x1,x2,x'a,
xN]'
(11)
and Y=[yl, yZ, ... ya ... yr]
(2)
(In this article, we denote the i-th row of a matrix, say X, by xi, and the j-th column, by x(1).) The purpose of neural network regression is to explain the criterion matrix Y by the regressor matrix X using the feed-forward neural network model. The n-layer feed-forward neural network model, or multi-layer perceptron, can be described as follows. Let nk, k=1, 2, •••, n, be the number of neurons per layer. In this article, we use the convention that the data matrix, X, is referred to as the output of the first layer, and that the model prediction, f, of the Y matrix is referred to as the output of the last layer. Therefore, nl=p and n,=m. An observation, xa, when presented to a neural network, generates an input and an output of all the neurons. The input and the output of the j-th neuron of the k th layer are denoted, respectively, as is and oa;. Given an input, a neuron is
"activated" and
produces an output as 0'a,=0(ia)
(3)
where 0(•) is called the activation or excitation function . In this article we use the logistic function 0'a;aa
)= 1+exp( -ia)
(4)
as the activation function. The relationship between the output of the k-1st layer and the input of the k th layer are as follows. Let the collection of the inputs and the outputs of all the neurons in the k-th layer be, respectively, l ka= 2¢1, k 1a2, 'k ..., Zai, k ..., lank] k
(5)
oak=[01alOat, , k . ., Oak.i,
(~)
and
The layer
input
of the
neurons
k-th
layer
neurons
are
linearly
, Oank k . related
to the outputs
of the
k-1st
as
'k Iai
nk-1
i =1
k-1 k + ek
Oai
(~)
wi .l
where w is the weight that connects the i-th neuron of the k-1st layer to the j-th neuron of the k-th layer and Ok is the offset value for the j-th neuron of the k-th layer. Therefore, for observation Xa, the nk neurons of the k-th layer received the following inputs ikk=Oa-1Wk + Ok
(8)
where Wk is the nk_1x nk weight matrix and Ok is the 1 X nk offset vector . Using the N x nk matrices Ik and Ok, whose rows consist , respectively, and oa, the above relationship can be expressed as Ok=W(Ik),
k=2, 3, .., n
of is (9)
and Ik=Ok-lWk+1NBk
k=2
3 .., n
(10)
where qf(•) denotes the elementwise logistic function and 1N is the N x 1 unit vector . By further denoting the recursive function which maps 0k-1 to Ok by I'k , namely, Ok=qf(Ik) W(Ok-1 Wk +1ek) =I'k(Ok-1) the output of the last layer can be expressed
as
(11)
On=rn(On-1) =rn(rn -1(On ~)) =rncrn -lo...or2(O1) =r(01)
(12)
where r(•)=rnorn
la...or'2(.)
(13)
Therefore, by writing X=O'
and Y=On
(14)
the n-layer feed-forward neural network's prediction of the criterion variables is given by Y=r(X).
(15)
Using the above derivation, the neural network model can be expressed in the following form of the usual regression analysis. Y=F(X)+E
(16)
where E is the N x m residual matrix. The parameters of the neural network regression model shown above are {W1, Bk},k=2, 3, •••, n, and the dispersion matrix of the residual term. We restrict ourselves to the case where the error dispersion matrix is 62I. In standard regression analysis, it is customary to estimate the parameters by the least squares method of minimizing the residual sums of squares RSS=11 Y-Y112.
(17)
However, contrary to the usual multivariate linear regression model, the neural network regression model is not linear in terms of the parameters. Therefore, in order to estimate the parameters, some numerical methods must be used. Given the current value of the parameters, E(l),standard numerical iterative methods update the parameters as $(t+1)-(1)+d(1)
(18)
where d(l) is the direction vector calculated in various ways. In the usual applica tion of the neural network model, the direction vector is chosen by the method of steepest descent. In the next section, we review usual back-propagation algorithm.
3. Estimation of the Parameters
by the Steepest Descent Method
Let us denote the row-vectorized version of the weight matrix and the offset values of the k-th layer by w1, w2, , wnk_,,Bk), k=2, 3,
n
19
where wk denotes the i-th row of the Wk matrix. Note that $k is 1 x qk, where qk, k=2, 3, n, is the number of parameters in the k-th layer. Collectively, all the parameters can be written as the 1 x q vector =( 2, $1, ..., n)
(20)
where q=J]k=2qk. The method of steepest descent uses the negative gradient vector as the direction vector g=(g2, g3, ..., gn)
(21)
gk =
(22)
where aRSS a
E=E t>
In this article, the partial derivatives of a scalar, s, whth repect to each element of an 1 x p vector, v, is denoted by the 1 x p vector as , which is arranged in the same order as the original vector.
Similarly, the partial derivatives of a 1 x q or q x 1
vector, u, with respect to each element of v is denoted by the q x p matrix a
v. Denoting the i-th column of the Ok and P matrices as O(i)and i(Z),respectively, the gradient vectors can be calculated as follows. The gradient vector gk consists of the derivatives of RSS with respect to wk and Bk. Using the Hadamar product (O), they can be expressed as: 12 aRSS a =1N'((O(ki) link)®4k) wi
(23)
1 aRSS =1N4k 2 aak
(24)
and
where minus twice the derivative of RSS with respect to the k-th input vector is denoted as the N x nk matrix 4k. This matrix can be expressed as 4k-[(3k ]=[0i'
02 .. 8a' .. 8N]~
(25)
where the (a, j) element of the 4k matrix is given, recursively, by as = _ 21 aRSS a Za;
_~0a;(1=oa;)(ya;-Oa;) ifk=n 0«;(1 0«;)(8a+1 wk+l') otherwise.
(26)
Note that when the last layer activation function is linear, the k= n case in the above equation must be modified to (yW on ). The whole set of the N x nk deriva tives can be written as
m m
ak=_2aass _ I On®(1N1 =0n)®(Y On)
1 Ok0(1Nln,k Ok)O(dk+l Wk+v)
4.
Estimation
of the Parameters
if k=n
otherwise.
by the Gauss-Newton
(27)
Method
In standard applications of nonlinear regression, the residual sums of squares are usually minimized using the Gauss-Newton method (See for example, Gallant, 1987). The Gauss-Newton method takes advantage of the fact that RSS is a quadratic function of the elements of the f matrix, and achieves quadratic conver gence without calculating the second derivative matrix of RSS. In general, using a linear approximation to the s-th column, yes,, of the f matrix in the neighbor hood of $(1),
Ys~tiksw)+as s' c
RSS
can
be expressed
Ee=ew
~ s=1,2, m
(28)
as
RSS
2111Y(S) ycs>(n-FF$1II2
(29)
S=1
where y(s)(1>is the value of 5 s) evaluated at E(1)and
aeL46(1)
Fs= ay(s) is the Jacobian
matrix.
Therefore,
d(l)=H
the Gauss-Newton
~Z FS(y(s)
5 s)(1))
(30) direction
is
(31)
S=1
where
H = Z F,Fs.
(32)
S=1
Note that this q x q matrix, where q=~k=2 qk, may become too large to invert as many times as is needed during the course of iterations. Moreover, Saarinen et al. (1993) found that the N x q Jacobian matrix is often ill-conditioned resulting in nearly singular H matrix. Therefore, we choose the following conditional minimi zation method by the alternating least squares (ALS). (See, for example, Takane et al., 1977.) The conditional minimization divides all the parameters into n-1 subsets, namely, $k, k=2, 3, •••, n, and tries to minimize RSS with respect to the parameters in each subset treating the parameters in the other subsets as fixed. The updated parameters are immediately fed back when proceeding to the next subset. The update formula is
m
(1+1)=$(0+d(k), where the (conditional)
Gauss-Newton
k=2,3, direction
•••, n
(33)
can be calculated
as follows :
d(k)=Hk ` Z Fs "(Y(s)-Y(S)(1))
(34)
s=1
wherea Fs=ask' (35) and Hk=
~' Fk Fs .
(36)
s=1
The Nx qk Jacobian
matrix Fk consisting of the derivatives
Y(S)=ors), with respect to the parameters ao() a =(ok wZ
of the output, i.e.,
of the k th layer has the following form
Ilnk)Ods
(37)
and aek> -dS.
(38)
The (a, j) element, as S, of the N x qk matrix ds is expressed recursively aa
,s
aon a ik Oas(1-Oas) 0 Oka;(1
1f j=s
otherwise oka;)Sk+l,s wk+1" a ;
if k =n otherwise
(39)
Note that when the last layer activation function is linear, the k=n in the previous equation must be set equal to unity. As before, in matrix notation, we have Jsk__
os0 (llyo(s))es O k0 (1zv1'n O k) o (4S+1 R,k+1")
Technical In this
tion
of the
and j=s
if k=n otherwise
case
(40)
where es is the m x 1 vector whose s-th element is unity, otherwise
5.
as
zero.
details section
we discuss
Gauss-Newton
some algorithm.
of the technical
issues
related
to the implementa
5.1
Step size The direction vector, d(l) for the steepest descent method or d() for the Gauss Newton method given in the previous sections tells us the direction to search for the smaller RSS value than the current one. However, it does not tell us how much we should modify the current value of E(1)or ('I). This is the step size problem. The usual approach is, starting from 7j=1, to halve the step size until RSS($(1)+ ;7d(l))