A Second Order Learning Scheme based on Iteratively Reweighted ...

5 downloads 0 Views 435KB Size Report
exact Hessian, H. Secondly, the matrix F is formulated as an inner product, and is ...... 38] Suresh H. Moolgavkar, Edward D. Lustbader, and David J. Venzon, ...
A Second Order Learning Scheme based on Iteratively Reweighted Least Squares Bradley A Warner

Dept. of Mathematical Sciences United States Air Force Academy 2354 Fairchild Drive, Suite 6D2A USAF Academy, CO 80840-6252 Ph: (719) 333-4470 Fax: (719) 333-2114

[email protected]

Manavendra Misra

Dept. of Mathematical and Computer Sciences Colorado School of Mines Golden, CO 80401 Ph. (303)-273-3873 Fax. (303)-273-3875 [email protected]

Abstract

In this work we demonstrate a method to obtain maximum likelihood weight estimates for a multilayered feedforward neural network using least squares. This method has certain advantages when compared to other second-order methods. The proposed method uses the Fisher's information matrix instead of the Hessian matrix to compute the search direction. Since this matrix is formulated as an inner product, it is guaranteed to be positive de nite. This ensures that each step of the optimization procedure will be in a descent direction and the solution found by the method will be a minimum and not a saddle point or a maximum. In addition, at convergence, the Fisher's information matrix yields estimates of the variance of the weights. The formulation used by the method also provides an interesting way of highlighting the multicollinearity problem in multilayered feedforward networks. This in turn shows the need for a regularization procedure such as weight decay in the learning process. The approach also has the advantage of allowing the use of regression diagnostics for analyzing neural networks.

1 Introduction The process of learning in a multi-layered feed-forward network is one of determining the values of the weights in the network that minimize some error function (E) between the actual outputs corresponding to the training patterns, and the desired outputs for those patterns. Learning1 is therefore a non-linear optimization process and a number of techniques developed to solve such optimization problems can be applied to learn the weights in an arti cial neural network (ANN). Most of these methods extensively use information about the gradient of the error in the weight space to guide the search for a minimum of the error. Traditionally, the ANN literature has tended to focus on back-propagation [1, 2] as it provides ecient local update rules for iteratively estimating the weights of an ANN. However, there are well documented problems with using such a steepest gradient descent technique for solving this non-linear optimization problem. Back-propagation often results in slow convergence to the nal values of weights, and can also be caught in local minima or error surface plateaus. Another situation that causes problems for steepest descent techniques is when the error function has signi cantly di erent curvatures along di erent directions (such as when a valley with steep sides in the error surface gently slopes towards the minimum). A rst order method such as back-propagation can require a large number of iterations to converge to the minimum in this case. This has led researchers to look for ad-hoc methods that can help speed the rate of convergence 1 The statistical literature uses the expression \parameter estimation" instead of learning, and we use the terms \parameter" and \weight" interchangeably.

1

[3, 4, 5, 6, 7]. These methods help in the practical implementation of back-propagation learning, but the method remains a rst-order method with its inherent drawbacks. Another approach that has been investigated involves the explicit use of second-order information about the error surface in order to nd a direction that can lead us to a minimum value of the error quickly. These second-order methods use a quadratic approximation to the error in order to increase the convergence rate (especially in cases like the gently sloping valley with steep sides described above) [8, 9, 10]. Newton's method [11] is a second-order method that uses the following search direction to nd the weights that result in the minimum of the error function, E: w = w ? H?1U (1) where w is the vector of all the weights in the network, w is the value of this vector at a minimum of E, H is the Hessian matrix (a matrix of the second derivatives of E with respect to the individual weights), and U = rE is the gradient of E in the weight space. The product H?1U is refered to as the Newton direction, and has the advantage that it always points towards a stationary point (something that is not guaranteed by the local negative gradient on its own). Therefore, the use of the Newton direction will often lead to a fewer number of steps in getting to a stationary point when compared to a local gradient-descent based optimization approach such as back-propagation. However, the fewer number of steps to get to a stationary point come at the cost of a higher computational complexity per step. There are a number of drawbacks that have limited the use of second-order methods for neural network learning [11]. First, the determination of the Newton direction requires the computation of the Hessian matrix, H. This is an m  m matrix of second derivatives, where m is the number of weights in the weight vector, w. The computation of H is an O(nm2 ) operation where n is the total number of patterns in the training set. Once the Hessian has been computed, it has to be inverted, which is an O(m3 ) operation, and therefore very expensive. Finally, although the Newton direction always points to a stationary point, this stationary point is a minimum only when H is positive de nite2 . In other cases, the Newton direction points at a saddle-point or a maximum (and is therefore undesirable when one is trying to minimize E). In this work, we propose a second-order learning method for multilayered feed-forward neural networks that uses the Fisher's information matrix, F. Our method attempts to address some of the drawbacks inherent in second-order methods identi ed above. We will show that we determine a descent direction that utilizes second-order information about the error function, but does not require the computation of the exact Hessian, H. Secondly, the matrix F is formulated as an inner product, and is therefore guaranteed to be positive de nite. This implies that the direction we work with is a descent direction, and leads us to a minimum (and not a maximum or saddle point) in the error surface. Although we still require the computation of a matrix inverse, this is in the context of the solution of a system of linear equations, and we can use standard least squares software. Since least squares software is widely available, this algorithm is easily implemented. Also, decomposition methods such as QR decomposition are easily and eciently parallelized, and can lead to a reduction in computational complexity on a parallel machine. Most work in neural networks has focused on using the mean squared error criterion: E = 1=2

n X X p=1 k

(t(kp) ? yk(p) )2

(2)

where t(kp) is the target (or desired) value for output unit k, and yk(p) is the actual activation value of output unit k when the network is presented with training pattern p. Under some assumptions, the mean squared error function can be looked upon as a special case of an error function derived from the principle of maximum likelihood [11]. We therefore concentrate on a more general error function based on nding the maximum likelihood of a certain set of weights, given a particular labeled training set [12]. The learning algorithm developed here is similar to the iteratively reweighted least squares algorithm used for parameter estimation in the generalized linear model (GLM) [13]. A GLM requires the construction 2

A matrix H is positive de nite if xT Hx > 0 for all vectors x.

2

of a matrix called the design matrix that is constructed by placing the input values corresponding to each training pattern as the rows of this matrix. Since the ANN models considered here have hidden units, we have to add additional columns that are the result of the dynamics of the hidden units. We call this new matrix the augmented design matrix and we show how it is constructed and used in the learning process. In addition to its use as a learning algorithm, the proposed method provides additional insight into the learning process in feed-forward networks. We will use the augmented design matrix to illustrate that there is an inherent multicollinearity problem in the multilayered feed-forward neural network [14]. Multicollinearity leads to slow training and unstable estimates of some weights and demonstrates the need for a regularization technique such as weight decay. This paper is divided into the following sections. Section 2 introduces the iteratively reweighted least squares algorithm for learning weight values. The idea of using the Fisher's information matrix (F) in the learning procedure facilitates the decomposition of F into an inner product of an augmented design matrix, guaranteeing that F will always be positive de nite. This augmented design matrix allows learning to take place without the need to explicitly compute the matrix of second derivatives, H. Section 3 makes use of the augmented design matrix to examine issues such as multicollinearity and unstable weight estimation. This leads to the discussion of regularization methods (penalized likelihood) to address these concerns. We also discuss how the suggested formulation allows the application of regression diagnostic tools to analyze neural network structures. In Section 4, the method developed in Section 2 is applied to two synthetic data sets. Section 5 concludes the paper with some pointers to future work.

2 Learning Algorithm In this section we discuss the forms of the multilayered feedforward neural networks considered in this paper and explain maximum likelihood based learning using least squares. This involves the formation of an augmented design matrix and an adjusted dependent variable in an iteratively reweighted least squares algorithm.

2.1 The Network Architecture

The networks considered in this paper are essentially variants of the traditional multilayered feedforward network, and an example is shown in Figure 1. Such a network is composed of input units that comprise the input layer, one or more hidden layers, and one output unit. There are standard feedforward connections from layers to subsequent layers, but we also allow for connections to skip over layers. The weights corresponding to connections to the output node, including optional skip layer connections are denoted by i where i represents the node connecting into the output. Connections from the input nodes to a hidden unit are denoted by ki, where i denotes the input node and k the hidden unit. The activation functions for the hidden units are standard sigmoidal functions of the form (although any di erentiable function is acceptable): (3) h(x) = 1 +1e?x The architecture represented in Figure 1 has a single output unit (or response variable) because the objective (or error) function being used, the likelihood function discussed in Section 2.2, is de ned for a single variable. This is not as restrictive as it rst appears. The common multilayered feedforward neural network with multiple output nodes usually has a single categorical response variable. The multiple output nodes correspond to the di erent categories of the output. Categorical output variables are still single response variables and can be modeled as a polytomous response using the appropriate objective (likelihood) function. The choice of the activation function used for the output unit depends on the nature of the output (or response) variable. For instance, if the response variable is binary (such as in a two-class classi cation problem), we need an activation function that generates outputs in [0,1]. A sigmoidal function (such as the 3

1

x1 H1 α 21

β H1

Outcome

H2

x2

Figure 1: Multilayered feedforward neural network considered in this paper. The skip layer connections directly from the inputs to the output are optional. logistic or a cumulative density function) is appropriate in this case. On the other hand, the identity output function would suce for a problem with a continuous output variable3.

2.2 Estimation of Network Weights

The learning strategy developed in this work is based on the principle of maximum likelihood. Most work in the area of multilayered feedforward ANN learning employs the mean squared error (MSE) function. However, the MSE function can be seen as a special case of a more general error function derived from the principle of maximum likelihood [11]. Such maximum likelihood based error functions have been used in other ANN learning tasks too [15, 16]. To estimate the weights (also called the parameters of the network), the likelihood function is maximized with respect to the weights. The weights that are most \likely" to explain the observed training data (maximum likelihood estimates of weights) are computed by maximizing the likelihood function with respect to the weights. Given a training set composed of pairs fxi ; tig, where xi is the input vector corresponding to pattern i, and ti is the target output value for pattern i, the likelihood function is de ned as:

L=

Y

i

p(ti jxi)

(4)

Since the conditional probability distribution of the output value t given a new input vector x, p(tjx) depends on the weights of the network, the likelihood function depends on the weights. Often, the logarithm of the likelihood function is used because it is mathematically more tractable. Since logarithms are monotonic, this does not change the value of the weights that maximize the original likelihood function. In addition, since we want to look upon the objective function as an error function to be minimized, the negative of the log likelihood is used as the objective function. This yields the same weight estimates as would be computed by maximizing the likelihood. The error function corresponding to the likelihood function in Equation 4 is therefore given by: X (5) E = ? ln p(ti jxi)

i 3 In general, the activation function of the output node is selected as the inverse of the canonical link function corresponding to the generalized linear model with the same likelihood function [13]. This is done to simplify the derivatives.

4

In assuming a particular conditional probability distribution p(tjx), we specify the structure of the probability distribution of the error at the output. This permits modeling of count data through the Poisson distribution, strictly positive continuous data through the gamma distribution, binary data through the binomial distribution, and continuous data through the Gaussian distribution [13]. Error structures such as the Gaussian, binomial, and gamma distributions belong to the more general exponential family [17]. For the exponential family, the log likelihood is de ned as: ? b(i ) + c(t ; ) (6) l(i ; ti ) = ti ia() i where the index i is over the training patterns.  is called the canonical parameter, and is a function of the weights in the network. ti is the target output of the network for training pattern i, b() is called the cummulant function and its rst derivative is the expected value of the output for a testing pattern (which is by de nition the actual output of the network, y), and its second derivative is proportional to the variance of y.  is called the dispersion parameter and is constant across all training patterns. a() and c(y; ) are speci c functions of the dispersion parameter  that are important in de ning the di erent distributions, but do not a ect the learning process. In our case, we simplify analysis by considering only canonical links (remember that the canonical link is the inverse of the output activation function, which is chosen appropriately, given the output range, and probability density assumptions). A choice of the canonical link results in the canonical parameter i being equal to the weighted sum of activities (net input) feeding into the output node for training pattern i. Thus, i =

X

j

j aj

(7)

where j is the weight of the connection from unit j to the output, and aj is the activation value of that unit. One area of concern in using the likelihood function for the error function, is the speci cation of a probabilistic mechanism for the data. In other words, if our output is binary, then we are assuming it is the result of a sequence of Bernoulli. What if we do not know the probabilistic mechanism or what if the data deviate from the assumed probabilistic mechanism. Then the likelihood function can still be used as a quasi-likelihood model. This model only requires the speci cation of the rst two moments. It requires a speci cation of the relationship between the mean and the input variables and a speci cation of the variance in terms of the mean response [13]. Thus the quasi-likelihood model provides a means to model the data without the speci cation of the probabilistic mechanism. To estimate the weights using a second-order method requires the computation of rst and second partial derivatives of the negative log likelihood with respect to the weights. We will derive expressions for these derivatives, although (as we shall see later) we will not need to explicitly compute them in our learning strategy. For the weights connected to the output ( i ), the vector of rst derivatives (called the score vector in the statistical literature) is given by: (8) U( ) = @@l = @@l @@  (9)

Computing the above partial derivatives on Equations 6 and 7, and using the fact that b0() = y, we get:   ( t ? y ) T U( ) = ?Xoutput a() (10) where Xoutput is the matrix of activities aj into the output unit. Each row of Xoutput corresponds to a particular training pattern, and is composed of the activation values of all units that connect to the output 5

unit. Also, t is the vector of target (or desired) output values and y is the vector of actual outputs of the network corresponding to the di erent training patterns. The matrix of second derivatives with respect to the output weights is computed using the chain rule in a manner similar to the computation of U( ), and is given by:   H( ) = @ 2 l = 1 X T WX (11) a() output output @ @ T where W is the following diagonal matrix: 2 V ar(y1 ) 0 0  0 3 6 0 V ar(y2 ) 0    0 77 6 6 . . .. 77 . .. .. W = 66 .. . 7 (12) 6 7 . . . . .. .. .. .. 4 5 0 0       V ar(yn ) where yi is the actual output of the network for the ith training pattern. W is known as the weight matrix in the statistical literature{this is not to be confused with the vector of weights in the network which we will represent as w. For the weights from the input units to the hidden units, , we have to account for the activation function h of the hidden units. The corresponding score vector (computed using the chain rule in a manner similar to Equation 9) is given by: U( ) = @l = ?X T G  t ? y  (13) input @ a() where Xinput is the matrix of input values into the hidden layer (row i of the matrix is composed of the activation values of the input units for the ith training pattern). In the statistical literature Xinput is called the design matrix, and is the matrix of input values for all training patterns (all observations of predictors). G is the following diagonal matrix: 2 0 T h (x1 ) 0 0  0 3 6 0 h0 (xT2 ) 0    0 77 6 7 6 .. . . . .. .. 7 . G = 66 .. (14) 7 7 6 . . . . .. .. .. .. 5 4 0 T 0 0       h (xN )

where xi is the ith row of the matrix Xinput. Each non-zero term in G corresponds to the derivative of the activation function of the hidden units, computed at the net input into the hidden unit, for a given training pattern. Note that Equations 10 and 13 are standard representations of the rst derivative of the error function (l in this case) with respect to the weights. A procedure like back-propagation can be used to compute these terms eciently. The matrix of second derivatives, H( ), requires the computation of the second partial derivative of the error function with respect to all possible pairs of weights in the network. There are ve possible combinations of the kinds of weights chosen as the denominator in the second partial derivatives. The rst case is when the second partial derivative is computed with respect to two hidden to output weights as was the case in Equation 11. The other four combinations correspond to second partial derivatives with respect to input weights into the same hidden unit, input weights into di erent hidden units, one output weight and one hidden weight corresponding to the same hidden unit, and nally the case of one hidden unit weight and one output weight corresponding to di erent hidden units. We will not provide expressions for all of 6

these cases, but as an example of one case, consider the partial derivative of the log likelihood with respect 2l @ to input weights into the same hidden unit, @ 2 @ 1 [10]: ki

ki

   1 xpi2 (yp ? tp )h00(zkp ) H + V ar(yp )fh0(zkp ) H g2 xpia() (15) where H is the weight of the connection from hidden unit k to the output, and zkp is the net input into hidden unit k for training pattern p (zkp = xTp ). The above expressions demonstrate that the computation of the matrix of second derivatives H, of the error is rather complicated. Instead of explicitly computing H in our learning process, we will use a secondorder method known as Fisher's scoring method to compute the weights of the network. The most familiar second-order method is the Newton-Raphson method (see Equation 1) which estimates weights iteratively using: wk+1 = wk ? H(wk)?1 U(wk ) (16) where we have de ned a new vector w composed of both kinds of weights used earlier, w = f ; g. Fisher's scoring method is similar to Newton-Raphson except that the Fisher's information matrix is used instead of the Hessian matrix. Fisher's information matrix is de ned as:   2 (17) F = ?E @ w@@ wl T (that is the negative of the expected value of the Hessian). The weights are therefore computed by the following: wk+1 = wk + F(wk )?1 U(wk ) (18) Since the Fisher's information matrix depends on the current weights, this procedure is iterative. Applying the de nition of Fisher's information matrix to the Hessian matrix, Equation 11, in the estimating equations for the output weights yields (ignoring the constant a() term):   T T WXoutput (19) F = ?E Xoutput WXoutput = ?Xoutput 

k

k

k

Notice that F in this case can be written as an inner product of two matrices. For the input to hidden unit weights, , some of the elements of the Hessian (see Equation 15 for an example) contain the term (t ? y). In Fisher's scoring method, under the expectation the terms (t ? y) vanish. In other words, when averaged over all the training data, the actual value of the output y equals the target output t. Therefore, under the expectation, the Fisher's information matrix for this case (F ) has the form of an inner product of two matrices also. These results imply that the overall Fisher's information matrix for the network (the combination of F and F ) is an inner product of the form: T WXaug F = Xaug (20) where Xaug is referred to as the augmented design matrix and it is an augmentation of the design matrix with additional columns that arise during the computation of F 4. W is the weight matrix de ned in Equation 12. To illustrate the form of the augmented design matrix, consider the network depicted in Figure 1. This network has two input (predictor) variables, two hidden units and skip layer connections (connections directly from the inputs to the outputs). Thus, there are eleven weights in the network that need to be estimated: 5 weights into the output unit and 6 weights into the hidden units. For one training pattern, a row of the augmented design matrix has the form: (21) [1 x1 x2 h(z1 ) h(z2 ) H1 h0(z1 ) H1 h0 (z1 )x1 . . . H2 h0 (z2 )x2] 4

Note that an alternate de nition of Fisher's Information Matrix is as the inverse covariance matrix of the parameters.

7

The rst term (1) corresponds to the bias connection into the output unit. The second and third terms (x1 x2 ) correspond to the skip connections from the input units to the output unit. Next, we have terms (h(z1 ) h(z2 )) corresponding to the connections from the hidden units to the output. The next term ( H1 h0 (z1 )) corresponds to the bias weight to the rst hidden unit. After this, we have similar terms that correspond to the the connections into the hidden units. These \pseudo" variables are formed by multiplying the derivative of the hidden unit output, the hidden unit to output weight associated with this hidden unit, and the input variable. Each row of the augmented design matrix corresponds to a particular training pattern, and the entries in a row are formed in sequence as follows:

FORMATION OF THE AUGMENTED DESIGN MATRIX

1. For each skip layer connection, add the corresponding input value xi as a term in the row; 2. For each hidden unit to output connection, use h(xT ); 3. For each input to hidden unit connection, form the product of the hidden unit's associated output weight H , the associated input variable, and the derivative of the associated hidden unit's output. Using the augmented design matrix, notice that the score vector (the overall score vector for the network is a concatenation of Equations 10 and 13) becomes: T (t ? y) U( ; ) = Xaug (22) (ignoring a()) and the overall Fisher's information matrix for the network is given by: T WXaug F( ; ) = Xaug (23) where W is a diagonal matrix with V ar(yi ) along the diagonal (Equation 12). The weight updates are calculated as: wnew = wold + F ?1U (24) new old Fw = Fw + U (25) Rearranging and substituting the expressions from Equations 22 and 23 into Equation 25 yields: T WXaug wnew = X T W[Xaug wold + W ?1 (t ? y)] Xaug (26) aug De ning a new adjusted dependent variable as: z = Xaug wold + W ?1(t ? y) (27) implies that the estimation of the weights can be written in the following form: T WXaug )wnew = (X T W)z (Xaug (28) aug We have now reduced the computation of the weights to a weighted least squares problem5 [19], however since both the adjusted dependent variable z and the augmented design matrix Xaug depend on the current weights, the solution must be obtained by iteration. Once a new set of weights is obtained, a new augmented

5 The standard Least Squares problem [18] involves nding a vector x that minimizes the Euclidean norm jjAx ? bjj where A is of size n  m with n typically larger than m. This can be solved as the system of equations AT Ax = AT b. The weighted version of the problem involves solving AT WAx = AT W b which has the same form as Equation 28. In our case, if the number of training patterns is given by n, and the number of weights in the network is given by m, then the sizes of the matrices and vectors of interest are: Xaug is n  m, W is n  n, w is of size m, and z is of size n.

8

design matrix and adjusted dependent variable are calculated and the process repeated until convergence of the weights. Also at convergence, the elements of the inverse Fisher's information matrix represent an estimate of the covariance matrix of the network's weights. This work can be extended to networks with more than one hidden layer. Consider a network with two hidden layers, depicted in Figure 2. Input weights into a hidden unit have the notation kij . Here, the subscript i indicates the hidden layer number, the index k indicates the hidden unit number in the layer i, and j indicates the unit number in layer (i ? 1) that is providing the input into the hidden unit k. In terms of forming the augmented design matrix, the only change is the addition of columns corresponding to the new hidden layer. The coecients for the existing hidden layer and the output are calculated as before. For 1

x1 1

α 212

2

H1

H1

1 H2

2 H2

βH2

Outcome

x2

Figure 2: A multilayered architecture with two hidden layers. hidden layer 1 weights, the new variables added to the rows of the augmented design matrix have the form: L @l = X 0 0 @ k1j l=1 [ l h (zl2 ) l2k h (zk1)xj ] Here L is the number of hidden units in layer 2, and znm is the weighted sum into hidden unit n of layer m. This idea can then be extended in a similar manner to a greater number of hidden layers.

3 Discussion An advantage of the formulation of the information matrix given above is that the Fisher information matrix, T WXaug , is positive de nite [20]. This is important for two reasons; rst, it guarantees that the critical Xaug point (or stationary point) found by this second-order optimization method will be a minimizer and not a saddle point or maximizer. Second, it guarantees that the Fisher's scoring step is in a descent direction [21]. The true Hessian is not guaranteed to be positive de nite, and often is not positive de nite. This implies that an optimization method based on the Hessian might end up at a saddle point or maximum of the error function. The augmented design matrix also allows learning using a second-order method without the explicit computation of the matrix of second derivatives. Instead, the method uses a direct decomposition of the augmented design matrix such as the QR decomposition [18]. From Equation 28, we see that the weighted least squares problem in this case is to nd a w that minimizes h i 1=2 Xaug W

h

i

w ? W = z 9

1 2

(29)

1=2 where (the square root of the weight matrix) is a diagonal matrix with diagonal non-zero entries given p W by V ar[yi] and z is the adjusted dependent variable6. Note that F does not need to be computed explicitly in this case. Using a direct decomposition also improves the conditioning of the problem7. The condition T WXaug , is the square of the condition number of W 1=2Xaug , number? of the Fisher's information matrix, Xaug   ? T WXaug = 2 W 1=2Xaug . The smaller condition number of W 1=2Xaug means that the comthus  Xaug putation of the weights is numerically more stable using a direct decomposition of the augmented design matrix instead of forming the Fisher's information matrix explicitly, and solving the resulting estimating equations. Although QR decomposition is an ecient technique for computing the weights within the least squares formulation that we have proposed, it is still computationally expensive. For our case, with Xaug of size n  m, the sequential implementation would be O(3nm2 ? m3 ), and since n > m, this is O(m3 ). However, there are ecient algorithms to implement QR factorization on a parallel machine. Most of these algorithms employ Givens' Rotations to compute the factorization. There are a number of ecient parallel algorithms to carry out Givens' Rotations [22, 23, 24, 25]. The classic parallel algorithm in this area was presented by Sameh and Kuck [24], and implements QR decomposition in 5(2n ? 3) steps using m(3m ? 2)=2 = O(m2 ) processors. Parallel implementations based on these algorithms, and using novel load balancing techniques have been developed for the SGI Power Challenge and Origin 2000 parallel machines [26]. Another concern that might arise is the high storage complexity of the method. The Xaug matrix requires O(nm) of storage (where n is the number of training patterns, and m is the number of weights in the network). This high complexity arises because the proposed method is a batch method, and asks for the data corresponding to all the training patterns to be used at once. One approach to reducing the storage requirements at any given time is to partition the whole training set into blocks, and train the network using one block at a time (see page 264 of [11] for details of this approach). Another approach to handling this problem is evident when we notice that the augmented columns in Xaug are products of the columns of the original design matrix with derivatives of the output of the hidden units. It may be feasible to write code for a QR decomposition to calculate the augmented columns of the design matrix as needed. Thus, only the original design matrix would need to be stored. This way, we can trade o computation for storage depending on which one is cheaper in a particular implementation. The method proposed here does not require the explicit computation or inversion of the Hessian matrix, and it guarantees that a minimum of the error function (and not a saddle point or a maximum) will be found. The proposed method is therefore desirable when compared with some of the other second-order methods. It is natural to ask how this method compares to rst order methods such as back-propagation. It is obvious that a single iteration of a rst order method will require less computation than a single iteration of the proposed method. However, there are a number of situations where the gradient information on its own does not provide enough information to result in fast convergence. For instance, an error function that has vastly di erent curvatures in di erent directions (such as a valley with steep sides that gently slopes towards a stationary point) can cause a steepest descent based method to take an inordinately large number of iterations to converge to the minimum. A second order method such as the one proposed here, on the other hand, can approach the minimum in very few iterations. A second order method would require less time overall to nd the minimum value in such situations (see [27] for a comparison of steepest descent techniques against some second order methods on a number of real problems). In general, it might be possible to use a hybrid method that employs a gradient descent algorithm to get close to the minimum, and then use a few of the more expensive iterations of the proposed method to get to the minimum value quickly ([28] presents an alternate hybrid approach for a network with linear outputs). One of the issues of concern with multilayered neural network learning is the slowness of the training process. Slow training has been attributed to the linear convergence of back-propagation [3, 29]; however, multicollinearity might also be part of the problem. Multicollinearity is the situation where a column of the

6 If we represent A = W 1=2 Xaug and b = W 1=2 z, QR decomposition factors A = QR where Q is composed of orthogonal columns, and R is upper diagonal. The solution w is then found by solving the system of equations Rw = QT b. 7 The condition number of a matrix is de ned as the ratio of the largest singular value of the matrix to its smallest singular value,  = n1 .

10

design matrix is strongly correlated with a linear combination of some other columns of the design matrix [30]. Examining the problem in terms of Fisher's information matrix; multicollinearity manifests itself in the form of small singular values of the information matrix in relation to its largest singular value [31]. This leads to a large condition number and indicates an ill-conditioned matrix. Multicollinearity implies that there is little information in the data about estimated weights corresponding to small singular values or equivalently that the variance of the estimated weights corresponding to the smallest singular values is large. The issue of near multicollinearity in the multilayered feedforward neural network has been discussed in [14], but it can also be understood easily by examining the augmented design matrix. There are two situations under which near multicollinearity can arise. First, notice that if a hidden unit is at saturation, then the derivative term h0 (z) is near zero. This implies that the columns of the augmented design matrix for that hidden unit are all close to zero. Thus, these columns are linearly dependent on the rst column which corresponds to the bias connection to the output unit, and is a column of ones. The implication of this is that the Fisher's information matrix is nearly singular and thus the weights corresponding to these linearly dependent columns have large variance. The second situation occurs when the input to hidden unit weights have small values (values close to zero). In this situation, the sigmoid activation function will be approximately linear and the derivative, h0 (z), is approximately a constant. Now, each column associated with input weights into the hidden unit is linearly dependent on the columns with just the original input values. For example, the term H 1 h0(z1 )x2 is, up to a constant, approximately x2 in this case. Another way to look at this is: if h(z) is linear, then it equals z and its derivative with respect to z is constant. By examining neural networks through the augmented design matrix, it is apparent that neural network models are inherently ill conditioned. The slow training of the multilayered feedforward neural network usually attributed to the use of a gradient descent algorithm may also be due in part to the ill-conditioning inherent in the architecture. One method to handle the issue of multicollinearity, and thus to stabilize learning is through the use of weight decay (also known as penalized likelihood) [32]. This approach is based on regularization theory [33, 34], and is easily incorporated into the iteratively reweighted least squares algorithm we have proposed by row augmenting the design matrix and the adjusted dependent variable vector with a diagonal matrix of penalty parameters and a vector of zeros, respectively [35]. Thus Equation 29 becomes 

1=2 Wp Xaug 2I





1=2 w = W0 z



where  is the weight decay term and I is the identity matrix. Perhaps the greatest advantage of the augmented design matrix formulation is the ability to analyze a particular problem using regression diagnostics. Diagnostics are used in regression problems to assess the t and sensitivity generated by the prediction model [36]. In particular, we are interested in determining how one or a small number of training patterns a ect the weights of the network. Information of this sort is typically hard to obtain in traditional neural network settings. The a ect of a training pattern on the t produced by the network can be assessed by the method of deletion. This is a process where an individual pattern is dropped from the training set, and the weights determined without it. The notation used to represent the weights when pattern i has been dropped is w(i) . One way to determine the weights without pattern i would be to take one step in the Newton-Raphson direction computed using the reduced design matrix [37]. However, recall that we are using Fisher's scoring method and thus, the expected information instead of the observed information. A natural compromise, therefore, is to take one step in the scoring direction, this procedure is termed \MLV"8 diagnostics [38]. The adjusted weights could be computed using: h

T W i Xaug i w i  w + Xaug i ( )

8

( )

( )

( )

i?1



Xaug(i) W(i) t(i) ? y(i)

MLV stands for Moolgavkar, Lustbader, and Venzon, the originators of this method.

11



(30)

From Equation 30, many diagnostics such as standardized residuals, DFBETAS, DFFIT, DFFITS, and Cook's distance can be calculated [36, 39]. One of the most useful diagnostics is Cook's distance because it summarizes the e ect on the weights from an individual training pattern. Cook's distance is de ned as:  T WXaug w ? w i  w ? w i T Xaug Ci = k 

( )

( )

( )

where k is the rank of Xaug . Another type of diagnostic, whose use will be demonstrated in the next section, is based on the singular value decomposition [40]. If we let A denote W 1=2Xaug , then Fisher's information matrix is given by AT A. This matrix provides a measure of the information in the data about the weights. We can use this approach to estimate the information about any linear combination of the weights in any linear combination of the training patterns. For example, the information about the linear combination of weights vT w in the linear combination of outcomes uT t is (vT AT u)(uT Av) (31) where v and u are the column vectors that de ne the respective linear combinations of the weights and outcomes. Thus, if u is a vector of all ones and v is the vector of zeroes with a single one, then Equation 31 will yield the inverse of the variance of the weight chosen by the non-zero entry in v. As an example of how this last methodology may be useful, consider a neural network model with two input nodes, one hidden layer consisting of two nodes, and one output node (with no skip-layer connections). This network has nine weights that need to be estimated. Suppose we are interested in nding the training patterns that are most in uential in the determination of the weights associated with the rst hidden unit, i.e. the weights from the two input nodes and the bias unit to the hidden node, and from the hidden node to the output node. Thus, the vector v has ones for the weights associated with the hidden unit and zeroes elsewhere: [0 1 0 1 1 1 0 0 0]T (refer to Equation 21 to see the order in which entries are made). For each training pattern i, u is set to the unit vector with a 1 in the ith position. Then (vT AT ui)(uTi Av) computes the information from pattern i in the linear combination of the weights associated with the hidden unit. This calculation can be repeated for each training pattern, and those patterns with the largest information contribution to the hidden unit provide insight into the features in the input space predicted by the hidden unit. For example, if the patterns with the highest information for a particular hidden unit were all in the same region of the input space, then that hidden unit would be contributing to tting the surface in that region of the input space. An example to show the application of this approach is presented in the next section.

4 Simulation Results

The methodology proposed in this paper was tested on two simulated problems. The rst test problem selected is a simple interaction problem of the form: g(x1 ; x2) = 10:391 [(x1 ? 0:4)(x2 ? 0:6) + 0:36]

(32)

where the domain of both x1 and x2 is [0; 1] [41]. A plot of the true surface is given in Figure 3. Notice the curvature of the plane corresponding to the interaction. A training data set of 250 points was generated by drawing values for x1 and x2 from a uniform distribution over the domain [0; 1]. The target data were generated from Equation 32 with the addition of random noise from a Gaussian distribution with mean 0 and standard deviation of 0:25. In addition, a cross-validation set of the same size, 250, was generated in the same manner. This validation set was used to estimate the decay (shrinkage) parameter. The t of the network was evaluated on a grid of 2,500 evenly spaced points over the domain of the input space. A network with three hidden units and skip layer connections was trained using the 250 training points. The best network in terms of predicted performance on the cross-validation (holdout) set had a minus log 12

g(X1,X2) 0 1 2 3 4 5 6 7 0.2

0.4

0.6

X1

0.8

0.2

0.4

0.6

0.8

X2

g(X1,X2) 0 1 2 3 4 5 6 7

Figure 3: The surface of the simple interaction problem.

0.2

0.4 X1

0.6

0.8

0.2

0.4

0.6

0.8

X2

Figure 4: The t from a multilayered feedforward neural network trained on the 250 training points. likelihood of ?1; 735. This network converged in 12 iterations with a weight decay term of 0:075. The resulting t as evaluated over the 2,500 equally spaced testing points is illustrated in Figure 4. For comparison with another second-order method, two networks, with four hidden units each, were built. No weight decay was used with either of these networks because we wanted to evaluate the performance of the Hessian matrix versus the Fisher's information matrix. The rst network used the standard Newton's step computed using the Hessian matrix. After 57 iterations, the Hessian was still not positive de nite. At convergence, 85 iterations, this network had a minus log likelihood of ?1; 695:71. The second network used the Fisher's information matrix in place of the Hessian matrix as described previously. This network converged to a minus log likelihood of ?1; 710 after only 15 iterations. Using the Fisher's information matrix in place of the Hessian matrix will not always provide superior results, but for this example it is clear that it o ers certain advantages. The other issue that was brought up in Section 3 was the multicollinearity or ill-conditioning of the design matrix. To illustrate how multicollinearity a ects the network, we again use a network with 4 hidden units. Generating 10 sets of random starting weights and performing a singular value decomposition of the design matrix [18] the condition number is calculated for each set. Over these ten sets of weights, the average condition number is 1:30  107 with a minimum of 1:31  105 and a maximum of 7:04  107. The largest singular value has a small range of 17:26 to 16:57, indicating the condition number is highly dependent on the smallest singular value. The smallest singular value is dependent upon the multicollinearity induced by 13

the choice of initial starting weights. This same exercise is then repeated for a network with 10 hidden units. Here the average condition number is 1:79  1012 with a minimumof 2:12  109 and a maximum of 8:60  1012. A similar observation of the dependence on the smallest singular value is also noted for this experiment. This example indicates the need for the weight decay discussed in Section 3 to handle the multicollinearity. The next problem considered is a complex interaction problem of the form i h g(x1 ; x2) = 2:57 + 1:9ex1 sin 13 (x1 ? 0:6)2 e?x2 sin (7x2)

g(X1,X2) 0 1 2 3 4 5 6 7

This complex surface is illustrated in Figure 5.

0.8

0.6 0 X2 .4

0.2

0.2

0.4

0.6

0.8

X1

g(X1,X2) 0 1 2 3 4 5 6 7

Figure 5: The true surface of the complex interaction problem.

0.8

0.6 0 X2 .4

0.2

0.2

0.4

0.6

0.8

X1

Figure 6: The t of a multilayered feedforward neural network built on the 250 training points for the complex interaction problem. The same types of data sets were generated for this example. A network with 4 hidden units converged in 25 iterations with a value of the minus log likelihood equal to ?1; 131. The tted network surface is illustrated on a uniform grid of 2,500 points in Figure 6. Notice that this network is picking up the complex features of the surface. The small knoll in the surface located at small values of x1 and x2 is apparent. The large hill located at large values of x1 is also visible, although the network is not picking up the crest of the hill well. This is a problem with the crest being located at the end of the domain of the data. Near the edge of the domain of the inputs, the data is sparse. 14

1.0

Features near the edge of the domain are thus dicult to model. This is why extrapolation past the edge of the domain of the inputs is highly discouraged. The network did nd the upturned surface at the back corner (x1 and x2 equal to 1) and the small upturn in the front corner (x1 and x2 equal to 0). Considering that the network is built using only 250 training points, it is doing well to pick up these features. It is interesting to brie y explore the potential for the use of the diagnostics described in Section 3 to this problem. The information matrix is employed to try and discover what features in Figure 6 the hidden units are modeling. Remember from Section 3 that the information about vT w in uT t is (vT AT u)(uT Av) (where A is the product of the augmented design matrix, Xaug , and the square root of the weight matrix, W 1=2). Thus if u is the identity vector ei and v is a vector with a 1 in the positions corresponding to weights associated with a particular hidden unit and zeroes in the other positions, then (vT AT u)(uT Av) is the information in training pattern i about the weights associated with that hidden unit. For example, setting u = f1 0 0    0 0g and v = f0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0g yields the information from the rst training data point in the rst hidden unit of the network. The information in each of the four hidden units is calculated for all training data points.

4 4 44 4 4

3

1 1 1 11 11 4 1 1 1 2 4 11 2 1 3 1 34 4 3 3 3 1 34 3 34 3 4 34 3 12 2 2 3 1 3 4 3 4 22 3 1 32 3 3 3 2 3 1 4 4 3 2 33 3 2 3 1 4 2 3 14 3 1 2 3 32 4 1 32 3 2 2 1 4 4 3 2 2 34 111 34 4 3 33 3 3 3 34 4 44 4 3 4 4 4 4

0.0

0.2

0.4

x2

0.6

0.8

444

0.0

0.2

0.4

0.6

0.8

1.0

x1

Figure 7: Plot of the region of the input space that explains the information in the hidden units. The numbers plotted at each point correspond to the respective hidden unit. For the rst hidden unit, 27 of the 250 training data points explain 50:1% of the information in the hidden unit's weights. For the second hidden unit, 28 of the 250 observations explain 49:4% of the information in the hidden unit's weights. For the third hidden unit, 44 of the 250 observations explain 50:4% of the information and for the fourth hidden unit, 39 of the 250 observations explain 50:3% of the information. These points are 15

plotted in Figure 7. The axes of Figure 7 cover the entire domain of the input space. The number plotted at each point corresponds to the number of the hidden unit most in uenced by this data point. For example, a point plotted with the symbol 1 is a data point explaining the information in the rst hidden unit. The points in Figure 7 de ne the regions of the input space that determine the weights of the hidden units. Hidden unit 2 heavily depends on the data in the large hill located where x1 has large values. Likewise, it appears that the rst hidden unit is being determined by the features in the back right corner of the plot in Figure 6. Hidden unit 3 is dominated by the valley and hill that occur for large values of x1. The weights associated with hidden unit 4 are being determined by the diagonal ridge across the center of Figure 6. It is interesting to note that the small knoll in the front of the gure is not a dominant feature in terms of the information in the weights of any of the hidden units.

5 Conclusion We have demonstrated a second-order learning scheme for a multilayered feedforward neural network using iteratively reweighted least squares. This method has certain advantages when compared to other secondorder methods. The proposed method uses the Fisher's information matrix instead of the Hessian matrix to compute the search direction. Since this matrix is formulated as an inner product, it is guaranteed to be positive de nite. This ensures that each step of the optimization procedure will be in a descent direction and the solution found by the method will be a minimum and not a saddle point or a maximum. In addition, at convergence, the Fisher's information matrix yields estimates of the variance of the weights. Another advantage is that standard least squares software can be used to implement the method. However, the method remains a high computational complexity algorithm when compared to an ecient rst-order method. One approach to reducing this complexity is to use a parallel machine, as the underlying high complexity step (that of computing the QR factors) is easily and eciently parallelized. The formulation also yielded insight into to the multicollinearity problem in multilayered networks. By examining the augmented design matrix, it was clear that the multilayered feedforward neural network is an ill-conditioned problem, thus contributing to slow training regardless of whether a rst-order method, such as backpropagation, or a second-order method is used. This inherent multicollinearity implies that a regularization scheme such as weight decay is necessary to obtain good convergence. There are a number of situations where the error function of a problem causes rst order learning methods such as back-propagation to converge slowly (see [27] for the performance of a variety of methods on some such problems). In such situations, a second order method (either alone, or as part of a hybrid scheme where the second order method is applied after a few steps of a rst order method) can yield faster overall runtimes. The second order method proposed in this work has been shown to have some desirable properties as compared to some of the other methods, and could be useful in these situations. In addition, the formulation developed in this paper allows the application of regression diagnostics to neural networks, thus making it possible to analyze the prediction behavior of a network in detail.

References [1] James L. McClelland and David E. Rumelhart, Explorations in Parallel Distributed Processing, MIT Press, Cambridge, Massachusetts, 1988. [2] P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis, Harvard University, 1974. [3] Robert A. Jacobs, \Increased rates of convergence through learning rate adaptation," Neural Networks, vol. 1, pp. 295{308, 1988. [4] D. Plaut, S. Nowlan, and G.E. Hinton, \Experiments on learning by back propagation," Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, 1986. 16

[5] R. Battiti, \Accelerated backpropagation learning: two optimization methods," Complex Systems, vol. 3, pp. 331{342, 1989. [6] T.P. Vogl, J.K. Mangis, A. K. Rigler, W.T. Zink, and D.L. Alkon, \Accelerating the convergence of the back-propagation method," Biological Cybernetics, vol. 59, pp. 257{263, 1988. [7] R.L. Watrous, \Learning algorithms for connectionsist networks: Applied gradient methods of nonlinear optimization," in Proceedings IEEE First International Conference on Neural Networks, San Diego, 1987, IEEE, vol. 2, pp. 619{627. [8] Roberto Battiti, \First- and second-order methods for learning: Between steepest descent and Newton's method," Neural Computation, vol. 4, pp. 141{166, 1992. [9] S. Becker and Y. LeCun, \Improving the convergence of back-propagation learning with second-order methods," in Proceedings of the the 1988 Connectionist Models Summer School, D.S. Touretzky, J.L. Elman, T.J. Sejnowski, and G.E. Hinton, Eds., San Mateo, CA, 1988, Morgan Kaufmann. [10] C.M. Bishop, \Exact calculation of the Hessian matrix for the multilayer perceptron," Neural Computation, vol. 4, pp. 494{501, 1992. [11] Christopher M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [12] H. Akaike, \Information theory and an extension of the maximum likelihood principle," in Second International Symposium on Information Theory, B. N. Petrov and F. Caski, Eds., Budapest, 1973, pp. 267{281, Akademiai Kaido, Reprinted in Breakthroughs in Statistics, eds Kotz, S. & Johnson, N. L. (1992), volume I, pp. 599{624. New York: Springer. [13] P. McCullagh and J.A. Nelder, Generalized Linear Models, Chapman Hall, London, 1989. [14] S. Saarinen, R. Bramley, and G. Cybenko, \Ill-conditioning in neural network training problems," SIAM Journal of Scienti c Computing, vol. 14, no. 3, pp. 693{714, May 1993. [15] J.B. Hampshire and B. Pearlmutter, \Equivalence proofs for multilayer perceptron class ers and the Bayesian discriminant function," in Proceedings of the the 1990 Connectionist Models Summer School, D.S. Touretzky, J.L. Elman, T.J. Sejnowski, and G.E. Hinton, Eds., San Mateo, CA, 1990, pp. 159{172, Morgan Kaufmann. [16] S.A Solla, E. Levin, and M. Fleisher, \Accelerated learning in layered neural networks," Complex Systems, vol. 2, pp. 625{639, 1988. [17] George Casella and Roger L. Berger, Statistical Inference, Statistics/Probability Series. Brooks/Cole Publishing Company, Belmont, California, 1990. [18] Gene H. Golub and Charles F. Van Loan, Matrix Computations, The John Hopkins University Press, Baltimore, MD, third edition, 1996. [19] A. R. Webb, \Functional approximation by feed-forward networks: a least-squares approach," IEEE Transactions on Neural Networks, vol. 5, pp. 363{371, 1994. [20] Ronald R. Hocking, The Analysis of Linear Models, Brooks/Cole Publishing Company, Monterey, California, 1985. [21] J.E. Dennis Jr. and Robert B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Inc., Englewood Cli s, New Jersey, 1983. [22] F.T. Luk, \A rotation method for computing the QR decomposition," SIAM J. Sci. Statist. Comput., vol. 7, pp. 452{459, 1986. 17

[23] J.J. Modi and M.R.B. Clarke, \An alternative givens ordering," Numerical Mathematics, vol. 43, pp. 83{90, 1984. [24] A. H. Sameh and D. J. Kuck, \On stable parallel linear system solvers," Journal of the Association for Computing Machinery, vol. 25, no. 1, pp. 81{91, January 1978. [25] Michel Cosnard and Denis Trystram, Parallel Algorithms and Architectures, PWS Publishing Company, Boston, MA, 1995. [26] Je Boleng and Manavendra Misra, \Parallel modeling of rough surface electromagnetic scattering," in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'97), June 1997. [27] A. R. Webb, D. Lowe, and M. D. Bedworth, \A comparison of nonlinear optimisation strategies for feed-forward adaptive layered networks," Tech. Rep. RSRE Memorandum No. 4157, Royal Signals and Radar Establishment, July 1988. [28] A. R. Webb and D. Lowe, \A hybrid optimisation strategy for adaptive feed-forward layered networks," Tech. Rep. RSRE Memorandum No. 4193, Royal Signals and Radar Establishment, September 1988. [29] R.S. Sutton, \Two problems with backpropagation and other steepest-descent learning procedures for networks," Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 823{831, 1986. [30] Ronald A. Thisted, Elements of Statistical Computing, Chapman and Hall, New York, New York, 1988. [31] S.D. Silvey, \Multicollinearity and imprecise estimation," Royal Statistical Society, Series B, vol. 31, pp. 539{552, 1969. [32] G. E. Hinton, \Learning distributed representations of concepts," in Proceedings of the Eighth Annual Conference of the Cognitive Science Society (Amherst, 1986), Hillsdale, 1986, pp. 1{12, Erlbaum. [33] F. Girosi, M. Jones, and T. Poggio, \Regularization theory and neural networks architectures," Neural Computation, vol. 7, pp. 219{269, 1995. [34] T. Poggio and F. Girosi, \Regularization algorithms for learning that are equivalent to multilayer networks," Science, vol. 247, pp. 978{982, 1990. [35] Willian J. Kennedy and James E. Gentle, Statistical Computing, Marcel Dekker, Inc., 1980. [36] David A. Belsley, Edwin Kuh, and Roy E. Welsch, Regression Diagnostics: Identifying In uential Data and Sources of Collinearity, Wiley, New York, 1980. [37] Daryl Pregibon, \Logistic regression diagnostics," The Annals of Statistics, vol. 9, no. 4, pp. 705{724, 1981. [38] Suresh H. Moolgavkar, Edward D. Lustbader, and David J. Venzon, \A geomietric approach to nonlinear regression diagnostics with application to matched case-control studies," Annals of Statistics, vol. 12, pp. 816{826, 1984. [39] R. Dennis Cook and Sanford Weisberg, Residuals and In uence in Regression, Chapman and Hall, New York, 1982. [40] John A. Nelder, \An alternative interpretation of the singular-value decomposition in regression," The American Statistician, vol. 39, pp. 63{64, 1985. [41] Jeng-Neng Hwang, Shyh-Rong Lay, Martin Maechler, R. Douglas Martin, and James Schimert, \Regression modeling in back-propagation and projection pursuit learning," IEEE Transactions on Neural Networks, vol. 5, no. 3, pp. 342{353, 1994. 18