Multiple optimal learning factors for feed-forward networks

72 downloads 566 Views 575KB Size Report
optimal learning factors, one for each hidden unit. ... on the effect of linearly dependent inputs and hidden units on learning using the proposed ...... [23] Univ. of Toronto, Delve Data Sets - http://www.cs.toronto.edu/delve/data/datasets.html.
Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Multiple optimal learning factors for feed-forward networks Sanjeev S. Malalur and Michael T. Manry Department of Electrical Engineering, University of Texas at Arlington, Arlington TX 76013 ABSTRACT A batch training algorithm for feed-forward networks is proposed which uses Newton’s method to estimate a vector of optimal learning factors, one for each hidden unit. Backpropagation, using this learning factor vector, is used to modify the hidden unit’s input weights. Linear equations are then solved for the network’s output weights. Elements of the new method’s Gauss-Newton Hessian matrix are shown to be weighted sums of elements from the total network’s Hessian. In several examples, the new method performs better than backpropagation and conjugate gradient, with similar numbers of required multiplies. The method performs as well as or better than Levenberg-Marquardt, with several orders of magnitude fewer multiplies due to the small size of its Hessian. Keywords: Neural networks, multilayer perceptron, output weight optimization, backpropagation, orthogonal least squares, multiple optimal learning factors, linear dependence, Newton’s method, Gauss-Newton Hessian

1. INTRODUCTION Feed-forward neural networks such as the multi-layer perceptron (MLP) are statistical tools widely used for regression and classification applications in the areas of parameter estimation1,2, document analysis and recognition3, finance and manufacturing4 and data mining5. The MLP draws its computing power from a layered, parallel architecture and has several favorable properties such as universal approximation6 and the ability to mimic Bayes discriminant7 and maximum a-posteriori (MAP) estimates8. Existing learning algorithms include first order methods such as backpropagation9 (BP) and conjugate gradient10 and second order learning methods related to Newton’s method. Since Newton’s method for the MLP often has non-positive definite11, 12 or even singular Hessians, Levenberg-Marquardt13, 14 (LM) and other methods are used instead. In this paper Newton’s method is used to obtain a vector of optimal learning factors, one for each MLP hidden unit. Section 2 reviews MLP notation, a simple first order training method, and an expression for the optimal learning factor 15 (OLF). The multiple optimal learning factor (MOLF) method is introduced in section 3. Section 4 presents a discussion on the effect of linearly dependent inputs and hidden units on learning using the proposed MOLF algorithm. Results and conclusion are presented in sections 5 and 6.

2. REVIEW OF MULTI-LAYER PERCEPTRON In this section, MLP notation is introduced and a convergent first order training method is described. 2.1. MLP notation In the fully connected MLP of figure 1, input weights w(k,n) connect the nth input to the kth hidden unit. Output weights woh(m,k) connect the kth hidden unit’s activation op(k) to the mth output yp(m), which has a linear activation. The bypass weight woi(m,n) connects the nth input to the mth output. The training data, described by the set {xp, tp} consists of Ndimensional input vectors xp and M-dimensional desired output vectors, tp. The pattern number p varies from 1 to Nv where Nv denotes the number of training vectors present in the data set. In order to handle thresholds in the hidden and output layers, the input vectors are augmented by an extra element xp(N+1) where, xp(N+1) = 1 , so xp = [xp(1), xp(2),…., xp(N+1)]T . Let Nh denote the number of hidden units. The vector of hidden layer net functions, np and the actual output of the network, yp can be written as [email protected]; [email protected]

7703 - 15 V. 2 (p.1 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

np = W ⋅xp,

y p = Woi ⋅ x p + Woh ⋅ o p

(1)

th

where the k element of the hidden unit activation vector op is calculated as op(k)= f(np(k)) and f(.) denotes the hidden layer activation function. Training an MLP typically involves minimizing the mean squared error between the desired and the actual network outputs, defined as

Figure 1: A fully connected multi-layer perceptron

E=

1 Nv

Nv

M

∑∑ [t p =1 m =1

p

( m) − y p ( m )]2

(2)

2.2. Training using output weight optimization-backpropagation In a given iteration of output weight optimization-backpropagation (OWO-BP), we (i) find the output weight matrices, Woh and Woi connected to the network outputs and (ii) separately train the input weights, W using BP. Output weight optimization (OWO) is a technique to solve for weights connected to the actual outputs of the network16. Since the outputs have linear activation, finding the weights connected to the outputs is equivalent to solving a system of linear equations. The expression for the actual outputs given in (1) can be re-written as y p = Wo ·x p (3) where x p = [x p , o p ] is the augmented input vector and Wo = [Woi : Woh ] denotes all the weights connected to the T

T

outputs. x p is a column vector of size Nu where Nu = N + Nh + 1 and Wo is M by N u . The output weights can be solved by setting ∂E / ∂Wo = 0 which leads to a set of linear equations given by

Ca = Wo ·R Ta

(4)

where

Ca =

1 Nv

Nv

∑ y p x Tp , p =1

Ra =

1 Nv

Nv

∑x p =1

p

x Tp

Equation (4) is most easily solved using orthogonal least squares17 (OLS). In the second half of an OWO-BP iteration, the input weight matrix W is updated as W ← W + z·G (5)

7703 - 15 V. 2 (p.2 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

where G is the generic representation of a direction matrix that contains information about the direction of learning and z, the learning factor contains information about the step length to be taken in the direction G. The weight update z⋅G in equation (5) can also be denoted as z·G = ΔW (6) For backpropagation, the direction matrix is nothing but the N h by (N+1) negative input weight Jacobian matrix computed as

G=

1 Nv

Nv

∑δ p =1

p

·xTp

(7)

Here δp = [δp(1) , δ p(2)… , δ p(Nh) ]T is the Nh by 1 column vector of hidden unit delta functions9. A description of OWO-BP is given below. For every training epoch i. Solve the system of linear equations in (4) using OLS and update the output weights, Wo ii. Find the negative Jacobian matrix G described in equation (7) iii. Update the input weights, W, using equation (5) This method is attractive for several reasons. First, the training is faster, since training weights connected to the outputs is equivalent to solving for linear equations. Second, it helps us avoid some local minima. Third, the method exhibits improved training performance compared to using only BP to update all the weights in the network. 2.3. Optimal learning factor The choice of learning factor z in equation (5) has a direct effect on the convergence rate of OWO-BP. Early steepest descent methods used a fixed constant, with slow convergence, while later methods used a heuristic scaling approach to modify the learning factor between iterations and thus speed up the rate of convergence. However, using Taylor’s series for the error E, in equation (2), a non-heuristic optimal learning factor (OLF) for OWO-BP can be derived14 as,

z=

−∂E / ∂z ∂ 2 E / ∂z 2

(8)

where the numerator and denominator derivatives are evaluated at z=0. The expression for the second derivative of the error with respect to the OLF is found using (2) as, N +1 ∂ 2 E N h N h N +1 ∂2E ( , ) = g k n g ( j, i ) ∑∑∑ ∑ ∂z 2 k =1 j =1 n =1 i =1 ∂w( k , n )∂w( j , i ) Nh Nh

(9)

= ∑∑ g H g j k =1 j =1

T k

k, j R

where column vector gk contains elements g(k,n) of G, for all values of n. HR is the reduced size input weight Hessian with Niw rows and columns, where Niw =( N + 1)⋅ Nh is the total number of input weights. HR k,j contains elements of HR for all input weights connected to the jth and kth hidden units and has size (N+1) by (N+1). When Gauss-Newton12 updates are used, elements of HR are computed as Nv ∂2 E 2 = u ( j , k )∑ x p (i ) x p (n)o' p ( j )o' p (k ) ∂w( j , i )∂w(k , n) N v p =1

(10)

M

u( j, k ) = ∑ woh ( m, j ) woh ( m, k ) m =1

where o'p(k) indicates first partial derivative of op(k) with respect to its net function. Because (10) represents the GaussNewton approximation to the Hessian, it is positive semi-definite. Equation (9) shows that (i) the OLF can be obtained from elements of the Hessian HR , (ii) HR contains useful information even when it is singular, and (iii) a smaller nonsingular Hessian ∂ 2E/∂ z2 can be constructed using HR. Therefore, it may be profitable to construct Hessian matrices of intermediate size for use in Newton’s algorithm.

7703 - 15 V. 2 (p.3 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

2.4. Convergence of OWO-BP To show convergence of OWO-BP, we have to show that both OWO and BP stages are convergent. Lemma 1: When OWO-BP training is used with an optimal learning factor (OLF), the training MSE, E, converges. During the first part of a given iteration, OWO solves a system of linear equations and can only decrease the error E or leave it the same26. Subsequently, a BP step with OLF for the input weights will also either reduce the training error E or leave it the same. Sigmoid MLP networks form outputs which are continuous functions of the weights. Hence the error E(W + z⋅G ) is a continuous function of z. It is possible to find an optimal z that will minimize the mean square error E, in a given iteration by solving ∂ E(W + z⋅G )/ ∂z = 0 . However, E(W + z⋅G ) may have many critical points where the partial derivative is zero. Since G is the negative Jacobian matrix, we search for the smallest non-negative z which minimizes E. Every time the input weights are updated using BP, the output weights must be re-calculated and updated. As mentioned before, OWO finds output weights by solving a system of linear equations and in any training iteration the error after OWO is guaranteed to be less than or at least equal to the error in the previous iteration. Let Ek denote the error at the kth step of OWO-BP training. For k odd, Ek denotes the error after an OWO stage and for k even, Ek denotes the error after a BP step. Since z is positive and optimal, the BP step can only decrease E or leave it unchanged. Similarly, OWO steps can only decrease E or leave it the same. If OWO-BP is run for Nit iterations, then error Ek for every step forms a monotone series, i.e. a series of non-increasing, non-negative numbers such that Ek+1 ≤ Ek. Such a series is guaranteed to converge18 as Nit → ∞.

3. MULTIPLE OPTIMAL LEARNING FACTOR ALGORITHM Here, a vector z of optimal learning factors is derived, which has one element for each hidden unit. The result is called the multiple optimal learning factor (MOLF) training method. 3.1. Derivation of multiple optimal learning factors Assume that an MLP is being trained using OWO-BP. Also assume that a separate OLF

zk is being used to update each

hidden unit’s input weights, w(k,n), where 1 ≤ n ≤ (N+1). The error function to be minimized is given by (2). The predicted output yp(m) is given by,

⎛ N +1 ⎞ y p ( m) = ∑ woi (m, n ) x p ( n ) + ∑ woh ( m, k ) f ⎜ ∑ ( w(k , i ) + zk ·g ( k , i )) x p (i ) ⎟ n =1 k =1 ⎝ i =1 ⎠ where, g(k,n) is an element of the negative Jacobian matrix G. The first partial of E with respect to z j is N +1

Nh

Nh ⎤ ∂E −2 Nv M ⎡ = ∑∑ ⎢ t p (m) − ∑ woh (m, k )o p ( zk ) ⎥·woh (m, j )o p ( j )Δn p ( j ) ∂z j N v p =1 m =1 ⎣ k =1 ⎦

where, N +1

tp ( m) = t p ( m ) − ∑ woi ( m, n ) x p ( n ), n =1

N +1

Δn p ( j ) = ∑ x p (n )· g ( j, n ), n =1

⎛ ⎞ o p ( zk ) = f ⎜ ∑ ( w(k , n) + zk ·g (k , n)) x p (n) ⎟ ⎝ n =1 ⎠ N +1

Using Gauss-Newton updates, the second partial derivative elements of the Hessian Hmolf are

7703 - 15 V. 2 (p.4 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

(11)

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

hmolf (l , j ) ≈

∂2E 2 = ∂zl ∂z j N v

M

Nv

m =1

p =1

∑ woh (m, l )woh (m, j )∑ o' p (l )o' p ( j )Δn p (l )Δn p ( j ) (12)

Nv ⎡ 2 ⎤ = ∑∑ ⎢ u (l , j )∑ x p (i ) x p (n)o' p (l )o' p ( j ) ⎥g (l , i )· g ( j , n) i =1 n =1 ⎣ N v p =1 ⎦ N +1 N +1

The Gauss-Newton update guarantees that Hmolf is non-negative definite. Given the negative gradient vector,

g molf = [ −∂E / ∂z1 , −∂E / ∂z2 …, −∂E / ∂z N h ]T and the Hessian Hmolf, we minimize E with respect to the vector z using Newton’s method. Note that -gmolf(j) is given in (11). In each iteration of the training algorithm, the steps are as follows: i. Solve linear equations for all output weights. ii. Calculate the negative input weight Jacobian G using BP. iii. Calculate z using Newton’s method and update the input weights as w( k , n ) ← w( k , n ) + zk ·g ( k , n ) (13) Here, the MOLF procedure has been inserted into the OWO-BP algorithm, resulting in an algorithm we denote as MOLF-BP. The MOLF procedure can be inserted into other algorithms as well. 3.2. MOLF Hessian If Hmolf and gmolf are the Hessian and negative gradient, respectively, of the error with respect to z, then the multiple optimal learning factors are computed as, −1 z = H molf ·g molf

(14)

The term within the square brackets in (12) is nothing but an element of the reduced Hessian, HR from Gauss-Newton method for updating input weights (as in (10)). Hence, N +1 N +1 ⎡ ⎤ ∂2E ∂2E g (l , i )·g ( j, n ) = ∑∑ ⎢ ∂zl ∂z j i =1 n =1 ⎣ ∂w(l , i )∂w( j, n ) ⎥⎦

(15)

Lemma 2: For fixed (l, j), hmolf(l,j) can be expressed in vector notation as, N +1 N +1 ∂2E = ∑ gl (i ) ∑ hRl , j (i, n )· g j ( n ) ∂zl ∂z j i =1 n =1

(16)

In matrix notation,

H molf = g Tl HlR, j g j

(17) l,j

where column vector gl contains G elements g(l, n) for all values of n, where the (N+1) by (N+1) matrix HR contains elements of HR for weights connected to the lth and jth hidden units. The reduced Hessian in (16) uses four indices (l, i, j, n) and can be viewed as a 4-dimensional array, represented by

H R4 whose elements ∈ R ( N +1)× Nh × Nh ×( N +1) .

4 H molf = GH R4G T

(18) 4

From (16), we see that hmolf(l,j) = h4molf(l,j,l,j), i.e., the 4-dimensional H molf is transformed into the 2-dimensional Hmolf by setting i=j and n=l. Each element of the MOLF Hessian combines the information from (N+1) rows and columns of the reduced Hessian, HR l,j . This makes MOLF less sensitive to input conditions. Note the similarities between (9) and (16).

7703 - 15 V. 2 (p.5 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

3.3. Computational cost The proposed MOLF algorithm involves inverting a Hessian, however, compared to Newton’s method or LM, the size of the Hessian is much smaller. Updating input weights using Newton’s method or LM, requires a Hessian with Niw rows and columns, whereas the Hessian used in the proposed MOLF has only Nh rows and columns. Recall that Nu denotes the number of weights connected to each output. The total number of weights in the network is denoted as Nw = M⋅Nu + (N + 1)⋅ Nh. The number of multiplies required to solve for output weights using the OLS is given by

1 3⎤ ⎡ M ols = N u ( N u + 1) ⎢ M + N u (2 N u + 1) + ⎥ 6 2⎦ ⎣

(19)

The numbers of multiplies per training iteration for OWO-BP, LM and MOLF are given below

M owo −bp = N v [2 N h ( N + 2) + M ( N u + 1) +

N u ( N u + 1) + M ( N + 6 N h + 4)] + 2 M ols + N h ( N + 1)

M lm = N v [ MN u + 2 N h ( N + 1) + M ( N + 6 N h + 4) + MN u ( N u + 3N h ( N + 1)) + 4 N h2 ( N + 1) 2 ] + N w3 + N w2 M molf = M owo −bp + N v [ N h ( N + 4) − M ( N + 6 N − h + 4)] + ( N h )3

(20)

(21) (22)

Note that Mmolf consists of Mowo-bp plus the required multiplies for calculating optimal learning factors.

4. EFFECT OF LINEAR DEPENDENCE ON THE MOLF ALGORITHM In this section, we analyze the performance of MOLF applied to OWO-BP, in the presence of linear dependence. 4.1. Dependence in the input layer A linearly dependent input can be modeled as N +1

x p ( N + 2) = ∑ b( n ) x p ( n )

(23)

n =1

During OWO, the weights from the dependent input, feeding the outputs will be set to zero and the output weight adaptation will not be affected. During the input weight adaptation, the expression for gradient given by (11) can be rewritten as, Nh ∂E −2 Nv M ′ ( = t ( m ) − woh (m, k )o p ( zk ))·woh (m, j )o' p ( j ) ∑∑ p ∑ ∂z j N v p =1 m =1 k =1 N +1 ⎡ ⎤ Δ + + n ( j ) g ( k , N 2) b( n) x p ( n) ⎥ ∑ ⎢ p n =1 ⎣ ⎦

and the expression for an element of the Hessian can be re-written as (12)

7703 - 15 V. 2 (p.6 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

(24)

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

∂2 E 2 = ∂zl ∂z j N v

M

∑w m =1

oh

Nv

(m, l ) woh (m, j )∑ o' p (l )o' p ( j )[Δn p (l )Δn p ( j ) + Δn p (l ) p =1

N +1

N +1

n =1

i =1

g ( j , N + 2)∑ b(n) x p (n) + Δn p ( j ) g (l , N + 2)∑ b(i ) x p (i ) +

(25)

N +1 N +1

g ( j , N + 2) g (l , N + 2)∑∑ b(i ) x p (i )b(n) x p (n)] n =1 i =1

Let H molf be the Hessian, when the extra dependent input xp(N+2) is included. Then,

hmolf (l , j ) = hmolf (l , j ) +

Nv N +1 2 u (l , j )∑ x p ( N + 2)o' p (l )o' p ( j )[ g (l , N + 2)∑ x p (n) g ( j , n) Nv p =1 n =1 N +1

(26)

+ g ( j , N + 2)∑ x p (i ) g (l , i ) + x p ( N + 2) g (l , N + 2) g ( j , N + 2)] i =1

Comparing (11) with (24) and (12) with (25) and (26), we see some additional terms that appear within the square brackets in the expressions for gradient and Hessian in the presence of linearly dependent input. Clearly, these parasitic cross-terms will cause the training using MOLF to be different for the case of linearly dependent inputs. This leads to the following lemma. Lemma 3: Linearly dependent inputs, when added to the network, do not force H molf to be singular. As seen in (25) and (26), each element hmolf ( m, j ) simply gains some first and second degree terms in the variables b(n), unlike Newton’s method, where a linearly dependent input will cause the Hessian matrix to be singular, leading to a poor solution. 4.2. Dependence in the hidden layer Here we look at the how a dependent hidden unit affects the performance of MOLF. Assume that some hidden unit activations are linearly dependent upon others, as Nh

o p ( N h + 1) = ∑ c(k )o p (k )

(27)

k =1

Further assume that OLS is used to solve for output weights. The weights in the network are updated during every training iteration and it is quite possible that this could cause some hidden units to be linearly dependent upon each other. The dependence could manifest in hidden units being identical or a linear combination of other hidden unit outputs. Consider one of the hidden units to be dependent. The autocorrelation matrix Ra , will be singular and since we are using OLS to solve for output weights, all the weights connecting the dependent hidden unit to the outputs will be forced to zero. This will ensure that the dependent hidden unit does not contribute to the output. To see if this has any impact on learning in MOLF we can look at the expression for the gradient and Hessian, given by (11) and (12) respectively. Both equations have a sum of product terms and since OWO sets the output weights for the dependent hidden unit to be zero, this will also set the corresponding gradient and Hessian elements to be zero. In general, any dependence in the hidden layer will cause the corresponding learning factor to be zero and will not affect the performance of MOLF. Lemma 4: For each dependent hidden unit, the corresponding row and column of Hmolf is zero-valued.

7703 - 15 V. 2 (p.7 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

This follows from (12). The zero-valued rows and columns in Hmolf need not cause any difficulties, if (14) is rewritten as19 Hmolf ⋅z = gmolf and solved using OLS.

5. NUMERICAL RESULTS FOR MOLF-BP Here we compare the performance of MOLF-BP to those of OWO-BP, LM, and full conjugate gradient (CG), where the OLF was used in the latter three algorithms. In full CG and LM, all weights are varied in each iteration. In OWO-BP and MOLF-BP, we first solve linear equations for the output weights and subsequently update the input weights. For a given network, we obtain the training error and the cumulative number of multiplies required for training. We also obtain the validation error for a fully trained network. This information is used to subsequently generate the plots and compare performances. Table 1 lists the data sets used for comparison and generating the plots. We use the k-fold crossvalidation procedure to obtain the average training and validation errors. Given a data set, we split the set into k nonoverlapping parts of equal size, and use (k-1) parts for training and the remaining one part for validation. The procedure is repeated till we have exhausted all k combinations (k=10 for our simulations). All the data sets used for simulation are publicly available. Table 1: Data set description

Data set name Prognostics Remote Sensing Federal Reserve Housing Concrete

No. of inputs 17 16 15 16 8

No. of outputs 9 3 1 1 1

No. of patterns 4,745 5,992 1,049 22,784 1,030

5.1. Prognostics data set This data file is available on the Image Processing and Neural Networks Lab repository20. It consists of parameters that are available in the Bell Helicopter health usage monitoring system (HUMS), which performs flight load synthesis, which is a form of prognostics21. The data set containing 17 input features and 9 output parameters was obtained from the M430 flight load level survey conducted in Mirabel Canada in early 1995.

Figure 2: Prognostics Data Set: (a) Average error Vs no. of iterations and (b)Average error Vs no. of multiplies per training iteration

We trained an MLP having 15 hidden units. In Fig. 2-a, the average mean square error (MSE) for 10-fold training is plotted versus the number of iterations for each algorithm. In Fig. 2-b, the average 10-fold training MSE is plotted versus the required number of multiplies (shown on a log10 scale).

7703 - 15 V. 2 (p.8 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

From fig. 2-a, MOLF-BP is better than the existing OWO-BP and CG, in terms of the overall training error. LM has better overall training error. However, the performance comes with a significantly higher computational demand, as shown in Fig. 2-b. 5.2. Federal Reserve economic data set This file contains some economic data for the USA from 01/04/1980 to 02/04/2000 on a weekly basis. From the given features, the goal is to predict the 1-Month CD Rate22. For this data file, we trained an MLP having 15 hidden units. In Fig. 3-a, the average 10-fold training MSE is plotted versus the number of iterations for each algorithm. In Fig. 3-b, the average 10-fold training MSE is plotted versus the required number of multiplies. From Fig. 3-a, the average training MSE for MOLF-BP is better than all other algorithms being compared. Fig. 3-b shows the computational cost of achieving this performance. The proposed algorithm requires slightly more computation than OWO-BP and full CG. However, OWO-BP utilizes about two orders of magnitude fewer computations than LM.

Figure 3: Federal Reserve data: average error vs. (a) iterations and (b) multiplies per iteration

5.3. Housing data set This data file is available on the DELVE data set repository23. It was designed on the basis of data provided by the US Census Bureau (under Lookup Access: Summary Tape File 1). The data was collected as part of the 1990 US census. These are mostly counts cumulated at different survey levels. For the purpose of this data set a level State-Place was used. Data from all states were obtained. Most of the counts were changed into appropriate proportions24. These are all concerned with predicting the median price of houses in a region based on demographic composition and the state of the housing market in the region. For Low task difficulty, more correlated attributes were chosen as signified by univariate smooth fit of that input on the target. Tasks with high difficulty have had their attributes chosen to make the modeling more difficult due to higher variance or lower correlation of the inputs to the target. We trained an MLP having 15 hidden units. From Fig. 4-a and 4-b, we see the same trend observed in the previous examples. MOLF-BP performs better than CG and OWO-BP and comes very close to LM in terms of the average 10fold training MSE. The required multiplies for MOLF-BP is not significantly more than those required for CG and OWO-BP.

7703 - 15 V. 2 (p.9 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Figure 4: Housing data: average error vs. (a) iterations and (b) multiplies per iteration

5.4. Concrete compressive strength data set This data file is available on the UCI Machine Learning Repository25. It contains the actual concrete compressive strength (MPa) for a given mixture under a specific age (days) determined from laboratory. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, super plasticizer, coarse aggregate, and fine aggregate.

Figure 5: Concrete data: average error vs. (a) iterations and (b) multiplies per iteration

We trained an MLP having 15 hidden units. For this data set, the LM algorithm has a better overall 10-fold training error. However MOLF-BP presents a good balance between performance and computational cost. 5.5. Remote sensing data set This data file is available on the Image Processing and Neural Networks Lab repository20. It consists of 16 inputs and 3 outputs and represents the training set for inversion of surface permittivity, the normalized surface rms roughness, and the surface correlation length found in back scattering models from randomly rough dielectric surfaces26. We trained an MLP having 15 hidden units. From Fig. 6-a, the average training MSE from 10-fold training error for MOLF-BP is less than those for the other algorithms under consideration, with no significant extra computational overhead, unlike LM.

7703 - 15 V. 2 (p.10 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

Figure 6: Remote sensing data: average error vs. (a) iterations and (b) multiplies per iteration

Table 2 compares the average 10-fold training and validation errors of the proposed MOLF-BP algorithm with BP, OWO-BP and LM on different data sets. We can see that the MOLF-BP often has a performance comparable to or better than LM. Table 2: Average 10-fold training and validation error

Data set Prognostics Federal Reserve Housing Concrete Remote Sensing

MSE Etrn

MOLF-BP 2.8321E7

LM 1.2960E7

OWO-BP 10.1418E7

CG 4.5095E7

Eval Etrn

3.1404E7

1.5360E7

10.4366E7

4.8960E7

0.0305

0.0311

0.0410

0.0398

Eval Etrn

0.0412

0.0398

0.0454

0.0452

1.3505E9

1.3472E9

1.4817E9

1.5658E9

Eval Etrn

1.4174E9

1.3917E9

1.5869E9

1.5231E9

27.0929

19.0569

52.8452

40.6759

Eval Etrn

37.6512

27.4454

60.4792

48.1583

0.5628

0.6800

1.3409

1.1917

Eval

0.6841

0.7309

1.4021

1.2612

6. CONCLUSIONS A second order learning algorithm is presented that simultaneously calculates optimal learning factors for all hidden units using Newton’s method. The proposed method makes it easier to detect dependent hidden units during inversion of the Hessian. The method is less heuristic than LM and is also faster since the Hessian is smaller. Elements of the reduced Hessian are shown to be weighted sums of the elements of the entire network’s Hessian. For the data sets investigated, the proposed algorithm is about as computationally efficient as OWO-BP and has validation errors about as small as those of LM. In effect the MOLF procedure has boosted the performance of OWO-BP so that it is better than CG. In future work, the MOLF approach will be applied to other training algorithms, and to networks having additional hidden layers.

7703 - 15 V. 2 (p.11 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Please verify that (1) all pages are present, (2) all figures are acceptable, (3) all fonts and special characters are correct, and (4) all text and figures fit within the margin lines shown on this review document. Return to your MySPIE ToDo list and approve or disapprove this submission.

REFERENCES [1] Odom, R., C., Pavlakos, P., Diocee, S.,S., Bailey, S., M., Zander, D., M., and Gillespie, J., J. , “Shaly sand analysis using density-neutron porosities from a cased-hole pulsed neutron system,” SPE Rocky Mountain regional meeting proceedings: Society of Petroleum Engineers, 467-476 (1999). [2] Khotanzad, A., Davis, M., H., Abaye, A., Maratukulam, D., J., “An artificial neural network hourly temperature forecaster with applications in load forecasting,” IEEE Transactions on Power Systems, 11(2), 870-876 (1996). [3] Marinai, S., Gori, M., Soda, G., “Artificial neural networks for document analysis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1), 23-35 (2005). [4] Kamruzzaman, J., Sarker, R., A., Begg, R., [Artificial Neural Networks: Applications in Finance and Manufacturing], Idea Group Inc (IGI) (2006). [5] Wang, L., and Fu, X., [Data Mining With Computational Intelligence], Springer-Verlag, (2005). [6] Cybenko, G., “Approximations by superposition of a sigmoidal function,” Mathematics of Control, Signals, and Systems (MCSS), vol. 2, 303-314, (1989). [7] Ruck, D. W., Rogers, S., K., Kabrisky, M., Oxley, M., E., Suter, B., W., “The multi-layer perceptron as an approximation to a Bayes optimal discriminant function,” IEEE Transactions on Neural Networks, 1(4), (1990). [8] Yu, Q., Apollo, S., J., and Manry, M., T., “MAP estimation and the multi-layer perceptron,” Proceedings of the 1993 IEEE Workshop on Neural Networks for Signal Processing, 30-39, (1993). [9] Rumelhart, D., E., Hinton, G., E., and Williams, R., J., “Learning internal representations by error propagation,” in D.E. Rumelhart and J.L. McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge, Massachusetts: The MIT Press, 1986. [10] Hestenes M., R., and Steifel, E., “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Standards, 49(6), 409-436, (1952). [11] McLoone, S., and Irwin, G., “A variable memory Quasi-Newton training algorithm,” Neural Processing Letters, vol. 9, 77-89, (1999). [12] Shepherd, A., J., [Second-Order Methods for Neural Networks], Springer-Verlag New York, Inc., (1997). [13] Levenberg, K., “A method for the solution of certain problems in least squares,” Quart. Appl. Math., vol. 2, 164.168, (1944). [14] Marquardt, D., “An algorithm for least-squares estimation of nonlinear parameters,” SIAM J. Appl. Math., vol. 11, 431-441, (1963). [15] Fletcher, R., [Practical Methods of Optimization], Second Edition, Wiley Publications, New York, (1987). [16] Barton, S., A.,“A matrix method for optimizing a neural network,” Neural Computation, 3(3), 450-459, (1991). [17] Kaminski, W., and Strumillo, P., “Kernel orthonormalization in radial basis function neural networks,” IEEE Transactions on Neural Networks, 8(5), 1177 - 1183, (1997). [18] Yeh, J., [Real Analysis: Theory of Measure and Integration], World Scientific Publishing Company, Incorporated, (2006). [19] Battiti, R., “First- and second-order methods for learning: between steepest descent and Newton’s method,” Neural Computation, vol. 4, 141-166, (1992). [20] Univ. of Texas at Arlington, Training Data Files - http://www-ee.uta.edu/eeweb/ip/training_data_files.htm [21] Manry, M., T., Chandrasekaran, H., and Hsieh, C-H, “Signal Processing Applications of the Multilayer Perceptron,” book chapter in Handbook on Neural Network Signal Processing, edited by Yu Hen Hu and Jenq-Nenq Hwang, CRC Press, (2001). [22] US Census Bureau [http://www.census.gov] (under Lookup Access [http://www.census.gov/cdrom/lookup]: Summary Tape File 1) [23] Univ. of Toronto, Delve Data Sets - http://www.cs.toronto.edu/delve/data/datasets.html [24] Bilkent University, Function Approximation Repository - http://funapp.cs.bilkent.edu.tr/DataSets/ [25] Univ. of California, Irvine, Machine Learning Repository - http://archive.ics.uci.edu/ml/ [26] Fung, A., K., Li, Z., and Chen, K., S., “Back scattering from a randomly rough dielectric surface,” IEEE Transactions Geoscience and Remote Sensing, 30(2), (1992).

7703 - 15 V. 2 (p.12 of 12) / Color: No / Format: Letter / Date: 2010-01-25 03:01:33 PM SPIE USE: ____ DB Check, ____ Prod Check, Notes:

Suggest Documents