Neural Networks PERGAMON
Neural Networks 13 (2000) 485–505 www.elsevier.com/locate/neunet
A new algorithm for learning in piecewise-linear neural networks E.F. Gad a,*, A.F. Atiya b, S. Shaheen c, A. El-Dessouki d a
Department of Electrical Engineering, Carleton University, 1125 Colonel By Drive, Ottawa, Ont., Canada K15 5B6 b Department of Electrical Engineering, Caltech 136-93 Pasadena, CA 91125, USA c Department of Computer Engineering, Cairo University, Giza, Egypt d Informatics Research Institute, MCSRTA, Alexandria, Egypt Received 27 June 1997; revised 15 March 2000; accepted 15 March 2000
Abstract Piecewise-linear (PWL) neural networks are widely known for their amenability to digital implementation. This paper presents a new algorithm for learning in PWL networks consisting of a single hidden layer. The approach adopted is based upon constructing a continuous PWL error function and developing an efficient algorithm to minimize it. The algorithm consists of two basic stages in searching the weight space. The first stage of the optimization algorithm is used to locate a point in the weight space representing the intersection of N linearly independent hyperplanes, with N being the number of weights in the network. The second stage is then called to use this point as a starting point in order to continue searching by moving along the single-dimension boundaries between the different linear regions of the error function, hopping from one point (representing the intersection of N hyperplanes) to another. The proposed algorithm exhibits significantly accelerated convergence, as compared to standard algorithms such as back-propagation and improved versions of it, such as the conjugate gradient algorithm. In addition, it has the distinct advantage that there are no parameters to adjust, and therefore there is no time-consuming parameters tuning step. The new algorithm is expected to find applications in function approximation, time series prediction and binary classification problems. 䉷 2000 Elsevier Science Ltd. All rights reserved. Keywords: Neural networks; Convergence; Function approximation
1. Introduction Neural network (NN) science has been a scene of great activity in recent years as its applications have flooded many domains. In addition, there have been many advances in the hardware implementation of NN, where the digital and analog techniques represent the main two components in this terrain (Haykin, 1994). Continuing reduction in the world of digital VLSI coupled with limitations in the areareduction of analog VLSI have tempted many researchers to focus their attention on digitally implementable NN. In addition to that, there are basic advantages for digital implementation over that of the analog (Hammerstrom, 1992). The sigmoidal nature of networks that make use of the error back-propagation (BP) algorithm (Bishop, 1995) presents some difficulties for digital implementation. The essential requirement needed for the implementation of BP is that the backward path from the network output to the internal nodes must be differentiable (Pao, 1989). As a * Corresponding author. E-mail addresses:
[email protected] (E.F. Gad), amir@work. caltech.edu (A.F. Atiya),
[email protected] (S. Shaheen).
consequence, the nonlinear element, such as the sigmoid, must be continuous and that makes it difficult for digital implementation (Batruni, 1991). Nonetheless, digital implementation of networks with nonlinear segments can be carried out by approximating the sigmoid with linear segments and storing its derivative in a lookup table in the memory. Care should be taken to sample the region of transition from linear to saturation more heavily. Naturally this makes the memory requirement nontrivial. Although the memory requirement does not represent an acute problem as it did a decade ago, the accurate implementation of the BP requires a very large sampling. On the other hand, the analog NN that adopt the tanh sigmoidal as their nonlinear element also face some problems due to the crude shape of the “tanh” produced by the analog VLSI semiconductors, which can result in a discrepancy between the off-line trained weights and the network on-line performance. One way to spare us these potential troubles is to use a piecewise-linear (PWL) activity function that possesses the smallest possible number of linear segments as the activation function of the hidden neurons. However, the nondifferentiability of the proposed PWL NN will deprive an algorithm such as the BP of a vital requirement for proper
0893-6080/00/$ - see front matter 䉷 2000 Elsevier Science Ltd. All rights reserved. PII: S0893-608 0(00)00024-1
486
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Fig. 1. The activity function of the hidden neutrons.
functioning (Pao, 1989). Approximating the derivative of the activation function by piecewise constant segments will not work easily with a BP-type scheme, since the minimum always occurs at the discontinuities (we have failed to achieve good results in several exploratory simulations using this approach). Hence, learning in PWL NN mandates some sort of “tailored” algorithms. For example, the algorithm presented in Lin and Unbehauen (1995) is used to train networks with a single hidden layer employing the absolute value as the activation function of the hidden neuron. This algorithm was further generalized to multilayer networks with cascaded structures in Batruni (1991). In Staley (1995), the algorithm used adopts the incremental-type approach (or constructive approach) for constructing a network with two hidden layers, where the hidden neurons possess a PWL function in the range [0–1]. This means that one starts with a minimal network configuration and then adds more units in a prescribed manner to minimize the error. Hush and Horne (1998) proposed another incremental-type approach that uses quadratic programming to adjust the weights. One drawback of the incremental approach is that it is not committed to an upper bound in the number of the network parameters (connection weights) and can end up with large networks, and that generally leads to questionable generalization performance. The intimate relationship between the number of the weights in the network and the generalization performance has been revealed in different places (see, for example, Abu-Mostafa, 1989; Baum & Haussler, 1989; Moody, 1992). In this paper, we studied and developed an algorithm for a class of networks consisting of a single hidden layer and a single output neuron. The activation function of the hidden neurons is a PWL of the saturation type with outputs in [⫺1,1] (see Fig. 1), whereas the output neuron possesses a pure linear function. This hidden unit function has been used in some types of Hopfield networks (Michel, Si & Yen, 1991) and is called saturation nonlinearity. The proposed learning algorithm for this class approaches the learning task by constructing a PWL objective function for the total error and then seeks to minimize it. The algorithm developed is a new algorithm for optimizing general PWL functions, and is therefore a contribution to the general optimization field as this path has been very scarcely trod-
den in the literature. For example, we could locate only one identical task, in Benchekroun and Falk (1991), where the adopted algorithm is based on the Branch and Bound scheme and defines subproblems that can be solved using linear programming. Our approach for PWL optimization, on the other hand, depends on utilizing the boundaries between the linear regions of the PWL function as the search directions. One of the main advantages of the proposed algorithm over BP is that it does not need any parameters (such as the learning rate in BP)—and therefore there is no time-consuming parameter tuning step—nor does it use any derivative information of the error surface and thus can be very promising as an accelerated learning algorithm (see Battiti, 1992). Another advantage of the proposed algorithm is that, due to the discrete nature of the problem, it converges in a finite number of steps, unlike BP where the minimum is approached only asymptotically. The proposed algorithm can be considered to belong to a broad category of algorithms called basis exchange algorithms, which is based on exchanging one variable with another in a basis set. An example of a basis exchange algorithm is the simplex algorithm for linear programming. Basis exchange algorithms have also been applied in the NN field, for example Bobrowski and Niemiro (1984) for training linear classifier networks and Bobrowski (1991) for training multilayer classifiers with hard-threshold neurons (though these algorithms are quite different in concept from the proposed algorithm, for example Bobrowski’s 1991 algorithm is based on adding hidden nodes that are trained to maximally separate between one class and the remaining classes). Besides being specifically suited for PWL NN, the proposed algorithm provides another fundamental advantage over gradient-type algorithms for smooth error functions. It is well known that if a gradient descent is used to minimize a function Q(x), the rate of convergence becomes governed by the relation (Battiti & Tecchiolli, 1994; Golden, 1996): hmax ⫺ hmin ·兩Q
xk ⫺ Q
xⴱ 兩; 兩Q
xk⫹1 ⫺ Q
xⴱ 兩 ⬇ hmax ⫹ hmin where xⴱ is the actual minimum, xk⫹1 and xk are the vectors of the independent variables at iterations
k ⫹ 1 and k, respectively, and h max and h min are the largest and smallest
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
eigenvalues of the Hessian matrix of Q(x). It is clear that if the two eigenvalues are very different (i.e. the Hessian matrix is very ill conditioned), the distance from the minimum is multiplied each iteration by a number that is very close to 1. The problem with the NN learning is that the Hessian matrix of the error function is usually ill conditioned. The analysis in Saarinen, Bramley and Cybenko (1993) shows why this ill conditioning is very common with NN learning. Our algorithm, on the other hand, is not of a gradient type and as a result is not prone to such problems and does not suffer from the slow rate of convergence as could be demonstrated by the results. The organization of this paper is as follows. Firstly, Section 2 presents a formal terminology for describing any general PWL function. Section 3 is involved in demonstrating the basic features of the neural model considered in this paper. The mathematical derivations of the algorithm are presented in Section 4, which consists basically of three subsections that handle the main stages of the algorithm. Simulation results are given in Section 5 and finally Section 6 presents a brief summary and conclusion. 2. Piecewise-linear functions and their representation A complicated mapping function f : D ! RM (with D a compact subset of R N) can be approximated through pieces of functions. A PWL function can be formally defined as follows (Lin and Unbehauen, 1995; see also Lin and Unbehauen, 1990): Definition 1 (PWL function). A PWL function f : D ! RM with D, a compact subset in R N, is defined from two aspects: 1. The partition of the domain D into a set of polyhedra called regions P n [ R p D; R p 傽 R p 0 f; p 苷 p 0 ; p; p 0 R U R p 傺 D p1
僆 {1; 2; …; P}jg1
(1)
by a set of boundaries of
N ⫺ 1-dimensional hyperplanes H U {Hq 傺 D; q 1; …; Q}; where Hq U {x 僆 D兩具aq ; x典 ⫹ bq 0};
2
where aq 僆 RN ; bq 僆 R; 具,典 denotes the inner product operation of two vectors and f is the empty set. 2. Local (linear) functions f p
x U Jp x ⫹ wp
3
Jp 僆 with f
x f p
x for any x 僆 Rp where Rp 僆 R; M×N M and wp 僆 R : There have been several approaches R 1
The over-bar notation denotes the closure of the set.
487
for describing a PWL function in a closed form (Kevenaar & Leenaerts, 1992). Consider the case M 1; i.e. f : D ! R: We define the following: Definition 2. A local minimum of f is a connected set, L; of points such that f
x f min
for x 僆 L
and f
x ⬎ f min
for x 僆 V \ L
where V is some open neighborhood of L: As a special case, an isolated local minimum is a point x, where f
x 0 ⬎ f
x for x 0 僆 V ⫺ x; where V is an open neighborhood around x. The following is a generalization of a result proved by Gauss for the case of the minimization of the least absolute P deviation function f
x Ki1 兩具a i ; x典 ⫹ bi 兩: Theorem 1. Let f be a PWL function defined as in Eqs. (1) and (2). If L is a connected set of local minima of f, and L is bounded, then there is an x0 僆 L such that x0 is the intersection of at least N hyperplanes. Proof. Consider the point x0 僆 L that is the intersection of the largest number, K, of hyperplane boundaries, i.e. 具a1 ; x典 ⫹ b1 0 .. .
…
x 僆 L
4
具aK ; x典 ⫹ bK 0 The function f evaluated at points x close to x0 that satisfy Eq. (4) is linear, since there is no other hyperplane intersecting with the K hyperplanes at point x0 and creating a discontinuity. Let x x0 ⫹ lx 0 where x 0 僆 Ꭽ
A; 3 2 a1 7 6 6a 7 6 27 7 6 A6 . 7 6 . 7 6 . 7 5 4 aK and Ꭽ(A) denotes the null space of A. If we assume K ⬍ N; then x 0 will be nonzero, and we can take l as large as which is another possible until x hits the boundary of L; hyperplane. This will make x at the intersection of K ⫹ 1 hyperplanes, which contradicts the definition of K. Hence, the assumption K ⬍ N is invalid, and there must be a point
488
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Fig. 2. A schematic layout for the NN model.
in L that is at the intersection of at least N hyperplane boundaries. A Corollary. An isolated local minimum is at the intersection of at least N hyperplane boundaries (N is the dimension of x). The above theorem is helpful in guiding the optimization procedure. We note that typically the minimum is at the intersection of N hyperplanes. It is at the intersection of more than N hyperplanes only in case of degeneracy, such as when more than N hyperplanes have nonempty intersection.
3. The piecewise-linear neural model For mathematical convenience, we adopted an NN with a single hidden layer and a single output neuron. Fig. 2 shows a schematic layout for the model. Every neuron in the hidden layer exhibits the transfer function (see Fig. 1) 8 ⫺1 6 ⬍ ⫺1 > > < 1 ⱖ 6 ⱖ ⫺1
5 f h
6 6 > > : 6⬎1 1
On the other hand, the output neuron exhibits the linear transfer function: fo
6 6
6
The output of the net, yo
m; under the presentation of a training example m, drawn from a training set of size M, will thus be given by: 0 !1 H I X X vj fh wji ui
m ⫹ wjo A
7 yo
m fo @vo ⫹ j1
i1
where ui
m is the input of example m at the input source node i, H is the number of the hidden neurons, I is the dimensionality of the input training examples, wji is the weight connecting the input node i to the hidden neuron j and vj is the weight from the hidden neuron j to the output neuron. Alternatively, by making use of the PWL nature of the hidden neuron and Eq. (6), we can write ! I X X X X yo
m vo ⫹ wji ui
m ⫹ wjo ·vj ⫹ vj ⫺ vj ; j僆SLm
i1
j僆S⫹ m
j僆S⫺ m
8 where SLm is the set of hidden neurons activated in the linear
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
the variables appearing in Eq. (14), i.e.
region upon the presentation of the example m, i.e. SLm U f j 僆 {1; 2; …; H}j ⫺ 1 ⬍
I X
wji ui
m ⫹ wjo ⬍ 1g
9
xT U w 01o ; …; w 01I ; w 02o ; …; w 02I ; … …; w 0Ho ; …; w 0HI ; vo ; v1 ; …; vH ;
i1
S⫹ m
S⫺ m
and are the sets of hidden neurons activated in the positive and negative saturation regions, respectively, i.e. S⫹ m U f j 僆 {1; 2; …; H}j
I X
wji ui
m ⫹ wjo ⱖ 1g
10
wji ui
m ⫹ wjo ⱕ ⫺1g
11
i1
S⫺ m U f j 僆 {1; 2; …; H}j
I X i1
L ⫹ From Eqs. (9)–(11), we can see that S⫺ m 傽 Sm 傽 Sm f and ⫺ L ⫹ denotes the cardinality of 兩Sm 傼 Sm 傼 Sm 兩 H; where 兩A兩 the set A: Taking the sum of the absolute deviations as our cost function, E, then
E
M X
兩yo
m ⫺ T
m兩
12
m1
where T(m) is the desired output for the example m. Substituting Eq. (8) into Eq. (12) we get: " ! M I X X X sm vo ⫹ wji ui
m ⫹ wjo ·vj E j僆SLm
m1
⫹
X
vj ⫺
j僆s⫹ m
X j僆S⫺ m
i1
# vj ⫺ T
m
13
where sm is the sign of the mth term between the square brackets. Defining w 0ji wji ·vj ; Eq. (13) can be written as " ! M I X X X 0 0 sm vo ⫹ w ji ui
m ⫹ w jo E j僆SLm
m1
⫹
X j僆S⫹ m
vj ⫺
X j僆S⫺ m
i1
# vj ⫺ T
m
489
and N is the total number of the weights in the network, given by N
I ⫹ 2H ⫹ 1: Thus, if we were to seek a good minimum for E, as would stipulate any learning task in the NNs, we may search the space whose coordinates are the set of variables
w 01o ; …; w 01I ; w 02o ; … ; w 02I ; … …; w 0Ho ; …; w 0HI ; vo ; v1 ; …; vH instead of searching the original weight space. Equivalently, we may refer to the variable space of (15) as the weight space. We also assume that we can define the boundary configuration for E;
H E rigorously. Intuitively, the boundary configuration H E should have three outstanding types of boundaries or hyperplanes: the first two types are closely similar to each other and are located at the two breakpoints of the activity function of each one of the hidden neurons, or mathematically speaking, they are located wherever I X
wji ui
m ⫹ wjo ^1
16
i1
whereby multiplying both sides of Eq. (16) by vj enables us to formulate this boundary equation in terms of the same set of variables expressing the error function E: I X
w 0ji ui
m ⫹ w 0jo ⫺ vj 0
17
i1
m 1; 2; …; M; j 1; 2; …; H I X
w 0ji ui
m ⫹ w 0jo ⫹ vj 0
18
i1
14
It might appear that the above formulation introduces a potential problem when the optimization procedure produces vj 0 while w 0ji 苷 0: However, this problem can be avoided by imposing the constraints: vj ⬎ e; (e is a very small positive constant) and thus restricting the search region to positive vj’s, assuming that vj ⬎ 0 initially (any function using a network with unrestricted vj’s can also be shown to be implemented with nonnegative vj’s). Practically speaking, however, these constraints turned out to degrade the efficiency of the algorithm. On the other hand, we have not encountered any case in which such a situation has disturbed the algorithm, hence such constraints have not been enforced. Now we can see that E is a PWL function. Using the notation of the previous section, we have E : D 傺 RN ! R where every x 僆 D is the vector form constituted from all
(15)
m 1; 2; …; M; j 1; 2; …; H We shall refer to the boundaries in Eq. (17) as type ‘Ia’ Ia ; and to that of Eq. hyperplanes, denoting them by Hj;m Ib : It is interesting to (18) as type ‘Ib’ hyperplanes (or Hj;m ⫺ and S exchange their members note that the sets SLm ; S⫹ m m across these boundaries. The third type of hyperplanes or boundaries can be found whenever yo
m T
m for all m 1; 2; …; M; i.e. when the mth term of Eq. (14) changes its sign: ! I X X X X 0 0 w ji ui
m ⫹ w jo ⫹ vj ⫺ vj ⫺ T
m vo ⫹ j僆SLm
0
i1
j僆S⫹ m
j僆S⫺ m
(19)
We shall refer to the boundaries given by Eq. (19) as type ‘II’ hyperplanes or HmII : Thus the whole set of the boundary
490
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
configuration, H E ; can be written as: H E U
Ia {{Hj;m ;
j 僆 {1; 2; …; H}; m
Ib ; j 僆 {1; 2; …; H}; m 僆 {1; 2; …; M}} 傼 {Hj;m
僆 {1; 2; …; M}} 傼 {HmII ; m 僆 {1; 2; …; M}}}
20
Evidently, it can be seen that E is continuous across H E ; since E is composed of the sum of continuous functions (the absolute value functions). A crucial issue regarding the type II hyperplanes should be stressed here. Closer inspection of Eq. (19) reveals that a hyperplane of type II is not a strictly linear hyperplane, because by varying the weights wji, a hidden neuron can move from the linear to the saturation region. It may be rather conceived as a zigzag hyperplane with its linear segments occurring between the types Ia and Ib hyperplanes. To illustrate better such a subtle point, consider a simple network configuration consisting of only two hidden neurons. If a training example, say p, is activating both of the hidden neurons in the linear region, i.e. SLp ⫺ {1; 2}; S⫹ p Sp f; then a hyperplane of type II for the example p, HpII ; would take the form vo ⫹ u1
pw 011 ⫹ u2
pw 012 ⫹ w 01o ⫹ u1
pw 021 ⫹ u2
pw 022 ⫹w 02o ⫺ T
p 0
(21)
Now imagine that we move in the weight space on the hyperplane of Eq. (21) until we happen to cross the boundary of the hyperplane H1;Ia p ; this in fact implies that the example p will change its activation region of the hidden neuron 1 to the positive saturation region. Thus HpII becomes vo ⫹ v1 ⫹ u1
pw 021 ⫹ u2
pw 022 ⫹ w 01o ⫺ T
p 0
22
Now, we can see that Eq. (22) provides a different form for HpII : The previous argument then suggests the following statement: a boundary associated with type II and example m, HmII ; written as in Eq. (19) is not strictly a hyperplane. It is rather zigzagging with its turning points occurring at its crossings with hyperplanes of types Ia and Ib. Fig. 3 depicts a graphical illustration of this statement. Note that the main concept of the proposed algorithm will work as well for any PWL objective function, so for example one can add to the L1 error a penalty function such as the sum of absolute values of the weights. 4. Derivation of the algorithm 4.1. Overview of the algorithm The algorithm consists basically of two stages: Stage-A and Stage-B. Given a randomly chosen starting point in the weight space, the role of Stage-A is to find an intersection point between N linearly independent hyperplane boundaries of the error function E. The set of these hyperplanes
will be referred to throughout the algorithm description by H^ where H^ 傺 H E : This is done by a sequence of N steps of successive projection of the gradient direction on N hyperplanes. In each step of Stage-A, a new point is obtained, where this new point is at the intersection of one more hyperplane than the point of the previous step. Stage-A is thus terminated after exactly N steps with the desired point as output, which in turn serves as the starting point for Stage-B. Stage-B carries out the bulk of the optimization process. It does so by generating a sequence of points that represent the intersection of N linearly independent hyperplane boundaries. This is done by moving along single-dimensional search directions lying at the intersection of
N ⫺ 1 hyperplane boundaries. The details of computing these directions will be explained in Section 4.3. Furthermore, Stage-B ensures that each point thus generated reduces the error function E. The termination point of Stage-B (and of the algorithm) is reached when all the available search directions at any one point fail to produce a point with lower error E. A theorem is also given to show that the termination point of Stage-B is a local minimum of E. In addition to that, restricting Stage-B to generate only points that represent the intersection of N hyperplane boundaries of the error function E is useful according to theorem (1), because it keeps the search process more focused on the local minimum candidate points in the weight space. However, the PWL nature of type II hyperplanes described in the previous section could introduce some difficulties during Stage-B. More specifically, it is quite possible that Stage-B generates a point representing the intersection of p hyperplanes where p ⬍ N: Clearly, this situation requires detection and/or correction. This is done by a routine (we name it “Restore”) that is called from within the Stage-B procedure. This routine is based on the idea of hypothetical hyperplanes, which is explained in Section 4.4. 4.2. Stage-A In this stage we start randomly and then move in the direction of the gradient descent until we hit the nearest discontinuity of the set H E : The search direction is then taken along the projection of the gradient on this hyperplane. We continue moving in this new direction until we hit another new hyperplane belonging to the set H E : Thus far, we would be standing by the intersection of two hyperplanes in an N-dimensional space, or equivalently a hyperplane of dimensionality N ⫺ 2; assuming that the two hyperplanes are linearly independent. We then project the direction of the gradient on the above lesser dimension hyperplane and move along it until we hit another new hyperplane. The new point obtained from the last movement thus represents the intersection of three hyperplanes, or in other words a hyperplane of dimensionality N ⫺ 3: Proceeding in the same manner, we continue until we reach the end of Stage-A, where the final point
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
represents the intersection of N hyperplanes. Suppose that, before we reach the end of Stage-A, we happened to be standing by the intersection of k hyperplanes, so that 具ai ; x典 ⫹ bi 0 (for i 1; 2; …; k where x is given as in Eq. (15); a i and b i are a vector and a constant, respectively, that depend on the type of the ith hyperplane. Or simply, in matrix form we can write: Ax b where, A a1 ; …; ak T ; ai aTi ; b b1 ; …; bk T ; bi ⫺bi : Let g represent the gradient of the error function E, i.e. " 2E 2E 2E 2E T ; …; ; ; …; ; … …; g U 2w 01o 2w 01I 2w 02o 2w 02I # 2E 2E 2E 2E 2E
23 ; …; ; ; ; …; 2w 0Ho 2w 0HI 2vo 2v1 2vH which can be obtained easily by checking the sets SLm ; S⫹ m; ; and it is of course a constant vector that depends on and S⫺ m the region we are in. The projection of the gradient on the hyperplane Ax b is given by Pg I ⫺ AT
AAT ⫺1 Ag
24
hence, as discussed above, our movement from xold to xnew would be given according to xnew xold ⫺ hmin Pg
25
or assuming h ⫺Pg : 3 2 0 3 2 0 3 2 w 1o;old w 1o;new h1 7 6 7 6 7 6 7 6 . 7 6 7 6 .. .. 7 6 . 7 6 7 6 7 7 6 6 . . . 7 6 7 6 7 6 7 6 7 6 0 7 6 0 7 6 7 6w 7 6w h 7 6 H
I⫹1 6 HI;new 7 6 HI;old 7 7 6 7 6 7 6 7 6 7 6 7 6 7 h 6 vo;new 7 6 vo;old 7 ⫹ hmin 6 H
I⫹1⫹1 7 6 7 6 7 6 7 6 7 6 7 6 7 6h 6 v1;new 7 6 v1;old 7 7 6 H
I⫹1⫹2 7 6 7 6 7 6 7 6 7 6 7 6 7 7 6 6 . 7 6 . . 7 6 . 7 6 .. .. 7 6 7 6 . 7 6 5 4 5 4 5 4 vH;new
hIa j;m U ⫺
^ ⫺ vj w 0j; T u
m I X
u^i
mhj
I⫹1⫺I⫹i ⫺ hH
I⫹1⫹1⫹j
28 Note that in the above equation we dispensed with subscript “old”, assuming that it is implicitly understood that the step size will be computed in terms of the current values (the old ones). Similarly, using Eq. (18) we can Ib ; along the obtain the step size required to reach any Hj;m positive direction of h,
hIb j;m U ⫺
^ w 0j; T u
m ⫹ vj I X
u^i
mhj
I⫹1⫺I⫹i ⫹ hH
I⫹1⫹1⫹j
29 To determine the step size required to reach any hyperplane of type II for any example m, we follow the above argument, where moving to a point on the hyperplane HmII implies that, using Eq. (19), ! I X X 0 0 w ji;new ui
m ⫹ w jo;new vo;new ⫹ X
i1
vj;new ⫺
j僆S⫹ m
X j僆S⫺ m
vj;new ⫺ T
m 0
X
j僆SLm i0
⫺
j僆SLm i0
u^ i
mhj
I⫹1⫺I⫹i ⫹
X
X j僆S⫺ m
j僆S⫹ m
) hH
I⫹1⫹1⫹j ⫹ hH
I⫹1⫹1 ⫺ T
m 0
31
where yo,old(m) is the old output of the network for the example m before stepping onto HmII ; hIIm is the step size along h that would carry us to HmII : By ignoring the subscript “old”, then
hH
I⫹1⫹1⫹j ⫺
j僆S⫹ m
30
Substituting from Eq. (30) in Eq. (26) ( I X X X II u^i
mhj
I⫹1⫺I⫹i ⫹ hH
I⫹1⫹1⫹j yo;old
m ⫹ hm
e
m I X
u^o
m 1
i0
⫹
where h min is the step required to reach the nearest hyper1a 1b plane
Hj;m ; Hj;m ; or HmII and hi is the ith component of the vector h. Also note that the presence of the minus sign in the second term of Eq. (25) is to account for the descent direction. To compute h min, imagine that we move
u^o
m 1
i0
hN
vH;old
hIIm U ⫺
^ where w 0j; T new w 0jo;new ; w 0j1;new ; …; w 0jI;new and u
m 1; u1
m; …; uI
mT : Substituting Eq. (27) into Eq. (26), Ia we can obtain the step size hIa j;m required to reach Hj;m ;
j僆SLm
26
491
X j僆S⫺ m
u^o
m 1
32
hH
I⫹1⫹1⫹j ⫹ hH
I⫹1⫹1
Ia : along the direction h until we hit the hyperplane Hj;m Clearly this implies that at the new point we should have, using Eq. (17)
e
m yo
m ⫺ T
m
^ ⫺ vj;new 0 w 0j; T new u
m
Eqs. (28), (29) and (32) define the step sizes needed to reach
27
where e
m is the deviation error of the example m
33
492
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Fig. 3. A graphical representation for the comportment of type II hyperplanes for the example p.
any hyperplane H 僆 H E in terms of the current values of the weights. Thus, the step size that carries the search, along the positive direction of h, to the nearest discontinuity of the set H E is obtained from:
hmin Ib II min{ min {hIa j;m ⬎ 0} 傼 min {hj;m ⬎ 0} 傼 min {hm ⬎ 0}} j;m
j;m
m
34 After hitting a new hyperplane H^ k ; this hyperplane is inserted as the newest member of the set H^ of satisfied constraints. We augment the matrix A with the row expressing this new hyperplane, and also augment the vector b. Naturally, the hyperplane equations 具a; x典 ⫹ b 0 depend on its type. For hyperplanes of types Ia, Ib and II, a is given, respectively, as
aIa j;m 兩ith component 8 u^ k
m i j
I ⫹ 1 ⫺ I ⫹ k > > < U ⫺1 i H
I ⫹ 1 ⫹ j ⫹ 1 > > : 0 otherwise
for all k 0; …; I
37
bIa j;m
bIb j;m
bIIm
0; ⫺T
m. The constant b is given by: Stage-A is terminated after exactly N steps, or, in other words, when the number of the intersecting hyperplanes (the ^ is equal to N. See Appendix B for a cardinality of the set H) detailed pseudo code description of Stage-A. We note that the projection operation (Eq. (24)) need not be computed from scratch each new step. There are recursive algorithms for updating the projection matrix when adding a matrix row. However, we will not go into these details, as we will present a more efficient method later in the paper to effectively perform the equivalent of Stage-A steps (see Section 4.4). 4.3. Stage-B
35
aIb j;m 兩ith component 8 u^ k
m i j
I ⫹ 1 ⫺ I ⫹ k > > < U 1 i H
I ⫹ 1 ⫹ j ⫹ 1 > > : 0 otherwise
aIIm 兩ith component 8 > u^k
m i j
I ⫹ 1 ⫺ I ⫹ k for all k 0; …; I and all j 僆 SLm > > > > > ⫺1 i H
I ⫹ 1 ⫹ j ⫹ 1 for all j 僆 S⫹ > m > < U 1 i H
I ⫹ 1 ⫹ j ⫹ 1 for all j 僆 S⫺ m > > > > > 1 i H
I ⫹ 1 ⫹ 1 > > > : 0 otherwise
for all k 0; …; I
After terminating Stage-A at a point representing the intersection of N-hyperplanes, Stage-B is then initiated. We assume for now that the equations of the hyperplanes are linearly independent. At the start of Stage-B, we can write the following equation 2 3 2 3 b1 a1 6 7 6 7 6 .. 7 6 . 7
38 6 . 7·x 6 .. 7 4 5 4 5 aN
36
bN
or in matrix form: Ax b keeping in mind that A is now an N × N matrix; ai aTi ; bi ⫺bi : If we let bk bk ⫹ y
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
493
Fig. 4. An illustration of Stage-B iterations on a two-dimensional example.
and hence b yT b1 …
bk ⫹ y…b N where y is a small constant, then the new point, xy , computed from x y A⫺1 ·by is a point representing the intersection of the following set of hyperplanes 具ai ; x典 ⫹ bi 0
i 1; 2; …; N i 苷 k
39a
具ai ; x典 ⫹ bi y
ik
39b
In fact, the hyperplane (39b) is not a legitimate member of the boundary configuration H E ; but the real gain in the above step is that a new search direction can now be obtained using the two points x and xy : h
x y ⫺ x=储xy ⫺ x储
40
Furthermore, the search direction h is a hyperplane of single degree of freedom that represents the intersection of the set of N ⫺ 1 hyperplanes of Eq. (39a). Thus, we can use this new search direction to resume the search process as in Stage-A, where the nearest hyperplane encountered along the positive direction of h, (say H 0k ; where H 0k 僆 H E ; H 0k U {x 僆 D兩具a 0k ; x典 ⫹ b 0k 0}; becomes the new member of the family of the N intersecting hyperplanes, while the old member H^ k is dismissed. If we let a 0k T a 0k and b 0k ⫺b 0k ; then the new point satisfies the system of equations Ak ·x bk ; where the matrix Ak is the same as A except that its kth row is a 0k instead of ak. It should be noted that the direction h is not necessarily committed to decrease the error E, but it remains up to us to backtrack and reject this search direction if it increases it; for example we can try another search direction by perturbing another component of the vector b (e.g. k ⫹ 1 and proceeding as described above. The total number of search directions we explore is 2N, corresponding to N positive perturbations (corresponding to positive y in Eq. (39b)) and N negative perturbations (corresponding to negative y in Eq. (39b)) of the components of the vector b. If a decrease in the error has occurred in one search direction, we move to the new point and continue the search by creating a new search direction starting from this new point (see Fig. 4 for an illustration). Fortunately, creating a new search direction will be less expensive, from a computational point of view, since it requires the inversion of the matrix Ak,
which differs in only one row from the matrix A whose inverse is already available from the previous step, and consequently we shall not be obliged to compute A k⫺1 from scratch (see Appendix A for a recursive matrix inversion algorithm by row exchange). Now, there is still left a subtle problem that we have to face. Consider the following scenario: suppose that a hyper^ Thus we have aII plane of type II is an element of the set H: p represented as some row in the matrix A. However, referring to Eq. (37), we find that the specification of the vector aIIp is L ⫹ dependent upon the configuration of the sets S⫺ p ; Sp ; and Sp : Hence, any change to their configurations due to a crossing from one region to another during the course of moving from one point to another in Stage-B necessitates a change in that row corresponding to the new configuration of the L ⫹ sets S⫺ p ; Sp ; and Sp : Even the 2N search directions might correspond to different A matrices (see Fig. 4 for an illustration). So it is important to track any changes in the region configurations, and adjust the rows of the matrix A accordingly. It should be stressed that this situation can never occur in Stage-A, since the procedure there does not penetrate the hyperplanes as does Stage-B; it merely contents with “hugging” them. We propose two ways to examine the 2N directions and determine which one will decrease the error. Procedure 1: In the first method we simply examine the change in the error function if a move has been taken in a given direction. Let us describe the available 2N directions more compactly in the form: hi ^eA⫺1 ei
i 1; …; N
41
where e is a sufficiently small positive constant and ei is the ith unity vector. Thus the procedure needed to examine each one of the directions in turn is: a. Let xi;new xold ⫹ hmin hi b. If E
xi;new ⬍ E
xold then Let xold xi;new ; otherwise Let i i ⫹ 1 and goto (a). Procedure 2: In the second method we think of the 2N directions as like “transformed coordinate axes” (the
494
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
positive and the negative sides of the axes). We think of each region flanked by the N axes like a “quadrant”. Assume the matrix A and the gradient g are readily available for say the region flanked by the positive coordinate axes (corresponding to positive y perturbations of the components of the vector b as in Eqs. (39a) and (39b). Then, as mentioned above the 2N directions are h i ^eA⫺1 ei : If the dot product gT hi ⬍ 0 then hi is a descent direction. We can therefore explore the N directions in one shot by calculating gT h1 h2 …hN which is equal to gT A⫺1 (because e1 …eN turns out to be the identity matrix). We pick one of the negative components of this row vector, since this indicates a negative dot product between the gradient and the corresponding hi vector, hence indicating a descent direction. As for the remaining N coordinate axes (search directions), we calculate the matrix A 0 and the gradient vector g 0 corresponding to the region flanked by these coordinate axes. By moving from the positive quadrant to the negative quadrant, we would “officially” be operating on different sides of all the N constraints, thus changing L ⫹ the configuration of the sets S⫺ p ; Sp ; and Sp : This might affect any type II constraints in matrix A, as we discussed before, creating a new matrix A 0 . The sign of the elements ⫺1 of the row vector g 0T A 0 determines the possible descent directions. We note that the inverse of the new matrix A 0 need not be calculated from scratch, but can be obtained from A ⫺1 in a straightforward manner, thus saving computational load. The reason is as follows. Assume A consists of a constraint of type II, represented as a row vector aIIm : Moreover, assume that A has k rows aI1 ; aI2 ; …; aIk corresponding to constraints of type I for example m. For the new matrix, the change in regions will affect only aIIm : Let the new row vector be aIIm 0 : By observing Eqs. (35)–(37), we can see that
aIIm 0 aIIm ⫹
k X
^1aIi
42
i1
where the sign in (^1) depends on the direction of the region change. If we were considered on the linear side of the hyperplane aIi ; and we moved onto one of the saturation regions, then the sign would be ⫺1, otherwise it is ⫹1. Thus, considering the matrix inversion equation (Y being the inverse of A 0 ): A 0Y I
43
and performing some row addition and subtraction operations, we get A 0⫺1 ⬅ Y A⫺1 D
44
where D is the matrix whose row di equals the unity vector for the a I rows, and a vector of ^1s for the rows a II, to account for the relation (42). Appendix B shows a pseudo code for Stage-B (using the first method of examining the search directions). Both
described procedures lead to a descent in the error function. In other words, we get the following result: Claim. The error function is nonincreasing along the trajectories, when using either Procedure 1 or Procedure 2. The proof of this result is given in Appendix C. Exploring the 2N directions, either using the first method and observing the change in the error or using the second method and observing the signs of the row vectors gT A⫺1 and g 0T A 0⫺1 ; will determine a descent direction. In fact, exploring the 2N directions is sufficient to determine whether we have reached a local minimum. In other words, if the error increases along the 2N directions h1 ; …; h2N ; then we have reached a local minimum. This is given by the following theorem: Theorem 2. Assume the matrix A is nonsingular. If E
x0 ⫹ ehj ⬎ E
x0 for all j 1; …; 2N and e is a sufficiently small constant, then x0 is a local minimum of E, in the sense that E
x0 ⫹ eh ⬎ E
x0 for any vector h. The proof is given in Appendix D. However, in the above discussions we have overlooked the possibility that the new row, a 0k ; entering the matrix A is linearly dependent on the N ⫺ 1 rows ai, i 1; 2; …; N; i 苷 k: In this case the new matrix Ak will be singular and, consequently, it will not be possible to compute its inverse A⫺1 k for generating new search directions in the subsequent steps. However, since A will have a rank equal to N ⫺ 1; we are effectively in Stage-A. In that case, we update x by moving in the direction of the null space (or the projection of the gradient onto the null space) of matrix A, and thus the update would be similar to Stage-A. If we later reach Stage-B with more than N satisfied constraints (more than N hyperplanes intersecting at a point, indicating a degeneracy), then more than 2N single-dimensional directions have to be explored. This will slow down the algorithm a little. Another method, that is suboptimal, but produces better results in terms of convergence speed, is to avoid the step that will carry us into a situation where A becomes singular. As discussed before, in Stage-B we compute the minimum step size h min required to carry us along the positive direction of h to the nearest hyperplane, but in contrast to Stage-A, the procedure of Stage-B is granted the choice of repudiating any step that causes the increase in the error function. Hence, in the proposed method an accepted movement in the weight space should satisfy the following two conditions: (1) it should be associated with a reduction in the error function E; and (2) it should not be ending on a Ib II hyperplane whose constant vector
aIa j;m ; aj;m ; or am is linearly dependent on the remaining vectors ai, i 1; …; N
i 苷 k: Next, upon a successful step, i.e. one that satisfies the above two conditions, the nearest hyperplane replaces ^ which entails the the kth hyperplane, H^ k ; in the set H;
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
replacement of its corresponding constant vector Ib II
aIa j;m ; aj;m ; or am in the kth row of matrix A. On the other hand, if the step taken would violate any of the above-two conditions, we just ignore it and consider a new search direction through the perturbation of the
k ⫹ 1th component of vector b. In the next subsection, we provide two techniques that deal with or eliminate the possibility that the matrix A is handed over from Stage-A to Stage-B as a singular matrix. Another point to consider is the following situation. Assume that we are at point x and that we have explored the 2N directions and have found no directions that lead to a decrease in the error, but rather we have found some directions that lead to no change in the error. In such a situation we choose one of the directions that lead to no change in the error, and move along this direction, in the hope that a descent direction will ultimately be found at a later step. However, we have to pay attention to the fact that there could be a flat area in the error surface, and that by following this procedure we could be caught in a never-ending cycle moving from one point to the next along a constant-error area. To avoid this situation, we perform the following. Whenever we take a direction that leads to no change in the error, we save the currently visited point in a list. As long as the error does not change in the upcoming moves, we always check the list before actually taking a move. If the point of destination is on the list, then we ignore this move and choose a new search direction. This mechanism will effectively prevent cycling. An interesting feature of the algorithm, given by the theorem below, is that because of its discrete nature it converges in a finite number of iterations, unlike gradient-type algorithms for smooth functions where the minimum is approached asymptotically but usually never reached exactly in a finite number of iterations. Theorem 3. The proposed algorithm converges to the local minimum in a finite number of iterations. The proof is described in Appendix E. 4.4. Hypothetical hyperplanes We have pointed out that the very nature of type II hyperplanes needs careful manipulation during Stage-B procedure. The reason for that was that these hyperplanes take a PWL form that required additional computational procedures to follow the search directions more closely as they “bend” from one point to the next. We develop in this subsection a technique that has been shown to provide some computational advantages, besides being more homogeneous with Stage-B. In addition to that, this routine provides an excellent alternative to Stage-A, as will be demonstrated later. In this technique, as we move from one region to the next some of the type II constraints in matrix A might change direction. Instead of bending these directions, we simply
495
drop these constraints in the hope that we will recoup these lost constraints in the next few steps. As we lose some constraints, we are effectively in StageA, since there are fewer than N satisfied constraints. However, switching back to Stage-A is computationally expensive since Stage-A involves the projection of the gradient on a set of hyperplanes. Although the computation cost of the gradient direction in our algorithm is very trivial as compared to the BP algorithm, the computation of the projection is quite cumbersome. This is clearly observed in Eq. (24) since there is a matrix inversion followed by a number of matrix multiplication operations, i.e. the complexity involved in this procedure is at least of order O
n3 where n ⬍ N: In addition to that, it is generally recommended to avoid the use of the gradient in the advanced stages of the search process (Battiti & Tecchiolli, 1994). In order to avoid the potential troubles arising from this problem Stage-B switches to a slightly modified version of itself (called “Restore”) instead of switching back to Stage-A. The proposed procedure is based on what will be referred to, from now on, as hypothetical hyperplanes. To understand the basis behind this notion, consider Fig. 5a. In this figure, assume that the current point in the weight space is point a, where point a is at the brink of breaking a type II hyperplane HmII : Suppose that a successful step in Stage-B is executed whereupon the current point becomes now the point b. It is clear that point b 僆 HmII any more, but instead it lies on an obsolete form of it (its extension beyond point a). We may say that b is now on a hypothetical hyperplane that does not have existence in the boundary configuration H E : The role of switching to the “Restore” is to redress this situation by taking point b to the nearest real hyperplane (or a solid line in Fig. 5a). This situation is firstly detected in step (14) of Stage-B (See Appendix B), by comparing the L ⫹ contents of the sets S⫺ m ; Sm ; and Sm ; before and after any step is taken. This is done for all training examples p that have a ^ If there hyperplane HmII as a member of the set H: happened to be a difference, this means that one or more of the constraints 具aIIm ; x典 ⫹ bIIm 0 has been lost during the last movement, and that the current point is not an intersection of N legitimate hyperplanes of the boundary configuration H E : Step (15) then groups the indices of the lost constraints (or their row numbers in the matrix A) into the set G and then passes it to the “Restore” procedure. Upon entering the Restore procedure (see Appendix B), the algorithm cycles through the hypothetical hyperplanes trying to replace them with their real counterparts. Note that the cycling is indexed by the elements of the set G . Also note that it was not necessary to compute both of the direction h ⫹ and h ⫺ since h ⫹ ⫺h⫺ : So it is sufficient to compute the smallest positive and the largest negative step sizes to determine the nearest hyperplanes along the direction h ⫹ and h ⫺, respectively. The smallest positive and the largest negative steps are denoted by h⫹ min and h⫺ ; respectively. The accepted step size is then selected min
496
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Fig. 5. (a) A typical situation occurring in Stage-B. (b) Analogy to one-dimensional function.
⫺ from {h⫹ min ; hmin } as the one that further reduces the value of E. It is worth stressing that the algorithm may quit the Restore procedure without restoring all the hypothetical hyperplanes. This would happen during the cycling through the elements of G , if one of the entering rows does not satisfy the nonsingularity condition. However, there is no harm if the Stage-B procedure is resumed with ^ Any later calling to hypothetical hyperplanes in H: Restore can remedy the lingering hypothetical hyperplanes from the previous call when some of the rows in the matrix A are renewed. The notion of hypothetical hyperplanes can even be borrowed to Stage-A procedure by starting with an identity matrix
A I and assuming that the vector b is equal to the initial random point. In this case, the starting hypothetical hyperplanes are given by hyperplanes parallel to the coordinate axes. We might resort to this technique in implementing Stage-A to avoid any possibility of ending up with a singular matrix A at the end of this stage. On the other hand, using this technique may result in handing over the matrix A to Stage-B with one or more hypothetical hyperplanes. However, the basic advantage of this technique is that it replaces the computation of the projection of the gradient whose cost is O
N 3 with a recursive matrix inversion whose cost is O
N 2 (see Appendix A for details). Another heuristic measure has also been developed to avoid the problem of singular A at the end of Stage-A procedure. As has been practically observed, the main
reason for the emergence of a singular matrix in A is the Ia accumulation of more than 3 or 4 hyperplanes of type Hj;m Ib (or Hj;m ; for the same j, during Stage-A. This situation occurs regularly if we happened to have w 0j0 ^vj ; while w 0ji 0 for i 1; …; I or when w 0ji ⬇ vj ⬇ 0: At this point of the Ia or weight space there will be M hyperplanes (of types Hj;m Ib Hj;m queuing in front of Stage-A, which will wallow in blindly stuffing the matrix A with their constant vectors Ib (aIa j;m and aj;m ; and eventually produce a singular matrix to the machine precision. To eliminate this situation, the network weights are initialized so that all the training examples span the three different regions of each hidden neuron.
5. Simulation results The new algorithm has been tested on a number of problems. Several other algorithms applied to sigmoidaltype networks have also been tested on the same problems to obtain a comparative idea on the speed of the developed algorithm. The algorithms used in the comparisons were the BP algorithm (using both of its update modes, the sequential or pattern update mode and the batch update mode), the Bold-Driver (BD) (Battiti & Tecchiolli, 1994; Vogl, Mangis, Rigler, Zink & Alkon, 1988) and the conjugate gradient algorithm (CG). In the test comparisons, we used the abbreviation PWLO (piecewise-linear optimization) to stand for the new algorithm. In each test, the PWLO is run first (starting from 10 randomly chosen different points) until it converges to a local minimum. The final error
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
497
Table 1 Learning the mapping y sin
x using 1-15-1 network Type of algorithm
BP(seq) BP(batch) BD CG PWLO
SSE error
Average no. of iterations
Best case
Worst case
Average
0.016432 0.016882 0.016887 0.016742 0.016887
0.046638 0.052533 0.052544 0.051743 0.052545
0.02716 0.026496 0.028535 0.025727 0.026502
9453.58 13227.2 39153.1 258.2 1406
level, SSE, is then recorded for each of the 10 trials, where: SSE
M X
y0
m ⫺ T
m2
45
m1
This error level is then set as an error goal for the other algorithms. The first trial is used to obtain the best-tuned values of the learning rate h and momentum a parameters. These parameters are then fixed for the other nine trials. In any case, these algorithms are stopped if they do not attain the error goal in 50 000 epochs. In each experiment considered, we provide the required number of iterations as well as the CPU time on a Sun Sparc Ultra 10 machine. In the case of the PWLO algorithm, an iteration is taken as a complete search cycle in Stage-B or in the Restore procedure (explained in Section 4.4), whether an actual reduction in the error is obtained or not. For the other algorithms, the iteration is the usual epoch. We used the version of the proposed algorithm as described in Appendix B, except that we used the hypothetical hyperplanes idea (as described in Section 4.4) in place of Stage-A. We coded the PWLO method using matlab. For the other methods we used netlab, a public domain matlab code developed by Bishop and Nabney, 1998.
Average CPU time (s)
1276.435 461.006 3606.52 34.333 30.49
Parameters
a
h
0.0 1.0 – – –
0.011 .0028 – – –
The computational complexity of a single iteration in Stage-B (or in the “Restore” procedure) is due to three sources (as defined in Section 3, let N, I, H and M be, respectively, the number of weights, the number of inputs, the number of hidden nodes and the number of examples): • The first source is coming mainly from the updating of the inverse matrix, i.e. O
N 2 . • The second source is from computing the change in E, which is O
NM: • The final source is from computing h min, which is O
IHM; i.e. approximately O
NM: Since usually N ⬍ M; the complexity is about O
NM: But this is the complexity of probing one direction only. However, let us assume that for any direction we probe the probability of it being a descent direction equals the probability of it being an ascent direction, i.e. 1/2. Since we keep probing till we find a descent direction, the expected number of directions we test is only 2. Hence the average complexity figure will be about twice that described above, thus still O
NM: As a comparison, for the BP the complexity per cycle is O
NM; hence same
Fig. 6. 1-15-1 PWL network output.
498
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Table 2 Learning the mapping of Eq. (46) using 8-4-1 network Type of algorithm
BP(seq) BP(batch) BD CG PWLO
SSE error
Average no. of iterations
Best case
Worst case
Average
0.013179 0.034845 0.014475 0.006403 0.006404
0.055585 0.560913 0.081473 0.039224 0.039266
0.040205 0.15121 0.03953 0.025853 0.023455
46225.3 50001 43668.9 20436.2 2385
order as our proposed method. Note also that because we are using matlab, some methods benefit more than others from the matlab speed advantage when using predominantly matrix–vector operations. For example, the BP batch update method is significantly faster than the BP sequential update (in terms of CPU time per iteration), because the gradient is evaluated using one big matrix involving all examples.
10234.14 4439.763 3907.669 3753.394 31.678
Table 1 shows the algorithms to learn a neurons. The training samples drawn from
results of applying the different network consisting of 15 hidden set consists of 60 input–output the function y sin
x in the
Parameters
a
h
0.0 0.0 – – –
0.001 0.001 – – –
range from ⫺2p to 2p. Fig. 6 shows the PWL network output after training. 5.2. Multi-dimension function approximation The used mapping function in this example is y
5.1. Single-dimension function approximation
Average CPU time (s)
x1 x2 ⫹ x3 x4 ⫹ x5 x6 ⫹ x7 x8 400
46
where xi is drawn at random from [0–10]. Similar mapping functions have been used in Barmann and Biegler-Konig (1992). Fifty examples have been used to train a network of size 8-4-1. Table 2 shows the comparison results for an 84-1 network.
Fig. 7. Forecasting the Nile River flow using a 4-3-1 network.
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
499
Table 3 Results of learning to forecast the Nile River flow using a 4-3-1 network Type of algorithm
BP(seq) BP(batch) BD CG PWLO
SSE error
Average no. of iterations
Best case
Worst case
Average
173.2549 179.5864 177.808 74.17225 64.15788
472.3864 426.9391 307.3966 94.20655 94.23526
226.6544 230.5609 214.2032 82.24458 76.41261
Average CPU time (s)
50000 50000 50000 36767 4873
5.3. Time series prediction: forecasting river flow In this experiment, the used time series consists of 650 points, where each point represents the 10-day mean of the Nile River flow measured at Dongola Station in northern Sudan in millions of cubic meters. The data extends from 1975 to 1993. These data points (used in Atiya, El-Shoura, Shaheen & El-Sherif, 1999) have been divided into two sets of 325 points each. The first set, which covers a 9-year period (1975–1984) has been used as a training set for a network consisting of three hidden neurons. An input example consists of three previous measurements and an additional measurement for the same period in the preceding year. The second set has been used to test the network prediction as shown in Fig. 7. Table 3 displays a summary of the 10 trials. Fig. 8 presents a graphical comparison between the learning speed of the BD and the PWLO algorithms. From Fig. 8, it can be observed that PWLO shows a tangible slow performance in the first stages of the algorithm in comparison to the BD. This may be referred to the relatively large number of training examples in this
70349.95 2524.3 2825.285 7768.794 187.558
Parameters
a
h
0.3 0.3 –
0.01 0.001 –
–
–
test, which resulted in stuffing the weight space with a huge number of hyperplanes that subsequently cause many stoppages, even along very steep descent directions. However, the BD suffers too much in the vicinity of the minimum, while the PWLO encounters no substantial troubles to find its way to the minimum. The average normalized root mean square error, NRMSE, of the PWLO algorithm is 0.202 for the training phase and 0.231 for the testing phase, where v uM M X uX
y0
m ⫺ T
m2 =
T
m2 NRMSE t m1
47
m1
For comparison, the BP(batch), BP(seq), BD and CG achieved an NRMSE of, respectively, 0.285, 0.296, 0.293 and 0.194 in the training set, and an NRMSE of, respectively, 0.377, 0.348, 0.338 and 0.258 in the testing set. In this example and the other examples we have noticed that the training error performance, to a large extent, carried over to the generalization error performance. This means that
Fig. 8. Learning speeds of BD and PWLO for the Nile River forecast problem.
500
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Table 4 Bankruptcy prediction using a 5-4-1 network Type of algorithm
BP(seq) BP(batch) BD CG PWLO
SSE error
Average no. of iterations
Best case
Worst case
Average
0.005187 0.033912 0.000043 1.0 × 10 ⫺9 0.0
24.70495 29.2197 27.37903 3.609034 3.963935
10.89279 11.74011 9.6168 1.0413 1.0797
48152 37036 43888 2404 1523
whatever method beat the others in terms of training error tended to do well compared to the others in generalization error performance. 5.4. Binary classification problems: bankruptcy prediction Bankruptcy prediction is a good example for binary classification problems. The data set of this experiment has been used in Odom and Sharda (1990) in their work to apply NNs in bankruptcy prediction. In this test, the inputs reflect a set of indicators defining the financial status of a given firm, while the desired output assumes a binary value to indicate a potential bankruptcy. The PWLO algorithm has been slightly modified to binary classification problems. The first modification is effected by using the binary threshold function f
x on the output of the network where: 8 1 x ⬎ 0:95 > > < f
x x 0:05 ⱕ x ⱖ 0:95
48 > > : 0 x ⬍ 0:05 instead of using a linear output node. The second modifica-
Average CPU time (s)
3177.248 2376.906 3117.431 326.24 18.678
Parameters
a
h
0.3 0.5 – – –
0.1 0.005 – – –
tion is realized by taking the SSE as the criterion to accept or reject the weight update in Stage-B or in the Restore procedure instead of the least absolute deviations. The number of examples is 88, and we used a 5-4-1 network. Table 4 summarizes the results of the 10 trials and Fig. 9 shows the learning curves for PWLO, BD and CG. 6. Summary and conclusions This paper has been essentially aimed at developing a novel optimization method. One advantage that this optimization method offers is that it is more suitable in training single-hidden-layer digitally implemented NNs. It requires only the storage of an N × N matrix, where N is the total number of weights in the network. This is in contrast to the indefinite memory requirements that the BP schemes may need to approximate the derivatives of the sigmoidal function, in addition to the potential troubles that may erupt as a result of such an approximation. However, the fact that this optimization method provided a radically different approach to avoid the inherent ill
Fig. 9. Learning speeds of BD, CG and PWLO for the bankruptcy prediction problem.
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
conditioning associated with the gradient optimization method used by the BP algorithms is by far another more important advantage. This can be clearly observed in the obtained results in the form of a significantly accelerated learning process as compared to existing techniques. We could see, for example, how accurate and fast the algorithm had been in the binary problem of the bankruptcy prediction as compared to the other algorithms. Future research efforts could to be exerted to develop efficient heuristics to avoid unnecessary stoppages along very steep descent directions. For example, one can use a hybrid technique by starting with BP training in the first stages and switching to the PWLO algorithm when the BP becomes very slow. Another issue that we plan to consider is to improve the performance for very large sets of training data (thousands or tens of thousands of examples). For such a case the number of hyperplane boundaries increases significantly, and the algorithm loses part of its advantage over the other algorithms. The global optimization is also another candidate issue that needs to be tackled.
Acknowledgements A.A. would like to acknowledge the support of NSF’s Engineering Research Center at Caltech. The authors would like to thank Prof. M. Odom for supplying the bankruptcy data.
and Suppose we have a nonsingular matrix A 僆 R suppose that A ⫺1 is already known. We now wish to compute A⫺1 b , where Ab is obtained from A by exchanging the rth row with a vector b 僆 R1×N : Firstly, let us write b as a linear combination of the rows of A: y i ai ;
yi 僆 R
A1
i1
in matrix form y
AT ⫺1 bT
A⫺1 T bT
A2
from Eq. (A1) we have ar
A EAb
A5
where the matrix E 僆 RN×N is obtained from the identity matrix by exchanging the rth row with hT : Multiplying Eq. (A5) by A ⫺1 and A⫺1 b from both sides we get ⫺1 A⫺1 b A E
A6
By inspecting Eq. (A3), we can deduce the necessary condition for the existence of A⫺1 b ; i.e. yr 苷 0: Thus to inspect the singularity condition of Ab, we can obtain yr by multiplying the rth row of
A⫺1 T (or the rth column of A ⫺1) by the vector b. procedure Recursive_Inverse_by_Row_Exchange (C, b, r) comment Given the inverse C of some matrix, say A, this algorithm computes the inverse of the matrix Ab, where Ab is obtained from A by replacing the rth row, ar , with the row vector b. The procedure uses the matrix D as a temporary storage for A⫺1 b : Note that this algorithm presumes that the new matrix Ab will be nonsingular. Hence, the calling procedure has to check the existence of A⫺1 b : begin Define the matrix D 僆 RN×N
N×N
N X
Using Eq. (A3), we can write
y CbT
Appendix A. Recursive matrix inversion
b
501
N X 1 yi b⫺ ai yr y i1 r
y y 1 y y h ⫺ 1 ; …; ⫺ r⫺1 ; ; ⫺ r⫺1 ; …; ⫺ N yr yr yr yr yr
T
for i 1 to N { for j 1 to N { if j r then dij hj cir else dij hj cir ⫹ cij } } return (D) end Appendix B. Pseudo code procedure
i苷r
or in matrix form ar hT
Ab where, y1 yr⫺1 1 yr⫺1 yN T h ⫺ ; …; ⫺ ; ; ⫺ ; …; ⫺ yr yr yr yr yr
A3
A4
Procedure Initialization comment ak is the kth row of the matrix A; ck is the kth column of the matrix C A⫺1 ; The indices m, j, i are understood to scan the ranges
m 1; …; M;
j 1; …; H and
i 0; 1; …; I wherever they appear.
502
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
begin (1) Start by assigning random values to wji ; vj : L ⫹ (2) Determine the members of the sets S⫺ m ; Sm ; and Sm : Use Eqs. (9)–(11) (3) Compute w 0ji wji ·vj (4) Define the vectors: w 0j T U w 0jo ; w 0j1 ; …; w 0jI ; xT U w 01o ; …; w 01I ; w 02o ; …; w 02I ; … …; w 0Ho ; …; w 0HI ; vo ; v1 ; …; vH Ib (5) Define the vectors aIa j;m and aj;m Use Eqs. (35) and (36) Ib II (6) Define the constants bIa j;m bj;m 0; bm ⫺T
m Ia Ia (7) Define the hyperplanes Hj;m U {具aj;m ; x典 ⫹ bIa j;m Ib Ib 0; Hj;m U {具aIb j;m ; x典 ⫹ bj;m 0} (8) Initialize the set H^ to the empty set f (9) Define the matrix A 僆 RN×N and the vector b 僆 RN (10) let k 1 (11) Call-Procedure Stage_A end Procedure Stage-A begin while k ⱕ N begin (1) compute yo
m, e
m : Use Eqs. (8) and (33) (2) Define the vectors aIIm : Use Eq. (37) (3) Define the hyperplanes HmII : HmII U {具aIIm ; x典 ⫹ bIIm 0} (4) Compute the gradient vector g (5) Compute the search direction h if k 1 then h ⫺g else Use Eq. (24) Ib II (6) Compute the values of hIa j;m ; hj;m and hm : Use Eqs. (28), (29) and (32) (7) Determine the indices
jⴱ1 ; mⴱ1 ;
jⴱ2 ; mⴱ2 ; mⴱ such that: Ia Ia hIa jⴱ1 ;mⴱ1 min {hj;m 兩hj;m ⬎ 0}; j;m
Ib Ib hIb jⴱ2 ;mⴱ2 min {hj;m 兩hj;m ⬎ 0}; j;m
hIImⴱ min {hIIm 兩hIIm ⬎ 0} m
(8) Compute the minimum step size hmin ; hmin Ib II min
hIa jⴱ1 ;mⴱ1 ; hjⴱ2 ;mⴱ2 ; hmⴱ (9) Denote the nearest hyperplane as H^ k {具a^ ; x典 ⫹ b^ 0}; then insert it as the kth member ^ in H: (10) Update the matrix A and the vector b correspondingly. ak a^ ;
bk ⫺b^ ;
H^ H^ 傼 {H^ k }
(11) Update the weights w 0ji w 0ji ⫹ hmin hj
I⫹1⫺I⫹i ; vj vj ⫹ hmin hH
I⫹1⫹1⫹j ;
x x ⫹ hmin ·h
; (12) Compute wji : wji w 0ji =vj L ⫹ (13) Compute the sets S⫺ m ; Sm ; and Sm : Use Eqs. (9)– (11) (14) Increment k k ⫹ 1 end (15) Call-Procedure Stage_B end
Procedure Stage-B begin Set: SINGULAR_Flag FALSE LOCAL_MIN_Flag FALSE LOCAL_MIN_Ind 0
y 0:5;
C A ⫺1 ;
僆 10E ⫺ 6;
Compute the error E using Eq. (14). while E ⬎ Emin and LOCAL_MIN_Flag FALSE begin (1) compute yo
m, e
m : Use Eqs. (8) and (33) (2) Define the vectors aIIm : Use Eq. (37) (3) Define the hyperplanes HmII as HmII U {具aIIm ; x典 ⫹ bIIm 0} (4) Create a new search direction by b ⫹ y e k ;
xy C·by ;
h
xy ⫺ x=储xy ⫺ x储 (5) Repeat steps (6) through (9) of Stage_A to find and identify the nearest hyperplane. (6) Will matrix A become singular if the vector a^ becomes its kth row? Test for singularity using the method developed in Appendix A. if 兩具a^ ; ck 典兩 ⬍僆 then SINGULAR_Flag TRUE else SINGULAR_Flag FALSE w 0ji;new w 0ji ⫹ hmin hj
I⫹1⫺I⫹i ; vj;new vj ⫹ hmin hH
I⫹1⫹1⫹j ; xnew x ⫹ hmin h (7) Compute the new value of the error function: Enew E
w 0ji;new ; vj;new if Enew ⬎ E or SINGULAR_Flag TRUE then
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
{ k k⫹1 if
k ⬎ N then { k 1; y ⫺ y; } LOCAL_MIN_Ind LOCAL_MIN_Ind ⫹ 1 if LOCAL_MIN_Ind 2N then LOCAL_MIN_Flag TRUE } else begin (8) An accepted step. Update as in Step (10) and (11) of Stage_A. (9) Update the matrix C Recursive_Inverse_by_Row_Exchange
C; a k ; k (10) Remove the kth member of the set H^ and insert the nearest hyperplane as the new member. (11) Store the previous configuration of the L ⫹ sets S⫺ m ; Sm ; and Sm ⫺ S^⫺ m Sm ;
S^ Lm SLm ;
⫹ S^⫹ m Sm
(12) Compute wji : wji w 0ji =vj : (13) Compute the new configuration of the L ⫹ sets S⫺ m ; Sm and Sm (14) Detect the emergence of any corrupted type II hyperplanes in H^ ^ T U {p 僆 {1; 2; …M}兩HpII 傺 H} ⫺ L ⫹ ^L ^⫹ V U {p 僆 T兩S^ ⫺ p 苷 Sp or Sp 苷 Sp or Sp 苷 Sp
(15) If there are any corrupted type II hyperplanes then call Restore procedure: if V 苷 f then { G {r 僆 {1; 2; …; N}兩a r aIIp for all p 僆 V}
Call-Procedure Restore (G ) } (16) Reset the local Minimum Indicator to zero: LOCAL_MIN_Ind 0 end/* end of if–else statement*/ end/* end of while statement*/ end/* end of the Stage-B procedure*/
Procedure Restore (G ) begin SINGULAR_Flag FALSE for r 1 to 兩G 兩 begin (1) Let k G
rthen Repeat steps (1) through (6) of
503
Stage_A ⫹ (2) Compute h min , the minimum step size required to reach the nearest hyperplane along the “positive” direction of h: (a) find the indices
jⴱ1 ; mⴱ1 ;
jⴱ2 ; mⴱ2 ; mⴱ such that Ia Ia hIa jⴱ1 ;mⴱ1 min {hj;m 兩hj;m ⬎ 0}; j;m
Ib Ib hIb jⴱ2 ;mⴱ2 min {hj;m 兩hj;m ⬎ 0}; j;m
hIImⴱ min {hIIm 兩hIIm ⬎ 0} m
Ib II min
hIa (b) h⫹ jⴱ1 ;mⴱ1 ; hjⴱ2 ;mⴱ2 ; hmⴱ min ⫺ (3) Compute hmin , the minimum step size required to reach the nearest hyperplane along the “negative” direction of h: (a) find
jⴱ1 ; mⴱ1 ;
jⴱ2 ; mⴱ2 ; mⴱ such that Ia Ia hIa jⴱ1 ;mⴱ1 max {hj;m 兩hj;m ⬍ 0}; j;m
Ib Ib hIb jⴱ2 ;mⴱ2 max {hj;m 兩hj;m ⬍ 0}; j;m
hIImⴱ max {hIIm 兩hIIm ⬍ 0 m
Ib II max
hIa (b) h⫺ jⴱ1 ;mⴱ1 ; hjⴱ2 ;mⴱ2 ; hmⴱ min ⫺ (4) Select the accepted step size {h⫹ min or hmin } as the one that would further reduce E; denote it by hmin : (5) Denote
the hyperplane associated with hmin as H^ k { a^ ; x ⫹ b^ 0} (6) Repeat steps (6) and (7) of Stage_B to test for the singularity of the matrix A if a^ becomes the kth row and to compute Enew ; respectively. if SINGULAR_Flag FALSE then (7) Repeat steps (8) through (10) of Stage_B. to update. end end
Appendix C. The error is nonincreasing along the trajectories For Procedure 1 it is clear since in Step B we take a step only if the new error is less than the old error. For Procedure 2, we take a step if gT hi ⬍ 0: Since in the quadrant (region) the considered the error function is linear, the new error is given by E
xnew E
xold ⫹ gT
xnew ⫺ xold E
xold ⫹ hgT hi ⬍ E
xold since as mentioned gT hi ⬍ 0:
504
E.F. Gad et al. / Neural Networks 13 (2000) 485–505
Appendix D. Local minimum for piecewise-linear functions Assume we are at an intersection of N hyperplanes, at point x0 ; given by Ax0 b
D1
(see Eq. (38)). Assume that we have explored the 2N directions:
i hi A⫺1 i b ⫺ x0
D2
where b
i b1 ; …;
bi ⫹ y; …; b N T
they are negative, then replace the corresponding basis vector hij by ⫺h 0ij ; and this will change the sign of ejj ; hence the new lj will be positive. After changing the basis vectors associated with type I constraints, the rows of the A matrix associated with type II constraints will have changed, creating a new matrix A 0 and a new vector l 0 : Then we observe the l 0j ’s associated with type II constraints. If they are negative, then again we replace the corresponding basis vector hij by ⫺h 0ij : This replacement will not tamper further with the A matrix, and hence this new basis results in nonnegative lj : Since in the region spanned by any N basis vectors hij ; the error function is linear (no discontinuities), then by using Eq. (D4) we get
for i 1; …; N and
i
b b1 ; …;
bi⫺N ⫺ y; …; b N
E
x 0 E
x0 ⫹
T
hi ⫺yA i⫺1 ei⫺N i N ⫹ 1; …; 2N
D3
Let x 0 be a point in the neighbourhood of x0 . Let us express x 0 ⫺ x0 as a linear combination of N of the hi’s: x 0 x0 ⫹
N X
lj hij
D4
j1
under the condition that hi and hi⫹N cannot be together represented in the expansion. Such an expansion is possible, since Ai is nonsingular, and hence any choice of N hij will be linearly independent (because of relation (D3)). Next we show that there is a set of N hij ’s such that expansion (D4) results in nonnegative lj : Let us arrange the hij ’s as columns in a matrix H, and let us arrange the lj ’s as a column vector l. Then x 0 ⫺ x0 H l :
D5
From Eq. (D3) we get H yA ⫺1 E
D6
where E is a diagonal matrix with ejj ⫹1 or ⫺1 depending on the choice of hij ; and A is the constraint matrix associated with the region flanked by the hij vectors. Then, from Eqs. (D5) and (D6), we get
l
1 EA
x 0 ⫺ x 0 y
lj E
hij
D8
j1
for i N ⫹ 1; …; 2N: The index i in Ai reflects the fact that in the different regions around x0 , the aII rows of A might have changed, as discussed in Section 4.3. Using Eqs. (D1) and (D2), we can write hi yA i⫺1 ei i 1; …; N
N X
D7
where we used the fact that E⫺1 E:. We will show here that if l has negative components, then it is possible to change the basis vector combination, in order to achieve nonnegative lj : Consider the lj ’s corresponding to type I constraints. They will be equal to
1=ye jj aIkj
x 0 ⫺ x0 : If
Using Eq. (D2) and the theorem assumption that all the 2N directions lead to a higher error we have E
hij ⬎ 0; and hence E
x 0 ⬎ E
x0 :
Appendix E. Convergence in a finite number of iterations The number of points that could possibly be visited is finite, and equals the number of intersections of N hyperplane boundaries (or more than N in case of degeneracies). Along a descent trajectory no point is visited more than once. This is also true along the trajectories of constant E, because of the mechanism described to prevent cycling. Also, no move will carry us to infinity. If the move is a descent move, then an infinite-sized move will result in E going to minus infinity, which is impossible because E cannot be negative. If the move is a constant-E move, then if we encounter a move that will go to infinity, then we can simply ignore this move and choose another search direction. To summarize, there is a finite number of points, every point is visited at most once, and every move leads from one point to another (unless we are at a local minimum). Hence the number of iterations is finite.
References Abu-Mostafa, Y. (1989). The Vapnik-Chernovenkis dimension: information versus complexity in learning. Neural Computation, 1, 312–317. Atiya, A., El-Shoura, S., Shaheen, S., & El-Sherif, M. (1999). A comparison between neural network forecasting techniques: case study: river flow forecasting. IEEE Transactions on Neural Networks, 10 (2), 402– 409. Barmann, F., & Biegler-Konig, F. (1992). On a class of efficient learning algorithms for neural networks. Neural Networks, 2, 75–90. Batruni, R. (1991). A multilayer neural network with piecewise-linear structure and backpropagation learning. IEEE Transactions on Neural Networks, 2, 395–403.
E.F. Gad et al. / Neural Networks 13 (2000) 485–505 Battiti, R. (1992). 1st order and 2nd order methods for learning between steepest descent and Newton method. Neural Computation, 4, 141–166. Battiti, R., & Tecchiolli, G. (1994). Learning with first second and no derivatives: a case study in high energy physics. Neurocomputing, 6, 181–206. Baum, E., & Haussler, D. (1989). What net size gives valid generalization? Neural Computation, 1, 151–160. Benchekroun, B., & Falk, J. (1991). A nonconvex piecewise-linear optimization problem. Computers in Mathematical Applications, 21, 77–85. Bishop, C. (1995). Neural network for pattern recognition. Oxford: Oxford University Press. Bishop, C., Nabney, I. (1998). netlab. Neural Network Simulator, http:// www.ncrg.aston.ac.uk/netlab/index.html. Bobrowski, L. (1991). Design of piecewise linear classifiers from formal neurons by a basis exchange technique. Pattern Recognition, 24, 863– 870. Bobrowski, L., & Niemiro, W. (1984). A method of synthesis of linear discriminant functions in the case of nonseparability. Pattern Recognition, 17, 205–210. Golden, R. (1996). Mathematical methods for neural network analysis and design. Cambridge, MA: MIT Press. Hammerstrom, D. (1992). Electronic neural network implementation, Tutorial. International Joint Conference on Neural Networks. Baltimore, MD. Haykin, S. (1994). VLSI implementation of neural networks. Neural networks: a comprehensive foundation. New York: IEEE Press (chap. 15). Hush, D., & Horne, B. (1998). Efficient algorithms for function approximation with piecewise linear sigmoidal networks. IEEE Transactions on Neural Networks, 9 (6), 1129–1141.
505
Kevenaar, T., & Leenaerts, D. M. W. (1992). A comparison of piecewiselinear model descriptions. IEEE Transactions on Circuits and Systems, 39, 996–1004. Lin, J.-N., & Unbehauen, R. (1990). Adaptive nonlinear digital filter with canonical piecewise-linear structure. IEEE Transactions on on Circuit Systems, 37, 347–353. Lin, J.-N., & Unbehauen, R. (1995). Canonical piecewise-linear neural networks. IEEE Transactions on Neural Networks, 6, 43–50. Michel, A., Si, J., & Yen, G. (1991). Analysis and synthesis of a class of discrete-time neural networks described in hypercubes. IEEE Transactions on Neural Networks, 2 (1), 32–46. Moody, J. (1992). The Effective number of parameters: an analysis of generalization and regularization in nonlinear learning system. In J. Moody, S. Hansen & R. Lippmann (Eds.), Advances in Neural Information Processing Systems (pp. 847–854), 1992, 4. Odom, M., & Sharda, R. (1990). A neural network model for bankruptcy prediction. In Proceedings of the IEEE International Conference on Neural Networks (pp. 163–168). Pao, Y. H. (1989). Adaptive pattern recognition and neural networks. Reading, MA: Addison-Wesley. Saarinen, S., Bramely, R., & Cybenko, G. (1993). Ill-conditioning in neural network training problems. SIAM Journal of Scientific Computing, 14 (3), 693–714. Staley, M. (1995). Learning with piecewise-linear neural networks. International Journal of Neural Systems, 6, 43–59. Vogl, T., Mangis, J., Rigler, A., Zink, W., & Alkon, D. (1988). Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59, 257–263.