Orthogonal Least Squares Algorithm for Training Cascade Neural

1 downloads 0 Views 2MB Size Report
neural networks is determining the network architecture. The multilayer perceptron (MLP) ... search Project at Tsinghua University under Grant 2010THZ07002. This paper ..... ploying progressive rprop,” in Int. Work Conf. Artif. Natural Neural.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 11, NOVEMBER 2012

2629

Orthogonal Least Squares Algorithm for Training Cascade Neural Networks Gao Huang, Shiji Song, and Cheng Wu

Abstract—This paper proposes a novel constructive training algorithm for cascade neural networks. By reformulating the cascade neural network as a linear-in-the-parameters model, we use the orthogonal least squares (OLS) method to derive a novel objective function for training new hidden units. With this objective function, the sum of squared errors (SSE) of the network can be maximally reduced after each new hidden unit is added, thus leading to a network with less hidden units and better generalization performance. Furthermore, the proposed algorithm considers both the input weights training and output weights training in an integrated framework, which greatly simplifies the training of output weights. The effectiveness of the proposed algorithm is demonstrated by simulation results. Index Terms—Cascade correlation, constructive neural networks, Newton’s method, orthogonal least squares.

I. INTRODUCTION

A

RTIFICIAL neural networks have been widely used for designing, simulating and modeling circuits and systems [1]–[4]. One of the most important issues in the application of neural networks is determining the network architecture. The multilayer perceptron (MLP) architecture has been widely used over the past decades, due to its universal approximation ability, structure flexibility and various available training algorithms [5]. However, recent studies have shown that the MLP architecture is less powerful than the cascade network (CN) which allows connections across layers [6], [7]. For example, B. M. Wilamowski have shown that a CN with 8 units (or neurons) can solve the Parity-255 problem, while a MLP architecture with one hidden layer and 8 units can only solve the Parity-7 problem [7], [8]. Generally, the CN requires significantly less units and connections than the MLP architecture to solve the same problem, and its generalization ability over the MLP architecture is evident. There are also other network architectures, such as the radial basis function (RBF) network [9] and the self-organizing map (SOM) [10], have been widely adopted. However, these network architectures usually require a relatively large number of hidden units. In this work, we will focus on the cascade networks with sigmoidal activation functions. An important issue in training cascade networks is to determine the network size, i.e., the number of hidden units. Too Manuscript received July 03, 2011; revised January 01, 2012; accepted January 12, 2012. Date of publication October 12, 2012; date of current version October 24, 2012. This work is supported by the National Natural Science Foundation of China under Grant 60874071, the Research Foundation for the Doctoral Program of Higher Education under Grant 20090002110035, the Project of China Ocean Association under Grant DYXM-125-03, and the Independent Research Project at Tsinghua University under Grant 2010THZ07002. This paper was recommended by Associate Editor B. Shi. The authors are with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected], [email protected]). Digital Object Identifier 10.1109/TCSI.2012.2189060

small a network will lack the computational power to learn a problem well, but a too large network may overfit the training data set and generalize poorly to new patterns. Usually, the network size is determined by a trial-and-error approach. That is, we independently train a number of networks with different sizes and then select the smallest network which can learn the problem well. Many algorithms have been proposed for training CNs with fixed sizes [11]–[14]. The recently proposed neuron-by-neuron (NBN) algorithm is an efficient second-order algorithm for training various feedforward networks, including the CNs [13], [14]. It has solved the Wieland’s two-spiral problem [15] using a CN with only 8 units and 52 weights [14], while the error back-propagation (EBP) algorithm requires a CN with at least 12 units and 102 weights to solve this problem. An alternative way to determine the network size is adopting an efficient algorithm that can automatically find the optimal network size. Two classes of algorithms have been proposed for determining the network size automatically [16], [17]. One is the class of pruning algorithms which starts with a large network, and gradually removes units and connections whose absence will not significantly degrade the network’s performance. The other one is the class of constructive algorithms which starts from the simplest network and then adds units and connections until a satisfactory solution is found. In practice, the constructive algorithms are more widely adopted since they are easier to implement and computationally more economical. The cascade correlation network (CCN) algorithm proposed by S. E. Fahlman and C. Lebiere is one of the most popular constructive algorithms used for training cascade networks [18]. Different from the algorithms that deal with fixed-size networks, the CCN grows from the simplest structure without hidden units. Then, the algorithm installs hidden units to the network one by one during the training process until a given performance criterion is satisfied. The CCN has several remarkable advantages: 1) it is simple and efficient since at any time only weights of a single hidden unit are trained; 2) the network can determine its own size and topology; 3) it is useful for incremental learning in which new information is added to the existing network. The CCN consists of training the weights feeding into a new hidden unit (input training), and training the weights that connect the input units and all the hidden units to the output units (output training) (see Fig. 1). Generally speaking, both the input and output training in CCN have disadvantages. Firstly, during the input training, the input weights are trained to maximize an objective function defined by the correlation between the output of a new hidden unit and the residual error of the existing network. However, this objective function can not guarantee a maximal error reduction when a new hidden unit is added to the network, which may slow down the convergence of the algorithm and results in a large network with poor generalization performance. Secondly, the output training are performed repeatedly

1549-8328/$31.00 © 2012 IEEE

2630

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 11, NOVEMBER 2012

When a new hidden unit is to be added, we generate a candidate unit that receives trainable input connections, denoted as input weights, from the input units and all the pre-existing hidden units (the input weights of the second hidden unit are shown by empty boxes in Fig. 1). The output of the candidate unit is not yet connected to any unit in the network. These input weights are initiated with small random values, and then they are trained to maximize the correlation function defined by (1)

Fig. 1. Interpretation of adding the second hidden unit to the cascade network.

after each new hidden unit is added, which is usually time consuming. In this work, we propose an orthogonal least squares based cascade network (OLSCN) to overcome these disadvantages. We reformulate the cascade network as a linear-in-the-parameters model, and adopt the OLS algorithm [19], [20] to derive a new objective function for input training. By using the second-order algorithm developed in our work, the proposed objective function can be optimized efficiently. We also introduce a method to simplify the output training by using information from the input training phase. It is worth mentioning that the conventional OLS algorithm and other forward subset selection methods, such as the forward recursive algorithm (FRA) [21], [22], have been widely adopted for constructing RBF networks, in which they are used for selecting RBF neurons from a given candidate pool. But in our work, the OLS is used for deriving a new objective function for training hidden units, which is inspired by the recent works that use the forward subset selection methods to search optimal RBF neurons over the continuous parameters space [2], [23]–[25]. The proposed method has two main advantages over the CCN and its variants. Firstly, it theoretically guarantees a maximal SSE reduction when each new hidden unit is added, thus leading to a smaller network with better generalization performance. Secondly, there is no need to train output weights repeatedly in our algorithm. We can only calculate the output weights after all the hidden units have been added, and moreover, this calculation is greatly simplified. The rest of this paper is organized as follows. In Section II, we introduce the basic idea of the CCN algorithm and discuss its disadvantages, then we formulate the cascade network as a linear-in-the-parameters model and introduce the OLS algorithm. The details of the proposed OLSCN algorithm are presented in Section III. Section IV gives the simulation results, and Section V concludes this paper. II. PROBLEM FORMULATION A. Cascade Correlation Network and its Disadvantages The CCN begins with the simplest structure in which the input units are directly connected to the output units. Then the weights are trained to minimize an error criterion such as the SSE. The obtained network is actually a linear model which usually cannot learn a complex problem well, and nonlinear hidden units are required to further reduce the training error.

where is the number of training patterns, and are the output of the candidate unit and the residual error of the network for pattern before the candidate unit is added, and and are the corresponding values averaged over all patterns. The above training process is referred to as the input training, for which a gradient based optimization algorithm, such as the quickprop algorithm [26], is usually adopted. To alleviate the local minima problem in a gradient based method, we can separately train a pool of candidate units with differently initiated input weights. Then the candidate unit with the largest correlation value is added to the network, by connecting it to the output unit and fixing (freezing) its input weights. After the new hidden unit is added, all the weights connected to the output units, denoted as output weights, are once again trained to minimize the network error, and this training process is referred to as the output training (shown by crosses and small circles in Fig. 1). As mentioned in the Introduction, the CCN algorithm has two major disadvantages. The first disadvantage concerns with the input training: the objective function (1) for input training is designed in an empirical way. As discussed in [18], early versions of the CCN used a true correlation measure with some normalization terms, but the use of (1) works better in most situations. In fact, when a new hidden unit is to be added, the CCN intends to maximally reduce the training error, which is, however, not theoretically guaranteed. In order to improve performance of the CCN, many new objective functions have been proposed to replace the correlation function [27]–[31]. In [31], T. Y. Kwok and D. Y. Yeung have surveyed a class of objective functions whose value can be computed in time proportional the number of training patterns. They have proved that some of the objective functions can lead to an optimal solution that maximizes a given error measure, provided the existing output weights are frozen [31]. But this assumption does not hold for the CCN since all the existing output weights are updated after each new hidden unit is added. As a consequence of the retraining of output weights, these objective functions still could not guarantee a maximal error reduction. Another disadvantage of the CCN is that the output weights are trained repeatedly and independently from the input training. A method has been proposed in [31] to simplify the output training by using information from the input training phase. But the retraining of output weights is also required in their method. In this work, a novel constructive algorithm is proposed to alleviate these disadvantages. First, the proposed objective function for input training theoretically guarantees a maximally SSE reduction when each hidden unit is added to the network. Second, the output training is greatly simplified and there is no need to train output weights repeatedly.

HUANG et al.: ORTHOGONAL LEAST SQUARES ALGORITHM FOR TRAINING CASCADE NEURAL NETWORKS

The OLS solution [32] to the output weights is given by

B. The Linear-in-the-Parameters Model and the OLS Algorithm Suppose that a set of patterns , is used as training data ( is the number of training is the dimension of input patterns), and patterns, and is the th input pattern and is the corresponding desired output. In cascade networks, the input units are allowed to be connected to the output units directly. In order to simplify notations, we use the basis units to denote the input units and hidden units in a unified framework, i.e., the th basis unit denotes

Thus the output of the

2631

(9) By using (5), (7), and (9), the error vector can be expressed by (10) or

,

(11)

,

is an identity matrix. where Let be the regression matrix of the network with basis units, and is the corresponding orthogonal matrix. We define the following residual matrix series:

th basis unit is given by , (2)

,

is a nonlinear activation function, and are the weights that connect the input units and pre-existing hidden units to the th basis unit. basis units and one We consider a cascade network with linear output unit, then the output the network is given by

,

where

(12)

where are the columns of . is symmetric and idempotent: It is easy to verify that (13)

(3) From (11)–(13), the SSE in (6) can be computed by are the weights that connect all the where basis units to the output unit. Thus, the prediction error of the network is given by

(14) or

(4) With the above expressions, we can reformulate the cascade network as a linear-in-the-parameters problem: (5) , , and is referred to as the regression matrix with columns . For problem (5), we aim to minimize the SSE defined as

where ,

(15) This expression plays a fundamental role in the OLS algorithm. In Section III, we will use it to derive a new objective function for training new hidden units in cascade networks. III. OLSCN ALGORITHM

(6) The main idea of OLS is to transform the set of vectors into a set of orthogonal basis vectors by performing the QR factorization [32] on : (7) is an upper triangular matrix with 1’s where on the diagonal, and is a matrix with orthogonal such that columns (8)

The proposed OLSCN algorithm differs from the CCN in three aspects: 1) the OLSCN adopts a new objective function for input training; 2) the objective function in the OLSCN is optimized by an efficient second-order algorithm; 3) the output weights in the OLSCN are obtained in a greatly simplified way. The details of the OLSCN algorithm are given in this section. A. The New Objective Function Since the trained input weights are frozen, the output of existing hidden units will not change with the growing of the network. It follows that

where is the output vector of the th basis unit, and orthogonalized vector of .

is the

2632

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 11, NOVEMBER 2012

Therefore, the net SSE reduction due to adding the unit to the network can be calculated from (15):

th basis

where with respect to

denotes the gradient vector of , and (21)

(16) is the SSE before the th basis unit is added. where This equation allows us to evaluate the contribution of a candidate unit to the SSE reduction without solving the least squares problem explicitly. In the OLSCN algorithm, we use (16) as the objective function for input training, thus the net SSE reduction can be directly maximized when a new hidden unit is added. B. Input Training Consider that the th basis unit is to be added into the network, we firstly generate a candidate unit with randomly initiated input weights, and calculate its output vector using (2). Then the Gram-Schmidt method is used to orthogonalize with , which is equivalent to orthogonalize with : (17)

Notice that

can be computed recursively by (22)

is computed from (17), thus the partial derivatives can and be computed without the direct use of the matrix and the space requirement is only proportional to . Hence, the gradient vector of (16) with respect to are given by

(23)

where . The Hessian matrix is given by (the detailed derivation is given in Appendix A)

where (24) (18) , and are obtained by perNote that for , ( is the dimenforming the QR factorization sion of input patterns). By using the matrix series defined in (12), can also be expressed by . computed from (17), the value of the objective funcWith tion (16) is obtained. By maximizing (16) with respect to the input weights, the SSE can be reduced maximally. In this paper, a modified Newton’s method will be applied to optimize the objective function. This requires the computation of the gradient vector and Hessian matrix, which are given as follows. The first- and second-order partial derivatives of the candidate unit’s output vector with respect to the input weights are denoted as

where (25) and denotes the Euclidean norm of . Note that with the above formulas, the time and space complexity of computing the Hessian matrix are also proportional to . Given the gradient vector and Hessian matrix, Newton’s method for input weights update is given by (26)

(19)

Since the Hessian matrix may be singular or the Newton direction of (26) may not be a direction of ascent when the Hesis not negative definite, a modified Newton’s method sian is used in this paper: (27)

Thus, the partial derivatives of are computed by

with respect to

where is a nonnegative damping factor which is adjusted at each iteration. If an update is successful (the objective function value is increased), is decreased, bringing the algorithm closer to (26); otherwise is increased, bringing the algorithm closer to the steepest ascent algorithm. C. Output Training

(20)

In the OLSCN algorithm, the output weight training is greatly is orthogonal, simplified. Notice that the column vectors of

HUANG et al.: ORTHOGONAL LEAST SQUARES ALGORITHM FOR TRAINING CASCADE NEURAL NETWORKS

and is an upper triangular matrix with 1’s on the diagonal, we can rewrite (9) as

.. .

.. .

..

.

.. .

.. .

.. .

In fact, the proposed objective function can also be treated as a variant of (29). Notice that and , we can rewrite (16) as

(28)

where

are the elements of , and are given by (21). Thus, the output weights can be easily solved using back substitution from the above equation. Since the output weights are not needed in the input training phase, we are not required to perform the output training until all the hidden units are added.

2633

(30) The difference between (29) and (30) explains how the OLSCN algorithm is different from those algorithms surveyed in [31]. In the proposed objective function (30), the matrix carries the information of the existing network, which is useful for evaluating the true contribution of a candidate unit. By considering these information, the SSE reduction in OLSCN is given by the maximum value of (30). Note that in [31], a lower bound of SSE reduction given by (29) has been obtained:

D. Summary of OLSCN Algorithm The OLSCN algorithm is summarized as follows. Step 1): Initialization phase. Connect the input units to the output units directly, and let ( is the dimension of the network input). Compute the regression matrix from (2), and perform the QR factorization Step 2): Check phase. If the stop criterion is satisfied, e.g., the SSE is less than a given threshold, then go to Step 5); otherwise, go to Step 3). Step 3): Input training phase. Randomly generate a number of candidate units, i.e., initiate the input weights of the candidate units with small random values. Train the input weights of the candidate units to maximize the objective function (16) with the modified Newton’s method in (27). Step 4): Update phase. Install the candidate unit that gives the largest error reduction into the network, and • Let , , and . with the new elements computed • Update from (18). • Update using (22). • Update the SSE: Go to Step 2). Step 5): Output training phase. Calculate the output weights using (28).

E. Discussion In [31], a number of objective functions, including the correlation function (1), for training input weights in cascade networks have been studied in a unified framework. The basic form of those objective functions is (29) where

and

are defined in (2) and (5).

(31) where is the set of all the possible functions that can be implemented by the th basis unit, and denotes the optimal solution. For any , we have

(32) It follows that

(33) Therefore, we can conclude that the proposed objective function theoretically guarantees a greater error reduction than does when a new hidden unit is added to the network. IV. SIMULATION EXAMPLES The proposed OLSCN algorithm has been empirically tested on several classification and approximation problems. We mainly compared our algorithm with the CCN and its variants [18], [31], and the networks used in these algorithms all had linear output units and hyperbolic tangent hidden units. All the algorithms are implemented in Matlab 7.4. A. Two-Spiral Problem We first tested our algorithm on the two-spiral problem [15], which consists of 194 pairs of patterns (half of these patterns belong to class 1, and the rest belong to class 2). For this problem, we aimed at constructing a network with the smallest size (least number of hidden units) that can correctly ChenBillings-1989 classify all the patterns. As discussed in Section II-A, in order

2634

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 11, NOVEMBER 2012

TABLE I RESULTS ON THE TWO-SPIRAL PROBLEM

to alleviate the local minimum problem of a gradient based optimization method, a pool of candidate units with differently initiated weights are used for training each hidden unit. The traditional CCN algorithm usually uses four to eight candidate units for each pool, and increasing the number of candidate units can hardly improve its performance [18]. But the objective function in the OLSCN algorithm is more complicated, which suggests that larger candidate pools may lead to better results. To verify this hypothesis, we have tested the OLSCN algorithm with different sizes of candidate pools, and for each size, we have performed 100 independent trials. All the trials were successful, and the results are summarized in Table I, where Candidates specifies the number of candidate units used in our experiment, Mean and Std represent the mean and standard deviation of the obtained network sizes, Min and Max represent the minimal and maximal network size from the 100 trials, and Time represents the averaged running time, measured in seconds, of each trial. We also present the results obtained by two existing CCN algorithms in this table, where CCN-Quickprop represents the traditional CCN that adopts the quickprop algorithm for input and output training [18]; and CCN-Hybrid represents an improved CCN with a hybrid optimization method [33]. From Table I, one may observe that with the same 8 candidate units, the average network size obtained by the OLSCN is smaller than that obtained by the traditional CCN, i.e., 13.2 vs. 15.2. As the size of the candidate pools was increased, the performance of the OLSCN improved evidently. With 32 candidate units, the OLSCN have solved the two-spiral problem with an average of 11.1 hidden units, and the minimal network had only 8 hidden units. While the performance of the CCN showed no significant improvement even the number of candidate units was increased to 32. The running time given in the table shows that the OLSCN with 32 candidate units required less time than the CCN with 8 candidate units. It is obvious that for the two-spiral problem, the proposed algorithm is much more efficient than the CCN. With 16 candidate units, the OLSCN performed even better than the CCN-Hybrid algorithm which requires 10 antibodies (each antibody can be viewed as a candidate units) and almost 5000 iterations to train each hidden unit [33]. By contrast, the OLSCN requires only an average of 40 iterations to train one candidate unit. Note that the OLSCN requires at least 8 hidden units and one linear output unit to solve the two-spiral problem, while the NBN algorithm can solve it with only 7 hidden units (13% success rate) [14]. Due to the adopting of greedy method, i.e., at each step the candidate unit that maximally reduces the training error is installed, the OLSCN may lead to a suboptimal network

with a relatively larger size. Nevertheless, the major advantage of the OLSCN is that it can automatically determine its size. Furthermore, the OLSCN may be more efficient to handle large networks since at any time only weights of one hidden unit are trained. For example, to train a cascade network with two inputs and 30 hidden units, the NBN has to update all the 558 weights at each iteration, where the second-order algorithms will become less efficient or even break down. In comparison, if the OLSCN is used to train this network, then at most 33 weights are updated at a time. B. Two-Dimensional Functions In order to compare the proposed algorithm with the CCN and its variants surveyed in [31], we have evaluated these algorithms on a set of function approximation problems. Compared to the CCN, its variants adopts the following objective functions for input training:

(34)

where , , , and are defined in Section II-A. The approximation problems are a set of two-dimensional functions originally used in [28 ]: • . • , where . • , where , . • . • . For each problem, we have generated a noise training data set with 225 patterns. The two inputs of the training data were generated as random variables uniformly distributed on and independent from each other, and the corresponding output was obtained by

where is the standard Gaussian noise with zero mean and unit variance. A test data set of size 100 100 was generated from equally space points on the input space for each problem. The output data were the exact values of the above functions without any noise.

HUANG et al.: ORTHOGONAL LEAST SQUARES ALGORITHM FOR TRAINING CASCADE NEURAL NETWORKS

2635

TABLE II TESTING FVUS ON THE TWO-DIMENSIONAL FUNCTIONS AND NETWORK SIZES CORRESPOND TO THE LOWEST FVUS

The fraction of variance unexplained (FVU) were used as the error measure as described in [28]:

where and are the desired output and network output for the th pattern, and is a constant for a given data set. All the algorithms, except the OLSCN, adopted the quickprop algorithm for input and output training. The candidate pools used in these algorithms all had 20 candidate units. We computed the FVU on the testing data set after each hidden unit was added. The training process was stopped when the number of hidden units was increased to 15. The lowest testing FVU among the 15 networks ranging from 1 to 15 hidden units in each trial was used for comparison. We have performed 100 independent trials of these algorithms on each data set, and the averaged lowest testing FVUs are present in Table II, where Mean, Std, and Size have the same meaning as in Table I. We can observe that performed consistently better than all the other objective functions on the five approximation problems. and have obtained relatively good results for problems and . But for the and , poor results (testing more complex functions ) have been obtained by using and the six objective functions in (34). In comparison, have obtained averaged testing FVUs of 0.0639 and 0.0835 on the two problems respectively. A comparison of and on these two problems are shown in Figs. 2 and 3, respectively. It is clear that has given a faster error reduction on both the training and testing data set than . As shown in Fig. 2, obtained a testing FVU of 0.0645 with 5 hidden units on problem , is while the lowest testing FVU on this problem given by 0.2166 with 14 hidden units. Similar observations can also be made from Fig. 3. C. UCI Data Sets We also validated our algorithm on some real world problems from the UCI database [34] and StatLib repository [35]. Three

Fig. 2. Comparison of

and

on

.

Fig. 3. Comparison of

and

on

.

classification data sets (Wine, Vowel, and Segment) and three regression data sets (Mpg, Housing, and Abalone) are used in our experiment. We randomly picked 80% of the patterns for training, and the rest for testing. Table III lists the characterization of these data sets. For multiclassification problems we took

2636

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 11, NOVEMBER 2012

TABLE III SUMMARY OF THE UCI DATA SETS USED

14.5 respectively. It is clear that the proposed algorithm outperforms the CCN on these real world data sets in terms of predicting accuracy and network size. V. CONCLUSION

TABLE IV RESULTS ON THE UCI CLASSIFICATION DATA SETS

In this paper, we have reformulated the cascade network as a linear-in-the-parameters model, and used the OLS algorithm to derive a new objective function for input training. The proposed method makes it possible to directly maximize the SSE reduction when each hidden unit is added, which has speed up the process of constructing an optimal network. With the OLS algorithm, the computing of output weights in our algorithm is greatly simplified by considering input training and output training in an integrated framework. Furthermore, with the calculation procedure introduced in Section II, the time and space complexity of the proposed algorithm are both proportional to the number of training patterns. The simulation results have shown that the proposed algorithm generally leads to significantly smaller networks with better generalization performances than that using the cascade correlation algorithms and its variants. APPENDIX

TABLE V RESULTS ON THE UCI REGRESSION DATA SETS

COMPUTING THE HESSIAN MATRIX: Using the gradient th element of Hessian matrix vector computed by (23), the is derived as

(35) where the winner-take-all strategy, in which the output neuron with the largest activation is designated as the network classification. In this experiment, the input units were not directly connected to the output units, since it may generate insignificant connections when the number of features is large. The number of candidate units in OLSCN and CCN were both set to be 20. We stop the algorithms when the network reached a maximal size, which were selected empirically according the complexity of a given problems: 15 hidden units for the Wine, Mpg, and Housing data sets, and 30 hidden units for the other data sets. For classification problems, we calculated the classification error on the testing data set after each hidden unit was added, and report the lowest error along with the corresponding network size. For regression problems, the root mean squared errors were used as the error measure. The results averaged over 100 trials are listed in Tables IV and V. From Table IV, we can observe that the OLSCN gives lower testing error than the CCN algorithm on all the three classification problems, and the optimal network constructed by the OLSCN has a smaller size. For the three regression data sets, similar results have been obtained as shown in Table V. One mays notice that the optimal network obtained by the OLSCN is significantly smaller. The optimal network obtained by OLSCN has an average of 2.3, 2.5, and 5.2 hidden units for the Wine, Mpg and Housing data sets respectively; while the optimal network obtained by the CCN have an average of 9.5, 11.2, and

is computed by

(36) Denote

,

, and

, we have

(37) Substituting the above expression into (35) yields (24).

HUANG et al.: ORTHOGONAL LEAST SQUARES ALGORITHM FOR TRAINING CASCADE NEURAL NETWORKS

REFERENCES [1] Y. N. Zhang, W. M. Ma, and B. H. Cai, “From Zhang neural network to Newton iteration for matrix inversion,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 7, pp. 1405–1415, Jul. 2009. [2] K. Li, J. X. Peng, and E. W. Bai, “Two-stage mixed discrete-continuous identification of radial basis function (RBF) neural models for nonlinear systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 3, pp. 630–643, Mar. 2009. [3] M. D. Marco, M. Forti, M. Grazzini, and L. Pancioni, “Limit set dichotomy and convergence of cooperative piecewise linear neural networks,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 5, pp. 1052–1062, May 2011. [4] C. Feng, R. Plamondon, and C. O’Reilly, “Analyzing oscillations for an -node recurrent neural networks model with time delays and general activation functions,” IEEE Trans. Circuits Syst. I, Reg. Papers, no. 99, Aug. 2011. [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagation errors,” Nature, vol. 323, pp. 533–536, 1986. [6] B. M. Wilamowski, “Neural network architectures and learning algorithms,” IEEE Ind. Electron. Mag., vol. 3, no. 4, pp. 56–63, Dec. 2009. [7] B. M. Wilamowski, “Challenges in applications of computational intelligence in industrial electronics,” in Proc. IEEE Int. Symp. Ind. Electron., Jul. 2010, pp. 15–22. [8] B. Wilamowski, D. Hunter, and A. Malinowski, “Solving parityproblems with feedforward neural networks,” in Proc. Int. Joint Conf. Neural Netw., Jul. 2003, vol. 4, pp. 2546–2551. [9] J. Park and I. W. Sandberg, “Universal approximation using radialbasis-function networks,” Neural Comput., vol. 3, pp. 246–257, 1991. [10] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp. 1464–1480, Sep. 1990. [11] N. K. Treadgold and T. D. Gedeon, “A cascade network algorithm employing progressive rprop,” in Int. Work Conf. Artif. Natural Neural Netw., 1997, pp. 733–742. [12] B. W. Wah and M. Qian, “Constrained formulations for neural networks training and their applications to solve the two-spiral problem,” in Proc. 5th Int. Conf. Comput. Sci. Informatics, 2000, vol. 1, pp. 598–601. [13] B. M. Wilamowski and Y. Hao, “Improved computation for Levenberg-Marquardt training,” IEEE Trans. Neural Netw., vol. 21, no. 6, pp. 930–937, Jun. 2010. [14] B. M. Wilamowski and H. Yu, “Neural network learning without backpropagation,” IEEE Trans. Neural Netw., vol. 21, no. 11, pp. 1793–1803, Nov. 2010. [15] K. J. Lang and M. J. Witbrock, “Learning to tell two spirals apart,” in Proc. 1988 Connectionist Models Summer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, Eds. San Mateo, CA: Morgan Kaufmann, 1988, pp. 52–59. [16] T. Y. Kwok and D. Y. Yeung, “Constructive algorithms for structure learning in feedforward neural networks for regression problems,” IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 630–645, May 1997. [17] R. Reed, “Pruning algorithms-a survey,” IEEE Trans. Neural Netw., vol. 4, no. 5, pp. 740–747, Sep. 1993. [18] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1991, pp. 524–532. [19] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methods and their applications to non-linear system identification,” Int. J. Control, vol. 50, pp. 1873–1896, 1989. [20] S. Chen, C. Cowan, and P. M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 302–309, Mar. 1991. [21] K. Li, J. X. Peng, and G. W. Irwin, “A fast nonlinear model identification method,” IEEE Trans. Autom. Control, vol. 50, no. 8, pp. 1211–1216, Aug. 2005. [22] K. Li and J. X. Peng, “Neural input selection—A fast model-based approach,” Neurocomput., vol. 70, no. 4–6, pp. 762–769, Oct. 2007. [23] J. X. Peng, K. Li, and D. S. Huang, “A hybrid forward algorithm for rbf neural network construction,” IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 1439–1451, Nov. 2006. [24] S. Chen, X. Hong, C. Harris, and X. Wang, “Identification of nonlinear systems using generalized kernel models,” IEEE Trans. Control Syst. Technol., vol. 13, no. 3, pp. 401–411, May 2005.

2637

[25] S. Chen, X. Hong, B. L. Luk, and C. J. Harris, “Construction of tunable radial basis function networks using orthogonal forward selection,” IEEE Trans. Syst., Man, Cybern. B, vol. 39, no. 2, pp. 457–466, Apr. 2009. [26] S. E. Fahlman, “Faster learning variations on backpropagation: An empirical study,” in Proc. 1988 Connectionist Models Summer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, Eds. San Mateo, CA: Morgan Kaufmann, 1988, pp. 38–51. [27] J. L. Yuan and T. L. Fine, “Forecasting demand for electric power,” in Advances in Neural Information Processing Systems 5, S. J. Hanson, J. Cowan, and C. L. Giles, Eds. San Mateo, CA: Morgan Kaufmann, 1993, pp. 739–746. [28] J. N. Hwang, S. R. Lay, M. Maechler, R. D. Martin, and J. Schimert, “Regression modeling in back-propagation and projection pursuit learning,” IEEE Trans. Neural Netw., vol. 5, no. 3, pp. 342–353, May 1994. [29] O. Fujita, “Optimization of the hidden unit function in feedforward neural networks,” Neural Netw., vol. 5, no. 5, pp. 755–764, 1992. [30] P. Courrieu, “A convergent generator of neural networks,” Neural Netw., vol. 6, no. 6, pp. 835–844, 1993. [31] T. Y. Kwok and D. Y. Yeung, “Objective functions for training new hidden units in constructive neural networks,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 1131–1148, Sep. 1997. [32] G. Golub and C. F. V. Loan, Matrix Computations. Baltimore, MD: Johns Hopkins Univ. Press, 1993. [33] X. Z. Gao, X. Wang, and S. J. Ovaska, “Fusion of clonal selection algorithm and differential evolution method in training cascade-correlation neural network,” Neurocomputing, vol. 72, pp. 2483–2490, Jun. 2009. [34] A. Frank and A. Asuncion, UCI Machine Learning Repository 2011 [Online]. Available: http://archive.ics.uci.edu/ml [35] D. Michie, D. J. Spiegelhalter, and C. Taylor, Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994.

Gao Huang was born in 1988. He received the B.S. degree from the School of Automation Science and Electrical Engineering from Beihang University, Beijing, China, in 2009. He is currently working toward the Ph.D. degree in the Department of Automation in Tsinghua University, Beijing, China. His current research interests include system identification and time series analysis.

Shiji Song was born in 1965. He received the Ph.D. degree in Department of Mathematics from Harbin Institute of Technology, China, in 1996. He is a Professor of Department of Automation, Tsinghua University, China. His research and teaching interests include system identification, fuzzy systems, and stochastic neural networks.

Cheng Wu was born in 1940. He received the B.S. and M.S. degrees in electrical engineering from Tsinghua University, China. Since 1967 he has been at Tsinghua University, where he is currently a Professor of the Department of Automation. His main research interests include system integration, modeling, scheduling, and optimization of complex industrial systems. Mr. Wu is a member of the Chinese Academy of Engineering.

Suggest Documents