A Theoretically Sound Learning Algorithm for Constructive Neural Networks Tin-Yau Kwok (
[email protected])
Dit-Yan Yeung (
[email protected])
Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon Hong Kong
Abstract
Determining network size used to require various ad hoc rules of thumb. In recent years, several researchers proposed methods to handle this problem with as little human intervention as possible. Among these, the cascade-correlation learning architecture is probably the most popular. Despite its promising empirical performance, this heuristically derived method does not have strong theoretical support. In this paper, we analyze the problem of learning in constructive neural networks from a Hilbert space point of view. A novel objective function for training new hidden units using a greedy approach is derived. More importantly, we prove that a network so constructed incrementally still preserves the universal approximation property with respect to L2 performance criteria. While theoretical results obtained so far on the universal approximation capabilities of multi-layer feed-forward networks only provide existence proofs, our results move one step further by providing a theoretically sound procedure for constructive approximation while still preserving the universal approximation property.
1 Introduction
Neural networks have been shown by several researchers ([1, 3, 4, 5]) to be universal approximators with respect to Lp performance criteria, under certain properties of the hidden unit functions. However, these results assert no theoretical bounds on the number of hidden units required. In recent years, attempts have been made to determine network size automatically. A particularly successful
This research is supported by the Hong Kong Research Grants Council under the grant RGC/HKUST 15/91. The rst author is also supported by the Sir Edward Youde Memorial Fellowship.
method is the cascade-correlation architecture [2]. It begins with a minimal network, then trains and adds new hidden units one at a time in a greedy manner by maximizing an objective function. After a new unit is added, all the weights between the input and hidden units as well as those among the hidden units are held constant, allowing only the hidden-to-output weights to change. This heuristic is referred to as input weight freezing. By employing this heuristic and the cascade-correlation architecture together, Fahlman and Lebiere obtained fast learning under all the cases tested. However, the design of this objective function is rather ad hoc. Moreover, it is unclear whether the universal approximation property can still be preserved when input weight freezing is used. For example, [9] reported that when freezing was used, the network was unable to converge to error levels reachable without freezing. Besides, because the maximization of the objective function is a nonlinear optimization problem, whether the quality of the solution obtained in this process aects the universal approximation property is also unknown. This paper is organized as follows. Section 2 will formulate the problem of constructive neural network learning as constructive approximation in function space. In Section 3, a greedy approach to the construction of multi-layer feed-forward networks with only one linear output unit will be presented. Section 4 describes the network construction algorithm. The last section gives some concluding remarks and discussions on further research.
2 Constructive Neural Network Learning as Constructive Approximation in Function Space Given a Hilbert space E and a set of basis elements U E . Consider the problem of approximating an arbitrary element g 2 E by a linear combination of basis elements from a nite subset of U , i.e., nding nitely many ui 2 U , i = 1; 2; : : :; n, together with
a linear combination Xn g^ = i ui; i 2 0; hg ? 11yu1; u2i2 > 0;
has been found, we want to nd un+1 2 U , such that
X kg ? u k;
i i
i=1
achieves its minimum i hg ? Pni=1 i ui; ur i : r = r = hu ; u i
un+1 6= u1; u2; : : :; un, and n+1
Xn u + u )k;
hg ? 1
.. .
X (n?1)yu ; u i2 > 0;
n?1 i=1
i
i n
Detailed proofs can be found in [8]
.. .
and
lim kg ? n!1
Xn nyu k = 0: i=1
i
i
Remark. Hence, if the span of U is dense in E , then for an arbitrary element g in E , the network so con-
structed incrementally is a universal approximator.
4 Network Construction Algorithm This section presents an algorithm for the construction of feed-forward networks with linear output units. It consists of two phases: input-to-hidden weight training and hidden-to-output weight training.
4.1 Input-to-Hidden Weight Training
The input-to-hidden weight training phase selects the new hidden unit that is to be installed in the network, by training the weights connecting all the input units to the new hidden unit. From Theorem 1, if the span of the set of hidden unit functions is dense in L2 space, then the network so constructed is a universal approximator with respect to L2 performance criteria. From [4], hidden unit functions satisfying this property are those with arbitrary bounded and non-constant transfer functions. Common examples include sigmoidal functions and radial basis functions. In the cascade-correlation architecture, connections are established from the new hidden unit to all existing hidden units as well as the input units. Notice that if the original set of hidden unit functions is dense in L2 space, then the new set of hidden unit functions (with such cascade connections added) is also dense in L2 space, because these cascade connections can be treated as zero. Hence, this algorithm applies to both the traditional single-hiddenlayer architecture and to other architectures with cascade connections. Although Theorem 1 states a very loose condition on the selection of the ui 's while still preserving the universal approximation property, in real applications we want to keep the number of hidden units in the network as small as possible. The reason is that too many hidden units may actually degrade the generalization performance of the network. Now one may notice that the decrease in squared-distance from g to the new subspace before re-adjustment of i's is:
Pm
Pn
ny 2 i=1 ji ui; un+1i : hun+1 ; un+1i
j =1hgj ?
Thus a greedy way to approximate g is to maximize this expression every time un+1 is added. In what follows, we denote the residual output error observed at output o for pattern p by Epo , the corresponding activation at the candidate hidden unit by Hp , and the parameter set associated with the hidden unit by . Then, the previous expression becomes: Po (Pp EpoHp)2 P H2 : S= p p
The new hidden unit will be the one that maximizes S . This can be obtained by using the gradient ascent algorithm in which is successively changed by / rS: Notice that although a greedy approach selects the hidden unit that yields the highest value of S in a local neighborhood, and this nonlinear optimization problem is frequently plagued by the presence of local minima, the universal approximation property of the network is not aected.
4.2 Hidden-to-Output Weight Training
In the second training phase, we adjust the hiddento-output weights, as discussed in the remark of Proposition 2. This may be done by computing the pseudo-inverse exactly. However, because the error surface in this case is actually a paraboloid, it can also be eciently computed by using any gradient descent method with second-order information. The time and space problem of computing the pseudoinverse therefore does not arise. The network construction algorithm is summarized as follows: 1. Repeat (a) Perform input-to-hidden weight training by performing gradient ascent of S with respect to ; (b) Add the trained hidden unit to the network, the input-to-hidden weights are then frozen; (c) Perform hidden-to-output weight training, either by computing the pseudo-inverse or using the gradient descent method; 2. until the residual error of the network falls below a certain threshold.
5 Conclusion In this paper, we formulate the problem of learning in constructive neural networks as constructive approximation in a Hilbert space. Using a greedy
approach, we propose a novel objective function for training new hidden units. The resultant network construction algorithm is similar to the cascadecorrelation architecture, but with much stronger theoretical support. Moreover, we prove that if the hidden unit functions satisfy the universal approximation property, the network so constructed incrementally still preserves the universal approximation property with respect to L2 performance criteria. Besides, although the objective function is highly nonlinear and the training of new hidden units may end up in local optima, the universal approximation property of the resultant network is not aected. Hence, while theoretical results obtained so far on the universal approximation capabilities of multilayer feed-forward networks provide only existence proofs, our results move one step further by providing a theoretically sound procedure for constructive approximation while still preserving the universal approximation property. Moreover, this applies not only to the traditional single-hidden-layer architecture, but also to other architectures with cascade connections. Preliminary simulation experiments using some benchmark learning tasks have been performed. Results obtained in these experiments agree well with our theoretical results discussed in this paper. The details are reported in [6]. However, although the use of a greedy algorithm is a reasonable approach in tackling complex problems like this, one should be aware of the fact that there exist scenarios in which a greedy approach to network construction may be ineective [7]. In our future work, we will compare the generalization behavior of the proposed algorithm with existing algorithms. Besides, a limitation of this research is that the analysis has to be performed in a Hilbert space. However, we know that a Lp space is a Hilbert space i p = 2. Hence, a possible direction is to extend the results to Banach spaces, so that the universal approximation property with respect to other Lp performance criteria may also be investigated. Moreover, as pointed out previously, a greedy approach to network construction may be ineective in some circumstances. We will investigate other heuristic approaches to this problem.
References [1] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signials and Systems, 2:303{314, 1989. [2] S.E. Fahlman and C. Lebiere. The cascade{ correlation learning architecture. Technical Report CMU{CS{90{100, School of Computer Science, Carnegie Mellon University, 1990.
[3] K.I. Funahashi. On the approximate realization of continous mappings by neural networks. Neural Networks, 2:183{192, 1989. [4] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4:251{257, 1991. [5] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359{366, 1989. [6] T.Y. Kwok and D.Y. Yeung. Constructive neural networks: Some practical considerations. Submitted to International Conference on Neural Networks 1994. [7] T.Y. Kwok and D.Y. Yeung. Experimental analysis of input weight freezing in constructive neural networks. In Proceedings of the 1993 IEEE International Conference on Neural Networks, volume 1, pages 511{516, San Francisco, California, USA, 1993. [8] T.Y. Kwok and D.Y. Yeung. Theoretical analysis of constructive neural networks. Technical Report HKUST{CS93{12, Department of Computer Science, Hong Kong University of Science and Technology, 1993. [9] C.S. Squires and J.W. Shavlik. Experimental analysis of aspects of the cascade{correlation learning architecture. Machine Learning Research Group Working Paper 91{1, Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, WI 53706, USA, 1991.