A Constructive Higher Order Network Algorithm that is Polynomial-Time Nicholas J. Redding DSTO Information Technology Division Salisbury, South Australia, Australia Adam Kowalczyk Telecom Australia, Research Laboratories Clayton, Victoria, Australia Tom Downs Dept of Electrical Engineering University of Queensland, Australia running title: Constructive HONs
Acknowledgement: We wish to thank Garry Newsam, Peter Bartlett and Andrew Back for their valuable suggestions and comments; Raymond Lister and especially Marcus Frean for their help on the two-or-more clumps problem; and Ewa Kowalczyk for her help in the simulations. This work was partially supported by the Australian Telecommunications and Electronics Research Board. Requests for reprints should be sent to Nicholas Redding, DSTO Information Technology Division, P.O. Box 1500, Salisbury SA 5108, Australia. Email:
[email protected]
A Constructive Higher Order Network Algorithm that is Polynomial-Time Abstract Constructive learning algorithms are important because they address two practical diculties of learning in arti cial neural networks. Firstly, it is not always possible to determine the minimal network consistent with a particular problem. Secondly, algorithms like backpropagation can require networks that are larger than the minimal architecture for satisfactory convergence. Furthermore, constructive algorithms have the advantage that polynomial-time learning is possible if network size is chosen by the learning algorithm so that the learning of the problem under consideration is simpli ed. This paper considers the representational ability of feedforward networks (FFNs) in terms of the fan-in required by the hidden units of a network. We de ne network order to be the maximum fan-in of the hidden units of a network. We prove, in terms of the problems they may represent, that a higher order network (HON) is at least as powerful as any other FFN architecture when the order of the networks are the same. Next, we present a detailed theoretical development of a constructive, polynomial time algorithm that will determine an exact HON realization with minimal order for an arbitrary binary or bipolar mapping problem. This algorithm does not have any parameters that need tuning for good performance. We show how a FFN with sigmoidal hidden units can be determined from the HON realization in polynomial time. Lastly, simulation results of the constructive HON algorithm are presented for the two-or-more clumps problem, demonstrating that the algorithm performs well when compared with the Tiling and Upstart algorithms. Keywords | constructive networks, higher order networks, feedforward networks, network order, fan-in, representation, polynomial time, two-or-more clumps problem.
Constructive HONs
1
Nomenclature binary f0; 1g or bipolar f 1; +1g set x vector input pattern xi i-th input element of pattern vector x X input pattern set containing all inputs x X +, X sets of positively and negatively classi ed patterns from X function mapping X ! B k integer denoting order ' hidden function of hidden unit hidden function family (HFF) i set of monomials of degree i 6 i union of sets 0 ; 1; : : : ; i wi scalar weight scalar threshold
combining function at output of network hi decision function jj cardinality I +, I hidden unit patterns (image space) corresponding to X + and X T = fi1; i2 ; : : : ; is g set of integer indices 'T monomial, product of input elements with indices in T M universal set, a set of monomials Mi subset of monomials from M of degree i M6i union of subsets M0; M1; : : : ; Mi [']X X -space vector of ' on X [M]X set of X -space vectors from monomials in M on X Span() linear combination B
1 Introduction
Constructive HONs
2
Judd (1990) has proven the NP-completeness of the loading problem, i.e., \can a given neural network map a predetermined set of input patterns to a desired set of output patterns?". However, Baum in a review (1991), has pointed out that by framing the learning issues more realistically an NP-completeness result need not arise. The key is that in Judd's loading problem the learning algorithm has no control over the network's size, which is predetermined. In practice, the network's size is chosen by the user, rather than by an \adversary", ensuring that the problem lies within the ability of the network and learning algorithm. If the networks available to the learning system are unrestricted, then polynomial-time complexity results are possible. There are, however, two practical diculties with this approach: it is not always possible to determine the minimal network consistent with a particular problem, and secondly, algorithms like backpropagation can require networks that are larger than the minimal architecture for satisfactory convergence. These two points provide the motivation for the development of constructive learning algorithms, that do not decide a priori upon the required network size, but \grow" the network as needed. Most constructive algorithms, however, provide no assurance that the resulting network will be minimal in any sense. This is an important issue, not only for cost reasons, but also because there is convincing evidence that using a larger network than required can adversely aect the network's ability to generalize (Denker et al. 1987; Baum, & Haussler 1989). Other approaches that have been used to try and achieve a minimal solution either \prune" the network after training has taken place (Mozer, & Smolensky 1989; Le Cun, Denker, & Solla 1990; Sietsma, & Dow 1991) or use a bias term in the error function to inhibit network size (Denker et al. 1987; Hanson, & Pratt 1989; Ji, Snapp, & Psaltis 1990). Some of the main constructive algorithms employ a number of architectures which create a hierarchical partitioning of the input space (Mezard, & Nadal 1989; Nadal 1989; Fahlman, & Lebiere 1990; Frean 1990; Golea, & Marchand 1990; Sirat, & Nadal 1990; Marchand, Golea, & Rujan 1990). (A summary of the salient features of these constructive algorithms can be found in Hertz, Krogh and Palmer (1991) and in Wynne-Jones (1991)). Other algorithms (Ash 1989; Refenes, & Vithlani 1991; Wynne-Jones 1992) create additional nodes during backpropagation training, and the algorithm of Hanson (1990) behaves similarly but uses a stochastic search in place of
Constructive HONs
3
backpropagation. Convergence proofs for some of these constructive algorithms are available, but most do not contain any statements regarding the time-complexity of training. One exception is the regular partitioning algorithm of Rujan and Marchand (1989) which has polynomial-time complexity (and is also minimal in some sense), although this is in terms of p^ = 2n, the total number of patterns in an input space of n-dimensions. In practice, however, as the input dimension n increases, the actual number of patterns p may form a decreasing fraction p=p^ of the possible patterns, so that the algorithm becomes prohibitively expensive. Recently, Blum and Rivest (1992) have presented material that extends Judd's results to show that training a simple feedforward network (FFN) with linear threshold functions is NP-complete. They also showed for a simple example that polynomialtime learning is possible if the network is enlarged to include extra inputs that are nonlinear combinations of the original inputs. Blum and Rivest, however, did not present a general algorithm that could be used. In this paper, we present a general algorithm that will learn an arbitrary mapping problem in polynomial time. More precisely, this paper extends the previous work of the authors (Redding 1991; Redding, Kowalczyk, & Downs 1991) on higher order networks (HONs) (Giles, & Maxwell 1987), to develop a constructive HON algorithm for an arbitrary problem that trains in time polynomial in the number of examples in the training set. The constructive HON algorithm we present has a number of important properties. Apart from the polynomial-time complexity property that we have already mentioned, the constructed network has minimal maximum fan-in, where maximum fan-in is de ned to be the largest number of network inputs that connect with any single hidden unit. This maximum fan-in of a network is de ned to be the order of the network by Minsky and Papert (1988). Network order is largely neglected by the current literature, but it has quite a long history, dating back to the mid 1960's (Cover 1965; Krishnan 1966; Minsky, & Papert 1988). We will demonstrate that the minimum network order required to solve a given problem is independent of the kind of architecture under consideration, and in this sense order is a structure-free property of the problem. In addition, we will show how the algorithm can be used to construct a sigmoidal net in polynomial time from the HON. In FFNs, full connectivity from the inputs to the hidden units (i.e., maximal order) is only required for a very small percentage of learning problems. For many real world problems, maximal order is infeasible, because such problems often have many thousands of inputs. Researchers have tried to address this problem a posteriori
Constructive HONs
4
using techniques like pruning (Sietsma, & Dow 1991) and weight decay (Kramer, & Sangiovanni-Vincentelli 1989). Our constructive approach automatically ensures that the network does not have a larger fan-in than necessary. In Section 2, we present more precise de nitions of the concepts that will be used in the remainder of this paper. Section 3 deals with representational issues of order in FFNs. Next, Section 4 gives a development of the constructive HON algorithm. Finally, simulation results and a discussion are presented in Section 5.
2 Terminology 2.1 Feedforward Networks The learning problems that are of interest to us here can be expressed in the following manner. We are given a nite set of two-valued patterns X Bn, where B is either the binary set f0; 1g, or the bipolar set f 1; +1g. (In what follows, most of our de nitions will be in terms of bipolar values, although sometimes we will nd it convenient to use binary values.) The pattern set X is formed from the union of two disjoint sets X + and X . We wish the network to learn a function that exactly maps the set X into a two-valued set that will be used to indicate whether the pattern x 2 X is a member of X + or X . The two-valued set of outputs can be either binary or bipolar as convenient, so that performs the mapping : X ! B. The function (x), where x is the input, is computed in two stages. Firstly, a set of functions (x) = f'1 (x); '2 (x); : : : ; 'r (x)g is computed and then the results are combined by means of a combining function, say , of r arguments to obtain . Since both and the 'i are functions, it is often convenient to refer to the functions 'i as hidden functions to distinguish them from the overall function . This framework, called the perceptron scheme, is not limited to any particular variety of hidden functions, and will allow us to derive results for a variety of dierent FFN architectures. (See Figure 1 for a depiction of the perceptron scheme.) A network within this framework will here be called a perceptron, and includes the backpropagation network and most other FFNs. We are particularly interested in the higher order network (HON) architecture and we shall discuss later how this ts within the perceptron scheme. Note that, for simplicity, the argument x is often omitted from the functions and . Let = f'1; '2 ; : : : ; 'r g be a hidden function family (HFF) de ned on some subset of Bn. Let us introduce the following de nition, stated for bipolar-valued .
Constructive HONs
5
De nition 1 A function is a linear threshold function with respect to if there exists a number and a set of numbers fw1; w2; : : : ; wr g such that (x) = 1 if and only if w1'1
(x) + : : : + wr 'r (x) >
and (x) = 1 if and only if w1'1
(x) + : : : + wr 'r (x) < :
Note that is commonly termed the threshold. This is equivalent to saying that is linearly separable (LS) with respect to , the space spanned by the hidden functions '1 ; '2; : : : ; 'r . Nilsson (1965) termed this space the image space. In the perceptron scheme, the function is the familiar linear combination followed by thresholding. In this scheme, the function is written explicitly as
('1 (x); '2 (x); : : : ; 'r (x)) =
*
r X i=1
wi'i
(x)
+
where the decision function hui for bipolar outputs is determined by 8
0 1; u < 0: Thus a perceptron de ned upon the HFF computes the function (x) as follows:
hui = :
(x) =
*
X
'i 2
+
wi'i
:
(1)
2.2 Problem Order Order of a given problem is de ned here in terms of the order of the network architecture required to implement the problem. So it is reasonable to talk of both function (or problem) order and network order, two related but dierent concepts. Consider a hidden function '(x) and its implementation '(i1 ; : : : ; ik ) depending upon inputs xi1 ; : : : ; xi . k
De nition 2 We de ne the fan-in of a hidden function implementation '(i1 ; : : : ; ik ) : X ! B, denoted fan in('), to be k, the number of inputs to '. Much of the following development concerns HFFs of monomials (eg., xi1 xi2 xi ). Products involving higher powers of input elements need not be considered in the binary or bipolar input case because, for instance, xni = xi, n 2 N when xi 2 f0; 1g. The real case is not considered in this paper. We now wish to consider the concept of problem order. To assist us, we rst introduce a de nition of network order. k
Constructive HONs
6
De nition 3 In a 2-layer FFN, let the number of inputs to hidden unit i be ki. The order of the network is equal to maxi (ki ).
The order of a given function is then de ned as being equal to the order of the lowest order network necessary to realize that function. A more formal de nition, which avoids reference to any speci c network architecture, is as follows.
De nition 4 The order of a function , ord( ), is the smallest number k for which we can nd a set of hidden functions = f'i g with fan in('i ) 6 k for all 'i 2 , such that is a linear threshold function with respect to .
Other equivalent de nitions of order are possible in the case of two-valued inputs (Krishnan 1966; Wang, & Williams 1991). An interesting consequence of the de nitions of problem order and fan-in is that it is possible to discuss how they relate to FFN structures with more than two layers. Figure 2 indicates how a multi-layered FFN can be contracted down so that all the layers of hidden units can be considered as a single layer of composite hidden units. When this is done, a composite hidden-unit's fan-in is simply determined by the number of inputs, and the value of the largest such fan-in is the network order. This allows one to discuss the relationships between network order and problem order with complete generality. For example, if the order of a three layer FFN is k, then it is only capable of realizing a problem of order k or less | the fact that it is a three layer network has no special signi cance.
2.3 The Hidden Function Family We have already mentioned how the use of a HFF determines the particular type of FFN modelled. This requires further elaboration. In the case of HONs, the HFF, denoted HON , is simply a subset of possible monomials on the input elements x1; x2; : : : ; xn. In the case of binary-valued or bipolar-valued inputs, the HFF is HON f1; x1; x2; : : : ; xn; x1x2 ; x1x3 ; : : : ; x2x3 xn; x1x2 xng: When the inputs are bipolar valued, the monomials can be considered to compute XOR functions. In the case where the inputs are binary, the monomials compute masks (conjunctions). For simplicity we will only concern ourselves with the case of a network with a single output; the results that follow can be easily extended to networks with multiple outputs.
Constructive HONs
7
The equation for a single output of a HON, denoted by y, is often written in the following manner: y
=
* n X
i=1
wi xi
+ wn+1x1 x2 + wn+2x1x3 + : : : + w2 1 x1x2 xn n
+
:
(2)
In the above equation, the wr , r = 1; 2; : : : ; 2n 1 are weightings on the outputs of the hidden functions, is the threshold, and hi denotes the decision function. It is often convenient to replace the threshold by the negative of a weight w0 upon the augmenting input x0 = 1; i.e., w0 x0 = . Network order in a HON is easily identi ed: when the polynomial in Equation (2) includes products of up to k input terms, the network is termed a k-th order HON (k 6 n). A number of authors have considered HONs, but the list is quite small in comparison with the popular Hop eld and sigmoidal FFN architectures (see for example (Poggio 1975; Lee et al. 1986; Peretto, & Niez 1986; Giles, & Maxwell 1987; Keeler 1987; Personnaz, Guyon, & Dreyfus 1987; Psaltis, Park, & Hong 1988; Kohring 1990; Sirat, & Jorand 1990; Shin, & Ghosh 1991; Perantonis, & Lisboa 1992)).
3 Representational Issues The task we wish the HON to solve is to classify correctly the nite set of patterns X Bn into the two subsets X + and X , corresponding to a unitary positive and negative network output, respectively. In addition, we require that the HON achieve this with a network of minimal order. These task requirements mean that, rstly, the HON architecture must be able to represent the classi cation of X . Secondly, it means that if the classi cation of X is a problem of order k, then this representation must be possible in a network of order k. We will address these issues in this section. The most important property of HONs when compared with other FFNs is that if a HON of at least order k is required to classify a set of patterns X Bn then a FFN (with any architecture) with order < k that will classify X (with complete accuracy) does not exist. This means that, for example, choosing a HFF such as linear units with sigmoidal outputs over the monomial HFF HON oers no advantage in terms of order. This rather remarkable property can be explained by means of the following theorem.
Theorem 1 Any function on X Bn, that can be implemented by an FFN of order
Constructive HONs k,
8
can be implemented by a HON of order k.
Proof For a bipolar vector y 2 X f 1; 1gn, we can construct a polynomial Qn y (x) = i=1 (1 + yi xi )=2 vanishing everywhere on X but at x = y. A similar polynomial exists for the binary case. Clearly, y (x) is a linear combination of monomials of degree not more than n. Thus any hidden function on X can be written as a linear combination of monomials f'i g of degree (and so fan-in) not more than n. Similarly, if one of a network's hidden functions ' has fan in(') = k, then ' can be written as a linear combination of monomials of degree not more than k. Therefore, it is always possible to transform the hidden units of an order k FFN to a linear combination of monomials, such that the resulting HON has order no larger than k when X Bn. It is, however, possible to conceive of situations where a restriction on the type of hidden functions can lead to an increase in the required network order. Because of this, it is of some interest to consider whether the HFF is order preserving:
De nition 5 The HFF is order preserving if any problem of order k can be realized as a network of order k using only hidden functions from .
Clearly the HFF HON is order preserving. A notable form of FFN is the backpropagation network formed of sigmoidal units, and it can easily be shown that sigmoidal hidden units are also order preserving.
Theorem 2 The hidden function family of sigmoidal units is order preserving for any problem on X Bn. Proof Any monomial '(x) = xi1 xi2 xi is equivalent to a threshold logic unit s
(TLU) of the form
'
(x) = hxi1 + xi2 + : : : + xi
s
s
+ 0:5i
(3)
that has the same fan-in, when x 2 X is binary valued. (A similar expression can be found for the case of bipolar-valued x 2 X using simple translations to binary-valued variables). Given a TLU, it is always possible to nd a sigmoidal unit of the same fan-in with output arbitrarily close to that of the TLU by adjusting the slope of the sigmoidal transfer function. (It is not necessary to get exactly the same output from the sigmoidal unit as the TLU, because all we require is that the patterns in the image space be rendered LS by the hidden units). Therefore, any set of monomial hidden units that realize a particular problem on X can be translated into a set of sigmoidal hidden units of the same fan-in that will also realize the problem.
Constructive HONs
9
It is possible then, by the mechanism outlined in the proof above, to transform a HON constructed in polynomial time by the algorithm we will describe, into a FFN with only sigmoidal hidden units. This therefore allows, indirectly, a sigmoidal FFN to be constructed in polynomial time. One may well conclude from Theorem 1 that the HON architecture is representationally much more powerful than certain FFN architectures. The following theorem indicates that this is not the case and that the architectures are in fact equivalent in representational power.
Theorem 3 FFNs of order k with hidden functions of the following kinds are all equivalent in the functions on X Bn that they can represent: HONs of order k FFNs of order k, with (compositions of) sigmoidal hidden functions.
Proof By Theorem 1, any function f on X Bn realized by an arbitrary FFN (eg.,
any number of layers or unit characteristics) of order k can be realized by a HON of order k. We can now use methods from the proof of Theorem 2 to determine the converse relationship. As was shown in this proof, every monomial on X Bn has an equivalent TLU and sigmoidal unit of the same fan-in. Therefore, for any HON of order k realizing f on X Bn, there is at least one equivalent two-layer sigmoidal FFN, also of order k. We can construct a three-layer sigmoidal FFN of order k from a two-layer sigmoidal FFN of order k by adding a layer of sigmoidal units so that each unit has only a single (scalar) input from the previous layer. Selecting a large slope for the sigmoid of each of these units will ensure that the input to the next layer will eectively be unchanged. Therefore, for any HON of order k realizing f there is at least one equivalent three-layer sigmoidal FFN of order k. We can repeat this procedure to create equivalent sigmoidal FFNs of order k with an arbitrary number of layers.
4 Constructing a HON The rst stage in our constructive HON training algorithm is to compute a minimal set of monomials for a particular pattern set, and use these monomials as the basis of the HON's hidden units. The technique we will outline in the following sections goes a considerable way towards eliminating the combinatorial explosion of high order terms, and consequently can be used to answer some of the criticism that HONs
Constructive HONs
10
have received. The method is based upon the following straightforward algorithm for determining the order of a problem on the set X of two-valued patterns. S Let k denote the set of monomials of degree k, and de ne 6k = ki=0 i . Recall that the image space (denoted by I ) is the space spanned by the hidden functions '1 ; '2; : : : ; 'r . The image space set I + is then de ned such that each pattern in I + corresponds to the state of the image space when a pattern x 2 X + is input to the network. Similarly, each pattern in I corresponds to a pattern in X .
Algorithm 1 01 02 03
=0 0 = f1g k
do
:= k + 1 05 determine 6k 06 nd images I + and I of X + and X under 6k 07 until I + and I are LS A HON that realizes the problem on X will now have been established. 04
k
In the case of two-valued inputs (i.e., X Bn, where Bn = f0; 1gn or Bn = f 1; 1gn), the algorithm must terminate because the degree of a monomial on X clearly need not be higher than n. This algorithm is demonstrated by the following simple example.
Example 1 Determine the order of the 2-variable XOR problem. Given that x = (x1 ; x2) 2 X , and X = B2, the set of monomials of up to degree one is given by 61 = f1; x1; x2g. The images of 61 on X + = f(0; 1); (1; 0)g and X = f(0; 0); (1; 1)g are the sets I + = f(1; 0; 1); (1; 1; 0)g and I = f(1; 0; 0); (1; 1; 1)g. The next step, determining the linear separability of I and I + , will show that the images of the monomials 61 on X are not LS. Therefore we must test the next highest order monomials; those of 2-nd order. Now, 2 = fx1x2g so 62 = f1; x1; x2; x1x2g and so the new image sets are I + = f(1; 0; 1; 0); (1; 1; 0; 0)g and I = f(1; 0; 0; 0); (1; 1; 1; 1)g. Applying a test for linear separability then shows that I + and I are LS (con rming that the 2-variable XOR is 2-nd order). Therefore, a HON that realizes the 2-variable XOR problem will require monomials from the HFF 62 = f1; x1; x2; x1x2g to form its hidden units.
The algorithm just described has one signi cant diculty that renders it unusable for large dimensional problems without some additional controlling mechanisms.
Constructive HONs
11
This is, of course, the combinatorial explosion of monomials that will result as the dimension of the input space n and the order k increase. In fact, the number of monomials of order k in an n-dimensional two-valued input space is given by the binomial coecient nk . So, for large n, an unacceptably large number of dimensions in the image space will be needed during the execution of the algorithm. In addition, high-dimensional tests for linear separability will need to be performed. This combinatorial explosion, quite daunting in its magnitude, is often seen as insurmountable and a good reason to avoid HONs (Minsky, & Papert 1988; Reid, Spirkovska, & Ochoa 1989). There are, however, a number of techniques available for keeping this problem within manageable limits. Some techniques for dealing with the combinatorial explosion of monomial hidden functions were suggested by Giles and Maxwell (1987). These techniques, however, are not suitable for our purposes because of our requirement of an exact mapping. An explanation of the approach we have developed to deal with the combinatorial explosion follows. This approach results in a constructive technique for determining a HON solution to the classi cation of a pattern set X Bn .
4.1 Universal Set of Monomials The combinatorial explosion can be dealt with by recognizing that in the majority of cases, many of the monomials are redundant because of their mutual linear dependence (as functions on the set X ). This linear dependence will occur in any problem that is incompletely speci ed, and all generalization problems are of this type. By restricting the monomial hidden units to only nonredundant, linearly independent monomials, the number of hidden units is kept to an acceptable level without reducing the representational ability of the network, as we will see in the following development. This representational ability is captured by what we term a universal set of monomials. The universal set is so named because it is constructed only from the set of input patterns X , and therefore remains the same across all the possible classi cations of the patterns in X . The following de nition is fundamental to the development of the universal set. Consider the space of real-valued functions that can be de ned on a nite set of patterns X . A particular function on X can be identi ed by its real-valued outputs for each pattern in the set X . If jXj = p, say, then the function will be identi ed by a point in the p-dimensional real space Rp , which will be called the X -space. Each coordinate in this space will correspond to the value of the function for a particular
Constructive HONs
12
pattern in X . Whenever we talk about monomials being linearly independent on X , we mean that the X -space vectors of the monomials are linearly independent. Let us now introduce some notation. Firstly, let 'T , where T = fi1 ; i2; : : : ; is g, represent a particular monomial xi1 xi2 xi . Then, let the point in X -space that is given by the outputs of the monomial 'T for the patterns in X be denoted by T ' X . Further, if M denotes a set of monomials, let Mk M denote the subset of all monomials in M of degree k. Finally, [M]X is used to denote the set of X -space vectors from the monomials in M on X . The idea of the universal set is simple in nature: a universal set only includes monomials whose outputs on X cannot be written as a combination of the outputs of the other monomials in the set. If a universal set for a set of patterns X is denoted by M, then the monomials in M will form a basis for the X -space. A universal set must have one further property | it must be order preserving (see De nition 5) with respect to any problem on the set of patterns X . The concept of a universal set is de ned more formally as follows. s
De nition 6 Let M be a set of monomials on X . Then M is termed a universal set of monomials on X if and only if the following two conditions are met: All the monomials in M are linearly independent on X such that M provides a basis for all real-valued functions on X , i.e., the X -space is generated by the output of the monomials in the set M. Any polynomial of degree k 6 n restricted to X can be written as a linear combination of monomials in M6k , i.e., M is an order preserving set of hidden functions for any problem de ned on X . The following theorem highlights two important properties of a universal set.
Theorem 4 Consider a universal set of monomials M on X . The cardinality of M is equal to the cardinality of X , i.e., jMj = jXj. A universal set for X is not necessarily unique.
Proof The rst part of this theorem follows from the fact that there can only be jXj linearly independent vectors in an jXj-dimensional linear space. The second part follows from the fact that more than one set of vectors may span a linear space. As a consequence of this theorem, one need never consider more than p monomial hidden units, where p = jXj. (Incidentally, this places an upper bound of p on the number of hidden units required in a FFN).
Constructive HONs
13
An obvious rst algorithm to compute a universal set of monomials M for the set X can then be expressed as follows.
Algorithm 2 01 02 03 04 05 06 07 08
=0 M0 = f1g k
do
k
:= k + 1
for each ' of degree k if ' is linearly independent of M6k on X Mk := Mk [ ' until jMj = jXj
Theorem 5 Algorithm 2 constructs a universal set for X . Proof By testing all monomials of lower degree rst, and eventually testing all
possible monomials, both properties in De nition 6 are satis ed which ensures that M is a universal set. Furthermore, because X is a set of two-valued patterns, the algorithm will terminate with a set M, jMj = jXj, such that the degree of any monomial in M is no more than n. This algorithm is faced with one diculty: it is unclear at this stage how to generate each ' of degree k in line 05. We will solve this problem in the following development, and produce a polynomial-time algorithm for determining a universal set. If the two sets Mi and Mj are the i-th and j -th degree subsets of a universal set on the patterns X , then the set of ordinary products of the monomials in Mi and Mj , denoted by Mi Mj , will usually contain a number of monomials of degree i + j . Most importantly, however, the number of monomials in Mi Mj of degree i + j will usually be much less than the total number of monomials that exist with degree i + j . The following theorems indicate how a universal set can be constructed by considering only those monomials in the ordinary product Mi Mj . We will use the notation Span() to denote a linear combination.
Theorem 6 Given a universal set M on X with subsets Mi and Mj of degree i and j , respectively, then [']X 2 Span( Mi Mj X [ M n;
con rming that at most njXj tests for linear independence need to be performed. Each test for linear independence requires examination of a rectangular matrix which has each dimension 6 jXj, and as a result (see the Appendix), the complexity of each test will be O(jXj3). And because njXj such tests have to be carried out, the computational complexity of Algorithm 3 is O(njXj4). In the Appendix, we will develop an improved algorithm by using a form of Gaussian elimination to speed up the tests for linear independence. The algorithm complexity is given by the following theorem (for proof, see the Appendix).
Theorem 10 A universal set for the set of patterns X can be computed in O(njXj3) time.
4.2 Constructing a HON Using a Universal Set of Monomials The following theorem indicates how the concept of a universal set of monomials may be incorporated into Algorithm 1 to reduce the computational overhead imposed by the need of the algorithm to consider all possible monomials. This theorem clearly indicates that a universal subset M6k can be used at any point in Algorithm 1 in place of the larger set 6k of monomials. The speed improvement is immediately apparent when one considers that jj = 2n whereas jMj = jXj and X B.
Theorem 11 The problem on X such that : X ! B has order 6 k if and only if it can be implemented with hidden functions from the subset M6k of a universal set of monomials M on X . Proof This theorem follows from the de nition of a universal set (De nition 6), and
Theorem 1. Once we have determined a universal set M on X that forms the set from which we will choose the hidden units of a HON, the next step is to determine the minimum k such that the monomials in the subset M6k when used as hidden units will correctly classify X according to . From Theorem 11, a problem on X is of order 6 k if the
Constructive HONs
17
following inequalities can be satis ed for some set of weights wi , i = 1; : : : ; jM6kj: X
'i 2M6k
wi'i
8 < >
(x) :
0 if x 2 X + < 0 if x 2 X .
(5)
Let us introduce the notation t(x) = 1 if x 2 X + and t(x) = 1 if x 2 X . We can then frame the problem of Equation (5) as a Chebyshev solution to a set of linear equations (Cheney 1982) min w E(w) = min w max x2X j
X
'i 2M6k
wi'i
(x)
(x)j:
t
(6)
It can easily be shown that any solution to Equation (6) for which E (w) < 1 is also a solution to Equation (5), and that if a solution to Equation (5) exists then any solution to Equation (6) must satisfy E (w) < 1 (Kaplan, & Winder 1965). The inequalities in Equation (6) can be rewritten as the following linear programming problem minimizew subject to
r
X
'i 2M6k X
'i 2M6k
wi'i
(x)
t
(x)
wi'i
(x)
t
r
6 0; for each x 2 X
(x) + r > 0; for each x 2 X
After translation into Karmarkar's form (Bazaraa, Jarvis, & Sherali 1990), this linear programming problem can be solved in O(p5:5) time (Karmarkar 1984), where p = jXj, assuming arithmetic operations require unitary time. Because a universal set M is constructed for any problem on X , a problem of order k would usually not require all the monomials in M6k as hidden units of a HON realization. Using a linear programming algorithm in the manner described above generally eliminates at least some of these redundant hidden units. We can now adapt our initial algorithm, Algorithm 1, to incorporate the concept of a universal set.
Algorithm 4 01 02 03 04 05 06
=0 M0 = f1g k
do
:= k + 1 determine Mk until Equation (5) is satis ed k
Constructive HONs
18
A HON that realizes the problem on X will now have been established.
Theorem 12 Algorithm 4 will construct a minimal order HON to realize the problem on X Bn in O(np5:5) time. Proof When X Bn, the order of a problem on X cannot be larger than n. Therefore the do-loop of Algorithm 4 will never be executed more than n times. The cost of solving the linear programming problem in Equation (5) is O(p5:5 ) time, much larger than that required to determine Mk . Therefore the total algorithm will require at most O(np5:5 ) time. In eect then, the algorithm described constructs a network with minimal maximum hidden unit fan-in (and this can be done in polynomial time using Karmarkar's algorithm); the network obtained is not necessarily, however, one that contains the minimum possible number of hidden units. An impediment to obtaining a minimal solution concerns the fact that the universal set is not unique. Finding a truly minimal solution from all possible universal sets for a given problem would be infeasible.
5 Discussion The strength of the constructive HON (CHON) algorithm we have presented lies in its ability to construct a minimal-order set of spanning monomials that are then used to determine the hidden-units for an arbitrary problem on X . In problems of the training-by-example type, and in other problems where generalization is required, we normally have jXj 2n (X Bn). In such cases our universal set algorithm generally eliminates the vast majority of the possible 2n monomials from consideration as hidden units. Of course, if every pattern were present in the training set (exceedingly rare in practical learning problems), the selection of a universal set would be a pointless exercise, because it would, of necessity, contain all 2n monomials for the pattern set X = Bn. Whilst the ability of the algorithm to construct a correctly classifying HON for an arbitrary set of patterns has been demonstrated mathematically, it is instructive to examine the performance of the algorithm on a test problem. At the same time we will investigate the generalization behaviour of the resulting network. The particular problem that we will use for this purpose is the \two-or-more clumps" problem, following (Denker et al. 1987; Mezard, & Nadal 1989; Frean 1990). An input pattern x is classi ed as belonging to X + if x contains two or more clumps of 1's, otherwise it
Constructive HONs
19
belongs to X . Note that cyclic boundary conditions apply to x | the element x1 of x is considered to be next to xn. The two-or-more clumps problem is a second-order problem. A two-or-more clumps problem with a mean of 1.5 clumps and 25 inputs was tested to make possible a direct comparison with the results for the Upstart and Tiling constructive algorithms presented in (Frean 1990). The training set in each of the trials contained up to 800 patterns3 and the performance of the network was tested on a further 600 patterns. These generalization test results are presented in Figure 3. Figure 4 indicates the growth of the number of weights in the resulting HON as the size of the training set was increased for the two-or-more clumps problem. We present the growth in the HON in terms of the number of weights, rather than the more usual situation of quoting the number of hidden units (although they occur with the same frequency in a HON). We have done this to emphasis that the hidden units of a HON are more cost eective than in typical FFN structures (including the network constructed by the Upstart algorithm). A HON has only one weight for each monomial hidden unit. In typical FFNs, however, each hidden unit has in addition a weight for each input element, and a threshold (giving n + 2 weights per hidden unit in total for an n-dimensional input space). Furthermore, the maximum fan-in of the monomials is kept as small as possible by the CHON algorithm, so the computation required to determine the output of each hidden unit of a HON is smaller than for typical FFNs. Table 1 indicates the network order of the constructed HONs on the two-or-more clumps problem. From this table it is possible to compute the total connectivity. The apparent variation in order of the two-or-more clumps problem, as evidenced by the dierent network orders obtained in the trials of Table 1, can be simply explained in the following way. The concept of a learning problem (eg., the predicate \are there two or more clumps?") can be only partially captured by an incomplete set of patterns. As the number of patterns in the set decreases, it becomes increasingly more probable that the unspeci ed patterns can be assigned so that the order of the problem collapses to a lower value. 3 Although the size of the pattern sets is a minuscule fraction of the 225 possible patterns, the mean of 1.5 clumps ensures that there is a considerable bias towards the patterns that have less than two clumps, of which there are only 602. Therefore, as the pattern set sizes increase, the number of duplicate patterns will increase, making the generalization performance statistic less and less interesting (Frean 1992). For the training sets used here, it was found that the 50, 100, 200, 400, 600 and 800 pattern training sets had a mean of 3, 8, 20, 57, 101 and 155 duplicate patterns, respectively.
Constructive HONs
20
It is interesting to note the dramatic increases in performance on the test set as the training set increases from 400 to 600 patterns and from 600 to 800 patterns. These increases correspond with only a marginal increase in the number of monomials and weights in the constructed HONs (Figure 4). Furthermore, at 600 and 800 training patterns, the variations in the number of weights over the 25 trials is too small to mark on the gure, having a standard deviation from the mean of only 0:9. This behaviour is not observed in training sets of up to and including 600 patterns under the Upstart and Tiling algorithms (Frean 1990). It seems that the CHON algorithm has learnt considerably more of the structure of the two-or-more clumps problem from 600 and 800 patterns than the increase that occurred from 200 to 400 patterns would suggest was possible. We can compare our results with those for the Tiling and Upstart algorithms presented in (Frean 1990) for training sets of 600 or less patterns | these results have been reproduced in Figure 3 from data generously supplied to us by M. Frean. At 600 patterns, the CHON algorithm outperforms the Upstart and Tiling algorithms. At less than 600 training patterns, the Upstart algorithm is a slightly better performer, and the Tiling algorithm performs roughly the same as the CHON algorithm we have presented.
5.1 Conclusions By utilizing a constructive architecture we have developed an algorithm that solves a mapping problem in time polynomial in terms of the number of patterns. The scheme involves the selection of multiplicative nonlinearities as hidden units (the HON architecture) based on their relevance to (i.e., linear independence on) the particular pattern set. Recently, Blum and Rivest (1992) have suggested this as a possible approach to overcoming hardness results for the training problem. Furthermore, using the concept of order, we have shown that the representational ability of an arbitrary FFN cannot be better than that of a HON of equivalent order. In addition, as long as the net structure is not set before training, we have shown that it is a simple matter to use a HON to construct a sigmoidal net of the same maximal fan-in in polynomial time. Finally, the algorithm does not have any parameters that require tuning for good performance and it seems to perform reasonably well when compared with other constructive algorithms on a standard test problem. These properties make the CHON algorithm an attractive one. The universal set concept and algorithm could also be used with a least squares
Constructive HONs
21
algorithm to obtain a least squares t to the training patterns. However, a network obtained by such an approach would not have the property of being guaranteed to be a network of minimal order for the test patterns presented. A paper dealing with the case of real-valued patterns (of limited precision) is forthcoming. An incremental version of the constructive HON algorithm could be developed to obtain an \online" version of the algorithm.
References
Constructive HONs
22
Ash, T. (1989). Dynamic node creation in backpropagation networks. ICS Report 8901. Institute for Cognitive Science, UCSD. Baum, E. B. (1991). Review of Neural Network Design and the Complexity of Learning, by S. Judd. IEEE Transactions on Neural Networks, 2(1), 181{182. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?. Neural Computation, 1, 151{160. Bazaraa, M. S., Jarvis, J. J., & Sherali, H. D. (1990). Linear Programming and Network Flows second edition. New York, NY: John Wiley & Sons. Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5, 117{127. Cheney, E. W. (1982). Introduction to Approximation Theory (second edition). New York, NY: Chelsea Publishing Company. Cover, T. M. (1965). Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications to Pattern Recognition. IEEE Transactions on Electronic Computers, EC-14, 326{334. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., & Jackel, L. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems, 1, 877{922. Fahlman, S. E., & Lebiere, C. L. (1990). The cascade-correlation learning architecture. In D. S. Touretzkey (Ed.), Advances in Neural Information Processing Systems 2 (pp. 524{532). San Mateo, CA: Morgan Kaufmann Publishers. Frean, M. (1990). The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Computation, 2(2), 198{209. Frean, M. (1992). personal communication via email, 14 August. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in higherorder neural networks. Applied Optics, 26, 4972{4978. Golea, M., & Marchand, M. (1990). A growth algorithm for neural network decision trees. Europhysics Letters, 12, 205{210.
Constructive HONs
23
Hanson, S. J. (1990). Meiosis networks. In D. S. Touretzkey (Ed.), Advances in Neural Information Processing Systems 2 (pp. 533{541). San Mateo, CA: Morgan Kaufmann Publishers. Hanson, S. J., & Pratt, L. Y. (1989). Comparing biases for minimal network construction with back-propagation. In D. S. Touretzsky (Ed.), Advances in Neural Information Processing Systems 1 (pp. 177{185). San Mateo, CA: Morgan Kaufmann Publishers. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Reading, MA: Addison Wesley. Ji, C., Snapp, R. R., & Psaltis, D. (1990). Generalizing smoothness constraints from discrete samples. Neural Computation, 2(2), 188{197. Judd, J. S. (1990). Neural Network Design and the Complexity of Learning. Cambridge, MA: MIT Press. Kaplan, K. R., & Winder, R. O. (1965). Chebyshev approximation and threshold functions. IEEE Transactions on Electronic Computers, EC-14, 250{252. Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. Combinatorica, 4, 373{395. Keeler, J. D. (1987). Information capacity of outer product neural networks. Physics Letters A, 124, 53{58. Kohring, G. A. (1990). Neural networks with many neuron interactions. Journal de Physique, 51(2), 145{155. Kramer, A. H., & Sangiovanni-Vincentelli, A. (1989). Ecient parallel learning algorithms for neural networks. In D. S. Touretzkey (Ed.), Advances in Neural Information Processing Systems 1 (pp. 40{48). San Mateo, CA: Morgan Kaufmann Publishers. Krishnan, T. (1966). On the threshold order of a Boolean function. IEEE Transactions on Electronic Computers, EC-15, 369{372. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzsky (Ed.), Advances in Neural Information Processing Systems 2 (pp. 598{ 605). San Mateo, CA: Morgan Kaufmann Publishers. Lee, Y. C., Doolen, G., Chen, H. H., Sun, G. Z., Maxwell, T., Lee, H. Y., & Giles, C. L. (1986). Machine learning using a higher order correlation network. Physica, 22D, 276{306.
Constructive HONs
24
Lipschutz, S. (1968). Theory and Problems of Linear Algebra. Schaum's Outline Series. New York, NY: McGraw-Hill. Marchand, M., Golea, M., & Rujan, P. (1990). Convergence theorem for sequential learning in two layer perceptrons. Europhysics Letters, 11, 487{492. Mezard, M., & Nadal, J. -P. (1989). Learning in feedforward layered networks: the tiling algorithm. Journal of Physics A: Mathematical and General, 22(12), 2191{ 2203. Minsky, M. L., & Papert, S. A. (1988). Perceptrons (second edition). MIT Press. Mozer, M. C., & Smolensky, P. (1989). Skeletonization: a technique for trimming the fat from a network via relevance assessment. In D. S. Touretzsky (Ed.), Advances in Neural Information Processing Systems 1 (pp. 107{115). San Mateo, CA: Morgan Kaufmann Publishers. Nadal, J. -P. (1989). Study of a growth algorithm for a feedforward network. International Journal of Neural Systems, 1(1), 55{59. Nilsson, N. J. (1965). Learning Machines. New York, NY: McGraw-Hill. Perantonis, S. J., & Lisboa, P. J. G. (1992). Translation, rotation, and scale invariant pattern recognition by high-order neural networks and moment classi ers. IEEE Transactions on Neural Networks, 3(2), 241{251. Peretto, P., & Niez, J. J. (1986). Long term memory storage capacity of multiconnected neural networks. Biological Cybernetics, 54, 53{63. Personnaz, L., Guyon, I., & Dreyfus, G. (1987). High-order neural networks: information storage without errors. Europhysics Letters, 4, 863{867. Poggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics, 19, 201{209. Psaltis, D., Park, C. H., & Hong, J. (1988). Higher order associative memories and their optical implementation. Neural Networks, 1, 149{163. Redding, N. J. (1991). Some Aspects of Representation and Learning in Arti cial Neural Networks. PhD Thesis. University of Queensland. Redding, N. J., Kowalczyk, A., & Downs, T. (1991). Higher order separability and minimal hidden unit fan-in. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Arti cial Neural Networks. Vol. 1 (pp. 25{30). North-Holland: Elsevier Science Publishers.
Constructive HONs
25
Refenes, A. N., & Vithlani, S. (1991). Constructive learning by specialisation. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Arti cial Neural Networks. Vol. 2 (pp. 923{929). North Holland: Elsevier Science Publishers. Reid, M. B., Spirkovska, L., & Ochoa, E. (1989). Rapid training of higher-order neural networks for invariant pattern recognition. Proceedings of the International Joint Conference on Neural Networks Washington D.C. Vol. 1 (pp. 689{692). IEEE. Rujan, P., & Marchand, M. (1989). Learning by minimizing resources in neural networks. Complex Systems, 3, 229{241. Shin, Y., & Ghosh, J. (1991). The pi-sigma network: an ecient higher-order neural network for pattern classi cation and function approximation. International Joint Conference on Neural Networks Seattle. Vol. I (pp. 13{18). Sietsma, J., & Dow, R. J. F. (1991). Creating arti cial neural networks that generalize. Neural Networks, 4, 67{80. Sirat, J. A., & Jorand, D. (1990). Third-order hop eld networks: extensive calculations and simulations. Philips Journal of Research, 44, 501{519. Sirat, J. A., & Nadal, J. -P. (1990). Neural trees: a new tool for classi cation. Network: Computation in Neural Systems, 1(4), 423{428. Wang, C., & Williams, A. C. (1991). The threshold order of a Boolean function. Discrete Applied Mathematics, 31, 51{69. Wynne-Jones, M. (1991). Constructive algorithms and pruning: improving the multilayer perceptron. In R. Vichnevetsky, & J. J. H. Miller (Eds.), Proceedings of 13th IMACS World Congress on Computation and Applied Mathematics Dublin (pp. 747{750). Wynne-Jones, M. (1992). Node splitting: a constructive algorithm for feed-forward neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann Publishers.
Appendix
Constructive HONs
26
In this appendix we give the development of an algorithm for computing a universal set in O(np3 ) time. In constructing the universal set, it is necessary to make use of the X -space concept in performing the linear independence tests in Algorithm 3. A set of monomials M = f'1 ; '2; : : : ; 'r g are linearly independent on a set of patterns X if the monomials' X -space points are linearly independent vectors. By letting each of these X -space vectors form the rows of a matrix, we can perform linear independence tests on this matrix in the manner outlined in the following development. First, we have to introduce the echelon form of a matrix.
De nition 7 A matrix A = (aij ) is said to be a matrix in echelon form, or an
echelon matrix, if the number of zeros preceding the rst nonzero entry of a row increases row by row until only zero rows remain.
The echelon form of a matrix can be used to determine linear independence, as the following theorem indicates (Lipschutz 1968, p. 87).
Theorem 13 The non-zero rows of a matrix in echelon form are linearly independent.
A matrix can be placed in echelon form using the Gaussian elimination algorithm. In the standard form of this algorithm, an entire matrix is converted to echelon form, but this does not completely t our requirements for a linear independence test. For instance, consider what happens during the operation of Algorithm 3. In this algorithm, the universal set M is constructed incrementally, so that during the course of the algorithm, a single new monomial is tested for linear independence from the existing monomials in M. It would be desirable to make use of the work done in previous linear independence tests in each new test, and if this is done a complete implementation of the Gaussian elimination algorithm is not necessary; the following algorithm suces.
Algorithm 5 (incremental Gaussian elimination) The matrix A = (aij ) is an r c matrix in echelon form, with rows denoted by ai . (For our purposes, the number of columns c = jXj and the number of rows r equals the number of monomials already in the universal set.) To this matrix, we wish to add a new row, aq , to the matrix A
and place the new matrix in echelon form. 01
for i = 1; : : : ; r
Constructive HONs 02 03 04
27
let j indicate the rst nonzero column of row ai aq := aq aa ai form a new (r + 1) c matrix by placing aq and rows of A into echelon row order qj ij
This simple procedure of rst placing a matrix in echelon form and then testing for non-zero rows will form the basis of an ecient algorithm for determining a universal set. The following example demonstrates how the linear independence test is applied to the X -space vectors for each monomial under consideration. By incorporating the incremental Gaussian elimination procedure (Algorithm 5) into Algorithm 3, we obtain an ecient algorithm for determining a universal set. Note that in this algorithm (detailed below) the X -space vectors after each incremental Gaussian elimination are eectively placed in echelon-row order using a linked list. Therefore, placing the row vectors in echelon row order (step 04 of Algorithm 5) is simply a matter of performing a linear search through the list to nd the appropriate point in which to insert a pointer to the new row. In this algorithm, the set of patterns X has cardinality jXj = p and X Bn. The X -space outputs are used to form the rows of a matrix A upon which the incremental Gaussian elimination algorithm is performed. During the operation of the algorithm it is necessary to keep track of some internal processes that take place using notation which we will now introduce. The X -space vector for the current monomial under consideration is denoted by z, and its i-th element by zi . The variable s is used to denote the current number of linearly independent monomials. The integer lk is used as an index to keep track of the monomials that have been added to the universal set. The value of lk is the nal monomial of degree k found to be linearly independent, and hence included in the universal set. If M denotes the universal set constructed, then the elements of the set are given by M = f'l0 ; : : : ; 'l 1 +1 ; : : : ; 'l ; : : : ; 'l g k
k
n
where the monomials follow the order in which they are added to M by the algorithm. Further, the monomials 'l 1 +1 ; : : : ; 'l are all the monomials of degree k in M, i.e., the subset Mk of M. k
Algorithm 6 % Initialization 01 02
T1 = f0g
'T 1
=1
k
Constructive HONs 03 04 05
28
a1 = (1; : : : ; 1)
=1 s=1 l0
% Find linearly independent monomials of degree 1.
06 07 08 09 10
for i = 1; : : : ; n z = 'fig X for r = 1; : : : ; s let q = index of rst nonzero element of row ar z := z az ar % If z is nonzero then ' is a linearly independent monomial if z has nonzero elements q
rq
i
11
:= s + 1 as = z Ts = fig reorder a1; : : : ; as into echelon row order s
12 13 14 15 16
l1
=s
% Find linearly independent monomials of degree > 1.
17 18 19 20 21 22 23
for k = 2; : : : ; n for i = 1; : : : ; l1 for j = lk 2 + 1; : : : ; lk 1 z = 'T X 'T X for r = 1; : : : ; s let q = index of rst nonzero element of row ar z := z az ar i
j
q
rq
% If z is nonzero then 'Ti 'Tj is a linearly independent monomial
if z has nonzero elements
24
26 27 28 29
:= s + 1 as = z Ts = Ti [ Tj reorder a1 ; : : : ; as into echelon row order s
25
lk
=s
% Finally, a universal set
30
M = f'T ji = 1; : : : ; ln g i
A more stable algorithm is possible if the largest element of row ar is used rather than the rst nonzero one in lines 9 and 22, but this will require some reordering of
Constructive HONs
29
the columns of A (this is called pivoting). The complexity of Algorithm 6 is stated in the following theorem.
Theorem 14 Algorithm 6 constructs a universal set for X in O(njXj3) time. Proof Clearly, the computationally expensive piece of pseudo-code is the block that
is most deeply nested. In Algorithm 6, this is the block of code in lines 21-28. This block of code is divided into two sections of interest: lines 21-23 containing the for loop, which we will refer to as the for-block; and the if-statement of lines 24-28, referred to as the if-block. The computational complexity of the if-block is given by the cost of line 26, which will be a constant factor of p, because the reordering of the rows in line 28 is a simple list operation that will cost a constant factor of s in the worst case. The complexity of the for-block, however, is greater than p because the operation in line 23, executed s times, involves the p elements of vectors z and ar . As a result, the computational complexity of Algorithm 6 is determined by the total cost of executing the for-block during the algorithm's operation. Assuming that the cost of arithmetic operations is constant, the cost that the for-block contributes to the overall algorithm is determined by summing over the nested loops within which this block lies. There are four loops to take into account in this calculation; the loops that occur on lines 17, 18, 19, and 21 of the algorithm. So then the contribution to the algorithm's total cost from the for-block is given by total cost(for-block) =
lX k 1
n X l1 X
k=2 i=1 j =lk 2 +1
The cost of the for-block by itself is given by cost(for-block) =
s X r=1
cost(for-block):
(A:1)
cp;
where c is a constant and s is the current number of linearly independent monomials. We can determine the value of s in Equation (A.1) by noting that for each iteration of the outer loop on line 17 with loop variable k, the current number of linearly independent monomials will be less than or equal to lk , so in Equation (A.1) s 6 lk . So, then, the total cost of the for-block in the algorithm is total cost(for-block) 6
n X l1 X
lX k 1
lk X
k=2 i=1 j =lk 2 +1 r=1
cp;
which can be simpli ed in the following manner total cost(for-block) 6
cp
l1 n X X
lX k 1
k=2 i=1 j =lk 2 +1
lk
Constructive HONs 6 By noting that
n X
(lk 1
we obtain
k=2
lk
cp
2)lk 6 ln
n X k=2
( 1
l1 l k
n X
(lk 1
k=2
total cost(for-block) 6
cpl1 ln
6 6
cnp3:
n X
(lk 1
k=2 cpl1 ln ln 1
2)lk :
lk
lk
30
2)
lk
2)
Finally then, as determined by the cost of the for-block, the complexity of the algorithm is O(np3 ).
Constructive HONs
31
Table 1: Network order occurrences for 25 trials in the solutions to the two-or-more clumps problem as constructed by the CHON algorithm. In each trial, for particular training set sizes, the network order is recorded along with the mean number of connections between the inputs and hidden layer. Training Network Order Mean Number Set 1 2 3 of Connections 50 23 2 29 100 - 25 131 200 - 25 287 400 - 20 5 616 600 - 25 627 800 - 25 627
Constructive HONs
Ω
input
combining function
hidden functions
Figure 1
32
output
Constructive HONs
33
output link
Figure 2
Constructive HONs
34
100 90
% test set correct
r
CHON
80 70
w
c
c
60 c r
50
r
0
cr 100
c r
200
r
K
Upstart
Tiling
300 400 500 600 700 Number of training patterns
Figure 3
800
900
Constructive HONs
35
350 s
s
s
300 250 Number of weights
200 s
150 100
s
50 s
0
0
100
200
300 400 500 600 700 Number of training patterns
Figure 4
800
900
Constructive HONs
36
Figure 1 This gure illustrates the important aspects of the perceptron scheme. These include the input elements xi of the input pattern x, the \hidden units" of the network which compute the hidden functions 'i (x) from the set (x), and the combining function which gives the output (x). Figure 2 This gure demonstrates how multiple layers of hidden units in a FFN can
be contracted down into a single composite layer for the purposes of computing network order and hidden-unit fan-in. Then the order of a network with more than one layer of hidden units can be seen to be given by the largest number of inputs that connect to any one output link.
Figure 3 Performance of the CHON, Upstart and Tiling algorithms on the two-ormore clumps problem. Each point is computed from the mean of 25 trials on dierent training sets, and the error bars indicate one standard deviation from the mean. Errorbars for the Upstart and Tiling algorithms are not displayed to reduce clutter, although the standard deviations range between 1:7 and 2:8 for the Upstart algorithm and between 2:1 and 2:9 for the Tiling algorithm. Data for the Upstart and Tiling algorithms was supplied by M. Frean.
Figure 4 Number of weights (which is equal to the number of monomial hidden
units) in the networks constructed by the CHON algorithm for the two-ormore clumps problem. Each point is computed from the mean of 25 trials on dierent training set, and the error bars indicate one standard deviation from the mean. The standard deviation for 600 and 800 training patterns is too small to plot at 0:9.