A Knowledge-Based Approach to MLP Con guration - Semantic Scholar

A Knowledge-Based Approach to MLP Con guration Melanie Hilario, Bruno Orsier, Ahmed Rida, Christian Pellegrini CUI - University of Geneva 24 rue General Dufour, 1211 Geneva 4 Voice: +41 22 705-7791 Fax: +41 22 705 7780 hilario orsier rida [email protected] j

j

j

ABSTRACT We propose a knowledge-based approach to the task of determining the topology of multilayer perceptrons (MLPs). The idea consists in integrating well-founded and empirically proven con guration techniques into a knowledge-based system. A preliminary study showed that the use of these techniques depends primarily on the amount of prior domain knowledge available: con guration techniques can be situated along a spectrum going from knowledgeintensive techniques like symbolic structure compilation, through partial techniques using pieces of domain knowledge like hints, to knowledge-lean techniques based mainly on heuristic search. Our knowledge base for MLP con guration formalizes the conditions of applicability of these techniques as well as interdependencies and complementaries among them which might lead to novel hybrid con guration strategies.

1. Introduction

Neural network (NN) research has generated a plethora of architectures and algorithms, and it is becoming increasingly dicult to choose those which are best adapted to a given task. NN design remains a highly empirical process: ecient networks have typically been developed through an iterative cycle of handcrafting and ne-tuning. A number of general-purpose simulators oer a variety of models and methods but provide no guidance on which choice to take. On the other hand, connectionist research has given rise to a corpus of collective expertise which might allow us to address the design task in a more knowledge-intensive fashion. Our proposal is to exploit this expertise and operationalize it in the form of a knowledge-based system for neural network design. NN design can be decomposed into a series of related decisions concerning the network model (e.g., multilayer perceptron, Hop eld net) and topology (e.g., 12-6-1 for a feedforward network with 12 inputs, a hidden layer with 6 units, and one output), the learning algorithm (e.g., backpropagation) and its corresponding parameter values (e.g., learning rate, momentum), the experimentation strategy (e.g., stopping criteria) and so on. The automation of the NN design process is a long-term goal which can only be attained in a succession of stages, each of which is devoted to a single decision point. As a rst cut at the problem, we focus on neural network con guration|de ned as the choice of topology|and freeze all other choice points. For instance, we circumvent the choice of network model by limiting our investigation to multilayer perceptrons (MLPs). As for the learning algorithm, we use RPROP [19], a local adaptation technique which is insensitive to the size of the error gradient and requires no painstaking parameter tuning. Finally, to facilitate validation and ensure replicability, our experimentation strategy follows the benchmarking rules de ned in PROBEN1 [18]. MLP con guration is perhaps the most crucial MLP design step. It can be decomposed into 4 subtasks: choose the number of hidden layers, choose the number of hidden units for each layer, determine the overall connection policy (local or global feedforward connections), and optionally add speci c constraints on selected connections. For each of these subtasks, dierent methods have been proposed in the literature. The problem consists in gleaning well-founded and useful methods with a view to building a coherent system for MLP con guration. A study of the state of the art showed that the choice of a method depends

primarily on the amount of prior knowledge available in the application domain. Existing methods can thus be situated along an axis representing the amount of domain knowledge available|going from one endpoint where there is little or no domain knowledge to the other extreme where available domain knowledge constitutes an approximately correct theory. Between these two endpoints are a number of intermediate points representing the availability of speci c and partial domain knowledge (see Fig. 1). Prior domain knowledge

MLP Configuration Techniques

# hidden # hidden Gen. layers units Conn. Approx. correct theory

Knowledge compilation techniques

Handcrafting, weight sharing, virtual examples

Partial domain knowledge

Little or no domain knowledge

Spec. Conn.

Dynamic config. methods Static config. methods

Fig. 1: Mapping between prior domain knowledge and MLP con guration techniques

In the following sections, we investigate a number of MLP con guration techniques in the three main subregions of the (domain) knowledge spectrum. Sections 2 and 3 present knowledge-intensive and knowledgelean techniques respectively, while Section 4 discusses intermediate techniques for using partial and often piecemeal knowledge to limit the space of candidate topologies. In Section 5, we introduce our strategy for integrating these dierent techniques into a coherent knowledge-based system. Section 6 describes related work.

2. Knowledge-intensive techniques

When prior knowledge is rich enough to form an approximately complete and correct domain theory, knowledge compilation can be used to build knowledge-based neural networks. These are generated by translating symbolic structures such as propositional rules [21], rules with certainty factors [9, 12, 15] or dierential equations [4]. As shown in Fig. 1, knowledge compilation methods perform all four con guration subtasks since the entire network topology is de ned by a deterministic mapping algorithm from existing symbolic structures. Most of these compilation techniques produce feedforward nets; an exception is Omlin and Giles' method [17], which transforms nite state automata into second-order recurrent NNs. We focus mainly on the use of rules as source representations for neural network construction. Perhaps the best known rule compilation method is the kbann algorithm [21]. The source structure is a set of acyclic propositional rules expressed as Horn clauses. Before the rule set can be used, rules should be rewritten to eliminate disjuncts (two rules with the same consequent) with more than one conjunct. For instance, the rules A:-B,C; A:-D,E,F are rewritten as A:-A'; A':-B,C; A:-A"; A":-D,E,F. The rewritten ruleset is then translated into a neural network: nal conclusions are mapped onto output units, intermediate conclusions into hidden units, and supporting facts into input units. Weights and biases are initialized according to a precise scheme, and new input and hidden units (and the corresponding links) are added to represent features not speci ed in the initial ruleset. The resulting network can then be trained using any standard neural learning method to obtain a re ned domain theory.

The main problem with the above method is that since standard learning algorithms (usually backpropagation) are used, the neural network generated by the knowledge compilation procedure should provide units and links for all features known in the domain but not used in the initial ruleset. This results in extremely large networks and raises eciency problems. As a remedy, neural learning algorithms have been developed which allow for addition and deletion of nodes during training. rapture [15] deletes nodes using a revised backpropagation algorithm, adds links using the ID3 information gain metric, and adds nodes using the upstart [7] algorithm.

3. Knowledge-lean techniques At the other extreme, where domain knowledge is scarce or unusable, MLP con guration techniques rely mainly on guided search. These knowledge-lean methods can be subdivided into dynamic and static methods. In dynamic methods, the network topology is dynamically modi ed during the training process, whereas in static methods, the topology is chosen based on knowledge extracted via a preliminary analysis of the training set before training. An example of more recent dynamic con guration methods is Wang et al.'s [24] procedure for determining the optimal topology of a static 4-layer MLP1 which approximates continuous non-linear functions. A canonical decomposition technique establishes a link between the number of neurons in the two hidden layers and the dimensions of the subspaces of the canonical decomposition. The procedure starts with a small number of hidden units, trains the network, and then computes two square matrix determinants that will help decide whether to stop, add one unit to one hidden layer, or add one unit to both hidden layers. Unfortunately the precise conditions under which the method performs well are not clearly stated in [24] (see Section 5). As for static con guration, Vysniauskas et al. [22]'s approach estimates the optimal number of learning samples and the number of hidden units needed for a 3-layer MLP to approximate a function to a desired accuracy. The approximation error of a network is investigated as a function F of the number of hidden units h and the number of learning samples N . Both the representation error (bias) and the generalization error (variance) are taken into account. Two models of F , together with an experimental procedure for determining their parameters, have been developed. Once its parameters are known, a model of the approximation error can be used to determine or h or N as a function of the two others (the corresponding con dence intervals are provided). Since training sets are not inde nitely extendable and the user has an idea of what he considers to be an acceptable error, h will typically be solved for on the basis of a given n and a preselected . The main limitation of this method is the number of learning processes (usually several hundreds) required to determine the parameters of the error model. The number of available patterns must also be reasonably high.

4. Between the extremes: using partial knowledge Between knowledge-intensive techniques which require an almost perfect domain theory and knowledge-lean techniques which suppose little or no domain knowledge, a number of con guration techniques make use of partial knowledge about the domain, the task, or the function to be learned. These techniques cannot generally determine the number of hidden layers and units; they can thus only be used in conjunction with knowledge-lean techniques that address these other decision points. One type of partial domain knowledge involves the structure of the application task. It can be used to partition a task into more or less independent subtasks, each of which is to be performed by a distinct network. This technique involving the creation of modular networks [2, 20] is a research area in itself and remains outside the scope of our work. A second type of partial prior knowledge concerns global constraints on the function to be learned, or partial information about its implementation. In pattern recognition tasks, for instance, it might be known a priori that the target function is invariant under certain transformations such as shift, rotation or scaling. Other function characteristics which might prove useful in other domains are symmetry, monotonicity, or parity. These speci c forms of functional knowledge can be integrated into neural networks in several ways. They can be integrated directly into the network structure: for instance, function monotonicity or convexity can be 1

In this paper, a n-layer MLP has one input layer, n ? 2 hidden layers, and one output layer.

handcrafted directly as hard constraints on connection weights whereas invariances can be introduced using more systematic methods such as weight-sharing [13, 3]. Alternatively, they can be incorporated indirectly into the nal network structure via the use of virtual examples coupled with a modi cation of the error function; in this case, such partial knowledge is referred to as hints [1, 20].

5. A KBS for integrating con guration techniques Whatever the degree of prior knowledge involved in the dierent con guration techniques, their integration into a knowledge-based system entails two clear advantages. First, it introduces order into the extreme variety of techniques by expressing in explicit declarative form each one's conditions of applicability. Second, it reveals interdependencies and complementarities between partial con guration techniques that might serve as a basis for novel combinations. This section gives a brief account of how this integration is eected.

5.1. Application conditions of techniques

The primary condition for the application of a given technique is the availability of the required domain knowledge. Our overall strategy for incorporating the dierent con guration methods into a uni ed system can be stated as follows: use knowledge-intensive techniques where possible; otherwise, use knowledge-lean techniques and try to re ne the topologies thus obtained by using specialized techniques based on partial domain knowledge. In the case of knowledge-intensive techniques, another condition is the initial representation of prior domain knowledge. A necessary condition for the use of a given knowledge compilation method is the availability of an initial domain theory in the source symbolic representation required by the method. For example, when a domain theory in the form of propositional rules is available, then the most likely candidates are rule compilation methods. If the rules contain no certainty factors, then use KBANN, else the choice is between RAPTURE and Lacher et al.'s method. A second application condition is the nature of the application task. Most applications can be categorized under either classi cation or (continuous-valued) function approximation tasks. When a con guration technique has been developed within the context of a given task, our knowledge base speci es this speci c task unless further investigations have shown the extendability of the technique to other tasks. For instance, rule compilation techniques are useful only for classi cation tasks. For static con guration, Vysniauskas et al.'s method is aimed explicitly at function approximation. Among dynamic con guration techniques, Cascade Correlation performs well on classi cation tasks but quite poorly on function approximation, which is the target task of Wang et al.'s method. Technique-speci c prerequisites constitute a third group of application conditions. These are constraints explicitly stipulated by authors on the use of their methods. For instance, the initial ruleset used in kbann should meet the following conditions: rules should be acyclic (no rule or combination or rules should contain a proposition as both an antecedent and a consequent) and should be expressed in the form of Horn clauses. Similarly, applicability of certain techniques is compromised by technique-speci c limitations which have been uncovered either by careful reading of authors' accounts or through our own experimentation with these techniques. For instance, our experiments on Wang's method have shown that no redundant input patterns are allowed (otherwise determinants will be null and the algorithm will stop systematically after initial network training). For the same reason, it is dangerous to use odd activation functions like the hyperbolic tangent: two opposite input vectors will produce opposite hidden unit activations and thus lead to null determinants. Such limitations have to be integrated into the knowledge base to ensure that con guration techniques are not utilized beyond their strict areas of competence.

5.2. Interdependencies and complementarities between techniques

A well-known disadvantage of knowledge-lean techniques is the high computational cost of the searchintensive procedures used. However, there is no way of avoiding the use of these techniques when there is little or no domain knowledge available. In this case, one way of reducing costs is by exploiting interdependencies and complementarities between dierent techniques. For instance, the problem with constructive con guration methods is the unbounded number of hidden units (h): some methods like Cascade Correlation

[6] start with one unit and can grow to a point that leads to ineciency (at best) or over tting (at worst). Possible remedies to this problem can be imported from other techniques. For instance, Vysniauskas' method can be used to obtain an interval of values of h which can serve as a starting point for any dynamic method. Also, pruning methods such as Optimal Brain Surgeon [10] can be applied to a network generated by constructive algorithms in order to derive a smaller, more ecient network. Another example of interdependencies between partial con guration techniques concerns the number of hidden layers and hidden units. All the static con guration techniques discussed above work for only one hidden layer, so for the moment dynamic methods should be envisaged when there is reason to believe that more hidden layers are needed.

5.3. Implementation

A knowledge-based system for MLP con guration needs to be coupled with other tools for reasoning about, creating, training and evaluating feedforward networks. Thus the early implementation stages have been devoted to the design of such an environment, called a Symbolic-Connectionist Architecture for Neural Network Design and Learning (SCANDAL). This follows a metaprocessing integration scheme [11] in which symbolic modules guide the baselevel module SNNS, a neural network simulator developed at the University of Stuttgart [5]. The simulator contains a variety of neural network architectures and algorithms, but more importantly, it facilitates modular integration of new models and methods. To allow for explicit reasoning about prior knowledge, whether domain- or meta-level, the simulator has been plunged into a multiagent environment where it can interact closely with symbolic agents endowed with modelling, patternmatching, and inferencing capabilities. These agents are implemented in LOOM, a Lisp-based system which combines a frame-like description language with inference and production rules [14]. These agents perform the MLP con guration task by reasoning on explicit models of neural networks and application domains; their decisions are transmitted to the SNNS agent as service requests (e.g., create a network with a speci ed topology, train the current network for x epochs using 10-fold cross-validation, etc.).

6. Related work and summary

Though intensive research has been devoted to the problem of choosing MLP topology, we are aware of only two antecedents which share our long-term objective of automating the entire MLP design process. Wah and Kriplani [23] propose a heuristic design method for selecting and training promising n-layer MLPs (n 2) under given resource (e.g., time and memory) constraints. First, limited experiments are performed to yield a set of promising con gurations; these are then trained either to completion or until some time ceiling is reached. The resulting nets form a search tree and are evaluated using a criterion that is itself re ned by means of another search process in the space of system-generated heuristic functions. neurex [16] attempts to model the methodology of a human MLP designer while trying automate the two main phases of MLP design: choice of design parameters (e.g., number of hidden units, learning rate, moment, etc.) and supervision of network training. A preliminary experimental study was conducted to extract empirical relations among these parameters which are then used to guide the MLP design process. However, it remains to be seen whether these relations, determined on the basis of four toy problems, can generalize to more complex real-world tasks. These two methods remain highly search-intensive. By contrast, our approach reduces search by bringing prior knowledge to bear on the MLP con guration task: domain-speci c knowledge is exploited whenever possible; otherwise, the system falls back on its metalevel knowledge (i.e., general theoretical and empirical knowledge about MLPs) to reduce the cost of applying knowledge-lean techniques. Finally, it attempts to optimize resulting network con gurations by using partial domain knowledge to constrain speci c connection weights. The knowledge base for MLP con guration is the metalevel component of SCANDAL, a system in which symbolic and connectionist agents cooperate to improve (and, hopefully, to automate) the MLP design process.

References [1] Y.S. Abu-Mostafa. Hints. Neural Computation, 7:639{671, 1995.

[2] D.H. Ballard. Modular learning in neural networks. In Proceedings of the National Conference in Arti cial Intelligence (AAAI-87), July 1987. [3] E. Barnard and D. Casasent. Invariance and neural nets. IEEE Transactions on Neural Networks, 2:498{508, 1991. [4] R. Cozzio. The Design of Neural Networks Using A Priori Knowledge. PhD thesis, ETHZ, Zurich, 1995. [5] A. Zell et al. SNNS. Stuttgart Neural Network Simulator. User Manual, Version 4.0. University of Stuttgart, 1995. [6] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. Technical Report CMUCS-90-100, Carnegie Mellon University, 1990. [7] M. Frean. The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Computation, 2:198{209, 1990. [8] L. Fu, editor. International Symposium on Integrating Knowledge and Neural Heuristics, Pensacola, FL, May 1994. [9] L.M. Fu. Integration of neural heuristics into knowledge-based inference. Connection Science, 1:325{ 340, 1989. [10] B. Hassibi and D. Stork. Second-order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing, 5. Morgan-Kaufmann, 1993. [11] M. Hilario. An overview of strategies for neurosymbolic integration. In R. Sun and F. Alexandre, editors, IJCAI-95 Workshop on Connectionist-Symbolic Integration: From Uni ed to Hybrid Approaches, pages 1{6, Montreal, August 1995. [12] R.C. Lacher, S.I. Hruska, and D.C. Kuncicky. Backpropagatin learning in expert networks. IEEE Transactions on Neural Networks, 3:63{72, 1992. [13] Yann LeCun, B. Boser, J. S. Denker, et al. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing, 2, pages 396{404. Morgan-Kaufmann, 1990. [14] Robert MacGregor. Using a description classi er to enhance deductive inference. In Proc. 7th IEEE Conference on AI Applications, pages 141{147, 1991. [15] J.J. Mahoney and R.J. Mooney. Modifying network architectures for certainty-factor rule-base revision. In Fu [8], pages 75{84. [16] F. Michaud, R. Gonzalez-Rubio, D. Dalle, and S. Ward. Simulateur de reseaux de neurones arti ciels integrant une supervision de leur entrainement. Revue canadienne de genie electrique et informatique, 20(1), January 1995. [17] C.W. Omlin and C. Lee Giles. Integrating temporal symbolic knowledge and recurrent neural networks. In Fu [8], pages 25{31. [18] L. Prechelt. PROBEN1: A set of neural network benchmark problems and benchmarking rules. Technical report, University of Karlsruhe, September 1994. [19] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: the rprop algorithm. In IEEE International Conference on Neural Networks, San Francisco, CA, 1993. [20] S.C. Suddarth and A.D.C. Holden. Symbolic-neural systems and the use of hints for developing complex systems. International Journal of Man-Machine Studies, 35(291-311), 1991. [21] G.G. Towell. Symbolic knowledge and neural networks : insertion re nement and extraction. PhD thesis, Univ. of Wisconsin-Madison, Computer Science Dept., 1992. [22] V. Vysniauskas, F. C. A. Groen, and B. J. A. Krose. The optimal number of learning samples and hidden units in function approximation with a feedforward network. Technical Report CS-93-15, CSD, University of Amsterdam, 1993. [23] B. W. Wah and H. Kriplani. Resource constrained design of arti cial neural networks. In IJCNN'90, volume III, pages 269{280, San Diego, CA, 1990. [24] Z. Wang, C. Di Massimo, M.T. Tham, and A.J. Morris. A procedure for determining the topology of multilayer feedforward neural networks. Neural Networks, 7(2):291{300, 1994.