Training and Retraining of Neural Network Trees - CiteSeerX

2 downloads 0 Views 158KB Size Report
each non-terminal (internal) node being an expert neu- ral network (ENN). We can also consider an NNTree as a modular neural network (MNN) 11]. Each ENN.
Training and Retraining of Neural Network Trees Qiangfu Zhao

The University of Aizu Aizu-Wakamatsu, Japan 965-8580 [email protected] Abstract ENN

In machine learning, symbolic approaches usually yield comprehensible results without free parameters for further (incremental) retraining. On the other hand, nonsymbolic (connectionist or neural network based) approaches usually yield black-boxes which are dicult to understand and reuse. The goal of this study is to propose a machine learner that is both incrementally retrainable and comprehensible through integration of decision trees and neural networks. In this paper, we introduce a kind of neural network trees (NNTrees), propose algorithms for their training and retraining, and verify the eciency of the algorithms through experiments with a digit recognition problem.

1 Introduction Up to now, many algorithms have been proposed for machine learning. These algorithms can be roughly divided into two categories: symbolic and non-symbolic. Typical symbolic approaches include decision trees (DTs), decision rules (DRs), nite-state-automata (FSA), and so on. These approaches usually yield comprehensible results without free parameters for on-line or incremental retraining. On the other hand, we can use non-symbolic, say, neural networks (NNs) based approaches, even if the training environment may change, because they are retrainable. However, non-symbolic learners are usually black-boxes | we do not know what have been learned even if we have the correct answers. Another key problem in using NNs is that the number of free parameters is usually too large to be determined eciently. To have the advantages of both symbolic and nonsymbolic approaches, it is important to integrate them together. For this purpose, many methods have already been proposed in the literature (see [1], [2] and references therein). Talking about integration of DTs and NNs alone, for example, we can design a DT rst, and

ENN

ENN

ENN

ENN

Figure 1: A neural network tree (NNTree) then derive an NN from the DT [3]-[5]. This method is good for fast design of NNs. Inversely, we can design an NN rst, and then extract a DT from it [6]-[8]. This approach is helpful for opening the NN black-boxes. We can also combine the above two to re ne domain knowledge. Using the above transformational approaches, however, we cannot really get comprehensible domain knowledge. Actually, if the features (attributes) are continuous numbers, the binary DTs are very dicult to be understood as well. In addition, since DTs and NNs are good for solving di erent problems [9], direct transformation from NN to DT (or vice versa) may not give good results (say, the generalization ability may become worse). In this study, we adopt the model shown in Fig. 1 [10]. We call it a neural network tree (NNTree). This model is not new because it (or a similar one) was also adopted in [12] and [13]. An NNTree is a DT with each non-terminal (internal) node being an expert neural network (ENN). We can also consider an NNTree as a modular neural network (MNN) [11]. Each ENN is used to extract some complex feature(s), and make local decision based on the feature(s). For example, an ENN can be used to recognize a sub-pattern (or subconcepts), and the overall decision can be made by the whole DT. The key point is that for on-line applica-

tions, free parameters contained in the ENNs can be updated to adapt to changing environment. To some extent, the global decision (the overall structure of the tree) is transparent to changes in the sub-patterns. Of course, the NNTrees are not suitable for cases in which the sub-patterns change dramatically. Using NNTrees, it is possible to understand the domain knowledge roughly with detailed information being hidden in the ENNs. The detailed information can be extracted when it is necessary. One of the key problem in designing an NNTree is how to determine the feature(s) to be extracted (or the sub-pattern to be recognized) by each ENN. In some applications (say, image recognition), the features might be pre-de ned by the designer. In this paper, we try to do this automatically using evolutionary algorithms (EAs). Interpretation of the physical meaning of the sub-pattern recognized by each ENN is to be studied in the future. In this paper, we study how to train and retrain the NNTrees. Training is to design an NNTree using currently available data. Retraining is to re ne the NNTree using new data. For training, we can use the same recursive process as that used in conventional algorithms [14]-[15]. The only thing to do is to embed some EA in this process to design each ENN. Note that to design the ENNs, the only ecient way seems to be EAs, because we do not know in advance which example should be assigned to which group (i.e., we do not have a teacher signal). The only thing we can do is to choose one ENN, among in nitely many, to optimize some criterion (say, the information gain). Retraining of NNTree is a kind of local training process. The basic idea for retraining is to update the weights of ENNs along the classi cation-path (similar to searchpath used in search trees). Detailed discussion will be given later. Here we just make some comments on the usefulness of retraining. First of all, retraining can be used for reducing the tree size. If we design an NNTree directly from all available data, the tree can be very large, and thus the comprehensibility will be very low. Instead, we can design an NNTree using part of the data, and then retrain the tree using all data to cover the domain knowledge better. This is the case for the experiments given in this paper. Retraining can also be used to update the NNTree using newly observed data. For example, an NNTree for robot control can be retrained to adapt to changing environment. We do not have to design an NNTree for each environment. A general purpose (to some extent) NNTree can be used by di erent robots living in di erent (but similar) environments. At the time being, we will not consider the

case in which the environment changes dramatically.

2 A Brief Review of Related Algorithms 2.1 Design of Decision Trees

To construct a DT, it is often assumed that a training set consisting of feature vectors and their corresponding class labels are available. The DT is then constructed by partitioning the feature space in such a way as to recursively generate the tree. This procedure involves three steps: splitting nodes, determining which nodes are terminal nodes, and assigning class labels to terminal nodes. Among them, the most important and most time consuming step is splitting the nodes. One of the popular algorithms for designing DTs is C4.5 [14]. In C4.5, the information gain ratio is used as the criterion for splitting nodes. The basic idea is to partition the current training set in such a way that the average information required to classify a given example can be reduced most. Let S stands for the current training set (with jS j training examples), and ni the number of cases belonging to the i ? th class (i = 1; 2;    ; N ), the average information (entropy) needed to identify the class of a given example is N n X i  log ( ni ): 2 j S jS j i=1 j

(1)

gain(X ) = info(S ) ? infoX (S )

(2)

n jS j X i  info(S ): i j S j =1 j

(3)

info(S ) = ?

Now suppose that S is partitioned into S1 ; S2 ;    ; Sn by some test X , the information gain is given by where

infoX (S ) =

The information gain ratio is de ned as follows:

gain ratio(X ) = gain(X )=split info(X )

(4)

n jS j X i  log ( jSi j ): 2 j S jS j j =1 j

(5)

where

split info(X ) = ?

For detailed discussion, refer to [14].

2.2 GA Based Neural Network Design

In this paper, we consider only multilayer feedforward neural networks (MLPs) with single hidden layer. The EA used here is a simple genetic algorithm (SGA) with three operators: truncate selection, one-point

ples. Here, we consider the latter only. The former will be studied in the next section. ENN1

ENN2

Figure 2: Clustering of examples by neural networks crossover, and bit-by-bit mutation. We adopt this SGA simply because it is easy to use. The genotype of an MLP is the concatenation of all weight vectors (including the threshold values) represented by binary numbers. The de nition of the tness is domain dependent. In this paper, the tness will be de ned as the information gain ratio. To improve the generalization ability, a secondary tness function is also used (see later).

3 Training of NNTrees Based on GA 3.1 General Considerations

To design a DT recursively, each time the current training set is partitioned into several sub-sets by testing the value of one of the features. If the feature has n values, there will be n sub-sets. If the feature is continuous, we often divide the current training set into two sub-sets according to some threshold. The point is to select the feature to optimize, say, the information gain ratio. In an NNTree, all nodes are ENNs. To simplify the problem, we assume that all ENNs are single-hiddenlayer MLPs of the same size, and the number of output neurons is n = 2 (all results given here can be extended easily to n > 2). For any given training example, it is assigned to the i ? th sub-set if the i ? th output neuron has the largest value (when this example is used as input). Again, the point is to nd an ENN to maximize the information gain ratio. To nd such an ENN for each node, we can adopt the SGA. The tness is de ned directly as the information gain ratio.

3.2 To Improve the Generalization Ability

There are two ways to improve the performance of the NNTree. The rst one is to provide new examples, and modify the NNTree incrementally using the learning ability of each ENN. The second method is to improve the performance of the NNTree using existing exam-

The basic idea to improve the NNTree using existing examples is to map the training examples into di erent clusters, minimize the distance between examples of the same cluster and maximize the distance between examples of di erent clusters (this is very similar to Fisher's law for linear discriminant). In this sense, if we look at Fig. 2, we can say that ENN2 is better than ENN1 although their information gain ratio are the same. Using the above idea, we propose a method to improve the generalization ability of the NNTrees. First, we de ne the approximation error of an ENN as follows. Suppose that the current training set S is partitioned into S1 ; S2 ;    ; Sn . If an example x is assigned to Si , the desired output (d1 ; d2 ;    ; dn ) of the ENN for this example is de ned as

dj =



1 j=i 0 otherwise

(6)

Using the training examples and the desired outputs, we can train the ENN to minimize the following approximation error:

E = 12

S X n X j

j

k=1 i=1

(yik ? dki )2

(7)

where yik and dki are, respectively, the actual and the desired output for the k ? th training example. For this purpose, we can embed the so called back-propagation (BP) algorithm into C4.5, along with the SGA. In our study, however, we do not use BP, but adopt a much simpler method. The idea is to use E as the secondary tness function. By so doing, we can nd ENNs with high information gain ratio as well as low approximation error using SGA alone. This can be implemented easily by modifying the program as follows. First, for every ENN, nd the information gain ratio and the approximation error. Then, use them in the sorting process as follows:

 If the information gain ratio of ENN1 is larger than that of ENN2 , we say ENN1 is better.  If the information gain ratio of ENN1 equals to that of ENN2 , but the approximation error of ENN1 is smaller than that of ENN2, ENN1 is better.

All other parts of the program will be the same.

Table 1: Results of training (by integrating C4.5 and SGA directly)

Run 1 2 3 4 5 6 7 8 9 10 Average

1

0 0 0 0 0 0 0 0 0 0 0

2

175 177 178 150 133 182 149 163 170 157 163.40

#Non-terminal #Terminal 39 40 39 40 43 44 47 48 36 37 34 35 42 43 44 45 51 52 43 44 41.8 42.8

3.3 Experimental Results

To test the e ectiveness of the algorithms proposed above, we conducted some experiments with a digit recognition problem. The data used here is the "optdigits" data set, which is taken from machine learning repository of the University of California at Irvine. The number of training examples is 3823, and the number of test examples is 1797. The number of features is 64 with each feature being an integer in [0,16]. The number of class is 10 (10 digits). Detailed information can be found in the le optdigits.names, which is included in the data set. The main experiment parameters are given as follows: 1) Number of output neurons of each ENN=2, 2)Number of hidden neurons = 4, 3)Number of bits per weight = 16, 4)Number of generation = 1000, 5)Population size=200, 6)Selection rate = 0.2, 7)Crossover rate = 0.7, and 8)Mutation rate = 0.01. Table 1 and 2 are the experimental results, where 1 is the number of classi cation errors for the training data, 2 is that for the test data, #Non-terminal is the number of Non-terminal (internal) nodes of the NNTree (or the number of ENNs), and #Terminal is the number of terminal nodes. 10 runs were conducted using each algorithm. Comparing these two tables we can see that by introducing the approximation error as the secondary tness measure, the number of misclassi cations for the test set is reduced to about 143 from 163. In addition, comparing these results with those obtained by C4.5 (see Table 3), we can see that the number of nodes has been greatly reduced, and at the same time, the number of classi cation errors is also much smaller. Of course, if we count the number of neurons contained in the whole tree, the complexity of an NNTree might be higher than a conventional DT. The more important things are that an NNTree is retrainable and (possibly) more comprehensible.

Table 2: Results of training (with the approximation error as the secondary tness measure)

Run 1 2 3 4 5 6 7 8 9 10 Average

1

0 0 0 0 0 0 0 0 0 0 0

2

149 133 134 139 143 165 128 144 145 147 142.70

#Non-terminal #Terminal 42 43 38 39 35 36 37 38 42 43 43 44 40 41 39 40 40 41 40 41 39.60 40.60

Table 3: Results of C4.5 Before pruning After pruning

Error-Train Error-Test Tree-Size 76 257 335 81 256 319

4 Retraining of NNTrees 4.1 The Retraining Algorithm

The decision made by an NNTree is local in the sense that it depends only on the training examples assigned by the tree to each leaf (terminal node). Therefore, retraining can also be performed locally. The basic idea for retraining is to update the weights of the ENNs along the classi cation path (CP). For example, if the CP of an example is root ! node1 ! node4 ! node8 (= leaf ), we can modify the ENNs of root, node1 and node4 (in our study, a leaf does not have an ENN). This simple idea, however, cannot be used directly when an example is mis-classi ed. In this case, we should modify the CP (and the nodes on the modi ed CP) so that the example might be classi ed correctly after retraining. To simplify the discussion, we introduce the following notations:

 x is a given example for retraining (note that we are using \training" and \retraining" in di erent meaning in this paper).

 y is the actual output of the ENN of current node.  d is the desired output of the ENN (to be de ned later).

 A is the set of children of the current node.

Table 4: Results before retraining (j 1j = 101 )

Run 1 2 3 4 5 6 7 8 9 10 Average

1

432 409 381 427 453 438 528 475 533 469 454.50

2

307 264 263 260 276 292 338 302 338 338 297.80

#Non-terminal #Terminal 9 10 9 10 10 11 10 11 10 11 13 14 9 10 10 11 10 11 9 10 9.90 10.90

Table 5: Results after retraining (j 1 j = 101 )

Run 1 2 3 4 5 6 7 8 9 10 Average

1

253 282 257 291 278 343 261 305 288 272 283.00

2

223 192 201 199 220 248 210 218 227 206 214.40

#Non-terminal #Terminal 9 10 9 10 10 11 10 11 10 11 13 14 9 10 10 11 10 11 9 10 9.90 10.90

 IA = f1; 2;    ; ng is the index set of A, where n is the number of outputs of the ENN.

 B = fb 2 Aj9y 2 (b) ) Label(y) = Label(x)g is a sub-set of A, where (b) is the set of training examples assigned to b by the tree.

 IB = fi1 ; i2;    ; img is the index set of B , where m  n.  C = fc 2 B jx can be classi ed correctly by cg is the set of children (sub-trees) of the current node that can classify x correctly.

 IC = fj1 ; j2 ;    ; jk g is the index set of C , where k  m. With the above de nitions, we can describe the algorithm for retraining as follows: For any given retraining example x, we start decision making from the root node. In general, for the current node, we have

 Case 1: C is not empty. That is, there is at least one sub-tree that can classify x correctly. In this

Table 6: Results before retraining (j 1 j = 15 )

Run 1 2 3 4 5 6 7 8 9 10 Average

1

299 247 309 287 284 279 308 217 307 286 282.30

2

223 215 222 210 223 263 243 203 253 240 229.5

#Non-terminal #Terminal 11 12 12 13 10 11 13 14 14 15 12 13 10 11 11 12 11 12 13 14 11.70 12.70

Table 7: Results after retraining (j 1 j = 51 )

Run 1 2 3 4 5 6 7 8 9 10 Average

1

238 207 255 261 237 228 239 229 240 226 236.00

2

197 200 194 186 206 200 184 191 192 201 195.10

#Non-terminal #Terminal 11 12 12 13 10 11 13 14 14 15 12 13 10 11 11 12 11 12 13 14 11.70 12.70

case, select an index j from IC such that if we de ne the desired output d by

di =



1 i=j 0 otherwise

(8)

jd ? yj can be minimized (that is, the decision

made by the current node is mostly respected). Then, we can update the weights of the ENN of the current node using BP (or some other algorithm) to make jd ? yj smaller (so that when x is provided as the input again, the current node will more likely make correct decision). This procedure is continued recursively for the j-th child.  Case 2: C is empty but B is not. In this case, select an index j from IB such that if we de ne the desired output d by Eq. (8), jd ? yj can be minimized (that is, the decision made by the current node is respected as much as possible). Then, we can update the weights of the ENN of the current node using BP to make jd ? yj smaller (so that when x is provided as the input again, the current node may make correct decision). This procedure is continued recursively for the j-th child.  Case 3: Both C and B are empty. This means x is an example with unknown label (not learned

during training). In this paper, we do not consider this case.

4.2 Experimental Results

Using the above algorithm, we conducted several experiments with the same data as before. In each experiment, we rst extract a subset 1 from the whole training set , and use it for designing (training) an NNTree. The size of 1 is 1/10 or 1/5 of j j. all data in were used for retraining. The main purpose here, as stated earlier, is to get a smaller NNTree. The number of epochs for retraining is 1,000, the learning ratio used in BP is 0.1, and the moment parameter was not used. From the results shown in Tables 4-7 we can see that the tree size can be reduced greatly if we design the tree using a small part of the training data. The side e ect of reducing the tree size is that the generalization ability becomes lower. This can be compensated to some extent by retraining. All results after retraining are better than those obtained by C4.5 in the sense that higher recognition rate can be obtained using much smaller trees.

5 Conclusion and Future Works In this paper, we have studied a kind of neural network tree (NNTree) whose nodes are expert neural networks (ENNs), and proposed algorithms for training and retraining. Experimental results with a digit recognition problem show that NNTree is more ecient than traditional decision trees in the sense that higher recognition rate can be achieved with less nodes. Talking about the complexity alone, however, NNTree may not be better than a traditional DT. The important things are that an NNTree can be more comprehensible and exible. We still have many questions to answer. For example, 1) How to retrain the ENNs incrementally or on-line to adapt to changing environment ? 2) How to interpret the ENNs and the whole tree as well ? 3) How to choose the architecture of each ENN ? 4) What shall we do if there are examples with not learned label ? and 5) How to choose the examples for training ?

References

[1] A. Kandel and G. Langholz (Eds.), Hybrid Architectures for Intelligent Systems, CRC Press, 1992. [2] S. Wermter and R. Sun (Eds.), Hybrid Neural Systems, Springer-Verlag, 2000.

[3] I. K. Sethi, \Entropy nets: from decision trees to neural networks," Proc. IEEE, Vol. 78, No. 10, pp. 1605-1613, 1990. [4] R. P. Brent, \Fast training algorithms for multilayer neural nets," IEEE Trans. on Neural Networks, Vol. 2, No. 3, pp.346-354, 1991. [5] G. G. Towell and J. W. Shavlik, \Knowledgebased arti cial neural networks," Arti cial intelligence, 70(1-2), pp. 119-165, 1994. [6] H. Tsukimoto, \Extracting rules from trained neural networks," IEEE Trans. on Neural Networks, Vol. 11, No. 2, pp.377-389, 2000. [7] M. W. Craven, \Extracting comprehensible models from trained neural networks," Dr. Thesis, University of Wisconsin - Madison, 1996. [8] G. P. J. Schmitz, C. Aldrich and F. S. Gouws, \ANN-DT: an algorithm for extraction of decision trees from arti cial neural networks," IEEE Trans. on Neural Networks, Vol. 10, No. 6, pp.1392-1401, 1999. [9] J. R. Quinlan, \Comparing connectionist and symbolic learning methods," R. Rivest (Eds.): Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, pp. 445-456, MIT Press, 1994. [10] Q. F. Zhao, \Evolutionary design of neural network tree | integration of decision tree, neural network and GA," Proc. IEEE Congress on Evolutionary Computation, Seoul, Korea, 2001. [11] Q. F. Zhao, "Modeling and evolutionary learning of modular neural networks," Proc. The 6-th International Symposium on Arti cial Life and Robotics, pp. 508-511, Tokyo, 2001. [12] H. Guo and S. B. Gelfand, \Classi cation trees with neural network feature extraction," IEEE Trans. on Neural Networks, Vol. 3, No. 6, pp. 923-933, 1992. [13] H. H. Song and S. W. Lee, \A self-organizing neural tree for large-set pattern classi cation," IEEE Trans. on Neural Networks, Vol. 9, No. 3, pp. 369-380, 1998. [14] J. R. Quinlan, "C4.5: Programs for machine learning," Morgan Kaufmann Publishers, 1993. [15] L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone, \Classi cation and regression trees," Belmont, CA: Wadsworth, 1984.

Suggest Documents