Growing Neural Network Trees Efficiently and Effectively - CiteSeerX

2 downloads 0 Views 63KB Size Report
Abstract. Neural network tree (NNTree) is a hybrid learning model with the overall structure being a decision tree (DT), and each non-terminal node containing ...
Growing Neural Network Trees Efficiently and Effectively Takaharu Takeda and Qiangfu Zhao The University of Aizu Aizuwakamatsu, Japan 965-8580 Email: m5061116,qf-zhao @u-aizu.ac.jp 

Abstract. Neural network tree (NNTree) is a hybrid learning model with the overall structure being a decision tree (DT), and each non-terminal node containing an expert neural network (ENN). Generally speaking, NNTrees outperform conventional DTs because better features can be extracted by the ENNs, and the performance can be improved further through incremental learning. In addition, as we have shown recently, NNTrees can always be interpreted in polynomial time if we restrict the number of inputs for each ENN. Currently, we proposed an algorithm which can grow the tree automatically, and can provide very good results. However, the algorithm is not efficient because GA is used both in re-training the ENNs and in creating new nodes. In this paper, we propose a way to replace GA with the back propagation (BP) algorithm in the growing algorithm. Experiments with several public databases show that the improved algorithm can grow better NNTrees, with much less computational costs.

1 Introduction Neural network tree (NNTree) is a hybrid learning model with the overall structure being a decision tree (DT), and each non-terminal node containing an expert neural network (ENN). Generally speaking, NNTrees outperform conventional DTs because better features can be extracted by the ENNs [1], and the performance can be improved further through learning of the ENNs [3]-[5]. In addition, as we have shown currently, NNTrees can be interpreted in polynomial time if we restrict the number of inputs for each ENN [6]. In fact, the time complexity of most algorithms for interpreting a trained NN is exponential [7]. Based on these considerations, it is reasonable to consider NNTrees as a good model for unifying both learning and understanding. One important problem in using NNTrees is how to design the tree on-line and efficiently. A direct way for designing an NNTree is to follow the same recursive procedure used for designing a conventional DT (say, [9]), and design the ENN for each non-terminal node using genetic algorithm (GA) [3]. We use GA here because we do not have teacher signals in designing the ENNs. This procedure, however, is useful only for off-line learning because all data must be provided all at once. In principle an NNTree obtained by off-line learning can be re-trained using new data. Unfortunately this kind of re-training usually cannot provide good NNTrees because an NNTree with fixed structure is not powerful enough to incorporate information contained in the new data [4]-[5]. To solve this problem, we proposed an algorithm that can grow the NNTrees automatically using data provided incrementally [8]. This

algorithm can give better results, but is not efficient because GA is used both in re-training the ENNs and in creating new nodes. In this paper, we propose a way to replace GA with the back propagation (BP) algorithm in the growing algorithm. Experiments with several public databases show that the improved algorithm can grow much better NNTrees, and more efficiently. The paper is organized as follows. Section 2 gives a brief review on the off-line learning of NNTrees. Section 3 introduces the growing algorithm. In Section 4 we point out the problems in the growing algorithm, and propose methods for solving the problems. Section 5 provide experimental results, and Section 6 is the conclusion. 2 Off-line Learning of NNTrees To construct a DT, it is often assumed that a training set consisting of feature vectors and their corresponding class labels are available. The DT is then constructed by partitioning the feature space in such a way as to recursively generate the tree. This procedure involves three steps: splitting nodes, determining which nodes are terminal nodes, and assigning class labels to terminal nodes. Among them, the most important and most time consuming step is splitting the nodes. There are many criteria for splitting nodes. One of the most popular criteria is the information gain ratio which is used in C4.5 [9]. A direct way to design an NNTree is to follow the same procedure used in C4.5. The point here is to find an ENN in each non-terminal node to maximize the information gain ratio. To find the ENNs, we can use genetic algorithm (GA). Usually, GA contains three basic operations: selection, crossover and mutation. In a standard GA, the genotype of each ENN is a binary string consisting of all connection weights, with each weight being represented in binary number. The fitness is defined directly as the information gain ratio. The genetic operators used here are truncation selection (a selection method that removes some of the current worst individuals in each generation, and replace them with the offspring of the remaining individuals), one-point crossover and bit-by-bit mutation. The ENNs used in our study are multilayer perceptrons (MLPs) with only two outputs, and the NNTrees are always binary trees. For any input example, if the first output of an ENN is larger, the example will be assigned to the left child; otherwise, it will be assigned to the right child. 3 On-line Learning of NNTrees The main purpose of on-line learning is to improve the performance of an NNTree using data provided incrementally. A straightforward way for on-line learning is to fix the structure of the tree, and update the weights of the ENNs only. The initial tree can be designed off-line using currently available data. On-line learning with fixed tree structure is actually supervised learning. For a new example , the teacher signal for the current node (starting from the root) can be determined as follows. Suppose that for , the -th output of the corresponding ENN is the maximum ( in this study), if the -th child of was assigned some examples of the class during off-line learning, the teacher signal is defined by

    











     otherwise.

(1)

Root node is set as the current node

terminal node ?

receive next input

Yes

Yes

classification correct ? No split this node into two

No Re-train the node

Yes correct decision made ? No Re-train the node again

go to j-th child (j is output of this node) and re-train recursively

Figure 1: Flow chart of the growing algorithm







  

On the other hand, if the -th child of during off-line learning, but the -th (



   

is not assigned any example of the class ) child of was, the teacher signal is defined by

      otherwise.

(2)

Currently, we have applied the above re-training method to producing smaller NNTrees [5]. The basic idea is to design an NNTree using partial training data (say, 1/10 of the data) first, and then re-train the tree using other data. The tree obtained by this approach is usually much smaller than that designed directly from all data. However, if the training set is not highly redundant, the small tree designed from partial data can never be re-trained to become as good as that obtained using all data. This means that re-training the ENNs alone is not enough to incorporate new information. To improve the learnability of an NNTree, we proposed a growing algorithm in [8]. Fig. 1 shows the flow-chart of this algorithm. The learning process is described as follows. Suppose . Start from the root node. that the new example is , and the label is



  



  

Step 1: See if the current node is a terminal node. If not, go to Step 2; if yes, see if the class label of the node is . If Yes (i.e., this example is recognizable), receive the next training example, and reset the current node as the root; if not, split the node into two, with one of them being the same as the old node, and another containing the current example. The current node now becomes a non-terminal node. A new ENN is then designed for this node. In designing the ENN, all training examples assigned to this node so far can be used. Step 2: Re-train the current node. In this step, we update the weights of the ENN only for once using BP algorithm with the current example. The teacher signal is defined by (1) and (2).

 

Step 3: See if continue.



can be classified to the correct branch. If yes, go to Step 5; otherwise,

Step 4: Re-train the node again. Now, we re-train the ENN using all examples assigned to this node up to now. This is actually a review process. This will result in an ENN that can classify both new and old examples better.





Step 5: Re-train the -th child recursively, where is defined by (2).

Note that in the above algorithm, for any input example, each node on the classification-path (a similar concept as search-path) is re-trained in two steps: minor revision and major revision. The basic idea is to revise the ENN slightly using BP. If the node is already good enough to recognize the current example after minor revision, input another example. Otherwise, we revise the ENN with all currently available data. These two steps can be considered as “learning and reviewing”, which seem to be a simplified process of human learning. Note also that in the growing algorithm, a new terminal node is split into two whenever an example is mis-classified. As the result, the tree may grow too fast. One method for reducing the tree size is to split the nodes only when the number of mis-classifications satisfies some splitting condition. The following condition is used in our study:

 AND   

(3)

where  is the number of total examples assigned to the node by the tree,   is the number of mis-classified examples, is a threshold and  is the splitting-rate. Generally speaking, and  depend ! on the training set size and the number of classes. In our study, we  set and  for simplicity. Fine tuning is not performed. With these values, the above condition can be read as “a new node will not be created if "# is less than 30, or if the percentage of mis-classified examples is less than 10%”.

 

 

4 Growing the NNTrees More Efficiently The growing algorithm given in the previous section has been verified using several public data bases [8]. Results show that the algorithm can get much better NNTrees through online learning, compared with learning with fixed tree structure. One important problem in the growing algorithm is that the computational cost is too high. There are three reasons for this. First, GA is used for creating an initial NNTree. This procedure is usually very slow, although only part of the data are used. Second, GA is used for creating a new ENN when a node is split into two. Third, GA is used for major revision of an ENN when an example is mis-classified even after minor revision. As for the first reason, we must use GA because we do not have teacher signals in designing each ENN. However, we can remove this initialization part completely if we start from an empty tree. This is possible if we investigate the growing algorithm carefully. Even if we start from an empty tree, the tree can grow automatically through on-line learning. For the second reason, we adopted GA because we had a biased thinking so far. We thought that whenever any new ENN is created, GA must be used because there are no teacher signals. However, if we start from an empty tree, and define the teacher signal using (1) and (2), training a new node is actually supervised training, and can be done using BP.

Table 1: Parameters of the Databases #Example dermatology ionosphere tic-tac-toe housevotes84 car

366 351 958 435 1728

#Cross validation 5 5 10 5 10

#Run

#Feature

#Class

20 20 10 20 10

34 34 9 16 6

5 2 2 2 4

The third reason is actually not a reason at all. Major revision of an ENN is actually supervised training. This is already known when we studied on-line learning with fixed tree structure. Our original purpose of using GA is to simplify the program. However, in the growing algorithm, we have already used BP for minor revision. Therefore, we can use BP for major revision without increasing the complexity of the program. From the above considerations, we propose to use BP in all parts of the growing algorithm. This will reduce the computational cost greatly. At the same time, as will be shown later. the results so obtained are also much better than before. 5 Experimental Results To verify the effectiveness of the method proposed in this paper, we conducted several experiments using five databases taken from the machine learning repository of the University of California at Irvine. Parameters related with these databases are given in Table 1, where # means “number of”. To make the results more reliable, we adopted -fold cross validation for all databases. For example, for “housevotes84”, , and thus a 5-fold cross validation is used. That is, of the data are used for learning, and of the data are used for testing. The number is chosen so that there are enough number of examples in the test set. This is important for reliable evaluation of the results. To increase the reliability further, 10 or 20 runs are conducted for each cases, altogether 100 experiments are conducted for each database. In each run, the training data are shuffled before learning, so that that examples are provided in different order. Three algorithms are compared:











Off-line: The GA based off-line learning algorithm proposed before. Since all data are supposed to be available all at once, results obtained by this algorithm can be considered as the upper-limit for on-line learning.



All-GA: The growing algorithm using GA for creating new node and for major revision. All-BP: The growing algorithm using BP only.



Parameters related to the ENNs are:





Number of inputs: equals to the number of features Number of hidden neurons: 5 Number of outputs: 2

Table 2: Performance of NNTrees designed by off-line learning Database Tree size Performance dermatology 12.72 0.92469 ionosphere 8.08 0.89949 tic-tac-toe 20.32 0.9204 housevotes84 8.8 0.9145 car 41.06 0.94763



Parameters related to GA for off-line learning and for creating a new ENN are:



Population size: 200



Selection rate: 0.2



Number of generations: 1000



Bits per weight: 16



Mutation rate: 0.01



Parameters related to GA for major revision are:

Crossover rate: 0.7



Number of generations: 100



Parameters related to BP for minor revision are:

Other parameters are the same as given above.



Learning rate: 0.5



Momentum: 0



Parameters related to BP, both for major revision and for creating new ENNs:



Number of epochs: 1 Number of epochs: 1,000 Other parameters are the same as given above.

Table 2 shows the results obtained by off-line learning. Only the recognition rates (averaged over 100 runs) for the test sets are given here. Those for the training set are always 1 (or 100%). Tables 3—5 are the sizes of the NNTrees obtained by using different algorithms. For off-line learning, we have results obtained using all data, 1/2 of the data, 1/4 of the data, and 1/8 of the data. For on-line learning, we have results obtained with and without initial trees. In most cases, the NNTrees obtained by All-BP are smaller than those obtained by All-GA. They are even comparable with the results of off-line learning. Tables 6—10 show the performance of the NNTrees obtained using different methods, for different databases. From these results we can see that, in most cases, the trees after onlearning are better than the initial trees. In addition, the generalization ability (recognition rate for the test set) of the NNTrees generated by All-BP are actually much better than those generated by All-GA. For some cases the results obtained by All-BP are even better than those obtained by off-line learning (although not so significant).

Table 3: Sizes of the NNTrees obtained by off-line learning Data derma- ionotic-tac housecar used tology sphere toe votes84 all 12.72 8.08 20.32 8.8 41.06 1/2 11.5 5.42 15.96 5.16 31.38 1/5 10.94 3.62 11.02 3.34 20.64 1/10 10.66 3.02 7.66 3.04 14.32 Table 4: Sizes of the NNTrees obtained by All-GA Data derma- ionotic-tac housecar used tology sphere toe votes84 1/2 13.52 10.3 62.32 9.14 78.04 1/5 13.38 10 63.5 7.64 66.52 1/10 14.06 10.88 64.48 7.38 60.84 none 28.96 10.04 60.48 8.14 160 Table 5: Sizes of the NNTrees obtained by All-BP Data derma- ionotic-tac housecar votes84 used tology sphere toe 1/2 12.02 8.14 22.36 6.8 42.6 1/5 11.88 7.36 17.62 5.54 31.16 1/10 11.9 6.36 14.74 5.14 26.36 none 11.58 5.24 8.36 3.8 18.46

6 Conclusion In this paper, we have studied on-line learning of NNTrees, and introduced a method to improve the efficiency and efficacy of the growing algorithm. The validity of the proposed algorithm is verified through experiments with several public databases. Currently, we are trying to improve the growing algorithm along two lines. The first is to “review” (major revision) with partial data, rather than using all data observed so far. Another one is to make it possible to design NNTrees with ENNs of limited inputs. The former is very useful for long-term on-line learning, and the latter is important for interpreting the learned NNTrees. Acknowledgment This research is supported in part by the Grant-in-Aid for scientific research of Japan Society for the Promotion of Science (JSPS).

Data used 1/2 1/5 1/10 none

Table 6: Results for “dermatology” Performance for test data Performance for training data Initial All-GA All-BP Initial All-GA All-BP 0.90038 0.90545 0.9228 0.92836 0.95962 0.96693 0.81074 0.89858 0.92021 0.84946 0.95892 0.96731 0.6705 0.89245 0.92207 0.70365 0.94918 0.95817 — 0.84521 0.92989 — 0.9141 0.97257

Data used 1/2 1/5 1/10 none

Table 7: Results for “ionosphere” Performance for test data Performance for training data Initial All-GA All-BP Initial All-GA All-BP 0.87057 0.87441 0.89994 0.93361 0.92999 0.96897 0.82275 0.87052 0.89933 0.8536 0.92974 0.9709 0.77777 0.85181 0.8986 0.79419 0.91984 0.96688 — 0.86696 0.90333 — 0.92336 0.97667

Data used 1/2 1/5 1/10 none

Table 8: Results for “tic-tac-toe” Performance for test data Performance for training data Initial All-GA All-BP Initial All-GA All-BP 0.8637 0.80579 0.97818 0.93155 0.83522 0.99 0.7824 0.78547 0.97829 0.82331 0.8255 0.98944 0.7266 0.78168 0.97588 0.75004 0.82183 0.97588 — 0.7887 0.97695 — 0.82464 0.98969

Data used 1/2 1/5 1/10 none

Table 9: Results for “housevotes84” Performance for test data Performance for training data Initial All-GA All-BP Initial All-GA All-BP 0.9005 0.90034 0.91437 0.95353 0.94754 0.97822 0.8877 0.88517 0.91494 0.90842 0.93721 0.9754 0.8566 0.89356 0.91621 0.87003 0.94552 0.97914 — 0.89655 0.92241 — 0.94448 0.98348

Data used 1/2 1/5 1/10 none

Table 10: Results for “car” Performance for test data Performance for training data Initial All-GA All-BP Initial All-GA All-BP 0.91539 0.86742 0.95422 0.95596 0.87993 0.96638 0.95098 0.86517 0.96098 0.88048 0.87947 0.97057 0.80216 0.85999 0.95173 0.81778 0.87603 0.96457 — 0.80607 0.9588 — 0.81828 0.96798

References [1] H. Guo and S. B. Gelfand, “Classification trees with neural network feature extraction,” IEEE Trans. on Neural Networks, Vol. 3, No. 6, pp. 923-933, Nov. 1992. [2] Q. F. Zhao, ”Neural network tree: integration of symbolic and non-symbolic approaches,” Technical Report of IEICE, NC2000-57 (2000-10). [3] Q. F. Zhao, ”Evolutionary design of neural network tree - integration of decision tree, neural network and GA,” Proc. IEEE Congress on Evolutionary Computation, pp. 240-244, Seoul, 2001. [4] Q. F. Zhao, ”Training and re-training of neural network trees,” Proc. INNS-IEEE International Joint Conference on Neural Networks, pp. 726-731, 2001. [5] T. Takeda and Q. F. Zhao, ”Size reduction of neural network trees through re-training,” Technical Report of IEICE, PRMU2002-105 (2002-10). [6] S. Mizuno and Q. F. Zhao, ”Evolutionary design of neural network trees with nodes of limited number of inputs,” Proc. IEEE International Conference on Systems, Man and Cybernetics (SMC’02), Tunisia, 2002. [7] H. Tsukimoto, “Extracting rules from trained neural networks,” IEEE Trans. Neural Networks, Vol. 11, No. 2, pp. 377-389, 2000. [8] T. Takeda, Q. F. Zhao and Y. Liu, ”A Study on On-line Learning of NNTrees” Proc. INNS-IEEE International Joint Conference on Neural Networks, 2003. [9] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.

Suggest Documents