Evolutionary Design of Neural Network Tree - CiteSeerX

0 downloads 0 Views 57KB Size Report
terminal node of a DT, and make the global decision based on all terminal nodes [9]-[12]. 5. Embed NNs directly into DTs [13]-[14]. The first approach was ...
Evolutionary Design of Neural Network Tree — Integration of Decision Tree, Neural Network and GA Qiangfu Zhao The University of Aizu Aizu-Wakamatsu, Japan 965-8580 [email protected] Abstract- Decision tree (DT) is one of the most popular approaches for machine learning. Using DTs, we can extract comprehensible decision rules, and make decisions based only on useful features. The drawback is that, once a DT is designed, there is no free parameter for further development. On the contrary, a neural network (NN) is adaptable or learnable, but the number of free parameters is usually too large to be determined efficiently. To have the advantages of both approaches, it is important to combine them together. Among many ways for combining NNs and DTs, this paper introduces a neural network tree (NNTree). An NNTree is a decision tree with each node being an expert neural network (ENN). The overall tree structure can be designed by following the same procedure as used in designing a conventional DT. Each node (an ENN) can be designed using genetic algorithms (GAs). Thus, the NNTree also provides a way for integrating DT, NN and GA. Through experiments with a digit recognition problem we show that NNTrees are more efficient than traditional DTs in the sense that higher recognition rate can be achieved with less nodes. Further more, if the fitness function for each node is defined properly, better generalization ability can also be achieved.

1 Introduction Up to now, many algorithms have been proposed for machine learning. These algorithms can be roughly divided into two categories: symbolic and non-symbolic. Typical symbolic approaches include decision trees (DTs), decision rules (DRs), finite-state-automata (FSA), and so on. These approaches are usually good at extracting comprehensible rules, but not suitable for on-line learning. On the contrary, we can use nonsymbolic approaches, say neural networks (NNs), even if the training environment may change, because they are learnable. However, non-symbolic learners are usually black-boxes. We do not know what have been learned even if we have the correct answers. Another key problem in using NNs is that the number of free parameters is usually too large to be determined efficiently. To have the advantages of both symbolic and non-symbolic approaches, it is important to combine them together. For this purpose, many methods have been proposed in the last decade. Here are just some examples related with combining DTs and NNs: 1. Design a DT first, and then derive an NN from the tree [1]-[2].

ENN

ENN

ENN

ENN

ENN

ENN

ENN

ENN

ENN

ENN

ENN

Figure 1: A neural network tree (NNTree) 2. Design an NN, and then extract a symbolic representation from the NN [4]-[6]. 3. Design a tree structured NN using genetic programming (GP) [7]-[8]. 4. Soften (fuzzify) the local decisions made by each nonterminal node of a DT, and make the global decision based on all terminal nodes [9]-[12]. 5. Embed NNs directly into DTs [13]-[14]. The first approach was proposed originally for fast design of neural networks. This approach can be extended to a more general framework named as knowledge based neural networks [3]. NNs so obtained are expected to cover domain knowledge better with fewer connections. However, once they are exposed to changing environments, they become black-boxes again. The second approach enable us to open the black-boxes, and see what are in them. The main problem is that the computational complexity is usually too large to interpret an NN in real time. The third method is good at designing a tree structured NN, but the NNs so obtained are usually too complex to understand. The decision trees used in the fourth approach are actually modular neural networks (MNNs). They are usually more powerful or effective than crisp DTs for regression problems. However, since the final result is given based on all terminal nodes, this kind of MNNs are basically not comprehensible. In this paper, we adopt the fifth approach. Using this approach, it is possible to reduce the tree size because each node is more powerful (not too complex, however). The decision trees so obtained are also MNNs, but they use local decisions only. Therefore, it is possible to perform local re-training on-line. Fig. 1 shows an example of neural network trees (NNTrees). In this NNTree, each node is an expert neural network (ENN). The basic idea is to design small ENNs first for

extracting certain features, and then put them together to get the whole decision tree. For on-line applications, free parameters contained in the ENNs can be updated to adapt to changing environment. The global decision rules (i.e., the overall structure of the tree), however, keep unchanged. Of course, this kind of NNTrees are not suitable for cases in which the patterns change dramatically. The feature(s) to be extracted by each ENN can be pre-defined by the designer or determined automatically in the designing process. In this paper, we consider only the latter (There is a question here: how to interpret each ENN in this case ? We are trying to answer this question in the near future). To design an NNTree, we can use the same recursive process as that used in conventional algorithms [15]-[17]. The only thing to do is to embed some algorithm in this process to design each ENN. Specifically, in this paper, we embed a simple genetic algorithm (SGA) to C4.5. To simplify the problem, we have two assumptions: 1. The architecture (topology and size) of all ENNs are the same, and are pre-specified by the user, 2. Each ENN has n branches, with n  2. First, by fixing the architecture of all ENN, we can greatly restrict the problem space for finding an ENN for each node. In addition, it is also easier to be implemented, say, using a VLSI chip. Second, we allow multiple branches because each ENN is not only a feature extractor, but also a local decision maker. An ENN can extract complex features from the given input vector (an example), and then assign the example to one of the n groups. In this sense, the NNTree used in [13] is a special case (with n = 2) of the model given here (Here we have another question, what is the advantage to have the number of branches more than 2 ? For example, can we reduce the total size of the tree ? This is an interesting question to be answered in the future) To design the ENNs, the only efficient way seems to be evolutionary algorithms, because we do not know in advance which example should be assigned to which group (i.e., we do not have a teacher signal). The only thing we can do is to choose one ENN, among infinitely many, to optimize some criterion (say, the information gain). For this purpose, we will use SGA. Through experiments with a digit recognition problem, we will show that NNTrees obtained by embedding SGA into C4.5 are more efficient than traditional decision trees in the sense that higher recognition rate can be achieved with less nodes. In addition, better generalization ability can also be achieved if the fitness of each ENN is defined properly (details will be given later).

y1

y2

yn output layer

z1

zk [wkj] nk

z2

hidden layer [vji] km input layer x1

x2

xm

Figure 2: Multilayer neural network model for each ENN labels are available. The decision tree is then constructed by partitioning the feature space in such a way as to recursively generate the tree. This procedure involves three steps: splitting nodes, determining which nodes are terminal nodes, and assigning class labels to terminal nodes. Among them, the most important and most time consuming step is splitting the nodes. One of the popular algorithms for designing decision trees is C4.5 [16]. In C4.5, the information gain ratio is used as the criterion for splitting nodes. The basic idea is to partition the current training set in such a way that the average information required to classify a given example can be reduced most. Let S stands for the current training set (with jS j training examples), and ni the number of cases belonging to the i ? th class (i = 1; 2;    ; N ), the average information (entropy) needed to identify the class of a given example is inf o(S

N X i  log ( i ) )=? 2 j j j i=1 j n

n

S

S

(1)

:

Now suppose that S is partitioned into n sub-sets S1 ; S2 ;    ; Sn , by some test X , the information gain is given by gain(X ) = inf o(S ) ? inf oX (S ) (2) where inf o

X (S ) =

n j X j=1 j

i j  inf o(S ): i Sj

S

(3)

The information gain ratio is defined as follows: gain ratio(X )

=

gain(X )=split inf o(X )

split inf o(X )

=

?

(4)

where

n j j X i  log ( j i j ) 2 j j j j=1 j S

S

S

S

:

(5)

2 A Brief Review of Related Algorithms

For detailed discussion, refer to [16].

2.1 Design of Decision Trees

2.2 GA Based Neural Network Design

To construct a decision tree, it is often assumed that a training set consisting of feature vectors and their corresponding class

In this paper, we consider only multilayer feedforward neural networks (MLPs) with single hidden layer (Fig. 2). The

evolutionary algorithm used here is a simple genetic algorithm (SGA) with three operators: truncate selection, onepoint crossover, and bit-by-bit mutation. We adopt this SGA simply because it is easy to use. The genotype of an MLP is the concatenation of all weight vectors (including the threshold values) represented by binary numbers. The definition of the fitness is domain dependent. In this paper, the fitness will be defined as the information gain ratio. To improve the generalization ability, a secondary fitness function is also used (see later).

3 Evolutionary Design of NNTrees 3.1 The algorithm To design a decision tree recursively, each time the current training set is partitioned into several sub-sets by testing the value of one of the features. If the feature has n values, there will be n sub-sets. If the feature is continuous, there are two sub-sets. The point is to select the feature to maximize the information gain ratio. In an NNTree, all nodes are ENNs. Each ENN is an MLP with single hidden layer and n output neurons. For any given training example, it is assigned to the i ? th sub-set if the i ? th output neuron has the largest value (when this example is used as input). Again, the point is to find an ENN to maximize the information gain ratio. To find such an ENN for each node, we can adopt the SGA. The fitness is defined directly as the information gain ratio. The simplified source code (written in C-language) for controlling the whole design process is given below: void TrainTree(PNode tree,int low,int high){ int branch_size[n_branch],i,low1,high1; tree->NodeLabel=NonTerminal; // Find the ENN for this node tree->nn=TrainNode(tree,branch_size,low,high); low1=low; for(i=0;inext[i],low1,high1-1); if(NonTerminal) // Find the ENNs for the children TrainTree(tree->next[i],low1,high1-1); } low1=high1; } }

and the source code for designing each ENN is as follows: NeuroNet TrainNode(PNode tree,int branch_size[], int low,int high){ NeuroNet NN[PopSize]; int i,j,g,n;

// Initialize all individuals Initialization(); // Find the phenotypes of all individuals for(i=0;i