A Hybrid Expert System for Error Message Classi cation H.C. Lau, K.Y. Szeto, K.Y.M. Wong and D.Y. Yeung*
Department of Physics, * Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong.
[email protected],
[email protected]
Abstract
A hybrid intelligent classi er is built for pattern classi cation. It consists of a classi cation and regression tree (CART), a genetic algorithm (GA) and a neural network (NN). CART extracts features of the patterns by setting up decision rules. Rule improvement by GA is explored. The rules act as a pre-processing layer of NN, a multi-class neural classi er, through which the most probable class is determined. A realistic test of classifying error messages generated from a telephone exchange shows that the CART-NN hybrid system has comparable performance with Bayesian neural network.
1 Introduction In a telephone network, error messages are continuously generated. Usually the number of combinations of error messages is enormous, making it a dicult diagnosis problem. Thus it is desirable to have an ecient and reliable expert system developed on the classi cation of error messages for intelligent network management. Neural networks and related intelligent systems are found useful in analyzing complex technical devices [1]. Here we introduce three approaches which have complementary advantages, and their hybridization yields results improved over the individual systems. 1
2 A Hybrid Network Architecture To formulate the classi cation problem, we consider a pattern coded in binary vector form, x = (x1; : : : ; xN ). Each attribute xi can take either 0 or 1, it is an indication of a special feature. The purpose of a classi er is to map the pattern vector x to a class variable c, where c takes discrete values in f1; : : :; C g. The hybrid classi er is composed of a rule-based hidden layer and a linear neural network [2], as shown in Figure 1. Each node in the input layer corresponds to an input attribute xi, while that in output layer corresponds to an output class c. The hidden layer contains R nodes representing R classi cation rule vectors fr1; : : : ; rRg. They are fully connected to the input attributes. Each rule rj = (r1; : : : ; rN ) has the same dimension as the input vector, but the component ri can take 0, 1 or `don't care'. x1 xi xN
s s s
rule-based layer Z\XZXZXXXXX X r .. \ Z .. \ Z . ZZ ... \ HHH \\ Z r . \ .. .. H H .. H H\H\ r .
Input Attributes
1
j
R
Rules
perceptron layer y1
hh h
XHZXHXXX ZH HXXX y ZH . XXXXZXZHZHH ... XXZXZ j
yR
z1 zk zC
Output Classes
Figure 1: Hybrid Classi er Network Architecture The output of a hidden node, yj , is related to a matching function between attributes in x and rj . It is de ned as: H (x; rj ) Sj 1 j R; (1) yj = 1 ? D(rj ) where H (x; rj ) is the hamming distance between x and rj , D(rj ) is the eective dimension of the rule rj and the output of is normalized in [0; 1]. In evaluating H and D, the `don't care' attributes in rj and the corresponding attributes in x will be ignored. Sj is the strength of rule rj , showing how important the rule is. As a result, each node will re an output between Sj and 0, with perfectly matched and perfectly unmatched being the two extremes. yR+1 is set to 1 as a bias term for the perceptron. 2
The rule-based pre-processor aims at selecting the features of the input pattern. With R < N , it also serves the purpose of dimensionality reduction to the network. The set of y's are then fed to a perceptron, with weights wkj and output node activation zk , given by: zk
=
X
R+1 j =1
1 k c:
wkj yj
(2)
It is an estimate of the corresponding class probability. Therefore c = arg
max fz g kc k
(3)
1
oers the rst choice of our class prediction. The second and third alternatives, and so on, can be determined in a similar manner by nding the second and third largest argument in fzk g.
3 Rule Extraction: CART We now turn to the problem of nding the rules in the hidden layer. The rules are extracted from a classi cation and regression tree (CART) [3]. When a set of training patterns is presented, the tree is grown by recursively nding splitting conditions until all terminal nodes have pure class membership.
t
x p=1 yes ? @ no
? yes? @no ? @
tL
@ t x =1 yes? @no ? @
x q=1
r
R
Figure 2: Decision Node of a Classi cation Tree Consider a branch node t and its left and right child tL and tR respectively, as in Figure 2. Denote p(ijt) as the conditional probability that the pattern belongs to class i, given that it lands in node t. De ne the node impurity function by the Gini criterion [3]: XX i(t) = p(ijt)p(j jt): (4) j i6=j
3
An ecient discrimination is achieved when the attribute is selected that provides the greatest reduction in impurity among the patterns. In other words, xp is chosen to maximize i(xp ; t) = i(t) ? i(tL )pL (t) ? i(tR )pR (t) (5) where pL (t) and pR (t) are the conditional probability that the pattern lands in node tL and tR respectively given that it lands in node t. After the tree is grown, it is pruned by minimizing the error complexity using a pruning factor which represents the cost per node. The number of rules, R, is thus kept below the dimension of the pattern vector, N . CART can be used independently to classify patterns, but here we use it to produce classi cation rules for subsequent neural network processing. The rule vectors rj are generated by exhausting all routes of the hierarchical tree, setting up appropriate attributes in traversing the decision nodes of each route, and inserting `don't cares' when those attributes are not examined in the splitting criteria. These rules constitute the rule-based layer in Figure 1, and for this purpose, the class outcome of each rule is not important at this stage. Alternatively, rule improvement by the genetic algorithm can be explored before building the rule-based layer, as described in the following section.
4 Rule Evolution: Genetic Algorithm The rules generated by CART, each with its corresponding class output, act as the initial population in GA. We investigate several simple variations on the application of genetic operators, reporting here the results of adaptive crossover [4, 5]. We randomly generate a xed number Na of additional rules per class according to the bit frequencies of the training set. All the rules belonging to class c are initially given the same strength P (c), the probability of occurrence of class c based on the statistics of the training set. The j -th rule with class output c will have its strength Sj modi ed according to Sj (t + 1) = Sj (t) pc;j
(6) The strength is increased(plus sign) if the rule gives the correct class, otherwise it is decreased(minus sign). The probability pc;j is a measure of how close the rule matches the given training example and is computed by a Boltzmann distribution at temperature T = 1= ,
4
pc;j
where dc;j
e? dc;j ? dc ;j (c ;j ) e
=P
0
0
0
(7)
0
P (c) ) + P (c)log( ) = vc;j log( Pvc;j (c) v c;j
(8)
is the symmetric Kullback-Leibler's measure of cross entropy between the distance vc;j and the probability P (c) of occurence of class c. Here vc;j = H (x; rj )=D(rj ) is the distance between the rule rj and the input x. The population of rules (R + C Na ) is maintained at constant as we replace the rule with the lowest strength after each generation. At the end of the evolution, we remove Na rules of the weakest strength in each class, so that the nal number of rules is R, although some of them are dierent from the initial set. Fine tuning of GA is achieved by adjusting the temperature parameter.
5 Neural Network Learning In this section, we describe the supervised learning in the perceptron layer, based on a modi ed training procedure, to include multi-class outputs [6]. Suppose we are given a training set, which consists of pairs of pattern vector x and its corresponding class . Initially, the weights of the perceptron are set with random values between 0 and 1. Then a training pair fx; g is selected randomly and fed to the network, through which the most probable class will be predicted. For correct classi cation, it is desirable that the predicted class c should be equal to the actual class with high certainty. Otherwise, update of weights takes place when the desired node output z cannot exceed other outputs by a learning threshold . Then the weights are modi ed according to a multi-class proportional increment training procedure: wkj = 0, wkj = ?yj , wj = yj
if if
zk + < z zk + z
(9) (10) (11)
where is the number of k's that accounts to the modi cation of (10), and is the learning rate. The learning threshold is introduced to increase the stability of the network. 5
The above learning procedure is repeated until a satisfactory percentage of training patterns are classi ed correctly. This algorithm has the advantages of fast convergence and fault tolerance.
6 Diagnosis of Error Messages In order to compare with existing expert systems on realistic data, we use error messages generated from a telephone exchange computer, indicating which circuit card is malfunctioning [1]. The training set consists of 442 samples and the test set of 112 samples. Each sample is in the format of a bit string consisting of an error vector of 122 bits (N ) and 36 possible error classes (C ). A simple measure of the performance of the system is the generalization rate. Using CART, a decision tree is constructed and pruned. The individual tree gives a generalization of 68.8% when unpruned and a best generalization of 69.6% when pruned by a factor of 0.9. A multi-class perceptron yields a generalization rate of 61.6% with learning rate = 1 and learning threshold = 1000, the result goes up to 83.9% when the rst three choices are included. Next we incorporate the perceptron with a pre-processing layer of CART rules, with pruning factor 1.3, which individually gives 63.4% only on generalization. However, it reduces the input dimension from 122 to 72. With no GA taking part, all rules are given with equal strength, Sj = 1, and a linear function is selected for the matching criterion (1). The resulting CART-NN hybrid network is found to have a performance boost up to 74.1% on rst prediction and 88.4% up to the third alternatives. It is obviously better than the individual results of CART and NN. Using GA, with initial rules generated by CART, around 70% on the rst prediction is achieved with NN. We observe that the number of additional rules, Na , per class during evolution has to be small in order for the perceptron to achieve good performance. Fine tuning of the temperature parameter and systematic elimination of weak rules at the end of evolution are necessary for GA to be compatible with NN. A comparison of the classi cation results among dierent methods is summarized in Table 1. The hybridization of CART and NN yields a slightly better performance than the Bayesian neural network [1].
6
Table 1: Classi cation Results of Various Techniques. The gures indicate the successful rate of classi cation when the speci ed choices are included. Choice 1st ? 2nd ? 3rd Rest
1st 1st
1st ? 2nd ? 3rd Rest
1st 1st
Multi-Layer RBF Backprop Bayesian Training set results (in %) 61.3 91.2 80.8 75.3 94.6 92.3 82.3 95.1 96.1 17.7 4.9 3.9 Testing set results (in %) 43.8 67.0 72.3 57.2 75.9 82.1 62.6 76.8 87.4 37.4 23.2 12.6
NN
CART-NN Hybrid
92.8 98.4 99.5 0.5
82.6 94.8 98.2 1.8
61.6 75.9 83.9 16.1
74.1 83.0 88.4 11.6
7 Conclusions Perceptron classi er performs well when the data are separable. However, in most real life situations, patterns are found to have irregular distributions. Therefore it may require a combination of dierent classi cation techniques to achieve better performance. In this speci c application, there is a strong correlation inherent in the data (e.g. concurrent error bits caused by simultaneous break down of components). This probably accounts for the success of rule-based approaches such as CART and the Bayesian neural network. NN alone does not perform well when input dimension is large and few training examples are available. Therefore, we use CART as a pre-processing layer of NN for dimensionality reduction and feature extraction. Other conventional ways of pre-processing such as radial basis functions and principal component analysis were found to be inferior to CART. On the other hand, CART provides understandable rules and easily interpreted concepts regarding the nature of the patterns. However, it may generate irrevocable errors at an early stage of the hierarchy, and improvement of rules is hard to implement. GA has the potential for improvement. Finally, NN provides the necessary exibility and fault tolerance to the generated rules. In conclusion, the combination of these methods gives not only a good success rate of classi cation, but also some rules of thumb which are generally inaccessible in a neural network.
7
Acknowledgements
We thank Dr. A. Holst for providing us the data on fault diagnosis of the telephone exchange computer, and Mr. D. Chan for technical support. This project is supported by the Hong Kong Telecom Institute of Information Technology, HKUST.
References
[1] A. Holst and A. Lansner, \Diagnosis of Technical Equipment using a Bayesian Neural Network," in Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications, ed. by J. Alspector, R. Goodman, T.X. Brown, pp. 147{153, 1993. [2] R.M. Goodman, C.M. Higgins, J.W. Miller and P. Smyth, \Rule-Based Neural Networks for Classi cation and Probability Estimation," Neural Computation, vol. 4, no. 6, pp. 781{804, 1992. [3] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classi cation and Regression Trees, Paci c Grove, CA: Wadsworth, 1984. [4] J.D. Schaer and A. Morishima, \An Adaptive Crossover Distribution Mechanism for Genetic Algorithm from Genetic Algorithms and their applications," in Proceedings of the Second International Conference on Genetic Algorithms, pp. 36, 1987. [5] S.J. Louis and G.J.E. Rawlins, \Syntactic Analysis of Convergence in Genetic Algorithms," Foundations of Genetic Algorithms 2 ed. by L.D. Whitley, San Mateo, CA: Morgan Kaufmann, pp. 141, 1993. [6] J. Sklansky and G.N. Wassel, Pattern Classi ers and Trainable Machines, New York: Springer-Verlag, 1981.
(to appear in Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications, Stockholm, May 22{24, 1995)
8