with discrete values, an novel discretization approach is proposed to improve ... applied to find a good solution for discretization of continuous attributes so that.
Improvement of Decision Accuracy Using Discretization of Continuous Attributes QingXiang Wu1,2,3, David Bell2, Martin McGinnity3, Girijesh Prasad3, Guilin Qi2, and Xi Huang1 1
School of Physics and OptoElectronic Technology, Fujian Normal University Fujian, Fuzhou, China {qxwu, xihuang}@fjnu.edu.cn 2 School of Computer Science, Queen’s University, Belfast, BT7 1NN, UK {Q.Wu, DA.Bell}@qub.ac.uk 3 School of Computing and Intelligent Systems, University of Ulster at Magee Londonderry, BT48 7JL, N.Ireland, UK {Q.Wu, G.Prasad, TM.McGinnity}@ulster.ac.uk
Abstract. The naïve Bayes classifier has been widely applied to decisionmaking or classification. Because the naïve Bayes classifier prefers to dealing with discrete values, an novel discretization approach is proposed to improve naïve Bayes classifier and enhance decision accuracy in this paper. Based on the statistical information of the naïve Bayes classifier, a distributional index is defined in the new discretization approach. The distributional index can be applied to find a good solution for discretization of continuous attributes so that the naïve Bayes classifier can reach high decision accuracy for instance information systems with continuous attributes. The experimental results on benchmark data sets show that the naïve Bayes classifier with the new discretizer can reach higher accuracy than the C5.0 tree. Keywords: Decision-making, Classification, Naive Bayes Cassifier, Discretizer.
1 Introduction The naïve Bayes classifier [1,2] is a highly practical machine learning method, and has been widely applied to decision-making and classification. Since the original naïve Bayes classifier encounters problem when sample set is small, the virtual samples have been applied in [1,3]. Based on the experiments on benchmark data sets from the UCI Machine Learning Repository, a modified naïve Bayes classifier with an empirical formula is proposed in this paper. Because this classifier prefers to dealing with symbolic data, a transformation from continuous data to symbolic data is required. This transformation is also called continuous attribute discretizer. Two classes of approaches (unsupervised discretizers and supervised discretizers) have been surveyed and proposed in [4,5,6]. Based on statistical information used in the naïve Bayes classifier, a new adaptive discretizer is proposed to solve the data type transformation problem. For this, a dichotomic entropy is defined and applied to determine splitting point within an interval. Based on the decision distribution and the value distribution, a compound distributional index is defined to guide to an interval L. Wang et al. (Eds.): FSKD 2006, LNAI 4223, pp. 674 – 683, 2006. © Springer-Verlag Berlin Heidelberg 2006
Improvement of Decision Accuracy Using Discretization of Continuous Attributes
675
that should be split. The discretizer can adaptively discretize any continuous attribute according to the adaptive rules based on minimal dichotomic entropy and maximal compound distributional index. As a set of optimal value intervals are obtained by the discretizer, decision accuracy can be improved when the naïve Bayes classifier is used. The discretizer can be also applied to other knowledge discorvery approaches for discretization of continuous attributes. In section 2, a modified naïve Bayes classifier is proposed. In section 3, a dichotomic entropy, a compound distributional index, and adaptive rules are defined. An example is applied to illustrate the algorithm. Experimental results and analysis are given in Section 4. Section 5 presents conclusions.
2 Modified Naïve Bayes Classifier Following the notations in [3,8], let H = represent an information system, where U = {o1, o2, …, oi, …, on } is a finite non-empty set, called an object space or universe; oi called an object. Each object has a finite non-empty set of attributes A = { a1, a2, a3, …, ai, …, am}, where m is the number of attributes. An instance information system is defined to distinguish an information system with decision attributes from a general information system. An instance is defined to distinguish an object with decision values from general objects. Let I = < U, A∪D > represent an instance information system, where U = {u1, u2, …, ui, …, un } is a finite non-empty set, called an instance space or universe, where ui is called an instance in U, and n is number of instances. Each instance has a set of attributes A and decision values D. D is a nonempty set of decision attributes or class attributes, and A∩D = ∅. As each instance u has a set of attribute values, a(u) represents a value of attribute a by applying an operation on instance u. In other words, a(u) is the value of attribute a of instance u. Domain Va is defined as Va = {a(u) : u ∈ U } for a ∈ A.
(1)
For a given universe U, all attribute domains can be obtained according to (1). For an instance information system, the domain of decision attribute or class attribute is defined as Vd = {d(u) : u ∈ U } for d ∈ D.
(2)
The condition vector space, which is generated from attribute domain Va, is denoted by
V× A =
× Va = Va1 × Va 2 × ... × Va| A|
a∈ A | A|
| V× A |=
∏ i =1
(3) | Vai |
where | V× A | is the size of the condition vector space. The decision vector space, which is generated from decision domain (or class domain) Vd, is denoted by
676
Q. Wu et al.
V× D =
× Vd = Vd1 × Vd 2 × ... × Vd |D|
d ∈D |D |
| V× D | =
∏ i =1
(4) | Vdi |
where | V× D | is the size of the decision vector space. A conjunction of attribute values for an instance corresponding to a condition vector in the condition vector space is denoted by G A(u ) = [a1 (u ), a2 (u ),..., a| A| (u )] (5) Let AU represent a set of condition vectors which exist in the instance information system. G AU = { A (u ) : u ∈ U } (6)
If | AU | = | V× A | , the system is called a complete instance system. In the real world, training sets for decision-making or classification are rarely complete instance systems. In order to illustrate algorithms in this paper, Table 1 is taken as an example of instance information system. Table 1. Example instance information system ----------------------------------------------------------U a1 a2 a3 a4 d -----------------------------------------------------------u1 1 1 1 4 + u2 1 2 3 3 u3 2 3 1 4 + u4 2 4 2 1 u5 3 4 2 2 u6 4 4 2 3 + u7 4 3 3 3 u8 5 2 2 4 + u9 6 1 1 4 + u10 7 1 2 3 + u11 7 2 3 1 u12 7 3 3 2 ------------------------------------------------------------
In this instance information system, there are 4 Attributes A={a1,a2,a3,a4}, 12 instances U={u1,u2,…,u12}, and one decision attribute with 2 values Vd = {+,-}. The value domains for each attribute are as follows: Va1 ={1,2,3,4,5,6,7}, |Va1| =7. Va2={1,2,3,4}, |Va2| =4. Va3 ={1,2,3}, |Va3| =3. Va4 ={1,2,3,4}, |Va4| =4. The size of the condition vector space: | V× A | = 7 × 4 × 3 × 4 = 336 . The number of condition vectors appearing in the table: | AU | = 12 . Clearly, 324 possible condition
Improvement of Decision Accuracy Using Discretization of Continuous Attributes
677
vectors (or conjunctions of attribute values) do not appear in Table 1. Table 1 is not a complete instance information system with 324 unseen instances. A classifier can extract rules from such incomplete training set and classify all instances including 324 unseen instances. According to the Bayes classifier, the most probable decision can be expressed as follows: G d MP = arg max P (d i | A) (7) d i ∈Vd
This expression can be rewritten by means of the Bayesian theorem. G G P ( A | di ) P ( di ) G = arg max P ( A | di ) P(di ) . d MP = arg max P( A) d i ∈Vd d i ∈Vd
(8)
Unseen instances cannot be classified by rules based on Equation (8) because
G
condition vector A for an unseen instance does not appear in the training set and thus G P( A | d i ) cannot be obtained from the training set. In order to classify unseen instances, it is assumed that attribute values are conditionally independent given the decision value. i.e. G P ( A | d i ) = P (a1 , a 2 ,..., a| A| | d i ) = ∏ P (a j | d i ) . (9) j
And so the naïve Bayes classifier [7] is obtained.
d MP = arg max P(d i )∏ P(a j | d i ) . d i ∈Vd
(10)
j
In order to represent Equation (10) with distribution numbers and share the statistical information with a discretizer, a set of statistical numbers is defined as follows. Suppose that there is an instance information system I = < U, A∪D >. Let N dk represent the number of instances with decision value dk . N d k = | {u : d (u ) = d k for all u ∈ U } | .
(11)
Let N d k , ai ,vx represent the number of instances with decision value dk and attribute value vx ∈ Vai . N d k ,ai ,vx = | {u : d (u ) = d k and ai (u ) = vx for all u ∈ U } |
(12)
Let N ai ,vx represent the number of instances for all decisions dx∈Vd and attribute value vx ∈ Vai . N ai ,vx = | {u : ai (u ) = vx for all u ∈ U } | .
(13)
For example, value number distribution for Table 1 is shown in Table 2. The number N d k , ai ,vx is a basic distribution number. Based on the number N d k , ai ,vx , the numbers N d k and N ai ,vx can be calculated by following expression.
678
Q. Wu et al.
N dk =
∑
N d k ,ai ,vx for any ai .
vx ∈Vai
N ai ,vx =
∑
(14)
N d x ,ai ,vx .
d x ∈Vd
(15)
Table 2. Distribution of value numbers
N d k , ai ,vx
Attribute Decision
d1=’+’
d2=’-‘
Name
Domain Va
vx1
a1 a2 a3 a4 a1 a2 a3 a4
{1, 2, …, 7} {1, 2, 3, 4} {1, 2, 3} {1, 2, 3, 4} {1, 2, …, 7} {1, 2, 3, 4} {1, 2, 3} {1, 2, 3, 4}
1 3 3 0 1 0 0 2
vx2
vx3
1 1 3 0 1 2 2 2
0 1 0 2 1 2 4 2
for vx ∈ Vai
vx4
vx5
1 1 --4 1 2 --0
1 ------0 -------
N dk
vx6
vx7
1 ------0 -------
1 ------2 -------
6
6
Based on these distributional numbers, Equation (10) is rewritten as follows.
d MP = arg max d k ∈Vd
Ndk
|U | ∏ i
N d k ,ai ,ai (u ) Ndk
,
(16)
where |U| is total number of instances in information system. Consider the virtual samples [1]. A modified naïve Bayes classifier is proposed as follows. d mp = arg max d k ∈Vd
N d k ,ai ,ai ( u ) + β • | U |
N dk
|U | ∏ N i
dk
+ β • | U | • | Va |
,
(17)
where β is a constant with small number and a typical value β = 0.02 is chosen in our experiments. By tuning this constant, a high accuracy can be obtained. Note that Equation (17) is different from the formula in [1]. The value number |Va| and instance number |U| are considered here so that high accuracy can be obtained. Here, distribution numbers are used instead of probabilities because the numbers can be updated easily when a new instance is added to the instance information system. Suppose that the new u13 = (a1=7, a2=3, a3=3, a4=1, d = ‘-‘ ) is added to the instance information system. Only 5 numbers need to be updated as follows. N d − , a1 ,vx 7 ⇒ 3; N d − ,a2 ,vx 3 ⇒ 3; N d − ,a3 ,vx 3 ⇒ 5; N d − ,a4 ,vx1 ⇒ 3; N d − ⇒ 7. By means of this update approach, any number of new instances can be added online. Equation (17) and distribution of value numbers as shown in Table 2 can be regarded as a type of knowledge for decision-making. The advantage of this representation is that it enables an intelligent system to update new knowledge in a changing environment.
Improvement of Decision Accuracy Using Discretization of Continuous Attributes
679
3 Discretization of Continuous Attributes 3.1 Definition of Dichotomic Entropy
In order to share the statistical information with the naïve Bayes classifier, a discretizer, which is based on distributional index and minimum of a dichotomic entropy, is proposed to discretize the continuous attributes, instead of existing discretization approaches [5-7]. A compound index is obtained from distribution numbers for guiding to split intervals. Let vx ∈ Vai is a value of continuous attribute ai and N d k , ai ,vx represent the number of instances with decision value d k ∈ Vd and value vx for attribute ai. Suppose that continuous attribute ai is split by border value vbd. The number of instances with decision dk and value ai (u ) ≤ vbd is represented by N d k ,left . N d k ,left =
∑
vx ≤ vbd
N d k ,ai ,vx .
(18)
The number of instances with value ai (u ) ≤ vbd for all decision d k ∈ Vd is represented by N ai ,left . N ai ,left =
∑
d k ∈Vd
N d k ,left .
(19)
The number of instances with decision dk and value ai (u ) > vbd is represented by N d k , right . N d k ,right =
∑
vx > vbd
N dk ,ai ,vx .
(20)
The number of instances with value ai (u ) > vbd for all decision d k ∈ Vd is represented by N ai ,right . N ai ,right =
∑
d k ∈Vd
N d k ,right .
(21)
In order to indicate instance number and homogeneous degree over decision space within an attribute value interval, a decision distributional index is defined as follows. N d ,v →v Ed (vstart → vend ) = −N d k ,vstart →vend log 2 ( k start end ) , (22) N ai ,vstart →vend d ∈V
∑ k
d
where N d k ,vstart →vend represents the number of instances with decision value dk and attribute value within vstart and vend, and N ai ,vstart →vend represents the number of instances with attribute value within vstart and vend for all decision values. This decision distributional index indicates that the larger the number of instances within the interval, the larger the index. The more homogeneous the distribution is, the larger the index. Based on this concept, two decision distributional indexes can be obtained when an interval is split. A left decision distributional index can be represented by following expression.
680
Q. Wu et al.
Eleft (vx ≤ vbd ) =
∑ −N
d k ∈Vd
d k ,left
log 2 (
N d k ,left N ai ,left
).
(23)
A right decision distributional index can be represented by following expression. N d ,right −N dk ,right log 2 ( k ). (24) Eright (vx > vbd ) = N ai ,right d ∈V
∑ k
d
A dichotomic entropy for splitting point vbd is defined as
E (vbd ) =
1 1 Eleft (vx ≤ vbd ) + Eright (vx > vbd ) , |U | |U |
(25)
where |U| is total number of instances. According to information entropy principles in machine learning theory [1,4,5,7,8], the smaller the entropy is, the better the attribute discretization. Applying Equation (25), a border value vborder can be obtained by minimization of dichotomic entropy. vborder = arg min vbd ∈Vai
1 1 Eleft (vx ≤ vbd ) + Eright (vx > vbd ) . |U | |U |
(26)
In other words, the minimal entropy can be obtained if vborder is applied to split attribute into two intervals. 3.2 Continuous Attribute Discretizer
Applying Equation (26), a continuous attribute can be split into two intervals. The two intervals can be selected one interval to split into 2 intervals by analogy. The total number of intervals becomes 3. Therefore, a continuous attribute can be split into any desired number of intervals. Here two questions have to be answered. Which interval needs to be split further? How many intervals are best for decision-making? In principle, an interval with inhomogeneous value distribution and large number of instances should be split. In order to get a quantitative criterion, a distributional index for instance number distribution over the attribute values and the decision space within an interval, which is called a value distributional index, is defined as follows. N d , a ,v Ev (vstart → vend ) = −N dk ,ai ,vx log 2 ( k i x ) . (27) N ai ,vx v d ∈V ≤v < v
∑
start
x
∑
end
k
d
It is obvious that Ev (vstart → vend ) is small if the distribution varies with value vx at high frequency. Ev (vstart → vend ) is large if the distribution varies with value vx very slow, in other words, the distribution over value vx is homogeneous. Based on the difference of Ed (vstart → vend ) - Ev (vstart → vend ) , a compound distributional index is defined as follows. E (v → vend ) − Ev (vstart → vend ) ΔE (vstart → vend ) = d start , (28) |U |
Improvement of Decision Accuracy Using Discretization of Continuous Attributes
681
where |U| is total number of instances in the instance information system. Dividing by |U| ensures that ΔE (vstart → vend ) is a real value within [0,1] and ΔE (vstart → vend ) decreases monotonously when instance number of the interval decreases i.e. the interval becomes smaller. This compound distributional index can be applied as a criterion to determine whether an interval is to be split further. An interval with largest ΔE (vstart → vend ) is selected for splitting further in an adaptive discretizer. If ΔE (vstart → vend ) = 0 i.e. the value distribution is homogeneous distribution with respect to vx, it is not needed to split the interval further. In most cases, 5 intervals are enough for reaching high decision accuracy in our experiments. Therefore, adaptive discretizer stops when the number of intervals reaches five or ΔE (vstart → vend ) is less than a threshold. These adaptive rules are thus very different from the approach in [4]. The value of ΔE (vstart → vend ) is applied to select an interval that should be split. The formal algorithm is as follows. A1. Algorithm for discretization of continuous attribute 1 Calculate the distribution numbers according to Equations (11)-(15) and get the distributional numbers over sampled values and over decision space. The results are shown as Table 2. 2 Calculate dichotomic entropy and determine the splitting point 2.1 Initial values Interval control number: n = 0; vstart = vmin; vend =vmax Splitting point sequence list S_list =[ vmin, vmax] 2.2 determine the splitting point vborder − n = arg min E (vbd ) vbd ∈Vai
= arg min vbd ∈Vai
1 N ai ,vstart →vend
Eleft (vx ≤ vbd ) +
1 N ai ,vstart →vend
Eright (vx > vbd )
2.3 add the splitting point into splitting point sequence list S_list =[ vmin, vborder-n, vmax] 3 Select an interval for splitting further 3.1 Calculate compound index N d , a ,v Ev (vstart → vend ) = −N dk ,ai ,vx log 2 ( k i x ) N ai ,vx v d ∈V ≤v < v
∑
start
x
∑
end
k
d
Ed (vstart → vend ) − Ev (vstart → vend ) |U | 3.2 Record compound index for each interval C-index-list =[ ΔE (vmin → vborder − n ) , ΔE (vborder − n → vmax ) ] ΔE (vstart → vend ) =
Adaptive rule control n=n+1; If ΔEmax < 0.0001 then end program If n > 5 then end program
682
Q. Wu et al.
3.4 Select the interval with maximal ΔEmax and set the parameters vstart = vstart-with-max; vend =vend-with-max Goto 2.2 //After running this program, a splitting point sequence list can be obtained
4 Experimental Results In order to test the improvement of the improved naïve Bayes classifier with adaptive discretizer, 16 benchmark data sets from the UCI Machine Learning Repository were applied. The decision accuracies under ten-fold cross validation standard are given in column Bayes in Table 3. Sub column Org lists decision accuracies for the classifier without the adaptive discretizer. Sub column Dich lists decision accuracies for the classifier with the adaptive discretizer. Sub column D5 lists decision accuracies for the classifier with a 5-identical-interval discretizer. It can be seen that the modified naïve Bayes classifier with the adaptive discretizer improved decision accuracies for 14 data sets. The average accuracy over 16 data sets is better than that approaches without the adaptive discretizer and with a 5-identical-interval discretizer. The adaptive discretizer was applied to the C5.0 tree. The results are shown in column C5.0 tree. The average accuracy is still improved when this adaptive discretizer is attached for data preparation. The modified naïve Bayes classifier combined with the discretizer can obtain higher average accuracy than the C5.0 tree for the 16 data sets. Column Att is for attribute numbers in data sets. The string ‘60c60’ indicates that there are 60 attributes and 60 attributes are continuous attributes. Column N is for instance numbers in data sets. The names with ‘♣’ indicate that some attribute values Table 3. Comparable results for the Naïve Bayes classifier with the discretizer Data Name
Att
N
Sonar Horse-colic ♣ Ionosphere Wine Crx_data♣ Heart Hungarian♣ SPECTF Astralian Echocard♣ Bupa Iris_data Ecoli Anneal♣ Hepatitis♣ Bands♣ Average
60c60 27c7 34c34 13c13 15c6 13c6 13c6 44c44 14c6 12c8 6c6 4c4 6c6 38c6 19c6 39c20
208 300 351 178 690 270 294 80 690 132 345 150 336 798 155 540
Org 64.0 73.7 82.6 98.3 81.0 76.3 83.3 66.3 85.2 60.6 64.4 94.0 71.5 85.4 69.6 64.8 76.9
Bayes Dich 89.9 75.0 90.3 98.3 86.5 84.4 83.7 88.8 85.5 73.5 70.2 96.7 75.3 86.7 70.8 66.3 82.6
D5 81.7 73.7 88.6 96.1 85.0 84.8 83.7 75.0 86.5 68.0 67.0 82.0 75.0 86.2 69.6 68.5 79.5
Org 71.0 78.3 88.3 93.2 85.1 78.1 79.2 70.0 83.2 68.2 67.2 96.7 71.5 98.6 62.0 68.0 78.7
C5.0 tree Dich D5 74.0 73.0 80.0 80.7 91.5 89.5 97.7 95.5 86.4 84.1 76.3 79.3 80.9 78.2 76.2 68.8 85.9 84.5 57.5 69.0 65.2 66.4 96.7 96.0 83.6 82.4 98.6 98.6 68.5 67.3 68.3 69.4 80.2 80.5
Improvement of Decision Accuracy Using Discretization of Continuous Attributes
683
Table 4. Comparision with other existing approaches Data/ Approach Iris Heart
Bayes-Dich CLIP4-CAIM 96.7 92.7 84.4 79.3
C4.5-MChi 94.7 32.9
Chi-C4.5 MDLPC-C4.5 94.0 94.0 55.1 54.5
are missing in the data set. As it is a NP-hard problem to find the best solution fordiscretization, the proposed discretizer gives a good solution with a low time complexity. Therefore, for a few data sets, the decision accuracies are slightly lower than 5-identical-interval discretizer, for example, data sets Heart and Bands. Results for comparison with other existing approaches [5] are shown in Table 4.
5 Conclusion Dichotomic entropy is defined and minimal dichotomic entropy can be applied to determine the border value for splitting an interval. A compound index, which composes of decision distributional index and value distributional index, is defined and applied for guiding to the interval that should be split during the discretization. Based on these concepts, a continuous attribute can be split in two intervals at the border point with minimal dichotomic entropy, and then the compound index is applied to select an interval to split in further intervals until the desired number of intervals is reached or the compound index is small enough. Applying the improved naïve Bayes classifier with adaptive discretizer to 16 benchmark data sets, experimental results show that the average accuracy has been improved with different scales. The improved naïve Bayes classifier with adaptive discretizer obtained higher average accuracy than the C5.0 tree.
References 1. Mitchell, T.: Machine Learning, McGraw Hill, Co-published by the MIT Press Companies, Inc. (1997). 2. Rish, I.: An empirical study of the naive Bayes classifier, IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, (2001). 3. Wu, Q. X., Bell, D.A. and McGinnity, T.M.: Multi-knowledge for Decision Making. International Journal of Knowledge and Information Systems, Springer-Verlag, 2 (2005) 246 – 266. 4. Wu, X.: A Bayesian Discretizer for Real-Valued Attributes. The Computer J., 39(8) (1996)688-691. 5. Kurgan, L.A., and Cios, K. J.: CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2) (2004) 145-153. 6. Dougherty, J., Kohavi, R., and Sahami, M.: Supervised and Unsupervised Discretization of Continuous Features. Proc. of International Conference on Machine Learning, (1995)194-202. 7. Wu, Q.X. and Bell, D.A.: Multi-Knowledge Extraction and Application. In Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, (eds) Wang, G.Y., Liu, Q., Yao Y.Y. and Skowron, A., LNAI 2639, Springer, Berlin, (2003) 274-279. 8. Quinlan, J. R.: Induction of Decision Trees. Machine Learning, 1(1) (1986) 81 - 106.