{anne.denton, william.perrizo}@ndsu.nodak.edu. Abstract1. A novel semi-naive Bayesian classifier is introduced that is particularly suitable to data with many ...
A Kernel-Based Semi-Naive Bayesian Classifier Using P-Trees1 Anne Denton, William Perrizo
Department of Computer Science, North Dakota State University Fargo, ND 58105-5164, USA {anne.denton, william.perrizo}@ndsu.nodak.edu
1
Abstract
A novel semi-naive Bayesian classifier is introduced that is particularly suitable to data with many attributes. The naive Bayesian classifier is taken as a starting point and correlations are reduced through joining of highly correlated attributes. Our technique differs from related work in its use of kernel-functions that systematically include continuous attributes rather than relying on discretization as a preprocessing step. This retains distance information within the attribute domains and ensures that attributes are joined based on their correlation for the particular values of the test sample. We implement a kernel-based semi-naive Bayesian classifier using P-Trees and demonstrate that it generally outperforms the naive Bayesian classifier as well as a discrete semi-naïve Bayesian classifier. Keywords: Bayesian classifiers, Semi-naïve Bayes, Scalable Algorithms, Correlations, Kernel Methods, P-Trees. 1.
INTRODUCTION
One of the main challenges in data mining is handling data with many attributes. The volume of the space that is spanned by all attributes grows exponentially with the number of attributes, and the density of training points decreases accordingly. This phenomenon is also termed the curse of dimensionality [1]. Many current problems, such as DNA sequence analysis, and text analysis suffer from the problem (see "spam" data set from [2] discussed below). Classification techniques that estimate a class label based on the density of training points such as nearest neighbor classifiers have to base their decision on far distant neighbors. Decision tree algorithms can only use few attributes before the small number of data points in each 1
Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.
branch makes results insignificant. This means that the potential predictive power of all other attributes is lost. A classifier that suffers relatively little from high dimensionality is the naive Bayesian classifier. Despites its name and the reputation of being trivial, it can lead to a surprisingly high accuracy, even in cases in which the assumption of independence of attributes, on which it is based, is no longer valid [3]. Other classifiers, namely generalized additive models [4], have been developed that make similar use of the predictive power of large numbers of attributes and improve on the naive Bayesian classifier. These classifiers require an optimization procedure that is computationally unacceptable for many data mining applications. Classifiers that require an eager training or optimization step are particularly ill suited to settings in which the training data changes continuously. Such a situation is common for sliding window approaches in data streams, in which old data is discarded at the rate at which new data arrives [5]. We introduce a lazy classifier that does not require a training phase. Data that is no longer considered accurate can be eliminated or replaced without the need to recreate a classifier or redo an optimization procedure. Our classifier improves on the accuracy of the Naive Bayesian classifier by treating strongly correlated attributes as one. Approaches that aim at improving on the validity of the naive assumption through joining of attributes are commonly referred to as semi-naive Bayesian classifiers [6-8]. Kononenko originally proposed this idea [6] and Pazzani [7], more recently, evaluated Cartesian product attributes in a wrapper approach. In previous work continuous attributes were intervalized as a preprocessing step, significantly limiting the usefulness of the classifier for continuous data. Other classifiers that improve on the naive Bayesian classifier include Bayesian network and augmented Bayesian classifiers [9-11]. These classifiers use a tree or network-like relationship between attributes to account for correlations. Such classifiers commonly assume that correlations are determined for attributes as a whole, but generalizations that consider specific instances are also discussed [9]. They do, however, all
discretize continuous attributes, and thereby lose distance information within attributes. Our approach is founded on a more general definition of the naive Bayesian classifier that involves kernel density estimators to compute probabilities [4]. We introduce a kernel-based correlation function and join attributes when the value of the correlation function at the location of the test sample exceeds a predefined threshold. We show that in this framework it is sufficient to define the kernel function of joined attributes without stating the shape of the joined attributes themselves. No information is lost in the joining process. The benefits of a kernel-based definition of density estimators are thereby fully extended to the elimination of correlations. In contrast to most other techniques attributes are only joined if their values are correlated at the location of the test sample. An example of the impact of a local correlation definition could be the classification of e-mail messages based on author age and message length. These two attributes are probably highly correlated if the author is a young child, i.e. they should be joined if age and message length of the test sample are very small. For other age groups there is probably little basis for considering those attributes combined. Evaluation of kernel functions requires the fast computation of counts, i.e. of the number of records that satisfy a given condition. We use compressed, bitcolumn-oriented data structures, namely P-Trees to represent the data [12-16]. This allows us to efficiently evaluate the number of data points that satisfy particular conditions on the attributes. Speed is furthermore increased through a distance function that leads to intervals for which counts can particularly easily be calculated, namely the HOBbit distance [12]. Intervals are weighted according a Gaussian function. Section 2 presents the concept of kernel density estimation in the context of Bayes' theorem and the naive Bayesian classifier. Section 2.1 defines correlation functions between attributes, section 2.2 summarizes P-tree properties, section 2.3 shows how the HOBbit distance can be used in kernel functions, and section 2.4 combines the components into the full algorithm. Section 3 presents our results, with 3.1 introducing the data sets and 3.2-3.4 discussing accuracy. Section 3.5 demonstrates the performance of our algorithm and section 4 concludes the paper.
2. NAIVE AND SEMI-NAIVE BAYESIAN CLASSIFIER USING KERNEL DENSITY ESTIMATION Bayes' theorem is used in different contexts. In what is called Bayesian estimation in statistics, Bayes' theorem allows consistent use of prior knowledge of the data, by transforming a prior into a posterior distribution through use of empirical knowledge of the data [1,4]. Using the simplest assumption of a constant prior distribution Bayes' theorem leads to a straightforward relationship between conditional probabilities. Given a class label C with m classes c1, c2, ..., cm and an attribute vector x of all other attributes, the conditional probability of class label ci can be expressed as follows
P ( x | C = c i ) P (C = c i ) P (x)
P (C = c i | x ) =
(1)
P(C = ci) is the probability of class label ci and can be estimated from the data directly. The probability of a particular unknown sample P(x) does not have to be calculated because it does not depend on the class label and the class with highest probability can be determined without its knowledge. Probabilities for discrete attributes can be calculated based on frequencies of occurrences. For continuous attributes different alternatives exist. We use a representation that is used by the statistics community and that is based on onedimensional kernel density estimates as discussed in [4]. The conditional probability P(x | C = ci) can be written as a kernel density estimate for class ci
P(x | C = ci ) = f i (x)
(2)
with
1 f i (x) = N
N
∑ K ( x, x ) t =1
i
t
(3)
where xt are training points and Ki(x, xt) is a kernel function. Estimation of this quantity from the data would be equivalent to density-based estimation [1], which is also called Parzen window classification [4], or, when using HOBbit distances, Podium classification [13]. The naive Bayesian model assumes that for a given
2
class the probabilities for individual attributes are independent M
P (x | C = c i ) = ∏ P ( x k | C = c i ) k =1
(4)
where xk is the kth attribute of a total of M attributes. The conditional probability P(xk|C=ci) can, for categorical attributes, simply be derived from the sample proportions. For numerical attributes two alternatives exist to the kernel-based representation used in this paper. Intervalization leads to the same formalism for continuous as categorical attributes and does not need to be treated separately. The traditional Naive Bayesian classifier estimates probabilities by an approximation of the data through a function, such as a Gaussian distribution
P ( x k | C = ci ) =
( xk − µ i ) 2 exp − 2 2σ i 2π σ i 1
(5)
where µi is the mean of the values of attribute xk averaged over training points with class label ci and σi is the standard deviation. We use a one-dimensional kernel density estimate that comes naturally from (2)
f i (x ) = 1 = N
M
∏
k =1
M
∏
k =1
f i(k ) ( x (k ) )
N ∑ K i( k ) ( x ( k ) , x t( k ) ) t =1
(6)
=
(x − x ) [C = ci ] exp − 2σ k2 2π σ k 1
k) K i( Cat ( x ( k ) , xt( k ) ) =
1 (k ) [ x = xt( k ) ] Ni
(8)
where Ni is the number of training points with class label ci. [π] indicates that the term is 1 if the predicate π is true and 0 if it is false.
0.1
0.08
distribution function
0.04
kernel density estimate
data points
0.02
k) K i(Gauss ( x ( k ) , xt( k ) ) (k ) 2 t
Categorical attributes can be discussed within the same framework. The kernel function for categorical attributes is
0.06
where the one-dimensional Gaussian kernel function is given by
(k )
attribute k. Note the difference to the kernel density estimate even in the case of a Gaussian kernel function. The Gaussian kernel function evaluates the density of training points within a range of the sample point x. This does not imply that the distribution of weighted training points must be Gaussian. We(4) show that the densitybased approximation is commonly more accurate. A toy configuration can be seen in figure 1 in which the Gaussian kernel density estimate has two peaks whereas the Gaussian distribution function has - as always - one peak only. The data points are represented by vertical lines.
(7)
with σk selected for good over all prediction accuracy. We chose σk as half of the the standard deviation of
0
Figure 1: Comparison between Gaussian distribution function and kernel density estimate. We use this notation throughout the paper. The kernel density for categorical attributes is identical to the conditional probability
3
N
f i Cat ( x ) = ∑ K i Cat ( x , x ) (k )
(k )
(k ) t
t =1
N
=∑ t =1
(
[
]
1 (k ) x = xt( k ) [C = ci ] Ni
= P x ( k ) C = ci 2.1.
(9)
)
Correlation function of attributes
We will now go beyond the naive Bayesian approximation by joining attributes if they are highly correlated. Attributes are considered highly correlated if the product assumption in (6) can be shown to be a poor approximation. The validity of the product assumption can be verified for any two attributes a and b individually by calculating the following correlation function N
Corr (a, b) =
∑ ∏K
(k )
t =1 k = a ,b N
∏∑ K
( x ( k ) , xt( k ) ) −1
(k )
(x
(k )
k = a ,b t =1
,x
(k ) t
(10)
)
where the kernel function is a Gaussian function (7) for continuous data and (8) for categorical data. If the product assumption (6) is satisfied Corr(a,b) is 0 by definition. This means that kernel density estimates can be calculated for the individual attributes a and b independently and combined using the naive Bayesian model. If, however, the product assumption is shown to be poor, i.e., the combined kernel function differs from the product of individual ones, then the two attributes will be considered together. The product assumption is shown to be poor if the correlation function between two attributes at the location of the unknown sample (10) exceeds a threshold, typically 0.05-1. The respective attributes are then considered together. The kernel function for joined attributes is
K =
( a ,b )
(x
(a)
N
,x
∑∏K t =1 k = a ,b
(b )
(k )
,x
(a) t
,x
(b ) t
( x ( k ) , x t( k ) )
With this definition it is possible to work with joined attributes in the same way as with the original attributes. Since all probability calculations are based on kernel function there is no need to explicitly state the format of the joined variables themselves. It also becomes clear how multiple attributes(9) can be joined. The kernel function of joined attributes can be used in the correlation calculation (10) without further modification. In fact, it is possible to look at the joining of attributes as successively interchanging the summation and product in (6). The correlation calculation (10) compares the kernel density estimate at the location of the unknown sample after a potential join with that before a join. If the difference is great, a join is performed. It is important to observe that the criterion for correlation is a function of the attribute values. If attributes are correlated over all but not for the value of a particular test sample no join is performed. One may, e.g., expect that in a classification problem of student satisfaction with a course, class size and the ability to hear the instructor are correlated attributes. If the test sample refers to a class size of 200 students this may, however, not be the case if for such large classrooms microphones are used. This applies to both categorical and continuous attributes. An example could be the attributes hair color and age. It is probably fair to assume that hair color “gray” should have a high correlation with age whereas little correlation is to be expected for other hair colors. Attributes should therefore only be joined if the test sample has hair color “gray”. Note that the traditional naive Bayesian classifier in which a distribution function is used for continuous attributes (5) would lead to a correlation function that vanishes identically for two continuous attributes. The sum over all data points that is explicit in (6) is part of the definition of the mean and standard deviation for the traditional naive Bayesian classifier. Without this sum the numerator and denominator in (10) would be equal and the correlation function would vanish by definition. We will now proceed to discuss the P-tree data structure and the HOBbit distance that make the calculation efficient. 2.2.
) (11)
P-Trees
The P-tree data structure was originally developed for spatial data [10] but has been successfully applied in many contexts [14,15]. P-Trees store bitcolumns of the data in sequence to allow compression as well as the fast evaluation of counts of records that
4
satisfy a particular condition. Most storage structures for tables store data by rows. Each row must be scanned whenever an algorithm requires information on the number of records that satisfy a particular condition. Such database scans that scale as O(N) are the basis for many data mining algorithms. P-tree data structures use a column-wise storage in which columns are further broken up into bits. A tree-based structure replaces subtrees that consist entirely of 0 values by a higher level "pure 0" node, and subtrees that consist entirely of 1 values by higher level "pure 1" nodes. The number of records that satisfy a particular condition is now evaluated by a bit-wise AND on the compressed bitsequences. Figure 2 illustrates the storage of a table with 2 integer and one Boolean attribute. The number of records with A1 = 12 (i.e. the bit sequence 1100) is evaluated as a bit-wise AND of the two P-Trees corresponding to the higher order bits of A1 and the complements of the two P-Trees corresponding to the lower order bits. This AND operation can be done very efficiently for the first half of the data set, since the single high-level 0-bit already indicates that the condition is not satisfied for any of the records. This is the basis for a scaling better than O(N) for such operations.
Figure 2: Storage of tables as hierarchically compressed bit columns The efficiency of P-tree operations depends strongly on the compression of the bit sequences, and thereby on the ordering of rows. For data that shows inherent continuity, such as spatial or multimedia data, such an ordering can be easily constructed. For spatial data, for example, neighboring pixels will often represent similar color values. Traversing an image in a sequence that keeps close points close will preserve the continuity when moving from the two-dimensional image to the
one-dimensional storage representation. An example of a suitable ordering is an ordering according to a spacefilling curve such as Peano-, or recursive raster ordering [10]. For time sequences such as data streams the time dimension will commonly show the corresponding continuity. If data shows no natural continuity it may be beneficial to sort it. Sorting according to full attributes is one option but it can easily be seen that it is not likely to be the best. Sorting the table in figure 2 according to attribute A1 would correspond to sorting according to the first four bits of the bit-wise representation on the righthand side. That means that the lowest order bit of A1 affects the sorting more strongly than the highest order bit of A2. It is, however, likely that the highest order bit of A2 will be more descriptive of the data, and will correlate more strongly with further attributes, such as A3. We therefore sort according to all highest order bits first. Figure 2 indicates at the bottom the sequence in which bits are used for sorting. Figure 3 shows the relationship between traversal of points in Peano order and sorting according to highest order bits. It can be seen that the Peano order traversal of an image corresponds to sorting records according to their spatial coordinates in just this way. We will therefore refer to it as generalized Peano order sorting. A main difference between generalized Peano order sorting and the Peano order traversal of images lies in the fact that spatial data does not require storing spatial coordinates since they are redundant. When feature attributes are used for sorting not all value combinations exist, and the attributes that are used for sorting therefore have to be represented as P-Trees. Only integer data types are traversed in Peano order, since the concept of proximity doesn't apply to categorical data. The relevance of categorical data for the ordering is chosen according to the number of attributes needed to represent it. Figure 3 shows how two numerical attributes are traversed with crosses representing existing data points. The attribute values are then listed in table format. The higher order bit of attribute x is chosen as an example to demonstrate the construction of a P-tree of fan-out 2, i.e., each node has 0 or 2 children. In the Ptree graph 1 stands for nodes that represent only 1 values, 0 stands for nodes that represent only 0 values, and m ("mixed") stands for nodes that represent a combination of 0 and 1 values. Only "mixed" nodes have children. For other nodes the data sequence can be reconstructed based on the purity information and node level alone. The P-tree example is kept simple for demonstration
5
purposes. The implementation has a fan-out of 16, and uses an array rather than pointers to represent the tree structure.
It is defined as
d HOBbit ( xs( k ) , xt( k ) ) 0 for xs( k ) = xt( k ) (k ) (k ) = ∞ (k ) (k ) j + 1 xs j ≠ xt max j for xs ≠ xt j =0 2 2 (12) (k )
(k )
where x s and xt xs and xt, and
are the values of attribute k for points
denotes the floor function.
The
HOBbit distance is thereby the number of bits by which two values have to be right-shifted to make them equal. The numbers 32 (10000) and 37 (10101), e.g., have HOBbit distance 3 because only the first two digits are equal, and the numbers consequently have to be rightshifted by 3 bits. The first two bits (10) define the neighborhood of 32 (or 37) with a HOBbit distance of no more than 3. We would like to approximate functions that are defined for Euclidean distances by the HOBbit distance. The exponential HOBbit distance corresponds to the average Euclidean distance of all values within a neighborhood of a particular HOBbit distance
0 for xs(k ) = xt(k ) d EH (x , x ) = (13) (k ) (k ) 2dHOBbit( xs , xt )−1 for xs(k ) ≠ xt( k ) (k ) s
(k ) t
With this definition we can write the onedimensional Gaussian kernel function in HOBbit approximation as k) ( x s( k ) , xt( k ) ) K i(Gauss
Figure 3: Peano order sorting and P-tree construction 2.3.
HOBbit Distance
The nature of a P-tree-based data representation with its bit-column structure has a strong impact on the kinds of algorithms that will be efficient. P-Trees allow easy evaluation of the number of data points in a neighborhood that can be represented by a single bit pattern. The HOBbit distance has the desired property.
=
(14) d EH ( x s( k ) , xt( k ) ) 2 exp − = C c [ ] i 2σ k2 2π σ k
1
Note that this can be seen as a type of intervalization due to the fact that a large number of points will have the same HOBbit distance from the sample. This explains why the HOBbit-/ P-tree-based algorithm can be efficient despite the high computational effort that multi-dimensional kernel density functions and correlation functions would otherwise involve.
6
2.4.
Algorithm
Our classification algorithm uses the background that has been developed as follows. Based on the attribute values of each test sample we evaluate kernel functions for all attributes. For continuous attributes we evaluate (7) and for categorical attributes (8) for all values of the class labels. Note that these kernel functions only have to be evaluated once for each attribute value and can be reused as long as the training data is unchanged. Kernel functions for pairs of attributes are then evaluated to determine the correlation function (10). For this purpose we use kernel functions that are independent of the class label rather than doing the analysis for different classes separately. This makes the solution numerically more stable by avoiding situations in which two attributes are joined for one class label and not for another. Attributes are joined if the correlation function exceeds a given threshold that is commonly chosen in the range 0.05 to 1. Joining of attributes consists in computing the joined kernel functions (11) and using the result to replace the individual kernel functions of the respective two attributes in (6). Evaluation of kernel functions for continuous attributes involves evaluating the number of data points within each of the HOBbit ranges of the attribute(s). This is done through AND operations on the P-Trees that correspond to the respective highest order bits. The number of points in each range is then weighted according to the value of the Gaussian function (12) using the exponential HOBbit distance that corresponds to the given HOBbit range. The products of kernel functions of single and combined attributes (6) are evaluated for all class label values. If can be seen from (2) that they correspond to the probabilities P(x | C = ci). These probabilities are used in (1) together with the total probabilities of class label values P(C = ci) that are determined from the training set. The probability of the unknown sample is constant for all classes and do not have to be evaluated. The class label with the highest probability P(x | C = ci) P(C = ci) is chosen as prediction. 3.
IMPLEMENTATION AND RESULTS
We implemented all algorithms in Java and evaluated them on 4 data sets. Data sets were selected to have at least 3000 data points and to contain continuous attributes. Two thirds of the data were taken as training set and one third as test set. Due to the consistently large size of data sets cross-validation was considered
unnecessary. All experiments were done using the same parameter values for all data sets. 3.1.
Data Sets
Three of the data sets were obtained from the UCI machine learning library [1] where full documentation on the data sets is available. These data sets include the following : • • •
spam data set: word and letter frequencies are used to classify e-mail as spam adult data set: census data is used to predict whether income is greater $50000 sick-euthyroid data set: medical data is used to predict sickness from thyroid disease
An additional data set was taken from the spatial data mining domain (crop data set). The RGB colors in the photograph of a cornfield are used to predict the yield of the field [17]. Class label is the first bit of the 8-bit yield information, i.e. the class label is 1 if yield is higher than 128 for a given pixel. Table 2 summarizes the properties of the data sets. No preprocessing of the data was done, but some attributes, were identified as being logarithmic in nature, and the logarithm was encoded in P-Trees. The following attributes were chosen as logarithmic: "capitalgain" and "capital-loss" of the adult data set, and all attributes of the "spam" data set. 3.2.
Results
We will now compare results of our semi-naive algorithm with three more traditional implementations. The traditional Naïve Bayesian algorithm uses a Gaussian distribution function, cf. equation (5). The Ptree Naïve Bayesian algorithm uses HOBbit-based kernel density estimation, cf. equation (12), but does not check for correlations. Section 3.3 compares the traditional Naïve Bayesian algorithm with the P-tree Naïve Bayesian one. Section 3.4 compares the new kernel-based Seminaïve Bayesian algorithm with both the P-tree naïve Bayesian and the Discrete Semi-Naïve Bayesian ones. The Discrete Semi-Naïve Baysian algorithm does eliminate correlations but discretizes continuous data as a preprocessing step. Table 2 gives a summary of the results.
7
Table 1: Summary of data set properties Number of attributes without class label catecontitotal gorical nuous
Number of classes
spam
57
0
57
crop
3
0
adult
14
sickeuthyroid
25
Number of instances training
test
Unknown values
Probability of minority class label
2
3067
1534
no
37.68
3
2
784080
87120
no
24.69
8
6
2
30162
15060
no
24.57
19
6
2
2108
1055
yes
9.29
Table 2: Results P-Tree Traditional (+/-) Naïve Naïve Bayes Bayes
(+/-)
Discrete Discrete Kernel SemiSemiSemi(+/-) (+/-) (+/-) Naïve Naïve Naïve Bayes (1) Bayes (2) Bayes (1)
Kernel Semi(+/-) Naïve Bayes (2)
spam
11.9
0.9
10.0
0.8
9.4
0.8
9.5
0.8
8.4
0.7
8.0
0.7
crop
21.6
0.2
22.0
0.2
28.8
0.2
21.0
0.2
28.5
0.2
21.0
0.2
adult
18.3
0.4
18.0
0.3
17.3
0.3
17.1
0.3
17.0
0.3
16.6
0.3
sick-euthyroid
15.2
1.2
5.9
0.7
11.4
1.0
8.8
1.0
4.2
0.6
4.6
0.7
P-Tree Naive Bayesian Classifier
Before using the semi-naive Bayesian classifier we will evaluate the performance of a simple naive Bayesian classifier that uses kernel density estimates based on a the HOBbit distance. Whether this classifier improves performance depends on two factors. Classification accuracy should benefit from the fact that a kernel density estimate in general gives more detailed information on an attribute than the choice of a simple distribution function. Our implementation does, however, use the HOBbit distance when determining the kernel density estimate, which may make classification worse over all. Figure 3 shows that for three of the data sets the P-tree naive Bayesian algorithm constitutes an improvement over traditional naive Bayesian.
9 Decrease in Error Rate
3.3.
6 3 0 spam -3
crop
adult
sickeuthyroid
Figure 3: Decrease in error rate for the P-tree naive Bayesian classifier compared with the traditional naive Bayesian classifier in units of the standard error.
Semi-Naive Bayesian Classifier
Decrease in Error Rate
The semi-naive Bayesian classifier was evaluated using two parameter combinations. Figure 4 shows the decrease in error rate compared with the P-tree naive Bayesian classifier, which is the relevant comparison when evaluating the benefit of combining attributes. 9 8 7 6 5 4 3 2 1 0 spam
crop
adult
sickeuthyroid
Figure 4: Decrease in error rate for the kernel-based semi-naive Bayesian classifier compared with the P-tree naive classifier in units of the standard error. Two parameter combinations were used: (1) t = 0.3, anticorrelations eliminated (left), (2) t = 0.05 only correlations eliminated (right). It can be clearly seen that for the chosen parameters accuracy is increased over the P-tree naive Bayesian algorithm. Both parameter combinations lead to a significant improvement for all data sets despite the fact that the cutoff value for correlations varies from 0.05 to 0.3, and runs differ in their treatment of anticorrelations. It can therefore be concluded that the benefits of correlation elimination are fairly robust with respect to parameter choice. Run (1) used a cut-off of threshold t = 0.3 while run (2) used t = 0.05. Run (1) eliminates not only correlations, i.e., attributes for which Corr(a,b) > t but also anti-correlations, i.e., attributes for which Corr(a,b) < -t. We also tried using multiple iterations, in which not only two but more attributes were joined. We did not get any consistent improvement, which is in accordance with the observation that correlated attributes can benefit the naive Bayesian classifier [18]. The mechanism by which correlation elimination can lead to a decrease in accuracy in the semi-naive Bayesian algorithm is as follows. Correlation elimination was introduced as a simple interchanging of summation and multiplication in equation (6). For categorical attributes the kernel function is a step function. The product of step functions
will quickly result in a small volume in attribute space, leading to correspondingly few training points and poor statistical significance. By combining many attributes the problems of the curse of dimensionality can recur. Similarly, it turned out not to be beneficial to eliminate anti-correlations as extensively as correlations because they will even more quickly lead to a decrease in the number of points that are considered. We then compared our approach with the alternative strategy of discretizing continuous attributes as a processing step. Figure 5 shows the decrease in error rate of the kernel-based implementation compared with discretizing attributes as a preprocessing step.
Decrease in Error Rate
3.4.
45 40 35 30 25 20 15 10 5 0 spam
crop
adult
sickeuthyroid
Figure 5: Decrease in error rate for two parameter combinations for the kernel-based semi-naive Bayesian classifier compared with a semi-naive Bayesian classifier using discretization in units of the standard error. See Figure 4 for parameters. The improvement of accuracy for the kernelbased representation is clearly evident. This may explain why early discretization-based results on semi-naive classifiers, that were reported by Kononenko [4], showed little or no improvement over the Naïve Bayesian classifier on a data set that is similar to our sickeuthyroid data set, whereas the kernel-based algorithm leads to a clear improvement. 3.5. Performance It is important for data mining algorithms to be efficient for large data sets. Figure 5 shows that the Ptree-based semi-naive Bayesian algorithm shows a better scaling than O(N) as a function of the training points. This scaling is closely related to the P-tree storage concept that benefits increasingly from compression for increasing data set size.
9
Time per Test Sample in Milliseconds
120
80
Measured Execution Time Linear Interpolation
40
0 0
10000 20000 30000
Number of Training Points
Figure 5: Scaling of execution time as a function of training set size 4.
CONCLUSIONS
We have presented a semi-naive Bayesian algorithm that treats continuous data through kernel density estimates rather than discretization. We were able to show that it increases accuracy for data sets from a wide range of domains both from the UCI machine learning repository as well as from an independent source. By avoiding discretization our algorithm ensures that distance information within numerical attributes will be represented accurately and improvements in accuracy could clearly be demonstrated. Categorical and continuous data are thereby treated on an equally strong footing, which is unusual since classification algorithms tend to favor one or the other type of data. Our algorithm is particularly valuable for the classification of data sets with many attributes. It does not require training of a classifier and is thereby suitable to such settings as data streams. The implementation using P-Trees has an efficient sub-linear scaling with respect to training set size. We have thereby introduced a tool equally interesting from a theoretical and a practical perspective. References [1] D. Hand, H. Mannila, P. Smyth, "Principles of Data Mining", The MIT Press, Cambridge, Massachusetts, 2001. [2] http://www.ics.uci.edu/mlearn/MLSummary.html [3] P. Domingos, M. Pazzani, "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", 13th International Conference on Machine Learning, 105-112, 1996. [4] T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning: data mining, inference, and prediction", Springer-Verlag, New York, 2001.
[5] M. Datar, A. Gionis, P. Indyk, and R. Motwani, "Maintaining Stream Statistics over sliding windows," in ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002. [6] I. Kononenko, "Semi-Naive Bayesian Classifier", In Proceedings of the sixth European Working Session on Learning, 206-219, 1991. [7] M. Pazzani, "Constructive Induction of Cartesian Product Attributes", Information, Statistics and Induction in Science, Melbourne, Australia, 1996. [8] Z. Zheng, G. Webb, K.-M. Ting, "Lazy Bayesian Rules: A lazy semi-naive Bayesian learning technique competitive to boosting decision trees", Proc. 16th Int. Conf. on Machine Learning, 493-502, 1999. [9] N. Friedman, D. Geiger, and M. Goldszmidt, "Bayesian network classifiers", Machine Learning, 29, 131-163, 1997. [10] E. J. Keogh and M. J. Pazzani, "Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches", Uncertainty 99: The Seventh International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 1999. [11] J. Cheng and R. Greiner, "Comparing Bayesian network classifiers", in Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI'99), 101 -107, Morgan Kaufmann Publishers, August 1999. [12] M. Khan, Q. Ding, W. Perrizo, "K-Nearest Neighbor classification of spatial data streams using PTrees", PAKDD-2002, Taipei, Taiwan, May 2002. [13] W. Perrizo, Q. Ding, A. Denton, K. Scott, Q. Ding, M. Khan, "PINE - podium incremental neighbor evaluator for spatial data using P-Trees", Symposium on Applied Computing (SAC'03), Melbourne, Florida, USA, 2003. [14] Q. Ding, W. Perrizo, Q. Ding, “On Mining Satellite and other Remotely Sensed Images”, DMKD -2001, pp. 33-40, Santa Barbara, CA, 2001. [15] W. Perrizo, W. Jockheck, A. Perera, D. Ren, W. Wu, Y. Zhang, "Multimedia data mining using P-Trees", Multimedia Data Mining Workshop, KDD, Sept. 2002. [16] A. Perera, A. Denton, P. Kotala, W. Jockheck, W. Valdivia Granda, W. Perrizo, "P-tree Classification of Yeast Gene Deletion Data", SIGKDD Explorations, Dec. 2002. [17]http://midas10.cs.ndsu.nodak.edu/data/images/data_s et_94/ [18] R. Kohavi, G. John, "Wrappers for feature subset selection", Artificial Intelligence, 1-2, 273-324, 1997.
10