WBCSVM: Weighted Bayesian Classification based on Support Vector Machines
Thomas Gärtner
[email protected] Knowledge Discovery Team, German National Research Center for Information Technology, Schloß Birlinghoven, 53754 Sankt Augustin, Germany; and Department of Computer Science, University of Bristol, United Kingdom Peter A. Flach P ETER.F LACH @BRISTOL.A C. UK Department of Computer Science, University of Bristol, Woodland Road, Bristol BS8 1UB, United Kingdom
Abstract This paper introduces an algorithm that combines naïve Bayes classification with feature weighting. Most of the related approaches to feature transformation for naïve Bayes suggest various heuristics and non-exhaustive search strategies for selecting a subset of features with which naïve Bayes performs better than with the complete set of features. In contrast, the algorithm introduced in this paper employs feature weighting performed by a support vector machine. The weights are optimised such that the danger of overfitting is reduced. To the best of our knowledge, this is the first time that naïve Bayes classification has been combined with feature weighting. Experimental results on 15 UCI domains demonstrate that WBCSVM compares favourably to state-of-the-art machine learning approaches.
1. Introduction A domain representation may be inadequate for naïve Bayes learning if the features of the representation are not conditionally independent given the class. One special case of not conditionally independent features are redundant features. The performance of naïve Bayes classification on such domains can be improved by removing the redundant features (Langley & Sage, 1994). Removal of features is often referred to as feature subset selection. Much research has been aimed at combining feature subset selection techniques and naïve Bayes classification (Hall, 1999; Kohavi & John, 1997; Langley & Sage, 1994). These approaches all have in common that they perform a non-exhaustive search in the space of feature subsets guided by some heuristic. As the number of distinct features sets is exponential in the number of attributes, it is in general not feasible to perform an
exhaustive search. Closely related to feature subset selection is feature weighting. Feature weighting assigns a continuous weight to each feature and has mostly been applied to lazy learning algorithms (Wettschereck, Aha, & Mohri, 1997). With continuous weights, feature weighting is more flexible than feature subset selection, but also typically more computationally expensive and more likely to overfit noisy data. These problems are overcome by the algorithm proposed in this paper. WBCSVM combines naïve Bayes classification with feature weighting. It is based on the support vector machine approach, i.e., an algorithm that looks for an optimal hyperplane that separates two classes in some given feature space. By choosing this hyperplane, such that the margin separating both classes is maximal, the danger of overfitting is reduced. Support vector machines can be customised by using different kernel functions, each of which corresponds to a different feature and hypothesis space. The kernel function suggested in this paper corresponds to the hypothesis space of weighted bayesian classification. The weights defining the separating hyperplane have a direct interpretation as feature weights in the naïve Bayes classifier. Although naïve Bayes and feature weighting have both been studied extensively, combining them has – to the best of our knowledge – never been tried. In contrast to other well-known kernel functions that have been used in support vector machines, the one suggested in this paper depends on the distribution of instances and classes. The outline of the paper is as follows. Section 2 concentrates on the underlying concepts of WBCSVM. These are naïve Bayes classification, feature weighting, and support vector machines. The corresponding kernel function is devised in section 3. The algorithm is experimentally evaluated in section 4 and compared to related work in section 5. The paper concludes with prospects on future work.
2. Weighted Bayesian Classification This section describes the basic concepts of weighted Bayesian classifiers and will motivate them as a generalisation of well-known subset selection approaches. Section 2.1 briefly introduces the naïve Bayes classifier, while section 2.2 concentrates on bringing feature transformation (in particular feature weighting) and naïve Bayes together. Finally, section 2.3 describes how feature weights are optimised using a support vector machine. In the following sections xi denotes the ith test instance, while xi,j denotes the value of the jth attribute of xi. The variable z is used for training examples, zi and zi,j are defined similarly. The expression denotes the inner product ( = ∑j xk,j * zi,j ) of the instances x and z. When used in inner product pairs, instances are supposed to have only numeric values. Where the inner product of instances in a feature space (defined by the feature transformation φ) is calculated, instances might contain symbolic values. However, in this case the components of the vector in the transformed feature space have to be numeric. 2.1 Naïve Bayes Classification The naïve Bayes classifier uses Bayes’ theorem to calculate the most likely classification of an example given the attribute-value distributions of the training examples. Given a test instance xi described by the attributes Aj with values xi,j and the possible classifications vc, the maximum posterior classification vmap is: v map(xi ) = argmax P( vc | A1 = x i,1 , ..., An = xi ,n ) vc
= argmax P( v c ) * P( A1 = xi,1 , ..., An = x i,n | v c ) vc
The attribute-value description Aj=xi,j of an instance is usually abbreviated to xi,j : v map(x i ) = argmax P( vc ) * P( xi, 1 , ..., xi, n | vc ) vc
The probability of an instance description, i.e., of its attribute-value pairs, given its class, is hard to estimate directly from the training data, because this would require a very large training set. Assuming conditional independence of the attributes, the probability can be decomposed to: P( xi,1 , ..., xi, n | vc ) = P( xi,1 | vc ) * ... * P( xi ,n | v c )
v NB ( xi ) = argmax P ( vc )∏ P ( x i, j | v c ) vc
j∈ S A
where SA = { j | Aj is an attribute}.
The decomposed conditional probabilities P( xi,j | vc ) are easier to estimate as they require far less training examples. To estimate the conditional probability of an attribute-value given a class in our approach, the Laplace formula is used. Conditional probabilities of numeric attributes are approximated by a normal distribution. We introduce the following notational convention: p0 ( xi ,ν ) = P(ν ) , p j ( xi , c ) = P( x i, j | c) for j > 0,
S = S A ∪ {0}. ν NB ( xi ) = argmax ∏ p j ( x i ,ν c ) νc
j∈S
Whenever the conditional independence assumption is justified, the naïve Bayes classification corresponds to the maximum posterior classification. 2.2 Weighting versus Subset Selection in Naïve Bayes Feature transformation techniques attempt to overcome the effects of bad data representations that hinder successful learning. Feature transformation refers to any modification made to the representation space of a learning algorithm. Traditional machine learning has focused on looking for a good hypothesis with respect to a given representation. Feature transformation approaches are looking for a representation, in which the search for a good hypothesis will be more successful. Whereas most feature transformation techniques introduce new features, feature subset selection and feature weighting try to reduce the influence of redundant and misleading features. In the literature it has often been suggested that the predictive accuracy of naïve Bayes can be improved by removing redundant or highly correlated features (Hall, 1999; Langley & Sage, 1994). This is intuitive as such features violate the conditional independence assumption described above. Example (Langley & Sage, 1994): Consider a domain with 3 features. Naïve Bayes classification is: v = argmax P(vc) * P(x i,1|vc) * P(x i,2|vc) * P(x i,3 |vc). Introducing the dependent feature xi,4 = xi,2 , the classifier becomes v = argmax P(vc) * P(x i,1|vc) * P(xi,2|vc)2 * P(x i,3|vc). Twice as much influence is given to the second feature. If the concept is true if and only if any two of the three features are present, then a naïve Bayes classifier can easily learn this concept on the original domain. However, with the redundant feature, naïve Bayes will consistently misclassify one out of the eight possible instances. Combinations of subset selection and naïve Bayes are also know as selective Bayesian classifiers. They can be formalised as follows:
ν SBC( x i , S′ ) = argmax ∏ p j ( x i ,ν c ); where S ′ ⊆ S νc
j ∈S′
This can be written as: 1; if j ∈ S ′ w ν SBC ( x i , w) = argmax ∏ p j ( xi ,ν c ) j ; w j = νc j ∈S 0; otherwise
Example: In terms of the example above, the selective Bayesian classifier could learn the concept, given any of the subsets S´={1,2,3} or S´={1,3,4}. Feature subset selection algorithms are often distinguished by their search strategy and heuristic. Kohavi and John (1997) distinguish further between filter and wrapper approaches according to the interaction between the subset selection and the classification. Filters are data driven and select a subset merely based on some property or heuristic measure of the data. Wrappers are hypothesis driven. They select a subset by repeatedly adding and removing features, and comparing the performance of a learning algorithm with different subsets by crossvalidation. Relaxing the above restriction w ∈ {0,1}|S| and choosing w to be a vector of continuous weights, we can formalise the weighted Bayesian classifier: ν WBC ( x i , w) = argmax ∏ p j ( x i ,ν c ) νc
wj
; with w ∈ 3
S
j ∈S
Example: The weighted Bayesian classifier could learn the concept given in the above example by choosing the weight vector w = λ * ( 1, α, 1, (1-α) ); α, λ arbitrary. In order to find a good subset of features a search has to be performed in the space of all possible subsets of the set of attributes SA. Exhaustive search is in general not feasible as the number of subsets is exponential in the number of attributes. Among others Hall (1999) notes that as feature weighting is a generalisation of feature subset selection it involves searching a much larger search space. It has also been noted that feature weighting increases the danger of overfitting noisy data. Below, we introduce an efficient way of optimising the weight vector, while reducing the danger of overfitting. This can be achieved by using a support vector machine with an appropriately chosen kernel function.
new representation space is performed by constructing the maximum margin hyperplane. One important aspect of support vector machines is that the search for this hyperplane can be performed by solving a quadratic optimisation problem and that the non-linear mapping does not have to be calculated explicitly. Linear threshold machines consider class boundaries that can be described by: t ; if f ( x ) > 0 ; f v( x) = LTM ( x ) = < w * x > +b f ; otherwise
with classes t and f, and the linear decision function fLTM. with feature vector x and weight vector w. If a feature transformation φ is to be applied to the examples the decision function becomes: f ′( x ) = < wφ * φ ( x ) > +b
where wφ denotes the weight vector in the representation space. The weight vector wφ is to be determined from the training instances zi. In order to improve generalisation ability, the separating hyperplane – and therefore the weight vector – is optimised such that the margin is maximal. The margin is defined as the distance between the separating hyperplane and the instances closest to it on either side. Optimisation theory allows to represent this problem in a dual form, i.e., a formula in which training instances only occur as part of an inner product with the test instance x: f dual ( x) = ∑ y iα i < z i * x > +b i
with the Lagrange multipliers αi, the training instances zi and the corresponding classes yi. If a feature transformation is to be applied to the data, then a kernel function K can be used to calculate the inner product of two examples in a possibly high-dimensional feature space without actually performing the transformation. Substituting < φ(zi) * φ(x) > = K( zi, x ): f SVM ( x ) = ∑ y iα i K ( z i , x ) + b i
2.3 Weighting with Support Vector Machines
The term linear support vector machine is used in this paper to refer to support vector machines that use the trivial kernel function, i.e., the inner product in the original feature space KLTM( x, z ) = < x * z >.
Support vector machines ( Boser, Guyon, & Vapnik, 1992; Cristianini & Shawe-Taylor, 2000) use linear classifiers to implement non-linear class boundaries. This is achieved by using a non-linear mapping to transform the input space into a representation space which is usually (but not necessarily) of much higher dimension. Learning in the
In the case of kernel functions that calculate the inner product in a known feature space of moderate dimension the weight vector in the feature space can be calculated explicitly. The fact that more weight will be given to more discriminating features can be used to increase domain understanding.
3. The WBCSVM – Kernel In this section we introduce a kernel function that can be used to perform weighted Bayesian classification with a support vector machine. Due to the nature of linear threshold algorithms we restrict ourselves to binary classification. However, it is worth mentioning that techniques exist to extend linear threshold classification to multi-class classification. These techniques can similarly be applied to WBCSVM. As described above, weighted naïve Bayes classification can be written as: ν WBC ( x i , w) = argmax ∏ p j ( x i ,ν c ) νc
wj
; with w ∈ 3
S
j ∈S
On two-class domains, this is equivalent to: t ; if f WBC ( x i ) > 0 vWBC ( xi , w) f ; otherwise f WBC ( xi , w) = ∏ p j ( x i , t )
wj
j∈ S
− ∏ p j (x i , f )
wj
j∈ S
As the logarithm is strictly monotonically increasing, applying it to both terms does not change the classification and we can use the following decision function instead: ′ ( x , w) = ln ∏ p j ( x , t) fWBC j∈S
wj
− ln ∏ p j ( x , f )
wj
j∈S
Note that, pj(x, v) > 0 (∀ j,x,v) as the Laplace formula is used to estimate the conditional probabilities, and thus the logarithm is always defined. For reasons of simplicity, the index i in xi is omitted where possible. The above function can be rewritten to: ′ ( x, w) = ∑ w j (ln p j ( x, t ) − ln p j ( x , f ) ) f WBC j ∈S
We now re-interpret this formula as a linear decision function in a transformed feature space: ′ ( x, w) = < w * φ ( x ) > f WBC φ j ( x) = ln p j ( x, t ) − ln p j ( x , f )
The optimal decision function, i.e., maximum margin hyperplane, can be found by a support vector machine that uses the following kernel: KWBC ( x, z ) = < φ ( x) * φ ( z ) > =
∑ (ln p j∈S
j
( x, t ) − ln p j ( x, f ) ) * (ln p j ( z , t ) − ln p j ( z , f ) )
There are two important things to notice that distinguish KWBC from most other kernel functions. Firstly, information about all instances (not just the support
vectors) is incorporated in the kernel function, as the conditional probabilities are estimated from the full instance space. Secondly, class information is used in the kernel function, as naïve Bayes uses different probabilistic models for each class. The approach is easily implemented by (i) estimating the conditional probabilities from the training set, and (ii) plugging this kernel function into a standard support vector machine. The current prototype is implemented within the Weka framework (Witten & Frank, 2000), making use of Weka’s naïve Bayes and support vector machine implementations. We will further describe the WBCSVM algorithm with respect to model comprehensibility, simplicity of the algorithm, ease of implementation, and VC-dimension. In terms of the comprehensibility of the model, it is worth noticing again that the weights found by the support vector machine have a direct interpretation in the naïve Bayes framework. An attribute of normal relevance will get a weight close to the average of all positive weights. An unimportant attribute will be assigned a weight close to 0. Very important attributes might be assigned weights greater than average, while counter-productive attributes might be assigned negative weights. The simplicity of the algorithm corresponds to the amount of understanding necessary before the algorithm can be implemented. Support vector machines are used as a black box and the new kernel function is simply plugged in. Thus only the way kernel functions are used in support vector machines, and the way weights are incorporated in naïve Bayes have to be understood. The ease of implementation refers to the amount of programming necessary to implement the algorithm. Assuming the source code of support vector machines and naïve Bayes is given, the amount of work needed to implement WBCSVM is low. Finally, the VCdimension (Vapnik & Chervonenkis, 1971) of the kernel function proposed above is linear in the number of attributes. This is obvious, as the number of features in feature space corresponds to the number of attributes plus one (for the prior probabilities). The VC-dimension of linear threshold machines is given by the dimensionality of the feature space plus one.
4. Comparison The sections above introduced WBCSVM, a new algorithm that performs feature weighting in naïve Bayes. This section describes the empirical evaluation in terms of the algorithms it was compared with, the data sets used, and the results of the comparison. Empirical evidence has been obtained on 15 domains from the UCI repository (Blake & Merz, 1998). These domains have all been widely used in literature. We selected all domains that have a binary class attribute and are available in a Weka
readable file format. The left part of table 1 lists the properties of these data sets (the number of missing values, the number of numeric and symbolic attributes, and the number of examples of the minority and majority class). The right column shows the mean accuracy and standard deviation obtained by WBCSVM in ten stratified tenfold crossvalidations using default options. Table 1. Properties of the data sets and accuracy of WBCSVM Dataset
Attributes Class Miss Num:Sym Min:Max
Breast-cancer Breast-w Cleveland-14 Credit-rating German_credit Heart-statlog Hepatitis Horse-colic Ionosphere KR-vs-KP Labor Mushroom Pima_diabetes Sonar Vote
9 16 7 67 0 0 167 1927 0 0 326 2480 0 0 392
0: 10 9: 0 6: 7 6: 9 7: 13 13: 0 6: 14 7: 15 34: 0 0: 37 8: 8 0: 22 8: 0 60: 0 0: 16
85: 201 241: 458 138: 165 307: 383 300: 700 120: 150 32: 123 136: 232 126: 225 1527: 1669 20: 37 3916: 4208 268: 500 97: 111 168: 267
WBCSVM 69.90 ±7.85 96.51 ±2.21 82.85 ±7.22 85.46 ±3.96 74.91 ±3.54 83.37 ±6.61 83.72 ±8.67 84.02 ±6.1 91.97 ±4.51 95.16 ±1.18 90.40 ±11.41 100.00 ±0 76.81 ±3.95 81.69 ±7.85 95.86 ±3.13
The WBCSVM algorithm has been compared to other algorithms using a paired T test on the same domains. The
test was performed two-tailed with 99.9% significance. The algorithms used were naïve Bayes, naïve Bayes with feature subset selection based on the Relief measure, a decision tree learner (J48), a boosted decision tree learner (AdaBoostM1 with J48) and a linear support vector machine. These algorithms have been used as implemented in the Weka machine learning toolkit and are described in detail by Witten and Frank (2000). For all algorithms the default settings have been used – no parameter optimisation has been performed. J48 is Weka’s implementation of the well-known C4.5 algorithm (Quinlan, 1993). Relief is described in detail in (Kira & Rendell, 1992; Kononenko, 1994). Table 2 shows the mean predictive accuracy obtained by these algorithms on the above described datasets in ten stratified tenfold crossvalidations using default options. The results of the paired T test are shown as follows: a 7 denotes that the algorithm was significantly better than WBCSVM; a 3 denotes the opposite, that WBCSVM performed significantly better on this domain than the other algorithm. Naïve Bayes (with and without feature subset selection) performs worse than WBCSVM on 8 out of the 15 domains, and better on 1 domain only. The decision tree learner performed significantly better than WBCSVM on 3 data sets, and significantly worse on 9. The boosted decision tree learner performed better twice and worse 6 times. The linear support vector machine performed better on 1 data set and worse on 4. We conclude that WBCSVM demonstrates superior performance.
Table 2. Results of different machine learning algorithms and paired T tests Dataset Breast-cancer Breast-w Cleveland-14 Credit-rating German-credit Heart-statlog Hepatitis Horse-colic Ionosphere KR-vs-KP Labor Mushroom Pima-diabetes Sonar Vote (7/ / 3)
Naïve Bayes with Relief-FSS
Naïve Bayes 73.12 ±7.48 96.05 ±2.24 83.83 ±7.44 77.84 ±3.9 74.98 ±4.14 84.37 ±6.13 83.87 ±7.42 78.28 ±6.93 82.60 ±6.67 87.84 ±1.84 93.93 ±10.01 95.77 ±0.75 75.73 ±4.64 67.97 ±10.13 90.19 ±3.92 (1 / 6 / 8)
7
3
3 3 3 3 3 3 3
72.74 ±7.74 96.28 ±2.17 83.07 ±6.53 85.55 ±3.91 73.42 ±3.77 84.11 ±6.15 82.32 ±7.59 82.06 ±6.07 89.44 ±5.93 93.96 ±1.43 88.13 ±14.41 98.00 ±0.51 75.73 ±4.64 68.77 ±10.72 93.84 ±3.53 (1 / 6 / 8)
7
3
3 3 3 3 3 3 3
Linear SVM 69.82 ±6.57 96.80 ±2.13 84.09 ±6.57 84.91 ±3.97 75.08 ±3.94 83.78 ±5.59 84.97 ±7.75 82.66 ±6.08 87.98 ±4.68 95.83 ±1.29 93.80 ±10.44 100.00 ±0 77.10 ±4.14 77.78 ±8.42 96.23 ±2.77 (1 / 10 / 4)
AdaBoostM1 + J48
J48
3
3 3 7
3
73.87 ±5.65 94.48 ±2.86 76.96 ±7.24 85.41 ±4.08 71.18 ±3.92 78.67 ±7.3 79.10 ±8.66 85.44 ±5.29 89.74 ±5.55 99.40 ±0.47 80.20 ±15.76 100.00 ±0 73.74 ±4.49 73.33 ±9.46 96.46 2.88 (3 / 3 / 9)
7 3 3 3 3 3 7 3 7 3 3 3
66.61 ±8.43 96.01 ±2.26 78.69 ±7.27 83.71 ±4.27 70.77 ±3.95 78.96 ±7.79 81.78 ±9.24 82.88 ±6.26 93.59 ±3.82 99.62 ±0.35 87.93 ±13.45 100.00 ±0 71.47 ±4.55 79.81 ±8.9 95.52 ±3.09 (2 / 7 / 6)
3 3 3 3 3
7 7
3
We further compared WBCSVM to some other algorithms recently published. As in these papers no standard deviations are shown, we could not perform a paired T test, instead we will only compare the mean accuracy. The number of domains used for this comparison is limited, as we could use only those results obtained with crossvalidation on UCI datasets. Tabular results are omitted, due to space restrictions. The wrapper approach was recently described as the best feature subset selection method for naïve Bayes by Hall and Holmes (2000). Accuracy was averaged over ten tenfold crossvalidations. Comparison with WBCSVM can be performed on 8 domains, on which wrapper performs better on 3 domains and worse on 5. For the other domains no results have been available. Cohen and Singer (1999) proposed the boosted rule learner SLIPPER and demonstrated that it performs better than other rule learning algorithms like C5.0 rules, and RIPPER. SLIPPER achieved better results than WBC SVM on 2 data sets and worse on 5 data sets. For the other domains either no results at all, or no crossvalidation results have been available. The comparison with C5.0 rules as given in the same paper, shows that WBCSVM performs worse only on one domain and better on 6 domains. In our experiments WBCSVM compared favourably to six well-known machine learning algorithms that are widely used in literature, and to one recently proposed algorithm.
5. Discussion of Related Work This work has to be compared with other approaches that combine feature transformation and naïve Bayes classification, in particular, subset selection and feature weighting. As a support vector machine is used to find an appropriate set of weights, it has also to be compared with support vector machines that use other kernel functions. 5.1 Feature Transformation Several approaches to combine feature subset selection and naïve Bayes have been discussed in the literature (Hall, 1999; Kohavi & John, 1997; Langley & Sage, 1994). These approaches are all based on heuristic search algorithms and thus on local evaluations of feature subsets. In contrast, WBCSVM operates globally, and is guaranteed to find the feature weights such that the margin in the feature space is maximal. The relation between the classification error and margin is well understood and maximising the margin reduces the danger of overfitting. Furthermore, it has been described above that feature weighting is more flexible than subset selection. Research in feature weighting has up to now mostly concentrated on lazy learning algorithms, in particular, the
nearest neighbour classifier. Wettschereck, Aha, and Mohri (1997) review several approaches and distinguish between preset and performance biases. These correspond to the filter and wrapper approaches for feature subset selection, respectively. The algorithm described in this paper does not fit in any of these categories, as neither a heuristic nor performance feedback is used. To the best of our knowledge, no previous attempts have been made to combine feature weighting and naïve Bayes. This might be caused by the computational complexity of feature weighting, and the increased danger of overfitting. Both disadvantages are overcome in WBCSVM by adopting a support vector machine based approach. The generalisation ability is enhanced by optimising the weights such that the margin is maximal. The search is performed efficiently by a quadratic programming algorithm. 5.2 Support Vector Machines Obviously, the algorithm presented here is strongly related to research on support vector machines. However, support vector machines are a general approach and behave very differently given different kernel functions, as the hypothesis space strongly depends on the kernel function at hand. An interesting and important difference between the WBCSVM kernel function and other kernel functions is the dependence of the kernel function on the instance distribution and on class information. It has been described above that, before the kernel function can be evaluated, the conditional probabilities have to be estimated from the given data. This can be seen as optimising the kernel function itself with respect to the data. In more traditional support vector machines, the separating hyperplane depends only on the support vectors (the instances closest to the hyperplane), and thus the trained classifier would be the same even if some of the instances that are not support vectors were removed from the training set. In our approach, such instances would still have an impact as they influence the conditional probabilities in the feature vectors. Through these conditional probabilities also class information influences the kernel function. Several kernel functions have been suggested in the literature. However, they are very different from the kernel suggested in this paper, as they search a very different hypothesis space. A more detailed discussion of this would exceed the size of this paper. The following sections will describe kernel functions that appear closely related to the kernel function introduced in this paper.
5.3 Selective Bayesian Classification based on Support Vector Machines Most related to the approach presented here is the kernelbased selective Bayesian classifier (Gärtner, 2000): K SBC ( x, z ) = K t, t ( x , z ) + K f , f ( x , z ) − K t , f ( x , z ) − K f ,t ( x, z ) K v1,v 2 ( x , z ) = ∑ p j ( x , v1 ) * p j ( z, v 2 ) + m j∈S
S
with x, z, t, f, and function pj defined as in section 2, attributes S, and parameter m. It can be shown that this kernel function calculates the inner product in a feature space containing the features: f SBC ( x, S ′) = ∏ p j ( x, t ) − ∏ p j ( x , f ) (∀S ′ ⊆ S ) j ∈S ′ j∈S′
On two-class domains this corresponds to the selective Bayesian classification vSBC(x,S′) described in section 2.2. The kernel KSBC linearly combines the results of naïve Bayes classifiers on all possible subsets of features. This algorithm is somewhat related to MFS (Bay, 1998), which combines nearest neighbour classifiers built on different feature subsets. The parameter m determines the bias of the algorithm towards large respectively small feature subsets. In section 2, it has been argued that feature weighting is more expressive than feature subset selection as it is not restricted to binary weights. Additionally, WBCSVM is computationally much less expensive than the selective approach presented there. A practical problem that appeared with large exponents in KSBC, i.e., with large numbers of attributes, is also not present in the kernel KWBC. Finally, we note that by explicitly calculating the weight vector w and imposing a weight threshold to decide which attributes are to be used for classification, it is possible to use WBCSVM for explicitly selecting a subset of features. 5.4 Fisher Kernel The fisher kernel (Jaakkola & Haussler, 1999) uses a feature transformation that represents instances by a vector of real numbers. The feature transformation is based on the fisher scores of a probabilistic model, and can thus even be applied to data that is more complex than simple tuples, for example sequential data. The fisher scores correspond to the gradients of the log-likelihood of the conditional probability of an instance given the parameters of the probabilistic model. The most commonly chosen model is the hidden markov model. The fisher kernel is a slight variant of the linear inner product in the feature space made up by all fisher scores. This transformation from hidden markov models into the
feature space of fisher scores can generally be used to combine generative models with discriminative classifiers. Similar to the fisher kernel method, WBCSVM uses also a probabilistic model of the instances, and some function of these probabilities as features in the transformed feature space. Furthermore, both classifiers use support vector machines to optimise the weights of these features. However, while the fisher kernel is based on hidden markov models, WBCSVM is based on a much simpler probabilistic model of conditional independence. The fisher kernel uses the probabilistic model to put information about the instance distribution into the kernel function. This information can either be given as background information or else it has to be estimated from the training data. For our comparison it is important to notice that no class information of the instances is incorporated into the hidden markov model and thus into the kernel function. The probabilistic model of WBCSVM is much simpler and is fixed in advance by the conditional independence assumption. However, as the features are based on conditional probabilities of the attributes given the class, it does take class information into account. The advantage of the fisher kernel is that it does neither assume conditional independence of the attributes, nor does it require instances to be represented by tuples. Thus it is more generally applicable, and potentially more accurate in domains where these assumptions are violated. The advantage of the WBCSVM kernel is that it employs different probabilistic models for each class. To the best of our knowledge there is currently no other kernel function that takes class information into account. We believe that both approaches can be combined to a more general kernel definition that exhibits both advantages. This approach would be based on the idea of having separate probabilistic models for each class.
6. Conclusions and Prospects This paper motivated, introduced, and evaluated a new machine learning algorithm based on the idea of improving naïve Bayes classification by assigning a different weight to each conditional probability. The weights are found by a support vector machine such that the danger of overfitting is reduced. To the best of our knowledge, this is the first time a combination of naïve Bayes and feature weighting has been tried. The evaluation of our algorithm demonstrated better performance than state-of-the-art machine learning algorithms. Due to the low VC dimension of the kernel function, WBCSVM has desirable theoretical properties. One possible extension of the algorithm is to apply it to relational data. Flach and Lachiche (2000) suggest
conditional probabilities over sets and lists. Using these conditional probabilities in the kernel function would enable WBCSVM to be applied to data that is too complex to be described by a single attribute-value table. Another promising approach is to generalise the definition of the kernel function to other probabilistic models, such as bayesian belief networks, hidden markov models, or probabilistic independence networks. In contrast to the fisher kernel, focus would still be on incorporating class information in the kernel function.
Acknowledgements The authors are grateful for valuable suggestions on earlier drafts to Thorsten Joachims, Mathias Kirsten, Jonathan Lawry, Stefan Wrobel, and the anonymous reviewers. Part of this work has been supported by the Esprit V project (IST-1999-11495) Data Mining and Decision Support for Business Competetiveness: Solomon Virtual Enterprise.
References Bay, S.D. (1998). Combining nearest neighbor classifiers through multiple feature subsets. Proceedings of the 7th International Conference on Machine Learning (pp. 3745), Morgan Kaufmann.
Gärtner, T. (2000). Kernel-Based Feature Space Transformation in Inductive Logic Programming. MSc Dissertation, University of Bristol, United Kingdom. Hall, M.A. (1999). Correlation based Feature Selection for Machine Learning. Doctoral dissertation, Department of Computer Science, The University of Waikato, Hamilton, New Zealand. Hall, M.A., & Holmes, G. (2000). Benchmarking attribute selection techniques for data mining. Technical Report, Department of Computer Science, The University of Waikato, Hamilton, New Zealand. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, MIT Press. Kira, K., & Rendell, L.A. (1992). A practical approach to feature selection. In Sleeman, D. & Edwards, P. (Eds.), Proceedings of the Ninth International Conference on Machine Learning (pp. 249-256), Morgan Kaufmann. Kohavi, R., & John, G. (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2), pp. 273-324. Kononenko, I. (1994). Estimating attributes: analysis and extensions of Relief. In De Raedt, L. and Bergadano, F. (Eds.), Machine Learning: ECML-94 (pp. 171-182), Springer Verlag.
Blake, C.L., & Merz, C.J. (1998). UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (pp. 399-406). Morgan Kaufmann, Seattle, WA.
Boser, B.E., Guyon, I. M., & Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the fifth Annul ACM Workshop on Computational Learning Theory (pp. 144152). ACM Press.
Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceedings of the 15th International Conference on Machine Learning (pp. 445-453), Morgan Kaufmann, San Francisco, CA..
Cohen, W.W., & Singer, Y. (1999). A Simple, Fast, and Effective Rule Learner. Proceedings of the 6th National Conference on Artificial Intelligence. Proceedings of the 11th Conference on Innovative Applications of Artificial Intelligence (pp. 335-342). AAAI/MIT Press.
Quinlan, R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to Support Vector Machines and other kernel-based methods. Cambridge University Press. Flach, P.A., & Lachiche, N. (2000). Decomposing probability distributions over structured individuals. In Brito, P., Costa, J., & Malerba, D. (Eds.), Proceedings of the ECML2000 workshop on Dealing with Structured Data in Machine Learning and Statistics (pp. 33-43).
Vapnik, V.N., & Chervonenkis, A.Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), pp. 264-280. Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A Review and Empirical Evaluation of Feature-Weighting Methods for a Class of Lazy Learning Algorithms. Artificial Intelligence Review, 97(1-5), pp. 273-314. Witten, I.H., & Frank E. (2000). Data Mining - Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.