Eighth International Conference on Hybrid Intelligent Systems
Multi-objective learning of multi-dimensional Bayesian classifiers Juan D. Rodr´ıguez and Jose A. Lozano Intelligent Systems Group Department of Computer Science and Artificial Intelligence University of the Basque Country San Sebasti´an, Spain
[email protected],
[email protected]
Abstract
high cardinality. Furthermore, the model does not reflect the real structure of the classification problem. Another approach is to develop multiple classifiers, one for each class variable. This approach does not reflect the real structure of the problem neither, because it does not model the conditional dependences between the different class variables. Recent works [6, 11] have proposed learning and inference algorithms for Bayesian classifiers with more than one variable to be predicted, but, nowadays, they are restricted to fully tree-augmented and fully polytreeaugmented multi-dimensional classifiers. For these kinds of models, those authors have demonstrated that the general problem of learning the restricted structure can be solved in polynomial time.
Multi-dimensional classification is a generalization of supervised classification that considers more than one class variable to classify. In this paper we review the existing multi-dimesional Bayesian classifiers and introduce a new one: the KDB multi-dimensional classifier. Then we define different classification rules for multi-dimensional scope. Finally, we introduce a structural learning approach of a multi-dimensional Bayesian classifier based on the multiobjective evolutionary algorithm NSGA-II. The solution of the learning approach is a Pareto front representing different multi-dimensional classifiers and their accuracy values for the different classes, so a decision maker can easily choose the classifier which is more interesting for the particular problem and domain.
2 Multi-dimensional Bayesian Classifiers
1. Introduction
2.1
A classical supervised classification task [1, 4] tries to predict the value of a unique class variable based on a set of predictive variables (features). However, in many application domains there is more than one class variable and an instance has to be assigned to a combination of classes, usually the most probable one. The problems focused on these domains are not one-dimensional, so it is necessary to introduce the concept of multi-dimensional classification, where there is more than one class variable to classify. Classical Bayesian network classifiers restrict the number of class variables to one, so they can not be straightforwardly applied to multi-dimensional classification. There are several possibilities to apply a Bayesian networks approach to multi-dimensional classification. One approach is to construct a unique class variable that models all possible combinations of classes (the Cartesian product of all the class variables). The problem arises because this compound class variable can easily end up with an excessively
A Bayesian network is a pair B = {S, Θ} where S is a directed acyclic graph (DAG) whose vertices correspond to random variables and Θ is a set of parameters. We consider Bayesian networks over a finite set V = {C1 , · · · , Cm , X1 , · · · , Xn } where each variable Cj and Xi takes a finite set of values determined for each variable. Θ is formed by a parameter θxi |P a(xi ) and θcj |P a(cj ) for each value that xi and cj can take and for each value assignment P a(xi ) and P a(cj ) to the sets P a(Xi ) and P a(Cj ) of parents of xi and cj respectively. In a multi-dimensional Bayesian network classifier the directed acyclic graph (DAG) structure S = (V, A) has the set V of random variables partitioned into the sets VC = {C1 , . . . , Cm }, m > 1, of class variables and the set VF = {X1 , . . . , Xn }, n ≥ 1, of feature variables. Moreover, the set of arcs A can be partitioned into three sets: ACF , AC and AF with the following properties:
978-0-7695-3326-1/08 $25.00 © 2008 IEEE DOI 10.1109/HIS.2008.143
501
Structure of multi-dimensional Bayesian classifiers
• ACF ⊆ VC × VF is composed of the arcs between the class variables and the feature variables, so we can define the feature selection subgraph of S induced by ACF and VC ∪VF as SCF = (V, ACF ). This subgraph represents the selection of features that seems relevant for classification given the class variables. • AC ⊆ VC × VC is composed of the arcs between the class variables, so we can define the class subgraph of S induced by VC as SC = (VC , AC ). • AF ⊆ VF × VF is composed of the arcs between the feature variables, so we can define the feature subgraph of S induced by VF as SF = (VF , AF ). Figure 2. Multi-dimensional Bayesian classifiers
In Figure 1 we show a multi-dimensional Bayesian classifier with 3 class variables and 5 features and its division.
2.2
Multi-dimensional classification rules
A multi-dimensional Bayesian classifier represents the joint probability p(C1 , · · · , Cm , X1 , · · · , Xn ). Using the same Bayesian network, we can obtain different classifiers depending on the classification rule we use. • Joint classification rule: It is the most obvious classification rule for multi-dimensional Bayesian classifiers. The estimated class variables values (b c1 , · · · , b cm ) are computed as:
Figure 1. A multi-dimensional Bayesian classifier and its division
argmaxc1 ,··· ,cm {p(c1 , · · · , cm |x1 , · · · , xn )} =
p(c1 , · · · , cm , x1 , · · · , xn ) }= p(x1 , · · · , xn ) argmaxc1 ,··· ,cm {p(c1 , · · · , cm , x1 , · · · , xn )}
argmaxc1 ,··· ,cm {
Depending on the structure of the class subgraph and the feature subgraph, we can distinguish the following subfamilies of multi-dimensional Bayesian classifiers: • Fully naive multi-dimensional classifier: both the class subgraph and the feature subgraph are empty [6].
• Marginal classification rule: We propose a classification rule that consists of marginalizing the class for the rest of classes simultaneously.
• Fully tree-augmented multi-dimensional classifier: both the class subgraph and the feature subgraph are directed trees [6].
b ci = argmaxci {p(ci |x1 , · · · , xn )} X {p(ci , c¬i |x1 , · · · , xn )} = argmaxci
• Fully polytree-augmented multi-dimensional classifier: both the class subgraph and the feature subgraph are polytrees (singly connected) [11].
c¬i
where c¬i = {c1 , · · · , ci−1 , ci+1 , · · · , cm }.
In this article we use another multi-dimensional Bayesian classifier where both the class subgraph and the feature subgraph are K-dependence Bayesian classifiers (KDB[9]). It allows each predictive variable Xi to have a maximum of K dependences with other predictive variables Xj where j < i, apart from the class variables. We call it KDB multi-dimensional classifier. Figure 2 shows the different developed multi-dimensional Bayesian classifiers.
3 Multi-objective approach dimensional classification
to
multi-
The general problem of learning a Bayesian network, in classification context, is to find a network B = {S, Θ} that achieves the minimum generalization error for a new instance. So, in classical supervised classification we will
502
the Pareto dominance: A vector u = (u1 , · · · , um ) is said to dominate v = (v1 , · · · , vm ) (denoted by u v) if and only if u is partially less (on minimization) than v. That is ∀i ∈ {1, · · · , m}, ui ≤ vi ∧ ∃i ∈ {1, · · · , m} : ui < vi A Edgeworth-Pareto optimal solution [10] is a nondominated solution, that is, a solution that is impossible to improve in any objective function without a simultaneous worsening in some other objectives. A set of Pareto optimal solutions compose a Pareto optimal set and form a Pareto Front. The expected solution of a multi-objective optimization problem is a Pareto front representing the values of the performance functions for each objective.
have to find the structure that maximizes the accuracy of the class variable given an instances data set D = {d1 , d2 , · · · , dn }. In multi-dimensional domains we have to maximize it for each class variable. Once the classification rule is decided, we can measure the accuracy of each class separately by means of any evaluation technique. So, there will be several error estimations for the classifier, one for each class variable. This issue suggests that the multi-dimensional classification learning problem can be considered as a multi-objective optimization problem whose objective is to find the network B that maximizes the accuracy of each class variable. Moreover, we suppose that the classes are in conflict with each other, that is, we can not find a classifier that improve classification accuracy of one class variable without getting worse the accuracy of another class variable. To carry out this approach to multi-dimensional classification by means of multi-objective optimization, we use a classical multi-objective evolutionary algorithm (MOEA), the nondominated sorting genetic algorithm II [3] (NSGAII). The search space is a binary vector representing the structure of the multi-dimensional Bayesian network and the functions to optimize are the accuracies of each class. There are some supervised classification approaches by means of multi-objective optimization [7]. However, none of them have take care of multi-dimensional classification. Multi-objective optimization have been used, in supervised classification field, to optimize sensitivity and sensibility on ROC curves, rule mining and partial classification, model accuracy versus model complexity, feature selection, accuracy on two different data sets or ensemble learning by means of integration of diverse classifiers.
Figure 3. Selection schema in NSGA-II There are several multi-objective evolutionary algorithms (MOEAs) but we think that, for a first approach, NSGA-II is appropriate. NSGA-II is a multi-objective evolutionary algorithm with a non-dominated sorting of the population, elitism and diversity preservation by means of a crowded-comparison operator. Initially a random parent population P0 of size N is created and sorted based on nondomination. Each solution is assigned a nondomination level (level 1 is the best and solutions in level 1 are nondominated between themselves but dominate all solutions of lower levels). Selection, crossover and mutation operators are used to create the offspring population Q0 of size N . The new population R0 is sorted again into nondomination levels Fi . In this case, we use binary tournament selection, single point crossover and bit flip mutation operator over the whole solution. Elitism is ensured because all current and previous population members are included in R. Finally, the new population P1 is created placing the solutions of each nondomination level from level 1 in advance. This procedure continues until a nondomination level is bigger than the remaining space in the new population. Then, the solutions of the present nondomination level are sorted using a crowded-comparison operator in descending order and the new population filled with the best solutions of the level. This operator gives an estimate of the density of the solutions surronding a particular solution, it measures the distance between the two nearest solutions on
4 Multi-objective optimization A multi-objective optimization problem (MOP) can be defined as an optimization problem with multiple objectives measured with different performance functions, usually in conflict with each other, and a set of restrictions. Hence, the optimization means finding such a solution which would give the values of all the objective functions acceptable to the decision maker [8]. Then, the decision maker, for example an expert or a function based on a cost matrix, must choose the optimum solution. The aim is to find good compromises (trade-offs) rather than a single solution [2]. Formally, a multi-objective optimization problem can be formulated as finding the vector x that satisfies l inequality restrictions gi (x) ≥ 0 for i = 1, 2, · · · , l and k equality restrictions hi (x) = 0 for i = 1, 2, · · · , k and optimizes the vector of objective functions: f (x) = [f1 (x), · · · , fm (x)] An important concept in multi-objective optimization is
503
real value is (c1 = 0, c2 = 0), we count b c1 as a success and b c2 as an error.
each side of the particular solution for each objetive. The overall crowding-distance value is calculated as the sum of individual distance values corresponding to each objective. The operation of this algorithm is represented in Figure 3.
• Accuracy based on marginal classification rule: In this case we classify a class bearing in mind all possible values of the other class, that is, marginalizing the class. It uses the marginal classification rule.
5 Structural learning approach In this multi-objective approach to multi-dimensional classification, the search space X is a binary vector x representing the arcs of the multi-dimensional Bayesian network. It represents all the possible arcs of any KDB multidimensional classifier. The maximum K value for a KDB multi-dimensional classifier is equal to the number of feature variables less one. Given that it is impossible to work with a KDB that represents too many dependences, we have applied a post-process rule to the individual in order to remove surplus dependences. We use a method based on the mutual information of two features (Xi parent of Xj ) given the class parents of Xj . If a feature has more than K dependences, we only keep the K dependences with maximum mutual information of the feature and delete the rest. Therefore, a unique classifier will be equivalent to several different solutions. The individual represents all the possible arcs of the class subgraph, the feature subgraph and the feature selection subgraph. In the fully naive case, the solution represents only the feature selection subgraph arcs, due that the feature subgraph and the class subgraph are empty. Therefore, the individual is codified this way: the first m · (m − 1) components represent the arcs between the class variables, this is, the set AC . The next m · n components represent the arcs between the class variables and the features, the set ACF . The last n · (n − 1) components represent the arcs between the features, the set AF . The objective functions are the k-fold cross-validation error estimations (kcv) of each class. So, the aim of the multi-objective optimization algorithm is to optimize the vector of the accuracy of each class:
6 Simulation study 6.1
Experimental design
First, we describe the data sets used to test the multiobjective approach to multi-dimensional structural learning. Since typical benchmark data repositories in supervised classification do not provide data sets with multiple class variables, we use biomedical multi-dimensional data sets of Crohn disease and ulcerous colitis from a biomedical corporation. These data sets contain gene and environmental information and the progress of the diseases on different aspects. The gene and environmental variables are the features and the progress-of-the-disease variables are the classes of the multi-dimensional classification problem. There are 58 features and 2 classes in both problems. All the variables are discrete with cardinalities from 2 to 5. In these data sets, the maximum number of arcs (the size of the individual) for any KDB multi-dimensional classifier is 1770. The main study is carried out with the KDB multidimensional classifier with K = 2. However, we extend the study to previous developed multi-dimensional Bayesian classifiers such as fully naive and fully tree-augmented in order to compare the results. The remaining NSGA-II parameters for the multidimensional Bayesian classifier are the population size (we have set the same size of the individual) and the number of generations of the population (1000). In order to make all these experiments possible we use different open source libraries in Java. For the classification utilities we use the weka library [12] and for the multiobjective optimization utilities we use the jMetal library [5].
kcv(x) = [kcv1 (x), · · · , kcvm (x)] We have used the two classification rules and we have distinguished two different multi-dimensional accuracies, one for each classification rule. For each possible classification rule we develop a different 5-fold cross-validation error estimator based on different accuracies. This estimators are used as the objective functions.
6.2
Experimental results
In this section we present the results of the experimentation. Each point of the Pareto fronts represents a classifier and its accuracy value for each class. Figure 4 shows the Pareto fronts of KDB multidimensional classifier for each data set (Crohn and colitis) and each classification rule. The first issue we realize is that the multi-dimensional Bayesian classifiers achieve acceptable solutions for both class variables and classification rules. In colitis disease
• Accuracy based on joint classification rule: In this case we classify all classes together and compute the errors of each class separately. It uses the joint classification rule. We check if each class separately is well classified in the predicted joint class. For example, if we classify an instance x as (b c1 = 0, b c2 = 1) and the
504
few solutions on it. There are no extreme classifiers which classify one class well, while clasifying the other badly. This is the result of classes with not much conflict or the use of small data sets. At this point it will be of interest to compare the Pareto front for classifiers with different classification rules. Marginal classification rule based classifiers seems to have a slightly better behaviour than joint classification rule based ones, but it seems to depend on the problem. Finally, we have run other multi-dimensional Bayesian classifiers on the data sets. In figure 5 we compare the Pareto fronts of each classifier based on joint classification rule. Fully naive is the least accurate one and multidimensional KDB with K = 2 the most accurate. It seems that the more dependences the classifier models, the greater the accuracy is, even though at the expense of more computational cost.
0.80 0.70
0.75
Accuracy C_2
0.85
0.90
Pareto Front
0.65
Marginal Joint
0.66
0.68
0.70
0.72
0.74
Accuracy C_1
7 Conclusions
(a) Crohn
In this paper, we have presented a multi-objective approach to the structural learning of a multi-dimensional Bayesian classifier. To that end we have defined a multidimensional classifier, learned a population of classifiers (nondominated solutions) by a multi-objective optimization technique and tested it on two multi-dimensional classification data sets. We have used a multi-objective approach because the learning of the multi-dimensional classifier can be seen as finding the network B that maximizes the accuracy of both class variables. The objective functions for the multi-objective approach are the multi-dimensional k-fold cross-validation estimations of the errors. Another important task is to choose a proper classification rule. We have presented two different rules with their own k-fold cross-validation. We call them joint classification rule and marginal classification rule. Multi-dimensional classifiers find classifiers for both data sets with acceptable accuracy values for all the classes. Classifiers that classify with the marginal classification rule obtain slightly better accuracies. However, we expected to find a more diversified Pareto front and some extreme solutions. We have seen that the multi-objective approach is possible and gives appropriate classifiers. Even more important, it offers a very interesting graphical representation of the behaviour of several different multi-dimensional Bayesian classifiers learned from the same data set and the accuracy of each class variable. So, a decision maker can easily choose the appropriate one from the Pareto front. As future work, we would like to extend the experimentation to data sets with more than two class variables. Also, the results of the pareto front are validated with the same data sets of the learning process, so it would be of inter-
0.87 0.86 0.84
0.85
Accuracy C_2
0.88
0.89
0.90
Pareto Front
0.83
Marginal Joint
0.76
0.78
0.80
0.82
0.84
Accuracy C_1
(b) Colitis
Figure 4. Pareto fronts of different classification rules
data set, both class accuracies exceed 80% and in Crohn disease data set, one class accuracy is nearly 90% and the other one is around 70% in the Pareto set solutions. As we have said, the classifiers of the Pareto front achieve acceptable accuracies for both classes but there are
505
8 Acknowledgment
0.90
Pareto Front
0.75
References
0.65
0.70
Accuracy C_2
0.80
0.85
This work has been partially supported by the Etortek, Saiotek and Research Groups 2007-2012 (IT-242-07) programs (Basque Government), TIN2005-03824 and Consolider Ingenio 2010 - CSD2007-00018 projects (Spanish Ministry of Education and Science) and COMBIOMED network in computational biomedicine (Carlos III Health Institute).
[1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [2] C. Coello, G. Lamont, and D. Van Veldhuizen. Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [3] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjetive genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2):182–197, April 2003. [4] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley Interscience, 2001. [5] J. Durillo, A. Nebro, F. Luna, B. Dorronsoro, and E. Alba. jMetal: A Java Framework for Developing Multi-Objective Optimization Metaheuristics. Technical report, Departamento de Lenguajes y Ciencias de la Computaci´on, University of M´alaga. [6] L. v. d. Gaag and P. Waal. Muti-dimensional Bayesian network classifiers. In Proceedings of the Third european workshop in probabilistic graphical models, pages 107–114, 2006. [7] J. Handl, D. Kell, and J. Knowles. Multiobjective optimization in bioinformatics and computational biology. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 4(2):279– 292, 2007. [8] A. Osyczka. Multicriteria optimization for engineering design. Academic Press, 1985. [9] M. Sahami. Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 335–338, 1996. [10] W. Stadler. Multicriteria Optimization in Engineering and in the Sciences. Springer, 1988. [11] P. Waal and L. v. d. Gaag. Inference and learning in multidimensional Bayesian network classifiers. Lecture Notes in Artificial Intelligence, 4724:501–511, 2007. [12] I. Witten and E. Frank. Data mining: practical machine learning tools and techniques with Java implementations. The Morgan Kaufmann series in Data Management Systems. 2000.
0.60
FKDB FTAN FNB
0.55
0.60
0.65
0.70
Accuracy C_1
(a) Crohn
0.86 0.84
0.85
Accuracy C_2
0.87
0.88
0.89
Pareto Front
0.83
FKDB FTAN FNB
0.74
0.76
0.78
0.80
0.82
0.84
0.86
Accuracy C_1
(b) Colitis
Figure 5. Pareto fronts of different classifiers
est to validate them with unseen data sets. Then we could compare the multi-dimensional Bayesian results with a traditional approach. Moreover, it could be interesting to use multi-objective optimization techniques oriented to feature selection in multi-dimensional classification.
506