Document not found! Please try again

A new way to build Oblique Decision Trees using ... - Semantic Scholar

2 downloads 17314 Views 87KB Size Report
Algorithm 1 : Application. Build system. While Indivisible do. Find intersection. Choose point. Extract point from system end while. Extract hyperplane from system ...
A new way to build Oblique Decision Trees using Linear Programming Guy Michel♣, Jean Luc Lambert♣ Bruno Cremilleux♣ & Michel Henry-Amar♠

GREYC♣, CNRS UPRESA 6072 - GRECAN♠, CNRS UPRES 1772 Esplanade de la paix - Université de Caen – F 14 032 Caen Cedex, France Summary : Adding linear combination splits to decision trees allows multivariate relations to be expressed more accurately and succinctly than univariate splits alone. In order to determine an oblique hyperplane which distinguishes two sets, linear programming is proposed to be used. This formulation yields a straightforward way to treat missing values. Computational comparison of that linear programming approach algorithm with classical univariate split algorithms proofs the interest of this method. Key words : Oblique decision tree, missing values, linear programming. Introduction Classification and decision trees already belong to the way of current research and many methods still have to be explored [5]. Unlike classical decision trees which produce univariate trees, linear programming is proposed to be used to generate oblique decision trees1. For many years, many methods have been shown to create ODT but finding best multivariate hyperplane is a NP complete problem. Murthy, Kasif and Salzberg [6] have proposed OC1, a solution based on impurity measures and perturbation algorithm. Other methods proposed by Mangasarian, Setiono and Wolberg [4] or Bennet [1] induce ODT using linear programming but these methods find the optimal linear discriminates by optimising specific goodness measures given by the authors. Instead of having an analytic approach, geometrical results of linear programming is used and no special measures are needed. Only data set containing two classes can be treated for the moment but an original solution to solve the problem of missing values is proposed. That gives the possibility to work with a medical data set and to compare the performances of this algorithm to C4.5 [7], a standard univariate decision trees builder.

♣ ♠ 1

{gmichel,lambert,cremilleux}@info.unicaen.fr [email protected] Trees using oblique hyperplanes to partition data are called oblique decision trees and noted ODT.

1

Linear Programming : geometrical approach

Two principal results of linear programming are essentially used in this work : the Simplex Algorithm and the Duality Theorem. Linear programming is an algebraic tool and so the data set has to be translated into an algebraic form. Assume, for the moment, that data set contains two classes, that all data are numeric and that there is no missing value. 1.1

Global approach

The data set can be projected in an Figure 1 : Duality theorem Euclidean space, noted E, of dimension n ( n is the number of characteristics describing each data and one data becomes one point in E ) and each class is represented by a cloud of points. The aim of learning, in this case, is to separate two classes ( or two clouds of points ) : in the case of the Figure 1, the hyperplane D divides x-class and y-class and allows to decide in which classes belong new points. As linear programming products linear borderline, dividing two clouds of points or their convex covers is equivalent. Let be C1 = { E1,....Ep} and C2 = { F1,....Fq} with Ei in ℜn for all i in [1…p] Convex cover of C1 and C2 is obtained by introducing convex notions ( They are noted Con(C1) and q  p Con(C2) ). The intersecλ E − ∑ i i ∑ µ j F j = 0 tion of Con(C1) and j =1  i =1 p ∀i ∈[1.. p] λ i ≥ 0 Con(C2) is null ( and a  λi = 1 with ∑ linear borderline can be  ∀j ∈[1.. q] µ j ≥ 0 i =1  found ) if and only if q  µ ji = 1 System 1 has no solution. ∑  j =1  In the other case, an other result of the linear System 1 : Intersection of convex covers. programming gives a method for having good results. 1.2

Tools and methods

The Simplex Algorithm and the Duality Theorem [3] are mainly used in these twice situations. 1.2.1

A borderline exists

∃a ∈ℜ , b ∈ℜ : ∀i ∈[1.. p], ∀j ∈[1.. q] n

System 2 : System 1 dual

a t Ei < b  t a Fj > b

Firstly, studying the case in which a borderline dividing convex covers exists ( e.g. the situation look like at the Figure 1 ).

A linear borderline exists ( e.g. a vector a ) if and only if the system 2 is verified. By noticing that the System 2 is the dual of the System 1 and using the Duality Theorem, following result can be proved : if the System 1 has no solution, the base of the dual is solution of the System 2. This base, easily extractable from the System 1, gives, for each dimension, the coefficient of the vector a ( and, of course, the equation of the hyperplane D ). 1.2.2

No borderline exists

But, most of the time, it not exists hyperplanes dividing classes and the dual has no solution. See the Figure 2 for having an example of this kind of situation. The Simplex Algorithm return a proof of the non existence of hyperplanes by giving a small set of points belonging to the both classes. The Figure 2 shows easily that : Con(C1) ∩ Con(C2) ≠ ∅ and one proof can be given by the following points : {x2,x3,y1,y2} ( because [x2,x3] ∩ [y1,y2] ≠ ∅ ). In this case, the idea is to take out one point from this set and to consider new convex covers again. Figure 2 : Simplex Algorithm

Logically, after m iterations, two distinct convex covers have to be obtained and results of § 1.2.1 may be applied. Those twice remarks allow to propose the following algorithm : 1.3

A trivial algorithm

Let C1 and C2, two sets of point in ℜn. The procedure used to induce a split that divides two sets of points can be defined : Algorithm 1. The function that chooses one point in the intersection’s set Algorithm 1 : Application is very important and not easy to define. For example, in Figure 2, it is more Build system interesting to extract x2 than x3,y1 or y2. While Indivisible do For the moment, the choice function is Find intersection trivial and chooses a point randomly. It is Choose point certainly easy to find a better solution and Extract point from system this point should be studied in the future. end while Extract hyperplane from system 2

Missing values treatment

In this representation of data set, there is no place for missing value whereas it is very important to consider this kind of trouble highly present in real word data extracted.

2.1

Linear Programming interpretation of missing values

Algorithms usually propose [2][8] to replace missing values by values extracted from the data set ( e.g. median value of the argument ) or to have a probabilistic approach [7]. The linear programming approach allows us to apply an other treatment. Instead of fixing missing values, they are replaced by variables. Limits are given to those variables to be sure that they have logical values. That means an expert has to define a maximum value and a minimum value for all dimensions in which there exists missing values. Sometimes, those values could be extracted from the data base by finding real maximum and minimum if this data base is representative enough. 2.2

Geometric interpretation of missing values

For understanding the approach, it is interesting to have a geometrical Figure 3 : Missing values treatment interpretation of this operation. In fact, A data having p missing values is replaced by an hypercube of dimension p ( e.g. an hypercube with 2p vertices ). Studying, in ℜ2, the following case : A=(?,α) and B=(β,?) and have a look to the Figure 3. Constraints given by the linear programming approach are stronger than those given by the classical approaches and the choice for finding hyperplanes is more limited. It is even possible to have situations in which there exists hyperplanes in the first case whereas the intersection between the two hypercubes is not empty ( it can not exist hyperplane in this condition ). That proves those approaches are not equivalent. The algorithm given in § 1.3 finds an elegant solution for this trouble because bad points are eliminated. 2.3

Generalisation

The algebraic form is generalised for C1 and C2, two sets of ℜn. Let be Ω=(Ω1,…,Ωn) and Θ=(Θ1,…,Θn), two sets of n sets. For all i in [1..n], let Ωi ( resp. Θi ) the set of indices of variables of C1 ( resp. C2 ) for which the ith component is unknown. For example, r is in Ωi if and only if xr is in C1 and xr,i is a missing value. Let , for all i in [1..n], mi and Mi as the limits for the ith component of the data base ( read in the data base or given by an expert of the domain ). The system obtained ( System 3 ) is linear and linear programming algorithms may be applied to resolve it.

System 3 : Generalised system p ∀h ∈ [1..n ]∑ p λ E + ∑r∈Ω X r ,h − ∑k =1,k∉Θ µ kj Fk ,h − ∑r∈Θ Yr ,k = 0 i =1,i∉Ω h i i , h h h K   m j λr − X r , j ≤ 0  ∀j ∈ [1..n]∀r ∈ Ω j    X r , j − M j λr ≤ 0  ∀i ∈ [1.. p ]λi ≥ 0 ; ∀j ∈ [1..q ]µ j ≥ 0    m j µ s − Ys , k ≤ 0  ∀k ∈ [1..n]∀s ∈ Θ k   Ys ,k − M j µ s ≤ 0 p q  ∑i=1 λi = 1 ; ∑ j =1 µ j = 1 

Notice that if all components are not numeric, it is always possible to transform them. For example, [Small,Medium,Big] becomes [1,2,3] and [Yellow,Blue,Red] is replaced with three binary components : [0,1] for Yellow, [0,1] for Blue and [0,1] for Red. This treatment is important and, at a rough estimate, can represent a negative point of this method because it could not be done automatically by algorithms and knowledge of experts is needed. But, with the view to have a semantic interpretation of attributes, only human users can be efficient. 3

Computational results

Before giving results of experiments that compare the performances of those algorithm to C4.5, the data set used has to be described. 3.1

The Hodgkin’s disease

In this paper, reported results are issued from a data set collected by the Lymphoma Cooperative Group of the European Organisation for Research and Treatment of Cancer ( EORTC ) and provided by Dr. M. Henry-Amar♠. The data set describes more than 3000 patients treated with various protocols. After treatment2, the data set currently has 824 entries for the learning data and 701 entries for the test data3. The patients, grouped as “Favourable” ( 369 cases ) nor “Unfavourable” ( 455 cases ), are described through 16 continuous attributes and three binaries attributes. The learning data set contains 330 missing values concentrated on five attributes. 3.2

Comments about results form

In this experience, extrema were extracted from the data base as described in § 2.1 and were supervised by Dr. M. Henry-Amar. Hyperplanes are given

2

Different protocols products different descriptors for patients and choices have to be done.

3

Those data sets are fixed like that for temporal reasons.

under the form explain in Table 1 : for each attribute of the data base, the value obtained in the dual is the coefficient of the vector a described in § 1.2.1. Table 1 : Results form Coef. for age Coef. for cbdfus Coef. for axdfus Coef. for medfus Coef. for sg Coef. for hb Coef. for polfus Coef. for monfus Coef. for pa Coef. for histforus Coef. for lambda_dua

: 0.0093584466 : 0.0014359370 : 0.0009392872 : 0.1363806657 : 0.2534100762 : -0.001241328 :0 :0 : -0.000123759 : 0.0077649976 : -0.411007586

◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊

Coef. for sexe Coef. for cbgfus Coef. for axgfus Coef. for ext Coef. for vs Coef. for gb Coef. for lymfus Coef. for plaq Coef. for ldh

: -0.005338922 : 0.0007233347 : 0.0003019576 : 0.0887066056 : 0.0048398000 : 2.485281e-05 :0 :0 : 7.475156 e-06



Coef. for mu_dua

: 0.4219074193

In the case of the Table 1, the vector a is : a = (0.00935844 66, - 0.00533892 2, …, 7.475156 e - 06, 0.00776499 76 ) and let be b=

0.0077649976 - (- 0.4219074193) 2

A point x belongs to “Favourable” if ax > b; otherwise, it belongs to “Unfavourable”. Results are given under a form that allowed a semantic interpretation. For example, the attribute polfus is unneeded to classify because its coefficient is null and the important value of the age’s coefficient means that the youngest are more concerned by the disease than the oldest. Those information are well known from medical experts but others hypotheses are confirmed by this way. 3.3

Results of the linear programming classifier

For having results, just three oblique hyperplanes are extracted ( That means that the work space is Learning Class 1 Class 2 Total devised in four regions with Linear Prog. 88.1% 89.2% 88.7% three linear borderline ) C4.5 ( Rules ) 90.8% 99.3% 94.3% whereas C4.5 use more than 20 rules. The results obtained are under those obtained by C4.5 but they are very interesting. With few hyperplanes ( Three ), results obtained on the test data base prove that the method described bellow is valid even on a real world Test Class 1 Class 2 Total extracted data base ( C4.5 Linear Prog. 85.3% 81% 82.8% results show the nontrivial C4.5 ( Rules ) 87.4% 91.1% 89.6% partition of the Hodgkin’s disease data base ). Notice that speed of linear programming algorithm, particularly adapted for this kind of situation, allows to obtain results very easily on common computer.

4

Conclusions

This paper has described a linear programming method to construct linear borderline speedily and easily. A new way to exploit linear programming in classification has been presented and an original treatment of missing values has been introduced. The implementation proposed was tested on a consistent data set extracted from the medical domain and interesting results were obtained : good hyperplanes that generalise very well were got. But better results may be obtained by trying those following ideas : build a complete decision trees with linear borderline, define a clever choice function ( § 1.3 ), introduce the extracted elements in an other order, generate complete decision trees, use some pruning methods, test the algorithm with cross validation and other data sets, generalise to multi classes decision trees. These extensions are currently explored. References [1]. Bennet ( 1992 ). Decision tree construction via linear programming, Computer Sciences Technical report 1067. [2]. Celeux ( 1988 ). Le traitement des valeurs manquantes dans le logiciel SICLA. [3]. Chvátal ( 1993 ). Linear Programming, W.H. Freman and Compagny. [4]. Mangasarian, Setiono & Wolberg ( 1990 ). Pattern recognition via linear programming : Theory and application to medical diagnosis, in : S.I.A.M. Workshop on optimisation. [5]. Michie, Spiegelhalter & Taylor ( 1994 ). Machine learning, neural and statistical classification. Ellis Herwood Series in Artificial Intelligence. [6]. Murthy, Kasif & Salzberg ( 1994 ). A system for induction of oblique decision trees in : Journal of Artificial Intelligence Research 2, 1-32. [7]. Quinlan ( 1993 ). C4.5 : Programs for machine learning. Morgan Kaufmann. [8]. Quinlan ( 1989 ). Unknown attribute values in induction in Segre. Proceedings of the sixth International Workshop on Machine Learning, 164-168.

Suggest Documents