FEATURE SPACE PARTITIONING BY NON-LINEAR AND FUZZY DECISION TREES Andreas Ittnera , Jens Zeidlera , Rolf Rossiusa , Werner Dilgera, Michael Schlosserb a Chemnitz University of Technology, Department of Computer Science, D-09107 Chemnitz,
Phone: +49 371 531-1643, Fax: +49 371 531-1465, E-mail: fait,jzei,ros,
[email protected]
b FH Koblenz, Department of Electrical Engineering, Am Finkenherd 4, D-56075 Koblenz,
Phone: +49 261 9528-187, Fax: +49 261 56953, E-mail:
[email protected]
Abstract. This paper focuses on a uni ed sight in the eld of non-linear feature space partitioning. We present
two well-known approaches of growing decision trees from data and show that these methods have a lot in common regarding non-linearity. The aim of this paper is to clarify that the application of simple mathematical operations broadens the capabilities to split the feature space in a non-linear fashion. Keywords: Non-linear and Fuzzy Decision Trees, Feature Space Partitioning
1 Introduction In the research eld of supervised learning from examples the accuracy of feature space partitioning is in some cases much more important than the simplicity of the splitting. One important method for more accurate separation of examples belonging to dierent classes is non-linear feature space partitioning. Especially in the eld of growing decision trees from data non-linear partitioning broadens the capabilities of nding good splits in the feature space. One way to achieve the goal of nonlinearity is to integrate a simple mathematical operation, namely multiplication. This can be done for instance within the process of generating new features from a set of given task-supplied primitive ones. Using these newly created 'non-linear' features allows for a non-linear partitioning of the original feature space. The studies in [3] and [2] showed that non-linear decision tree algorithms (NDT's) produce more accurate trees than their axis-parallel or oblique counterparts. On the other hand, multiplication is a quite important operation to de ne logical operators in fuzzy theory and its application on fuzzy decision trees (FDT's). In the area of FDT's [8] we can nd non-linear partitionings too. Here the calculation with Fuzzy-AND and FuzzyOR in conjunction with defuzzi cation creates new non-linear geometrical forms of feature space partition (see [9]). Section 2 of this paper is dedicated to non-linear feature space partitioning with NDT's. Section 3 deals with fundamentals of fuzzy-decision trees generation. In section 4 we try to develop a uni ed view of the partitioning problem as a synthesis of these two types of trees. Here we demonstrate the in uence of membership functions and the fuzzy operators on the feature space partitioning. Section 5 summarizes the lessons learned from this uni cation and outlines an avenue for further work.
2 Non-Linear Decision Trees The method of NDT, rst introduced in [1], proceeds in two consecutive steps:
augmentation of the feature space application of a decision tree algorithm. The rst step is done by building all possible pairwise products and squares of n given primitive numerical features, resulting in a set of 2 +3 2 features which are considered as the axes of a new feature space. These features are represented by terms in equations describing hypersurfaces of the second degree. For example, in the two-dimensional case such an equation has the form n
n
0 = ax21 + 2bx1x2 + cx22 + 2dx1 + 2ex2 + f: (1) Ellipses, circles, hyperbolas, etc. are described by equations of this type. In the m-dimensional case (m > 2) we get elliptoids, hyperboloids, paraboloids, etc. In the second step of the NDT method a decision tree algorithm is applied to construct an oblique decision tree in the augmented feature space which is now of higher dimension.In our experiments we used OC1 (Oblique Classi er 1) [5]. This algorithm generated hyperplanes with an oblique orientation as a test of a linear combination of primitive and new created features at each internal node. In general, these hyperplanes corresponded to non-linear hypersurfaces in the original feature space of primitive features. One of the data sets we used in our experiments [2] was an arti cial set, called SPIRAL [Figure 1], which allows for an exemplary demonstration of the ability of the NDT method [Figure 2].
Figure 1: Spiral Data Set
represent the class ratio of examples with the attribute values according to the internal nodes from the root to the leaf node. Each branch is described by a membership-function according to the discretisation of the continuous-valued attribute. To classify new unseen examples with an FDT we have to calculate all membership values. Here the usage of diagrams on the branches of the tree comes in. The membership values (along the way from the root to the leaves) are combined by each other and by the class membership at the each leaf. Often the class membership of unseen examples is equal to the sum of products for each class ( ? ? +-method). In general, fuzzy operators (Fuzzy-AND and Fuzzy-OR) are used to calculate the class memberships of unseen examples. In the following section we are going to deal with this particular part in detail. Furthermore we will show that NDT's and FDT's have a lot in common regarding non-linear feature space partitioning. M6 1
X
?
cut points (N) n (N) n?1 X
A A A A ?A
A
n
n
0a c d N ?1 b ?1 c ?1 d ?1 a b Figure 3: Trapezoidal membership functions n
n
n
n
Figure 2: Non-Linear Partitioning The source of power regarding non-linear feature space partitioning with NDT's was the utilization of dualism between a linear partitioning in a tricky constructed feature space of high dimension and the original space of the given features. In the next section we describe another approach of nonlinear feature space partitioning which is based on FDT's.
3 Fuzzy Decision Trees Classical crisp decision trees (i. e. axis-parallel, oblique, and non-linear decision trees) are widely applied to classi cation tasks. However, there is also a number of fuzzy decision tree solutions [4], [6], [7] and [8]. In the eld of FDT's, the learning examples are labelled with membership grades. These grades represent the aliations of examples to the classes. Fuzzy borders for discretisation of continuous-valued attributes [Figure 3] are used in almost all approaches mentioned above. The resulting FDT consists of nodes and leafs. Every internal node corresponds to a test attribute [Figure 4]. In this example 'weight' and 'height' are the features. The leaf nodes are labeled by class membership values for all classes. These values
A A A ?A n
n
4 Non-Linear Feature Space Partitioning: The Source of Power NDT's and FDT's share the use of the multiplication operation for the non-linear partitioning of the feature space. However, they dier from each other with respect to the way how this operation is applied. In case of NDT, multiplication is applied to the task-supplied primitive features to construct new ones. Multiplication, so to speak, has in uence on the whole feature space. Compared with this in FDT's multiplication is used to create special fuzzy operations, like Fuzzy-AND and Fuzzy-OR. The crux of a FDT is the partitioning of the whole feature space in axis-parallel basis cuboids (B-cuboids) based of the given membership functions D [Figure 5]. These B-cuboids contain all instances of a given training set, e. g. they correspond to the leafs of a generated FDT. The B-cuboids are bounded by the two top corner points (a = c ?1 and b = d ?1) of underlying trapezoidal membership functions [Figure 3]. The vectors (C1 1,...,C2 2) of membership values of each class are calculated based on subsets of the training set. The class membership of an unseen test instance which is located in some B-cuboid is the vector of the corresponding B-cuboid. n
n
n
n
;
;
weight
? ? ? ? M ? 16
D1
M 16 @
60 70 ? ? height
M 1 6160 A
AA
165
;
? ?
? ?
@ @ ;
M 1 6 165
@ @ @-
-
160
? ?
h
C1 : 1:0 C2 : 0:0
h
D2
D3
@ @ M @ 6 1
-
7075
60 7075
@ @
h
C1 : 1:0 C2 : 0:0
?@ @ ?
D1 2
@ @
?A ? AA?
@ @
D1 1
?@
@ M @ 6 1
175
h
C1 : 0:0 C2 : 1:0
D1 3 ;
? ? ? -
165 175 @ @
h
C1 : 0:0 C2 : 1:0
C1 : 0:7 C2 : 0:3
Figure 4: The FDT created from an example data set including membership functions [9] Each B-cuboid in uences its environment. This polynomials over the remaining cuboids are similar. environment is determined by the legs of trapezoid membership functions. The environments induce C (x2) = 1x2 + 2 (2) the partitioning of the area between the B-cuboids C1 1 (3) 1 = Cd1 2 ? in composite cuboids (C-cuboids) [Figure 5]. The ? c 1 1 1 1 class memberships (vectors C ,...,C ) in the CC1 1d1 1 ? C1 2c1 1 cuboids result as compositions of memberships in = (4) 2 the in uencing B-cuboids. d1 1 ? c 1 1 The membership in a C-cuboid is determined by the legs of two adjacent trapezoids, i. e. by linear C (x1; x2) = 1x1x2 + 2x1 + 3 x2 + 4 (5) functions. Therefore the composition of memberships along some path of an FDT is de ned as a (6) 1 = (d ?Cc1 1)(d? C1?2 c ) combination of such linear functions to non-linear 1 1 11 11 ones that correspond to curves and surfaces of a higher degree. Thus it turns out that the mem2 = (d C?2 2c ) + (dC1?2cc1 1)(d? C1?1dc1 1 ) (7) bership functions for the elements in the C-cuboids 1 1 1 1 11 11 have the same form as the hypersurfaces that are ? C1 1)d1 (8) 3 = (d (C?1c2 )(d used to partition the whole feature space in the 1 1 1 1 ? c1 1) NDT method. 1 1d1 1 ? C1 2c1 1)d1 (9) Figure 5 shows a possible partition of a two di4 = ? (dC2?2cc1 ) + (C (d mensional space. The FDT algorithm has cut the 1 1 1 ? c1 )(d1 1 ? c1 1) axis x1 once, and each of the two vertical strips once again (one cut means a pair of top corner points of consecutive trapezoids and creates the two 'certain' areas and one area of uncertainty inbetween). In general there may be many strips D1 ,..., D in the rst level and more cuts of these strips, as well as a higher dimension with a deep hierarchy. Examples of vectors of explicit polynomials are given in the following. Each one is de ned over the C-cuboid described by the index of the vector. The I
;
I
;
;
V II
;
;
;
;
;
;
;
VI
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
n
;
;
;
x2 D1 2 d1 1
6
;
D
;
1,2
VII
D2 2
2,2
;
D D D D D D
VI
d2 1
I
;
V
c1 1 D1 1
II
;
c
IV 1,1
;
D1
c1 XX
III
d
1 X X XX
2,1
C C C C C 2;1 C
-
x1
D2
D2 1 ;
Figure 5: x1-x2-feature space with B-cuboids (1,1;...;2,2), C-cuboids (I...VII), and the trapezoid membership functions D1 ,...,D2 2 ;
References
C (x1 ; x2) = 1x1 x2 + 2x1 + 3x2 + 4 (10) 1. A. Ittner. Ermittlung von funktionalen Attributabhangigkeiten und deren Ein u auf 1 = (d ?Cc1 1)(d? C1?2 c ) + (d ?Cc2 2)(d? C2?1 c ) maschinelle Lernverfahren. Master's thesis, 1 1 11 11 1 1 21 21 (11) Dept. of Computer Science, Chemnitz University C d ? C c C d ? C c 1 1 1 1 1 2 1 1 2 1 2 1 2 2 2 1 of Technology, Germany, 1995. Only in German 2 = (d ? c )(d ? c ) + (d ? c )(d ? c ) available. 1 1 21 21 1 1 11 11 (12) 2 1 ? C2 2)c1 + (C1 2 ? C1 1)d1 2. A. Ittner and M. Schlosser. Discovery of relevant 3 = (d (C ? 1 c1 )(d2 1 ? c2 1) (d1 ? c1 )(d1 1 ? c1 1) new features by generating non-linear decision (13) trees. In E. Simoudis, J. Han, and U. Fayyad, ed1 1d1 1 ? C1 2c1 1)d1 + (C2 2c2 1 ? C2 2d2 1)c1 itors, Proc. of 2 International Conference on 4 = (C (d1 ? c1)(d1 1 ? c1 1) (d1 ? c1 )(d2 1 ? c2 1) Knowledge Discovery and Data Mining, pages (14) 108{113. AAAI Press, Menlo Park, CA, Portland, Oregon, USA, 1996. http://www.tu5 Conclusion and Further Work chemnitz.de/~ait/publications/kdd96.ps.gz. This paper has dealt with the question of a uni- 3. A. Ittner and M. Schlosser. Non-linear deci ed view in the eld of non-linear feature space sion trees - NDT. In L. Saitta, editor, Proc. partitioning. We have described two approaches to of 13 International Machine Learning Conreach this goal. On the one hand, non-linearity is ference, pages 252{257. Morgan Kaufmann, San achieved by the combination of task-supplied primFrancisco, CA, Bari, Italy, 1996. http://www.tuitive features, on the other hand, fuzziness oers chemnitz.de/~ait/publications/icml96.ps.gz. the opportunity of this kind of feature space splitting. As far as we know, up to now no attention has been paid on the area of feature space partitioning 4. C. Z. Janikow. Fuzzy processing in decision trees. In Proceedings of International Sympoin conjunction with NDT- and FDT-classi cation in sium on Arti cal Intelligence, pages 360{367, this way. The simple mathematical multiplication 1993. operation, the source of power, plays an important role in both the eld of NDT's and in the area of FDT's. We plan to extend our work to develop a 5. S. Murthy, S. Kasif, S. Salzberg, and R. Beigel. uni ed theory of non-linear partitioning of the feaOC1: Randomized induction of oblique deciture space in general. sion trees. In Proc. of the 11 Nat. Conf. on V
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
nd
;
th
th
AI AAAI-93, pages 322{327, Washington, D.C., 8. J. Zeidler and M. Schlosser. Fuzzy handling of
1993.
6. M. Umano, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S. Umedzu, and J. Kinoshita. Fuzzy decision trees by fuzzy ID3 algorithm and its application to diagnosis systems. In Proc. of 3
continuous-valued attributes in decision trees. In Y. Kodrato, G. Nakhaeizadeh, and Ch. Taylor, editors, Proc. of MLNet Familiarization Workshop: Statistics, Machine Learning and Knowledge Discovery in Databases, pages 41{46,
Heraklion, Crete/Greece, 1995. http://www.tuchemnitz.de/~jzei/VEROEFF/ecws3.ps. IEEE International Conference on Fuzzy Systems, pages 2113{2118, Orlando, FL, 1994. 9. J. Zeidler and M. Schlosser. Continuous-valued attributes in fuzzy decision trees. In Proc. of 6 International Conference on Informa7. X. Wu and P. Mahlen. Fuzzy interpretation of tion Processing and Management of Uncerinduction results. In U. M. Fayyad and R. Uthutainty in Knowledge-Based Systems, pages 395{ rusamy, editors, Proc. of 1 International Con400, Granada, Spain, 1996. http://www.tuference on Knowledge Discovery and Data Minchemnitz.de/~jzei/VEROEFF/ipmu.ps. ing, pages 325{330, Montreal, Quebec, Canada, 1995. rd
th
st