Interval and Dynamic Time Warping-based Decision Trees - CiteSeerX

14 downloads 0 Views 163KB Size Report
Juan J. Rodríguez. Lenguajes y Sistemas Informáticos ..... as a benchmark for classification systems of temporal pat- terns in the process industry. This data set ...
Interval and Dynamic Time Warping-based Decision Trees∗ Juan J. Rodríguez

Carlos J. Alonso

Lenguajes y Sistemas Informáticos Universidad de Burgos, Spain

Grupo de Sistemas Inteligentes Depto. de Informática Universidad de Valladolid, Spain

[email protected]

ABSTRACT This work presents decision trees adequate for the classification of series data. There are several methods for this task, but most of them focus on accuracy. One of the requirements of data mining is to produce comprehensible models. Decision trees are one of the most comprehensible classifiers. The use of these methods directly on this kind of data is, generally, not adequate, because complex and inaccurate classifiers are obtained. Hence, instead of using the raw features, new ones are constructed. This work presents two types of trees. In interval-based trees, the decision nodes evaluate a function (e.g., the average) in an interval and the result is compared to a threshold. For DTW-based trees each decision node has a reference example. The distance from the example to classify to the reference example is calculated and then it is compared to a threshold. The method for obtaining these trees it is based on 1) to develop a method that obtains for a 2-class data set a classifier formed by a new feature (a function in an interval or the distance to a reference example) and a threshold, 2) to use the boosting method to obtain an ensemble of these classifiers, and 3) to use a method for constructing decision trees using as data set the features selected by boosting.

1.

INTRODUCTION

This work deals with the data mining task of classification in the domain of time series. The models considered are decision trees, because they produce comprehensible classifiers. Two types of trees are considered, interval-based and DTW-based. In interval-based trees the decision nodes calculate a function in an interval of the series and compare the result with a threshold. Although it is possible to define a lot of functions for intervals, in this work only the average and deviation are ∗This work has been supported by the Spanish MCyT project DPI2001-4404-E and the “Junta de Castilla y Le´ on” project VA101/01.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’04 March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04 ...$5.00.

[email protected]

considered. The format of the decision nodes in these trees is: average( Variable, Begin, End ) < Threshold deviation( Variable, Begin, End ) < Threshold The Variable is included because multivariate data sets are also considered. If the series are of different lengths, there will be literals with intervals that are outside, partially or totally, for some of the series. For this cases, the result of the evaluation of the literal will not be true nor false. It will be considered a “missing” value. For time series data, the Euclidean distance is very brittle [6]. Dynamic Time Warping (DTW) allows an elastic shifting of the time axis, to accommodate sequences that are similar, but out of phase. DTW is a classic technique for time series classification. DTW is used normally with instance-based classifiers, although it is possible to use it with other methods such as boosting [9] or SVM [2]. None of these methods produce comprehensible classifiers. In this paper DTW is incorporated into decision trees. The decision nodes have the following format: dtw( Reference, Variable ) < Threshold The distance of the current example and the Reference, for the Variable, is calculated and compared to a threshold. Other methods for constructing decision trees from time series data are [5] and [3]. The first one extracts global (e.g., mean, maximum) and local features (e.g., increasing, local maximum) and then uses a method for constructing decision trees using as a input these features. In [3] the decision nodes are formed by extracted patterns. These patterns can be simple or complex. A simple pattern is a time series. It is detected in a series if the euclidean distance between the pattern and a fragment of the series is smaller than a threshold. A complex pattern is an ordered list of simple patterns. The pattern extraction process is integrated with the method for constructing decision trees. The rest of the paper is organized as follows. The used method for the construction of these trees is described in section 2. Section 3 presents experimental results when using the new method. Finally, we give some concluding remarks in section 4.

2.

CONSTRUCTING INTERVAL AND DTWBASED TREES

There are two approaches for constructing decision trees that do not use the raw features, but new ones. A possibility is to integrate the construction of features in the decision tree method. The second is to consider the construction of features as a preprocessing step and then use a unmodified method for constructing decision trees. The approach used in this paper is the second one, because the considered features had already been used with boosting [8].

2.1

Boosting

At present, an active research topic is the use of ensembles of classifiers. They are obtained by generating and combining base classifiers, constructed using other machine learning methods. The target of these ensembles is to increase the accuracy with respect to the base classifiers. One of the most popular methods for creating ensembles is boosting [12], a family of methods, of which AdaBoost is the most prominent member. It works by assigning a weight to each example. Every example has an associated weight. Initially, all the examples have the same weight. In each iteration a base (also named weak ) classifier is constructed, according to the distribution of weights. Afterwards, the weight of each example is readjusted, based on the correctness of the class assigned to the example by the base classifier. The final result is obtained by weighted votes of the base classifiers. AdaBoost is only for binary problems, but there are several methods for extending AdaBoost to the multiclass case. The one used in this work is AdaBoost.MH [13].

2.2

The Base Learners

The task of the base learners are to select a feature and a threshold. Considering all the possible features is not an option unless the data set was very small.

2.2.1

Selecting Interval features

With the objective of reducing the search space, not all the intervals are explored. Only those that are of size power of 2 are considered. If the length of the series is n, the number of these intervals is O(n lg n). For each considered interval, the function must be evaluated for the v variables and the e examples. The evaluation of a function in an interval depends on the length of the interval. But in this case it is necessary to evaluate the function in a lot of intervals and it is possible to reuse the computations. It is possible to obtain the information necessary for an interval of size 2i from two consecutive intervals of size i. Hence, it is possible to evaluate the functions for all the intervals, variables and examples in a time of O(ven lg n). Then it is necessary to select, for each variable and interval, the best threshold. It could be possible to do it sorting the examples, but this would be too costly. The used approach is to consider only as possible threshold a fixed number of values. The thresholds are selected in a way that the range of values of the feature is divided in uniform-width intervals. This number, t, is a parameter. In the experimental validation the value considered is t = 25. The selection of the best threshold is O(e + t), and normally t  e, so it can be considered O(e). Hence, the selection process needs a time of O(ven lg n). For most details on the selection process the interested reader is referred to [8].

Literal deviation( x, 63, 126 ) < 1.81 deviation( x, 48, 111 ) < 1.89 average( x, 38, 101 ) < 0.77 average( x, 30, 33 ) ≥ 3.35 average( x, 55, 58 ) < 4.28 deviation( x, 3, 34 ) < 1.65 average( x, 49, 52 ) ≥ 5.51 deviation( x, 25, 56 ) < 1.11 average( x, 44, 47 ) ≥ 3.72 deviation( x, 26, 33 ) ≥ 3.36

Cl. 1 -0.399 -0.232 0.096 0.746 -0.906 -0.334 0.744 0.635 0.484 -0.149

Cl. 2 -0.421 -0.154 0.073 -1.641 0.352 2.162 0.518 0.540 -0.932 0.091

Cl. 3 1.038 0.606 0.715 0.665 0.472 -0.114 -0.196 0.000 -0.156 0.658

Table 1: Example of a classifier obtained with boosting, for a 3 class problem.

2.2.2

Selecting DTW features

The base learner works as follows. First, several examples are selected, randomly, as possible references. The number of considered reference examples (r) is a parameter. In the experimental validation, r = 20. For each reference example, the distance to all the other training examples (e) is computed. DTW requires O(n2 ) time for series of length n. Hence, the time necessary for this process is O(ren2 ). Then the best threshold for the distances is computed in a similar way as done with decision trees. First, all the distances are sorted (time e lg e). All the values are considered, from left to right, keeping into account the number and weight of positive and negative examples at the left from the current value. For each distance value it is computed the error of selecting this threshold. This can be done, for each value, in O(1) because it only involve to calculate a function of the weight of positive and negative examples at the left and at the right of the threshold. For e distances the time necessary is O(e), which is smaller than O(e lg e). If the data set is multivariate, the process must consider the v variables. Hence, the base learner requires a time of O(vre(n2 + lg e).

2.3

Ensemble Example

Table 1 shows an example classifier, obtained by boosting. It is for a data set with three classes. This classifier is composed by 10 base classifiers. The first column shows the literal. For each class in the data set there is another column, with the weight associated to the literal for that class. In order to classify a new example, a weight is calculated for each class, and then the example is assigned to the class with greater weight. Initially, the weight of each class is 0. For each base classifier, the literal is evaluated. If it is true, for each class its weight is added with the weight of the class for the literal. If the literal is false, then the weight of the class for the literal is subtracted from the weight of the class.

2.4

From Ensembles to Trees

These ensembles can be used as classifiers with good accuracy results, but they are not very comprehensible, specially if they are formed by a lot of base classifiers. Is is normal to use boosting with hundreds of classifiers, even with more complex ones as decision trees. Given the ensemble, a new data set is constructed. For each base classifier, the corresponding function it is consid-

Va ria ble s

Cl ass es

Data Set CBF CBF-tr Control 2-Patterns Trace J. Vowels Auslan

1 1 1 1 4 12 22

3 3 6 4 16 9 95

Length Min Max 128 128 128 128 60 60 128 128 268 394 7 29 45 136

Examples Total Test 768 10-fold CV 5000 4000 600 10-fold CV 5000 4000 1600 800 640 370 2565 5-fold CV

Normal

Decreasing

Increasing

Cyclic

Downward

Upward

Table 2: Characteristics of the data sets. Cylinder

Bell

Funnel Figure 2: Examples of the Control data set. Two examples of the same class are shown in each graph.

Figure 1: Examples of the CBF data set. Two examples of the same class are shown in each graph.

ered as a new feature. The threshold and the weights are discarded. A new data set is constructed. The attributes are the features that appear in the ensemble. For each training example in the original data set, there exists a example in the new data set. The values of its attributes are the evaluation of the different features for the original example. Once the new data set is obtained, any method for constructing decision trees is used with the new data set.

3. 3.1

down-down

up-down

down-up

up-up

EXPERIMENTAL VALIDATION Data Sets

Table 2 summarizes the characteristics of the data sets. The first 5 data sets are synthetic, the last two are real. The CBF data set is an artificial problem, introduced in [11]. The learning task is to distinguish between three classes: cylinder, bell or funnel. Figure 1 shows some examples of this data set. The CBF translated data set is a modification of the previous one introduced in [3]. It emphasizes the shifting of patterns. In the Control Charts data set there are six different classes of control charts, synthetically generated by the process in [1]. Figure 2 shows two examples of each class. The used data was obtained from the UCI KDD Archive [4]. The two patterns data set was introduced in [3]. Each class is characterized by the presence of two patterns in a definite order. Figure 3 shows examples of this data set. The Trace data set was introduced in [10]. It is proposed as a benchmark for classification systems of temporal patterns in the process industry. This data set was generated artificially. There are four variables, and each variable has two behaviors, as shown in figure 4. The combination of the behaviors of the variables produces 16 different classes. 1600 examples were generated, 100 of each class. Half of the examples are for training and the other half for testing. The Japanese Vowels data set is introduced in [7]. It is

Figure 3: Examples of the 2-Patterns data set.

Variable 1

Variable 2

Variable 3

Variable 4

Figure 4: Trace data set. Each example is composed by 4 variables, and each variable has two possible behaviors. In the graphs, two examples of each behavior are shown.

Data Set CBF CBF-TR Control 2-Patterns Trace J. Vowels Auslan

Boosting Error 1.13 ± 1.23 6.65 ± 0.22 0.83 ± 0.88 20.95 ± 1.16 10.90 ± 3.70 4.86 ± 0.81 10.17 ± 1.60

Decision Trees Error Nodes 2.27 ± 1.80 16.00 ± 3.68 8.41 ± 0.95 72.60 ± 5.73 1.67 ± 1.57 12.60 ± 0.84 13.32 ± 0.50 118.20 ± 8.56 9.58 ± 1.93 71.80 ± 5.02 19.24 ± 2.23 34.20 ± 1.10 13.22 ± 2.40 274.20 ± 12.70

Table 3: Experimental results using interval features.

Data Set CBF CBF-TR Control 2-Patterns Trace J. Vowels Auslan

Boosting Error 0.38 ± 0.61 1.22 ± 0.16 0.67 ± 0.86 0.59 ± 0.08 2.43 ± 0.45 6.92 ± 0.77 14.72 ± 1.95

Decision Trees Error Nodes 1.62 ± 1.65 18.80 ± 2.74 2.97 ± 0.55 32.60 ± 1.67 3.17 ± 2.14 19.40 ± 1.84 4.90 ± 0.75 41.00 ± 1.41 5.00 ± 1.26 91.04 ± 2.19 18.81 ± 2.28 37.00 ± 2.83 18.04 ± 1.62 308.20 ± 7.69

Table 4: Experimental results using DTW.

average( x, 22, 53 )

a speaker recognition problem. Nine male speakers uttered two Japanese vowels /ae/ successively. For each utterance, it was applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 LPC cepstrum coefficients. This means that one utterance by a speaker forms a time series whose length is in the range 7–29 and each point of a time series is of 12 features (12 coefficients). Auslan is the Australian sign language, the language of the Australian deaf community. Instances of the signs were collected using an instrumented glove [5]. There are two versions of this data set, obtained with different equipments. In the first one the manufacturer is Nintendo, and in the second one it is Flock. According to [5], in terms of the quality of the data, the Flock system was far superior to the Nintendo system. Hence, in this paper it is used the Flock version.

3.2

Results

The boosting method was used using 100 iterations. This means that the transformed data set, that it is used as input to the method for constructing decision trees, has at most 100 features. It could have less than 100 if the boosting method selects most than once the same feature. The used method for constructing the decision trees was J48 of the WEKA library [14]. It is based on C4.5. If there is a specified partition of the examples for training and testing, the reported results are the averaging of 5 runnings. In other case, 10-fold stratified cross-validation is used. Nevertheless, for the Auslan, 5-fold are used, because it is the norm for this data set [5]. Tables 3 and 4 show, respectively, the obtained results using interval-based features and dynamic warping. As expected, the obtained accuracy is generally better with boosting than with decision trees. There are two exceptions when using interval features, for the data sets 2-Patterns and Trace. For these data sets those features are not adequate, the results are much worse than the obtained ones with DTW. The trees are binary, so the number of leaves is approximately the half of the number of nodes. In each internal node there is a reference example, but an example can appear in several nodes. For the Auslan data set there are more than 100 internal nodes, but the number of reference examples was only 100. Hence, several examples have to appear more than once in the tree. Figure 5 shows an interval decision tree for the Control data set. It has the minimum size for a 6 class problem. The first number in the leaves is the number of the examples. If there is a second number it is the number of missclassified

28.313794

downward (90.0/1.0)

32.195134 average( x, 6, 21 ) 31.904637 increasing (90.0/1.0)

Figure 5: An interval-based decision tree for the Control data set.

examples in the leave. It is possible to construct trees that use different types of features. The results obtained using these hybrid trees are shown in table 5. The results are best than using only one type of features or are slightly worst than for the best of the two, non-hybrid, types of trees. It must be noted that the hybrid trees are constructed using twice the number of features. An interesting case is the Trace data set, the results of the hybrid tree are much better than the results using only one type of feature, even when using boosting. Moreover, for the five executions, a minimum tree for a 16 class problem is generated. Table 6 shows the results reported in [3, 5] for these data sets using decision trees. The best results reported in those references (because they consider different settings) are the ones that appear in the table. These error results are worst than our results using hybrid trees. The only exception is the J. Vowels, but the results using DTW-based trees are better. With respect to the number of nodes, the method

Data Set CBF CBF-TR Control 2-Patterns Trace J. Vowels Auslan

Decision Trees Error Nodes 0.87 ± 1.02 11.40 ± 3.10 3.15 ± 0.39 36.20 ± 2.68 1.67 ± 1.76 12.60 ± 0.84 3.83 ± 0.12 41.40 ± 2.97 0.18 ± 0.26 31.00 ± 0.00 19.46 ± 1.79 35.80 ± 2.68 13.47 ± 1.17 253.40 ± 2.97

Table 5: Experimental results using interval and DTW features.

Data Set CBF CBF-TR Control 2-Patterns Trace J. Vowels Auslan

Kadous Error 1.14 — — — — — 14.20

Geurts Error Nodes 1.00 15 3.22 11 2.67 22 4.45 15 — — 19.19 45 — —

Table 6: Other results for the data sets using decision trees.

of [3] produces trees with less nodes in two data sets of five. These two data sets are the ones that are introduced in in that work. The complexity of the trees depends also on the complexity of the decision nodes. It can be argued that the use of different types of features in the same tree worsens the comprehensibility.

4.

CONCLUSIONS AND FUTURE WORK

A method for constructing decision trees for series classification has been presented. Two types of trees are considered, interval-based and DTW-based. Each internal node in the tree calculates a function (average, deviation) in an interval or the distance between the example to classify and a reference example. Then the result is compared to a threshold. The results are competitive with other methods for constructing decision trees for this kind of data. As expected, the comprehensibility has a cost, and the accuracy of decision trees is worst that the one obtained with boosting. Hence, the presented method is adequate when the main concern is comprehensibility instead of accuracy. The direct method for obtaining these kind of trees would be to modify a method for constructing decision trees. Instead, the approach presented in this work consists on develop a method that given a binary problem selects a new feature (a function in an interval or the distance to a reference example) and a threshold, to use this weak method with boosting and then to consider the selected features as the input to the method for constructing decision trees. This has been done due to implementation convenience, it is rather more easy to implement a base learner for boosting that selects a feature example and a threshold than to modify a decision tree algorithm. One advantage of the used approach is that other learning methods can be used, without modifications, with the features selected by boosting. On the other hand, one disadvantage of this approach is that, for DTW features, the examples selected as references can be counterintuitive. For instance, the example that is used in a node to discriminate between two classes could be of any other class. Hence, one of the possible future works is to incorporate the selection of the reference example for each node in the method for constructing decision trees.

5.

ACKNOWLEDGEMENTS

To the maintainers of the UCI KDD Repository [4]. To the donors of the different data sets [3, 5, 7, 10]. To the developers of the WEKA library [14].

6.

REFERENCES

[1] R. J. Alcock and Y. Manolopoulos. Time-series similarity queries employing a feature-based approach. In Proceedings of the 7 th Hellenic Conference on Informatics, Ioannina, Greece, 1999. [2] C. Bahlmann, B. Haasdonk, and H. Burkhardt. On-line handwriting recognition with support vector machines: A kernel approach. In 8th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 49–54, 2002. [3] P. Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, Department of Electrical Engineering and Computer Science, University of Li`ege, Belgium, 2002. [4] S. Hettich and S. D. Bay. The UCI KDD archive [http://kdd.ics.uci.edu], 1999. Irvine, CA: University of California, Department of Information and Computer Science. [5] M. W. Kadous. Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series. PhD thesis, The University of New South Wales, School of Computer Science and Engineering, 2002. [6] E. Keogh. Exact indexing of dynamic time warping. In 28 th International Conference on Very Large Data Bases, 2002. [7] M. Kudo, J. Toyama, and M. Shimbo. Multidimensional curve classification using passing-through regions. Pattern Recognition Letters, 20(11–13):1103–1111, 1999. [8] J. J. Rodr´ıguez, C. J. Alonso, and H. Bostr¨ om. Boosting interval based literals. Intelligent Data Analysis, 5(3):245–262, 2001. [9] J. J. Rodr´ıguez Diez and C. J. Alonso Gonz´ alez. Applying boosting to similarity literals for time series classification. In 1st International Workshop on Multiple Classifier Systems, MCS2000, 2000. [10] D. Roverso. Multivariate temporal classification by windowed wavelet decomposition and recurrent neural networks. In 3 rd ANS International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface, 2000. [11] N. Saito. Local Feature Extraction and Its Applications Using a Library of Bases. PhD thesis, Department of Mathematics, Yale University, 1994. [12] R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. [13] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. In 11 th Annual Conference on Computational Learning Theory (COLT 1998), pages 80–91. ACM, 1998. [14] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.