Boosting Diverse Learners for Domain Agnostic Time Series

0 downloads 0 Views 209KB Size Report
domain agnostic time series classification algorithm that al- lows safe inclusion of domain-specific methods that may be highly effective in some domains yet ...
Boosting Diverse Learners for Domain Agnostic Time Series Classification David Minnen

Peng Zang

Charles Isbell

Thad Starner

College of Computing, School of Interactive Computing Georgia Institute of Technology Atlanta, GA 30308 USA {dminn,pengzang,isbell,thad}@cc.gatech.edu

ABSTRACT Although most classification methods benefit from the incorporation of domain knowledge, some situations call for a single algorithm that applies to a wide range of diverse domains. In such cases, the techniques and biases that prove useful in one domain may be irrelevant or even harmful in another. This paper addresses the problem of constructing a domain agnostic time series classification algorithm that allows safe inclusion of domain-specific methods that may be highly effective in some domains yet detrimental in others. Our approach combines MBoost, an extension to AdaBoost that allows robust boosting of multiple weak learners, with SAMME, a multiclass extension of AdaBoost which does not rely on a reduction to a set of binary problems. The resulting algorithm allows the safe and efficient combination of multiple learning algorithms for multiclass classification.

1.

INTRODUCTION

In a typical classification scenario, knowledge about the target domain and the intended application is used to design and enhance the classification algorithm. For instance, mathematical models describing the underlying system dynamics or process evolution may be built in to the classifier. For a speech or sign language recognition engine, the system designers may prefer a probabilistic temporal model that allows easy integration with higher-level language grammars. For visual recognition tasks, sophisticated preprocessing may be applied to extract informative or discriminative properties from the raw pixel values. In other cases, knowledge of large training sets such as the institutional data sets collected to aid face recognition, speech recognition, and text analysis research may bias the designers toward statistical learning algorithms. This paper presents an algorithm that addresses the problem of time series classification when very little is known about the target domain. This situation often arises in data mining tasks because the purpose of exploratory tools is to infer or

Submission to the Knowledge Discovery and Data Mining (KDD) Workshop on Time Series Classification. San Jose, CA, Aug 12-15, 2007.

uncover just the kind of knowledge about the target data that would be useful for deeper analysis. Our approach for domain agnostic time series classification is built around the assumption that it is preferable to include in the classifier as much knowledge as possible that might be useful and to then learn what should be ignored for a particular task. To this end, we combine two extensions to AdaBoost, MBoost and SAMME, which allow boosting over multiple weak learners and provide direct multiclass classification, respectively. Within this framework, we are able to efficiently combine many different time series classification algorithms operating at different scales and over different features. Furthermore, our algorithm allows safe inclusion of a wide range of methods that may only have narrow application by automatically detecting and ignoring those methods which are not useful for a particular classification task. The remainder of this paper is organized as follows. The next section provides a review of the MBoost algorithm, while Section 3 explains the SAMME algorithm for multiclass AdaBoost. Section 4 details our algorithm that combines MBoost and SAMME and outlines the “weak” time series classifiers, distance metrics, and features that we use. We empirically evaluate our algorithm and compare it to baseline methods in Section 5, and then we discuss the results in Section 6.

2.

MBOOST

Ensemble learning methods have been empirically shown to be more powerful than any single method alone [4]. Boosting [14] is a particularly popular ensemble technique with strong theoretical and empirical support. For our time series classification algorithm, we use MBoost [18], an ensemble algorithm designed for boosting multiple weak learners. MBoost provides two primary advantages. First, it explicitly supports multiple weak learners and formalizes the notion of using the boosting framework as an arbitrator for choosing between the hypotheses produced by the weak learners. This ensures that the hypothesis selection process does not introduce any additional inductive bias into the boosting framework. Second, it controls weak learner overfitting. This control protects the overall performance of the ensemble from poor weak learners or those that are simply mismatched with the data, thus allowing the system designer to try many different weak learners without fear of degrading overall performance. Finally, empirical experiments on

several benchmark UCI data sets show MBoost performs at least as well as any single model, any boosted single model, and sometimes outperforms them both [18]. MBoost is based on AdaBoost [15, 8], an ensemble learning technique that iteratively constructs an ensemble of hypotheses by applying a weak learner repeatedly on different distributions over data. In AdaBoost, distributions are chosen to focus on the “hard” parts of the data space, that is, where the hypotheses generated thus far perform poorly. If h1 , h2 , . . . , hT is a set of hypotheses generated by AdaBoost, then the final boosted hypothesis is then: C(x) = P (m) (m) sign(f (x)) = sign( M h (x)), where α(m) denotes m=1 α (m) the weighting coefficient for h , which is the hypothesis for the mth round of boosting. MBoost (see Algorithm 1) works much like AdaBoost but has two main differences. First, a hypothesis is chosen in each round. This reflects the fact that it boosts over multiple weak learners. Second, a validation set is randomly chosen per round and hypothesis selection, hypothesis weighting, and data distribution reweighting are performed with respect to the validation set. MBoost improves upon AdaBoost in several ways. MBoost explicitly supports multiple weak learners, so one could for example, boost a decision tree learner and a naive Bayes learner together. While boosting multiple learners was previously possible by using a meta-learner that internally selected a learning algorithm, care had be taken in implementing such work arounds in order avoid introducing additional bias. MBoost makes boosting multiple weak learners explicit and uses the boosting framework itself to arbitrate, that is, to choose between hypotheses of weak learners to ensure no additional inductive bias is introduced. This means the choice of which hypothesis to use in each round is decided based on the exponential loss function G(f ) = PN −yi f (xi ) where yi is the true label and i=1 e f (x) =

M X

α(m) h(m) (x)

m=1

is the ensemble classifier. In practice, we minimize Z (m) , the normalization constant in AdaBoost in the mth round as it is faster and has been proven to be equivalent [18]. Boosting is known to be susceptible to overfitting when its weak learner overfits. Imagine for example, that the weak learner is a rote learner such as a hash table. The training error will be zero, but there will also be no generalization. Regardless of the number of rounds of boosting, the rote learner will generate the same hypothesis (assuming there is no noise in the labels). As a result, the final boosted classifier would show the same (lack of) generalization capabilities. Using multiple weak learners only compounds the problem. MBoost resolves this problem by dividing the data into randomly partitioned training and validation sets for each round, using the validation set to evaluate the learned hypotheses. This approach yields a more accurate measure of the hypotheses’ generalization error which in turn is used to choose the best hypothesis and its weight. Additionally, MBoost

only reweights the data distribution from the validation set as points from the training set do not provide useful feedback and reweighting those points may incorrectly imply that they have been learned. Note, however, that the weight of the training points may be indirectly modified due to the weight normalization step at the end of each round of boosting. Algorithm 1 MBoost Input: Weak learners b1 , . . . , bp ; data (x1 , y1 ), . . . , (xn , yn ) where xi ∈ χ, yi ∈ {−1, +1}; number of rounds of boosting (M ) Output: Binary ensemble classifier C(x) 1. Initialize weights: wi = 1/n 2. for m = 1, . . . , M do (m) (m) 3. Split data randomly into Dtrain and Dval . 4. 5.

(m)

Train learners b1 . . . bp on Dtrain to generate (m) (m) hypotheses h1 . . . hp . (m) (m) Choose hypothesis h = arg min(Zi ) (m)

hi

6. 7.

(m)

Compute α for h Update weights:

(m)

(m)

wi = wi · e−α 8.

per usual

yi h(m) (xi )

(m)

: (xi , yi ) ∈ Dval

Normalize weights: wi =

n X wi such that wi = 1 Z (m) i=1

9. end for 10. return C(x) = sign(

M X

α(m) h(m) (x))

m=1

Where χ is the input space, and Z (m) is the normalization constant. MBoost has two additional useful properties. First it provides an automatic stopping criterion by detecting when weak learners are exhausted, which occurs when no weak learner can perform better than random as determined by Monte Carlo cross validation. Second, it has been proven to inherit the generalization error bounds derived for AdaBoost. The net effect is that MBoost robustly handles and boosts over a variety of models, enabling us to take advantage of the empirical power of using multiple models, even those that can be brittle.

3.

MULTICLASS ADABOOST

The original AdaBoost algorithm introduced by Fruend and Schapire [7] combined multiple, weighted hypotheses from a single binary classification algorithm. Several approaches to boost multiclass classifiers have been proposed, but these methods typically rely on a reduction from multiclass classification to multiple binary classification steps. For instance, researchers have proposed using one-vs-all and one-vs-one reductions combined using uniform voting or probabilistic voting based on the component margins. Other approaches

use tournaments or error correcting codes as a distributed class representation.

as reducing the requirement of the weak learners to perform with an error rate of (k − 1)/k rather than 1/2 as in binary AdaBoost. Thus SAMME reduces to standard AdaBoost in the two-class case, and maintains the intuition that the weak learners only need to perform better than random guessing for the given distribution over the training points.

Although many of these methods provide adequate recognition performance, a major drawback is that they often require considerable extra computational resources for both learning and classification. For instance, assuming a problem has k classes with an average of n training points per class, both the one-vs-one and one-vs-all approaches require O(n · k2 ) to learn the multiclass classifier. Typically, the learning algorithm is super-linear in the number of training points, and so the one-vs-one approach, which requires learning O(k2 ) classifiers from O(n) data, is faster than the one-vs-all reduction, which requires learning O(k) classifiers from O(kn) data. Nonetheless, when the weak learners support multiclass classification, the preferred solution would allow boosting with only O(n) work per round.

Our approach to time series classification combines SAMME and MBoost by simply altering the calculation of α(m) in MBoost (Algorithm 1, Step 6) to include the −log(k − 1) term found in SAMME (Algorithm 2, Step 2c). Furthermore we ensure that the weak learners that MBoost uses produce hypotheses that classify time series, and that at least one of the weak learners will produce a hypothesis that performs ). better than random guessing (i.e., error < k−1 k

Algorithm 2 SAMME: Multiclass AdaBoost

4.1

Input: Training data (xi , yi ) : yi ∈ [1..k], i ∈ [1..n]; number of rounds of boosting (M ) Output: Multiclass ensemble classifier C(x) 1. Initialize the observation weights: wi =

1 n

2. For m = 1 to M (a) Learn classifier T (m) (x) from the weighted training data (b) Compute the error rate for this round: Pn I(yi 6= T (m) (xi )) (m) i=1 wiP err = n i=1 wi (c) Compute the weight for this classifier: α(m)

1 − err(m) = log − log(k − 1) err(m)

(d) Update the weights: wi ← wi · exp(α(m) · I(yi 6= T (m) (xi ))), i = 1, ..., n (e) Normalize the weights: wi ←

Pnwi

i=1

wi

3. Return the ensemble classifier C(x) = arg max y

M X

α(m) · I(T (m) (x) = y)

m=1

The “stagewise additive modeling using a multiclass exponential loss function” (SAMME) algorithm developed by Zhu, Rosset, Zou, and Hastie [19] achieves the goal of direct multiclass boosting. Zhu et al. provide a statistical justification for their modification to the original AdaBoost algorithm by noting the relationship to the exponential loss function. They generalize the loss function to the multiclass case by recoding the output with a K-dimensional vector and show that SAMME minimizes this multiclass loss function. The resulting algorithm turns out to only require a minor modification of the original, binary algorithm (see Algorithm 2). In practical terms, SAMME can be understood

4.

TIME SERIES CLASSIFICATION

Weak Learners

Our approach allows the inclusion of any time series classification algorithm that produces multiclass output. Ideally, the algorithms should also consider the current weight on each training example, although this is not strictly required. For the purposes of the time series classification task posed by this workshop, we used the following classifiers: Weighted k-Nearest Neighbors The weighted k-nearest neighbor algorithm classifies query points by returning the class with the most (weighted) votes from all of the closest neighbors needed to ensure a total weight of at least k. During training, the algorithm simply stores all of the training points and their associated weights. During classification, the training points are sorted by their distance to the query point. Pm Then the nearest m points are located such Pmthat i=1 wi ≥ k. Finally, the classifier returns arg max i=1 wi , where ωi is the class label associated with ω

each wi . Support Vector Machines Support Vector Machines (SVMs) are large-margin classifiers based on linear classification [6]. Typically, a kernel is used to implicitly project the data into a high-dimensional space where linear classification is more likely to be effective. Our implementation is based on libSVM [5] and uses a simple grid search to find good values for the slack variable and RBF kernel scale parameter. Since our implementation of the SVM only supports binary classification, we adopt a one-vs-one scheme to generate a multiclass composite classifier. Voting is used to combine the results of the 12 k ·(k −1) binary SVM classifiers. Note also that the SVM classifies feature vectors and not sequences per se. Although in our case we can interpret the fixed-length time series as a vector in