Unsupervised Context Switch for Classification ... - ACM Digital Library

17 downloads 0 Views 1000KB Size Report
authored or co-authored by an employee, contractor or a liate of a national govern- ment. .... card fraud detection. e proposal expects delayed labels for a.
Unsupervised Context Switch for Classification Tasks on Data Streams with Recurrent Concepts Denis M. dos Reis

Universidade de S˜ao Paulo S˜ao Carlos, S˜ao Paulo, Brazil [email protected]

Andr´e G. Maletzke

Universidade de S˜ao Paulo S˜ao Carlos, S˜ao Paulo, Brazil [email protected]

Universidade de S˜ao Paulo S˜ao Carlos, S˜ao Paulo, Brazil [email protected]

This perspective is much different from most of the supervised approaches in data stream learning. Frequently, the literature assumes the total availability of labeled data even in the deployment setting. Therefore, drift detectors can use actual performance data to flag changes in data distribution and classifiers can update themselves with correctly labeled data. However, many real-world applications fall in the extreme verification latency scenario. As a motivating example, consider the sensor described in [2, 3, 13]. The authors propose a sensor to classify insects into species using the wing-beat data. These sensors are the key to a scalable real-time surveillance of flying insects such as agricultural pests and disease vectors. Such a sensor would have to deal with concept drifts in an extreme verification setting. Although the authors can gather labeled data in the laboratory, the obtainment of actual classes after deployment is an expensive and non-scalable task. Nevertheless, the classifier would have to face concept drifts, since ambient conditions such as temperature, humidity, air pressure and other factors influence the behavior of the insects. We propose a simple and effective method to adapt classifiers in extreme label latency scenarios. Our approach has two main assumptions:

ABSTRACT In this paper, we propose a novel approach to deal with concept drifts in data streams. We assume we can collect labeled data for different concepts in the training phase; however, in the test phase, no labels are available. Our approach consists of the storage of a limited number of classification models and the unsupervised identification of the most suitable one depending on the current concept. Several real-world classification problems with extreme label latency can use this setting. One example is the identification of insects species using wing-beat data gathered by sensors in field conditions. Flying insects have their wing-beat frequency indirectly affected by temperature, among other factors. In this work, we show that we can dynamically identify which is the most appropriate classification model, among other models from data with different temperature conditions, without any temperature information. We then expand the use of the method to other data sets and obtain accurate results.

CCS CONCEPTS •Information systems → Data stream mining; •Mathematics of computing → Hypothesis testing and confidence interval computation; •Computing methodologies → Machine learning approaches;

• Although a large set of latent factors can cause concept drifts, there is a smaller subset of variables responsible for most of the concept drifts. Once these variables are identified, data can be gathered offline controlling these variables; • Drifts are recurrent meaning that we can group the latent factors into a discrete and relatively small number of contexts. Changes in the latent factors cause a back and forth switch among contexts leading to recurrence.

KEYWORDS Classification, Data Streams, Concept Drift, Extreme Verification Latency, Kolmogorov-Smirnov ACM Reference format: Denis M. dos Reis, Andr´e G. Maletzke, and Gustavo E. A. P. A. Batista. 2018. Unsupervised Context Switch for Classification Tasks on Data Streams with Recurrent Concepts. In Proceedings of SAC 2018: Symposium on Applied Computing , Pau, France, April 9–13, 2018 (SAC 2018), 7 pages. DOI: 10.1145/3167132.3167189

1

Gustavo E. A. P. A. Batista

There are real applications that fulfill these assumptions. For the previously mentioned sensor, although several conditions may influence the behavior of insects, including some factors that are difficult to measure, such as availability of water and food, the temperature is the variable that most affect their wing-beat data. In the laboratory, specialized chambers can control the temperature artificially. Therefore, we can avail of plenty of labeled insect data for different temperatures. The main contribution of this paper is a proposal of an unsupervised method to identify the most likely recurrent context in streaming data. We use a recently introduced incremental variation of the non-parametric Kolmogorov-Smirnov test [12] to determine the most likely context. For each context, we induce a classifier in the training phase. Therefore, the unsupervised identification of the current context allows adapting to concept drifts by dynamically

INTRODUCTION

Learning from data streams with extreme verification latency is a challenging endeavor. Extreme verification latency means that no labels are available after the classifier deployment. Therefore, the classifier must detect and adapt to concept drifts in the absence of information about the correct classes of the examples. Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SAC 2018, Pau, France © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-5191-1/18/04. . . $15.00 DOI: 10.1145/3167132.3167189

518

switching among the known contexts, without access to the variables that were responsible for separating the data into different contexts. This paper is organized as follows. Section 2 reviews the literature about related problems; Section 3 presents and describes our proposals; Section 4 specifies our experimental evaluation; Section 5 analyzes our obtained results; finally, Section 6 presents our conclusions and directions for future work.

2

learning is that in the first approach the algorithm can choose which events will be tagged by an oracle. Again, it is a common assumption in the literature that the oracle immediately labels all assigned examples. Two recent paper that exclusively deal with the extreme latency scenario are [5] and [14]. Both papers make assumptions regarding the distribution of the data in the feature space and assume the concept drift to be incremental. Dyer’s proposal [5] applies a technique of semi-supervised classification on consecutive batches of unlabeled data. For each batch, the predicted labels of the previous batch are assumed to be correct. Souza’s approach [14] periodically performs data clustering and checks the spatial similarity between the clusters to assume a movement for the data in the feature space. The literature also tackles label scarcity and delays with semisupervised techniques. Pozzolo et al. [4] introduce a method to directly address delays in the availability of actual labels for credit card fraud detection. The proposal expects delayed labels for a significant portion of the data and readily available labels for a smaller part. The obtained results suggest that an ensemble with a classification model induced on the small labeled portion, and a classification model produced on the more substantial share, provides better results than a single model induced on the whole data. Masud and colleagues use semi-supervised approaches in two papers. In [10], they introduce a method for the true label scarcity problem, where actual labels are available for only a portion of the data. The labeled data are used in a semi-supervised fashion to classify future unlabeled events. In [9], they introduce an algorithm that is capable of both tackling delays in the availability of real labels and detecting novelty, i.e., the emergence of new classes. However, besides specific assumptions regarding the feature space, there is also the assumption of the possibility of delaying the prediction of observed events. Wu et al. [16] introduce another semi-supervised approach that tackles label scarcity in data streams. They use a decision tree that incrementally grows while keeping groups in its leaves. Such groups detect concept drifts through deviations in stored statistics. Finally, Kuncheva et al. [8] and Amir and Toshniwal [1] propose methods that directly tackle delays in the availability of true labels. However, they are limited to stationary distribution, i.e., classification problems without concept drifts.

RELATED WORK

In this section, we review the most relevant related work regarding label latency and unsupervised drift detection. We start summarizing the literature on label latency in Section 2.1 and later discuss the associated papers on drift detection with emphasis on statistical tests in Section 2.2.

2.1

Label Latency

Verification latency, or delay, is the period between the observation of a non-labeled event and the availability of its true label. This period depends on the application domain. For instance, if we are interested in predicting the tendency of the stock market or electricity demand, the verification latency is the prediction window, i.e., it is how long ahead of time the prediction targets. In other applications, verification latency can vary over time, as in the case of sensors in an external environment [2]. Null verification latency occurs when the actual labels are available just after the classifier issues an output. They are rare in real-world applications, although they are frequent in scientific publications. They are uncommon in the real-world since either (a) there must be a period between the prediction and the actual occurrence of the event, or (b) there must be a period for the event to be correctly classified by an external oracle. In both cases, the time between a prediction and the availability of its corresponding real label is that in which one can plan an appropriate reaction and execute it. For instance, if a classification system foretells a high electricity demand in a 30 minutes window, there are 30 minutes for planning and executing actions that will lead to an increase of electricity generation to attend the predicted demand. Different publications tackle label availability with different assumptions and different experimental setups. In extreme label latency, correct labels are observed only in a short period at the beginning of the stream. These papers use these labeled examples to identify the classes and build an initial classification model. After that, no correct labels are further observed, and the approach needs to recognize concept drifts and adapt to them without labeled data [5, 14]. In contrast, semi-supervised learning and active learning settings assume label availability for just a portion of the data. Semisupervised learning expects a usually small fraction of labeled events while the majority is unlabeled. Although the labeled events could occur with any latency, most of the literature assumes they arrive with null latency. Therefore, a typical setting for semisupervised learning is to have null-latency for the labeled data and extreme verification latency for the unlabeled data [9]. Active learning also conjectures that only a portion of the data is labeled. The critical difference between active and semi-supervised

2.2

Drift Detection

To be best of our knowledge, the work of Kifer and colleagues [7] is the first to propose the use of hypothesis tests to detect concept drifts. A fixed reference window represents the original data, and a sliding window holds the current data. The approach identifies a concept drift when the hypothesis test indicates that the data in both windows come from different probability distributions. In this case, the method updates the reference window to contain the events from the current sliding window. They evaluated five statistical tests, including Kolmogorov-Smirnov and Wilcoxon tests. The authors described an algorithm for the fast recomputation of a variant of the Kolmogorov-Smirnov statistics, i.e., the proposed algorithm does not compute the exact Kolmogorov-Smirnov statistic according to its definition. They also measured delay and number of

519

false-positives and true-positives for each hypothesis test in an experimental setup that only included synthetic and unidimensional datasets. More recently, Zliobaite has shown that certain types of concept drifts are undetectable without true labels [17]. Figure 1 illustrates one example of such cases. She also proposed a method to identify detectable changes without true labels. Two non-intersecting consecutive sliding windows follow the data stream. The windows can contain either information from the feature space or the output from a classifier. After the observation of each event, the windows are updated – removing the information relative to the oldest event and inserting the information of the newest event. The method reports a concept drift if the probability distributions of both windows are different, according to a hypothesis test. When using information from the feature space, Zliobaite suggests using the whole vector that describes an event as a multivariate variable and, therefore, applying an appropriate hypothesis test, or mapping it to a single value. In the case of using the classifier’s output, Zliobaite suggests employing the score obtained for the predicted label, if the classifier can provide this information, or the label itself, otherwise. She evaluated three hypothesis tests, including the KolmogorovSmirnov test. Zliobaite advocated its use since it is a non-parametric test. Non-parametric tests better suit classification scenarios, where the underlying distribution is a mix of different distributions. Particularly for data stream applications, concept drifts can potentially intensify the occurrence of a mixture of distributions. Neither [7] nor [17] directly examined the impact of unsupervised concept drift detection in classification performance. More recently, Reis at al. [12] introduced a novel algorithm for the fast and exact re-computation of the Kolmogorov-Smirnov statistic that suits data stream applications. They evaluated the impact of unsupervised concept drift detection in classification regarding both classification accuracy, and amount of actual labels requested to update the classification model. The setting proposed in [12] is similar to active learning. When a concept drift is detected, the approach may request actual labels for the examples in the current sliding window to adapt the classifier to the new concept. Additionally, Reis et al. proposed different strategies to decrease the number of labels requested after the drift detection. The experimental evaluation assessed these strategies as well as the efficiency of the incremental Kolmogorov-Smirnov test with real datasets with both real and synthesized concept drifts. Despite their differences, the majority of the related work deal with concept drifts over time. However, none of them assume recurrent concepts in extreme verification latency scenarios. In this paper, we consider the concept drifts are recurrent, and the conditions that trigger a change are known. We call these recurrent concept drifts of contexts to differentiate from regular drifts caused by unknown sources. Contexts make for a valuable, and yet realistic assumption, if one wants to achieve entirely unsupervised classification on nonstationary data streams. In such scenarios, the key is to continually identify which known context is currently the most likely to be active. In this work, we are dedicated to study this scenario in depth and provide a start point for future work.

3

PROPOSED METHOD

We start this section by analyzing a simpler dataset, WBF-Insects, to provide the intuition behind the reasoning of the proposed method. WBF-Insects is a real dataset, which was collected in the laboratory and contains information regarding the flight of insects passing through a photosensitive sensor. This simplified version of the sensor data comprises only two features for each event: the wing-beat frequency (WBF) of the insect passing in front of the sensor and the ambient temperature at the same instant of time. There are records for three different species: Aedes aegypti, Aedes albopictus and Culex quinquefasciatus. Aedes aegypti and Aedes albopictus are additionally discriminated by sex, totalizing five classes. Temperature influences the insect’s metabolism, and, consequently, their wing-beat frequency [11, 15]. We consider the temperature as a possible latent variable for a classification task, where we only have access to it in a moment previous to the actual use of the models for the prediction task. This assumption is following a practical application of sensors: it is possible to collect more information in the laboratory than in the field, which can be constrained by financial and scalability factors. The dataset has two contexts, C A , and C B . C A contains the events observed with temperatures inferior to 25° Celsius, while C B contains the events observed with temperatures greater or equal than 25° Celsius. After the split, we removed the temperature feature from all events. Table 1 quantifies the events for each class and context. Table 1: Number of events for each class and context in the WBF-Insects dataset. Class 1 2 3 4 5 Total

CA 2068 4297 6921 562 36852 50700

CB 4942 6001 15511 1379 22867 50700

CA ∪ C B 7010 10298 22432 1941 59719 101400

We show, in Figure 2, that both contexts present distinct, although similar, probability distributions. Hence, we expect that a non-parametric hypothesis test as Kolmogorov-Smirnov test is capable of distinguishing them, given enough observations. Additionally, and more importantly, we can use the statistic of a given hypothesis test as a dissimilarity measurement to identify which of these two distributions is more similar to a third one. Identifying which is the current context would be devoid of usefulness if the overlap among the classes in the feature space, for at least one of the contexts, is no smaller than the class overlap for the whole data without context separation. In other words, if a classification model M AB induced over C A ∪ C B performs at least as well as individual classifiers for C A and C B , there is no reason to identify which is the actual context, since it would be enough to always issue predictions with M AB . We estimated this overlap by computing the total area of overlap for the equivalent bins in histograms of each class with 10 bins. The

520





&ODVV$









&ODVV$









&ODVV%





&ODVV%

 





 

(a) Before concept drift.

8QNQRZQFODVV







(b) After undetectable drift.







(c) Perception with unlabeled data.

Figure 1: A concept drift that is undetectable without true labels in a two-dimensional feature space with two classes. Red dots represent events belonging to class A, which are generated inside the red-shaded area. Blue dots represent events belonging to class B, which are generated inside the blue-shaded area. In general, undetectable changes are those in which neither the prior probability of events nor the shape that constrain the events changes over time when observing only the unlabeled data. In a more general setting, we have K contexts. For each context, we induce a classification model Mi with a reference event sample Ti . For simplicity, we assume the size of each reference sample Ti to be the same w. We keep a reference value sample Ri that contains one observation value extracted for each event in Ti , and a circular buffer Si that contains one observation value for each one of the last w events in the stream. Initially, Si = Ri . Therefore, Si represents the current window of a sliding window over the stream. We select which model is more apt to classify an event by choosing a classification model M j such that j = arg min H (Ti , Si ). H is a function that gives the statistic for a given hypothesis test, such as the Kolmogorov-Smirnov statistic. This test should analyze if the two samples Ri and Si come from the same data distributions. We note that the WBF-Insects data is overly simple, since it has only one available feature, besides the context attribute. Introducing more features imply in two problems. First, simply analyzing the data overlap and inducing expectations for accuracy becomes more difficult. Second, most hypothesis tests are unidimensional; therefore, we need a strategy to test for multiple features. There are different ways of circumventing this problem. We discuss some alternatives:

H 

CA CB



'HQVLW\

     







:LQJEHDWIUHTXHQF\ Figure 2: Estimated densities for the probability distributions of contexts C A and C B . histograms for C A ∪ C B presented a total area of superposition of 31%, which represents an upper bound for optimal classification accuracy of 69%. Histograms for the individual contexts C A and C B presented areas of superposition of 15% and 21%, respectively, which represent an upper bound for optimal classification accuracy of 82%. These values indicate that it is probably beneficial to use two classification models and identify which is the one that should be used, with a hypothesis test, than just applying one single classifier for every event. We executed the experiments with four different classifiers. We reserve the detailed results for Section 4 and discuss here the results obtained with Random Forest. Using the WBF attribute to select the context, the Random Forest classifier achieved a respectable accuracy of 75.70%. Training with data from both contexts reached just 66.03%. This result is more convincing when compared to a perfect classifier that always chooses the right context. Such a flawless but unfeasible classifier obtained an accuracy that is slightly better than ours, 77.78%.

• Applying a multidimensional hypothesis test: although this may seem the most natural option, it carries many caveats. The most important one is the dimensionality of the feature space. Large feature spaces require bigger samples to achieve reasonable quality for the tests; • Applying a filter to reduce the dimensionality of the feature space to one: one problem with this option is choosing the filter. The weights for different features should be accounted for so that they best reflect the classification models. However, classification models for different contexts may weight the features differently; • Using the output from the classification models: most classifiers output a score or a probability estimate. For these classifiers, Ri can be composed of these scores for each example in Ti . A recommended approach to generate these scores is to use leave-one-out sampling;

521

removing the oldest entry and inserting the value obtained with the current event. Finally, we select the most appropriate Mi so that we minimize H (Ri , Si ) and predict this event’s ft . We evaluate our proposal regarding two aspects: the correct identification of the current context and the proper classification of events. Therefore, we measure the context accuracy Co when applying the hypothesis test on classifier output and Cf when applying the hypothesis test on a chosen feature). We report these accuracies as the percentage of times that we selected the correct classification model Mi . We also relate the classification accuracy, i.e., the percentage of times that we correctly classified each event’s target feature ft . We compare the classification accuracy of our proposal (Ao when applying the hypothesis test on classifier output and Af when applying the hypothesis test on a feature of the feature space) against two baselines and one topline, described below:

• Choosing one specific feature from the feature space: this case requires knowledge of the domain to choose the feature and the existence of a good feature to discriminate all contexts. The first option is too compromising in practice: we would prefer to use the smallest viable number of recent observations to be more reactive to recent changes in data distribution. Thus we can often not afford to gather a sample that is large enough so that the multidimensional test becomes reliable. The second alternative merits a study on its own. Besides, we can consider the third option as a subset of the second. In our experiments, we tested the application of the last two alternatives. The output from the classification models was the estimated probability for the predicted class, as in [17], except for the Nearest Neighbor algorithm. In this case, the output was the distance between the test event and the nearest training event.

4

• BR , the first baseline, classifies each event using a ranÐ domly chosen model in i Mi . Its utility is to find whether our proposal selects a model intelligently, i.e., performs better than randomly guessing which is the current context; • BU , the second baseline, classifies each event using a model Ð induced over i Ti , i.e., a classification model trained with all training data disregarding their contexts. Our proposal does not necessarily need to outperform this baseline, as long as it does not fall behind it; and • T, the topline, classifies each event in the stream with the model that belongs to the same context as the event.

EXPERIMENTAL SETUP

We carried out our experiments with four classification algorithms: Nearest Neighbor (NN), Multilayer Perceptron (MLP), Random Forest (RF) and Support Vector Classifier (SVC). For the NN, we used our implementation. For the other algorithms, we chose the implementations in scikit-learn1 with default parameters. When performing experiments with context detection on the output of the classifiers, we chose to use the score for the predicted class as suggested in [17] for all models, except for the Nearest Neighbor algorithm. In this case, the output was the distance between the test event and the nearest training event. All experiments employ the Kolmogorov-Smirnov statistic. For performance improvement, we chose the incremental implementation of this test [12]. Our experiments involve seven stream configurations. We executed each configuration ten times and averaged the results. Each stream configuration consists of:

The cases that proposal, BU and T perform similarly mean that, although the proposal is selecting the classification model well, there is no need to separate contexts for the evaluated data. Descriptions of each stream configuration follow: A In this stream, we use the data set Aedes aegypti-Culex quinquefasciatus, which contains features that describe the passage of female Aedes aegypti and female Culex quinquefasciatus mosquitoes in front of a sensor. Besides wing beat frequency, there are other 25 numerical features obtained from the signal power spectrum. Six ranges of temperature define the contexts, and the objective is to distinguish between the two species. The chosen feature for the hypothesis test is the WBF. w = 100, lc = 900; B In this stream, we use the data set Aedes aegypti-sex. It is similar to the previous experiment; however, we want to discriminate between male and female Aedes aegypti mosquitoes. The chosen feature for the hypothesis test is the WBF. w = 100, lc = 900; C In this stream, we use a modified version of Arabic-Digit [6], which contains a fixed number of MFCC values for human speech of Arabic digits. The sex of the speaker defines the context, and the task is to predict which is the spoken digit. The chosen feature for the hypothesis test is the first MFCC. w = 150, lc = 800; D This stream is similar to the previous one. However, the digit defined the context and the task is to predict the sex of the speaker. The chosen feature for the hypothesis test is the first MFCC. w = 50, lc = 800;

• D, a data set; • fc , a feature that specifies which context each event belongs to; • w, the window size that defines the size of the training event samples Ti , the reference value samples Ri and the rotating buffers Si ; • lc , a concept length, so that in the generated stream, events belonging to a context only appear in sequences of consecutive klc events, k ∈ N; • fh , the feature used in the hypothesis test, if we are not using the classifier output; and • ft , the class feature for the classification task. We split the dataset into sets, here called contexts, so that each set corresponds to a unique value of fc . After the split, fc is removed from all events. From each context Ci , we reserve a set Ti of w randomly chosen events to induce Mi and initialize Ri and Si . We randomly split the remaining events in Ci into segments of lc events. The stream is a random arrangement of all segments of all contexts. The task is to predict ft . We iterate through the stream. For each event in the stream, we first update each Si by 1 scikit-learn

v0.19.0 available at http://scikit-learn.org/

522

E In this stream, we use the data set Handwritten, which contains features regarding the handwritten letters g, p and q. The context is given by the author (among 10), and the objective is to predict the letter. The chosen feature for the hypothesis test is the area of the shape that is drawn by the author. w = 50, lc = 250; F This stream is similar to the previous. However, the letter defines the context, and the task is to predict the author. The chosen feature for the hypothesis test is the area of the shape that is drawn by the author. w = 50, lc = 250; G This stream uses the WBF-Insects, as described at the beginning of previous section. The chosen feature for the hypothesis test is the WBF. w = 100, lc = 1000.

proposed method of choosing the classification model, as the accuracy rates obtained for the proposed method and for BU were close to the topline accuracies (although statistically inferior). This means that, for the majority of the data sets, there was no need for inducing a model for each context, since only one model would perform as well as separate models. However, we note that having different models present some advantages. Besides contextual models being potentially less complex than one single model, if one is interested in increasing the number of identifiable contexts over time, they can do so without having to retrain a model with an ever-increasing training set. Additionally, if classification algorithm does not need the training events after being induced, they can be permanently discarded, even with the addition of new contexts later on. Finally, we note that, although it is true that separate models are not required for the majority of our streams, they are required for some. As we analyzed in Section 3, WBF-Insecst is a case of them, and therefore an obtained accuracy for stream G, specifically Af , was greater than both baselines by a large margin. Similar results were found for streams A and B. As an additional relevant note, the context accuracy obtained by using Nearest Neighbors’ output significantly outperformed the context accuracy obtained by using any other classification algorithm’s output. This fact indicates that the probability estimate for the predicted class can be an inadequate value to feed our proposed method.

Except for WBF-Insects, all data sets are balanced for all classes considered. The online paper supplementary material2 contains all source code and datasets used in our experiments.

5

EXPERIMENTAL RESULTS

Tables 3 and 4 display our experimental classification accuracies and context accuracies, respectively. Table 2 displays the p-value for the most important pairwise Wilcoxon signed-rank tests. We consider a significance level of 0.05 for the next considerations. Table 2: p-values for Wilcoxon signed-rank tests.

Co Ao Ao Ao Ao Af Af Af BR BU

× × × × × × × × × ×

Cf Af BR BU T BR BU T T T

NN 1.00 0.74 0.02 0.87 0.02 0.02 0.87 0.02 0.02 0.03

MLP 0.31 0.24 0.06 0.24 0.13 0.02 0.06 0.06 0.02 0.61

RF 0.50 0.40 0.02 0.06 0.02 0.02 1.00 0.02 0.02 0.03

SVC 0.24 0.18 0.03 0.02 0.02 0.02 0.40 0.02 0.02 0.87

6

In this work, we presented a real-world application that has both extreme verification latency and recurrent concepts in a data stream. In other words, in the studied scenario, we cannot obtain true labels for observed test events, although we know they follow one among a limited number of known contexts. We have shown that, for this application, classification performance is higher when applying individual classification models for each context, assuming that we can identify and use the appropriate model, than one single model that was induced with all available training data. We presented a method to identify which is the most suitable model for each event in a stream, assuming that contexts happen with a minimum number of consecutive events. The proposed method has obtained a performance close to that of a topline that always use the correct classification model. We note that a negative aspect of the Kolmogorov Smirnov statistics, as applied in the method, is that it is sensitive to changes in the proportions of the classes. This is a negative aspect since changes in the proportions of the classes would lead to an increase in the value of the statistic, that translates to a higher dissimilarity with a context, even if not necessarily accompanied by a worsening in the classification performance of this context’s classifier. Future work should replace the KS statistic with alternatives that can be robust against, or even identify and report, changes in the proportions of the classes. We also reserve for future work to try to identify the occurrence of yet unknown contexts, and take

We cannot statistically assert whether it is better to always apply the hypothesis test with the output of the classifier or with the values of a feature in the feature space, regarding either context accuracy or classification accuracy. We note, however, that the results when using a feature area biased by our choices of features, which can be not optimal. Interestingly, context accuracy and classification accuracy are uncorrelated. This can be explained by contexts that are too similar so that the accuracy would be the same regardless which model is used. Obtained accuracies, for both classifier’s output and feature value, were significantly greater than BR . This means that the method for selecting the classification model performs better than choosing randomly. Considering that we can either use a classifier’s output or a feature in the feature space, obtained accuracies were not significantly better or worse than BU . This fact is not especially against the 2

CONCLUSIONS AND FUTURE WORK PROSPECTS

https://github.com/denismr/Unsupervised-Context-Switch-For-Classification-Tasks

523

Table 3: Experimental classification accuracies. All values are percentages and are accompanied by their standard deviations in parentheses. Stream A B C D E F G Stream A B C D E F G

NN

Ao 74.56 (1.31) 96.96 (0.40) 71.83 (1.41) 96.25 (0.29) 96.49 (0.28) 68.80 (1.85) 57.51 (4.36)

Af 75.84 (0.95) 97.26 (0.54) 69.02 (1.73) 89.26 (1.10) 93.07 (1.24) 68.64 (1.84) 72.21 (3.96)

Ao 71.17 (3.73) 95.61 (0.63) 67.04 (1.62) 88.18 (2.27) 97.05 (0.41) 61.86 (7.14) 65.99 (6.31)

Af

78.26 (0.79) 96.07 (0.69) 65.55 (1.43) 83.32 (2.20) 92.85 (1.22) 66.78 (2.77) 75.70 (2.52)

BR

69.20 (0.70) 88.84 (2.48) 58.48 (1.24) 81.42 (1.36) 79.73 (0.83) 33.08 (1.37) 57.47 (2.44) RF

BR

68.12 (1.40) 82.63 (2.18) 52.26 (1.47) 75.63 (1.87) 80.66 (2.09) 32.79 (1.42) 58.80 (3.08)

MLP

BU

68.17 (0.75) 93.66 (0.82) 73.10 (1.21) 96.92 (0.26) 96.99 (0.27) 69.01 (1.87) 59.00 (3.09)

T

Ao

76.62 (0.84) 97.74 (0.50) 73.24 (1.40) 96.81 (0.25) 98.18 (0.12) 70.62 (2.00) 73.65 (4.30)

49.96 (0.22) 49.91 (0.23) 78.98 (1.25) 94.63 (2.05) 96.13 (0.11) 70.26 (4.10) 35.31 (22.43)

BU

71.60 (0.92) 94.73 (0.44) 68.00 (0.76) 93.71 (0.97) 98.17 (0.45) 64.67 (1.76) 66.03 (1.73)

T

Ao

78.94 (0.76) 96.89 (0.52) 68.99 (1.51) 93.59 (0.93) 98.81 (0.33) 68.85 (3.03) 77.78 (2.50)

72.46 (2.63) 89.83 (1.46) 68.80 (1.79) 94.55 (1.73) 94.91 (1.39) 34.60 (5.74) 53.05 (8.66)

Af

50.86 (2.16) 50.10 (0.36) 77.38 (1.52) 91.47 (1.33) 92.14 (1.27) 72.45 (3.39) 30.75 (20.23)

Af

77.16 (1.43) 93.21 (0.81) 67.24 (2.43) 91.53 (1.14) 91.63 (1.00) 42.19 (3.33) 76.13 (1.42)

BR

50.28 (0.85) 49.93 (0.40) 65.15 (1.94) 86.18 (1.95) 79.51 (1.17) 33.41 (1.61) 28.31 (20.08) SVC

BR

66.75 (2.15) 75.87 (2.80) 58.74 (1.97) 85.29 (1.23) 78.34 (0.86) 22.38 (1.93) 58.23 (1.96)

BU 58.23 (10.86) 73.21 (18.38) 79.42 (0.69) 98.39 (0.19) 97.28 (0.34) 68.78 (2.87) 32.59 (23.37)

BU

74.29 (1.00) 95.09 (0.45) 75.24 (0.92) 98.22 (0.19) 96.13 (0.36) 46.63 (2.70) 65.48 (3.47)

T 50.68 (1.94) 50.02 (0.11) 80.44 (1.03) 99.23 (0.25) 97.91 (0.12) 74.56 (3.01) 31.54 (21.28)

T 78.29 (1.09) 94.50 (0.52) 69.70 (1.74) 97.89 (0.35) 97.34 (0.24) 43.32 (3.47) 77.62 (1.00)

Table 4: Experimental context accuracies. All values are percentages and are accompanied by their standard deviations in parentheses. Stream A B C D E F G

NN 45.19 (9.48) 58.11 (7.61) 95.33 (1.09) 95.06 (2.27) 91.00 (0.64) 96.70 (0.75) 50.00 (0.00)

Co – Test on classifier’s output MLP 17.01 (7.04) 19.81 (8.58) 95.52 (2.48) 49.80 (10.77) 87.25 (2.52) 92.44 (5.12) 54.58 (14.46)

RF 29.28 (5.46) 54.41 (9.96) 94.06 (0.75) 59.07 (9.53) 89.53 (2.29) 87.92 (10.90) 69.93 (15.73)

SVC 31.23 (9.06) 37.52 (9.95) 95.49 (1.96) 63.77 (5.11) 88.84 (2.32) 72.77 (18.70) 38.06 (21.48)

subsequent actions, such as requesting limited label information to learn the newly discovered contexts.

NN 78.80 (6.98) 81.38 (3.51) 85.70 (5.45) 47.74 (3.56) 65.30 (3.89) 96.39 (0.92) 95.72 (2.79)

Cf –Test on feature value MLP RF 80.78 (4.08) 76.55 (5.92) 81.98 (6.73) 81.31 (5.77) 90.41 (3.22) 89.59 (2.34) 46.35 (3.29) 45.22 (3.99) 63.84 (5.34) 62.62 (4.25) 96.39 (1.08) 96.20 (0.56) 88.23 (11.57) 94.54 (3.61)

SVC 75.09 (7.79) 83.53 (4.46) 87.62 (6.57) 45.36 (5.30) 63.01 (4.33) 96.27 (0.57) 96.23 (1.24)

coping with scarcity of labeled data. KAIS 33, 1 (2012), 213–244. [11] Kenneth Mellanby. 1936. Humidity and insect metabolism. Nature 138 (1936), 124–125. [12] Denis dos Reis, Peter Flach, Stan Matwin, Gustavo Enrique de Almeida Prado Alves Batista, et al. 2016. Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, XXII. Association for Computing Machinery-ACM. [13] Diego F. Silva, Vin´ıcius M. A. Souza, Daniel Ellis, Eamonn Keogh, and Gustavo E.A.P.A. Batista. 2015. Exploring Low Cost Laser Sensors to Identify Flying Insect Species. J Intell Robot Syst 80, 1 (2015), 313–330. https://doi.org/10.1007/ s10846-014-0168-9 [14] Vin´ıcius MA Souza, Diego F Silva, Jo˜ao Gama, and Gustavo EAPA Batista. 2015. Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency. In SDM. SIAM, 873–881. [15] LR Taylor. 1963. Analysis of the effect of temperature on insects in flight. Journal of Animal Ecology 32, 1 (1963), 99–117. [16] Xindong Wu, Peipei Li, and Xuegang Hu. 2012. Learning from concept drifting data streams with unlabeled data. Neurocomputing 92 (2012), 145–155. ˇ [17] Indre Zliobait˙ e. 2010. Change with Delayed Labeling: when is it detectable?. In ICDMW. IEEE, 843–850.

ACKNOWLEDGMENTS The authors would like to thank CAPES (grant PROEX-6909543/D), CNPq (grants 446330/2014-0 and 306631/2016-4), FAPESP (grant 2016/04986-6). This material is based upon work supported by the United States Agency for International Development under Grant No AID-OAA-F-16-00072.

REFERENCES [1] Mohd Amir and Durga Toshniwal. 2010. Instance-Based Classification of Streaming Data Using Emerging Patterns. In ICT. 228–236. [2] Gustavo E. A. P. A. Batista, Eamonn J. Keogh, Agenor Mafra Neto, and Edgar Rowton. 2011. Sensors and software to allow computational entomology, an emerging application of data mining. In KDD. 761–764. [3] Yanping Chen, Adena Why, Gustavo E.A.P.A. Batista, Agenor Mafra-Neto, and Eamonn Keogh. 2014. Flying Insect Classification with Inexpensive Sensors. J Insect Behav 27, 5 (2014), 657–677. https://doi.org/10.1007/s10905-014-9454-4 [4] Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi. 2015. Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information. In IJCNN. [5] Karl B Dyer, Robert Capo, and Robi Polikar. 2013. COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data. TNNLS (2013). [6] Nacereddine Hammami and Mouldi Bedda. 2010. Improved tree model for arabic speech recognition. In ICCSIT, Vol. 5. 521–526. [7] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. 2004. Detecting change in data streams. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 180–191. [8] Ludmila I Kuncheva et al. 2008. Nearest neighbour classifiers for streaming data with delayed labelling. In ICDM. IEEE, 869–874. [9] Mohammad M Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. 2011. Classification and novel class detection in concept-drifting data streams under time constraints. TKDE 23, 6 (2011), 859–874. [10] Mohammad M Masud, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W Hamlen, and Nikunj C Oza. 2012. Facing the reality of data stream classification:

524