A FRAMEWORK FOR GENERATING DATA TO SIMULATE CHANGING ENVIRONMENTS Anand Narasimhamurthy School of Electronics and Computer Science, University of Wales, Bangor, LL57 1UT, United Kingdom
[email protected]
Ludmila I. Kuncheva School of Electronics and Computer Science, University of Wales, Bangor, LL57 1UT, United Kingdom
[email protected]
ABSTRACT A fundamental assumption often made in supervised classification is that the problem is static, i.e. the description of the classes does not change with time. However many practical classification tasks involve changing environments. Thus designing and testing classifiers for changing environments are of increasing interest and importance. A number of benchmark data sets are available for static classification tasks. For example, the UCI machine learning repository is extensively used by researchers to compare algorithms across various domains. No such benchmark datasets are available for changing environments. Also, while generating data for static environments is relatively straightforward, this is not so for changing environments. The reason is that an infinite amount of changes can be simulated, and it is difficult to define which ones will be realistic and hence useful. In this paper we propose a general framework for generating data to simulate changing environments. The paper gives illustrations of how the framework encompasses various types of changes observed in real data and also how the two most popular simulation models (STAGGER and moving hyperplane) are represented within.
“spam” and “non-spam” evolve with time. They are userspecific, and users’ preferences change with time. Also spammers are active opponents who devise newer methods to evade detection thereby altering the definition of class “spam” within the current feature space. A change in the environment is referred to as concept drift in machine learning. The changes could include minor fluctuations of the underlying probability distributions, steady trends, rapid substitution of one classification task with another and so on [12, 33]. These categories are by no means mutually exclusive and various combinations are possible. Bespoke online classification models [5, 6, 32], as well as classifier ensembles [21,29] have been proposed for changing environments [22]. In the light of the growing importance of classification in changing environments, it is striking how little attention has been paid to putting together a collection of benchmark data sets or formulating generic procedures for simulating data. As a result researchers use their own simulated datasets or specific real data domains such as finance [8, 12, 32], text categorisation [17–19], spam filtering [4] and web searches [11]. In spite of some critiques [27], the creation of the UCI Machine Learning Repository [3] in 1987 has had a significant positive impact on Machine Learning and Pattern Recognition Communities [1]. A KDD (Knowledge Discovery in Databases) archive was subsequently developed in 1999 [1] containing large data sets of diverse types and domains (http: //kdd.ics.uci.edu). Data with a variety of known changes are needed for evaluating classification methods in changing environments. In this study we propose a general framework for generating artificial data which simulate changing environments. The data may contain one or more types of drift as desired by the user.
KEY WORDS Machine Learning, Changing environments, Concept drift, Population drift, Artificial data, Simulated data.
1 Introduction Online learning is increasingly occupying centre stage in data mining as more and more applications generate massive streams of data [5, 6, 29]. In these cases it is often impractical to apply conventional classification methodologies since these typically are batch algorithms often requiring more than one pass through the entire data. Many online versions of conventional classification methods have been developed [23–25]. A fundamental assumption in supervised classification, including many of the online classification methods, is that the problem is static, i.e. the description of the classes does not change with time. In many practical problems this is most likely to be false. An example is detecting and filtering out spam-email. The descriptions of the two classes
549-167
The rest of the paper is organised as follows. In Section 2 we review the terminology used in the literature on changing environments. The proposed framework is described in Section 3. Illustrations of how the proposed framework accommodates the defined types of environmental changes are given in Section 4. An implementation algorithm is given in Section 5. Section 6 concludes the study.
384
2 Terminology
because authors make distinctions.
The most used term in the ML literature on the subject is ‘concept drift’. However, there is a variety of other terms, some of them not precisely defined, taken to mean the same thing. This lack of uniformity of terminology prompted us to try to summarise and explain the terms found in the literature before proceeding with the general framework. The term ’concept’ is used to mean slightly different things. It is sometimes used to refer to the class of interest. A concept change in this case would mean a deviation of the description of the class from the original description. Consider for example the STAGGER dataset described by Widmer and Kubat [33] (this is discussed further in section 4.2). In this dataset the feature space of a simple blocks world is described by three features (attributes) namely, size, color and shape. Examples are labelled according to a target concept (this is not given to the classifier explicitly) which changes over time. For example the target concept for the first set of instances is ”size = small AND color = red”, the target concept for a subsequent set of instances is ”color = green OR shape = circular”. An analogous ’target concept’ or a ’concept of interest’ in a real world domain such as information filtering, may be the class ‘relevant articles’ from a set of documents (e.g. Klinkenberg [14, 18]). In other articles, the authors use ‘concept’ to refer to the whole distribution of the problem at time moment , including all classes. Then concept drift is the substitution of . To avoid one concept at time with another at time confusion, in this study we will try to avoid the term altogether. Another frequently used term is population drift, this refers to the changes in the underlying distributions. Kelly et al. [12] use the term population drift to mean a changes either in the priors or in the probabilities probabilities of class membership, conditional on feature vectors. They point out that the term concept drift has been used in the machine learning literature both for this (population drift) as well as other changes. While in a natural system we can expect gradual drifts (e.g., seasonal, demographic, habitual, etc.), sometimes the class description may change rapidly due to so called hidden contexts. Such contexts may be, for example, illumination in image recognition and accents in speech recognition [34]. The context might instantly become highly relevant. Consider a system trained on images with similar illumination. If images with the same type of content but a different illumination are fed to the system, the class descriptions might change so as to make the system worse than a random guess. The type of changes can be roughly summarised as follows. They are briefly discussed in subsequent sections. These categories are by no means mutually exclusive and various combinations are possible. For example, recurring trends and population drift are both subsumed by the other categories, namely gradual changes and concept substitution. These are listed separately however, since they are changes which occur in real world applications and also
Gradual changes (gradual drift, evolutionary changes, concept drift) [2, 10, 13, 16, 30] Substitutions (abrupt changes, revolutionary changes, concept substitution, concept shift) [10, 16, 33] Recurring trends (recurring contexts) [31, 33] Population drift [12]
3 A general framework for simulating changing environments Every classification problem, however complex it might be, may be described completely by the following. Let be an object in the -dimensional feature space of the problem and be the set of class labels. The knowledge of the prior probabilities for the classes, and the class-conditional probability den sity functions (pdf) , , determine completely the problem and the optimal (minimum error) classifier for it. Viewed in this probabilistic sense, a classification problem may change due to the changes in the and/or . Posterior probabilities and the unconditional pdf may also change but these changes can be re-expressed in terms of and . Without loss of generality assume that all data live in the -dimensional real space, . Consider a set of data sources with known distributions. The distribution for source is characterised by the class-conditional probability density functions , and the prior probabilities ,
, for this source,
. At any time we have one or more “active” data sources. Let specify the extent of the influence of data source at time . We shall treat the influences as mixing proportions and the resultant distribution as a mixture, i.e. ! for any . At time the data distribution " #$ %
is characterised by prior probabilities & '
(
and class-conditional pdfs
& '
(
As the distributions at the sources are fixed, the data distribution at moment , " #$ % is specified through . Thus we can equivalently define " #$ % as & " #$ %
To be able to cater for all possible scenarios of changes we will allow the set of sources to be as large as necessary. For example, if 1000 time instances are needed in a population drift data, the set of sources might need to be of cardinality 1000, one source for each time instance.
385
transition. At time instance , " #$ % will be a set of binary mixing proportions, with for the data source cor . To make responding to , and all other
the model plausible, the consecutive data sources should be can closely related. For example, data source for time be obtained from data source for time by tweaking a parameter. In this example, the starting and the final distributions are the same as in the first example. However, at each time , only one source is sampled, so the two classes in the example remain separable at any transition time moment . Figure 3 shows the scatterplot of the data. Note that this time there are four data sources with their corresponding . This second example of gradual change illustrates a popular artificial data type used by many authors to test their classification methods for changing environments. This data simulation idea is often called the moving hyperplane [7, 11, 21]. The proposed framework is meant to be a versatile tool to combine the scenarios of the two examples by letting many data sources to be sampled at a time with mixing proportions evolving with time. It is hoped that in this way the generated data will represent a more realistic model of real-life changing environments.
1
1
0.5
2
time
0
Figure 1. Mixing proportions and gradual change between two data sources
for a linear
4 Illustration of the types of changes within the proposed framework 4.1 Gradual drift Two examples of gradual drift are given below. The first example represents only two sources with gradual changes from one source to the other, as proposed by Widmer and Kubat [33] and Black and Hickey [2]. This is a scenario where one concept fades gradually while the other takes over. The real-world example given by Widmer and Kubat is that of a device which begins to malfunction. At first only small number of data points will come from the stable failure state. Finally the failure will take over completely. Figure 1 shows and as functions of for this case. Figure 2 (a) represents the starting data where the distribution is equivalent to that of source 1, i.e.,
,
if otherwise if otherwise
4.2 Substitutions This is the type of change most often simulated in order to assess the performance of a classifier in changing environment. The classical example is the STAGGER concept first proposed by Schlimmer and Granger [28]. It has received the status of a benchmark artificial data by being used by many authors since then [4, 7, 20–22, 33]. The feature space is described by three features (attributes): size small (1), medium (2), large (3) , color red (a), green (b), blue (c) and shape square (A), circular (B), triangular (C) . There are three data sources (called ‘target concepts’)
The end distribution is that of source 2, shown in Figure 2 (d). Subplots (b) and (c) plot the transition of the data distribution for equal to (0.67,0.33) and (0.33,0.67), respectively. The second example of gradual change is when there are more sources so the mixture components at any time of the transition are not just the start and the end ones. This example mimics population drift whereby the parameters of the distributions change gradually with time. An example from real-life would be the change of category “high risk” in loan approval. The criteria for a money lender to approve a loan evolve with time to respond to new socio-economic circumstances; the gross annual salaries increase; the market is getting saturated with loan providers. All this will alter the class distributions, so class “high risk” from a few years ago will have a new profile now. To model gradual changes in the parameters of the distributions, we assume that there are as many data sources as there are time instances to be recorded in the
Target Concept 1 : size = small AND color = red Target Concept 2 : color = green OR shape = circular Target Concept 3 : size = medium OR size = large
In their experiment, Widmer and Kubat [33] generate randomly (from a uniform distribution across the feature space) 120 training instances and label each instance according to the current target concept. After processing each instance, the predictive accuracy is tested on an independent test set of 100 instances, also generated randomly. One target concept is active at a time. Concept 1 is active from instance 1 to instance 40, Concept 2 is active from instance 41 to 80 and Concept 3 from instance 81 to 120. This example is modelled as follows. Consider three data sources each related to one of the target concepts. The distributions at the three data sources are
386
1.5
1.5
1.5
1
1
1
1.5
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1.5
−1.5
−1.5
−1
0
1
−1
(a)
0
1
−1
(b)
0
1
−1
(c)
0
1
(d)
Figure 2. Scatterplots of the data illustrating a gradual (linear) change from distribution of source 1 (a) to that of source 2 (e). Class is depicted with grey dots, and class with black dots. 1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1
0
−1.5
1
−1
0
1
−1.5
(a)
−1
0
1
−1.5
(b)
−1
0
1
(c)
(d)
Figure 3. Scatterplots of the data illustrating a gradual (linear) change using 4 data sources. Class is depicted with grey dots, and class with black dots.
, , for instances 1aA, 1aB and 1aC and 0 for all other instances. for instances 1aA, 1aB and 1aC and 1/24 for all other instances.
4.3 Recurring trends
Data source 1.
,
, for instances xby and xzB, where x is any of 1,2,3 y is any of A,B,C and z is any of a,b,c. for all other instances. for all other instances. for xby and xzB and 1/12 for all remaining instances.
Data source 2.
In many applications contexts tend to recur either cyclically or in an unordered fashion. Biological and economic systems tend to go through cycles of development with recurring patterns [33]. Recurring contexts may be due to irregular phenomena such as inflation rates or market mood [9] or due to cyclic phenomena such as seasonal changes. For example, Tsymbal [31] deal with antibiotic resistance data for nosocomial infections. They discovered seasonal context recurring with winter and spring models, these corresponded to yearly spring and winter infection outbreaks. With respect to our framework, the phenomenon of a periodically recurring context can be expressed as
, , for instances for 2zy and 3zy and 0 for all remaining instances. for 2zy and 3zy and 1/9 for all remaining instances.
Data source 3.
" #$ % " #$ %
For the STAGGER data,
for for for
where is the period of repetition.
and 0 for any other and 0 for any other and 0 for any other
5 Implementation algorithm Figure 4 gives an algorithm which implements the proposed framework. 1
In the substitution scenario, usually one data source is instantly replaced by another, so " #$% contain again binary mixing proportions as in the second example above. However, this time the subsequent data sources are assumed to be significantly different from one another.
1 MATLAB code is available at the following url. http: //www.informatics.bangor.ac.uk/˜kuncheva/ activities/epsrc2006/changing_environments.html
387
ods for problems with changing environments, it will not be long before a valuable benchmark of such data will become publicly available, following in the footsteps of UCI ML repository [3] and UCI KDD Database [1].
Input:
number of time instances required ( = number of data points) number of data sources a user-defined function which samples once from data source an array with the mixing proportions (column contains )
Acknowledgments This work was supported by EPSRC grant EP/D04040X/1.
References
Data generation: Initialise the data set For (1)
= Sample once from with distribution defined by
(2)
Sample a new data point from source , ,
[1] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD Exploration, 2(2):14–18, 2000.
and add it to the current set (the class label is also added)
End Return the data set
[2] M.M Black and R.J. Hickey. Maintaining the performance of a learned classifier under concept drift. Intelligent Data Analysis, 3(6):453–474, 1999.
[3] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/˜mlearn/ MLRepository.html.
Figure 4. Algorithm: Generating data to simulate changing environments
[4] S. J. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-based maintenance techniques for handling concept drift in spam filtering. Technical Report TCD-CS-2005-19, Trinity College Dublin, 2005.
6 Conclusion
[5] P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge Discovery and Data Mining, pages 71–80, 2000.
We propose a general framework to generate data simulating changing environments. The framework is meant to channel the effort in developing a test-bed of artificial data to train and test classifiers under various modes of changing environments. The framework accommodates the currently favourite STAGGER and Moving Hyperplane generation strategies and offers a versatile approach towards devising new strategies. We show that the framework covers the three main types of changes as formulated in the literature: gradual changes, substitution and systematic (recurrent) trends. These all can be simulated individually or in combination. An implementation algorithm is detailed as well. Our future work includes simulating changes on real data. Current state-of-the-art is relatively poor in this respect, revolving around dropping classes and adding data from new classes [15, 26] or tweaking the class labels [2]. One possible way to go is to use one of the features as hidden context and sort the data set by that feature to simulate a trend. For now we have looked into generating artificial data but the main focus must be maintained on the possibility of finding real data sets which exhibit a clear case of changing environments. Many such datasets, e.g. in finance, are not freely available as benchmark due to privacy limitations. Given the pressing need for new tools and meth-
[6] P. Domingos and G. Hulten. A general framework for mining massive data streams. Journal of Computational and Graphical Statistics, 12:945–949, 2003. [7] F. Ferrer-Troyano, J. S. Aguilar-Ruiz, and J. C. Riquelme. Incremental rule learning based on example nearness from numerical data streams. In 2005 ACM Symposium on Applied Computing, SAC 05, Santa Fe, New Mexico,USA, 2005. ACM Press. [8] M. Harries and K. Horn. Detecting concept drift in financial time series prediction using symbolic machine learning. In Eighth Australian Joint Conference on Artificial Intelligence, pages 91–98, Singapore, 1995. World Scientific Publishing. [9] M. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32:101–126, 1998. [10] R.J. Hickey and M. M. Black. Refined time stamps for concept drift detection during mining for classification rules. In Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining (TSDM2000), Lecture Notes in Artificial
388
Intelligence, volume 2007, pages 20–30. SpringerVerlag: Berlin, 2000.
on Data Mining, pages 123–130, Los Alamitos, CA, 2003. IEEE Press.
[11] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In In Proc. 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 97–106. ACM Press, 2001.
[22] L. I. Kuncheva. Classifier ensembles for changing environments. In 5th International Workshop on Multiple Classifier Systems, MCS 04, LNCS, volume 3077, pages 1–15. Springer-Verlag, 2004. [23] H. K. H. Lee and M. A. Clyde. Lossless online Bayesian bagging. Journal of Machine Learning Research, 5:143–151, 2004.
[12] Mark G. Kelly, David J. Hand, and Niall M. Adams. The impact of changing populations on classifier performance. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367–371, New York, NY, USA, 1999. ACM Press.
[24] N. C. Oza. Online Ensemble Learning. PhD thesis, University of California, Berkeley, 2001. [25] N. C. Oza and S. Russell. Online bagging and boosting. In Eighth International Workshop on Artificial Intelligence and Statistics, Key West, Florida, 2001.
[13] M. Klenner and U. Hahn. Concept versioning : A methodology for tracking evolutionary concept drift in dynamic concept systems. In 11th European Conference on Artificial Intelligence, pages 473–477, Chichester, England, 1994. Wiley.
[26] R. Polikar, L. Udpa, S. S. Udpa, and V. Honavar. Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, 31(4):497–508, 2001.
[14] R. Klinkenberg. Learning drifting concepts with partial user feedback. In P. Perner and V. Fink, editors, Beitrge zum Treffen der GI-Fachgruppe 1.1.3 Maschinelles Lernen (FGML-99), Magdeburg, Germany, 1999.
[27] S. Salzberg. On comparing classifiers: A critique of current research and methods. Data Mining and Knowledge Discovery, 1:1–12, 1999. [28] J. C. Schlimmer and R. H. Granger. Incremental learning from noisy data. Machine Learning, 1(3):317–354, 1986.
[15] R. Klinkenberg. Using labeled and unlabeled data to learn drifting concepts. In Workshop notes of the IJCAI-01 Workshop on Learning from Temporal and Spatial Data, pages 16–24, Menlo Park, CA, USA, 2001. IJCAI, AAAI Press.
[29] W. N. Street and Y. S. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 377–382. ACM Press, 2001.
[16] R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3), 2004.
[30] A. Tsymbal. The problem of concept drift: definitions and related work. Technical Report TCD-CS-200415, Trinity College Dublin, April 2004.
[17] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 487–494, San Francisco, CA, USA, 2000. Morgan Kaufmann.
[31] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Handling local concept drift with dynamic integration of classifiers: domain of antibiotic resistance in nosocomial infections. In Proc 19th IEEE Int. Symposium on Computer-Based Medical Systems CBMS 2006, Salt Lake City, UT, June 2006. IEEE CS Press.
[18] R. Klinkenberg and I. Renz. Adaptive information filtering: Learning in the presence of concept drifts. In AAAI-98/ICML-98 workshop Learning for Text Categorization, Menlo Park,CA, 1998. [19] R. Klinkenberg and S. Rping. Concept drift and the importance of examples. In Text Mining – Theoretical Aspects and Applications, pages 55–77. PhysicaVerlag, Berlin, Germany, 2003.
[32] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 226–235. ACM Press, 2003.
[20] J. Kolter. Using additive expert ensembles to cope with concept drift - A senior thesis, April 2005. A senior thesis.
[33] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23:69–101, 1996. [34] G. Widmer and M. Kubat. Special Issue on Context Sensitivity and Concept Drift. Machine Learning, 32, 1998.
[21] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. In Proc 3rd International IEEE Conference
389