Incremental Classification of Nonstationary Data Streams
Lior Cohen, Gil Avrahami, Mark Last Ben-Gurion University of the Negev Department of Information Systems Engineering Beer-Sheva 84105, Israel Email:{clior,gilav,mlast}@ bgu.ac.il
Abraham Kandel Dept. of Computer Science and Engineering, University of SouthFlorida, Tampa, Florida 33620, USA Email:
[email protected]
Oscar Kipersztok Boeing Phantom Works Mathematics & Computing Technology, P.O. Box 3707 MC 7L44, Seattle, WA 98124-2207 USA Email:
[email protected]
Abstract. Traditional data mining algorithms used in the process of knowledge discovery are based on the assumption that the training data is a random sample drawn from a stationary distribution. In contrast, when dealing with non-stationary data sets, most of the processes generating time-stamped data may change drastically over time. Therefore, the results of a data-mining algorithm, which was run in the past, will be hardly relevant to the future data. In this paper, we present two new incremental approaches to real-time classification of continuous non-stationary data. The proposed approaches apply the data-mining algorithm repeatedly to a sliding window of examples in order to update the existing model, replace it by a former model, or construct a new model if a major concept drift is detected. The algorithms are evaluated on multi-year real-world streams of traffic data vs. the CVFDT online learning algorithm.
Keywords Real-time data mining, data streams, incremental learning, online learning, concept drift, information networks.
1. Introduction In recent years, there is an ongoing demand for systems, which will be capable to process massive and continuous streams of information. The use of such systems can be in the field of temperature monitoring, precision agriculture, urban traffic control, classification of stock data, etc. The most important characteristic of such system is to deal with noise, uncertainty, and asynchrony of the real-world data [2]. The complex nature of real world data has increased the difficulties and challenges of data mining in terms of data processing, data storage, and model storage requirements [7]. One of the main difficulties in mining non-stationary continuous data streams is to cope with the changing data concept. The fundamental processes generating most real-time data streams may change over years, months and even seconds, at times drastically. This change, also known as concept drift [3], causes the data-mining model generated from past data, to become less accurate in the classification of new data. Algorithms and methods, which extract patterns from continuous data streams, are known as online real time learning [10]. Real-time data mining of high-speed and non-stationary data streams has large potential in fields such as efficient operation of machinery and vehicles, exploring ecosystem dynamics, and intrusion detection in computer networks. This paper continues our previous work on real-time data mining [8], [9], where we have presented the Basic Incremental On Line Information Network (IOLIN) algorithm, leaving the issue of a better trade-off between accuracy and run time open for the future research. Here we suggest two alternative algorithms for online incremental learning of non-stationary data streams: the Multiple-Model IOLIN and the Advanced IOLIN, both aimed at maximizing the average classification rate (in records per second) while preserving the accuracy rate of the regenerative approach. The two proposed algorithms are evaluated in comparison to the On-Line Information Network (OLIN) algorithm presented by Last in [11] and the more recent Incremental On Line Information Network (IOLIN) algorithm [8], [9]. All these incremental methods are based on the batch information-theoretic algorithm, named Information Network (IN), which
was introduced by Last & Maimon [12], [13]. The general idea of the new incremental algorithms is to keep updating an existing model as long as no major concept drift is detected in the arriving data. While most of the existing online algorithms (including OLIN [11]), which deal with the problem of concept drift, build a new model from every new window of training examples, the proposed incremental techniques save the expensive CPU time by performing minimal changes in the structure of the current classification model. We also compare the performance of our incremental methods to the CVFDT algorithm [4], which is designed at mining non-stationary data streams as well.
2. Related work In the case of non-stationary data streams, the optimal situation is to have KDD systems that operate continuously, constantly processing the data received so that potentially valuable information is never lost. In order to achieve this goal, several methods for extracting patterns from non-stationary streams of data have been developed, all under the general title of online (incremental) learning methods. The pure incremental learning methods take into account every new instance that arrives. These algorithms may be irrelevant when dealing with high-speed data streams. Widmer & Kubat [5] have described a series of purely incremental learning algorithms that flexibly react to concept drift and can take advantage of situations where context repeats itself. Their series of algorithms is based on a framework called FLORA. FLORA maintains a dynamically adjustable window during the learning process and whenever a concept drift seems to occur (a drop in predictive accuracy) the window shrinks (forgets old instances), and when the concept seems to be stable the window is kept fixed. Otherwise, the window keeps growing until the concept seems to be stable. FLORA is a computationally expensive methodology, since it updates the classification model with every example added to or removed from the training window. Domingos & Hulten [14] have proposed the VFDT (Very Fast Decision Trees learner) system in order to overcome the longer training time issue of the pure incremental algorithms. The VFDT system is based on a decision tree learning method, which builds the trees based on sub-sampling of a stationary data stream. To deal with non-stationary data streams, Hulten et al. [4] have proposed an improvement to the VFDT algorithm which is the CVFDT (Conceptadapting Very Fast Decision Tree learner). CVFDT applies the VFDT algorithm to a sliding window of a fixed size and builds the model in an incremental manner instead of building it from scratch whenever a new set of examples arrives. CVFDT increases the computational overload vs. VFDT by growing alternate sub-trees at its internal nodes. The model is modified when the alternate sub-tree becomes more accurate than the original one. In [1] the authors claim that methods such as [4], [14] which use the one-pass approach on the data in order create mining models, are limited. The problem with the one-pass approach is that it disables the recognition of changes in the mining model during the data stream process. Aggarwal et al. [1], suggested a framework for the classification of non stationary data streams. The methodology proposed in their paper uses the notion of micro-clusters that are created only from the training data. Each micro-cluster is associated with a set of points, all of which belong to the same class. The maintenance procedure of the micro clusters derives from the nearest neighbor and k-means algorithms and if needed, operations of merging multiple clusters into one cluster are made. In the classification process, the test record is classified to a certain micro-cluster according to its similarity to the cluster’s centroid. Last in [11] describes an online classification system that uses an information network (IN) as a classification model. The system called OLIN (On Line Information Network) gets a continuous stream of non-stationary data and builds a network based on a sliding window of the latest examples. OLIN detects a concept drift (an unexpected rise in the classification error rate) and dynamically adjusts the size of the training window and accordingly, the frequency of the model reconstruction. The calculations of the window size in OLIN are based on information-theoretic and statistical measures. The experimental results of [11] show that in non-stationary data streams, dynamic windowing generates more accurate models than the static (fixed size) windowing approach used by CVFDT. Cohen et al [8], [9] extended the ideas suggested in [11] in order to enhance the performance of the OLIN system. The new developed approach called IOLIN (Incremental Online Information Network) also deals with a continuous non-stationary data stream, but this time instead of creating a new model for each sliding window of examples, the system updates a current model according to the new data. In the experiments of [8] and [9], the IOLIN version significantly outperformed the OLIN with regarding to the run time at the expense of a minor drop in the accuracy rate.
3. The Info Fuzzy Algorithm Many learning methods use the information theory to induce classification models. One of the methods, developed by Last & Maimon [12] [13] is the Info-Fuzzy Network algorithm (also known as Information Network-IN). IN, is an
2
oblivious tree-like classification model, which is designed to minimize the total number of predicting attributes. The underlying principle of the IN-based methodology is to construct a multi-layered network in order to test the Mutual Information (MI) between the input and target attributes. Each hidden layer is related to a specific input attribute and represents the interaction between this input attribute and those associated with previous layers. The IN algorithm is using a pre-pruning strategy: a node is split if this procedure brings about a statistically significant decrease in the conditional entropy of the target attribute (equal to an increase in the mutual information). If none of the remaining input attributes provides a statistically significant increase in the mutual information, the network construction stops. The output of this algorithm is a network, which can be used as a decision tree to predict the values of the target attribute. Fig. 1 illustrates a sample structure of an information network. 1,1
1
1 2
0 Layer No. 0 (the root node)
3
1,2 3,1 3,2
Connection Layer No. 1 Layer No. 2 Weights (First input (Second input attribute) attribute)
2 3 Target Layer
Fig. 1 . Info-fuzzy Network Two Layered Structure [13]
4. Incremental On-Line Information Network The OLIN online classification algorithm [11] deals with the potential concept drift in a non-stationary data stream simply by generating a new model for every new sliding window. On one hand, this regenerative approach ensures accurate and relevant models over time and therefore an increase in the classification accuracy. On the other hand, OLIN's major shortcoming is the high computational cost of generating new models. In this section, we present three approaches for incremental learning, which are aimed to overcome the high computational cost of generating new models. As shown in the evaluation section below, the incremental methods achieve almost the same and sometimes even higher accuracy rates than the regenerative algorithm and are significantly cheaper, since there is no need in producing a new model for every new window of examples. 4.1. Basic Incremental Approach We start this section with an overview of the Basic IOLIN algorithm introduced by us in [8]. In general, the basic incremental method updates a current model as long as the concept is stable. If a concept drift has been detected, the algorithm creates a totally new network. The operations for updating the network include checking split validity, examining the replacement of the last layer, and adding new layers as necessary. Each of the update operations is aimed at reducing the error rate of the already existing model, when using it for classifying new instances. The first procedure, check splits validity, suggests eliminating nodes, which are not relevant for the model according to the current training window. The split validity checks start from the root and go downwards the network. The general idea is to ensure that the current split of a specific node actually contributes to the mutual information calculated from the current training set. Elimination of non-relevant splits should decrease the error rate. From experiments conducted on the regenerative OLIN [11], we discovered that when there is no major concept drift between two adjacent windows, the main differences between the newly constructed model and the current one are in the last hidden layer of the information network. For this reason, the second procedure check the replacement of the last layer is made in order to determine which attribute is more appropriate to correspond to the last (final) hidden layer. The alternatives considered are: the attribute that is already associated with the last layer or the second best attribute for the last layer of the previous model. The attribute selected for the last layer of the new model will be the one with the highest conditional mutual information based on the current training window. The last procedure add new layers as necessary attempts to split the nodes of the last layer on attributes, which are not yet participating in the updated network. The basic intuition behind the incremental approach is to update the current classification model with the current training window concept as long as no major concept drift has been detected by a statistically significant drop in predictive accuracy and to build a new model in case of a major concept drift. Actually, the regenerative approach can be viewed as a special case of the incremental approach. If a major concept drift is detected at each arrival of new data, the incremental approach is identical to the regenerative one. So the incremental approach becomes useful only when the concept stays stable between at least one pair of adjacent windows. This condition can be expected to exist in most
3
data streams, especially those where the size of the training window is relatively small. Following is the pseudo-code outline of the Basic Incremental OLIN (Basic IOLIN) algorithm: Basic IOLIN Input: Training window, current network model Output: Updated or re-generated network model For each new training window If concept drift is detected Create new network using the Info-Fuzzy algorithm Else Update existing network as follows: 1. Eliminate non-relevant nodes. 2. Replace the last layer with a better one. 3. Add new layers if possible. Determining the size of the training window is extensively discussed in [11]. In general, the size of the window is expanded if the concept appears to be stable and reduced if a concept drift has been detected. The detection of a concept drift is made whenever there is a sudden fall in the accuracy rate. The assumption is that if the concept is stable, the examples in the training window and in the subsequent validation interval conform to the same distribution and there is no statistically significant difference between the model’s training and the validation error rates. Another assumption is that the true classification of the incoming records becomes available after a short while. This true classification can then be used to start the model update or the construction process for creating a new predictive model. If however, the true classification is not available from the monitored system, we can keep the current network. Another alternative is to use unsupervised learning methods in order to detect changes in the distribution of unlabelled data and update the current network accordingly. Zeira et al. [6] present a methodology for change detection and segmentation based on a set of statistical estimators. This methodology detects statistically significant changes in nonstationary continuous data streams. The time complexity of the suggested basic incremental algorithm depends on the number of detected concept drifts. As long as no concept drift has been detected, the major computational cost is to add a new layer to the existing network. Checking the split validity of the nodes in the network is relatively cheap (O(n) where n is the number of nodes in the network). In case of concept drift, a new network should be constructed from scratch, and the cost of the network construction procedure is linear in the number of records in a training window, linear in the number of distinct attribute values, and quadratic in the number of candidate input attributes [12]. 4.2. Multiple-Model IOLIN The first new incremental approach introduced in this paper also updates a current model as long as the concept is stable. However, if a concept drift has been detected for the first time, the algorithm searches for the best model representing the current data from all the past networks. The idea is to utilize potential periodicity in the data by reusing previously created models rather than generating a new model with every concept drift as suggested by the Basic IOLIN approach. The search is made as follows: when a concept drift is being detected, the algorithm calculates the target attribute’s entropy and compares it to the target’s entropy values of each previous training window, where a completely new network was constructed. The previous network to be chosen is the one, which was built from the training window having the closest target entropy to the current target’s entropy. The chosen network is used for the classification of the examples arriving in the next validation session. In the case of re-occurrence of a concept drift, a new network will be created like under the Basic IOLIN approach. Following is the pseudo-code outline of the Multiple-Model IOLIN algorithm: Multiple Model IOLIN Input: Training window T, current network model, past network models Output: Updated / Former / New network model For each new training window If concept drift is detected Search for the best network from the past by: 1. Comparing the value of the target’s entropy (based on the current window’s data) to entropy values in all windows where a totally new network was built. 2. Choose Network with Min |Ecurrent(T)- Eformer(T)|.
4
3. Add new layers if possible. If a concept drift is detected again with the chosen network Create totally new network using the Info-Fuzzy algorithm Else update existing network (see below) Else Update existing network as follows: 1. Eliminate non-relevant nodes. 2. Replace the last layer with a better one. 3. Add new layers if possible. In addition to the use of former models in the presence of a concept drift, we also checked another variation for the multi model approach, which included the use of former models even in the case when the concept is stable. This means that when the concept is stable, we will use a former model instead of updating the current model. With this approach, we expect to save additional run time because an appropriate model will probably be found in the adjacent former windows. This approach will be referred to as the pure multiple model approach. 4.3. Advanced IOLIN In the second proposed approach, we enhance the update operations of the incremental algorithms by maintaining the information-theoretic quality of the current model disregarding its predictive accuracy on the new data. We recalculate the conditional mutual information of each layer using the top-down approach: we start with the first layer, and replace the input attributes in that layer and all subsequent layers in case of a significant decrease in the layer’s conditional mutual information with the target attribute. A reduction in the mutual information of the first layer will trigger a complete re-construction of the model. The current model will be retained only if there is no significant decrease in mutual information at any layer. In that case, we will try to add new layers representing attributes, which are not yet participating in the network. This approach does not relate directly to the issue of the concept drift detection. The initial network is constantly updated and the presence of a concept drift can be concluded if all the network layers have been replaced. For each session, the conditional mutual information (MI) value of each layer is saved. In the model’s update process, a comparison is made between the former MI value of the ith layer and the current MI value. If the current value is nearly as high as the former one (up to 5% difference), the ith layer is kept as is. If the current MI value is significantly lower, the ith layer is replaced with a new one. Following is the pseudo-code outline of the advanced IOLIN algorithm: Advanced IOLIN Input: Training window, current network model, conditional mutual information of each network layer i (Former_Conditional_MI(i)) Output: Updated network model For each new training window For each Layer i in existing network Calculate Cond_MI(i) If Cond_MI(i)≥Former_Conditional_MI(i)*95% Keep ith layer as is and move to the next layer Else Continue the network construction by adding new layers Save value: Former_Conditional_MI(i) = Conditional_MI(i) If reached the last layer Try adding new layers to the network The time complexity of the advanced approach depends on the number of layers that have been replaced. In the case of a layer replacement, the algorithm searches for the best input attribute from the input candidates. This search cost O(n) where n is the number of candidate input attributes. In the replacement of all layers, which is actually the construction of a totally new network, the cost will be O(n*l) where l is the number of the network layers.
5
5. Evaluation The two proposed incremental algorithms were applied to a real-world data stream from the traffic control domain. We have also applied the Basic IOLIN [8][9], the pure multi-model approach (see above), and the CVFDT algorithm [15] to the same data. The experiments objective was to simulate real-time prediction of the traffic volume at all lanes of a signaled urban junction based on several input attributes such as day hour, weekday, previous volume on the same weekday last week, etc. We have checked two aspects: first, we have compared the predictive accuracy of the algorithms and secondly we have compared their processing rate (in records per second). From the simulation results and the classification speed (as presented in the results sub-section), we can safely assume that the incremental algorithms are capable to simultaneously process data from a large number of sources. Furthermore, the predicted traffic volumes at each direction can be used to improve the traffic plan for the monitored intersection and consequently reduce the average waiting time of drivers at the junction. 5.1. Traffic Data Streams The data acquisition and preprocessing of this dataset are extensively discussed in [8], [9] so here we describe it only in brief. The data streams we used include traffic flow information from under-road sensors at a signaled three-way junction where vehicles can cross the intersection in five different directions. The resulting five data streams related to these directions have included the hourly incoming traffic volumes for 24 hours a day, seven days a week during a period of more than 3 years. Each original data stream has been converted into a relational data set, where a record contains twelve candidate attributes representing the exact time (date, hour, day in week, etc.) when the traffic volume was measured and traffic volumes at earlier points of time (the previous hour, the same hour of the previous day, etc.). The target attribute represents the volume of traffic during a given hour. Due to the fact that the target attribute is continuous we have manually discretized it to four equal-frequency intervals of traffic volume. As mentioned, the traffic data was partitioned into five separate data tables for the five directions. Each table corresponding to a given direction contained about 30,000 hourly records. 5.2. Results The runs of the algorithms were carried out on a Pentium IV processor with 256 MB of RAM. In the experiments, the online learning of the traffic data starts after inducing the initial model from the first 500 records, which leaves the system to work with about 30,000 records for each direction in every year. Figures 2, 3 and 4 present the results of the error rates (training and testing) and the processing time after applying the IOLIN-based algorithms and the CVFDT algorithm to the traffic data sets. Figure 2 shows that the IOLIN-based algorithms outperform the training accuracy rate of the CVFDT algorithm. As expected, the Advanced IOLIN algorithm gets the most accurate results due to the network update process, which updates each layer in the network if necessary. 90 80
CVFDT
60 50
Multi Modal
40 Pure Multi Modal
30
Error Rate
70 Basic Incremental
20
Advanced Incremental
10 0 D5
D4
D3
D2
D1
Data Source
Fig. 2. Training Error Rate
6
Figure 3 shows the results on the testing data. Again, in most cases the IOLIN-based algorithms outperform the accuracy rate of the CVFDT algorithm with Advanced IOLIN being the most accurate algorithm on all five data streams.
35 CVFDT Basic Incremental
25
Multi Modal
20 15
Pure Multi Modal
10 Advanced Incremental
Error Rate
30
5 0 D5
D4
D3
D2
D1
Data Source Fig. 3. Testing Error Rate
Figure 4 presents the run time results. The CVFDT algorithm consumes the least run time for processing the incoming examples. We assume that the reason for that is the one-pass on the data, which significantly reduces the run time. In all IOLIN-based algorithms, the training windows are overlapping so each example may be used several times. Figure 4 also shows that the fastest IOLIN-based algorithm is the multiple-models IOLIN. One can also notice that the Advanced IOLIN requires much longer run time than the other incremental approaches. The reason for that is that the update process checks if changes should be made in each layer. On one hand, this process makes the model more accurate with respect to the training window but on the other hand, this process consumes longer processing time because of the checking process in each layer.
1200
Basic Incremental
1000 800
Multi Modal
600 Pure Multi Modal
400 Advanced Incremental
Run Time (Sec)
1400 CVFDT
200 0 D5
D4
D3
D2
D1
Data Source
Fig. 4. Run Time
6. Conclusions and Future Work This paper has presented two new incremental approaches to real-time classification of continuous non-stationary data streams. The proposed approaches apply the classification algorithm repeatedly to a sliding window of examples, in order to update the existing model (in the case of the Advanced IOLIN approach) or to replace it by a former model (in the case of the Multiple Model approach). The efficiency of the proposed algorithms has been validated by experiments on real-world data streams. The experimental data in use included a set of five data streams recorded by traffic sensors for more than 3 years. We used the CVFDT algorithm of Domingos & Hulton (2001) for the purpose of performance comparison. From the experimental results, we can conclude that in this dataset, the CVFDT algorithm is preferable in terms of the run time
7
but its accuracy rates are significantly lower. In order to improve the run time we suggest as future work to consider the one pass approach on the data when the concept appears to be stable. We assume that this will significantly reduce the processing time. In addition, the future work includes evaluating incremental classifiers on additional real-world data streams from other domains. Acknowledgements. We would like to thank the Traffic Control Center of Jerusalem for granting us the permission to use their traffic database. This work was partially supported under a research contract from the Israel Ministry of Defense and by the National Institute for Systems Test and Productivity at University of South Florida under the USA Space and Naval Warfare Systems Command Grant No. N00039-01-1-2248. References [1] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On Demand Classification of Data Streams, Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, Aug. 2004. [2] D. E. Culler and W. Hong, Wireless sensor networks: Introduction, Communications of the ACM, Vol 47, No. 6, pp. 30— 33, 2004. [3] D.P Helmbold and P.M. Long, Tracking Drifting Concepts by Minimizing Disagreements, Machine Learning, No. 14, pp. 27-45, 1994. [4] G. Hulten, L, Spencer, and P. Domingos, “Mining Time-Changing Data Streams”, Proc. of KDD 2001, pp. 97-106, ACM Press, 2001. [5] G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts”, Machine Learning, Vol. 23, No. 1, pp. 69-101, 1996. [6] G. Zeira, O. Maimon, M. Last, L. Rokach, “Change Detection in Classification Models Induced from Time Series Data”, in “Data Mining in Time Series Databases”, M. Last, A. Kandel, and H. Bunke (Editors), World Scientific, Series in Machine Perception and Artificial Intelligence, Vol. 57, pp. 101 - 125, 2004. [7] H. Kargupta, Distributed Data Mining for Sensor Networks, Tutorial, ECML / PKDD 2004. [8] L. Cohen, M. Last, G. Avrahami, “Incremental Info-Fuzzy Algorithm for Real Time Data Mining of Non-Stationary Data Streams”, TDM Workshop, Brighton UK, 2004. [9] L. Cohen, G. Avrahami, M. Last, A. Kandel, and O. Kipersztok “Real-Time Data Mining of Non-Stationary Data Streams from Sensor Networks”, to appear in Information Fusion Journal, Special Issue on Information Fusion in Distributed Sensor Networks.. [10] M. Black and R. J. Hickey, “Maintaining the Performance of a Learned Classifier under Concept Drift”, Intelligent Data Analysis, No. 3, pp. 453-474, 1999. [11] M. Last, “Online Classification of Nonstationary Data Streams”, Intelligent Data Analysis, Vol. 6, No. 2, pp. 129-147, 2002. [12] M. Last and O. Maimon, “A Compact and Accurate Model for Classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 2, pp. 203-215, February 2004. [13] O. Maimon and M. Last, Knowledge Discovery and Data Mining - The Info-Fuzzy Network (IFN) Methodology, Kluwer Academic Publishers, December 2000. [14] P. Domingos and G. Hulten, “Mining High-Speed Data Streams”, Proc. of KDD 2000, pp. 71-80, 2000. [15] G. Hulten, and P. Domingos, "VFML -- A toolkit for mining high-speed time-changing data streams" http://www.cs.washington.edu/dm/vfml/. 2003.
8