Concept Drifting Detection on Noisy Streaming Data in ... - Springer Link

6 downloads 10812 Views 340KB Size Report
ensemble classification algorithm of CDRDT for Concept Drifting data streams ..... In accordance with the related analysis mentioned above and the definitions of.
Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees Peipei Li1,2 , Xuegang Hu1 , Qianhui Liang2 , and Yunjun Gao2,3 1 2

School of Computer Science and Information Technology, Hefei University of Technology, China, 230009 School of Information Systems, Singapore Management University, Singapore, 178902 3 College of Computer Science, Zhejiang University, China, 310027

Abstract. Although a vast majority of inductive learning algorithms has been developed for handling of the concept drifting data streams, especially the ones in virtue of ensemble classification models, few of them could adapt to the detection on the different types of concept drifts from noisy streaming data in a light demand on overheads of time and space. Motivated by this, a new classification algorithm for Concept drifting Detection based on an ensembling model of Random Decision Trees (called CDRDT) is proposed in this paper. Extensive studies with synthetic and real streaming data demonstrate that in comparison to several representative classification algorithms for concept drifting data streams, CDRDT not only could effectively and efficiently detect the potential concept changes in the noisy data streams, but also performs much better on the abilities of runtime and space with an improvement in predictive accuracy. Thus, our proposed algorithm provides a significant reference to the classification for concept drifting data streams with noise in a light weight way. Keywords: Data Streams, Ensemble Decision Trees, Concept Drift, Noise.

1

Introduction

As the definition of data streams described in [23], it is an ordered sequence of tuples with certain time intervals. And as compared with the traditional data source, it always presents various new characteristics as being open-ended, continuous and high-volume etc.. It is hence a challenge to learn from these streaming data for most of traditional inductive models or classification algorithms[18,19,9]. Especially, it is intensively challenging for them oriented to the issues of concept drifts and noise contamination in the real applications, such as web search, online shopping or stock market and alike. To handle these problems, massive models and algorithms of classification have been proposed. The representative ones are based on ensemble learning, including an early ensemble algorithm of SEA[1] addressed the concept drift of data streams, a general framework P. Perner (Ed.): MLDM 2009, LNAI 5632, pp. 236–250, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Concept Drifting Detection on Noisy Streaming Data

237

for mining concept-drifting data streams using weighted ensemble classifiers[2], a discriminative model based on the EM framework for fast mining of noisy data streams[4], decision tree algorithms for concept drifting data streams with noise[5,11] and a boosting-like method for adaptation to different kinds of concept drifts[6] etc.. However, for these algorithms referred above, the limitations mainly rely that on one hand, little attention is paid to handle various types of concept drifts in data streams impacted from noise. On the other hand, the overheads of space and runtime are probably demanded heavily while without a prominent improvement on predictive accuracy. Therefore, to address the aforementioned issues, we present a light-weighted ensemble classification algorithm of CDRDT for Concept Drifting data streams with noise. It is based on random decision trees evolved from semi-random decision trees in [14]. Namely, it adopts the strategy of random selection to solve split-test for the nodes with numerical attributes instead of the heuristic method. In comparison to other ensembling model of random decision trees for concept drifting data streams, there are four significant contributions in CDRDT: i) the basic classifiers are constructed incrementally with small various chunks of streaming data. ii) the inequality of Hoeffding Bounds[7] is adopted to specify two thresholds, which are used in the concept drifting detection from noise. It benefits distinguishing the different types of concept drifts from noise. iii) the sizes of data chunks are adjusted dynamically with the bound limit to adapt to the concept drifts. It is beneficial to avoid the disadvantage of too large or too small sizes of data chunks in the detection of data distribution, especially in the case with the classification method of majority class. iv) the effectiveness and efficiency of CDRDT in the detection on concept drifts from noisy data streams are estimated and contrasted with other algorithms, including the state-of-the-art algorithm of CVFDT[10] and the new ensemble algorithms of MSRT (Multiple Semi-Random decision Trees)[11] based on semi-random decision trees. And the experimental results show that CDRDT performs in a light demand on the overheads of time and space with higher predictive accuracy. The rest of the paper is organized as follows. Section 2 reviews related work based on ensemble classifiers of random decision trees learning from concept drifting data streams. Our algorithm of CDRDT for the concept drifting detection from noisy data streams is described in details at Section 3. Section 4 provides the experimental evaluations and Section 5 is the conclusion.

2

Related Work

Since the model of Random Decision Forests[12] was first proposed by Ho in 1995, the random selection strategy of split-features has been applied into the model of decision trees popularly. And many developed or new random decision trees have appeared, such as [24, 25, 17]. However, it is not suitable for them to handle data streams directly. Sub-sequentially, a random decision tree ensembling method[3] for streaming data was proposed by Fan in 2004. It adopts the crossvalidation estimation for higher classification accuracy. Hu et al. designed an incremental algorithm of Semi-Random Multiple Decision Trees for Data Streams

238

P. Li et al.

(SRMTDS )[14] in 2007. It uses the inequality of Hoeffding bounds with a heuristic method to implement split-test. In the following year, an extended algorithm of MSRT in [11] was further introduced by authors to reduce the impact from noise in the concept-drifting detection. At the same year, H. Abdulsalam et al. proposed a stream-classification algorithm of Dynamic Streaming Random Forests[13]. It is able to handle evolving data streams with the underlying class boundaries drift using an entropy-based drift-detection technique. In contrast with the algorithms based on decision trees ensembling mentioned above, our classification algorithm of CDRDT for concept drifting data streams proposed here behaves with four prominent characteristics. Firstly, the ensemble models of random decision trees developed from semi-random decision trees are generated incrementally in variable sizes of streaming data chunks. Secondly, to avoid the oversensitivity to the concept drifts and reduce the noise contamination, two thresholds are specified to partition their bounds in the inequality of Hoeffding Bound. Thirdly, the check period are adjusted dynamically for adaptation to concept drifts. Lastly, it presents better performances on the abilities of space, time and predictive accuracy.

3

3.1

Concept Drifting Detection Algorithm Based on Random Ensemble Decision Trees Algorithm Description

The classification algorithm of CDRDT to be proposed in this section is for the detection of concept drifts from the data streams with noise. It first generates multiple classifiers of random decision trees incrementally with variable chunks of data streams. After seeing all streaming data in a chunk (i.e., the check period is reached), a concept drifting detection is installed in this ensembling model. By means of the pre-defined thresholds in the Hoeffding Bound inequality, the difference of the average error rates classified in the method of Na¨ıve Bayes or majority-class at leaves are taken to measure the distribution changes of streaming data. Further different types of concept drifts from noise are distinguished. Once a concept drift is detected, we correspondingly adjust the check period to adapt to the concept drift. Finally, a majority-class voting or Na¨ıve Bayes is utilized to classify the test instances. Generally, the process flow of CDRDT mentioned above could be partitioned into three major components: i) the incremental generation of random decision trees in the function of GenerateClassifier. ii) the concept drifting detection methods adopted in ComputeClassDistribution. iii) the adaptation strategies to concept drifts and noise in CheckConceptChange. The related details will be illustrated as follows respectively. Ensemble Classifiers of Random Decision Trees In different from the previous algorithms involved in [11, 14], on one hand, CDRDT utilizes various magnitudes of streaming data chunks to generate ensemble classifiers of random decision trees. Here, random indicates that the split-test

Concept Drifting Detection on Noisy Streaming Data

239

Input: Training set: DSTR; Test set: DSTE ; Attribute set: A; Initial height of tree: h 0 ; The number of minimum split-examples: n min ; Split estimator function: H (·); The number of trees: N ; The set of classifiers: CT ; Memory Constraint: MC and Check Period: CP. Output: The error rate of classification Procedure CDRDT {DSTR, DSTE, A, h 0 , n min , H (·), N, CT, MC, CP } 1. For each chunk of training data streams S j ∈ DSTR (|CP | =|Sj |, j ≥ 1) 2. For each classifier of CT k (1 ≤ k ≤ N ) 3. GenerateClassifier (CT k , S j , MC, CP ); 4. If all streaming data in S j are observed 5. averageError = ComputeClassDistribution(); 6. If the current chunk is the first one 7. fError = averageError ; 8. Else 9. sError = averageError ; 10. If ( j ≥ 2 ) 11. CheckConceptChange(fError, sError, CP, S j ); 12. fError = sError ; 13. For each test instance in DSTE 14. For each classifier of CT k 15. Travel the tree of CT k from its root to a leaf; 16. Classify with the method of majority class or Na¨ıve Bayes in CT k ; 17. Return the error rate of voting classification.

method adopted in our algorithm selects an index of the discretization intervals consisted in ordered values of a numerical attribute randomly and sets the mean value of this interval to a cut-point. On the other hand, it won’t split continuously for nodes with the discrete attributes until the count of instances collected meets the specified threshold (a default value is initialized to two). However, the remainder details of trees’ growing are similar to the descriptions in [11, 14]. Concept Drifting Detection In this subsection, we first introduce several basic concepts relevant to concept drift. Definition 1. A concept signifies either a stationary distribution of class labels in a set of instances at the current data streams or a similar distribution rule about the attributes in the given instances. According to the divergence of concept drifting patterns, the change modes of a concept could be divided into three types of concept drift, concept shift and sampling change as involved in [15].

240

P. Li et al.

Definition 2. The types of concept drift and concept shift belong to the pattern with distinct change speed in the attribute values or class labels of databases. The first one refers to the gradual change and the other one indicates the rapid change. Definition 3. sampling change is mostly attributed to the pattern change in the data distribution of class labels (in this paper all changes are called concept drifts instead.). In CDRDT, a concept drifting detection on the distribution changes of streaming data is installed after a data chunk traverses all of random decision trees. And various types of concept drifts are distinguished from noise in virtue of the relation between the difference of average error rates of classification at leaves and the specified thresholds. Here, the thresholds are specified in the inequality of Hoeffding Bound, whose detailed description is given below: Consider a real-valued random variable r whose range is R. Suppose we have made n independent observations of this variable, and computed their mean r¯, which shows that, with probability 1 - δ , the true mean of the variable is at least r¯ - ε. P (r ≥ r¯ - ε) = 1 - δ, ε=



R2 ln(1/δ)/2n

(1)

Where R is defined as log(M (classes)) and M (classes) indicates the count of total class labels in the current database, the value of n refers to the size of the current streaming data chunk, the random variable of r specifies the expectation error rate classified in the method of Na¨ıve Bayes or majority-class at leaves over all classifiers of random decision trees in CDRDT. Suppose the target object of r¯ is the history classification result in the i th -chunk (denoted as e¯f ) and the current observation object refers to the estimation result of classification in the es ) is formalized (i+1)th chunk (marked as e¯s ). The detailed definition of e¯f (¯ below. k k   Mleaf Mleaf e¯f (¯ es ) = 1/N · N [p · n / ki ki k=1 i=1 i=1 nki ]

(2)

In this formula, N signifies the number of total trees, Mkleaf refers to the count of leaves at the k th classifier, nki is the count of instances at the i th leaf in the classifier of CT k and pki is the error rate estimated in 0-1 loss function at the ith leaf in CT k . In terms of Formula (2), we utilize the difference between e¯s and e¯f ( i.e., Δe = e¯s − e¯f ) to discover the distribution changes of class labels. More specifically, if the value of Δe is nonnegative, a potential concept drift is taken into account. Otherwise, it is regarded as a case without any concept drift. This is based on the statistics theory, which guarantees that for stationary distribution of the instances, the online error of Na¨ıve Bayes will decrease; when the distribution function of the instances changes, the online error of the Na¨ıve Bayes at the node will increase[16]. However, for the classification results in the method of majority-class, a similar rule could be concluded from the distribution changes

Concept Drifting Detection on Noisy Streaming Data

241

of class labels in small chunks of streaming data but with sufficient instances as well (In this paper, the minimum size of a data chunk marked as n min is set to 0.2k, 1k = 1000. This is obtained from the conclusion in [22].). It is also verified in our experiments on the tracking of concept drifts in Section 4. Hence, Eq.(1) could be transformed into Eq.(3). P (¯ es - e¯f ≥ ε0 ) = 1 - δ0 , ε0 =



R2 ln(1/δ0 )/2n

(3)

To distinguish diverse concept drifts from noise, it is necessary to specify different values of ε0 to partition their bounds, which refer to the tolerant bounds of deviation between the current error rate and the reference error rate. Evidently, the larger the variance of ε0 the higher the drifting likelihood is. In other words, it is more probable that the previous model won’t adapt to the current data streams due to the deficiency in the accuracy of classification. Correspondingly, the value of δ0 will decrease while the confidence of 1-δ0 will increase. Therefore, with the evocation from [8], two thresholds are defined in the inequality of Hoeffding Bound to control the classification deviation of error rates, i.e., T max and T min . Considering the demand on the predictive ability of the current models, their values are specified as follows. P (¯ es - e¯f ≥ Tmax ) = 1 - δmin , Tmax = 3ε0 δmin = 1 / exp[Tmax 2 ·2n/R 2 ] P (¯ es - e¯f ≥ Tmin ) = 1 - δmax , Tmin = ε0 δmax = 1 / exp(Tmin 2 ·2n/R 2 )

(4)

(5)

Adaptation to Concept Drifts Contaminated by the Noise In accordance with the related analysis mentioned above and the definitions of thresholds specified in Eqs.(4) and (5), four types of concept drifting states would be partitioned, including the ones of a non-concept drift, a potential concept drift, a plausible concept drift and a true concept drift. Namely, if the value of Δe is negative, it is taken as a non-concept drift. Otherwise, it is in a case of other three possible concept drifts. More precisely, if the value of Δe is less than T min , a potential concept drift is considered (potential indicates that the slower or much slower concept drift is probably occurring). And if greater than T max , a true concept drift is taken into account, which is resulted from a potential concept drift or an abrupt concept drift. Otherwise, it is attributed to the state of plausible concept drift considering the effect from the noise contamination. It spans the transition interval between a potential concept drift and a true concept drift. As regards this fuzzy status, it is beneficial to reduce the impact from the noise in data streams and avoid over-sensitivity to the concept drifts. Correspondingly, different strategies are adopted to handle various types of concept drifts. More specifically, for the case of non-concept drift, maintain the size of the current data chunk in a default value (e.g., n min ). For the potential concept drift, increase the chunk size by the number of instances-m min (e.g., m min = n min = 0.2k). However, for a plausible concept drift, shrink the size of

242

P. Li et al.

streaming data chunk and the check period by one third respectively. Because it is necessary to further observe the change of data streams for a deterministic type of concept drift. Otherwise, for a true concept drift, reduce the sizes into a half of the original values. Regarding the disadvantages of streaming data chunks with too large or too small sizes, the maximum bound (e.g., m max = 10*n min ) and the minimum one (e.g., m min ) are specified to control the change magnitude of a data chunk for better adaption to the concept changes. It indicates that if a bound is reached, the check period remains invariable until a new concept drift occurs. Furthermore, to improve the utility of each tree, those sub-branches whose error rates of classification are lower than the average level (e.g., 50%) will be pruned. 3.2

Analysis

Generation Error Rate for the Concept Drifting Data Streams According to the theorem of generation error analyzed in [17], as the number of trees increases, for almost surely all sequences Θ1 . . ., the generation error of PE will converge to P X,Y (P Θ (h(X, Θ)=Y ) - maxj =Y (P Θ (h(X, Θ)=j ))

Suggest Documents