Semi-supervised Ensemble Learning of Data Streams ... - Springer Link

5 downloads 1922 Views 474KB Size Report
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ... Online learning of data streams has some important features, such as ...
Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift Zahra Ahmadi and Hamid Beigy Department of Computer Engineering, Sharif University of Technology, Tehran, Iran {z_ahmadi,beigy}@ce.sharif.com

Abstract. Increasing access to very large and non-stationary datasets in many real problems has made the classical data mining algorithms impractical and made it necessary to design new online classification algorithms. Online learning of data streams has some important features, such as sequential access to the data, limitation on time and space complexity and the occurrence of concept drift. The infinite nature of data streams makes it hard to label all observed instances. It seems that using the semi-supervised approaches have much more compatibility with the problem. So in this paper we present a new semi-supervised ensemble learning algorithm for data streams. This algorithm uses the majority vote of learners for the labeling of unlabeled instances. The empirical study demonstrates that the proposed algorithm is comparable with the state-of-the-art semi-supervised online algorithms. Keywords: Stream Mining, Concept Drift, Ensemble Learning, SemiSupervised Learning.

1

Introduction

The growing availability of data in web has made mining and knowledge discovery from huge amounts of data, difficult and of interest. As the amount of data is very large (and ideally infinite) it cannot be stored and therefore there is a need for new algorithms to process the stream of data online. This is called stream mining and it has been a challenging problem in recent years. Data streams have some important properties [1]: -

-

There should be a forgetting mechanism as the received data could not be stored completely. The most common way of forgetting is to use a window of constant size. However, adaptive window [2] or density based forgetting [3, 4] is also presented. Time and algorithmic complexity should be simple as the data must be processed online. The most important property of data streams is concept drift, which is the change in the feature or class distribution over the time: ,

|

E. Corchado et al. (Eds.): HAIS 2012, Part II, LNCS 7209, pp. 526–537, 2012. © Springer-Verlag Berlin Heidelberg 2012

(1)

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

527

where X is the feature vector and C is the class label. If the drift occurs in feature space (P(X)), it is called virtual drift but if it occurs in target function (P(C|X)), it is called real drift. We only consider the occurrence of the drift, change of joint probability of P(X,C) over time, no matter whether the drift is virtual or real. The concept drift could be abrupt, gradual or recurring [5]. Whether the drift is the underlying distribution of data changes suddenly at time t, abrupt drift has occurred. If the distribution changes in a period of time (not at a specific time), and the probability of new distribution increases gradually, the drift is called gradual. If the previously seen concepts reappear some time later, they are called recurring concepts. As there should be a forgetting mechanism in data streams to handle the drifts, the previously seen concepts may be forgotten and so the true classification of recurring concepts is an important ability of data stream algorithms. There have been extensive studies on the supervised learning of data streams in the presence of abrupt, gradual or recurring concepts. However, the semi-supervised approaches are not considered much and just a few researches are done recently [6-13], so the problem is identified challenging. This paper proposes an ensemble learning method to classify the instances and predict the labels of unlabeled instances. For each classifier in the ensemble, the majority vote of other classifiers is used to label the unlabeled instances and then it is used to update the classifier. It is proven that even the labeling process is noisy, the classification is PAC learnable. The results show the effectiveness of our algorithm in terms of accuracy in comparison to one of the promising ensemble algorithms in the literature of semi-supervised data streams. The structure of the paper is as follows: in the next section the related works of semi-supervised data streams are discussed. In section 3 the proposed algorithm is presented and in section 4 the experimental results and evaluation of the method is presented. Section 5 concludes the paper and discusses some of the future works.

2

Related Works

As data streams are infinite, arrive continuously and there should be online classification, labeling all of the arrived data is impossible. So in the recent years, there is a motivation on semi-supervised learning of data streams. Few algorithms have been presented to classify scarcely labeled data streams [6-13]. The algorithms could be categorized in two groups according to the number of classifiers used in the learning process: single [6, 8-10] and ensemble [12-14]. The semi-supervised approaches are categorized in one of the following methods: -

Using K-means clustering algorithm to label the unlabeled instances [6, 10, 12]. K-means is used because of its simplicity and efficiency. Using expectation maximization algorithm to estimate the label of instances [9].

528

Z. Ahmadi and H. Beigy

The first semi-supervised learning algorithm of data streams was presented by Klinkenberg [8] and used the SVM and window adjustment approach. Later, another algorithm based on relational k-means transfer semi-supervised support vector machines (RK-TS3VM) was proposed in [10]. The algorithm presented in [6], extends online decision trees to support recurring concepts. It uses k-means to cluster and label the instances. To cover the recurring concepts, it uses the conceptual clusters in the leaves of the trees. To avoid overfitting, pruning is done regularly. Another algorithm which uses ensemble learning is presented in [12]. In each window (or batch of data), the constraint-based clustering algorithm is applied and K homogeneous clusters are created. A homogeneous cluster is a cluster which contains only unlabeled instances or only labeled instances of a single class. Some information about each micro-cluster (centroid, number of instances,…) is maintained as pseudopoints. Then label propagation is done on the pseudo-points and these points act as a classification model. The ensemble is kept up to date with the current concept and periodically refines the L classifiers to cope with the drift. The refinement is done according to the accuracy of the learners on the current batch. We compared our method to this algorithm because of their similar approach. On the other hand, there are several other approaches in the literature of semisupervised learning: self-training, probabilistic generative models, cluster then labeling, co-training, graph based approaches and transductive support vector machines. Our proposed algorithm could be categorized as a self-training model, but it has some differences in the regular methods of self-training. In regular self-training approaches, the learner first learns from labeled data and then uses its prediction on unlabeled instances and selects some instances with the more confident labels. The new labeled instances are added to the training set and the learning process is repeated again. This process repeats iteratively until no new instance is added to the training set. However in our proposed algorithm, the algorithm should act online and so the iterative process is omitted and the labeling process is done once. The advantage of this approach is its simplicity and the fact that it is a wrapper method, so it could be applied on different learning algorithms.

3

The SSEL Algorithm

As we discussed previously, labeling of all the instances in the stream is impossible. So the approach we follow here is a semi-supervised approach, which assumes that among arrived instances, some of them are labeled randomly. An ensemble method is used to label the unlabeled instances to improve the performance of classification. The proposed algorithm is presented in Table 1. When a new window of data arrives, first of all, the labeled instances are separated from the unlabeled ones and the ensemble learners are updated with the labeled instances. Assume that the number of learners is K. Diversity is an important feature of ensemble learning and in the learning of data streams it plays an even more

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

529

important role. If we use all the instances in the window, after some time the base learners become the same, so we use bootstrapping to have diverse learners. Then the process of using unlabeled data begins. For each learner, we determine a set of labeled instances from unlabeled data. To do this, we use the majority vote of other learners for the specified instance. We use the predicted label of K-1 learners (all learners except the one which we are selecting the instances for), if their ensemble does better than a random classifier (which is checked in the line 17 of pseudo code). If the prediction of K-1 learners is correct, then the Kth learner has received a valid labeled instance. Otherwise the label will be noisy. In the worst case, when we have noisy instances, if the number of labeled instances is sufficient, we could decrease the error rate of classification. To do this, we used the idea from [15]. Assume the number of misclassified instances in the tth window is η |L |, where Lt is the set of labeled instances in the tth window and η is the noise rate of the classification algorithm. is the noise rate upper bound of hk ensemble of classifiers for all base classifiers with k≠i in the window t. If z is the number of unlabeled instances having the vote of more than 50% of the learners’ and z’ is the number of . Here z could be written correct classified instances; then we could estimate by as |Wt|-|Lt|-|ft|, where Wt is the tth window and ft is the set of unlabeled instances that the ensemble of K-1 base learners has a 50%-50% vote on them. So the number of misclassified instances from unlabeled data will be | |, where Lt is the set of instances from unlabeled data which are labeled in the window t (Lt = Wt-Lt-ft). Therefore we could write the noise rate in the tth window as: η

ηL | | |L

| | L|

(2)

From [16], if a sequence σ of m samples is drawn, where the minimum number of instances should be computed by 2 ln

2

1

(3) 2

where N is the number of hypotheses, δ is the appropriate confidence, is the worst case of classification error rate and η is an upper bound of classification error rate. Then a hypothesis (Hi) minimizes disagreement on σ will be PAC learnable (H* is a ground truth hypothesis): ,

(4)

Using equation (3) in the process of data stream learning, the minimum number of instances in each window should at least be: 2 ln 1

2

(5) 2

530

Z. Ahmadi and H. Beigy Table 1. Semi-Supervised Ensemble Learning Algorithm (SSEL)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Input: θ = , where C is the number of classes, K : Number of weak classifiers, data stream in the window size of w, LearnIncremental: Learning algorithm. for j=1..K do //learner’s error e’j = 0.5 l’j = 0 //length of unlabeled set in the window end for while true do receive wi window of data Li = separate labeled data of wi Ui = separate unlabeled data of wi Test current hypothesis on Li for j=1..K do Sji = BootstrapSample(Li) hji = LearnIncremental(hj(i-1), Sji) end for for j=1..K do Lji = //the set of unlabeled instances and their predicted labels eji = MeasureError( | ) (k i) //estimate the classification error rate if (eji < θ) then //if the ensemble works better than random for every x Ui do if most of hki(x) (k i) classify x in c then Lji = Lji {(x,c)} end for

21

if (l’j = 0) then l’j =

22

if(

1 ) then

//in the first window //the condition of equation 8-9

//subsampling is done if number of unlabeled set is more than the maximum size

23 24 25 26 27 28 29 30 31

Lji = Subsample(Lji, end for for j=1..K do if (Lji ≠ ) then hji = LearnIncremental(Lji) e’j = eji l’j = |Lji| end for end while

1)

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

531

As the goal of an online learner is to become better in time, should decrease as t survives ( ). Using the aforementioned equation we have: m 1

2

m

1

|L

L |, we can substitute (6):

Having equation (2) in hand and m |L

L| 1

|L

L|

2

ηL | | |L

| | L|

2ηL | | |L L |

|L

L

|L

2 | |

2

| 1

L

|

(6)

2

ηL |

| |L

| L |

| 2 2ηL | |L L |

|

|

|

(7)

By making some simplification assumptions, such as fixed length of window over the time (Wt = Wt-1) and fixed number of labeled instances in each window (| | | |), two conditions should be satisfied: |L |W|

L| |f |

|L L | |L 2 | | |L

2ηL | |

L

L | | 2ηL |

|f | |f | 2ηL | | 2 | | |W| |f | | | 2 | | |f | 2

|

2ηL | | 2 |f | 0

|

2 |

| |

So the number of unlabeled instances that are labeled by the algorithm should satisfy the following inequality: |

0 L

By substituting

L

| | |

1

(8)

1

(9)

in (8), we conclude: 0

|L | |L |

z z

We could find the maximum number of unlabeled instances in the tth window to be labeled, from (8) and (9):

| |

|

|

1

|L

|

z

z

(10)

So if the number of unlabeled instances to be labeled in the tth window becomes more than the value obtained in (10), the subsampling process is done (line 22 and 23 in Table 1). Finally a set of instances is determined for each of base learners to be updated with. This process is repeatedly done on data windows.

532

4

Z. Ahmadi and H. Beigy

Experimental Results

To evaluate the proposed algorithm, we first introduce some datasets which contain different kinds of concept drifts. Then the proposed algorithm is compared with one of the most promising semi-supervised data stream algorithms called ReaSC [12]. The experiments show the effectiveness of the proposed algorithm. 4.1

Data Sets

We have used six datasets where two of them are artificial and the others are real. We tried to select datasets containing all types of concept drift: abrupt, gradual and recurring concepts. The SEA [17] dataset is an artificial dataset which consists of abrupt drift and has 2 classes and 3 attributes of values between 0 and 10. The dataset we used here consists of 50,000 instances. Another artificial dataset is Hyperplane [18], which has 100,000 instances with gradual drift. The concept is defined by

.

(11)

where the value of aj controls the shape of the decision surface and f(x) is the class label of instance x. Concept drift is controlled through the following parameters: (1) t, controls the magnitude of the concept drift; (2) p, controls the number of attributes whose weights are involved in the drift; and (3) h and g∈{-1, 1}, control the weight adjustment direction for attributes involved in the change. For each instance x, ai is adjusted by g*t / M where after receiving M instances, there is an h percentage of chance that weight change will inverse its direction, i.e., g = -g. Here the Hyperplane dataset has five classes and 10 attributes. p is set to 5, and t to 0.1 in every M=2000 instances and weight adjustment inverses the direction with h=20% of chance. Another dataset used in the experiments is the famous KDD Cup 99 [19], which has 23 class labels and 42 attributes. Here we used 10% of the whole dataset so it has 492,000 instances. One important feature of this dataset is the occurrence of novelty which means that new classes appear through the time. Usenet dataset used in [20] contains a stream of emails with different topics, where the user labels them as interesting or junk. The data in Usenet posts [19] has been used to construct this dataset. Three topics from 20 newsgroups are selected. The user is interested in one or two topics in each concept and so he/she labels the emails according to his/her interest. User interests can change in time, so this dataset contains recurring concepts and abrupt concept drift (Table 2). The dataset consists of 1500 instances and 913 attributes and is divided into 5 time periods each having equal number of instances. Sensor dataset [17] is a real dataset which consists of the information collected from 54 sensors deployed in Intel Berkeley Research laboratory in a two-month period. The class label is the sensor ID, so there are 54 classes, 5 attributes and 2,219,803 instances. There are some kinds of concept drifts in this dataset. For example, lighting (or the temperature of some specific sensors) during the working hours is much stronger than nights.

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

533

Table 2. Usenet Dataset concepts 1-300

Medicine Space Baseball

+ -

300-600

+ +

600-900

+ -

900-1200

+ +

1200-1500

+ -

Elec [21] is another dataset gathered from real data. This dataset consists of the price and the demand on an electricity store which is gathered every 30 minutes. The dataset contains 45,312 instances and 8 attributes and the label shows the change in the price in comparison to the mean price of previous 24 hours. 4.2

Implementation and Parameter Setting

The proposed algorithm was developed by Java using WEKA [22] and MOA [23] environments. As discussed previously, there is a need for incremental classifier. Here we used the incremental decision tree as the base classifier of ensemble learners. Decision trees are appropriate classifiers for online learning as they are fast and accurate learners. The number of base classifiers is set to three. In ReaSC algorithm, another parameter is the number of possible clusters. Our proposed method does not use clustering and thus has no need of such a parameter. This is an advantage since clustering is time consuming and determining the cluster parameter is not easy, however, it has an impact on the classification accuracy. This parameter is set to 50 according to the authors’ experiments in ReaSC [12]. In all datasets, we assume that 20% of instances in each window are labeled randomly and the window size is set to 1000 instances, except for the Usenet dataset in which the window size is set to 100 ones (as the number of instances in it is small). 4.3

Results and Discussion

We compared our method with the ReaSC algorithm [12] in terms of cumulative accuracy through the windows of data. We used cumulative accuracy to obtain smoother plots. The results of our experiments on the datasets are shown in figures 1 to 6. The results on the datasets containing different types of concept drift are definitely promising. As it can be seen in SEA dataset, there is a difference of at least 3% in performance. In Hyperplane dataset the difference is much more and about 15% of improvement in accuracy. So it seems that our proposed algorithm works much better in datasets with gradual drifts. It can be due to the boosting nature of the algorithm. In real datasets, the type and place of concept drift is not determined. But in all datasets (KDD CUP99, Usenet, Sensor and Elec) our proposed algorithm has better performance. Especially in Sensor dataset we have a performance improvement of about 45% and in Elec, it is about 15%. These results are approximately remained the same even with different values of parameters (number of base learners, percentage of labeled data and window size).

Z. Ahmadi and H. Beigy

Cumulative Accuracy

534

84 82 80 SSEL

78

ReaSC

76 0

20

40

60

Window Number

Cumulative Accuracy

Fig. 1. Total accuracy of SSEL and ReaSC algorithms in SEA dataset

65 60 55 50 45 40 35

SSEL ReaSC 0

50

100

150

Window Number

Cumulative Accuracy

Fig. 2. Total accuracy of SSEL and ReaSC algorithms in HyperPlane dataset

100 95 90 85

SSEL

80

ReaSC

75 0

200

400

600

Window Number

Fig. 3. Total accuracy of SSEL and ReaSC algorithms in KDD CUP99 dataset

Cumulative Accuracy

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

55 50 45 40 35 30 25 20

SSEL ReaSC 0

2

4

6

8

10

12

14

16

Window Number

Cumulative Accuracy

Fig. 4. Total accuracy of SSEL and ReaSC algorithms in Usenet dataset

70 50 SSEL

30

ReaSC

10 0

500

1000

1500

2000

2500

Window Number

Cumulative Accuracy

Fig. 5. Total accuracy of SSEL and ReaSC algorithms in the Sensor dataset

90 85 80 75 70 65 60 55 50

SSEL ReaSC 0

10

20

30

40

50

Window Number Fig. 6. Total accuracy of SSEL and ReaSC algorithms in the Elec dataset

535

536

Z. Ahmadi and H. Beigy

All in one, we can see that our proposed algorithm works much better than the other algorithm having a similar approach in semi-supervised ensemble learning.

5

Conclusion and Future Works

In this paper we have proposed a new semi-supervised ensemble learning (SSEL) algorithm for the classification of streaming data. The approach could be categorized as modified self-training but without the conventional problems of self-training. In self-training methods, if the learners are weak and predict the label of unlabeled instances incorrectly, using the unlabeled data will degrade the performance of the learner. But here in SSEL we showed that if the number of instances in each window is enough (according to the equation (10)), then the algorithm is PAC learnable and noise will not degrade the performance of the learner. For future works we could develop the experiments and examine more datasets and different parameters (the number of base classifiers, the percentage of labeled data and the window size) to get more reliable results. On the other hand, if the drift rate becomes fast, our method has delay in detecting the changes and the performance decrease. Also the upper bound of drift rate could be calculated as a future work.

References 1. Tsymbal, A.: The Problem of Concept Drift: Definitions and Related Work (2004) 2. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996) 3. Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Mach. Learn. 6(1), 37–66 (1991) 4. Salganicoff, M.: Density-Adaptive Learning and Forgetting. In: Tenth International Conference on Machine Learning. Morgan Kaufmann (1993) 5. Zliobaite, I.: Learning under Concept Drift: an Overview (2010) 6. Li, P., Wu, X., Hu, X.: Mining Recurring Concept Drifts with Limited Labeled Streaming Data. In: 2nd Asian Conference on Machine Learning (ACML 2010). JMLR, Tokyo (2010) 7. Masud, M.M.: Adaptive Classification of Scarcely Labeled and Evolving Data Streams, in Computer Science, p. 161. The University of Texas, Dallas (2009) 8. Klinkenberg, R.: Using Labeled and Unlabeled Data to Learn Drifting Concepts. In: IJCAI 2001 Workshop on Learning from Temporal and Spatial Data. AAAI Press, Menlo Park (2001) 9. Borchani, H., Larrañaga, P., Bielza, C.: Mining Concept-Drifting Data Streams Containing Labeled and Unlabeled Instances. In: García-Pedrajas, N., Herrera, F., Fyfe, C., Benítez, J.M., Ali, M. (eds.) IEA/AIE 2010, Part I. LNCS, vol. 6096, pp. 531–540. Springer, Heidelberg (2010) 10. Zhang, P., Zhu, X., Guo, L.: Mining Data Streams with Labeled and Unlabeled Training Examples. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining. IEEE Computer Society (2009)

Semi-supervised Ensemble Learning of Data Streams in the Presence of Concept Drift

537

11. Widyantoro, D.H., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Transactions on Knowledge and Data Engineering 17(3), 401–412 (2005) 12. Woolam, C., Masud, M.M., Khan, L.: Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS 2009. LNCS, vol. 5722, pp. 552–562. Springer, Heidelberg (2009) 13. Ditzler, G., Polikar, R.: Semi-supervised learning in nonstationary environments. IEEE 14. Kantardzic, M., Ryu, J.W., Walgampaya, C.: Building a New Classifier in an Ensemble Using Streaming Unlabeled Data. In: García-Pedrajas, N., Herrera, F., Fyfe, C., Benítez, J.M., Ali, M. (eds.) IEA/AIE 2010, Part I. LNCS, vol. 6097, pp. 77–86. Springer, Heidelberg (2010) 15. Zhou, Z.-H., Li, M.: Tri-Training: Exploiting Unlabeled Data Using Three Classifiers. IEEE Trans. on Knowl. and Data Eng. 17(11), 1529–1541 (2005) 16. Angluin, D., Laird, P.: Learning From Noisy Examples. Machine Learning 2(4), 343–370 (1988) 17. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco (2001) 18. Zhu, X.: Stream Data Mining repository (2010), http://www.cse.fau.edu/~xqzhu/stream.html 19. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2010), http://archive.ics.uci.edu/ml (cited May 2011) 20. Katakis, I., Tsoumakas, G., Vlahavas, I.: Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems 22(3), 371–391 (2009) 21. Harries, M.B., Sammut, C., Horn, K.: Extracting hidden context. Machine Learning 32(2), 101–126 (1998) 22. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005) 23. Bifet, A., et al.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604

Suggest Documents