2008 IEEE International Conference on Data Mining Workshops
One-class Classification of Text Streams with Concept Drift ∗ Xue Li† University of Queensland, Australia
[email protected]
Yang Zhang University of Queensland, Australia 2 Northwest A&F University, China
[email protected]
1
Maria Orlowska Polish-Japanese Institute of Information Technology, Poland
[email protected]
Abstract
minutes to find out the feedback emails of a newly launched product for detailed investigation, hence a text data stream classification system is expected to retrieve all the ontopic feedbacks in the incoming email streams. This is also a case for news readers when they are browsing online. Users may only have limited patients to label a part of the ontopic text documents from the incoming news stream, and a text classifier is expected to retrieve all the related news stories. In the problem of one-class classification [14][13][18] [23][6], a class of objects, called the target class, has to be distinguished from all other objects. The description of the target class should be constructed such that the possibility of accepting objects that are not originated from the target class should be minimized. We propose to integrate one-class classification approaches with streaming data mining, so as to construct a text stream classifier without using negative training samples. The following challenges are identified.
Research on streaming data classification has been mostly based on the assumption that data can be fully labelled. However, this is impractical. Firstly it is impossible to make a complete labelling before all data has arrived. Secondly it is generally very expensive to obtain fully labelled data by using man power. Thirdly user interests may change with time so the labels issued earlier may be inconsistent with the labels issued later – this represents concept drift. In this paper, we consider the problem of oneclass classification on text stream with respect to concept drift where a large volume of documents arrives at a high speed and with change of user interests and data distribution. In this case, only a small number of positively labelled documents is available for training. We propose a stacking style ensemble-based approach and have compared it to all other window-based approaches, such as single window, fixed window, and full memory approaches. Our experiment results demonstrate that the proposed ensemble approach outperforms all other approaches.
Concept drifting. User may change their interests from time to time, and the distribution of text categories in the stream may also change with time. Small number of training samples. In order to release the user from the heavy burden of labelling all samples in the incoming stream, only a small number of labelled samples are provided to the system. No negative training samples. User may not be patient enough to provide negative samples to the system. For example, a news reader may just identify the news stories she/he is interested and then expect to find all the ontopic news stories from today’s newswire. Noisy data. With thousands of features (e.g., meaningful words in documents), text in natural language is believed to be noisy because of word sense ambiguity (i.e., polysemous or homonymous ). Limited memory space. Only a limited memory space is available to a given classification system.
1. Introduction Current research trends on data stream classification are mostly focused on classification measures for fully labelled data streams. However, it is not practical in real life applications, as it is generally too expensive to obtain fully labelled data stream for the limited resources of human power. Suppose a customer service section of a large enterprise receives thousands of user feedback emails everyday. Everyday, the manager of a section only wants to pay a few ∗ Australian
ARC Discovery Project DP0558879. author.
† Corresponding
978-0-7695-3503-6/08 $25.00 © 2008 IEEE DOI 10.1109/ICDMW.2008.54 10.1109/ICDM.Workshops.2008.56
116
An ensemble of one-class classifiers is proposed to meet the challenges described above. In the experiments, we compare our ensemble-based algorithm with other corresponding window-based approaches, such as, single window, fixed window, and full memory approaches. Various scenarios are simulated in our experiments, and the results demonstrate that the ensemble-based algorithm outperforms all others. This paper is organized as following. Section 2 reviews the related work. Section 3 shows our framework for oneclass text stream classification. The algorithms for learning concept drift are presented in section 4. Section 5 gives our experiment results, and section 6 concludes this paper.
samples, and this is related to our work. For TDT, the concept drift is caused by the evolving of the news story itself. While for our work, the concept drift is caused by changes in the user interests, and/or changes in data distribution. The task of information filtering[12][10] is to classify documents from a stream as either relevant or nonrelevant according to a particular user interest with the objective to reduce information load. In adaptive information filtering[12], a sub-task for information filtering, it is assumed that there is an incoming stream of documents, and that each user interest is represented by a persistent profile. Each incoming document is matched against each profile, and (well-) matched documents are sent to the user. The user is assumed to provide feedback on each document sent to him/her. While for our work, no user feedback is supplied. It is generally believed that ensemble classifier could have better classification accuracy than a single classifier [2]. A range of ensemble classifiers has been proposed by research community[17, 20, 11, 24, 25, 19]. The initial papers [17][20] on classification data streams by ensemble methods use static majority voting[17] and static weighted voting[20], while the current trends is to use dynamic methods, say, dynamic voting in [11], [24], and [19]; dynamic classifier selection in [25] and [19]. And it is concluded in [19] that dynamic methods perform better than static methods. Here, [17, 20, 11, 24, 25, 19] are all dedicated to fully labelled data streams. In [23] and [6], algorithms are proposed to have successfully built text classifiers from positive and unlabelled text documents. We use the same idea discussed in [6] to expand the training data from positive-only to include both positive and negative samples to train base classifiers. And then, ensemble stacking is used to cope with concept drift.
2. Related Work To the best of our knowledge, there is no work so far on one-class classification under data stream scenario. While a few works found have discussed classification of partially labelled data streams. In [22], Wu et al. proposed a semisupervised classifier, which uses a small number of both positive and negative samples to train an initial classifier, without any further user feedback or training samples. This means that their algorithms cannot cope with concept drift in data streams. Text classifiers using positive and unlabelled example are also discussed in [14][13][15], where the problem of concept drift is not considered as an issue. Active learning of data streams is proposed in [4][5][7] and [26], which trains initial classifier on partially labelled data stream, and requires the true label of some certain unlabelled samples for enhancing the classifier. The algorithms in [4][5][7] estimate the error of the model on new data without knowing the true class labels. As it is known that concept drift could be caused by changing of user interests, changing of data distribution, or both, the algorithms proposed in [4][5][7] cannot cope with concept drift caused by sudden shift of user interests. Their system cannot detect this kind of concept drift without knowing the true class labels. Moreover, in real-life applications, the system is always fed with overwhelming volume of incoming data, which makes it not applicable to require human investigation for the true class label of some unlabelled samples. Learning concept drift from both label and unlabelled text data is proposed in [9] and [21]. The algorithms proposed by Klinkenberg et al. in [9] need more than one scan of the dataset, which makes it not applicable for a data stream scenario. In [21], Dwi et al. focused on expanding the labelled training samples using relevant unlabelled data, so as to “extend the capability of existing concept drift learning algorithms”. Topic tracking [1], a sub-task of topic detection and tracking (TDT), tries to retrieval all ontopic news stories from a stream of news stories with a few initial ontopic
3. Framework for One-class Text stream Classification In this paper, we follow the assumption that text stream arrives as batches with variable length [9]: d1,1 , d1,2 , · · · , d1,m1 ; d2,1 , d2,2 , · · · , d2,m2 ; ···; (1) dn,1 , dn,2 , · · · , dn,mn ; ···; Here, di,j represents the j-th document in the i-th batch. Each text in the stream is labelled with a category. In each batch, some of the categories are considered as ontopic, and others are not. We write di,j =< Xi,j , yi,j >. Here, Xi,j ∈ Rn , representing a sample text in the stream; and yi,j ∈ {+1, −1}, representing the document is ontopic (yi,j = +1), or not (yi,j = −1). In each batch, only a small number of ontopic (positive) samples is given as the
117
training data. The task of one-class text stream classification is to find out all the ontopic documents from the text stream. Algorithm 1 gives a general framework of ensemble learning for one-class text stream classification.
2. Instance weighting, this uses the ability of some learning algorithms, such as support vector machines (SVM) to process weighted instances [8]. 3. Ensemble learning, this maintains a collection of concept descriptions, and combines the predictions of them by voting, base classifier selection, etc. [17]-[19].
Algorithm 1 Framework of ensemble learning for one-class data stream classification. Input: The set of positive samples for current batch, Pn ; The set of unlabelled samples for current batch, Un ; Ensemble of classifiers on former batches, En−1 ; Output: Ensemble of classifiers on the current batch, En ; 1: Extracting the set of reliable negative and/or positive samples Tn from Un with help of Pn ; 2: Training ensemble of classifiers E on Tn ∪ Pn , with help of data in former batches; 3: En = En−1 ∪ E; 4: Classifying samples in Un − Tn by En ; 5: Deleting some weak classifiers in En so as to keep the capacity of En ; 6: return En ;
In this paper, we propose to use stacking style ensemble learning to cope with concept drift.
4.1. Training Base Classifiers On each batch of data, only a small number of positive training samples are given by the users. The positive samples on the current batch, say, dn , is not sufficient enough to train a good text classifier on dn . Hence, it will be great helpful if we can reuse the positive samples in previous batches di , (i < n). However, as pointed out in [3], to use old data blindly is not better than “gambling”. In this paper, for each batch dn , two base classifiers are trained. One base classifier is trained with positive samples in dn exclusively, and the other one is trained on the positive samples in dn and some previous batches. The problem of which base classifier is more helpful for classification is left to ensemble stacking learner to decide. The algorithm for training base classifiers is given in algorithm 2, which is used in step 2 of algorithm 1.
It is proved in [21] that the learning task will be more difficult if the learner learns on few samples. Currently, most of the state-of-art classifiers require both positive and negative samples for training. Therefore, in step 1 of algorithm 1, we extract Tn , a set of reliable negative samples, from Un , the set of unlabelled samples of current batch, so as to create a training dataset, Tn ∪ Pn . Here, any one-class text classification[23][6] algorithms could be used as a plug-in in this step for extracting samples. In this paper, we simply use a successful algorithm proposed by Fung, et al. in [6] for extracting reliable negative samples. In step 3, the newly learned classifiers are added into En−1 , the ensemble of classifiers built on former batches of data. In step 4, the new ensemble of classifiers, En , is used to determine the class label of unlabelled samples in Un − Tn . And in step 5, some weak classifiers are deleted from the ensemble, so as to keep the population capacity of the ensemble.
Algorithm 2 Training base classifiers on batch dn . Input: Training dataset of current batch, Tn ∪ Pn ; Ensemble of classifiers on the former batch, En−1 ; Parameter, M ; Output: Ensemble of new generated base classifiers, E; 1: T = Tn ∪ Pn ; 2: Training base classifier C on T ; 3: for all c such that c ∈ RecentM Batch(En−1 ) do 4: T = T ∪ P osSample(c); 5: end for; 6: T = RecentM Batch(T ); 7: Training base classifier C on T ; 8: E = {C, C }; 9: return E;
4. Learning Concept Drifting With concept drifting, one of the challenges of data stream classification systems is that there are no sufficient samples for training a robust classifier for the concept in the current batch of data. Generally, there are 3 ways to cope with this problem.
In step 4 of algorithm 2, the function P osSample(c) returns the set of positive samples used to train the base classifier c; the function RecentM Batch(En−1 ) in step 3, and RecentM Batch(T ) in step 6 returns the base classifiers in the most recent M batches in En−1 , and training samples in the most recent M batches in T , respectively.
1. Instance selection, that is to select instances with similar concept of the current batch from former batches, and use these instances for training [8][3].
118
4.2. Ensemble Stacking
Algorithm 4 Ensemble stacking classifier for one-class data stream classification. Input: Unlabelled sample, d; Ensemble of classifiers on the current batch, En ; Stacking classifier on the current batch, Sn ; Output: The class label of d; 1: for all ci such that ci ∈ En do 2: xi = Classif yci (d); 3: end for; 4: ld = Classif ySn (< x1 , x2 , · · · , x|En | >); 5: return ld ;
Various kinds of heuristic approaches have been proposed for ensembling base classifiers to cope with concept drift [17, 20, 25]. Let’s write ci for the i-th base classifier in ensemble En . In the previous researches[17, 20, 25], the formula to predict the class label of an unlabelled sample d by En could be summarized as: |En |
En (d) =
αi ci (d)
(2)
i=1
Here, αi ∈ R, (1 < i < |En |), representing the weight (importance) of the i-th base classifier. For majority voting used in [17], we have αi ∈ {0, +1}. For accuracy weighted voting in [20], αi represents the accuracy of ci . And for dy|E | namic classifier selection in [25], we have i=1n αi = 1, αi ∈ {0, +1}. In this paper, we propose to learn the importance of a base classifier for classification by the ensemble stacking learner.
As we are more interested in ontopic documents, we measure the importance of a base classifier by its accuracy on positive training samples. Algorithm 5 shows our base classifier selection algorithm. In this paper, we argue that two common phenomena could be observed from the behavior of a news reader. • If the reader is very interested in a certain topic today, say, sports, then, there is a high probability that he is also interested in sports tomorrow. • If the reader is interested in a topic, say, sports, and for some reason his interests change to another topic, say, politics, then after sometime, there is high probability that his interests will change back to sports again. From the above two observations, we make the following two conclusions: • The ensemble should keep some recent base classifier. This is shown from step 11 to step 16 in algorithm 5. • The ensemble should keep some base classifiers which represent the reader’s interests in the long run. This is shown in step 18 to step 23 in algorithm 5. In algorithm 5, L1 and L2 are user defined parameter to control the population capacity of the classifier ensemble. We measure the accuracy of the base classifiers only on positive samples, as under one-class classification scenario, we only have positive training samples. The function P osP redict(ci , n) in step 5 returns the number of correct positive predictions made by ci in batch dn . And the function P osP redictRate(ci ) in step 20, is defined as: n j=1 P osP redict(ci , j) n P osP redictRate(ci ) = (3) j=1 |dj |
Algorithm 3 Ensemble stacking learner for one-class data stream classification. Input: Training dataset of current batch, Tn ∪ Pn ; Ensemble of classifiers on current batch, En ; Output: Stacking classifier on the current batch, Sn ; 1: Stacking training set Ts = φ; 2: for all d =< X, y > such that d ∈ Tn ∪ Pn do 3: for all ci such that ci ∈ En do 4: xi = Classif yci (X); 5: end for; 6: Ts = Ts ∪ {< x1 , x2 , · · · , x|En | , y >}; 7: end for; 8: Sn = Learn(Ts ); 9: return Sn ; Algorithm 3 gives the algorithm for training stacking learner on the ensemble of classifiers, and algorithm 4 shows the ensemble stacking classifier for one-class data streams. These two algorithms are used in step 4 of algorihtm 1. As RBF kernel, K(U, V ) = exp(−γ||U − V ||2 ), has good expressive ability, in this paper, we use SVM with RBF kernel for the stacking classifier in step 8 of algorithm 3 and step 4 of algorithm 4.
4.3. Base Classifier Selection
This formula is used to measure the interests of the reader to the concept described by base classifier ci . In step 5, k is the size of positive training set, and M inAcc is the minimum accuracy tolerated, which is simply set to 0.2 in our experiment. This parameter is used to filter out base classifiers with bad performance.
With accumulating of new base classifiers, the population capacity of the ensemble is expanding. We should perform base classifier selection to keep the population capacity of the ensemble because of limited memory available.
119
Algorithm 5 Base Classifier Selection. Input: Training dataset of current batch of data, Tn ∪ Pn ; Ensemble of classifiers on current batch, En ; Parameter, maximum population size of En , L1 + L2 ; Parameter, minimum accuracy tolerated, M inAcc; Output: The Ensemble of classifiers, En ; 1: // Selection by accuracy on positive samples. 2: Getting stacking training set Ts following Algorithm 3; 3: D = φ; 4: for all ci such that ci ∈ En do 5: if (P osP redict(ci , n)/k < M inAcc) then 6: D = D ∪ {ci }; 7: end if 8: end for; 9: En = En − D;
TFIDF algorithm. F1 is widely used for measuring the classification performance of text classifiers[16]. In this paper, we also report our experiment result in F1 . Suppose there are n batches of text data observed so far, in order to measure the classification performance on the text stream, we define averaged F1 for the whole text data stream as the averover the batches observed from the stream so far, aged F1 n F1ave = i=1 F1n /n. Following the experiment setting of [9][10], which are dedicated to learn concept drift from text stream, in our experiment, we compare our algorithm with another 3 different window management approaches: • Single Window (SW): The classifier is built on the current batch of data. • Full Memory (FM): The classifier is built on the current batch of data, together with positive samples dated back to batch 0. • Fixed Window (FW): The classifier is built on the samples from a fixed size of windows. Here, we set the window size s to s = 3 (FW3) and s = 4 (FW4). • Ensemble (EN): The classifier is built by the algorithms proposed in this paper. With the accumulation of positive training data, the memory needed by FM will exceed the maximum memory capacity of the computer, which makes this approach not applicable for data stream scenario. For FW, the choice of “good” window size is a compromise between fast adaptivity (small window) and good generalization in phases without concept change (large window)[8]. However, we use FM and FW here for performance comparison. We set the parameter L1 = 6, L2 = 2, and M = 4. M = 10 does not show significant different experiment result from M = 4. We set γ = 0.5 in all our experiment for RBF kernel. Linear SVM is used as base classifier, as it is reported in the literature[16] that linear SVM performs excellently for text classification task.
10: 11:
// Reusing recent base classifiers. while (|En | ≤ L1 )and(D = φ) do Selecting the most recent classifier ci from D; D = D − ci ; 15: En = En + ci ; 16: end while 12: 13: 14:
17: 18: 19: 20: 21: 22: 23: 24:
// Reusing most often used base classifiers. if (D = φ) then Sorting ci in D according to P osP redictRate(ci ) in descending order; D = top L2 classifiers in the list; En = En + D; end if return En ;
5. Experiments In this section, we report our experiment results. The algorithms are implemented in Java with help of WEKA1 software package. Our experiment is made on a PC with Pentium 4 3.0 GHz CPU and 1G memory. 20NewsGroup2 dataset is used in our experiment. The documents are messages posted to Usenet newsgroups, and the categories are the newsgroup themselves. There are 20 categories in this dataset, with almost 1000 text documents for each category. We remove the subject header from the text document, as this strongly implies the category of the document. The preprocessing of the text document including stemming and stop words removing. After preprocessing, each document is represented by a vector weighted by
5.1. Concept Drift Caused by Changing of User Interests In this group of experiment, following the experiment setting in [9][10], three scenarios are simulated. • Scenarios A: The concept is TA in batch 0∼2. It drifts slowly to concept TB in batch 3∼6, and it keeps stable in batch 7∼9. • Scenarios B: The concept is TA in batch 0∼4. It drifts suddenly to TB in batch 5, and keeps stable from then on. • Scenarios C: The concept is TA in batch 0∼3. It shifts to TB in batch 4 and 5, and shifts back to TA again in batch 6.
1 http://www.cs.waikato.ac.nz/ml/weka/ 2 http://kdd.ics.uci.edu/databases/20newsgroups/ 20newsgroups.data.html
120
cope with concept drift. From batch 7 to 9, the concept is stable again. The improving of F1 could be observed again from batch 7 to 9 for EN, FW3, and FW4, as they have the ability to accumulating training samples. In batch 9, FW3 is built with training samples only from TB . The classification performance of EN is similar to FW3, which shows the strong ability of EN to forget offtopic training samples. It should be pointed out that SW, FW3, FW4, FM, and EN should have same F1 value in batch 0. However, different F1 value is observed. This tiny difference comes from the randomness of SVM learner. Fig.1(B) presents the classification result for scenario B, in which the concept shifts suddenly from TA to TB at batch 5. An obvious drop in the classification performance of FW3, FW4, and FM could be observed at batch 5. The classification result of EN is only a little bit worse than SW, and is much better than FW3, FW4, and FM. This shows the good ability for EN to detect concept drift as the algorithm is only feed with 20 positive samples in each batch. Fig.1(C) presents the classification result for scenario C. Again, EN performs better than SW3, SW4 and FM at batch 4 where concept shifts from TA to TB , and EN performs better than SW3, and SW4 at batch 6 where concept shifts from TB to TA again. As linear SVM is an excellent text classifier, the addition positive samples from batch 4 and 5 does not cause big drop to FM at batch 6. It could be observed that at batch 6, the classification performance of EN is better than the performance of SW, and this suggests the ability of EN to borrow positive training samples from batch 0∼4. We compare average F1 of different approaches for all of the scenarios simulated in Tab.2. In Tab.2, column 1 gives the scenarios simulated; column 2 gives the sizes of training datasets, columns 3∼7 give the average F1 of SW, FW3, FW4, FM, and EN, respectively. It is obvious that EN outperforms SW, FW3, FW4, and FM on scenario A∼C.
Table 1. Scenarios of changing user interests. Batch ID 0 1 2 3 4 5 6 7 8 9
A(%)
B(%)
C(%)
100 100 100 80 60 40 20 0 0 0
100 100 100 100 100 0 0 0 0 0
100 100 100 100 0 0 100 100 100 100
Referring to Tab.1., Column 1 lists the batch IDs. Columns 2∼4 give the percentages of training samples used, which come from TA , with the rest of the training samples coming from TB , for scenario A, B, and C, respectively. Each scenario is made up of ten batches of text data, and in each batch, there are 2000 text documents, with 100 text documents from each of the 20 categories. In each batch, the positive dataset is composed of k text documents that are randomly selected from TA and/or TB , and the rest of the document in the batch is used as unlabelled data. Five categories, rec.motorcycles, rec.sport.baseball, sci.crypt, sci.space, and talk.politics.mideast are selected randomly from the 20 categories of 20NewsGroup dataset. We randomly select one of them as topic TA , and another one as topic TB . Scenario A∼C are generated simulating the concept drifts from TA to TB . Hence, we have C52 = 10 trails. The averaged classification performance of the 10 trails is reported here as the final result. Fig.1 shows the experiment result with k = 20. In Fig.1, Single Window, Fixed Window s=3, Fixed Window s=4, Full Memory, and Ensemble represents the classification result for single window (SW), fixed window with window size 3 (FW3), fixed window with window size 4 (FW4), full memory (FM), and ensemble algorithm (EN) proposed in this paper, respectively. In order to make the figures more readable, at the bottom of each figure, there is a curve representing the simulated concept drift, which is labelled as Concept Drift. Fig.1(A) presents the classification result for scenario A. In batch 0, 1, and 2, EN has similar performance as FW3, FW4, and FM. As all of them are accumulating positive training samples, the improvement of F1 could be observed in these batches. It could be observed that in batch 3∼6, where there is gradual concept drift, EN outperforms all other approaches, which shows the strong ability for EN to
5.2. Heavy Concept Drift vs. Gradual Concept Drift In this group of experiments, we analyze the ability of our algorithm to cope with heavy concept drift and gradual concept drift. These scenarios can be seen from Tab.3. Scenario D, E, F and G simulates heavy concept drift, with scenario D most heavily; scenario I simulates extremely gradual concept drift; and there is no concept drift in scenario H at all. We report the experiment result for k = 20 in Fig.2. The experiment result of scenario D is given in Fig.2(A), which simulates extreme concept drift, with sudden concept drift in each batch. As EN has a good ability to detect concept drift, there is no obvious drop in experiment result in batch 2 for EN. While for FW3, FW4, and FM, there is a big drop. The performance of EN keeps the best in batch 3∼7.
121
(A) Experiment result for scenario A 1
0.8
(C) Experiment result for scenario C
(B) Experiment result for scenario B 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
F1
F1
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0 0
1
2
3
4
5
6
7
8
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.6
F1
0.6
1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
9
0
1
2
3
Batch ID
4
5
6
7
8
0
9
1
2
3
4
5
6
7
8
9
8
9
8
9
8
9
Batch ID
Batch ID
Figure 1. Experiment result for changing user interests.
(A) Experiment result for Scenario D 1
(B) Experiment result for Scenario E 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.8
F1
F1
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 0
1
2
3
4
5
6
7
8
9
0 0
1
2
3
Batch ID
5
6
7
8
9
0
0.2
0.2
0 5
6
7
8
7
F1
F1
F1 0.2
4
6
0.6
0.4
3
5
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.4
0
4
(F) Experiment result for Scenario I
0.4
2
3
1
0.6
1
2
Batch ID
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.6
0
1
(E) Experiment result for Scenario H 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
4
Batch ID
(D) Experiment result for Scenario G 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.6
F1
0.6
(C) Experiment result for Scenario F 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
9
0 0
1
2
3
Batch ID
4
5
6
7
8
9
0
1
2
3
Batch ID
4
5
6
7
Batch ID
Figure 2. Experiment result for heavy & gradual concept drift.
(A) Experiment result for Scenario J 1
(B) Experiment result for Scenario K 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.8
F1
F1
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 0
1
2
3
4
5
Batch ID
6
7
8
9
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0.8
0.6
F1
0.6
(C) Experiment result for Scenario L 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble Concept Drift
0 0
1
2
3
4
5
6
7
8
9
0
1
2
Batch ID
Figure 3. Experiment result for uneven data distribution.
122
3
4
5
Batch ID
6
7
(A) Experiment result for Scenario M 1
0.8
(C) Experiment result for Scenario O
(B) Experiment result for Scenario N 1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble
0.8
0.6
1
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble
0.8
0.6 F1
F1
F1
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0 0
1
2
3
4
5
6
7
8
9
Single Window Fixed Window s=3 Fixed Window s=4 Full Memory Ensemble
0
1
2
Batch ID
3
4 5 Batch ID
6
7
8
0
9
1
2
3
4
5
6
7
8
9
Batch ID
Figure 4. Experiment result for five ontopic categories.
Table 3. Scenarios of heavy & gradual concept drift.
Table 2. Averaged F1 for all scenarios. Sce A
B
C
D
E
F
G
H
I
J
K
L M N O
k
Single Window
10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 20 40 20 40 20 40
0.259 0.427 0.482 0.336 0.509 0.572 0.354 0.532 0.589 0.344 0.494 0.587 0.368 0.523 0.582 0.350 0.529 0.576 0.334 0.528 0.591 0.347 0.533 0.586 0.151 0.316 0.380 0.259 0.432 0.501 0.339 0.527 0.599 0.317 0.545 0.602 0.274 0.420 0.315 0.467 0.298 0.474
Fixed Window s=3 s=4 0.373 0.518 0.555 0.424 0.564 0.598 0.383 0.516 0.568 0.292 0.461 0.549 0.230 0.411 0.475 0.333 0.466 0.522 0.364 0.521 0.563 0.504 0.628 0.651 0.239 0.430 0.480 0.361 0.520 0.562 0.429 0.569 0.613 0.361 0.527 0.586 0.338 0.477 0.378 0.510 0.318 0.487
0.392 0.518 0.559 0.425 0.561 0.574 0.379 0.520 0.574 0.248 0.413 0.511 0.265 0.449 0.512 0.299 0.434 0.505 0.351 0.498 0.543 0.525 0.646 0.662 0.255 0.446 0.499 0.369 0.525 0.569 0.425 0.556 0.601 0.358 0.530 0.587 0.329 0.478 0.366 0.498 0.312 0.479
Full Memory
Enesmble
0.369 0.497 0.545 0.383 0.524 0.546 0.441 0.554 0.590 0.336 0.493 0.557 0.362 0.507 0.556 0.379 0.513 0.561 0.385 0.518 0.571 0.555 0.654 0.662 0.295 0.476 0.523 0.348 0.508 0.559 0.384 0.526 0.571 0.430 0.555 0.600 0.302 0.449 0.324 0.455 0.336 0.477
0.410 0.533 0.560 0.483 0.616 0.608 0.488 0.610 0.632 0.345 0.518 0.579 0.439 0.565 0.592 0.431 0.561 0.588 0.458 0.578 0.614 0.567 0.645 0.648 0.313 0.496 0.521 0.389 0.543 0.560 0.484 0.595 0.624 0.458 0.603 0.631 0.389 0.492 0.427 0.511 0.402 0.518
Batch ID 0 1 2 3 4 5 6 7 8 9
D(%)
E(%)
F(%)
G(%)
H(%)
I(%)
100 0 100 0 100 0 100 0 100 0
100 100 0 0 100 100 0 0 100 100
100 100 100 0 0 0 100 100 100 0
100 100 100 100 0 0 0 0 100 100
100 100 100 100 100 100 100 100 100 100
100 90 80 70 60 50 40 30 20 10
FM performs the best in batch 8 and 9, as it has enough training samples for the topic {TA , TB }. The experiment result of scenario E is given in Fig.2(B). There is concept shifting in batch 2, 4, 6, and 8, and hence the classification performance of FW3 fluctuates heavily. It obvious that for most of the cases, EN performs the best. Fig.2(C) and Fig.2(D) gives the experiment result of scenario F and G, respectively. Again, the heavy drop in performance in Fig.2(C) at batch 3, 6, and 9 for FW3, and FW4 is not observed for EN. And EN performs much better than FW3 and FW4 at batch 4, and 8 in Fig.2(D). It can be observed from Fig.2(A), Fig.2(B), Fig.2(C), and Fig.2(D) that after accumulated enough training samples, FM always has good classification performance. We believe that this is due to the reason that SVM is very robust against noise, and {TA , TB } is taken as the final category. However, this good performance is got by the sacrifice of classification performance at batch 1 of Fig.2(A), batch 2 of Fig.2(B), batch 3 of Fig.2(C), and batch 4 of Fig.2(D), where there is a sudden concept drift to a previous unknown concept. While at these batches, EN performs much better than FM. Fig.2(E) gives the classification result for scenario H, in
123
which there is no concept drift at all. The classification performance of FW3, FW4, FM, and EN are very close to each other, with FM outperforms others by a very little bit degree. This experiment shows that under the condition that there is no concept drift, EN also performs excellently. Fig.2(F) gives the classification result for scenario I, in which there is a very gradual concept drift. It is obvious that EN outperforms all other approaches. It could be concluded from Tab.2 that on the whole view, EN outperforms other approaches on scenario D∼I.
Table 4. Scenarios of uneven data distribution. Batch ID
GA
GB
GC
J(%)
K(%)
L(%)
0 1 2 3 4 5 6 7 8 9
+ + + + + + + + + +
– – – – – + + + + +
+ + + + + – – – – –
100 100 100 80 60 40 20 0 0 0
100 100 100 100 100 0 0 0 0 0
100 100 100 100 0 0 100 100 100 100
5.3. Experiment on Uneven Data Distribution Here, we study the situation where concept drift is caused by changing of user interests together with changing of data distribution. Three scenarios are simulated in this group of experiments. There are 20 categories in the 20NewsGroup dataset. We divide these categories into 3 groups. Group GA is made up of the 5 categories described in section 5.1 and another 5 randomly selected categories. The rest 10 categories is randomly divided into 2 groups, say, group GB and GC , with each consisting of 5 categories. Tab.4 describes the three scenarios simulated in this group of experiment. The first column of Tab.4 lists the batch IDs. Columns 2∼4 show the occurrence/nonoccurrence of each group of categories. Notation “+” means the occurrence, and “–” the non-occurrence. Columns 5∼7 show the three scenarios, J, K, and L, simulated in this group of experiments. The data in each cell gives the percentage of samples used from the training dataset, which belongs to TA . The rest of the training samples comes from TB . If the sample of a certain category occurs in a certain batch, the number of samples of this category in the batch is l −λ simulated by Poisson distribution P oi(λ) = λ el! , which is often used to model the number of events occurred within a given time interval. The number of samples for each category in group GA is modeled by P oi(100), and the number of samples for each category in group GB and GC is modeled by P oi(200). Here, we show our experiment when the size of training dataset k = 20 in Fig3. Fig.3(A), Fig.3(B), and Fig.3(C) shows the experiment result for scenario J, K, and L, respectively. From Fig.3(A)∼Fig.3(C), we can observe similar behaviors as those of scenario A, B, and C in Fig.1(A)∼Fig.1(C). EN outperforms other approaches during batch 3∼6 for scenario J; outperforms FW3, FW4, FM at batch 5 for scenario K; and outperforms FW3, FW4, FM at batch 4 for scenario L. The experiment result is summarized in Tab.2. This experiment result shows the good ability for EN to cope
Table 5. Scenarios of five ontopic categories. M
N
O
Batch ID
Ta (%)
Tb (%)
Ta (%)
Tb (%)
Ta (%)
Tb (%)
Tc (%)
Td (%)
Te (%)
0 1 2 3 4 5 6 7 8 9
50 50 50 40 30 20 10 0 0 0
0 0 0 10 20 30 40 50 50 50
50 50 50 50 50 0 0 0 0 0
0 0 0 0 0 50 50 50 50 50
50 50 50 50 0 0 50 50 50 50
0 0 0 0 50 50 0 0 0 0
50 50 50 0 0 0 0 0 0 0
0 0 0 50 50 50 50 0 0 0
0 0 0 0 0 0 0 50 50 50
with concept drift caused by changing of user interests and changing of data distribution working together.
5.4. Experiments with Five Ontopic Categories Here, we experiment with 5 ontopic categories. The five categories listed in 5.1 are used as ontopic categories, say, TA , TB , TC , TD , and TE , randomly. 3 scenarios, say, M, N, and O, which are illustrated in Tab.5, are simulated in this group of experiment. In Tab.5, column 1 lists the batch IDs. Columns 2∼4 represent scenarios M, N, and O, respectively. The cells in this table show the percentages of samples of the corresponding category in the training dataset. The experiment results with k = 40 are shown in Fig.4, and are summarized in Tab.2. It is shown that EN performs the best when there are 5 ontopic categories.
5.5. Executing Time We generated data streams with 1000 batches of data, with 2000 documents in each data batch. The experiment results show that the executing time needed by the system is linear to the number of batches processed.
124
6. Conclusion and Future Work
[11] J. Kolter and M. Maloof. Dynamic weighted majority: a new ensemble method for tracking concept drift. Third International Conference on Data Mining, (ICDM’03), pages 123–130, 2003. [12] C. Lanquillon and I. Renz. Adaptive information filtering: detecting changes in text streams. Proceedings of the eighth international conference on Information and knowledge management, (CIKM’99), pages 538–544, 1999. [13] X. Li and B. Liu. Learning from Positive and Unlabeled Examples with Different Data Distributions. Proceedings of European Conference on Machine Learning (ECML-05), pages 218–229, 2005. [14] B. Liu, Y. Dai, X. Li, L. W.S., and Y. P. Building Text Classifiers Using Positive and Unlabeled Examples. Proceedings of the Third IEEE International Conference on Data Mining, (ICDM’03), pages 179–186, 2003. [15] B. Liu, X. Li, L. W.S., and Y. P. Text Classification by Labeling Words. Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), pages 425–430, 2004. [16] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. [17] W. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the seventh international conference on Knowledge discovery and data mining, (KDD’01), pages 377–382, 2001. [18] D. Tax. One-class classification. Doctoral dissertation, Delft University of Technology, 2001. [19] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen. Handling local concept drift with dynamic integration of classifiers: domain of antibiotic resistance in nosocomial infections. Proceedings of 19th International Symposium on Computer-Based Medical Systems, (CBMS’06), pages 679– 684, 2006. [20] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. Proceedings of the ninth international conference on Knowledge discovery and data mining, (KDD’03), pages 226–235, 2003. [21] D. Widyantoro and J. Yen. Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Transactions on Knowledge and Data Engineering, 17(3):401–412, 2005. [22] S. Wu, C. Yang, and J. Zhou. Clustering-training for data stream mining. Sixth IEEE International Conference on Data Mining Workshops, pages 653–656, 2006. [23] H. Yu, J. Han, and K. Chang. PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70–81, 2004. [24] Y. Zhang and X. Jin. An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Record, 35(3):28–33, 2006. [25] X. Zhu, X. Wu, and Y. Yang. Dynamic classifier selection for effective mining from noisy data streams. Proceedings of the 4th international conference on Data Mining, (ICDM’04), pages 305–312, 2004. [26] X. Zhu, P. Zhang, X. Lin, and S. Y. Active Learning from Data Streams. Proceedings of the Sixth International Conference on Data Mining, (ICDM’06), 2007.
The main contribution of this paper can be summarized as that we firstly tackled the problem of the one-class classification on streaming data. The proposition is that fully labelling streaming data is impractical, expensive, and sometimes unnecessary, especially with text streams. By designing a stacking style ensemble-based classifier, and using a series comparative studies, we have dealt with the problems of concept drift, small number of training examples, no negative examples, noisy data, and limited memory space on streaming data classification. The experiments demonstrate that our ensemble-based algorithm outperforms other window management approaches, such as single window, fixed window, and full memory. The feature space of text stream may evolve constantly. We need to study the dynamic feature space under the oneclass text stream classification scenario in the future. On the other hand, the further research should also be considered with the one-class classification on streaming data in general.
References [1] J. Allan. Topic detection and tracking: event-based information organization. Kluwer Academic Publishers, 2002. [2] T. Dietterich. Ensemble methods in machine learning. Proceedings of the First International Workshop on Multiple Classifier Systems, pages 1–15, 2000. [3] W. Fan. Systematic data selection to mine concept-drifting data streams. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, (KDD’04), pages 128–137. ACM Press, 2004. [4] W. Fan, Y. Huang, H. Wang, and P. Yu. Active mining of data streams. Proceedings of the Fourth SIAM International Conference on Data Mining, (SDM’04), pages 457–461, 2004. [5] W. Fan, Y. Huang, and P. Yu. Decision tree evolution using limited number of labeled data items from drifting data streams. Proceedings of the Fourth IEEE International Conference on Data Mining, (ICDM’04), pages 379–382, 2004. [6] G. Fung, J. Yu, H. Lu, and P. Yu. Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering, 18(1):6–20, 2006. [7] S. Huang and Y. Dong. An active learning system for mining time-changing data streams. Intelligent Data Analysis, 11(4):401–419, 2007. [8] R. Klinkenberg. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, 8(3):281–300, 2004. [9] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning, (ICML’00), pages 487–494, 2000. [10] R. Klinkenberg and I. Renz. Adaptive information filtering: learning in the presence of concept drifts. Workshop Notes of the ICML-98 Workshop on Learning for Text Categorization, pages 33–40, 1998.
125