Knowl Inf Syst DOI 10.1007/s10115-011-0469-2 REGULAR PAPER
Dynamic classifier ensemble for positive unlabeled text stream classification Shirui Pan · Yang Zhang · Xue Li
Received: 12 August 2010 / Revised: 2 October 2011 / Accepted: 9 December 2011 © Springer-Verlag London Limited 2011
Abstract Most of studies on streaming data classification are based on the assumption that data can be fully labeled. However, in real-life applications, it is impractical and timeconsuming to manually label the entire stream for training. It is very common that only a small part of positive data and a large amount of unlabeled data are available in data stream environments. In this case, applying the traditional streaming algorithms with straightforward adaptation to positive unlabeled stream may not work well or lead to poor performance. In this paper, we propose a Dynamic Classifier Ensemble method for Positive and Unlabeled text stream (DCEPU) classification scenarios. We address the problem of classifying positive and unlabeled text stream with various concept drift by constructing an appropriate validation set and designing a novel dynamic weighting scheme in the classification phase. Experimental results on benchmark dataset RCV1-v2 demonstrate that the proposed method DCEPU outperforms the existing LELC (Li et al. 2009b), DVS (with necessary adaption) (Tsymbal et al. in Inf Fusion 9(1):56–68, 2008), and Stacking style ensemble-based algorithm (Zhang et al. 2008b). Keywords
Positive unlabeled learning · Text streams · Classifier ensemble · Concept drift
1 Introduction Current research on data stream classification [12,27,28,33,37] mostly focuses on supervised learning techniques which requires fully labeled data streams for learning purpose. S. Pan · Y. Zhang (B) College of Information Engineering, Northwest A&F University, Yangling, China e-mail:
[email protected] S. Pan e-mail:
[email protected] X. Li School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, Australia e-mail:
[email protected]
123
S. Pan et al.
However, this is impractical in real-life applications, as it is generally too expensive to obtain fully labeled data stream for limited resource of human power. On the other hand, on static datasets, positive unlabeled learning (PU Learning) that utilizes a small set of labeled positive examples augmented with a large amount of unlabeled examples has been greatly investigated in recent years [9,19,22,35]. However, not enough attention has been paid to PU Learning under data stream scenarios although it is very prevalent in real-life applications. Suppose a customer service section of a large enterprise receives thousands of users’ feedback emails on daily basis, it is impractical to check the emails one by one while the section manager wants to find out the feedback emails about some newly launched product for detailed investigations, so as to make some further decisions. Hence, he may just spend a few minutes labeling some emails relevant to the new product and then expect to retrieve all the on-topic feedback emails in the incoming email streams. Applications of positive unlabeled stream learning also arise in a spam filtering scenario. In such a case, a user would like to label a few spam emails from time to time, whereas a text classifier is expected to recognize and filter out all the junk mails over the stream. In this paper, we study the classification task of positive unlabeled text streams. We propose a dynamic classifier ensemble approach to integrate PU Learning approaches with data stream mining. In our research, we identify the following challenges: – Concept drift. User interest may depend on some hidden context. Changes in hidden context may be more or less radical in the target concept, which is generally known as concept drift [30]. We assume that concept drift is caused by changes in user interest or underlying data distribution in a text stream scenario. – Small number of training examples. In order to release user from heavy burden of labeling all the examples in the incoming streams, only a small number of labeled examples are required by the system. – No negative training examples. Users may not be patient enough to provide negative examples to the system. For example, a manager may just identify relevant emails to a newly launched product and then expect to find out all the on-topic emails from email streams. – Noisy data. With thousands of features (e.g., meaningful words in documents), text document in natural language is believed to be noisy because of word sense ambiguity (e.g., polysemy or homonymy). – Limited memory space for data stream classification. Only a limited memory space is available to the given classification system, which means that the system cannot store all the historical data in physical memory. A Dynamic Classifier Ensemble method for Positive and Unlabeled text stream (DCEPU) is proposed to meet the challenges described above. In the experimental studies, we compare our proposed method, DCEPU, with other corresponding approaches, such as, LELC [20], DVS (with necessary adaption) [27], and Stacking style ensemble-based algorithm [39]. Various scenarios are simulated in our experiments, and the experimental results demonstrate that our DCEPU outperforms all others. The remainder of the paper is organized as follows. Section 2 reviews the related work. Section 3 gives problem statement. Section 4 presents our dynamic classifier ensemble method for classification of positive unlabeled text stream with concept drift. Section 5 gives our experimental results. Section 6 discusses and analyzes the advantages of our algorithm over other possible solutions on positive unlabeled text stream classification task. We conclude this paper in Sect. 7.
123
Dynamic classifier ensemble
2 Related work Our work is inspired and informed by a number of areas. We briefly review the most relevant areas below. 2.1 Positive unlabeled learning Learning from positive and unlabeled data has been widely studied in recent years. A survey on PU Learning can be found in [36]. Methods on PU Learning can be roughly distinguished into two categories: – Converting the PU Learning task into supervised classification task. This method usually follows a two-step scheme [9,18,19,21,22], i.e., it firstly retrieves some reliable negative examples, and then it builds a supervised classifier on the positive and reliable negative examples. – Constructing a basic classifier directly from the positive and unlabeled data. This method can simply use positive data only, such as one-class SVM [24], or estimate statistical information from both positive and unlabeled examples, such as decision tree algorithm POSC4.5 [2] and positive naive Bayes classifier PNB [1]. The former works excellently on text classification tasks, while the latter can be applied in general classification tasks. However, both of them are only dedicated to static datasets, rather than streaming data scenarios. 2.2 Data stream classification Classification on streaming data has to meet the requirement that data are continuously coming into the system, and the system memory can never be large enough to hold all the data for multiple scans. Basically, there are two types of approaches for classifying streaming data. One is an incremental learning approach, such as VFDT [5] and CVFDT [12]. And the other, which is becoming the mainstream of studies in the research community, is an ensemblebased approach [10,26–28,37,38]. Ensemble-based methods decompose the streaming data into batches and train a classifier on each batch. Then, an ensemble of classifiers is constructed and used to predict the test data. The way of integrating classifiers can be distinguished into two categories: – Static classifier ensemble. The weight of each base classifier is decided before the classification phase [26,28,38]. – Dynamic classifier ensemble. The weight of each base classifier is decided dynamically in classification phase [15,27,41]. Dynamic methods were well studied and analyzed in past years for both static datasets [3,14,32] and data stream scenarios [15,27,41]. Generally, in the dynamic methods for a fully labeled data stream [27], the weight of each base classifier is determined by its local accuracy on a set of neighbor examples of the test example. It is concluded in [27] that dynamic methods are superior to the static methods. However, those dynamic methods, such as DVS [27], only work well for fully labeled data, while in this paper, we propose DCEPU, a novel dynamic classifier ensemble algorithm for positive unlabeled text streams. There are also some works available on partially labeled data streams. In Wu et al. [34] proposed a semi-supervised classifier, which uses a small number of both positive and negative
123
S. Pan et al.
examples to train an initial classifier without any further user feedback or training examples. This means that their algorithms cannot cope with concept drift in data streams. Active learning of data streams is proposed in [7,8,11] and [42], which trains initial classifier on partially labeled data stream and requires certain true labels for unlabeled samples in order to enhance the performance of the classifier. The algorithms in [7,8,11] estimate the error of the model on new data without utilizing the true class labels. As concept drift could be caused by changing user interest, changing data distribution, or both, the algorithms proposed in [7,8,11] cannot cope with concept drift caused by a sudden shift of user interest. 2.3 Positive unlabeled learning of text streams There are very few studies on PU Learning problems under data stream scenarios. Li et al. [17] proposed One-class Very Fast Decision Tree (OcVFDT) algorithm, and the classification performance of OcVFDT is very close to that of VFDT [5], which is trained on fully labeled data stream. However, OcVFDT cannot handle concept drift. Recently, Li et al. proposed an approach called LELC (PU Learning by Extracting Likely positive and negative micro-clusters) for document classification with concept drift [20]. LELC tries to extract reliable negative documents from the unlabeled data and expands the training dataset by retrieving likely positive and negative examples in the form of micro-clusters, and then it builds a robust classifier for the classification task. Both OcVFDT and LELC follow a single classifier approach, while our proposed algorithm follows ensemble learning approach. It is generally believed that ensemble learning can improve the generalization ability of a single learner [4]. And it was extended to PU Learning of data stream classification scenario recently [39,43]. Zhu et al. [43] proposed a one-class learning and summarization (OCLS) framework which uses a vague one-class learning (VOCL) and a one-class concept summarization (OCCS) model to handle concept drift and summarization. While the OCLS in [43] learns base classifiers from vague labeled examples with One-class SVM, which takes instance weights into classifier training process, our previous Stacking style ensemble algorithm in [39] uses a traditional PU learner as a base classifier that extracts reliable negative examples from unlabeled data and converts PU learning to traditional two-class classification task. In this paper, we propose a new ensemble-based algorithm, Dynamic Classifier Ensemble for Positive Unlabeled text stream with concept drift (DCEPU) by designing a dynamic weighting scheme in the classification phase. Experimental results demonstrate that DCEPU outperforms LELC [20], DVS (with some necessary adaption) [27], and Stacking style classifier ensemble-based method [39].
3 Problem statement Here, we define our learning problem for positive unlabeled text stream classification. We follow the assumption that text stream arrives as batches [13]: d1,1 , d1,2 , . . . , d1,m ; d2,1 , d2,2 , . . . , d2,m ; ··· ; dn,1 , dn,2 , . . . , dn,m ; ··· ;
123
(1)
Dynamic classifier ensemble
Here, di, j represents the jth document in the ith batch. Each text in the stream is associated with a category. In each batch, some of the categories are considered as on-topic, and others are not. We write di, j =< X i, j , yi, j >. Here, X i, j ∈ R m represents a text document in the stream; and yi, j ∈ {+1, −1} represents whether the document is on-topic (yi, j = +1), or not (yi, j = −1). In each batch, only a small number of on-topic (positive) examples are available and labeled as training data, and the rest documents are unlabeled. The task of positive unlabeled text stream classification is to find out all on-topic documents from the unlabeled text documents in the text stream. In this paper, we consider the following situations of concept drift: – Sudden concept drift. The interest of user may change significantly. For instance, a manager in the service section of a large enterprise may focus on the feedback emails about the product of Camera on the market, and then when a new product Printer is launched, she/he may shift her/his interest to the feedback emails about Printer. In such a case, she/he would label some newly interested examples about the Printer and expects the system to retrieve all relevant emails regard to this issue. So the system should have the ability to handle sudden change in user interest. – Gradual concept drift. In our research domain, this type of drift is mainly caused by gradual changes of user interest and/or underlying data distribution over time. The change in underlying data distribution, which is known as virtual concept drift [29], is very prevalent over the stream. For instance, in a spam filtering system (spam emails are labeled as positive documents), while the characteristics of spam emails are almost the same in a short period, the types of junk mails will vary gradually or dramatically in a long time. Thus, the user may have to provide some labeled examples to the given system from time to time and expects the system can update its models to capture the real concept of the stream.
4 Dynamic classifier ensemble method for positive unlabeled text stream classification In this paper, a dynamic classifier ensemble-based algorithm, DCEPU, is proposed to cope with concept drift in the positive unlabeled text stream classification scenarios. 4.1 Framework of algorithm for positive unlabeled text streams classification Our framework of ensemble learning algorithm for positive unlabeled text stream classification is given in Algorithm 1. It is proved in [31] that the learning task will be more difficult if the learner learns on few examples. Currently, most of the state-of-the-art classifiers require both positive and negative examples for training. Therefore, for the nth batch, in step 1 of Algorithm 1, we extract Tn , a set of reliable negative examples, from Un , the set of unlabeled examples of current batch, so as to create a training dataset, Tn ∪ Pn . Here, any PU Learning algorithms [9,18,21,35] can be used as a plug-in component in this step for extracting examples. In this paper, we simply use a successful algorithm proposed by Fung et al. in [9] to extract reliable negative examples. In step 2, a classifier is built on current batch. In step 3, the newly learned classifier is added into E n−1 , the ensemble of classifiers built on former batches of data. In step 5, if the population inside the ensemble exceeds the maximum capacity of the ensemble, R, then the oldest classifier is deleted from the ensemble, so as to keep the population capacity of the ensemble. Here, an alternative way to prune classifier is to discard the one with least
123
S. Pan et al.
Algorithm 1 Framework of ensemble learning algorithm for positive unlabeled text stream classification. Input: The set of positive examples for current batch, Pn ; The set of unlabeled examples for current batch, Un ; The ensemble of classifiers on former batches, E n−1 ; The population capacity of the classifier ensemble, R; Output: The ensemble of classifiers on the current batch, E n ; 1: Extracting the set of reliable negative examples Tn from Un with the help of Pn ; 2: Training a classifier Cn for current batch on Tn ∪ Pn ; 3: E n = E n−1 ∪ {Cn }; 4: if (|E n | > R) then 5: Deleting the oldest base classifier in E n ; 6: end if 7: Classifying examples in Un − Tn by E n ; 8: return E n ;
weight among the classifier ensemble. However, this classifier may be helpful in the future, especially under the situations where concept changes frequently and periodically. Thus, we do not discard the classifier with least weight. In contrast, we design a novel weighting scheme (presented in Sect. 4.2.3) to tune the weights of classifiers automatically (for instance, the classifier with least weight would be adjusted to as small as 0 with sudden concept drift and could be reused in the future whenever it is necessary). In step 7, the new ensemble of classifiers, E n , is used to determine the class labels of unlabeled documents in Un − Tn with a weighting scheme. 4.2 Dynamic classifier ensemble method The dynamic voting with selection (DVS) algorithm presented in [27] for supervised learning works in the following way: firstly, for the test example t, the most up-to-date batch of data is used as a validation set, from which a set of nearest neighbors of t are retrieved. Then, each base classifier in the ensemble is evaluated on these neighbors to determine the weights of the base classifiers. And finally, some base classifiers with high weights are selected to predict the example t by a weighted voting scheme. DVS can be extended to positive unlabeled text stream straightforwardly. After extracting reliable negative documents on the most up-to-date batch, DVS can construct a base classifier by leveraging Pn ∪ Tn . Then, DVS uses Pn ∪ Tn as a validation set, on which each classifier is evaluated so as to determine its corresponding weight. Although DVS works excellently and outperforms static classifier ensemble algorithm [28] in supervised data stream scenario, such a direct adaptation to DVS algorithm may not work well under positive unlabeled text stream scenario, as we will demonstrate and discuss latter in Sects. 5 and 6. On the one hand, under positive unlabeled text stream scenarios, the validation set Pn ∪ Tn for DVS is heavily noisy and extremely imbalanced, as only a small number of positive documents are available whereas a large amount of reliable negative data exists, which aggravates the performance of DVS. On the other hand, the weighting scheme in DVS only considers the local performance of each classifier, which could be improved by taking the overall performance of each base classifier into account. Hence, in this paper, we proposed DCEPU, a novel dynamic weighting method for positive unlabeled text stream classification.
123
Dynamic classifier ensemble
Here, we give our distance measurement between text documents in Sect. 4.2.1. Then, we present our strategy of DCEPU algorithm to construct an appropriate validation set in Sect. 4.2.2, from which we can retrieve some neighbors of test documents. The way to decide the weight of each classifier in our DCEPU algorithm is described in Sect. 4.2.3. 4.2.1 Distance measurement When retrieving neighbors for the test example t, cosine similarity, a metric widely used in text retrieval systems [23], is utilized in our method to measure the similarity between text documents. For two text documents, di1 , j1 = (q1 , q2 , . . . , qm ) and di2 , j2 = ( p1 , p2 , . . . , pm ), from the stream, the cosine similarity between them is defined as: m qi · pi similarity(di1 , j1 , di2 , j2 ) = i=1 (2) m m 2 2 q p i=1 i i=1 i Then we can get the cosine distance by: dis(di1 , j1 , di2 , j2 ) = 1 − similarity(di1 , j1 , di2 , j2 )
(3)
4.2.2 Constructing the validation set For dynamic classifier ensemble methods, a critical issue is to obtain an appropriate validation set. The closer the validation set is to the target concept, the better the dynamic method performs. In [27], the most up-to-date batch of data was used as validation set. However, in our PU Learning scenarios, in the most up-to-date batch, only a small set of positive data is labeled; hence, the most up-to-date batch cannot simply be used as the validation set. In [6], Fan proposed to construct an ensemble of decision trees to “sift through” old data and combine with new data to handle concept drift. The basic idea is to train a number of random and uncorrelated decision trees. Each decision tree is built by randomly choosing available attributes from the training data. However, they use cross-validation to find the optimal model, which is rather complicated and time-consuming for the data stream classification. In our paper, we utilize some simple yet effective strategy to construct a good validation set. Selecting positive examples in recent batches into the validation set. Since the number of positive examples in the most up-to-date batch is rather small, more positive examples should be selected from recent batches. However, as pointed out in [6], to use old data blindly is no better than “gambling”. What we need is some positive data that are close to the current target concept. Fortunately, the positive data on the current batch represent the current concept, and hence it can be helpful when retrieving relevant examples. We take S most recent batches of data into consideration. Let Cn be the classifier built on the most up-to-date batch; Pi be the set of positive examples on the ith batch; and An,i be the classification accuracy of Cn evaluated on Pi . Then if An,i is lower than a predefined threshold ε, it may imply that the concept of the ith batch differs greatly from the current concept. This means that the positive data in Pi should be discarded. Otherwise, the examples in Pi should be added into the validation set. Although this is a simple strategy, the examples that significantly differ from current concept can be filtered out. Sampling some reliable negative examples randomly. If there are no negative examples in the validation set, for the test example t which is a negative example, it might be hard to find out some real neighbors in the validation set to t. Furthermore, this may lead to bad performance of the dynamic classifier ensemble method. In this paper, a set of reliable
123
S. Pan et al.
negative examples with size L are randomly sampled from Tn and added into the validation set. Here, we do not import all data in Tn into the validation set for the sake of limited memory and efficiency, because when predicting an example, we needs to traverse exhaustively the validation set to get its nearest neighbors. So, the validation set with a large size is not preferred. 4.2.3 Weighting base classifiers Another novelty of our dynamic method is our weighting scheme in the classification phase. In a traditional way, when a neighbor of the test examples t is classified correctly by a classifier Ci , the weight of Ci will be increased [14]; otherwise, it will be decreased [27]. For either case, the weight of each base classifier is determined by its local classification performance on the neighbors. Here, in our proposed DCEPU method, the weight of Ci is determined not only by its local classification performance on the neighbor examples, but also by a global weight. Let us write ϑi for the global weight of classifier Ci , which is defined as: (4) ϑi = Ai,n (1 − An,i − Ai,n ) Here, Ai,n denotes the accuracy of classifier Ci (built on the ith batch) evaluated on Pn , the set of positive examples of the nth batch. And An,i denotes the accuracy of classifier Cn (built on the most up-to-date batch) evaluated on Pi , the positive examples of the ith batch. Ai,n and An,i can indicate the similarity of the concept between batch i and n. More specifically, we have three types of situations according to formula (4): – Different concepts. Both Ai,n and An,i are low if there is significant between difference the concepts of these two batches. Thus, we have Ai,n → 0, and An,i − Ai,n → 0, so, we have ϑi → 0. It means that if the concept of the ith batch differs greatly from the current concept, then ϑi is low. – Similar concepts. We have An,i − Ai,n → 0 if there is high similarity between the concepts of these two batches. So we have ϑi → Ai,n . It means that if the concept of the ith batch is close to the current concept, then ϑi is mainly determined by the accuracy of Ci on the most up-to-date batch. – Moderate difference between the two concepts. Under this situation, Ai,n and An,i are different from each other, and both of them work together to influence ϑi . Our global weighting scheme can effectively measure the similarity of concepts between different batches, so as to adjust the weight of each classifier accordingly. Note that the base classifier Cn is only evaluated on a small set of positive examples Pi (e.g., 30 or 50 examples), and Ci is evaluated on Pn . Thus it does not take much time and memory during the procedure. The experiment results in Sect. 5 show the effectiveness of this weighting scheme. After obtaining the global weight ϑi of classifier Ci , and a set of neighbors Nei for test example t, the final weight of classifier Ci for predicting t is determined as: weighti =
d∈Nei
h(d) ·
1 · ϑi dis(d, t)
(5)
Here, h(d) ∈ {1, 0}. If classifier Ci predicts the neighbor d correctly, h(d) = 1; otherwise, h(d) = 0. The function dis(d, t) returns the cosine distance between d and t following formula (3).
123
Dynamic classifier ensemble
Our dynamic classifier ensemble-based classification algorithm is listed in Algorithm 2. In step 1, the neighbor set N ei is formed by K nearest neighbors of test example t from the validation set V . The weight of each base classifier in the ensemble is determined in step 2–9 following formula (5). In step 10, the weights of classifiers are normalized. And finally, in step 11, the weighted voting scheme is used to predict the class label of t. Here, the function predict(Ci , t) returns the class label of t predicted by Ci . Our dynamic classifier ensemble method considers two factors for positive unlabeled data streams when predicting an example. On the one hand, the weight of each classifier is affected by the local neighbors surrounding the test example. On the other hand, although some strategies are taken in order to form a validation set, the validation set may not be sufficient and accurate enough. So we use the global weight of each base classifier to adjust their weights. Hence, although lacking of negative examples, our method still performs well and can handle both sudden concept drift and gradual concept drift. Algorithm 2 Classification algorithm for positive unlabeled data streams. Input: E n : Ensemble of classifiers; t : An unlabeled example from the nth batch; V : The validation set; Output: l: The class label of the unlabeled example t; 1: Retrieving K nearest neighbors of t from V to form the set N ei ; 2: for each Ci ∈ E n do 3: weighti = 0; 4: for each d ∈ N ei do 5: if Ci predict d correctly then 6: weighti = weighti + 1/dis(d, t) ∗ ϑi ; 7: end if 8: end for 9: end for 10: Normalizing the weights of all the classifiers in E n ; weighti ∗ pr edict (Ci , t); 11: l = |E1 | n 12: return l;
5 Experimental studies In this section, we report our experimental results. The algorithms were implemented in Java together with the WEKA1 software package. Our experiments were made on a PC with Pentium 4 3.0 GHz CPU and 1G memory. The RCV1-v2 dataset2 was used in our experiments. RCV1-v2 is a text dataset with 804,414 news stories and now is a new benchmark collection for text categorization [16]. All news data in RCV1-v2 are organized into four categories, i.e., CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). Each news story may belong to several categories, as it is a multi-label dataset. RCV1-v2 dataset consists of two subdatasets, i.e., training set with 23,149 news stories, and the test set with 781,265 news stories. 1 http://www.cs.waikato.ac.nz/ml/weka/. 2 http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
123
S. Pan et al.
In our experiments, we chose the category CCAT, which has the largest number of examples in both training and test set, to construct data stream. Firstly, the text documents that belong to multiple categories were removed. And then, eight subcategories of CCAT, which have similar category size, were selected in our experiments. Table 1 gives the details for these eight subcategories. Columns 1–3 give the category name, the total count of examples for the corresponding category, and the count of examples for the corresponding category in a batch, respectively. Information gain metric was used in our experiments to select 1,000 most predictive attributes. Then, each text document was represented by a vector weighted by TFIDF algorithm. Altogether, eight scenarios were simulated in our experiments. For each scenario, 39 batches of data were constructed, with each batch consisting of 2,000 documents. Readers can refer to Table 1 for details of each batch. Four categories, C12, C171, C21, and C11, were selected randomly as potential on-topic categories from the experiment data. Two of them were used as topic T A and topic TB randomly, on which the concept drift was simulated. In each batch, the set of positive training examples, Pi , consists of H text documents that were selected randomly from T A and/or TB , and the rest of the documents in the batch were used as unlabeled data. On these four categories, we conducted C42 = 6 trails of experiments for each scenario. And the averaged results of the 6 trials were reported here as the final results. In our experiments, we set the count of positive examples in each batch, H = 30 and S = 4 and ε = 0.3 to control the import of the positive data in the recent batches. We set L = 150, which means that 150 reliable negative examples were randomly sampled from Tn and added into the validation set. Detailed analysis of these parameters was presented and discussed later in Sect. 5.4. We compared our dynamic classifier ensemble method with the following algorithms: – LELC algorithm. LELC [20] firstly employs an ensemble strategy that integrates both the Spy [21] extraction and the Rocchio [18] extraction algorithms to extract the most reliable unbiased negatives from the historical data in the stream and then extracts likely positive and negative micro-clusters, so as to exploit the characteristics of data streams. Finally a SVM classifier is built to classify the incoming unlabeled data. Li et al. [20] show in their experiments that LELC outperforms traditional PU learners on handling concept drift in text stream. – DVS (with necessary adaption). Dynamic voting with selection (DVS) [27] uses a set of classifier to predict the incoming data stream. The weight of each classifier is decided Table 1 Categories used in experiments
123
Category
Total #Doc
#Doc per batch
C11
8,932
C12
6,372
164
C13
14,756
380
C171
7,809
201
C172
10,103
261
C21
12,943
334
C411
8,475
219
C42
8,187
211
Total
77,577
2,000
230
Dynamic classifier ensemble
by its local performance on a set of neighbors of the test document. Here, we use the most up-to-date batch data Pn ∪ Tn as a validation set, from which necessary neighbors of the test document are retrieved. We want to study whether our strategy on constructing an appropriate validation set and novel weighting scheme outperforms the DVS (with necessary adaption) algorithm. – Stacking style ensemble-based algorithm. Our previous algorithm [39] for positive unlabeled text stream utilizes a Stacking style ensemble-based strategy to classify the streaming data. Compared with our DCEPU algorithm in this paper, the weight of each base classifier is decided before predicting a document. In another word, the base classifiers use fixed weights to predict the unlabeled data in the most up-to-date batch, whereas our DCEPU algorithm updates the weights of base classifiers according to different test documents. The experiment settings for LELC classifier were the same as those reported in [20]. To keep the comparison fair, the population capacity of the classifier ensemble (R in our paper) for DVS, Stacking, and DCEPU algorithms was set to 15, and the count of neighbors of the test example for both DVS and DCEPU is set to 5 (K = 5). Other parameters for DVS and Stacking were set following the original algorithms in [27] and [39], respectively. Linear SVM was used as a base classifier for all the three ensemble-based methods, as it was reported in the literature [25] that linear SVM performs excellently for text classification tasks. The F1 index was widely used for measuring the classification performance of text classifiers [25]. In this paper, we also reported our experiment results measured in F1 . Suppose there were n batches of text data observed so far, in order to measure the classification performance on the text stream, we defined averaged F1 for the whole text data streams as the n averaged F1 over the batches observed from the stream so far, F1ave = i=1 F1i /n. 5.1 Stable concept and gradual concept drift The interest of user may keep stable or changing gradually in a certain period. Moreover, her/his interest may recur. In this group of experiments, four scenarios were generated to simulate changes in the user interest, which were given in Table 2. In Table 2, Column 1 lists the batch IDs; Columns 2–5 give the percentages of positive training examples, which were selected randomly from T A , for scenario A, B, C, and D, respectively. Here, the rest of the training positive examples were selected randomly from TB in each batch.
Table 2 Stable concept and gradual concept drift Batch ID
Scenario A
Scenario B
Scenario C
Scenario D
0–4
100
100
0,10,20,30,40
100,80,60,40,20
5–9
100
100
50,60,70,80,90
0,20,40,60,80
10–14
100
80
100,90,80,70,60
100,80,60,40,20
15–19
100
60
50,40,30,20,10
0,20,40,60,80
20–24
100
40
0,10,20,30,40
100,80,60,40,20
25–29
100
20
50,60,70,80,90
0,20,40,60,80
30–38
100
0
100,90,80,. . .,30
100,80,. . .,0,20,. . .60
123
S. Pan et al.
Scenario A: The concept was T A and it kept stable all the time. Here, we wanted to study the performance of our DCEPU algorithm when there was no concept drift at all. Scenario B: The concept was T A in batch 0–9. Concept drift occurred in every 5 batches with the rate of 20% in batch 10–30. And the concept kept stable in batch 30–38. Scenario C: The concept kept changing from batch to batch. In each batch, a concept drift with a rate of 10% occurred. Scenario D: The concept kept changing. In each batch, a concept drift with a rate of 20% occurred. Figure1a–d give the experimental results for scenario A, B, C, and D, respectively. Here, LELC, DVS, Stacking, and DCEPU represent the classification performance of the LELC [20], DVS (with necessary adaption) [27], Stacking style ensemble-based algorithm [39], and the proposed DCEPU method in this paper, respectively. In Fig. 1a, it is obvious that our DCEPU algorithm performs best among the four algorithms over the data stream. It outperforms LELC, DVS (with necessary adaption), and Stacking style ensemble-based algorithm significantly. While the ensemble-based algorithms DCEPU, DVS, and Stacking ensemble algorithms perform relatively stable on each batch, the classification performance of LELC fluctuates from batch to batch significantly (e.g., batch 5 and 32). Referring to batch 10 of Fig. 1b, when concept drift occurs, a sharp drop in classification performance is observed, with the F1 index dropping from 0.75 to 0.5 for Stacking and DCEPU algorithm, or even worse for LELC and DVS algorithm. It is because most of the base classifiers in the Stacking, DCEPU, and DVS algorithms are built from the documents representing concept T A in the previous batches of streaming data; hence, they are lacking the
(B)
0.85
0.75
0.75
0.65
0.65
F1
F1
(A) 0.85
0.55
0.55
0.45
0.45
0.35
0.35 0
5
10
15
20
25
30
35
0
5
10
Batch ID LELC DVS
20
25
30
35
Batch ID LELC DVS
Stacking DCEPU
0.8
(D)
0.7
0.6
0.6
0.5 0.4
Stacking DCEPU
0.8
0.7
F1
F1
(C)
15
0.5 0.4
0.3
0.3
0.2
0.2 0
5
10
15
20
25
30
35
0
5
10
Batch ID LELC DVS
Stacking DCEPU
15
20
25
30
35
Batch ID LELC DVS
Stacking DCEPU
Fig. 1 Experiment results for stable concept and gradual concept drift. a Scenario A. b Scenario B. c Scenario C. d Scenario D
123
Dynamic classifier ensemble
ability to “recognize” the data of topic TB in batch 10. Similarly, the quality of likely positive and negative documents extracted from previous streaming data by LELC also decreases in batch 10, leading to a drop in the classification performance. In later batches, such as batch 15, it can be seen that DVS, Stacking, and DCEPU algorithms recover from the drop continuously, because the older base classifiers in these ensemble-based methods are continuously replaced by some newly learned ones, which have a better ability to “recognize” both topics T A and TB . Likewise, LELC can extract high-quality likely positive and negative documents from the former batch of data and thus can build a stronger classifier. Note that over the whole stream, DCEPU performs best and handles concept drift better than other three algorithms. From Fig. 1c, it can be observed that with concept changing periodically, the F1 index of all the four algorithms changes periodically as well. Although in some batches, DCEPU performs worse than the other methods, over the whole data stream, DCEPU outperforms its peers. A similar result is observed in Fig. 1d for scenario D. From this group of experiments (Fig. 1), it is obvious that our DCEPU algorithm performs best when there is stable concept or gradual concept drift in the stream. Generally, DCEPU can achieve the highest F1 index over the whole stream and performs much more stable than the other three algorithms. Second is the Stacking style ensemble-based algorithm, which even outperforms the DVS (with necessary adaption) algorithm. The reasons are (1) the validation set for DVS is noise and extremely imbalanced (|Tn | |Pn |), which affects the performance of DVS and (2) the weight of each classifier in DVS is only determined by its local performance, which is not the best way to handle concept drift under positive unlabeled stream scenarios. This result suggests that applying traditional streaming algorithm with straightway adaption to positive unlabeled stream may perform badly. Furthermore, the result also reveals that our strategy to constructing an appropriate validation set and our weighting scheme adopted in DCEPU algorithm is workable and effective. It outperforms DVS significantly. In this experiment, the LELC algorithm performs worst. The reasons are twofold. Initially, although LELC can effectively extract the reliable negative documents by integrating two successful classifiers Spy [21] and Rocchio [18], it is just a single classifier approach, whereas all the three other algorithms also use a robust PU learner to extract reliable negative documents and adopt an ensemble-based strategy to classify unlabeled data. It is generally believed that ensemble learning can improve the generalization ability of a single learner [4]. Subsequently, LELC tries to extract likely positive and negative documents from microclusters, yet the quality may not always be guaranteed. In LELC algorithm, if a micro-cluster is classified as positive (negative), all the data in the micro-cluster are added into the likely positive (negative) set and used to build the final classifier. In such case, if the micro-cluster is classified wrongly, a large amount of false positive (negative) documents are added to the positive (negative) set, which will greatly weaken the effectiveness of LELC algorithm. 5.2 Sudden concept drift In order to study the ability of our algorithm to cope with sudden concept drift, four scenarios simulating sudden drift were generated in this group of experiments. The scenarios were described in Table 3 and the experiment results were shown in Fig. 2. In Scenario E, we simulated extremely heavy concept drift, with sudden concept drift occurring in each batch. In Scenario F, the concept drift occurred every five batches. In Scenario G, the concept drift occurred every ten batches. In Scenario H, the concept drifted suddenly from T A to TB in batch 10, and it drifted back to T A from TB again in batch 20.
123
S. Pan et al. Table 3 Sudden concept drift Batch ID
Scenario E
Scenario F
Scenario G
Scenario H
0–4
100,0,100,0,100
100
100
100
5–9
0,100,0,100,0
10–14
100,0,100,0,100
15–19
0,100,0,100,0
20–24
100,0,100,0,100
25–29
0,100,0,100,0
30–34
100,0,100,0,100
35–38
0,100,0,100
0
100
100
100
0
0
0
0
0
100
100
100
0
100
100
100
0
100
0
0
100
0.75
0.65
0.65
F1
(B) 0.85
0.75
F1
(A) 0.85
0.55 0.45 0.35
0.55 0.45
0
5
10
15
20
25
30
0.35
35
0
5
10
Batch ID
0.75
0.75
0.65
0.65
F1
(C) 0.85
F1
25
30
35
30
35
Stacking DCEPU
LELC DVS
(D) 0.85
0.55 0.45 0.35
20
Batch ID
Stacking DCEPU
LELC DVS
15
0.55 0.45
0
5
10
15
20
25
30
35
0.35
0
5
10
Batch ID LELC DVS
Stacking DCEPU
15
20
25
Batch ID LELC DVS
Stacking DCEPU
Fig. 2 Experiment results for sudden concept drift. a Scenario E. b Scenario F. c Scenario G. d Scenario H
Figure 2a shows the experiment results for scenario E. It is shown that when concept keeps drifting suddenly from batch to batch, the F1 index for all the four algorithms fluctuates over the stream, which is especially prominent in LELC algorithm as it is a single classifier approach. Note that our proposed DCEPU algorithm still surpasses other three algorithms. In Fig. 2b for scenario F, it can be observed that when the ensemble is full, say, from batch 15 on, DCEPU is always superior to other 3 algorithms. The experimental results shown in Fig. 2c, d for scenario G and H are similar to that of scenario F. The experiment results of this group of experiments reveal a good capability of DCEPU to deal with sudden drift.
123
Dynamic classifier ensemble Table 4 Averaged F1 for all scenarios with respect to various algorithms Scenario
LELC
DVS
Stacking
DCEPU
A
0.669 ± 0.064
0.705 ± 0.019
0.736 ± 0.017
0.761 ± 0.013
B
0.604 ± 0.087
0.604 ± 0.093
0.649 ± 0.079
0.672 ± 0.073
C
0.568 ± 0.068
0.543 ± 0.071
0.598 ± 0.068
0.634 ± 0.059
D
0.565 ± 0.082
0.564 ± 0.078
0.623 ± 0.069
0.640 ± 0.053
E
0.628 ± 0.063
0.694 ± 0.024
0.719 ±0.023
0.740 ± 0.022
F
0.672 ± 0.058
0.689 ±0.024
0.712 ± 0.030
0.736 ± 0.023
G
0.663 ± 0.056
0.692 ± 0.023
0.714 ± 0.024
0.736 ± 0.022
H
0.660 ±0.058
0.702 ± 0.020
0.721 ±0.018
0.745 ± 0.016
5.3 Average classification performance The average classification performance (the averaged F1 for the whole text data stream) for each scenario was summarized in Table 4. The results were represented in the form of “mean ± SD (standard deviation)” of F1 index. Table 4 shows that DCEPU algorithm outperforms LELC, DVS (with necessary adaption), and Stacking ensemble-based algorithm in all scenarios, with highest F1 index and lowest standard deviation. It achieves 9.2% improvement over LELC in Scenario A, 6.8% improvement over DVS in Scenario B, and 3.6% improvement over Stacking in Scenario C, respectively. We also ran pairwise t test to evaluate the performance differences in DCEPU over LELC, DVS, and Stacking, respectively (confidence level α = 0.05 in our experiments), and the results show that DCEPU outperforms others significantly. 5.4 Analysis of parameters To further analyze our proposed method, in this subsection, we report our experiment results (averaged F1 ) on the parameters of our DCEPU method. Experiment on number of positive documents H . We experimented with different number of positive documents in each batch, i.e., H = 30, 50, and 70 for different approaches LELC, DVS, Stacking, and DCEPU. The purpose of these experiments is to investigate the relative effectiveness of various techniques for different number of positive documents. The experiment results were given in Fig. 3. It is shown that with the increment of positive documents in each batch, all the four algorithms witness a slowly climb in F1 index, which is consistent with the results reported in [20]. Also, our DCEPU algorithm outperforms other algorithms in Scenario C and Scenario D. We experimented on other scenarios with different H , and similar results were observed; hence, we omitted the results on other scenarios here. Experiment on population capacity of ensemble R. We changed the population capacity of ensemble R from 5 to 30, with other parameters remaining the same. Figure 4a–c give the experiment results on scenario B, C, and D, respectively. From Fig. 4c, it could be observed that with the increment of R (from 5 to 15), the performance of our DCEPU method also improves. However, when R varies from 15 to 30, there is no significant improvement in performance. Such a result is not surprising, as it is reported in [40] that it may be better to ensemble many instead of all of classifiers at hand. Similar results are observed on other scenarios, e.g., Fig. 4a, b. Hence, in all experiments reported in Sects. 5.1–5.3, we set R = 15.
123
S. Pan et al. 0.8
(A)
0.8 0.7
0.6
0.6
0.5
0.5
F1ave
F1ave
(B)
LELC DVS Stacking DCEPU
0.7
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
H=30
H=50
LELC DVS Stacking DCEPU
0
H=70
H=30
H=50
H=70
Fig. 3 Experiment on number of positive documents H . a Scenario C. b Scenario D
(A)
(B) 0.76
(C) 0.64
0.66
0.62
0.74
F1ave
F1ave
F1ave
0.68
0.72 0.64 DCEPU
5
10
15
20
DCEPU
0.7 25
30
0.6 0.58
5
10
15
R
20
DCEPU
0.56 25
30
5
10
15
R
20
25
30
R
Fig. 4 Experiment on R. a Varying R on Scenario B. b Varying R on Scenario C. c Varying R on Scenario D
0.74
0.73 0.72
F1ave
(C) 0.75
0.74
F1ave
(B) 0.75
0.74
F1ave
(A) 0.75
0.73 0.72
0.71
0.71 DCEPU
0.71 1
5
10
15
K
20
0.7 DCEPU
0.7 25
30
0.73 0.72
1
5
10
15
K
20
DCEPU
0.69 25
30
1
5
10
15
20
25
30
K
Fig. 5 Experiment on K . a Varying K on Scenario F. b Varying K on Scenario G. c Varying K on Scenario H
Experiment on number of neighbors K . We varied the number of neighbors K from 1 to 30, and other parameters remain the same. Figure 5a–c present the results on scenario F, G, and H, respectively. From Fig. 5a, it could be observed that with a small number of neighbors (K = 1), the performance is not very good due to noise in the dataset. However, when we increase K from 5 to 30, the performance does not change significantly. Similar results are obtained from Scenarios G and H. Hence, we set K = 5 in all other experiments. Note that a large value of K may increase the predicting time, because each classifier has to be evaluated on a large neighbor set. Parameters S and L. The parameters S and L, which are used to control the import of data into the validation set, do not influence the classification performance of our algorithm a lot. However, large values of L and S would take more time to classify a test example, for the algorithm needs to go through a large validation set. The detailed experiment results on these parameters are omitted here as the proposed algorithm is not sensitive to these parameters.
123
Dynamic classifier ensemble Fig. 6 Time complexity of DCEPU
Time complexity with respect to number of batches
Total Time (seconds)
4500 4000 3500 3000 2500 2000 1500 1000 500
DCEPU
0 0
5
15
10
20
30
25
35
Number of Batches
5.5 Time and memory complexity analysis Suppose there are w batches of data over the whole stream, with each batch consisting of m examples, the complexity of DCEPU includes three major parts: (1) extracting reliable negative examples using some successful PU learner, (2) building a base classifier on data Pn ∪Tn , and (3) classifying unlabeled data in the most up-to-date batch. The first part involves training a PU learner G(m) and evaluating the unlabeled data in the most up-to-date batch by G(m), where G(m) is a function with respect to the batch size m. So the time complexity for this step is O(G(m)) + O(m). The second part is to build a base classifier Cn (Pn ∪ Tn ), and time complexity is approximately represented by O(Cn (m)). The third part is to evaluate on the validation set and classify the unlabeled data in the most up-to-date batch, with time complexity O((L + S · H ) · m) = O(m), as L , S, and H are constant variables. Thus, the total time complexity for the whole stream is given in formula (6): Ttotal = O(w · (G(m) + m + Cn (m) + m)) = O(w · (G(m) + Cn (m) + 2m))
(6)
It can be seen from formula (6) that the time complexity of DCEPU is linear with respect to the number of batches over the stream. It is also reflected in our experiments, which is given in Fig. 6. The X -axis in Fig. 6 describes the number of batches which have been proceeded over the stream, and the Y -axis represents the total time taken by our DCEPU algorithm. It is obvious that the total time taken by DCEPU algorithm is approximately linear to the number of batches. The maximum memory of JVM is set to 1 GB in our experiments, and this is to guarantee that only limited memory space is utilized by our algorithm. We do not store entire dataset in main memory. On the contrary, only S recent batches of positive data are stored in memory and used by our algorithm. 6 Discussion In this section, we further discuss the effectiveness of our DCEPU algorithm against LELC, DVS (with necessary adaption), and Stacking style ensemble-based approaches. Table 5 gives details of differences in all the four methods in our experiments. From Table 5, it is clear that all the four algorithms extract reliable negative examples from the unlabeled set, and LELC even extracts likely positive and negative documents by constructing micro-clusters. However, in our experiments, LELC algorithm
123
S. Pan et al. Table 5 Comparison of various algorithms
LELC DVS Stacking DCEPU
Extracts reliable neg.
Extracts likely pos. and neg.
√
√
√ √ √
Number of classifiers
Validation set
Weighting scheme
1
–
–
–
R
1. Positive document Pn ; 2. Reliable negatives Tn ;
Local weight
–
R
–
Static
–
R
1. Selected from (Pn−S , Pn−S+1 , . . . , Pn ); 2. Sample(L , Tn );
Local weight and global weight
works worst among these four algorithms. It is mainly because: (1) when LELC extracts likely positive and negative documents in the form of micro-clusters, its quality sometimes may not be guaranteed, since if a micro-cluster is classified wrongly, the false negative (or positive) examples in the whole micro-cluster will be added to negative (or positive) set, which is undoubtedly detrimental to building a robust final classifier and (2) in LELC, only one classifier is employed to predict unlabeled documents, whereas other methods use a set of classifiers to classify streaming data. It is generally believed that ensemble learning outperforms the single classifier approach [4]. From Table 5, it suggests that DVS (with necessary adaption) is also a dynamic classifier ensemble algorithm, which straightforwardly utilizes Pn ∪ Tn as a validation set and employs a local weighting scheme in classification phase. Although Tsymbal et al. show that the original DVS outperforms static classifier ensemble methods for fully label stream [27], in our experiments, we observe that DVS (with necessary adaption) does not make any improvement over the Stacking style ensemble-based algorithm for positive unlabeled stream, which is a static classifier ensemble method. It is because the validation set Pn ∪ Tn is not accurate enough and extremely imbalanced (|Tn | |Pn |) and thus finally affects the performance of the DVS. Furthermore, only a local weighting scheme is applied to determine the weight of each base classifier, which sometimes cannot achieve high classification performance. It contrast, our proposed DCEPU adopts a simple yet effective strategy to construct the validation set and uses a novel weighting scheme to determine the weights of each classifier in the classification phase, which considers not only the local classification accuracy (local weight) of each classifier on the neighbor set surrounding the test example, but also the global performance (global weight) of the corresponding classifier. As a result, DCEPU outperforms DVS algorithm significantly. It also reveals in Table 5 that the major difference between Stacking style ensemble-based algorithm and DCEPU algorithm lies in the way to determine the weights of base classifiers. For Stacking style ensemble-based method, the weight of each classifier is decided before classification phase, i.e., each classifier uses a fixed weight to predict the unlabeled data in the stream. In contrast, for DCEPU algorithm, the weight of each classifier is decided in the classification phase, i.e., each classifier uses different weights to predict different examples. This weight is influenced by its local performance on the neighbors of the test document and a global weight. The experiment results witness the effectiveness of our DCEPU over Stacking style ensemble-based algorithm.
123
Dynamic classifier ensemble
7 Conclusion Positive unlabeled data stream learning is prevalent in real-life applications. However, it is not well studied in the research community. In this paper, we consider the problem of positive unlabeled text stream classification. We propose a novel dynamic classifier ensemble algorithm, DCEPU, to address this problem. After constructing an appropriate validation set from the most recent batches of positive data and reliable negative examples on the most up-to-date batch, a novel weighting scheme is proposed to assign weight to each base classifier in the ensemble. This weighting scheme considers not only the local weight of each base classifier on the neighbors surrounding the test example, but also a global weight of each classifier. Experimental results on benchmark dataset RCV1-v2 demonstrate that our DCEPU algorithm outperforms the other existing algorithm LELC [20], DVS (with necessary adaption) [27], and Stacking style ensemble-based algorithm [39] significantly in handling both gradual and sudden concept drift. The proposed algorithm can only cope with text stream classification. In the future, we will study the algorithms for general data stream classification under positive and unlabeled scenarios. Acknowledgments This work is supported by the National Natural Science Foundation of China (60873196) and Chinese Universities Scientific Fund (QN2009092). The authors would like to thank the anonymous referees for their very constructive comments and suggestions.
References 1. Calvo B, Larranaga P, Lozano JA (2005) Learning bayesian classifiers from positive and unlabeled examples. Pattern Recognit Lett 28(16):2375–2384 2. Cheng R, Kalashnikov D, Prabhakar S (2005) Learning from positive and unlabeled examples. Theor Comput Sci 38(1):70–83 3. Didaci L, Giacinto G, Roli F, Marcialis GL (2005) A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognit 38(11):2188–2191 4. Dietterich TG (2002) Ensemble methods in machine learning. In: Proceedings of the first international workshop on multiple classifier systems, pp 1–15 5. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining(KDD’00). Boston, pp 71–80 6. Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’04), ACM Press, pp 128–137 7. Fan W, Huang YA, Wang H, Yu PS (2004a) Active mining of data streams. In: Proceedings of the fourth SIAM international conference on data mining(SDM’04), pp 457–461 8. Fan W, Huang YA, Yu PS (2004b) Decision tree evolution using limited number of labeled data items from drifting data streams. In: Proceedings of the fourth IEEE international conference on data mining(ICDM’04), pp 379–382 9. Fung GPC, Yu JX, Lu H, Yu PS (2006) Text classification without negative examples revisit. IEEE Trans Knowl Data Eng 18(1):6–20 10. Grossi V, Turini F (2010) Stream mining: a novel architecture for ensemble-based classification. Knowl Inf Syst: 1–35. doi:10.1007/s10115-011-0378-4 11. Huang S, Dong Y (2007) An active learning system for mining time-changing data streams. Intell Data Anal 11(4):401–419 12. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining(KDD’01), pp 97–106 13. Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the seventeenth international conference on machine learning(ICML’00), pp 487–494 14. Koa A, Sabourina R, Britto A Jr (2008) From dynamic classifier selection to dynamic ensemble selection. Pattern Recognit 41(5):1718–1731
123
S. Pan et al. 15. Kolter JZ, Maloof MA (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the third international conference on data mining (ICDM’03), pp 123–130 16. Lewis DD, Yang Y, Rose TG, Dietterich G, Li F, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397 17. Li C, Zhang Y, Li X (2009a) OcVFDT: one-class very fast decision tree for one-class classification of data streams. In: Proceedings of the third international workshop on knowledge discovery from sensor data. Paris, pp 79–86 18. Li X, Liu B (2003) Learning to classify texts using positive and unlabeled data. In: International joint conference on artificial intelligence (IJCAI’03), pp 587–594 19. Li X, Liu B (2005) Learning from positive and unlabeled examples with different data distributions. In: Proceedings of European conference on machine learning (ECML’05), pp 218–229 20. Li XL, Yu PS, Liu B, Ng SK (2009b) Positive unlabeled learning for data stream classification. In: Proceedings of the ninth SIAM international conference on data mining (SDM’09), pp 257–268 21. Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: Proceedings of the nineteenth international conference on machine learning (ICML’02) 22. Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. In: Proceedings of the third IEEE international conference on data mining (ICDM’03), pp 179–186 23. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523 24. Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471 25. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47 26. Street W, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the seventh international conference on knowledge discovery and data mining (KDD’01), pp 377–382 27. Tsymbal A, Pechenizkiy M, Cunningham P, Puuronen S (2008) Dynamic integration of classifiers for handling concept drift. Inf Fusion 9(1):56–68 28. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth international conference on knowledge discovery and data mining (KDD’03), pp 226–235 29. Widmer G, Kubat M (1993) Effective learning in dynamic environments by explicit context tracking. In: European conference on machine learning (ECML’93). Springer, Berlin, pp 227–243 30. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101 31. Widyantoro D, Yen J (2005) Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans Knowl Data Eng 17(3):401–412 32. Woods K, Kegelmeyer WPJr, Bowyer K (1997) Combination of multiple classifiers using local accuracy estimates. IEEE Trans Pattern Anal Mach Intell 19(4):405–410 33. Wozniak M (2010) A hybrid decision tree training method using data streams. Knowl Inf Syst: 1–13. doi:10.1007/s10115-010-0345-5 34. Wu S, Yang C, Zhou J (2006) Clustering-training for data stream mining. In: Proceedings of the sixth IEEE international conference on data mining workshops (ICDMW’06), pp 653–656 35. Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81 36. Zhang B, Zuo W (2008) Learning from positive and unlabeled examples: a survey. In: International symposiums on information processing, IEEE Computer Society, Los Alamitos, pp 650–654 37. Zhang P, Zhu X, Shi Y (2008a) Categorizing and mining concept drifting data streams. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’08). Las Vegas, pp 812–820 38. Zhang Y, Jin X (2006) An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Rec 35(3):28–33 39. Zhang Y, Li X, Orlowska M (2008b) One-class classification of text streams with concept drift. In: Proceedings of the 2008 IEEE international conference on data mining workshops (ICDMW’08), pp 116–125 40. Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1–2):239–263 41. Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3):339–363 42. Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of the seventh international conference on data mining (ICDM’07), pp 757–762
123
Dynamic classifier ensemble 43. Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst: 1–31. http://dx.doi.org/10.1007/s10115-010-0331-y
Author Biographies Shirui Pan received his M.S. degree in Computer Software and Theory from Northwest A&F University, Shaanxi, China, in 2011. He is currently a Ph.D candidate at the Faculty of Engineering and Information Technology, University of Technology, Sydney (UTS), Australia. His research interests include data stream classification and positive unlabeled learning, etc.
Yang Zhang received his M.S. and Ph.D degree in Computer Software and Theory from Northwest Polytechnical University in 2000 and 2005, respectively. He is a professor and doctoral supervisor at Northwest A&F University. His research interests include data mining and machine learning, etc.
Xue Li received his M.S. degree in Computer Science from University of Queensland in 1990 and received his Ph.D degree in Information Systems from Queensland University of Technology in 1997. He is an associate professor at the School of Information Technology and Electrical Engineering of University of Queensland (UQ) in Brisbane, Queensland, Australia. His research interests include data mining, multimedia data security, database systems and intelligent Web information systems, etc.
123