Supervised Adaptive-transfer PLSA (SAtPLSA) to tackle the cross-domain text classification problem. Probability Latent. Semantic Analysis (PLSA) is one ...
2014 IEEE International Conference on Data Mining Workshop
Supervised Adaptive-transfer PLSA for Cross-Domain Text Classification Rui Zhao and Kezhi Mao School of Electrical and Electronic Engineering Nanyang Technological University Singapore Email: {rzhao001, ekzmao}@ntu.edu.sg and add them to the source domain, i.e., training data. But this solution is impractical because data labelling is timeconsuming and requires a huge amount of human labor. To address the above problems, transfer learning techniques, or more specifically transductive transfer learning approaches, have been proposed [1]. Transductive transfer learning methods aim to train the model for a direct usage in the target domain via all the data from the labeled source domain and the unlabeled target domain. The majority of algorithms proposed in the literature seek a unified new feature space to diminish the distributional difference between the source and target domains so that the trained model can fit the target domain well. In this paper, we proposed a novel approach named Supervised Adaptive-transfer PLSA (SAtPLSA) to tackle the cross-domain text classification problem. Probability Latent Semantic Analysis (PLSA) is one popular probabilistic topic modeling approaches for text mining [2]. Firstly, we modify the PLSA to make it a supervised learning algorithm by setting the latent variable as the observable one. That is achieved by making latent topic be the category that the document belongs to. For training documents in source domain, 𝑃 (𝑤𝑗 ∣𝑐𝑖 ) denoted as the class-conditional probability of a specific word conditioned on a class can be estimated directly during initialization and is then fixed in the following model fitting process. For testing documents from target domain, the word-category probabilities are randomly assigned and learned during the whole process. Because the classification tasks in source and target domains are the same, and the class label sets of these two domains are the same, the word-category probabilities thus serve as a bridge connecting the two domains. A parameter is introduced here to control the weighting for knowledge transfer between source domain and target domain. Considering we only need to boost the classification accuracy in target domain, an intuitive idea is that the importance of knowledge extracted from source domain will fade with the development of the training process. Therefore, we adjust the weighting parameter adaptively in the model fitting process. In addition, to enforce the supervision information from the source domain, the mustlink constraints and the non-link constraints for documents, which were firstly used in the semi-supervised clustering, are adopted in our algorithms. We apply the EM algorithm
Abstract—Cross-domain learning is a very promising technique to improve classification in the target (testing) domain whose data distributions are very different from the source (training) domain. Many cross-domain text classification methods are built on topic modeling approaches. However, topic model methods are unsupervised in nature without fully utilizing the label information of the source domain. In addition, almost all cross-domain learning approaches utilize the knowledge of source domain in the later stage of the training process, and this limits the knowledge transfer. In this paper, we propose a model named Supervised Adaptivetransfer Probabilistic Latent Semantic Analysis (SAtPLSA) for cross-domain text classification aiming to deal with the above two issues. The proposed model extends the original PLSA to a supervised learning paradigm. By defining the common labeled information from each term across domains, we transfer knowledge in source domain to assist classifying text in target domain. In addition, we adaptively modify the weight value controlling the proportion of the usage of knowledge from source domain in the model learning process. At last, we conducted experiments on nine benchmark datasets in crossdomain text classification to compare the performance of our proposed algorithm with two classical supervised learning methods and five state-of-art transfer learning approaches. The experimental results have shown the effectiveness and efficiency of our proposed SAtPLSA algorithm. Keywords-PLSA, Cross-Domain Learning, Text Classification.
I. I NTRODUCTION Many existing machine learning techniques are based on the assumption that the training and testing data stem from the identical distribution. However, in some applications such as text classification, the training data can be easily outof-dated, which means that the new coming unlabeled testing data may not obey the same distribution as the training data. For example, there are ample articles posted on the web. Some of them are pre-categorized, such as the news in the following news websites: Yahoo! News, CNN and BBC, which will be regarded as the source domain in our task. The rest of them including blog articles, tweets and so on, are the unlabeled target domain. In general, the classifier trained on the source domain is unable to achieve a good performance on the target domain, because the data distribution among various domains can be very different due to various word choices, organization styles and writing skills. One possible solution is to label some of the data in the target domain 978-1-4799-4274-9/14 $31.00 © 2014 IEEE DOI 10.1109/ICDMW.2014.163
259
to learn the model parameters. Since the latent topic is set to be the class label, the learned parameter 𝑃 (𝑐𝑗 ∣𝑑𝑖 ) denoting a document-specific distribution over class labels will finally determine the classification results for the target domain. One major modification of our work compared to the original PLSA algorithm is that the latent topics are converted to the observable class labels. Although it doesn’t follow the original PLSA’s motivation to optimize mixture decomposition for a good representation of data on a latent space, the proposed algorithm utilizes the pre-calculated category conditional probability of a specific word 𝑃 (𝑤𝑗 ∣𝑐𝑖 ) along with must-link and cannot-link constraints to form a strong supervision information, which is injected into the iteration process to guide the state of the target domain to converge into the pre-defined class label domain directly and efficiently, rather than the previous latent variable space. This modification not only speeds up the model fitting process, but also avoids the latent number selection process. And the model for multi-class classification can be easily extended. At the same time, our algorithm adaptively transfers the knowledge from source domain to target domain. The following experimental results have shown that these procedures all improve the performance of our proposed algorithm. The rest of our paper is organized as follows. After reviewing related work briefly in Section 2, we propose SAtPLSA for cross-domain text classification in Section 3. In Section 4, we conduct performance evaluation of our algorithm in an experimental study. At last, concluding remarks and some possible future works are provided in Section 5.
To complete the prediction task, supervised LDA has been proposed by adding a response variable associated with each document [6]. Labeled LDA (L-LDA) has also been proposed to solve credit attribution in documents where each one contains multiple tags [7]. In L-LDA, a one-toone correspondence between LDA’s latent topics and user tags are introduced to learn word-tag correspondence. B. Transductive Transfer Learning As investigated in [1], transductive transfer learning is defined as one type of transfer learning, where the data in the source domain are labeled and the data from the target domain are unlabeled, and the learning targets are the same in these two domains. During training phase, these unlabeled target domain data must be available. Based on what to transfer, approaches for transductive transfer learning can be categorized into two classes: instance-transfer methods and feature-representation-transfer methods. Since our work is closely related to the latter, here we only focus on the feature-representation-transfer methods. Featurerepresentation-transfer methods aim at obtaining a good feature representation to improve the performance of the task in the target domain. One type of algorithms aims at encoding applicationspecific knowledge to learn the feature transformation. These methods utilize pivot features to align features across domains.One famous work is Spectral Feature Alignment (SFA). In SFA, the new features are derived based on spectral clustering of a bipartite graph between the pivot and remaining features. It should be noted that the performance of these methods heavily depend on the selection of pivot features [8]. However, the task to pick optimal pivot features can only be done empirically, which lacks a unified theoretical guidance. In recent years, some feature-representation-transfer methods based on topic modeling have been proposed, which is also the basic motivation behind our work. Because of topic modeling’s strong power for summarizing and understanding massive information, these approaches have attracted great attentions. For example, Topic Correlation Analysis (TCA), proposed in [9], extracts the shared topics and domainspecific topics to perform knowledge transfer. However, the alignment between domain-specific topics and the information of labels in the training domain is required for this approach. Collaborative Dual-PLSA (CDPLSA) is proposed based on the Dual-PLSA [10], [11]. Dual-PLSA utilizes two latent variables to model word topic and document topic separately. And then, CDPLSA further explores the commonality and distinction of the topics across multiple domains. However, for word topics, the correlations among different domains are ignored during the modeling process. An approach called as Partially Supervised Cross-Collection LDA (PSCCLDA) is proposed to classify cross-domain text [12]. This method admits the existence of the duality of the
II. RELATED WORK Our work has been developed based on earlier works of topic modeling and transductive transfer learning. In the following, we briefly review some valuable works related to ours in these two areas. A. Topic Modeling Topics modeling approaches, such as Latent Dirichlet Analysis (LDA) and Probability Latent Semantic Analysis (PLSA), are derived from Latent Semantic Indexing, which can reduce the high dimensionality of document-term occurrence matrix into low dimensions denoting as latent topics. In nature, these approaches are based on clustering without requiring label information. However, in some scenarios, available label information can provide a vital performance boost so that several extensions of topic models have been proposed to integrate the useful prior knowledge. Semisupervised PLSA has been proposed to cluster documents by utilizing the must-link and cannot-link pairwise constraints provided by labeled data [3], [4]. And these constraints are converted into the final clustering objective function, which was firstly adopted in semi-supervised clustering [5].
260
marginal distribution of examples and the conditional distribution of class labels. Meanwhile, the domain-independent and domain-specific latent feature are learned to further improve classification performance. But this approach is based on a very strong assumption that the numbers of topics in all topic sets for different domain are the same. The most related work to our proposed algorithm is [4], which is called Topic-bridged PLSA (TPLSA). TPLSA transfers the knowledge across domains by assuming that the source and target domains share the same topics jointly. And the supervision of labeled data in the source domain is induced by the pair wise constraints, as the mentioned semisupervised PLSA does. The assumption that the topics are shared by the source and target domains has been questioned in [12]. Different from TPLSA, we let the number of latent topics be the same as the number of class labels to enable a one-to-one mapping between them. The class label sets are definitely shared across domains. And the alignment in the mapping can be realized by the supervision of labeled data in source target domain, which consists of not only the pairwise constraints, but also the pre-calculated probabilities of a specific word conditioned on a class. Since the number of parameters to be estimated is reduced and more supervised information is introduced, the performance of our method, including classification accuracy and convergence speed, is improved compared to the previous TPLSA.
d
p (d | z )
Figure 1.
z
p( w | z )
w
Graphical Model for PLSA
𝑑𝑖 and word 𝑤𝑗 can be expressed by a mixture model with latent topics, which is given as follows: ∑ 𝑃 (𝑤𝑗 ∣𝑧)𝑃 (𝑑𝑖 ∣𝑧)𝑃 (𝑧) (1) 𝑃 (𝑤𝑗 , 𝑑𝑖 ) = 𝑧
where z denotes the latent topic set. Parameter learning for 𝑃 (𝑤∣𝑧), 𝑃 (𝑑∣𝑧) and 𝑃 (𝑧) is a kind of maximum likelihood estimation in latent variable models, which can be solved via the Expectation Maximization (EM) algorithm [13]. The graphical representation of PLSA is given in Figure 1, where observed variables are denoted by the shading nodes, while latent variables correspond to hollow nodes [14]. Obviously, the word-document pairs are observable, while the latent topics are not. 2) Supervised Adaptive-transfer PLSA: Based on PLSA, there are two major extensions in our model to solve the cross-domain classification problem. Firstly, we regard each latent topic in PLSA as one specified and observable class label. That is to say that the number of latent topics is equal to the number of document categories in the whole corpus. In this setting, we adopt the label information in the source domain to calculate the category-conditional probability of a specific word 𝑃 (𝑤∣𝑐), which not only ensures the alignment between the introduced class labels and the true categories of documents, but also speeds up the model fitting process and improves the classification accuracy. The label assignment for documents in the target domain can be determined by the distribution of class labels in documents 𝑃 (𝑑∣𝑐). Another extension is similar to the work in [4]. Here, we consider the distribution of classes in source and target domains are different. But these two domains are related so that the wordcategory probabilities 𝑃 (𝑤∣𝑐) are shared in both domains. In summary, the probability of generating a training document on a specified class 𝑃 (𝑑𝑠 ∣𝑐) is different from the probability of generating a testing document on the same class 𝑃 (𝑑𝑡 ∣𝑐), while the category-conditional probabilities of a specific word 𝑃 (𝑤∣𝑐) are the same across source and target domains. This means that the knowledge is transferred from the source domain to the target domain through the word-category probabilities. Finally, the different but related source and target domains are incorporated into a joint supervised model, as shown in Figure 2. Since documents from source domain are labeled and documents from target domain are unlabeled, the upper node for category is shading and the lower one is empty. To control the knowledge transfer process between source domain and target domain, a weighting parameter a is adopted to adjust the relative proportions of two domains in connecting them. After the introduction of the weighting
III. PROPOSED ALGORITHM A. Problem Formulation In this section, the scenario that our method is applicable is described. The source domain is a data collection containing 𝑁𝑠 documents, denoted by 𝐷𝑠 = {(𝑑𝑠1 , 𝑐𝑠1 ), . . . , (𝑑𝑠𝑁𝑠 , 𝑐𝑠𝑁𝑠 )}, where 𝑑𝑠𝑖 and 𝑐𝑠𝑖 are the i-th training text document and its corresponding class label, respectively. The unlabeled target domain is denoted by 𝐷𝑡 = {𝑑𝑡1 , . . . , 𝑑𝑡𝑁𝑡 }, where 𝑑𝑡𝑖 is the i-th document in the test corpus containing 𝑁𝑡 documents. The task is to assign class labels to the documents in target domain. In this paper, the tasks in source domain and target domain are the same, which means that the label set remains unchanged in these two domains. However, the data distribution in the target domain 𝐷𝑡 are related but different from the one in the source domain 𝐷𝑠 . B. Our Model In this section, we first give a brief description about PLSA, and then describe our proposed model called Supervised Adaptive-transfer PLSA and its parameters learning process. At last, we formulate our proposed approach for cross-domain classification. 1) A Review of PLSA: PLSA is a generative latent topic model to analyze text data through performing a probabilistic mixture decomposition on the word-document co-occurrence matrix. In this model, the observation for a pair of document
261
parameter a, the log-likelihood function is formed in Eq. (2), which is maximized to produce estimates of model parameters. 𝐿𝑐 =
d
𝑁𝑠 𝑀 ∑ ∑ ∑ [𝛼 𝑛(𝑤𝑗 , 𝑑𝑠𝑖 )𝑙𝑜𝑔 𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑠𝑖 ∣𝑐)+ 𝑗=1
𝑐
𝑖=1
(1 − 𝛼)
𝑁𝑡 ∑
𝑛(𝑤𝑗 , 𝑑𝑡𝑖 )𝑙𝑜𝑔
𝑖=1
∑ 𝑐
c p (d | c)
(2)
d
𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑡𝑖 ∣𝑐)]
Figure 2.
=
𝑐𝑖 𝑐𝑖 ∕=𝑐𝑗
𝑃 (𝑑𝑠𝑖 ∣𝑐𝑖 )𝑃 (𝑑𝑠𝑗 ∣𝑐𝑗 )
c
w
Graphical Model for SAtPLSA
Then, based on the two kinds of constraints, we can derive a document-document pair log-likelihood function as follows: ∑ ∑ [𝛽1 𝑡1 (𝑑𝑖 , 𝑑𝑗 )𝑙𝑜𝑔𝑃1 (𝑑𝑖 , 𝑑𝑗 ) 𝐿𝑑 = 𝑑𝑖 ∈𝐷 𝑠 𝑑𝑗 ∈𝐷 𝑠 (5) + 𝛽2 𝑡2 (𝑑𝑖 , 𝑑𝑗 )𝑙𝑜𝑔𝑃2 (𝑑𝑖 , 𝑑𝑗 )] where 𝛽1 and 𝛽2 are weights, and 𝑡(𝑑𝑖 , 𝑑𝑗 ) are indication functions which map whether the two documents 𝑑𝑖 and 𝑑𝑗 belong to the same category into a binary value. Specifically, if these two documents belong to the same category, 𝑡1 (𝑑𝑖 , 𝑑𝑗 ) = 1 and 𝑡2 (𝑑𝑖 , 𝑑𝑗 ) = 0, otherwise 𝑡1 (𝑑𝑖 , 𝑑𝑗 ) = 0 an 𝑡2 (𝑑𝑖 , 𝑑𝑗 ) = 1. Finally, Eq. (2) and Eq. (5) are merged together into the final object function as follows: 𝐿 = 𝐿 𝑐 + 𝐿𝑑
(6)
The model fitting process actually maximizes the final likelihood formulation with respect to all probability mass functions, i.e., model parameters. 3) Parameter Learning with the EM Algorithm: EM algorithm is used to find a local optimal solution for the nonconvex objective function L. In nature, EM is an iteration process alternating two following steps: (1) The Expectation (E) step: the posterior probabilities of the class label under the observed co-occurrence of the i-th document and the j-th word in vocabulary from the source domain and the target domain are calculated respectively as follows: 𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑠𝑖 ∣𝑐) (7) 𝑃 (𝑐∣𝑑𝑠𝑖 , 𝑤𝑗 ) = ∑ 𝑠 𝑐 𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑖 ∣𝑐) 𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑡𝑖 ∣𝑐) 𝑃 (𝑐∣𝑑𝑡𝑖 , 𝑤𝑗 ) = ∑ 𝑡 𝑐 𝑃 (𝑤𝑗 ∣𝑐)𝑃 (𝑑𝑖 ∣𝑐)
(8)
Because of the introduction of the pairwise constraints, we need to consider another posterior probability of the class label given the documents link’s status in the source domain. The documents link’s status includes one condition that two documents belong to the same category and the other condition that two documents are in different categories. Corresponding to these two conditions, two kinds of posterior probabilities are derived as follows:
𝑐
∑ ∑
p( w | c)
t
where 𝑛(𝑤𝑗 , 𝑑𝑖 ) represents the count of 𝑤𝑗 in a document 𝑑𝑖 , M denotes the vocabulary size and other parameters are described as above. Unlike previous work that knowledge is transferred in a fixed manner, we consider an adaptive scheme to combine the knowledge from the source domain and the target domain by adjusting weight 𝛼 dynamically. It is motivated by the following intuition: our algorithm originates from PLSA, which can be regarded as a clustering method. At the initial phase of the training process, the label distribution over the target domain is randomly set so that the knowledge transferred from the source domain will improve the clustering performance in the target domain. However, considering that two domains are different, the knowledge from the source domain may hinder the label assignment process in the target domain toward its own intrinsic structure at the later phase of the training process. Therefore, an adaptive weighting parameter adjustment scheme is proposed: after each iteration, the training accuracy is calculated. If the training accuracy exceeds a threshold, the original weight will be reduced to a low value and the iterations are continued based on the modified weighting parameter. In the following experiments section, we will provide a detailed empirical study on the adaptive transfer settings. In other view, the weight adjustment scheme can be regarded as a generalization procedure to prevent the overfitting problem, since the training process is able to prevent relying on the source domain too much after adopting the adaptive transfer setting. To further utilize the label information in the training domain, the must-link and cannot-link constraints, which were firstly proposed in semi-supervised clustering [5], are also integrated into our model. The must-link constraint indicates that if two documents belong to the same class, the probability that these two documents generated by the same class label should be constrained (maximized). Meanwhile, the cannot-link constraint can be explained as that if two documents belong to different labels, the probability that these two documents generated by different labels should be constrained (maximized). These two constraints are described mathematically in Eq. (3) and Eq. (4) respectively. ∑ 𝑃 (𝑑𝑠𝑖 ∣𝑐)𝑃 (𝑑𝑠𝑗 ∣𝑐) (3) 𝑃𝑠 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 ) = 𝑃𝑑 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 )
p(d s | c)
(4)
𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐) =
262
𝑃 (𝑑𝑠𝑖 ∣𝑐)𝑃 (𝑑𝑠𝑗 ∣𝑐) 𝑃𝑠 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 )
(9)
𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐𝑖 , 𝑐𝑗 ) =
𝑃 (𝑑𝑠𝑖 ∣𝑐𝑖 )𝑃 (𝑑𝑠𝑗 ∣𝑐𝑗 ) 𝑃𝑑 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 )
we can directly predict the document label in the target domain based on the final values of 𝑃 (𝑐∣𝑑𝑡 ):
(10)
𝑐 = arg 𝑚𝑎𝑥𝑐 𝑃 (𝑐∣𝑑𝑡 )
where 𝑐𝑖 and 𝑐𝑗 represent two different categories respectively. (2) The Maximization (M) step: here, we compute the model parameters that maximize the final objective function L. There are three parameters to estimate: the probability distribution of unlabeled document dt in target domain over class labels 𝑃 (𝑑𝑡 ∣𝑐), the probability distribution of labeled document dt in source domain over class labels 𝑃 (𝑑𝑠 ∣𝑐) and the generating probabilities of a specific word w conditioned on the class label c 𝑃 (𝑤∣𝑐). For 𝑃 (𝑑𝑡 ∣𝑐), we extract the terms containing ∑ it in L denoted by 𝐿[𝑃 (𝑑𝑡 ∣𝑐)] . Then, the constraint of 𝑐 𝑃 (𝑑𝑡 ∣𝑐) = 1 is combined with 𝐿[𝑃 (𝑑𝑡 ∣𝑐)] to form a convex optimization problem. After applying the Lagrangian Multiplier Method to solve it, we yield: ∑ 𝑛(𝑤, 𝑑𝑡 )𝑃 (𝑐∣𝑑𝑡 , 𝑤) (11) 𝑃 (𝑑𝑡 ∣𝑐) ∝
Algorithm 1 Supervised Adaptive-transfer PLSA Input: labeled training documents in source domain 𝐷𝑠 ; unlabeled testing documents in target domain 𝐷𝑡 ; number of categories 𝐾; number of iterations Tmax ; original weight 𝛼0 ; adjusted weight 𝛼1 ; training accuracy threshold 𝑃𝑡 Output: class labels for each document in target domain 𝐷𝑡 1: Initialize 𝑃 (𝑐∣𝑑𝑠 ), 𝑃 (𝑐∣𝑑𝑡 ), and 𝑃 (𝑤∣𝑐, 𝐷 𝑡 ) randomly 2: for 𝑘 = 1 to 𝐾 do 3: Calculate 𝑛𝑘 : the total number of words in documents in category k 4: for 𝑗 = 1 to 𝑀 do 5: Calculate 𝑛𝑗,𝑘 : the number of the j-th word in documents in category k 𝑛 6: Calculate 𝑃 (𝑤𝑗 ∣𝑐𝑘 , 𝐷𝑠 ) = 𝑛𝑗,𝑘 𝑘 7: end for 8: end for 9: Calculate 𝑃 (𝑤∣𝑐) = 𝛼𝑃 (𝑤∣𝑐, 𝐷 𝑠 ) + (1 − 𝛼)𝑃 (𝑤∣𝑐, 𝐷 𝑡 ) as the initialization of the 𝑃 (𝑤∣𝑐) 10: for 𝑡 = 1 to Tmax do 11: Update the values of 𝑃 (𝑐∣𝑑𝑠𝑖 , 𝑤𝑗 ), 𝑃 (𝑐∣𝑑𝑡𝑖 , 𝑤𝑗 ), 𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐) and 𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐𝑖 , 𝑐𝑗 ) according to Eqs. (7), (8), (9) and (10) 12: Update the values of 𝑃 (𝑑𝑡 ∣𝑐), 𝑃 (𝑑𝑠 ∣𝑐) and 𝑃 (𝑤∣𝑐) according to Eqs. (12), (14) and (15) 13: for each document 𝑑𝑠 in the source domain 𝐷𝑠 do 14: Predict it’s class label based on Eq. (16) 15: end for 16: Compare the estimation results to the known labels in the source domain and calculate the training accuracy 17: if the training accuracy exceeds the threshold value then 18: weight 𝛼 is adjusted to 𝛼1 19: end if 20: end for 21: for each document 𝑑𝑡 in the source domain 𝐷 𝑡 do 22: Predict it’s class label based on Eq. (16) 23: end for
𝑤
Normalizing Eq. (11) over all class labels obtains the following solution: ∑ 𝑛(𝑤, 𝑑𝑡𝑖 )𝑃 (𝑐∣𝑑𝑡𝑖 , 𝑤) (12) 𝑃 (𝑑𝑡𝑖 ∣𝑐) = ∑ 𝑤 𝑡 𝑡 𝑤,𝑐 𝑛(𝑤, 𝑑𝑖 )𝑃 (𝑐∣𝑑𝑖 , 𝑤) Similarly, the other parameters can be estimated as follows: ∑ ∑ 𝑂(𝑑𝑠𝑖 , 𝑐) = 𝑛(𝑤, 𝑑𝑠𝑖 )𝑃 (𝑐∣𝑑𝑠𝑖 , 𝑤) + 𝛽1 𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐) 𝑤
+ 𝛽2
∑
∑
𝑑𝑠𝑗 ∈𝐷 𝑠
𝑐𝑗 :𝑐𝑗 ∕=𝑐
(16)
𝑑𝑠𝑗 ∈𝐷 𝑠
𝑃 (𝑑𝑠𝑖 , 𝑑𝑠𝑗 , 𝑐, 𝑐𝑗 ) (13)
𝑂(𝑑𝑠 , 𝑐) (14) 𝑃 (𝑑𝑠𝑖 ∣𝑐) = ∑ 𝑖 𝑠 𝑐 𝑂(𝑑𝑖 , 𝑐) ∑ 𝑛(𝑤, 𝑑𝑡 )𝑃 (𝑐∣𝑑𝑡 , 𝑤) 𝑃 (𝑤∣𝑐) = 𝛼𝑃 (𝑤∣𝑐, 𝐷𝑠 ) + (1 − 𝛼) 𝑑𝑡
(15) 4) SAtPLSA for Cross-domain Classification: The detailed procedures of our model are depicted in Algorithm 1. We next describe each of the steps of our algorithm in details. Firstly, we initialize the parameters: 𝑃 (𝑐∣𝑑𝑠 ), 𝑃 (𝑐∣𝑑𝑡 ), and 𝑃 (𝑤∣𝑐, 𝐷𝑡 ) randomly. 𝑃 (𝑤∣𝑐, 𝐷𝑠 ) are calculated based on the labels in the source domain. Then, 𝑃 (𝑤∣𝑐) is initialized as a mixture of 𝑃 (𝑤∣𝑐, 𝐷𝑠 ) and 𝑃 (𝑤∣𝑐, 𝐷𝑡 ) with a weight 𝛼0 ,which is repeated in each of the following M step. After initialization, we fit a model by running EM algorithm for Tmax times. During the iteration, if the training accuracy exceeds the predefined threshold, the weight is adjusted to 𝛼1 to make the following training process rely on the source domain less. When the model learning process is finished,
IV. EXPERIMENTS In this section, we evaluate the effectiveness of our algorithm for cross-domain text classification. Two baseline classifiers and five state-of-art algorithms are compared with our proposed algorithm.
263
B. Comparison Approaches
Table I S IX DATASETS GENERATED FROM 20N EWSGROUPS AND R EUTERS -21578 Dataset comp vs rec
comp vs sci
comp vs talk
rec vs sci
rec vs talk
sci vs talk orgs vs people orgs vs places people vs places
Source Domain 𝐷 𝑠 comp.graphics comp.sys.ibm.pc.hardware rec.motocycles rec.sprot.baseball comp.os.ms-windows.misc comp.sys.ibm.pc.hardware sci.electronics sci.space comp.os.ms-windows.misc comp.sys.ibm.pc.hardware talk.politics.mideas talk.politics.misc rec.autos rec.sport.baseball sci.crypt sci.med rec.autos rec.sport.baseball talk.politics.mideast talk.politics.misc sci.crypt sci.med talk.politics.misc talk.religion.mis orgs.{. . . },people.{. . . } orgs.{. . . },places.{. . . } people.{. . . },places.{. . . }
To evaluate our SAtPLSA algorithm, two different types of algorithms: classical supervised learning algorithms and cross-domain learning approaches are adopted here. One contains two baseline supervised classification algorithms, including Support Vector Machines (SVM) and Logistic Regression (LR). The others are cross-domain classification algorithms, which consists of five state-of-the-art algorithms: Topic Correlation Analysis (TCA), Spectral Feature Alignment (SFA), Topic-bridge PLSA (TPLSA), Collaborative Dual-PLSA (CDPLSA) and Partially Supervised CrossCollection LDA (PSCCLDA). Considering the datasets are constructed in a same way as the previous work [12], we just use the same results obtained for these total six algorithms.
Target Domain 𝐷 𝑡 comp.os.ms-windows.misc comp.sys.mac.hardware rec.autos rec.sport.hockey comp.graphics comp.sys.mac.hardware sci.crypt sci.med comp.graphics comp.sys.mac.hardware talk.politics.guns talk.religion.misc rec.motorcycles rec.sport.hockey sci.electronics sci.space rec.motorcycles rec.sport.hockey talk.politics.guns talk.religion.misc sci.electronics sci.space talk.politics.guns talk.politics.mideast orgs.{. . . },people.{. . . } orgs.{. . . },places.{. . . } people.{. . . },places.{. . . }
C. Evaluation Metrics and Implementation Details The evaluation metrics here is defined as the classification accuracy. Meanwhile, considering we have conducted experiments on several datasets, i.e., different binary classification problem, macro-averaging and micro-averaging are introduced here to derive a single aggregate measure for different approaches. These two metrics can be calculated as shown in Eqs. 17 and 18:
A. Datasets To evaluate the performance, we conducted experiments on nine datasets, which were generated from two widely used text corpora: 20Newsgroups1 and Reuters-215782 . 20Newgroups comprises approximately 18,000 newsgroup posts on 20 sub-categories. Reuters-21578 is another benchmark for text classification, which also has a hierarchical category structures suitable for cross-domain learning settings. Following the same experimental design framework in previous work of cross-domain text classification, we use the hierarchical structures in these two text corpora to construct two-class classification problems with data distribution difference across training domain and testing domain. Therefore, the training dataset and testing dataset contain documents with different sub-categories, but follow the same top-categories distribution. We generate six dataset from 20Newgroups and three datasets from Reuters-21578, as shown in Table I. Since there are too many sub-categories of Reuters-21578, we do not provide the detailed information about sub-categories here. To compare our work fairly with those in [9], [12], we followed the same experiments design procedure. Here, the same processed data3 for the above described datasets are utilized directly, and the settings of the algorithms remain unchanged so that the corresponding reported results can be directly cited here to compare with our proposed algorithm’s outcome.
𝐴𝑚𝑎 =
𝑀 1 ∑ 𝑓𝑖 𝑀 𝑖=1
1 𝐴𝑚𝑖 = ∑𝑀
𝑖=1
𝑀 ∑
𝑁𝑖
𝑁𝑖 𝑓 𝑖
(17)
(18)
𝑖=1
where 𝐴𝑚𝑎 represents the macro-average value, 𝐴𝑚𝑖 is the micro-average value, M is the number of datasets, 𝑓𝑖 and 𝑁𝑖 denote the accuracy value on the i-th dataset and the corresponding number of samples, respectively. For our algorithm SAtPLSA, we set parameters 𝛼0 , 𝛽1 and 𝛽2 to 0.5, 50 and 15, respectively. It has been shown that the classification accuracy is not sensitive to them so that these three parameters are set to their original values as the previous work did [8]. The parameter 𝛼1 and 𝑃𝑡 is set to 0 and 99%, indicating that after training accuracy exceeds 99%, the subsequent model fitting process only rely on the testing data. At last, K and Tmax are set to 2 and 50, respectively. D. Performance on Different Datasets In this section, we show a comparison of our proposed SAtPLSA algorithm with seven benchmark approaches on the nine datasets. The corresponding results are listed in Table II. Since random initialization is implemented in our algorithm, we run it three times on each dataset, then take the average result and its corresponding standard variance. As shown in Table II, SAtPLSA achieves the best performance among all algorithms. As shown in Table II, the supervised classifiers perform worse than the cross-domain learning
1 http://people.csail.mit.edu/jrennie/20Newsgroups 2 http://www.daviddlewis.com/resources/testcollections 3 The first six datasets for 20Newsgroups are obtained via following the same procedure described in [9]. And another three datasets for Retures21578 are downloaded from http://www.cse.ust.hk/TL/dataset/Reuters.zip
264
Table II P ERFORMANCE C OMPARISON FOR D IFFERENT DATASETS Dataset comp vs rec comp vs sci comp vs talk rec vs sci rec vs talk sci vs talk orgs vs people orgs vs places people vs places macro-average micro-average
LR 0.906 0.759 0.911 0.719 0.848 0.780 0.681 0.692 0.513 0.757 0.794
SVM 0.895 0.719 0.898 0.696 0.827 0.747 0.670 0.669 0.520 0.738 0.772
TCA 0.940 0.891 0.967 0.879 0.962 0.940 0.792 0.730 0.626 0.859 0.901
SFA 0.939 0.830 0.971 0.885 0.935 0.854 0.671 0.683 0.506 0.808 0.864
TPLSA 0.910 0.802 0.938 0.928 0.849 0.890 0.746 0.719 0.623 0.823 0.861
CDPLSA 0.914 0.877 0.955 0.872 0.912 0.862 0.808 0.714 0.548 0.829 0.871
PSCCLDA 0.958 0.900 0.967 0.955 0.958 0.947 0.807 0.742 0.690 0.880 0.921
SAtPLSA 0.970 ± 0.008 0.954 ± 0.010 0.988 ± 0.002 0.980 ± 0.001 0.986 ± 0.002 0.977 ± 0.009 0.822 ± 0.010 0.758 ± 0.002 0.692 ± 0.003 0.893 0.947
algorithms on all datasets. This demonstrates that it is very important to consider the data distributional difference between the source domain and the target domain. E. Convergence Since the model fitting process behind our algorithm is the EM algorithm, it is necessary to evaluate the convergence performance of our approach. Figure 3 presents the changes of classification accuracy with respect to different numbers of iterations. As we can find in this figure, the performance on classification accuracy increases fast during the first 15 iterations. Then, it is converging to a steady state after almost 20 iterations. It shows that our algorithm has a good convergence performance during these experiments. Compared to the same EM-based algorithm: TPLSA whose convergence iteration numbers are reported to be 100 times, our algorithm demonstrates a satisfying efficiency. Theoretically, in our proposed model SAtPLSA, the parameters 𝑃 (𝑤∣𝑐) for source domain are computable before the start of the training process and fixed during the training process. However, all of the parameters in TPLSA are required to be iterated until the model converges. And the number of parameter need to be learned is in proportion to the number of topics so that the number of parameters in SAtPLSA is much smaller than the one in TPLSA.
Figure 3.
Performance convergence on nine datasets
the algorithm. According to the total six different adjusted weighting values, we perform our proposed algorithm on com vs sci dataset respectively, where the other parameters are kept the same. The testing accuracy of each experiment is shown in Table III. The performance increases with the decrease of the adjusted weight value, which indicates that the adaptive transfer setting is able to give a better performance. And the Figure 4 shows the testing accuracy representing the model performance by varying the number of iterations. As we can find in the Figure 4, the lower the adjusted weight value, the higher the increasing speed of the performance. As we can find in Table III and Figure 4, the optimal value for adjusted weight values is 0, which means the model learning process for target domain does not need any knowledge from source domain when the state of current learned model is near the stable one. Then, we perform the algorithm with/without adaptive transfer three times on the rest eight datasets. In the setting with adaptive transfer, the adjusted weight value is set to 0. In the setting without adaptive transfer, the weight value is fixed to 0.5 during the training process. The mean value and standard deviation of each accuracy value for these two settings are shown in Figure 5, which indicates the adaptive transfer modifications during the training process are able to improve our models’
F. Effects of Adaptive-transfer Setting Different from previous transfer learning algorithms, our algorithm transfers knowledge adaptively via adjusting the weighting parameter 𝛼. The weighting parameter 𝛼 is adjusted to be smaller than the original value if the training accuracy exceeds a certain threshold value. Here, we make an assumption that the learned parameters for target domain are more likely to be at the stable state under a higher training accuracy. We study the different performance affected by the value of adjusted weighting parameter experimentally. The adjusted weighting parameter is tuned from 0 to 0.5 with a step as 0.1. The weighting parameter with a value of 0.5 means adaptive transfer modification is not considered. The decrease of the adjusted weight value represents the increasing impact of the adaptive transfer modification for
265
Table III P ERFORMANCE COMPARISON FOR DIFFERENT ADJUSTED WEIGHTING VALUE ON com vs sci Adjusted Weighting Value Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.957
0.947
0.941
0.939
0.928
0.904
[3] L. Niu and Y. Shi, “Semi-supervised plsa for document clustering,” in Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010, pp. 1196–1203. [4] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu, “Topic-bridged plsa for cross-domain text classification,” in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008, pp. 627–634. [5] D. Cohn, R. Caruana, and A. McCallum, “Semi-supervised clustering with user feedback,” Constrained Clustering: Advances in Algorithms, Theory, and Applications, vol. 4, no. 1, pp. 17–32, 2003. [6] D. M. Blei and J. D. McAuliffe, “Supervised topic models.” in NIPS, vol. 7, 2007, pp. 121–128.
Figure 4. Performance convergence for different adjusted weighting value on comp vs sci.
[7] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009, pp. 248–256. [8] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Crossdomain sentiment classification via spectral feature alignment,” in Proceedings of the 19th international conference on World wide web. ACM, 2010, pp. 751–760. [9] L. Li, X. Jin, and M. Long, “Topic correlation analysis for cross-domain text classification.” in AAAI, 2012.
Figure 5.
[10] J. Yoo and S. Choi, “Probabilistic matrix tri-factorization,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 2009, pp. 1553–1556.
Effects of adaptive transfer on nine datasets
[11] F. Zhuang, P. Luo, Z. Shen, Q. He, Y. Xiong, Z. Shi, and H. Xiong, “Collaborative dual-plsa: mining distinction and commonality across multiple domains for text classification,” in Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 2010, pp. 359–368.
performance further. V. CONCLUSION In this paper, we proposed a new algorithm combining transfer learning and topic modeling, called SAtPLSA, to solve cross-domain document classification problem. Experimental results have demonstrated the effectiveness and efficiency of our proposed algorithm.
[12] Y. Bao, N. Collier, and A. Datta, “A partially supervised crosscollection topic model for cross-domain text classification,” in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 2013, pp. 239–248.
R EFERENCES [1] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–1359, 2010.
[13] A. P. Dempster, N. M. Laird, D. B. Rubin et al., “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal statistical Society, vol. 39, no. 1, pp. 1–38, 1977.
[2] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, no. 1-2, pp. 177–196, 2001.
[14] C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1. 266