Cross Domain Distribution Adaptation via Kernel Mapping Erheng Zhong† , Wei Fan‡ , Jing Peng§ , Kun Zhang$ , Jiangtao Ren† , Deepak Turaga‡ and Olivier Verscheure‡ ‡
† Sun Yat-Sen University, Guangzhou, China IBM T.J. Watson Research Center, New York, USA § Montclair State University, New Jersey, USA $ Xavier University of Louisiana, Louisiana, USA
{sw04zheh@mail2, issrjt@mail}.sysu.edu.cn, {weifan, turaga, ov1}@us.ibm.com
[email protected],
[email protected] ABSTRACT
borrow labeled examples from a source-domain to improve learning in the target-domain. Let Pt (x, y) and P s (x, y) denote joint distribution of target-domain and source-domain respectively. Supervised transfer learning is to use small number of labeled example from Pt (x, y), but many from P s (x, y), to build a learning model for target-domain. The main challenge is to identify regions in P(x, y), either in original space or its transformation, where Pt (x, y) and P s (x, y) are similar and knowledge can be transferred. Much work has been proposed to solve transfer learning problem [15]. By definition, P(x, y) = r(y|x)q(x). Some work, such as [14], assumes that conditional probabilities rt (y|x) and rs (y|x) are similar in regions of the latent space where marginal distribution qt (x) and qs (x) of corresponding examples are close. Other works, such as [12], assumes q(x) is related to r(y|x). They both implicitly assume that marginal distribution and conditional probability are directly related. In summary, either of the following is assumed to be true and adopted to design transfer learning strategies: (1) where marginal distribution q(x) of target-domain and source-domain are similar, conditional probability r(y|x) also ought to be similar, or (2) vice versa. However, those assumptions may be too strict to be practical. For some problems, both the marginal and conditional distributions between target-domain and source-domain could be significantly different. When this happens, neither of the two assumptions is true anymore, in either the original space, scaled space or latent space using linear transformation. However, non-linear transformation, such as the kernel manipulation, can make these assumptions plausible. A suitable kernel is able to map the input space into a convenient feature space where a linear boundary can be easily found [3]. Importantly, this also sheds light on transfer learning. First, a suitable kernel, such as Gaussian kernel [13], can make different input data form similar marginal distribution in the kernel space. Second, in the kernel space, some examples have very similar conditional probabilities and can be used to construct transfer learning models. Third, the error rate of the transfer classifier can be bounded (Section 3). Thus, under a suitable kernel mapping space, two domains significantly different in their original space can both have similar marginal and conditional distributions, leading to effective knowledge transfer. Consider a synthetic example in Figure 1 where /∗ denotes positive/negative. Figure 1(a) plots the target-domain data, “two circles”, and the maximal margin decision boundary is the dashed ellipse. Figure 1(b) shows a source-domain data set, “two moons”, where the decision boundary is the dashed curve. Obviously, two moons and two circles have significantly different distributions in the original space. However, after we map them into the kernel
When labeled examples are limited and difficult to obtain, transfer learning employs knowledge from a source domain to improve learning accuracy in the target domain. However, the assumption made by existing approaches, that the marginal and conditional probabilities are directly related between source and target domains, has limited applicability in either the original space or its linear transformations. To solve this problem, we propose an adaptive kernel approach that maps the marginal distribution of targetdomain and source-domain data into a common kernel space, and utilize a sample selection strategy to draw conditional probabilities between the two domains closer. We formally show that under the kernel-mapping space, the difference in distributions between the two domains is bounded; and the prediction error of the proposed approach can also be bounded. Experimental results demonstrate that the proposed method outperforms both traditional inductive classifiers and the state-of-the-art boosting-based transfer algorithms on most domains, including text categorization and web page ratings. In particular, it can achieve around 10% higher accuracy than other approaches for the text categorization problem. The source code and datasets are available from the authors.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications − Data Mining
General Terms Algorithms
1.
INTRODUCTION
It is expensive or impractical for many applications to obtain large number of labeled examples. When this happens, most inductive learners perform poorly. The idea of transfer learning is to
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’09, June 28–July 1, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$5.00.
1027
Target−domain Data, Two Moons 1.5
2
1
1
0.5
Feature 2
Feature 2
Target−domain Data, Two Circles 3
0 −1 −2 −3
Table 1: Definition of notations Notation L U O X Y KDA_Mapping C P(x, y) q(x) r(y|x)
0 −0.5 −1
2
4 Feature 1
−1.5 −2
6
0
2
4
Feature 1
(a)
(b)
Target−domain Data in Kernel Mapping SpaceSource−domain Data in Kernel Mapping Space 0.6 0.2
Notation Description Labeled target-domain data Unlabeled target-domain data Labeled source-domain data Instance space Label space Feature mapping through KDA Base classifier Joint distribution Marginal distribution Conditional probabilities
0.4 Feature 2
Feature 2
0.1 0
−0.1
0.2 0
−0.2
−0.4
2.1 Kernel-Based Feature Mapping
−0.2
−0.3 −0.2
0 0.2 Feature 1
(c)
0.4
−0.4 −0.4
−0.2
0 Feature 1
0.2
We first show that how to employ KDA to find a proper mapping space, perform feature mapping and then bridge two different distributions. In practice, we center the scatter matrix before we performdiscriminant analysis, thus sample mean of X is zero or ms = 1 i=1 xi = 0, where is the number of samples. Now, we define the within-class scatter matrix S W and the between-class scatter matrix S B as follows.
0.4
(d)
Figure 1: The /∗ refers to positive/negative. The ‘’ denotes examples with similar r(y|x).
Definition 1. Scatter matrix
space (details in Section 2.1), the marginal distributions become close and samples from two domains are both in cloud-like appearance, as shown in Figure 1(c) and (d). In the kernel space, though conditional probabilities between source and target domains are still different, those examples highlighted by ‘’ in (d) have similar conditional distribution as the target-domain. In other words, these examples lie on the same region as those target-domain data with corresponding labels. The observation is, if one can find a proper kernel as the bridge, the marginal distributions can be made reasonably close. Those source-domain examples with similar conditional probabilities as target-domain data can be selected to build transfer learning models. As follows, we propose an iterative framework to transfer knowledge based on kernel mapping. In each iteration, we perform kernel mapping followed by sample selection of source-domain data with similar distribution to target-domain, and then use these examples to build transfer classifier. Finally, we combine the classifiers from different iterations to form an ensemble in order to remove the bias due to any single mapping space.
2.
(1)
SW
i NC
=
(x j − mi )(x j − mi )T
i=1 j=1
(2)
SB
NC
=
i (mi − ms )(mi − ms )T
i=1
where mi is the mean of the samples of class i in X, i is the number of instances with class i. The objective of discriminant analysis, T maxv {λ = ( vvT SS B vv )} is to find a proper projection to minimize the W within-class scatter matrix S W , and maximize the between-class scatter matrix S B simultaneously. In particular, for a two class problem, the solution of the objective is the same as: S W v = m 1 − m2
(3)
where m1 and m2 are the means of the two classes. As an important step, we show how to perform discriminant analysis in the Gaussian RKHS (or Reproducing Kernel Hilbert Space) to find a proper kernel feature mapping between source and target domains. Let φ : x → F be a function to map the data from the original space to RKHS. We use the inner product to avoid explicit mapping, κ(xi , x j ) = (φ(xi ) · φ(x j )). Then, a kernel matrix can be formed as K, Ki j = φ(xi )φ(x j ). According to the Representer Theorem [19], let xi j is the jth instance of class i and then any solution φ(v) ∈ F can be presented as:
KERNEL MAPPING AND SAMPLE SELECTION
We introduce the kernel-based feature mapping methods for transfer learning. For ease of discussion, the notations are summarized in Table 1. Suppose that labeled target-domain data L={XL ,YL } contains instances, XL = {x1 , . . . , x }, YL = {y1 , . . . , y }. The number of class labels is NC. For the unlabeled data, we set U={XU }, which contains u instances, XU = {x+1 , . . . , x+u }. Similarly, sourcedomain data O={XO ,YO } has o instances. We assume the sourcedomain data has the same class labels as the target-domain data. And the number of classes is NC. As follows, we first discuss how to perform kernel-based feature mapping through Kernel Discriminant Analysis(KDA) to make the marginal distributions q(x) from two domains similar, then discuss the cluster criterion to select source-domain data whose conditional probabilities r(y|x) are likely to be similar to target-domain data within the mapping space. These are followed by the details of the proposed algorithms implemented in two different ways: KMapEnsemble (Kernel-based feature Mapping with Ensemble) as well as KMapWeighted (Kernelbased feature Mapping with Weighted criterion).
(4)
φ(v) =
i NC
αi j φ(xi j )
i=1 j=1
where αi j is the jth component in αi = {αi1 , . . . , αii }, which is the coefficient of the vector φ(v) for the class i. So, the discriminant analysis object in kernel space can be written as ([3]) (5)
max{λ = ( α
αT KWKα )} αT KKα
where W = (Wi )i=1,...,NC is a block diagonal matrix, and Wi is a (i ×i ) matrix with terms all equal to 1i . We solve this equation as an eigenvalue decomposition problem. First, we use the eigenvectors decomposition of the matrix K and obtain K = PΛPT , where Λ is a diagonal matrix of non-zero eigenvalues and P is a matrix of
1028
normalized eigenvectors associated to Λ. Thus, by substituting K in Equation Eq.(5), we get: (6)
max{λ = ( β
Algorithm 1 Self-adapting Bisecting K-means Input: Labeled Data: L Output: Cluster Index of data in L: idx C=L,idx={1,...,1} for each Ci in C do Split Ci into two clusters Ci 1 and Ci 2 using K-means If Purity(Ci )≤ 0.9 or Par(Ci ,Ci 1,Ci 2) Then Replace Ci with Ci 1 and Ci 2, update idx end for Return cluster index of L
βT PT WPβ )} βT PT Pβ
where β = ΛPT α. As P is orthogonal, Equation Eq.(6) can be simplified, and the solutions of β are found by maximizing λ: (7)
λβ = PT WPβ
Thus the first step of the system resolution consists in finding β in accordance with Equation Eq.(7), which corresponds to a classical eigenvector system resolution. Once β are calculated, we compute α. Details of the decomposition process is standard and can be found in [3]. Then the projection can be obtained to perform feature mapping between target-domain and source-domain. When a new instance z comes, we get its projection as: (8)
vφ(z) =
i NC
where nci j is the number of labeled target-domain points with label j in cluster Ci . After that we only select those data from sourcedomain whose labels are the same as the corresponding cluster’s label. Formally, let S O denotes the selected source-domain data. Then we obtain S O ⊂ O and ∀xk ∈ S O, if xk ∈ Ci then yk = CLi .
2.3 Framework
αi j κ(xi j , z)
Both kernel-based feature mapping and cluster-based example selection are adopted into two ensemble-based frameworks. As shown in Algorithm 2, KMapEnsemble uses an iterative procedure to generate multiple mapping and uses model averaging to compute the final prediction. In summary, the labeled target-domain data is first used to perform KDA and obtain an original mapping space. The marginal distribution q(x) of data from both domains are similar inside this mapping space, according to the analysis in Section 3. At later iterations, we select those source-domain data on the basis of the “cluster-based criterion”. Those source-domain data have similar conditional probabilities r(y|x) to the target-domain data. Afterwards, these selected source-domain data and labeled targetdomain data are used to construct a new feature space using KDA, and train a new classifier. In order to take advantage of each individual kernel space and avoid their individual bias, we combine their predictions with simple model averaging. KMapWeighted takes advantage of a re-weighting scheme to use source-domain data, similar to that of TrAdaboost [8]. It first employs KDA to obtain a new mapping feature space, and then adopts a weighting scheme based on training error to control the impacts of source-domain data. Weights for wrongly predicted source domain data are reduced to weaken their impacts. Since their marginal distributions are similar to target-domain in kernel space, the prediction errors are obviously ascribed to different conditional distribution. The complete process is summarized in Algorithm 3. All the labeled data from target-domain and source-domain are used to learn the initial mapping space and train a base classifier. At later iterations, the weight of each labeled data is updated according to the training error. For the source-domain data, their contributions are controlled through multiplying the weights by γ|C(xi )−yi | . Obviously when source-domain examples are misclassified, they are likely to conflict with target-domain data and their weights are reduced as a result. We re-sample instances from the labeled data set in proportion to their weights. These sampled data are used to learn a new mapping space and new base model at the next iteration. The prediction results from each iterations are averaged. Below is the analysis of computational complexity. Let m be the number of all instances of source-domain and target-domain, n f be the number of features and N be the number of iterations. For kernel mapping, we need O(m2 n f + m3 ) to compute the mapping space and project data. For the sample selection, KMapEnsemble needs O(m2 n f ) complexity while KMapWeighted relies on the complexity of the base classifier. For example, if we use KNN, it needs O(m2 n f ). Thus, considering the iterations, the entire complexity of each approach is O((m2 n f + m3 ) ∗ N).
i=1 j=1
where xi j is the jth instance of class i. In this paper, we use the Gaussian kernel and set the kernel-width using a heuristic strategy defined in [24]. In Section 3, we discuss how the difference in marginal distribution between target-domain and source-domain is bounded by the kernel mapping process.
2.2 Cluster-based Example Selection KDA, as described above, is to learn a feature mapping space in order to make the marginal distribution between two domains similar. However, the conditional probabilities r(y|x) between two domains may be still different. Thus, in the following, under the mapping space, we propose a “cluster-based criterion” to select those source-domain data whose conditional probability is similar to target-domain. To achieve this, we introduce a cluster method, Bisecting K-means, with a self-adapting modification as adopted from [16]. We propose two criteria to control self-adapting cluster process. If the sum of squared error of two sub-clusters is smaller than that of the whole cluster, then the cluster is to be splitted into two sub-clusters. In addition, data in the same cluster should mostly belong to the same class, i.e., “purity” of the cluster is high. Formal definitions are as follows: Definition 2. Given a cluster C and its two sub-clusters C1 , C2 , where C1 ∪ C2 = C and C1 ∩ C2 = . Then S S E(Ci ) (9) Par(C, C1 , C2 ) = S S E(C) − i=1,2
where x returns 1 if x is positive, 0 if not and SSE is sum of squared error. Given a cluster Ci whose data are all labeled and labels in it are “+” and “-” then (10)
Purity(Ci ) = max(
|“+ ∈ Ci | |“− ∈ Ci | , ) |Ci | |Ci |
The clustering procedure is summarized in Algorithm 1. It minimizes the distance of data within the cluster and prefers clusters having mostly the same label. Based on the above description, we propose cluster-based example selection. We first cluster the data (both the labeled target-domain data and source-domain data) into clusters. Then, we set the “label of each cluster” as the majority class of those labeled target-domain examples in this cluster. We define the label of cluster Ci as CLi : (11)
CLi = arg max nci j j∈[1,NC]
1029
Algorithm 2 KMapEnsemble Input: L, U, O ,max number of iteration: N, base classifier C Output: Predicted labels of U: Lo f U for i = 1 to N do LearnKDA = L if i > 1 then LearnKDA = LearnKDA ∪ S i−1 end if Learn mapping space from LearnKDA using KDA NOi = KDA_Mapping(O) NLi = KDA_Mapping(L) NUi = KDA_Mapping(U) S Oi = selected data from NOi S i = corresponding data of S Oi in original space Train a classifier Ci from NLi ∪ S Oi For each instances x j in NUi , predict its label PLi j and get its confidence con fi jk where k is the class index. end for For each element j in Lo f U: Lo f U j = maxk { ik con fi jk } Return Lo f U
are not completely different, rather they are related. We then state that these similar data can be exploited for transfer learning under the “cluster-manifold” assumption [4]. Theorem 3 provides an error bound for the proposed transfer learning algorithm. Based on the bound, we show which situations can not be handled by the proposed approach. Finally, Theorem 4 shows the error rate estimation using our ensemble classifier.
3.1 Error Bound for Discriminant Analysis As stated earlier, we pool the labeled target and source domain data to perform KDA and train classifiers. The following analysis establishes the error bound for discriminant analysis, which is similar to [23]. Based on the notations described in Section 2, let us consider the target mapping function f ∗ :
f
1 (yi − f (i))2 i=1
Assume for the moment that our hypothesis space is linear. That is, H = { f | f (x) = vT x, x ∈ X}. Consequently we have
Algorithm 3 KMapWeighted Input: L, U, O ,max number of iteration N, base classifier C Output: Predicted labels of U: Lo f U set the weight (weight0L = (w01 , ...w0 ), weightO0 = (w0+1 , ..., w0+o )) of each instances as 1.0 for i = 1 to N do S Li = Sample(L, weightL ) S Oi = Sample(O, weightO ) LearnKDA = S Li ∪ S Oi Learn mapping space from LearnKDA using KDA NUi = KDA_Mapping(U) T = KDA_Mapping(LearnKDA) Train a classifier Ci from T wi |Ci (x j )−y j | εi = j=1 j wi j=1 j √ γi = εi /(1 − εi ), γ = 1/(1 + (2 ln ( + o)/N)) Update weightL , weightO −|Ci (x j )−y j | wij = wi−1 , 1≤ j≤ j γi |Ci (x j )−y j | wij = wi−1 , 1+l≤ j≤+o j γ end for N N −Ci (x j ) ≥ i=N/2 γi−1/2 Lo f U j = 1, i=N/2 γi Lo f U j = 0, otherwise Return Lo f U
3.
f ∗ = arg min
(12)
(13)
v∗ = arg min v
1 (yi − vT xi )2 i=1
where is the number of training instances. The lemma below is for two class problems, but it can extend to multi-class problems as well. Lemma 1. The solution derived from Equation Eq.(3) is equivalent to the one obtained from Equation Eq.(13). Proof. Let XL be the labeled data set and YL be the corresponding labels. First, we can rewrite Equation Eq.(13) as:
(14)
(yi − vT xi )2
i=1
= = = =
(YL − XLT v)T (YL − XLT v) (YLT − vT XL )(YL − XLT v) − 2(XL YL )T v + vT S m v − 2(1 m1 − 2 m2 )T v + vT S m v.
Optimize this objective respect to v, we obtain S m v = (1 m1 −2 m2 ), where 1 is the number of labeled instances in class 1, 2 is the one in class 2 and S m = S B + S W . When S W is full rank, this equation has the same solution as Equation Eq.(3) if the overall mean is 0. The detail can be found in [23].
FORMAL ANALYSIS
A major challenge facing transfer learning is the difference between target and source distributions. The classic theory of learnable does not apply, where training and test data must follow the same (but unknown) distribution. The difference between training and test distributions can be classified into two types: covariate shift and functional relation change [22]. Covariate shift is related to the marginal distribution q(x), while functional relation is related to the conditional probability r(y|x). Our approach to dealing with this problem is as follows. Theorem 1 establishes the convergence for our kernel discriminant analysis. Theorem 2 and the analysis thereafter show that the selected source-domain data are similar to the target-domain data in that their marginal and conditional distributions are similar. Specifically, Theorem 2 states that the data in a Gaussian reproducing kernel Hilbert space (RKHS) are distributed approximately Gaussian under some suitable conditions. Thus, the source and target domain marginals in a Gaussian RKHS share a similar intrinsic geometry. This knowledge of the marginals can be exploited for better transfer learning by assuming the conditionals rs (y|x) and rt (y|x)
However, solving Equation Eq.(12) directly often leads to overfitting the data. Instead, we minimize the following regularized functional for a positive parameter λ in a hypothesis space H: (15)
f + = arg min f ∈H
1 (yi − f (i))2 + μ f 2H . i=1
We note that the error bound for f + provides an error bound for the estimation (Equation Eq.(3)). The following theorem states for the bound for f + . Theorem 1. Let fρ is the best function that minimizes the mean squared error. That is (16) fρ = arg min (y − f (x))2 dρ f
1030
Z
. With the confidence 1 − δ, the error bound for the solution of Equation Eq.(15) is given by: (17) ( f + − fρ )2 dρX ≤ S (μ) + A(μ) 2
2
marginal distribution q(x) to be exploited. Here, for the purpose of our analysis, we will make a similar assumption. We will assume that rt (y|x) = rso (y|x). The rationale is that since the selected source and labeled target domain data are close (i.e., in the same clusters and have the same labels) and they follow closely related Gaussian distributions, their marginals should share a very similar intrinsic geometry. Therefore, the conditionals rt (y|x) and rso (y|x) are the same. The above assumptions allow us to state that the difference between the source and target domain distributions can be attributed to co-variate shift.
2
k) v∗ (, δ). Here where A(μ) = μ1/2 Lk −1/4 fρ and S (μ) = 32M (μ+C μ2 −1/4 Lk represents a simple operator, M and Ck are constant, and v∗ (, δ) is the unique solution of an equation whose coefficients contain sample size and confidence parameter δ. Note that there exists an unique μ that minimizes S(μ)+A(μ).
3.3 Error Bound for Domain Transfer
A(μ) represents the approximation error, and S(μ) represents the sample error. They correspond to classic bias and variance tradeoff. The detail of the proof can be found in [7]. The kernelized version of the theorem can be similarly derived.
According to [6], if the difference between the two distributions can be bounded, the generalization error can also be bounded when we combine the labeled data from the target-domain and sourcedomain to predict the unlabeled data from target-domain. Following [6], let m = + o denote the total number of labeled data. That is, we obtain = βm labeled data from the target-domain and o = (1 − β)m from the source-domain independently. Consider the ideal labeling function f ∗ : X → Y, where Y = [0, 1]. For a hypothesis h : X → Y, the probability of h disagreeing with f ∗ according to the target domain distribution qt (x) can be defined as
3.2 Bounding Distributions First, we will show that under some suitable conditions, the data in a Gaussian RKHS induced by a kernel function are distributed approximately Gaussian [13]. Let σ be the kernel window width. The kernel associated with Hilbert space Hκ is κ. They are all dependent on the number of examples m. Also, let κm (x, u) = κo (x/σm , u/σm )/σm ,
(20)
where κo denote the baseline kernel with window width 1. Suppose that there exits a constant τ2 > 0 such that for any > 0, (18)
(19)
lim
1 card{1 ≤ j ≤ m : |σm κm (x j , x j ) − τ2 | > } = 0 m
lim
1 card{1 ≤ j, j ≤ m : σm |κm (x j , x j )| > } = 0 m2
m→∞
m→∞
t (h, f ∗ ) = E x∼qt (x) [|h(x) − f ∗ (x)|].
This is also the risk for a hypothesis. And we use t (h) for short and ˆt (h) for the empirical risk. We use the parallel definitions s (h, f ∗ ), s (h) and ˆs (h) for the source-domain. This allows us to measure the distance between two distributions using a hypothesis class-specific criterion. Let H be a hypothesis class for X and AH be a subset for H where for every hypothesis h ∈ H, {x : x ∈ X, h(x) = 1} ∈ AH . Then the distance between the target and source distributions is
Eq.(18) states that most kernel data have Hm -norm near τ2 and Eq.(19) states that most kernel data are nearly orthogonal in Hm .
(21)
dH (qt (x), qs (x)) = 2 sup |Prt [A] − Prs [A]|. A∈AH
According to [6], dH can be computed approximately from a finitesample when H has a finite VC dimension. After that, for a hypothesis space H, we define the symmetric difference hypothesis space HΦH as HΦH = {h(x) ⊗ h (x) : h, h ∈ H}, where ⊗ is the XOR operator. It says that every hypothesis g ∈ HΦH labels all data as 1 when a given pair of hypotheses in H disagree against each other. It can be shown that for any hypotheses h, h ∈ H, the following inequality holds
Theorem 2. Let h be a random element with zero mean function and covariance operator κm . If Equation Eq.(18) and Eq.(19) hold, as m → ∞, the empirical distribution θm (h) converges weakly to N(0, τ2 ) in probability. The proof of this theorem can be found in [13]. Notice that the theorem is established for one dimensional projection. However, it can be extend to an n f -dimensional projection for an arbitrary but fixed n f . Also, one can observe that projected data at lower dimensions can be well approximated by Gaussian distributions. As shown in our approach, we use KDA to achieve this projection. From the above theorem, we can see that both the target and source domain data follow Gaussian distributions in a Gaussian RKHS. That is, qt (x) = N(0, Σt ) and qs (x) = N(0, Σs ). Notice that our approach to transfer learning selects for training those source domain data that are close to the target-domain data through clustering. A source-domain instance is close to a target-domain instance if they reside in the same cluster. Thus, it seems reasonable to assume that the selected source-domain data follow a Gaussian distribution with similar variance. That is, qso (x) = N(0, Σso ) and the generalized variance |Σso | ≈ |Σt |. This implies that the selected data have a similar marginal distribution to that of target-domain data in the induced space. A recent work made a specific assumption about the relation between the marginal and conditional distributions [4]. It assumes that if two point x1 and x2 are close in the intrinsic geometry of q(x), then the conditionals r(y|x1 ) and r(y|x2 ) are similar. That is, r(y|x) is a smooth function along the geodesics in the intrinsic geometry of q(x). This assumption allows the knowledge of the
(22)
| t (h, h ) − s (h, h )| ≤
1 dH ΦH (qt (x), qs (x)) 2
Moreover, we define the ideal hypothesis h∗ that minimizes the combined target-domain and source-domain risk. (23)
h∗ = arg min t (h) + s (h) h∈H
Thus, the combined risk can be denoted as λ = t (h) + s (h). Consider that we have a few labeled data from the target-domain and a large number of labeled data from the source-domain. In this situation, we train a classifier that minimizes a convex combination of empirical target-domain and source-domain risk: (24)
ˆα (h) = αˆ t (h) + (1 − α)ˆ s (h)
We use α (h) as a generalized one with respect to qt (x) and qs (x). Similar to in [6], our transfer algorithm can be bounded as follows. Specifically, Lemma 2 shows the the difference between the target risk s (h) and weighted risk α (h) can be bounded and Lemma 3 shows the the difference between the true and empirical weighted risks can also bounded.
1031
Lemma 2. Let h be a hypothesis in H (25)
Table 2: Data Set Summary
1 | α (h) − t (h)| ≤ (1 − α)( dH ΦH + λ) 2
Data Set OrgsVsPeople Reuters OrgsVsPlaces PeopleVsPlaces 20 ComVsRec NewsComVsSci group ComVsTalk
Lemma 3. Let H be a hypothesis space of VC-dimension d, then with probability at 1 − δ, for every h ∈ H α2 (1 − α)2 d log (2m) − log δ (26) |ˆ α (h) − α (h)| < + β 1−β 2m
SyskillWebert
Target-domain Documents of some sub categories Documents of some sub categories Sheep Biomedical Goats
Source-domain Documents of other sub categories Documents of other sub categories Bands-recording artists
The detailed derivation can be founded in [6]. From these two Lemmas, we obtain the bound for our domain transfer learning. In addition, we use mQ (x) = |Eh∼Q h(x)| to denote the the unsigned margin function. From these, the error of our ensemble can be bounded based on the risk of individual Gibbs classifiers and the margin of unlabeled data.
Theorem 3. Let H be a hypothesis space of VC-dimension d, and Ut and U s be unlabeled samples of size m each, drawn according to qt (x) and qs (x), respectively. Let dˆH ΦH be the empirical distance on Ut and Us . Let S be a labeled set of size m, which is constructed from βm points drawn from qt (x) and (1 − β)m points from qs (x). If hˆ ∈ H is the minimizer of ˆα (h) on S and h∗t = minh∈H t (h) is the target domain risk minimizer, then with probability at least 1 − δ (over the choice of the samples), for every h ∈ H, α2 (1 − α)2 d log (2m) − log δ ∗ ˆ ˆ t (h) ≤ t (ht ) + 2 + + β 1−β 2m 2d log(2m ) + log ( 4β ) 1ˆ + λ) 2(1 − α)( dH ΦH (Ut , U s ) + 4 2 m
Theorem 4. BQ is the Bayes classifier and then for all Q and all δ ∈ (0, 1] with probability at least 1 − δ: (30)
where T uδ (Q) = δ (Q)+ 21 (Eu mQ (xi )−1), KQ< (γ) = Eu mQ (xi )(mQ (xi ) < γ), Eu z is the expectation of a random variable z with respect to the uniform distribution over U, Pu is the uniform probability over U, and x+ denotes the positive part of x that means x+ = x if x > 0 and 0 otherwise. This theorem states that if the individual classifier has error bound δ (Q) with probability at least 1 − δ, the ensemble classifier can also be bounded.
In a situation where we have many labeled source domain data and a few labeled target domain data, α will be small and dˆH ΦH will play an important role in the bound. As stated in Section 3.2, Pt (x, y) and P s (x, y) are similar after we perform kernel mapping and cluster-based sample selection. That means a classifier h ∈ H can be found to maximally discriminate between unlabeled instances from the target and source domains at the same time. Thus, the empirical distance dˆH ΦH will become small and the bound will be tight. Follow these and the analysis in Section 3.2, the transferability of the proposed approach relies on the selected data in the induced space. If two domains are unrelated in the kernel space, few source domain examples can be selected by the sample selection criteria. When these happen, the proposed approach may fail, as shown by the error bound of Theorem 3. If the difference between the two distributions is large, the bound is loose.
4. EXPERIMENT Three real-world data collections from two different domains are used to empirically evaluate the proposed algorithms. The performance is compared with both traditional non-transfer single classifiers, and the state-of-art transfer learning algorithm TrAdaBoost [8]. Two sets of studies are also conducted to further examine the sensitivity of the proposed methods with respect to different number of iterations, and varied sizes of labeled target-domain data.
4.1 Data Set and Experiment Setting As summarized in Table 2, the data collections involved in our study are Reuters-21578 [2], 20-Newsgroups [9] and SyskillWebert [2]. Among them, Reuters-21578 and 20-Newsgroups are the benchmarks of text categorization, and SyskillWebert is the standard used to test web page ratings. The important statistics and pre-processing procedures of these collections are presented below. Data Set Description With a hierarchical structure, the Reuters21578 collection contains Reuters news wire articles organized into five top categories, and each category includes different sub-categories. Three categories, “orgs”, “people” and “places” are selected in our study. From the category “places”, we remove all the documents of “USA” to make the sizes of the three categories nearly even. For each category, all of the subcategories are then organized into two parts, and each part is of the different distribution and approximately equal size. Therefore, one part can be treated as the target-domain data and the other is used for the source-domain purpose. According to the method described in [8], three crossdomain learning tasks are generated as listed in Table 2, and the learning objective aims to classify articles into top categories. Similar to Reuturs-21578 data set, 20-Newsgroups corpus contains 7 top categories and these top categories contain 20 subcategories
3.4 Error Bound for Ensemble In Section 3.3, we have shown the error bound for individual classifier. In this section, we will obtain the upper bound of the generalization error for our majority vote ensemble. The analysis is similar to [1]. Let Q be a posterior distribution over H such that the Q − weighted majority vote classifier or Bayes classifier is (27)
BQ (x) = sgn[Eh∼Q h(x)]
1 (BQ ) inf {Pu (mQ (xi ) < γ) + T uδ (Q) − KQ< (γ)+ } γ∈(0,1] γ
∀x ∈ X
where sgn[x] = 1 if x > 0 and −1 otherwise. We denote the associated individual Gibbs classifier that for classifying any example x ∈ X chooses randomly a classifier h according to the distribution Q. For a finite set of unlabeled data U, the risk of G Q is 1 (28) (G Q ) = Eh∼Q h(x) yi u x ∈U i
where, Ξ = 1 if predicate Ξ holds and 0 otherwise, and yi is the true unknown label of xi . Similarly, the risk of BQ is 1 (29) (BQ ) = BQ (x) yi u x ∈U i
1032
Table 3: Accuracy in data set (%) Method Data set OrgsVsPeople OrgsVsPlaces PeopleVsPlaces Data set ComVsRec ComVsSci ComVsTalk Data set Bands-Biomedical Bands-Goats Bands-Sheep
IBK(k=3)
KNN classifier KE TrAdaBoost
KW
SMO
55.050 56.759 51.532
68.658 65.495 58.557
52.941 58.253 58.350
57.538 60.064 56.495
73.069 74.688 63.974
53.562 58.997 55.253
81.663 61.938 70.435
54.043 44.476 58.343
67.824 61.376 56.953
92.363 83.186 95.427
73.282 55.714 61.538
76.271 60.317 83.051
61.017 52.381 69.661
73.729 53.968 64.407
63.359 66.667 69.231
SVM classifier KE TrAdaBoost Reuters-21578 80.882 75.184 78.168 76.038 75.670 67.567 20-Newsgroup 94.932 92.084 90.590 87.266 96.856 94.740 SyskillWebert 77.119 63.560 67.188 61.905 82.356 84.746
which have approximately 20,000 newsgroup documents. We select four top categories “com”, “rec”, “talk” and “sci” in this experiment. Thus, three other cross-domain tasks are formed as listed in Table 2. SyskillWebert database contains the HTML source of web pages plus the ratings of a user on those web pages. The web pages are on four separate subjects. Associated with each web page are the HTML source as well as a user’s rating in terms of “hot”, “medium” or “cold”. As demonstrated in Table 2, all of the four subjects are involved in our study. “Bands-recording artists” is reserved as the set of source-domain and the others are used as the target-domain data. Compared to the “cold” pages, the total number of pages rated as “medium” or “hot” is fewer. Therefore, we combine the “medium” and “hot” pages together, and change the labels of those pages as “non-cold” to form a binary classification problem. The learning task is to predict the user’s preferences for the given web pages. Experiment Setting For each target-domain data set employed in the experiment, we further split it into two parts: target-domain data with labels(L) and target-domain data without labels(U). The ratio between L and U is 1:9. All of the target-domain data without labels (U) are used as the test sets; while the training sets consist of the data points with labels from both the target-domain (L) and source-domain (O). Several popular algorithms, such as NaiveBayes, KNN and SMO with polynomial kernel, are used as not only single classifiers by themselves, but also the base classifiers of KMapEnsemble, KMapWeighted and TrAdaBoost respectively. For these three ensembles, we set the number of iterations to be 10. The algorithm implementations are based on Weka [21].
4.2 Performance Study Using accuracy as the evaluation metric, we systematically compare the proposed algorithms to single classifiers and TrAdaBoost. All of the results reported below are averaged over 10 runs. Due to the space limitation, we abbreviate KMapEnsemble as “KE”, and KMapWeighted as “KW” in all of the tables. Table 3 summarizes the accuracies of KMapEnsemble, KMapWeight, single classifiers and TrAdaboost on the three databases. For the Reuters-21578 collection, as highlighted in bold, KMapEnsemble consistently outperforms TrAdaBoost and the three single classifiers on all of the 3 tasks. On the “OrgsVsPeople” data set with KNN as the base classifier, we notice that the averaged accuracy of KMapEnsemble is over 24% higher than TrAdaBoost and KNN as the single classifier. If we overlook the model differences, compared to its rivals, KMapEnsemble on average achieves at least 15.4%, 7.4% and 8.6% higher accuracy on “OrgsVsPeople”, “OrgsVsPlaces” and “PeopleVsPlaces” respectively. The better performance of KMapEnsemble over TrAdaBoost can be ascribed to the kernel-
1033
NaiveBayes classifier KE TrAdaBoost
KW
NaiveBayes
KW
79.136 77.209 71.649
70.364 70.278 67.873
78.033 76.571 77.216
70.588 71.565 67.938
73.621 72.098 63.299
92.084 90.122 94.740
83.086 75.980 91.127
94.476 89.934 94.438
87.927 84.363 95.768
94.590 81.835 95.405
71.186 58.730 83.051
52.672 50.0 55.384
67.796 60.317 79.661
61.016 57.143 72.881
65.961 58.730 74.576
based feature mapping and cluster based example selection implemented in KMapEnsemble, which utilizes the predictions from different feature spaces to transfer the similar data from the sourcedomain to the target-domain. Moveover, on 7 out of 9 cases, KMapWeight outperforms TrAdaBoost which conducts the cross-domain transfer only in the original feature space. This, from the empirical perspective, provides justification to the theoretical analysis of kernel-based feature mapping in Section 3. The worst performances of single classifiers are mainly due to the inferior strength of the single model which can not capture the useful information existing in the source-domain. For the 20-Newgroup data set, remarkably, when the base classifier is KNN, SMO and NavieBayes in the order, the win-lose-tie statistics of KMapEnsemble is 3-0-0, 3-0-0 and 1-0-2 respectively. The similar performance explanation provided to KMapEnsemble on Reuters-21578 can be also applied here. For those data sets KMapEnsemble fails, KMapWeighted performs best on 1 out of 2 such data sets. In particular, on data set “ComVsRec”, KMapWeighted achieves the highest accuracy when NavieBayes is used as the base classifier. This suggests that, besides the new mapping feature space technique integrated into KMapWeighted, the sophisticated weighting scheme also helps to enhance the ability of KMapWeighted by dynamically controlling the impacts of source-domain data according to the training error. On the SyskillWebert database, among the total of 3 tasks with respect to different base classifiers, the win-lose-tie statistics of KMapEnsemble is respectively 3-0-0 for KNN, 2-1-0 for SMO and 3-0-0 for NaiveBayes. For the “Bands-Sheep” task, the accuracy of KMapEnsemble is at least 13% higher than other approaches when the base classifier is KNN. In addition, compared to TrAdaBoost, the win-lose-tie statistics of KMapWeighted is still 2-1-0 for KNN, 1-2-0 for SMO and 3-0-0 for NaiveBayes in the order. The similar performance explanation provided on the above two collections can be also applied here. On the other hand, the superior performance of KMapEnsemble over KMapWeighted on all of the 9 scenarios could be due to the effectiveness of model averaging over boosting. Please refer to [10] for related discussion.
4.3 Sensitivity Study As shown in Section 2, several inputs need to be specified before the execution of KMapEnsemble. Among those parameters, different numbers of iterations and varied sizes of labeled target-domain data, are of critical importance for the performance of the proposed algorithms. As the result, we carried out two additional sets of experiments to test the sensitivity and adaptiveness of the proposed algorithms with respect to these parameters. Different Numbers of Iterations As the number of iterations varies from 1 to 10, Figure 2 plots the accuracy learning curves
72
74
70
72
68
4 8 Number of 6Iteration
10
2
4
(a) 70
55 50 45
10
4
4
6
8
Number of Iteration
(d)
10
6
8
80
58
55 50
KMapEnsemble TrAdaBoost IBK
45
2
4
6
8
Number of Iteration
(e)
10
40
2
4
6
8
10
6
8
90 80 70 60 50
20
12
2
4
6
KMapEnsemble SMO TrAdaBoost
70
12
2
50
SMO, Goats vs. Biomedical Dataset
80 70 60
6
8
10
12
Number of labeled in−domain data
(f)
8
10
12
SMO, Goats vs. Sheep Dataset KMapEnsemble SMO TrAdaBoost
100 90 80 70 60
40 4
6
110
KMapEnsemble SMO TrAdaBoost
50 2
4
Number of labeled in−domain data
(c)
100 90
60
30
10
10
Number of labeled in−domain data
40
Number of Iteration
8
(b)
SMO, Goats vs. Bands Dataset
90
60
54
4
(a)
65
60
2
Number of labeled in−domain data
Number of Iteration
64
KFMCEnsemble TrAdaBoost IBK
60
KMapEnsemble SMO TrAdaBoost
100
40 20
10
66
62
80
SMO, Bands vs. Sheep Dataset
110
40
(c)
56 2
2
40
IBK Classifier, orgs vs places Dataset IBK Classifier, people vs places Dataset
Accuracy
Accuracy
KMapEnsemble TrAdaBoost IBK
60
8
55
60
KMapEnsemble TrAdaBoost SMO
(b)
IBK Classifier, orgs vs people Dataset 65
6
Number of Iteration
65 60
Accuracy
2
KMapEnsemble TrAdaBoost SMO
80
KMapEnsemble SMO TrAdaBoost
Accuracy
74
70
SMO, Bands vs. Goats Dataset 100
2
4
6
8
10
50
12
Number of labeled in−domain data
(d)
2
4
6
8
10
12
Number of labeled in−domain data
(e)
(f)
Figure 3: Accuracy vs. different size of L
Figure 2: Accuracy vs. different numbers of iterations
SMO, orgs vs. people Dataset
SMO, orgs vs. places Dataset
82
of KMapEnsemble, TrAdaBoost and traditional non-transfer single classifiers on the Reuters-21578 collection. Although the performances of traditional non-transfer single classifiers are not affected by the values of iteration, we still deliberately include those straight lines in the plots to acquire a global inspection. As shown in the Figure 2(a-e), KMapEnsemble outperforms its competitors for 5 out of 6 scenarios. The only exception is on data set “people vs places” when the base classifier is KNN. In general, compared to TrAdaBoost, KMapEnsemble underperforms at the first iteration. That is because the feature space at the first iteration is learned merely through the small size of labeled target-domain data. As more iterations proceed, the predictive ability of KMapEnsemble is significantly boosted due to the constant accumulation of the source-domain data. These data are similar to the labeled data from the target-domain, whereby they can assist in the learning. For example, KMapEnsemble consistently outperforms TrAdaBoost in accuracy by at least 12% on data set “OrgsVsPeople” when the base classifier is KNN. Therefore, we conjecture that, if more iterations are carried out, KMapEnsemble could perform better in the sense that the combination of more feature spaces can significantly reduce the bias and boost the accuracy. Varied Sizes of Labeled Target-domain data This study is conducted on SyskillWebert collection. Two sets of experiments are performed with “Bands” or “Goats” as the target-domain data and others for the source-domain purpose. The results are demonstrated in Figure 3. It is evident that, as the size of the labeled targetdomain data increases, KMapEnsemble consistently performs better than or as equal as its competitors. For example, as shown in Figure 3(c), KMapEnsemble achieves at least 10% higher accuracy than other methods on each size of labeled target-domain data. As a general trend, the accuracy of KMapEnsemble steadily improves when the number of labeled target-domain data increases from 2 to 12. Consequently, we infer that, better performances can be obtained if more labeled target-domain data are provided.
kMapEnsemble nokMap noCluster
82
kMapEnsemble nokMap noCluster
80
80
78
Accuracy
Accuracy
SMO, people vs. places Dataset 85
84
kMapEnsemble nokMap noCluster
80
76 74
Accuracy
76
SMO, Bands vs. Biomedical Dataset KMapEnsemble 100 SMO TrAdaBoost
Accuracy
76
75
Accuracy
78
80
78
Accuracy
80
80
Accuracy
Accuracy
Accuracy
82
Accuracy
KMapEnsemble TrAdaBoost SMO
84
Accuracy
SMO Classifier, orgs vs people Dataset SMO Classifier, orgs vs places Dataset SMO Classifier, people vs places Datase
78 76 74 72
72
75 70 65
70
70
2
4 6 8 Number of iterations
(a)
10
68
2
4
6
8
10
60
2
Number of iterations
(b)
4
6
8
10
Number of iterations
(c)
Figure 4: Effectiveness analysis
to learn a mapping space and build a classifier, the other called “noKMap” is to build a classifier using labeled target-domain data and selected source-domain under the original feature space. We notice that noCluster only performs kernel mapping while noKMap only performs cluster based sample selection. Figure 4 plots the accuracies of noCluster, noKMap and kMapEnsemble on the Reuters21578 collection as the number of iterations varies from 1 to 10. Both noCluster and noKMap are not affected by the values of iteration. We observe that kMapEnsemble outperforms noCluster and noKMap consistently after the second iteration on all three data set. It boosts at least 5% in accuracy comparing two base line methods at all three cases. noKMap fails because it does not cover the gap of margin distribution between source and target domains. And the reason behind poor performance of noCluster is that it does not deal with conditional distributions of source and target domain which are still different in the kernel mapping space.
5. RELATED WORKS One main challenge of transfer learning is how to resolve and, in the same time, take advantage of the difference between two domains. Several methods use instance weighting or to increase the weights of similar instances and reduce those dissimilar ([5, 8, 11, 16]). For example, [8] adopts the boosting weight formula as the reweighting scheme. Many others approaches attempt to change the representation of instances through mapping them into other spaces where the data from two domains are similar(e.g., [18, 20]). Most recently, [12] proposes a locally weighted ensemble framework to combine multiple models for transfer learning by dynamically assigning weights of a model according to a model’s predictive power on each test example. [17] uses a set of predefined kernels to find a suitable kernel for the new data. [20] improves the effectiveness of unsupervised dimension reduction with the help of related prior knowledge from other classes in the same type of concept.
4.4 Effectiveness Study As analysesed in Section 3, kernel mapping makes the margin distributions between source-domain and target-domain close, and cluster based sample selection further reduces the difference of conditional distributions between source-domain and target-domain. In this part, we perform an extend experiment to show that both of them play an important role on the proposed approach. For comparison, we propose two other methods. One is called “noCluster” which use labeled target-domain data and all source-domain data
1034
6.
CONCLUSION
[9] D. Davidov, E. Gabrilovich, and S. Markovitch. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In Proceedings of The 27th Annual International ACM SIGIR Conference, pages 250–257, Sheffield, UK, 2004. ACM Press. [10] I. Davidson and W. Fan. When efficient model averaging out-performs boosting and bagging. In Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings, pages 478–486. Springer, 2006. [11] W. Fan and I. Davidson. On sample selection bias and its efficient correction via model averaging and unlabeled examples. In Proceedings of the Seventh SIAM International Conference on Data Mining, SDM 2007, Minneapolis, Minnesota, USA, 2007. SIAM. [12] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure mapping. In Y. Li, B. Liu, and S. Sarawagi, editors, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pages 283–291. ACM, 2008. [13] S. Y. Huang and C. R. Hwang. Kernel fisher’s discriminant analysis in gaussian reproducing kernel hilbert space. Technical report, Institute of Statistical Science, Academia Sinica, Taiwan, 2005. [14] S. J. Pan, J. T. Kwok, and Q. Yang. Transfer learning via dimensionality reduction. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008, pages 677–682. AAAI Press, 2008. [15] S. J. Pan and Q. Yang. A survey on transfer learning. Technical Report HKUST-CS08-08, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China, November 2008. [16] J. Ren, X. Shi, W. Fan, and P. S. Yu. Type-independent correction of sample selection bias via structural discovery and re-balancing. In Proceedings of the Eighth SIAM International Conference on Data Mining, SDM 2008, pages 565–576, Atlanta, Georgia, USA, 2008. SIAM. [17] U. Rückert and S. Kramer. Kernel-based inductive transfer. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part II, pages 220–233, 2008. [18] S. Satpal and S. Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In J. N. Kok, J. Koronacki, R. L. de M´l´cntaras, S. Matwin, D. Mladenic, and A. Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages 224–235. Springer, 2007. [19] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In COLT ’01/EuroCOLT ’01: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory, pages 416–426, London, UK, 2001. Springer-Verlag. [20] Z. Wang, Y. Song, and C. Zhang. Transferred dimensionality reduction. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part II, pages 550–565, 2008. [21] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, June 2005. [22] K. Yamazaki, M. Kawanabe, S. Watanabe, M. Sugiyama, and K.-R. Müller. Asymptotic bayesian generalization error when training and test distributions are different. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 1079–1086, New York, NY, USA, 2007. ACM. [23] P. Zhang, J. Peng, and N. Riedel. Discriminant analysis: A unified approach. In ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 514–521, Washington, DC, USA, 2005. IEEE Computer Society. [24] X. Zhu. Semi-supervised learning with graphs. PhD thesis, Dept. of Computer Science, University of Carnegie Mellon, 2005.
The main challenge of transfer learning is to bridge the difference in distribution between target-domain and source-domain. The common assumption that the marginal and conditional probabilities are directly related between source and target domains may fail in the original or linearly transformed space. In this paper, we have explored a new approach to solve this problem by (1) exploiting kernel space feature mapping to connect two distributions, (2) cluster criterion-based source-domain sample selection (3) instance reweighting, and (4) ensembles. Based on the feature spaces of source and target domains, we use KDA to seek a third feature space to make the marginal distributions from two domains similar. Then under the clustering manifold assumption, we select those data from source-domain whose conditional probability is likely to be similar to target-domain inside the kernel space. This process is repeated in an iterative manner in order to remove the bias of any single kernel mapping as well as to re-select or re-weight source domain examples based on their contribution to expected accuracy. Formal analysis show that in the kernel space, both target-domain and source-domain data are approximately Gaussian, the difference in conditional distribution between target-domain and selected source-domain data is bounded, and importantly, the error of prediction on target-domain data by the kernel space ensemble trained with source-domain data is also bounded. Empirical studies have shown that when two domains are different, the proposed approach increases the accuracy of state-of-the-art transfer learning methods by as much as 10%.
Acknowledgement We would like to thank Ulrich Rükert and Stefan Kramer of Institut für Informatik/I12 at Technische Universität München for sharing the TechTC-300 data set and Wenyuan Dai from Department of Computer Science and Engineering at Shanghai Jiao Tong University for sharing the preprocessed Reuters-21578 data set. Jiangtao Ren is supported by the National Natural Science Foundation of China under Grant No. 60703110.
7.
REFERENCES
[1] M. Amini, F. Laviolette, and N. Usunier. A transductive bound for the voted classifier with an application to semi-supervised learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21. 2009. [2] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. http://www.ics.uci.edu/mlearn/MLRepository.html. [3] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Comput., 12(10):2385–2404, 2000. [4] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006. [5] S. Bickel, M. Br´l´zckner, and T. Scheffer. Discriminative learning for differing training and test distributions. In Z. Ghahramani, editor, ICML, volume 227 of ACM International Conference Proceeding Series, pages 81–88. ACM, 2007. [6] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. 2008. [7] F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: On the bias-variance problem. Foundations of Computational Mathematics, 2(4):413–428, 2002. [8] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Boosting for transfer learning. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 193–200, New York, NY, USA, 2007. ACM.
1035