A Risk Minimization Framework for Domain Adaptation Bo Long
Yahoo! Labs 701 First Avenue Sunnyvale, California 94089
Sudarshan Lamkhede
Yahoo! Labs 701 First Avenue Sunnyvale, California 94089
Srinivas Vadrevu
Yahoo! Labs 701 First Avenue Sunnyvale, California 94089
svadrevu@
[email protected] inc.com Ya Zhang Belle Tseng
[email protected]
Yahoo! Labs 701 First Avenue Sunnyvale, California 94089
[email protected] ABSTRACT Supervised learning algorithms usually require high quality labeled training set of large volume. It is often expensive to obtain such labeled examples in every domain of an application. Domain adaptation aims to help in such cases by utilizing data available in related domains. However transferring knowledge from one domain to another is often non trivial due to different data distributions among the domains. Moreover, it is usually very hard to measure and formulate these distribution differences. Hence we introduce a new concept of label-relation function to transfer knowledge among different domains without explicitly formulating the data distribution differences. A novel learning framework, Domain Transfer Risk Minimization (DTRM), is proposed based on this concept. DTRM simultaneously minimizes the empirical risk for the target and the regularized empirical risk for source domain. Under this framework, we further derive a generic algorithm called Domain Adaptation by Label Relation (DALR) that is applicable to various applications in both classification and regression settings. DALR iteratively updates the target hypothesis function and outputs for the source domain until it converges. We provide an in-depth theoretical analysis of DTRM and establish fundamental error bounds. We also experimentally evaluate DALR on the task of ranking search results using real-world data. Our experimental results show that the proposed algorithm effectively and robustly utilizes data from source domains under various conditions: different sizes for source domain data; different noise levels for source domain data, and different difficulty levels for target domain data.
Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous;
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.
Yahoo! Labs 701 First Avenue Sunnyvale, California 94089
[email protected]
I.5.1 [Pattern Recognition]: Models-statistical
General Terms algorithms
Keywords domain adaptation, risk minimization, label-relation, source domain, target domain
1. INTRODUCTION In most supervised learning settings, a learner is provided with some solved cases (examples and corresponding labels) and is supposed to learn how to solve new cases. We use the term ‘labels’ to refer to both class labels for classification task and real valued targets for regression task. For example, in learning to rank search results, a learner is provided with relevance judgments for a set of query document pairs. This labeling is done usually by human experts, which is quite expensive and time consuming. Moreover, we may not have enough resources and human experts in every domain to create sufficiently large, good quality training data set, e.g., we may not have experts for each target language for which we want to develop a ranking model. But we may have good training data available in one domain and it can be of great value if we can utilize it in other domains of the same application. For instance, if we can use relevance judgments in English language to train a model for other language(s), we can overcome scarcity of data in those languages. Similar situations arise in various other important applications where we want to overcome scarcity of data in a domain (target domain) by utilizing data from related domains (source domains). In text mining applications, such as text classification, tagging and name entity recognition, we want to use plenty of labeled data from popular domains such as business journal domain to help target domains such as scientific publication domain with little labeled data. In recommendation systems, we are interested in using review data from source domains such as book domain to help predicting rating of production in target domains such as dvd domain. In email spam detection, we may have sufficient labeled data for general spam detection (source domain), but not enough labeled data for a specific group of users (target
domain). It is always desirable to use the general domain data to help build a personalized spam detection model for the group of users. Although domain adaptation has phenomenal impact on various applications, it has started to gain more attention only recently [20, 31]. The main challenge of domain adaptation is to deal with different data distributions among domains. A model trained with data from a source domain or combined multi-domain data usually does not work well for the target domain, because it violates the basic supervised learning assumption - identical distribution between train and test data. An intuitive solution to domain adaptation is to explicitly formulate the distribution difference between a source domain and target domain and then to use the distribution difference to adapt the model from the source domain to the target domain [19, 4, 21, 35]. However, directly formulating distribution difference faces three big challenges. First, in the real world, the distribution difference is very hard to measure and formulate. Two domains could differ in the conditional distributions, marginal distributions or joint distributions. For high dimensional data such as text data, web data and image data, it is very difficult to formulate the distribution in a single domain, not to mention the distribution difference involving multiple domains. Second, even if we manage to formulate the differences between a single source and target domain, we can not efficiently extend them to multiple source domains, since different source domains could be different from the target domain in different ways. Third, formulating the distribution difference is usually tied to the specific data distributions and hence, the algorithm design for domain adaptation often remains heavily domainspecific, application-specific and learning-task-specific. In this paper, we take a different approach to domain adaptation. Instead of directly formulating the data distribution differences, we formulate the relation between the hypothesis functions of different domains in the output space. We assume that hypothesis functions in the target and the source domain, ht (x) and hs (x), respectively, are related and there exists a relation function for their outputs. We name this function as label-relation function. Since in most applications, the outputs are scalars and the relationships between outputs are intuitively interpretable, it is relatively easy to formulate a label-relation function (Section 3.2). We can then use the label-relation function to transfer the knowledge from the source domain to the target domain. Our main contributions can be summarized as follows: 1. Based on the concept of the label-relation function, we propose a new framework, Domain Transfer Risk Minimization (DTRM), for domain adaptation. DTRM provides a general domain adaptation framework for different supervised learning tasks in various applications (Section 3). 2. Under the DTRM framework, we derive an algorithm for general domain adaptation, which iteratively updates the target hypothesis function and outputs for the source domain until it converges. The proposed algorithm has three main advantages. First, it is applicable to both classification and regression tasks in a wide range of applications; second, it is easy for the proposed algorithm to make use of labeled data from multiple source domains; third, as a general wrapper
algorithm, it provides the flexibility to choose the best base learner for a given application (Section 4). 3. We prove the objective function is non-increasing under the updating rules and hence the convergence of the algorithm is guaranteed; we provide theoretical analysis for the bounds on the risk for domain adaptation (Section 5).
2. RELATED WORK Historically, special cases of domain adaptation have been explored in the literature under different names, including sample selection bias [17], class imbalance [19], and covariate shift [35]. Those studies mainly address the difference between training and testing data distribution. For example, class imbalance assumes that the density of input variables conditioned on output variables is the same for train and test data, while the marginal density of output variable is different for train and test data. Recently domain adaptation has increasingly attracted attention from machine learning community, mainly in the form of domain adaptation or transfer learning [20, 31]. While domain adaptation and transfer learning apply the same principle to transferring knowledge across domains, it is worth to note that transfer learning may not distinguish the target domain from the source domain. A popular class of domain adaptation methods is based on instance re-weighting [4, 21, 10, 25, 5, 18, 13, 36, 12], which assumes that certain parts of the data in the source domain can be reused for the target domain by re-weighting. [21] proposed a heuristic method to remove “misleading” training instances from source domain so as to include “good” instances from labeled source-domain instances and unlabeled target-domain instances. [10] introduced a boosting algorithm, TrAdaBoost, which assumes that the source and target domain data use exactly the same set of features and labels, but the distributions of the data in the two domains are different. TrAdaBoost attempts to iteratively re-weight the source domain data and target domain data to reduce the effect of the “bad” source data while encourage the “good” source data to contribute more for the target domains. [4] proposed a framework to simultaneously re-weight the source domain data and train models on the re-weighted data with a kernel logistic regression classifier. Those above studies have shown the promise of data reweighting for many application such as NLP [21] and special learning tasks such as binary classification [10, 12]. To some extent, our algorithm can also be viewed as a re-weighting style algorithm. To our best knowledge, under the DTRM framework, it is the first time that re-weighting is generalized to different learning tasks in various applications. Another category of domain adaptation approaches focuses on feature presentation [6, 32, 26, 9, 1, 2, 3, 24], where a feature representation is learned for the target domain and used to transfer knowledge across domains. A structural correspondence learning (SCL) algorithm [6] was proposed to use unlabeled data from the target domain to extract features so as to reduce the difference between source and target domains. A simple kernel mapping function is introduced in [11], which maps the data from both domains to a highdimensional feature space, where standard discriminative learning methods are then used to train classifiers. [32] proposed to apply sparse coding [37], an unsupervised feature
construction method, to learning higher level features across domain. [26] proposed a spectral classification framework for cross-domain transfer learning problem, where the objective function is to seek consistency between the in-domain supervision and the out-of-domain intrinsic structure. The third category of approaches can be viewed as parametersharing approaches [33, 23, 15, 8], which assumes that the source tasks and the target tasks share some parameters or priors of their models. An efficient algorithm MT-IVM [23], which is based on Gaussian Process (GP), was proposed to handle multi-domain learning case. MT-IVM tries to learn parameters of GP over multiple tasks by assigning the same GP prior to the tasks. Similarly, Hierarchical Bayes (HB) is used with GP for multi-task learning [33]. [15] borrowed the idea of [33] and used SVMs for multi-domain learning. The parameters of SVMs for each domain is assumed to be separable into two terms: a common term across tasks and a task specific term. [27] proposed a consensus regularization framework for transfer learning from multiple source domains to a target domain. The fourth category of approaches is based on relational knowledge transfer [29, 14, 28]. [28] proposed the algorithm TAMAR, which transfers relational knowledge with Markov Logic Networks (MLNs) across relational domains. MLNs is a powerful formalism, which combines the compact expressiveness of first order logic with flexibility of probability, for statistical relational learning. TAMAR was later extended to the single-entity-centered setting of transfer learning [29], where only one entity in the target domain is available. [14] proposed an approach to transferring relational knowledge based on a form of second-order Markov logic. In summary, those approaches assume that some relationship among the data in the source and target domains are similar. Another related field is semi-supervised learning [7, 38, 22, 30], which addresses the problem that the labeled data are too few to build a good classifier by making use of a large amount of unlabeled data and a small amount of the labeled data. Co-training [7, 38], which trains two learners for two views by iteratively including unlabeled data, is closely related to domain adaptation. The major difference is that in co-training, we have one set of instances with features partitioned into two “views”. In domain adaptation, we use different sets of data instances from different domains. Finally, it is worth to mention that most work in the literature focus on classification. Our study addresses both regression and classification for domain adaptation.
3. 3.1
BACKGROUND AND MODEL FORMULATION Notation and Background
First, a word about notation. We use R to denote the set of real numbers, R+ to denote the set of nonnegative real numbers, and R++ to denote the set of positive real numbers. X denotes instance space and Y denotes output space. In this study, we consider the most general case for both classification and regression, i.e., we consider X to be Rd and Y to be R. we Let h : X → Y denote the function we want to learn and H denote a function space. The target function ht and the source function hs denote the functions we want to learn from the target and the source domain, respectively. We define a domain as a data distribution D on the
instance space X . Dt and Ds are for the target and source t domain, respectively. S t = {(xti , yit )}n i=1 ∈ X × Y ∼ D denotes the training data sampled from the target domain; s S s = {(xsj , yjs )}n+m j=n+1 ∈ X × Y ∼ D denotes the training data sampled from the source domain. l : Y × Y → R is a loss function defined on a pair of outputs. Let us first consider the conventional supervised learning setting. Suppose that we only have training data from t the target domain, S t = {(xti , yit )}n i=1 ∈ X × Y ∼ D . A t learning algorithm takes S as input to learn the function, ht : X → Y, which is the optimal solution to minimize the risk function R : Y × Y → R, Rt
Ex∼Dt [l(ht (x), y)] Z l(ht (x), y)dF t (x, y)
= =
(1) (2)
where F t (x, y) denotes the cumulative density function of (x, y) on the target domain. We cannot minimize the functional Rt directly, since F t (x, y) is unknown. Instead, based on the training data, we minimize the empirical risk function, ˜t R
=
n 1X t t t l(h (xi ), yi ) n i=1
(3)
3.2 Model Formulation Under the conventional risk minimization, if the training ˜t is not a data sample size n is small, the empirical risk R t reliable estimate of the true risk R and we cannot expect to obtain a good estimation of the target function ht . In the domain adaptation learning setting, we have additional training data from a source domain (we assume one source domain here and later we will show it is easy to extend the proposed framework and algorithm to multiple source s domains), S s = {(xsj , yjs )}n+m j=n+1 ∈ X × Y ∼ D . Now the s problem is how we use S to help learn the target function ht . To achieve this goal, we need to transfer knowledge from S s to S t . Note that since Dt is different from Ds , simply combining S t and S s is not a justified approach. As discussed in Section 1, in general, it is hard to formulate the difference between Dt and Ds , especially when X = Rd with large d. On the other hand, we observe the following facts in the output space. First, the relation between a pair of outputs is much easier to formulate, since in most applications, the output space is single dimension, i.e., Y ∈ R. Second, the relation between the outputs of ht and hs is usually intuitive and implies the basic feasibility for domain adaption in most applications. For example, in the ranking problem, ht and hs are the optimal ranking functions for the target domain Dt (query-document distribution for a country) and the source domain Ds (query-document distribution for another country). Given any query-document instance x from Ds , we obtain two ranking scores y t = ht (x) and y s = hs (x). We expect certain correlation between y t and y s . For example, y t and y s are positively correlated with a significant probability. If y t is totally independent with y s , we cannot expect that the training data from S s can help learn ht , since that implies two domains have totally different ranking principals, i.e., the optimal ranking function from one domain cannot give anything more than random guess for the instances from another domain. Similarly in sentiment
classification, we expect that the output of the rating function hs in the book domain is correlated with the output of the ranking function ht in the dvd domain in a certain way; otherwise, the training data from the book domain cannot be helpful for the dvd domain. Therefore, we propose the concept of label-relation function, r : Y ×Y → R to formulate the relation between two labels (outputs) such that r(y t , y s ) measure the “consistency” between y t and y s . We have plenty of choices of function spaces for r. In this study, we select to use the exponential function such that, r(y t , y s ) = exp(−d(y t , y s )),
(4)
where d : Y × Y → R+ is a distance function. We choose the exponential function exp(−d(y t , y s )) due to the following reasons. First, the function provides intuitive measure for the consistency between two labels, since its output is between 0 and 1 with 1 denoting the perfect consistency of two labels. Second, the exponential function leads to less computation effort for the algorithm (Section 4). Next, we propose a new risk minimization framework for domain adaptation, which transfers knowledge from the source domain to the target domain through the label-relation function. We summarize the following assumptions for the framework. • The target domain and the source domain have their own optimal function ht and hs , respectively. • The relation of the outputs of two functions is formulated as r(ht , hs ). We call the output of ht as target label denoted by y t and the output of hs as source label denoted by y s . In the source domain, for each instance x, the source label y s are observable but the target label are not observable. We treat the target label for the source domain as hidden variables to incorporate the source domain data into the risk function. Ra
= =
E[l(ht (xt ), y t )] + E[αl(ht (xs ), y t )r(y t , y s )] Z l(ht (xt ), y t )dF t (xt , y t ) Z + αl(ht (xs ), y t )r(y t , y s )dF s (xs , y s , y t ), (5)
where α ∈ R++ is a positive normalization constant. We call the framework based on the above risk function as Domain Transfer Risk Minimization (DTRM). In this framework, by introducing the hidden target label y t and the label-relation function r into the source domain, the additional information from source domain data S s = {(xsj , yjs )}n+m j=n+1 is used to learn the target function, ht . Since F t (xt , y t ) and F s (xt , y s , y t ) are unknown, we minimize the following empirical risk minimization to learn ht . ˜a R
=
1 n
n X
to “select” the training instances from the source domain to help according to their “usefulness”, which is measured through the label-relation function. The main difference between DTRM and the conventional re-weighing framework [21] is that DTRM does not explicitly formulate the data distribution difference in input space and this lead to several algorithmic advantages (Section 4).
4. ALGORITHM DERIVATION In this section, we derive the domain adaptation algorithm under DTRM framework. Our task is to solve the following minimization problem, arg min R˜a , ht ,Υ
where R˜a is defined in Eq.(6), and Υ = {yjt }n+m j=n+1 denotes the hidden target labels for the source domain instances. Since this is a general minimization problem, it seems that for a specific loss function l, we need to derive a specific problem. However, we show that under a weak assumption for l, we can derive a general algorithm that applied to a wide range of loss functions. We assume that the loss function l is positive definite, i.e., l : R × R → R+ and l(y, y) = 0. Positive definiteness is a weak assumption easily satisfied in real applications, since most popular loss functions, such as Euclidean distance, 0-1 loss, and KL-divergence, are non-negative and equal to zero if only if inputs are identical. To derive a general domain adaptation algorithm, we also assume that a base learner L is given to minimize the loss function l, i.e., it is flexible for domain experts to choose the best base learner for certain applications as an input. For example, for a binary classification task, SVM may be chosen as the input base learner to the general domain adaptation algorithm. In the minimization problem, we need to learn both the target function ht and the hidden target labels for the source In general, this is a nondomain instances, {yjt }n+m j=n+1 . convex problem. We derive an iterative algorithm which updates ht and {yjt }n+m j=n+1 alternatively until it converges. t First, we fix {yjt }n+m j=n+1 to update h . Hence, our task is, arg min R˜a . ht
l(h
˜a R
=
n 1X t t t l(h (xi ), yi ) n i=1
+
(xti ), yit )
n+m α X l(ht (xsj ), yjt )r(yjt , yjs ) m j=n+1
n+m X
wi l(ht (xi ), yit ),
(9)
i=1
i=1
n+m α X l(ht (xsj ), yjt )r(yjt , yjs ). + m j=n+1
(8)
We re-formulate the risk function as follows.
= t
(7)
( (6)
The above empirical risk is helpful for us to understand the intuition behind the DTRM. Basically, DTRM aims
wi =
1 n αr(yit ,yis ) m
if 1 ≤ i ≤ n, if n + 1 ≤ i ≤ n + m.
(10)
Based on Eq. (9), when {yjt }n+m j=n+1 is fixed, the minimization of R˜a is reduced to the minimization of the weighted
loss function l, which we can directly apply the base learner L to optimize. We assume that the base learner L returns the following solution, ∗
ht = arg min ht
n+m X
wi l(ht (xi ), yit )
(11)
i=1
with wi defined in Eq.(10). t Next, we update {yjt }n+m j=n+1 when h is fixed. Now our task is arg
min
{yjt }n+m j=n+1
R˜a ,
(12)
It seems that given a specific loss function l and a specific label-relation function r, we need to derive a specific algorithm to solve this optimization problem. However, using the positive definiteness of the loss function and the nice property of the exponential function (relation function), we can derive a general updating rule applicable to all these loss functions and relation functions. We propose the following theorem to provide the updating rule. Theorem 1. {yjt = ht (xsj )}n+m j=n+1 is a global optimal solution to the minimization problem in (12). Proof. Since l(ht (xs ), y t ) ≥ 0 and r(y t , y s ) ≥ 0, we obtain the following deduction, ˜a
R
=
n 1X t t t l(h (xi ), yi ) n i=1
α + m ≥
1 n
n+m X
l(h
t
s Input: S t = {(xti , yit )}n = {(xsj , yjs )}n+m i=1 , S j=n+1 , and a base learner L. Output: A target function ht . Method: 1: Initialize ht . 2: repeat 3: for j = n + 1 to n + m do 4: let yjt = ht (xsj ) 5: end for 6: Update wi such as ( 1 if 1 ≤ i ≤ n, wi = n αr(yit ,yis ) if n + 1 ≤ i ≤ n + m. m
7:
Call base learner L to obtain ∗
ht = arg min ht
n+m X
wi l(ht (xi ), yit )
i=1
8: until convergence Proof. Let R˜a e and R˜a e+1 denote the risk function at eth and e + 1th iterations, respectively. During the e + 1th iteration, we have the following updates, ht e+1 = arg min R˜a e
(15)
yjt e+1 = hte+1 (xsj ) for n + 1 ≤ j ≤ n + m.
(16)
hte
Then, we have the following deduction,
(xsj ), yjt )r(yjt , yjs )
j=n+1
n X
Algorithm 1 Domain Adaptation by Label Relation
l(ht (xti ), yit )
(13)
˜ ae R
=
n 1X t t t l(he (xi ), yi ) n i=1
i=1
˜ a w.r.t. Hence, the global minimum of R is Pn t s t t t 1 ). Since 1 ≥ r(y , y ) ≥ 0, the global ), y l(h (x i i i=1 n minimum is attained when l(ht (xsj ), yjt ) = 0, i.e., yjt = ht (xsj ) (by positive definiteness of the loss function l). Proof is completed.
+
{yjt }n+m j=n+1
≥
n 1X t l(he+1 (xti ), yit ) n i=1
+
Theorem 1 provides a simple updating rule to update t {yjt }n+m j=n+1 with h is fixed,
n+m α X l(hte (xsj ), yjt e )r(yjt e , yjs ) m j=n+1
n+m α X l(hte+1 (xsj ), yjt e )r(yjt e , yjs ) m j=n+1
(14)
≥
n 1X t l(he+1 (xti ), yit ) n i=1
Based on updating rules (11) and (14), we propose a general domain adaptation algorithm, Domain Adaptation by Label Relation (DALR), which is summarized in Algorithm 1. Algorithm 1 alternatively updates the target function ht and the hidden target labels {yjt }n+m j=n+1 , which transfer the information from the source domain to ht through the label-relation function r. The intuition behind the algorithm is that the instances in domain training data S s = {(xsj , yjs )}n+m j=n+1 have different “usefulness” for learning the target function ht and through the label-relation function the model iteratively makes use of the source domain data instances based on their most recent “usefulness” until it converges. The following theorem guarantees the convergence of the DALR algorithm.
=
R˜a e+1
yjt = ht (xsj ) for n + 1 ≤ j ≤ n + m.
Theorem 2. Algorithm 1 monotonically decreases the risk function R˜a in Eq.(6).
In the above deduction, the first inequality results from Eq.(15) and the last equality can be obtained by substituting Eq. (16) into R˜a e+1 . The proof is completed. Based on Theorem 2, Algorithm 1 is guaranteed to converge. In practice, we observe that the DALR algorithm converges very fast, usually in a few iterations. The DALR algorithm can also be viewed as a re-weighting style algorithm. However, DALR uses model self-adjustment and label-relations to iteratively adjust the “usefulness” of the source data instances and this avoids directly dealing with the distribution differences in the input space. Moreover, under DTRM framework, DALR generalizes re-weighting to different learning tasks in various applications. In summary, the DALR algorithm has the following advantages.
(1) It is applicable to a wide range of applications, since it is not designed for a specific loss function. (2) It is easy to extend it to multiple source domains; we just need to combine the data sets from multiple source domain into a single data set as the input. Since the algorithm does not make assumptions about source domains, it will automatically select “useful” training examples from the input data from multiple source domains. (3) It is flexible to allow domain experts to choose the best suitable base learner for a specific application.
The following theorem provides the error bound between R˜a and the true risk Ra .
5.
P[|R˜a − Ra | ≥ t] =
FORMAL ANALYSIS
Theorem 4. Let H be a function space with VC dimension c, and b is the upper bound for the loss function l. Then, for every ht ∈ H, with probability 1 − δ, r (m + nα2 )(c lg(2(n + m)/c) − lg δ) |R˜a − Ra | ≤ b (20) 2mn Proof. P[|¯ z − E¯ z | ≥ t]
In this section, we provide theoretical analysis for the proposed framework algorithm. Specifically, we aim to answer the following fundamental questions.
≤
1. Is the empirical risk R˜a an unbiased estimator of the true risk Ra ?
=
2. How close is the empirical risk R˜a to the true Ra ?
= 2 exp(
3. How close is the optimal function learned by the DALR algorithm to the true optimal function? We show that the empirical risk R˜a is an unbiased estimator of the true risk Ra by the following theorem. Theorem 3. For a given function ht ∈ H, R˜a is an unbiased estimator of Ra . ER˜a = Ra
(17)
Proof. ER˜a
=
n+m 1 X E[αl(ht (xsj ), yjt )r(yjt , yjs )] m j=n+1 t
t
t
t
s
t
t
s
=
E[l(h (x ), y )] + E[αl(h (x ), y )r(y , y )]
=
Ra
Next, we derive the error bound between the empirical risk R˜a and the true risk Ra to answer the second question. We begin with some notations for improving clarity. (
n+m l(ht (xti ), yit ) n (n+m)α l(ht (xsj ), yjt )r(yjt , yjs ) m
if 1 ≤ i ≤ n, if n + 1 ≤ i ≤ n + m. (18) Hence, we can formulate R˜a as the following,
zi =
R˜a
=
1 n+m
n X
zi +
i=1
=
n+m X 1 zi n + m i=1
=
z¯
1 n+m
n+m X
−2(n + m)2 t2 ) P (m+n)b 2 α(m+n)b 2 ) + n+m ) i=1 ( i=n+1 ( n m
2 exp( Pn
−2nmt2 ) (m + nα2 )b2
In the above deduction, the first equality results from Theorem 3 and the first inequality results from Hoeffding’s inequality. The above results holds true for a single ht ∈ H. For the whole function space H, with the standard uniform bound argument with VC dimensions as our growth function, we have 2(n + m) c −2nmt2 P[|R˜a − Ra | ≥ t] ≤ ( ) exp( ) (21) c (m + nα2 )b2
(22)
Hence, we obtain, r (m + nα2 )(c lg(2(n + m)/c) − lg δ) a a ˜ ]=δ P[|R − R | ≥ b 2mn (23) the proof is completed.
n 1X E[l(ht (xti ), yit )] n i=1
+
i=1
−2(n + m)2 t2 ) (max(zi ) − min(zi ))2
We let the RHS equal to δ and solve for t. r (m + nα2 )(c lg(2(n + m)/c) − lg δ) t=b 2mn
n 1X t t t E[ l(h (xi ), yi ) n i=1 n+m α X + l(ht (xsj ), yjt )r(yjt , yjs )] m j=n+1
=
2 exp( Pn+m
zj
Finally, we address the third question, how close is the optimal function learned by the DALR algorithm to the true optimal function, i.e., what is the best the DALR algorithm can do. We define the following notations for clarity. Z Rt (h) = l(h(xt ), y t )dF t (xt , y t ) (24) Z Rs (h) =
l(h(xs ), y t )r(y t , y s )dH s (xt , y s , y t )
(25)
Ra (h) = Rt (h) + αRs (h)
(26)
Then, we have
Then, we propose the following definition to measure the distance between two domains. Definition 1. Let H be a function space with VC dimension c and l : R × R → R+ be a loss function, the distance between two domains, Dt and Ds is defined as
j=n+1
(19)
4H,l = sup |Rs (h) − Rt (h)| h∈H
(27)
Computing 4H,l is NP-hard. In practice, we can use bootstrapping to approximate it. For convenience, we use λ to denote 4H,l . If we let h0 denote the true risk minimizer for the target domain such that h0 = arg minh∈H Rt (h) and h∗ denote the empirical minimizer for domain adaptation such that h∗ = arg minh∈H R˜a (h), then the third question is equivalent to ask that what is the error bound between Rt (h∗ ) and Rt (h0 ). The following theorem answers this question. Theorem 5. Let H be a function space with VC dimension c and b is the upper bound for the loss function l. Then, with probability 1 − δ, r (m + nα2 )(c lg(2(n + m)/c) − lg δ) 2b Rt (h∗ ) ≤ 1+α 2mn αλ t 0 +R (h ) + (28) 1+α Proof. With triangle inequality and Theorem 4, the proof can be done (details are omitted). Theorem 5 provides theoretical justification for the intuitions about the DALR algorithm (also true for most domain adaptation algorithms). For example, when two domains are closer with each other (λ is smaller), the optimal function learned by the algorithm is more accurate w.r.t the true function (the error bound is tighter). Another interesting example is that when α is larger, the size of the source domain data m is more important and the size of the target domain data n is less important for the error bound, i.e., α controls the relative importance of the source domain data w.r.t. the target domain data. Note that based on Theorem 5, increasing or decreasing α does not necessarily improve the error bound.
6.
EXPERIMENTAL EVALUATION
As a general domain adaptation algorithm, DALR can be applied to a wide range of applications. In this section, we apply DALR to an important application, ranking, to demonstrate the properties and effectiveness of DALR. Under machine learned ranking problem setting, each querydocument pair is represented as a feature vector x, the output of the ranking function y is a real value denoting the relevance between the query and the document. Hence, in this study, we treat ranking as a regression problem. Note that most domain adaptation approaches in the literature are limited to classification. DALR does not have this limitation, since it is applied to classification and regression in the same way. For the base learner, we select to use gradient boosting tree [16], which has shown great potential for ranking [37]. We use the popular L2 norm as the loss function. For the relation function, there are many choices with different distance functions. In this study, we propose the following two designs. The first design provides soft weights to measure the relation between two labels based on Euclidean distance, r(y t , y s ) = exp(−(y t − y s )2 ).
(29)
The second one provides 0/1 weights to measure the relation based on a threshold, ( 0 if |y t − y s | > t, zi = (30) 1 if |y t − y s | ≤ t,
Data set D0 D1 D2 D3
Number of examples 1227095 6109 6007 6001
Table 1: Size of domain training data
where t denotes a threshold. Note that this relation function can also be formulated as an exponential function such that r(y t , y s ) = exp(I[|y t − y s | > t](−∞) + I[|y t − y s | ≤ t]0), where I denotes an indicator function. We use DALR-L2 to denote the algorithm with the relation function in Eq.(29) and DALR-Bi to denote the algorithm with the relation function in Eq.(30). We compare our algorithms with the following three approaches. The first one is the baseline approach based on the target domain data only (called Base). The second one is an intuitive domain adaptation approach, which applies the base learner to the Source Domain data only (referred to SD). The third ones is an effective domain adaptation algorithm based on Optimal Combination of source domain data and target domain data (referred to OC) [5, 34]. This approach assumes that there exists an optimal convex combination of the target domain data and source domain data distributions to learn the target function. As an effective algorithm that also does not depend on specific loss functions, it is a suitable comparison for the DALR algorithms.
6.1 Data and Experimental Setting We use data sets from a search engine. The training examples are labeled using five values, {0, 1, 2, 3, 4}, representing five levels of relevance. We select to use four domains corresponding to four different countries/languages.The source domain D0 is large, including about 1.2 million training examples. The target domain D1 has about 6, 000 training examples, representing a typical situation that labeled examples are scarce. There are two more data sets from target domains D2 and D3. We will focus on D1 for detailed study on the properties of the DALR algorithm, while using the other domain data for comparing different algorithms. Table 1 summarizes the size of the data sets. We use Mean Squared Error (MSE) as the evaluation metric, since it is consistent with the loss function L2 norm. Six-fold cross validation is used for testing. For the initial model, we use the model trained based on the target domain model. The constant α in the relation function controls the relative weight of source domain data against the target domain data as Theorem 5 suggested. We test DALR-L2 and DALR-Bi algorithms on D0 (source domain) and D1 (target domain) with different α values, which provide different relative weights of source domain data against the total weight of all data as shown in the X axis in Figure 1. Figure 1 shows that α is not monotonically related to the performance. This is consistent with Theorem 5. From Figure 1, we observe that for both algorithms, when the weight of the source data equals to the weight of the target data (the relevant weight is 0.5), we achieve the best performance (lowest MSE). In all the following experiments, we use this optimal α value.
0.91
1.3 0.9
DALR−L2 DALR−Bi
1.2
0.88 MSE
MSE
0.89
0.87
1
0.86 0.85 0.84
0.9 0
0.2
0.4 0.6 Weight of source data
0.8
1
Figure 1: Effect of relative weight of source domain data against the total weight of all data. 0.91 0.9
MSE
0.89 0.88 0.87
Base SD OC DALR−L2 DALR−Bi
0.85 0.84 50K
75K 100K 150K Size of Source Domain Data
0.8 10%
20% 30% 40% Percentage of Noise in Source Domain
50%
Figure 3: Effect of different levels of noise of the source domain data on MSE of the five algorithms shows that the DALR algorithms are most robust to noise. size of the target domain data. SD does not perform well since it violates the basic assumption of supervised learning that train data and test data should have the same distribution. The possible reason that OC approach performs worse than the DALR algorithms is that OC’s assumption about the optimal convex combination of the target domain data and source domain data distributions is not true for the data set. On the other hand, DALR makes an easily-satisfied assumption based on the label-relation.
0.86
200K
Figure 2: Effect of different sizes of the source data on MSE of the five algorithms shows that the performance is positively correlated with the size of the source data. DALR perform best under different sizes.
6.2
1.1
Base SD OC DALR−L2 DALR−Bi
Experiments with Different Size of Source Domain Data
Since domain adaptation aims to use sufficient labeled data from source domain to help target domain with limited labeled data, it is critical to see how different sizes of source data affect the performance of the algorithms. We randomly sample data from the source domain D0 with different sizes, from 50K to 200K. We use D1 as target domain to test the five algorithms. Figure 2 shows that how different sizes of source domain data affect MSE for all five algorithms. MSE of the base line does not change, since it only uses the target domain data. For the other four algorithms, the MSE decrease when the size increases. This is consistent with Theorem 4 and Theorem 5, when the source domain data size m becomes larger, the estimation becomes closer to optimal solution. At the same time, we observe that when the size is larger than 100K, the performances of the algorithms tend to be “saturated”. Finally, we observe from Figure 2 that for all different sizes, the DALR algorithms perform better than other algorithms. Base performs worst due to the limited
6.3 Experiments with Robustness to Noise In real applications, the source domain data may have noise and it is always desirable for a domain adaptation algorithm to be robust to noise. We test the robustness of all five algorithms by intentionally adding noise into the source domain data D0. We add noise as follows. A training example is randomly selected, and its label value is randomly replaced by a value that is not the current one. We change different percentages of the source domain data to create different levels of noise. Figure 3 shows that how different levels of noise of the source domain data affect the performances of the five algorithms. MSE of the base line does not change, since it only uses the target domain data. For SD and OC algorithms, when the noise increases, their performances drop significantly. This is consistent with the intuition that noise can be transferred from the source domain to the target domain and hurts the accuracy of the learned function. On the other hand, it is surprising to observe that the DALR algorithms are affected by noise very slightly. Even with 50% training examples added with noisy scores, the performances of the DALR algorithms do not change significantly. The reason that the DALR algorithms are very robust to noise is that DALR is capable of identifying “usefulness” of the training examples through the label-relation function and hence, it automatically “excludes” the noise examples. Since the size of source domain data is large (about 1.2 million), even after the DALR algorithms “exclude” 50% noisy training examples, the rest of the data is enough for the DALR algorithms to provide good performance. In summary, DALR provides much better performance for noisy source data due to its
for target domain data. There are a number of interesting directions for future work. For example, currently the α is set via empirical estimation, we are investigating how to learn it on the fly; also we found that DALR-Bi version of the algorithm works slightly better than DALR-L2 and we would like to investigate the reasons and formulate guidelines for choosing the best label-relation function.
1 Base SD OC DALR−L2 DALR−Bi
0.95
MSE
0.9 0.85 0.8
8. ACKNOWLEDGEMENT The authors gratefully acknowledge the insightful comments of Dr. Byron Dom, which allowed us to significantly improve our paper.
0.75 0.7
D1
D2 Target Domains
D3
Figure 4: Comparing the DALR algorithms with other algorithms on different target domains shows that the DALR algorithms perform best on different target domains. high robustness to noise.
6.4
Experiments with Different Domains
Finally, we compare the DALR algorithms with other algorithms on different target domains. By using different target domains, we have different distances between the target domain and source domain. Based on the prior knowledge about the languages/countries associated with these domains, the difference between D1 and D0 is the smallest. From Figure 4, we observe that the DALR algorithms provides best performance for all three target domains. For the two target domains, D2 and D3, which are more different from the source domain D0, the DALR algorithms show more improvements compared with SD and OC approaches. This shows another great potential for DALR to deal with more difficult problems, since usually when two domains are more different from each other, the domain adaptation task becomes harder.
7.
CONCLUSIONS AND FUTURE WORK
In this paper, based on the concept of the label-relation function we propose a novel domain adaptation framework (DTRM). An iterative algorithm (DALR) derived under this framework updates the target hypothesis function and outputs for the source domain until convergence. The algorithm is applicable to a wide range of applications due to the fact that it does not rely on formulating data distribution differences among the domain for transferring knowledge. We can easily extend it to make use of labeled data from multiple source domains. It can work with any base learner in classification or regression settings. We hope that this flexibility will encourage the wider use of it. We prove the objective function is non-increasing under the updating rules and hence the convergence of the algorithm is guaranteed; we provide the theoretical analysis for the bounds on the risk for domain adaptation. Our extensive experimental results show that the DALR algorithms consistently, robustly and effectively use the source domain data to improve the accuracy of models on the target domain under various conditions: different sizes for source domain data; different noise levels for source domain data, and different difficult levels
9. REFERENCES
[1] R. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 1–9. Association for Computational Linguistics Morristown, NJ, USA, 2005. [2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, page 41. MIT Press, 2007. [3] A. Argyriou, C. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task structure learning. Advances in Neural Information Processing Systems, 20, 2008. uckner, and T. Scheffer. [4] S. Bickel, M. Br¨ Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81–88. ACM New York, NY, USA, 2007. [5] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. Advances in Neural Information Processing Systems, 20, 2008. [6] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2006. [7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM New York, NY, USA, 1998. [8] E. Bonilla, K. Chai, and C. Williams. Multi-task gaussian process prediction. Advances in Neural Information Processing Systems, 20:153–160. [9] W. Dai, G. Xue, Q. Yang, and Y. Yu. Co-clustering based classification for out-of-domain documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 210–219. ACM New York, NY, USA, 2007. [10] W. Dai, Q. Yang, G. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pages 193–200. ACM New York, NY, USA, 2007. [11] H. Daume. Frustratingly easy domain adaptation. In Annual meeting-association for computational linguistics, volume 45, page 256, 2007.
[12] H. Daum´e. Cross-task knowledge-constrained self training. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 680–688, Honolulu, Hawaii, 2008. Association for Computational Linguistics. [13] H. Daume III and D. Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101–126, 2006. [14] J. Davis and P. Domingos. Deep transfer via second-order markov logic. In AAAI Workshop: Transfer Learning for Complex Tasks, 2008. [15] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM New York, NY, USA, 2004. [16] J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001. [17] J. Heckman. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pages 153–161, 1979. [18] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601, 2007. [19] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449, 2002. [20] J. Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers. 2007. [21] J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In Annual meeting-assosciation for computational linguistics, volume 45, page 264, 2007. [22] T. Joachims. Transductive inference for text classification using support vector machines. In Sixteenth International Conference on Machine Learning, 1999. [23] N. Lawrence and J. Platt. Learning to learn with the informative vector machine. In Proceedings of the twenty-first international conference on Machine learning. ACM New York, NY, USA, 2004. [24] S. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th international conference on Machine learning, pages 489–496. ACM New York, NY, USA, 2007. [25] X. Liao, Y. Xue, and L. Carin. Logistic regression with an auxiliary data source. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, volume 22, page 505, 2005. [26] X. Ling, W. Dai, G.-R. Xue, Q. Yang, and Y. Yu. Spectral domain-transfer learning. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 488–496, New York, NY, USA, 2008. ACM. [27] P. Luo, F. Zhuang, H. Xiong, Y. Xiong, and Q. He. Transfer learning from multiple source domains via consensus regularization. In CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 103–112, New York, NY, USA, 2008. ACM.
[28] L. Mihalkova, T. Huynh, and R. Mooney. Mapping and revising markov logic networks for transfer learning. In Proceedings of teh national conference on artificial intelligence, volume 22, page 608, 2007. [29] L. Mihalkova and R. Mooney. Transfer learning by mapping with minimal target data. In Proceedings of the AAAI-08 Workshop on Transfer Learning for Complex Tasks, 2008. [30] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2):103–134, 2000. [31] S. J. Pan and Q. Yang. A survey on transfer learning. Technical Report HKUST-CS08-08, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China, November 2008. [32] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766. ACM New York, NY, USA, 2007. [33] A. Schwaighofer, V. Tresp, and K. Yu. Learning Gaussian process kernels via hierarchical Bayes. Advances in Neural Information Processing Systems, 17:1209–1216, 2005. [34] G. Schweikert, C. Widmer, B. Sch¨ olkopf, and G. R¨ atsch. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In NIPS, pages 1433–1440, 2008. [35] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [36] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems, 20, 2008. [37] Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression framework for learning ranking functions using relative relevance judgments. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 287–294, New York, NY, USA, 2007. ACM. [38] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006.