2018 24th International Conference on Pattern Recognition (ICPR) Beijing, China, August 20-24, 2018
Multi-source Domain Adaptation for Face Recognition Haiyang Yi1, Zhi Xu1,†, Yimin Wen1,†, Zhigang Fan2 1
Guangxi Colleges and Universities Key Laboratory of Intelligent Processing of Computer Images and Graphics, Guilin University of Electronic Technology, Guilin, Guangxi, China 2 Zebra Technologies Corporation, Shanghai, China E-Mail:
[email protected], †
[email protected], †
[email protected],
[email protected]
Abstract-For transfer learning, many research works have demonstrated that effective use of information from multi-source domains will improve classification performance. In this paper, we propose a method of Targetize Multi-source Domain Bridged by Common Subspace (TMSD) for face recognition, which transfers rich supervision knowledge from more than one labeled source domains to the unlabeled target domain. Specifically, a common subspace is learnt for several domains by keeping the maximum total correlation. In this way, the discrepancy of each domain is reduced, and the structures of both the source and target domains are well preserved for classification. In the common subspace, each sample projected from the source domains is sparsely represented as a linear combination of several samples projected from the target domain, such that the samples projected from different domains can be well interlaced. Then, in the original image space, each source domain image can be represented as a linear combination of neighbors in the target domain. Finally, the discriminant subspace can be obtained by targetized multi-source domain images using supervised learning algorithm. The experimental results illustrate the superiority of TMSD over those competitive ones. Keywords—domain adaptation; multi-source transfer learning; common subspace learning; face recognition.
I.
INTRODUCTION
Many traditional machine learning algorithms such as Fisher’s Linear Discriminant (FLD) analysis [1], Back Propagation (BP) neural network [2], and Support Vector Machine (SVM) [3] and so on, always assume that the test samples are drawn according to the same distribution as the training samples. However, in many real applications, this is not a usual case. Training data and testing data are often subordinate to different distributions will lead to serious degradation of model performance, or even mismatch. For instance, a face recognition model trained from frontal images can hardly be immediately generalized to recognize nonfrontal images. To address this problem, one intuitive solution is to collect sufficient labeled nonfrontal images. However, collecting and annotating enough images are very cumbersome and costly in practice. Domain adaptation method was proposed to solve the above problems [4]. Domain adaptation is a branch of transfer learning [7], which has been applied in many fields, e.g., natural language processing [4], computer vision [9] and 978-1-5386-3788-3/18/$31.00 ©2018 IEEE
1349
speech recognition [11] etc. Theoretically, the goal of domain adaptation is to transfer the rich knowledge in a source domain to another different but related target domain. According to the work [9], domain adaptation can be divided into two categories: supervised and unsupervised. In scenario of supervised domain adaptation, there are a few labeled data in the target domain, while for unsupervised domain adaptation all the data in the target domain are unlabeled. In this paper, we mainly focus on the unsupervised domain adaptation problem. For unsupervised domain adaptation, most methods concentrate on single-source domain adaptation. A useful strategy is proposed to diminish the disparity between domains [4, 13-17] by exploring the commonality between both source and target domains, e.g. common feature representation or common subspace for both domains. In the work [18], a latent common space was obtained by finding a transform that can minimize the discrepancy between the marginal distributions of the source and target domains, and meanwhile preserve the data structure in the original spaces. The application of domain adaptation for face recognition has been discussed in [14, 19-21]. One type of these methods is based on dictionary learning [19-21], and another is based on the subspace [14, 22]. Dictionary learning is able to effectively model the pose, illumination and facial expression information and a test sample in target domain is collaboratively represented by the atoms of the learned dictionary. In the work [19], Qiu et al. proposed compositional dictionaries for domain adaptive face recognition to compensate for the transformation of faces due to changes in viewpoint, illumination, and resolution. For common subspace method, Ho et al. [20] demonstrated that using domain adaptation techniques is feasible to recognize faces in the scenarios where the images collected from the source and target domains are under varying degree of illumination, expression, blur and alignment. In the work [22], Banerjee proposed a Bag-of-Words (BOW) based approach for face recognition combined with domain adaptation, to overcome the difficulty of face recognition in degraded conditions. Most of the current methods only use the common characteristics of source and target domains, ignoring the specific knowledge of target domain which is beneficial to the tasks in the target domain. Kan et.al [14] projected the source and target domain into a common subspace to obtain the common characteristics of source and target domain, which
reduces the disparity of source and target domain, while the specific structure of the target domain is kept. In practice, one could obtain more than one source domain for training, and many experimental results have illustrated that making effective use of multi-source data will improve the performance of model [23]. In the work [14], a common subspace of single-source domain and target domain was obtained by transformation matrix. But for multi-source domains, this approach can only obtain a common subspace for each pair of source domain and target domain, and cannot find a common subspace for all domains. For multi-source domain data, the data distribution of each source domain is always different, which will result in that the single-source domain adaptation method cannot be directly applied. In addition, the usual common subspace methods can only capture the commonalities between different domains, but ignores the specific information about the target domain. How to transfer from multiple sources and meanwhile preserve the discriminative information in target domain? Inspired by the work of Kan et al. [14] in this paper we propose a method of Targetize Multi-source Domain Bridged by Common Subspace (TMSD). Specifically, firstly, a common subspace is learnt for several source domains and the target domain by keeping the maximum total correlation. In this way, the discrepancy between each source domain and the target domain is reduced, while the structures of all of the domains are well preserved. Then, in the common space, for each source domain, its samples are sparsely represented by a specific linear combination of target domain neighbors respectively. After having the sparse reconstruction coefficients for each source domain, we can generate a “virtual labeled target domain” that the distribution is similar to the target domain. Finally, a discriminative model can be learnt by the supervised way for the target domain. In this paper our contributions are as follows: (1) A novel unsupervised domain adaptation method is proposed for multiple sources domain adaptation, and theoretical derivation is given; (2)the optimization approach is proposed for solving this TMSD problem; (3) the effectiveness of TMSD is verified through experiments. The rest of this paper is organized as follows. In Section II we present the details of TMSD. In Section III, we experimentally evaluate TMSD for face recognition and finally we draw some conclusions in Section IV. II.
We denote the i-th source domain data matrix as X i = x1i , x2i ,..., xni ∈ R di × ni , i = 1, 2,..., v and v denotes the i
number of source domain, ni and di respectively means the number of training sample and dimension of the i-th source domain with class labels
in which
nv +1 , dv +1 ,and xvj +1 ∈ Rdv +1 have similar meanings to the above for the target domain. Next we will present the details of the proposed TMSD and its optimization. This part contains three aspects: common subspace learning, targetizing the multi-source domain in image space and supervised model learning and testing, optimization. A. Common subspace learning One important components of the propose method is finding the common subspace that several source domains and the target domain are all projected in this latent space through different transformations. The common subspace should satisfy the following two requirements: (1) In the common subspace, each source domain and the target domain are sufficiently interlaced, so as to reduce the discrepancy between them; (2) The structures of all domains should be well preserved, in order to keep enough discriminant information. To meet the above two requirements, Sparse Reconstruction and Total Correlation are regarded as the constrain conditions for finding the common subspace. Since images from different domains may have large discrepancy, to project all domains into a common subspace, we propose to learn multiple different projection matrices corresponding to those different domains, which are denoted as W1 , W2 ,..., Wv , Wv +1 with the same column size. The projected samples of the i-th source domain and the target domain in the common subspace are denoted as Zi = [ z1i , z2i ,..., zni ] and i
Zv +1 = [ z1v +1 , z2v +1 ,..., znv +1 ] , v +1
yij
means the class label of the j-th sample in the i-th source domain. xij ∈ R di means the feature representation of the j-th sample of the X i .
1350
T respectively, where Zi = Wi Xi
Zv+1 = WvT+1 Xv+1 .
and
Sparse Reconstruction: Sparse representation or sparse coding [25] has been widely used for reconstruction and classification. In the common subspace, each source domain sample can be reconstructed only by a few adjacent samples of target domain: [Vs* , Wi* , Wv*+1 ] = arg i
max
Vs ,Wi ,Wv +1
Zi − Zv +1Vsi
i
= arg
TARGETIZE MULTI-SOURCE DOMAINS
yi = [ y1i , y2i ,..., yni ] i
Similarly, data matrix of samples from the target domain is denoted as X v +1 = x1v +1 , x2v +1 ,..., xnv +1 ∈ Rdv +1×nv +1 , in which v +1
s
s.t. v ji
max
Vs Wi Wv +1 i
0
WiT X i − WvT+1 X v +1Vsi
2 F
2 F
,
(1)
≤ τ , i = 1, 2,..., v; j = 1, 2,..., ni .
s*
s*
s*
Vs*i = [v1i , v2i , . . . , vni ] i
are
the
sparse
reconstruct
coefficients of the samples from the i-th source domain using the target source domain samples. Considering the difference between source domains, in the common subspace, each target domain sample is independently reconstructed via only several neighbors from the same source domains:
[Vt* , Wi* ,Wv*+1 ] = arg i
= arg t
s.t. v ji
0
max Vt , Wi ,Wv +1 i
max
Vt Wi Wv +1 i
WvT+1 X v +1 − WiT X iVti
2 F
In this paper, the FLD [1] method is employed for supervised feature extraction:
2
Zv +1 − ZiVti
F
(2)
,
* W fld =arg max
≤ τ , i = 1, 2,..., v; j = 1, 2,..., nv +1 .
t*
t*
t*
Vti* = [v1i , v2i , . . . , vni
v +1
] are reconstruction coefficients of
Total Correlation: In the work [14], the structures of source domain and the target domain are preserved just by simply maximizing the variance of each domain. This way can obtain a common subspace only for single source domain and target domain, and is not fit for the case of multiple source domains. In order to overcome this deficiency, in this work maximizing total correlation [21] strategy is employed. In this way, the structures of all of the domains are well preserved, so as to keep as much information as possible for discrimination. The total correlation in the common space is maximized as below: v+1
max W1, ... , Wv ,Wv+1
Tr( i< j
. (5) Tr (W T StW ) It should be noted that no class label is needed for the computation of total scatter matrix St and it can be calculated from the target domain. The between-class scatter matrix is the sum of the between-class scatter matrix Sb for each targetized source domains. W
target domain samples using the i-source domain samples. In Eqs. (1-2), τ is the parameter to control the sparsity in terms of the number of samples used for the reconstruction.
[W1*, ... ,Wv*,Wv*+1] = arg
Tr (W T SbW )
WiT Xi X Tj Wj
+ WiT Xi XiTWi ) (3)
, where Tr(⋅) is the trace of matrix. And data matrix X i and X v +1 are assumed to have zero mean. In addition, the above maximized total correlation terms are also necessary to ensure the sparse reconstruction feasible. Without them, all the samples from all domains may converge together.
g For test, the feature of each gallery image xi is firstly g extracted as W *fld xi . Similarly, for each probe image x p , it
is extracted as W *fld x p . Then the similarity with each gallery face
image is calculated via cosine function as * xg , W* x p ) . Finally, the identity of x p sim( x p , xig ) = cos( Wfld fld i is determined by using the Nearest Neighbor classifier. C. Optimization The optimization problem of Eq. (4) is not convex for all variables, so alternately optimizing is used to update the projection matrices [W1 , ... , Wv , Wv +1 ] and the sparse reconstruct coefficients [Vsi , Vti ] .
Step 1: Fixing
[W1 , ... , Wv , Wv +1 ] , [Vsi , Vti ] are updated.
For Eq. (4), Vsi and Vti are independent of each other for
According to Eqs. (1-3), we can propose the objective function
given W1 , ... , Wv and Wv +1 , so they can be optimized
[W1*, ...,Wv*,Wv*+1,Vs*i ,Vti*] = a r g
independently. Furthermore, v1i , v2i ,..., vni in Eq. (1) are also i
W1, ...,Wv ,Wv+1, Vsi , Vti
v+1
v1
s
ma x
2 1 Tr( ( WiT Xi XTj Wj + WiT Xi XiTWi )) ni i< j ni + nj
1 n WiT Xi −WvT+1Xv+1Vsi F + n WvT+1Xv+1 −WiT XV i ti v+1 i=1 i s t st. . vki ≤τ , k =1, 2,..., ni ; vli ≤τ , l =1,2,..., nv+1 . 0 0 2
s
s
s
independent of each other. This means each vji can be
(4)
separately solved as a lasso problem:
v −1 T 2 Wv+1Xv+1 F − F n v+1
2
s 2
s
vji* = arg max WiT xij − WvT+1 X v +1v ji s
s.t .
From this formulation, it is clear that we actually jointly optimize the common subspace and the sparse reconstruction coefficients. In Eq.(2), we can find the target domain has been reconstructed several times, so we use the third item of the denominator in Eq. (4) to remove duplication. B. Targetizing the Multi-source Domain in Image Space and Supervised Model Learning and Testing After optimizing Eq. (4), we can obtain the common subspace by [W1* , ... ,Wv* ,Wv*+1 ] and [Vs*i , Vti* ] , i = 1, 2,..., v . And then we perform targetization in the original image space. Virtual labeled target domain is denoted as X s →t , where
X s →t = [ X s1 →t , X s2 →t ,..., X sv →t ] and X si →t = XtVs*i . And then any supervised method can be used to learn a recognition model based on ( X s→t , ys ) , ys = [ y1 , y2 ,..., yv ] denotes the sample label of source domains.
1351
s vji 0
vji
F
,
(6)
≤ τ , i =1,2,...,v; j = 1, 2,..., ni .
To preserve the diversity of each source domain and guarantee each source domain and the target domain is interlaced sufficiently, an additional penalty term is added to Eq. (6) to enhance the preference of varied samples for the reconstruction: s 2
s
v ji* = arg min WiT xij − WvT+1 X v +1v ji si vj
s
s.t. v ji
0
F
s 2
+λ 1 − htT v ji i
F
,
(7)
< τ , i =1,2,...,v ; j = 1, 2,..., ni .
Here, the indicator hti ∈ Rnv +1 assigns a weight to each target sample, and will punish those target samples that are over used to reconstruct the source samples via a small weight. In experiments, hti is initialized as an all-1 vector, meaning that all samples have equal availability in the subsequent reconstruction. The optimization problem (7) can be easily
solved by using a forward stepwise regression like algorithm i.e., the Least Angle Regression (LARS) [26] by being reformulated as the following form: s* s v ji = arg min zˆij − Zˆti v ji s v ji 0
s.t.
s v ji
2 F
W ∗ =arg max W
reconstruction are punished by being decreased the weights 05 .
s* hti ⇐ hti - vi i . si * max( vi )
(9)
is the absolute operation on each element of the vector
s*
v ji . t*
in Vti is also independent of each other
and can be separately optimized as: t*
t
v ji = arg min WvT+1 xvj +1−WiT X i v ji t
t v ji 0
v ji
2 F
t
+ λ 1 − hsT v ji i
2
,
F
(10)
s.t. ≤ τ , i = 1, 2, … , v; j = 1, 2, … , nv +1 . The same as Eq. (6), Eq. (10) can be further reformulated as Eq. (11) and is solved by the LARS t* t v ji = arg min zˆvj +1 − Zˆsi v ji t
2 F
Wv +1
as one matrix
Tr (W T SbW )
Tr (W T SwW )
s.t . wi
2
=1
(14)
where wi denotes the i-th column of W . Sb and Sw are defined as below:
T
Wv +1 X v +1 and Zˆti = T . λ hti s*
Similarly, v ji
and
W , Eq. (13) can be reformulated as the following form with a norm-1 constraint:
(8)
After obtaining v ji , those samples selected for the
s*
Wi
= [W1T ,W2T ,..., WvT ,WvT+1 ]T
< τ , i = 1, 2,...v; j = 1, 2,…, ni .
W T xi With zˆij = i j λ
vji
By concatenating all
,
(11)
1 n 1 Sb =
0
0
2 T X1 X i n1 + ni 1 T Xi Xi ni
0
0
T
X1 X1
2 T X1 X v +1 n1 + nv +1 2 T X i X v +1 , ni + nt 1 T X v +1 X v +1 nv +1
(15)
V VT VsT Vt 1 T X I + t1 t1 XT 0 1 0 − X 1 n + n Xv+1 1 n1 nv+1 1 v + 1 1 (16) 0 0 Vt VtT VsT Vt I 0 Xi + i i XiT 0 −Xi i + i XvT+1 . Sw = ni nv+1 ni nv+1 0 0 T T Vt Vs Vt Vs I v Vs VsT T T T −X 1 + 1 X −X i + i X X + i i Xv+1 v+1 v+1 v+1 nv+1 n1 1 nv+1 ni i n n v+1 i=1 i
Here I is an identity matrix. Sb is an upper triangular matrix, and Sw is paw form matrix. Finally, Eq. (14) can be written as
T T v +1 W X W x , with zˆvj +1 = v +1 j , and Zˆsi = i Ti . The indicator λ λ hsi
W T SbW 2 s.t. wi = 1. W ∗=arg max Tr (17) WT S W W w Eq. (17) can be easily solved using the generalized eigenvalue decomposition. With the increase of the source domains, the generalized eigenvalue decomposition will be time consuming.
hsi ∈ Rni assigns a weight to each source sample from i-th
Step 3: Repeat Step 1 and 2 until W1 , ... ,Wv , Wv +1 , Vsi and
v ji
t
s.t . v ji
0
< τ , i = 1, 2, ..., v; j = 1, 2, … , nv +1
source domain, indicating how many times the sample has been used. And hsi is initialized as an all-1 vector and is updated as follows:
III. EXPERIMENTS hsi ⇐ hsi -
0.5 t*
max( v ji )
t*
.
v ji
ma x
W1, ...,Wv,Wv+1 v+1
i=1
i
2
1 Tr ( ( WiT Xi X Tj Wj + WiT Xi XiTWi ni i