Learning Cross-Domain Disentangled Deep Representation with ...

10 downloads 165032 Views 847KB Size Report
May 3, 2017 - cross-domain feature disentanglement with only the (attribute) .... the target domain T , while no label information is available for target-domain.
arXiv:1705.01314v2 [cs.CV] 7 May 2017

Learning Cross-Domain Disentangled Deep Representation with Supervision from A Single Domain Tzu-Chien Fu∗ Academia Sinica Taipei, Taiwan [email protected]

Yen-Cheng Liu∗ National Taiwan University Taipei, Taiwan [email protected]

Sheng-De Wang National Taiwan University Taipei, Taiwan [email protected]

Wei-Chen Chiu National Chiao Tung University Hsinchu, Taiwan [email protected]

Yu-Chiang Frank Wang Academia Sinica Taipei, Taiwan [email protected]

Abstract The recent progress and development of deep generative models have led to remarkable improvements in research topics in computer vision and machine learning. In this paper, we address the task of cross-domain feature disentanglement. We advance the idea of unsupervised domain adaptation and propose to perform joint feature disentanglement and adaptation. Based on generative adversarial networks, we present a novel deep learning architecture with disentanglement ability, which observes cross-domain image data and derives latent features with the underlying factors (e.g., attributes). As a result, our generative model is able to address cross-domain feature disentanglement with only the (attribute) supervision from the source-domain data (not the target-domain ones). In the experiments, we apply our model for generating and classifying images with particular attributes, and show that satisfactory results can be produced.

1

Introduction

With the advances of deep generative models [3, 9], image synthesis is among the applications of growing interest. In the initial attempts [3, 9], the network-based generator transforms vectors from a latent space to output images. Despite of promising results, these early works do not take efforts in exploring the semantic information in the derived latent feature spaces. For example, if one can identify the latent features associated with facial expressions or poses, one can not only synthesize face images of interest but also to manipulate the image attribute of interest accordingly. To address the above concern, performing feature disentanglement during the learning of generative models becomes a practical solution. With the goal of discovering the underlying factors of image variants associated to particular attributes of interest, feature disentanglement aims at a latent feature space which factorizes the derived representation into different parts, which would properly describe corresponding information (e.g., identity, pose, and expression of facial images) during the generative process. Many works have been proposed to tackle this task, ranging from unsupervised [1, 5], semi-supervised [8, 14], to supervised settings [10, 15]. We note that, with supervision of label information (e.g., image attributes) during learning, disentangling the corresponding feature can be performed, and one can produce the output image of interest accordingly. ∗

- indicates equal contribution.

However, the learning of deep neural networks for feature disentanglement generally requires a large number of annotated data, which can be computationally expensive in practical scenarios. As an alternative way, one can utilize pre-collected or existing annotated data as the source-domain training data, and choose to adapt the learned model to handle the target-domain data of interest without observing their ground truth annotation. Given a source-domain image data set XS containing face image with expression and illumination variations, and a target-domain one XT with pose variants only. Considering that expression is the attribute of interest, our goal is to disentangle and adapt the associated latent feature from XS for describing both XS and XT , so that one can manipulate the output images of particular expression with different lighting conditions or at different poses (i.e., to synthesize expression image variants at source and target domains). Inspired by the success of transfer learning [16] and domain adaptation [17], we propose a deep neural networks architecture based on generative adversarial networks (GAN) for jointly handling crossdomain data and performing disentanglement in the shared latent space (see Fig. 1 for illustration). As noted above and detailed later in Sect. 3, while our goal is to produce images with preferable attributes in both source and target domains using our network architecture, the learning of the proposed model only observes the attribute information from the source-domain data. In other words, our strategy can be viewed as unsupervised domain adaptation for leaning cross-domain disentangled features. Our contributions are two-fold. First, we propose an unified framework for simultaneous feature disentanglement and domain adaptation, while only supervision of label information from the source domain data. Second, our proposed model allows one to perform classification upon the disentangled feature attribute for cross-domain data. Comparing to existing unsupervised domain adaption techniques for cross-domain classification, we show that our model exhibits satisfactory ability in terms of both disentangling and synthesizing images with the attribute of interest.

2

Related Works

From Image Synthesis and Image-to-Image Translation Utilizing deep neural networks and generative models, research progresses on image synthesis have shown impressive results. The network is trained for approximating the data distribution and generative procedure of image data, and being capable of synthesizing previously-unseen images like samples from real data. As for image-to-image translation, [18] approaches this task by applying pairs of images for learning deep learning models like generative adversarial networks (GAN) [3]. Zhu et al. [23] advance to loosen the limitation on requiring image pairs by constraining the dual domain mapping with cycle consistency loss; similar ideas can also be found in [7, 22]. Feature Disentanglement for Image Synthesis Learning to disentangle the latent factors of image variants has led to the understanding of the observed data [10, 15, 8, 14, 1, 5]. Kulkarni et al. [10] propose to learn interpretable and invertible graphics codes when rendering image 3D models. However, its training process requires a sufficient amount of fully annotated data. Odena et al. [15] augment the GAN architecture with an auxiliary classifier, which enables the synthesized images to be conditioned on the interpretable latent factors, which is given during training. Kingma et al. [8] extend variational autoencoder (VAE) [9] to achieve semi-supervised leaning for disentanglement. Chen et al. [1] further tackle this task in an unsupervised manner by maximizing the mutual information between pre-specified latent factors and rendered images, yet the semantic meanings behind disentangled factors are unknown. Despite the promising progress from the above methods on learning deep disentangled representation, most existing works only focus on handling annotated data collected from the same domain. In real-world applications, this requirement might not be easily achieved. This is why we aim at transferring knowledge from one domain (with full annotation) to another one, while annotation for data in the latter target domain is not required for realizing joint feature disentanglement and adaptation. Adaptation Across Visual Domains Domain adaptation aims to alleviate the domain shift by associating the same learning task across data domains. Inspired by the adversarial learning scheme [3], several methods have been proposed recently to learn the mapping of multiple domains into a common feature space. For exmaple, Ganin et al. [2] introduce a domain classifier in a standard CNN framework, with its gradient reversal layer serving as the domain-adaptive feature extractor in 2

Figure 1: Our proposed architecture of Cross-Domain Disentanglement (CDD). The network components EC , GC , and DC are shared by cross-domain data, while those with subscripts S and T are associated with data in the corresponding domain. Note that for XT will be recognized as real/fake images due to the lack of ground truth labels l (shown in red).

absence of labeled data in the target domain. Tzeng et al. [21] untie the weight sharing layers between feature extractors across source and target domains. This allows them to learn domain-specific feature embedding by utilizing domain adversarial training strategy. Liu et al. [12] adapt two parallel GANs and share weights of high-level layers of networks to simultaneously render corresponding cross-domain images from the joint Gaussian noise. This framework is extended in [11], in which VAE and GAN are integrated for enabling the matching between cross-domain images. Inspired by [11], we propose to perform cross-domain feature disentanglement without label supervision from target-domain data. Later in our experiments, provide qualitative evaluation for image synthesis across different domains; moreover, we show that the derived disentangled features can be applied for classification. This would verify the use of our derived latent feature for identifying underlying image factors of interest, and confirm its robustness in describing image data across different domains.

3 3.1

Proposed Method Problem Definition and Notation N

S Let XS = {xSi }i=1 as the set of NS data samples from the source domain S, and each instance is NT associated with a ground truth label (or attribute) li ∈ L. Similarly, we have XT = {xTi }i=1 as NT instances observed in the target domain T , while no label information is available for target-domain data during training.

The objective of our proposed model is to perform joint feature disentanglement and domain adaptation, while only label supervision from the source domain is given. As illustrated in Figure 1 and detailed in the following subsections, our proposed model of Cross-Domain Disentanglement (CDD) derives deep disentangled feature representation z and the disentangled latent factor l for describing ˜ S and X ˜ T images as the cross-domain data and their labels, respectively. Thus, one can produce X outputs of preferable latent factors l. 3.2

Learning Disentangled Feature Representation in a Single Domain

We start with learning disentangled feature representation {z, l} by observing labeled data in the source domain S. As depicted in Figure 1, the upper part of our proposed architecture is designed to address this task, which can be viewed as integration of variational autoencoder (VAE) [9] and auxiliary classifier GAN (AC-GAN) [15]. Compared with the original version of GAN [3], the discriminator DS + DC in our CDD not only ˜ S and the real ones XS , but also predicts the distinguishes between the synthesized output images X attribute labels L. Thus, the objective function can be expressed as the summation of an adversarial loss LSadv and a disentanglement loss LSdis , which are described below: 3

˜ S )))] + E[log(DC (DS (XS )))], LSadv = E[log(1 − DC (DS (X ˜ S )], LS = E[log(L = l|XS )] + E[log(L = l|X

(1)

dis

˜ S = GS (GC (z, l)) denotes the synthesized image in S conditioned on manually assigned where X attribute code l, and z = EC (ES (XS )) is the latent representation of input image XS . It can be seen that, the adversarial loss LSadv is to enforce the distribution of synthesized images to approximate that of real images from the same domain. On the other hand LSdis serves as an auxiliary classifier to ˜ S ; for XS , its classification will be compared with the predict the manually assigned codes l of X corresponding ground truth labels. With the above learning process, the source-domain generator of ˜ S in accordance with l. CDD would gradually evolve to render the synthesized images X Since synthetic image may contain appearance variations in accordance with l, the generator GC +GS in CDD is designed to preserve the semantic information. Inspired by [9], we advance the VAE loss LSV AE : LSV AE = LSperc + KL(qS (z|XS )||p(z)),

(2)

where KL(qS (z|XS )||p(z)) denotes Kullback-Leibler divergence over the prior p(z) and the auxiliary distribution qS (z|XS ). In particular, the perceptual loss LSperc [6] in LSV AE is determined below: ˜ S )k2 , LSperc = kΦ(XS ) − Φ(X F

(3) ˜ which calculates the reconstruction error between the synthesized output XS and its original input XS upon network transformation Φ (we apply the VGG net [20] in our work). 3.3

Learning Cross-Domain Disentanglement Representation

The network components introduced in Section 3.2 cannot be directly applied for handling image data in other domains without observing their ground truth label information. In other words, with only unlabeled data XT in the target domain T , one can only calculates the following losses: ˜ T )))] + E[log(DC (DT (XT )))], LTadv = E[log(1 − DC (DT (X ˜ T )k2 + KL(qT (z|XT )||p(z)). LTV AE = LTperc + KL(qT (z|XT )||p(z)) = kΦ(XT ) − Φ(X F

(4)

As a major contribution of our proposed CDD, we need to adapt the disentanglement ability across ˜ T )], the domain. However, for the target-domain disentanglement loss LTdis = E[log(L = l|X latent factor l to be disentangled is not explicitly interpretable due to the lack of ground truth information. While existing works for unsupervised disentanglement have been proposed [1, 5], directly combination with such techniques with the aforementioned network components would not guarantee to produce feature representation with the same disentangled latent factors. In order to address the above challenge, our CDD needs to learn a joint disentangled latent space across data domains. As depicted in Figure 1, we have network components designed for handling domain-specific or cross-domain data. To enforce the practicality of the cross-domain output images, we introduce the cross-domain adversarial loss Lcd adv at the output layer of CDD: S→T T →S Lcd adv = Ladv + Ladv ,

˜ LS→T adv = E[log(1 − DC (DT (XS→T )))] + E[log(DC (DT (XT )))], →S ˜ T →S )))] + E[log(DC (DT (XS )))]. LTadv = E[log(1 − DC (DS (X

(5)

To further preserve the disentanglement ability during adaptation, we tie the disentangled factor l across domains with a unique cross-domain disentanglement loss Lcd dis : cd ˜ ˜ T →S )]. Ldis = E[log(L = l|XS→T )] + E[log(L = l|X (6) ˜ S→T = GT (GC (z, l)) denotes the synthesized image Note that, given manually assigned codes l, X ˜ T →S = GS (GC (z, l)), in T given the input from S, i.e., z = EC (ES (XS )). Similarly, we have X where z = EC (ET (XT )). 4

Figure 2: Example results of cross-domain feature disentanglement for face image synthesis. (Top to bottom rows: source-domain input images, output images with varying attributes in source and target domains, respectively.)

3.4

Learning of CDD

Here we summarize the loss functions into several groups with respect to the main structures of the proposed method: encoder, generator, and discriminator. 1. For encoder (ES , ET , EC ), it is related to only the variational autoencoder part: LE = LV AE = LSV AE + LTV AE .

(7)

2. For the generator (GS , GT , GC ), it is shared by VAE and GAN augmented with disentanglement ability, thus: LG = LSperc + LTperc + Ldis + Ladv , (8) S T cd where Ladv = LSadv + LTadv + Lcd adv and Ldis = Ldis + Ldis + Ldis . 3. For the discriminator (DS , DT , DC ), it is adversarial to the generator of GAN while trained to obtain disentanglement, thus:

LD = Ldis − Ladv .

(9)

The learning of the proposed CDD framework is then done through jointly training all components mentioned above, and finally derives disentangled deep representation to describe image data across different domains. We experimentally prove the practicality of our CCD approach in the following section.

4

Experiments

We evaluate our proposed method in two different scenarios. In Section 4.1, we qualitatively demonstrate the ability of our method in terms of both conditional image synthesis and translation. In Section 4.2, our method is quantitatively compared with recent deep learning approaches for cross-domain classification. All the images in the experiments are from the CelebFaces Attributes dataset (CelebA) [13], which is a large-scale face dataset including more than 200K celebrity images annotated with 40 facial attributes. We consider the following settings of data domains (i.e., source domain S and target domain T ) and latent factor l to be disentangled: I) S : faces w/o eyeglasses.; T : faces w/ eyeglasses.; l : attribute of smiling II) S : real photo of faces; T : simulated sketch of faces; l : attribute of smiling III) S : real photo of faces; T : simulated sketch of faces; l : attribute of eyeglasses 5

Figure 3: Example results of cross-domain feature disentanglement for face image synthesis. (Top to bottom rows: source-domain input images, output images with varying attributes in source and target domains, respectively.)

Figure 4: Example results of cross-domain feature disentanglement for face image synthesis. (Top to bottom rows: source-domain input images, output images with varying attributes in source and target domains, respectively.)

Take the first setting for example, we take face images without/with eyeglasses as source/ targetdomain data, while the latent factor to be learned/disentangled is the attribute of smiling. As for the latter two settings which require the face sketch images as target-domain data, we apply the technique in [4, 19] to synthesize the sketch output for each face image from CelebA. It is worth repeating that, for all the aforementioned settings, only the ground truth attribute of source-domain data is available for learning our proposed CDD. 4.1

Conditional Image Synthesis and Translation

As pointed out in Section 3, the architecture of our CDD allows one to freely control the disentangled factor of the test image and produce the corresponding output. This is thus viewed as the task of conditional image synthesis. More specifically, with a specified attribute value (e.g., smiling or not), we can take the learned encoder in one domain paired with the generator from the other domain, so that image-to-image translation conditioned on that attribute can be realized. The qualitative results are shown in Figure 2, Figure 3, and Figure 4 under different settings. Take Figure 2 for example, when having a face image without eyeglasses as the input, our CDD was 6

able to manipulate the smiling attribute not only for the output image in the same domain (i.e., with eyeglasses) as input, but also for the face images in the target domain (e.g., with eyeglasses). The above results support the use of our CDD for learning disentangled feature representation from cross-domain data, and confirm its effectiveness in producing images of interest in either domain. As an additional remark, for conditional image translation, one can first perform conditional image synthesis using source-domain data, followed by existing off-the-shelf image-to-image translation frameworks which convert such outputs into the images of interest in the target domain. However, such integrated approaches cannot result in disentangled representation in a shared feature space (i.e., as ours did). Later in Section 4.2, we provide additional quantitative evaluation to support the use of our derived representation for describing cross-domain data. 4.2

Cross-Domain Visual Classification

Table 1: Cross-domain classification results of face images with respect to the attribute of smiling. Source and target-domain test data are faces w/o eyeglasses and w/ eyeglasses, respectively. Method Accuracy (%)

Source Target

CoGAN 87.03 71.92

UNIT 88.66 71.82

Ours* 89.48 83.69

Ours 89.73 84.43

Table 2: Cross-domain classification results of face images with respect to the attribute of smiling. Source and target-domain test data are sketch and photo faces, respectively. Method Accuracy (%)

Source Target

CoGAN 89.50 78.90

UNIT 90.10 81.04

Ours* 90.19 87.61

Ours 90.01 88.28

Table 3: Cross-domain classification results of face images with respect to the attribute of eyeglasses. Source and target-domain test data are sketch and photo faces, respectively. Method Accuracy (%)

Source Target

CoGAN 96.63 81.01

UNIT 97.65 79.89

Ours* 97.06 94.49

Ours 97.19 94.84

In addition to qualitative evaluation, we now provide quantitative results to further confirm the use of our CDD for describing cross-domain data given particular attributes of interest. Recall that, the discriminator in our CDD is augmented with an auxiliary classifier, which classifies images with respect to the disentangled latent factor l. With only having supervision from the source-domain data, the classification task is also analogous to the scenario addressed by unsupervised domain adaptation (UDA) for cross-domain visual classification, in which no label information is observed from the target-domain data during adaptation. To perform quantitative evaluation on recognizing the disentangled attribute, we consider two recent deep learning models of CoGAN [12] and UNIT [11]. Both are stemmed from the architecture of GAN and aim to generate corresponding cross-domain image-pairs by learning a joint latent space. With attaching a classifier to their discriminators (and trained by labeled source-domain data only), both CoGAN and UNIT can be applied to UDA as noted in [12, 11]. From Table 1, 2, and 3, we see that neither CoGAN or UNIT were able to produce satisfactory performances, as the gaps on the classification performance between source and target domains were 14.6%, 15.11% and 16.84%, respectively. In contrast, our CDD (noted as Ours in tables) observed a much smaller performance gap of 5.3%, 1.8%, and 2.3%. We also provide the performance of a variant of our full model, noted as Ours*. This variant removes the encoder part of the CDD framework but keeps a joint latent space of z, which is analogous to the difference between CoGAN and UNIT. Similarly, we also observe promising results of this simplified variant of our CDD. We note that, the main difference between our proposed method and CoGAN/UNIT is that our classifier is learned by observing cross-domain disentangled data, even no ground truth information from the target domain is presented. From these quantitative evaluation results, we successfully verify the 7

use of our proposed method (over recent DNN methods) for cross-domain visual classification. This also confirms that the disentangled representation and latent features learned by our CDD exhibit excellent ability in describing images across different domains.

5

Conclusions

In this paper, we presented a novel deep learning framework for Cross-Domain Disentanglement (CDD), with only supervision from the source domain data is available. Extending from generative adversarial networks, our proposed CDD uniquely performs joint feature disentanglement and adaptation by observing domain-specific and cross-domain adversarial and disentanglement losses, which allows one to manipulate output images with attributes of interest in either domain. Our experiment results qualitatively show that our model is able to control the disentangled factor for image synthesis as well as image-to-image translation. Moreover, we quantitatively show that our resulting discriminator can seamlessly maintain the consistent performance across domains for the classifier of the disentangled factor against those produced by the recent deep-learning based approaches.

References [1] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2016. [2] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning (ICML), 2015. [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014. [4] Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. van Gerven. Convolutional sketch inversion. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. [5] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. [6] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and superresolution. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. [7] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017. [8] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems (NIPS), 2014. [9] D. P. Kingma and M. Welling. Stochastic gradient VB and the variational auto-encoder. In Proceedings of the International Conference on Learning Representations (ICLR), 2014. [10] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems (NIPS), 2015. [11] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017. [12] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), 2016. [13] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. [14] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. [15] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. arXiv preprint arXiv:1610.09585, 2016. [16] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(10):1345–1359, 2010. [17] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015. [18] T. Z. Phillip Isola, Jun-Yan Zhu and A. A. Efros. Image-to-image translation with conditional adversarial nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 8

[19] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [21] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [22] Z. Yi, H. Zhang, P. T. Gong, et al. Dualgan: Unsupervised dual learning for image-to-image translation. arXiv preprint arXiv:1704.02510, 2017. [23] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. arXiv preprint arXiv:1703.10593, 2017.

9

Suggest Documents