Transfer Learning for Large Scale Data using Subspace Alignment ¨ and Malika KHAROUF Nassara ELHADJI-ILLE-GADO, Edith GRALL-MAES University of Champagne, University of Technology of Troyes, France Charles Delaunay Institute (ICD)/LM2S, 12 rue Marie Curie, 10000 Troyes, e-mail :
[email protected] Abstract—A major assumption in many machine learning algorithms is that the training and testing data must come from the same feature space or have the same distributions. However, in real applications, this strong hypothesis does not hold. In this paper, we introduce a new framework for transfer where the source and target domains are represented by subspaces described by eigenvector matrices. To unify subspace distribution between domains, we propose to use a fast efficient approximative SVD for fast features generation. In order to make a transfer learning between domains, we firstly use a subspace learning approach to develop a domain adaption algorithm where only target knowledge is transferable. Secondly, we use subspace alignment trick to propose a novel transfer domain adaptation method. To evaluate the proposal, we use large scale data sets. Numerical results, based on accuracy and computational time are provided with comparison with state-of-the-art methods.
I.
I NTRODUCTION
Traditional machine learning algorithms often fail to generalize to new input distributions and lead to degraded accuracy results. A general assumption is that training and testing samples must have the same distribution or come from the same feature space. If this assumption is not satisfied, most of statistical models need to be rebuilt by using new training data. However, it is difficult, or impossible under certain conditions, to recollect new data and reconstruct models [1]. Consequently, a natural question that comes to mind is: if a model is learnt from source domain, what would be its ability to correctly predict new coming data from another target domain whose characteristics may be different? In such circumstances, knowledge transfer between domains, if successfully completed, can considerably improve the classification performance. Transfer learning approach attempts to compensate the performance degradation by transferring and adapting source knowledge to target domain. Specially, domain adaptation method that seeks a common low-dimensional subspace between domains and attempts to align the subspace bases [2], [3]. The idea is to seek a common subspace between domains, where source domain Ds and target domain Dt , share a maximum amount of information or may have the same marginal distribution [4], [5]. Such approach can be very useful for example for texts or images categorization. Also in the filed of cyber-physics [6], transfer learning can be adapted to learn data from one site to another one or transferring information from one physical sensor to adapt another sensor. To establish a common feature space, some methods are based on the spectral decomposition of a well defined kernel function. These methods seek a projection space that allows
to approximate the marginal distribution in the two domains [7], [8]. However, this decomposition can be expensive when a fairly large number of samples is available [2], [3]. Many transfer learning techniques based on subspace domain adaptation are increasingly being developed [9], [10]. The subspace alignment is one of these methods that proposes to use the principal component analysis (PCA) to select an intermediate subspace where the source and target domains share a common marginal distribution [9]. Inspired by transfer subspace representation method, and aware of the limits of PCA for high dimensional data, we propose to use an approximated SVD method for fast efficient features extraction, which is used previously for a LDA application [11]. This method has shown effectiveness and efficiency for features extraction in high dimensional context. It uses a low-rank approximation to give a representative latent space of a given data matrix. The objective is to realize a prototype for transfer learning domain adaptation, with an application to large scale data classification. The major contributions of this paper include the following: 1)
2)
3)
A new target adaptive feature representation that allows to share only target knowledge on source domain is proposed by using a learnt low-rank features representation. To enforce the distributions of the two domains to be similar in a given embedding space, we propose another transfer domain adaptation for large scale that use subspace alignment approach and low-rank approximation method. We use real world data sets for solving classification problem in text categorization, object recognition and handwritten digit classification. Experiments show the performance of the proposed algorithms.
The paper is organized as follow. Problem formulation is firstly introduced in Section II. Then, preliminary background is presented in Section III. The proposed methods are exposed in Section IV. Experimental settings, results and discussion are reported in Section V. Finally, we conclude our work in Section VI. II.
P ROBLEM FORMULATION
In this paper, we consider the problem of transfer learning for domain adaptation for multi-class classification. We define a data space X , a label space Y and a distribution P. We consider a domain as a pair D = {X , P} and assume that the data originates from two domains, a source (Ds ) and a target (Dt ).
s where The source data is fully labeled set Xs = {(xi , yi )}ni=1 xi ∈ Xs are sampled from a distribution Ps and yi ∈ Ys . The nt +ns target data is a set of unlabeled samples Xt = {(xi )}i=n s +1 where xi ∈ Xt are drawn from a distribution Pt . We assumed that the source and the target domains have the same feature space, say a d-dimensional space X = Rd , and also have the same label sets but different marginal distributions, i.e, Xs = Xt , Ys = Yt , but Ps 6= Pt . The label space Y can be adapted either for binary or multi-class settings. The objective is to learn a prediction function f , based on the source domain, that shows low expected error on the target domain. Such classifier, may accurately predict the labels of the unlabeled target data with enough high confidence predictions once the knowledge transfer between domains is realized.
III.
BACKGROUND
In this section, we present a preliminary knowledge used in our work. First we outline subspace alignment method and then present the subspace representation based on efficient approximation of singular value decomposition. A. Subspace alignment Subspace alignment method (SA)[9] emphasizes the use of subspace generated by PCA in order to make an adaptation between domains. The basic idea is to apply PCA on source sample (Xs ) and target sample (Xt ) separately by choosing a common space of dimension equal to k d. This leads to two projection matrices Gs and Gt . Then, the projected source data is aligned with the projected target data in a common subspace by using a subspace alignment matrix Ga = Gs GTs Gt . To achieve this, SA method proposes to reduce the discrepancy between domains by moving closer the source and target subspaces such that: G∗ = argminkGs G − Gt k2F = argminkG − GTs Gt k2F . (1) G
G
Then, the optimal transformation matrix is given by G∗ = GTs Gt , and the aligned source coordinate in the target subspace is defined by Ga = Gs G∗ = Gs GTs Gt . B. Subspace representation for high dimensional data PCA is a most commonly approach widely used as a tool of dimension reduction for high-dimensional data analysis and feature extraction [12]. It uses eigen-decomposition of covariance matrix to find eigenvectors for subspace representation. However, this method can not be used in very high-dimensional settings where the dimension of the data is comparable to, or even larger than, the sample size. 1) Fast approximate singular value decomposition: Giving a matrix X ∈ RN ×d , the approximation of SVD seeks a lowrank matrix Xk , of rank k d, such that Xk and X are close in some metric sense. The optimal approximation Xk of X of rank at most equal to k, is given by the truncated SVD [13]. The time complexity of the SVD is O(N d min{N, d}) which makes it infeasible if min{N, d} is too large. To overcome this complexity, an improvement of this method was recently developed as fast efficient approximation of SVD [14]. It combines random projection, approximate SVD and linear algebra operations [11]. The aim is to find a projection matrix
Gk ∈ Rd×k such that X∗k = XGk contains the new artificial data points in a reduced space. The algorithm 1 gives the main outlines. Algorithm 1 Fast Efficient SVD-FESVD [11] Input: X, p and k Output: Gk 1: Generate R ∈ Rd×p where rij ∼ N (0, 1); 2: Compute the matrix Z = XR; 3: Compute Q = orth(Z); 4: Compute B = QT X ∈ Rp×d ; 5: Compute T = BB T ; 6: Compute eigen-decomposition T = H∆T H T ; p 7: Compute ΣTii = ∆Tii ; −1 8: Compute V = (ΣT H T B)T ; 9: Return Gk = V (:, 1 : k).
2) Low-level data representation: The obtained matrix X∗k = XGk ∈ RN ×k can be considered as the reduced form of X that contains the reconstructed features. The reduced subspace is spanned by the columns of the projection matrix Gk . Since the projection matrix is learnt from algorithm 1, both data domains can be mapped in their respective latent space. This leads to a dimension reduction technique, where source and target domains can be represented by their respective subspace as Gs and Gt . Traditional dimension reduction methods consider that only training data (source) should be exploited to construct some latent subspace where new data would be labeled. Basically, giving a training data a projection matrix Gs is learnt from only training samples. This projection matrix allows to transform training data in a reduced features space by S = Xs Gs and testing data by T = Xt Gs . In this case, only source data are used for building the model, and no transfer is realized. Such method is well known as classical dimension reduction technique. In the sequel of the paper, we refer to this approach as a no adaptation method (NA). It will be used for comparison in section V. IV.
T HE PROPOSED METHOD
In this section, we introduce the proposed approach for transfer domain adaptation. In Section IV-A, we present a target subspace domain adaptation while underlining a difference with traditional dimension reduction techniques. In Section IV-B, we propose a framework that considers both subspace alignment and fast efficient subspace representation. A. Target transfer subspace Subspace learning techniques are often used for feature extraction, feature selection or data reconstruction to solve classification problems using optimisation techniques. To emphasize the interest behind knowledge transfer based on the subspace representation, we propose a target domain adaptation (TDA). We rather suggest not to use both source and target data to learn the common subspace for knowledge transfer. Only target knowledge Gt is considered as a projection matrix which can be learnt through the fast efficient approximate SVD. Then, source data is projected in the target subspace,i.e (S = Xs Gt ),
where it shares common artificial features with target data (T = Xt Gt ). We call this approach target domain adaptation (TDA). Our method differs from NA, previously described in III-B2, as the new source data Xs Gt are in the same space as the transformed target data Xt × Gt , and should share the same embedding-based features. This allows to exploit the target information since we have access to data in this domain whereas in the case of standard feature learning, the target information is assumed not to be accessible. B. Approximative subspace alignment for transfer domain adaptation Although both domains lie in the same d-dimensional embedding space, their marginal distribution remains different. To bring their respective basis vectors closer, we propose to use a subspace alignment approach. Hence, source and target domains subspace are learnt by fast efficient approximation of SVD. Algorithm 2 gives the main steps of the proposal approximative subspace alignment for transfer domain adaptation (ASA-DA). The idea behind ASA-DA is to achieve transfer learning for large scale data where no tune regularization parameter is needed in the objective function as imposed by many other methods [15].
TABLE I: 20Newsgroups datasets and task description Task
Ds |Dt Ds
Sci vs Talk Dt Ds Rec vs Talk Dt Ds Rec vs Sci Dt
Ds Comp vs Sci Dt Ds Comp vs Rec Dt
Ds Comp vs Talk
Algorithm 2 Approximative subspace alignment for transfer domain adaptation-ASA-DA Input: : Xs , Xt , p and k Output: : S and T 1: Compute source projection Gs = FESVD(Xs , p, k); 2: Compute target projection Gt = FESVD(Xt , p, k); 3: Compute the aligned projection Ga = Gs × GT s × Gt ; 4: Compute aligned source data S = Xs × Ga ; 5: Compute new target data T = Xt × Gt ; In this approach, the target data is mapped in its own subspace, i.e Gt , whereas source data distribution is aligned through the matrix Ga in the new subspace, where the two sub-domains share common information. Since we apply the FESVD algorithm for subspace generation, we can see that the source and target data lie in a k-dimensional feature subspace. Through subspace alignment, the discrepancy between their respective marginal distribution is reduced. Then, the classifier learnt from the transformed source data S, is used to predict target samples T . V.
E XPERIMENTS
We have applied the proposed approach in high dimensional context with document classification, object recognition and handwritten digit data. Standard available real data sets have been used for evaluation. Also, we have compared the proposed method with various other state-of-the-art methods.
Dt
Class 1 sci.crypt sci.med sci.electronics sci.space rec.autos rec.sport.baseball rec.motorcycles rec.sport.hockey rec.autos rec.sport.baseball rec.motorcycles rec.sport.hockey comp.os.mswindows.misc comp.sys.ibm.pc.* comp.graphics comp.sys.mac.* comp.graphics comp.sys.ibm.pc.* comp.os.mswindows.misc comp.sys.mac.* comp.os.mswindows.misc comp.sys.ibm.pc.* comp.graphics comp.sys.mac.*
Class 2 talk.politics.misc talk.religion.misc talk.politics.guns talk.religion.mideast talk.religion.mideast talk.politics.misc talk.politics.guns talk.religion.misc sci.crypt sci.med sci.electronics sci.space sci.electronics sci.space sci.crypt sci.med rec.motorcycles rec.sport.baseball rec.autos rec.sport.hockey talk.religion.mideast talk.politics.misc talk.politics.guns talk.religion.misc
ns |nt
#d
3373 23561 3818 3690 22737 3525 3951 22020 3958
3911 20016 3901 3933 18825 3904
3911 21264 3904
domain problems for adaptation. 20Newsgroups data are used to construct several binary cross domain problems whereas COIL20, MNIST and USPS are used for multi-class problem. USPS and MNIST datasets, contain handwritten images with 10 classes of digits (0-9). The USPS dataset contains 7,291 training samples and 2,007 testing samples with 16 × 16 pixels. The MNIST dataset consists of 60,000 training images and 10,000 testing images of size 28×28. To construct a cross domain data USPS→MNIST, 1,800 samples were randomly selected from USPS to form a source domain, and 2,000 were randomly selected from MNIST to form target domain. In the same way, MNIST→USPS cross domain is constructed by switching source and target domains. All images have been resized to have 16 × 16 pixels leading to a 256-dimensional feature space. COIL20 dataset contains 20 objects with 1,440 sample images of size 32 × 32. Each image is taken in 5 degrees, and 72 images are available for each object. To construct a cross domain data, the database is separated into two subparts: COIL1 and COIL2. COIL1 contains images taken in the direction [0◦ , 85◦ ] ∪ [180◦ , 265◦ ] and COIL2 in the direction [90◦ , 175◦ ] ∪ [270◦ , 355◦ ]. This leads to two transfer domains COIL1→COIL2 and COIL2→COIL1.
A. Databases We have considered four databases widely used in transfer domain adaptation in classification problems. 20Newsgroups, COIL20, MNIST and USPS1 . We have derived 10 cross 1 All database can be download at http://www.cad.zju.edu.cn/home/dengcai/ Data/TextData.html
20Newsgroups contains data of 20 classes. It is a text collection of nearly 20,000 documents across 20 different topics. To construct a cross domain problem, some subcategories from the top categories are selected to form source samples. The target domain contains the rest of the subcategories. The details of the constructed transfer domains are reported in Table I.
B. Baseline methods We refer the proposed algorithms by ASA-DA for algorithm 2, NA for no adaptation III-B2 (see section III-B2) and TDA (section IV-A). We have compared these algorithms with the following baselines, •
K-Nearest Neighbor classifier (NN)[16]: We apply directly a NN classifier from source data to construct a model and then classify target samples.
•
Transfer feature learning with joint distribution adaptation (JDA) + NN: JDA is a transfer feature learning method proposed in [4].
•
Transfer component analysis (TCA) + NN: TCA is a domain adaptation method for transfer component analysis proposed in [2].
•
Subspace alignment method (SA) + NN: We apply directly the approach proposed in [9] where subspace bases are generated by PCA.
Because of the large amount of samples and features along 20Newsgroups, it was not possible to apply directly SA, TCA, and JDA methods because of time consuming to find their respective transfer embedding space. So, for these large scale data, we have applied only the three presented algorithms. For COIL20, MNIST and USPS data, we performed all the methods and compared their accuracy and computational time. C. Implementation Two main parameters p and k are used in subspace learning of the proposed approach and the parameter KN N = 10, which is equal to the nearest neighbor value for the NN classifier. We have chosen p = 2k and varied k ∈ [25, 50, ..., 250, 400]. TCA, and JDA are performed on all data as a dimensionality reduction procedure on their kernel representation matrices before applied the NN classifier. The subspace dimension size is equal to k for all methods. The regularized parameter for JDA and TCA is tuned as λ = 0.1. All methods have been implemented in Maltab. For all methods, a NN classifier were trained on the labeled source data, and tested on the unlabeled target data. As the target labels are available, the accuracy has been used as the evaluation criteria. The measured time is the time spent by each method for only training source model. All the reported results were averaged over 10 trials. D. Results and discussion The classification accuracy results of ASA-DA, NA and TDA are reported on Figure 1 for 20Newsgroups data. Figure 1 shows the influence of subspace dimension on the accuracy with a fix nearest neighbor value KNN=10. When the subspace dimension increases, the accuracy of ASA-DA increases slightly to reach a maximum and decreases thereafter. The maximum accuracy value is reached around subspace dimension k = 50. Regarding NA and TDA methods, the accuracy decreases gradually as the subspace dimension value increases. This means that a small subspace value is sufficient to discriminate these data. The observed accuracy performance shows a considerable difference in favour of ASA-DA approach and disfavour of NA.
In Tables II and III, we report respectively the accuracy and computational time for each method. For transfer from COIL1 → COIL2 and COIL2 → COIL1, the proposed ASADA has the best accuracy value whereas, for transfer between USPS → MNIST and MNIST → USPS, JDA and NN methods have respectively the best accuracy value which is very close to TDA. TCA method has a high execution time, due to the decomposition of the kernel matrix which size is equal to the number of total samples (source plus target). The computational time of TDA is the most efficient and very close to that of NA, slightly higher to that of ASA-DA. TCA is the most expensive method followed by JDA when compared to the execution times of other methods. VI.
C ONCLUSION
This paper proposes a transfer learning approach framework for high dimensional data set by using subspace alignment approach and efficient subspace representation based on low-rank representation matrices. Our approach aims to adapt both source and target marginal distributions using dimensionality reduction technique. By learning subspace from source and target domains, experimental results show that the proposed approaches ASA-DA and TDA are effective and outperform NA method. The performance improvement of the proposed TDA approach shows the advantage of transferring target knowledge in the source domain. The obtained simulation results on ASA-DA show that, by using fast efficient SVD for features generation, we get good results for the target classification accuracy and the execution time. We have applied our approach in a high dimensional context where the evaluation of other methods is practically infeasible due to time consuming and memory requirement. R EFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345– 1359, 2010. S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. M. Gheisari and M. S. Baghshah, “Unsupervised domain adaptation via representation learning and adaptive classifier learning,” Neurocomputing, vol. 165, pp. 300–311, 2015. M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2200–2207. X. Shi, Q. Liu, W. Fan, and P. S. Yu, “Transfer across completely different feature spaces via spectral embedding,” Knowledge and Data Engineering, IEEE Transactions on, vol. 25, no. 4, pp. 906–918, 2013. L. Fillatre, I. Nikiforov, P. Willett et al., “security of scada systems against cyber–physical attacks,” IEEE Aerospace and Electronic Systems Magazine, vol. 32, no. 5, pp. 28–45, 2017. C.-A. Hou, Y.-H. H. Tsai, Y.-R. Yeh, and Y.-C. F. Wang, “Unsupervised domain adaptation with label and structural consistency,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5552–5562, 2016. I. Guyon, G. Dror, V. Lemaire, G. Taylor, and D. W. Aha, “Unsupervised and transfer learning challenge,” in Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011, pp. 793–800. B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2960– 2967.
0.95
0.95 ASA−DA NA TDA
0.9 0.85
Accuracy
0.85
ASA−DA NA TDA
0.85
Accuracy
Accuracy
0.9
ASA−DA NA TDA
0.9
0.8 0.75
0.8
0.8 0.75 0.7 0.65
0.7
0.6 0.75 0
100
200
300
0.65 0
400
100
Subspace dimension (k)
400
0.8
0.7
300
400
0.85
0.8
0.65
200
400
ASA−DA NA TDA
0.9
Accuracy
0.85
300
0.95 ASA−DA NA TDA
0.75
Accuracy
0.9
200
(c) Rec vs. Sci
0.8 ASA−DA NA TDA
100
100
Subspace dimension (k)
(b) Rec vs. Talk
0.95
Accuracy
300
Subspace dimension (k)
(a) Sci vs. Talk
0.75 0
200
0.55 0
0
100
Subspace dimension (k)
200
300
400
0.75 0
Subspace dimension (k)
(d) Comp vs. Talk
100
200
300
400
Subspace dimension (k)
(e) Comp vs. Sci
(f) Comp vs. Rec
Fig. 1: Simulation results on 20Newsgroups data. Influence of reduced dimension on accuracy. KN N = 10.
TABLE II: Accuracy on image data (%). k = 20 et KN N = 10. Domain (source → target) USPS→ MNIST MNIST→USPS COIL1 → COIL2 COIL2 → COIL1
NN 42.90 68.88 79.86 75.97
JDA 52.25 59.66 82.36 79.02
TCA 33.15 46.94 82.77 77.50
Baseline Methods SA NA 44.40 28.95 51.38 54.44 82.77 79.86 80.00 78.61
TDA 50.30 67.77 83.61 78.75
ASA-DA 48.85 60.44 84.30 81.25
TABLE III: Computational time for image data (s). k = 20 et KN N = 10. Domain (source → target) USPS→ MNIST MNIST→USPS COIL1 → COIL2 COIL2 → COIL1
NN 1.11 0.917 0.730 0.562
JDA 1.25 0.92 1.65 1.57
[10]
B. Fernando, T. Tommasi, and T. Tuytelaars, “Joint cross-domain classification and subspace learning for unsupervised adaptation,” Pattern Recognition Letters, vol. 65, pp. 60–66, 2015.
[11]
E. I. G. Nassara, E. Grall-Ma¨es, and M. Kharouf, “Linear discriminant analysis for large-scale data: Application on text and image data,” in Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on. IEEE, 2016, pp. 961–964.
[12]
H. Trevor, T. Robert, and F. Jerome, “The elements of statistical learning: data mining, inference and prediction,” New York: SpringerVerlag, vol. 1, no. 8, pp. 371–406, 2001.
[13]
A. K. Menon and C. Elkan, “Fast algorithms for approximating the singular value decomposition,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, no. 2, p. 13, 2011.
[14]
N. E. I. Gado, E. Grall-Ma¨es, and M. Kharouf, “Linear discriminant analysis based on fast approximate svd.” in ICPRAM, 2017, pp. 359– 365.
[15]
J. Tao, D. Song, S. Wen, and W. Hu, “Robust multi-source adaptation visual classification using supervised low-rank representation,” Pattern Recognition, vol. 61, pp. 47–65, 2017.
TCA 79.71 88.64 6.65 6.78
Baseline Methods SA NA 0.021 0.022 0.054 0.014 0.700 0.025 0.676 0.018
[16]
TDA 0.019 0.026 0.019 0.018
ASA-DA 0.075 0.025 0.037 0.039
P. Soucy and G. W. Mineau, “A simple knn algorithm for text categorization,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001, pp. 647–648.