Domain Space Transfer Extreme Learning Machine for ... - IEEE Xplore

12 downloads 0 Views 2MB Size Report
Yiming Chen , Shiji Song, Shuang Li, Le Yang , and Cheng Wu. Abstract—Extreme learning machine (ELM) has been applied in a wide range of classification ...
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Domain Space Transfer Extreme Learning Machine for Domain Adaptation Yiming Chen , Shiji Song, Shuang Li, Le Yang , and Cheng Wu

Abstract—Extreme learning machine (ELM) has been applied in a wide range of classification and regression problems due to its high accuracy and efficiency. However, ELM can only deal with cases where training and testing data are from identical distribution, while in real world situations, this assumption is often violated. As a result, ELM performs poorly in domain adaptation problems, in which the training data (source domain) and testing data (target domain) are differently distributed but somehow related. In this paper, an ELM-based space learning algorithm, domain space transfer ELM (DST-ELM), is developed to deal with unsupervised domain adaptation problems. To be specific, through DST-ELM, the source and target data are reconstructed in a domain invariant space with target data labels unavailable. Two goals are achieved simultaneously. One is that, the target data are input into an ELM-based feature space learning network, and the output is supposed to approximate the input such that the target domain structural knowledge and the intrinsic discriminative information can be preserved as much as possible. The other one is that, the source data are projected into the same space as the target data and the distribution distance between the two domains is minimized in the space. This unsupervised feature transformation network is followed by an adaptive ELM classifier which is trained from the transferred labeled source samples, and is used for target data label prediction. Moreover, the ELMs in the proposed method, including both the space learning ELM and the classifier, require just a small number of hidden nodes, thus maintaining low computation complexity. Extensive experiments on real-world image and text datasets are conducted and verify that our approach outperforms several existing domain adaptation methods in terms of accuracy while maintaining high efficiency. Index Terms—Domain adaptation, extreme learning machine (ELM), maximum mean discrepancy (MMD), space learning.

I. I NTRODUCTION RADITIONAL classification problems deal with a situation where training and testing data are identically distributed, and thus a classifier trained from training data

T

Manuscript received June 22, 2017; revised January 23, 2018; accepted March 7, 2018. This work was supported in part by the National Natural Science Foundation of China under Grant 41427806 and Grant 61273233, and in part by the National Key Research and Development Program under Grant 2016YFB1200203. This paper was recommended by Associate Editor X. Wang. (Corresponding author: Shiji Song.) The authors are with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]; yangle15@mails. tsinghua.edu.cn; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2018.2816981

can be directly applied to the testing data. Theoretical studies on classifiers are also based on this assumption [1]–[3]. However, in real-world applications, distributions of training and testing data are often different [4]–[8]. A typical situation is that, when acquiring labeled samples of the domain of interest (referred to as target domain) is expensive or timeconsuming, it is only the data of a relevant domain (source domain) that are available [9], [10]. We expect that the abundant labeled data from the relevant domain are reused for our original classification task. Therefore, the domain adaptation problem has emerged, in which a quite accurate target domain classifier is supposed to be learned using both some unlabeled target data, and a large amount of labeled source data drawn from a different but related distribution. One strategy of domain adaptation is to find the domain invariant feature representation to minimize the distribution divergence. A number of existing methods employing such strategy learn a shared feature subspace where the mismatch of the two domains is reduced and then apply a standard classification method in the subspace [11]–[14]. Maximum mean discrepancy (MMD) is a frequently used metric for the mismatch between domains. Pan et al. [11] proposed a domain invariant feature learning algorithm named transfer component analysis (TCA), in which a feature representation minimizing the MMD distance between two domains is learned based on the transfer components. Fernando et al. [15] aligned the target PCA subspace to the source one, and then project the raw data to the aligned subspace. However, these methods learn feature representations without explicitly preserving the data structural information. Some other methods adopt a low-rank linear reconstruction and cluster the cross-domain data into various subspaces to achieve a robust adaptation [16]–[18]. Another line of strategy is to mitigate the distribution discrepancy by reweighting the samples instead of learning new features, for instance, the kernel mean matching (KMM) [19] and prediction reweighting for domain adaptation [20]. Neural networks have been widely studied on intrinsic properties [21], [22], and taken advantage of for many specific applications, such as classification [23], [24] and programming problems [25], [26]. As a single hidden layer feed-forward neural network, extreme learning machine (ELM) proposed by Huang et al. [27]–[29] has been widely applied to pattern classification and regression problems including domain adaptation, due to its high performance and efficiency [30]. In the network of ELM, the input weights between the input and hidden nodes are randomly assigned, and the output weights between the hidden and output nodes are determined by a

c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

loss-function minimization problem, usually a regularized least square problem, which can be analytically solved. Therefore, parameters in ELM can be directly assigned or calculated without tuning or gradient descending as in common neural networks, resulting in high efficiency. Moreover, it has been proved by Huang et al. [31] that if the hidden layer activation function is bounded nonconstant piecewise continuous, then ELM can asymptotically approximate any continuous function as the number of hidden layer nodes approaches to infinite. A detailed review on ELM along with other neural networks with random weights can be found in [32]. In practical applications, ELM has shown a comparable or even better performance in classification and regression tasks than support vector machine (SVM). The differences and comparisons between ELM and SVM are discussed in [33] and [34]. Recent years have seen quite a number of variants of ELM in terms of models or algorithms for improvement, and its application in a wide range of fields [35]–[37], including semi-supervised learning [38] and feature extraction [39]. Similar to other conventional classification methods, a classical ELM is unable to deal with the domain adaptation problem, since ELM is based on the assumption of identical distribution of training and testing data. Some modifications on ELM have been proposed to address the problem. Zhang and Zhang [40] put forward a framework which leverages some labeled target data to train an adaptive model. Two forms of ELM under such framework are proposed in [40], which are called source domain adaptation ELM (DAELM-S) and target DAELM (DAELM-T), respectively. However, to yield satisfactory results by DAELM, abundant labeled target data are required, while usually it is hard to gain access to real target labels. A cross-domain network simultaneously learning a category transformation and an ELM classifier is proposed in [41]. In the transfer learning algorithm proposed in [30], a source domain ELM is first trained and then used as a constraint while training a target domain ELM. The methods mentioned above are restricted to semi-supervised cases in which several labeled target data are required, but this is often not satisfied. Instead of coping with such a scenario, we aim at an unsupervised domain adaptation problem, i.e., no labeled target data are available. Uzair and Mian [42] proposed a blind domain adaptation algorithm named augmented ELM (AELM) based on both global and class specific ELM-based autoencoders. This algorithm and provides a nonlinear ELM mechanism to deal with the situation, where no target data is available during training. AELM learns a specific ELM-based autoencoder for each class of the dataset to extract class information. When the number of classes is large, this procedure may be time-consuming, which we hope to overcome. To sum up, there are several challenges in domain adaptation problems. The first one is the dissimilarity between source and target domains which needs to be eliminated, since a standard classifier cannot deal with training and testing data from different distributions. To cope with this issue, to find an appropriate distribution distance metric should be of top priority. The second is the loss of useful information during feature adaptation. Discriminative information or structural knowledge may be sacrificed for the domains to be closer after feature transfer.

IEEE TRANSACTIONS ON CYBERNETICS

(a)

(b)

(c)

(d)

Fig. 1. Illustration of the DST-ELM algorithm on two artificial datasets. In the plots, red/black and purple/blue points represent the source and target samples, respectively. Squares and asterisks are the positive and negative samples. Plot (a) and (b) are the circle data before and after data transfer by DST-ELM, and plot (c) and (d) are normally distributed data. The target data are kept almost unchanged or better clustered and source data are transferred to better match the target data.

The third challenge is that, the learning process is supposed to be efficient, while a complex algorithm may suffer from high time cost. We believe that a perfect domain adaptation algorithm should tackle the three challenges well. Therefore, we would like to settle all of them simultaneously. We aim to design a domain adaptation space transfer model which reduces between-domain divergence without sacrificing useful information contained in the original feature space. And the model should be solved efficiently without much calculation and many hyper-parameters. To realize this goal, in this paper, we put forward a domain transfer approach which learns a domain invariant space by discriminatively dealing with the source and target domains using ELM framework. The model is denoted as domain space transfer ELM (DST-ELM). The basic idea behind the model is illustrated in Fig. 1. In the two datasets, the source and target data follow very different distributions. A new feature representation of the data is learned such that the target domain either barely changes as in Fig. 1(b), or is better clustered as in Fig. 1(d), and the source data are transferred to be more similarly distributed to the target data. As a result, the two domains share almost the same classification hyperplane. To achieve the data transfer shown in Fig. 1, in the proposed approach, an appropriate ELM objective function is adopted. The object function forces the learned target features in the new space to preserve important information as much as possible. In the mean time, the source domain invariant features are exploited. We adopt the MMD distance to measure difference between domains, and transfer the source data to minimize the MMD distance between domains in the learned space. In summary, the whole proposed approach: 1) carries out an effective feature space learning to minimize domain difference without destruction of target domain useful information; 2) achieves accurate cross-domain classification using the learned features; and 3) maintains high efficiency as well. The contributions of this paper are summarized as follows.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

1) Based on the ELM theory, we propose a DST-ELM method to learn a new feature space in which the domain shift is diminished. We transfer the source and target data in different ways, respectively. Specifically, the reconstruction is performed with the intrinsic target domain discriminative information fixed, and the source domain is adapted to the target so that the distribution MMD distance between them is minimized. The model enriches the theory of ELM and expands the application of ELM to space learning other than classifier training. 2) By joining the DST-ELM and an ELM classifier together, we make it possible for a standard ELM classifier to deal with unsupervised domain adaptation problems by reconstructing the input data. 3) The DST-ELM inherits the advantage of high efficiency of ELM compared to some other existing MMD-based domain adaptation methods, even though the data are mapped to the space with the same dimensionality instead of a lower dimensional subspace. Furthermore, we also show that, the best ELM network width in our method is small, thus maintaining a fast computation when the amount of data points is huge. 4) The effectiveness and efficiency of our method are verified by extensive experiments on real-world datasets. We have conducted experiments on both image (Office, Caltech-256, MNIST, USPS) and text (20-Newsgroups) datasets. The rest of this paper is organized as follows. In Section II, a brief review of related works on ELM and domain adaptation is presented. In Section III, we describe the basic framework of ELM and the background of domain adaptation. The formal problem definition and detailed description of our proposed algorithm is introduced in Section IV. In Section V, we conduct a series of experiments and the results are presented. Section VI gives some conclusive discussions. II. L ITERATURE R EVIEW A. ELM for Domain Adaptation Although ELM has been widely applied in various problems and tasks, the issue of ELM-based domain adaptation (EDA) has not been adequately studied. To our best knowledge, there exists only a few research on ELM with cross-domain learning. Zhang and Zhang [40] proposed a unified framework, DAELM. In the framework, a limited number of labeled target data as well as the labeled source data are leveraged to learn a robust classifier. Two algorithms, DAELM-S and DAELMT, are designed based on the framework. In [30], a transfer learning ELM (TL-ELM) is introduced. TL-ELM makes up for the defect that ELM cannot transfer knowledge by using a small number of labeled target data and a large number of source data to build a high generalization capability ELM model. In [41], an ELM-based cross-domain learning framework named EDA is proposed to learn a robust classifier. It minimizes the matching error between the learned classifier and a based classifier, and incorporate a manifold regularization. Li et al. [43] gave free sparse transfer learning based on the ELM algorithm along with its kernel ELM extension,

3

which can freely transfer knowledge through using graphLaplacian regularization and penalizing the diversity between consecutive classifiers. These frameworks are different from our method in that, a small number of labeled target samples are leveraged, which makes the problem less challenging. Our method, however, deals with the unsupervised case in which no target sample is labeled. B. Autoencoder for Domain Adaptation To deal with target data, our model adopts a mechanism whose desired output is the same as the input, which is similar to an autoencoder. In fact, some transfer learning algorithms based on autoencoder networks have been proposed. Chen et al. [44] utilized the marginalized stacked denoising autoencoders (mSDA) and the output of all layers together with the original features are concatenated as the new representation. The mSDA learns the denoising parameters by marginalizing out the feature corruption, which avoids gradient descent for parameter optimization. Deng et al. [45] focused on speech emotion recognition and learns an adaptive denoising autoencoder (A-DAE) based on a target prior denoising autoencoder (DAE), and the data are reconstructed through the A-DAE. The difference between the A-DAE and our method lies in that the A-DAE method aims to adapt the connection weights from one domain to another, while we directly adapt the data points by a specific metric, e.g., MMD. An ELM-based autoencoder (ELM-AE) is also a useful feature extraction network [46], [47]. In [42], a blind domain adaptation algorithm with AELM features is proposed using ELM-AE. The algorithm learns a global ELM-AE on source domain and specific ELM-AEs on each class of source data. Then the target data are augmented with the reconstructed features from the global ELM-AE. Finally, the augmented features are classified by the specific ELM-AEs based on minimum reconstruction error. The drawback of AELM is that the learning for the class-specific ELM-AEs may be timeconsuming, especially when the number of classes is large. Different from the algorithm in [42], we only learn a single global network with an MMD term added to the loss function. C. Maximum Mean Discrepancy and Domain Adaptation Gretton et al. [48] introduced the MMD as a distribution distance measure. Based on this metric, a number of domain adaptation algorithms have been proposed based on MMD. KMM [19] learns a set of weights by minimizing MMD in reproducing kernel Hilbert space (RKHS) to reweight the source samples such that the domain divergence is eliminated. It is different from our method in that we reconstruct the dataset by a universal nonlinear function or mechanism, instead of adapting each data point by reweighting it specifically. Pan et al. [11] proposed to find a representation through TCA, which tries to learn cross-domain transfer components in RKHS based on MMD. In the transfer components spanned subspace, the data variance can be preserved as much as possible, and the distance across domains is minimized. Joint distribution adaptation (JDA) [12] exploits the source label

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON CYBERNETICS

knowledge in the TCA framework, and transfer joint matching (TJM) [13] introduces row-sparsity to the transformation matrix for instance reweighting. Recently, Jiang et al. [14] proposed an integration of global and local metrics for domain adaptation learning (IGLDA) model which takes into consideration the local geometry of the source domain. Li et al. [49] reduced the MMD between predicted labels instead of samples. Different from these mentioned methods, the DST-ELM reconstructs the data without dimensionality reduction and deals with different domains in different manners in order to preserve the class information, while maintaining low computation complexity.

is the error vector with respect to the ith training sample. C is a tradeoff coefficient to balance the two terms. Problem (3) can be converted to an unconstrained optimization problem as

III. BACKGROUND

  β ∗ − CHT T − Hβ ∗ = 0.

In this section, the basic framework of ELM and some notations and definitions of domain adaptation problem will be introduced. A. Extreme Learning Machine {xi , ti }N i=1

with N samConsider a training dataset {X, T} = ples, where xi ∈ m and ti ∈ d are the input and output training samples, respectively. The output samples ti represent the ground truth for classification, in which case only the entry of the corresponding class in ti equals to one and the others are zero, or the desired output features for regression. In either the two cases, a regression function to map the input samples to the output samples is estimated. A single hidden layer feed-forward network (SLFN) is essentially a form of regression function. In SLFN, a hidden layer of L hidden nodes fully connect the inputs and outputs. Define h(x) ∈ 1×L as the hidden layer nonlinear feature mapping output and g(·) as the activation function, which is a nonlinear piecewise continuous function such as the sigmoid function g(u) = (1/[1 + e−u ]). Then the particular hi (x) can be expressed as hi (x) = g(x; wi , bi ),

wi ∈ L , bi ∈ 

(1)

where wi and bi are input weights and bias, respectively. And the output of the SLFN for x is f (x) = h(x)β

(2)

L×d

is the output weight. where β ∈ An ELM generates the input weights and biases randomly, and transforms the input data into a random feature space with a nonlinear mapping function g(·). This is the major difference from the above-mentioned SLFN paradigms. Since wi and bi are randomly initialized and never updated, β can be determined by a least square optimization problem of minimizing the sum of squared losses of prediction errors. This is expressed in the regularized ELM formulation

min

β∈L×d

1 C β2 + T − Hβ2 2 2

(4)

where T = [t1 , t2 , . . . , tN ]T ∈ N×d is the desired output matrix and H = [h(x1 )T , h(x2 )T , . . . , h(xN )T ]T ∈ N×L is the randomized hidden layer output matrix. This regularized least square problem can be solved analytically by setting the gradient with respect to β to 0 to yield the following equation: (5)

Equation (5) has a closed form solution ⎧

IN −1 ⎪ T T ⎪ T ⎨H HH + C β∗ =

−1 ⎪ I ⎪ ⎩ HT H + L HT T C

if N < L (6) if N  L

where I is the identity matrix with the subscript indicating the dimensionality. Since the output weights can be determined directly from (6), the parameters in ELM are never updated by iterative back propagation, which makes ELM extremely efficient and flexible. Consider a special circumstance where the desired output T = X. In such case, the ELM output is similar to the input data, based on which we will propose our method. B. Maximum Mean Discrepancy Given two distributions p and q, many metrics have been developed to estimate the mutual distance, for instance, Kullback–Leibler (K–L) divergence [50] and Bregman divergence [51]. However, many estimators are parametric or need an intermediate density estimation procedure. MMD is an effective nonparametric distribution distance measurement based on the maximum mean function value difference between the two distributions 



2 MMD2 (F, p, q) := sup Ex∼p f (x) − Ey∼q f (y)  f ∈F

(7)

(3)

where F is a class of functions. It has been proven that when F is universal, then MMD2 (F, p, q) = 0 if and only if p = q. Consider two datasets X = {x1 , x2 , . . . , xn } ⊂ X and Y = {y1 , y2 , . . . , ym } ⊂ X drawn independently identically from distribution p and q, respectively. An empirical estimate of MMD between the two datasets in an RKHS H is 2     n m      1 1 2  φ(xi ) − φ yj  (8) MMD (X, Y) =   m   n i=1 j=1

where the first term is a regularizer to mitigate over-fitting, and the second term is the sum of prediction errors. ξ i ∈ d

where φ is the kernel-induced feature mapping, and  · H is the RKHS norm.

min

β∈L×d

N  1 C  ξ i 2 β2 + 2 2

s.t. h(xi )β =

i=1

tTi

− ξ Ti ,

i = 1, . . . , N

H

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

5

C. Basic Knowledge on Domain Adaptation

(11) can be converted to

In the unsupervised domain adaptation problem, the data originate from two distinct domains, source S and target T. From the source domain S are sampled the source data XS = nS nS {xSi }i=1 with label knowledge LS = {ySi }i=1 , whereas from the T without target domain we sample target data XT = {xTj }nj=1 labels. nS and nT are numbers of source and target samples, respectively. We assume that the source and target samples belong to the same feature space, i.e., xS , xT ∈ X , and the labels yS ∈ Y. Let PS (XS ) and PT (XT ) (PS and PT in short) be the marginal probability distributions of the source and target domain, and QS (YS |XS ) and QT (YT |XT ) (QS and QT in short) be the conditional distributions. In general, the distributions can be different, i.e., PS = PT or QS = QT . Our method aims to reconstruct the new data representation XS and XT from the original data XS and XT such that: 1) PS (XS ) ≈ PT (XT ) and QS (YS |XS ) ≈ QT (YT |XT ) and 2) XT ≈ XT . We use the empirical MMD distance between source and target domains as a metric to evaluate whether 1) is achieved, and the ELM output is appropriately set as mentioned in part A to ensure 2). Accordingly, the marginal and conditional MMD are defined as follows:  2    nS nT    1 1

 x − x (9) MMD2mar (XS , XT ) =  Tj  Si n nT  S i=1  j=1 H  2   K   1  1    2

xSi − (k) xTj  MMDcon (XS , XT ) =  (k) n nT yˆ T =k  k=1  S ySi =k  j H

(10)

(k) where K is the total number of classes, and n(k) S , nT are the numbers of the kth class samples of source and target data, respectively. In unsupervised domain adaptation, the true labels of target samples yTj are unavailable, and thus we will use some kind of pseudo labels yˆ Tj .

IV. P ROPOSED D OMAIN S PACE T RANSFER ELM Our proposed DST-ELM will be introduced and described in detail in this section, and a computation complexity analysis is also presented. A. Problem Definition The goal of this paper is to predict the target labels yTj for classification using only the unlabeled testing samples (target samples) and labeled training samples from a different distribution (source samples). An important strategy is to eliminate the distribution difference to associate the cross-domain data. The problem of minimizing the distance between the distributions can be expressed as min f ∈F

distance(f (XS ), f (XT )) + f 2H .

(11)

The regularization term f 2H is added to avoid overfitting. Specifically, if f (·) is a linear mapping, which is a simple case,

min W

  distance W T XS , W T XT + W2p

(12)

where ·p indicates the Lp norm. Obviously, the optimal solution of the unconstrained minimization problem (11) is f ∗ (·) ≡ 0, which means the data information will be totally eliminated by the transform. To cope with the issue, some kind of modification is demanded. A widely accepted method is to add a constraint to the minimization problem. Frequently used is a variance property preserving constraint W T XHX T W = I [11], [12], where X = [XS , XT ] is the crossdomain dataset, H = In − (1/n)11T is the centering matrix, n is the number of samples, In ∈ n×n is the identity matrix, and 1 ∈ n is the column vector with all 1’s. Our method tackles the information preserving issue in a different form. We aim to design a space learning mechanism which receives data as input, and behaves discriminatively according to the data domain. To be specific, when the input data is from the target domain, the mechanism is supposed to learn a feature space in which the useful target data class information are preserved after a nonlinear transformation. On the other hand, when the input data is drawn from the source domain, we desire that the distribution distance reducing process should be implemented. We formally define our algorithm as the following optimization problem: min f ∈F

distance(f (XS ), f (XT )) + (XT , f (XT )) + f 2H

(13)

where (XT , f (XT )) is the information loss function. We add this term to prevent target data information from being lost during the transfer process. The key point for our algorithm is to appropriately define the functions distance(·, ·) and (·, ·). In the space learning process, a nonlinear feature mapping is necessary for better generalization. We use ELM as the data reconstruction and classification algorithm to enhance the generalization capability. The whole network of our proposed domain transfer and classification algorithm is provided in Fig. 2. A random nonlinear feature mapping is first implemented for more generalized space learning. A standard ELM classifier is then trained by the transferred source samples for predicting target data labels. One advantage of ELM is that it has low computation complexity. As a whole, a crossdomain ELM-based classifier network is put forward. Details of our proposed method, DST-ELM, will be presented in Sections IV-B and IV-C. B. Domain Space Transfer ELM Based on the description above, two components should be selected in the DST-ELM objective function for the target and source data, respectively: 1) an information loss function on target data (XT , f (XT )) and 2) a distance measure between reconstructed source and target domain. In our DST-ELM algorithm, we directly define the information loss function as the difference between the output and input target data, such that the target samples are almost fixed (or better clustered) without dimensionality reduction.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON CYBERNETICS

Fig. 2. Proposed DST-ELM process. The feature transfer projection is learned by an ELM-based algorithm. An ELM classifier is trained from the transferred source data to predict labels of the transferred target data.

The definition is given as follows:

The formulation is

nT      f xT − xT 2 . (XT , f (XT )) = j j j=1

On the other hand, the nonparametric MMD is used as the distance metric function distance(f (XS ), f (XT )) = MMD2 (f (XS ), f (XT )).

min

min

nT   1 C  2 λ β2 + ξ Tj  + MMD2 (S , T ) 2 2 2 j=1

  s.t. h xTj β = xTTj − ξ TTj , j = 1, . . . , nT     h xSi β = φ xSi , i = 1, . . . , nS     h xTj β = φ xTj , j = 1, . . . , nT

(14)

where m is the dimensionality of raw data points, and β ∈ L×m is the output weight between output and hidden layer. h(xSi ), h(xTj ) ∈ 1×L are the output of the hidden layer with respect to the ith source sample and jth target sample. ξ Tj ∈ m is the jth target sample reconstruction error vector. C, λ are positive tradeoff parameters to balance the effects of the two terms. The term MMD2 (S , T ) denotes the squared MMD distance between the reconstructed source and target domains. nS T and T = {φ(xTj )}nj=1 are the output conS = {φ(xSi )}i=1 structed data, and φ(xSi ), φ(xTj ) denote the reconstructed ith source sample and jth target sample, respectively. In fact, the function φ(·) represents the whole DST-ELM mapping. We will explain the components of the DST-ELM in detail in the following. 1) Target Information Loss: For target domain, the DSTELM is supposed to reconstruct the data such that the output does not change much from the input. As mentioned in Section III-A, the ELM can achieve this by letting T = X.

(15)

Similarly, the above problem can be converted to an unconstrained form by replacing the error term with the constraints β∈L×m

Then the DST-ELM formulation is

β∈L×m

nT 1 C β2 + ξ Tj 2 2 β∈L×m 2 j=1   s.t. h xTj β = xTTj − ξ TTj , j = 1, . . . , nT .

min

1 C β2 + XT − HT β2F 2 2

(16)

where HT is the target data hidden layer output matrix.  · F denotes the Frobenius norm. We rewrite (16) in an equivalent form

1 1 β2 + Tr (XT − HT β)T C (XT − HT β) min L×m 2 2 β∈ (17) where C = diag(C, C, . . . , C) ∈ nT ×nT , and Tr[ · ] is the trace of a matrix, i.e., the sum of diagonal entries of the matrix. Different from traditional autoencoder network which learns both input and output weights by backpropagation and contains useful information in the hidden layer output, the reconstruction-error-minimization ELM generally does not provide much more information than the input original dataset does. Compared to autoencoder models, we focus more on data reconstruction rather than feature extraction in DST-ELM. 2) MMD of Reconstructed Data: As a nonparametric estimator of distribution distance, MMD is utilized to transfer knowledge from one domain to another. In DST-ELM, the empirical MMD term is added to the ELM formulation as a penalty. The MMD measure is 2    nS nT         1 1 2  φ xSi − φ x Tj  MMD (S , T ) =   (18) n n T   S i=1 j=1 where   φ(xSi ) = h(xSi )β and φ(xTj ) = h(xTj )β. Define H = HS as the whole randomized data matrix. Then the MMD HT

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

7

term can be expressed in a matrix form

MMD2 (S , T ) = Tr β T HT MHβ

C. Conditional Mean Matching (19)

where M ∈ (nS +nT )×(nS +nT ) is the MMD matrix defined as ⎧ 1 ⎪ ⎪ if i, j  nS ⎪ ⎪ ⎪ n2S ⎪ ⎨1 Mij = if i, j > nS ⎪ n2T ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎩− otherwise. nS nT 3) Overall Model and Optimization: The DST-ELM implements the data information preservation and distribution distance reduction simultaneously but reconstructs data of two domains in distinguished ways, as shown above. Substitute (17) and (19) into (14), and we obtain the DST-ELM objective function

1 1 β2 + Tr (XT − HT β)T C (XT − HT β) 2 2

λ T T + Tr β H MHβ . (20) 2

min

β∈L×m

For convenience, we extend the  matrices to consistent dimen 0nS ×m

∈ (nS +nT )×m , and  = sions. Let X = XT diag(0nS ×nS , C, C, . . . , C) ∈ (nS +nT )×(nS +nT ) , where 0k×l is a k by l matrix with all 0 elements. Then the minimization problem (20) is rewritten as min

β∈L×m

T   1 1  β2 + Tr X − Hβ  X − Hβ 2 2

λ T T + Tr β H MHβ . (21) 2

Obviously, problem (21) is a convex quadratic minimization problem, with the optima equal to the stationary point, i.e., β ∗ + HT Hβ ∗ − HT X + λHT MHβ ∗ = 0.

(22)

The solution of (22) is similar to (6) in that the number of hidden nodes L matters. When L > nS + nT , we have  −1 β ∗ = HT InS +nT + ( + λM)HHT X .

(23)

Otherwise, when L  nS + nT , the alternative solution is  −1 β ∗ = IL + HT ( + λM)H HT X .

(24)

4) Reconstruction and Classification: Finally, we reconnS nS T and {xTj }nj=1 as {h(xSi )β ∗ }i=1 struct the data points {xSi }i=1 ∗ nT ∗ and {h(xTj )β }j=1 with the β learned by (23) or (24). The reconstructed representation can be directly input into an ELM network to train an adaptive ELM classifier for target classification. Furthermore, The weights β ∗ and the trained ELM classifier can be fixed and applied to new target data as long as the target domain remains unchanged.

In the previously mentioned DST-ELM model, the marginal distribution distance is taken into consideration, but the conditional distribution is not paid attention to. In fact, algorithms merely reducing marginal distance between domains may not perform well because the label information of the source data is not utilized. To fully leverage the available dataset, the source class label knowledge should also be transferred to target domain to improve the data reconstruction. For this purpose, we add a conditional MMD term to the proposed formulation, in which the initial pseudo labels of the target samples are acquired from a classifier trained from either the original source data or source samples reconstructed with marginal MMD measure. We denote the squared conditional MMD as MMD2con (S , T ), which is defined as MMD2con (S , T ) 2    K         1   1 = φ xSi − (k) φ xTj  .  (k)  n nT yˆ T =k k=1  S ySi =k  j

(25)

The notations are the same as those in (10). Converting (25) to the matrix form yields  K   2 T T Mk Hβ (26) MMDcon (S , T ) = β H k=1

where Mk ∈ (nS +nT )×(nS +nT ) is the kth class MMD matrix defined as (Mk )ij ⎧ 1 ⎪ ⎪ ⎪  2 if i, j  nS and ySi , ySj = k ⎪ (k) ⎪ ⎪ ⎪ nS ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎪  2 if i, j > nS and yˆ T(i−nS ) , yˆ T(j−nS ) = k ⎪ ⎪ ⎪ ⎨ n(k) T = 1 ⎪ ⎪ − (k) (k) if i  nS , j > nS and ySi , yˆ T(j−n ) = k ⎪ ⎪ S ⎪ nS nT ⎪ ⎪ ⎪ 1 ⎪ ⎪ ⎪ − (k) (k) if i > nS , j  nS and yˆ T(i−n ) , ySj = k ⎪ ⎪ S ⎪ n n ⎪ ⎪ ⎩ S T 0 otherwise. T The pseudo labels of target samples {ˆyTj }nj=1 can be updated iteratively to improve the accuracy until convergence, thus enhancing the performance. Incorporate the conditional MMD into the DST-ELM, we obtain the formulation T   1 1  min β2 + Tr X − Hβ  X − Hβ 2 2 β∈L×m     K  λ + Tr β T HT M + Mk Hβ . (27) 2

k=1

If L > nS + nT , the solution is   ∗

β =H

T

InS +nT +  + λM + λ

K 

 Mk HH

−1 T

X .

k=1

(28)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON CYBERNETICS

Algorithm 1 DST-ELM nS T Input: Source samples {xSi }i=1 ; Target samples {xTj }nj=1 ; nS Source labels {ySi }i=1 . 1: Construct MMD matrix M, and set M k = 0, k = 1, 2, . . . , K. Set iteration number T. 2: Set parameters L, C and λ. 3: for t = 1 : T do 4: Initialize the random input weights and calculate the hidden layer output H. 5: Compute the output weights using (28) or (29) according to L and nS + nT . nS 6: Calculate the reconstructed samples {φt (xSi )}i=1 and nT {φt (xTj )}j=1 . 7: Train an adaptive ELM classifier Ft on the reconstructed source samples. 8: Update the target pseudo labels and conditional matrices {Mk }K k=1 using Ft . 9: end for Output: F ← Ft , φ ← φt . The final classification network F · φ.

And if L  nS + nT , we have β ∗ = (IL + HT ( + λM + λ

K 

Mk )H)−1 HT X .

(29)

k=1

The DST-ELM algorithm is summarized in Algorithm 1. The first iteration simply adapts the marginal distributions while both marginal and conditional distribution MMD are taken into account in the following process. Target pseudo labels are updated in each iteration for the use of the next one. D. Computation Complexity Based on the previous description, we will make a brief discussion on the efficiency of our method. We analyze the computation complexity using the big O notation. Throughout our method in Algorithm 1, the major computation cost lies in the matrix inverse and multiplication in step 5. Denote the total number of samples as N, i.e., N = nS + nT . For comparison, we consider the model without conditional MMD, i.e., T = 1. According to (28) and (29), step 5 costs a maximum of O(min{N, L}3 ); the computational cost for training of ELM in steps 4 and 6 is O(2NLd); and the construction of MMD matrices in step 1 costs O(N 2 ). Here, d is the dimensionality of the samples. The overall computational cost is O(min{N, L}3 + 2NLd + N 2 ). Note that the training of a standard ELM classifier in step 7 is not considered because this is a relatively independent stage from the whole DST-ELM procedure. This is fair when comparing DST-ELM with other feature transfer algorithms. On the other hand, traditional feature representation reconstruction algorithms based on MMD, such as TCA, deal with a generalized eigen-decomposition problem, which costs O(d2 p), where p is the dimensionality of the learned subspace. the computation complexity of matrix multiplication in TCA is O(Npd); and the construction of MMD matrix costs O(N 2 ). The overall computational cost

of TCA is O(d2 p + Npd + N 2 ). It can be seen that the DSTELM is particularly beneficial for the cases where the number of features, i.e., the parameter d, is large. Moreover, the experiences on real world data show that the number of nodes in the hidden layer, L, needs not to be very large. Hence, in most cases we have L  N, and can handle the datasets with a huge amount of samples. V. E XPERIMENTS In this section, experiments are performed on cross-domain datasets including image and text data for classification. We compare our approach with several related methods. A. Dataset Specifications The image datasets we use in our experiments are Office,1 Caltech-256,2 USPS,3 and MNIST.4 These are all widely used benchmark datasets in visual domain adaptation problems. Office [52], [53] is a popular benchmark for domain adaptation which consists of 46 522 images belonging to 31 distinct object categories. These images are from three different domains, AMAZON (images downloaded from online merchants, www.amazon.com), Webcam (low resolution images by a simple Web camera), and DSLR (high resolution images by a digital SLR camera). Calteh-256 [54] is a famous standard object recognition dataset. It contains 30 607 images of 256 object classes. In our experiments, we use the Office and Caltech-256 datasets provided by Gong et al. [53]. Ten object categories included in all four domains are extracted, and the datasets are preprocessed by the feature extraction method and experiment protocols in [53]. The domains are denoted as C (Caltech-256), A (AMAZON), W (Webcam), and D (DSLR). We can select two different domains in order as the source and target domain respectively, and construct 12 cross-domain datasets, e.g., C → A, C → W, . . . , W → D. USPS and MNIST are standard hand-written digit datasets. USPS dataset contains 7291 training images and 2007 testing images of size 16 × 16. MNIST consists of 60 000 training images and 10 000 testing images of size 28×28. Both datasets share ten same classes of digits but follow different distributions. As in [12], 1800 samples in USPS and 2000 samples in MNIST are randomly chosen to form two domains. All the images are uniformly resized to 16×16 and vectorized. Details of the image datasets are described in Table I. Fig. 3 shows example images of these two datasets. We use the 20-Newsgroups5 as the text dataset. This dataset is a collection of approximately 20 000 newsgroup documents of six main categories and 20 subcategories. Following [55], we select four categories and construct four dataset, denoted as COMP, REC, SCI, and TALK, for binary classification. Similarly, 12 cross-domain datasets can be constructed. The 20-Newsgroups dataset is summarized in Table II. 1 http://www-scf.usc.edu/boqinggo/domainadaptation.html 2 Same as 1. 3 http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html 4 http://yann.lecun.com/exdb/mnist/ 5 http://qwone.com/jason/20Newsgroups

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

TABLE I D ESCRIPTION OF THE S IX B ENCHMARK I MAGE DATASETS

9

corruption probability in the range {0.5, 0.6, 0.7, 0.8, 0.9}. In all the ELM-based methods, the number of hidden layer is set to be L = 1000 for fair comparison. For AELM, the two tradeoff parameters (C1 , C2 ) in the learning process of the input and output weights are decided by searching {10−3 , 10−2 , . . . , 102 }×{10−3 , 10−2 , . . . , 102 }, where × indicates the Cartesian product. In DAELM-S, the labeled target samples are chosen from each class respectively and randomly, and the number of the labeled samples in each class is set as 1 and 2. We select the penalty coefficients CS and CT in {10−3 , 10−2 , . . . , 102 }. DAELM-T is similar to DAELM-S, and the three penalty coefficients CS , CT , CTu are all selected by searching {10−3 , 10−2 , . . . , 102 }. In our method, we set the MMD distance tradeoff parameter λ = 1, and the number of hidden nodes L = 1000 throughout the experiments. The target data reconstruction parameter C is set to be 0.01 or 0.001 in each dataset. For the model involving conditional mean matching, the iteration number is fixed to 10. C. Results and Analysis

Fig. 3. Example images of Caltech-256 + Office datasets and MNIST + USPS datasets. TABLE II D ESCRIPTION OF THE 20-N EWSGROUPS DATASET

B. Experimental Settings We compare the DST-ELM with two standard methods and five existing domain adaptation methods. The two standard classifiers include SVM6 [56] and ELM [27]. The domain adaptation methods are TCA [11], IGLDA [14], mSDA [44], ELM with feature augmentation (AELM) [42], and the DAELM [40], including DAELM-S and DAELM-T. For TCA and mSDA, the output data with learned new feature representation are classified by standard SVM. It is worth noting that, since the target (test) data are unlabeled, parameters cannot be automatically tuned via cross validation. Therefore, we manually select the parameters and report the best accuracy results. For TCA, we set the tradeoff parameter to be 0.1. For IGLDA, both parameters μ and λ are set to 1, and the linear kernel is used. The dimensionality of the subspace in the 2-D reduction methods is searched in {10, 20, 50, 100, 200, 500, 1000}. The mSDA results for image datasets are reported from [42]. For the mSDA algorithm for text datasets, we adopt the extension for high dimensional data proposed in [44], in which we set the dimensionality of the shortened vector to be r = 5000, and search the optimal 6 https://www.csie.ntu.edu.tw/cjlin/libsvm/

1) Image Datasets: Table III summarizes the accuracies of the above methods for different pairings of source and target domain within the image datasets. The best results of each pairing among all the methods as well as unsupervised methods are presented in bold. The DST-ELM model outperforms the other methods in most of the tasks (9 out of 14). The average accuracy of our method on the 14 tasks is 51.74%, which gains a relative improvement of 3.19% compared to the best baseline. Moreover, the relative improvement compared to the best unsupervised method is 6.59%. This verifies that the proposed DST-ELM algorithm is effective. It is worth noting that, the DST-ELM performs particularly well on the digit dataset. A possible explanation is that the digit data are naturally pretty discriminative, and our method works quite well on preserving the discriminative information. The two standard classification algorithms, SVM and ELM, fail to achieve a satisfying performance due to their assumption that the training and testing data are identically distributed, which is not true in domain adaptation problem. It can be noticed that, the traditional ELM and AELM algorithms do not perform very well compared to the other methods. In fact, the two ELM-based algorithms both suffer from the small number of hidden nodes. Generally, the width of the ELM network, i.e., the number of hidden nodes L, should be appropriately set such that the useful features can be extracted while keeping low computation complexity. A small L may result in a poor performance of the ELM algorithm because the ELM with few hidden nodes may be unable to accurately approximate the object function. The effect of L will be discussed in the analytic experiments. DAELM is actually a semi-supervised method which requires some labeled target data. It is obvious that the available labeled target samples succeed in mitigating the negative effect of the small network width in both DAELM-S and DAELM-T, and good performances are achieved. It can also be seen that the more labeled target samples are available, the better the algorithm works, and the improvement is pretty large.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON CYBERNETICS

TABLE III AVERAGE ACCURACY (%) ON O BJECT AND H AND -W RITTEN D IGIT DATASETS

TABLE IV AVERAGE ACCURACY (%) ON 20-N EWSGROUPS DATASETS

However, the DAELM method still suffers from the absence of distribution distance reduction. The number of labeled target samples is actually small compared to the amount of source samples, and the DAELM learning cannot deal with domain difference essentially. The feature transformation methods TCA and IGLDA achieve good results compared to other unsupervised algorithms. It is worth noting that IGLDA has the best performance among all the unsupervised methods. However, they project target and source domain jointly and ignore the inherent target discriminative knowledge. In comparison, our method performs better classification, especially on digit datasets, in which the data discriminative knowledge is more salient and useful. The mSDA method achieves comparable results to the TCA in cross domain visual data tasks, although the mSDA algorithm is originally designed for sentiment analysis. The results show that, the DST-ELM with conditional mean matching performs much better than the marginal one does. In fact, the conditional MMD can be seen as a term for source data class information preservation, such that the source data clustering property will not be severely destroyed while

reducing the domain distance. The few exceptions may be attributed to the incorrect pseudo labels, which is a limitation to the use of label conditional distributions. 2) Text Datasets: The accuracies of the methods on text datasets 20-Newsgroups are presented in Table IV. The average accuracy of DST-ELM is 82.73%, and is the best result. The DST-ELM without reducing conditional distribution divergence is comparable to TCA. A possible explanation for the high accuracy of DST-ELM is that in such a binary classification problem, the conditional mean matching can work quite well, thus improving the domain distance reduction. The standard ELM only achieves an accuracy of 64.51%, much worse than the SVM. In fact, all the ELM-based methods, including DAELM, fail to obtain high performances. This is also due to the small number of hidden nodes L. Note that the dimensionality of data points in 20-Newsgroups is 25 804, much larger than 1000, which means that it is quite difficult to get an accurate prediction or reconstruction. For DAELM, the performance improvement with the increase of labeled target data is small compared to that for object and digit data. This phenomenon suggests that, as the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

11

TABLE V AVERAGE ACCURACY (%) ON DeCAF6 O FFICE+C ALTECH 256 DATASETS

Fig. 4. Accuracies of DST-ELM on Office+Caltech256 datasets based on K–L divergence and MMD distance.

samples such that both have zero means. Then we calculate the covariance matrices, denoted by S and T , respectively. The K–L divergence between domains is defined as   d(S ||T ) = Tr T−1 S − log det S + log det T . dataset scale becomes larger, more labeled target samples may be needed accordingly for semi-supervised methods Different from the results in image data experiments, the accuracies of mSDA are lower than those of MMDbased IGLDA on text datasets. The adopted mSDA on 20Newsgroups is actually an extended version designed for high dimensional data, in which only a limited number of most frequent “pivot features” are reconstructed with the randomly ordered nonoverlapping subsets of input features. The eventual classification accuracy is affected by the selection of pivot features and may suffer from the loss of information. D. Extended Discussion and Evaluation 1) Feature Effectiveness: In fact, DST-ELM can be seen as a generalized cross-domain subspace learning algorithm. In order to testify the effectiveness of the learned features of DSTELM, we conduct an extended experiment on the DeCAF6 features of the Office+Caltech256 datasets [57] and utilize the nonparametric 1-nearest neighbor as the baseline classifier. The DeCAF6 features are extracted from a deep convolutional neural network trained on ImageNet [23]. The object datasets are input into the well trained CNN and the outputs of the sixth layer are used. The feature dimensionality is 4096. We compare the accuracies on DST-ELM features with those of several state-of-the-art MMD-based feature transformation methods, i.e., TCA, JDA, and TJM. The IGLDA model is also evaluated. Results are listed in Table V. From the results, we observe that: 1) high quality features have been extracted from the CNN, which results in the well improved accuracies compared to the results on the 800-D data; 2) the poor performance of the baseline 1-NN demonstrates that the domain shift is still likely to exist; and 3) the DST-ELM significantly outperforms the other transfer algorithms in all the tasks, which indicates that the features learned from DST-ELM are effectively clustered. 2) Distance Function: We also conduct experiments with the MMD distance replaced by K–L divergence. Without loss of generality, we assume that the datasets follow normal distributions. We first centralize the output source and target

The results with marginal K–L divergence and MMD distance on Office+Caltech256 datasets are shown in Fig. 4. The accuracies with the two mutual distance functions are generally competitive. K–L divergence achieves significantly higher accuracy on 6 out of 12 tasks, while MMD performs approximately equally or better on others. This suggests that, both distance metrics can be incorporated into the DST-ELM model and can achieve satisfactory performance. E. Parameter Influence 1) Tradeoff Parameter Sensitivity: In the DST-ELM model, since C is empirically selected in the experiments, the projected MMD distance control parameter λ is the single tradeoff parameter to be tuned. The parameter sensitivity of λ is tested on C → A, W → A, MNIST → USPS, and REC → COMP datasets in a wide range under various values of L. The results are shown in Fig. 5. The left plots show the accuracies of the algorithm with only the marginal MMD distance, while the right ones demonstrate the trend of performance with the conditional distribution difference considered. The values of C and L are fixed to the same in all the experiments. Generally, DST-ELM achieves good performance in a wide range of λ, but the sensitivity is different for various source and target pairings. The value of L also influences the sensitivity. It can be seen that a relatively small L leads to less stability. When the network width is small, the accuracy of the algorithm is low and may fluctuate severely. Another observation is that the conditional MMD term results in the changing trend being more sensitive to λ, since the term M + K k=1 M k plays a more important role in the algorithm than a mere marginal MMD matrix M. 2) Network Width: We testify the influence of the width of the hidden layer in ELM for all the ELM-based methods and explore the structure property of DST-ELM network. We conduct experiments on the Office+Caltech256 dataset, and the average accuracies are presented Fig. 6. The results show that DST-ELM can achieve high accuracies with only a small network width, avoiding the large scale

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON CYBERNETICS

TABLE VI AVERAGE RUNNING T IMES ON 20-N EWSGROUPS DATASETS

F. Execution Time

Fig. 5. Accuracies of DST-ELM with respect to the variation of λ under different values of L.

We check the execution time of the methods using 20-Newsgroups dataset, and report the average running times of the 12 tasks in Table VI. The ELM-based methods enjoy higher efficiency, including our proposed DST-ELM. The time cost is significantly reduced compared with TCA, which proves that the DST-ELM is beneficial for the datasets of high dimensionality. The DST-ELM is slower than the traditional ELM because a standard ELM classifier is a part of the DST-ELM procedure in the experiments, and thus more time is cost by the feature representation learning process than a standard ELM training. And DST-ELM does not reduce the data dimensionality, which also results in more time being spent on weights learning. The marginal version of DST-ELM is much faster than mSDA and AELM. However, the conditional version of DST-ELM costs more time than mSDA and AELM in that an iterative process is implemented. The results verify that our proposed method is effective and efficient for domain adaptation. VI. C ONCLUSION

Fig. 6. of L.

Accuracies of ELM-based algorithms with respect to the variation

matrix calculation in traditional ELM algorithms, particularly when the sample number is huge. It can be observed that the performances will be boosted when the number of hidden nodes increases for ELM and semi-supervised DAELM, but the other ELM-based methods only reach the highest point at a particular L. Note that the classical ELM and DAELM directly utilize labeled data to train a classifier, whereas the other algorithms focus more on feature or classifier adaptation. In the former case, a larger network is helpful to the exploration and exploitation of the invariant features of both domains. In contrast, in the latter case, a huge amount of hidden nodes may not be beneficial because they may force the ELM network to behave better on output function approximation and degrade the performance on the adaptation, for instance, reducing MMD distance.

In this paper, we have proposed a space transfer method based on ELM network. Unlike other domain transfer algorithms, our goal is to preserve the target domain discriminative information, and to transfer the source domain to better match the target distribution by reducing the MMD distance. The objective functions are incorporated into the ELM framework, and the algorithm inherits the advantage of ELM in terms of efficiency. The proposed DST-ELM performs well on object, digit, and text recognition tasks. In the future, we plan to make deeper researches into learning subspaces through ELM. At present, subspace learning through ELM suffers from the hidden node values being randomized, restricting the performance of knowledge transfer. Better information preservation techniques may also be developed for this problem to improve the feature transfer in the ELM framework. R EFERENCES [1] I. Kushchu, “Genetic programming and evolutionary generalization,” IEEE Trans. Evol. Comput., vol. 6, no. 5, pp. 431–442, Oct. 2002. [2] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” J. Mach. Learn. Res., vol. 11, pp. 2079–2107, Jul. 2010.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CHEN et al.: DST-ELM FOR DOMAIN ADAPTATION

[3] X.-Z. Wang et al., “A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning,” IEEE Trans. Fuzzy Syst., vol. 23, no. 5, pp. 1638–1654, Oct. 2015. [4] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proc. Conf. Empirical Methods Nat. Lang. Process., Sydney, NSW, Australia, 2006, pp. 120–128. [5] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification,” in Proc. ACL, vol. 7. Prague, Czech Republic, 2007, pp. 440–447. [6] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank, “Domain transfer SVM for video concept detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Miami, FL, USA, 2009, pp. 1375–1381. [7] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [8] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Colorado Springs, CO, USA, 2011, pp. 1521–1528. [9] S.-H. Fang and T.-N. Lin, “Indoor location system based on discriminant-adaptive neural network in IEEE 802.11 environments,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1973–1978, Nov. 2008. [10] A. Bergamo and L. Torresani, “Exploiting weakly-labeled Web images to improve object classification: A domain adaptation approach,” in Proc. Adv. Neural Inf. Process. Syst., 2010, pp. 181–189. [11] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 199–210, Feb. 2011. [12] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proc. IEEE Int. Conf. Comput. Vis., Sydney, NSW, Australia, 2013, pp. 2200–2207. [13] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer joint matching for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Columbus, OH, USA, 2014, pp. 1410–1417. [14] M. Jiang, W. Huang, Z. Huang, and G. G. Yen, “Integration of global and local metrics for domain adaptation learning via dimensionality reduction,” IEEE Trans. Cybern., vol. 47, no. 1, pp. 38–51, Jan. 2017. [15] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. IEEE Int. Conf. Comput. Vis., Sydney, NSW, Australia, 2013, pp. 2960–2967. [16] I.-H. Jhuo, D. Liu, D. T. Lee, and S.-F. Chang, “Robust visual domain adaptation with low-rank reconstruction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, 2012, pp. 2168–2175. [17] M. Shao, D. Kit, and Y. Fu, “Generalized transfer subspace learning through low-rank constraint,” Int. J. Comput. Vis., vol. 109, nos. 1–2, pp. 74–93, 2014. [18] L. Zhang, W. Zuo, and D. Zhang, “LSDT: Latent sparse domain transfer learning for visual adaptation,” IEEE Trans. Image Process., vol. 25, no. 3, pp. 1177–1191, Mar. 2016. [19] J. Huang et al., “Correcting sample selection bias by unlabeled data,” in Proc. Adv. Neural Inf. Process. Syst., vol. 19, 2007, pp. 601–608. [20] S. Li, S. Song, and G. Huang, “Prediction reweighting for domain adaptation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 7, pp. 1682–1695, Jul. 2017. [21] H.-B. Zeng, Y. He, M. Wu, and H.-Q. Xiao, “Improved conditions for passivity of neural networks with a time-varying delay,” IEEE Trans. Cybern., vol. 44, no. 6, pp. 785–792, Jun. 2014. [22] J. Xiao, S. Zhong, Y. Li, and F. Xu, “Relaxed exponential passivity criteria for memristor-based neural networks with leakage and time-varying delays,” Int. J. Mach. Learn. Cybern., vol. 8, no. 6, pp. 1875–1886, 2017. [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105. [24] C.-T. Lin, M. Prasad, and A. Saxena, “An improved polynomial neural network classifier using real-coded genetic algorithm,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 45, no. 11, pp. 1389–1401, Nov. 2015. [25] H. Chen et al., “Standard plane localization in fetal ultrasound via domain transferred deep neural networks,” IEEE J. Biomed. Health Inform., vol. 19, no. 5, pp. 1627–1636, Sep. 2015. [26] Y. Zhang, “A projected-based neural network method for second-order cone programming,” Int. J. Mach. Learn. Cybern., vol. 8, no. 6, pp. 1907–1914, 2017. [27] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A new learning scheme of feedforward neural networks,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 2. Budapest, Hungary, 2004, pp. 985–990.

13

[28] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, nos. 1–3, pp. 489–501, 2006. [29] G. Huang, G.-B. Huang, S. Song, and K. You, “Trends in extreme learning machines: A review,” Neural Netw., vol. 61, pp. 32–48, Jan. 2015. [30] X. Li, W. Mao, and W. Jiang, “Extreme learning machine based transfer learning for data classification,” Neurocomputing, vol. 174, pp. 203–210, Jan. 2016. [31] G.-B. Huang, L. Chen, and C. K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892, Jul. 2006. [32] W. Cao, X. Wang, Z. Ming, and J. Gao, “A review on neural networks with random weights,” Neurocomputing, vol. 275, pp. 278–287, Jan. 2018. [33] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [34] L. Zhang, D. Zhang, and F. Tian, “SVM and ELM: Who wins? Object recognition with deep convolutional features from ImageNet,” in Proceedings of ELM-2015 Volume 1. Cham, Switzerland: Springer, 2016, pp. 249–263. [35] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 4, pp. 809–821, Apr. 2016. [36] Y. Yang and Q. M. J. Wu, “Extreme learning machine with subnetwork hidden nodes for regression and classification,” IEEE Trans. Cybern., vol. 46, no. 12, pp. 2885–2898, Dec. 2016. [37] W. Mao, J. Wang, and Z. Xue, “An ELM-based model with sparseweighting strategy for sequential data imbalance problem,” Int. J. Mach. Learn. Cybern., vol. 8, no. 4, pp. 1333–1345, 2017. [38] G. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2405–2417, Dec. 2014. [39] Y. Peng and B.-L. Lu, “Discriminative extreme learning machine with supervised sparsity preserving for image classification,” Neurocomputing, vol. 261, pp. 242–252, Oct. 2017. [40] L. Zhang and D. Zhang, “Domain adaptation extreme learning machines for drift compensation in e-nose systems,” IEEE Trans. Instrum. Meas., vol. 64, no. 7, pp. 1790–1801, Jul. 2015. [41] L. Zhang and D. Zhang, “Robust visual knowledge transfer via extreme learning machine-based domain adaptation,” IEEE Trans Image Process., vol. 25, no. 10, pp. 4959–4973, Oct. 2016. [42] M. Uzair and A. Mian, “Blind domain adaptation with augmented extreme learning machine features,” IEEE Trans. Cybern., vol. 47, no. 3, pp. 651–660, Mar. 2017. [43] X. Li, W. Mao, W. Jiang, and Y. Yao, “Extreme learning machine via free sparse transfer representation optimization,” Memetic Comput., vol. 8, no. 2, pp. 85–95, 2016. [44] M. Chen, K. Q. Weinberger, Z. E. Xu, and F. Sha, “Marginalizing stacked linear denoising autoencoders,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 3849–3875, 2015. [45] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoderbased feature transfer learning for speech emotion recognition,” in Proc. Humaine Assoc. Conf. Affect. Comput. Intell. Interact. (ACII), Geneva, Switzerland, 2013, pp. 511–516. [46] L. L. C. Kasun, H. Zhou, G.-B. Huang, and C. M. Vong, “Representational learning with ELMs for big data,” IEEE Intell. Syst., vol. 28, no. 6, pp. 31–34, Dec. 2013. [47] S. Ding, N. Zhang, J. Zhang, X. Xu, and Z. Shi, “Unsupervised extreme learning machine with representational features,” Int. J. Mach. Learn. Cybern., vol. 8, no. 2, pp. 587–595, 2017. [48] A. Gretton et al., “A kernel method for the two-sample-problem,” in Proc. Adv. Neural Inf. Process. Syst., vol. 19. Vancouver, BC, Canada, 2007, pp. 513–520. [49] S. Li, S. Song, G. Huang, and C. Wu, “Cross-domain extreme learning machines for domain adaptation,” IEEE Trans. Syst., Man, Cybern., Syst., to be published. [50] X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun, “A practical transfer learning algorithm for face verification,” in Proc. IEEE Int. Conf. Comput. Vis., Sydney, NSW, Australia, 2013, pp. 3208–3215. [51] S. Si, D. Tao, and B. Geng, “Bregman divergence-based regularization for transfer subspace learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 7, pp. 929–942, Jul. 2010. [52] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision—ECCV 2010. Heidelberg, Germany: Springer, 2010, pp. 213–226.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

IEEE TRANSACTIONS ON CYBERNETICS

[53] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, 2012, pp. 2066–2073. [54] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” California Inst. Technol., Pasadena, CA, USA, Rep. CNS-TR-2007-001, 2007. [55] M. Long, J. Wang, J. Sun, and P. S. Yu, “Domain invariant transfer kernel learning,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 6, pp. 1519–1532, Jun. 2015. [56] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, 2011. [57] J. Donahue et al., “DeCAF: A deep convolutional activation feature for generic visual recognition,” in Proc. ICML, vol. 32. Beijing, China, 2014, pp. 647–655. Yiming Chen received the B.S. degree from the Department of Automation, Tsinghua University, Beijing, China, in 2015, where he is currently pursuing the Ph.D. degree with the Institute of System Integration, Department of Automation. His current research interests include pattern recognition and machine learning, especially in transfer learning and domain adaptation.

Shiji Song received the Ph.D. degree from the Department of Mathematics, Harbin Institute of Technology, Harbin, China, in 1996. He is a Professor with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include system modeling, control and optimization, computational intelligence, and pattern recognition.

Shuang Li received the B.S. degree from the Department of Automation, Northeastern University, Shenyang, China, in 2012. He is currently pursuing the Ph.D. degree with the Institute of System Integration, Department of Automation, Tsinghua University, Beijing, China. He was a Visiting Research Scholar with the Department of Computer Science, Cornell University, Ithaca, NY, USA, from 2015 to 2016. His current research interests include machine learning and pattern recognition, especially in transfer learning and domain adaptation.

Le Yang received the B.S. degree in Department of Automation, Northwestern Polytechnical University, Xi’an, China, in 2015. He is currently pursuing the Ph.D. degree with the Institute of System Integration, Department of Automation, Tsinghua University, Beijing, China. His current research interests include machine learning and pattern recognition, especially in dimension reduction and feature extraction.

Cheng Wu received the B.S. and M.S. degrees in electrical engineering from Tsinghua University, Beijing, China. Since 1967, he has been with Tsinghua University, where he is currently a Professor with the Department of Automation. His current research interests include system integration, modeling, scheduling, and optimization of complex industrial systems. Mr. Wu is a member of the Chinese Academy of Engineering.