On Covariate Shift Adaptation via Sparse Filtering

1 downloads 0 Views 709KB Size Report
Jul 22, 2016 - A major challenge in machine learning is covariate shift, i.e., the problem of training data and test data coming from different distributions.
On Covariate Shift Adaptation via Sparse Filtering

arXiv:1607.06781v1 [cs.LG] 22 Jul 2016

Fabio Massimo Zennaro School of Computer Science The University of Manchester Manchester, M13 9PL, UK [email protected]

Ke Chen School of Computer Science The University of Manchester Manchester, M13 9PL, UK [email protected]

Abstract A major challenge in machine learning is covariate shift, i.e., the problem of training data and test data coming from different distributions. This paper studies the feasibility of tackling this problem by means of sparse filtering. We show that the sparse filtering algorithm intrinsically addresses this problem, but it has limited capacity for covariate shift adaptation. To overcome this limit, we propose a novel semi-supervised sparse filtering algorithm, named periodic sparse filtering. Our proposed algorithm is formally analyzed and empirically evaluated with an elaborated synthetic data set and real speech emotion data sets. As a result, we argue that, as an alternative methodology, feature distribution learning has enormous potential in carrying out covariate shift adaptation.

1

Introduction

A common assumption in machine learning is the independence and identical distribution (i.i.d.) of the data samples, which states that training and test samples are independent and drawn from the same probability distribution. On the basis of this assumption, the model learned from training data is expected to generalize to test data. Unfortunately, this assumption often does not hold in the real world. A more realistic assumption is covariate shift [14], which states that training data Xtr and test data Xtst are sampled from two multivariate random variables X tr and X tst with different distributions p (X tr ) and p (X tst ), while the conditional distribution of labels given the data p (Y |X) is assumed to be the same over training and test data, i.e., p (Y |X tr ) = p (Y |X tst ). Learning under covariate shift requires modeling p (Y |X) while taking into account the fact that p (X tr ) 6= (X tst ). One solution is offered by representation learning: if we can learn a mapping f that projects the training and test data to new representations Ztr and Ztst having similar distributions p (Z tr ) and p (Z tst ), then we can learn p (Y |Z) using standard machine learning algorithms [2]. Sparse filtering (SF) is a feature distribution learning algorithm that learns a maximally sparse representation of the data instead of modelling the true data distribution [9]. Although SF has exhibited good performance in real applications [9] and is backed by some theoretical justification [17], this algorithm has never been related to covariate shift adaptation. Hence, it is still unclear whether and how SF can address the issue of covariate shift. In this paper, we investigate the connection between SF and covariate shift adaptation and subsequently propose a novel variant of SF for covariate shift adaptation. We formulate and answer the following research questions. 1) Does SF address covariate shift adaptation? And if so, how? 2) Can we extend SF to deal with more generic covariate shift problems? By addressing the issues stated in these two research questions, we show the strength and the limitation of SF in performing covariate shift adaptation. Motivated by the above formal analysis, we further propose an alternative semi-supervised SF algorithm for covariate shift adaptation. Our algorithm retains the favorable properties of the original SF algorithm: effective in achieving sparse representation, computationally efficient, and easy-to-implement.

2

On the property of covariate shift adaptation in sparse filtering

In this section, we investigate whether SF can address covariate shift. We first review the SF algorithm briefly to facilitate our presentation. Next, we discuss statistical properties of SF that we discovered in our study. Finally, we show the limit of SF in covariate shift adaptation. 2.1

SF algorithm

The SF algorithm [9] was designed to learn a mapping f : RO → RL from unlabeled data to sparse (i) representations. Let Xj denote the j th feature of the ith sample of data matrix X. Then, SF defines the following transformation of X: Z = `2,col (`2,row |WX|) , where W is a weight matrix, |·| is the element-wise absolute-value function, `2,row is the `2 (i)

normalization along the rows, `2,row (F) =

r

Fj PN  i=1

(i)

Fj

2 ,

and `2,col is the `2 -normalization along

(i)

the columns, `2,col (F) =

Fj   . (i) 2 j=1 Fj

r PL

During learning, the weight matrix W is trained by gradient descent in order to minimize the loss PN PL (i) function L(W; X) = i=1 j=1 Zj , which acts as a proxy to maximize the sparsity of the learned representation matrix Z. 2.2

Covariate shift adaptation in SF

To deal with covariate shift, the SF algorithm is trained on all the unlabeled data denoted by matrix X and given by the concatenation of Xtr and Xtst , as is done in most covariate shift adaptation methods. After learning, we obtain a new representation Z that can be partitioned back in two parts, Ztr and Ztst , corresponding to training and test samples respectively. To assess whether SF performs covariate shift adaptation, we need to evaluate the distribution of the training and test representations, p (Z tr ) and p (Z tst ), to check if they are close to each other. In order to examine this issue, we first present a lemma used to prove an important proposition. Lemma 1. For each learned feature Zj , SF limits the domain of Zj to [0, ithe expected value h 1], moves  1  N 4 L4 −1 of p (Zj ) within N 2 L2 , 1 , and bounds the variance of p (Zj ) within 0, N 4 L4 . Moreover, if we make the assumption thath learned representations are k-sparse, with 1 < k h≤ L, then the  i expected 3 4 value of p (Zj ) is within k21L2 , √1N , and the variance of p (Zj ) is within 0, NNk4 k−1 . (See Sect. 4 A.2.1 in the appendix for the proof). Based on this lemma, we have the following proposition: Proposition 1. Given the data X, the maximization of sparsity of the learned representation Z by SF acts as a proxy for the minimization of the distance D [p (Z tr ) , p (Z tst )]. Proof. The minimization of the `1 -norm of the samples Z(i) is a well-known proxy for the min (i) imization of the entropy of the distribution p(Z) [10], i.e., min ` Z ≡ min H Z(i) , where 1      P P (i) (i) `1 Z(i) = = − i p Z(i) log p Z(i) is the sample entropy. The i,j Zj and H Z practical effect of minimizing the `1 -norm of Z is the learning of a multivariate pdf p (Z) whose mass is mainly localized around zero, independently from the original pdf p (X). Next, we observe that, as shown by Lemma 1, SF moves the center of mass and the variance of the pdf of each feature p (Zj ) towards zero, as well. In this way, by shaping the pdf of each learned feature Zj within fixed bounds, SF minimizes the entropy of the distribution of each learned feature p (Zj ). Thanks to the enforcement of the properties of population sparsity, lifetime sparsity, and high dispersal which are crucial to Lemma 1, SF minimizes not only the entropy of the multivariate pdf p (Z) but also the entropy of the marginal pdfs p (Zj ) for each learned feature. 2

Finally, it is observed that entropy is a very powerful domain-independent descriptor for pdfs: if two pdfs have the same entropy, they have the same shape and they are identical up to a translation over tst the domain [11]. Hence, by minimizing the `1 -norm of Ztr j and Zj , we force the two distributions to have a similar shape (determined by the small entropy) and a similar location (determined by the high sparsity). Thus, no matter what the pdfs p (X tr ) and p (X tst ) are, SF will learn new representations such that the pdf of each feature will have minimal entropy and the pdfs p (Z tr ) and p (Z tst ) will have most of their mass close to zero. The pdfs p (Z tr ) and p (Z tst ) will be as close as possible to a Dirac delta function centered on the origin, and, by transitivity, they will be close to each other, minimizing the distance D [p (Z tr ) , p (Z tst )].  This proposition confirms that SF implicitly performs covariate shift adaptation by trying to move both learned pdfs p (Z tr ) and p (Z tst ) towards a common entropy-minimizing pdf. 2.3

Properties of representations learned by SF

We can summarize the properties of the representation Z learned by SF as follows: Proposition 2. Given two data set Xtr and Xtst made up of independent samples and affected by covariate shift, the representations Ztr and Ztst learned by SF are made up of non-independent samples and they are quasi-identically distributed. Proof. The non-independence of the samples of Ztr and Ztst follows from the fact that each sample is transformed and rescaled with respect to all other samples. The quasi-identical distribution property follows from Proposition 1.  This proposition clearly states that the original samples in Xtr and Xtst are independent and affected by covariate shift, while the samples of the learned representations Ztr and Ztst are non-independent and quasi-identically distributed. Thus, the SF algorithm gives up the property of independence for covariate shift adaptation. This happens as, given data Xtr and Xtst from different pdfs, SF projects each sample X(i) with respect to all other samples so that the learned distributions p (Z tr ) and p (Z tst ) are both entropy-minimizing pdfs. 2.4

Limit of SF in covariate shift adaptation

Having shown that SF is capable of dealing with covariate shift by using both training and test data in learning, we now explore the limit of SF in tackling covariate shift. It is shown in [17] that the SF algorithm works well under the tacit assumption that the structure of the data with respect to a given set of labels is best explained in terms of the cosine metric. This assumption is also valid for SF in the case of covariate shift adaptation, as described in the following proposition: Proposition 3. The SF successfully performs covariate shift adaptation only if the training and test data, Xtr and Xtst , exhibit a common radial structure. ¯ tr be a sample from Xtr and X ¯ tst a sample from Xtst . Let us assume that, in the Proof. Let X  tr tst  O ¯ ,X ¯ original space R , these points are far apart under theEuclidean metric, DEuclid X  0, tr ¯ tst tr tst ¯ ¯ ¯ but close under the cosine metric, Dcosine X , X ≈ 0 [17]. As such, X and X will have similar angular coordinates θj but different radial coordinate ρ, for j = 1, . . . , O − 1. Since SF defines conical filters that map collinear points in the original space onto the same learned ¯ tr and X ¯ tst will be mapped together onto the same representation Z∗ in RL . representation [17], X ¯ tr and X ¯ tst are modeled as two samples coming from Now, in the original representation space, X tr tst two different regions of the pdfs p (X ) and p (X ). However, in the learned representation space, ¯ tr and X ¯ tst are modeled as two samples coming from an identical region of the pdfs p (Z tr ) and X tst p (Z ) since they are mapped onto the same representation Z∗ . Now, it is evident that mapping the ¯ tr and X ¯ tst onto the same representation Z∗ makes sense only if the structure of the data points X with respect to a given set of labels is explained by the cosine metric.  This proposition clearly shows a limit of SF in learning good representations while performing covariate shift adaptation. SF works only if there is a common radial structure underlying training and test data for given labels. Thus, the condition for success of SF in performing covariate shift adaptation is intimately connected to that of SF in processing the data, i.e., there must be a radial structure underlying the original data set. 3

3

Semi-supervised periodic sparse filtering

In Sect. 2, we have shown that the SF algorithm successfully performs covariate shift adaptation only if the data exhibit a radial structure. To overcome this limit, we introduce a more generic assumption, i.e., that the data exhibit some form of periodic structure. Thus, we need to adapt the original SF algorithm for this new scenario via two additional properties: (i) the ability to generate periodic filters; and, (ii) the ability to capture the periodic structure underlying the data. First of all, we want the new algorithm to yield generic filters. The instantiation of conical filters in SF is determined by the absolute-value non-linear transformation encoded in the algorithm. By substituting this non-linearity with a sinusoidal function, it becomes possible to generate periodic filters that regularly tile the whole original data space. Next, we need the new algorithm to shape the periodic filter to capture the structure of the data. To do so, we consider learning in a semi-supervised scenario by exploiting the label information provided with the training data. Putting together these two properties, we propose a novel algorithm, called periodic sparse filtering (PSF), for covariate shift adaptation when the data exhibit any kind of periodic structure (see Sect. A.3.1 in the appendix for the pseudo-code of PSF1 ). PSF defines the overall transformation of the data f : RO → RL as: Z = `2,col (`2,row (g (WX))) , where g(x) is a positive element-wise sinusoidal function, such as 1 +  + sin(x) or 1 +  + cos(x), that is a sinusoidal function shifted by 1 + , with  = 10−8 , in order to guarantee the strict positivity of the output. PSF is trained in a semi-supervised setting on the training data Xtr , the corresponding labels Ytr , and the test data Xtst . Let C be the number of classes defined in Ytr and let the number of learned features L be partitioned in C + 1 groups with arbitrarly defined cardinality. We can then re-define the learned representation matrix Z using the following block matrix notation:   A11 A12 ··· A1C A2(C+1) A22 ··· A2C A2(C+1)   A21   ··· ··· ··· ··· ··· Z= ,  A AC2 ··· ACC AC(C+1)  C1 A(C+1)1 A(C+1)2 · · · A(C+1)C A(C+1)(C+1) where Aij is the block matrix containing the ith group of learned features from the samples belonging to the j th class, with class C + 1 containing the unlabeled samples Xtst . The loss function of PSF is defined by: N X L C X  X (i) L W; X, Ytr = Zj − λk Akk , (1) i=1 j=1

k=1

where λk is a scaling factor. The first term of the loss function pushes for learning sparse representations, while the second term pushes the mass of the sparse representations on the sub-matrices Akk , for 1 ≤ k ≤ C. The practical effect of optimizing Eq. (1) is then to generate sparse representations and to drive the algorithm to activate the k th group of learned features for the samples belonging to the k th class. Gradient descent method is employed in minimizing the loss defined in Eq. (1). Below we show that the PSF algorithm produces a periodic tiling of the original data space. Proposition 4. Let X(1) ∈ RO be a point in the original space RO and let Z(1) ∈ RL be the representation learned by PSF. Then, there are infinite points X(i) ∈ RO built from X(1) with period W−1 kπ, where W is the weight matrix of PSF and k is a vector of integer constants in Z, that maps to the same representation Z(1) . (See Sect. A.3.2 in the appendix for the proof). This proposition highlights that the PSF algorithm defines a specific frequency for each dimension so that all the points that are integer multiples of these frequencies are mapped to the same representation. Using various frequencies on different dimensions, PSF can flexibly define diverse tilings of the original data space, thus overcoming the limit of the original SF algorithm in covariate shift adaptation. 1

The source code is available at: https://github.com/EldarFeatel/PSF

4

Figure 1: Synthetic data set: (a) Data: training samples in blue, test samples in red; positive-labeled samples as crosses, negative-labeled samples as dots. (b) Illustration of a filter instantiated by SF. (c) Illustration of a filter instantiated by PSF.

4

Empirical validation

In this section, we empirically validate our theoretical results on SF for covariate shift adaptation and we evaluate our proposed PSF algorithm with synthetic and real data sets. 4.1

Synthetic data set

We generate a synthetic data set that reflects a typical covariate shift effect encountered in machine learning, as illustrated in Fig 1(a). The details on data generation can be found in Sect. A.4.1 in the appendix. The generated data set has the following properties: (i) data are easily visualizable; (ii) covariate shift affects one dimension of the input; (iii) the labeling function is a periodic function defined over the dimension exhibiting the covariate shift. tst In our experiment, we train a SF and a PSF module, which yield representations Ztr SF , ZSF and tr tst ZP SF , ZP SF , respectively. The dimensionality L of the learned representations is set to two for visualization, while the scaling factors λ1 , λ2 are set to one to keep all the loss terms in the same order of magnitude. We run off-the-shelf classification algorithms (linear SVM and RBF kernel tst SVM) on the original data (Xtr , Xtst ) as a baseline, and on the SF representations (Ztr SF , ZSF ) and tr tst the PSF representations (ZP SF , ZP SF ) for comparison.

To evalutate the results, we use the following methods. (i) We visualize example filters learned by SF and PSF. (ii) We employ a Kolmogorov-Smirnov (KS) test to analyze the distribution of each feature [7], testing the hypotheses (p-value 0.05) that the features of training and test data come from different distributions, and that the features for positive-labeled and negative-labeled data come from different distributions. (iii) We report the classification accuracy on the training and the test data, and we estimate cross-domain generalization with the metrics of percentage drop suggested in [15]. We report the mean and the standard error computed over ten simulations with randomly re-sampled data. Figures 1(b) and 1(c) show prototypical filters learned by SF and PSF, respectively. For each point X(i) in the original data space, the plot shows the distance of the learned representation Z(i) from a perfect 1-sparse representation. SF instantiates filters with a conical shape centered in the origin of the original data space. PSF, instead, generates periodic filters with arbitrary shapes: by changing the values in the weight matrix, the filters learned by PSF may vary from parallel stripes with a chosen orientation to squares tiling the whole original data space. Table 1 reports the proportion of features exhibiting a different pdf according to the KS test. In general, the KS test evaluates only the pdf of individual features p(Zj ), and it can not assess the pdf of the learned representation p(Z). Due to the independence copula, however, we can actually assess the joint pdf of the learned representations p(Z) from the marginal distribution of each feature p(Zj ) in our experiment. For the raw data, the KS test easily detects covariate shift on one dimension (X1 ) and a difference in the distributions of positive-labeled and negative-labeled data via a single dimension (X1 ). For the SF representations, the KS test reveals that covariate shift adaptation takes place as the feature distributions of the training and the test data appear to be identical. Unfortunately, 5

Table 1: Synthetic data set: Kolmogorov-Smirnov statistical tests. Kolmogorov-Smirnov Data

(train,test)

(positive,negative)

Raw SF PSF

0.5 0.0 0.3 ± 0.15

0.5 0.0 0.7 ± 0.15

Table 2: Synthetic data set: classification accuracy on the training and test data set, and percentage drop using a linear SVM classifier and a RBF kernel SVM classifier. Linear SVM

RBF SVM

Data

Training

Test

P-drop

Training

Test

P-drop

Raw SF PSF

0.71 ± 0.03 0.60 ± 0.01 0.75 ± 0.04

0.51 ± 0.02 0.52 ± 0.01 0.72 ± 0.05

28.32 ± 1.39 13.21 ± 2.66 4.10 ± 3.75

0.98 ± 0.01 0.59 ± 0.01 0.75 ± 0.04

0.51 ± 0.02 0.51 ± 0.02 0.70 ± 0.05

48.02 ± 1.74 12.75 ± 3.10 6.56 ± 4.13

differences in the distribution of positive-labeled and negative-labeled data are lost. For the PSF representations, the results on the KS test suggest that our proposed PSF algorithm leads to sparse features that carry out covariate shift adaptation and retain the discriminative information defined in the labels simultaneously. Table 2 reports the performance in classification. For the raw data, the linear SVM returns a low performance because the data are not linearly separable. Although the kernel SVM reaches almost perfect discrimination on the training data, it clearly overfits, as its performance on the test data is reduced to chance level due to covariate shift. This illustrates the failure of standard classifiers to learn under covariate shift. For SF representations, classification accuracy is always set to chance level, since any difference between data of different labels vanished. This confirms that SF can perform covariate shift adaptation, but only within certain limits: covariate shift adaptation happened since the input data are symmetric with respect to the origin; however, since the structure of the labeled data is not radial, SF produced a representation where useful discriminating information is missing. For PSF representations, the classification accuracy is significantly higher than chance level, and the difference between the performance on the training data set and the test data set is remarkably small, as is evident from the percentage drop. The experimental results suggest that PSF is able to provide representations that allow for learning the conditional distribution of the labels from the training data and for generalizing it to the test data. 4.2

Real data set

Emotional speech is highly regarded as a class of typical data affected by covariate shift [13]. In this experiment we chose two emotional speech data sets widely used in the affective computing community: Berlin Emotional (EMODB) [3] and Vera am Mittag (VAM) [6]. EMODB is constituted of recordings of 10 German speakers, while VAM contains emotional utterances of 47 speakers participating in a German talk show. All the recordings are pre-processed into a standard representations made up of 72-dimensional Mel-frequency cepstrum coefficient (MFCC) feature vectors [4]. Samples from EMODB are associated with a binary label denoting the presence or the absence of emotional content (1065 emotional samples, 146 non-emotional samples), while samples from VAM are provided with a binary label denoting a state of high or low arousal (1091 high-arousal samples, 1404 low-arousal samples). We selected these data sets based on the assumption that acoustic samples from each user are described by different distributions, but each distribution shows regularities in relation to emotional labeling. We follow a protocol similar to the one described in Sect. 4.1. First, we train an SF and a PSF module, and then we run the off-the-shelf linear SVM classifier on the raw data and on the learned features. We set the learned dimensionality L to 80, following the conservative decision of preserving the approximate dimensionality of the original samples; we also set the dimensions of the block matrices A11 and A22 to 35, in order to balance the learned features among the two classes. We set 6

Table 3: Real data set: UAR on the training and test data set, and percentage drop using a linear SVM classifier for the EMODB and the VAM data sets. Linear SVM Data EMODB

VAM

Train

Test

P-drop

Raw SF PSF PSF + KS crit.

0.60 ± 0.03 0.65 ± 0.01 0.96 ± 0.002 0.85 ± 0.01

0.52 ± 0.01 0.52 ± 0.01 0.52 ± 0.04 0.53 ± 0.05

11.86 ± 3.60 20.42 ± 2.03 44.94 ± 3.93 37.09 ± 5.22

Raw SF PSF PSF + KS crit.

0.55 ± 0.02 0.66 ± 0.003 0.94 ± 0.001 0.76 ± 0.005

0.52 ± 0.02 0.57 ± 0.01 0.53 ± 0.03 0.58 ± 0.02

5.96 ± 1.11 14.45 ± 1.96 43.50 ± 1.83 23.88 ± 4.88

the other hyper-parameters (λ1 and λ2 ) to the optimal value found via cross-validation. For each data set, we keep all the samples of one speaker for validation and all the samples of another speaker for testing. Ten trials on each cross-validation configuration are run to increase the reliability of our results. Because of the high variance exhibited by PSF in cross-validation, we additionally devise a simple unsupervised criterion to retain those trials in which the distance between the distribution of training and test data is minimal; in particular, we retain half of the trials according to the rule that trial t0 is retained if the average distance between learned features computed using the KS test is smallernthan the median of the average distance between computed over all the trials n P learned features  tr tst o  tr tst o PL L 1 1 t, i.e., L j=1 KS Zj ; Zj < M ediant L j=1 KS Zj ; Zj . 0 t

t

Because of the high class imbalance, we use the unweighted average recall (UAR) metric instead PK 1 of the accuracy to evaluate our results [1]. The UAR metric is defined as K k=1 recall(k), where recall(k) denotes the recall for class k and K is the total number of classes. We report the mean and the standard error computed over all the trials. Table 3 contains the results of classification using a linear SVM. For the raw data, the covariate shift reduces the performance on the test data of the linear SVM close to chance level on both data sets. Once again, this confirms the limits of standard classifiers in the presence of covariate shift. For the SF representations, the performance on the test data is close to chance level for EMODB, but higher for VAM. This may suggest the presence of different structures underlying the two data sets. For the PSF representations, classification performance is close or even lower than the performance of the standard SF algorithm. However, PSF representations have higher variance, as shown by the standard error. This suggests that PSF may find good solution, as well as unsatisfactory solutions. Adopting the criterion based on the KS test to select those trials in which the distance between the training and the test distribution is lower, we reduce the variance and improve the final performance. The standard error is constant, but, since it was computed on half trials, it reveals a decrease in the variance. In the case of EMODB, the overall increase in performance is limited compared to SF, probably due to the difficulty of learning a periodic structure from a highly unbalanced data set. In the case of the VAM data set, PSF provides a significant improvement over raw representations and a moderate improvement over SF. In summary, the above experimental results empirically verify our formal analysis and demonstrates that our PSF algorithm may be effective in learning sparse features for covariate shift adaptation.

5

Discussion

In this section, we discuss issues arising from our studies and relate our research to previous works. While experimental results on PSF empirically verify our theoretical statements, it has been observed that this algorithm is quite sensitive to initialization, as a high variance appears among multiple trials initialized differently. Hence, this is an issue to be addressed in our ongoing work. We implemented a simple criterion based on the KS test to select those trials where the distance between the training and test data is minimized, but more refined criteria may be designed to provide better solutions. In 7

addition, an alternative model selection procedure may be implemented to explore more thoroughly the space of the hyper-parameters; better values for the dimensionality of the learned space and for the value of the scaling parameters may be found, thus improving the final performance of PSF. The idea behind our novel PSF algorithm shows resemblances with traditional sinusoid filters and Walsh filters, which are often used as bases for sets of coding neurons [16]. The main difference is that, while these traditional filters are trained in order to model the pdf of the data, we train our filters to generate sparse feature representations that, at the same time, perform covariate shift adaptation. In our theoretical study, we have argued that SF implicitly defines a trade-off between the statistical independence of the samples and covariate shift adaptation. This exchange between independence and covariate shift takes place as the representation of each sample is affected by the representations of all the other samples. It may be interesting to investigate whether this trade-off is a general property of covariate shift adaptation algorithms based on representation learning and to consider the ensuing implications. Our research suggests that SF and PSF may be suitable algorithms to perform covariate shift adaptation. This conclusion, drawn from our study of SF and PSF in this paper, may be extended to the whole class of feature distribution learning algorithms [9], to which SF and PSF belong. Differently from standard data distribution learning, which is aimed at performing unsupervised learning by modeling the true pdf p (X) of the data, feature distribution learning performs learning by focusing only on shaping a useful learned pdf p (Z). We suggest that the insensitivity of feature distribution learning with respect to the original pdfs p (X tr ) and p (X tst ) may make such algorithms naturally robust to covariate shift. In general, our work presented in this paper suggests that feature distribution learning is of a great potential to be an effective yet efficient framework for covariate shift adaptation. Covariate shift adaptation has been studied in previous works [8]. Our research presents the closest affinity with works on covariate shift adaptation through representation learning, such as mean matching [12, 5], where the distance between learned training and test distributions is explicitly minimized during learning. In addition, our work is related to importance weighting [14, 7], where the importance of each training sample is scaled proportionally to the ratio between the original training distribution and the original test distribution. While these works address covariate shift in a data distribution learning framework, SF and PSF offer a more efficient solution to covariate shift which avoids confronting the challenging problem of learning the true pdf of the data. In conclusion, we have showed that the SF algorithm is able to implicitly perform covariate shift adaptation under certain assumptions. Motivated by our formal analysis, we have proposed the novel PSF algorithm that works on less restrictive assumptions. Experimental results clearly support our formal analysis on SF in terms of covariate shift and our theoretical justification for the PSF algorithm. Hence, we expect that the results presented in this paper may be extended to the whole class of feature distribution learning algorithms and thus provide an insight for developing a novel framework for covariate shift adaptation.

8

Appendices A.1

Summary of notation tr

Xtr , X tr , p (X tr ) Matrix of training data with domain RO×N , random variable for the training samples, pdf for the training samples. tst

Xtst , X tst , p (X tst ) Matrix of test data with domain RO×N , random variable for the test samples, pdf for the test samples. X, X, p (X) Matrix of the data with domain RO×N , random variable for the samples, pdf for the samples. The matrix X is given by the concatenation of Xtr and Xtst . (i)

Xj

j th feature of the ith sample of X.

N, N tst , N tr

Number of all samples, training samples and test samples. The total number N is given by the sum of N tr and N tst . Original dimensionality of the samples.

O Ytr , Y, p(Y ) Z, Z, p(Z)

L f W D [·]

tr

Vector of labels associated with the training data with domain R1×N , random variable of the labels, pdf for the labels. Matrix of the learned representations with domain RL×N , random variable of the representations, pdf for the representations. Analogously, we define the variables for the representations of training data Ztr , Z tr , p (Z tr ) and test data Ztst , Z tst , p (Z tst ). Learned dimensionality of the samples. Representation learning function defined on RO → RL . Matrix of weights with domain RL×O . Distance between pdfs.

9

A.2

Proof of lemma in Section 2

A.2.1

Proof of Lemma 1

For each learned feature Zj , SF limits the domain of Zj to [0,h1], moves the i expected  1  N 4 L4 −1 value of p (Zj ) within N 2 L2 , 1 and bounds the variance of p (Zj ) within 0, N 4 L4 . Moreover, if we make the assumption that learned representations are k-sparse, with 1 < kh≤ L, thenithe  h 3 4 1 1 √ expected value of p (Zj ) is within k2 L2 , N and the variance of p (Zj ) is within 0, NNk4 k−1 . 4 Lemma 1.

Proof. The proof of this lemma is based on the following logical steps: (a) rigorous definition of the SF computation; (b) analysis of the properties of the intermediate representations and of the learned representations; (c) estimation of the second moment, the expected value and the variance of the distribution of a feature in the intermediate representation; (d) estimation of the second moment of the distribution of a feature in the learned representation; (e) estimation of the expected value of the distribution of a feature in the learned representation (with and without the assumption of k-sparsity); (f) estimation of the variance of the distribution of a feature in the learned representation (with and without the assumption of k-sparsity). ˜ be the intermediate representation and Z be the learned representation produced by the SF (a) Let F algorithm: ˜ = F

`2,row |WX|   ˜ , `2,col F

Z =

where |·| is the element-wise absolute-value function, `2,row is the `2 -normalization along the rows (i)

`2,row (F) =

Fj r , PN  (i) 2 i=1 Fj

and `2,col is the `2 -normalization along the columns `2,col (F) =

(i)

r PL

Fj 

j=1

(i)

Fj

2 .

˜ and Z have two main effects: they constrain all the values (b) The `2 -normalization steps defining F ˜ in F and Z to be within 0 and 1; and, they force features or samples to have a square total activation of 1. Formally: ˜ (i) ≤ 0, 1 ≤ Z(i) ≤ 0, ∀1 ≤ j ≤ L, 1 ≤ i ≤ N, 1≤F j j N  X

˜ (i) F j

2

=1 1≤j≤L

i=1 L  X

(i)

Zj

2

= 1 1 ≤ i ≤ N.

j=1

(c) Let us now consider a given feature and, for clarity, let us denote this fixed feature as ¯j to underline the fact that it is not going to change in the following analysis. We can now analyze the distribution ˜ ¯j by considering its main statistical moments. Let us start from the estimation of the second of F moment. From the properties stated in (b), it follows that:    N h i 2 1 X  ˜ (i) 2 1 ˜ ¯j = E F ˜ ¯j M2 F = ˆ F¯j = , N i=1 N where = ˆ denotes the statistical estimation from the samples. Next, let us move to the estimation of the expected value: 10

N h i 1 X ˜ ¯j = ˜ (i) . E F ˆ F N i=1 ¯j

From the properties stated in (b), we can easily bound the sum N  X

˜ ¯(i) F j

2



i=1

N X

˜ ¯(i) ≤ F j

PN ˜ (i) as follows: i=1 F¯ j

√ N,

i=1

1≤

N X

˜ ¯(i) ≤ F j

√ N.

i=1

We can then bound expected value as: 1 ≤ N

1 N

1 ≤√ . N

PN ˜ (i) i=1 F¯ j

Finally, let us estimate the variance:     h i i 2 2 ˜ ¯j ˜ ˜ ¯j V ar F¯j = E F − E F . h

Using the value we computed for the second moment and the interval we estimated for the expected value, we can define the following bound for the variance:

1 − N



1 √ N

2

i h ˜ ¯(i) ≤ V ar F j

i h ˜ ¯(i) 0 ≤ V ar F j

 2 1 1 ≤ − N N N −1 ≤ . N2

(d) Let us now analyze the distribution of Z¯j , starting again from the second moment: N  2  h 2 i 1 X   (i) Z¯j . M2 Z¯j = E Z¯j = ˆ N i=1

During the normalization along the ¯j-th feature of each sample Z(i) is divided by a rthe columns,   2 PL ˜ (i) . We can then rewrite the estimator of the second sample-specific factor d(i) = j=1 Fj moment as:  N  N 1 X  (i) 2 1 X Z¯j = N i=1 N i=1

"

1 d(i)

2 ·



˜ ¯(i) F j

2

# .

2 In order to define a bound for the second moment, we need to define a bound on d(i) = PL  ˜ (i) 2 (i) ˜ must be bounded between zero and one. . Now, we know that each element F j j=1 Fj    2 2 PL ˜ (i) must be lower bounded by F ˜ (i) (if all the values except Z¯(i) Therefore the sum j=1 F j j j itself are zero) and it must upper bounded by L2 (if all the values are one). Thus, the second moment must be bounded in: 11

   N  N    X X 2 2 1 1 1  (i) ˜ ¯(i)  Z¯j ≤   2 · F j N i=1 N i=1 (i) ˜ F j  2  (i) ˆ E F¯j ≤ 1.

 N  1 X 1  ˜ (i) 2 · F¯j ≤ N i=1 L2 1 ≤ N L2

˜ (i) are k-sparse, with 1 < k ≤ L, that is, each Let us now make the assumption that the samples F sample has a number k of active elements greater than 1 and smaller than L. This assumption is justified considering the properties of population sparsity and lifetime sparsity of SF [9]. In this case, PL  ˜ (i) 2 F we could redefine the lower bound of the sum as 1 and the upper bound of as k 2 . j=1

j

Consequently the bounds of the second moment would be:  2  1 1 (i) ≤ E Z¯j < . 2 Nk N (e) Let us now consider the estimation of the expected value of Z¯j : N   1 X (i) Z . E Z¯j = ˆ N i=1 ¯j

As before, we can rewrite the value of factor of

(i) Z¯j

˜ ¯(i) and d(i) = as the ratio of F j

r

PL



j=1

˜ (i) F j

2

 N N  1 X h (i) i 1 X 1 ˜ (i) Z¯j = · F . ¯ j N i=1 N i=1 d(i) Because of the monotonicity of the square root and the positivity of the arguments, the bounds for r PL  ˜ (i) 2 PL  ˜ (i) 2 F can be easily derived from the ones computed for . Specifically, j j=1 j=1 Fj r PL  ˜ (i) 2 ˜ (i) (if all the values except Z¯(i) itself are zero) and it must be lower bounded by F j j=1 Fj j must upper bounded by L (if all the values are one). Thus, the average must be bounded in:  N  1 X 1 ˜ (i) · F¯j ≤ N i=1 L

N 1 X h (i) i Z¯j N i=1

  N X 1 ˜ ¯(i)   1 ·F ≤ j ˜ ¯(i) N i=1 F

1 ≤ N 2 L2

N 1 X h (i) i Z¯j N i=1

≤ 1.

j

˜ (i) , with 1 < k ≤ L, we can get the tighter bounds: Under the assumption of k-sparsity of F h i 1 ˆ ¯(i) < √1 . ≤ E F j k 2 L2 N This proves the first part of our statement. (f) Finally, let us consider the estimation of the variance of Z¯j : h   2 i  2 V ar Z¯j = E Z¯j − E Z¯j . Again, using the value we computed for the second moment and the interval we estimated for the expected value, we can define the following bound for the variance: 12

1 − 12 ≤ N L2

h i ˜ ¯(i) V ar F j

h i ˜ ¯(i) 0 ≤ V ar F j



1 2 N L2 N 4 L4 − 1 ≤ . N 4 L4

2

≤1−

˜ (i) , with 1 < k ≤ L, we can get the tighter bounds: Under the assumption of k-sparsity of F

1 − N k2



1 √ N

2 ≤

h i ˜ ¯(i) V ar F j

0≤

h i ˜ ¯(i) V ar F j

This proves the last part of our statement. 

13

2  1 1 − N N 2 k2 N 3 k4 − 1 ≤ N 4 k4 ≤

A.3

Pseudo-code of the PSF algorithm and proof of the proposition in Section 3

A.3.1

Pseudo-code of the PSF algorithm

Below we provide the pseudo-code for the PSF algorithm.2 . Algorithm 1 Periodic Sparse Filtering (PSF) 1: Input: training data Xtr , test data Xtst , training label Y tr 2: Hyper-params: learned dimensionality L, partitioning vector V, lambda vector λ 3: X ← Xtr ∪ Xtst 4: W ← N (0, 1) 5: C ← #classes in Y tr 6: repeat 7: H←W·X 8: F ← 1 +  + sin H

˜← F

9:

(i)

r

Fj PN  i=1

(i)

Fj

2

˜ (i) F j PL  ˜ (i) 2 F

10:

Z←

r

11:

L←

PN PL

j=1

i=1

j

(i)

j=1

Zj +

PC

k=1

P

i:X(i) ∈Xtr ∧Y (i) =k

P

(i)

j:j∈Vk

λk · Zj

12: W ← W − η∇L 13: until termination condition for gradient descent is met

return Z Algorithm 2 Derivative for PSF 1: Input: PSF output Z ∂Z (i) ˜j ∂F

2:

∂Z (i) ∂F j

3:

r PL



j=1

 2 ˜ (i) −Z(i) ·PL F ˜ (i) F j=1 j j j PL  ˜ (i) 2 F j=1



∂Z (i) ˜j ∂F

r

PN  i=1

j

2

  (i) ∂Z (i) ˜ (i) ·PN −F ˜ j ·Fj i=1 ∂ F j PN  (i) 2 i=1 Fj

(i) Fj

∂Z ∂Z ∂H ← ∂F · cos H ∂Z ∂Z ∂W ← λ ∂H · X ∂Z return ∂W

4: 5:

2

The source code is available at: https://github.com/EldarFeatel/PSF

14

A.3.2

Proof of Proposition 4

Proposition 4. Let X(1) ∈ RO be a point in the original space RO and let Z(1) ∈ RL be the representation learned by PSF. Then, there are infinite points X(i) ∈ RO built from X(1) with period W−1 kπ, where W is the weight matrix of PSF and k is a vector of integer constants in Z, that maps to the same representation Z(1) . Proof. The proof of this proposition is based on the following logical steps: (a) rigorous definition of the PSF computation; (b-e) back-computation through all the steps of PSF up to the input (`2 normalization along the columns, `2 -normalization along the rows, non-linearity, linear projection). h iT h iT (1) (1) (1) (2) (2) (2) (a) Let X(1) = and X(2) = be two x1 x2 · · · xO x1 x2 · · · xO h i T  (1) points in the original space RO and let Z(1) = P SF X(1) = z1(1) z2(1) · · · zL and h iT  (2) Z(2) = P SF X(2) = z1(2) z2(2) · · · zL be the two representations in RL learned by PSF. Recall that PSF is defined as P SF (X) = `2,col (`2,row (g (WX))), where we take the non-linearity to be an element-wise positive sinusoidal function, that is g(x) = 1 +  + sin(x). Now, let us suppose that the two learned representations are identical, that is Z(1) = Z(2) . (b) By definition of PSF, Z(1) = Z(2) implies:   ˜ (1) `2,col F

r

  ˜ (2) = `2,col F

˜ (1) F j PL  (1) 2 ˜ j=1 Fj

˜ (2) F j  2 , (2) ˜ j=1 Fj

=

r PL

iT (i) (i) (i) ˜ = is the intermediate output of PSF defined as F f˜1 f˜2 · · · f˜L `2,row (g (WX)). Now, for the `2 -normalization along the columns to be equal, it must hold that: h (1) iT h (2) iT (1) (2) (1) (2) f˜L f˜L f˜2 f˜2 f˜1 f˜1 = , · · · d(1) · · · d(2) d(1) d(1) d(2) d(2) r PL  (i) 2 ˜ where d(i) = is a sample-dependent scaling factor. Therefore, it follows that F ˜ (i) = where F

h

j

j=1

(1)

Z

(2)

=Z

˜ (1) = λF ˜ (2) for any λ ∈ R. if and only if F

˜ (1) = λF ˜ (2) implies: (c) By definition of PSF, F   `2,row F(1) = (1)

r

Fj PN  i=1

  λ`2,row F(2) (2)

(i) Fj

2

=

Fj λr PN  i=1

(i) Fj

2 ,

h iT where F(i) = f1(i) f2(i) · · · fL(i) is the intermediate output of PSF defined as F = g (WX). Now, for the `2 -normalization along the rows to be equal, it must hold that: h

where dj =

(1)

f1 d1

r PN  i=1

(1)

(1)

f2 d2 (i)

Fj

2

···

fL dL

iT



h

(2)

f1 d1

(2)

(2)

f2 d2

···

fL dL

iT

,

is a feature-dependent scaling factor. Therefore, it follows that

˜ (1) = λF ˜ (2) if and only if F(1) = F(2) . F 15

(d) By definition of PSF, F(1) = λF(2) implies:     = g H(2) g H(1)     = 1 +  + sin H(2) 1 +  + sin H(1)     = sin H(2) , sin H(1) h iT (i) (i) where H(i) = h(i) is the intermediate output of PSF defined as H = WX. h · · · h 1 2 L Now, for the applications of the sinusoidal function to be equal, it must hold that:   sin H(1)

=

  sin H(2)    arcsin sin H(2)

H(1)

=

H(1)

= H(2) + kπ,

where kj is a vector of feature-dependent periodic factors in Z. (e) By definition of PSF, H(1) = H(2) + kπ implies: WX(1) X(1)

= WX(2) + kπ = X(2) + W−1 kπ.

Thus, there are infinite points X(i) ∈ RO built from X(1) with period W−1 kπ that maps to the same representation Z(1) . 

16

A.4

Detailed experimental results in Section 4

A.4.1

Synthetic data set in Section 4.1

In our synthetic data set simulation we generated a data set that would capture in a simplified way the case of user-dependent data: different users have distributions located far apart in the feature space, but their data exhibit local regularities. Training data and test data are then sampled from two bivariate Gaussian pdfs and they are labeled in a binary way deterministic function. For the  by a    2π 2 0 tr training data set we generated 50 samples from X ∼ N , and for the test 2 0 .5     −2π 2 0 data set we generated 50 samples from Xtst ∼ N , . Binary labels over the 2 0 .5 training data and the test data are defined by a deterministic square wave function with period 1 on the domain of the first feature. Figure 2 shows the synthetic data in two dimensions. Figure 3 shows the projection of the synthetic data along the first dimension, that is, the dimension affected by covariate shift. Each figure shows separately positive-labeled training data (blue crosses), negative-labeled training data (blue dots), positive-labeled test data (red crosses), and negative-labeled test data (red dots); moreover, it also shows the empirical and the real (where possible) distributions of the training, test, positive-labeled and negative-labeled data. Figure 4 shows actual filters instantiated by SF and PSF in relation to the data in the synthetic data set simulation. The figure confirms that PSF is able to learn filters that better explain the data.

Figure 2: Synthetic data X used in the simulation.

17

Figure 3: Projection of the synthetic data X used in the simulation along the first dimension.

Figure 4: (a) Illustration of a filter instantiated by SF. (b) Illustration of a filter instantiated by PSF.

References [1] A. Batliner, S. Steidl, D. Seppi, and B. Schuller. Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach. Advances in HumanComputer Interaction, 2010(1):3–18, 2010. [2] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pages 137–144, 2007. 18

[3] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of German emotional speech. In Proceedings of the Interspeech Conference, pages 1517–1520, 2005. [4] F. Eyben, M. Wollmer, and B. Schuller. opensmile - the munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia, 2010. [5] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012. [6] M. Grimm, K. Kroschel, and S. Narayanan. The vera am mittag german audio-visual emotional speech database. In Proceedings of the IEEE International Conference on Multimedia and Expo, 2008. [7] A. Hassan, R. Damper, and M. Niranjan. On acoustic emotion recognition: compensating for covariate shift. IEEE Trans. Audio, Speech, Language Process., 21(7):1458–1468, 2013. [8] J. Jiang. A literature survey on domain adaptation of statistical classifiers. Technical report, University of Illinois at Urbana-Champaign, 2008. [9] J. Ngiam, Z. Chen, S. A. Bhaskar, P. W. Koh, and A. Y. Ng. Sparse filtering. In Advances in Neural Information Processing Systems, pages 1125–1133, 2011. [10] G. Pastor, I. Mora-Jiménez, R. Jäntti, and A. J. Caamaño. Mathematics of sparsity and entropy: axioms, core functions and sparse recovery. arXiv preprint arXiv:1501.05126, 2015. [11] J. C. Principe. Information theoretic learning: Rényi’s entropy and kernel perspectives. Springer, 2010. [12] N. Quadrianto, J. Petterson, and A. J. Smola. Distribution matching for transduction. In Advances in Neural Information Processing Systems, pages 1500–1508, 2009. [13] B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput., 1:119–131, 2010. [14] M. Sugiyama and M. Kawanabe. Machine learning in non-stationary environments: introduction to covariate shift adaptation. MIT Press, 2012. [15] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1521–1528, 2011. [16] B. Willmore and D. J. Tolhurst. Characterizing the sparseness of neural codes. Network: Computation in Neural Systems, 12(3):255–270, 2001. [17] F. M. Zennaro and K. Chen. Towards understanding sparse filtering: A theoretical perspective. arXiv preprint arXiv:1603.08831, 2016.

19

Suggest Documents