Feature Selection for Transfer Learning - VideoLectures

11 downloads 8158 Views 1MB Size Report
outperforms other methods in prediction accuracy novel approach for domain adaptation identify variant features across source and target domains.
Feature Selection for Transfer Learning Selen Uguroglu,Jaime Carbonell [email protected], [email protected] Language Technologies Institute Carnegie Mellon University

Overview     

Transfer Learning Motivation Related Work Our Approach Results  Feature selection  Prediction performance

 Conclusion & Future Directions  References

Motivation Supervised Machine Learning: Unlabeled Data

Labeled Data

Model

PREDICTION

Assumption: Training data and test data are drawn from the same distribution and same feature space

Many real world tasks violate this assumption!

Motivation  Sampling bias  Clinical Trials: Sicker patients may volunteer for a clinical trial  Not having sufficient labeled data for the prediction task  Leveraging another related labeled dataset perform prediction on the current dataset  Sentiment analysis: Using computer reviews to detect sentiment on camera reviews  Change in distribution over space  Surveys: Characteristics of geographical regions may differ  Change in distribution over time  WIFI Localization: Data collected in a further time point may be differently distributed than the data from earlier time points

We can’t expect to have good results if we train our model on pears and apply it on apples

Related Work Prior methods that aim to reduce the distance between domains:  Instance reweighting o Incorporate the ratio of source and target data density distribution to the loss function o Learn the weights of instances to reduce distance between domains

 Feature representation methods o Find pivot features o Projection of both domains to a lower dimensional latent space

In this work we propose a different approach to the problem

Which features create the discrepancy between source and target distributions?

Source Data

Source Data Identical Mean shift Mean and variance shift

Target Data

outperforms other methods in prediction accuracy

identify variant features across source and target domains.

novel approach for domain adaptation

present convex optimization problem

  

How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task

  

How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task

Measuring the Distance  Kullback-Leibler Divergence  Requires expensive density estimation calculation  Maximum mean discrepancy (MMD) is a nonparametric kernel method that can be used to measure the distance between two distributions [1] [2]

MMD Definition: Let and and be the two sets of observations drawn from Borel probability distributions p and q. Let F be a class of functions, then the empirical estimate is [1]:

Let be the feature map where Then we can write the function evaluation as:

is a RKHS.

The two samples are different if there is at least one function whose empirical mean is different!

MMD This equation can also be written as [5]:

where,

is a composite kernel matrix, Kxx Kxy Kyy are kernel matrices defined by the positive definite kernel,

on source, target and cross domains respectively. nS and nT refers to the number of instances in source and target domain . Different representation of the previous equation

  

How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task

Goal Minimize the distance between source and target domains

Intuition Differentially distributed features adds to this distance more than the features from the same distribution

Solution We’ll weigh each feature and solve for the weights while minimizing the distance

New kernel function  Let be the weight matrix whose diagonal will give the weights for each dimension.  The new feature map:

where the new kernel mapping would be:

 For a polynomial kernel of degree d, kernel matrix on source domain can be written as:

Optimization Problem To solve for the matrix W, we present the following optimization problem: Distance between distributions

1

where

is the diagonal weight matrix

Once W* is solved, assign the values to w 2

Constrain the size of weights by applying ridge penalty

Algorithm INPUT: Samples from the source domain XS, samples from the target domain XT weight threshold λ

OUTPUT: Variant feature set V, invariant feature set N

• Compute the new kernel function • Compute Kxx, Kyy, Kxy, Lxx, Lyy, Lxy • Solve for equations 1 and 2 • FOR i=1:d where d is the number of features • IF wi ≥ λ • V←V∪fi •ELSE {wi < λ} • N←N∪fi •END IF • END FOR

Experiments Datasets: 1. Synthetic dataset:  d dimensional dataset  Indices of variant dimensions are randomly picked  Distributions of variant dimensions are randomly picked from a list of distributions (Gaussian, exponential, uniform…)

2. Real world datasets  USPS hand written digits images dataset  WIFI Localization dataset [4]

Results  Figures show the weights assigned by the algorithm to each dimension  Variant dimensions are shown in red  As illustrated below, variant dimensions have the highest weights as expected 20 variant dimensions

10 variant dimensions

Results Weights learned by the algorithm when the number of differentially distributed dimensions are varied from 1 to 10 are shown below:

Note the gap in the weights of variant and invariant features

  

How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task

Results What happens if we train on all the features regardless of the distribution?

On the synthetic dataset we trained logistic regression and linear SVM once with all the features and once with only the variant features Accuracy of linear SVM

Accuracy of Logistic Regression

# Samples

Invariant Features

All Features

# Samples

Invariant Features

All Features

400

86%

61%

400

82%

62%

450

86.7%

55.3%

450

84.7%

58%

500

86%

55.5%

500

83.5%

59.5%

550

87.2%

57.2%

550

85.2%

60.9%

600

87%

56.7%

600

84%

61.3%

Over 30% improvement in the prediction accuracy!

Results We compared our method with Transfer Component Analysis (TCA) [3] and Kernel PCA Classification accuracy of linear SVM after f-MMD, TCA and KPCA on the USPS dataset

Average absolute ridge regression error after f-MMD, TCA and KPCA on WIFI localization dataset

Conclusion  We showed that our technique can identify all variant dimensions across source and target domains  We showed that using this information significantly improves the prediction accuracy in the following supervised learning task, compared to using all dimensions  We showed that our method outperforms comparable feature reduction techniques in transfer learning

Future Directions  In the future we’ll explore how to incorporate variant features in to the model to increase the prediction accuracy  We’ll select the variant dimensions with L1 (Lasso) penalty enforced on the objective function  We’ll apply this method to the clinical data, to see which dimensions are distributed differently across sick patients

THANK YOU!!!

References 1. Borgwardt, K., Gretton, A., Rasch, M., Kriegel, H.P., Scholkopf, B., and Smola, A.J. Integrating structured biological data by kernel maximum mean discrep- ancy. Bioinformatics, 22(14):49–57, 2006. 2. Gretton, A., K. M. Borgwardt, M. Rasch, B. Scholkopf, A. Smola: A Kernel Method for the Two-Sample-Problem. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, 513520. MIT Press, Cambridge, MA, USA (2007). 3. Pan, S. J., Tsang, I. W., Kwok J. T., Yang Q.: Domain Adaptation via Transfer Component Analysis. In Proceedings of IJCAI’2009. pp.1187 1192 11. 4. Yang, Q., Pan, S.J., Zheng, V.W.: Estimating Location Using WiFi, IEEE Intelligent Systems, vol. 23, no. 1, pp. 8-13, Jan/Feb, 2008 5. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer Learning via Dimensionality Reduction .In AAAI(2008) 677-682