Feature Selection for Transfer Learning Selen Uguroglu,Jaime Carbonell
[email protected],
[email protected] Language Technologies Institute Carnegie Mellon University
Overview
Transfer Learning Motivation Related Work Our Approach Results Feature selection Prediction performance
Conclusion & Future Directions References
Motivation Supervised Machine Learning: Unlabeled Data
Labeled Data
Model
PREDICTION
Assumption: Training data and test data are drawn from the same distribution and same feature space
Many real world tasks violate this assumption!
Motivation Sampling bias Clinical Trials: Sicker patients may volunteer for a clinical trial Not having sufficient labeled data for the prediction task Leveraging another related labeled dataset perform prediction on the current dataset Sentiment analysis: Using computer reviews to detect sentiment on camera reviews Change in distribution over space Surveys: Characteristics of geographical regions may differ Change in distribution over time WIFI Localization: Data collected in a further time point may be differently distributed than the data from earlier time points
We can’t expect to have good results if we train our model on pears and apply it on apples
Related Work Prior methods that aim to reduce the distance between domains: Instance reweighting o Incorporate the ratio of source and target data density distribution to the loss function o Learn the weights of instances to reduce distance between domains
Feature representation methods o Find pivot features o Projection of both domains to a lower dimensional latent space
In this work we propose a different approach to the problem
Which features create the discrepancy between source and target distributions?
Source Data
Source Data Identical Mean shift Mean and variance shift
Target Data
outperforms other methods in prediction accuracy
identify variant features across source and target domains.
novel approach for domain adaptation
present convex optimization problem
How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task
How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task
Measuring the Distance Kullback-Leibler Divergence Requires expensive density estimation calculation Maximum mean discrepancy (MMD) is a nonparametric kernel method that can be used to measure the distance between two distributions [1] [2]
MMD Definition: Let and and be the two sets of observations drawn from Borel probability distributions p and q. Let F be a class of functions, then the empirical estimate is [1]:
Let be the feature map where Then we can write the function evaluation as:
is a RKHS.
The two samples are different if there is at least one function whose empirical mean is different!
MMD This equation can also be written as [5]:
where,
is a composite kernel matrix, Kxx Kxy Kyy are kernel matrices defined by the positive definite kernel,
on source, target and cross domains respectively. nS and nT refers to the number of instances in source and target domain . Different representation of the previous equation
How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task
Goal Minimize the distance between source and target domains
Intuition Differentially distributed features adds to this distance more than the features from the same distribution
Solution We’ll weigh each feature and solve for the weights while minimizing the distance
New kernel function Let be the weight matrix whose diagonal will give the weights for each dimension. The new feature map:
where the new kernel mapping would be:
For a polynomial kernel of degree d, kernel matrix on source domain can be written as:
Optimization Problem To solve for the matrix W, we present the following optimization problem: Distance between distributions
1
where
is the diagonal weight matrix
Once W* is solved, assign the values to w 2
Constrain the size of weights by applying ridge penalty
Algorithm INPUT: Samples from the source domain XS, samples from the target domain XT weight threshold λ
OUTPUT: Variant feature set V, invariant feature set N
• Compute the new kernel function • Compute Kxx, Kyy, Kxy, Lxx, Lyy, Lxy • Solve for equations 1 and 2 • FOR i=1:d where d is the number of features • IF wi ≥ λ • V←V∪fi •ELSE {wi < λ} • N←N∪fi •END IF • END FOR
Experiments Datasets: 1. Synthetic dataset: d dimensional dataset Indices of variant dimensions are randomly picked Distributions of variant dimensions are randomly picked from a list of distributions (Gaussian, exponential, uniform…)
2. Real world datasets USPS hand written digits images dataset WIFI Localization dataset [4]
Results Figures show the weights assigned by the algorithm to each dimension Variant dimensions are shown in red As illustrated below, variant dimensions have the highest weights as expected 20 variant dimensions
10 variant dimensions
Results Weights learned by the algorithm when the number of differentially distributed dimensions are varied from 1 to 10 are shown below:
Note the gap in the weights of variant and invariant features
How we measure the distance between source and target distributions How we identify the differentially distributed dimensions How we use this information in the prediction task
Results What happens if we train on all the features regardless of the distribution?
On the synthetic dataset we trained logistic regression and linear SVM once with all the features and once with only the variant features Accuracy of linear SVM
Accuracy of Logistic Regression
# Samples
Invariant Features
All Features
# Samples
Invariant Features
All Features
400
86%
61%
400
82%
62%
450
86.7%
55.3%
450
84.7%
58%
500
86%
55.5%
500
83.5%
59.5%
550
87.2%
57.2%
550
85.2%
60.9%
600
87%
56.7%
600
84%
61.3%
Over 30% improvement in the prediction accuracy!
Results We compared our method with Transfer Component Analysis (TCA) [3] and Kernel PCA Classification accuracy of linear SVM after f-MMD, TCA and KPCA on the USPS dataset
Average absolute ridge regression error after f-MMD, TCA and KPCA on WIFI localization dataset
Conclusion We showed that our technique can identify all variant dimensions across source and target domains We showed that using this information significantly improves the prediction accuracy in the following supervised learning task, compared to using all dimensions We showed that our method outperforms comparable feature reduction techniques in transfer learning
Future Directions In the future we’ll explore how to incorporate variant features in to the model to increase the prediction accuracy We’ll select the variant dimensions with L1 (Lasso) penalty enforced on the objective function We’ll apply this method to the clinical data, to see which dimensions are distributed differently across sick patients
THANK YOU!!!
References 1. Borgwardt, K., Gretton, A., Rasch, M., Kriegel, H.P., Scholkopf, B., and Smola, A.J. Integrating structured biological data by kernel maximum mean discrep- ancy. Bioinformatics, 22(14):49–57, 2006. 2. Gretton, A., K. M. Borgwardt, M. Rasch, B. Scholkopf, A. Smola: A Kernel Method for the Two-Sample-Problem. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, 513520. MIT Press, Cambridge, MA, USA (2007). 3. Pan, S. J., Tsang, I. W., Kwok J. T., Yang Q.: Domain Adaptation via Transfer Component Analysis. In Proceedings of IJCAI’2009. pp.1187 1192 11. 4. Yang, Q., Pan, S.J., Zheng, V.W.: Estimating Location Using WiFi, IEEE Intelligent Systems, vol. 23, no. 1, pp. 8-13, Jan/Feb, 2008 5. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer Learning via Dimensionality Reduction .In AAAI(2008) 677-682