Making Generative Classifiers Robust to Selection Bias. Andrew Smith. Charles Elkan. November 30th, 2007 ...
Making Generative Classifiers Robust to Selection Bias Andrew Smith
Charles Elkan
November 30th, 2007
Outline
◮
What is selection bias?
◮
Types of selection bias.
◮
Overcoming learnable bias with weighting.
◮
Overcoming bias with maximum likelihood (ML).
◮
Experiment 1: ADULT dataset.
◮
Experiment 2: CA-housing dataset.
◮
Future work & conclusions.
What is selection bias?
Traditional semi-supervised learning assumes: ◮
Some samples are labeled, some are not.
◮
Labeled and unlabeled examples are identically distributed.
What is selection bias?
Traditional semi-supervised learning assumes: ◮
Some samples are labeled, some are not.
◮
Labeled and unlabeled examples are identically distributed.
Semi-supervised learning under selection bias: ◮
Labeled examples are selected from the general population not at random.
◮
Labeled and unlabeled examples may be differently distributed.
Examples
◮
Loan application approval ◮ ◮
Goal is to model repay/default behavior of all applicants. But the training set only includes labels for people who were approved for a loan.
Examples
◮
Loan application approval ◮ ◮
Goal is to model repay/default behavior of all applicants. But the training set only includes labels for people who were approved for a loan.
Examples
◮
Loan application approval ◮ ◮
◮
Goal is to model repay/default behavior of all applicants. But the training set only includes labels for people who were approved for a loan.
Spam filtering ◮ ◮
Goal is an up-to-date spam filter. But, while up-to-date unlabeled emails are available, hand-labeled data sets are expensive and may be rarely updated.
Framework
Types of selection bias are distinguised by conditional independence assumptions between: ◮
x is the feature vector.
◮
y is the class label. If y is binary, y ∈ {1, 0}.
◮
s is the binary selection variable. If yi is observable then si = 1, otherwise si = 0.
Types of selection bias – No bias
x
y s
s ⊥ x, s ⊥ y ◮
The standard semi-supervised learning scenario.
Types of selection bias – No bias
x
y s
s ⊥ x, s ⊥ y ◮
The standard semi-supervised learning scenario.
◮
Labeled examples are selected completely at random from the general population.
Types of selection bias – No bias
x
y s
s ⊥ x, s ⊥ y ◮
The standard semi-supervised learning scenario.
◮
Labeled examples are selected completely at random from the general population.
◮
The missing labels are said to be “missing completely at random” (MCAR) in the literature.
Types of selection bias – Learnable bias x
y s
s ⊥ y |x ◮
Labeled examples are selected from the general population only depending on features x.
Types of selection bias – Learnable bias x
y s
s ⊥ y |x ◮
Labeled examples are selected from the general population only depending on features x.
◮
A model p(s|x) is learnable.
Types of selection bias – Learnable bias x
y s
s ⊥ y |x ◮
Labeled examples are selected from the general population only depending on features x.
◮
A model p(s|x) is learnable.
◮
The missing labels are said to be “missing at random” (MAR), or ”ignorable bias” in the literature.
◮
p(y |x, s = 1) = p(y |x).
Model mis-specification under learnable bias p(y |x, s = 1) = p(y |x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified?
Model mis-specification under learnable bias p(y |x, s = 1) = p(y |x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified? Then a sub-optimal decision boundary may be learned under MAR bias. Ignoring samples without labels
Viewing hidden labels ++ + + ++ + + + + + ++ ++ + − ++++ + + ++ + +−+++ + ++ ++ +++ + ++ ++ +− ++ + ++ + + + + ++ + −+− + ++ ++ ++ −+ + + + ++ + ++ ++ + ++ +++++ + + + ++ +++ + + − − −− −+− + ++ + + +++ + − −++ ++ +−+ −− − ++ + + −+ + ++ + + + + + + +++ + ++ − −− ++++− + + −−− − −−−− + ++ + + + + − − − ++ + − − − −− −−− + ++++ + + + + + + ++ − − + − +−+ − −−−− − − − −−− + − − −++ + ++ − − −− −−−− −− ++ + −− − + − − − − − − + − − − +− + − −+ −−− − − − − −−−−−− − −−−− − −−−−−− −−− − − − − −− − − − − − − − − − − −−− − − − − − − − − − − − −− −− − −−− −−− − − − −− − −− − −− − − − − − −− −−− −− −best mis−specified bondary −− − − − − − −− − − −−−−−−−− − − −−− − − − true −−−boundary − −− − −−
++ + + + + + + +−+++ + ++ +++ + ++ + + ++ ++− + + ++ + + + + − + ++ + ++ + + ++ + + −+ −− + + + + ++ ++ + − − ++ ++ ++ + −− − −− − − − − +++ + −++ + − − + + + −− + ++ + − − + − +− − − − + + − −− −−−− − − − − − − − − ++ − −+ −−− − − − − − −− −−− − − − −− − − − −− − −− −− − − − −− − − − − −− − − − − − − − − −−boundary estim. −− − − − mis−specified − −− − estim.−− well−specified − − boundary − −− +
+
+ ++ − − −
− −
+
Types of selection bias – Arbitrary bias
x
y s
◮
Labeled examples are selected from the general population possibly depending on the label itself.
Types of selection bias – Arbitrary bias
x
y s
◮
Labeled examples are selected from the general population possibly depending on the label itself.
◮
No independence assumptions can be made.
Types of selection bias – Arbitrary bias
x
y s
◮
Labeled examples are selected from the general population possibly depending on the label itself.
◮
No independence assumptions can be made.
◮
The missing labels are said to be “missing not at random” (MNAR) in the literature.
Overcoming bias – Two alternate goals
The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible:
Overcoming bias – Two alternate goals
The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible: ◮
General population modeling: Learn p(y |x), e.g. loan application approval.
Overcoming bias – Two alternate goals
The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible: ◮
General population modeling: Learn p(y |x), e.g. loan application approval.
◮
Unlabeled population modeling: Learn p(y |x, s = 0), e.g. spam filtering.
Overcoming learnable bias – General population modeling
Lemma 1 Under MAR bias in the labeling, p(x, y ) =
p(s = 1) p(x, y |s = 1) p(s = 1|x)
if all probabilities are non-zero.
The distribution of samples in the general population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.
General population modeling – Application Lemma 1 can be used: ◮
to estimate class-conditional density models p(x|y ).
◮
to improve the loss of misspecified discriminative classifiers in the general population. Viewing hidden labels + + + + + + ++ ++ ++ + + ++ + +++ + ++ + − + ++ + + + ++ + + ++++ + +− ++ + + + + + + + + + +++ + + + + +−− ++ + ++ + − − + + + − ++ ++ + + + −+ ++++++ −−−−− ++ ++++ −− + − −− + + ++ + − −− − + + + + + +++ + ++ +++ − + − + − − −− ++ +− −−− − − − + ++ + + + + + + + −−+− + − + +− + − − − − +++ + +− − ++ −+ − − −++ − − −− − ++ −−− + +− − + − −− − +− +−+ + −− −− − − − ++ − −++− + − −−− + + ++−−+ +−+−−− −−− − −−− − − − −− − − − +−−−− −− − − − −−−−− −− −−− −−− − −−− − −− −−−− −− − − − −− −− −− −− − − −− − − − − −−− −−− −− − −−− − −−− − − − − − − −− − − −− −− −− − − −−− − −−− − −− − − − − − − − −− − −− − −− − − −boundary −−− −− general−pop. − best − −−−
Using lemma 1 + ++ ++ + ++++ +
+
+ ++ +++ + ++ − + + + +++ +++ + + +−− ++ + −+ − −− + + + + +++ +++ + + + − +−−+− + − ++ + − − − −+− ++ + + +− − + +− − −++ −+ − + − + + −−−− + − + + − − − −− − − + ++ −+ −− − −−− − − −− −−− − −−− − −− − −−−− −− − − − − − −− − − − − − − −− −− −− − − −− −− − − −−− − − −− − −−− − − − − est. general−pop. boundary − − − − −−− +
+
+
The weighted logistic regression finds a decision boundary close to the best-possible for the general population.
Overcoming learnable bias – Unlabeled population modeling Lemma 2 Under MAR bias in the labeling, p(x, y |s = 0) =
1 − p(s = 1|x) p(s = 1) p(x, y |s = 1) 1 − p(s = 1) p(s = 1|x)
if all probabilities are non-zero.
Similarly, the distribution of samples in the unlabeled population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.
General population modeling – Application Lemma 2 can be used: ◮
to estimate class-conditional density models p(x|y , s = 0).
◮
to improve the loss of misspecified discriminative classifiers in the unlabeled population. Viewing hidden labels + + + + + ++ − −+ + + − + + + + + ++ + + + + + + + + + + + − ++ + + + + + −− + + − + + ++ + + + − − ++ + + + +− ++ + ++ + + ++−− −+ + + + ++ + + + + + − −− + + + + ++ + +++ + +++ − + + + + + + + −− ++ ++−− + −++ + −− + ++ ++ − + + + + ++− + + − − ++ + ++ − − −− ++ + ++ ++ − − − ++ + − + + ++ − −− − − − − − −− + − + + + − −−− + + −−−− − − − ++ ++ ++− −− − − −− − − + −++−− −− + + −− − − − −− −− ++− − − − − − − − −− − −+ + − − −−− −− − − − − − − −−− − − −− − −− − −− − − −− − − − −− − − − − − −− −− − − −−− − − − − − − − − − −−− − − − − − − − − − − − − −−−− − − − −− − −− − − − −−− −− − − − − − − − −− − − − − − − −− −−−−− − − − − − − − − − − − − −−− − − − −− −−− − − − − best unlabeled−pop. boundary +
Using Lemma 2 + + ++ + ++ + ++ + + + −− + + + + + − ++ + +− + ++ + + ++ + + + ++−− + + + + + +++ + ++ − + + + ++ − + ++ −− + +− + + + + + − − + ++ + −+ − − + −− − + + + − + −−−− − − − + + ++− − − − + −+ − + −+ − − − − − −+ − + − − − −− − − −−− − − − − −− − − −− − − − − − − − −− − − − − − − − − − − − − − − − −−− − − − − − − − −− − −−− − − − − − − − −− −− − unlabeled−pop. boundary − est.
−
+
+
++
The weighted logistic regression finds a decision boundary close to the best-possible for the unlabeled population.
Overcoming arbitrary bias – Maximum Likelihood
The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X ) =
m X i =1
log pΘ (xi , yi , si = 1) +
m+n X i =m+1
log
X
pΘ (xi , y , si = 0)
y
for labeled data i = 1...m and unlabeled data i = m + 1...m + n.
Overcoming arbitrary bias – Maximum Likelihood
The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X ) =
m X i =1
log pΘ (xi , yi , si = 1) +
m+n X i =m+1
log
X
pΘ (xi , y , si = 0)
y
for labeled data i = 1...m and unlabeled data i = m + 1...m + n. ◮
Under the assumption of learnable bias this reduces to a traditional semi-supervised likelihood equation....
Log-likelihood assuming learnable bias ◮
Factor p(x, y , s) = p(s|x, y )p(x|y )p(y )
◮
Assume learnable bias (MAR): p(s|x, y ) = p(s|x);
Log-likelihood assuming learnable bias ◮
Factor p(x, y , s) = p(s|x, y )p(x|y )p(y )
◮
Assume learnable bias (MAR): p(s|x, y ) = p(s|x);
ℓ(Θ, X )
=
m X i =1
=
m X i =1
log p(xi , yi , si = 1) +
m+n X
log
i =m+1
log p(si |xi , yi )p(xi |yi )p(yi ) +
X
p(xi , y , si = 0)
y m+n X i =m+1
log
X y
p(si |xi , y )p(xi |y )p(y )
Log-likelihood assuming learnable bias ◮
Factor p(x, y , s) = p(s|x, y )p(x|y )p(y )
◮
Assume learnable bias (MAR): p(s|x, y ) = p(s|x);
ℓ(Θ, X )
=
m X
log p(xi , yi , si = 1) +
m X
log
m X
log p(si |xi , yi )p(xi |yi )p(yi ) +
m X
log p(xi |yi )p(yi ) +
m+n X i =1
m+n X i =m+1
i =1
+
m+n X
log
X
p(si |xi , y )p(xi |y )p(y )
y
log{p(si |xi )
log p(si |xi )
log
X y
X y
i =m+1
i =1
=
m+n X i =m+1
log p(si |xi )p(xi |yi )p(yi ) +
p(xi , y , si = 0)
y
i =1
=
X
i =m+1
i =1
=
m+n X
p(xi |y )p(y )
p(xi |y )p(y )}
Log-likelihood making no assumptions about bias Use a different factoring: p(x, y , s) = p(x|y , s)p(y |s)p(s).
ℓ(ΘL , ΘU , X ) =
m X
log pΘL (xi |yi , si = 1)pΘL (yi |si = 1)p(si = 1)
i =1
+
n X i =m+1
◮
log
X
pΘU (xi |y , si = 0)pΘU (y |si = 0)p(si = 0)
y
For |C | classes, this has 2|C | class-conditional density models: p(x|y = c, s) for each c ∈ C .
Log-likelihood making no assumptions about bias Use a different factoring: p(x, y , s) = p(x|y , s)p(y |s)p(s).
ℓ(ΘL , ΘU , X ) =
m X
log pΘL (xi |yi , si = 1)pΘL (yi |si = 1)p(si = 1)
i =1
+
n X i =m+1
log
X
pΘU (xi |y , si = 0)pΘU (y |si = 0)p(si = 0)
y
◮
For |C | classes, this has 2|C | class-conditional density models: p(x|y = c, s) for each c ∈ C .
◮
This simplifies to two independent maximizations.
Log-likelihood making no assumptions about bias Use a different factoring: p(x, y , s) = p(x|y , s)p(y |s)p(s).
ℓ(ΘL , ΘU , X ) =
m X
log pΘL (xi |yi , si = 1)pΘL (yi |si = 1)p(si = 1)
i =1
+
n X i =m+1
log
X
pΘU (xi |y , si = 0)pΘU (y |si = 0)p(si = 0)
y
◮
For |C | classes, this has 2|C | class-conditional density models: p(x|y = c, s) for each c ∈ C .
◮
This simplifies to two independent maximizations.
◮
But how do we maximize the likelihood in a sensible way for the unlabeled data?
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples Solution: Assume the model parameters for the labeled data, pΘL (x|y , s = 1) and pΘL (y |s = 1), and the parameters for the unlabeled data, pΘU (x|y , s = 0) and pΘU (y |s = 0) are “close.” ◮
Learn density models pΘL (xi |yi , si = 1) and priors pΘL (yi |si = 1) from labeled data.
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples Solution: Assume the model parameters for the labeled data, pΘL (x|y , s = 1) and pΘL (y |s = 1), and the parameters for the unlabeled data, pΘU (x|y , s = 0) and pΘU (y |s = 0) are “close.” ◮
Learn density models pΘL (xi |yi , si = 1) and priors pΘL (yi |si = 1) from labeled data.
◮
Initialize parameters ΘU for unlabeled data with parameters learned from labeled data ΘL , or possibly using Lemma 2.
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples Solution: Assume the model parameters for the labeled data, pΘL (x|y , s = 1) and pΘL (y |s = 1), and the parameters for the unlabeled data, pΘU (x|y , s = 0) and pΘU (y |s = 0) are “close.” ◮
Learn density models pΘL (xi |yi , si = 1) and priors pΘL (yi |si = 1) from labeled data.
◮
Initialize parameters ΘU for unlabeled data with parameters learned from labeled data ΘL , or possibly using Lemma 2.
◮
Improve the likelihood of ΘU given the unlabeled data(unsupervised learning).
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples Solution: Assume the model parameters for the labeled data, pΘL (x|y , s = 1) and pΘL (y |s = 1), and the parameters for the unlabeled data, pΘU (x|y , s = 0) and pΘU (y |s = 0) are “close.” ◮
Learn density models pΘL (xi |yi , si = 1) and priors pΘL (yi |si = 1) from labeled data.
◮
Initialize parameters ΘU for unlabeled data with parameters learned from labeled data ΘL , or possibly using Lemma 2.
◮
Improve the likelihood of ΘU given the unlabeled data(unsupervised learning).
The SMM approach is useful for both general and unlabeled population modeling, as it produces an explicit generative model of both populations.
Example with synthetic data – Concept Drift
Example with synthetic data – Concept Drift
Application to EM
Application of the SMM approach to EM: ◮
Let Θ0U be the parameters for pΘU (x|y , s = 0), initialized with the labeled data.
◮
Limit to a few (5) iterations to limit parameter changes.
◮
Use inertia parameter α = .99 to slow parameter evolution: given ΘtU , and EM update Θ′ , use ← αΘtU + (1 − α)Θ′ Θt+1 U
◮
Final Θ5U is the parameters for the unlabeled data.
Experiment 1 – the ADULT data set Features: ◮
x: (AGE, EDUCATION, CAPITAL GAIN, CAPITAL LOSS, HOURS PER WEEK, SEX, NATIVE TO US, ETHNICITY, FULL TIME)
◮
y: INCOME > $50,000?
◮
s: MARRIED?
This dataset has information most of which could be used in a loan approval system. The target is analogous to a repay/default behavior label. Marital status is analogous to an unquantifiable measure of responsibility. This wouldn’t be recorded in a bank’s records, but which might influence the label.
Determining the type of bias
Is it MCAR?
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.4556
◮
p(y = 1|s = 0) = 0.0692
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.4556
◮
p(y = 1|s = 0) = 0.0692
Is it MAR?
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.4556
◮
p(y = 1|s = 0) = 0.0692
Is it MAR? Not as far as logistic regression can detect: ◮
Accuracy in gen. pop. based on labeled data = 74.2%
◮
Accuracy in gen. pop. based on all data = 80.7%
Results
(boxplots over 10 random trials) Unlabeled population − accuracy
Analysis in loan-application context:
SMM
GMM + Lemma 2
GMM
LR + Lemma 2
LR
SMM
GMM + Lemma 1
GMM
LR + Lemma 1
LR
0.74
0.80
0.76
0.78
0.85
0.80
0.90
0.82
General population − accuracy
Results
(boxplots over 10 random trials) Unlabeled population − accuracy
Analysis in loan-application context: ◮
We can achieve better accuracy than logistic regression.
◮
We can do “reject inference.”
◮
We can improve accuracy by not assuming MAR bias.
SMM
GMM + Lemma 2
GMM
LR + Lemma 2
LR
SMM
GMM + Lemma 1
GMM
LR + Lemma 1
LR
0.74
0.80
0.76
0.78
0.85
0.80
0.90
0.82
General population − accuracy
Experiment 1 – the CA-HOUSING data set
◮
s: LATITUDE > 36 and within 0.4 degrees of coast?
40
y: house VALUE > California median?
38
◮
36
x: CA census tract data: MEDIAN INCOME, MEDIAN HOUSE AGE, TOTAL ROOMS, TOTAL BEDROOMS, POPULATION, HOUSEHOLDS
34
◮
42
Features:
−124
−120
−116
The goal is to learn a model of housing prices throughout California when price information is available only for the northern California coast.
Determining the type of bias
Is it MCAR?
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.751
◮
p(y = 1|s = 0) = 0.443
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.751
◮
p(y = 1|s = 0) = 0.443
Is it MAR?
Determining the type of bias
Is it MCAR? No: ◮
p(y = 1|s = 1) = 0.751
◮
p(y = 1|s = 0) = 0.443
Is it MAR? Not as far as logistic regression can detect: ◮
Accuracy in gen. pop. based on labeled data = 74.8%
◮
Accuracy in gen. pop. based on all data = 80.5%
Results
(boxplots over 10 random trials) Unlabeled population − accuracy
SMM
GMM + Lemma 2
GMM
LR + Lemma 2
LR
SMM
GMM + Lemma 1
GMM
LR + Lemma 1
LR
0.70
0.68
0.72
0.72
0.74
0.76
0.76
0.80
0.78
General population − accuracy
Analysis: ◮
We can achieve better accuracy than logistic regression.
◮
Isolated areas constitute a biased sample of housing prices throughout California
◮
We can improve accuracy by not assuming MAR bias.
Future Work More “integrated” ways to keep ΘL and ΘU close: ◮
Specify a Bayesian prior.
◮
Add a term to the likelihood equation: +λKL(ΘL , ΘU ).
Future Work More “integrated” ways to keep ΘL and ΘU close: ◮
Specify a Bayesian prior.
◮
Add a term to the likelihood equation: +λKL(ΘL , ΘU ).
Other factorings: ◮
We only explored one factoring p(x, y , s) = p(x|y , s)p(y |s)p(s), resulting in independent maximizations.
◮
Other factorizations yield models with more coupled parameters.
Conclusions
◮
Model misspecification, the reality in real-world applications, means that even under MAR bias, re-weighting can help discriminative and generative classifiers.
Conclusions
◮
Model misspecification, the reality in real-world applications, means that even under MAR bias, re-weighting can help discriminative and generative classifiers.
◮
Improving the likelihood of the model parameters for the unlabeled data yields better classifiers in the unlabeled and general populations.
Conclusions
◮
Model misspecification, the reality in real-world applications, means that even under MAR bias, re-weighting can help discriminative and generative classifiers.
◮
Improving the likelihood of the model parameters for the unlabeled data yields better classifiers in the unlabeled and general populations.
◮
The SMM approach is a practical method of allowing the unlabeled model parameters to better represent the unlabeled samples while remaining close to the parameters yielding the best predictions for the labeled data.