Making Generative Classifiers Robust to Selection ... - Semantic Scholar

Making Generative Classifiers Robust to Selection Bias Andrew Smith

Charles Elkan

November 30th, 2007

Outline

◮

What is selection bias?

◮

Types of selection bias.

◮

Overcoming learnable bias with weighting.

◮

Overcoming bias with maximum likelihood (ML).

◮

Experiment 1: ADULT dataset.

◮

Experiment 2: CA-housing dataset.

◮

Future work & conclusions.


Traditional semi-supervised learning assumes: ◮

Some samples are labeled, some are not.

◮

Labeled and unlabeled examples are identically distributed.


Traditional semi-supervised learning assumes: ◮

Some samples are labeled, some are not.

◮

Labeled and unlabeled examples are identically distributed.

Semi-supervised learning under selection bias: ◮

Labeled examples are selected from the general population not at random.

◮

Labeled and unlabeled examples may be differently distributed.

Examples

◮

Loan application approval ◮ ◮

Goal is to model repay/default behavior of all applicants. But the training set only includes labels for people who were approved for a loan.

Examples

◮



Examples

◮


◮


Spam filtering ◮ ◮

Goal is an up-to-date spam filter. But, while up-to-date unlabeled emails are available, hand-labeled data sets are expensive and may be rarely updated.

Framework

Types of selection bias are distinguised by conditional independence assumptions between: ◮

x is the feature vector.

◮

y is the class label. If y is binary, y ∈ {1, 0}.

◮

s is the binary selection variable. If yi is observable then si = 1, otherwise si = 0.

Types of selection bias – No bias

x

y s

s ⊥ x, s ⊥ y ◮

The standard semi-supervised learning scenario.


x

y s

s ⊥ x, s ⊥ y ◮


◮

Labeled examples are selected completely at random from the general population.


x

y s

s ⊥ x, s ⊥ y ◮


◮

Labeled examples are selected completely at random from the general population.

◮

The missing labels are said to be “missing completely at random” (MCAR) in the literature.

Types of selection bias – Learnable bias x

y s

s ⊥ y |x ◮

Labeled examples are selected from the general population only depending on features x.


y s

s ⊥ y |x ◮


◮

A model p(s|x) is learnable.


y s

s ⊥ y |x ◮


◮

A model p(s|x) is learnable.

◮

The missing labels are said to be “missing at random” (MAR), or ”ignorable bias” in the literature.

◮

p(y |x, s = 1) = p(y |x).

Model mis-specification under learnable bias p(y |x, s = 1) = p(y |x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified?

Model mis-specification under learnable bias p(y |x, s = 1) = p(y |x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified? Then a sub-optimal decision boundary may be learned under MAR bias. Ignoring samples without labels

Viewing hidden labels ++ + + ++ + + + + + ++ ++ + − ++++ + + ++ + +−+++ + ++ ++ +++ + ++ ++ +− ++ + ++ + + + + ++ + −+− + ++ ++ ++ −+ + + + ++ + ++ ++ + ++ +++++ + + + ++ +++ + + − − −− −+− + ++ + + +++ + − −++ ++ +−+ −− − ++ + + −+ + ++ + + + + + + +++ + ++ − −− ++++− + + −−− − −−−− + ++ + + + + − − − ++ + − − − −− −−− + ++++ + + + + + + ++ − − + − +−+ − −−−− − − − −−− + − − −++ + ++ − − −− −−−− −− ++ + −− − + − − − − − − + − − − +− + − −+ −−− − − − − −−−−−− − −−−− − −−−−−− −−− − − − − −− − − − − − − − − − − −−− − − − − − − − − − − − −− −− − −−− −−− − − − −− − −− − −− − − − − − −− −−− −− −best mis−specified bondary −− − − − − − −− − − −−−−−−−− − − −−− − − − true −−−boundary − −− − −−

++ + + + + + + +−+++ + ++ +++ + ++ + + ++ ++− + + ++ + + + + − + ++ + ++ + + ++ + + −+ −− + + + + ++ ++ + − − ++ ++ ++ + −− − −− − − − − +++ + −++ + − − + + + −− + ++ + − − + − +− − − − + + − −− −−−− − − − − − − − − ++ − −+ −−− − − − − − −− −−− − − − −− − − − −− − −− −− − − − −− − − − − −− − − − − − − − − −−boundary estim. −− − − − mis−specified − −− − estim.−− well−specified − − boundary − −− +

+

+ ++ − − −

− −

+

Types of selection bias – Arbitrary bias

x

y s

◮

Labeled examples are selected from the general population possibly depending on the label itself.


x

y s

◮


◮

No independence assumptions can be made.


x

y s

◮


◮

No independence assumptions can be made.

◮

The missing labels are said to be “missing not at random” (MNAR) in the literature.

Overcoming bias – Two alternate goals

The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible:


The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible: ◮

General population modeling: Learn p(y |x), e.g. loan application approval.


The training data consist of {(xi , yi )|si = 1} and {(xi )|si = 0}. Two goals are possible: ◮

General population modeling: Learn p(y |x), e.g. loan application approval.

◮

Unlabeled population modeling: Learn p(y |x, s = 0), e.g. spam filtering.

Overcoming learnable bias – General population modeling

Lemma 1 Under MAR bias in the labeling, p(x, y ) =

p(s = 1) p(x, y |s = 1) p(s = 1|x)

if all probabilities are non-zero.

The distribution of samples in the general population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.

General population modeling – Application Lemma 1 can be used: ◮

to estimate class-conditional density models p(x|y ).

◮

to improve the loss of misspecified discriminative classifiers in the general population. Viewing hidden labels + + + + + + ++ ++ ++ + + ++ + +++ + ++ + − + ++ + + + ++ + + ++++ + +− ++ + + + + + + + + + +++ + + + + +−− ++ + ++ + − − + + + − ++ ++ + + + −+ ++++++ −−−−− ++ ++++ −− + − −− + + ++ + − −− − + + + + + +++ + ++ +++ − + − + − − −− ++ +− −−− − − − + ++ + + + + + + + −−+− + − + +− + − − − − +++ + +− − ++ −+ − − −++ − − −− − ++ −−− + +− − + − −− − +− +−+ + −− −− − − − ++ − −++− + − −−− + + ++−−+ +−+−−− −−− − −−− − − − −− − − − +−−−− −− − − − −−−−− −− −−− −−− − −−− − −− −−−− −− − − − −− −− −− −− − − −− − − − − −−− −−− −− − −−− − −−− − − − − − − −− − − −− −− −− − − −−− − −−− − −− − − − − − − − −− − −− − −− − − −boundary −−− −− general−pop. − best − −−−

Using lemma 1 + ++ ++ + ++++ +

+

+ ++ +++ + ++ − + + + +++ +++ + + +−− ++ + −+ − −− + + + + +++ +++ + + + − +−−+− + − ++ + − − − −+− ++ + + +− − + +− − −++ −+ − + − + + −−−− + − + + − − − −− − − + ++ −+ −− − −−− − − −− −−− − −−− − −− − −−−− −− − − − − − −− − − − − − − −− −− −− − − −− −− − − −−− − − −− − −−− − − − − est. general−pop. boundary − − − − −−− +

+

+

The weighted logistic regression finds a decision boundary close to the best-possible for the general population.

Overcoming learnable bias – Unlabeled population modeling Lemma 2 Under MAR bias in the labeling, p(x, y |s = 0) =

1 − p(s = 1|x) p(s = 1) p(x, y |s = 1) 1 − p(s = 1) p(s = 1|x)

if all probabilities are non-zero.

Similarly, the distribution of samples in the unlabeled population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.

General population modeling – Application Lemma 2 can be used: ◮

to estimate class-conditional density models p(x|y , s = 0).

◮

to improve the loss of misspecified discriminative classifiers in the unlabeled population. Viewing hidden labels + + + + + ++ − −+ + + − + + + + + ++ + + + + + + + + + + + − ++ + + + + + −− + + − + + ++ + + + − − ++ + + + +− ++ + ++ + + ++−− −+ + + + ++ + + + + + − −− + + + + ++ + +++ + +++ − + + + + + + + −− ++ ++−− + −++ + −− + ++ ++ − + + + + ++− + + − − ++ + ++ − − −− ++ + ++ ++ − − − ++ + − + + ++ − −− − − − − − −− + − + + + − −−− + + −−−− − − − ++ ++ ++− −− − − −− − − + −++−− −− + + −− − − − −− −− ++− − − − − − − − −− − −+ + − − −−− −− − − − − − − −−− − − −− − −− − −− − − −− − − − −− − − − − − −− −− − − −−− − − − − − − − − − −−− − − − − − − − − − − − − −−−− − − − −− − −− − − − −−− −− − − − − − − − −− − − − − − − −− −−−−− − − − − − − − − − − − − −−− − − − −− −−− − − − − best unlabeled−pop. boundary +

Using Lemma 2 + + ++ + ++ + ++ + + + −− + + + + + − ++ + +− + ++ + + ++ + + + ++−− + + + + + +++ + ++ − + + + ++ − + ++ −− + +− + + + + + − − + ++ + −+ − − + −− − + + + − + −−−− − − − + + ++− − − − + −+ − + −+ − − − − − −+ − + − − − −− − − −−− − − − − −− − − −− − − − − − − − −− − − − − − − − − − − − − − − − −−− − − − − − − − −− − −−− − − − − − − − −− −− − unlabeled−pop. boundary − est.

−

+

+

++

The weighted logistic regression finds a decision boundary close to the best-possible for the unlabeled population.

Overcoming arbitrary bias – Maximum Likelihood

The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X ) =

m X i =1

log pΘ (xi , yi , si = 1) +

m+n X i =m+1

log

X

pΘ (xi , y , si = 0)

y

for labeled data i = 1...m and unlabeled data i = m + 1...m + n.

Overcoming arbitrary bias – Maximum Likelihood

The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X ) =

m X i =1

log pΘ (xi , yi , si = 1) +

m+n X i =m+1

log

X

pΘ (xi , y , si = 0)

y

for labeled data i = 1...m and unlabeled data i = m + 1...m + n. ◮

Under the assumption of learnable bias this reduces to a traditional semi-supervised likelihood equation....

Log-likelihood assuming learnable bias ◮

Factor p(x, y , s) = p(s|x, y )p(x|y )p(y )

◮

Assume learnable bias (MAR): p(s|x, y ) = p(s|x);



◮


ℓ(Θ, X )

=

m X i =1

=

m X i =1

log p(xi , yi , si = 1) +

m+n X

log

i =m+1

log p(si |xi , yi )p(xi |yi )p(yi ) +

X

p(xi , y , si = 0)

y m+n X i =m+1

log

X y

p(si |xi , y )p(xi |y )p(y )



◮


ℓ(Θ, X )

=

m X

log p(xi , yi , si = 1) +

m X

log

m X

log p(si |xi , yi )p(xi |yi )p(yi ) +

m X

log p(xi |yi )p(yi ) +

m+n X i =1

m+n X i =m+1

i =1

+

m+n X

log

X

p(si |xi , y )p(xi |y )p(y )

y

log{p(si |xi )

log p(si |xi )

log

X y

X y

i =m+1

i =1

=

m+n X i =m+1

log p(si |xi )p(xi |yi )p(yi ) +

p(xi , y , si = 0)

y

i =1

=

X

i =m+1

i =1

=

m+n X

p(xi |y )p(y )

p(xi |y )p(y )}

Log-likelihood making no assumptions about bias Use a different factoring: p(x, y , s) = p(x|y , s)p(y |s)p(s).

ℓ(ΘL , ΘU , X ) =

m X

log pΘL (xi |yi , si = 1)pΘL (yi |si = 1)p(si = 1)

i =1

+

n X i =m+1

◮

log

X

pΘU (xi |y , si = 0)pΘU (y |si = 0)p(si = 0)

y

For |C | classes, this has 2|C | class-conditional density models: p(x|y = c, s) for each c ∈ C .


ℓ(ΘL , ΘU , X ) =

m X


i =1

+

n X i =m+1

log

X


y

◮


◮

This simplifies to two independent maximizations.


ℓ(ΘL , ΘU , X ) =

m X


i =1

+

n X i =m+1

log

X


y

◮


◮

This simplifies to two independent maximizations.

◮

But how do we maximize the likelihood in a sensible way for the unlabeled data?

The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples Solution: Assume the model parameters for the labeled data, pΘL (x|y , s = 1) and pΘL (y |s = 1), and the parameters for the unlabeled data, pΘU (x|y , s = 0) and pΘU (y |s = 0) are “close.” ◮

Learn density models pΘL (xi |yi , si = 1) and priors pΘL (yi |si = 1) from labeled data.



◮

Initialize parameters ΘU for unlabeled data with parameters learned from labeled data ΘL , or possibly using Lemma 2.



◮


◮

Improve the likelihood of ΘU given the unlabeled data(unsupervised learning).



◮


◮

Improve the likelihood of ΘU given the unlabeled data(unsupervised learning).

The SMM approach is useful for both general and unlabeled population modeling, as it produces an explicit generative model of both populations.

Example with synthetic data – Concept Drift

Example with synthetic data – Concept Drift

Application to EM

Application of the SMM approach to EM: ◮

Let Θ0U be the parameters for pΘU (x|y , s = 0), initialized with the labeled data.

◮

Limit to a few (5) iterations to limit parameter changes.

◮

Use inertia parameter α = .99 to slow parameter evolution: given ΘtU , and EM update Θ′ , use ← αΘtU + (1 − α)Θ′ Θt+1 U

◮

Final Θ5U is the parameters for the unlabeled data.

Experiment 1 – the ADULT data set Features: ◮

x: (AGE, EDUCATION, CAPITAL GAIN, CAPITAL LOSS, HOURS PER WEEK, SEX, NATIVE TO US, ETHNICITY, FULL TIME)

◮

y: INCOME > $50,000?

◮

s: MARRIED?

This dataset has information most of which could be used in a loan approval system. The target is analogous to a repay/default behavior label. Marital status is analogous to an unquantifiable measure of responsibility. This wouldn’t be recorded in a bank’s records, but which might influence the label.

Determining the type of bias

Is it MCAR?


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.4556

◮

p(y = 1|s = 0) = 0.0692


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.4556

◮

p(y = 1|s = 0) = 0.0692

Is it MAR?


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.4556

◮

p(y = 1|s = 0) = 0.0692

Is it MAR? Not as far as logistic regression can detect: ◮

Accuracy in gen. pop. based on labeled data = 74.2%

◮

Accuracy in gen. pop. based on all data = 80.7%

Results

(boxplots over 10 random trials) Unlabeled population − accuracy

Analysis in loan-application context:

SMM

GMM + Lemma 2

GMM

LR + Lemma 2

LR

SMM

GMM + Lemma 1

GMM

LR + Lemma 1

LR

0.74

0.80

0.76

0.78

0.85

0.80

0.90

0.82

General population − accuracy

Results


Analysis in loan-application context: ◮

We can achieve better accuracy than logistic regression.

◮

We can do “reject inference.”

◮

We can improve accuracy by not assuming MAR bias.

SMM

GMM + Lemma 2

GMM

LR + Lemma 2

LR

SMM

GMM + Lemma 1

GMM

LR + Lemma 1

LR

0.74

0.80

0.76

0.78

0.85

0.80

0.90

0.82


Experiment 1 – the CA-HOUSING data set

◮

s: LATITUDE > 36 and within 0.4 degrees of coast?

40

y: house VALUE > California median?

38

◮

36

x: CA census tract data: MEDIAN INCOME, MEDIAN HOUSE AGE, TOTAL ROOMS, TOTAL BEDROOMS, POPULATION, HOUSEHOLDS

34

◮

42

Features:

−124

−120

−116

The goal is to learn a model of housing prices throughout California when price information is available only for the northern California coast.


Is it MCAR?


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.751

◮

p(y = 1|s = 0) = 0.443


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.751

◮

p(y = 1|s = 0) = 0.443

Is it MAR?


Is it MCAR? No: ◮

p(y = 1|s = 1) = 0.751

◮

p(y = 1|s = 0) = 0.443

Is it MAR? Not as far as logistic regression can detect: ◮

Accuracy in gen. pop. based on labeled data = 74.8%

◮

Accuracy in gen. pop. based on all data = 80.5%

Results


SMM

GMM + Lemma 2

GMM

LR + Lemma 2

LR

SMM

GMM + Lemma 1

GMM

LR + Lemma 1

LR

0.70

0.68

0.72

0.72

0.74

0.76

0.76

0.80

0.78


Analysis: ◮

We can achieve better accuracy than logistic regression.

◮

Isolated areas constitute a biased sample of housing prices throughout California

◮

We can improve accuracy by not assuming MAR bias.

Future Work More “integrated” ways to keep ΘL and ΘU close: ◮

Specify a Bayesian prior.

◮

Add a term to the likelihood equation: +λKL(ΘL , ΘU ).

Future Work More “integrated” ways to keep ΘL and ΘU close: ◮

Specify a Bayesian prior.

◮

Add a term to the likelihood equation: +λKL(ΘL , ΘU ).

Other factorings: ◮

We only explored one factoring p(x, y , s) = p(x|y , s)p(y |s)p(s), resulting in independent maximizations.

◮

Other factorizations yield models with more coupled parameters.

Conclusions

◮

Model misspecification, the reality in real-world applications, means that even under MAR bias, re-weighting can help discriminative and generative classifiers.

Conclusions

◮


◮

Improving the likelihood of the model parameters for the unlabeled data yields better classifiers in the unlabeled and general populations.

Conclusions

◮


◮

Improving the likelihood of the model parameters for the unlabeled data yields better classifiers in the unlabeled and general populations.

◮

The SMM approach is a practical method of allowing the unlabeled model parameters to better represent the unlabeled samples while remaining close to the parameters yielding the best predictions for the labeled data.

Making Generative Classifiers Robust to Selection ... - Semantic Scholar

Making Generative Classifiers Robust to Selection ... - Semantic Scholar

Suggest Documents

robust portfolio selection problems - Semantic Scholar

Robust supervised image classifiers by spatial ... - Semantic Scholar

making chroma features more robust to timbre ... - Semantic Scholar

Making brain-machine interfaces robust to future ... - Semantic Scholar

Making Chord Robust to Byzantine Attacks - Semantic Scholar

On Efficient Selection of Binary Classifiers for Min ... - Semantic Scholar

From generative fit to generative capacity - Semantic Scholar

From generative fit to generative capacity - Semantic Scholar

generative collectives - Semantic Scholar

Learning to Combine Discriminative Classifiers - Semantic Scholar

Learning to Combine Discriminative Classifiers - Semantic Scholar

Combining Generative And Discriminative Classifiers For ... - CiteSeerX

Boosting Classifiers Regionally - Semantic Scholar

Generative 3D Models - Semantic Scholar

Sampling Generative Networks - Semantic Scholar

Teaching Generative Design - Semantic Scholar

Robust SVM-based biomarker selection with noisy ... - Semantic Scholar

feature selection and stacking for robust ... - Semantic Scholar

Robust EEG Channel Selection across Sessions in ... - Semantic Scholar

A Systematic Method for Gain Selection of Robust ... - Semantic Scholar

Making TCP Robust Against Delay Spikes - Semantic Scholar

Making Contribution-Aware P2P Systems Robust ... - Semantic Scholar

Making Robust Policy Decisions Using Global ... - Semantic Scholar

From static to dynamic ensemble of classifiers selection: Application to