Learning from Multiple Partially Observed Views

0 downloads 0 Views 669KB Size Report
Sep 18, 2009 - F1 svb .559 .388 .639 .403 .557 .294 .579 .374 .800 .501 .651 .483 mvg .693 .494 .681 .445 .665 .375.
Learning from Multiple Partially Observed Views Massih R. Amini Institute for Information Technology National Research Council

Joint work with Nicolas Usunier and Cyril Goutte

UdeM-McGill ML seminar, September the 18th , 2009

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

1/26

Motivation - classical setting In standard classification setting, we consider an input space X ⊆ Rd and an output space Y . Hypothesis: The pairs (x, y ) ∈ X × Y are distributed according to an unkown distribution D . Samples: We observe a sequence of m i.i.d. pairs (xi , yi ) generated according to D. Goal: Construct a function g : X → Y which predict y from x: P (g(x) 6= y ) is the lowest.

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

2/26

Motivation - world is multi-view Many applications now involve multiple feature sets or views Audio

Image

Massih R. Amini, ITI, NRC

Text

Learning from Multiple Partially Observed Views

3/26

Motivation - Learning from multiple sources I

Multi-view approaches exploit view redundancy to learn. "Two views of an example that are redundant but not completely correlated are complementary [2]."

I

Multi-view learning can be advantageous to learning with only a single view [2, 3]. "Strengths of one view complement the weaknesses of the other."

I

Another key assumption is based on view agreement to learn from partially labeled data.

I

Previous work I

I

Massih R. Amini, ITI, NRC

rely on two-views and most of the theory has been developped under this setting. suppose that all views are observed.

Learning from Multiple Partially Observed Views

4/26

Our Approach I

We consider binary classification problems, where each multi-view observation x, is defined as. def x = (x 1 , ..., x V ) ∈ X = (X1 ∪{⊥})×...×(XV ∪{⊥})

Where, x v =⊥ means that the v th view is not observed. And, V ≥ 2. I

We assume that there exist view generating functions Ψv →v 0 : Xv → Xv 0 .

I

For a given partially observed x, the completed observation x is v

∀v , x =

Massih R. Amini, ITI, NRC



xv 0 Ψv 0 →v (x v )

if x v 6=⊥ otherwise

Learning from Multiple Partially Observed Views

5/26

Domain Application Multi-lingual document classification

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

6/26

Domain Application Multi-lingual document classification I Ψ.→. are Machine Translation Systems,

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

6/26

Domain Application Multi-lingual document classification I Ψ.→. are Machine Translation Systems, I ∀v , S v = {(x v , yi ) |, i = 1..m}, where m > mv i

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

6/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

7/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

8/26

Learning objective I

We assume that we are given V deterministic classifier sets (Hv )Vv =1 :

∀v ∈ [[1, V ]], Hv = {hv : Xv → {0, 1}} I

The final set of classifiers C contains stochastic classifiers

C = {x 7→ ΦC (h1 , ..., hV , x) |∀v , hv ∈ Hv } Where, ∀v ∈ Hv , ∀x, ΦC (h1 , ..., hV , x) ∈ [0, 1]. For convenience ch1 ,...,hV : x 7→ ΦC (h1 , ..., hV , x). I

The overall objective of learning is therefore to find c ∈ C with low generalization error:

(c) =

E

e (c, (x, y ))

(x,y )∼D Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

9/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

10/26

Supervised Learning tasks - Simple view I

Train ∀v , hv ∈ arg minh∈Hv

Massih R. Amini, ITI, NRC

P

(x,y )∈S:x v 6=⊥

e(h, (x v , y ))

Learning from Multiple Partially Observed Views

11/26

Supervised Learning tasks - Simple view P

e(h, (x v , y ))

I

Train ∀v , hv ∈ arg minh∈Hv

I

Test ∀x, chb1 ,...,hV (x) = hv (x v ), where x v 6=⊥

Massih R. Amini, ITI, NRC

(x,y )∈S:x v 6=⊥

Learning from Multiple Partially Observed Views

11/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv

Massih R. Amini, ITI, NRC

P

(x,y )∈S

e(h, (x v , y ))

Learning from Multiple Partially Observed Views

12/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv I Test

Massih R. Amini, ITI, NRC

P

(x,y )∈S

e(h, (x v , y ))

Learning from Multiple Partially Observed Views

12/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv

P

I Test - Gibbs ∀x, c mg h ,...,h (x) 1

V

v (x,y )∈S e(h, (x , y )) P V v = V1 v =1 hv (x )

. . .

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

12/26

Supervised Learning tasks - Multiple view P

e(h, (x v , y )) P V v I Test - Majority vote ∀x, chmv,...,h (x) = I v =1 hv (x ) > 1 V

I Train ∀v , hv ∈ arg minh∈Hv

(x,y )∈S

V 2



. . .

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

12/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

13/26

Theorem - Generalization bounds m

Fix δ ∈ (0, 1), let S = ((xi , yi ))i=1 be a dataset of m examples drawn i.i.d. according to D and let e be the 0/1 loss, and let (Hv )Vv =1 be the view-specific deterministic classifier sets. Then with probability at least 1 − δ over S , we have: Baseline setting:

(chb1 ,...,hV

r V h i X mv ˆ ln(2/δ) b v ) ≤ 0inf (ch0 ,...,h0 ) + 2 Rmv (e ◦ Hv , S ) + 6 1 V hv ∈Hv m 2m v =1

Multi-view Gibbs classification setting:

(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i



r

ln(2/δ) +η 2m



where, for all v , S v def (x vi , yi )|i = 1..m , hv ∈ Hv is the classifier = minimizing the empirical risk on S v , and

η = 0inf

hv ∈Hv

Massih R. Amini, ITI, NRC

h i h i b (chmg (c 0 ,...,h0 ) 0 ,...,h0 ) − inf h 0 1

V

hv ∈Hv

1

Learning from Multiple Partially Observed Views

V

14/26

Theorem - Generalization bounds m

Fix δ ∈ (0, 1), let S = ((xi , yi ))i=1 be a dataset of m examples drawn i.i.d. according to D and let e be the 0/1 loss, and let (Hv )Vv =1 be the view-specific deterministic classifier sets. Then with probability at least 1 − δ over S , we have: Baseline setting:

(chb1 ,...,hV

r V h i X mv ˆ ln(2/δ) b v ) ≤ 0inf (ch0 ,...,h0 ) + 2 Rmv (e ◦ Hv , S ) + 6 1 V hv ∈Hv m 2m v =1

Multi-view Gibbs classification setting:

(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i



r

ln(2/δ) +η 2m



where, for all v , S v def (x vi , yi )|i = 1..m , hv ∈ Hv is the classifier = minimizing the empirical risk on S v , and

η = 0inf

hv ∈Hv

Massih R. Amini, ITI, NRC

h i h i b (chmg (c 0 ,...,h0 ) 0 ,...,h0 ) − inf h 0 1

V

hv ∈Hv

1

Learning from Multiple Partially Observed Views

V

14/26

Trade-off

I

The Rademacher complexity for a sample of size n is O( √1n ),

I

Assuming that I I

Massih R. Amini, ITI, NRC

all proportional factors are equal to d , and that ∀v , mv = m/V .

Learning from Multiple Partially Observed Views

15/26

Trade-off m

Fix δ ∈ (0, 1), let S = ((xi , yi ))i=1 be a dataset of m examples drawn i.i.d. according to D and let e be the 0/1 loss, and let (Hv )Vv =1 be the view-specific deterministic classifier sets. Then with probability at least 1 − δ over S , we have: Baseline setting:

(chb1 ,...,hV

r V h i X mv ˆ ln(2/δ) b v ) ≤ 0inf (ch0 ,...,h0 ) + 2 Rmv (e ◦ Hv , S ) + 6 1 V hv ∈Hv m 2m v =1

Multi-view Gibbs classification setting:

(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i



r

ln(2/δ) +η 2m



where, for all v , S v def (x vi , yi )|i = 1..m , hv ∈ Hv is the classifier = minimizing the empirical risk on S v , and

η = 0inf

hv ∈Hv

Massih R. Amini, ITI, NRC

h i h i b (chmg (c 0 ,...,h0 ) 0 ,...,h0 ) − inf h 0 1

V

hv ∈Hv

1

Learning from Multiple Partially Observed Views

V

16/26

Trade-off I

The Rademacher complexity for a sample of size n is O( √1n ),

I

Assuming that I I

all proportional factors are equal to d , and that ∀v , mv = m/V .

Choose the Multi-view Gibbs classifier when:

r d

Massih R. Amini, ITI, NRC

V 1 −√ m m

! >η

Learning from Multiple Partially Observed Views

17/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

18/26

Agreement-Based semi-supervised learning How can the result be affected, if the chosen predictors give roughly similar predictions on the unlabeled data? I We hence define the notion of disagreement between classifiers as: I

 1 X hv (x v )2 − V (h1 , ..., hV ) def = E V v I

!2  

And the set of classifiers for which Hv∗ (µ) def =

I

1 X hv (x v ) V v



hv0 ∈ Hv ∀v 0 6= v , ∃hv0 0 ∈ Hv 0 , V(h10 , ..., hV0 ) ≤ µ

We can then find the minimum number of unlabeled examples B(, δ) required to estimate disagreement with precision  and with probability at least 1 − δ .

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

19/26

Agreement-Based semi-supervised learning

Let 0 ≤ µ ≤ 1 and 0 < δ < 1, and assume that we have access to u ≥ B(µ/2, δ/2) unlabeled examples drawn i.i.d. according to the marginal distribution of D on X . Then, with probability at least risk P 1 − δ , if the empirical v minimizers hv ∈ arg minh∈Hv (x v ,y )∈S v e(h, (x , y )) have a disagreement less than µ/2 on the unlabeled set, we have: r V h i 2X mg v b ∗ ˆ m (e◦Hv (µ), S )+6 ln(4/δ) +η (ch1 ,...,hV ) ≤ 0inf (ch0 ,...,h0 ) + R 1 V hv ∈Hv V v =1 2m

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

20/26

Agreement-Based semi-supervised learning Let 0 ≤ µ ≤ 1 and 0 < δ < 1, and assume that we have access to u ≥ B(µ/2, δ/2) unlabeled examples drawn i.i.d. according to the marginal distribution of D on X . Then, with probability at least risk P 1 − δ , if the empirical v minimizers hv ∈ arg minh∈Hv (x v ,y )∈S v e(h, (x , y )) have a disagreement less than µ/2 on the unlabeled set, we have: r V h i 2X mg v ∗ b ˆ m (e ◦Hv (µ), S )+6 ln(4/δ) +η (ch1 ,...,hV ) ≤ 0inf (ch0 ,...,h0 ) + R 1 V hv ∈Hv V v =1 2m

ˆ m (e ◦ Hv∗ (µ), S v ) ∝ O( √du ), with du