Learning from Multiple Partially Observed Views

Learning from Multiple Partially Observed Views Massih R. Amini Institute for Information Technology National Research Council

Joint work with Nicolas Usunier and Cyril Goutte

UdeM-McGill ML seminar, September the 18th , 2009

Massih R. Amini, ITI, NRC

Learning from Multiple Partially Observed Views

1/26

Motivation - classical setting In standard classification setting, we consider an input space X ⊆ Rd and an output space Y . Hypothesis: The pairs (x, y ) ∈ X × Y are distributed according to an unkown distribution D . Samples: We observe a sequence of m i.i.d. pairs (xi , yi ) generated according to D. Goal: Construct a function g : X → Y which predict y from x: P (g(x) 6= y ) is the lowest.



2/26

Motivation - world is multi-view Many applications now involve multiple feature sets or views Audio

Image


Text


3/26

Motivation - Learning from multiple sources I

Multi-view approaches exploit view redundancy to learn. "Two views of an example that are redundant but not completely correlated are complementary [2]."

I

Multi-view learning can be advantageous to learning with only a single view [2, 3]. "Strengths of one view complement the weaknesses of the other."

I

Another key assumption is based on view agreement to learn from partially labeled data.

I

Previous work I

I


rely on two-views and most of the theory has been developped under this setting. suppose that all views are observed.


4/26

Our Approach I

We consider binary classification problems, where each multi-view observation x, is defined as. def x = (x 1 , ..., x V ) ∈ X = (X1 ∪{⊥})×...×(XV ∪{⊥})

Where, x v =⊥ means that the v th view is not observed. And, V ≥ 2. I

We assume that there exist view generating functions Ψv →v 0 : Xv → Xv 0 .

I

For a given partially observed x, the completed observation x is v

∀v , x =


xv 0 Ψv 0 →v (x v )

if x v 6=⊥ otherwise


5/26

Domain Application Multi-lingual document classification



6/26

Domain Application Multi-lingual document classification I Ψ.→. are Machine Translation Systems,



6/26

Domain Application Multi-lingual document classification I Ψ.→. are Machine Translation Systems, I ∀v , S v = {(x v , yi ) |, i = 1..m}, where m > mv i



6/26

Outline

Learning objective Supervised Learning tasks Trade-off with the ERM principle Agreement-Based semi-supervised learning Experimental results



7/26

Outline




8/26

Learning objective I

We assume that we are given V deterministic classifier sets (Hv )Vv =1 :

∀v ∈ [[1, V ]], Hv = {hv : Xv → {0, 1}} I

The final set of classifiers C contains stochastic classifiers

C = {x 7→ ΦC (h1 , ..., hV , x) |∀v , hv ∈ Hv } Where, ∀v ∈ Hv , ∀x, ΦC (h1 , ..., hV , x) ∈ [0, 1]. For convenience ch1 ,...,hV : x 7→ ΦC (h1 , ..., hV , x). I

The overall objective of learning is therefore to find c ∈ C with low generalization error:

(c) =

E

e (c, (x, y ))

(x,y )∼D Massih R. Amini, ITI, NRC


9/26

Outline




10/26

Supervised Learning tasks - Simple view I

Train ∀v , hv ∈ arg minh∈Hv


P

(x,y )∈S:x v 6=⊥

e(h, (x v , y ))


11/26

Supervised Learning tasks - Simple view P

e(h, (x v , y ))

I

Train ∀v , hv ∈ arg minh∈Hv

I

Test ∀x, chb1 ,...,hV (x) = hv (x v ), where x v 6=⊥


(x,y )∈S:x v 6=⊥


11/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv


P

(x,y )∈S

e(h, (x v , y ))


12/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv I Test


P

(x,y )∈S

e(h, (x v , y ))


12/26

Supervised Learning tasks - Multiple view I Train ∀v , hv ∈ arg minh∈Hv

P

I Test - Gibbs ∀x, c mg h ,...,h (x) 1

V

v (x,y )∈S e(h, (x , y )) P V v = V1 v =1 hv (x )

. . .



12/26

Supervised Learning tasks - Multiple view P

e(h, (x v , y )) P V v I Test - Majority vote ∀x, chmv,...,h (x) = I v =1 hv (x ) > 1 V

I Train ∀v , hv ∈ arg minh∈Hv

(x,y )∈S

V 2

. . .



12/26

Outline




13/26

Theorem - Generalization bounds m

Fix δ ∈ (0, 1), let S = ((xi , yi ))i=1 be a dataset of m examples drawn i.i.d. according to D and let e be the 0/1 loss, and let (Hv )Vv =1 be the view-specific deterministic classifier sets. Then with probability at least 1 − δ over S , we have: Baseline setting:

(chb1 ,...,hV

r V h i X mv ˆ ln(2/δ) b v ) ≤ 0inf (ch0 ,...,h0 ) + 2 Rmv (e ◦ Hv , S ) + 6 1 V hv ∈Hv m 2m v =1

Multi-view Gibbs classification setting:

(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i

r

ln(2/δ) +η 2m

where, for all v , S v def (x vi , yi )|i = 1..m , hv ∈ Hv is the classifier = minimizing the empirical risk on S v , and

η = 0inf

hv ∈Hv


h i h i b (chmg (c 0 ,...,h0 ) 0 ,...,h0 ) − inf h 0 1

V

hv ∈Hv

1


V

14/26

Theorem - Generalization bounds m


(chb1 ,...,hV



(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i

r

ln(2/δ) +η 2m


η = 0inf

hv ∈Hv



V

hv ∈Hv

1


V

14/26

Trade-off

I

The Rademacher complexity for a sample of size n is O( √1n ),

I

Assuming that I I


all proportional factors are equal to d , and that ∀v , mv = m/V .


15/26

Trade-off m


(chb1 ,...,hV



(chmg 1 ,...,hV

) ≤ 0inf

hv ∈Hv

h

(chb0 ,...,h0 1 V

V 2 X ˆ ) + Rm (e ◦ Hv , S v ) + 6 V v =1

i

r

ln(2/δ) +η 2m


η = 0inf

hv ∈Hv



V

hv ∈Hv

1


V

16/26

Trade-off I

The Rademacher complexity for a sample of size n is O( √1n ),

I

Assuming that I I

all proportional factors are equal to d , and that ∀v , mv = m/V .

Choose the Multi-view Gibbs classifier when:

r d


V 1 −√ m m

! >η


17/26

Outline




18/26

Agreement-Based semi-supervised learning How can the result be affected, if the chosen predictors give roughly similar predictions on the unlabeled data? I We hence define the notion of disagreement between classifiers as: I

 1 X hv (x v )2 − V (h1 , ..., hV ) def = E V v I

!2  

And the set of classifiers for which Hv∗ (µ) def =

I

1 X hv (x v ) V v

hv0 ∈ Hv ∀v 0 6= v , ∃hv0 0 ∈ Hv 0 , V(h10 , ..., hV0 ) ≤ µ

We can then find the minimum number of unlabeled examples B(, δ) required to estimate disagreement with precision and with probability at least 1 − δ .



19/26

Agreement-Based semi-supervised learning

Let 0 ≤ µ ≤ 1 and 0 < δ < 1, and assume that we have access to u ≥ B(µ/2, δ/2) unlabeled examples drawn i.i.d. according to the marginal distribution of D on X . Then, with probability at least risk P 1 − δ , if the empirical v minimizers hv ∈ arg minh∈Hv (x v ,y )∈S v e(h, (x , y )) have a disagreement less than µ/2 on the unlabeled set, we have: r V h i 2X mg v b ∗ ˆ m (e◦Hv (µ), S )+6 ln(4/δ) +η (ch1 ,...,hV ) ≤ 0inf (ch0 ,...,h0 ) + R 1 V hv ∈Hv V v =1 2m



20/26

Agreement-Based semi-supervised learning Let 0 ≤ µ ≤ 1 and 0 < δ < 1, and assume that we have access to u ≥ B(µ/2, δ/2) unlabeled examples drawn i.i.d. according to the marginal distribution of D on X . Then, with probability at least risk P 1 − δ , if the empirical v minimizers hv ∈ arg minh∈Hv (x v ,y )∈S v e(h, (x , y )) have a disagreement less than µ/2 on the unlabeled set, we have: r V h i 2X mg v ∗ b ˆ m (e ◦Hv (µ), S )+6 ln(4/δ) +η (ch1 ,...,hV ) ≤ 0inf (ch0 ,...,h0 ) + R 1 V hv ∈Hv V v =1 2m

ˆ m (e ◦ Hv∗ (µ), S v ) ∝ O( √du ), with du

Learning from Multiple Partially Observed Views

Learning from Multiple Partially Observed Views

Suggest Documents

Efficient Learning with Partially Observed Attributes

Active Learning with Multiple Views

Learning with Multiple Views - CiteSeerX

Partially Observed Inventory Systems

Partially Materialized Views - CiteSeerX

Learning scenes from multiple views - Department of Psychology

Active Learning with Partially Observed Data

Active Learning with Partially Observed Data

Learning from Aggregate Views - CiteSeerX

Modeling Hair from Multiple Views - Microsoft

Estimation of neural connections from partially observed neural spikes

From Multiple Linked Views to Multiple Linked ... - Semantic Scholar

Reprojecting Partially Observed Systems with Application ... - CiteSeerX

Link prediction for partially observed networks

Statistical inference in partially observed stochastic compartmental ...

PORTFOLIO OPTIMIZATION UNDER A PARTIALLY OBSERVED ...

Wastewater dilution index partially explains observed polybrominated ...

Propensity score analysis with partially observed ...

Parameter Estimation for Partially Observed Queues

PARTIALLY-OBSERVED MODELS FOR CLASSIFYING MINERALS ...

Reasoning about Partially Observed Actions - Research - Google

Partially Observed Inventory Systems: The Case of

Multiple Instance Learning from Multiple Cameras

Partially Observed, Multi-objective Markov Games