Lecture 2: Feasibility of learning

1 downloads 0 Views 7MB Size Report
Lecture 2: Feasibility of learning. Review of Lecture 1. Hoeffding's inequality. Connection to the learning problem. Multiple hypotheses. Alaa Tharwat. 2 / 27 ...
Lecture 2: Feasibility of learning Alaa Tharwat

Alaa Tharwat

1 / 27

Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses

Alaa Tharwat

2 / 27

Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses

Alaa Tharwat

3 / 27

Review of Lecture 1

Definition of machine learning Machine learning is a subset/part of the artificial intelligence science which uses statistical techniques to give computers the ability to learn to do a specific task without being explicitly programmed.

Examples of machine learning Credit approval and fish types

Components of learning Input: (x ∈ Rd ), Output: (y ∈ {−1, +1}), Target function: f : X → Y , Data: D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where yi = f (xi ) and Hypothesis: g : X → Y Unknown target function f: X

Y yi=f(xi)

1

Learning algorithm

0

−0.5 −1 −1

Hypotheses set (H) Alaa Tharwat

h1 h2 h3

0.5 f(x)

Training samples (xN,yN) (x1,y1)

−0.5

0 x

0.5

1

Final hypothesis (g≈f) 4 / 27

Review of Lecture 1

Simple model (changing w generates many hypotheses) h(x) = sign(wT xi ) We want to select g ∈ H so that g ≈ f on the data D The weights are updated as, w(t + 1) ← w(t) + xn yn Types of learning

w(t+1)=w+yn xn w(t+1)=w- xn

w(t)

w(t)

update w

w(t+1)=w+yn xn w(t+1)=w+xn

update w

xn yn =-1

xn yn =+1

Supervised learning Unsupervised learning Reinforcement learning

Alaa Tharwat

5 / 27

Review of Lecture 1

Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses

Alaa Tharwat

6 / 27

Hoeffding’s inequality

Given a box of balls (red and green): Probability of picking red ball is P [picking red ball] = µ Probability of picking green ball is P [picking green ball] = 1 − µ

The value of µ is unknown, e.g. probability of males in a country We cannot predict µ; but, we can select/pick N balls/samples from the box independently, the fraction of red ball is v, v and µ are similar or not? The sample is mostly green, while the box is mostly red (biased sample) Sample frequency (v) is likely to be close to µ

Alaa Tharwat

7 / 27

Hoeffding’s inequality

In a big sample (N ), v is probably close to µ, BUT, within constraints as follows: 2N

P [|v − µ| > ] ≤ 2e−2

(1)

where  is a small number. This is called Hoeffding’s inequality, and it means that The probability of v far from µ (bad event) by  must be within a 2 limit 2e−2 N , for all N and  This bound/constraint (right hand side) does not depend on µ or the number of samples in the box (good news, why?) Because µ and number of samples in the box are unknown

Alaa Tharwat

8 / 27

Hoeffding’s inequality

2N

P [|v − µ| > ] ≤ 2e−2

= P [|v − µ| ≤ ] ≥ 1 − 2e−2

2N

(2)

2

N ↑ ∞ ⇒ 2e−2 N ≈ 0; hence, 2 P [|v − µ| ≤ ] ≥ 1 − 2e−2 N ⇒ P [|v − µ| ≤ ] ≥ 1, (v ≈ µ with high probability, but not sure). 2

On the other hand, N ↓⇒ e−2 N ≈ 1 ⇒ 2 P [|v − µ| ≤ ] ≥ 1 − 2e−2 N ⇒ P [|v − µ| ≤ ] ≥ 0; hence, v far from µ Small  means (v ≈ µ). On the other hand, large  means v far from µ

Alaa Tharwat

9 / 27

Hoeffding’s inequality

2N

P [|v − µ| > ] ≤ 2e−2

(3)

How far v from µ: Given N = 1000 and  = 0.05, this means that 2 P [|v − µ| > 0.05] ≤ 2e−2×0.05 ×1000 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 99% Given N = 1000 and  = 0.1, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.1 ×1000 ⇒ µ − 0.1 ≤ v ≤ µ + 0.1 with probability ≈ 99.9999996% Given N = 100 and  = 0.05, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.05 ×100 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 39.4% Given N = 50 and  = 0.05, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.05 ×50 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 22.1%

Tradeoff N ,  (large sample size N or looser gab ) Alaa Tharwat

10 / 27

Hoeffding’s inequality

The figure below shows a binomial distribution (B) In a coin flip example, the probability of getting head or tail is P = 0.5, E[X] = 0.5, and let  = 0.25 With a small data size, the difference between the expected/empirical mean and the true mean is big Some distributions converge to their true means when the data size increased

P(|True mean−Sample mean|>ε)

1.8

True Hoeffding bound

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

Alaa Tharwat

10

20

30

40

50

n

60

70

80

90

100

11 / 27

Hoeffding’s inequality

Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses

Alaa Tharwat

12 / 27

Connection to the learning problem

In the learning framework The target function (f ) is unknown as µ. In other words, we can say f is not only the given data D, but, f also outside the data Each ball is x ∈ X, and the sample v is the dataset (D)

Alaa Tharwat

13 / 27

Connection to the learning problem

Some samples are generated/drawn from f , these samples represent D The learning model uses the data D for generating a set of hypotheses H, and h ∈ H

Alaa Tharwat

14 / 27

Connection to the learning problem

Green ball means the hypothesis is rights: h(x) = f (x) Red ball means the hypothesis is wrong: h(x) 6= f (x) The error is the size of red region (Error function), E(h) = Px [h(x) 6= f (x)]

Alaa Tharwat

15 / 27

Connection to the learning problem

There are two types of errors: In-sample Error (Ein (h)): P how many samples in the dataset are m 1 misclassified, Ein (h) = m i=1 [h(xi ) 6= f (xi )], m is the number of 4 samples in the dataset (here is Ein (h) = 14 ) Out-of-sample Error (Eout (h)): red regions, Eout (h) = Px [h(x) 6= f (x)] (we cannot calculate it)

Alaa Tharwat

16 / 27

Connection to the learning problem

Learning Model Input space X x, where h(x) = f (x) (h is right) x, where h(x) 6= f (x) (h is wrong) P (x) dataset D = {(x1 , y1 ), . . . , (xn , yn )} with i.i.d. xn Out-of-sample (Eout ) In-sample (Ein )

Alaa Tharwat

Box Model Box of balls Green ball Red ball Randomly pick balls Sample of N balls of i.i.d. balls µ → probability of picking red ball in the box v → probability of picking red ball in a sample

17 / 27

Connection to the learning problem

Unknown target function f: X

Y yi=f(xi)

Probability Distribution P on x

Training samples (xN,yN) x , x (x1,y1) 2 1

,x

N

Learning algorithm

Hypotheses set (H)

Alaa Tharwat

Final hypothesis (g f)

18 / 27

Connection to the learning problem

Back to learning: 2N

P [|Ein (h) − Eout (h)| > ] ≤ 2e−2

(4)

We can calculate Ein , and it is random; but, we can not calculate Eout , and it is fixed If Ein ≈ 0 ⇒ Eout ≈ 0 (with a high probability). This means Px [h(x) 6= f (x)] ≈ 0; hence, We have learned something about the entire f , and f ≈ h over X (outside D) If Ein (out of luck). This means we have still learned something about the entire; but, f 6= h

Alaa Tharwat

19 / 27

Connection to the learning problem

Given h ∈ H, the samples (i.e. dataset) can verify whether h is good (i.e. Ein is small) or bad (i.e. Ein is large).

Alaa Tharwat

20 / 27

Connection to the learning problem

Another example: X = {0, 1}3 , Y = {O, X}

xn 0 0 0 0 1

Alaa Tharwat

0 0 1 1 0

yn=h(xn) 0 1 0 1 0

21 / 27

Connection to the learning problem

The shaded region is D g ≈ f inside D (perfectly), but, not outside D xn 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

Alaa Tharwat

yn 0 1 0 1 0 1 0 1

g

f1

f2

f3

f4

f5

f6

f7

f8

? ? ?

22 / 27

Connection to the learning problem

Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses

Alaa Tharwat

23 / 27

Multiple hypotheses

v and µ depends on the hypothesis h The learning model generates multiple hypotheses H (not only one) Some balls are green in one box and red in the other box, why? Answer: in a box i, green ball means that this hypothesis (hi ) correctly classify this ball/sample, and a red ball indicates that this ball/sample is misclassified

The learning model scans all hypotheses to find g ≈ f ⇒ minimum Ein (hi ) Hoeffding does not apply to multiple hypotheses!!!

Alaa Tharwat

24 / 27

Multiple hypotheses

Given a fair coin, what is the probability of getting 10 heads if toss the coin ten times? Answer: ≈ 0.1%1 The coin is fair, and the ten heads are no indication about the real probability

Given 1000 fair coins, what is the probability of getting 10 heads if toss the coins ten times? Answer: ≈ 63%2 , why? Answer: simply, you tried many times (Hoeffding is applied in each time)

1 2

P = 21N 1 − (1 −

1 1000 ) 2N Alaa Tharwat

25 / 27

Multiple hypotheses

Bad sample: Ein is far from Eout (remember Hoeffding). For example Ein = 0 and Eout = 0.5, getting 10 heads, or picking green balls from a box which has red and green balls Bad data for one h means Eout is big (h is far from the target function f ) and Ein is small (h(xn ) = f (xn ) on most samples) Given M hypotheses (many h’s) P [Bad D] = P [ Bad D for h1 or Bad D for h2 or . . . Bad D for hM ] ≤ P [Bad D for h1 ] + P [Bad D for h2 + . . . P [Bad D for hM ] Some Basic Probability P (A orB) = P (A ∪ B) ≤ P (A) + P (B) If (A ⊆ B) then P (A) ≤ P (B)

Alaa Tharwat

26 / 27

Multiple hypotheses

g ∈ H and hence g = h1 , or h2 , . . . , or hM P [|Ein (g) − Eout (g)| > ] ≤ P [|Ein (h1 ) − Eout (h1 )| >  or |Ein (h2 ) − Eout (h2 )| >  or .. . |Ein (hM ) − Eout (hM )| > ]

(5)

The worst case is to make union hence, P [|Ein (g) − Eout (g)| > ] ≤ ≤

M X i=1 M X

P [|Ein (hi ) − Eout (hi )|] 2e−2

2N

i=1

≤ 2M e−2 Alaa Tharwat

2N

(6) 27 / 27