Lecture 2: Feasibility of learning. Review of Lecture 1. Hoeffding's inequality. Connection to the learning problem. Multiple hypotheses. Alaa Tharwat. 2 / 27 ...
Lecture 2: Feasibility of learning Alaa Tharwat
Alaa Tharwat
1 / 27
Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses
Alaa Tharwat
2 / 27
Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses
Alaa Tharwat
3 / 27
Review of Lecture 1
Definition of machine learning Machine learning is a subset/part of the artificial intelligence science which uses statistical techniques to give computers the ability to learn to do a specific task without being explicitly programmed.
Examples of machine learning Credit approval and fish types
Components of learning Input: (x ∈ Rd ), Output: (y ∈ {−1, +1}), Target function: f : X → Y , Data: D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where yi = f (xi ) and Hypothesis: g : X → Y Unknown target function f: X
Y yi=f(xi)
1
Learning algorithm
0
−0.5 −1 −1
Hypotheses set (H) Alaa Tharwat
h1 h2 h3
0.5 f(x)
Training samples (xN,yN) (x1,y1)
−0.5
0 x
0.5
1
Final hypothesis (g≈f) 4 / 27
Review of Lecture 1
Simple model (changing w generates many hypotheses) h(x) = sign(wT xi ) We want to select g ∈ H so that g ≈ f on the data D The weights are updated as, w(t + 1) ← w(t) + xn yn Types of learning
w(t+1)=w+yn xn w(t+1)=w- xn
w(t)
w(t)
update w
w(t+1)=w+yn xn w(t+1)=w+xn
update w
xn yn =-1
xn yn =+1
Supervised learning Unsupervised learning Reinforcement learning
Alaa Tharwat
5 / 27
Review of Lecture 1
Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses
Alaa Tharwat
6 / 27
Hoeffding’s inequality
Given a box of balls (red and green): Probability of picking red ball is P [picking red ball] = µ Probability of picking green ball is P [picking green ball] = 1 − µ
The value of µ is unknown, e.g. probability of males in a country We cannot predict µ; but, we can select/pick N balls/samples from the box independently, the fraction of red ball is v, v and µ are similar or not? The sample is mostly green, while the box is mostly red (biased sample) Sample frequency (v) is likely to be close to µ
Alaa Tharwat
7 / 27
Hoeffding’s inequality
In a big sample (N ), v is probably close to µ, BUT, within constraints as follows: 2N
P [|v − µ| > ] ≤ 2e−2
(1)
where is a small number. This is called Hoeffding’s inequality, and it means that The probability of v far from µ (bad event) by must be within a 2 limit 2e−2 N , for all N and This bound/constraint (right hand side) does not depend on µ or the number of samples in the box (good news, why?) Because µ and number of samples in the box are unknown
Alaa Tharwat
8 / 27
Hoeffding’s inequality
2N
P [|v − µ| > ] ≤ 2e−2
= P [|v − µ| ≤ ] ≥ 1 − 2e−2
2N
(2)
2
N ↑ ∞ ⇒ 2e−2 N ≈ 0; hence, 2 P [|v − µ| ≤ ] ≥ 1 − 2e−2 N ⇒ P [|v − µ| ≤ ] ≥ 1, (v ≈ µ with high probability, but not sure). 2
On the other hand, N ↓⇒ e−2 N ≈ 1 ⇒ 2 P [|v − µ| ≤ ] ≥ 1 − 2e−2 N ⇒ P [|v − µ| ≤ ] ≥ 0; hence, v far from µ Small means (v ≈ µ). On the other hand, large means v far from µ
Alaa Tharwat
9 / 27
Hoeffding’s inequality
2N
P [|v − µ| > ] ≤ 2e−2
(3)
How far v from µ: Given N = 1000 and = 0.05, this means that 2 P [|v − µ| > 0.05] ≤ 2e−2×0.05 ×1000 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 99% Given N = 1000 and = 0.1, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.1 ×1000 ⇒ µ − 0.1 ≤ v ≤ µ + 0.1 with probability ≈ 99.9999996% Given N = 100 and = 0.05, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.05 ×100 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 39.4% Given N = 50 and = 0.05, this means that 2 P [|v − µ| > 0.1] ≤ 2e−2×0.05 ×50 ⇒ µ − 0.05 ≤ v ≤ µ + 0.05 with probability ≈ 22.1%
Tradeoff N , (large sample size N or looser gab ) Alaa Tharwat
10 / 27
Hoeffding’s inequality
The figure below shows a binomial distribution (B) In a coin flip example, the probability of getting head or tail is P = 0.5, E[X] = 0.5, and let = 0.25 With a small data size, the difference between the expected/empirical mean and the true mean is big Some distributions converge to their true means when the data size increased
P(|True mean−Sample mean|>ε)
1.8
True Hoeffding bound
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
Alaa Tharwat
10
20
30
40
50
n
60
70
80
90
100
11 / 27
Hoeffding’s inequality
Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses
Alaa Tharwat
12 / 27
Connection to the learning problem
In the learning framework The target function (f ) is unknown as µ. In other words, we can say f is not only the given data D, but, f also outside the data Each ball is x ∈ X, and the sample v is the dataset (D)
Alaa Tharwat
13 / 27
Connection to the learning problem
Some samples are generated/drawn from f , these samples represent D The learning model uses the data D for generating a set of hypotheses H, and h ∈ H
Alaa Tharwat
14 / 27
Connection to the learning problem
Green ball means the hypothesis is rights: h(x) = f (x) Red ball means the hypothesis is wrong: h(x) 6= f (x) The error is the size of red region (Error function), E(h) = Px [h(x) 6= f (x)]
Alaa Tharwat
15 / 27
Connection to the learning problem
There are two types of errors: In-sample Error (Ein (h)): P how many samples in the dataset are m 1 misclassified, Ein (h) = m i=1 [h(xi ) 6= f (xi )], m is the number of 4 samples in the dataset (here is Ein (h) = 14 ) Out-of-sample Error (Eout (h)): red regions, Eout (h) = Px [h(x) 6= f (x)] (we cannot calculate it)
Alaa Tharwat
16 / 27
Connection to the learning problem
Learning Model Input space X x, where h(x) = f (x) (h is right) x, where h(x) 6= f (x) (h is wrong) P (x) dataset D = {(x1 , y1 ), . . . , (xn , yn )} with i.i.d. xn Out-of-sample (Eout ) In-sample (Ein )
Alaa Tharwat
Box Model Box of balls Green ball Red ball Randomly pick balls Sample of N balls of i.i.d. balls µ → probability of picking red ball in the box v → probability of picking red ball in a sample
17 / 27
Connection to the learning problem
Unknown target function f: X
Y yi=f(xi)
Probability Distribution P on x
Training samples (xN,yN) x , x (x1,y1) 2 1
,x
N
Learning algorithm
Hypotheses set (H)
Alaa Tharwat
Final hypothesis (g f)
18 / 27
Connection to the learning problem
Back to learning: 2N
P [|Ein (h) − Eout (h)| > ] ≤ 2e−2
(4)
We can calculate Ein , and it is random; but, we can not calculate Eout , and it is fixed If Ein ≈ 0 ⇒ Eout ≈ 0 (with a high probability). This means Px [h(x) 6= f (x)] ≈ 0; hence, We have learned something about the entire f , and f ≈ h over X (outside D) If Ein (out of luck). This means we have still learned something about the entire; but, f 6= h
Alaa Tharwat
19 / 27
Connection to the learning problem
Given h ∈ H, the samples (i.e. dataset) can verify whether h is good (i.e. Ein is small) or bad (i.e. Ein is large).
Alaa Tharwat
20 / 27
Connection to the learning problem
Another example: X = {0, 1}3 , Y = {O, X}
xn 0 0 0 0 1
Alaa Tharwat
0 0 1 1 0
yn=h(xn) 0 1 0 1 0
21 / 27
Connection to the learning problem
The shaded region is D g ≈ f inside D (perfectly), but, not outside D xn 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
Alaa Tharwat
yn 0 1 0 1 0 1 0 1
g
f1
f2
f3
f4
f5
f6
f7
f8
? ? ?
22 / 27
Connection to the learning problem
Lecture 2: Feasibility of learning Review of Lecture 1 Hoeffding’s inequality Connection to the learning problem Multiple hypotheses
Alaa Tharwat
23 / 27
Multiple hypotheses
v and µ depends on the hypothesis h The learning model generates multiple hypotheses H (not only one) Some balls are green in one box and red in the other box, why? Answer: in a box i, green ball means that this hypothesis (hi ) correctly classify this ball/sample, and a red ball indicates that this ball/sample is misclassified
The learning model scans all hypotheses to find g ≈ f ⇒ minimum Ein (hi ) Hoeffding does not apply to multiple hypotheses!!!
Alaa Tharwat
24 / 27
Multiple hypotheses
Given a fair coin, what is the probability of getting 10 heads if toss the coin ten times? Answer: ≈ 0.1%1 The coin is fair, and the ten heads are no indication about the real probability
Given 1000 fair coins, what is the probability of getting 10 heads if toss the coins ten times? Answer: ≈ 63%2 , why? Answer: simply, you tried many times (Hoeffding is applied in each time)
1 2
P = 21N 1 − (1 −
1 1000 ) 2N Alaa Tharwat
25 / 27
Multiple hypotheses
Bad sample: Ein is far from Eout (remember Hoeffding). For example Ein = 0 and Eout = 0.5, getting 10 heads, or picking green balls from a box which has red and green balls Bad data for one h means Eout is big (h is far from the target function f ) and Ein is small (h(xn ) = f (xn ) on most samples) Given M hypotheses (many h’s) P [Bad D] = P [ Bad D for h1 or Bad D for h2 or . . . Bad D for hM ] ≤ P [Bad D for h1 ] + P [Bad D for h2 + . . . P [Bad D for hM ] Some Basic Probability P (A orB) = P (A ∪ B) ≤ P (A) + P (B) If (A ⊆ B) then P (A) ≤ P (B)
Alaa Tharwat
26 / 27
Multiple hypotheses
g ∈ H and hence g = h1 , or h2 , . . . , or hM P [|Ein (g) − Eout (g)| > ] ≤ P [|Ein (h1 ) − Eout (h1 )| > or |Ein (h2 ) − Eout (h2 )| > or .. . |Ein (hM ) − Eout (hM )| > ]
(5)
The worst case is to make union hence, P [|Ein (g) − Eout (g)| > ] ≤ ≤
M X i=1 M X
P [|Ein (hi ) − Eout (hi )|] 2e−2
2N
i=1
≤ 2M e−2 Alaa Tharwat
2N
(6) 27 / 27