Wee Sun Lee3. 1Department of Mechanical Engineering, National University of Singapore, Singapore,
. 2M
Robustness of Bayesian Pool-based Active Learning Against Prior Misspecification: Supplementary Material Nguyen Viet Cuong1
Nan Ye2
Wee Sun Lee3
1 2
Department of Mechanical Engineering, National University of Singapore, Singapore,
[email protected] Mathematical Sciences School & ACEMS, Queensland University of Technology, Australia,
[email protected] 3 Department of Computer Science, National University of Singapore, Singapore,
[email protected]
Proof of Corollary 1 Cuong et al. (2013) showed that the maximum Gibbs error algorithm provides a constant factor approximation to the optimal policy Gibbs error, which is equivalent to the expected version space reduction fpavg (π). Formally, they showed that, for any prior p, 1 avg fp (A(p)) ≥ 1 − max fpavg (π), π e where A is the maximum Gibbs error algorithm. That is, the algorithm is average-case (1 − 1/e)-approximate. Furthermore, the version space reduction utility is upper bounded by M = 1; and for any priors p, p0 , we also have |fp (S, h) − fp0 (S, h)| = |p0 [h(S); S] − p[h(S); S]| X = | p0 [h0 ] P[h0 (S) = h(S)|h0 ] h0
fp0 (xπh∗ , h∗ ) ≥ ≥ = ≥ =
fp1 (xπh∗ , h∗ ) − Lkp0 − p1 k min fp1 (xπh , h) − Lkp0 − p1 k
h worst fp1 (π) − Lkp0 − p1 k α max fpworst (π) − Lkp0 − p1 k 1 π worst αfp1 (π1 ) − Lkp0 − p1 k,
where the last inequality holds as A is α-approximate. Using the inequality relating fpworst (π1 ) and fpworst (π0 ) above, we 1 0 now have fpworst (π) ≥ α(fpworst (π0 ) − Lkp0 − p1 k) − Lkp0 − p1 k 0 0 = α max fpworst (π) − (α + 1)Lkp0 − p1 k. 0 π
Proof of Corollary 3 −
X
p[h0 ] P[h0 (S) = h(S)|h0 ]|
h0
≤
Let π = A(p1 ) and h∗ = arg minh fp0 (xπh , h). We have = minh fp0 (xπh , h) = fp0 (xπh∗ , h∗ ). By the Lipschitz continuity of fp , we have
fpworst (π) 0
0
kp − p k.
Thus, the version space reduction utility is Lipschitz continuous with L = 1 and is upper bounded by M = 1. Hence, Corollary 1 follows from Theorem 1.
Proof of Theorem 2 Let π0 = arg maxπ fpworst (π) and π1 = arg maxπ fpworst (π). 0 1 π0 worst We have fp1 (π1 ) ≥ fpworst (π ) = f (x , h ), where 0 p 0 1 h0 1 h0 = arg minh fp1 (xπh0 , h). Using the Lipschitz continuity of fp and the definition of fpworst , we have 0 fp1 (xπh00 , h0 ) ≥ fp0 (xπh00 , h0 ) − Lkp0 − p1 k ≥ min fp0 (xπh0 , h) − Lkp0 − p1 k =
h worst fp0 (π0 )
− Lkp0 − p1 k.
Thus, fpworst (π1 ) ≥ fpworst (π0 ) − Lkp0 − p1 k. 1 0 c 2016, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.
Cuong, Lee, and Ye (2014) have shown that using the least confidence algorithm can achieve a constant factor approximation to the optimal worst-case version space reduction. Formally, if fp (S, h) is the version space reduction utility (that was considered previously for the maximum Gibbs error algorithm), then fpworst (π) is the worst-case version space reduction of π, and it was shown (Cuong, Lee, and Ye 2014) that, for any prior p, 1 worst max fpworst (π), fp (A(p)) ≥ 1 − π e where A is the least confidence algorithm. That is, the least confidence algorithm is worst-case (1 − 1/e)-approximate. Since the version space reduction utility is Lipschitz continuous with L = 1 as shown in the proof of Corollary 1, Corollary 3 follows from Theorem 2.
Proof of Corollary 4 It was shown by Cuong, Lee, and Ye (2014) that, for any prior p, 1 worst tp (A(p)) ≥ 1 − max tworst (π), p π e
where A is the worst-case generalized Gibbs error algorithm. That is, the worst-case generalized Gibbs error algorithm is worst-case (1 − 1/e)-approximate. If we assume the loss function L is upper bounded by a constant m, then tp is Lipschitz continuous with L = 2m. Indeed, for any S, h, p, and p0 , we have
=
|tp (S, h) − tp0 (S, h)| X | L(h0 , h00 )(p[h0 ]p[h00 ] − p0 [h0 ]p0 [h00 ])| h0 (S)6=h(S) or h00 (S)6=h(S)
≤
X
m
|p[h0 ]p[h00 ] − p0 [h0 ]p0 [h00 ]|
0
h (S)6=h(S) or h00 (S)6=h(S)
X
= m
|(p[h0 ] − p0 [h0 ])p[h00 ]
Proof of Theorem 4
h0 (S)6=h(S) or h00 (S)6=h(S)
+ p0 [h0 ](p[h00 ] − p0 [h00 ])| ≤ m
X
2( 21 − µ − δ) + 2( 12 − µ − δ) + 2(µ + δ) + 2(µ + δ) = 2. Thus, π1 is an average-case optimal policy for p1 and A∗ is an exact algorithm for p1 in the average case. (π0 ). Thus, π1 is a (π1 ) = 2 = fpworst Similarly, fpworst 1 1 worst-case optimal policy for p1 and A∗ is an exact algorithm for p1 in the worst case. Hence, A∗ is an exact algorithm for p1 in both average and worst cases. 1 1 Considering p0 , we have fpavg 0 (π1 ) = 0( 2 − µ) + 0( 2 − µ) + 2µ + 2µ = 0 since µ = 0 in the average case. On the 1 1 other hand, fpavg 0 (π0 ) = 1( 2 − µ) + 1( 2 − µ) + 1µ + 1µ = 1. Similarly, in the worst case, we also have fpworst (π1 ) = 0 0 and fpworst (π ) = 1. Thus, π is the optimal policy for p0 in 0 0 0 both average and worst cases. Now given any C, α, > 0, we can choose a small enough δ such that kp1 − p0 k < and α − Ckp1 − p0 k > 0. Hence, Theorem 3 holds.
(|p[h0 ] − p0 [h0 ]|p[h00 ] + p0 [h0 ]|p[h00 ] − p0 [h00 ]|)
For any policy π, note that X X avg |cavg p0 [h]c(π, h) − p1 [h]c(π, h)| p0 (π) − cp1 (π)| = | h
h0 ,h00
=
= |
2mkp − p0 k.
h
X (p0 [h] − p1 [h])c(π, h)| h
Thus, Corollary 4 follows from Theorem 2.
Proof of Theorem 3 For both the average and worst cases, consider the AL problem with budget k = 1 and the utility fp (S, h) = |{h0 : p[h0 ] > µ and h0 (S) 6= h(S)}|, for some very small µ > 0 in the worst case and µ = 0 in the average case. This utility returns the number of hypotheses that have a significant probability (greater than µ) and are not consistent with h on S. When µ = 0, it is the number of hypotheses pruned from the version space. So, this is a reasonable utility to maximize for AL. It is easy to see that this utility is nonLipschitz. Consider the case where there are two examples x0 , x1 and 4 hypotheses h1 , . . . , h4 with binary labels given according to the following table.
≤ Kkp0 − p1 k, where the last inequality holds as c(π, h) is upper bounded by K. Thus, avg cavg p0 (π) ≤ cp1 (π) + Kkp0 − p1 k, for all π, p0 , p1 . Let π0 = arg minπ cavg p0 (π). We have avg cavg (A(p )) ≤ c (A(p 1 1 )) + Kkp0 − p1 k p0 p1 ≤ α(p1 ) min cavg p1 (π) + Kkp0 − p1 k ≤ ≤ = =
π avg α(p1 )cp1 (π0 ) + Kkp0 − p1 k α(p1 )(cavg p0 (π0 ) + Kkp0 − p1 k) + Kkp0 − p1 k avg α(p1 )cp0 (π0 ) + (α(p1 ) + 1)Kkp0 − p1 k α(p1 ) min cavg p0 (π) + (α(p1 ) + 1)Kkp0 − p1 k, π
where the first and fourth inequalities are from the discussion above, and the second inequality is from the fact that A is α(p)-approximate.
Proof of Theorem 5 Hypothesis
x0
x1
h1 h2 h3 h4
0 1 0 1
0 0 1 1
Consider the true prior p0 where p0 [h1 ] = p0 [h2 ] = 12 −µ and p0 [h3 ] = p0 [h4 ] = µ, and a perturbed prior p1 where p1 [h1 ] = p1 [h2 ] = 12 − µ − δ and p1 [h3 ] = p1 [h4 ] = µ + δ, for some small δ > 0. With budget k = 1, there are two possible policies: the policy π0 which chooses x0 and the policy π1 which chooses 1 x1 . Let A∗ (p1 ) = π1 . Note that fpavg 1 (π1 ) = 2( 2 − µ − δ) + 1 2( 2 − µ − δ) + 2(µ + δ) + 2(µ + δ) = 2, and fpavg 1 (π0 ) =
If k = 1, then p1 = p0 and Theorem 5 trivially holds. Consider k ≥ 2. For any h, since p0 [h] p0 [h] p0 [h] = Pk 1 ≤ 1 = k, p1 [h] k p0 [h] i=1 k p1,i [h] [h] we have k −1 ≥ 1− pp10 [h] ≥ 1−k. Thus, |1− pp01 [h] [h] | ≤ k −1.
Hence, for any policy π, avg |cavg p1 (π) − cp0 (π)| = |
X
p1 [h](1 −
h
≤ (k − 1) =
(k −
X
p0 [h] )c(π, h)| p1 [h]
p1 [h]c(π, h)
h 1)cavg p1 (π).
avg Therefore, cavg p0 (π) ≤ kcp1 (π).
On the other hand, for any h, we have P Pk 1 1 1 p1 [h] i:p1,i 6=p0 k p1,i [h] k p0 [h] + i=1 k p1,i [h] = = p0 [h] p0 [h] p0 [h] k−1 1 1 k−1 k + = + . k minh p0 [h] k k minh p0 [h]
≤
1 k−1 Thus, 1 − k1 ≥ 1 − pp10 [h] [h] ≥ 1 − k − k minh p0 [h] . When H contains at least 2 hypothesis, minh p0 [h] ≤ 1/2, and 1 k−1 1 k + k minh p0 [h] − 1 ≥ 1 − k (the case when H is a singleton is equivalent to k = 1). Hence,
|1 −
p1 [h] 1 k−1 |≤ + − 1. p0 [h] k k minh p0 [h]
We have avg |cavg p0 (π) − cp1 (π)| = |
X
p0 [h](1 −
h
p1 [h] )c(π, h)| p0 [h]
1 k−1 ≤ ( + − 1)cavg p0 (π). k k minh p0 [h] avg k−1 1 Therefore, cavg p1 (π) ≤ ( k + k minh p0 [h] )cp0 (π). Now let π0 = arg minπ cavg p0 (π). We have avg cavg p0 (A(p1 )) ≤ kcp1 (A(p1 ))
≤ kα(p1 ) min cavg p1 (π) ≤
π kα(p1 )cavg p1 (π0 )
(first part)
k−1 1 + )cavg (π0 ) k k minh p0 [h] p0 k−1 ) min cavg (π), = α(p1 )(1 + minh p0 [h] π p0 ≤ kα(p1 )(
where the first and fourth inequalities are from the discussions above, and the second inequality is from the fact that A is α(p)-approximate. If A is the generalized binary search algorithm, then α(p1 ) = ln minh1p1 [h] + 1. Note that Pk minh p1 [h] = minh i=1 k1 p1,i [h] ≥ minh k1 p0 [h]. Thus, α(p1 ) ≤ ln minhkp0 [h] + 1. Therefore, avg k k−1 cavg p0 (A(p1 )) ≤ (ln minh p0 [h] +1)( minh p0 [h] +1)min cp0 (π). π
References Cuong, N. V.; Lee, W. S.; Ye, N.; Chai, K. M. A.; and Chieu, H. L. 2013. Active learning for probabilistic hypotheses using the maximum Gibbs error criterion. In NIPS. Cuong, N. V.; Lee, W. S.; and Ye, N. 2014. Near-optimal adaptive pool-based active learning with general loss. In UAI.