PAC models in stochastic multi-objective multi ... - ACM Digital Library

1 downloads 0 Views 354KB Size Report
Hoeffding inequality with the union bound over the number of objectives m bounds the probability that the estimated mean vector is not accurate. Pr(̂µj + ˜a ≺ µj ...
PAC models in stochastic multi-objective multi-armed bandits Supplementary material Madalina M. Drugan Technical University of Eindhoven [email protected] ACM Reference format: Madalina M. Drugan. 2017. PAC models in stochastic multiobjective multi-armed bandits. In Proceedings of GECCO ’17, Berlin, Germany, July 15-19, 2017, 5 pages. DOI: http://dx.doi.org/10.1145/3071178.3071337

suboptimal arms deemed Pareto optimal when considering confidence regions. Thus, by definition

1

An arm u is considered needy, u ∈ Needy(n), if u has a confidence parameter βu (n) larger than the targeted confidence value /2, and u is in the middle, thus  Needy(n) = {u ∈ Middle(n) | βu (n) > } 2 The algorithm stops when there are no crossing arms, Cross(n) =, or their confidence values are small enough to correctly distinguish between them. If there are no crossing arms and the algorithm does not terminate, then one of the critical arms is needy.

PROOF OF THEOREM 4.1

Theorem 4.1. Pareto LUCB is (, δ) Pareto PAC. With probability at least 1 − δ Pareto LUCB terminates after    H/2 O H/2 ln δ rounds. Hoeffding inequality with the union bound over the number of objectives m bounds the probability that the estimated mean vector is not accurate P r(b µj + a ˜ ≺ µj ∨ µj ≺ µ bj − a ˜) ≤ 2m · exp(−2a2 nj )

(1)

where a some real positive number. Recall that the middle vector of an arm u is defined as ( b∗ (µu + µi(u) )/2 u ∈ A cu = b∗ ) (µi∗ (u) + µu )/2 u ∈ (A \ A To reuse part of the sample complexity analysis of LUCB, we need to redefine the sets it operates on: Above contains the set of arms u ∈ A that have all the values of the the lowest confidence bound above the corresponding middle vector cu ; Above(n) = {u ∈ A | µ bu − β˜u (n)  cu } Below includes arms u that have the values of the highest confidence bound below the middle vector cu ; Below(n) = {u ∈ A | µ bu + β˜u (n) ≺ cu } Middle are the remaining arms from A that are neither in Above or Below. Middle(n) = A \ Above(n) \ Below(n) Let Cross(n) be the set of arms that are misclassified after b∗ are n arm pulls: 1) the empirical Pareto optimal arms A deemed suboptimal given their confidence regions, and 2) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GECCO ’17, Berlin, Germany © 2017 ACM. 978-1-4503-4920-8/17/07. . . $15.00 DOI: http://dx.doi.org/10.1145/3071178.3071337

b∗ ) | j ∈ Above(n)} Cross(n) ={j ∈ (A \ A b∗ | i∗ ∈ Below(n)} ∪{i∗ ∈ A

(2)

Lemma 1.1. ¬Cross(n) ∧ ¬Term(n) ⇒ ∃i∗ ∈ (L(n) ∩ Needy(n)) ∨j ∈ (U (n) ∩ Needy(n)) Proof. We prove the following four statements: ¬Cross(n) ∧ ¬Term(n) ⇒ (i∗ ∈ Middle(n)) ∨ (j ∈ Middle(n)) ¬Term(n) ∧ (i∗ ∈ Middle(n)) ∧ (j 6∈ Middle(n)) ⇒βi∗ > /2 ¬Term(n) ∧ (i∗ 6∈ Middle(n)) ∧ (j ∈ Middle(n)) ⇒βj > /2 ¬Term(n) ∧ (i∗ ∈ Middle(n)) ∧ (j ∈ Middle(n)) ⇒ (βi∗ > /2) ∨ (βj > /2) We now prove the statement from the first equation, and we leave the prove for the other three equations as an exercise. In each objective o, we have that µ boi∗ ≥ µ boj . If there are no arms in Cross(n), then all arms are in the Above(n) or Below(n) sets. We distinguish between the following four exclusive cases (Case 1): (i∗ ∈ Above(n)) ∧ (j ∈ Above(n)) ∧ ¬Term(n) b∗ , j ∈ Above(n) ⇒ ∃j ∈ A \ A ⇒ Cross(n)

(3)

GECCO ’17, July 15-19, 2017, Berlin, Germany

M. M. Drugan µ brj > µ bri∗ . Let Mistakej (n) be the event that the algorithm returns a sub-optimal arm j after n rounds. If the termination condition holds, then we have at least an objective r where at least one condition holds

(Case 2): ∗

(i ∈ Above(n))∧(j ∈ Below(n)) ∧ ¬Term(n) ⇒ (∀r, µ bri∗ − βi∗ (n) > ci∗ )  ∧ ∀p, µ bpj + βj (n) < cj ∧ ∃o, µ boj + βj (n) > µ boi∗ − βi∗ (n) + 



⇒ ∃o, µ boj + βj (n) − µ boi∗ + βi∗ (n) < 0  ∧ µ boj + βj (n) > µ boi∗ − βi∗ (n) + 

(4)



⇒∅ where + βj (n) − µ boi∗ + βi∗ (n) < cj + ci∗ <  since, by construction, the difference in all objectives between the two middle vectors j and i∗ is smaller than . (Case 3): (i∗ ∈ Below(n))∧(j ∈ Above(n)) ∧ ¬Term(n) (5)

⇒b µri∗ ≤ µ brj (Case 4): b∗ , i∗ ∈ Bellow(n) ⇒∃i∗ ∈ A

(6)

In a similar way, we prove the remaining statements.



For large n, any arm u should have been pulled a sufficiently large number of samples such that the corresponding confidence parameter βu (n) is no larger than /2. The condition for the choice of confidence parameters is given in the following lemma Lemma 1.2. Let βu (n) be confidence values such that

n=1 u=1

(b µri∗ + βi∗ (n)) − (b µrj − βj (n)) < µrj + µri∗   b∗ , µ ⇒ ∃i∗ ∈ A bri∗ < µri∗ − βi∗ (n)   b∗ , µ ∨ ∃j ∈ A \ A brj > βj (n) + µrj



(7)

⇒∃j ∈ A, ¬WBj (n)



T XX

P r{WBj (n)}

j∈A n=1

⇒Cross(n)

exp(−2uβu2 (n)) ≤



P r{∪Tn=1 Mistakej (n)} ≤P r{∪j∈A ∪Tn=1 WBj (n)}

(i∗ ∈ Below(n))∧(j ∈ Below(n)) ∧ ¬Term(n)

∞ X n X

(b µri∗ + βi∗ (n)) − (b µrj − βj (n)) <   b∗ , ∃j ∈ A \ A b∗ , ⇒ ∃i∗ ∈ A

Thus, an mistake occurs if the confidence interval for the arm j is not within the expected bounds. Then,

⇒∅

m

We have Mistakej (n)  b∗ , ∃j ∈ A \ A b∗ , ⇒ ∃i∗ ∈ A

µ boj

⇒ (∀r, µ bri∗ + βi∗ (n) < ci∗ )  ∧ ∀p, µ bpj − βj (n) > cj

µ brj > µrj + βj (n), µ bri∗ < µri∗ − βi∗ (n)

δ K

when the algorithm terminates in round n. The probability of misclassifying Pareto optimal arms is at most δ. Proof. Let WBu (n) denote the event that the arm u is within the confidence intervals, thus WBu (n) ← (∀o, |b µou − µou | < βu (n)) Applying the Hoeffding inequality and union bound over the number of dimensions and the set of arms, we bound the probability that the arm u is not within this bounds during the round n: n−1 X P r{WBu (n)} ≤ m exp(−2tβu (i)2 ) t=1

If the algorithm terminates in round n where a suboptimal arm j is wrongly classified as Pareto optimal, j ∈ Above(n), then there exits an objective r where the order of their estimated mean reward vectors is reversed. This means that exists a Pareto optimal arm i∗ for which ∃r, such that

≤m

T X n XX

(8) exp(−2tβj (n)2 )

j∈A n=1 t=1

δ ≤ 2  Intuitively, when the pLUCB algorithm terminates, the upper confidence bound of lowest lower confidence vector is lower than the highest higher confidence bound with at most . Thus, we proof that with high probability the true means of arms stay within their confidence bounds such that a mistake cannot occur. Note that β fulfils the condition from Lemma 1.2. Lemma 1.3. For pLUCB, cf Algorithm 1, we have that      m a1 Kmn4 ln ∧ Needy(n) P r {∃i ∈ A | `i > 4 2[∆i ∨ /2]2 4δ 3δH/2 ≤ 4a1 Kmn4 (9) where a1 is a positive constant. Proof. Let’s denote the quantity    m a1 Kmn4 ui (n) = ln 2[∆i ∨ /2]2 δ an appropriate number of samples for i. Consider an arm i ∈ A. If ∆i ≤ 2 and the arm is sufficiently sampled 4·ui (n) < `i , then βi (n) < 4 .

Multi-objective PAC models Let ∆i ≥

GECCO ’17, July 15-19, 2017, Berlin, Germany

 2

and let i be a Pareto optimal arm. Then, by r   4 substituting for βi (n) = 2n1 i ln a1 Kmn , we have δ

have P r{Crossi∗ (n)} = =P r{(∃i∗ ∈ A˜∗ , ∀r, µ bri∗ + βi∗ (n) < cri∗ )}

P r{(`i > 4ui (n)) ∧ Needy(n)}



≤P r{(`i > 4ui (n)) ∧ (i ∈ Middle(n))}

t X m X

≤P r{(`i > 4ui (n)) ∧ (∃p, µ bpi − βi (n) < ci∗ )} ∞ m   X X ≤ exp −2u · (µpi − ci∗ − βi (n))2

≤m



∞ X u=4ui (n)+1

m exp −2u ·

∆i √ − 2 m

s

=m

  !2 1 a1 Kmn4  ln 2ni δ

exp(−2u · βi∗ (n)2 )

t X

exp(−2u · βi∗ (n)2 )

u=1

=m

∞ X

 2  p ∆2 √ u − ui (n) ≤ m exp −2 i · m u=4ui (n)+1  Z ∞ 2  p ∆2 √ m exp −2 i · ≤ x1 − ui (n) dx1 m x1 =4ui (n)   Z ∞ ∆2 =2m x2 exp −2 i x22 dx2 √ m x2 = ui (n)   Z ∞ p ∆2i 2 +2m ui (n) · x2 dx2 exp −2 √ m x2 = ui (n) Z 2m2 ∞ = 2 exp(−x3 )dx3 ∆i x3 =2 ∆2i ui (n) m p   Z m3/2 2πui (n) ∞ 1 x2 √ exp − 4 dx4 + √ ∆ ∆i 2 2π ui (n) x4 =2 √ i m     ∆2 m2 ∆2 2m2 ≤ 2 exp −2 i ui (n) + 2 exp −2 i ui (n) ∆i m ∆i m     3m2 ∆2i 2m a1 Kmn4 = exp −2 ln 4∆2i m ∆2i δ 3mδ ≤ 4∆2i Kn4

t X u=1

u=4ui (n)+1 p=1



exp(−2u · (b µri∗ − cri∗ + βi∗ (n))2 )

u=1 r=1

t X u=1

=

δ a1 Kmn4

δ a1 Kn3

(11) The same inequality applies for any suboptimal arm. We obtain the above inequality by summing over all arms.  We now upper bound the probability that pLUCB does not terminate after a certain number of arm pulls. Lemma 1.5. The probability that pLUCB does not terminate after    H/2 T ≥ 146H/2 ln δ is at most

4δ . T2

Proof. Let T = d T2 e, and let E1 and E2 be two events over the interval {T , T + 1, . . . , T − 1}, where E1 = {∃t ∈ {T , T + 1, . . . , T − 1} | Cross(n)} E2 = {∃t ∈ {T , T +1, . . . , T −1} | ∃i, (`i > 4ui (n))∧Needy(n)}

We show that if either E1 or E2 occurs then pLUCB1 terminates after at most T rounds. Assume that the algorithm We further have that does not terminates after T rounds, and let ∆(t) be the num    ber of extra rounds from T until it terminates. From Lemma m a1 Kmn4 P r{∃i ∈ A | `i > 4 ln ∧ Needy(n)} and following the same line of reasoning asPin the proof of [∆i ∨ /2]2 δ Lemma 5 from [1], we have that ∆(t) ≤ 4 i∈A ui (T ). Let X H  3δH/2 m 3δ /2 · ≤ ≤ T = 146H/2 ln . Then δ 4a1 Kn4 ∆2i 4a1 Kn4 i∈A∧∆i >/2   X X m a1 KmT 4 which concludes our prove.  2+8 ui (T ) =2 + 8 ln [∆i ∨ /2]2 δ i∈A i∈A   δ Lemma 1.4. P r{Cross(n)} ≤ a1 n3 a1 KmT 4 ≤2 + 8K + 4H/2 δ Proof. Accordingly to the definition of Cross(n), we have Km ≤(10 + 4 ln a1 )H/2 + 4H/2 ln δ P r{Cross(n)} +16H/2 ln(T ) ∗ ∗ r r b , ∀r, µ =P r{(∃i ∈ A bi∗ + βi∗ (n) < ci∗ ) (10) Km ∗ crj )} δ +32H/2 ln(H/2 ) + 16H/2 ln(146) ∗ Without lost of generality, let arm i be Pareto optimal ≤(66 + 16 ln(146))H/2 ln(H/2 /δ) b∗ . We consider further the inequality µ i∗ ∈ A bri∗ +βi∗ (n) < cri∗ for each objective r. Then, for any Pareto optimal arm, we