Control of Sample Complexity and Regret while Learning

1 downloads 0 Views 464KB Size Report
Journal of Machine Learning Research 0 (2011) 0-00. Submitted 10/11; Published : No, Pending Review. Control of Sample Complexity and Regret in Bandits ...
Journal of Machine Learning Research 0 (2011) 0-00

Submitted 10/11; Published : No, Pending Review

Control of Sample Complexity and Regret in Bandits using Fractional Moments Ananda Narayanan B

[email protected]

Department of Electrical Engineering Indian Institute of Technology, Madras

Balaraman Ravindran

[email protected]

Department of Computer Science and Engineering Indian Institute of Technology, Madras

Editor: Review Pending; Submitted to the Journal of Machine Learning Research, Oct 2011

Abstract One key facet of learning through reinforcements is the dilemma between exploration to find profitable actions and exploitation to act optimal according to the observations already made. We analyze this explore/exploit situation on Bandit problems in stateless environments. We propose a family of learning algorithms for bandit problems based on fractional expectation of rewards acquired. The algorithms can be controlled to behave optimal with respect to sample complexity or regret, through a single parameter. The family is theoretically shown to contain algorithms that converge on an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The family is also shown to include algorithms that achieve the optimal logarithmic regret proved by Lai & Robbins (1985), again a theoretical minimum. We also propose a specific algorithm from the family that can achieve this optimal regret. Experimental results support these theoretical findings and the algorithms perform substantially better than state-of-the-art techniques in regret reduction like UCB-Revisited and other standard approaches.

1. Introduction The multi-arm bandit problem captures general aspects of learning in an unknown environment [Berry & Fristedt (1985)]. Decision theoretic issues like how to minimize learning time, and when to stop learning and start exploiting the knowledge acquired are well embraced in the multi-arm bandit problem. This problem is of useful concern in different areas of artificial intelligence such as reinforcement learning [Sutton & Barto (1998)] and evolutionary programming [Holland (1992)]. The problem also has applications in many fields including industrial engineering, simulation and evolutionary computation. For instance, Kim & Nelson (2001) talk about applications with regard to ranking, selection and multiple comparison procedures in statistics and Schmidt et al. (2006) entail applications in evolutionary computation. An n-arm bandit problem is to learn to preferentially select a particular action (or pull a particular arm) from a set of n actions (arms) numbered 1, 2, 3, . . . , n. Each selection of an action results in a random reward Ri corresponding to the action being selected that follows a probability distribution, with mean µi , usually stationary. Arms are pulled and rewards c

2011 Ananda Narayanan B and Balaraman Ravindran.

acquired until learning, to find the arm with highest rewards, converges. If µ∗ = maxi {µi }, then arm j is defined to be -optimal if µj > µ∗ − . An arm corresponding to a mean equal to µ∗ is considered the best arm as it is expected to give the maximum reward per pull. Let Zt be the total reward acquired over t successive pulls. Define regret η as the expected loss in total reward acquired had the best arm been repetitively chosen right from the first pull, that is, η(t) = tµ∗ − E[Zt ], sometimes referred to as cumulative regret. The traditional objective of the bandit problem is to maximize total reward given a specified number of pulls, l, to be made hence minimizing regret. Lai & Robbins (1985) showed that the regret should grow at least logarithmically and provided policies that attain the lower bound for specific probability distributions. Agrawal (1995) provided policies achieving the logarithmic bounds incorporating sample means that are computationally more efficient and Auer et al. (2005) described policies that achieve the bounds uniformly over time rather than only asymptotically. Meanwhile, Even-Dar et al. (2006) provided another quantification of the objective measuring quickness in determining the best arm. A Probably Approximately Correct (PAC) framework is incorporated quantifying the identification of an -optimal arm with probability 1 − δ. The objective is then to minimize the sample complexity l, the number of samples required for such an arm identification. Even-Dar et al. (2006) describe Median Elimination Algorithm achieving O(n) sample complexity. This is further extended to finding m arms that are -optimal with high probability by Kalyanakrishnan & Stone (2010). We propose a family of learning algorithms for the Bandit problem and show the family contains not only algorithms attaining O(n) sample complexity but also algorithms that achieve the theoretically lowest regret of O(log(t)). We address the n-arm bandit problem with generic probability distributions without any restrictions on the means, variances or any higher moments that the distributions may possess. Experiments show the proposed algorithms perform substantially better compared to state-of-the-art algorithms like Median Elimination Algorithm [Even-Dar et al. (2006)] and UCB-Revisited [Auer & Ortner (2010)]. While Even-Dar et al. (2006) and Auer & Ortner (2010) provide algorithms that are not parameterized for tunability, we propose a single-parametric algorithm that is based on fractional moments1 of rewards acquired. To the best of our knowledge ours is the first work2 to use fractional moments in bandit problems, while recently we discovered its usage in the literature in other contexts. Min & Chrysostomos (1993) describe applications of such fractional (low order) moments in signal processing. It has been employed in many areas of signal processing including image processing [Achim et al. (2005)] and communication systems [Xinyu & Nikias (1996); Liu & Mendel (2001)] mainly with regard to achieving more stable estimators. We theoretically show that the proposed family of algorithms have algorithms that attain O(n) sample complexity in finding an -optimal arm and algorithms that incur the optimal logarithmic regret, O(log(t)). In addition, we propose a specific algorithm that achieves the optimal regret. Experiments support these theoretical findings showing the algorithms incurring substantially low regrets while learning. A brief overview of what is presented: Section 2 describes motivations for the proposed algorithms, followed by theoretical analysis of optimality and sample complexity in Section 3. Then we prove the 1. The ith moment of a random variable R is defined as E[Ri ]. Fractional moments occur when the exponent i is fractional (rational or irrational). 2. A part of this work appeared recently in UAI 2011 [Ananda & Ravindran (2011)]

2

proposed algorithms incur the theoretical lower bound for regret shown by Lai & Robbins (1985) in Section 4. We then provide a detailed analysis of the algorithms presented with regard to their properties, performance and tunability in Section 5. Finally, experimental results and observations are presented in Section 6 followed by conclusions and scope for future work.

2. Proposed Algorithm & Motivation Consider a bandit problem with n arms with action ai denoting choice of pulling the ith arm. An experiment involves a finite number of arm-pulls in succession. In a particular such experiment, let ri,k be a sample of the reward acquired when ith arm was pulled for the k th time, Ri being the associated random variable (with bounded support) for reward while taking action ai . Then we have estimated means and variances for various actions as P P ri,k (ri,k − µˆi )2 2 k µˆi = P and σˆi = k P k1 k1 When deciding about selecting action ai over any other action aj , we are concerned about the rewards we would receive. Though E(Ri ) and E(Rj ) are indicative of rewards for the respective actions, variances E[(Ri −µi )2 ] and E[(Rj −µj )2 ] would give more information in the beginning stages of learning when the confidence in estimates of expectations would be low. In other words, we wouldn’t have explored enough for the estimated expectations to reflect true means. Though mean and variance together provide full knowledge of the stochasticity for some distributions, for instance Gaussian, we would want to handle generic distributions hence requiring to consider additional higher moments. It is the distribution after all, that gives rise to all the moments completely specifying the random variable. Thus we look at a generic probability distribution to model our exploration policy. Consider selection of action ai over all other actions. For action ai to have a higher reward than any other action selected, the following probability holds: n\ o Ai = P (Ri > Rj ) j6=i

=

Y

P (Ri > Rj )

(1)

j6=i

where the independence of Rj ’s is used. Thus, we can perform action selection based on the quantities Ai provided we know the probability distributions, for which we propose the following discrete approximation: After selecting actions ai , aj for ni , nj times respectively, we can, in general, compute the probability estimate  X X Pˆ (Ri > Rj ) = Pˆ (Ri = ri,k ) Pˆ (Rj = rj,l ) (2) lLji,k

k∈Ni

where sets Ni and Lji,k are given by, Ni = {k : 1 ≤ k ≤ ni and ri,k are unique} 3

Lji,k = {l : rj,l < ri,k and rj,l are unique} with random estimates Pˆ (Ri = ri,k ) of the true probability P (Ri = ri,k ) calculated by |{l : ri,l = ri,k }| Pˆ (Ri = ri,k ) = ni

(3)

Thus, with (1), (2) and (3) we can use Ai ’s as preferences to choose action ai ’s. Note that thus far we are only taking into account the probability, ignoring the magnitude of rewards. That is, if we have two instances of rewards, rj,l and rm,n with P (Rj = rj,l ) = P (Rm = rm,n ), then they contribute equally to the probabilities P (Ri > Rj ) and P (Ri > Rm ), though one of the rewards could be much larger than the other. For fair action selections, let us then formulate the preference function for action ai over action aj as, Aij =

X k∈Ni

Pˆ (Ri = ri,k )

X

 ˆ (ri,k − rj,l ) P (Rj = rj,l ) β

lLji,k

where β determines how far we want to distinguish the preference functions with regard to magnitude of rewards. This way, for instance, arms constituting a higher variance are given more preference. For β = 1, we would have Aij proportional to, E [Ri − Rj |Ri > Rj ] when the estimates Pˆ approach true probabilties. Hence, our preference function to choose ai over all other actions becomes, Ai =

Y

Aij

j6=i

The proposed class of algorithms 1 on the facing page are based on conditional fractional expectations of rewards acquired. A specific instance that is greedy after the exploring initiation phase (Algorithm 2), henceforth referred to as Fractional Moments on Bandits (FMB), picks arm i that has the highest Ai . All analysis presented in this work are with reference to the algorithm FMB. A probabilistic variant, picking arm i with probability Ai (normalized) is also proposed, but not analyzed. We call this probabilistic action selection on the quantities Ai , or simply pFMB. The family of algorithms are based on the set of rewards previously acquired but can be incrementally implemented (Results discussed in Section 6 use an incremental implementation). Also, the algorithms are invariant to the type of reward distributions. Besides incremental implementations, the algorithms and the associated computations simplify greatly when the underlying reward is known to follow a discrete probability distribution. The family is shown to contain algorithms that achieve a sample complexity of O(n) in Section 3. The algorithm 2 is also shown to incur the asymptotic theoretical lower bound for regret in Section 4. Further, the algorithm is analyzed in detail in Section 5. 4

Algorithm 1 Class of Policy Learning Algorithms based on Fractional Moments of Rewards acquired Initialization: Choose each arm l times Define: ri,k , the reward acquired for k th selection of arm i; the sets Ni = {k : ri,k are unique}and Lji,k = {t : rj,t < ri,k and rj,t are unique}; and mi , the number of selections made for arm i Loop: |{t:ri,t =ri,k }| 1. pˆik = Pˆ (Ri = ri,k ) = for 1 ≤ i ≤ n, 1 ≤ k ≤ mi mi o n P P 2. Aij = k∈Ni pˆik tLj (ri,k − rj,t )β pˆjt for 1 ≤ i, j ≤ n, i 6= j i,k

3. Ai =

Q

j6=i Aij

∀1 ≤ i ≤ n

4. Perform Action selection based on the quantities Ai

Algorithm 2 Fractional Moments on Bandits (FMB) Initialization: Choose each arm l times Define: ri,k , the reward acquired for k th selection of arm i; the sets Ni = {k : ri,k are unique}and Lji,k = {t : rj,t < ri,k and rj,t are unique}; and mi , the number of selections made for arm i Loop: |{t:ri,t =ri,k }| 1. pˆik = Pˆ (Ri = ri,k ) = for 1 ≤ i ≤ n, 1 ≤ k ≤ mi mi n P o P 2. Aij = k∈Ni pˆik tLj (ri,k − rj,t )β pˆjt for 1 ≤ i, j ≤ n, i 6= j i,k

3. Ai =

Q

j6=i Aij

∀1 ≤ i ≤ n

4. Pull arm i that has the highest Ai

5

3. Sample Complexity In this Section, we present theoretical analysis of sample complexity on the proposed algorithms. We show that given  and δ, a sample complexity that is of the order O(n) exists for the proposed algorithm FMB. Consider a set of n bandit arms with Ri representing the random reward on pulling arm i. We assume that the reward is binary, Ri ∈ {0, ri }∀i. Following usual practice, we present the proof for the binary case only. The results extend to the case of general rewards with little additional work. Also, denote pi = P {Ri = ri }, µi = E[Ri ]=ri pi , and let us have µ1 > µ2 > · · · > µn for simplicity. Define, ( 1 Iij = 0

if ri > rj otherwise

(r

if ri > rj otherwise

j

δij =

ri

1

We then have

Aij

= Iij pˆi pˆj (ri − rj )β + pˆi (1 − pˆj )(ri − 0)β = pˆi pˆj (ri − δij ri )β + pˆi (1 − pˆj )riβ  = pˆi riβ 1 + pˆj [(1 − δij )β − 1]

The action selection function becomes, Ai = (pˆi riβ )n−1

Y 1 + pˆj [(1 − δij )β − 1] j6=i

To prove PAC-correctness of the proposed algorithm, we need to bound for all i, the probability of selecting a non--optimal arm. P (Ai > Aj : i 6= 1, µi < µ1 − , j ∈ {1, 2, 3, . . . , n}) We see that this is in turn bounded as P (Ai > Aj : i 6= 1, µi < µ1 − , j ∈ {1, 2, 3, . . . , n}) ≤ P (Ai > A1 : i 6= 1, µi < µ1 − ) since arm 1 is optimal. Let i 6= 1 be an arm which is not -optimal, µi < µ1 − . The policy would choose arm i instead of the best arm 1, if we have Ai > A 1 We now discuss how Chernoff bounds and its extension to dependent random settings would help in the analysis of the proposed algorithm. 6

3.1 Chernoff Hoeffding Bounds [Hoeffding (1963)] Let X1 , X2 , . . . Xn be random variables with common range [0, 1] and n such that E[Xt |X1 , . . . , Xt−1 ] = µ for 1 ≤ t ≤ n. Let Sn = X1 +···+X . Then for all a ≥ 0 n we have the following, 2n

P {Sn ≥ µ + a} ≤ e−2a

2n

P {Sn ≤ µ − a} ≤ e−2a 3.2 Modified Chernoff Hoeffding Bounds

Chernoff Hoeffding bounds on a set of dependent random variables is discussed here for analysis of the proposed algorithm. Chromatic and Fractional Chromatic Number of Graphs A proper vertex coloring of a graph G(V, E) is assignment of colors to vertices such that no two vertices connected by an edge have the same color. Chromatic number χ(G) is defined to be the minimum number of distinct colors required for proper coloring of a graph. A k-fold proper vertex coloring of a graph is assignment of k distinct colors to each vertex such that no two vertices connected by an edge share any of their assigned colors. k-fold Chromatic number χk (G) is defined to be the minimum number of distinct colors required for k-fold proper coloring for a graph. The Fractional Chromatic number is then defined as, χk (G) χ0 (G) = Lt k→∞ k 0 Clearly, χ (G) ≤ χ(G) Chernoff Hoeffding Bounds with Dependence [Dubhashi & Panconesi (2009)] Let X1 , X2 , . . . Xn be random variables, some of which are independent, have common range n and a graph G(V, E) with V = {1, 2, . . . , n}. [0, 1] and mean µ. Define Sn = X1 +···+X n Edges connect vertices i and j, i 6= j if and only if Xi is dependent on Xj . Let χ0 (G) denote the fractional chromatic number of graph G. Then for all a ≥ 0 we have the following, 2 n/χ0 (G)

P {Sn ≥ µ + a} ≤ e−2a

2 n/χ0 (G)

P {Sn ≤ µ − a} ≤ e−2a Bounds on the proposed formulation (k) mk times. Let Xu be a random variable of k th arm. Then, for {mk : 1 ≤ k ≤ n}, common range within [0, 1], say) such that (i)

Consider n arms where k th arm is sampled for related to the reward acquired on the uth pull (k) (k) (k) X1 , X2 , . . . Xmk are random variables (with (k) (k) (k) E[Xt |X1 , . . . , Xt−1 ] = µ(k) ∀k, t : 1 ≤ t ≤ mk

(j)

(k)

X

(k)

(k)

+···+Xm

k and Xu is independent of Xv for i 6= j. Define random variables Smk = 1 mk , Q (k) Q (k) T = k Smk and U = k Xik where 1 ≤ ik ≤ mk . Let U1 , U2 , . . . , UqQbe the realizations of U for different permutations of ik in the respective ranges where q = k mk . Then we have Pq Ul T = 1 q

7

Q Also, µT = E[T ] = k µ(k) . Construct a graph G(V, E) on V = {Ui : 1 ≤ i ≤ q} with edges connecting every pair of dependent vertices. Let χ(G) and χ0 (G) denote the chromatic and fractional chromatic numbers of graph G respectively. Applying Chernoff-Hoeffding bounds on T , we have for all a ≥ 0, −2a2 q

−2a2 q χ(G)

−2a2 q

−2a2 q χ(G)

P {T

≥ µT + a} ≤ e χ0 (G) ≤ e

P {T

≤ µT − a} ≤ e χ0 (G) ≤ e

3.3 Bounds On Sample Complexity We need to bound the probability of the event Ai > A1 where Ai = (pˆi riβ )n−1

Q

j6=i {1

+ pˆj [(1 − δij )β − 1]} and i is any non--optimal arm. Define

Bij = {1 + pˆj [(1 − δij )β − 1]} and Ci = (pˆi riβ )n−1 . The event under concern then is, Y Y Ci Bij > C1 B1j j6=i

j6=1

Ci , Bij for 1 ≤ j 6= i ≤ n are estimates of different random variables that are independent of one another. Let arm k be chosen mk times, so Ci , Bij have mi and mj realizations respectively. Assuming arms i and 1 are chosen for equal number of times, we define the random variable Y Y Ti = Ci Bij − C1 B1j j6=i

j6=1

Q

which would have q = j mj realizations and true mean µTi = E[Ti ] = E[Ai ] − E[A1 ]. Further, as seen in Subsection 3.2, Ti can be decomposed as Pq Ul Ti = 1 q (k)

with Ul being a product of n independent random variables Xik , 1 ≤ k ≤ n for different permutations of ik . The event to be bounded in probability is then Ti ≥ 0 Using the modified Chernoff-Hoeffding Bounds, −2µ2T q/χ(G)

P (Ti ≥ 0) = P (Ti ≥ µTi − µTi ) ≤ e

i

(4)

where the graph G(V, E) is constructed on V = {Ui : 1 ≤ i ≤ q} with edges connecting every pair of dependent vertices. While |V | = q, we see that each vertex is connected to 8

Q q − k (mk − Q 1) other vertices. Then, the total number of edges in the graph becomes, |E| = 12 n (q − k (mk − 1)). Consider the number of ways a combination of two colors can be picked, χ(G) C2 = 1 2 χ(G) (χ(G) − 1). For optimal coloring, there must exist at least one edge connecting any such pair of colors picked. So, 1 χ(G) (χ(G) − 1) ≤ |E| 2   Y χ(G) (χ(G) − 1) ≤ n q − (mk − 1) k

Since (χ(G) − 1)2 ≤ χ(G) (χ(G) − 1), we have   Y (χ(G) − 1)2 ≤ n q − (mk − 1) k

s   Y (mk − 1) χ(G) ≤ 1 + n q − k

Hence from (4), −2µ2Ti q p Q 1 + n (q − k (mk − 1))

P (Ti ≥ 0) ≤ exp

! (5)

In a simple case when each arm is sampled l times, mk = l ∀k and ! −2µ2Ti ln p P (Ti ≥ 0) ≤ exp 1 + n (ln − (l − 1)n ) ! −2µ2Ti ln √ ≤ exp 2 nln ! −µ2Ti ln/2 √ ≤ exp n To get the sample complexity, sample each arm so that ! −µ2T ln/2 δ √ = exp n n  1 n 2 n n =⇒ l = ln δ µ4T

(6)

where µT

=

min

i:µi n0 ∈ N with the limit at infinity governed by Lt ln(g(n)) =

n→∞

=

Lt

n→∞

Lt

ln n + 2 ln ln n n ∂ (ln n + 2 ln ln n) ∂n ∂ ∂n n

n→∞

 =

Lt

n→∞

1 2 + n n ln n



= 0 =⇒

Lt g(n) = 1

n→∞

With some numerical analysis, we see that ∀n > 5, g 0 (n) < 0. Also, g attains a maximum at n = 5, with g(n) = 1.669. A plot of g versus n is shown in Figure 1. So g(n) < 1.67 ∀n ∈ N. Thus we have 1 O(n(n ln2 n) n ) = O(ρn) where ρ = 1.67. Thus, sample complexity of the proposed algorithms is essentially O(n). 10

4. Regret Analysis In this Section, we show that FMB (Algorithm 2) incurs a regret of O(log(t)), a theoretical lower bound shown by Lai & Robbins (1985). Define random variable Fi (t) to be the number of times arm i was chosen in t plays. The expected regret, η is then given by η(t) = (n − E[F1 (t)])µ1 −

X

µi E[Fi (t)]

i6=1

which is, η(t) = µ1

X

E[Fi (t)] −

i6=1

X

µi E[Fi (t)]

i6=1

X = (µ1 − µi )E[Fi (t)]

(8)

i6=1

Thus, to bound the expected regret is to bound E[Fi (t)] for all i 6= 1. Let us denote Ai computed for the τ th play by Ai (τ ) and the corresponding mk by mk (τ ) and Ti by Ti (τ ). Then the event {arm i selected at t} is {Ai (t) > Aj (t)∀j} of course with j 6= i. Consider E[Fi (t)] = Ey [E[Fi (t)|Fi (t − 1) = y]] = E[(y + 1) P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y} +y(1 − P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y})] ≤ E[(y + 1)P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y} + y]

(9)

Consider P {arm i selected at t|Fi (t − 1) = y} = P {Ai (t) > Aj (t)∀j|Fi (t − 1) = y} ≤ P {Ai (t) > A1 (t)|Fi (t − 1) = y}

(10)

See that P {Ai (t) > A1 (t)|Fi (t − 1) = y} is nothing but P (Ti (t) ≥ 0) for a specific case when theP arms pulled adhere to the set of n-tuples Sm(t) = {(m1 (t), m2 (t), . . . , mk (t), . . . , mn (t))|mi (t) = y, k mk (t) = t−1}. In other words, P (Ti (t) ≥ 0) is the probability that an arm i 6= 1 that is non--optimal will be selected at play t given the first t − 1 plays had k th arm chosen for P mk (t) times with t − 1 = k mk (t). And P {Ai (t) > A1 (t)|Fi (t − 1) = y} is the probability with an additional constraint, mi (t) = y. Choose  such that there exists only one -optimal arm, then the summation in (8) goes through only non--optimal arms and we can use the bounds on P (Ti (t) ≥ 0) for P {arm i selected at t|Fi (t − 1) = y} for all such i. This is possible, using the bound P {Ai (t) > A1 (t)|Fi (t − 1) = y} ≤ max{P (Ti (t) ≥ 0)} Sm(t)

11

(11)

Now, P (Ti (t) ≥ 0) would be maximum when as

Q

k

mk is minimum. This is observed from (5)

! −2µ2Ti q p P (Ti (t) ≥ 0) ≤ exp Q 1 + n (q − k (mk (t) − 1)) ! −2µ2Ti q ≤ exp √ 1 + nq √ ! −µ2Ti q √ ≤ exp n ! pQ −µ2Ti k mk (t) √ (12) = exp n Q P The minimum that k mk (t) would take for a given k mk (t) = t − 1 is when mi (t) are most unevenly distributed. When each arm is chosen l times in the initiation phase of the algorithm, we would have Y mk (t) ≥ y(t − 1 − l(n − 2) − y)  ln−2 (13) k

which occurs when all the instances of arm choices (other than mi (t) = y and the initiation phase) correspond to a single particular arm other than i, say j. From (10), (11), (12) and (13), we would have P {arm i selected at t|Fi (t − 1)y} ≤ P {Ai (t) > A1 (t)|Fi (t − 1) = y}    y  ≤ P Ti (t) ≥ 0|mk (t) = t − 1 − l(n − 2) − y   l ! p −µ2Ti y(t − y − 1 − l(n − 2))ln−2 √ ≤ exp n

 if k = i  if k = j  otherwise

and (9) then turns into, " E[Fi (t)] ≤ E (y + 1)exp

−µ2Ti

! # p y(t − y − 1 − l(n − 2))ln−2 √ +y n

µ2T √i n

as a positive constant λi and using E[y] = E[Fi (t − 1)], we have the relation p E[Fi (t)] ≤ E[Fi (t − 1)] + E[(y + 1)exp(−λi y(t − y − 1 − l(n − 2))ln−2 )]

Denoting

Unrolling the above recurrence gives E[Fi (t)] ≤ E[Fi (nl)] +

t X

E[(yw + 1)exp(−λi

p yw (w − yw − 1 − l(n − 2))ln−2 )]

w=nl+1

≤ l+

t X

E[(yw + 1)exp(−λi

p yw (w − yw − 1 − l(n − 2))ln−2 )]

w=nl+1

12

where yw corresponds to number of times the arm under consideration was chosen in w − 1 plays. Defining p Gi (w) = E[(yw + 1)exp(−λi yw (w − yw − 1 − l(n − 2))ln−2 )] we see the regret bounded as, η(t) ≤

X

(µ1 − µi )(l +

i6=1

t X

Gi (w))

(14)

w=nl+1

Now, w−1−l(n−1)

Gi (w) =

X

[(d + 1)exp(−λi

p

d(w − d − 1 − l(n − 2))ln−2 ) 

d=l

P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l}]

(15)

where the probability is that which corresponds to arm i being selected d − l times between nl + 1th and w − 1th plays, inclusive of the ends, with kSk denoting the cardinality of set S. Denoting the  plays in which arm i was to be selected by thePindicator variable I(τ ), there w−nl−1 are possible sets SI = {I(τ ) : nl + 1 ≤ τ ≤ w − 1, τ I(τ ) = d − l}. Defining d−l Pmax = max P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l | I(τ ) :

X

I∈SI

I(τ ) = d − l}

and Imax = arg max P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d−l | I(τ ) : I∈SI

X

I(τ ) = d−l}

we see that P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l} is bounded by   w − nl − 1 P {kτ : Ai (τ ) > Aj (τ )∀j, nl + 1 ≤ τ ≤ w − 1k = d − l} ≤ Pmax (16) d−l with Pmax =

Y

P {Ai (τ ) > Aj (τ )∀j}

(17)

τ :Imax (τ )=1

where P {Ai (τ ) > Aj (τ )∀j} = P (Ti (t) ≥ 0) can be bounded using (12) as follows. At Imax , the term Y Y P {Ai (τ ) > Aj (τ )∀j} ≤ P {Ai (τ ) > A1 (τ )} τ :Imax (τ )=1

τ :Imax (τ )=1

 = exp −λi

X τ :Imax (τ )=1

sY

 mk (τ )

k

Q will be at its maximum, and hence k mk (τ ) would be at its minimum for all τ . This would occur as Imax (τ ) = 1∀τ : nl + 1 ≤ τ ≤ nl + (d − l), when arm i is chosen repeatedly in the 13

Q first d − l turns after the initiation phase. Then, k mk (τ ) during and after the initiation phases are given by, Y mk (τ ) = (l + τ − nl)ln−1 ∀τ : nl + 1 ≤ τ ≤ nl + (d − l) k

So,  Y



nl+(d−l)

X

P {Ai (τ )Aj (τ )∀j} ≤ exp −λi 

 p (l + τ − nl)ln−1 

τ =nl+1

τ :Imax (τ )=1

≤ exp −λi

l

n−1 2

d X √

!! k

(18)

k=l+1

which can be refined further using b √ X

Z

b

h ≥



υdυ

a−1

h=a

=

2 3/2 (b − (a − 1)3/2 ) 3

(19)

Using (16), (17), (18) and (19), we bound Gi (w) in (15) as w−1−l(n−1)

  p w − nl − 1 2 n−1 (d+1) exp(−λi ( d(w − d − 1 − l(n − 2))ln−2 + l 2 (d3/2 −l3/2 ))) d−l 3 d=l (20) Back to the original problem, if we show that E[Fi (t)] grows slower than log(t) asymptotically, for all i, then it is sufficient to prove the regret is O(log(t)). For this, it is sufficient to show that Gi (t), the derivative of E[Fi (t)] at t, is bounded by 1t asymptotically, which means to say the regret grows slower than log(t). Consider Γd = (d + p  n−1 exp(−λ ( 1) w−nl−1 d(w − d − 1 − l(n − 2))ln−2 + 23 l 2 (d3/2 −l3/2 ))). ∃d∗ : Γd∗ ≥ Γd ∀d 6= i d−l d∗ , and so we bound Gi (w) ≤ (w − ln)Γd∗ X

Gi (w) ≤

Now Lt

t→∞

Gi (t) 1 t

= ≤

Lt tGi (t)

t→∞

Lt

t→∞ exp(λ

t−nl−1 d∗ −l n−1 2))ln−2 + 32 l 2 (d∗3/2

t(t − ln)(d∗ + 1) i(

p d∗ (t − d∗ − 1 − l(n −



− l3/2 )))

Since each term in the numerator is bounded by a polynomial while the exponential in the denominator is not, Gi (t) Lt 1 = 0 t→∞

t

Since growth of E[Fi (t)] is bounded by O(log(t)) for all i, we see the proposed algorithm FMB has an optimal regret as characterized by Lai & Robbins (1985). 14

5. Analysis of the Algorithm FMB In this Section, we analyze the proposed algorithm FMB, with regard to its properties, performance and tunability. 5.1 Simple Vs. Cumulative Regrets While cumulative regret, referred to regret in general, is defined in the expected sense, η(t) = tµ∗ − E[Zt ] where Zt is the total reward acquired for t pulls and µ∗ = maxi µi , there exists another quantification for regret that is related to the sample complexity in some sense. Bubeck et al. (2009) discuss links between cumulative regrets and this new quantification, called simple regret, defined as φ(t) = µ∗ − µψt P where µψt = i µi ψi,t with ψi,t , the probability of choosing arm i as ouput by the policy learning algorithm for the t + 1th pull. Essentially, ψt is the policy learnt by the algorithm after t pulls. Note that φ(t) after an , δ-PAC guaranteeing exploration is essentially related to , in the sense that φ(t) < (1 − δ) + δ(µ∗ − min µi ) i

Bubeck et al. (2009) find dependencies between φ(t) and η(t) and state that the smaller φ(t) can get, the larger η(t) would have to be. We now analyze how FMB attains this relation. Consider improving the policy learning algorithm with respect to φ(t). This could be achieved by keeping β constant, and increasing l. This may either reduce µT as seen from (6) and hence  as seen from (7), or simply reduce δ as seen from (6), both ways improving the simple regret. On the other hand, regret η(t), otherwise specifically called cumulative regret, directly depends on l as seen from (14). For a constant number of pulls, t, the summation of Gi (w) in (14) runs over lesser number of terms for an increasing l. As cumulative regret is dominated by regret incurred during the initiation phase, it is seen evidently that, by increasing l we find worse bounds on regret. Thus, we evidence a Simple vs. Cumulative Regret trade-off in the proposed algorithm FMB. 5.2 Control of Complexity or Regret with β From (6), consider the sample complexity,  nl = n

n 2 n ln δ µ4T

1

where µT is given by µT

=

min

i:µi E[Aj ] for every µi > µj . Let us call the set adhering to this restriction, βS . Since E[Ai ] is monotonic in β, we see βS must be an interval on the real line. Now consider the bounds on regret, given together by (14) and (20). Given sample µ2

complexity, or equivalently given l, regret can still be improved as it depends on λi = √Tni that the algorithm has control on, through β. So, β can be increased to improve the bounds on regret through an increase in λi , keeping l constant. But with increase in λi or equivalently |µTi |, µT is expected to increase as well. In this respect, we believe β for the best regret, given an l, would be fractional, and be that value which when increased by the smallest amount will result in a decrease in sample complexity l by 1. While it has been observed experimentally that the best regret occurs for a fractional value of β, we cannot be sure whether ‘The β’ for the best regret was indeed achieved. To summarize, we can control the algorithm FMB with regard to sample complexity or regret. But this tuning will inevitably have trade-offs between simple and cumulative regrets, φ(t) and η(t) respectively, consistent with the findings of Bubeck et al. (2009). Nevertheless, there is a definite interval of restriction βS , defined by the problem at hand, on the tuning of β. 16

Optimal Selections with plays

Average Reward with plays 80 Greedy Ai Exploring Ai Temp. =0.24

1.2 1

% Optimal Selections

Average Reward

1.4

0.8 0.6 0.4

70 Greedy Ai Exploring Ai Temp. =0.24

60 50 40 30 20

0.2 0

200

400

600

800

10 0

1000

200

400

600

800

1000

Plays

Plays

Figure 2: Two variants of the proposed algorithm, Greedy and Probabilistic action selections on Ai , are compared against the SoftMax algorithm

5.3 How low can l get Is it possible to get l as low as 1, and still get a Complexity or Regret achieving algorithm so as to reduce the cumulative regret incurred in the initiation phase? While this was accomplished experimentally, we see from a theoretical perspective of when this would be possible. We have 1  n 2 n n ln l= δ µ4T For l = 1, we would have µ4T

= n ln2

δ = ne

n δ

√ −µ2T / n

For δ < 1, we require 2



eµT /

n

> n q √ |µT | > n ln n

Thus, if some β ∈ βS could achieve |µT | > achieving O(n) sample complexity for l = 1.

p√ n ln n, then we would have an algorithm

6. Experiment & Results A 10-arm bandit test bed with rewards formulated as Gaussian with varying means and a variance of 1 was developed for the experiments. Probabilistic and Greedy action selections (pFMB and FMB respectively) were performed on the quantities Ai . Figure 2 shows Average rewards and Cumulative Optimal Selections with plays averaged over 2000 tasks on the test bed. Greedy Ai and Exploring Ai are the curves corresponding to performances of FMB and pFMB respectively. β = 0.85 was empirically chosen without complete parameter optimization (though 4 trials of different β were made). The temperature of the SoftMax algorithm, τ = 0.24 was observed to be best among the 13 different temperatures that were 17

Average Reward with plays

Optimal Selections with plays 90 80 % Optimal Selections

Average Reward

1.4 Exploring Ai Epsilon =0.1 Temp. =0.25

1.2 1 0.8 0.6 0.4

Exploring Ai Epsilon =0.1 Temp. =0.25

60 50 40 30 20

0.2 0

70

1000

2000

3000

4000

5000

10 0

1000

2000

3000

4000

5000

Plays

Plays

Figure 3: Asymptotic performance showing low Regret attempted for parameter-optimization of the SoftMax procedure. This temperature value was also seen to better -greedy action selection with  = 0.1. To see a more asymptotic performance, the number of plays was further increased to 5000 with Gaussian rewards incorporating varying means as well as variances and the corresponding plots are shown in Figure 3. As can be seen, though the proposed algorithms could not keep up in optimal selections, but yet are reaping higher cumulative rewards even 3000 turns after -greedy finds better optimal selections (at around 1500th turn). The algorithm FMB was then compared with state-of-the-art algorithms that guarantee either sample complexity or regret. Comparisons of FMB with UCB-Revisited, henceforth called UCB-Rev, that was shown to incur low regrets [Auer & Ortner (2010)] are depicted in Figure 4. The experiments were conducted on specific random seeds and averaged over 200 different tasks or trials. Note that UCB-Rev was provided with the knowledge of the horizon, the number of plays T , which aids in its optimal performance. Furthermore, no parameter optimization was performed for FMB and the experiments were conducted with β = 0.85. We see that FMB performs substantially better in terms of cumulative regret and its growth, even with the knowledge of horizon provided to UCB-Rev. Comparisons of FMB with Median Elimination Algorithm (MEA) [Even-Dar et al. (2006)] which was also shown to achieve O(n) sample complexity is shown in Figure 5. Here we compare the performances of the two algorithms for incurring the same sample complexity. The parameters for MEA’s performances depicted ( = 0.95, δ = 0.95) performed best with respect to regret among 34 different uniformly changing instances tested, while no parameter optimization for FMB was performed. To achieve O(n) guarantees at  = 0.95, δ = 0.95, it was observed that l = 2 and l = 3 were respectively required as arm-picks at the start of FMB for the Bernoulli and Gaussian experiments. Though this may not be a fair comparison as appropriate values of l were computed empirically to attain the , δ confidences, we observe relatively very low values of l are sufficient to ensure O(n) sample complexity. We conclude that the proposed class of algorithms, in addition to substantially reducing regrets while learning, seem to perform well with respect to sample complexity as well.

7. Conclusions & Future Work The proposed class of algorithms are the first to use fractional moments in bandit literature to the best of our knowledge. Specifically, the class is shown to possess algorithms that 18

Average Reward with plays

Average Reward with plays

1.2 1.2

0.8

Average Reward

Average Reward

1

FMB (l=1) UCBRev (T=1000)

0.6 0.4

1

FMB (l=1) UCBRev (T=1000)

0.8 0.6 0.4 0.2 0

0.2 0

200

400

600

800

−0.2 0

1000

200

400

Plays

600

800

1000

Plays

Figure 4: Comparison of FMB with UCB-Rev [Auer & Ortner (2010)] on Bernoulli and Gaussian rewards

Average Reward with plays

Average Reward with plays

1.2 FMB (l=2) MEA Average Reward

1 Average Reward

1.5

0.8 0.6

FMB (l=3) MEA 1

0.5

0.4 0 0.2 0

200

400

600

800

1000

0

Plays

200

400

600

800

1000

Plays

Figure 5: Comparison of FMB with Median Elimination Algorithm (MEA) [Even-Dar et al. (2006)] on Bernoulli and Gaussian rewards

19

provide PAC guarantees with O(n) complexity in finding an -optimal arm in addition to algorithms incurring the theoretical lowest regret of O(log(t)). Experimental results support this, showing the algorithm achieves substantially lower regrets not only when compared with parameter-optimized -greedy and SoftMax methods but also with state-ofthe art algorithms like MEA [Even-Dar et al. (2006)] (when compared in achieving the same sample complexity) and UCB-Rev [Auer & Ortner (2010)]. Minimizing regret has been a crucial factor in various applications. For instance Agarwal et al. (2008) describe relevance to content publishing systems that select articles to serve hundreds of millions of user visits per day. We note that as the reward distributions are relaxed from Ri ∈ {0, ri } to continuous probability distributions, the sample complexity in (6) is further improved. To see this, consider a gradual blurring of the bernoulli distribution to a continuous distribution. The probabilities P {Ai > A1 } would increase due to the inclusion of new reward-possibilities in the event space. So we expect even lower sample complexities (in terms of the constants involved) with continuous reward distributions. But the trade off is really in the computations involved. The algorithm presented can be implemented incrementally, but requires that the set of rewards observed till then be stored. On the other hand the algorithm simplifies computationally to a much faster approach in case of Bernoulli distributions3 , as the mean estimates can be used directly. As the cardinality of the reward set increases, so will the complexity in computation. Since most rewards are encoded manually specific to the problem at hand, we expect low cardinal reward supports where the low sample complexity achieved would greatly help without major increase in computations. The theoretical analysis assumes greedy action selections on the quantities Ai and hence any exploration performed by the algorithm pFMB is unaccounted. Bounds for the exploratory algorithm on the sample complexity or regret would help in better understanding of the explore-exploit situation so as to whether a greedy FMB could beat an exploratory FMB. While FMB incurs a sample complexity of O(n), determination of an appropriate l given  and δ is another direction to pursue. In addition, note from (6) and (7) that we sample all arms with the worst li , where li is the necessary number of pulls for arm i to ensure PAC guarantees with O(n). We could reduce the sample complexity further if we formulate the initiation phase with li pulls of arm i, which requires further theoretical footing on the PAC analysis presented. The method of tuning parameter β, and the estimation of βS are aspects to pursue for better use of the algorithm in unknown environments with no knowledge of reward support. A further variant of the algorithms proposed, allowing change of β while learning could be an interesting possibility to look at. Acknowledgements We thank Shivaram Kalyanakrishnan for the valuable insights on related work in the literature. We would also like to thank Arun Chaganty and the UAI reviewers for their informative and helpful reviews.

3. Essential requirement is rather a low cardinal Reward support

20

References Hoeffding W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association. 58(301):13–30. 3.1 Even-Dar E., Mannor S. and Mansour Y. (2006). Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7, 1079–1105. 1, 3.3, 6, 5, 7 Sutton R. and Barto A.(1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge. 1 Dubhashi D P. and Panconesi A. (2009). Concentration of Measure for the analysis of Randomized Algorithms, Cambridge University Press, New York. 3.2 Kalyanakrishnan S. and Stone P. (2010). Efficient Selection of Multiple Bandit Arms: Theory and Practice. Proceedings of the 27th International Conference on Machine Learning, 511-518. 1 Berry D A. and Fristedt B. (1985). Bandit problems. Chapman and Hall Ltd. 1 Auer P., Cesa-Bianchi N. and Fischer P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem Machine Learning, Springer Netherlands, 47, 235-256. 1 Holland J. (1992). Adaptation in natural and artificial systems. Cambridge: Press/Bradford Books. 1

MIT

Lai T. and Robbins H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22. (document), 1, 4, 4 Agrawal R. (1995). Sample mean based index policies with O(log n) regret for the multiarmed bandit problem. Advances in Applied Probability, 27, 1054–1078 1 Min S. and Chrysostomos L N. (1993). Signal Processing with Fractional Lower Order Moments: Stable Processes & Their Applications. Proceedings of IEEE, Vol.81 No.7, 986-1010. 1 Achim A M., Canagarajah C N. and Bull D R (2005). Complex wavelet domain image fusion based on fractional lower order moments, 8th International Conference on Information Fusion, Vol.1 No.7, 25-28. 1 Xinyu M. and Nikias C L. (1996). Joint estimation of time delay and frequency delay in impulsive noise using fractional lower order statistics, IEEE Transactions on Signal Processing, Vol.44 No.11, 2669-2687. 1 Liu T. and Mendel J M. (2001). A subspace-based direction finding algorithm using fractional lower order statistics, IEEE Transactions on Signal Processing, Vol.49 No.8, 16051613. 1 21

Agarwal D., Chen B., Elango P., Motgi N., Park S., Ramakrishnan R., Roy S. and J. Zachariah. (2009). Online models for content optimization. In Advances in Neural Information Processing Systems 21, 2009. 7 Bubeck S., Munos R., and Stoltz G. (2009). Pure Exploration in Multi-Armed Bandit Problems. In 20th Intl. Conf. on Algorithmic Learning Theory (ALT), 2009. 5.1, 5.2 Auer P. and Ortner R. (2010). UCB revisited: Improved regret bounds for the stochastic multiarmed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010. 1, 6, 4, 7 Kim S. and Nelson B. A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation, 11(3):251–273, 2001. 1 Schmidt, Christian, Branke, J¨ urgen, and Chick, Stephen E. Integrating techniques from statistical ranking into evolutionary algorithms. In Applications of Evolutionary Computations, volume 3907 of LNCS, pp. 752–763. Springer, 2006. 1 Ananda Narayanan B. and Ravindran B. (2011). Fractional Moments on Bandit Problems. In the Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial Intelligence (UAI 2011), pp. 531-538. AUAI Press. 2

22

Suggest Documents