In an uncontrolled restless bandit problem, there is a finite set of arms, each of ... asymptotically optimal adaptive policies for the multi-armed bandit problem with ...
1
Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems
arXiv:1107.4042v2 [math.OC] 17 Oct 2012
Cem Tekin, Student Member, IEEE, Mingyan Liu, Senior Member, IEEE
Abstract In this paper we consider the problem of learning the optimal policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when pulled yields a positive reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a time horizon T . The reward process of each arm is a finite state Markov chain, whose transition probabilities are unknown by the player. State transitions of each arm is independent of the selection of the player. We propose a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy. Our results extend the optimal adaptive learning of MDPs to POMDPs.
Index Terms Online learning, restless bendits, POMDPs, regret, exploration-exploitation tradeoff
I. I NTRODUCTION In an uncontrolled restless bandit problem (URBP) there is a set of arms indexed by 1, 2, . . . , K , whose state process is discrete and follows a discrete time Markov rule independent of each other. There is a user/player who chooses one arm at each of the discrete time steps, gets the reward and observes the current state of the selected arm. The control action, i.e., the arm selection, does not affect the state transitions, therefore the system dynamics is uncontrolled. However, it is both used to exploit the instantaneous reward and to decrease the uncertainty about the current state of the system by exploring. Thus, the optimal policy should balance the tradeoff between exploration and exploitation. A preliminary version of this work appeared in Allerton 2011. C. Tekin and M. Liu are with the Electrical Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI 48105, USA, {cmtkn,mingyan}@eecs.umich.edu
2
If the structure of the system, i.e., the state transition probabilities and the rewards of the arms are known, then the optimal policy can be found by dynamic programming for any finite horizon. In the case of infinite horizon, stationary optimal policies can be found for the discounted problem by using the contraction properties of the dynamic programming operator. For the infinite horizon average reward problem, stationary optimal policies can be found under some assumptions on the transition probabilities [1], [2]. In this paper, we assume that initially the player has no knowledge about the transition probabilities of the arms. Therefore, the problem we consider is a learning problem rather than an optimization problem given the structure of the system. Our assumption is consistent with the fact that in most of the systems, the player does not have a perfect model for the system at the beginning but learns the model over time. For example, in a cognitive radio network, initially the player (secondary user) may not know how the primary user activity evolves over time, or in a target tracking system, initially the statistics of movement of the target may be unknown. Thus, our goal is to design learning algorithms with fastest convergence rate, i.e., minimum regret where regret of a learning policy at time t is defined as the difference between reward of the optimal policy for the undiscounted t-horizon problem with full information about the system model and the undiscounted reward of the learning policy up to time t. In this paper, we show that when the transition probability between any two states of the same arm is always positive, an algorithm with logarithmic regret uniform in time with respect to the optimal policy for the finite time undiscounted problem with known transition probabilities exist. We also claim that logarithmic order is the best achievable order for URBP. To the best of our knowledge this paper is the first attempt to extend the optimal adaptive learning in MDPs to partially observable Markov decision processes (POMDP), in which there are countably many information states. The organization of the remainder of this paper is as follows. Related work is given in Section II. In Section III, we give the problem formulation, notations and some lemmas that will be used throughout the proofs in the paper. In Section IV, we give sufficient conditions under which the average reward optimality equation has a continuous solution. In Section V, we give an equivalent countable representation of the information state and an assumption under which the regret of a policy can be related to the expected number of times a suboptimal action is taken. In Section VI, an adaptive learning algorithm is given. Then, under an assumption on the structure of the optimal policy, we give an upper bound for the regret of the adaptive learning algorithm in Section VII. In Section VIII, we prove that this upper bound is logarithmic in time. In Section IX, we provide extensions to the adaptive learning algorithm with nearlogarithmic regret, and relax the assumption on the structure of the optimal policy. Section X concludes
3
the paper. II. R ELATED W ORK Related work in optimal adaptive learning started with the paper of Lai and Robbins [3], where asymptotically optimal adaptive policies for the multi-armed bandit problem with i.i.d. reward process for each arm were constructed. These are index policies and it is shown that they achieve the optimal regret both in terms of the constant and the order. Later Agrawal [4] considered the i.i.d. problem and provided sample mean based index policies which are easier to compute, order optimal but not optimal in terms of the constant in general. Anantharam et. al. [5], [6] proposed asymptotically optimal policies with multiple plays at each time for i.i.d. and Markovian arms respectively. However, all the above work assumed parametrized distributions for the reward process of the arms. Auer et. al. [7] considered the i.i.d. multi-armed bandit problem and proposed sample mean based index policies with logarithmic regret when reward processes have a bounded support. Their upper bound holds uniformly over time rather than asymptotically but this bound is not asymptotically optimal. Following this approach, Tekin and Liu [8], [9] provided policies with uniformly logarithmic regret bounds with respect to the best single arm policy for restless and rested multi-armed bandit problems and extended the results to single player multiple plays and decentralized multiplayer models in [10]. Decentralized multi-player versions of the i.i.d. multi-armed bandit problem under different collision models were considered in [11], [12], [13], [14]. Other research on adaptive learning focused on Markov Decision Processes (MDP) with finite state and action space. Burnetas and Katehakis [15] proposed index policies with asymptotically logarithmic regret, where the indices are the inflations of right-hand sides of the estimated average reward optimality equations based on Kullback Leibler (KL) divergence, and showed that these are asymptotically optimal both in terms of the order and the constant. However, they assumed that the support of the transition probabilities are known. Tewari and Bartlett [16] proposed a learning algorithm that uses l1 distance instead of KL divergence with the same order of regret but a larger constant. Their proof is simpler than the proof in [15] and does not require the support of the transition probabilities to be known. Auer and Ortner [17] proposed another algorithm with logarithmic regret and reduced computation for the MDP problem, which solves the average reward optimality equations only when a confidence interval is halved. In all the above work the MDPs are assumed to be irreducible. Contrary to the related work in Markovian restless bandits [10], which compare the performance of the learning algorithm with the best static policy, in this paper we compare our learning algorithm with the best dynamic policy.
4
III. P ROBLEM F ORMULATION AND A SSUMPTIONS In this paper we study strong regret algorithms for uncontrolled restless Markovian bandit problems (URBP). Consider K mutually independent uncontrolled restless Markovian arms, indexed by the set K = {1, 2, . . . , K} whose states evolve in discrete time steps t = 1, 2, . . . according to finite state
Markov chain with unknown transition probabilities. Let S k be the state space of arm k . For simplicity of presentation, without loss of generality, assume that for state x ∈ S k , rxk = x, i.e., the state of an arm also represents the reward from that arm. Then, the state space of the system is the Cartesian product of the state spaces of individual arms which is denoted by S = S 1 × . . . × S K . Let pkij denote the transition probability from state i to j of arm k . Then the the transition probability matrix of arm k is P k whose ij th element is pkij . The set of transition probability matrices is denoted by P = (P 1 , . . . , P K ). We assume that P k s are such that the induces Markov chains are ergodic. Therefore for each arm there exists a unique stationary distribution which is given by π k = (πxk )x∈S k . At each time step, the state of the system is a K -dimensional vector of states of arms which is given by x = (x1 , . . . , xK ) ∈ S . Next, we define notations that will frequently be used in the following sections. Let ekx represent the unit vector with dimension |S k |, whose xth element is 1, and all other elements are 0, N = {1, 2, . . .} represent the set of natural numbers, Z+ = {0, 1, . . .} represent the set of non-negative integers, (v • w) represent the standard inner product of vectors v, w, ||v||1 , ||v||∞ represent respectively the l1 and l∞ norm of vector v , and ||P ||1 represent the induced maximum row sum norm for matrices. For a vector v , (v −u , v 0 ) represents the vector whose uth element is v 0 , while all other elements are the same as the
elements of v . For a vector of matrices P , (P −u , P 0 ) represents the vector of matrices whose uth matrix is P 0 , while all other matrices are the same as the matrices of P . The transpose of a vector v or matrix P is denoted by v T and P T respectively. The list below gives some of the quantities that frequently
appear in the results in this paper. P∞ 2 k k Notation 1: • β = t=1 1/t , πmin = minx∈S k πx •
k πmin = mink∈K πmin
•
rmax = maxx∈S k ,k∈K rxk
•
Smax = maxk∈K |S k |
There is a player who selects one of the K arms at each time step t, and gets the reward from that arm depending on the state of that arm. The objective of the player is to maximize the undiscounted sum of the rewards for any finite horizon. However, the player does not know the set of transition probability
5
matrices P . Moreover, the player can only observe the state of the arm it chooses but not the states of the other arms. Intuitively, in order to maximize its reward, the player should explore the arms to estimate their transition probabilities and to reduce the uncertainty about the current state x ∈ S of the system, while exploiting the information it has acquired about the system to select arms that yield high rewards. This process should be carefully balanced to yield the maximum reward for the player. In a more general sense, the player should learn to play optimally in an uncontrolled POMDP. We denote the set of S k ×S k stochastic matrices by Ξk . Set of sets of K stochastic matrices is denoted by Ξ = (Ξ1 , Ξ2 , . . . , ΞK ). Since P is not known by the player, at time t, the player has an estimate of ˆ t ∈ Ξ. For two sets of transition probability matrices P and P ˜ , the distance P , which is given by P ˜ ||1 := PK ||P k − P˜ k ||1 . At any time step, the state of the system between them is defined as ||P − P k=1
in the next time step uncertain due to the stochastically changing states. Therefore, let Xtk be the random variable representing the state of arm k at time t. Then, the random vector X t = (Xt1 , Xt2 , . . . , XtK ) represents the state of the system at time t. Since the player chooses an arm in K at each time step, K can be seen as the action space of the k player. Since the player can observe the state of the arm it selects at each time step, Y = ∪K k=1 S is the
observation space of the player. Note that even if two states of different arms have the same reward, the player will label these states differently. Thus S k ∩S l = ∅ for k 6= l. Let ut ∈ U be the arm selected by the player at time t, where U = K, and yt ∈ Y be the state/reward observed by the player at time t. Then, the history at time t is z t = (u0 , y1 , u1 , y2 , . . . , ut−1 , yt ). Usually, for an algorithm, ut and yt depends on the history and the stochastic evolution of the arms. Therefore, we denote by Ut and Yt , the random variables representing the action and the observation at time t, respectively. We let QP (y|u) be the sub-stochastic transition probability matrix such that (QP (y|u))xx0 = PP (X t+1 = x0 , Yt+1 = y|X t = x, Ut = u). For URBP QP (y|u) is the zero matrix for y ∈ / S u . For y ∈ S u only nonzero entries of QP (y|u) are the ones for which xu = y . Let α denote a learning algorithm used by the player. In order to decide which arm to select at time t, α can only use all past observations and selections by the player up to t. We denote by α(t) the arm
selected by algorithm α at time t. The performance of algorithm α can be measured by regret which is the difference between performance of the algorithm and performance of the optimal policy with known transition probabilities up to t. For any algorithm α, the regret with respect to the optimal T horizon
6
policy is " α
R (T ) = sup γ 0 ∈Γ
EψP0 ,γ 0
T X
#! r
γ 0 (t)
(t)
" −
EψP0 ,α
t=1
T X
# r
α(t)
(t) ,
(1)
t=1
where Γ is the set of admissible policies. In Section VI, we will propose an algorithm whose regret grows logarithmically in time. Therefore this algorithm converges to the optimal policy in terms of the average reward. In the following sections we will frequently use results from the large deviations theory. We will relate the accuracy of the transition probability estimates of the player with the probability of deviating from the optimal action. First, we give the definition of a uniformly ergodic Markov chain. Definition 1: [18] A Markov chain X = {Xt , t ∈ Z+ } on a measurable space (S, B), with transition kernel P (x, G) is uniformly ergodic if there exists constants ρ < 1, C < ∞ such that for all x ∈ S ,
ex P t − π ≤ Cρt , t ∈ Z+ ,
(2)
The norm used in the above definition is the total variation norm. For finite and countable vectors this corresponds to l1 norm, and the induced matrix norm corresponds to maximum absolute row sum norm. Clearly, for a finite state Markov chain, uniform ergodicity is equivalent to ergodicity. Next, we give a large deviation bound for a perturbation of a uniformly ergodic Markov chain. Lemma 1: ([18] Theorem 3.1.) Let X = {Xt , t ∈ Z+ } be a uniformly ergodic Markov chain for which ˆ = {X ˆ t , t ∈ Z+ } be the perturbed chain with transition kernel Pˆ . Given the two chains (2) holds. Let X ˆ at time t respectively. Then, have the same initial distribution, let ψt , ψˆt be the distribution of X, X
ψt − ψˆt ≤ C1 (P, t) Pˆ − P ,
ˆ t
t
−ρ where C1 (P, t) = tˆ + C ρ1−ρ
(3)
and tˆ = logρ C −1 .
We will frequently use the Chernoff-Hoeffding bound in our proofs, which bounds the difference between the sample mean and expected reward for distributions with bounded support. Lemma 2: (Chernoff-Hoeffding Bound) Let X1 , . . . , XT be random variables with common range [0,1], such that E[Xt |Xt−1 , . . . , X1 ] = µ. Let ST = X1 + . . . + XT . Then for all ≥ 0 P (|ST − T µ| ≥ ) ≤ 2e
−22 T
.
The following lemma, which will be used in the following sections, gives an upper bound on the difference between product of two equal-sized sets of numbers in the unit interval, in terms of the sum
7
of the absolute values of the pairwise differences between the numbers taken from different sets. Lemma 3: for ρk , ρ0k ∈ [0, 1] we have |ρ1 . . . ρK − ρ01 . . . ρ0K | ≤
K X
|ρk − ρ0k |
(4)
k=1
Proof: First consider |ρ1 ρ2 − ρ01 ρ02 | where ρ1 , ρ2 , ρ01 , ρ02 ∈ [0, 1]. Let = ρ02 − ρ2 . Then |ρ1 ρ2 − ρ01 ρ02 | =
|ρ1 ρ2 − ρ01 (ρ2 + )| = |ρ2 (ρ1 − ρ01 ) − ρ01 | ≤ ρ2 |ρ1 − ρ01 | + ρ01 ||.
But we have |ρ1 − ρ01 | + |ρ2 − ρ02 | = |ρ1 − ρ01 | + || ≥ ρ2 |ρ1 − ρ01 | + ρ01 ||.
Thus |ρ1 ρ2 − ρ01 ρ02 | ≤ |ρ1 − ρ01 | + |ρ2 − ρ02 |.
Now, we prove by induction. Clearly (4) holds for K = 1. Assume it holds for some K > 1. Then |ρ1 . . . ρK+1 − ρ01 . . . ρ0K+1 | ≤ |ρ1 . . . ρK − ρ01 . . . ρ0K | + |ρK+1 − ρ0K+1 | ≤
K X
|ρk − ρ0k |.
k=1
IV. S OLUTIONS OF THE AVERAGE R EWARD O PTIMALITY E QUATION Assume that the transition probability matrices for the arms are known by the player. Then, URBP turns into an optimization problem (POMDP) rather than a learning problem. In its general form this problem is intractable [19], but heuristics, approximations and exact solutions under different assumptions on the arms are studied by [20], [21], [22] and many others. One way to represent a POMDP problem is to use the belief space, i.e., the set of probability distributions over the state space. For URBP with the set of transition probability matrices P , the belief P space is Ψ = {ψ : ψ T ∈ R|S| , ψx ≥ 0, ∀x ∈ S, x∈S ψx = 1} which is the unit simplex in R|S| . Let ψ0 denote the initial belief and ψt denote the belief at time t. Then, the probability that player observes y given it selects arm u when the belief is ψ is VP (ψ, y, u) := ψQP (y|u)1, where 1 is the |S| dimensional column vector of 1s. Given arm u is chosen at belief ψ and state y is observed, the next belief is TP (ψ, y, u) = ψQP (y|u)/VP (ψ, y, u). Let Γ be the set of admissible policies, i.e., any policy for which action at t is a function of ψ0 and z t . When P is known by the player, the average reward
8
optimality equation (AROE) is g + h(ψ) = max{¯ r(ψ, u) + u∈U
X
VP (ψ, y, u)h(TP (ψ, y, u))},
(5)
y∈S u
where g is a constant and h is a function from Ψ → R, r¯(ψ, u) = (ψ • r(u)) =
P
xu ∈S u
xu φu,xu (ψ) is
the expected reward of action u at belief ψ , φu,xu (ψ) is the probability that arm u is in state xu given belief ψ , r(u) = (r(x, u))x∈S and r(x, u) = xu is the reward when arm u is chosen at state x. The bound on regret, which we will derive in the following sections, will hold under the following assumption. Assumption 1: pkij > 0, ∀k ∈ K, i, j ∈ S k . P [.] Under Assumption 1 existence of a bounded, convex continuous solution to (5) is guaranteed. Let Eψ,γ
denote the expectation operator when policy γ is used with initial state ψ , and the set of transition probability matrices is P . Let V denote the space of bounded real-valued functions on Ψ. Next, we define the undiscounted dynamic programming operator F : V → V . Let v ∈ V , we have X (F v)(ψ) = max r¯(ψ, u) + VP (ψ, y, u)v(TP (ψ, y, u)). u∈U u
(6)
y∈S
Lemma 4: Let h+ = h − inf ψ∈Ψ (h(ψ)), h− = h − supψ∈Ψ (h(ψ)), " P hT,P (ψ) = sup Eψ,γ γ∈Γ
T X
#! rγ (t)
t=1
Under Assumption 1 the following holds: S-1 Consider a sequence of functions v0 , v1 , v2 , . . . in V such that v0 = 0, and vl = F vl−1 , l = 1, 2, . . .. This sequence converges uniformly to a convex continuous function v ∗ for which F v ∗ = v ∗ + g where g is a finite constant. In terms of (5), this result means that there exists a finite constant gP and a bounded convex continuous function hP : Ψ → R which is a solution to (5). S-2 hP − (ψ) ≤ hT,P (ψ) − T gP ≤ hP + (ψ), ∀ψ ∈ Ψ. S-3 hT,P (ψ) = T gP + hP (ψ) + O(1) as T → ∞. Proof: Sufficient conditions for the existence of a bounded convex continuous solution to the AROE is investigated in [1]. According to Theorem 4 of [1], if reachability and detectability conditions are satisfied then S-1 holds. Below, we directly prove that reachability condition in [1] is satisfied. To prove
9
that detectability condition is satisfied, we show another condition, i.e., subrectangular substochastic matrices, holds which implies the detectability condition. We note that P (X t+1 = x0 |X t = x) > 0, ∀x, x0 ∈ S since by Assumption 1, pkij > 0 ∀i, j ∈ S k , ∀k ∈ K.
Condition 1: (Reachability) There is a ρ < 1 and an integer ξ such that sup max P (X t = x|ψ0 ) ≥ 1 − ρ, ∀ψ0 ∈ Ψ. γ∈Γ 0≤t≤ξ
Set ρ = 1 − minx,x0 P (X t+1 = x0 |X t = x), ξ = 1. Since the system is uncontrolled, state transitions are independent of the arm selected by the player. Therefore, sup P (X 1 = x|ψ0 ) = P (X 1 = x|ψ0 ) γ∈Γ
≥ min0 P (X t+1 = x0 |X t = x) = 1 − ρ. x,x
Condition 2: (Subrectangular matrices) For any substochastic matrix Q(y|u), y ∈ Y, u ∈ U , and for any i, i0 , j, j 0 ∈ S , (Q(y|u))ij > 0 and (Q(y|u))i0 j 0 > 0 ⇒ (Q(y|u))ij 0 > 0 and (Q(y|u))i0 j > 0. Q(y|u) is subrectangular for y ∈ / S u since it is the zero matrix. For y ∈ S u all entries of Q(y|u) is
positive since P (X t+1 = x0 |X t = x) > 0, ∀x, x0 ∈ S . S-2 holds by Lemma 1 in [1], and S-3 is a consequence of S-2 and the boundedness property in S-1
V. C OUNTABLE R EPRESENTATION OF THE I NFORMATION S TATE Assume that the initial K steps are such that the player select arm k at the k th step. Then, the POMDP for the player can be written as a countable state MDP. In this way, the information state at time t can be 1 2 K k k represented by (st , τ t ) = ((s1t , s2t . . . , sK t ), (τt , τt . . . , τt )), where st and τt are the last observed state
from arm k and time from the last observation of arm k to t, respectively. The contribution of the initial K steps to the regret is at most Krmax . Therefore we only analyze the time steps after this initialization,
i.e., t = 0 when the initialization phase is completed. This way, the initial information state of the player can be written as (s0 , τ 0 ). Let C be the set of all possible information states that the player can be in. Since the player selects a single arm at each time step, at any time τ k = 1 for the last selected arm k .
10
ˆ together The player can compute its belief state ψ ∈ Ψ by using its transition probability estimates P
with the information state (st , τ t ). We let ψP (st , τ t ) be the belief that corresponds to information state (st , τ t ) when the set of transition probability matrices is P . The player knows the information state
exactly, but it only has an estimate of the belief that corresponds to the information state, because it does not know the transition probabilities. The true belief computed with the knowledge of exact transition probabilities and information state at time t is denoted by ψt , while the estimated belief computed with estimated transition probabilities and information state at time t is denoted by ψˆt . When the belief is ψ and the set of transition probability matrices is P , the set of optimal actions which are the maximizers of (5) is denoted by O(ψ; P ). Therefore, when the information state is (st , τ t ), and the set of transition probability matrices is P , the set of optimal actions is O((s, τ ); P ) := O(ψP ((s, τ )); P ). Note that even when the agent is given the optimal policy for the infinite horizon average reward problem, it may not be able to play optimally because he does not know the exact belief ψt at time t. In this case, in order for the agent to play optimally, there should be an > 0 such that if ||ψt − ψˆt ||1 < , the set of actions that are optimal in ψˆt should be a subset of the set of actions that are optimal in ψt . This is indeed the case, and we prove it by exploiting the continuity of the solution to (5) under
Assumption 1. We start by defining finite partitions of the set of information states C . Definition 2: Let τ 0 > 0 be an integer. Consider a vector i = (i1 , . . . , iK ) such that either ik = τ 0 or ik = (sk , τ k ), τk < τ 0 , sk ∈ S k . Each vector will define a set in the partition of C so we call i a partition vector. Let Gτ 0 denote the partition formed by τ 0 . Let s0 (i1 , . . . , iK ) = {sk : ik 6= τ 0 } and τ 0 (i1 , . . . , iK ) = {τ k : ik 6= τ 0 }. Let M(i1 , . . . , iK ) := {k : ik = τ 0 } be the set of arms that are played ¯ 1 , . . . , iK ) := K − M(i1 , . . . , iK ). Vector (i1 , . . . , iK ) forms the at least τ 0 time steps ago. Let M(i
following set in the partition Gτ 0 . 0 0 k k k 0 1 K Gi1 ,...,iK = {(s, τ ) ∈ C : (sM(i ¯ 1 ,...,iK ) = s , τ M(i ¯ 1 ,...,iK ) = τ ), s ∈ S , τ ≥ τ , ∀k ∈ M(i , . . . , i )}.
Let A(τ 0 ) be the number of sets in partition Gτ 0 . Re-index the sets in Gτ 0 as G1 , G2 , . . . , GA(τ 0 ) . These sets form a partition of C such that Gl either consists of a single information state (s, τ ) for which τ k < τ 0 , ∀k ∈ K, or it includes infinitely many information states of the arms in M(i1 , . . . , iK ). ˜ for which Assumption 1 holds, so all arms have Consider a set of transition probability matrices P
a stationary distribution. If we map a set Gl with infinitely many information states to the belief space using ψP˜ , for any δ > 0, only a finite number of information states in Gl will lie outside the radius δ ball centered at the stationary distribution of arms for which ik = τ 0 .
11
Fig. 1. Partition of C on Ψ based on P . Gl is a set with a single information state and Gl0 is a set with infinitely many information states.
For a set Gl ∈ Gτ 0 , given a set of transition probability matrices P , we define its center as follows. If Gl only contains a single information state, then the belief corresponding to that information state is the center of Gl . If Gl contains infinitely many information states, then the center of Gl is the belief in which all arms for which ik = τ 0 are in their stationary distribution based on P . In both cases the belief which is the center of Gl is denoted by ψ ∗ (Gl ; P ). Let O∗ (Gl ; P ) be the set of optimal actions at this belief. Note that for any τ 0 > 0, the number of sets with infinitely many elements is the same, and each of these sets are centered around the stationary distribution. However, as τ 0 increases, the number of sets with a single information state increases. The points in belief space corresponding to these sets are shown in Figure 1. Next, we define extensions of the sets Gl on the belief space. For a set A ∈ Ψ let A() be the extension of that set, i.e., A() = {ψ ∈ Ψ : ψ ∈ A or d1 (ψ, A) < }, where d1 (ψ, A) is the minimum l1 distance between ψ and any element of A. The -extension of Gl ∈ Gτ 0 corresponding to P is the -
extension of the convex-hull of the points ψP (s, τ ) such that (s, τ ) ∈ Gl . Let Jl, denote the -extension of Gl . Examples of Jl, on the belief space is given in Figure V. Let the diameter of a set A be the maximum distance between any two elements of the set A. Another observation is that when τ 0 increases, the diameter of the convex-hull of the points of the sets in Gτ 0 that contains infinitely many elements decreases. In the following lemma, we show that when τ 0 is chosen large enough, there exists > 0 such for all Gl ∈ Gτ 0 , we have non-overlapping -extensions in which only a subset of the actions in O∗ (Gl ; P ) is optimal.
12
Fig. 2.
-extensions of the sets in Gτ 0 on the belief space.
Lemma 5: For any P , ∃ τ 0 > 0 and > 0 such that for all Gl ∈ Gτ 0 , its -extension Jl, has the following properties: i For any ψ ∈ Jl, , O(ψ; P ) ⊂ O∗ (Gl ; P ). ii For l 6= l0 , Jl, ∩ Jl0 , = ∅. Proof: For Gl ∈ Gτ 0 consider ψ ∗ (Gl ; P ). For any ψ ∈ Ψ the suboptimality gap is defined as X ∆(ψ, P ) = max r¯(ψ, u) + VP (ψ, y, u)h(TP (ψ, y, u))} u∈U y∈S u X {¯ r(ψ, u) + VP (ψ, y, u)h(TP (ψ, y, u)) . − max u∈U −O(ψ;P ) u
(7)
y∈S
Since r, h, V and T are continuous in ψ , we can find an > 0 such that for any ψ ∈ B2 (ψ ∗ (Gl ; P )) and for all u ∈ U , X r¯(ψ ∗ (Gl ; P ), u) + VP (ψ ∗ (Gl ; P ), y, u)h(TP (ψ ∗ (Gl ; P ), y, u)) y∈S u X −¯ r(ψ, u) + VP (ψ, y, u)h(TP (ψ, y, u)) < ∆(ψ ∗ (Gl ; P ), P )/2, y∈S u
(8)
and B2 (ψ ∗ (Gl ; P )) ∩ B2 (ψ ∗ (Gl ; P )) = ∅ for l 6= l0 . Therefore, any action u which is not in O∗ (Gl ; P ) cannot be optimal for any ψ ∈ B2 (ψ ∗ (Gl ; P )). Since the diameter of the convex-hull of the sets that contains infinitely many information states decreases with τ 0 , there exists τ 0 > 0 such that for any
13
Gl ∈ Gτ 0 , the diameter of the convex-hull Jl,0 is less than . Let τ 0 be the smallest integer such that
this holds. Then, the -extension of the convex hull Jl, is included in the ball B2 (ψ ∗ (Gl ; P )) for all Gl ∈ Gτ 0 . This concludes the proof.
Remark 1: According to Lemma 5, although we can find an -extension in which a subset of O∗ (Gl ; P ) is optimal for any ψ, ψ 0 ∈ Jl, , the set of optimal actions for ψ may be different from the set of optimal actions for ψ 0 . Note that player’s estimated belief ψˆt is different from the true belief ψt . If no matter how close ψˆt to ψt , the set of optimal actions for the two is different, then the player will make a suboptimal decision even if it knows the optimal policy. It appears that this is a serious problem in the design of an efficient learning algorithm. We present two different approaches. The first is an assumption on the structure of the optimal policy, and the second is a modification to the learning algorithm so that it can control the loss due to the suboptimal decisions resulting from the difference between the estimated and the true belief. This loss can be controlled in a better way if the player knows the smoothness of the h function. Assumption 2: There exists τ 0 ∈ N such that for any Gl ∈ Gτ 0 , there exists > 0 such that the same subset of O∗ (Gl ; P ) is optimal for any ψ ∈ Jl, − ψ ∗ (Gl ; P ). When this assumption is correct, if ψt and ψˆt are sufficiently close to each other, then the player will always chose an optimal arm. Assume that this assumption is false. Consider the stationary information states for which τ k = ∞ for some arm k . Then for any τ 0 > 0, there exists a set Gl ∈ Gτ 0 and a sequence of information states (s, τ )n , n = 1, 2, . . ., such that ψP ((s, τ )n ) converges to ψ ∗ (Gl ; P ) but there exists infinitely many n’s for which O((s, τ )n ; P ) 6= O((s, τ )n+1 ; P ). For simplicity of analysis, we focus on the following version of Assumption 2, although our results in Section VIII will also hold when Assumption 2 is true. Assumption 3: There exists τ 0 ∈ N such that for any Gl ∈ Gτ 0 , there exists > 0 such that for any ψ ∈ Jl, , a single action is optimal.
Remark 2: Although we don’t know a way to check if Assumption 3 holds given a set of transition probability matrices P , we claim that it holds for a large set of P s. The player’s selection does not affect state transitions of the arms, it only affects the player’s reward by changing the information state. Moreover, each arm evolves independently from each other. Assume that P is arbitrarily selected from Ξ, and the state reward rxk , x ∈ S k is arbitrarily selected from (0, rmax ]. Then at any information state (s, τ ) ∈ C , the probability that the reward distribution of two arms are the same will be zero. Based
on this, we claim that Assumption 3 holds with probability one, if the arm rewards and P is chosen from the uniform distribution on Ψ × (0, rmax ]. In other words, the set of arm rewards and transition
14
probabilities for which Assumption 3 does not hold is a measure zero subset of Ψ × (0, rmax ]. VI. A N A DAPTIVE L EARNING A LGORITHM (ALA) Adaptive Learning Algorithm k = 0, C k = 0, ∀k ∈ K, i, j ∈ S k . Play 1: Initialize: f (t) given for t ∈ {1, 2, . . .}, t = 1, N k = 0, Ni,j i each arm once to set the initial information state (s, τ )0 . Pick u(0) randomly. 2: while t ≥ 1 do k k 1I(Ni,j =0)+Ni,j 3: p¯kij = |S k |I(C k =0)+C k pˆkij =
4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
p¯kij P
l∈S k
i
p¯
W = {(k, i), k ∈ K, i ∈ S k : Cik < f (t)}. if W 6= ∅ then EXPLORE if u(t − 1) ∈ W then u(t) = u(t − 1) else select u(t) ∈ W arbitrarily end if else EXPLOIT ˆ Let ψˆt be the estimate of the belief at time P t based on (sˆt , τ t ) and P t . ˆ solve gˆt + ht (ψ) = maxu∈U {¯ r(ψ, u) + y∈S u V (ψ, y, u)ht (TPˆ t (ψ, y, u))}, ∀ψ ∈ Ψ. compute the indices of all actions at ψˆt : P ˆ t (T ˜ (ψˆt , y, u))} such that ∀u ∈ U , It (ψˆt , u) = supP˜ ∈Ξ {¯ r(ψˆt , u) + y∈S u V (ψˆt , y, u)h P q
t
ˆ
2 log t ˜
P − P ≤ u . 1
19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:
i
k il
N
Let u∗ be the arm with the highest index (arbitrarily select one if there is more than one such arm). u(t) = u∗ . end if Receive reward ru(t) (t), i.e., state of u(t) at t Compute (st+1 , τ t+1 ) N u(t) = N u(t) + 1 if u(t − 1) = u(t) then for i, j ∈ S u(t) do if State j is observed at t, state i is observed at t − 1 then u(t) u(t) u(t) u(t) Ni,j = Ni,j + 1, Ci = Ci + 1. end if end for end if t := t + 1 end while
Fig. 3.
pseudocode for the Adaptive Learning Algorithm (ALA)
In this section we propose the Adaptive Learning Algorithm (ALA) given in Figure 3, as a learning
15
algorithm for the player. ALA consists of exploration and exploitation phases. In the exploration phase the player selects each arm for a certain time to form estimates of the transition probabilities, while in the exploitation phase the player selects an arm according to the optimal policy based on the estimated transition probabilities. At each time step, the player decides if it is an exploration phase or exploitation phase based on the accuracy of transition probability estimates. Let Cik (t) be the number of times a transition from state i of arm k to any state of arm k is observed by time t. Let f (t) be a non-negative, increasing function which sets a condition on the accuracy of estimates. If Cik (t) < f (t) for some k ∈ K, i ∈ S k , the player explores at time t. Otherwise, the player exploits at time t. In other words, the player
concludes that the sample mean estimates are accurate enough to compute the optimal action correctly when Cik (t) ≥ f (t) ∀k ∈ K, i ∈ S k . In an exploration step, in order to update the estimate of pkij , j ∈ S k , the player does the following. It selects arm k until state i is observed, then selects arm k again to observe the next state after i. It forms sample mean estimate of each p¯kij , j ∈ S k . In order for this estimates to P form a probability distribution, the player should have j∈S k p¯kij = 1. Therefore, instead of the estimates P p¯kij , the player uses the normalized estimates pˆkij = p¯kij /( l∈S k p¯kil ). If ALA is in the exploitation phase at time t, the player first computes ψˆt , the estimated belief at ˆ t . Then, it solves the average reward time t, using the set of estimated transition probability matrices P ˆ t , for which the solution is given by gˆt and hˆt . We assume that the player can optimality equation using P
compute the solution at every time step, independent of the complexity of the problem. This solution is used to compute the indices It (ψˆt , u) for each action u ∈ U at estimated belief ψˆt . It (ψˆt , u) represents the advantage of choosing action u starting from information state ψˆt , i.e, the sum of gain and bias, inflated by the uncertainty about the transition probability estimates based on the number of times arm u is played. After computing the indices for each action, the player selects the action with the highest index. In case of a tie, one of the actions with the highest index is randomly selected. Note that it is possible to update the state transition probabilities even in the exploitation phase given that the arm selected at times t − 1 and t are the same. Thus Cik (t) may also increase in an exploitation phase, and the number of explorations may be smaller than the number of explorations needed in the worst case, in which the transition probability estimates are only updated at exploration steps. In the following sections we will denote ALA by α.
16
VII. A N U PPER B OUND FOR R EGRET For any admissible policy γ , the regret with respect to the optimal T horizon policy is given in (1), which we restate below. sup γ 0 ∈Γ
EψP0 ,γ 0
" T X
#! r
γ 0 (t)
(t)
−
EψP0 ,γ
" T X
t=1
# r
γ(t)
(t) .
t=1
First, we derive the regret with respect to the optimal policy as a function of the number of suboptimal plays. Before proceeding, we define expressions to compactly represent the right hand side of the AROE. Let L(ψ, u, h, P ) = r¯(ψ, u) + (V (ψ, ., u) • h(TP (ψ, ., u))) L∗ (ψ, P ) = max L(ψ, u, hP , P ). u∈U
Let ∆(ψ, u; P ) = L∗ (ψ, P ) − L(ψ, u, hP , P ),
(9)
denote the degree of suboptimality of action u at information state ψ when the set of transition probability matrices is P . From Proposition 1 of [15] we have for all γ ∈ Γ γ (T ) R(ψ 0 ;P )
=
T −1 X
EψP0 ,γ [∆(ψt , Ut ; P )],
(10)
t=0
where we have used the subscript (ψ0 ; P ) to denote the dependence of regret to the initial belief and the transition probabilities. We assume that initially all the arms are sampled once thus the initial belief is ψ0 = ψP ((s0 , τ 0 )). For the true set of transition probability matrices P , let τ 0 and be such that Assumption 3 holds. Specifically, let τ 0 be the minimum over all possible values so that Assumption 3 holds, and be the maximum over all possible values given τ 0 so that Assumption 3 holds. Then the -extensions of the sets Gl ∈ Gτ 0 are Jl, . Note that at any t the belief ψt ∈ Jl, for some l. When is
clear from the context, we simply write Jl, as Jl . Let ¯ l , u; P ) = sup ∆(ψ, u; P ). ∆(J ψ∈Jl
¯ l , Ut ; P ) Note that if Ut ∈ O(ψt ; P ) then ∆(ψt , Ut ; P )=0, else Ut ∈ / O(ψt ; P ) then ∆(ψt , Ut ; P ) < ∆(J
17
w.p.1. Let NT (Jl , u) =
T −1 X
I(ψt ∈ Jl , Ut = u).
t=0
We have γ (T ) ≤ R(ψ 0 ;P )
T −1 X
A X EψP0 ,γ [
t=0
=
=
X
¯ l , u; P )] I(ψt ∈ Jl , Ut = u)∆(J
l=1 u∈O(J / l ;P )
T −1 X P I(ψt Eψ0 ,γ [ t=0 l=1 u∈O(J / ;P ) l
A X
X
A X
X
¯ l , u; P ) ∈ Jl , Ut = u)]∆(J
¯ l , u; P ). EψP0 ,γ [NT (Jl , u)]∆(J
(11)
l=1 u∈O(J / l ;P )
Now consider ALA, which is denoted by α. We will upper bound NT (Jl , u) for suboptimal actions u for α by a sum of expressions which we will upper bound individually. Let Et be the event that ALA
ˆ
exploits at time t and Ft = { h − h ≤ }. t P ∞
D1,1 (T, , Jl , u) =
T −1 X
I(ψˆt ∈ Jl , Ut = u, Et , Ft , I(ψˆt , u) ≥ L∗ (ψˆt , P ) − 2),
t=0
D1,2 (T, , Jl , u) =
T −1 X
I(ψˆt ∈ Jl , Ut = u, Et , Ft , I(ψˆt , u) < L∗ (ψˆt , P ) − 2),
t=0
D1,3 (T, ) =
T −1 X
I(Et , FtC ),
t=0
D1 (T, , Jl , u) = D1,1 (T, , Jl , u) + D1,2 (T, , Jl , u) + D1,3 (T, ), D2,1 (T, ) =
T −1 X
I(||ψt − ψˆt ||1 > , Et ),
t=0
D2,2 (T, , Jl ) =
T −1 X
I(||ψt − ψˆt ||1 ≤ , ψˆt ∈ / Jl , ψt ∈ Jl , Et ),
t=0
D2 (T, , Jl ) = D2,1 (T, ) + D2,2 (T, , Jl ).
Lemma 6: For any P satisfying Assumption 2 EψP0 ,γ [NT (Jl , u)] ≤ EψP0 ,γ [D1 (T, , Jl , u)] + EψP0 ,γ [D2 (T, , Jl )] + EψP0 ,γ [
T −1 X t=0
I(E C (t))].
(12)
18
Proof: NT (Jl , u) =
T −1 X
(I(ψt ∈ Jl , Ut = u, Et ) + I(ψt ∈ Jl , Ut = u, EtC ))
t=0
≤
T −1 X
I(ψt ∈ Jl , ψˆt ∈ Jl , Ut = u, Et ) +
≤
I(ψt ∈ Jl , ψˆt ∈ / Jl , Ut = u, Et ) +
I(ψˆt ∈ Jl , Ut = u, Et ) +
t=0
T −1 X
I(ψt ∈ Jl , ψˆt ∈ / Jl , Et ) +
t=0
T −1 X
I(EtC )
t=0
t=0
t=0 T −1 X
T −1 X
T −1 X
I(EtC )
t=0
≤ D1,1 (T, , Jl , u) + D1,2 (T, , Jl , u) + D1,3 (T, ) + D2,1 (T, ) + D2,2 (T, , Jl ) +
T −1 X
I(EtC ).
t=0
The result follows from taking the expectation of both sides. VIII. A NALYSIS OF THE R EGRET OF ALA In this section we show that when P is such that Assumptions 1 and 3 hold, if the player uses ALA with f (t) = L log t with L sufficiently large, i.e., the exploration constant L ≥ C(P ), where C(P ) is a constant that depends on P , then the regret due to explorations will be logarithmic in time, while the regret due to all other terms are finite, independent of t. Note that since the player does not know P , it cannot know how large he should chose L. For simplicity we assume that the player starts with an L that is large enough without knowing C(P ). We also prove a near-logarithmic regret result, when the player sets f (t) = L(t) log t, where L(t) is a positive increasing function over time such that limt→∞ L(t) = ∞. Using Lemma 2, we will show that the probability that an estimated transition probability is significantly different from the true transition probability given ALA is in an exploitation phase is very small. Lemma 7: For any > 0, for an agent using ALA with constant L ≥ 3/(22 ), 2 P |¯ pkij (t) − pkij | > , Et ≤ 2 , t for all t > 0, i, j ∈ S k , k ∈ K. Proof: See Appendix A. Lemma 8: For any > 0, for an agent using ALA with constant L ≥ CP (), we have 2 P |ˆ pkij,t − pkij | > , Et ≤ 2 , t for all t > 0, i, j ∈ S k , k ∈ M , where CP () is a constant that depends on P and .
19
Proof: See Appendix B.
A. Bounding the Expected Number of Explorations Lemma 9: EψP0 ,α
"T −1 X
# I(EtC )
t=0
K X ≤( |S k |)L log T (1 + Tmax ),
(13)
k=1
where Tmax = maxk∈K,i,j∈S k E[Tijk ] + 1, Tijk is the hitting time of state j of arm k starting from state i of arm k . Since all arms are ergodic E[Tijk ] is finite for all k ∈ K, i, j ∈ S k . Proof: The number of transition probability updates that results from explorations up to time T − 1 P P is at most K k=1 i∈S k L log T . The expected time spent in exploration during a single update is at most (1 + Tmax ).
B. Bounding EψP0 ,α [D2 (T, , Jl )] 2 |S 1 | . . . |S K |C (P ))) we have Lemma 10: for L ≥ CP (/(KSmax 1 2 EψP0 ,α [D2,1 (T, )] ≤ 2KSmax β,
(14)
where C1 (P ) = maxk∈K C1 (P k , ∞) and C1 (P k , t) is given in Lemma 1. Proof: Consider t > 0 K K Y Y k |(ψˆt )x − (ψt )x | = (Pˆtk )τ eksk k − (P k )τk eksk k x x k=1
≤
k=1
K X
ˆk τ k k (Pt ) esk
k=1
xk
− (P k )τk eksk k x
K
X
ˆk τ k k k τk k ≤
(Pt ) esk − (P ) esk
1
k=1 K
X
ˆk k ≤ C1 (P )
Pt − P , k=1
(15)
1
where last inequality follows from Lemma 1. By (15) K
X
ˆ
ˆk
Pt − P k .
ψt − ψt ≤ |S 1 | . . . |S K |C1 (P ) 1
k=1
1
20
Thus we have
P ψˆt − ψt > , Et ≤ P 1
K
X
ˆk
Pt − P k > /(|S 1 | . . . |S K |C1 (P )), Et 1
k=1
≤
K X
ˆk k P Pt − P > /(K|S 1 | . . . |S K |C1 (P )), Et 1
k=1
≤
!
K X
X
P
|ˆ pkij,t
−
pkij |
k=1 (i,j)∈S k ×S k 2 ≤ 2KSmax
, Et > 2 |S 1 | . . . |S K |C (P )) (KSmax 1
1 , t2
where last inequality follows from Lemma 8. Then, EψP0 ,α [D2,1 (T, )]
=
T −1 X
2 Pψ0 ,α ( ψt − ψˆt > , Et ) ≤ 2KSmax β. 1
t=0
Next we will bound EψP0 ,α [D2,2 (T, , Jl )]. Lemma 11: Let τ 0 be such that Assumption 2 holds. Then for < ξ/2, EψP0 ,α [D2,2 (T, , Jl )] = 0, l = 1, . . . , A.
Proof: By Assumption 3, any ψt ∈ Jl is at least ξ away from the boundary of Jl . Thus given ψˆt is at most away from ψt , it is at least ξ/2 away from the boundary of Jl . C. Bounding EψP0 ,α [D1 (T, , Jl , u)] Define the following function: n o ˜ = (P˜ 1 , . . . , P˜ K ) : P˜ k ∈ Ξk , L(ψ, u, hP , P ˜ ) ≥ L∗ (ψ, P ) − , MakeOpt(ψ, u; P , ) := P For any u and ψ , if MakeOpt(ψ, u; P , ) 6= ∅ for all > 0, then the pair (ψ, u) is called a critical information state-action pair. Let Crit(P ) := {(ψ, u) : u is suboptimal at ψ, MakeOpt(ψ, u; P , ) 6= ∅, ∀ > 0} . Let
2 ˆ ; P , ) := inf{ ˆ −P ˜ ˜ ∈ MakeOpt (ψ, u; P , )}. Jψ,u (P
P
:P 1
21
Lemma 12: Let τ 0 , δ > 0 be such that and δ < Jψ,u (P ; P , 3)/2, for any ψ ∈ Jl , l = 1, . . . , A(τ 0 ), 2 ) we have (ψ, u) ∈ Crit(P ). Then for L ≥ CP (fP ,3 (δ)/(KSmax 2 EψP0 ,α [D1,1 (T, , Jl , u)] ≤ (2KSmax + 4/δ)β,
(16)
where fP , is a function such that fP , (δ) > 0 for δ > 0 and limδ−→0 fP , (δ) = 0. Proof: See Appendix C 2 β. Lemma 13: For L large enough, EψP0 ,α [D1,2 (T, , Jl , u)] ≤ 2KSmax
Proof: If suboptimal action u is chosen at information state ψˆt this means that for the optimal action u∗ ∈ O(ψˆt ; P ) It (ψˆt , u∗ ) ≤ It (ψˆt , u) < L∗ (ψˆt ; P ) − 2 .
This implies
˜ ∈ Ξ, ˜ ˆ ∀P P − P
t ≤ 1
s
2 log t Nt (u)
ˆ t (T ˜ (ψˆt , ., u∗ ))) < (V (ψˆt , ., u∗ ) • hP (TP (ψˆt , ., u∗ ))) − 2. ⇒ (V (ψˆt , ., u∗ ) • h P
Since on Ft (33) holds, ( ∗
∗
{It (ψˆt , u ) ≤ L (ψˆt ; P ) − 2} ⊂
˜ ∈ Ξ, ˜ ˆ ∀P P − P
t ≤ 1
s
2 log t ⇒ Nt (u)
o (V (ψˆt , ., u∗ ) • hP (TP˜ (ψˆt , ., u∗ ))) < (V (ψˆt , ., u∗ ) • hP (TP (ψˆt , ., u∗ ))) − . But since hP is continuous there exists δ1 > 0 such that the event in (17) implies
∗
TP˜ (ψˆt , y, u∗ ) − TP (ψˆt , y, u∗ ) > δ1 , ∀y ∈ S u . 1
Again since TP (ψ, y, u) is continuous in P , there exists δ2 > 0 such that above equation implies
˜
P − P > δ2 . 1
(17)
22
Thus (
s
2 log t
˜
⇒ P {It (ψˆt , u∗ ) ≤ L∗ (ψˆt ; P ) − 2} ⊂ − P > δ2 Nt (u) 1 1 s ) (
2 log t
ˆ
⊂ P − P
> δ2 + t Nt (u) 1
n t o
ˆ
⊂ P − P > δ2 .
˜ ∈ Ξ, ˜ −P ˆ t ∀P
P
≤
)
1
Therefore EψP0 ,α [D1,2 (T, , Jl , u)] ≤
T −1 X
ˆ
P P − P > δ , E
t 2 t 1
t=0
≤
T −1 X K X
X
P
t=0 k=1 (i,j)∈S k ×S k
|ˆ pkij,t
−
pkij |
δ2 ≥ , Et 2 KSmax
2 ≤ 2KSmax β, 2 )). for L ≥ CP (δ2 /(KSmax
Next, we consider bound EψP0 ,α D1,3 (T, ).
Lemma 14: For any > 0, there exists ς > 0 depending on such that if P k − Pˆ k < ς, ∀k ∈ K 1
then hP − hPˆ ∞ < . Proof: ˜ such that Assumption 1 holds, and r¯(ψ), V ˜ , T ˜ Since hP˜ is continuous in ψ by lemma 4 for any P P P ˜ , for any ψ ∈ Ψ. are continuous in P gPˆ + hPˆ (ψ) = arg max u∈U
r¯(ψ, u) +
VPˆ (ψ, y, u)hPˆ (TPˆ (ψ, y, u)) u
X y∈S
X ˆ , ψ, u) , VP (ψ, y, u)hPˆ (TP (ψ, y, u)) + q(P , P = arg max r¯(ψ, u) + u∈U u
(18)
y∈S
ˆ , ψ, u) = 0, ∀ψ ∈ Ψ, u ∈ U . for some function q such that limPˆ →P q(P , P ˆ , ψ, u) = r¯(ψ, u) + q(P , P ˆ , ψ, u). We can write (18) as Let r¯(P , P X ˆ gPˆ + hPˆ (ψ) = arg max r¯(P , P , ψ, u) + VP (ψ, y, u)hPˆ (TP (ψ, y, u)) u∈U u y∈S
(19)
23
Note that (19) is the average reward optimality equation for a system with set of transition probability ˆ , ψ, u). Since lim ˆ ˆ matrices P , and perturbed rewards r¯(P , P ¯(ψ, u), ∀ψ ∈ Ψ, u ∈ P →P r(P , P , ψ, u) = r U , we expect hPˆ to converge to hP . Next, we prove that this is true. Let FPˆ denote the dynamic ˆ , ψ, u). Then, programming operator defined in (6), with transition probabilities P and rewards r(P , P
by S-1 of Lemma (4), there exists a sequence of functions v0,Pˆ , v1,Pˆ , v2,Pˆ , . . . such that v0,Pˆ = 0, vl,Pˆ = FPˆ vl−1,Pˆ and another sequence of functions v0,P , v1,P , v2,P , . . . such that v0,P = 0, vl,P = FP vl−1,P ,
for which lim v ˆ l→∞ l,P
= hPˆ ,
(20)
lim vl,P = hP ,
(21)
l→∞
uniformly in ψ . ˆ) = Next, we prove that for any l ∈ {1, 2, . . .}, limPˆ →P vl,Pˆ = vl,P uniformly in ψ . Let qmax (P , P ˆ , ψ, u)|. By Equation 2.27 of [1], we have supu∈U,ψ∈Ψ |q(P , P
n o n o sup |vl,Pˆ (ψ) − vl,P (ψ)| = sup |F l−1 v1,Pˆ (ψ) − F l−1 vl,P (ψ)|
ψ∈Ψ
ψ∈Ψ
n o ≤ sup |v1,Pˆ (ψ) − v1,P (ψ)| ψ∈Ψ
ˆ ), ≤ 2qmax (P , P
(22)
where the last inequality follows form v0,P = 0, v0,Pˆ = 0, and v1,P (ψ) = max {¯ r(ψ, u)} u∈U n o ˆ , ψ, u) . v1,Pˆ (ψ) = max r¯(ψ, u) + q(P , P u∈U
ˆ n }∞ which converges to P . Since limn→∞ qmax (P , P ˆ n ) = 0, for any > 0, Consider a sequence {P n=1 ˆ n ) < /2, which implies by (22) that there exists N0 such that for all n > N0 we have qmax (P , P
n o sup |vl,Pˆ n (ψ) − vl,P (ψ)| < ,
ψ∈Ψ
for all ψ ∈ Ψ. Therefore, for any l ∈ {1, 2, . . .}, we have lim vl,Pˆ = vl,P ,
ˆ →P P
(23)
24
uniformly in ψ . Using (20) and (21), for any > 0 and any n ∈ {1, 2, . . .}, there exists N1 (n) such that for any l > N1 (n) and ψ ∈ Ψ, we have |vl,Pˆ n (ψ) − hPˆ n (ψ)| < /3, |vl,P (ψ) − hP (ψ)| < /3.
Similarly using (23), for any > 0, there exists N0 such that for all n > N0 and ψ ∈ Ψ, we have |vl,Pˆ n (ψ) − vl,P (ψ)| ≤ /3.
(24)
These imply that for any > 0, there exists N2 ≥ N0 and such that for all n > N2 , such that for all ψ ∈ Ψ, we have |hP (ψ) − hPˆ n (ψ)| ≤ |hP (ψ) − vl,P (ψ)| + |vl,Pˆ n (ψ) − vl,P (ψ)| + |vl,Pˆ n (ψ) − hPˆ n (ψ)| < ,
since there exists some l > N1 (n) such that (24) holds. Therefore, for any > 0 there exists some η > 0 ˆ | < η implies |hP (ψ) − h ˆ (ψ)|∞ ≤ . such that |P − P P
Lemma 15: For any > 0, let ς > 0 be such that Lemma 14 holds. Then for an agent using ALA 2 ), we have with L ≥ CP (ς/Smax 2 EψP0 ,α [D1,3 (T, )] ≤ 2KSmax β .
Proof: We have by Lemma 14,
n o
k
P − Pˆtk < ς, ∀k ∈ K} ⊂ {khP − ht k∞ < } , 1
which implies
n o
k k ˆ
P − Pt ≥ ς, for some k ∈ K ⊃ {khP − ht k∞ ≥ } . 1
(25)
25
Then "T −1 X
EψP0 ,α D1,3 (T, ) = EψP0 ,α
# I(Et , FtC )
t=0
≤
T −1 X
k k ˆ P ( P − Pt ≥ ς, for some k ∈ K, Et ) 1
t=0
≤
K X
X
T −1 X
P
ς
|pkij − pˆkij,t | >
2 Smax
k=1 (i,j)∈S k ×S k t=0
, Et
2 ≤ 2KSmax β .
D. Logarithmic regret upper bound Theorem 1: Let τ 0 be a mixing time such that Assumptions 1, 2 and ?? are true. Under these assumptions, for an agent using ALA with L sufficiently large, for any action u ∈ U which is suboptimal for the belief vectors in Jl EψP0 ,α [NT (Jl , u)]
≤
K X
! k
2 |S | L log T (1 + Tmax ) + (8KSmax + 4/δ)β ,
k=1
for some δ > 0 depending on L. Therefore 2 Rψα0 ;P (T ) ≤ (L log T (1 + Tmax ) + (8KSmax + 4/δ)β) ×
PA(τ 0 ) P l=1
¯
∆(Jl , u; P ). u∈O(J / l ;P )
Proof: The result follows from Lemmas 9, 10, 11, 12, 13, 15 and (11). IX. E XTENSIONS TO THE A DAPTIVE L EARNING A LGORITHM (ALA) In this section we propose several extensions to ALA and relaxation of certain assumptions. Firstly, we consider an adaptive exploration function for ALA, by which the player can achieve near-logarithmic regret without knowing a sufficient bound on the distance between the true and estimated transition probabilities (such that the exploration constant L can be chosen large enough). Secondly, we present a modified algorithm for which Assumption 2 can be relaxed. We prove that the modified algorithm can achieve logarithmic regret without Assumption 2. A. An adaptive exploration function We note that the analysis in Section VIII holds when ALA is run with a sufficiently large exploration ˆ is close enough constant L such that at each exploitation step the estimated transition probabilities P
26
to P to guarantee that all regret terms in (12) is finite except the regret due to explorations which is logarithmic in time. In other words, there is an L0 (P ) > 0 such that when ALA is run with L ≥ L0 (P ) ˆ − P ||1 ≤ δ(P ), where δ(P ) > 0 is a constant for , at the end of each exploration step we have ||P
which Theorem 1 holds. However, in our learning model we assumed that the player does not know the transition probabilities ˜ ⊂ Ξ be the set of transition initially, therefore it is impossible for the player to check if L ≥ L0 (P ). Let Ξ ˜ , then it can compute L ˜ ), ˜ 0 = sup ˜ ˜ L0 (P probability matrices where P lies in. If the player knows Ξ P ∈Ξ ˜ 0. and choose L > L
In this section, we present another exploration function for ALA such that the player can achieve near˜ 0 . Let f (t) = L(t) log t where L(t) is an increasing logarithmic regret even without knowing L0 (P ) or L
function such that L(1) = 1 and limt→∞ L(t) = ∞. The intuition behind this exploration function is that after some time T0 , L(t) will be large enough so that the estimated transition probabilities are sufficiently accurate, and the regret due to incorrect calculations is a constant independent of time. Theorem 2: When P is such that Assumptions 1 and 3 hold. If the player uses ALA with f (t) = L(t) log t, for some L(t) such that L(1) = 1 and limt→∞ L(t) = ∞, then there exists constant δ(L) > 0, τ 0 (P ) > 0, T0 (L, P ) > 0 such that the regret is upper bounded by ! K X α k Rψ0 ;P (T ) ≤ rmax T0 (L, P ) + |S | L(T ) log T (1 + Tmax ) k=1
A(τ 0 )
2 +(8KSmax + 4/δ(L))β
X
X
¯ l , u; P ) ∆(J
l=1 u∈O(J / l ;P )
≤ rmax
T0 (L, P ) +
K X
! k
|S | L(T ) log T (1 + Tmax )
k=1 2 +(8KSmax
0 M
+ 4/δ(L))β(τ )
K X k=1
! k
|S |
! max
l∈{1,...,A(τ 0 )}
¯ l , u; P ) ∆(J
Proof: The regret up to T0 (L, P ) can be at most rmax T0 (L, P ). After T0 (L, P ), since L(t) ≥ L0 (P ), transition probabilities at exploitation steps sufficiently accurate so that all regret terms in (12) except the regret due to explorations is finite. Since time t is an exploration step whenever Cik (t) < L(t) log t, the regret due to explorations is rmax
K X k=1
! k
|S | L(T ) log T (1 + Tmax )
27
Remark 3: There is a tradeoff between choosing a rapidly increasing L or a slowly increasing L. The regret of ALA up to time T0 (L, P ) is linear. Since T0 (L, P ) is decreasing in function L, a rapidly increasing L will have better performance when the considered time horizon is small. However, in terms of asymptotic performance, i.e., as T → ∞, L should be a slowly diverging sequence. For example if L = log(log t), then the asymptotic regret will be O(log(log t) log t).
B. Adaptive Learning Algorithm with Finite Partitions In this section, we consider a variant of ALA which is called adaptive learning algorithm with finite partitions (ALA-FP). We show that ALA-FP achieves logarithmic regret even when Assumption 2 does not hold. Basically, ALA-FP takes as input the mixing time τ 0 , then forms Gτ 0 partition of the set of information states C . At each exploitation step ALA-FP solves the estimated AROE based on the ˆ , and if the belief (s, τ ) is in Gl , the player arbitrarily picks an arm transition probability estimated P ˆ ). When time t is an exploitation step, ˆ ), instead of picking an arm in O(ψ ˆ ((s, τ )); P in O∗ (Gl ; P P
the player plays optimally by choosing an arm in O(ψP ((s, τ )); P ), or near-optimally by choosing an arm in O∗ (Gl ; P ) different from O(ψP ((s, τ )); P ) such that (s, τ ) ∈ Gl , or suboptimally by choosing an arm that is neither in O(ψP ((s, τ )); P ) nor in O∗ (Gl ; P ). By Lemma 5 we know that when τ 0 is chosen large enough, for any Gl ∈ Gτ 0 , and (s, τ ), O(ψP ((s, τ )); P ) is a subset of O∗ (Gl ; P ). Since the solution to the AROE is a continuous function, by choosing a large enough τ 0 , we can control the regret due to near-optimal actions. The regret due to suboptimal actions can be bounded the same way as in Theorem 1. The following theorem gives a logarithmic upper bound on the regret of ALA-FP. Theorem 3: When the true set of transition probabilities P is such that Assumption 1 is true, for a player using ALA-FP with exploration constant L, and mixing time τ 0 sufficiently large such that for any (s, τ ) ∈ Gl , Gl ∈ Gτ 0 , we have |hP (ψP ((s, τ ))) − hP (ψ ∗ (Gl ; P ))| < C/2T , where C is a constant and T is the time horizon, the regret of ALA-FP is upper bounded by 2 Rψα0 ;P (T ) ≤ C + (L log T (1 + Tmax ) + (8KSmax + 4/δ)β) ×
PA(τ 0 ) P l=1
¯
∆(Jl , u; P ), u∈O(J / l ;P )
for some δ > 0 which depends on L and τ 0 . Proof: The regret at time T is upper bounded by (10). Consider any t which is an exploitation step. Let l be such that (st , τ t ) ∈ Gl . If the selected arm α(t) ∈ O(ψP ((s, τ )); P ), then an optimal decision is made at t, so the contribution to regret in time step t is zero. Next, we consider the case when α(t) ∈ / O(ψP ((s, τ )); P ). In this case there are two possibilities. Either α(t) ∈ O∗ (Gl ; P ) or not. ˆ t ) ⊂ O∗ (Gl ; P ) we have α(t) ∈ O∗ (Gl ; P ). Since |hP (ψP ((s, τ ))) − We know that when O∗ (Gl ; P
28
hP (ψ ∗ (Gl ; P ))| < C/2T for all (s, τ ) ∈ Gl , we have by (9), ∆(ψt , α(t); P ) = L∗ (ψt , P ) − L(ψt , α(t), hP , P ) ≤ C/T.
(26)
Therefore, contribution of a near-optimal action to regret is at most C/T . Finally, consider the case when α(t) ∈ / O∗ (Gl ; P ). This implies that either the estimated belief ψˆt is ˆ t is not close enough to hP . Due to the not close enough to ψt or the estimated solution to the AROE h
non-vanishing suboptimality gap at any belief vector ψ ∗ (Gl ; P ), and since decisions of ALA-FP is only based on belief vectors corresponding to (s, τ ) ∈ C , the regret due to suboptimal actions can be bounded by Theorem 1. We get the regret bound by combining all these results. Note that the regret bound in Theorem 3 depends on τ 0 which depends on T since τ 0 should be chosen large enough so that for every Gl in the partition created by τ 0 , function hP should vary by at most C/2T . Clearly since hP is a continuous function, the variation of hP on Gl decreases with the diameter of Gl on the belief space. Note that there is a term in regret that depends linearly on the number of sets A(τ 0 ) in the partition generated by τ 0 , and A(τ 0 ) increases proportional to (τ 0 )M . This tradeoff is not taken into account in Theorem 3. For example, if (τ 0 )M ≥ T then the regret bound in Theorem 3 is useless. Another approach is to jointly optimize the regret due to suboptimal and near-optimal actions by balancing the number of sets A(τ 0 ) and the variation of hP on sets in partition Gτ 0 . For example given 0 < θ ≤ 1, we can find a τ 0 (θ) such that for any (s, τ ) ∈ Gl , Gl ∈ Gτ 0 (θ) we have |hP (ψP ((s, τ ))) − hP (ψ ∗ (Gl ; P ))| < C/(2T θ ). Then, the regret due to near-
optimal decisions will be proportional to CT 1−θ , and the regret due to suboptimal decision will be proportional to (τ 0 (θ))M . Let C = supψ∈Ψ hP (ψ) − inf ψ∈Ψ hP (ψ). Since T 1−θ is decreasing in θ and τ 0 (θ) is increasing in θ, there exists θ ∈ [0, 1], such that θ = arg minθ0 ∈[0,1] |T 1−θ − (τ 0 (θ))M | and |T 1−θ − (τ 0 (θ))M | ≤ (τ 0 (θ) + 1)M − (τ 0 (θ))M . If the optimal value of θ is in (0, 1), then given θ, the
player can balance the tradeoff, and achieve sublinear regret proportional to T 1−θ . However, since the player does not know P initially, it may not know the optimal value of θ. Online learning algorithms for the player to estimate the optimal value of θ is a future research direction. X. C ONCLUSION In this paper we proved that given the transition probabilities of the arms are positive for any state, there exists index policies which gives logarithmic regret with respect to the optimal finite horizon policy uniformly in time. Our future research includes finding approximately optimal policies with polynomial computation time, and study the short term performance of the proposed algorithms.
29
A PPENDIX A P ROOF OF L EMMA 7 k (i) = l. We have, Let t(l) be the time Ct(l)
p¯kij (t)
Ntk (i, j) = Ctk (i)
=
PCtk (i) l=1
k k = j) I(Xt(l)−1 = i, Xt(l)
Ctk (i)
.
k k = j), l = 1, 2, . . . , C k (i) are IID random variables with mean pk . Then Note that I(Xt(l)−1 = i, Xt(l) t ij
P k Ct (i) k k = j) I(X = i, X l=1 t(l)−1 t(l) P |¯ pkij (t) − pkij | > , Et = P − pkij ≥ , Et k (i) C t k CX t X t (i) k k = P I(Xt(l)−1 = i, Xt(l) = j) − Ctk (i)pkij ≥ Ctk (i), Ctk (i) = b, Et l=1 b=1
≤
t X
2e
−2(a log t)2 a log t
b=1
=
t X
e
−2a log t()2
b=1
t X 1 1 2 =2 , 2 = 2 −1 ≤ 2a 2a t t t2
(27)
b=1
where we used Lemma 2 and the fact that Ctk (i) ≥ L log t w.p.1. in the event Et . A PPENDIX B P ROOF OF L EMMA 8 From symmetry we have P |ˆ pkij,t − pkij | > , Et = P pˆkij,t − pkij > , Et + P pˆkij,t − pkij < −, Et = 2P pˆkij,t − pkij > , Et .
(28)
Then
P pˆkij,t − pkij > , Et = P
P
l∈S
=P
!
p¯kij
k
P
p¯kij 1−δ
−
pkij
> , |
X l∈S
p¯kil
− 1| < δ, Et
k
!
l∈S k
≤P
p¯kil
p¯kij
P
− pkij > , Et
!
p¯kij l∈S k
+P
p¯kil
p¯kil
−
pkij
> , |
X l∈S
p¯kil
− 1| ≥ δ, Et
k
! − pkij > , Et
! +P
|
X l∈S k
p¯kil − 1| ≥ δ, Et
.
(29)
30
We have p¯kij − pkij > , Et 1−δ
P
!
= P p¯kij − pkij > (1 − δ) − δ, Et .
(30)
Note that (1 − δ) − δ is decreasing in δ . We can choose a δ small enough such that (1 − δ) − δ > /2. Then P p¯kij − pkij > (1 − δ) − δ, Et ≤ P p¯kij − pkij > , Et 2 2 ≤ 2, t
(31)
for L ≥ 6/(2 ). We also have ! P
|
X l∈S
p¯kil
− 1| ≥ δ, Et
k
! ≤P
X l∈S
|¯ pkil
−
pkil |
≥ δ, Et
.
k
Consider the events A = {|¯ pkil − pkil | < δ/|S k |, ∀k ∈ K} X |¯ pkil − pkil | < δ} B={ l∈S k
If ω ∈ A then ω ∈ B . Thus A ⊂ B , B C ⊂ AC . Then ! X P |¯ pkil − pkil | ≥ δ, Et = P (B C , Et ) ≤ P (AC , Et ) l∈S k
=P
K [
! |¯ pkil − pkil | ≥ δ/|S k |, Et
k=1
≤
K X
P |¯ pkil − pkil | ≥ δ/Smax
k=1
≤
2K , t2
2 /(2δ 2 ). Combining (28), (31) and (32) we get for L ≥ Smax
2K + 2 , P |ˆ pkij,t − pkij | > , Et ≤ t2 2 /(2δ 2 )} = C (). for L ≥ max{6/(2 ), Smax P
(32)
31
A PPENDIX C P ROOF OF L EMMA 12 First consider the case (ψˆt , u) ∈ / Crit(P ). Then ∃0 > 0 such that MakeOpt(ψ, u; P , 0 ) = ∅. On the event Ft we have X ˆt , y, u)(h ˆ t (T ˜ (ψˆt , y, u)) − hP (T ˜ (ψˆt , y, u)) ≤ , ∀u ∈ U, P˜ ∈ Ξ. V ( ψ P P y∈S u
(33)
Thus for < 0 /3, It (ψˆt , u) ≤ sup {¯ r(ψˆt , u) + ˜ ∈Ξ P
≤ sup {¯ r(ψˆt , u) + ˜ ∈Ξ P
X y∈S
ˆ t (T ˜ (ψˆt , y, u))} V (ψˆt , y, u)h P
u
X
V (ψˆt , y, u)hP (TP˜ (ψˆt , y, u))}
y∈S u
< L∗ (ψˆt , P ) − 0 + < L∗ (ψˆt , P ) − 2,
which implies that I(ψˆt ∈ Jl , Ut = u, Et , Ft , I(ψˆt , u) ≥ L∗ (ψˆt , P ) − 2) = 0.
Next, we consider the case when (ψˆt , u) ∈ Crit(P ). Note that by Lemma ?? τ 0 can be selected such that the suboptimality gap for any belief in Jl , l ∈ {1, . . . , A(τ 0 )} is at least δ 00 > 0. Consider (ψ, u) ∈ Crit(P ), ψ ∈ Jl for some l ∈ {1, . . . , A(τ 0 )}. Let 3 = Cδ 00 where 0 < C < 1 is some constant.
We have L(ψ, u, hP , P ) ≤ L∗ (ψ, P ) − δ 00 . ˜ such that Since MakeOpt(ψ, u; P , Cδ 00 ) 6= ∅, for any P ˜ ) ≥ L∗ (ψ, P ) − Cδ 00 , L(ψ, u, hP , P
32
we have ˜ ) − L(ψ, u, hP , P ) ≥ (1 − C)δ 00 L(ψ, u, hP , P X ⇒ V (ψ, y, u)(hP (TP˜ (ψ, y, u)) − hP (TP (ψ, y, u))) ≥ (1 − C)δ 00 y∈S u
⇒ maxu |(hP (TP˜ (ψ, y, u)) − hP (TP (ψ, y, u))| ≥ (1 − C)δ 00 y∈S
˜
⇒ P − P ≥ 2δ, 1
for some δ > 0 by continuity of hP and TP . Therefore Jψ,u (P ; P , 3) > 2δ for any (ψ, u) ∈ ˆ ; P , ) is Crit(P ), ψ ∈ Jl for some l ∈ {1, . . . , A(τ 0 )}. Before proceeding, we also note that Jψ,u (P
continuous in its first argument. Therefore there exists a function fP , such that fP , (δ) > 0 for δ > 0
ˆ ˆ and limδ−→0 fP , (δ) = 0, for which |Jψ,u (P ; P , ) − Jψ,u (P ; P , )| > δ implies P − P > fP , (δ). 1
The event {It (ψˆt , u) ≥ L∗ (ψˆt , P ) − 2} is equivalent to
2 2 log t
˜ ∈Ξ: ˆ ˜ ∃P P − P ,
t
≤ Nt (u) 1 ˆ t (T ˜ (ψˆt , ., u))) ≥ L∗ (ψˆt , P ) − 2). (¯ r(ψˆt , u) + (V (ψˆt , ., u) • h P
(34)
On the event Ft we have X ˆ t (T ˜ (ψˆt , y, u)) − hP (T ˜ (ψˆt , y, u)) ≤ , ∀u ∈ U, P ˜ ∈ Ξ. V (ψˆt , y, u)(h P P y∈S u Thus (34) implies ˜ ∈Ξ: ∃P
2 2 log t
ˆ
˜ ,
P t − P ≤ Nt (u) 1
(¯ r(ψˆt , u) + (V (ψˆt , ., u) • hP (TP˜ (ψˆt , ., u))) ≥ L∗ (ψˆt , P ) − 3). ˆ ; P , ) (35) implies From the definition of Jψ,u (P ˆ t ; P , 3) ≤ 2 log t . Jψˆt ,u (P Nt (u)
(35)
33
Thus we have D1,1 (T, , Jl , u) ≤
T −1 X t=0
≤
T −1 X t=0
+
T −1 X
ˆ t ; P , 3) ≤ 2 log t , (ψˆt , u) ∈ Crit(P ), Et I ψˆt ∈ Jl , Ut = u, Jψˆt ,u (P Nt (u)
2 log t I ψˆt ∈ Jl , Ut = u, Et , (ψˆt , u) ∈ Crit(P )Jψˆt ,u (P ; P , 3) ≤ +δ Nt (u)
(36)
ˆ t ; P , 3) + δ . I ψˆt ∈ Jl , Ut = u, Et , (ψˆt , u) ∈ Crit(P )Jψˆt ,u (P ; P , 3) > Jψˆt ,u (P
t=0
(37) Note that (36) is less than or equal to 4 log T . δ
(38)
ˆ t ; P , 3)+δ implies ˆt − P By the continuity argument we mentioned above Jψˆt ,u (P ; P , 3) > Jψˆt ,u (P
P
> 1
fP ,3 (δ). Thus (37) us upper bounded by T −1 X
ˆ
I P t − P > fP ,3 (δ), Et .
t=0
1
Taking expectation we have T −1 X
K −1 X
TX
ˆ
P P − P > f (δ), E ≤
t t P ,3
t=0
1
X
t=0 k=1 (i,j)∈S ×S k
P
|ˆ pkij,t
k
−
pkij |
fP ,3 (δ) ≥ 2 KSmax
2 ≤ KSmax β.
(39)
Combining (38) and (39) we have 2 + 4/δ)β. EψP0 ,α [D1,1 (T, , Jl , u)] ≤ (KSmax
R EFERENCES [1] L. K. Platzman, “Optimal infinite-horizon undiscounted control of finite probabilistic systems,” SIAM J. Control Optim., vol. 18, pp. 362–380, 1980. [2] S. P. Hsu, D. M. Chuang, and A. Arapostathis, “On the existence of stationary optimal policies for partially observed mdps under the long-run average cost criterion,” Systems and Control Letters, vol. 55, pp. 165–173, 2006. [3] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985. [4] R. Agrawal, “Sample mean based index policies with O(log(n)) regret for the multi-armed bandit problem,” Advances in Applied Probability, vol. 27, no. 4, pp. 1054–1078, December 1995.
34
[5] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: IID rewards,” IEEE Trans. Automat. Contr., pp. 968–975, November 1987. [6] ——, “Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards,” IEEE Trans. Automat. Contr., pp. 977–982, November 1987. [7] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256, 2002. [8] C. Tekin and M. Liu, “Online algorithms for the multi-armed bandit problem with markovian rewards,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on.
IEEE, 2010, pp. 1675–1682.
[9] ——, “Online learning in opportunistic spectrum access: A restless bandit approach,” in Proc. of the 30th Annual IEEE International Conference on Computer Communications (INFOCOM), April 2011, pp. 2462 –2470. [10] ——, “Online learning of rested and restless bandits,” IEEE Trans. on Information Theory, vol. 58, no. 8, pp. 5588–5611, August 2012. [11] ——, “Performance and convergence of multi-user online learning,” in Proc. of the 2nd International Conference on Game Theory for Networks (GAMENETS), April 2011. [12] ——, “Online learning in decentralized multi-user spectrum access with synchronized explorations,” in submitted to IEEE MILCOM 2012, 2012. [13] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple players,” Signal Processing, IEEE Transactions on, vol. 58, no. 11, pp. 5667–5681, 2010. [14] A. Anandkumar, N. Michael, A. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” Selected Areas in Communications, IEEE Journal on, vol. 29, no. 4, pp. 731–745, 2011. [15] A. Burnetas and M. Katehakis, “Optimal adaptive policies for markov decision processes,” Mathematics of Operations Research, pp. 222–255, 1997. [16] A. Tewari and P. Bartlett, “Optimistic linear programming gives logarithmic regret for irreducible mdps,” Advances in Neural Information Processing Systems, vol. 20, pp. 1505–1512, 2008. [17] P. Auer, T. Jaksch, and R. Ortner, “Near-optimal regret bounds for reinforcement learning,” 2009. [18] A. Y. Mitrophanov, “Senstivity and convergence of uniformly ergodic markov chains,” J. Appl. Prob., vol. 42, pp. 1003– 1014, 2005. [19] C. Papadimitriou and J. Tsitsiklis, “The complexity of optimal queuing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, May 1999. [20] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of applied probability, pp. 287–298, 1988. [21] S. Guha, K. Munagala, and P. Shi, “Approximation algorithms for restless bandit problems,” Journal of the ACM (JACM), vol. 58, no. 1, p. 3, 2010. [22] S. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, “Optimality of myopic sensing in multichannel opportunistic access,” Information Theory, IEEE Transactions on, vol. 55, no. 9, pp. 4040–4050, 2009.