which reduce to those proposed by Gittins (1979) and Whittle (1988) for the classical ... the index which emerges is an elegant generalisation of the Gittins index ...
A Generalised Gittins Index for a class of Multi-Armed Bandits with General Resource Requirements K. D. Glazebrook and R. Minty Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK. Abstract We generalise classical multi-armed and restless bandits to allow for the distribution of a (fixed amount of a) divisible resource among the constituent bandits at each decision point. Bandit activation consumes amounts of the available resource which may vary by bandit and state. Any collection of bandits may be activated at any decision epoch provided they do not consume more resource than is available. We propose suitable bandit indices which reduce to those proposed by Gittins (1979) and Whittle (1988) for the classical models. In multi-armed bandit cases in which bandits do not change state when passive the index which emerges is an elegant generalisation of the Gittins index which measures in a natural way the reward earnable from a bandit per unit of resource consumed. The paper discusses both how such indices may be computed and how they may be used to construct heuristics for resource distribution. We also describe how to develop bounds on the closeness to optimality of index heuristics and demonstrate a form of asymptotic optimality for a greedy index heuristic in a class of simple models. A numerical study testifies to the strong performance of a weighted index heuristic.
1
Introduction
The paper concerns developments of the classical work of Gittins (1979, 1989) who elucidated index-based solutions to a family of Markov decision processes (MDPs) called multi-armed bandit problems (MABs). This work concerned the optimal allocation of a single indivisible resource among a collection of stochastic projects (or bandits as they are sometimes called). Gittins’ original contribution, namely to demonstrate that optimal project choices are those of largest index, has given rise to a substantial literature describing a range of extensions to and reformulations of his result. Gittins (1989) gives an extensive bibliography of early contributions. A more recent survey is due to Mahajan and Teneketzis (2007). An important limitation on the modelling power of Gittins’ MABs is the critical assumptions that bandits are frozen while not in receipt of resource. In response to this, Whittle (1988) introduced a class of restless bandit problems (RBPs) in which projects may change state whether active or passive, though according to different dynamics. In contrast to Gittins’ MABs, RBPs are almost certainly intractable and have been shown to be PSPACE-hard by Papadimitriou and Tsitsiklis (1999). Whittle used a Lagrangian relaxation to develop an index heuristic (which reduces to Gittins’ index policy in the MAB case) for those RBPs which pass an indexability test. Weber and Weiss (1990, 1991) demonstrated a form of asymptotic optimality for Whittle’s index 1
policy but otherwise the model’s perceived difficulty inhibited further substantial progress for some time. More recently Ni˜ no-Mora (2001, 2002) and Glazebrook, Ni˜ no-Mora and Ansell (2002) have explored indexability issues from a polyhedral perspective. Further, a range of empirical studies have demonstrated the power and practicability of Whittle’s approach in a range of application contexts. See, for example, Ansell et al. (2003), Opp, Glazebrook and Kulkarni (2005), Glazebrook, Kirkbride and Ruiz-Hernandez (2006a,b) and Glazebrook and Kirkbride (2004, 2007). A significant limitation on the applicability of the above classical models is the very simple view they take of the resource to be allocated. In Gittins’ MABs a single indivisible resource is to be allocated en bloc to a single bandit at each decision epoch. Whittle’s RBP formulation contemplates parallel server versions of this. While such views of the resource may sometimes be appropriate, many applications concern a divisible resource (for example, money or manpower) in situations where its over-concentration would usually be far from optimal. Indeed, this was an issue for Gittins in his work on the planning of new product pharmaceutical research which provided the motivation for his pioneering contribution. See, for example, Chapter 5 of Gittins (1989). Stimulated by such considerations, we develop in Section 2 a family of restless bandit models in which a fixed amount of some divisible resource is available for distribution among a collection of projects at each decision epoch. Activation of a bandit consumes a quantity of resource which is in general both bandit and state dependent. Any collection of bandits may be activated at any decision epoch provided they do not consume more resource than is available. Plainly, the development of good policies must now take serious account not only of the rewards which the bandits are capable of securing, but also of the amount of resource they will consume in the process. Notions of bandit indexability are developed via a Lagrangian relaxation approach which suitably extends that of Whittle (1988) and index heuristics are proposed for indexable systems. From Section 3 we specialise the model of Section 2 to a general class of MABs with switching costs. Formally, we extend the work of Glazebrook, Kirkbride and Ruiz-Hernandez (2006b) to embrace the much more general view of resource allocation described in the preceding paragraph. We demonstrate indexability for this class under mild conditions. The index which results is an elegant generalisation of the Gittins index which measures in a natural way the reward earnable from a bandit per unit of resource consumed. Succeeding sections discuss how these indices are computed (Section 4) and how polyhedral ideas may be exploited to analyse how close indexbased heuristics are to optimal (Section 5). A form of asymptotic optimality is established for a greedy index heuristic in a class of simple models. The paper concludes with a numerical study (Section 6).
2
Indexability and Index Heuristics
We shall consider a restless bandit model with general resource requirements, as outlined in the Introduction. More specifically, we focus on a family of reward discounted Markov decision processes whose main features are as follows: (i) The state space of the process is ×N n=1 Ωn with Ωn the (finite or countable) state space of bandit n. The state of the process is observed at each time t ∈ N. At time t, we write ¯ X(t) = {X1 (t), X2 (t), . . . , XN (t)} for the process state with Xn (t) ∈ Ωn the state of bandit n.
2
(ii) At each time t ∈ N an action α ¯ = {α1 , α2 , . . . , αN } is applied to the process, with αn ∈ {a, b} the action applied to bandit n. The action a is active and calls for the positive commitment of resource. The alternative action b is passive. Under action α ¯ the N bandits evolve independently, each according to its own Markov law. We write ¯ + 1) = y¯|X(t) ¯ P {X(t = x¯, α ¯} =
N Y
P {Xn (t + 1) = yn |Xn (t) = xn , αn }.
(1)
n=1
(iii) For all n, rna and rnb are bounded reward functions which map the state space Ωn into the positive reals. The expected reward earned when action α ¯ is applied to the process in state x¯ is written X X r(¯ α, x¯) = rna (xn ) + rnb (xn ). n:α ¯ n =a
n:α ¯ n =b
It is natural to assume that expected rewards earned under the active action a uniformly exceed those earned under the passive action b, namely rna (xn ) ≥ rnb (xn ), xn ∈ Ωn , 1 ≤ n ≤ N,
(2)
but we do not impose this as a general requirement. All rewards and costs are discounted according to rate β ∈ (0, 1). (iv) For all n, can and cbn are bounded consumption functions which map the state space Ωn into the non-negative reals. The amount of resource consumed when action α ¯ is applied to the process in state x¯ is written X X cbn (xn ). c(¯ α, x¯) = can (xn ) + n:α ¯ n =a
n:α ¯ n =b
We shall require that the amount of resource consumed under the active action a uniformly exceeds that under the passive action b, namely can (xn ) > cbn (xn ), xn ∈ Ωn , 1 ≤ n ≤ N.
(3)
The set of admissible actions in process state x¯ is given by A(¯ x) = {¯ α; c(¯ α, x¯) ≤ C} where C is the total resource available at each decision epoch. We shall suppose that A(¯ x) 6= ∅, x¯ ∈ ×N ¯ ∈ ×N x) 6= {a, b}N , n=1 Ωn . We also require that ∃ x n=1 Ωn for which A(¯ namely that resource availability strictly constrains project activation. (v) A policy u is a rule for choosing an admissible action at each decision epoch. Such a rule can in principle be a function of the entire history of the process to date. We shall seek a policy to maximise the total expected reward earned over an infinite horizon. Standard theory (see, for example, Puterman (1994)) asserts the existence of an optimal policy which is stationary (makes decisions in the light of the current process state only) and which satisfies the optimality equations of dynamic programming (DP). That said, a pure DP approach to the above problem is unlikely to yield insight and will be computationally intractable for problems of realistic size. Hence the primary request is for strongly performing heuristic policies. 3
Following the classical work of Gittins (1979,1989) and Whittle (1988) we shall seek heuristics in the form of index policies. Hence, we shall seek calibrating functions νn : Ωn → R, 1 ≤ n ≤ N , which will guide the construction of good actions in each process state. In order to develop such indices, we shall seek a decomposition of (a relaxation of) our optimization problem into N individual problems, one for each bandit. It is these individual problems which, when suitably structured, will yield the calibrating functions required. We proceed via a series of relaxations as follows: we state the optimization problem of interest, expressed in (i)–(v) above as ( N ) X Rnu (¯ x) . (4) Ropt (¯ x) = sup u∈U
n=1
In (4), U is the class of stationary policies for the problem and Rnu (¯ x) the total expected reward yielded by bandit n under policy u from initial state x¯. We relax (4) by extending to the class of stationary policies U(¯ x) such that N X
Cnu (¯ x) ≤ C(1 − β)−1 , u ∈ U(¯ x).
(5)
n=1
In (5), Cnu (¯ x) is the total expected resource consumed by bandit n under policy u, with successive resource consumptions discounted at rate β. Explicitly, we have # "∞ X ¯ ¯ ¯ β t can {Xn (t)}I[u{X(t)} ∈ An ] + cbn {Xn (t)}I[u{X(t)} ∈ Bn ] X(0) Cnu (¯ x) = Eu = x¯ , t=0
x¯ ∈ ×N n=1 Ωn , 1 ≤ n ≤ N,
(6)
where, in (6), Eu denotes an expectation taken over realisations of the process evolving under stationary policy u, An (respectively Bn ) is the subset of {a, b}N consisting of actions whose nth component is a (respectively b) and I is an indicator. Plainly, U(¯ x) ⊇ U, x¯ ∈ ×N n=1 Ωn . Hence we relax (4) to ¯ opt
R
(¯ x) = sup u∈U (¯ x)
( N X n=1
)
Rnu (¯ x)
.
(7)
We now further relax the problem by firstly extending to the class of stationary policies U ′ which are allowed a free choice of action from {a, b}N in every state and then incorporating the constraint (5) into the objective in a Lagrangian fashion. We write # " N X x)} + νC(1 − β)−1 . (8) x) − νCnu (¯ {Rnu (¯ Ropt (¯ x, ν) = sup u∈U ′
n=1
Plainly, + ¯ opt (¯ Ropt (¯ x, ν) ≥ R x) ≥ Ropt (¯ x), x¯ ∈ ×N n=1 Ωn , ν ∈ R .
4
It is clear that the construction of the Lagrangian relaxation (8) permits an additive decomposition by individual bandit. We write Ropt (¯ x, ν) =
N X
Rnopt (xn , ν) + νC(1 − β)−1 .
(9)
n=1
In (9), Rnopt (xn , ν) is the value of a stochastic DP involving bandit n alone in which we seek a policy un to choose from actions {a, b} in each state to maximise the total expected reward (Rnun (xn ), say) net of penalties imposed from resource consumption (νCnun (xn ), say). Note that the Lagrange multiplier ν ≥ 0 has an economic interpretation as a charge imposed per unit of resource consumed. Denote this optimization problem for bandit n by p(n, ν). Plainly, from (9), an optimal policy for the Lagrangian relaxation in (8) simply runs optimal policies for the p(n, ν), 1 ≤ n ≤ N , in parallel. We now follow Whittle (1988) in requiring structure in optimal policies for the p(n, ν) which will permit development of an appropriate calibrating function νn : Ωn → R+ for bandit n. In order to do this, we write bn (un ) for the passive set corresponding to policy un which is stationary for p(n, ν), namely bn (un ) := {xn ∈ Ωn ; un (xn ) = b}. Definition 1 describes both the structure required and the calibrating function which results from it. Definition 1 Bandit n is indexable if there exists a family of policies {un (ν), ν ∈ R+ } such that (I) un (ν) is optimal for p(n, ν); and (II) bn {un (ν)} is non-decreasing in ν. Under (I) and (II), the corresponding index for bandit n, νn : Ωn → R+ is given by νn (xn ) = inf ν ∈ R+ ; xn ∈ bn {un (ν)} .
(10)
From (10), νn (xn ) has an interpretation as a fair charge (per unit of resource) for the additional resource consumed in going from passive (b) to active (a) when bandit n is in state xn . Further, this index generalises those of Gittins (1979) and Whittle (1988) to our more complex resource consumption set up. Hence, if we take all the can to be identically one, all the cbn to be identically zero then (10) is the index developed by Whittle (1988). If we further impose the requirement on each bandit of no state evolution under the passive action then (10) is the Gittins index. Note that, if all locations are indexable, there exists an optimal policy for the Lagrangian relaxation in (8) which, in state x¯, chooses the active action for each bandit n if and only if the fair charge νn (xn ) is no less than the prevailing charge ν ≥ 0. When all bandits are indexable we propose a greedy index heuristic (GI) for the optimization problem (4) which in each process state x¯ constructs an action by adding bandits to the active set in decreasing order of the indices νn (xn ) until the point is reached where any further such addition is either of a zero index bandit and/or violates the resource constraint. This heuristic corresponds to the index policies of Gittins (1979) and Whittle (1988) in the simpler settings considered by 5
them. In our context, GI runs the risk of poor utilisation of the available resource, for example, in cases where the marginal resource allocations can (xn ) − cbn (xn ) are large and irregular. In analyses of some simple models in Section 5 we shall see that this emerges as the central concern regarding the performance of GI. One suggested approach in such cases is to use the weighted index heuristic (W I) which in each process state x¯ chooses an action to maximise the sum X {can (xn ) − cbn (xn )}νn (xn ) (11) n
within the resource constraint. The sum in (11) is taken over all active bandits. These issues and policies are discussed further in the upcoming sections in the context of a class of multi-armed bandits (MABs) with general resource requirements.
3
Multi-Armed Bandits with general resource requirements
We proceed to illustrate the ideas of the preceding section by specialising (i)–(v) above to a family of multi-armed bandit problems (MABs) with switching penalties. Hence bandits do not undergo state transitions when passive and so (1) is specialised to cases where P {Xn (t + 1) = yn |Xn (t) = xn , b} = δ(xn , yn ), (xn , yn ) ∈ Ω2n , 1 ≤ n ≤ N
(12)
with δ(·, ·) the Kronecker delta. Equation (12) notwithstanding, the incorporation of switching penalties in the model necessitates extension of the “natural” state of each bandit to incorporate information concerning the action which was applied to it at the preceding decision epoch. When the state of each bandit is extended in this way, changes of bandit state will occur whenever a switch of actions happens and a simple form of restlessness is induced. From the discussion in Section 2 up to and around Definition 1, indexability is a characteristic of individual bandits and it is these which become the focus of our discussion in this section and the next. We are thus able to drop the bandit identifier n and use B to denote an individual bandit within our MAB model with switching penalties. Its prime characteristics are as follows: (i)′ The bandit’s enhanced state space is {a, b} × Ω where Ω is finite or countable. If we denote by X(t) the enhanced state of the bandit at time t ∈ Z+ , then X(t) = (·, x), · ∈ {a, b}, x ∈ Ω, means that the state of the bandit is x and that action · was applied to the bandit at time t − 1. When needed we use X(t) for the (non-enhanced) bandit state, i.e. X(t) = (·, x) =⇒ X(t) = x. (ii)′ Should action a be applied to the bandit at time t, where X(t) = (·, x), then the new state will be X(t + 1) = (a, y), with y determined according to the Markov law P . We write P {X(t + 1) = (a, y) | X(t) = (·, x), a} = Pxy , x, y ∈ Ω.
(13)
Should action b be applied to the bandit at time t, where X(t) = (·, x), then P {X(t + 1) = (b, x) | X(t) = (·, x), b} = 1, x ∈ Ω.
6
(14)
(iii)′ The reward functions ra , rb , S a , S b all map the state space Ω into the non-negative reals R+ and are bounded. The expected reward earned by the transition in (13) is ra (x) when · = a and is ra (x) − S a (x) when · = b. Hence S a (x) is a cost incurred whenever the bandit is switched on (goes from passive to active) in state x. The expected reward earned by the transition in (14) is rb (x) − S b (x) when · = a and is rb (x) when · = b. Hence S b (x) is a cost incurred when the bandit is switched off in state x. (iv)′ The consumption functions ca , cb , Sˆa , Sˆb map the state space Ω into the non-negative reals R+ and are bounded. The resource consumed when action a is taken in enhanced state (·, x) is ca (x) when · = a and ca (x) + Sˆa (x) when · = b. When action b is taken in enhanced state (·, x) the resource consumed is cb (x) + Sˆb (x) when · = a and cb (x) when · = b. Hence additional resource may be consumed both when the bandit is switched on or off. The following technical assumption will be used in the analysis. Condition 1 (Resource consumption) The functions ca , cb , Sˆb and the Markov law P are such that X ca (x) + β Pxy {Sˆb (y) + cb (y)(1 − β)−1 } > Sˆb (x) + cb (x)(1 − β)−1 , x ∈ Ω. (15) y∈Ω
In words, the above resource consumption condition states that, starting from a position in which the passive action is applied to the bandit for all times from initial enhanced state (a, x), the expected total (discounted) amount of resource consumed is increased if the action taken at time 0 is changed from passive to active. The condition is trivially satisfied when ca is positivevalued and cb ≡ 0, Sˆb ≡ 0, namely that the bandit consumes no resource when passive. Inequality (15) can be rendered non-strict in what follows at the cost of some additional complexity. From Section 2, exploration of the indexability of B involves consideration of structural properties of optimal policies for the associated problem p(ν) in which (discounted) charges are levied for resource consumption at a rate of ν per unit. We write V : {a, b} × Ω × R+ → R for the value function for p(ν), with V {(·, x) , ν} the maximal return from the bandit (net of charges for resource consumption) when X(0) = (·, x) and the resource charge is ν ≥ 0. By standard theory, value function V satisfies the optimality equations (
V {(a, x) , ν} = max ra (x) − νca (x) + β
X
Pxy V {(a, y) , ν} ;
y∈Ω
b
b
b
)
(16)
)
(17)
ˆb
r (x) − S (x) − ν{c (x) + S (x)} + βV {(b, x), ν} and (
V {(b, x), ν} = max ra (x) − S a (x) − ν{ca (x) + Sˆa (x)} + β
X
Pxy V {(a, y), ν} ;
y∈Ω
rb (x) − νcb (x) + βV {(b, x), ν}
7
where both equations hold ∀ x ∈ Ω. In equation (16) the two terms within the maximisation on the right hand side are the expected returns when action a, respectively b, is applied at time 0 to the bandit when in enhanced state (a, x) and when actions are taken optimally thereafter. Equation (17) is constructed similarly for the enhanced state (b, x). The next result describes an important property of optimal policies for p(ν). It arises because a switch of processing away from an active bandit followed immediately by a transfer of processing back to it can only introduce unnecessarily incurred switching costs and unnecessarily consumed additional resource. Such action sequences can therefore be eliminated from consideration. Lemma 1 If action b is optimal for (a, x) then it is also optimal for (b, x) and will be strictly optimal when S a (x) + S b (x) + ν{Sˆa (x) + Sˆb (x)} > 0. Proof If action b is optimal for (a, x) then by standard results, the maximisation on the r.h.s of (16) is achieved by the second term, namely X rb (x) − S b (x) − ν{cb (x) + Sˆb (x)} + βV {(b, x), ν} ≥ ra (x) − νca (x) + β Pxy V {(a, y), ν}. y∈Ω
It must then follow that rb (x) − νcb (x) + βV {(b, x), ν} ≥ ra (x) − νca (x) + S b (x) + ν Sˆb (x) + β
X
Pxy V {(a, y), ν}
y∈Ω
≥ ra (x) − νca (x) − S a (x) − ν Sˆa (x) + β
X
Pxy V {(a, y), ν}
y∈Ω
with the second inequality strict when S a (x) + S b (x) + ν{Sˆa (x) + Sˆb (x)} > 0. It follows that the maximisation on the r.h.s. of (17) is achieved by the second term, uniquely so if the inequality in the statement of the lemma holds. The result now follows from standard results. 2 It follows from the preceding lemma that in the search for solutions to p(ν), we may restrict attention to stationary policies u in the class Φ, where Φ := {u; (x; x ∈ Ω and u(a, x) = b) ⊆ (x; x ∈ Ω and u(b, x) = b)} . Suppose now that X(0) = (·, x), x ∈ Ω, and consider evolution of the bandit under some u ∈ Φ. There are two possibilities. Either u{X(0)} = b in which case u’s stationarity, its membership of Φ and the nature of its evolution under b guarantees that u{X(t)} = b, t ≥ 0. Alternatively, u{X(0)} = a in which case u will continue to choose action a until time τ , where τ = inf{t; t ≥ 1 and X(t) ∈ (y; y ∈ Ω and u(a, y) = b)}
(18)
in which case u’s stationarity, its membership of Φ and the nature of its evolution under b guarantees that u{X(t)} = b, t ≥ τ . Hence, the determination of optimal actions for p(ν) may be reduced to a collection of stopping problems, one for each (initial) enhanced state, where stopping connotes first application of the passive action b. In order to develop the necessary ideas, consider the bandit B evolving from some initial enhanced state under continuous application of the active action a. We now introduce the class of stationary positive-valued stopping times Tˇ. Generic class member τ (Θ) is defined by τ (Θ) := inf {t; t ≥ 1 and X(t) ∈ Θ} 8
(19)
for some subset Θ ⊆ Ω. Plainly the time τ in (18) is in Tˇ. It follows from this discussion that the optimality equations (16) and (17) may be replaced by the following: " # ( X Pxy Vτ {(a, y), ν} ; V {(a, x), ν} = max sup ra (x) − νca (x) + β τ
y∈Ω
) b r (x) − νcb (x) (1 − β)−1 − S b (x) − ν Sˆb (x)
and (
"
V {(b, x), ν} = max sup ra (x) − S a (x) − ν{ca (x) + Sˆa (x)} + β τ
X y∈Ω
(20)
#
Pxy Vτ {(a, y), ν} ;
) b r (x) − νcb (x) (1 − β)−1 .
(21)
In (20) and (21), the function Vτ is the total return earned net of charges for resource consumption from the identified initial state under a policy which chooses the active action up to stopping time τ and the passive action beyond. The second term within the maximum on the right hand side of (20) (respectively (21)) is the total return earned from initial state (a, x) (respectively (b, x)) when the passive action is taken at all decision epochs. In both (20) and (21) the suprema are taken over τ ∈ Tˇ. In order to progress with an economy of notation, when τ ∈ Tˇ we shall use the notations a r (x, τ ), ca (x, τ ) for, respectively, the total discounted reward earned and resource consumed during [0, τ ) under continuous application of the active action from initial enhanced state X(0) = (a, x). We write ) # "( τ −1 X t a a (22) β r {X(t)} x , r (x, τ ) := E t=0
and similarly for ca (x, τ ), where in (22) and throughout we shall use |x as a notational shorthand for the conditioning on the initial state more fully expressed by |X(0) = x. We can now introduce an appropriate index for the bandit in the form of a real-valued function on the state-space {a, b} × Ω. Optimal policies for p(ν) will be describable in terms of this index. Definition 2 The return/consumption index for B is a function ν : {a, b} × Ω → R given by ν(a, x) := sup {∆r(a, x, τ )/∆c(a, x, τ )} , x ∈ Ω, τ
where ∆r(a, x, τ ) := ra (x, τ ) − {rb (x) − E([β τ rb {X(τ )}]|x)}(1 − β)−1 + S b (x) − E([β τ S b {X(τ }]|x) and ∆c(a, x, τ ) := ca (x, τ ) − {cb (x) − E([β τ cb {X(τ )}]|x)}(1 − β)−1 − {Sˆb (x) − E([β τ Sˆb {X(τ )}]|x)}; 9
and ν(b, x) := sup{∆r(b, x, τ )/∆c(b, x, τ )}, x ∈ Ω, τ
where ∆r(b, x, τ ) := ∆r(a, x, τ ) − S a (x) − S b (x) and ∆c(b, x, τ ) := ∆c(a, x, τ ) + Sˆa (x) + Sˆb (x). Both of the above suprema are over τ ∈ Tˇ. We shall write ν + : {a, b} × Ω → R+ for the positive part of function ν, namely ν + (·, x) = max {ν(·, x), 0} , · ∈ {a, b}, x ∈ Ω.
Theorem 1 (Indexability and indices) The bandit B is indexable with index given by the positive part of the return/consumption index above. Further, ν + (a, x) ≥ ν + (b, x), ∀ x ∈ Ω. Proof Consider enhanced state (a, x), x ∈ Ω and stopping time τ ∈ Tˇ. The expression within the supremum on the the right hand side of (20) may be written X ra (x) − νca (x) + β Pxy Vτ {(a, y), ν} = y∈Ω
a
a
r (x, τ ) − νc (x, τ ) − E(β τ S b {X(τ )}|x) − νE(β τ Sˆb {X(τ )}|x) + E(β τ [rb {X(τ )} − νcb {X(τ )}]|x)(1 − β)−1 .
(23)
Reading the expression on the r.h.s. of (23) from left to right, we first account for the rewards earned net of charges for resource consumption under action a during [0, τ ), then the switching costs incurred at τ when the choice of action changes from a to b and finally the net rewards earned under action b during [τ, ∞). From (20) the active action is strictly optimal in (a, x) when ∃ τ ∈ Tˇ for which the quantity in (23) exceeds {rb (x) − νcb (x)}(1 − β)−1 − S b (x) − ν Sˆb (x). Straightforward algebra yields the conclusion that this is equivalent to the requirement that ∃ τ ∈ T˜ for which ∆r(a, x, τ ) > ν∆c(a, x, τ ). Condition 1 above guarantees that ∆c(·, x, τ ) > 0 for all enhanced states (·, x) ∈ {a, b} × Ω and stationary positive-valued stopping times τ ∈ Tˇ. It now follows that action a is optimal in (a, x) whenever there exists τ ∈ Tˇ for which ∆r(a, x, τ )/∆c(a, x, τ ) > ν.
10
This will be the case if ν + (a, x) = ν(a, x) > ν. By continuity of (optimised) returns as ν varies, action a must also be optimal in state (a, x) when ν + (a, x) = ν. However, if ν(a, x) < ν then ∆r(a, x, τ )/∆c(a, x, τ ) < ν ∀ τ ∈ Tˇ from which it follows that action b is strictly optimal in state (a, x). We conclude that action a is optimal in state (a, x) if and only if ν + (a, x) ≥ ν. Further, action b is optimal in state (a, x) if and only if ν + (a, x) ≤ ν. A similar argument using (21) yields the conclusion that action a is optimal in (b, x) if and only if ν + (b, x) ≥ ν with action b optimal if and only if ν + (b, x) ≤ ν. We infer the existence of a family of optimal policies {u(ν), ν ∈ R+ } whose associated passive sets b{u(ν)} = {(·, x); · ∈ (a, b), x ∈ Ω, ν + (·, x) ≤ ν} are non-decreasing in ν. This establishes the indexability of B from Definition 1. That the index is indeed the positive part of the return/consumption index of Definition 2 is now trivial, as is the fact that ν + (a, x) ≥ ν + (b, x) ∀ x ∈ Ω. The latter is an immediate consequence of the definitions of the quantities concerned. 2 Remark The return/consumption indices above are guaranteed finite and have a natural interpretation. The quantity ∆r(a, x, τ ) (respectively ∆r(b, x, τ )) is the increase in expected return achieved when the active action a is taken during [0, τ ) instead of the passive action b from enhanced state (a, x) (respectively (b, x)). Similarly, ∆c(a, x, τ ) (respectively ∆c(b, x, τ )) is the increased resource consumed. Hence the ratio may be understood as a measure of additional return achieved from such processing per unit of additional resource consumed. The return/consumption index may be understood as a generalised form of Gittins index which is able to take varying patterns of resource consumption into account. As we shall see, it shares key structural properties with its predecessor and can be computed efficiently. We shall now proceed to show that, in the above definition of the return/consumption index, the supremum is always achieved by stopping time(s) which have a simple characterisation. To facilitate the discussion we introduce subsets of state space Ω as follows: Λ(ν) := {y; y ∈ Ω, ν(a, y) < ν}, ν ∈ R+ , and Ξ(ν) := {y; y ∈ Ω, ν(a, y) = ν}, ν ∈ R+ . Further, we use Tˇ(ν) for the collection of stationary positive-valued stopping times given by Tˇ(ν) = {τ (Θ); Θ = Λ(ν) ∪ Ξ for some Ξ ⊆ Ξ(ν)}, ν ∈ R+ .
Proposition 1 If ν(·, x) ≥ 0 then any stopping time in Tˇ{ν(·, x)} achieves the supremum in the definition of ν(·, x), · ∈ {a, b}, x ∈ Ω, τ ∈ Tˇ.
11
Proof (i) Consider the problem p{ν(a, x)} in which the charge per unit of resource consumed is fixed at ν(a, x), assumed non-negative. By the argument in the proof of the preceding theorem, in problem p{ν(a, x)} action a (respectively b) is optimal in enhanced state (·, y) if and only if ν(·, y) ≥ ν(a, x) (respectively ν(·, y) ≤ ν(a, x)). In particular, both actions a and b are optimal at time 0 when X(0) = (a, x). Should action a be taken at 0, then it is optimal to continue taking it through [0, t) provided that ν(a, x) ≤ ν{a, X(s)}, 0 ≤ s ≤ t − 1. Should that prove the case, action b will then be optimal at t if ν(a, x) ≥ ν{a, X(t)}. Hence, assuming that we make decisions in a stationary fashion, we deduce that only the following sequences of actions are optimal for p{ν(a, x)} when X(0) = (a, x): (1) take action b at all decision epochs; and (2) take action a throughout [0, τ ) and action b from τ onwards, where τ ∈ Tˇ{ν(a, x)}. Equating the expected returns from the policies outlined in (1) and (2) yields the conclusion that ν(a, x) = ∆r(a, x, τ )/∆c(a, x, τ ), τ ∈ Tˇ{ν(a, x)}, x ∈ Ω, which establishes the result when · = a. The case · = b is dealt with similarly.
2
Remark The proof of Proposition 1 also reveals that the stopping times in Tˇ{ν(·, x)} are essentially the only stationary ones achieving the suprema concerned. Should stopping time τ (Θ) in (19) be such that stopping set Θ contains some y for which ν(a, y) > ν(·, x) and, moreover, that when the bandit is subject to the continuous application of action a from initial state (·, x) then P [X{τ (Θ)} = (a, y)] > 0 it will follow from the argument in the above proof that ν(·, x) > ∆r{·, x, τ (Θ)}/∆c{·, x, τ (Θ)}.
4
Computation of the return/consumption indices
We have noted above the Gittins index-like nature of the return/consumption indices developed in the preceding section for our multi-armed bandit model with general resource requirements. We shall now precede to give an algorithm for their computation in finite state space cases which is a suitably modified version of the adaptive greedy algorithm first given by Robinson (1982) and subsequently further developed by Bertsimas and Ni˜ no-Mora (1996). Consider now the bandit B. We produce from it a derived Gittins-type bandit DB (i.e., one which does not change state and earns no rewards under the passive action) as follows: (i)′′ DB’s state space is the enhanced state space of B, namely {a, b} × Ω. This is assumed finite. (ii)′′ The stochastic dynamics of DB under action a are exactly as for B and are given in (13) above. However, DB does not change state under action b. We use X(t) for the state of DB at time t ∈ N.
12
(iii)′′ The reward function r˜ : {a, b} × Ω → R yields expected rewards earned by DB under action a. Specifically, when action a is taken in state (a, x), the expected reward is r˜(a, x) := ra (x) − {rb (x) − β
X
Pxy rb (y)}(1 − β)−1 + {S b (x) − β
y∈Ω
X
Pxy S b (y)}, x ∈ Ω,
y∈Ω
and when action a is taken in state (b, x), the expected reward is r˜(b, x) := r˜(a, x) − S a (x) − S b (x), x ∈ Ω. Derived bandit DB earns no rewards under action b. (iv)′′ From the consumption functions ca , cb , Sˆa , Sˆb associated with B we derive a function C˜ : {a, b) × Ω → R+ for DB. We first write C a (x) for the total expected discounted resource consumed by B when action a is taken at all decision epochs from time 0 at which point it is in enhanced state (a, x). Hence ! # "∞ X β t ca {X(t)} (a, x) , C a (x) := E t=0
the expectation being taken over realisations of B under continuous application of the active action. We now have ˜ x) := C a (x) − cb (x)(1 − β)−1 , x ∈ Ω, C(a,
and ˜ x) := C(a, x) + Sˆa (x) + Sˆb (x), x ∈ Ω. C(b, Condition 1 and (3) guarantee that C˜ is positive-valued. With the above in place we can re-express the return/consumption indices for B in terms of the quantities associated with the derived bandit DB as follows: Suppose DB to be in some state (·, x) at time 0 (i.e., X(0) = (·, x)) and to be subject to action a up to stopping time τ ∈ Tˇ. Write r˜{(·, x), τ } for the expected reward earned by DB during [0, τ ), namely ) "( τ −1 # X t β r˜{X(t)} (·, x) , · ∈ {a, b}, x ∈ Ω, τ ∈ Tˇ. r˜{(·, x), τ } := E t=0 From Definition 2 we have
˜ x) − E([β τ C{X(τ ˜ ν(·, x) = sup{˜ r{(·, x), τ }/[C(·, )}]|(·, x))]}, · ∈ {a, b}, x ∈ Ω,
(24)
τ
the supremum being taken over all τ ∈ Tˇ. The expression in (24) yields an index which is of modified Gittins type. The form of modification is precisely that required by a class of generalised bandit problems first studied by Nash (1980). An account of this class of processes has been given by Crosbie and Glazebrook (2000) using polyhedral methods. From that analysis we are able to infer the following algorithm for computing the return/consumption indices for B when the latter has finite state space. 13
´ ≡ {a, b}×Ω for its state For ease of notation, we now use x, y for generic states of DB and Ω Θ space. We require the matrix of constants A(B) := {Ax }x∈Ω,Θ⊆ ´ ´ defined as follows: Suppose Ω c that X(0) = x and that DB evolves from 0 under the active action. We use TxΘ for the first time at or after time 1 for which the state of DB lies in subset Θ. We write c
TxΘ = inf{t; t ≥ 1 and X(t) ∈ Θ}. We then have AΘ x
h Θc i Θc Tx ˜ ˜ ´ Θ ⊆ Ω, ´ = I(x ∈ Θ) C(x) − E β C X Tx x , x ∈ Ω,
´ → R+ is as defined in (iv)′′ above. The following adaptive greedy algorithm yields where C˜ : Ω ´ the return/consumption indices {ν(x), x ∈ Ω}. Inputs to the algorithm are the matrix A(B) ′′ ´ and the rewards {˜ r(x), x ∈ Ω} defined in (iii) above. The algorithm operates as follows: ´ Step 1. Set Θ|Ω| ´ = Ω and δ(Θ|Ω| ´ ) = max
r˜(x) ´ ;x ∈ Ω . ´ AΩ x
(25)
´ State |Ω| ´ has bandit B’s largest Take any maximising state from (25) and call it state |Ω|. ´ = δ(Θ ´ ) and Θ ´ ´ return/consumption index. Set ν(|Ω|) ´ \ {|Ω|}. |Ω| |Ω|−1 = Θ|Ω| ´ set Θ ´ ´ − k + 2} Step k. For k = 2, 3, . . . , |Ω|, \ {|Ω| ´ |Ω|−k+1 = Θ|Ω|−k+2 δ(Θ|Ω|−k+1 ) = max ´
´ r˜(x) − Pk−1 AxΘ|Ω|−j+1 δ(Θ ´
|Ω|−j+1 )
j=1
Θ
´
Ax |Ω|−k+1
; x ∈ Θ|Ω|−k+1 ´
.
(26)
´ − k + 1. State |Ω| ´ − k + 1 has bandit Take any maximising state from (26) and call it state |Ω| ´ − k + 1) = ν(|Ω| ´ − k + 2) + δ(Θ ´ B’s kth largest return/consumption index. Set ν(|Ω| |Ω|−k+1 ). ´ gives the return/consumption indices of all (enhanced) The collection {ν (k) , 1 ≤ k ≤ |Ω|} states of the bandit B, with these states now numbered in decreasing order of their index val´ having the highest return/consumption index. We now use i for a generic ues, with state |Ω| state within this numbering. With this new state representation subsets Θ are members of ´ ´ with the smallest re2{1,2,...,|Ω|} . Each subset Θj = {j, j − 1, . . . , 1} contains j states within Ω ´ turn/consumption indices 1 ≤ j ≤ |Ω|. It is an immediate consequence of the structure of the ´ may be re-expressed as above algorithm that the rewards {˜ r(i), 1 ≤ i ≤ |Ω|} r˜(i) =
´ ´ |Ω| ν(|Ω|)A i
´ |Ω|−1
−
X
Θ ´ {ν(j + 1) − ν(j)} Ai j , 1 ≤ i ≤ |Ω|,
(27)
j=i
and this will be used in the development of the suboptimality bounds discussed in the next section.
14
5
On the closeness to optimality of index policies
In the preceding two sections we have been able to focus on individual bandits as we have demonstrated indexability and developed the return/consumption indices. We now return to the full multi-armed bandit version of the model presented in Section 2, and continue to require ´ n the enhanced state space its state space to be finite. This model features N bandits, with Ω ´ ´ := ∪N for bandit n, 1 ≤ n ≤ N, and Ω n=1 Ωn their union. In what follows we shall also use a ´ n consisting of the currently active states. We now proceed Ωn := {a} × Ωn for the subset of Ω Θ to define the matrix A := {Ai }i∈Ω, ´ Θ⊆Ω ´ from the corresponding matrices A(Bn ) for individual bandits Bn given in the preceding section by a simple process of extension. We firstly create a ´ by numbering them in decreasing order of their numerical representation of the members of Ω return/consumption index values such that ´ ≥ ν(|Ω| ´ − 1) ≥ . . . ≥ ν(2) ≥ ν(1). ν(|Ω|)
(28)
As before, we use i for a generic state within this numbering. With this state representation ´ ´ are members of 2{1,2,...,|Ω|} ´ subsets Θ of Ω . We use Θj = {j, j − 1, . . . , 1} for a subset of Ω containing j states with smallest return/consumption indices. With this state representation in place we obtain all entries in A as follows: ´
Θ∩Ωn ´ n ⊆ Ω, ´ Θ ⊆ Ω. ´ , i∈Ω AΘ i = Ai
(29)
It is a straightforward matter to check that the expressions for the rewards given in (27) in terms of quantities associated with individual bandits yield the equations ´ |Ω|−1
r˜(i) =
´ |Ω| ´ ν(|Ω|)A i
−
X
Θ
´ {ν(j + 1) − ν(j)} Ai j , 1 ≤ i ≤ |Ω|.
(30)
j=i
We now use the representation of rewards given in (30) to develop expressions for the quan´ tities {Ru (j), u ∈ U, j ∈ ×N n=1 Ωn }, namely the total expected discounted rewards earned by the multi-armed bandit under admissible stationary policies u ∈ U from initial states j ≡ u ´ {j1 , j2 , . . . , jN } ∈ ×N n=1 Ωn . To do this we need to develop performances measures {π (i|j), u ∈ ´ ´ j ∈ ×N U, i ∈ Ω, n=1 Ωn } defined by ) (∞ X u t (31) π (i|j) := Eu β I(i, t) j , t=0
where in (31) the expectation is taken over all realisations of the multi-armed bandit under policy u from initial state j and I(i, t) is an indicator which takes the value 1 if a bandit with enhanced state i is active at epoch t ∈ N and is zero otherwise. Proposition 2 describes how to deploy these performance measures to compute Ru (j).
´ Proposition 2 (Computation of total expected rewards) For any u ∈ U and j ∈ ×N n=1 Ωn , u
R (j) =
X
u
r˜(i)π (i|j) +
´ i∈Ω
N X n=1
−S b (jn )I(jn ∈ Ωan ) + rb (jn )(1 − β)−1 .
where I(·) is an indicator. 15
(32)
Proof Following the usage in (4) above, we write u
R (j) =
N X
Rnu (j)
(33)
n=1
where in (33), Rnu (j) is the total expected reward (net of switching penalties) yielded by bandit n under policy u from initial state j. In order to demonstrate (32) it is enough to show that X Rnu (j) = r˜(i)π(i|j) − S b (jn )I(jn ∈ Ωan ) + rb (jn )(1 − β)−1 , 1 ≤ n ≤ N. (34) ´n i∈Ω
n With policy u and initial state j fixed, we define the sequence of random times {ψm , m ≥ 1} as the collection of (successive) epochs at which processing is switched into bandit n, with {φnm , m ≥ 1} as the collection of (successive) epochs at which processing is switched away from bandit n. In what follows we shall use the notational shorthand un (t) for the action applied by policy u to bandit n at epoch t. Hence we have
ψ1n = inf{t ≥ 0; un (t) = a}. n If ψ1n = ∞ then φnm = ∞, m ≥ 1 and ψm = ∞, m ≥ 2. Otherwise,
φn1 = inf{t > ψ1n ; un (t) = b}. We continue inductively. If m ≥ 2 and φnm−1 < ∞ we have n ψm = inf{t > φnm−1 ; un (t) = a}, n while if ψm < ∞ we have n φnm = inf{t > ψm ; un (t) = b}.
We now utilise the form of r˜ given in (iii)′′ above and the nature of the performance measures in (31) to infer that when jn ∈ Ωbn (bandit n is deemed passive at time 0) X r˜(i)π(i|j) − S b (jn )I(jn ∈ Ωan ) + rb (jn )(1 − β)−1 ´n i∈Ω
=
X
r˜(i)π(i|j) + rb (jn )(1 − β)−1
´ i∈Ω
n n φm −1 ∞ X X n n n −β ψm =E Sna {Xn (ψm )} + β t rna {Xn (t)} − β φm Snb {Xn (φnm )} j m=1 n t=ψm ) ( ∞ X n n n +E )} )} + β φm rnb {Xn (φm −β ψm rnb {Xn (ψm j (1 − β)−1 + rb (jn )(1 − β)−1 . n
(35)
m=1
If we now use the fact that the natural state of bandit n is frozen under application of the passive action we have that, with probability one, n Xn (t) = Xn (φnm ) , φnm ≤ t < ψm+1 − 1, m ≥ 0,
16
(36)
where we take φn0 = 0. Utilising (36) within (35) we see that the latter expression may be rewritten n φm −1 ∞ X X n n n −β ψm β t rna {Xn (t)} − β φm Snb {Xn (φnm )} j Sna {Xn (ψm )} + E m=1 n t=ψm n n ∞ ψm+1 1 −1 ψX X X−1 β t rnb {Xn (t)} + β t rnb {Xn (t)} j . (37) +E t=0 n m=1 t=φm
However, the expression (37) accounts for all rewards earned by and switching penalties exacted from bandit n when policy u is applied to it from initial state j. Hence the quantity in (37) is Rnu (j), as required. The case jn ∈ Ωbn is dealt with similarly. We therefore have (34) and the result is proved. 2 We observe that the second term on the r.h.s. of (32) is policy independent and are able to conclude that, for any u, v ∈ U, and any initial state j X Ru (j) − Rv (j) = r˜(i){π u (i|j) − π v (i|j)}. (38) ´ i∈Ω
In order to deploy (30) within (38) we now write X u N ´ ´ Au (Θ|j) := AΘ i π (i|j), Θ ⊆ Ω, j ∈ ×n=1 Ωn .
(39)
i∈Θ
In the following result we use u∗ for an optimal stationary policy for our multi-armed bandit and Ropt (j) for the maximal expected return from initial state j, as in (4). Theorem 2 is a straightforward consequence of (30), (32) and (39). ´ Theorem 2 (Policy evaluation) For any stationary policy u ∈ U and initial state j ∈ ×N n=1 Ωn , n ∗ o ´ ´ − Au (Ω|j) ´ Ropt (j) − Ru (j) = ν(|Ω|) Au (Ω|j) ´ |Ω|−1
+
X i=1
∗ {ν(i + 1) − ν(i)} Au (Θi |j) − Au (Θi |j) .
(40)
As an illustration of the application of Theorem 2, we shall now use it to analyse the closeness to optimality of the greedy index heuristic GI introduced in Section 2 in simple cases in which the following hold: 1. There are no switching costs. Hence the functions Sna , Snb , Sˆna and Sˆnb are all identically zero for all bandits n; 2. No rewards are earned or resource consumed under the passive action. Hence the functions rnb and cbn are identically zero for all bandits n; 3. The consumption functions can are constant for each bandit n. We use can , 1 ≤ n ≤ N , for the constant values. 17
4. Each Markov law Pn determining transitions in bandit n under the active action is irreducible and hence positive recurrent. 5. The total resource available, C, and the resource requirements of the individual bandits, can , are all multiples of resource quantum δ > 0. In light of 1. and 2. above we can work with the (non-enhanced) state space Ωn for each n and the corresponding union Ω := ∪N n=1 Ωn . We continue to write i, j for generic members of Ω. These continue to be numbered in decreasing order of their indices. Note that 1. and 2. also guarantee that Condition 1 is satisfied and that all return/consumption indices are non-negative and hence equal to their positive parts. In light of 5. we write C = M δ, can = mn δ, 1 ≤ n ≤ N. We now extend Ω to ΩM := Ω ∪ {δ, 2δ, . . . , M δ}, where activation of mδ earns no reward and consumes mδ of resource. Trivially, each mδ may be thought of as a single state bandit with return/consumption index equal to zero. It will assist in what follows if we use the notational shorthand M # ≡ {δ, 2δ, . . . , M δ}. In the multi-armed bandit problem, the activation of mδ ∈ M # at some time t ∈ N will indicate that the amount of resource NOT consumed during [t, t + 1) by the N conventional bandits is exactly mδ. Hence at most one member of M # may be activated at any epoch. By definition, the resource consumed by the activated members of ΩM at each decision epoch is exactly C under any policy. Note that, trivially, and in an obvious notation, we have that the matrix A := {AΘ i }i∈ΩM ,Θ⊆ΩM M appropriate for our extended state listing Ω has, under the above conditions, the entries (Θ∩Ωn )c Θ∩Ωn Ti a Θ i , i ∈ Ωn , 1 ≤ n ≤ N, = cn 1 − E β Ai = Ai and
# AΘ mδ = mδI(mδ ∈ Θ), mδ ∈ M . # We now write ΘM for the union of the states from Ω with the j lowest j = {j, j − 1, . . . , 1} ∪ M return/consumption indices together with M # or, equivalently, the j + M members of ΩM of lowest index. Now, adapting Theorem 2 to the current set up we have, in an obvious notation, ∗ Ropt (j) − Ru (j) = ν(|Ω|) Au (ΩM |j) − Au (ΩM |j) |Ω|−1
+
X i=0
|Ω|−1
=
X i=0
u∗ M {ν(i + 1) − ν(i)} Au (ΘM i |j) − A (Θi |j)
u∗ M {ν(i + 1) − ν(i)} Au (ΘM i |j) − A (Θi |j) ,
(41)
# where ν(0) = 0, ΘM in (41). The equality in (41) results from the fact (which follows 0 = M easily from the above definitions) that
Au (ΩM |j) = C(1 − β)−1 , u ∈ U.
18
Before we apply (41) to the evaluation of the greedy index heuristic GI determined by the state ordering in (28) we require the following additional notation: We shall write Γ(t) for the amount of resource unused at time t (i.e., by the N conventional bandits) and ) (∞ X t β Γ(t) j Γ(u|j) := Eu t=0
for the expected total discounted amount of resource unused when policy u ∈ U is implemented from initial state j. We further use Ni for the subset of the N conventional bandits which have no intersection with Θi , namely Ni := {n; 1 ≤ n ≤ N and Ωn ∩ Θi = ∅} . Further, we write C(Ni ) :=
X
can
n∈Ni
for the total resource required by the bandits in Ni . Note that N|Ω| = ∅, N0 = {1, 2, . . . , N } and hence C(N|Ω| ) = 0, C(N0 ) > C with C(Ni ) decreasing in i. For Theorem 3, we define I ∗ := min{i; C(Ni ) < C} and use O(1) to denote a quantity which remains bounded in the limit β → 1. Theorem 3 (Closeness to optimality of GI) Under the conditions 1.–5. above, for any initial state j ∈ ×N n=1 Ωn , Ropt (j) − RGI (j) ≤ ν(I ∗ ) {Γ(GI|j) − Γ(u∗ |j)} + O(1) Proof First note, that it is plain from the definitions of the quantities concerned that u Au (ΘM i |j) = A (Θi |j) + Γ(u|j)
(42)
for all choices of i, u and j. Further, it is clear that if C(Ni ) ≥ C then the greedy index policy GI will never activate any members of Θi . We then conclude from (42) that ∀ j and i < I ∗ , AGI (Θi |j) = 0 and hence that ∗ u∗ ΘM AGI ΘM i |j ≤ Γ(GI|j) − Γ(u |j). i |j − A
(43)
Now suppose that C(Ni ) < C or, equivalently, that i ≥ I ∗ . Write Ti (GI, j) for the first time at which a member of Θi is scheduled for processing when the greedy index policy is implemented from state j at time 0. The random time Ti (GI, j) may be infinite with positive probability. It must be true, invoking the structure of GI, that at epoch Ti (GI, j) all bandits in Ni are processed and that all bandits not being processed must have their current states in Θi . It is clear that under GI all bandits in Ni will continue to be processed at each t ≥ Ti (GI, j) since they can never be displaced by any of the bandits unprocessed at Ti (GI, j). By definition of 19
the quantities concerned, the processing of bandits in Ni contributes nothing to the measure ∗ AGI (ΘM i |j) when i ≥ I . It now follows that at any decision epoch t ∈ [Ti (GI, j), ∞) the residual resource C − C(Ni ) remaining once the bandits in Ni have been processed will either not be used (equivalently, will be allocated to members of M # ) and/or will be used to process bandits in {1, 2, . . . , N } \ Ni which have previously entered states in Θi . This, together with the fact that, from assumption 4. above, prior to time Ti (GI, j), the bandits in {1, 2, . . . , N } \ Ni can only have been processed a finite number of times almost surely, yields the conclusion that when C(Ni ) < C −1 AGI (ΘM + O(1). i |j) = {C − C(Ni )}(1 − β)
(44)
An argument very close to that used by Glazebrook and Garbe (1999) in their analysis of the performance of parallel processor versions of Gittins index policies for conventional multi-armed bandits and which utilised a single machine relaxation of the optimization problem min Au (ΘM i |j) u∈U
yields the conclusion that when C(Ni ) < C, −1 + O(1), u ∈ U. Au (ΘM i |j) ≥ {C − C(Ni )}(1 − β)
(45)
From (44) and (45) we conclude that when C(Ni ) < C, ∗
u M AGI (ΘM i |j) − A (Θi |j) ≤ O(1).
The theorem now follows simply from (41), (43) and (46).
(46) 2
The fact that the suboptimality bound for GI given in Theorem 3 must be non-negative yields the following result. Corollary 1 (Resource consumption of GI) Under the conditions 1. to 5. above, for any initial state j ∈ ×N n=1 Ωn , Γ(u∗ |j) − Γ(GI|j) ≤ O(1).
Remark We conclude that for this simple family of multi-armed bandits, the degree of reward suboptimality of the greedy index heuristic GI is, to within an O(1) quantity, bounded above by a multiple of the difference between the amount of available resource left unused by GI and the equivalent quantity for any optimal policy. Should this difference be zero (or, indeed, negative) then the greedy index policy must be within O(1) of optimality. It in turn must then follow that the total amount of resource left unused by an optimal policy (as measured by Γ(u∗ |j)) must, for all initial states and to within an O(1) quantity, be bounded above by the total amount of resource left unused by GI. This is Corollary 1. One way of seeing why this is so is to note that, by its construction, GI makes the best use (as measured by the reward/consumption indices and to within O(1)) of the resource it consumes. Hence the only way that a policy can outperform GI is to use more resource. For this class of models, then, it 20
is precisely any inability of GI to use the available resource as effectively as other good policies which might cause it to perform poorly. In examples where this is not a serious concern, GI will perform well. We formalize some of the above ideas by developing a form of asymptotic optimality for GI not unlike that proposed for restless bandits by Whittle (1988). We develop a sequence of MABs, each structured as in 1.–5. above and sharing a common discount rate β. The sequence is indexed by N , the number of competing bandits in the N th problem, where N ≥ 2. Objects related to problem N are so identified by adding N as an additional subscript. Hence we have a reward functions {rnN , 1 ≤ n ≤ N, N ≥ 2}, consumption values {canN , 1 ≤ n ≤ N, N ≥ 2} and total resources {CN , N ≥ 2}. We require the following additional conditions: a 6. The reward functions {rnN , 1 ≤ n ≤ N, N ≥ 2} are uniformly bounded above and away from zero;
7. The collection of consumptions {canN , 1 ≤ n ≤ N, N ≥ 2} are bounded above and away from zero; 8. The total resources {Cn , N ≥ 2} form a non-decreasing divergent sequence with CN