1
Optimal Admission Control for Tandem Queues with Loss Bo Zhang and Hayriye Ayhan H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 30332-0205, U.S.A
[email protected],
[email protected]
Abstract We consider a two-station tandem queue loss model where customers arrive to station 1 according to a Poisson process. A gatekeeper who has complete knowledge of the number of customers at both stations decides to accept or reject each arrival. A cost c1 is incurred if a customer is rejected, while if an admitted customer finds that station 2 is full at the time of his service completion at station 1, he leaves the system and a cost c2 is incurred. Assuming exponential service times at both stations, an arbitrary but finite buffer size at station 1 and a buffer size of one at station 2, we show that the optimal admission control policy for minimizing the long-run average cost per unit time has a simple structure. Depending on the value of c2 compared to a threshold value c∗ , it is optimal to admit a customer at the time of his arrival either only if the system is empty or as long as there is space at station 1. We also provide the closed-form expression of c∗ , which depends on the service rates at both stations, the arrival rate and c1 . Index Terms admission control, tandem queues, loss models, Markov decision processes, dynamic programming, optimal control.
I. I NTRODUCTION Consider two stations (or nodes) in series. There is one server at each station, and customers arrive to station 1 in accordance to a Poisson process with rate λ. Customer service times at station i are exponentially distributed with rate µi , i = 1, 2. The interarrival and service times DRAFT
2
are independent of one another. The size of the buffer (which includes the customer in service as well as those waiting) at station 1 is B, where 1 ≤ B < ∞, whereas the buffer size at station 2 is one (i.e., customers are not allowed to wait for service at station 2). A gatekeeper decides whether to accept or refuse service to an arriving customer (to station 1) and he has complete knowledge on the number of customers at each station at all times. If accepted, an arriving customer moves into the buffer at station 1; otherwise he leaves the system and a cost c1 is incurred. If station 1 is full upon a customer’s arrival, the gatekeeper has to reject him. The service discipline at station 1 is FCFS (First Come First Served). Once a customer completes service at station 1, he proceeds to station 2. If he finds station 2 full, he immediately leaves the system and a cost c2 is incurred; otherwise he receives service from server 2, and eventually leaves the system. The objective for the gatekeeper is to make optimal admission decisions in order to minimize the long-run average cost per unit time. Note that if c1 = c2 , this is equivalent to maximizing the long-run fraction of customers that successfully complete service at both stations. Figure 1 illustrates the model with B = 4.
Fig. 1: Tandem queue loss model with B = 4.
Tandem queues with finite buffers and the loss feature are appropriate models for communication networks such as the Internet (see, for example, Bertsekas and Gallager [1], Spicer and Ziedins [2], and the references therein) but there are few analytical results on optimal admission control for such models in the literature. To the best of our knowledge, this paper provides the first analytical result on the long-run average reward/cost optimal admission control policy for tandem queues with loss. Several researchers have studied control problems for tandem queues. Spicer and Ziedins [2] consider a system of parallel tandem queues with loss, where each queue consists DRAFT
3
of two single-server queues in tandem with finite capacities, and they show that it is sometimes optimal for an arriving customer to select queues with more customers already present and/or with greater residual service requirements in order to minimize his individual loss probability. Sheu and Ziedins [3] consider admission and routing controls in a system of N parallel tandem queues with the objective of minimizing loss. They develop an asymptotically optimal policy as N → ∞. Their approach involves obtaining the fluid limit as N → ∞ and solving a related optimization problem. Ku and Jordan [4] consider admission control policies for two multi-server Markovian loss queues (with no waiting space) in series with two types of customers, where type 1 (which is more valuable) requires service at the first station and with a positive probability at the second station and type 2 requires service only at the second station. Under appropriate conditions, Ku and Jordan [4] show that, to maximize the expected total discounted reward over an infinite horizon, type 1 customers are always accepted and the optimal admission policy for type 2 customers is of threshold type. Ku and Jordan [5] generalize these results to n multiserver loss queues in series with n customer classes and develop admission control heuristics which yield near optimal performance. Ku and Jordan [6] consider access control in a target multi-server loss queue (with no waiting space) fed by a set of upstream parallel multi-server loss queues (with no waiting space) and by a stream of new customers. The target queue faces a choice of how many servers to reserve for each stream. Revenue is gained by each station when it serves a customer, but the amount of revenue at the target queue depends on the source of the customer. They prove that the policy that maximizes total discounted revenue consists of a set of monotonically decreasing thresholds as functions of the occupancy of each queue and show monotonicity properties of these thresholds with respect to system parameters. Chang and Chen [7] consider a two-stage no-wait tandem queueing system, in which any customer who finds all servers busy at his destination stage will be lost, and they compare the loss rate under various admission control policies. Ghoneim and Stidham [8] consider two infinite queues in series with input to each queue, which can be controlled by accepting or rejecting arriving customers. Their objective is to maximize the discounted or average expected net benefit over a finite or infinite horizon, where net benefit is composed of (random) rewards for entering customers minus holding costs assessed against the customers at each queue. Hordijk and Koole [9] also consider two infinite queues in tandem. They assume that at each node there are two servers with the same service rate and each with its own queue. Their goal is to find a policy DRAFT
4
which stochastically minimizes the number of customers in the system at any time point. The remainder of this paper is organized as follows. In Section II, we state our main result as Theorem 1. Section III introduces the notation and provides the Markov decision process formulation. Section IV is devoted to the proof of Theorem 1. II. M AIN RESULT In this section, we present the optimal admission policy for the tandem queue described in Section I. We start by defining two policies which are used throughout the paper. The first policy, denoted by π0 , is what we refer to as the prudent policy, under which the gatekeeper accepts a customer only if both buffers are empty at the arrival epoch and rejects otherwise. The second policy, denoted by π1 , is to accept a customer whenever there is space in the buffer at station 1 and reject otherwise. We call this policy the greedy policy. We denote the long-run average cost under a policy π as Cπ and use C ∗ for the optimal long-run average cost. Intuitively, if c2 is large compared to c1 , it makes sense to prefer the prudent policy because this policy evades the risk of incurring a large cost of c2 ; if c2 is comparable to or smaller than c1 , the greedy policy should be more reasonable, because each admission saves c1 only at the cost of incurring c2 with a certain probability. We make this intuition rigorous by showing a simple structure of the optimal policy: there is a threshold value of c2 , above which the prudent policy is optimal and below which the greedy policy is optimal; moreover, this threshold value does not depend on B. This simple structure is not entirely obvious, because one may well expect that a policy in between, in the sense that it prescribes the acceptance action when the system is nonempty (unlike π0 ) and also the rejection action when the buffer at station 1 is not full (unlike π1 ), is optimal for some moderate c2 values, and also it is not apparent that such a threshold is independent of the B value. We now state this structural result as Theorem 1.
Theorem 1: Let c∗ ≡ c1 1 +
µ22 µ1 λ+µ2 λ+µ1 µ2
.
(i) If c2 ≥ c∗ , the prudent policy is optimal, and it is the unique stationary optimal policy when the inequality is strict. Furthermore, in this case C ∗ = Cπ0 =
c1 λ2 (µ1 + µ2 ) . µ1 λ + µ2 λ + µ1 µ2
(1)
(ii) If c2 ≤ c∗ , the greedy policy is optimal, and it is the unique stationary optimal policy when
DRAFT
5
the inequality is strict. Furthermore, in this case C ∗ = Cπ1 =
B+1
c1 λ (µ1 − λ) + B+1 µ1 − λB+1
c2 λ 2 µ 1 2
h
µ1 B − λB + µ2 µ1 B−1 − λB−1
i
(λ + µ2 ) (µ1 + µ2 ) (µ1 B+1 − λB+1 )
.
(2)
The proof of Theorem 1 is given in Section IV. Note that Theorem 1 illustrates that a policy that is individually optimal is not necessarily socially optimal (see, for example, Stidham [10] for a discussion on this issue). Indeed, the intuition described above suggests that the individually
nonempty system is lost at station 2 with probability
µ2 µ1 µ1 µ1 +µ2
optimal threshold value for c2 should be c4 ≡ c1 1 +
, because a customer that joins a and thus his decision depends on
1 the comparison of c1 and c2 µ1µ+µ2 . The difference between c∗ , as defined in Theorem 1, and
c4 confirms the distinction between individual optimality and social optimality for this model. As suggested in Lin and Ross [11], by charging each accepted customer a toll δ (equivalent to letting the rejection cost be c1 − δ) such that µ2 (c1 − δ) 1 + µ1
= c∗ ,
(3)
the gatekeeper can enforce the social optimum among rational customers (i.e., provide incentives for customers to act according to the optimal policy described in Theorem 1). III. D ISCRETE - TIME M ARKOV DECISION PROCESS (MDP)
FORMULATION
In this section, using uniformization (see Lippman [12]), we formulate the admission control problem as a discrete-time Markov decision process (MDP). Furthermore, we show that we have a unichain model. Suppose that two servers work at all times. The service at a station when there is no customer at that station is referred to as fictitious service. Then, we let the gatekeeper make admission decisions at each arrival and (real or fictitious) service completion epochs, but the only allowable action at a service completion epoch is “rejection”. Thus, the times between consecutive decision epochs are independent exponential random variables with rate Λ ≡ µ1 + µ2 + λ, and minimizing the continuous-time long-run average cost is equivalent to minimizing the long-run average cost of the discrete-time MDP over this new set of decision epochs with proper reformulation. Specifically, we give the detailed formulation of the equivalent discrete-time MDP as follows. For technical convenience, throughout the rest of this paper, we consider “reward” instead of “cost”, i.e., each customer loss generating a reward of −c1 or −c2 , and the objective is to maximize the long-run average reward (which has a negative value). DRAFT
6
Let p1 ≡
µ1 , Λ
p2 ≡
µ2 , Λ
p3 ≡
λ . Λ
Let 1A = 1 if statement A holds, and 1A = 0 otherwise. We
then define the following discrete-time Markov decision process problem. We have a discrete-time Markov chain with state space S = {(q1 , q2 , a) ∈
Z3
: 0 ≤ q1 ≤
B, q2 = 0 or 1, a = 0 or 1}\{(B, 1, 0)}, where qi denotes the number of customers at station i (including those waiting and in service), i = 1, 2, and a equals 1 if the decision epoch is the time of an arrival and equals 0 otherwise. Note that the state (B, 1, 0) is not possible and the size of the state space is |S| = (B + 1) × 2 × 2 − 1. Let 0 denote rejecting an arrival and 1 denote accepting an arrival. Then the sets of allowable actions are A(q1 ,q2 ,a) = {0} if a = 0 or q1 = B, and A(q1 ,q2 ,1) = {0, 1}, for any other (q1 , q2 , 1) ∈ S. Let r(s, d) denote the reward received when action d is taken in state s. Specifically, r((q1 , q2 , a), 0) = −c1 1{a=1} − c2 p1 1{q1 >0 and q2 =1} ,
∀(q1 , q2 , a) ∈ S,
r((q1 , q2 , 1), 1) = −c2 p1 1{q2 =1} ,
∀(q1 , q2 , 1) ∈ S.
Let p(·|(q1 , q2 , a), d) denote the transition probability when action d is taken in state (q1 , q2 , a). We have
p1 , if s = ((q1 − 1)+ , [(q2 + 1) ∧ 1]1{q1 >0} + q2 1{q1 =0} , 0)
p3 , if s = (q1 , q2 , 1)
p(s|(q1 , q2 , a), 0) = p2 , if s = (q1 , (q2 − 1)+ , 0)
p1 , if s = (q1 , [(q2 + 1) ∧ 1], 0)
p3 , if s = (q1 + 1, q2 , 1).
p(s|(q1 , q2 , 1), 1) = p2 , if s = (q1 + 1, (q2 − 1)+ , 0) As introduced in Puterman [13], a discrete-time MDP is unichain, if the resulting Markov chain under every deterministic stationary policy has a single recurrent class plus a possibly empty set of transient states. In fact, the discrete-time MDP described above for our model is unichain. This can be seen from the following argument. Regardless of the policy and the initial state, state (0, 0, 0) can be reached with probability 1, due to a long enough interarrival time. Therefore, (0,0,0) is recurrent, and all the states accessible from (0,0,0), together with (0,0,0), form a recurrent class; other states form a set of transient states. The existence of a deterministic stationary optimal policy for our model is guaranteed by the finiteness of the state space and the action space (see Theorem 9.1.8 in Puterman [13]). DRAFT
7
Therefore, throughout the remainder of this paper, a policy π is identified with a binary-valued function π : S → {0, 1}. Let Sd ≡ {(q1 , q2 , a) ∈ S : 0 ≤ q1 ≤ B − 1, a = 1} be the set of states at which a decision of acceptance or rejection needs to be made. Moreover, let gπ and hπ (·) denote the long-run average reward and relative value function, respectively, under policy π. Also, for any policy π, we define !
∆π (s) ≡ r(s, 1) +
X
p(s’|s, 1)hπ (s’) − r(s, 0) +
s’∈S
! X
p(s’|s, 0)hπ (s’) , ∀s ∈ Sd .
(4)
s’∈S
For notational simplicity, for a function only having the state argument s = (q1 , q2 , a), e.g., hπ (s), we omit a second set of parentheses and write hπ (q1 , q2 , a) instead of hπ ((q1 , q2 , a)). IV. P ROOF OF THE MAIN RESULT In this section, we prove our main result given in Theorem 1. Section IV-A establishes part (i) of Theorem 1 and Section IV-B contains the proof of part (ii) of Theorem 1. A. Optimality of the prudent policy when c2 ≥ c∗ The prudent policy π0 , under the discrete MDP framework, prescribes that π0 (0, 0, 1) = 1, and π0 (s) = 0, ∀s ∈ S\{(0, 0, 1)}. We set hπ0 (0, 0, 0) = 0 throughout. First note that the long-run average reward and relative value function are independent of B under π0 . Intuitively, this follows from the fact that π0 only utilizes one unit of capacity of the buffer at station 1 (except possibly for an initial transient period). This observation is formally stated and proved in Lemma 1. Lemma 1: The long-run average reward gπ0 is the same for all B ≥ 1. Moreover, the relative value function hπ0 (s) for any s = (q1 , q2 , a) ∈ S is independent of B as long as B ≥ q1 . Proof: For any B ≥ 1, under π0 , the set of recurrent states {(q1 , q2 , a)} should satisfy q1 + q2 = 0 or q1 + q2 = 1, and all other states in the state space are transient. Also, the transition probabilities among recurrent states are the same for any B ≥ 1. Therefore, gπ0 is independent of B. Under any policy π, hπ (s) − hπ (0, 0, 0) measures the asymptotic difference in total expected reward that results from starting the system in state s versus the reference state (0,0,0) (see page
DRAFT
8
339 of Puterman [13]). Specifically, here we have that hπ0 (s) − hπ0 (0, 0, 0) = lim [vN (s) − vN (0, 0, 0)], N →∞
(5)
where vN (s) denotes the total expected reward over N periods under π0 when the system starts in state s. For a fixed s = (q1 , q2 , a) ∈ S, vN (s) and vN (0, 0, 0) are independent of B as long as B ≥ q1 . Indeed, a stronger result holds. The total reward over N periods starting at s, as a random variable, is independent of B along any sample path due to the prudent nature of π0 (i.e., always rejecting except in state (0,0,1)). This, combined with the assumption that hπ0 (0, 0, 0) = 0 for any B, yields that hπ0 (s) is the same for all B ≥ q1 . We know from equation (8.6.1) of Puterman [13] that the following evaluation equations hold: gπ0 + hπ0 (q, 0, 0) = p1 hπ0 (q − 1, 1, 0) + p2 hπ0 (q, 0, 0) + p3 hπ0 (q, 0, 1), 1 ≤ q ≤ B,
(6)
gπ0 + hπ0 (q, 0, 1) = −c1 + p1 hπ0 (q − 1, 1, 0) + p2 hπ0 (q, 0, 0) + p3 hπ0 (q, 0, 1), 1 ≤ q ≤ B, (7) gπ0 +hπ0 (q, 1, 1) = −c1 −c2 p1 +p1 hπ0 (q−1, 1, 0)+p2 hπ0 (q, 0, 0)+p3 hπ0 (q, 1, 1), 1 ≤ q ≤ B, (8) gπ0 + hπ0 (q, 1, 0) = −c2 p1 + p1 hπ0 (q − 1, 1, 0) + p2 hπ0 (q, 0, 0) + p3 hπ0 (q, 1, 1), 1 ≤ q ≤ B. (9) These equations are used to show some properties of the relative value function under π0 as stated in Lemmas 2 and 3. Lemma 2: hπ0 (q, 0, 0) − c1 = hπ0 (q, 0, 1), 1 ≤ q ≤ B,
(10)
−c2 p1 , 1 ≤ q ≤ B, p1 + p 2
(11)
hπ0 (q + 1, 0, 0) − hπ0 (q, 0, 0) = hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0), 1 ≤ q ≤ B − 1,
(12)
hπ0 (q + 1, 1, 1) − hπ0 (q, 1, 1) = hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0), 1 ≤ q ≤ B − 1.
(13)
hπ0 (q, 1, 1) − hπ0 (q, 0, 1) =
Proof: Equation (10) holds by subtracting (7) from (6), and (11) follows by subtracting (7) from (8). To prove (12), we substitute hπ0 (q, 0, 1) in (6) with the left-hand side of (10), rearrange the terms, and obtain that
hπ0 (q, 0, 0) = hπ0 (q − 1, 1, 0) −
p3 c1 gπ0 − , 1 ≤ q ≤ B. p1 p1
(14)
DRAFT
9
Equation (12) follows from (14). For the proof of (13), replacing hπ0 (q, 0, 0) in (8) with (14) and rearranging the terms yield hπ0 (q, 1, 1) − hπ0 (q − 1, 1, 0) = −c1
p1 gπ p 1 + p2 p3 − c2 − 0 , 1 ≤ q ≤ B. p1 (p1 + p2 ) p1 + p2 p1
(15)
Equation (13) then follows from (15). Lemma 3: hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0) = hπ0 (q − 1, 1, 0) − hπ0 (q − 2, 1, 0), 2 ≤ q ≤ B.
(16)
Proof: Replacing q with q − 1 in (9) yields, for any 2 ≤ q ≤ B, gπ0 + hπ0 (q − 1, 1, 0) = −c2 p1 + p1 hπ0 (q − 2, 1, 0) + p2 hπ0 (q − 1, 0, 0) + p3 hπ0 (q − 1, 1, 1). (17) Subtracting (17) from (9), we obtain that, for any 2 ≤ q ≤ B, hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0) = p1 [hπ0 (q − 1, 1, 0) − hπ0 (q − 2, 1, 0)] + p2 [hπ0 (q, 0, 0) − hπ0 (q − 1, 0, 0)] + p3 [hπ0 (q, 1, 1) − hπ0 (q − 1, 1, 1)].
(18)
Finally, (16) follows by applying (12) and (13) (with q replaced by q − 1) to the terms in the second and third square brackets of (18), respectively. Next, using Lemmas 2 and 3, we show that ∆π0 (s) is the same for any s ∈ Sd \{0, 0, 1}. Propositon 1: ∆π0 (q, 1, 1) = ∆π0 (q − 1, 1, 1), 2 ≤ q ≤ B − 1,
(19)
∆π0 (q, 1, 1) = ∆π0 (q, 0, 1), 1 ≤ q ≤ B − 1.
(20)
Proof: First, to prove (19), we note that, for any 1 ≤ q ≤ B − 1, ∆π0 (q, 1, 1) = −c2 p1 + p1 hπ0 (q, 1, 0) + p2 hπ0 (q + 1, 0, 0) + p3 hπ0 (q + 1, 1, 1) − [−c1 − c2 p1 + p1 hπ0 (q − 1, 1, 0) + p2 hπ0 (q, 0, 0) + p3 hπ0 (q, 1, 1)] = c1 + p1 [hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0)] + p2 [hπ0 (q + 1, 0, 0) − hπ0 (q, 0, 0)] + p3 [hπ0 (q + 1, 1, 1) − hπ0 (q, 1, 1)].
(21)
Applying (12) and (13) to the second and third square brackets of (21), respectively, we obtain ∆π0 (q, 1, 1) = c1 + hπ0 (q, 1, 0) − hπ0 (q − 1, 1, 0).
(22)
Equation (19) then follows from (16) and (22). DRAFT
10
Next, to show (20), we first obtain that, for any 1 ≤ q ≤ B − 1, ∆π0 (q, 0, 1) = p1 hπ0 (q, 1, 0) + p2 hπ0 (q + 1, 0, 0) + p3 hπ0 (q + 1, 0, 1) − [−c1 + p1 hπ0 (q − 1, 1, 0) + p2 hπ0 (q, 0, 0) + p3 hπ0 (q, 0, 1)].
(23)
Subtracting (23) from (21) and rearranging the terms yield, for any 1 ≤ q ≤ B − 1, ∆π0 (q, 1, 1) − ∆π0 (q, 0, 1) = p3 ([hπ0 (q + 1, 1, 1) − hπ0 (q + 1, 0, 1)] − [hπ0 (q, 1, 1) − hπ0 (q, 0, 1)]) = 0.
(24)
where the last equality follows from (11). We are now ready to prove the optimality of the prudent policy when c2 ≥ c∗ . Proof of part (i) of Theorem 1: For B = 1 case, the optimality of π0 can be verified by enumerating and comparing all the deterministic stationary policies. Specifically, there are three other policies, namely, the greedy policy π1 , policy π2 with π2 (s) = 0, ∀s ∈ S, and policy π3 with π3 (0, 1, 1) = 1 and π3 (s) = 0, ∀s ∈ S\{(0, 1, 1)}. The fact that gπ0 > gπ2 = gπ3 follows from the simple argument that under π2 or π3 , all customers (in the π3 case, except for possibly some arriving during an initial transient period) are lost, each incurring c1 , while under π0 only a fraction of customers are lost in the long run, also each incurring c1 . In fact, 2
λ (µ1 +µ2 ) = gπ0 (see (28) below). The fact that gπ0 ≥ (>)gπ1 when gπ2 = gπ3 = − cΛ1 λ < − Λ(µc11λ+µ 2 λ+µ1 µ2 )
c2 ≥ (>)c∗ is proved in Lemma 4 (which establishes this result for any B ≥ 1); therefore, we omit detailed expressions. This also shows the uniqueness of the optimal policy when c2 > c∗ . We next consider B ≥ 2. Letting B = 2, we apply the standard policy iteration algorithm (see page 378 of Puterman [13]) starting with π0 as the initial policy and obtain that ∆π0 (0, 0, 1) =
c1 µ 1 µ 2 > 0, µ1 λ + µ2 λ + µ1 µ2
∆π0 (0, 1, 1) = ∆π0 (1, 1, 1) = ∆,
(25) (26)
where ∆=
µ1 [(c1 − c2 )(µ1 λ + µ2 λ + µ1 µ2 ) + c1 µ22 ] . (µ1 + µ2 )(µ1 λ + µ2 λ + µ1 µ2 )
As a consequence of Lemma 1, for a fixed state (q1 , q2 , 1) ∈ Sd , ∆π0 (q1 , q2 , 1) remains the same for any B ≥ q1 + 1. Therefore, (25) and (26) in fact hold for any B ≥ 2 value. Note that since c2 ≥ c∗ , ∆ ≤ 0. Then, from Proposition 1 and (26), for any B ≥ 2, we have ∆π0 (0, 1, 1) = ∆π0 (1, 0, 1) = ∆π0 (1, 1, 1) = ... = ∆π0 (B − 1, 0, 1) = ∆π0 (B − 1, 1, 1) = ∆ ≤ 0, DRAFT
11
or ∆π0 (s) ≤ 0, ∀s ∈ Sd \{0, 0, 1}.
(27)
Combining (25) and (27), we conclude that the policy π0 is optimal for all B ≥ 2. The uniqueness of the optimal policy when c2 > c∗ follows from Proposition 8.5.10 of Puterman [13] because in this case (25) and (27) hold as strict inequalities. Since gπ0 is the same for all B ≥ 1 (see Lemma 1), setting B = 1 (which results in a 7-state irreducible Markov chain), one can easily calculate gπ0 as gπ0 = −
c1 λ2 (µ1 + µ2 ) . Λ(µ1 λ + µ2 λ + µ1 µ2 )
(28)
Furthermore, multiplying (28) with the uniformization factor Λ and reversing its sign, we obtain (1) as the optimal long-run average cost per unit time when c2 ≥ c∗ . B. Optimality of the greedy policy when c2 ≤ c∗ In the discrete MDP formulation, the greedy policy prescribes that π1 (s) = 1, ∀s ∈ Sd , and π1 (s) = 0, ∀s ∈ S\Sd . Let β1 ≡ −
c2 λµ1 , (µ2 + λ)(µ1 + µ2 )
β2 ≡ −
c2 µ 1 . µ1 + µ2
(29)
The following definitions of the long-run average reward and relative value function (i.e., (30) to (35)) satisfy the evaluation equations (equation (8.6.1) in Puterman [13]) under π1 : B+1
c2 λ 2 µ 1 2
c1 λ (µ1 − λ) − B+1 Λ (µ1 − λB+1 ) gπ hπ1 (0, 0, 0) = 0, hπ1 (0, 0, 1) = 1 , p3 gπ 1 = −
hπ1 (1, 0, 1) =
h
µ1 B − λB + µ2 µ1 B−1 − λB−1
Λ (λ + µ2 ) (µ1 + µ2 ) (µ1 B+1 − λB+1 ) hπ1 (0, 1, 0) = β1 ,
hπ1 (0, 1, 1) =
i
,
(30)
gπ1 + β2 , (31) p3
gπ1 p1 + p3 gπ1 p1 + − β1 . p3 p3 p 3 p3
(32)
For 2 ≤ q ≤ B, hπ1 (q, 0, 1) =
gπ1 p1 p1 + [hπ1 (q − 1, 0, 1) − hπ1 (q − 2, 0, 1)] − β2 + hπ1 (q − 1, 0, 1). p3 p3 p3
(33)
For 1 ≤ q ≤ B, hπ1 (q, 0, 0) = hπ1 (q − 1, 0, 1), hπ1 (q, 1, 1) = hπ1 (q, 0, 1) + β2 .
(34)
For 1 ≤ q ≤ B − 1, hπ1 (q, 1, 0) = hπ1 (q − 1, 0, 1) + β2 .
(35) DRAFT
12
In particular, the recursive definitions in (34) and (35) have simple explanations. The first equation in (34) holds because under π1 the acceptance action is taken at state (q − 1, 0, 1) and the state transitions and cost incurrence afterwards are exactly the same as if starting in state (q, 0, 0). The second equation in (34) follows from the following argument. Consider system I starting with state (q, 0, 1) and system II starting with (q, 1, 1) both under π1 in the same probability space. If the customer initially in service at station 1 completes service before the one in service at station 2 in system II, which occurs with probability
µ1 , µ1 +µ2
then an extra cost
of c2 is incurred in system II due to a customer loss at station 2 and after that both systems evolve identically; otherwise, both systems have the same cost incurrence throughout. Therefore, the asymptotic difference in total expected reward between system I and system II is exactly c2 µ1 µ1 +µ2
= −β2 . Note that equation (35) has a similar interpretation.
First, we note the following relation between gπ1 and gπ0 , which is repeatedly used in the proof of the later results. Lemma 4: If c2 = c∗ , then gπ1 = gπ0 ; if c2 < (>)c∗ , then gπ1 > ( ∆π1 (0, 1, 1),
(44)
∆π1 (0, 1, 1) ≥ 0,
(45)
∆π1 (q, 1, 1) = ∆π1 (q, 0, 1), 1 ≤ q ≤ B − 1,
(46)
∆π1 (q, 1, 1) ≥ 0, 1 ≤ q ≤ B − 1.
(47)
Proof: First, we prove (44). Using hπ1 (0, 0, 0) = 0, we obtain that ∆π1 (0, 0, 1) = p1 hπ1 (0, 1, 0) + p2 hπ1 (1, 0, 0) + p3 hπ1 (1, 0, 1) − [−c1 + p3 hπ1 (0, 0, 1)]
(48)
and ∆π1 (0, 1, 1) = −c2 p1 + p1 hπ1 (0, 1, 0) + p2 hπ1 (1, 0, 0) + p3 hπ1 (1, 1, 1) − [−c1 + p1 hπ1 (0, 1, 0) + p3 hπ1 (0, 1, 1)] = −c2 p1 + c1 + p2 hπ1 (1, 0, 0) + p3 [hπ1 (1, 1, 1) − hπ1 (0, 1, 1)].
(49)
DRAFT
14
Subtracting (49) from (48) yields ∆π1 (0, 0, 1)−∆π1 (0, 1, 1) = c2 p1 +p1 hπ1 (0, 1, 0)+p3 [hπ1 (1, 0, 1)−hπ1 (1, 1, 1)+hπ1 (0, 1, 1)−hπ1 (0, 0, 1)]. (50) From (31) and (34), we have that hπ1 (1, 1, 1) − hπ1 (1, 0, 1) = hπ1 (0, 1, 1) − hπ1 (0, 0, 1) = β2 ,
(51)
which implies that the last square bracket of (50) equals 0. Equation (50) then becomes ∆π1 (0, 0, 1) − ∆π1 (0, 1, 1) = c2 p1 + p1 hπ1 (0, 1, 0) = c2 p1 + p1 β1 > 0,
(52)
where the last inequality follows since c2 + β1 > 0. This establishes (44). Next we show (45). It follows from (31), (32), and (34) that hπ1 (1, 1, 1) − hπ1 (0, 1, 1) =
p1 p1 + p3 gπ 1 − β1 . p3 p 3 p3
(53)
Applying (53) and the definition of hπ1 (1, 0, 0) (which is the same as hπ1 (0, 0, 1), as defined in (31)) to (49) yields ∆π1 (0, 1, 1) = −c2 p1 + c1 − p1 β1 +
gπ gπ1 ≥ −c2 p1 + c1 − p1 β1 + 0 , p3 p3
(54)
where the last inequality follows from Lemma 4. From (54), with gπ0 replaced by (28) and β1 by (29), it can be verified by straightforward calculation that (45) indeed holds if c2 ≤ c∗ . The proof of (46) is exactly the same as that for (20) in Proposition 1 (with π0 replaced by π1 ). To prove (47), we first note that, with π0 replaced by π1 , (21) still holds and becomes ∆π1 (q, 1, 1) = c1 + p1 [hπ1 (q, 1, 0) − hπ1 (q − 1, 1, 0)] + p2 [hπ1 (q + 1, 0, 0) − hπ1 (q, 0, 0)] + p3 [hπ1 (q + 1, 1, 1) − hπ1 (q, 1, 1)], 1 ≤ q ≤ B − 1.
(55)
Applying Corollary 1 to (55) yields ∆π1 (q, 1, 1) ≥ 0, for 2 ≤ q ≤ B − 1. Due to Corollary 1, in order to show ∆π1 (1, 1, 1) ≥ 0, it suffices to prove that hπ1 (1, 1, 0) − hπ1 (0, 1, 0) ≥ −c1 .
(56)
From the definitions of hπ1 (1, 1, 0) and hπ1 (0, 1, 0), we obtain that hπ1 (1, 1, 0) − hπ1 (0, 1, 0) =
gπ gπ1 + β2 − β1 ≥ 0 + β2 − β1 , p3 p3
(57)
DRAFT
15
where the last inequality follows from Lemma 4. Then, (56) can be algebraically verified by applying c2 ≤ c∗ to (57). This completes the proof of (47). Finally, we prove the optimality of π1 when c2 ≤ c∗ . Proof of part (ii) of Theorem 1: Proposition 2 implies that, if c2 ≤ c∗ , then ∆π1 (s) ≥ 0, ∀s ∈ Sd .
(58)
This establishes the optimality of π1 . If c2 < c∗ , the uniqueness of the optimal policy follows from Proposition 8.5.10 of Puterman [13] and the fact that (58) holds with strict inequality. Multiplying (30) with the uniformization factor and reversing its sign, we obtain (2) as the optimal long-run average cost when c2 ≤ c∗ . It follows from Lemma 4 that when c2 = c∗ both π0 and π1 are optimal and Cπ0 = Cπ1 . R EFERENCES [1] D. Bertsekas and R. Gallager, Data Networks (2nd Edition).
Prentice Hall, 1991.
[2] S. Spicer and I. Ziedins, “User-optimal state-dependent routeing in parallel tandem queues with loss,” J. Appl. Prob., vol. 43, pp. 274–281, 2006. [3] R. Sheu and I. Ziedins, “Asymptotically optimal control of parallel tandem queues with loss,” preprint, 2009. [4] C.-Y. Ku and S. Jordan, “Access control to two multiserver loss queues in series,” IEEE Trans. Autom. Control, vol. 42, no. 7, pp. 1017–1023, 1997. [5] ——, “Near optimal admission control for multiserver loss queues in series,” Eur. J. Oper. Res., vol. 144, no. 1, pp. 166–178, January 2003. [6] ——, “Access control of parallel multiserver loss queues,” Perf. Eval, vol. 50, no. 4, pp. 219 – 231, 2002. [7] K.-H. Chang and W.-F. Chen, “Admission control policies for two-stage tandem queues with no waiting spaces,” Comput. Oper. Res., vol. 30, no. 4, pp. 589–601, 2003. [8] H. A. Ghoneim and S. Stidham, “Control of arrivals to two queues in series,” Eur. J. Oper. Res., vol. 21, no. 3, pp. 399 – 409, 1985. [9] A. Hordijk and G. Koole, “On the Shortest Queue Policy for the Tandem Parallel Queue,” Probability in the Engineering and Informational Sciences, vol. 6, pp. 63–79, 1992. [10] J. Stidham, S., “Optimal control of admission to a queueing system,” IEEE Trans. Autom. Control, vol. 30, no. 8, pp. 705–713, Aug 1985. [11] K. Y. Lin and S. M. Ross, “Admission control with incomplete information of a queueing system,” Oper. Res., vol. 51, no. 4, pp. 645–654, 2003. [12] S. A. Lippman, “Applying a New Device in the Optimization of Exponential Queuing Systems,” Oper. Res., vol. 23, no. 4, pp. 687–710, 1975. [13] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, April 1994.
DRAFT