Feb 2, 1983 - General Motors Corporation and research expe- rience at the Massachusetts Institute of Technol- ogy and the University of Illinois, as well as ...
pattern recognition, biped and humandynamics and control, and robotics. Dr. Hemami was Chairman of the Columbus Section of the IEEE, and amember of the EducationCommittee of the IEEE Control Systems Society.
Frank Carlin W'eimer (M42-SM51) received the B.S.E.E.degreefrom Ohio University, Athens, and the M.Sc. and Ph.D. degrees from The Ohio State University, Columbus. He is currently Professor and Vice Chairman of Electrical Engineering, The Ohio StateUniversity. He has had industrialexperienceinengineeringatDelcoProducts and A. C. SparkPlugDivisions of the
General Motors Corporation and research experience at the Massachusetts Institute of Technology and the University of Illinois, as well as The Ohio State University Research Foundation. He has alsoserved as a consultantto several industrial and research organizations. Dr. Weimer is a member of Eta Kappa Nu, Tau Beta Pi, Sigma Xi, the American Society for EngineeringEducation, the AmericanAssociation for theAdvancement of Science, andthe National Society of ProfessionalEngineers. He was theChairman of theLinearSystems Theory Committee of the Automatic Control Group of the IEEE from 1963 to 1966.
Optimal DecentralizedControl in a Multiaccess Channelwith Partial Information ZVI ROSBERG
Abstruct -We consider two transmission stations sharing a single a m munication channel. For differentvalues of the inputmessagerates r,, i = 1,2, a simple open-loop control policy is shown to be optimal for the long-run average throughput criterion.
I. INTRODUCTION
a new packet does or does not arrive at slot t . The buffer contentat time t is eithertransmitted or not, via the common channel, dependingon whether u j ( t )= 1 or 0. Channel operation is described by the following dynamic system: ~.(t+~)=y:(t)+&(t)(l-~(t))(l-~~(t)),
C
ONSIDER I transmission stations sharing a single communication channel. (Although the formulation is given for any I, this paper dealsonly with I = 2.) The channelis assumed to be slotted, that is, the channel time is divided into equal segments called slots. The length of eachslotequals that of a packet. Furthermore, it is assumed that the stations are synchronized with the channel inthe sense thatattemptstotransmitpacketsare exactly time-aligned with the slots which are labeled t = 1,2, . . . Every station i contains a buffer which can store one packet; X j ( t )= 1 or 0, depending on whether the buffer is full or empty atslot t; y ( t )= 1 or0, depending on whether
I < ~ < I ,t=O,1,2;-where {T.(O),y(l)ll < i d I , t = 0,1,2, dent Bernoulli r.v.3 and
P(y(t)=l)=P(X,(O)=l)=r;.,
. } areindepen-
o 2. it is transmittedduring the first T slots(hereinafter T steps) shown that these two particular policies are person-by-perusing policy T . We have son optimal for certain intervals of r. Another closely rer- 1 lated paper is the recent study of Hajek and Van Loon [3]. In this paper, we only consider the case I = 2. In Section V A T )= E , r(r) r=O 11, we formulate the dynamic programming model, and in r- 1 Section 111,we derive the optimal control policy for the =E, E ( ~ ( t ) / W ~ - ~ . z t ~(1.4) - ~ ) long-runaverage throughput. We show thattheoptimal r=O control does not permit the two stations to transmit at the where E ( r ( f ) / W r - l , u ‘ - ’ is ) the expected immediate re- same time. Moreover, the optimal policy permits one of the stations to transmit every k slots and the other to transmit ward at step t under policy m. at the rest of the slots. Also, let
’
n
c
v(
T )=
lim inf
V,( m ) / T .
(1 -5)
T-x
When the limit in (1.5) exists, v 2, k , > 1, which is defined E. by the regenerative cycle given in Fig. 1. For every t , X ( t ) = ( X,( t ) ,X,( t ) ) is an r.v. whose probau , . u: The notation ( n , ?n,) + ( n ; : r z ; ) in Fig. 1 means that bility distribution depends on the information ( W'- u'- I ) . n z ) ,the control action is u,, u,, after which the Had X ( t ) been known, it could serve as a state. However, at state X ( t ) is an r.v., so the best one can do is to take as the state system moves on to state (ni:n;). 2) The policy 7r2( k , k , ) . k > 1, k , 2 2, which is defined its probability distribution given ( W ' - u'- I ) . From Lemmas 2.1 and 2.2, this distribution is completely defined by by the regenerative cycle given in Fig. 2. 3) The policy v ( 1, l), which is defined by Fig. 3. the parameters k ( t ) = ( k , (t ) , k2(t ) ) . Therefore, the state 4) The policy v ( k , , k2),k , > 2, i = 1,2, which is defined space is by Fig. 4. S = { k = (kl,k2)lki=1,2;--, i=1,2} Remark 3.1: From the mean ergodic theorem, it follows that for every policy 7~ above, V K , + 1 or k , > K , + 1>are transient under these policies. Let S'={(k,,k,)llQkjQKj+1,i=1,2}, A : = { ( 1 , 1 ) , ( 0 , 1 )f}o,r s = ( k , , K , + l ) ,
~,QK,+I,
A ; = { ( l , l ) , ( l , O )f}o, r s = ( K , + l , k ,k),,g ~ , + 1 ,
Fig. 2.
A: = A , ,
otherwise (3.2)
q'= q ,
and
r =F.. Fig. 3.
1.0
1
...*.+
(1,3)-
0,1
(k1.11-
1 .o
I s 0
(1.2)-
.**..-
0,I
From Lemma 3.1, the value function of the d.p.p. given in V', is smaller than V. 0 (3.2),not Henceforth, we consider only the d.p.p. given in (3.2). Theorem 3.1: There exists a nonrandomized stationary control policy 'IT*such that
I
(1,k2)
v= sup- V (7r)
=
(( 32 ,, l1)1
V (7r*).
Fig. 4.
finite numberof consecutive control actions( ul,O), ui = 0,1, such that
V ( To) > V (7r) for every 7r which takes infinitely manyconsecutive control actions with u , = 0 ( u2 = 0). Proof: Suppose that r, 2 rl. Let kp be the first integer such that p , ( k p ) > r2 and let 7ro = n(kp,2). From Remark 3.1 and the definition of ~ ( k k,) , , in Fig. 4, we have
V ( d = ( ( k ! -2b2
+ P,(kY)+ P2(2))/kY
> max{r,, r,}.
(34
Now, let 7r be a policy which takes infinitely many consecutive control actions, say (0, u,). From (3.1), ( 1 3 , (2.2), and (2.5), it follows that
V(7r)Q r, < V ( n 0 ) . For the control actions ( u , , O ) , we have
V(7r)< r, < V(7ro).
0
Corollary 3.1: For every (r,, r,), there exists a dynamic programming problem (d.p.p.) with finite state and action is not smaller than V, spaces such that its value function the value function of the d.p.p. in Section 11.
v'
Proof: From Corollary 3.1,we can reduce the d.p.p. given in Section I1 to a d.p.p. with finite state and action spaces in which a nonrandomized stationary optimal control policy exists. (See, e g , Derman [2].) 0 Hereinafter, we consider only nonrandomized stationary control policies. Let u*(k,,k,) be the control action taken at state ( k , , k , ) by the optimal policy T*. Inthe following lemmasand theorem, weshow that attention can be restricted to policies of the type7r(kl,k,),
k i2 1. Lemma 3.2: u * ( k , , k,) * (0,O) for any state ( k , ,k,). Proof: Suppose, in contradiction, that under the optimal stationary policy 7r*,u * ( k , , k,) = (0,O) for some state ( k , ?k2). Let N be the number of states in one regenerative cycle of the process under policy T*.( N is also the number of steps in one regenerateive cycle.) For every policy n, let T;(k,, k,), n = 0,1,2, . . . be the state of the process under policy n, n steps after the first time that the process enters into state ( k , , k , ) [e.g., T'(k,. k , ) = ( k , >k,) and C * ( k , ?k z ) = ( k , + 1, k , + 111. Propositions A and B belowfollow directly fromthe definition of N and the fact that N < co. Proposition A: T$(k,, k,) = ( k , , k , ) . Proposition B: There exists an h" Q N suchthat uT(T;?'(k,,k,)) = 1 for i = l(i = 2) and uT(T,",(k,,k,))= 1 forj=2(j=l)andnGN'. Let No be the minimum of the numbers N' satisfying the conditions of Proposition B.
~~ ~~
~~
191
ROSBERG: OPTIMAL DECENTRALIZED CONTROL
Now, define recursively a control policy ii as follows: ~(T,l'(k,,k,))=u*(T,",+'(k,,k,)),
n=0,1,2;*-,N-2. Clearly, ii is well defined, since for a given T*, the control actions of ii atstates T;(k,, k,). 1 < i < n - 1 define T,"(kl, k2)PropositionC: T / - l ( k l , k 2 )= ( k , , k , ) and T ~ ~ 1 ( k l , k , ) # ( k , , k 2 ) f o r I < n < N(i.e.,N-1 -l isthe number of states in one regenerative cycle of the process under policy 5). Proof of Proposition C: From Proposition B, we have ~ ; - ' ( k ~k,)$
forn
= ~;*(k,, k,).
-
No.
Thus, Proposition A implies that Tiv - I (k1, k 2 ) = T;?(k,, k,) 5i
=
Corollary 3.2: If rj > 1/2, for some i, then u*(k,, k 2 )* (1,l) for every state ( k l ,k 2 ) . Proof: The corollary follows from Lemma 3.3 since p , ( k )is increasing in k. 0 Lemma 3.4: If u * ( k l ,k,) = (1,l) forsome state(k,, k,), then u* (1,l). Proof: From Corollary 3.2, we may assume that rj < 1/2, i = 1,2. From Lemma 3.2, it follows that any nonrandomized stationary policy is of type ml(k,, k 2 ) , ~ 2 ( k ,k,), , or m(k,, k,), which are given in Figs. 1-4. It also follows that any nontransientstate is of theform (1, k ) or ( k ,1). Suppose that u*(k, 1) = (1,l). (For u*(l, k ) = (1, l), the proof is similar.) If k = 1, then T* = n(1,I) and the lemma is proved. If k > 1, then the process k ( t ) , t = 0,1,2, . under T* enters into state ( k ,1) only from state ( k - 1: I), and the control action at state ( k - 1:1) is u*(k - 1,l) = (0,l). From state ( k ,l), the process under m* moves on to state (1,l). The reward under T* duringthetransitionsfrom ( k - 1,l) to ( k ,1) and then to (1,l) is
(k13k 2 ) ,
and the first part is proven. For n hT,, ( k i 3k ; ) : = T;-'(k,, k,) =e ( k , , k,) since k: > k j for at least one i. For N - 1 > n > N o , T , - l ( k , , k,) ( k , ? k , ) since T;*(kl, k , ) =e ( k l ,k 2 ) $and the second part is proven. To show that v < i i ) > e m * ) , we first evaluate the difference N . ~ ( T * ) - ( N - l)V(ii). From (2.5), Remark 3.1, Propositions Aand C, the monotonicity of p , ( k ) , and the fact that p , ( k + 1)- p , ( k ) decreases in k . it follows (and can be easily verified) that
r*= 2P,(1)+P,(k)(l -2p2(1)).
(3.3)
f
N-F(m*)-(N-l)V(ii)
Now. let ii bethenonstationary policy which does the same as T*, except when the process enters intostate ( k - 1,l). At this state, ii takes the control actionu = (1,l). Then the process moves on to state (1,l) in which ii again takes the control action u = (1,l). At this point, ii proceeds as T*. The reward during these two consecutive (1, 1)-control actions is
< P l ( k l + ~ ) - P l ( ~ l ) + P 2 ( ++P2(k2). ~2
r=2p2(l)+(pl(k-l)+pI(l))(l-2pz(l)).
Therefore,
Sincep,(l) = r2 < 1/2 a n d p , ( k - l ) + p l ( l ) > p , ( k ) , it follows from (3.3) and (3.4) that
(N-l)(V(ii)-F(7r*))
>F(,*)-
[ d k , + 1 ) - P , ( k A + p , ( k , + 1>-P2(k2)1
r > r*.
=V(m*)-rI(1-rl)k1-r2(1-r2)k2
ii. Thus, from the mean ergodic theorem, we obtain V 0.
v(
Thus, V( ii)> m*), which is contradiction. a 0 Lemma 3.3: For every state ( k , ,k,), if p , ( k , )> 1/2 for some i, then u*(k,, k 2 ) (1,l). Proof: Suppose, in contradiction, that under T*, u * ( k ; ,k ; ) = (1,l) for some state ( k t , k ; ) and P , ( k ; )> 1/2. (If p 2 ( k ; )> 1/2, then the proof is similar.) From Lemma 3.2, it follows that T* can only be a policy of type ml(k,, k 2 ) ,m2(k,, k,), or ~ ( 1 , l ) . Define ii as follows. If m* = 772(1, k,) or ~ ( 1l),%then let 77 bethe policy which always does u = (1,O). If m* = n-I(k,, k2) or m2(k,, k,), k , 2 2, then let ii= ~ ( k k, , 1). From Remark 3.1, it is easy to check that, in any case,
+
V ( ii)> V (m*), contradiction. which is a
(3.5)
All the other immediate rewards remain unchanged under V(ii)>
~~(~(l,l))-[r,(l-r~)+r~(l-r~)]
=
(3.4)
0
0
The following theorem is a direct consequence of Lemma 3.4 and Corollary 3.2. Theorem 3.2: For every 0 < r l , r2 < 1, m* = a ( k , ,k,) for some k j 2 1, i = 1,2. 2) maximize Next, we shall find the values ( k 1 2 k which v ( r ( k l , k , ) ) , k j > 2: i = 1,2, and then we shall show that this maximum is larger than T( 1,l)). From (2.5), Figs. 3 and 4: and Remark 3.1, we have
v
P 2 ( 1 ) = rI
+ r, - 2rlr2
(3.6)
and %7(klJ2))
= ( ( k 2- 2 ) d l ) + ( k 1
-2)P,(l)
+ P , ( k , ) + P l ( k , ) ) / ( k I +k2-2). (3.7)
192
IEEE TRANSACTIONS ON AUTOhfATIC CONTROL, NO. VOL. AC-28,
From (3.7), we have
P(m(k,, k ,
2,
FEBRUARY
1983
than m(1,l). ( [ a ] is the smallest integer not smaller than
+ 1 ) ) = kk l, ++ kk 2, -- 21 V- ( 4 % k d )
a.)
Proof: We consider only the case r2 r,. From (3.6) and (3.7), we have
+ (kPl +, (kk122-+1 1 ) - p z ( k , ) + p , ( l ) )
V(m(k,2))=p2(l)+(pl(k)-rz?.)/k
=r2+(l-(l-rl)k-r?)/k (V3(.m 1 3( 1) , 1 ) ) = r l + r 2 - 2 r l r 2 .
and
(3.12)
From (3.12) and (3.13), it is sufficient to show that
(3.8)
where k , = [r,/r,]. From Lemma 3.6 and the definition of k,, we obtain
(3.9)
Finally, we present the conclusive theorem which determines the optimal control policy. Theorem 3.5: a) If r, >, rl, then V * = m(k*,2) where k* is the smallest integer satisfying a) If r2 > r l , then V(m(2,2))> p2(3)-p2(2)+pI(1). b) If rl >, r2, then v((a(2,2))> p l ( 3 ) - p , ( 2 ) + p 2 ( l ) . Proof: Sincepi( k + 1 ) - pi( k ) decrease in k , the lemma follows by a straightforward computation using the defini0 tion of Q'(a(2,2))given in (3.7). Theorem 3.3: For finding theoptimalcontrol poiicies when r2 r l ( r l> r,), the only policies which have tobe considered are m(1,l) and 7i(k,2)( 4 2 , k ) ) , k = 2,3.. . . . Proof: Let k , > 2. If v((m(kl,2))> p2(3)- p2(2)+ p , ( l ) , then from (3.7) and (3.9), it follows that max T ( m ( k , , k 2 ) )= % ( 7 7 ( k l , 2 ) ) .
(3.10)
r ~ ~ p , ( k * + l ) - ( k * + l ) r I ( l - r l ) k * (3.14) . Moreover, V(m*)=r,+(p,(k*)-r:)/k*. b) If rl >, r,, then T* = m(2, k * ) where k* is the smallest integer satisfying r ~ ~ p 2 ( k * + l ) - ( k * + l ) r 2 ( l - r , ) k * .(3.15)
v(n*)
Moreover, = r, + ( p , ( k * ) - r:)/k*. Proof: We only consider a). From (3.12), it follows that
k2
If V ( ? r ( k , , 2 ) ) G p , ( 3 ) - p , ( 2 ) + p I ( l ) ,then since p l ( k + 1 ) - p , ( k ) decreases in k , it follows from (3.7), (3.9), and Lemma 3.5 that m a x V ( ~ ( k I k 2 ) ~) ~ 2 ( 3 ) - ~ 2 ( 2 ) + ~ 1 ( 1 ) k2
< T(m(2,2)). (3.11)
Now, the theorem follows from (3.10) and (3.1 1). 0 In Theorem 3.4, it will be shown that m(1,l) is not optimal. Lemma 3.6: For every k = 0,1,2; e ,
(1-r)k, rl (r, >, r2). the policy m(k,.2) where k , = [ r 2 / r l ] ( 4 2k,) , where k , = [ r l / r 2 1 )is better
The theorem now follows from (3.12) and (3.16).
0
IV. CONCLUSIONS 1) The optimal control policies m(k*,2) and ~ ( 2k*) , are open loops and do not allow conflicts. Furthermore, they can easily be implemented as decentralized policies. 2) The probabilistic interpretation for the optimal value k* in condition (3.14) is as follows.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. AC-28,NO.
2,
FEBRUARY
193
1983
-
Let x; B ( k , rj), i = 1,2, k = 1,2, . . . . The integer k* is the smallest integer k which satisfies
REFERENCES
This condition seems to be a technical one which follows from (3.16), and it is hard to relate it to the optimality of the rule. On the other hand, note that the condition onk*,
[l] Z. Berman, “Maximal traffic inradio networkswith two transmitters’’ (in Hebrew), Master’s thesis, Dep. Elec. Eng., Technion-Israel Inst. Technol., Haifa, 1981. [2] C. Derman, Finite StateMarkmian Decision Processes. New York: Academic, 1970. [3] B. Hajek and T.Van Loon, “Decentralizeddynamiccontrol of a multiaccess broadcast channel,” IEEE Trans. Automat. Contr., vol. AC-27, pp. 559-569, June 1982. [4]F.C. Schoyte, ,‘,‘Decentralized control in packetswitched satellite commumcatlon, ZEEE Trans. Automat. Contr., vol.AC-23, pp. 362-371, Apr. 1978. [5] C. Striebel, “Sufficientstatistics in the optimum control of stochastic systems,” J . Math. Anal. Appl., vol. 12. pp. 576-592, 1965. [6] P. Varaiya and J. Walrand, “Decentralized controlin packet switched satellite communication,” IEEE Trans. Automat. Contr., vol. AC-24, pp. 794-796, Oct. 1979.
defines the myopic policy (which allowstransmission of the station which is more likely to have a message). Although the myopic policy seems intuitively optimal, as well as the policy in Theorem 3.4, they are not. 3) If rl = r2, the optimal control policy is the “roundrobin” policy.
Zvi Rosberg was born in Germany on July 25,
P(XA+, z 2) > P ( x 2 ’ z 2).
ACKNOWLEDGMENT
I am indebtedtothe referees fortheirconstructive remarks, especially for pointing out a mistake in the proof of Lemma 3.2 in an earlier version of this paper.
1947.HereceivedtheB.Sc., M.A., and Ph.D. degrees from the Hebrew University, Jerusalem, Israel, in 1971 1974, and 1978, respectively. From 1972 to 1978 he was a SeniorSystem Analyst in the General Computer Bureau of the Israeli Government. From1978 to 1979 he held a Research Fellowship at the Center of Operation Research andEconometric(CORE),Catholic University of Louvain,Belgium. From 1979 to ’1980 he was a Visiting Assistant Professor at the University of Illinois, Urbana Since 1980 he has been with the Faculty of Computer Science,Technion-Israel Institute of Technology, Haifa, Israel, where he has held the position of Senior Lecturer. I
Dynamic Equilibria in Multigeneration Stochastic Games ABDERRAHMANE AWAND ALAIN HAURIE,MEMBER,
Absrruct --Stochastic sequential games are considered, with the assumption that to each stage there correspondsa generation of players. Intraduc-
ing the concept of imperfect altruism, a class of solution concepts called intergenerationalequilibriaisinvestigated.Completeintergenerational equilibria, where all players adopt a noncooperative mood of play, are first
IEEE
characterized.Existence of stationary equilibria is proved for infinite horizon games. The concept of the intergenerational game is then extended to allow for variable moods of play in each generation.
I. INTRODUCTION Manuscript received November 24,1980;revised March 10, 1982. Paper recommended by F. N. Bailey, Past Chairman of the Large Scale Systems and Differential Games Committee. This work was supported by a CIDA-NSERC Canada Grant, and by SSHRC Canada under Grants 410-78-0603 and 410-81-0722. the Ecole Mohammedia d’hgenieurs, Universite A. Alj is with Mohammed V, Rabat, Morocco. A. Haurie is with I’Ecole des Hautes Etudes Commerciales, Montreal H3T 1V6, P.Q., Canada
T
HISpaperaddressestheproblem of characterizing intergenerational equilibria in dynamic stochastic games. The games considered herein have thesame dynamical structure as the stochastic games introduced first by Shapley [lo] and then used by Sobel [ 111, Rogers [9], and, more recently, Whitt [ 131. However, instead of consideringafinite set of players who controlthisdynamical
0018-9286/83/0200-Ol93$01.00
01983IEEE