A methodology for the adaptive control of Markov ... - Semantic Scholar

0 downloads 0 Views 230KB Size Report
matron, Identification and Adaptive Control, Prentice-Hall,. Englewood Cliffs, 1986. [PAZ] A. Paz, Introduction to Probabilistic Automata, Academic. Press, New ...
-

FA7 8:50 A METHODOLOGY FOR THE ADAPTIVE CONTROL OF MARKOV CHAINS UNDER PARTIAL STATE INFORMATION Emmanuel Fernindez-Gaucherand t,Aristotle Arapostathis $. and Steven I. Marcus SUMMARY We consider a stochastic adaptive control problem where complete state information is not available to the controller. The system is modelled as a finite stochastic automaton (FSA) [PAZ], [DOB]. These models are a slight generalization of the more common partially observable controlled Markov chain models as presented in, e.g. [BE], [KV]. A controlled FSA is described by the quintuplet (X,Y,U,{P(y 1 U ) : ( y , ~ )E Y x U},c); here X = {1,2,. . .,Nx} is the finite set of internal states, Y = {1,2,. . . ,N Y } is the set of observations (or messages), U = { 1,2,. . . , N u } is the set of decisions (or controls), and c ( . , .) is the one-stage cost function. For each pair ( y , ~ )E Y x U, we have that P ( y I U ) := [ p , , ] ( yI U ) ] is a N x x Nx matrix, such that Nu Nx

P,,j(Y

I U ) 2 0,

C P , , ] ( Y I U ) = 1, vi E

X,U

E U.

y=l,=1

If at time t the automaton is in state Xt = i and decision Ut = U is made, then by the beginning of the next decision time the automaton would have evolved to state Xt+l = j , and output a message Yt+l = y with probability p,,](y 1 U ) . The cost incurred in this process is c(z,j). We refer to [ABFGM], [DOB], [PAZ] for more details. At decision time t , the information available to the decisionmaker is where po E S N := ~ { p E RNX I p(’) 2 0, z i p ( ’ ) = 1) is the initial state distribution. It is well known that the partially observable optimal control problem for a FSA, under several optimality criteria, can be transformed into an equivalent completely observable problem, in terms of an information state PTOces3 [ABFGM], [BE], [KV],asfollows. Givenpo E S N ~compute , recursively Pt+l

= T(Yt+l,P*,Ut),

t E N,

where,

here we have 1 = (1,1,.. . ,l)r.The process { p t } is a controlled Markov chain and equals the conditional distribution of the internal state Xt given It [ABFGM], [BE], The “new” state is then taken as the process { p t } .

t Systems and Industrial Engineering Department, The University of Arizona, Tucson, Arizona 85721 (emmanuelQsie.arizona.edu). 3 Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712-1084 (ari&mx.utexas.edu). § Systems Research Center & Electrical Engineering Department, The University of Maryland, College Park, Maryland 20742 ([email protected]). CH3229-2/92/0000-2750$1.OO 0 1992 IEEE

A (stationary separated) policy K is a rule for making decisions, based on { p l } , i.e. T : s~~+ U, and Ut = +*). The stochastic optimal control problem of interest to us is that of finding a policy T * , optimal with respect to the long-run expected average cost (AC) performance criterion, which for a given policy n and initial state distribution po E S N is~ given as

where Z ( p , u ) := p ’ ( c ( l , u ) , . .. , c ( N x , u ) ) ‘ . The infinite horizon optimal control problem for FSA under an AC criterion has been studied by the authors and others [ABFGM]. Since the state space S N is ~ a general (Borel) space, then this problem can be thought of as falling into the realm of completely observable controlled Markov processes (CMP), with general (Borel) state space, c.f. [ABFGM]. However, the problem with partial information has a very rich structure which is not fully utilized by following the latter approach [ABFGM], [BE], [FAMl], [FG]. Furthermore, many of the assumptions used in the literature on the AC control problem for general state space CMP require some form of strong ergodicity for the controlled process { p t } , under all stationary control policies. This is not satisfied for many applications of much interest, c.f. [FAMl]. Hence the idea of viewing the (equivalent) FSA problem as a general state space completely observable CMP very often is not advantageous at all, for many purposes. This is especially true for the case of parametric adaptive control of FSA. In this situation, the model depends on some unknown parameter 6 0 , which we denote as Pe,(y I U); the parameter takes values in some (Borel) parameter space 0.Hence, the true conditional probability depends on this parameter, i.e.:

Therefore, since the true value of the parameter is unknown to the controller, { p l } cannot be computed and thus the equivalent problem is not completely observable anymore. Although a very interesting problem with much potential for applications [BDO], [BTE], [FAM2], (WAK] there is very little available in the literature concerning the adaptive control of FSA. Recently, the above adaptive control problem has been studied by the authors [FAM2], [FG]. In [FAM2], a complete analysis for a particular case study has been reported; the methodology used has been generalized in [FG] as follows: we adopt an “enforced certainty equivalence” approach which involves recursively computing estimates {e,} of the unknown parameter, and using at each decision time the latest available estimate to compute

We assume that the solution to the stochastic optimal control problem is known, for each B E 0,which is expressed as follows; see [ABFGM], [BE], [KV].

2750

Assumption A.l: For each 8, there is a bounded solution ( & h e ) , with p: E IR, to the corresponding average cost optimaldty equation (ACOE)

Under the above assumption, there exists a set of optimal policies OP = { T * ( . ; e ) }eEF! , see [ABFGM]. The certainty equivalent adaptive policy is given as follows.

Adaptive Policy: Given a sequence of estimates { e t } of Bo, compute the control action at each time t by

Eo

ut = a*(Pt;i t ) , where fit is computed recursively using (1). The above adaptive policy will be denoted by r a .Under a set of assumptions, it was shown in [FG] that the adaptive policy ma is self-optimazing with respect to the AC criterion, i.e. it achieves the same asymptotic average performance as the optimal policy T * ( , ; 80)E OP corresponding to the true parameter. The other assumptions used in [FG] are the following. Assumption A.2: The parameter set 0 is compact; ( & , h e ) are continuous and bounded, both in p and in 8. Assumption A.3: P e ( y I U ) is continuous in 8, for each ( 9 , U ) E Y x U. Assumption A.5: We have that all po.

fit

+ p t , in probability, for

t-m

We can now prove our main result. Theorem: Under Assumptions A.l-A.5, r a is self-optimizing with respect to the AC criterion.

Proof: Let @e(-, .) denote Mandl's discrepancy function, corresponding to the parameter value 0 E 0 , i.e. for p E S N and ~ U E U (see [ABFGM], [FAMl])

I

@ e ( p , U ):= ~ ( P , u ) + C ~ ' P ~ ( uY) p h e ( T ( y , p , u ;e ) ) - p $ - h e ( p ) . YEY

Then by the assumptions made, @ e ( p ,U ) is continuous in both p E SN* and 8 E 0. Furthermore, since 0 is compact, then @ e ( p ,U ) is uniformly continuous and bounded in ( p , 8) E S N x~ 0, and thus @&,U) is uniformly integrable, for each U E U. Therefore, for each U E U, we have

and since U is finite, then

where we used the fact that

(fit,

x ( 6 t ; 8,)) = 0, since R(.; e) E

OP minimizes the corresponding ACOE, for the pa.rameter value 8 E 0. The result then follows from (2); see [ABFGM, Theorem 0 6.31.

Let us briefly examine the assumptions used in deriving the result above. Verifiable conditions on the model specifications exist in the literature that imply Assumption A . l holds [ABFGM], [FAMl]. Assumption A.3 is easy to verify, and holds trivially if the parameterization of the model is taken in terms of the entire matrices Pe(y I U ) . Assumption A.4 depends on the parameter estimation scheme used, and is very problem-specific [FAMP]. The continuity required in Assumption A.2 depends to a large extent on the continuity required in Assumption A.3, and on some ergodic properties of the model [FAM2]. Finally, it is clear that even if Assumptions A.l-A.4 hold, it may be the case that Assumption A.5 does not. Under continuity with respect to the parameterization, and in the presence of converging parameter estimates, this last assumption will hold if there is some type of, e.g. regeneratiwe behavior for the processes { p t } and {&}, such that at some times both processes are reset to the same value. This type of behavior occurs naturally in some inventory, queueing and machine replacement problems [BE], [ABFGM], [FAMP]. ACKNOWLEDGEMENTS: This work was supported in part by the Texas Advanced Technology Program under Grants 003658-186, in part by the Air No. 003658-093 and No. Force Office of Scientific Research under Grants AFOSR-910033, F49620-92-5-0045, F49620-92-5-0083, and in part by the National Science Foundation under Grants CDR-8803012 and INT-9201430.

REFERENCES [ABFGM]A. Arapostathis, V. Borkar, E. Fernhdez-Gaucherand, M.K. Ghosh and S.I. Marcus, Discrete-Time Controlled Markov Processes with Average Cost Criterion: A Survey, to appear in SIAM Journal on Control 8 Optimization). [BE] D.P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models, Prentioe-Hall, Englewood Cliffs, 1987. [BDO] J.S. Baras and A.J. Dorsey, Stochastic Control of Two Partially Observed Competing Queues, IEEE Transactions on Automatic Control, AC-26 (1981) 1106-1117. [BTE] F.J. Beutler and D. Teneketzis, Routing in Queueing Networks Under Imperfect Information: Stochastic Dominance and Thresholds, Stochastics €4 Stochastics Reports, 26 (1989) 81-100. [DOB] E.-E. Doberkat, Stochastic Automata: Stability, Nondeterminism, and Prediction, Springer-Verlag, Berlin, 1981. [FAMl] E. Fernhdez-Gaucherand, A. Arapostathis, and S.I. Marcus, On the Average Cost Optimality Equation and the Structure of Optimal Policies for Partially Observable Markov Decision Processes, Annals of Operations Research 20 (1991) 439470. [FAMS] E. Fernhdez-Gaucherand, A. Arapostathis and S.I. Marcus, Analysis of an Adaptive Control Scheme for a Partially Observed Controlled Markov Chain, t o appear in IEEE Transactions in Automatic Control. [FG] E. Fernkdez-Gaucherand, Controlled Markow Processes on the Infinite Planning Horizon: Optimal 8 Adaptive Control, Ph.D. Dissertation, The University of Texas at Austin, August 1991. [KV] P.R. Kumar and P. Varaiya, Stochastic Systems: Estimatron, Identification and Adaptive Control, Prentice-Hall, Englewood Cliffs, 1986. [PAZ] A. Paz, Introduction to Probabilistic Automata, Academic Press, New York, 1971. [WAK] IC. Wakuta, Optimal Control of an M/G/l Queue with Imperfectly Observed Queue Length when the Input Source is Finite, Journal of Applied Probability, 28 (1991) 210-220.

2751

Suggest Documents