Risk Sensitive Control of Finite State Machines on an In ... - CiteSeerX

0 downloads 0 Views 283KB Size Report
1 be a sequence in (0;1). Then, (2.9) - (2.11) imply that, by a suitable diagonalization, we may pick a subsequence f ng (denoting it again by f ng) along which V ...
Risk Sensitive Control of Finite State Machines on an In nite Horizon I W. H. Fleming1 Division of Applied Mathematics Brown University Providence, Rhode Island 02912 U. S. A. D. Hernandez-Hernandez2 Departamento de Matematicas CINVESTAV - IPN Apartado Postal 14-740 Mexico D.F. 07000 MEXICO August 1995 1 Partially

supported by AFOSR F49620-92-J-0081, ARO DAAL03-92-G-0115 and NSF DMS-9301048. 2 Supported by the Consejo Nacional de Ciencia y Tecnologa (CONACYT), and by the Centro de Investigacion y de Estudios Avanzados (CINVESTAV), Mexico. This research was done while the author visited the Division of Applied Mathematics, Brown University.

Abstract In this paper we consider robust and risk sensitive control of discrete time nite state systems on an in nite horizon. The solution of the state feedback robust control problem is characterized in terms of the value of an average cost dynamic game. The risk sensitive stochastic optimal control problem is solved using the policy iteration algorithm, and the optimal rate is expressed in terms of the value of a stochastic dynamic game with average cost per unit time criterion. By taking a small noise limit a deterministic dynamic game is obtained, which is closely related to the robust control problem.

1 Introduction. There are various approaches to treating disturbances in control systems. In stochastic control, disturbances are modelled as stochastic processes (random noise). On the other hand, in H1/robust control theory disturbances are modelled deterministically. The theory of risk sensitive optimal control provides a link between stochastic and deterministic approaches. The link is made by considering small noise limits for stochastic control problems with exponential of cost criteria. For continuous variable, nite time horizon problems this idea was introduced by Whittle [18], [19]. For the state - feedback (complete state observation) case, Whittle's idea was put on a mathematically rigorous basis in [14], [9] using viscosity solution methods. Discrete time, output feedback (partial state information) problems on a nite time horizon were treated in [15]. See also [7], where a solution approach for risk sensitive control problems for hidden Markov models is given. In [1] robust and risk sensitive control of discrete time nite state systems on a nite horizon is considered. The purpose of the present paper is to study such systems on an in nite time horizon. We consider here only the state feedback case. In a sequel, output feedback control on an in nite horizon will be considered. Our approach is similar in spirit to [10], [11] where continuous variable systems modelled by di erential equations are considered. However, the technical details are quite di erent for the discrete ( nite state machine) case. To illustrate the ideas in a simple setting, we begin in Section 2 with uncontrolled nite state machines described by the di erence equation (2.1). In the robust/H1 formulation, deterministic perturbations are described by (2.2). The H1 - norm is characterized in terms of a deterministic optimal control problem, in which the perturbations (or disturbances) are chosen to maximize some long run average cost per unit time criterion. The maximum average cost 0 is nonnegative, and is zero if and only if H1 - control is achieved. The corresponding cost potential function W0(x) has the role of a \storage function" in H1 - control terminology. It is unique, provided that the H1 - norm parameter exceeds the critical value . Following [1] stochastic perturbations of the nite state machine model are introduced in Section 2.2. The strength of the perturbations are described through a parameter ". Stochastic analogues " ; W" of 0; W0 are introduced, with " the maximal expected average cost per unit time for a corresponding 1

ergodic stochastic control problem. In Section 3, convergence of " ; W" to 0; W0 as " ! 0 is proved. In Section 4, we consider controlled nite state machines. With the deterministic robust /H1 formulation, the optimal H1 - norm is characterized in terms of a dynamic game with average cost per unit time. In the stochastic risk sensitive formulation, a corresponding stochastic di erence game is introduced, with payo involving a relative entropy function. Results like those in Section 3 are obtained as the noise intensity parameter " ! 0.

2 Risk sensitive analysis. In this section we consider both deterministic and stochastic perturbations of a discrete-time nite state system. In order to measure the size of the deterministic perturbations an analogous de nition of the H1 - norm for nonlinear continuous variable systems is given. This norm is characterized in terms of the value of a long run average cost deterministic control problem. The risk sensitive index is used to measure the e ect of the stochastic perturbations, and it is expressed as the value of an average cost stochastic optimal control problem.

2.1 The H1 - norm.

Consider the deterministic nite state machine (2:1)

xt+1 = f (xt); t = 0; 1;    ; x0 = x;

where xt takes values in the nite set X and f is a function from X into itself that de nes the dynamics of the system. The set X has N elements, X = fx1;    xN g. Let us now de ne a perturbed system , (2:2)

xt+1 = b(xt; wt); t = 0; 1    ; x0 = x:

Here the exogenous inputs (disturbances) wt take values in a nite set W , the state variable xt evolves in X and b : X  W ! X is a given function. Remark 2.1. Notation. Throughout this paper we denote by [0; T ] the time interval f0; 1;    ; T g. If Z is a generic set, then Z [0; T ] denotes the set 2

of functions z : [0; T ] ! Z . Moreover, given any function v : Z ! IR, kvk stands for the supremum norm of v, i.e. kvk = supz2Z jv(z)j. We assume the following: (A1) There exists a null state x and a null disturbance w 2 W such that (i)

f (x) = x and

b(x; w) = f (x) for all x 2 X:

(ii)

(A2) Given x; x00 2 X there exist T1; 0 < T1  N , and w 2 W [0; T1] such00

that for the initial condition x0 = x and input w, the system  reaches x after T1 steps. (A3) There exists a positive integer N0 such that for any initial condition x0 = x, the unperturbed system (2.1) reaches the null state x after N0 steps. Let us introduce a pair of functions  : W ! IR and ` : X ! IR such that 8 > < (w ) = 0 > : (w) > 0 if w 6= w 2 W

(2:3) and

8 > < `(x ) = 0 > : `(x) > 0 if x 6= x 2 X:

The values (w) and `(x) represent the magnitude of disturbance w and the cost per stage generated by the system (2.2) respectively. Now, let us give the de nition of the H1 - norm of the discrete system ; this de nition is analogous to the one given for nonlinear continuous variable systems - see e.g. [17, 2]. De nition 2.2. We say that the H1 - norm kkH1 is less than or equal to a positive number if and only if for every x0 = x, there exists a nonnegative constant K (x), with K (x) = 0, such that (2:4)

K (x) +

T h X t=0

i

2(wt) ? `(xt)  0 for all w 2 W [0; T ]; T  0:

Then, kkH1 is the smallest such that kkH1  . Straightforward calculations show that kkH1  if and only if there exists a nonnegative function W0 de ned on X , called a storage function, 3

such that 8 (2:5)

( T h i) X > > 2 > sup W0(xT +1) ? (wt) ? `(xt) < W0 (x)  sup T>0 w2W [0;T ] t=0 > > : W0 (x ) = 0

If there exists a storage function, then the system  is called dissipative with respect to the supply rate (x; w) ! 2(w) ? `(x). Actually, the inequality (2.5) can be rewritten as 8 > W0(x)  max fW0(b(x; w)) + `(x) ? 2(w)g < w 2 W (2:50 ) > > : W0(x ) = 0: The H1 - norm shall be characterized in terms of the value of an average cost optimal control problem, but rst we introduce some preliminary results. Proposition 2.3. Assume (A2). Then there exist a nonnegative number 0 and a function W0 : X ! IR such that (2:6) 0 + W0(x) = max [W (b(x; w)) + `(x) ? 2(w)]: w2W 0 Proof. The proof is based on the standard vanishing discount approach. De ne the value function 1 X (2:7) V (x) : = sup t[`(xt) ? 2(wt)]; w2W [0;1) t=0

where 2 (0; 1) and xt obeys the dynamic described by (2.2). This function satis es the dynamic programming equation (2:8) V (x) = sup [ V (b(x; w)) + `(x) ? 2(w)] for all x 2 X: w2W

On the other hand, from (2.7) we have (2:9) 0  (1 ? )V (x)  k`k: Then, for each x 2 X , (2.8) - (2.9) yield V (x)  V (b(x; w)) ? 2kk (2:10)

= V (b(x; w)) ? [(1 ? )V (b(x; w)) + 2kk]

 V (b(x; w)) ? C1 for all w 2 W; 4

with C1 = k`k + 2kk. Thus, given x; x00 2 X , from (A2) and iterating (2.10) T1 times, we have V (x)  V (x00 ) ? T1C1: 00 0 Thus, interchanging the roles of x and x , for some constant T depending 00 on x and x , (2:11) jV (x) ? V (x00 )j  T 0 C1: Let n ! 1 be a sequence in (0; 1). Then, (2.9) - (2.11) imply that, by a suitable diagonalization, we may pick a subsequence f ng (denoting it again by f ng) along which V n (x) ? V n (x); x 2 X , and (1 ? )V n (x) converge to some limits W0(x) and 0 respectively. Set V (x) := V (x) ? V (x); x 2 X . Then, V (x) = 0 and h i (1 ? )V (x) + V (x) = max V (b(x; w)) + `(x) ? 2(w) : w2W

Passing to the limit n ! 1, we get 0 + W0(x) = max [W (b(x; w)) + `(x) ? 2(w)] w2W 0

2

Remark 2.4. The number 0 in the above theorem is unique, as it follows

from the next theorem. Regarding the uniqueness of the function W0 (up to an additive constant), at the end of this subsection we prove that if > , for  being the H1 - norm, then this holds. For   we still do not have any uniqueness result.

The equation (2.6) corresponds to the dynamic programming equation of the following average cost deterministic optimal control problem. The dynamic is given by (2:12) xt+1 = b(xt; wt); t = 0; 1;    ; x0 = x; where the disturbances w = fwtg 2 W [0; 1) play the role of a maximizing control. The cost per stage is (x; w) ! `(x) ? 2(w), and the cost functional we try to maximize is given by TX ?1 J w (x) = lim sup T1 [`(xt) ? 2(wt)]: T !1 t=0 5

The next theorem is a straightforward application of standard dynamic programming arguments.

Theorem 2.5. For any x 2 X , 0 = sup J w (x); w2W [0;1)

and an optimal control is wt = w(xt ), where w (x) achieves the maximum in (2.6). The link between the above optimal control problem and the H1 - norm of the system  is given in the next theorem. Theorem 2.6. Assume that (A.1) - (A.3) hold. Then, 0 = 0 if and only if kkH1  .

Proof. Assume 0 = 0. Let W0 be the function de ned in the Proposition

2.3. 0 Since this function satis es (2.6), in particular it solves the rst part of (2.5 ), and the second part follows from the construction of W0. To prove that W0 is nonnegative, let T > 0 and w 2 W [0; T ] with wt = w for all t = 0; 1;    T . Then, in view of (2.6), it follows that

W0(x) 

T X t=0

`(xt) + W0(xT +1);

where xt evolves according the dynamic (2.12) (or equivalently (2.1)) with initial condition x0 = x. However, from (A3), xT = x for T  N0 . Hence

W0(x)  W0(x) = 0: Conversely, assume (2.50 ). Then, for any T > 0 and w 2 W [0; T ] T X t=0

[`(xt) ? 2(wt)]  W0(x) ? W0(xT +1)  0;

where the state dynamics (2.12) start at the initial condition x0 = x. This implies that TX ?1 lim sup T1 [`(xt) ? 2(wt)]  0: T !1

t=0

6

Then, from Theorem 2.5, 0  0. However, in view of Proposition 2.3, 0  0. Hence, 0 = 0. 2 For the rest of this subsection let us make explicit the dependence on of 0 and W0 in Proposition 2.3, denoting them by  0 and W0 respectively. De ne the set ? = f :  0 = 0g: Note that this set is not empty, since for great enough, the optimal control w in Theorem 2.5 is such that w(x) = w, and in view of (A1),  0 = 0. Note also that, from Theorem 2.5,  0 is a nonincreasing function of . Let

 = inf ?. Then, in view of the above facts, ? = [ ; 1), and by Theorem 2.6, (2:13)

 = kkH1 : Proposition 2.7. For any >  the function W0 , with W0 (x) = 0, is the unique solution of (2.6). Proof. Let >  be arbitrary but xed, and de ne the value function (X 1h i) 2 ~ `(xt) ? (wt) ; x 2 X: W0(x) = sup w2W~ [0;1) t=0

Here xt obeys the dynamic (2.12), and W~ [0; 1) = fw 2 W [0; 1) : wt = w except for nitely many tg. Then, by well-known dynamic programming methods { see e.g. [3], it follows that W~ 0 is a solution of (2.6) with 0 = 0. We claim that W~ 0 = W0 . Actually, by (2.6) with 0 = 0 and the de nition of W~ 0, it follows immediately that W0  W~ 0. Thus, we shall just prove the reverse inequality, i.e, that W0 (x)  W~ 0(x) 8 x 2 X . Let x 2 X and

1 2 ( ; ). Then, (2.6) yields

W0 1 (x) 

TX ?1 h t=0

i

`(xt) ? 12(wt) + W0 1 (xT ) for all w 2 W [0; 1); T > 0:

On the other hand, if w is the optimal control de ned in Theorem 2.5, then

W (x) 0

=

(2:14) =

TX ?1 h t=0

i

`(xt ) ? 2(wt) + W0 (xT )

TX ?1 h t=0

i

TX ?1

`(xt ) ? 12(wt) ? ( 2 ? 12) (wt) + W0 (xT ) t=0

7

for all T > 0. Therefore, TX ?1

(2:15)

t=0

(wt)

1 (x) ? W (x) + W (x ) W 0 0 0 T  2 2

? 1

 2 ?C 2 ; 1

for some suitable constant C . Thus, in view of (2.3), (2.15) implies that w 2 W~ [0; 1), and by (A3), if T is large enough, xt = x for t  T . Finally, the above facts and (2.14) imply that W0 (x)  W~ 0(x). This completes the proof. 2

2.2 Stochastic Perturbation.

In this subsection we de ne a nite state Markov chain, which represents a stochastic perturbation of the system (2.1). This model has been described in [1]. Throughout this subsection we assume (A2). However, we will need (A1) and (A3) for the small noise limit analysis in Section 3. Let V : X  X ! IR [ f+1g be the function de ned by V (x; x00 ) = minf(w) : x00 = b(x; w)g;

with the standard convention that the minimum over an empty set equals 00 +1. The value V (x; x ) represents the minimum00 \magnitude" associated with the disturbances (see (2.3)) to go from x to x in one time00 step. Let us de ne the following stochastic matrix " : Given x; x 2 X , 00 V (x;x ) 1 00 ? " "(x; x ) = Z (x) e ; " where " > 0Pis a small noise parameter and Z" (x) is a normalizing constant satisfying: x00 2X "(x; x00 ) = 1. This stochastic matrix satis es the consistency condition 8 00 > 1 if x = f (x) < 00 lim  (x; x ) = > "!0 " : 0 otherwise: 8

Remark 2.8. Note that (A2) implies that " is irreducible. Remember that 00 a N  N nonnegative matrix M00 is irreducible if, for every x; x 2 X , there

exist T > 0 such that M T (x; x ) > 0, where M T denotes the T - power of M. De nition 2.9. The risk sensitive index for " is de ned by (

)

?1 "  1 log E exp  TX (2:15) " = Tlim x !1  T " t=0 `(xt) ; where  > 0 is the risk averse factor. The existence of the limit in (2.15) is implied by Sanov's Theorem - see e.g. [4], and it coincides with the optimal value of an average cost in nite horizon optimal control problem, which we will de ne later in this section. Keeping this in mind, we shall prove that e " " is the dominant eigenvalue of the nonnegative matrix de ned by

L(x; x00 ) = e " `(x)"(x; x00 ) for x; x00 2 X: 

Note that, since " is irreducible, so is L. Remark 2.10. Notation. If M is any matrix on X and h : X ! IR is any function, we denote by Mh their product, that is

Mh(x) =

X

x00 2X

M (x; x00 )h(x00 ) for each x 2 X:

On the other hand, given x 2 X , the x? row vector of M is denoted by M (x), i.e., M (x) = (M (x; x1);    ; M (x; xN )).

Theorem 2.11. There exist " > 0 and a unique strictly positive function " : X ! IR, with "(x) = 1, such that (2:16) Further,

(2:17)

" " = L ": " = " log ": 9

Proof. The rst part follows from the Perron - Frobenious Theorem, see

e.g. [16]. To prove the rest, let xt be the Markov chain governed by " with initial condition x0 = x. Thus, (2.16) yields   TX ?1

(2:18)



Ex exp " `(xt) t=0

= Ex

TY ?1 t=0

"  " (x(xt) ) " " t

TY ?1 " " (xt) # T : " Ex t=0 " " (xt )

=

Using the Markovian property of xt, we have "T ?1 # Y " (xt) (2:19) Ex  " " (xT ?1) = "(x): t=0 " " (xt ) Since " is strictly positive, so is " ", and in view of (2.19), there follows the existence of suitable positive constant K1 and K2 such that (2:20) # "T ?1 Y " (xt) " (x) " (x) K1  max  (x)  Ex   (x ) min  (x)  K2 : x2X " "

t=0

x2X " "

" " t

Therefore, from (2.18) { (2.20) we have

)

(

?1 " log = lim "  1 log E exp  TX " T !1 x   T " t=0 `(xt) : This completes the proof. 2 Let us de ne W"(x) = " log " (x); x 2 X , and rewrite (2.16) as  (2:21) " + W"(x) = " log "e " W"(x) + `(x) for all x 2 X: In order to transform (2.21) into the dynamic programming equation of some ergodic cost optimal control problem, we introduce now the relative entropy function. Let P (X ) be the set of probability vectors on X , i.e.

(

P (X ) =  = (1;    ; N ) : i  0; 10

N X i=1

)

i = 1 :

Let us x  2 P (X ). We de ne the relative entropy function I (k ) : P (X ) ! IR [ f+1g by 8 X 00 00 > log[ r ( x )]  ( x ) if > < x00 2X I (k ) = > > :

+1

where

0 > > <  (x00 ) r(x00 ) = > :

1 otherwise: The next lemma is proved in an Appendix at the end of the paper. Actually, it is a particular case of Proposition II.4.2 in [5].

Lemma 2.12. The pair " ; W" satis es the equation

9 8 = > : lim W"n (x) = W0 (x) for all x 2 X: "n !1 Since X is nite, in order to get (3.1) we just need to prove that the family f" ; W"g is uniformly bounded. By (2.15), we have (3:2)

0  "  k`k:

So, it just remains to prove that fW"g is uniformly bounded. Throughout this section we assume the following condition. (A4)00 There exists a positive integer T2 such that T" 2 (x; x00 ) > 0 for all x; x 2 X .

Remark 3.1. Note that (A1) - (A2) imply that the stochastic matrix " is aperiodic irreducible. In particular, (A1) - (A2) imply (A4). 12

Theorem 3.2. Let W" be the solution of (2.22) satisfying the normalizing

condition W"(x) = 0. Then, for any "0 > 0 there exists a constant K1 such that for " < "0,

kW"k  K1 :

Proof. Let xt be the Markov chain with stochastic matrix " with initial condition x0 = x, and T > 0 be arbitrary. Let us de ne

e " V"(x;T ) := Exe "





PT ?1 `(x )+  W (x t

t=0

"

"

T

):

Then, one can prove by induction that

e " V"(x;T ) = LT e " W"(x);

(3:3)





where LT is the T - power of the matrix L - see Theorem 2.10. Indeed, by (2.16)   LT e " W"(x) = e " [T"+W"(x)]: Substituting this in (3.3), and making the logarithmic transformation in both sides, we get V" (x; T ) ? T" = W"(x): Since x was chosen arbitrarily, in particular we have (3:4)

V" (x; T ) ? V" (x; T ) = W"(x):

Now we shall get a uniform bound for x ! V" (x; T )?V"(x; T ). Let x; x00 2 X00 and let 0 (~0) be the distribution of xT2 with initial condition x0 = x (x0 = x respectively), with T2 as in (A4). Then, in view of the de nition of the stochastic matrix ", 1 e ?T2"kk  ~(y)  1 for all y 2 X; N T2 and therefore, 0(y)  e T2"kk N T2 for all y 2 X: ~ (y) 0

13

Thus,

e

V (x;T )

 " "

e

T k`kE

 " 2

 NT e 2

~0 [e

 "

PT ?1 `(x )+  W (x t=T2

t

"

"

T (k`k+kk)  e " V"(x00 ;T )

1 " 2

T

) 0 (xT2 ) ~0 (xT2 ) ]



for all T > T2:

Then,

V" (x; T )  V" (x00 ; T ) + T2(k`k + 1 kk) + " T2 log N; and therefore, given any "0 > 0 arbitrary but xed, V" (x; T ) ? V" (x00 ; T )  T2(k`k + 1 kk + "0 log N ) for " < "0. Hence, interchanging the roles of x and x00 , we get (3:5) jV" (x; T ) ? V" (x00 ; T )  T2(k`k + 1 kk + "0 log N ): Therefore, in view of (3.4) { (3.5), taking K1 = T2(k`k + 1 kk + "0 log N ) and x00 = x, the theorem follows. 2 Theorem 3.3. There exist a number 0  0 and a function W0 : X ! IR limit point of the family f("; W")g, such that 1 (w)g: (3:6) 0 + W0(x) = max f W ( b ( x; w )) + ` ( x ) ? 0 w2W 

Proof. Estimate (3.2) and Theorem 3.1 imply the existence of a limit point (0; W0) of the family (" ; W") through a sequence "n ! 0. Now we rewrite

(2.20) as follows (3:7)

00

X  W" (x00 ) e? "1 V (x;x ) " " + W"(x) =  log e"  Z (x) + `(x): " 00 x 2X

Using a version of the Laplace - Varadhan Lemma - see the Appendix, it follows that the r.h.s. of (3.7) converges to sup fW0(b(x; w)) + `(x) ? 1 (w)g as "n ! 0: w2W 14

Thus, letting "n ! 0 in (3.7), we get that the pair (0; W0) solves (3.6). 2 Remark 3.4. Note that (3.10) is the equation (2.6) we had introduced in Proposition 2.3 with 2 = 1 . Actually, assuming (A1) - (A3), W0 is the same as in Section 2.1 for  small enough (by uniqueness) { see Proposition 2.7. Note also that, uniqueness of 0(= 0) and W0 implies convergence of " to 0 and W"(x) to W0(x) as " ! 0 { not just convergence for sequences "n ! 0.

4 Risk sensitive control problem. In this section we set up the state feedback robust control problem. Paralleling the approach of the previous sections, we introduce an in nite horizon risk sensitive control problem that is solved using the policy iteration algorithm. The optimal rate is interpreted as the upper value of a stochastic dynamic game with average cost per unit time criterion.

4.1 State feedback control problem.

Consider the nite state controlled machine de ned by (4:1)

xt+1 = f (xt; ut); t = 0; 1;    ; x0 = x;

where the state xt takes values in the nite set X , the control ut evolves in the nite set U and f : X  U ! X is a given function. We recall the N is the number of states in X . We de ne now a deterministic perturbation of the system (4.1). Let b : X  U  W ! X be the function that de nes the dynamics of the system u given by (4:2)

xt+1 = b(xt; ut; wt);

where xt and ut take values in X and U respectively, and the disturbance wt takes values in a nite set W . We assume the following: (H1) There exist a null control u 2 U , an equilibrium state x 2 X , and a null disturbance w 2 W such that (i)

f (x; u) = x and

(ii)

b(x; u; w) = f (x; u) for all x 2 X; u 2 U: 15

(H2) Let U be the nite set of all stationary control policies u~ : X ! U . 00 Given u~ 2 U and x; x 2 X , there exist T1; 0 < T1 PN , and w 200 W [0; T1] u~

such that for the initial condition x0 = x, the system reaches x after T1 steps. Let  : W ! IR and ` : X  U ! IR be functions such that 8 > < (w ) = 0 > : (w) > 0 for w 6= w 2 W

and (4:3)

8 > < `(x ; u) = 0 > : `(x; u) > 0 for (x; u) 6= (x ; u ) 2 X  U:

The functions  and ` play the same role as in the previous sections. Let U1  U be the subset of stationary policies u~ such that the following condition is satis ed. For each initial condition x0 = x, there exists a positive integer N0, such that the system (4.1) reaches the equilibrium state x after N0 steps, and u~(x) = u. Note that given u~ 2 U (~u 2 U1), letting f u~(x) = f (x; u~(x)), with bu~(x; w) and `u~ (x) de ned similarly, then (A2) ( (A1) and (A3) respectively) of SectionP2 holds, with f , b replaced by f u~; bu~ . Indeed, for u~ 2 U1 the H1 - norm k u~ kH1 is de ned (see (2.13)). The state feedback robust control problem is the following { see e.g. [2]. Given > 0, nd a control u~ 2 U1 such that for each initial condition x0 = x, there exists a nonnegative constant K (x), with K (x) = 0, satisfying

K (x) +

T h X t=0

i

2(wt) ? `(xt)  0 for all w 2 W [0; T ]; T > 0:

In other words, given > 0 we want to nd u~ 2 U1 such that ku~ kH1  . Following the same arguments than in Section 2, we deduce that the existence of a nonnegative function W0 : X ! IR such that 8 > maxfW (b(x; u; w)) + `(x; u) ? 2(w)g < W0 (x)  min u2U w2W 0 > > :

W0(x) = 0

16

is a necessary and sucient condition for the existence of solution to the state feedback robust control problem. Actually, the solution can be characterized in terms of the value of an average cost dynamic game, as we shall see later in this section. The proof of the next proposition is structurally similar to the one given for the Proposition 2.3, and we sketch it in the Appendix at the end of the paper. Proposition 4.1. If (H2) holds, then there exist a nonnegative number 0 and a function W0 : X ! IR such that h i 2 (w) : (4:4) 0 + W0(x) = min max W ( b ( x; u; w )) + ` ( x; u ) ?

0 u2U w2W

Corollary 4.2. Let u~ 2 U be a control achieving the minimum in the r.h.s. of (4.4). If (H1) - (H2) hold, then the following conditions are equivalent: (i) 0 = 0

(ii) ku~ k  :

Proof. The proof of this corollary is the same as in Theorem 2.6, noting that 0 = 0 implies that u~ 2 U1. To see this, note that, for any x 2 X and T > 0,

T X t=0

`(xt; u~(xt))  W0(x) ? W0(xT +1);

where xt obeys the dynamic (4.1). Thus, for T great enough `(xT ; u~(xT )) = 0, and in view of (4.3), xT = x and u~(xT ) = u. 2 The equation (4.4) is the Isaacs equation of the following average cost zero - sum dynamic game. Dynamic game. Consider the di erence equation xt+1 = b(xt; ut; wt); t = 0; 1;    ; x0 = x: Here u = futg 2 U [0; 1) and w = fwtg 2 W [0; 1), and they play the role of controls for Player 1 (minimizer) and Player 2 (maximizer), respectively. The associated cost functional is TX ?1 h i (4:5) J (x; u; w) = lim sup T1 `(xt; ut) ? 2(wt) T !1 t=0 17

We use the following de nition of value!of a dynamic game [8], which is given in terms of strategies. A strategy u for Player 1 consists of a sequence of functions u0; u1;    ; with! values in U such that ut is a function of xs; xt; ws; 0  s < t. A strategy w for Player 2 is a sequence of functions w0; w1;   , with values !in W such that wt is a function of xs; us; 0  s  t. We say that a strategy u is stationary feedback if ut depends only on the current state xt, i.e., ut : X ! U , and ut = u~ is independent of t. Analogously, ! w is stationary feedback if wt depends only on the current state xt and the current control ut of Player 1, and wt = w~ : X  U ! W is independent of t. Given a pair of strategies (!u; w!) and the initial condition x0 = x, the controls for Player 1 and Player 2 are generated recursively as u0 = u0(x); w0 = w0(x; u0); u1 = u1(x; x1; w0); w1 = w1(x; x1; u0; u1);   

De nition 4.3. When there exists a pair of strategies (!u ; w!) such that   ! J (x; !u ; w!)  J (x; !u ; w!)  J (x; !u; w!) for all !u; w;

    the value V (x) = J (x; !u ; w! ) is called the value of the game, and (!u ; w! ) are referred as optimal strategies. Remark 4.4. V (x) is often called the upper value of this in nite horizon dynamic game, since the maximizing Player 2 has the advantage of knowing Player 1's choice ut before choosing wt. This is also re ected in the order min max (rather than max min) in the Isaacs equation (4.4). For continuous variable di erential games, an alternative de nition due to Elliott and Kalton is often used. See e.g. [6]. The discrete time version of the Elliott -Kalton de nition is as follows. The minimizing Player 1 chooses any control sequence u 2 U [0; 1), while the maximizing Player 2 chooses a map  : U [0; 1) ! W [0; 1) such that [u]t depends only on u0; u1;    ut. The Elliott - Kalton upper value of our discrete - time dynamic game can be easily shown to be the same as the one de ned above. Proposition 4.5. For every x 2 X , 0 = V (x): Furthermore, the stationary strategies



h

u~(x) 2 arg umin max W (b(x; u; w)) + `(x; u) ? 2(w) 2U w2W 0 18

i

and

n

w~(x; u) 2 argwmax W0(b(x; u; w)) + `(x; u) ? 2(w) 2W

o

are optimal. The proof of this proposition is an immediate application of (4.4) and we omit it.

4.2 Risk sensitive optimal control problem.

We regard system (4.1) as a deterministic controlled Markov chain and de ne a random perturbation as follows. Given x; x00 2 X and u 2 U we de ne n

o

V (x; u; x00 ) = min (w) : x00 = b(x; u; w) : Here the minimum over an empty set is de ned as +1. For each u 2 U , we de ne the stochastic matrix 1 exp ? 1 V (x; u; x00 ) ; (4:6) u"(x; x00 ) = Z (x; " " u) where " > 0 is a noise parameter and00 Z" (x; u) is a normalizing constant satisfying the condition: x00 2X u"(x; x ) = 1. Throughout this subsection we assume (H2). For each u~ 2 U the cost functional (to be minimized) is the in nite horizon exponential growth criterion )

(

?1 "  1 log E exp  TX (4:7) " (~u) = Tlim x !1  T " t=0 `(xt; u~(xt)) ; where  > 0 is given. The risk sensitive optimal control problem is to nd a control u~ 2 U that minimize "(~u). Let " := u~inf  (~u): 2U " Next we have a veri cation theorem. Theorem 4.6. Suppose that there exist a number > 0 and a strictly positive function : X ! IR such that

n

o

e " `(x;u)u" (x) for all x 2 X: (x) = min u2U 

19

Then, " = " log , and the control u~ 2 U , with u~(x) achieving the minimum on the r.h.s., is optimal.

Proof. Let u~ 2 U . Following the same arguments as in the proof of Theorem 2.11, we have

" log   (~u); "  with equality for the control u~. 2 Now, in order to get an optimal policy, we use the policy iteration algorithm, which is described as follows. Given an arbitrary policy u~0 2 U , we have proved already the existence of a number 0 > 0 and a function 0 : X ! IR strictly positive - see Theorem 2.11, such that for all x 2 X 0 0(x) = e " `(x;u~0(x))u"~0 0(x)

 T0 0(x):

Let u~1 2 U be de ned by o n  `(x;u) u "  ( x ) : u~1(x) 2 arg umin e " 0 2U Calculate 1 and 1, and repeat the process. If we reach a point where  " `(x;u) u k (x)] for all x 2 X Tk k (x) = min [ e " u2U then, according to Theorem 4.6, u~k is optimal, and we stop. Theorem 4.7. The policy iteration generates a nite sequence of controls u~0; u~1;    ; u~m = u~g with strictly nonotonically decreasing "(~uk ) until the iteration reaches a stopping point. Proof. Let u~k and u~k+1 be control policies generated by the policy iteration algorithm, and k ; k ( k+1 ; k+1) be the dominant eigenvalue and eigenfunction of Tk (Tk+1 respectively). Thus, (4:8) Tk+1 k  k k (= Tk k ): If Tk+1 k = k k , then n

o

e " `(x;u)u" k (x) for all x 2 X; Tk k (x) = min u2U 

20

and the iteration terminates. In this case, u~k is optimal by Theorem 4.6. So, assume that there exists some component x0 2 X such that the inequality (4.8) is strict, i.e.

Tk+1 k (x0) < k k (x0): Then, Theorem 1.6 in [16] implies that k+1 < k and therefore,

k (~uk+1) < " (~uk ): Thus, since there exist just a nite number of policies, the iteration will stop after a nite number of steps. 2 Corollary 4.8. There exist " > 0 and a function W" : X ! IR such that   h  `(x;u) u  W" (x)i e " " e " : (4:9) exp " (" + W"(x)) = min u2U

Proof. The proof follows immediately taking e



 " "

= m and e " W" = m,

where m is the step where the iteration nishes. 2 Using the variational equality (A.1) in the Appendix, we rewrite (4.9) as 0 (4:9 ) 9 8 =

Suggest Documents