ROBUST ADAPTIVE DYNAMIC PROGRAMMING - MTNS 2012

6 downloads 2040 Views 124KB Size Report
Key words. Adaptive dynamic programming, robust optimal control, power systems. 1. Introduction. As a computational method inspired by the biological learn- .... Definition 2.4. System (2.7) is said to be input-to-output stable (IOS) with gain.
ROBUST ADAPTIVE DYNAMIC PROGRAMMING: AN OVERVIEW OF RECENT RESULTS YU JIANG AND ZHONG-PING JIANG∗ Abstract. This paper gives an overview of our recent progress on robust adaptive dynamic programming (for short, robust-ADP) for continuous-time dynamic systems with unknown system parameters or system order. First, a novel, computational adaptive control method based on robustADP is proposed for linear systems with completely unknown dynamics. Then, the robust-ADP for nonlinear systems, with the integration of tools from nonlinear control theory, such as the small-gain and backstepping techniques is developed. Finally, an application of robust-ADP to the decentralized optimal stabilization of large-scale systems is studied. An example of power systems is numerically simulated to validate the efficiency of the robust-ADP-based optimal control design. Key words. Adaptive dynamic programming, robust optimal control, power systems

1. Introduction. As a computational method inspired by the biological learning, approximate/ adaptive dynamic programming (ADP) (e.g., [19, 27, 28, 29]) has been applied to the optimal feedback control design of dynamic systems in recent years (e.g., [1, 4, 18, 24, 25, 30, 31]). Using ADP, the optimal control policy can be directly approximated without identifying the unknown system parameters. In the setting of continuous-time systems [9, 23, 25], ADP-based policy iteration algorithms have been developed to find online the optimal control policies for linear systems with partially or completely unknown dynamics. It is noteworthy that biological systems learn to achieve enhanced robustness, or greater chance of survival, through interacting with the unknown environment, and may only be able to make decisions based on partial-state information. In order to capture and model these two features from biological learning, a new framework of robust adaptive dynamic programming (robust-ADP) is proposed [8, 10, 11]. Integrated with tools from nonlinear control theory, such as Lyapunov designs, input-to-state stability theory [22], and nonlinear small-gain techniques [13], robust-ADP is a natural extension of ADP to uncertain dynamic systems with incomplete state information and unknown system order. Most recently, robust-ADP has been applied to large-scale dynamic systems [12]. In practice, the existence of unknown parameters and/or dynamic uncertainties, and the limited information of state variables, give rise to challenges for the controller design of large-scale systems. By integrating a simple version of the cyclic-small-gain theorem [17], asymptotic stability can be achieved by assigning appropriate weighting matrices for each subsystem. Further, certain suboptimality property can be obtained. The proofs of the presented results are available from the authors upon request; also see [9, 10, 11]. 2. Robust ADP. 2.1. ADP for uncertain continuous-time linear systems. Consider the stabilizable system (2.1)

x˙ = Ax + Bu

∗ Y. Jiang and Z. P. Jiang are with the Control and Networks Lab, Department of Electrical and Computer Engineering, Polytechnic Institute of New York University, Brooklyn, NY 11201, USA. Email: [email protected], [email protected]. This work has been supported in part by NSF grants DMS-0906659 and ECCS-1101401.

where x ∈ Rn is the state vector; u ∈ Rm is the control input; A ∈ Rn×n and B ∈ Rn×m are unknown system matrices. The objective is to find online an optimal control policy u∗ = −K ∗ x such that the following quadratic cost is minimized: ∫ ∞ ( T ) (2.2) x Qx + uT Ru dτ J= 0

where Q ≥ 0 and R > 0 are symmetric matrices with (A, Q1/2 ) observable. To begin with, we select a stabilizing initial gain matrix K0 such that A − BK0 is Hurwitz. Next, we apply u0 = −K0 x + e as the control input with e an exploration noise, and record the state and input information on [ti , ti+1 ] with i = 0, 1, · · · , l − 1 where l > 0 is a sufficiently large integer. Then, for k = 0, 1, 2, · · ·, the following iterative equation was derived in [9]. ti+1 (2.3) xT Pk x ti =



ti+1

[ T ] −x (Q + KkT RKk )x + 2(u0 + Kk x)T RKk+1 x dτ.

ti

Theorem 2.1 ([9]). Under the PE condition on e(t), assume that Pk = PkT and Kk+1 can be uniquely solved from (2.3), for all k = 0, 1, 2, · · ·. Then, lim Pk = P ∗ , k→∞

lim Kk = K ∗ , where K ∗ = R−1 B T P ∗ , and P ∗ > 0 is the symmetric solution of the

k→∞

following ARE: AT P + P A + Q − P BR−1 B T P = 0. A practical online learning algorithm is summarized as follows: Algorithm 2.2. Online policy iteration algorithm 1. Let k ← 0. 2. Solve Pk and Kk+1 from (2.3). 3. Let k ← k + 1, and repeat Step 2 until ∥Pk − Pk−1 ∥ ≤ ϵ for k ≥ 1, where the constant ϵ > 0 can be any predefined small threshold. 4. Finally, use u = −Kk x as the approximated optimal control policy. Remark 2.3. In order to satisfy the PE condition in [9], the exploration noise e, which is comprised of the sum of sinusoidal signals with different frequencies, is used. In addition, the number of time intervals for the collection of online data should be sufficiently large. See [9] for more detailed analysis on the PE condition. 2.2. Robust-ADP with matched dynamic uncertainties. Consider the following continuous-time system which is a linear model interconnected with nonlinear dynamic uncertainties, characterized by the w-subsystem: (2.4) (2.5)

w˙ = q(w, y), x˙ = Ax + B [u + ∆(w, y)] ,

(2.6)

y = Cx

where x ∈ Rn is the measured component of the state available for feedback control; w ∈ Rnw is the unmeasurable part of the state with unknown order nw ; u ∈ Rm is the control input; y ∈ Rp is the system output; A ∈ Rn×n , B ∈ Rn×m , and C ∈ Rp×n are unknown constant matrices with (A, B) stabilizable, (A, C) observable; q : Rnw × Rp → Rnw and ∆ : Rnw × Rp → Rm are two unknown locally Lipschitz

functions satisfying q(0, 0) = 0 and ∆(0, 0) = 0. Assume that the open-loop system is forward-complete. To study the robust stabilization problem of (2.4)-(2.6), let us consider the following control system having x ∈ Rn as the state, u ∈ Rm as the input, and y ∈ Rp as the output: (2.7)

x˙ = f (x, u), y = h(x, u)

where f is a locally Lipschitz function and h is a continuous function. The following definitions are taken from [13]. Also see [22]. Definition 2.4. System (2.7) is said to be input-to-output stable (IOS) with gain γ if, for any measurable locally essentially bounded input u and any initial condition x(0), the solution x(t) exists for every t ≥ 0 and satisfies |y(t)| ≤ β (|x(0)|, t) + γ (∥u∥)

(2.8)

where β, γ are of class KL and of class K, respectively. Definition 2.5. System (2.7) is said to have the strong unboundedness observability (SUO) property with zero offset, if a function β 0 of class KL, a function γ 0 of class K exist such that, for each measurable control u(t) defined on [0, T ) with 0 < T ≤ ∞, the solution x(t) of (2.7) right maximally defined on [0, T ′ ) (0 < T ≤ T ′ ) satisfies |x(t)| ≤ β 0 (|x(0)|, t) + γ 0 (∥ (u, y)t ∥) , ∀t ∈ [0, T ′ ).

(2.9)

In order to achieve global asymptotic stability, let us make a few assumptions about (2.4), which are often required in the literature of nonlinear control design [7]. Assumption 2.6. The w-subsystem has SUO property with zero offset and is IOS with respect to y as the input and ∆ as the output. Assumption 2.7. There exist a continuously differentiable, positive definite, radially unbounded function W : Rnw → R+ , and two constants c1 > 0, c2 ≥ 0 such that ˙ = ∂W (w) q(w, y) ≤ −c1 |∆|2 + c2 |y|2 W ∂w nw p for all w ∈ R and y ∈ R . Online learning is conducted following the same procedure as in Algorithm 2.2 with u0 replaced by u0 + ∆. Theorem 2.8 ([10]). Under Assumptions 2.6 and 2.7, suppose the exploration noise e(t) satisfies the PE condition such that Pk = PkT and Kk+1 can be uniquely c2 solved from (2.3), for all k = 0, 1, 2, · · ·. Also, let Q > C T C and R = Im . Then, we c1 have lim Pk = P ∗ , lim Kk = K ∗ , and u = −K ∗ x globally asymptotically stabilizes (2.10)

k→∞

k→∞

(2.4)-(2.6) at the origin. 2.3. Robust-ADP with unmatched dynamic uncertainties. Consider the following interconnected system (2.11) (2.12)

w˙ = q(w, y), x˙ = Ax + B [z + ∆1 (w, y)] ,

(2.13) (2.14)

z˙ = Ex + F z + G [u + ∆2 (w, y)] , y = Cx

where [xT , z T ]T ∈ Rn+m is the vector of system states; A ∈ Rn×n , B ∈ Rn×m , C ∈ Rp×n , E ∈ Rm×n , F ∈ Rm×m , and G ∈ Rm×m are unknown constant matrices with the pair (A, B) stabilizable and G nonsingular; w ∈ Rq is the state of the dynamic uncertainty; ∆1 = D∆(w, y) and ∆2 = H∆(w, y) are the outputs of the dynamic uncertainty, with D, H ∈ Rm×n∆ unknown constant matrices; q : Rnw × Rp → Rnw and ∆ : Rnw × Rp → Rn∆ are two unknown locally Lipschitz functions vanishing at the origin. The open loop system ∪l−1is assumed to be forward-complete. Letting u0 = e(t) for t ∈ i=0 [ti , ti+1 ], the following learning strategy is derived [10]: Phase-one learning ti+1 (2.15) xT P1,k x ti =



ti+1 [

] T −xT (Q1+K1,k R1K1,k )x+2(z+∆1+K1,k x)T R1K1,k+1 x dτ.

ti

For the matrix K1,k obtained from Phase-one learning, we define ξˆ = z + K1,k x. Phase-two learning ti+1 ∫ (2.16) ξˆT P2,j ξˆ = ti

ti+1 [

T ˆ ˆT ˆ −ξˆT(Q2+K2,j R2K2,j )ξ+2(u 0+∆2+K2,j ξ) R2 K2,j+1ξ ] + 2ξˆT Nj x + 2ξˆT Lj ∆1 dτ. ti

Theorem 2.9 ([10]). Under Assumptions 2.6 and 2.7, suppose the exploration T noise e(t) satisfies the PE condition such that P1,k = P1,k , K1,k+1 can be uniquely T solved from (2.15) for all k = 0, 1, 2, · · ·, and P2,j = P2,j , K2,j+1 , Nj , Lj can be 2c2 T C C and uniquely solved from (2.16) for all j = 0, 1, 2, · · ·. Also, let Q1 > c1 −1 −1 T −1 ∗ −1 ∗ T R1 ≥ DD , Q2 > 0, R2 > (H + G K BD)(H + G K BD) . Then, we have lim P1,k = P ∗ , lim K1,k = K ∗ , lim P2,j = P2∗ , lim K2,j = K2∗ = R2 GT P2∗ , k→∞

k→∞

k,j→∞

k,j→∞

k,j→∞

lim Nj = P2∗ (E + K ∗ (A − BK ∗ ) − F K ∗ ), and

lim Lj = P2∗ K ∗ B, where P2∗ is

k,j→∞

the symmetric positive definite solution of the ARE (2.17)

P2∗ (F + K ∗ B) + (F + K ∗ B)T P2∗ + Q2 − P2∗ GR2−1 GT P2∗ = 0.

Moreover, for sufficiently large integers k > 0 and j > 0, the selected control policy [ T ] (2.18) u = − (K2,j R2 )−1 (Nj + R1 K1,k ) + K2,j K1,k x − K2,j z globally asymptotically stabilizes (2.11)-(2.14) at the origin. 2.4. Robust-ADP for large-scale systems. Consider the large-scale system (2.19) (2.20)

x˙ i = Ai xi + Bi [ui + Di (y)] , yi = Ci xi , 1 ≤ i ≤ N,

where xi ∈ Rni , yi ∈ Rpi , and ui ∈ Rmi are the state, the output, and the input of [ ] T T the i-th subsystem; y = y1T , y2T , · · · , yN ; Ai ∈ Rni ×ni , Bi ∈ Rni ×mi are unknown system matrices, with (Ai , Bi ) stabilizable; Di (·) : Rp → Rmi are unknown functions

∑N ∑N satisfying |Di (y)| ≤ di |y| for all y ∈ Rq , with di > 0, i=1 ni = n, i=1 pi = p, and ∑N i=1 mi = m. (0) (0) For the i-th subsystem, select Ki such that Ai − Bi Ki is Hurwitz, with ui = (0) −Ki xi (t) + ei (t), along the solutions of (2.19), it follows that [12] ∫ t (k) j+1 (2.21) xTi Pi xi = tj

tj+1

(k)

(k)

(k+1)

−xTi Qi xi +2(ui +Di +Ki xi )T Ri Ki

xi dτ.

tj

Theorem 2.10 ([12]). For any 1 ≤ i ≤ N , suppose ei (t) satisfies the PE con(k) (k) (k) such (24) has a unique solution Pi = (Pi )T and Li , with Qi ≥ (dition ) that −1 −1 T −1 2 γi + 1 Ci Ci + γi ϵi Ini and Ri > di Imi . Also assume N −1 ∑

(2.22)

j=1 (k)

Then, lim Ki k→∞



j

γi1 γi2 · · · γij+1 < 1.

1≤i1

Suggest Documents