An example representing path planning for an autonomous car is considered to illustrate ... constraints, which has stimulated broader interest in MPC in applications ..... is a standard nonlinear programming (NP) problem with differentiability.
Tractable Stochastic Predictive Control for Partially Observable Markov Decision Processes with Time-Joint Chance Constraints Nan Li, Ilya Kolmanovsky, and Anouck Girard Abstract— In this paper, we describe a stochastic model predictive control algorithm for finite-space partially observable Markov decision process problems with time-joint chance constraints. We discuss theoretical properties of the algorithm, including an approach to its recursive feasibility. An example representing path planning for an autonomous car is considered to illustrate the computational tractability of the algorithm.
I. I NTRODUCTION Model predictive control (MPC) has been one of the most popular advanced control techniques. It is typically defined based on the first move of the solution to a finitehorizon open-loop optimal control problem parameterized by the current state as the initial condition. One of the defining features of MPC is its capability to explicitly handle constraints, which has stimulated broader interest in MPC in applications, including in the automotive industry [1]. Stochastic MPC (SMPC) [2] is a variant of MPC that integrates the ideas of receding-horizon optimal control with stochastic modeling and stochastic optimal control. For a dynamic system subject to uncertainties, the solution to a finite-horizon stochastic optimal control problem as an openloop control input sequence is typically different from that given by a closed-loop feedback policy. The former, in general, provides worse control performance than the latter, but is usually computationally tractable for implementation. The latter assigns possibly different controls for different realizations of the uncertainties over the horizon, but inherits the curse of dimensionality related to the need to solve a stochastic dynamic programming (DP) equation. The former is considered, for instance, in [3], [4], [5]; the latter in [6], [7], [8]. Note that an open-loop optimal control-based SMPC algorithm also implicitly defines a feedback policy, as the finite-horizon optimal control problem is solved at each time step subject to the current state as the initial condition. On the other hand, Markov decision processes (MDPs) [9] provide a mathematical framework for solving optimal decision-making problems and have also been rapidly gaining popularity. Typical approaches to solve MDP problems include value iteration, policy iteration, reinforcement learning, etc. All such approaches attempt to create optimal feedback policies. Although not required by the theory of MDPs, the input space and state space of an MDP problem are typically assumed to be finite (otherwise function This research has been supported by the National Science Foundation under award CNS 1544844 and by NASA under cooperative agreement NNX16AH81A. Nan Li, Ilya Kolmanovsky, and Anouck Girard are with the Department of Aerospace Engineering, University of Michigan, Ann Arbor, 48109 MI, USA. {nanli,ilya,anouck}@umich.edu
approximation techniques are usually exploited). When the state cannot be perfectly observed, an MDP problem is generalized to a partially observable MDP problem (POMDP). When input/state constraints are included, an MDP problem becomes a constrained MDP problem (CMDP) [10]. The latter two problems are usually considered to be harder to solve than an MDP problem with perfect observation and without constraints. For example, to solve a POMDP typically involves estimating probability distribution of the state (called the “information state”). Even if the state space is finite, the space for the information state may be uncountably infinite. As another example, CMDPs, in general, cannot be solved using iterative algorithms like value iteration or policy iteration [10], [11].
In this paper, we exploit the computational tractability of open-loop optimal control-based SMPC to treat a class of POMDP problems, in particular, POMDPs with finite input and state spaces and with time-joint chance constraints (see (8)). We consider randomized decision rules to obtain an optimization problem with a finite number of continuous decision variables, to reduce the computational difficulty inherited from a discrete optimization problem with a finite decision set. Based on randomized decision rules, we explicitly evaluate and tightly enforce a time-joint chance constraint as opposed to approximately evaluating and conservatively enforcing it, e.g., using risk allocation and Boole’s inequality [3], [4], [5], [6]. Although applying randomized decision rules to POMDPs has been discussed, for instance, in [12], [13], we study in this paper the theoretical properties specific to our setting, in particular, when applied to a POMDP with a time-joint chance constraint, including the extent to which the resulting continuous optimization problem is equivalent to the original discrete optimization problem; we also develop an approach to ensure recursive feasibility.
This paper is organized as follows: In Section II, we state the problem treated in this paper. In Section III, we present an algorithm to solve the problem. In Section IV, we discuss theoretical properties of the algorithm, including an approach to guaranteeing its recursive feasibility. In Section V, we illustrate the algorithm using an example representing path planning for an autonomous car to overtake another car in its front, to show the algorithm’s computational tractability. The paper is concluded in Section VI.
II. P ROBLEM STATEMENT We consider the optimal control/decision-making based on a time-invariant discrete-time model of the system given by (1) xt+1 = f xt , ut , wt , t t t y = g x ,v , (2) where xt ∈ X = x1 , · · · , xnx denotes the state at step the control input at t; ut ∈ U = u1, · · · , unu denotes the observation at step t; y t ∈ Y = y1 , · · · , yny denotes step t; wt ∈ W = w1 , · · · , wnw denotes an unmeasured disturbance at step t; v t ∈ V = v1 , · · · , vnv denotes a measurement noise at step t; and f : X × U × W → X and g : X × V → Y are deterministic maps. We make the following assumptions: A1: The prior distribution of the initial state x0 , > π 0|−1 := π 0|−1 (x1 ), · · · , π 0|−1 (xnx ) , (3) π 0|−1 (xi ) := P x0 = xi , is known. A2: The wt and v t are randomly distributed, respectively, over the set W and over the set V, and the conditional distributions, ∀ wt ∈ W, ∀ xt ∈ X , ∀ ut ∈ U, (4) P wt xt , ut , t t t t P v x , ∀ v ∈ V, ∀ x ∈ X , (5) are known. In particular, wt is conditionally independent of all other variables given (xt , ut ); v t is conditionally independent of all other variables given xt . This also yields that the conditional distributions (4) and (5) are independent of time t. A3: At step t ∈ N, all historical observations y τ , τ ∈ {0, · · · , t}, and all previously applied controls uτ , τ ∈ {0, · · · , t − 1}, are known. We denote all available data at step t by (6) ξ t := y 0 , · · · , y t , u0 , · · · , ut−1 . We consider the following receding-horizon optimal control (RHOC) problem: (P): At step t ∈ N, we minimize the cost function n o t J := E J xt+1 , · · · , xt+N , ut , · · · , ut+N −1 ξ t , (7) with to an open-loop control input sequence t respect u , · · · , ut+N −1 ∈ U N , subject to i) the dynamic model given by (1) and (2), and ii) the time-joint chance constraint t+N ^ P
The optimization problem (P) is well-defined, meaning that either no feasible solutions exist, or a minimizer This holds trivially since the decision set for t exists. u , · · · , ut+N −1 , U N , is finite. We note that a constraint in the form of a set of individual chance constraints, P xτ ∈ X τ ξ t ≥ 1 − ετ , ∀ τ ∈ t + 1, · · · , t + N , (9) is often considered, for instance, in [3], [4], [5], [8], [14]. The constraint (8) is different from (9). In particular, (8) requires the probability that the entire trajectory over {t + 1, · · · , t + N } satisfies all constraints xt+1 ∈ X t+1 , · · · , xt+N ∈ X t+N to be no less than a threshold 1−ε; while (9) requires the probability measure of all trajectories that satisfy the constraint xτ ∈ X τ to be no less than a threshold at each time step τ ∈ t + 1, · · · , t + N . As a motivating example from a process control standpoint, (8) requires that the success rate of the entire production process be no less than a threshold. Although risk allocation and Boole’s inequality may P be used to robustly enforce (8) through enforcing (9) t+N and τ =t+1 ετ ≤ ε ([15], [16]), as this approach involves over-bounding, it may lead to conservative solutions. We also note that although the dynamics (1) is Markov, because of the imperfect/partial observation of the state (2), the solution to (P) depends on ξ t . III. P ROBLEM REFORMULATION The dynamic system (1) and (2) can be represented by the state transition probabilities P(xi |xj , uk ) := P(xt+1 = xi |xt = xj , ut = uk ) = nw X P(xt+1 = xi |xt = xj , ut = uk , wt = wl ) l=1 P(wt = wl |xt = xj , ut = uk ) , (10) and the observation probabilities P(ym |xj ) := P(y t = ym |xt = xj ) = (11) nv X P(y t = ym |xt = xj , v t = vn )P(v t = vn |xt = xj ) , n=1
defined for all xi , xj ∈ X , uk ∈ U, wl ∈ W, ym ∈ Y, and t t t vn ∈ V, where P(xt+1 1 = xi |x t= xj , u t= uk , w t = wl ) = I xi = f (xj , uk , w l ) and P(y = ym |x = xj , v = vn ) = I ym = g(xj , vn ) . A. Posterior state distributions
x ∈ X ξ t ≥ 1 − ε, τ
τ
ε ∈ [0, 1],
(8)
τ =t+1
where J : X N × U N → R is a deterministic function, and X τ ⊂ X , τ ∈ {t + 1, · ·· , t + N }, are state constraint sets. After the sequence ut , · · · , ut+N −1 is obtained, we apply the first element ut as the control input, update the state xt → xt+1 , make the new observation y t+1 , and then solve (P) at step t + 1.
We can compute the posterior distribution of the state xt conditioned on the available data ξ t , namely, the “information state” at step t, > π t|t := π t|t (x1 ), · · · , π t|t (xnx ) , (12) π t|t (xi ) := P xt = xi ξ t , 1 I(·)
is the indicator function of (·), taking 1 if (·) is true and 0 otherwise.
using a Bayesian filter [17] in a recursive way: π t|t (xi ) =
(13)
P x t−1 P(y t |xi ) n )π t−1|t−1 (xj ) j=1 P(xi |xj , u , Pnx Pnx t−1 )π t−1|t−1 (x ) t j j=1 P(xk |xj , u k=1 P(y |xk )
where P(y 0 |xi )π 0|−1 (xi ) . 0 0|−1 (x ) k k=1 P(y |xk )π
π 0|0 (xi ) = Pnx
(14)
B. Randomized decision rules As the decision set for ut , · · · , ut+N −1 , U N , is finite, problem (P) is a discrete optimization problem. Exhaustivesearch approaches, e.g., tree-search, may be computationally intractable when the cardinality of U N is large. Instead, we consider randomized decision rules to obtain an optimization problem with a finite number of continuous decision variables as follows: Define > γ t := γ t (u1 ), · · · , γ t (unu ) , γ t (uk ) := P(ut = uk ),
(15)
that is, a probability distribution of selecting inputs over the set U at step t. t Given the available and the sequence of control t data ξ t+N −1 over the horizon, the input distributions γ , · · · , γ predicted state distributions, denoted by π τ |t (xi ) = P(xτ = xi |ξ t ), over the horizon can be computed recursively by
= = =
(16)
P(xτ +1 = xi |xτ = xj )π τ |t (xj )
j=1 nx X nu X
(17)
P(xi |xj , uk )P(uτ = uk |xτ = xj ) π τ |t (xj )
j=1 k=1 nx X nu X
τ
P(xi |xj , uk )γ (uk )π
τ |t
(xj ) .
π
(xi ) = φi (π
τ >
τ
, γ ) := (γ ) P(xi ) π
(19)
τ |t
,
(20)
φ1 (π
τ |t
τ
,γ )
.. .
π τ +1|t = φ(π τ |t , γ τ ) := Inx ⊗ (γ τ )
=
··· ··· ···
τ =t
xt ∈X
where we use the Markov property, the causality property, and the independence of uτ on xτ given γ τ to obtain (24). t It can be observed from (23) (24) that J is separable [18], i.e., depends on ξ t only through the information state π t|t . t 1: Given ξ t , J is a smooth function of Proposition γ t , · · · , γ t+N −1 . Proof: The proof follows from (23) and (24). D. Time-joint chance constraint evaluation To explicitly evaluate the time-joint chance constraint (8), we denote by τ n_ o V τ |t := xσ ∈ / X σ , τ ∈ {t + 1, · · · , t + N }, (25) the event that at least one constraint violation occurs between the step t + 1 and the step τ . τ |t The event V indicates the negation/complement of event V τ |t , i.e., τ n^ o τ |t (26) xσ ∈ X σ . V = σ=t+1
Similarly, we use X to denote the complement of the state τ constraint set X τ , i.e., X = X \ X τ . The constraint (8) can be represented as P V t+N |t ξ t ≤ ε. (27) τ |t t Theorem 1: The P V ξ , τ ∈ {t + 1, · · · , t + N }, can be computed through the recursion X τ −1|t t ξ , P V τ |t ξ t = P V τ −1|t ξ t + P xτ ∧ V xτ ∈X
(21) P xτ ∧ V
diag P(x1 ), · · · , P(xnx ) 1nx ⊗ π τ |t ,
P(xi |x1 , u1 ) .. P(xi ) := . P(xi |x1 , unu )
t+N −1 t
t
τ −1|t t
ξ
P(xi |xnx , u1 ) .. . . P(xi |xnx , unu )
=
X
τ |t t
ξ
=
τ −1
·P x
P xτ ∧ V 0,
(28)
τ
Pτ −1 xτ xτ −1
xτ −1 ∈X
∧V
τ −1|t t
ξ
(29) τ −1|t t
ξ ,
if xτ ∈ X τ , τ if xτ ∈ X ,
(30)
t|t with P V t|t ξ t = 0 and P xt ∧ V ξ t = P xt ξ t = π t|t (xt ), and
where
(23) ξ ,
t+N −1 t
t
,··· ,x
t+N
φnx (π τ |t , γ τ ) >
t+N
,u ,··· ,u where P x , · · · , x ,u ,··· ,u ξ = t+N t+N −1 t+2 t+1 t+N −1 P x · ... · P x x ,u x , ut+1 t+N X Y−1 P xt+1 xt , ut π t|t (xt ) · γ τ (uτ ), (24)
P xτ ∧ V
or
·P x
t+1
τ
Note that we use the Markov property to obtain (17) from (16), and use the independence of uτ on xτ given γ τ to obtain (19) from (18). We can write (19) in a compact form as τ |t
xτ +1 ∈ X , uτ ∈ U t+1
(18)
j=1 k=1
τ +1|t
C. Cost function evaluation The cost function (7) can be represented as X t J xt+1 , · · · , xt+N , ut , · · · , ut+N −1 J =
σ=t+1
π τ +1|t (xi ) nx X = P(xτ +1 = xi |xτ = xj , ξ t )P(xτ = xj |ξ t ) j=1 nx X
It can be seen from (20) or (21) that the predicted state probability distribution π τ |t is propagated through a time-invariant system φ, and the function φ is bilinear in π τ |t , γ τ .
(22)
nu X γ τ −1 (uk )P(xτ |xτ −1 , uk ). Pτ −1 xτ xτ −1 := k=1
(31)
τ τ |t shown above; for each xτ ∈ X , P xτ ∧ V ξ t = 0 is trivially affine in γ σ . This completes the induction step, and in turn completes the proof of (i). Then, (ii) follows from (i).
Proof: P V τ |t ξ t = τ −1|t t ξ P(V τ |t V τ −1|t , ξ t P V τ −1|t ξ t + P V τ |t ∧ V | {z } =1 X τ −1|t t τ −1|t t ξ P xτ ∧ V τ |t ∧ V =P V ξ +
E. Problem reformulation in probability space
xτ ∈X
=P V
τ −1|t t
ξ
+
X
xτ ∈X
P xτ ∧ V
τ −1|t t
ξ
=
τ
P x ∧V
τ −1|t t
ξ ,
τ
X
P xτ ∧ xτ −1 ∧ V
τ −1|t t
ξ
xτ −1 ∈X
=
X
τ −1|t τ −1|t t ξ , ξ t P xτ −1 ∧ V P xτ xτ −1 , V
(32)
xτ −1 ∈X
=
X
τ −1|t t ξ , Pτ −1 xτ xτ −1 P xτ −1 ∧ V
(33)
xτ −1 ∈X τ
P x ∧V
τ |t t
ξ
τ |t τ −1|t t ξ ∧ V τ −1|t ξ t +P xτ ∧ V ∧V =P x ∧V {z } | =0 τ −1|t t ξ if xτ ∈ X τ , P xτ ∧ V = τ 0, if xτ ∈ X , τ
τ |t
where we use the Markov property to obtain (33) from (32). t+N |t t It can be observed from Theorem 1 that P V ξ is t+N |t t|t t+N |t t π . separable, i.e., P V ξ =P V Proposition 2: (i) Given ξ t , P V t+N |t ξ t is a multiσ affine function of γ t , · · · , γ t+N −1 , i.e., affine in γ , for t+N |t t ξ is a smooth each σ ∈ {t,· · · , t + N − 1}. (ii) P V function of γ t , · · · , γ t+N −1 . Proof: We prove (i) by induction: t|t Basis of induction: P V t|t ξ t = 0 and P xt ∧ V ξ t = π t|t (xt ) are independent thus trivially affine in γ σ , σ ∈ {t, · · · , t + N − 1}. Induction hypothesis: P V τ −1|t ξ t and P xτ −1 ∧ τ −1|t t ξ for all xτ −1 ∈ X are affine in γ σ , σ ∈ {t, · · · , t+ V N − 1}. Induction step: To show that P V τ |t ξ t and P xτ ∧ τ |t V ξ t for all xτ ∈ X are affine in γ σ , σ ∈ {t, · · · , t + N − 1}, it suffices to consider σ ∈ {t, · · · , τ − 1} as, by τ |t causality, P V τ |t ξ t and P xτ ∧ V ξ t are independent of γ σ , σ ∈ {τ, · · · , t + N − 1}. τ τ −1 For each ∈ X , if σ = τ − 1, by (31), pair of x , x τ τ −1 τ −1 is linear in γ σ , and by causality, P xτ −1 ∧ x x P τ −1|t t ξ is independent of γ σ , thus, by (29), P xτ ∧ V τ −1|t t ξ is linear in γ σ ; if σ < τ − 1, Pτ −1 xτ xτ −1 is V independent of γ σ , and by induction hypothesis, P xτ −1 ∧ τ −1|t t ξ is affine in γ σ , thus, by (29), P xτ ∧ V τ −1|t ξ t V is affine in γ σ . in γ σ , By induction hypothesis, P V τ −1|t ξ t is affine τ |t t σ ∈ {t, · · · , τ − 1}. Then, by (28), P V ξ is affine in γ σ , σ ∈ {t, · · · , τ − 1}. τ |t For each xτ ∈ X τ , by (30), P xτ ∧ V ξ t = P xτ ∧ τ −1|t t ξ is affine in γ σ , σ ∈ {t, · · · , τ − 1}, as has been V
We reformulate the RHOC problem as: (Q): At step t ∈ N, we minimize the cost function n o t J = E J xt+1 , · · · , xt+N , ut , · · · , ut+N −1 ξ t , (34) with respect to Γt := γ t , · · · , γ t+N −1 ∈ Rnu ×N , subject to i) the dynamic model (21), and ii) the constraints P V t+N |t ξ t ≤ ε, (35a) 0 ≤ Γtij ≤ 1, ∀ i ∈ {1, · · · , nu }, ∀ j ∈ {1, · · · , N }, (35b) nu X Γtij = 1, ∀ j ∈ {1, · · · , N }. (35c) i=1
Hence, we transform the discrete optimization problem (P) to the continuous optimization problem (Q). We can write the constraints (35) in a compact form, P V t+N |t ξ t ε g t := vec Γt ≤ 1nu N , 0nu N −vec Γt Pnu t i=1 Γi1 .. t h := (36) = 1N . Pnu. t i=1 ΓiN IV. T HEORETICAL PROPERTIES OF (Q) We now discuss theoretical properties of problem (Q), including the relationship between problems (P) and (Q). A. General properties Proposition 3: The optimization problem (Q) is welldefined, meaning that it either does not have a feasible solution, or attains its infimum. Proof: By Proposition 2(ii), P V t+N |t ξ t is a contint uous function of Γ . Then, the preimage of [0, ε] under t+N |t t P V ξ is closed. Thus, the set defined by (35), denoted by Λt ⊂ Rnu N , is closed and bounded, thus compact. t By Proposition 1, J is a continuous function of Γt . Then, by t the Weierstrass extreme value theorem, J attains its infimum t t on Λ if Λ 6= ∅. Propositions 1-3 assert that the optimization problem (Q) is a standard nonlinear programming (NP) problem with differentiability. Therefore, standard NP solvers can be used to solve (Q). In the numerical example reported in this paper, we use the Matlab f mincon() function. We now discuss the relationship between (P) and (Q): Proposition 4: (i) Problem (P) has feasible solutions precisely when (Q) has feasible solutions. (ii) A feasible solution to one can be attained from a feasible solution to the other in polynomial time.
Proof: If (P) has a feasible solution ut , · · · , ut+N −1 , bt = γ bτ , τ ∈ {t, · · · , t+N − then Γ bt , · · · , γ bt+N −1 , where γ τ τ 1}, is such that γ b (u ) = 1 and all other entries are 0 (we call such a distribution a zero-one distribution), is a feasible solution to (Q). t If (Q) has a feasible solution Γ = γ t , · · · , γ t+N −1 , bt then can be = shown τ that there exists Γ t it t+N −1 , where γ b , τ ∈ {t, · · · , t + N − 1}, is a γ b ,··· ,γ b b t is also a feasible solution zero-one distribution, such that Γ to (Q) (we call such a solution a zero-one solution). We show this by constructing a polynomial-time algorithm relying on Proposition 2(i) to find a zero-one feasible b t from an arbitrary feasible solution Γt . solution Γ t Let Γ = γ t , · · · , γ t+N −1 to be a feasible solution t+N |t t t+N |t t ξ the value of P V ξ (Q). We denote by P V t eσ V t+N |t ξ t evaluated at Γ and define the function P of γ σ induced from P V t+N |t ξ t by substituting in γ t , · · · , γ σ−1 , γ σ+1 , · · · , γ t+N −1 . We have eσ V t+N |t ξ t . ε ≥ P V t+N |t ξ t ≥ min P σ γ
(37)
eσ V t+N |t ξ t is an affine function By Proposition 2(i), P of γ σ . In turn, the minimization problem eσ V t+N |t ξ t , P min σ γ
(38)
subject to nu X
γ σ (ui ) = 1,
0 ≤ γ σ (ui ) ≤ 1,
∀ i ∈ {1, · · · , nu }, (39)
i=1
is a linear programming (LP) problem. The admissible set defined by (39) is a probability (nu −1)-simplex [19]. In such a case, there exists a zero-one distribution, denoted by γ bσ , such that [20] eσ V t+N |t ξ t . γ bσ ∈ argmin P
(40)
γσ
b tσ := γ t , · · · , γ σ−1 , γ By (37), Γ bσ , γ σ+1 , · · · , γ t+N −1 is a feasible solution to (Q). b t as an arbitrary feasible solution Then we can treat Γ σ to (Q) and repeat the above procedure with a different σ selection. For instance, we can first let σ = t and generate b tt to (Q) with γ the feasible solution Γ bt being a zero-one bt distribution. We then let σ ← σ + 1 and generate Γ t,t+1 := t t+1 t+2 t+N −1 . We recursively execute this γ b ,γ b ,γ ,··· ,γ procedure for all σ ∈ {t, · · · , t + N − 1}, solving N LP problems ((38) subject to (39)) sequentially, after which we bt = Γ bt obtain a zero-one feasible solution Γ t,··· ,t+N −1 to (Q). b t from This algorithm finds a zero-one feasible solution Γ t b t corresponds to a Γ . Then, Γ an arbitrary feasible solution controlinput sequence ut , · · · , ut+N −1 with probability 1, and ut , · · · , ut+N −1 is a feasible solution to (P). This completes the proof of (i). Then, (ii) follows from the fact that LP problems can be solved in polynomial time, for instance, using Karmarkar’s algorithm [21], and therefore the above algorithm is a polynomial-time algorithm.
B. Recursive feasibility The optimization problem (Q) is solved in a recedinghorizon manner. Recursive feasibility is a desired property and is discussed in this section. To problem (Q), feasibility at t means that for the “initial condition” π t|t , there exists Γt such that the constraints (35) can be satisfied over the horizon of length N . The difficulty in establishing recursive feasibility lies in that the “initial condition” at the next step, π t+1|t+1 , is stochastic. Recursive feasibility requires that for all “possible” realizations of π t+1|t+1 , (Q) is feasible at t + 1. To guarantee recursive feasibility usually involves introducing additional constraints to the optimization problem. Similar to [22], [23], we augment a first-step constraint to (Q): (41) π t|t , γ t ∈ Λ0 (ε), where Λ0 (ε) :=
1 ∃ γ , · · · , γ N −1 , s.t. P ∧N (xτ ∈ X ∞ ) π 0|0 ≥ 1 − ε; τ =1 0|0 π 0 0 0 1 , ∀ u ∈ U, with γ (u ) > 0, ∀ y ∈ Y, 0 γ with P y 1 π 0|0 , γ 0 > 0, s.t. 1|1 0|0 0 1 π , u , y ∈ Projπ Λ0 (ε) π
(42)
in which π 0|0 and γ τ , τ ∈ {0, · · · , N − 1}, are probability distributions, i.e., respectively, π 0|0 (xi ) ∈ [0, 1], i ∈ Pnxsatisfy, 0|0 ) ∈ [0, 1], k ∈ {1, · · · , nx }, Pi=1 π (xi ) = 1, and γ τ (uk T nu ∞ τ ∞ ⊆ t=1 X t ; and {1, · · · , nu }, k=1 γ (uk ) = 1; X Projπ (·) represents the projection onto the first nx coordinates. Furthermore, X X X P y 1 π 0|0 , γ 0 = P y 1 , x1 , x0 , u0 π 0|0 , γ 0 , x1 ∈X x0 ∈X u0 ∈U
=
X X X
P y 1 x1 P x1 x0 , u0 π 0|0 (x0 )γ 0 (u0 ),
x1 ∈X x0 ∈X u0 ∈U
where P x1 x0 , u0 and P y 1 x1 are defined in (10) and (11); the function π 1|1 π 0|0 , u0 , y 1 is defined by (13). Theorem 2: If π 0|0 ∈ Projπ Λ0 (ε) , then (Q) (augmented with (41)) is feasible for all t ∈ N almost surely. 0 t|t Λ (ε) . Then, there exists Proof: Suppose π ∈ Proj π t i) the augmented constraint γ , · · · , γ t+N −1 such that t|t t (41) is satisfied by π , γ , and ii) the time-joint chance constraint (35a) is satisfied by γ t , · · · , γ t+N −1 . Thus, (Q) is feasible at t. Furthermore, any γ t satisfying (41) guarantees that for every realization of ut with positive probability based on γ t , and for every realization of y t+1 with positive probability based on π t|t , γ t , the resulting π t+1|t+1 ∈ Projπ Λ0 (ε) . Thus, by induction, (Q) is feasible for all τ ≥ t almost surely. Theorem 2 follows. Similar to the algorithm to determine the disturbance invariant sets for discrete-time linear systems in [24] and the algorithm to approximate the robust controlled invariant sets of linear systems in [25], the set Λ0 (ε) can be outer approximated through the following conceptual algorithm:
Algorithm 1 (Offline): 1) Set t = 0 and Λ0 (ε) = (
) 1 N −1 , s.t. π 0|0 ∃ γ , · · · , γ . τ ∞ 0|0 γ 0 P ∧N ≥1−ε τ =1 (x ∈ X ) π
the model (46) in the relative coordinates,
(43)
2) Set t ← t + 1. 3) Set Λt (ε) = ∀ u0 , y 1 ∈ U, Y , 0|0 0|0 0 π 0 1 . (44) > 0, s.t. with P u , y π , γ 0 γ 1|1 0|0 0 1 π π , u , y ∈ Projπ Λt−1 (ε)
4) Set Λt (ε) = Λ0 (ε) ∩ Λt (ε). e 0 (ε) = Λt (ε) and stop; otherwise 5) If t = tmax , output Λ return to Step 2. Theorem 3: The sequence of sets Λt (ε) generated by Algorithm 1 converges to Λ0 (ε) from above. Proof: We first show, by induction, that the sequence of sets Λt (ε) generated by Algorithm 1 is nonincreasing. Basis of induction: Λ1 (ε) = Λ0 (ε) ∩ Λ1 (ε) ⊆ Λ0 (ε). Induction hypothesis: Λt (ε) ⊆ Λt−1 (ε). Induction step: To show that Λt+1 (ε) ⊆ Λt (ε). By induction hypothesis, Λt (ε) ⊆ Λt−1 (ε), then Projπ Λt (ε) ⊆ Projπ Λt−1 (ε) . Thus, by (44), Λt+1 (ε) ⊆ Λt (ε). Then, Λt+1 (ε) = Λ0 (ε) ∩ Λt+1 (ε) ⊆ Λ0 (ε) ∩ Λt (ε) = Λt (ε). Therefore, the sequence of T sets Λt (ε) is nonincreasing, thus converges to Λ∞ (ε) := t∈N Λt (ε). We now show that Λ∞ (ε) = Λ0 (ε). The set Λ∞ (ε) satisfies Λ∞ (ε) = Λ0 (ε) ∩ Λ∞ (ε), where Λ∞ (ε) =
∀ u0 , y 1 ∈ U, Y , π 0|0 0 0 1 0|0 . > 0, s.t. π , γ with P u , y 0 γ 1|1 0|0 0 1 π π , u , y ∈ Projπ Λ∞ (ε)
1 ∆t 0 ∆x(t) ∆x(t + 1) ∆vx (t + 1) = 0 1 0 ∆vx (t) (46) 0 0 1 ∆y(t) ∆y(t + 1) 0 0 0 2 a (t) − ∆t a1x (t), + ∆t 0 x2 0 ∆t vy (t) 0
(45)
Then, it can be seen that Λ∞ (ε) = Λ0 (ε) ∩ Λ∞ (ε) meets the definition of Λ0 (ε) in (42). The computational difficulty of Algorithm 1 lies mainly e 0 (ε) ⊆ in the construction of Λ0 (ε). In practice, a subset Λ Λ0 (ε) that is easier to compute to the specific problem under consideration may be used to start the algorithm. V. PATH PLANNING FOR AUTONOMOUS OVERTAKING To illustrate the algorithm, we consider an example representing path planning for an autonomous car driving on a highway (car 2) to overtake a car in its front (car 1). We use SI units in this example. We take the model from [26] to represent the car dynamics. We let ∆x(t) = x2 (t) − x1 (t) designate the relative position and ∆vx (t) = vx2 (t) − vx1 (t) designate the relative velocity in the longitudinal direction, and let ∆y(t) = y 2 (t) − y 1 (t) designate the relative position in the lateral direction. We assume that car 1 to be overtaken stays in its own lane over the entire overtaking process, i.e., vy1 (t) ≡ 0. Then, we obtain
where ∆t = 1 represents the sampling period. We model car 1’s longitudinal acceleration a1x (t) by the probabilities P a1x (t) = {2, 0, −2} = 0.2, 0.6, 0.2 , for all t ∈ N. Alternatively, the profile of a1x (t) may be modeled using Markov chains [27]. Car 2 takes maneuver actions based on a2x (t), vy2 (t) ∈ U = (0, 0) maintain , (2, 0) accel. , (−2, 0) decel. , (0, 1.6) move left , (0, −1.6) move right , for all t ∈ N. We assume that car 2 can measure the states of (46), but the measurements are corrupted by noise. In particular, ∆e x(t), vx (t), ∆e y (t) = ∆x(t) ∆vx (t) + δv (t), ∆e + δx (t), ∆y(t) , where δx (t), δv (t) ∈ 1, 0, −1 × 1, 0, −1 , with probability 0.6 to take (0, 0) and 0.05 to take each of the other eight values. The objective of car 2 is to overtake car 1 then come back to its original lane as quickly as it can. This objective is represented by the following cost function to minimize, n t+N X t J =E
o − 8∆x(τ ) + ∆y(τ ) ξ t ,
τ =t+1
where we choose the planning horizon N = 5. Furthermore, car 2 should keep a reasonable distance from car 1 to maintain safety. Thus, we consider the safety set n o Ω := ∆x, ∆vx , ∆y |∆x| > 10 ∨ ∆y > 1.6 . In particular, we enforce a time-joint chance constraint over the planning horizon, t+N ^ P
(∆x(τ ), ∆vx (τ ), ∆y(τ )) ∈ Ω ξ t ≥ 1 − ε.
τ =t+1
The initial condition is ∆x(0), ∆vx (0), ∆y(0) = (−20, 0, 0) and is assumed to be known, and the terminal condition indicating the completion of the overtake is ∆x(t) ≥ 20. We bound the state variables by ∆x ∈ [−20, 20], ∆vx ∈ [−8, 8], ∆y ∈ [0, 3.2], and thus have a finite state space2 . Note that card(U N ) = 55 = 3125, causing an exhaustive search to be computationally demanding. Instead, we consider randomized decision rules and solve the problem in the form of (Q) using our algorithm. We test two different cases, ε = 0.05 and ε = 0. The simulation results are shown in Fig. 1 and summarized in Table I. The simulations are performed on the Matlab R2016a platform using an Intel Core i7-4790 3.60 GHz PC with Windows 10 and 16.0 GB of RAM. The computation times are calculated using the Matlab tic-toc command. Our 2
nx =
20−(−20) 2
+1 ×
8−(−8) 2
+1 ×
3.2 1.6
+ 1 = 567.
algorithm is computationally tractable in solving this path planning problem although the formulated POMDP model has a large state space2 . ∆y
∆vx
∆x ∆y
(a)
t ∆vx
∆x
(b)
t
Fig. 1: Simulation results corresponding to (a) ε = 0.05 and (b) ε = 0 over 50 simulation runs. The red rectangles represent the constraints (the complement of the safety set Ω). The blue curves represent the trajectories of (46). In (a), as the constraints are enforced with probability less than 1, there are a few constraint violations. In (b), as the constraints are enforced with probability 1, there are no constraint violations. TABLE I: Summary of results ε 0.05 0 Aver. computation time per step [s] 0.9574 2.3236
Aver. # of constraint violations per run 0.048 0 Worst computation time per step [s] 1.2764 4.5329
VI. C ONCLUSIONS In this paper, we presented an SMPC algorithm for finitespace POMDP problems with time-joint chance constraints and discussed several theoretical properties. We used an example representing path planning for an autonomous car to illustrate the algorithm. Constraint satisfaction properties in closed-loop operation are to be investigated. Implementation methods to compute the first-step constraint set to recursive feasibility are to be developed. R EFERENCES [1] D. Hrovat, S. Di Cairano, H. E. Tseng, and I. V. Kolmanovsky, “The development of model predictive control in automotive industry: A survey,” in Control Applications (CCA), International Conference on. IEEE, 2012, pp. 295–302. [2] A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,” IEEE Control Systems, vol. 36, no. 6, pp. 30–44, 2016.
[3] M. Cannon, B. Kouvaritakis, S. V. Rakovic, and Q. Cheng, “Stochastic tubes in model predictive control with probabilistic constraints,” IEEE Transactions on Automatic Control, vol. 56, no. 1, pp. 194–200, 2011. [4] M. Cannon, Q. Cheng, B. Kouvaritakis, and S. V. Rakovi´c, “Stochastic tube MPC with state estimation,” Automatica, vol. 48, no. 3, pp. 536– 541, 2012. [5] G. Schildbach, L. Fagiano, C. Frei, and M. Morari, “The scenario approach for stochastic model predictive control with bounds on closed-loop constraint violations,” Automatica, vol. 50, no. 12, pp. 3009–3018, 2014. [6] F. Oldewurtel, C. N. Jones, and M. Morari, “A tractable approximation of chance constrained stochastic MPC based on affine disturbance feedback,” in Decision and Control (CDC), 47th Conference on. IEEE, 2008, pp. 4731–4736. [7] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Autonomous Robots, vol. 39, no. 4, pp. 555–571, 2015. [8] M. A. Sehr and R. R. Bitmead, “Stochastic model predictive control: Output-feedback, duality and guaranteed performance,” arXiv preprint arXiv:1706.00733, 2017. [9] E. A. Feinberg and A. Shwartz, Handbook of Markov decision processes: methods and applications. Springer Science & Business Media, 2012, vol. 40. [10] E. Altman, Constrained Markov decision processes. CRC Press, 1999, vol. 7. [11] S. Feyzabadi and S. Carpin, “Risk-aware path planning using hirerachical constrained Markov decision processes,” in Automation Science and Engineering (CASE), 2014 IEEE International Conference on. IEEE, 2014, pp. 297–303. [12] Z. Sunberg, S. Chakravorty, and R. S. Erwin, “Information space receding horizon control,” IEEE Transactions on Cybernetics, vol. 43, no. 6, pp. 2255–2260, 2013. [13] ——, “Information space receding horizon control for multisensor tasking problems,” IEEE Transactions on Cybernetics, vol. 46, no. 6, pp. 1325–1336, 2016. [14] A. T. Schwarm and M. Nikolaou, “Chance-constrained model predictive control,” AIChE Journal, vol. 45, no. 8, pp. 1743–1752, 1999. [15] L. Blackmore and M. Ono, “Convex chance constrained predictive control without sampling,” in Proceedings of the AIAA Guidance, Navigation and Control Conference, 2009, pp. 7–21. [16] J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,” International Journal of Control, pp. 1–14, 2017. [17] S. S¨arkk¨a, Bayesian filtering and smoothing. Cambridge University Press, 2013, vol. 3. [18] P. R. Kumar and P. Varaiya, Stochastic systems: Estimation, identification, and adaptive control. SIAM, 2015. [19] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004. [20] J. Lee, A first course in combinatorial optimization. Cambridge University Press, 2004, vol. 36. [21] I. Adler, M. G. Resende, G. Veiga, and N. Karmarkar, “An implementation of Karmarkar’s algorithm for linear programming,” Mathematical programming, vol. 44, no. 1, pp. 297–335, 1989. [22] M. Lorenzen, F. Allgower, F. Dabbene, and R. Tempo, “An improved constraint-tightening approach for stochastic MPC,” in American Control Conference (ACC). IEEE, 2015, pp. 944–949. [23] M. Lorenzen, F. Allg¨ower, F. Dabbene, and R. Tempo, “Scenario-based stochastic MPC with guaranteed recursive feasibility,” in Decision and Control (CDC), 54th Conference on. IEEE, 2015, pp. 4958–4963. [24] I. Kolmanovsky and E. G. Gilbert, “Theory and computation of disturbance invariant sets for discrete-time linear systems,” Mathematical problems in engineering, vol. 4, no. 4, pp. 317–367, 1998. [25] M. Rungger and P. Tabuada, “Computing robust controlled invariant sets of linear systems,” IEEE Transactions on Automatic Control, 2017. [26] N. Li, D. W. Oyler, M. Zhang, Y. Yildiz, I. Kolmanovsky, and A. R. Girard, “Game theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems,” IEEE Transactions on Control Systems Technology, vol. 26, no. 5, pp. 1782–1797, 2018. [27] S. Di Cairano, D. Bernardini, A. Bemporad, and I. V. Kolmanovsky, “Stochastic MPC with learning for driver-predictive vehicle control and its application to HEV energy management,” IEEE Transactions on Control Systems Technology, vol. 22, no. 3, pp. 1018–1031, 2014.