Dynamic Policy Programming

18 downloads 0 Views 975KB Size Report
Apr 12, 2010 - gramming methods, DPP converges to a near-optimal policy, even when the basis func- ... sults in oscillations or divergence of the algorithms (Bartlett, 2003; Bertsekas ..... The operator OM defined by (8) can be written in matrix notation as: ..... proximate optimal control â∗ : S↦→ A is then identified by â∗(s) ...
Dynamic Policy Programming

Dynamic Policy Programming

arXiv:1004.2027v1 [cs.LG] 12 Apr 2010

Mohammad Gheshlaghi Azar Hilbert J. Kappen

[email protected] [email protected]

Department of Biophysics Radboud University Nijmegen 6525 EZ Nijmegen, The Netherlands

Editor: TBA

Abstract In this paper, we consider the problem of planning and learning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative direct policysearch approach, called dynamic policy programming (DPP). DPP is, to the best of our knowledge, the first convergent direct policy-search method that uses a Bellman-like iteration technique and at the same time is compatible with function approximation. For the tabular case, we prove that DPP converges asymptotically to the optimal policy. We numerically compare the performance of DPP to other state-of-the-art approximate dynamic programming methods on the mountain-car problem with linear function approximation and Gaussian basis functions. We observe that, unlike other approximate dynamic programming methods, DPP converges to a near-optimal policy, even when the basis functions are randomly placed. We conclude that DPP, combined with function approximation, asymptotically outperforms other approximate dynamic programming methods in the mountain-car problem. Keywords: Approximate dynamic programming, reinforcement learning, Markov decision processes, direct policy search, KL divergence.

1. Introduction Many problems in robotics, operations research and process control can be presented as dynamic programming (DP) problem. DP is based on estimation of some measures of the value of a state or of the state and action pairs through the Bellman equation. For highdimensional systems or for continuous systems the state space is huge and computing the value function by DP is intractable. Common approaches to make the computation tractable are function-approximation approaches, where the value function is parameterized in terms of a number of fixed basis functions and thus reduces the Bellman equation to the estimation of these parameters (Bertsekas and Tsitsiklis, 1996, chap. 6). There are many algorithms for approximating the optimal value functions through DP (Szepesvari, 2009). Among them approximate policy iteration algorithms (Bertsekas, 2007b, chap. 6.3) with greedy or ǫ-greedy policy update usually fail to converge to a stable nearoptimal policy. The reason is that these algorithms switch the control policy without enforcing continuity. This, in the absence of accurate estimation of value functions, can drastically deteriorate the quality of the control policy and, when applied repeatedly, can eventually re1

Gheshlaghi Azar and Kappen

sults in oscillations or divergence of the algorithms (Bartlett, 2003; Bertsekas and Tsitsiklis, 1996, chap. 6.4.2). One way to guarantee the stability of the policy improvement is to replace the unstable greedy policy update by an incremental change of the policy. One of the most well-known algorithm of this kind is the actor-critic method (AC), which has a separate update rule for the policy (actor) as well as the value function (critic) (Sutton and Barto, 1998, chap. 6.6). The actor uses the value function computed by critic to guide the policy search. An important extension of AC, the policy-gradient actor critic (PGAC) first introduced by Sutton et al. (1999), has been applied to large-scale problems (Peters and Schaal, 2008). In PGAC, the actor updates the parameters of the policy in the direction of the (natural) gradient of performance estimated by the critic. However, PGAC suffers from local maxima due to non-linear nature of the optimization problem. The method proposed in this paper, called dynamic policy programming (DPP), includes some of the features of AC. Like AC, DPP incrementally updates the parametrized policy. The difference is that DPP uses the parameters of the policy (action preferences) to guide the policy search with a Bellman-like recursion. In other words, DPP recursion bootstraps the action preferences to improve the current estimate of the optimal policy as opposed to the standard Bellman recursion which backs up the value functions. In this sense, DPP is a direct policy-search approach (i.e., it searches for the optimal policy without any reference to the value functions.). The basic idea is to transform the maximization over actions in the right-hand side of the Bellman equation by a convex optimization problem by adding a penalty term, which is the KL divergence between some baseline policy and the control policy. One can analytically solve this maximization problem and the solution for the control policy is directly expressed in terms of the baseline policy and the specification of the environment. We then show that iterating this process, where the new base policy becomes the just computed control policy, pushes the policy towards the deterministic optimal control. This article is organized as follows. In section 2, we present the notations which are used in this paper. Then, we introduce DPP in section 3, where algorithm 1 implements the direct policy search in discrete state-action MDPs. In section 4, we investigate the convergence properties of DPP. DPP is proven to achieve optimality for the tabular case and we provide an upper-bound for its convergence rate. In section 5, we demonstrate the compatibility of our method with approximation techniques. We also propose a method to learn an approximation of the optimal policy by Monte-Carlo simulation. Section 6, presents numerical experiments on a grid-world problem and the mountain-car problem. We compare approximate DPP with several approximate-DP methods. We show that for the mountain-car problem approximate DPP asymptotically outperforms the other methods. Section 7 contains a brief review of other related methods. Finally, we discuss some of the implications of our work and future extensions in section 8.

2. Preliminaries A stationary MDP is a 5-tuple (S, A, R, T, γ), where S, A, R are, respectively, the set of all system states, the set of actions that can be taken and the set of rewards that may be a denotes the reward of the next state s′ given that the current state is issued, such that rss ′ 2

Dynamic Policy Programming

s and the action is a. T is a set of matrices of dimension |S × S|, one for each a ∈ A such a denotes the probability of the next state s′ given that the current state is s and that Tss ′ the action is a. γ ∈ (0, 1) denotes the discount factor. Assumption 1 We assume that for every 3-tuple (s, a, s′ ) ∈ S × A × S, the magnitude a | is bounded from above by R the magnitude of the of the immediate reward, |rss ′ Pmax and a r a , is bounded from expected immediate reward for (s, a) ∈ S × A, |¯ r (s, a)| = s′ ∈S Tss ′ ss′ ¯ max . above by R

A stationary policy is a mapping π that assigns to each state s a probability distribution over the action space A, one for each (s, a) ∈ S × A such that πs (a) denotes the probability of the action a given the current state is s. Given the policy π, its corresponding value function V π denotes the expected value of the long-term discounted sum of rewards in each state s, when the action is chosen by policy π. The goal is to find a policy π ∗ that attains the optimal value function V ∗ (s), such that V ∗ (s) satisfies a Bellman equation: X X  a a ∗ ′ πs (a) V ∗ (s) = max Tss ∀s ∈ S. (1) ′ rss′ + γV (s ) , πs

a∈A

s′ ∈S

3. Dynamic Policy Programming Standard methods for solving MDPs are often expressed in terms of the value function. In this section, we demonstrate that it is possible to formulate the MDP problem entirely in terms of the policy distribution and that estimating the value function is not essential for solving MDPs. Our approach DPP uses a Bellman-like update rule to iterate the control policy towards the optimal policy. Further, we present an iterative algorithm based on DPP update rule for discrete state-action problems. As a first step towards formulating the MDP problem in terms of a policy distribution, we consider a slightly modified version of the MDP problem whereby the optimal policy π ¯∗ maximizes the expected sum of future rewards, whilst it minimizes a distance measure with some baseline policy π ¯ . Differences between probability distributions are most naturally measured in terms of KL-divergence:   X πs (a) , ∀s ∈ S. gπ (s) = KL (πs k¯ πs ) = πs (a) log π ¯s (a) a∈A

One can incorporate the penalty gπ (s) and the reward under the policy π to define a new value function Vπ¯π given by: # " n X 1 ∀s ∈ S, γ k−1 (rst+k − gπ (st+k−1 )) st = s , Vπ¯π (s) = lim Eπ n→∞ η k=1

where η is a positive constant and Eπ denotes expectation w.r.t. the state transition probability distribution T and the policy π. The optimal value function Vπ¯∗ (s) = maxπ Vπ¯π (s) then satisfies the following Bellman equation: "  # X X  1 πs (a) ∗ a a ∗ ′ , ∀s ∈ S. (2) πs (a) Vπ¯ (s) = max Tss′ rss′ + γVπ¯ (s ) − log πs η π ¯s (a) ′ a∈A

s ∈S

3

Gheshlaghi Azar and Kappen

The maximization in (2) can be performed in closed form using Lagrange multipliers. Following Todorov (2006), we state the following lemma: Lemma 2 Let η be a positive constant, then the optimal value function Vπ¯∗ (s) and the optimal policy π ¯s∗ (a), respectively, satisfy the following: " !# X X  1 ∗ ∗ ′ a a Vπ¯ (s) = log π ¯s (a) exp η , ∀s ∈ S, (3) Tss′ rss′ + γVπ¯ (s ) η a∈A s′ ∈S    P a ∗ ′ a π ¯s (a) exp η Tss′ rss′ + γVπ¯ (s ) s′ ∈S ∗ π ¯s (a) = , ∀(s, a) ∈ S × A. (4) exp (ηVπ¯∗ (s)) Proof See appendix A. Equations (3) and (4) can be incorporated in one formula by presenting the policy in terms of action preferences (Sutton and Barto, 1998, chap. 6.6). For any arbitrary policy distribution π the action preferences P : S × A 7→ ℜ satisfy the following: ∀(s, a) ∈ S × A. (5) πs (a) = exp (η (P (s, a) − Lη P (s))) ,  P Here, we define Lη P (s) = 1 η log a∈A exp(ηP (s, a)) to ensure normalization of π. If P¯ and P¯ ∗ denote the action preferences corresponding to the policies π ¯ and π ¯ ∗ respectively, then (3) and (4) reduce to: X  a a ¯∗ ′ ¯ P¯ ∗ (s, a) = P¯ (s, a) + Tss ∀(s, a) ∈ S × A, (6) ′ rss′ + γLη P (s ) − Lη P (s), s′ ∈S

 P a r a + γV ∗ (s′ ) . In other words, we have trans¯s (a) + s∈S Tss with P¯ ∗ (s, a) = η1 log π ′ π ¯ ss′ formed the optimal control problem (2) to a system of non-linear equations in terms of the action preferences. Equation (6) can be solved by fixed-point iteration, and the solution gives P¯ ∗ in terms of P¯ . So one can iterate some initial action preferences, P0 through (6) indefinitely, until it asymptotically converges to P¯ ∗ . However, we are not in principle interested in the optimal solution of (2) for fixed π ¯ , but the optimal solution of (1). The idea to further improve the policy towards this optimal solution is to replace P¯ with the newly computed P¯ ∗ after each iteration of (6). This results in a Bellman-like update rule given by the operator OL : X  ′ a a OL P (s, a) = P (s, a) + ∀(s, a) ∈ S × A. (7) Tss ′ rss′ + γLη P (s ) − Lη P (s), s′ ∈S

Equation (7) is different from (6), in that it does no longer depend on π ¯ but only on the specification of the underlying MDP. It is important to point out that OL is one of several possible proposals for manipulating the policy towards optimality. Instead of OL one may consider the slightly different operator OM by replacing the operator Lη in (7) with the soft-max operator Mη : X  a a ′ OM P (s, a) = P (s, a) + Tss ∀(s, a) ∈ S × A, (8) ′ rss′ + γMη P (s ) − Mη P (s), s′ ∈S

4

Dynamic Policy Programming

where Mη P (s) denotes the expected value of the action preferences under the policy π given by (5): P exp (ηP (s, a)) P (s, a) a∈A P Mη P (s) = . (9) exp (ηP (s, a)) a∈A

According to MacKay (2003, chap. 31, p. 402) one can easily obtain the following inequality: max

(s,a)∈S×A

|OL P (s, a) − OM P (s, a)| ≤

1 1 (1 + γ)maxHπ (s) ≤ (1 + γ) log |A|, s∈S η η

(10)

where Hπ (s) is the entropy of the policy πs and |A| is the cardinality of the action set A.  According to (10), the difference between OM P (s, a) and OL P (s, a) is no bigger than 1 η(1 + γ) log |A| and it goes to zero for deterministic policies. Therefore, for deterministic policies, we have: OL P (s, a) = OM P (s, a). Equation (7) and (8) provide the basis for DPP algorithm, in that, at each iteration, the action preferences of all states and actions are updated according to (7) or (8). An algorithmic view of this iterative process is provided in algorithm 1. Algorithm 1: (DPP) Dynamic Policy Programming Input: Randomized action preferences P0 (., .) and η 1 for n = 1, 2, 3, . . . do 2 for (s, a) ∈ S × A do 3 update P (s, a) by (8) or (7); 4 end 5 end 6 Compute π by (5); 7 return π We should mention at this point, that the above derivations only serves to motivate DPP and do not provide any formal justification for the steps that we have taken. We will prove its convergence to the optimal policy in section 4.

4. Convergence Analysis This section provides a formal analysis of the convergence behavior of algorithm 1 (DPP). Our first objective is to establish a rate of convergence for the value function of the policy induced by DPP. Then, we prove that the action preferences resulted by DPP always asymptotically converge to a unique limit and we compute this limit. To keep the presentation succinct, henceforth, we concentrate on the update rule (8). The convergence of DPP using (7) can be proven along similar lines. 4.1 Definitions and Matrix Notation Before we proceed with proving the convergence of DPP, we find it convenient to use matrix and vector notation. We begin by re-defining the variables of the MDP. T and r are, 5

Gheshlaghi Azar and Kappen

a and a |S||A| × |S| respectively, a |S||A| × |S| transition matrix with entries T(sa, s′ ) = Tss ′ ′ a . ¯ r is a |S||A| × 1 column vector of expected reward matrix with entries r(sa, s ) = r ss′ P a a rewards with entries ¯ r(sa) = s′ ∈S Tss′ rss′ . The policy can also be re-expressed as a |S||A| × 1 vector π with entries π(sa) = πs (a). In addition, it will be convenient to introduce a |S| × |S||A| matrix Π, given by:





πs1 (a1 ) · · · πs1 (a|A| ) πs2 (a1 ) · · · πs2 (a|A| )

  Π= 

  , ..  . πs|S| (a1 ) · · · πs|S| (a|A| )

(11)

where Π consists of |S| row blocks, each of length |A|, which are arranged diagonally.1 The policy matrix Π is related to the policy vector π by π = ΠT 1 with 1 denotes a |S| × 1 vector of all 1s. Further, one can easily verify that the matrix-product ΠT gives the state to state transition matrix for the policy π. One can also present the value function in vector space. vπ is a |S|×1 vector with entries vπ (s) = V π (s). The Bellman equation (1) can now be re-expressed in matrix notation as:  r + γTv∗ , (12) v∗ = M∞ ¯ where M∞ is the max operator on the |S||A| × 1 vector ¯ r + γTv∗ , such that M∞ ¯ r+ ∗ ∗ γTv (s) = maxa∈A ¯ r(sa) + γTv (sa) . Often it is convenient to associate value functions not with states but with state-action pairs. Therefore, we introduce a |S||A| × 1 vector of the action-values qπ whose entries qπ (sa) denotes the expected value of the sum of future rewards for all state-action (s, a) ∈ S × A, provided the future actions are chosen by the policy π. qπ satisfies the following Bellman equation: qπ = T π qπ = ¯ r + γTΠqπ , (13)

where we introduce the Bellman operator T π and qπ denotes the fixed point of T π . The optimal q, q∗ , also satisfies a Bellman equation: q∗ = T q∗ = ¯ r + γTM∞ q∗ ,

(14)

where we introduce the Bellman operator T . q∗ denotes the fixed point of the operator T . Both T and T π are contraction mappings with the factor γ (Bertsekas, 2007b, chap. 1). In other words, for any two vectors q and q′ , we have:





T q − T q′ ≤ γ q − q′ ,

T π q − T π q′ ≤ γ q − q′ , (15) ∞







where k · k∞ denotes an L∞ -norm on ℜ|S||A|. Furthermore, we can re-express our results in section 3 in vector space. The operator OM defined by (8) can be written in matrix notation as: Op = p + ¯ r + γTMη p − ΞMη p. 1. The same notation has been used by Wang et al. (2007); Lagoudakis and Parr (2003).

6

(16)

Dynamic Policy Programming

Here, p denotes a |S||A| × 1 vector of the action preferences with p(sa) = P (s, a) and P (s, a) satisfies (5). Mη is the soft-max operator on the vector p with Mη p(s) = Mη P (s) and Mη P (s) is given by (9). Ξ is a |S||A| × |S| matrix given by: T  1 · · · 1   1···1 Ξ=  ..  .

1···1

    ,  

(17)

where Ξ consists of |S| column blocks of 1s, each of length |A|, which are arranged diagonally. Ξ transforms the |S| × 1 vector Mη p to the |S||A| × 1 vector ΞMη p with ΞMη p(sa) = Mη p(s). Further, one can easily verify that for any policy matrix defined by (11):  P π(s1 a)  a∈A P   π(s a) 2   a∈A  = I,  (18) ΠΞ =   . .   .   P π(s|S| a) a∈A

where I is a |S| × |S| identity matrix. We know that limη→∞ Mη p = M∞ p. Further, It is not difficult to show that the following inequality holds for kM∞ p − Mη pk∞ : Lemma 3 Let p be a |S||A| × 1 vector of action preferences and η be a positive constant, then an upper bound for kMη p − M∞ pk∞ can be obtained as follows: kMη p − M∞ pk∞ ≤

1 log(|A|). η

Proof See appendix B. Without any loss of generality, we set η = 1 in our analysis2 and we use M as a shorthand notation for M1 . Defining pn as the action preference resulted by the nth iteration of (16), we have: pn = pn−1 + ¯ r + γTMpn−1 − ΞMpn−1 = pn−1 + ¯ r + γTΠn−1 pn−1 − ΞΠn−1 pn−1

n = 1, 2, 3, · · · ,

,

(19)

where p0 = p and Πn is a policy distribution matrix associated with the policy distribution vector π n given by: π n (sa) = and Z(s) =

P

exp (pn (sa)) , Z(s)

a∈A exp (pn (sa))

∀(s, a) ∈ S × A,

(20)

is the normalization factor.

2. Any other value of η can be absorbed by redefining the reward and the action preferences: ¯ rnew = η¯ r and pnew = ηp.

7

Gheshlaghi Azar and Kappen

4.2 A Sub-Linear Convergence Rate for DPP In this subsection, we will prove the convergence of DPP to the optimal policy. Our main result (theorem 5) is that at iteration n of DPP we have:   1 πn ∗ kq − q k∞ < ǫn , where ǫn = O . (21) n

Here, qπn is a vector of the action-values under the policy π n and π n is the policy induced by DPP at iteration n. Equation (21) implies that, asymptotically, the policy induced by DPP, π = limn→∞ π n , converges to the optimal policy π ∗ . To derive (21) one needs to relate qπn to the optimal q∗ . Unfortunately, finding a direct relation between qπn and q∗ is not an easy task. Instead, we can relate qπn to q∗ via an auxiliary qA n , which we define later in this section. In the remainder of this section, we A first express

A

π n in terms of qn . Then, we obtain an upper bound on the normed error ∗

qn − q in lemma 4. Finally, we use these two results to derive a sub-linear bound on ∞ the normed error kqπn − q∗ k∞ in theorem 5. In order to express π n in terms of qA n , we expand the corresponding action preference pn by recursive substitution of (19): pn = pn−1 + ¯ r + γTΠn−1 pn−1 − ΞΠn−1 pn−1 = pn−2 + 2¯ r + γTΠn−2 pn−2 − ΞΠn−2 pn−2 + γTΠn−1 pn−2 + γTΠn−1¯ r + γ 2 TΠn−1 TΠn−2 pn−2 − γTΠn−1 ΞΠn−2 pn−2 − ΞΠn−1 pn−2 − ΞΠn−1¯ r

(22)

− γΞΠn−1 TΠn−2 pn−2 + ΞΠn−1 ΞΠn−2 pn−2 = 2¯ r + γTΠn−1¯ r + pn−2 + γTΠn−1 pn−2 + γ 2 TΠn−1 TΠn−2 pn−2 − ΞΠn−1¯ r − ΞΠn−1 pn−2 − γΞΠn−1 TΠn−2 pn−2

by (18).

Equation (22) expresses pn in terms of pn−2 , where in the last step we have canceled some terms because of (18). To express pn in terms of p0 , we proceed with the expansion of (19):  Yk   Yk  Xn−1 Xn pn = n¯ r+ (n − k)γ k r + p0 + TΠn−j ¯ γk TΠn−j p0 k=1 j=1 k=1 j=1 h  Yk  i Xn−2 r+ − ΞΠn−1 (n − 1)¯ (n − k − 1)γ k r TΠn−j−1 ¯ k=1 j=1     Xn−1 Yk − ΞΠn−1 p0 + γk TΠn−j−1 p0 k=1 j=1     Yk  Xn−1  Yk Xn−1 =n ¯ r+ γk r −γ TΠn−j ¯ kγ k−1 r + p0 TΠn−j ¯ k=1 j=1 k=1 j=1  Yk     Xn Xn−2  Yk + γk r+ TΠn−j p0 − (n − 1)ΞΠn−1 ¯ γk r TΠn−j−1 ¯ k=1 j=1 k=1 j=1 h Xn−2  Yk   i  Xn−1 Yk k−1 k + ΞΠn−1 γ kγ r − p0 − TΠn−j−1 ¯ γ TΠn−j−1 p0 . k=1

j=1

k=1

r+ We define the auxiliary action-values qA n = ¯  Pn Qk k p0 + k=1 γ j=1 TΠn−j p0 and write (23) as: pn = nqA n −γ

Pn−1

k k=1 γ

j=1

(23)  p TΠ ¯ r and q n = n−j j=1

Qk

  ∂qA ∂qA n−1 n A A P (n − 1)q − γ + qP − ΞΠ + q n−1 n−1 n n−1 = nqn + cn − Ξdn . (24) ∂γ ∂γ 8

Dynamic Policy Programming

 A Here, cn = qP n − γ∂qn ∂γ is a |S||A| × 1 vector bounded by:

γ 1 ¯ max + R kp0 k∞ , n = 1, 2, 3, · · · , (25) 2 (1 − γ) (1 − γ)   P A and dn = Πn−1 (n − 1)qA n−1 − γ∂qn−1 ∂γ + qn−1 is a |S| × 1 vector. From its definition, it is easy to see that qA n satisfies the following Bellman equation: kcn k∞ ≤ cmax =

qA r + γTΠn−1 qA n =¯ n−1 .

(26)

Note the difference between (26) and (13): whereas qπ evolves according to a fixed policy π, qA n evolves according to a policy that evolves according to DPP. Finally, we express π n in terms of qA n . By plugging (24) in (20) and taking in to account that Ξdn (sa) = dn (s), we obtain:   exp nqA exp nqA n (sa) + cn (sa) − dn (s) n (sa) + cn (sa) π n (sa) = = , (s, a) ∈ S × A, Z(s) Z ′ (s) (27) ′ where Z (s) = Z(s) exp (dn (s)) is the normalization factor. Equation (27) establishes the A and q∗ we state the following lemma, that relation between π n and qA n . To relate q n ∗ establishes a sub-linear bound on qA n − q ∞: ∗ Lemma 4 (A sub-linear bound on kqA n − q k∞ ) Let assumption 1 holds, |A| denotes the cardinality of A and n be a positive integer, also, for keeping the representation succinct, assume that both k¯ rk∞ and kp0 k∞ are bounded from above by some constant L > 0, then the following inequality holds:

A

qn − q∗ ≤ ǫ1n , ∞

where ǫ1n is given by:

ǫ1n = 2γ

(1 − γ)2 log(|A|) + 2L L + 2γ n−1 . 4 n(1 − γ) 1−γ

(28)

Proof See appendix C. A Equation (27) expresses the policy π n in terms

Aof the∗ auxiliary qn . Lemma 4 provides a sub-linear upper bound on the normed-error qn − q ∞ . We use these two results to derive a sub-linear bound on the normed-error kqπn − q∗ k∞ :

Theorem 5 (A sub-linear bound on kqπn − q∗ k∞ ) Let assumption 1 holds, |A| be the cardinality of the action space A. Further, for keeping the representation succinct, assume that both k¯ rk∞ and kp0 k∞ are bounded from above by some constant L > 0 and for any positive integer n, ǫ1n defined according to (28), Then the following inequality holds: kqπn − q∗ k∞ ≤ ǫn , where:

  (1 − γ)2 log(|A|) + 2L γ 1 + 2ǫn . ǫn = 1−γ n(1 − γ)2 9

Gheshlaghi Azar and Kappen

Proof We start by noting that qπ n is obtained by infinite application of the operator T πn to an arbitrary initial q-vector, for which we take q∗ , then:

m 

X 

k−1 kqπn − q∗ k∞ = lim T kπn q∗ − T π T q∗ n m→∞

k=1



m

X

k ∗ k−1 ∗ ≤ lim T q

T π n q − T π

n m→∞

≤ lim

m→∞

k=1 m X



γ k−1 kT πn q∗ − T q∗ k∞

by (15)

(29)

k=1

1 kT πn q∗ − T q∗ k∞ 1−γ γ kTΠn q∗ − TM∞ q∗ k∞ = 1−γ γ kΠn q∗ − M∞ q∗ k∞ ≤ 1−γ ≤

by H¨older’s inequality.

Along similar lines with the proof of lemma 4, we obtain: γ kΠn q∗ − M∞ q∗ k∞ 1−γ γ max (M∞ q∗ (s) − Πn q∗ (s)) = 1 − γ s∈S     γ A ∗ 1 ≤ max M∞ qn (s) − Πn q (s) + ǫn 1 − γ s∈S    cn  γ ∗ (s) − Π q (s) + max M∞ qA ≤ n n 1 − γ s∈S n   γ cmax 1 + ǫ + 1−γ n n   cn  cn  γ

A q + − Π + ≤

M∞ qA n n n 1−γ n n ∞ γ  1 2cmax  + 2ǫn + 1−γ n

  cn  cn  γ

A A q + − M q + M =

∞ n n n 1−γ n n ∞   2cmax γ 2ǫ1n + + 1−γ n   γ log(|A|) 2cmax 1 ≤ + 2ǫn + 1−γ n n   2 γ (1 − γ) log(|A|) + 2L 1 = + 2ǫn 1−γ n(1 − γ)2

kqπn − q∗ k∞ ≤

by lemma 4

by (25)

by (9) by lemma 3 by (25).

As an immediate corollary of theorem 5, we obtain the following result: 10

Dynamic Policy Programming

Corollary 6 The following relation holds in limit: lim qπn = q∗ .

n→+∞

Corollary (6), that guarantees the asymptotic convergence of the policy induced by DPP to the optimal policy π ∗ , completes the convergence analysis of DPP. 4.3 Asymptotic Behavior of DPP In this subsection, we show that, under some mild conditions, there exists a unique limit for pn in infinity, which only depends on the optimal value v∗ and the optimal control action a∗ . Assumption 7 We assume that MDP has a unique deterministic optimal policy π ∗ given by:  1 a = a∗ (s) ∗ π (sa) = , ∀s ∈ S, 0 otherwise

where a∗ (s) = arg maxa∈A q∗ (sa).

Theorem 8 Let assumption 7 holds and n be a positive integer and let pn be the action preference after n iteration of DPP. Then, for an arbitrary vector of action preferences p0 :  ∗ v (s) a = a∗ (s) lim pn (sa) = , ∀s ∈ S, n→+∞ −∞ otherwise where a∗ (s) is the optimal action associated with the state s. Proof First, we show that there exists a limit for pn in infinity. Then, we compute this limit. The convergence of pn to a limit can be established by proving the convergence of all ∗ the terms in RHS of (24). We have already  proven the convergence of qA n to q in lemma 4.  ∗ Then, the convergence of ∂qA n ∂γ to ∂q ∂γ is immediate. Corollary 6 together with assumption 7 yields the convergence of π n to π ∗ . The convergence of qpn to some limit qp then follows. Accordingly, there exists a limit for pn . Now, we compute the limit of pn . Combining corollary 6 with (24) and taking in to account that v∗ = Π∗ q∗ yields:   ∂v∗ (s) ∂q∗ (sa) p ∗ p ∗ ∗ + q (sa) − (n − 1)v (s) + γ − q (sa ) lim pn (sa) = lim nq (sa) − γ n→∞ n→∞ ∂γ ∂γ = lim n(q∗ (sa) − v∗ (s)) n→∞  ∂v∗ (s) ∂q∗ (sa)  +γ + qp (sa) − qp (sa∗ ) + v∗ (s) − ∂γ ∂γ  ∗ v (s) a = a∗ (s) . = −∞ otherwise

11

Gheshlaghi Azar and Kappen

5. Dynamic Policy Programming with Approximation In this section, we discuss the possibility of combining DPP with approximation techniques. Algorithm 1 (DPP) can only be applied to small problems with few states and actions. We scale up DPP to large-scale (or continuous) state problems by using function approximation, where the action preferences are approximated by a weighted combination of basis functions (linear function approximation). We call the new algorithm approximate dynamic policy programming (ADPP). Let us define Φ as a |S||A| × k matrix, where for each (s, a) ∈ S × A, the row vector Φ(sa, ·) denotes the output of set of basis functions Fφ = {φ1 , · · · , φk } given the stateaction pair (s, a) as the input, where each basis function φi : S × A −→ ℜ. A parametric ˆ can then be expressed as a projection of the approximation for the action preferences p ˆ = Φθ, where θ ∈ ℜk is a k × 1 action preferences onto the column space spanned by Φ, p vector of adjustable parameters. There is no guarantee that Oˆ p defined by (16) stays in the column span of Φ. Instead, a ˆ on the column space spanned by Φ, common approach is to find a vector θ that projects OP  T 2 p−Φθ , while minimizing some pseudo-normed error J = kOˆ p−Φθ Υ Oˆ Pp − ΦθkΥ = Oˆ where Υ is a |S||A| × |S||A| diagonal matrix and sa diag Υ (sa) = 1 (Bertsekas, 2007b, chap. 6).3 The best solution, that minimize J, is given by the least-squares projection: −1 θ ∗ = arg min J(θ) = ΦT ΥΦ ΦT ΥOˆ p. (30) θ∈ℜk

Equation (30) requires the computation of Oˆ p for all states and actions. For large scale problems this becomes infeasible. However, we can simulate the state trajectory and then make an empirical estimate of the least-squares projection. The key observation in deriving an unbiased sample estimate of (30) is that the pseudo-normed error J can be written as an expectation:   J = E(sa)∼Υ (Oˆ p(sa) − Φθ(sa))T (Oˆ p(sa) − Φθ(sa)) .

Therefore, one can simulate the state trajectory according to π (to be specified) and then empirically estimate J and the least-squares solution for the corresponding visit distribution matrix Υ. In an analogy with Bertsekas (2007b, chap. 6.3), we can derive a sample estimate of (30). We assume that a trajectory of m + 1 state-action pairs, Gπ = {(s1 , a1 ), (s2 , a2 ) · · · , (sm+1 , am+1 )}, is available. The trajectory Gπ is generated by simulating the policy π for ˜ m + 1 steps and has a stationary distribution Υ. Then, we construct the m × k matrix Φ th ˜ j) = φj (si , ai ), where (si , ai ) is the i component of Gπ . An unbiased estimation with Φ(i, of (30) under the stationary distribution Υ can then be identified as follows:  ∗ ˆ ˜ −1 Φ ˜ Tp ˜ TΦ ˜, (31) θ˜ = Φ

ˆ where p ˜ is a m × 1 vector with entries:

ˆ ˆ (si ai ) + r(si ai , si+1 ) + γMη p ˆ (si+1 ) − Mη p ˆ (si ), p ˜(i) = p

i = 1, 2, · · · , m.

(32)

3. One can associate diag(Υ) with the stationary distribution of state-action pairs (s, a) for some policy π and the transition matrix T.

12

Dynamic Policy Programming

 ˆ ˜(i) . By comparing (32) with (16), we observe that Oˆ p(si ai ) = Esi+1 ∼T p So far, we have not specified the policy π that is used to generate Gπ . There are several possibilities (Sutton and Barto, 1998, chap. 5). Here, we choose the on-policy approach, where the state-action trajectory Gπ is generated by the policy induced by p ˆ. The optimal policy can then be approximated by iterating the parametrized policy through (32) using the just computed policy for sampling the next trajectory. Algorithm 2 shows the on-policy ADPP method. Algorithm 2: (ADPP) On-policy approximate dynamic policy programming Input: Randomized parameterized p ˆ(θ), η and m 1 for n = 1, 2, 3, . . . do 2 Construct the policy π ˆ from p ˆ(θ) ; 3 Make a m + 1 sequence G of (s, a) ∈ S × A by simulating π ˆ; ˆ ˜ 4 Construct p ˜ and Φ from the state trajectory G; ∗ ˜ 5 Compute θ by (31); ∗ 6 θ ← θ˜ ; 7 end 8 return θ In order to estimate the optimal policy by ADPP we only need some trajectories of state-action-reward generated by simulation (i.e., knowledge of the transition probabilities is not required.). In other words, ADPP directly learns an approximation of the optimal policy by Monte-Carlo simulation.

6. Numerical Results In this section, we investigate the effectiveness of our proposed DPP algorithms of section 3 and section 5. We apply DPP and ADPP on different problem domains. In particular, we consider two distinct problem domains: a discrete state-action grid-world problem and the mountain-car problem (Sutton, 1996). The grid-world problem (subsection 6.1) allows us to test the convergence properties of algorithm 1. The mountain-car problem (subsection 6.2) shows the effectiveness of DPP in the presence of function approximation. In addition, We compare ADPP with the approximate-DP algorithms on the mountain-car problem. Both the mountain car and the grid world are episodic tasks (i.e., they eventually finish at some end state.). In order to apply the theory of infinite-horizon MDPs to these problems, we introduce an absorbing reward-free state, which occurs at the end of episode and once the agent reaches that state, it remains there at no further reward (Bertsekas, 2007a, chap. 2). The action-value functions and the action preferences (in the grid-world problem) or their corresponding parameters (in the mountain-car problem) are initially randomized within the interval [−0.5, 0.5]. For both the grid-world and the mountain-car problem, we set the value of the discount factor γ to 0.99. In order to evaluate the performance of an algorithm, we consider the error between the value function under the policy induced by this algorithm and the optimal value function. We consider the L∞ -normed error E∞ to evaluate the performance of all the algorithms in the grid-world problem: E∞ = kvπn − v∗ k∞ , 13

Gheshlaghi Azar and Kappen

where vπn denotes the value function associated with the policy of the algorithm at iteration n. For the mountain-car problem, we use the root mean-squared error (RMSE) between the value function under the policy induced by the algorithm at iteration n and the optimal value functions, since L2 -norm is usually a better measure than L∞ -norm for evaluating approximate-DP approaches (Munos, 2005):

RMSE =

6.1 Grid-World Problem

1

πˆn ∗ − v

.

v |S|0.5 2

Our first benchmark is a four-action episodic grid-world problem, in which the agent begins at some random point and must find the shortest path to the goal, which ends the episode, and at the same time avoids the T-shape obstacle and the surrounding walls, in which case, the episode is also terminated (figure 1). The agent receives a reward 1 by reaching the goal and reward 0 by ending up in any other state. 4 actions (up, down, left, right) are associated with each of the total 44 states (the only exception is the goal state). Each action succeeds with probability 0.85 and with probability 0.15, the agent makes a random move.

Figure 1: Illustration of the grid-world problem. The goal is to find the shortest path to the upper-right corner, while avoiding the obstacle and the surrounding walls. The agent receives a reward 1 by reaching the goal and 0 by ending up in any other state. Each action succeeds with probability 0.85 and with probability 0.15, the agent makes a random move. The arrows show the direction of the optimal control for γ = 0.99.

14

Dynamic Policy Programming

We verify the convergence of algorithm 1 (DPP) in the grid-world problem in figure 2.4 We observe that the normed error E∞ , as a function of the number of iteration n, decreases with an exponential rate (i.e., the policy induced by DPP asymptotically converges to the optimal policy). In addition, we observe that the convergence substantially accelerates by increasing η. DPP on grid world

0

10

−5

E∞

10

η =0.1 −10

10

η =1 η =10 2

4

n

6

8

10 4 x 10

Figure 2: The convergence rate of algorithm 1 (DPP) on the grid-world problem. The normed error E∞ vs. the number of iterations n. The results are averages over 50 runs.

6.2 Mountain-Car Problem To evaluate the effectiveness of our method in large-scale problems we choose the well-known mountain-car problem . The mountain-car domain has been reported to diverge in the presence of approximation (Boyan and Moore, 1995), although successful results have been observed by carefully crafting the basis functions (Sutton, 1996). The specification of the mountain-car domain is described by Sutton and Barto (1998, chap. 8.4). The problem has a continuous two-dimensional state space and three discrete actions A = {±1, 0}. To obtain an accurate estimate of the optimal value function, we discretize the state space with a 1751×151 grid and solve the resulted infinite-horizon MDP using value iteration (Bertsekas, 2007b, chap. 1). The resulted value function (figure 3) is just used to derive the RMSE for the approximate algorithms and computing the optimal policy for discretized MDP. To evaluate the relative performance of algorithm 2 (ADPP), we compare it with the approximate Q-iteration (AQI) (Bertsekas, 2007b, chap. 6.4.1) and the optimistic approximate policy iteration (OAPI) (Bertsekas and Tsitsiklis, 1996, chap. 6.4) on the mountain-car problem.5 We also report results on the natural actor critic with the least-squares temporal 4. We only plot E∞ for update rule (8), since the results show no significant difference between using (7) and (8). 5. In addition to ǫ-greedy policy (ǫ = 0.01), we also implement OAPI with a soft-max policy. Further, to facilitate the computation of the control policy, we approximate the action-value functions as opposed to the value functions in OAPI.

15

Gheshlaghi Azar and Kappen

difference, NAC-LSTD(0) and NAC-LSTD(1), (Peters and Schaal, 2008). The step size βn is defined for the actor update rule of NAC-LSTD by: βn =

β0 , tβ n + 1

n = 1, 2, 3, · · · ,

(33)

where the free parameters β0 and tβ are some positive constants. Also, the temperaturefactor τn for the soft-max operator in OAPI with soft-max policy is given by: τ0 τn = , n = 1, 2, 3, · · · , (34) tτ log(n + 1) + 1 where the free parameters τ0 and tτ are some positive constants. All the free parameters are optimized for the best asymptotic performance.6 ∗ (b) v for discretized mountain car

−1

0 −10 −20

−0.5 x

−30 −40

0

−50 −60

0.5

−0.05

0 x˙

0.05

Figure 3: The optimal value function for the mountain-car problem obtained by value iteration. The vertical axis indicates the position of the car in x direction. The horizontal axis indicates the difference between the current position and the last position in x direction. For better visualization the figure needs to be printed in color page.

In addition to standard approximate-DP algorithms, we compare our results with the performance of the best linear function approximator (optimal FA). The best linear function approximator is specified as the projection of the optimal deterministic policy π ∗ of discretized MDP onto the column space spanned by Φ defined in section 5, π ˆ ∗ = Φw∗ . The approximate optimal control a ˆ∗ : S 7→ A is then identified by a ˆ∗ (s) = arg maxa∈A Φ(sa, ·)w∗ . ∗ The RMSE between the value function associated with a ˆ and the optimal value function considers as a baseline for the performance of the approximate algorithms. 6. The asymptotic performance of each method is specified by the average of its RMSE over the last 400 iterations of the total 500 iterations. For ADPP this results in the choice η = 0.1. For OAPI (soft-max) this results in the choice τ0 = 10 and tτ = 1. For NAC-LSTD(0) this results in the choice β0 = 1 and tβ = 1. For NAC-LSTD(1) this results in the choice β0 = 0.1 and tβ = 0.1.

16

Dynamic Policy Programming

Here, we consider k = 27 random radial basis functions to approximate the state-action dependent quantities, i.e., the action-value functions, the action preferences (NAC-LSTD and ADPP) and the approximate optimal policy π ˆ ∗ (the optimal FA), and k = 9 radial basis 7 functions to approximate the state-dependent value functions in NAC-LSTD.  0.5 We terminate each iteration either when the projection algorithm is converged (i.e., 1 k kθ˜i+1 − θ˜i k2 < ∗ 0.05, where θ˜i is the least-squares estimation of θ˜ for a trajectory of i samples) or when the maximum number of samples m = 500, 000 is reached. We first examine the effect of η on the convergence of ADPP. Figure 4 shows the RMSE between the value function of the policy induced by ADPP and the optimal value function as a function of the number of iterations for different values of η. We observe that ADPP converges to a near-optimal policy for small η, but fails to converge for big values of η. Further, we observe that ADPP (η = 1) improves much faster than the convergent ADPP (η= 0.1) in the early stages of optimization. Figure 5.a shows that ADPP asymptotically outperforms the other methods and substantially narrows the gap with the optimal FA performance. We also observe a decline in performance of both ADPP and OAPI (soft-max) after the corresponding RMSEs are minimized at iteration 150 and iteration 350, receptively. In order to compare the transient behavior of these algorithms, we plot the RMSE in terms of the computational cost for the first 100 seconds of simulation (figure 5.b). We observe that some DP methods and in particular NAC-LSTD(0) improve faster than ADPP in the early stages of optimization. We provide the error surfaces between the optimal value function and the value function of the policy induced by the algorithms in figure 6. We observe that ADPP achieves a small error almost every where except for a narrow spiral curve extended from (0, 0.5) to (0, −0.65) and its neighborhood. The same pattern is observed for the optimal FA (figure 6.g). Figure 6.a.3 shows that the performance of ADPP, around (0, −0.65), slightly deteriorates in comparison to figure 6.a.2, while it does not significantly change in the rest of the state space. Figure 6.e shows that NAC-LSTD(0) rapidly improves the performance for the whole state space. However, it asymptotically fails to achieve optimality. We observe that the standard DP algorithms, in general, do not achieve optimality, except for the high-value states (i.e., the goal and its neighboring states). Overall, we conclude that DPP, when combined with the function approximation, converges to a near-optimal performance in situations that other methods fail to converge. ADPP can be substantially slower than NAC-LSTD (0) in the early stages of the optimization process. But ADPP asymptotically outperforms NAC-LSTD(0) by a wide margin. We also observe that in the grid-world problem DPP converges to the optimal solution with exponential rate. 7. These basis functions are randomly placed within the range of their inputs. The variance matrix is identical for all the basis functions:     2 0.1606 0 0 σx 0 0 2 0.0011 0 , 0= 0 Σ =  0 σx˙ 0 0 0.2222 0 0 σu2 where σx , σx˙ and σu are set to a fixed fraction of the admissible range of the position of the car x, the difference between the current and last position x˙ and the control action u, respectively. For the value functions: Σv = Σ({1, 2}, {1, 2}).

17

Gheshlaghi Azar and Kappen

ADPP on mountain car 60 50 40 30 RMSE

20

10 8 6

ADPP (η = 0.1) ADPP (η = 1) ADPP (η = 10) Optimal FA

4 0

50

100 n

150

200

Figure 4: RMSE between the value function of the policy induced by ADPP and the optimal value function obtained by discretization of the state space as a function of the number of iterations. The results are averages over 50 runs.

ADPP vs. approximate DP on mountain-car problem (b)

(a) 50 40 30 20 RMSE

ADPP AQI OAPI (ǫ greedy)

10 8

OAPI (soft-max) NAC-LSTD(0) NAC-LSTD(1) Optimal FA

6 4 100

200

n

300

400

500

20

40 60 80 CPU time (sec)

100

Figure 5: The root mean squared error (RMSE) as a function of (a) the number of iterations and (b) elapsed CPU time. The RMSE between the value function of the policy induced by the approximate algorithms and the optimal value function obtained by discretization of the state space. The results are averages over 50 runs.

18

Dynamic Policy Programming

x

ADPP vs. approximate DP on mountain-car problem n=100 n=500 n=10 −1 (a.1) −0.5 0 0.5 −1 (b.1) −0.5 0 0.5 −1 (c.1) −0.5 0 0.5 −1 (d.1) −0.5 0 0.5 −1 (e.1) −0.5 0 0.5 −1 (f.1) −0.5 0 0.5 −1 (g.1) −0.5 0 0.5 −0.05

0

0.05

(a.2)

(a.3)

80 60 40 20

(b.2)

(b.3)

80 60 40 20

(c.2)

(c.3)

80 60 40 20

(d.2)

(d.3)

80 60 40 20

(e.2)

(e.3)

80 60 40 20

(f.2)

(f.3)

80 60 40 20

(g.2)

(g.3)

80 60 40 20

−0.05

0

0.05

−0.05

0

0.05

x˙ Figure 6: The error between the optimal value function and the value function under the policy induced by (a) ADPP, (b) AVI, (c) OAPI (ǫ-greedy), (d) OAPI (softmax), (e) NAC-LSTD(0), (f) NAC-LSTD(1) and (g) optimal FA. The vertical axis indicates the position of the car in x direction. The horizontal axis indicates the difference between the current and last position in x direction. The results are averages over 50 runs. For better visualization the figure needs to be printed in color page. 19

Gheshlaghi Azar and Kappen

7. Related Work There are some other approaches based on direct search for the optimal policy. The most popular is a gradient-based search for the optimal policy, which directly estimates the gradient of the performance with respect to the parameters of the control policy by MonteCarlo simulation (Kakade, 2001; Baxter and Bartlett, 2001). Although the gradient-based policy-search methods guarantee convergence to a local maximum, they suffer from high variance of the estimates and local maxima, which make it difficult to attain optimality in practice. Here, we did not compare DPP with the gradient-based policy-search methods since the performance of these methods has been reported to be inferior to the actor-critic policy-gradient methods (Peters and Schaal, 2008). Recently, Wang et al. (2007) introduce a dual representation technique called dual dynamic programming (dual-DP) based on manipulating the state-action visit distributions instead of the value function. Wang et al. (2007) have reported better convergence results than value-based methods when dual-DP is combined with function approximation: since the state-action visit distribution is a normalized probability distribution the dual-DP solution can not diverge. The drawback of the dual approach is that, it needs more memory space than the primal representation. DPP has a superficial similarity to using a soft-max policy with value iteration (Farias and Roy, 2000). The main difference is that such a value iteration scheme will not converge to the optimal solution unless the temperature factor of the soft-max is lowered to 0. But in that case, the global convergence is affected in the presence of function approximation. Instead, DPP combined with function approximation converges to the optimal solution for a constant inverse-temperature η. The work proposed in this paper has some relation to recent work by Kappen et al. (2009); Kappen (2005) and Todorov (2006), who formulate a stochastic optimal control problem to find a conditional probability distribution p(s′ |s) given an uncontrolled dynamics p¯(s′ |s). The control cost is the KL divergence between p(s′ |s) and p¯(s′ |s) exp(r(s)). The difference is that in their work a restricted class of control problems is considered for which the optimal solution p can be computed directly in terms of p¯ without requiring Bellman-like iterations. Instead, the present approach is more general, but does requires Bellman-like iterations.

8. Discussion We have presented a novel direct-search method, dynamic policy programming (DPP), to compute the optimal policy through DPP operator. We have proven the convergence of our method to the optimal policy theoretically for the tabular case. We also have provided a sub-linear convergence rate for DPP. We have reasons to believe that it is possible to provide a tighter bound for DPP. The main reason can be traced back to the sub-linear error bound of the soft-max operator in lemma 3. We think that this error-bound is not really tight, since the present error-bound was originally a bound on the difference of the soft-max operator and the log-partition function. It may be possible to obtain an exponential convergence rate based on the approach proposed by Farias and Roy (2000). This idea is confirmed by our numerical results in figure 2. 20

Dynamic Policy Programming

Experimental results for both the grid world and the mountain car show rapid convergence of DPP-based algorithms towards a near-optimal policy and that ADPP outperforms the standard approximate-DP methods. The reason for the success of ADPP is that it does not suffer from the kind of policy degradation caused by inaccurate approximation of the value function in greedy methods or local maxima in NAC-LSTD.8 On the other hand, ADPP makes little progress towards the optimal performance at the early stages of optimization due to our choice of η. We know that the transient behavior of DPP strongly depends on the value of η (see figure 2). At the same time, in order to guarantee asymptotic convergence, we have chosen η rather small, which cause relatively slow convergence. Especially, we observe that NAC-LSTD(0) improves the performance in much faster rate than ADPP in the early stages of the optimization process. To improve the transient behavior of ADPP one can combine NAC-LASTD(0) and ADPP in to one algorithm. The key idea is to begin with the natural gradient policy update and then gradually switch to ADPP. In section 6.2 we observed that the performance of ADPP and OADP, in terms of the RMSE, slightly deteriorate in the final stage of optimization. We also observed that for ADPP this happens for a neighborhood around the state with the lowest value. The main reason is that both ADPP and OADP are on-policy algorithms and, as the control policy approaches to the optimal policy, they tend to make more samples from the high-value states than the low-value ones. This leads to a more accurate approximation of the optimal policy for high-value states than the low-value ones and eventually the error may increase for low-value states. On the other hand, it allows approximate on-policy algorithms (e.g., ADPP and OADP) to improve their overall performance by specializing for high-value states at the expense of less-frequent low-value states. The main theoretical problem of combining DPP with function approximation is that it is prone to divergence. On the other hand, the experiment in mountain-car domain shows that one can avoid divergence by adjusting the parameter η to some small values. Overall, we think that the role of η as a stabilizer for ADPP algorithms needs to be investigated more, both theoretically and numerically. This will be a subject for a future study. In this work, we have not considered the case that we have limited access to the environment through simulation and we need to incrementally learn the optimal policy based on each observation. However, in analogy with Borkar (1997), we think that it is possible to extend DPP (or ADPP) to those problems by using stochastic approximation techniques. This may be another line for the future work. An important extension of our results would be to apply DPP-based algorithms for large-scale or continuous-action problems. In that case, we need an efficient way to estimate Mη P (s) in update rule (8) or Lη P (s) in (7). We can estimate Mη P (s) efficiently using any Monte-Carlo sampling methods (MacKay, 2003, chap. 29 and chap. 30), since Mη P (s) is the expected value of P (s, a) under the soft-max policy π. In order to estimate Lη P (s), which is a logarithm of partition sum, one may require more sophisticated sampling methods, like thermodynamic integration (Gelman and Meng, 1998).

8. Although OAPI with soft-max policy does not use a greedy update rule, it behaves rather similar to greedy methods for low temperatures.

21

Gheshlaghi Azar and Kappen

Acknowledgments We acknowledge N. Vlassis, W. Wiegerinck and N. de Freitas for interesting discussions and V. G´ omez for helping in software development.

Appendix A. The maximization in (2) can be performed in closed form using Lagrange multipliers:

L (s, λs ) =

X

πs (a)

X

a a Tss ′ rss′

s′ ∈S

a∈A

# " X  1 πs (a) − 1 . π s ) − λs + γVπ¯∗ (s′ ) − KL (πs k¯ η a∈A

The necessary condition for the extremum with respect to πs is: ∂L (s, λs ) X a a 1 1 0= = Tss′ (rss′ + γVπ¯∗ (s)) − − log ∂πs (a) η η s∈S



πs (a) π ¯s (a)



− λs .

So, the solution is: "

πs (a) = π ¯s (a) exp (−ηλs − 1) exp η

X

a a Tss ′ rss′

s′ ∈S

#  + γVπ¯∗ (s′ ) ,

∀s ∈ S.

(35)

The Lagrange multipliers can be solved from the constraints: 1=

X

πs (a) = exp (−ηλs − 1)

a∈A

X

"

π ¯s (a) exp η

a∈A

X

s′ ∈S

a a Tss ′ rss′

#  + γVπ¯∗ (s′ ) ,

" # X X  1 1 a a ∗ ′ π ¯s (a) exp η Tss − . λs = log ′ rss′ + γVπ ¯ (s ) η η ′

(36)

s ∈S

a∈A

By plugging (36) in to (35) we have: "

π ¯s (a) exp η πs (a) = P

P

s′ ∈S

"

π ¯s (a) exp η

a∈A

a Tss ′

P

s′ ∈S

a rss ′

a Tss ′

+ γVπ¯∗ (s′ )

a rss ′



+

#

γVπ¯∗ (s′ )



#,

∀(s, a) ∈ S × A.

Substituting policy (37) in (2), we obtain: Vπ¯∗ (s)

" # X X  1 a a ∗ ′ π ¯s (a) exp η = log Tss′ rss′ + γVπ¯ (s ) . η ′ s ∈S

a∈A

22

(37)

Dynamic Policy Programming

Appendix B.  kMη p − M∞ pk∞ = max M∞ p(s) − Mη p(s) s∈S

 = max M∞ p(s) − Lη p(s) + Lη p(s) − Mη p(s) s∈S

where the first step follows because Mη p(s) < M∞ p(s). One can show that, M∞ p(s) ≤ Lη p(s). Therefore, we have: kMη p − M∞ pk∞ ≤ max|Lη p(s) − Mη p(s)| s∈S

1 = maxHπ (s), η s∈S

(38)

where π(sa) ∝ exp(p(sa)) and Hπ (s) denotes its entropy for the state s. According to Cover and Thomas (1991, chap. 16) for all s ∈ S, we have Hπ (s) ≤ log(|A|) that by comparison with (38) implies: kMη p − M∞ pk∞ ≤

1 log(|A|). η

Appendix C.

n−1

  X

A

n−1 A ∗ k−1 A k A

qn − q∗ = + T q − q T q − T q

1 n−k+1 n−k ∞





k=1 n−1 X k=1 n−1 X k=1





k−1 A qn−k+1 − T k qA

T n−k



∗ + T n−1 qA 1 −q





A n−1 A γ k−1 qA q1 (s, a) − q∗ ∞ n−k+1 − T qn−k ∞ + γ

by (15). (39)

For 1 ≤ k ≤ n − 1 we obtain:



A

A A A

q

n−k+1 − T qn−k ∞ = γ TΠn−k qn−k − TM∞ qn−k ∞ by (26) and (14)

(40) A ≤ γ Πn−k qA by H¨older’s inequality. n−k − M∞ qn−k ∞

By monotonicity property for any action-value vector q and distribution matrix Π we have: Πq(s) =

X

π(sa)q(sa)≤M∞ q(s),

a∈A

23

∀s ∈ S.

(41)

Gheshlaghi Azar and Kappen

By comparing (41) with (40), we obtain:

A

 A A A

q n−k+1 − T qn−k ∞ ≤ γmax M∞ qn−k (s) − Πn−k qn−k (s) s∈S    cn−k  cmax A ≤ γmax M∞ qA + (s) + − Π q (s) n−k n−k n−k s∈S n−k n − k



γcmax cn−k  A A − Π q =γ q + M ∞ n−k n−1 n−k

+ n−k

n−k ∞

   

c c 2γcmax n−k n−k A A

≤γ

M∞ qn−k + n − k − Πn−k qn−k + n − k + n − k ∞

   

c 2γcmax c n−k n−k A A =γ

Mn−k qn−k + n − k − M∞ qn−k + n − k + n − k ∞ log(|A|) + 2cmax ≤ γ by lemma 3. n−k (42) By substitution of (42) in (39), we obtain:

A

qn − q∗



≤ (log(|A|) + 2cmax )

It is not difficult to show that yields:

n−1 X k=1

γk k=1 n−k

Pn−1

γk 2γ n−1 L + . n−k 1−γ

(43)

 ≤ 2γ (n(1 − γ)2 ). Combining this with (43)

A

q − q∗ ≤ 2γ log(|A|) + 2cmax + 2γ n−1 L n ∞ n(1 − γ)2 1−γ 2 L (1 − γ) log(|A|) + 2L + 2γ n−1 ≤ 2γ 4 n(1 − γ) 1−γ

by (25).

References P. L. Bartlett. An introduction to reinforcement learning theory: Value function methods. Lecture Notes in Artificial Intelligence, 2600/2003:184–202, 2003. J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001. D. P. Bertsekas. Dynamic Programming and Optimal Control, volume I. Athena Scientific, Belmount, Massachusetts, third edition, 2007a. D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, Belmount, Massachusetts, third edition, 2007b. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, Massachusetts, 1996. V. S. Borkar. The o.d.e method for convergence of stochastic approximation and reinforcement learning. SIAM Journal of Control and Optimization, 38(2):447–469, 1997. 24

Dynamic Policy Programming

J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems, pages 369–376, 1995. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, New York, N.Y., first edition, 1991. D. P. Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105 (3):589–608, 2000. A. Gelman and X. Meng. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13(2):163–185, May 1998. S. Kakade. Natural policy gradient. In Advances in Neural Information Processing Systems 14, pages 1531–1538, Vancouver, British Columbia, Canada, 2001. H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Statistical Mechanics, 2005(11):P11011, 2005. H. J. Kappen, V. G´ omez, and M. Opper. Optimal control as a graphical model inference problem, 2009. URL http://arxiv.org/pdf/0901.0633v2. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, United Kingdom, first edition, 2003. R. Munos. Error bounds for approximate value iteration. In Proceedings of the 20th National Conference on Artificial Intelligence, volume II, pages 1006–1011, Pittsburgh, Pennsylvania, 2005. J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7–9):1180–1190, 2008. R. S. Sutton. Generalization in reinforcement learning: succesful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 9, pages 1038– 1044, 1996. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063, Denver, Colorado, USA, 1999. Cs. Szepesvari. Reinforcement learning algorithms for mdps – a survey. Technical Report TR09–13, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, 2009. 25

Gheshlaghi Azar and Kappen

E. Todorov. Linearly-solvable markov decision problems. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pages 1369–1376, Vancouver, British Columbia, Canada, 2006. T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Stable dual dynamic programming. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2007.

26