Document not found! Please try again

Efficient Sampling in Approximate Dynamic Programming Algorithms

0 downloads 0 Views 275KB Size Report
Keywords: stochastic optimal control problem, dynamic programming, sample ..... The computation of the optimal controls for each vector xt at stage t can be ...
Efficient Sampling in Approximate Dynamic Programming Algorithms Cristiano Cervellera∗

Marco Muselli†

Abstract Dynamic Programming (DP) is known to be a standard optimization tool for solving Stochastic Optimal Control (SOC) problems, either over a finite or an infinite horizon of stages. Under very general assumptions, commonly employed numerical algorithms are based on approximations of the cost-to-go functions, by means of suitable parametric models built from a set of sampling points in the d-dimensional state space. Here the problem of sample complexity, i.e., how “fast” the number of points must grow with the input dimension in order to have an accurate estimate of the cost-to-go functions in typical DP approaches such as value iteration and policy iteration, is discussed. It is shown that a choice of the sampling based on low-discrepancy sequences, commonly used for efficient numerical integration, permits to achieve, under suitable hypotheses, an almost linear sample complexity, thus contributing to mitigate the curse of dimensionality of the approximate DP procedure. Keywords: stochastic optimal control problem, dynamic programming, sample complexity, deterministic learning, low-discrepancy sequences.

1

Introduction

Consider the following special case of Markovian decision process, formulated under very general hypotheses. A dynamic system evolves through discrete temporal stages according to the general stochastic state equation xt+1 = f (xt , ut , θt ), t = 0, 1, . . . where xt ∈ Xt ⊂ Rd is a state vector, ut ∈ Ut ⊂ Rm is a control vector and θt ∈ Θt ⊂ Rq is a random vector. Suppose the random vectors θt are characterized by a probability measure P (θt ) with density p(θt ), defined on the Borel σ-algebra of Rq . The purpose of the control is to minimize a cost function which has an additive form over the various stages. We consider two versions of such optimization problem, namely T-stage stochastic optimal control (T-SOC) problem and discounted infinite-horizon stochastic optimal control (∞-SOC) problem. In both cases, as the decision problem is Markovian, we want to derive control functions in a closed-loop form, i.e., the control vector at each stage must be a function µt (usually called policy) of the current state vector ut = µt (xt ), t = 0, 1, . . . ∗

Istituto di Studi sui Sistemi Intelligenti per l’Automazione - Consiglio Nazionale delle Ricerche - Via de Marini 6, 16149 Genova, Italy - Email: [email protected] † Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni - Consiglio Nazionale delle Ricerche - Via de Marini 6, 16149 Genova, Italy - Email: [email protected]

1

2 T-stage Stochastic Optimal Control (T-SOC) problem We want to find the optimal control law u◦ = col(µ◦0 (x0 ), . . . , µ◦T −1 (xT −1 )) that minimizes F (u) = E θ

T −1 X

h(xt , ut , θt ) + hT (xT )

t=0

subject to the constraints µt (xt ) ∈ Ut t = 0, . . . , T − 1 and xt+1 = f (xt , µt (xt ), θt ) t = 0, . . . , T − 1 where x0 is a given initial state, θ = col(θ0 , . . . , θT −1 ), u = col(u0 , . . . , uT −1 ), h(xt , ut , θt ) is the cost paid at the single stage t and hT (xT ) is the cost associated with the final stage1 . ¤ Discounted Infinite-horizon Stochastic Optimal Control (∞-SOC) problem When the number of stages T is not limited, we usually look for stationary policies, i.e., policies that do not change from stage to stage. We consider here discounted problems, where the effect of future costs is weighted by a parameter β ∈ [0, 1). Therefore, the cost to be minimized takes on the following form lim E

T →∞ θ

T X £

¤ β t h(xt , ut , θt )

t=0

subject to ut = µ(xt ), t = 0, 1, . . . and xt+1 = f (xt , ut , θt ), t = 0, 1, . . . ¤ The Dynamic Programming (DP) algorithm, introduced by Bellman [1], is the standard tool for the solution of SOC problems, as is documented by the large amount of studies devoted to this method through the years. The basic idea underlying the DP procedure is to define, at each stage t, a function, commonly named cost-to-go or value function 2 , which quantifies the cost that has to be paid from that stage on to the end of the time horizon. The basics of the recursive solution for SOC problems are introduced and discussed in several classic references on DP methods and applications (see, for example, [1, 2, 3]). Among the most recent surveys on DP techniques and Markov decision processes in general, two excellent monographs are [4, 5]. Although efficient variations of the DP procedure exist for the deterministic version of the SOC problem, such as Differential Dynamic Programming [6], the presence of the random vectors θt makes the DP equations analytically solvable only when some assumptions on the dynamic system and on the cost function are verified3 . For the general case we must look for approximate numerical solutions, i.e., we must accept sub-optimal policies based on an approximation of the cost and possibly of the control functions. Several numerical algorithms have been proposed for the approximate solution of the DP procedure (see, e.g., [7, 8, 9, 10, 11]). Anyway, if the problem is stated under very general hypotheses, any method based on discretization 1

In many cases hT (xT ) ≡ 0. In the following, we will use the term “cost-to-go”. 3 Typically, these assumptions are the classic “LQ hypotheses” (linear system equation and quadratic cost). 2

3 suffers from an exponentional growth of the computational requirements (usually called curse of dimensionality) which prevents from finding accurate solutions for nontrivial dimensions d (see, e.g, [11]). Still, despite the unavoidable curse of dimensionality, there is the need to find computationally-tractable methods that can be effectively applied to the SOC context, possibly introducing hypotheses on the regularity of the functions involved. A very general algorithm is based on the approximation of the cost-to-go function by means of some fixed-structure parametric architecture, which is “trained” on the basis of sample points coming from the discretization of the state space. There are many examples of such approach, where different structures are employed. Among the others, polynomial approximators [7], splines [10], multivariate adaptive regression splines [12] and neural networks [13] (in the last case the term neuro-dynamic programming is often used). Once a suitably “rich” class of approximating architectures is chosen, the choice of the sample points is the most critical issue of the procedure. In general, finding the best function (i.e., the one that is closest to the true unknown cost-to-go function) inside the class of models corresponds to an estimation process that is usually performed by adopting a local or global optimization algorithm, which aims at finding the point of minimum (approximator parameters) of a nonlinear function measuring the error between the actual cost-to-go function and its current approximation. The present paper deals with the curse of dimensionality related to sample complexity, i.e., the rate at which the number of sampling points must grow in order to achieve a desired rate of accuracy of the estimation. The most common sampling technique used in the literature is the “full uniform” grid, i.e., the uniform discretization of each component of the state space in a fixed number of values. This clearly leads to a curse of dimensionality: if each of the d components of the state space is discretized by means of q equally spaced values, the number of points of the grid is equal to q d . Therefore, for the purpose of proving that the estimation problem can be solved with non exponential sample complexity, finer sampling schemes have to be investigated. For what concerns deterministic sampling, a promising approach is based on the use of Orthogonal Arrays [12], where Multivariate Adaptive Regression Splines (MARS) are employed as approximating architectures. Orthogonal Arrays are a family of subsets of the full uniform grid which need to grow only polynomially with the dimension d. Anyway, theoretical results on the convergence of the estimation process are not currently available. For what concerns random sampling, interesting theoretical results on functional estimation come from the field of Statistical Learning Theory (SLT) [14], which deals with the general problem of learning functional dependences from empirical data. In a typical learning problem, the data are generated randomly, according to some probability, by an external source. Under suitable hypotheses on the structure of the class of models, it has been proven that the sample complexity of the estimation is quadratic, almost independently of the dimension d. This is consistent with the typical quadratic convergence of various algorithms based on Monte Carlo discretization techniques, such as integration of multivariate functions [15]. In the present work the sample complexity issue is faced by employing a deterministic version of learning theory, first developed in its general context in [16]. Applied to dynamic programming, this approach leads to the estimation of the cost-to-go functions on the basis of quasirandom sampling of the state space. It is possible to prove that, under mild regularity conditions, an almost linear convergence of the estimation error can be achieved. We point ou that the method has already proved to be successful in practice for the solution of high-dimensional problems, such as optimal reservoir planning and inventory forecasting ([17, 18, 19]). The work is organized as follows. In Section 2 the theory of deterministic learning is reported, and bounds on the estimation error are derived. In Section 3 algorithms for the finite horizon and the discounted infinite horizon case, based on the learning of the cost-to-go functions, are

4 presented. Section 4 contains results on the application of deterministic learning to approximate DP. In Section 5 simulation results are presented. Finally, the Appendix contains proofs, figures and tables.

2

Deterministic learning

We summarize briefly the main results of the deterministic framework procedure that will be considered in the following for the context of SOC problems. A detailed tractation can be found in [16], together with proofs of theorems. Consider the following problem of function estimation: a functional dependence of the form y = g(x), where x ∈ X ⊂ Rd and y ∈ Y ⊂ R, has to be learnt from a set of samples (xL , y L ) ∈ (X L × Y L ), where xL = {x0 , . . . , xL−1 }, y L = {y0 , . . . , yL−1 }, yl = g(xl ). In particular, we define: © ª 1. A family of parameterized functions Γ = ψ(x, α) : α ∈ Λ ⊂ Rk , which are the models used for learning. 2. A risk functional R(α) which measures the difference between the true function and the model over X Z R(α) = `(g(x), ψ(x, α))dx (1) X 4 function that

where ` : (Y ×Y ) 7→ R is a loss measures the difference between the function g and its approximation at any point of X. 3. A deterministic algorithm by which a sequence of points xL ∈ X L , xL = {x0 , . . . , xL−1 } is generated. If the class of models Γ is sufficiently rich, we can annihilate R(α). As previously noted, we will not discuss the approximation problem related to the choice of Γ (a discussion on different nonlinear models and their advantageous approximation properties can be found, e.g., in [20]). Also note that the results presented in the following can be easily extended to Y ⊂ Rk by simply considering the k single components independently. The target of the estimation problem is to find α∗ ∈ Λ such that R(α∗ ) = min R(α). If the α∈Λ

minimum does not exist, the problem consists in finding α∗ ∈ Λ such that R(α∗ ) < inf R(α) + ε α∈Λ

for a given ε > 0. As we know g only in the points of the sequence xL , we try to minimize R on the basis of available data. In particular, we choose a training algorithm that corresponds to minimizing the empirical risk given L observation samples Remp (α, xL ) =

L−1 1 X `(yl , ψ(xl , α)) L l=0

and define αL as the parameter vector obtained after the minimization of Remp (α, xL ). In order to measure the difference between the actual value of the risk R(αL ) after the training and the best achievable risk R(α∗ ), we define r(αL ) as r(αL ) = R(αL ) − R(α∗ ) Definition 2.1 We say that the training algorithm A is deterministically consistent if r(αL ) → 0 as L → ∞. 4

The loss function must be symmetric and satisfy `(z, z) = 0, `(z1 , z2 ) > 0 for z1 6= z2

5 The term sample complexity will be used to denote the rate of convergence of r(αL ) as L grows. For a given class of models Γ, we can adopt this rate as an efficiency measure of the chosen sequence xL .

2.1

Deterministic learning rates based on discrepancy

We present some theoretical results based on a measure of spread of points called discrepancy, commonly employed in numerical analysis [21] and probability [22]. In particular, it can be proved [16] that the sample complexity is directly related to how uniformly the deterministic sequence covers the input space as the number of points L grows. For this reason, a special family of deterministic sequences which yield almost linear convergence of the discrepancy is considered. Such sequences are usually referred to as low-discrepancy sequences, and are commonly used for numerical integration methods. A detailed description of their construction can be found in [23]. In the following we will assume that X = [0, 1)d (i.e., the d-dimensional semi-closed unit cube). The results can be extended to other intervals of Rd , or more complex input spaces, by suitable transformations [21]. d Y For each vertex of a given subinterval B = [ai , bi ] of X, we can define a binary label by i=1

assigning ‘0’ to every ai and ‘1’ to every bi . For every function ϕ : X 7→ R we define ∆(ϕ, B) as the alternating sum of ϕ computed at the vertices of B, i.e., X X ϕ(x) ϕ(x) − ∆(ϕ, B) = x∈oB

x∈eB

where eB is the set of vertices with an even number of ‘1’s in their label, and oB is the set of vertices with an odd number of ‘1’s. Definition 2.2 Let ℘ be any partition of X into subintervals. The variation of ϕ on X in the sense of Vitali is defined by X V (d) (ϕ) = sup |∆(ϕ, B)| (2) ℘

B∈℘

If the partial derivatives of ϕ are continuous on X, it is possible to write V (d) (ϕ) in an easier way as ¯ Z 1 Z 1¯ ¯ ∂dϕ ¯ (d) ¯ ¯ V (ϕ) = ··· (3) ¯ ∂x1 · · · xd ¯ dx1 · · · dxd 0

0

where xi is the i-th component of x. For 1 ≤ k ≤ d and 1 ≤ i1 < i2 < · · · < ik ≤ d, let V (k) (ϕ, i1 , . . . , ik ) be the variation in the sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1 , . . . , xd ) ∈ X : xi = 1 for i 6= i1 , . . . , ik }. Definition 2.3 The variation of ϕ on X in the sense of Hardy and Krause is defined by VHK (ϕ) =

d X

X

V (k) (ϕ, i1 , . . . , ik )

(4)

k=1 1≤i1

Suggest Documents