c IEEE Appeared in Proceedings of the 10th MMAR Int. Conf. Mi˛edzyzdroje, Poland, 2004, pp. 1143-1149,
A SIMPLE ACTOR-CRITIC ALGORITHM FOR CONTINUOUS ENVIRONMENTS
PAWEL WAWRZYNSKI† and ANDRZEJ PACUT‡, Senior Member IEEE Institute of Control and Computation Engineering Warsaw University of Technology 00–665 Warsaw, Poland †
[email protected], http://home.elka.pw.edu.pl/∼pwawrzyn ‡
[email protected], http://www.ia.pw.edu.pl/∼pacut Abstract. In reference to methods analyzed recently by Sutton et al, and Konda & Tsitsiklis, we propose their modification called Randomized Policy Optimizer (RPO). The algorithm has a modular structure and is based on the value function rather than on the action-value function. The modules include neural approximators and a parameterized distribution of control actions. The distribution must belong to a family of smoothly exploring distributions that enables to sample from control action set to approximate certain gradient. A pre-action-value function is introduced similarly to the action-value function, with the first action replaced by the first action distribution parameter. The paper contains an experimental comparison of this approach to reinforcement learning with model-free Adaptive Critic Designs, specifically with Action-Dependent Adaptive Heuristic Critic. The comparison is favorable for our algorithm. Key Words. reinforcement learning, actor-critic, Adaptive Critic Designs, Cart-Pole SwingUp 1. INTRODUCTION
infinite, possibly unnumerable control space, this approach seems to be questionable.
Reinforcement learning belongs to intensively investigated areas of artificial intelligence. While the theory of reinforcement learning in discrete state and control spaces is well developed [8], the case of continuous state/control spaces is still challenging [2, 9, 4]. In this paper we investigate the case of generic (possibly continuous) controls and states and introduce an algorithm whose behavior seems to be promising.
The second group, namely Actor-Critic algorithms, do employ an explicit representation of the policy. This group includes Adaptive Heuristic Critic [1], the family of Adaptive Critic Designs [6], algorithms analyzed recently by [9, 4], as well as the algorithm presented in this paper. Each algorithm based on policy iteration must solve two problems. The first one is the critic training, which concerns an evaluation of the policy. To our best knowledge, it is always solved by the method of temporal differences [8]. The second one is the actor training, i.e. the policy adjustment, which is equivalent to optimization of a selection of the first control action along each trajectory. In this paper, we propose a method that solves the second problem with the use of the methodology introduced by Williams in [12].
The reinforcement algorithms can be divided into two groups: the methods philosophically based on value iteration and the ones based on policy iteration. The first group gathers Q-Learning [11] and its modifications like: Q(λ), SARSA, and SARSA(λ). Methods in this group do not use any explicit representation of the policy. The lack of any structure that may be directly employed to advise the controller is not a problem in the case of finite set of controls, where each control can be evaluated in some way. In the case of
The paper is organized as follows. In Section 2. we discuss actor and critic components of the proposed algorithm, and formulate the problem the algorithm 1
is parameterized by the weight vector wV . For approximator Ve to be useful for policy improvement, it should minimize the mean-square error
solves. In Section 3. we discuss some probabilistic preliminaries necessary to understand the algorithm described in detail in Section 4.. In the next section we extend the algorithm in the same way as Sutton’s method of temporal differences in extended to TD(λ). Section 6. is devoted to certain implementation details. In Section 7. our algorithms and ADHDP are applied to control of the Cart-Pole Swing-Up. We conclude in Section 8..
Ψ(wV , wθ ) Z ³ ´2 = V π(wθ ) (x) − Ve (x; wV ) dη(x, wθ ) x∈X
with respect to wV . The action-value function Qπ : X × U 7→ R is typically defined as the expected value of future discounted rewards the controller may expect starting from the state x, performing the control u, and following the policy π afterwards [11]: ¯ ¡ ¢ Qπ (x, u) = E rt+1 + γV π (xt+1 )¯xt = x, ut = u (1)
2. RANDOMIZED POLICY OPTIMIZER The discounted Reinforcement Learning problem [8] will be analyzed here for continuous (possibly, multidimensional) states x ∈ X, continuous (possibly, multi-dimensional) controls u ∈ U, rewards r ∈ R, and discrete time t ∈ {1, 2, . . . }. The RPO algorithm proposed here is comprised of actor and critic components introduced shortly below.
We are interested here in the parameter that governs control selection rather than in particular controls. Let us define the pre-action-value function U π : X × Θ 7→ R, as the expected value of future discounted rewards the agent may expect starting from the state x and performing a control drawn from the distribution characterized by the parameter θ, and following the policy π afterwards:
2.1. Actor At each state x the control u is drawn according to certain density ϕ(.; θ). The density is parameterized by the vector θ ∈ Θ ⊂ Rm whose value is in turn e wθ ) padetermined by parametric approximator θ(x; rameterized by the weight vector wθ . For example, ϕ(u; θ) can be the normal density with the mean equal to θ and a constant variance, e wθ ) can be a neural network. In this exwhereas θ(x, ample, the output of the network determines a center of the distribution that the control is drawn from. e the discussed control selection For given ϕ and θ, mechanism forms a policy π(wθ ) dependent only on wθ . For a fixed wθ , the sequence of states {xt } forms a Markov chain. Suppose {xt } has the stationary distribution η(., wθ ). To determine the control objective, let us define the value function of a generic policy π: ¯ X ¯ V π (x) = E γ i rt+1+i ¯xt = x; π
U π (x, θ) (2) ¯ ¡ ¢ π ¯ = E rt+1 + γV (xt+1 ) xt = x; ut ∼ ϕ(.; θ) = Eθ Qπ (x, Y) where Eθ denotes the expected value calculated for random vector Y drawn from ϕ(.; ³ θ). Note´that by π(wθ ) π(wθ ) e wθ ) . definition V (x) = U x, θ(x; Summing up, the considered problem is to find wθ that maximizes Z ³ ´ e wθ ) dη(x, wθ ) Φ(wθ ) = U π(wθ ) x, θ(x; x∈X
(3) To deal with this problem, we utilize an approximate solution of the auxiliary problem of minimization of Ψ(wV , wθ ) (4) Z ³ ´2 ¡ ¢ e wθ ) − Ve (x; wV ) dη(x, wθ ) = U π(wθ ) x, θ(x;
i≥0
The ideal, while not necessary realistic objective is to find wθ that maximizes V π(wθ ) (x) for each x. We approach a more realistic objective maximization of the averaged value function, namely Z Φ(wθ ) = V π(wθ ) (x) dη(x, wθ )
x∈X
with respect to wV . Note that since both η(., wθ ) and U π(wθ ) are unknown, the problems can not be solved directly. To show solutions, we first need certain probabilistic preliminaries based on [12].
x∈X
3. 2.2. Critic Our algorithm uses certain estimator of ∇Φ(wθ ) to maximize Φ(wθ ). In order to construct such the estimators, the algorithm employs the approximator Ve (x; wV ) of the value function V π(wθ ) (x) of the current policy. The approximator (e.g., a neural network)
ROBABILITY DISTRIBUTION OPTIMIZATION
In this section we analyze the following generic problem. As before, let ϕ : Rn × Θ 7→ R+ be a function that, for each fixed value θ of its second argument, is a density of a certain random vector Y in Rn . Obviously, ϕ defines a family of distributions in Rn . Let 2
f : Rn 7→ R be an unknown function and denote by Eθ f (Y) an expected value of f (Y) for Y drawn from the distribution selected from ϕ by setting the parameter to θ. We look for a way of maximization of Eθ f (Y) with respect to θ. We assume that f is initially unknown and one can only repeatedly draw Y and check the value of f (Y). Let ϕ satisfy the conditions:
4.
The Randomized Policy Optimizer (RPO) algorithm perfoms at each step t the control action from ϕ e t ; wθ )), ut ∼ ϕ(.; θ(x observes the reward rt+1 and the next state xt+1 . The reward and the next state is then utilized to estimate gradients of Φ(wθ ) and Ψ(wV , wθ ). The gradient estimators are then used for, respectively, actor adjustment and critic adjustment. Functions Φ(wθ ) and Ψ(wV , wθ ) are hence optimized with the use of stochastic approximation. In order to construct appropriate gradient estimators we employ derivations in the previous section. Let us denote
(A) ϕ(z, θ) > 0 for z ∈ U, θ ∈ Θ, (B) for every z ∈ U, the mapping θ 7→ ϕ(z; θ) is continuous and differentiable. ¡ ¢ (C) Fisher information I(θ) = Eθ ∇θ ln ϕ(Y; θ) × ¡ ¢T ∇θ ln ϕ(Y; θ) is a bounded function of θ. We will call families of distributions satisfying such conditions smoothly exploring. For example, the family of normal distributions with constant nonsingular variance, parameterized by the expected value is smoothly exploring. However, the family of normal distributions with a constant expected value, parameterized by the variance, is not smoothly exploring because it does not satisfy Condition (B). In the discussion below the convention will hold that whenever Y and θ occur in the same formula, Y is drawn from the distribution of density ϕ(.; θ). We now outline some properties of Eθ f (Y) as a function of θ. Proofs of the propositions are presented in [10].
qt = rt+1 + γ Ve (xt+1 ; wV )
1. Y ↔ ut , the drawing, e t ; wθ ), the parameter used for drawing, 2. θ ↔ θ(x 3. f (Y) ↔ Qπ(wθ ) (xt , ut ), the π(wθ ) Q (xt , ut ) is estimated by qt ,
(a) f is a bounded function Rn 7→ R,
return;
e t ; wθ )), the value 4. Eθ f (Y) ↔ U π(wθ ) (xt , θ(x we want to estimate and maximize.
(b) distributions of Y are smoothly exploring, then Eθ f (Y) is uniformly continuous in Θ.
5. c ↔ Ve (xt ; wV ), the reference point.
Corollary 1 f (Y) is an unbiased estimator of Eθ f (Y).
4.1.
Next Proposition enables to form an unbiased estimator of the gradient of Eθ f (Y) as a function of θ.
Actor adjustment
The very idea of control policy adjustment utilized in RPO is as follows. At step t the control ut is applied. e t ; wθ ). Its selection is governed by parameter θ = θ(x The return qt is confronted with the expected return Ve (xt ; wV ). If control ut turns out to be “good” (qt e t ; wθ ) is modified to make control ut is larger), θ(x more plausible in state xt . If the control turns out to e t ; wθ ) is modified to make it less plausibe “bad”, θ(x ble. More formally, the return expected in state xt e t ; wθ )). An estimator of is equal to U π(wθ ) (xt , θ(x gradient ∇θ U π(wθ ) (xt , θ)|θ=θ(x e t ;wθ ) is employed to e t ; wθ ) to maximize the expected return. modify θ(x
Proposition 2 Under the assumptions of Proposition 1, the following relation holds ³ ´ d Eθ f (Y) = Eθ f (Y)∇θ ln ϕ(Y; θ) (5) dθ Proposition 2 leads to the main result: Proposition 3 (Unbiased estimator) Under the assumptions of Proposition 1, the random vector ¡ ¢ f (Y) − c ∇θ ln ϕ(Y; θ) (6) d dθ Eθ f (Y),
(7)
Let us now establish the relations between the generic terms in the previous section and the terms of our problem.
Proposition 1 If
is an unbiased estimator of c.
ACTOR AND CRITIC ADAPTATION
regardless of
It is not clear how to choose the scalar c to obtain the best, in a sense of variance minimization, estimator of the form (6). Note yet that taking c equal to an approximation of Eθ f (Y) decreases the absolute value of the differences f (Y) − c and seems to constrain the variance. This is not, however, the optimal solution.
Yet more formally, it is assumed that the state xt has been drawn from η(., wθ ). We hence employ (3) to estimate ∇wθ Φ(wθ ) with an estimator of d e t ; wθ )) U π (xt , θ(x dwθ 3
(8)
for a fixed π = π(wθ ). Eq. (8) may be written as
1. Draw the control ut
e t ; wθ ) dθ(x gt dwθ
e t ; wθ )) ut ∼ ϕ(.; θ(x
where gt is an estimator of
2. Perform the control ut , observe the next state xt+1 and the reinforcement rt+1 .
e t ; wθ )) dU π (xt , θ(x e t ; wθ ) dθ(x
3. Calculate the temporal difference as
To construct gt we use an estimator similar to (6), namely gt = (qt − Ve (xt ; wV ))
e t ; wθ )) d ln ϕ(ut ; θ(x e t ; wθ ) dθ(x
dt = rt+1 + γ Ve (xt+1 ; wV ) − Ve (xt ; wV ) e t ; wθ ): 4. Adjust θ(x
(9)
mθ := (γλ)mθ e t ; wθ ) d ln ϕ(ut ; θ(x e t ; wθ )) dθ(x + e t ; wθ ) dwθ dθ(x
where Ve (xt ; wV ) is a non-random assesment of qt taken as a reference point. Summing up, the actor adjustment in RPO takes the form: wθ := wθ +
wθ := wθ + βtθ dt mθ
βtθ (rt+1 +γ Ve (xt+1 ; wV
)− Ve (xt ; wV ))× e t ; wθ ) d ln ϕ(ut ; θ(x e t ; wθ )) dθ(x × e t ; wθ ) dwθ dθ(x (10)
(13)
5. Adjust Ve (xt ; wV ): dVe (xt ; wV ) dwV wV := wV + βtV dt mV (14)
mV := (γλ)mV +
where the step parameters βtθ must satisfy the standard requirements, namely P P stochastic approximation 2 β < ∞. β = ∞ and t t t≥1 t≥1
6. Set t := t + 1 and repeat from Step 1.
4.2. Critic adjustment
TABLE 1. The RPO(λ) algorithm.
Critic adjustment is based on exactly the same assumptions and is much simpler. Basing on (4), we treat xt as drawn from η(., wθ ) and employ an estimator of ´2 ¢ d ³ π(wθ ) ¡ e U x, θ(x; wθ ) − Ve (x; wV ) (11) dwV
since V π(wθ ) (st ) is in fact defined as an expected value of qt1 . Variance of qt1 is, however, large. Note that X γ i (rt+i+1 + γvt+i+1 − vt+i ) qt1 = rt+1 + γvt+1 +
to estimate ∇wV Ψ(wV , wθ ). As above, we treat qt as an estimator of Qπ(wθ ) (xt , ut ) and consequently as e t ; wθ )). This leads us an estimator of U π(wθ ) (xt , θ(x easily to a standard [8] critic adjustment schema
i≥1
for any bounded sequence {vi , i ∈ Z}, because vt+i for consecutive i-s cancel each other out. Taking vt+i = Ve (xt+i ; wV ) and λ ∈ [0, 1] we obtain
wV := wV + βtV (rt+1 +γ Ve (xt+1 ; wV )− Ve (xt ; w))× dVe (xt ; wV ) × (12) dwV
qtλ = rt+1 + γ Ve (xt+1 ; wV )+ (15) X + (γλ)i (rt+i+1 +γ Ve (xt+i+1 ; wV )− Ve (xt+i ; wV ))
The step parameters βtV satisfy the same requirements as βtθ .
i≥1
which is the well known TD(λ) estimator of future rewards. By varying the value of λ, we balance its bias and variance. To construct an algorithm based on qtλ as the estimate of future rewards, one must replace (10) and (12) with
5. RPO(λ) RPO can be extended in the same way as Sutton’s method of temporal differences is extended to TD(λ). The algorithm employs qt = rt+1 +γ Ve (xt ; wV ) as an estimator of Qπ(wθ ) (xt , ut ). Since Ve is an approximation of V π(wθ ) of only limited precision, this estimator is biased. An unbiased estimator is X qt1 = γ i rt+i+1 ,
wθ := wθ + βtθ (qtλ − Ve (xt ; wV ))× e t ; wθ )) e t ; wθ ) d ln ϕ(ut ; θ(x dθ(x × e t ; wθ ) dwθ dθ(x (16)
i≥0
4
and wV := wV + βtV (qtλ − Ve (xt ; wV ))× dVe (xt ; wV ) × dwV
The formula above has a very intuitive interpretation: e t ) approaches ut if the control turned out to be θ(x “good” i.e. rt+1 + γ Ve (xt+1 ) > Ve (xt ) and moves away from ut otherwise. Extending the discussion above to the case of multidimensional controls and θ is straightforward.
(17)
While this can not be done directly because there is no particular moment at which qtλ can be calculated, the relation X qtλ − Ve (xt ; wV ) = (γλ)i dt+i
6.2.
Continuous positive control actions, relative accuracy requirements
Suppose again that there is a single optimal control for each state. Conversely to the previous example, suppose that the controls are positive real numbers that should be kept within a small relative distance to their optimal values. Suppose σ (equal to, say, 0.01) is the required relative accuracy. In this case we may use a family of log-normal distributions µ ¶ 1 (ln u − θ)2 ϕ(u; θ) = √ exp − 2σ 2 2πσa
i≥0
enables to implement (16) and (17) incrementally. This is exactly what the RPO(λ) algorithm depicted in Table 1 does. The algorithm uses auxiliary vectors mθ and mV . Their dimensions are equal to the dimensions of wθ and wV , respectively. 6. EXAMPLES OF DISTRIBUTIONS
To sample according to this density, one may simply take a = exp Y where Y is drawn from the normal distribution N (θ, σ 2 ) of density (18).
We will now translate the family of smoothlyexploring distributions ϕ into more intuitive terms. In the case of continuous controls it may look this way. There is a deterministic mapping θ : X 7→ U utilized e wθ ) in such a way that the control is a sum of θ(x; and some zero-mean noise. The noise is necessary for exploration and optimization of θ. Generally speaking, a choice of the family of densities ϕ that governs the control selection should depend on what kind of randomization is acceptable and what kind of exploration seems fruitful in a particular reinforcement learning problem. We present a number of examples illustrating possible choices.
6.3.
Discrete ordered control actions
The case of discrete controls has not been discussed in this paper so far. The only thing that changes is that ϕ should be understood as the probability rather then the density. Sometimes the controls are discrete, but there is a natural order among them. We will show in an extreme example how to exploit this order. Let the control space U be the set of all integers. Let us define a family of distributions parameterized by a single real value θ that assigns higher probabilities to integers closer to θ. Namely, ¡ ¢ exp −(u − θ)2 /σ 2 ϕ(u; θ) = P∞ 2 2 j=−∞ exp (−(j − θ) /σ )
6.1. Continuous control actions, absolute accuracy requirements Suppose we know that there is a single optimal control for each state, yet controls within some distance σ still give an acceptable performance. Suppose σ is independent from the unknown optimum. In this case we may employ the family of normal distributions N (θ, σ 2 ) of constant variance equal to σ 2 , parameterized by the expected value θ. In this case µ ¶ (u − θ)2 1 exp − (18) ϕ(u; θ) = √ 2σ 2 2πσ
where σ is a parameter that determines the amount of exploration. The infinite sum above has fortunately very few components greater than a numeric error. 7.
ILLUSTRATION: CART-POLE SWING-UP
In this section we present an experimental study of Randomized Policy Optimizer applied to a problem of reinforcement learning with continuous state and control spaces. The algorithm is compared to ActionDependent Heuristic Dynamic Programming (ADHDP) as the method that seems to become quite popular [7, 6, 5]. As a platform for illustration, we chose the CartPole Swing-Up [3], which is a modification of the inverted pendulum frequently used as a benchmark for reinforcement learning algorithms. There are four state variables: position of the cart z, arc of the pole
The larger σ we take, the faster will be the convergence and the poorer the final performance. By manipulating σ, we control a balance between the exploration and the exploitation of the environment. The actor adjustment in the case of normal smoothly-exploring distribution takes the form: wθ := wθ + βtθ (rt+1 + γ Ve (xt ; wV ) − Ve (xt ; wV ))× e t ; wθ ) ¡ ¢ dθ(x e t ; wθ ) × σ −2 ut − θ(x dwθ 5
e u; wQ ) and the action network critic network Q(x, µ e(x; wµ ).
ω, and time derivatives z, ˙ ω. ˙ A single control variable F is the force applied to the cart. Dynamics of the Cart-Pole Swing-Up, is the same as that of the inverted pendulum described e. g. in [1]. The reward in this problem is typically equal to the elevation of the pole top. Initially, the waving pole hovers, and the rewards are close to −1. The goal of control is to avoid hitting the track bounds, swing the pole, turn it up and stabilize upwards. Rewards are then close to 1. Controller’s interventions take place every 0.1s. Note that τ here is the continuous time of the plant which remains in a simple relation with the discrete time t of the controller. The force F (τ ) is calculated from the action ut as −10 if ut < −10 ut if ut ∈ [−10, 10] F (τ ) = 10 otherwise
e(xt ; wµ ) 1. ut = µ 2. Perform the control ut , observe the next state xt+1 and the reinforcement rt+1 3. Calculate the temporal difference as e t+1 , µ dt = rt+1 +γ Q(x e(xt+1 ; wµ ); wQ ) e t , ut ; wQ ) − Q(x d e dut Q(xt , ut ; wQ ),
4. Adjust µ e(xt ; wµ ) along namely wµ := wµ + βtµ
e t , ut ; wQ ) de µ(xt ; wµ ) dQ(x dwµ dut
e t , ut ; wQ ) 5. Adjust Q(x e γ Q(xt , ut+1 ; wQ ):
The action ut is determined every 0.1 second. State of the plant is acceptable if and only if z(τ ) ∈ [−2.4, 2.4]. If the next state is not ¯ acceptable ¯ then the reinforcement is set to −30 − 0.2¯ut − F (τ )¯. Otherwise, the reinforcement is determined as ½ ¯ ¯ cos ω(τ ) if |ω(τ ˙ )| < 2π r(τ ) = −0.2 ¯ut −F (τ )¯+ −1 otherwise
wQ := wQ + βtQ dt
towards
rt+1 +
e t , ut ; wQ ) Q(x dwQ
6. Set t := t + 1 and repeat from Step 1. The ADHDP achieved its results for: N µ = 20 — number of hidden neurons of µ e. e N Q = 60 — number of hidden neurons of Q. βtµ ≡ 0.001 — learning rate of µ e. e βtQ ≡ 0.01 — learning rate of Q. We must stress that it was difficult to select the parameters that led to convergence. For example, βQ too little makes the algorithm diverge. Furthermore, even for the best parameters, the algorithm did not converge at all in almost 30% runs. RPO that has been used utilized the normal family of control distributions. The exact formulation of the algorithm is as follows:
The learning process consists of a sequence of trials. The trial may end in one of two possible situations. First, the pendulum’s state may become unacceptable. It is then associated with an appropriately low reinforcement. Otherwise, the trial lasts for a random number of steps drawn from the geometric distribution with expected value equal to 300 (corresponding to 30 sec. of real time). The trial begins with a state reset, which consists of drawing z(τ ) and ω(τ ) from the uniform distributions U (−2.4, 2.4) and U (0, 2π), respectively, and setting z(τ ˙ ), ω(τ ˙ ) to zero. Each approximator employed by the algorithm was implemented as a two layer perceptron with sigmoidal (arctan) activation function in the hidden layer and the linear output layer. Each neuron had a constant input (bias). The initial weights of the hidden layer were drawn randomly from the normal distribution N (0, 0.2) and the initial weights of the output layer were set to zero. The discount factor γ was set to 0.95. The state vector xt feeding the approximators was normalized, namely ¸T · ˙ ) sin ω(τ ) cos ω(τ ) ω(τ ˙ ) z(τ ) z(τ , , , , xt = 2 3 0.8 0.8 4
1. Draw the control ut e t ; wθ ), σ 2 ) ut ∼ N (θ(x 2. Perform the control ut , observe the next state xt+1 and the reinforcement rt+1 3. Calculate the temporal difference as dt = rt+1 + γ Ve (xt+1 ; wV ) − Ve (xt ; wV ) e t ; wθ ): 4. Adjust θ(x wθ := wθ + βtθ dt
Additionally, the ADHDP algorithm uses controls as inputs to the critic network. They were normalized as:
e t ; wθ ) ¡ ¢ dθ(x e t ; wθ ) σ −2 ut − θ(x dwθ
5. Adjust Ve (xt ; wV ):
ut = [0.1u(τ )] This is the exact formulation of ADHDP in use: the algorithm uses two approximators, namely the
wV := wV + βtV dt
6
dVe (xt ; wV ) dwV
6. Set t := t + 1 and repeat from Step 1.
It is possible, among others, to simply add a noise to the policy that would otherwise be deterministic. Note that a randomization (or some other form of nongreediness) is necessary for the reinforcement learning algorithm to explore the environment. In RPO, the randomization may take almost any form and gives the designer an opportunity to pick the most suitable one. Most algorithms of reinforcement learning suppress the randomization as the learning goes on. It is of course also possible for the RPO. Extension of the framework of smoothly exploring distributions that incorporates a decreasing exploration will be a topic of our future research.
The algorithm achieved the results depicted below for the following parameters: e N θ = 20 — number of hidden neurons of θ. V N = 40 — number of hidden neurons of Ve . βtθ = βtV ≡ 0.003 — learning rates of θe and Ve . σ 2 = 4 — variance of smoothly-exploring distribution. In this case, choosing the parameters was quite easy. If the parameters are too large, the algorithm becomes unstable. Decreasing the parameters slows the method down. This is in fact what one may expect from an algorithm based on stochastic approximation. The parameters lead (in all runs) to a satisfying behavior in a quite wide area of parameter values which does not seem difficult to be found. Another algorithm we tested was RPO(λ). Its exact formulation is easy to derive on the basis of Table 1. We applied the same values of parameters as we applied to RPO (additionally λ = 0.5). The burden with parameters’ selection was the same as in the case of RPO, i.e. little.
REFERENCES 1. A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike Adaptive Elements That Can Learn Difficult Learning Control Problems, ” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 834-846, Sept.-Oct. 1983. 2. D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, Massachusetts, 1997.
1.0
3. K. Doya, “Reinforcement learning in continuous time and space,” Neural Computation, 12:243-269, 2000.
0.5 0.0
4. V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms," SIAM Journal on Control and Optimization, Vol. 42, No. 4, pp. 1143-1166, 2003.
-0.5 -1.0 -1.5
RPO(lambda) RPO ADHDP
5. D. Liu, X. Xiong and Y. Zhang, “Action-Dependent Adaptive Critic Designs”, Proceedings of the INNSIEEE International Joint Conference on Neural Networks, Washington, DC, July 2001, pp.990-995.
-2.0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
FIGURE 1. Comparison of AD HDP and RPO — the average reinforcement as a function of the trial number (divided by 1000). Each point averages reinforcements in 100 consecutive trials and each curve averages 10 runs.
6. D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Networks, vol. 8, pp. 9971007, Sept. 1997. 7. J. Si and Y.-T. Wand, “On-line learning control by association and reinforcement,” IEEE Transactions on Neural Networks, vol. 12, pp. 264-276, Mar. 2001.
The learning curves for the discussed algorithms are presented in Fig. 1. Please note that RPO is always convergent while ADHDP is not convergent in almost 30% runs. For the runs that ADHDP does converge, it is over 10 times slower than RPO. For better comparison, both algorithms used just one approximators adjustment per a simulation step. Such a computational thrift gives slow convergence and is not a recommended way to train a controller in real time. A version of RPO that is more computationally intensive is beyond the scope of this paper.
8. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, Massachusetts, 1998. 9. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” Advances in Information Processing Systems 12, pp. 1057-1063, MIT Press, 2000. 10. P. Wawrzynski, “Reinforcement Learning in Control Systems,” PhD Thesis, Institute of Control and Computation Engineering, Warsaw University of Technology, forthcoming.
8. CONCLUSIONS AND FURTHER WORK In this paper we introduced a notion of smoothly exploring distributions as a basis for controls selection optimization in Actor-Critic algorithms. Our Randomized Policy Optimizer directly utilizes this notion. The usage of smoothly exploring distribution in reinforcement learning may take a variety of forms.
11. C. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, pp. 279-292, 1992. 12. R. Williams, “Simple statistical gradient following algorithms for connectionist reinforcement learning,” Machine Learning, 8:229-256, 1992.
7