serts a child actor/critic into the actor part of parent ac- tor/critic algorithm. We examined the proposed algorithm for a stable control problem in both simulation ...
LEARNING TO CONTROL A JOINT DRIVEN DOUBLE INVERTED PENDLUM USING NESTED ACTOR/CRITIC ALGORITHM Norimasa Kobori, Kenji Suzuki, Pitoyo Hartono and Shuji Hashimoto Department of Applied Physics, School of Science and Engineering, Waseda University Ohkubo 3-4-1,Shinjuku-ku Tokyo 169-8555 E-mail: {kobo,kenji,hartono,shuji}@shalab.phys.waseda.ac.jp ABSTRACT In recent years, ‘Reinforcement Learning’ which can acquire reflective and adaptive actions, is becoming the center of attention as a learning method for robotics control. However, there are many unsolved problems that have to be cleared in order to put the method into practical use. One of the problems is the handling of the state space and the action space. Many algorithms of existing reinforcement learning deal with discrete state space and action space. When the unit of search space is rough, a subtle control cannot be achieved (imperfect perception). On the contrary, when the unit of search space is too fine, searching space is enlarged accordingly and the stable convergence of learning cannot be obtained (curse of dimensionality). In this paper, we propose a nested actor/critic algorithm that can deal with the continuous state and action space. The method proposed in this paper inserts a child actor/critic into the actor part of parent actor/critic algorithm. We examined the proposed algorithm for a stable control problem in both simulation and prototype model of a joint-driven double inverted pendulum. 1. INTRODUCTION Reinforcement learning (RL) is a machine learning method, in which evaluation from the environment (reward) with regard to the performance of the learner is utilized. The system (learner) reinforces its strategy so that its actions generate a maximum amount of reward. The RL can be distinguished from the widely used supervised training methods by means of the information that have to be provided for training the learner. The RL method requires only the evaluation with regard to the generated actions while the conventional supervised training methods require the ideal action to a particular input. This characteristic is beneficial to the application of machine learning in real world problems. Because in some problems the ideal actions for given inputs (environments) are unknown, while the qualities of the system’s performance in the given environment are easier to evaluate.
In the past, RL was mostly adopted for the task with the discrete state space, represented by 2-dimensional grid maze learning. So far some researches on the dynamic control system that have continuous state have been reported. For example, there are studies for the stabilization of an inverted pendulum [1] and an acrobot [2] which has two links with continuous state quantized into discrete condition by using BOXES [1] and CMAC [3]. Moreover, there is a successful case of cart-pole problem with continuous state space represented by using radial basis function network (RBF) [4]. However, there are few works dealing with continuous action space. Our RL method is built using architecture based on actor/critic (A/C) model proposed by Barto et.al [5]. The actor part outputs control signals to the current condition and the critic part predicts the cumulative amount of rewards throughout in the future. In the A/C model, the time error against the expected the amount of reward is called TDerror. By using the TDerror, the actor approximates selection probability for each possible action while the critic approximates the state value function. This learning method is based on an assumption that space for control variable is discrete and the number of possible control signals which are output from the actor is limited. Therefore it is difficult to apply the RL methods for problems that require continuous control values, i.e. an inverted pendulum. x1 , x 2 (rad)
Link1
x1
Link2
actuator
x2
angle from vertical direction 0.08 m1 , m 2 0.1 link mass (kg) 0.252 l1 , l 2 0.142 link length (m) 0.5 S1 , S 2 0.5 center of mass 0.08 d1 , d 2 0.3 coefficient of friction
Fig.1 The joint driven double inverted pendulum
In this paper, we propose a ‘nested actor/critic algorithm’ in which an actor/critic agent is additionally inserted into the actor part of actor-critic agent to realize continuous control (detailed in Chapter 4). A stabilization control of the double inverted pendulum which is a joint-driven structure (refer to Fig.1.) was conducted to prove the effectiveness of the proposed algorithm. The experiments were conducted with both the simulation and the prototype. The parameters of the double inverted pendulum are shown in Fig.1. and the equation of the motion is given in the appendix. The sections 2 and 3 reviews the conventional RL method. The detail of the proposed method is given in section 3. Finally, the results of control learning by using proposed simulation and prototype are shown.
computational learning agent
Critic RBF
reinforcement (TDerror)
Actor RBF
state vector xt reward rt
action at
Environment Fig.2 An ordinary AC architecture that uses RBF network function approximators ; the critic RBF and the actor RBF
2. REINFORCEMENT LEARNING 3. ACTOR/CRITIC ALGORITHM At each discrete time t, the agent observes xt containing information about its current state, select an action at , and then receives an instantaneous reward rt resulting from state transition in the environment. In general, the reward and the next state may be random, but their probability distributions are assumed to depend only on xt and at in Markov decision processes (MDPs), in which many RL algorithms are studied. The objective of RL is to construct a policy that maximizes the agent’s performance. The performance measure of tasks with in nonspecific duration is its cumulative discounted reward defined as: ∞
Vt = ∑ γ k rt + k
(1)
k =0
where the discount factor, 0≤γ0.053
x1 >0.00262
0.5
200 180 160 140 120 100 80 60 40 20 0
A/C with the discrete output (–0.1, 0.1) A/C with the discrete output (–0.1, –0.05, 0.05, 0.1)
nested A/C
0
5000
10000
15000
number of learning
Fig.8 The relation between the accumulated number of failures and the learning steps in the controlling the inverted pendulum.
5.3 prototype model
number of accumulated failure
We did a preliminary experiment by utilizing a prototype mechanical system as shown in Fig.9(a). The results are shown in Fig.9 and Fig.10. The proposed algorithm showed a better performance for control of the actual mechanical system as shown in Fig.9(b). 35
of the system in real world problem. Compared with the simulation, the result of prototype model is inferior regarding the duration of the pendulum stability. Some delay in the control system is believed to be one of the reasons. In both of the simulation and the prototype, the number of failures decreases with the progress of the learning. The stability of learning was affected by ∆P (the correction of the weight of the perceptron)
30
7. CONCLUSIONS
25 20 15 10 5 0 0
100 200 300 400 500 600 700 800 900 number of learning
(a) (b) Fig.9 (a) shows the prototype model (b) shows that the relation between the number of failure and the learning steps in the control of the inverted pendulum.
In this paper, the nested actor/critic algorithm that is able to deal with continuous state space and action space is proposed. In the future work, we will investigate the proposed algorithm from the theoretical standpoint, e.g. how the weight is adapted independently with the observed state, and consider to apply it for robotics control with multi-degree-of-freedoms.
8. REFERENCES [1]
angle (rad)
1 0.5
[2]
0 -0.5 -1 0
0.5
Voltage (V)
(a)
1
1.5
2 x1 x2
time (sec)
8 4 0 -4 -8
(b)
[3] [4] [5] [6]
0
0.5
1
1.5
2
time (sec)
Fig.10 The performance after 1000 learning steps used by nested A/C agent in the prototype. (a) shows the angle and (b) shows the Voltage which is outputted from the joint motor. 6. DISCUSSION In the simulation, it can be observed that after a number of learning iterations, the system acquired an ability to generate a periodic movement to stabilize its upright position. The nested A/C algorithm is more powerful than the conventional A/C algorithm whose output is discrete. In the nested A/C, there is no necessity to tune the discrete value of the output. This is important for the application
A.G.Barto, R.S.Sutton, and C.W.Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Systems, Man, & Cybernetics, vol.13, pp. 834-846, 1983. R.S.Sutton, “Generalization in reinforcement learning : Succesful examples using sparse coarse coding ,” in Advances in Neural Information Processing Systems 8,eds. D.S. Touretzky, M.C. Mozer, and M.E.Hasselmo, pp.1038-1044, MIT Press, Cambrige, MA, USA, 1996. J.S. Albus, Brain, Behavior, and Robotics, Byte Books, 1981. K.Doya, Reinforcement learning in continuous time and space, Neural Computation, vol.12, pp. 243-269, 1999. Crites, R.H. and Barto, A.G., “An Actor/Critic Algorithm that is Equivalent to Q-Learning,” Advances in Neural Information Processing Systems 7, pp.401-408 (1994). J.Morimoto, K.Doya, “Learning Dynamic Motor Sequence in High-Dimensional State Space by Reinforcement Learning: Learning to Stand Up,” Trans. IEICE, vol.82-D-2, no.11, pp. 2118-2131, 1999.
9. APPENDIX The equation of the double inverted pendulum shown in Fig.1 is as follows. I 1 + m1 (l1 S1 ) 2 m1 (l1 S1 )l 2 cos( x1 − x 2 ) m (l S )l cos( x − x ) m1l 22 + I 2 1 2 1 1 1 2
&x&1 + &x&2
τ1 m1 (l1S1 )l2 x&22 sin(x1 − x2 ) + D1 x&1 − m1 g(l1S1 ) sinx1 − m (l S )l x& 2 sin(x − x ) + D x& − m gl sinx − m g(l S ) sinx = τ 2 1 2 2 2 1 2 2− 2 2 2 2 1 1 1 2 1
(7)
where D1 D2 are the energy dissipation of two links as follows, D = ( d x& ) / 2 D = (d x& ) / 2 and I1,I2 are moment of inertia of two links as follows, 2
1
1 1
2
2
I 1 = ( m1l1 2 )(1 + 3 S12 − 3 S1 ) / 3
2 2
I 2 = m2l22 S22