Helicopter Velocity Tracking Control by Adaptive Actor-Critic Reinforcement Method Yang Hu, Yang Chen, Jianda Han, Yuechao Wang, Juntong Qi
Abstract—A robotic helicopter is an aircraft equipped with a sensing, computing, actuation, and communication infrastructure that allows it to execute a variety of tasks with autonomous mode. In this paper, we present an adaptive actor-critic reinforcement method to obtain near optimal controller for small autonomous helicopter. A network based on Q-value performs the critic and is trained by SARSA algorithm. A BP neural network, which is the actor network, generates control signal of helicopter dynamics. First, the proposed actor-critic reinforcement controller is introduced, then the algorithm is applied to an unmanned helicopter known as a highly nonlinear and complex system and the simulation results are presented.
I. INTRODUCTION
T
he control of an autonomous helicopter is a challenging issue because its dynamic is high dimensional, asymmetric, noisy, and nonlinear. In order to obtain optimal controller, machine learning methods are extensively introduced. A Y Ng and Pieter Abbeel [1-2] proposed a reinforcement learning (RL) method for helicopter control, which is called apprenticeship learning. Their helicopter performs very difficult aerobatic maneuvers which are well beyond the abilities of all but the best pilots. And a competition of reinforcement learning algorithm is hold to design best reinforcement algorithm for helicopter control task. Recently, lots of researches have been done on helicopter using reinforcement learning. In [7], a reinforcement learning method based Nero-control system is proposed for helicopter stably hovering. Engel [6] also designed a controller which can be used in hovering condition but cannot be used to velocity tracking situation. It is well known that the real system of the helicopter is characterized of nonlinear, time-varying, and coupled. Consequently, it is difficult to describe the real system with an accurate mathematical model. In this paper, actor-critic architecture is proposed to be embedded in a velocity-tracking system based on the known dynamics of the helicopter. In order to realize tracking precisely, a simplified, linear, time-invariant model is derived with the help of Manuscript received July 15, 2011. This work was supported in part by National Natural Science Foundation of China under grant 61035005 and 61005086. Yang Hu is now with the State Key Laboratory of Robotics, Shenyang Institute of Automation (SIA), Chinese Academy of Sciences (CAS), 110016, Shenyang, China, and with Graduate School, CAS, 100039, Beijing, China (email:
[email protected]). Yang Chen is now with the State Key Laboratory of Robotics, SIA, CAS, Shenyang, China; with School of Information Science and Engineering, Wuhan University of Science and Technology, 430081, Wuhan, China, and with Graduate School, CAS, Beijing, China. Jianda Han, Yuechao Wang, and Juntong Qi are all with the State Key Laboratory of Robotics, SIA, CAS, 110016, Shenyang, China (Corresponding email:
[email protected]).
system identification method [3]. The controller researched in this paper adopts a kind of reinforcement learning method, named adaptive heuristic critic (AHC). The helicopter dynamics identified by Song in [3] is used for the purpose of comparison. To be more specific, our method is based on actor-critic reinforcement framework. The helicopter can learn near optimal behaviors in an unknown environment through trial-and-error learning once some certain conditions are satisfied [4]. The reinforcement learning agent interacts with environment real-time by some continuous behaviors, i.e., selecting an action, transferring to next states, and then receiving reinforcement signal about the states. By these processes, the agent learns the best policy to get optimal action at special state. The paper discusses the controller designing method for velocity tracking problem using actor-critic reinforcement learning. In the next section, we first introduce the architecture of the application system, and then introduce the actor-critic reinforcement learning method for designing velocity tracking controller. In addition, we illustrate the identification dynamics of helicopter which serve as the control plant. At last, the simulation results are presented to verify our approach. II. THE CONTROLLER BASED ON ACTOR-CRITIC REINFORCEMENT
A. The whole control system In order to obtain the controller for solving the velocity tracking problem, a function of feedback control mechanism is derived in our method. The errors between the desired reference signal and real output are feedback to our reinforcement learning controller. As far as the dynamics is concerned, the helicopter has 12 state variables and 4 input signals which are corresponding to 4 error signals. Fig. 1 shows the whole architecture of our application.
R+
E
U RL Controller
Helicopter Dynamics
state
Fig.1.The reference signals include linear velocities of longitudinal, lateral, altitude and angular velocity of yaw channel, they are R=[ru, rv, rw, rr]’. E=[eu, ev, ew, er]’ is the error vector between outputs and reference signals. U=[Ulon, Ulat, Ucol, Uped]’ is controlling signals for helicopter. They are the longitudinal, lateral cyclic pitch, collective cyclic pitch, and tail rotor collective pitch. The state is characterized by 12 numbers, including position (x, y, z), orientation (θ,φ,ϕ), velocity (u, v, w), angular velocity (p, q, r).
B. The architecture of actor-critic reinforcement learning controller In reinforcement learning architecture, the agent learns to maximize the sum of the reward in an unknown environment through trial-and-error exploration. The sum of the reward increases in every episode until the current action is close to the best one of Q-Table along with the updating of critic and actor. The architecture of the actor-critic reinforcement learning controller is illustrated in Fig 2.
critic Q(e,a)
a∗ k
u
e actor (W, V)
_
Fig.2. a∗ is the best action. a is the current action. e is the error between referenced signal and output. k is a scaling factor. (W, V) is the weight vector of BP neural network.
D. Critic network The critic network is a look-up table of Q values (Fig.4). The Q-value is updated by on-policy SARSA algorithm. The critic network seems like Matthew and Peter’s [5]. The rows of the Q-Table represent actions and the columns represent states (error). Each action-state pair is corresponding to a value stored in Q-Table. Let m be the number of actions and n be the number of states, and therefore the Q-Table is a n×m matrix. The Q-Table is updated similar to SARSA algorithm which is defined by following equations:
(4) where rt+1 is the reward gotten by executing action in current state, and γ is discount factor. st is current state, and in our application it is the error. Q(s, a) is the value of action-state pair. t represents the current step, t+1 represents the next step. e
C. Actor network We use BP neural network as actor network (Fig. 3), and the network is trained by the back-propagation algorithm based on the error between the current action and the best action. ⎧Φ t = tanh(Wt ⋅ et ) ⎪a = V ⋅ Φ ⎪ t t t ⎨ u = k ⋅ a t t ⎪ ⎪⎩ st +1 = g (ut , st ) (1) where Wt, and Vt are current weights of neural network. k is a scaling factor from action to control signal. g(⋅) is the plant function. The activated function is tanh, as shown in (2).
e x − e− x tanh( x) = x − x e +e W
(2)
V
et
Input layer
(3)
⎧⎪Qt +1 ( st , at ) = Qt ( st , at ) + {rt +1 + rQt ( st +1, at +1 ) − Qt ( st , at )} ⎨ 2 ⎪⎩rt +1 = 1 − 1000 st
+
a
⎧⎪Vt +1 = Vt + β (at∗ − at )ΦTt ⎨ T ∗ T ⎪⎩Wt +1 = Wt + βV (at − at ) ⋅ (1 − Φ t ⋅ Φ t )et
at
Hidden layer
Output layer
Fig.3. et represents the input. at represents the output. W and V are the weights vector of BP neural network.
In the update equations of weights, as shown in (3), β is the learning rate, and at* is the best action selected from Q-Table by greedy policy. Here ‘greedy’ means selecting the largest value action by the current state.
at
Qe ( ,a)
Fig.4. e is the current state. at is the current action. By greedy selection, we can get a value Q(e ,a ).
Now we will summarize the steps of our actor-critic reinforcement learning algorithm and give a pseudo-code as follows: 1) Initialize the weights vector W and V of the actor network with random value between [0,1]. 2) Initialize the critic Q-Table with a n×m zero matrix. 3) Present current state st to actor and calculate current action at. 4) Execute the action at to the plant function. 5) Observe new state st+1 and get reinforcement signal rt+1. 6) Present current state st+1 to actor and calculate current action at+1. 7) Get the best action a* at current state and calculate the error between a* and a. 8) Update critic Q-Table. Qt +1 ( st , at ) = Qt ( st , at ) + [rt +1 + 1 + γ Qt ( st +1 , at +1 ) − Qt ( st , at )] 9) Update actor weights of BP network. Vt +1 = Vt + β (at∗ − at )ΦTt Wt +1 = Wt + βV T (at∗ − at ) ⋅ (1 − Φ t T ⋅ Φ t )et 10) New state becomes current state.
III. HELICOPTER DYNAMICS Our control system is based on the concept of feedback controller. The error between the referenced signal and the output of the plant system is used as the state of RL controller. There are many states in the dynamic equations of helicopter, but we only use 3 velocities and 1 yaw angular velocity as referenced signals. Four control signals are used as the inputs of the controller to generate 4 outputs for tracking the inference. Via simple reasoning each channel should be used to control its corresponding state variable. As a result, four RL controllers are derived to control the helicopter, i.e., lateral controller, longitudinal controller, altitude controller and yaw controller. We use the dynamic equations from Song's experiment [3] with a little change, and the dynamic model can be used for cruise. The real parameters may be a little different during the helicopter executing different maneuvers, but we ignore it. Now we select the actions and states used in the actor-critic reinforcement learning method. The action space is defined as a 4-dimensional continuous vector: Ulon: longitudinal (front-back) cyclic pitch [-1, 1]. Ulat: latitudinal (left-right) cyclic pitch [-1, 1]. Ucol: main rotor collective pitch [-1, 1]. Uped: tail rotor collective pitch [-1, 1]. The state space is defined as follows: u: forward velocity. v: sideways velocity (to the right). w: downward velocity. r: angular velocity around helicopter’s z axis. The decoupled state equations of helicopter for the four parts are as follows, and the corresponding parameters are showing in TABLE I. ⎛Δu ⎞ ⎛ Xu ⎜Δq ⎟ ⎜ ⎜ ⎟ ⎜Mu ⎜Δθ⎟ =ΔXlon =⎜ 0 ⎜ ⎟ ⎜ ⎜a ⎟ ⎜0 ⎜ ⎟ ⎜ ⎝0 ⎝c ⎠
0 0 1 −1 −1
−g
⎛ Δv ⎞ ⎛ Yu 0 ⎜ ⎟ ⎜ ⎜ Δp ⎟ ⎜ Lu 0 ⎜ ⎟ =⎜ 0 1 Δ = Δ ϕ X lat ⎜ ⎟ ⎜ ⎜ b ⎟ ⎜ 0 −1 ⎜ ⎟ ⎜ ⎜ d ⎟ ⎝ 0 −1 ⎝ ⎠
g
Ya
0
La
Xa 0 ⎞ ⎛ Xlon ⎟ 0 Ma 0 ⎟ ⎜⎜Mlon 0 0 0 0⎟ +⎜ 0 ⎟ ⎜ 0 −1/τ f AC /τ f ⎟ ⎜ Alon 0 0 −1/τ f ⎠⎟ ⎝⎜ Clon
0 0 0 −1 / τ f 0
0
Xlat ⎞ Mlat ⎟⎟ ⎛U ⎞ 0 ⎟⎜ lon ⎟ ⎟U Alat ⎟⎝ lat ⎠ Clat ⎠⎟
⎞ ⎛ Ylon Ylat ⎞ ⎟ ⎜ ⎟ 0 ⎟ ⎜ Llon Llat ⎟ ⎛Ulon ⎞ 0 0⎟ + ⎜ 0 ⎟ 0 ⎟⎜⎜ ⎟ ⎜ ⎟⎝ Ulat ⎟⎠ BC / τ f ⎟ ⎜ Blon Blat ⎟ ⎟ −1/ τ f ⎠ ⎜⎝ Dlon Dlat ⎟⎠
IV. SIMULATION BY CONTROL HELICOPTER DYNAMICS MODEL
We now demonstrate our actor-critic reinforcement learning algorithm on velocity tracking control of helicopter. Our referenced signal is changed with time. During the tracking process, the learning agent will get large negative reward if the output of the plant cannot track the referenced signal, and we just break current episode, and jump to next one. The sample interval of our system is 0.02s. The learning rate of critic network is 0.1, and the learning rate of actor is 0.01. So the learning rate of actor network is slow than critic networking. Every episode takes 10000 steps, so the total simulation time is 200s. Actions are between [-1, 1], the scaling factor k is set to 1000, and we choose its interval as 0.01. States are between [-3,-3], and interval is also 0.01. We select N1×3×1 BP neural network, and random initial weights between [0, 1] at the start of training. In order to test our controller, we change the referenced signal with time. We set the referenced signal as 0.1sin(kt) and change it to 0.1sgnh(sin(kt)) during the simulation. We also increase k slowly for changing the cycle of the referenced signal. We give the referenced signal to longitudinal channel. Fig.5 shows longitudinal channel velocity, control signal and tracking error in the first episode. Fig.6 shows the 38th episode in which the training is complete. By contrast, we can see the evolution of our controller, during 38 training, the error between the referenced signal and output is decreased significantly. TABLE I PARAMETERS OF HELICOPTER DYNAMICS
(5)
0
Δw = ΔX vertical = (Z w Z r 0)Δw + Z colU col ⎛ N N − Nped ⎞⎛Δr ⎞ ⎛Δr ⎞ ⎟⎜ ⎟ ⎜ ⎟ = ΔX yaw = ⎜ w r ⎜ 0 K K ⎟⎜Δr ⎟ + NpedUped ⎜Δr ⎟ fb r rfb ⎝ ⎠⎝ fb ⎠ ⎝ ⎠
longitudinal control input. Uped is the yawing control input and Ucol is the vertical control input.
(6) (7) (8)
where u, v and w are longitudinal, lateral and vertical velocities, respectively. p, q and r are roll, pitch and yaw angle rates, respectively. ϕ and θ are the angles of roll and pitch, respectively. a and b are the first harmonic flapping angle of main rotor. c and d are the first harmonic flapping angle of stabilizer bar. rfb is the feedback control value of an angular rate gyro. Ulat is the lateral control input. Ulon is the
longtitude parameters
lateral parameters
Vertical parameters
Yawing parameters
Xu
0.2446
Yv
-0.057
Zw
1.666
Nw
-0.027
Xa
-4.962
Yb
9.812
Zr
-3.784
Nr
-1.807
Zcol
-11.11
Nrfb
-1.845
Ncol
-0.972
Xlat
-0.686
Ylat
-1.823
Xlon
0.089
Ylon
2.191
Mu
-1.258
Lv
15.84
Kr
-0.040
Ma
46.06
Lb
126.6
Krfb
-2.174
Mlat
-0.626
Llat
-4.875
Mlon
3.394
Llon
28.64
Ac
0.162
Bd
-1.654
Alat
-0.017
Blat
0.047
Alon
-0.258
Blon
-9.288
Clat
2.238
Dlat
-0.779
Clon
-4.144
Dlon
-5.726
τf
0.502
τs
0.505
Because the longitudinal channel and altitude channel is coupled, if the helicopter’s longitudinal velocity changes periodically, altitude velocity is also changeable. In order to keep the helicopters flying stably, the vibration must be very
0.4 longitudinal velocity reference
0.2
altitude position
longitudinal velocity
0.4
0 -0.2 -0.4
0.3 0.2 0.1 0
0
50
100 Time(s)
150
200
0
50
100 Time(s)
150
200
0
50
100 Time(s)
150
200
0
50
100 Time(s)
150
200
40 longitudinal control
40
0 -20 -40
0
50
100 Time(s)
150
30 20 10 0
200
0.2 0 -0.2 -0.4
0
50
100 Time(s)
150
200
longitudinal velocity error
-0.2 -0.3 -0.4
Fig.7. The 1th training episode. In the top plot, reference signal of altitude is set at 0. The output signal of altitude channel gets close to reference slowly. The middle plot shows the control signal. The bottom plot shows the error between reference signal and output.
0.2
0.03 altitude position
longitudinal velocity reference
0.1 0 -0.1 -0.2
0 -0.1
0
50
100 Time(s)
150
0.02 0.01 0
200
200
0 -100 -200
0
50
100 Time(s)
150
50
100 Time(s)
150
200
0
50
100 Time(s)
150
200
0.2 0 -0.2 0
50
100 Time(s)
150
200
Fig.6. The 38th training episode. In the top plot, the green curve represents the longitudinal reference signal and the blue curve represents output signal. The middle plot illustrates the control signal. The bottom plot is the error between the reference signal and the output.
50
100 Time(s)
150
200
20 10 0
200
0.4
-0.4
0
30
100
altitude control
longitudinal control
longitudinal velocity
Fig.5. The first training episode. In the top plot, the green curve represents the longitudinal reference signal and the blue curve represents output signal. The middle plot illustrates the control signal. The bottom plot is the error between the reference signal and the output.
altitude position error
0.4
altitude position error
longitudinal velocity error
altitude control
20
0 -0.01 -0.02 -0.03
0 th
Fig.8. The 38 training episode. In the top plot, reference signal of altitude is set at 0. The output signal of altitude channel gets close to reference signal sharply. The middle plot shows the control signal. The bottom plot shows the error between reference signal and output.
learning method can obtain near optimal controller automatically without much prior knowledge, and this study can be applied to control task such as Inverted Pendulum and Mountain Car problems and even more complicated control challenges. 5 longitudinal reward
small. The altitude referenced signal is set at 0 that means that the helicopter is expected flying in horizontal plane. Fig.7 shows that the first episode altitude velocity vibrations during longitudinal velocity tracking. Fig.8 shows 38th episode altitude velocity vibration during longitudinal velocity tracking. During training, the error of altitude channel also decreased. When the training is complete, we present the information of Q-Table in three-dimension. Fig.9 shows the Q-Table of longitudinal channel. Fig.10 shows the Q-Table of altitude channel. In the longitudinal channel, the referenced signal is changed with time, so that more state-action pairs are visited, and the Q-Table is filled with more non-zero data.
x 10
4
0 -5 -10 -15
0
2
4
6 steps
8
10
12
0
2
4
6 steps
8
10
12
5000 0
0 -500
-5000
-1000 -1500
-10000
-2000
-15000
-2500 -3000 0 5
0
Fig.11 The top plot is the longitudinal channel reward, the bottom plot is altitude channel reward.
20
10 40 15
REFERENCES
60 80
20
[1]
100 25
120
Fig.9. The display of the Q-Table of longitudinal channel in three dimensions.
300
200
100
0
-100
-200
-300 120
25 20 100
80
15 60
10 40
20
5 0
0
Fig.10. The display of the Q-Table of altitude channel in three dimensions.
The total reward increased. Fig.11 shows the total rewards of longitudinal and altitude channel in 12 episodes. It shows that their total rewards increased in every step. V. CONCLUSION We apply actor-critic reinforcement learning architecture for velocity tracking control of a helicopter dynamic model. In this method, satisfactory control performances are achieved after training. The actor-critic reinforcement
A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry, Autonomous helicopter flight via reinforcement learning. in NIPS, S. Thrun, L. K. Saul, and B.Scholkopf, Eds. MITPress, 2003. [2] P. Abbeel, V. Ganapathi, and A. Y. Ng, “Learning vehicular dynamics, with application to modeling helicopters.” in NIPS, 2005. [3] Dalei Song, Juntong Qi and Lei Dai, “Modelling a small-size unmanned helicopter using optimal estimation in the frequency domain.” Int. J. Intelligent Systems Technologies and Applications, Vol. 8, Nos. 1-4, 2010. [4] R. Sutton and A. Barto, “An Introduction to Reinforcement Learning.” The MIT press, 1998. [5] R. Matthew Kretchmar, Peter M. Young, Charles W. Anderson, Robust Reinforcement Learning Control with Static and Dynamic Stability. Colorado State University, May 30, 2001. [6] J.M.Engel. Reinforcement learning applied to UAV helicopter control. Master thesis, Delft University of Technology, 2005. [7] Dong Jin Lee and Hyochoong Bang, “Reinforcement Learning based Neuro-control Systems for an Unmanned Helicopter.” International Conference on Control, Automation and Systems, 2010. [8] Yang Chen and Jianda Han, “LP-Based Path Planning for Target Pursuit and Obstacle Avoidance in 3D Relative Coordinates.” American Control Conference. 2010: 5394-5399. [9] A. Coates, P. Abbeel, A. Y. Ng, “Learning for Control from Multiple Demonstrations,” In Proc. of the 25th Inter. Conf. on Machine Learning. Pp.144-151, Helsinki, Finland, 2008. [10] B. D. Argall, S. Chernov, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems. Vol.57, no.3, pp.469-483, May 2009. [11] B. Mettler, C. Dever, and E. Feron, “Scaling Effects and Dynamic Characteristics of Miniature Rotorcraft,” Journal of Guidance, Control, and Dynamics. Vol.27, no.3, May–June 2004. [12] D. Lee, H. Jin Kim, and S. Sastry, “Feedback Linearization vs. Adaptive Sliding Mode Control for a Quadrotor Helicopter.” International Journal of Control, Automation, and Systems. vol.7, no.3, pp.419-428, June 2009.