algorithms Article
Control Strategy of Speed Servo Systems Based on Deep Reinforcement Learning Pengzhan Chen *, Zhiqiang He
ID
, Chuanxi Chen and Jiahong Xu
School of Electrical Engineering and Automation, East China Jiaotong University, Nanchang 330013, China;
[email protected] (Z.H.);
[email protected] (C.C.);
[email protected] (J.X.) * Correspondence:
[email protected]; Tel.: +86-139-7095-9308 Received: 8 March 2018; Accepted: 3 May 2018; Published: 5 May 2018
Abstract: We developed a novel control strategy of speed servo systems based on deep reinforcement learning. The control parameters of speed servo systems are difficult to regulate for practical applications, and problems of moment disturbance and inertia mutation occur during the operation process. A class of reinforcement learning agents for speed servo systems is designed based on the deep deterministic policy gradient algorithm. The agents are trained by a significant number of system data. After learning completion, they can automatically adjust the control parameters of servo systems and compensate for current online. Consequently, a servo system can always maintain good control performance. Numerous experiments are conducted to verify the proposed control strategy. Results show that the proposed method can achieve proportional–integral–derivative automatic tuning and effectively overcome the effects of inertia mutation and torque disturbance. Keywords: servo system; deep reinforcement learning; PID parameter tuning; torque disturbance; inertia change
1. Introduction Servo systems are widely used in robots, aerospace, computer numerical control machine tools, and other fields [1–3]. Their internal electrical characteristics often change, namely, the inertia change of the mechanical structure connected to them and the torque disturbance in the working process. Servo systems generally adopt a fixed-parameter structure proportional–integral–derivative (PID) controller to complete the system adjustment process. The control performance of the PID controller with a fixed-parameter structure is closely related to its control parameters. When the state of the control loop changes, it will not achieve a good control performance [4–6]. Servo systems often need to reorganize their parameters and seek new control strategies to meet the needs of high-performance control in various situations. Many researchers have proposed several solutions for the problems of inertia variables and torque disturbance in the application of servo systems such as fuzzy logic, synovial, and adaptive control to improve the control accuracy of servo systems. Dong, Vu, and Han et al. [7] proposed a fuzzy neural network adaptive hybrid control system, which obtained high robustness when torque disturbance and servo motor parameters were uncertain. However, a systematic fuzzy control theory was difficult to establish to solve the fuzzy control mechanism. A large number of fuzzy rules also needed to be established to improve control accuracy, which greatly increased the algorithm complexity. Jezernik, Koreliˇc, and Horvat [8] proposed a sliding mode control method for permanent magnet synchronous motors, which can have a fast dynamic response and strong robustness when changes in torque occur. However, when the system parameters were changed, the controller output easily produced jitter. In [9], an adaptive PID control method based on gradient descent method was proposed. The control method presented a fast dynamic response and a small steady-state error when the internal parameters Algorithms 2018, 11, 65; doi:10.3390/a11050065
www.mdpi.com/journal/algorithms
Algorithms 2018, 11, 65
2 of 18
of the servo system changed. However, the algorithm was slow and difficult to converge. Although defects remain in existing PID control methods, the PID control structure is still an important way to improve the control performances of systems. Deep reinforcement learning is an important branch of artificial intelligence and has made great progress in recent years. In 2015, the Google DeepMind team proposed a deep reinforcement learning algorithm, deep Q-network (DQN) [10,11], and verified its universality and superiority in the games of Atari 2600, StarCraft, and Go [12]. The research progress of deep reinforcement learning has attracted the attention of many researchers. Not only have many improvement strategies been proposed, many researchers have begun to apply the research results of deep reinforcement learning to practical engineering applications. Lillicrap T.P proposed a deep deterministic policy gradient (DDPG) algorithm as a replay buffer to construct a target network to solve the problems of a continuous motion space neural network convergence and a slow algorithm update [13]. Jiang, Zhang, and Luo et al. [14] adopted a reinforcement learning method to realize an optimized tracking control of completely unknown nonlinear Markov jump systems. Carlucho, Paula, and Villar et al. [15] proposed an incremental q-learning adaptive PID control algorithm and applied it to mobile machines. Yu, Shi, and Huang et al. [16] proposed an optimal tracking control of an underwater vehicle based on the DDPG algorithm and achieved better control accuracy than that of the traditional PID algorithm. Other studies such as [17,18] adopted the reinforcement learning of continuous state space to energy management. The aforementioned results lay a good theoretical foundation, however, there still exist three shortcomings in the control process of a servo system: (1) Traditional control methods such as PID, fuzzy logic control, and model predictive control require complex mathematical models and the expertise of experts, however, these experiences and knowledge are very difficult to obtain. (2) The optimal tracking curves optimized by particle swarm optimization, genetic algorithm, and neural network algorithm [19,20] are usually effective only for specific cycle periods and lack on-line learning capabilities and limited generalization ability. (3) Classic reinforcement learning methods such as Q-learning are prone to dimension disaster problems, do not have good generalization ability, and are usually only useful for specific tasks. More than 90% of the industrial control processes adopt the PID algorithm, which is because its simple structure and robust performance remain stable under complex conditions. Unfortunately, it is very difficult to correctly adjust the gain of the PID controller, because the servo system is often disturbed in the working process and has problems such as time-delays, inertia change, and nonlinearity. DDPG is a data-driven control method that can learn the mathematical model of the system according to the input and output data of the system and realize the optimal control of the system according to the given reward. The core idea of deep reinforcement learning is to evaluate a decision-making behavior in a certain type of process by giving an evaluation index to train a class of agents through a large number of system data generated in the decision-making process. Therefore, the reinforcement learning method can be used to automatically adjust the control parameters of servo systems. Based on the tuning control system, reinforcement learning provides the best solution for adjusting compensations on the basis of tuning control systems. Servo systems accordingly obtain satisfactory control performances. This is the first time deep reinforcement learning has been applied to a speed servo system. Deep reinforcement learning has good knowledge transfer ability, which is necessary for a servo system to track signals with different amplitudes or frequencies. In this study, an adaptive compensation PID control strategy of a speed servo system based on the DDPG algorithm is proposed. Two types of reinforcement learning agents with the same structure are designed. The environment interaction object of the first class of agents is built by a servo system controlled by a PID controller. The actor–critic framework and four neural networks are adopted to construct the reinforcement learning agents. A reward function is constructed on the basis of the absolute value of the tracking error of the speed servo system. A state set is constructed by the output angle and error of the speed servo system. PI control structure is adopted in most practical
Algorithms 2018, 11, 65 Algorithms 2018, 11, x FOR PEER REVIEW
3 of 18
3 of 19
angle and processes error of theof speed servo system. controlthe structure adopted in most practical application application the servo system.PIHence, Kd of is PID controller is not considered in this processes of the servo system. Hence, the Kd of PID controller is not considered in this andits paper, and the PID parameters Kp, Ki construct an action set. The second agent classpaper, changes the PID Kp,that Ki construct set. The second agent class changes its action set to the of action set parameters to the current needs toanbeaction compensated after the PID output current on the basis current that needs to be compensated after the PID output current on the basis of the first agent class. the first agent class. The adaptive compensation for the PID output current is then realized. The The adaptive compensation for the PID output current is then realized. The convergence speed of the convergence speed of the neural network is enhanced through the dynamic adjustment of noise neural network is enhanced through the dynamic adjustment of noise strategy [21,22]. L2 strategy [21,22]. L2 regularization is used to prevent neural network overfitting [23,24]. The first regularization is used to prevent neural network overfitting [23,24]. The first class of agents can class of agents can automatically adjust the PID parameters. The second class can solve the control automatically adjust the PID parameters. The second class can solve the control performance performance degradation caused by the changes in the internal structural parameters of the servo degradation caused by the changes in the internal structural parameters of the servo system due to system due to the traditional PID control second kind of reinforcement the traditional PID control structure. Andstructure. the secondAnd kindthe of reinforcement learning agent canlearning make agent can make the corresponding action according to the previous learning experience the case of the corresponding action according to the previous learning experience in the case of aninunknown an disturbance unknown disturbance and an tracking unknown tracking curve by learning the output input and of the and an unknown curve by learning the input and dataoutput of thedata system. system. The effect is still better than the fixed parameter PID algorithm. The effect is still better than the fixed parameter PID algorithm.
2. Model Analysis ofofa aSpeed 2. Model Analysis SpeedServo ServoSystem System The speed servo control loop, loop,servo servomotor, motor,load, load, and feedback system. The speed servosystem systemconsists consistsof of current current control and feedback system. According to the analysis of the component of speed loop, the current closed loop, motor, and speed According to the analysis of the component of speed loop, the current closed loop, motor, and speed detection areare assumed the PI PI control controlalgorithm algorithmemployed employed a speed detection assumedasasfirst-order first-ordersystems. systems. Using Using the asas a speed regulator, thethe speed servo in Figure Figure1.1. regulator, speed servosystem systemstructure structurediagram diagram can can be be obtained obtained in
Speed instruction
Speed regulator
1 Tsf s + 1
1 iq ) K s (1 + Ts s
Current Torque current control loop coefficient
1 2TΣ s + 1
Kt
Motor
1 J m s+B
1 s
Load
Output speed
ω
K
1 J L s 2 + Bs
K sf Tsf s + 1
Gso ( s )
Figure1.1.Diagram Diagram of of the the speed speed servo Figure servo system systemstructure. structure.
Where K s is the proportional gain of current loop. Ts is the integral time constant. K t is the Where Ks is the proportional gain of current loop. Ts is the integral time constant. Kt is the servo system torque current coefficient. K sf =60m/p f T represents the speed feedback coefficient, m servo system torque current coefficient. Ks f = 60m/p f T represents the speed feedback coefficient, is the number of detected pulses, p represents the resolution of encoder, T is the speed detection m is the number of detected pulses, pf f represents the resolution of encoder, T is the speed detection is the time constant current closed-loopapproximate approximatesystem. system. TTsf isisthe thetime timeconstant constant of cycle. cycle. 2TΣ2T isΣ the time constant of of current closed-loop sf K stiffness is the stiffness coefficient of mechanical transmission. of speed feedback approximate system. speed feedback approximate system. K is the coefficient of mechanical transmission. B is the B is the coefficient conversion of coefficient of load friction. conversion load friction. When impactofofresonance resonanceon onaaservo servo system system is object cancan When thethe impact is ignored, ignored,the themotor motorrotor rotorand andload load object be assumed as a rigid body and the inertia of the servo system can be expressed as = + J J JL . be assumed as a rigid body and the inertia of the servo system can be expressed as J = Jm + JLm. Where represents motor inertia, external torque of system the servo J m motor J L is load Jm Where represents inertia, JL is load inertia. The inertia. externalThe load torqueload of the servo is system assumed can beas:expressed as: is assumed to be TL , and the speed-loop-controlled to be TL , and the speed-loop-controlled model can bemodel expressed tK KKt K s sff Gso (sG ) so=( s ) = (( J + J ) s + B)(T s + 1) . . (( Jm +m JL )sL + B)( TΣn Σn s + 1)
(1)(1)
where (1)(1) cancan be rearranged into:into: Σn =2 where TΣnT= 2TTΣΣ ++TsfT.s fEquation . Equation be rearranged
B B11 G (s) = 2 Gso (s)so= + AA3 A1 A s21 s++AA22ss + 3
(2)(2)
where = A ( J + J )TΣn , A2 = ( BTΣn + ( J m + J L )) , A3 = B , and B1 = K t K sf . where A1 =1 ( Jm m+ JLL) TΣn , A2 = ( BTΣn + ( Jm + JL )), A3 = B, and B1 = Kt Ks f . thethe The control time interval used to to discrete thethe transfer function z of The control time interval isis used discrete Equation Equation(2), (2),and and transfer function z of controlled object of the speed servo system is obtained as follows: controlled object of the speed servo system is obtained as follows:
Algorithms 2018, 11, 65
4 of 18
Gso (z) =
b1 z−1 + b2 z−2 . 1 + a 1 z −1 + a 2 z −2
(3)
Equation (3) can be rewritten into the following system difference equation for convenience: ω (t) = − a1 ω (t − 1) − a2 ω (t − 2) + b1 iq (t − 1) + b2 iq (t − 2)
(4)
where ω (t) is the speed at time t, iq represents the q axis component of the current. The load torque disturbances can also be described by: ∆ ω (k) = ∆t×∆TL /( Jm + ∆JL ).
(5)
Equation (5) presents that the speed fluctuation caused by changes in load torque is associated with the system inertia and load torque. A small inertia indicates great impacts of the same load torque fluctuations on the speed control process. The system inertia and load torque should be carefully considered during the servo system operation to obtain a stable velocity response. For example, during the movement of the arm of a robot, its inertia is constantly changing. Thus in the CNC machining process, the quality of the processed parts will reduce, which will lead to the change of converted inertia of the feeding axis. When the characteristics of the load object changes, the characteristics of the entire servo system will change. If the system inertia increases, the response of the system will slow down, which is likely to cause system instability and result in climb. On the contrary, if the system inertia decreases, dynamic response will speed up with speed overshoot as well as turbulence. According to Equations (3)–(5), the relevant parameters of a specific type of servo system are given as follows: TΣn of 1.25 × 10−4 , motor inertia of 6.329 × 10−4 kg · m2 , and friction coefficient of 3.0 × 10−4 N · m · s/rad. We construct system 1 with torque of 1 N·m and load inertia of 2.5 × 10−3 kg · m2 , system 2 with torque of 1 N·m and load inertia of 6.3 × 10−3 kg · m2 , and system 3 with torque of 3 N·m and load inertia of 2.5 × 10−3 kg · m2 to simulate the torque change and disturbance during the operation of the servo system. The rotation inertia and external torque disturbance of the servo system during operation can be considered the jump switches among systems 1, 2, and 3 at different times. The transfer function of the system model can be obtained by taking the above described system 1, 2, and 3 parameters into Equation (1). Using the control time interval ∆t = 100 us to discrete Equation (1), the mathematical model of various servo systems can be obtained according to Equations (4), as shown in Table 1. Table 1. Servo system model. System
Differential Equation Expression of The Mathematical Model
System 1
ω (k) = ω (k − 1) − 3.478 × 10−4 ω (k − 2) + 1.388iq (k − 1) + 0.1986iq (k − 2)
System 2
ω (k) = ω (k − 1) − 3.366 × 10−4 ω (k − 2) + 0.1263iq (k − 1) + 0.01799iq (k − 2)
System 3
ω (k) = ω (k − 1) − 3.478 × 10−4 ω (k − 2) + 1.388iq (k − 1) + 0.1986iq (k − 2) + 0.9148
3. Servo System Control Strategy Based on Reinforcement Learning 3.1. Strategic Process of Reinforcement Learning The reinforcement learning framework can be described as the Markov decision process [25]. A limited range of discrete time discount Markov decision processes can be expressed as M = (S, A, P, r, ρ0 , γ, T ), where S is the state set, A is the action set, P : S × A × S → R is the transition probability, r : S × A → [− Rmax , Rmax ] is the immediate reward function, ρ0 : S → R is the initial state distribution, γ ∈ [0, 1] is the discount factor, and τ = (s0 , a0 , s1 , a1 · · ·) is a sequence of trajectories.
Algorithms 2018, 11, 65
5 of 18
T
The cumulative return is defined as R = ∑ γt rt . The goal of reinforcement learning is to find an t =0 R optimal strategy π that maximizes the reward under this strategy, defined as max R(τ ) Pπ (τ )dτ. π
3.1.1. DQN Algorithms The Monte Carlo method is used to calculate the expectation of return value in the state value function, and the expectation is estimated by a random sample. In calculating the value function, the expectation of the random variable is replaced with the empirical average. Combining with Monte Carlo sampling and dynamic programming methods, the one-step prediction method is used to calculate the current state-value function to obtain the algorithm update formula of Q-learning [26] as follows: h i Q(st , at ) ← Q(st , at ) + α rt + γmaxQ(st+1 , a) − Q(st , at ) (6) a
where α ∈ [0, 1] is the learning rate, γ ∈ [0, 1] represents the discounting factor. However, the state value of the servo system control is always continuous. The original Q-learning is a learning algorithm based on Q table; therefore, a dimension failure easily occurs in processing the continuous state. DQN can learn value function using a neural network. The network is trained off-policy with samples from a replay buffer to minimize the correlation among samples, and the target neural network is adopted to give consistent targets during temporal difference backups. The target − neural network is expressed as θ Q . The network that the value function approximates is expressed as θ Q , and the loss function is generated as: L(θ ) = E
0
0
r + γmaxQ s , a ; θ
Q−
a
− Q s, a; θ
Q
2
.
(7)
The network parameters of the action value function are updated in real time. The target neural network uses the frozen parameter method to update every fixed number of steps. The gradient descent updating formula of DQN is obtained as [10,11] h i − θtQ+1 = θt Q + α r + γmaxQ s0 , a0 ; θ Q − Q s, a; θ Q ∇ Q(s, a; θ Q ).
(8)
a
3.1.2. Deterministic Policy Gradient Algorithms The output actions of DQN are discretized and this will result in the failure to find the optimal parameter, or the violent oscillation of the output current. On the basis of the Markov decision H
process, the single R(τ ) = ∑ R(st , ut ) represents the cumulative reward of the track τ, P(τ, θ ) is t =0
defined as the probability of showing the occurrenceof a trajectory τ, and the objective function of reinforcement learning can be expressed as U (θ ) = E
H
∑ R ( s t , u t ); π θ
t =0
= ∑ P(τ; θ ) R(τ ). The goal in τ
reinforcement learning is to learn an optimal parameter θ that maximizes the expected return from the start distribution, defined as maxU (θ ) = max∑ P(τ; θ ) R(τ ). The objective function is optimized θ
θ
τ
by using the gradient descent θnew = θold + α∇θ U (θ ); in the equation, ∇θ U (θ ) = ∇θ ∑ P(τ; θ ) R(τ ). τ
Further consolidation can be obtained as follows:
∇θ = ∑ P(τ; θ )∇ log P(τ; θ ) R(τ ).
(9)
τ
The policy gradient algorithm has achieved good results in dealing with the problem of continuous action space. According to the stochastic gradient descent [27], the performance gradient ∇θ J (πθ ) can be expressed as: ∇θ J (πθ ) = Es∼ρπ ,a∼πθ [∇θ log πθ ( a|s) Qπ (s, a)]. (10)
Algorithms 2018, 11, x FOR PEER REVIEW
Algorithms 2018, 11, 65
6 of 19
∇θ J (π θ ) = Ε s ~ ρ π , a ~π ∇θ log π θ ( a | s ) Qπ ( s, a ) . θ
6 of 18
(10)
The framework of the actor–critic algorithm based on the idea of the stochastic gradient descent The framework of the actor–critic algorithm based on the idea of the stochastic gradient descent is is proposed and widely used [28]. The actor is used to adjust the parameter θ . The critic proposed and widely used [28]. The actor is usedπ to adjust the parameterQ θ. The critic approximates the ω Q s , a where ωto is the parameter to be approximates the value function = θbe approximated. ( ) ≈=Qθ Q( sis, athe ) , parameter ω π value function Q (s, a) ≈ Q (s, a), where ω The algorithm approximated. The algorithm is shown in Figure 2. structure is shown in Figurestructure 2. Critic
S
ω
θ
Q ( s, a )
A
S
Actor
Figure 2. Diagram of the actor–critic algorithm structure. Figure 2. Diagram of the actor–critic algorithm structure.
Lillicrap, Hunt, model-freealgorithm algorithmbased basedon onthe Lillicrap, Hunt,and andPritzel Pritzeletetal. al.[13] [13]presented presented an an actor–critic, actor–critic, model-free thedeterministic deterministicpolicy policygradient gradient[27], [27],and andthe theformula formulaofofthe thedeterministic deterministic policy gradient policy gradient is:is: h i θ µ ∇θθ JJβ ((µµθθ)) = = ΕEss~∼ρρβ β∇∇θ θµµθ ((ss))∇ ∇ aaQQµ ((ss,, aa))||aa==µθµ(θs )(s). .
(11) (11)
representsthe thestate statebehavior behaviorstrategy. strategy.Neural Neuralnetworks networks employed areare employed to to β (βa( a| |ss))≠ 6= π θ (πaθ|(sa)|s)represents approximate the action-value function Qωω(s, a) and deterministic policy µθ (s).θ The updating formula approximate the action-value function Q ( s, a ) and deterministic policy µ ( s ) . The updating of the DDPG algorithm [13] is obtained as: formula of the DDPG algorithm [13] is obtained as: − δt = rt + γQωω−− st+1 , θµ− θ (st+1 ) ω− Qω (st , at ) δt = rt + γ Q st +1 , µ ( st +1 ) − Q ( st , at ) ωt+1 = ωt + αω δt ∇ω Qω (st , at ) ω =ω∇t +µα st (, sat ,)a )| θ θ δ ∇ω Q Q( ω (12) +1 α θt+1 = θωt t+ a t t a=µ (s) . θ θ µω ( st t )∇
(
)
θ θt +1 = θt + αθθ−∇θ=µ µτθ st )(∇1 a− Qωτ()sθt−, at ) |a = µθ ( s ) . (+ ω − −= τω + (1 − τ−)ω − θ = τθ + 1 − τ θ
(12)
( ) + (1 − τ ) ω − Learning ω on= τω 3.2. PID Servo System Control Scheme Based Reinforcement −
3.2.1. PID Parameter Tuning Method Based on Reinforcement Learning 3.2. PID Servo System Control Scheme Based on Reinforcement Learning This study uses actor–critic to construct the framework of reinforcement learning agent 1, with a PIDPID servo system asTuning its environment object. tracking error curve of incentive function is acquired. 3.2.1. Parameter Method Based onThe Reinforcement Learning The deterministic policy gradient algorithm is used to design an action network, and the DQN This study uses actor–critic to construct the framework of reinforcement learning agent 1, with algorithm is used to design an evaluation network. PID parameter self-tuning is realized. The structure a PID servo system as its environment object. The tracking error curve of incentive function is diagram of the adaptive PID control algorithm based on reinforcement learning is shown in Figure 3. acquired. The deterministic policy gradient algorithm is used to design an action network, and the DQN algorithm is used to design an evaluation network. PID parameter self-tuning is realized. The structure diagram of the adaptive PID control algorithm based on reinforcement learning is shown in Figure 3.
Algorithms 2018, 11, 65 Algorithms 2018, 11, x FOR PEER REVIEW
st
7 of 18 7 of 19
si +1
µ ( si | θ µ )
(
µ si +1 | θ µ
Actor Network
−
)
K p−
K i−
µ ( st | θ µ )
Memory M Si si +1 Critic Network
Qi +1 ∗ γ
× St
µ ( st | θ µ )
Speed
Noise Kp
PID
rt s t+1
ri +
+
∇ ai Qi
+
−
Qi
∇θ µ ai
Ki
Servo system
X (t )
Figure Figure 3.3. Adaptive Adaptive proportional–integral–derivative proportional–integral–derivative (PID) (PID) control control structure structure diagram diagram based based on on reinforcement learning agent 1. reinforcement learning agent 1.
In In Figure Figure 3,3, the the upper upper dotted dotted line line frame frame isis an anadaptive adaptive PID PID parameter parameter regulator regulator based based on on reinforcement which is is composed composedofofa areinforcement reinforcementlearning learning agent; lower dashed reinforcement learning, learning, which agent; thethe lower dashed box box a PID servo controller anda aspeed speedservo servosystem system as as agent agent environment has has a PID servo controller and environment interaction interaction objects. objects. 1 , e ( t − 1 ) , X t , e ( t ) , H t + 1 , H t + 2 is a set of reinforcement learning agent states s. [ X X (t (− ) ( ) ( ) ( )] t − 1) , e(t − 1), X ( t ) , e(t ), H ( t +1) , H ( t +2 ) is a set of reinforcement learning agent states stt . [ X (t), e(t), X (t + 1), e(t + 1), H (t + 2), H (t + 3)] is a set of states st+1 at the next moment, which is ), Xthe (t + 1), e(t + 1), H (of t +agent 2), H (1t +and 3)] servo is a environments. set of states sXt +(1t − at 1the moment, [ X (t ), e(t by acquired interaction t), and X (t + 1)which denoteis ), X (next the angular at the last theservo current moment, and X the( t next of the − 1) , moment X servo X ( t ) , and acquired byvelocity the interaction of moment, agent 1 and environments. denote ( t + 1) system, respectively. H t + 1 , H t + 2 , and H t + 3 refers to the need to track the signal of the next moment, ( ) ( ) ( ) the angular velocity at the last moment, the current moment, and the next moment of the servo after two times, and after three times, respectively. e(t − 1), e(t), and e(t + 1) are the output error system, respectively. H ( t + 1) , H ( t + 2 ) , and H ( t + 3) refers to the need to track the signal of the µ values of the previous moment, at the current time, and at the next moment, respectively. µ(st |θ− ) − 1)state next moment, afterselected two times, and after three times, respectively. , e ( ts)t ,, and andµ es(it++11|θ)µ are indicates the action by the actor evaluation network according eto( t the
the output moment, the current time, thePID next moment, indicates the error actionvalues selectedofbythe theprevious actor target networkataccording to the stateand si+1 .atThe parameters K p , Ki are a set variablesthe µ to the reinforcement agent. The rewardaccording function is action selected by thelearning actor evaluation network to respectively. µ of st |action θ µ indicates constructed by the absoluteµ −value of the tracking error of the output angle of the servo system. The µ si +1 | θ the interaction the state indicates the action selected by servo the actor target network according to st , and through data are obtained of the agent and the environment and stored in the memory network update dataare obtained by memory playback, and the neural networks are the stateM.si +The 1 . The PID parameters K p,K i are a set of action variables µ to the reinforcement trained using Equation (14). The reinforcement learning agent is obtained on the basis of the current learning agent. The reward function is constructed by the absolute value of the tracking error of the state st to select the appropriate K p , Ki , and automatic tuning of the parameters of the PID servo output angle of the servo system. The data are obtained through the interaction of the agent and the system is realized. servo environment and stored in the memory M . The network update data are obtained by memory playback, and the networks are trained usingBased Equation (14). The reinforcement 3.2.2. Adaptive PIDneural Current Compensation Method on Reinforcement Learning learning agent is obtained on the basis of the current state st to select the appropriate K p,K i , and automatic The adaptive PID current compensation agent 2 adopts the same control structure and algorithm tuning parameters of the PIDcompensation servo system is as thoseofofthe agent 1. The adaptive ofrealized. PID output current is realized. The structure diagram of the adaptive PID control algorithm based on reinforcement learning is shown in Figure 4. 3.2.2. Adaptive PID Current Compensation Method Based on Reinforcement Learning
(
(
)
)
The adaptive PID current compensation agent 2 adopts the same control structure and algorithm as those of agent 1. The adaptive compensation of PID output current is realized. The structure diagram of the adaptive PID control algorithm based on reinforcement learning is shown in Figure 4.
Algorithms 2018, 11, 65 Algorithms 2018, 11, x FOR PEER REVIEW
st
8 of 18 8 of 19
si +1 Actor Network
µ ( st | θ µ ) t
r s t+1
Memory M µ ( si | θ µ ) Si
(
µ si +1 | θ µ
I −t
−
)
si +1
Critic Network
Qi +1 ∗ γ
ri +
+
× St
Speed
µ ( st | θ µ )
PID
Noise
+
−
∇ ai Qi
Qi
∇θ µ ai
It X (t )
Servo system
Figure Figure 4.4. Adaptive Adaptive PID PID current current compensation compensation control control structure structure diagram diagram based based on on reinforcement reinforcement learning agent 2. learning agent 2.
In InFigure Figure4,4,the theupper upperdotted dottedbox boxisisthe theadaptive adaptivecurrent currentcompensator compensatorbased basedon onreinforcement reinforcement learning learning agent 2, and box comprises a PID servo controller learningfor forreinforcement reinforcement learning agent 2, the andlower the dotted lower dotted box comprises a PID servo and a speedand servo systemservo as thesystem environment objects of the reinforcement agent. controller a speed as the interaction environment interaction objects of thelearning reinforcement learning Theisstate definition is the same1. asThe thatcurrent of agent The state agent. definition the same as that of agent as a current set of action µ to the , as a set of action [ It1. ], The [ I t ] variables reinforcement agent, is compensated. reward function is constructed by the absoluteis to the reinforcement learning The agent, is compensated. The reward function variables µ learning value of the tracking error of the output angle of the servo system. The reinforcement learning agent is constructed by the absolute value of the tracking error of the output angle of the servo system. The achieved on the basis of the current state st to select the appropriate [ It ], and the adaptive compensation reinforcement learning agent is achieved on the basis of the current state st to select the appropriate for the PID output current is realized. [ I t ] , and the adaptive compensation for the PID output current is realized. 3.3. Design Scheme of Reinforcement Learning Agent 3.3. Design Scheme of Reinforcement Learning Agent 3.3.1. Network Design for Actor 3.3.1.Two Network Design for Actor four-layer neural networks are established on the basis of the deterministic policy gradient algorithm. They are neural the actor evaluation and target networks, which the same policy structures but Two four-layer networks are established on the basis of thehave deterministic gradient different functions. of the actor is shown below: algorithm. They areThe thestructure actor evaluation andnetwork target networks, which have the same structures but As illustrated Figure 5, the of actor network has exactly different functions.inThe structure theevaluation actor network is shown below: the same number of neurons and structure as those of the 5, actor targetevaluation network. network The inputhas layer of the As illustrated in Figure the actor exactly theactor sameevaluation number ofnetwork neurons has neurons, correspond to six input nodes, for the angular X (tnetwork − 1) at andsix structure as which those of the actor target network. Thenamely, input layer of the actor velocity evaluation the the output error value t − 1) nodes, of the previous moment, the angular velocity X (tat ) haslast six moment, neurons, which correspond to sixe(input namely, for the angular velocity X (t − 1) at the current moment, the output error evalue e t at the current time, the signal H ( t + 1 ) needed ( ) the last moment, the output error value ( t − 1) of the previous moment, the angular velocity X (t ) to track in the next moment, and the signal H (t + 2) needed to track after two times, respectively. e (at )setatofthe current time,learning the signal H (state t +1) needed X (the t − current 1), e(t −moment, 1), X (t), ethe (t),output H (t + 1error + 2)] is reinforcement agent st . The [at ), H (tvalue µ needed to trackofafter two times, respectively. to track in the and the signal (t +2) action value µ(snext of the evaluation networkHis output. The inputs the actor target network are t | θ )moment, the current angular velocity X ( t ) , the output error value e t at the current time, the angular velocity () learning agent state st . The X ( t − 1) , e(t − 1), X ( t ) , e(t ), H ( t +1) , H ( t +2 ) is a set of reinforcement X (t + 1) at the next moment after the interaction between action µ(st |θ µ ) and the environment, the µ µ st |eθ(t + 1of) at action value thethe evaluation network output. of after the actor output error value next moment, the is signal H (tThe + 2)inputs tracked two target times, network and the signal (t + 3) angular tracked velocity after threeXtimes. (t), e(terror ), X (tvalue + 1), e(et(+ (t + 2), H (time, t + 3)]the is aangular set of t ) 1)at, Hthe are theHcurrent output current (t ) , the[ X
(
)
−
(
)
states st+1 at the next moment. The action value µ st+1 |θ µ of the target network is output. The velocity X ( t + 1) at the next moment after the interaction between action µ st | θ µ and the actor network contains three hidden layers. The first hidden layer comprises 200 neurons, and the e ( t +The 1) atsecond environment, the output the next moment, signal tracked afterand two H (t +2) activation function in thiserror layervalue is ReLu6. hidden layerthe also consists of 200 neurons the activation The third hidden is composed networks, e(t + 1), H (t + 2), Hand (t +the 3)] times, and thefunction signal is after three layer times. H ReLu6. (t +3) tracked [ X (t ), e(t ), Xof(t10+ 1),neural
is a set of states
st +1
(
at the next moment. The action value µ st +1 | θ µ
−
)
of the target network is
output. The actor network contains three hidden layers. The first hidden layer comprises 200 neurons,
Algorithms 2018, 11, x FOR PEER REVIEW
9 of 19
Algorithms 2018, 11, 65
9 of 18
and the activation function in this layer is ReLu6. The second hidden layer also consists of 200 neurons and the activation function is ReLu6. The third hidden layer is composed of 10 neural networks, the activation function ReLu. Eachuses layerL2ofregularization neurons uses L2 regularization to prevent activation and function is ReLu. Each layeris of neurons to prevent overfitting of the overfitting of the neural network. neural network. Input layer
Hidden layer
X (t − 1)
e(t − 1)
X (t )
e(t ) H (t +1) H (t +2)
Output layer
θµ
θµ
θµ
θµ
θµ
θµ
θµ
θµ
θµ
θµ
θµ
θµ
Input layer
Hidden layer −
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
−
θµ
X (t )
e(t )
It
X (t + 1)
e(t + 1) H (t +2)
H (t +3)
Actor Evaluate Network
Output layer
I t +1
Actor Target Network
Figure 5. Structure diagram of the actor network. Figure 5. Structure diagram of the actor network.
3.3.2. Network Design for Critic 3.3.2. Network Design for Critic Two four-layer neural networks are established on the basis of the DQN algorithm. They are Two four-layer neural networks are established on the basis of the DQN algorithm. They are the the critic evaluation and target networks, which have the same structures but different functions. critic evaluation and target networks, which have the same structures but different functions. The The structure of the critic network is shown below: structure of the critic network is shown below: As shown in Figure 6, the critic evaluation network has exactly the same number of neurons and As shown in Figure 6, the critic evaluation network has exactly the same number of neurons and structure as those of the critic target network. The input layer of the critic evaluation network has structure as those of the critic target network. The input layer of the critic evaluation network has seven neurons, which correspond to seven input nodes, namely, for the angular velocity X (t − 1) at seven neurons, which correspond to seven input nodes, namely, for the angular velocity X (t − 1) at the last moment, the output error value e(t − 1) of the previous moment, the angular velocity X (t) the lastcurrent moment, the output value moment, thesignal angular X (t ) ( t − 1) eof at the moment, the error output errorevalue at previous the current time, the H (velocity t + 1) needed (t)the
to the track for the next moment, theerror signal H (t +e (2t)) toattrack after two times, and the value at current moment, the output value the current time, the signal H (output t +1) needed µ(st |θ µ ) of the actor evaluation network. The output layer contains a neuron, which outputs the H (t +2) to track after two times, and the output value to track for the next moment, theµ signal Q . [ X ( t − 1), e ( t − 1), X ( t ), e ( t ), H ( t + 1), H ( t + 2)] is a set of action-value function Q s , µ s | θ θ ( )| t t µ ( st | θ µ ) of the actor evaluation The output contains a neuron, which outputs reinforcement learning agent statenetwork. st . The critic target layer network also has seven nodes, namely, the for µ Qmoment, the output error value e ( t ) at the current time, the the angular velocity X ( t ) at the current action-value function Q st , µ st | θ | θ . X ( t − 1) , e(t − 1), X ( t ) , e(t ), H ( t +1) , H ( t +2 ) is a set of angular velocity X (t + 1) at the next moment after the interaction between action µ(st |θ µ ) and the The critic networkthe also has H seven nodes, namely, for reinforcement learning state es(t t. + environment, the outputagent error value 1) at the target next moment, signal (t + 2) to track after two
(
(
)
)
times, the signal H (t +X3()t )to track threemoment, times, and output value µ st+e1(|tθ)µ at of actor target the angular velocity at theafter current thethe output error value thethe current time, − − network. Thevelocity critic target outputs state-action value function between Q st+1 , µaction st+1 |θ µµ ( s|t θ| Q X ( t network + 1) at the θµ ). the angular next the moment after the interaction next moment. The critic [ X (t), e(t), X (t + 1), e(t + 1), H (t + 2), H (t + 3)] is a set of states st+1 at the and the environment, the output error value e ( t + 1) at the next moment, the signal H (t +2) to track network contains three hidden layers. The first hidden layer comprises 200 neurons, and its activation µ− µ st +function trackalso after three times, the output of after twoistimes, the signal H (hidden t +3) tolayer function ReLu6.The second consists of 200 and neurons, and itsvalue activation is 1 |θ ReLu6. The third hidden layer is composed of 10 neural networks, and its activation function is ReLu. the actor target network. The critic target network outputs the state-action value function L2 regularization to prevent overfitting of the neural network. − is adopted − Q st +1 , µ st +1 | θ µ | θ Q . [ X (t ), e(t ), X (t + 1), e(t + 1), H (t + 2), H (t + 3) ] is a set of states st +1 at the −
(
(
(
)
)
)
next moment. The critic network contains three hidden layers. The first hidden layer comprises 200 neurons, and its activation function is ReLu6.The second hidden layer also consists of 200 neurons, and its activation function is ReLu6. The third hidden layer is composed of 10 neural networks, and its activation function is ReLu. L2 regularization is adopted to prevent overfitting of the neural network.
Algorithms 2018, 11, 65
10 of 18
Algorithms 2018, 11, x FOR PEER REVIEW
10 of 19
Hidden layer
Input layer X ( t − 1)
e ( t − 1)
θQ
θQ
θQ
θQ
X (t )
e (t ) X ( t +1)
Hidden layer
X (t )
e (t )
(
Q st , µ ( st | θ µ ) | θ Q
θQ
θQ
θQ
θQ
θQ
θQ
θQ
θQ
X ( t +2 ) µ ( st | θ µ )
Input layer
Output layer
)
Output layer −
θQ
−
θQ
θQ
−
θQ
θQ
−
θQ
−
θQ
−
θQ
−
θQ
−
θQ
−
θQ
−
θQ
−
−
X ( t + 1)
e ( t + 1) X ( t +2 ) X ( t +3)
(
µ st +1 | θ µ
−
)
Critic Evaluate Network
(
(
Q st +1 , µ st +1 | θ µ
−
) |θ ) Q−
Critic Target Network
Figure 6. Structure diagram of the critic network. Figure 6. Structure diagram of the critic network.
3.4. Implementation Process ofofthe AlgorithmBased Based DDPG Algorithm 3.4. Implementation Process theAdaptive Adaptive PID PID Algorithm onon DDPG Algorithm In this paper, two typesofofagents agentsare are designed designed with same neural network structure. AgentAgent 1 In this paper, two types withthe the same neural network structure. 1 searches for the optimal PIDparameters parameters within within 00to in in order to realize the adaptive tuningtuning searches for the optimal PID to100 100range, range, order to realize the adaptive of PID the PID parameters of the speed servosystem. system. The The adaptive adaptive compensation of of PID output current of the parameters of the speed servo compensation PID output current is is realized by Agent 2, which further reduces the steady state error. realized by Agent 2, which further reduces the steady state error. Agent 1 is designed on the basis of the actor–critic network structure in 3.3. The critic Agent 1 isµ designed on the basis of the actor–critic network structure in 3.3. The critic Q s , µ s | θ | θ Q and actor µ s | θ µ ) , with weightsQof θ Q µand θ µ , respectively, are randomly Q st , µ(st t |θ(µ t)|θ Q) and actor µ(st |θ µ(),t with weights of θ and θ , respectively, are randomly initialized. − − µ The target networks Q− andnetworks µ− with weights ← θQ and θ µof ←θ Qθ µ← , respectively, Q − and of µθ−Q with θ Q and θ µ are ← θrandomly , weights initialized. The target initialized. A memory libraryinitialized. M1 with A capacity is stored. Action µt is M 1 The The current withcurrent capacitystate respectively, are randomly memoryClibrary C1 sist built. 1 is built. taken. The reward r and the next moment state s are acquired. The first state s = 0, 0, 0, 0, 0, 0] is [ t t+1 rt and the next moment state st +11 are acquired. state st is stored. Action µ t is taken. The reward µ ) + Noise is selected on the basis of the actor evaluation initialized. The action a = K , K = µ s | θ ( p 0,i0, 0 ] ist initialized. The action The first state s1 = t[ 0, 0, 0, = at = K p , K i µ ( st | θ µ ) +Noise is network. The action at is performed in the PID servo controller to obtain the return rt and the next state selected on the basis of the actor evaluation network. The action at is performed in the PID servo st+1 . The transition (st , at , rt , st+1 ) is stored in M1 . A random − minibatch of N transitions (si , ai , ri , si+1 ) − s .QThe controller to obtain the return rt −and the next state transition ( st , at , rt , st +1 ) is stored in µ t +1 from M1 is sampled. yi = ri + γQ si+1 , µ si+1 |θ |θ is computed, and the critic is updated s , a , r , s M 1 . A random minibatch M 1 is sampled. from of transitions ( ) N 2 i i i i +1 by minimizing the loss L = N1 ∑i yi − Q si , ai |θ Q to integrate into Equation (8). The actor network Q µ1 µ Q | | | yi= ri + γwith Q − si +∇ s θ θ is computed, and the critic is updated by minimizing the loss (12). µ µ is updated J ≈ ∇ µ s | θ on the basis of Equation ∇ Q s, a | θ ( )| ∑ i 1, µ 1 + s a θ s=si ,a=µ(si ) θ i N
(
)
−
(
(
(
−
−
)i ) −
)
− − − 2 1 network is updated µ + (1 − τ ) θ µ − . The target θ Q ← into τθ Q Equation + (1 − τ )(8). θ Q The , θ µactor ← τθ L yi − Q ( si , ai | θ Q ) with = to integrate network is updated with ∑ i N 2 is designed on the same neural network structure and state set as agent 1. By changing Agent 1 the action compensation the PID controller realized and the is control of Equation (12).isThe target network ∇θ µ J ≈space ∇θ µ µ | , acurrent ( s | θ µ ) |si on theofbasis ( s, a | θ1,Q )the ∑ ∇ofa Qagent =s si= µ ( si ) N i µ
performance is further improved. The action at = [ It ] = µ(st |θ ) + Noise is selected on the basis of Q µ µ ← τθ Q + (1The − τ ) θsame ← τθalgorithm + (1 − τ ) θ µwith , θDDPG . agent 1 is adopted to train the agent 2. updated with θ Q network. the actor evaluation −
−
−
−
Agent 2 is designed on the same neural network structure and state set as agent 1. By changing the action of agent 1, the current compensation of the PID controller is realized and the control 4. Result and space Discussion performance is further improved. The action = at [ = I t ] µ ( st | θ µ ) +Noise is selected on the basis of
4.1. Experimental Setup
the actor evaluation network. The same DDPG algorithm with agent 1 is adopted to train the agent 2.In the actual application process, the speed servo system instruction is generally trapezoidal or
positive rotation. Therefore, only the performances of the servo control system in the two types of 4. Result and Discussion instruction cases are analyzed and compared in this study on the basis of system control experiments under instructions 4.1.the Experimental Setupof different amplitudes, frequencies, and rise and fall rates. In the control process of the servo system, PID parameters play a key role in the control performance of the speed In the actual application process, the speed servo system instruction is generally trapezoidal or servo system. The classic PID (Classic_PID) controller is adopted to compare with the algorithm positive rotation. Therefore, only the performances of the servo control system in the two types of proposed in this paper. The control parameters turned based on the empirical knowledge from experts instruction cases are analyzed and compared in this study on the basis of system control experiments are Kunder Ki = 2.985. p = 0.5 theand instructions of different amplitudes, frequencies, and rise and fall rates. In the control Reinforcement learning agent 1 is constructed on the basis of the method in 3.4.1 to realize automatic tuning of the PID parameters of the servo system. The relevant parameters are set as follows: the capacity of the memory M1 is C1 = 50,000, the actor target network updates the steps by 11,000, and
Algorithms 2018, 11, 65
11 of 18
the critic target network updates the steps by 10,000. The action range is [0 to 100]. The learning rate of actor and critic α a = αc = 0.0001, and the batch size N = 256. Gaussian distribution is used to increase noise, and the output value of the neural network is used as the distribution mean. Parameter var 1 is set as the standard deviation, and the initial action noise var 1 = 10. The noise strategy is adjusted dynamically with the increase in learning frequency. Finally, var 1 = 0.01. In this paper, the control precision of the controller is adopted as the control target. In consideration of the limitation of control servo motor current and power consumption in practical engineering, the reward r is defined as: rt = e
− 21
[0.9∗|e(t)|+0.1∗| It |]2 σ2
(13)
where σ = 10. Table 1 presents that with the control servo system 1, the tracking amplitude is 1000, the slope absolute value is 13,333.3, the isosceles trapezoidal signal is 0.5 s, the loop iteration is 5000 times, and the environment interaction object is reinforcement learning agent 1. The trained agent model is called MODEL_1. Reinforcement learning agent 2_1 is constructed on the basis of the method in 3.4.2 to realize the adaptive compensation of PID output current. The optimal control parameters learned by the adaptive PID control algorithm in MODEL_1 are K p = 0.9318 and Ki = 1.4824. The system performance of the fixed PID control structure under these parameters is compared with the control performance of the adaptive current compensation method proposed in this study. The relevant parameters of the reinforcement agent 2_1 are set as follows: the capacity of the memory M2 is C2 = 50,000, the actor target network updates the steps by 11,000, and the critic target network updates the steps by 10,000. The action range is [−2 to 2]. The learning rate of actor and critic α a = αc = 0.0001, and the batch size N = 256, and the initial action noise var 2 = 2. Finally, var 2 = 0.01. The control output of agent 2_1 needs to compensate the PID current. Therefore, the angular velocity error of PID control is constructed to build the reward function r, and the reward r is defined as: rt = 0.9 ∗ e
− 12
e ( t +1)2 σ2
+ 0.1∗e−| It |
(14)
where σ = 2. The environment interaction object of agent 1 is adopted to the reinforcement learning agent 2_1. Eight hundred episodes are trained to acquire the agent model, called MODEL_2. Agent 2_2 is constructed on the basis of the structure of agent 2_1. Agent 2_2 changes the relevant parameters on the basis of agent 2_1 as follows: the memory M2 is changed to M3 , the capacity of the memory M3 is C3 = 2000, the actor target network updates the steps by 1100, the critic target network updates the steps by 1000, and the remaining indicators are unchanged by agent 2_2. The compensation tracking amplitude is 10, and the frequency is 100 Hz sinusoidal signal. Four thousand episodes are trained to acquire the agent model MODEL_3. The system models of system inertia mutation and external torque disturbance are tested to verify the adaptability of the proposed adaptive control strategy based on the adaptive speed servo system of reinforcement learning in this study. The performance of the proposed method is quantitatively analyzed with velocity tracking curve rendering, velocity tracking error value, and Integral of Absolute Error (IAE) index value. 4.2. Experimental Results Test case #1 The following figure shows the data of the successful training of each model. The performance test of MODEL_1 is shown in Figure 7a, in which the tracking amplitude of agent 1 is 1000, the absolute value of the slope is 13,333.3, and the isosceles trapezoid tracking effect presents an inconsiderable error. The reinforcement learning output action values of Kp, Ki present a straight line. The optimal control parameters are K p = 0.9318 and Ki = 1.4824. It can be seen from the Figure 7b that the improved
4.2. Experimental Results Test case #1 The following figure shows the data of the successful training of each model. The performance test of MODEL_1 is shown in Figure 7a, in which the tracking amplitude of agent 1 is 1000, the of the slope is 13,333.3, and the isosceles trapezoid tracking effect presents an 12 of 18 Algorithmsabsolute 2018, 11, value 65 inconsiderable error. The reinforcement learning output action values of Kp, Ki present a straight line. The optimal control parameters are K p = 0.9318 and K i = 1.4824 . It can be seen from the Figure 7b that the improved PID parameters (Improved_PID) with good parameters achieve better PID (Improved_PID) with good can achieve better controlcan performance thancontrol Classic_PID. performance than 800 Classic_PID. indicates that episode are trained acquire the2_1. The Figure 7c indicates that episodeFigure values7care trained to800 acquire thevalues average rewardtofor agent average reward for agent 2_1. The training is started by exploring a large degree, where in the average training is started by exploring a large degree, where in the average reward value is very low. As the reward value is very low. As the training steps increase, the average reward increases gradually, and training steps increase, the average reward increases gradually, and MODEL_2 eventually converges. MODEL_2 eventually converges. Figure 7d depicts the average reward of agent 2_2 when training Figure 7d depicts average reward of agent 2_2 when training 4000 episode values. The average 4000 episodethe values. The average reward eventually increases and tends to be stable. The correctness reward and eventually increases and model tendsoftothe beproposed stable. The correctness and of the effectiveness of the agent method in this study areeffectiveness verified according to agent thethe above results. method in this study are verified according to the above results. model of proposed
(a)
(b)
(c)
(d)
Figure 7. Result of test case #1. (a) Agent 1 tracking renderings; (b) Integral of absolute error (IAE)
Figure 7. Result of test case #1. (a) Agent 1 tracking renderings; (b) Integral of absolute error (IAE) index index comparison diagram; (c) Average reward value of agent 2_1; (d) Average reward value of agent comparison 2_2. diagram; (c) Average reward value of agent 2_1; (d) Average reward value of agent 2_2.
Test case #2
Test case #2
The tracking command signal of the setting system is set with the amplitude value of 1000, the The tracking command signal of the setting system is set with the amplitude value of 1000, the slope absolute valuevalue of 13,333.3 isosceles signal, amplitude 0.1,mean, zerovariance mean, variance of slope absolute of 13,333.3 isoscelestrapezoid trapezoid signal, amplitude of 0.1,ofzero of 1, 1, and random white Gaussian noisesignal. signal. When tracking 0.5 s,1system 1 as is the selected as the servo and random white Gaussian noise When tracking 0.5 s, system is selected servo system in the the tracking The reinforcement learning adaptive PID current compensation method ismethod system in trackingprocess. process. The reinforcement learning adaptive PID current compensation is compared with PID control. The MODEL_2 control and the performance comparison of the PID approaches are shown in Figure 8. Figure 8a presents the tracking effect diagram of 0.50 s for the reinforcement learning adaptive PID current compensation method. A good tracking performance is obtained. The circle-part amplification is shown in the figure. Agent 2_1 tracking effect (RL_PID) is compared with Classic_PID and Improved_PID. In the absence of external disturbances, the proposed method can rapidly and accurately track the command signal, and the Classic_PID control structure produces a large tracking error. The control performance of Improved_PID is between Classic_PID and RL_PID. The parameter tuning method proposed in this paper can achieve better control performance than the Classic_PID. Figure 8b depicts the IAE index for RL_PID, Improved_PID, and Classic_PID. The IAE of the reinforcement learning adaptive PID control algorithm is smaller than that of the PID algorithm under the condition of the same system. The results prove that the proposed control strategy is feasible in tracking trapezoidal signal and can achieve a good control performance.
large tracking error. The control performance of Improved_PID is between Classic_PID and RL_PID. The parameter tuning method proposed in this paper can achieve better control performance than the Classic_PID. Figure 8b depicts the IAE index for RL_PID, Improved_PID, and Classic_PID. The IAE of the reinforcement learning adaptive PID control algorithm is smaller than that of the PID algorithm under the condition of the same system. The results prove that the proposed control Algorithms 2018, 11, 65 13 of 18 strategy is feasible in tracking trapezoidal signal and can achieve a good control performance.
(a)
(b)
Figure 8. Result of test case #2. (a) Tracking signal comparison diagram; (b) IAE index comparison
Figure 8. Result of test case #2. diagram. comparison diagram. Algorithms 2018, 11, x FOR PEER REVIEW
(a) Tracking signal comparison diagram; (b) IAE index 14 of 19
Test case #3 Test casePID #3 controller to reduce the tracking error caused by the mutation of inertia. Current classical The tracking signal of the setting system has amplitude value of 2000, aslope compensation incommand thecommand PID controller turning parameters canan reduce the error further. The figure The tracking signal after of the setting system has an ofamplitude valuevariance of 2000, absolute value of 26,666.6 isosceles trapezoid signal, an amplitude 0.1, zero mean, of aslope 1, below Figure 9d shows the current value of the RL_PID based on reinforcement learning output. absolute value of 26,666.6 isosceles trapezoid signal, an amplitude of 0.1, zero mean, variance of 1, and a random white Gaussian noise signal. When tracking 0.5 s, system 1 is selected as the servo When the system undergoes mutation, the output current value of the reinforcement learning and (RL_PID_RL) asystem. random Gaussian noise signal. When 0.5 s, system 1ofisPID selected as the Thewhite inertia and moment disturbance of thetracking random added between 0.075 andservo in adaptive PID based on reinforcement learningsystem and theare value (RL_PID_PID) system. The inertia and moment disturbance of the random system are added between 0.075 0.375 s. The reinforcement learning adaptive PID current compensation method (RL_PID) in adaptive PID based on reinforcement learning change to adapt to the system. The results proveis and compared with Improved_PID control, and PID Classic_PID. MODEL_2 control and the performance 0.375that s. The learning adaptive current The compensation method (RL_PID) is compared the reinforcement proposed control strategy in tracking trapezoidal signal amplitude, slope change, system comparison of the PID approaches are shown in Figure 9. with inertia, Improved_PID control, Classic_PID. The MODEL_2 and the comparison and torque under and the condition of random mutation control is still better thanperformance Improved_PID and 9a presents the tracking effect 9. diagram 0.50 s formethod the reinforcement The applicability and of theof proposed are verified.learning adaptive of theClassic_PID. PIDFigure approaches are shown in superiority Figure PID current compensation method. A good tracking performance is observed. The circle-part amplification is shown in the figure. Agent 2_1 tracking effect (RL_PID) is compared with Improved_PID and Classic_PID. The proposed method follows the signal amplitude. The slope with changed circumstances shows a better tracking effect than that of the Improved_PID and Classic_PID. This method is verified to have good adaptability. Figure 9b shows a schematic of the control system of the adaptive PID and PID control algorithms, where 0 represents system 1, 1 represents system 2, and 2 represents system 3. Figure 9c depicts the comparison chart of the reinforcement learning adaptive PID control algorithm and the PID control algorithm IAE. The IAE value of the reinforcement learning adaptive PID control algorithm is always less than that of the PID algorithm under the same system. The image above Figure 9d shows the comparison diagram of the RL_PID, Improved_PID and Classic_PID tracking error. When the system changes, the tracking error of the reinforcement learning is less than that of the Improved_PID and Classic_PID. It can be obtained from the figure that the PID controller after turning the parameter of the agent 1 is better than the (a)
(b)
(c)
(d)
Figure 9. Result of test case #3 (a) Tracking signal comparison diagram; (b) System selection chart; (c)
Figure 9. Result of test case #3 (a) Tracking signal comparison diagram; (b) System selection chart; (c) IAE index comparison diagram; (d) Agent 2_1 output current value and tracking error graph. IAE index comparison diagram; (d) Agent 2_1 output current value and tracking error graph.
Test case #4 The tracking command signal is set with a signal amplitude of 10, a frequency of 100 Hz sinusoidal signal, an amplitude of 0.1, zero mean, variance of 1, and random white Gaussian noise signal. When tracking 0.02 s, system 1 is selected as the servo system, and agent 2_2 is tested. The reinforcement learning adaptive PID current compensation method is compared with Improved_PID control and Classic_PID control. The MODEL_3 control and the performance comparison of
Algorithms 2018, 11, 65
14 of 18
Figure 9a presents the tracking effect diagram of 0.50 s for the reinforcement learning adaptive PID current compensation method. A good tracking performance is observed. The circle-part amplification is shown in the figure. Agent 2_1 tracking effect (RL_PID) is compared with Improved_PID and Classic_PID. The proposed method follows the signal amplitude. The slope with changed circumstances shows a better tracking effect than that of the Improved_PID and Classic_PID. This method is verified to have good adaptability. Figure 9b shows a schematic of the control system of the adaptive PID and PID control algorithms, where 0 represents system 1, 1 represents system 2, and 2 represents system 3. Figure 9c depicts the comparison chart of the reinforcement learning adaptive PID control algorithm and the PID control algorithm IAE. The IAE value of the reinforcement learning adaptive PID control algorithm is always less than that of the PID algorithm under the same system. The image above Figure 9d shows the comparison diagram of the RL_PID, Improved_PID and Classic_PID tracking error. When the system changes, the tracking error of the reinforcement learning is less than that of the Improved_PID and Classic_PID. It can be obtained from the figure that the PID controller after turning the parameter of the agent 1 is better than the classical PID controller to reduce the tracking error caused by the mutation of inertia. Current compensation in the PID controller after turning parameters can reduce the error further. The figure below Figure 9d shows the current value of the RL_PID based on reinforcement learning output. When the system undergoes mutation, the output current value of the reinforcement learning (RL_PID_RL) in adaptive PID based on reinforcement learning and the value of PID (RL_PID_PID) in adaptive PID based on reinforcement learning change to adapt to the system. The results prove that the proposed control strategy in tracking trapezoidal signal amplitude, slope change, system inertia, and torque under the condition of random mutation is still better than Improved_PID and Classic_PID. The applicability and superiority of the proposed method are verified. Algorithms 2018, 11, x FOR PEER REVIEW 15 of 19 Test case #4 verifies that the parameters searched by the reinforcement learning agent 1 can still be applied when the tracking signal type changes. 10b showsamplitude the IAE index the reinforcement learning The tracking command signal isFigure set with a signal of 10, afor frequency of 100 Hz sinusoidal adaptive controlofalgorithm, Improved_PID Classic_PID control algorithm. The IAE When of the signal, anPID amplitude 0.1, zero mean, variance ofand 1, and random white Gaussian noise signal. reinforcement PID algorithm is smaller thanisthat of the tracking 0.02 s,learning system 1adaptive is selected as control the servo system, and agent 2_2 tested. TheImproved_PID reinforcement algorithm and Classic_PID algorithm under the condition of the same The results confirm learning adaptive PID current compensation method is compared withsystem. Improved_PID control and that the control strategy proposed in this study is feasible to track sinusoidal signal and can obtain Classic_PID control. The MODEL_3 control and the performance comparison of Improved_PID anda good controlare performance. Classic_PID shown in Figure 10.
(a)
(b)
Figure of test case case #4. (a)#4. Tracking signal comparison diagram; (b) IAE index Figure 10. 10. Result Result of test (a) Tracking signal comparison diagram; (b) comparison IAE index diagram. comparison diagram.
Test case #5 Figure 10a is the tracking effect diagram of 0.02 s for the reinforcement learning adaptive The tracking commandmethod. signal isAset with a signal amplitude is of yielded, 20, a frequency 200 Hz PID current compensation good tracking performance and the of circle-part sinusoidal signal, an amplitude of 0.1, zero 2_2 mean, variance of is 1, compared and random white Gaussian noise amplification is shown in the figure. Agent tracking effect with Improved_PID and signal. When tracking 0.02 s, system 1 is selected as the servo system. The torque mutation system is switched to system 3 at 0.01 s, and agent 2_2 is tested. The reinforcement learning adaptive PID current compensation method is compared with Improved_PID and Classic_PID. The MODEL_3 control and the performance comparison of the PID approaches are shown in Figure 11. Figure 11a presents the tracking effect diagram of 0.02 s for the reinforcement learning adaptive PID current compensation method. A good tracking performance is obtained, and the circle-part
Algorithms 2018, 11, 65
15 of 18
Classic_PID. In the absence of external disturbances, the proposed method can rapidly and accurately track the command signal, and the Classic_PID control structure produces a large tracking error. The control performance of Improved_PID with good parameters is better than that of Classic_PID, which verifies that the parameters searched by the reinforcement learning agent 1 can still be applied when the tracking signal type changes. Figure 10b shows the IAE index for the reinforcement learning adaptive PID control algorithm, Improved_PID and Classic_PID control algorithm. The IAE of the reinforcement learning adaptive PID control algorithm is smaller than that of the Improved_PID algorithm and Classic_PID algorithm under the condition of the same system. The results confirm that the control strategy proposed in this study is feasible to track sinusoidal signal and can obtain a good control performance. Test case #5 The tracking command signal is set with a signal amplitude of 20, a frequency of 200 Hz sinusoidal signal, an amplitude of 0.1, zero mean, variance of 1, and random white Gaussian noise signal. When tracking 0.02 s, system 1 is selected as the servo system. The torque mutation system is switched to system 3 at 0.01 s, and agent 2_2 is tested. The reinforcement learning adaptive PID current compensation method is compared with Improved_PID and Classic_PID. The MODEL_3 control and Algorithms 2018, 11, xcomparison FOR PEER REVIEW 16 of 19 the performance of the PID approaches are shown in Figure 11.
(a)
(b)
Figure 11. Result of test case #5. (a) Tracking signal comparison diagram; (b) Tracking error Figure 11. Result of test case #5. (a) Tracking signal comparison diagram; (b) Tracking error comparison diagram. comparison diagram.
Test case #6 Figure 11a presents the tracking effect diagram of 0.02 s for the reinforcement learning adaptive The proposed controlmethod. strategyAbased on reinforcement learning is compared fixed PID current compensation good tracking performance is obtained, and the with circle-part parameter PID servothe system structure under the different amplification is control shown performance in the figure.ofFrom comparison between the circumstances tracking effectof(RL_PID), signal types, signal amplitudes,tracking torques,effect, and inertia mutations, IAE as the evaluation index. Improved_PID and Classic_PID the tracking effect with of theits proposed method is still better For the random mutation, random system inertia isFigure added,11b and disturbance than thatsystem of the with no current compensation PID control structure. is the the torque error graph of the ranges from s to 0.375 and s. For the system with algorithms. inertia mutation, 0.01value s switch system PID 2 is adaptive PID, 0.075 Improved_PID, Classic_PID control The error of theto adaptive required. For the is system with load mutation, 0.01of s system to switch to system 3 isand needed. The results control algorithm significantly smaller than that the Improved_PID algorithm the Classic_PID are shownunder in Tables algorithm the 2–5. same system. The Improved_PID has a better PID parameter that can achieve a smaller tracking error than Classic_PID when the load is a mutation. The results prove that the Tablecontrol 2. Amplitude value with 1000 slope 13,333.3 isosceles trapezoid signal under 0.5 s presented strategy under the condition of torque mutation has atracking better tracking performance RL_PID, Improved_PIDand and Classic_PID. Classic_PID IAE table. and superiority of the proposed method than that of Improved_PID Theindicator applicability are verified. System RL_PID Improved_-PID Classic_PID System 1 1199.56 2087.81 3883.61 Test case #6 system with random mutation 4000.01 5444.85 8017.54 The proposed control strategy based on reinforcement learning is compared with fixed parameter PID control performance of servo system structure under the circumstances of different signal types, Table 3. Amplitude value with 2000 slope 26,666.6 isosceles trapezoid tracking signal under 0.5 s signalRL_PID, amplitudes, torques, and mutations, with its IAE as the evaluation index. For the system Improved_PID and inertia Classic_PID IAE indicator table. System System 1 system with random mutation
RL_PID 2937.40 7144.29
Improved_PID 4144.64 8307.17
Classic_PID 7705.06 14,608.11
Table 4. Amplitude with 10 frequency 100 sinusoidal tracking signal under 0.02 s RL_PID,
Algorithms 2018, 11, 65
16 of 18
with random mutation, random system inertia is added, and the torque disturbance ranges from 0.075 s to 0.375 s. For the system with inertia mutation, 0.01 s switch to system 2 is required. For the system with load mutation, 0.01 s system to switch to system 3 is needed. The results are shown in Tables 2–5. Table 2. Amplitude value with 1000 slope 13,333.3 isosceles trapezoid tracking signal under 0.5 s RL_PID, Improved_PID and Classic_PID IAE indicator table. System
RL_PID
Improved_-PID
Classic_PID
System 1 system with random mutation
1199.56 4000.01
2087.81 5444.85
3883.61 8017.54
Table 3. Amplitude value with 2000 slope 26,666.6 isosceles trapezoid tracking signal under 0.5 s RL_PID, Improved_PID and Classic_PID IAE indicator table. System
RL_PID
Improved_PID
Classic_PID
System 1 system with random mutation
2937.40 7144.29
4144.64 8307.17
7705.06 14,608.11
Table 4. Amplitude with 10 frequency 100 sinusoidal tracking signal under 0.02 s RL_PID, Improved_PID and Classic_PID IAE indicator table. System
RL_PID
Improved_PID
Classic_PID
System 1 system with inertia mutation system with load mutation
17.126 199.045 60.300
54.412 271.327 89.819
99.760 427.420 166.420
Table 5. Amplitude with 20 frequency 200 sinusoidal tracking signal under 0.02 s RL_PID, Improved_PID and Classic_PID IAE indicator table. System
RL_PID
Improved_PID
Classic_PID
System 1 system with inertia mutation system with load mutation
112.418 827.467 129.000
217.766 952.202 226.029
403.680 1278.677 417.880
According to the experimental results in different velocity commands and complex disturbances and inertia change under such circumstances, the IAE index of the method proposed in this study is less than the IAE index value of the fixed parameter PID control system. The proposed method has a better control performance than the PID control structure with fixed parameters and has good adaptability. In addition, the proposed PID parameter tuning method based on deep reinforcement learning can search better control parameters, create the fixed parameter PID control system under the condition of the system of inertia, and load mutations that can obtain better control performance than classic PID. 5. Conclusions In this study, we design a reinforcement learning agent to automatically control the parameters of a speed servo system. The agent establishes the networks of action and critic functions on the basis of the DDPG algorithm. The action network realizes the optimal approximation of the strategy. The critic network realizes the optimal approximation of the value function and adopts the strategies of memory playback, parameter freezing, and noise dynamic adjustment to improve the convergence speed of neural networks. The tuning parameters of the servo system are obtained, and the same reinforcement learning intelligent structure is adopted to enhance the control accuracy.
Algorithms 2018, 11, 65
17 of 18
The use of reinforcement learning cannot only make the control parameters of the servo system have fast and accurate tuning, but also can perform current online compensation on the basis of the state of the servo system. The system can effectively overcome the influences of external factors. This study proposes a servo system control strategy based on reinforcement learning in different types of instructions, torque disturbances, and inertia cases. A satisfactory performance of the control system is obtained. The experimental results demonstrate the effectiveness and robustness of the proposed method. In this paper, the tuning of PID parameters of the servo system and the compensation of current are two agents. Multi-task learning may solve this problem and this will be the topic of a future work. In addition, we find that L2 regularization can improve the generalization ability of reinforcement learning agents. The contributions in this paper will be beneficial to the application of deep reinforcement learning in a servo system. Author Contributions: P.C. contributed to the conception of the reported research and helped revise the manuscript. Z.H. contributed significantly to the design and conduct of the experiment and the analysis of results, as well as contributed to the writing of the manuscript. C.C. and J.X. helped design the experiments and perform the analysis with constructive discussions. Acknowledgments: This research is supported by the National Natural Science Foundation of China (61663011) and the postdoctoral fund of Jiangxi Province (2015KY19). Conflicts of Interest: The authors declare no conflicts of interest.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11.
12.
Zheng, S.; Tang, X.; Song, B.; Ye, B. Stable adaptive PI control for permanent magnet synchronous motor drive based on improved JITL technique. ISA Trans. 2013, 52, 539–549. [CrossRef] [PubMed] Roy, P.; Roy, B.K. Fractional order PI control applied to level control in coupled two tank MIMO system with experimental validation. Control Eng. Pract. 2016, 48, 119–135. [CrossRef] Ang, K.H.; Chong, G.; Li, Y. PID control system analysis, design, and technology. IEEE Trans. Control Syst. Technol. 2005, 13, 559–576. [CrossRef] Sekour, M.; Hartani, K.; Draou, A.; Ahmed, A. Sensorless Fuzzy Direct Torque Control for High Performance Electric Vehicle with Four In-Wheel Motors. J. Electr. Eng. Technol. 2013, 8, 530–543. [CrossRef] Zhang, X.; Sun, L.; Zhao, K.; Sun, L. Nonlinear Speed Control for PMSM System Using Sliding-Mode Control and Disturbance Compensation Techniques. IEEE Trans. Power Electron. 2012, 28, 1358–1365. [CrossRef] Dang, Q.D.; Vu, N.T.T.; Choi, H.H.; Jung, J.W. Neuro-Fuzzy Control of Interior Permanent Magnet Synchronous Motors. J. Electr. Eng. Technol. 2013, 8, 1439–1450. [CrossRef] El-Sousy, F.F.M. Adaptive hybrid control system using a recurrent RBFN-based self-evolving fuzzy-neural-network for PMSM servo drives. Appl. Soft Comput. 2014, 21, 509–532. [CrossRef] Jezernik, K.; Koreliˇc, J.; Horvat, R. PMSM sliding mode FPGA-based control for torque ripple reduction. IEEE Trans. Power Electron. 2012, 28, 3549–3556. [CrossRef] Jung, J.W.; Leu, V.Q.; Do, T.D.; Kim, E.K.; Choi, H.H. Adaptive PID Speed Control Design for Permanent Magnet Synchronous Motor Drives. IEEE Trans. Power Electron. 2014, 30, 900–908. [CrossRef] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. Comput. Sci. 2013. Available online: https://arxiv.org/pdf/1312.5602v1. pdf (accessed on 3 May 2018). Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef] [PubMed] Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [CrossRef] [PubMed]
Algorithms 2018, 11, 65
13.
14.
15. 16.
17.
18. 19.
20. 21. 22.
23.
24.
25. 26.
27.
28.
18 of 18
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. Comput. Sci. 2015, 8, A187. Available online: http://xueshu.baidu.com/s?wd=paperuri%3A%283752bdb69e8a3f4849ecba38b2b0168f%29&filter= sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Farxiv.org%2Fabs%2F1509. 02971&ie=utf-8&sc_us=1138439324812222606 (accessed on 3 May 2018). Jiang, H.; Zhang, H.; Luo, Y.; Wang, J. Optimal tracking control for completely unknown nonlinear discrete-time Markov jump systems using data-based reinforcement learning method. Neurocomputing 2016, 194, 176–182. [CrossRef] Carlucho, I.; Paula, M.D.; Villar, S.; Acosta, G.G. Incremental Q-learning strategy for adaptive PID control of mobile robots. Expert Syst. Appl. 2017, 80, 183–199. [CrossRef] Yu, R.; Shi, Z.; Huang, C.; Li, T.; Ma, Q. Deep reinforcement learning based optimal trajectory tracking control of autonomous underwater vehicle. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4958–4965. [CrossRef] Zhao, H.; Wang, Y.; Zhao, M.; Sun, C.; Tan, Q. Application of Gradient Descent Continuous Actor-Critic Algorithm for Bilateral Spot Electricity Market Modeling Considering Renewable Power Penetration. Algorithms 2017, 10, 53. [CrossRef] Hu, Y.; Li, W.; Xu, K.; Zahid, T.; Qin, F.; Li, C. Energy Management Strategy for a Hybrid Electric Vehicle Based on Deep Reinforcement Learning. Appl. Sci. 2018, 8, 187. [CrossRef] Lin, C.H. Composite recurrent Laguerre orthogonal polynomials neural network dynamic control for continuously variable transmission system using altered particle swarm optimization. Nonlinear Dyn. 2015, 81, 1219–1245. [CrossRef] Liu, Y.J.; Tong, S. Adaptive NN tracking control of uncertain nonlinear discrete-time systems with nonaffine dead-zone input. IEEE Trans. Cybern. 2017, 45, 497–505. [CrossRef] [PubMed] Dos Santos Mignon, A.; da Rocha, R.L.A. An Adaptive Implementation of ε-Greedy in Reinforcement Learning. Procedia Comput. Sci. 2017, 109, 1146–1151. [CrossRef] Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. Comput. Sci. 2017. Available online: http://xueshu.baidu.com/s?wd=paperuri%3A%28c8411e5bc5e651d776e9d4997604cc3e%29&filter=sc_ long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Farxiv.org%2Fabs%2F1706.01905& ie=utf-8&sc_us=8746754823928227025 (accessed on 3 May 2018). Nigam, K. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Sweden, 1 August 1999; pp. 61–67. Available online: http://www.kamalnigam.com/papers/maxent-ijcaiws99.pdf (accessed on 3 May 2018). Ng, A.Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 June 2004; ACM: New York, NY, USA, 2004; p. 78. Available online: http://www.yaroslavvb.com/papers/ng-feature.pdf (accessed on 3 May 2018). Monahan, G.E. A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms. Manag. Sci. 1982, 28, 1–16. [CrossRef] Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Available online: http://www.umiacs.umd.edu/~hal/courses/2016F_RL/RL9.pdf (accessed on 3 May 2018). Silver, D.; Lever, G.; Heess, N.; Thomas, D.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. Available online: http://www0.cs.ucl.ac.uk/staff/d.silver/ web/Applications_files/deterministic-policy-gradients.pdf (accessed on 3 May 2018). Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. SIAM J. Control Optim. 2002, 42, 1143–1166. Available online: http://web.mit.edu/jnt/OldFiles/www/Papers/C.-99-konda-NIPS.pdf (accessed on 3 May 2018). [CrossRef] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).