Fuel Saving Control for Hybrid Electric Vehicle Using Driving Cycles ...

AVEC’18

Fuel Saving Control for Hybrid Electric Vehicle Using Driving Cycles Prediction and Reinforcement Learning Teng Liu University of Waterloo,

Xiaosong Hu Chongqing University

Yuan Zou Beijing Institute of Technology

Dongpu Cao University of Waterloo

E-mail: [email protected] Hybrid electric vehicles (HEVs) and electric vehicles (EVs) provide large potential to save energy and reduce emissions in recent years. A novel energy management strategy based on the driving cycles prediction and reinforcement learning (RL) has been proposed in this paper. First, the Markov chain (MC)-based driving cycles prediction approach is formulated. Then, the modeling of hybrid electric powertrain is introduced and the optimal control problem is formulated. In addition to these efforts, a real-time predictive energy management strategy under the RL framework is finally proposed. The proposed velocity prediction method can achieve high accuracy when the current driving cycle is existent in the historical driving data. Furthermore, the presented control strategy is compared with the conventional RL-based strategy to demonstrate its effectiveness. The simulation results indicate that the novel technique is superior to the benchmarking methods on the fuel economy improvement. Design and Control of (Plug-in) Hybrid Electric Vehicles 1. INTRODUCTION Air pollution and petroleum lack are becoming more and more serious concerns in recent years. This tendency has encouraged the development of hybrid electric vehicles (HEVs), which have large potential to improve fuel economy and reduce pollutant emissions [1]. Energy management strategy has been researched deeply for HEVs to maximize the overall powertrain efficiency and minimize fuel consumption [2]. Energy management strategies of HEVs can be mainly classified into two types: rule-based and optimization-based methods [3]. Rule-based energy management strategies are often determined by engineering experience and are capable of operating steadily. However, these traditional rule-based schemes are highly susceptible to heuristics and arbitrariness of design criterion and experience, thus losing a warranty of optimality. Optimization-based energy management strategies can be further classified into global optimization and real-time optimization. Dynamic programming (DP), Pontryagin’s minimum principle (PMP) and stochastic dynamic programming (SDP) are representative methods to make a globally optimal control decision, as information of driving cycles is presumably known in advance [4]. Equivalent consumption minimization strategy (ECMS), model predictive control (MPC) and reinforcement learning (RL) are most representative approaches in real-time optimization. The control performance of MPC is highly determined by the precision of future velocity or power prediction. RL is one of the machine learning approaches in which the agent interacts with the environment to provide the

optimal control in real-time [5]. However, it is noted that the fuel economy of HEVs may be even degraded, if the associated control strategy is unsuitable for future driving conditions [6]. Hence, future driving information needs to be carefully considered during derivation of an energy management strategy for HEVs. This paper proposes a Markov chain (MC)-based driving cycles prediction approach, named nearest neighbor predictor (NNP). Combining NNP with the RL algorithm, a real-time predictive energy management strategy is finally presented for a hybrid tracked vehicle (HTV). NNP makes the controls adapt to mutative future driving conditions, and RL enables the controls to be real-time implementable. The influence of predicted times on the prediction performance is described by comparing the velocity trajectories and MC-based transition probability matrix (TPM). The optimality of the proposed predictive control is evaluated through comparing with the conventional RL-based strategy. Simulation results underline that the proposed strategy leads to noticeable improved fuel economy and computational speed. These merits make it feasible for online application. This rest paper is organized as follows: In Section 2, the HTV powertrain modeling and the cost function in the energy management problem are introduced. Section 3 describes the velocity prediction approach and RL technique. In Section 4, tests are designed to evaluate the proposed approach, and simulation results are analyzed. Finally, conclusions and future work are described in Section 5. 2. POWERTRAIN MODELING AND PROBLEM FORMULATION

AVEC’18 2.1 Powertrain of a Hybrid Tracked Vehicle

Fig.1 Powertrain of the HTV [7]. The architecture of a series HTV and its power flows are shown in Fig. 1 [7]. The tracked vehicle is impelled by two electric machines that are treated as the power conversion machines with the same average efficiency. An engine-generator set and a battery pack constitute the main power sources of the HTV. The diesel engine gives 300 kW rated power at the speed of 3100 rpm and 2200 Nm rated output torque within the speed range from 650 rpm to 2100 rpm. The generator gives 270 kW rated power within the speed range from 2500 rpm to 3100 rpm and 960 Nm rated torque within the speed range from 0 to 2500 rpm. The 50 Ah lithium-ion battery pack gives 470 V rated voltage. The solid arrow lines in Fig. 1 indicate the directions of power flows. The generator speed is selected as the first state variable and can be calculated according to the torque equilibrium constraint [8]

 dng Te Je  dt = ( i − Tg ) / 0.1047( i 2 + J g ) e− g e− g  n = n / i g e− g  e

(1)

where ng and ne are the rotational speeds, Tg and Te are the torques of the generator and engine, respectively, and Te is the sole control action in this work. Je and Jg are the rotational moment of inertias of the engine and generator, respectively. ie-g is the gear ratio between the engine and generator, and 0.1047 is the transformation factor that denotes 1 r/min=0.1047 rad/s. The torque and output voltage of the generator can be derived as follows [8] 2  Tg = K e I g − K x I g   U g = K e ng − K x n g I g

(2)

where Ke is the electromotive force coefficient, Ug and Ig are the generator voltage and current, respectively. Furthermore, Kxng is the electromotive force, and Kx = 3PLg /π, in which Lg is the armature synchronous inductance, and P is the poles number. For the HTV, the state of charge (SOC) of the battery is chosen as another state variable, which is computed by

dSoC I (t ) (3) = − bat , dt Cbat where Ibat and Cbat denote the current and rated capacity of battery, respectively. According to the internal resistance model [9], the derivative of SOC and battery output voltage can be computed by  dSoC (V − V 2 − 4 r ( r ) P (t ) ) oc ch dis bat = oc  2Cbat rch ( rdis )  dt (4)  V − I r ( SoC ) ( I  0)   oc bat ch bat U bat = V − I r ( SoC ) ( I  0)  oc bat dis bat  where Voc is the open circuit voltage and Pbat is the battery power. Furthermore, Ubat is the battery output voltage, rdis(SOC) and rch(SOC) depict the internal resistances during discharging and charging, respectively. 2.2 Energy Management Modeling The cost function J to be minimized in energy management problem is a trade-off between fuel consumption and charge sustainability in the battery, as formulated as follow: T

J =  [m f (t ) +  ( SOC (t ) − SOC (0)) 2 ]dt (5) 0

where β is a positive weighting factor, SOC is the state of charge in the battery and m f is the fuel consumption rate over the entire time span [0, T]. To ensure safety and reliability of the components, the following inequality constraints should be satisfied:

( Nm) 0  Te (t )  2200 0.5  SOC (t )  0.9    −180  Pbat (t )  180 ( kW )  650  ne  3100 ( rpm)

(6)

where Te is the control variable, torque of engine and SOC is the state variable. 3. VELOCITY PREDICTION AND RL 3.1 Powertrain of a Hybrid Tracked Vehicle Vehicle velocity v is modeled as a finite-state MC and denoted as V= {vj | j=1, …, M}  X in this paper, where X  R is bounded. The maximum likelihood estimator is applied to estimate the transition probability of the vehicle speed as [10]

N ij  +  pij = P (v = v j v = vi ) = Ni   M N = N ij  i  j =1 

(7)

where v and v+ are the current and next one step-ahead velocity, respectively, and pij is the transition probability from vi to vj. Furthermore, Nij indicates the transition counts from vi to vj, Ni is the total transition counts initiated from vi, and the transition probability matrix (TPM) Π is filled with elements pij. The one step-ahead probability vector of v taking one of finite values vj is linked as

( p + )T = p T  and for n>1 steps ahead as

(8)

AVEC’18

( p + n )T = p T  n .

(9)

In the nearest neighbor method, X is divided into a finite set of disjoint intervals, Ij, j=1, …, M, and each interval is assigned a Markov chain state, vj∈Ij, which is typically the midpoint of the interval Ij. Based on this partitioning, a continuous state v∈Ij corresponds to a discrete state vj and may be associated with an Mdimensional probability vector αT(v)=[0···1···0] with the j-th element is 1 and other elements equal to 0. Motivated by (8) and α(v), the probability vector of the next state is determined as (10) ( + (v))T = ( (v))T  =  Tj where ΠTj denotes the j-th row of the TPM Π. In the nearest neighbor predictor (NNP), the next one-step ahead speed can be predicted as an expectation, according to the interval midpoints: j =1

3.2 Q-Learning Algorithm The interaction between the agent and environment in RL is modeled as a discrete discounted Markov decision process (MDP). The MDP is a quintuple (S, A, Π, R, γ), where S and A are the set of states and actions, Π is the TPM, R is the reward function, and γ∈(0, 1) is a discount factor. The transition probability from state s to next state s´using action a and the corresponding reward is denoted as psa,s’ and r(s, a). The control policy π is the distribution over the control actions a, given the current state s. The optimal value function is exhibited as the finite expected discounted sum of the rewards: T

V * ( s ) = min E (   t r ) 

t =0

= min( r ( s, a ) +   psa ,sV * ( s)) s  S a

Table 1 Pseudo-Code of the Q-Learning Algorithm [5] Algorithm: Q-learning Algorithm 1. Initialize Q(s, a), s, number of iteration N 2. Repeat each step k=1, 2, 3… 3. Choose a, based on Q(s, .) (ε-greedy) 4. Taking action a, observe r, s' 5. Define a*=arg maxa Q(s', a) 6. Q(s, a)←Q(s, a)+ η(r(s, a)+ γ maxa' Q(s', a')Q(s, a)) 7. s←s' 8. until s is terminal

M

v =  pij v j if v  I i . (11) +

management strategy. The pseudo-code of the Q-learning algorithm is described in Table 1 [5].

(12)

sS

Since the optimal value function is given, the optimal control policy is determined as follows:

 * ( s) = arg min( r( s, a ) +   psa ,sV * ( s)). (13) a

sS

In addition, the action-value function Q(s, a) and its optimal value Q*(s, a) are expressed as the following formula:

Q ( s, a ) = r ( s, a ) +   psa ,sQ ( s, a )  s S (14)  * Q * ( s , a ). Q ( s, a ) = r ( s, a ) +   psa ,s min a  sS

The computational process of the predictive energy management strategy is implemented in Matlab using the Markov decision process (MDP) toolbox introduced in [12]. The decaying factor η is correlated with the time step k and taken as 1 / k + 2 , the discount factor γ is taken as 0.95, the number of iteration N is 10000, and the sample time is 1 second. 4. RESULTS AND DISCUSSION 4.1 Velocity Prediction for Different Times The NNP is utilized to predict vehicle velocity at different prediction times. Fig. 2 [5] illustrates the realistic driving cycle for simulation in this paper. The first-time one-step ahead velocity prediction for them is depicted in Fig. 3. It is apparent that the accuracy at somewhere is not very high. Fig. 4 shows the secondtime one-step ahead velocity prediction. Comparing the Fig. 3 with Fig. 4, it can be perceived that the second-time prediction can achieve excellent accuracy, which is quantified by the mean square error (MSE). The MSE in Fig. 4 is 1.0481, which is lower than that in Fig. 3 (MSE =1.9459). This can be explained by the real-time transition probability shown in Fig. 5. In the first-time prediction, the transition events at v>25 km/h is not completed. However, these transition events have been experienced in the second-time prediction, which results in the higher accuracy.

The variable V*(s) is the value of s, assuming that an optimal action is taken initially; therefore, V*(s) = Q*(s, a) and π*(s)=arg mina Q*(s, a). The updated rule of actionvalue function in Q-learning algorithm is expressed as [11] Q ( s, a )  Q ( s, a ) +  ( r ( s, a ) +  min Q( s, a ) − Q( s, a )) (15) a

where η∈[0, 1] is a decaying factor of the Q-learning algorithm. As the vehicle velocity is predicted using NNP, (15) is used to acquire the RL-based predictive energy

Fig. 2 Realistic Driving Cycle for Simulation [5].

AVEC’18

First-time prediction, MSE=1.9459

Velocity (km/h)

Real cycle Prediction cycle

Time (s)

Fig. 3 The First-Time Velocity Prediction.

Velocity (km/h)

Second-time prediction, MSE=1.0481

calculation speed. These results generate that the proposed energy management strategy adapts to the realtime driving conditions more suitably than the conventional RL control, which demonstrates its adaptability. The working points of the engine with these two control strategies are shown in Fig. 8. The engine working points under the predictive control strategy locate in the lower fuel-consumption region more frequently, compared to the conventional control. Table 2 illustrates the fuel consumption after SOCcorrection for these two control strategies. Obviously, the fuel consumption under the NNP-based predictive control strategy is 5.6 % lower than that of the conventional control. Consequently, it can be concluded that the proposed energy management strategy is superior to the conventional RL-based control strategy in fuel economy.

Time (s)

Fig. 4 The Second-Time Velocity Prediction. Transition probability matrix

Current velocity

Next velocity

Transition probability at v>25 km/h has all been experienced Transition probability

Transition probability

Transition probability at v>25 km/h is not completed

Current velocity

Next velocity

Fig.5 Transition probability for Two Predictions. 4.2 Comparison of Different RL Controls The NNP based RL-enabled energy management strategy is further compared with the conventional RLbased control strategy in this section. Taking the driving cycle in Fig. 2 as an example, the state variable is SOC in battery and control variable is engine torque. The initial value of SOC is 0.70. Fig. 6 illustrates the SOC evolution and power split for the simulation cycle. It can be discerned that the SOC trajectory in the NNP based predictive control strategy is different from that of conventional RL-based control strategy. We can observe an analogous result in the power split trajectory. Also, the convergence processes of the Q values in proposed predictive and conventional RL-based controls are illustrated in Fig. 7. The mean discrepancy of Q values in proposed control is always lower than that in conventional RL control. It means that the proposed control is superior to the conventional RL control in

Fig. 6 SOC Trajectories and Power Split.

Fig. 7 The convergence processes of Q values.

AVEC’18

Lower fuel area

Fig. 8 Engine working points for two strategies. Table 2 Fuel Consumption for Two Control Strategies Control strategies

Fuel consumption (g)

Relative increase (%)

NNP-RL

2713.5

-

RL

2865.9

5.62

5. CONCLUSION This paper develops a reinforcement learning (RL) enabled predictive control strategy for a series HTV. First, the powertrain of HTV is introduced. Then, the novel velocity predictor is presented to predict the future velocity profile in the RL control framework. The predictive control scheme is compared with the conventional RL strategy, to demonstrate its optimality in fuel economy. The simulation results show the influences of the predicted times on the prediction accuracy and illustrates the proposed energy management strategy is superior to the conventional RLbased control strategy in fuel economy. The future research direction involves searching novel prediction approach to achieve high accuracy in the first predicted time and applying the prediction approach and RL method in in many other fields requiring flexible model-free control. ACKNOWLEDGEMENT This work is supported by the Fundamental Research Funds for the Central Universities (106112017CDJQJ338811). REFERENCES [1] Martinez C, Hu XS, and Cao DP. “Energy Management in Plug-in Hy-brid Electric Vehicles: Recent Progress and a Connected Vehicles Perspective”, IEEE Trans. Veh. Technol., Vol. 66, No. 6, 2017, pp. 4534–4549. [2] Trovao JPF, Santos VDN, Henggeler Antunes C, Pereirinha PG, and Jorge HM. “A real-time energy management architecture for multisource electric vehicles”, IEEE Trans. Ind. Electron., Vol. 62, No. 5, 2017, pp. 3223–3233.

[3] Sciarretta A and Guzzella L. “Control of hybrid electric vehicles”, IEEE Contr. Syst. Mag., Vol. 27, No. 2, 2007, pp. 60–67. [4] Serrao L, Onori S, and Rizzoni G. “A comparative analysis of energy management strategies for hybrid electric vehicles”, J. Dyn. Sys. Meas. Control., Vol. 133, No. 3, 2011, pp. 1–9. [5] Liu T, Zou Y, Liu DX, and Sun FC. “Reinforcement learning of adaptive energy management with transition probability for a hybrid electric tracked vehicle”, IEEE Trans. Ind. Electron., Vol. 62, No. 12, 2015, pp. 7837–7846. [6] E. R. Stephens, D. B. Smith and A. Mahanti, “Game theoretic model predictive control for distributed energy demand-side management”, IEEE T SMART GRID, Vol. 6, No. 3, 2015, pp.1394–1402. [7] Y. Zou, T. Liu, D. X. Liu, and F. C. Sun, “Reinforcement learning-based real-time energy management for a hybrid tracked vehicle”, Appl. Energy, Vol. 171, 2016, pp. 372-382. [8] T. Liu, Y. Zou, D. Liu and F. C. Sun, “Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle”, Energies, Vol. 8, No. 7, 2015, pp.7243–7260. [9] X. Hu, S. Moura, N. Murgovski, B. Egardt, and D. Cao, “Integrated optimization of battery sizing, charging, and power management in plug-in hybrid electric vehicles”, IEEE Trans. Control Syst. Technol., Vol. 24, No. 3, 2015, pp. 1036-1043. [10] D. P. Filev and I. Kolmanovsky, “Generalized markov models for re-al-time modeling of continuous systems”, IEEE Trans. Fuzzy. Syst., Vol.22, 2014, pp.983-998. [11] L. Kaelbing, M. Littman, and A. Moore, “Reinforcement learning: A Survey”, Journal of Artificial Intelligence Research, Vol. 4, 1996, pp. 237-285. [12] Chadés I, Cros M, Garcia F. “Markov decision process (MDP) toolbox v2. 0 for matlab”, INRA Toulouse, INRA,France, http://www.inra.fr/ internet/ Departements /MIA/T/MDPtoolbox, 2005.