Evolutionary Algorithms and Reinforcement Learning in ... - IEEE Xplore

2013 International Conference on Process Control (PC) June 18–21, 2013, Štrbské Pleso, Slovakia

Evolutionary Algorithms and Reinforcement Learning in Experiments with Slot Cars Dan Martinec and Marek Bundzel Department of Control Engineering Faculty of Electrical Engineering Czech Technical University in Prague 16627 Prague, Czech Republic Email: [email protected], [email protected] have showed yet another way to go. Results of this research can further be extended, so the car could iteratively learn the optimal set of commands to race through the track

Abstract—Some control systems are difficult or impossible to be tuned by other means than automatically. We present here examples of optimization of the parameters of a PID controller regulating velocity of a slot car to the given set point using evolutionary optimization and reinforcement learning. These methods are implemented on the micro-controller of the slot car. Experimental results and comparison are provided.

I.

II.

The experimental slot car platform was introduced by [5] for experiments with the platoon of vehicles. It will be shortly introduced in here to keep the coherence of the paper.

I NTRODUCTION

The experiments presented here were performed on a slot car platform used at Czech Technical University for testing of various autonomous control algorithms and platooning of vehicles. The goal was to find optimal parameters of a PID velocity controller automatically and thus prove the ability of the used methods to solve optimization problems when implemented on the microcontroller of the slot car. Evolutionary optimization and reinforcement learning were used to minimize the actual velocity deviation from the given setpoint at any time. Brief surveys of those two types of algorithms are given in [1] and [2]. More thorough description of the reinforcement learning gave [3]. Depending on the controller and the velocity setpoint, the slot car usually slows down in turns due to the increased friction and then overshoots the given velocity setpoint after entering the straight section. Maintaining a constant speed is important for some applications, e.g. when minimizing the lap times it is important not to exceed the given maximal velocity to enter a turn. Centrifugal force ejects the slot car from the track otherwise.

The slot car platform consist of the slot car track and the slot car. The car shown in Fig.1 is equipped with various electronic devices enabling the car to drive itself, measure its own velocity and also other features, which were not utilized in these experiments.

Fig. 1. Slot car used in our experiments. Part of the PCB is revealed by a hole on the side but the velocity sensor is hidden under the bodywork. IR distance sensor mounted to the front is not used in the experiments.

Tuning the PID controller manually e.g. using a Ziegler– Nichols method is time consuming and fatiguing. The slot car records the velocity (and other) measurements on a SD card. This card must be removed and the data are read on a PC. The performance of the setting is analyzed, the controller is adjusted and the process is repeated until the desired setting is found. When the automatic optimization methods are employed time may not be saved but the process does not require operators supervision.

The electronics of the car is based on a populated circuit board (PCB) with Freescale microprocessor, a 64-pin 32-bit MCF51JM64 microcontroller from the ColdFire series. It runs at 3.3 V and 48 MHz with 64 KB Flash memory and 16 KB RAM. The DC motor is driven by a MC33931 H-bridge operating at 14 V and 8 kHz. The velocity is measured using an incremental sensor based on an IR reflectance sensor QRE1113 from Fairchild Semiconductor. The incremental sensor not only measures the velocity but also measures the total driven distance of the car supposing the wheels would not slip on the surface. The track is in a shape of oval with length of 4283 mm.

The task was inspired by Freescale Race Challenge 2012 competition described in [4] (unfortunately only in Czech), where the cars are supposed to race through 10 laps of track as fast as possible. The shape of the track is not known in advance. Therefore, the competitors have to either use a feedback control from the accelerometers or they can online identify the track and then use an optimal control approach. We

c 978-1-4799-0927-8/13/$31.00 2013 IEEE

S LOT C AR P LATFORM

The velocity controller is discretized proportional-integralderivative (PID), controlling the velocity by adjusting the

159

motor voltage in a simple control loop. The controller output u(tk ) at time tk (motor voltage) is calculated as follows: ∆t Td u(tk ) = u(tk−1 ) + Kp 1 + e(tk ) + Ti ∆t 2Td Td + −1 − e(tk−1 ) + e(tk−2 ) , ∆t ∆t

(1)

where Kp , Ti , Td are the coefficients to be found (from the PID controller standard form: overall gain, integral time and derivative time), ∆t is the sampling time (0.001s in our case) and e is the error calculated as the difference between the process value and the setpoint. The controller output was limited by conditions to stay within the motor voltage allowed range. III.

E VOLUTIONARY O PTIMIZATION

The evolutionary optimization method used in the experiments was a genetic algorithm implemented in C language into a microprocessor of the slot car. The implementation had to take into account the computational and memory resources of the microcontroller. Every individual was comprised of three unsigned short variables encoding the PID parameters. Before being tested the individual was decoded by a mapping function transforming the unsigned shorts to floating point numbers from the user pre–defined range. The individual had certain life time when the car was running on the track, while being evaluated according to Z Tev |vref (t) − v(t)|dt, (2) J(n) = 0

where vref (t) and v(t) are the reference velocity and the actual velocity of the car, respectively. Variable Tev is time period of the PID controller evaluation, which was set to Tev = 2 s, i.e. J(n) is computed as an integration of the velocity error during the time period. The car was then brought to stop so that evolutionary (dis-)advantage could not be inherited indirectly. The lower the performance index the higher was the fitness of the individual. The initial population of N individuals was created randomly so that the decoded values were from the user defined range and certain reserve was left for the parameters to evolve outside the initial range. Every individual was evaluated as described above and the stop condition was tested (number of generations and target fitness). If the stop condition was not met then a candidate generation of N − 1 individuals was created. Tournament selection was used to select the parent of individuals also crossover and mutation were applied. Crossover was implemented so that three random numbers from the ranges given by the parents parameters were generated. This way, the offspring was from somewhere ”in between” the parents. Mutation was applied with pre–defined probability by adding white noise of a given amplitude to the individual. The mutation amplitude gradually decreased throughout the evolution. This was to support exploration of the solution space in the start of the evolution and to enhance exploitation of the best solution found towards the end of the evolution. The individuals of the candidate generation were evaluated and new generation was created by adding the strongest individual of the current generation to the candidate generation (elitism).

160

IV.

CARLA A LGORITHM AND I MPLEMENTATION

The acronym CARLA stands for Continuous Action Reinforcement Learning Automata, which is a learning algorithm introduced by [6]. It operates on a random environment. The action taken on this environment is based on probability distributed function. Successful actions are rewarded in the learning process and the probability of future selection is increased via a Gaussian neighborhood function. The CARLA is shortly introduced in here for convenience of the reader. The first step is an initialization of the probability density function to a uniform distribution. The following steps are repeated, until a stopping condition: 1) 2) 3) 4) 5)

Select and execute an action based on the density function Evaluate performance of the action Evaluate reward of the action Update the probability density function Return to step 1).

An action x(n) (in this case one of a PID constants) at iteration n is selected as a random variable based on the nonuniform distributed function f (x, n) as Z x(n) f (x, n)dx = z(n), (3) xmin

where z(n) is found on a uniform distribution z(n) ∼ U [0, 1].

The performance was evaluated according to Eq.2. Lower J(n) means better performance. Hence, we are aiming for the lowest possible performance index. The reward β(n) of the action is equal to Jmed − J(n) , β(n) = max 0, Jmax − Jmed

(4)

where Jmed and Jmax are the median and maximum values computed from the performance history, respectively. It is not possible to keep the whole performance history in the memory of the processor. It is neither recommended, so the algorithm can more flexibly react on the environment changes. It was experimentally found that 8 samples is large enough history window in these experiments. The probability density function f (x, n) is updated as follows f (x, n + 1) = α [f (x, n) + β(n)H(x, r)] ,

(5)

where α is normalization parameter to satisfy the condition Z xmax f (x, n + 1)dx = 1 (6) xmin

and H(x, r) is a symmetric Gaussian function centered on r = x(n) (x − r)2 1 , (7) H(x, r) = √ exp − 2σ 2 σ 2π

where σ is standard deviation and is a tuning parameter in the algorithm, which was set to σ = 0.046. Variables xmin = 0 and xmax = 10 stand for the minimal and maximal value of the PID constant, respectively.

The algorithm, which was just described iteratively learns only one parameter x. However, it can be generalized to simultaneously learn multiple parameters at once. Instead of one automata, there would simultaneously run several CARLA algorithms (automata). Each of them would have unique density function, but they might or might not have different evaluation of performance and reward. The trick of the method lies in the interaction of automata through the environment.

SP 1.8m/s 2.0m/s 2.2m/s 2.3m/s 2.4m/s

SP 1.8m/s 2.0m/s 2.2m/s 2.3m/s 2.4m/s

The CARLA was also implemented in C language into a microprocessor of the slot car to iteratively learn three constants of a PID controller, kp , Ti and Td . The PID parameters were encoded in the same manner as for the evolutionary algorithm. Furthermore, for each the constant there existed one probability density function, which had to be sampled into the microprocessor’s memory. It turned out, that 150 samples for each density function is enough. Performance and reward evaluations were identical for all the three constants. Resulting action was a combination of a three PID constants. V.

TABLE II.

E XPERIMENTS

TABLE I.

Td 0.0903 0.2381 0.1893 0.2487 0.1951 0.2084 0.1765 0.0001 0.0001 0.0001 0.0001

FitW 8840 5052 5690 4916 4059 5066 3526 3491 4032 3404 3645

0.05 Kp probability density function [−]

Ti 1.0580 0.7995 0.8940 0.8450 0.8804 0.8696 0.8300 0.8368 0.8368 0.8368 0.8369

M EASURED AVERAGE VELOCITIES OVER 10 LAPS WITH THE G ENETIC ALGORITHM

Process of learning the parameters is depicted in Figs.2,3,4. They show how are the probability density functions was updated during the iteration process. Some minor off-peaks occurred during iteration learning process, one of them is shown at iteration step 10 in Fig.3, are caused by the mechanical imperfections of the slot car. It also demonstrates, that the CARLA is capable to react on a changes of the environment, where it operates.

The slot car top speed was limited in the experiments to 2.4m/s to avoid the slot car being ejected from the track. Every individual was evaluated for 2 seconds and then the slot car was brought to stop before testing the next individual. The training setpoint was either 1.8m/s or 2.2m/s and the solutions found belonged to two families with either Ti or Td converging to zero. Table I provides an example of the progress and convergence of the GA optimization. FitW and FitAve stand for the fitness of the winning individual and average fitness of the individuals in given generation. In the experiments, the size of population was 10 individuals, mutation probability was 0.2 and the magnitude of mutation was gradually decreased. The time needed to evaluate one generation and to create a new one was approximately 45 seconds, thus the solution was found in less than 8 minutes. Kp 1.8738 3.4709 2.8393 3.4555 3.3681 3.4502 3.3353 3.3112 3.3112 3.3112 3.2887

= 1.8m/s K = 2.90, Ti = 0.0, Td = 3.59 Meas. ave 1.7m/s 1.89m/s 2.06m/s 2.16m/s = 2.2m/s K = 2.57, Ti = 0.0, Td = 2.82 Meas. ave 1.61m/s 1.86m/s 2.04m/s 2.13m/s 2.185

2 seconds to evaluate performance of selected PID parameters, than stop to evaluate the run and change the parameters. Each iteration took approximately 5 seconds. Therefore, the solution was found in two and half minutes.

A. Experiments with Genetic Algorithm

Gen init 1 2 3 4 5 6 7 8 9 10

Solutions found by SP K = 3.29, Ti = 0.84, Td = 0.0 Meas. ave. 1.83m/s 2.01m/s 2.17m/s 2.27m/s 2.30m/s Solutions found by SP K = 3.11, Ti = 1.10, Td = 0.00 Meas. ave. 1.83m/s 2.02m/s 2.2m/s 2.27m/s 2.34m/s

FitAve 28319 16313 10972 7023 7262 7979 5694 5451 4930 4703 4592

0.04

0.03

0.02

0.01

0

P ROGRESS OF THE EVOLUTION , EXAMPLE

Iteration 1 − start Iteration 10 Iteration 20 Iteration 30 − stop

1

2

3

4 5 6 7 Kp coefficient value [−]

8

9

10

Fig. 2. Comparison of probability density function for parameter kp after various iteration steps.

The solutions that were found were evaluated so that the time the slot car needed to make 10 laps on the track was measured and among others the average velocity was calculated. Table II summarizes the measured values. The average velocity of the car using solutions with Ti = 0 was always slower than the required setpoint. When the slot car could not stay on track for 10 laps the average velocity was not calculated. However, average velocity does not reflect well the ability of the controller to maintain the required setpoint. B. Experiments with CARLA

After the proper PID parameters were found, they were evaluated. The evaluation process was the same as for the Genetic algorithm solution, i.e. the car was let to drive 10 laps at a given reference velocity. Obtained results are summarized in Table.III. Though their average performance is similar, it will be concluded in the final section that there are differences between the solutions. C. Comparison The ability of the controller to maintain the required setpoint is reflected by measures considering deviations of the actual velocity from the required setpoint at any time.

The CARLA was trained in 30 iterations in a similar way as the Genetic algorithm. In the each iteration, the car drove for

161

Ti probability density function [−]

superior. Table IV summarizes the results. The quality of the solutions found by both methods was very similar.


0.05 0.04

SP = 1.8 SP = 2.0 MSE GA at 1.8 0.025 0.059 GA at 2.2 0.019 0.062 CARLA at 1.8 0.026 0.055 CARLA at 2.2 0.029 0.066 MAE GA at 1.8 -0.011 0.190 GA at 2.2 -0.007 0.192 CARLA at 1.8 -0.017 0.184 CARLA at 2.2 -0.019 0.178 MSE GA / MSE CARLA * 100% at 1.8 100.22% 107.76% at 2.2 64.82% 94.48% MAE GA / MAE CARLA * 100% at 1.8 63.25% 103.28% at 2.2 38.47% 108.36%

0.03 0.02 0.01 0

1

2

3 4 5 6 7 Ti coefficient value [−]

8

9

10

TABLE IV.

Fig. 3. Comparison of probability density function for parameter Ti after various iteration steps.


Td probability density function [−]

0.08 0.07

SP = 2.2

SP = 2.3

0.187 0.182 0.181 0.173

0.270 0.273 0.265 0.282

0.387 0.387 0.381 0.391

0.490 0.487 0.480 0.474

103.79% 105.47%

101.73% 96.93%

101.67% 99.21%

102.09% 102.68%

C OMPARISON GA VS . CARLA

VI.

C ONCLUSION

Two methods for automatic optimization of the parameters of a PID controller for velocity control of a slot car were used: CARLA and GA. The experimental results indicate that both methods are able to deliver viable solutions while implemented on the slot car’s microcontroller. The solutions found were not identical but nearly identical in performance.

0.06

ACKNOWLEDGMENT

0.05 0.04 0.03 0.02 0.01 0

1

2

3

4 5 6 7 Td coefficient value [−]

8

9

10

Fig. 4. Comparison of probability density function for parameter Td after various iteration steps.

We have used mean square error (MSE) and mean absolute error (MSE) calculated from 3000 samples measured every 10ms (unit = m/s) while the slot car was driving on the oval track. Two solutions considered the best for each method were selected for comparison: CARLA trained at SP 1.8 and 2.2m/s respectively and GA trained at SP 1.8 and 2.2m/s respectively. For comparison, the ratio of the particular calculated errors for each method were calculated and expressed as percentage. The percentages above 100% indicate the CARLA method was superior and the percentages below 100% indicate the GA was

This work was supported by the Ministry of Education of the Czech Republic under the Centralized Project for University Development CSM 100 TALENT and also by Grant Agency of the Czech Republic within the project GACR P10312-1794 and partially the result of the project implementation: Development of the Center of Information and Communication Technologies for Knowledge Systems (ITMS project code: 26220120030) supported by the Research & Development Operational Program funded by the ERDF (the second affiliation of Marek Bundzel is Technical University Koˇsice, Dept. of Cybernetics and Artificial Intelligence, Koˇsice, Letn´a 9, 04001, Slovakia). R EFERENCES [1] [2] [3]

[4] SP 1.8m/s 2.0m/s 2.2m/s 2.3m/s 2.35m/s

TABLE III.

Solution found by SP = 1.8m/s K = 7.60, Ti = 1.07, Td = 0 Meas. ave. 1.82m/s 2.00m/s 2.18m/s 2.26m/s 2.30m/s

Solution found by SP = 2.2m/s K = 5.55, Ti = 1.20, Td = 0 Meas. ave 1.83m/s 2.01m/s 2.20m/s 2.25m/s 2.29m/s

[5] [6]

M EASURED AVERAGE VELOCITIES OVER 10 LAPS WITH THE CARLA

162

Y. Jin and J. Branke, “Evolutionary optimization in uncertain environments-a survey,” IEEE Transactions on Evolutionary Computation, vol. 9, no. 3, pp. 303–317, 2005. L. Kaelbling, M. Littman, and A. Moore, “Reinforcement learning: A survey,” arXiv preprint cs/9605103, vol. 4, pp. 237–285, 1996. [Online]. Available: http://arxiv.org/abs/cs/9605103 F. Lewis, D. Vrabie, and K. Vamvoudakis, “Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers,” IEEE Control Systems, vol. 32, no. 6, pp. 76–105, Dec. 2012. HW.cz. (2012) Freescale race challenge 2012. [Online]. Available: http://www.hw.cz/teorie-a-praxe/mimochodem/ freescale-race-challenge-2012-soutez-samoridicich-auticek-na-autodrahu D. Martinec, M. Sebek, and Z. Hurak, “Vehicular platooning experiments with racing slot cars,” 2012 IEEE International Conference on Control Applications, pp. 166–171, Oct. 2012. M. Howell, G. Frost, T. Gordon, and Q. Wu, “Continuous action reinforcement learning applied to vehicle suspension control,” Mechatronics, vol. 7, no. 3, pp. 263–276, Apr. 1997.

Evolutionary Algorithms and Reinforcement Learning in ... - IEEE Xplore

Evolutionary Algorithms and Reinforcement Learning in ... - IEEE Xplore

Suggest Documents

analyzing Reinforcement Learning algorithms using Evolutionary ...

Ensemble Algorithms in Reinforcement Learning

Distributed Reinforcement Learning Frameworks for ... - IEEE Xplore

Reinforcement Learning based Secondary User ... - IEEE Xplore

Reinforcement Learning Algorithms in Humanoid

Reinforcement Learning for Online Control of Evolutionary Algorithms

Reinforcement Learning for Online Control of Evolutionary Algorithms

Reinforcement Learning in Evolutionary Games - Semantic Scholar

Reinforcement Learning Algorithms in Markov Decision ... - Sztaki

Reinforcement Learning algorithms for regret minimization in ...

Reinforcement Learning Algorithms in Markov Decision ... - Sztaki

Universal Reinforcement Learning Algorithms: Survey and Experiments

Genetic Algorithms - IEEE Xplore

Algorithms - IEEE Xplore

Efficient Multi-objective Evolutionary Algorithms for ... - IEEE Xplore

A Runtime Analysis of Evolutionary Algorithms for ... - IEEE Xplore

A Runtime Analysis of Evolutionary Algorithms for ... - IEEE Xplore

Intelligent Evolutionary Algorithms for Large Parameter ... - IEEE Xplore

Machine Learning Algorithms in Bipedal Robot Control - IEEE Xplore

Deep Reinforcement Learning Based Dynamic Channel ... - IEEE Xplore

Reinforcement Learning for Licensed-Assisted Access ... - IEEE Xplore

Reinforcement Learning for Reactive Power Control - IEEE Xplore

Evolutionary Learning, Reinforcement Learning, and Fuzzy Rules for ...

Evolutionary Computation for Reinforcement Learning - Department of ...