Memory-based reinforcement learning algorithm for ... - SAGE Journals

3 downloads 0 Views 787KB Size Report
of MMQN reinforcement learning module where they make a decision with attention to this state and its policy to find the best action. After taking an action and ...
Research Article

Memory-based reinforcement learning algorithm for autonomous exploration in unknown environment

International Journal of Advanced Robotic Systems May-June 2018: 1–10 ª The Author(s) 2018 DOI: 10.1177/1729881418775849 journals.sagepub.com/home/arx

Amir Ramezani Dooraki and Deok Jin Lee

Abstract In the near future, robots would be seen in almost every area of our life, in different shapes and with different objectives such as entertainment, surveillance, rescue, and navigation. In any shape and with any objective, it is necessary for them to be capable of successful exploration. They should be able to explore efficiently and be able to adapt themselves with changes in their environment. For successful navigation, it is necessary to recognize the difference between similar places of an environment. In order to achieve this goal without increasing the capability of sensors, having a memory is crucial. In this article, an algorithm for autonomous exploration and obstacle avoidance in an unknown environment is proposed. In order to make our self-learner algorithm, a memory-based reinforcement learning method using multilayer neural network is used with the aim of creating an agent having an efficient exploration and obstacle avoidance policy. Furthermore, this agent can automatically adapt itself to the changes of its environment. Finally, in order to test the capability of our algorithm, we have implemented it in a robot similar to a real model, simulated in the robust physics engine simulator of Gazebo. Keywords Reinforcement learning, autonomous exploration, adaptive agent, depth map, artificial neural network, sensor fusion, memory-based, obstacle avoidance Date received: 21 December 2017; accepted: 8 April 2018 Topic: Mobile Robots and Multi-Robot Systems Topic Editor: Mohsen Shahinpoor Associate Editor: Genci Capi

Introduction Learning is crucial for intelligence, and a robot that can learn by itself can adapt to its environment and its changes. Similar to the natural intelligent creatures, this kind of robots can interact with their environment in an autonomous way and increase their intelligence using their own experiences considering their capabilities. With the goal of achieving such ability, reproduction of the computational algorithms of the complex process happening inside the brain of intelligent animals has already started. One example of such reproduction is reinforcement learning algorithms 1 where they can be considered as a cognitive structure that can be substrate for intelligent behaviors such

as exploration, obstacle avoidance, and navigation. As another example, artificial neural network (ANN) can be named where it is the very basic block of learning in intelligent creatures that can shape the infrastructure of a highly efficient and expandable learner. ANNs are developed for years and recently had great advances in terms of their Smart Autonomous System Lab, Kunsan National University, Gunsan, Republic of Korea Corresponding author: Lee Deok Jin, Smart Autonomous System Lab, Kunsan National University, Gunsan 54150, Republic of Korea. Email: [email protected]

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage).

2 training capability and feasibility, especially by emerging deep learning technology or deep neural networks. One the other hand, computing power of robots and machines has increased largely recently specially due to progress of computer vision technology that is contributed to by graphics processing unit computing technology. As a result, the development of sophisticated intelligent algorithms capable of online learning is possible. For example, Zhang et al.2 used an online reinforcement learning method to plan the optimal traffic policy; Subbarao et al.3 tried to use reinforcement learning for adaptive control and system identification; Reidmiller et al.4 used reinforcement learning to train a soccer robot; and finally in the study by Mnih et al.,5 which is very famous in the area of reinforcement learning and deep learning, a deep Q-network (DQN) agent learns to play Atari 2600 games and score at the same level of a human player in most of the games. With attention to these recent advances and progresses in the area of artificial intelligence, people are now able to believe the possibility of having robots not only in industrial lines such as a factory but also as the entities in their everyday social life. For robots that interact with human in their everyday life, it is necessary to be able to adapt themselves with the environments autonomously and learn from their own experience in that environment in contrast with the industrial robots that their environment is not changing and the robots do exactly the same procedure for years. Thus, it is important to consider two crucial factors. First, with attention to the wide range of tasks considered for social robots to be able to perform, it is not easy to use an old-fashioned machine learning methods such as supervised learning alone to train the robot because it is very hard if it is not impossible to prepare the training samples and their labeling. Second, a robot that is expected to interact in an everyday environment of human kind should be able to adapt itself with these ever-changing environments in an autonomous and continuous way. Considering the mentioned factors, in this article, we attempt to define and implement an algorithm capable of autonomous and adaptive exploration and obstacle avoidance in a social environment where these capabilities are crucial for every kind of a social movable robot. Looking at the literatures, there are different methods that have been used in order to achieve these goals such as ant colony optimization,6 genetic algorithm,7 particle swarm optimization,8 fuzzy neural network,9 learning from demonstration,10 and reinforcement learning.11 For example, Lee et al.12 have used a trained network for their reinforcement learning–based algorithm in order to make a quadruped robot able to avoid obstacles. Kominami et al.13 used a reinforcement learning–based technique in combination with virtual repulsive method to tackle the obstacles by a multilegged robot using tactile sensors. Zhang et al.10 have used learning from demonstration method using Gaussian mixture model and Gaussian mixture regression in order to let the robot to learn how to avoid obstacles. Tai et al.11 and Lei and Ming14 have used a method similar to DQN to train

International Journal of Advanced Robotic Systems

Figure 1. Memory-based multilayer Q-network algorithm work flow diagram.

their robot to explore and avoid obstacles; however, it is different from DQN where they have initialized their network weights using a previously trained network using supervised learning. In this article, reinforcement learning and multilayer neural network are used in order to make an algorithm capable of learning by itself from its own experiences to explore and avoid obstacle where in this text we call it memory-based multilayer Q-network (MMQN) where it is different from the mentioned reinforcement learning– based works in the aspects that, first, it learns from the scratch in contrast with Tai et al.11 and Lei and Ming14 that they initialized their network weights using a previously trained network; second, it is able to adapt itself with an unknown environment or change in its environment autonomously and continuously. Furthermore, we used a sensor fusion technique in order to increase the accuracy of our depth measurement: depth sensor and infrared sensor. Finally, our experiments are performed using a simulated robot in a real physics engine simulator using a robot operating system (ROS) where the robot is simulated based on a real robot. In the following text, we first explain our main algorithm and the different aspects of its implementation. After that we explain about our testing environment, testing scenarios, and results. Finally, we discuss about our results and conclude our work.

Memory-based multilayer Q-network The work flow of our algorithm is shown in Figure 1 and pseudocode in algorithm 1. The input to the algorithm is the preprocessed sensory inputs, which contains depth sensor and range sensors signals. These input data shape the state of MMQN reinforcement learning module where they make a decision with attention to this state and its policy to find the best action. After taking an action and running it, the state of the algorithm will be changed to a new state. In MMQN, we are using an epsilon greedy policy; as a result,

Dooraki and Lee

3

Algorithm 1. Memory-based multilayer Q-network algorithm.

the decision of which action is the best is according to the value of each action in a particular state where it is calculated by MMQN ANN module. The MMQN ANN module acts as a function approximator for MMQN reinforcement learning module. Thus, the state of MMQN reinforcement learning module is the input for MMQN neural network module, and the output is the value of each action in that particular state. After selecting an action and moving to a new state, a reward will be calculated based on the sensor inputs, which can be interpreted as the response of

environment to the action selected by the MMQN algorithm. Further, this reward will be used to update the value of the action just selected in our MMQN ANN module. In the following sections, we explain each process in detail.

Reinforcement learning For a robot, in order to be able to adapt itself to an unknown environment, it needs to be able to learn autonomously. The framework of reinforcement learning is able to achieve the

4

International Journal of Advanced Robotic Systems The optimal action value function is based on Bellman equation     Q ðSt ; at Þ ¼ QðSt ; at Þ þ a: Rðt þ 1Þ þ g: maxa Q Sðtþ1Þ ; a  QðSt ; at Þ

ð3Þ

Figure 2. Reinforcement learning framework.

idea of autonomous learning by taking actions autonomously and measuring the value of each action based on a predefined reward function. More actions the agent takes, it gets more information about what is the best action in each state of the environment. For the implementation of our reinforcement learning agent, we have used the ideas mentioned by Sutton and Barto.1 We considered the problem of autonomous learning for our algorithm as a Markov decision process case, where our agent interacts with its environment at discrete time steps and at each time step agent is located in a unique state according to its sensory input st 2 S. In each st, the agent can take an action, at 2 A, based on its current policy pðst Þ. In the next time step, the state of the agent changes to sðtþ1Þ 2 S based on its previous action, and as a result of this action, it receives a reward R (Figure 2). Q-learning. In MMQN algorithm, we have used a model-free off-policy method, Q-learning,15 mainly because we do not have a model of the environment and as we explain later we want to use experience replay method16 for the training of our multilayer neural network (Q-network) in which we need a model-free method. Q-learning is a temporal difference reinforcement learning method that works based on an epsilon greedy policy for action selection. The goal of this kind of policies is to achieve the maximum discounted reward over time, and it does it by an iterative method of selecting the action with maximum value in each state and updating this value using itself, the maximum action value in the next state, and the reward it receives from the environment. We use the standard assumption that future rewards are discounted by a factor gamma per time step; thus, the future discounted reward at time t is Rt ¼

T X 0 g t t rt0

ð1Þ

t0 ¼t

where T is the final time step or infinity. We also define the optimal action value function as the maximum expected return achievable by following any strategy after taking enough actions a in state s as Q ðs; aÞ ¼ maxp E½Rt jst ¼ s; at ¼ a; p

ð2Þ

where policy is an epsilon greedy policy mapping states to actions.

Many reinforcement learning algorithms use the mentioned formula in order to find the optimal action value. However, there are two main problems: the lack of generalization and the problem of number of the states. In order to tackle these problems, as we explain later in section “Multilayer neural network” , we use a function approximator to find the optimal Q-values of our action in each state. We name this Q-value function approximator Q ðst ; at ; qi Þ. Further, we use this function approximator for optimizing itself using the online rewards we obtain. We define a loss function L in the following way as well to optimize our function approximator  2  Li ðqi Þ ¼ Eðs; aÞ yi  Qðst ; at ; qi Þ ð4Þ where yi is     R þ g: maxa Qðstþ1 ; a; qi1 Þ  Estþ1 Qðst ; at ; qi1 Þ þ a: ; a s t t  Qðst ; at ; qi1 Þ 

ð5Þ Differentiating the loss function with respect to the weights we reach to the following gradient, and we use gradient descent in order to reduce the loss rqi Li ðqi Þ ¼   Eðs;aÞ Qðst ; at ; qi1 Þ þ a: R þ g:maxa Qðstþ1 ; a; qi1 Þ    Qðst ; at ; qi1 Þ  Qðst ; at ; qi Þ rqi Qðst ; at ; qi Þ ð6Þ As it can be seen, taking actions is directly related to state perception. Thus, it is crucial to provide the optimal state perception for our agent so it can do optimal obstacle avoidance. In order to achieve this optimal state perception, we use more than one type of sensor in this article. In the following section, we first explain about or sensor fusion method, memory-based technique, and finally the perception of state and reward calculation. Sensor fusion. A typical depth sensor can measure a depth of 50 cm to 5 m; an extra type of sensor is necessary to be used to cover its blind spot and measure the depth of the robot correctly. In order to solve this issue, we obtain the depth sensor input first and then try to correct it with the range sensors values. Our design architecture contains three range sensors for measuring the distance of the robot with a front object in three different angles (directions): 30 , 0 , and þ30 . If each one of our range sensors

Dooraki and Lee

Figure 3. Left scenario and right scenarios are different, however, only if the robot considers states 1 and 2 in each scenario. Otherwise, if the robot only considers state 2 of left and right scenarios, then both cases looks like a similar state.

Figure 4. In order to enhance our state, we mix several states and make a super state which actually is a short-term memory containing the recent n states that the robot has been in them.

detects an object in a distance with less than 50 cm, we change the values in the depth image with the value of the range sensor considering the region of the depth sensor that the particular range sensor covers. This mechanism is used in order to change the unreliable values of depth sensor in a distance less than 50 cm with reliable values of range sensor (Figure 3). Memory-based method. Having memory is of the foremost issues in an intelligent agent or animal with the ability of learning. One of the main reasons for having a memory is to remember what is happened in the past in order to choose what is the best action in the future. For our robot, we also need to use a memory mechanism in order to be able to distinguish between the similar states. In this work, we used a linear constant short-term memory of size 3, 5, 8, and 16, of which the memory of size 3 is shown in Figure 4. Further, we used a long-term memory of size 10,000 based on the work by Lin.16 Figure 3 shows our robot in two different scenarios. In both cases, the robot chooses to turn left which ends up putting the robot in a similar situation

5

Figure 5. Reward calculation using range sensors.

considering only state 2. If the robot does not have any short-term memory, so it will only consider the current state for making its decision regarding the next actions which are the same in the left and right cases. However, if we consider the robot uses a memory of size 1, then it will be able to remember its previous state, and as a result, it will be able to recognize the difference between the two scenarios where in the left case robot has a wall in its right hand side, but in the right scenarios, it does not have a wall in its right hand side so it can turn right. In this article, we use this idea to enhance our simple state to a super state containing several simpler steps (Figure 5). The usage of the short-term memory shows its effect in Figure 6 where a robot with short-term memory can increase its average reward over time, and a robot with no short-term memory cannot increase its average reward over time. We discussed regarding the reason in detail in the “Results” section. State of agent. With attention to what is mentioned already regarding the sensor fusion and memory-based method, our algorithm perceives the state of the robot in a robust way where it can measure the depth of 0.0 cm to 5 m and distinguishes between the similar states based on its previous states. Besides that, with the aim of reducing our processing time, we have resized our depth image to 80  60 pixels and transformed it to a vector of 80 dimensions before implementing our sensor fusion technique and using it in a memory-based fashion to generate our state. For translation from 80  60 pixels to a vector of 80 dimensions, we have measured the values of each column between rows 40–60 of the resized depth image and selected the minimum value where it represents the distance of the nearest object on that point. Furthermore, from the left to right, we have divided our vector of dimension 80 into three parts. From 0 to 20, it is section 1, from 20 to 60 is section 2, and the rest is section 3 where each section is considered to be covered with one range sensor (Figure 7). If the output measurement of each range sensor is less than

6

International Journal of Advanced Robotic Systems

Figure 6. In the left, all the samples regarding 0 (no short-term memory), 3, 5, 8, and 16 states short-term memory algorithm are shown. In the right side, the result of testing with no memory is compared with the result of testing with several samples of five states of short-term memory.

Reward calculation. A reinforcement learning agent is able to estimate how well or bad its action is only based on the reward it receives. Thus, it is crucial to define the correct and only necessary rewards. As it can be seen in Figure 5, we use our range sensors to measure the distance with the near obstacles and generate the proper reward. When the agent is very near to an obstacle, at the distance of less than 50 cm, it receives a negative reward. If the agent rotates left or right, it receives a small negative reward, and if it moves forward when there is no obstacle, it receives a small positive reward where it can be considered as an intrinsic reward that gives the agent enough motivation to move forward when it is possible. The reward calculation in detail is shown in algorithm 1. Further, the generation of average reward with different short-term memory sizes is discussed in the “Results” section. Figure 7. Sensor fusion and state generating.

50 cm, we conclude that our depth sensor cannot measure that part of the depth image correctly so we change the values of that section with values of our range sensor. As a result, our final state is defined according to a variable n which denotes the size (frames) of our shortterm memory, which is illustrated in Figure 4 (in that figure n is defined to be 3) and the size of each perception. Each perception of the robot is the output of our “Sensor fusion” section where as it is explained in this section is a vector of 80 elements. The final state for our simulated robot and the input to our MQN is a vector of n  80 elements. In algorithm 1, n is defined to be 8; 5 or 8 can be selected because they achieve the accuracy while keeping the speed of algorithm in real time considering our result in Figure 6 (left), where n is checked to be 5, 3, 8, 16, and 0 (in the case without memory).

Multilayer neural network As we mentioned earlier in this article, when the number of states increases, the old method of keeping action values in a table just does not work. Imagine that our state is a vector of 80 dimensions where each dimension can have a scalar value from 0 to 255. Therefore, the total number of states will be about 25580, which is a large state dimension and intractable in practice. As a result, we use an ANN as a function approximator in place of our Q-learning Q-table, so our algorithm can do generalization as well. Our ANN is a multilayer neural network with 400 inputs, which is our super state vector (with a memory of size 5) and three outputs corresponding to the value of all three possible actions in each state which are left, right, and forward. Furthermore, it has two hidden layers with 512 and 256 neurons in the first and second hidden layers. Our neurons activation functions are ReLu in the first and second hidden

Dooraki and Lee

Figure 8. Our simulated robot in Gazebo simulation environment.

layers and linear in the output layer. The network is a fully connected network.

Simulation Environment In regard to testing our algorithm, we used a simulated version of a real robot17 in Gazebo simulation environment18, which is able to completely simulate the real environment. Furthermore, we added three range sensors to our simulated robot as it can be seen in Figure 8. We also used ROS19 for connectivity of different modules of our robot such as range sensors, depth sensor, Kobuki (wheels), and the main algorithm. Using ROS gives us this reusability benefit that we can use our code with minor changes for a real robot.

Results In order to fully check the capability of our algorithm, we performed several tests using our simulated robot (Figure 8) in Gazebo where we used different sizes of short-term memory such as 0, 3, 5, 8, and 16. A short-term memory with a size of 0 is as a matter of fact a scenario without short-term memory in which we do not have a super state shown in Figure 7, and we use our normal state (the output of our “Sensor fusion” section). Figure 6 (left) shows the result of testing with different short-term memory sizes and its effect on our results. As it can be seen when the short-term memory size is 0, the agent cannot increase the average reward over time. Even though its average reward temporarily increases in the beginning of the curve, it then reduces gradually and never increases again. The main reason for this increment at the beginning of the curve is that the actions are selected randomly because we have used an annealed epsilon greedy policy that starts with epsilon ¼ 1.0 and decreases the epsilon gradually to 0.1, while it reaches 5000 steps, and continues with a constant epsilon of 0.1. In contrast, in the cases with a short-term memory (3, 5, 8, and 16), the

7 increment of the average reward is robust over time, which shows that the agent has learned to choose the best actions in order to maximize the average reward over time. Figure 6 (right) shows our algorithm with a short-term memory of size 5 and the case that we did not use a short-term memory (of size 0). As it can be seen, increasing the short-term memory size more than 5 contributes very little to the increment or robustness of average reward over time. As a result, we selected the short-term memory of size 5 for our algorithm in the figure to compare with the short-term memory of size 0 considering the fact that larger short-term memory size needs more processing capabilities. The main incident that causes the difference in the average reward that the robot achieves over time with different short-term memory sizes is the fact that, when the robot has no short-term memory, it stucks in the corners of the environment and gathers all the high negative rewards. A similar situation that is explained in Figure 3 happens to the robot. When it has no short-term memory, it cannot recognize that one turn to right and left cannot help it to get out of the corner, and it learned that it is better not to completely face the wall since it has the highest negative reward, each move gives the robot a negative reward, and this continues for a long time; only the exploration part of the epsilon greedy policy may help the robot to get out of the corner. As we explained previously, we also have defined a long-term memory of size 10,000 for the robot where it stores the latest 10,000 moves of the robot taken by policy caused by exploration or exploitation in order to increase the training accuracy of MMQN. As a result, in the case that the robot uses a short-term memory, it learns by exploration that, when stuck in a corner, the better move is to turn subsequently to left or right in order to get out of the corner and increase its reward again. The robot can learn this fact because it has a memory and can recognize that it is in a corner. Further, having get out of the corner, its actions in the corner state get update, and in turn, it helps the robot to get out of the corner and maximizes its average reward over time. In Figure 9, the different aspects of robot interaction with its environment can be observed. While it can be seen in the figure that the robot movement is getting smoother by increasing the number of steps it spends in the environment, but the real evaluation on the robot movement can only achieve by looking at the average reward that robot achieves over time. As a matter of fact, when robot increases its average reward over time, it means that it has learned to choose the best actions or converge to an optimal policy that is shaped according to our reward function, which is the goal of a reinforcement learning–based agent, as it can be seen in Figure 6. In order to clarify our robot movement, as it can be seen in Figure 9, we explain the different stages that happen to it. (1) It tries to learn from the environment using the epsilon greedy policy (epsilon starts with 0.99), so most of the robot moves are random (Figure 9, left image). (2) Robot

8

International Journal of Advanced Robotic Systems

Figure 9. The first image from the left shows the robot path while it is learning, especially in the first 5000 moves in which the actions executed are mostly random based on the epsilon greedy policy. The second image from the left shows the robot path when it is trained.

has learned a good policy and can use it to observe the environment (Figure 9 center image). (3) If a change happens in the environment, robot tries to adapt itself with this change with the goal of choosing the best action in that situation. Figure 9 (right image) shows the problem of robot in choosing the correct action while it is still learning or when the learning of a proper policy is failed. In each step of robot movement in the environment, it will receive a representation of the state and tries to take the best action based on its learned model so far. The result will be a reward that will update the model of the robot. Gradually, robot’s policy will converge to optimal, and the robot will be able to autonomously learn the best actions to take in different states according to its experiences. As our results show and we explained in section “Memory-based method,” memory has an important effect on the learning process of the robot; however, it increases the dimensionality of the state, and as a result, it increases the processing time of the algorithm and its learning time. In order to test the effect of size of the short-term memory, we performed separated experiments in the same environment using different sizes of short-term memory. As it can be seen in Figure 6 (left), the maximum average reward is obtained with a short-term memory of size 16. However, since it does not have a large difference with other curves where the size of short-term memory is 8 or 5, we selected shortterm memory of size 5. In Figure 10, we have shown the exploration and obstacle avoidance of our robot in a new environment. Robot originally is trained with the environment shown in Figure 9, and after that, it is moved to a new unknown environment where it successfully could explore the new environment and avoid the obstacles. As mentioned previously, we have defined an epsilon greedy policy for our robot in order to balance between the exploration and exploitation. When the robot starts its training, epsilon is 1.0, which helps the

Figure 10. Robot is moving in a new unknown environment and is able to move correctly.

robot to do as much exploration as it can; however, by increasing the steps it takes over time, the epsilon decreases to 0.1, and this epsilon will also be used when our trained robot moves to a new environment. As a result, in the new environment, the robots tries to use the policy it learned previously most of the time and also changes the policy according to the new experiences it gathers in the new environment. It is important to notice that while robot is learning, it learns from its own experiences. As a result, an environment that can provide more experiences for the robot will let it to learn more, and this makes the robot more intelligent.

Dooraki and Lee

Implementation in a real robot There are two aspects regarding the implementation of our algorithm in a real robot: software and hardware. Perhaps, noise is the biggest difference between a simulated and a real-word scenario. Considering a real scenario, a depth sensor can be noisy and also limited in the range it can measure; in a simulated scenario, however, a depth sensor is not noisy by default and its range is not limited. In our algorithm, since we had in mind that a real-world depth sensor would have these kind of problems, we added range sensors with the aim of helping our algorithm to tackle the limitations of a real-world depth sensor (it is explained in the “Sensor fusion” section). Further, in a real scenario, a filter such as Kalman filter can be used in order to reduce the noise. The second issue in a real-world scenario would be about the training time necessary for the robot. The battery of the robot is a limitation comparing with a simulated environment. In a real robot scenario, the robot should be able to pause its training process programmatically and resume it after recharging its battery. For example, the implemented algorithm in a real robot scenario can be enhanced in order to be able to detect the low battery and run a predefined procedure in order to connect the robot to the power supply. In the case of Kobuki, an automatic docking procedure can be used to connect the robot to the power supply; there should be no obstacle between the robot and the infrared sensor of the power supply station, or Kobuki can be programmed to play a sound so a human assist can connect it to the power supply; later, when the algorithm detected that the battery power is enough, then it can resume the training process. The other method that can be used in order to tackle the problem of low battery is to connect the robot to a power supply in the whole process of training. Despite this, in order to increase the speed of training procedure, the trained files generated by MMQN (the ANN trained weights) can be used as a starting point for the real-world robot. Having said that, the implementation of software part is much easier and almost needs no change mainly because we deliberately used ROS in order to connect our nodes such as Kobuki, depth sensors, and range sensors to the main algorithm considering future implementation in a real robot. As a result, by defining the same nodes and topics used in simulation, our algorithm is able to easily work on a different hardware platform such as in Nvidia Tegra K1 or Nvidia Tegra X1, a laptop, or any other platform that supports ROS and be strong enough to handle the calculations in real time.

Conclusion In this article, we designed and developed an algorithm using ANNs and reinforcement learning in a memorybased fashion for autonomous exploration and obstacle avoidance in an unknown environment called MMQN. In

9 order to have a robust obstacle detection method, we used a sensor fusion technique that combines depth and range sensors. Finally, we implemented our algorithm in a simulated robot in Gazebo simulator and demonstrated that our algorithm is able to successfully learn over time to shape an optimal policy (according to our result shown in Figure 6) and as a result performs autonomous exploration and obstacle avoidance in an unknown environment. Besides, our method is able to learn online and from its own experiences. As a result, it is able to learn autonomously and in a continuous way, and as we demonstrated previously, it can adapt itself to a new environment different from its original environment where it uses its epsilon greedy policy to balance between exploration and exploitation. Finally, we discussed regarding the implementation of our algorithm in a real robot in which the fact that we used ROS in our work eases its usage in a real-world scenario. Declaration of conflicting interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Unmanned Vehicles Advanced Core Technology Research and Development Program through the National Research Foundation of Korea (NRF), Unmanned Vehicle Advanced Research Center (UVARC) funded by the Ministry of Science, ICT and Future Planning, the Republic of Korea (NRF-2016M1B3A1A01937245) and was also supported by the Ministry of Trade, Industry & Energy (MOTIE) under the R&D program (Educating Future-Car R&D Expert), project number N0002428.

References 1. Sutton RS and Barto AG. Introduction to reinforcement learning. 1st ed. Cambridge, MA: MIT Press, 1998. ISBN: 0262193981. 2. Zhang Z, Baek SJ, Lee DJ, et al. Study of reinforcement learning based dynamic traffic control mechanism. Dordrecht: Springer Netherlands, 2013, pp. 1047–1056. ISBN: 978-94-007-6738-6. 3. Subbarao K, Nuthi P, and Atmeh G. Reinforcement learning based computational adaptive optimal control and system identification for linear systems. Annu Rev Control 2016; 42: 319–331. DOI: 10.1016/j.arcontrol. 2016.09.021 4. Riedmiller M, Gabel T, Hafner R, et al. Reinforcement learning for robot soccer. Auton Robots 2009; 27(1): 55–73. DOI: 10.1007/s10514-009-9120-4 5. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533. 6. Du R, Zhang X, Chen C, et al. Path planning with obstacle avoidance in PEGs: ant colony optimization method. In: Proceedings of the IEEE/ACM Int’L conference on green

10

7.

8.

9.

10.

11.

International Journal of Advanced Robotic Systems computing and communications & int’l conference on cyber, physical and social computing, Greencom-Cpscom ‘10, Washington, DC, USA, 18–20 December 2010, pp. 768–773. IEEE Computer Society. DOI: 10.1109/ GreenCom-CPSCom.2010.124 Pauplin O, Louchet J, Lutton E, et al. Applying evolutionary optimisation to robot obstacle avoidance. Clin Orthop Relat Res 2004. DOI: abs/cs/0510076 Lin CJ, Li THS, Kuo PH, et al. Integrated particle swarm optimization algorithm based obstacle avoidance control design for home service robot. Comput Electr Eng 2016; 56(C): 748–762. DOI: 10.1016/j.compeleceng.2015. 05.019. Kim CJ and Chwa D. Obstacle avoidance method for wheeled mobile robots using interval type-2 fuzzy neural network. IEEE Trans Fuzzy Syst 2015; 23(3): 677–687. DOI: 10. 1109/TFUZZ.2014.2321771. Zhang H, Han X, Fu M, et al. Robot obstacle avoidance learning based on mixture models. J Robot 2016; 2016: 1. DOI: 10.1155/2016/7840580. Tai L, Li S, and Liu M. A deep-network solution towards model-less obstacle avoidance. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Korea, Republic of Daejeon, 9–14 October 2016, pp. 2759–2764. DOI:10.1109/IROS.2016.7759428

12. Lee H, Shen Y, Yu CH, et al. Quadruped robot obstacle negotiation via reinforcement learning. In: Proceedings 2006 IEEE international conference on robotics and automation, 2006, ICRA, Orlando, FL, USA, 15–19 May 2006, pp. 3003–3010. IEEE. DOI: 10.1109/ROBOT.2006.1642158 13. Kominami K, Takubo T, Ohara K, et al. Optimization of obstacle avoidance using reinforcement learning. In: IEEE/ SICE international symposium on system integration (SII), Fukuoka, Japan, 16–18 December 2012, pp. 67–72. IEEE. DOI: 10.1109/SII.2012.6426933 14. Lei T and Ming L. A robot exploration strategy based on Q-learning network. In: IEEE international conference on real-time computing and robotics (RCAR), Angkor Wat, Cambodia, 6–10 June 2016, pp. 57–62. IEEE. DOI: 10. 1109/RCAR.2016.7784001 15. Watkins CJ and Dayan P. Technical note: Q-learning. Mach Learn 1992; 8(3): 279–292. DOI: 10.1023/A: 1022676722315 16. Lin LJ. Reinforcement learning for robots using neural networks. PhD Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1992. UMI Order No: GAX93-22750. 17. Autonomous deep learning robot, 2017. https:// http://www. autonomous.ai/deep-learning-robot 18. Gazebo simulator, 2017. http://gazebosim.org/ 19. Ros.org — powering the worlds robots, 2017. http://www. ros.org/

Suggest Documents