Robot Task Learning based on Reinforcement ... - Semantic Scholar

Neural Information Processing – Letters and Reviews

Vol. 11, No. 7, July 2007

LETTER

Robot Task Learning based on Reinforcement Learning in Virtual Space Tadashi Tsubone, Koichi Sugiyama, and Yasuhiro Wada Department of Electrical Engineering, Nagaoka University of Technology 1603-1 Kamitomioka, Nagaoka, Niigata 940-2188, Japan E-mail: [email protected] (Submitted on September 15, 2006) Abstract − As a novel learning method, reinforced learning by which a robot acquires control rules through trial and error has gotten a lot of attention. However, it is quite difficult for robots to acquire control rules by reinforcement learning in real space because many learning trials are needed to achieve the control rules; the robot itself may lose control, or there may be safety problems with the control objects. In this paper, we propose a method in which a robot in real space learns a virtual task; then the task is transferred from virtual to real space. The robot eventually acquires the task in a real environment. We show that a real robot can acquire a task in virtual space with an input device by an example of an inverted pendulum. Next, we verify availability that the acquired task in virtual space can be applied to a real world task. We emphasize the utilization of virtual space to effectively obtain the real world task. Keywords − Task learning, reinforcement learning, virtual space, robot control

1. Introduction A robot can perform complicated operations by control rules designed and planned by engineers. However, for an intelligent robot to coexist with humans in daily life and perform assigned tasks with sufficient accuracy, it needs to adapt to the manipulation according to the dynamic alteration of the environment. Recently, reinforcement learning [1] has gotten a lot of attention as a learning method by which a robot can autonomously obtain information from environments and actions. Reinforcement learning has been applied to various robot control researches [2] [3]. The robot tries to acquire optimal control rules through trial and error learning during the reinforcement learning. However, in a real world environment, robots have difficulties in learning a task by trial and error such as reinforcement learning. For example: 1) Restoring the environment to the initial state for learning is difficult because the number of learning needed to acquire adequate actions may exceed several thousands. 2) In the learning process, there are safety problems with the robot itself, as well as concerns about damage to operational objects and harm to humans since robot movement is not stable. To solve these problems we propose a novel method to acquire a task by trial and error learning in which a robot in the real world learns a task in virtual space through an input device connected to the real world and virtual space. After this, the robot can achieve the real task almost without learning in the real world. Since various tasks can be realized in the virtual space, the robot obtains them by changing the virtual space environment with relatively little effort. This means that the first problem above can be easily solved by the approach. Moreover, the second problem above can be tolerated due to the robot handling virtual control objects. In this paper, after reviewing related previous works, we explain our proposed approach. First, we show that a real robot can learn to control an inverted pendulum in virtual space. Next, the robot that acquires a virtual task can successfully control the inverted pendulum in the real world with fewer learning trials. Virtual space learning effectively acquires the real task by using reinforcement learning.

165

Robot Task Learning based on Reinforcement Learning

Tadashi Tsubone, Koichi Sugiyama, and Yasuhiro Wada

Figure 1. Task learning system in virtual space

2. Utilization of Virtual Space to Acquire Real World Tasks The following are examples of the utilization of virtual space for a robot system: 1) Utilization of engineering for robot mechanism (ex., consideration of layout or component parts) 2) Evaluation of robot control software 3) Teaching tools for robot control Several simulation softwares have been proposed for humanoid or mobile robots to satisfy the second utility above. Simulation softwares can enhance the development or the verification of the robot itself and the control rules by using virtual space. OpenHRP (Open Architecture Humanoid Robotics Platform) [4][5] and Dynamic Simulation [6] are typical examples. OpenHRP is a distributed architecture that can simulate in real time by operating a schedule client. OpenHRP is composed of a schedule client and various other clients who provide such functions as kinetics calculations and control. These systems are effective because control program bugs can be found in the simulator without involving the real robot. Therefore, we can minimize danger to people or the robot’s surroundings if it loses control. In the third utility, research has been done on teaching in assembly operations [7][8]. Computer programs for such operations are automatically produced after teaching data are extracted from actions performed by human operators in virtual space. Operators can edit and modify the extracted teaching data on a computer display to adapt to the real world. A virtual environment effectively develops a robot control system. In some previous researches on the utilization of virtual space, the robot system itself existed in virtual space where it manipulated the task. Also some researches on the real robot (the robot in the real world is called a real robot) that interacts with a virtual environment to achieve a virtual task were studied. [9][10] We propose a method where a real robot utilizes a virtual environment to obtain a real task by using reinforcement learning.

3. Method for Virtual Environments to Obtain Real Tasks The proposed method has two stages. In the first stage, the real robot learns the virtual task through an input device connected to virtual space (Figure 1), providing an opportunity to learn the task as if in the real world. In the second stage, the real robot learns the real task based on control rules acquired in the first stage (Figure 2). As mentioned above, there are several problems in trial and error learning such as reinforcement learning in real space. The utilization of virtual space allows the real robot to obtain the real task and avoid these problems. Moreover, we expect the following merits by using this approach. First, the robot can learn a huge variety of environmental alterations because virtual environment parameters can be arbitrarily and easily changed. The robot must experience various environments to autonomously adapt to them. In virtual space, the robot can learn 166


Vol. 11, No. 7, July 2007

Figure 2. Task learning system in real space iteratively and easily in situations whose realization is difficult in the real world. A simulation learning system is another idea where both the robot and the task exist in virtual space. However, if the robot mechanism is changed, we need to remodel the robot. At present the robot actuator by electric servomotor is general. But recently, the development of soft actuators has actively progressed [11] because soft robot systems, which pose no threat to humans, are required in welfare and caregiving fields. It is hard to simulate robot systems with soft actuators by computer. Robot system modeling with a soft actuator is very difficult because of the hysteresis of the air pressure actuator [12] or the nonlinear spring property. Robot modeling in virtual space does not need to use a real robot, and the differences between real and virtual robots are ignored. We can therefore quickly realize the real robot that acquire the task rule. The proposed approach features the following effective points: • flexible environment setting • safe task learning • smooth transfer (easy realization) to the real world 1) Task learning in virtual space Figure 1 shows a system in which a real robot can learn a virtual task. The real robot is a general purpose arm robot PA10-7C (Mitsubishi Heavy Indusries, LTD.). We used PHANToM (SensAble Technologies, LTD.) as the device that connects real and virtual space. PHANToM is a haptic interface device that can input a position to virtual space and feedback force to real space. In this paper, even though we used PHANToM as an input device, in the future the system will feature the potential to be applied to tasks that need to interact with haptic sense. The robot can manipulate the input device shown in Figure 1. A computer for the virtual space receives the arm robot’s hand position as robot information through the input device. The environment information in virtual space is updated by the hand position using mathematical models for the task and the environment. The virtual space is visualized on a monitor by using OpenGL [13]. The updated environment information in the virtual space is transferred to a computer for learning and control. The virtual task is learned based on reinforcement learning using the information in the task learning part of the computer for learning and control. 2) Task learning system in real space Figure 2 shows a system for the learning real task by the real robot. The environment information in real space is measured in a computer for state measurement by a three-dimensional position measurement system OPTOTRAK (Northern Digital Inc.), which gauges three-dimensional positions by tracing markers attached to the hand of the robot or the control object. Measured data is transferred to the computer for learning and control through a TCP/IP network. The real task is learned based on reinforcement learning using the information in the task learning part of the computer for learning and control. However, the robot has learned the virtual task and almost completed it using the acquired control rules in virtual space. Learning the real task is tuned exactly 167



Figure 3. Pole balance task

Figure 4. Actor-critic architecture

according to the difference between the real task and the mathematical model in virtual space. Finally, the robot is perfectly controlled by the tuned control rules.

4. Experiment Results of an Inverted Pendulum 4.1 An inverted pendulum We show control results of an inverted pendulum by the proposed approach. The inverted pendulum’s goal is to maintain the inverted state of the pendulum by moving a bogie toward the X-axis (Figure 3). The mathematical model in virtual space is denoted as follows [14]: θ& i +1 = (1 − α1 )θ& i + g∆t l (sin(θi ) + &x&i cos(θi ) g ) ,

(1)

where θi , θ& i , and &x&i show the angle of the pendulum at time i, the angular velocity, and the acceleration of the bogie, respectively. We use the following parameter values: viscosity α1 =0.01, time step ∆t =0.02s, gravitational acceleration g=9.81m/s2, and pendulum length l=0.9m.

4.2 Learning control of the inverted pendulum In this paper the inverted pendulum is controlled based on reinforcement learning, the actor-critic algorithm (shown in Figure 4) [15]. In critic, state value is learned as error prediction r? in equation (2) to become zero. The actor learns to select action as rewards increase: ˆr = r (s (t )) + γV (s (t )) − V (s (t − 1)) ,

(2)

where r (s(t )) is the reward at time t, γ is a discount factor, and V (s(t )) is value estimated by the critic. State variable s is denoted by four dimensions: θ, θ& , x , x& . Reward is expressed as follows:

{

}

⎧⎪1 x ≤ 0.15 and θ ≤ 12 deg r (s(t )) = ⎨ . ⎪⎩0 other

(3)

1) Critic The critic and actor are represented by the Adaptive Gaussian Softmax Basis Function [2]. Figure 5 shows a critic network using AGSBF. The critic’s output is computed by the following equation. A k-th activating function is denoted as 1 − M k s − ck exp 2

2

ak (s ) = , where c k and M k show a center and a variance matrix of the activating function, respectively. A base function is given as K

bk (s ) = ak (s ) ∑ al (s ) , l =1

where K is the number of the base function. The critic’s output is computed using

168

(4)


Vol. 11, No. 7, July 2007

Figure 5. Actor-critic network representation V (s ) = ∑ vk bkC (s ) ,

(5)

k

The base functions are successively arranged in learning. A new base function is arranged when error exceeds threshold emax and activation values for all units are less than threshold amin . The new unit is initialized as ck = s (t n ) , M k = diag (µ i ) , vk = 0 . The network’s weight parameters are updated by the following equations: ∆vk = βrˆ (t n )ekC (t n )

(6)

⎧⎪1 bkC (t n ) ≥ 0.8 ekC (t n ) = ⎨ ⎪⎩λekC (t n−1 ) + bkC (t n ) other

(7)

where β and ekC denote a learning coefficient and eligibility trace [16], respectively. The following are the parameter values used in the experiment: β =0.3, λ =0.8, amin =0.5, emax =0.0001, and µ i = (1/0.02, 1/0.1, 1/0.02, 1/0.05). 2) Actor The output of actor u, which is the distance of the robot arm movement, is computed by the following equation: ⎛ ⎞ u (s ) = u max g ⎜ ∑ wk bkA (s ) + σε(t )⎟ + ub , ⎝k ⎠

(8)

where bkA is a base function, output u is saturated at maximum value u max by a sigmoidal function, ub is a biased output, and ε(t ) is a kind of noise for exploration. The weight values are updated by the next equations: ∆wk = αrˆ (t n )ekA (t n ) ⎧⎪1 ekA (t n ) = ⎨ ⎪⎩λekA (tn −1 ) + bkA (t n )

(9)

bkA (t n ) ≥ 0.8

,

(10)

other

where α and ekA show a learning coefficient and eligibility trace, respectively. A Gaussian-typed noise is used as the exploration noise in equation (8), and noise magnitude σ (t ) is determined according to estimation function V (t ) to explore the smaller areas for the high estimation value of action: σ(t ) = min[1, max[0 ,−V (t )]] .

(11)

The following parameter values were used in the experiment: α =0.3, λ =0.8, amin =0.5, emax =0.0001, µ i = (1/0.02, 1/0.1, 1/0.02, 1/0.05).

169



Figure 6. Learning results in virtual space

Figure 7. Control results of hand positions and pendulum angles in virtual space

3) Trajectory planning Robot hand trajectory is planned based on the minimum jerk criterion [17] as follows:

(

)(

)

x(t ) = x0 + x0 − x f 15τ4 − 6τ5 − 10τ3 ,

where τ = t t f . x0 , x f , and t f show starting point, final point, and movement duration, respectively. The final point, x f , can be calculated by the sum of distance u and starting point x 0 . In the paper, the duration of point to point movement is 100 msec, and the robot arm is controlled every 20 msec according to the planned point to point movement.

4.3 Task learning in virtual space The task is learned in virtual space. 1) Experimental conditions The initial angle θ of the pendulum is given according to a Gaussian distribution with mean 0ºand standard deviation 1º, and initial angular velocity θ& is set to 0 º/sec. The successful trial is defined as maintaining the inverted pendulum for 10 minutes. When the number of continuous successful trials exceeds five, the experiment is terminated. 2) Experimental results We performed the experiments three times. In all three experiments the pendulum was kept for more than 10 minutes. The average number of learning trials to achieve the experiment’s termination conditions was 1552. We show an example of the results. Figure 6 shows the relation of the holding time of the inverted pendulum and the number of learning trials. The x- and y-axes denote the number of learning trials and the holding time, respectively. After about 1400 trials, the robot arm could maintain the inverted pendulum for 10 minutes. The control results of the pendulum for the first minute in a successful trial are shown in Figure 7. The upper shows the transition of the robot arm position, and the bottom shows the transition of the angle of the pendulum. The tilted pendulum was controlled vertically from the beginning for 15 sec and then periodically controlled to keep it near the origin in small amplitude less than 1 cm. These results show that the real robot can learn a virtual task with the input device.

170


Figure 8. Learning results by simulation and in real space

Vol. 11, No. 7, July 2007

Figure 9. Control results of hand positions and pendulum angles in real space

Table 1. The number of trials which are necessary to acquire task in real space.

4.4 Task acquisition in real space Task acquisition in real space was performed using the actor and critic obtained in virtual task learning. 1) Experimental conditions A successful trial is defined as one that maintained the inverted pendulum for 10 minutes. When the number of continuous successful trials exceeded five, the experiment is terminated. 2) Experimental results We performed the experiments three times. In all three experiments the pendulum was kept for more than 10 minutes. The average number of learning trials was 11 before the experiment’s termination conditions were satisfied. Figure 8 shows an example of the relation of the holding time of the inverted pendulum and the number of learning trials. The x- and y-axes denote the same data as Figure 6. The solid line (virtual) in Figure 8 shows the holding time of the pendulum when using the actor and critic obtained in the virtual task learning as initial states. Just after starting real task learning, the robot arm could maintain the inverted pendulum for 10 minutes. The control results of the pendulum for the first minute in successful trials are shown in Figure 9. The upper and bottom parts of the figure show the transitions of the robot arm position and the angle of the pendulum, respectively. The tilted pendulum is recovered just after starting and then periodically controlled to keep it near the origin at an constant amplitude. These results show that a virtual learning task can be effectively transferred to a real world task when actions learned in virtual space are applied to real world space using the proposed approach. Scant learning trials are needed in the above results when the virtual learning task is transferred to the real world task, probably because the mathematical model in virtual space closely resembles real dynamics.

5. Discussion: Comparison When Virtual Space Is Not Used To verify the effectiveness of virtual space use for the real robot in real task learning, we compared virtual space use with a case that didn’t use virtual space in the following two experiments. 171



1) Only real space is used for task learning. 2) First, the task is learned in computer simulations, and then it is learned in real space. The robot system and the task environment exist in computer simulations. In a sense the first method is conventional. The task is acquired in real space through the first trial to the final. The pendulum is manually returned to the initial state. In the second method, we ignore the position error and the time lag of the robot control because we assume that control is performed perfectly. The learned actor and critic are applied to the real task. 1) Experimental conditions The parameter values for the actor and critic are identical to the virtual task learning experiment. Initial angle θ is set by Gaussian distribution with mean 0ºand standard deviation 1º, and the initial angular velocity is set to 0º/sec. A successful trial is defined as maintaining the inverted pendulum for 10 minutes. When the number of continuous success trials exceeds five, the experiment is terminated. 2) Experimental results The above experiments were performed three times. The task was achieved every time. In the first experiment, the average number of learning trials was 1021, and in the second experiment it took 401 trials. A dashed line (real) in Figure 8 shows the holding time of the pendulum in the real world after learning was performed in real space. It took about 1,000 trials to achieve the real task. A dotted line (sim) in the figure shows the holding time of the pendulum in the real world after learning was performed in the computer simulation. About 400 trials were required. Table 1 shows the number of learning trials until the robot obtains the task in the real world in each approach. About 1,000 trials were needed to complete the task when learning was only performed in real space. However, the proposed virtual space approach needed no learning trial. The computer simulation approach took less than about half of the number of learning trials for the case of only real space; however, it was inferior to the proposed approach. Since the task rule was located by the computer simulation before the real task was learned, the almost proper actor and critic had already been obtained before the learning performance in real space. Therefore, faster learning was naturally expected in comparison to learning only using the real task. We infer that robot dynamics is responsible for the difference between the proposed approach using virtual space and the computer simulation approach. The same mathematical inverted pendulum model for virtual space and simulations was used, so the difference does not depend on pendulum dynamics. We suppose that an ideal robot can be completely controlled with no time lag. However, in fact the robot cannot be controlled perfectly because of the increasing time of electric motors, etc. In the simulation approach, it apparently takes additional time to learn the real characteristics of robot dynamics. From these results, the proposed approach, which dramatically reduces the number of learning trials in real space, can be expected to smoothly shift to real space by the proposed approach because the differences between real and virtual robots need not be considered.

6. Conclusion In this paper we proposed a novel approach for task learning by reinforcement learning that uses virtual space by which a real task is effectively and safely learned. We show one example where the real robot can control the inverted pendulum in virtual space using an input device. Then we show that the real task can be effectively obtained by applying the rules acquired in virtual task learning to real task learning. This means that the proposed approach, which first learns the virtual task and then shifts to real space, is quite a useful approach when the real robot has to learn and acquire tasks by trial and error. In the approach, it is possible to use force feedback controlled by a system that includes a haptic interface device such as PHANToM. In the future, we will discuss the effectiveness of a system that includes force feedback control.

References [1] A.G. Barto, R.S. Sutton, C.W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transaction on Systems, Man and Cybernetics, Vol. 3, No. 5, pp. 834-846, 1983. [2] J. Morimoto, K. Doya, “Reinforcement learning of dynamic motor sequence: Learning to stand up,” IEEE 172


Vol. 11, No. 7, July 2007

International Conference on Intelligent Robots and Systems 3, pp. 1721-1726, 1998. [3] J. Morimoto, G. Cheng, C.G. Atkenson, G. Zeglin, “A Simple Reinforcement Learning Algorithm For Biped Walking,” IEEE International Conference on Robotics and Automation 2004 (3), pp. 3030-3035, 2004 [4] F. Kanehiro, N. Miyata, S. Kajita, K. Fujiwara, H. Hirukawa, Y. Nakamura, K. Yamane, I. Kohara, Y. Kawamura, Y. Sankai, “Virtual humanoid robot platform to develop controllers of real humanoid robots without porting,” IEEE International Conference on Intelligent Robots and Systems 2, pp. 1093-1099, 2001. [5] F. Kanehiro, H. Hirukawa, S. Kajita, “OpenHRP: Open Architecture Humanoid Robot Platform,” International Journal of Robotics Research, Vol. 23, No. 2, pp.155-165, 2004. [6] O. Khatib, O. Brock, K. S. Chang, F. Conti, D. Ruspini, and L. Sentis, “Robotics and interactive simulation,” Communications of the ACM, Vol. 45, No. 3, pp. 46-51, 2002. [7] H. Ogata, T. Takahashi, “Robotic assembly operation teaching in a virtual environment,” IEEE Transactions on Robotics and Automation, pp. 391-399, vol.10, no. 3, pp. 391-399, 1994. [8] H. Onda, H. Hirukawa, K. Takase, “Assembly motion teaching system using position/force simulator — extracting a sequence of contact state transition,” IEEE International Conference on Intelligent Robots and Systems 1, pp. 9-16 1995. [9] A.M. Martinez and J. Vitria “Designing and Implementing Real Walking Agents using Virtual Environments,” in Applications of Artificial Intelligence, N.Mamede, C.Pinto-Ferreira (eds), Scitec Pub., 1996, pp. 105-114. 1997. [10] X. Hu, B. P. Zeigler, "A Simulation-Based Virtual Environment to Study Cooperative Robotic Systems", Integrated Computer-Aided Engineering (ICAE), 12:4, pp. 353 – 367, 2005. [11] M. Konyo, S. Tadokoro, T. Takamori, K. Oguro, “Artificial tactile feel display using soft gel actuators,” IEEE International Conference on Robotics and Automation 4, pp. 3416-3421, 2000. [12] D.G. Caldwell, N. Tsagarakis, D. Badihi, G.A. Medrano-Cerda, “Pneumatic muscle actuator technology a light weight power system for a humanoid robot,” IEEE International Conference on Robotics and Automation 4, pp. 3053-3058, 1998. [13] http://www.opengl.org/ [14] C.G. Atkeson, S. Schall, “Robot learning from demonstration,” 14th International Conference on Machine Learning, pp12-20, 1997. [15] R.S. Sutton, A.G. Barto, “Reinforcement Learning: An Introduction,” A Bradford Book, MIT Press, 1998. [16] S.P. Singh, R.S. Sutton, “Reinforcement learning with replacing eligibility traces,” Machine Learning, Vol. 22, No. 1-3, pp. 123-158, 1996. [17] T. Flash, N. Hogan, “The coordination of arm movements: an experimentally confirmed mathematical model,” Journal of Neuroscience, Vol. 5, No.7, pp. 1688-1703, 1985.

Tadashi Tsubone received the B.E., M.E., and Ph.D. degrees in electrical engineering from Hosei University, Tokyo, Japan, in 1996, 1998 and 2001, respectively. He is currently a Research Assistant with the Department of Electrical Engineering, Nagaoka University of Technology, Niigata, Japan. His research interests are in neural networks, chaos, and bifurcation. Dr. Tsubone is a member of IEEE and IEICE.

173



Koichi Sugiyama received the B.E. and M.E. degrees in electrical engineering from Nagaoka University of Technology, Niigata, Japan, in 2004 and 2006, respectively. He joined TOSHIBA TEC CORPORATION, Tokyo, Japan in April 2006. His research interests are in motion controls and neural networks.

Yasuhiro Wada received his B.S. degree from the Department of Control Engineering, Tokyo Institute of Technology, in 1980, completed the M.E. program (system science) in 1982, and joined Kawasaki Steel Company. He moved to ATR Visual and Auditory Function Research Laboratory in 1989, and to ATR Human Information Processing Research Laboratories in 1992. He has been a professor at Nagaoka University of Technology, and is engaged in research on motion control and neural networks. He holds a D. Eng. degree, and is a member of the Neural Network Society and Society of Instrument and Control Engineers. (Home page: http://gallery.nagaokaut.ac.jp/)

174

Robot Task Learning based on Reinforcement ... - Semantic Scholar

Robot Task Learning based on Reinforcement ... - Semantic Scholar

Suggest Documents

Robot Task Learning and Collaboration - Semantic Scholar

Reinforcement Learning Based on Active Learning ... - Semantic Scholar

Designing a Reinforcement Learning-based ... - Semantic Scholar

Reinforcement learning based dual-control ... - Semantic Scholar

Distributed Reinforcement Learning Based MAC ... - Semantic Scholar

Fuzzy Model-Based Reinforcement Learning - Semantic Scholar

Vision-Based Reinforcement Learning for ... - Semantic Scholar

An Adaptive Reinforcement Learning-based ... - Semantic Scholar

Reinforcement Learning Based Routing in ... - Semantic Scholar

bilateral robot therapy based on haptics and reinforcement learning

bilateral robot therapy based on haptics and reinforcement learning

Interactive reinforcement learning for task-oriented ... - Semantic Scholar

User-Guided Reinforcement Learning of Robot ... - Semantic Scholar

Mobile Robot Navigation using Reinforcement ... - Semantic Scholar

Robot Skill Learning: From Reinforcement Learning to

Associative Reinforcement Learning - Semantic Scholar

Multi-task Gaussian Process Learning of Robot ... - Semantic Scholar

Robot Reinforcement Learning Using EEG-Based Reward Signals

direct gradient-based reinforcement learning for robot ... - EIA - UdG

Reinforcement Learning Based Multi-Agent LFC ... - Semantic Scholar

Policy Gradient Based Reinforcement Learning for ... - Semantic Scholar

Advice-based Transfer in Reinforcement Learning - Semantic Scholar

deep reinforcement learning for human-robot

Robot Navigation using Reinforcement Learning and Slow