Q-RAN: A Constructive Reinforcement Learning. Approach for Robot Behavior Learning. Jun Li, Achim Lilienthal. Department of Technology. Ãrebro University.
IEEE/RSJ International Conference on Intelligent Robot and System (IROS06) Beijing, China, Oct. 9-15, 2006
Q-RAN: A Constructive Reinforcement Learning Approach for Robot Behavior Learning
Jun Li, Achim Lilienthal
Tomás Martínez-Marín
Department of Technology Örebro University Sweden
Department of Physics, System Engineering and Signal Theory University of Alicante, Spain
1/18
Tom Duckett Department of Computing and Informatics University of Lincoln, UK
Outline
Background – acquiring a robot behavior ¾ ¾
A layered learning system – QRAN ¾ ¾ ¾ ¾
Main ideas of our learning system Architecture of our learning system Implementation of QRAN learning Comments on QRAN learning
Experimental results and analysis ¾ ¾
by engineering design by learning from robot’s own experiences
Docking behavior Learning by the QRAN system
Conclusion and future work 2/18
Background – acquiring behaviors
A reactive behavior ¾
¾
a sequence of sensory states and their corresponding motor actions for different tasks some example behaviors
Robot docking
Moving-object following
3/18
Doorway crossing
Background– acquiring behaviors
Engineering design ¾
Linear control[1], Fuzzy control[2], and Symbolic-based planing[3].
[1] R. Siegwart and I. R. Nourbakhsh. Introduction to Autonomous Mobile Robots. The MIT Press, Cambridge, Massachusetts, 2004. [2] A. Saffiotti. Autonomous Robot Navigation: a fuzzy logic approach PhD thesis, Université Libre de Bruxelles, 1998. [3] N. J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann, CA 94104, USA, 1998.
Learning from experiences ¾
Learning by demonstration[1], shaping[2], and development
[3]
[1] A. Billard and R. Siegwart. Robot learning from demonstration. Robotics and Autonomous Systems, 47:65–67, 2004. [2] M. Dorigo and M. Colombetti. Robot Shaping: An Experiment in Behavior Engineering. MIT Press/Bradford Books, 1998. [3] J.Weng. Developmental robotics: Theory and experiments. International Journal of Humanoid Robotics, 1(2):199–236, 2004.
4/18
A layered learning system
Main ideas of our learning system ¾
A prior knowledge controller (rough controller)
¾
Lower layer with supervised learning
¾
derived from the engineering design, or derived from the demonstration of a “teacher”
improving the prior knowledge controller in the sense of smooth control
Upper layer with reinforcement learning
improving the lower layer’s controller in the sense of optimal control 5/18
A layered learning system
Learning system’s architecture
6/18
A layered learning system
QRAN learning: Q(st , at ) = Q(st , at ) +α[rt+1 +γ maxQ(st+1, a) −Q(st , at )] a
wherest ∈S, andat ∈ A
7/18
A layered learning system
Comments on QRAN learning ¾
Applicable constraints of Q-learning
¾
Discrete state-action representation Infinite visits of (s, a) guarantee the optimal mapping
Related work Rivest et. al, Combining TD-learning with cascade-correlation networks, ICML-2003 Smart et. al, Effective reinforcement learning for mobile robot, ICRA-2002 Santos et. al, Exploration tuned reinforcement learning for mobile robot, Neucomp. 1999 Martínez et. al, Fast reinforcement learning for vision-guided mobile robots. ICRA-2005 ……
¾
What is new in QRAN-learning
RAN – A constructive ANN for continuous states representation Off-policy of Q-learning speeds up the learning process on real robots Easy to use and simple to implement due to the simple structure
8/18
Experimental results and analysis
Docking behavior
A start position
Tracing a green can
Approaching table
Picking up the can 9/18
Experimental results and analysis
LC controller solution
v trans = k P P v rot = k α α + k β β Where vtrans – translational velocity vrot – rotational velocity α, β, and P – state variables kα, kβ, and kP – gains (xG, yG) – the global coordinates
10/18
Experimental results and analysis
Docking becomes a complex behavior 1. {atilt, apan, aedge} is in local coordinate (clip: LC_chattering) a. LC controller is not applicable (overshooting and not robust) b. dependence time lag and momentum of robot and camera 2. fully reactive docking behavior a. the visual servoing – stabilizing and synchronizing b. the object tracking – robust 3. precise positioning at the goal pose 4. time-optimal trajectory
11/18
Experimental results and analysis
Object tracking and visual servoing Estimating the table edge’s angle aedge 1. computing edge slope br by a LS model 2. aedge = arctan br
Estimating apan and atilt by PD controllers
∆ Pan = k pp ( x ocur − x I ) + k dp dx ∆ Tilt = k pt ( y ocur − y I ) + k dt dy State variables (α, β, P) is estimated by visual servoing variables
α = a pan , β = a edge , P = 80 − a tilt 12/18
Experimental results and analysis
Learning with QRAN ¾ ¾
State inputs: x = [α, β, u]T Action output:
¾
Rotational velocity vrot learned by QRAN Translational velocity is determined by vtran = kPP
Training QRAN network 1. estimate the control variable {α, β, P} by visual servoing 2. if goal or failure state, then end this episode, move the robot to a new starting position, goto 1 to start a new episode 3. else train the QRAN network, and goto 1 13/18
Experimental results and analysis Comparison: QRAN and LC controllers 500
10000
200
number of training examples
9000
450
8000
400
7000
350
6000
300
5000
250
4000
200
3000
150
−200 start position
goal position
Y(mm)
−400
−600
−800
100
2000
−1000
number of neurons
LC trajectory Q−RAN trajectory
50
1000
−1200
0
60
50
200
60
50
50
40
0
2
6
4
8
20
20
10
10
0
0
−10
−10 LC still is in "chattering" state
−20
−30
0
50
100
150
300 250 200 t (time step)
rot
−20
350
400
450
20
22
24
0
40
30
−30 500
state variables (α, β)
rot
30
rotational velocity v
30
(degree) (degree) (degree/s)
18
16
Q−RAN Controller
40 α β v
14 12 episode
50
LC Controller
40
10
30 α (degree) β (degree) vrot (degree/s)
20
rot
0
rotational velocity v
−2000 −1800 −1600 −1400 −1200 −1000 −800 −600 −400 −200 X(mm)
10
0
−10
10
0
−10
−20
−30
20
−20
QRAN avoids "chattering" significantly
0
50
14/18
100
150
300 250 200 t (time step)
350
400
450
−30 500
number of neurons
number of training examples
0
state variables (α, β)
Experimental results and analysis
Comparison: QRAN and LC controllers Learning with only upper layer (Approx. 2m away from the goal) controller
Successful trials
Average steps
Number of neurons
Training episodes
QRAN
10 of 10
405 ± 12
263
23
LC
8 of 10
458 ± 14
*
*
15/18
Experimental results and analysis
Learning with the layered learning architecture 200
Learning with lower and upper layers
0
(Approx. 4m away from the goal)
−200
Starting position
Goal position
Y (mm)
−400
controller
Successful trials
Average steps
Number of neurons
Training episodes
QRAN
10 of 10
518 ±19
181
18
RAN
9 of 10
618 ±28
126
100
LC
7 of 10
685 ±30
*
*
−600 −800 −1000
Linear controller RAN controller Q−RAN controller
−1200 −1400 −3000 −2700 −2400 −2100 −1800 −1500 −1200 −900 −600 −300 X (mm)
(clip: layered learning)
0
16/18
Experimental results and analysis Some example trajectories of layered learning 2000
1500 Goal position
1000
500
Y (mm)
0
−500
−1000
−1500
−2000 −3000
−2500
−2000
−1000 −1500 X (mm)
17/18
−500
0
500
Conclusion and future work
Conclusion ¾
A layered learning architecture is proposed
¾
QRAN learning algorithm is proposed
LC is used as a prior knowledge Lower layer with RAN network improves the LC controller in supervised learning fashion Upper layer with QRAN improves the RAN controller in reinforcement learning fashion Off-policy: incorporation of prior knowledge Constructive ANN: dynamic representation of state space
Future work ¾
Automatic design of the reward function 18/18