Q-RAN: A Constructive Reinforcement Learning ... - Semantic Scholar

1 downloads 0 Views 778KB Size Report
Q-RAN: A Constructive Reinforcement Learning. Approach for Robot Behavior Learning. Jun Li, Achim Lilienthal. Department of Technology. Örebro University.
IEEE/RSJ International Conference on Intelligent Robot and System (IROS06) Beijing, China, Oct. 9-15, 2006

Q-RAN: A Constructive Reinforcement Learning Approach for Robot Behavior Learning

Jun Li, Achim Lilienthal

Tomás Martínez-Marín

Department of Technology Örebro University Sweden

Department of Physics, System Engineering and Signal Theory University of Alicante, Spain

1/18

Tom Duckett Department of Computing and Informatics University of Lincoln, UK

Outline „

Background – acquiring a robot behavior ¾ ¾

„

A layered learning system – QRAN ¾ ¾ ¾ ¾

„

Main ideas of our learning system Architecture of our learning system Implementation of QRAN learning Comments on QRAN learning

Experimental results and analysis ¾ ¾

„

by engineering design by learning from robot’s own experiences

Docking behavior Learning by the QRAN system

Conclusion and future work 2/18

Background – acquiring behaviors „

A reactive behavior ¾

¾

a sequence of sensory states and their corresponding motor actions for different tasks some example behaviors

Robot docking

Moving-object following

3/18

Doorway crossing

Background– acquiring behaviors „

Engineering design ¾

Linear control[1], Fuzzy control[2], and Symbolic-based planing[3].

[1] R. Siegwart and I. R. Nourbakhsh. Introduction to Autonomous Mobile Robots. The MIT Press, Cambridge, Massachusetts, 2004. [2] A. Saffiotti. Autonomous Robot Navigation: a fuzzy logic approach PhD thesis, Université Libre de Bruxelles, 1998. [3] N. J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann, CA 94104, USA, 1998. „

Learning from experiences ¾

Learning by demonstration[1], shaping[2], and development

[3]

[1] A. Billard and R. Siegwart. Robot learning from demonstration. Robotics and Autonomous Systems, 47:65–67, 2004. [2] M. Dorigo and M. Colombetti. Robot Shaping: An Experiment in Behavior Engineering. MIT Press/Bradford Books, 1998. [3] J.Weng. Developmental robotics: Theory and experiments. International Journal of Humanoid Robotics, 1(2):199–236, 2004.

4/18

A layered learning system „

Main ideas of our learning system ¾

A prior knowledge controller (rough controller) „ „

¾

Lower layer with supervised learning „ „

¾

derived from the engineering design, or derived from the demonstration of a “teacher”

improving the prior knowledge controller in the sense of smooth control

Upper layer with reinforcement learning „ „

improving the lower layer’s controller in the sense of optimal control 5/18

A layered learning system „

Learning system’s architecture

6/18

A layered learning system „

QRAN learning: Q(st , at ) = Q(st , at ) +α[rt+1 +γ maxQ(st+1, a) −Q(st , at )] a

wherest ∈S, andat ∈ A

7/18

A layered learning system „

Comments on QRAN learning ¾

Applicable constraints of Q-learning „ „

¾

Discrete state-action representation Infinite visits of (s, a) guarantee the optimal mapping

Related work Rivest et. al, Combining TD-learning with cascade-correlation networks, ICML-2003 Smart et. al, Effective reinforcement learning for mobile robot, ICRA-2002 Santos et. al, Exploration tuned reinforcement learning for mobile robot, Neucomp. 1999 Martínez et. al, Fast reinforcement learning for vision-guided mobile robots. ICRA-2005 ……

¾

What is new in QRAN-learning „ „ „

RAN – A constructive ANN for continuous states representation Off-policy of Q-learning speeds up the learning process on real robots Easy to use and simple to implement due to the simple structure

8/18

Experimental results and analysis „

Docking behavior

A start position

Tracing a green can

Approaching table

Picking up the can 9/18

Experimental results and analysis „

LC controller solution

v trans = k P P v rot = k α α + k β β Where vtrans – translational velocity vrot – rotational velocity α, β, and P – state variables kα, kβ, and kP – gains (xG, yG) – the global coordinates

10/18

Experimental results and analysis „

Docking becomes a complex behavior 1. {atilt, apan, aedge} is in local coordinate (clip: LC_chattering) a. LC controller is not applicable (overshooting and not robust) b. dependence time lag and momentum of robot and camera 2. fully reactive docking behavior a. the visual servoing – stabilizing and synchronizing b. the object tracking – robust 3. precise positioning at the goal pose 4. time-optimal trajectory

11/18

Experimental results and analysis „

Object tracking and visual servoing  Estimating the table edge’s angle aedge 1. computing edge slope br by a LS model 2. aedge = arctan br

 Estimating apan and atilt by PD controllers

∆ Pan = k pp ( x ocur − x I ) + k dp dx ∆ Tilt = k pt ( y ocur − y I ) + k dt dy  State variables (α, β, P) is estimated by visual servoing variables

α = a pan , β = a edge , P = 80 − a tilt 12/18

Experimental results and analysis „

Learning with QRAN ¾ ¾

State inputs: x = [α, β, u]T Action output: „ „

¾

Rotational velocity vrot learned by QRAN Translational velocity is determined by vtran = kPP

Training QRAN network 1. estimate the control variable {α, β, P} by visual servoing 2. if goal or failure state, then end this episode, move the robot to a new starting position, goto 1 to start a new episode 3. else train the QRAN network, and goto 1 13/18

Experimental results and analysis Comparison: QRAN and LC controllers 500

10000

200

number of training examples

9000

450

8000

400

7000

350

6000

300

5000

250

4000

200

3000

150

−200 start position

goal position

Y(mm)

−400

−600

−800

100

2000

−1000

number of neurons

LC trajectory Q−RAN trajectory

50

1000

−1200

0

60

50

200

60

50

50

40

0

2

6

4

8

20

20

10

10

0

0

−10

−10 LC still is in "chattering" state

−20

−30

0

50

100

150

300 250 200 t (time step)

rot

−20

350

400

450

20

22

24

0

40

30

−30 500

state variables (α, β)

rot

30

rotational velocity v

30

(degree) (degree) (degree/s)

18

16

Q−RAN Controller

40 α β v

14 12 episode

50

LC Controller

40

10

30 α (degree) β (degree) vrot (degree/s)

20

rot

0

rotational velocity v

−2000 −1800 −1600 −1400 −1200 −1000 −800 −600 −400 −200 X(mm)

10

0

−10

10

0

−10

−20

−30

20

−20

QRAN avoids "chattering" significantly

0

50

14/18

100

150

300 250 200 t (time step)

350

400

450

−30 500

number of neurons

number of training examples

0

state variables (α, β)

„

Experimental results and analysis „

Comparison: QRAN and LC controllers Learning with only upper layer (Approx. 2m away from the goal) controller

Successful trials

Average steps

Number of neurons

Training episodes

QRAN

10 of 10

405 ± 12

263

23

LC

8 of 10

458 ± 14

*

*

15/18

Experimental results and analysis „

Learning with the layered learning architecture 200

Learning with lower and upper layers

0

(Approx. 4m away from the goal)

−200

Starting position

Goal position

Y (mm)

−400

controller

Successful trials

Average steps

Number of neurons

Training episodes

QRAN

10 of 10

518 ±19

181

18

RAN

9 of 10

618 ±28

126

100

LC

7 of 10

685 ±30

*

*

−600 −800 −1000

Linear controller RAN controller Q−RAN controller

−1200 −1400 −3000 −2700 −2400 −2100 −1800 −1500 −1200 −900 −600 −300 X (mm)

(clip: layered learning)

0

16/18

Experimental results and analysis Some example trajectories of layered learning 2000

1500 Goal position

1000

500

Y (mm)

„

0

−500

−1000

−1500

−2000 −3000

−2500

−2000

−1000 −1500 X (mm)

17/18

−500

0

500

Conclusion and future work „

Conclusion ¾

A layered learning architecture is proposed „ „

„ „

¾

QRAN learning algorithm is proposed „ „

„

LC is used as a prior knowledge Lower layer with RAN network improves the LC controller in supervised learning fashion Upper layer with QRAN improves the RAN controller in reinforcement learning fashion Off-policy: incorporation of prior knowledge Constructive ANN: dynamic representation of state space

Future work ¾

Automatic design of the reward function 18/18

Suggest Documents