Emergent Proximo-‐Distal Matura.on through Adap.ve ... - Inria

43 downloads 22 Views 13MB Size Report
Matura.onal skill learning: ... Motor matura.on for efficient skill learning in robots .... max(Jk )−min(Jk ) . K l=1 e −h(Jl −min(Jl )) max(Jl )−min(Jl ) θ new. = K.
Emergent  Proximo-­‐Distal  Matura4on   through  Adap4ve  Explora4on   Freek  Stulp  and  Pierre-­‐Yves  Oudeyer   FLOWERS  team   ENSTA-­‐ParisTech  and  Inria,  France   hHp://flowers.inria.fr    

Matura4onal  skill  learning:     freeing  and  freezing  of  motor  DOFs  in  humans   Reaching

Skiing

Control

(Bjorklund, 1997; Turkewitz and Kenny, 1985)

(Bernstein, 1967; Verijken et al., 1992)

Motor  matura4on  for  efficient  skill  learning  in  robots  

One task: Swinging

Berthouze and Lungarella, 2004) (See also Ivanchenko and Jacobs, 2003)

MATURATION:   Freeing  and  freezing  of   DOFs  

Fixed or random

OPTIMIZATION   Reinforcement  learning  

Learn one skill Simple RL/Opt.

Adap%ve  matura4on     for  skill  learning  in  robots   Many tasks: Omnidirectional locomotion

MATURATION:   Adap4ve  freeing     of  DOFs  

Adaptive clock

INTRINSIC  MOTIVATION   Competence  progress  

Learn field of skills

OPTIMIZATION   Reinforcement  learning  

Simple local RL/Opt.

McSAGG-RIAC (Baranes and Oudeyer, 2011)

reach to a goal in the workspace

Ini$al  goal:     2 PI Adap%ve  matura4on  controlled  by            CMA!ES    

OF kinematically simulated ‘arm’ in 2-D plane

One task: reaching

MATURATION:   Freeing  and  freezing  of   DOFs  

OPTIMIZATION   Reinforcement  learning  

Adaptive clock

SoA RL

PI

2 CMA!ES

reach to a goal in the workspace

What  we  got:     2 P I Emergent  matura4on  from     CMA!ES

OF kinematically simulated ‘arm’ in 2-D plane

One task: reaching

MATURATION:   Freeing  and  freezing  of   DOFs  

OPTIMIZATION   Reinforcement  learning  

Emergent   Schedule  

SoA Episodic RL

PI

2 CMA!ES

2 Policy  Improvement  with  Path  Integrals  and  Covariant  Matrix  Adapta4on   PI CMA!ES

PI2

CMAES



Policy Improvement with Path Integrals and Covariance Matrix Adaptation

(Hansen and Ostermeier, 2001)

(Theodorou et al., 2010)

2 Policy  Improvement  with  Path  Integrals  and  Covariant  Matrix  Adapta4on   PI CMA!ES

PI2

CMAES



Policy Improvement with Path Integrals and Covariance Matrix Adaptation

(Stulp and Sigaud, 2012)

PI

2PI 2

Improvement   with   Path  Integrals   and  Covariant   Matrix  Adapta4on   PolicyPolicy   Improvement with Path Integrals and Covariance Matrix Adaptation CMA!ES

CMAES



Up next Explain reward-weighted averaging Explain covariance matrix adaptation

as depicted in Fig. 3, and have a null speed.

o Reaching

Applica4on  to  reaching   ach to a goal in the workspace

Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.

‘goal 2’

Visualization of the reaching motion (after learning) for ‘goal 1’ and

10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:

The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 joint with the largest angle at max(qtN ), expressing a comfort effect, with maximum comfort being the initial position. The Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint accelerations. The weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are less costly than shoulder movements for humans, cf. [21]2 . Cost function: φtN = 104 ||xtN − g||2 + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)

Terminal cost

(1)

Immediate cost

(2)

b) Policy representation: The acceleration q¨m,t of the m joint at time t is determined as a linear combination of basis functions, where the parameter vector θ m represents the weighting of joint m. th

PI2

CMAES

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

:

Reward-Weighted Averaging

N (θ, Σ)

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ)

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1

� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2

2    PI        CMA!ES                    :  Reward-­‐Weighted  Averaging  

CMAES

:

Reward-Weighted Averaging

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K � k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Without CMA

Σ

new

=

k=1

With CMA



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance   M atrix   A dapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

l=1 e

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance   M atrix   A dapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Σ

new

=

k=1



PI PI2

   Reward-­‐Weighted  Averaging  and     Covariance  Matrix  Adapta4on     Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e



�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1



−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Without CMA

Σ

new

=

k=1

With CMA



as depicted in Fig. 3, and have a null speed.

o Reaching

Applica4on  to  reaching   ach to a goal in the workspace

Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.

‘goal 2’

Visualization of the reaching motion (after learning) for ‘goal 1’ and

10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:

The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 Task: reachangle to ata max(q goaltNin), expressing the workspace joint with the largest a comfort effect, with maximum comfort being the initial position. The 10-DOF kinematically simulated ‘arm’ i Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint Policy The representation: accelerations. weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are � less q ¨m,t = g(t) θ mcostly than shoulder movements 2 for humans, cf. [21] . � Ψb (t) Cost function: [g(t)]b = �B with Ψb (t) = exp −(t − 4 2 b=1 Ψb (t)

Application to Reaching

φtN = 10 ||xtN − g|| + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)

Terminal cost

(1)

Immediate cost

(2)

Duration of movement is 0.5s Initialy, θ = 0 The (no acceleration movement) Policy representation: q¨m,t

b) of the m joint2 at time t is determined as a linear combination of PICMAES parameters: K = 20 basis functions, where the parameter vector θ m represents the weighting of joint m. th

Results:   Re-­‐Adadapta4on   to  C hanging  Tasks   Results: Re-adaptation to Changing Tasks

⇒ Life-long continual reinforcement learning with automatic exploration/exploitation trade-off

Results:  Emergent  Proximo-­‐Distal  Matura4on   Results: Proximo-Distal Maturation

Control

⇒ Emergent Proximo-Distal Maturation

Results:  Emergent  Proximo-­‐Distal  Matura4on   Results: Proximo-Distal Maturation Exploration magnitude of individual DOFS

Control

⇒ Emergent Proximo-Distal Maturation

Results:  Emergent  Proximo-­‐Distal  Matura4on   Results: Proximo-Distal Maturation Time

Control

Interpreta4on   •  Stochastic optimization directs itself by following an approximated smoothed gradient/curvature i.e. by fostering exploration in directions where impact on cost function is big, and fostering ignorance of directions where impact is less •  Arm structure is such that 1)  initially proximal joints have more impact on cost function than distal ones 2)  This relative impact changes as one gets closer to the maximum of the cost function è  Emergent  matura4on  is  a  property  of  the  combina4on   between  the  structure  of  cost  func4on  (dep.  on  body  structure)   and  adap4ve  explora4on  in  stochas4c  op4miza4on  

References   Baranes, A., Oudeyer, P-Y. (2011) The Interaction of Maturational Constraints and Intrinsic Motivations in Active Motor Development, in proceedings of the IEEE International Conference on Development and Learning (ICDLEpirob) , Frankfurt, Germany. Berthouze, L. and Lungarella,M. (2004) Motor skill acquisition under environmental perturbations: On the necessity of alternate freezing and freeing degrees of freedom. Adaptive Behavior, 12(1):47–63, 2004. Hansen, N. and Ostermeier, A. (2001) Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195. Ivanchenko, V. and Jacobs, R.A. (2003) A developmental approach aids motor learning, Neural Computation, 15, 2051-2065. Stulp, F. and Oudeyer, P.-Y. (2012) Emergent proximo-distal maturation through adaptive exploration. In International Conference on Development and Learning (ICDL). Accepted for publication. Stulp, F. and Sigaud, O. (2012a) Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML). Theodorou, E., Buchli, J., and Schaal, S. (2010) A generalized path integral control approach to reinforcement learning. J. of Machine Learning Research, 11:3137–3181, 2010.

Suggest Documents