Matura.onal skill learning: ... Motor matura.on for efficient skill learning in robots
.... max(Jk )−min(Jk ) . K l=1 e −h(Jl −min(Jl )) max(Jl )−min(Jl ) θ new. = K.
Emergent Proximo-‐Distal Matura4on through Adap4ve Explora4on Freek Stulp and Pierre-‐Yves Oudeyer FLOWERS team ENSTA-‐ParisTech and Inria, France hHp://flowers.inria.fr
Matura4onal skill learning: freeing and freezing of motor DOFs in humans Reaching
Skiing
Control
(Bjorklund, 1997; Turkewitz and Kenny, 1985)
(Bernstein, 1967; Verijken et al., 1992)
Motor matura4on for efficient skill learning in robots
One task: Swinging
Berthouze and Lungarella, 2004) (See also Ivanchenko and Jacobs, 2003)
MATURATION: Freeing and freezing of DOFs
Fixed or random
OPTIMIZATION Reinforcement learning
Learn one skill Simple RL/Opt.
Adap%ve matura4on for skill learning in robots Many tasks: Omnidirectional locomotion
MATURATION: Adap4ve freeing of DOFs
Adaptive clock
INTRINSIC MOTIVATION Competence progress
Learn field of skills
OPTIMIZATION Reinforcement learning
Simple local RL/Opt.
McSAGG-RIAC (Baranes and Oudeyer, 2011)
reach to a goal in the workspace
Ini$al goal: 2 PI Adap%ve matura4on controlled by CMA!ES
OF kinematically simulated ‘arm’ in 2-D plane
One task: reaching
MATURATION: Freeing and freezing of DOFs
OPTIMIZATION Reinforcement learning
Adaptive clock
SoA RL
PI
2 CMA!ES
reach to a goal in the workspace
What we got: 2 P I Emergent matura4on from CMA!ES
OF kinematically simulated ‘arm’ in 2-D plane
One task: reaching
MATURATION: Freeing and freezing of DOFs
OPTIMIZATION Reinforcement learning
Emergent Schedule
SoA Episodic RL
PI
2 CMA!ES
2 Policy Improvement with Path Integrals and Covariant Matrix Adapta4on PI CMA!ES
PI2
CMAES
–
Policy Improvement with Path Integrals and Covariance Matrix Adaptation
(Hansen and Ostermeier, 2001)
(Theodorou et al., 2010)
2 Policy Improvement with Path Integrals and Covariant Matrix Adapta4on PI CMA!ES
PI2
CMAES
–
Policy Improvement with Path Integrals and Covariance Matrix Adaptation
(Stulp and Sigaud, 2012)
PI
2PI 2
Improvement with Path Integrals and Covariant Matrix Adapta4on PolicyPolicy Improvement with Path Integrals and Covariance Matrix Adaptation CMA!ES
CMAES
–
Up next Explain reward-weighted averaging Explain covariance matrix adaptation
as depicted in Fig. 3, and have a null speed.
o Reaching
Applica4on to reaching ach to a goal in the workspace
Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.
‘goal 2’
Visualization of the reaching motion (after learning) for ‘goal 1’ and
10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:
The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 joint with the largest angle at max(qtN ), expressing a comfort effect, with maximum comfort being the initial position. The Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint accelerations. The weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are less costly than shoulder movements for humans, cf. [21]2 . Cost function: φtN = 104 ||xtN − g||2 + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)
Terminal cost
(1)
Immediate cost
(2)
b) Policy representation: The acceleration q¨m,t of the m joint at time t is determined as a linear combination of basis functions, where the parameter vector θ m represents the weighting of joint m. th
PI2
CMAES
2 PI CMA!ES : Reward-‐Weighted Averaging
:
Reward-Weighted Averaging
N (θ, Σ)
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ)
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )
e
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )
e
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )
e
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI2
2 PI CMA!ES : Reward-‐Weighted Averaging
CMAES
:
Reward-Weighted Averaging
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K � k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Pk θ k
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Without CMA
Σ
new
=
k=1
With CMA
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance M atrix A dapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )
l=1 e
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance M atrix A dapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )
e
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Σ
new
=
k=1
�
PI PI2
Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation
2 CMA!ES
CMAES
:
θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )
∀k Pk =
e
�
�K
l=1
θ
new
=
K �
Pk θ k
K �
Pk (θ k − θ)(θ k − θ)
k=1
�
−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )
Without CMA
Σ
new
=
k=1
With CMA
�
as depicted in Fig. 3, and have a null speed.
o Reaching
Applica4on to reaching ach to a goal in the workspace
Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.
‘goal 2’
Visualization of the reaching motion (after learning) for ‘goal 1’ and
10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:
The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 Task: reachangle to ata max(q goaltNin), expressing the workspace joint with the largest a comfort effect, with maximum comfort being the initial position. The 10-DOF kinematically simulated ‘arm’ i Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint Policy The representation: accelerations. weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are � less q ¨m,t = g(t) θ mcostly than shoulder movements 2 for humans, cf. [21] . � Ψb (t) Cost function: [g(t)]b = �B with Ψb (t) = exp −(t − 4 2 b=1 Ψb (t)
Application to Reaching
φtN = 10 ||xtN − g|| + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)
Terminal cost
(1)
Immediate cost
(2)
Duration of movement is 0.5s Initialy, θ = 0 The (no acceleration movement) Policy representation: q¨m,t
b) of the m joint2 at time t is determined as a linear combination of PICMAES parameters: K = 20 basis functions, where the parameter vector θ m represents the weighting of joint m. th
Results: Re-‐Adadapta4on to C hanging Tasks Results: Re-adaptation to Changing Tasks
⇒ Life-long continual reinforcement learning with automatic exploration/exploitation trade-off
Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation
Control
⇒ Emergent Proximo-Distal Maturation
Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation Exploration magnitude of individual DOFS
Control
⇒ Emergent Proximo-Distal Maturation
Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation Time
Control
Interpreta4on • Stochastic optimization directs itself by following an approximated smoothed gradient/curvature i.e. by fostering exploration in directions where impact on cost function is big, and fostering ignorance of directions where impact is less • Arm structure is such that 1) initially proximal joints have more impact on cost function than distal ones 2) This relative impact changes as one gets closer to the maximum of the cost function è Emergent matura4on is a property of the combina4on between the structure of cost func4on (dep. on body structure) and adap4ve explora4on in stochas4c op4miza4on
References Baranes, A., Oudeyer, P-Y. (2011) The Interaction of Maturational Constraints and Intrinsic Motivations in Active Motor Development, in proceedings of the IEEE International Conference on Development and Learning (ICDLEpirob) , Frankfurt, Germany. Berthouze, L. and Lungarella,M. (2004) Motor skill acquisition under environmental perturbations: On the necessity of alternate freezing and freeing degrees of freedom. Adaptive Behavior, 12(1):47–63, 2004. Hansen, N. and Ostermeier, A. (2001) Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195. Ivanchenko, V. and Jacobs, R.A. (2003) A developmental approach aids motor learning, Neural Computation, 15, 2051-2065. Stulp, F. and Oudeyer, P.-Y. (2012) Emergent proximo-distal maturation through adaptive exploration. In International Conference on Development and Learning (ICDL). Accepted for publication. Stulp, F. and Sigaud, O. (2012a) Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML). Theodorou, E., Buchli, J., and Schaal, S. (2010) A generalized path integral control approach to reinforcement learning. J. of Machine Learning Research, 11:3137–3181, 2010.