Emergent Proximo-‐Distal Matura.on through Adap.ve ... - Inria

Emergent Proximo-‐Distal Matura4on through Adap4ve Explora4on Freek Stulp and Pierre-‐Yves Oudeyer FLOWERS team ENSTA-‐ParisTech and Inria, France hHp://flowers.inria.fr

Matura4onal skill learning: freeing and freezing of motor DOFs in humans Reaching

Skiing

Control

(Bjorklund, 1997; Turkewitz and Kenny, 1985)

(Bernstein, 1967; Verijken et al., 1992)

Motor matura4on for efficient skill learning in robots

One task: Swinging

Berthouze and Lungarella, 2004) (See also Ivanchenko and Jacobs, 2003)

MATURATION: Freeing and freezing of DOFs

Fixed or random

OPTIMIZATION Reinforcement learning

Learn one skill Simple RL/Opt.

Adap%ve matura4on for skill learning in robots Many tasks: Omnidirectional locomotion

MATURATION: Adap4ve freeing of DOFs

Adaptive clock

INTRINSIC MOTIVATION Competence progress

Learn field of skills


Simple local RL/Opt.

McSAGG-RIAC (Baranes and Oudeyer, 2011)

reach to a goal in the workspace

Ini$al goal: 2 PI Adap%ve matura4on controlled by CMA!ES

OF kinematically simulated ‘arm’ in 2-D plane

One task: reaching



Adaptive clock

SoA RL

PI

2 CMA!ES

reach to a goal in the workspace

What we got: 2 P I Emergent matura4on from CMA!ES

OF kinematically simulated ‘arm’ in 2-D plane

One task: reaching



Emergent Schedule

SoA Episodic RL

PI

2 CMA!ES

2 Policy Improvement with Path Integrals and Covariant Matrix Adapta4on PI CMA!ES

PI2

CMAES

–

Policy Improvement with Path Integrals and Covariance Matrix Adaptation

(Hansen and Ostermeier, 2001)

(Theodorou et al., 2010)

2 Policy Improvement with Path Integrals and Covariant Matrix Adapta4on PI CMA!ES

PI2

CMAES

–

Policy Improvement with Path Integrals and Covariance Matrix Adaptation

(Stulp and Sigaud, 2012)

PI

2PI 2

Improvement with Path Integrals and Covariant Matrix Adapta4on PolicyPolicy Improvement with Path Integrals and Covariance Matrix Adaptation CMA!ES

CMAES

–

Up next Explain reward-weighted averaging Explain covariance matrix adaptation

as depicted in Fig. 3, and have a null speed.

o Reaching

Applica4on to reaching ach to a goal in the workspace

Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.

‘goal 2’

Visualization of the reaching motion (after learning) for ‘goal 1’ and

10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:

The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 joint with the largest angle at max(qtN ), expressing a comfort effect, with maximum comfort being the initial position. The Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint accelerations. The weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are less costly than shoulder movements for humans, cf. [21]2 . Cost function: φtN = 104 ||xtN − g||2 + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)

Terminal cost

(1)

Immediate cost

(2)

b) Policy representation: The acceleration q¨m,t of the m joint at time t is determined as a linear combination of basis functions, where the parameter vector θ m represents the weighting of joint m. th

PI2

CMAES

2 PI CMA!ES : Reward-‐Weighted Averaging

:

Reward-Weighted Averaging

N (θ, Σ)

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ)

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

� −h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1


e

Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�

−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) e max(Jl )−min(Jl )

Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�

−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

e

Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI2


CMAES

:


θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K � k=1

�


Pk θ k

PI PI2

Reward-‐Weighted Averaging and Covariance Matrix Adapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Without CMA

Σ

new

=

k=1

With CMA

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2

Reward-‐Weighted Averaging and Covariance M atrix A dapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�

−h(Jk −min(Jk )) max(Jk )−min(Jk ) � � −h(Jl −min(Jl )) max(Jl )−min(Jl )

l=1 e

Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2

Reward-‐Weighted Averaging and Covariance M atrix A dapta4on Reward-Weighted Averaging and Covariance Matrix Adaptation

2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1


e

Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Σ

new

=

k=1

�

PI PI2


2 CMA!ES

CMAES

:

θ k=1...K ∼ N (θ, Σ) ∀k Jk = J(θ k )

∀k Pk =

e

�

�K

l=1

θ

new

=

K �

Pk θ k

K �

Pk (θ k − θ)(θ k − θ)

k=1

�


Without CMA

Σ

new

=

k=1

With CMA

�

as depicted in Fig. 3, and have a null speed.

o Reaching

Applica4on to reaching ach to a goal in the workspace

Application to Reaching kinematically simulated ‘arm’ in 2-D plane Task: reach to a goal in the workspace Fig. 3.

‘goal 2’

Visualization of the reaching motion (after learning) for ‘goal 1’ and

10-DOF kinematically simulated ‘arm’ in 2-D plane a) Cost function: Policy representation:

The terminal costs of this task are expressed in (1), where ||xtN − g|| represented the distance � q ¨m,t = g(t) θ m Acc. ofcoordinates joint m (1) between the 2-D Cartesian of the end-effector � tN ) and the goal � g1 = [0.0 0.5] or g2 = [0.0 0.25] at the end (x Ψb (t) 2 2 [g(t)]b = �B with Ψb (t) = exp of−(t − c ) /w b the movement at tNBasis . Thefunctions terminal cost(2)also penalizes the Ψ (t) b b=1 Task: reachangle to ata max(q goaltNin), expressing the workspace joint with the largest a comfort effect, with maximum comfort being the initial position. The 10-DOF kinematically simulated ‘arm’ i Duration of movement is 0.5s immediate costs at each time step rt in (2) penalize joint Policy The representation: accelerations. weighting term (M + 1 − m) penalizes Initialy, θ = 0 (no movement) DOFs closer to the origin, the underlying motivation being that wrist movements are � less q ¨m,t = g(t) θ mcostly than shoulder movements 2 for humans, cf. [21] . � Ψb (t) Cost function: [g(t)]b = �B with Ψb (t) = exp −(t − 4 2 b=1 Ψb (t)

Application to Reaching

φtN = 10 ||xtN − g|| + max(qtN ) �M qt,m )2 −5 m=1 (M + 1 − m)(¨ rt = 10 �M m=1 (M + 1 − m)

Terminal cost

(1)

Immediate cost

(2)

Duration of movement is 0.5s Initialy, θ = 0 The (no acceleration movement) Policy representation: q¨m,t

b) of the m joint2 at time t is determined as a linear combination of PICMAES parameters: K = 20 basis functions, where the parameter vector θ m represents the weighting of joint m. th

Results: Re-‐Adadapta4on to C hanging Tasks Results: Re-adaptation to Changing Tasks

⇒ Life-long continual reinforcement learning with automatic exploration/exploitation trade-off

Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation

Control

⇒ Emergent Proximo-Distal Maturation

Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation Exploration magnitude of individual DOFS

Control

⇒ Emergent Proximo-Distal Maturation

Results: Emergent Proximo-‐Distal Matura4on Results: Proximo-Distal Maturation Time

Control

Interpreta4on •  Stochastic optimization directs itself by following an approximated smoothed gradient/curvature i.e. by fostering exploration in directions where impact on cost function is big, and fostering ignorance of directions where impact is less •  Arm structure is such that 1)  initially proximal joints have more impact on cost function than distal ones 2)  This relative impact changes as one gets closer to the maximum of the cost function è Emergent matura4on is a property of the combina4on between the structure of cost func4on (dep. on body structure) and adap4ve explora4on in stochas4c op4miza4on

References Baranes, A., Oudeyer, P-Y. (2011) The Interaction of Maturational Constraints and Intrinsic Motivations in Active Motor Development, in proceedings of the IEEE International Conference on Development and Learning (ICDLEpirob) , Frankfurt, Germany. Berthouze, L. and Lungarella,M. (2004) Motor skill acquisition under environmental perturbations: On the necessity of alternate freezing and freeing degrees of freedom. Adaptive Behavior, 12(1):47–63, 2004. Hansen, N. and Ostermeier, A. (2001) Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195. Ivanchenko, V. and Jacobs, R.A. (2003) A developmental approach aids motor learning, Neural Computation, 15, 2051-2065. Stulp, F. and Oudeyer, P.-Y. (2012) Emergent proximo-distal maturation through adaptive exploration. In International Conference on Development and Learning (ICDL). Accepted for publication. Stulp, F. and Sigaud, O. (2012a) Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML). Theodorou, E., Buchli, J., and Schaal, S. (2010) A generalized path integral control approach to reinforcement learning. J. of Machine Learning Research, 11:3137–3181, 2010.

Emergent Proximo-‐Distal Matura.on through Adap.ve ... - Inria

Emergent Proximo-‐Distal Matura.on through Adap.ve ... - Inria

Suggest Documents

The Role of Ontologies in Emergent Middleware - HAL-Inria

Globalizing Pikachu through Emergent and Deliberate Strategies

Emergent Semantics Through Interaction in Image Databases

Interoperability through Emergent Semantics. A Semiotic ... - CiteSeerX

Interoperability through Emergent Semantics. A Semiotic ... - CiteSeerX

emergent learning through playful interactions and ...

Exploring robust, intuitive and emergent physical ... - INRIA Flowers

Emergent fields through adaptation and identity - DeGennaro

Emergent fields through adaptation and identity - DeGennaro

Research Article Dorsoventral and Proximodistal ... - Amygdala

Lessons Learned through Implementation and ... - HAL-Inria

ISICIL: Information Semantic Integration through ... - HAL-Inria

Lessons Learned through Implementation and ... - HAL-Inria

Opportunistic data collection through delegation - HAL-Inria

1994-Emergent Coordination through the Use of Cooperative State ...

Validating Emergent Behaviors in Systems-of-Systems through Model ...

Improving access to emergent spinal care through ... - Springer Link

Emergent Rhythms through Multi-agency in Max ... - Semantic Scholar

Diffusion of floating particles in flow through emergent vegetation ...

Probing emergent geometry through phase transitions in free vector

Detecting Emergent Conflicts through Web Mining and ... - FOI

Emergent spontaneous symmetry breaking and emergent symmetry ...

Emergent Quantum Mechanics and Emergent Symmetries

inria