Recurrent neural associative learning of forward and inverse kinematics for movement generation of the redundant PA-10 robot R. Felix Reinhart
Jochen J. Steil
Research Institute for Cognition and Robotics - CoR-Lab Bielefeld University
[email protected]
Research Institute for Cognition and Robotics - CoR-Lab Bielefeld University
[email protected], www.cor-lab.de
Abstract— We present a connectionist approach to learn forward and redundant inverse kinematics in a single recurrent network. The network architecture extends the reservoir computing idea, i.e. to read out the state of a fixed dynamic system, into an associative setting, which learns the forward and backward mapping simultaneously. For output learning we use efficient Backpropagation-Decorrelation learning while the recurrent dynamics is adjusted by an unsupervised biologically inspired learning rule based on intrinsic plasticity. Including linear connections between input and output allows to train the network for autonomous movement generation. We show results for the 7-DOF redundant PA-10 robot arm in simulation.
I. I NTRODUCTION The computational power and dynamical properties of artificial recurrent neural networks continuously trigger new enthusiasm for neurally inspired learning techniques, despite the fact that recurrent learning is an algorithmic and computational challenging task. One important field of application are sophisticated intelligent robots behaving in real world. Interactive settings involving humans demand real-time and online learning in a priori unknown situations to cope with dynamic and uncertain environments [14]. Online acquisition of new skills is a crucial requirement for such systems. In this context, online learning of time-variant patterns, i.e. motions of a robot arm, by imitation and in interaction is of special interest and there are many neural network based approaches to this problem ([10], [24], [20], [19]), in particular those exploring similarities to the cerebellar cortex [25]. Motion generation is a challenging task for artificial neural networks, although the brain generates the movement patterns without any difficulty. It is well known that the cerebellum is used to store such movement models and there are efforts to model its capabilities. We do not aim at directly modelling a cerebellar network here, but this biological background motivates to search for unified models for learning kinematic transformations. We argue that the recent approach to recurrent networks known as reservoir computing offers an appealing approach to model such motor learning. Reservoir computing uses a fixed network of analog or spiking neurons together with special readout neurons and has been introduced under the notions of the liquid state machine (LSM) [12], echo state
network (ESN) [7], [8], or in an online learning setting, as backpropagation-decorrelation learing [17]. The fundamental idea is to create a ”kernel-like” transformation in the temporal domain by means of a fixed recurrent reservoir network used to provide a nonlinear temporal transformation of its input, which is mapped to the high dimensional network state space. We will show in this paper that a reservoir network can learn complex forward and inverse models simultaneously in the same network by extending a standard ESN-like architecture by a backward learning path to reconstruct the input from the output. This creates an associative capability to connect input and output in both directions. Note that our approach differs from Hopfield networks and their attractor based schemes for content addressable memory in targeting dynamic network behavior instead. The potential of our approach obviously depends on the properties and the quality of the encoding of the input signal in the dynamic reservoir. To improve the coding of temporal memory in the network, we also use a biologically inspired learning rule based on the principle of neural intrinsic plasticity first introduced in [21]. It changes the neurons’ gains and biases in order to optimize information transmission in the inner reservoir neurons [18]. Learning of inverse models and in particular movements has been recognized as a key ingredient towards more intelligent robots and learning by imitation [14]. Learning of inverse models as a combination of locally linear low dimensional models has been very successfully demonstrated in [2], [23]. A similar weighted regression scheme has also been used in [4] to learn goal directed movements based on parametrizing a predefined dynamical system or nonlinear oscillators for rhythmic movements [5]. A particular interesting distributed approach to learn sensori-motor mappings has been developed in [6], [19] under the notion parametric bias recurrent neural network. To qualitatively recognize learned movements or to autonomously generate movements, outputs of the network are recursively connected to the input, which is also the standard approach for generating complex dynamical behavior in recursive time series prediction ([8], [18], [10]). All these approaches do not consider to invert the input-output mapping
Respecting the network structure we obtain a subdivision of the network matrix Wnet into sub-matrices ˆ inp ˆ inp ˆ inp Winp Wres Wout res res res W W W (4) Wnet = res out , inp ˆ out W ˆ out W ˆ out W inp
Fig. 1. The extended BPDC network forms a forward and backward associative architecture.
in a generative fashion, a much discussed concept in the visual learning domain [3]. We show that our approach is capable of learning both directions of a mapping and to generated movement trajectories in a single recurrent model. II. R ESERVOIR L EARNING AND I NTRINSIC P LASTICITY In the following, we describe the network learning architecture as depicted in Fig. 1. It comprises a recurrent reservoir network of nonlinear Fermi neurons and a linear subsystem directly interconnecting input and output nodes. We consider the recurrent network dynamics x(k+1) = (1−∆t) x(k) + ∆t Wnet y(k) y(k) = f (x(k)),
(1) (2)
where for small ∆t continuous time dynamic is approximated1 . xi , i = 1 . . . N , are the neural activations. To emphasize the functional division into output and internal neurons, we split up the state vector x such that x(k) = (x1 (k) . . . xl (k), xl+1 (k) . . . xl+m (k), T
xl+m+1 (k) . . . xN (k)) . for l input neurons, m reservoir neurons, and n output neurons and N = l + m + n is the network size. Further define I = {1, ..., l} and O = {l + m + 1, ..., N } as the sets of neuron indices for input and output nodes respectively. y = f (x) is the vector of neuron activities obtained by applying parametrized activation functions f component-wise to the vector x. The input and output nodes have a identity as activation function, i.e. are linear nodes. The internal reservoir neurons are activated by Fermi functions yi = fi (xi , ai , bi ) = (1 + exp(−ai xi −bi ) )−1 .
(3)
1 With a small abuse of notation we also interpret time arguments and indices (k + 1) as (k + 1)∆t.
res
out
where we denote by W? the connections from ? to using inp for input, out for output, and res for inner reservoir neurons. Note that this approach differs from standard reservoir methods by including recurrent connections from output neurons into the reservoir and to the input nodes. Further, the direct connections from input to output and vice versa constitute a bidirectional linear mapping between input and output neurons. This enables simultaneous learning of a mapping and ˆ out and its inverse. It is important that only connections W ∗ inp ˆ W∗ in (4) projecting to the input and output neurons are trained by error correction. All nine submatrices can have different connectivity and be subject to varying initializations. We use a sparse reservoir res matrix Wres keeping 20% randomly selected connections which are initialized in [−0.05, 0.05]. All other matrices are fully occupied. A. Intrinsic plasticity reservoir adaptation Originally LSM, Echo State, and BPDC networks assume res is constant and that the reservoir is fixed beforehand, i.e. Wres not changed while training. Thus performance depends on the initialization of the reservoir and a recently much discussed topic is how to optimize a reservoir for a given task [9], [13], [16], [18]. To solve this problem and rather improve the reservoir online, we use an adaption rule introduced by Triesch ([21]) and adopted for reservoir optimization in [18]. It is an unsupervised, efficient self-adaptation rule to optimize information transmission of the reservoir neurons by adapting the parameters ai , bi of the Fermi transfer functions (3) used in the network equation. The idea is to adjust the parameters ai , bi in order to minimize the Kullback-Leibler divergence of the neuron’s output distribution with respect to an exponential distribution with desired mean activity level. More formally, denote by Fx (x) and Fy (y) the input and output distributions of the Fermi neuron with y = f (x, a, b), then Fx (x) is determined by the statistical properties of the net input the neuron receives in the operating network. Let Fexp (y) = 1/µ exp(−y/µ) be the exponential distribution with average activity rate µ, which later appears as global parameter for the learning process. Then the Kullback-Leibler divergence D between Fy () and Fexp () can be written as ([21]) D(Fy , Fexp ) = H(y) +
1 E(y) + log(µ) µ
where H(y) is the entropy of y and E(y) the expected value. Further calculation shows that we can obtain the following
Algorithm 1 Online reservoir learning algorithm Require: set network size, sparsity, initial weights and states Require: set learn rates µ = 0.2, ζ = 0.001, η = 0.03 1: for k < iTrainIterations do 2: get inputs and targets d(k) 3: execute network iteration (1), (2) 4: compute errors ei (k) = di (k)−xi (k) 5: apply BPDC rule (8) 6: apply IP rules (5), (6) 7: end for 8: compute MSQE
online gradient rule ([21]) to adapt the parameters a, b with learning rate ζ in time step k: 1 1 ∆b(k) = ζ 1 − 2 + y(k) + y(k)2 , (5) µ µ 1 + x(k)∆b(k). (6) ∆a(k) = ζ a(k) We call the joint application of rules (5), (6) to the reservoir neurons intrinsic plasticity (IP) learning. It is local in time and space and therefore efficient to compute and is applied at each time step k after iteration of the network dynamics (1), (2), as shown in algorithm 1. B. Output learning by Backpropagation-Decorrelation The input and output weights are trained by the backpropagation-decorrelation (BPDC) learning rule, a supervised online training scheme which has been introduced to reservoir computing in [17].BPDC copes with feedback from output to the internal neurons as well as the linear connections by means of a unified adaptation rule. BPDC roots in a constraint optimization formalism introduced first in [1] and which has been denoted virtual teacher forcing in [15]. The basic idea is to start with the teacher signal d(k) = (d1 (k), . . . , dl (k), . . . , dl+m+1 (k), . . . , dl+m+n (k))T which needs to be defined for input and output nodes only and to accumulate the standard quadratic error E for T time steps k = 1 . . . T as E=
T X X
2
(ds (k)−xs (k)) .
(7)
k=1 s∈I∪O
First the gradient of E with respect to the states is computed to obtain a desired change in the state variables ∆x(k) ∼ ∂E − ∂(x(k)) . Then the network equations are used as constraints to compute an actual weight change ∆w(k) to realize this ∆x(k), which can be understood as a virtual teacher signal. A formalism to derive the corresponding ∆w(k) was given in [17]and leads to the following BPDC rule [17] for all adaptive ˆ out ): ˆ ∗inp and W connections (W ∗
Fig. 2. PA-10 multipurpose robot arm. Redundant positions lie on a redundancy circle for the elbow (left). Circular motion of the PA-10 robot arm in simulation (right).
with s ∈ I if i ∈ I or s ∈ O if i ∈ O consists of the local error ei (k) = di (k) − xi (k), i ∈ I ∪ O, and an one-step error backpropagation term. The time-dependent parameter η¯(k) = η
1 ky(k−1)k2 + ε
can be interpreted as a time and context dependent learning rate. It scales the BPDC learning rate η with a factor dependent on the overall network activities and a small regularization constant ε, e.g. ε = 0.002. Conceptually BPDC learning constitutes an error correction rule with time-dependent learning rate η¯(k), input predicate yj (k) = fj (xj (k)), and modified error γi (k). BPDC learning is a supervised learning technique, and has linear complexity O(N ) in the number of neurons N . Thus, the combination of IP reservoir adaptation and BPDC learning stays in the O(N ) complexity range, such that the forward iteration of the recurrent network (1), (2) becomes the computationally limiting factor (O(N 2 ) for a fully connected res network, i.e. full Wres − 100%). In fact, our implementation can use sparse matrix multiplication, such that the overall execution and training of sparse networks with a fixed number of connections per node also scales linear in the number of nodes. This efficiency renders the scheme well suited for implementation on robots for execution and learning in realtime. III. L EARNING OF FORWARD AND INVERSE MODELS
The proposed network is applied to learn observed motion in joint and task space featuring completely data driven learning. We generate training data by projecting a task trajectory ∆wij (k) = η¯(k) yj (k−1) γi (k), (8) specified in Cartesian end effector coordinates into the joint space by means of inverse kinematics of the redundant Mitwhere i ∈ I ∪ O, j ∈ {1, . . . , N }. The γ-term subishi PA-10 (shown in Fig. 2), which is a 7 DOF general X 0 γi (k) = ei (k)− ((1 − ∆t)δis + ∆twis f (xs (k)))es (k − 1) purpose robot arm. Starting from the base of the arm, the joints form three virtual spherical joints at locations 2, 4 s
0.15
forward kinematics, train inverse kinematics, train forward kinematics, test inverse kinematics, test
0.7 0.6 0.5 0.4 0.3
0.1
0.2
0.05 0
EBPDC - forward and inverse kinematics 0.8
MSE
MSE
BPDC and EBPDC - forward and inverse kinematics 0.3 BPDC, forward kinematics BPDC, inverse kinematics 0.25 EBPDC, forward kinematics EBPDC, inverse kinematics 0.2
0.1 1
2
3
4
5
6
epoch
0
0
5
10
15 epoch
20
25
30
Fig. 3. Comparison of standard BPDC and extended BPDC network performance on the forward and inverse kinematics of the PA-10 arm for the circular motion defined by (9), (10) with τ = 0.
Fig. 5. Mean square errors of the EBPDC network performance on the training and test set defined by (9), (10). The test error indicates excellent generalization.
and 6 in Fig. 2. Since position 2 is tied to the base of the arm, and the length of each limb is fixed, the spherical joint 4 has to lie on a redundancy manifold described by a circle of the PA-10’s elbow in the Cartesian space. We can calculate the corresponding joint angles analytically [11]. For learning the PA-10 inverse kinematics we compute for each task space input d1 (k), d2 (k), d3 (k) (with fixed orientation pointing upwards along the z-axis) the 7-dim target vector (dl+m+1 (k), . . . , dl+m+7 (k)). Thus the BPDC network in Fig. 1 is set up with 3 input and 7 output neurons.
B. Results for learning kinematics
Fig. 4. The training set consists of 101 cyclic sequences morphing from a circle (τ = 0) into a figure eight (τ = 100). Every 10-th sequence is shown.
A. Training setup The training set comprises 101 periodic motions (see Fig. 4) for the PA-10 arm in task and target space. In task space, endeffector positions having fixed orientation are defined by d1 (k) = 0.15 cos(¯ ω k) d2 (k) = 0.15
τ 100 − τ sin(¯ ω k) + sin(2¯ ω k) 100 100
(9) (10)
and constant d3 (k) = 0.9. These equations describe a circular motion for τ = 0 that is morphed into a motion with a figure eight-like shape (τ = 100) with increasing τ = 0 . . . 100. For ω ¯ = 2π/100 we obtain patterns with a period of 100 samples. Though simple in task space, the target trajectories involve all joints of the PA-10. Note that it is also particularly difficult for the network to keep d3 (k) constant when learning input from output.
We first compare the performance of the extended BPDC (EBPDC) network learning forward and inverse model simulˆ ∗inp ≡ 0) learning taneously with two standard networks (W each model separately. All networks have the same size N = 210 and parameters ∆t = 1, η = 0.003, ζ = 0.0001, µ = 0.2. For training we use teacher forcing, i.e. lump the input and output nodes to the target values before iterating the network. Training and test errors are computed with respect to the network generated input and output values. In each learning epoch all 100 sampled inputs for a single training trajectory are presented. Fig. 3 shows the mean square errors of three networks learning the circular motion, which is defined by equations (9), (10) for τ = 0. Fig. 3 shows an error offset between forward and inverse kinematic which reflects the different difficulties of both tasks, but the errors for the standard and extended architecture differ only slightly from each other. In summary, the EBPDC model learns to associate the motion in task and target space within a single network with the same quality as two separate networks for both tasks. We further investigate network generalization to unlearned patterns in a 10-fold cross-validation experiment each time randomly selecting 30 of the 101 training sequences. The remaining 71 sequences serve as test data. The mean square errors are averaged over 10 trials for training and test sets and plotted in Fig. 5. Excellent generalization is indicated by test errors being almost equal to training errors. IV. L EARNING OF AUTONOMOUS PATTERN GENERATION The standard approach for pattern generation in time series prediction networks is to learn to predict the next time step of a pattern. For pattern generation, the trained network is operated in a closed loop fashion by recursively feeding back the output to the input [18], [8], [10]. We show that in contrast the EBPDC model enables autonomous pattern generation without recursive prediction, only based on the dynamic properties of
Stability results 10000
circle figure eight up and down
T
1000 100 10 1
Fig. 6. A simple performance criterion for pattern generation: The network should generate the desired pattern (solid graph) as long as possible within an error bound (dotted graphs). The dash-dotted line is a fictive trajectory.
the network and the mutual association of inputs and outputs. This pattern generation ability therefore is more a by-product of the kinematics learning. It is based on the presentation order of the training data rather than being an explicit goal as for predictive methods. We consider three different patterns: an up-and-down motion along the z-axis (d1 (k) = d2 (k) = 0.1, d3 (k) = 0.2 sin(¯ ω k)) and the circular and figure eight like motions defined in equations (9), (10) for τ = 0 and τ = 100 respectively. A. Performance criterion for pattern generation For evaluating stability and accuracy of generated patterns, we use a straightforward performance criterion. The network is required to generate the desired pattern as long as possible with an instantaneous error smaller than : sX sX 2 2 (ds (k)−xs (k)) + (ds (k)−xs (k)) < s∈I
s∈O
Note that here bounds the 10-dimensional combined motion. When the network generated pattern violates the -bound the current time step T is returned. T ∈ + is the number of actual network updates, not the virtually elapsed time defined by ∆t steps. In Fig. 6 this testing method is illustrated.
N
B. Pattern generation results We use EBPDC networks of 500 reservoir neurons with continuous time dynamics (∆t = 0.05 roughly matching the ω ¯ sampling time) and learning parameters η = 0.03, ζ = 0.001. The error bound is set to 0.4 requiring a precision of 4cm in task space and approximately 3 degrees in target space. In each learning epoch we reset the network state and washout initial transients by teacher forcing the network for 100 steps. Then the network is trained for 300 steps of the target pattern without injection of the teacher signal such that a purely errordriven learning is accomplished.
0
20
40
60
80 epoch
100
120
140
Fig. 7. Stability results T for different motions: circular, figure-eight like and up-and-down pattern (logarithmic scale).
The time horizon T of the generated patterns is averaged over 200 trials and plotted in Fig. 7. The networks generate the patterns up to 100 periods for the circular motion and 45 periods of the more complex figure-eight and up-and-down motions. This ability is surprising as it relies neither on explicit prediction forward in time (like recursive prediction models) nor on propagation of errors backward in time (like when learning with BP through time or real time recurrent learning). In contrast, the BPDC learning propagates errors only to the input and output connections and only one step backward in time. It rather relies on a suitable distributed representation of the signals in the reservoir. Control experiments showed that in fact it is essential to use the intrinsic plasticity optimization to achieve the necessary reservoir coding quality. Fig. 8 also displays some evidence for self-stabilizing effects of the intrinsic plasticity learning. When the learning drives the network dynamics to new regions of the state space, where average performance quickly increases (peaks), variance in trials also increases. In the further development we see an interaction of the different time-scales of BPDCand IP-learning. The latter seems to re-attract network states on a slower time scale to a more predictable dynamic state, which results in a clear trade-off between performance and reproducibility. C. Analysis by system decomposition Further insight into the pattern generation ability can be gained by decomposition of the network into the linear subsystem present by virtue of the input-output connections and the non-linear reservoir (compare Fig.1). We analyze the stability of the linear subsystem by investigating the eigenvalues of the connection matrix W lin between all nodes with linear activation functions: Wlin
inp Winp = out Winp
inp Wout
out Wout
Fig. 8.
Eigenvalues of linear subsystem up-and-down circle figure eight
0.04 0.02 Im
T
Up-and-down motion 1000 900 800 700 600 500 400 300 200 100 0 -100
0 -0.02 -0.04
60
80
100 epoch
120
140
Standard deviation of performance over 200 trails.
Fig. 9 shows the complex eigenvalues for all three presented motions after training. The linear subsystem of the EBPDC setup is unstable in each case, because there is at least one large eigenvalue in the unstable region Re(λ) > 0. There are also complex conjugated eigenvalues introducing oscillations with small amplitudes. The linear subsystem therefore acts as a driving force while the overall network dynamics is damped by the reservoir. The periodic dynamics of the learned input and output signals then is a result of the complex interplay between linear and nonlinear parts and is adjusted by the biggest eigenvalue of the linear subsystem. V. C ONCLUSION The presented EBPDC model is capable of learning a mapping and its inverse in a single network with excellent generalization properties. As a byproduct, it also enables autonomous pattern generation without explicit learning of trajectory prediction. Though the presented results are encouraging, the complex interplay between linear and nonlinear dynamics and the different learning approaches for reservoir and input/output nodes deserves further investigation. Some directions of future work can be to investigate the effects of disturbances to the pattern generation ability of the presented setup, further stability analysis considering perturbation at the input and output nodes, or to train multiple patterns in a single network for pattern generation. R EFERENCES [1] A.F. Atiya and A.G. Parlos. New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Trans. Neural Networks, 11(3):697–709, 2000. [2] A. D’Souza, S. Vijayakumar, and S. Schaal. Learning inverse kinematics. In Proc. IROS, pages 298–303, 2001. [3] Geoffrey E. Hinton. To recognize shapes, first learn to generate images. Technical Report UTML TR 2006-004, 2006. [4] A.J. Ijspeert, J. Nakanishi, and S. Schaal. Learning rhythmic movements by demonstration using nonlinear oscillators. In Proc. IROS, pages 958– 963, 2002. [5] A.J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Proc. ICRA, pages 1398–1403, 2002.
-0.5
Fig. 9.
0
0.5
1 Re
1.5
2
2.5
Eigenvalues of the linear subsystem for the pattern generation task.
[6] M. Ito and J. Tani. On-line imitative interaction with a humanoid robot using a dynamic neural network model of a mirror system. Adaptive Behavior, 12(2):93–115, 2004. [7] H. Jaeger. Adaptive nonlinear system identification with echo state networks. In NIPS, pages 593–600, 2002. [8] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, pages 78–80, 2004. [9] H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert. Optimization and applications of echo state networks with leaky- integrator neurons. Neural Networks, 20(3):335–352, 2007. [10] P. Joshi and W. Maass. Movement generation with circuits of spiking neurons. Neural Computation, 17(8):1715–1738, 2005. [11] D. Lebedev, S. Klanke, R. Haschke, J.J. Steil, and H. Ritter. Dynamic path planning for a 7-DOF robot arm. In IROS, pages 3879–3884, 2006. [12] W. Maass, T. Natschl¨ager, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. [13] M. C. Ozturk, D. Xu, and J. C. Principe. Analysis and design of echo state networks. Neural Computation, 19(1):111–138, 2007. [14] S. Schaal, A.J. Ijspeert, and A. Billard. Computational approaches to motor learning by imitation. Philosophical transaction of the Royal Society of London, series B, 358(1431):537–547, 2003. [15] U. D. Schiller and J. J. Steil. Analyzing the weight dynamics of recurrent learning algorithms. Neurocomputing, 63C:5–23, 2005. [16] B. Schrauwen, M. Wardermann, D. Verstraeten, J. J. Steil, and D. Stroobandt. Improving reservoirs using intrinsic plasticity. Neurocomputing, 71(7-9):1159–1171, 2008. [17] J. J. Steil. Backpropagation-decorrelation: recurrent learning with O(N) complexity. In Proc. IJCNN, volume 1, pages 843–848, 2004. [18] J. J. Steil. Online reservoir adaptation by intrinsic plasticity for backpropagation-decorrelation and echo state learning. Neural Networks, 20(3):353–364, 2007. [19] J. Tani, M. Ito, Y. Sugita. Self-organization of distributedly represented multiple behavior schemata in a mirror system: reviews of robot experiments using RNNPB. Neural Networks, 17:1273–1289, 2004. [20] J. Tani, R. Nishimoto, J. Namikawa, and M. Ito. Codevelopmental learning between human and humanoid robot using a dynamic neuralnetwork model. IEEE SMC (B), 38(1):43–59, 2008. [21] J. Triesch. A gradient rule for the plasticity of a neuron’s intrinsic excitability. In Proc. ICANN, pages 65–79, 2005. [22] D. Verstraeten, B. Schrauwen, and J. Van Campenhout. Recognition of isolated digits using a liquid state machine. In Proceedings of SPSDARTS 2005, pages 135–138, 4 2005. [23] S. Vijayakumar, A. D’Souza, and S. Schaal. Incremental online learning in high dimensions. Neural Computation, 17(12):2602–2634, 2005. [24] D. Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, pages 1317–1329, 1998. [25] T. Yamazaki and S. Tanaka. The cerebellum as a liquid state machine. Neural Networks, 20(3):290–297, 2007.