Proceedings of the 2002 IEEE International Conference on Robotics & Automation Washington, DC • May 2002
Biological Robot Arm Motion through Reinforcement Learning Jun Izawa, Toshiyuki Kondo, Koji Ito Abstract— The present paper discusses an optimal control method of biological robot arm which has redundancy of the mapping from the control input to the task goal. The control input space is divided into a couple of subspaces according to a priority order depending on the progress and stability of learning. In the proposed method, the search noise which is required for reinforcement learning is restricted within the first priority subspace. Then the constraint is relaxed with the progress of learning, and the search space extends to the second priority subspace in accordance with the history of learning. The method was applied to the musculoskeletal system as an example of biological control systems. Dynamic manipulation is obtained through reinforcement learning with no previous knowledge of the arm’s dynamics. The effectiveness of the proposed method is shown by computational simulation.
neural command
D.O.F muscle activations
joint kinematics
hand path
task goal
Keywords— learning control, bio-mimetic robot, reinforcement,learning, neural network, over-actuated system
Fig. 1. Motor hierarchy of biological control system
I. Introduction cess of trial and error. The number of the random search increases as the search domain extends. The redundancy of biological control system is represented as the hierarchical structure as shown in Fig.1. For instance, there are infinite number of possible paths that the hand can reach the task goal. Even though the path was determined, it can be achieved by various muscle activations. N.Bernstein[3] pointed out that the coordination of these elements, such as cocontraction of muscles, can resolve these redundancies. This framework is called Bernstein’s synergy concept. On the other hand, we co-activates our muscles and stiffens our joints when we performs a new motion or meet with a new environment. For example, when we begin riding a bicycle, we strain our body rigidly at first. But the stiffness decreases with the progress of learning. As know from this phenomenon, Bernstein’s synergy concept is extended to not only the control problem but also the learning problem. In the present paper, we propose a bio-mimetic learning control system for the redundant controlled object. The proposed method reduces the degree of freedom in the control space by restricting the search noise to the subspace in the initial stage of learning, which makes the learning system robust. Then, the constraint is relaxed according to the learning progression. The effectiveness will be shown by computational experiment.
We can move skillfully and dynamically our body parts, despite of their own complex controlled object. Such a skillful action is acquired through interacting with the environment. For example, when an infant tries to reach his hand and get something, some random-like motions are generated at first. After repeating the motions, his performance gradually becomes smooth and rapid. Our study for designing robot skill is inspired by such a motor learning process. We propose a biomimetic approach that is able to get a dynamic manipulation based on only the information coming from its sensor. That is, the artificial neural network offers a way to implement some reactive system automatically. Moreover, reinforcement learning allows the neural network to acquire the reaction rules by interacting with its environment only from success or failure of the trial[2][5][1][8]. Furthermore, the biological system such as musculoskeletal system has much redundancy, which allows the arm to obtain the flexible and robust motion. However, it is difficult to apply reinforcement learning to the redundant system. Because, reinforcement learning involves many random searches, i.e. the proJ. Izawa, T. Kondo and K. Ito are with the Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology(TIT), Tokyo, Japan. Email:
[email protected], {kon,ito}@dis.titech.ac.jp
0-7803-7272-7/02/$17.00 © 2002 IEEE
3398
Γ
∋
u
Learning Controller
n R
m u’ R
∋
History
I(W) N(W)
External Dynamics Body Dynamics
(a) Early stage of learning
Environment
I(W)
Γ C Fig. 2. Outline of bio-mimetic motor learning system
(b) Middle stage of learning
II. bio-mimetic motor learning system Fig.2 shows the outline of bio-mimetic motor learning system. The optimization of the controller through the learning process causes the embodiment of the external dynamics. The external dynamics is made up through the interaction between the body-dynamics and the environmental dynamics. Thus, the adjustment of the body-dynamics brings about the change of the external dynamics. Therefore, the learning for the redundant system can be divided into at least tow-stages process. The first stage is the optimization for some task and the second stage is that for the body-dynamics. In the proposed method, we reduces the degree of freedom in the control space at the initial stage by introducing the restriction of the search noise, which corresponds to maintain certain body-dynamics. Then, the constraint is relaxed according to the progress of learning so that the relaxation makes the bodydynamics optimal. In other words, we adopt the learning approach by which the stability of convergence is maintained in the early stage of learning. And the proposed method is able to give the controller which makes active use of redundancy in the final stage of learning. First, it is assumed that the control input space is composed of two hierarchical structures. The upper part is specified by the variable u ∈ R n , and the lower part is specified by the variable u 0 ∈ Rm . Note that m ≥ n. A transformation between u and u 0 is written by u0 = W (u). (1)
search space
Fig. 3. Search space for the upper and lower parts of the hierarchical system
Fig.3(a) shows the relationship between upper and lower parts of the hierarchical system. In Fig.3(a), the subspace Γ is defined as Γ = {u1 |u1 ∈ / N (W )}.
(2)
Next, let n1 , n2 be n1 ∈ Γ, n2 ∈ N (W ). Then the search noise n ∈ R n for reinforcement learning can be divided as n = n1 + n 2 . (3) In the early stage of learning, the search noise for reinforcement learning is restricted within the firstpriority subspace. In the middle stage of learning, the constraint is relaxed, and then the search space is extended to the second-priority subspace with the progress of learning. The search noise for the proposed bio-mimetic learning method is defined as ˜ = n1 + c · n 2 . n
(4)
The control rule of the parameter c is defined as follows. c = 1/(1 − a exp(−( V¯ − b))) τ V¯˙ = −V¯ + V,
(5)
where V is the value of reinforcement learning, a and b are constant parameters, V¯ represents the time history of the learning, τ is time-constant.
We define the range space of the transformation W as I(W ), and the null space of the transformation W as N (W ).
3399
Note that the control parameter c(0 ≤ c ≤ 1) for the constraint of search noise is increased with the progress of learning from 0 to 1 (Fig.3(b)). Tuning the time-constant is required. Too large time constant causes long learning time, and too small one causes instabilities of learning. III. application to biological robot arm We adopt , as an example for application, the muscluoskeketal arm which consists of two link arm with tow joints and six muscles, where mono-articular muscles and bi-articular muscles are embedded as shown in Fig.4.
Fig. 5. Schematic image of reaching task
where Km ∈ Rn×m is the elastic coefficient of muscles. In general, the search noise for reinforcement learning is set to an isotopic noise with Gaussian distribution which is zero on average. Here, we assume n ∈ Rn is the isotopic Gaussian noise which the variance is 1. The noise n is divided into the subspace where the joint stiffness maintains constant and its orthogonal complement subspace, which is expressed as follows.
θ2
θ1
n = = n1 = n2 =
Fig. 4. Two joints and six muscles model
Generally, the musculoskeletal arm is actuated by many muscles. In other words, it has an over-actuated mechanism. The redundancy contributes to make the motion control flexible and robust. Now, we assume that the muscle force is T ∈ R n and the joint torque is τ ∈ Rm , where m ≤ n. Then, the relationship between the muscle force and the joint torque can be written by
T
τ = −G T .
IV. computational experiment A. Dynamic Equation The musculoskeletal model of arm is shown in Fig.4. Then the dynamic equation of the arm is given by
(6)
¨ + H(θ) ˙ M (θ)
= τ,
(10)
where M is the inertia matrix, H the centrifugal and Coriolis force vector, and τ the joint torque vector. τ is driven by Eq.(6). The muscle tension T is modeled as follows[6].
(7)
˙ ˙ u) = K(u)(l − lr (u)) + B(u)l, T (l, l,
where (GT )∗ is the generalized inverse matrix of G T , and z ∈ Rn is an arbitrary vector. Note that the second term is the vector which satisfies (I n − (GT )− GT )z ∈ N (−GT ) Suppose (GT )∗ is the generalized inverse matrix solved under the constant joint stiffness condition. Then it can be expressed as follows.
(11)
where K, B is the coefficient matrices of elasticity and viscosity, u is the muscle activation vector, l is the muscle length vector, and lr is the equilibrium length vector of muscle.
(GT )∗ = Km GT (GT Km G)−1 ,
(9)
Finally, learning will be performed with n 1 , n2 by using Eq.(4) (5).
Note that the linear transformation G is the moment arm matrix from the joint to the insertion point of muscle Though the inverse of matrix of −G T can not be determined uniquely, the muscle force can be expressed as T = −(GT )∗ τ + (I n − (GT )∗ GT )z,
(In − (GT )∗ GT )n + (GT )∗ GT n n1 + n2 , (In − (GT )∗ GT )n, (GT )∗ GT n
B. Reward Definition for Reaching Task The reaching motion is selected as an example showing the effectiveness of the proposed architecture.The
(8)
3400
Reward
Critic
Vt History
Musculoskeletal system
c=sigmoid(Vt) TD error
Actor
+n= n1+cn2 ut
qt Fig. 6. Architecture of proposed learning system
The critic network monitors the state of musculoskeletal system and estimates the value function. In accordance with the reward definition mentioned before, the reward is a function of the state of arm θt+1 , θ˙ t+1 and the motor command ut . Since two joints-six muscles model is a redundant system, u t can not be determined only by θ t+1 , θ˙ t+1 . For this reason, not only the state of arm but also the motor command are incorporated into the actor-critic networks. In other words, the agent must sense the state q t+1 = {θ t+1 , θ˙ t+1 , ut }T to estimate the value function Vt+1 and produce the action u t+1 . The teacher for the critic network is given by
arm is moved from a start point to the target under the minimum energy consumption as shown in Fig.5. Then, the reward is defined as follows. for xH ∈ S ∩ xH ∈ G 1 − rE , −rE , for xH ∈ S ∩ xH ∈ /G (12) r= −1, for xH ∈ /S where, xH , S and G is the hand position, the working area and the goal area respectively. The performance function rE concerning the energy consumption of muscles is defined as follows rE =
6 X
u2i .
(13)
Vtteacher = Vt + α [r + γVt+1 − Vt ] .
i
Also, for the actor network, it is given by
Besides, the total energy is defined as follows.
uteacher = u t + β · n · δt , t
tf inal
X
Energy =
rE (t)
(14)
C. Reinforcement Learning In the present paper, we use the actor-critic architecture[7] as shown in Fig.6. When the approximate value function is ∞ X
γ i−1 rt+1 ,
D. Impedance Adjustment The implementation of impedance adjustment is explained as follows. 1. The weight of actor network is preset to zero between the second and third layers except the biasunit as shown in Fig.7. 2. The bias-unit is tuned so that the musculoskeletal system may involve the high impedance property with keeping the torque zero. Here we define the joint stiffness matrix R as
(15)
i=1
the networks are learned such that the following TD error is decreased. δt = rt + γVt+1 − Vt ,
(18)
where n is the random noise for searching. It is certainly required for the actor-critic architecture that the search noise n is added to the motor command u.
tstart
Vt =
(17)
(16)
where γ is the discount rate.
δτ = Rδθ,
3401
(19)
150 100
reward
50 0 -50 -100 -150 -200
free
-250 -300
restricted 0
500
1000
1500
2000
2500
3000
step
Fig. 7. Actor-network is preset so that the impedance of the arm is high
where
(a) Cumulative reward against trial number
restricted
1000
R11 R21
R12 R22
.
free
(20)
800
Energy
R=
The bias unit is tuned so that the stiffness R satisfies the following value. 9.5 3.4 R= [N · m/rad]. (21) 3.4 5.8
600
400
200 0
1000
2000
3000
4000
5000
6000
7000
8000
Step
V. simulation
(b) Energy consumption against trial number
Fig.8-left shows a reaching motion in the early stage of learning. The motion seems to be random like motion. On the other hand, Fig.8-right illustrates that the rapid and smooth motion is acquired in the final stage of learning. The hand trajectory tends to follow roughly a strait line.
Fig. 9. The difference between the proposed method and the standard reinforcement learning
just got the target reward. Therefore, in the standard case, the optimization process seems to be emphasized only on the energy term, and the solution is trapped in the local minima. Next, we performed the simulation under another condition in which the novel external force is introduced on the arm’s hand. The load is given on the hand when the position of the hand p(X, Y ) is within 1.0 ≤ Y ≤ 1.5. The load is designed so that the orientation of the force is changed every trials randomly. We test whether the system can acquire any robust controller against the disturbance force. Fig.10 shows the acquired reaching motion where the external random force is added. The trajectory is slightly curved. Fig.11 shows the acquired joint stiffness. The peak of the stiffness shifts to the loaded area as compared with the standard case. Thus it seems that the robust control was acquired by adapting to the environmental changes.
Fig. 8. Reaching motion
Fig.9(a) shows the difference of the cumulative reward between the proposed method and the standard reinforcement learning. The proposed method seems to get the target reward at 1000 trials and get more reward after that. In contrast, in the case of standard reinforcement learning, the cumulative reward converges to zero. Additionally, Fig.9(b) shows the transition of the energy consumption. In the proposed method, the energy dose not change in the early stage, and then the energy begins to decrease when the system has
VI. Conclusion In the present paper, we proposed the bio-mimetic learning method based on reinforcement learning that can be applied to the redundant system. The redun-
3402
Disturbance
Force
(a) R11 Fig. 10. Acquired reaching motion under the disturbed condition.
2.78 2.76
[Nm/rad]
external force
2.74
dant system is represented as a hierarchical structure. In the proposed method, the control input space is divide into subspaces according to a priority order depending on the progression and stability of learning. In early stage of learning, the search noise for reinforcement learning is restricted within the first priority subspace. Then, the constraint is relaxed, and the search space is extended to the second priority subspace with the progress of learning. The method was applied to the musculoskeletal system. We showed the effectiveness by computational experiments. First, the proposed method based on reinforcement learning can acquire the reaching motion only from success or failure of the trial. It is not required to design any trajectory before learning. Trajectory generation and trajectory following are integrated into one step. The method was successfully applied to the over-actuated mechanism. Second, the combination of the proposed method and the musculoskeletal system allows the system to obtain the robust controller against the external disturbance force. We can find easily that another candidate of the stage in the motor hierarchy is the kinematic relationship between the joint angle and the hand position. The proposed method can be applied to the system which has redundancy not only at the force level but also at the kinematic level. The next stage of our study will be about the application to kinematic redundancy such as n-link arm[4].
free
2.72
R12,21 2.7 2.68 2.66 2.64 2.62 0
0.05
0.1
0.15
0.2
0.25
0.3
Y[m]
(b) R12,21 Fig. 11. Acquired joint stiffness against Y-coordinate: dashed line is under the standard condition, dotted line is under the disturbed condition
[2] A. F. Barto, R. S. Sutton, and C. W. Abderson. Neuronlike adaptive elements that can solve difficult learning control problems, 1983. [3] N Bernstein. The Coordination and Regulation of Movement. Pergamon Press, 1967. [4] F.Matsuno. Optimal path planning for flexible plate handling using an n-link manipulator. Robotics and Mechatronics, 10(3):178–183, 1998. [5] P. Martin and J.R.Millan. Robot arm reaching through neural inversions and reinforcement learning. Robotics and Autonomous Systems, 31:227–246, 2000. [6] J. Mclntyre and E. Bizzi. Servo hypotheses for the biological control of movement. Journal of Motor Behavior, 25-3:193 –202, 1993. [7] R.S.Sutton and A.G.Barto. Reinforcement Learning. The MIT Press, 1998. [8] Katsunari Shibata, Masanori Sugisaka, and Koji Ito. Hand reaching movement acquired through reinforcement learning. In Proc. of 2000 KACC (Korea Automatic Control Conf.), volume 90rd (CD-ROM), 2000.
Acknowledgment A part of this research was supported by Takahashi Industrial and Economic Research Foundation, and Mitutoyo Association for Science and Technology. References [1] K. Althoefer, B.Krekelberg, and D.Husmeier. Reinforcement learning in a rule-based navigator for robotic manipulators. Neurocomputing, 37:51–70, 2001.
3403