2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
Stabilization of Biped Robot based on Two mode Q-learning Kui-Hong Park, Jun Jo and *Jong-Hwan Kim School of Information Technology, Gold Coast campus, Grif£th University, PMG 50, Gold Coast Mail Centre, Australia {k.park, junjo}@grif£th.edu.au, *
[email protected] *Department of EECS, Korea Advanced Institute of Science and Technology (KAIST), Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea
Abstract In this paper, two mode Q-learning, an extension of Q-learning, is used to stabilize the Zero Moment Point (ZMP) of a biped robot in the standing posture. In two mode Q-learning, the experiences of both success and failure of an agent are used for fast convergence. To demonstrate the effectiveness of two mode Q-learning against conventional Q-learning, the property of convergence is investigated through simulation in a grid world. This paper also presents the experimental results of the ZMP compensation in the standing posture of a biped robot. The external force generated from sagittal plane impacts to biped robot and the ZMP compensation scheme, based on two mode Q-learning, is employed. The effectiveness of two mode Q-learning is veri£ed by the use of real experiment.
Keywords: biped robot, balance control, ZMP, Q-learning, two mode Q-learning 1 Introduction Q-learning, a well-known method in reinforcement learning [1, 2, 3], is easy to implement and is not affected by the learning policy for Q value convergence [4, 5, 6]. Hence, it has been used in many applications [7, 8, 9, 10]. Once the Q value has converged to the optimal value, the agent selects only the action with the maximum Q value in a given state. To improve the speed of the Q-learning, several modi£cations based on conventional Q-learning have been presented [11, 12, 13]. In this paper, a two mode Q-learning [14, 15] is employed for improving the performance of Q-learning using the failure experience of the agent more effectively as well as the success experience. Two mode Q-learning is based on both the normal and failure Q values for the selection of the action in a state-action space. To determine the failure Q value using the previous failure experience of the agent, it employs a failure Q value module. To investigate the convergence property of two mode Q-learning, a grid world is employed. A humanoid robot is a biped (i.e., two-legged) intelligent robot that is expected to eventually take the form of a human-like body. Recently, many researchers have focused on developing humanoid robots which are similar in many aspects to human beings [16, 17, 18, 19, 20]. A biped robot is inherently susceptible to problems of instability and therefore is constantly at risk of falling down. Therefore ensuring stability must be
446
considered one of the most important goals from the perspective of locomotion. To measure stability, the concept based on ZMP proposed by Vukobratovic [21] is used. The ZMP is de£ned to be the point on the ground plane at which the total moments of force due to ground contacts becomes zero. It is important to recognize that ZMP must always reside in the convex hull of all contact points on the ground plane, since the moments on the ground plane are caused only by normal forces at the contact points which are always positive in the upper vertical direction. In this paper, a ZMP compensation scheme based on two mode Qlearning is applied to a biped robot. Section II presents two mode Q-learning and its simulation results. In section III, the experimental results of a biped robot showing the comparison between the performances of Q-learning and two mode Q-learning are presented. Concluding remarks follow in section IV.
2 Two mode Q-learning 2.1 Description of two mode Q-learning In two mode Q-learning, two kinds of Q values are considered. One is called normal Q value (QN ) which is a Q value from conventional Q-learning. The other is called failure Q value (QF ) which is obtained from failure experience. This algorithm uses both the normal Q value and the failure Q value and hence it is called two mode Q-learning.
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
The architecture of the two mode Q-learning consists of the action selection module, the normal Q value module and the failure Q value module as shown in Figure 1. Failure Q value module
ij
Action
Action Selection module
=
QN Reward
Figure 1: Architecture of two mode Q-learning In the two mode Q-learning, the action selection depends on the total value of the normal Q value (QN (si , a j )) and the failure Q value (QF (si , a j )). For the action selection, the following Boltzmann equation is employed, where for the simple notation the expressions (QN (si , a j )), (QF (si , a j )) and (QT (si , a j )) are expressed as QN , QF and QT , respectively: p (si , a j ) =
e
ij
ij
QT
ij
∑a j ∈A e
/τ QT /τ
,
ij
(1)
QT = Q N + Q F ij
ij
ij
where QN is the Q value of action j in state i obtained ij by conventional Q-learning, QF is the Q value of acij tion j in state i obtained from the failure Q value module, and QT is the total value of these two Q values. ij QT decreases if the value of QF decreases, and hence ij ij the probability of action a j being selected in state si is lowered. The failure Q value of action a j in a given state si is calculated by taking the numerator form of the Boltzmann equation, e
QF /τ ij
= 1 − pF : ij
(2)
where τ is the temperature value used in (1), i is the index of the state, and j is the index of the action. And pF is a failure probability of the actions included in ij the failure trial, which is obtained from the following equation: pF = f (pNmini j , pNmaxi j (k), l(c), j) ij
+
pNmaxi j (k) − pNmin
ij
( j − (m − l(c))), (4)
where i and j are the index of the state and action, respectively, included in the step length l(c), and m is the number of steps of the state-action trace leading the agent to the fail state. Since pNmaxi j = 1 can not be
de£ned in (2), we introduce the notation 1 − as a value just less than and close to 1. It should be noted that if the failure probability of the action is 1− , the action almost surely leads the agent to the fail state. The action, which led the agent to a fail state in the previous trial, may not lead the agent to the same fail state again. Setting the value of pF to 1− is a posij sible way to inhibit the corresponding action from the progress of learning. Thus, the failure probability of the selected action just before the fail state should be decreased as the trial goes on until the agent reaches the fail state again, by the same action. For this purpose, the following scheme for decreasing pNmaxi j (k) is proposed: pNmaxi j (k) = η k−k f
(5)
where η is a constant between 0 and 1, k is the current trial number, and k f is the failure trial number. In this paper, pNmin is £xed as a constant value irrespective of ij N.
2.2 Simulation Result
ij
QF = τ ln(1 − pF ), ij
pNmini j
l(c) m − l(c) ≤ j ≤ m
Normal Q value module
ij
The failure probability applied to the actions is calculated as follows: pF = f (pNmini j , pNmaxi j (k), l(c), j)
QF State
in the step length l(c), where c is the constant value. According to the environment, the number of steps that led the agent to the fail state is varied.
(3)
where N is the index of the failure trial (the state-action trace that led the agent to the fail state), i and j are the index of the state and action, respectively, included in the failure trial N, pNmin and pNmaxi j (k) are minimum ij and maximum probability values, respectively, and k is the trial number. l(c) is the step length of the stateaction trace counted from the fail state. Failure probabilities are to be calculated for the actions included
447
To investigate the convergence property of two mode Q-learning, a grid world was used. The purpose of the agent is to £nd the shortest path which will lead to the success state starting from the start state. For most states, there are four possible actions moving in four directions (north, south, east and west). For the edge states, there are three possible actions out of these four actions. When the agent enters into the success state, it receives 100 points as a reward. When the agent reaches the fail state, it receives -100 as a penalty. In this £gure, there are four start states and 16 fail states. After each trial, the start state is changed by the index number. Figure 2 shows simulation results after converging to the optimal Q values by Q-learning and two mode Q-learning for a grid world of size 17 × 17 with several fail and start states. The arrows depict the optimal action of the agent in each state after the learning
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
Start 3
S1
F
S3
F
S5
F
S7
F
S9
S17
S18
S19
S20
S21
S22
S23
S24
S25
S26
F
S11
F
S13
F
S15
Start 2
S27
S28
S29
S30
S31
S32
S33
S49
F
Convergence ratio 100
F
S35
100 Convergence ratio
80
Convergence ratio 100 Convergence ratio 80
80
60
100
60
S51
S66
S52
S67
40 40
80
60
60
40
40
20
20
F
S83
S69
20
F
0 0
15 15
S85
S100
S86
15
10 10
S101
5 5
F
S103
S119
S120
F
S137
S153
S117
S134
5
State-action space
F
S168
S169
S185
F
S188
S202
S203
F
S205
S219
F
100
S221
S222
S236
S237
60
F
S239
S253
F
S255
S256
S270
S271
100 100 80 80 60 60
80
40 40
40
20 20 0 0
20
0
15 15
15
F
S279
F
S281
F
S283
F
S385
F
S287
10
5
10
5
15
10
15
10 5
S277
15
10
State-action space State-action space
Convergence ratio Convergence ratio
Convergenceratio ratio Convergence
F
5 5
(b) TMQ: 300 trials
F
S275
10 5
(a) Q: 300 trials
S187
F
15
10 10
S135
S154
S273
15 15
5
S171
Start 0
0
0
State-action space
F
S151
Goal
10
5
20
15
10
State-actionspace space State-action
5
5
State-action 15space 10 10 State-action space
5
Start 1
(d) TMQ: 1000 trials
(c) Q: 1000 trials
Figure 2: 17 × 17 grid world with fail and start states phase. In this environment, more than 3,000 trials are needed for converging to the optimal Q value. However, to compare the convergence property during the progress of the learning, a series of 300, 1000 and 3000 trials, were executed. Figure 3 shows the simulation results of convergence ratio of conventional Q-learning and two mode Q-learning, where convergence ratio is de£ned as the ratio of the Q value to the maximum optimal Q value of each state-action space. Z-axes of these £gures depict the average convergence ratio. To get the average convergence ratio, the simulation program was run for 20 iterations, where one iteration corresponds to 300 trials, 1000 trials and 3000 trials, respectively. Values such as c = 2, pminN = 0.1, η = 0.9 and
Convergence ratio Convergence ratio
Convergence Convergenceratio ratio 100
100100
100 80
80 80
80 60
60 60
60 40
40 40
40 20
20 0
20 20
0
0 0 15
15
15 15 15
10 10
5
10 5
5
15 15
10 10
15
State-action space State-action space 10
5 5
5
5
10 10 State-action space State-action space
5
(e) Q: 3000 trials
(f) TMQ: 3000 trials
Figure 3: Simulation results of convergence property in the state-action space
j
1− = 0.99 were used. For the purpose of comparison, in conventional Q-learning, Boltzmann action selection was employed. In this £gure, Q and TMQ depict Qlearning and two mode Q-learning, respectively. As the £gures show, the convergence ratio of the two mode Q-learning is better than that of the conventional Qlearning in each trial. Also, the two mode Q-learning has more exploratory state-action space than that of the conventional Q-learning.
3 3.1
(a) Front
Experimental Result
(b) Side
Figure 4: HSR-IV employed. ZMP can be easily calculated by FSR measurements as follows [22, 23]:
HSR-IV
To apply Q-learning and two mode Q-learning to a biped robot, HanSaRam (HSR) [20], developed by Robot Intelligence Technology Lab., KAIST, was used. Figure 4 shows the HSR-IV. For the ZMP measurement, eight FSRs were installed in the soles of HSR-IV. For mimicking the joints in the lower body, 12 RC servo motors were used. There is no upper body in HSR-IV. For obtaining a linear response to force measured by the FSRs, OP amp circuits were
xZMP =
∑4i=1 fi xi ∑4i=1 fi
∑ 4 f i yi yZMP = i=1 ∑4i=1 fi
(6)
where p1 , p2 , p3 and p4 are four corner positions of the FSR and f1 , f2 , f3 and f4 are forces measured by the each FSR.
448
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
Two different types of micro-controllers were used for the HSR-IV. One is called the master micro-controller, which plays a role of communicating between the host PC and the other micro-controller. The other is called the slave micro-controller, which controls 12 RC servo motors for 12 joints for both legs. In each 20ms interval, 12 RC servo motors are controlled by slave the micro-controller.
2. State 2: Angular velocity of ankle rolling (3 states) 3. State 3: Variation of the Y component of ZMP (11 states) • Actions: Velocity of ankle rolling during 20 ms, ±0.5◦ , 0◦ • Rewards: 1. Positive reward: In case of selecting an action opposite direction of the external force r = 100
3.2 Implementation The experiment for Q-learning and two mode Qlearning in HSR-IV deals with the external force that is generated in the sagittal plane and applied to the HSR-IV in its standing posture. In this case, ZMP is calculated by equation (6). The purpose of learning is to select an action to resist an external force within the angle and ZMP constraints. Figure 5 shows the situation when external force is generated from the sagittal plane that has impacts on the HSR-IV in the standing posture and desired motion of the HSR-IV for resisting the external force.
2. Negative reward: In case of exceeding the angle and ZMP constraints r = −100 3. Instant reward: r =
ZMPY is the Y component of ZMP Figure 6 shows the states for Q-learning and two mode Q-learning. State component 1
3
1
State component 3
2
6 4 2 3 5 6 10 7 8 0
State component 2 0
−6 − 10
− 3 −1 1
Hip rolling
50 10+|ZMPY | , where
− 0.5 0
0.5
1
0 1 2 3 4 5 6 7 8 9 10 -70 -45 -25 -12 -4
Angular velocity
Ankle rolling
Figure 6: States for experiment
Z
Z Ankle rolling
Y
4 12 25 45 70
Variation of ZMPY
Y
Figure 5: External force in the sagittal plane For Q-learning and two mode Q-learning, the time duration of trial is set to four second. This means that the slave micro controller controls the RC servo motors 200 times during this period. Interval time between trials is also set to four second. In this time, the slave micro-controller controls the RC servo motors to move the initial posture. Initial posture means the initial angle of each actuator before learning. In order to reduce the complexity of the problem, it is assumed that the left leg has the same motion of the right leg in the parallel direction and the hip rolling is the opposite direction to the direction of ankle rolling. By these assumptions, only one actuator (ankle rolling) reacts to the external force generated from the sagittal plane. For applying Q-learning and two mode Qlearning, 297 (9 x 3 x 11) states, 3 actions per state and reward are de£ned as follows: • States: 1. State 1: Variation of the ankle rolling (9 states)
449
To compare the learning performance between Qlearning and two mode Q-learning, 800 trials were executed. In a real experiment, learning rate, α = 0.1, discount rate, γ = 0.9, the minimum failure probability, pNmin = 0.1, the parameter of the maximum failure probability, η = 0.9, and the constant value of step length, c = 4 were used.
3.3
Result
The experiment took one and half hours to complete 800 trials. To sense an external force, one FSR was used with the micro-controller. When the HSR-IV selects an action which resists an external force, the external force is decreased gradually. After the learning phase, the performances of Q-learning and two mode Q-learning were compared after 80 test trials. Within the constraints of angle and ZMP, a success number was used to measure performance. Table 1 shows the result of test trials after 80 iterations. In Q-learning, the number of successes was 19 from 80 test trials. In contrast to Q-learning, the number of successes in case of two mode Q-learning was 33 during 80 test trials. During four seconds (200 control ticks), the number of trials longer than 100 ticks is 30 in Q-learning. However, in two mode Q-learning, the
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
Table 1: Experimental result in the sagittal plane Two mode Q-learning
Q-learning Set Num
1st
2nd
# success: 19 : Success
3rd
4th
# more than 100: 30 : More than 100
Set Num
1st
2nd
3rd
4th
(a) Experimental result with external force
# success: 33 # more than 100: 49 : Success
: More than 100
number of trials longer than 100 ticks is 49. This table compares the performance of the two algorithms and illustrates that the performance of two mode Q-learning is superior to that of Q-learning in this experiment. A human £nger was used for replacing the external force apparatus. Figure 7(a) and 7(b) show a snapshot of the experimental result of two mode Q-learning with a mechanically applied external force and £nger force, respectively.
4
Summary and conclusion
(b) Experimental result with £nger force
In this paper two mode Q-learning was proposed. Two mode Q-learning utilizes failure experience of the agent. It consists of a normal Q value module, a failure Q value module and an action selection module. Both the action selection module and the normal Q value module compose conventional Qlearning. Through the simulation its effectiveness was demonstrated against Q-learning. Also, HSR-IV, a biped robot developed in RIT Lab., KAIST, was used for implementing Q-learning and two mode Q-learning. The purpose of the learning is to select an action to resist an external force within the angle and ZMP constraints. After the learning phase, 80 test trials were executed to compare the performance of Q-learning and two mode Q-learning. Within the constraints of angle and ZMP, the number of successes was used for the performance measure. The effectiveness of the proposed two mode Q-learning method is veri£ed through the actual experiment.
Figure 7: Snapshots of the experimental results in the sagittal plane
References [1] L. P. Kaelbling and M. L. Littman and A. W. Moore, “Reinforcement Learning: A Survey,” Journal of Arti£cial Intelligence Research, pp. 237-285, 1996. [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Bradford Books/MIT Press, 1998. [3] C. J. C. H. Watkins and P. Dayan, “Q-learning,”, Machine Learning,, vol. 8, pp. 279-292, 1992.
Acknowledgements
[4] M. A. Wiering, R. Salustowicz, and J. Schmidhuber, “Reinforcement Learning Soccer Teams with Incomplete World Models,” Autonomous Robots, Special issue on Neural Networks for Robot Learning, vol 7. no. 1, pp. 77-88, 1999
This work was supported by the ITRC-IRRC (Intelligent Robot Research Center) of the Korea Ministry of Information and Communication in 2004.
[5] W. D. Smart, “Making Reinforcement Learning Work on Real Robots,” Ph.D thesis, University of Brown, 2002.
450
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand
[6] M. N. Ahmadabadi and M. Asadpour, “Expertness Based Cooperative Q-Learning,” IEEE Trans. Systems, Man and Cybernetics, Part B, vol 32, no. 1, pp. 66-76, 2002. [7] M. Asada, E. Uchibe, and K. Hosoda, “Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development,” Arti£cial Intelligence, vol. 110, pp. 275-292, 1999. [8] O. Abul, F. Polat, and R. Alhajj, “Multiagent Reinforcement Learning Using Function Approximation,”, IEEE Trans. Systems, Man and Cybernetics, Part C, vol 30, no. 4, pp. 485-497, 2000. [9] D.-H. Kim, Y.-J. Kim. K.-C. Kim, J.-H. Kim, and P. Vadakkepat, “Vector Field Based Path Planninig and Petri-net Based Role Selection Mechanism with Q-learning for the Soccer Robot System,” Int. Journal of intelligent Automation and Soft Computing, vol. 6, no. 1, pp. 75-88, 2000. [10] K.-H. Park, Y.-J. Kim, and J.-H. Kim, “Modular Q-learning based multi-agent cooperation for robot soccer,” Robotics and Autonomous Systems, vol. 31, no. 2, pp. 109-122, 2001. [11] S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing eligibility traces,” Machine Learning, vol. 22, pp. 123-158, 1996. [12] J. Peng and R, J. Williams, “Incremental multistep Q-learning,” Machine Learning, vol. 22, pp. 283-290, 1996. [13] M. A. Wiering, “Fast on-line Q(λ ),” Machine Learning, vol. 33, no. 1, pp. 105-115, 1998. [14] K.-H. Park and J.-H. Kim, “Two mode Qlearning,” Proc. of IEEE Int. Conf. Evolutionary and Computation, pp. 3404-3410, 2003. [15] K.-H. Park, “Two Mode Q-learning using Failure Experience of the Agent and its application to Biped Robot,” Ph.D thesis, Korea Advanced Institute of Science and Technology, 2004. [16] K. Hirai, M. Hirose, Y. Haikawa, and T. Takenaka, “The Development of Honda Humanoid Robot,” Proc. of IEEE Int. Conf. on Robotics and Automations, pp. 1321-1326, 1998. [17] J. Yamaguchi, A. Takanishi, and I. Kato, “Development of a biped walking robot compensating for three-axis moment by trunk motion,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 561-566, 1993.
451
[18] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higake, and K. Fujimura, “The intelligent ASIMO: System overview and integration,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 2478-2483, 2002. [19] K. Nishiwaki, T. Sugihara, S. Kagami, F. Sanehiro, M. Inaba, and H. Inoue, “Design and Development of Research Platform for PerceptionAction Integration in Humanoid Robot: H6,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1559-1564, 2000. [20] J.-H. Kim, D.-H. Kim, Y.-J. Kim, K.-H. Park, J.H. Park, C.-K. Moon, K. T. Seow, and K.-C. Koh, “Humanoid Robot HanSaRam: Recent Progress and Development,” Journal of Advanced Computational Intelligence & Intelligent Informatics, Fuji Technology Press Ltd., Accepted, 2003. [21] M. Vukobratovic and D. Juricic, “Contribution to the Synthesis of Biped Gait,” IEEE Trans. on BioMedical Engineering, vol. BME-16, no. 1, pp. 16, 1969. [22] Q. Li, A. Takanishi, and I. Kato, “A Biped Walking Robot Having A ZMP Measurement System Using Universal Force-Moment Sensors,” IEEE/RSJ Int. Workshop on Intelligent Robots and Systems IROS’91, pp. 1568-1573, 1991. [23] J.-H. Kim, S.-W. Park, I.-W. Park, and J-H. Oh, “Development of a Humanoid Biped Walking Robot Platform KHR-1 - Initial Design and Its Performance Evaluation,” Proc. of 3rd IARP Int. Workshop on Humanoid and Human Friendly Robotics, pp. 14-21, 2002.