Stabilization of Biped Robot based on Two mode ... - Semantic Scholar

0 downloads 0 Views 1MB Size Report
state. To improve the speed of the Q-learning, several modi£cations based on conventional Q-learning .... where N is the index of the failure trial (the state-action.
2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

Stabilization of Biped Robot based on Two mode Q-learning Kui-Hong Park, Jun Jo and *Jong-Hwan Kim School of Information Technology, Gold Coast campus, Grif£th University, PMG 50, Gold Coast Mail Centre, Australia {k.park, junjo}@grif£th.edu.au, *[email protected] *Department of EECS, Korea Advanced Institute of Science and Technology (KAIST), Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea

Abstract In this paper, two mode Q-learning, an extension of Q-learning, is used to stabilize the Zero Moment Point (ZMP) of a biped robot in the standing posture. In two mode Q-learning, the experiences of both success and failure of an agent are used for fast convergence. To demonstrate the effectiveness of two mode Q-learning against conventional Q-learning, the property of convergence is investigated through simulation in a grid world. This paper also presents the experimental results of the ZMP compensation in the standing posture of a biped robot. The external force generated from sagittal plane impacts to biped robot and the ZMP compensation scheme, based on two mode Q-learning, is employed. The effectiveness of two mode Q-learning is veri£ed by the use of real experiment.

Keywords: biped robot, balance control, ZMP, Q-learning, two mode Q-learning 1 Introduction Q-learning, a well-known method in reinforcement learning [1, 2, 3], is easy to implement and is not affected by the learning policy for Q value convergence [4, 5, 6]. Hence, it has been used in many applications [7, 8, 9, 10]. Once the Q value has converged to the optimal value, the agent selects only the action with the maximum Q value in a given state. To improve the speed of the Q-learning, several modi£cations based on conventional Q-learning have been presented [11, 12, 13]. In this paper, a two mode Q-learning [14, 15] is employed for improving the performance of Q-learning using the failure experience of the agent more effectively as well as the success experience. Two mode Q-learning is based on both the normal and failure Q values for the selection of the action in a state-action space. To determine the failure Q value using the previous failure experience of the agent, it employs a failure Q value module. To investigate the convergence property of two mode Q-learning, a grid world is employed. A humanoid robot is a biped (i.e., two-legged) intelligent robot that is expected to eventually take the form of a human-like body. Recently, many researchers have focused on developing humanoid robots which are similar in many aspects to human beings [16, 17, 18, 19, 20]. A biped robot is inherently susceptible to problems of instability and therefore is constantly at risk of falling down. Therefore ensuring stability must be

446

considered one of the most important goals from the perspective of locomotion. To measure stability, the concept based on ZMP proposed by Vukobratovic [21] is used. The ZMP is de£ned to be the point on the ground plane at which the total moments of force due to ground contacts becomes zero. It is important to recognize that ZMP must always reside in the convex hull of all contact points on the ground plane, since the moments on the ground plane are caused only by normal forces at the contact points which are always positive in the upper vertical direction. In this paper, a ZMP compensation scheme based on two mode Qlearning is applied to a biped robot. Section II presents two mode Q-learning and its simulation results. In section III, the experimental results of a biped robot showing the comparison between the performances of Q-learning and two mode Q-learning are presented. Concluding remarks follow in section IV.

2 Two mode Q-learning 2.1 Description of two mode Q-learning In two mode Q-learning, two kinds of Q values are considered. One is called normal Q value (QN ) which is a Q value from conventional Q-learning. The other is called failure Q value (QF ) which is obtained from failure experience. This algorithm uses both the normal Q value and the failure Q value and hence it is called two mode Q-learning.

2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

The architecture of the two mode Q-learning consists of the action selection module, the normal Q value module and the failure Q value module as shown in Figure 1. Failure Q value module

ij

Action

Action Selection module

=

QN Reward

Figure 1: Architecture of two mode Q-learning In the two mode Q-learning, the action selection depends on the total value of the normal Q value (QN (si , a j )) and the failure Q value (QF (si , a j )). For the action selection, the following Boltzmann equation is employed, where for the simple notation the expressions (QN (si , a j )), (QF (si , a j )) and (QT (si , a j )) are expressed as QN , QF and QT , respectively: p (si , a j ) =

e

ij

ij

QT

ij

∑a j ∈A e

/τ QT /τ

,

ij

(1)

QT = Q N + Q F ij

ij

ij

where QN is the Q value of action j in state i obtained ij by conventional Q-learning, QF is the Q value of acij tion j in state i obtained from the failure Q value module, and QT is the total value of these two Q values. ij QT decreases if the value of QF decreases, and hence ij ij the probability of action a j being selected in state si is lowered. The failure Q value of action a j in a given state si is calculated by taking the numerator form of the Boltzmann equation, e

QF /τ ij

= 1 − pF : ij

(2)

where τ is the temperature value used in (1), i is the index of the state, and j is the index of the action. And pF is a failure probability of the actions included in ij the failure trial, which is obtained from the following equation: pF = f (pNmini j , pNmaxi j (k), l(c), j) ij

+

pNmaxi j (k) − pNmin

ij

( j − (m − l(c))), (4)

where i and j are the index of the state and action, respectively, included in the step length l(c), and m is the number of steps of the state-action trace leading the agent to the fail state. Since pNmaxi j = 1 can not be

de£ned in (2), we introduce the notation 1 − as a value just less than and close to 1. It should be noted that if the failure probability of the action is 1− , the action almost surely leads the agent to the fail state. The action, which led the agent to a fail state in the previous trial, may not lead the agent to the same fail state again. Setting the value of pF to 1− is a posij sible way to inhibit the corresponding action from the progress of learning. Thus, the failure probability of the selected action just before the fail state should be decreased as the trial goes on until the agent reaches the fail state again, by the same action. For this purpose, the following scheme for decreasing pNmaxi j (k) is proposed: pNmaxi j (k) = η k−k f

(5)

where η is a constant between 0 and 1, k is the current trial number, and k f is the failure trial number. In this paper, pNmin is £xed as a constant value irrespective of ij N.

2.2 Simulation Result

ij

QF = τ ln(1 − pF ), ij

pNmini j

l(c) m − l(c) ≤ j ≤ m

Normal Q value module

ij

The failure probability applied to the actions is calculated as follows: pF = f (pNmini j , pNmaxi j (k), l(c), j)

QF State

in the step length l(c), where c is the constant value. According to the environment, the number of steps that led the agent to the fail state is varied.

(3)

where N is the index of the failure trial (the state-action trace that led the agent to the fail state), i and j are the index of the state and action, respectively, included in the failure trial N, pNmin and pNmaxi j (k) are minimum ij and maximum probability values, respectively, and k is the trial number. l(c) is the step length of the stateaction trace counted from the fail state. Failure probabilities are to be calculated for the actions included

447

To investigate the convergence property of two mode Q-learning, a grid world was used. The purpose of the agent is to £nd the shortest path which will lead to the success state starting from the start state. For most states, there are four possible actions moving in four directions (north, south, east and west). For the edge states, there are three possible actions out of these four actions. When the agent enters into the success state, it receives 100 points as a reward. When the agent reaches the fail state, it receives -100 as a penalty. In this £gure, there are four start states and 16 fail states. After each trial, the start state is changed by the index number. Figure 2 shows simulation results after converging to the optimal Q values by Q-learning and two mode Q-learning for a grid world of size 17 × 17 with several fail and start states. The arrows depict the optimal action of the agent in each state after the learning

2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

Start 3

S1

F

S3

F

S5

F

S7

F

S9

S17

S18

S19

S20

S21

S22

S23

S24

S25

S26

F

S11

F

S13

F

S15

Start 2

S27

S28

S29

S30

S31

S32

S33

S49

F

Convergence ratio 100

F

S35

100 Convergence ratio

80

Convergence ratio 100 Convergence ratio 80

80

60

100

60

S51

S66

S52

S67

40 40

80

60

60

40

40

20

20

F

S83

S69

20

F

0 0

15 15

S85

S100

S86

15

10 10

S101

5 5

F

S103

S119

S120

F

S137

S153

S117

S134

5

State-action space

F

S168

S169

S185

F

S188

S202

S203

F

S205

S219

F

100

S221

S222

S236

S237

60

F

S239

S253

F

S255

S256

S270

S271

100 100 80 80 60 60

80

40 40

40

20 20 0 0

20

0

15 15

15

F

S279

F

S281

F

S283

F

S385

F

S287

10

5

10

5

15

10

15

10 5

S277

15

10

State-action space State-action space

Convergence ratio Convergence ratio

Convergenceratio ratio Convergence

F

5 5

(b) TMQ: 300 trials

F

S275

10 5

(a) Q: 300 trials

S187

F

15

10 10

S135

S154

S273

15 15

5

S171

Start 0

0

0

State-action space

F

S151

Goal

10

5

20

15

10

State-actionspace space State-action

5

5

State-action 15space 10 10 State-action space

5

Start 1

(d) TMQ: 1000 trials

(c) Q: 1000 trials

Figure 2: 17 × 17 grid world with fail and start states phase. In this environment, more than 3,000 trials are needed for converging to the optimal Q value. However, to compare the convergence property during the progress of the learning, a series of 300, 1000 and 3000 trials, were executed. Figure 3 shows the simulation results of convergence ratio of conventional Q-learning and two mode Q-learning, where convergence ratio is de£ned as the ratio of the Q value to the maximum optimal Q value of each state-action space. Z-axes of these £gures depict the average convergence ratio. To get the average convergence ratio, the simulation program was run for 20 iterations, where one iteration corresponds to 300 trials, 1000 trials and 3000 trials, respectively. Values such as c = 2, pminN = 0.1, η = 0.9 and

Convergence ratio Convergence ratio

Convergence Convergenceratio ratio 100

100100

100 80

80 80

80 60

60 60

60 40

40 40

40 20

20 0

20 20

0

0 0 15

15

15 15 15

10 10

5

10 5

5

15 15

10 10

15

State-action space State-action space 10

5 5

5

5

10 10 State-action space State-action space

5

(e) Q: 3000 trials

(f) TMQ: 3000 trials

Figure 3: Simulation results of convergence property in the state-action space

j

1− = 0.99 were used. For the purpose of comparison, in conventional Q-learning, Boltzmann action selection was employed. In this £gure, Q and TMQ depict Qlearning and two mode Q-learning, respectively. As the £gures show, the convergence ratio of the two mode Q-learning is better than that of the conventional Qlearning in each trial. Also, the two mode Q-learning has more exploratory state-action space than that of the conventional Q-learning.

3 3.1

(a) Front

Experimental Result

(b) Side

Figure 4: HSR-IV employed. ZMP can be easily calculated by FSR measurements as follows [22, 23]:

HSR-IV

To apply Q-learning and two mode Q-learning to a biped robot, HanSaRam (HSR) [20], developed by Robot Intelligence Technology Lab., KAIST, was used. Figure 4 shows the HSR-IV. For the ZMP measurement, eight FSRs were installed in the soles of HSR-IV. For mimicking the joints in the lower body, 12 RC servo motors were used. There is no upper body in HSR-IV. For obtaining a linear response to force measured by the FSRs, OP amp circuits were

xZMP =

∑4i=1 fi xi ∑4i=1 fi

∑ 4 f i yi yZMP = i=1 ∑4i=1 fi

(6)

where p1 , p2 , p3 and p4 are four corner positions of the FSR and f1 , f2 , f3 and f4 are forces measured by the each FSR.

448

2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

Two different types of micro-controllers were used for the HSR-IV. One is called the master micro-controller, which plays a role of communicating between the host PC and the other micro-controller. The other is called the slave micro-controller, which controls 12 RC servo motors for 12 joints for both legs. In each 20ms interval, 12 RC servo motors are controlled by slave the micro-controller.

2. State 2: Angular velocity of ankle rolling (3 states) 3. State 3: Variation of the Y component of ZMP (11 states) • Actions: Velocity of ankle rolling during 20 ms, ±0.5◦ , 0◦ • Rewards: 1. Positive reward: In case of selecting an action opposite direction of the external force r = 100

3.2 Implementation The experiment for Q-learning and two mode Qlearning in HSR-IV deals with the external force that is generated in the sagittal plane and applied to the HSR-IV in its standing posture. In this case, ZMP is calculated by equation (6). The purpose of learning is to select an action to resist an external force within the angle and ZMP constraints. Figure 5 shows the situation when external force is generated from the sagittal plane that has impacts on the HSR-IV in the standing posture and desired motion of the HSR-IV for resisting the external force.

2. Negative reward: In case of exceeding the angle and ZMP constraints r = −100 3. Instant reward: r =

ZMPY is the Y component of ZMP Figure 6 shows the states for Q-learning and two mode Q-learning. State component 1



 





 

 

 





 





 





 





 

 

 











 











 



 



 





 

 



 





 





 



















3

1

State component 3

2





6 4 2 3 5 6 10 7 8 0

State component 2 0



−6 − 10 

− 3 −1 1

Hip rolling 

50 10+|ZMPY | , where

− 0.5 0 



0.5 

1



 



0 1 2 3 4 5 6 7 8 9 10 -70 -45 -25 -12 -4

Angular velocity

Ankle rolling

Figure 6: States for experiment

Z

Z Ankle rolling

Y

4 12 25 45 70

Variation of ZMPY

Y

Figure 5: External force in the sagittal plane For Q-learning and two mode Q-learning, the time duration of trial is set to four second. This means that the slave micro controller controls the RC servo motors 200 times during this period. Interval time between trials is also set to four second. In this time, the slave micro-controller controls the RC servo motors to move the initial posture. Initial posture means the initial angle of each actuator before learning. In order to reduce the complexity of the problem, it is assumed that the left leg has the same motion of the right leg in the parallel direction and the hip rolling is the opposite direction to the direction of ankle rolling. By these assumptions, only one actuator (ankle rolling) reacts to the external force generated from the sagittal plane. For applying Q-learning and two mode Qlearning, 297 (9 x 3 x 11) states, 3 actions per state and reward are de£ned as follows: • States: 1. State 1: Variation of the ankle rolling (9 states)

449

To compare the learning performance between Qlearning and two mode Q-learning, 800 trials were executed. In a real experiment, learning rate, α = 0.1, discount rate, γ = 0.9, the minimum failure probability, pNmin = 0.1, the parameter of the maximum failure probability, η = 0.9, and the constant value of step length, c = 4 were used.

3.3

Result

The experiment took one and half hours to complete 800 trials. To sense an external force, one FSR was used with the micro-controller. When the HSR-IV selects an action which resists an external force, the external force is decreased gradually. After the learning phase, the performances of Q-learning and two mode Q-learning were compared after 80 test trials. Within the constraints of angle and ZMP, a success number was used to measure performance. Table 1 shows the result of test trials after 80 iterations. In Q-learning, the number of successes was 19 from 80 test trials. In contrast to Q-learning, the number of successes in case of two mode Q-learning was 33 during 80 test trials. During four seconds (200 control ticks), the number of trials longer than 100 ticks is 30 in Q-learning. However, in two mode Q-learning, the

2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

Table 1: Experimental result in the sagittal plane Two mode Q-learning

Q-learning Set Num





   

        

1st

                       

2nd

             

   

          

# success: 19 : Success

3rd

 

              

        

4th

      

   

 

       

      

# more than 100: 30 : More than 100

Set Num









 



        

1st                           

2nd                                  

3rd                              

4th             

       

      

(a) Experimental result with external force

# success: 33 # more than 100: 49 : Success

: More than 100

number of trials longer than 100 ticks is 49. This table compares the performance of the two algorithms and illustrates that the performance of two mode Q-learning is superior to that of Q-learning in this experiment. A human £nger was used for replacing the external force apparatus. Figure 7(a) and 7(b) show a snapshot of the experimental result of two mode Q-learning with a mechanically applied external force and £nger force, respectively.

4

Summary and conclusion

(b) Experimental result with £nger force

In this paper two mode Q-learning was proposed. Two mode Q-learning utilizes failure experience of the agent. It consists of a normal Q value module, a failure Q value module and an action selection module. Both the action selection module and the normal Q value module compose conventional Qlearning. Through the simulation its effectiveness was demonstrated against Q-learning. Also, HSR-IV, a biped robot developed in RIT Lab., KAIST, was used for implementing Q-learning and two mode Q-learning. The purpose of the learning is to select an action to resist an external force within the angle and ZMP constraints. After the learning phase, 80 test trials were executed to compare the performance of Q-learning and two mode Q-learning. Within the constraints of angle and ZMP, the number of successes was used for the performance measure. The effectiveness of the proposed two mode Q-learning method is veri£ed through the actual experiment.

Figure 7: Snapshots of the experimental results in the sagittal plane

References [1] L. P. Kaelbling and M. L. Littman and A. W. Moore, “Reinforcement Learning: A Survey,” Journal of Arti£cial Intelligence Research, pp. 237-285, 1996. [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Bradford Books/MIT Press, 1998. [3] C. J. C. H. Watkins and P. Dayan, “Q-learning,”, Machine Learning,, vol. 8, pp. 279-292, 1992.

Acknowledgements

[4] M. A. Wiering, R. Salustowicz, and J. Schmidhuber, “Reinforcement Learning Soccer Teams with Incomplete World Models,” Autonomous Robots, Special issue on Neural Networks for Robot Learning, vol 7. no. 1, pp. 77-88, 1999

This work was supported by the ITRC-IRRC (Intelligent Robot Research Center) of the Korea Ministry of Information and Communication in 2004.

[5] W. D. Smart, “Making Reinforcement Learning Work on Real Robots,” Ph.D thesis, University of Brown, 2002.

450

2nd International Conference on Autonomous Robots and Agents December 13-15, 2004 Palmerston North, New Zealand

[6] M. N. Ahmadabadi and M. Asadpour, “Expertness Based Cooperative Q-Learning,” IEEE Trans. Systems, Man and Cybernetics, Part B, vol 32, no. 1, pp. 66-76, 2002. [7] M. Asada, E. Uchibe, and K. Hosoda, “Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development,” Arti£cial Intelligence, vol. 110, pp. 275-292, 1999. [8] O. Abul, F. Polat, and R. Alhajj, “Multiagent Reinforcement Learning Using Function Approximation,”, IEEE Trans. Systems, Man and Cybernetics, Part C, vol 30, no. 4, pp. 485-497, 2000. [9] D.-H. Kim, Y.-J. Kim. K.-C. Kim, J.-H. Kim, and P. Vadakkepat, “Vector Field Based Path Planninig and Petri-net Based Role Selection Mechanism with Q-learning for the Soccer Robot System,” Int. Journal of intelligent Automation and Soft Computing, vol. 6, no. 1, pp. 75-88, 2000. [10] K.-H. Park, Y.-J. Kim, and J.-H. Kim, “Modular Q-learning based multi-agent cooperation for robot soccer,” Robotics and Autonomous Systems, vol. 31, no. 2, pp. 109-122, 2001. [11] S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing eligibility traces,” Machine Learning, vol. 22, pp. 123-158, 1996. [12] J. Peng and R, J. Williams, “Incremental multistep Q-learning,” Machine Learning, vol. 22, pp. 283-290, 1996. [13] M. A. Wiering, “Fast on-line Q(λ ),” Machine Learning, vol. 33, no. 1, pp. 105-115, 1998. [14] K.-H. Park and J.-H. Kim, “Two mode Qlearning,” Proc. of IEEE Int. Conf. Evolutionary and Computation, pp. 3404-3410, 2003. [15] K.-H. Park, “Two Mode Q-learning using Failure Experience of the Agent and its application to Biped Robot,” Ph.D thesis, Korea Advanced Institute of Science and Technology, 2004. [16] K. Hirai, M. Hirose, Y. Haikawa, and T. Takenaka, “The Development of Honda Humanoid Robot,” Proc. of IEEE Int. Conf. on Robotics and Automations, pp. 1321-1326, 1998. [17] J. Yamaguchi, A. Takanishi, and I. Kato, “Development of a biped walking robot compensating for three-axis moment by trunk motion,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 561-566, 1993.

451

[18] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higake, and K. Fujimura, “The intelligent ASIMO: System overview and integration,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 2478-2483, 2002. [19] K. Nishiwaki, T. Sugihara, S. Kagami, F. Sanehiro, M. Inaba, and H. Inoue, “Design and Development of Research Platform for PerceptionAction Integration in Humanoid Robot: H6,” Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 1559-1564, 2000. [20] J.-H. Kim, D.-H. Kim, Y.-J. Kim, K.-H. Park, J.H. Park, C.-K. Moon, K. T. Seow, and K.-C. Koh, “Humanoid Robot HanSaRam: Recent Progress and Development,” Journal of Advanced Computational Intelligence & Intelligent Informatics, Fuji Technology Press Ltd., Accepted, 2003. [21] M. Vukobratovic and D. Juricic, “Contribution to the Synthesis of Biped Gait,” IEEE Trans. on BioMedical Engineering, vol. BME-16, no. 1, pp. 16, 1969. [22] Q. Li, A. Takanishi, and I. Kato, “A Biped Walking Robot Having A ZMP Measurement System Using Universal Force-Moment Sensors,” IEEE/RSJ Int. Workshop on Intelligent Robots and Systems IROS’91, pp. 1568-1573, 1991. [23] J.-H. Kim, S.-W. Park, I.-W. Park, and J-H. Oh, “Development of a Humanoid Biped Walking Robot Platform KHR-1 - Initial Design and Its Performance Evaluation,” Proc. of 3rd IARP Int. Workshop on Humanoid and Human Friendly Robotics, pp. 14-21, 2002.

Suggest Documents