1
Using a controller based on reinforcement learning for a passive dynamic walking robot E. Schuitema (+), D. G. E. Hobbelen (*), P. P. Jonker (+), M. Wisse (*), J. G. D. Karssen (*)
Abstract— One of the difficulties with passive dynamic walking is the stability of walking. In our robot, small uneven or tilted parts of the floor disturb the locomotion and must be dealt with by the feedback controller of the hip actuation mechanism. This paper presents a solution to the problem in the form of controller that is based on reinforcement learning. The control mechanism is studied using a simulation model that is based on a mechanical prototype of passive dynamic walking robot with a conventional feedback controller. The successful walking results of our simulated walking robot with a controller based on reinforcement learning showed that in addition to the prime principle of our mechanical prototype, new possibilities such as optimization towards various goals like maximum speed and minimal cost of transport, and adaptation to unknown situations can be quickly found. Index Terms— Passive Dynamic Walking, Reinforcement Learning, Hip Controller, Biped.
I. I NTRODUCTION WO-LEGGED WALKING ROBOTS have a strong attractive appeal due to the resemblance with human beings. Consequently, some major research institutions and private companies have started to develop bipedal (two-legged) robots, which has led to sophisticated machines [14], [8]. To enable economically viable commercialization (e.g. for entertainment), the challenge is now to reduce the design complexity of these early successes, in search for the ideal set of characteristics: stability, simplicity, and energy efficiency. A promising idea for the simultaneous reduction of complexity and energy consumption, while maintaining or even increasing the stability, is McGeer’s concept of ‘passive dynamic walking’ [9]. On a shallow slope, a system consisting of two legs with well-chosen mass properties can already show stable and sustained walking [6]. No actuators or controls are necessary as the swing leg moves in its natural frequency. Using McGeer’s concept as a starting point we realized a number of 2D and 3D mechanical prototypes with increasing complexity [24], [22], [5]. These prototypes are all powered by hip actuation and the control of these robots is extremely simple; a foot switch per leg triggers a change in desired hip angle, resulting in swing of the opposite leg. Although passive dynamics combined with this simple controller already stabilize the effect of small disturbances, larger disturbances, such as an uneven floor, quickly lead to failures [15]. Also, the simple controller does not guarantee optimal efficiency or speed. Consequently, in this paper we elaborate on the introduction of more complex controllers
T
The authors are with the Faculties of (+) Applied Sciences, Lorentzweg 1, 2628 CJ and (*) Mechanical Engineering, Mekelweg 2, 2628 CD, Delft University of Technology, Delft, The Netherlands. E-mail:
[email protected] .
based on learning. A learning controller has several advantages: • • • •
•
It is model free, so no model of the biped’s dynamic system nor of the environment is needed. It uses a fully result driven optimization method. It is on-line learning, in principle possible on a real robot. It is adaptive, in the sense that when the robot or its environment changes without notice, the controller can adapt until performance is again maximal. It can optimize relatively easily towards several goals, such as: minimum cost of transport, largest average forward speed, or both.
Section II gives an overview of the concept of passive dynamic walking, our mechanical prototype (Fig. 1) and the 2D simulation model describing the dynamics of this prototype. This simulation model is used for our learning controller studies. Section III describes the principles of reinforcement learning, their application in a controller for walking and our measurements. In Section IV we conclude that a reinforcement based controller provides an elegant and simple control solution for stable and efficient biped walking.
Fig. 1. ‘Meta’; a 2D robot based on the principle of passive dynamic walking. This study is based on the simulation model of this prototype.
2
II. PASSIVE DYNAMIC WALKING A. Basic Principles Since McGeer’s work, the idea of passive dynamic walking has gained in popularity. The most advanced fully passive walker, constructed at Cornell University, has two legs and stable three-dimensional dynamics with knees, and counterswinging arms [6]. The purely passive walking prototypes demonstrate convincing walking patterns, however, they require a slope as well as a smooth and well adjusted walking surface. A small disturbance (e.g. introduced by the manual launch) can still be handled, but larger disturbances quickly lead to a failure [15]. One way to power passive dynamic walkers to walk on a level floor and make them more robust to large disturbances is hip actuation. This type of actuation can supply the necessary energy for maintaining a walking motion and keep the robot from falling forward [23]. The faster the swing leg is swung forward (and then kept there), the more robust the walker is against disturbances. This creates a trade-off between energy consumption and robustness for the amount of hip actuation that is applied.
B. Mechanical prototype The combination of passive dynamics and hip actuation has resulted in multiple prototypes made at the Delft Biorobotics Laboratory. The most recent 2D model is Meta (Fig. 1), which is the subject of this study. This prototype is a 2D walker consisting of 7 body parts (an upper body, two upper legs, two lower legs and two feet). It has a total of 5 Degrees of Freedom, located in a hip joint, two knee joints and two ankle joints. The upper body is connected to the upper legs by a bisecting hip mechanism, which passively keeps the upper body in the intermediate angle of the legs [22]. The system is powered by a DC motor that is located at the hip. This actuator is connected to the hip joint through a compliant element, based on the concept of Series Elastic Actuation first introduced by the MIT Leg Lab [13]. By measuring the elongation of this compliant element, this allows the hip joint to be force controlled. The compliance ensures that the actuator’s output impedance is low, which makes it possible to replicate passive dynamic motions. Also it ensures that the actuator can perform well under the presence of impacts. This actuator construction allows us to apply a desired torque pattern up to a maximum torque of around 10 Nm with a bandwidth of around 20 Hz. These properties should allow the reinforcement learning based controller to be implemented in practice in the near future. The prototype is fully autonomous running on lithium ion polymer batteries. The control platform is a PC/104 stack with a 400 MHz processor and the controllers are implemented through the Matlab Simulink xPC Target environment. The angles of all 5 joints as well as the elongation of the actuator’s compliant element are measured in real-time using incremental encoders. Next to these sensors there are two switches underneath the two feet to detect foot contact.
The knee and ankle joints are both fully passive, but the knee joint can be locked to keep the knee extended whenever the robot is standing on the corresponding leg. The prototype can walk based on a fairly simple control algorithm. The hip angle is PD controlled given a constant reference hip angle. If the foot switch of the current swing leg is contacted (and thus becomes the new stance leg), the reference angle is inverted, effectively pulling the new swing leg forward. Simultaneously, the knee latches of the new swing leg are released briefly. Then, the system just waits for the new swing leg’s foot switch to make contact, assuming that knee extension takes place before heel contact.
C. Dynamic system model The dynamic simulation model that is used in this study was made using the Open Dynamics Engine physics simulator [16]. The model consists of the same 7 body parts as the prototype modeled by rigid links having a mass and moment of inertia associated with them (Fig. 2). The joints are modeled by stiff spring-damper combinations. The knees are provided with a hyperextension stop and a locking mechanism which is released just after the start of the swing phase. The hip bisecting mechanism that keeps the upper body upright is modeled by introducing a kinematic chain through two added bodies with negligible mass. The floor is - provisionally assumed to be a rigid, flat, and level surface. Contact between the foot and the ground is also modeled by a tuned spring-damper combination which is active whenever part of the foot is below the ground. The model of the foot mainly consists of two cylinders at the back and the front of the foot. The spring damper combination is tuned such that the qualitative motion of the model is similar to a rigid contact model made in Matlab which has been validated using measurements from a former prototype [22]. A profound validation of our ODE model with the prototype will be performed in the near future.
g
θ bI bm bw
bc
uI um uw
uc ul
lc ll
φh
lI lm lw
φk φa
fI fm fw
fl
fr
fh
Fig. 2. Two-dimensional 7-link model. Left the parameter definition, right the Degrees of Freedom (DoFs). Only the DoFs of the swing leg are given, which are identical to the DoFs of the other leg.
3
Fig. 3.
Trainer Trainer
Trainer Trainer
Learning Learning
Learning Learning
Simulator Simulator
Robot Robot
Learning in a simulator first and downloading the result
A set of physically realistic parameter values were derived from the prototype; see Table I. Its values were used throughout this study. TABLE I D EFAULT PARAMETER VALUES FOR THE SIMULATION MODEL .
mass m [kg] mom. of Inertia I [kgm2 ] length l [m] vert. dist. CoM c [m] hor. offset CoM w [m] foot radius fr [m] foot hor. offset fh [m]
body 8 0.11 0.45 0.2 0.02 -
upper leg 0.7 0.005 0.3 0.15 0 -
lower leg 0.7 0.005 0.3 0.15 0 -
foot 0.1 0.0001 0.06 0 0.015 0.02 0.015
III. R EINFORCEMENT L EARNING BASED C ONTROL A. Simulation versus on-line learning A learning controller has several advantages over a normal PID controller: It is model free, so no model of the biped’s dynamic system nor of the environment is needed. It uses result driven optimization. It is adaptive, in the sense that when the robot or its environment changes without notice, the controller can adapt until performance is again maximal. It can optimize relatively easily towards several goals, such as: minimum cost of transport, largest average forward speed, or both. In principle, it can be performed on-line on the real robot itself. However, problematic with robot control using learning through trial and error from scratch, is that the robot will fall down quite some times, that the robot needs to be initialized in initial states over and over again and that its behavior needs to be monitored adequately. With a good simulator, adequately describing the real robot, learning an adaptive and optimizing controller can be done without tedious human labor to ”coach” the robot and without the robot damaging itself. Moreover, learning occurs at the computer’s calculation speed, which usually means several times realtime. The final result can be downloaded into the controller of the real robot after which learning can be continued. Fig. 3 shows the learning controller that first learns to adapt to a simulator of the robot, after which its result can be downloaded to the controller of a real robot. Note that the controller is divided into the controller itself and a trainer (internal to the controller on a meta level) that controls the reward assignments. B. State of the art Using reinforcement learning techniques for the control of walking bipeds is not new [3], [4], [12], [20]. Especially
interesting for passive dynamics based walkers is Poincar´e based reinforcement learning as discussed in [11], [7], [10]. Other promising current work in the field of learning and execution simultaneously is found in [18], [19]. Due to the mechanical design of their robot, it is able to acquire a robust policy for dynamic bipedal walking from scratch. The trials are implemented on the physical robot itself, a simulator for offline pre-learning is not necessary. The robot begins walking within a minute and learning converges in +/- 20 minutes. It quickly and continually adapts to the terrain with every step it takes. Our approach is based on experiences from various successful mechanical prototypes and is similar to the approach of e.g. [11]. Although, we aim for a solution as found in [19], and also our simulated robot often converges quickly to walking, see Fig. 5, until now we feel more comfortable with the approach of learning a number of controllers from random initialization and downloading the best of their results into the physical robot. See section III-I. Not found in literature is the optimization that we applied towards various goals, such as speed and efficiency. See section III-H. Unlike methods based on Poincar´e mapping, our method does not require periodic solutions with a one footstep period. C. Reinforcement learning principles Reinforcement learning is learning what to do - how to map situations to actions - so as to maximize a numerical reward signal [17]. In principle it does a trial-and-error search through a state-action space to optimize the cumulative discounted sum of rewards. This may include rewards delayed over several time steps. In reinforcement learning for control problems, we are trying to find an optimal action selection method or policy π, which gives us the optimal action-value function defined by: (1) Q∗ (s, a) = maxπ Qπ (s, a) ∀s ∈ S and a ∈ A(s), which may be shared by several optimal policies π ∗ . Q-learning [21] is an off-policy temporal difference (TD) control algorithm that approximates the optimal action-value function independent of the policy being followed, in our case the -greedy policy. The update rule for Q-learning, which is done after every state transition, is: Q(st , at ) ← Q(st , at )+α[rt+1 +γmaxa Q(st+1 , a )−Q(st , at )] (2) in which s is our state signal, a is the chosen action, st+1 is the new state after action a has been performed and a is an iterator to find the action which gives us the maximum Q(st+1 , a ). • α is the learnrate, constant in our tests, which defines how much of the new estimate is blended with the old estimate. • r is the reward received after taking action a in state s. • γ is the rate at which delayed reward are discounted every time step. During learning, actions are selected according to the -greedy policy: there is an (1 − ) chance of choosing the action that gives us the maximum Q(st+1 , a ) (exploitation), and an
4
chance of choosing a random action (exploration). When the state signal succeeds in retaining all relevant information about the current learning situation, it is said to have the Markov property. A standard technique often combined with Q-learning is the use of an eligibility trace. By keeping a record of the visited state-action pairs over the last few time steps, all state-action pairs in the trace, with decaying importance are updated. In this paper we used Q-learning combined with an eligibility trace the way Watkins [21] first proposed: Watkins’ Q(λ). To approximate our action-value function, CMACS tile coding was used [17], [21], [1], [2], a linear function approximator. For each input and output dimension, values within the dimension dependent tile width are discretized to one state, creating a tiling. By constructing several randomly shifted tilings, each real valued state-action pair falls into a number of tilings. The Q-value of a certain state-action pair is then approximated by averaging all tile values that the state-action pair falls into. Throughout this research, we used 10 tilings. The Q-values are all initialized with a random value between 0 and 1. Depending on the random number generator, the initial values can be favorable or non-favorable in finding an actuation pattern for a walking motion. D. Learning with a dynamic system simulator The state space of the walking biped problem consists of six dimensions: angle and angular velocity of upper stance leg, upper swing leg, and the lower swing leg. In order not to learn the same thing twice, symmetry between the left and right leg is implemented by mirroring left and right leg state information when the stance leg changes. In the mirrored case, the chosen hip torque is also mirrored by negation. This defines the state of the robot except for the feet, thereby not fully complying to the Markov property, but coming very close when finding walking cycles. There is one output dimension: the torque to be applied on the hip joint, which was given a range between -8 and 8 Nm, divided in 60 discrete torques; to be evaluated in the function approximator when choosing the best action. All dimensions (input and output) were given approximately the same number of discrete states within their range of occurrence during a walking cycle. This boils down to about 100,000 discrete states, or estimating 1,000,000 Qvalues when using 10 tilings. The parameters of Q-learning were set to α=0.25, γ=1.0, =0.05 and λ=0.92, while decays with time with a discount rate of 0.9999 /s. The values for α, and λ are very common in Q-learning. The choice of γ will be explained for each learning problem. A test run was performed after every 20 learning runs, measuring average hip speed, cost of transport and the number of footsteps taken. E. Learning to walk At the start of a learning run, the robot is placed in an initial condition which is known to lead to a stable walking motion with the PD controlled hip actuation: a left leg angle of 0.17 rad, right leg angle of -0.5 rad (both with the absolute vertical) and an angular velocity of 0.55 rad/s for all body parts. It places the robot in such a state that the first footstep can hardly
Fig. 4.
The simulated robot performing steps
be missed (see Fig.5). The learning run ends when either the robot fell (ground contact of the head, knees or hip) or when it has made 16 footsteps. The discount factor γ was set to 1.0, since time does not play a role in this learning problem. In order to keep the expected total (undiscounted) sum of rewards bounded, the maximum number of footsteps is limited to 16. To learn a stable walking motion, the following rewarding scheme was chosen: A positive reward is given when a footstep occurs, while a negative reward is given per time step when the hip moves backward. A footstep does not count if the hip angle exceeds 1.2 rad, to avoid rewarding overly stretched steps. This scheme leaves a large freedom to the actual actuation pattern, since there is not one best way to finish 16 footsteps when disturbances are small or zero. This kind of reward structure leads to a walking motion very fast, often under 30 minutes of simulation time, sometimes under 3 minutes, depending on the initial random initialization of the Q-values and the amount of exploration that was set. Inherently, in all learning problems in this paper, a tradeoff will be made between robustness against disturbances and the goal set by rewards, simply because the total expected return will be higher in the case of finishing the full run of 16 footsteps. Although the disturbances are self-induced by either exploration, an irregular gait and/or the initial condition, the states outside the optimal walking motion may occur equally well because of external disturbances. F. Minimizing cost of transport To minimize the specific cost of transport (C.o.T.), defined as the amount of energy used per unit transported system weight (m.g) per distance traveled, the following rewarding scheme was chosen: A reward of +200/m, proportional to the length of the footstep when a footstep occurs, a reward of -8.3/J.s proportional to the motor work done per time step, and a reward of -333/s every time step that the hip moves backward. The first reward is largely deterministic because the angles of both upper legs will define the length of the footstep, provided that both feet are touching the floor and that the length of both legs is constant. The second reward is completely deterministic, being calculated from the angular velocities of both upper legs (which are part of the state space) and the hip motor torque (chosen as action). Again no discounting is used (γ = 1.0). The optimal policy will be the one that maximizes the tradeoff between making
5
Averaged number of steps taken
18
H. Minimizing C.o.T and maximizing speed
16 14 12 10 8 6 4 Efficient Fast Fast and efficient
2 0 0
20
40
60
80
Time [min]
100
Fig. 5. Learning curves: average number of footsteps over learning time, averaging 50 learning episodes for each optimization problem.
large footsteps and spending energy. The negative reward for backward movement of the hip should not occur when a walking cycle has been found, and thus will mostly play a role at the start of the learning process. Although, when walking slowly and accidentally becoming instable on the brim of falling backward, the robot often keeps its leg with unlocked knee straight and stiff, standing still. Fig. 5 shows the average learning curve of 50 learning episodes (different random seeds), optimizing on minimum cost of transport. The average and minimum cost of transport can be found in Table II. TABLE II AVERAGE AND BEST VALUES FOR HIP SPEED AND COST OF TRANSPORT (C OT) FOR ALL THREE OPTIMIZATION PROBLEMS .
Average speed [m/s] Maximum speed [m/s] Average CoT [-] Minimum CoT [-]
Optimization on speed
Optimization on CoT
0.554 0.582 0.175 0.120
0.526 0.549 0.102 0.078
Optimization on speed and CoT 0.540 0.566 0.121 0.090
Both previous reward structures can be blended. All rewards together (proportional footstep length reward, motor work penalty, time step penalty, backward movement penalty) produce a tradeoff between minimum C.o.T. and maximum forward speed. This tradeoff will depend on the exact numbers of the rewards for motor work, time step and footstep length. In our test, we used the following reward scheme: A reward of 350/m proportional to the length of the footstep, when a footstep occurs, a reward of -8.3/J.s proportional to the motor work done every time step, a reward of -56/s every time step, and a reward of -333/s every time step when the hip moves backward. Fig. 5 shows the average learning curve of 50 learning episodes (different random seeds), optimizing on minimum cost of transport as well as maximum average forward speed. The average and maximum forward velocity as well as the average and minimum cost of transport can be found in Table II. I. Learning curve, random initialization and ranking In general the robot learns to walk very quickly, as Fig. 5 shows. A stable walking motion is often found within 20 minutes. In order to verify that the robot is not in a local minimum (i.e. the C.o.T might suddenly drop at some point), the simulations need to be performed for quite some time. We performed tests with simulation times of 15 hours, showing no performance drop, indicating convergence. Due to the random initialization, not all attempts to learn are all equally successful. Some random seeds never lead to a result. Optimizing on minimum cost of transport failed to converge once in our 50 test runs. It is even so that walkers develop their own ”character”. For example, initially some walkers might develop a preference for a short step with their left leg and a large step with their right leg. Some dribble and some tend to walk like Russian elite soldiers. Due to the built in exploration and the optimization (e.g. towards efficiency) odd behaviors mostly disappear in the long run. A ranking on performance of all results makes it possible to select the best walkers as download candidates for the real robot. J. Robustness and adaptivity
G. Maximizing speed To maximize the forward speed, the following rewarding scheme was chosen: A reward of 150/m proportional to the length of the footstep, when a footstep occurs, a reward of -56/s every time step, and a reward of -333/s every time step when the hip moves backward. Again, no discounting is used, although time does play a role in optimizing this problem. Our reward should linearly decrease with time, not exponentially as is the case with discounting. Fig. 5 shows the average learning curve of 50 learning episodes (different random seeds), optimizing on maximum forward speed of the hip. The average and maximum forward speed can be found in Table II.
In order to try how robust the controller is for disturbances, we have set-up a simulation in which we, before each run of 16 footsteps, randomly changed the height of the tiles of the floor. In worst case each step encounters another height. The system appears to be able to learn to cope with these disturbances in the floor up to 1.0 cm, which is slightly better than with its real mechanical counterpart and a PD controller. To illustrate the adaptive behavior, the robot was first placed on a level surface, learning to walk. After some time, a change in the environment was introduced by means of a small ramp. At first, performance drops. After a relatively short time, performance has recovered to its maximum again. Especially when trying to walk with a minimum cost of transport, this behavior is desirable. A learning controller without being notified of the angle of the ramp, will find a new optimum
6
after some time, purely result driven: a desirable feature for autonomously operating robots. IV. C ONCLUSION Using a generic learning algorithm, stable walking motions can be found for a passive dynamic walking robot with hip actuation, by learning to control the torque applied in the hip to the upper legs. To test the learning algorithm, a two dimensional model of a passive dynamic walking biped was used that of which its mechanical counterpart is known to walk stably with a PD controller for hip actuation. A dynamic system model of the robot was used to train the learning controller. None of the body dynamics of the mechanical robot were provided to the learning algorithm itself. Using a single learning module, simple ways of optimizing the walking motion on goals such as minimum cost of transport and maximum forward velocity, were demonstrated. Convergence times showed to be acceptable even when optimizing on difficult criteria such as minimum cost of transport. By means of standard and easy to implement Q(λ)-learning, problems are solved which are very difficult to tackle with conventional analysis. We have verified the robustness of the system for disturbances, leading to the observation that height differences of 1.0 cm can be dealt with. The system can adapt itself quickly to a change in the environment such as a weak ramp. Q-learning proves to operate as a very efficient search algorithm for finding the optimal path through a large stateaction space with simple rewards, when they are chosen carefully. R EFERENCES [1] James S. Albus. A theory of cerebellar function. Mathematical Biosciences, 10:25–61, 1971. [2] James S. Albus. Brains, behavior, and Robotics. BYTE Books, McGrawHill, Peterborough, NH, Nov 1981. [3] H. Benbrahim and J. Franklin. Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, (22):283–302, 1997. [4] C.-M. Chew and G. A. Pratt. Dynamic bipedal walking assisted by learning. Robotica, 20(5):477–491, 2002. [5] S. H. Collins, A. Ruina, R. Tedrake, and M. Wisse. Efficient bipedal robots based on passive-dynamic walkers. Science, 307(5712):1082– 1085, 2005. [6] S. H. Collins, M. Wisse, and A. Ruina. A two legged kneed passive dynamic walking robot. Int. J. of Robotics Research, 20(7):607–615, July 2001. [7] G. Endo, J. Morimoto, J.Nakanishi, and G.M.W. Cheng. An empirical exploration of a neural oscillator for biped locomotion control. In Proc. 4th IEEE Int. Conf. on Robotics and Automation, pages 3030 – 3035 Vol.3, Apr 26-May 1 2004. [8] Y. Kuroki, M. Fujita, T. Ishida, K. Nagasaka, and J. Yamaguchi. A small biped entertainment robot exploring attractive applications. In Proc., IEEE Int. Conf. on Robotics and Automation, pages 471–476, 2003. [9] T. McGeer. Passive dynamic walking. Int. J. Robot. Res., 9(2):62–82, April 1990. [10] J. Morimoto, J. Cheng, C.G. Atkeson, and G. Zeglin. A simple reinforcement learning algorithm for biped walking. In Proc. 4th IEEE Int. Conf. on Robotics and Automation, pages 3030 – 3035 Vol.3, Apr 26-May 1 2004. [11] J. Morimoto, J.Nakanishi, G. Endo, and G.M.W. Cheng. Acquisition of a biped walking pattern using a poincare map. In Proc. 4th IEEE/RAS Int. Conf. on Humanoid Robots, pages 912 – 924 Vol. 2, Nov. 10-12 2004.
[12] Y. Nakamura, M. Sato, and S. lshii. Reinforcement learning for biped robot. In Proc. 2nd Int. Symp. on Adaptive Motion of Animals and Machines. www.kimura.is.uec.ac.jp/amam2003/onlineproceedings.html, 2003. [13] G. A. Pratt and M. M. Williamson. Series elastic actuators. IEEE International Conference on Intelligent Robots and Systems, pages 399– 406, 1995. [14] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and M. Fujita. The intelligent asimo: System overview and integration. In Proc., Int. Conf. on Intelligent Robots and Systems, pages 2478–2483, 2002. [15] A. L. Schwab and M. Wisse. Basin of attraction of the simplest walking model. In Proc., ASME Design Engineering Technical Conferences, Pennsylvania, 2001. ASME. Paper number DETC2001/VIB-21363. [16] R. Smith. Open dynamics engine. Electronic Citation, 2005. [17] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. The MIT Press, Cambridge, MA, 1998. ISBN 0-26219398-1. [18] R. Tedrake, T.W. Zhang, M.-F. Fong, and H.S. Seung. Actuating a simple 3D passive dynamic walker. In Proc., IEEE Int. Conf. on Robotics and Automation, 2004. [19] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Learning to walk in 20 minutes. In Proc. 14th Yale Workshop on Adaptive and Learning Systems. Yale University, New Haven, CT, 2005. [20] E. Vaughan, E. Di Paolo, and I. Harvey. The evolution of control and adaptation in a 3d powered passive dynamic walker. In Proc. 9th Int. Conf. on the Simulation and Synthesis of Living Systems, pages 2849– 2854, Boston, September 12-15 2004. MIT Press. [21] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [22] M. Wisse, D. G. E. Hobbelen, and A. L. Schwab. Adding the upper body to passive dynamic walking robots by means of a bisecting hip mechanism. IEEE Transactions on Robotics, (submitted), 2005. [23] M. Wisse, A. L. Schwab, R. Q. v. d. Linde, and F. C. T. v. d. Helm. How to keep from falling forward; elementary swing leg action for passive dynamic walkers. IEEE Transactions on Robotics, 21(3):393–401, 2005. [24] M. Wisse and J. v. Frankenhuyzen. Design and construction of mike; a 2d autonomous biped based on passive dynamic walking. 2nd International Symposium on Adaptive Motion of Animals and Machines, 2003.