Chapter 6 and 7 . ...... Figure 2-8 â Schematic view of the reinforcement learning framework . ... Figure 4-5 Schematic model of the mono-ped system and the SEA. ... Figure 4-12 Response to step and disturbance inputs using SEA controller with inverse notch .... Figure 7-3 Test to determine the CPG period. ...... Data sheet.
Implementation and Optimization of an Open Loop Gait Controller on a Mono Pedal Robot
Research Thesis
Submitted in partial fulfillment of the requirements for the degree of Master of Science in Mechanical Engineering
Israel Schallheim
Submitted to the Senate of the Technion - Israel Institute of Technology Shvat, 5778 Haifa February 2018
ACKNOWLEDGEMENTS The research thesis was done under the supervision of Associate Professor Miriam Zacksenhouse in the faculty of Mechanical engineering. The generous financial help of the Technion is gratefully acknowledged.
TABLE OF CONTENTS List of Symbols and Abbreviations ..............................................................................................2 Chapter 2 ......................................................................................................................................2 Chapters 3,4 and 5 .......................................................................................................................3 Chapter 6 and 7 ............................................................................................................................6 Chapter 8 ......................................................................................................................................7 1.
2.
Introduction ............................................................................................................................9 1.1
Dynamic walking of biped robots ....................................................................................9
1.2
Gait controller ................................................................................................................11
1.3
Optimization using Reinforcement Learning.................................................................12
1.4
The Mono-ped robot ......................................................................................................13
1.5
Thesis objectives ............................................................................................................14
Literature Survey .................................................................................................................16 2.1
Limit cycle stability and region of attraction .................................................................16
2.1.1 Limit cycles ................................................................................................................16 2.1.2 Poincaré map ..............................................................................................................17 2.1.3 Poincaré map linearization .........................................................................................18 2.1.4 Region of attraction....................................................................................................20 2.2
Central Pattern Generator (CPG) ...................................................................................21
2.2.1 CPG in robotics ..........................................................................................................22 2.2.2 Minimal Feedback CPG .............................................................................................23 2.3
Reinforcement Learning ................................................................................................24
2.3.1 Framework .................................................................................................................24 2.3.2 Policy gradient methods .............................................................................................26
3.
4.
Mathematical Model of Mono-Pedal Robot ......................................................................32 3.1
Swing phase ...................................................................................................................32
3.2
Stance phase ...................................................................................................................34
3.3
Swing to stance transition at impact ..............................................................................36
3.4
Leg contraction/extension ..............................................................................................36
3.5
Gait controller ................................................................................................................37
3.7
Poincaré section definition .............................................................................................39
Series Elastic Actuation .......................................................................................................40 4.1
Series elastic actuation (SEA) ........................................................................................40
4.2
SEA mechanical design .................................................................................................42
4.3
SEA controller design ....................................................................................................45
4.3.1 Theoretical model ......................................................................................................45 4.3.2 Measured frequency response ....................................................................................51 4.3.3 Controller design ........................................................................................................55
5.
4.4
Controller hardware and code implementation ..............................................................57
4.5
SEA controller performance ..........................................................................................57
Model Identification.............................................................................................................60 5.1
Parameters identified from direct experiment............................................................60
5.2
Parameters identified by fit ........................................................................................62
5.2.1 Fit of cart friction model ............................................................................................63 5.2.2 Fit of impact model ....................................................................................................65 5.2.3 Genetic algorithm for full model fit ...........................................................................66 5.2.4 Conclusions from Genetic algorithm fit ....................................................................71 6.
Simulatiuon Analysis ...........................................................................................................72
6.1
Nominal gait limit cycles ...........................................................................................72
6.2
Substituting phase activation by reflexes ...................................................................74
6.3
Bifurcation analysis ...................................................................................................74
6.4
Region of Attraction Analysis ...................................................................................77
6.4.1 Effect of gait speed on the RoA .................................................................................78 6.4.2 Effect of reflexes on the RoA ....................................................................................79
7.
6.5
Effect of phase reset ...................................................................................................81
6.6
Conclusions of simulation analysis ............................................................................83
Robot Performance ..............................................................................................................84 7.1
SEA performance ...........................................................................................................84
7.1.1 Achieving natural dynamics ......................................................................................84 7.1.2 Torque control performance ......................................................................................86 7.2
Hand tuned CPG walking ..............................................................................................86
7.2.1 The choice of CPG period (T) ...................................................................................86 7.2.2 The choice of CPG phase ...........................................................................................88 7.2.3 The choice of CPG torque amplitude.........................................................................88 7.2.4 Gait time plots ............................................................................................................89 7.2.5 Gait limit Cycle ..........................................................................................................90 8.
Optimization Using Reinforcement Learning ...................................................................93 8.1
Optimization of CPG parameters using REINFORCE ..............................................93
8.1.1 Deriving the policy gradient ......................................................................................93 8.1.2 Demonstration on a pendulum model ........................................................................96 8.2
Robot Learning: in Simulation.....................................................................................102
8.2.1 Comparing Action VS parameter perturbation ........................................................102
8.3
Robot Learning: from Simulation to Hardware ...........................................................106
8.3.1 Grounded Action Transformation (GAT) ................................................................106 8.3.2 Training forward and inverse models ......................................................................107 8.3.3 GAT Performance ....................................................................................................108 9.
Conclusions .........................................................................................................................115 9.1
Mono-ped prototype.....................................................................................................115
9.2
Simulation ....................................................................................................................116
9.3
Reinforcement learning ................................................................................................116
9.4
Future Work .................................................................................................................117
Appendix A – Deriving the Policy Gradient ............................................................................119 Appendix B – Calculation of SEA Constant ............................................................................124 Appendix C – Development of Dynamic Equations of the Mono-Ped Model ......................126 Swing phase .............................................................................................................................126 Stance phase .............................................................................................................................129 Cart friction model ...................................................................................................................132 Impact equation........................................................................................................................135 Appendix D - Matching Model Parameters.............................................................................136 References ...................................................................................................................................140
LIST OF FIGURES Figure 1-1 Passive walking biped robots (from [2]) .................................................................... 10 Figure 1-2 Dynamic walking actuated robots. Upper: Ranger [3], Lower left: Marlo [10] Lower right: AMBER [11]................................................................................................................ 11 Figure 1-3 Mono-pedal scooter robot. Left: Mounted on cart. Right: On treadmill. ................... 14 Figure 2-1 Limit cycle of the Van der Pol oscillator .................................................................... 16 Figure 2-2 Left: Limit cycle of the compass biped robot (Simulation [23]), Right: Model of compass biped robot ............................................................................................................. 17 Figure 2-3 Illustration of Poincaré map ........................................................................................ 18 Figure 2-4 Region of attraction of compass biped projected on the phase plane [25].................. 20 Figure 2-5 Region of attraction of compass biped on Poincaré surface [26]................................ 21 Figure 2-6 Possible models of CPG control Hierarchy in human locomotion [29]...................... 22 Figure 2-7 CPG torque pattern proposed at [20] .......................................................................... 23 Figure 2-8 – Schematic view of the reinforcement learning framework ...................................... 24 Figure 2-9 – The actor critic architecture. The value function is separated from the policy (the ‘actor’), and is used only as a ‘critic’ to correct the policy. .................................................. 26 Figure 2-10 – Policy improvement loop. At each iteration, M trials are performed with injected perturbations at each time step. The observed trajectories from each trial that include the states, actions and rewards are passed to the gradient estimator. The parameters are updated when the gradient estimation converges, and the updated parameters are passed to another iteration. ................................................................................................................................ 27 Figure 3-1 mono-pedal robot state machine consists of 3 states. From each state, the system can enter a failed state. ................................................................................................................ 32
Figure 3-2 Swing phase model. (a) ankle joint (b) hip joint (c) linear track on which the leg is mounted (d) linear motor in contracted position................................................................... 33 Figure 3-3 Stance phase model. (a) linear motor is extended (b) hip joint is free to move vertically. (c) Foot is attached to the track with no slip ........................................................ 35 Figure 3-4 Central pattern generator used for the mono-pedal robot .......................................... 37 Figure 4-1 Left: The mono - pedal robot. Upper right: Hip SEA motor and springs. Lower right: Ankle SEA motor and springs. ............................................................................................. 40 Figure 4-2 Classic torque control vs. SEA torque control. Left: SEA control can use a smaller geared motor, and requires position measurements before and after the elastic component. Right: Classic torque control uses bigger motors and requires expensive torque sensors.... 41 Figure 4-3 Actuator’s output impedance with or without an elastic component (from [60]) ...... 42 Figure 4-4 schematic drawing of SEA. ........................................................................................ 43 Figure 4-5 Schematic model of the mono-ped system and the SEA ......................................... 45 Figure 4-6 Block diagram of the theoretical linearized model of the SEA-actuated mono-ped in stick mode due to stick friction (without cart dynamics, upper panel) and slip mode (with cart dynamics, lower panel). ..................................................................................... 48 Figure 4-7 Frequency responses of the theoretical linearized model of the SEA-actuated mono-ped from voltage input to motor angle (upper) and from voltage input to leg joint angle. .................................................................................................................................... 49 Figure 4-8 Measured frequency responses. Comparison between sine sweep and discrete sine inputs, along with an identified parametric model. Upper) From input to motor angle output. Lower) From input to joint angle output. ................................................................. 53
Figure 4-9 Comparison between measured and theoretical frequency responses. Upper) With cart dynamics. Lower) Without cart dynamics ..................................................................... 54 Figure 4-10 SEA control scheme. The input to the controller is the error in the desired angle difference between the motor and leg joint. .......................................................................... 55 Figure 4-11 Measured open the open loop transfer functions from control input u=Vin to the outputs: θ1=θm (blue), θ2=θl (red) , dθ=θ1-θ2 (yellow) ......................................................... 55 Figure 4-12 Response to step and disturbance inputs using SEA controller with inverse notch filter ....................................................................................................................................... 58 Figure 4-13 Response to step and disturbance inputs using SEA controller with lead-lag filters 58 Figure 4-14 Response to a smoothed step input (logistic function with κ=1/250) using SEA controller with lead-lag filters .............................................................................................. 59 Figure 5-1 Initial condition experiment to match the inertia and damping of the leg in respect to: Right) Hip axis, Left) Ankle axis ........................................................................................... 61 Figure 5-2 Cart position and velocity during initial condition tests to match the stick-slip friction model ..................................................................................................................................... 63 Figure 5-3 Initial condition experiment with cart dynamics and friction during swing .............. 64 Figure 5-4 Hip and cart velocities as recorded during a gait before and after the impact. The red circles approximate the time of the impact ..................................................................... 66 Figure 5-5 Genetic Algorithm progress along generations ......................................................... 68 Figure 5-6 Model fit before and after using genetic algorithm. The simulation with parameters before the GA fit ends after taking one step due to a fail caused by the leg extending into the ground ................................................................................................................................... 70
Figure 6-1 Phase-plane trajecoties of the nominal limit cycles of the slow, medium and fast gaits. Right: θ, θ phase plane Left: φ, θ phase plane ......................................................... 73 Figure 6-2 Temporal trajectories of the nominal limit cycles of slow,medium and fast gaits .... 73 Figure 6-3 Magnitude of the eigenvalues of the Linearized Poincaré map as function of CPG period time T. The arrows mark the nominal value for each gait speed. In addition, the gait velocity of the controller with reflexes is plotted as function of T. ...................................... 76 Figure 6-4 RoA comparison between slow and medium gait speeds achieved with reflexes. The stable fixed point of the limit cycle is marked by star and its RoA is marked by yellow. Different green shades mark different failure modes: Failed state I – maximum angle reached (dark green). Failed state II – leg extension into the ground (medium dark green). Failed state III – insufficient enegry to complete step (light green). .................................. 78 Figure 6-5 RoA of fast gait. Comparison between (a) No reflexes (b) Only extention reflex (c) Only contraction reflex (d) Extention and contraction reflexes The stable fixed point of the limit cycle is marked by star and its RoA is marked by yellow. Different green shades mark different failure modes: Failed state I – maximum angle reached (dark green). Failed state II – leg extension into the ground (medium dark green). Failed state III – insufficient enegry to complete step (light green).................................................................................... 80 Figure 6-6 One dimensional Poincaré map θk +1 = f (θk), for CPG controllers with reflexes and phase reset applied at the Poincaré section. The local RoA is obtained graphically and is bounded on the right by the nearest non-stable fixed point (see enlarged), and bounded on the left by either: I) the first point with the same height as the non-stable fixed point (slow gait, upper panel), or II) the left end of the plotted map, i.e. a failed state (medium and fast gaits, upper panel). ............................................................................................................... 82
Figure 7-1 Comparison between free swinging leg (red) and SEA actuated swinging leg with zero-torque command (blue). The leg was released to swing from an initial position with (upper panel) and without the cart dynamics (lower panel). ................................................ 85 Figure 7-2 Commanded and applied torque to the mono-ped robot during walking .................. 86 Figure 7-3 Test to determine the CPG period .............................................................................. 87 Figure 7-4 Time plot of the mono-ped gait with walking speed of 1.8 km/h. The circles mark the impact and extraction times, where (red) indicates the impact and (green) indicated the leg contraction. ........................................................................................................................... 89 Figure 7-5 An enlargement of one gait cycle from the time plot of the mono-ped gait ............... 90 Figure 7-6 Phase plane plot of a stable gaits with walking speeds ranging from 1.2 to 2 km/h. The circles mark the impact and extraction times, where (red) indicates the impact and (green) indicated the leg contraction. ................................................................................... 91 Figure 7-7 Comparison of the mono-ped limit cycle trajectory (Left) to the compass biped’s (Right) ................................................................................................................................... 92 Figure 8-1 – Cost function of the pendulum problem as function of the period T and impulse amplitude τ. ........................................................................................................................... 98 Figure 8-2 – Estimation of the cost function gradient of the pendulum problem. Blue arrows: Real gradient, Green arrows: REINFORCE method, Red arrows: Finite difference method ............................................................................................................................................... 99 Figure 8-3 – Left: absolute value of cost during learning process. Right: Parameter update ..... 100 Figure 8-4 Cost during learning process. A comparison between I) Constant variance (no update of sigma) II) Updating of a single variance value for all parameters III) Updating the variance for all the parameters ............................................................................................ 101
Figure 8-5 Action (left) vs Parameter (right) perturbation setup ............................................. 102 Figure 8-6 Performance of Reinforcement Learning with Action (left, green) vs Parameter (right, blue) perturbations. ................................................................................................. 103 Figure 8-7 Grounded Action Transformation (GAT) framework .............................................. 107 Figure 8-8 Training forward and inverse models for GAT ........................................................ 108 Figure 8-9 Effect of grounded action transformation (GAT) on simulated leg trajectories. Phase plane trajectories derived from experiments with the robot and from simulations before and after GAT. States on the Poincaré section are denoted by symbols ................................... 109 Figure 8-10 Effect of grounded action transformation (GAT) on the simulated position of the cart. ..................................................................................................................................... 110 Figure 8-11 Learning progress of Open loop CPG parameters using GAT .............................. 111 Figure 8-12 Performance test in initial learning stage. The robot fails to resume walking after a disturbance is applied. CPG parameters: τ= [3,-3] , ϕ=[0.05, 0.3, 0.45, 0.9], T=1.15 ... 113 Figure 8-13 Performance test of final policy. The robot resumes walking after disturbances are applied. In some cases the leg swings once before it resumes walking (marked with dashed circles). CPG parameters: τ= [5,-5] , ϕ=[0.07, 0.3, 0.45, 0.95], T=1.015 ....................... 114
ABSTRACT Dynamically walking robots exploit their natural dynamics by ‘falling’ forward into the swing leg at each step. In contrast to the classic quasi-statically walking robots that remain stable along the gait cycle, dynamic walking robots maintain only orbital stability. Stability and robustness of dynamic robots are investigated by analysing the Poincaré maps and Region of Attraction (RoA) of the resulting limit cycles. Our proposed gait controller is based on a central pattern generator (CPG) that produces a periodic set of torque pulses applied to the robot joints. While the open loop CPG controller produces stable walking, we demonstrate that incorporating reflexes greatly enhances its robustness as quantified by the region of attraction. A prototype mono-pedal robot was designed and built to demonstrate the CPG controller. The implementation involves Series Elastic Actuation (SEA), which facilitates the control of the applied torque, while exploiting the natural dynamics of the system and providing inherent compliance. Lowlevel motor control was implemented using standard feedback control and loop shaping design techniques. The mono-pedal robot was mounted on a treadmill to investigate its steady state performance and perturbation rejection. A Reinforcement Learning algorithm was applied for optimizing the CPG parameters to maximize robustness while minimizing control effort. The algorithm applies a ‘Policy Gradient’ method to modify the controller parameters in the direction that maximizes an objective fun ction. We show how the well-known REINFORCE algorithm can be modified to match our unique case, by using a parameter perturbation approach rather than the traditional action perturbation approach. To enhance safety and reduce hardware wear during the learning process, we applied a technique that facilitates transfer of the learned policy from simulation to hardware, known as Grounded Action Transformation. We demonstrate that the performance of the mono-pedal robot improves as learning progresses.
LIST OF SYMBOLS AND ABBREVIATIONS Symbols Chapter 2 xt
The state of the system at time t
∑
Poincaré section
P
Poincaré mapping
x*
A fixed point of Poincaré mapping
ε
Perturbations from fixed point
X*
Matrix whose columns are made of the fixed-point
X+
Matrix of resulting intersections with the Poincaré section after perturbations
Ap
Linearized Poincare map matrix
T
Trajectory length in a reinforcement learning episode/trial
st
State in Markov Decision Process framework
at
Action in Markov Decision Process framework
rt
Reward in Markov Decision Process framework
π
Policy of action selection in Markov Decision Process framework
τ
Trajectory gathered along a reinforcement learning episode/trial
R
Return, the sum of rewards accumulated over a learning episode
γ
Discounted reward parameter
θ
Policy parameterization
J
Objective function
Gradient operator with respect to policy parameters
Vπ
State value function
2
Qπ
State-action value function
α
Policy update rate
pθ(τ)
Distribution of trajectory τ while using parameters θ
M
Number of learning episodes/trials
b
Policy gradient baseline
Chapters 3,4 and 5 x
Cart’s position
xankle
Ankle’s position
θm
Motor’s angle
θ’m
Motor’s angle at the gear output
θ,θl
Load/leg’s angle
Δθ,dθ
Difference between motor gear output angle and load/leg angle
mleg
Leg’s weight
mcart
Cart’s weight
lcg,hip
Leg’s center of gravity w.r.t. hip axis
Tload,TSEA,τCPG
Torque applied on load/leg by SEA
Tdes
Desired Torque of SEA
fslip
Slip friction between the cart and its track
fstick
Stick friction between the cart and its track
Jankle
Leg’s inertia w.r.t. ankle axis
lcg,ankle
Leg’s center of gravity w.r.t. ankle axis
fankle
Ankle axis viscous friction
fz
Hip’s vertical track viscous friction
3
l
Leg’s full length
e
Coefficient of restitution after impact
θextract
Threshold angle to extract leg
θcontract
Threshold angle to contract leg
φ
Oscillator phase
ω
Oscillator frequency
T
Oscillator period
κ
Logistic function smoothing parameter
w
Feedback weights
ψe
features extracted from the environment
k
Linear spring stiffness
r
Pulley radius
ωm
Motor angular velocity
ωl
Load/leg angular velocity
Tm
Motor torque at the gear input
ia
Motor’s armature current
Vin
Motor’s input voltage
Vback-emf
Motor’s back e.m.f. voltage
Kt
Motor torque constant
Kb
Motor speed constant
Ra
Motor’s armature resistance
La
Motor’s armature inductance
Jm
Motor inertia (Including gear inertia)
fm
Motor viscous friction
4
Nr
Gear ratio
Jhip
Leg’s inertia w.r.t. hip axis
fhip
Hip axis friction
kSEA
SEA stiffness
cSEA
SEA damping
mleg
Leg’s weight
mcart
Cart’s weight
lcg ,lcg,hip
Leg’s center of gravity w.r.t. hip axis
Tu 'm
Transfer function from control input to motor’s gear output angle
Tul
Transfer function from control input to leg angle
Pu 'm
Cross power spectral density of control input and motor’s gear output angle
Pul
Cross power spectral density of control input and leg angle
Puu
Power spectral density of control input
Cui
Coherence function
f
Frequency
Kp
Proportional gain of PI controller
KI
Integral gain of PI controller
alead/lad
Zero of lead/lag controller
blead/lag
Pole of lead/lag controller
ζn/d
Nominator/denominator damping of notch filter
ωn
Natural frequency of notch filter
κ
Logistic function smoothing constant
5
mleg,swing
Leg weight with the foot
mleg,stace
Leg weight without the foot
l
Leg total length from ankle to hip
stick
Stick friction coefficient
slip
Slip friction coefficient
impact ,
Impact equation component related to leg rotation
impact , x
Impact equation component related to cart movement
Chapter 6 and 7 θ
Leg angle
x
Cart position
v
Cart speed
φ
Vector of phases for CPG pulse activation
φ
CPG phase
τ
Vector of CPG torques
τ
CPG torque
T
CPG period
Ap
Linearized Poincaré map matrix
λ
Eigen value
6
Chapter 8 s
State in Markov Decision Process framework
r
Reward in Markov Decision Process framework
πθ
Policy with parameterization θ
θ
n-dimensional policy parameters vector
aCPG
d-dimensional action in Markov Decision Process framework (CPG parameters)
μi
Mean of policy parameter or action component i
σi
Standard deviation of policy parameter or action component i
ψi
n-dimensional feature vector that corresponds to action component i
â
Action after Grounded Action Transformation
u
Area under CPG pulse
7
Abbreviations ZMP
Zero Moment Point
HZD
Hybrid Zero Dynamics
MPC
Model Predictive Control
CPG
Central Pattern Generator
SEA
Series Elastic Actuation
RL
Reinforcement Learning
H.O.T.
Higher Order Terms
RoA
Region of Attraction
IC
Initial Conditions
MDP
Markov Decision Process
GA
Genetic Algorithm
GAT
Grounded Action Transformation
8
1. INTRODUCTION 1.1 Dynamic walking of biped robots A significant feature of biped robots is the ability to walk passively (with no actuators) down a slope [1] [2] (See Figure 1-1). Actuated bipedal robots can exploit this feature for power reduction, by mimicking the passive gait, and only adding the needed amount of energy to compensate for loss of energy due to impact and friction (when walking on non-negative sloped terrain) and maintain balance when encountering disturbances [3]. Making use of this principle can allow the use of the robot’s inertia rather than ‘fighting’ it. Furthermore, the use of simplistic dynamic walking methods can spear the need of rigorous mathematical calculations and reduces online computation time since there is no need to calculate the forward and inverse kinematics of the robot. Classic methods for robotic gait control such as Zero Moment Point (ZMP) [4] make use of classic robotics concepts, such as the calculation of forward and inverse kinematics, following welldefined trajectories, and using high power actuators to accurately track these trajectories [5][6][7][8]. In contrast, much of the research on biped robots in recent years has focused on natural-dynamics based algorithms. Natural-dynamics control methods exploit the natural dynamics of the robot, as seen in passive walking. Those methods do not assure stability at every given point in time, but rather maintain orbital stability, as will be explained below. Some examples of dynamic walking robots are shown in Figure 1-2. In many applications it is noticeable that the dynamic approach resembles the way humans walk. In addition, the dynamic approach uses much less energy [3]. The stability of walking robots can be assessed by looking at periodic trajectories of the joints angles. Such trajectories plotted in state space yield the limit cycle of the system [9] . Therefore, methods to examine the limit cycle stability using the Poincaré map were developed. In addition, the region of attrition of the limit cycle is of great interest, since it defines how robust the system is to perturbations.
9
Figure 1-1 Passive walking biped robots (from [2])
10
Figure 1-2 Dynamic walking actuated robots. Upper: Ranger [3], Lower left: Marlo [10] Lower right: AMBER [11]
1.2 Gait controller Gait control methods vary from ones that rely on an accurate mathematical models and heavy computing to simple and computationally inexpensive methods. we will briefly review several approaches that exist in literature:
11
Hybrid zero dynamics (HZD) The method is based on writing the robot's full equations, and then using methods such as feedback linearization to make the robot behave as a more simple model, for which the movement can be designed analytically. These methods have achieved very good performance, the main drawback is the need for full modeling of the robot [12][13] Trajectory Optimization / (Model Predictive Control) MPC This method can be divided into online and offline approaches. The online approach, also known as MPC, calculates the robot's possible movements and finds the controller parameters that yield the optimal movement in real time. This approach requires heavy computing resources [14]. The offline approach does not involve any real-time optimization, so it cannot adapt to changes. However, this approach has been used to develop the most energy efficient robot - the Cornell robot Ranger that walked 65 miles without recharging the battery [15]. Central Pattern Generator (CPG) This method is inspired by biological mechanisms that control cyclic movements such as chewing, breathing, swimming, and walking. A biological CPG is composed of a number of coupled neurons and produces oscillatory signals, which, when coupled to the mechanics of the body, generate a cyclic movement. Similarly, in robotics, a mathematical oscillator model produces rhythmic signals that are applied to the robot's joints to form the desired gait. This method has the advantage of relatively few real-time calculations. However, determining the structure of the CPG and tuning its parameters is challenging [16] [17][18]. The gait controller developed in our laboratory is based on a simplified CPG [19][20]. A linear oscillator acts as a clock, which determines the timing when pulse-like torques are applied to the robot's joints. The magnitude and duration of the torque pulses can be kept constant (open loop) or adapted to the environment (e.g., the slope) in closed loop.
1.3 Optimization using Reinforcement Learning Reinforcement Learning is a field of machine learning that especially suits robotics and control applications. A Reinforcement Learning algorithm learns the control policy that maximized an objective function while the system interacts with the environment and receives appropriate rewards,
12
as explained in section 2.3. In particular, policy gradient methods suit robotics applications, since they use the reinforcement-learning framework to optimize existing control structures. We applied a variant of the REINFORCE algorithm [21] to optimize the CPG gait controller parameters. The full equations development is detailed in Section 8.1. The open loop gait controller parameters were initially hand tuned to achieve a stable gait. We then applied the Reinforcement Learning algorithm to optimize the open loop controller parameters to achieve a more energy efficient gait. We show how the algorithm can be applied to learn a closed loop controller and demonstrate the results in simulations. Since differences between the simulation and the actual robot are inevitable, we must transfer the policy that is learned in simulation to a policy that is optimal also on the real robot. We will demonstrate the use of a method called Grounded Action Transformation, which transforms the actions during training of the policy, to match the behavior of the actual robot. We will demonstrate that the method works in practice, and we are able; I) to approximate well the behavior of the real robot, II) to conduct the learning in simulation, and find a policy that optimizes the behavior of the actual robot.
1.4 The Mono-ped robot A Mono-ped robot was designed to test the gait controller that has been developed in our laboratory. The robot, shown in Figure 1-3, acts as a leg of a rider on a scooter, where the leg pushes the ground to propel the rider forward. Although the robot only has one leg, its main dynamics are similar to the compass biped. The leg’s dynamics alter between swing and stance phase just as in the compass biped, and the same gait controller architecture is applied in both cases. The rotating joints at the hip and ankle are actuated using Series Elastic Actuation (SEA) [22]. This facilitates torque control using only inexpensive components, without the need of high performance torque sensors. The scooting leg has a linear knee that contracts the leg to allow swinging without scraping the ground. An earlier version was mounted on a cart, and was controlled by a portable Arduino controller. After applying the gait controller, the leg on cart was able to perform a few steps until it reached the wall or the end of its electrical cable. To be able to perform longer tests, the leg was mounte d on a treadmill. In addition, control and data gathering was done by using a high performance dSPACE controller. However, the system dynamics were slightly altered by adding the treadmill, as will be elaborated further.
13
Figure 1-3 Mono-pedal scooter robot. Left: Mounted on cart. Right: On treadmill.
1.5 Thesis objectives Our laboratory has previously developed an algorithm for gait generation using a Central Pattern Generator (CPG) in an open loop form, as well as a closed loop form with minimal feedback [19][20]. But so far, no experiments have been conducted on a real robot to validate the feasibility of the proposed architecture. For this purpose, a prototype of the mono-ped robot was built in the laboratory to test the algorithm. Therefore, the research presented henceforth has the following objectives: I. Implementing the gait control algorithm on the mono-ped prototype and examining its performance using a Series Elastic Actuation (SEA) mechanism. For that purpose, an overview of the hardware, with emphasis on the SEA system, design considerations and low-level control design is shown in section 4. The gait controller performance on hardware, along with parameter choice considerations, will be presented in section 7. II. Developing the mathematical model of the mono-ped robot as will be shown in section 3, and then examining methods for system identification to match the model parameters, as discussed in section 5. III. Performing simulation analysis to assess the robustness of the proposed gait controller, as will be presented in section 6.
14
IV. Developing methods for optimization of the gait controller using reinforcement learning, as shown in section 8.
15
2. LITERATURE SURVEY 2.1 Limit cycle stability and region of attraction
2.1.1 Limit cycles For systems that exhibit periodic motion such as biped robots, it is common to examine the system’s dynamics by looking at its closed trajectory in the phase plane, rather than temporal plots. Isolated closed trajectories are referred to as limit cycles of the system. A limit cycle is stable if all neighboring trajectories approach the limit cycle as time approaches infinity. A limit cycle is unstable if all neighboring trajectories approach the limit cycle as time approaches minus infinity. This can be illustrated for the well-known Van der Pol oscillator (equation (2-1)), defined by:
q q 2 1 q q 0
(2-1)
The dynamics of the Van der Pol oscillator are similar to the dynamics of a mass-spring-damper system, except for a non-linear coefficient of damping. This non-linearity creates higher damping when the oscillations grow, and adds energy when the oscillations decay. As a result, the Van der Pol oscillator settles into a stable limit cycle from any starting initial conditions (other than zero). Figure 2-1 depicts the limit cycle of the Van der Pol oscillator (left panel), and the temporal plots (right panel), and demonstrate the convergence to a unique limit cycle from several starting points.
Figure 2-1 Limit cycle of the Van der Pol oscillator
16
A biped robot’s gait creates a limit cycle in the joints phase plane, given that the environment does not change nor the control input (if any is applied, since we also consider passive gaits). The ledt panel of Figure 2-2 depicts the (closed trajectory of the ) limit cycle of the compass biped shown in the right panel under open loop control [23]. The limit cycle includes discrete jumps in velocity due to the impact of the legs with the ground: a large jump when the leg hits the ground and a smaller jump when the other leg hits the ground.
Figure 2-2 Left: Limit cycle of the compass biped robot (Simulation [23]), Right: Model of compass biped robot
2.1.2 Poincaré map
For an n dimensional dynamic system
x f x , we define the Poincaré section ∑, as an n-
1 (or lower) dimensional surface that is transverse to the flow f. The Poincaré map (or return map), is the mapping from ∑ to itself as f(x) is integrated:
x t 1 P x t
17
(2-2)
xt P(xt)
P(x*)
Figure 2-3 Illustration of Poincaré map
A point x* is said to be a fixed point if P(x*) = x*. See Figure 2-3 for illustration.
2.1.3 Poincaré map linearization The Poincaré map P can be estimated numerically by perturbing the fixed-point x* and integrating f(x) until the next intersection with the Poincaré section. Using perturbations of size ε along the different axes, and denoting by X+ the matrix whose ith column is the resulting intersections with the Poincaré section after a perturbation along the ith axis and by X* the matrix whose columns are made of the fixed-point x*, the linearized Poincaré map is given by [24] :
A p 1 X X*
(2-3)
The approach above works well in cases that repeated samples of the fixed-point are identical up to the accuracy of the numeric integration. However, when the numerical integration is not very accurate nor repeatable, or in actual measurements from hardware that include process and sensor noise, a statistical approach is suggested. Random perturbations sampled from a uniform distribution in the range [0, ε] are applied to the system along all the axes on the Poincaré section. The system is simulated or actuated starting from a perturbated initial state, and the following intersections with the
18
Poincaré section are recorded until convergence to the nominal limit cycle (within the measurement or simulation resolution). The process is repeated until enough data is gathered. The data is organized in pairs of consecutive intersections of the Poincaré section and the fixed-point x* is subtracted, so that the following relation holds for each pair-
xn1 A p xn H .O.T .
(2-4)
Where H.O.T is higher order terms. The consecutive points are arranged in the matrices Xn and Xn+1 with dimension m˟N, where m is the Poincaré section dimension and N is the number of samples. The linearized Poincaré map is computed by calculating the pseudo inverse of Xn -
A p X n 1 XTn ( X n XTn ) 1
(2-5)
Stability of the limit cycle can therefore be inferred from the matrix Ap, by using discrete dynamic systems stability analysis; i.e. the limit cycle is stable if all eigen-values of Ap are within the unit circle. In the rare cases that eigen-values are on the unit cycle, then stability cannot be determined using this technique, and other methods such as Lyapunov stability criteria can be applied.
19
2.1.4 Region of attraction The region of attraction (RoA) of an attractor (for e.g. fixed point or limit cycle) is the region of initial conditions (ICs) from which the system converges to that attractor. The size (or volume) of the RoA is indicative of the robustness of the solution to ICs. The robustness of a limit cycle can be characterized by the RoA in the phase plane, as shown in Figure 2-4 for the limit cycle of the compass biped investigated in [26]. Alternatively, the robustness can be characterized by the RoA of the fixed point of the Poincaré map in the Poincaré section, as shown in Figure 2-5 for the compass biped studied in [27]. Both studies use the same model for the compass biped, however, not necessarily the same parameters. The RoA is the dark area marked with A, and the fixed point is the point marked by x. Starting from IC outside the RoA, the robot will fall backward when the IC is in the region marked by B, and forward when the IC is in the region marked by F.
Figure 2-4 Region of attraction of compass biped projected on the phase plane [25]
20
x – fixed point A – Region of Attraction F – falls forward B – falls backwards
Figure 2-5 Region of attraction of compass biped on Poincaré surface [26]
2.2 Central Pattern Generator (CPG) Animal research has shown that networks of neurons called Central Pattern Generators (CPGs) control locomotion and other repetitive movements such as chewing and swimming. The CPG sends coordinated rhythmic patterns of electrical activity to the muscles. Afferents deliver feedback information related to the movement of the muscles, which is used for variations of the CPG signal due to environmental changes. CPGs are believed to also control locomotion of humans [27] [28]. Possible hierarchies for CPG control of human locomotion are shown in Figure 2-6 [29].
21
Figure 2-6 Possible models of CPG control Hierarchy in human locomotion [29]
2.2.1 CPG in robotics Many biomimetic robots that use CPGs to generate locomotion have been developed, including quadrupeds [30] and bipeds [31][17][32]. This is part of a growing trend in robotics to use basic motion primitives for design of more complex tasks. Different CPG models are used in literature for biped locomotion with varying complexity of feedback. A common method is to use CPGs based on networks of oscillators, for e.g. Matsuoka oscillators, as firstly introduced by Taga [17]. Another method applies adaptive frequency Hopf oscillators as introduced by Righetti and Ijspeert [33]. It is widely hypothesized that feedback is not required for the generation of the rhythmic signals in biological systems [27]. This is a key concept in the development of CPGs that require no feedback to maintain a stable limit cycle of bipedal walking. Stability is achieved by mutual entrainment between the CPG and the mechanical dynamics. However, small disturbances can cause the system to go out of the Region of Attraction and lose stability. Therefore, feedback is essentially needed in most practical implementations.
22
2.2.2 Minimal Feedback CPG The traditional approach for CPG design is to use of each oscillator output of the CPG as the reference trajectory for the respective joint. However, tracking joint trajectories does not allow the robot to exploit its natural dynamics. Instead , a previous work developed in our lab [20] suggested a ‘natural dynamics’ approach[20], in which joint trajectories are determined by the natural response of the robot rather than forced to follow a predefined reference. Actuation is provided by predefined torque patterns triggered by the CPG proposed. The CPG is made of four coupled neurons that produces torque pulses at neuron activation. The proposed method was implemented on a compass biped robot in simulation, and achieved stable walking on slopes between −0.9º and +0.7 º. A further work [19], showed how adaptation to slope variations can be extended by incorporating minimal - once per cycle – feedback. Stability was maintained in the presence of varying slopes of up to −8º and +7 º. In the latest proposed CPG architecture, torque pulses are activated according to timing of a linear phase clock.
Figure 2-7 CPG torque pattern proposed at [20]
23
2.3 Reinforcement Learning Reinforcement learning (RL) is used in this thesis to optimize the gait controller parameters. RL is based on learning from interaction of an agent with its environment. The basic assumption of RL is that there are no explicit examples of a good or a right solution for each situation and thus it differs from standard supervised learning.
2.3.1 Framework The Reinforcement learning framework is illustrated in Figure 2-8 [35]. An agent tries to maximize the reward that it receives while interacting with the environment. The learning framework is based on Markov Decision Processes (MDP) [36]. Learning is done along trials (or episodes) with length T. At each time step t, the agent receives from the environment the state st S. On that basis, the agent selects an action at A using a policy π. Consequently, the agent received an award rt, and the state is updated to st+1. We define the trajectory τ as all the information the agent receives along the trial. So at each time step t, τt={st,at,rt}. Policy π at
st+1
Environment
τt={st,at,rt} Figure 2-8 – Schematic view of the reinforcement learning framework The agent's policy π(a|s) defines the probability of selecting an action, given the state s. The RL method specifies how the agent changes its policy along the learning process. We will proceed to define some important notations widely used in reinforcement learning. First, we define the return, R, which is the sum of reward accumulated over a learning episode:
R
T
r t
t 0
24
t 1
(2-6)
Where γ is the discount parameter used to determine the importance of future reward. When γ is close to zero, the agent accounts only for the immediate reward, and neglects future rewards. When γ approaches one, the agent accounts also for rewards that are far in the future. The goal of the learning process is to maximize the expected return by maximizing the objective function J while following the policy π, where
J E R
(2-7)
The value function is defined as the expected return when starting from state s0, while following a policy π-
V s E R s0
(2-8)
In some cases the action-value function is more useful to evaluate one policy over another and is defined as the expected return when starting from state s0 and choosing action a, while following a policy π-
Q s, a E R s0 , a
(2-9)
Reinforcement Learning algorithms can be classified as either value function methods or policy search methods. Value function methods are based on learning the state or action-state value function. The action is selected based the learned function, using, for example, a greedy policy (Choose the action that yields the highest value). Value function methods include the SARSA algorithm [37], Qlearning [38] and more [35]. In contrast, policy search algorithms [39] use a pre-defined control structure, and try to optimize its parameters using explicit, or non-explicit knowledge of the state or state-action value functions. Algorithms that make explicit use of the value function are known as actor critic algorithms (Figure 2-9). In this architecture, the value function and the policy are separated, and the value function is used as a critic to modify the policy in a way that will maximize the value. The next section will discuss in particular the policy gradient methods, which make use of an approximation of the gradient of the objective function with respect to the policy parameters to maximize the accumulated reward.
25
actor
Policy
state
critic
action
Value function reward
Environment
Figure 2-9 – The actor critic architecture. The value function is separated from the policy (the ‘actor’), and is used only as a ‘critic’ to correct the policy.
2.3.2 Policy gradient methods Policy gradient methods are common in robotic applications for many reasons [40]. Among the important ones are that they I) use policies with pre-defined control structures, thus incorporating prior knowledge into the system, II) have fewer parameters that need to be learned than in the value function based methods and III) have better convergence properties and strong theoretical background [41][42]. The major disadvantage of policy gradient methods is that they typically converge to a local rather than global optimum. Policy gradient methods have been applied to optimize spline-based trajectories for robots with high degree of freedom [43], and demonstrated good results when learning complex tasks such as playing table tennis [44] and hitting a baseball with a bat [45]. The policy gradient approach has been previously applied on biped robots [46] [47] and CPG controlled biped robots [48] [49].
26
Policy update We assume that the policy is parameterized with a vector of parameters θ. The policy gradient algorithm relies on the approximation of the gradient of the objective function (2-7)
J . Given the
gradient, steepest ascent yields the following update rule:
k 1 k k J
(2-10)
Where α denotes the update rate, and k is the update step. The policy improvement loop [50] is illustrated in Figure 2-10. Perturbations of M trials N(0,∑ ) Observed trajectories
θ
πθ
at
{τm}m=1:M
Env
st
Parameter update J
θnew
k 1 k
Figure 2-10 – Policy improvement loop. At each iteration, M trials are performed with injected perturbations at each time step. The observed trajectories from each trial that include the states, actions and rewards are passed to the gradient estimator. The parameters are updated when the gradient estimation converges, and the updated parameters are passed to another iteration.
The main challenge of policy gradient methods is the estimation of the gradient. In simpler problems, it is possible to derive the gradient analytically. However, in problems that are more realistic it is difficult to model each detail in the environment. Therefore, model free methods were developed to grant more autonomy to the leaning process. The following will describe various approaches for estimating the gradient.
Finite Difference The most straight forward and simple approach is to estimate the gradient directly using finite difference [45]. At each update step k, the objective function Jref is estimated for the policy
27
parametrization θk. Then, small variations of the policy parameters Δθi are applied to estimate the corresponding change in the objective function ΔJi ≈J(θk+Δθi)-Jref. The update is performed once enough samples are gathered so the gradient can be estimated by the least mean square formula-
J T T J 1
Where [1 ,..., I ]T ,
J [J1 ,..., J I ]T
(2-11)
and I is the number of samples.
This approach was successful in several cases, such as learning the gait controller parameters of the AIBO robot [51]. However, using the Finite Difference method results in noisy gradient estimation. In addition, the choice of the magnitude of the variations Δθi must be done very carefully since in some applications different parameters should vary in different orders of magnitude. An implementation of the Finite Difference method is shown in Table 2-1. input: policy parameterization θk
J 0
0
initialize
1
estimate Jˆref J k
2
repeat
3
generate policy variation Δθi
4
perform trial, observe trajectory’s return and estimate Jˆi J k i Ri
5
compute Jˆi Jˆ k i Jˆref
6
compute gradient J T
7
until gradient estimation
return: gradient estimation
J
1
T Jˆ
converges
J
Table 2-1 Finite Difference method for estimating the gradient of the cost function with respect to the policy parameters [45]
28
REINFORCE The REINFORCE algorithm was first developed by Williams in 1992 [21] showing that the policy gradient can be estimated using the likelihood ratio [52][53]. Williams’s approach was later given a strong theoretical background with the presentation of the policy gradient theorem [42]. Consider the objective function (2-7). Over an episode that is made of a trajectory τ with a distribution
p p | and return R r t rt 1 , the expected return can be written T
t 0
as:
J E R p r d
(2-12)
The objective function gradient is therefore obtained by differentiating (2-12) –
J p r d
Using the log-likelihood ‘trick’ we can replace
(2-13)
p with log p p , since-
log p
p p
(2-14)
This replacement becomes very useful when using normal distributed policies, and makes their differentiation very simple. Due to the Markovity assumption the distribution pθ(τ) can be written as – T
p p s0 p st 1 | st , at at | st
(2-15)
t 0
This results in the formula that is the basis of the policy gradient theorem-
T J E log p r E log at | st r t 0
29
(2-16)
The last transition relies on the fact that from all the components of pθ(τ) in (2-15), only the policy πθ(at|st) depends on θ. The is a critical step which obviates the need for explicit knowledge of the model pθ(τ). Eq. (2-16) yields an approximation of the gradient that can be applied after gathering sufficient data from M trials:
J
1 M
log a M
m 1
T
t 0
m t
| stm R m
(2-17)
In order to reduce the variance of the gradient estimation, a constant baseline can be subtracted from the gradient:
J E log p r b Where b
(2-18)
can be chosen arbitrarily since it does not introduce bias to the gradient estimation , since
p d 1 implies that p d 0 [21]. The optimal baseline which minimizes the variance was derived in the form of [45]:
b
M m 1
t 0 log atm | stm Rm
M m 1
T
2
t 0 log atm | stm T
2
An implementation of the REINFORCE method with optimal baseline is shown in Table 2-2.
30
(2-19)
input: policy parameterization θk 1
initialize J
2
repeat
3
perform trial m out of M and obtain trajectory s0:T , a0:T , r0:T
4
compute dLog m t 0 log k a | s
0, b 0
T
m t
m t
T
t rt 1 and R t 0 i
dLog R b dLog M
5
estimate optimal baseline
2
m 1 M
2
m 1
6
7
estimate the gradient J
until gradient estimation
return: gradient estimation
1 M
m
M
dLog R m 1
m
m
m
m
b
J converges
J
Table 2-2 REINFORCE algorithm with optimal baseline for estimating the gradient of the cost function with respect to the policy parameters [45]
31
3. MATHEMATICAL MODEL OF MONO-PEDAL ROBOT The following section describes the mathematical model that was used for simulating the mono-pedal robot. The model’s dynamics consists of three phases, as shown in Figure 3-1: I) Stance phase, during which the foot is assumed to be attached to the ground with no slip. II) Swing phase with contracted leg. III) Swing phase with extended leg. Different failure modes may occur depending on the phase of the system and include: I) Reaching the maximal leg angle during swing phase. This failure mode reflects the mechanical limitation in the system, and prevents the leg from reaching its limit and causing a mechanical failure. If it were a compass biped robot, then kicking too high would also cause the robot to fall on its back. II) Extension of the leg into the ground. Also here, If it were a compass biped robot, extending into the ground would cause the robot to trip and fall. III) The system does not have enough energy to complete the step. Failure III is detected in simulations when the hip falls below a certain value, although in hardware there is a mechanical limit. The transitions between the phases and possible failure modes are illustrated in the state machine shown in Figure 3-1.
Fail III: Insufficient energy
Swing extended
d ten Ex
Im pa ct
Fail II: Extend into ground
Swing contracted
Stance
Fail I: Max angle
Contract
Figure 3-1 mono-pedal robot state machine consists of 3 states. From each state, the system can enter a failed state.
3.1 Swing phase During the swing phase, the leg is mounted by its hip joint at the bottom of the linea r track, marked as (c) in Figure 3-2, and the system behaves as the well-known pendulum on a cart model. The friction of the cart with its track is modeled as coulomb friction, so the system can be in one of two modes: I) Slip mode, where the cart’s speed is non-zero and the cart-pendulum model is considered.
32
II) Stick mode, where the cart’s velocity is zero, and only the pendulum-like motion of the leg is considered. During slip mode, the system has two degrees of freedom: the angle of the leg relative to the vertical position, denoted as θ, and the horizontal displacement of the cart denoted as x. During stick mode, the system has only one degree of freedom that is the leg’s angle θ. The equations of motion were derived using Lagrangian mechanics, by accounting for: I) the kinetic energy stored by the inertia of the leg and the cart, II) the potential energy stored by the vertical height of the leg, and III) the energy dissipated due to friction of the leg at the hip joint. The switching between slip and stick mode is detected by computing the horizontal reaction force applied on the cart due to the dynamic movement of the leg using Lagrange multipliers (See Appendix C – Development of Dynamic Equations of the Mono-Ped Model). If the force exceeds the stick friction threshold, then we enter slip mode. If the cart’s initial velocity is zero, or its velocity reaches zero and the computed reaction force is less than the stick friction threshold, then we enter stick mode.
x
(c)
θ (b)
(d)
(a)
vtreadmill
Figure 3-2 Swing phase model. (a) ankle joint (b) hip joint (c) linear track on which the leg is mounted (d) linear motor in contracted position
33
The forces that are applied on the system during swing phase are: I) Torque input τCPG at the hip joint, produced by the SEA and governed by the CPG gait controller. II) Friction force fslip on the cart during slip mode. The equations of motion during swing phase are developed in Appendix C – Development of Dynamic Equations of the Mono-Ped Model. The resulting equations of motion during the slip mode are:
J hip mleg lcg ,hip cos x mleg glcg ,hip sin f hip CPG
m
cart
mleg x mleg lcg ,hip cos sin 2 f slip sign x
(3-1) (3-2)
and during stick mode:
J hip mleg glcg ,hip sin f hip CPG
(3-3)
3.2 Stance phase During stance phase, the foot is assumed to be attached to the treadmill surface with no slip. The hip joint is free to move vertically in its linear track as illustrated in Figure 3-3(b). The leg was modeled as a rotating link that is attached to the horizontally moving cart in its upper hip axis, and to the horizontally moving treadmill track in its lower ankle axis. During stick friction mode of the cart, the treadmill speed determines the rotational velocity of the leg using the kinematic relation (3-5). During slip friction mode of the cart, the system kinematics have one degree of freedom, which is chosen to be the leg angle θ. The cart’s position and velocity are drevied from their kinematic relation to the leg angle. The forces that are applied on the system during stance phase are: I) Torque input at the hip joint and ankle joints, produced by the SEA and governed by the CPG gait controller. The input torque at both joints has the same effect on the system, and therefore is summed and will be denoted by τCPG. II) Friction force on the cart during slip mode, which is denoted as fslip. During stick friction mode of the cart, the movement is only governed by the kinematic relation with the treadmill.
34
The equations of motion are derived using Lagrangian mechanics as detailed in Appendix C – Development of Dynamic Equations of the Mono-Ped Model. Energy dissipation occurs due to the rotational friction at the hip and ankle joints and the friction between the hip and the linear track.
x
θ
(b)
(a)
(c)
vtreadmill
Figure 3-3 Stance phase model. (a) linear motor is extended (b) hip joint is free to move vertically. (c) Foot is attached to the track with no slip
The resulting equations of motion during the stance phase while in slip mode are:
m
cart
l 2 cos 2 J ankle mcart l 2 cos sin 2
mleg glcg ,ankle sin f hip f ankle f z l 2 sin 2 CPG f slip sign xcart l cos
(3-4)
During the stick mode the angular velocity is related to the linear velocity by:
1 x x 1 ankle cart l
35
2
.
xankle l
(3-5)
3.3 Swing to stance transition at impact The transition from the swing phase to the stance phase occurs when the foot contacts the ground. We note that the transition is successful only if the contact occurs after the leg was extended. If the contact occurs at leg extension, the system enters a failed state. At contact with the ground, the reaction force is assumed to act on one point located at the ankle joint, so conservation of angular momentum can be presumed. From this assumption, the angular velocity of the leg after the impact 𝜃̇ + is derived, given the angular velocity of the leg 𝜃̇ − and the horizontal velocity of the cart 𝑥̇ − prior to the impact. The impact equation is therefore derived as (See Appendix C – Development of Dynamic Equations of the Mono-Ped Model)-
e
J ankle mcart l cos x J ankle mcart l 2 cos 2
(3-6)
Since the impact angle is constant (with small changes in case of disturbances), the impact law can be written in a general form as – impact , impact , x xcart
(3-7)
αimpact,θ and αimpact,x can be approximated from experiments, as will be shown in section 5.
3.4 Leg contraction/extension The leg is contracted in order to clear the ground during the swing phase. The leg’s contraction and extension are assumed to happen instantaneously, and to have no effect on the dynamics of the system. However, the timing of the contraction and extension has an important role in maintaining a limit cycle that propels the cart forward. In particular, if extension occurs when the leg angle is too low, it will extend into the ground, which would lead to a failed state. Two methods to time the extension/contraction were used. Note that each method can be implemented on the extension and/or contraction independently. I) CPG phase activation: extension and/or contraction are timed at specific phases of the CPG. II) Reflex mechanism: extension and/or contraction occur in response to specific events of the mechanical system. Specifically, the extension reflex is activated during the swing phase when the leg reaches a threshold angle θextract. The contraction reflex is activated during the stance phase, when the hip joint reaches a threshold angle θcontract.
36
3.5 Gait controller The CPG controller is based on a linear phase oscillator, as suggested in [19]. The phase is calculated according to the following expression depending on the oscillator frequency ω:
t
(3-8)
Two torque pulses are generated as a function of the phase as shown in Figure 3-4. During the swing phase, torque is used to activate the hip only, while during the stance phase the torque can be distributed between the hip and ankle joints. Each pulse m (m=1,2) is defined by three parameters specifying the activation phase φ2m-1, deactivation phase φ2m and torque magnitude τm, for a total of 6 parameters for the two pulses. An additional global parameter T defines the period of the CPG cycle. We note that when the extension and/or contraction of the leg are determined by the CPG phase, one or two additional timing parameters are added accordingly.
Figure 3-4 Central pattern generator used for the mono-pedal robot For practical implementation, it is better to use smooth signals than square pulses, and therefore Sigmoid (or Logistic) functions are used. The Sigmoid function determines the pulse activation time per cycle, with a smoothing parameter κ.
37
The amplitude of each pulse m is controlled by the feedback weights wτ,m that determines the change in pulse intensity from the nominal value τm according to feedback from the environment ψe. To sum up the above, the mathematical representation of each torque pulse is defined by the following expression-
1 1 um m m m,2 m,1 1 exp 1 exp
(3-9)
m f ,m w ,m , ψe
(3-10)
Where fτ is can be any function that maps from the features ψe to the torque correction δτm depending on the parameter vector wτ,m. In the simplest case fτ defines a linear relation so that –
m wT,m ψ e
(3-11)
However, fτ,m can be also a more complex relation, such as a neural network with weights w τ,m. The nominal cycle time T is modified by feedback features, ψe, extracted from the environment and the feedback weights wT:
1 T T
T fT wT , ψe
(3-12) (3-13)
As before, fT can be any function that maps from the features ψe to the period correction δT depending on the parameter vector wT. Also here, fT can be defined either as a linear relation or a neural network. Therefore, changes from the environment can cause a change in the pulse intensity (3-9) as well as a change in the clock frequency (3-12).
3.6 Phase reset
38
The open loop CPG controller runs independently of the dynamic model. Synchronization between the CPG and the robot dynamics occur naturally within a limited set of initial conditions. This is due to mutual entrainment between the system dynamics and the CPG oscillator, as described in [17], resulting in a unique limit cycle. However, in presence of disturbances the synchronization between the system dynamics and the CPG may deteriorate and eventually lead to failure. The phase resetting technique is one way of enforcing synchronization, and has been implemented in the past successfully to enhance robustness [54][32]. In our implementation, the performance of phase resetting that occurs at the Poincaré section is assessed in simulations in section 6.
3.7 Poincaré section definition As detailed in the above sections, the state space is 5-dimensional during the swing phase under slip mode, i.e., xswing = [θ,𝜃̇,x,𝑥̇ ,φ]T, and 3-dimensional during the stance phase under slip mode, i.e., xstance = [θ,𝜃̇,φ]T. Hence, in order to have a low-dimensional Poincaré map, we can select a Poincaré section during the stance phase, and in particular as the state immediately after the impact. During swing, the hip rests on the bottom of the linear track and thus is located at a fixed height from the ground. Hence, the angle of the leg when it hits the ground is fixed, and the Poincaré section is 2 dimensional xp = [𝜃̇,φ]T.
39
4. SERIES ELASTIC ACTUATION The mono – pedal robot (Figure 4-1) was designed to demonstrate the feasibility of CPG gait control on hardware. The robot design enables the following features: I) The ability to produce a torque signal in its hip and ankle joints, II) the ability to shorten and extend the leg to clear the ground during its swing. To achieve the first feature, the robot’s hip and ankle joints are actuated by the Series Elastic Actuation (SEA) method detailed in this section. To achieve the second feature, a linear motor is used to extend and shorten the leg. In addition, two pressure sensors are positioned under the foot to detect the ground contact. The sensors are positioned in the front and back part of the foot.
Figure 4-1 Left: The mono - pedal robot. Upper right: Hip SEA motor and springs. Lower right: Ankle SEA motor and springs.
4.1 Series elastic actuation (SEA) Series elastic actuation is a widely used method in robotics when designing torque-controlled robots. First introduced by Gill Prat [22] and implemented on the Troody [55] and Spring Turkey [56]
40
robots, and on many other robotic implementations since then [57][58][59][60][61]. The SEA method was developed to address the need for torque actuated, compliant joints, that suit the nature of robots that interact with a sensitive environment and/or with humans. Figure 4-2 illustrates the difference between classic torque control and SEA torque control. Classic torque control requires a high-performance motor that can produce the desired torque with high bandwidth in order to respond to environmental changes. In addition, a torque sensor is required. However, torque sensors are usually quite costly. In contrast, SEA torque control requires less powerful motors thanks to using high gear reduction, in expense of reduced bandwidth. The reduced bandwidth is compensated by the inherent compliance of the system. Sudden environment changes will be met by the soft compliance of the spring. In return, position sensors must be positioned before and after the
Gear
SEA
Joint
Torque sensor
Joint
Position sensor
Motor
Gear output
Position sensor
elastic component of the system to measure the deflection.
Motor
spring
Motor command
Motor command
Control Desired torque
+
-
Control
Estimated torque
Desired torque
Measured torque
Figure 4-2 Classic torque control vs. SEA torque control. Left: SEA control can use a smaller geared motor, and requires position measurements before and after the elastic component. Right: Classic torque control uses bigger motors and requires expensive torque sensors The SEA compliance results in reduced output impedance (defined as the change in actuator torque caused by a change in joint angular displacement)as demonstrated in Figure 4-3 [60]. The actuator impedance is reduced by at least 30dB at all frequencies (dotted and solid lines). Reducing the spring stiffness reduced the impedance in the higher (uncontrolled) frequencies. In addition, the decrease in phase lead implies that the actuator applies less damping.
41
Figure 4-3 Actuator’s output impedance with or without an elastic component (from [60])
4.2 SEA mechanical design As explained in the previous section, the series elastic actuation system designed in our lab achieves two goals. First, is facilitates torque control without the need of expensive torque sensors and high-performance motors. Second, its inherent compliance serves the goal of mimicking the biomechanics of an actual human leg. Figure 4-4 illustrated the mechanics of the SEA system we use for our robot. The desired torque signal is produced by controlling the relative angle between the motor angle at the gear output θ’m and the joint angle θl. The output torque (when neglecting dynamics) is given by:
Tload k SEA 'm l k SEA
(4-1)
kSEA is the equivalent elastic constant of the SEA springs, and is given by -
kSEA 2kr 2
(4-2)
k is the linear spring constant, and r is the pulley radius as illustrated in Figure 4-4. See (Appendix B – Calculation of SEA Constant) for a detailed calculation.
42
Motor
θm Gear
r SEA springs
k
θ'm k
r θl Figure 4-4 schematic drawing of SEA.
Each joint is powered by a MAXON RE35 90W brushed DC motor with a 86:1 gear head. After gear reduction, the motor is able to produce a maximal torque of 75 Nm (Stall torque) and a nominal torque of 6.2 Nm (Continuous torque). Angular positions of the motor shaft at the gear output and joint angle are measured by two AVAGO HEDR 55L2-BY09 incremental encoders, with a resolution of 14400 counts per revolution, resulting in an angular resolution of Δθmin=0.025°. The gear exhibits a backlash of maximal 1°, but it is not taken into account since the encoder is mounted on the gear output. Equations (4-1) (4-2) and Figure 4-4 provide the following three insights: I.
The maximal torque that could be applied (Tload,max) is bounded by the maximal angular difference that can be applied between the motor shaft and the joint (Δθmax).
II.
The minimal torque that could be measured (Tload,min) is limited by the encoder resolution(Δθmin).
III.
The maximal linear deflection of the springs cannot exceed the space between the two pulleys.
43
The spring constant was selected to satisfy the following requirements: (i) a torque resolution of Tload,min = 0.01 Nm, (ii) maximal torque of Tload,max = 5 Nm, (iii) a maximal angle difference of Δθ max =40 degrees (0.69 [rad]) and (iv) linear spring length of 50 mm after deflection. Given the maximal angular difference Δθ max, the minimal SEA stiffness that can meet the required Tload,max is:
kSEA,min= Tload,max /Δθmax = 7.15 [Nm/ rad] Given the encoder resolution Δθ min, the maximal SEA stiffness that can meet the required Tload,min is:
kSEA,max = Tload,min / Δθmin =22.5 [Nm/rad] Given those bounds, and the constraints on the deflection of the linear springs (satisfying x0+Δxmax= x0+rΔθmax= x0+rTload,max/kSEA