This succeeded in flying the helicopter in simulation, but not on the actual ... censed/specially trained human pilot (dotted line). ..... [6] J. Kiefer and J. Wolfowitz.
Autonomous helicopter flight via Reinforcement Learning Andrew Y. Ng Stanford University Stanford, CA 94305
H. Jin Kim, Michael I. Jordan, and Shankar Sastry University of California Berkeley, CA 94720
Abstract Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
1 Introduction Helicopters represent a challenging control problem with high-dimensional, complex, asymmetric, noisy, non-linear, dynamics, and are widely regarded as significantly more difficult to control than fixed-wing aircraft. [7] Consider, for instance, the problem of designing a helicopter that hovers in place. We begin with a single, horizontally-oriented main rotor attached to the helicopter via the rotor shaft. Suppose the main rotor rotates clockwise (viewed from above), blowing air downwards and hence generating upward thrust. By applying clockwise torque to the main rotor to make it rotate, our helicopter experiences an anti-torque that tends to cause the main chassis to spin anti-clockwise. Thus, in the invention of the helicopter, it was necessary to add a tail rotor, which blows air sideways/rightwards to generate an appropriate moment to counteract the spin. But, this sideways force now causes the helicopter to drift leftwards. So, for a helicopter to hover in place, it must actually be tilted slightly to the right, so that the main rotor’s thrust is directed downwards and slightly to the left, to counteract this tendency to drift sideways. The history of helicopters is rife with such tales of ingenious solutions to problems caused by solutions to other problems, and of complex, nonintuitive dynamics that make helicopters challenging to control. In this paper, we describe the successful application of reinforcement learning to designing a controller for autonomous helicopter flight. Due to space constraints, our description of this work is necessarily brief; a detailed treatment is provided in [9]. For a discussion of related work on autonomous flight, also see [9, 12].
2 Autonomous Helicopter The helicopter used in this work was a Yamaha R-50 helicopter, which is approximately 3.6m long, carries up to a 20kg payload, and is shown in Figure 1a. A detailed description of the design and construction of its instrumentation is in [12]. The helicopter carries an Inertial Navigation System (INS) consisting of 3 accelerometers and 3 rate gyroscopes installed in exactly orthogonal x,y,z directions, and a differential GPS system, which with the assistance of a ground station, gives position estimates with a resolution of 2cm. An onboard navigation computer runs a Kalman filter which integrates the sensor information from the GPS, INS, and a digital compass, and reports (at 50Hz) 12 numbers corresponding to the estimates of the helicopter’s position ( ), orientation (roll , pitch , yaw ), velocity ( ) and angular velocities ( ).
(a)
(b)
Figure 1: (a) Autonomous helicopter. (b) Helicopter hovering under control of learned policy.
Most Helicopters are controlled via a 4-dimensional action space: : The longtitudinal (front-back) and latitudinal (left-right) cyclic pitch controls. The rotor plane is the plane in which the helicopter’s rotors rotate. By tilting this plane either forwards/backwards or sideways, these controls cause the helicopter to accelerate forward/backwards or sideways. : The (main rotor) collective pitch control. As the helicopter main-rotor’s blades sweep through the air, they generate an amount of upward thrust that (generally) increases with the angle at which the rotor blades are tilted. By varying the tilt angle of the rotor blades, the collective pitch control affects the main rotor’s thrust. : The tail rotor collective pitch control. Using a mechanism similar to the main rotor collective pitch control, this controls the tail rotor’s thrust. Using the position estimates given by the Kalman filter, our task is to pick good control actions every 50th of a second.
3 Model identification To fit a model of the helicopter’s dynamics, we began by asking a human pilot to fly the helicopter for several minutes, and recorded the 12-dimensional helicopter state and 4dimensional helicopter control inputs as it was flown. In what follows, we used 339 seconds of flight data for model fitting, and another 140 seconds of data for hold-out testing. There are many natural symmetries in helicopter flight. For instance, a helicopter at (0,0,0) facing east behaves in a way related only by a translation and rotation to one at (10,10,50) facing north, if we command each to accelerate forwards. We would like to encode these symmetries directly into the model rather force an algorithm to learn them from scratch. Thus, model identification is typically done not in the spatial (world) coordinates , but instead in the helicopter body coordinates, in which the , , and axes are forwards, sideways, and down relative to the current position of the helicopter. Where there is risk of confusion, we will use superscript and to distinguish between spatial and body coordinates; thus, is forward velocity, regardless of . orientation. Our model is identified in the body coordinates which has four fewer variables than . Note that once this model is built, it is easily converted back using simple geometry to one in terms of spatial coordinates. Our main tool for model fitting was locally weighted linear regression (e.g., [11, 3]). Given a dataset where the ’s are vector-valued inputs and the ’s are the realvalued outputs to be predicted, we let be the design matrix whose -th row is , and let be the vector of ’s. In response to a query at , locally weighted linear regression makes , and is a diagonal matrix with the prediction , where (say) , so that the regression gives datapoints near a larger weight. Here, determines how weights fall off with distance from , and was picked in our experiments via leave-one-out cross validation. 1 Using the estimator for . By noise given in [3], this gives a model , where
$ !#" % ' 3 %/*10 ) + ( * , ( . / % 1 * 2 0 % 0 4 6587:9;=< < $3 > *? 3 < ? @ A( * CBED
1
0
&
DGFAHJIKMLONQP SR @
Actually, since we were fitting a model to a time-series, samples tend to be correlated in time,
xdot
xdot
0.4 0.2 0.5 seconds
0.4 0.3
1
xdot mean squared error
mean squared error
1
thetadot
0.3 0.2 0.1 0 0
0.5 seconds
0.5 seconds
erry y φ
0.01 0.008
Σ
Σ
a1
Σ
Σ
a2
Σ
Σ
a3
Σ
Σ
a4
−2
0.006 0.004 0.002 0 0
1
0 −1
0.012
0.4
+1 errx x θ
0.1 0 0
0.5
1
0.2
ydot
0.6
0 0
2
0.5 mean squared error
mean squared error
1 0.8
0.5 seconds
1
−3 0
2
4
6
8
10
(a)
time
errz z ω errω
(b)
(c)
Figure 2: (a) Examples of plots comparing a model fit using the parameterization described in the text (solid lines) to some other models (dash-dot lines). Each point plotted shows the mean-squared error between the predicted value of a state variable—when a model is used to the simulate the helicopter’s dynamics for a certain duration indicated on the -axis—and the true value of that state variable (as measured on test data) after the same duration. Top left: Comparison of -error to model not using , etc. terms. Top right: Comparison of -error to model omitting intercept (bias) term. Bottom: Comparison of and to linear deterministic model identified by [12]. (b) The solid line is the true helicopter state on 10s of test data. The dash-dot line is the helicopter state predicted by our model, given the initial state at time 0 and all the intermediate control inputs. The dotted lines show two standard deviations in the estimated state. Every two seconds, the estimated state is “reset” to the true state, and the track restarts with zero error. Note that the estimated state is of the full, highdimensional state of the helicopter, but only is shown here. (c) Policy class. The picture inside the circles indicate whether a node outputs the sum of their inputs, or the
of the sum of their inputs. Each edge with an arrow in the picture denotes a tunable parameter. The solid lines show the hovering policy class (Section 5). The dashed lines show the extra weights added for trajectory following (Section 6).