A Reinforcement Learning Approach to Predictive Control Design

A Reinforcement Learning Approach to Predictive Control Design: Autonomous Vehicle Applications

by

P. Travis Jardine

A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Doctor of Philosophy

Queen’s University Kingston, Ontario, Canada May 2018

c P. Travis Jardine, 2018 Copyright

Abstract

This research investigates the use of learning techniques to select control parameters in the Model Predictive Control (MPC) of autonomous vehicles. The general problem of having a vehicle track a target while adhering to constraints and minimizing control effort is defined. We further expand the problem to consider a vehicle for which the underlying dynamics are not well known. A game of Finite Action-Set Learning Automata (FALA) is used to select the weighting parameters in the MPC cost function. Fast Orthogonal Search (FOS) is combined with a Kalman Filter to simultaneously identify the model while estimating the system states. Planar inequality constraints are used to avoid spherical obstacles. The performance of these techniques is assessed for applications involving ground and aerial vehicles. Simulation and experimental results demonstrate that the combined FOS-FALA architecture reduces the overall number of design parameters that must be selected. The amount of reduction depends on the specific application. For the differential drive robot case considered here, the number for parameters was reduced from six to one. Furthermore, the learning strategy links the selection of these parameters to the desired performance. This is a significant improvement over the typical approach of trial and error.

i

Acknowledgments

It would be a dissertation unto itself were I to properly acknowledge all of those to whom I owe gratitude for this work. Completing a PhD part-time while serving in the Canadian Forces requires above all, access to exceptionally focused and knowledgable supervision. For this I owe Dr. Sidney Givigi, whose mentorship has had a profound impact on my intellectual and moral development over the past five years. This, combined with the perspective and insight provided by Dr. Shahram Yousefi at every step of the way, has made me the beneficiary of an unparalleled supervisory team. I also relied heavily on the guidance of Dr. Michael Korenberg on the topic of vehicle modelling, particularly at the early stages of problem formulation. To all the family, friends, teachers, roommates, coworkers, strangers, and connoisseurs of the midnight hour that have helped me formulate my ideas and find my way down this very long road - thank you. Finally, I would like to dedicate this work to the memory of the venerable Dr Alain Beaulieu.

ii

List of Acronyms

ANN

Artificial Neural Network

D-FOKF FALA FOS

Finite Action-Set Learning Automata Fast Orthogonal Search

MIMO

Multiple-Input-Multiple-Output

MPC

Model Predictive Control

MPMP MSE PID R-FOS

Dual Fast Orthogonal Kalman Filter

Model Predictive Mission Planning Mean Squared Error

Proportional-Integral-Derivative Recursive Fast Orthogonal Search

RHC

Receding Horizon Control

SISO

Single-Input-Single-Output

UAV

Unmanned Aerial Vehicle iii

Nomenclature

x

bolded lower-case is a vector

X

upper-case is a matrix

X

bolded upper-case is a vertical concatenation or stacking of vectors

X

is a concatenation of matrices as defined in the text

X is a set x(t)

is value of x at time t in continuous time

x(n)

is value of x at discrete timestep n

x˙

is the time rate of change of x

x¯

is the time average of x

xˆ− and xˆ+

are the a priori and a posteriori estimates of x

s(x), c(x), and t(x) ⊕bi=1 ai

are the sine, cosine, and tangent function of x

is the diagonal concatenation of b terms

kp , ki , and kd

are gains of the proportion, integral, and derivate error iv

A, B, and C x(n) and u(n)

are the state-transition, input, and output matrices are the general form for state and input vectors at time n

y

is an scalar output and y is a vector of ouputs

g

is gravity

ζ

is the vector of Cartesian coordinates in the Earth frame: [x, y, z]T

ξ

is the vector of Euler Angles roll, pitch, and yaw in the Earth frame: [φ, ψ, θ]T

ζb

is the vector of Cartesian coordinates in the Body frame: [xb , yb , zb ]T

ξ˙b

are the rates of roll, pitch, and yaw in the Body frame: [p, q, r]T

uc

is the set of high-level commands in roll, pitch, yaw and vertical directions: [uc,φ , uc,θ , uc,ψ , uc,z ]T

um

is the set of low-level motor inputs in roll, pitch, yaw and vertical directions: [um,φ , um,θ , um,ψ , um,z ]T

λ

is the learning rate

κ

is the discount factor

r(k)

is the reward received at time k

A = {α1 , α2 , · · · , αηr }

is a finite set of ηr possible actions in a Finite Action-set

Learning Automata scheme τ is an algorithm that updates action probabilities p(k) = [p1 (k), p2 (k), · · · , pηr (k)]T

is the action probability vector v

k denotes a discrete learning iteration E[·] denotes the expected value of a random variable 2 (n) denotes the mean squared error

vi

Contents Abstract

i

Acknowledgments

ii

List of Acronyms

iii

Nomenclature

iv

Contents

vii

List of Tables

x

List of Figures

xi

Chapter 1: Introduction 1.1 Problem . . . . . . . . . 1.2 Motivation . . . . . . . . 1.3 Objective . . . . . . . . 1.4 Contributions . . . . . . 1.5 Publications arising from 1.6 Organization of Thesis .

. . . . . . . . . . . . . . . . . . . . . . . . this thesis . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Chapter 2: Vehicle Control 2.1 Traditional Control Techniques . . . . . . . . 2.1.1 Selecting PID Control Parameters . . . 2.1.2 Limitations of traditional techniques . 2.2 MPC Basics . . . . . . . . . . . . . . . . . . . 2.2.1 MPC History . . . . . . . . . . . . . . 2.2.2 Benefits of MPC . . . . . . . . . . . . 2.2.3 Computational Considerations . . . . . 2.2.4 Stability and Feasibility Considerations 2.2.5 Obstacle Considerations . . . . . . . . vii

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

1 3 4 5 7 8 10

. . . . . . . . .

11 11 14 14 16 16 17 18 18 19

2.3 2.4

MPC Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 MPC design parameters . . . . . . . . . . . . . . . . . . . . . Modelling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3: Learning Techniques 3.1 Supervised Learning . . . . . . . . . . . . 3.1.1 Artificial Neural Networks . . . . . 3.1.2 Support Vector Machines . . . . . . 3.1.3 Limitations of Supervised Learning 3.2 Unsupervised Learning . . . . . . . . . . . 3.3 Reinforcement Learning . . . . . . . . . . 3.3.1 Markov Decision Process . . . . . . 3.3.2 Q-learning . . . . . . . . . . . . . . 3.3.3 SARSA-learning . . . . . . . . . . . 3.3.4 Exploration versus Exploitation . . 3.3.5 Learning Automata . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

19 23 24 27 27 28 29 29 30 30 31 32 33 34 34

Chapter 4: Vehicle Modelling using Fast Orthogonal Search 4.1 Adaptive Modelling via R-FOS . . . . . . . . . . . . . . . . . . . . . 4.1.1 Structure of Approximate Linear Model . . . . . . . . . . . . 4.1.2 Recursive Fast Orthogonal Search . . . . . . . . . . . . . . . . 4.1.3 Link between State Space and R-FOS . . . . . . . . . . . . . . 4.2 Dual Fast Orthogonal Kalman Filter . . . . . . . . . . . . . . . . . . 4.2.1 Dual Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Dual Fast Orthogonal Kalman Filter . . . . . . . . . . . . . . 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Basis for Model of Altitude Dynamics . . . . . . . . . . . . . . 4.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experiment #1 - Offline modelling and analysis . . . . . . . . 4.4.2 Experiments #2 and #3 - Online versus offline models . . . . 4.4.3 Experiments #4 and #5 - Partial Modelling and Forgetting Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Experiment #6 - State Estimation . . . . . . . . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 39 40 41 43 45 45 47 48 49 51 54 54 56

Chapter 5: Learning Automata 5.1 General description of FALA . 5.2 FALA for MPC . . . . . . . . 5.3 Training Process . . . . . . . 5.3.1 Convergence . . . . . .

62 63 64 64 68

. . . .

. . . .

viii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

57 59 60

5.4 5.5 5.6

5.7

Feedback Linearization . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . Learning Environments . . . . . . . . . . 5.6.1 Simulated Learning Environment 5.6.2 Experimental Environment . . . . Results . . . . . . . . . . . . . . . . . . . 5.7.1 Rate Comparison . . . . . . . . . 5.7.2 Learning Progress . . . . . . . . . 5.7.3 Statistical Analysis . . . . . . . . 5.7.4 Experimental Data . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Chapter 6: Integrated FOS FALA MPC 6.1 Translational and Rotational Dynamics . . . . . . . . . . . . . . . 6.2 Decoupled Bare-frame Dynamics . . . . . . . . . . . . . . . . . . 6.3 Planar Inequality Constraints . . . . . . . . . . . . . . . . . . . . 6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Experiment 1 - Altitude Control Learning . . . . . . . . . 6.5.2 Experiment 2 - Lat/Long Control Learning . . . . . . . . . 6.5.3 Experiment 3 - Trajectory Tracking with Fixed Obstacles . 6.5.4 Experiment 4 - Trajectory Tracking with Moving Obstacles 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7: Conclusions 7.1 Modelling using R-FOS . . . 7.2 Control design using FALA 7.3 Obstacle Avoidance . . . . . 7.4 Future Work . . . . . . . . .

. . . .

. . . .

. . . .

Bibliography

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . .

. . . . . . . . . .

69 72 72 73 75 77 77 80 82 84

. . . . . . . . . . .

88 90 93 94 96 99 99 103 106 108 109

. . . .

110 110 111 112 113

. . . . . . . .

. . . .

115

ix

List of Tables 4.1

FOS Experiment #1 Model Data . . . . . . . . . . . . . . . . . . . .

55

4.2

Comparison of Accumulated Error under Identical Conditions . . . .

58

5.1

Comparison of slow and fast learning rates . . . . . . . . . . . . . . .

79

6.1

Comparison of slow and fast learning rates . . . . . . . . . . . . . . . 101

x

List of Figures 1.1

Problem overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Domains of this Research . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1

PID control block diagram . . . . . . . . . . . . . . . . . . . . . . . .

12

2.2

Illustration of PID position error . . . . . . . . . . . . . . . . . . . .

12

2.3

Illustration of PID integral error . . . . . . . . . . . . . . . . . . . . .

13

2.4

Illustration of PID derivative error . . . . . . . . . . . . . . . . . . .

13

2.5

Illustration of MPC prediction horizon . . . . . . . . . . . . . . . . .

20

3.1

Illustration of Artificial Neural Network . . . . . . . . . . . . . . . . .

28

4.1

Architecture for unmanned aerial system with unknown dynamics . .

37

4.2

Dual Fast Orthogonal Kalman Filter . . . . . . . . . . . . . . . . . .

48

4.3

Illustration of Experimental Setup . . . . . . . . . . . . . . . . . . . .

49

4.4

Illustration of reference altitudes . . . . . . . . . . . . . . . . . . . . .

52

4.5

Comparison of response using various modelling techniques . . . . . .

56

4.6

Accumulated error using various modelling techniques . . . . . . . . .

57

4.7

Accumulated error compared to online R-FOS model with forgetting factor

4.8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Performance D-FOKF . . . . . . . . . . . . . . . . . . . . . . . . . .

59

xi

5.1

FALA training process . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.2

Simulated Environment used for Learning . . . . . . . . . . . . . . .

74

5.3

TurtleBot 2 robot used in the experiments. . . . . . . . . . . . . . . .

75

5.4

Experimental system with camera system and ground stations. . . . .

76

5.5

Illustration of probability of chosen parameter during one experiment

78

5.6

Comparison of fast and slow learning rates . . . . . . . . . . . . . . .

79

5.7

Comparison of performance before learning . . . . . . . . . . . . . . .

80

5.8

Comparison of performance part way through learning . . . . . . . .

81

5.9

Comparison of performance after learning

. . . . . . . . . . . . . . .

81

5.10 Boxplot of 162 experiments . . . . . . . . . . . . . . . . . . . . . . .

83

5.11 Comparison of performance of controllers in experiments . . . . . . .

86

5.12 Commands generated by the learned controller . . . . . . . . . . . . .

87

6.1

Quadcopter mission with obstacles . . . . . . . . . . . . . . . . . . .

89

6.2

Quadcopter reference frames . . . . . . . . . . . . . . . . . . . . . . .

90

6.3

Illustration of Planar Inequality Constraint . . . . . . . . . . . . . . .

96

6.4

Effect of Learning Rate on Convergence Time for Altitude Control . . 100

6.5

Comparision of Slow and Fast Learning Rates . . . . . . . . . . . . . 101

6.6

Normal probability plot for slow learning rate . . . . . . . . . . . . . 102

6.7

Normal probability plot for fast learning rate . . . . . . . . . . . . . . 103

6.8

Error during Longitudinal Control Learning . . . . . . . . . . . . . . 105

6.9

Error during Latitudinal Control Learning . . . . . . . . . . . . . . . 105

6.10 Vehicle trajectory with two spherical obstacles . . . . . . . . . . . . . 107 6.11 Vehicle trajectory with two spherical obstacles - overhead view . . . . 107 6.12 Moving obstacle avoidance . . . . . . . . . . . . . . . . . . . . . . . . 108 xii

1

Chapter 1 Introduction

By the late nineteenth century, the Industrial Revolution had generated widespread improvements in the speed, capacity, and interconnectedness of global transportation systems [1]. A new network of roads, waterways, and railways across Europe and North America helped transform small, agricultural communities into modern, thriving cities [2]. The emergence of this network and the technological advances that accompanied it has been called a transportation revolution. The unprecedented freedom of movement afforded by the personal automobile could be considered a second revolution in the global transportation network. The automobile allowed people to live further away from their place of work. As the technology matured throughout the early twentieth century, vehicles became more affordable and communities continued to spread out. As a result, the cities of postWorld War Two America were shaped to accommodate the ubiquitous automobile; large, sprawling highways now dominate much of the contemporary landscape [3]. The emergence of intelligent, autonomous vehicles represents yet another major shift in transportation systems [4]. The impact of these new technologies reaches far beyond the automobile market; they have the potential to disrupt the entire industry

2

and fundamentally transform the global economy. Early evidence of this impact has already become apparent with the controversy surrounding information technologydriven transportation companies like Uber. At the March 31, 2016 unveiling of the Model 3 sedan, Chief Executive Office (CEO) of Tesla Motors indicated their new model would be equipped with an auto-pilot system [5]. Motivated by promises of higher speeds and improved reliability, the more traditional vehicle manufactures such as Nissan, and Volvo predict that fully autonomous cars will be driving on North American highways by 2020 [6]. Some predict autonomous capabilities will dramatically reduce the number of vehicles required, impact jobs, and alter the very nature of vehicle ownership. The growth in autonomous vehicles is not constrained to automobiles. Many countries are investing considerable resources into military Unmanned Aerial Vehicle (UAV) research. Remotely-piloted military UAV operations currently involve a significant amount of human control, particularly for high-level decision making [7]. For example, UAVs deployed in combat operations in Afghanistan were routinely controlled by operators stationed at Creech Air Force Base, Nevada. In a rapidly changing environment, these decisions are subject to latency effects. This has motivated research into autonomous guidance and control strategies that guarantee performance in the presence of communications delays [8]. A fully autonomous UAV could operate without a human in the loop. A number of prominent experts from the military, intelligence, legal, and academic communities published a report in 2014 outlining a way-forward for military UAVs [9]. Their report predicts a likely increase in the use of unmanned systems for weapons delivery and the development of autonomous UAV capabilities. This desire for increased autonomy

1.1. PROBLEM

3

has led to the investigation of optimal control techniques for military UAV tactics [10]. Commercially, companies like Airbus have started developing autonomous flying vehicles for personal transportation services [11]. There are many technical challenges that must be overcome before autonomous vehicles are ubiquitous in public life. This research is concerned with challenge of decision-making, specifically the guidance and control of autonomous vehicles.

1.1

Problem

As illustrated in Fig. 1.1, we consider the general problem of having a vehicle track a target while adhering to constraints on the position, orientation, velocity, and inputs. We wish to optimize the tracking performance while also minimizing the control effort. We further expand the problem to consider the case when the underlying dynamics are not well known. This is a realistic assumption when the guidance and control is applied on a vehicle equipped with commercial flight controllers or when the flight conditions change throughout the mission.

Figure 1.1: Problem overview Since the system makes use of multiple levels of control (high-level and low-level), we describe the system as using hierarchical control. Other terms used to describe this type of control strategy include cascaded ; nested ; and inner or outer control loops. In general, we use the term inputs to describe lower-level control policies and

1.2. MOTIVATION

4

commands to describe higher-level control policies.

Note: In the context of a vehicle tracking problem, it is helpful to draw a distinction between guidance and control. Guidance is related to the definition of a desired vehicle trajectory. Control is related to the actuation of components on the actual vehicle in order to maintain a desired trajectory. As will be formulated in Chapter 2 and demonstrated in the results that follow, MPC serves as both a guidance and control system by linking the optimal control policy with the predicted trajectory.

1.2

Motivation

This research is motivated by the desire to control the vehicle described in Fig. 1.1. Specifically, we would like to obtain accurate plant models and select control parameters that will provide optimal tracking performance while adhering to constraints. As will be demonstrated by the applications examined in Chapters 4, 5, and 6, the problem defined in Fig. 1.1 is suitable for a wide range of ground and aerial vehicle scenarios. The simulations and experiments are based on scenarios that might be encountered in real-world applications. For example, the figure-8 and circular trajectories considered in Chapter 5 are applicable to precision agricultural, surveillance, and reconnaissance applications. The techniques described in this research could be used to provide the trajectory planning and control required for autonomous vehicles employed in these domains. By minimizing the control effort, we make efficient use of energy resources such as battery life. This could potentially result in cost savings. Similarly, the obstacle avoidance scenario developed in Chapter 6 could help avoid collisions with other vehicles in a drone delivery service. We consider the plant to be a vehicle for which we understand the basic relationship between the evolution of the states and inputs, but for which we do not have

1.3. OBJECTIVE

5

precise values such as mass, moments, aerodynamic effects, and any low-level control parameters. The transient behaviour of any control system is affected by the values of specific design parameters. For the reasons that will be described in Chapter 2, we have chosen to focus our research on Model Predictive Control (MPC). The design parameters in an MPC framework take the form of cost function weights. Selecting these parameters based on a desired transient response is nontrivial and typically accomplished through trial and error. This is due to the fact that the precise, underlying dynamics are difficult to relate to the performance derived from the constrained optimization required in MPC. This makes such a problem ideal for the application of learning techniques. Furthermore, in addition to the weighting parameters, MPC requires an accurate plant model (typically linear). Since we are considering the case when the basic structure of the dynamics are understood but not precisely defined, we investigate novel methods to produce these models.

1.3

Objective

Considering the problem and motivation described above, our objective is to investigate the use of learning techniques to reduce the number of design parameters in a predictive control architecture. The applications considered are the guidance and control of autonomous ground and air vehicles. We also explore novel applications of system identification techniques for modelling the vehicle dynamics. As illustrated in Fig. 1.2, this involves the intersection of three domains:

1.3. OBJECTIVE

6

Figure 1.2: Domains of this Research The objective of this research is met through the investigation of the following research goals: • Develop a technique for producing accurate, linear models and state estimates for use in the control of ground and aerial vehicles. Specifically, we propose using Fast Orthogonal Search (FOS) to determine the matrix elements in a linear, state space model from the observed response over time. • Develop a technique for producing an cost function which encourages mission accomplishment and minimizes control effort. Specifically, we propose using Finite Action-set Learning Automata (FALA) to select weights that minimize tracking error. • Develop an MPC-based guidance and control architecture to produce stable, optimal control policies in real-time that adhere to constraints and avoid obstacles. Specifically, we propose combining FOS and FALA with time-varying

1.4. CONTRIBUTIONS

7

planar inequality constraints for collision-free target tracking. • Demonstrate the utility of the proposed techniques for application on both ground and air vehicles.

1.4

Contributions

The contributions of this research are listed below. The combined effect of these contributions is an overall reduction in the number of parameters that must be selected by the designer when implementing an MPC-based guidance and control architecture: • The primary contribution of this research is a novel application of FALA to select the cost function weights in an MPC framework in order to obtain a desired time response. This contribution links MPC design with desired performance using reinforcement learning, the effect of which reduces the overall number of design parameters that must be selected. Prior to this contribution, the typical way to select these parameters had been trial and error. • The development of the Dual Fast Orthogonal Kalman Filter (D-FOKF) for simultaneous modelling and state estimation of a vehicle with unknown dynamics. The D-FOKF provides capability similar to that of the Dual KF, with the added benefit of producing a time-varying estimate of the noise covariance, which would normally require additional knowledge of the system. This contribution represents the first use of FOS in the context of MPC for autonomous vehicles. • A novel approach for achieving obstacle avoidance in a convex, MPC control architecture using time-varying planar inequality constraints.

1.5. PUBLICATIONS ARISING FROM THIS THESIS

8

• The formulation and validation of a systems-level architecture for design and testing MPC-based guidance and control techniques for vehicles with unknown dynamics.

1.5

Publications arising from this thesis

The following is a list of papers published in direct support of this PhD thesis research and for which the candidate is primary author:

P. T. Jardine, M. Kogan, S. Givigi, S. Yousefi. Adaptive Predictive Control of a Differential Drive Robot Tuned with Reinforcement Learning, International Journal of Adaptive Control and Signal Processing, 2018.

P. T. Jardine, S. Givigi, S. Yousefi. Adaptive MPC Using a Dual Fast Orthogonal Kalman Filter: Application to Quadcopter Altitude Control, IEEE Systems Journal, 2017.

P. T. Jardine, S. Givigi, S. Yousefi. Parameter Tuning for Prediction-based Quadcopter Trajectory Planning using Learning Automata, IFAC-PapersOnLine, Volume 50, Issue 1, 2017, Pages 2341-2346.

P. T. Jardine, S. Givigi, S. Yousefi. Planar Inequality Constraints for Stable, Collisionfree Model Predictive Control of a Quadcopter, IFAC-PapersOnLine, Volume 50, Issue 1, 2017, Pages 9095-9100.

1.5. PUBLICATIONS ARISING FROM THIS THESIS

9

P. T. Jardine, S. N. Givigi, S. Yousefi and M. J. Korenberg. Adaptive State-space Model Approximation for Quadcopter using Fast Orthogonal Search, 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, 2017, Pages 1-6.

P. T. Jardine, S. N. Givigi, S. Yousefi and M. J. Korenberg. Recursive Fast Orthogonal Search for Real-time Adaptive Modelling of a Quadcopter, 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 2017, Pages 791-796.

P. T. Jardine, S. N. Givigi and S. Yousefi, Experimental Results for autonomous model-predictive trajectory planning tuned with machine learning, 2017 Annual IEEE International Systems Conference (SysCon), Montreal, QC, 2017, Pages. 1-7.

1.6. ORGANIZATION OF THESIS

1.6

10

Organization of Thesis

The remainder of this thesis is organized as follows: • Chapter 2 presents background information related to vehicle control, including a literature review focusing on the application of MPC to vehicle guidance and control. • Chapter 3 provides a brief overview of learning techniques relevant to this research. • Chapter 4 describes how FOS can be used in conjunction with MPC. Experimental results are presented for an application to quadcopter altitude control. • Chapter 5 describes how FALA can be used in conjunction with MPC. Experimental results are presented for an application to a differential drive ground robot. • Chapter 6 describes an integrated FOS-FALA MPC architecture with timevarying planar inequality constraints for obstacle avoidance. Experimental results are presented for application to a quadcopter trajectory planning mission. • Chapter 7 concludes and outlines future work.

11

Chapter 2 Vehicle Control

This chapter presents background information related to vehicle control. It begins with a review of traditional control techniques, with a focus on Proportional Integral Derivative control design. This is followed by a detailed description of MPC, including a general formulation that will be useful for the specific applications in the chapters that follow. Finally, the chapter concludes with a brief discussion of modelling techniques relevant to application of MPC.

2.1

Traditional Control Techniques

Proportional-plus-Integral-plus-Derivative or simply, Proportional Integral Derivative (PID) control is by far the most common feedback control strategy in use today. It has been nearly universally adopted as the standard throughout industry [12]. This is due to the simplicity and robustness of PID controllers in a wide ranges of industrial, electrical, mechanical, and biological applications. As the name suggests, the technique is composed of three error components: proportional, derivative, and integral. The interaction between these components is shown in Fig. 2.1, where gains kp , ki , and kd are design parameters. Assuming the goal is for the plant

2.1. TRADITIONAL CONTROL TECHNIQUES

12

Figure 2.1: PID control block diagram output (yplant (t)) to track a reference signal (yref (t)), we compute the control input upid (t) at time t as follows: Z upid (t) = kp e(t) + ki

t

e(t)δt + kd 0

δe(t) δt

(2.1)

where the error, e(t) = yplant (t) − yref (t). As illustrated in Fig. 2.2, the proportional component is related to the error at time t.

Figure 2.2: Illustration of PID position error

The effect of applying the proportional component is to provide a corrective control input that is related to the size of the measured error. As illustrated in Fig. 2.3, the integral component is related to error accumulated until time t.


13

Figure 2.3: Illustration of PID integral error

The primary goal of the integral component is to remove steady-state error. As illustrated in Fig. 2.4, the derivative component is related to rate of change of error around time t.

Figure 2.4: Illustration of PID derivative error

Since the rate of change of error is being used to offset the contribution of the proportional component, this has the effect of reducing excessive overshoot. Together, the three PID components provide robust performance for a system governed by linear (or approximately linear) dynamics. In order to achieve a desired performance, we must related the design parameters (gains kp , ki , and kd ) to the system dynamics.


2.1.1

14

Selecting PID Control Parameters

There are many techniques for selecting PID control parameters. The most common approach is to represent the plant as a transfer function, either through first principles or system identification, and select gains that give desired performance. Here we describe another common technique called the Ziegler-Nichol’s Tuning. Ziegler-Nichol’s Tuning is a heuristic method developed in the 1940’s [13] and is still in use today. The method is as follows: 1. The values of kp , ki , and kd are set to zero; 2. kp is increased until the output begins to oscillate; 3. The values of kp at the onset of oscillation (critical gain kc ) and the frequency of oscillation are recorded; and 4. Based on the values of kc and the period of oscillation, the values of kp , ki , and kd are selected using tables. The tables used in step 4 above are derived from a combination of simulations and experiments generally representative of conditions encountered in industrial processes [14]. There are many variations and improvements on the Ziegler-Nichol’s method, such as the Tyreus-Luyben method, which are surveyed in [15].

2.1.2

Limitations of traditional techniques

PID control has several limitations that have motivated research into more advanced techniques. The primary limitation of PID control in the context of vehicle control is that it is a linear control technique. Most vehicle applications involve nonlinearities,


15

particularly when we consider the dynamics of a UAV. There are several techniques for handling nonlinearities, which will be discussed at length in Section 2.4. A common technique used for PID control applications is cascaded control. In cascaded control, multiple PID controllers are used to track intermediate references, such as velocities and angular rates. This has been demonstrated to be very effective at minimizing the effects of plant nonlinearities when the architecture is properly designed [16]. In fact, this is the underlying low-level control technique used for the application investigated in Chapter 6. PID control is also limited by the fact that it does not satisfy any criterion of optimality [17]. This is particularly true for systems subjected to constraints, which are typically managed by saturating the inputs. While it is possible to change the control parameters for different operating conditions (commonly referred to as gain scheduling), this technique normally involves compromises on performance by selecting certain operating points [18]. Finally, PID control is applicable mainly to Single-Input-Single-Output (SISO) systems. In some cases, PID has been applied to Multiple-Input-Multiple-Output (MIMO) systems. These typically take the form of decoupling the system dynamics and decentralizing the control architecture in various control loops [19]. However, there appear to be no effective techniques for defining gains to achieve desired overall performance. This is due mainly to the difficulty in modelling the interactions between the control loops. For a detailed survey of PID control in the context of MIMO systems, see [20].

2.2. MPC BASICS

2.2

16

MPC Basics

In Section 2.1.2 we described the limitations of traditional control techniques. In short, traditional techniques assume linear plant dynamics, do not incorporate criteria for optimality in the presence of constraints, and are applicable mainly to SISO systems. In this section we present a technique which overcomes many of these limitations and represents one of the most popular and exciting topics of current control research: MPC.

2.2.1

MPC History

The history of MPC began in the chemical processing industry and, in fact, much of the seminal theoretical development can be found in journals from this application domain. Throughout the 1980’s, a number of predictive model-based control techniques emerged that relied upon empirical and theoretical process models. For a detailed summary of these early developments, refer to [21]. Some common examples of MPC-like methods include: • Identification and Command Method (IDCOM) [22] • Dynamic Matrix Control (DMC) [23] • Model Algorithmic Control (MAC) • Inferential Control (IC) • Internal Model Control (IMC) [24]

2.2. MPC BASICS

17

The industrial success of these techniques inspired more significant academic attention by the mid 1990’s. Subsequent research has focused on expanding the applications of MPC while formally exploring the characteristics of stability [25] and robustness [26]. This research has converged into a comprehensive framework now nearly universally recognized as MPC [27]. Though less common, Receding Horizon Control (RHC) is also used to emphasize the fact that MPC is typically implemented using successive optimizations [28].

2.2.2

Benefits of MPC

MPC treats control as an optimization problem. An MPC-derived policy is a sequence of control actions over a finite prediction horizon. In order to analyze the evolution of states over this prediction horizon, MPC relies on a model of the plant dynamics. In this sense, since it links the control actions to the states over a prediction horizon, MPC can also be considered a path planner. For this reason, it is sometimes referred to as Model Predictive Motion Planning (MPMP)[29]. Constraints on the system dynamics are enforced as constraints on the optimization, which typically takes the form of the minimization of some cost function. Given certain assumptions, constraint satisfaction can be guaranteed in the presence of uncertain dynamics and disturbances [30]. While traditional techniques can be informed by the vehicle dynamics (for example, in gain scheduling), only MPC considers the system constraints at each step in the optimization [31]. It is through this synthesis of constrained planning and control that MPC is able to achieve a truly optimal solution.

2.2. MPC BASICS

2.2.3

18

Computational Considerations

The primary drawback of MPC is the high computation required for the optimization. This is why, as described above, early MPC developments were considered in the context of industrial and chemical process control. The relatively slow dynamics of such systems meant real-time constraints were not as much of a concern. Recent advances in computing have allowed MPC to be applied to systems with more demanding real-time constraints [32]. Some examples include the optimization of fuel economy for hybrid electric vehicles [33], directional stability of self-driving cars [34] and autonomous traffic management [35]. In addition to faster computing technology, tractable solutions are also achieved through improved computational methods or by structuring the problem such that it can be solved efficiently [36],[37]. By far the most common approach is to formulate a convex optimization, which normally requires a linear dynamic model for the vehicle [38], [39]. For this reason, we focus our research on convex MPC formulations.

2.2.4

Stability and Feasibility Considerations

Stability in MPC is typically guaranteed by enforcing a terminal constraint. Specifically, we constrain the terminal state such that it must fall into some region in the neighbourhood of the target [40]. More robust guarantees involve redefining the control in terms of perturbations on a stabilizing feedback law and through a maximization of bounded uncertain disturbances. Stability and feasibility are covered at great length in [27], [21], and [41]. A more recent development in managing uncertainty is tube-based MPC, which is described in [42]. While significant attention has been paid to formal analysis of robustness, in practice MPC is inherently robust

2.3. MPC FORMULATION

19

when implemented as a receding horizon. This receding horizon approach is wellsuited for dealing with unexpected changes in the environment, noise, and modelling errors. For this reason, we have chosen to implement MPC as a receding horizon. An alternative approach would be to rely on the sequence of control inputs computed at initialization.

2.2.5

Obstacle Considerations

Collision and obstacle avoidance can be incorporated into the MPC framework in a number of ways. One approach is to discourage collisions as a penalty in the cost function [43]. Other techniques redefine the problem in terms of a safety region around obstacles [44]. Finally, obstacle avoidance can be enforced through constraints on the states [45]. In order to preserve an optimization that can be executed in real-time, the challenge is in defining obstacles such that the search space is convex [46]. A convex space is one in which, for any two points in the space, all points along the line drawn between those two points is also in the space. A popular approach for defining such a space is to use linear inequality constraints [38]. In [47], the authors construct linear inequality constraints tangent to ellipsoidal obstacles in the environment. Their work assumes the vehicles travel in a 2D plane. Additional work is required to extend this to the 3D environment required for most UAV applications.

2.3

MPC Formulation

Fig. 2.5 presents an illustration of MPC for a SISO sytem [48]. The sequence of control inputs (blue line) over the prediction horizon is related to the states (green line) by


20

a dynamic model. Here the control input sequence forces the states to converge to zero. The size of the prediction horizon could be based on the range of sensors or, as described in [21], driven by stability considerations. Typically, only the first control action in this sequence is executed. At the next time-step, a whole new control sequence is computed and the process repeats itself. This implementation is referred to as receding horizon and provides an inherent level of robustness. This approach also allows the system to respond to changes in the environment.

Figure 2.5: Illustration of MPC prediction horizon

This sequence of control inputs could, for example, be voltage signals sent to a series of motors on a vehicle. When the inputs are evolved through the plant model, we obtain a trajectory plan. Of course the actual system would be subjected to uncertain disturbances, which would cause the states to diverge from the predicted states. This uncertainty would be partially accounted for when implemented as a receding horizon, since a new control sequence is computed at each timestep. More rigorous methods for managing this uncertainty are described in [27].


21

As described in Section 2.2.3, implementation of MPC for vehicles normally requires a convex optimization. As described in [48] and [38], the most common formulation for convex MPC involves the following conditions: • A linear dynamic model that approximates the vehicle dynamics over the prediction horizon; • A quadratic cost function; • Linear inequality constraints; and • A finite prediction horizon. Let us assume the combined dynamics of a vehicle (which, presumably have some inherent nonlinearities) can be approximated by the following discrete, time-invariant, linear state-space equation:

x(n + 1) = Ax(n) + Bu(n)

(2.2)

where A and B are the state-transition and input matrices; x(n) are the vehicle states; and u(n) are the inputs. The measurement function is defined as:

y(n) = Cx(n)

(2.3)

where y(n) are the outputs and C is the measurement matrix.

Note: Here we present the state, input, and output vectors in their general form. These will be more precisely defined for the specific applications examined in Chapters 4, 5, and 6.


22

Let us consider the system described by (2.2) and (2.3). By expanding these dynamics as an evolution over finite prediction horizon h, we obtain the stacked vector of predicted states (X) and corresponding measurements (Y ) as a function of the sequences of input (U ) and initial state (x(0)) as follows:

Ax(0) + B U ) = C X Y = C (A

(2.4)

where the state-transition, input, and output matrices are propagated through the horizon as: 





0  B A     AB A2  B    A= .  ,B= . ..  ..  ..  .       Ah−1 B Ah−2 B Ah



... 0  ... 0  , C = ⊕hi=1 C ..  ..  . .  ... B

(2.5)

We also define the following target tracking cost function:

j(Y , U , y(h)) = (Y − Yr )T Q (Y − Yr ) + U T R U + y(h)T P y(h)

(2.6)

where Yr is a vector of the target outputs (yr ) stacked over the prediction horizon: 



 yr (1)     yr (2)    Yr =  .   ..      yr (h)

(2.7)

and Q = ⊕hi=1 Qi and R = ⊕hi=1 Ri such that Qi and Ri weight the importance of target tracking and control effort; P weighs the cost of the output and the end of


23

prediction horizon; and y(h) is this terminal output. Assuming the objective is to track the target as close as possible while minimizing control effort and adhering to constraints, then the optimal sequence of inputs (U ∗ ) is determined by the following convex minimization: U ∗ = argmin j(Y , U , y(h)) U

subject to Ax(0) + B U ) Y = C (A

(2.8)

Mx X ≤ fx Mu U ≤ fu where Mx , Mu , fx , and fu are appropriately constructed linear inequality constraints applied to the states and inputs. As mentioned above, MPC is normally implemented as a receding horizon, such that only the first input sequence from (2.8) is actually executed. At each successive timestep, a new control sequence is computed, which allows the system to adapt to changes in the environment.

2.3.1

MPC design parameters

From the formulation above, we see the application of MPC involves the selection of several design parameters, including h, P , Qi , and Ri . The choice of h is related to the computational complexity of the solution of (2.8). This is due to the fact that the dimensions of A , B and C grow with the value of h. As described above, the terminal cost P is linked to the stability and desired steady-state of the system. While the choice of Qi and Ri are related to the transient response of the system, their selection is nontrivial. This is particularly true for systems with complex dynamics, such as

2.4. MODELLING TECHNIQUES

24

the vehicles and hierarchical control strategies considered in Section 1.1. A novel approach to selecting these parameters is the primary contribution of this research.

2.4

Modelling Techniques

As described above, the convex MPC formulation considered in this research requires a linear model for the vehicle dynamics. As will be shown in the examples described in Chapter 4, Chapter 5, and Chapter 6, vehicles are typically governed by nonlinear dynamics. Therefore, we are required to linearize the model for incorporation into an MPC formulation. Linearization techniques can be considered in terms of two general approaches: • First principles; and • System identification. First principles can be further subdivided into two categories: linearization about an equilibrium point and linearization using feedback linearization. The first involves linearizing the theoretical model about a fixed equilibrium point. When state-space representations are used, this involves the computation of partial derivatives about each state and input and construction of the Jacobian matrix [49]. For an example of linearization about an equilibrium point applied to UAVs, refer to [50]. A second option is feedback linearization. In feedback linearization, the theoretical model is separated into linear and nonlinear components [51]. Using a substitution of variables, the nonlinear components form the new basis of an exact linear model. The control policy is developed using this new model, with the nonlinear component pushed into the control signal itself. This is accomplished by redefining the inputs such


25

that they encapsulate the nonlinear terms. This control signal is then decomposed into the form accepted by the plant. An example of feedback linearization is provided in Chapter 5. First principals techniques require knowledge of the underlying theoretical model and precise values of mass, moments, friction, drag, and aerodynamic parameters. As described in Section 1.1, precise values for these parameters may not be known. When the theoretical model cannot be precisely defined, system identification techniques can be used to develop a model based on observations alone. For linear models, one of the most common systems identification techniques is Least Squares (LS). Let us consider a system that can be described by a set of ηe linear equations with ηa unknown parameters:

yi =

ηa X

Xi,j aj

∀ 1 ≤ i ≤ ηe

(2.9)

j=1

where yi is a measured output, aj is one of the unknown parameters, and Xi,j is a matrix of collected data points (which we will consider as basis terms). The subscripts i and j in Xi,j denote the element row and column, respectively. The vector of parameters a = [a1 a2 . . . aηa ]T describes the relationship between measured outputs and the basis. We further define the output vector y = [y1 y2 . . . yηe ]T . We consider the least squares fit of parameters a as that which minimizes the following cost function:

jLS = ||y − Xa||2

(2.10)

There are many tools available for solving (2.10) efficiently for application in many


26

different domains. LS will provide the best fit using all of the provided basis terms. In certain cases, appropriate basis terms may not be available and it is desirable to select from a large set of candidate terms. For these types of scenarios, more advanced system identification techniques such as FOS can be used. FOS was initially developed in [52] and is described in detail in Chapter 4. FOS is a numerically efficient technique to fit candidate basis functions and coefficients with a record of past data. Furthermore, the recursive variant of FOS (R-FOS) can be updated online in the presence of noise without significant additional computational load [53],[54],[55]. FOS has been used for nonlinear system identification in a wide range of applications [56], [57], [58], [59], [60]. However, these previous works have not considered the use of FOS in conjunction with a model-based control technique such as MPC. When compared to LS, FOS has the added benefit of being able to search through a finite list of candidate basis functions in order to determine which terms contribute the most to reducing modelling error. A standard LS analysis does not incorporate any such evaluation.

27

Chapter 3 Learning Techniques

This chapter provides a brief overview of learning techniques relevant to this research. It begins with an explanation of supervised and unsupervised learning, followed by a description of reinforcement learning techniques. This provides the necessary foundation for a more detailed discussion of the FALA technique investigated in Chapter 5.

3.1

Supervised Learning

Supervised learning is a category of machine learning used to determine the relationship between the inputs and outputs of a system. The term supervised comes from the fact that the training data is labeled using previous knowledge about the nature of data. For example, if the goal is to recognize images of cats, the training data would be populated with images of cats and images not containing cats. Each image containing a cat would be labelled as such. The inputs consist of certain features, such as size, shape, texture, ... etc. The relationship between these features and the presence of a cat in the image is determined by comparing the images to the labeled dataset. Two common examples of supervised learning include Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) [61].

3.1. SUPERVISED LEARNING

3.1.1

28

Artificial Neural Networks

ANNs were inspired by the natural learning process used in animal brains. The learning is based on a collection of interconnected nodes called neurons. These neurons represent functions on the input and output data. The information about the system is contained in the connections between the neurons, called synapses. During a supervised learning process, the weights of the synapses are modified to reflect stronger or weaker synaptic connections between the neurons. Neurons are typically organized into layers, which perform different kinds of functions. By understanding the connections between the neurons, we gain insight into the nature of the system [61].

Figure 3.1: Illustration of Artificial Neural Network Figure 3.1 illustrates a simple ANN with three layers. The synapses of the middle layer are not visible as outputs of the overall network. For this reason, we refer

3.1. SUPERVISED LEARNING

29

to layers between the input and output layers as hidden layers. Depending on the application, there could be multiple hidden layers. Also labelled on Figure 3.1 are examples of weights of the synapses. These weights are named according to their respective neurons. For example, w41 (2) represents the weight of a synapse from the second (2) layer connecting the forth (4) neuron in the hidden layer to the first (1) neuron in the output layer. The values of these weights typically take the form of real numbers, which are adjusted to represent the strength of the connections between neurons as observed in repeated examples.

3.1.2

Support Vector Machines

SVMs are similar to ANNs in that they use a supervised learning process. In an SVM model, the training data is represented by points in space separated by a hyperplane. Using a supervised learning process, a SVM constructs this hyperplane to separate the data into different categories. The goal of the SVM learning process is to maximize the margin between these categories. New data points are then mapped into these categories based on a prediction of which side of the hyperplane they belong [62].

3.1.3

Limitations of Supervised Learning

The primary limitation of supervised learning schemes is the requirement to label the data. This makes it well suited to applications such as image recognition, when we are able to obtain examples of the desired behaviour. However, supervised learning techniques may not be well suited for applications in which we have no previous knowledge of the desired outputs. In the context of the problem discussed in Section 1.1, a supervised learning approach would require us to have datasets from a vehicle

3.2. UNSUPERVISED LEARNING

30

with ideal control parameters. As this is not the case for our problem, we investigate other methods.

3.2

Unsupervised Learning

When information about the desired outputs in not available, unsupervised learning techniques can also be used. Since unsupervised learning does not make use of labelled data, it is used mainly for organizing or clustering data into groups and identifying hidden patterns. One example is self-organizing maps (SOMs), which transforms inputs of arbitrary dimensions into low-dimensional maps [61]. The benefit of SOMs over traditional supervised techniques is that it can identify patterns that would not have been otherwise considered. However, as with supervised learning, unsupervised techniques are not well suited for the problem considered in this research. Our goal is to determine the best set of control parameters to achieve a desired performance, not to order a dataset. For our application, reinforcement learning techniques will be used.

3.3

Reinforcement Learning

The primary benefit of reinforcement learning techniques, and why they are of such value to solving our problem, is that we are able to encourage desired behaviour without specifying how it must be achieved [63]. Unlike supervised learning, where we must have specific examples of the desired output, we simply reward desired behaviour. The desired behaviour is reinforced through repeated interactions with the environment in the form of trials.

3.3. REINFORCEMENT LEARNING

3.3.1

31

Markov Decision Process

An important mathematical framework to consider in the history of reinforcement learning is that of the Markov Decision Process (MDP). MDPs involve a sequence of decisions about actions and states [64]. At each step in the process, the MDP begins in a state from which certain actions can be taken. The decision maker chooses an action and the process responds by randomly moving to a new state. The transformation to this state produces a reward. An important assumption in any Markov process is that the transformation to the new state depends only on the current state and action being taken. This evolution is independent of all previous combinations of states and actions. The early work of [65] demonstrated that the relatively large computation burden of solving MDPs could be mitigated by Dynamic Programming (DP) [64]. In DP, complex problems are broken into smaller, more manageable subproblems. In the context of reinforcement learning, this means that all information gained from the past states and actions is stored in a parameter for which only the current value matters. This parameter takes various forms, such as a Q-table when using Q-learning or a probability distribution when using Learning Automata. MDP could alternatively be described as discrete, stochastic, optimal control problems and form the underlying theory of nearly all reinforcement learning techniques [66]. Reinforcement learning assumes an underlying MDP and learns through successive interaction with the environment. A powerful quality of reinforcement learning is the capacity to generalize, in that not all combinations of states and actions have to be visited. The learning can be accomplished in a number of ways. Here we discuss three techniques: Q-learning, SARSA-learning, and Learning Automata.


3.3.2

32

Q-learning

Q-learning is a very popular reinforcement learning technique [67], [68]. Q-learning consists of the following components: • A Q-table which stores the learned quality of all possible state (s(n)) and action (α(n)) combinations at time n; • An estimate of the maximum future value of the quality of a state and action combination; • A reward r(n) resulting from the transformation from s(n) to s(n + 1); • A discount factor κ that defines the importance of future rewards; and • A learning rate λ that defines the weight of new information when updating the Q-table. The Q-table (QQL ) is updated as follows [69]: QQL (s(n), α(n)) =(1 − λ)QQL (s(n), α(n)) (3.1) + λ(r(n) + κ max QQL (s(n + 1), α(n))) α(n)

where 0 ≤ λ ≤ 1. In (3.1) we see the Q-table value for a state and action combination is adjusted based on the observed reward (r(n)) and an estimated of the maximum future value. The degree to which the reward impacts the Q-table value is determined by the learning rate (λ). Since Q-learning updates the Q-table using an estimate of the maximum future value, independent of the actions actually taken, it is considered an off-policy technique. The process is summarized below:


33

1. The system starts at state s(n); 2. The system takes action α(n) and transitions to state s(n + 1); 3. The system receives reward r(n); 4. The maximum possible reward for any action from s(n + 1) is determined; and 5. The Q-table is updated using (3.1).

3.3.3

SARSA-learning

State-Action-Reward-State-Action (SARSA) learning is similar to Q-learning except the Q-table is updated based on the actions actually carried out during exploration, rather than an estimate of the maximum future value. For this reason, SARSA learning is called an on-policy learning technique. The SARSA Q table is updated as follows [69]: QSA (s(n), α(n)) =(1 − λ)QSA (s(n), α(n)) (3.2) + λ(r(n) + κQSA (s(n + 1), α(n + 1))) The mechanism by which this update occurs is reflected in the name of the algorithm. The process is summarized below: 1. The system starts at state s(n); 2. The system takes action α(n) and transitions to state s(n + 1); 3. The system receives reward r(n); 4. The system takes action α(n + 1) and transitions to state s(n + 2);


34

5. The system receives reward r(n + 1); and 6. The Q-table for α(n) at s(n) is updated using (3.2).

3.3.4

Exploration versus Exploitation

The difference between Q-learning and SARSA-learning highlights an important characteristic of reinforcement learning. Since reinforcement learning relies on repeated interactions with the environment, there is a trade-off between exploiting what has been learned (and therefore following an optimal policy) and exploring new state and action combinations. In Q-learning, we see the Q-table is updated assuming an exploitation of the optimal policy will be followed (hence the maximization). In reality, the system would also be exploring new state and action combinations. SARSAlearning updates the Q-table based on the actual policy being followed and therefore reflects the true exploitation and exploration trade-off. A benefit of Q-learning is that the Q-table can be updated with data generated using any policy.

3.3.5

Learning Automata

A common theme of the reinforcement learning techniques described above is the use of a Q-table to store the quality of various state and action combinations. While the values of the Q-table provide important information about which state and action combinations should be followed, their interpretation is not obvious. An alternate approach for representing this information was introduced as Learning Automata in [70] and further developed in [71] In Learning Automata, specifically Finite Action-set Learning Automata (FALA),


35

the Q-table is replaced with a probability distribution. In the context of control design problems, this probability distribution provides an intuitive representation of the value of the parameter options being considered. This intuition is provided throughout the learning process, as we can observe the probability of various parameters shrink and grow online. During exploration, it is from this probability distribution that options are randomly selected. Ultimately, the probability of one parameter converges to 1, which presents a deterministic choice of which parameter is best. Alternatively, we can influence the exploitation and exploration trade-off directly by placing a bound on the maximum value of the probability for any parameter. Finally, FALA has been investigated extensively for use in a game of multiple learned parameters. This is particularly valuable for the problem considered in this research, as there is a coupling of the various MPC design parameters in terms of performance. We provide a detailed description and application of FALA for selecting MPC design parameters in Chapter 5.

36

Chapter 4 Vehicle Modelling using Fast Orthogonal Search

In this chapter we present a novel application of FOS for use in autonomous vehicles. We also investigate a variant known as Recursive FOS (R-FOS), which permits adaptive online modelling. As described in Section 2.4, there are many options available for deriving a model using first principles or system identification. Here we examine the use of FOS for MPC-based vehicle guidance and control problems. While FOS is a popular system identification technique in a wide range of applications, this research represents the first investigation of FOS in the context of predictive control of autonomous vehicles. The formulation focuses on application to MPC which, as described in Section 2.3, requires an accurate, linear plant model approximating the vehicle dynamics. However, as was discussed when the problem was introduced in Section 1.1, we do not have precise values for certain parameters in the underlying dynamics. Therefore, the first principles linearization techniques described in Section 2.4 would be costly and difficult to use. We use FOS to develop an approximate linear model for the vehicle dynamics from observations of the states and inputs over time. For the experiments that follow, we apply the proposed technique to the altitude

37

Figure 4.1: Architecture for unmanned aerial system with unknown dynamics control of a simulated Parrot AR.drone quadcopter like the one investigated in [72]. Fig. 4.1 illustrates the relationship between the guidance, control, and modelling components considered in this chapter. As shown in the upper-right of Fig. 4.1, we assume the low-level dynamics of the quadcopter frame, motors, and flight controller are unknown. Motor speeds are controlled by a controller modelled after the commercially available Pixhawk Flight Controller [73]. The Pixhawk uses a set of cascaded PID controllers like those described in Section 2.1. We assume we do not have access to the parameters of the low-level controller, including the values of the proportional, derivative, and integral gains. We use R-FOS to produce a time-varying, linear statespace model to approximate the full dynamics of the quadcopter frame and these low-level controllers. Let us define the Pixhawk flight controller converts commands (uc ) at time n into motor inputs (um ) required to actuate a Parrot AR.drone quadcopter as follows:

um (n) = fq (uc (n), x(n))

(4.1)

38

which is a calculation that requires state feedback (x). The function fq represents the combined dynamics of the flight controller and plant, for which we do not have a reliable model. The motor inputs take the form:

um = [um,f , um,φ , um,θ , um,ψ ]T

(4.2)

where um,f is the total force generated by the rotors; um,φ generates a rolling action, um,θ generates a pitching action, um,ψ generates a yawing action; and km is a constant that relates the force and torque produced by each rotor. The motor inputs are related to the force generated by the front (1), right (2), rear (3), and left (4) rotors:

fm = [f1 , f2 , f3 , f4 ]T

(4.3)

as follows: 

um



1 1 1  1    0  −1 0 1   =  fm  −1 0 1 0     −km km −km km

(4.4)

The set of high-level commands has a similar form as the motor inputs:

uc = [uc,φ , uc,θ , uc,ψ , uc,z ]T

(4.5)

which corresponds to changes in roll, pitch, yaw, and altitude. For the formulations that follow in this chapter, we assume the vehicle is commanded to stay in the hover. Therefore, we assume no roll, pitch, or yaw commands:

4.1. ADAPTIVE MODELLING VIA R-FOS

uc,φ = 0 uc,θ = 0 uc,ψ = 0

39

(4.6)

Altitude changes are commanded using uc,z . Here, we allow the Pixhawk flight controller to provide minor corrections in roll, pitch, and yaw to stabilize the vehicle while it performs the commanded altitude changes. In Chapter 6, we investigate the same vehicle in more complex manoeuvres. Therefore, we present FOS in its general form so that it can be applied to the applications later in this document. We also introduce the Dual Fast Orthogonal Kalman Filter (D-FOKF).

4.1

Adaptive Modelling via R-FOS

This section shows how the FOS algorithm developed in [52] and [54] can be used to identify time-varying state space models for a system with unknown dynamics. We present a special, recursive variant (R-FOS) that allows for efficient model updates online. The model is constrained to a linear, state-space for incorporation into an adaptive convex MPC formulation. For a more detailed explanation refer to [55] and [59]. In subsequent experiments, we demonstrate the following features of the R-FOS technique: • The ability to evaluate, from a list of candidate basis functions, the contribution of each term to reducing modelling error; • The ability to implicitly search orthogonal basis functions without having to actually solve them; • The ability to quickly solve for the basis coefficients;


40

• The ability to incrementally improve the basis coefficients online without significant additional computational load; and • The ability to give greater weight to new data when implemented online. Furthermore, while not considered here, R-FOS has the ability to incorporate nonlinear model terms without any significant change in structure. While R-FOS fits the model terms using least squares, a standard least squares regression analysis does not incorporate any evaluation of the contribution of each term.

4.1.1

Structure of Approximate Linear Model

Let us assume the dynamics of the vehicle described above can be approximated by a discrete linear state-space similar to the one introduced in (2.2). We slightly modify the model to incorporate the commands described in (4.1) and a time-varying model:

ˆ ˆ x(n + 1) = A(n)x(n) + B(n)u c (n)

(4.7)

ˆ ˆ where A(n) and B(n) are the state-transition and input matrices that vary with n. For now we keep the general expression for vehicle states (x(n)), but this will be modified for specific applications of FOS later. The system and input matrices are composed of unknown elements with the following structure: 

 a1,1 a1,2   a2,1 a2,2  ˆ A(n) =  . ..  .. .   aηx ,1 a1,2

...



a1,ηx   . . . a2,ηx   ..  .. . .    . . . aηx ,ηx

(4.8)




 b1,1 b1,2   b2,1 b2,2  ˆ B(n) = . ..  .. .   bηx ,1 bηx ,2

41



...

b1,ηu   . . . b2,ηu   ..  ... .    . . . bηx ,ηu

(4.9)

ˆ where ηx and ηu are the number of states and inputs; ai,j are elements in A(n); and ˆ ˆ ˆ bi,j are elements in B(n). The elements of A(n) and B(n) are chosen such they approximate the underlying (presumably nonlinear) dynamics as close as possible.

4.1.2

Recursive Fast Orthogonal Search

Consider a system described by:

y(n) =

ηm X

βm pm (n) + (n)

(4.10)

m=1

where y(n) is the output at time n; pm (n) is a basis function; βm is a basis coefficient; ηm is the total number of model terms; and describes the modelling error. R-FOS selects the best candidate functions and coefficients to model the system dynamics. We define the mean squared error (2 (n)) as:

2 (n) =

y(n) −

ηm X

!2 βm pm (n)

(4.11)

m=1

R-FOS implicitly searches the following orthogonal basis functions:

y(n) =

ηm X

gm wm (n) + (n)

(4.12)

m=1

where wm (n) are orthogonal basis functions derived from pm (n) using a Gram-Schmidt


42

process with coefficients gm and weights κ(m, r) computed as follows:

gm =

fC (m) fD (m, m)

κ(m, r) =

(4.13)

fD (m, r) fD (m, m)

(4.14)

where fD (m, r) and fC (m) are functions computed recursively using the time average of products of the basis functions and outputs:

fD (m, r) = pm (n)pr (n) −

r−1 X

κ(r, i)fD (m, i)

(4.15)

κ(m, r)fC (r)

(4.16)

i=1

fC (m) = pm (n)y(n) −

m−1 X r=1

The coefficients for the selected candidates are computed as follows:

βm =

ηm X

gi vi

(4.17)

κ(i, r)vr

(4.18)

i=m

where vi = 1 when i = m, and:

vi = −

i−1 X r=m

for i = m + 1, . . . , ηm . The contribution in reduced mean squared error by adding the mth model term is determined by computing µm :

2 µm = gm fD (m, m)

(4.19)


43

The candidate with the maximum µ1 is selected as the first term in the model. Additional terms are successively added until a stopping criteria is met. Examples of stopping criteria are [59]: • the reduction in mean squared error reaches a threshold; • the maximum number of permissible model terms is reached; or • all available candidates have been included in the model. Clearly, the time averages in (4.15) and (4.16) would become more costly to compute as the record of data grows with time. In order to mitigate this growth in computational load, one could use a sliding window of fixed data size. Alternatively, one could incrementally modify the terms online using the following general technique [54]:

d(n) = ff d(n − 1) +

d(n) − ff d(n − 1) n

(4.20)

where d(n) is pm (n)pr (n) for use in (4.15) and pm (n)y(n) for use in (4.16). The optional term ff is a forgetting factor that gives exponentially less weight to older data points when 0 < ff ≤ 1. This approach significantly reduces computation time when compared to repeatedly recomputing a new time average over the growing dataset [54] and [55].

4.1.3

Link between State Space and R-FOS

This section links the R-FOS identification technique described in Section 4.1.2 with state space structure presented presented in Section 4.1. Recall the goal is to select


44

elements in (4.8) and (4.9) that best approximate the dynamics of the nonlinear system. Given full-state feedback, let us consider the outputs in (4.10) to represent each output of the system. Then we can construct a vector that represents all of the outputs of the system: 



 y1 (n)     y2 (n)    y(n) =  .   ..      yηx (n)

(4.21)

We define the candidate functions as the outputs (y(n−1)) and commands (uc (n−1)) from the previous step. This links the familiar R-FOS model structure in (4.10) with the state-space structure in (2.2):

yi (n) =

ηx X j=1

ai,j yj (n−1) +

ηu X

bi,k uk (n−1)

(4.22)

k=1

where yi (n) is the ith output from (4.21). The coefficients for the basis computed in (4.17) form the matrix elements for the ith row in (4.8) and (4.9):

βm = ai,m for 1 ≤ m ≤ ηx βm = bi,m−ηx +1 for ηx < m ≤ ηm where the total number of model terms is equal to the sum of the number of outputs and inputs (ηm = ηx + ηu ). Little knowledge is required as to which states and inputs effect the outputs since R-FOS will evaluate each term with respect to its contribution to mean squared error using (4.19).

4.2. DUAL FAST ORTHOGONAL KALMAN FILTER

4.2

45


While the formulation above assumes full state feedback, this may not always be possible. The KF is a popular technique for state and parameter estimation [59]. When two separate KFs are used to simultaneously estimate the states and model parameters, the configuration is known as a Dual KF [74]. We present a novel application of R-FOS in conjunction with a KF to similarly produce state estimates from the available outputs. The Dual Fast Orthogonal Kalman Filter (D-FOKF) differs from the Dual KF in that no knowledge is required of the system except a record of past inputs and outputs and an estimate of the measurement covariance.

4.2.1

Dual Kalman Filter

We first present the Dual KF as described in [74] and [75]. Here were define ζp (n) as the vector composed of the parameters in the state-space model (i.e. the maˆ ˆ trix elements of A(n) and B(n)); Pp (n) as the covariance of the estimate of these parameters; and Qp as the user-defined noise covariance for the model parameters (assumed constant). The process begins with a prediction of the model parameters and corresponding covariance: − + ζˆp (n) = ζˆp (n − 1)

(4.23)

Pp− (n) = Pp+ (n − 1) + Qp These parameters are used to construct estimated state-transition and input maˆ ˆ trices A(n) and B(n), which are then used to produce an a priori state estimate:


− ˆ ζˆs + (n − 1) + B(n)u(n) ˆ ζˆs (n) = A(n)

Ps− (n)

46

(4.24)

+ ˆT ˆ = A(n)P s (n − 1)A (n) + Qs

where ζs (n) is a vector of states, Ps (n) and Qs are the covariance of the state estimate ˜ and user-defined noise covariance of the states. The innovation y(n) is computed using the a priori state estimate, output matrix Cs (assumed constant) and measured output y(n) as follows:

− ˜ y(n) = y(n) − Cs ζˆs (n)

(4.25)

A corrected a posteriori state estimate and updated covariance is produced using the innovation and the Kalman Gain L(n) as follows: L(n) = Ps− (n)CsT [Cs Ps− (n)CsT + Rs ]−1 + − ˜ ζˆs (n) = ζˆs (n) + L(n)y(n)

(4.26)

Ps+ (n) = [I − L(n)Cs ]Ps− (n) where Rs is the user-defined noise covariance for the measurement. Similarly, an update for the parameters is produced using a separate Kalman gain for the parameter correction Lp (n); matrix Cp describing the relationship between the states and the model parameters (assumed constant); and the user-defined noise covariance for the parameters measurement Rp : Lp (n) = Pp− (n)CpT [Cp Pp− (n)CpT + Rp ]−1 + − ˜ ζˆp (n) = ζˆp (n) + Lp (n)y(n)

Pp+ (n) = [I − Lp (n)Cp ]Pp− (n)

(4.27)


47

The advantage of the Dual KF is that it provides updated state estimates while simultaneously improving the model based on measured outputs. In order to implement the Dual KF the user must have sufficient knowledge of the system. Of particular interest for this research is the requirement to provide the following parameters: • The correct model basis (no ability is given to evaluate candidate basis terms); ˆ ˆ • An initial estimate of A(n) and B(n); and • An accurate approximation of the noise covariance (Qs ). 4.2.2


As shown in Fig. 4.2, the D-FOKF uses the R-FOS generated linear model to feed the KF. Rather than relying on a second KF to estimate the model parameters, the DFOKF only requires a record of past inputs. Since R-FOS incorporates the evaluation and selection of candidate basis functions, this means that less information is required about the system than with the Dual KF. Furthermore, we use the estimate of MSE from (4.11) and (4.19) to generate an estimate of the noise covariance required by the KF. To preserve conventional KF notation, this noise covariance matrix is shown as Qs in Fig. 4.2, which is not related to the objective function weight Qi . As described earlier, typically an estimate of the noise covariance would have to be developed based on knowledge of the system. In summary, the D-FOKF provides the following benefits when compared to the Dual KF: • The model terms can be selected from a set of candidate bases using only a record of past inputs and outputs;

4.3. EXPERIMENTAL SETUP

48

Figure 4.2: Dual Fast Orthogonal Kalman Filter • An initial estimate of the state-transition and input matrices is generated by R-FOS, rather than by the user; and • There is no requirement for an approximation of the noise covariance, since R-FOS produces its own estimate of MSE.

4.3

Experimental Setup

An adaptive MPC informed by the R-FOS modelling algorithm was implemented in R real-time on the simulation testbed developed in [72]. The testbed uses a Simulink

model that takes into account aerodynamic drag forces (identified experimentally for low speed manoeuvres), coupling forces, and gyroscopic effects, and dynamics of a Parrot AR.drone quadcopter. Noise was assumed to be white Gaussian. All processing was accomplished on an Apple MacBook Pro using a 2.6 GHz Intel Core i7 processor. Low-level PID controllers simulated an embedded Pixhawk flight controller. The flight controller actuated the four motors based on a desired upwards


49

climb command [73]. These commands were provided by the adaptive MPC controller to track a series of reference altitudes. The full setup is presented in Fig. 4.3.

Figure 4.3: Illustration of Experimental Setup

4.3.1

Basis for Model of Altitude Dynamics

Based on the work of [76], [77], and [49], we know the dynamics of the quadcopter altitude can be theoretically described in the continuous domain as:

z¨(t) = −

kd c(φ)c(θ) z(t) ˙ + um,f (t) − g m m

(4.28)


50

where kd is the drag constant; m is the mass; g is the acceleration due to gravity; θ and φ are angles of bank and pitch; z(t), z(t), ˙ and z¨(t) are the vertical position, velocity, and acceleration; um,f (t) is the total force generated by the four motors at time t. While this experiment was concerned with vertical motion only, small fluctuations in bank and pitch angles were caused by noise. This noise was simulated by adding uncertain disturbances with mean 0.04 rads in for banking and pitching; and 0.03 m in the vertical. The Pixhawk flight controller was responsible for making the small corrections required to compensate for this uncertainty while also responding to the climb and descent commands generated by MPC [72]. We propose the following linear state-space structure to encompass the combined dynamics of the plant (4.28) and low-level controller (6.1).         z(n + 1) a1,1 a1,2  z(n) b1,1 b1,2  uc,z (n)  =  +   z(n ˙ + 1) a2,1 a2,1 z(n) ˙ b2,1 b2,2 g where z(n) and z(n) ˙ are the altitude and time rate of change in altitude at time n; uc,z (n) is altitude command generated by MPC and sent to the Pixhawk flight controller; and g is the acceleration due to gravity. While gravity was assumed to be constant throughout the simulation, this term could feasibly serve as a model term reflecting other external sources of vertical acceleration, such as ground effect.

Note: We see that the structure above was informed by some knowledge of the structure of the plant model presented in (4.28) and an assumption that the effect of the underlying PID controller can be approximated as a linear function. This is inline with the motivation of this research described in Section 1.2, where we understand the basic relationship between the evolution of the states and inputs but do not have precise values such as mass, aerodynamic effects, and parameters of the low-level controller.


51

In order to illustrate the use of constraints in (2.8), the inputs were constrained to values between uc,z,min = 0 and uc,z,max = 1000 by the following inequality constraints: 







1  uc,z,max    uc , z(n) ≤   −1 −uc,z,min

(4.29)

which was enforced at each step in the prediction horizon.

4.3.2

Methods

The modelling was accomplished using R-FOS, which relied exclusively on a record of past inputs and measurements to build the state-space model. This model was used to make predictions of the quadcopter’s future evolution in the MPC optimization. Furthermore, the model was updated online such that it could adjust the model for dynamics in different flight regimes. The quadcopter motion was constrained to vertical displacement for the purpose of tracking a reference altitude. As illustrated in Fig. 4.4, these reference altitudes changed every 20 s. A quadratic objective function as described in (2.6) was selected with the following weighting matrices     2.0 0  55 0 Qi =   Ri =   0 1.5 0 0

(4.30)

where Qi acted on the outputs (assuming full state feedback) and Ri acted on the change in input between timesteps. These parameters were selected by trial and error to produce the desired accumulated error performance described in (4.32). When expanded over the prediction horizon, this yields an objective function of


52

Figure 4.4: Illustration of reference altitudes the form described in (2.6), tailored for this specific experiment:

j(Z, Uηc ) = Z T Q Z + Uηc T R Uc,z

(4.31)

where Q = ⊕hi=1 Qi and R = ⊕hi=1 Ri ; Z is the stacked set of altitude and time rate of change in altitude over the prediction horizon; and Uc,z is the stacked set of climb commands over the prediction horizon. A prediction horizon of h = 100 was used. To save computation time, only the first 2 inputs in the horizon were computed (i.e. a control horizon of 2). The R-FOS model identification was executed at a rate of 1000 Hz while the MPC was executed


53

at a rate of 100 Hz. A total of five experiments were conducted to investigate the behaviour when adaptive MPC was combined with R-FOS: 1. Experiment #1: A preliminary experiment was conducted to select a model offline. This involved altitude changes (2 m, 5 m, 3 m, 6 m, 3 m, and 5 m) every 20 s over a 110 s period. 2. Experiment #2: The model from Experiment #1 was used in an MPC implementation to follow a separate set of reference altitude changes (2 m, 4 m, 6 m, 3 m, 5 m, and 4 m). 3. Experiment #3: The identical conditions from Experiment #2 were repeated except the model was updated online. All model parameters were included in the model and no forgetting factor was used. 4. Experiment #4: The identical conditions from Experiment #2 were repeated except the model was updated online. The b1,1 and b1,2 terms were not included in the model; a1,1 and a1,2 were fixed; and no forgetting factor was used. 5. Experiment #5: The identical conditions from Experiment #2 were repeated except the model was updated online. The b1,1 and b1,2 terms were not included in the model; a1,1 and a1,2 were fixed; and a forgetting factor of 0.9999 was used. This forgetting factor was selected through trial and error to give greater weight to more recent observations. 6. Experiment #6: The performance of MPC and R-FOS was investigated when used in conjunction with the D-FOKF. It was assumed that only the altitude (z(n)) could be measured directly.

4.4. RESULTS

54

The purpose of leaving out certain model terms in Experiment #4 and #5 was to investigate the effect on error when using the reduced model update. The benefit of the reduced model update is that less time would be required to compute the updated model. The b1,1 and b1,2 terms were selected based on their expected contribution to reduction in MSE (µ) in accordance with (4.19). Furthermore, a1,1 and a1,2 were fixed due to a very low variance observed in Experiment #1. For the purpose of analysis that follows, these models are referred to simply as partial, in that only part of the models are updated online. Accumulated error ea (n) was computed at each timestep as follows: n X ea (n) = (z(i) − zr (i))2

(4.32)

i=1

where zr (n) is the reference altitude at time n.

4.4

Results

The experimental results are presented in four subsections. In Experiment #1, a model was developed offline for comparison of subsequent online models. In Experiments #2 and #3, the performance of online and offline modelling was compared. In Experiments #4 and #5, the effects of updating only part of the model and including a forgetting factor were investigated. Finally, in Experiment #6, the performance of the D-FOKF for state estimation is investigated.

4.4.1

Experiment #1 - Offline modelling and analysis

Table 4.1 presents the results from Experiment #1. The mean value, standard deviation (σ), and individual contribution to reducing MSE (µ) for each matrix element

4.4. RESULTS

55

is provided. Certain items include a bold or strikethrough for reasons described later in the text. The purpose of this experiment was to develop an offline model that would be used for comparison with the models computed online. To investigate the variation in the model at different flight regimes, 11000 FOS models were developed, each using an increasingly large record of data as the quadcopter changed altitudes as shown in Fig. 4.4. Table 4.1: FOS Experiment #1 Model Data mean value

σ ×103

mean µ

a1,1

1.0000

0.0001

13.2588

a1,2

0.0010

0.0001

8.6053 ×10−8

a2,1

0.0000

0.1231

0.0088

a2,2

0.9994

0.2203

0.0861

b1,1

0.0000

0.00003 2.6636 ×10−14

b1,2

0.0000

0.00002 8.0749 ×10−14

b2,1

-0.0025

0.06861

0.1067 ×10−6

b2,2

0.0009

0.04546

0.3231 ×10−6

Element

The most important observation from the modelling data is the fact that the model terms do in fact vary with time. This suggests that at least some of the nonlinearities in the system are being reflected in the linear approximation. Also, from Table 4.1 we see b1,1 and b1,2 have a negligible effect on mean squared error, since their µ values are very close to zero. This was the rationale for not including these terms in subsequent experiments (hence the strikethrough). Furthermore, while

4.4. RESULTS

56

a1,1 and a1,2 are important terms, their standard deviation is very small. Therefore, these values were fixed in subsequent experiments (hence the bolded text).

4.4.2

Experiments #2 and #3 - Online versus offline models

The purpose of Experiment #2 was to demonstrate the benefits of modelling online. For Experiment #2, a simple MPC controller used the model determined in Experiment #1 to make predictions over the horizon during the entire simulation. For Experiment #3, the model was updated online (with no forgetting factor). The full model was used (i.e. a1,1 , a1,2 , b1,1 , and b1,2 terms were included and updated). The tracking results are provided in Fig. 4.5.

Figure 4.5: Comparison of response using various modelling techniques

The online model provided better tracking performance than the offline (linear time-invariant) model. The online linear approximation is more accurate because it

4.4. RESULTS

57

was developed based on data from regimes actually being experienced by the quadcopter.

4.4.3

Experiments #4 and #5 - Partial Modelling and Forgetting Factor

Experiment #4 was nearly identical to Experiment #3 except that b1,1 , and b1,2 were left out of the model and a1,1 , a1,2 were not updated online. This constituted what is called the partial model. For Experiment #5, the forgetting factor was also used. Fig. 4.6 presents the accumulated error for all experiments. The error accumulated by each technique over the entire simulation is summarized in Table 4.2. It should be noted that this error due to both the modelling and control method.

Figure 4.6: Accumulated error using various modelling techniques

Fig. 4.7 compares the accumulated error from each experiment with that of the

4.4. RESULTS

58

Online R-FOS with forgetting factor.

Figure 4.7: Accumulated error compared to online R-FOS model with forgetting factor

Notice the accumulated error for the R-FOS full and partial are nearly identical. This demonstrates the R-FOS µ parameter is an effective tool for deciding which terms to include in the model. Table 4.2: Comparison of Accumulated Error under Identical Conditions Modelling Type

Accumulated Error [m2 ]

Offline FOS

4600.0

Online R-FOS (full)

3695.5

Online R-FOS (partial)

3698.9

Online R-FOS (partial) with forget

3225.3

4.4. RESULTS

59

Adding the forgetting factor further improved the performance since the adaptive MPC was able to make prediction based more heavily on recent experience.

4.4.4

Experiment #6 - State Estimation

In Experiment #6, it was assumed that only the altitude (z(n)) could be measured directly. The D-FOKF was used to developed an estimate of the rate of change (z). ˙ The R-FOS model and MSE computed in (4.11) and (4.19) was used to derive a timevarying estimate of the noise covariance. The noise covariance was given a minimum value of 0.05 but allowed to fluctuate with the R-FOS model updates.

Figure 4.8: Performance D-FOKF

Fig. 4.8 compares the performance of the D-FOKF with that of full state feedback (FSF). Initially, there were small deviations in the path being followed. However,

4.5. DISCUSSION

60

these deviations became negligible as the D-FOKF converged on the correct estimate of z. ˙ Also notice there is a small overshoot when levelling off following a climb but not when levelling off following a descent. This illustrates how the dynamics of the quadcopter differ when climbing and descending. This difference can be attributed to the effect of gravity.

4.5

Discussion

The above experiments demonstrate that R-FOS is an effective technique for approximating the dynamics of a quadcopter during altitude changes. As described earlier, this model represents the full plant dynamics, including the low-level controller used by the motors and the closed-loop response of the higher-level flight controller. Even with knowledge of the equations of motion that describe the quadcopter, precise values for these parameters (including gains) could be difficult to obtain. Furthermore, by executing R-FOS online, we obtain an updated model when parameters change. Some examples of cases where the model might have to change significantly include fluctuations in the behaviour of the battery and when flying near walls or in ground effect. Since R-FOS provides a technique for evaluating the importance of each candidate term, computational savings can be made by ignoring less significant terms. While the models considered here are relatively small (2 × 2), these savings could be more beneficial for more complex systems. Furthermore, if the overall size of the model is reduced, this would increase the MPC prediction horizons that could be considered for realtime implementation. Finally, by combining R-FOS with a KF, we can develop estimates for the states

4.5. DISCUSSION

61

when limited measurements are available. This is similar to a dual KF setup, with the added benefits afforded by the ability of R-FOS to evaluate the candidate terms. Furthermore, R-FOS provides a time-varying estimate of the noise covariance, which would normally have to be selected based on additional knowledge of the system.

62

Chapter 5 Learning Automata

This chapter describes a novel approach to designing certain parameters in MPC using reinforcement learning. Specifically, we show how the tracking error and control effort weights (Qi and Ri ) in the cost function (2.6) can be selected using Finite Action-Set Learning Automata (FALA). In Chapter 6, we apply this FALA technique in combination with the FOS modelling technique described in Chapter 4 on a quadcopter trajectory tracking mission. In order to demonstrate that the technique can be applied on other types of vehicles, the latter part of this chapter presents an experiment of MPC control on a differential drive ground robot. The ground robot’s model is linearized using feedback linearization. For a similar application using linearization about an equilibrium point, the reader is directed to [50], which presents additional findings of FALA used in conjunction with MPC-based guidance of a Quanser Qball2 quadcopter. Throughout the formulations that follow, the timestep k denotes one learning trial. This may not operate at the same frequency as the MPC. Typically, it is desirable to update the learning at a much slower rate than the controller sample rate. For example, the MPC may execute at a rate of 10 Hz while the learning updates at

5.1. GENERAL DESCRIPTION OF FALA

63

0.05 Hz. This allows the performance of the control parameters to be evaluated over a significant timeframe.

5.1

General description of FALA

FALA is normally described by the quadruple Γ = (A, r(k), τ, p(k)) [78], where: • A = {α1 , α2 , · · · , αηr } is a finite set of ηr possible actions. In the context of our problem, these actions represent options for the control design parameters. If there are multiple parameters that must be designed, each parameter will have its own quadruple. • r(k) is a reinforcement function that specifies rewards after executing the action αi ∈ A selected at time k. • τ is an algorithm that updates action probabilities to be used at time k + 1; • p(k) = [p1 (k), p2 (k), · · · , pηr (k)]T is the action probability vector, where pi (k) corresponds to the probability at which action αi is selected at time k. The algorithm that updates action probabilities p(k) is generally of the form,

p(k + 1) = τ (p(m), α(k), r(k))

(5.1)

where p(k + 1) is denoted as the updated action probabilities which depends on α(k) and r(k) and {p(k), k = 0, 1, 2, · · · } is a random process whose evolution is governed by the learning. Therefore, a FALA strategy uses a process whereby we randomly select actions αi ∈ A in accordance with probability vector p. Based on the observed performance

5.2. FALA FOR MPC

64

of the selected action, its probability is updated using the reinforcement function r(k). The objective is to maximize the expected value of r(k).

5.2

FALA for MPC

Note that the process described in Section 5.1 applies to selecting the best value for a single parameter from a set of ηr possible options. However, for multiple-inputmultiple-output (MIMO) systems controlled using MPC, the elements in Qi and Ri are composed of several parameters. In this case we must consider a FALA game defined as follows [70].

Γ = (Γ1 , Γ2 , Γ3 , Γ4 , · · · , Γηa )

(5.2)

where Γi = (Ai , ri (k), τ i , pi (k)), ∀i ∈ {1, 2, · · · , ηa } is an individual FALA corresponding to one of the diagonal elements in the Qi and Ri matrices (which in the context is a control design parameter) used in (2.6). Only the diagonal elements are considered design parameters; all other elements in the matrices are set to zero. Furthermore, all values are assumed to be positive. By constraining the matrix elements in this way, we ensure Qi and Ri are symmetric with positive eigenvalues (i.e. are positive definite), which is a necessary condition for the minimization in (2.6) to be convex.

5.3

Training Process

Figure 5.1 illustrates the FALA training process used in this paper. Let us start by defining the action set for all actions taken by each FALA as A˜ = A1 ×A2 ×· · ·×Aηa . Furthermore, for the actions, i.e., the actual control parameters used are:

5.3. TRAINING PROCESS

65

α(k) = [α1 (k), α2 (k), · · · , αηa (k)] ∈ A˜ Here the superscript denotes the index for each of the ηa parameters. The MPC ˜ Each FALA Γi for i ∈ design parameters are considered the actions α(k) ∈ A. {1, 2, · · · , ηa } has a finite set of ηri possible values that it can assume. While not necessary, we assume each of the ηa parameters has an equal number of possible values ηr .

Figure 5.1: FALA training process We assume each of the ηa diagonal elements in Qi and Ri are independent and therefore treat each automata Γi separately. However, while each automaton is treated separately, they are evaluated based on how they interact with each other.


66

Effectively, we select the best balance of values for the parameters using a FALA game. Therefore, we expect several possible solutions to produce roughly similar performance characteristics for the overall system. Let us define a training procedure with successive iterations (k) during which different combinations of control parameters α(k) are selected at random according to the probability distribution pij (k), where i ∈ {1, 2, ..., ηa } and j ∈ {1, 2, ..., ηr }. Initially, all parameters are given equal probability:

pij (1) =

1 ∀ i, j ηr

(5.3)

where the superscript is the parameter being tuned and the subscript is the option being considered. The process is as follows: 1. The first candidate actions are randomly selected according to (5.3); 2. The selected actions are used as the diagonal elements of Qi and Ri in a full, receding horizon MPC implementation; 3. An MPC target tracking implementation as described in Section 2.3 is executed as described in over a fixed timeframe (ηt ); 4. Throughout the mission, the tracking error (e(t)) is computed using the Euclidean distance from the target; 5. The total cost for the kth iteration is computed according to the following cost function: ηt X jLA (k) = kp e(t) + kd e(t) ˙ t=1

(5.4)


67

where kp and kd define the relative importance of proportional and derivative error components discussed in Section 2.1. 6. The minimum and mean costs (jLA,min and jLA,med ) are used to derive a reinforcement signal: ! ! jLA,med − jLA (k) rc (k) = min max 0, ,1 jLA (k) − jLA,min

(5.5)

We constrain the reinforcement to positive values between 0 and 1. Refer to [78] and [70] for a discussion of negative reinforcement signals. 7. We define an upper bound rb and the reward for the kth iteration as:

r(k) = rc (k)rb

(5.6)

8. We define a probability distribution vector (pi (k)), which is composed of the probabilities for each candidate action selected at iteration k. The reinforcement signal is used to produce a normalized probability update according to the following equation:

pi (k + 1) = pi (k) + λr(k)(ej − pi (k))

(5.7)

where λ defines the learning rate and ej is a unit vector with jth component unity such that the index j corresponds to the option selected at k [71]. 9. Convergence is considered to have occurred when the probability of one action


68

reaches some minimum threshold (normally 1). If the system has not yet converged, new actions are selected at random according to the updated probability (5.7) and the training continues. Note that while the probability of each action is updated separately, the cost (5.4) reflects the performance of the overall system, which is affected by the coupling of all parameters.

5.3.1

Convergence

Since the goal of the FALA technique described above is to eventually converge to a set of parameters, it is important for us to investigate the convergence properties of the algorithm. This topic is discussed at length for various applications of FALA in [71]. Of interest to us is the common payoff game of FALA, since in our application the reward defined in (5.6) is shared by all the FALA. In this section we describe a guarantee of convergence in terms of possible maximal points for the expected value of the accumulated rewards r(k). For simplicity, let us remove the index k and define a common payoff function for all automata:

˜ fr (α) = E[r|α ∈ A]

(5.8)

Recall that α ∈ A˜ are the parameters used for the weights of the objective function for our MPC controller. Let us denote a possible set of maximal points:

α∗ = [α1∗ , · · · , αηa ∗ ]

(5.9)

According to Theorem 2.4 of [71], the game of FALA using the update (5.7) will

5.4. FEEDBACK LINEARIZATION

69

converge to one of the maximal points where:

fr (α∗ ) ≥ fr (α)

(5.10)

for all α = [α1∗ , · · · , αi−1∗ , αi , αi+1∗ , · · · , αηa ∗ ] ∈ A˜ [71]. The above conditions imply that α∗ is a Nash equilibrium of the game. This result shows that the game of FALA converges to a set of parameters that minimizes the cost function (5.4). However, notice that (5.10) provides no guarantee of convergence to a single maximal point. This means that there may be more than one controller with similar performance derived using a FALA game.

5.4

Feedback Linearization

The dynamics of a differential drive robot can be approximated as a linear state space using feedback linearization [79]. This section briefly describes how this can be achieved. Consider a differential drive robot with position xx , xy in the inertial frame and heading xψ . The control commands are the forward speed uv and angular speed uω . Therefore, the kinematics of the vehicle are described as follows: 



   x˙ x  uv cos (xψ )      x˙  =  u sin (x )  ψ   y  v     x˙ ψ uω

(5.11)

where x˙ x and x˙ y are the time rate of change in position and x˙ ψ is the time rate of change in heading. Let us define a vector:


70

  xx  ρ=  xy

(5.12)

which we want to drive to desired ρd such that ρ − ρd approaches 0 as t → ∞. The second derivative, ρ¨ is then computed as:    cos (xψ ) −v sin (xψ )  u˙v  ρ¨ =    sin (xψ ) v cos (xψ ) uω

(5.13)

which is invertible if we assume uv 6= 0. We define an input vector:   uρ,1  uρ =   = ρ¨ uρ,2

(5.14)

rearrange (5.13), and combine with (5.11) to obtain a new, extended model for the vehicle as: 







uv cos (xψ )   x˙ x         x˙ y   u sin (x ) v ψ       = x˙   −u sin (xψ ) + u cos (xψ )    ψ  ρ,2 uv ρ,1 uv     uρ,1 cos (xψ ) + uρ,2 sin (xψ ) u˙ v

(5.15)

Given a change of variables such that γ1 = x, γ2 = y, γ3 = x, ˙ and γ4 = y, ˙ we express (5.15) as linear state space model:


   γ˙1  0    γ˙2  0     = γ˙  0  3     γ˙4 0

0 1 0 0 0 0 0 0

   0 γ1  0       1  γ2  0   +     0  γ3  1    0 γ4 0

71



0    0  uρ,1     0 uρ,2  1

(5.16)

which is an exact linearization of (5.11) valid only when the vehicle is in motion. The discrete form of (5.16) with sample time dt is as follows:

   γ1 (n + 1) 1    γ2 (n + 1) 0     = γ (n + 1) 0  3      0 γ4 (n + 1) {z } | | γ(n+1)

    1 2 0 dt 0  γ1 (n)  2 dt 0        1 2    dt  uρ,1 (n) 1 0 dt  γ2 (n)  0 2 +         0 1 0 0   γ3 (n)  dt  uρ,2 (n)     | {z } 0 0 1 γ4 (n) 0 dt uρ (n) {z } | {z } | {z } Atbot

γ(n)

(5.17)

Btbot

where and Atbot and Btbot are the state transition and input matrices and γ(n) and uρ (n) are the state and input vectors at discrete timestep n. Here we see the nonlinearities have been pushed downstream into the control signal, which allows us to consider the evolution of the states as a linear function of the previous states and the inputs. We also define the output of the system as follows:

y(n) = Ctbot γ(n) where C is the output matrix and y(n) is the ouput vector.

(5.18)

5.5. METHODS

5.5

72

Methods

The MPC technique described in (2.8) was used to drive the robot. The statetransition matrix (Atbot ) and input matrices (Btbot ) derived using feedback linearization were used. The inputs produced by MPC, uρ (n) were converted to the uv and uω used by the vehicle using (5.13). The robot velocity was constrained by saturation to maximum (vmax ) and minimum (vmin ) velocities of 0.5 m/s and 0.05 m/s, respectively. In order to enforce these constraints in the controller using the linear inequalities required in (2.8), the following time-varying linear approximations were used:

x˙ x,min (n) = uv,min cos (xψ (n)) x˙ x,max (n) = uv,max cos (xψ (n))

(5.19)

x˙ y,min (n) = uv,min sin (xψ (n)) x˙ y,max (n) = uv,max sin (xψ (n)) where x˙ x,min (n), x˙ x,max (n), x˙ y,min (n), x˙ y,max (n) are meant to approximate the robot velocity constraints computed at time k. Since these constraints varied with time, an adaptive MPC scheme was required. The controller was executed at a rate of 10 Hz.

5.6

Learning Environments

FALA can be used to select design parameters offline, in simulation, or during the execution of the system. Typically, it is desirable to accomplish a significant portion of the training before the controller is executed on the real platform in order to avoid instability of the vehicle. For the robot consider here, the learning was accomplished R in a simulated A MATLAB/Simulink environment. The design parameters were

5.6. LEARNING ENVIRONMENTS

73

then ported to the actual vehicle. Offline learning facilitated a statistical analysis of the learning process. The number of trials required for meaningful statistical analysis would have been impractical to accomplish on the actual vehicle given time and laboratory resource constraints. Relying on online learning data would have reduced the number of trials available for analysis. The remainder of this section is organized as follows. Section 5.6.1 presents the simulation environment and learning procedure for the differential drive robot. Section 5.6.2 describes the environment and the robot used in the validation of the learning process.

5.6.1

Simulated Learning Environment

R A MATLAB/Simulink environment was developed to accomplish the learning phase.

The robot dynamics were simulated by integrating the inputs through the extended nonlinear model given in (5.15). As shown in Figure 5.2, the simulated Turtlebot robot was required to move between a sequence of 8 waypoints separated by 1 m (direction of change indicated by green arrows). The waypoints changed every 20 s. Since the robot was constrained to a minimum velocity of 0.05 m/s, the robot did not stop moving throughout the entire learning process. When a waypoint target was reached, the robot would maintain a trajectory in a region near the target until the waypoint changed. During the learning process, new weighting parameters (Qi and Ri ) we selected at random (according to the probability distribution of each action) every 20 s. The options for weighting parameters were constrained to integer values between 1 and


74

Figure 5.2: Simulated Environment used for Learning 10, inclusive. This means that the game of FALA in (5.2) is composed of six FALA:

Γ = (Γ1 , Γ2 , Γ3 , Γ4 , Γ5 , Γ6 ) and each set of actions Ai , i ∈ {1, · · · , 6} is

Ai = {1, 2, 3, · · · , 8, 9, 10} The probability distributions were updated as described in Equation (5.7) based on the observed cost computed using (5.4). The values of kp and kd in (5.4) were selected as 5 and 1, respectively. A total of 162 experiments were run, with each experiment representing 1000 separate trials. Therefore, the results that follow represent a total of 162,000 separate trials spanning 900 hours of learning. These experiments were


75

executed for statistical analysis. However, notice that in an actual application of the technique, only a single experiment would be necessary and this could be achieved in a matter of minutes.

5.6.2

Experimental Environment

In order to test the controller derived through the learning process, a real differential drive robot was used. The robot used, the TurtleBot 2, is shown in Figure 5.3.

Figure 5.3: TurtleBot 2 robot used in the experiments.

The experimental architecture, shown in Figure 5.4, including he following components:


76

• TurtleBot Mobile Station, which includes the robot and laptop running the Robot Operation System (ROS); • The Optitrack localization system, which measures position and velocity using optical sensor; • Localization Ground Station, which processes the information from the Optitrack localization system; and • The Controller Ground Station, which includes the MPC-based guidance and control system.

Figure 5.4: Experimental system with camera system and ground stations.

The Optitrack localization system uses 24 camera sensors. These sensors are able to detect the pose of the robot (xx , xy , xψ ) using the reflective markers visible on the top surface of the robot in Figure 5.3. The pose is captured by the Localization Ground Station, which transmits information over a UDP protocol to the Controller Ground Station.

5.7. RESULTS

77

The Controller ground station implements an MPC-based controller using the vehicle model presented in Sections 5.4. The controller is programmed in MATLAB/Simulink

R

and generates the forward and angular speed commands (uv , uω )

R used by the TurtleBot Mobile Station. The connection between MATLAB/Simulink

and ROS is facilitated using the Robotics System Toolbox. The TurtleBot Mobile Station runs Ubuntu and ROS and applies the appropriate control signals to the robot actuators. As in the simulation, the frequency of control system was 10 Hz, while the measurements through the Optitrack system operated at 100 Hz. Therefore, a Kalman filter was implemented in order to provide measurements at the required rate.

5.7

Results

The results are presented in three subsections: Section 5.7.1 compares the learning performance using different learning rates (λ); Section 5.7.2 demonstrates the improvements in performance achieved when LA is used to tune the MPC weighting parameters; Section 5.7.3 analyses the results from a statistical perspective; and Section 5.7.4 presents the results of a full implementation on a real robot.

5.7.1

Rate Comparison

Figure 5.5 illustrates the probability of chosen parameters during one experiment with a learning rate of λ = 0.1. Here we see all six parameters converge to a probability of 1, which indicate that these parameters should be selected as the weighting parameters in (2.6). As will be discussed shortly, these parameters are not guaranteed to be globally optimal. The optimality of the solution is affected by the learning rate and

5.7. RESULTS

78

the parameter options considered during the learning.

Figure 5.5: Illustration of probability of chosen parameter during one experiment

Figure 5.6 illustrates the faster convergence time with a learning rate of 0.8 compared to 0.1 over 1000 trials. However, notice that a slower rate eventually leads to a lower cost. Recall that the cost is computed by summing the squared error over the entire trial using (5.4). The system takes more time to converge on the selected parameters because the slow learning rate gives less weight to the observed rewards at the end of each trial. This explains the extended period of large fluctuations with the slow learning rate, as more time is spent evaluating randomly selected parameters. The trade-off is lower MSE at the end of the learning process.

5.7. RESULTS

79

Figure 5.6: Comparison of fast and slow learning rates

The cost over 10 experiments was compared for the two learning rates in Figure 5.6. The results are compared in Table 4.1. λ

MSE [m2 ]

σ

Slow

0.1

0.2054

0.0686

Fast

0.8

0.3125

0.3235

Table 5.1: Comparison of slow and fast learning rates

A lower average final cost and standard deviation was observed when the slower learning rate was used over the 10 experiments. The mean final error with learning rate of 0.1 was 0.2054 with standard deviation 0.0686. The mean final error with learning rate of 0.8 was 0.3125 with standard deviation 0.3253. Furthermore, running an ANOVA test on the final errors for λ = 0.1 and λ = 0.8 demonstrates that the

5.7. RESULTS

80

two distributions are different with a significance of 0.01%. At times the faster learning rate resulted in low costs that were comparable to that of the slower rate. However, as indicated by the very large standard deviation, sometimes the final cost was much higher. This reflects the fact that the slower learning rate takes more time to consider the possible options and reduce the likelihood of prematurely converging on a set of poorly performing parameters.

5.7.2

Learning Progress

While MPC always produces a solution which adheres to constraints, the quality of solution is greatly impacted by the selection of the control parameters (see the analysis in Section 5.7.3). Figure 5.7, Figure 5.8 and Figure 5.9 illustrate the improvement in target tracking for a controller when a learning rate of 0.1 (i.e, different controllers, with different matrices Qi and Ri ). Figure 5.7 shows the robot prior to any learning.

5.7. RESULTS

81

Figure 5.7: Comparison of performance before learning

Figure 5.8 shows the robot tracking approximately 50% through the learning experiment in Figure 5.5.

Figure 5.8: Comparison of performance part way through learning

Figure 5.9 shows the robot tracking after all of the weighting parameters converged.

5.7. RESULTS

82

Figure 5.9: Comparison of performance after learning Here we see a clear improvement in tracking performance. The figure eight patterns at the corners were caused by the fact that the robot was constrained to a minimum forward velocity of 0.05 m/s. Therefore, it was required to drive in a region around the target after it was reached, rather than just stopping at it. Finally, it is important to notice that the learning is not particular to the layout in Figure 5.2. The intention is that the learning observed in Figure 5.9 can be extended to other tracks. This will be also discussed in Section 5.7.4.

5.7.3

Statistical Analysis

Figure 5.10 presents a box plot for the learning of the weighted parameters. In the box plot, the tops and bottoms of each blue rectangular box represent the 25th and 75th percentiles, respectively. The red horizontal line represents the median. The black dashed lines represent furthest ranges of values observed, excluding outliers. Outliers are represented as red crosses. The box plot presents the results for the state weighting parameters (xx ,xy ,x˙ x ,x˙ y ) and input weighting parameters (uρ,1 ,uρ,2 ).

5.7. RESULTS

83

Recall that (uρ,1 ,uρ,2 ) are inputs derived from feedback linearization and represent the accelerations in the x and y directions.

Figure 5.10: Boxplot of 162 experiments The figure shows that the selection of parameters with LA is not random. The values for each parameter cluster around certain values. The medians for the Qi matrix is [8, 9, 4, 4] for [xx , yy , x˙ x , x˙ y ]. For the matrix Ri , the medians are [3, 4] for [uρ,1 , uρ,2 ]. However a certain variability is observed for each parameter. Below is the matrix of correlation coefficients for each weighting parameters and observed cost rounded to nearest two decimal places. Here we see the weighting parameters for states xx and x˙ x and inputs uρ,1 have the strongest correlation with the error e.

5.7. RESULTS

84

xx

xy

x˙ x

x˙ y

uρ,1

xx

1

0.12

0.10

−0.05

0.08

xy

0.12

1

0.01

0.03

−0.19

0.07

−0.20

x˙ x

0.10

0.01

1

0.15

0.13

−0.02

0.25

x˙ y

−0.05

0.03

0.15

1

−0.03 0.045

0.02

uρ,1

0.08

−0.19

0.13

−0.03

1

0.17

0.25

uρ,2

−0.20

0.07

−0.02

0.05

0.17

1

0.00

0.25

0.024

0.25

0.00

1

e

−0.37 −0.20

uρ,2

e

−0.20 −0.37

Since the weighting parameter for xx is negatively correlated with the error, it means that generally we can expect that larger values for the weighting parameter for xx will result in reduced error, which is consistent with the larger learned xx parameter values presented in Figure 5.10. The weights for x˙ x and uρ,1 , which correspond to the velocity and acceleration in the x-direction, are required to achieve the desired transient behaviour of the controller. This balance would normally be achieved through trial and error. However, in this case it was the FALA training that determined a region of parameters that achieves the desired performance.

5.7.4

Experimental Data

In order to validate the learning process, an experiment was executed on the actual Turtlebot. The MPC approach discussed in Section 5.6.2 was implemented in R MATLAB/Simulink .

The performance characteristics of two sets of parameters were compared. The first set of parameters were as follows:

5.7. RESULTS

85



1  0  Q= 0   0

0 0 1 0 0 1 0 0

 0    0  1 0  and R =  ,  0 0 1  1

(5.20)

which reflect an equal weight on each parameter. This is a reasonable first guess and we expect it would produce stable performance. The second set of parameters were derived in the learning process discussed in Section 5.7.3 with the following weights:



9  0  Q= 0   0

0 0 8 0 0 5 0 0

 0     0 4 0  and R =  .  0 0 2  8

(5.21)

The learned weights in (5.21) were selected because this was the set that produced the lowest error on average from all the 162 experiments discussed in Section 5.6.1. The robot was tested on an ellipsoidal path and the results for the runs are shown in Figure 5.11. A path different from the one used for the learning was used in order to test the robustness of the controller to different inputs. For the designed controller, the robot was started close to position (−2, 0, π/2) and in the learned controller case, the robot was started close to (0, −0.8, 0). As it can be seen in Figure 5.11(a), the robot is not able to closely follow the path. The reason is that all weights are the same and no preference is given to any of the

5.7. RESULTS

86

(a) Unlearned parameters

(b) Learned parameters

Figure 5.11: Comparison of performance of controllers in experiments states or commands. With the learned parameters, the robot was able to follow the path more closely due to the fact that the gains better represent the tracking task (Figure 5.11(b)). The commands generated for the path of Figure 5.11(b) are shown in Figure 5.12.

5.7. RESULTS

87

It is important to notice that the speed command respects the constraints discussed in Section 5.6.1. Furthermore, the angular speed increases when the curvature of the path increases.

Figure 5.12: Commands generated by the learned controller Overall, the experiments confirm that the controller derived through the LA approach is able to adapt to the tracking task. It is also better suited for the task as the error in tracking (Figure 5.11(b)) is smaller than the error observed in the initial controller (Figure 5.11(b)). Notice that instead of having six parameters to choose from, the only parameter that a designer needs to choose is the learning rate λ. This potentially makes the design phase of MPC faster and more reliable.

88

Chapter 6 Integrated FOS FALA MPC

This section formulates a UAV target tracking mission similar to the more general problem defined in Fig. 1.1. The problem is similar to the altitude control problem described in Chapter 4, but now we consider movement in the longitudinal and lateral directions as well. Furthermore, we demonstrate how time-varying planar inequality constraints can be used to achieve obstacle avoidance. As illustrated in Fig. 6.1, this scenario involves a quadcopter controlled by an embedded low-level Pixhawk flight controller. As we saw in Chapter 4, the flight controller uses a nested PID control architecture that converts commands (uc ) at time n into motor speeds (um ) required to actuate a Parrot AR.drone quadcopter as follows:

um (n) = fq (uc (n), ζ(n))

(6.1)

which is a calculation that requires state feedback (ζ). We assume a precise model for the combined dynamics of the flight controller and plant (fq ) are unavailable. We consider the commands as the following vector: uc = [uc,φ , uc,θ , uc,ψ , uc,z ]T for

89

Figure 6.1: Quadcopter mission with obstacles commanded roll, pitch, yaw, and altitude. These commands are provided by an adaptive MPC, which requires 5 inputs: • A reference (target) position; • Weighting matrices for the cost function (2.6), which are selected offline using a FALA learning process described by (5.2); • A linear model, time invariant of the form (2.2) that approximates the underlying (presumably nonlinear) dynamics fq , which we compute using the FOS algorithm from Chapter 4; • Time-varying planar inequality described in Section 6.3 to ensure obstacle avoidance; and • State feedback.

6.1. TRANSLATIONAL AND ROTATIONAL DYNAMICS

6.1

90

Translational and Rotational Dynamics

While the precise dynamics of the quadcopter in Fig. 6.1 are unknown, we can draw from previous work [76], [77], and [49] to develop an appropriate structure for the model prior to identification. We first define the following reference frames: the Earth (inertial), translated vehicle frame, and body frames illustrated in Fig. 6.2. The Earth frame (x, y, z) is fixed in space, often defined in the East, North and Up directions. The translated vehicle frame (xv , yv , zv ) is the Earth frame translated to the center of mass of the quadcopter. The body frame (xb , yb , zb ) also has its origin at the center of mass of the quadcopter, but the xb and yb axes are oriented towards the front and left motors, respectively. The zb axis is directed upwards and perpendicular to xb , and yb . The motors are numbered clockwise from the top, starting with the front one.

Figure 6.2: Quadcopter reference frames

Vectors representing the position (ζe ) and orientation (ξe ) in the Earth frame are as follows:


91

T

ζe = x y z T ξe = φ θ ψ

(6.2)

where x, y, and z are Cartesian coordinates with respect to the Earth frame and φ, θ, and ψ are Euler roll, pitch, and yaw Euler angles with respect to the translated vehicle frame. The translational and angular velocities of the Euler angles are expressed as (ζ˙e ) and (ξ˙e ), respectively. The angular velocities roll (p), pitch (q), and yaw (r) about the body frame axes are expressed as follows: T ˙ ζb = u v w T ˙ ξb = p q r

(6.3)

where u,v, and w are velocities in xb , yb , zb directions; p is rotation about xb , q is a rotation about yb , and r is a rotation about zb . The angular velocities are related as follows:

ξ˙e = REB ξ˙b

(6.4)

where REB is the following transformation:

REB

  1 s(φ)t(θ) c(φ)t(θ)    = c(φ) −s(φ)  0    0 s(φ)/c(θ) c(φ)/c(θ)

(6.5)

Let us consider the force generated by the front, right, rear, and left rotors as fm = [f1 , f2 , f3 , f4 ]T . We relate motor speed um = [um,f , um,φ , um,θ , um,ψ ]T to these


92

forces as follows: 

um



1 1 1  1    0  −1 0 1   =  fm  −1 0 1 0     −km km −km km

(6.6)

where um,f is the total force generated by the rotors; um,φ generates a rolling action, um,θ generates a pitching action, um,ψ generates a yawing action; and km is a constant that relates the force and torque produced by each rotor. Recall that the motor speed inputs (um ) are not the same as the commands (uc ). The relationship between um and uc is defined by (6.1), which incorporates the unknown low-level dynamics fq . Based on the equations of motion presented in [76] and [77], we define the translational dynamics of the quadcopter in the Earth frame: 

   0  0     Kd ˙    ζ¨e = TEB   0  −  0  − m ζe     um,f g m 

(6.7)

where g is gravitational acceleration; Kd is a matrix of drag coefficients; and the transformation from the body to Earth frame is as follows:   c(θ)c(ψ) s(φ)s(θ)c(ψ) − c(φ)s(ψ) c(φ)s(θ)c(ψ) + s(φ)s(ψ)    TEB =  c(θ)s(ψ) s(φ)s(θ)s(ψ) + c(φ)c(ψ) c(φ)s(θ)s(ψ) − s(φ)c(ψ)     −s(θ) s(φ)c(θ) c(φ)c(θ)

(6.8)

We define the rotational dynamics in the body frame as described in [49] and [80]:

6.2. DECOUPLED BARE-FRAME DYNAMICS

ξ¨b =

1 jx

1 jy

  (jy − jz )qr + lx um,φ     1  (j − j )pr + l u z x y m,θ   jz   (jx − jy )pq + um,ψ

93

(6.9)

where jx , jy , and jz are moments of inertia around the respective axes; and lx and ly are the arm lengths.

6.2

Decoupled Bare-frame Dynamics

This section decouples the longitudinal, lateral, and vertical components of the model developed in Section 6.1. This approach, also known as the bare-frame model, is a common technique for helicopter control [81]. Here we formulate the bare-frame for use in a quadcopter target tracking mission. Longitudinal motion is considered to be along the xb axis in the body frame caused by pitching. Lateral motion is considered to be along the yb axis in the body frame caused by rolling. The benefit of this modelling technique is that it directly relates the motion the body frame to commands in the body frame. Since no transformation to the earth frame is required, we are able to make a time-invariant linear approximation of the dynamics in the longitudinal and lateral directions. The longitudinal dynamics are described as follows:     xb (n + 1) xb (n)     x˙ b (n + 1) x˙ b (n)       = Alon   + Blon uc,θ (n)  q(n + 1)   q(n)          θ(n + 1) θ(n)

(6.10)

6.3. PLANAR INEQUALITY CONSTRAINTS

94

where Alon and Blon are the state-transition and input matrices for the longitudinal dynamics; and xb (n + 1) is used to predict the evolution of the states in the xb direction.     yb (n + 1) yb (n)     y˙ b (n + 1) y˙ b (n)       = Alat   + Blat uc,φ (n)  p(n + 1)   p(n)          φ(n + 1) φ(n)

(6.11)

where Alat and Blat are the state-transition and input matrices for the lateral dynamics; and yb (n + 1) is used to predict the evolution of the states in the yb direction. The vertical dynamics are expressed in the earth frame as follows:       uc,z (n) z(n) z(n + 1)   + Balt    = Aalt  g z(n) ˙ z(n ˙ + 1)

(6.12)

where Aalt and Balt are the state-transition and input matrices for the vertical dynamics. Recall that g is the acceleration due to gravity.

6.3

Planar Inequality Constraints

This section describes how time-varying inequality constraints can be used to avoid obstacles in the MPC formulation described in (2.8). The constraints are presented in a general form and can be applied in either of the reference frames presented in Fig. 6.2. The technique was inspired by the work of [47] and is extended for 3D obstacle avoidance. Let us define the following parameters: • rq is the spherical radius of the quadcopter;

6.3. PLANAR INEQUALITY CONSTRAINTS

95

• robs is the radius of a sphere that encompasses an obstacle; • rb is an optional safety buffer ; • ζobs is the position of the center-of-mass of the obstacle; • ζ is the position of the quadcopter; and • d is the Euclidean distance between the center-of-mass of the quadcopter and obstacle. We also define a point (ζp ) along the line connecting the center-of-mass of the quadcopter and the obstacle as follows:

ζp = ζobs +

robs + rq + rb ζ − ζobs d

(6.13)

We define the admissible region using the plane tangent to the sphere at ζobs with radius (robs + rq + rb ) and passing through point ζp as follows:

−(ζp − ζobs )T ζ ≤ ζpT (ζobs − ζp )

(6.14)

This PIC is illustrated in Fig. 6.3, for an obstacle at position ζobs = [−2, −1, −0.2]T . The quadcopter is located at ζ = [0, 0, 0]T . By constraining the evolution of the quadcopter states to one side of this PIC, we avoid collisions with the obstacle while maintaining a permissible search space that is convex. When recomputed at each timestep in a receding horizon implementation, the PIC can be updated and adapt to changes in the environment.


96

Figure 6.3: Illustration of Planar Inequality Constraint 6.4

Experimental Setup

The FALA technique for tuning MPC was implemented on the real-time MATLAB R Simulink simulation testbed developed in [72]. The training process described in

Section 5.3 was used to select the Qi and Ri matrices in (2.6). For the cases when obstacle avoidance was necessary, the PICs described in (6.14) were used. The testbed uses a series of very accurate plant models that incorporate coupling forces, gyroscopic effects, noise, and aerodynamic drag forces (identified experimentally for low speed manoeuvres) for the Parrot AR.drone quadcopter with embedded Pixhawk flight controller. Processing was accomplished on an Apple MacBook Pro with 2.6 GHz Intel Core i7 processor. The flight controller actuated the four motors


97

based on a desired commands uc from the MPC [73]. Since the MPC required a linear model for the system dynamics, the FOS techniques described in Chapter 4 was used to identify parameters to approximate the decoupled, Bare-frame dynamics. For the purpose of presentation, all values are rounded to two significant digits in the results that follow. The longitudinal and lateral state transition matrices, which were equivalent due to the symmetry of the vehicle were identified as follows:

Alon = Alat

  1 0.10 0 0     0 1 0.15 1.39    =  0 0 0.78 −2.10     0 0 0 0.92

(6.15)

Similarly, the longitudinal and lateral input matrices were identified as follows: 

Blon = Blat



 0    −1.31   =   2.10      0.086

(6.16)

Finally, the state-transition and input matrices for the vertical dynamics were identified as follows:

Aalt

  1 0.10  =  0 0.934

(6.17)


98





0   0 Balt =   −0.25 0.092

(6.18)

The following output feedback was used:

Clon = Clat

    1 0 0 0 1 0 =  Calt =   0 1 0 0 0 1

(6.19)

where Clon , Clat , and Calt are the output matrices for the longitudinal, lateral, and altitude. These models were fixed for all subsequent experiments and used in three parallel MPC implementations. Each MPC was executed at a rate of 10Hz and prediction horizon of 100. The roll and pitch angles were constrained to −0.1 rad ≤ φ ≤ 0.1 rad and −0.1 rad ≤ θ ≤ 0.1 rad. This range of angles permitted a rough small angle approximations, that would help preserve the validity of the linearization technique being used. The heading was fixed to ψ = 0 using the low-level Pixhawk flight controller, which operated at 100Hz. Learning for each controller was accomplished separately and the results were integrated in a number of subsequent trajectory tracking missions. In the results that follow, the term trial denotes a fixed period of time over which a specific set of parameters is fixed and evaluated; the term session denotes a set of successive trials during which time the system is trained to converge on a set of optimal parameters; and the term experiment denotes one or more sessions carried out for a specific purpose. A summary of the experiments carried out is as follows: 1. Experiment 1 (Altitude Control Learning) - FALA was used to identify MPC

6.5. RESULTS

99

weight parameters at various learning rates to analyze the performance of the learned parameters; 2. Experiment 2 (Lat/Long Control Learning) - FALA was used to identify MPC weight parameters for longitudinal and lateral controllers in two separate, decoupled sessions; 3. Experiment 3 (Trajectory Tracking with Fixed Obstacles) - The parameters learned in previous experiments were used in two trajectory tracking missions with fixed, spherical obstacles; and 4. Experiment 4 (Trajectory Tracking with Moving Obstacles) - The parameters learned in previous experiments were used in a trajectory tracking mission with moving, spherical obstacles.

6.5

Results

The results are presented as follows: in Section 6.5.1, the learning results for the altitude controller are presented, including a statistical analysis of the improvement observed at different learning rates; Section 6.5.2 presents the learning results for the decoupled longitude and lateral controller; Section 6.5.3 presents the results for trajectory tracking with two spherical obstacles; and Section 6.5.4 presents the results for moving obstacles.

6.5.1

Experiment 1 - Altitude Control Learning

A total of 65 separate FALA sessions were used to identify MPC weight parameters at various learning rates. Recalling the dynamics described in Section 6.2, the following

6.5. RESULTS

100

finite set of options were considered for each parameter:

Aalt,Q1 = Aalt,Q2 = Aalt,R1 = {10, 20, ..., 100}

(6.20)

where Aalt,Q1 ,Aalt,Q2 , and Aalt,R1 are options for the MPC cost function weights. This learning was accomplished by evaluating the tracking performance during successive 1 m altitude changes (upwards and downwards) over a trial period of 10 s. The desired vertical speed was set to zero. Fig. 6.4 presents the convergence time for 27 sessions at the learning rates indicated. There is an exponential increase in convergence time (expressed as the number of trials required) at lower learning rates. Effect of Learning Rate on Convergence Time: Altitude Control

12000

10000

Trials to Converge

8000

6000

4000

2000

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Learning Rate

Figure 6.4: Effect of Learning Rate on Convergence Time for Altitude Control In order to investigate the impact of learning rate on performance, the learning rates γ = 0.1 and γ = 0.8 were selected for further analysis. The remaining sessions in

6.5. RESULTS

101

Experiment 1 were used to compare the performance of these two rates over multiple training sessions. Fig. 6.5 presents the results from 2 representative sessions (one at γ = 0.1 and one at γ = 0.8). Here we see the mean squared error (MSE) reduces much faster at γ = 0.8. However, while the MSE reduces much slower at γ = 0.1, the final result is better. Comparison of Slow and Fast Learning Rates

45 40

Learning Rate = 0.1 (slow) Learning Rate = 0.8 (fast)

Mean Squared Error (m2)

35 30 25 20 15 10 5 0

0

100

200

300

400

500

600

700

800

900

1000

Trial

Figure 6.5: Comparision of Slow and Fast Learning Rates

The training sessions did not always converge to the same parameters. Therefore, it was necessary to analyze the mean MSE and standard deviation (σ) across all sessions. These results are presented in Table I.

Slow Fast

λ 0.1 0.8

mean MSE [m2 ] 0.9247 1.0607

σ 0.3518 0.5372

Table 6.1: Comparison of slow and fast learning rates For slow learning rate, the mean MSE after the parameters converged was 0.9247 m2

6.5. RESULTS

102

with standard deviation of 0.3518. For fast learning rate, the mean was 1.0607 m2 with standard deviation of 0.5372. Before performing a statistical analysis, normal probability plots were produced and are presented in Fig. 6.6 and Fig. 6.7. These allow us to visually inspect the data to determine if the distributions can be approximated as normal. Normal Probability Plot (Slow Learning) 0.99 0.98 0.95 0.90

Probability

0.75

0.50

0.25 0.10 0.05 0.02 0.01 0.4

0.6

0.8

1

1.2

1.4

1.6

Data

Figure 6.6: Normal probability plot for slow learning rate A perfectly normal distribution would fall along the red line in each plot. From inspection, we see that we can roughly approximate the distributions to be normal, which permits an analysis of the variance in the performance. A two-sample F test for equal variances rejects the null hypothesis that the sets come from normal distributions with the same variance (confidence p = 0.0317). This suggests that

6.5. RESULTS

103

Normal Probability Plot (Fast Learning) 0.99 0.98 0.95 0.90

Probability

0.75

0.50

0.25 0.10 0.05 0.02 0.01 0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Data

Figure 6.7: Normal probability plot for fast learning rate while FALA is not guaranteed to converge to a single set of optimal parameters, the variance in performance is reduced with longer learning times. The top-performing set of parameters was selected for subsequent experiments:

Qalt

6.5.2

  70 0  =  Ralt = 50 0 80

(6.21)

Experiment 2 - Lat/Long Control Learning

FALA was used to identify MPC weight parameters for longitudinal and lateral controllers in 2 separate, decoupled sessions. The following finite set of options were considered for each parameter:

6.5. RESULTS

104

Alon,Q1 = Alon,Q2 = Alon,R1 = {1, 5, 10, 15, ..., 100}

(6.22)

where Alon,Q1 ,Alon,Q2 , and Alon,R1 are options for cost function weights on the longitude outputs and inputs, respectively.

Alat,Q1 = Alat,Q2 = Alat,R1 = {1, 5, 10, 15, ..., 100}

(6.23)

where Alat,Q1 ,Alat,Q2 , and Alat,R1 are options for cost function weights on the latitude outputs and inputs, respectively. This learning was accomplished by evaluating the tracking performance during successive 5 m changes along the respective body axes over a trial period of 20 s. The desired speed in each direction was set to zero. A learning rate of γ = 0.1 was selected based on the results from Experiment 1. The MSE for the reinforcement signal was composed of a proportional and velocity components kp = 1 and kd = 0.2 as defined in (5.4). The rationale for selecting these parameters was to reduce overshoot in the transient response by adding a small component related to the rate of change of error (hence, kd ). Fig. 6.8 and Fig. 6.9 present the MSE reduction during the learning for both sessions.

6.5. RESULTS

105

Error During Longitudinal Learning

500 450


400 350 300 250 200 150 100 50 0

0

500

1000

1500

2000

2500

Trial (20s each)

Figure 6.8: Error during Longitudinal Control Learning

Error During Latitudinal Learning

500 450


400 350 300 250 200 150 100 50 0

0

500

1000

1500

2000

Trial (20s each)

Figure 6.9: Error during Latitudinal Control Learning

2500

6.5. RESULTS

106

The FALA game converged on the following parameters for the longitudinal controller:

Qlon

  45 0  =  Rlon = 60 0 90

(6.24)

and converged on the following parameters for the latitudinal controller:

Qlat

6.5.3

  45 0  =  Rlat = 55 0 90

(6.25)

Experiment 3 - Trajectory Tracking with Fixed Obstacles

The parameters learned in previous experiments were used in two trajectory tracking missions with fixed, spherical obstacles. The quadcopter was treated with a radius of rq = 0.5 m and the obstacle with a radius of robs = 0.5 m. PICs were used to avoid the obstacles. In Fig. 6.10, we see the quadcopter follows a smooth trajectory while avoiding the two obstacles in its path. For the obstacle on the right, the quadcopter follows a trajectory around the obstacle. For the obstacle on the left, the quadcopter passed under the obstacle. Notice that the vehicle avoids the obstacle when moving in all three dimensions. Fig. 6.11 presents an overhead view from a similar mission to provide a clearer illustration of the separation between the quadcopter (in blue) and obstacles (in grey).

6.5. RESULTS

107

Vehicle Trajectory with two Spherical Obstacles 6

z-direction

4

2

0

6

Vehicle Trajectory Reference Trajectory

4 2

y-direction

0 -2 -4 -6 -2

-4

-6

6

4

2

0

x-direction

Figure 6.10: Vehicle trajectory with two spherical obstacles

Vehicle Trajectory with two Spherical Obstacles

Vehicle Trajector Reference Trajectory

8

6

y-direction

4

2

0

-2

-4

-6

-8

-10

-8

-6

-4

-2

0

2

4

6

8

x-direction

Figure 6.11: Vehicle trajectory with two spherical obstacles - overhead view

These results demonstrate that time-varying planar inequality constraints can be

6.5. RESULTS

108

used in conjunction with MPC to achieve obstacle avoidance for a vehicle travelling in 3D space. The above results, however, assume the obstacle is stationary.

6.5.4

Experiment 4 - Trajectory Tracking with Moving Obstacles

The parameters learned in previous experiments were used in a trajectory tracking mission with moving, spherical obstacles. A buffer of rb = 1 m was used to provide an added level of safety due to the obstacle motion. Fig. 6.12 provides an illustration of the quadcopter (in blue) avoiding the moving obstacle (in grey). In each frame, we see the quadcopter tracks a path as close as possible to the reference trajectory (in

6

4

4

2

2

0

-2

0

-2

-4

-4

-6

-6

-8

y-direction

y-direction

6

0

1

2

3

4

5

-8

6

4

2

2

0

-2

-6

-6

4

5

6

5

6

4

5

6

-2

-4

3

4

0

-4

2

3

(b) 6

1

2

(a)

4

0

1

x-direction

6

-8

0

x-direction

y-direction

y-direction

red) while maintaining a safe distance from the moving obstacle.

-8

0

1

2

3

x-direction

x-direction

(c)

(d)

Figure 6.12: Moving obstacle avoidance

6.6. SUMMARY

109

At no time did the quadcopter trajectory collide with that of the obstacle. These results demonstrate that time-varying planar inequality constraints can be used in conjunction with MPC to avoid moving obstacles.

6.6

Summary

This chapter presented an integrated FOS-FALA approach for modelling and controlling a quadcopter UAV. The learning was accomplished in a decoupled manner, with parameters for separate longitudinal, latitudinal, and vertical controller selected offline. An analysis of the performance during the learning process demonstrated a statistically significant reduction in MSE with slower learning rates. The learned parameters were used as weights in a parallel MPC implementation for a simulated quadcopter trajectory tracking mission involving stationary and dynamic spherical obstacles. A time-invariant, linear state space model for the system dynamics was identified using FOS. Obstacle avoidance was achieved using time-varying planar inequality constraints, which proved effective in avoiding collisions by deflecting the quadcopter around the obstacle while following the desired 3D trajectory as close as possible.

110

Chapter 7 Conclusions

This chapter concludes the work presented above. The use of reinforcement learning to link MPC cost function weights with system performance is a significant improvement over trial and error and represents a significant contribution to the field. This is particularly useful with systems for which we have a basic understanding of the underlying dynamics but lack precise quantities such as mass, moments, and aerodynamic effects. Generally, we have seen that by combining R-FOS and FALA we are able to reduce the number of design parameters that must be selected for MPC-based guidance and control. We have also provided a novel technique for obstacle avoidance using planar inequality constraints. These conclusions are summarized in three parts: modelling using R-FOS, control design using FALA, and obstacle avoidance using planar inequality constraints. The chapter ends with recommendations for future work.

7.1

Modelling using R-FOS

The results of Chapter 4 and Chapter 6 demonstrate that R-FOS is an effective technique for approximating vehicle dynamics. For the purpose of this research, the

7.2. CONTROL DESIGN USING FALA

111

problem was constrained to the case when the overall structure of a linear model approximating the vehicle dynamics was available. Precise values for certain parameters of the vehicle were unknown, such as masses, moments, and certain aerodynamic quantities. We found that FOS could be use to compute a model offline using a set of observed input and outputs. Furthermore, by executing FOS recursively (R-FOS), we demonstrated that we could obtain an updated model online. This might be useful for situations in which the dynamics of the vehicle change with time or are effected by the environment. R-FOS is uniquely capable of evaluating the importance candidate terms in any form. We found that this could be used to save computational requirements by updating only the most significant terms. These savings grow with the size of the model, which could be potentially useful for more complex systems. Finally, by combining R-FOS with a KF, we developed estimates for the states when limited measurements were available. This was similar to a dual KF setup, with the added benefit afforded by the R-FOS ability to evaluate candidate terms. Since R-FOS provides a time-varying estimate of the noise covariance, we used this for the KF. Typically an estimate for the noise covariance would have to be selected by the designer based on knowledge of the system

7.2

Control design using FALA

Chapter 5 and Chapter 6 demonstrated that reinforcement learning, specifically FALA, could be used to select design parameters in an MPC-based guidance and control strategy. This was demonstrated for use in both ground and aerial vehicles. The effect of these results in two-fold.

7.3. OBSTACLE AVOIDANCE

112

First, in the cases considered, the use of FALA reduced the overall number of parameters that needed to be selected by the designer. For the differential drive robot, the reduction was from six weights in the cost function to the two parameters required to define the error (position and velocity). This reduction was the same for the aerial vehicle scenario considered in Chapter 6. However, it is important to note that only six weights were required in the cost function for the aerial vehicle. This is because of the decoupled bare-frame dynamics and output feedback selected. For some applications, such as those used in [49] and [82], the cost function could require up to 16 weights. Regardless of the number of weights, the number of parameters required by FALA is fixed to the number required to define the error. Secondly, the FALA technique developed in this research links the selection of the design parameters (weights in the cost function) to the desired performance. For MPC, no formal technique to accomplish this has yet been developed. Typically, these parameters are selected through trial and error.

7.3

Obstacle Avoidance

Chapter 6 presented a fully integrated FOS-FALA solution with obstacle avoidance for a quadcopter UAV. Decoupled longitudinal, latitudinal, and vertical controller models were modelled using FOS. MPC parameters were selected using FALA. The FOS-derived models and FALA-derived control parameters were used in a parallel MPC implementation for a simulated quadcopter trajectory tracking mission involving stationary and dynamic spherical obstacles. Results demonstrated that obstacle avoidance could be achieved using time-varying

7.4. FUTURE WORK

113

planar inequality constraints. These proved effective in avoiding collisions by deflecting the quadcopter around the obstacle while following the desired 3D trajectory as close as possible. This provides a novel, simple and intuitive strategy for obstacle avoidance that fits nicely into the MPC framework.

7.4

Future Work

While the FOS technique proved to be a useful technique for the problems considered in this research, the full potential for FOS has not been explored. Specifically, the accuracy of FOS is best observed when nonlinear models are used. Also, the evaluation of candidate model terms should be further investigated in the context of MPC-based vehicle control. This would be useful for scenarios in which even the structure of the model is not known. The planar inequality technique developed for obstacle avoidance required the assumption that obstacles were static throughout the prediction horizon. This was necessary because no model for the obstacle dynamics was available. Future work should focus on incorporating the dynamics of moving obstacles into the prediction horizon, possibly using systems identification techniques such as FOS. The use of reinforcement learning techniques to select control parameters is potentially useful in a broad range of applications. Future development should focus on tailoring this approach to the unique constraints of other domains. For example, in addition to autonomous vehicles, this framework could be applied to other areas that use MPC, such as power system and chemical process control. This would necessarily require the development of special models; something for which FOS has proven well suited in a diverse range of domains. While this research focused on convex MPC

7.4. FUTURE WORK

114

formulations that use linear models, there is potential for using FALA or similar techniques for nonlinear MPC applications. Finally, this work focused on the guidance and control of a single vehicle. While the games of FALA investigated in this research selected parameters using a single vehicle, there is no reason this game could not be extended to incorporate multiple vehicles. Future work should be accomplished to see how the learning techniques can be incorporated into multi-vehicle applications, such as swarming or cooperative construction tasks. This would require the development of appropriate models and cost functions to describe the interactions between the vehicles.

BIBLIOGRAPHY

115

Bibliography

[1] D.Keeling. The transportation revolution and transatlantic migration 1850 to 1914. Research in Economic History, 19:39–74, 1999. [2] Rick Szostak. Role of Transportation in the Industrial Revolution: A Comparison of England and France. McGill-Queen’s Press, Quebec, Canada, 1991. [3] J. Zimmer. The Third Transportation Revolution. https://medium.com/. Accessed: 2018-02-26. [4] F. Y. Wang. Toward a revolution in transportation operations: AI for complex systems. IEEE Intelligent Systems, 23(6):8–13, Nov 2008. [5] Tesla Motors. Model 3 specifications, 2016. [6] K. Bimbraw. Autonomous cars: Past, present and future a review of the developments in the last century, the present scenario and the expected future of autonomous vehicle technology. In 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), volume 01, pages 191–198, Jul 2015.

BIBLIOGRAPHY

116

[7] S. Giese, D. Carr, and J. Chahl. Implications for unmanned systems research of military UAV mishap statistics. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), pages 1191–1196, Jun 2013. [8] S. El-Ferik, B. A. Siddiqui, and F. L. Lewis. Distributed nonlinear MPC of multiagent systems with data compression and random delays. IEEE Transactions on Automatic Control, 61(3):817–822, Mar 2016. [9] J. Abizaid and R. Brooks. Recommendations and Report of the Task Force on US Drone Policy. Washington, DC: The Stimson Center, 2nd ed. edition, 2014. [10] A.T. Hafez, A.J. Marasco, S.N. Givigi, M. Iskandarani, S. Yousefi, and C.A. Rabbath. Solving multi-UAV dynamic encirclement via model predictive control. IEEE Transactions on Control Systems Technology, PP(99):1–1, 2015. [11] Vahana. Vahana. https://vahana.aero. Accessed: 2017-12-14. [12] National Instruments. PID Theory Explained. Technical report, National Instruments, Austin, Texas, 03 2011. [13] J. G. Ziegler and N. B. Nichols. Optimum Settings for Automatic Controllers. Transactions of ASME, 64:759–768, 1942. [14] Richard M. Murray, S. Shankar Sastry, and Li Zexiang. A Mathematical Introduction to Robotic Manipulation. CRC Press, Inc., Boca Raton, FL, USA, 1st edition, 1994. [15] M. W. Foley, R. H. Julien, and B. R. Copeland. Proportional-integral-derivative tuning for integrating processes with deadtime. IET Control Theory Applications, 4(3):425–436, March 2010.

BIBLIOGRAPHY

117

[16] G. Zhong, H. Deng, G. Xin, and H. Wang. Dynamic hybrid control of a hexapod walking robot: Experimental verification. IEEE Transactions on Industrial Electronics, 63(8):5001–5011, Aug 2016. [17] D. P. Atherton and S. Majhi. Limitations of pid controllers. In Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251), volume 6, pages 3843–3847 vol.6, 1999. [18] T. Chaiyatham and I. Ngamroo. Improvement of power system transient stability by pv farm with fuzzy gain scheduling of pid controller. IEEE Systems Journal, 11(3):1684–1691, Sept 2017. [19] K. J. Astrom, K. H. Johansson, and Qing-Guo Wang. Design of decoupled pid controllers for mimo systems. In Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148), volume 3, pages 2015–2020 vol.3, 2001. [20] Qing-Guo Wang and Zhuo-Yun Nie. PID Control for MIMO Processes, pages 177–204. Springer London, London, 2012. [21] J. A. Rossiter. Model-Based Predictive Control: A Practical Approach. CRC Press LLC, Boca Raton, FL, first edition, 2004. [22] G Martin. IDCOM hierarchical control of an oil refinery reactor. In Proc. of the American Control Conf., pages 675 – 678, Jun 1984. [23] Zhou Jun-zhe, Li Xing, and Zhu Hai-tang. Application and simulation of DMC controller in time delay inertial system. In Proc. of the Control and Decision Conf., pages 649–654, Jul 2008.

BIBLIOGRAPHY

118

[24] S.Joe Qin and Thomas A. Badgwell. A survey of industrial model predictive control technology. Control Engineering Practice, 11(7):733 – 764, Dec 2003. [25] S. M. Lee, H. Kim, H. Myung, and X. Yao.

Cooperative coevolutionary

algorithm-based model predictive control guaranteeing stability of multirobot formation. IEEE Transactions on Control Systems Technology, 23(1):37–51, Jan 2015. [26] P. Falugi and D. Q. Mayne. Getting robustness against unstructured uncertainty: A tube-based MPC approach. IEEE Transactions on Automatic Control, 59(5):1290–1295, May 2014. [27] D.Q. Mayne, J.B. Rawlings, C.V. Rao, and P.O.M. Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789 – 814, Jun 2000. [28] N.E. Du Toit and J.W. Burdick. Robot motion planning in dynamic, uncertain environments. IEEE Trans. on Robotics, 28(1):101–115, Feb 2012. [29] T.M. Howard, M. Pivtoraiko, R.A Knepper, and A Kelly. Model-predictive motion planning: Several key developments for autonomous mobile robots. IEEE Robotics Automation Magazine, 21(1):64–73, Mar 2014. [30] Z. Yan and J. Wang. Robust model predictive control of nonlinear systems with unmodeled dynamics and bounded uncertainties based on neural networks. IEEE Transactions on Neural Networks and Learning Systems, 25(3):457–469, Mar 2014.

BIBLIOGRAPHY

119

[31] D. Fan and P. Shi. Improvement of Dijkstras algorithm and its application in route planning. In Proc. of the 7th Int. Conf. Fuzzy Systems and Knowledge Discovery, pages 1901–1904, Aug 2010. [32] G. Franz and W. Lucia. A receding horizon control strategy for autonomous vehicles in dynamic environments. IEEE Transactions on Control Systems Technology, 24(2):695–702, Mar 2016. [33] J. Zhao and J. Wang. Integrated model predictive control of hybrid electric vehicle coupled with aftertreatment systems. Vehicular Technology, IEEE Transactions on, PP(99):1–1, 2015. [34] Mooryong Choi and S.B. Choi. Model predictive control for vehicle yaw stability with practical concerns. Vehicular Technology, IEEE Transactions on, 63(8):3539–3548, Oct 2014. [35] K. D. Kim and P. R. Kumar. An MPC-based approach to provable systemwide safety and liveness of autonomous ground traffic. IEEE Transactions on Automatic Control, 59(12):3341–3356, Dec 2014. [36] David Muoz de la Pea, Daniel Limn, D. Kouzoupis, R. Quirynen, J.V. Frasch, and M. Diehl. Block condensing for fast nonlinear MPC with the dual newton strategy. In 5th IFAC Conference on Nonlinear Model Predictive Control, pages 26 – 31, September 2015. [37] S. E. Li, Z. Jia, K. Li, and B. Cheng. Fast online computation of a model predictive controller and its application to fuel economy - oriented adaptive cruise

BIBLIOGRAPHY

120

control. IEEE Transactions on Intelligent Transportation Systems, 16(3):1199– 1209, Jun 2015. [38] A. Bemporad. A quadratic programming algorithm based on nonnegative least squares with applications to embedded model predictive control. IEEE Transactions on Automatic Control, 61(4):1111–1116, Apr 2016. [39] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming. http://cvxr.com/cvx/. Accessed: 2018-05-22. [40] B. Kouvaritakis, J.A. Rossiter, and A.O.T. Chang. Stable generalised predictive control: an algorithm with guaranteed stability. Control Theory and Applications, IEEE Proceedings D, 139(4):349–362, Jul 1992. [41] M. N. Zeilinger, M. Morari, and C. N. Jones. Soft constrained model predictive control with robust stability guarantees. IEEE Transactions on Automatic Control, 59(5):1190–1202, May 2014. [42] S. Jafari Fesharaki, M. Kamali, and F. Sheikholeslam. Adaptive tube-based model predictive control for linear systems with parametric uncertainty. IET Control Theory Applications, 11(17):2947–2953, 2017. [43] Zhou Chao, Lei Ming, Zhou Shaolei, and Zhang Wenguang. Collision-free UAV formation flight control based on nonlinearMPC. In International Conference on Electronics, Communications and Control (ICECC), pages 1951–1956, Sep 2011. [44] Y. Kuriki and T. Namerikawa. Formation control with collision avoidance for a multi-UAV system using decentralized MPC and consensus-based control. In Control Conference (ECC), 2015 European, pages 3079–3084, July 2015.

BIBLIOGRAPHY

121

[45] J. Park, S. Karumanchi, and K. Iagnemma. Homotopy-based divide-and-conquer strategy for optimal trajectory planning via mixed-integer programming. IEEE Transactions on Robotics, 31(5):1101–1115, Oct 2015. [46] R. Deits and R. Tedrake. Efficient mixed-integer planning for UAVs in cluttered environments. In IEEE International Conference on Robotics and Automation (ICRA), pages 42–49, May 2015. [47] M.A Mousavi, Z. Heshmati, and B. Moshiri. LTV-MPC based path planning of an autonomous vehicle via convex optimization. In Proc. of the 21st Iranian Conf. on Electrical Engineering (ICEE), pages 1–7, May 2013. [48] A. Bemporad. Model predictive control: Basic concepts, 2009. [49] E. C. Suicmez and A. T. Kutay. Optimal path tracking control of a quadrotor UAV. In Unmanned Aircraft Systems (ICUAS), 2014 International Conference on, pages 115–125, May 2014. [50] P. T. Jardine, S. N. Givigi, and S. Yousefi. Experimental results for autonomous model-predictive trajectory planning tuned with machine learning. In 2017 Annual IEEE International Systems Conference (SysCon), pages 1–7, April 2017. [51] AT. Hafez, M. Iskandarani, S.N. Givigi, S. Yousefi, Camille Alain Rabbath, and A Beaulieu. Using linear model predictive control via feedback linearization for dynamic encirclement. In Proc. of the American Control Conf., pages 3868–3873, Jun 2014.

BIBLIOGRAPHY

122

[52] Michael J. Korenberg. Fast orthogonal identification of nonlinear difference equation models. Proceedings of the 30th Midwest Symposium on Circuits and Systems, 1:90–95, 1987. [53] K.M. Adeney and M.J. Korenberg. Iterative fast orthogonal search algorithm for mdl-based training of generalized single-layer networks. Neural Networks, 13(7):787 – 799, 2000. [54] M. J. Korenberg. Functional expansions, parallel cascades and nonlinear difference equations. In V.Z. Marmarelis, editor, Advanced methods of physiological system modelling, pages 221–240. USC Biomedical Simulations Resource, Los Angeles, 1987. [55] Michael J. Korenberg and Larry D. Paarmann. Applications of fast orthogonal search: Time-series analysis and resolution of signals in noise. Annals of Biomedical Engineering, 17(3):219–231, 1989. [56] G. Johns, E. Morin, and K. Hashtrudi-Zaad. Force modelling of upper limb biomechanics using ensemble fast orthogonal search on high-density electromyography. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(10):1041–1050, Oct 2016. [57] L. I. Nahlawi and P. Mousavi. Fast orthogonal search for genetic feature selection. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pages 1077–1080, Aug 2010.

BIBLIOGRAPHY

123

[58] D. R. McGaughey, V. Dagenais, and S. P. Pecknold. Improved torpedo range estimation using the fast orthogonal search. IEEE Journal of Oceanic Engineering, 35(3):595–602, July 2010. [59] Zhi Shen, Jacques Georgy, Michael J. Korenberg, and Aboelmagd Noureldin. Low cost two dimension navigation using an augmented kalman filter/fast orthogonal search module for the integration of reduced inertial sensor system and global positioning system. Transportation Research Part C: Emerging Technologies, 19(6):1111 – 1132, 2011. [60] M. Atia, C. Donnelly, A. Noureldin, and M. Korenberg. A novel systems integration approach for multi-sensor integrated navigation systems. In 2014 IEEE International Systems Conference Proceedings, pages 554–558, March 2014. [61] R. Sathya and A. Abraham. Comparison of supervised and unsupervised learning algorithms for pattern classification. International Journal of Advanced Research in Artificial Intelligence, 2:2–2, 2013. [62] Simon S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle River, NJ, third edition, 2009. [63] Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [64] Abhijit Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS J. on Computing, 21(2):178–192, April 2009. [65] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957.

BIBLIOGRAPHY

124

[66] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. [67] C. Watkins and P. Dayan. Q-learning. Machine Learning, 1:279–292, 1992. [68] S. M. Hung, S. N. Givigi, and A. Noureldin. A dyna-q (lambda) approach to flocking with fixed-wing uavs in a stochastic environment. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pages 1918–1923, Oct 2015. [69] M. A. Wiering and H. van Hasselt. Two novel on-policy reinforcement learning algorithms based on td(gamma)-methods. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 280–287, April 2007. [70] Kumpati S. Narendra and Mandayam A. L. Thathachar. Learning Automata: An Introduction. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989. [71] M. A. L. Thathachar and P. S. Sastry. Networks of Learning Automata: Techniques for Online Stochastic Optimization. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003. [72] Maxime Lecointe, Caroline Ponzoni Carvalho Chanel, and Francois Defay. Backstepping control law application to path tracking with an indoor quadrotor. In Proceedings of European Aerospace Guidance Navigation and Control Conference (EuroGNC), pages pp.1–19, Toulouse, FR, April 2015. [73] Lorenz Meier, Petri Tanskanen, Lionel Heng, Gim Hee Lee, Friedrich Fraundorfer, and Marc Pollefeys. Pixhawk: A micro aerial vehicle design for autonomous

BIBLIOGRAPHY

125

flight using onboard computer vision. Auton. Robots, 33(1-2):21–39, August 2012. [74] S. Hong, C. Lee, F. Borrelli, and J. K. Hedrick. A novel approach for vehicle inertial parameter identification using a dual Kalman filter. IEEE Transactions on Intelligent Transportation Systems, 16(1):151–161, Feb 2015. [75] T. A. Wenzel, K. J. Burnham, M. V. Blundell, and R. A. Williams. Dual extended Kalman filter for vehicle state and parameter estimation. Vehicle System Dynamics, 44(2):153–171, 2006. [76] M. Abdolhosseini, Y.M. Zhang, and C.A. Rabbath. Trajectory tracking with model predictive control for an unmanned quad-rotor helicopter: Theory and flight test results. In Proceedings of the 5th International Conference on Intelligent Robotics and Applications - Volume Part I, pages 411–420, Berlin, Heidelberg, 2012. Springer-Verlag. [77] A. Nemati and M. Kumar. Modeling and control of a single axis tilting quadcopter. In 2014 American Control Conference, pages 3077–3082, June 2014. [78] S. R. Barros dos Santos, S. N. Givigi, and C. L. Nascimento. Autonomous construction of multiple structures using learning automata: Description and experimental validation. IEEE Systems Journal, 9(4):1376–1387, Dec 2015. [79] Shouling He. Feedback control design of differential-drive wheeled mobile robots. In ICAR ’05. Proceedings., 12th International Conference on Advanced Robotics, 2005., pages 135–140, July 2005.

BIBLIOGRAPHY

126

[80] G. Cao, E. M. K. Lai, and F. Alam. Gaussian process model predictive control of unmanned quadrotors. In 2016 2nd International Conference on Control, Automation and Robotics (ICCAR), pages 200–206, April 2016. [81] M. K. Samal, M. Garratt, H. Pota, and H. T. Sangani. Model predictive flight controller for longitudinal and lateral cyclic control of an unmanned helicopter. In 2012 2nd Australian Control Conference, pages 386–391, Nov 2012. [82] Peter T. Jardine, Sidney Givigi, and Shahram Yousefi. Parameter tuning for prediction-based quadcopter trajectory planning using learning automata. IFACPapersOnLine, 50(1):2341 – 2346, 2017. 20th IFAC World Congress.

A Reinforcement Learning Approach to Predictive Control Design

A Reinforcement Learning Approach to Predictive Control Design

Suggest Documents

Reinforcement Learning Versus Model Predictive Control: A

Reinforcement Learning-Based Predictive Control for

Variable Impedance Control A Reinforcement Learning Approach

Reinforcement Learning based approach to

A Reinforcement Learning Approach to Call ... - ScienceDirect

A Distributed Reinforcement Learning Approach to Mission ...

A Reinforcement Learning Approach to Finding

A DISTRIBUTED REINFORCEMENT LEARNING CONTROL ...

A DISTRIBUTED REINFORCEMENT LEARNING CONTROL ...

A Reinforcement Learning Approach to Online Learning of Decision ...

Learning to play Monopoly: A Reinforcement Learning approach

Recurrent Reinforcement Learning: A Hybrid Approach

Deep Predictive Policy Training using Reinforcement Learning

A reinforcement-learning approach for admission ...

A Reinforcement Learning Automata Optimization Approach for ...

Variable Impedance Control A Reinforcement Learning ... - CiteSeerX

A Bayesian Approach to Imitation in Reinforcement Learning

A Reinforcement Learning Approach to Airline Seat Allocation for ...

A Distributed Reinforcement Learning Approach to ... - Semantic Scholar

A Hybrid Reinforcement Learning Approach to ... - Semantic Scholar

VCONF: A Reinforcement Learning Approach to Virtual ... - CiteSeerX

A Deep Reinforcement Learning Approach to Character Segmentation ...

A reinforcement learning approach to obstacle avoidance ... - CiteSeerX

A Reinforcement Learning Approach to Online Web System Auto ...