Optimization of Wavelet Basis Controllers for Nonlinear Systems with

0 downloads 0 Views 1MB Size Report
Fosha, John Hauser, Miloje Radenkovic, Mark Wickert, Keith Philips, Marijke. Augusteijn, and ..... The error signal is fed to the adaptation law which up-dates.
Optimization of Wavelet Basis Controllers for Nonlinear Systems with applications to Learning Control Systems

by Harry George Direen Jr. B.S.E.E. University of California at Irvine, 1982

A thesis submitted to the Faculty of the Graduate School of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical Engineering 1996

This thesis for the Doctor of Philosophy degree by Harry G. Direen Jr. has been approved for the Department of Electrical Engineering by

Charles E. Fosha

Josh Alspector

Mark Wickert

Date ________________

iii

Direen, Harry G. Jr. (Ph.D., Electrical Engineering) Optimization of Wavelet Basis Controllers for Nonlinear Systems with applications to Learning Control Systems Thesis directed by Professor Pieter A. Frick with continued support by Associate Professor Adjunct Charles E. Fosha A learning control system may be defined as a system that improves its performance with experience. This is a common trait of systems that use humans as the main control element: automobiles, aircraft, children balancing a broom on their hand… Ascribing this characteristic to inanimate controllers is a nontrivial task. This thesis examines methods of improving control of general nonlinear systems with experience, where “experience” is gained through monitoring normal operation of the given system. The primary contribution of this thesis is the development of local conditions for improvement of the control law based on a given objective function. The application of these conditions lead to an optimized control law in a limited sense. Improvements of the control law are made only in regions of the state space where the system is known to be asymptotically stable. The conditions of improvement contain conditions that guarantee the “improved system” remains asymptotically stable. The learning control system relies on being able to make local changes to the control law.

Due to their localized support and orthogonal structure, wavelet

multiresolution analysis is turned to for the basis functions used to represent the control law. The multiresolution structure provides a mechanism for starting with a coarse level of control and adding detail to the control law as required. Wavelet multiresolution analysis is reviewed and shown to map into a popular neural network topology which provides a massively parallel structure for implementing the controller. Practical methods of applying the control law improvement theory are addressed and examples provided. Finally, methods of expanding the region of asymptotic stability are proposed.

iv

This work is dedicated to my wife Susan, my two sons Randy and James, and to our Creator, who makes all things possible.

v

Acknowledgments I would like to express my deepest appreciation to all of those who have supported me on the long road towards my doctorate. I thank the management of ETO, Dick Ehrhorn, Tim Coutts, Don Fowler, and Steve Christensen, for financial and moral support without which this endeavor would not have been possible. I would like to thank my coworkers at ETO for their patience, support, and covering my hide over the years it has taken to reach this point. Special thanks to David Leupp for his careful reading of the thesis and helpful hints. I would like to thank my advisory committee: Professors Pieter Frick, Charles Fosha, John Hauser, Miloje Radenkovic, Mark Wickert, Keith Philips, Marijke Augusteijn, and Josh Alspector for their help and support. I am indebted to Dr. Frick for taking me under his wing early on and helping get started on this thesis. I wish to thank Dr. Fosha for taking over for Dr. Frick when Dr. Frick moved on to become Dean at San Diego State. I wish to thank Dr. Alspector for volunteering at the last moment to join my committee. Special thanks to Dr. Radenkovic for his support at a critical time in moving the thesis forward. I would like to thank my parents and my brothers and sisters for their never ending support and encouragement along with many friends who have been there for me. My greatest appreciation goes to my family. Without the undying support and love of my wife, Susan, and sons Randy and James, I would have given up along time ago. They are the ones who make it all worth while. Finally, it is only by the grace of God that any of this is possible. The only talents I have are those which have been handed down by the grace and love of our heavenly Father.

Contents

CHAPTER 1 INTRODUCTION .........................................................................................................1 1.1 1.2 1.3 1.4 1.5

Types of Control Systems ......................................................................................2 Neural Network and Fuzzy Logic Control .............................................................3 A Human Approach to Control ..............................................................................7 Wavelet Basis Functions ........................................................................................9 Thesis Overview...................................................................................................10

2 PLANT DEFINITIONS, STABILITY, AND OBJECTIVE FUNCTIONS................13 2.1 2.2 2.3 2.4

Plant Definition, Region of Operation, and General Control Law.......................13 Feedback Control Law .........................................................................................17 Stability Considerations .......................................................................................19 The Objective Function ........................................................................................26

3 FUNCTION APPROXIMATION AND SYNTHESIS USING WAVELET BASIS FUNCTIONS AND MULTIRESOLUTION ANALYSIS ..........................................31 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

Preliminary Notations...........................................................................................35 Wavelets and the Continuous Wavelet Transform...............................................37 Multiresolution Analysis ......................................................................................40 The Haar Multiresolution Analysis ......................................................................48 Daubechies’ Orthonormal Wavelets ....................................................................50 Calculating the Scaling Function and Wavelet from the Filter Coefficients .......51 Example................................................................................................................52 Higher Order Wavelet Bases and Multiresolution Analysis ................................55 Summary...............................................................................................................58

4 THE SCALING FUNCTION AND NEURAL NETWORKS ....................................59 4.1 4.2 4.3 4.4 4.5

Function Approximation by a Finite Combination of Basis Functions................59 Scaling Function Example....................................................................................66 Neural Network Topology....................................................................................72 Neural Network Training .....................................................................................79 Summary...............................................................................................................83

5 IMPROVING CONTROL ON A KNOWN REGION OF ASYMPTOTIC STABILITY.................................................................................................................85

vii

5.1 5.2 5.3 5.4 5.5

Plant, Control Law, Regions of Operation, and Objective Function ...................86 Conditions for Pointwise Improvement of the Cost-to-Go ..................................91 Conditions for the L1 Average Improvement of the Cost-to-Go..........................95 Optimization of the Wavelet Basis Controller on the Known RAS...................105 Summary.............................................................................................................109

6 THE LEARNING CONTROL PROCESS ................................................................110 6.1 The Basic Process ................................................................................................112 6.1.1 Plant Model ________________________________________________113 6.1.2 Region of Operation__________________________________________114 6.1.2 Initial Control Law___________________________________________115 6.1.3 Determining the RAS_________________________________________117 6.1.4 Performance Monitor (Approximation of the Cost-to-Go) ____________117 6.15 Learning Algorithm __________________________________________119 6.2 Example Problem.................................................................................................122 6.3 Expanding the RAS..............................................................................................131 7 CONCLUSIONS AND FUTURE DIRECTIONS.....................................................136 BIBLIOGRAPHY......................................................................................................140

viii

List of Figures Figure 2.1 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 4.1

Full State Feedback Control System .......................................................................... 16 Wavelet....................................................................................................................... 38 Wavelets with different Scales and Translations........................................................ 39 Scaling Function......................................................................................................... 42 Haar Scaling Function and Wavelet ........................................................................... 49 Scaling Function as Calculated from the Low-pass Filter Coefficients ..................... 54 Function Approximation using Scaling Function and First Two Wavelet Scales...... 55 Scaling Function Fourier Transform, ϕ ^ (ν ) ............................................................. 67

Figure 4.2 Wavelet Fourier Transform ψ ^ (ν ) ........................................................................... 67 Figure 4.3 Spatial Frequency Spectrum of f (x)........................................................................... 68 Figure 4.4 Sampled Spatial Frequency Spectrum, cs^ (υ ) ........................................................... 69 Figure 4.5 Example Approximation in R 2 ................................................................................. 71 Figure 4.6 Unit Lattice................................................................................................................. 72 Figure 4.7 Three Layer Neural Network ..................................................................................... 78 Figure 4.8 Hidden Layer Node Structure..................................................................................... 79 Figure 5.1 Regions of Operation ................................................................................................. 88 Figure 6.1 Regions of Operation................................................................................................ 115 Figure 6.2 System Block Diagram............................................................................................. 120 Figure 6.3 Inverted Pendulum.................................................................................................... 122 Figure 6.4 Initial Control Surface .............................................................................................. 124 Figure 6.5 Wavelet Approximation of the Initial Control Surface ............................................ 124 Figure 6.6 Inverted Pendulum RAS........................................................................................... 125 Figure 6.7 Initial Cost-to-Go...................................................................................................... 125 Figure 6.8 Intermediate Control Surface.................................................................................... 127 Figure 6.9 Final Control Surface ............................................................................................... 127 Figure 6.10 Intermediate Cost-to-Go ......................................................................................... 128 Figure 6.11 Final Cost-to-Go..................................................................................................... 128 Figure 6.12 Initial System Trajectory ........................................................................................ 129 Figure 6.13 Intermediate System Trajectory.............................................................................. 129 Figure 6.14 Final System Trajectory ......................................................................................... 130 Figure 6.15 Control Along Trajectories..................................................................................... 130 Figure 6.16 Cost-to-Go along Trajectories ................................................................................ 131

Chapter 1 Introduction

It is fascinating to watch a child, or an adult for that matter, learn to do some relatively complex task, such as ice skating, gymnastics, or a wide selection of other tasks. As human beings we have the designed in capacity to learn to do an incredible variety of things, including control of complex systems, processes, machinery, and so on. Taking even a cursory look at the control of robotic arms and manipulators gives a great appreciation of the "simple" ability to pick up a cup of coffee and get it to our lips without spilling it. It has often been the frustration of a control system designer to try to implement in hardware or software a controller for a task which a human operator seems to be able to handle with relative ease. It is this recognition of the innate capability of humans to learn to control complex, nonlinear, systems (our own appendages being a prime example), which begs the question of whether or not this capability, in some limited sense, can be imparted to inanimate (human designed) control systems. This dissertation embarks on a study of potential methods of endowing a controller with the capacity to learn to control in some limited sense. It would be foolish to think that in the course of one dissertation project, or even with extended research of many, that endowing a control system with the capacity to learn to control could even approach the God given capacity of a human to learn to control. It would be almost equally foolish to ignore the great body of knowledge amassed over the last fifty to one hundred years concerning the design of control systems.

2

This thesis presents research and ideas in to what may be described as learning control systems. Most of the concepts and theories employed are well known and well founded in the controls arena. It is primarily in the application of these concepts and theories where the novelty of the thesis is found.

1.1

Types of Control Systems Control systems may be broken down into what I will define as four major

classes: fixed control, adaptive control, learning control, and intelligent control. A fixed control system is by and far the most widely used. As its name implies, once a fixed controller is designed and implemented for a given system, it remains unchanged, apart from parameter drifts, for the life of the system. No attempt is made with a fixed controller to change its characteristics to in some way adapt to changing parameters or characteristics of the system (plant1 ) being controlled. An adaptive control system is: a control system which maintains consistent control performance in the presence of system uncertainties, parameter variations, or other changes in the system’s characteristics2. The controller has parameters which may be modified to accomplish this task. Adaptive control systems are typically further broken down into indirect or direct adaptive control. In indirect adaptive control, there is a mechanism for estimating the system (or plant) parameters on-line. Once the plant parameters are estimated, new controller parameters, which give the desired system performance, are computed and implemented based the estimated plant parameters. This process of estimating plant parameters and computing new controller parameters is carried out in a continuous process. An indirect adaptive controller is also called a selftuning controller [51]. 1 2

A direct adaptive controller, or model-reference adaptive

In control systems, the system to be controlled is typically called the plant.

In practice, an adaptive controller will attempt to maintain consistent control. It is rarely, if ever, possible to maintain the same control performance under all system variations. The definition is due to Slotine [51].

3

controller (MRAC) [51], contains four key elements: a plant to be controlled with unknown parameters; a controller with variable parameters; a reference model which describes the desired plant output; and an adaptation law used to vary the controller parameters. In MRAC, the output of the plant is compared to the reference model output generating an error signal. The error signal is fed to the adaptation law which up-dates the controller parameters in such a fashion as to force the plant output to more closely match the model output. The direct adaptive controller, or MRAC, continuously makes adjustments to the controller to compensate for parameter changes in the plant based on a given, fixed, adaptation law. Refer to Slotine and Li, [51], or other texts for an introduction to adaptive control. Learning control systems are relatively new and as of yet not well developed or defined in the field of control. It would take little effort to start a lively debate among control system practitioners and theoreticians as to what the definition of a learning control system is or should be, and if such a beast exists or is in any way different than an already defined adaptive controller. I will define a learning control system as: a control system in which the performance of the system is improved with experience. I will give justification of this definition a little later, along with defining what is meant by “experience”. For the learning control system it is assumed that the plant remains constant, or relatively constant, during the “learning” process. An intelligent control system is: a control system which has some capacity to handle a new control situation based on a reasoning process. By a reasoning process I mean that the approach to control of a new situation is based on experience with previous situations that the control system “knows” how to control. Intelligent control systems are beyond the scope of this dissertation.

4

1.2

Neural Network and Fuzzy Logic Control Before expounding on my definition of learning control systems, I would like to

digress a bit and discuss linear verses nonlinear control and present some recent developments and trends in control systems. This is not an attempt to be a detailed review, just some observations and notes. From a control practitioner’s point of view, design of controllers for real world systems can be a very challenging task. The overwhelming majority of control system design techniques assume that there is a valid, finite dimensional, linear model of the system for which the linear control system design methodology is very mature. Unfortunately, virtually all real world systems are nonlinear. Fortunately though, many systems to be controlled can be reasonably approximated by a linear model over a desired region of operation. As long as the system remains in the linear region of operation modeled, and within the bandwidth modeled, the linear control system design works quite well. Linear control systems have been the corner stone of the controls community since its inception. For systems where a linear model is not valid over the desired region of operation, other control techniques may be required. While much theoretical underpinning exists, nonlinear control system design is not nearly as well developed, or generally applicable, as linear control system design. Depending on the nature of the nonlinearities in the system to be controlled, different techniques may be employed or required. For instance, some nonlinear systems may be mapped into a linear system through feedback linearization over a region of operation, at which point linear design principles may be applied to the transformed system[51][21]. Gain scheduling is another technique whereby the system to be controlled is linearized around certain operating points.

Linear controllers are designed for each of these

linearized systems, and then a method of smoothly switching between the different linear control laws as the system changes its operating point is employed.

Other design

techniques include sliding mode control, and Lyapunov techniques [51][25].

5

It is perhaps the difficulties and the lack of generally applicable design methodologies associated with nonlinear control systems which has opened the door to the application of neural networks and Zadehan1 logic to nonlinear control. Striving towards “learning” or “intelligent” control systems is also a strong motivation which fuels research in the area of artificial neural networks and Zadehan logic. Since the early eighties there has been an absolutely explosive growth in interest and research surrounding both. One conference alone, the 1994 IEEE International Conference on Neural Networks held in Orlando, Florida produced seven volumes, nearly 5000 pages, of research papers covering interests of neural networks. A substantial portion of the papers presented involved either directly or indirectly issues of control using neural networks. Manolis & Christodoulou, [34], present a review of neural networks in control. Narendra and Parthasarathy at Yale have published a number of papers in this area.

Reference [38] presents a review of their work.

The interest, research, and

application of Zadehan logic to control is at par and in terms of actual application, ahead of neural networks [57][5][26][56]. Also much research has been done in tying Zadehan logic and neural networks together for control applications [28][6]. It is in the field of artificial neural networks, and to some extent Zadehan logic, where the term “learning” is ubiquitous. Neural networks are massively parallel structures which in some loose fashion are modeled after or inspired by what is known about the human brain. Key features of neural networks include a massively parallel, highly interconnected structure which contain interconnection weights.

The

interconnection weights are modified by an algorithm whereby the structure "learns" some useful task. The useful task is often to learn a nonlinear mapping from a set of input variables to a set of output variables. The network, in a training phase, is presented 1

In honor of the founder of fuzzy logic, Professor Lotfi Zadeh, David Brubaker at the Fuzzy Logic 95 seminar in Burlingame CA proposed changing the name “Fuzzy Logic” to “Zadehan Logic”. In this dissertation, the two terms will be used interchangeably with preference given to Zadehan logic.

6

a set of input and output pairs of data (exemplars). A training algorithm (often backpropagation which is a form of gradient descent) modifies the neural network’s internal connection weights in such a fashion as to reduce an error measure of the difference between the network’s output and the desired output given in the data set. The neural network then generalizes or forms a mapping between all possible input variables and the output space. Depending on the neural network topology, the number of elements or nodes in the network, the training data, and the training algorithm, the neural network may or may not “learn” the desired mapping. Reference [14] is a good introductory text on the subject of neural networks. The application of neural networks to control systems has to date been largely ad hoc. Typically a neural network is used in place of the feedback controller. A training algorithm, often a variant of gradient decent, is used to adjust the neural networks connection weights until the closed loop control system performs the desired task. The approach is usual met with some success. Unfortunately these system are analytically intractable. Even if the plant being controlled is linear, the introduction of a nonlinear neural network makes the overall system mathematically intractable especially in the area of stability analysis. This is very undesirable from the control practitioners standpoint. Sanner and Slotine [48] were among the first to present a mathematically tractable approach to the use of neural networks in control. It was their works that inspired some of the directions taken in this thesis. Zadehan logic provides a means of mapping a linguistic description of a control process to the actual control of a system. Zadehan logic is claimed to be particularly attractive for difficult systems where a good model of the system is not obtainable, yet an experienced human operator can control the system.

Fuzzy Logic control has been

applied to a wide variety of products and systems in Japan and more recently in the US and throughout the world. Yamakawa, [57], presents a good tutorial on fuzzy logic control. Kosko, [28], and Yegar, [56], present good treatments of fuzzy set theory and

7

fuzzy control. References [5] and [26] present methods of training or adapting fuzzy controllers. Zadehan or Fuzzy logic controllers suffer the same basic problem of mathematical intractability for a feedback control system as do neural networks. This coupled with the ad hoc nature of the application of Fuzzy logic controllers to control systems is (I suspect) the reason this technology is met with such disdain in a large portion of the controls community. Another key issue with Fuzzy logic controllers is that it is unclear how to adjust its parameters to optimize overall system performance. The methods of optimizing system performance tend to be cut and try.

This goes back to the

mathematical tractability of these control systems.

1.3

A Human Approach to Control Often one of the key reasons for applying both neural networks and Zadehan logic

techniques to control systems is an attempt to mimic in some fashion a human approach to control. So, just what is a human approach to control? In control theory, the first requirement is to have a mathematical model of the plant to be controlled. This model is in the form of a set of differential equations. From the model a control strategy or law may be developed. A human does not uses a set of differential equations, at least not explicitly. A child learning to balance a broom on his hand has no concept of what a set of differential equations are. But, he will start to develop a mental model of the broom balancing problem. He learns through trial and error, or watching someone else, that if the broom starts falling to his right, he must quickly move his hand in the same direction to bring it back into balance. He also learns that if he moves his hand too fast to the right, the broom will swing past up-right and off to the left. With practice, he will learn that if he has the broom in balance and wants to move several feet to the right, he actually needs to first move his hand to the left. This starts the broom falling to the right whereby he can move to the right several feet bringing the broom back into balance as he reaches

8

his desired location. In effect, a mental model of the broom balancing problem is formulated, along with a control scheme for balancing the broom. The mental model does not have to be highly exact in order to balance the broom, but it is nevertheless formed. In the course of learning to balance the broom, the child starts with a rough model of the plant (broom on hand) and a rough control strategy.

The child has

awareness of the angular position and angular speed of the broom along with relative position and speed of his hand. In control terminology we call the angular position and angular speed of the broom along with the horizontal position and speed of the hand the states of the plant. The child is in effect able to measure these states to some limited degree. His control input is the force of his hand to the left or right. The mental model is an awareness, given the current state of the system, of what the state of the system will be a short time later. In mathematical language we write:

x (t + Δt ) = f [ x (t ), u]

(1.1)

where: ⎡ broom angular position⎤ ⎢ broom angular velocity⎥ ⎥ and u = hand force left or right x=⎢ ⎢ ⎥ hand position ⎥ ⎢ hand velocity ⎣ ⎦ The control strategy is a plan of action whereby given the broom’s angular position and velocity along with hand position and velocity (state of the system) the child knows that he must move his hand left or right with a certain force to try and bring the broom back in balance. Mathematically we say: u = g( x)

(1.2)

9

We will say that when the broom is in balance, x = 0. So what the child effectively uses to balance the broom is a full state, feedback, control system. His initial plant model,

f 0 ( x , u) , and control strategy, g0 ( x ) , may be rather crude. With practice, the child will refine and improve his model and control law such that the balancing will become “easier” and “smoother”.

Inherent to improving the

control, is some concept of what better control is. The child is making improvements in his ability to balance the broom as measured against some concept of what better is, i.e. easier and smoother. The assignment of better or worse is typically based on bringing the broom back to balance from some starting point. In other words, there is some inherent objective or cost function in which a cost is assigned to each initial starting state. The child then works to minimize this cost, which is an optimization process. In this example of a child learning to balance a broom, it is assumed that the child has watched someone else balance a broom so that he has some initial mental model of the broom balancing problem along with an initial control strategy for balancing the broom. Initially he may only be able to bring the broom back to balance from a limited range of starting points, and the control will be rough. With experience, he will learn to be able to bring the broom back to balance with more finesse and from a larger range of starting points. Based on the above example of a human approach to control, the definition of a learning control system as: A control system in which the performance is improved with experience, is justified.

1.4

Wavelet Basis Functions In the example of a human approach to control, a rough model and control law is

started with, and then the model and control law is improved with experience. It can be said that “detail” is added to the model and control law with experience. Wavelets and multiresolution analysis are relatively new fields in applied mathematics which support

10

the concept of starting with a rough model and adding more and more detail to that model.

Mallat [33][32] and Meyer [35] formulated the concept of multiresolution

analysis. The idea came out of the need to describe mathematically increments of information in image analysis of going from a coarse approximation to one of higher resolution. Wavelets provide a method of breaking a function up into component parts using translations and dilations of a base function (mother wavelet). localized in both spatial region and spatial frequency.

A wavelet is

A linear combination of

translations and dilations of a mother wavelet can be used to represent given functions, i.e. plant models and control laws. A set of compactly supported, orthonormal wavelets are the basis functions used in a multiresolution analysis. Introduction to wavelets and multiresolution analysis can be found in references [13][27] & [58]. Wavelets map cleanly into the massively parallel structure of neural networks. The importance of this lies in the efficiency of computation. A wavelet decomposition of a function typically results in a large number of parameters. The large number of parameters give great flexibility in synthesizing control laws, but, on the other hand large number of parameters can mean a long slow computational process if the process is carried out in serial fashion. The use of a massively parallel structure alleviates this problem. The practicality of approaches presented in this thesis may well have to await the availability of neural network hardware in which to implement the wavelet basis functions.

1.5

Thesis Overview This dissertation embarks on a study of potential methods of endowing a system

with the capacity to learn to control. Using the restricted definition given above, this means that we will be looking at methods of providing a control system with the capacity to improve the system performance with experience. Experience for a control system may be gained by running trajectories from various starting points within the given

11

region of operation. To further limit the scope, we will be looking at regulator problems as opposed to tracking control problems. Regulator problems are fundamental in the field of control and include the broom balancing problem given above. The primary contribution of this thesis lies in providing a method for improving or optimizing a controller on a region of the state space where the system is initially known to be stable. Control laws parameterized with wavelet basis functions are used due to the multiresolution structure and the orthogonal properties of the wavelet basis functions. A key feature of the optimization method is the guaranteed stability of all of the intermediate control functions. This property allows the optimization method to be used in a learning control system environment. The outline of the dissertation is as follows. Chapter 2 provides the initial discussion and definition of the systems or plants along with the basic form of the control law which will be addressed in this thesis. Next the very important issue of system stability and stability analysis techniques are taken up.

Constraints on changes to a

stabilizing control law are identified which ensure the new system remains stable. Finally an objective function which defines the system performance is defined and characterized.

The objective function is shown to be useful for defining system

performance and for establishing system stability. Chapter 3 and chapter 4 address function approximation and synthesis capabilities using wavelet basis function. Function approximation and synthesis is at the foundation of any leaning control system architecture. These chapters provide an overview of wavelet based multiresolution analysis. Chapter 4 also shows that the wavelet based multiresolution analysis may be mapped into a popular neural network topology. While this is not required from a theoretical basis, from a practical point of view neural networks provide a massively parallel structure in which to implement the wavelet basis functions.

12

Chapter 5 presents the theoretical foundations for improving the control law on a know region of asymptotic stability.

The chapter develops the conditions for

improvement while ensuring each iteration of the control law will remain asymptotically stable. This chapter contains the prime area of original research for this thesis. Chapter 6 describes the overall process of the learning control system. This chapter brings together the theoretical work and walks through the basic process of a system which improves its control law with experience. An example system is worked through to help solidify the concepts an provide credence to the methods. Chapter 7 summarizes the thesis and provides ideas and directions for further research.

Chapter 2

Plant Definitions, Stability, and Objective Functions

This chapter will define: the types of nonlinear systems (plants) which will be addressed in the thesis; the region of operation of the systems; the general form of the feedback control law; and the objective function, which defines the performance level of the chosen control law. Time will also be spent on the very important issue of stability of the control system.

2.1

Plant Definition, Region of Operation, and General Control Law For this thesis we will be considering general nonlinear, autonomous1, plants, P,

of the form P : x = f ( x , u )

where x ∈ R N , u ∈ R M , and x ≡

d dt

(2.1)

x . It will be assumed that f : R N × R M → R N is

continuous, the first partial derivatives with respect to x and u exist and are continuous, and f is U asymptotically controllable on Ω P , a compact2, simply connected, subset of R N containing a neighborhood of the origin. By U asymptotically controllable on Ω P it is meant that for each initial starting point x ∈Ω P there exists a control law, ux ∈U , 1

A system is said to be autonomous if it does not depend explicitly on time. This is equivalent, for a linear system, as saying that the system is time invariant [51].

2

A subset of RN is compact if it is closed and bounded [55].

14 which takes the plant P from x to the origin, where U is a suitably defined set of control laws. The class of functions, U, will be defined below and in more detail in the ensuing chapters. A more precise definition of U asymptotically controllable will be given in section 2.3. Trajectories (solutions) of P will be denoted: x (t ) = φtu ( x )

(2.2)

where x is the starting point of the trajectory at t = 0, t ∈R and u ∈U . Properties of

φtu ( x ) include •

φ0u ( x ) = x



φtu D φtu ( x ) = φtu ( x ) ∀ t = t1 + t2 2

1

A couple of notes concerning the plant and the region of operation are in order. First, equation (2.1) is the general form of a large class of nonlinear systems which includes linear systems as a subclass. The restriction that the plants of concern are autonomous will alleviate concerns that the plant is somehow changing over time while a control law is being “learned”. The restriction that the system is relatively smooth (first partial derivatives with respect to x and u exist and are continuous) will help to reduce some of the more esoteric math details. I will leave it to the more mathematically astute to expand the class of functions, provided of course that this work proves useful. The restriction that the plant is asymptotically controllable on a compact, simply connected, subset of R N containing a neighborhood of the origin, can hardly be considered a restriction. If the plant of interest is not controllable using a “reasonable” class of control functions, U, then there is not much point in proceeding. It should also be apparent that for any practical system, the states will be bounded. In the broom balancing example given in chapter 1, there are clearly bounds on all states of the system (angular position and velocity, and relative hand position and velocity). In the control of a robot arm, there are clear limits on the arm position along with limits on joint velocities and

15 available torque in actuators. Even in considering the position control of an inter-stellar spaceship, there are finite bounds on the relative position of concern along with finite bounds on velocities attainable (speed of light). Bounds of a system, both state and control effort, are often ignored in control system theory. But, it is the bounds of a system which are often primarily responsible for introducing nonlinearities into an otherwise linear system. It is typically best, and will be required in this development, to include these bounds in the design considerations up front. This thesis will only be concerned with bounded, controllable, systems. Though somewhat rudimentary, it is worth considering what equation (2.1) is telling us about the plant to be controlled. The equation tells us, given the current state of the plant, x, along with the control input to the plant, u, what the magnitude and direction the states of the plant will take in the next time increment. From the controls perspective, the control input, u, at state position x, is what may be used to effect the magnitude and direction the states of the plant will take in the next time increment. In picking up the broom balancing example of chapter 1, the force of the hand left or right is used to effect the next time instance of the broom’s angular position and velocity. Of course the next angular position and velocity is constrained by the physics of the problem and the available force of the hand. The physics of the problem is contained in the description of the plant, (2.1). The control problem being considered is: given any starting point x0 ∈Ω P , what control effort, u = g ( x ), u ∈ U , is required to drive the plant states back to the origin. This type of problem is known as a regulator problem and is fundamental in the field of controls. Note that the control effort is posed as a function of the current state position. It is generally known, and shown by Kreisselmeier and Birkhölzer [29], that if a system is asymptotically controllable to the origin then a stabilizing controller exists in the form of state feedback. While their proof was developed for discrete systems, it is readily extended to the continuous case. Therefore, by choosing the control effort as a function

16 of all of the states of the system, we are able to control the largest possible class of systems. Figure Chapter 2 .1 shows a diagram of this system.

Controller

u = g ( x)

Plant

u

x = f ( x, u)

x

Figure Chapter 2 .1 Full State Feedback Control System

A full state feedback controller imposes the additional requirement that the states of the plant are measurable or that a suitable observer for reconstructing the states is available. We will assume that the states of the plant are available. The closed loop feedback system P c , may be described by: P c : x = f ( x , g ( x ))

(2.3)

x (t ) = φtg ( x )

(2.4)

with trajectories:

The same notes for equations (2.1) and (2.2) apply to the closed feedback system. For a regulator problem, there is normally a reference input which defines the desired resting, final, or equilibrium state of the system. For a linear system, an offset or translation in one or more state variables does not effect the dynamics of the closed loop system, i.e. a feedback controller designed for a zero reference input will provide the same system dynamics when used with a non-zero reference input. For a nonlinear system, a translation in one or more state variables will in general effect the dynamics of the closed loop system. It will be assumed, without loss of generality, that x = 0, u = 0 is the equilibrium point of the system, i.e. f ( 0, 0) = 0.

A reference (or other exogenous)

input to the system may be handled by adding one or more, zero dynamic, states to the

17 system. For example, the equilibrium point, xe, of the plant will in general be effected by the steady state control input, ue = g ( xe ) , such that f ( xe , g( xe )) = 0 , where only a limited number of states are potential equilibrium states. Using a state and control transformation: y = x − xe and w = u − ue where y = x , gives the equivalent system

y = f ( y + xe , w + ue ) , where y = 0 , w = 0 , is an equilibrium point for the new system. Considering xe a state of the system we can write: z = f ( z , w )

(2.5)

⎡y⎤ ⎡ f ( y + xe , w + g ( xe )) ⎤ z = ⎢ ⎥ and f ( z , w) = ⎢ ⎥ 0 ⎣ xe ⎦ ⎣ ⎦

(2.6)

where

By equating a reference input with xe, it is seen how a nonlinear regulator with reference input is handled by the addition of a zero dynamic, state or states to the system. Clearly the additional state(s) will not be controllable via the new plant input w.

2.2

Feedback Control Law For linear plants, a linear feedback control law is typically used, where N

u = g ( x ) = ∑ ai xi

(2.7)

i =1

For u ∈ R1 , the graph of g ( x ) ∈ R N +1 is a hyperplane. Each of the coefficients, ai, may be used to adjust the slope of the hyperplane in the xi axis. The graph of g ( x ) ∈ R N +1 will be called the control surface. For u ∈ R M , there will be M control surfaces. The linear feedback control law has N degrees of freedom for adjusting the control law or the control surface. Linear feedback is sufficient for linear plants. It may also be sufficient, and is often used, for systems with mild nonlinearities. For general nonlinear systems, linear feedback is not sufficient. Either the system will not be stabilizable using linear feedback over the desired region of operation, or the system

18 performance with linear feedback will be poor. For the learning control paradigm, linear feedback does not provide a method of making local adjustments to the control surface. Adjusting a single coefficient effects the whole control surface. It will be seen that the learning control paradigm rests on being able to make local variations to the control surface. Chapters 3 and 4 will take up the issue of function approximation and synthesis using a set of orthogonal wavelet basis functions. It will be seen that a very large class of control surfaces can be generated using these functions, and that the wavelet basis functions have some very nice properties for representing the control surface. One of the key properties will be that the wavelet basis functions considered have compact support1 which implies that each wavelet function will only effect a limited region of the control surface. The general form of the control law, using wavelets, will be u = g( x,α ) =

∑α ψ

I ∈I o

I

I

( x)

(2.8)

where Io is a finite subset of Z N with Q components, α I ∈ R M , α ∈ R M × R Q , and

ψ I ( x ) is a suitable set of wavelet basis functions covering Ω P . The set of all g ( x , α ) such that α I



≤ αmax , αmax a finite, positive number, will be denoted U. Chapter 3 will

discuss in detail the class of basis functions which will be used for ψ I ( x ) . Chapter 4 will take up the details of choosing parameters for the wavelet basis functions, ψ I ( x ) , in order to represent various classes of control laws. Also taken up in chapter 4 will be methods of determining an initial set of coefficients, α 0 , in order to represent a given initial control law with equation (2.8). During the learning process, the control law will be changed via the coefficients of (2.8), α. Each change to one or more coefficients will represent a new control law (or control surface). With a slight abuse of notation, the various control laws will be defined

1

The support of a function is defined as the closure of the set where the function is non-zero [55].

19 as g ( x ,α k ) = gk ( x ) where "k" represents a particular choice of α. With this notation, the closed loop system becomes: Pkc : x = f ( x , gk ( x ))

(2.9)

Trajectories of Pkc , will be denoted: x (t ) = φtk ( x )

2.3

(2.10)

Stability Considerations Of primary concern in the design of any feedback controller is the stability of the

system. If the system becomes unstable, unpredictable results including destruction of the plant can occur. The problem becomes even more acute when considering any form of adaptive controllers. The stability of the system must be guaranteed at each step of the adaptation (learning) process, otherwise, a system which starts out stable may be changed such that it becomes unstable. At this point the system may very well destroy itself before the controller can be changed to a new stable region. For stability analysis of nonlinear systems, the methods of Lyapunov are extremely powerful and widely used. Alexandr Mikhailovich Lyapunov was a Russian mathematician in the late 19th century. He published his work, The General Problem of Motion Stability, in 1892 which contains two methods of stability analysis: the so called

linearization method and the direct method.

The linearization method studies the

stability properties of a nonlinear system in local neighborhoods around an equilibrium point using a locally linearized system. The direct method is not restricted to a local neighborhood of an equilibrium point. The direct method studies the stability properties of the nonlinear system by constructing an “energy-like” function (Lyapunov function) for the system and then evaluates the time variations of this function. It is Lyapunov’s direct method of stability analysis which will play a key role in the optimization (or improving) of the control law on bounded regions of the state space.

20 Due to the key role that stability analysis will play in the optimization process, it is worth reviewing stability definitions and Lyapunov stability analysis for nonlinear systems. The review follows Slotine & Li [51]. Integral to Lyapunov stability analysis is the concern for the region over which the system is stable. This region of stability will be expanded upon in the ensuing material. With this material in hand, we will find, given a Lyapunov function for a system along with a known region of stability, constraints on changes to the control system which will result in a stable system on the same region. Then, in section 2.4 we will show that the objective function defined for the performance analysis of the system will also serve as a Lyapunov function for the system. The stability of a system, P c , is defined with respect to an equilibrium state. It is about an equilibrium state in which we may speak of stability, asymptotic stability, and other forms of stability.

The next three definitions formalize the concepts of an

equilibrium state, stability in the sense of Lyapunov, and asymptotic stability. Definition 2.1 A state, xe , is an equilibrium state of a system if once the state equals xe ,

the system state remains equal to xe for all future times [51].

For P c , xe is an equilibrium state if f ( xe , g ( xe )) = 0 . In this dissertation, unless otherwise noted, it will be assumed without loss of generality that xe = 0 . Definition 2.2 An equilibrium state, xe = 0 , is said to be stable (or stable in the sense of

Lyapunov) if, for any ε > 0 , there exists a δ > 0, such that if

x(0) < δ , then

x ( t ) < ε for all t > 0 . Otherwise the equilibrium point is unstable. Definition 2.3 An equilibrium state, xe = 0 , is said to be asymptotically stable if: •

it is stable,



there exists some r > 0 such that x ( 0) < r implies that x ( t ) → 0 as t → ∞ .

In general, the region of asymptotic stability for P c will not be a symmetric ball around the equilibrium point. It is therefore said that P c is asymptotically stable on the

21 set Ω if P c is stable, and in addition, for all x ∈Ω we have φtg ( x ) → 0 as t → ∞ . Ω will be called a domain of attraction for P c . A second, and possibly more useful definition, of asymptotic stability, along with a precise definition for asymptotic controllability, has been provided by Kreisselmeier & Birkholzer [29]. For conceptual purposes it is worth reviewing these definitions. First, two required function classes are defined. Definition 2.4 A function ϕ : R + → R + is said to belong to the class Kϕ if it is

continuous, strictly increasing, and satisfies ϕ ( 0 ) = 0 and lim r→∞ ϕ ( r ) = ∞ . Definition 2.5 A function σ : R + × R + → R + is said to belong to class Kσ if

a) σ (r , t ) ∈ Kϕ for each fixed t ∈ R + , b) σ ( r , t1 ) > σ ( r , t 2 ) for all t 2 > t1 and r ≠ 0 , c) σ ( 0, t ) = 0 for all t ∈ R + , and d) limt →∞ σ ( r , t ) = 0 for all r ∈ R + .

Based on these two function classes, asymptotic controllability may be defined along with a second definition of asymptotic stability. Definition 2.6 Asymptotic Controllability

A plant, P, is said to be asymptotically

controllable on Ω (or from Ω to the origin), if there exists a σ ( r , t ) ∈ Kσ such that for all x ∈ Ω there exists a control function ux ( t ) such that

φtu ( x , t ) ≤ σ ( x , t ) ∀ t ∈ R + x

The plant will be said to be U asymptotically controllable if the additional restriction ux (t ) ∈ U , where U is a suitably defined set of control functions, is made. Definition 2.7 Asymptotic Stability (second definition) A system, P c : x = f ( x , g ( x )) , is

said to be asymptotically stable on Ω (or from Ω to the origin), if there exists a

σ ( r , t ) ∈ Kσ such that for all x ∈ Ω

22

φtg ( x ) ≤ σ ( x , t ) ∀ t ∈ R + This second definition of asymptotic stability also implies that the equilibrium point xe = 0 is stable in the sense of Lyapunov and that all trajectories starting in Ω converge to the equilibrium point, i.e. φtg ( x ) → 0 as t → ∞ ∀ x ∈ Ω . We now come to the pivotal theorem in the study of stability of non-linear systems which is Lyapunov's stability theorem: Theorem 2.1 Lyapunov's stability theorem: If in a ball Br = { x | x < r } there exists a

scalar function V(x) with continuous first partial derivatives such that •

V(x) is positive definite in Br



V ( x ) is negative semi-definite in Br

then the equilibrium point, x = 0 , of the system P c is stable. If V ( x ) is negative definite in Br then the system is asymptotically stable. V(x) is called a Lyapunov function.

A function, V(x), is positive definite if V (0) = 0 and V ( x ) > 0 ∀ x ≠ 0 , the function is positive semi-definite if V ( x ) ≥ 0 ∀ x ≠ 0 .

Negative definite and semi-definite are

defined in a similar fashion. It should be clear that x in V(x) refer to the states of the system P c , and V ( x ) x = f ( x , g ( x )) ≡ dtd V (φtg ( x )) t = 0

V ( x ) is a directional derivative of V ( x ) with the direction being the tangent of the

system P c trajectory at x. Refer to [51] for a proof of this theorem along with a more complete development of Lyapunov’s stability methods. That Lyapunov’s stability theorem is pivotal to the study of nonlinear systems is demonstrated by the work of Eduardo Sontag. Sontag [52] showed that a system is asymptotically controllable to the origin if and only if there exists a Lyapunov function for the system, and further [53], that if the Lyapunov function for the system is smooth, a smooth stabilizing controller exists.

23 In theorem 2.1, the domain of attraction is a symmetric ball around the origin. As already noted above, the domain of attraction for the systems we will be concerned with will in general not be symmetric. It is therefore important to define precisely what the domain of attraction of P c will be and to extend Lyapunov’s stability theorem to this region of the state space. The closed loop systems, P c , we are concerned with are by definition continuous. If P c is asymptotically stable on Ω, a compact, simply connected set containing a neighborhood of the origin, then all trajectories starting from a point x0 ∈ Ω will eventually terminate at the origin which by definition is also in Ω. This does not however imply that the entire trajectory will remain in Ω. Define Ω = { x| x = φtg ( x0 ), ∀ x0 ∈ Ω, t ∈ R + } .

(2.11)

Clearly Ω is a well defined compact set and Ω ⊂ Ω . This can be seen in that all points in Ω can be consider a part of a trajectory originating in Ω (a compact set), in which the trajectories are continuous, bounded, and terminate at 0 ∈ Ω . We therefore have that P c is asymptotically stable on Ω and that all trajectories which start in Ω remain in Ω . Ω will be called an invariant set for P c . Definition 2.8 Invariant Set A set, Ω , is an invariant set for the system P c if for every

x ∈Ω the trajectory φtg ( x ) ∈Ω for all t ≥ 0 .

Invariant sets are a key concept in this thesis in that we must know in what region of the state space our system will be operating or conversely on what region of the state space it is safe to operate the system. The following definition of the radius of a bounded set which contains a neighborhood of the origin will prove useful. Definition 2.9 Radius of a set For simply connected, bounded sets, Ω, which contain a

neighborhood of the origin define:

24

ρR (Ω) = sup x x ∈Ω

(2.12)

where "sup" refers to supremum and ⋅ is the Euclidean norm. For further definitions of set theoretic concepts and metric space concepts refer to the books by: Naylor & Sell [40], Bell [2], and Wheeden & Zygmund [55]. The following corollary expands Lyapunov's Stability Theorem to domains of attraction which are simply connected, bounded, invariant sets which contain a neighborhood of the origin. These are the sets (regions of operation) in which we will be primarily concerned with. Corollary 2.1 Let V(x) be a positive definite, scalar, function with continuous first

partial derivatives on the simply connected, bounded set Ω which contains a neighborhood of the origin. In addition let V ( x ) be negative definite on Ω .. If Ω is an invariant set for P c then is P c asymptotically stable on Ω . Proof: (Follows Slotine & Li [51]) First stability is shown. Given any positive number R, we must show that there exists a positive number r, such that x ∈ Br ∩ Ω implies

φtg ( x ) ∈ BR ∩ Ω for all t > 0.

If R ≥ ρR ( Ω ) then clearly r = ρR ( Ω ) meets the

requirement. For R < ρR ( Ω ) , let m be the minimum of V(x) on the boundary of BR ∩ Ω . This minimum exits because V(x) is continuous and positive definite, and the boundary of BR ∩ Ω is a bounded set. Since V(0) = 0, there exists an r such that V ( x ) < m for all x ∈ Br ∩ Ω . For x ∈ Br ∩ Ω we have that V ( x ) is negative definite which implies V (φtg ( x )) < m therefore φtg ( x ) ∈ BR ∩ Ω for all t ≥ 0.

Asymptotic stability may be shown by contradiction.

Consider a trajectory

starting in Ω . Since Ω is an invariant set, the trajectory will remain in Ω . Along the trajectory φtg ( x ), V (φtg ( x )) decreases continually and since V ( x ) is lower bounded tends towards a limit L. Assume that this limit is not zero. Then since V is continuous and V ( 0) = 0 there exists a ball Br 0 in which the trajectory never enters. But since V ( x ) is

negative definite, V ( x ) must remain less than some strictly negative number, m. This is

25 a contradiction because it implies that V (φtg ( x )) decreases from an initial bounded value

V ( x0 ) to a value less than L in a finite time less than [V ( x0 ) − L] / ( − m) . Therefore all trajectories starting in Ω converge to the origin., We now have conditions for the system P c to be asymptotically stable on an invariant set Ω . Suppose we have an asymptotically stable system on an invariant set together with a valid Lyapunov function, what we would like to know is: what are the constraints on the changes to the control law which will ensure the new system remains asymptotically stable on the same invariant set. The following propositions attack this question. First another tool for working with sets is required. Definition 2.10 For compact, simply connected sets Ω1 and Ω2 with boundaries Γ1 and

Γ2 , define:

ρ B ( Ω1 , Ω 2 ) = min x − y

(2.13)

x ∈Γ1 , y ∈Γ2

Refer to [55] for a formal definition of the boundary of a set. Proposition 2.1

Suppose that Ω1 is a compact, simply connected, invariant set for the

continuous system:

P1c : x = f ( x , g1 ( x )) with unique solutions. Let Ω2 be a compact,

simply

set

connected

P2c : x = f ( x , g2 ( x ))

such

that

Ω2 ⊂ Ω1

be a similar system to

and

δ = ρB (Ω1 , Ω2 ) > 0 .

Let

P1c with unique solutions, where

g2 ( x ) = g1 ( x ) ∀ x ∈ Ω1 − Ω2 , and where g2 ( x ) is bounded, continuous, and arbitrary on Ω2 ; then Ω1 is an invariant set for P2c . Proof:

On the set Ω1 − Ω2 ,

Ω1 − Ω2 remain in Ω1 .

P2c = P1c , and for P1c all trajectories which start in

For P2c , any trajectory which leaves Ω2 must enter Ω1 − Ω2 .

This is true because the trajectories of P2c are continuous and there is a minimum distance δ which must be crossed before the boundary of Ω1 could be reached. Since trajectories in Ω1 − Ω2 cannot leave Ω1 we have that Ω1 is an invariant set for P2c .

26 Proposition 2.2

Let Ω 0 and Ω1 be simply connected, compact sets containing a

neighborhood of the origin, such that: Ω1 ⊂ Ω 0 ; δ = ρB (Ω1 , Ω0 ) > 0 ; and Ω 0 is an invariant set for Pgc0 . Define Ω 0B = Ω 0 − Ω1 as the boundary set between Ω 0 and Ω1 . Let V(x) be a positive definite, scalar, function with continuous first partial derivatives on Ω 0 . In addition let

Vg 0 ( x ) ≡ V ( x ) x = f ( x , g

0 ( x ))

be negative definite on Ω 0 . If g1 ∈ U (a set of bound, continuous, functions on Ω 0 ) is chosen such that g1 ( x ) = g0 ( x ) ∀ x ∈ Ω 0B and such that Vg1 ( x ) ≡ V ( x ) x = f ( x , g

1 ( x ))

is negative definite on Ω1 then Pgc1 will be asymptotically stable on Ω 0 . Proof: By proposition 2.1, Ω 0 is an invariant set for Pgc1 . V(x) does not depend on the

trajectories of the given system so it remains positive definite. On Ω 0B , φ

g1 t

( x) = φ

g0 t

( x)

which implies Vg1 ( x ) < 0 on Ω 0B . By definition, on Ω1 Vg1 ( x ) < 0 Therefore Vg1 ( x ) < 0 is negative definite on Ω 0 , which by corollary 2.1 implies that Pgc1 is asymptotically stable on Ω 0 . Proposition 2.2 provides a very interesting and important result. If we have a valid Lyapunov function for a system on an invariant set, we may change the control law on the interior of the set, leaving a boundary region unchanged, and, provided the control is changed such that the Lyapunov function remains valid, the new system will be asymptotically stable.

This result will be a key building block, providing the

fundamental constraint, for optimizing (or improving) control on a bounded region of the state space.

27

2.4

The Objective Function To improve performance of a system we must have a method of measuring system

performance, otherwise there is no way of knowing whether a given change in the control law makes the system better or worse. For the regulator systems being dealt with, we are concerned with how well the control law drives the system back to the origin given some starting point. In other words the performance measure should take into account the whole system trajectory from a given starting point back to the origin. In the neural network literature, a critic subsystem is often used to evaluate system performance. The control neural network is trained based on the output of the critic network [36][54][5]. In optimal control theory an objective function is used to define system performance [23][8][31]. It is to the mature field of optimal control where we will turn for our performance analyzer (which is the same place critic networks often turn). Given an asymptotically stable system P c (2.3), on the invariant set Ω , a somewhat arbitrary objective (or cost) function ∞

J ( x0 ) = ∫ L( x , u)dt 0

(2.14)

can be created which defines the performance of P c on Ω . The integral is taken along trajectories of P c starting at x (0) = x0 and ending at x ( ∞) = 0 . The following restrictions will be placed on the objective function, (2.14): •

L ( x , u ) must be positive definite in x and positive semi-definite in u.



The partial derivatives of L ( x , u ) with respect to x and u must exist and be continuous.



J ( x0 ) < ∞ ∀ x0 ∈ Ω.

As stated above, the cost function, J ( x0 ) , is used to define the performance of the closed loop system P c on the state space of interest. A system designer will shape the terms in L ( x , u ) to try and drive the system response to meet desired objectives. To help

28 the system avoid, or cause the system to leave quickly, certain regions of the state space, larger values of L ( x , u ) would be assigned to that region. To place a penalty against large control efforts, larger values of L ( x , u ) would be assigned to larger values of u. For instance, suppose our control problem is maintaining a car’s position inside a lane going down a highway. As long as the car is inside the lane, and reasonably centered, low values would be assigned to L ( x , u ) where x represents the position in the lane relative to center, and u is the control effort on the steering mechanism. As the car gets closer to the edge of the lane, or outside the lane, larger values would be assigned to L ( x , u ). Even larger values would be assigned to L ( x , u ) for leaving the road. To

prevent excessive control effort which could cause an uncomfortable ride, larger values would be assigned to L ( x , u ) for larger values of u, unless the car is well outside of the lane or off the road, in which case the priority is to get the car back into the desired lane. It is this type of logic that the control system design engineer will use to shape L ( x , u ). The integral of L ( x , u ) over a trajectory is our performance measure. Note: The control effort will be in the form of state feedback, u = g ( x ) , so L could have been defined to be simply a function of the state, x. While in fact L is simply a function of the state x, it is more intuitive to separate the state position requirements and the control effort requirements of L. The ensuing development will also require the separation of these term. The control law optimization process will work to adjust the control parameters, α (2.8), in such a fashion as to reduce or minimize the performance measure. Therefore

the performance measure will be a function of the parameters α: J ( x0 , α ) . Assuming that for a given set of control law parameters, α, the system P c is asymptotically stable on the invariant set Ω , the following observations concerning J ( x0 , α ) can be made: •

For a fixed control law (fixed α) the objective function, J ( x0 , α ) , is a positive definite mapping from Ω to R1 . The graph of this mapping can be viewed as a cost-to-go surface in R N +1 . For each x0 ∈Ω , a cost is associated with running the

29 system P c from x0 back to the origin. This mental picture will be helpful in the ensuing development. •

J ( x0 , α ) is continuous on Ω with respect to x0 and the partial derivatives with respect to x0 exist and are continuous on the interior of Ω .



J ( x0 , α ) is a Lyapunov function for the system P c .

This can be seen by

following J ( x0 , α ) along trajectories of the system. To clarify the situation, rewrite J ( x0 , α ) in the form ∞

J (φtg ( x ), α ) = ∫ L[φτg ( xt ), g (φτg ( xt ), α )]dτ t

(2.15)

where φτg ( xt ) is a trajectory of the system P c starting at xt = φtg ( x ) . It is clear

that J (φtg ( x ),α ) is positive definite in x for fixed t. The directional derivative of the objective function along a system trajectory is defined as J ( x ,α ) ≡

d dt

J (φtg ( x ),α ) t = 0

(2.16)

In the event that confusion may arise as to which control law is in effect for the directional derivative, a subscript will be used to clarify the situation: J g1 ( x ,α ) ≡ J ( x ,α ) x = f ( x , g

1 ( x ,α ))



d dt

J (φt g1 ( x ),α ) t = 0

(2.17)

From (2.15) it can be seen that directional derivative evaluates to J ( x , α ) = − L( x , g ( x , α ))

(2.18)

which is negative definite. Therefore J ( x , α ) is a Lyapunov function for the system P c .

The primary difficulty in applying Lyapunov stability analysis to a nonlinear system is coming up with a valid Lyapunov function. Here a Lyapunov function has fallen into our laps. The function which will be used to monitor the system performance, will serve the dual purpose of being the Lyapunov function for the system. Lest anyone

30 be deceived, the fact that an objective function of the form given above, (2.14), is a Lyapunov function for the system, is well established in the controls community [23][29]. We may now extend proposition 2.2 using our new found Lyapunov function. Theorem 2.1 Suppose Pkc = f [ x , g ( x , α k )] is asymptotically stable on the compact

invariant set Ω , where g ( x , α k ) is defined in (2.8) and f ( x , u) meets the conditions of (2.1). The trajectories of Pkc will be denoted φ tk ( x ) . Let ∞

J ( x ,α k ) = ∫ L[φτk ( x ), g (φτk ( x ),α k )]dτ 0

be an objective function as defined in (2.14) meeting its conditions. Suppose that the control law g ( x ,α k ) is changed via its parameters to g ( x , α k +1 ) such that: a) g ( x , α k +1 ) = g ( x , α k ) on the boundary set Ω B where Ω B is defined in proposition 2.2. b) g ( 0, α k +1 ) = g ( 0, α k ) = 0 c) Jg k +1 ( x ,α k ) < 0 ∀ x ∈ Ω , x ≠ 0. then

J ( x ,α k ) is a Lyapunov function for

Pk+c1 = f [ x , g ( x ,α k +1 )]

and

Pkc+1

is

asymptotically stable on Ω . Proof: By definition, J ( x , α k ) is positive definite in x. By condition c , J gk +1 ( x , α k ) is negative definite for trajectories of Pkc+1 on Ω . Therefore J ( x , α k ) is a Lyapunov function for Pkc+1 on Ω . By proposition 2.1, Ω is an invariant set for Pkc+1 , therefore by corollary 2.1, Pkc+1 is asymptotically stable on Ω .

Chapter 3

Function Approximation and Synthesis using Wavelet Basis Functions and Multiresolution Analysis

A learning control paradigm requires function approximation and synthesis techniques. There are three primary functions to be concerned with approximating or synthesizing: the plant, f ( x , u) ;

the control law, g(x); and the objective function,

J ( x0 ) = ∫ L( x , u) dt . The plant is the physical system we are attempting to control; it is assumed it can be modeled by (2.1). Often a reasonable model of the plant can be determined from the basic physics of the plant, in which case function approximation techniques are not required. When this is not true, a method of approximating the plant model will be required. For the control law, a particular control function is not inherent. The objective will be to determine, or synthesize, a control function which will provide reasonable performance as defined by the objective function. The form of the objective function makes it difficult to evaluate for any given starting point x0. Partial derivatives of the objective function with respect to the states of the system will be required as part of the algorithm to improve system performance. For this reason, it will be convenient to

32 approximate the objective function with a function in which the evaluation of the partial derivatives is easier. The learning control paradigm will typically start with an initial approximation of a function (plant equation, control law equation, or cost-to-go equation) and then improve the approximation with experience. Invoking our human approach to control analogy, we think of starting with crude approximations and improving these approximations over time. The “detail” of approximations are filled in locally with experience. To make adjustments to an approximation, the approximating function must be parameterized. It will be the approximating function’s parameters which will be adjusted to somehow improve it. In general, only local regions of the state space will be looked at, at any given time. Therefore, it will be convenient (actually required) to have the function parameters effect only local regions of the state space. It goes without saying, that from a practical standpoint, the number of parameters in the approximating function must be finite.

The fact that the region over which

function approximation/synthesis will be made is bounded goes a long ways towards keeping the number of parameters in the approximating function finite. Due to inherent function approximation capabilities of neural networks, there has been considerable research in the area of using neural networks for function approximation, system identification, and control of nonlinear systems [24][28][30][34] [36][38][45] [50][48]. Typically feed-forward neural networks with one or more hidden layers, using “sigmoid” activation functions, are used (refer to [14] [28] [43] and [44] for an introduction to neural networks).

Feedforward neural networks using sigmoid

activation functions are capable of uniformly approximating continuous functions to any specified degree of accuracy given enough nodes [17][22][41]. The problem with using feedforward neural networks with sigmoidal activation functions is multifaceted. For function approximation, the usual method of adjusting the networks parameters (connection weights) is via backpropagation, which is a form of gradient descent. This

33 method gives no guarantee that a reasonable approximation will be made of the desired function. The sigmoid functions have global support, so adjusting one parameter effects the entire support of the function being approximated. This implies that improving the function approximation in one region of the state space, via a parameter change, may well make the approximation worse at another, possibly distant, location of the state space.

The use of activation functions with global support precludes making fine

adjustments to the approximating function in local regions of the state space. Finally, there is no known relationship between the desired degree of accuracy of approximation and the required number of nodes in the neural network based on the class of function to be approximated. Sanner and Slotine [48] were among the first to use a neural network in a feedback control system in an analytically tractable fashion with guaranteed stability and convergence principles.

They used a linear combinations of Gaussian radial basis

functions which map into a single layer feedforward neural network (the Gaussian functions replace the sigmoidal functions). Gaussian radial basis functions decay rapidly to zero away from their centers which gives them local support. Through the use of multidimensional sampling theory, Sanner and Slotine demonstrated the uniform approximation capability of Gaussian neural networks.

They went on to provide

equations for computing the number of Gaussian nodes required to approximate bandlimited functions on compact sets to a desired level of accuracy. It was Sanner and Slotine’s work which inspired key directions in this thesis. The use of a linear combination of Gaussian radial basis functions are sufficient for the function approximation and synthesis requirements of this thesis. But, there are some short comings of the Gaussian network which can be improved upon by turning to multiresolution analysis using a set of orthonormal, compactly supported, wavelets. The key problem with using Gaussian radial basis functions, is that the number of Gaussians which must be used to approximate or synthesize a given class of function is based on the

34 worst case spatial frequency component of the class of functions [48] to be approximated. This can result in a much larger number of parameters than is actually required to make a reasonable approximation of a given function. A multiresolution analysis, using a set of orthonormal, compactly supported, wavelets, allows starting from a rough or crude approximation point and then adding detail with experience. This is precisely the characteristic being sought. The wavelet basis functions are orthogonal which implies that the set coefficients in the wavelet expansion are unique and independent. The wavelets have true compact support, so only a limited number of coefficients in the wavelet expansion have to be taken into account at any given time. Wavelet analysis is relatively new to the field of applied mathematics and growing at an explosive rate.

Wavelet transforms provide a method of breaking a

function up into a set of component parts. These components are a set of translations and dilation’s of a primary function (mother wavelet).

It is beyond the scope of this

dissertation to provide a thorough development of wavelets and multiresolution analysis. Kaiser [27], presents a thorough introduction to wavelets and multiresolution analysis at an accessible level to engineers, scientists, and applied researchers. Young [58], provides a less thorough but more intuitive introduction to wavelets. Daubechies book [13], is a corner stone in the field. It is more thorough in development and assumes a higher level of mathematical maturity, but is still accessible. The next few sections provide an overview of wavelets, wavelet transforms, multiresolution analysis, and Daubechies orthonormal wavelets. The wavelet theory will be developed in R1 and then expanded to R N . It should be noted that a function f : R N → R M may be broken down into a set of functions:

35

⎡ f1 ( x ) ⎤ ⎢ f (x) ⎥ 2 ⎥ f (x) = ⎢ ⎢ # ⎥ ⎥ ⎢ ⎣ f M ( x )⎦

(3.1)

f i : R N → R1 for i = 1" M

(3.2)

where

Wavelet transforms may be applied to functions, f : R N → R M , by applying the transforms to each of the component functions, (3.2), one at a time.

3.1

Preliminary Notations The ensuing development of wavelets follows Kaiser [27]. Kaiser uses a “star”

notation to represent the adjoint of an operator. Before going into an overview of wavelets, it will be useful to quickly review adjoints, develop the star notation, and establish other requisite notation. A measurable function f : R N → R1 is said to be in L2 ( R N ) if



RN

[ f ( x )]2 dx < ∞

(3.3)

where Lebesgue1 integration is used. When it is not likely to cause confusion, L2 will be used in place of L2 ( R N ) . The inner product of two functions, f & g ∈ L2 ( R N ) is defined as f ,g = ∫

RN

f ( x ) g ( x ) dx

(3.4)

The L2 norm of f ∈ L2 ( R N ) is f = f , f .

1

2

N

For general functions in L ( R ) Lebesgue integration must be used in evaluating the integral. From a practical standpoint in this dissertation all functions considered will be continuous or piece wise continuous so standard Riemann integration may be used. In either case, the appropriate integration technique will be assumed to be used and I will not take up topics that occur on sets of measure zero. Refer to Wheeden and Zygmund’s book [55] or most standard texts on real analysis for a development of Lebesgue integration.

36 Let U and V be Hilbert2 spaces and let F be a bounded, linear operator such that F:U → V . Then there exists a unique operator F *:V → U which satisfies:

u, F *v

U

= Fu, v

V

∀ u ∈U , v ∈V

(3.5)

The subscripts on the inner products remind us of which Hilbert space the inner product is being taken in. The operator F * is referred to as the adjoint of F. Properties of the adjoint include: a) F ** ≡ ( F * )* = F b) The adjoint of the composition of two operators GF is: (GF )* = F *G * Proposition 3.1

Let f ∈ L2 ( R N ) be regarded as the operator f : R → L2 ( R 2 ) defined

by fa = af where a ∈ R .

Then the adjoint operator f *: L2 ( R N ) → R is given by:

f *g = f , g .

Proof: By definition of the adjoint f * is determined by a, f *g But, a , f * g

R

R

= fa , g

L2

= a f ,g

L2

= af * g which is true for all a ∈ R , therefore f * g = f , g .

Proposition 3.1 is nothing more than a statement of the Riesz Representation Theorem [40]. This proposition along with the definition of the adjoint, forms the basis of the star notation. Let the set {ψ n }n ∈Z be an orthonormal basis for L2 . Then for any f ∈ L2 there exists a unique set of numbers {an }n ∈Z such that f = ∑ anψ n n

where an = ψ n , f = ψ n* f

2

A Hilbert space is a complete inner product space [40].

37 We may now write: ⎛ ⎞ f = ∑ anψ n = ∑ ψ nan = ∑ ψ nψ n* f = ⎜ ∑ ψ nψ n* ⎟ f ⎝ n ⎠ n n n therefore

∑ψ ψ n

* n

= IV

(3.6)

n

We call (3.6) a resolution of unity (or resolution of the identity). There is a more general setting for the very important concept of the resolution of the identity. It forms the basis for generalized frames and the notion of reciprocal bases. This dissertation will only be concerned with orthonormal bases sets, so we will not need the more generalized setting. The Fourier transform of f ∈ L2 ( R1 ) is defined as ∞

(F f )(ν ) = f ^ (ν ) = ∫ e − j 2πνx f ( x )dx −∞

(3.7)

The inverse Fourier transform is defined as ∞

( F −1 f ^ )( x ) = f ( x ) = ∫ e j 2πνx f ^ (ν )dν −∞

3.2

(3.8)

Wavelets and the Continuous Wavelet Transform Wavelet theory provides a method of breaking up data, functions, or operators

into component parts.

Translations and dilation’s of a primary function, “mother

wavelet” are used as the basic building block. Daubechies’, [13], orthonormal wavelets.

Figure Chapter 3 .1 shows one of

38 1.5 1 0.5 0 -0.5 -1 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure Chapter 3 .1 Wavelet Key features for a function, ψ ( x ) , to be a wavelet include: •

it is localized in space, i.e. ψ ( x ) decays rapidly to zero outside of a bounded region;



it is localized in spatial frequency, i.e. the Fourier transform of ψ ( x ) decays rapidly to zero outside of a bounded region;







−∞

ψ ( x ) dx = 0 which implies that ψ ( x ) must oscillate. This fact along with the

localized support of ψ ( x ) is where the term wavelet (small wave) was derived. For the purposes of this thesis, it will be assumed that ψ ( x ) ∈ L2 ( R N ) (complex wavelets can be defined). A function f : R N → R1 is a member of L2 ( R N ) if



RN

[ f ( x )]2 dx < ∞ .

For now we will set N = 1. Later the theory will be expanded to higher dimensional spaces. The wavelet transform uses translations and dilations of a mother wavelet. The function

(s)

− ψ s ,r ( x ) = s ψ x − r 1 2

(3.9)

defines these translations and dilations. The term “r” shifts the wavelet along the x axis. The term “s”, which is most often strictly greater than zero, causes the wavelet to expand

39 or contract. If s is greater than one, then the wavelet will be stretched, or dilated, as compared to the mother wavelet. If s is between zero and one, then the wavelet will be compressed. The factor “s” scales the wavelet and is therefore called the scaling factor. The square root of s in front of the wavelet keeps the energy of the scaled wavelet the same as the mother wavelet, i.e. the L2 norm of the scaled wavelet remains constant. Figure Chapter 3 .2 shows wavelets which have been scaled and translated. The center wavelet (solid line) is the mother wavelet. The left wavelet has s = 2 and r = −3. The right wavelet has s = 0.5 and r = 3. The functions ψ s ,r are the wavelets generated by ψ, and since ψ ∈ L2 then ψ s ,r ∈ L2 which can be seen by:

ψ s ,r

2

( )

2

−1 = ∫ ⎛⎜⎝ s 2 ψ x − r ⎞⎟⎠ dx = ψ −∞ s ∞

2

2 1.5 1 0.5 0 -0.5 -1 -1.5 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure Chapter 3 .2 Wavelets with different Scales and Translations

The continuous wavelet transform of a function f ( x ) is defined as

( Wf )( s, r ) = f~( s, r ) ≡ ∫−∞ ψ s.r ( x) f ( x)dx = ∞

ψ s ,r , f

(3.10)

40 This inner product exists because both ψ s ,r and f are in L2 . The wavelet transform carries the interpretation: As a function of r at a fixed value of s, the wavelet transform represents the detail contained in f ( x ) at the scale s. This concept will become clearer in the next section when multiresolution analysis is taken up. The wavelet reconstruction formula given the wavelet transform of a function is ∞

∞ ~ f ( x ) = ∫ ∫ s− 2ψ s ,r ( x ) f ( s, r )drds 0

−∞

(3.11)

where ψ r ,s is the reciprocal wavelet family to ψ s ,r . If ψ s ,r is an orthonormal family of wavelets, then ψ r , s = ψ r , s . The family of wavelets with which this thesis is concerned will be orthonormal, so I will not take up the details of computing a reciprocal wavelet set.

3.3

Multiresolution Analysis Multiresolution analysis (MRA) was formulated in 1986 by Mallat [33][32] and

Meyer [35], and is one of the primary reasons wavelet analysis has become so popular. The idea came out of image analysis and the need to describe mathematically increments in information needed to go from a coarse approximation to one of higher resolution. The existence of wavelets that form an orthonormal basis came as somewhat of a surprise and as a fall out of multiresolution analysis. To start, the wavelet scale factor and translation term will be discretized.

Set

s = 2m and r = 2m nΔx where m and n are integers. The reason for including the 2m factor in the translation, is to normalize the translation based on scale.

This

normalization gives the same relative translation with respect to the wavelet scale (size) at any scale m. Without loss of generality, the translation step size, Δx , will be set to one. The discretized family of wavelets become:

ψ m , n ( x ) = 2 ψ ( 2 − m x − n) −m 2

(3.12)

41 Translation and dilation operators, T , D: L2 ( R ) → L2 ( R ) , may be defined by (Tf )( x ) = f ( x − 1)

(3.13)

( Df )( x ) = 2 − 2 f (2 −1 x )

(3.14)

1

These two operators are invertible and have the following properties: a) (T n f )( x ) = f ( x − n) , b)

Tf , Tg = f , g ,

( D m f )( x ) = 2

− m2

f (2 − m x )

n, m ∈Z

Df , Dg = f , g

c) the adjoint of T = T * = T −1 , the adjoint of D = D* = D −1 It can be readily verified that DTf = T 2 Df for all f ∈ L2 ( R ) .

Which states that

translating a function by Δx = 1 and then stretching it by a factor of 2, is the same as first stretching it and then translating it by Δx = 2 . The operator equations can be written

DT = T 2 D, and D −1T = T − 2 D −1 1

(3.15)

These equations will be used to “square” and take the “square root” of T. Multiresolution analysis starts with a scaling function, ϕ ( x ) ∈ L2 ( R) , rather than a wavelet. The scaling function and the wavelet generated from the scaling function will be intimately tied together. The scaling function shares some properties of the wavelet: one, it is localized in space; and two, it is localized in spatial frequency. The scaling function though, will have a non-zero average (the wavelet has a zero average). The wavelet will be used to define the “detail” of a function, while the scaling function will provide a sample of the function averaged over the support of the scaling function. Figure Chapter 3 .3 shows the scaling function which corresponds to the wavelet shown in Figure Chapter 3 .1.

42 1.5 1 0.5 0 -0.5 -5

-4

-3

-2

-1

0

1

2

3

4

5

Figure Chapter 3 .3 Scaling Function

Translations and dilations of the scaling function, ϕ ( x ) , will be defined by

ϕm,n ( x ) ≡ ( DmT nϕ )( x ) = 2 − ϕ (2 − m x − n) m 2

(3.16)

In order for the scaling function to determine a multiresolution analysis, translations of the scaling function, within a given scale, will be required to be orthonormal. That is

ϕ m,n ,ϕ m, k = δ nk

(3.17)

where δ nk is the Kronecker delta function ( δ nk = 0 for k ≠ n, 1 otherwise ). It can be readily shown due to the properties of the dilation operator, that if (3.17) holds for m = 0, it will hold for all values of m. It should be noted that the ϕm,n ( x ) ’s do not have to be orthonormal across scales. The second requirement which will be placed on the scaling function is:





−∞

ϕ ( x ) dx = 1

(3.18)

This requirement leads to the averaging (or sampling) property of the scaling function. To see this, suppose f(x) equals a constant, c, over the support of ϕm,n ( x ) . Then

43 ∞

ϕm* ,n f = ϕm,n , f = ∫ ϕm,n ( x ) f ( x )dx = c −∞

The scaling functions are used to define subspaces of L2 ( R) . For any fixed m ∈ Z let Vm be the closed subspace of L2 ( R ) spanned by {ϕm,n : n ∈ Z } , i.e., ⎧ Vm = ⎨ f = ∑ ϕm,n un : f n ⎩

2

⎫ = ∑ un2 < ∞ ⎬ n ⎭

(3.19)

where un ∈ R . Clearly {ϕm,n : n ∈ Z } forms an orthonormal basis for Vm . A relationship between the different spaces Vm may be formed by:

∑ϕ

m,n

un = Dm ∑ ϕ0,n un

n

Since

∑ϕ n

0,n

(3.20)

n

un ∈V0 , it follows that Vm = { Dm f : f ∈V0 } ≡ DmV0

(3.21)

The orthogonal projection operator Pm: L2 ( R ) → Vm is defined as Pm = ∑ ϕm,nϕm* ,n

(3.22)

n

The projection operator Pm is related to P0 by Pm = ∑ Dmϕ0,n ( Dmϕ0,n )* = Dm ∑ ϕ0,nϕ0*,n D− m = Dm P0 D− m n

(3.23)

n

The projection operator projects a function f ∈ L2 onto the space Vm . The larger the scale m, the larger the support of ϕm,n ( x ) , resulting in more detail of f being lost in the projection. We say that Vm contains information down to the scale Δx = 2m . A partial reconstruction of f (f at scale m) from its samples ϕ m* ,n f is given by f m ( x ) ≡ ( Pm f )( x ) = ∑ ϕ m,n ( x )ϕ m* ,n f

(3.24)

A natural extension to the idea of spaces Vm containing less information the larger the scale is to impose the requirement Vm+1 ⊂ Vm ∀ m ∈ Z

(3.25)

44 This requirement imposes a new restriction on ϕ. Since Dϕ ∈V1 and V1 ⊂ V0 , we must have Dϕ ∈V0 . Thus Dϕ = ∑ hnϕ0,n = ∑ hnT nϕ ≡ h(T )ϕ n

(3.26)

n

for some set of coefficients hn. Equation (3.26) is called a dilation equation or two-scale relation for ϕ . It may be used to determine ϕ up to normalization. A new operator has been introduced h(T ) = ∑ hnT n

(3.27)

n

on V0 (or L2 ). h(T) may be interpreted as an averaging that represents a stretched ϕ as the superposition of unstretched ϕ. This notation will be useful in general for all vectors in V0 :

∑u ϕ n

n

0, n

= ∑ unT nϕ = u(T )ϕ

(3.28)

n

Summarizing the above and adding a couple of final details, a multiresolution analysis can be defined. Definition 3.1

A nested sequence of closed subspaces Vm of L2 is said to form a

multiresolution analysis if the following properties hold: a) ϕ n ( x ) ≡ T nϕ ( x ) forms an orthonormal basis for V0 , b) Vm ≡ DmV0 , c) Vm+1 ⊂ Vm , and in addition, we have for every f ∈ L2 d) lim Pm f = 0 m→∞

e)

lim f − Pm f = 0

m→−∞

Properties d and e come about from the interpretation of Pm f . Pm f gives a blurred or average version of f to the scale Δx = 2m . As m → ∞ it is expected that Pm f will go to a constant. The only constant function in L2 is zero. As m → −∞ it is expected that Pm f

45 will approach f if the scaling function is reasonable. It can be shown that if the scaling function, ϕ, is integrable and satisfies the orthonormality condition, (3.17), and the averaging condition, (3.18), then properties d and e hold [27][13]. Wavelets in a multiresolution analysis are related to the orthogonal complements of the subspaces Vm . The orthogonal complement of Vm+1 in Vm is defined as

{

}

Wm+ 1 = f ∈Vm : f , g = 0 ∀ g ∈Vm+ 1

(3.29)

The subspace Vm may now be written as: Vm = Vm+1 ⊕ Wm+1 . In other words, every

f m ∈Vm has a unique decomposition: f m = f m+1 + dm+1 where f m+1 ∈Vm+1 and dm +1 ∈Wm +1 . We say that f m is broken down into a “blurred” part, f m+1 , and its “detail”, dm+1 . Now, since Wm +1 ⊂ Vm and Wm is by definition orthogonal to Vm , then Wm+1 is orthogonal to

Wm . This implies that all Wm are mutually orthogonal. In addition, since Dm preserves orthogonality the following relationship between the subspaces Wm may be established: Wm = { Dmd : d ∈W0 } ≡ DmW0

(3.30)

The orthogonal projection to Wm will be defined as Qm : L2 → L2 . It should come as no surprise that Qm = ∑ ψ m,nψ m* ,n

(3.31)

n

where {ψ m,n }n∈Z is a set of orthonormal wavelets which span Wm . The mother wavelet is related to the scaling function by

ψ = D−1 g ( T )ϕ

(3.32)

g (T ) = − T q h( − T )* = − ∑ ( −1) n hnT q − n

(3.33)

where

n

These equations may be rewritten as

46

ψ ( x ) = 2 ∑ gnϕ (2 x − n)

(3.34)

n

where gn = ( −1) n hq − n

(3.35)

The hq − n constants are the complex conjugates of the hn constants used in the dilation equation (3.26), offset by an arbitrary odd integer q. Refer to Kaiser, [27], for a complete development of these equations. The projection operator Qm is related to Q0 by Qm = ∑ Dmψ 0,n ( Dmψ 0,n )* = Dm ∑ ψ 0,nψ 0*,n D − m = DmQ0 D − m n

(3.36)

n

The following relationships hold for the projection operators Pm and Qm : a) Vm⊥Wm ⇒ PmQm = Qm Pm = 0 b) Pk Pm = Pm Pk = Pk for k > m since Vk ⊂ Vm c) Qk Pm = PmQk = Qk for k > m since Wk ⊂ Vm d) Vm = Vm+ 1 ⊕ Wm+ 1 ⇒ Pm = Pm+1 + Qm+1 The last equation may be iterated by replacing P’s with their decomposition’s giving M

Pm = PM +

∑Q ,

k = m +1

M >m

k

(3.37)

This leads to a decomposition of any f ∈ L2 at resolution m, of f m ≡ Pm f = PM f +

M

∑Q

k = m +1

k

f , M > m.

PM f is a blurred, or coarse approximation of f at resolution M.

(3.38) To this coarse

approximation, detail is added, Qk f , at resolution k for k = M until k = m + 1 (k is decreasing). The function f, may be “built” with this process to any level of resolution desired starting with a coarse approximation, PM f . For a given resolution, m, the

47 sampling interval is Δx = 2m , so f m = Pm f will be called the 2m resolution of the function f. In practice, only finite resolutions, m, are used. It is worth noting though that if m → −∞ , by definition 3.1 item e, f = PM f +

M

∑Q

k

k = −∞

f , f ∈ L2

(3.39)

This gives an orthogonal decomposition of L2 , of M

⊕W

(3.40)

f , f ∈ L2

(3.41)

L2 = VM ⊕

k

k = −∞

By definition 3.1 item d, ∞

f =

∑Q

k

k = −∞

which gives the orthogonal decomposition of M

L = 2

⊕W

k = −∞

(3.42)

k

It is the decomposition given in (3.39), which will be of interest in this thesis. This equation may be rewritten f m ( x ) = ∑ anϕ M ,n ( x ) + n ∈Z

M

∑ ∑b

k = m +1 n ∈Z

ψ k ,n ( x ), M > m .

k ,n

(3.43)

where an = ϕ M ,n , f

and bk ,n = ψ k ,n , f

(3.44)

It should be clear that if ϕ and ψ have compact support, and the region where f is to be approximated is compact, then the range of the index n will be finite. Equations (3.43) and (3.44) provide a finite decomposition of functions whereby detail to any desired level may be added to an initial coarse approximation of a function. The added detail comes in

48 orthonormal sets so there will be independence in the approximation between sets of coefficients.

This is precisely the type of decomposition needed for a “learning” control

system.

3.4

The Haar Multiresolution Analysis The simplest function which meets the conditions of the multiresolution analysis

is the characteristic function of the interval I ≡ [ 0,1) : ⎧1 for 0 ≤ x < 1 ⎩ 0 otherwise

ϕ ( x ) = χ[ 0,1) = ⎨

(3.45)

Clearly ⎧0 for n ≠ k ⎩1 for n = k

ϕ m,n , ϕ m, k = ⎨

i.e., {ϕ m,n }n∈Z is an orthonormal set of functions meeting the requirement of (3.17). It is also clear that this scaling function meets the averaging property, (3.18). The projection of f ∈ L2 to Vm is Pm f = ∑ ϕm,n ( x ) ϕm,n , f

(3.46)

n

where

ϕm,n , f = 2−

m 2



( n + 1) 2 m

n 2m

f ( x )dx

(3.47)

and ⎧⎪2− m2 for n ≤ x < n + 1 ϕm,n ( x ) = ⎨ 2m 2m ⎪⎩ 0 otherwise

(3.48)

The dilation equation, (3.26), can be verified by Dϕ = 1 χ[ 0,2 ) = 1 ( χ[ 0,1) + χ[1,2 ) ) = 1 (ϕ + Tϕ ) 2 2 2

(3.49)

49 Here we see that for the Haar scaling function h(T ) = 1 (1 + T ) 2

(3.50)

The wavelet which corresponds to the Haar scaling function may be computed from equations (3.34) and (3.35). Using these equations, the wavelet is computed as

ψ ( x) = χ

[ q 2−1, q2 )

− χ q q +1 [2, 2 )

(3.51)

The parameter q, simply establishes the offset of the wavelet. Choosing q = 1, will center the Haar wavelet in the interval [0,1]. Figure Chapter 3 .1 shows the Haar scaling function and wavelet. It should be clear from the figure that the Haar wavelet forms an orthonormal set, and the wavelets have a zero average. The Haar scaling function and wavelet have good space localization. This is readily seen in Figure Chapter 3 .4. The Fourier transform of the scaling function is

ϕ (ν ) = e− jπν

sin(πν )

(3.52)

πν

The spatial frequency decays slowly, so the Haar system does not have good spatial frequency localization.

This hurts the usefulness of the Haar system.

The ideal

multiresolution analysis has both good spatial localization and good spatial frequency localization.

2 1.5 1 0.5 0 -0.5 -1 -1.5 -2

2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -1

-0.5

0

0.5

1

1.5

2

-1

-0.5

0

0.5

Figure Chapter 3 .4 Haar Scaling Function and Wavelet

1

1.5

2

50

3.5

Daubechies’ Orthonormal Wavelets A multiresolution analysis is determined by the scaling function. The scaling

function in turn is determined (up to normalization) by the dilation equation, (3.26). The dilation equation operator, h(T ) , may be considered as a low pass filter.

The

corresponding wavelet generating coefficient equation, g (T ) (3.33), may be considered a high pass filter. When there are a finite number of filter coefficients in h( T ) , there will be a finite number of coefficients in g (T ) , which implies that both the scaling function and the wavelet generated by h( T ) and g (T ) will have compact support. In this case, h(T ) and g (T ) are called finite impulse response (FIR) filters. It is therefore appropriate to start with the FIR filter coefficients for h(T ) in order to develop the multiresolution analysis. Daubechies, [13], developed a whole sequence of low-pass FIR filters, h N (T ) for N = 1,2,…, which may be used to generate compactly supported scaling functions and their associated wavelets. For N = 1 the FIR filter h1 (T ) , generates the Haar scaling function. The high pass filter, g1 (T ) , is determined from h1 (T ) , and the Haar wavelet is generated from g1 (T ) . For N > 1, the Daubechies’ filters are generalizations of the Haar system, possessing more and more regularity.

For example, the scaling function

generated from h 2 (T ) is continuous and, the scaling function generated from h 3 (T ) has a continuous first derivative. The associated scaling function and wavelet generated from the Daubechies’ filters are supported on [ 0,2 N − 1] . It is beyond the scope of this dissertation to go through the development of the low pass filter coefficients for h N (T ) . Refer to Daubechies chapter 6, [13], or Kaiser chapter 8, [27], for a complete development. Daubechies in chapter 6, table 6.3, gives filter coefficients, hn , for N = 4 through 10 along with plots of a sampling of the corresponding scaling functions and wavelets. Figure Chapter 3 .1 and Figure Chapter 3

51 .3 are the wavelet and scaling function based on the filter coefficients from Daubechies’

table 6.3 with N = 6.

3.6

Calculating the Scaling Function and Wavelet from the Filter Coefficients Closed form solutions do not exist for the Daubechies’ orthonormal wavelets.

There are several techniques for calculating the scaling function and the corresponding wavelet given the filter coefficients in h N (T ) . The method presented, is a recursive method with gives exact values of the scaling function, ϕ, at evenly spaced points. Interpolation methods are then used to calculate ϕ at all other points. The scaling function, ϕ, may be calculated at the integers by noting that for x = n, the dilation equation gives 2 N −1

2 N −1

2 N −1

k =0

k =0

k =0

ϕ (n) = D −1 ∑ hk T kϕ (n) = 2 ∑ hkϕ (2n − k ) = 2 ∑ h2 n − kϕ ( k )

(3.53)

where hj = 0 for j < 0 or j > 2N-1. This gives a linear set of equations for the unknowns

ϕ ( n) : ⎡h2 ⎡ ϕ (1) ⎤ ⎢h ⎢ ϕ ( 2) ⎥ ⎢ 4 ⎢ ⎥ # ⎢ ⎥ = 2⎢ # ⎢ ⎢ϕ (2 N − 3) ⎥ ⎢0 ⎢ϕ (2 N − 2) ⎥ ⎢⎣ 0 ⎣ ⎦

h1 h3 # 0

" 0 " 0 % # " h2 N − 6

0 " h2 N − 4

⎤ ⎡ ϕ (1) ⎤ ⎥⎢ ⎥ ⎥ ⎢ ϕ ( 2) ⎥ ⎥⎢ # ⎥ ⎥⎢ h2 N − 5 ⎥ ϕ (2 N − 3) ⎥ ⎢ ⎥ h2 N − 3 ⎥⎦ ⎣ϕ (2 N − 2) ⎦ 0 0 #

(3.54)

Since ϕ ( x )

is supported on the interval [0,2N-1], and is continuous, then G ϕ ( 0) = ϕ ( 2 N − 1) = 0 . It is clear that ϕ = [ϕ (1) " ϕ (2 N − 2)]T is an eigenvector for

the linear system (3.54), and may be solved for by standard techniques. This determines

ϕ (n) up to a normalization factor. ϕ (n) may be normalized by requiring 2 N −2

∑ ϕ ( n) = 1 n =1

(3.55)

52 Once ϕ (n) for n = 1 through 2N - 2 have been found, the dilation equation may be applied once again to solve for ϕ ( x ) at x = n + 0.5: 2 N −1

ϕ (n + 0.5) = 2 ∑ hkϕ (2n − k )

(3.56)

k =0

Once ϕ ( x ) at x = n + 0.5 has been solved, then ϕ ( x ) at x = n + 0.25 may be solved for. This process may be carried out indefinitely. After ϕ ( x ) has been determined at a finite set of evenly spaced points, the corresponding wavelet, ψ ( x ) , may be determined at the same resolution of the x axis via equations (3.34) and (3.35). The above gives a method of computing ϕ ( x ) and ψ ( x ) at evenly distributed points within the support of the functions. The method is not practical or efficient for computing ϕ ( x ) and ψ ( x ) at arbitrary points. For ϕ ( x ) and ψ ( x ) generated from h N (T ) where N ≥ 3, the functions are continuous functions and have continuous first derivatives. Therefore ϕ ( x ) and ψ ( x ) may be approximated to any level of accuracy desired by use of interpolating functions such as cubic splines. A cubic spline fits a set of third order polynomials to a set of data points generated by the function to be approximated. This gives an approximation to the function with the properties that: 1) the approximation matches the function exactly at the given data points; 2) the approximation is continuous and has continuous first and second derivatives; 3) a third order polynomial is very efficient to compute. Refer to any good book on numerical analysis for a treatise on cubic splines. Burden and Faires book [9] gives a good presentation with easy to follow algorithms.

3.7

Example A quick example will help to clarify things. The first step in building the scaling

function for a multiresolution analysis is to choose a set of low pass filter coefficients. Daubechies has several tables of filter coefficients. I chose a set (see Table Chapter 3 .1)

53 for this example which gives a relatively smooth and symmetric scaling function and wavelet.

Table Chapter 3 .1 Low-pass filter coefficients for the “least asymmetric” compactly supported wavelets for N = 6. Coefficients are taken from Daubechies Table 6.3 [13]. h0 h1 h2 h3 h4 h5

0.021784700327 0.004936612372 -0.166863215412 -0.068323121587 0.694457972958 1.113892783926

h6 h7 h8 h9 h10 h11

0.477904371333 -0.102724969862 -0.029783751299 0.063250562660 0.002499922093 -0.011031867509

The filter coefficients are put into the matrix given in equation (3.54). The eigenvector corresponding to eigenvalue 1 is computed. (A number of software packages on the market will compute the eigenvector directly. Mathcad was used in this case.) A normalization constant is calculated from (3.55). The scaling function, ϕ (n) , for N = 1 to 2N-2 is the eigenvector divided by the normalization constant. Of course ϕ (n) equals zero for n outside of the range N = 1 to 2N-2. The first graph in Figure Chapter 3 .1 shows the scaling function as calculated from the Daubechies low-pass filter coefficients. The next three graphs show how additional points are filled in using the recursion formula (3.56). Figure Chapter 3 .3 shows the scaling function after applying a cubic spline. The scaling function in Figure Chapter 3 .3 was re-centered around zero. Figure Chapter 3 .1 shows the corresponding wavelet, ψ ( x ) , as calculated from equations (3.34) and (3.35) and after applying a cubic spline.

54 1.5

1.5

(Spacing: 1)

1

(Spacing: 1/2)

1

0.5

0.5

0

0

-0.5

-0.5 0

1

2

3

4

5

6

7

8

9 10 11 12

1.5

0

1

2

3

4

5

6

7

8

9 10 11 12

1.5

(Spacing: 1/4)

1 0.5

0.5

0

0

-0.5

(Spacing: 1/8)

1

-0.5 0

1

2

3

4

5

6

7

8

9 10 11 12

0

1

2

3

4

5

6

7

8

9 10 11 12

Figure Chapter 3 .5 Scaling Function as Calculated from the Low-pass Filter Coefficients To demonstrate the function approximation capabilities of the multiresolution analysis, I chose the somewhat arbitrary function:

f ( x ) = sin(0.36 x 2 ) on the interval

[ −10,10] . This function is shown as the solid line in Figure Chapter 3 .6. In Figure Chapter 3 .6-a, the dashed line represents f 0 ( x ) = P0 f ( x ) where P0 is defined by (3.22). For P0 , the step size, Δx = 1. The dashed line in Figure Chapter 3 .6-b represents f −1 ( x ) = P0 f ( x ) + Q0 f ( x ) , and the dashed line in Figure Chapter 3 .6-c represents f − 2 ( x ) = P0 f ( x ) + Q0 f ( x ) + Q−1 f ( x ) . Here, Qm , for m = 0 and 1, is given by (3.31). In Figure Chapter 3 .6-a, it can be seen that the scaling functions give a reasonably good approximation to the function, f(x), where the function is changing slowly relative to the scale of the scaling function. When f(x) starts changing more rapidly, the averaging effect of the scaling function can be seen. Going from Figure Chapter 3 .6-a to b, the first layer of detail is added by the wavelets with step size Δx = 1. Going from Figure Chapter 3 .6-b to c, the second layer of detail is added by the wavelets with step size Δx = 1 2 .

The approximation becomes better with each additional level of detail added.

55 1.5 1 0.5

a

0 -0.5 -1 -1.5 -10

-8

-6

-4

-2

0

2

4

6

8

10

1.5 1 0.5

b

0 -0.5 -1 -1.5 -10

-8

-6

-4

-2

0

2

4

6

8

10

1.5 1 0.5

c

0 -0.5 -1 -1.5 -10

-8

-6

-4

-2

0

2

4

6

8

10

Figure Chapter 3 .6 Function Approximation using Scaling Function and First Two Wavelet Scales

3.8

Higher Order Wavelet Bases and Multiresolution Analysis The above development of wavelets and multiresolution analysis was done for

L2 ( R1 ) . The extension to L2 ( R N ) is fairly straightforward through the use of tensor products. If the wavelets ψ m,n ( x ) form an orthonormal basis for L2 ( R1 ) then

ψ m ,n ( x ) = ψ m1,n1 ( x1 )ψ m2 ,n 2 ( x2 )"ψ mN ,nN ( x N )

56 (3.57)

where

⎡ x1 ⎤ ⎡ m1 ⎤ ⎡ n1 ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ x = ⎢ # ⎥ , m = ⎢ # ⎥ , and n = ⎢ # ⎥ ⎢⎣ x N ⎥⎦ ⎢⎣mN ⎥⎦ ⎢⎣nN ⎥⎦

(3.58)

forms an orthonormal basis for L2 ( R N ) [13]. This extension gives independent scales in each direction. Using independent scales in each direction does not lend itself well to multiresolution analysis.

Another construct, resulting in orthonormal wavelet basis,

controls the scale in each direction simultaneously. The expansion to R 2 will be shown, the expansion to R N follows suite. The construction starts with tensor products of two, one-dimensional, multiresolution analysis. Define spaces

Vm = Vm ⊗ Vm = the closure of the Span{F ( x , y ) = f ( x ) g ( y ); f , g ∈Vm}

(3.59)

The Vm form a sequence of subspaces in L2 ( R 2 ) given by

"V2 ⊂ V1 ⊂ V0 ⊂ V−1 ⊂ V− 2"

(3.60)

If {ϕ m,n }n∈Z constitutes an orthonormal basis for Vm , then {ϕm,n }n ∈Z 2 where

ϕm,n ( x , y ) = ϕm,n1 ( x )ϕm,n 2 ( y ) = 2 − ϕ ( x − n1)ϕ ( y − n2) m 2

(3.61)

constitutes an orthonormal basis for Vm = Vm ⊗ Vm . The orthogonal projection operator,

Pm : L2 ( R 2 ) → Vm is Pm f = ∑ ϕ m,nϕ m*,n f = ∑ ϕ m,n ϕ m,n , f n

(3.62)

n

where obviously

ϕm,n , f = ∫







−∞ −∞

ϕm,n ( x , y ) f ( x, y )dxdy

(3.63)

57 As before, Wm is the orthogonal complement of Vm in Vm−1 . Vm may now be broken down into its component spaces: Vm = Vm ⊗ Vm =(Vm +1 ⊕ Wm +1 ) ⊗ (Vm +1 ⊕ Wm +1 ) = (Vm +1 ⊗ Vm +1 ) ⊕ [(Vm +1 ⊗ Wm +1 ) ⊕ (Wm +1 ⊗ Vm +1 ) ⊕ (Wm +1 ⊗ Wm +1 )]

(3.64)

= Vm +1 ⊕ Wm +1

The orthogonal complement space, Wm , is broken down into three mutually orthogonal spaces Wm = W ma ⊕ W mb ⊕ W mc

(3.65)

where Wma = Vm ⊗ Wm , Wmb = Wm ⊗ Vm , and Wmc = Wm ⊗ Wm . The bases for the orthogonal complement spaces are

ψ ma,n ( x , y ) = ϕ m,n1 ( x )ψ m,n 2 ( y ) for Wma ψ mb,n ( x , y ) = ψ m,n1 ( x )ϕ m,n 2 ( y ) for Wmb

(3.66)

ψ mc ,n ( x , y ) = ψ m,n1 ( x )ψ m,n 2 ( y ) for Wmc As a note, the bases for the orthogonal complement spaces in R 3 are

ψ ma,n ( x , y , z ) = ϕm,n1 ( x )ϕm,n 2 ( y )ψ m,n 3 ( z ) ψ mb,n ( x , y , z ) = ϕm,n1 ( x )ψ m,n 2 ( y )ϕm,n 3 ( z ) ψ mc,n ( x , y , z ) = ψ m,n1 ( x )ϕm,n 2 ( y )ϕm,n 3 ( z ) ψ md,n ( x , y , z ) = ϕm,n1 ( x )ψ m,n 2 ( y )ψ m,n 3 ( z)

(3.67)

ψ me,n ( x , y , z ) = ψ m,n1 ( x )ϕm,n 2 ( y )ψ m,n 3 ( z) ψ mf,n ( x , y , z ) = ψ m,n1 ( x )ψ m,n 2 ( y )ϕm,n 3 ( z) ψ mg,n ( x , y , z ) = ψ m,n1 ( x )ψ m,n 2 ( y )ψ m,n 3 ( z ) The same process continues for higher dimensional spaces. The breakup of the orthogonal complement space into several subspaces has an interesting interpretation. For R 2 , Wma gives detail along the horizontal direction, Wmb gives detail in the vertical direction, and Wmc gives detail in the diagonal direction. In

58 R 3 , if you picture standing at the corner of a cube, the same breakup of detail along the three connecting faces of the cube and the principal diagonal of the cube account for the seven subspaces (three of the directions are redundant which accounts for seven subspaces instead of ten). The orthogonal projection operator, Qm: L2 ( R 2 ) → Wm is Qm = Qma + Qmb + Qmc Qm f = ∑ ψ ma,n ψ ma,*n f + ∑ ψ mb,n ψ mb,*n f + ∑ ψ mc,n ψ mc,*n f n

n

(3.68)

n

This leads to a decomposition of any f ∈ L2 ( R 2 ) at resolution m, of M

f m ≡ Pm f = PM f +

∑Q

k = m +1

k

f , M > m.

(3.69)

The decomposition of a function in L2 ( R N ) is the same. The only difference is the number of components in the orthogonal complement space. It is possible to define a multiresolution analysis in higher dimensions directly, without using tensor products of L2 ( R1 ) [13]. The tensor product approach though, gives a direct extension in a straight forward manor to the vast work being done with wavelets in L2 ( R1 ) .

3.9

Summary This chapter develops an approach for function approximation and synthesis

using a set of compactly supported, orthonormal, scaling functions and wavelets. The multiresolution analysis provides the means of building the function approximation or function synthesis from a coarse starting point and then adding detail as required. This approach is ideal from the standpoint of learning control systems. The development of wavelet theory in this chapter is brief in nature. There are a variety of good books on the market which I encourage the reader to turn to for a more thorough development of this very important and fascinating field.

Chapter 4

The Scaling Function and Neural Networks

Chapter 3 took up the issue of function approximation and synthesis using the structure of multiresolution analysis.

Multiresolution analysis starts with a coarse

approximation to a function using a set of orthogonal scaling functions. More and more detail may be added to the function approximation by adding layers of wavelets which have finer and finer resolution. This chapter will take up the issue of what resolution to start a function approximation or synthesis at, along with methods for making an initial approximation. We will also look at how the wavelet basis functions map into a popular neural network structure. While the association of wavelets to neural networks is not required, the massively parallel structure of neural networks provides a practical computational platform.

4.1

Function Approximation by a Finite Combination of Basis Functions Sanner and Slotine [48] addressed the problem of what class of functions can be

represented by a class of Gaussian radial basis functions. Their development, based on multidimensional sampling theory, applies directly to the class of functions which can be represented by a finite combination of scaling functions. This will be the space V0 of the multiresolution analysis. It is well worth reviewing Sanner and Slotine’s development

60 for the insight it gives into choosing an initial set of scaling functions (which determine a multiresolution analysis), along with a starting resolution. Sampling theory deals with the representation of a function by a countable set of regularly spaced samples of that function. The development is largely performed in the frequency domain. The multidimensional, spatial, Fourier transform is defined as f ^ (ν ) = ( Ff )(ν ) = ∫

f ( x )e − j 2πν x dx

(4.1)

F (ν )e j 2πν x dx

(4.2)

T

N

R

and the inversion formula is defined as f ( x ) = ( F −1 f )( x ) = ∫

RN

T

where x , ν ∈R N and f , f ^ : R N → R1 . The Fourier transform pair is valid for functions that are absolutely integrable (i.e.,



f

exists and is finite) and in which their spatial Fourier transform is also

absolutely integrable[7][59]. Further, if the spatial Fourier transform has compact support, f admits an exact expansion in terms of its samples on an appropriately defined lattice in R N [42]. A set of lattice points for R N can be defined as

ξ I = i1Δ 1e1 + i2 Δ 2 e 2 + where I = ( i1 , i2 , Δ1 , Δ 2 ,

, i N ) T ∈ Z N , e1 , e 2 ,

+iN Δ N e N

(4.3)

e N are the standard basis vectors for R N , and

, Δ N are the sample spacing. To ease some of the math details it will be

assumed Δ 1 = Δ 2 ,

, = Δ N = Δ , i.e., the lattice is square. In practice the state space will

be normalized before the approximation, creating the V0 space of the multiresolution analysis. Therefore a square lattice where Δ = 1 is reasonable. A sampling operation can be viewed as a modulation of the hypersurface, f, with a field of Dirac distributions centered at the lattice points, i.e., fs ( x) = λ( x) f ( x)

where

(4.4)

61

λ( x) =

∑ δ( x − ξ

)

(4.5)

f s^ (ν ) = f ^ (ν )∗ Λ (ν )

(4.6)

I

I ∈Z N

Fourier transforming (4.4) gives

where ∗ indicates convolution and Λ ( ν) =

1 ΔN

∑ δ( ν − ζ

I

)

(4.7)

I ∈Z N

which gives

f s^ (ν ) = 1N Δ

∑f

I ∈Z

^

(ν − ζ I )

(4.8)

N

It can now be seen that the sampled spectrum, f s^ (ν ) , consists of copies of the original spectrum, f ^ (ν ) , centered at the lattice points,

ζ I = 1 Δ (i1e1 + i2 e 2 +

+i N e N )

(4.9)

Let K( γ ) be the smallest N-cube, centered at the origin, which completely encloses the support of f ^ (ν ) . If Δ is chosen less than or equal to 1 ( 2 γ ) then copies of f ^ (ν ) in the sampled spectrum will not overlap. The canonical interpolating function qc ( x ) , whose spectrum, qc^ (ν ) = ΔN on K( γ ) and zero elsewhere, can be used to reconstruct f ( x ) , i.e., f ( x) =

∑ f (ξ )q ( x − ξ ) I

c

I

(4.10)

I ∈Z N

Equation (4.10) shows that under the conditions that f ( x ) is absolutely integrable and the support of its Fourier transform is compact, f ( x ) may be represented by a countable set of parameters, f ( ξ I ) . If f ( x ) is over sampled other interpolating functions, q ( x ) , may be used. The requirements on the interpolating function are: 1) it must be Fourier transformable; 2) its spectrum, q ^ (ν ) , must be bounded, real valued, and strictly positive on K( γ ) ; 3) its spectrum must vanish outside K( 1 Δ ) . To see how this works let us define the function

62 c( x ) = F c (ν )

(4.11)

c^ (ν ) = f ^ (ν ) q ^ −1 (ν )

(4.12)

−1 ^

where

As before, modulate c( x ) with a field of Dirac distributions centered at the lattice points given in (4.3), which gives cs ( x ) = λ ( x ) c( x )

(4.13)

where λ ( x ) is defined in (4.5). Fourier transforming (4.13) gives cs^ (ν ) = 1N Δ

∑ c (ν − ζ ) = Δ1 ∑ f ^

^

N

I

I ∈Z N

(ν − ζI )q ^ −1 (ν − ζI )

(4.14)

I ∈Z N

where ζ I is defined in (4.9). Applying the reconstruction filter, ΔN q ^ (ν ) , to (4.14) and taking the inverse Fourier transform leads to f ( x ) = ΔN

∑ c(ξ )q ( x − ξ ) I

I

(4.15)

I ∈Z N

By restricting the region of interest for approximating f ( x ) to a compact subset Ω of R N and allowing quantifiably small errors in the approximation, a larger class of functions, f ( x ) , may be represented by an expansion which has a finite number of terms using a larger class of interpolating functions. The principal result demonstrated by Sanner and Slotine [48] is that if the function, f ( x ) , can be smoothly truncated outside a compact set, Ω, in such a way that the resulting spatial Fourier transform is absolutely integrable, then the function itself can be uniformly approximated on Ω with a finite linear combination of appropriately chosen interpolating functions. Proceeding in this vain, if f ( x ) is not absolutely integrable itself, a new function f F ( x ) (read "f Fourier") may be defined which is exactly equal to f on Ω by

f F ( x ) = m( x ) f ( x )

(4.16)

63 where m( x ) is an infinitely smooth function (partial derivatives of m( x ) of all order exist and are continuous) which is unity on Ω and which decays to zero faster than f . While the Fourier transform of f F ( x ) exists, in general it will not have compact support, instead it will approach zero asymptotically. In order to extend the range of interpolating functions, q1 ( x ) , it will also be assumed that its spatial Fourier transform will in general not have compact support, but instead will approach zero asymptotically. If however the rates of decrease of f F^ (ν ) and q ^ (ν ) are sufficiently fast as ν → ∞ , a choice of a small enough sampling mesh on R N will result in an expansion f Ω ( x ) in terms of q ( x ) which can uniformly approximate f ( x ) to a specified degree of accuracy on Ω, i.e., f − f Ω < ε f ∀ x ∈Ω . To show this assertion, define a new function f BL ( x ) ("f bandlimited") which is obtained by truncating the spectrum of f F ( x ) at a particular radius γ. It can then be shown f F ( x ) − f BL ( x ) = ∫

K c (β )

f F^ (ν ) dν ≡ ε1

(4.17)

where K c ( γ ) is the complement of the set K ( γ ) . Because the spectrum of f F^ (ν ) is absolutely integrable, it is possible to choose a spectral truncation radius γ so that the error ε1 is as small as desired. Taking this value of γ, choose an interpolating function, q(x), such that q ^ (ν ) is bounded, real, and positive for ν ∈ K ( γ ) , but perhaps approaches zero only asymptotically outside of K ( γ ) . Define c^ (ν ) = f BL^ (ν ) q ^ −1 (ν ) . Sampling the resulting c(x) with a mesh size Δ ≤ 1 ( 2 γ ) produces an expansion fR(x) ("f reconstructed") as

f R ( x ) = ΔN

∑ c(ξ )q( x − ξ ) I

I

I ∈Z N

where ΔN

∑ c(ξ )q( x − ξ ) = F [Δ -1

I

I ∈Z

N

I

N

q ^ (ν )cs^ (ν )]

(4.18)

64 and F -1[ ΔN q ^ (ν )cs^ (ν )] = ∫

ΔN q ^ (ν ) q ^ −1 (ν ) f BL^ (ν )e j 2πν x dν T

K( 1

Δ)

⎡ ⎤ T q ^ (ν ) ⎢∑ c ^ (ν − ζI ) ⎥e j 2πν x dν ⎣ I ≠0 ⎦ ⎤ ⎡ T + ∫ c 1 q ^ (ν ) ⎢ ∑ c ^ (ν − ζI ) ⎥e j 2πν x dν K ( Δ) ⎣ I ∈Z N ⎦ +∫

K ( 1Δ )

(4.19)

By construction f BL^ (ν ) is zero outside of K ( γ ) and by choice of sampling rate the second integral in (4.19) is zero, therefore (4.19) reduces to f R ( x) = ∫

f BL^ (ν )e j 2πν x dν T

K ( 1Δ )

⎡ ⎤ T q ^ (ν ) ⎢ ∑ c ^ (ν − ζ I ) ⎥e j 2πν x dν c 1 K ( Δ) ⎣ I ∈Z N ⎦

+∫

(4.20)

The last integral in (4.20) represents the error in approximating the bandlimited function, f BL ( x ) , with an interpolating function which is not an ideal low pass filter. In that q ^ (ν ) is bounded away from zero on K( γ ) , and by choice of mesh size the repeating spectra in the summation is non overlapping, one has

∑ c (ν − ζ ) ≤ νsupγ

f ^ (ν ) q ^ −1 (ν ) ≡ κ C

^

I

I ∈Z

N

∈K ( )

(4.21)

for each frequency point ν. Using the fact that q ^ (ν ) is always positive produces an upper bound f BL ( x ) − f R ( x ) ≤ κ C ∫

K c ( 1Δ )

q ^ (ν )dν ≡ ε2

(4.22)

which makes it possible to choose a Δ such that ε2 is a small as desired. The last issue is the truncation error of the series in (4.18) created by limiting the number of terms in the series to those which correspond to samples ξI which fall within a radius ρ of x. define

{

}

Let Iρ ( x ) = I ∈ Z N | ξI ∈ Bρ ( x )

where Bρ ( x ) = { x ∈ R N | x < ρ} and

65 f ρ ( x ) = ΔN

∑ c(ξ )q( x − ξ )

I ∈I ρ

I

(4.23)

I

The truncation error can be written as

ερ ( x ) = f R ( x ) − f ρ ( x ) = ΔN

∑ c(ξ )q( x − ξ ) I

I ∈I ρc

I

(4.24)

and the maximum truncation error for x ∈ Ω as

ε3 = sup ερ ( x )

(4.25)

x ∈Ω

If q ( x ) is chosen to decay rapidly outside a given radius, then a ρ may be chosen which will make the truncation error as small as desired. Of course, if q ( x ) has compact support, then the error, ε 3 , will be zero for an appropriately chosen radius, ρ. A second method of describing the truncation error is to define the sets Ωρ =

∪ Bρ ( x)

(4.26)

x ∈Ω

and

{

Io = I ∈ Z N |ξI ∈ Ω ρ

}

(4.27)

along with the function f Ω ( x ) = ΔN ∑ c(ξI ) q ( x − ξI )

(4.28)

I ∈I o

From the above definitions it is clear that Iρ ⊂ Io and f R ( x ) − f Ω ( x ) ≤ f R ( x ) − f ρ ( x ) ≤ ε3

(4.29)

The advantage of the first truncation method over the second is that many fewer elements in the series expansion must be calculated at a given position x.

For

interpolating functions which have compact support, there is no advantage in calculating

66 terms in the expansion in which x lies outside of the support of q ( x − ξI ) . This can result in significant computational savings. Using (4.17), (4.22), and (4.25) or (4.29) gives a total approximation error of

f ( x ) − f Ω ( x ) ≤ f ( x ) − f ρ ( x ) ≤ ε1 + ε2 + ε3 = ε f

(4.30)

∀ x ∈Ω

Where, error ε1 is due to band limiting f ( x ) ; error ε2 is caused by interpolating functions which have over lapping supports in the frequency domain; and, error ε 3 is due to truncation errors in the series expansion of f ( x ) . For the purposes of this thesis, the interpolating function, q ( x ) , will be the scaling function, ϕ ( x ) , of the multiresolution analysis (see chapter 3).

In that scaling

functions with compact support are being considered, the truncation error term, ε 3 , may be ignored unless the chosen radius, ρ, is smaller than the support radius of ϕ ( x ) .

4.2

Scaling Function Example The primary interest from the standpoint of this dissertation, is not to choose an

interpolating function q ( x ) which will be used for representing a given function, f ( x ) , to some desired level of accuracy. Instead, the interest lies in making an intelligent choice of an initial resolution for the scaling function in which to build a multiresolution approximation of, or synthesis for, f ( x ) . Picking up the example scaling function from chapter three, Figure Chapter 4 .1 shows the magnitude of the Fourier transform of this function, ϕ ( x) . The frequency response is based on using a translation step size, Δ, of one. As expected, the frequency response of the scaling function, ϕ ^ (ν ) , is that of a low pass filter. The cutoff frequency will be proportional to 1 Δ . Refer to Daubechies [13], or Kaiser [27], for details on the frequency response of other scaling functions. As a point of interest, Figure Chapter 4 .2 shows the magnitude of the Fourier transform of the wavelet, ψ ( x ) , from chapter 3.

67 This frequency response has a bandpass filter characteristic. This is consistent with the concept that the wavelets carry the detail information. 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.5

1

1.5

2

2.5

3

ν (cycles/unit distance) Figure Chapter 4 .1 Scaling Function Fourier Transform, ϕ ^ (ν )

1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.5

1

1.5

2

2.5

3

ν (cycles/unit distance) Figure Chapter 4 .2 Wavelet Fourier Transform ψ ^ (ν ) To choose an initial resolution for the scaling functions, or equivalently an initial lattice spacing, first define the upper spatial frequency bandwidth, γ, to be used at the

68 coarse approximation level. Choosing too high a bandwidth will cause an excessive number of scaling functions to be used. Choosing too low a bandwidth will give an initial approximation which does not resemble the function being approximated very well. Figure Chapter 4 .3 shows an example frequency spectrum for a function f (x). The upper frequency limits, γ, have been normalized to 1 Δ . Or equivalently, 1 Δ will be proportional to γ.

1

f ^(υ)

0.8 0.6 0.4 0.2 0 -1.5

-1

-0.5

0

−γ

0.5

1

γ

1.5

υ (cycles / unit distance normalized to 1/Δ)

Figure Chapter 4 .3 Spatial Frequency Spectrum of f (x)

Following the development above, set c^ (ν ) = f ^ (ν )ϕ ^ −1 (ν ) Figure Chapter 4 .4 shows part of the sampled spectra, cs^ (ν ) , along with the reconstruction filter, ϕ ^ (ν ) . Because ϕ ^ (ν ) is approximately unity over the bandwidth of f ^ (ν ) , then c^ (ν ) ≈ f ^ (ν ) . This implies that c( x ) ≈ f ( x ) . For this example, the decomposition of f ( x ) can be approximated by f ( x ) ≈ ΔN ∑ c(ξI )ϕ ( x − ξI ) ≈ ΔN ∑ f (ξI )ϕ ( x − ξI ) I ∈I 0

I ∈I 0

where ξI is defined in (4.3), and for this example N = 1.

(4.31)

69

ϕ ^(ν)

c ^(ν)

1 0.8 0.6 0.4 0.2 0 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

υ (cycles / unit distance normalized to 1/Δ)

Figure Chapter 4 .4 Sampled Spatial Frequency Spectrum, cs^ (υ )

The frequency spectrum of ϕ ^ (ν ) is not ideal in that it overlaps adjacent sampled spectrums. This will cause aliasing errors as defined in (4.22). This is not a big concern in that the scaling functions are being used as the first level approximation. Wavelets will then fill in detail, reducing this aliasing error. A second order example should help to solidify some of this material. Suppose we choose f ( x ) = cos( x ) , where x = x12 + x22 .

Suppose also that the region of

interest is Ω = x ≤ 3π 2 . The function f ( x ) may be truncated with the function

m( x ) = exp[ − ( sg ( x − 3π 2)) 2 ]

(4.32)

where

⎧0 for r ≤ 0 sg ( r ) = ⎨ ⎩ r for r > 0 It can be seen that m( x ) is unity on Ω, and decays rapidly to zero outside of Ω. Along each axis, the primary spatial frequency of f ( x ) is clearly 1 2π , so a choice of Δ1 = Δ 2 = 1 is quite reasonable. Therefore the scaling functions will be on a 1 by 1

70 lattice which must cover Ω, where 1 is the translation step size for the V0 space in a multiresolution analysis (reference the development in chapter 3). For cases where the upper spatial frequency limit along each axis is substantially different causing Δ1 ≠ Δ 2 , the easiest route is to first normalize each x axis according to: xni =

1 Δi

xi .

The

approximation may then be carried out in the normalized space. The scaling function used for this example will be the one from chapter 3. Since we are working in R 2 , the scaling function will be defined as

ϕ ( x1 , x2 ) = ϕ ( x1 )ϕ ( x2 )

(4.33)

The practical support of ϕ ( x ) is [-3,3]. Therefore to approximate f ( x ) on Ω, we will need a lattice covering for Ω, of at least [-8,-7…7,8] × [-8,-7…7,8]. This makes a total of 289 scaling functions. Figure Chapter 4 .5a shows f ( x ) . Figure Chapter 4 .5b shows a single scaling function, ϕ ( x ) . Figure Chapter 4 .5c shows the approximation to f ( x ) using a linear combination of scaling functions according to the decomposition given in (4.31). It can be seen from the plots, that the approximation is quite reasonable for a first level approximation. There are some edge effects where the function f ( x ) is being truncated. This is primarily due to the higher spatial frequency content there.

71

1.2 0.8 0.4 0 -0.4 -0.8 -1.2

1.2 0.8 0.4 0 -0.4 -0.8 -1.2 4 4

2 2

0

0

(a) f (x) -2

-2 -4

-4 -6

-6

(b) ϕ (x)

1.2 0.8 0.4 0 -0.4 -0.8 -1.2

1.2 0.8 0.4 0 -0.4 -0.8 -1.2 4 4

2 2

0

0

(c) f (x) Approximation

-2

-2 -4

-4 -6

-6

Figure Chapter 4 .5 Example Approximation in R 2

72

4.3

Neural Network Topology This section combines the multiresolution analysis of chapter three with the

results of this chapter and maps the wavelet decomposition into a popular neural network structure. The use of neural networks in this application is purely utilitarian. The neural network topology is an ideal, massively parallel, structure which may be used for computing a wavelet decomposition. A multiresolution analysis is formed by a nested sequence of closed subspaces, Vm , with certain properties between the subspaces (referred to definition 3.1). From the

property

V2 ⊂ V1 ⊂ V0 ⊂ V−1 ⊂ V− 2

(4.34)

it can be seen that the smaller the subscript, m, the higher the resolution of the subspace. By property “a” of definition 3.1, the set of scaling functions, {ϕ ( x − n)}n ∈Z N , form an orthonormal basis for V0 . Here, it is assumed that Vm , a subspace of L2 ( R N ) , is formed by a tensor product of the subspaces: Vmi ⊂ L2 ( R1 ), for i = 1… N . The scaling functions,

ϕ : R N → R1 , for the N dimensional space are given by ϕ ( x − n) = ϕ ( x1 − n1 )ϕ ( x2 − n2 ) ϕ ( x N − nN )

(4.35)

In the ensuing material, the bar over the scaling function will be dropped and it will be assumed that ϕ: R N → R 1 . The function space V0 will be considered the starting resolution for function approximation or synthesis. Recall from the development in chapter 3 that in V0 , the scaling functions are on a unit lattice covering R N . Figure Chapter 4 .6 depicts this case for R 2 , where each dot represents the center location of a single scaling function. V0 may be used as the starting resolution through the use

of a normalization. Let γ i , for i = 1 to N, be the upper

Figure Chapter 4 .6 Unit Lattice for Scaling Functions

73 spatial frequency bandwidth, for each spatial direction, for the coarse approximation level. The non-normalized sample spacing will be Δi = β 1

(4.36)

xi = 1 xi for i = 1 to N Δi

(4.37)

γi

where 0 < β < 0.5 . By letting

the sample spacing (or equivalently the lattice spacing on which the scaling functions will be placed) for the normalized function space will be unity. In the normalized space, the lattice points ξI , (4.3), will be simply: ξI = I where I ∈ Z N . The scaling function of chapter 3 may now be equated with the interpolating function of this chapter:

ϕ ( x − n) = q ( x − ξI )

(4.38)

ξI = I = n and I,n ∈ Z N

(4.39)

where

Note: the normalization is done as a mater of convenience, a non-normalized space could be used just as well with appropriate adjustments to scale factors. To avoid additional complexity in notation, it will be assumed that all function approximations or synthesis is being carried out in the normalized space and the bar over the state variable x will be dropped. Studying the orthogonal projection operator Pm: L2 ( R N ) → Vm from chapter 3, for m = 0 , we have P0 f =

∑ϕ ϕ n

n ∈Z

N

* n

f =

∑ϕ

n ∈Z N

n

ϕn , f =

∑α ϕ n

n

(4.40)

n ∈Z N

where

αn = ϕn , f = ∫

Ω ϕn

ϕ ( x − n) f ( x )dx

(4.41)

74 and

{

Ωϕ n = x ∈ R N x ∈ support of ϕn

}

(4.42)

Recall that ϕ n ( x ) = ϕ ( x − n) and that the support of a function is the closure of the set of points where the function is non-zero. The projection operator, P0 , projects a function f ∈ L2 ( R N ) (in our case a function in the normalized L2 ( R N ) space) onto the space V0 . We say that P0 f is the approximation of f at the V0 resolution. Comparing equations (4.18) and (4.40) it can be seen that sampling theory and wavelet theory give similar decompositions when using the scaling function as the interpolating function. It is in general not true that

αn = ϕn , f = ΔN c(n) = ΔN F −1[ϕ ^ −1 (υ ) f ^ (υ )]

(4.43)

for functions f ∈ L2 ( R N ) . But, for the case where the spatial bandwidth of f is much less than the spatial bandwidth of ϕ, then the relationship (4.43) will be approximately true. The wavelet analysis minimizes an L2 norm, and the sampling theory analysis minimizes an infinity norm (4.17). In the case where f is slow varying in relation to the support of

ϕ then the L2 norm and the infinity norm will be approximately the same. This can be seen in that both the left hand side and the right hand side of (4.43) both give

αn = ϕn , f ≈ f (n) ≈ ΔN c(n) = ΔN F −1[ϕ ^ −1 (υ ) f ^ (υ )], n ∈ Z N

(4.44)

This gives justification for calling V0 a bandlimited, function, approximation, space. The next higher resolution space is obtained by adding the orthogonal complement to the current space:

Vm−1 = Vm ⊕ Wm

(4.45)

Wavelet decompositions, (3.68), are used to describe the orthogonal complement space. In R N the orthogonal complement space will have a number of components depending on N (refer to chapter 3 section 8). Equation (3.69) gives the breakdown of a function

75 f ∈ L2 ( R N ) at the resolution m where in this case M = 0 and m ≤ 0 .

Additional

resolution only needs to be added in regions of the state space where the function being approximated is changing more rapidly. The directional nature of the components of the orthogonal complement space implies that only those components which line up with the direction of change of the function being approximated will typically be required (this would be a good area for further research). As noted, the multiresolution decomposition of a function f ∈ L2 ( R N ) at the resolution m is given by (3.69): m

f m ≡ P0 f + ∑ Qk f , m < 0

(4.46)

k =0

Here the assumption has been made that the starting resolution is V0 . The definitions for P0 and Qk are given by (3.62) and (3.68) respectively. With a slight abuse of notation, and substituting the definitions for P0 and Qk , we may write: fR ( x ) =

∑ ∑α

m∈IR I ∈Z N

m, I

ψ m, I ( x )

(4.47)

Here, m is used to index the resolution and the wavelet type including the scaling function, and R is used to represent the resolution level. The dimension of m will depend on the spatial dimension, i.e.: for R 2 , m ∈IR ⊂ Z 4 ,

for R 3 , m ∈IR ⊂ Z 8 . The

dimension of m will be the number of component in the orthogonal complement space plus one. Refer to chapter 3 section 8 for a description of the orthogonal complement spaces verse spatial dimension. As an example, for R 2 , let m = (1,0,0,0)T be the index for the scaling functions at resolution zero. Let m = (0,1,0,0)T be the index for the wavelets in W0a . Let m = (0,0,1,0)T be the index for the wavelets in W0b . Let m = (0,0,0,3)T be the index for the wavelets in W−c2 . It is easy to see the pattern here, and it is clear that IR will be a very sparse subset of Z 4 (only one component of m can be nonzero at a time).

76 The functions of concern in this thesis are defined on a compact subset, Ω, of

R N , or f ∈ L2 (Ω) . Outside of Ω the functions are undefined. From a mathematical purity standpoint it will be assumed that the functions are truncated smoothly to zero outside of Ω. Functions defined on a compact subset of R N may be represented by a wavelet decomposition with a finite number of wavelets. As such, the sets

{

}

Im = I ∈ Z N | Ωψ m,I ∩ Ω ≠ {0}

(4.48)

represent the indexes of the wavelets (and scaling functions) which cover Ω. As in (4.42), ψ ,I

Ω

{

= x ∈ R N x ∈support of ψ m,I

}

(4.49)

In general, the complete set of wavelets at a given resolution will not be used. The set Im ⊂ Im will be used to represent the set of indexes at resolution m, of the wavelets used

in the decomposition of a given function.

The multiresolution decomposition of a

function f ∈ L2 (Ω) at resolution R, may now be written as: fR ( x) =

∑α

ψ m, I ( x )

(4.50)

m

(4.51)

m, I m∈IR , I ∈I m

If Qm is the number of components in Im, then QR =

∑Q

m∈IR

will be the total number of wavelets and scaling functions used in the decomposition of the function f at resolution R. The structure of the multiresolution analysis maps nicely into the structure of a popular neural network. Neural networks are massively parallel, highly interconnected, computing structures which have been modeled or inspired in some sense after the structure of the human brain. There are a variety of topologies for neural networks along with an wide variety of methods for "training" neural networks to perform a desired

77 function. Refer to [14], [28], [43] & [44] for an introduction to various neural network architectures and training methods. Radial Basis Function networks, [47][37][11], have the basic neural network structure required to represent a wavelet decomposition. Wavelet based neural networks have been proposed by Bakshi and Stephanopoulus [1], and Zhang and Leigh [60]. The use of neural networks in this application is purely utilitarian. This neural network topology is an ideal, massively parallel, structure for computing a wavelet decomposition. The basic structure of a three layer, fully connected, neural network is shown in Figure Chapter 4 .7. This network has an input layer, a single hidden layer, and an output layer.

The input layer is merely a distribution point for the input variables

x = ( x1 , x2 ,… , xN ) T . The hidden layer nodes compute a difference between the input variable and the input weight, xi − ξ ji , and apply this difference to a function ψ (⋅) . The output layer nodes compute a linear weighted sum of the outputs of the hidden layer nodes. Each output is therefore a mapping: f j : R N → R1 . The neural network computes the function: f i ( x ) = ∑ α j ,i ψ ( x − ξji )

(4.52)

j

It should be quite clear that this neural network will compute a linear combination of scaling, wavelet, or other functions.

78

Hidden Layer Input Layer x1

ψ(x−ξ1) ξ

α11

11

α21

ξ

21

ψ(x−ξ2)

x2

Output Layer

Σ Σ

f1 ( x)

f 2 ( x)

ψ(x−ξ3) xn

Σ

ξni

ψ(x−ξq)

f m ( x)

αmi

Figure Chapter 4 .7 Three Layer Neural Network

The multiresolution analysis uses a linear combination of scaling functions at a fixed scale, plus a variety of wavelets (depending on the dimension N of the inputs) at different scales. In mapping the multiresolution analysis to the above neural network, the input weights, ξj ,i , establish the lattice in which the scaling or wavelet functions are placed on and as such will be fixed. By normalizing the input vector, x, before applying it to the neural network, the input weights can be used to define a unit spaced lattice structure.

The output weights will be the only variables in the network. The wavelet

scale factor, 2 − 2 (3.12), may be incorporated in with the output weights. m

Figure Chapter 4 .8 shows the basic hidden layer node structure which may be used with the N dimensional, tensor product, multiresolution analysis.

The ψ i (⋅)

functions represent both the individual scaling functions or wavelet functions required to build the N dimensional scaling function or wavelet component. Combining Figure

79 Chapter 4 .8’s node structure into the neural network of Figure Chapter 4 .7 allows the multiresolution analysis to be implemented by a neural network. The primary importance of this implementation is the efficient computational structure.

ξ1 -

x1

Σ

ψ 1(⋅)

+ Σ

ψ 2 (⋅)

+

ξ2 x2

-

Node Output

X ξN xN

-

+ Σ

ψ N (⋅)

Figure Chapter 4 .8 Hidden Layer Node Structure

4.4

Neural Network Training One of the primary uses of neural networks is function approximation, or

attempting to “learn” a mapping from R N → R M based on a finite set of exemplars (example input verse output points).

The neural network then “generalizes” (or

interpolates) for all other points in the input space R N .

In other words, suppose

f:R N → R M is an unknown mapping representing some measurable process and let

f:R N → R M represent the neural network approximation to f. Let

{

}

X P = x p ∈ R N , y p ∈ R M | y p = f ( x p ), p = 0,1,… , P

(4.53)

80 be a set of exemplars: a set of example input verse output data points typically based on actual measurements from the system or process. Using the exemplars, the internal weights of the neural network are adjusted or trained to minimize a measure of [ f ( x p ) − f ( x p )] over the complete set of exemplars. The usual error measure is

EP =

P

1 2

∑[ f ( x p=0

p

) − f ( x p )]2

(4.54)

The most common training method is a form of gradient descent called backpropagation. Backpropagation was first proposed by Rumelhart, Hinton, and Williams [46]. Most texts on neural networks cover this training algorithm. Freeman and Skapura’s book [14] is one example. For the case of the wavelet based neural network, the backpropagation algorithm become particularly simple in that only the output connection weights, or coefficients, are being trained (adjusted). Let

f ( x) =

∑α

m, I m∈IR , I ∈I m

ψ m, I ( x )

(4.55)

be the neural network approximation to f (x). To simplify notation, let i = (m, I ) . The partial derivative of the error function (4.54) with respect to the wavelet coefficient α i is

∂ ( f − f )ψ i ∂αi E P = ∑ p=0 P

A gradient vector ∇α can be defined where ∇αi =

(4.56) ∂ ∂α i

EP .

The standard gradient

descent algorithm would then update the vector of all coefficients, α, at iteration k + 1 by

α k +1 = α k − η∇α where training factor, η, is a small positive number. This approach is called block training in the neural network literature. Due to the typically large number of exemplars and the large number of coefficients, this method is rarely used. More typically, the neural network weights are updated based on the error seen at each exemplar. In this case, define

81

e p = 21 [ f ( x p ) − f ( x p )]2

(4.57)

∇ pαi = ∂ eP = [ f ( x p ) − f ( x p )]ψ i ( x p )

(4.58)

and

∂αi

Due to the local nature of the wavelets, at a single exemplar only a limited number of coefficients need to be updated. This is because ψ i ( x p ) = 0 for all wavelets, ψ i , where

x p is outside of the support of ψ i .

The wavelet coefficients α i at training iteration

k + 1 will be update as: αik +1 = αik − η∇ pαi , where training factor, η, is a small positive number. For small training factors, η, this method of updating the weights works as well as the block training method [14]. Assuming a wavelet neural network has been established to approximate a function f ∈ L2 (Ω) , the basic backpropagation training algorithm is: 1) Initialize the output weights, α m, I . The output weights for the scaling function part of the neural network may be initialized to: α 0, I = f (ξI ) where ξI represents the center points of the scaling functions. All other α m, I ’s may be initialized to zero. Note: most neural network training algorithms start by randomizing the connection weights. Based on the previous material, this method of initializing the weights should provide a neural network with a reasonable first approximation. 2) Select a training exemplar pair ( x p , y p = f ( x p )) . Typical for purposes of this thesis, x p will be a randomly chosen input, or an input chosen from some active process and y p will be the measured output of that process. This is as opposed to having a predefined set of exemplars. Apply the input, x p , to the neural network to obtain the output y p = f ( x p ) .

82 3) Compute the error term e p = 21 [ f ( x p ) − f ( x p )]2 . If the error term is small, start back at step two with the next exemplar. If the error term is small for all exemplars, the training may be stopped. 4) Determine the set of coefficients, α i , for which x p is in the support of ψ i . Compute the terms: ∇ pαi =

∂ ∂α i

eP = [ f ( x p ) − f ( x p )]ψ i ( x p ) .

5) Update the weights according to: αik +1 = αik − η∇ pαi . Continue the process by returning to step 2.

The above neural network training method is an example of one of the more popular methods of training neural networks of this type. Refer to the neural network literature for additional training methods along with treatise on best choices for the training factor. Due to the orthogonal structure of the wavelet neural network, minimizing the error function, (4.54), over enough points in Ω will result in the wavelet coefficients approaching their optimum. This can be seen by noting that (4.54) is an approximation to the error function:

ET =

1 2



Ω

[ f ( x ) − f ( x )]2 dx

(4.59)

The optimum set of coefficients α* i , are that set of wavelet coefficients which minimizes (4.59). At the minimum, the coefficients α* i satisfy

∂ ∂αi ET |α = 0 ∀ i *

(4.60)

i

The partial derivative of ET with respect to α i is

∂ E = [ f ( x ) − f ( x )]ψ ( x ) dx i ∂αi T ∫Ω = ∫ f ( x )ψ i ( x ) dx − ∫ f ( x )ψ i ( x ) dx Ω

Ω

(4.61)

83 Noting that f ( x ) is defined by (4.55) and taking into account the orthogonality

properties of the wavelets we have that



Ω

f ( x )ψ i ( x )dx = α i

(4.62)

On the other hand, the multiresolution analysis of chapter 3 gives



Ω

f ( x )ψ i ( x )dx = α i

(4.63)

Therefore, using the using the minimization condition (4.60) we have that

α i = α i = α*i

(4.64)

This is one advantage of using an orthogonal set of basis functions.

4.5

Summary This chapter took a step back and looked at the requirements for function

approximation using multi-dimensional sampling theory. This analysis provides intuitive insight into choosing a starting resolution for multiresolution analysis.

Thoroughly

understanding the function approximation and synthesis capabilities of a chosen set of basis functions is critical in the control systems arena.

It is very difficult to predict

system stability characteristics and performance without this up front knowledge. The control systems engineer can not live with approximation capability statements like: “given enough nodes in a neural network the network…”. This statement is ubiquitous in the neural network literature. Next, the wavelet multiresolution analysis was mapped to a popular neural network topology. While this association is not necessary from the theoretical point of view, it may prove essential from the practical point of view. A large number of wavelets and their respective coefficients are associated with a multiresolution analysis decomposition of a given function. Computing outputs based on given inputs to this structure in a serial fashion can be slow even given today’s high speed computer

84 technology.

Using the massively parallel structure of neural networks provides an

efficient mechanism for computing the wavelet decomposition of a function. In fact, I would go so far to say that the practicality of methods presented in this thesis may await the availability of low cost, hardware implemented, neural networks. The neural network training method presented was intended to give the flavor of one of the popular methods of training neural networks. The literature abounds with training methods and more detailed descriptions. I refer the interested reader to this more complete material.

Chapter 5

Improving Control on a Known Region of Asymptotic Stability

This chapter concerns itself with the steady improvement of control on a known region of asymptotic stability. This has been the prime area of concern and work of this thesis. It is assumed from the onset, that an asymptotically stable control law is known for the given plant along with a closed and bound region of attraction. It is on this region of attraction that we wish to improve the control law based on a given objective function. In that this step of improving the control on a known region of asymptotic stability may be considered only one part of the process of "learning to control", it is highly desirable to ensure that once a region of asymptotic stability is found, the control improvement process will guarantee the region remain asymptotically stable. Also, in that the control improvement process does not optimize the controller over the entire region of interest in one large step, but instead uses a slow, steady, progressive approach, it is a requirement that all intermediate control laws generated must guarantee the region remain asymptotically stable. The approach taken for improving control on a known region of asymptotic stability is based on the well founded theory of dynamic programming. It is known in the controls community that dynamic programming will provide an optimal control law for a general nonlinear system and that this control law will be in the form of state feedback,

86 [3][4][23] & [29].

The development presented here is unique in its approach to the

optimization problem. The usual approach is to derive the conditions of optimality and then find a control law which meets the conditions. Intermediate results and the stability of these results are not considered. The approach taken here starts from a non-optimal, stabilizing, control law, and works towards an optimal control law in an iterative fashion where system stability is guaranteed with each intermediate control law. Werbos, [54][36], proposed a similar method for “reinforcement learning” which he called “Heuristic Dynamic Programming (HDP)”. Werbos did not thoroughly address the issue of system stability, or present methods of incremental improvements of the control law which ensured system stability.

5.1

Plant, Control Law, Regions of Operation, and Objective Function Returning to the plant definition in chapter 2, this thesis is considering general

nonlinear, autonomous, plants, P, of the form P : x = f ( x , u)

where x ∈ R N , u ∈ R M , and x ≡

d dt

x.

(5.1)

It is assumed that f : R N × R M → R N is

continuous, the first partial derivatives with respect to x and u exist and are continuous, and f is U asymptotically controllable on Ω P , a compact, simply connected, subset of R N containing a neighborhood of the origin. To avoid additional notational complexity which will only serve to cloud key issues, M (the control input dimension) will be set to one. The development is readily extended to higher control input dimensions.

Where

appropriate, notes will be added to indicate how higher dimensional control inputs could be handled. For the plant P, it is assumed that an initial feedback control law, u0 = g0 ( x ) , is known such that the closed loop system

87 P : x = f ( x , g0 ( x )) c

(5.2)

is asymptotically stable on a known, possibly limited, invariant, compact, set, Ω K ⊂ Ω P . The general form of the control law will be u = g( x,α ) = where IR is a finite subset of Z



∑α

m, I m∈IR , I ∈I m

ψ m, I ( x )

(5.3)

used to index the wavelet resolution. R refers to the

resolution level. Refer to chapter 4 section 3 for a complete definition of the index m along with the dimension size Nψ and other elements of this function decomposition. Im is a finite subset of Z N with Qm components. αm,I ∈ R1 , α ∈ RQR where QR is the total

number of components used in the decomposition at resolution R. ψ m, I ( x ) is a suitable set of wavelet basis functions covering Ω P . In this case, ψ m, I ( x ) is taken to be both the scaling functions and the different forms of wavelets required for a multiresolution analysis.

In general, ψ m, I ( x ) , may represent any compactly supported set of basis

functions. In that QR is a finite set, an index “i” or “j” will often be used to indicate an individual coefficient αm,I or wavelet ψ m, I ( x ) rather than carrying around the extra “m,I” indexing overhead. The initial control law may be approximated using the methods in chapters 3 and 4 by u0 = g ( x , α 0 ) =

∑ αI I

m∈

R

,I ∈

0 m, I

ψ m, I ( x )

(5.4)

m

For higher dimension input functions, M > 1, the control input would be defined as: ⎡ α m1 , Iψ m1 , I ( x ) ⎤ ⎡ g1 ( x , α 1 ) ⎤ ⎢ m∈I∑ ⎥ 1 , I ∈I 1 ⎢ ⎥ ⎢ R m ⎥ u=⎢ ⎥=⎢ ⎥ ⎢⎣ g M ( x , α M ) ⎥⎦ ⎢ ∑ α mM, Iψ mM, I ( x ) ⎥ ⎢⎣m∈IRM , I ∈I mM ⎥⎦

where the decomposition of each function g j ( x , α j ) is handled independently. The set of allowable wavelet coefficients, A, will be defined as

(5.5)

{

A = α ∈ R Q α m, I ≤ α max , m ∈IR , I ∈Io

88

}

(5.6)

where α max is a finite, positive number. The set of all u = g ( x , α ) such that α ∈A , will be denoted U. Different control laws represented by a different set of coefficients, α, will be indexed by k, i.e., uk = g ( x , α k ) . The closed loop system, (5.2), with u = g ( x , α k ) will be denoted: Pkc . The trajectories of Pkc starting at x0 will be denoted x (t ) = φtk ( x0 )

(5.7)

Given a U asymptotically controllable plant P, a number of regions in the state space of operation can be defined. Figure Chapter 5 .1 illustrates these regions.

U

Ω A

Ω

P

Ω

K

Ω

Kb

Ω Kc

Ω

Figure Chapter 5 .1 Regions of Operation

At the top, or the largest of these regions is ΩU . ΩU , the universe of discourse, is a bounded region of the state space where all trajectories of the system must lie. Any practical system will have bounds on its states, most often defined by the physics of the

89 system. This is the largest region of the state space that we will have to concern ourselves with. The next region of interest is Ω P . Ω P is the potential Region of Asymptotic Stability (RAS) given the right choice of control law coefficients, g ( x , α ) ∈U . This region is typically unknown from the onset. It is a goal of the "learning" process to expand the known region of asymptotic stability to as much of this region as possible. Once a stabilizing control law, g ( x , α ) ∈U , is determined for P, two additional regions of operation become apparent: the known RAS, Ω K and the actual RAS, Ω A . In the diagram, the known RAS, Ω K , includes both the boundary region, Ω Kb , and the region where the control law will be changed, Ω Kc . The known RAS is just that, the region of ΩU where P is known to be asymptotically stable with the current control. This region is assumed known well enough such that it is invariant. The actual RAS, Ω A , will typically be larger than the known RAS and by definition is invariant. Clearly, the following relationship exists between the sets: Ω K ⊂ Ω A ⊂ Ω P ⊂ ΩU . These sets are assumed to be simply connected, compact sets containing a neighborhood of the origin. For regions of asymptotic stability, the interior of sets Ω A and Ω K will be assumed to be used without making specific distinctions. Issues of stability on the boundary of these sets will not be taken up. This is a reasonable approach in that all changes considered will be made on the interior of either Ω A or Ω K . The region of interest in this chapter, from the aspect of control law improvement, will be Ω K . To help ensure that a stable system remains stable during the adaptation process, and in particular to ensure that Ω K remains invariant, a boundary region, Ω Kb , will be defined where the control law must remain fixed (reference proposition 2.1 and 2.2). Ω Kc , the set on which the control law will be changed, is a compact set such that Ω Kc ⊂ Ω K and δ = ρb (Ω K , Ω Kc ) > 0 (refer to definition 2.10 for the meaning of ρb (⋅,⋅) ) The boundary set will then be defined as Ω Kb = Ω K − Ω Kc . The control law will only be allowed to change on the set Ω Kc . This is possible due to the compact support and

90 multiresolution structure of the basis functions used to represent the control law. By proposition 2.1, if Ω K is an invariant set for Pkc then Ω K will remain an invariant set for Pkc+1 if changes to g ( x , α k ) resulting in g ( x , α k +1 ) are constrained to the set Ω K − Ω Kb . One method for establishing Ω Kc and Ω Kb is as follows. Let Ωψ i be the support of ψ i ( x) . Define

{

IδK = m ∈IR , I ∈Im Ωψ

m ,I

ψ m ,I

⊂ Ω K and ρb (Ω

, ΩK ) ≥ δ

}

(5.8)

where δ is a number greater than zero. Using the index “i” to represent the combination “m,I”, ΩδKc is then defined as ΩδKc = ∪K Ωψ i

(5.9)

ΩδKb = Ω K − ΩδKc

(5.10)

i ∈I δ

and ΩδKb is defined as above:

The objective (or cost) function ∞

J ( x0 ) = ∫ L( x , u)dt 0

(5.11)

will define the performance of P c on Ω P . The integral is taken along trajectories of P c starting at x (0) = x0 and ending at x( ∞) = 0 . The following restrictions will be placed on

the objective function, (5.11): •

L ( x , u ) must be positive definite in x and positive semi-definite in u:



The partial derivatives of L ( x , u ) with respect to x and u must exist and be continuous; and



J ( x0 ) < ∞ ∀ x0 ∈Ω A .

The objective function is a function of the control law. To show this dependence the objective function may be written as

91 ∞

J ( x0 , α k ) = ∫ L[φτk ( x0 ), g (φτk ( x0 ), α k )]dτ 0

(5.12)

Equation (5.12) also shows the explicit dependence of the objective function on the actual trajectory of the closed loop system.

To help ease equation complexity the

following definitions will be made: Lk ( x ) ≡ L( x , g ( x , α k ))

(5.13)

f k ( x ) ≡ f ( x , g ( x , α k ))

(5.14)

and

Using this notation equation (5.12) may be rewritten as: ∞

J ( x0 , α k ) = ∫ Lk (φτk ( x0 )) dτ 0

(5.15)

The graph of J ( x , α ) versus the state vector x in R N +1 , for a constant α, will be referred to as the cost-to-go surface. Note that the objective function may be written as J ( x0 , α k ) = ∫ Lk (φτk ( x0 ))dτ + J ( xt , α k )

(5.16)

xt = φtk ( x0 )

(5.17)

t

0

where

and ∞

J ( xt , α k ) = ∫ Lk (φτk ( xt ))dτ t

5.2

(5.18)

Conditions for Pointwise Improvement of the Cost-to-Go A first step towards improving the control on a known region of asymptotic

stability, is to find conditions on changes to the control law which will result in the

92 objective function being reduced. The first approach looked at is a point wise condition without regard for practicality. The region of the state space considered is Ω K . Restating the problem, we would like to determine conditions on changes to the control g ( x , α ) such that J ( x , α k +1 ) ≤ J ( x , α k ) ∀ x ∈ Ω K

(5.19)

First, we will look at a proposition which gives the flavor of the conditions, then the conditions will be formalized in a theorem. Suppose we can find an α k +1 and some Δt > 0 such that

Proposition 5.1



Δt

Lk +1 (φτk +1 ( x0 ))dτ + J ( xΔt , α k ) ≤ J ( x0 , α k ) ∀ x0 ∈Ω K

0

(5.20)

and such that a) g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈ Ω Kb ; b) g ( 0, α k +1 ) = g ( 0, α k ) = 0 ; then J ( x0 , α k +1 ) ≤ J ( x0 , α k ) ∀ x0 ∈ Ω K . Arguement: Conditions “a” and “b” are required to help ensure the system remains stable. Theses conditions will be discussed in more detail in theorem 5.1. Per the assumption



Δt

Lk +1 (φτk +1 ( x ))dτ + J (φΔkt+1 ( x0 ), α k ) ≤ J ( x0 , α k ) ∀ x0 ∈ Ω K

0

(5.21)

The fact that (5.21) holds for all x in Ω K implies that



2 Δt

Lk +1 (φτk +1 ( x0 ))dτ + J (φ2kΔ+t1 ( x0 ), α k ) ≤ J (φΔkt+1 ( x0 ), α k )

Δt

(5.22)

Adding equations (5.21) and (5.22) and simplifying gives



2 Δt

0

Lk +1 (φτk +1 ( x0 ))dτ + J (φ2kΔ+t1 ( x0 ), α k ) ≤ J ( x0 , α k )

(5.23)

93 Continuing in this process we have



nΔt

0

Lk +1 (φτk +1 ( x0 ))dτ + J (φnkΔ+t1 ( x0 ), α k ) ≤ J ( x0 , α k )

(5.24)

Letting n → ∞ and noting that J (φ∞k +1 ( x0 ), α k ) = 0 , gives





0

Lk +1 (φτk +1 ( x0 , ))dτ = J ( x0 , α k +1 ) ≤ J ( x0 , α k )

(5.25)

which is what we set out to show. Theorem 5.1

Given the closed loop system Pkc along with the objective function

J ( x ,α k ) , both defined and under the constraints of section 5.1 above; assume that Pkc is asymptotically stable under the control law g ( x , α k ) on the compact invariant set Ω K such that J ( x ,α k ) is defined and finite for all x ∈Ω K ; suppose that the control law is changed via its parameters to g ( x , α k +1 ) such that: a) g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈ Ω Kb ; b) g ( 0, α k +1 ) = g ( 0, α k ) = 0 ; c) Lk +1 ( x ) + 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 ≤ 0 ∀ x ∈ Ω K ; then J ( x , α k +1 ) ≤ J ( x , α k ) ∀ x ∈ Ω K . In addition, the new system Pkc will be asymptotically stable on Ω K . Proof: First note that 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 is a directional derivative of the system Pkc+1 along the trajectory φtk +1 ( x ) , i.e.: 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 =

d dt

J (φtk +1 ( x ), α k ) t = 0 ≡ J k +1 ( x , α k )

(5.26)

Condition “a” ensures that Ω K remains an invariant set for Pkc+1 . Condition “b” ensures that 0 remains an equilibrium point for Pkc+1 . Noting that J ( x , α k ) is a positive definite function on Ω K , condition “c” ensures that J ( x , α k ) is a valid Lyapunov function for

94 Pkc+1 . Therefore by theorem 2.1, Pkc+1 will be asymptotically stable on Ω K which implies that φ∞k +1 ( x ) = 0 ∀ x ∈Ω K . Integrating condition “c” with respect to time along the trajectory φtk +1 ( x ) gives

∫ Lk +1 (φτk +1 ( x))dτ + ∫ Jk +1 ( x,α k )dτ ≤ 0 t

t

0

0

(5.27)

which is equivalent to

∫L t

0 k +1

(φτk +1 ( x ))dτ + J (φtk +1 ( x ), α k ) − J ( x , α k ) ≤ 0

∀ t >0

(5.28)

Taking the limit at t → ∞ and noting that J (φ∞k +1 ( x ), α k ) = J ( 0, α k ) = 0 gives ∞

∫L

0 k +1

Noting that



∫L

0 k +1

(φτk +1 ( x ))dτ ≤ J ( x , α k )

(5.29)

(φτk +1 ( x ))dτ = J ( x , α k +1 ) completes the proof.

Condition “c” of theorem 5.1 Lk +1 ( x ) + 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 ≤ 0 ∀ x ∈Ω K

(5.30)

is very similar to Bellman's functional equation [3][4][23], but, it has been reached via a slightly different approach. Bellman's functional equation was derived based on finding the conditions of optimality. Equation (5.30) has been derived based on an iterative approach for reaching optimality from a non-optimum starting point. Equation (5.30) gives the local nature of the system P c with respect to the current cost-to-go surface. The quantity − ∇ x J ( x , α k ) is a vector which points towards the maximum decrease in the current cost-to-go surface. The quantity f k +1 ( x ) is a vector pointing in the direction of the next time increment of the plant trajectory. The quantity Lk +1 ( x ) gives the incremental cost of taking the next time step. At a given position x in the state space Ω K , equation (5.30) indicates that to reduce the cost-to-go, the control function g ( x , α k +1 ) should be changed via α in such a fashion as to reduce or minimize

95 (5.30). Visually we would like to change g ( x , α k +1 ) so as to swing the trajectory, f ( x , g ( x , α k +1 )) , as much as possible in a direction pointing towards the steepest descent of the current cost-to-go surface, − ∇ x J ( x , α k ) , while not adding substantially to the incremental cost, L( x , g( x , α k +1 )) . Note, it is not difficult to show that if α k +1 = α k then (5.30) is identically equal to zero, i.e. we have the identity: Lk ( x ) + 〈∇ x J ( x , α k ), f k ( x ) 〉 = 0 ∀ x ∈Ω K

(5.31)

Theorem 5.1 gives a pointwise condition for improving the control. The problem is that in general it is not clear that a change in the control law parameters α, from α k to

α k +1 can be found such that condition “c” in theorem 5.1 will hold for all x ∈Ω K . Arguments could be made that if at the current resolution of g ( x , α ) , α k +1 cannot be found such that condition “c” in theorem 5.1 will hold for all x ∈Ω A , simply add wavelets of higher resolution until the condition can be met. The problem with this argument is that the number of wavelets in the decomposition of g ( x , α ) may grow very fast trying to satisfy the condition on a pointwise basis. What is needed are conditions for improvement which allow for the possibility that the cost may go up in some locations, but, on the average the overall cost is improved. This will result, in general, in a lot fewer wavelets being used in the decomposition of g ( x , α ) while still providing “reasonable” control performance. The next section deals with this issue.

5.3

Conditions for the L1 Average Improvement of the Cost-to-Go In order to discuss an average improvement in the control law or cost-to-go, a

precise definition needs to be established of just what is meant by “average cost-to-go” and “average improvement”. Definition 5.1 L1 Average Cost-to-Go The L1 average cost-to-go on a given region Ω

of the state space will be defined as

96 Γ (α k , Ω) = ∫ J ( x , α k )dx Ω

In that the region Ω is specified over which the L1 average is being taken, the L1 average will not be divided by the N dimensional volume of Ω. Definition 5.2

Average Improvement

An average improvement is considered being

made when changing the control law from g ( x , α k ) to g ( x , α k +1 ) if Γ (α k +1 , Ω) < Γ (α k , Ω)

Based on the definition of the average cost-to-go, an optimal control law on the region Ω K may be defined. First, there are a couple of issues which come up in the definition of the optimal control law which do not normally arise in the classical definition. This chapter is concerned with improving, or optimizing, the control law on a known, fixed, region of the state space. The fact that the region of operation is fixed from the onset (in terms of present concern) places a restriction on control laws which may be used. The control law will not be allowed to changed in any fashion which would drive a system trajectory outside of Ω K , even though this might lead to a lower cost control law. The reason for the restriction is concern for system stability. If a trajectory is driven outside of Ω K , stability of the system cannot be guaranteed. The method for guaranteeing new control laws do not drive system trajectories outside of Ω K , is to define a boundary region, Ω Kb , in which the current, stabilizing control law is not allowed to change. This implies that whatever optimization is done on the set Ω Kc , the final, “optimal”, control law on Ω K will depend to some degree, at least in the boundary region, on the initial stabilizing control law for Ω K . As methods are employed to expand the known RAS, Ω K , the boundary region will move out and so will these “fringe” effects. Once a known RAS, Ω K , is determined, with an associated stabilizing control law, g ( x , α K0 ) , a set of coefficients can identified which may be changed to improve the

97 control on Ω K . The set of coefficients α K0 ∈A (5.6) define g ( x , α K0 ) . Let ΩδKc be defined as in (5.9) with the set of coefficient indexes I δK as defined in (5.8). The coefficient space for Ω K will then be defined as

{

A δK0 = α ∈ A αm ,I = αmK,0I , (m, I ) ∈IR × Im − IδK

}

(5.32)

The control law space on Ω K may then be defined as U

K0

δ

{

= g ( x , α ) ∈ U α ∈ A δK0

Definition 5.3 Optimal Control Law on Ω K

is the g ( x , α ) ∈ U

K0

δ

}

(5.33)

The optimal control law, g ( x , α ) , on Ω K *K

with the optimal coefficients α* K , which minimizes Γ (α , Ω K ) , i.e. * K

Γ = minK Γ (α , Ω K ) α ∈Aδ

The question now becomes: What are the conditions on changes to the control law, via parameters α, which cause the average cost-to-go to be reduced, Γ (α k +1 , Ω K ) ≤ Γ (α k , Ω K )

and such that the system Pkc+1 remains asymptotically stable on Ω K ?

(5.34) Clearly the

pointwise conditions of the previous section are sufficient to ensure reduction in the average cost-to-go. What we would like is less stringent conditions, which will allow more freedom in adjusting parameters α, which ensure on the average the cost-to-go is reduced and the system remains asymptotically stable. The stability question is the most straight forward, so it will be attacked first. Theorem 2.1 provides the condition which will be used to ensure the continued asymptotic stability of the system Pkc+1 when the control law is changed from g ( x , α k ) to g ( x , α k +1 ) . Using the current cost-to-go surface, J ( x , α k ) , as a Lyapunov function,

arbitrary changes to the control law may be made, which will result in the system remaining asymptotically stable, provided the following conditions are met:

98 a)

g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈ Ω Kb ;

b) g ( 0, α k +1 ) = g ( 0, α k ) = 0 ; c) J k +1 ( x ,, α k ) =

d dt

J (φtk +1 ( x ), α k ) t = 0 < 0 ∀ x ∈Ω K , x ≠ 0 .

Condition “c” may be rewritten as 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 < 0 ∀ x ∈Ω K , x ≠ 0

(5.35)

If the control law is not changed, we have from (5.31) 〈∇ x J ( x , α k ), f k ( x ) 〉 = − Lk ( x )

(5.36)

In that Lk ( x ) = L( x , g ( x , α k )) is a positive definite function by definition, the time derivative of our Lyapunov function is bounded below by zero. This fact provides “room” for the control law to change while maintaining an asymptotically stable system. A measure of robustness can be added to the stability conditions by changing the requirement of item “c” to be 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 < −aLk ( x ) ∀ x ∈Ω K , x ≠ 0 where “a” is a constant between zero and one. A typical choice might be a =

(5.37) 1

2

. The

measure of robustness gained here allows room for imprecise plant models, approximations to the cost-to-go function, and noise. This advantage should not be taken lightly. The ensuing theorem for average improvement of the control law requires the definition of the following family of sets. Definition 5.3

Trajectory Sets

For an asymptotically stable system Pkc on the

compact, invariant, set Ω K , with trajectories x (t ) = φtk ( x ) , define for all t ≥ 0 :

{

Ω kK (t ) = φtk (Ω K ) ≡ x| x = φtk ( x0 ), x0 ∈Ω K

}

(5.38)

99 The trajectory sets, are sets which become smaller and smaller as time increases. The sets collapse or shrink inwards from the outer edges as starting points on the edges of the sets move in along system trajectories. Properties of the trajectory sets include: 1) Ω kK ( t ) ⊂ Ω K ∀ t ≥ 0 and Ω kK ( 0) = Ω K By definition, Ω K

is an invariant set for Pkc , which implies that

φtk ( x0 ) ∈Ω K ∀ x0 ∈Ω K and t ≥ 0 . The equivalence at t = 0 comes from the fact that φ0k ( x0 ) = x0 . 2) Ω kK (t2 ) ⊂ Ω kK (t1 ) ∀ t2 ≥ t1 To see this, let t2 = t1 + Δt . We have that φtk2 ( x0 ) = φΔkt φtk1 ( x0 ) = φΔkt (φtk1 ( x0 )) , and since x ′ = φtk1 ( x0 ) ∈Ω kK (t1 ) then φΔkt ( x ′) ∈Ω kK (t1 ) and φΔkt ( x ′) ∈Ω kK (t2 )

therefore Ω kK (t2 ) ⊂ Ω kK (t1 ) .

3) Ω kK ( t ) is an invariant set for Pkc for all t ≥ 0 This property follows from properties 1 and 2. 4) ρR (Ω kK ( t ) ) → 0 as t → ∞ This property follows directly from the fact that Pkc is asymptotically stable on the compact, invariant set, Ω K , and the definition of Ω Kk ( t ) .

Theorem 5.2

Given the closed loop system Pkc along with the objective function

J ( x ,α k ) , both defined and under the constraints of section 5.1 above; assume that Pkc is asymptotically stable under the control law g ( x , α k ) on the compact, invariant, set Ω K such that J ( x , α k ) is defined and finite for all x ∈Ω K ; let Ω Kb be a boundary set as defined in (5.8)(5.9) & (5.10); suppose that the control law is changed via its parameters to g ( x , α k +1 ) such that: a) g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈Ω Kb ; b)

g ( 0, α k +1 ) = g ( 0, α k ) = 0 ;

c) 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 < −aLk ( x ) ∀ x ∈Ω K , x ≠ 0, 0 < a < 1 ;

100

∫[L

d)

k +1

( x ) + 〈∇ x J ( x , α k ), f k +1 ( x ) 〉]dx ≤ 0 ∀ s ≥ 0 ;

Ω kK+1 ( s )

then Γ (α k +1 , Ω K ) ≤ Γ (α k , Ω K ) . In addition, the new system Pkc will be asymptotically stable on Ω K . The proof is similar in structure to theorem 5.1 and is a consequence of the next proposition. The stability issue was handled above. Proposition 5.2 Suppose an α k +1 and some Δt > 0 can be found such that

⎡ Δt L (φ k +1 ( x ))dτ + J (φ k +1 ( x ), α k ) ⎤dx ≤ Δt ∫K ⎢⎣∫0 k +1 τ ∫K J ( x, α k )dx ⎥⎦ Ω ( s) Ω ( s) k +1

k +1

(5.39)

∀ s≥0 and such that conditions a - c of theorem 5.2 are met, then Γ (α k +1 , Ω K ) ≤ Γ (α k , Ω K )

(5.40)

Proof: Per the proposition, equation (5.39) is true which implies the following is also true:

Ω kK+1



( s + Δt )

⎡ Δt L (φ k +1 ( x ))dτ + J (φ k +1 ( x ), α k ) ⎤dx ≤ Δt ⎣⎢ ∫0 k +1 τ ⎦⎥

Ω kK+1

∫ J ( x, α

k

)dx

(5.41)

( s + Δt )

Re-arranging the integration regions gives: ⎡ 2 Δt L (φ k +1 ( x ))dτ + J (φ k +1 ( x ), α k ) ⎤dx ≤ 2 Δt ∫K ⎢⎣∫Δt k +1 τ ∫K J (φΔkt+1 ( x), α k )dx ⎥⎦ Ω ( s) Ω ( s) k +1

(5.42)

k +1

Adding (5.39) to (5.42) and re-arranging terms gives:

∫ ⎡⎢⎣∫

Ω kK+1

( s)

2 Δt

0

Lk +1 (φτk +1 ( x ))dτ + J (φ2kΔ+t1 ( x ), α k ) ⎤⎥dx ≤ ∫ J ( x , α k )dx ⎦ Ω K ( s)

Proceeding in this vein we have:

k +1

(5.43)

101

∫ ⎡⎢⎣∫

nΔt

0

Ω kK+1 ( s )

Lk +1 (φτk +1 ( x ))dτ + J (φnkΔ+t1 ( x ), α k ) ⎤⎥dx ≤ ∫ J ( x , α k )dx ⎦ Ω K ( s)

(5.44)

k +1

By letting n → ∞ and noting that J (φ∞k +1 ( x ), α k ) = 0 , we then have:

∫ ∫

Ω kK+1 ( s )



0

Lk +1 (φτk +1 ( x ))dτ dx =

∫ J ( x,α

k +1

)dx ≤

Ω kK+1 ( s )

∫ J ( x, α

k

)dx

(5.45)

Ω kK+1 ( s )

By choosing s = 0 we have: Γ (α k +1 , Ω K ) ≤ Γ (α k , Ω K ) , which completes the proof of the proposition. Equation (5.39) may be rewritten as:

∫ ⎡⎣⎢∫

Ω kK+1

( s)

Δt

0

Lk +1 (φτk +1 ( x ))dτ ⎤⎥dx + ⎦

∫ [ J (φ

k +1 Δt

Ω kK+1

( s)

( x ), α k ) − J ( x , α k )]dx ≤ 0

(5.46)

Dividing by Δt and taking the limit ⎡ Δt Lk +1 (φτk +1 ( x )) ⎤ lim dτ ⎥dx + ⎢ ∫0 Δt Δt → 0 K∫ ⎦ Ω k +1 ( s ) ⎣ ⎡ J (φΔkt+1 ( x ), α k ) − J ( x , α k ) ⎤ lim ⎢ ⎥dx ≤ 0 Δt Δt → 0 K∫ ⎦ Ω k +1 ( s ) ⎣

(5.47)

gives

∫[L

Ω kK+1 ( s )

where once again J k +1 ( x , α k ) ≡

along the trajectory, φtk +1 ( x ) .

d dt

k +1

]

( x ) + J k +1 ( x , α k ) dx ≤ 0

(5.48)

J (φtk +1 ( x ), α k ) t = 0 is a directional derivative of J ( x ,α k )

The partial derivatives of J ( x , α k ) exist and are

continuous, so we may write: J k +1 ( x , α k ) = ∇ x J ( x , α k ), f k +1 ( x )

(5.49)

where ⋅, ⋅ is the inner product of two vectors, ∇ x J ( x, α k ) is the gradient of J ( x , α k ) with respect to x, and f k +1 ( x ) = f ( x , g ( x , α k +1 )) . Substituting (5.49) into (5.48) gives condition “d” of theorem 5.2, which completes the proof.

102 Theorem 5.2 can be strengthened to provide conditions which yield a strictly decreasing average cost-to-go: Γ (α k +1 , Ω K ) < Γ (α k , Ω K ) .

The key to this lies in

condition “d” of theorem 5.2. Condition “d” of theorem 5.2 is also the key to finding algorithms for changing the control law parameters, α, such that the average cost-to-go is improved. It is therefore worth some time to study this condition. Before proceeding, some notational simplification will be helpful. First, define: hk ( x , α k +1 ) = Lk +1 ( x ,) + 〈∇ x J ( x , α k ), f k +1 ( x ) 〉

(5.50)

where hk ( x ,α k +1 ) is the pointwise condition for improving the cost-to-go. The subscript “k” represents the cost-to-go surface, J ( x ,α k ) , the improvement is being measured against. By the definitions of Lk +1 ( x ) , f k +1 ( x ) , and J ( x ,α ) it is readily seen that hk ( x ,α k +1 ) is a continuous function of x, and the first partial derivatives with respect to x exist and are continuous. From equation (5.31) we have that on regions of the state space where g ( x , α k +1 ) = g ( x , α k ) , hk ( x , α k +1 ) = 0 . The integral of hk ( x ,α k +1 ) over a given set Ω will be denoted: Hk (Ω, α k +1 ) = ∫ hk ( x , α k +1 )dx

(5.51)

Ω

Once again, the subscript “k” represents the cost-to-go surface, J ( x , α k ) , the improvement is being measured against. The fact that hk ( x ,α k +1 ) is zero everywhere except where there is an actual change in going from Hk (Ω, α k +1 )

g ( x , α k ) to g ( x , α k +1 ) is of great advantage. It means that

only needs to be evaluated on a set Ω where the control law actually

changes. By changing only a limited number of coefficients in the wavelet expansion for g ( x , α k ) at a time, and recognizing that the wavelets chosen have compact support, Hk (Ω, α k +1 ) only needs to be evaluated over the support of the wavelets whose coefficients are being changed. A typical algorithm for optimizing the control law will only change a limited number of coefficients, αi , at a time. Therefore Hk (Ω, α k +1 ) only

103 needs to be evaluated over the support of those wavelets, ψ i ( x ) , whose coefficients are being changed.

In that hk ( x ,α k +1 ) is a relatively nice function (it is a continuous

function of x, and the first partial derivatives with respect to x exist and are continuous), then numerical methods of approximating Hk (Ω, α k +1 ) may readily be employed. If basis functions with global support were used in the decomposition of g ( x , α k ) , then Hk (Ω, α k +1 ) would have to be evaluated over the entire set Ω K each time a single coefficient was changed. Global basis functions would also violate the requirement that on a boundary region the control law must remained fixed. Condition “d” of theorem 5.2 has the requirement (a rather unfortunate requirement) that it must be satisfied on all of the trajectory sets, i.e. Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0

(5.52)

Recall that the trajectory sets are sets which collapse inward as the extremity points move inward along trajectories of the system Pkc+1 . The parameter “s” represents the time point along the trajectories. Provided (5.52) is true, it is also true that Hk (Ω kK+1 ( s1 ), α k +1 ) ≤ Hk (Ω kK+1 ( s2 ), α k +1 ) for s1 ≤ s2

(5.53)

From which it follows directly that if (5.52) is true and if for some s ≥ 0 we have Hk (Ω kK+1 ( s), α k +1 ) < 0 then Hk (Ω K , α k +1 ) will be strictly less than zero. Following the arguments from the proof to theorem 5.2, if the additional requirement that Hk ( Ω K , α k + 1 ) < 0

is

added

to

theorem

5.2,

then

the

stronger

result

Γ (α k +1 , Ω K ) < Γ (α k , Ω K ) will apply. This is formalized in the following corollary. Corollary 5.1 Given the closed loop system Pkc along with the objective function

J ( x ,α k ) , both defined and under the constraints of section 5.1 above; assume that Pkc is asymptotically stable under the control law g ( x , α k ) on the compact, invariant, set Ω K such that J ( x , α k ) is defined and finite for all x ∈Ω K ; let Ω Kb be a boundary set as

104 defined in (5.8)(5.9) & (5.10); suppose that the control law is changed via its parameters to g ( x , α k +1 ) such that: a) g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈ Ω Kb ; b)

g ( 0, α k +1 ) = g ( 0, α k ) = 0 ;

c) 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 < − aLk ( x ) ∀ x ∈Ω K , x ≠ 0, 0 < a < 1 ; d) Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0 ; e) Hk (Ω K , α k +1 ) < 0; then Γ (α k +1 , Ω K ) < Γ (α k , Ω K ) . In addition, the new system Pkc will be asymptotically stable on Ω K . The proof of the corollary follows directly from the proof of theorem 5.2, and the discussion above concerning properties of Hk (Ω, α k +1 ) . Any algorithm which ensures H k (Ω K , α k +1 ) < 0 for all k > 0 will force *K

Γ (α k , Ω K ) → Γ

as k → ∞ . The validity of this statement follows from noting that

Γ (α k , Ω K ) is a strictly positive function, of the bounded parameters α, which is bounded *K

below by Γ , and by definition Γ (α k , Ω K ) is monotonically decreasing with increasing

k. The convergence follows from Lyapunov’s theorems. One last key point needs to be made on the requirement that condition “d” of theorem 5.2 (and corollary 5.1) be satisfied on all of the trajectory sets, i.e.:

Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0 .

This requirement embodies Bellman’s Principle of

Optimally. Bellman's Principle of Optimality states [23]: "An optimal policy (control

law) has the property that no matter what the previous decisions (control input) have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions". In other words, an optimal control law has the property that no mater how a system reaches it’s current state, the remaining control effort to drive the system back to the origin must be optimal. This principle implies that

105 the control law needs to be optimized closest to the final trajectory point (origin) first. The optimization should then proceed backwards, or out from the origin, along valid system trajectories.

The penalty for not following this plan lies in additional effort, or

additional iterations, required to optimize the control law.

5.4

Optimization of the Wavelet Basis Controller on the Known RAS Theorem 5.2 and its corollary 5.1 help to reduce the problem of optimizing a

wavelet based feedback controller for the system P c (5.2), on a known RAS to that of minimizing the function Hk (Ω, α k +1 ) subject to a number of constraints. This is a basic problem in the field of nonlinear programming [8][12][15]. A number of methods could readily be proposed for this optimization, one of the most straight forward methods is gradient descent, which I outline below. Gradient descent is a standard method of optimizing or training neural networks and for minimizing (or maximizing) nonlinear functions of one or many variables. Gradient descent is not necessarily the best or the fastest optimization method, but it is one of the most straight forward and intuitive methods to apply. In studying corollary 5.1 for application, a number of practical issues immediately arise. First, the exact plant model, f ( x , u) , is assumed known, which is rarely the case to a high level of precision. Normally only a rough approximation to f ( x , u) is known. Second, the gradient with respect to the state vector x of the current cost-to-go, J ( x , α k ) , is assumed known or readily computable. Direct computation of this gradient would be a numerical nightmare, therefore an approximation of this term must be used. these are two key issues from a practical application sense,

While

to avoid additional

complications at this stage, I will hold off addressing them until chapter 6. Note: the function L( x , u) is defined by the control or systems engineer and therefore known. The

next

practical

issue

is

requirement

“d”

of

corollary

5.1:

Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0 . Meeting this condition on all of the trajectory sets may

106 not be possible and certainly would be difficult to demonstrate in a practical application. The penalty for not meeting this condition is the possibility that the average cost-to-go increases instead of decreases, i.e.: Γ (α k +1 , Ω K ) > Γ (α k , Ω K ) . Even if the average costto-go does increase some on a given iteration (going for k to k + 1 )

the system will

remain asymptotically stable on Ω K provided conditions a, b and c of corollary 5.1 are met.

As noted above though, the condition Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0 embodies

Bellman’s Principle of Optimally. If the optimization process is started near the origin (the equilibrium point of the system), and works out along valid system trajectories, then the condition Hk (Ω kK+1 ( s), α k +1 ) ≤ 0 ∀ s ≥ 0 will in general be met, and on average (the majority of optimization iterations k) the average cost-to-go will be improved. Even if the optimization process is not started at the origin, and local minimum are fallen into, once the region around the origin is optimized, the optimization process will eventually work its way back out along system trajectories and a global minimum on Ω K will be reached. The process will take longer but Bellman’s principle still implies that the global optimum will be reached.

These arguments may be formalized in the following

conjecture: Conjecture 5.1 Given the closed loop system Pkc along with the objective function

J ( x ,α k ) , both defined and under the constraints of section 5.1 above; assume that Pkc is asymptotically stable under the control law g ( x , α k ) on the compact, invariant, set Ω K such that J ( x , α k ) is defined and finite for all x ∈Ω K ; let Ω Kb be a boundary set as defined in (5.8)(5.9) & (5.10); suppose that the control law is changed via its parameters to g ( x , α k +1 ) such that: a) g ( x , α k +1 ) = g ( x , α k ) ∀ x ∈ Ω Kb ; b)

g ( 0, α k +1 ) = g ( 0, α k ) = 0 ;

c) 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 < − aLk ( x ) ∀ x ∈Ω K , x ≠ 0, 0 < a < 1 ; d) Hk (Ω k , α k +1 ) ≤ 0, Ω k ⊂ Ω K ;

107

then

Γ (α l , Ω K ) < Γ (α k , Ω K ) for most l > k , and *K

Γ (α k , Ω K ) → Γ

as k → ∞

which implies

α k → α and g(x,α k ) → g(x,α* K) as k → ∞ *K

In addition, the new system Pkc will be asymptotically stable on Ω K . The asymptotic stability of the system follows from previous arguments. The argument for convergence to the restricted optimums is given above. The gradient descent optimization for the wavelet basis controller on the known RAS, Ω K , is as follows: 1) Choose a small region Ω k in Ω K on which to make a control law improvement. The natural choice for size of this region will be the support of a wavelet Ωψ i . In choosing the region, Bellman’s principle of optimality should be kept in mind. Those regions closest to the origin (system equilibrium point) should be chosen for optimization first, then move outward along system trajectories. 2) Define

{

Ikψ = i ∈IδK | Ωψ i ⊂ Ω k

}

(5.54)

Here, Ikψ is the set of wavelet coefficients which may be changed during the kth optimization iteration. 3) The term: Hk (Ω k , α k +1 ) = ∫ hk ( x, α k +1 ) dx may be approximated using a finite Ωk

sum. Let X

P k

be a finite, evenly distributed, sampling of Ω k with P samples.

And let Δx be the N dimensional volume element with one sample in each volume element. Then Hk (Ω k , α k +1 ) may be approximated by

108 Hk (Ω k , α k +1 ) = ∑ hk ( x p , α k +1 ) Δx

(5.55)

p ∈P

4) Let ∇α be a gradient vector defined by

⎧ ∂α∂ H (Ω , α ) ∀ i ∈Ikψ ∇α i = ⎨ i k k 0 otherwise ⎩

(5.56)

The partial derivatives with respect to α i may be calculated directly or numerical methods may be used to determine each partial derivative. If the gradient vector ∇α is near zero, i.e. ∇α is small, then the coefficients, Ikψ are near optimal

(at least at this point in the optimization cycle), so go to step one. When the coefficients on all of Ω K are near optimal, then the process may stop, or higher resolution wavelets may be added in to try and obtain better performance. 5) Let

α k +1 = α k − η∇α , for some η ≥ 0

(5.57)

Perform the minimization min Hk (Ω k , α k − η∇α ) = min ∑ hk ( x p , α k − η∇α ) Δx

(5.58)

〈∇ x J ( x p , α k ), f k +1 ( x p ) 〉 < − aLk ( x p ) ∀ x p ∈ X kP , x ≠ 0, 0 < a < 1

(5.59)

g (0, α k − η∇α ) = 0

(5.60)

η≥0

η≥0

p ∈P

subject to:

and

Condition (5.59) and (5.60) are met with η = 0 . In that

hk ( x , α k +1 ) = Lk +1 ( x ) + 〈∇ x J ( x , α k ), f k +1 ( x ) 〉 there is little additional effort required to compute this condition. By the structure of the condition, there will always be a η > 0 such that the condition (5.59) is met. For regions, Ω k , which contain the origin, extra effort will be required to

109 ensure condition (5.60) is met. One approach is to use one or more of the coefficients in the set Ikψ to maintain condition (5.60).

This (or these)

coefficients are held fixed while small changes are made to the other coefficients in the direction of − ∇α . Then the held coefficient(s) is (are) changed to meet the requirement (5.60). 6) Update the cost-to-go information at J ( x , α k +1 ) . Return to step one.

The optimization process may be carried out indefinitely or until satisfactory system performance is obtained. Step four provides one possible condition for stopping. A condition for adding higher resolution wavelets is still required. The equation

hk ( x , ux ) = L( x , ux ) + 〈∇ x J ( x , α k ), f ( x , ux ) 〉 ≤ 0

(5.61)

is the pointwise condition for improving the control law. If on a region Ω k for the set of points X kP , there exist uxp such that uxp is bounded by the limits of the control law and such that

∑ h (x p ∈P

k

p

, uxp )Δx

Suggest Documents