Controller Design via Adaptive Critic and Model Reference Methods

1 downloads 0 Views 434KB Size Report
Critics (AC) provide computationally feasible means for performing approximate Dynamic Programming (ADP). The term 'adaptive' in AC refers to the critic's ...
Controller Design via Adaptive Critic and Model Reference Methods George.G. Lendaris1, Roberto Santiago2, Jay McCarthy3, Michael Carroll4

1. Faculty, Systems Science Ph.D. Program, and Electrical & Computer Engineering, 2. Graduate Student, Systems Science Ph.D. Program 3. Graduate Student, Electrical & Computer Engineering Department, Portland State University, Portland, OR 97201 4. Graduate Student, Computational Neuroscience, University of Chicago (formerly at PSU). [email protected], [email protected], [email protected], [email protected]

Abstract Dynamic Programming (DP) is a principled way to design optimal controllers for certain classes of nonlinear systems; unfortunately, DP is computationally very expensive. The Reinforcement Learning methods known as Adaptive Critics (AC) provide computationally feasible means for performing approximate Dynamic Programming (ADP). The term ‘adaptive’ in AC refers to the critic’s improved estimations of the Value Function used by DP. To apply DP, the user must craft a Utility function that embodies all the problem-specific design specifications/criteria. Model Reference Adaptive Control methods have been successfully used in the control community to effect on-line redesign of a controller in response to variations in plant parameters, with the idea that the resulting closed loop system dynamics will mimic those of a Reference Model. The work reported here 1) uses a reference model in ADP as the key information input to the Utility function, and 2) uses ADP off-line to design the desired controller. Future work will extend this to on-line application. This method is demonstrated for a hypersonic shaped airplane called LoFLYTE®; its handling characteristics are natively a little “hotter” than a pilot would desire. A control augmentation subsystem is designed using ADP to make the plane “feel like” a better behaved one, as specified by a Reference Model. The number of inputs to the successfully designed controller are among the largest seen in the literature to date.

1. Introduction Model Reference techniques have been used extensively in the context of adaptive control, dating back at least two decades [11][12][3][17]. The term ‘adaptive’ in this context refers to the modification of a controller’s design based on information attained about unknown aspects of the controlled plant, the latter being determined via comparison with a reference model that embodies “desired” plant characteristics. Descriptive phrases often used in this context include: ‘model predictive control’ and ‘model reference adaptive control’ (MRAC). This work was partially supported by the National Science Foundation, Grant ECS-9904378, and by NASA, SBIR Phase III Contract NAS1-01070.

Revised submission to IJCNN’03, April 4, 2003

Dynamic Programming (DP) [2] “... remains the only general approach for sequential optimization applications under very broad conditions, including uncertainty...” [18] (pg. 68). Because of this capability, strong motivation exists for using DP to design controllers for nonlinear systems for which optimal control methods developed for linear systems do not yield satisfactory designs. Unfortunately, the computations required for exact application of DP depend exponentially on the problem dimension (Bellman’s “curse of dimensionality”). Thus, for real-world problems, approximate methods have been required since the method’s inception (e.g., see [9]). In the early 1990’s, the emerging Reinforcement Learning methodology [16] known as Adaptive Critics was observed to be capable of implementing a useful approximation to Dynamic Programming. In [18] a variety of alternative implementations were formulated, depending on the specific role performed by the key component called critic. The version in which the critic approximates the J function of DP is called Heuristic Dynamic Programming (HDP); the one that approximates the derivative of J is called Dual Heuristic Programming (DHP); and the version where the critic performs both approximations is called Generalized (or Global) DHP (GDHP). Each of these has a version that involves the output of the controller (“action”); the phrase action dependent (AD) is prepended to each of the above names: ADHDP, ADDHP, and ADGDHP. A detailed presentation and analysis of the various AC (or equivalently, ADP) methods is given in [13], and includes a mathematical model of the AC process relating it to MRAC. The work reported in the current paper focuses on DHP. In the neural network context, the concept of reinforcement learning contrasts with supervised learning in that instead of a desired output vector being available at each training step, a general assessment of the overall performance of the system in response to the current action is provided. In the Adaptive Critic structures, the critic provides this qualitative information. In any controller design situation, the “customer” must always provide design specifications to the control engineer. Once the design specs are provided, the engineer proceeds to ply his/her skills to develop a controller design that satisfies the specs. When using DP as the design tool, the

Page 1/6

engineer is obliged to craft a Utility Function whose role is to embody the stipulated design specifications, in a manner appropriate to the DP process. In those cases where the “specs” are given in terms of a reference model, Utility may be defined in terms of the errors between the states of the plant and of the reference model. Often, it is useful for the control engineer to craft other terms to include in the definition of the Utility function (e.g., [19][15][6]). Because of the latter, it is fair to say that crafting of the Utility Function is one of the key creative tasks the control engineer performs when applying Dynamic Programming. In the present work, the Utility Function is based on the reference model, and is used during an off-line DHP process to design an (approximately) optimal controller that meets the specifications embodied in the Utility Function, for a given plant, over a stipulated range of operating conditions. The resulting controller is to be used for on-line operation, with no adaptation involved (because of stability considerations), over this stipulated range of operating conditions. Being a non-linear controller, it is inherently capable of providing desired control over a broader range of operating conditions than would be possible for linear controllers (though covering such a broad range of operating conditions with linear control laws can be accommodated via a multiple-models approach [4][5][17]). In a manner mentioned in a previous paper [8], our future work will also perform online adaptation using the DHP method. The particular application reported here relates to the design of a controller for the hypersonic shaped aircraft called LoFLYTE®, developed by Accurate Automation Corporation (for NASA). The intended operating scenario entails the use of a human pilot to remotely fly the aircraft, and the design task is to develop a control augmentation subsystem that will make the LoFLYTE® respond to the pilot’s command as though its dynamics are those defined in a Reference Model. To accomplish the present work, a large “down payment” effort was required to build a software platform for a) carrying out the DHP process, and b) simulating the LoFLYTE® aircraft and the Reference Model. The 6-dof (degree of freedom) LoFLYTE® simulator was developed by AAC, and refined with user input from our team over about a 1.5-year period; the DHP platform was built at our laboratory (NWCIL) over the same period. This 6-dof simulator represents a non-linear plant of significant complexity, equivalent to that of the aircraft described in [20], without the thermal components. Another key creative task of the control engineers (in this case, the authors) for successfully applying the DP method is the crafting of a “lesson plan” for training/designing the controller via the DHP method. A key consideration in coming up with such a “lesson plan” is how to use the Reference Model during the controller design process. In early work, the LoFLYTE® and the Reference Model were initialized together, and the training carried out for a specified amount Revised submission to IJCNN’03, April 4, 2003

of time (an “epoch”), and then restarted. During training with this version, before the controller design would converge to an approximate solution, the Reference aircraft could end up in a significantly different location in state space (and simulated physical space as well) than the LoFLYTE®. Since some of the aircraft’s non-linearities are dependent on its state-space location, a reset mechanism was implemented. The current training subsystem monitors various parameters of the two aircraft, and if certain threshold criteria are exceeded by any of a selected subset of variables, the Reference Model is reset to those of the LoFLYTE®, and the training proceeds. This is similar to a technique previously used for designing a steering controller for a 4wheeled terrestrial vehicle [6][7]. In that case, a command was given to change lanes; if the car went off the road, rather than continuing the training in an attempt to get it back on the road, the training session was aborted, and restarted at the original location. In this way, during each training pass, the car was able to get further towards the goal of being centered in the next lane, in the correct orientation. As mentioned above, the human pilot actually flies the aircraft (e.g., does the navigation task, etc.), while the controller to be designed has the task of making the LoFLYTE® “feel” to the pilot like the Reference Model. Accordingly, the training focuses on the dynamic response of the LoFLYTE® to the pilot’s x-stick (roll command) and y-stick (pitch command) deflections, as well as pedal and throttle commands. A key element of the “lesson plan” used for training comprised sequences of pilot-commands designed specifically to elicit a range of dynamic responses from the Reference Model and the LoFLYTE® simulations.

2. The DHP Process There are a number of recent contributions to the neural network literature that describe DHP (e.g., [18][14][6]-[8]), so space is conserved here by not including its description. For the purposes of this paper, however, it is important to reiterate three central requirements of the DHP process: a differentiable model of the “plant,” a mathematically defined Utility Function, and a “lesson plan” for carrying out the training (design) process.

3. Experimental Setup The six degree of freedom simulator (6-dof) used in the project is capable of being configured to simulate a variety of plane designs, including LoFLYTE®. The 6-dof was written in FORTRAN and is callable from C++ as well as MATLAB. Using the 6-dof, two planes were simulated, LoFLYTE® and LoFLYTE®* (a version of LoFLYTE® with more desirable handling characteristics, and serves as the Reference Model). Given the flexibility of the 6-dof simulator, we had the advantage of being able to easily define/ implement a LoFLYTE®*. In the following text, R(t) and R*(t) refer to the state vectors of LoFLYTE® and LoFLYTE®*, respectively, at time t. Page 2/6

In the configuration outlined in this paper, pilot commands are given to both the LoFLYTE® and LoFLYTE®*, and the goal assigned to the DHP was to design an augmentation controller that makes LoFLYTE® “feel” to the pilot the same as LoFLYTE®* does. Fig.1 provides a general schematic of the Reference Model / DHP design configuration. The shaded boxes in Fig.1 indicate the “active” role players in the design process. The Utility function, U, was defined as follows: U ( t ) = R ( t ) – R∗ ( t ) Reference Airplane

(eq. 1) R*

P I L O T

u

+

ua a

Actual Airplane,

including various stabilization loops, etc.

e

R

Augmenter (CAS)

DHP Design Process

Utility Fn.

Figure 1. System configuration during DHP Design of Controller Augmentation System (CAS). We comment that the above layout is similar to Model Reference Adaptive Control (MRAC) configurations (e.g., Fig. 14.5 in [1]), but in the present work, it is only used during the controller design phase (which is done off-line), not during the actual functioning of the controller. While functioning as a Control Augmentation System during flight (CAS in Figure 1), no adaptation is involved. Thus, the resulting control system is more properly labeled Model Reference Control. The term ‘adaptive’ in the AC method used to design the controller refers to what the critic is doing, not what the resulting controller is doing. A corollary key objective in this research was to develop an extended training protocol for the DHP process such that the resulting non-linear controller would have near-optimal performance in extended regions of the airplane’s state space corresponding to significantly shifted parameter values, such as location of the center-of-gravity. The results of that work will be reported elsewhere. The stability and response characteristics of an airplane are typically provided by at least two sub-systems: 1) an “inner loop stability augmentation” system (ILSA) -- alternatively, “stability augmentation system (SAS), and 2) an “outerloop control system” (OLC). In Fig.1, for simplicity, these subsystems are embedded inside the box labeled “AcRevised submission to IJCNN’03, April 4, 2003

tual Airplane.” The box labeled “Reference Airplane” embodies response characteristics that are deemed desirable by the pilot, and provide the basis for guiding the design of the “Augmenter.” When fully designed, the Augmenter box (Control Augmentation System -- hereafter called CAS) receives the pilot’s commands and the error between the actual and reference airplane states, and based on this data, generates signals that (additively) augment the pilot’s commands to the actual airplane. During such flying operations, the bottom paths in Fig.1, comprising the Utility Fn. and DHP Design Process boxes, are not present. Note that while the CAS operates on pilot commands plus plane states, providing modifiers to the pilot commands, the OLC operates on the modified pilot commands and plane states, returning plant commands that are actually used to manipulate the actuators. The ILSA operates on the states of the plane, with stability as its focus, and returns modifications to the plant commands. The DHP method was used to create a non-linear CAS, optimized to cause LoFLYTE® to emulate the flight characteristics of LoFLYTE®* through the Utility function defined in (eq. 1). An MLP neural network paradigm is used for CAS; it is a sigmoidal feedforward neural network, trained via the Backpropagation algorithm. It has 28 inputs (12 states, 12 errors, and 4 controls), 20 elements in the hidden layer, and 4 (control) outputs. For the present research, the Actual Plant and the Reference Plant operations were implemented on the 6-dof simulator. In the equations below, the 6-dof operations are designated S; the 6-dof parameters corresponding to LoFLYTE® and LoFLYTE®* are designated P and P*, respectively; and the CAS equations designated by C. Thus, a ( t ) = C ( u ( t ), R ( t ), e ( t ), W ( t ) ) a u (t) = u(t) + a(t) a R ( t + 1 ) = S ( R ( t ), u ( t ), P ) R∗ ( t + 1 ) = S ( R∗ ( t ), u ( t ), P∗ )

(eq. 2) (eq. 3) (eq. 4) (eq. 5)

Here u represents the command vector coming from the pilot; e is the difference between R and R* (i.e. e(t) = R(t) R*(t)); a represents the augmentations produced by the CAS; and ua is the CAS augmented pilot command vector. Note that the CAS, C, used in this research was a neural network with weight parameters W. We note that any parameterized function approximator, or even a parameterized CAS, could have been used (with a requirement on the selected form that it be differentiable). With these equations, the optimization problem can be more clearly stated as: k Find W that minimizes ∑ U ( t ) , where k is some large integer. t=0 This restatement puts our problem in a form that is comPage 3/6

patible with Dynamic Programming (to minimize the error between the Reference Model and the actual airplane), hence, with the DHP implementation. In order to apply DHP, a differentiable model of the plant is necessary to calculate the local linear relationship between the state of LoFLYTE® at time t+1, R(t+1), and the state of LoFLYTE® at time t, R(t), as well as the augmented pilot command at time t, ua(t). The first of these, referred to as the model state Jacobian (MSJ), is denoted ∇S R ( t + 1 ) and has elements ∂R j ( t + 1 ) ⁄ ∂R i ( t ) . The second of these, referred to as the model command Jacobian (MCJ), is denoted a ∇S u a ( t + 1 ) and has elements ∂R j ( t + 1 ) ⁄ ∂u i ( t ) . Unfortunately, the 6-dof is quite complex, making explicit analytical differentiation of the equations very difficult. To avoid this labor, a finite-differences method was employed. This method is a numerical application of the basic limiting definition of derivatives: df ( x + ε ) – f ( x -) . (eq. 6) = lim f--------------------------------dx ε ε→0 In our case, we used a more balanced version of this linearization: df f(x + ε) – f(x – ε) = ------------------------------------------ , ( ε very small ) dx 2ε

(eq. 7)

Using S, each component of the input vectors R(t) and ua(t) are increased and decreased by some small value ε , run through the 6-dof, and the difference between the resulting output vectors, R+(t+1) and R-(t+1) respectively, are i used to build the desired Jacobians. For clarity, let ∇S R ( t ) designate the ith row of the MSJ at time t, and designate the ith state of LoFLYTE® at time t, increased and decreased by +ε -ε ε , by R i ( t ) and R i ( t ) , respectively. Thus, for i=1….n, where n is the number of components in the state vector, we have + +ε a R i ( t + 1 ) = S  R i ( t ), u ( t ), P (eq. 8)   -ε a (eq. 9) R i ( t + 1 ) = S  R i ( t ), u ( t ), P   T i + (eq. 10) ∇S R ( t ) =  R i ( t + 1 ) – R i ( t + 1 ) A   where A is a square matrix of size n with diagonal elements equal to 1 ⁄ 2 ε and zeros everywhere else. Similarly, for the i MCJ, let ∇S a ( t ) designate the ith row of the MCJ. Thus for u i=1…m, where m is the number of components in the command vector, +ε a + R i ( t + 1 ) = S  R ( t ), u i ( t ), P (eq. 11) -ε

a R i ( t + 1 ) = S  R ( t ), u i ( t ), P T i + ∇S a ( t ) =  R i ( t + 1 ) – R i ( t + 1 ) B   u Revised submission to IJCNN’03, April 4, 2003

(eq. 12) (eq. 13)

where B is a square matrix of size m, with diagonal elements equal to 1 ⁄ 2 ε and zeros everywhere else. DHP also requires that the utility function and the controller (action module), in this case the augmenter, be differentiable. The Utility is defined in (eq. 1) as the two-norm between the plant states and the reference states. As such, the derivative of utility with respect to state can be defined as ∂U ( t ) ∂U ( t ) ∇ R U ( t ) = ----------------, … , ---------------(eq. 14) ∂R 1 ( t ) ∂R n ( t ) where n is the number of state componen ts, and ∂U ( t ) ⁄ ∂R i ( t ) = K i ( R i – R i∗ ) , where K i are relative weightings of components of U. For the CAS, it was mentioned earlier that any functional form could have been used as long as it was differentiable. In this way, the Jacobian of the CAS with respect to the state is definable. This Jacobian is here referred to as the augmenter Jacobian (AJ) and is denoted ∇ R a ( t ) and has elements ∂a ( t ) ⁄ ∂R ( t ) . i j The above comprise the needed equations for applying DHP to this problem, except for the critic component. This is implemented here using an MLP neural network. Recall, for DHP, the critic component is responsible for estimating λ∗ = ∇ R J∗ ( t ) , the derivative of the optimal secondary cost surface (J*) with respect to the state R. The estimate produced by the critic is designated λˆ ( t ) . Since the ADP structure used here entails a dynamic reference model, the formulation specifies that its states R∗ ( t ) be input to the critic as well [13]. In the present case, we use e(t) as a proxy for this. Thus, the critic received u(t), R(t) and e(t) as inputs.

4. Experimental Results A LoFLYTE®* simulation was created by changing some of the stability coefficients of the LoFLYTE® 6-dof simulation in a direction that provides “nicer” uncompensated flight characteristics. Example responses (to a stick-x doublet) from both LoFLYTE® and LoFLYTE®* are shown in Fig.2, neither of which has an inner loop stabilization sub-system in place (ILSA, or SAS). The objective is to demonstrate the ability of the DHP Adaptive Critic method to design a controller (off-line) that will make LoFLYTE® “feel” like the LoFLYTE®* to a pilot. The present experiments are done without the use of a SAS in place, as it was deemed more “dramatic” to be able to demonstrate an augmenter design that makes the uncompensated dynamics of an airplane look like the uncompensated dynamics of a Reference Model (also, this is closer to the requirements of the original problem specification). In practice, however, one would more likely have a SAS subsystem in place, at least in the LoFLYTE®* in its role as Reference Model, and a controller developed for LoFLYTE® accordingly. Page 4/6

The training “syllabus” that was crafted focused first on stick-x commands, since roll rate is the key dynamic that is different between LoFLYTE® and LoFLYTE®*. After this roll-rate dynamic was learned, stick-y (pitch), pedal (yaw), and throttle commands were slowly introduced. For the purposes of the present demonstration, responses to a stick-x “doublet” are shown. The stick-x doublet was of magnitude 2 (out of a nominal range of +5 units): plus stick-x for 1 second; zero stick-x for 1 second; minus stick-x for one second; return to zero stick-x thereafter. Fig. 2 shows roll-rate responses of three different aircraft: 1) LoFLYTE®*, 2) LoFLYTE® WITHOUT control augmentation, and 3) LoFLYTE® WITH control augmentation; in addition, an indication of the pilot’s stick command is overlaid on the responses (arbitrary scale in the Figure). The response for LoFLYTE® WITHOUT control augmentation is seen to be significantly different from that of LoFLYTE®* (the Reference Model), while the response of LoFLYTE® WITH control augmentation is seen to lie virtually on top of the desired response. To help visualize these results better, Fig.4 has the respective differences (roll-rate errors) plotted. The significant improvement provided by the neural network augmentation controller is made obvious in this figure. We note that the .6 radians/sec maximum deviation shown is significant. Fig.3 shows the pilot’s stick-x signal, and the corresponding augmented signal generated by the controller. The augmented signal is used to command the appropriate actuators. There is a fair amount of coupling between the various modes in aircraft such as this. Therefore, pitch and yaw responses are also shown here, in Fig.5 and Fig.6, respectively. As in Fig.2, we see that the errors are virtually eliminated. Figures 7a and 7b show the augmentation commands for stick-y and pedal that the controller learned to provide to make the induced pitch and yaw responses of LoFLYTE® match those of LoFLYTE®*

goes well with the rest of the design process (and funding), it is in line to be actually test flown on the LoFLYTE®.

Figure 2. Pilot stick-x doublet signal (arbitrary scale in the Figure), and roll-rate responses of 3 aircraft: LoFLYTE® w/ Unaugmented control, LoFLYTE® w/Augmented Control, and LoFLYTE®*. Note: Responses of latter two essentially coincide.

Figure 3. Stick-x doublet: pilot’s stick signal vs. augmented signal (the latter is sent to aircraft actuators).

5. Conclusions The basic motivation behind the present work is the desire to ultimately generate a non-linear controller that has control capabilities equivalent to that of an “experienced pilot.” Once the ability to accomplish DHP design based on a (dynamic) Reference Model is demonstrated for a nominal airplane configuration (as was done above), then we can move on to exploring the ability of this design process to operate effectively while making modifications to plant parameters (still in an off-line mode), and to thus expand the controller capabilities to make the modified-parameter LoFLYTE® fly like the (original) Reference Model. A first step in this direction is reported in a related paper [8]. To the authors’ knowledge, the number of inputs for the controller successfully designed here using the DHP method is among the largest reported during the first decade of this method’s evolution. It is a real-world problem, and if all Revised submission to IJCNN’03, April 4, 2003

Figure 4. Roll-rate error (for above stick-x signal) between LoFLYTE®* and LoFLYTE® w/Unaugmented Control, and between LoFLYTE®* and LoFLYTE® w/Augmented Control signals.

Page 5/6

6. References

Figure 5. Pitch-rate error (for above stick-x signal) between LoFLYTE®* and LoFLYTE® w/Unaugmented Control, and between LoFLYTE®* and LoFLYTE® w/ Augmented Control signals.

Figure 6. Yaw-rate error (for above stick-x signal) between LoFLYTE®* and LoFLYTE® w/Unaugmented Control, and between LoFLYTE®* and LoFLYTE® w/ Augmented Control signals.

Figure 7. Augmentation commands for stick-y and pedal that the controller learned to provide to make the induced a) pitch (stick-y) and b) yaw (pedal) responses of LoFLYTE® match those of LoFLYTE®*.

Revised submission to IJCNN’03, April 4, 2003

[1] Astrom, K.J. & G. Wittenmark (1984), Computer Controlled Systems, Theory & Design, Prentice Hall. [2] Bellman, R.E. (1957), Dynamic Programming, Princeton Univ.Press. [3] Butler, H. (1992), Model Reference Adaptive Control -- from Theory to Practice, Prentice Hall (Series in System & Control Engineering. [4] Chen, L. & K. Narendra (2001), “Nonlinear Adaptive Control Using Neural Networks and Multiple Models” Automatica, Special Issue on Neural Network Feedback Control, 37(8), pp 1245-1255. [5] Lainiotis, D.G., (1976) “Partitioning: A Unifying Framework for Adaptive Systems, Part I: Estimation, & Part II: Control,” PROCEEDINGS of the IEEE, vol.64: Part I: pp. 1126-1143; Part II: pp. 1182-1197, August. [6] Lendaris, G. & L. Schultz (2000), “Controller Design (from scratch) Using Approximate Dynamic Programming”, Proc of IEEE-International Symposium on Intelligent Control (IEEE-ISIC’2000), Patras, Greece, July. [7] Lendaris, G.G., L. Schultz & T. T. Shannon (2000), “Adaptive Critic Design for Intelligent Steering and Speed Control of a 2-Axle Vehicle” Proceedings of International Conference on Neural Networks’00 (IJCNN'2000) Italy, Jan. [8] Lendaris, G.G., R.A. Santiago & M.S. Carroll (2002), “Proposed Framework for Applying Adaptive Critics in Real-Time Realm” Proceedings of International Conference on Neural Networks’02 (IJCNN' 2002), Hawaii, May. [9] Luus, Rein (2000), Iterative Dynamic Programming, CRC Press, Jan. [10] Narendra, K.S. & L.S. Valavani (1978), “Stable Adaptive Controller Design Direct Control,” IEEE TRANSACTIONS on Automatic Control, vol. AC-23, no. 4, pp. 570-583, Aug. [11] Narendra, K.S. & K.H. Lin (1980), “A New Error Model for Adaptive Systems,” IEEE TRANSACITONS on Adaptive Control, vol. 19, pp. 474-484. [12] Narendra, K.S. & A.M. Annaswamy (1989), Stable Adaptive Systems, Prentice Hall. [13] Prokhorov, D.V. (1997), Adaptive Critic Designs and their Applications, Ph.D. Dissertation, Texas Tech Univ., Oct. [14] Prokhorov, D.V. & D.C. Wunsch II (1997), “Adaptive Critic Designs,” IEEE Transactions on Neural Networks, vol 8, no 5, pp 997-1007, Sept. [15] Prokhorov, D.V., R.A. Santiago & D.C. Wunsch II (1995), “Adaptive Critic Designs: A Case Study for Neurocontrol,” Neural Networks, vol 8, no 9, pp 1367-1372. [16] Sutton, R.S. & A.G. Barto (1999), Reinforcement Learning, MIT Press. [17] Thampi, G.K., J.C. Principe, M.A. Motter, J.H. Cho, J. Lan (2002), “Multiple Model Based Flight Control Design,” Proceedings of Midwest Symposium on Circuits and Systems’02, Tulsa, Oklahoma, Aug. [18] Werbos, P.J. (1990), “A Menu of Designs for Reinforcement Learning Over Time,” Chapter 3 in Miller, W.T. III, R.S. Sutton, & P.J. Werbos, eds.,1990, Neural Networks For Control, MIT Press. [19] Werbos, P.J. (1992), “Neurocontrol and Supervised Learning: An Overview and Evaluation,” Ch. 3 in Handbook of Intelligent Control, White & Sofge, Eds.,Van Nostrand Reinhold (p. 78). [20] White, D., et al (1992), “Flight, Propulsion, and Thermal Control of Advanced aircraft and Hypersonic Vehicles,” Ch. 11 in Handbook of Intelligent Control, White & Sofge, Eds., Van Nostrand Reinhold.

Page 6/6

Suggest Documents