Observations on the Practical Use of Adaptive Critics ... - CiteSeerX

3 downloads 0 Views 155KB Size Report
Observations on the Practical Use of Adaptive Critics. Lee A. Feldkamp. Ford Research Laboratory, P.O. Box 2053, MD 1170 SRL. Dearborn, MI 48121{2053.
Observations on the Practical Use of Adaptive Critics Lee A. Feldkamp Ford Research Laboratory, P.O. Box 2053, MD 1170 SRL Dearborn, MI 48121{2053 [email protected] Danil V. Prokhorov Applied Computational Intelligence Laboratory Department of Electrical Engineering, Box 43102 Texas Tech University, Lubbock, TX 79409-3102 [email protected]

Abstract By studying adaptive critic designs (ACD) from the standpoint of practical use in training neural networks, we expect to establish the types of problems for which ACD might be preferable to more established methods. To restrict the scope, we have chosen to concentrate on applying ACD, speci cally derivative critics, to the training of recurrent networks [1]. This is actually less restrictive than it may appear; many problems, including controller training, can be posed as optimizing some or all of the weights of a recurrent network. An immediate bene t of this focus has been to clarify the relationship between the derivatives that result from backpropagation through time (BPTT) and the quantities that derivative critics are expected to deliver. At the same time, many questions have been raised, such as that of the critic representation that best balances accuracy against the number of time steps required for adaptation. Because our formulation permits BPTT and derivative critics to be used together or separetely, we expect that experience with a variety of problems will further clarify the various tradeo s and suggest situations in which critics may be used to particular advantage.

1 Introduction

The use of approximate dynamic programming (ADP) takes on many forms and has produced many impressive results. Most such results are for control problems in which the states and actions are discrete (see summary in [2]). For this reason, it has been dif cult to construct side-by-side comparisons of ADP methods with other approaches. In a recent paper [1], we presented a formulation that places on a common footing the computation of derivatives for time-lagged recurrent networks by two

methods: 1) backpropagation through time and 2) derivative adaptive critics. We did this by recognizing that the so-called di erential Bellman equation, upon which derivative critics are based, arises naturally in the course of BPTT. Consequently, training a critic may be regarded simply as an estimation problem. Further, we argue that controller training may be viewed as the training of a subset of the weights of a heterogeneous recurrent network. Applied to derivative adaptive critics, this perspective clari es several issues that have been somewhat obscure.

2 Context of the Formulation

We deal with the training of a time-lagged recurrent network with continuous variables, whether inputs, outputs, or state variables. The recurrent network represents an entire dynamical system. For controller training, the network typically consists of blocks representing the plant (and/or identi cation network) and controller. For model reference control, the network also contains the reference model. At any given time, parts of the overall network are regarded as xed, others as trainable. Although the network may be autonomous (isolated), more commonly it receives exogenous input. In control problems, inputs are often assumed to follow some convenient noise process. In network training problems, such an assumption seldom holds, so we must be able to deal with whatever input stream is appropriate to the problem. From the standpoint of the whole system, expressed as a recurrent network, we generally assume that we are dealing with supervised learning. In particular, we assume that we are provided with or can construct a short-term cost or utility function in terms of one or more network variables, so that derivatives of that function with respect to the output variable can be computed. In typical network training problems (e.g.,

training from a le of inputs and desired outputs), the target outputs relate to present or past values of the inputs. In controller training problems, on the other hand, the targets are typically desired system outputs that might be constants (stabilization problems) or known functions of reference inputs. Not every problem falls neatly into this form, however, and we pause to list the major categories of performance feedback. 1) Explicit targets for network outputs are provided at every step. 2) Explicit di erentiable cost or utility function is provided at every step and expressed in term of network variables. This case di ers from case 1 in that target values may not be known. 3) A graded cost is provided at every step but its relation to network variables is not explicit. This case arises frequently in practical problems. 4) Ungraded reinforcement, e.g., win or lose, provided when appropriate, such as at the end of a game or when some major event has occurred, such as success or failure. Reinforcement might be represented numerically as costs of 1 or -1, with the absence of reinforcement being represented by 0. Several intermediate cases or modi cations may also be identi ed. Examples: a) short-term cost or utility exists but at some time steps is unavailable to the learning process; b) the cost function, while expressed in terms of network variables, is not di erentiable. Supervised learning, with or without adaptive critics, may be applied immediately to cases 1 and 2. If we wish to apply the mechanics of supervised learning to the other cases, a reasonable approach is to convert the available information to the form of case 2. This is most conveniently done as a separate training problem in which we attempt to establish at least an approximate mapping of relevant network variables to the cost or reinforcement values provided in cases 3 and 4. For example, for the classic cart-pole problem described in [7], one could approximate with a simple neural network the negative reinforcement associated with failure (the pole falling or the cart exceeding its allowed region of operation) in terms of the observed system variables. Once such a network has been trained, it provides a di erentiable cost function (case 2) and supervised learning can be employed. Of course, the process of approximating the provided pattern of cost or reinforcement can in many cases be arranged to take place while the primary training process is underway. Coordinating such processes will vary with the problem and further discussion is beyond the scope of this paper. In a pragmatic sense, it is important to note that in the training process we are not restricted to using the actual cost or reinforcement provided. On the basis

of prior knowledge, we may choose to employ a surrogate cost function that expresses our understanding of the problem. For example, the cart-pole problem becomes much simpler if we follow our intuition and impose a cost function that in a graded fashion penalizes pole angles that deviate from the vertical and cart positions that are not in the middle of its track; see e.g., [6]. Of course, an evaluation of the ecacy of training should follow the original de nition of cost or utility, not the surrogate. It also should be obvious that since this approach alters the original statement of the training problem, comparisons with procedures that make less use of prior knowledge should be stated with great care.

3 Role of a Derivative Critic

The task of a conventional adaptive (J) critic is that of a state evaluation function; i.e., it is intended to provide an estimate of the discounted sum of future one-step costs called cost-to-go, which we denote at time step k as J (k). A derivative critic, on the other hand, is intended to provide derivatives of J with respect to variables of the system. In the case of recurrent network training, we need ultimately derivatives with respect to each trainable weight. To get these we require derivatives with respect to the outputs of the various nodes. Derivative adaptive critics used in such processes are usually pictured as structures (e.g., neural network) with several inputs and outputs, a separate output of the structure providing estimates of derivatives of J with respect to the outputs of particular (typically, output) nodes of the network [3]. However, it is certainly not inconsistent to consider a decoupled critic structure, so that a separate derivative critic is de ned for every node and input of the network. In practice, various considerations may reduce the number of nodes for which a critic is needed. For example, we need a critic for only those nodes or inputs which lie, in the sense of signal propagation, between the cost function and an intended decision (e.g., weight adjustment or input change). As discussed in [1], even some of these can be excluded because the corresponding critic can be recognized to remain null. The speci c function of a derivative critic for a given variable y is to estimate the change in J (k) that would result if y(k) were to be changed from the value it would otherwise have on the basis of current operation of the system. In [1, 5] we noted that the derivative being estimated is very closely related to the derivatives that result from a particular form of truncated backpropagation through time (BPTT(h)) In this form of BPTT, one computes the change in summed error, over a speci c chain of time steps, that would result from a change in a given variable at the rst step in the chain. Speci cally,

P

let J (k) = 12 Tt=k (t)2, where (t) is an error (onestep cost)Pat the time step t and T = k ++ h. Then T @ + J (k ) @ + (t) @ (t) t=k (t) @y (k ) . We note that @y (k ) is sup@y (k ) = posed to account for both explicit and implicit dependencies of (t) on y(k) at the rst step. Such derivatives are called ordered in [4]. Given the similarity just mentioned, let us contrast derivatives computed by truncated BPTT with those produced by derivative critics: 1) BPTT derivatives are computed directly, while critic derivatives are computed from a representation such as a linear function or a neural network; further, the parameters of this represention must be learned. 2) BPTT(h) derivatives generally involve a nite time horizon (equal to the chosen truncation depth h), while critic derivatives are estimates for an in nite horizon; it is often necessary, however, to employ a discount factor, which may be interpreted as a gentle truncation. 3) BPTT derivatives necessarily compute the e ect of changing a variable in the past, while a derivative critic may be used to estimate the e ect of a change at the present time. If critics are used only to adjust controller parameters, this distinction is essentially irrelevant. On the other hand, recognizing it poses interesting possibilities for alternative or supplementary methods of control, as in Section 6. 4) A BPTT derivative is essentially exact for the speci c trajectory for which it is computed, while a critic derivative is expected to estimate an average over trajectories that begin with a given system state. Such an estimate may be quite accurate (if the critic has been well adapted or trained and if exogenous inputs to the system are either small or statistically well behaved) or may be essentially worthless (if future operation is completely unpredictible due to arbitrary inputs or targets).

4 Critic Issues

4.1 Inputs to the Critic

A critic executes a static mapping based on available input information. It must, of course, base its estimate for step k only upon information that is available at k or earlier. Since the critic is essentially being asked to encapsulate the e ect of a change in some variable on the future operation of the system, as much as is known about the system state should be provided. In the case of a recurrent network, the set of node values subsumes the available information and could be used as critic input variables. In most cases, it is desirable to exclude redundant variables from the critic input list. For example, nodes which have constant values may always be excluded if the critic has a bias input, as should variables that are linear combinations of other variables in the in-

put set. Nodes whose outputs can be obtained from others by a nonlinear but static mapping can be excluded, though retaining them may simplify the critic learning task. Let us consider the training process for a neural network controller in a model reference scheme. We de ne a system recurrent network that represents the overall closed-loop system. For simplicity, we assume that the plant (system to be controlled) is known and can be represented exactly by a recurrent network. The reference model that produces target values for the plant is represented similarly. (In a stabilization problem, the reference model is merely a constant.) The overall recurrent network thus consists of three subnetworks: 1) the plant network, 2) the controller network, and 3) the reference model network. Our above remarks then imply that inputs to derivative critics should include at least the state nodes of each subnetwork, including those of the controller and reference model. Traditionally a distinction has been made between critics that are `action dependent' and those that are not. Here, the distinction is not really useful. In particular, we contend that if the controller has memory (e.g., is recurrent or has an embedded delay line), then the state variables of the controller should be included in the critic input representation in the interests of critic accuracy. Inputs from the controller can be excluded if the controller is a static function of other critic inputs. We note that inclusion of controller variables in the critic input set does not directly in uence the training of the controller. This is in contrast to Q-learning, where including controller variables as critic inputs is intrinsic to the method. In previous treaments of adaptive critics, the importance of including states of the reference model as critic inputs does not seem to have been recognized explicitly, though it is implicit in the common prescription that the critic should be an adapted function of the system state. This may be understood by noting that if states of the reference model are excluded from to the critic representation, we e ectively turn targets from the reference model into disturbances, which tend to decrease the likelihood of an accurate critic. An extension of the principle just discussed may be used in standard recurrent network training problems, such as the training of a recurrent network to act as a system model. Here we typically have a le of inputs and target outputs. Suppose we regard this problem as that of training a controller for a trivial identity plant to follow the target outputs. Since we lack a reference model, it seems necessary to treat the targets as disturbances. However, we can still approximate the states of the missing reference model by its time-lagged outputs (using some estimate of the

order of the system that is producing the data) and provide these as inputs to the critic. Our conclusions regarding critic inputs may be demonstrated through simple linear examples. It must be understood, however, that they do not imply that any given problem cannot be solved if a simpler input set is used. Indeed, many issues remain to be understood regarding the various tradeo s involved in practical use. Among these is the con ict between rapid critic learning and ultimate critic accuracy. Extreme simpli cations of the derivative critic representation called primitive (bias-weight only) critics can provide the fastest learning, and they are obviously attractive for on-line implementations [5].

4.2 Form of the Critic

The computational form of the critic can, in principle, be anything that is capable of a reasonable mapping from its inputs to its target output. As might be expected, a tradeo exists between mapping accuracy and training or adaptation time. Moreover, because critic training usually must coexist with the process of training the network weights, the tradeo is more complex. The accuracy attainable by a critic depends on the complexity of the system. (In the case of control, it is the complexity of the entire closed-loop system that matters; the closed-loop system often becomes less complex as control improves.) It also may be limited by the degree to which future system evolution cannot be predicted, perhaps because of disturbances. If network training is in progress, critic accuracy will also depend on how much the controller is being changed, because every controller change results in a change to the system to which the critic is being adapted. If network weight updates are to be closely interleaved with critic updates, a relatively simple critic representational structure might be warranted, so that the critic can be quickly adapted to changes in the system. Any critic-based training scheme operates best whenever dynamics of the closed-loop system is predictable. For example, a critic can be accurate when the system is subject to a constant disturbance. However, the same critic will always be inaccurate if the system is perturbed by a disturbance whose behavior in the immediate future has changed from its behavior in the recent past. An illustrative example of a system which is hard to treat using critics is a timedelay neural network (TDNN) and, more speci cally, a nite impulse response (FIR) network. Training capabilities of a critic-based method for an FIR network must be compared with the known procedure called temporal backpropagation (TB) [8]. TB is the most computationally ecient method to train FIR networks since it handles arbitrarily long delays with

the eciency of BPTT(0). We contend that critics should not be used to train pure TDNN when dealing with arbitrarily changing disturbances or inputs. Unfortunately, TB is not applicable to training recurrent networks. However, it appears possible to develop a hybrid of TB, BPTT, and derivative critics which would combine the bene ts of both TB and BPTT while enabling the use of critics when necessary. Work on such a hybrid is currently underway.

4.3 Training of the Critic

The proper coordination of critic training with network or controller training remains a research topic. We have found that, in some cases, it is possible to update both network weights and critic weights at every training step, although such a strategy may not work well in the presence of network weight updates of larger size, as frequently result from second order training procedures. Alternating the training processes in blocks is a reasonable option, since holding the network xed while adapting the critic generally leads to greater critic accuracy. The drawback is that once the critic is held xed and the network is changed, the critic may become very inaccurate and compromise training with poor derivatives. Hence a better approach might be to carry out the training processes concurrently but to monitor the critic error (the error used to update the critic) and to suspend network training for a speci ed number of steps if a speci ed critic error is exceeded. In the course of exploring coordination strategies, it is well to keep in mind that pure BPTT(h) generally permits network weight updates at every step. Though the use of derivative critics may save computation (depending on the required depth of BPTT), the saving may exact a cost in the number of time steps required for network training. In on-line training on physical systems, the number of steps is often a more important metric than the mere number of computations. It turns out that derivative critic framework is also applicable when quanti ed targets to nodes of the network (i.e., gradients of the immediate cost) are not available at some time steps. Such an occasional loss of information from the environment does not appear to preclude training derivative critics. Indeed, we can readily perform critic updates based on the Bellman recursion whenever immediate gradients are available. In this case, critics are not updated at each time step, as in our usual critic training, but only when gradients are supplied from the environment.

5



and C Critics

As usually formulated, derivative critics designated by  are trained to estimate derivatives of J , including the contribution from the current step, k. In [1], we

proposed separating the estimate of the future from that of the present. The di erence between the values of the two critic forms is precisely the quantity that results when the @J derivative of error at each output node @yout is (k ) backpropagated to y(k). (In the simple case of a single-node network, the new critic C y(k) is related as follows to the usual critic: (k) = @y@J(k) + C y(k).) The C critic is not required to estimate quantities which can be computed exactly. Limited experimentation suggests that the use of C critics may lead to faster training than that of  critics.

6 Direct Critic-Based Control Beyond their use together with or in addition to BPTT(h) for training network or controller weights, derivative adaptive critics may be used in situations where the usual BPTT(h) cannot be used. A generic example is the use of derivative critics to assess the ecacy of control action. Let us assume that a critic C u(k) has been trained, with the parameters of the existing controller held constant. Directly from its de nition, this critic estimates the sensitivity of future costs to changing the control action from that speci ed by the controller. This permits us to judge the adequacy of the present control: a controller which is optimal with respect to the cost function used to train the critic should result in very small values of C u(k). In the case of a slowly time-varying system, C u(k) could be used to trigger further controller adaptation. Interestingly, control actions may be derived directly from the critic. In particular, a control change at time step k may be taken as u(k) = ? C u(k), where is a constant (unfortunately, the derivative critic formalism does not tell us what should be, but we are currently exploring ways of quantifying ). It is important to recognize that such supplementary control actions appear as disturbances to the critic. Hence, to maintain the integrity of the critic, these control changes should be used sparingly. (If frequent changes are required, they should be embedded into the base controller by the usual training process.) It is not dicult to construct problem statements in which occasional control actions are appropriate. For example, suppose that each control action exacts a xed cost. In such a case the obvious strategy would be to make a limited number of highly e ective control actions. The critic C u(k) is precisely what is needed to decide when to exert control.

7 Example and Discussion We consider the following simple control problem, which involves a plant resembling a slightly damped

pendulum. The plant is stable and has two state variables. The plant and a simple controller are described by the equations 1 y1 (k) = 2tanh( (0:89y1(k ? 1) 2 +0:1y2 (k ? 1) + 0:185u(k ? 1))) y2 (k) = ?0:935y1(k ? 1) +0:89y2(k ? 1) + 0:193u(k ? 1); 1 u(k) = 2tanh( (w0 + w1y1 (k) + w2 y2 (k))): 2 We take the cost function to be ?(y1 (k) ? 0:45)2 for y1 (k) > 0:45, +5(y1 (k) + 0:75)2 for y1 (k) < ?0:75, and zero otherwise. Although this is a control problem, it may be regarded as a recurrent network training problem in which only the weights wi can be varied; the available performance feedback corresponds to case 2 of Section 2. Even applied continuously, the maximum control magnitude is not sucient to drive y1 into the region of negative cost and hold it there. Hence to minimize the cost function over time, the control must cause the pendulum to swing back and forth. This implies that the controller must at times drive y1 away from the region of negative cost and toward the region of positive cost. Since meaningful short-term performance feedback is supplied only when y1 enters the speci ed regions, the critic C u(k) is useful to evaluate the proper sign for control action to estimate its ecacy in maximizing long-term performance. Here each of the critics C y1 (k), C y2 (k) and C u(k) is a linear function (including bias) of y1 (k) and y2(k). We treat this problem as an on-line training process in which the trajectory is initialized at y1 = ?2 and y2 = 0 and controller training is carried out without resetting the plant state. For the above system parameters and the speci ed cost function, however, the system \dies out" before the controller has learned to keep the pendulum swinging. The diculty is easily seen to be the relatively high threshold for performance feedback, y1 = 0:45. Hence, in the absence of prior knowledge about the problem, it may be impossible to train a controller. A pragmatic solution is to exploit our qualitative knowledge and encourage the controller to learn to swing the pendulum. We do this by replacing the negative term in the actual cost function temporarily with ?y1 (k)2 . This term serves to provide feedback early in the training process. After the rst 100 time steps, the original cost function is restored. We carried out the training process using pure BPTT(h) with several values of h, using pure derivative critics, and with several hybrids. In all cases, the learning rates of the controller and critics were xed at 0.1 and the discount factor was set to 0.95. Both controller and critic weights were updated every time

step. We summarize our results as follows. Neither BPTT(0) nor BPTT(1) were capable of training a good controller. Larger values of h are able to get the pendulum to swing and to produce substantial negative values of accumulated cost. Using C-critics with h = 0 (C-BPTT(0)) also produces a good controller, though not quite as good as that from BPTT(2). Controllers from hybrid training such as C-BPTT(1) are quite good, generally much better than with pure BPTT. It is important not to overgeneralize these results, as they depend on the details of the problem as well as on details of the training procedure. Let us alter the problem statement to allow larger control actions but require that control be applied on no more than one time step in a block of n steps. Here we take the approach of direct control, i.e., holding w0 = w1 = w2 = 0 and calculating controls directly from the critic C u(k). Here we take n = 10 and allow controls in the range [-4,4]. The cost function strategy discussed above and the critic learning rate remain unchanged. To decide when to apply control, we adopt the following simple logic. For each of the rst nine steps of every ten-step block, we evaluate C u(k) and compare its magnitude with a speci ed fraction = 0:95 of the largest magnitude in the previous block. If jC u(k)j is greater than or equal to this value, then we apply the control u = ?4tanh(C u(k)). Here we take  = 50. If the test has not been passed before the last step of the block is reached, then u is computed and applied so that precisely one control action occurs in every block. If is reasonably chosen, then control will tend to be applied when it is more e ective than average. In Figure 1 we show the rst 1000 steps of the system under direct control. Note that y1 (k) largely avoids the region of positive cost. We also show values of the critic C u(k) and its error according to the di erential Bellman equation used in critic training (see [1]). Note the considerable variation in C u and its relatively small error. Control is frequently, but not always, applied at the peaks of this critic, because the period of oscillation is not a multiple of our chosen block size. In this example of direct control, no controller is embedded as part of the system, so that the magnitude of C u has no tendency to decrease over time. On the other hand, when a controller is being trained, jC uj should decrease. In the absence of disturbances, a residual irreducible C u suggests that augmenting the controller architecture may be bene cial.

References [1] L. A. Feldkamp, G. V. Puskorius, and D. V. Prokhorov, \Uni ed Formulation for Training Recurrent Networks with Derivative Adaptive

0.5 0 -0.5

error in C_u(k)

-1 0.5

C_u(k)

0 -0.5 -1 1 0 -1

y1(k) 4 2 0 -2

u(k)

-4 0

200

400

time step k

600

800

1000

Figure 1: Illustration of direct control for the example system. The cost function is changed at step 100.

[2] [3]

[4] [5]

Critics," in Proceedings of the 1997 International Conference on Neural Networks, Houston, 1997, pp. IV-2268{2272. D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming, Belmont, MA: Athena Scienti c, 1996. P. J. Werbos, \Approximate dynamic programming for real-time control and neural modeling," in Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, D. A. White and D. A. Sofge, Eds., pp. 493{525, 1992. P. J. Werbos. \Backpropagation through time: What it does and how to do it," Proceedings of the IEEE, vol. 78, no. 10, pp. 1550{1560, 1990. D. V. Prokhorov and L. A. Feldkamp, \Primitive Adaptive Critics," in Proceedings of the 1997 IEEE International Conference on Neural Networks, Houston, TX, 1997, pp. IV-2263{2267.

[6] G. V. Puskorius and L. A. Feldkamp, \Neurocontrol of Nonlinear Dynamical Systems with Kalman Filter-Trained Recurrent Networks," IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 279{297, 1994. [7] A. G. Barto, R. S. Sutton, and C. W. Anderson, \Neuronlike Adaptive Elements that Can Solve Dicult Learning Control Problems," IEEE Transactions on Systems, Man, and Cybernetics, vol. 13, pp. 834{846, 1983.

[8] E. A. Wan, \Temporal Backpropagation for FIR neural networks," in Proceedings of the 1990

International Joint Conference on Neural Networks, San Diego, 1990, pp. I-575{580.

Suggest Documents