Adaptive Critic Based Adaptation of a Fuzzy Policy - CiteSeerX

Adaptive Critic Based Adaptation of A Fuzzy Policy Manager for A Logistic System Thaddeus T. Shannon1 Portland State University, Portland, OR 97207, USA

Stephen Shervais Eastern Washington University, Cheney, WA 99004, USA Abstract-- We show that a reinforcement learning method, adaptive critic based approximate dynamic programming, can be used to create fuzzy policy managers for adaptive control of a logistic system. Two different architectures are used for the policy manager, a feed forward neural network, and a fuzzy rule base. For both architectures, policy managers are trained that outperform LP and GA derived fixed policies in stochastic and non-stationary demand environments. In all cases the fuzzy system initialized with expert information outperforms the neural network. Index terms -- applications, neural networks, reinforcement learning, genetic algorithms, qualitative reasoning, rule learning

This critic function produces estimates of the partial derivatives of the long-term cost associated with the policy manager’s operation. The training module uses a neural net model (D) of the of the distribution system to predict its response to policy changes. This information is then used to adjust the structure of the policy manager, reducing the long-term cost. As the state and control variables in our problem context are discrete valued, numerous approximations are used to implement differentiable functions needed by DHP. Past work in this area [10] has concentrated on the use of a neural-network-based critic to tune the weights in a neural-network-based policy manager. This paper uses a neural-network-based critic to tune linguistic variable centroids in a fuzzy-logic-based policy manager. B.

1. Introduction This paper demonstrates the use of adaptive-criticbased approximate dynamic programming techniques to the training of a fuzzy-logic policy manager for a logistic system. The task of the policy manager is to adjust inventory control and transportation policies for a physical distribution system. Adaptive critic control is a kind of reinforcement learning, wherein the structure of a controller is periodically modified, based on estimated improvements in the long-term cost. Figure 1 shows a high-level view of the process. We start with an existing policy manager (A), and an existing set of control policies for the distribution system (B). In the current study these policies were found via global search using a genetic algorithm (GA). Based on system state information, the policies are periodically modified by the policy manager with the goal of minimizing overall cost. We then implement the dual heuristic programming (DHP) training method to improve the policy manager and adapt it to a changing environment. DHP uses observations of the system together with information from a model of the coupled policy manager-distribution system to train a critic function (C). _________________________________________ 1 This work was supported in part by the National Science Foundation under grant ECS-9904378.

IFSA/NAFIPS’01, July, 2001, Vancouver, B.C.

Distribution System Training Module

A. Policy Manager C.

D.

Critic

Plant Model

Figure 1. Adaptive Critic Reinforcement Learning 2. The Problem The practical task addressed by this paper is that of adjusting the control policies for the physical distribution system shown in Figure 2 so that it can continue to operate effectively in a stochastic, non-stationary environment. Such distribution systems feature warehouses/depots where inventory can be cheaply stored and which feed higher-cost retail outlets that satisfy final customer demand. Transportation resources are multimodal, but may be limited or difficult to change. Final demand may fluctuate, and average demand can change over time. Most prior studies in this application area have concentrated on inventory (economic quantity) [5][6], or on transportation (solid transportation problem) [1][4] allocations. Page

1/6

This work addresses both: selection of an optimal set of policies for a multi-product, multi-echelon, multimodal physical distribution system, in a nonstationary environment. The problem is highly multidimensional, even with D0

A

External

External Supply

S0

S1

Goal: Minimize Total Cost Controls: Inventory Policies Transportation Resources

Demand

D1

a small system. Both state and control variables may be discrete valued and demand is often characterized by a random variable. The cost surface in policy space for such systems tends to be quite discontinuous, with low penalty and high penalty regions separated by no more than a single transport unit. The underlying problem is usually presented as a cost minimization problem. The function to be minimized is total cost CTot, which consists of the initial and final costs, plus the incremental costs, summed over a planning time horizon T

C Incr =

C TF (t) = ∑

M

∑

(8)

C TF (t, a, m),

a =0 m =0

summed over A arcs and M modes of transport, where CTF (t, a, m) = PTFTCap (t, a, m) (9) if TCap (t, a, m) > 0, where PTF is the price per TCap, the

Figure 2. The Physical Distribution Problem. Inventory is held at S1 and distributed to inventories at D0 and D1 using limited transport resources.

C Tot = C Init + C Incr + C Final ,

(7) C T (t) = C TF (t) + C TO (t) + C TX (t), where CTF is fixed transport cost (the cost of owning the transport resource over the decision period), CTO is operating cost (only for transport units actually employed), and CTX is the penalty for transport shortfalls, and

(1)

unit capacity of transport hired during the decision period; C TO (t) =

A

M

∑ ∑

a =0 m=0

C TO (t, a, m), where

(10)

C TO (t, a, m) = PTO (TCAP ( t, a, m) − TCAV ( t, a, m)),

(11) if TCAP ( t, a, m) > TCAV ( t, a, m), where PTO is the cost of operating the units which are on the road (the difference between capacity TCap and capacity available TCav). C TX (t) =

A

M

∑ ∑

a =0 m =0

(12)

C TX ( t, a, m),

C TX (t, a, m) = PTX (Q S ( t, a, m) − TCav ( t, a, m)),

(13) if Q S ( t, a, m) > TCav ( t, a, m), where QS is the quantity provided by a supply node ns, which is related to QR the quantity requested by a demand node nd, by Q S (t, a, m) = min (Q R (t, n d , k), Q H (t, n s , k) ) . (14)

T

∑ (C H (t) + C P (t) + C T (t) + C X (t)),

(2)

t =0

where CH is holding cost, CP is purchase cost, CT transport cost, and CX stockout penalties, and N

C H (t) = ∑

K

∑

n =1 k = 0

C H (t, n, k) ,

(3)

summed over N nodes and K stocks. If Q H ( t, n, k ) ≥ 0, C H (t, n, k) = PH (n, k) Q H ( t, n, k ) (4) where PH is holding price per QH unit quantity on hand. N

C P (t) = ∑

K

∑

n =1 k = 0

C P ( t, n, k ) ,

(5)

with N

CP (t, n, k) = ∑

K

∑

PP (k)QP ( t, n, k )

(6)

n =1 k = 0

if Q P (t, n, k) ≥ 0 ,where PP(k) is purchase price per QP unit quantity purchased;


For small problems, minimizing the cost function is often done using mixed-integer linear programming (LP) techniques. Our study uses a Genetic Algorithm as a tool for finding an initial policy set that will minimize costs, reserving the use of the LP as a comparison tool. Because the fitness terrain is spiky, the search is difficult, and the GA solution is only quasioptimal. The evaluation function for the GA is a discrete event simulation. Business constraints necessary to the operation of the simulation are handled by repairing the chromosome as it is being created. Other business rules are enforced by adjustment of the penalties associated with breaking them. 3. Methodology Small-scale and artificially-constrained examples of our inventory and transportation problem can be solved exactly using Dynamic Programming [3]. Unfortunately, very few supply nodes, stock levels and

Page

2/6

transport arcs can be included before the classical approach becomes intractable due to the “curse of dimensionality”. Over the last decade, a family of approximate dynamic programming techniques utilizing adaptive critics has been developed, that do not suffer from dimensional blow up. In Dynamic Programming one develops an optimal policy by comparing the costs of all alternative actions at all accessible points in state space through time. This search is made efficient by limiting the options using the principle of optimality: that an optimal trajectory has the property that no matter how an intermediate point is reached, the rest of the trajectory must coincide with an optimal trajectory as calculated with the intermediate point as the starting point.

In Dual Heuristic Programming (DHP) the critic’s outputs are estimates of the derivatives of J(t). What approximate dynamic programming techniques offer is a tractable method for local hill climbing on the J(t) landscape of policy parameter space. Initialized at a random point in parameter space, these methods may be trapped by a local optimum at an unsatisfactory control law. One can attempt to avoid this by applying problem specific knowledge to the choice of initial policy manager parameters, in the hope of being near a satisfactorily high hill (or deep valley). In this paper we seek to avoid this problem by starting with a quasi-optimal initial policy that is already in some sense satisfactory. 4. Implementation 4.1 Strategic Utility Function

Adaptive critic based approximate dynamic programming methods start with the assumption that the optimal policy can be written as a continuously differentiable function of the state variables and some number of policy parameters. A critic function is then constructed that estimates the value of the secondary utility function (cost to go) at any accessible point in state space. Under the assumption that the critic is accurately estimating the long term cost of the policy specified by the control function’s parameter values, the gradient of the critic function can be used to adjust the policy parameters so as to arrive at a local optimum in the parameterized policy space. This process has been successfully operationalized using artificial neural networks for both the control and critic functions [2][7][8] and more recently using fuzzy systems as controllers for continuous systems [9]. The method is applied by formulating a "primary" utility function U(t) that embodies a control objective for a particular context in one or more measurable variables. A secondary utility function is then formed J (t ) =

∞

∑ γ k U (t + k ),

(15)

k =0

which embodies the desired control objective through time. This is Bellman's equation, and a useful identity that follows from it is J (t ) = U (t ) + γJ (t + 1).

(16)

A promising collection of approximation techniques based on estimating the function J(t) using this identity with neural networks as function approximators was proposed by Werbos [11][12]. As the gradient of the estimated J(t) is used to train or tune the control policy, some techniques use critics that estimate the derivatives of J(t) instead of the function value itself.


The strategic utility function U(t) defines the effectiveness of the policy manager. In many control situations this is a simple error function that measures how far the system is from a desired state. The current problem is a bit more complex. The utility function is set equal to the cost equation (1), and the objective is to set policies so as to minimize the total cost of executing those policies. The utility function therefore describes the impact on costs of each aspect of that execution. Complicating the situation further is the fact that policy execution is discontinuous – if an order is created, then purchase costs are incremented, if no order is created, then they are not. However, in order to apply the DHP methodology, we must have a differentiable J(t), and therefore need a differentiable U(t), and in order to do that, we must make the approximations described below. Rather than detail all elements of U(t), we will limit this presentation to selected elements of the cost equations presented above, the first example being holding cost, CH. The curve representing holding cost has a ‘knee’ at QH = 0, so the derivative is not continuous at that point: ∂ C H / ∂ Q H =  P H if Q H > 0,  0 else,

(17)

so, for the purpose of calculating derivatives only, we use the approximation: (18) C H = PH Q H sig(Q H ), (where sig(x) is the logistic sigmoid function sig(x) = 1/(1 + e x ) ), which has the partial derivative (19) ∂C H /∂ Q H = PH Q H sig( Q H ) ( 1 - sig( Q H ) ). Similar approximations are used for purchase cost:

Page

3/6

C P = PP Q Psig((UT - (Q H + Q O ) )( sig(RP - (Q H + QO ))).

(20)

to the OK variable. For example, the centerpoint for Very High was set at five times the OK value.

(21)

There were three control variables – ReorderPoint, UpTo Point, and Transport Capacity. The value of each was the consequent of two antecedents. Both the Reorder Point and the UpTo point were driven by the stock on hand and the demand, while changes to Capacity were based on Demand and Capacity Available. A typical rule was:

and transport shortfall penalty:

C TX = PTX (UT - (Q H + Q O ) )( sig(RP - (Q H + Q O )) - TCav ) ( sig (UT - (Q H + Q O ) ) - TCav ).

4.2 System Identification Creation of a function which describes the responses of the plant to control and state inputs is a task called System Identification. In this study we train a simple MLP Neural Net as a model of the plant responses, then use the net as an ordered array of derivatives. Training the Plant Model NN is a straightforward task, provided one has a suitable collection of I/O pairs. There are a number of ways of obtaining such pairs -- all involve providing a set of inputs which span the search space, processing the plant response, and collecting the resultant output. The method used here is to save the output generated by the GA during its search process. If all output is saved, even that generated by less-than-optimal inputs, then the data set obtained not only spans the search space, but also provides a higher proportion of I/O pairs in the vicinity of good solutions. Once the trained plant model NN is in place, it may be used to provide information on the partial derivatives ∂R(t + 1) ∂R(t + 1) and via standard backpropaga∂u (t ) ∂R(t ) tion. 4.3 Fuzzy Policy Manager Architecture The fuzzy antecedents were based on five linguistic values – Very Low, Low, OK, High, and Very High. Antecedent variables were stock on hand, transport capacity available, and demand. Fuzzy consequents were also based on five linguistic values: Large Negative, Small Negative, Zero, Small Positive, and Large Positive. For antecedents, the membership functions were overlapping trapezoids with maximum membership values of 1.0, while singleton values were used for the consequents. The membership functions for the antecedents were adapted during the training runs. For example, the centerpoint for linguistic variable OK for Demand was first set at the fixed demand level used by the GA in its policy search. As the training session continued, the centerpoint for OK was adjusted to reflect the average demand over the previous control period (nine days). Centerpoints for related linguistic variables were then adjusted relative


IF StockOnHand is High, and Demand is Low, THEN the ReorderPoint should be changed by a Small Negative amount. This resulted in twenty-five rules for each control variable. Tuning the fuzzy policy manager was a matter of using DHP to adapt the singleton values of the consequent. Interestingly, the improvements in cost came not because the DHP process improved the fuzzy policy manager, but because the fuzzy policy manager was able to improve the control policies. For example, at the start of one test, the singleton definition of a “large positive” increase in the Order Up To Point for Demand Node 1, Stock 1 was set by hand at 30.00. This represented the application of knowledge about the domain of interest, namely, how demand was likely to force stock usage. At the end of the training and testing process, the singleton value for “large positive” had dropped only slightly, to 26.5, indicating that the original estimate of what constituted a “large positive” increase was a reasonable one. However, over the course of the training, the fuzzy policy manager changed the Order Up To Point to 453.0 from 165.0 by applying a number of “large positive” increases in that value. One of the advantages of a fuzzy policy manager is that, unlike neural net or other adaptive controllers, the rules can be explained in plain English. In the case of Demand Node 1 Stock 1, the rule for making a large positive increase in the Reorder Point might be explained to a business manager this way: “If the stock on hand is at what we previously considered to be acceptable levels (that is, enough to meet projected demand while a new ground shipment arrives), but the average demand over the last control period (nine days) has been very high (about five times what we expected it to be), or if demand has been merely high (about twice what we expected), but stocks are very low (that is, less than half what we need to supply our customers until an air shipment arrives), then we need to make a large positive in-

Page

4/6

crease (bump it up by about thirty) in the Reorder Point for this stock at this node.” The phrases in parentheses reflect the definitions of the centerpoints taken from the program code.

manager was able to improve on the NN, in all scenarios.

5. Results

This set of experiments (Figure 4) was designed to establish a reference behavior for the system. It used a fixed demand schedule, identical to the training set, but with added, Poisson-distributed noise.

All four policies were then tested using three new demand schedules that ran for 360 simulation days. All demand schedules had Poisson distributed noise laid on top of the underlying trend. The results are summarized in Figure 3. Note that the comparison to be made is really between fixed policies (however 2500

1500 Fixed LP 1000 Fixed GA

NN

500

Fuzzy

99 11 7 13 5 15 3 17 1 18 9 20 7 22 5 24 3 26 1 27 9 29 7 31 5 33 3 35 1 36 9

81

63

45

9

0 27

The cost structure of the simulation was designed to maintain pressure on both the GA evolutionary search and the learning process, by penalizing severely any failure to maintain stocks or transport resources at a level appropriate to the scenario. Costs reported, therefore, are not directly comparable to normal business operating costs.

2000

Total Cost

The experiments described below tested the adaptive policy managers against the fixed policies found by the LP the GA. The LP and GA policies were developed based on a fixed demand schedule based on the expected value of the demand for a specific stock at a specific node. Demand node 1, for example, required 2.0 units of stock 0 every timestep for 90 days. The fuzzy and neural policy managers require persistent excitation to train well. They were both trained using Poisson-distributed demand with a stationary mean equal to the value used by the fixed policies.

5.1 Baseline

Timestep

Figure 4. Cost of fixed policies (light dashed and dotted lines), versus policies adjusted by the NN (heavy dashed line), and fuzzy policy managers (solid line), baseline demand. 5.2 Increasing Average Demand The second series of tests compared the effectiveness of the various approaches when operating in a continuously changing environment. The perturbation was a fixed increase in demand at every timestep throughout the period, sufficient to increase demand by 20% over the 360-day test period.

LP 2000

5.3 Delta Demand

GA NN

Total Cost

Fuzzy

The final test series (Figure 6) compared the effectiveness of the various policies when operating in a stationary environment different from the one they were trained on. Specifically, demand was stationary but 10% higher than the training examples.

1500

1000

500

6. Conclusions

0 Baseline

Increasing Average Demand

Delta Demand

Demand Scenario

Figure 3. Summary of Cost comparisons of fixed (LP and GA) policies with (NN and Fuzzy) policy managers for three different demand schedules: stationary (Baseline), Increasing Average Demand (IAD), and Delta Demand (DD). All demand schedules exhibit Poisson-distributed noise. arrived at) and adaptive policies. The results can be summarized as the NN policy manager was able to improve on all fixed policies, and the fuzzy policy


We have demonstrated the effectiveness of a reinforcement learning approach to designing a fuzzy policy manager for the control of physical inventory systems in both stationary and non-stationary demand conditions. Improvements over a fixed, quasi-optimal policy found using a genetic algorithm, average 66% under conditions both of stationary and non-stationary demand in a high penalty environment. While both the neural network based and fuzzy policy managers easily beat the fixed policies, the fuzzy manager was significantly superior to the neural network manager.

Page

5/6

We hypothesize that this is due in part to the inclusion of domain specific expert information in the Fixed LP 2000

Total Cost

1500 Fixed GA 1000 NN

500 Fuzzy

369

351

333

315

297

279

261

243

225

207

189

171

153

135

99

117

81

63

45

9

27

0 Timesteps

Figure 5. Costs of fixed policies (LP and GA) and policies adjusted by the NN and Fuzzy policy managers, increasing average demand. initialization of the rule base prior to learning, and perhaps due in part to the regularization effect of having fewer adjustable parameters in the rule base. An additional advantage of the fuzzy approach is the natural transparency of its decision making process. Each of its rules can easily be interpreted in the business context, their validity open to common sense verification. Fixed LP

2000

Total Cost

1500

Fixed GA 1000 NN Fuzzy

500

369

351

333

315

297

279

261

243

225

207

189

171

153

135

117

99

81

63

45

9

27

0 Timesteps

Figure 6. Cost of fixed (LP and GA) policies and policies adjusted by the NN and Fuzzy policy managers, baseline demand plus 10% step increase.


7. References [1] Aneja, Y. and K. Nair (1979). “Bicriteria Transportation Problem.” Management Science. 25: 73-78. [2] Barto, A., Sutton, R. and Anderson, C. (1983) “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems,” IEEE Transactions On Systems, Man, And Cybernetics Vol. SMC-13, No. 5, pp. 834-846. [3] Bellman, R. (1957). Dynamic Programming. Princeton, Princeton University Press. [4] Bit, A., M. Biswal, et al. (1993). “Fuzzy programming approach to multiobjective solid transportation problem.” in Fuzzy Sets and Systems. 57: 183-194. [5] Gullu, R. and N. Erkip (1996). “Optimal allocation policies in a two-echelon inventory problem with fixed shipment costs.” International Journal of Production Research, . 46-47: 311-321. [6] Jonsson, H., E. Silver, et al. (1986). “Overview of a stock allocation model for a two-echelon push system having identical units at the lower echelon” in Multi-Stage Production Planning and Inventory Control. New York., Springer-Verlag. [7] Prokhorov, D., Adaptive Critic Designs and their Application, Ph.D. Dissertation, Department of Electrical Engineering, Texas Tech University, 1997. [8] Prokhorov, D. & D. Wunsch, "Adaptive Critic Designs", IEEE Transactions On Neural Networks, vol.8(5), 1997, pp. 997-1007. [9] Shannon, T.T. & Lendaris, G.G., (2000), “Adaptive Critic Based Approximate Dynamic Programming for Tuning Fuzzy Controllers”, in Proceedings of IEEE-FUZZ 2000, IEEE. [10] Shervais, S (2000). “Developing Improved Inventory And Transportation Policies For Distribution Systems Using Genetic Algorithm And Neural Network Methods,” Proceedings of the World Conference on the Systems Sciences, Toronto, CA, 16-22 July, 2000, pp 200059-1 to 200059-17. [11] Werbos, P. (1995) “Optimization Methods for Brain-like Intelligent Control,” Proceedings of the 34th Conference on Decision and Control. IEEE Press, pp. 579-584. [12] Werbos, P. (1990) “Neurocontrol and related techniques,” in Maren, A., Harston, C., and Pap, R. (eds.) Handbook of Neural Computing Applications. Academic Press, Inc., New York., pp. 345-380.

Page

6/6

Adaptive Critic Based Adaptation of a Fuzzy Policy - CiteSeerX

Adaptive Critic Based Adaptation of a Fuzzy Policy - CiteSeerX

Suggest Documents