MODELING AND CONTROL OF DISCRETE ... - Semantic Scholar

1 downloads 0 Views 557KB Size Report
MODELING AND CONTROL OF DISCRETE EVENT DYNAMIC SYSTEMS: ...... This will provide a high learning rate initially, and a decreasing smaller rate while ...
MODELING AND CONTROL OF DISCRETE EVENT DYNAMIC SYSTEMS: A SIMULATOR-BASED REINFORCEMENT-LEARNING PARADIGM

PAOLO DADONE HUGH VANLANDINGHAM The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, 340 Whittemore Hall Blacksburg, Virginia 24061 - 0111 U.S.A. BRUNO MAIONE Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, Via E. Orabona 4 Bari, 70125, Italy

A general reinforcement-learning approach for controlling discrete event systems is presented. A machine-repair example is formulated: (1) to describe and explain the DEVS formulation, and (2) to illustrate the general control method. Modified gradient learning methods and evolutionary programming methods are compared for the purpose of optimizing the controller. An on-line adaptation method is presented; and, the use of fuzzy logic and artificial neural networks for such adaptation is compared. Evolutionary programming methods for controller optimization prove to be the most robust type of optimization. Moreover, the fuzzy and neural adaptation approaches are successful in improving the performance of the static controller for dynamic operating conditions. Keywords: Intelligent control; DEDS; DEVS; Reinforcement-learning; Evolutionary programming; Fuzzy logic; Artificial neural networks.

1. Introduction Classical dynamic system models, i.e. differential equation based models, have developed in large part to describe natural physical systems. In contrast, the widespread deployment of man-made systems, such as manufacturing systems of various types, has given rise to more general models which are called discrete event dynamic system (DEDS) models. DEDS can be used to model the “event-driven” systems common to man-made systems as well as subsystems which are “timedriven” natural processes, such as temperature decay and the dynamics and kinematics of mass motion, making DEDS a more powerful paradigm.1 In response to industrial demand for DEDS models, several simulation packages have been developed; and their use has greatly helped the understanding of these systems and the development of control policies for them. A discrete event system specification, called DEVS formulation, was instrumental in the development of such software. DEVS is a system theoretic method for modeling DEDS which provides a descriptive framework for such systems.2,3,4,5 Both the complexity and the stochastic nature of discrete event systems, along with the non-homogeneity of relations, contribute greatly to their analytical intractability. It is for this reason that we assume the availability of DEDS simulation software as a means of representing the plant, i.e. the system to be controlled. Suitable control policies for DEDS are generally difficult to derive using classical control techniques. The rise of “intelligent” methods, such as fuzzy logic system (FLS) and artificial neural network (ANN) control has helped to some extent; however, a truly optimizing controller for DEDS continues to be a challenging area of research. The mathematical foundation of optimal control for general systems, including DEDS, has been available for several decades in the form of Bellmann's dynamic programming (DP) algorithm.6,7 In essence, DP seeks to find a control policy that minimizes a total average cost functional which, in turn, can be partitioned into the cost of the current control action and the remaining "cost-to-goal". The basic drawback of DP is that for significant problems the search space expands exponentially. As an alternative to the exact calculation of the optimal control policy, one might calculate a limited lookahead policy with cost-to-goal generated by an approximating function. If, for example, an artificial neural network is used as the approximating function, it can be included as part of a performance evaluation for selecting the proper control action. In addition, its parameters can be updated on-line as information about the system is collected. Figure 1 illustrates the generic control system structure. The difficulty in using the reinforcement-learning concept is in finding the approximating function for the cost-to-goal. To cope with this problem, it will be assumed, even though no analytical model of the plant is available, that a plant simulator is accessible. (This assumption is reasonable since there is wide availability of specialized simulation programs for different types of systems.) The approximation problem is the key to using the DP approach; however, here the problem can take several different paths. First of all, there are many approximating paradigms, including ANNs and FLSs. Secondly, there are many approaches to approximating the complex system. For example, if the state-input (configuration) space can be partitioned

appropriately, then some type of modular approach may be taken, where "simple" modules, used to approximate the function Response Reference Adaptive in different regions, can be combined to Plant Controller form the overall approximation.8 Up to this point, this approach could be called heuristic dynamic programming (HDP).7 The option of improving the cost-to-goal approximation on-line involves the "reinforcement-learning" (RL) concept. Performance Referring to Fig. 1, this function is Criterion implicitly performed in the block entitled "Performance Criterion" (PC). Figure 1. Reinforcement Learning Control A direct approach can be used to determine improvements in the control action. With this method the configuration space is "explored" by modifying the control actions and observing performance changes. This type of trial-and-error learning has been called reinforcementlearning because of its similarity to animal behavior modification used by psychologists. Assuming that the controller is a parameterized structure, such as a neural network, the exploration of the parameter space can be performed either on-line with the actual system, or off-line with the use of a simulator. The on-line mode of adaptation amounts to a judicious perturbation of the controller parameters and noting the change in the PC. Given enough time to "gently" explore the space, the PC can be modified to include some "hard limit" terms that will prevent instability. One might think of this more general PC as a heuristic for the cost-to-goal in the DP interpretation of determining the optimal control. The off-line mode can be used effectively if the plant model is accurate. In this case many "trials" can be run as a practice exercise for applying the next control input. This off-line search is one way in which the judicious choice of perturbations can be made. In this paper a simple two-parameter control policy is formulated for the example of a shop running 50-machines which fail probabilistically. It is the authors' opinion that a specific example is beneficial for understanding of both the DEVS modeling process and the details of the adaptive control methods. The control algorithm is based on a performance index that is a measure of the shop profit. The DEVS formulation for the machine-repair problem is presented in Section 2. A preliminary discussion about the characteristics of the problem is given in Section 3. The reinforcement learning paradigm that uses gradient ascent methods and evolutionary programming for the design phase, and fuzzy and neural adaptation for the on-line operation is presented in Section 4 and then applied in Section 5, where simulation results are given. Concluding comments are given in Section 6.

Disturbance

2. DEVS Formulation of the Machine Repair Plant t 1F

2.1. Informal description of the plant It will be assumed that there are Nmachines working in a plant which are subject to individual failure according to random failure times. When a failure occurs, a machine can change state from correct-functioning (state 1) to "minorrepair-needed" (state 2), up to a complete disfunctioning ("non-operable" state F). This change of state occurs at failure times according to some state transition probabilities as shown in Fig. 2. At a failure time a machine in state i will either stay in the same functioning state

t 13 t 11

1

t 2F

t 12

η(1)

2 c(2)

η(2)

3 η(3)

c(3)

Repair Queue

Figure 2. Machine Repair Example

. . . η(F) c(F)

F

with probability tii or it will transition to a worse functioning state j (j > i) with probability tij. When a machine is in state i, it produces a profit of η(i)P $/time unit, where η(i) is the efficiency of the machine in state i (0≤η(i)≤ 1) and P is the profit due to a correct functioning machine (therefore, η(1)=1). A machine's efficiency corresponds to the fact that when it is not working properly, it consumes more fuel, produces pieces with less quality and even takes more time; all of which impacts the system profit. When a machine is in a state i≠1 (i.e., a failure has occurred), it is eligible to be sent to repair. If so, the machine returns to state 1 (Fig. 2), at a cost of c(i) (per unit time), having kept one of the M repair teams busy for a random repair time. If all M repair teams are busy, the machine will wait in a FIFO (first in first out) queue. Sending a machine to repair will also have another (indirect) cost, i.e. since the machine is down, the plant is not profiting from that machine. When a machine is repaired, it goes back to state 1 and starts to work, as explained before. In this model the control action is triggered by the failure of a machine; and, based on the actual value of the output of the system (the instantaneous profit that the plant is making), decides whether to send some machines to repair, or not. 2.2. DEVS formulation of the plant Our system is a discrete event system, indeed state transitions are due to the occurrence of three types of events: • e : a possible failure of a machine; • e : the end of repair of a machine; • e : sending n machines to repair. The first two events are internal events. The last one is an external event, and corresponds to the actual input to the system. Using a system theoretic (DEVS) formalism the system can be described as S = < X, Σ, Y, δ, λ, ta >. In the following there is a description of each of the six components of the system, S. 1 2 3

The input set (X): The input corresponds to the last event (e3: send n machines to repair), thus it is completely specified by the integer number n. It can be any integer between 0 (send no machine to repair, i.e., the non-event) and N (send all the machines to repair). Therefore: X = {0, 1, 2, ..., N}

(2.1)

The sequential state set (Σ): The sequential state is comprised of four "macro" logic components. The first of the four components is the SERVICE LIST, i.e., the list of machines that are actually working. This is a FIFO list but, as we will see, its order is not important. Every record in this list has three components: (1) the identification number (ID) of the machine, (2) the service time left (σ) for the machine and (3) the current functioning state (s) of the machine; therefore, s s s the ith record can be written as (IDi , σi , si ). The list of working machines has I records, where I∈N and I ≤ N. The second of the four components is the REPAIR QUEUE, the queue of machines waiting for a free repair team. This is a FIFO queue. Every record in this queue has two components: (1) the ID of the machine and (2) its functioning state. Therefore, q q the jth record in the queue will be (IDj , sj ). This queue has J records, where J∈N and J ≤ N. The third of the four components is the ON-REPAIR LIST, the list of machines that are currently under repair. This is a FIFO list, but in this case, the order is not important. Every record in this list has three components: (1) the ID of the machine, (2) the repair time left for the machine and (3) the functioning state the machine was in before starting the repair; therefore, the kth record r r r will be denoted (IDk , σk , sk ). This list has K records, where K∈N and K ≤ M (M is the number of repair teams). The fourth and last component of the state is the number of free repair teams (FRT). This can be any integer between 0 (all the repair teams are busy) and M (all the repair teams are idle). The machine IDs can be any integer between 1 and N, and the σs are positive (real) numbers. The functioning state of a machine can be any integer between 1 (correct functioning) and F (complete disfunctioning), where F is the number of functioning states of the machines. The state s will look like: ID1s , σ1s , s1s , ID1q , s1q , ID1r , σ1r , s1r , s ={

( ) ( ) ( ) (ID , σ , s ), (ID , s ), (ID , σ , s ), s 2

s 2

s 2

...

q 2

q 2

...

(ID , σ , s ) (ID , s ) (ID s I

s I

s I

q J

q J

r 2

r 2

r 2

... r r r K , σK , s K

FRT }

)

In the following: NT = {1,2, …, T} ∀T∈N, NT,o = NT ∪ {0} and A* is the set of all the possible lists made of elements of A. If we define L = NN × R+ × NF, then we have that the SEQUENTIAL STATE SET is:

Σ = L* × {NN × NF}* × L* × NM,o

(2.2)

The output set (Y): In this system the output will be a number indicating the instantaneous profit the plant is making; therefore, it can be any real number, with negative numbers indicating loss. Y=R

(2.3)

The output function (λ): As explained in the informal model, the output (i.e., the plant profit) depends on the functioning state of all the working machines less the cost incurred for machines being repaired. Therefore, we need to define two functions that will be a part of the output function. The first is an efficiency function, η : {1, 2,..., N} → [0, 1]; where η(i) is the efficiency of a machine in state i. The second function is a cost to repair function, c : {2,..., F} → R+; where c(i) is the cost (per time unit) to repair a machine that is in state i (c is obviously defined for i > 1 since there is no sense in repairing a correctly functioning machine, state 1). Thus, we can define the output function as: λ:Σ → Y



I

( )

K

( )

λ(s ) = ∑ η s si ⋅ P − ∑ c s rk i =1

k =1

(2.4)

The time advance function (ta): The time advance function gives us the next hatching time, that is, the next internal event r occurrence time in the absence of external events. If we call σ smin the minimum of σsi over all i, and σ min the minimum r of σ k over all k, then the time advance function is given by: ta : Σ → R+

{

min σsmin , σrmin ta (s ) =   σ rmin

}

if σsmin ≠ σ rmin

(2.5)

otherwise

If ta(s) = σ min , this means that the next event is a possible machine failure. If ta(s) = σ min , this means that the next event is an end-of-repair event. The second line of the definition of the time advance function in Eq. (2.5) has the following implicit "tie-breaking rule": If a possible failure event and an end-of-repair event occur together, give s precedence to the end-of-repair. Another possible situation is that σ min corresponds to two elements in the service list. In this case the precedence will be given to the first record. The same will happen for the on-repair list. This is the only case where the FIFO nature of the lists affects our model. The transition-specifying function (δ): Characterizing the transition specifying function is the most difficult job in a DEVS model. The main reason is that in most cases there is no closed form for this function, in which case it must be defined with tables, algorithms, etc. For present purposes we will define this function with a sort of "algorithm" for every possible state transition. This function is divided into two parts, the internal transition function (δΦ) and the external transition function (δex). The former specifies the behavior of the system after an internal event occurs, while the latter specifies the behavior of the system after an external event. The internal transition function is: s

r

δΦ : Σ → Σ,

∋ s+ = δΦ(s-)

(2.6)

An internal transition can take place if event e1 or e2 happen. If the event is e1 (possible failure of a machine), this means that ∃ m ∈ NI ∋ σsm = ta (s) = σsmin . In this case the actions to take are: (1) Determine the new functioning state of machine IDm ( ssm+ ) according to a given state transition matrix T, i.e., tij is the probability of a transition from state i to state j. More precisely, the new functioning state for machine IDm, presently in state ssm− =i, will be sampled from the following discrete probability function: {(1, ti1), (2, ti2), …, (F, tiF)}. (In our case the matrix T, whose rows sum to one - since each of them represent a discrete probability function, is upper triangular, since every machine can only transition to a worse functioning state.); (2) Sample a new service time from the corresponding distribution; (3) If ssm+ ≠ ssm− (a failure occurred), send a signal to the controller. If the event that takes place is e2 (end-of-repair of a machine), this means that ∃ m ∈ NK ∋ σ rm = ta (s ) = σrmin . In this case the actions to take are: (1) Remove the mth record from the on-repair list (i.e., the machine just finished being repaired) and add it at the end of the service list. (From now on, for sake of simplicity, we will assume the basic operations of remove and add to a list are understood); (2) Sample a service time from the appropriate service time distribution;

(3) Set the functioning state of the machine to correct functioning: ssI + = 1 ; (4) If the repair queue is empty: FRT=FRT+1; else remove the first record from the repair queue and add it at the end of the on-repair list and sample a repair time using the appropriate repair time distribution. This completely specifies the internal transition function. The external transition function is: δex : Q × Ω → Σ , ∋ s+ = δex(s-, e, ω)

(2.7)

where, Q = {(s,e) | s∈Σ Σ ∧ 0 ≤ e ≤ ta(s)} is the state set, and Ω is the input segment set, i.e., a subset of the set of all the input segments. If at e time units since the last event there is an external event (e3) of value n (i.e., send n machines to be repaired) the actions to take are: (1) Evaluate the number of machines in the service list having ssi − ≠ 1 ∀i ∈ {1,2,..., I} , and, if there are b such machines, let ν = min {n, b}; (2) Take the ν "worst" machines in the service list and send them to repair. This means that we must remove the entries corresponding to these machines from the service list and add them at the end of the on-repair list or of the on-repair queue, depending on the availability of repair teams. More precisely, for each of the ν machines on the service list we have to: (a) Remove the corresponding record (call it pth record) from the service list; (b) If FRT≠0: add the pth record to the on-repair list, sample a repair time from the appropriate distribution and decrement FRT; else: add the pth record at the end of the repair queue. The state transition specifying function is now completely defined; completing the DEVS formulation. 2.3. Controller The controller is a relatively simple one. Namely, when there is a failure of a machine in the system, a control action is required. The controller takes, as input, the output of the system and determines whether to send R machines to repair, or not, (the control action) applying the following control rule: If input is less then w⋅P⋅N, then send R machines to repair. This controller is defined by the two parameters w and R. The w parameter can be any real number in [0,1] and the R parameter can be any non-negative real number. If w = 0 this means that the “threshold” for the input is zero, i.e., the controller will never send a machine to repair. The opposite extreme corresponds to a value of w = 1 which means that the threshold is P⋅N, i.e., the maximum profit that the plant can realize having all the machines working at the same time in a correct functioning state. In this case, at every failure R machines will be sent to repair. A value of R = 0 means that we never send a machine to be repaired. A value of R that is not an integer has a special meaning. It means that we are sending to repair as many machines as the integer part of R (IR) plus one more machine with a probability given by the decimal part of R (DR). Therefore, we send IR machines to repair and decide whether to send another machine or not, according to the “probability” DR. Remembering what is written for the δex function, in our case the response of the controller is immediate, therefore e = 0. This means that immediately after a failure we have a send-to-repair action. 3. Problem Formulation • • • • • • • • •

In our simulations the following parameters were chosen: N = 50 machines, to establish a reasonably sized plant; P = 10 $/time unit; M = 10 repair teams, in order not to have too much of a constraint on R; F = 5 functioning states for each machine; η(1) = 1, η(2) = 0.85, η(3) = 0.6, η(4) = 0.25, η(5) = 0; (efficiency function); c(2) = 30 $, c(3) = 45 $, c(4) = 60 $, c(5) = 90 $; (cost to repair function); Service times are exponentially distributed with a mean time between failures (MTBF) of 5 time units; Repair times are exponentially distributed with a mean time to repair (MTTR) of 1 time unit; Functioning state transition probability matrix (T):

0.05   0  T= 0   0   0 

0.005  0.05 0.8 0 .1 0.05   0 0.05 0.8 0.15   0 0 0.05 0.95   0 0 0 1  The estimated expected profit per unit time is used as a performance index (PI) for the control actions. The estimate is computed by averaging (over time) on one simulation 1000-time units long. Transients are excluded from the averaging process by deleting the observations corresponding to the first 100-time units. Through an extensive number of simulations the behavior of the PI as a function of w and R has been determined and then plotted in Fig. 3. The following observations can be made: (1) The response surface is “corrugated” due to the stochastic nature of the system; even increasing the length of the simulation or using several replications does not show a significant improvement. (2) There is a strong discontinuity at R = 1 for every value of w∈[0,1]. This means that if the repair is a probabilistic action (instead of deterministic as for R ≥ 1), the PI is strongly affected; that is, the system really needs the repair actions. (3) There is a global maximum of approximately 200 for R=1 and w between 0.6 and 0.8. (4) There are an infinite number of local minima for w=1 and every R. These local minima also have the same value (PI=119.6). The presence of these local minima is easily explained, indeed w=1 means that the threshold for the repair action is the maximum profit we can ever make with the plant, that is, we will send a machine to repair every time there is a failure. This means that as soon as a machine breaks, we send it to repair. Therefore, there will always be no more than one machine broken at a time and this creates the insensitivity of the minima to the R parameter. The preceding considerations show that, although a relatively simple problem, determination of the optimal (or even a "good" sub-optimal) controller for this system is not simple. 0.8

0 .1

0.045

4. The Reinforcement Learning Paradigm The reinforcement learning paradigm consists of an off-line (design) phase, and an on-line (adaptation) phase. The design phase consists in finding, by means of simulation, some values for the parameters of a given parametric controller that optimize some performance index. Therefore, this phase can be regarded as a simulation response optimization problem where several approaches can be taken.9,10 Even though the field of simulation optimization is quite mature, there is no general “good” algorithm for simulation optimization, but 200 there are several techniques which give some good results.9 100 Therefore, there is no fixed method for the controller design, but a set of methods to choose by trial-and-error. This makes the design phase, even though apparently simple, a task that is far from simple. In the following we will use path search methods, that is, we will try to find the optimal parameters for the controller by employing a path search in the direction given by an estimate of the gradient of the PI with respect to the controller parameters. An estimate of this

PI

0 -100 -200 6 4

R

2 0

0

0.2

Figure 3. Response Surface

0.4

0.6

w

0.8

1

gradient will be determined using finite differences or response surface methodologies and some modifications thereof. Finally, we will also employ a random search algorithm, namely, evolutionary algorithms, that is the most computationally expensive, but also the most robust and reliable. The adaptation phase (on-line reinforcement learning) is a very complicated task. In the following we will approach the problem with an “exploration” of the optimal solutions. This approach was developed and applied by the authors for the on-line adaptation of an inventory system policy and proves to be very promising. 4.1. Off-line (design) phase The off-line reinforcement learning is a simulation optimization problem that we will first approach with path search methods. If xk is the kth iteration point of a path-search algorithm in a p-dimensional search space the next point will be given by: xk+1 = xk + η dk

(4.1)

where η is called the learning coefficient and dk is the search direction at iteration k and is given by: dk = ∇xPI(xk) + α dk-1

(4.2)

where α is called the momentum coefficient and ∇xPI(xk) is the (estimated) gradient of the PI with respect to x evaluated at xk. The terminology used above is commonly used in the field of neural networks, while in mathematical programming terms (4.1) and (4.2) imply a conjugate gradient method with step length η and deflection parameter α.8,11 (Even though the deflection parameter will be fixed arbitrarily and not determined in the usual conjugate gradient ways.) The algorithm will generate new points in the search space and will finally stop when a stopping condition (such as small relative change in the PI or small gradient etc.) is met. The critical term in (4.2) is the (estimated) gradient. The gradient can be estimated using several approaches.9,10 In the forward finite differences (FFD) approach, the gradient is estimated by perturbing separately each parameter in the search space and observing the consequent change in the PI. In the particular case we are considering there are two search parameters (w and R) and therefore the PI can be regarded as a function of these two parameters, PI = PI(w, R). Thus, the estimated gradient at a point (w0, R0) in the FFD approach will be given by:

 PI (w 0 + δw, R 0 ) − PI (w 0 , R 0 )   δw  (4.3) ≅  PI (w 0 , R 0 + δR ) − PI (w 0 , R 0 )   δR   ,R ) where δw and δR are perturbations small enough to give an accurate estimate of the gradient and big enough not to confuse the system stochastic component with the actual gradient. Note that the FFD approach requires three simulations to estimate the gradient at each point (for a p-dimensional search space it requires p+1 simulations), and it is better to use common random numbers. We also propose a modification of the FFD approach that can be regarded as a cyclic coordinate forward finite different approach (CFFD). In this approach the derivatives of the PI with respect to the search parameters are still estimated in the same way, but the actual points at which they are estimated change. Indeed we first estimate the first component of the gradient, move accordingly and then estimate the second component of the gradient in the new point and continue like this in a cyclic way. This method is a kind of hybrid between the FFD and the cyclic coordinate method.11 Another paradigm to estimate the gradient lies in the field of response surface methodologies (RSM). This approach fits a meta model to the data coming from the simulation and extracts the gradient according to this fit. The easiest way to do this is by using a linear meta model where the PI is a first order polynomial in the search space parameters. In our case the meta model would be:  ∂PI   ∂w   ∇ x PI (w 0 , R 0 ) =   ∂PI   ∂R    (w

0

0

PI = β0 + β1w + β2 R + ε

(4.4)

where ε is an error term that under certain assumptions is normally distributed with zero mean and finite variance.9,10 After running some simulations for different values of w and R, the corresponding PI is obtained, and the meta model of T (4.4) is fitted to the data, thus obtaining some least square error estimates for the βs. It is easy to see that [β 1, β 2] is an estimate of the gradient. Of course values for w and R cannot be arbitrarily chosen, but some kind of experimental design has to be followed. The most common type of experimental design is a full factorial design, which is quite “reliable” for

gradient estimation, but unfortunately requires 2p simulations in a p-dimensional search space, generating an explosion of complexity with increasing cardinality of the search space. In this case a gradient estimation using a full factorial design requires 4 simulation runs, thus being feasible to use. An alternate class of search methods is that of random search methods, e.g., genetic algorithms (GAs) and evolutionary programming (EP). In evolutionary programming the algorithm starts with an initial (random) population of k individuals, i.e., points in the search space.12 Each individual will have a fitness, i.e., the value of the PI corresponding to that point note that the fitness estimation requires one simulation. According to the fitness of each individual, a new generation will be formed by: retaining the “healthier” individuals (elitism); generating “off-spring” of the old individuals (applying some perturbation to some “parents” chosen through some kind of probabilistic selection process); and creating some new random individuals. The algorithm will finally end after a certain number of generations or when there is some kind of uniformity in the population. It is important to notice that EP is quite different from GAs in that it does not require an additional binary encoding and mainly works on perturbations of parents to generate off-springs, rather than using crossover operators. 4.2. On-line (adaptation) phase The adaptation-through-exploration paradigm is strongly based on the reinforcement learning approach described above and depicted in Fig. 1. Given a DEDS, its operating conditions will generally have random variations, according to the influence of the environment on the system and of changes of the system itself. However, we can reasonably identify the possible causes for changes in operating conditions (perturbing elements), as well as their range of variation, which we call the perturbing element space (PES), through an understanding of the environment-system interactions. Through a sensitivity analysis of the performance of the system with respect to such perturbing elements, the most relevant ones can be determined. Eventually their number will be limited, and their range can be bounded through a “common sense”, practical approach, thus leading to a subset of the critical perturbing element (CPE) space. Unfortunately, in most practical cases the CPE subset still has a large cardinality (since it represents the environment/system interaction); therefore, it is not feasible to completely “explore” it (with an off-line simulator based reinforcement learning approach) to determine the optimal controller parameters to use in each operating conditions. However, a quick “exploration” can be done by just considering some “well chosen” points, that is, simulating some significant changes due to the environment (or the plant). If this step is done properly, we should then be able to interpolate what happens for other critical points that were not explored. This process can be summarized as follows. First, the CPE subspace is explored (off-line) by stimulating selected changes in operating conditions and recording the consequent variation in the optimal controller parameters. Second these data are routed to an adaptation module that learns and generalizes from them. Finally, the controller should be able to work on-line, adapting itself to random environmental changes by means of the adaptation module. In the following the CPE subset is defined for the machine repair example; and, both a FLS and an ANN are used as adaptation modules, and their performances are compared. 5. Experimental Results 5.1. Off-line reinforcement learning In the following the learning schemes described in 4.1 are implemented. As a stopping criterion for all the path search algorithms the relative change in the PI is used, i.e., the algorithm will stop when the PI has had a relative change of 0.1% or less. Such a percentage was used since it is useless to have a more stringent (smaller percentage) stopping criterion in the case of a stochastic plant. Values of δw = 0.1 and δR = 0.5 and random initial points were used. In Fig. 4 some learning curves are shown for different values of the learning coefficient and for different starting points for the FFD, RSM and CFFD approaches. In Fig. 4-(a) we see the learning curves for FFD learning (FFD1) and for RSM learning (RSM1). The curve corresponding to FFD1 (*), is characterized by η=0.2, α=0 and final values: w=1.0, R=4.14 and PI=119.6, i.e., FFD1 “fell” into one of the local minima. The same happens with RSM1 (o) (η=0.2, α=.01) that ends at w=1.0, R=1.5 and PI=119.6. In Fig. 4-(b) we can see the learning curves for the cyclic finite differences (CFFD) for three cases. The first curve corresponds to CFFD1 (*) and has been obtained for η=0.5 and α=0.05 and converges to w=1.3, R=10.6, PI=119.6. A value of w larger then 1 and the value of PI, show us that this learning has converged to one of the local minima. The second curve, CFFD2 (o), was obtained for η=0.8 and α=0.1 and converges to w=1.0 R=4.51 and PI=119.6. Even in this case we are trapped in a local minimum. In the third case, CFFD3 (+), lower values for η and

250

160

FFD1

140

CFFD3

200

120

RSM1

100

PI

CFFD1

150 PI

80

100

60

50

40 20

CFFD2

0

0

-50 -20

-100

-40 0

10

20

30

40

50

60

5

iteration (a)

10 iteration

15

20

(b) Figure 4. Learning Curves

α have been used, i.e., η=0.2 α=0. With these values the algorithm converges again to a local minimum given by: w=1.8, R=1.2 and PI=119.6. By observing these curves and the corresponding values of η and α, we can conclude that: • η and α must not be too low at the beginning of the learning, as the learning procedure needs to start moving quickly towards the optimum; • η and α cannot be too high because of irregularities in the PI function. It can occur (CFFD1 and CFFD3) that R becomes smaller than 1, thus causing a large decrement in the PI. A subsequent recovery is possible, but is not assured. We can summarize these two considerations by saying that we need to learn quickly, avoiding dangerous instabilities. Finally, RSM seems to be really slow and oscillatory in convergence, even though requiring an high number of simulation runs at each iteration. Thus, RSM is not considered further in the following. The preceding considerations suggest that we modify the FFD and CFFD with the standard assumption made when using stochastic gradient algorithms, i.e., using a scheduled decrease in the learning rate (the learning rate at iteration k is η/k). This will provide a high learning rate initially, and a decreasing smaller rate while approaching the maximum. In Fig. 5-(a) we see the learning curve for a FFD with decreasing learning rate, FFD2, characterized by η=0.8, α=0.01 and final point: w=0.76, R=0.9 and PI=76. It is obvious that FFD2 does not learn successfully, but, it is interesting to note the oscillation in the learning curve. Indeed, the learning is successful in terms of “envelope” of the learning curve. The oscillation is caused by repeated crossing of the value R=1; this causes quick performance degradation that the FFD is eventually capable to recover from. Figure 5-(b) shows the learning curves for CFFD modified with decreasing learning rate. The first curve, CFFD4 (*), is obtained from η=0.3 and α=0.05 and leads to w=0.62, R=1.3 and PI=137.6. The second curve, CFFD5 (o), is obtained from η=0.7 and α=0.05 and leads to w=0.66, R=1.2 and PI=149.5. In both cases the 200

200

150

150

CFFD5

100

CFFD4

100

PI

PI

50

50

FFD2 0

0

-50

-100 0

10

20

30

40

50

60

70

80

-50 0

5

iteration (a)

10 15 iteration (b)

Figure 5. Learning Curves with decreasing learning rate

20

25

200 200

FFD5

CFFD7

180 160

CFFD6

150

FFD4 FFD3

140

100

120

PI

PI

100

50

80 60

0

40 20 0

0

5

10

15

20

25

-50 0

30

5

10 iteration

iteration (a)

15

20

(b) Figure 6. Learning Curves with constraint on R

situation is slightly better, but there still is a sort of instability that enables the algorithm to discard good values of PI. Moreover, the learning procedure reaches a point where it cannot go further since the learning coefficient has decreased so much that there is no significant learning. To help us in the learning process, we can also use some knowledge of the system, that is, we know that we always want to send at least one machine to repair if the threshold condition is met. This means that R is not allowed to fall to values smaller than unity. By putting such a constraint on R as a barrier function that blocks R at 1 if there is any attempt (from the learning algorithm) to decrease it further, we obtain the last series of path search methods (still using a decreasing learning coefficient) whose learning curves are shown in Fig. 6. Figure 6-(a) shows the learning curves for three different FFD each characterized by: FFD3, (*), η=0.8, α=0.01, final w=0.65, R=1.0, PI=184; FFD4, (o), η=0.6, α=0.01, final w=0.7, R=1.0, PI=199; FFD5, (+), η=0.3, α=0.01, final w=0.68, R=1.0, PI=190. Figure 6-(b) shows two learning curves obtained using the CFFD method. The first curve, CFFD6 (*), was obtained for η=0.6 and α=0.1, and converges to w=0.68, R=1.0 and PI=189.2. The second curve, CFFD7 (o), was obtained for η=0.8 and α=0.2, and converges to w=0.72, R=1.0 and PI=187.6. This final modification definitely gives better results than the two previous ones. There is still the problem of reaching the maximum and stopping there, as we can see from the learning curves. These three methods give good results, but it is difficult to approach the optimum in a satisfactory way, since there are many irregularities that remain even when using longer simulations or more runs. Therefore, assuming that the irregularity is a part of the statistical nature of the problem, a final method has been used: evolutionary programming. Using a standard EP approach, a population of 10 individuals was used. With a deterministic process the best four were chosen; and, then four off-spring were generated from them. The remaining two individuals were randomly chosen. The perturbations used to determine the off-spring were assumed to be additive terms, uniformly distributed in [-0.1, 0.1] for w, and in [-0.5, 0.5] for R. New random individuals are sampled from uniform distributions in [0,1] and [0,5] for w and R, respectively. The initial population is chosen randomly. In Fig. 7 the best and worst case individuals among the four best in each generation are plotted versus the generation number for three differently “seeded” evolutionary algorithms (EAs). The final four best individuals for these three different EAs are given in Table 1. Though this kind of learning algorithm is dependent on initialization, and offers no guarantee that a certain number of generations will give good results; the results 200

180

200

200

180

180 160

160 160 PI

PI 140

PI

120

140 120 120

100 0

140

100

100

10

20 30 generation

(a)

40

50

80 0

80 20

40 60 generation

80

100

60 0

(b) Figure 7. Learning Curves - Evolutionary Algorithm

20

40 60 generation

(c)

80

100

Table 1 - Evolutionary Algorithm - best individuals

w=0.7203 w=0.7217 w=0.7223 w=0.7218

First EA (a) R=1.0000 R=1.0068 R=1.0099 R=1.0072

PI=196 PI=194 PI=193 PI=192

w=0.6808 w=0.6862 w=0.6818 w=0.6834

Second EA (b) R=1.0008 R=1.0278 R=1.0054 R=1.0138

PI=191 PI=189 PI=187 PI=187

w=0.6830 w=0.6858 w=0.6826 w=0.6847

Third EA (c) R=1.0050 R=1.0192 R=1.0031 R=1.0136

PI=194 PI=193 PI=192 PI=188

are fairly consistent as well as almost optimal. The procedure is not fast (but not excessively time consuming); and it is an automatic off-line method. 5.2. On-line reinforcement learning For the on-line reinforcement learning it has been assumed that the sources of variability in the operating conditions (critical perturbing elements, CPE) of the plant are the mean time between failures (MTBF) and the mean time to repair (MTTR) of the machines. The CPE subspace is considered to be [4,10]×[0.5,1.5], that is, it is assumed that (from some knowledge of the plant) the MTBF will be eventually varying in [4,10] and the MTTR in [0.5,1.5]. The CPE subspace is therefore “explored” by trying to find the optimal w and R for some points in it. An EA like the one described in the Table 2 - Performance on exploration points: optimal preceding section was used and the points that are explored lie on a controller, fuzzy and neural adaptation modules uniformly spaced 4×3 grid on the CPE subspace. This means that the EA is run on all the points in {4, 6, 8, 10}×{0.5, 1, 1.5} and the MTBF ↓ MTTR → 0.5 1.0 1.5 “optimal” solutions are found for each point. In every solution the Optimal w=0.86 w=0.70 w=0.06 optimal value for R turns out to be always 1.0 (even though the EA is PI=310 PI=114 PI=-73 not assuming any kind of constraint on R), thus suggesting that it is 4 FL10 311 117 -191 better to always send only one machine to repair and to adjust the FL50 311 68 -233 threshold implicitly defined by w. For this reason from now on R will FL200 309 91 -109 be considered to be set to 1 without further discussion. Table 2 shows 108 -158 NN10 315 the optimal results in the explored points along with the optimal PI. -388 NN50 312 108 The next step is to efficiently interpolate (generalize) between NN200 308 94 -213 these optimal points. Well known approximating paradigms have Optimal w=0.94 w=0.86 w=0.66 been used, namely, fuzzy logic systems (FLS) and artificial neural PI=370 PI=256 PI=127 networks (ANN).8,13 A fuzzy logic system with two antecedents 6 FL10 364 255 156 (MTBF and MTTR) and one consequent (w), was designed from the FL50 365 246 110 “exploration” data in Table 2. The corresponding membership FL200 369 254 76 functions are shown in Fig. 8 and the set of 12 rules is given in Table 244 106 NN10 367 3. The FLS uses singleton fuzzification, max-product inference, max NN50 367 250 111 rule composition and modified height defuzzification. Artificial NN200 367 247 105 neural networks constitute a second interpolating paradigm. In this Optimal w=0.94 w=0.87 w=0.79 case a 2-10-1 ANN was trained on the “exploration” data with an PI=393 PI=313 PI=232 error goal of 10-4, learning coefficient of 0.8 and momentum 239 8 FL10 392 315 coefficient of 0.3. This network has two inputs (MTBF and MTTR) FL50 393 313 213 and one output (w) and it uses sigmoidal squashing functions. Both 312 213 FL200 394 the FLS and the ANN have the MTBF and the MTTR as inputs. 393 311 215 NN10 Obviously those values are not known a-priori and some kind of NN50 393 306 219 estimates are needed. Therefore, the times between failure and the NN200 393 313 232 times to repair are collected during the simulation and their average Optimal w=0.95 w=0.92 w=0.85 serves as estimates for MTBF and the MTTR. Since the plant is PI=409 PI=347 PI=284 dynamic and MTBF and MTTR will change over time, the above 289 10 FL10 408 346 mentioned average is taken on the data collected in some sliding FL50 409 345 281 window, the dimension of this window becomes a design issue. The FL200 409 349 282 FLS and ANN are now ready to operate on-line as adaptation 407 345 261 NN10 modules; therefore, they will monitor the evolution of the plant NN50 410 349 276 NN200 410 346 285

Table 3. Rule base

M L

S

VL

1

M

S

L

1

4

6

8 10

VS

S M L VL

1

.5

1

1.5

.06

.68 .79 .86 .94

(a) (b) (c) Figure 8. Membership functions for MTBF (a), MTTR (b) and w (c)

MTTR → MTBF ↓ S M L VL

S

M

L

L VL VL VL

S L L VL

VS S M L

through the sliding window and impose changes to the controller parameter accordingly. A first test of the adaptation module can be done by checking its performance on the exploration points, from which it was created. Table 2 summarizes the results for this first test, where FLx (NNx) stands for fuzzy logic (neural network) adaptation module with sliding window of dimension x. Both ANN and FLS give good overall results. More specifically, from Table 2 we can see that there is a major performance degradation in (4,1.5) with respect to optimal performances, with both FLS and ANN. The only case where this degradation is contained is for a FLS using a sliding window of 200 observations. The problem in this case seems to be the big performance degradation corresponding to such values of MTBF and MTTR. Indeed, a high MTTR and a low MTBF correspond to frequently failing machines that take a long time to repair. This gives substantial additional costs that make the situation difficult to manage; indeed small variations in w cause big variations (decreases) in the PI. There is some smaller performance degradation (for some values of the sliding window dimension) also in (4,1.0), (6,1.5) and (8,1.5) in Table 2. A sliding window of 10 seems to give good results for both the FLS and the ANN. The latter is 10% sub-optimal in (6,1.5), regardless of the sliding window size. All other data show performances very close to optimal ones. The general trend suggested by the data is that a small sliding window works well with the fuzzy adaptation module, while a big sliding window works well (except few cases) with the neural adaptation module. It is interesting to note that in some instances the performances of the system with adaptation modules seem to be better than optimal. Though this might seem strange, there are two reasons that explain those results. First of all, in the comparison we have to take into account the variability inherent to the plant, which makes small differences negligible. Moreover, what we are calling the “optimal performances” or the “optimal controller” are only optimal solutions for that particular type of controller, and not for the problem itself. Since the static controller works with fixed parameters for all the simulation; and, the dynamic controller (with the adaptation) may change the value of its parameters at every time in the simulation, the optimal performances of these two controllers will be different in general. Another test of the adaptation modules consists in checking their extrapolation (even though we are assuming we will not need any, it is interesting to check) and interpolation capabilities. Table 4 gives the results of simulations for the optimal controller (determined with an EA) and for the neural and fuzzy adaptation in (2,0.5), (10,2) (i.e., extrapolation) and (5,1) (i.e., interpolation). The neural adaptation seems to have fairly good extrapolation capabilities and good interpolation ones. On the other hand, the fuzzy adaptation works well in terms of interpolation (regardless of the window size) and extrapolation at (10,2) (with small window size), but it gives unsatisfactory performances for (2,0.5). This is probably caused by the type of membership functions that were used. The “saturation” of the membership functions could be the cause of bad extrapolating capabilities. These bad extrapolation capabilities are not a problem in our case, since one of the initial assumptions in the method is that the CPEs are going to vary in some subspace that we explore; therefore, we are practically only interested in interpolation properties. Table 4. Extrapolation and interpolation of the Nonetheless it is interesting to compare the fuzzy and the neural fuzzy and neural modules extrapolation capabilities. MTBF=2 MTBF=10 MTBF=5 The final test is the one that simulates a real operating situation. MTTR=0.5 MTTR=2 MTTR=1 Therefore, it is assumed that every 1000 time units new values of Optimal w=0.66 w=0.79 w=0.72 MTBF and MTTR are uniformly sampled from the CPE subspace PI=110 PI=215 PI=196 ([4,10]×[0.5,1.5]). The simulation was run for 50,000 time units FL10 -2 225 199 and fixed controllers were compared to dynamic controllers using FL50 -2 150 193 the adaptation module. In particular fixed controllers with w=0.6, FL200 4 158 199 0.7, 0.8 and 0.9 were used since those values of w are in the range NN10 113 160 203 of the optimal values found previously. The results for this last test NN50 99 195 187 are shown in Table 5. Here we see that the fuzzy or neural NN200 96 194 199 adaptation is effective in giving performances that are more than

20% better than the ones with the fixed controllers. Moreover, we note that a small sliding window seems to be the best choice for both the fuzzy and the neural adaptation modules. 6. Conclusions

Table 5. Adaptation

Controller w=0.6 w=0.7 w=0.8 w=0.9 FL10 FL50 FL200 NN10 NN50 NN200

PI 174 197 192 183 252 241 244 243 240 235

A general reinforcement learning paradigm for both off-line and on-line adaptation of a parametric controller for a DEDS was presented. To demonstrate its use, a machine repair example was formulated and its DEVS model was developed. The off-line paradigm used standard simulation optimization techniques (finite differences, response surface methodologies) and advanced random search (evolutionary programming) to determine the optimal controller. The on-line reinforcement learning used the approximation paradigms of fuzzy logic systems and artificial neural networks to provide adaptation. Both adaptation approaches (FLS and ANN) showed good approximation and adaptation properties. The neural adaptation also showed better extrapolation capabilities, even though extrapolation is not an issue. The proposed method is still in its early phase, and needs further study and experimentation to prove its general validity and applicability. Furthermore, some theoretical considerations need to be addressed, e.g., regarding the type of “exploration” of the critical-perturbing-elements subspace to perform. ACKNOWLEDGMENT The authors would like to thank DuPont for support on the development of this paper and the anonymous reviewers for their important suggestions. REFERENCES [1] [2] [3] [4] [5] [6] [7]

[8] [9] [10] [11] [12] [13]

C.G. Cassandras, Discrete Event Systems: Modeling and Performance Analysis, Richard D. Irwin and Aksen Associates Inc. Publishers, Boston, MA, 1993. B.P. Zeigler, Theory of Modelling and Simulation, John Wiley and Sons, 1976. B.P. Zeigler, Multifacetted Modelling and Discrete Event Simulation, Academic Press, 1984. B.P. Zeigler, "DEVS Representation of Dynamic Systems: Event-Based Intelligent Control," Proceedings of the IEEE, vol. 77, n. 1, pp.72-80, January 1989. B.P. Zeigler, Object Oriented Simulation with Hierarchical Modular Models, Academic Press, 1990. D.E. Kirk, Optimal Control Theory: An Introduction, Prentice-Hall Inc., Englewood Cliffs, NJ, 1970. P.J. Werbos, "Approximate dynamic programming for real-time control and neural modeling," Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, (D.A. White and D.A. Sofge editors), Van Nostrand Reinhold, New York, 1992. S. Haykin, Neural networks: A Comprehensive Foundation, Macmillan College Publishing and IEEE Press, 1994. S.H. Jacobson and L.W. Schruben, "Techniques for Simulation Response Optimization," Operations Research Letters, vol. 8, n. 1, pp. 1-9, February 1989. M.H. Safizadeh, "Optimization in Simulation: Current Issues and the Future Outlook," Naval Research Logistics, vol. 37, pp. 807-825, 1990. M.S. Bazaraa, H.D. Sherali and, C.M. Shetty, Nonlinear Programming: Theory and Algorithms, John Wiley and Sons Inc., NY, 1993. D.B. Fogel, Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, IEEE Press, NY, 1995. J.M. Mendel "Fuzzy logic systems for engineering: a tutorial," Proceedings of the IEEE, vol. 83, no. 3, pp. 345377, March 1995.

PAOLO DADONE received the "laurea" degree with honors in electronic engineering in 1995 from the Politecnico di Bari, Italy and the M.S. degree in electrical engineering in 1997 from the Virginia Polytechnic Institute and State University, USA where he currently is a PhD student. His research interests are in intelligent control algorithms development and implementation, discrete event dynamic systems and manufacturing systems. Mr. Dadone is a member of several honor societies as well as IEEE technical societies. He was the recipient of the 1996 IEEE VMS graduate student paper contest and of the 1996 Politecnico di Bari fellowship for studying abroad.

HUGH VANLANDINGHAM is a professor in the Bradley Department of Electrical and Computer Engineering, where he has served since September 1966. He received his B.E.E. degree from N.C. State University in 1957, M.E.E. from N.Y.University in 1959 and a Ph.D. in Electrical Engineering from Cornell University in 1967. From 1957 to 1962 Dr. VanLandingham worked in the Bell Telephone Labs, Whippany NJ. His research has, or is being supported by NASA, ONR, NSF, NSWC, Lockeed, DuPont, and Eastman Chemical. He is the author of three textbooks in the areas of signal processing and control and has published more than 70 technical papers in journals and international conferences. Dr. VanLandingham has taught courses in virtually all areas of the undergraduate curriculum. At the graduate level his teaching concentration has been in the areas of random processes, signal processing and automatic control systems. Whereas earlier research was focused on conventional digital methods, primarily digital control systems, his more recent areas of interest are in applications of "soft computing." This broad area is a subset of artificial intelligence that has to do with paradigms that mimic Nature. Included in this soft computing area are the studies of artificial neural networks, fuzzy logic systems and evolutionary computation. BRUNO MAIONE received the laurea in electrical engineering with honors from the University of Naples. Currently, he is full professor of Automatic Control at the Department of Electrical and Electronic Engineering of the Polytechnic of Bari. In 1983 and 1985 he was a visiting professor with The University of Florida, Gainesville. He held the position of Dean of Faculty of Engineering from 1986 to 1992. His primary areas of research are discrete event dynamical systems and intelligent control.