Fuzzy Model-Based Reinforcement Learning - Semantic Scholar

3 downloads 23274 Views 144KB Size Report
Technical University of Munich ...... E., 1998, “Neurofuzzy traffic signal control”, Master's thesis, Helsinki University of ... Science, Colorado State University.
ESIT 2000, 14-15 September 2000, Aachen, Germany

Fuzzy Model-Based Reinforcement Learning Martin Appl1 , Wilfried Brauer2 Siemens AG, Corporate Technology Information and Communications D-81730 Munich, Germany Phone: +49-89-636-45377, Fax: +49-89-636-45456 email: [email protected] 2 Technical University of Munich Department of Computer Science D-80290 Munich, Germany email: [email protected] 1

ABSTRACT: Model-based reinforcement learning methods are known to be highly efficient with respect to the number of trials required for learning optimal policies. In this article, a novel fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping (F-PS), is presented. The approach is capable of learning strategies for Markov decision problems with continuous state and action spaces. The output of the algorithm is a TakagiSugeno fuzzy system with linear terms in the consequents of the rules. From the Q-function approximated by this fuzzy system an optimal control strategy can be easily derived. The proposed method is applied to the problem of selecting optimal framework signal plans in urban traffic networks. It is shown that the method outperforms existing model-based approaches. KEYWORDS: reinforcement learning, model-based learning, fuzzy prioritized sweeping, Takagi-Sugeno fuzzy systems, framework signal plans

INTRODUCTION Reinforcement learning means learning from experiences (Sutton and Barto(1998), Bertsekas and Tsitsiklis(1996)). A reinforcement learning agent perceives certain characteristics of its environment, influences the environment by performing actions and finally gets rewards due to the appropriateness of the selected actions. One can distinguish between indirect and direct reinforcement learning methods. Indirect methods, like e.g. prioritized sweeping (Moore and Atkeson(1993)), build an internal model of the environment and calculate the optimal policy based on this model, whereas direct methods, like Q-learning (Watkins(1989)), do not use an explicit model but learn directly from experiences. Indirect reinforcement learning methods are known to learn in many settings much faster than direct methods, since they can reuse information stored in their internal model. Learning models of discrete environments is much easier than learning models of continuous ones. This may be the reason why most publications on model-based reinforcement learning deal with discrete Markov decision problems. Discrete methods, of course, can be also applied to continuous problems by discretizing the state and action spaces of these problems. The main challenge of this approach, however, is to define a partition of reasonable granularity, since fine partitions lead to a high number of states and thus complex problems, whereas approximations based on coarse crisp partitions can be highly imprecise. Model-based learning in continuous state spaces was previously discussed by Davies(1997), who suggested to define a coarse grid on the state space and to approximate the continuous value function by performing interpolation based on this grid. This approximation approach is comparable to a Takagi-Sugeno fuzzy system with triangular membership functions and constant terms in the consequents of the rules. Davies, however, used a crisp partition for the training of the transition probabilities and the corresponding rewards, which seems to be inconsistent with the idea of interpolating. Besides, he did not consider continuous actions. In this article, a fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping (F-PS), is considered. The approach is capable of learning strategies for problems with continuous state and action spaces. The output of the F-PS approach is a Takagi-Sugeno fuzzy systems with linear rules (Takagi and

212

ESIT 2000, 14-15 September 2000, Aachen, Germany

Sugeno(1985)). With such fuzzy systems continuous value functions can be approximated much more precisely than with approximation architectures based on crisp partitions. Alternatively, the number of partitioning subsets can be reduced. The proposed method is applied to the problem of selecting framework signal plans in dependence of traffic conditions. Several approaches applying reinforcement learning to problems from traffic signal control can be found in the literature (e.g. Thorpe(1997), Bingham(1998), Appl and Palm(1999)). To the authors, however, no publication on the selection of framework signal plans by means of reinforcement learning methods is known. In the following section the basic Markov decision problem on which the further considerations are based is introduced. Afterwards, the fuzzy model-based reinforcement learning approach is presented. Finally, the effectiveness of the proposed algorithm is shown by the task of selecting framework signal plans.

BASIC MODEL In the following it is assumed that the reinforcement learning agent gets inputs from a continuous state space X of dimension N X and may perform actions taken from a continuous action space A of dimension N A . The sets of dimensions of the state space and the action space will be denoted by DX := {1, ..., N X } and DA := {1, ..., N A } respectively. Let, for each state x ∈ X and each action a ∈ A, p˜(y; x, a) be a probability density function giving the distribution of the successor state y if action a is executed in state x. Furthermore, let g˜(x, a, y) ∈ R be the (unknown) reward the agent gets for executing action a in state x if the action causes a transition to state y. The agent is supposed to select actions at discrete points in time. The goal of the learning task then is to find a stationary policy µ : X → A, i.e. a mapping from states to actions, such that the expected sum of discounted future rewards   N     µ κ ˜ α g˜ (xκ , µ(xκ ), xκ+1 ) x0 = x , α ∈ [0, 1) (1) J (x) := lim E  N →∞ κ=0

is maximized for each x ∈ X , where xκ+1 is determined from xκ using p˜(xκ+1 ; xκ , µ(xκ )). Let    µ ˜ Q (x, a) := p˜(y; x, a) g˜(x, a, y) + αJ˜µ (y) dy,

(2)

y∈X

be the sum of discounted future rewards the agent may expect if it executes action a in state x and behaves ˜ µ∗ (x, a) are given by the fixed-point solution according to the policy µ afterwards. Then, the optimal Q-values Q of the Bellman equation   µ∗ µ∗ ˜ ˜ Q (x, a) = p˜(y; x, a) g˜(x, a, y) + α max Q (y, b) dy, (3) b∈A

y∈X

and the optimal policy µ∗ is to execute in each state x the action a that maximizes these Q-values: ∗

˜ µ (x, a). µ∗ (x) := arg max Q

(4)

a∈A

˜ µ∗ by a Takagi-Sugeno The F-PS approach described in the following approximates the continuous Q-function Q fuzzy system. Thereto, it is assumed that a fuzzy partition {µX i }i∈I of the state space is defined, where the X X subscripts of the N µ membership functions are given by I = {1, ..., N µ } and the labels and centers of the partitioning subsets are given by {Xi }i∈I and {˜ xi }i∈I , respectively. Likewise, it is assumed that the action A A space is partitioned by {µu }u∈U , where U = {1, ..., N µ } is the set of subscripts, {Au }u∈U gives the labels and A {˜ au }u∈U the centers of the N µ subsets of the partition.

FUZZY MODEL-BASED LEARNING The basic idea of the F-PS approach presented in the following is to learn an approximation of the unknown ˜ µ∗ , from which the optimal strategy can be easily derived (cf. eqn. (4)). The Q-function continuous Q-function Q will be approximated by the Takagi-Sugeno fuzzy system (Takagi and Sugeno(1985), Sugeno(1985))  x  a ˆ l (xl − x ˆ l (al − a ˜ µ∗ (x, a) = Q ˆ iu + Q Q ˜i,l ) + ˜u,l ), i ∈ I, u ∈ U, (5) if x is Xi and a is Au then Q iu iu

213

l∈D X

l∈D A

ESIT 2000, 14-15 September 2000, Aachen, Germany

ˆ iu is an estimate of the average Q-value in (Xi , Au ), and Q ˆ xl and Q ˆ al are estimates of the local partial where Q iu iu ˜ µ∗

˜ µ∗

˜u ) and δQ ˜u ), respectively. The estimation of the average Q-values and the average derivatives δQ xi , a xi , a δxl (˜ δal (˜ partial derivatives will be considered in the following subsections. ESTIMATION OF AVERAGE Q-VALUES Let Niu,k be counters giving the number of executions of (fuzzy) action Au in (fuzzy) state Xi until iteration k (i ∈ I, u ∈ U). Likewise, let Miuj,k be counters giving the number of times that the execution of action Au in state Xi caused a transition to Xj (i, j ∈ I, u ∈ U). On the observation of a transition (xk , ak , xk+1 ), xk ∈ X , xk+1 ∈ X , ak ∈ A, with reward g˜k ∈ R these counters are increased according to the degrees of membership of the transition in the corresponding centers: Niu,k+1 Miuj,k+1

A ← Niu,k + µX ∀i ∈ I, u ∈ U, i (xk )µu (ak ), X A X ← Miuj,k + µi (xk )µu (ak )µj (xk+1 ), ∀i ∈ I, u ∈ U, j ∈ I.

Based on these counters one can estimate the probability

X X µi (x)µA p(y; x, a) dy da dx u (a)µj (y)˜ x∈X a∈A y∈X

X pij (u) := µi (x)µA u (a) da dx

(6) (7)

(8)

x∈X a∈A

that the execution of action Au in state Xi causes a transition to state Xj : pˆij,k+1 (u) :=

Miuj,k+1 . Niu,k+1

(9)

Let giuj be the average reward the agent may expect if it executes action Au in state Xi and the action causes a transition to state Xj :

X X µi (x)µA p(y; x, a)˜ g (x, a, y) dy da dx u (a)µj (y)˜ x∈X a∈A y∈X

X . (10) giuj := X µi (x)µA p(y; x, a) dy da dx u (a)µj (y)˜ x∈X a∈A y∈X

Then, an estimate gˆiuj of these average rewards can be gained by performing the update gˆiuj,k+1 ← gˆiuj,k +

A X µX i (xk )µu (ak )µj (xk+1 ) [˜ gk − gˆiuj,k ] , Miuj,k+1

∀i ∈ I, u ∈ U, j ∈ I

(11)

on the observation of transitions (xk , ak , xk+1 ), xk ∈ X , xk+1 ∈ X , ak ∈ A with rewards g˜k ∈ R. Based on the discrete model (ˆ pij,k+1 (u), gˆiuj,k+1 ), one can now calculate average Q-values. It can be shown that the solution of the fixed point equation   ˆ jv,k+1 ) ˆ iu,k+1 = Q pˆij,k+1 (u) gˆiuj,k+1 + α max Q (12) v∈U

j∈I

gives estimates of the average Q-values

Qiu



A ˜ µ∗ µX i (x)µu (a)Q (x, a) da dx x∈X a∈A

X . := µi (x)µA u (a) da dx

(13)

x∈X a∈A

These estimates can be used in the representation (5) of the Q-function. The system (12) can be advantageously solved by discrete prioritized sweeping (Moore and Atkeson(1993)). ESTIMATION OF AVERAGE PARTIAL DERIVATIVES The partial derivatives Qxiul and Qaiul of the Q-function can be derived from average values and partial derivatives of the reward function and the transition probabilities. It can be shown that the following is satisfied for the

214

ESIT 2000, 14-15 September 2000, Aachen, Germany

partial derivatives with respect to the dimensions of the state space: ∗

Qxiul

˜µ δQ ˜u ) = (˜ xi , a δxl

    ∗ ˜ µ (y, b) dy ˜u ) g˜(x, a ˜u , y) + α max Q p˜(y; x, a  b∈A  y∈X x=˜ xi   xl xl pij (u) giuj + α max Qjv + pij (u)giuj , δ δxl

(3)

=





v∈U

j∈I

(14)

(15)

where the average rewards giuj and transition probabilities pij (u) were defined in the preceding section and the xl average derivatives pxijl (u) and giuj are given by

pxijl (u) :=



xl giuj

:=



X A µX µj (y) δxδ l p˜(y; x, a) dy da dx i (x)µu (a) x∈X a∈A y∈X

X , µi (x)µA u (a) da dx





(16)

x∈X a∈A

δ A X µX ˜(x, a, y)˜ p(y; x, a) dy da dx i (x)µu (a)µj (y) δxl g x∈X a∈A y∈X

X . X µi (x)µA p(y; x, a) dy da dx u (a)µj (x)˜

(17)

x∈X a∈A y∈X

Likewise, the partial derivatives with respect to the dimensions of the action space can be approximated as follows   a ˜ δQ al al l ˜u ) ≈ Qiu = (˜ xi , a pij (u) giuj + α max Qjv + pij (u)giuj , (18) v∈U δal j∈I

where the abbreviations paijl (u) :=



al giuj

:=



X A µX µj (y) δaδ l p˜(y; x, a) dy da dx i (x)µu (a) x∈X a∈A y∈X

X , µi (x)µA u (a) da dx





(19)

x∈X a∈A

δ A X µX ˜(x, a, y)˜ p(y; x, a) dy da dx i (x)µu (a)µj (y) δal g x∈X a∈A y∈X

X X µi (x)µA p(y; x, a) dy da dx u (a)µj (y)˜

(20)

x∈X a∈A y∈X

were introduced. In the following subsections, it will be shown how the average partial derivatives of the reward function and the conditional probability density function can be estimated from observed transitions. Then, the partial derivatives of the Q-function can be estimated using the approximations (15) and (18).

Partial Derivatives of the Reward Function xl al The average local reward giuj and the average local derivatives giuj and giuj of the reward function g˜ can be xl al estimated by adapting the parameters gˆiuj , gˆiuj and gˆiuj of the following linear function to experiences in the ˜u , x ˜ j ): vicinity of the center (˜ xi , a    y xl al l gˆiuj (xl − x ˜i,l ) + gˆiuj (al − a ˜u,l ) + gˆiuj (yl − x˜j,l ). (21) gˇ(x, a, y) := gˆiuj + l∈D X

l∈D A

l∈D X

On the observation of a transition (xk , ak , xk+1 ) with reward g˜k , the parameters can be adapted by performing a gradient descent with respect to the following error measure: E :=

1 2 (˜ gk − gˇ(xk , ak , xk+1 )) . 2

Let ηiuj,k :=

A X µX i (xk )µu (ak )µj (xk+1 ) , Miuj,k+1

(22)

(23)

215

ESIT 2000, 14-15 September 2000, Aachen, Germany

be the stepsizes for the gradient descent, such that the stepsize for a given center is weighted by the membership of observed transitions in this center and decreases gradually. Based on (22) and (23), the following update rules can be derived (∀i, j ∈ I, u ∈ U): gˆiuj,k+1 xl gˆiuj,k+1 al gˆiuj,k+1 yl gˆiuj,k+1

= =

gˆiuj,k + ηiuj,k (˜ gk − gˇ(xk , ak , xk+1 )) , xl gˆiuj,k + ηiuj,k (xk,l − x ˜i,l ) (˜ gk − gˇ(xk , ak , xk+1 )) ,

=

al gˆiuj,k yl gˆiuj,k

=

+ ηiuj,k (ak,l − a ˜u,l ) (˜ gk − gˇ(xk , ak , xk+1 )) ,

∀l ∈ D

(24) (25)

X

∀l ∈ DA

(26)

∀l ∈ DX .

+ ηiuj,k (xk+1,l − x ˜j,l ) (˜ gk − gˇ(xk , ak , xk+1 ))

(27)

Note that an alternative update rule for gˆiuj,k+1 was defined in (11).

Partial Derivatives of the Conditional Probability Density Function The average partial derivatives of the conditional probability density function can be approximated as follows:

pxijl (u)



x∈X a∈A

paijl (u)







x∈X a∈A

A µX i (x)µu (a)



N p(y;x+e ˜

X

N ,a)−p(y;x−e ˜ l

l µX j (y) 2 y∈X

X µi (x)µA u (a) da dx

X

,a)

dy da dx ,

(28)

,

(29)

x∈X a∈A A µX i (x)µu (a)



A

A

N N p(y;x,a+e ˜ )−p(y;x,a−e ˜ ) l

l µX j (y) 2 y∈X

X µi (x)µA u (a) da dx

dy da dx

x∈X a∈A

where edl is a vector of dimension d with components edl,i = δil , i = 1, ..., d, δ is the Kronecker symbol and  is a small constant. Let Lxiul ,+ count the number of executions of action Au in a (fuzzy) state that results from xl ,+ shifting state Xi along dimension l by , and let Miuj count the number of times that action Au caused a xl ,− transition from this state to Xj . Likewise, let Liu be a counter for the number of executions of action Au in xl ,− a state that results from shifting state Xi along dimension l by −, and let Miuj count the number of times that Au caused a transition from this state to Xj . On the observation of a transition (xk , ak , xk+1 , gk ), these counters can be updated as follows (∀i ∈ I, u ∈ U): l ,+ Lxiu,k+1

xl ,+ Miuj,k+1 l ,− Lxiu,k+1

xl ,− Miuj,k+1

X

N l ,+ ← Lxiu,k + µX )µA i (xk − el u (ak ), X

xl ,+ N X ← Miuj,k + µX )µA i (xk − el u (ak )µj (xk+1 ),

(30) j ∈ I,

X

N l ,− ← Lxiu,k + µX )µA i (xk + el u (ak ),



xl ,− Miuj,k

X

N X + µX )µA i (xk + el u (ak )µj (xk+1 ),

(31) (32)

j ∈ I.

(33)

al ,+ al ,− In a similar way counters Laiul ,+ ,Miuj , Laiul ,− and Miuj with the following update rules can be defined (∀i ∈ I, u ∈ U): l ,+ Laiu,k+1

al ,+ Miuj,k+1 l ,− Laiu,k+1

al ,− Miuj,k+1

A

A N l ,+ ← Laiu,k + µX ), i (xk )µu (ak − el



al ,+ Miuj,k

+

A µX i (xk )µu (ak



A eN )µX l j (xk+1 ),

(34) j ∈ I,

A

A N l ,− ← Laiu,k + µX ), i (xk )µu (ak + el A

al ,− A N ← Miuj,k + µX )µX i (xk )µu (ak + el j (xk+1 ),

(35) (36)

j ∈ I.

Then, the average partial derivatives (28) and (29) can be estimated as follows (∀i, j ∈ I, u ∈ U): x ,+ xl ,− l Miuj,k+1 1 Miuj,k+1 xl pˆij,k+1 (u) := − xl ,− , l ,+ 2 Lxiu,k+1 Liu,k+1 a ,+ al ,− l Miuj,k+1 1 Miuj,k+1 al pˆij,k+1 (u) := − al ,− . l ,+ 2 Laiu,k+1 Liu,k+1

(37)

(38)

(39)

216

45

40

35

q

30

25

20

15

10

5

0

0

5

10

15

20

t

residential area

A

B

industrial area

C

50

45

40

35

30

q

25

20

15

10

5

0

0

5

10

15

20

t

D 50

45

40

cycle time

35

30

q

ESIT 2000, 14-15 September 2000, Aachen, Germany

shopping center north 50

25

20

15

request

extension

10

5

shopping center south +cinema

0

0

5

10

15

20

t

Figure 1: Example framework signal plan and test scenario.

OPTIMAL SELECTION OF FRAMEWORK SIGNAL PLANS Framework signal plans define constraints on signal control strategies in traffic networks. A framework signal plan usually comprises individual signal plans for all traffic signals controlled by the framework signal plan. In the left part of figure 1 an example signal plan is depicted. Green phases of the traffic signal controlled according to this signal plan have to start within the ‘request’-interval and have to end within the ’extension’interval. Within the leeway given by signal plans, traffic-dependent optimization may be performed or public transportation may be prioritized. Sophisticated traffic control systems are able to choose between different framework signal plans in dependence of traffic conditions. The rules controlling this selection are usually tuned by hand, which is not trivial in complex traffic networks. The task of selecting framework signal plans in dependence of traffic conditions, however, can be considered as a Markov decision problem, where the state is composed of measurements made on the traffic network and the framework signal plans are the available actions. In the following the scenario shown in the right part of figure 1 will be considered. The traffic density is measured at the three points indicated by arrows. It is assumed that three framework signal plans are given. Plan 1 favors ‘horizontal’ traffic streams and should therefore be used in the morning when people go to work. In Plan 2, ‘horizontal’ and ‘vertical’ phases have the same length, such that this plan is suitable at noon and in the afternoon when people go shopping and return from work. The third plan finally favors traffic flows between the residential area and the cinema and should therefore be selected in the evening. During learning the controller gets the following rewards: g˜ := −

 l

2

ρ¯l ρl,max

,

(40)

where ρ¯l and ρl,max give the average and maximum density, respectively, of vehicles in link l. The basic idea behind this definition is that the average density in the road network is to be minimized, where homogeneous states in which all roads have a similar density result in larger rewards than inhomogeneous states. µXi (x) 1

is_vs

is_s

is_m

is_h

µXi (x)

is_vh

1

0 0

0.5

1

ρ/ρ max

is_vs

is_s

is_m

is_h

is_vh

0 0

0.5

1

ρ/ρ max

Figure 2: Partitions of sensor signals for PS (left) and F-PS approach (right).

217

−70

− (total average density per day)

ESIT 2000, 14-15 September 2000, Aachen, Germany

−60

−80

−90

−100

−110

−120

−130 0

: F−PS : PS 5

10

15

20

25

30

35

40

45

50

number of simulated days

Figure 3: Progress of framework signal plan selection with prioritized sweeping (PS) and fuzzy prioritized sweeping (F-PS).

Two algorithms were applied to this Markov decision problem: Training with prioritized sweeping (Moore and Atkeson(1993)), where the state space was discretized by the crisp partition shown in the left part of figure 2 (PS), and training with the fuzzy prioritized sweeping approach proposed in this article (F-PS), where the fuzzy partition shown in the right part of figure 2 was used. The progress of these algorithms is shown in figure 3. For the plot, training was interrupted every two simulated days and the strategy learned until then was applied to the network for one further simulated day. The total rewards gained in the courses of these evaluation days are shown in figure 3, where averages over 10 runs are shown in order to reduce statistical effects. The learning task, obviously, is solved much faster by the fuzzy model-based approach than by the crisp approach. Moreover, the strategy learned by F-PS is superior to the strategy learned by PS, i.e. the continuous Q-function, obviously, can not be approximated sufficiently good by an architecture based on the crisp partition shown in figure 2.

CONCLUSIONS In this article a novel fuzzy model-based reinforcement learning approach was presented. The approach represents continuous Q-functions by Takagi-Sugeno models with linear consequents. As Q-functions directly represent control knowledge, control strategies learned by the F-PS approach can be expected to be superior to strategies learned by methods based on crisp partitions. The proposed method was applied to the task of selecting optimal framework signal plans in dependence of traffic conditions. As expected, the proposed method outperforms the crisp PS approach when used with partitions of similar granularity. In the example of application presented in this article actions were discrete. The proposed algorithm, however, also performs well in environments with continuous action spaces, as can be easily tested with small toyexamples. Real-world problems with continuous action spaces will be considered in future publications.

REFERENCES Appl, M.; Palm, R., 1999, “Fuzzy Q-learning in nonstationary environments”, Proceedings of the 7th European Congress on Intelligent Techniques and Soft Computing. Bertsekas, D. P.; Tsitsiklis, J. N., 1996, “Neuro-Dynamic Programming”, Athena Scientific. Bingham, E., 1998, “Neurofuzzy traffic signal control”, Master’s thesis, Helsinki University of Technology. Davies, S., 1997, “Multidimensional triangulation and interpolation for reinforcement learning”, Advances in Neural Information Processing Systems, Volume 9, pp. 1005–1011, The MIT Press. Moore, A. W.; Atkeson C. G., 1993, “Memory-based reinforcement learning: Converging with less data and less time”, Robot Learning, pp.79–103. Sugeno, M., 1985, “An introductory survey of fuzzy control”, Information Sciences 36, pp.59–83. Sutton, R. S.; Barto, A. G., 1998, “Reinforcement Learning — An Introduction”, The MIT Press. Takagi, T.; Sugeno, M., 1985, “Fuzzy identification of systems and its application to modeling and control”, IEEE Transactions on Systems, Man and Cybernetics, Volume 15, pp. 116–132. Thorpe, T., 1997, “Vehicle Traffic Light Control Using SARSA”, Ph. D. thesis, Department of Computer Science, Colorado State University. Watkins, C. J. C. H., 1989, “Learning from Delayed Rewards”, Ph. D. thesis, Cambridge University.

218