Document not found! Please try again

The Fuzzy Sars™a™(?) Learning Approach Applied to a ... - CiteSeerX

2 downloads 0 Views 938KB Size Report
Applied to a Strategic Route Learning Robot Behaviour. Theodoros ..... apex x i j i j xz. µ. µ. (13) where αpexi denotes the output recommended by the rule i and.
Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems October 9 - 15, 2006, Beijing, China

The Fuzzy Sars’a’(λ) Learning Approach Applied to a Strategic Route Learning Robot Behaviour Theodoros Theodoridis and Huosheng Hu Department of Computer Science, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, U.K. Email: [email protected], [email protected] Abstract - This paper presents a novel Fuzzy Sarsa(λ) Learning (FSλL) approach applied to a strategic route leaning task of a mobile robot. FSλL is a hybrid architecture that combines Reinforcement Learning and Fuzzy Logic control. The Sarsa(λ) Learning algorithm is used to tune the rule-base of a Fuzzy Logic controller which has been tested in a route learning task. The robot explores its environment using its fixed experience provided by a discretized Fuzzy Logic controller, and then learns optimal policies to achieve goals in less time and less error. Index Terms - Reinforcement Learning, Fuzzy Q-Learning, Fuzzy Logic Controllers, Sarsa(λ), Robot Autonomy.

I. INTRODUCTION Reinforcement Learning (RL) is widely used for understanding and automating goal-directed learning as well as decision-making [1]. According to R.S. Sutton et. al. the interaction between a learning agent and its environment in terms of states, actions, rewards, and value functions, is to achieve efficient search in the space of policies. Kaelbling and Littman presented the RL problem faced by agents that learn behaviours through trial-and-error within dynamic environments [2]. They discuss central issues of reinforcement learning, including the trading of exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, and constructing empirical models to accelerate the learning. A similar approach, Sarsa(λ), is considered as one of the most powerful methods in RL. Agents which have built-in Sarsa(λ) algorithms, initially explore the state space by obtaining instant experience. The efficiency of learning an on-policy like Sarsa is quite promising since it can be more effective than other off-policy methods like Q-learning. The use of an Adaptive Heuristic Critic (AHC) RL engine is to refine a FLC engine as it has been presented by [3][7]. The actor part of the AHC used in this project is a conventional FLC in which the parameters of the input membership functions are learnt by an immediate internal reinforcement signal. This internal reinforcement signal derives by a prediction of the evaluation value of a policy and the external reinforcement signal. The evaluation value of a policy is learnt by temporal difference (TD) learning in the critic part that is also represented by a FLC. Furthermore, Glorennec has also attempted to fuse two model-free algorithms, Q-Learning and Fuzzy Q-Learning, for a broom balancing task [8]. Inspired by [2][3][7][8][9], it has been attempted in this paper to combine Reinforcement Sarsa(λ) Learning (RSL)

1-4244-0259-X/06/$20.00 ©2006 IEEE

with FLCs to form the FSλL hybrid architecture in order to achieve faster learning performances. The RSL algorithm had been chosen for the route learning task because it helps robot agents, being under unsupervised control, to passively observe the state-action space and the immediate rewards encountered, by checking the updated experience taken by fuzzy rule-bases so that to integrate this information in order to achieve successfully predefined goals. Once the learnt policy along with the taken experience is “good” enough to control the robot agent, the learning error and the time steps required to achieve goals are reduced. The rest of the paper is organized as follows. Section II presents the route learning task and the proposed hybrid learning architecture. In section III, the FLC architecture is used to show how the learning process is accelerated by fetching artificial experience. In section IV the FSλL controller is presented along with the core FSλL algorithm. The experimental results and analysis are presented in section V to demonstrate the learning performance of the proposed hybrid approach. Finally, conclusions and future improvements are given in section VI briefly.

1767

II. HYBRID LEARNING ARCHITECTURE A. Strategic Route Learning Task The FSλL approach presented in this paper is mainly used to accelerate the learning process of a robot agent. It has been tested on a route learning task in which the robotic agent has to learn how to find an optimal route toward a goal point, as shown in Fig. 1. During searching the environment, the only information given to the robot agent is the coordinates of the goal. Immediate rewards are collected to define the agent’s experience, which is learnt by trial. According to this artificial experience, built by immediate rewards and pre-given fuzzy experience on how to perform discretized actions under certain situations, the robot agent constructs learning strategies which helps it to confront the task’s nature in general and the route learning task in particular.

Fig. 1 The route learning task performed by a strategic optimal policy.

B. Fuzzy Sarsa(λ) Learning Architecture Inspired by [2][3][7][8][9], the FSλL model is described by a learning architecture in Fig. 2, in which the FSλL is connected with a static environment that feeds the robot agent with state information. According to the perceived states the agent applies actions to the environment that responds with rewards. This is the state-action-reward cycle which takes place in all RL algorithms as well. The FSλL architecture is based on the conventional Sarsa(λ) algorithm and FLCs. Initially the Sarsa(λ) algorithm prepares the process of the quintuple sars’a’ and passes all this information through the δt error to the defuzzifier engine. The defuzzifier engine undertakes the updating of the Qfunction. The updated Q-function is propagated back to the Sarsa(λ) algorithm and then, according to the updated Q(s, a) values, the Sarsa(λ) algorithm selects an action with probability 1-ε (ε-greedy policy) and the agent performs that action to the environment. Immediately after that, a reward r evaluates the action performed and the algorithm receives the next state s’. The next action α’ is not selected again by the εgreedy policy; a FLC generates the next action according to the state α perceived initially from the environment. This last information will end the FSλL algorithm’s cycle. In the next algorithmic cycles the ε-greedy policy will select among fuzzy actions by tuning the rule-base. C. Reinforcement Sarsa(λ) Learning The learning controller for the route learning task is based on the Sarsa(λ) learning that can explore the state space and exploit it at the same time by obtaining instant experience. Thus it does not have to wait until the goal area is reached in order to propagate back the collected rewards and then to build a “delayed” experience. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs rather than to states only. Obviously, a trace of eligibility is needed for each state-action pair et(s, α), described by the following update equation: Qt +1 ( s , a ) = Qt ( s , a ) + α δ t et ( s , a ) (1)

Agent

Reinforcement Sarsa(lambda) Learning Algorithm

dt

a'(t)

Q(s, a)

Sarsa(lambda) Defuzzifier

FLC Action Selector

where α is the learning gain ( 0 ≤ α ≤ 1 ). If it decreases sufficiently slowly towards zero, then the value function converges toward to optimal policy [8]. δt is the TD error, which is described by:

(1 − α )Qt ( s t ,

δt = 

Environment

if rt = − 1

 rt+1 + γ Qt ( s t+1 ,

at+1 ) − Qt ( s t , at ) otherwise whereas for the replacing eligibility traces we have:

et ( st , at ) =

if s = st and a = at , 1  γ λ e ( s, a ) otherwise.  t −1

(2)

(3)

Sarsa(λ) is an on-policy algorithm, which approximates the Qπ ( s, a ) [3], i.e. the action values for the current policy π. In our case the ε-greedy method is used for action selection with maximum Q(s, a) values and probability 1-ε, otherwise the selection mechanism switches by choosing a random action [4]. The use of eligibility traces can improve the efficiency of control algorithms such as Sarsa [1]. One of the most significant characteristics for the Sarsa algorithm is that it credits each action taken during the agent's locomotive procedure so that on-line learning can be achieved. Each trace of eligibility constitutes a short term memory of the frequency of the state-action pair visits or triggered rules in the case of Reinforcement Fuzzy Learning [2]. Each trace decreases its content exponentially until the next update which can occur by a new visit to the corresponded state [8]. III. FUZZY LOGIC CONTROLLER Two different FLC systems are used in this project. The first FLC classifier is used as a mechanism providing fixed experience based on discretized input and output spaces which help the robot agent to achieve better action performances. The second FLC classifier is used to refine the updated Qfunction (1) so that the learning to be accelerated. The input state space of the discretized FLC algorithm uses five discrete laser vectors: LLV, LFLV, CLV, RFLV and RLV, each of which is assigned with five fuzzy variables: Left, LFrt, Front, RFrt and Right. Each variable is divided into three areas: Ner (for near), Med (for medium) and Far (for far). These distinct areas represent the distance from the agent’s body to the detected object/obstacle. The output action space of the FLC has been divided into five discrete angle actions allocated at the central point of each laser vector. Act0 to Act4 correspond to five action variables: Lws, LForw, Forw, RForw and Rws. Fig. 3 presents the five laser vectors, each of which is divided in three range areas: Far, Med and Ner. Each active laser vector can generate one of the state weights. If there are more than one laser vectors detected, then all the generated state weights are summed.

a(t) s(t)

at ) + α rt

    

r(t)

Fig. 2 Fuzzy Sarsa(λ) Learning architecture.

1768

LLV: Left Laser Vector LFLV: Left Front Laser Vector CLV: Central Laser Vector RFLV: Right Front Laser Vector RLV: Right Laser Vector

(weight = 1) (weight = 2) (weight = 4) (weight = 6) (weight = 8)

n

Act2 Act1

µi ( x) = ∏ µ i ( x j ) j =1 F

(9)

j

Act3

th

The true value for the i rule activated by the input vector x is calculated by the Mamdani’s minimum fuzzy implication: (10) µ F ∩ F ( x ) = min[ µ F ( x ), µ F ( x )]

[22.5, -22.5] deg

CLV [67.5, 22.5] deg

Act0

[-22.5, -67.5] deg

LFLV

Act4

RFLV

[90.0, 67.5] deg

[-67.5, -90.0] deg

LLV

RLV

A

Robot

Far

Med

Ner

Ner

Med

The FLC’s Fuzzy Inference Engine (FIE) selects each discrete action according to the fuzzy antecedents of the input space by consulting a fuzzy rule-base which processes these input laser data so that to generate an output fuzzy consequent data set.

C. Fuzzy Inference System The Fuzzy Inference Engine (FIE), according to the fuzzy rule-base, uses the membership functions in order to map some fuzzy sets onto other fuzzy sets [6]. A sample fuzzy rule is given below [5]: ℜi :

A. Membership Functions Here, two types of membership functions are used; the trapezoid µfs(x) is used for the fuzzy input space and the singleton µfs(x) for the fuzzy output space. The maximum value of membership functions for both trapezoid and singleton µfs is defined by the height (4) that equals to 1: h i := max{µ i ( x j )} (4) Fj Fj x

D. Height Singleton Deffuzification The Singleton Height Defuzzification has been used here because it reduces the computation cost and it is useful for symmetric consequents, which makes the defuzzification process much faster. However, because of the height singleton method the actions performed by the robot agent correspond to five very distinct angle values. Therefore the action space will be limited. On the other hand the output action responses will be faster. The height defuzzifier evaluates µi(x) and then takes the apexes of each µf and computes the output of the FLC, as shown in the following general form:

i =1,.. N

uncertainty, where N is the number of rules and n is the dimension of the input space [5]. 1) Trapezoid µf(x): The support area of a trapezoid µf is defined by the section whose values are greater than zero. s i := { x j : µ i ( x j ) > 0} ∀ a ≤ x ≤ d (5) F j trapezoid

n

z ( x) = (∑ µ

The core area and the geometric representation of a trapezoid µf are defined by the section whose values have

i =1

i =1

F ji

( x j ))

(13) i

associated with the activation of the rule R (i ) [5] while µ i ( x j ) denotes the weight of the rule i.

(6)

Fj

IV. FUZZY SARSA(λ) LEARNING (FSλL) CONTROLLER

a≤ x≤b b≤ x≤c c≤ x≤d

(7)

2) Singleton µf(x): The singleton µf does not have support area, it can be only characterized by its core point a. c i := { x j : µ i ( x j ) = 1} ∀ a = x (8) F j sin gleton

( x j ) ⋅ apexi ) /( ∑ µ

by taken centre of gravity of the fuzzy set F j ; this is

described by the quantities: ∀ b≤x≤c

n

F ji

where αpexi denotes the output recommended by the rule i and

i

maximum degree of membership to fuzzy set F j trapezoid

 0, x ≤ a ( x − a ) /(b − a ), x ∈ ( a , b ),  µ F ( x ) = 1, x ∈ (b, c ), ( d − x ) /( d − c ), x ∈ (c, d ), 0 , x ≥ c 

i i if x j is A j and y j is B j then z = ci

The above fuzzy rule is implemented by a fuzzy implication Ri as follows: i i i (12) R j ( x j , y j , z ) = [ A j ( x j ) and B j ( y j )] → Ci ( z )

Fuzzy sets F j =1,...n are used to describe elements of

c i := { x j : µ i ( x ) = 1} F j trapezoid F j trapezoid j

B

A

Far

Fig. 3 Agent’s laser vectors in 5 different areas.

F j trapezoid

B

For this implication, one type of T-norm operator is adopted; the probabilistic PAND: PAND ( a , b ) = a ⋅ b (11)

A. Fuzzy Rule-Bases in Sarsa(λ) Learning Similarly to Q-learning, a typical RFL system represents its rule-base by connecting the action selection and the associated q-values to form a new set of fuzzy rules, i.e. the rules have the form [2][7][8][9]:

F j sin gleton

B. Fuzzification and Fuzzy Logical Operators The total number of fuzzy rules µ, is the product of the number of fuzzy sets of all input state variables [5][7].

1769

ℜ i FLC :

i if s j is F j then a = ci

(14)

→ for action selection

i if s j is F j and a = ci then Q ( s , a ) = q ( s , a ) → for update Q − function

ℜi RFL :

(15)

The FLC form represents a typical fuzzy rule; for a given state s in a fuzzy set F, the output a becomes c. On the other hand, the RFL form represents a fuzzy rule in which the learning agent can choose an action, α, among the state set A for each rule i by a[i, k[i]] while q[i, k[i]] denotes its corresponded q-value [10]. The learning process is undertaken to determine the best set of rules by optimising the future reinforcements. The initial rule-base has N rules as follows:

A FIS is used to represent the set of actions A(st) which is associated with the Q-function. st → z = Q ( s , a ) = FIS ( s , a )

In a RFL system the evaluation of an action is obtained by defuzzifying the input state vector s (where s=x). The TakagiSugeno defuzzification is used for the calculation of the action (α(s) function) selection as well as to update the Q-function (Q(s, α) function) as is shown below [8][9].

i ℜ i RFL : if s j is F j then a [ i , 1] with q [ i , 1] or :

i if s j is F j then a [ i , 2 ] with q [ i , 2 ]

• • • i if s j is F j then a [ i , k [ i ]] with q [ i , k [ i ]]

For every rule i, let k [i ] ∈ {1, A} be the subscript of the available actions chosen by an EEP [8][9]. B. Fuzzy Value Function Approach The greedy policy is the one which is used by an FIS to select the value of a state vector that can obtain better results. Let q[i, maxk[i]] be the maximum q-value for the rule i, and the value of the value-function becomes: n n V t ( s t +1 ) = ( ∑ µ i ( s j ) × q [ i , max k [ i ]]) /( ∑ µ i ( s j ) ) i =1 F j i =1 F j

(16)

C. Updating the q-Values Recall that the TD error δt in Q learning describes the difference between the Q value of next state-action pair and the Q value of the current state-action pair. Thus, the error signal which is used to update the action q-values is given by: δ t = ∆Q ( st , at ) = rt +1 + γ Qt ( st +1 , at +1 ) − Qt ( st , at ) (17) An effective method for appropriate action selection is to perform the gradient descent on the expected reward and to integrate the rule-base with back-propagation so that the immediate rewards to be maximized. This class is called reinforce algorithm, including linear reward-inaction as a special case [2]. The gradient algorithm is used to update the q-values incrementally [8] by the following equation: ∆q[i , k [i ]] = α × ∆Q[(∂Q ( s t , a t )) /(∂q[i , k [i ]])] = n

α × ∆Q[( µ i ( s j )) /( ∑ µ i ( s j ) )] F i =1 F j

k [ i ]]

n n Q ( s t , a t ) = ( ∑ µ i ( s j ) × q[i , k [i ]]) /( ∑ µ i ( s j ) ) i =1 F j i =1 F j

(22)

TABLE I FSλL ALGORITHM

_

Initialize Q(s, a) to zero, for all s, a Initialize e(s, a) to zero, for all s, a Repeat for each episode:

(18)

{

Init a static and a fuzzy state a Repeat for each step of episode:

j

{

if k [ i ] = at , otherwise .

Thus, the updating formula for the q-value becomes: q[ i , k [ i ]] = q[ i , k [ i ]] + ∆ q[ i , k [ i ]] × e[ i , k [ i ]] E. Q-Function and FIS Representation

(21)

F. FSλL Algorithm Table I shows the Fuzzy Sarsa(λ) Learning algorithm, which is influenced by [2][3][7][8][9]. According to Fig. 2, the q-values are zeroed to avoid the distortion of the training since they are not considered essential for the 1st stage of the learning process as it happens in Q-learning [8][10]. Exploration can be achieved by performing a set of arbitrary actions at each time step t. These actions are automatically generated through reinforcement signals and experience can be exploited by the agent in future steps. Then an Exploration/ Exploitation policy (EEP) is used to select actions. After the Q-function is initialized to zero as well as the eligibility is traced by the e-function, the episode counter starts to perform the learning procedure by executing a number of functions. A subsumption architecture which supervises the status of the FSλL algorithm will decide whether to finalize the current running episode under the following circumstances: (i) a number of steps are over a predefined threshold, (ii) goal is achieved, (iii) an unexpected event comes into foreground.

Select a static action a using e-greedy policy Take the static action a Receive a reward r Get the next state s’ Compute the next action a’ derived from FLC Update Q-Function

D. Updating the e-Values RFL algorithms obtain more effective and faster learning when they combined with eligibility traces (λ) [8]. To replace eligibility traces used in our analysis, we have: e[i ,

n n a ( s t ) = ( ∑ µ i ( s j ) × apex i ) /( ∑ µ i ( s j ) ) F i =1 j i =1 F j

Recall that both α(s) and Q(st, αt) functions use singletonheight defuzzification to calculate the selection of each action.

where α is the learning rate.

1 k [i ]] =   γ λ e[i ,

∀ s ∈ St | s = x

{

dt error ∆q gradient e traces defuzzify Q(st, at)

(19) }

(20)

Restore the pairs: (s, a) ← (s’, a’) } }Until s is terminal (goal area) ____________________________________________________________________________________

1770

The episode’s body executes the state capturing by getting two state snapshots: (1) static state: denoting the current sensory discretized value evaluated with fixed weights, of the environment at the agent’s current position, and (2) fuzzy state: denoting the fuzzy evaluation of the weights taken by the static consequent. Both states are captured by using the laser scanner. Thereafter, the ε-greedy policy is used to select actions with maximum Q-value according to the current static state. The selected action is executed by the TakeAction( ) engine which is a low-level discretized controller function. The performance of the action taken is evaluated, if goal is achieved or not, by a reward and according to that, the next state is computed. Immediately after, the next action is calculated by the action-defuzzifier engine which chooses an action according to the fuzzified states as well as to the rulebase. At the end of the basic procedure which is a preparation for the Q-function, all the data derived from the states and actions are evaluated by the Q-function. At the same time, the Q-function evaluates itself by the Q-defuzzifier engine which updates the Q-content. Before the value is being defuzzified, the Q-function calculates the TD error which is an evaluation of the difference between the current and the next state-action pair; the q-gradient. This is a back propagation reinforce algorithm [8] and the traces of eligibility are snapshots of each state visited. Finally, the next states and actions are replaced by the current states and actions.

optimal policy has been adopted from the 15th episode with plays around to ~45. A. Learnt Trajectories The robot agent’s performance in the route learning task is presented in Fig. 4, including the 1st episode and three more randomly selected episodic tasks. In this testing, the robot agent performed 24 episodes without external training by an expert tutor (user) using the FSλL controller. Table III presents the learning parameters of the FSλL controller. Looking at the 1st episode of Fig. 4, it can be seen that the robot’s behaviour is unstable, which means that the robot agent initially explores the arena until it finds a path for the goal area. In the 6th episode the robot spends less time steps for the goal area because of the experience obtained in the past five episodes. As it can be seen from the left side of the arena, the learning performance has been improved and the uncertainty which had produced incorrect actions in previous episodes has been reduced dramatically. In the 11th episode it is clear that the robot agent has formed a path which shows that the learning has been obviously improved. In this episode the uncertain actions have been reduced and the agent has obtained precise knowledge about where the goal area is located. TABLE II RULE TABLE OF THE ROUTE LEARNING TASK I/O Variable Correspondence Rule State Weights Angle Left Lws 1 2 LFrt LForw 3 Front Forw 4 RFrt RForw 5 Right Rws

V. EXPERIMENTAL RESULTS AND ANALYSIS

1771

TABLE III LEARNING PARAMETERS FOR THE ROUTE LEARNING TESTING Learning Parameters Parameter Name Explanation Value ALPHA Learning rate 0.01 1 2 EPSILON Percentage of Exploration 0.003 GAMMA Discount Factor 0.95 3 4 LAMBDA Lambda Parameter 0.9 5 DISCFACTOR Environment Discrete Factor 500 6 Reward1 Goal 1 Reward2 Obstacles -1 7 8 Reward3 Otherwise 0 Episode [1]

Epi sode [6]

3000

Start Point Area

2500 2000

Goal Point Area

1500

Y[mm]

Y[mm]

2000

3000

Goal Point Area

Start Point Area

2500

1500 1000

1000 500 0

500

0

1000

2000

3000

4000

5000

-500 0 0

1000

2000

3000

4000

5000

-1000

X[mm ]

X[mm]

Episode [11]

2500

Episode [22]

Start Point Area

2000

2500

Goal Point Area

Start Point Area

2000

1500

Y[mm]

Y[mm]

The FSλL route learning task has been tested in a simulated environment: arena A1 shown in Fig. 1, without any training courses while its I/O parameters are shown in Table II. In this arena, a triangle obstacle is located in the middle while there is only one way out path to the goal area. Testing the route learning task it has been observed that the agent’s general behaviour was stable with improving learning rate around to ~90%: from the 1st episode starting with ~120 time steps, to the 24th episode finalizing with time steps fluctuating around ~20. Ultimately, the agent has adopted an optimal policy from the 3rd episode where it was able to find the goal area in ~20 time steps (plays). The same experiment has been tested by giving three training episodes. The three first episodes have been given by a human-tutor showing the agent the way to the goal area. The agent’s behaviour observed from the 4th episode as improving and stable with precisely same learning performances as in the previous experiment. The next experiment took place in the simulated arena A2 in which a squared obstacle was presented in the middle of the arena whereas two paths to the goal area were accessible. The agent’s learning performances observed in this experiment were almost the same with the two previous experiments. Finally, the conventional Reinforcement Sarsa(λ) Learning algorithm has been tested in the arena A1 without training courses given. The agent’s behaviour has been observed as unstable but with improving learning rate from the 1st to the 23rd episode fluctuating around ~50% while an

1000

Goal Point Area

1500

1000

500

500

0

0 0

500

1000

1500

2000

2500

X[mm ]

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

X[mm ]

Fig. 4 FSλL algorithm performance (24 episodes in arena A1).

4000

4500

50

40

30

Reward P erformance 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

-10 -20

Rewards

B. Collected Rewards Fig. 5 shows the collected rewards in a 2D space, which were taken over the agent’s performance in 24 episodes in the arena A1. In this reward representation, rewards have been allocated according to the arena’s formation (locations of the walls and obstacles). This can be verified by comparing the trajectories of the 11th and 22nd episodes of Fig. 4. The rewards reveal the most visited states and the exact locations the agent interacted with. The rewards have formed a curve trajectory which starts from the left side area and ends to the right. C. Learning Performances As it is shown in the play performance graph, Fig. 6(a), the amount of collected rewards was reduced during the first three episodes dramatically. Over the next 17 episodic tasks, the rewards kept a stable attitude fluctuating around 20 per episode while in the 19th episode, a sharp diminution of rewards was less than 10. Fig. 6(b) shows the reward performance (learning error) graph. The collected rewards taken on the 1st episode were close to 50, meaning that the bumps into walls and obstacles were frequent. The optimal policy can now drive the robot agent to the goal area without meeting walls or obstacles. In case that a new dynamic obstacle would be placed on the agent’s learnt path, the FSλL algorithm would reconfigure its parameters by adapting itself towards the new condition. Gradually, the agent would be able to learn about the obstacle presented and consequently to find the goal area; a fact which constitutes the systems adaptability. Finally, the average performance of the system over 24 episodes fluctuates around 20 plays while the collected rewards were approximately 12.

-30 -40 -50 -60

Episodes

(b) Reward performance Fig. 6 Agent’s learning performance.

VI. CONCLUSION AND FUTURE WORK The Fuzzy Sarsa(λ) Learning algorithm presented in this paper can improve the learning performance at about 40% more than the conventional Sarsa(λ) algorithm regarding the route learning task which has been tested in different arenas as well as with and without training courses. Furthermore, the FSλL approach can be applied as well in other robot tasks such as visual tracking, emotional behaviours, motion control etc. It has been found that an FSλL controller can drive a robot agent to achieve goals by accelerating the learning rate and by producing optimal control policies since it learns control policies more quickly than moderately experienced programmers can hand-code. However, there are still many open questions, i.e. how a FSλL system would behave if we establish a more effective policy method than ε-greedy. Instead of the ε-greedy policy for the action selection, the core FSλL algorithm can adopt more effective policy methods such as softmax, Genetic Algorithms (GA), or Artificial Neural Networks (ANN) in order to improve the quality of the actions selected for particular state cases. Further improvements can be done especially on the low-level discretized controller in order to maintain more stable learning behaviours.

Y[mm]

20

REFERENCES 10

0 -10

0

10

20

30

40

-10

-20

X[mm]

Fig. 5 Collected rewards in the 23rd episode − arena A1. Play Performance 140 120

Plays

100 80 60 40 20 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Episode s

(a)

Play performance

50

[1] R.S. Sutton and A.G. Barto, Reinforcement Learning: an Introduction, MIT Press, 1998, pp. 3-83, 133-158, 163-191. [2] L.P. Kaelbling and M.L. Littman, Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research 4, 1996, pp. 237-285. [3] W.D. Smart and L.P. Kaelbling, Effective Reinforcement Learning for Mobile Robots, MA 02139, Proc. of IEEE Int. Conf. on Robotics and Automation, 2002, Vol. 4, pp. 3404-3410. [4] D. Gu and H. Hu, Reinforcement Learning of Fuzzy Logic Controller for Quadruped Walking Robots, Proc. of 15th IFAC World Congress, Barcelona, Spain, July 21-26, 2002. [5] T.M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997, pp. 367-388. [6] J.M. Mendel, Fuzzy Logic Systems for Engineering: A Tutorial, Proc. of the IEEE Vol. 83, No. 3, March 1995, pp. 345-377. [7] R.C. Arkin, Bebaviour based Robotics, The MIT Press, 1998, pp. 310320, 342-349. [8] D. Gu, H. Hu, and L. Spacek, Learning Fuzzy Logic Controller for Reactive Robot Behaviours, Proc. of IEEE/ASME Int. Conf. on Advanced Intl. Mechatronics, Kobe, Japan, July 2003, pp. 46-51, 20-24. [9] P.Y. Glorennec, Reinforcement Learning: An Overview, INSA de Rennes/IRISA, Aachen, Germany, September 2000, pp. 17-33, 14-15. [10] P.Y. Glorennec and L. Jouffe, Fuzzy Q-Learning, Proc. of the 6th IEEE – www.esiea-recherche.esiea.fr, 1997.

1772

Suggest Documents