In: Mobile Robots: New Research Editor: John X. Liu, pp. 223-260
ISBN: 1-59454-359-3 © 2005 Nova Science Publishers, Inc.
Chapter 6
BEHAVIOR LEARNING AND BEHAVIOR FUSION IN AUTONOMOUS NAVIGATION OF MOBILE ROBOTS Cang Ye* Department of Applied Science University of Arkansas at Little Rock, Little Rock, AR 72204, USA
Danwei Wang+ School of Electrical and Electronic Engineering Nanyang Technological University, Singapore 639798
Abstract Behavior-based robotics has demonstrated its success in autonomous navigation of mobile robots. It has produced intelligent autonomous systems for a wide variety of areas, including factory automation, military applications and space exploration. Behavior-based approach decomposes a navigation system into task-specific behaviors, each of which has a direct connection between sensors and actuators. Behaviors operate in parallel and they may produce conflicting actions while attempting to control the same actuator simultaneously. In order to resolve the conflicts, an arbitration mechanism is required to integrate the independent behaviors seamlessly. In the previous research, this was accomplished by the high-prioritytake-all command selection scheme or by the winner-take-all approach which votes for the actions of individual behaviors and selects the winner as the final control command. The former may cause failure in navigation as important navigational information may be lost when it completely switches the control to a single behavior. The disadvantage with the latter is that it uses a constant weight for each behavior. This is in contradiction with the fact that the desirability of each behavior may vary from situation to situation, e.g., the weight of obstacle avoidance should be large when an obstacle locates in the middle of the path while it should be small if the obstacle only slightly obstructs the path. For a smooth behavior transition, an innovative behavior arbitration method is desired. This paper presents a complete navigation method employing our novel behavior fusion method for autonomous navigation in unknown environments. The proposed navigator * +
Tel: 1-501-569-8043, Fax: 1-501-569-8020, Email address:
[email protected];
[email protected] Tel: 65-6790-5376, Fax: 65-6792-0415, Email address:
[email protected]
224
Cang Ye and Danwei Wang consists of an Obstacle Avoider (OA), a Goal Seeker (GS), a Navigation Supervisor (NS) and an Environment Evaluator (EE). Fuzzy Logic Control (FLC) is employed in the OA and the GS to map the perception to the action. The NS, also employing FLC, determines the weights for the OA and the GS based on the local and global environmental information. The fuzzy actions inferred by the OA and the GS are then weighted and fused through fuzzy set aggregation. The resulted command actions, represented by fuzzy sets, are defuzzified to produce the crisp actions to control the robot. Reinforcement Learning method is employed to learn the fuzzy rules for the OA. To enhance the capability of the navigator, the EE is used to tune the supports of the fuzzy sets of the OA and the NS in such a way that the robot is able to navigate itself adaptively in various environments. Simulation results reveal that the navigator features that: (1) it is able to guide the robot in unknown environments; (2) it has smooth behavior transition and thus smooth actions; and (3) it has exceptionally good robustness to sensor noise.
Keywords: Behavior Control, Behavior Fusion, Fuzzy Logic Control, Reinforcement Learning, Obstacle Avoidance, Path Planning, Virtual Environment.
1
Introduction
Navigation is an important issue in the research of Autonomous Mobile Vehicle/Robot (AMV). The navigation of an AMV is to determine a collision-free path that allows the AMV to travel through an obstacle course from an initial configuration to a goal configuration, where configuration here refers to the spatial coordinate and the heading angle of the AMV. The process of finding such a path is also known as path planning which could be classified into two categories: Global Path Planning and Local Path Planning. Global Path Planning methods, such as geometry algorithm [1, 2], Potential Field Method (PFM) [3, 4, 5] as well as heuristic or approximating approaches [6, 7, 8], are usually conducted off-line in a completely known environment. As a pre-specified environment is required for these methods to plan the path, they may fail when the environment is not fully known. On the other hand, the Local Path Planning techniques, also known as obstacle avoidance, are more efficient for navigation in an unknown or partially known environment. It utilizes on-line sensory information to tackle the uncertainty. Among the proposed methods, geometry algorithm assumes that the local obstacles are fully recognized via visual sensor [9, 10] or can be learnt by on-line acquisition via distance sensor [11, 12, 13]. The former assumption may not be applicable to a real environment; while in the latter case, it is time consuming to explore an unknown environment. A number of research [14, 15] has demonstrated that the PFM is well suited for fast obstacle avoidance as it does not require detailed obstacle information. However, it has three disadvantages: First, local minimum may cause the AMV to be stuck. Second, it tends to cause unstable motion in the presence of obstacles [16]. Third, it is difficult to find the exact force coefficients influencing the AMV’s velocity and direction in an unknown environment. The practicality of the PFM therefore hinges on how well these problems can be resolved. In 1985, Brooks [17] introduced the first Behavior Control paradigm. Behavior Control method is to decompose the problem of autonomous control by task rather than by function. Traditional function decomposition used a SMPA (Sense-Model-Plan-Act) framework. The SMPA approach [18, 19] consists of modules each of which senses the world, builds a world model, and plans actions for the AMV. It connects the modules in a serial fashion. A serious
Behavior Learning and Behavior Fusion in Autonomous Navigation…
225
problem in this approach is reliability. If any module fails, the entire system will break down. In contrast, Behavior Control method decomposes the system into task-specific modules, each of which is connected directly to sensors and actuators and operates in parallel. These modules are generally called behaviors, and they are connected together by an arbitrator to determine the control action of the entire system. A large number of behavior-based methods [20, 21, 22, 23, 24] have emerged since Brooks introduced the approach. In a behavior-based system, a complex task is achieved by composing a number of behaviors to produce an emergent complex behavior. The composition is accomplished by a behavior arbitrator, which resolves any conflict among behavior modules arising from their attempt to control the same actuator simultaneously. Such an architecture has at least four advantages: (1) it is able to react in real time due to the parallel operation; (2) it performs well in unknown environment as each module determines its action in an on-line manner; (3) behavior modules may be simple and thus easily and flexibly managed; and (4) it has good robustness, i.e., the system still functions even if one or more of the behavior module fail. There are two main issues in employing Behavior Control method: behavior construction and behavior fusion. A behavior module is usually designed to be a reactive system [20, 21], which maps a perceived situation to an action. Fuzzy Logic is an efficient way of representing this mapping relationship as it is able to represent human expert’s knowledge and requires no mathematical model. In addition, fuzzy set representation of the input space is continuous. Because of these advantages, Fuzzy Logic Control has been widely used in autonomous navigation [25, 26, 27]. However, in many of the work, the fuzzy rules were compiled and tuned manually by human experts. If a behavior module incurs large number of input variables (e.g., obstacle avoidance behavior), it is intractable to maintain the correctness, consistency and completeness of the fuzzy rule base because the number of fuzzy rules is large in this case. Therefore, a fuzzy system with the capability of learning the rules is highly desired. A number of learning algorithms, such as Evolutionary Algorithm (EA) [28,29], Supervised Learning (SL) [30, 31] can be used in learning the fuzzy rules for navigation. However, EA usually results in a very long learning process and it is not easy to design the fitness function for performance evaluation. On the contrary, SL has the advantage of fast convergence because of the direct gradient information provided by the training data. However, it is very difficult to obtain sufficient training data, which contain no conflict input/output pairs. Insufficient training data may result in an incomplete fuzzy rule base, while the conflicts among the training data may cause incorrect fuzzy rules and thus lead to collision in navigation. Reinforcement Learning (RL) [24, 32, 33] is quite promising in learning fuzzy rules for autonomous navigation as it only requires a simple scalar reinforcement signal rather than a large training data set. The limitation is that it may result in insufficiently learnt rule base due to the conflict between the exploration and exploitation [32]. This paper presents a RL-based learning method for automatic generation of fuzzy rules and introduces a training method to alleviate the effect of the conflict between exploration and exploitation.
226
Cang Ye and Danwei Wang
Figure 1 Obstacle avoidance does not know which way is the correct turn [35]
Although behavior modules can be constructed by learning methods, it remains difficult to seamlessly fuse behavior components developed independently as there are conflicts among multiple behaviors when they attempt to control the same actuator simultaneously. Much effort has been devoted to the design of behavior fusion method. However, there is no completely satisfactory solution. Most of the existing behavior arbitrators [22, 23, 24, 34] adopt the strategy of high-priority-take-all. Payton and Rosenblatt [35] identified a serious problem with this priority-based arbitration scheme, which is illustrated by a case that an AMV encounters an obstacle while following a path (see Fig. 1). In this example, the path following behavior suggests to turn left while the obstacle avoidance behavior suggests to turn either left or right. The priority-based arbitration scheme has one half chance to make the correct turn (left) since the information regarding path following is not available once the arbitrator selects the obstacle avoidance behavior. To resolve this limitation, they developed a command fusion architecture that allows control recommendations from different behaviors to be directly combined to form multiple control recommendations with different activation levels, from which a final control command is chosen based on a winner-take-all selection strategy. In the Payton-Rosenblatt (P-R) command fusion method, the output of each behavior is a set of nodes, each of which represents a possible control decision. The desirability of each control decision is represented by the activation level of the node. Figure 2 describes a P-R method for fusing two behaviors, Turn-For-Obstacle and Track-Road-Edges, for the situation given in Figure 1. The size and color of each node in Figure 2 represents the behavior activation for that command. The size of a node represents the magnitude of the activation. Solid black color represents positive activation, while white color represents negative activation. For example, the Turn-For-Obstacle behavior in Figure 2 has a large positive activation level for the “hard left” node, and a large negative activation level for the “straight ahead” node. To fuse these two behaviors, the activation strengths of the corresponding nodes are combined using a weighted sum scheme, as illustrated in Figure 2. The weight associated with a behavior reflects the degree of importance of its command recommendation. For instance, the Turn-For-Obstacle behavior has a higher weight than the Track-Road-Edges behavior because it is more important to avoid obstacles than to follow the road edges. The final control command is the node with the largest positive activation. In the case illustrated in Figure 2, the final control command is to turn “hard left.”
Behavior Learning and Behavior Fusion in Autonomous Navigation…
227
Turn For Obstacle +
+
HARD LEFT
SOFT LEFT +
+
STRAIGHT AHEAD +
+
HARD LEFT +
+
HARD LEFT +
+
Track Road Edges Positive activation
Negative activation
Figure 2 Payton-Rosenblatt network for fusing turn command for two behaviors [35]
Langer and Rosenblatt [36] applied a similar architecture to off-road navigation to arbitrate the commands from Avoid-Obstacle, Road-Following, Goal-Seeking, MaintainHeading and Path-Tracking behavior. Yen and Pfluger [37] (Y-P) extended Payton and Rosenblatt’s work by using Fuzzy Logic. Their command arbitrator fused the allowed direction recommended by Fuzzy-Obstacle-Avoidance behavior and Fuzzy-Path-Following behavior to determine the final control command. A problem remaining in these two command fusion methods is that the weight for each behavior is constant. As a matter of fact, the desirability of each behavior may vary from situation to situation. For example, the weight of Turn-For-Obstacle behavior should be large when an obstacle is lying in the middle of the path, and it should be small when the obstacle only slightly obstructs the path. This suggests that the weights of behaviors should vary with environments in order to obtain a better behavior fusion. In this paper, we propose a new behavior fusion method. It employs FLC to weight the behaviors using local and global environmental information and fuses the fuzzy actions recommended by each behavior. Based on this behavior fusion method, we develop a navigator for the navigation of AMV in a floor space. It consists of an Obstacle Avoider (OA), a Goal Seeker (GS), a Navigation Supervisor (NS) and an Environment Evaluator (EE). The fuzzy actions inferred by the OA and the GS are weighted by the NS, and they are fused through fuzzy set operation to produce the command actions, from which the final crisp actions are determined by defuzzification. This paper is organized as follows: In Section 2, the vehicle model, the navigation task and the overall architecture of the fuzzy navigator are described. In Section 3, the OA is presented and a Reinforcement Learning method is introduced to automatically construct the module. In Section 4, the GS is introduced and its function is verified by simulation. In Section 5, the NS which fuses the obstacle avoidance behavior and the goal seeking behavior is presented. In Section 6, the function of the EE is briefed. In Section 7, the navigator’s ability is extended to an indoor floor space. In Section 8, the performance of the navigator is analyzed in a virtual indoor environment. Finally, the paper is concluded in Section 9.
228
2
Cang Ye and Danwei Wang
Overview of the Navigator
As we intend to develop a platform independent navigation method, the assumption we make regarding the platform is only that the AMV is able to turn on the spot1 and can detect obstacles by its distant sensors. This places no particular shape or size on the AMV. However, for the convenience of simulation, we use a cylindrical AMV model with a radius of Rv and assume there are N distant sensors evenly distributed along a ring. In this study, we assume that the AMV’s radius to be 20cm and there are 24 ultrasonic sensors arranged in a ring. Each sensor, si for i 1,,24 , gives a distance li ( 4cm l i 400cm ) to the obstacle in its field of view, and each sensor covers an angular
view of 15 . For obstacle avoidance purpose, the sensors in the front of the AMV is divided into five sensor groups ( sg i for i 1, ,5 ) as depicted in Figure 3, which give the AMV a view angle of 225 . The five sensor groups are denoted sg i for i 1,,5 . Each sensor
group comprises 3 neighboring sensors. y obstacle sg3 sg4
sg2
v s
li
8
sg5
s s s
2
o
Rv
sg1
x
1
24
Figure 3 Diagram of the AMV and the arrangement of the distance sensors
With this sensor arrangement, the distance d i measured by the ith sensor group from the center of the AMV to the obstacle is expressed as:
d i Rv minl j | j 3i 2,3i 1,3i ; for i = 1, ,5
.
(1)
The remaining sensors are used to detect the distance to the obstacle along the goal direction for behavior fusion. The reason for arranging the sensors into groups is to reduce the input dimension of the navigation method. 1
A robot with differential drive wheels has this capability.
Behavior Learning and Behavior Fusion in Autonomous Navigation…
2.1
229
Coordinate System and Navigation Task
The coordinate systems and the control variables of the AMV are depicted in Figure 4. The two coordinate systems are the world coordinate denoted XOY, and the AMV coordinate xoy. A navigation task is defined as to move the AMV from a start configuration to its goal coordinate without collision with the obstacles in between. A navigation task is specified in the world coordinate. The AMV configuration is represented by S ( X o , Yo , ) , where T
X o and Yo is the coordinate of the AMV’s center, and stands for the heading angle of the AMV. The control variables are its linear velocity v and the change in the heading angle, , which is referred as steering angle hereafter.
Y
di: Sensor reading v: AMV’s velocity vector xoy: AMV coordinate XWY: World coordinate Po(Xo, Yo): Vector of AMV’s center in XOY Pg(Xg, Yg): Goal Vector in XOY pg(xg,yg): Goal vector in xoy dog: obstacle distance along the goal vector pg
y v
di
x
(Xo, Yo) o
obstacle
dog Po ( X o , Yo )
O
Pg ( X g , Yg )
pg (x g , y g ) goal
: Vehicle’s heading alngle : Angle between goal vector and y axis : Angle between goal vector and X axis X
Figure 4 Diagram of the coordinate system and the control variables
Suppose a navigation task from a start configuration s to a goal coordinate g is specified as in Figure 5. It can be achieved through an iterative process as follows: At time step t, t 0, 1, , k , , acquire the environment information, d i and
Pg ( X g , Yg ) , and the AMV’s configuration S (t ) ( X o (t ), Yo (t ), (t ))T ; Determine the output variables v(t+1) and (t+1), where (t+1)=(t)+(t); then update the AMV’s configuration by the following equation:
(t 1) (t ) (t ) S (t 1) X o (t 1) X o (t ) v(t 1)T cos( (t ) (t )) Y (t 1) X (t ) v(t 1)T cos( (t ) (t )) o o where T is the time interval between two time steps. Iterate this situation-action mapping process (step 1 and 2) until the goal is achieved.
(2)
230
Cang Ye and Danwei Wang
obstacle
(t+1)
(2) (1)
Figure 5 Process of achieving a navigation task
2.2
Architecture of the Navigator
The architecture of the proposed navigator is depicted in Figure 6. It consists of four main modules: an Obstacle Avoider (OA), a Goal Seeker (GS), a Navigation Supervisor (NS) and an Environment Evaluator (EE). FLC is used in the OA, GS and NS. The function of OA is to determine the actions to avoid obstacles without considering if it will cause deviation from the goal. The input variables are d i for i 1, ,5 , where d i are the sensor readings; and W is the variable determining the fuzzy sets for the input space. The value of W is determined by the EE. For a specific value of W, the sensor readings, d i for
i 1, ,5 , are fuzzified and the fuzzy inference is made to produce the fuzzy control actions, Va and a . The function of the GS is to infer the actions for goal seeking without considering obstacle avoidance. The input variables of this module are d g and , where d g is the magnitude of the goal vector p g ( x g , y g ) , and is the angle between the goal vector and y axis. The input variables are fuzzified, then the fuzzy inference is made and the fuzzy control actions, V g and g , are given. The role of the NS is to resolve the conflicts between the actions recommended by the OA and the GS and determine the command actions for collision-free navigation. It consists of three sections: the Behavior Weighting Section (BWS), the Behavior Fusion Section (BFS) and the State Evaluation Section (SES). The input variables of the BWS are dog and dmin, where dog is the distance to the obstacle locating along the goal vector p g ( x g , y g ) , and dmin=min(di) is the minimum value of the
Behavior Learning and Behavior Fusion in Autonomous Navigation…
231
five sensor readings. These two variables are fuzzified based on the fuzzy sets determined by W, and the fuzzy inference is made. Finally, the behavior coefficient is given by defuzzification. The behavior fusion is implemented in the BFS. The actions inferred by the OA and GS are first weighted by (1-) and , respectively. Then they are fused to produce the fuzzy command actions, V and . Finally, the crisp control actions of the AMV are determined by defuzzification. Obstacle Avoider W
Fuzzy Quantization
=1
di
j
W
State Evaluation dog
dog, dg dg,
dmin Rule Base
Fuzzy Inference
Defuzzification
Va , a
V g , g
Behavior Fusion di
V ,
Rule Base
BWS
Fuzzy Inference
Fuzzy Inference
Fuzzifier
Fuzzifier
Rule Base
Environment Evaluator
Goal Seeker
Navigation Supervisor
BFS
Defuzzification v, Mobile Vehicle
The input variables of each module is referred to Figure 4 Figure 6 Diagram of the proposed navigator
Each behavior module is developed independently. To construct the OA module, the GS is simply ignored (i.e.,
1 ) such that V Va and a . This means the
defuzzification section in the BFS is merged into the OA to form a complete fuzzy controller. The GS is constructed in the same way. As can be seen from this navigation method, the OA utilizes the local environmental information d i to infer the local actions for obstacle avoidance; while the GS uses the global environmental information p g ( x g , y g ) to determine the global actions to move the robot towards the goal. The NS weights the local and global actions based on a local and a global information, d min and d og ; so as to maintains a balance between the local and the global actions. Through this behavior fusion scheme, the local and global navigation targets, to avoid obstacles and to achieve the goal, are achieved.
232
3
Cang Ye and Danwei Wang
Obstacle Avoider
The OA is a reactive fuzzy controller. Its fuzzy rule base determines the vehicle actions, va and a, based on the sensor readings, d i for i 1,,5 . As a fuzzy controller, the main issue is to construct the fuzzy rule base.
3.1
Fuzzy Control of Obstacle Avoidance
As depicted in Figure 6, the input variables of the OA are the sensor input variables, di for
i=1,, 5, and the control outputs are va and a. The membership functions of the input and output variables are depicted in Figure 7. In Figure 7(a), the crisp value of each input variable di is fuzzified and expressed by the fuzzy sets VN, NR and FR, which represent very near, near and far, respectively. The universe of discourse of each fuzzy set is determined by R and W. Here R Rv lmin , where lmin 8cm is the minimum distance that the ultrasonic sensor can detect and Rv 20cm is the radius of the vehicle. Therefore, the fuzzy sets are fully determined by W. Each sensor reading di is mapped to a set of different membership function values and further mapped to different actions for a different value of W. In accordance with the fuzzification of the input variables di, the fuzzy rule base consists of 243 rules and it requires 243 fuzzy sets Vj (j=1, , 243) to represent the velocity va; and 243 fuzzy sets j (j=1, , 243) to represent the steering angle a. The fuzzy sets of the output variable, va and , take the triangular membership functions as shown in Figures 7(b) and 7(c) respectively, while their center positions, b1j and b2j for j=1, , 243, are determined by Reinforcement Learning. The upper bound of the velocity is Vamax.
1
VN
NR
V j 1
FR
Vj
di R
R+W (a)
R+2W
j
1
0
j 1
1
v 0
b1 ( j 1 )
b1 j (b)
V
am ax
a
/ 2
b 2 ( j 1 )
0
b2 j
a
/2
(c)
Figure 7 Membership functions of the input/output variables for the OA
The fuzzy rule base plays a central role in mapping the sensor input di to the mobile vehicle’s actions, va and a. It consists of 243 fuzzy rules, each of which is denoted by Rule j: IF d1 is D j1 AND ... AND d 5 is D j 5 THEN va is V j , a is j ; for
j 1, ,243 , where D ji for i 1,,5 , are the fuzzy sets for d i in the jth rule. They take the linguistic value of VN, NR or FR; va and a denote the output variables; and Vj and j
Behavior Learning and Behavior Fusion in Autonomous Navigation…
233
are the fuzzy sets for va and a in the jth rule, respectively. Let the fire strength of the jth rule be denoted by
j . For the inputs di di ' , j can be calculated by
j D ( d1 ' ) D ( d 2 ' ) D ( d 3 ' ) D ( d 4 ' ) D ( d 5 ' ) ; j1
j2
j3
j4
j5
for j 1, ,243 .
(3)
If the Mamdani’s minimum operation [38] is used for fuzzy implication, the memberships of the inferred fuzzy control actions, V and , are calculated by 243
243
V (va ) j V (va ) cand ( a ) j ( a ) . j
j 1
j
j 1
(4)
For the reason of limiting the computing cost, the method of height defuzzification is used. The crisp control actions is then given by 243
va
j 1 243
j 1j
j 1
3.2
243
b
and
a
j
b j 1 243
j 2j
j 1
.
(5)
j
Rule Learning for Obstacle Avoidance
The problem in constructing the rule base is to determine the values for b1 j and b2 j . In this paper, it is fulfilled by the reinforcement learning method [24] using the Sutton and Barto’s model. The network of the learning algorithm is depicted in Figure 8. The vehicle begins the learning with an initial v and at time step t=0, then it moves to a new position at time step t=1, and so on, until a collision occurs at the time step t=k. The whole process until a collision is called a trial and the time step t is called the tth learning step. For instance, if a trial ends up with a collision at t=k, then a failure signal is fed back to the learning network. The rules that were used at the previous time steps k, k-1, k-2,, and have contributed to this failure should be tuned in order to improve the vehicle’s performance. This task is accomplished by two adaptive neuron-like elements, in which one is for updating the rules concerning v and the other one for updating the rules concerning . Each element consists of an Associative Search Element (ASE) and an Associative Critic Element (ACE) [24]. After the rules are updated, a new trial begins at the (k+1)th learning step. The process is iterated and terminated until no more collision occurs.
234
Cang Ye and Danwei Wang
W
Fuzzy Quantization
d
j
1
i
243
AMV ASE
2 j
r2
b2 j
y
2
v2 j
v1 j
rm
ACE
ACE
1 j
r1 y
1
ASE
b1 j
y
Defuzzification
m
Figure 8 Diagram of the neural network to learn obstacle avoidance
Suppose that the current vehicle configuration is S (t ) ( X o (t ), Yo (t ), (t )) T , and the five sensor group readings are encoded into the rules, the trace
j . In order to give the associativity in learning
j (t ) of the fired j rule is used. The trace at time step t+1 is given by th
j (t 1) j (t ) (1 ) j (t ) , where
(6)
, 0 1, is the trace decay rate. Each ACE receives the external reinforcement
signal, rm (t ) (m 1,2) , as a performance feedback from the environment, and generates the internal reinforcement signals, rˆm (t ) ( m 1,2) , which are fed into the ASE to update their weights. The external reinforcement signal is determined by
1 rm (t ) 0
if min(d i | i 1,2,,5) Rv Va max T otherwise
for m 1,2 ,
(7)
where T is the time interval between two learning steps. Let pm (t ) be the predicted value of the following discounted sum:
z m (t ) t t rm (t ' 1) , '
(8)
t ' t
where
, 0 1 , is used to weight the near term reinforcement more heavily than distant
future reinforcement. If the predictions are accurate, from Eq. (8) we have:
pm t 1 rm t pm t .
(9)
Behavior Learning and Behavior Fusion in Autonomous Navigation…
235
pm (t ) is computed by the ACE as a weighted sum of j (t ) and given by 243 p m (t ) G vmj (t ) j (t ) , j 1
(10)
where
G x
2 1 . 1 e x
(11)
and vmj for m 1,2; j 1,,243 , are the weights of the ACE. As the ACE is learning to make the prediction, the mismatch or Time Difference Error [39] between the two sides of Eq. (9) is defined as the internal reinforcement signal and expressed as
rˆm t rm t pm t pm t 1 ,
(12)
This internal reinforcement is then used to train the ACE and the ASE. The weights of ACE are updated by
vmj (t 1) vmj (t ) rˆm (t ) j (t ) , where
(13)
is a positive constant determining the rate of change in vmj . Similarly, the weights
of the ASE, mj for m 1,2; j 1,,243 , are updated by
mj (t 1) mj (t ) rˆm (t )emj (t ) ,
(14)
where , 0 1, determines the learning rate, and emj (t ) is the eligibility trace of the
j th rule at time t, which is updated by
emj t 1 emj t 1 ym t j t , where
(15)
, 0 1 , is the decay rate of the eligibility. The eligibility trace is a trace of what
rules have been used and what control actions have been applied to the vehicle. Here y m (t ) are the control action defined as ( y1 (t ), y2 (t )) (va , a ) . The center positions of the T
fuzzy sets at each time step are determined by
T
236
Cang Ye and Danwei Wang
bmj (t ) bm
mj (t ) f m
k max mj (t ) mj (t )
for m = 1,2; j = 1,,243 ,
(16)
where b1 is the initial center position for the fuzzy sets V j ; b2 is the initial center position for the fuzzy sets j ; f m is a positive constant that determines the range of bmj (t ) ; and the positive constant k is used to guarantee the fuzzy sets of the output variables to be within their universe of discourse. According to Eq. (16), bmj (t ) [bm f m /(1 k ), b m f m /(1 k )] and Va max is (b1 f1 ) . With known control actions of the current time step, the vehicle configuration at the next time step is updated by Eq. (2). Eventually, if the rules are sufficiently learnt in a specific environment, the weights of the ASE converge to a set of fixed values. When the learning process is terminated, the learned set of bmj is used as the rule base for the OA.
3.3
Simulation of Rule Learning
A fundamental problem in Reinforcement Learning is the conflict between exploration and exploitation [32]. In the case of learning obstacle avoidance, exploration is the desire to acquire knowledge about the consequences of the actions not considered optimal so as to make improvement in the rule base; while exploration is the desire to use the rule base already learned. For efficient learning, a tradeoff between exploration and exploitation should be achieved to maximize the effect of learning and minimize the costs of learning. The above learning method is a pure-exploitation approach. In this Section, the limitation of the method is discussed and a new training method is introduced to alleviate the problem. The parameters for rule learning are tabulated in Table 1. With the values of b1 and f1 , the upper bound for the vehicle’s velocity is 30cm/s and b1 j is within the range of (2.5cm/s, 27.5cm/s); while b2 j is within the range of (75 , 75 ) . At the beginning, vmj (t ) are set to
some small non-zero values, while
mj (t ) , j t , pm (t 1) , emj (t ) are set to zero. The
vehicle begins learning at an arbitrary initial configuration of S (0) with non-zero initial control actions. Table 1: Parameters used for simulation
R 28cm
T 0.3
0 .5
b1 15cm / s
b2 0
0.8
0 .8
W 20cm
0.85
0.95
f1 15cm / s
f 2 / 2 k 0 .2
1.5
A. Environment Exploration Method The Environment Exploration Method (EEM) [24] is a training approach which is iterated as follows:
Behavior Learning and Behavior Fusion in Autonomous Navigation…
237
The current distance readings from the five sensor groups are fed into the Fuzzy Quantization module in Figure 8 and encoded into j (t ) . If max(| mj (t ) |) 0 , then the values of the initial control actions are used to control the robot; otherwise the control outputs are determined by Eq. (5). The external reinforcement signal is calculated by Eq. (7), the current prediction value pm (t ) is calculated by Eq. (10), and the internal reinforcement signal is computed by Eq. (12). The weights of the ACE and ASE are updated by Eq. (13) and (14), while the trace of the rule and the eligibility trace are updated by Eq. (6) & (15), respectively. Finally, if there is no collision, the configuration of the vehicle is updated by Eq. (2) and the learning process goes to Step 1. If a collision occurs, then vmj (t ) , j t , pm (t 1) and
emj (t ) are reset to zero. The vehicle is backtracked 4 steps and its heading direction is reversed. The next trial begins by repeating from Step 1 to Step 5 again. In order to facilitate the learning process, a pre-specified rule, IF di=FR for i=1, , 5, THEN b1,243=27.5 cm/s and b2,243=0, is used. As the learning continues, the center positions of the membership functions of the output fuzzy sets are tuned from the initial values, b1 and b2, to the new values, b1j and b2j, respectively. Figure 9(a) is an example of learning in a computer generated environment. The vehicle started learning with a initial configuration (280 cm, 196 cm, -45)T. After a number of learning steps, it went into a trap situation (loop 1) and couldn’t get out. The vehicle was then moved to a new configuration (520 cm, 240 cm, 0)T manually and a new trial began. The weights of the ASE were tuned as the learning process continued. We can form a vector
Δωm (m1 , , mj , , m 243 )T where mj (t ) is the change in the ACE’s weights. The norms of the vector Δω m
describe the extent of the learning and in some
learning methods they are continuous values and can be used to determine the convergence. However, in learning obstacle avoidance using Reinforcement Learning, Δω m
are not
continuous. Specifically, the values are big upon a collision and may be extremely small otherwise. Therefore, it is not viable to determine the convergence using these values. Neither it is reliable to set a large number of learning steps as extensive simulation shows that the EEM still has collision at up to 100,000 steps. As it is impossible to set a criterion to judge the convergence, the EEM is not able to guarantee a sufficiently learned rule base. Another frustrating problem with the EEM is that the vehicle may go into a dead loop and stop exploring the environment. In this case, it has to be manually removed from the trap to start a new trial. However, it may go into a dead loop once again. As depicted in Fig. 9(a), the vehicle started a new trail at s2 with configuration (520 cm, 240 cm, 90)T after the removal from loop 1. It got into a new trap (loop 2) again. If the vehicle stays in the loop for a while and forms the looping behavior, it will always go into a loop if we move it to a new configuration. This means we have to restart the learning from scratch. A similar scenario is shown in Fig. 9(b). The vehicle started the learning at s1 with configuration (340cm, 360cm, 90). It collided with obstacle O1 first and then collided with O2 and O1, it then circumnavigated slowly and closely around obstacle O3 and was trapped by loop1. It went into new
238
Cang Ye and Danwei Wang
traps (loop2 and loop3) again starting from s2 and s3. As there was no collision, the learning algorithm thought a convergence has been achieved although the rule base has not been sufficiently learnt. Figure 9 (c) demonstrates the worst trap situation where the vehicle almost learnt nothing. The robot began the learning at s1 with an initial heading of 90. After a collision with the neighboring boarder, it moved at a straight-line trajectory back and forth with collisions with the obstacles at both ends and could not get out of the trap. The same resulted as it was moved to s2, s3, s4 and s5 (all with 90 initial heading) manually one after another. It is required to restart the learning from scratch. We believe all of these problems are induced by the pure-exploitation nature of this learning method. Once a reactive behavior to obstacles has been formed, the robot will use the associated rules over and over again hence it repeats the same behavior and refuse to learn further.
(a)
(b)
(c) Figure 9 Training by the EEM
B. New Training Method The new training method was conducted in a simple corridor-like environment as depicted in Figure 10. As the environment is regular, the vehicle trajectory remains unchanged while the learning process converges, where the criterion for terminating the training process can be based upon. This is one of the major advantages of the new method over the EEM. The new training method is divided into two phases according to the moving direction of the vehicle. Firstly, the vehicle begins the training from an arbitrarily chosen start configuration. The vehicle moves in a clockwise (CW) or counter-clockwise (CCW) direction. In this phase, the learning is iterated as the same as the EEM. This training phase is completed when the vehicle maintains a constant trajectory without collision. In the second phase, the vehicle learns to navigate in the opposite direction with a new start configuration.
Behavior Learning and Behavior Fusion in Autonomous Navigation…
239
Upon collision, if the vehicle is trained in the CW direction, the vehicle backtracks 40 steps and turns an additional steering angle (-/30); otherwise, it backtracks 40 steps and turns an additional steering angle (/30). This training phase is completed when the vehicle keeps a constant trajectory without collision. A training example is depicted in Fig. 10. The vehicle began the first training phase (Fig. 10(a)) from the start point ‘A’ at (60 cm, 70 cm, 0). After 11 collisions, it navigated successfully in the CCW direction. When the learning method converged, the vehicle moved almost in the middle of the corridor and turned smoothly around the corners. The CW direction training phase began with a new start point ‘B’ at (226 cm, 60 cm, 90) as depicted in Fig. 10(b). After two collisions, the vehicle navigated successfully in the CW direction. It kept a constant trajectory in the middle of the corridor and turned smoothly around corners. The whole training was then completed. The learning process converged at 360 learning steps with 13 collisions while the convergence of the EEM method is uncertain at up to 100,000 learning steps with 267 collisions. This means the new training method is able to construct the rule base with less cost and less time. The rules were almost sufficiently learned (1.2% of the rules were blank) in our method after 13 trials. In contrast, the rules of the EEM were far from sufficiently learnt (30% of the rules were blank) after the 76th trial and the vehicle collided in less than 6000 steps. The three blank rules (1.2% rules) in the new training method can be obtained by manually placing the AMV in the situations associated with these rules and re-training the AMV in those situations. Y
Y
Collision Collision
Width=72 cm
Collision
X Starting point A
Collision
(a) CCW training (500cmX600cm)
X Starting point B
(b)CW training (500cmX600cm)
Figure 10 Learning in a simple environment (W=20cm)
Figure 11 depicts the control surfaces of the fuzzy rules learnt in the environment in Fig. 10. The control surfaces show how the fuzzy controller reacts to the sensor readings. Taking Fig. 11(a) as an example, with the three sensor input d2=d4=d5 =38 cm (i.e., all of them are VN), if d2 is VN (e.g., d2=30 cm) and d1 is FR (e.g., d1=30 cm), then from the control surface va is very small and a is negative big. This turns the robot away from obstacle in the direction of d2, d3, d4 and d5, and moves the robot toward the direction of d2 where there is no obstacle. Also, the robot speed is very small as all four sensor readings indicate very near obstacle(s). Obviously the actions are correct. The learnt rule base was tested in various
240
Cang Ye and Danwei Wang
computer generated environments and it was found that there was no collision. This set of rules is then used in the OA module.
(a) d2=d4=d5=R+W/2=38 cm
(b) d2=d4=d5=R+3W/2=58 cm
(c) d2=d4=d5=R+2W=68 cm
Figure 11 Control surfaces of the rules learnt by the new training method
4
Goal Seeker
The input variables of the GS are d g and , and the control outputs are v g and g . As depicted in Figure 4, the goal vector with respect to the AMV’s coordinate can be calculated by
p g ( x g , y g ) Pg ( X g , Yg ) Po ( X o , Yo ) .
(17)
The magnitude of this vector and the deviation of the AMV’s heading direction are denoted by
Behavior Learning and Behavior Fusion in Autonomous Navigation…
241
d g pg ( xg , y g ) ,
(18)
,
(19)
and
where is the angle between the relative goal vector p g ( xg , y g ) and the horizontal axes, and is given by
yg . x g
tan 1
(20)
The membership functions of these four variables are shown in Fig. 12. As the maximum velocity inferred by the OA is 16 cm/s, V g max is also set to 16 cm/s for consistency. In accordance with the membership functions of the input variables, the rule base consists of 28 rules, each of which is denoted by
1 VN
NR
FR
0
4Rv
8Rv
VFR
NB
dg 12Rv
-
NM
NS
1 ZZ
PS
PM
-2/3
-/3
0
/3
2/3
PS
PM
PB
/12
/6
/4
1 VS
SL
FS
NB
VF
vg 0
PB
Vgmax/2 3Vgmax/4 Vgmax
-/4
NM
NS
-/6
-/12
1 ZZ
0
g
VN: very near NR: near FR: far VFR: very far VS: very slow SL: slow FS: fast VF: very fast NB: negative big NM: negative middle NS: negative small ZZ: zero PS positive small PM: positive middle PB: positive big Vgmax: maximum value for this behavior Figure 12 Membership functions of the input and output variables for the GS
Rule j: IF is j AND d g is D j THEN v g is V j , g is j ; for j 1, ,28 where j is the fuzzy set for
in the jth rule, which takes the linguistic value of NB, NM,
242
Cang Ye and Danwei Wang
NS, ZZ, PS, PM, or PB; D j is the fuzzy set for d g in the jth rule and it takes the value of VN, NR, FR or VFR; while V j is the fuzzy set for v g in the jth rule, each of which takes the value of VS, SL, FS, or VF, and j is the fuzzy set for g in the jth rule, each of which takes the value of NB, NM, NS, ZZ, PS, PM, or PB. These rules are constructed and tuned by human expert since the number of rules is small. The fuzzy rules are given in Table 2. Taking the first rule for an example, it is read as: IF is NB and d g is VN THEN v g is VS and g is PB. Table 2 Fuzzy rules for the GS dg VN NR FR VFR
NB
NM
NS
(VS,PB) (VS,PB) (SL,PB) (SL,PB)
(VS,PM) (VS,PM) (SL,PM) (FS,PM)
ZZ
PS
(VS,PS) (VS,ZZ) (VS,NS) (SL,PS) (SL,ZZ) (SL,NS) (FS,PS) (FS,ZZ) (FS,NS) (VF,PS) (VF,ZZ) (VF,NS) Control Actions (vg, g)
Similar to the OA, for the inputs
PM
PB
(VS,NM) (VS,NM) (SL,NM) (FS,NM)
(VS,NB) (VS,NB) (SL,NB) (SL,NB)
' and d g d g' , the fired strength of the jth rule is
gj ( ) D (d g ); for j 1, ,28 . j
j
(21)
As the Mamdani’s minimum operation is used for fuzzy implication and the height defuzzification method is used, the crisp control actions for this module are calculated by 28
vg
28
gj p j j 1 28
j 1
and g j
g
j 1 28
g j
j 1
qj ,
(22)
g j
where p j is the center of fuzzy set V j ; and q j is the center of fuzzy set j . For the membership functions defined for v g and g , p j takes a crisp value of 0, Vg max / 2 ,
3Vg max / 4 or Vg max , while q j takes a value of -/4, -/6 -/12, 0, /12, /6 or /4.
Behavior Learning and Behavior Fusion in Autonomous Navigation…
243
goal Heading direction Reduce velocity while goal is approaching
start Figure 13 Verification of the GS from three start points to a goal
After the construction of the GS, its goal seeking function is verified by simulation as illustrated in Fig. 13. Given the three start configurations (with different locations and different heading directions) and the goal location, the AMV turned to the goal direction and moved toward the goal. When it was approaching the goal, it reduced its velocity gradually and finally stopped at the goal location.
5
Navigation Supervisor
Since the OA and the GS work independently, the individually inferred control actions, v and , have to be fused in order to produce the command actions to control the robot. The original NS performing this role was proposed in [32] which used the IF-THEN rules to switch between obstacle avoidance and goal seeking behaviors. For a smooth behavior transition, a new NS based on fuzzy control is proposed. It consists of three sections: the Behavior Weighting Section (BWS), the Behavior Fusion Section (BFS) and the State Evaluation Section (SES).
5.1
Behavior Weighting Section
The BWS is a fuzzy controller which takes dmin and dog as the input and infers the behavior coefficient . Here, dmin stands for the minimum value of the sensor readings and is given by
d min min(d i i 1,2, ,5) ,
(23)
and dog represents the minimum distance to the obstacle located along the relative goal vector p g ( x g , y g ) . In order to detect this distance value, a sensor group is dynamically configured (depicted in Fig. 14) as si for i k 1, k , k 1 , where k is calculated based on the deviation of the AMV’s heading direction from the goal direction. In accordance with the sensor arrangement in Fig. 14, we have
244
Cang Ye and Danwei Wang
12 k 8 int( )
(24)
where is determined by Eq. 19 and int() refers to the rounding to the nearest integer. If i is zero or negative, the modulo-24 of i is used instead. The distance dog detected by the sensor group is given by
d og Rv minli i k 1, k , k 1 .
(25)
The membership functions of the input and output variables are depicted in Fig. 15. The membership functions of the input variables take exactly the same form as the OA. In accordance with the membership functions of the input variables, the rule base of the NS consists of nine rules only. They are given in Table 3. At each time step, the input variables are fuzzified and the fuzzy inference is made. Finally, the crisp value of the behavior coefficient is determined by defuzzification. As the process of constructing the BWS is similar to the GS, we omit it for conciseness. Y
y v
di 6
8
o
x
4 2 k+1 k k-1
Dynamic sensor group dog
obstacle pg ( xg , yg )
Po ( X o , Yo )
Pg ( X g , Yg )
goal X
W
Figure 14 Configuration of the dynamic sensor group
1
VN
NR
FR
R+W
R+2W
1
VS
SM
BG
VB
1/4
1/2
1
d min/d og 0
R
0
(a) (b) VN: very near NR: Near FR: Far VS: Very Small SM: Small BG: Big VB: Very Big Figure 15 Membership functions of the input and output variables for the BWS
Behavior Learning and Behavior Fusion in Autonomous Navigation…
245
Table 3 Fuzzy rules for the BWS dog dmin VN NR FR
VN
NR
FR
VS SM SM
SM SM BG
SM BG VB
5.2
Behavior Fusing Section
The implementation of the behavior fusion is depicted in Fig. 16. As shown in Fig. 16(a), the fuzzy sets of the steering angles inferred by the OA and the GS, a and g , are weighted by (1-) and , respectively, and yield a and g . Then a and g are '
'
'
'
aggregated to yield fuzzy set . In the same way, the fuzzy sets of the velocity, V g and
Va , are fused to produce fuzzy set V as depicted in Fig. 16(b). To infer the crisp command actions of the AMV, a defuzzification process is required. As we employ the method of height defuzzification, the result is given by 28
v
243
j ( ' , d g' ) p j (1 ) m (d ' )b1m j 1
m 1 243
28
j ( , d ) (1 ) m (d ) '
' g
j 1
(26a)
'
m 1
and 28
243
j ( ' , d g' )q j (1 ) m (d ' )b2 m j 1
m 1 243
28
j ( , d ) (1 ) m (d ) '
j 1
where
' g
(26b)
'
m 1
m (d ' ) is the fire strength of the mth rule of the OA for the input
d ' (d1' , d 2' , d 3' , d 4' , d 5' ) , while b1m and b2 m are the centers of the mth fuzzy sets for v a and a , respectively.
246
Cang Ye and Danwei Wang g qj
a g
b2m
Vg
a
1-
'g
vg
pj
1-
Vg'
a
va
b1m
'a
g
Va
Va' vg
va
V
(a)
v
(b) Figure 16 Behavior fusion of the navigator
5.3
State Evaluation Section
A State Evaluator inside the NS is employed to correctly guide the AMV in the case that the goal is located very near to an obstacle. Its control flow is as follows: Firstly, if the condition dg