1
Adaptive Behavior Navigation of a Mobile Robot
Eduardo Zalama, Jaime Gomez, Mariano Paul, Jose Ramon Peran
Manuscript received E. Zalama, Jaime Gomez and Jose Ramon Peran are with the Instituto de las Tecnologas de la Produccion, University of Valladolid, Spain. E-mail:
[email protected]. This work is supported by a Junta de Castilla y Leon, and Comision Interministerial de Ciencia y Tecnologa TYP 1FD97-1580-C02-01 April 25, 2001
DRAFT
2
Abstract This paper describes a neural network model for the reactive behavioral navigation of a mobile robot. From the information received through the sensors the robot can elicit one of several behaviors (e.g. stop, avoid, stroll, wall following), through a competitive neural network. The robot is able to develop a control strategy depending on sensor information and learning operation. Reinforcement learning improves the navigation of the robot by adapting the eligibility of the behaviors and determining the linear and angular robot velocities.
Keywords mobile robots, obstacle avoidance, learning control, neural networks, robot navigation, adaptive behavior
I. INTRODUCTION Learning in robotics has been one of the challenges of researchers in recent years. Some of the advantages of learning, among others, are the following: The robot can acquire new knowledge by interaction with the environment and can adapt or change its behavior in the course of its life. Learning allows past concept experiences to be generalized from multiple examples. Actuators and sensors do not need costly calibrations, since they can be adjusted depending on the environment conditions. The system improves performance and introduces new knowledge. Reinforcement learning [1] has been one of the methods used to adapt robotics control systems. It is based on psychology and ethology whereby humans and animals have to operate in unknown environments and must learn to predict the consequences of their own actions. By learning the causality of environmental events, it becomes possible to predict future and new events. With this type of learning, also known as operant conditioning, an animal learns to exhibit more frequently a behavior that has lead to a reward, and to exhibit less frequently a behavior that has lead to punishment [2]. This notion of reward and punishment reinforces the causal behavior and provides a useful model for the behavioral modi cation in robotics. The appeal of these learning models is that they do not need an external supervisor that tells the system the objective, since the evaluation of its own behavior permits the system to adjust and to improve its future response. DRAFT
April 25, 2001
3
One of the problems associated with reinforcement learning is the credit assignment problem. For example, a hungry cat placed in a cage from which it can see some food will learn to press a lever that allows it to escape the cage to reach the food. In this situation the animal cannot simply wait for things to happen, it must generate dierent behaviors and learn which are eective. The main problem of modeling this type of behavior is to determine, from all the behaviors, which one produces the reward. Several authors have used reinforcement learning for the behavioral adaptation of mobile robots. Maes and Brooks [3] applied learning to improve the movement of a hexapod robot through positive reinforcement (progress of the robot) and negative reinforcement (contact sensors). Mahadevan and Connell [4] used Q-learning to teach a behavior based robot how to push a box. Asada et al. [5] have applied a vision system and Q-learning for shooting balls into a goal so that positive reinforcement learning is given when a ball reaches the goal. In this paper, a neural network model for the control of a mobile robot in an environment with obstacles is presented. Through sensor information (infrared, ultrasonic and bumper) the robot is able to learn dierent behaviors (avoid, wall following, goal reaching, etc) and to improve navigation in an unstructured environment. Initially the robot is able to navigate in an inecient way through the environment from a set of re ex behaviors. Positive and negative reinforcement signals improve the robot behavior through learning during operation. Results of the system have been tested in simulation and with a mobile robot Nomad 200 with ultrasonic, infrared and contact sensors. The rest of this paper is structured as follows: the proposed model is explained in section 2, describing every module of the model, dynamics equations, and interactions between the modules. Experimental results are presented in section 3 while related work and conclusions are presented in sections 4 and 5 respectively.
II. MODEL DESCRIPTION The proposed neural model, is made up of several interconnected modules which, through a perception-action cycle, allow the robot to obtain the linear and angular velocities (see gure 1). The modules of the model are the following: Intermodal Map. April 25, 2001
DRAFT
4
Behaviors Map. Sensorial Map. Velocity Vector. VITE (Velocity Integration To End point).
In general terms, the operation of the model is as follows. The intermodal map receives information from the infrared sensors, ultrasonic sensors and the robot bumper. This information is processed through contrast and reinforcement mechanisms that allows the robot to obtain signals less exposed to noise and uncertainties, and a uni ed representation of the near environment is obtained. Moreover, the intermodal map forms a representation of the environment based on spatial directions, where all dependency on the type of sensor involved in the readings has been eliminated. Once the information is contrasted, the information ows to the behavior map, where a behavior is elicited, and additionally to the sensorial map, where a behavioral representation of the environment is obtained. Seven dierent behaviors have been de ned in the behaviors map: Collision : This is the situation in which the robot crashes against an obstacle, and some of the bumper sensors are activated. The robot must react with a re exive behavior that permits the robot to be removed from the obstacle. Stroll : If an obstacle is detected very close in the direction of movement, the robot stops and turns in the opposite direction to the obstacle. This situation is only produced in extreme circumstances, in which the robot is very close to crashing with an obstacle. The behavior must be activated immediately when a collision danger is detected. Avoid: Upon detecting an obstacle at a moderate distance, the robot turns avoiding the obstacle and continues its course. Wall Following: Once a wall (or an obstacle) is detected to one side of the robot, the robot follows it at a given distance. This distance depends on the behavior de nition but, furthermore, it will also be modi ed by the number of obstacles detected in the environment, and by the ease with which the robot can lose sensorial contact with the wall. Goal Reaching: If there is no obstacle in the direction of a destination point, the robot goes towards it. DRAFT
April 25, 2001
5
Exploration: The robot is dedicated to wandering freely, obtaining information from the environment. Correct: The robot takes a direction without varying it, until a new behavior is elicited.
All these behaviors can be classi ed in two groups, depending on the type of response and learning: re ex and adaptive. The re ex behaviors, by analogy with live beings, are the behaviors that exist from the beginning and need not be learned. They are activated as a result of intensive stimuli (when the hand gets too close to the re, it produces a sudden pain or a panic reaction) or instinctive needs for life (a newborn does not need to learn to suck) and the performance is immediate and intuitive. As a rule, the fundamental factor in these kinds of stimuli is response time, since it is necessary to act with maximum speed to avoid possible irreversible damage. In the case of the mobile robot, an example of instinctive behavior would be that of being faced with a collision. This behavior is active when the contact sensors perceive the pressure of an obstacle. In this situation the robot stops and moves back from the obstacle to avoid damage, since, if this situation continues, the motors could be damaged; thus, a rapid and not too complex response is necessary to reverse this situation. The following have been de ned as instinctive behaviors: collision, stroll, correct and wandering. The adaptive behaviors need a previous study of the process and of the correct selection of the performance commands. Furthermore, the learning and development of these behaviors is possible, based on the use of evaluation techniques. These characteristics permit the robot to adapt to changing situations through a reinforcement learning process, where the commands that permit the robot to operate in a correct way (absence of collisions, goal reaching in the shortest possible time) are primed, allowing the system to evolve toward an accurate operation. The following have been de ned as adaptive behaviors: avoid, wall following and goal reaching. The behaviors map can also receive external or motivational signals that can inhibit or produce the arti cial activation of any behavior. This can be very useful in program debugging. Furthermore, this allows the system to elicit a behavior depending not only on the information from the sensors, but also on the internal motivational state. For example the robot can go to destination to load the batteries, if these are running out. April 25, 2001
DRAFT
6
The behaviors map sends signals to two other modules. On the one hand the behaviors map is connected to the velocity vector and sends performance commands to instintive behaviors, which need an immediate intervention but do not need any type of learning. Moreover, the behaviors map sends activation signals to the sensorial map for the more complex behaviors, that need continuous learning. The sensorial map is made up of several self-organizing networks [6], each one being representative of one of the behaviors, and is connected to the velocity vector through adaptive connections which de ne the appropriate linear and angular velocities that the robot must implement to produce the elicited behavior. These connections are adapted through reinforcement learning by means of a set of critical functions de ned for each behavior. The purpose of these critical functions is to modify the weights in the direction that improves the execution of each behavior. Finally, the VITE module receives signals related to the linear and angular reference velocities, and produces soft speed pro les avoiding jumps or discontinuities. The learning is accomplished in three modules. In the behaviors map, where the robot learns to select a behavior depending on sensor information. In the velocity vector, where the robot learns the velocity distribution that is appropriate for each behavior. Furthermore, an adaptive process is produced in the sensorial map, that allows the codi cation of the sensorial states that the robot detects in each behavior, by means of self-organizing networks. Below, a more detailed description of the modules and equations that make up the model, as well as of the connections and interactions between the dierent modules, will be given. A. Intermodal Map
This map has two dierent modules, depending on the information it processes: Exteroceptive module. Propioceptive module. The exteroceptive module receives and processes the information from dierent types of sensors (infrared, ultrasonic and bumper), obtaining information from the environment that is updated continuously. Two types of interactions are produced: On the one hand, the information from the dierent types of sensors is contrasted selecting the most sigDRAFT
April 25, 2001
7
ni cant that can produce risk situations to the safety of the robot. On the other hand, adjacent sensors, or those in nearby positions, are mutually reinforced to guarantee that the environment that they de ne is more realistic. These two kinds of contrast allow the robot to obtain a more realistic perception of the environment, less aected by noise. As can be seen in gure 2, this module receives information from dierent types of sensors and in dierent spatial directions, and integrates the information in a competitive neural network. The exteroceptive module is modeled by a recurrent competitive network [7] , and produces a normalization process that weights the activity of each neuron within the total neural activity. On the other hand, the dynamics acts as a low pass lter that suppresses the noise. The equations that de ne this network are:
Ii =
X M
fM (JiM )JiM
dSi = ,AS + (B , S )I , S X I i i i i j dt j 6=i
(1)
(2)
where JiM is the ith sensor reading for modality M (e.g. infrared, ultrasonic, bumper, etc..), Ii represents the sensory fusion for position ith, and fM weights the in uence of each kind of sensor in the data fusion process. In gure 3 the relative importance of the sensor readings as a function of the detection distance to obstacles, characterized by sensorial reading Ji is shown. For small distances, the information that prevails is that of the infrared sensors, while these are unable to detect objects at moderate distances, giving the priority to ultrasonic sensors. Equation 2 performs a normalization of the sensor information that represents the distance to obstacles in every direction around the robot. Activity Si represents the normalized information of the presence of obstacles in direction i. Constant A is the passive decay that makes the activity tend towards zero in absence of excitation or input, while constant B determines the saturation level of each neuron. To see the normalization eect of equation 2, this equation can be solved in stationary, that is dSi = 0, then April 25, 2001
DRAFT
8
(3) Si = A + IBI+i P I i j where the activity of each neuron is normalized by the total neural activity. Typical values of this equation are A = 10, B = 1 that corresponds to a short term dynamic equation in comparison with other equations of long term dynamics which involve weight adaptations. With respect to the propioceptive module, it maintains an internal representation of the position and orientation of the robot. This is accomplished by an odometry process where the robot integrates the information originating from the motors (linear and angular velocity) and allows the robot to obtain the distance and turning angle to reach a destination. B. Behaviors map
This module determines the behavior, or behaviors, that will depend on the exteroceptive information (e.g. ultrasonic, infrared, laser, etc) and propioceptive information (that represents the state of the robot itself: position, speed, battery charge, etc,). The behaviors, mentioned above are: collision, stroll, avoid, wall following, goal reaching, exploration and correct. The behaviors map is implemented through a neuron layer, where every neuron represents a behavior. This map receives signals from the intermodal map and sends signals of the active behavior to the sensorial map and the velocity vector. The behaviors map is described by the Shunting On-Center O-Surround equation [7]:
dCi = ,AC + (B , C )(E K X[w , S ]++ i i i i ij j dt j X + f (Ci )) , f (Cj ) j 6=i
(4)
where Ci represents the activity of the behavior i, Sj is the activity of neuron j of the sensorial map obtained from equation 2, A and B have similar meanings as de ned in equation 2, Ei characterizes the internal motivation state of the robot (e.g. low batteries), Ki shows the relative importance of the ith behavior (for instance, stroll has priority with respect to avoid or wall following, e.g. Ki = 0:6; 0:2; 0; 2 for stroll, avoid and wall DRAFT
April 25, 2001
9
following behaviors respectively) and wij represents the adaptive weight of the node j of the intermodal map and the node that represents behavior i in the behaviors map. Depending on the shape of the function f (Ci ), linear, faster than linear, slower than linear or sigmoid, dierent dynamics are obtained. In particular, with a faster than linear function f (Ci) = Ci2 the competition means that only one behavior is active (winner takes all). Finally j:j+ represents recti cation where only positive values are considered making negative values equal to zero. Again, this equation can be solved in stationary state where, in the case of "winner takes all" the active or winner behavior will be the node or behavior which receives more signals from the intermodal map. The internal motivation state can be used to prime the activation of a certain behavior. For example a planner can determine to go along a corridor to reach a certain destination, in this situation, it could be interesting to force the activation of the wall following behavior (e.g. Ei = 1 for wall following behavior and zero otherwise). As shown in gure 4 the adaptive behaviors send signals to the sensorial map, while re ex behaviors send directly to the velocity vector (without any intermediate step). Some behaviors are represented by two nodes. This is the case when a behavior that possesses two symmetrical reactions is described, for example, when the robot should follow a wall to the left or to the right. In the same gure 4, a detailed description of the behaviors map is given. The dierence between instinctive and adaptive behaviors is observed, characterized by the module which receives the signals: the sensorial map or the velocity vector. Finally, the special character of the exploration behavior, whose activation results in the inhibition of the goal reaching behavior is also observed. The relationship between the sensorial information and the activation of the dierent behaviors is learnt through a process that depends on the active behavior. For example, reinforcement learning permits the ring of stroll behavior after several crashes detected by the bumper. This learning will allow the robot to anticipate future collisions. The discrete equation that describes this behavior is:
wiST (t) = wiST (t , 1) + (B , wiST )BP (t)f (dI ,i)
(5)
where wiST is the adaptive weight between sensor ith and behavior "stroll", BP (t) is a April 25, 2001
DRAFT
10
temporal function 1 that stays inactive (zero) until a crash occurs. Thus, it is activated for a reduced time period and hastily recovers its initial null value. This allows the robot to learn only at the beginning, and disables learning if the situation continues in time. dI ,i is the normalized distance between the winner neuron I and neuron i of the intermodal map. f is a function that decreases with the distance ( gure 5.a) and permits the neurons closer to the winner to be adapted. This kind of functions are biologically inspired and characterizes the lateral interactions between certain neurons of the cerebral cortex. In gure 5.b the discrete approximation used in the proposed architecture is shown. Finally, is the learning rate, and B is the maximum value that the weights can reach (e.g. B = 1). The behavior "avoid" is learned by an analogous method. In this case, the equation is:
wiAV (t) = wiAV (t , 1) + (B , wiAV )ST (t)f (dI ,i)
(6)
where all the variables have similar meaning to that in the previous equation, and ST function is quickly inhibited a short period of time after activation (like BP function). C. Sensorial Map
In this module, the sensorial information coming from the intermodal map is modulated as a function of the active behavior. A self-organizing network is employed for each behavior, as shown in gure 6. Each layer of the sensorial map receives signals through adaptive connections from the intermodal map. Each behavior is entrusted to the activation of one of the self-organizing networks, in such a way that if there is only one active behavior, then only the corresponding network will be activated and will send control commands to the velocity vector. At the beginning, the relationship between the neurons of each net does not exist, so each one represents an arbitrary sensorial state. Thus, there is no topological relationship between the sensorial information and the activation of each neuron. While the robot is moving, the net evolves through a learning process, so that dierent sensorial states that take part in each behavior are incorporated. These sensorial states form an ordered structure, in which adjacent states are represented as neighboring neurons within the 1
This function represents the Unconditional Stimulus (UCS) or the negative reinforcement learning
DRAFT
April 25, 2001
11
net that produce a topological arranged network. This is accomplished by a Kohonen self-organizing network [6]: 1. The weights of each self-organizing network are initialized with null or random values. 2. Each neuron (i; j ) in the map computes the similarity between its synaptic weight wij and the input vector from the intermodal map. This can be done computing the euclidean distance.
xcij
sX c , Sk )2 c ~ = kw ij , ~sk = (wijk k
(7)
where xcij represents the neuron activity for the behavior c of row i and column j , Sk is c is the weight of the adaptive the signal k that comes from the intermodal map, and wijk connection between node k from the intermodal map, and node ij in the sensorial map for behavior c. Therefore, the activity level of each neuron will depend on the degree of coincidence between the weight vector and the sensorial income vector, as can be seen in gure 7. 3. The winner neuron in the map is elected according with a minimal distance criteria. If Tijc is the output of neuron (i; j ) corresponding to the behavior C , the following equation de nes the network:
8 >> 1 if xcij = min(k;l)fxcklg and c active, < Tijc = > where k and l are extended to the whole map. >:
(8)
0 otherwise
4. The weights of the winner neuron (I; J ) and the neighbor neurons are adapted, according to the following discrete learning law. c (t) = wc (t , 1) + f (d wijk ij ,IJ )(Sk , wijk ) ijk
(9)
Similarly to equation 5 dij,IJ is the normalized distance between neuron (i; j ) and the winner neuron (I; J ) while f is a function that decreases with the distance. The learning law permits, in each iteration, an approximation of the weight vector of the winner neuron (and vicinity) to the input vector. April 25, 2001
DRAFT
12
The in uence of several nodes of the intermodal map is represented in gure 8. The intermodal map is made up of the information from the sensors of the periphery of the robot, S in the gure, and by the nodes that de ne the distance and direction toward the destination, D and in the gure. In the avoid behavior the corresponding self- organizing map receives information from the central nodes of the intermodal map, that characterize the frontal sensors of the robot ( g. 8.a). In the wall following behavior the map receives information from the lateral sensors ( g. 8.b). Finally the goal reaching behavior receives information from the nodes that code the distance and the angle to destination ( g. 8.c). D. Velocity Vector
The velocity vector V~t = (vt; wt) represents the linear and angular velocity that the robot must develop depending on the selected behavior and sensor information. In [8] a distributed model is de ned so that a neuron vector stores the information of the optimal speed for each situation. The activation of nodes generates a gaussian whose maximum is used to de ne the speed to be implemented. The reinforcement learning of the weights of each neuron permits the displacement of the maximum of the gaussian in one direction or another, varying the robot speeds in the same direction. In this work we have simpli ed the model, and the velocity vector is made up of only one neuron whose output can take a null (straight travel), positive (right turn) or negative (left turn) value. The absolute value quanti es the turning radius of the path, producing a null linear speed at a maximum angular velocity. Figure 9 shows the transformation that allows the angular and linear velocity of the robot to be obtained. The velocity is described by the following equation:
u=
X i;j;c
wijc Tijc +
X d
wdC d
(10)
where u is the value of the neuron that receives the information corresponding to two terms: The rst one is the information which comes from the behaviors map through adaptive connections, and the second one corresponds to the contribution of the instinctive behaviors de ned in the behaviors map. Then, wijc is the weight of connection between the node (i; j ) of the self-organized map, that corresponds to behavior c within the sensorial DRAFT
April 25, 2001
13
map, and the velocity vector. Tijc is the output of the sensorial map de ned in equation 8. C d represents the instinctive behaviors of the behavior map: collision, stroll and correct, (C d = 1 if instinctive behavior d is active and zero otherwise) and wd are the corresponding weights. The output u is modulated by the following transformation functions ( gure 9):
vt = Kv fu(u) wt = Kw u
(11)
where Kv and Kw are gain constants (e.g. Kv = 0:5 m/s, Kw = 1:5 rad/s), that permit the scale of the function fu (u) (de ned as fu (u) = exp ,u2=0:2 ) and u to be changed, to the velocity range that the robot can develop. Figure 9 shows how the transformation functions give a maximum positive value to the linear velocity and zero to the angular velocity when u takes a null value. On the other hand, the robot performs a zero linear velocity and maximum angular velocity (positive or negative) for the extreme values of u. The weights wd are chosen speci cally to generate the desired velocities for each instinctive behavior. For example the weight of the correct behavior should be zero to perform straight movements, and the weight of right = left turn should be positive = negative. The velocity signal that is sent to the VITE module is learnt through the modi cation of the adaptive connections. This modi cation will depend on a dierent function for each behavior, that will de ne the degree of satisfaction in the ful llment of the objective within each behavior. This function is known as critical function and depends on the sensorial information of the robot (see gure 10). To determine the weight connections to be adapted, it is necessary to keep a trace of the last active nodes for each behavior. These weight connections are adapted depending on the critical function and the degree of antiquity of the nodes, so that those recently activated nodes are more aected by the learning than the more ancient nodes. The equation is:
wijc (t) = wijc(t , 1) + ([Qc(S1:::SN )act(i; j; t)]+(wmax , wijc)+ [Qc(S1:::SN )act(i; j; t)],(wmin , wijc)) April 25, 2001
(12) DRAFT
14
The constant re ects the learning rate, while the function act(i; j; t) de nes those weights that should be learnt, and will depend on the recent nodes that have been activated in the intermodal map. In this way the nodes will learn proportionally to their nearness in time to the recently activated node, as shown in gure 11. Qc is the critical function for behavior c, and is heuristically elected (v.i.). Finally, wmax and wmin are the extreme values of the weights (e.g. wmax = 1 and wmin = ,1), and []+ and [], are positive = negative recti cation, where negative = positive values are ltered to zero, and permit the weights to increase or decrease according to the sign of the critical function. The critical function for right wall following behavior is de ned as follows:
QWR = 0:7 [0:8(S12 , d) + 0:1(S13 , d) + 0:1(S14 , d)] + 0:3(S12 , S14)
(13)
where S12, S13 and S14 are the activities of the intermodal map corresponding to the three right lateral sensors and d is the desired distance to the wall. The rst three terms in the equation take into consideration the distance to the wall while the last one calculates the orientation with respect to the wall. A similar function is de ned when the wall is detected to the left. The critical function for right avoiding behavior is de ned as follows:
QAR = ,0:4[w9AV , S9]+ + 0:3[w10AV , S10]+ + 0:2[w11AV , S11]+ + 0:1[w12AV , S12]+) (14) where again S9, S10, S11, and S12 are the activities of the intermodal map corresponding AV , wAV , wAV are the corresponding weights of to the frontal-right sensor and w9AV , w10 11 12 equation 6. This critical function takes a negative value when an obstacle is detected to the right closer than a certain distance de ned by the adaptive weights. The equation also re ects how the most frontal sensors have a higher contribution in the evaluation of the critical function, as the corresponding terms are multiplied by bigger constants. E. VITE module
The VITE module (Vector Integration To Endpoint) [9] generates the reference velocity pro les (bell shaped) to be carried out by the motors of the robot, thus avoiding fast DRAFT
April 25, 2001
15
accelerations during operation. The equations that describe the VITE module for the velocity are:
dV~ = (,V~ + V~ , V~ ) t r dt
(15)
dV~r = G t2 V~ (16) 0 dt 1 + t2 where V~t is the velocity reference vector (linear vt and angular wt) obtained in the previous section, V~r is the real velocity vector developed by the robot, and V~ is the vector t2 is a function that guarantees soft accelerations at the dierence. Finally, G(t) = G0 1+t 2 beginning of the movements. The structure of the VITE module is represented in gure 12. The dierence vector V~ computes the dierence between the reference velocity V~t and the real velocity V~r , and is integrated over V~r multiplied by the G(t) function. The main characteristic of the G(t) function is that it controls the integration velocity over V~r . This function starts from zero at the beginning of the movement and increases through time to a maximum value (according to a given gain G0), letting the linear and angular velocities change synchronally according to a bell shaped pro le. A complete description of the VITE module is presented in [9]
III. EXPERIMENTAL RESULTS The architecture explained above has been tested in the experimental mobile robot Nomad 200. This robot is driven by a syncho-drive mechanism (one motor to drive the wheels and a second motor to steer all the wheels), and is equipped with pressure sensitive sensors to detect collisions, sonar and infrared sensors to detect and determine distance to obstacles, and a 2D laser system to obtain the pro le of objects. The Nomad 200 has onboard computers for sensor and motor control and for communication to a host computer. The mobile base keeps track of its position and orientation (integrated over time). The Nomad 200 includes a complete software package for the host computer with a graphic interface, and a robot simulator. April 25, 2001
DRAFT
16
The architecture has rst been implemented in the simulator in order to speed up the learning process, and to test the correct performance of the model and its ability to adapt to dierent characteristics in the environment. Then it has been validated in the real robot, where the weights can be initialized to those learned in simulation in order to avoid multiple collisions with obstacles in the initial learning phase. In this section a description of the results obtained will be given. All the experiments have been performed with a time step of 0:2 s , learning rates = 0:01, = 0:01, = 0:01, G0 = 1. In gure 13 the evolution of the avoid behavior is shown at a stage without initial learning, and after 5000, 10000 and 20000 learning iterations. In this experiment only the avoid and stroll behaviors have been allowed to be activated, in order to achieve a greater clarity in the representation. At the beginning ( gure a) the connections toward the velocity map have null value. The robot describes a straight path until a collision with an obstacle is produced. At that moment the robot turns and continues in a straight line until a new crash is produced before the stroll behavior is activated. While these situations are happening, the connections to the velocity vector are adapted. After 5000 iterations ( gure b) the robot starts to turn before the obstacle is reached, but the turn is still insucient and nally the stroll behavior is activated. After 10000 iterations ( gure c) the robot is capable of avoiding some obstacles, though it has still not learnt throughly all the connections between the sensorial map and the velocity vector. Finally, after 20000 iterations ( gure d), the robot is already capable of avoiding any obstacle in its way without diculties, since the weights have been correctly learnt. The continuous learning while the robot is in operation allows the robot to adapt to the characteristics of the environment. For example, in cluttered environments, the robot develops greater angular velocities than in spacious environments. The learning evolution of the wall following behavior is presented in gure 14. In this case the avoid learning has been inhibited for a clear representation. To permit a rapid adjustment of the weights, the robot has been located in a specially prepared environment so that this behavior can be activated a greater number of times, to permit the following of walls and corners. It is observed that at the beginning, without learning ( gure 14.a), DRAFT
April 25, 2001
17
the stroll behavior is activated and the robot is not able to follow the wall. After 5000 iterations ( gure 14.b) the robot acquires the capacity to follow the wall but still develops a very irregular path, because not all the weights have learnt the adequate value. After 10000 iterations ( gure 14.c), the robot is capable of performing an acceptable path but has diculties turning a corner. In gure (14.d) the robot is capable of avoiding the central obstacle, overcoming the diculty that the corners suppose. In the lower part of gure 14.c weights overlearning can be seen: the robot has learnt to approach the wall but does so at an excessive speed and is unable to stop in time and the stroll behavior is activated close to the wall. After several iterations the robot is capable of stabilizing its weights and thus developing an adequate operation. In gure 15 the distribution of the self-organizing network of the sensorial map can be seen. The gure represents the weight distribution where the regions of similar gray level corresponds to similar values of the weights. At the beginning the weights have random values. After a training period the weights have been distributed so that they cover all the sensorial space. It can be seen that neighboring nodes have similar values, due to the distributed learning of self-organizing networks. In the rst gure a not trained map is shown, and in the rest the maps with 1000, 2000 and 4000 learning iterations are shown. The vicinity concept in the learning of the sensorial space corresponding to each behavior produces the weight distribution and smooth grouping Figure 16 shows the ability of the robot to reach and pass through several consecutive destinations that could be determined by a planner. In this situation it is not necessary to stop at each intermediate point, but the robot can continue until the nal destination is reached. Once the robot reaches an intermediate point, it receives the position and direction of the following point that prevents the robot from stopping. In this paragraph an explanation will be given of the development of a complete path from an initial position until the destination is reached, through the activation of the dierent behaviors. Figure 17 shows the path executed by the robot. The robot starts to move from position 0. Although there is an obstacle that prevents the robot from directly reaching its destination, the distance is still not suciently small to activate the avoid behavior and the robot develops a straight movement between point 0 and point April 25, 2001
DRAFT
18
1. When the robot is close to the obstacle the avoid behavior is activated, the obstacle being surpassed to the right. In this case the stroll behavior is not activated (activation of the stroll behavior would perhaps have been necessary in the event of the existence of a corner or some obstacles with a more complex con guration). As the obstacle is passed, it is to the left of the robot and is detected by the lateral sensors, whose signals are sent to activate the wall following behavior (this occurs at point 2). Since the robot still does not detect free space it continues with this behavior, turning the corner and maintaining an adequate distance from the obstacle. The robot continues on the opposite side of the obstacle until it detects free space to its destination, and "go to destination" behavior is activated (point 3) until point 4 is reached. One of the characteristics of the model developed is the capacity to navigate in complex environments with obstacles placed in an arbitrary way, and to reach any proposed destination. Figure 18 shows several environments with dierent con gurations of the obstacles and how the robot reaches its destination. Due to the fact that the model is reactive and a general map of the environment does not exist, the scope of the destination is not always achieved following a minimal length path. In fact, in gure 18.(b) it can be observed that the robot has a free path to the back that permits it to reach its destination quickly. To detect this situation it would be necessary to complete the model with a map of the environment and procedures for path planning to determine the intermediate points that the robot must reach in order to follow an optimal path. Figure 19 shows the robot wandering the environment. In this case, the behavior "to go to destination" is inhibited, permitting the activation of the rest of the behaviors depending on the sensor information. It can be seen that the entire path is free of collisions and is reasonably soft. In some circumstances (particularly at corners) the avoid behavior is insucient to turn in time and avoid the collision, the stroll behavior being activated in this case thus avoiding a collision between the robot and an obstacle. After testing the correct operation of the neural controller in the simulation, it has been tested in the real robot by sending movement commands in response to the sensor information. In the adjustment process of parameters several dierences were observed between the DRAFT
April 25, 2001
19
simulated model and the real model: Although a certain degree of uncertainty has been considered in the readings of sensors, this uncertainty is higher in the real robot, where bad readings are frequently produced. The simulation considers a two-dimensional environment, while the world in which the real robot must act possesses three dimensions. This produces some diculties in the detection of objects at dierent heights (the Nomad 200 has the sonar ring at 1m height while the infrared ring is only 10 cm from the oor). This situation is corrected by the integration of the information of dierent types of sensors, located at dierent heights in the robot. There are some limitations in the detection of the obstacles, which are intrinsic to the nature of the sensors. The infrared sensors are unable to detect transparent objects, are very dependent on the environmental light, and are also aected by the color and the degree of slope of the obstacle. On the other hand sonar readings depend on the texture and absorptive capacity of the obstacle and present diculties associated with the existence of echoes. The presence of people and mobile objects generates a non-stationary environment, submitted to rapid structure changes (movement of persons, opening and closing of doors, movement of objects, etc.).
The only source of information for perceiving the environment for the mobile robot is through its sensorial system. As mentioned above, the Nomad 200 is equipped with contact sensors, sonar and infrared sensors. While the contact sensors can only take two values (that represent the collision or not with an object), the other sensors permit a wider range of values. Furthermore, infrared sensors can only detect close objects, while ultrasonic sensors are capable of detecting obstacles at several meters but with great uncertainty. For the experiments, a typical oce environment has been used, with the presence of persons and objects such as tables, chairs or closets. The representation of the oce environment is shown in gure 20. In (a) the approximate plan of the test room is represented. The perception that the robot has of the environment through ultrasonic sensors after an exploratory journey is shown in (b) while in (c) infrared sensor information is presented for the same exploratory path. A clear dierence can be observed between April 25, 2001
DRAFT
20
the two gures: while the ultrasonic sensors permit a more general representation of the environment, the infrared sensors obtain greater precision in the nearby obstacles but present large unexplored zones in the places far from the path of the robot. Finally, in (d) the three types of information are integrated in the same gure. Within the environment de ned in the previous paragraph, corresponding to a room with oce-supply furniture and the presence of persons, dierent tests of learning and navigation were performed. In gure 21 the evolution of the learning in the real robot is shown. The learning permits the weights to be adapted and, in successive iterations, allows softer trajectories to be performed. In gure 21 (a) the robot has no experience, and when it approaches the obstacles only instinctive behaviors such as "collision" or "stroll" can be activated. After 10000 iterations the robot reacts close to the obstacles but the velocity is still not adequate to avoid the obstacles without activating the stroll behavior. After 20000 iterations, the robot performs soft trajectories but still has slow reactions. Finally, after 40000 iterations, the robot is able to avoid the obstacles and to border the obstacles without diculties. Dierent tests of goal reaching are shown in gure 22 where the initial position is labeled as 0 and the destination as 1. In gure 22(a) the robot does not nd any obstacle in its path, thus the robot turns in the direction of the goal and reaches it. In gure 22(b) an obstacle is located in the robot path. The robot reacts by activating the avoid and wall following behaviors. After bordering the obstacle, the robot detects the free path to the destination and goes toward it. In (c) the robot avoids several obstacles detected by the ultrasonic and infrared sensors. In (d) several non-stationary obstacles are presented. Initially the robot activates the avoid behavior and attempts to border the obstacle, but nally the obstacle or person goes away and the robot goes directly towards its destination. The architecture can easily be combined with other planning or control models. For example, a sequence of intermediate points that the robot must reach in order to develop a certain task can be de ned. In gure 23 the robot is shown passing through several successive destinations. The robot crosses a narrow corridor to reach the rst intermediate destination, then it is directed toward the following destination without nding obstacles in its path, and nally reaches the last destination after avoiding an obstacle. DRAFT
April 25, 2001
21
When a destination to be reached is not de ned, the robot can develop an exploration behavior. In this situation, the robot can activate the dierent behaviors depending on sensor information. A collision free path is thus obtained which, to a certain degree, is random, and which would be used to generate or update possible environment maps. The greatest diculty for the robot is the presence of corners and other similar situations. In these situations it is necessary to activate the stroll behavior or even, in extreme situations, the collision behavior, that makes the robot back up. Several tours in exploration mode are shown in gure 24. It can be observed that the activation of the dierent behaviors endows the robot with great capacity to recognize important characteristics of the environment. It is possible to recognize convex and concave corners, narrow corridors, obstacle free space, walls, persons or other mobile obstacles, etc. This capacity would make the generation of environment maps and their updating and maintenance possible.
IV. RELATED WORK Dierent classical conditioning and neural network models have been applied in mobile robots to behavior learning. Baloch and Waxman [10] used a variant of Grossberg's conditioning circuit for the visual navigation of the MAVIN robot. Similar models of conditioning have been applied in [11], [8], [12] for the learning of the avoid behavior through the association of conditional stimulus (CS) (information for ultrasonic sensors) and unconditional stimulus (UCS) (collision detected by a bumper). In a similar way, Verschure and colleagues [13], [14], [15] have proposed Distributed Adaptive Control (DAC) as an architecture that learns to generate approach and avoidance behaviors in response to appetitive and aversive stimuli. Nehmozow [16] and colleagues have also used a neural network modulated by a reinforcement signal to train robots to perform approach and avoidance behaviors. However all these approaches have been centered on developing a biologically inspired methodology to the avoidance behavior learning, and dierent aspects of sensor integration, sensor navigation and a combination of dierent behaviors are not considered. Kurz [17] uses Kohonen's self-organizing feature maps and RCE-Classi ers (RestrictedCoulomb-Energy Classi er) for recognizing situations from the information obtained from April 25, 2001
DRAFT
22
a ring of ultrasonic sensors. However, the approach is dierent since the main objective of Kurz's work is to learn maps of the environment for position estimation and our work is focused in the reactive behavioral navigation of the robot. In Beom [18] a sensor navigation method which utilizes fuzzy logic and reinforcement learning for the navigation of a mobile robot is proposed. The navigator consists of an avoiding behavior and goal seeking behavior and is tested in simulated environments. These behaviors are designed independently and combined by a bistable switching function. However, aspects of sensor con guration such as orientation, disposition around the robot, calibrations, etc. should be considered. In contrast, the proposed model develops an emergent navigation behavior from the conjunction of several behaviors. It is not necessary to consider aspects of sensor con gurations because they are learned in a perception action cycle. In the Adaptive Heuristic Critic (AHC) reinforcement learning methods, a critic learns the utility of particular states through reinforcement. This type of learning is employed in Gachet [19] for the coordination of a robot with dierent behaviors (goal attraction, twoperimeter following, free space, etc). However, the performance is restricted to missions for each behavior and a discrete reinforcement signal (0/-1) and the robot is returned to 30 steps before collision takes place. In contrast the proposed model performs real time learning and operation, developing soft transitions between the behaviors. Another drawback pointed out by the authors is the poor resolution of the input of the sensorial information. Millan [20] used a similar method with temporal dierencing methods [21] to avoid unfruitful paths on the way to achieving navigational paths. Similarly, Fagg et al. [22] have used temporal dierencing as a form of connectionist AHC learning to permit a robot to learn collision avoidance, wall following and environmental exploration. But these approaches also suer from poor resolution to the input, where the sensory inputs are of the type: mid range up sonar, far range forward sonar, etc which involves sensor preprocessing. Besides, they do not cope with real robots where cinematic aspects such as soft velocities pro les should be generated. The main contribution of the proposed system, among others, is the real time learning DRAFT
April 25, 2001
23
of a model which combines a set of dierent behaviors without external supervision and is performed continuously during the normal operation of the robot. Besides, the calibration or con guration of the sensors is not necessary because the robot learns to generate adequate velocity pro les according to the sensor information of the real robot.
V. CONCLUSIONS In this paper a neural architecture has been proposed for the control of a mobile robot, that permits it to reach dierent goals in the presence of obstacles through the control of the linear and angular velocities. The main characteristics of the model are : Reactivity: The system permits a mobile robot to navigate in the environment and to develop dierent behaviors depending on sensor information. Thus, it is not necessary to de ne the complete trajectory to be followed by the robot, which could oblige it to perform replanning tasks in the presence of unforeseen circumstances. Decomposition in simple behaviors: This step supposes a great simpli cation in the development of the architecture. The conjunction and interaction of the behaviors give the robot a reliable and exible operation. The use of competitive neural networks in the intermodal and behaviors map, and independent self-organizing networks for every behavior in the sensorial map permits the model to be easily extended by adding new behaviors. Sensorial fusion: The use of several types of sensors, each one especially ecient in de ned circumstances, permits a more secure navigation, less aected by uncertainties or noise. Learning: The neural network provides the model with an adjustment capacity to obtain the optimum operation in each behavior. Furthermore, the system is capable of adapting to changes in the environment or in the robot without the need of external supervision, which gives the model a high degree of autonomy. Besides, learning is performed during normal robot operation, which permits it to adjust continuously to changes in the environment. The reliability and exibility of the proposed architecture has been tested in simulation and in the Nomad 200 mobile robot. The operation of the robot has been veri ed in dierent environments and circumstances (U shaped obstacles, mobile objects, non April 25, 2001
DRAFT
24
visible destination, etc) obtaining satisfactory performances. Future work will be focused on the integration of the system in a complete arquitecture which includes map building, robot localization and planning.
References [1] Richard S. Sutton and Andrew G. Barto, \Toward a modern theory of adaptive networks: Expectation and prediction," Psychological Review, vol. 88, pp. 135{170, 1981. [2] E. Thorndike, Animal Intelligence, Hafner, Darien, C.T., 1911. [3] P. Maes, \Robotics and autonomous systems," Situated Agents Can Have Goals, vol. 6, pp. 49{70, 1990. [4] S. Mahadevan and J. Connell, \Automatic programing of behavior-based robots using reinforcement learning," in Proceedings of the Ninth National Conference on Arti cial Intelligence (AAAI'91), pp. 768{773. Ananheim, CA, 1991. [5] M. Asada, S. Noda, S. Tawaratsumida, and K. Hosoda, \Vision-based reinforcement learning for purposive behavior acquisition," in Proceedings of the IEEE International Conference on Robotics and Automation, pp. 146{153. 1995. [6] T Kohonen, \The self-organizing map," in Proc. of the IEEE, 78, 9. 1990, pp. 1464{1480, Piscataway, NJ: IEEE Service Center. [7] S. Grossberg, \Contour enhancement, short-term memory, and constancies in reverberating neural networks," Studies in Applied Mathematics, vol. 52, pp. 217{257, 1973. [8] P. Gaudiano, E. Zalama, Carolina Chang, and J. Lopez Coronado, \A model of operant conditioning for adaptive obstacle avoidance," in From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, The MIT Press/Bradford Books, Ed., Cape Cod, Masachussets, USA, 1996. [9] D. Bullock and S. Grossberg, \The VITE model: A neural command circuit for generating arm and articulator trajectories," Dynamic Patterns In Complex Systems, pp. 305{326, 1988. [10] A. Baloch and A. Waxman, \Visual learning, adaptive expectations, and behavioral conditioning of the mobile robot MAVIN," Neural Networks, vol. 4, pp. 271{302, 1991. [11] E. Zalama, P. Gaudiano, and J. Lopez-Coronado, \A real-time, unsupervised neural network for the low level control of a mobile robot in a nonstationary environment," Neural Networks, vol. 8, pp. 103{123, 1995. [12] A. Bluhlmeier, H.R. Everett, and L. Feng, \Operant conditioning in robotics," in Neural Systems for Robotics, O. Omidvar and P. Vand der Smagt, Eds., pp. 195{225. 1997. [13] P.F.M.J. Verschure and A.C.C. Coolen, \Adaptive elds: Distributed representations of clasically conditioned associations," Network, vol. 2, pp. 189{217, 1991. [14] P.F.M.J. Verschure and R. Pfeifer, \Categorization, representations, and the dynamics of system-environment interaction," in Proceedings of the Second International Conference on Simulation of Adaptive Behavior, J.A. Meyer, H. Roitblat, and S. Wilson, Eds., pp. 210{217. Cambridge, MA: MIT Press, 1992. [15] R. Pfeifer and P. Verschure, \Distributed adaptive control: a paradigm for designing autonomous agents," in Toward a practice of autonomous systems, F.J. Varela and P. Bourgine, Eds., pp. 21{30. 1992. [16] U. Nehmozow, \Flexible control of mobile robot through autonomous competence adquisition," Mesurement and Control, vol. 28(2), 1995. DRAFT
April 25, 2001
25
[17] A. Kurz, \Constructing maps for mobile robot navigation based on ultrasonic range data," IEEE Transaction on System Man and Cybernetics, vol. 26, no. 2, pp. 433{441, 1996. [18] H.R. Beom and H.S. Cho, \A sensor based navigation for a mobile robot using fuzzy logic and reinforcing learning," IEEE Transaction on System Man and Cybernetics, vol. 25, no. 3, pp. 464{477, 1995. [19] D. Gachet, M. Moreno, and J. Pimentel, \Learning emergent tasks for an autonomous mobile robot," in Proceedings of the International Conference on Intelligent Robots and Systems (IROS'94), pp. 290{297. Munich, Germany, 1994. [20] J. Millan, \Learning ecient reactive behavioral sequences from basic re exes in a goal-directed autonomous robot," in Proceedings of the Third Conference on Simulation of Adaptive Behavior, pp. 266{274. 1994. [21] R. Sutton, \Learning to predict by the methods of temporal dierences," Machine Learning, vol. 3, pp. 9{44, 1988. [22] A. Fagg, D. Lotspeich, and G. Bekey, \A reinforcement-learning approach to reactive control policy design for autonomous robots," in Proceedings of the IEEE International Conference on Robotics and Automation, pp. 39{44. 1994.
April 25, 2001
DRAFT
26
Intermodal Map
Behaviors Map
Sensorial Map
Velocity Vector External Inputs VITE
Sensors V W ROBOT
Fig. 1. General scheme of the proposed model.
27
Bumper
Ultrasonic
Infrared Se
Fig. 2. Representation of the exteroceptive intermodal map modeled as a competitive neural network. Each neuron of the map receives signals from bumpers, ultrasonic and infrared sensors and corresponds to one spatial direction.
28
fM
fUS
1
fIR 0 Ji
Fig. 3. Functions that weight the influence of each type of sensor. The infrared sensors (fIR) prevail at short distances while ultrasonic sensors (fUS) take more importance at long distances.
29
Avoid
Sensorial Map
Wall following Goal Reaching Wandering Collision
Velocity Stroll
Vector Correct
Fig. 4. Behaviors map. The adaptive behaviors project their signals over the sensorial map and the reflex or instinctive behaviors over the velocity vector. Some behaviors are represented by two neurons (left and right).
30
f 1
0.5 0.1 1 (a)
(b)
2 Vicinity distance
Fig. 5. Representation of function f. In (a) the biological function is shown and in (b) discrete approximation is shown.
31
Intermodal Map
Behaviors Map Sensorial Map Velocity Vector
Fig. 6. Representation of the sensorial map composed of self-organizing networks. The intermodal map projects signals over the sensorial map through adaptive connections. The velocity vector receives signals form the active self-organizing network which is activated by the winner behavior in the behaviors map.
32
wij-s
wij
wij-s wij
s s (a)
(b)
Fig. 7. Activity of a neuron as the difference of the weight vector wij and the income s: (a) the weight vector is similar to the income (small activity) (b) weight vector and income have different orientations (large activity).
33
S
D θ
S
Left Avoid
(a)
S
Right Avoid
S
Left Wall Following
(b)
S
D θ
Right Wall Following
S
D θ
Goal Reaching
(c)
Fig. 8. Projection of the intermodal map over different self-organizing maps of the sensorial map depending on the active behavior. (a) avoid, (b) wall following, (c) goal reaching.
34
vr
wr VITE
vt
wt vt
wt
u
Fig. 9. The projections of the sensorial map over the velocity vector u is transformed into the linear vt and angular wt velocity, to be developed by the robot.
35
Intermodal Map
Sensorial Map
Q
Velocity Vector
Fig. 10. Adaptive connections of the velocity vector. Weights are adapted according to critical function Q.
36
t
t-2
t-4 Intermodal Map
Figure 11. Definition of the learning level as a function of the node activation at any given instant.
37
Vt DV = ( −V + Vt − Vr ) dt
+
V G (t ) =
-
G0 t 2 1+ t2
+ t
Vr dVt = GV dt Vr
t
Fig. 12. The VITE model. The difference vector V continuously computes the difference between the reference velocity Vt and the real velocity Vr, and is integrated over Vr multiplied by the G(t) function that modulates the integration speed.
38
(a)
(c)
(b
(d
Fig. 13. Evolution of the learning for the avoid behavior: (a) without learning, (b) after 5000 iterations, (c) after 10000 iterations, and (d) after 20000 iterations. As a reference, the paths followed by the robot in each figure need approximately 500 iterations.
39
(a)
(b)
(c)
(d)
Fig. 14. Evolution in the wall following learning : (a) without learning, (b) after 5000 iterations, (c) after 10000 iterations and (d) after 20000 iterations.
40
N º i t er aci o nes Nº Iterations =0
= 0
N ºIterations i t er aci o nes = 1000 Nº = 1000 16
16
15
15
14
14
13
13
12
12
11
11
10 9 8
10 9
j
8
7
7
6
6
5
5
4
4
3
3
2
2
1 1
2
3
4
5
6
7
8
1
9 10 11 12 13 14 15 16
2
3
4
5
6
7
1 9 10 11 12 13 14 15 16
8
i
i
º it er aci o nes 2000 NºN Iterations = =2000
N ºIterations it er aci o nes 4000 Nº = =4000 16
16
15
15
14
14
13
13
12
12
11
11
10 9 8
10 9
j
8
7
7
6
6
5
5
4
4
3
3
2
2
1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 i
j
j
1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 i
Fig. 15. Evolution of one self-organizing network of the sensorial map. Contour levels represent similar values of weights.
41
2
4
0
1 3
5
Fig. 16. Consecutive target reaching of different destinations (from position labeled 0 to 5).
42
4
1 3 0 2
Fig. 17. Development of a path through different behaviors activation.
43
1
0 0
(a)
1
(b)
0
0
(c)
1
1
(d)
Fig. 18. Destinations scope in complex environments. The robot arrives at its destination in any type of environment configuration but not always following an optimal path. Point 0 represents the original position and point 1 the destination position.
44
0
Fig. 19. The robot wandering the environment.
45
(a)
(c)
(b)
(d)
Fig. 20. Real environment and its perception by the robot. In (a) an approximate map of the environment is shown, in (b) and (c) the map based on ultrasonic and infrared sensor readings respectively are presented. In (c) the integration of the types of sensor information is presented.
46
(a)
(c)
(b)
(d)
Figure 21: Evolution of the learning in the real robot: (a) without learning, (b) after 10000 iterations, (c) after 20000 iterations and finally (d) after 40000 iterations.
47
O O
1
1
(a)
(b)
O
1
1 O
(c)
(d)
Fig. 22. Goal reaching in the real environment. In (a) the path is free of obstacles, in (b) the robot has to avoid a convex obstacle, in (c) and (d) the robot must avoid several stationary and mobile obstacles.
48
2
1
3
O
Fig. 23. Scope of several successive destinations in a real environment.
49
(a)
(c)
(b)
(d)
Fig. 24. Different tours of the robot in a real environment in exploration mode.