IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
2797
A Modular Agent Architecture for an Autonomous Robot Kao-Shing Hwang, Senior Member, IEEE, Chia-Yue Lo, and Wen-Long Liu
Abstract—The architecture of a modular behavioral agent (MBA) with learning ability for hardware realization is proposed to implement a multilevel behavioral robot such that it can autonomously complete a complex task. The architecture is composed of similar modules, primitive actions, and composition behaviors. These modules, which are derived from a basic template, are capable of learning and cooperating to cope with a variety of tasks. The infrastructure of a template embeds a reinforcement learning mechanism with an adaptable receptive module (ARM)-based critic-actor model. Each template executes one specified behavior and also cooperates with other templates to form a more dexterous composed behavior. In other words, the composed behavior is constructed by several primitives with similar modular architecture. The learning and cooperation abilities in the modules are based on a reinforcement learning technique, which is based on a critic-actor model. The proposed architecture is implemented in a field-programmable gate array (FPGA) chip with a CPU core such that the computing device can fully utilize the merits of parallel processing of neural networks in the ARM scheme. The study is demonstrated on a mobile robot for goal-seeking, cruise, and safety ensurance tasks in an unstructured environment with obstacles such as walls and blocks. The results show that this robot with the modular architecture can perform well in unstructured environments. Index Terms—Control systems, learning systems, mobile robot motion-planning, mobile robots, modular behavioral agent (MBA), multiagent system, neural network, reinforcement learning.
I. I NTRODUCTION
A
UTONOMOUS robot is a significant research topic in the fields of advanced robotics and intelligent systems [1]. When autonomous mobile robots are in unstructured and dynamic environments, some problems arise, such as noisy measurements due to the lack of appropriate sensors or difficulty in constructing an appropriate environmental model. To solve such problems, robotics researchers have developed some reactive architectures by which all robot tasks are performed in terms of a set of elementary tasks that are referred to as primitive behaviors. In other words, a robot mission or task is always regarded as a set of primitive behaviors executed in successive or parallel fashion. In terms of reactive archi-
Manuscript received December 13, 2007; revised March 2, 2008. First published April 24, 2009; current version published July 17, 2009. The Associate Editor coordinating the review process for this paper was Dr. Antonios Tsourdos. The authors are with the Department of Electrical Engineering, National Chung Cheng University, Chiayi 621, Taiwan (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIM.2009.2016301
tectures, we may draw a distinction between purely reactive and behavior-based systems. Purely reactive systems, which are also called reflexive systems, can be represented by a set of mappings between input and output, for example, from sensors to actuators [2]. They have no internal memory and, instead, use the state of the environment as a form of memory. Behavior-based strategies are akin to purely reactive systems and may have reactive elements and primitive behaviors, but they can also have an internal state and an internal model of the world. Many examples of behavior-based systems have been introduced and implemented in robot control, such as in the subsumption architecture [3]. The subsumption, which is a competitive architecture, decomposes an architecture into a series of layers, each of which contains a number of behaviors. The main difficulties of competitive approaches lie in mechanized behavior arbitration, since it has to decide which behavior should be active and obtain control over the actuators at a particular time [4]. This problem has been solved in different ways, including hand precompilation of the control flow and priorities and automatic recompilation by using a description of the desired behavior selection. Instead of coordinating the behaviors by arbitration, some allow appropriate simultaneous activation of defined behaviors to execute more complex tasks. These techniques have simultaneously or sequentially been implemented within architectures, thus allowing the execution of a variety of behaviors. The architectures follow a multibehavioral approach, where a set of primitive actions reacting to environmental stimuli build up complex behaviors by some kind of mechanism, such as independent sum, combination, suppression, sequence, and so on [5]. Based on the concept of autonomous agents with learning capabilities for a robot [6], the objective of this paper is to propose a hardware template that is capable of executing one primitive action with the characteristics of autonomy, flexibility, and learning ability. Furthermore, the codes of the template can easily be realized in a field-programmable gate array (FPGA) system such that the advantages of parallel processing mechanism can fully be utilized in circuits instead of in software programs. This paper is arranged as follows. Sections II and III describe the proposed modular behavioral architecture and the adaptable receptive module (ARM)-based reinforcement learning for an autonomous mobile robot. Sections IV and V, respectively, present the detail of all primitives and compositions used in the architecture. In addition, both simulation and physical experiments are presented in Section VI, and, finally, a conclusion is drawn in Section VII.
0018-9456/$25.00 © 2009 IEEE
2798
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
Fig. 2.
Fig. 1. MBA architecture.
II. MBA A RCHITECTURE The modular behavioral agent (MBA) architecture for an autonomous mobile robot, which is illustrated in Fig. 1, is composed of three primitive actions and two composition behaviors. These three primitive actions are goal seeking, obstacle avoiding, and vacancy pursuing. Goal seeking drives the robot toward a specific goal. Vacancy pursuing leads to the robot freeing itself from objects or obstacles by taking sensory information and choosing the direction with the longest distance in adjacent measures to ensure a safe motion. Obstacle avoiding basically drives the robot by avoiding obstacles in the proximity and choosing a direction to ensure no collision with nearby obstacles. The main function of a composition template is to subsume actions from the preceding primitive or composition modules. Two compositions are implemented for the cruise behavior and the safety ensurance behavior. A robot cruises in an unstructured environment under cruise behavior and navigates toward a goal under safety ensurance behavior. These compositions subsume different actions by means of a linear combination to form an efferent behavior such that a robot can behave according to different degrees of complexity based on modular primitives or other compositions. In Fig. 1, obstacle avoiding produces an action yo for collision avoidance. At the same time, vacancy pursing determines an action yv , which drives the robot toward the most vacant region. The composition module, such as cruise behavior, receives the actions from these actions produced by obstacle avoiding and vacancy pursuing and outputs a fused action
Structure of a primitive template with reinforcement learning.
outcb . Simultaneously, goal seeking makes an action yg from the information coming from the sensors to guide the robot toward the goal. Because the actions yg and outcb always drive the robot toward different directions, the other composition, i.e., safety ensurance, is used to resolve conflicts through reinforcement learning. It makes a decision based on yg , outcb , and the states sensed from the environment to determine the current driving the robot actuators. It is noted that either primitive or composition is constructed by a similar template. In general, a task is executed by considering the dynamic nature of the environment perceived by a robot. For example, safety ensurance can be performed only in terms of goal seeking, i.e., as long as there are no objects (static or moving) blocking the direction of the motion. To account for the dynamic nature of the environment, safety ensurance performs in terms of goal seeking and obstacle avoiding. Some missions require a small set of primitives, whereas other missions require a larger set. Due to modularity, the system easily handles the addition or removal of primitive behaviors, and, on the other hand, behaviors with a higher degree of dexterity can be developed by adding more layers to the hierarchical architecture. III. ARM-B ASED R EINFORCEMENT L EARNING The applicability of a particular learning method to a mobile robot depends on the type of information that the mobile robot needs to learn to respond to. These robot-learning approaches can be divided into four categories: learning for calibration or parameter adjustments, learning about the world, learning new behaviors for the accomplishment of specified tasks, and learning to coordinate behaviors. This paper considers the problems of the last two types, where a robot learns a set of new primitives proliferated from a template, and to coordinate the primitives to achieve a more complex mission by a composition template akin to primitive ones. Although the problem of coordination or arbitration can be solved by simply setting proper weightings to the activated actions and summing up, it is difficult to provide a systematic evaluation of potential learning methodologies for weighting parameter settings because of the various criteria and contexts used. Therefore, an architecture for MBA with ARM-based reinforcement learning is proposed to cope with the mentioned problem instead. Fig. 2 shows the paradigm of a template that can preliterate a variety of primitives and compositions. As the figure shows, there are three data flows between the modules observed: sensation, action, and rewards. The module merely senses the
HWANG et al.: MODULAR AGENT ARCHITECTURE FOR AN AUTONOMOUS ROBOT
condition of an environment through a state st , which is, in general, regarded as a representation of sensory information at time t, and a reward rt , which is an estimate of the degree of success after the action is taken. Aiming at a better reward, a module performs an action yt to affect the environment. As a result, the environment will then respond with a new reward rt+1 and a new state st+1 . Therefore, there are three elements in a critic-actor model: a policy learner; a reward function, which represents the abstract model of the environment responding to the stimulus of an immediate action; and a value function for expected accumulative performance [7], [8]. The policy describes and maps the behaviors of a learning object from the sensed states of the environment to the responding actions. The reward function is actually a measure of the achievement by the learner; it is an estimate of the degree of success after the action is taken for the environment. The objective of a reinforcement learning mechanism is to maximize the long-term cumulative rewards. In other words, the reward function evaluates not only what is immediately good according to the immediate reward but also what is good in the long run according to the state value. The critic-actor model is a popular model of reinforcement learning, where its decoder divided the input state variables into several regions by experience or expertise. In addition, a drawback of this approach is that it cannot modify the boundaries of regions after an environmental change or a bad division. To solve this problem, we proposed a decoder structure, i.e., ARM, which holds a varying vigilance value associated with each cluster, to automatically divide the state space in terms of scatter partition. The design of the ARM was inspired by the concept of the adaptive resonance theory (ART) [9], [10] algorithm but with a dynamic categorization function. Merging with the dynamic vigilance parameter and online pruning mechanism, ARM is able to efficiently realize the clustering. A. ARM Decoder The proposed ARM-based critic-actor model is actually based on a structured lookup table approach. It is composed of three main parts: the ARM decoder for decoding states, the critic element for predicting the accumulative performance, and the policy learner (actor) element for action learning. The decoder maps the sensed inputs from the environment and system states to a single cluster or a set of corresponding clusters as an internal state. In the design, each ARM neuron j, which is the index number of categories of internal states Uj , maintains an n-dimensional reference vector Cj and a threshold Tj that determines its sensitivity region. At the beginning, input patterns are presented, and the network adapts through application of the ARM algorithm. Fig. 3 shows the ARM decoder diagram. If a new input vector S does not lie within any sensitivity region of any neuron, then a new neuron is generated by setting its reference vector Cj to be centered on the input vector pattern S. On the other hand, the Cj and Tj of a neuron are dynamically updated if S lies within its sensitivity region. There is also another scenario, wherein S may activate more than one neuron. This implies that the ARM network has redundant neurons with overlapped sensitivity regions. The
2799
Fig. 3. ARM decoder diagram.
pruning function is then executed to delete one of these neurons to decrease the network size. The operations can be described in the following steps. Step 1) Initialize the system. Step 2) Present an input pattern S and compute d(S, Cj ) for every operational neuron j, where d(S, Cj ) is the Manhattan distance, i.e., a distance between two points measured along axes at right angles, e.g., on a plane with two points s1 at (x1 , y1 ) and s2 at (x2 , y2 ), their Manhattan distance is |x1 − x2 | + |y1 − y2 |. Step 3) If d(S, Cj ) > Tj for all j, then activate a new neuron j ∗ by initializing its reference vector to Cj∗ = S, its threshold to Tj∗ = Tini , and its deactivation parameter to P rj∗ = 1. Step 4) If d(S, Cj ) < Tj , where j is an active neuron, and Tj is its threshold, then update Cj and Tj as Cj (t + 1) = Cj (t) + αTj (S − Cj (t))
(1)
Tj (t + 1) = Tj (t) − γ (Tj (t) − Tmin )
(2)
where α, γ, and Tmin are the learning constants. Update the global Tini parameter to Tini = Tini (t + 1) − γ (Tini (t)Tmin ) .
(3)
Step 5) If several neurons are activated in step 3), then deactivate one of the neurons if rnd( ) > P rj , where P rj is the deactivation parameter, and rnd( ) is a uniformly distributed random number in the range (0, 1). Increase the activated neurons’ probability of deactivation as P rj (t + 1) = P r(t) − η (P rj (t) − P rmin ) where η and P rmin are pruning constants. Step 6) Go to step 2).
(4)
2800
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
evaluated based on the current reward and the long term reward at the last time as rˆ(t) = rt+1 + γV (st+1 ) − V (st ).
(5)
In addition, the long-term reward is evaluated based on the relationship described as V (st ) = V (st ) + βˆ r(t)
(6)
where rˆ(t) is the implication resulting from (5), V (s) is the estimated state value of s, st indicates that state s occurs at time step t, 0 < γ ≤ 1 is a positive constant called the discount rate, and the constant value β is a learning rate, which has a range of [0, 1]. Through the neural network architecture, the estimation of V (s), which is represented as p(t), can be designed as p(t) =
n
vi (t)Ui (t)
(7)
i=1
Fig. 4. ARM architecture diagram.
Since the essence of neural networks is a structure of process units in parallel, it is hard to be realized in a program, which actually executes a process in sequence. Therefore, to utilize the merits of neural networks in parallel processing, the design down to the circuit level is essential. Based on the preceding algorithm, an xc2v2000 FPGA chip of the Virtex II series from Xilinx is used to carry out the decoder. The ARM architecture is composed of five parts: parameter and input vector register, normalization unit, neuron unit, memory unit, and control unit. The parameter and input vector register stores the parameter and the input vector of ARM. Due to the limited number of I/O pins, the register for the input vector needs to take two time cycles for each input. The normalization unit is used to normalize the input vector. The memory unit is used to store the various data of each neuron. The control unit is used to control the whole operation process of ARM. The architecture of ARM is shown in Fig. 4. The neuron unit is composed of five submodules: counter, pruning mechanism unit, Manhattan distance calculation unit, weight update unit, and critical value update or pruning rate update unit. The counter is used to read the data from memory. The pruning mechanism unit starts when the usage of the neuron reaches a certain upper limitation. The Manhattan distance calculation unit is used to calculate the distance between each input vector and the central point. The weight update unit updates the weight value according to whether it belongs to this neuron after comparing the Manhattan distance with the critical value, and the critical value update or the pruning rate update unit likewise. B. Critic Element The critic element in the critic-actor model produces an internal implication signal for the policy learner to modify or tune its current policy toward an optimal policy and improve the capability of the critic. The implication signal results from the long-term reward that is referred to as state value, which is
where vi (t) is the weight of links. Therefore, the internal reinforcement signal is calculated as rˆ(t) = rt + γp(t) − p(t − 1), and the critic ability can be improved by modifying the weight vi as vi (t + 1) = vi (t) + βˆ r(t)¯ ui (t)
(8)
u ¯i (t + 1) = λ¯ ui (t) + (1 − λ)Ui (t)
(9)
where
in which λ ∈ [0, 1]. C. Policy Learner The policy learner empirically searches for an appropriate action consequently based on the current state, the active clusters, and the implication from the critic element. Simply speaking, it maps the states into an action based on the current situation as n y(t) = f wi (t)Ui (t) + noise(t) (10) i=1
where y(t) represents a selected action, wi (t) is the weight of link I, Ui (t) is the decoded internal state, noise(t) is a random value for exploring a new action, and f [ ]is a function that maps minus values to −1, positive values to 1, and zero to 0. In addition, it has the improvement ability of selecting an appropriate action according to the last situation. The improvement capability is based on the implication, which also expresses the trend of a selected action that converges to an optimal action. According to the implication signal, the policy learner improves its policy through the weights wi as wi (t + 1) = wi (t) + αˆ r(t)ˆ ei (t)
(11)
where α is a constant that represents a linear relationship between the action code and the state value of the emerging
HWANG et al.: MODULAR AGENT ARCHITECTURE FOR AN AUTONOMOUS ROBOT
2801
A. Goal-Seeking (GS)
Fig. 5.
To design the goal seeking, we let the robot emulate the driving behaviors of human beings. In the process of human driving, the direction of the front wheel is always toward the goal when no obstacles appear. We design the goal seeking based on this concept. We define two parameters, i.e., α and ϕ, where α is the angle between the goal direction and the vehicle frame, and ϕ is the angle of the front wheel. When the wheel is turned to the right, the angle of front wheel ϕ is positive. However, when the goal is at the left side of the car, the angle α is positive. Thus, the angle, which is referred to as θ, between the front wheel and the goal is defined by following function:
Schematic of a composition template.
state together with the eligibility function ei (t) + (1 − δ)y(t)Ui (t) eˆi (t + 1) = δˆ
θ(t) = αt + ϕt , (12)
where δ is a constant value between 0 and 1. D. Templates As aforementioned, two modules are designed based on a similar paradigm: one for a primitive action, as shown in Fig. 2, and the other for a composition behavior, as shown in Fig. 5. The primitive templates are constructed by the ARM-base reinforcement learning architecture described in the previous section. The primitives make and determine the primitive actions. In the MBA, all of the goal seeking, obstacle avoiding, and vacancy pursuing are the instances of the primitive template. Each instance uses different input signals and rewards to embody different primitive actions. In other words, the differences among these primitives are the input pattern S (signal) and the reward function r. The composition template has the capability of combining two or more module opinions (signals). The inner architecture of the composition is similar to the primitive, except for one mixer to receive the opinions from other primitives. Thus, the consistency of the composition template can be constructed by a primitive template and a mixer, as shown in Fig. 5. After those two kinds of templates are designed, it is possible to use a variety of primitives and compositions to construct an MBA. For example, in the overview of the MBA shown in Fig. 1, two compositions, i.e., cruise and safety ensurance, produce output signals by combining the signals from the primitive. Hence, the overall capability of the system could flexibly be increased by adding new primitives with different abilities and adding new compositions as connectors to process their opinions. IV. P RIMITIVE A CTIONS Three primitive actions are implemented in this paper: goal seeking, vacancy pursuing, and obstacle avoiding, which are embodied by the same codes of the primitive template in terms of hardware realization. In other words, these modules have the same architecture shown in Fig. 2.
αt ∈ [−40, 40],
ϕt ∈ [−180, 180]
(13)
where θ(t) is the angle between the front wheel and the goal at time t, and αt and ϕt are the values of α and ϕ at time t. In other words, if we want to set the direction of the front wheel toward the goal, then the front wheel should rotate to the angle θ(t). The output of the action y ∈ [−1.0, 1.0] controls the front wheel of the robot by means of ϕt+1 = ϕt + 5 ∗ yt
(14)
where ϕt+1 and ϕt are the values of ϕ at time t + 1 and time t, respectively, and yt is the value of y at time t. We limit each maximal change of the front wheel to 5◦ to keep the robot from overturning when the amount of rotation is too large. The external reinforcement reward r ∈ [−1.0, 0.0] is produced by the function θ and its derivative function θ . If |θ| becomes smaller, then this means that the front wheel direction is closer to the goal direction. In this situation, the reward function gives a reward with r = 0. Otherwise, if |θ| becomes larger, then the reward function gives a punishment with r = −1. We use this concept to define r as follows: θ (t) = θ(t) − θ(t − 1) if θ(t) < 2◦ then r=0 else r = (θ (t)/5)(u(θ (t)) − u(θ (t) + 5)) − u(θ (t)) end if where the function u is the unit step function B. Vacancy-Pursuing (VP) When the mobile robot is in an unstructured environment, goal seeking leads the robot to its goal. However, the robot may bump into something if it does not have the capability of obstacle avoidance. Therefore, we have developed two modules of obstacle avoidance. One is vacancy pursuing, which moves toward a vacant place, and the other is obstacle avoiding. To implement vacancy pursuing, we locate the vacant places using ultrarange finders (URFs). In other words, the robot moves in the heading direction of the URF with the longest detected distance. Therefore, we need to know how to set ultra range sonars (URSs) for the robot. The locations of URSs on the
2802
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
avoiding, the action leads the robot away from obstacles, so the input variable A is the steering angle corresponding to the heading angle of the URF with the shortest detected distance, as shown in Table I. When the URF with the shortest detected distance is URF C, the expected steering angle will be one of the boundary values of {−40◦ , −20◦ , 0◦ , 20◦ , 40◦ }, i.e., ±40◦ . The angle is determined according to the vacancy of two sides of the robot using the following equations: Fig. 6. URF locations of the mobile robot.
Lef t vacancy = dis(B) + a ∗ dis(A) + b ∗ dis(H)
TABLE I
Right vacancy = dis(D) + a ∗ dis(E) + b ∗ dis(F)
EXPECTED STEERING ANGLES OF THE VACANCY-PURSUING AGENT AND OBSTACLE AVOIDING
where dis(i) is the detected distance of the ith URS. Parameters a and b are vacant parameters, whose values are between 0 and 1. We use the default values a = 0.7 and b = 0.5. As for function θ, the equation for θ(t) is the same as that of vacancy pursuing in (15). Instead, the mapping method for determining the expected steering angle is different, as shown in Table I. In Table I, the condition “No obstacle in front” is the same as that of vacancy pursuing.
mobile robot are shown in Fig. 6, with eight URSs labeled from URF A to URF H. Each URF has its own location and heading direction. The input parameters S1 and S2 of vacancy pursuing are A and ϕ. The input variable A is the expected steering angle A ∈ {−40◦ , −20◦ , 0◦ , 20◦ , 40◦ }. The input variable is mapped from the heading angle of the URF with the longest detected distance among the front five URSs, which ranges from URF A to URF E. The corresponding relationships between the heading angle of the URSs and the expected steering angles are shown in Table I. In Table I, the situation “No obstacle in front” is activated for this simulation when the detected distance of URF C is larger than 200 cm and the detected distances of URF B and D are both larger than 150 cm. The parameter ϕ is the front wheel angle. The output action y and the steering control function are the same as those of goal seeking. The reward function of vacancy pursuing is a function θ, which indicates where the vacant place is located. It is defined as θ(t) = |expected steering angle − ϕt |
(15)
where the expected steering angle is shown in Table I. The derivative function θ and the reward function r are the same as the corresponding functions in goal seeking. C. Obstacle-Avoiding (OA) Obstacle avoiding is another primitive action, whose only function is enabling the robot to avoid obstacles. Obstacle avoiding is similar to vacancy pursuing, but differs in the input variables of the URF heading angle A and the expected steering angle of function θ. In vacancy pursuing, the action leads the robot toward a vacant place, so the input variable A is the expected steering angle corresponding to the heading angle of the URF with the longest detected distance. However, in obstacle
if Lef t vacancy > Right vacancy then expectant steering angle = −40◦ else expectant steering angle = +40◦ end if V. C OMPOSITIONS The three primitive actions described in the previous section can individually control a mobile robot, but they cannot control a mobile robot together. In this section, two compositions, i.e., cruise and safety ensurance, are presented to create a more sophisticated system. The architecture of these two compositions is shown in Fig. 5. A. Cruise Obstacle avoiding performs better when the robot is close to obstacles, but vacancy pursuing leads the robot far away from the obstacles. Instead of arbitration between these two actions, the cruise combines obstacle avoiding and vacancy pursuing. This combination enables the primitives to work with, instead of mutual excluding, each other in terms of a set of learned weighting parameters. The two input variables S1 and S2 of the cruise are LD and SD. LD is the longest detected distance among the front five URSs, whereas SD is the shortest distance. In addition, r is the external reinforcement signal. The output of the associative search element (ASE) y ∈ [−1.0, 1.0] controls the mixer, which produces the value outcb by combining the behaviors of obstacle avoiding and vacancy pursuing. To distinguish between the outputs y of these modules, the outputs of vacancy pursuing, obstacle avoiding, and cruise are denoted by yv , yo , and ycb , respectively. The mixer combines the outputs to produce the output of the steering angle outcb by means of the following equation: y ycb cb + 0.5 ∗ yo + 0.5 − outcb = 5 ∗ ∗ yv . (16) 2 2
HWANG et al.: MODULAR AGENT ARCHITECTURE FOR AN AUTONOMOUS ROBOT
2803
The constant 5 determines each maximal change of the steering angle. The function ycb determines the affecting power of these two primitives. When the value ycb is close to 1, the steering angle is affected by obstacle avoiding rather than vacancy pursuing. On the contrary, when the value of ycb is close to −1, vacancy pursuing affects the steering angle more. The reward r is defined as the following rule:where function u is the unit step function, and the functions ε(t) and ε (t) are defined as follows: ε(t) = |sign(ρ1 − SD) − ycb (t)|
ε (t) = ε(t) − ε(t − 1)
(17) (18)
where ρ1 is the threshold of the dangerous distance, and sign(x) = u(x) − u(−x). ε is the difference between the expected output of ycb and the real output ycb . This means that when SD is smaller than ρ1, the output of the sign function will be 1, so that the value of ycb can be expected to be 1, and the mixer can be expected to choose the behavior of obstacle avoiding. Otherwise, when SD is larger than ρ1, ycb can be expected to choose vacancy pursuing. In addition, ε (t) determines the changes of ε(t). If ε (t) is negative, then this means that ycb is close to the expected output, and so the reward rule will give a reward with r = 0 (r = −u(ε (t) = 0). Otherwise, it will assign a punishment with r = −1 when ε (t) is positive (r = −u(ε (t) = −1)). if ε(t) < 0.2 then r=0 else r = −u(ε (t)) end if B. Safety Ensurance The cruise can lead the robot to avoid obstacles and move to vacant places, but it cannot guide the robot to the goal. Hence, we use safety ensurance to organize cruise and goal seeking into an evolving MBA so that the system has the functions and advantages of those three primitives. The two input parameters of safety ensurance are GD and SD. GD is the distance from the current position of the robot to the goal. The following equation presents the value outsb of the mixer in the safety ensurance: y ysb sb + 0.5 ∗ yg + 0.5 − ∗ outcb outsb = 5 ∗ 2 2 (19) where yg is the output of goal seeking, and outcb is the output of cruise in (16). The value ysb determines whether to choose the behavior of goal seeking or the behavior of cruise. Goal seeking is chosen when ysb = 1, and cruise is chosen when ysb = −1. The reward function of safety ensurance is the same as that of cruise, but the definition of ε(t) is different. The value of ε(t) is defined by the following equation: ε(t) = |E[Y ] − ysb (t)|
(20)
where E[Y ] is the expected value of ysb (t). E[Y ] determines that either cruise or goal seeking is chosen. If SD is larger than
Fig. 7. Photo of the mobile robot. TABLE II PARAMETERS OF ARM-BASED REINFORCEMENT LEARNING
ρ2, where ρ2 is the dangerous distance, then it implies that the obstacle sensed is far away from the robot. Thus, E[Y ] is set to 1 in the expectation that the behavior of goal seeking has more effect on the output than its counterpart. On the contrary, if SD is smaller than ρ2 and GD is larger than ρ2, and the obstacle is nearby but the goal is far away, then the behavior of cruise dominates the behavior of safety ensurance. If SD and GD are both smaller than ρ2, then the behavior should be chosen based on the sign function. VI. E XPERIMENTS The mobile robot shown in Fig. 7 is a modified elderly assistance vehicle that is 120 cm long and 45 cm wide. There are two control cards: one is a servo motor control card, and the other is an 8255 control card. The servo motor control card controls the actuators of the back wheels, and the 8255 control card controls the other hardware devices, such as the steering wheel and URFs. One A/B phase encoder is set on the axle of the back wheels. The servo motor control card gets the output of the encoder and translates its pulses into the values of the moving distance and the moving speed. The mobile robot is equipped with eight URFs, from URF A to URF H. The position and heading direction of each URF is shown in Fig. 6. To determine the steering angle of the front wheels, we define the steering angle as positive when the steering gear is rotated to the right, and negative when rotated to the left. The steering angle ranges from −40◦ to 40◦ . The parameters and weightings in neural networks for learning, which are used in the physical experiments, are eventually obtained by simulations beforehand. The parameters of the ARM-based reinforcement learning are shown in Table II. In
2804
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
Fig. 8. Experiments. (a) Goal seeking between two points. (b) Cruise in a narrow corridor. (c) Safety ensurance in a simple workspace. (d) Safety ensurance in a complex workspace.
other words, we copied these parameters and weightings from the simulator to the physical robot, and the architecture of the ARM-based reinforcement learning in the robot initially used these settings to control the physical robot. The robot still keeps learning while operating in these physical experiments. The MBA is implemented in an FPGA board, where the software of reinforcement learning is executed in the embedded Microblaze CPU with the ARM-based decoder hardwired in the xc2v2000 FPGA chip of Virtex II [11]. Due to the limited number of pages for presentation, the simulation outcomes and some experimental results of primitive actions and composition behaviors introduced in the previous sections are not fully listed on this paper. Merely the experimental results of one typical primitive action, i.e., goal seeking, and two compositions, i.e., cruise and safety ensurance, are demonstrated and illustrated. A. Goal-Seeking To test the performance of goal seeking, we put the robot in an open place, and an experiment was performed with the initial position (700, 100) and the goal position (300, 600). During execution, the software in the robot periodically and automatically drew its tracks, as shown in Fig. 8(a). This figure shows that the robot could smoothly approach its goal and stopped when it arrived at its goal. B. Cruise The value of the dangerous distance, i.e., ρ1 in (17), was 100 cm. This meant that if the shortest distance between an obstacle and the robot was smaller than 100 cm, then the cruise
was expected to choose the behavior of obstacle avoiding. Otherwise, it chose vacancy pursuing. In the simulations executed in the previous work, when the robot bumps into an obstacle, it would retrieve to a position on the path in a distance of 100 control cycles and restart to learn. However, in the physical experiment, it is difficult to have the robot move back to the previous position without any positioning error. To solve this question, we used the forward model and the backward model to control the robot. In the forward model, it executed the progressing and learning programs, which are the same as those in the simulator. The accurate distance of URSs was from 60 to 660 cm, so that when the URF detected that the distance of a front obstacle was less than 60 cm, we considered this condition as bumping into obstacles. Hence, if one of the front URSs, i.e., from URF B to URF D in Fig. 6, detected that the obstacle distance was less than 60 cm, it switched to the backward model. In the backward model, the learning algorithms were stopped, and the robot only moved straight backward. If the detected distances of the front URSs were all larger than 100 cm, or the detected distance of URF G was less than 60 cm, then it switched to the forward model. The robot could use this simple algorithm to learn by itself without human interaction. The condition of “No obstacle in front” in Table I was defined as when the detected distance of URF C in Fig. 6 was larger than 300 cm and the detected distances of URF B and URF D were both larger than 250 cm. The value of the dangerous distance, i.e., ρ1 in (17), was 150 cm, so that if the detected obstacle distance was larger than 150 cm, then it was expected to choose the behavior of vacancy pursuing, or it was expected to choose obstacle avoiding.
HWANG et al.: MODULAR AGENT ARCHITECTURE FOR AN AUTONOMOUS ROBOT
In Fig. 8(b), the test location was the corridor outside our laboratory, and the long black solid boxes indicate obstacles. The tracks of the robot are indicated by small hollow black squares when executing the cruise. At location A on the figure, since the distance detected by URF C was smaller than 60 cm, the robot would have bumped into obstacles, and so it was switched to the backward model. The robot passed the long obstacle in the next run and smoothly passed another obstacle. C. Safety Ensurance In this experiment, we show the outcomes of the experiments on safety ensurance, which is a combination of goal seeking and cruise. The value of the dangerous distance ρ2 was set to 150 cm. The experiment shown in Fig. 8(c) was executed in an outdoor vacant space. The big black solid rectangle was an obstacle consisting of two chairs, and the tracks of the mobile robot are drawn in small hollow black squares. The initial position was at (400, 50), and the goal was at (500, 800). In the figure, when the robot met the obstacle, it indeed made progress with the control of the cruise to avoid the obstacle. After passing the obstacle, the robot smoothly approached the goal. The behavioral fusion was successful in this experiment. Fig. 8(d) demonstrates the motion in which the robot is originally located at (500, 100) in a room with an exit opened at the right-bottom corner and drives to a target point at (600, 1100). The target resides in a room opened at the left-top corner. At the beginning, the robot tries to directly approach the target, but immediately takes a sharp right turn as it detects the wall. After passing the exit and driving along the short corridor, the robot soon goes back to a path to the target. However, soon, the robot takes a proper reactive movement to the detection of the wall residing on the path to the target. Avoiding from approaching the wall too closely, the robot finally finds a way to the target point. By the path shown in Fig. 8(d), it is observed that safety ensurance subsumes the actions of goal seeking and collision avoidance and the behavior of cruise, and directs the robot to the desired position, as expected. The ARM decoder classifies the sensory data into only 75 categories, each of which corresponds to the receptive field of a neuron. VII. C ONCLUSION Reinforcement learning with an ARM decoder can continue in a stable fashion, with familiar inputs directly accessing their categories and novel inputs triggering adaptive searches for better categories, until the network fully uses its memory capacity. However, the quality of categorization attained by the algorithm is critically dependent on the static vigilance parameter, which is fixed a priori. Two templates of core structures are designed in this paper for primitive actions and composition behaviors, respectively, both of which actually proliferated from a similar structure for code reusability in FPGA programming. Each action is used to instantiate a primitive behavior. Three actions are designed in this paper: obstacle avoiding, vacancy pursuing, and goal seeking. Composition templates are used to produce composed actions, which are referred to as behaviors, such as cruise,
2805
which subsume obstacle avoiding and vacancy pursuing, and safety ensurance, which further subsumes the cruise behavior and the goal-seeking action. To pursue autonomous or semiautonomous control, the inner structure of each module uses a novel reinforcement learning architecture, i.e., ARM-based reinforcement learning, so that the robot would learn by itself how to control its route and that could reduce human operations in the control loop. Another advantage of the agent architecture is that it is easy to be implemented in hardware. We would design one hardware element to perform one module, such as a primitive, and the other primitives would be performed by using duplicates of the same hardware element. The compositions are similar in producing hardware elements. R EFERENCES [1] J. O. Gray, “Recent developments in advanced robotics and intelligent systems,” Comput. Control Eng. J., vol. 7, no. 6, pp. 267–276, Dec. 1996. [2] J. P. Brusey, “Learning behaviours for robot soccer,” Ph.D. dissertation, RMIT Univ., Melbourne, Australia, 2002. [3] R. A. Brooks, “A robust layered control system for a mobile robot,” IEEE J. Robot. Autom., vol. RA-2, no. 1, pp. 14–23, Mar. 1986. [4] R. C. Arkin, Behavior-Based Robotics. Cambridge, MA: MIT Press, 1999. [5] M. Dorigo and M. Colombetti, Robot Shaping: An Experiment in Behavior Engineering. Cambridge, MA: MIT Press, 1998. [6] G. A. Bekey, Autonomous Robots. Cambridge, MA: MIT Press, 2005. [7] R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge, MA: MIT Press, 1998. [8] C. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no. 3, pp. 279–292, 1992. [9] A. Pérez-Uribe, “Structure-adaptable digital neural networks,” Ph.D. dissertation, Swiss Federal Inst. Technol.-Lausanne, Lausanne, Switzerland, 1999. 2052. [10] G. A. Carpenter and S. Grossberg, “The ART of adaptive pattern recognition by a self-organizing neural network,” Computer, vol. 21, no. 3, pp. 77–88, Mar. 1988. [11] K. S. Hwang, Y. P. Hsu, H. W. Hsieh, and H. Y. Lin, “Hardware implementation of FAST-based reinforcement learning algorithm,” in Proc. IEEE Int. Conf. Signal Process. VLSI Des., 2005, pp. 435–438.
Kao-Shing Hwang (M’93–SM’08) received the B.S. degree in industrial design from the National Cheng Kung University, Tainan, Taiwan, in 1981 and the M.M.E. and Ph.D. degrees in electrical engineering and computer science from Northwestern University, Evanston, IL, in 1989 and 1993, respectively. He was a Design Engineer with Sanyo Electric Company, Taipei, Taiwan, from 1983 to 1985 and was a System Programmer with the C&D Microsystem Company, Plastow, NH, in 1989. Since August 1993, he has been with the National Chung Cheng University, Chiayi, Taiwan, where he was the Chairman of the Department of Electrical Engineering from 2003 to 2006 and is currently a Professor. He was with the evaluation board and steering committee of the Automatic Control Group, National Science Council of Taiwan between 2002 and 2005. He is in charge of developing an exchange platform of heterogeneous educational resources for educational web sites under Ministry of Education of Taiwan. In addition, he is assigned to a government steering committee for solving nationwide digital divide. His areas of interest are neural networks, learning control in robotics, cooperative learning agents, and e-learning by IT. Dr. Hwang is the Chairman of the IEEE Robotics and Automation Society, Taipei Chapter.
2806
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 58, NO. 8, AUGUST 2009
Chia-Yue Lo received the B.S. degree in mathematics from Soochow University, Taipei, Taiwan, in 1987 and the M.S. degree in mathematics from the National Tsing-Hua University, Hsinchu, Taiwan, in 1989. He is currently working toward the Ph.D. degree with the Department of Electric Engineering, National Chung Cheng University, Chiayi, Taiwan. His current research interests are neural networks, learning systems, fuzzy control, mobile robots, controlling systems, and multiagent systems.
Wen-Long Liu received the B.S. degree in electrical engineering from the National Taipei University of Technology, Taipei, Taiwan, in 1999 and the M.S. degree in electrical engineering from the National Chung Cheng University, Chiayi, Taiwan, in 2001. He is currently with the Department of Electric Engineering, National Chung Cheng University. His research interests are neural networks, learning systems, and mobile robots.