Mobile Robot Navigation based on Expected State ... - Semantic Scholar

0 downloads 0 Views 528KB Size Report
ferred to as its state on the field, and the direction in which it is walking is referred to as ..... a goal post, and such trials increased the average time in. Fig. 9: Time ...
Proceedings of the 2003 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems Las Vegas, Nevada · October 2003

Mobile Robot Navigation based on Expected State Value under Uncertainty of Self-localization Ryuichi UEDA1 , Tamio ARAI1 , Kazunori ASANUMA2 , Shogo KAMIYA2 , Toshifumi KIKUCHI2 , and Kazunori UMEDA2 1

2

Department of Precision Engineering, School of Engineering, The University of Tokyo Department of Precision Mechanics, School of Science and Engineering, Chuo University {ueda, arai}@prince.pe.u-tokyo.ac.jp

Abstract — Uncertainty of self-localization is one of the most serious problems related to navigation of autonomous mobile robots. There are some studies that deal with this problem. However, most of them assume that the extent of the uncertainty is known or has been measured previously, though it changes easily with some trivial changes of the environment. To avoid this assumption, we take the following approach: 1) a robot uses a selflocalization method that is quite robust against sensing noise and the environment change, 2) the plan for robot behavior is based on the assumption that the robot can recognize the environment perfectly, 3) a novel decision-making algorithm computes proper behavior from the planning result and self-localization results in real time. These algorithms were implemented on a soccer robot of the RoboCup four-legged robot league, and their efficiency was verified with experiments.

tive state as an input. Permanent insufficiency of information for estimation sometimes happens after a trivial change in the environment. Particularly, most of the robots that have cameras are not robust against change of lighting condition from adequate to dark. Such a change may reduce chances to obtain reliable results of image processing. In this case, a robot cannot move effectively even though a mapping used by the robot is optimal. In this paper, we deal with this problem that occurs when we want to utilize the mappings. We have already proposed a method that takes into account the uncertainty of self-localization results [5]. With this method, dynamic programming (DP) [3] is used for path planning. At the planning phase, DP determines an optimal action that minimizes time to a destination at each state and records it in a look-up table, which is a form of mapping. This table is used by a robot. An optimal action is indicated when the robot inputs a state. Uncertainty of the self-localization results was considered in the planning phase. Uncertainty was included in the state, and mapping, which required a candidate area of a robot’s pose (position and orientation) in xyθ-space was obtained. This idea belongs to planning in an information space [6]. Some methods based on information space have also been proposed [2, 8]. To use DP for planning, all state transitions after any actions of a robot at any states must be estimated stochastically. Then, to use the above methods, all transition probabilities about the increase and decrease of uncertainty need to be known. However, the probabilities are not fixed under the change of lighting condition. Therefore, another approach should be taken to achieve the objective. Thrun proposes Monte Carlo POMDPs [10]. With Monte Carlo POMDPs, a mapping can be obtained by relying on learning with the information space approach. Since, with this method, a robot can obtain a mapping that includes its real experiences, it can adjust the mapping to accommodate to changing lighting conditions. However, it takes time for the robot to obtain enough experiences that will allow it to adapt to changing conditions. QMDP [7] is free from changes of condition. This method uses a planned result that does not consider uncertainty. In this case, the state estimation result has to be rep-

1 Introduction RoboCup four-legged robot league [1] is a good subject for developing ingenious algorithms that enable robots without expensive sensors and computing resources to carry out difficult tasks. A robot needs to play soccer intelligently even though the robot’s body has various constraints in its motion, and even though field view of its camera is narrower than that of our eyes. Moreover, the robot should decide its action and move as fast as possible. Pre-computed mapping from a state of a soccer game onto an appropriate action is applicable for real-time decisions. For example, when a legged robot navigates in a game, its position and orientation (hereafter, pose) are referred to as its state on the field, and the direction in which it is walking is referred to as its action. If the robot has a mapping that tells an optimal walking direction at each state, it does not need to compute its path in a game. A robot can obtain an appropriate action from the mapping at anytime as long as it can estimate its pose. However, estimates of the state of the game are always uncertain to a certain degree. Moreover, if the robot cannot obtain enough information, it cannot make a precise estimate of its state. This uncertainty is a critical problem because pre-computed mapping usually requires a defini0-7803-7860-1/03/$17.00 © 2003 IEEE

473

resented as a probability density function (pdf). This pdf and the planned result are used for the selection of an appropriate action that maximizes the expected efficiency of the action. If the pdf is correct (no need of convergence), a robot can determine appropriate action in any conditions with QMDP. However, such an estimation is very difficult. Real-time operation is then inevitable if this algorithm is used in an environment as dynamic as that of RoboCup. In this paper, a novel decision-making method is proposed that enables the robot to choose a feasible action for navigation tasks in real time, even if conditions change to some extent. A Monte Carlo method is used for state estimation. With this method, the calculation of QMDP is very concise. In Section 2, the formulation of the method is explained. It is implemented for a navigation task in Section 3 and examined in an environment of RoboCup four-legged robot league in Section 4. Section 5 is the conclusion.

To solve Bellman equation, a value iteration algorithm [9] is frequently applied. This algorithm iterates a local operation for each s to estimate V ∗ (s) from state values of the neighboring states with Eq. (1). V ∗ (s) certainly converges after some iterations.

2.2 On-line Decision-making Phase Even though the optimal policy π ∗ (s) needs x or s as an input, estimation contains a degree of uncertainty. We assume that this uncertainty is represented as a probability density function bel(x). It outputs the probability density that the actual state vector is x. An expected state value is calculated from V ∗ (s), which approximates the continuous optimal state-value function V ∗ (x), and bel(x) so as to obtain an appropriate action. This expected value is calculated as:  V¯ = V ∗ (x)bel(x)dx. (3) X

2 Decision-making with an Expected State Value

The robot chooses an action that makes V¯ effectively increase. The expected value of action a is defined as the difference between the expected state value before and after the action. It is formulated as:  ¯ Q(a) = {V ∗ (f a (x)) − V ∗ (x)}bel(x)dx, (4)

Off-line planning phase and on-line decision making phase are used in our method. The off-line phase treats a partially observable Markov decision process (POMDP) as if it were a Markov decision process (MDP). Therefore, it is assumed that the robot can know the state perfectly in this phase even though it is unrealistic. Uncertainty of state estimation is considered in the on-line phase.

2.1

X

where f a is a function that transfers a state x to another state with action a. Eventually, a mapping from the probability density function onto an action is obtained as:

Off-line Planning Phase

Dynamic programming (DP), which is used for computing a mapping, is briefly explained. When a state is represented with a state vector x = (x1 , x2 , . . . , xn ), which belongs to a state space X , the state space is discretized into S. It is then assumed that the robot motion is discretized in advance. One of the discretized outputs is called action, which is written as a. The action belongs to a set of actions A. Bellman equation [3] in discrete time can be formulated as follows:  a a ∗  Pss (1) V ∗ (s) = max  [Rss + V (s )], a

¯ Π[bel(x)] = arg max Q(a). a

This formulation is referred to as QMDP. It is implement in a novel way, which can calculate QMDP quickly.

3 Implementation The method is applied to a navigation task of a goalkeeper in a field of RoboCup four-legged robot league.

3.1 RoboCup Four-legged Robot League In RoboCup four-legged robot league [1], a team uses four sets of lion-like pet robots, ERS-210, as soccer players. Each robot has three DoF in each leg and in the head, as shown in Fig.1. The CPU of the robot is an MIPS 192MHz, with 32MB of RAM. It has a CMOS camera on its nose, whose resolution is 176 by 144. The soccer field is illustrated in Fig.2.

s

where s represents a discrete state in the state space S. a Pss  denotes the transition probability from a state s to another s by taking action a. Rass denotes the immediate evaluation given to the state transition from s to s by taking action a. The optimal state-value function V ∗ (s) is defined as the expected sum of the immediate evaluation from the state s to a destination under the optimal policy π ∗ , which is formulated as:  a a ∗  Pss (2) π ∗ (s) = arg max  [Rss + V (s )]. a

(5)

3.1.1 Task and Conditions One of the four robots is a goalkeeper. The goalkeeper robot should return to the goal as soon as possible after it clears a ball from the goal. During the task, the robot should avoid collision with walls or goal posts, because if the robot collides, it will slip, and self-localization will become difficult. The task can be described as sequences of elemental actions. Here, 40 walking actions

s

The destinations are called final states. The purpose of dynamic programming is to obtain the optimal policy π ∗ , which is the mapping from a state to an optimal action. 474

Fig. 1: ERS-210 (made by SONY)

Fig. 4: Difference of color extraction results

3.2 The Planning Phase A mapping for the task is obtained in the following way. First, final states are fixed in the state space X as 2100 ≤ x < 2300[mm], −100 ≤ y < 100[mm] and |θ| > 160[deg]. This region of the xy-plane is the center of the goal, and the robot faces the other goal if its direction θ is within this interval. Secondly, the state space X is discretized. The xyθ-space is discretized into 100[mm]×100[mm]×20[deg] cuboids. Thirdly, the immediate reward Rass is determined. The robot is given −1 each step (one action) so that the goalkeeper finishes the task with minimum steps. Since all the actions are executed in the same amount of time, the minimum number of steps and the minimum amount of time are equivalent. Moreover, an infinite penalty is given when an action that is capable of hitting the robot and a wall is chosen. Fia nally, Pss  should be solved. We calculate this probability by means of random sampling of state transitions from any points in a discrete state. The optimal state-value function V ∗ (s) is obtained along with the optimal policy π ∗ (s) with these settings. a ∗ The computation of Pss  and V (s) was finished within one minute by a Celeron 1.3GHz PC. Figure 5 shows part of this state-value function when the orientation of the robot is θ = 180[deg]. The final state is at the center of the goal area. Trajectories that are brought by the optimal policy π ∗ (s) are shown in Fig.6. Each action in these trajectories is optimum to increase the state value.

Fig. 2: The field of four-legged robot league ai ∈ A (i = 1, 2, . . . , 40) are defined as Forward, Backward, Turn, and so forth. Each action changes the pose of the robot with a fixed amount. All actions take the same time: 0.6[s]. The pose is represented by (x, y, θ) with a field coordinate system as shown in Fig.3. The state space X of this task is the xyθ-space.

Fig. 3: Field coordinate 3.1.2 Change of Lighting Condition Our field is illuminated with eight fluorescent lights. All the lights are switched on during the calibration of thresholds for color extraction. Therefore, if some of the lights are turned off, the color extraction becomes incomplete, as shown in Fig.4. The figure at the far left is a CMOS camera image taken by a robot. There is a landmark, which is a column painted with two colors, at the center of the image. When all the lights are switched on, the surface of the landmark is extracted well enough to identify this landmark. This identification becomes difficult if some of the lights are turned off because extracted pixels on the landmark surface are fewer as shown in the three images on the right-hand side Fig.4. Therefore, the chances of successful landmark recognition decrease in a dark environment. Estimation of the robot’s pose becomes difficult.

Fig. 5: State-value function (θ = 180[deg]. The axis of state value is reversed.)

475

Fig. 6: Examples of simulated trajectories

3.3

The On-line Phase Fig. 7: Distribution of samples after an observation

3.3.1 Uniform Monte Carlo Localization To estimate its pose accurately under changing condition, the robot uses Uniform Monte Carlo Localization (Uniform MCL) [11] as a self-localization method. Uniform MCL uses a lot of candidates ξi (i = 1, 2, . . . , N ) of the robot’s pose to approximate bel(x). We call these candidates samples. All samples share the same probability 1/N and are distributed in the xyθ-space. The original Monte Carlo Localization (MCL) [4] varies the probability of each sample to use a result of observation for refinement of bel(x). In contrast, Uniform MCL erases samples that are inconsistent with a result of observation so as to refine bel(x) when the result is obtained. An example of the samples is illustrated in Fig.7. The large arrow denotes an actual pose of a robot. The small arrows denote the poses of samples. This distribution of samples is obtained after an observation of the upper-right landmark in the figure. The robot knows the direction of the landmark and the maximum distance from it with image processing, which measures the position and width of the landmark in an image. Samples that are inconsistent with this information are erased by Uniform MCL. Other samples remain and are shown in Fig.7. The maximum distance increases under a lack of color extraction since the measured width of a landmark in an image is short due to the failure of color extraction. Therefore, the samples distribute widely under the change of lighting. However, this situation is better than that in which samples gather in the wrong state. This occurs if changes of lighting are not considered. Figure 8 shows sample distributions after observation (30[s]) at a fixed pose in varying lighting conditions, as shown in Fig.4. When there are six or eight lights, the robot can estimate its pose accurately. However, if the number is two or four and the lighting is dark, uncertainty of its pose remains. Though the distributions of samples are very coarse in this figure, new samples are supplied around the surviving samples after an action by the robot. 3.3.2 Calculation of the Expected State Value with Uniform MCL

Fig. 8: Sample distributions under the different conditions formulated as: Bel(X) = (δ[xξi , X] =

N 1  δ[xξi , X] N i=1

(6)

1 (if xξi ∈ X), 0 (otherwise) ),

where xξi is the state of the sample ξi . To calculate the expected state value with the samples, we rewrite Eq. (3) with Eq. (6) as: N 1  ∗ V (xξi ). V¯ = N i=1

(7)

Equation (4) is also rewritten as: N N 1  ∗ 1  ∗ ¯ Q(a) = V (f a (xξi )) − V (xξi ). N i=1 N i=1

The number of samples in area X represents the probability that the robot is in X. This probability Bel(X) is 476

(8)

• EP-1 (expected pose): a method that is not concerned with the uncertainty of self-localization. The average pose of all samples is calculated, and this pose and π ∗ (s) are used for decision-making. • EP-2: EP-1 with thresholds. The width of the interval in which the samples exist is calculated for each axis. If one of them is over a threshold, the robot stops walking. The thresholds of the x-axis and y-axis are fixed at 1000[mm], and the threshold of θ-axis is 90[deg]. If all the intervals are below the thresholds, the robot moves again as soon as possible; the time for the stop action is not fixed. This action is expected to reduce wasteful actions that make the robot move opposite of the final state, when the robot does not completely localize itself.

An appropriate action is chosen with Eq. (5). To calculate Eq. (8) for all actions is a waste of computation time. We implement not only V ∗ (s) but also π ∗ (s) and use π ∗ (s) to reduce time in the following way. 1. The optimal actions π ∗ (sξi ) ∈ A (i = 1, 2, . . . , N ) are listed, where sξi denotes the discrete state that includes sample ξi . 2. Equation (8) is calculated for each action that is listed once or more times. 3. An appropriate action is chosen with Eq. (5) from the listed actions. Though there is a probability that a non-listed action marks ¯ the maximum Q(a), this probability may be close to zero. The amount of calculation for this method is in proportion to the number of samples and the number of the listed actions. Since original QMDP requires calculation that is in proportion to the number of discrete states and that of actions, this method is effective when the number of states is extremely large. This on-line method on ERS-210 was implemented, and the time consumption was measured with 1000 samples. Uniform MCL needs more than 1000 samples to conscientiously represent bel(x) on the field. Calculation times of 628 decision making opportunities were recorded. The average was 18.6[ms], and the maximum time consumption was 58.3[ms]. When the worst time was recorded, 20 actions were listed. Incidentally, the computing time of Uniform MCL with 1000 samples was 7.5[ms].

These methods were examined under four kinds of lighting that are explained with Fig.4 and Fig.8. Initial poses of the robot were as follows: (x, y, θ) = (1000, 0, 45), (1000, 0, 135), (1000, 500, ±45), (1000, 500, ±135), (1000, 1000, ±45), (1000, 1000, ±135) [mm], [deg]. Two trials took place with every combination of a method, a condition, and an initial pose (240 trials). The results are shown in Fig.9 and Fig.10. The relationship between the number of lights and average time consumption of trials is illustrated in Fig.9. The failed trials are not considered to calculate the average time. The numbers of failed trials is shown in Fig.10. Through the trials, the ESV method got the robot to reach the final state 1.31.8 times as fast as the EP methods in all the conditions as shown in Fig.9. With EP methods, the robot often collided with a goal post when it tried to enter the goal area, since the robot did not consider the difference between the average sample pose and the real pose. When the robot slipped after such a collision, its real pose and estimated pose were slightly different from each other. Once this occurred, the robot moved in the wrong direction for a long time until enough information to correct the difference was obtained. The robot with the ESV method also occasionally collided with a goal post, and such trials increased the average time in

4 Experiment Our method on ERS-210 was implemented, and verified that a robot can return to its goal more quickly than it could with other methods that are not concerned or are partially concerned with the uncertainty of self-localization. One trial of the experiment is shown, as follows. • A robot is placed away from the goal. The robot does not have access to the initial pose. • The robot returns to the goal. The robot observes landmarks and goals with its camera over this trial. • When it is determined that the robot has reached the final state, the trial is finished. It is not important whether the robot stops at the final state or not since the only thing compared is time consumption. • When the robot determines that it is at the final state, it chooses an action randomly. This rule prevents the robot from freezing at a non-final state. • The elapsed time at this trial is recorded. • If the robot cannot reach the final state within three minutes, the trial is regarded as a failure, and the trial is stopped. The following methods are implemented and compared. • ESV (expected state value): our method, which uses the expected state value.

Fig. 9: Time consumption 477

Fig. 10: Number of failures (per 20 trials)

Fig. 9. However, the frequency of collisions with the ESV method was less than that with EP methods, since the robot could estimate the danger of collision through the expected state value. As shown in Fig.10, the EP-2 method was more effective than the EP-1 method with ideal lighting. However, this method led to failed trials more than other methods under worse conditions. In these cases, the robot sometimes could not estimate its pose more accurately than the thresholds and stopped for a long time. The robot could not behave flexibly in the worse conditions with the fixed thresholds. Typical behavior with the ESV method is shown in Fig.11. The robot walked randomly around the initial position (1 and 2 in Fig.11), went near the goal quickly (35), walked randomly again in front of the goal (6 and 7), and suddenly entered the goal area (8). From 1 to 3, the robot waited to determine its rough orientation in order to move in the direction of the goal. From 6 to 7, the robot delayed entering the goal area until the uncertainty of selflocalization was reduced enough to avoid collision. Through this experiment, the method listed two or more actions to calculate Eq.(8) at 75% of decision-making chances. Our method could choose an appropriate action from the listed actions.

phase of our method took one minute with a 1.3 GHz CPU. The planning result was implemented on an ERS-210. The following things were confirmed with the experiment, and the effectiveness of the method was verified. • The calculation time of the on-line method was 18.6[ms] on average and less than 58.3[ms], which was enough to enable the robot to move in real time. • Our method leads the robot to the final state 1.3-1.8 times faster and more certainly than the methods that do not consider the self-localization uncertainty. • The robot with our method moves prudently when it obtains too little information to execute its task. • The robot can execute the task securely with our method under both ideal and poor conditions. • Our method enables the robot to consider the uncertainty of self-localization. Our next project will focus on the robot’s ability to consider the risk of an action. For the experiment, the robot did not take a risk to enter the goal area when the robot could not obtain enough information, and this strategy was good for the experiment. However, if the robot cannot obtain sufficient information for a long time, a risky action may occasionally have yield a good result. If some samples that have low state value are excluded at the calculation of the expected state values after a long time, the robot may choose a risky action. In this way, the robot changes its behavior slightly with some variations in the calculation. We will study the characteristics of the variations in the future.

References [1] M. Asada et al.: “RoboCup: Today and Tomorrow – What we have learned,” In Artificial Intelligence, Vol. 110, pp. 193–214. 1999. [2] J. Barraquand and P. Ferbach: “Motion Planning with Uncertainty: The Information Space Approach,” In Proc. of IEEE ICRA, 1995.

Fig. 11: Behavior of the robot in a trial

[3] R. Bellman: “Dynamic Programming,” Princeton University Press, 1957.

5 Conclusion and Future Work

[4] D. Fox et al.: “Monte Carlo Localization: Efficient Position Estimation for Mobile Robots,” In Proc. of AAAI, 1999.

We proposed the novel navigation method under selflocalization uncertainty. This method uses Uniform MCL as a self-localization algorithm and executes QMDP calculation by using the samples of Uniform MCL. The following is possible with this method.

[5] T. Fukase et al.: “Real-time Decision Making under Uncertainty of Self-Localization Results,” In Proc. of 2002 International RoboCup Symposium, pp. 327–379, 2002. [6] S. M. LaValle: “Robot Motion Planning: A Game–Theoretic Foundation,” Algorithmica, 26:430–465, 2000.

• The behavior of a robot can be planned without consideration of the estimation uncertainty at an off-line phase. • Uniform MCL can estimate the robot’s pose without mistake under the uncertainty. • The on-lime method deals with vagueness of selflocalization results and chooses an appropriate action in real time.

[7] M. L. Littman et al.: “Learning policies for partially observable environments: Scaling up,” In Proc. of International Conference on Machine Learning, pp. 362–370, 1995. [8] N. Roy and S. Thrun: “Coastal Navigation with Mobile Robots,” In Advances in Neural Information Processing Systems, 1999. [9] R. S. Sutton and A. G. Barto: “Reinforcement Learning: An Introduction,” The MIT Press, 1998. [10] S. Thrun: “Monte Carlo POMDPs,” Neural Information Processing Systems, 12:1064–1070, 2000.

A robot can then consider the uncertainty in real time when it chooses an appropriate action. Our method was applied to a navigation task in RoboCup four-legged robot league. The off-line planning

[11] R. Ueda et al.: “Uniform Monte Carlo Localization – Fast and Robust Self-localization Method for Mobile Robots,” In Proc. of IEEE ICRA, pp. 1353–1358, 2002.

478

Suggest Documents