Incremental Markov-Model Planning

Incremental Markov-Model Planning Richard Washington Philadelphia VAMC and University of Pennsylvania Department of Computer and Information Science 200 South 33rd Street, 5th Floor Philadelphia, PA 19104-6389 [email protected]

Abstract This paper presents an approach to building plans using partially observable Markov decision processes. The approach begins with a base solution that assumes full observability. The partially observable solution is incrementally constructed by considering increasing amounts of information from observations. The base solution directs the expansion of the plan by providing an evaluation function for the search fringe. We show that incremental observation moves from the base solution towards the complete solution, allowing the planner to model the uncertainty about action outcomes and observations that are present in real domains.

1 Introduction For domains with uncertainty about action outcomes and incomplete information about the current state, partially observable Markov decision processes (POMDPs) are appropriate for capturing the dynamics of the domain. The diculty is that complete, precise solutions to POMDPs are available only for very small problems. On the other hand, fully observable Markov decision processes (FOMDPs) are amenable to tractable solutions for problems of practical size. They, however, fail to completely capture the interesting and important features of the domains, and thus remain an imperfect approach. The approach presented in this paper uses a FOMDP solution as a \base" solution, which is then improved. The results of observations are taken into account incrementally, moving incrementally from an imperfect FOMDP solution towards a more accurate POMDP solution. Consider two domains to which Markov models have already been applied: medicine and mobile robotics.

Both share the features that make POMDPs attractive and FOMDPs problematic. In medical diagnosis, the underlying patient state can be modeled as a Markov process, with medical procedures and natural processes moving the patient from state to state with some probability. These probabilities are reported in the medical literature (e.g., see [4]) and can be used to produce Markov models of disease progression, modeling such features as survival rate [4] and cost [7]. However, in these models the patient's state is modeled probabilistically based on reported occurrences in the general population or a speci c patient population. The role of diagnostic tests for identifying the current state of the patient is absent from these models; when used, it appears in a decision tree, with the leaves being Markov models of the outcomes based on the results of a diagnostic procedure [1]. However, actual medical practice interleaves diagnostic and therapeutic actions so that the most appropriate procedure is done at each time (given the available knowledge). Given two possible conditions, each of which has a different treatment, the goal of a diagnostic action is to rule in or out the conditions so that (given the test error), an eective therapy can be found. In terms of Markov models, the model of the patient state is not a single state, but rather a distribution over a set of possible states. The role of the diagnostic tests is to provide observations that will imply a dierent distribution over the states. The best action given the observation will likely be dierent than the best action given no observation (after all, that is the reason for performing a diagnostic test). POMDPs model this behavior naturally, while FOMDPs model only the underlying physics of the body. In mobile robotics, a similar situation exists. A mobile robot relies on its sensor information to infer its location in the world, and then uses that location to calculate the best route to take. Given perfect informa-

tion, there is a wide array of approaches that work well for navigation [15, 9, 8], but in reality very few work well when the robot is unsure of its location. Those that do work often rely on speci c features or models of the robot or environment that rarely generalize [12, 5]. In fact, what the robot needs to do is calculate the optimal action to take in face of this uncertainty, and given the sensor information. This is exactly what POMDPs are designed to do. Existing research on Markov approaches rely either on FOMDPs that assume perfect sensor knowledge [6, 2], or 0th-order approximations to POMDPs that ignore information-gathering actions [17]. The POMDP literature, on the other hand, makes it obvious that precise solutions to POMDPs are available for only trivially small models [13, 3]. Available approximations are only powerful enough to scale up to problems with tens of states [18, 11, 16]. However, even simple tasks can require thousands of states (see [17]). The approach presented here shows that an FOMDP solution can produce estimates that lead to an incremental POMDP solution. First, we brie y review POMDPs. Then we show how observations can be incrementally added to improve a plan. We then discuss the use of the FOMDP base solution as an estimate for state evaluations. Finally we discuss our initial results on a simple robot navigation problem and some other problems from the literature.

2 Partially Observable Markov Decision Processes In this section we brie y review Markov processes, and in particular POMDPs. We will borrow the notation of [13], adding or changing only as required for the problem at hand; the reader can refer there for a more complete explanation of the framework. We assume that the underlying process, the core process, is described by a nite-state, stationary Markov chain. The core process is captured by the following information: a nite set N f1; : : :; N g, representing the possible states of the process a variable Xt 2 N representing the state of the core process at time t a nite set A of actions available a matrix P = [pij ]; i; j 2 N specifying transition probabilities of the core process: P (a) = [pij (a)] speci es the transition probabilities when action a 2 A is chosen

a reward matrix R = [rij ]; i; j 2 N specifying the immediate rewards of the core process: R(a) = [rij (a)] speci es the reward received when the action a 2 A is executed. We will use the shorthand X %i (a) = rij (a)pij (a) j 2N

to denote the reward of taking action a when in state i, and %(a) = f%1 (a); : : : ; %N (a)g. So at time t, the core process is in state Xt = i, and if an action a 2 A is taken, the core process transitions to state Xt+1 = j with probability pij (a), receiving immediate reward rij (a). Actions are chosen by a policy that maps states to actions. The optimal policy is the policy that maximizes the utility of each state. The value of a state under the optimal policy (given full observability) is de ned as: (

v(i) = max %i (a) + a2A

X k2N

v(k)pik (a)

)

where 0 < 1 is a discount factor (this ensures a bounded value functrion). However, in a partially observable MDP, the progress of the core process is not known, but can only be inferred through a nite set of observations. The observations are captured with the following information: a nite set M f1; : : :; M g representing the possible observations a variable Yt 2 M representing the observation at time t a matrix Q = [qij ]; i 2 N ; j 2 M specifying the probability of seeing observations in given states: Q(a) = [qij (a)], where qij (a) denotes the probability of observing j from state i when action a 2 A has been taken a state distribution variable (t) = f1(t); : : : ; N (t)g, where i (t) is the probability of Xt = i given the information about actions and observations an initial state distribution (0). At time t, the observation of the core process will be Yt . If action a 2 A is taken, we can de ne a function to determine Yt+1 . In particular, we de ne

(j j(t); a) =

X

i2N

qij (a)

X

k2N

pki(a)k (t)

(1)

as the probability that Yt+1 = j given that action a 2 A is taken at time t and the state distribution at that time is (t). To determine the state distribution variable (t +1), we de ne the transformation T as follows: (t + 1) = T ((t)jj; a) = fT1((t)jj; a); : : : ; TN ((t)jj; a)g where P qij (a) k2N pki(a)k (t) P ; (2) Ti ((t)jj; a) = P l2N qlj (a) k2N pkl (a)k (t) for i 2 N , and where (t) is the state distribution at time t, a 2 A is the action taken at that time, resulting in observation j 2 M. Actions are chosen by a decision rule (or plan) that maps state distributions to actions. The utility of a state distribution under the optimal decision rule can be computed by the value function:

( ) X VP [T (jj; a)] (j j; a) : VP () = max %(a)+ a2A

j 2M

(3)

3 Incremental Observation This section introduces a method for incrementally incorporating observations into a plan. The observations are incorporated if the cost of the action that provides the observation is outweighed by the gain in the overall plan utility. In the limiting case, the plan will be the plan prescribed by a full POMDP solution. As described in Section 2, given a set of states N , a set of actions A, and an initial distribution (0) over N , the optimal action is the action that maximizes the expected utility of the resulting state distribution. In reality, to know the precise expected utility of the resulting states, a full POMDP solution is required. However, we will work with an estimate of the utility of a state distribution. How this estimate is calculated will be shown in Section 4. If the actions and observations are expanded in a search-tree form, they form an AND-OR tree. The actions form the OR branches, since the optimal action is a choice among the set of actions. The observations form the AND branches, since the utility of an action is a sum of the utility of the state distribution implied by each of the possible observations multiplied by the probability of that observation (see Equation 3). Given the AND-OR tree of actions and observations and an evaluation function, with the goal of maximizing the utility of the chosen plan of actions, we can

invert the utilities (making them disutilities). This now presents the goal of minimizing the disutility of the plan, so the AO* algorithm [14] can be used to expand the tree incrementally. The AO* algorithm still involves a non-deterministic choice of which AND branch along the current best path to further expand (since that does not fall out directly from the evaluations, unlike in A*). We have chosen in our approach to expand AND branches to equal depths to avoid long searches down paths that will do little to change the overall utility1. However, from a purely theoretical standpoint, a stochastic approach or any other choice would be equally likely to work in general.

4 FOMDP-based Evaluation In this section, we show that a FOMDP solution to the core process can be used as an estimate of the POMDP value function, and in particular that it provides an overestimate of the value function (thus ensuring admissibility of the AO* algorithm described above). We assume that a solution exists to an underlying FOMDP model of the problem domain. The solution describes the optimal policy and expected state values if full observation were possible. Since tractable solution methods exist for FOMDPs, and this computation can take place o-line, this assumption appears reasonable to us. See Section 6 for how this may be relaxed. What this corresponds to in the domains discussed is an o-line computation of the optimal treatment for each possible disease state, or an o-line computation of an optimal robot action at each distinguished location in its known environment (the locations must be discretized in order to use any of the existing Markov approaches). De ne a value function of a state distribution as a weighted sum of the FOMDP value functions: VF () =

X

i2N

iv(i)

(this is the approach suggested in [10] and [17]). Now, using the value functions for the nite-horizon case, where the value is computed for n steps rather than an in nite number, we have the following equations: VP0()

=

%(0)

1 If an evaluation estimate is too good at an AND node after a bad evaluation at an OR node, this is more likely to occur in a depth- rst expansion, as the search will continue to expand the AND branch without changing the overall utility and thus the best path.

VPn ()

= max a2A +

j 2M

=

vn (i)

= max a2A

%i (0) (

X

=

i2N

%(a)

X

v0 (i)

VFn ()

(

VPn?1[T (jj; a)] (j j; a)

9 = ;

X n?1 %i (a) + v (k)pik (a)

i vn (i):

)

VPn()

= max a2A

X j 2M

%(a) +

VPn?1[T (jj; a)] (j j; a)

=

a2A

=

j 2M

VFn?1[T (jj; a)] (j j; a)

=

)

)

j 2M k2N

( max %(a) + a2A

X X X n?1 v (k) qkj (a) pik (a)i

=

j 2M

( max %(a) +

X n?1 X v (k) pik (a)i

= max a2A

X

i2N

(

)

i2N

a2A

k2N

i2N

X

i2N

( i

)

i2N

X n?1 i%i (a) + i v (k)pik (a) k2N

max a2A

"

i vn (i)

X n?1 %i (a) + v (k)pik (a)

#)

k2N

VFn ():

VP () =

a2A

k2N

X

Then we can see that

( max %(a) + X X n?1 v (k)Tk (jj; a) (j j; a)

=

Figure 1. Incremental observation using an FOMDP-based evaluation.

)

( max %(a) + X

FOMDP fixed policy

k2N

Note that VP0 () = VF0 (). Also, the in nite value functions are simply the limit of the nite value functions as n approaches ini nity. Now, we can show VPn() VFn () inductively. Assuming VPn?1() VFn?1(), the terms can be rearranged using Equations 1 and 2 to show: (

POMDP AO*

)

n lim V n() nlim n!1 P !1 VF () = VF ():

So we have shown that the FOMDP-based value function can be used as an overestimate of the POMDP value function. At any point in expanding the AND-OR tree of actions and observations, there will be a partiallyexpanded search tree, with estimates produced from the FOMDP model. In essence, the FOMDP model \ lls in" the unexplored area of the search space with its precomputed policy (see Figure 1). In fact, the FOMDP solution can be used if necessary to decide on an action in the absence of an expanded POMDP path, using a weighting to decide on the best action, as in [10]. As the AND-OR tree is expanded, the best action is increasingly in uenced by the observations and decreasingly by the FOMDP, thus incrementally shifting from a FOMDP model to a POMDP model.

-60

-9899 -9899

S6

-70

-9899

-9899 -80

-9899

S5

-103 -9899

-122

-9899

S4

-82 -84

-103

-122

S3

-61 -84

-42

-82

S7

-42

-22

-9899 -9899

S2

-9899 -9899

S1

-9899

0

S8

-90

-9899

0 -9899

Utility

-9899

-100

0 -110

-9899

0

-120 0

Figure 2. Example with FOMDP state evaluations. States S1, S5, S6, and S7 are sinks. State S8 is the goal state. The numbers are the FOMDP evaluations for the state corresponding to the robot being in the square and facing that direction.

5 Results Consider the simple robot-navigation problem in Figure 2. There are 32 states, corresponding to the robot being in each of the states facing right, left, up, or down. The robot has the choice of turning left or right, moving forward, or doing a laser scan of the environment. While moving or turning, less reliable sonar information is available, so that observations can confound corridors and junctions, and especially different types of junctions. In addition, motions are slightly inaccurate. The FOMDP policy will route the robot towards the goal state, but it assumes perfect information about the state of the robot. In addition, the assumption of perfect information will obviate the need for laser scans, since these are purely informationgathering actions. Suppose that the robot started with uncertainty about its starting state: either S2 or S4, facing to the right. In this case the FOMDP-derived value would assume taking the optimal action for each, producing a value estimate of ?63. In fact, the more accurate POMDP value produced by the algorithm, ?115, is much more costly, because of the necessity of performing additional steps such as laser scans to verify the location. Each iteration of the AO* algorithm involves expanding the best path one step, followed by a reassessment of the best path (possibly changing the best path if the evaluations have changed). After 5 iterations, the initial action is xed as a laser scan to gather

5

10

15

20 Iterations

25

30

35

40

Figure 3. Refinement of state utility estimate as the AO* search is expanded: robotnavigation problem from this paper.

more information. After 3 more iterations, that plan is extended: If the laser scan shows a T intersection, the robot turns right and moves forward; if the laser scan shows an X, the robot moves forward two squares, turns right, and moves forward. These particular results rely on the particular probabilities and costs assigned to motion and sensing; given dierent initial probabilities and costs, the plan produced will spend more or less time sensing and moving to optimize its overall utility. On a SPARCserver 690MP Model 612 (Solaris 5.4), the initial 5 iterations take 0.7 seconds of execution time and produce a AND-OR \best path" of 4 states with maximum depth 2, and the 8 iterations to further expand the plan take a total of 1.2 seconds of execution time, producing a best path of 188 states with maximum depth 4. The search tree and time increase exponentially, as would be expected with the AO* algorithm, with 40 iterations requiring 137 seconds of execution time, and producing a \best path" through the AND-OR space with 40098 states with maximum depth 16. To show the performance of the algorithm over time, results from four example problems are shown in Figures 3{6. The utility estimate of the plan begins with the FOMDP-based estimate and approaches the POMDP value with each iteration, with each run terminated after 40 iterations. The problem in Figure 6 was terminated prematurely due to a lack of memory for the AO* expansion, which suggests the need for a bounded-memory version of the algorithm. In all cases, the value function quickly changes from the initial FOMDP-based estimate and then more slowly approaches the true POMDP value. This is true

1

0.98

0.96 190

Utility

0.94

180

0.92

0.9 170

Utility

0.88 160 0.86

150

0.84 0

5

10

15

20

25

30

35

Iterations 140

130 0

5

10

15

20 Iterations

25

30

35

40

Figure 4. Refinement of state utility estimate as the AO* search is expanded: tiger problem [3].

Figure 6. Refinement of state utility estimate as the AO* search is expanded: large navigation problem [10].

whether the absolute change in utility is large (Figures 3 and 4) or small (Figures 5 and 6).

6 Discussion

3.5

3.45

3.4

3.35

Utility

3.3

3.25

3.2

3.15

3.1

3.05 0

5

10

15

20 Iterations

25

30

35

40

Figure 5. Refinement of state utility estimate as the AO* search is expanded: tiny navigation problem [10].

We have shown how a POMDP decision rule can be incrementally constructed using a FOMDP solution of the core process. As the POMDP decision rule is expanded, the plan incorporates increasing amounts of information from observations. The incremental nature of the algorithm lends itself to larger problems; memory considerations of the AO* algorithm trades o number of iterations versus problem size, but problems with over 100 states have been tested (although analysis of the results is a problem simply because of the enormous search trees explored). Further re nement of the algorithm to an iterative-deepening style is planned, which should alleviate the memory limitations. Since the FOMDP evaluations are overestimates, these evaluations only need to be re ned at fringe nodes of the best path, since the re nement can only serve to make a path look worse in relation to others. Thus ignoring paths with high FOMDP-estimated disutilities will never lead to an suboptimal plan. The speed at which the algorithm nds a stable solution depends on the underlying numbers. If the cost of a sink is too high, the utility of the goal is overwhelmed by the risk of hitting a sink, and the \optimal" plan will end up with the robot doing nothing. In real problems, the cost of not reaching the goal within a deadline may exceed the risk of the sinks; this needs to modeled. On larger problems, the AO* expansion fails due

to the exponential memory needed to store the AND branches. We are working on a bounded-memory version that will overcome this problem. The admissibility of AO*, guaranteeing an optimal solution, relies not only on the value estimate being an underestimate, but also on the monotonicity of the value estimate. This is not strictly the case for the FOMDP-based estimates (see Figure 5, where increases in the utility estimate correspond to non-monotonicities). However, searches are only terminated when the best path loops along all branches (at which point the estimate is exactly the POMDP value). Otherwise the search is in nite, so the admissibility of AO* is of secondary interest to the overall behavior. An assumption of a full FOMDP solution was made above in Section 4. In fact, given the incremental nature of the approach, an incremental approximation to FOMDP values would also be usable (just adding yet another layer of incremental computation). A guaranteed underestimating approximation would preserve the properties described in the paper.

7 Acknowledgments This research was supported by a VA Medical Informatics Fellowship, administered through the Philadelphia VAMC. Thanks to Michael Littman and anonymous referees for helpful comments on various versions of the paper. Many thanks also to Michael Littman for supplying a suite of example problems.

References [1] J. R. Beck and S. G. Pauker. The Markov process in medical prognosis. Medical Decision Making, 3(4):419{458, 1983. [2] C. Boutilier and R. Dearden. Using abstractions for decision-theoretic planning with time constraints. In Proceedings of AAAI-94, 1994. [3] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of AAAI-94, 1994. [4] M. E. Cowen, M. Chartrand, and W. F. Weitzel. A Markov model of the natural history of prostate cancer. Journal of Clinical Epidemiology, pages 3{21, 1994. [5] A. Curran and K. J. Kyriakopolous. Sensor-based localization for wheeled mobile robots. In Proccedings, IEEE International Conference on Robotics and Automation, 1993. [6] T. Dean, L. P. Kaelbling, J. Kirman, and A. Nicholson. Planning under time constraints in stochastic domains. Arti cial Intelligence, 76:35{74, 1995.

[7] M. C. Fahs, J. Mandelblatt, C. Schechter, and C. Muller. Cost eectiveness of cervical cancer screening for the elderly. Annals of Internal Medicine, 117(6):520{527, 1992. [8] Y. K. Hwang and N. Ahuja. Gross motion planning| a survey. ACM Computing Surveys, 24(3), September 1992. [9] K. Kedem and M. Sharir. An ecient motion planning algorithm for a convex rigid polygonal object in 2dimensional polygonal space. Discrete Computational Geometry, 5:43{75, 1990. [10] M. L. Littman, A. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 362{370, San Francisco, CA, 1995. Morgan Kaufmann. [11] W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28:47{65, 1991. [12] R. Mandelbaum and M. Mintz. A con dence set approach to mobile robot localization. In Proceedings of the 1996 Conference on Multisensor Fusion and Integration, 1996. [13] G. E. Monahan. A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1{16, 1982. [14] N. J. Nilsson. Principles of Arti cial Intelligence. Tioga Publishing Company, 1980. [15] C. O'Dunlaing and C. K. Yap. A \retraction" method for planning the motion of a disc. Journal of Algorithms, 6:104{111, 1985. [16] R. Parr and S. Russell. Approximating optimal policies for partially observable stochastic domains. In Proceedings of IJCAI-95, 1995. [17] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of IJCAI-95, 1995. [18] C. C. White, III. Partially observed Markov decision processes: A survey. Annals of Operations Research, 32, 1991.