Incremental Observation: Tractable Markov-Model Planning Richard Washington University of Pennsylvania Department of Computer and Information Science 200 South 33rd Street, 5th Floor Philadelphia, PA 19104-6389
[email protected] Abstract
This paper presents an approach to building plans using partially observable Markov decision processes. The approach begins with a base solution that assumes full observability. The partially observable solution is incrementally constructed by considering increasing amounts of information from observations. The base solution directs the expansion of the plan by providing an evaluation function for the search fringe. In addition, the evaluations are iteratively re ned towards a solution of the non-observable case, at the same time extending the plan. We show that incremental observation and iterative state evaluation combine to move from the base solution towards the complete solution, allowing the planner to model the uncertainty about action outcomes and observations that are present in real domains.
Submitted for review
1
1 Introduction For domains with uncertainty about action outcomes and incomplete information about the current state, partially observable Markov decision processes (POMDPs) are appropriate for capturing the dynamics of the domain. The diculty is that complete, precise solutions to POMDPs are available only for very small problems. On the other hand, fully observable Markov decision processes (FOMDPs) are amenable to tractable solutions for problems of practical size. They, however, fail to completely capture the interesting and important features of the domains, and thus remain an imperfect approach. The approach presented in this paper uses a FOMDP solution as a \base" solution, which is then improved in two ways: rst, the results of observations are taken into account incrementally; and second, the evaluations of the states are iteratively improved from the FOMDP estimates. We will show how the two methods, incremental observation and iterative state evaluation, combine to move incrementally from an imperfect FOMDP solution towards a more accurate POMDP solution. Consider two domains to which Markov models have already been applied: medicine and mobile robotics. Both share the features that make POMDPs attractive and FOMDPs problematic. In medical diagnosis, the underlying patient state can be modeled as a Markov process, with medical procedures and natural processes moving the patient from state to state with some probability. These probabilities are reported in the medical literature (e.g., see [Cowen et al., 1994]) and can be used to produce Markov models of disease progression, modeling such features as survival rate [Cowen et al., 1994] and cost [Fahs et al., 1992]. However, in these models the patient's state is modeled probabilistically based on reported occurrences in the general population or a speci c patient population. The role of diagnostic tests for identifying the current state of the patient is absent from these models; when used, it appears in a decision tree, with the leaves being Markov models of the outcomes based on the results of a diagnostic procedure [Beck and Pauker, 1983]. However, actual medical practice interleaves diagnostic and therapeutic actions so that the most appropriate procedure is done at each time (given the available knowledge). Given two possible conditions, each of which has a dierent treatment, the goal of a diagnostic action is to rule in or out the conditions so that (given the test 2
error), an eective therapy can be found. In terms of Markov models, the model of the patient state is not a single state, but rather a distribution over a set of possible states. The role of the diagnostic tests is to provide observations that will imply a dierent distribution over the states. The best action given the observation will likely be dierent than given no observation (after all, that is the reason for performing a diagnostic test). POMDPs model this behavior naturally, while FOMDPs model only the underlying physics of the body. In mobile robotics, a similar situation exists. A mobile robot relies on its sensor information to infer its location in the world, and then uses that location to calculate the best route to take. Given perfect information, there is a wide array of approaches that work well for navigation [O'Dunlaing and Yap, 1985; Kedem and Sharir, 1990; Hwang and Ahuja, 1992], but in reality very few work well when the robot is unsure of its location. Those that do work often rely on speci c features or models of the robot or environment that rarely generalize [Mandelbaum and Mintz, 1996; Curran and Kyriakopolous, 1993]. In fact, what the robot needs to do is calculate the optimal action to take in face of this uncertainty, and given the sensor information. This is exactly what POMDPs are designed to do. Existing research on Markov approaches rely either on FOMDPs that assume perfect sensor knowledge [Dean et al., 1995; Boutilier and Dearden, 1994], or 0th-order approximations to POMDPs that ignore observations [Simmons and Koenig, 1995]. The POMDP literature, on the other hand, makes it obvious that precise solutions to POMDPs are available for only trivially small models [Monahan, 1982; Cassandra et al., 1994]. Approximations that are oered are not powerful enough to scale up to practical problems [White, 1991; Lovejoy, 1991; Parr and Russell, 1995]. So, given the attractiveness and necessity of POMDPs in real-world domains, but their computational intractability, we are left in a bit of limbo for these domains. The approach presented here rests on two observations: informationseeking actions are designed to discriminate among states; and FOMDP solutions can be used to produce estimates, which can be iteratively re ned, of the worth of an action given a POMDP state distribution. These observations will be expanded on later. First, we brie y review POMDPs. Then we show how observations can be incrementally added to improve a plan. We then discuss the use of the FOMDP base solution as an estimate for state evaluations. Then we show how to iteratively re ne the state evaluations 3
from the FOMDP base solution. Finally we discuss our initial results and conclude.
2 Partially Observable Markov Decision Processes In this section we brie y review Markov processes, and in particular POMDPs. We will borrow the notation of [Monahan, 1982], adding or changing only as required for the problem at hand; the reader can refer there for a more complete explanation of the framework. We assume that the underlying process, the core process, is described by a nite-state, stationary Markov chain. The core process is captured by the following information: a nite set N f1; : : : ; N g, representing the possible states of the process a variable Xt 2 N representing the state of the core process at time t a nite set A of actions available a matrix P = [pij ]; i; j 2 N specifying transition probabilities of the core process: P (a) = [pij (a)] speci es the transition probabilities when action a 2 A is chosen a reward matrix R = [rij ]; i; j 2 N specifying the immediate rewards of the core process: R(a) = [rij (a)] speci es the reward received when the action a 2 A is executed. We will use the shorthand %i (a) =
X
j 2N
rij (a)pij (a)
to denote the reward of taking action a when in state i, and %(a) = f%1(a); : : : ; %N (a)g. So at time t, the core process is in state Xt = i, and if an action a 2 A is taken, the core process transitions to state Xt+1 = j with probability pij (a), receiving immediate reward rij (a). 4
Given a policy that maps states to actions, we can de ne the value of a state (given full observability) as: 9 8 = < X v (k )pik (a); v (i) = max :%i (a) + a2A k2N
where 0 < 1 is a discount factor (this ensures convergence). However, in a partially observable MDP, the progress of the core process is not known, but can only be inferred through a nite set of observations. The observations are captured with the following information: a nite set M f1; : : : ; M g representing the possible observations a variable Yt 2 M representing the observation at time t a matrix Q = [qij ]; i 2 N ; j 2 M specifying the probability of seeing observations in given states: Q(a) = [qij (a)], where qij (a) denotes the probability of observing j from state i when action a 2 A has been taken a state distribution variable (t) = f1(t); : : : ; N (t)g, where i(t) is the probability of Xt = i given the information about actions and observations an initial state distribution (0). At time t, the observation of the core process will be Yt. If action a 2 A is taken, we can de ne a function to determine Yt+1. In particular, we de ne X X
(j j (t); a) = qij (a) pki (a)k (t) (1) i2N
k2N
as the probability that Yt+1 = j given that action a 2 A is taken at time t and the state distribution at that time is (t). To determine the state distribution variable (t +1), we de ne the transformation T as follows: (t + 1) = T ( (t)jj; a) = fT1 ( (t)jj; a); : : : ; TN ( (t)jj; a)g where P qij (a) k2N pki (a)k (t) Ti( (t)jj; a) = P ; for i 2 N (2) P p (a) (t) q (a) k2N kl
l2N lj
5
k
and where (t) is the state distribution at time t, a 2 A is the action taken at that time, resulting in observation j 2 M. Given a decision rule (or plan) that maps state distributions to actions, the utility of a state distribution can be computed by the value function: 9 8 = < X VP [T ( jj; a)] (j j; a); : VP ( ) = max : %(a) + a2A j 2M
(3)
The optimal decision rule is that which achieves the supremum of the value function for all possible state distributions, or in other words, chooses the best possible action given the state distribution. We will contrast the POMDP model with a non-observable MDP (NOMDP) in Section 5, so we will brie y describe that here. A NOMDP is in essence a POMDP where observations are not used to make decisions. So actions are taken that maximize the expected utility given a state distribution, without taking into account the eects of observations. In this case we have a transformation function : (t + 1) = ( (t)ja) = f1 ( (t)ja); : : : ; N ( (t)ja)g
where
i((t)ja) =
X k2N
pki (a)k (t):
The utility of a state distribution is given by the value function: VN ( ) = max f %(a) + VN [T ( ja)]g : a2A
(4)
As with the POMDP case, the optimal decision rule is that which achieves the supremum of the value function for all possible state distributions.
3 Incremental Observation This section introduces a method for incrementally incorporating observations into a plan. The observations are incorporated if the cost of the action that provides the observation is outweighed by the gain in the overall plan utility. In the limiting case, the plan will be the plan prescribed by a full POMDP solution. 6
π1 a1
π4 o1 a4
a2
π0
π2
o2
π5 a5
a3
π7
o3
π8
π6
π3
Figure 1: Expansion of Markov state distributions and actions with observations. The observations are the AND branches, and the actions are the OR branches. As described in Section 2, given a set of states N , a set of actions A, and an initial distribution (0) over N , the optimal action is the action that maximizes the expected utility of the resulting state distribution. In reality, to know the precise expected utility of the resulting states, a full POMDP solution is required. However, we will work with an estimate of the utility of a state distribution. How this estimate is calculated will be shown in Sections 4 and 5. If the actions and observations are expanded in a search-tree form, they form an AND-OR tree (see Figure 1). The actions form the OR branches, since the optimal action is a choice among the set of actions. The observations form the AND branches, since the utility of an action is a sum of the utility of the state distribution implied by each of the possible observations multiplied by the probability of that observation (see Equation 3). In contrast, if observations are ignored, then the tree reverts to an OR tree|the standard search tree (see Figure 2). The only branches are actions, which lead from one state distribution to another. In this case the utility of an action is simply the utility of the resulting state distribution. Given the AND-OR tree of actions and observations and an evaluation function, with the goal of maximizing the utility of the chosen plan of actions, we can invert the utilities (making them disutilities). This now presents the goal of minimizing the disutility of the plan, so the AO* algorithm [Nilsson, 1980] can be used to expand the tree incrementally. Moreover, if the 7
π1 a1 a4 a2
π0
π4
π2 a5
a3
π5
π3
Figure 2: Expansion of Markov state distributions and actions, with no observations. evaluation function gives an overestimate of the state utilities (thus an underestimate of the disutilities), the algorithm is admissible. The AO* algorithm still involves a non-deterministic choice of which AND branch along the current best path to further expand (since that does not fall out directly from the evaluations, unlike in A*). We have chosen in our approach to expand AND branches to equal depths to avoid long searches down paths that will do little to change the overall utility1. However, from a purely theoretical standpoint, a stochastic approach or any other choice would be equally likely to work in general.
4 FOMDP-based Evaluation In this section, we show that a FOMDP solution to the core process can be used as an estimate of the POMDP value function, and in particular that it provides an overestimate of the value function (thus ensuring admissibility of the AO* algorithm described above). We assume that a solution exists to an underlying FOMDP model of the problem domain. The solution describes the optimal policy and expected state values if full observation were possible. Since tractable solution meth1 If an evaluation estimate is too good at an AND node after a bad evaluation at an OR node, this is more likely to occur in a depth- rst expansion, as the search will continue to expand the AND branch without changing the overall utility and thus the best path.
8
ods exist for FOMDPs, and this computation can take place o-line, this assumption appears reasonable to us. See Section 6 for how this may be relaxed. What this corresponds to in the domains discussed is an o-line computation of the optimal treatment for each possible disease state, or an o-line computation of an optimal robot action at each distinguished location in its known environment. De ne a value function of a state distribution as a weighted sum of the FOMDP value functions: X i v (i) VF ( ) = i2N
(this is the approach suggested but not used in [Simmons and Koenig, 1995]). Also, de ne a selection function i() = f0; : : : ; i; : : : ; 0g that constructs a vector with all but the ith element set to 0. We can see that X T ( jj; a) = i(T (jj; a)): i2N
Now, using the nite-horizon formulation of the value functions: VP0(pi) = %(0) 9 8 = < X n?1 n VP [T ( jj; a)] (j j; a); VP ( ) = max : %(a) + a2A v 0(i)
= %i (0)8
v n (i)
=
VFn ( )
=
Note that
j 2M
9 < = X n?1 % ( a ) + max v ( k ) p ( a ) i ik ; a2A : k2N X n i v (i): i2N VFn ( ) =
X i2N
VFn (( ))
(5)
and that VP0() = VF0(). Now, we can show VPn() VFn() inductively. Assuming VPn?1() VFn?1 (), the terms can be rearranged using Equations 1,2, and 5 to show: VPn ( )
=
8 9 < = X n?1 max % ( a ) + V [ T ( j j; a )]
( j j ; a ) P ; a2A : j 2M
9
= = = = =
=
9 8 = < X n?1 max % ( a ) + V [ T ( j j; a )]
( j j ; a ) F ; a2A : j 2M 9 8 = < X X n?1 % ( a ) + V [( T ( j j; a ))]
( j j ; a ) max F ; a2A : j 2M k2N 9 8 = < X X n?1 % ( a ) + v ( k ) T ( j j; a )
( j j ; a ) max k ; a2A : j 2M k2N 9 8 = < X n?1 X X % ( a ) + v ( k ) q ( a ) p ( a ) max kj ik i ; a2A : j 2M i2N k2N 9 8 = < X n?1 X % ( a ) + max p ( a ) v ( k ) ik i ; a2A : i2N k2N 8 9 = X n?1 X< % ( a ) + max v ( k ) p ( a ) i i i ik ; a2A i2N : k2N 8 2 39 = X n?1 X < i :max 4%i (a) + v (k )pik (a)5; a2A i2N k2N X n iv (i) i2N VFn ( ):
= Then we can see that n n VP ( ) = nlim !1 VF ( ) = VF ( ): !1 VP ( ) nlim
So we have shown that the FOMDP-based value function can be used as an overestimate of the POMDP value function. To illustrate the use of the estimate, consider the example in Figure 3. The FOMDP value of each state is shown, as well as the preferred action to take according to the FOMDP policy. Each action is assumed to have a utility of ?5, except in the nal states s4 and s5, which function as sinks, with all actions from each of these states transitioning only to that state with no cost2. Given that diagnostic action Dx has sensitivity 0.98 and speci city 0.97 for discriminating between states s2 and s3, we set q21(Dx) = 0:98 (the 2
For clarity, only the most relevant actions are shown in the gure.
10
Dx
p=0.5 S1 -34.35
p=0.1 Rx2 S2 p=0.9 -44.55 p=0.6 Rx3 p=0.4
S4 0
p=0.9 Rx2 S3 p=0.1 -14.85 p=0.6 Rx3 p=0.4
S5 -100
Rx1
p=0.5
Dx
Figure 3: Example with FOMDP state evaluations. The action preferred by the FOMDP solution is marked with a bold line. sensitivity of the diagnostic action Dx), q22(Dx) = 0:02, q32(Dx) = 0:97 (the speci city of Dx), q31(Dx) = 0:03, and qij (Dx) = 0:5; i 2 f1; 4; 5g; j 2 f1; 2g. The initial state distribution (0) = f1; 0; 0; 0; 0g. The steps of expansion are as follows (the estimated cost sums the discounted weight of the expanded path and the discounted FOMDP-based estimate): best path est. cost (0) ?34:35 (0) Rx1 ! (1) = f0; 0:5; 0:5; 0; 0g ?34:40 (0) Rx1 (1) Dx(1) ! (2a) = f0; 0:97; 0:03; 0; 0g ?52:74 (0) Rx1 (1) Dx(2) ! (2b) = f0; 0:02; 0:98; 0; 0g ?25:09 (0) Rx1 (1) Dx(1) (2a) Rx3 ! (3a) = f0; 0; 0; 0:6; 0:4g ?53:66 (0) Rx1 (1) Dx(2) (2b) Rx2 ! (3b) = f0; 0; 0; 0:88; 0:12g ?26:11 to reach the POMDP true value of 39.88. In the steps, Dx(i) refers to action Dx with observation i. At any point in expanding the AND-OR tree of actions and observations, there will be a partially-expanded search tree, with estimates produced from the FOMDP model. In essence, the FOMDP model \ lls in" the unexplored area of the search space with its precomputed policy (see Figure 4). In fact, 11
POMDP AO*
FOMDP fixed policy
Figure 4: Incremental observation using an FOMDP-based evaluation. the FOMDP solution can be used if necessary to decide on an action in the absence of an expanded POMDP path, using a weighting to decide on the best action, as in [Simmons and Koenig, 1995]. As the AND-OR tree is expanded, the best action is increasingly in uenced by the observations and decreasingly by the FOMDP, thus incrementally shifting from a FOMDP model to a POMDP model.
5 Iterative State Evaluation The drawback of using FOMDP state estimations is that they can be overly optimistic, since they choose the best action for each state rather than considering the best action for an entire state distribution. If there are two possible states requiring nearly contradictory actions, the FOMDP evaluation would provide an estimate of taking the best action in each state, although given a lack of information, this could well lead to a suboptimal solution. One possible estimate that is more conservative is to use the NOMDP value function, since this considers the best action given a state distribution. In this section we show how this estimate relates to the others and how to 12
compute it eciently. Recall the de nition of the NOMDP value function (Equation 4). Using the same inductive technique used for comparing the FOMDP and POMDP value functions, we can see that: VN ( ) VP ( ) VF ( ):
for any state distribution . This means that VN underestimates the POMDP utility of a state distribution, so the admissibility of the AO* algorithm is lost. However, remember that the entire reason for incrementally computing a POMDP decision rule is because computing the entire decision rule is intractable. We don't expect the POMDP incremental procedure to terminate in those cases where a standard POMDP solution procedure is ineective, so the more important goal is to arrive at a good partial decision rule quickly. A good evaluation function, albeit inadmissible, will still achieve that goal better than an admissible but inaccurate evaluation function. To compute the NOMDP value function is still non-trivial, but the recursive de nition of NOMDP (Equation 4) leads directly to the re nement method. The NOMDP value function is expanded to increasing depth, and the FOMDP value function is used as an estimate beyond that depth. The NOMDP problem omits observations, so it becomes a problem of choosing an action at each step that maximizes the expected utility of the resulting path. Thus expanding the NOMDP actions produces an OR tree, as described in Section 3. Since the FOMDP-based value function provides an overestimate of the NOMDP value function, we can again invert utilities to disutilities and use an IDA* algorithm [Korf, 1990] to iteratively approximate the NOMDP value function. This gives us not only an estimate of the NOMDP value function, but also a suggested best path to extend the POMDP decision rule. To illustrate the use of iterative state evaluation, consider again the example in Figure 3. We perform two evaluation iterations per expansion of the POMDP search. The steps of expansion are as follows (the path suggested by iterative state evaluation is in parentheses, and the estimated cost is the sum of the discounted cost of the expanded POMDP path, the discounted cost of the NOMDP path extension, and the discounted FOMDP-based estimate): 13
best path est. cost (0) ?34:35 (0)(Rx1 ) ?34:40 (0)(Rx1 Rx3 ) ?49:55 (0) Rx1 ! (1) = f0; 0:5; 0:5; 0; 0g ?34:40 (0) Rx1 (1)(Dx) ?39:06 (0) Rx1 (1)(Dx Dx) ?43:67 (0) Rx1 (1) Dx(1) ! (2a) = f0; 0:97; 0:03; 0; 0g ?52:74 (0) Rx1 (1) Dx(2) ! (2b) = f0; 0:02; 0:98; 0; 0g ?25:09 (0) Rx1 (1) Dx(1) (2a)(Rx3) ?53:66 (0) Rx1 (1) Dx(2) (2b)(Rx2 ) ?26:11 (0) Rx1 (1) Dx(1) (2a) Rx3 ! (3a) = f0; 0; 0; 0:6; 0:4g ?53:66 (0) Rx1 (1) Dx(2) (2b) Rx2 ! (3b) = f0; 0; 0; 0:88; 0:12g ?26:11 to again reach the POMDP true value of 39.88. Now we have a method for iteratively re ning state evaluation estimates, giving both direction to the search for expanding the POMDP decision rule and a conservative extension to the current decision rule (see Figure 5). The state evaluation estimates move from FOMDP-based to NOMDP to POMDP, and the plan expansion incorporates increasing information from observations.
6 Discussion We have shown how a POMDP decision rule can be incrementally constructed using a FOMDP solution of the core process. In addition, the FOMDP-based evaluation can be iteratively re ned towards a more conservative NOMDP solution, providing at the same time an extension to the POMDP decision rule. As the POMDP decision rule is expanded, the plan incorporates increasing amounts of information from observations. As described above, the incremental observation method is implemented as an AO* algorithm, while the iterative state evaluation method is implemented as an IDA* algorithm. In its current form, the two methods run synchronously and interleaved (some plan expansion followed by some state evaluation). Since any control of the interleaving is arbitrary, an asynchronous version is planned. As of this point the program works on small 14
FOMDP fixed policy POMDP AO* NOMDP A*
Figure 5: Incremental observation using iterative state evaluation. examples such as the one described in this paper, reproducing the results shown; more complicated examples will be tested to explore the incremental change in utility over time as well as comparing the pure FOMDP-based evaluation with iterative state evaluation. [such results will be included in a nal version] Since the FOMDP evaluations are overestimates, these evaluations only need to be re ned at fringe nodes of the best path, since the re nement can only serve to make a path look worse in relation to others. Thus ignoring paths with high FOMDP-estimated disutilities will never lead to an suboptimal plan. This focuses the iterative state evaluation on those parts of the tree where it will expand on the current best path. Iterative state evaluation is a true anytime algorithm [Dean and Boddy, 1988], in that each re nement monotonically decreases the estimated utility measure, with the limit being the true NOMDP value. The incremental observation algorithm does not share the strict property, since it relies on estimated utilities. An assumption of a full FOMDP solution was made above in Section 4. In fact, given the incremental nature of the approach, an incremental approxi15
mation to FOMDP values would also be usable (just adding yet another layer of incremental computation). A guaranteed underestimating approximation would preserve the properties described in the paper. We expect the contribution of the iterative evaluation method to depend on the reliability of the observations. If the observations are reliable, in the sense that they accurately discriminate among states, then the FOMDPbased estimate may prove good enough, since the planning agent will be able to determine its state with reasonable certainty. If however the observations are unable to discriminate accurately among states, then the more conservative iterative evaluation will likely be better, since it assumes a lack of observational information.
7 Acknowledgements This research was supported by a VA Medical Informatics Fellowship, administered through the Philadelphia VAMC.
References [Beck and Pauker, 1983] J. R. Beck and S. G. Pauker. The Markov process in medical prognosis. Medical Decision Making, 3(4):419{458, 1983. [Boutilier and Dearden, 1994] C. Boutilier and R. Dearden. Using abstractions for decision-theoretic planning with time constraints. In Proceedings of AAAI-94, 1994. [Cassandra et al., 1994] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of AAAI-94, 1994. [Cowen et al., 1994] M. E. Cowen, M. Chartrand, and W. F. Weitzel. A Markov model of the natural history of prostate cancer. Journal of Clinical Epidemiology, pages 3{21, 1994. [Curran and Kyriakopolous, 1993] A. Curran and K. J. Kyriakopolous. Sensor-based localization for wheeled mobile robots. In Proccedings, IEEE International Conference on Robotics and Automation, 1993. 16
[Dean and Boddy, 1988] T. Dean and M. Boddy. An analysis of timedependent planning. In Proceedings of AAAI-88. AAAI, 1988. [Dean et al., 1995] T. Dean, L. P. Kaelbling, J. Kirman, and A. Nicholson. Planning under time constraints in stochastic domains. Arti cial Intelligence, 76:35{74, 1995. [Fahs et al., 1992] M. C. Fahs, J. Mandelblatt, C. Schechter, and C. Muller. Cost eectiveness of cervical cancer screening for the elderly. Annals of Internal Medicine, 117(6):520{527, 1992. [Hwang and Ahuja, 1992] Y. K. Hwang and N. Ahuja. Gross motion planning|a survey. ACM Computing Surveys, 24(3), September 1992. [Kedem and Sharir, 1990] K. Kedem and M. Sharir. An ecient motion planning algorithm for a convex rigid polygonal object in 2-dimensional polygonal space. Discrete Computational Geometry, 5:43{75, 1990. [Korf, 1990] R. E. Korf. Real-time heuristic search. Arti cial Intelligence, 42(2{3):189{211, 1990. [Lovejoy, 1991] W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28:47{ 65, 1991. [Mandelbaum and Mintz, 1996] R. Mandelbaum and M. Mintz. A con dence set approach to mobile robot localization. In Proceedings of the 1996 Conference on Multisensor Fusion and Integration, 1996. [Monahan, 1982] G. E. Monahan. A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28(1):1{16, 1982. [Nilsson, 1980] N. J. Nilsson. Principles of Arti cial Intelligence. Tioga Publishing Company, 1980. [O'Dunlaing and Yap, 1985] C. O'Dunlaing and C. K. Yap. A \retraction" method for planning the motion of a disc. Journal of Algorithms, 6:104{ 111, 1985. 17
[Parr and Russell, 1995] R. Parr and S. Russell. Approximating optimal policies for partially observable stochastic domains. In Proceedings of IJCAI95, 1995. [Simmons and Koenig, 1995] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of IJCAI95, 1995. [White, 1991] C. C. III White. A survey of solution techinques for the partially observed Markov decision process. Annals of Operations Research, 32:215{230, 1991.
18