An Alternative Formulation of Dynamic ... - Semantic Scholar

1 downloads 0 Views 153KB Size Report
Our argument is that the alternative algorithm scales better than the standard one. Empirical studies provide supportive evidence to our argument. 2 POMDPs.
An Alternative Formulation of Dynamic Programming Updates for POMDPs Weihong Zhang Department of Computer Science Washington University, POBox 1045 St. Louis, MO 63130-4899 [email protected] Nevin L. Zhang Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong [email protected]

1 Introduction Partially Observable Markov Decision Processes(POMDPs) are a general model for sequential decision problems where the e ects of the actions are nondeterministic and the states are unknown with certainty. However, solving POMDPs is computationally dicult. Existing algorithms can solve only problems with small state space. Value iteration(VI) is a standard method to solving POMDPs (Sondik 1971, Puterman 1990). It starts from any initial value function and performs a sequence of dynamic programming(DP) updates. A DP update computes the nite representation of value function at next step from that at the current step. Value iteration terminates until the di erence between two consecutive value functions falls below a predetermined threshold. To compute the value function at next step, several existing algorithms construct an intermediate value function for each action, i.e. the Q-function 1

(Monahan 1982, Cheng 1988, Kaelbling et. al. 1998). It is known that a Q-function can break into smaller fractions and each fraction corresponds to one possible observation. Furthermore, each of these fractions has a nite representation. In this paper, we formulate a DP update as a procedure of computing the nite representations of the fractions of value functions for all the action and observation pairs. Based on this, we describe a value iteration algorithm. Our argument is that the alternative algorithm scales better than the standard one. Empirical studies provide supportive evidence to our argument.

2 POMDPs In POMDP, the interactions between an agent and the real world proceed as follows. At any point in time, the agent is in one of a nite set of states S . It does not know the identity of the state. Rather, it collects one observation from a nite set Z . When performing action a from a nite set A at state s, the agent receives an immediate reward r(s; a). The environment evolves into a next state s0 with probability P (s0js; a). The agent collects an observation z with probability P (z js0 ; a). The procedure repeats.

2.1 Policies and value functions

In a partially observable domain, all the useful information for the agent's decision-making can be summarized by a probability distribution over the state space (Astrom 1965). The probability distribution is called a belief state and denoted by b. The set of all possible belief states is called the belief space and usually denoted by B. A policy prescribes an action for each possible belief state. In other words, it is a mapping from B to A. Associated with policy  is its value function V  . For each belief state b, V  (b) is the expected total discounted reward that P the agent receives by following the policy starting from b, i.e. V  (b) = t E;b [ 1 t=0  rt ], where rt is the reward received at time t and  (0 < 1) is the discount factor. It is known that there exists a policy   such that   V (b)  V  (b) for any other policy  and any belief state b(Puterman 1990). Such a policy is called an optimal policy. The value function of an optimal policy is called the optimal value function. We denote it by V . For a positive 2

number , a policy  is -optimal if V  (b) +   V (b) for any b 2 B.

2.2 Value iteration

To explain value iteration, we need to consider how belief state evolves over time. Let b be the current belief state. The belief state at the next point in time is determined by the current belief state, the current action a, the next observation z . We denote it by baz . For any state s0, baz (s0 ) is given by P baz (s0) = k s P (s0js; a)P (z js0; a)b(s), where k is a renormalization constant. Value iteration is an algorithm for nding -optimal policies. It starts with an initial value function V0 and computes next step value function as follows: Vn+1 (b) = max [r(b; a) +  a

X P (zjb; a)V (ba)]; z

n

z

(1)

where r(b; a) = Ps r(s; a)b(s) is the expected immediate reward for taking action a in belief state b. The step of computing Vn+1 from Vn is referred to as dynamic-programming update. Value iteration terminates when the Bellman residual maxb jVn(b) ? Vn?1 (b)j falls below the predetermined threshold (1 ? )=2. However, since there are uncountably many belief states in the belief space, the DP updates(and therefore value iteration) can not be explicitly carried out. Fortunately, value functions can be represented by a nite set of jSj-dimensional vectors. The minimal representation of a value function is the representation (set of vectors) that represents the same value function and contains the fewest vectors. An implicit DP update refers to the process of obtaining the minimal representation Vn+1 of Vn+1 from that of Vn . For convenience, we use lower case Greek letters and to refer to vectors and script letters V and U to refer to sets of vectors. In contrast, the upper letters V and U always refer to value functions over belief space B.

3 An Alternative DP Formulation In this section, we propose an alternate way to formulate DP update. Next, we show how to carry out the DP update implicitly. Finally, we brie y discuss the stopping criterion when the DP update is used in value iteration. 3

3.1 Introducing a DP Formulation

Value function Vn+1 in Equation (1) can be split into combinations of simpler value functions: Vna;z b; a)Vn (baz ) +1 (b) = P (z jX Vna+1 = r(b; a) + Vna;z +1 (b) z

Vn+1 (b) = max Vna+1 (b) a

(2) (3) (4)

The notations can be interpreted as follows. Value function Vna;z+1 denotes one portion of future reward. It is the discounted previous value at next belief state baz weighted by the probability of observing z if action a is executed. We refer to such fractions as a/z fractions or simply fractions. Value function Vna+1 is the maximum expected reward the agent receives if it starts from b and performs action a at previous step. It is the sum of the immediate fraction r(b; a) and all the possible fractions Vna;z +1 for the performed action a. The last equation means that value function Vn+1 is the maximization of Vna+1 over all actions. It can be seen that these fractions fVna;z +1 ja 2 A; z 2 Zg are central in a DP step. Once they are known, the next step value function Vn+1 can be computed by combining these fractions. We are interested in the relationship between the fractions of two consecutive value functions Vn and Vn+1 . For this purpose, we formulate an alternative DP step as follows: given the fractions fVna;z g at step n, how to compute the fractions fVna;z+1 g for step n + 1?

3.2 Implicit DP Updates

It is known0 0 that each fraction Vna;z +1 can be represented by a set of vectors if each Vna ;z can. Therefore, the implicit version of our DP update can be stated as: Given an array of sets which represent Vna0;z0 respectively, how to compute the representation of Vna;z+1 for any a/z pair? The following three operators are useful when we show how to conduct the above DP update later. 4

 The matrix multiplication operator \" takes an a/z pair and a set V of vectors as input and returns a new set. The a/z pair determines a jSj  jSj matrix Pa;z whose entry at (s; s0) is the probability P (s0 js; a)P (z js0; a). The resulting P set is denoted by V  Pa;z and de ned as follows: V  Pa;z = f j (s) = (s0)P (s0js; a)P (zjs0; a); 2 Vg. s 0

 Both cross-sum operator \" and union operator \[" take two sets of vectors as input and return a new set. Given two sets U and V , their cross-sum U  V is a new set f + j 2 U ; 2 Vg and their union U [ V is de ned in the obvious way.

With the above three operators, we are able to represent fractions at next step given the fractions at the current step. This is summarized in the following theorem. Theorem 1 Sets of vectors fVna0;z0 g are given. For any a and z, (5) Vna;z+1 = [a fr(:; a0)g  Pa;z  z (Vna+1;z  Pa;z ): The above theorem implies that Vna;z+1 for an a/z pair can be constructed 0

0

0 0

in three steps.

 For each possible action a0, the immediate fraction r(:; a0) and a0/z0

fractions associated with a0 are transformed by the matrix Pa;z .  For each action a0, all the transformed sets with a0 are cross-summed into a new set.  The sets for all actions are pooled in a union to form the set Vna;z+1. Finally, for a complete step of our DP update, we have to compute jAjjZj sets for all the a/z combinations.

3.3 Comparison With Standard DP Update

In this subsection, we compare the alternative DP update with the standard DP update and argue that it scales better. 5

We base our discussions on comparing with incremental pruning, the most ecient algorithm to date for DP update (Cassandra et. al. 1997, Zhang and Liu 1997). It computes Vn+1 using the following equation:

Vn+1 = [a fr(:; a0)g  z Vna+1;z : 0

0

0 0

(6)

Comparing the above equation with (5), we see that incremental pruning a0 ;z0 directly uses the fractions Vn+1 in cross-sum operations while the alternative DP update uses the sets transformed by the matrix Paz . For simplicity, a0;z0 a0 ;z0 we denote the transformed set V  P by U . az n+1 n+1 In terms of minimal 0 0 a0 ;z0 representation, Un+1 usually contains fewer vectors than Vna+1;z . This means that in the alternative DP update, the operands in cross-sum contain fewer a0 ;z0 vectors. Consequently, computing the fraction Vn+1 for a pair [a0; z0] should be more ecient than computing the set Vn+1. The alternative DP update needs to account for jAjjZj fractions. We use the notation T (:) to denote the time cost of computing a set. Suppose [a,z] is the pair such that Vna;z+1 is the most time-consuming fraction to compute. The total time cost for the alternative DP update is upper bounded by (jAj jZj)T (Vna;z+1). Since jAj  jZj is a constant, we conclude that the alternative DP update is more ecient than the standard DP update in the asymptotic sense.

3.4 Stopping Criterion

When the alternative DP step is used in a value iteration, we need to develop a stopping criterion. A nature criterion follows. When the quantity maxa;z maxb jVna;z (b) ?Vna;z?1(b)j falls below a threshold, value iteration should terminate. In our experiments, we set the threshold to (1 ? )=2, the same as in standard VI.

4 Experiments Experiments have been conducted to compare the performance of our alternative VI and the standard VI. For standard VI, we choose incremental pruning. For convenience, we refer to it as VI and refer to value iteration using the alternate DP update as VI1. Our main observation is that VI1 6

signi cantly outperforms VI for medium-size and large POMDPs. In other words, VI1 scales better than VI.

4.1 Setups

In our experiments, the round-o precision is set to 10?6 . The discount factor is 0.95. The quality of policy generated is 0.01-optimal. All experiments are conducted on an UltraSparc II machine. We report our experiments on three problems, namely Ejs1, Network and Elevator problem. Their sizes of parameters are summarized in the following table. They are representative examples for small, medium-size and large POMDPs. The rst two examples are downloaded from Cassandra's POMDP page 1 and the last one is created by our AI group(Choi 2000). Problem jSj jZj jAj Ejs1 3 2 4 Network 7 2 4 Elevator 96 32 3

For each problem, we collect the CPU seconds each algorithm takes to converge. To provide insights on the performance di erences, we also collect the size of the P sets constructed in each DP update. At step n, we collect jVnj in VI and a;z jVna;z j in VI1.

4.2 Ejs1 problem

solves the problem in 60 seconds while VI1 in 110 seconds. VI1 is less ecient than VI. This means that VI1 has overhead for small problems. Results from VI show that the 0.01-optimal value function is represented by 30 vectors. For most iterations, a value function in VI is represented by 30 vectors and the total size of the sets representing the fractions in VI1 is 60. Our data shows that one among the fractions is represented by 29 vectors. This is close to the size jVnj. Therefore computing this particular fraction takes almost the same time as computing the set Vn. Since VI1 has to account for other fractions, this leads to computational overhead and hence VI1 is less ecient than VI.

VI

1 http://www.cs.brown.edu/research/ai/pomdp/index.html

7

4.3 Network problem

To achieve 0.01-optimality, VI takes 13,021 seconds while VI1 takes 1,489 seconds. It can be seen that VI1 is much faster than VI. When they terminate, it happens that both take 213 iterations. For VI and VI1, the sizes of the sets constructed increase sharply in the rst 30 iterations and remain stable afterwards. For later iterations, the Vns contain 491 vectors and the sets representing fractions have 201 vectors in total.

4.4 Elevator problem

Neither VI nor VI1 is able to solve this problem within reasonable time limit. The available data shows that VI1 signigicantly outperforms VI. By our experiments, VI1 takes 57,091 seconds for the rst 28 iterations while VI takes 98,257 seconds for only rst 5 iterations. For the 5th iteration (the last iteration observed from VI), VI1 takes 3,848 seconds and VI 97,547 seconds. The performance di erence is visible. To make VI1 converge faster, we have also combined it with a point-based technique(Zhang and Zhang, 2001). With the technique, VI1 terminates in 11,659 seconds while VI still can not solve the problem. This suggests VI1 works very well for this problem. At the last iteration observed from VI1, the total size of the sets representing the fractions is 2,754. It is evident that the corresponding value function contains more vectors than this number. Our data shows that for the pairs of one action and all the observations, the sizes of the fractions are: [ 14 11 20 19 12 16 11 13 12 15 19 14 29 47 24 41 28 24 43 42 17 25 13 20 49 50 63 69 42 67 45 66 ]. For VI, it has to conduct cross-sum over all these sets. If VI is used to construct the value function from these fractions, it has to solve more than 1032 linear programs in the worse case. This is a huge number for VI to deal with. It is no surprising to observe that VI can not solve this problem.

References

Astrom, K. J.(1965). Optimal control of Markov decision processes with the incomplete state estimation. J. Math. Anal. Appl. 10:174{205. 8

Cassandra, A. R., Littman, M. L. and Zhang, N. L. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of Thirteenth Conference on Uncertainty in Arti cial Intelligence, 54{61. Cheng, H. T. (1988). Algorithms for partially observable Markov decision Processes. PhD thesis, University of British Columbia, Vancouver, BC, Canada. Choi, P. M. (2000). Reinforcement Learning in Nonstationary Environments. PhD thesis, Department of Computer Science, Hong Kong University of Science & Technology. Kaelbling, L. P., Littman, M. L. and Cassandra, A. R.(1998). Planning and acting in partially observable stochastic domains, Arti cial Intelligence, Vol 101:1-2, 99{134. Monahan, G. E. (1982). A survey of partially observalble Markov decision processes: theory, models, and algorithms. Management Science, 28 (1), 1{ 16. Puterman, M. L. (1990). Markov decision processes, in D. P. Heyman and M. J. Sobel (eds.), Handbooks in OR & MS., Vol. 2, 331{434, Elsevier Science Publishers. Sondik, E. J. (1971). The optimal control of partially observable Markov processes. PhD thesis, Stanford University. Zhang, N. L. and Liu, W. (1997). A model approximation scheme for planning in stochastic domains, Journal of Arti cial Intelligence Research, 7, 199{230. Zhang, N. L. and Zhang, W. (2001). Speeding up the convergence of value iteration in partially observable Markov decision processes, Journal of Arti cial Intelligence Research, 14, 29{51.

9