POMDP Compression and Decomposition via

0 downloads 0 Views 83KB Size Report
Hong Kong Baptist University. Kowloon Tong, Hong Kong ... william@comp.hkbu.edu.hk. ABSTRACT. Partially ..... [1] A. R. Cassandra. Exact and approximate ...
POMDP Compression and Decomposition via Belief State Analysis Xin Li

William Cheung

Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong

Department of Computer Science Hong Kong Baptist University Kowloon Tong, Hong Kong

[email protected]

[email protected]

ABSTRACT Partially observable Markov decision process (POMDP) is a commonly adopted mathematical framework for solving planning problems in stochastic environments. However, computing the optimal policy of POMDP for large-scale problems is known to be intractable, where the high dimensionality of the underlying belief state space is one of the major causes. Our research focuses on studying two different paradigms, namely POMDP compression and POMDP decomposition, for addressing the POMDP’s tractability issue. We proposed a novel orthogonal NMF hybrid approach to compress the POMDP problem. For POMDP decomposition, we propose a clustering criteria function which takes into the account the temporal difference of belief sample points for partitioning the belief space. We evaluated the proposed approaches based on a set of benchmark problems. We demonstrated that belief compression and belief clustering are both useful in effectively reducing the cost for computing policies, and yet with the quality of the policies more or less retained.

1.

INTRODUCTION

Partially Observable Markov decision process (POMDP) is commonly used to model a stochastic environment for supporting optimal decision making such robotics,8 web service composition,2 etc. . A POMDP model is mathematically characterized by a tuple < S, A, Z, T, O, R >, which contains a finite set of true states S, a finite set of agents’ actions A, state transition probabilities T : S ×A → Π(S), a reward function which depends on the action and the state R : S ×A → R, a finite set of observations Z and the corresponding set of observation probabilities O : S ×A → Π(Z). Solving POMDP problems typically makes use of the belief state which is defined as a probability mass function over the possible current states, given as b = (b(s1 ), b(s2 ), ...b(s|S| )). However, solving a large-scale POMDP problem under the partially observable condition (also known as POMDP) is known to be computationally intractable even with the computational shortcuts proposed based on the value function’s piecewise linear and convex (PWLC) property over belief space.1 In the literature, there exist a number of approximation methods proposed to address the tractability issue. Belief compression and value-directed compression are the two representative approaches recently proposed to reduce POMDP’s complexity via compression (c.f. dimension reduction). However, both bear their own limitations in terms of policy quality and computational efficiency. The use of orthogonal nonnegative matrix factorization (NMF) for belief compression is proposed by us, which is then further integrated with a value-directed Copyright is held by the author/owner(s). ACM-HK Student Research and Career Day, 2009

compression framework for ensuring the quality of the computed policies as far as possible. It has been empirically demonstrated that this orthogonal NMF is effective in making a trade-off among (1) reducing the POMDP’s dimension, (2) maintaining the orthogonality of the NMF projection matrix, and (3) ensuring the optimally of the computed policies. In addition to POMDP compression, decomposing a POMDP problem is another direction where a conquer-and-divide approach is taken. And the application of data clustering techniques to POMDP’s belief state space is proposed by us for “decomposing” the POMDP. The clustering criterion function is designed so that the transition probabilities between clusters are minimized as far as possible, and thus reduce the loss in formulation accuracy incurred due to the decomposition. Via experiments, it has been shown that such a belief clustering technique can readily be combined with non-linear and linear belief compression methods to tackle the POMDP’s tractability. To further study the scalability of our proposed compression framework and the compressibility of different POMDP problems, the application of interior-point gradient acceleration to the proposed orthogonal NMF and the use of an eigenvalue analysis are proposed. Again via experiments, the former is shown to be effective in further reducing the NMF overhead needed for the compression and the latter is validated to be more or less consistent to the best ratio of POMDP compression which can be empirically obtained for different benchmark problems.

2. POMDP COMPRESSION - A HYBRID APPROACH In this section, we propose to use an orthogonal non-negative matrix factorization for belief compression and integrate it with a value-directed framework to reduce the dimensionality of a POMDP. The goal of the value-directed compression is to keep the value (expected reward) of the belief state remain unchanged before and after converting the original problem to its compressed version. The key to find the proper reward function and transition functions in π the low-dimensional compressed belief space to keep Vt+1 (b) = π V˜t+1 (˜b) hold throughout the whole horizon.

2.1 Proposed Formulation Let B denote a n × |S| matrix defined as [b1 |b2 |...|bn ]T where n is the number of belief states in the training sample, and bi (sj ) ≥ 0 is the j th element of the belief state bi . Also, let F denote a |S| × l transformation matrix computed using NMF4 which “compresses” e With the orthogonality constraint the belief state matrix B to B. added, our compression problem becomes finding an F such that e ⊤ s.t. F ⊤ F = I. B⊤ = F ⊤B

(1)

The low-dimensional reward function and transition function then come up with e R

e

G

≈ =

F R.

(2)

F G F ⊤ .

(3)

A variation of conventional notation scheme with orthogonal constraint can be rewritten as as V = W H, s.t. W W ⊤ = I. (4) We propose the following updating rules to iteratively compute e the projection matrix F and the low-dimensional beliefs B(setting ⊤ ⊤ ⊤ e V = B , W = F , H = B ): s (V H ⊤ )ik Wik ← Wik (5) ⊤ ⊤ (W HH + V H W ⊤ W − W HH ⊤W ⊤ W )ik Hkj

(W ⊤ V )kj . ← Hkj (W ⊤ W H)kj

which minimize kV −

3.

W Hk2F

(6) as well as kW W





Ik2F .

The proposed scheme in last section induce a closed-form solution for POMDP compression. However when the size of the problem is further scaled up, the cost of computing the compressed space and the projection matrix F also increases. In some cases, the overhead was found to be so large that the additional overhead makes the compression step provide no advantage in saving the overall computational complexity. In this section, we first present an eigenvalue-based analysis for estimating to what extent a particular POMDP problem can be compressed. Then we studied the use of the interior-point gradient method to speed up the convergence of O-NMF we proposed. Also, we found that the use of graph-based clustering approach is effective for POMDP decomposition. With the overhead of O-NMF further minimized, we show that largescale problems of navigation type (∼ 1000 states ) could be solved using considerable reduced time.

3.1 An Eigenvalue-based Analysis on POMDP’s Compressibility In this section, we illustrate that an eigenvalue-based analysis over the original generalized transition function G can be used to estimate the reasonable reduced dimension. Lemma 1. There exists an n×m matrix F induces a lossless valuee for all of the action directed compression G F = F G and observation P pairs < a, z >, where F T F = I and m is no less than the rank of G . According to Lemma1, we define rank(

P



G

) to be the

a,z

estimated reduced dimension for the lossless P compression setting. Unfortunately for some problems rank( G ) is quite close a,z P to the problem’s original dimension, e.g. rank( G ) is a,z

88, 604, 1069 for the 92-dimensional Hallway2 problem, 612dimensional Hall68by9 problem, 1088-dimensional Hall68way16 problems respectively. This means that the dimension reduction method is not effective for Plossless compression. Inspired by PCA, we sorted eigenvalues of G for Hallway2, Tag,6 Hall68by9, a,z

Hall68by16 in Figure 1. We observed that Hallway2, Hall68by9, and Hall68by16 have sharp drops at the dimensions 30, 200, and

Tag

Hallway2 20

200

40

400

60

600

80

800 20

AN ADAPTATION OF ORTHOGONAL NONNEGATIVE MATRIX FACTORIZATION

a,z

250 respectively. This hints that those sharp drops could be considered as the compressibility limits of the problems. In Section 4, we will show that the corresponding dimensions in fact are quite close to the effective compression ratio found empirically. In addition, a careful comparison among the sub-figures in Figure 1 indicates that the Hall68byn2 problems of higher resolutions can be more effectively compressed. Also, referring to Figure 1, the sum of the eigenvalues of those truncated dimensions is about 10 percent of the overall sum of all the eigenvalues for Hallway2, Hall68by9, and Hall68by16 problems. However, for the Tag problem, we need to set the reduced dimension to be no less than 620 to guarantee a less than 10 percent “loss”. Thus, this implies that the Tag problem may not be able to benefit from the value-directed compression method.

40

60

80

200

Hall68by9

400

600

800

Hall68by16 200

200

400 600

400

800 1000

600 200

400

600

200 400 600 800 1000

Figure 1: An illustration of the sorted eigenvalues for Hallway2, Tag, Hall68by9, Hall68by16 problems.

3.2 IPG Acceleartion for O-NMF In this section, we derive a new set of updating rules for speeding up the previously proposed O-NMF by incorporating the interiorpoint gradient acceleration5 into O-NMF. By carefully setting the value of α the convergence parameter larger than 1, the acceleration of O-NMF could achieve a greater decrease in value for the objective function within a step.

3.3 A Graph-Based Belief Clustering In general, the graph based clustering algorithm has the advantage of finding clusters that span different dimensions of the underlying feature space. Intuitively, we could take the beliefs as the vertices of the graph. For the weight of the edges, we could adopt the spatio-temporal formulation, given as W (i, j) =



2 − min(kbi − bj k, k

i−j k), λ

(7)

where W (i, j) can be taken as the similarity matrix.7 Recall that min(kbi −bj k, k i−j k) is measuring the “spatio-temporal” distance λ √ between two belief points. √2 is the maximum of kbi − bj k. λ should be set more than N/ 2 to guarantee the same maximum for the temporal-part norm k i−j k) , where N is the number of beλ liefs. Thus, Eq.(7) implies the chance a belief state bj could be evolved from bi . Minimizing this criterion has an implicit advantage of maximizing the inter-cluster belief transitions and at the same time minimizing the intra-cluster belief transitions. In our experiment, we applied the graph-based clustering tool - CLUTO3 to the designed weight matrix to cluster the belief samples. And a combination of O-NMF with the graph-based decomposition is expected to further speedup the policy computation for POMDPs.

4.

PERFORMANCE EVALUATION

approach could beat the K-means clustering introduced previously. 0.5

Table 1: Performance comparison for different problems.

0.45 0.4

PROBLEM (STATES / ACTIONS / OBS .) T IGER -GRID (500 samples) (36 S 5 A 17 O ) PERSEUS PERSEUS +T UNC .K RY. PERSEUS +O-NMF H ALLWAY (500 samples) (60 S 5 A 21 O ) PERSEUS PERSEUS +T UNC .K RY. PERSEUS +O-NMF H ALLWAY 2(500 samples) (92 S 5 A 17 O ) PERSEUS PERSEUS +T UNC .K RY. PERSEUS +O-NMF ROCK SAMPLE (464 samples) 257 S 9 A 2 O ) PERSEUS PERSEUS +T UNC .K RY. PERSEUS +O-NMF HALLWAY BY9(500 samples)

R EWARD

T IME(SEC .) Tp + Tc

R EDUCED D IM.

±0.014 0.63 -VE 0.65 ±0.005

104 − 10.26+0.2

N/A − 30

0.50 0.50 0.50 ±0.014

53 13.78 + 202 2.46+5.49

N/A 48 45

0.31 0.29 0.29 ±1.22

69.96 36 + 52.7 49+8.398

N/A 48 60

17.99 11.81

500 2.7+0.3

N/A 80

0.4576 0.4355 0.4403 0.4017

5399 500+967 500+688 500+464

N/A 250 250 250

0.3079 0.3895 0.3918 0.2501

20000 1000+2477 1000+1883 500+293

N/A 300 300 200

(1088 S 5 A 21 O ) PERSEUS PERSEUS +O-NMF PERSEUS +O-NMF-S PERSEUS +O-NMF-S+GC L +λ150

0.35 0.3 0.25 0.2

We conducted a rigorous performance comparison between ONMF ,O-NMF-S compression, O-NMF based compression with belief clustering and truncated Krylov compression. Table 1 tabulates the results obtained in terms of (a) running time (with breakdown of the time needed by compression approach and Perseus) and (b) the policy’s average award. It can be observed that for the problems Tiger-grid, Hallway, Hallway2, and RockSample, the cases with O-NMF based compression can give policies of quality almost the same as that without it at a reasonable reduced dimension. Also, the computational speedup brought by the O-NMF was quite significant for most of the cases, except for Hallway2. Comparing with truncated Krylov compression, The proposed O-NMF based belief compression obviously outperforms truncated Krylov compression regarding its effectiveness in reducing the POMDP’s computational complexity. While O-NMF-S can give policies of quality almost the same as that without it at a reasonable reduced dimension for both Hall68by9 and Hall68by16. Also, the computational speedup brought by the O-NMF-S with and without graphbased clustering was quite significant comparing with Perseus. More specifically, Figure 2 shows a comparison of policy performance with reduced dimension 250 among a bunch of our proposed compression tools (with the respectively overhead considered) and original Perseus without compression. We observe that O-NMF-S is the fastest approach which come to the converge point with achieving a nearly optimal policy. For the graph-based clustering with λ = 300, it owns the lest overhead among all of the compression approaches but only achieve an average reward around 0.40 while the optimal policy could achieve 0.47, which indicating that how to select a proper λ is still a crucial problem though the graph-based

O−NMF Perf. with 967 right−shift O−NMF−S Perf. with 688 right−shift

0.15

Perseus Perf. over VI GCL−λ300 with 464 right−shift

0.1 0.05

(612 S 5 A 17 O ) PERSEUS PERSEUS +ONMF PERSEUS +ONMF-S PERSEUS +ONMF-S+GC L +λ300 H ALLWAY BY16(500 samples)

Average Reward

(Tp - policy computing time; Tc - compression time)

0

1000

2000

3000 Time (sec.)

4000

5000

6000

Figure 2: A Comparison on compression efficiency with overhead right shift between O-NMF-S and O-NMF for Hall68by9 problem with reduced dimension 250.

5. FUTURE WORK Future extensions of this work include at least the following directions: (1) optimizing the belief clustering quality and extending it to support hierarchical decomposition, (2) alternative POMDP decomposition techniques derived based on the eigenvalue-based analysis over the generalized transition functions, (3) extending the decomposition setting to a multi-agent with the hope to obtain a better and more dynamic belief clustering algorithm, and (4) extending our current approaches to support online learning of POMDPs.

6. REFERENCES [1] A. R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, 1998. [2] P. Doshi, R. Goodwin, R. Akkiraju, and K. Verma. Dynamic workflow composition using markov decision processes. Journal of Web Services Research (JWSR), 2:1–17, 2005. [3] G. Karypis. CLUTO - a clustering toolkit. Technical Report #02-017, nov 2003. [4] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 1999. [5] M. Merritt and Y. Zhang. An interior-point gradient method for large-scale totally nonnegative least squares problems. Technical Report TR04-08, Department of COmputational and Applied Mathematics, Rice University. [6] J. Pineau, N. Roy, and S. Thrun. A hierarchical approach to POMDP planning and execution. In Workshop on Hierarchy and Memory in Reinforcement Learning (ICML), June 2001. [7] J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: structure detection by optimization. Pattern Recognition, 33:617–634, 2000. [8] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. Artificial Intelligence Journal, 1997.