On the Computability of Infinite-Horizon Partially

0 downloads 0 Views 241KB Size Report
semantically in between finite-horizon models and undiscounted infinite .... the column vector Ra with belief state (row vector) x, is the expected reward if the ...... maximizing eq 8) can change with different functions H∗kV (as k increases).
On the Computability of Infinite-Horizon Partially Observable Markov Decision Processes Omid Madani Abstract We investigate the computability of infinite-horizon partially observable Markov decision processes under discounted and undiscounted optimality criteria. The undecidability of the emptiness problem for probabilistic finite automata is used to show that a few technical problems, such as the isolation of a threshold, and closely related undiscounted problems such as probabilistic planning are undecidable. The decidability of corresponding problems under the discounted criterion remains largely open, but we provide evidence for decidabilility of several, while we also give evidence of hardness as there may be no closed-forms for describing optimal sequences of actions. The research sheds light on some interesting structural properties of these problems.

We investigate the computability of infinite-horizon partially observable Markov decision processes (POMDPs). These problems form the basic model for closely related problems in the area of probabilistic planning. Their computability had been questioned or conjectured before (see for example [PT87] and [Lit96]). To simplify and focus on important properties of the problems, we will concentrate on unobservable MDPs, or UMDPs. Of course, any hardness result shown applies to the more general class of POMDPs as well. In Section 1, we give a brief introduction to the models, several infinite-horizon discounted and undiscounted optimality criteria, notions of optimal policies, values and action sequences, and the computational problems of interest. In Section 2, we describe the emptiness problem for probabilistic finite automata (PFA’s) and its significance for computational problems of UMDPs under undiscounted optimality criteria. Surprisingly the emptiness problems for PFA’s is undecidable [CL89, Paz71]. We explain in some detail how the undecidability result in [CL89] is established, and explore consequences of the result, including the undecidability of a few related technical problems such as the isolation of a threshold, in addition to undecidability of probabilistic planning in its general form. Section 3 concerns the UMDP model in the presence of discounting, which adds an interesting twist to the problems. One view of discounting is that in its presence, the dynamics of the model terminate with probability one. Hence, while the horizon is still infinite, these models lie semantically in between finite-horizon models and undiscounted infinite horizon models. The decision problems also appear to be easier computationally than the corresponding ones in the 1

undiscounted case, while some may be undecidable. The notion of periodicity has a significant role in the analysis of the computational problems, and it comes up at different levels of the analysis. We define optimality of action sequence and the notion of periodicity for action sequences, and give an example to show that optimal sequence of actions are not always periodic, providing strong evidence that seeking closed-forms for action sequences or optimal values (rational values) is not always possible. However, as we explain later, the notion of periodicity may still simplify the problems to some extent and render some of the decision problems decidable. We explore a few of these questions when we investigate structural properties of UMDPs under the discounting criterion. We conclude in Section 4.

1

Preliminaries

This section gives a brief introduction to the concepts and models in the paper and collects some of the terminology and notation that we will use later. The reader may prefer to skip and refer back to this section when needed. Some of the following notions are defined at a level intended for this report and are not necessarily in their widely-used or most general form. We make the standard assumption that all the input numbers (inputs and transition probabilities, etc.) are rational. Σ denotes an alphabet or a finite set of actions, depending on the context, and Σ∗ denotes the set of all strings (finite sequences of actions) over this alphabet. ǫ and δ denote positive real numbers throughout.

1.1

Unobservable Markov Decision Processes

In an unobservable Markov decision process (UMDP) a decision maker is faced with a dynamic system modeled by a tuple S = (Q, Σ, s), where Q is a set of n states, and Σ is a finite set of actions. The system occupies exactly one state at any given time, and s ∈ Q is the starting state of the system. The system may change state if an action is executed as described below. Associated with each action a is an n × n stochastic matrix Ma which specifies the state transition probabilities for action a. Ma [i, j] denotes the entry in row i column j of the matrix with the semantics that if the state of the system is i and action a is executed, with probability Ma [i, j] the next state of the system is j. Also associated with each action a is a vector of rewards Ra . Ra [i] denotes the ith component of the reward vector with the semantics that the decision maker gains Ra [i] if the state of the system is i and action a is executed. The problem faced by the decision maker, informally, is to execute actions that result in a high long term accumulated reward. In unobservable models, the system state is not observable by the decision maker as actions are executed. We next explain common optimality criteria and possible computational questions that can be posed. 2

1.2

Optimality Criteria

The optimality criteria we will discuss are for the infinite horizon, that is we assume that we are planning for an indefinite number of action executions. In Section 2 we show the hardness of a special case of the undiscounted total reward and average reward criterions. In undiscounted total reward problems, the objective is to maximize the total expected reward. This criterion is well defined for the special case where a reward is obtained at most once upon entering a specially designated goal state. Probabilistic planning problems, with traditional goal-oriented models, fall in this area. The hardness results of the next section also apply to the average reward criterion where the objective is to maximize the expected reward per step (action execution). A widely used criterion is the infinite-horizon discounted criterion, where there is also a discount factor β < 1 as part of the problem statement. The optimality criterion is called the discounted total reward criterion in which the expected total reward from future action executions is discounted, i.e. multiplied by the discount factor β.

1.3

Computational Problems

We will consider several computation problems in this report. While all these problems turn out to be undecidable for undiscounted models, we need to differentiate as the outcome can vary from one question to another in the case of discounted models. Given a UMDP , a natural problem is to ask for the optimal (maximum) value according to one of the criteria above. For some criteria, computing closed forms for optimal values may not be possible, but the question of given k as input, whether the kth bit of the binary representation of the optimal value is 0 or 1 may be decidable. A related and possibly easier problem is that of approximation, for example, determining whether the optimal value lies inside an interval [ǫ, 1 − ǫ]. In Section 2, the first question proved undecidable is whether the optimal value exceeds a certain threshold. An important class of problems are ones related to computing optimal actions. Our models are unobservable, hence it is not hard to see that the optimal value can be achieved by mere sequences of actions, i.e. the decision maker need not change its choice of action sequence from one episode of system interaction to another (as long as the system characteristics does not change). Optimal sequences might be infinite however. Hence, one question is whether the sequence has some structure so that it can be represented “compactly,” so that it would be easy to compute finite prefixes of the sequence. Even if that’s not possible, there may still exist an algorithm that can return the first bit (in general the kth) of an optimal sequence. In fact, this is the case for the example in Section 3. It can be shown that for optimal behavior (any of the criteria above), it suffices to have a mapping from the distribution over system states (called a belief state in [Lit96]) to a set of one or more optimal actions (see for example [Str65]). The set of distributions over n states 3

forms the n-dimensional unit simplex. A 3-dimensional unit simplex, with a mapping to two actions, is shown in Figure 2. The distribution over system state can be updated after each action execution, as we assume that the starting state and the state transition probabilities for each action is known. We denote a point of the simplex by an n dimensional row-vector. For example, x0 = [0 0 1] represents being in state 3 with probability 1.0, and the next belief state if an action a with dynamics Ma , 



2/3 0 1/3    M =  5/9 1/3 1/9   1/4 1/4 1/2 is used at this state, would be x0 Ma = [1/4 1/4 1/2]. Similarly, xRa , the cross product of the column vector Ra with belief state (row vector) x, is the expected reward if the current distribution over states is x and action a is executed. We briefly give expression for the total expected reward and average reward of an action sequence. Let w denote an infinite sequence of actions, and let wi denote the ith action, 1 ≤ i < ∞, so that Rwi and Mwi denote the reward vector and the dynamics matrix of the ith action respectively. Let xi denote the probability distribution after the ith action of the sequence is executed. We have xi+1 = xi Mwi , i ≥ 1, where x1 denotes the initial distribution. Then the total discounted expected reward from executing action sequence w with initial distribution x1 is ∞ X

β i−1 xi Rwi ,

(1)

i=1

and the sum is undiscounted if β = 1. For an average reward criterion, let Sk = Then the average reward per step for sequence w can be defined as:

Pk

i=1

xi Rwi .

lim inf Sk /k. k→∞

(2)

We seek action sequences that maximize 1 or 2 depending on the optimality criterion. We denote by V ∗ the optimal value function, mapping each distribution point to the maximum attainable value. Section 3.1 and appendix B delve deeper into some structural properties of UMDPs.

1.4

Probabilistic Finite Automata

Formally, a probabilistic finite automaton M is denoted by a quadruple M = (Q, Σ, s, f ) where Q is a finite set of states, Σ is the input alphabet, s ∈ Q is the initial state, f ∈ Q is an accepting state. As in the case for UMDPs, associated with each letter a ∈ Σ is a stochastic matrix Ma with the same semantics described in the above. The automaton starts in the initial state and may change state depending on subsequent input. The model of the PFA used in the next section assumes that the accepting state f is absorbing1 : Ma [f, f ] = 1.0, ∀a ∈ Σ. 1

In other words, a one-way randomized Turing machine with a read-only input tape and finite memory.

4

If the automaton ends in the accepting state upon reading a string w ∈ Σ∗ , we say the automaton has accepted the string, otherwise we say it has rejected the string. We denote by pM (w) the acceptance probability of string w by PFA M.

2

Undecidability in the Undiscounted Criterion

The emptiness problem for probabilistic finite state automata (PFA’s) is surprisingly undecidable, considering that the same question, even for nondeterministic finite automata, is a simple reachability question. This problem is important as it is basically equivalent to a core computational question that arises in a POMDP or probabilistic planning problem: Is there a policy or a plan that achieves a desired expected reward or success probability? The undecidability result was originally established in [Paz71] by Azaria Paz. Later Condon and Lipton proved undecidability as an application of proof techniques they developed for Space Bounded Interactive Proofs [CL89]. Condon and Lipton’s proof is conceptually simpler and more illuminating, and, in addition, answers several open problems, a few of which were posed in [Paz71]. We begin with a formal definition of the emptiness problem, then give sufficient exposure to the properties of the reduction in [CL89] in order to derive the consequences that follow. Definition 1 The emptiness problem for PFA’s is the problem of deciding whether or not there is some input string that the given PFA accepts with probability exceeding a given threshold τ . Both [CL89] and [Paz71] show that the emptiness problem for PFA’s is very hard: Theorem 2.1 [CL89][Paz71] The emptiness problem for PFA’s is undecidable. In [CL89], the question of whether a Turing Machine (TM) accepts the empty string (an undecidable decision problem) is reduced to the question of whether a PFA accepts any string with probability exceeding a threshold. The reduction roughly works as follows. The PFA constructed by the reduction “tests” whether the input is a concatenation of accepting sequences, where an accepting sequence is the sequence of TM configurations which represent the accepting computation of the TM. The reduction has the property that if the TM is accepting, strings representing concatenation of accepting sequences exist, and the PFA accepts those strings with high probability. If the TM is not accepting, no accepting sequence exists, and the PFA accepts all strings with low probability. We next formalize these properties and use them in subsequent undecidability results. An explanation of the reduction to understand how it has these properties is given in Appendix A. In what follows, the reduction can be tuned so that ǫ is made as small as desired, independent of the TM given as input to the reduction. The constructed PFA M has the following properties: • If the TM does not accept the empty string (is not accepting), then no valid computation of the TM can end in its accepting state, and the PFA M accepts with probability at most ǫ. 5

• If the TM is accepting, then let string w represent the accepting computation, and let w k denote w concatenated k times. We have limk→∞ pM (w k ) = 1 − ǫ. As described the reduction has immediate consequence on possibility for approximations: Theorem 2.2 For any fixed ǫ > 0, the following problem is undecidable: Given is a PFA M for which one of the two cases hold: • The PFA accepts some string with probability greater than 1 − ǫ. • The PFA accepts no string with probability greater than ǫ. Decide whether case 1 holds. Hence it follows that approximations such as computing a string which the PFA accepts with probability within any additive factor ǫ < 1 or multiplicative factors of maximum probability are also noncomputable. We now address an open problem posed in [Paz71]. Define the language accepted by a PFA as follows. Given a threshold 0 ≤ τ < 1 (also called a cutpoint in the literature), any string that takes the PFA to the accepting state with probability greater than τ is in the language: L(M, τ ) = {w ∈ Σ∗ : pM (w) > τ }

(3)

where, L(M, τ ) denotes the language accepted by PFA M given threshold τ , and pM (w) denotes the probability that M accepts w. In general, PFA are powerful enough to accept even noncontext-free languages (see the next section). However, Rabin[Rab63] showed that PFA with isolated thresholds accept regular languages. An isolated threshold of a PFA is defined as follows: Definition 2 Let M be a PFA. The threshold τ is ǫ-isolated with respect to M if |pM (x)−τ | ≥ ǫ for all x ∈ Σ∗ , for some ǫ > 0. The motivation behind the notion of isolation is that if a threshold τ is ǫ-isolated for a PFA M, and ǫ is known, then given a desirable error probability δ, one can determine apriori the number of times M has to be run on a given string w to decide whether w ∈ L(M, τ ), with error probability of the decision not exceeding δ (see [Paz71] or [Rab63]). A natural question is then given a PFA and a threshold, whether the threshold is isolated for the PFA. If we can compute the answer and it is positive, then we can presumably compute the regular language accepted by the PFA, and see whether it is empty or not. This question was raised in [Paz71]. Unfortunately, the description of the reduction shows that the problem of deciding whether a given threshold is isolated is undecidable as well. As noted, we can design the reduction so that ǫ is, say, 1/3. Looking at the proof, this translates into the two cases where if the TM doesn’t accept, then there is no string that the PFA accepts with probability greater than 1/3, while if the TM accepts, there are (finite) strings that the TM accepts with probability arbitrary close to 2/3. In other words, the threshold 2/3 is isolated iff the TM is not accepting. Corollary 2.3 It is undecidable to answer whether a given threshold for a given PFA is isolated. 6

2.1

Undecidability Consequences for Related problems

Here we touch on the consequences of the above hardness results for PFA’s. As alluded to earlier, PFA models can be viewed as UMDP models with special reward structure. In particular, the problem of maximizing the acceptance probability by a PFA can be modeled as an undiscounted infinite-horizon total reward criterion problem. Figure 1(a) shows the basic transitional step from a goal-oriented or the PFA model of the previous section to a expected total reward criterion. The assumption that goal state is absorbing makes the remodeling easy. It can be verified that the maximum probability of reaching the goal, is equal to maximum total expected reward in the transformed UMDP model. UMDPs are a special case of POMDPs, and the above results show the undecidability of many undiscounted expected total reward infinite-horizon POMDP problems. In addition, average reward criterion problems are also undecidable, which can be seen if the goal state is absorbing with every action having a reward of 1.0 at it, and all other rewards 0.0, as shown in Figure 1(b). Then, the maximum probability of reaching the goal is equal to the maximum expected reward per step of an optimal sequence of actions. R=1

start

goal

R=1

start

extra

(a)

goal

(b)

Figure 1: (a) A goal-oriented criterion modeled as a total reward criterion. The old absorbing goal state (labeled goal) now, on any action, has a transition to an extra absorbing state with reward of 1.0. All other rewards are zero. (b) Similarly, a goal-oriented problem (with an absorbing goal) modeled as an average reward problem. Consider the probabilistic planning problem in its general form: we have uncertainty in the effects of actions and uncertainty in the state of the world. It is not hard to see that the model described for example in [KHW95], with propositions representing states, and actions having uncertain effects and limited observations, can model a PFA. In other words, a problem such as PFA emptiness can be reduced to plan existence for probabilistic planning. We conclude that problems associated with probabilistic planning in general form and criteria (e.g. no specified limit on size of the plan saught) are also undecidable. Finally, we note from the above reductions that the problems of computing the first bit of an optimal action sequence or the optimal value is undecidable as well. The major open problems of the next section is whether these problems are undecidable in the presence of discounting.

7

3

The Discounted Case

As mentioned in the introduction, we can view an MDP model with discounting as a case of a more general model in which the dynamics of the problem ends with probability one. If β is the discount factor, then 1 − β is the probability of termination at each stage (See, for example, [Put94] Chapter 5.3 for this view). As such, discounted infinite horizon models are closer to finite horizon models than undiscounted models. For finite horizon models, the decision problems of the previous section are decidable. Where do discounted UMDP problems stand with regard to such problems? To begin with, approximation of the optimal value, to within any additive factor ǫ > 0, is computable and follows directly from the presence of the discounting factor. As we saw, this was not the case for undiscounted problems (Theorem 2.2). We next describe an example UMDP that provides evidence for hardness of answering some decision problems on UMDPs. It is plausible that in an unobservable model, optimal sequence of actions, if not finite, always become repetitive ending in a regular structure. This can be verified in the case of deterministic unobservable models with uncertainty over system state). A quote from Rodney Brooks on Sensorless Robots[Eve95] expresses a similar intuition: “Without sensors, robots would be nothing more than fixed automation, going through the same repetitive task again and again in a carefully controlled environment.” However, while this repetitive property of optimal action sequences might hold in most cases in random environments too, we will next see that it does not hold all the time, even with discounting. We first formalize our notion of a repetitive sequence of actions: Definition 3 A finite or an infinite sequence w is periodic if w = u˜ v, where u, v ∈ Σ∗ , and v˜ denotes the concatenation of v with itself infinitely many times: v˜ = vvv · · · Existence of optimal periodic action sequences would be evidence for decidability of the decision problems such as emptiness or computability of optimal values. While this property might be the case for two state UMDPs (see the next section), we adapt an example from Paz [Paz71] to show that unfortunately this is not true for UMDPs with three or more states, even in the presence of discounting. Consider the following 3 state, 2 action UMDP. The two actions a and b have identical dynamics, denoted by the matrix M below, but their reward vectors Ra and Rb are different: 











0 2/3 0 1/3 1            M =  5/9 1/3 1/9  , Ra =  0  , Rb =  1   7 1/4 1/4 1/2 0 4

(4)

The unique stationary point of the matrix is at [1/2 3/22 4/11], and the matrix is ergodic, that is, all points of the simplex converge to the stationary point with subsequent applications of the matrix. Due to identical dynamics, the action that yields the higher reward is the optimal action at any given point. Hence, we note that the reward vectors are designed so that if the third 8

[0 0 1]

[0 0 1]

3 [1/2 3/22 4/11]

a b

1

2

[1/2 3/22 4/11]

(a) (b)

Figure 2: (a) The 3 dimensional unit simplex which repesents possible distributions over 3 states, partitioned into two regions. Action a is optimal in the top region. (b) The evolution of a point toward the stationary point. The pattern of the points of the sequence being above or below the 4/11 line is irregular. component of a point is greater than 4/11 then action a is to be applied, and if below 4/11 then action b is to be applied, while either action is optimal at 4/11. Notice that this is also irrespective of the discount factor β (of course we assume 0 < β < 1). Figure 2(a) shows the three dimensional unit simplex with optimal regions for actions a and b, and the stationary point, with the line corresponding to points with third component equal to 4/11. In order to show the nonperiodicity of optimal policies, we need only show that convergence of some point of the simplex to the stationary point visits the a and b regions in an irregular manner. This nonregularity is shown in [Paz71], which we will describe below. Paz shows that the language accepted by the single letter, 3 state PFA M = ({1, 2, 3}, {a}, 3, 3) with dynamics Ma given as above, and given threshold 4/11, is not regular and in fact not context-free as all single-letter context-free languages are regular (see Definition 3 for definition of language acceptance of PFA’s). From the nonregularity of this language, it immediately follows that the patterns of a’s and b’s cannot form a periodic sequence. In order to show nonregularity, we investigate the third component of [0 0 1]M k , with k ≥ 0, which is the probability of being in the 3rd state after reading k letters. But [0 0 1]M k is simply the component in the thrid row and column of matrix M k , which we denote by mk3,3 . √ √ We can write mk3,3 in terms of the eigenvalues of M, which are 1, 1/4 + 127 i, and 1/4 − 127 i. Let √ √ λ = 1/4 + 127 i (one of the eigenvalues), and let u = 7/22 + 31547 i. We get: mk3,3 =

4 ¯k , + uλk + u¯λ 11

(5)

¯ are conjugates of u and λ respectively. We are interested in when mk exceeds 4/11, where u¯, λ 3,3 and rearranging equation 5, rewriting it in trigonometric form, and simplifying, we get: mk3,3 −

4 = 2|u||λ|k cos(k arg(λ) + arg(u)). 11 9

(6)

Hence mk3,3 > 4/11 iff −π/2 < k arg(λ)+arg(u) < π/2. The following lemmas appear in [Paz71] which references [Niv56] for proofs. Lemma 3.1 If an angle θ is rational in degrees, i.e. θ = 2πr, where r is a rational number, then the only rational values of cos θ are 0, ± 12 , and ±1. Since cos(arg(λ)) = Reλ/|λ| = 3/4, we conclude that arg(λ) is irrational in degrees. Lemma 3.2 If θ is irrational in degrees, then any subinterval of (0, 2π) contains values of the form (kθ mod 2π). We use lemma 3.2 to show that Nerode’s right invariant equivalence relation has infinite index for this language. Briefly, two strings w and u are in the same equivalence class for a language L, if ∀v ∈ Σ∗ , wv ∈ L ⇔ uv ∈ L. A characterization of regular languages is that Nerode’s equivalent relation has finite index (for example see [Paz71]). Hence the language of the PFA cannot be regular. Assume ak1 and ak2 are in the language and let k1 arg(λ)+arg(u) = α, k2 arg(λ)+arg(u) = β. We have −π/2 < α, β < π/2, and pick k1 and k2 so that α < β. By lemma 3.2, there is some k3 such that π/2 < (k2 + k3) arg(λ) + arg(u) < π/2 + (β − α)/2, so that (k1 + k3 ) arg(λ) + arg(u) < π/2 − (β − α)/2. We conclude that ak2 +k3 is not in the language, but ak1 +k3 is. Hence any such two words in the language (of which there are infinitely many) belong to their own equivalence class, or Nerode’s equivalence class for this language has infinite index. A few noteworthy points are in order. 1. Small perturbations in, say, one of the reward vectors, changes the optimal regions so that the stationary point would fall inside one region. When the stationary point is inside a single region, it is not hard to see that at any point periodic optimal sequences exist. This provides some evidence that nonperiodic sequences are rare. Of course, in this example the two transition matrices were the same. It would be interesting to show whether or not a conjecture such as the following holds: given a natural distribution over the inputs of the problems, with probability one at every point an optimal periodic sequence exists. 2. In this example, apparently there is no closed-form for the optimal sequence, at least a form independent of the information contained in the simplex partition. However, using the simplex partition and tracking the evolution of a point, it is not hard to see that any bit or any prefix of an optimal sequence can be computed. This situation might hold true in UMDPs in general. 3. Very likely, at least for some values of β, the optimal value is irrational for the above problem. We haven’t explored ways of proving this. A related problem to point 2 is whether the kth bit of the optimal value can be computed, for any k, at least for examples such as the above, where all actions have the same dynamics, but reward vectors are different. An argument using the convergence of slightly perturbed reward vectors (point 10

1 above), might be used to show that the first bit (and therefore the kth) bit is always computable. In the next subsection, we explore a few properties of discounted UMDPs in some detail.

3.1

Structural Properties

Here we mostly consider the discounted model. Consider the n-dimensional simplex denoted by X. The optimal value function V ∗ : X → R is the the function that maps every point x to the maximum value obtainable at that point, V ∗ (x). In the discounted infinite horizon criterion, this value is the maximum total discounted reward obtainable if the starting point is the distribution x. That V ∗ is bounded and convex can be shown using contraction mapping properties. Proofs of such are standard and also appear in Appendix B. An action a is optimal at a point x of the simplex if some optimal sequence of actions for point x begins with a. In the presence of discounting, a ∈ arg maxσ∈A [xRσ + βV ∗ (xMσ )] implies a is optimal at x (and vice versa), but when there is no discounting, the implication does not hold (see appendix). Define the optimal region for action a, denoted by region∗ (a), to be all the points where action a is an optimal action. Lemma 3.3 (closedness) In the discounting model, ∀a ∈ A, region∗ (a) is closed. Proof. The lemma follows from the continuity of V ∗ , and the fact that if a ∈ arg maxa∈A [xRa + βV ∗ (xMa )] then x ∈ region∗ (a) (a is optimal at x). Let x be on the border of region∗ (a) where there are points arbitrary close to x at which another action, say action b, is optimal. We show that action b must be optimal at point x as well. Take a sequence {xk } of points converging to x where action b is optimal. We have: xk Ra + βV ∗ (xk Ma ) ≤ xk Rb + βV ∗ (xk Mb ), ∀xk . But, from continuity of of V ∗ , limk→∞ xk Rb + βV ∗ (xk Mb ) = xRb + βV ∗ (xMb ), which shows that action b is at least as good as action a at x: xRa + βV ∗ (xMa ) ≤ xRb + βV ∗ (xMb ), hence action b is also optimal at x. 2 Figure 3 shows that optimal regions in the absence of discounting may not be closed. The example also shows that optimal sequences in an undiscounted goal-oriented models many not be periodic in the sense of Definition 3, as the optimal sequence of actions at state 1 is a ˜b. ∗ ∗ A maximal connected subset of region (a) is a subset of region (a) that is connected and is not a proper subset of any other connected subset of region∗ (a). An important conjecture is: 11

a\1.0 2

a,b

b\1.0

a,b b

a 3

a\0.5

goal 1

b\1.0

2

1

a\0.5

(a)

(b)

Figure 3: (a) A simple instance of the goal-model where closedness of optimal regions does not hold. The numbers on the transitions are probabilities of the transitions. (b) The optimal policy. Conjecture 3.4 (1)∀a ∈ Σ, region∗ (a) contains finitely many maximally connected subregions, and (2) these regions have piecewise linear boundaries (in the sense of [SS73]). This property certainly holds for finite horizon problems, but it is not clear that in the limit to the infinite-horizon, the properties still stand. Also, presence or absence of discounting may make a difference. The significance of this conjecture in general is described in the next section. We next show that in two dimensions if Conjecture 3.4 holds, optimal sequences are periodic, from which computability of computational problems follows. The two dimensional simplex may be mapped to the interval X = [0, 1], in which case a point x ∈ X represents the probability of being in one of the two states, state 1 say. The stochastic matrices become linear functions from X to X (see Figure 4(a)). Let L denote the set of linear functions mapping X to X. Conjecture 3.4 then states that for each action a, region∗ (a) is composed of a finite number of subintervals of X. As the number of actions is finite, assuming Conjecture 3.4 holds, it is easily seen then that the interval X can be partitioned into a finite number of nontrivial (nonzero-length) subintervals over each of which a single action is optimal. In this case, we have a finite number 12

(0,1) l2 l4

l5

l1

l4 l3

(1,0)

0

(a)

1 (b)

Figure 4: (a) Several linear functions from [0, 1] to [0, 1]. l1 and l2 correspond to the indentity and permutation matrices respectively. (b) A partition of [0, 1] into three regions over each of which a linear function is defined. There are two border points. of border points in X. Border points are the end points of two adjacent subintervals. Over each subinterval Ii , one linear function fi ∈ L is defined where fi corresponds to the transition matrix of the optimal action chosen over i. See Figure 4(b). A few definitions follow. Let f : X → X. We define f k (x), k ≥ 0, inductively: f 0 (x) = x, and f k (x) = f (f k−1 (x)). Extend notation f to sets: f (S) = {f (x)|x ∈ S}. Point x ∈ X is called cyclic (for f ), if for some k > 0, f k (x) = x, and it is called ultimately cyclic if for some j, k > 0, k 6= j, f k (x) = f j (x). Define the orbit or the set of images of x, denoted by O(x), to be the set O(x) = {f k (x), k ≥ 0}. A point y is a cluster point of a set S, if y 6∈ S, and ∀ǫ > 0, ∃x ∈ S, such that ky −xk < ǫ, where kyk denotes the Euclidean norm. Denote by cl(S) the set of cluster points of a set S. Note that for an ultimately cyclic point x, O(x) is finite and cl(O(x)) = ∅. Point y is a kth order preimage of a point x if f k (y) = x, for some k ≥ 0, where k is the smallest such power. Denote the set of preimages of a point x, by pre(x) : pre(x) = {y|f k (y) = x, for some k ≥ 0}. We say pre(x) has finite order if ∃K, ∀yinpre(x), f j (y) = x, for some j < K. Basically, except for the possibility of a constant over some interval, when we say pre(x) has finite order, it follows that pre(x) is finite. The following lemmas basically say that, given a function f with a finite number of discontinuities, each continuous piece being a function in L, the images of a point eventually visit the same regions in a periodic manner. Each point will either be cyclic or its images converge to a finite number of cluster points that can to be cyclic. If not cyclic, the cluster points will become cyclic with an appropriate redefinition of f at the border points. The next lemma, makes connections between some of the notions just defined: Lemma 3.5 if ∀b ∈ X, b is a border point (finitely many), pre(b) has finite order then ∀x, |cl(O(x))| < ∞. 13

Proof. Consider any x ∈ X, and the set of cluster points of O(x), which is a finite set. If O(x) has no cluster points (O(x) is finite), we are done. Otherwise, for each cluster point c, consider the open interval on the side of c where O(x) has infinitely many points. There must exist an ǫ on this side, say (c − ǫ, c), that contains no preimage of any border point. Otherwise, we would have to get infinitely many preimages of the same order, and we would have to conclude that x is cyclic. So extend ǫ so that the other end point is an endpoint of X, or a point in pre(b) for some border point b. If c is not in pre(b) for any border point b, then extend this interval to the right as well up to the first preimage of some border point or the an endpoint of X. Let the interval be I = (c − ǫ, c + δ), where δ ≥ 0. As |O(x) ∩ (c − ǫ, c + δ)| = ∞, and over the images of the interval the same function is defined always, we must conclude that for some k, f k (I) ⊂ I, or f k acts as contraction mapper with a unique fixedpoint, which must be c, from which it follows that the number of cluster points is at most k.

2

Theorem 3.6 Assume the interval [0, 1] is partitioned into a finite number of subintervals, and there exists a function f : [0, 1] → [0, 1] such that restricted to each subinterval, f is a linear function from L. Then ∀x ∈ [0, 1], O(x) has a finite number of cluster points. Proof. If there is only one subinterval the statement holds: if f (x) = x, or f (x) = −x, all points are cyclic, and otherwise f is a contraction mapping with a unique cyclic point (fixed point) x0 , and cl(O(x)) = {x0 }, ∀x 6= x0 . Below we use the fact that whether, for any border point b, pre(b) is finite or not, is independent of the function definition at b. Assume there are two subintervals, with a border point b. We show that the pre(b) has finite order and then use lemma 3.5. For convenience, assume over neither interval a constant function is defined, so that pre(x) having finite order is equivalent to pre(x) being finite. Let S = X − ∩k≥0 f k (X). Using induction on k ≥ 1, x ∈ S ⇒ |pre(x)| < ∞, as one can show that for x ∈ ∩k≤K f k (X) − ∩k≤K+1 f k (X), pre(x) ⊂ ∩k≤K f k (X). It follows that O(x) ∩ S 6= ∅ ⇒ x ∈ S. If x 6∈ S, but pre(x) is finite, x must be cyclic. It is not hard to see that either S is empty, or will contain some interval. If empty, we must have all functions be identity (y = x), or permutation (y = −x), and all points are cyclic. Otherwise consider the endpoints of an interval of S. If they are the endpoints of X, so closure(S) = X, we are done again, as all points must be in S or possibly cyclic (the endpoints). So assume one endpoint x of an interval is in the interior of X and either x ∈ S or not. We will conclude from this that the border point b must be an endpoint of an interval in S from which we show that the result follows. Say [x, x + ǫ) ⊂ S, but, say (x − δ, x) is not, and assume b 6∈ S. We can conclude that b maps to x (x ∈ O(b)) if the function definition at b is changed to the other function defined on the interval sided by b. This basically follows because inverse image of the open set of points (x−δ, x), must be open, and when at some k, x was added to S (x ∈ ∩k≤K f k (X)−∩k≤K+1f k (X)) 14

while, the open interval siding it was not, it must be that b sides with (i.e. an endpoint of ) the inverse of the open interval (so the function f is discontinuous). Now, if we change the definition of the function at b, it would follow that x 6∈ S with such change as b 6∈ S by assumption. We show that x is still the end point of an interval I belonging to S (this time, an open interval). ¯ the possibility is that x ∈ cl(O(x)), so that x may no longer side with We have O(x) ⊂ S, an interval in S. However, if this is the case, either there are arbitrary close preimages of b in I, or x must be cyclic (take the largest open interval containing x and no preimage of b and apply f ). x can’t be cyclic, as we assumed x ∈ cl(O(x)), and so assume x maps to y back in (x, x + ǫ/3) after k mappings. If there’s no preimage of order k or fewer between x and it’s kth image y, again we must conclude that in fact either x is cyclic (and no preimage of b can exists), a contradiction, or b sides with S: Take the lowest ordered preimage p, and map interval [x, p], where yin[x, p], until p maps to b. the image of y is between x and b, and with a few mappings x maps to y and either all points between y and x map insider [x, y], or all points, including possibly b between y and b map to [y, p], therefore b sides with S. Finally, if in [x, y] there is a preimage of lower order than k, pick closest one to x (there are finitely many) , call it p, and assume p has order m ≤ k. Then the intervals [x, p] maps to [f m (x), b], and since [f m (x), b) falls back in I, as f m (x) maps to y before any preimage of b breaks [f m (x), b), we must conclude that b sides with S. Assume (x, x + ǫ) ⊂ S, but x 6∈ S. This is the case where the endpoint of an open interval of S is not in S. We can use a 3 preimage argument to conclude that x is cyclic, or, again b sides with S. If there are no 3 preimages of x within ǫ/3, x must have a finite number of preimages, or x is cyclic, and we are done. Pick the three so that no preimage of lower order exist between them. If there are no preimages of b of lower order than such 3 preimages between any two, we must conclude that x must have both sides belong S, as each preimage has a side that belongs to S. If there are b preimages, we must conclude that b sides with S. If both sides of x belong to S for width ǫ say, maps of x cannot fall arbitrarily close to x or x would belong to S. Again x cannot have arbitrary close preimages (pick any consecutive two that are less than ǫ apart), or b must side with S. Finally assume b sides with S, say on its right-side, i.e. (b, b + ǫ) ⊂ S. We show that either |pre(b)| < ∞, or b must side must with S on the other side too. This would show that b 6∈ cl(O(x)), as b 6∈ S, which would again show |pre(b)| < ∞. Let ǫ, be such that (b, b + ǫ) ⊂ S, and pick three consecutive preimage points x, y, z ∈ pre(b), x < y < z, z − x < ǫ/2, O(x) ∪ O(y) ∪ O(z) ∩ (b, b + ǫ) = ∅. If no such points exist, |pre(b)| < ∞. We can’t have any point of pre(b), with a lower order which maps first to (b, b+ǫ) between any two of these either, otherwise, at least one of x, y, z would also map to (b, b + ǫ), hence it would be in S. Since each point carries points of S on at least one side, we have to conclude that on both sides of one of these points for some δ > 0, and hence around b, all points belong to S. So if b 6∈ S, then b 6∈ cl(O(b)) and we are done by Lemma 3.7. The case for more than 2 border points can be proved by an induction on the number of 2 border points. 15

Lemma 3.7 b 6∈ cl(O(b)) ⇒ pre(b) has finite order. Proof. One can easily verify this, if one of the functions on a side of b is the constant function: as on such a side, all points have the same order or do not map to b, and if necessary, we can change the function at b, so that over b the same constant function is defined, and the result follows. Otherwise, consider the side of b where the same function is defined at b and on the siding interval. We cant have arbitrary close preimages otherwise applying the function to both b and the closest preimage of some order implies that images b fall arbitrarily close to b, contradicting, b 6∈ cl(O(b)). Change the function definition at b to conclude the same on the other side. So on both sides of b preimages of b have a distance from b, from which it follows that preimages of b 2 are not closer than some ǫ to one another, or pre(b) is finite. Note, that for the function f (x) = x + θ mod π which can be considered as two linear functions mapping [0, π] to [0, π], where θ is irrational in degrees, finiteness of cl(O(x)) does not hold for all x. The class of functions for which we proved the above (in L) are all contraction mappers, except for the identity and permutation functions, and this fact was crucial in the proof for lemma 3.6 when we considered the emptiness of the set S defined in the proof. Interestingly, recall that the function f (x) = x + θ mod π was used in showing the nonperiodicity of action sequences in 3 dimensions in the last section. The following corollary, is a direct consequence of lemmas 3.6 and 3.5, and the proof is very similar to the proof for 3.5. Corollary 3.8 Assume the conditions in the theorem 3.6 hold. Then ∀x ∈ [0, 1], f k (x) eventually visits some subset of the intervals in a periodic order, i.e. ∃N, ∃ period p, ∀k ≥ N, f k (x) ∈ I ⇔ f k+p (x) ∈ I, where I is a subinterval of the partition. Note that while we can have similar conditions hold on the three dimensional simplex, i.e. a finite number of regions, and the orbit of every point having a finite number of cluster points, periodicity does not follow as the example of previous section shows. In that example, the simplex in partitioned into 2 regions, the stationary point is cyclic, and all other orbits have the stationary point of the matrix as the only cluster point. A consequence of the above results is that, assuming conjecture 3.4 holds, the optimal value functions in two dimensions is piecewise linear, i.e. it consists of a finite number of linear pieces. We sketch an algorithm for computing an optimal policy and the optimal value function given conjecture 3.4. Basically the algorithm “dovetails” through the possibilities. Because there exists a policy with a finite number of intervals, periods are finite, and a potential optimal value function can be tested. An algorithm that systematically tries the possibilities, for example the length and the contents of the period, terminates: Given a candidate period, the points that are cyclical for that period can be computed. For example, if ab is a candidate for a period, the two cyclical points can be identified by crossing the linear functions Ma Mb and Mb Ma with the identity line y = x. The values at these points can be computed readily. The algorithm can find out whether the values are indeed optimal by applying the maximization 16

operator on the value function created, for a finite number of times. Note that there can be multiple cyclic classes, but excepting the degenerate cases of identity and permutation matrices (easily handled), there can only be a finite number of cyclic classes covering all the interval, and each one can be found in a finite number of steps. Another algorithm would use the finitehorizon approximations to get sufficiently close to the infinite horizon optimal partition, and then guess the cyclic classes and construct the value function from there.

3.2

Open Problems

Besides being a statement of a basic property for UMDP models, conjecture 3.4 together with conjecture 3.9 (below) would show that the only source of noncomputably in UMDPs is the possibility of nonperiodicity of optimal actions sequences at some points. The nonperiodicity would happen only when the cluster points of some orbits fall on a boundary point. Even in cases of nonperiodicity, the boundaries between optimal regions maybe computable, thereby showing that prefixes of optimal action sequences are computable. Hence establishing these conjectures should shed light on the decidability of the open problems listed at the end of the section. Conjecture 3.9 Consider a finite partitioning of the simplex X with piecewise linear boundary region, where over each partition a linear function mapping X to X is defined. Then ∀x ∈ X, |cl(O(x))| < inf ty. We briefly address ways of proving Conjecture 3.4. One possible approach is to use the fact that for finite-horizon problems the conjecture holds true, and use a limiting argument. However there are properties, such as finiteness of the linear pieces of optimal value function for finite-horizon criterion, that fail for the infinite horizon. The following conjecture makes use of the closure of the optimal regions and is expressed for the two state case only. The truth of the conjecture implies existence of points such that if a proper choice of optimal actions are executed at them, such points become cyclic. Existence of such points would then impose a constraint on the optimal value function to show that the interval [0, 1] can be partitioned into a finite number of regions over each which a single action is optimal. Conjecture 3.10 Consider a finite set of functions f1 , · · · , fk ∈ L, and any assignment a : [0, 1] → {c1 , · · · , ck }, k is finite. Then there exists an assignment a′ , where a′ (x) = a(x), or a′ (x) = cj , if there exists a sequence {xi }∞ 1 in [0, 1] converging to x, a(xi ) = cj (all are assigned the same class cj ), such that function f : [0, 1] → [0, 1], defined by f (x) = fi (x), when a′ (x) = ci , has a cyclic point. The generalization of conjecture to the simplex in n dimension might also hold, and can have the same consequences. We close the section with a few open problems for discounted UMDP problems.

17

• Are the following problems decidable? Computing the first bit of the optimal value, or the first bit of an optimal sequence? Or the emptiness problem for discounted UMDPs: Given a rational threshold τ , is the optimal value greater than τ ? • Goal-oriented discounted models: The example for nonperiodicity of optimal action sequences made use of the reward vectors. Periodicity may actually hold in case of discounted goal-oriented models. This model would be a PFA that terminates with probability one (a terminating PFA), i.e. with each action execution, the PFA has nonzero probability of ending in an absorbing rejecting state. A related problem is whether languages accepted by terminating PFA are regular.

4

Summary and Conclusions

We investigated several computational problems for infinite-horizon UMDPs. The reduction in [CL89] shows that many natural questions on undiscounted UMDPs are undecidable. These results apply to related problems such as general probabilistic planning as well. In the discounted model, the status of several decision problems is not settled. Some problems such as approximating are clearly computable. However, an adapted example from [Paz71] shows that noncomputability still lurks in this model. Problems such as computing the first bit of the optimal value or an optimal action sequence may be decidable. We explored the area to some extent and suggested a few avenues of research and open problems. The focus of this investigation was on computability. Investigators have shown that the computational complexity of solving POMDPs exactly and/or without restrictions on partial observability is very high [PT87, Lit96, MGA97]. A promising research direction is to look at these models under restrictions on the degree of unobservability, and with the less ambitious goal of approximating the optimal.

Acknowledgements Many thanks to Steve Hanks, for motivation and support in this research, Steve Hanks, Anne Condon, Dick Karp, and Jim Burke for the meetings and discussions. The research on the undiscounted problems was done in collaboration with Anne Condon, and Jim Burke gave valuable advice on the discounting model.

References [Ber87]

D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice Hall, 1987.

18

[CL89]

Anne Condon and Richard Lipton. On the complexity of space bounded interactive proofs. In 30th Annual Symposium on Foundations of Computer Science, 1989.

[Eve95]

H. R. Everett. Sensors For Mobile Robots. 1995.

[Fre81]

R. Freivalds. Probabilistic two way machines. In Proc. International Symposium on Mathemarical Foundations of Computer Science, volume 118, pages 33–45. SpringerVerlag, 1981.

[KHW95] N. Kushmerick, S. Hanks, and D. S. Weld. An algorithm for probabilistic planning. Artificial Intelligence, (76):239–286, 1995. [Lit96]

Michael Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown, 1996.

[MGA97] M. Mundhenk, J Goldsmith, and Eric Allender. The complexity of policy existance problem for partially-observable finite-horizon markov decision processes. In Mathematical Foundations of Computer Science, pages 129–38, 1997. [Niv56]

I. Niven. Irrational Numbers. 1956.

[Paz71]

Azaria Paz. Introduction to Probabilistic Automata. Academic Press, 1971.

[PT87]

Christos H. Papadimitriou and John N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of operations research, 12(3):441–450, August 1987.

[Put94]

Martin L. Puterman. Markov Decision Processes. Wiley Inter-science, 1994.

[Rab63]

M. O. Rabin. Probabilistic automata. Information and Control, pages 230–245, 1963.

[SS73]

R. Smallwood and E. Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21:1071–1088, 1973.

[Str65]

Charlotte Striebel. Sufficient statistics in the optimum control of stochastic systems. Journal of Mathematical Analysis And Applications, (12):576–592, 1965.

19

A

The Emptiness Reduction

Here, we sketch the reduction of [CL89] in some more detail. Both the technique for the weak-equality test which is based on Freivald’s [Fre81] method, and the construction of the reduction around this test are elegant. The class of TMs used in the reduction in [CL89] are 2-counter TMs, which are as powerful as general TMs. The constructed PFA has the task of detecting whether a sequence of computations represents a valid accepting computation (called an accepting sequence) of the TM. This task reduces to the problem of checking legality of transitions from one configuration of the TM to the next, i.e. according to TM state transition rules and whether the first and last configurations are respectively the start and the accepting configurations of the TM. It is not hard to verify that all these checks can be carried out by a deterministic finite state automaton, except the check of whether counter contents of the TM remain valid on consecutive configurations of the TM. The PFA rejects immediately if any of the easily verifiable transition rules are violated. That leaves the checking of the counter contents. The counter contents stay the same or get incremented or decremented by 1, on each computation step of a 2-counter TM. Hence the testing of the counter contents from one configuration to the next reduces to checking whether two strings have the same length, as we may assume counter contents are represented in unary, and the problem reduces to the following equality problem: given an bm , does m = n? We next explain, at a high level, the properties of a weakequality test which can be carried out by a PFA to attempt tp answer the equality question in a very limited sense. However, this limited capability is sufficient, and we then describe how the properties of the weak-equality test is utilized in the reduction. In the next subsection, for completeness, we explain the weak equality test in more detail to show how it has these properties. The technique was adapted in [CL89] from Freivalds [Fre81] who used it to show the power of 2-way PFA’s. The weak-equality test has the following property: In a scan of an bm , with high probability, the PFA enters an “indecision” state, and we say the outcome of the test is “indecision”. However with some (small) probability the PFA conducting the test may make a decision: In this case, if the strings have equal length, the PFA, with equal probability, ends in either a “correct” state (correct test outcome) or a “suspect” state (suspect outcome). However, if the strings have unequal length (and given the PFA is not in the “indecision” state), the probability that the PFA ends in the “suspect” state is k times higher than the probability of ending in a “correct” state. The quantity k can be made as large as desired by increasing the size of the PFA. We go back to the reduction. The reduction constructs a PFA that, given a candidate accepting sequence of the TM, and utilizing the weak-equality test, carries out a global test of its own. The PFA applies the weak-equality test, properly adjusted for detecting increments or decrements, on counter contents of consecutive configurations. Given a candidate accepting string, if the outcome of all the tests were decisive and “correct,” (and other criteria are met, 20

for example the last configuration is an accepting one), the PFA goes into an accepting state. If the outcome of all the tests were “suspect,” the PFA goes into a rejecting state. Otherwise, the PFA remains in the “global-indecision” state, until it detects the start of the next candidate accepting string (start configuration of the TM), or it rejects if the end of input is reached. In this case, if the TM accepts the empty string, we observe that the probability of acceptance of the PFA can reach the upper limit .5, by concatenating accepting sequences. If the TM does not accept, it follows from the properties of the weak-equality test that the probability of the PFA accepting any string is no larger than 1/k. By making a minor adjustment to the PFA, the upper limit can be raised to any desired probability close to 1: Instead of the PFA rejecting if it sees all “suspect” on a completed sequence of configurations, it may increment an all-suspect finite counter, one with an upper finite limit K. The PFA accepts if and only if it sees an all “correct” on a candidate sequence before the upper limit of its all-suspect counter is reached. The acceptance probability of the PFA doesn’t exceed 1/k still, if the TM is not accepting, while it accepts some strings with probability arbitrary close to 1 − (.5)K , in case the TM is accepting.

A.1

The Weak Equality Test

Here we describe the algorithm for the weak equality test in some detail. Given is the input an bm , and the question is whether m = n. The outputs of the algorithm (the state the PFA ends in) are indecision, suspect or correct as described in the previous section. The algorithm performs the following as it scans the input: 1. For each letter a, the algorithm flips two (fair) coins. 2. For each letter b, the algorithm flips two coins. 3. For each letter a and b, the algorithm flips one coin. 4. For each letter a and b, the algorithm flips one coin. 5. The algorithm checks whether m ≡ n mod p p is a sufficiently large constant. It can be easily verified that the algorithm can be carried out by a PFA for the operations described. If m 6≡ n mod p, then the outcome is suspect. Otherwise, let event A be true if steps 1 or 2 get only heads, and let event B be true if steps 3 or 4 get only heads. The algorithm outputs indecision (the PFA goes into indecision state) if both A and B are false (the common case), or both are true. Otherwise, the algorithm outputs suspect if A is true and B is false, and it outputs correct if A is false and B is true. It is easily seen that if m = n, then A and B have the same probability of being true. If m 6= n, given that the algorithm does not output indecision, and with p reasonably large, the probability of outputting suspect is much higher than outputting correct. 21

B

Optimal Value Functions and Optimal Policies

Here, we include proofs for a few properties of the optimal value functions V ∗ and optimal policies p∗ . These properties include convexity of V ∗ and closure of optimal action regions. For completeness, we include the notion of contraction mappings and fixed-points, and some of the associated proofs. Most of the derivations are standard and can be found for example in [Ber87].

B.1

Contraction Mappings, and the fixed-point Property

For a real-valued function V on a set X, let kV k = supx∈X |V (x)|. Let B(X) denote the set of bounded real-valued function on X: for V ∈ B(X), kV k < M for some M ∈ R. A mapping H : B(X) → B(X) is said to be a contraction mapping if ∃β ∈ R, 0 ≤ β < 1 such that ∀V ∈ B(X), ∀V ′ ∈ B(X): kHV − HV ′ k ≤ βkV − V ′ k. Let H k be the mapping H composed with itself k times: H 0 V = V , and H k V = H(H k−1V ), k ≥ 1. The following Theorem is a version of Banach fixed-point theorem for contraction mappings: Theorem B.1 If H : B(X) → B(X) is a contraction mapping, then ∃V ∗ ∈ B(X) such that HV ∗ = V ∗ and ∀V ∈ B(X), limk→∞ kH k V − V ∗ k = 0. Proof. Let V ∈ B(X). For k ≥ 0, let dk = kH k+1V − H k V k. We have d0 ∈ R, and from the contraction assumption, ∃β ∈ R, 0 ≤ β < 1, dk ≤ β k d0 . Therefore limk→∞ dk = 0. We next show that ∀x ∈ X, the sequence {(H k V )(x)} is bounded, from which it follows that ∀x ∈ X, the sequence {(H k V )(x)} is convergent. We have: (from contraction) ∀k ≥ 0, kH k+1V − HV k ≤ βkH k V − V k ⇒ 



∀k ≥ 1, kH k+1V k ≤ β kH k V k + kV k + kHV k ⇒ ∀k ≥ 1, kH k+1V k ≤ (

X

β i )kHV k + (

0≤i≤k

∀k ≥ 0, kH k V k ≤

X

β i)kV k ⇒

1≤i≤k

kHV k + kV k , 1−β

and V and HV are bounded. Next we show that the function V ∗ ∈ B(X), defined by V ∗ (x) = limk→∞ (H k V )(x), is a fixed-point of H. But first, we have ∀ǫ > 0, ∃N, such that: kH N +1 V − H N V k ≤ ǫ ⇒ ∀k ≥ 0, kH N +k+1V − H N +k V k ≤ β k ǫ ⇒ ∀k ≥ 0, ∀l ≥ 0, kH N +k+lV − H N +k V k ≤

X

0≤i 0 we can find an integer N, such that ∀k ≥ N, kV ∗ − H k V k ≤ ǫ, and kH k+1V − H k V k < ǫ. For such k, kH k+1V − HV ∗ k ≤ βkV ∗ − H k V k ≤ ǫ, and we have kHV ∗ − V ∗ k ≤ kH k V − V ∗ k + kH k+1V − H k V k + kH k+1V − HV ∗ k ≤ 3ǫ. Therefore ∀ǫ > 0, kHV ∗ − V ∗ k ≤ ǫ ⇒ HV ∗ = V ∗ . We have shown that for arbitrary V ∈ B(X), H k V converges uniformly to a fixed-point V ∗ of H. The last statement of the theorem follows if there is a unique fixed-point, which can be shown using what else but the contraction property: Let V1∗ and V2∗ be two fixedpoints of H. Then kHV1∗ − HV2∗ k ≤ βkV1∗ − V2∗ k, but HV1∗ = V1∗ , and HV2∗ = V2∗ , so kHV1∗ − HV2∗ k = kV1∗ − V2∗ k ≤ βkV1∗ − V2∗ k ⇒ kV1∗ − V2∗ k = 0. 2

B.2

Policies and Value Functions

We are concerned with the set P of policies of the form 2 p : X → Σ. Define the operators Hp : B(X) → B(X), where p ∈ P , and H ∗ : B(X) → B(X) (the Bellman operator) as: Hp V (x) =

[xRp(x) + βV (xMp(x) )]

H ∗ V (x) = max [xRa + βV (xMa )]. a∈Σ

(7) (8)

These operators originate from expressing value functions for a horizon of k (for a policy p or the optimal policy) in terms of the value functions for the horizon of k − 1. The operators are a short hand and eliminate writing pointwise equations. They also allow us to make connections with contraction mappings of the previous section. Lemma B.2 (contraction property) Hp and H ∗ are contraction mappings. Proof. Let V and V ′ be in B(X). ∀x ∈ X, |Hp V (x)−Hp V ′ (x)| = β|V (xMp(x) )−V ′ (xMp(x) )| ≤ βkV − V ′ k ⇒ kHp V − Hp V ′ k ≤ βkV − V ′ k. A similar argument works for the case of operator H ∗. 2 It follows from theorem B.1 that both Hp and H ∗ have unique fixed-points, which we denote by Vp and V ∗ respectively, and that ∀V ∈ B(X), Hpk V converges to Vp and H ∗k V converges to V ∗. For V, V ′ ∈ B(X), we write V ≤ V ′ when ∀x ∈ X, V (x) ≤ V ′ (x). The next lemma shows that ∀p ∈ P, Vp ≤ V ∗ . Hence the ≤ relation induces a partial ordering on the space of fixed-point (and finite horizon) value functions, where V ∗ is the top point. 2

It can be shown that an optimal policy need only be a (stationary) function of distribution over states, and need not be randomized.

23

Lemma B.3 (monotonicity) Let V, V ′ ∈ B(X), p ∈ P . Then V ≤ V ′ ⇒ Hp V ≤ Hp V ′ and H ∗ V ≤ H ∗ V ′ . Proof. Take any x ∈ X. Say action a maximizes the right side of eq. 8. But H ∗ V (x) = xR(a) + βV (xMa ) ≤ xR(a) + βV ′ (xMa ), since V (xMa ) ≤ V ′ (xMa ), and xR(a) + βV ′ (xMa ) ≤ maxa∈Σ xR(a) + βV ′ (xMa ) = (H ∗ V ′ )(x). The argument for Hp is simpler: Hp V (x) = xRp(x) + 2 βV (xMp(x) ) ≤ xRp(x) + βV ′ (xMp(x) ) = (Hp V ′ )(x). Corollary B.4 ∀V ∈ B(X), ∀p ∈ P, ∀k > 0, Hpk V ≤ H ∗k V . Proof. Proof by induction. Not hard to see for k ≤ 1. If for some k ≥ 0, Hpk V ≤ H ∗k V then Hpk+1V = Hp (Hpk V ) ≤ Hp (H ∗k V ) from the monotonicity of Hp . But by definition of Hp and 2 H ∗ , Hp (H ∗k V ) ≤ H ∗ (H ∗k V ) = H ∗k+1 V . Corollary B.5 ∀p ∈ P, Vp ≤ V ∗ . Proof. Let V ∈ B(x). The corollary follows from the convergence of Hpk V to Vp and of H k V to V ∗ : Let x ∈ X, then Vp (x) − V ∗ (x) = [Vp (x) − Hpk V (x)] + [Hpk V (x) − H ∗k V (x)] + [H ∗k V (x) − V ∗ (x)], for any k ≥ 0.

2 But limk→∞ [Vp (x) − Hpk V (x)] = limk→∞[H ∗k V (x) − V ∗ (x)] = 0, while Hpk V ≤ H ∗k V . It follows that V ∗ (x) is the best value that a policy can achieve at a point x over the infinite-horizon. Notice that the action that the operator H ∗ prescribes at a point (the action maximizing eq 8) can change with different functions H ∗k V (as k increases). Can any policy in P achieve V ∗ ? The answer is positive, and the function V ∗ tells us how: Any mapping p∗ ∈ P such ∀x ∈ X, p∗ (x) ∈ arg maxa∈Σ [xR(a) + βV ∗ (xMa )] is well defined and Hp∗ V ∗ (x) = H ∗ V ∗ (x) = V ∗ (x), i.e. the unique fixed-point of Hp∗ is V ∗ , hence the execution of policy p∗ over the infinite horizon is expected to yield the value function V ∗ . V ∗ is called the optimal value function for the infinite horizon. We next elaborate on the properties of V ∗ and p∗ . Lemma B.6 H ∗ preserves continuity. Proof. Let V ∈ B(X) be continuous. Then for a ∈ Σ, the function xRa + βV (xMa ) is also continuous: the composition of continuous functions and their summation preserves continuity. Therefore, H ∗ V , the pointwise maximum of these functions (one functions for each action), is also continuous. 2 A piecewise linear function is a function that is linear over each partition of a finite partitioning of X. Lemma B.7 If V is piecewise linear, then H ∗ V is piecewise linear. If in addition V is convex, then H ∗ V is also convex. Proof. Let V ∈ B(X) be piecewise linear and convex. Then for a ∈ Σ, the function xRa + βV (xMa ) is also piecewise linear and convex: the composition of piecewise linear functions and their summation preserves piecewise linearity, and it is not hard to verify that if V is convex, 24

because xMa is a linear function, V (xMa ) is convex. Therefore, H ∗ V , the pointwise maximum 2 of these functions (one functions for each action), is also piecewise linear and convex. Lemma B.7 shows that the finite horizon optimal value function is piecewise linear and convex: for a finite horizon of k stages, the optimal value function is simply H ∗k applied to the zero function, and the zero function is piecewise linear, continuous and convex. Corollary B.8 V ∗ is continuous and convex. Proof. Let V ∈ B(x) be continuous, piecewise linear and convex (e. g. the zero function). Then the functions H ∗k V, ∀k ≥ 0 are continuous, piecewise linear, and convex, and the convexity and continuity of V ∗ follows by the uniform convergence of H ∗k V to V ∗ . 2 The monotonicity and continuity preservation would hold even if H ∗ was not a a contraction mapper. Using these facts, one can show that V ∗ is convex in the absence of discounting as well (for a goal model, V ∗ would denote maximum probability of reaching the goal, or we may consider the average-reward criterion). What follows are properties that bring out some differences between the presence and absence of discounting. Recall that (in the discounting case) V ∗ tells us how to construct an optimal policy: any mapping p∗ ∈ P defined by ∀x ∈ X, p∗ (x) ∈ arg maxa∈Σ [xRa + βV ∗ (xMa )] would do. We are not specific about the choice of actions when the set arg maxa∈Σ [xRa + βV ∗ (xMa )] has more than one action, since any choice would give us an optimal policy. This is due the fact that any such policy p∗ corresponds to an operator Hp∗ whose unique fixed-point is V ∗ by definition of p∗ (and from the contraction property of Hp∗ ). Properties such as uniqueness of fixed-points and uniform convergence do not hold for operators that are not contraction mappings. The simple example of a goal model (no discounting) in Figure 3 illustrates cases where operators (corresponding to policies) can have more than one fixed-point, and a policy p for which V1∗ is a fixed-point may not necessarily be an optimal (stationary) policy (by which I mean limk→∞ Hp zero 6= V1∗ ). Hence, when there is no discounting, some choices of actions in arg maxa∈Σ [xRa + V1∗ (xMa )] may not correspond to any optimal policy. We say that action a is optimal at a point x, if there is some optimal stationary policy p∗ that maps x to a. In the presence of discounting, we have: a ∈ arg max [xRσ + βV ∗ (xMσ )] ⇔ a is optimal at x σ∈Σ

In the undiscounting, only the left implication holds.

25

(9)