Approximate Discounted Dynamic Programming Is ... - CiteSeerX

3 downloads 0 Views 201KB Size Report
Empirical evidence is presented that suggests when problems are likely ... learning methods based on dynamic programming operate by estimating the value of ... A popular reinforcement learning algorithm based on dynamic programming, Q- ..... The rst environment used was the state-space of the ve tile sliding tile puzzle ...
The University of Western Australia

Technical Report 94/6

October 1994

Approximate Discounted Dynamic Programming Is Unreliable* Matthew A. F. McDonald Philip Hingston Department of Computer Science The University of Western Australia, Crawley, WA, 6009 fmafm,[email protected]

Abstract

Popular reinforcement learning methods that employ generalising function approximators perform poorly in many domains. We analyse e ects of approximation error in domains with sparse rewards, revealing the extent of scaling diculties. Empirical evidence is presented that suggests when problems are likely to occur and explains some of the widely di ering results reported in the literature.

Keywords

Reinforcement learning, dynamic programming, function approximation, induction, problem solving

CR categories I.2.6, I.2.8

* The Robotics and Vision Research Group acknowledges the support received from Digital through their External Research Programme.

Department of Computer Science

Approximate Discounted Dynamic Programming Is Unreliable

McDonald, Hingston { Page 1

1 Introduction Most domains studied in AI are too large to be searched exhaustively. It is widely believed that reinforcement learning methods must be combined with generalising function approximators in order to scale to such domains. Unfortunately, the many results on the convergence of popular reinforcement learning algorithms rely almost exclusively on the assumptions that all states are visited repeatedly and that estimates of the values of actions or states can be accurately stored1 . Although several straightforward combinations of generalising function approximators and reinforcement learning algorithms have had remarkable success in complex domains [14,15,1,12], not all such results have been as encouraging [8,4,3,2]. Reinforcement learning methods based on dynamic programming operate by estimating the value of actions. The discounted techniques usually used attach values to actions in such a way that the di erence in value between actions available at a state fades exponentially as distance between a state and a goal state increases. This exponential decay of signal raises serious doubts about the scalability of current reinforcement learning methods in the face of approximator error. Such considerations appear to have motivated the development of several approximate reinforcement learning methods that learn to compose sequences of actions to form high-level actions rather than directly implementing classical dynamic programming techniques [13,6]. This paper presents an analysis of the e ects of approximation error on reinforcement learning methods based on discounted dynamic programming. We show that in the presence of function approximator error, the likelihood of learning to receive a reward can diminish exponentially as the number of actions required to reach a reward increases. We present empirical evidence that suggests such exponential decay is likely to be a problem in domains where all available actions have inverses. The conclusion discusses the consequences of these results and possible solutions.

2 Reinforcement Learning A reinforcement learning agent can be seen as acting in an environment, repeatedly mapping sense data into actions. Once per cycle, the agent receives sensory input and a reinforcement, indicating the immediate desirability of the e ects of the last action. It responds with an action. The world responds to the action with new sensory information and a new immediate reinforcement and the cycle continues. The goal of reinforcement learning is to nd a mapping of sense data into actions that maximises the reinforcement received by the agent over time. This section gives a brief summary of Markov decision tasks and the theory of discounted dynamic programming. A popular reinforcement learning algorithm based on dynamic programming, Q-learning [17,18], is also summarised.

2.1 Markov Decision Tasks and Discounted Dynamic Programming The world in which an agent is acting is taken to be a Markov environment , a nite state automaton coupled with a reinforcement function. A Markov environment is formally modeled as a quadruple (S; A; T; R). S denotes the nite set of world states, and A the set of possible actions an agent may output. T is a (possibly stochastic) transition function mapping S  A into S , and R an immediate reinforcement function (which may also be stochastic) mapping S  A into the reals. A (stationary) policy  is a mapping from S to A indicating which action is to be performed in each state. A Markov decision task consists of a Markov environment and a criterion for assessing the quality of policies. Because the most obvious measure of a policy's long-term value, the sum of all future reinforcements, is generally in nite it is usually an inappropriate measure to maximise. The most widely used alternative is the discounted return of a policy, an in nite sum in which near-term reinforcement is weighted more heavily than that occurring far in the future. A parameter , (0 < < 1), speci es the rate at which future reinforcement is discounted. For a policy , the corresponding state value function V  measures the expected discounted return given that the initial state is s and policy  is continued inde nitely2: 1 Bradtke [2] provides the only convergence proof we are aware of for a situation where values are stored using an approximator. 2 In environments where T or R is stochastic, the sum is replaced by its expected value.

Page 2 { McDonald, Hingston Approximate Discounted Dynamic Programming Is Unreliable 1 X V  (s) = t R(st ; at)js0 = s; at = (st ); st+1 = T (st ; at ): t=0

Optimal policies are those for which the corresponding state value function is maximal at every state. A fundamental result from the theory of dynamic programming is that optimal policies exist under the conditions described. Although more than one optimal policy may exist in a given task, all have the same value function, V  . We say that a policy  is greedy for a state value function V if

R(s; (s)) + V (T (s; (s))) = MAXa2A [R(s; a) + V (T (s; a))]: Most policies are not greedy for their own value functions. A key fact that forms the basis of most dynamic programming algorithms is that exactly those policies that are greedy for their own value functions are optimal. This allows the optimal state value function to be de ned simply as that for which for all states s,

V  (s) = MAXa2A [R(s; a) + V  (T (s; a))] holds. This is one form of the Bellman optimality equation. Dynamic programming methods are based on iteratively solving this equation. Value iteration is a simple classical algorithm for nding the optimal state value function V  for a given task. Given a state value function V , a backup operator B de nes a state value function as follows:

BV (s) = MAXa2A [R(s; a) + V (T (s; a))]: Given an arbitrary initial estimate, V , of the optimal state value function, value iteration proceeds by repeatedly synchronously updating the values associated with states as follows:

V (s)

BV (s):

Successive estimates calculated by the algorithm converge to the optimal value function [10]. Any policy that is greedy for this value function is optimal.

2.2 Q-Learning In order to apply the value iteration algorithm to a given task, knowledge of T and R is required, to apply the backup operator, and in order to calculate a policy that is greedy for V  once V  has been computed. In reinforcement learning problems, complete knowledge of T and R is generally unavailable, as the agent will usually not have completely explored an environment before learning begins. Although some reinforcement learning methods are based on rst modeling reinforcement and transition functions and then applying value iteration to these models, a more popular alternative, Q-learning, obviates the need for these models by employing a value function that measures the value of state action pairs directly. A state-action value function Q is a function from S  A into the real numbers. For a policy , the corresponding value function Q measures the discounted return of a policy, given that the initial state is s, action a is taken, and policy  is followed inde nitely thereafter:

Q (s; a) =

1 X

t R(st ; at)js0 = s; a0 = a; t 6= 0 ) at = (st ); st+1 = T (st ; at): t=0

The optimal state-action value function Q is de ned as Q for any optimal policy . A policy  is greedy for a state-action value function Q if

Q(s; (s)) = MAXa2A Q(s; a): A backup operator B is de ned over state-action value functions [18,17]

BQ(s; a) = R(s; a) + MAXa 2A [ Q(T (s; a); a0 )]; 0

Approximate Discounted Dynamic Programming Is Unreliable

McDonald, Hingston { Page 3

and an algorithm analogous to value iteration allows Q to be calculated beginning from an arbitrary initial estimate Q, by repeatedly applying the backup operator:

Q(s; a)

BQ(s; a):

Once Q is known, a policy that is greedy for Q can easily be computed, without knowledge of T or R. Equally importantly, an asynchronous Monte Carlo version of the algorithm that operates using only samples of the behaviour of the reinforcement and transition function is possible. The asynchronous version of the algorithm, Q-learning , begins with an initial estimate, Q of Q and improves this estimate as the results of individual actions become apparent. If a Q-learning agent is in state s and performs action a, resulting in a transition to state s0 and immediate reinforcement r, Q is updated as follows:

Q(s; a)

(r + MAXa 2A Q(s0 ; a0)) + (1 ? )Q(s; a): 0

Providing actions are repeated inde nitely in each state and the magnitude of the update factor diminishes towards zero, this update rule results in Q converging to Q and any policy that is greedy for Q will be optimal.

3 E ects of Value Function Inaccuracies If a value function Q is stored using a function approximator, attempts to move the value of Q(s; a) towards its desired value BQ(s; a) may fail. Any discrepancy between the current value of a stateaction pair, Q(s; a), and the value supplied by the backup operator, BQ(s; a) indicates that Q is not optimal and that policies that are greedy for Q may be sub-optimal. This section examines the e ect of discrepancies between Q(s; a) and BQ(s; a). In subsection 3.1 we show that, in general environments with sparse rewards, the approximator precision required to guarantee rewards will be reached is generally unachievable. Subsection 3.2 shows that without such precision, the likelihood of reaching rewards may decay exponentially as the number of actions required to reach a reward increases.

3.1 Performance Bounds are Inadequate Recent results by Williams and Baird [19] e ectively bound the extent to which approximation error can a ect a policy's discounted return. For a given state-action value function, Q, the di erence between the discounted return achieved by a policy that is greedy for Q, and the discounted return achieved by an optimal policy, is bounded by a quantity that is proportional to , the largest discrepancy between a value of a function and its backed-up value:

 = MAXs2S;a2A jQ(s; a) ? BQ(s; a)j: Williams and Baird show that the di erence between the discounted return of an optimal policy, and the discounted return of any policy Q that is greedy for a state-action value function Q is bounded by jV Q (s) ? V  (s)j  1 2? : Although this bound appears promising because the maximum loss in discounted reinforcement is proportional to the magnitude of the maximumapproximationerror, the bound is not useful in environments where rewards are sparse. Consider an environment where some subset G of the states is designated as the set of goal states and in which one immediate reinforcement r (reward) is delivered for actions causing a transition to a goal state and another p (penalty) is delivered to the agent for all other actions:  if T (s; a) 2 G R(s; a) = pr otherwise. We say a policy fails at state s if it never reaches a goal state after beginning in state s. A policy is said to succeed at s if it does not fail at s.

Page 4 { McDonald, Hingston

Approximate Discounted Dynamic Programming Is Unreliable

If no policy can reach a goal in less than n actions from a given state s, no policy can do better than recieve a series of n ? 1 penalties followed by an unending series of rewards. The discounted return achieved by an optimal policy from a state s in which n actions are required to reach a goal state is thus bounded by: V  (s)  p + p + 2 p + 3 p + ::: + n?1r + n r + n+1 r::: = (1 ? n?1 ) 1 ?p + n?1 1 ?r : A policy that fails at s, and thereby receives an in nite series of punishments, achieves a discounted return of p + p + 2 p + 3 p + :::, or (1?p ) . The di erence between the discounted return achieved by an optimal policy at s and the discounted return achieved by a policy that fails at s is therefore

n?1 (r ? p): (1 ? ) Hence, any policy that achieves a discounted return at s that di ers from the discounted return achieved by an optimal policy by less than

n?1 (r ? p) (1 ? ) does not fail at s. Thus Williams and Baird's bound allows us to conclude no policy that is greedy for a value function Q fails if 2 < n?1 (r ? p): 1 ? (1 ? ) Unfortunately, the condition is equivalent to n?1  < (1 ? ) (r ? p)

which implies that as the number of actions required to reach a goal, n, increases, the error tolerance  decreases exponentially. This precision is not required in all environments. Tolerance to error will vary depending on the topology of a environment. For example, in an environment where all policies succeed, error tolerance is in nite. However, in the general case, guarantees cannot be made without exponentially increasing precision.

3.2 Exponential Likelihood of Failure This section examines the behaviour of Q-learning agents under simplifying assumptions about the nature of function approximator error. It shows that in in nitely many environments, if approximator precision cannot guarantee a reward will be reached, the likelihood of learning to achieve a reward decreases exponentially as the number of actions required to reach a reward increases. Following Thrun and Schwartz [16], we assume that a state-action value function Q is stored using a function approximator that assigns values to state-action pairs Q(s; a) which represent a target value (here BQ(s; a)) corrupted by a noise term Ysa :

Q(s; a) = BQ(s; a) + Ysa : We assume that the noise terms Ysa are independent random values drawn from a uniform distribution over the interval [?; +]. As before, we consider an environment where goal states are designated and in which reinforcement r (reward) is delivered for actions causing a transition to a goal state and reinforcement p (penalty) is delivered to the agent for all other actions. Because the magnitude of the error terms is bounded, it is possible to place an upper bound on the values Q(s; a). Q is equal to the optimal value function Q0 for a modi ed task in which the reinforcement function R is replaced by R0 where R0(s; a) = R(s; a) + Ysa .

Approximate Discounted Dynamic Programming Is Unreliable

McDonald, Hingston { Page 5 right

Goal

s0

s1

goal

l eft

s2

left

s3

l eft

sn-1

. . . .

Start

sn

l eft

l eft

Figure 3{1: A section whose probability of being crossed is bounded away from 1

The reinforcement that R0 assigns to state-action pairs is bounded (it cannot exceed r + ). In this modi ed environment, no policy can do better than to receive reinforcement r +  inde nitely, achieving a discounted return of (1r?+  ) . Thus, Q cannot assign a value greater than (1r?+  ) to any state-action pair. The existence of an upper bound on the values of Q allows sets of states containing a speci ed start and goal state to be constructed such that the probability of learning to travel from start to goal is bounded away from one. Consider the set of states shown in Figure 3{1. State s0 allows one action, goal, that leads directly to a goal state. States s1 :::sn and Start allow the action left, and state sn also allows the action right. Since Q(s0 ; goal) is bounded from above, the probability that a policy that is greedy for Q will succeed from the start state can be bounded away from one. Consider a situation in which:

8 s 2 fs1:::sng; Ysleft < 0; Y right >  ; and 2

(2)

 Ysleft n > 2:

(3)

sn

(1)

Because all noise terms are independent, this situation occurs with probability 21n  14  41 . (1) implies that Q(sn ; left)  p + p + 2 p + ::: + n?1 p + n (1r ?+  ) = (1 ?p ) + n ( (1r ?+  ) ? (1 ?p ) ): (2) and (3) together imply Q(start; left)  2 + p + Q(sn ; right); and Q(sn ; right)  2 + p + Q(start; left): Hence Q(sn ; right)  2 + p + Q(sn ; right) and so  +p Q(sn ; right)  (12 ? ) = (1 ?p ) + 2(1 ? ) : Clearly any policy that is greedy for Q fails at the start state if Q(sn ; right) > Q(sn ; left). Since Q(sn ; right) > Q(sn ; left) when 2(1? ) > n?1 ( (1r?+  ) ? (1?p ) ), this can be guaranteed by choosing n, the number of states, to be suitably large. Thus, the probability of learning to traverse the series of states shown in Figure 3{1 can be bounded away from one. We have shown that the probability of failure is at least 2n1+4 . Although this signi cantly

Page 6 { McDonald, Hingston

Approximate Discounted Dynamic Programming Is Unreliable

1

2

4

5

3

Figure 3{2: Five tile sliding tile puzzle

underestimates the true probability of failure, for present purposes it is important only that some such bound exists. Given sections of a potential environment in which the probability of learning to negotiate from start to nish can be bounded away from one, it is possible to construct environments in which the probability of learning to traverse to a speci ed goal state diminishes exponentially as the number of actions required to reach the goal increases. The simplest such construction is a linear chain of such sections, in which as the number of actions a required to reach the goal state increases, the number of sections that must all be crossed independently grows as a=n. There are other ways of arranging copies of the environmental section shown so that an agent must pass through each of an increasing number of sections as the distance between its start state and goal state increases. There are also other ways to connect a set of states such that the probability of learning to travel through them can be bounded away from one. This implies that any general lower bound on the probability of learning to reach a reward diminishes exponentially as the number of actions required to achieve a reward increases. An exponential decay of the likelihood of success will not occur in all environments. For example in an environment where all available actions result in a move towards a goal, the likelihood of reaching the goal will remain constant as the distance to the goal increases, regardless of approximation error. Although the argument can be extended to cover other assumptions about the nature of approximation error, it will not hold for all models of approximation error. This section has shown that discounted reinforcement learning methods will perform badly in arti cially constructed environments. Because problems will not occur in all environments it remains to be seen whether or not they occur in environments of practical interest. The following section presents empirical results for several environments that show diculties will occur in practice and implicate a common feature of environments as a major factor.

3.3 Empirical Results The argument of Subsection 3.2 fails to characterise environments in which success can be expected to decay exponentially as the distance from a state to a goal increases. To see whether or not such decay is a problem in practice, experiments were conducted that measured the e ect of approximation error on the performance of a Q-learning agent. The experiments measure the e ect of additive, xed, 0-mean noise drawn from uniform distributions of varying widths, as was assumed in the argument of Subsection 3.2 Experiments were conducted using three di erent environments. All environments were deterministic and had a single goal state. In all three environments, a reinforcement of +1 was paid for actions leading directly to the goal, other actions were penalised using a reinforcement of ?1 and was set to 0:9. These values were found to give reasonable performance across all three domains. In each of the three environments, results were averaged over 100 trials. The rst environment used was the state-space of the ve tile sliding tile puzzle (Figure 3{2.) The domain has 360 states. In each state three actions are available - tiles may be moved left, right, or vertically into the empty space. In states where an action is inapplicable, it leaves the state of the environment unchanged. Figure 3{3 shows results for this environment. Each of the lines in the two graphs shows the performance under noise drawn from di erent width 0-mean uniform distributions. Each plotted line is labeled with the maximum and minimum values of the distributions from which the corresponding noise was drawn. The left graph in the gure plots the likelihood of learning a policy that is successful at a state as a function of its distance from the goal. In the ve-puzzle environment, this curve appears to diminish exponentially at large distances and linearly (worse than exponentially) in intermediate distances. In

Approximate Discounted Dynamic Programming Is Unreliable

McDonald, Hingston { Page 7

Success in 5-puzzle

Decay of Success in 5-puzzle

% Success

Decay Rate U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U -0 0

1.00 0.95 0.90 0.85 0.80 0.75

0.95 0.90 0.85 0.80 0.75

0.70

0.70

0.65

0.65

0.60

0.60

0.55

0.55

0.50

0.50

0.45

0.45

0.40

0.40

0.35

0.35

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

0.05

0.00

U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U -0 0

1.00

0.00

-0.05

Distance 0.00

5.00

10.00

15.00

-0.05

Distance

20.00

5.00

10.00

15.00

20.00

Figure 3{3: Results for 5-puzzle environment

order to more clearly show the way likelihood of success changes with distance, the right graph plots the likelihood of success at a distance divided by the likelihood of success at states that are one action closer to the goal { the running rate of decay of success. In the case of the ve-puzzle, the left graph shows a general downward trend, indicating that success rates appears to be decaying faster than exponentially. It is clear noise does not a ect the performance of Q-learning this dramatically in all domains. As the next experiment shows, substantially di erent results are obtained in random environments. The second environment tested contained 1500 states, one of which was selected as a goal. Three actions were available in each state. Each action caused a transition to a randomly selected state. All states from which it was impossible to reach the goal were discarded, leaving 1473 states in the environment. Success in One-way Domain

Decay of Success in One-way Domain

% Success

Decay Rate U -1.6 1.6 U -1.5 1.5 U -1.4 1.4 U -1.3 1.3 U -1.2 1.2 U -1.1 1.1 U -1 1 U -0.9 0.9 U -0.8 0.8 U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U00

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40

U -1.6 1.6 U -1.5 1.5 U -1.4 1.4 U -1.3 1.3 U -1.2 1.2 U -1.1 1.1 U -1 1 U -0.9 0.9 U -0.8 0.8 U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U00

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65

0.35 0.60

0.30 0.25

0.55

0.20 0.15

0.50

0.10 0.45

0.05

Distance 0.00

2.00

4.00

6.00

8.00

Distance 2.00

4.00

6.00

8.00

Figure 3{4: Results for random environment with unreversible actions

The results, shown in Figure 3{4, for this random environment di er markedly from those for the sliding-tile puzzle. In the random environment the line plotting likelihood of success versus distance levels o as distance increases, being approximately equal for states between four and ve actions away from the goal. The diminishing rate of decay is re ected in the second graph of the gure, which shows that decay tends to vanish as distance increases. Similar results were obtained for a variety of random environments with varying sizes and numbers of actions. The third environment tested di ers from the second in that it contains only actions with inverses. In this environment, if an action available in state a results in a transition to state b, then an action available in state b results in a transition to state a. The random environment with inverse actions has 1500 states, one of which is the goal. Each state

Page 8 { McDonald, Hingston

Approximate Discounted Dynamic Programming Is Unreliable

Success in Two-way Domain

Decay of Success in Two-way Domain

% Success

Decay Rate U -1.6 1.6 U -1.5 1.5 U -1.4 1.4 U -1.3 1.3 U -1.2 1.2 U -1.1 1.1 U -1 1 U -0.9 0.9 U -0.8 0.8 U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U00

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35

U -1.6 1.6 U -1.5 1.5 U -1.4 1.4 U -1.3 1.3 U -1.2 1.2 U -1.1 1.1 U -1 1 U -0.9 0.9 U -0.8 0.8 U -0.7 0.7 U -0.6 0.6 U -0.5 0.5 U -0.4 0.4 U -0.3 0.3 U -0.2 0.2 U -0.1 0.1 U00

1.10 1.05 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35

0.30

0.30

0.25

0.25

0.20

0.20

0.15

0.15

0.10

0.10

0.05

0.05 0.00

0.00 -0.05

Distance 0.00

2.00

4.00

6.00

8.00

10.00

12.00

-0.05

Distance 2.00

4.00

6.00

8.00

10.00

12.00

Figure 3{5: Results for random environment with reversible actions

allowed three actions that caused transitions to randomly selected states. The states these actions led to then had one of their three actions result in a transition to the original state. All states from which it was impossible to reach the goal were discarded, leaving 1423 states in the environment. Results for this environment are shown in Figure 3{5. They are similar to the results for the ve-puzzle environment. For most of the noise distributions shown, the likelihood of success is shown to decay at a rate that increases until an asymptote below one is reached. (The one case where the rate of decay does not continually drop is due to statistical uctuation, and di ers if the experiment is repeated.) These results suggest that environments where actions have inverses (like the ve-puzzle state-space) may be more likely to su er exponentially decaying success than environments where actions have no inverse. Not only does the likelihood of success continually decay in the environment with inverses, the absolute likelihood of success is signi cantly lower at most distances for all noise levels. Again, similar results were obtained for a variety of random environments with varying numbers of actions and states.

4 Conclusions The bounds on performance loss due to approximation error established by Williams and Baird [19] cannot guarantee discounted reinforcement learning algorithms will achieve goals in general environments unless approximator precision increases exponentially as the length of paths towards a goal increase. We have shown that in the absence of such guarantees, the likelihood of reaching a goal may decay exponentially as the length of paths to a goal increase. Empirical results were also presented that indicate such decay is likely to occur in simple puzzle-world environments, and in environments where actions have inverses. These results suggest that reinforcement learning methods based on discounted dynamic programming may be unlikely to scale to large problem-solving environments when when combined with function approximators. The empirical results of Subsection 3.3 appear to correspond to results already reported in the literature { many successful applications of discounted reinforcement learning to complex domains appear to have been in domains such as Back-gammon and Go [14,15,1,12] where the presence of another player ensures that the e ect of actions can generally not be immediately reversed, while unsuccessful applications [8,4,3] have been often been in domains where most actions have inverses. These results suggest a number of topics for further research.

 It may be possible to dramatically reduce approximation error in many domains by employing

function approximators that can be expected to perform well in a domain of interest [20].  It is the multiplicative nature of costs under discounted dynamic programming that causes di erences between action values to diminish exponentially as distance from goal states increases. This suggests a change to undiscounted methods, in which costs are additive rather than multiplicative.  Algorithms based on dividing tasks in an environment into a series of smaller tasks and composing these to form macro actions have been demonstrated in less complex environments [13,6]. Although

Approximate Discounted Dynamic Programming Is Unreliable

McDonald, Hingston { Page 9

the performance of these approaches in more complex environments like the ve-puzzle world has yet to be investigated, they seem promising as macro actions have proven useful in similar domains outside reinforcement learning [9].

 Coupling discounted reinforcement learning methods such as Q-learning with generalising function

approximators represents an attempt to approximate exact solutions to the problem of action assessment. It is probably not feasible to eliminate approximation error for value functions for general problems since optimal solutions are typically NP-hard to compute [7,5,11]. However, compactly and exactly representing members of a class of value functions that leads to suboptimal solutions for such problems may be feasible, and our current research is directed along these lines.

The results presented in this paper are limited in several ways. The analysis presented shows that the likelihood of learning to reach a goal will decay exponentially in general domains as the length of paths to a goal increase. Provided the rate of decay is suciently small, this may be acceptable, and our analysis does not demonstrate that decay rate will be large. Similarly, although empirical evidence suggests that domains where actions have inverses are especially prone to such problems, this is currently unsupported by analysis. Our analysis assumes that errors are independent of one another, which is probably not the case in practice. However, the way in which approximation errors are related is unclear, making a more realistic analysis impossible. Although we believe our results explain a variety of empirical results, they indicate only one situation in which problems can be expected. Other problems can also be expected [16] and a better understanding of the way reinforcement learning methods can be reliably based on function approximation is the subject of ongoing research.

Bibliography 1. J. A. Boyan. Modular neural networks for learning context-dependent game strategies. Master's thesis, Computer Speech and Language Processing, Cambridge University, 1992. 2. S. J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In S. J. Hanson, J. D. Cowan, and C. L. Giles., editors, Advances in Neural Information Processing 5, pages 295{302, San Mateo, CA, 1993. Morgan Kaufmann. 3. D. Chapman. Vision, Instruction, and Action. M.I.T. Press, Cambridge, Mass, 1991. 4. D. Chapman and L. P. Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In J. Mylopoulos and R. Reiter., editors, Proceedings of the Twelfth International Joint Conference on Arti cial Intelligence (IJCAI-91), pages 726{731, San Mateo, Ca., 1991. Morgan Kaufmann. 5. S. V. Chenoweth. On the NP-hardness of blocks world. In Proceedings of the Tenth National Conference on Arti cial Intelligence (AAAI-91), pages 623{628, 1991. 6. P. Dayan and G. E. Hinton. Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, and C. L. Giles., editors, Advances in Neural Information Processing 5, pages 271{278, San Mateo, CA, 1993. Morgan Kaufmann. 7. N. Gupta and D. S. Nau. Complexity results for blocks-world planning. In Proceedings of the Tenth National Conference on Arti cial Intelligence (AAAI-91), pages 629{633, 1991. 8. L. P. Kaelbling. Learning in Embedded Systems. PhD thesis, Department of Computer Science, Stanford University, 1990. 9. R. E. Korf. Learning to solve problems by searching for macro-operators. Pitman, Boston, 1985. 10. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994. 11. D. Ratner and M. Warmuth. Finding a shortest solution for the nxn extension of the 15-puzzle is intractable. In Proceedings of the Fifth National Conference on Arti cial Intelligence (AAAI-86), pages 168{172, 1986.

Page 10 { McDonald, Hingston

Approximate Discounted Dynamic Programming Is Unreliable

12. N. N. Schraudolph, P. Dayan, and T. J. Sejnowski. Temporal di erence learning of position evaluation in the game of go. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing 6, San Mateo, CA, 1994. Morgan Kaufmann. 13. S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8:323{339, 1992. 14. G. Tesauro. Practical issues in temporal di erence learning. Machine Learning, 8:257{277, 1992. 15. G. Tesauro. TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play. Neural Computation, 6:215{219, 1994. 16. S. Thrun and A. Schwartz. Issues in Using Function Approximation for Reinforcement Learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum. 17. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, 1989. 18. C. J. C. H. Watkins and P. Dayan. Technical note: Q learning. Machine Learning, 8:279{292, 1992. 19. R. J. Williams and L. C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU-CCS-93-14, Northeastern University College of Computer Science, 1993. 20. R. C. Yee. Abstraction in control learning. Technical Report COINS 92-16, Department of Computer and Information Science, University of Massachusetts, Amherst, MA 01003, 1992. A dissertation proposal.