Value Function Based Reinforcement Learning in Changing ...

9 downloads 0 Views 634KB Size Report
Abstract. The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may ...
Journal of Machine Learning Research 9 (2008) 1679-1709

Submitted 6/07; Revised 12/07; Published 8/08

Value Function Based Reinforcement Learning in Changing Markovian Environments Bal´azs Csan´ad Cs´aji L´aszl´o Monostori∗

BALAZS . CSAJI @ SZTAKI . HU LASZLO . MONOSTORI @ SZTAKI . HU

Computer and Automation Research Institute Hungarian Academy of Sciences Kende utca 13–17, Budapest, H–1111, Hungary

Editor: Sridhar Mahadevan

Abstract The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms

1. Introduction Stochastic control problems are often modeled by Markov decision processes (MDPs) that constitute a fundamental tool for computational learning theory. The theory of MDPs has grown extensively since Bellman introduced the discrete stochastic variant of the optimal control problem in 1957. These kinds of stochastic optimization problems have great importance in diverse fields, such as engineering, manufacturing, medicine, finance or social sciences. Several solution methods are known, for example, from the field of [neuro-]dynamic programming (NDP) or reinforcement learning (RL), which compute or approximate the optimal control policy of an MDP. These methods succeeded in solving many different problems, such as transportation and inventory control (Van Roy et al., 1996), channel allocation (Singh and Bertsekas, 1997), robotic control (Kalm a´ r et al., 1998), production scheduling (Cs´aji and Monostori, 2006), logical games and problems from financial mathematics. Many applications of RL and NDP methods are also considered by the textbooks of Bertsekas and Tsitsiklis (1996), Sutton and Barto (1998) as well as Feinberg and Shwartz (2002). ∗. Also faculty in Mechanical Engineering at the Budapest University of Technology and Economics. c

2008 Bal´azs Csan´ad Cs´aji and L´aszl´o Monostori.

´ AND M ONOSTORI C S AJI

The dynamics of (Markovian) control problems can often be formulated as follows: xt+1 = f (xt , at , wt ),

(1)

where xt is the state of the system at time t ∈ N, at is a control action and wt is some disturbance. There is also a cost function g(xt , at ) and the aim is to find an optimal control policy that minimizes the [discounted] costs over time (the next section will contain the basic definitions). In many applications the calculation of a control policy should be fast and, additionally, environmental changes should also be taken into account. These two criteria are against each other. In most control applications during the computation of a control policy the system uses a model of the environment. The dynamics of (1) can be modeled with an MDP, but what happens when the model is wrong (e.g., if the transition function is incorrect) or the dynamics have changed? The changing of the dynamics can also be modeled as an MDP, however, including environmental changes as a higher level MDP very likely leads to problems which do not have any practically efficient solution methods. The paper argues that if the model was “close” to the environment, then a “good” policy based on the model cannot be arbitrarily “wrong” from the viewpoint of the environment and, moreover, “slight” changes in the environment result only in “slight” changes in the optimal cost-to-go function. More precisely, the optimal value function of an MDP depends Lipschitz continuously on the cost function and the transition probabilities. Applying this result, the concept of (ε, δ)-MDPs is introduced, in which these functions are allowed to vary over time, as long as the cumulative changes remain bounded in the limit. Afterwards, a general framework for analyzing stochastic iterative algorithms is presented. A novelty of our approach is that we allow the value function update operator to be time-dependent. Then, we apply that framework to deduce an approximate convergence theorem for time-dependent stochastic iterative algorithms. Later, with the help of this general theorem, we show relaxed convergence properties (more precisely, κ-approximation) for value function based reinforcement learning methods working in (ε, δ)-MDPs. The main contributions of the paper can be summarized as follows: 1. We show that the optimal value function of a discounted MDP Lipschitz continuously depends on the immediate-cost function (Theorem 12). This result was already known for the case of transition-probability functions (Mu¨ ller, 1996; Kalm´ar et al., 1998), however, we present an improved bound for this case, as well (Theorem 11). We also present value function bounds (Theorem 13) for the case of changes in the discount factor and demonstrate that this dependence is not Lipschitz continuous. 2. In order to study changing environments, we introduce (ε, δ)-MDPs (Definition 17) that are generalizations of MDPs and ε-MDPs (Kalm´ar et al., 1998; Szita et al., 2002). In this model the transition function and the cost function may change over time, provided that the accumulated changes remain bounded in the limit. We show (Lemma 18) that potential changes in the discount factor can be incorporated into the immediate-cost function, thus, we do not have to consider discount factor changes. 3. We investigate stochastic iterative algorithms where the value function operator may change over time. A relaxed convergence theorem for this kind of algorithm is presented (Theorem 20). As a corollary, we get an approximation theorem for value function based reinforcement learning methods in (ε, δ)-MDPs (Corollary 21). 1680

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

4. Furthermore, we illustrate our results through three classical RL algorithms. Relaxed convergence properties in (ε, δ)-MDPs for asynchronous value iteration, Q-learning and temporal difference learning are deduced. Later, we show that our approach could also be applied to investigate approximate dynamic programming methods. 5. We also present numerical experiments which highlight some features of working in varying environments. First, two simple stochastic iterative algorithms, a “well-behaving” and a “pathological” one, are shown. Regarding learning, we illustrate the effects of environmental changes through two problems: scheduling and grid world.

2. Definitions and Preliminaries Sequential decision making under the presence of uncertainties is often modeled by MDPs (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Feinberg and Shwartz, 2002). This section contains the basic definitions, the applied notations and some preliminaries. Definition 1 By a (finite, discrete-time, stationary, fully observable) Markov decision process (MDP) we mean a stochastic system characterized by a 6-tuple hX, A, A , p, g, αi, where the components are as follows: X is a finite set of discrete states and A is a finite set of control actions. Mapping A : X → P (A) is the availability function that renders a set of actions available to each state where P denotes the power set. The transition function is given by p : X × A → ∆(X), where ∆(X) is the set of all probability distributions over X. Let p(y | x, a) denote the probability of arrival at state y after executing action a ∈ A (x) in state x. The immediate-cost function is defined by g : X × A → R, where g(x, a) is the cost of taking action a in state x. Finally, α ∈ [0, 1) denotes the discount rate. An interpretation of an MDP can be given, which viewpoint is often taken in RL, if we consider an agent that acts in an uncertain environment. The agent receives information about the state of the environment x, at each state x the agent is allowed to choose an action a ∈ A (x). After the action is selected, the environment moves to the next state according to the probability distribution p(x, a) and the decision-maker collects its one-step cost, g(x, a). The aim of the agent is to find an optimal behavior (policy), such that applying this strategy minimizes the expected cumulative costs. Definition 2 A (stationary, Markovian) control policy determines the action to take in each state. A deterministic policy, π : X → A, is simply a function from states to control actions. A randomized policy, π : X → ∆(A), is a function from states to probability distributions over actions. We denote the probability of executing action a in state x by π(x)(a) or, for short, by π(x, a). Unless indicated otherwise, we consider randomized policies. For any xe0 ∈ ∆(X) initial probability distribution of the states, the transition probabilities p together with a control policy π completely determine the progress of the system in a stochastic sense, namely, they define a homogeneous Markov chain on X, xet+1 = P(π)e xt ,

where xet is the state probability distribution vector of the system at time t and P(π) denotes the probability transition matrix induced by control policy π, [P(π)]x,y =

∑ p(y | x, a) π(x, a).

a∈A

1681

´ AND M ONOSTORI C S AJI

Definition 3 The value or cost-to-go function of a policy π is a function from states to costs, J π : X → R. Function J π (x) gives the expected value of the cumulative (discounted) costs when the system is in state x and it follows policy π thereafter, " # N π t π J (x) = E ∑ α g(Xt , At ) X0 = x , (2) t=0

where Xt and Atπ are random variables, Atπ is selected according to control policy π and the distribution of Xt+1 is p(Xt , Atπ ). The horizon of the problem is denoted by N ∈ N ∪ {∞}. Unless indicated otherwise, we will always assume that the horizon is infinite, N = ∞.

Definition 4 We say that π1 ≤ π2 if and only if ∀x ∈ X : J π1 (x) ≤ J π2 (x). A control policy is (uniformly) optimal if it is less than or equal to all other control policies. There always exists at least one optimal policy (Sutton and Barto, 1998). Although there may be many optimal policies, they all share the same unique optimal cost-to-go function, denoted by J ∗ . This function must satisfy the Bellman optimality equation (Bertsekas and Tsitsiklis, 1996), T J ∗ = J ∗ , where T is the Bellman operator, defined for all x ∈ X, as i h (T J)(x) = min g(x, a) + α ∑ p(y | x, a)J(y) . a∈A (x)

y∈X

Definition 5 We say that function f : X → Y , where X , Y are normed spaces, is Lipschitz continuous if there exists a β ≥ 0 such that ∀ x1 , x2 ∈ X : k f (x1 ) − f (x2 )kY ≤ β kx1 − x2 kX , where k·kX and k·kY denote the norm of X and Y , respectively. The smallest such β is called the Lipschitz constant of f . Henceforth, assume that X = Y . If the Lipschitz constant β < 1, then the function is called a contraction. A mapping is called a pseudo-contraction if there exists an x ∗ ∈ X and a β ≥ 0 such that ∀ x ∈ X , we have k f (x) − x∗ kX ≤ β kx − x∗ kX . Naturally, every contraction mapping is also a pseudo-contraction, however, the opposite is not true. The pseudo-contraction condition implies that x ∗ is the fixed point of function f , namely, f (x∗ ) = x∗ , moreover, x∗ is unique, thus, f cannot have other fixed points. It is known that the Bellman operator is a supremum norm contraction with Lipschitz constant α. In case we consider stochastic shortest path (SSP) problems, which arise if the MDP has an absorbing terminal (goal) state, then the Bellman operator becomes a pseudo-contraction in the weighted supremum norm (Bertsekas and Tsitsiklis, 1996). From a given value function J, it is straightforward to get a policy, for example, by applying a greedy and deterministic policy (w.r.t. J), that always selects actions with minimal costs, h i π(x) ∈ arg min g(x, a) + α ∑ p(y | x, a)J(y) . a∈A (x)

y∈X

Similarly to the definition of J π , one can define action-value functions of control polices, # " N π t π π Q (x, a) = E ∑ α g(Xt , At ) X0 = x, A0 = a , t=0

1682

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

where the notations are the same as in (2). MDPs have an extensively studied theory and there exist a lot of exact and approximate solution methods, for example, value iteration, policy iteration, the Gauss-Seidel method, Q-learning, Q(λ), SARSA and TD(λ)—temporal difference learning (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Feinberg and Shwartz, 2002). Most of these reinforcement learning algorithms work by iteratively approximating the optimal value function and typically consider stationary environments. If J is “close” to J ∗ , then the greedy policy with one-stage lookahead based on J will also be “close” to an optimal policy, as it was proven by Bertsekas and Tsitsiklis (1996): Theorem 6 Let M be a discounted MDP and J is an arbitrary value function. The value function of the greedy policy based on J is denoted by J π . Then, we have kJ π − J ∗ k∞ ≤

2α kJ − J ∗ k∞ , 1−α

where k·k∞ denotes the supremum norm, namely k f k∞ = sup {| f (x)| : x ∈ domain( f )}. Moreover, there exists an ε > 0 such that if kJ − J ∗ k∞ < ε then J ∗ = J π . Consequently, if we could obtain a good approximation of the optimal value function, then we immediately had a good control policy, as well, for example, the greedy policy with respect to our approximate value function. Therefore, the main question for most RL approaches is that how a good approximation of the optimal value function could be achieved.

3. Asymptotic Bounds for Generalized Value Iteration In this section we will briefly overview a unified framework to analyze value function based reinforcement learning algorithms. We will use this approach later when we prove convergence properties in changing environments. The theory presented in this section was developed by Szepesv a´ ri and Littman (1999) and was extended by Szita et al. (2002). 3.1 Generalized Value Functions and Approximate Convergence Throughout the paper we denote the set of value functions by V which contains, in general, all bounded real-valued functions over an arbitrary set X , for example, X = X, in the case of statevalue functions, or X = X × A, in the case of action-value functions. Note that the set of value functions, V = B (X ), where B (X ) denotes the set of all bounded real-valued functions over set X , is a normed space, for example, with the supremum norm. Naturally, bounded functions constitute no real restriction in case of analyzing finite MDPs. Definition 7 We say that a sequence of random variables, denoted by Xt , κ-approximates random variable X with κ ≥ 0, in a given norm, if we have   P lim sup kXt − Xk ≤ κ = 1. (3) t→∞

Hence, the “meaning” of this definition is that sequence Xt converges almost surely to an environment of X and the radius of this environment is less than or equal to a given constant κ. Naturally, this definition is weaker (more general) than the probability one convergence, which arises as a special case, when κ = 0. In the paper we will always consider convergence in the supremum norm. 1683

´ AND M ONOSTORI C S AJI

3.2 Relaxed Convergence of Generalized Value Iteration A general form of value iteration type algorithms can be given as follows, Vt+1 = Ht (Vt ,Vt ), where Ht is a random operator on V × V → V (Szepesv´ari and Littman, 1999). Consider, for example, the SARSA (state-action-reward-state-action) algorithm which is a model-free policy evaluation method. It aims at finding Qπ for a given policy π and it is defined as Qt+1 (x, a) = (1 − γt (x, a)) Qt (x, a) + γt (x, a)(g(x, a) + α Qt (Y, B)), where γt (x, a) denotes the stepsize associated with state x and action a at time t; Y and B are random variables, Y is generated from the pair (x, a) by simulation, that is, according to the distribution p(x, a), and the distribution of B is π(Y ). In this case, Ht is defined as Ht (Qa , Qb )(x, a) = (1 − γt (x, a)) Qa (x, a) + γt (x, a)(g(x, a) + α Qb (Y, B)),

(4)

for all x and a. Therefore, the SARSA algorithm takes the form Qt+1 = Ht (Qt , Qt ). Definition 8 We say that the operator sequence Ht κ-approximates operator H : V → V at V ∈ V if for any initial V0 ∈ V the sequence Vt+1 = Ht (Vt ,V ) κ-approximates HV . The next theorem (Szita et al., 2002) will be an important tool for proving convergence results for value function based RL algorithms in varying environments. Theorem 9 Let H be an arbitrary mapping with fixed point V ∗ , and let Ht κ-approximate H at V ∗ over set X . Additionally, assume that there exist random functions 0 ≤ Ft (x) ≤ 1 and 0 ≤ Gt (x) ≤ 1 satisfying the four conditions below with probability one 1. For all V1 ,V2 ∈ V and for all x ∈ X , |Ht (V1 ,V ∗ )(x) − Ht (V2 ,V ∗ )(x)| ≤ Gt (x) |V1 (x) −V2 (x)| . 2. For all V1 ,V2 ∈ V and for all x ∈ X , |Ht (V1 ,V ∗ )(x) − Ht (V1 ,V2 )(x)| ≤ Ft (x) kV ∗ −V2 k∞ . n 3. For all k > 0, ∏t=k Gt (x) converges to zero uniformly in x as n increases.

4. There exist 0 ≤ ξ < 1 such that for all x ∈ X and sufficiently large t, Ft (x) ≤ ξ (1 − Gt (x)). Then, Vt+1 = Ht (Vt ,Vt ) κ0 -approximates V ∗ over X for any V0 ∈ V , where κ0 = 2 κ/(1 − ξ). Usually, functions Ft and Gt can be interpreted as the ratio of mixing the two arguments of operator Ht . In the case of the SARSA algorithm, described above by (4), X = X × A, Gt (x, a) = (1 − γt (x, a)) and Ft (x, a) = α γt (x, a) would be a suitable choice. One of the most important aspects of this theorem is that it shows how to reduce the problem of approximating V ∗ with Vt = Ht (Vt ,Vt ) type operators to the problem of approximating it with a Vt0 = Ht (Vt0 ,V ∗ ) sequence, which is, in many cases, much easier to be dealt with. This makes, for example, the convergence of Watkins’ Q-learning a consequence of the classical Robbins-Monro theory (Szepesv´ari and Littman, 1999; Szita et al., 2002). 1684

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

4. Value Function Bounds for Environmental Changes In many control problems it is typically not possible to “practise” in the real environment, only a dynamic model is available to the system and this model can be used for predicting how the environment will respond to the control signals (model predictive control). MDP based solutions usually work by simulating the environment with the model, through simulation they produce simulated experience and by learning from these experience they improve their value functions. Computing an approximately optimal value function is essential because, as we have seen (Theorem 6), close approximations to optimal value functions lead directly to good policies. Though, there are alternative approaches which directly approximate optimal control policies (see Sutton et al., 2000). However, what happens if the model was inaccurate or the environment had changed slightly? In what follows we investigate the effects of environmental changes on the optimal value function. For continuous Markov processes questions like these were already analyzed (Gordienko and Salem, 2000; Favero and Runggaldier, 2002; de Oca et al., 2003), hence, we will focus on finite MDPs. The theorems of this section have some similarities with two previous results. First, Munos and Moore (2000) studied the dependence of the Bellman operator on the transition-probabilities and the immediate-costs. Later, Kearns and Singh (2002) applied a simulation lemma to deduce polynomial time bounds to achieve near-optimal return in MDPs. This lemma states that if two MDPs differ only in their transition and cost functions and we want to approximate the value function of a fixed policy concerning one of the MDPs in the other MDP, then how close should we choose the transitions and the costs to the original MDP relative to the mixing time or the horizon time. 4.1 Changes in the Transition-Probability Function First, we will see that the optimal value function of a discounted MDP Lipschitz continuously depends on the transition-probability function. This question was analyzed by M u¨ ller (1996), as well, but the presented version of Theorem 10 was proven by Kalm´ar et al. (1998). Theorem 10 Assume that two discounted MDPs differ only in their transition functions, denoted by p1 and p2 . Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

n α kgk∞ kp1 − p2 k∞ , (1 − α)2

recall that n is the size of the state space and α ∈ [0, 1) is the discount rate. A disadvantage of this theorem is that the estimation heavily depends on the size of the state space, n. However, this bound can be improved if we consider an induced matrix norm for transitionprobabilities instead of the supremum norm. The following theorem presents our improved estimation for transition changes. Its proof can be found in the appendix. Theorem 11 With the assumptions and notations of Theorem 10, we have kJ1∗ − J2∗ k∞ ≤

α kgk∞ kp1 − p2 k1 , (1 − α)2

where k·k1 is a norm on f : X × A × X → R type functions, for example, f (x, a, y) = p(y | x, a), k f k1 = max x, a

∑ | f (x, a, y) | .

y∈X

1685

(5)

´ AND M ONOSTORI C S AJI

If we consider f as a matrix which has a column for each state-action pair (x, a) ∈ X × A and a row for each state y ∈ X, then the above definition gives us the usual “maximum absolute column sum norm” definition for matrices, which is conventionally denoted by k·k1 . It is easy to see that for all f , we have k f k1 ≤ n k f k∞ , where n is size of the state space. Therefore, the estimation of Theorem 11 is at least as good as the estimation of Theorem 10. In order to see that it is a real improvement consider, for example, the case when we choose a particular state-action pair, (x, ˆ a), ˆ and take a p1 and p2 that only differ in (x, ˆ a). ˆ For example, p1 (x, ˆ a) ˆ = h1, 0, 0, . . . , 0i and p2 (x, ˆ a) ˆ = h0, 1, 0, . . . , 0i, and they are equal for all other (x, a) 6= (x, ˆ a). ˆ Then, by definition, kp1 − p2 k1 = 2, but n kp1 − p2 k∞ = n. Consequently, in this case, we have improved the bound of Theorem 10 by a factor of 2/n. 4.2 Changes in the Immediate-Cost Function The same kind of Lipschitz continuity can be proven in case of changes in the cost function. Theorem 12 Assume that two discounted MDPs differ only in the immediate-costs functions, g 1 and g2 . Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

1 kg1 − g2 k∞ . 1−α

4.3 Changes in the Discount Factor The following theorem shows that the change of the value function can also be estimated in case there were changes in the discount rate (all proofs can be found in the appendix). Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by α1 , α2 ∈ [0, 1). Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

|α1 − α2 | kgk∞ . (1 − α1 )(1 − α2 )

The next example demonstrates, however, that this dependence is not Lipschitz continuous. Consider, for example, an MDP that has only one state x and one action a. Taking action a loops back deterministically to state x with cost g(x, a) = 1. Suppose that the MDP has discount factor α1 = 0, thus, J1∗ (x) = 1. Now, if we change the discount rate to α2 ∈ (0, 1), then |α1 − α2 | < 1 but kJ1∗ − J2∗ k∞ could be arbitrarily large, since J2∗ (x) → ∞ as α2 → 1. At the same time, we can notice that if we fix a constant α0 < 1 and only allow discount factors from the interval [0, α0 ], then this dependence became Lipschitz continuous, as well. 4.4 Case of Action-Value Functions Many reinforcement learning algorithms, such as Q-learning, work with action-value functions which are important, for example, for model-free approaches. Now, we investigate how the previously presented theorems apply to this type of value functions. The optimal action-value function, denoted by Q∗ , is defined for all state-action pair (x, a) by Q∗ (x, a) = g(x, a) + α ∑ p(y | x, a)J ∗ (y), y∈X

1686

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

where J ∗ is the optimal state-value function. Note that in the case of the optimal action-value function, first, we take a given action (which can have very high cost) and, only after that the action was taken, follow an optimal policy. Thus, we can estimate kQ∗ k∞ by kQ∗ k∞ ≤ kgk∞ + α kJ ∗ k∞ . Nevertheless, the next lemma shows that the same estimations can be derived for environmental changes in the case of action-value functions as in the case of state-value functions. Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. Let the corresponding optimal action-value functions be Q∗1 and Q∗2 , respectively. Then, the bounds for kJ1∗ − J2∗ k∞ of Theorems 11, 12 and 13 are also bounds for kQ∗1 − Q∗2 k∞ . 4.5 Further Remarks on Inaccurate Models In this section we saw that the optimal value function of a discounted MDP depends smoothly on the transition function, the cost function and the discount rate. This dependence is of Lipschitz type in the first two cases and non-Lipschitz for discount rates. If we treat one of the MDPs in the previous theorems as a system which describes the “real” behavior of the environment and the other MDP as our model, then these results show that even if the model is slightly inaccurate or there were changes in the environment, the optimal value function based on the model cannot be arbitrarily wrong from the viewpoint of the environment. These theorems are of special interest because in “real world” problems the transition-probabilities and the immediate-costs are mostly estimated only, for example, by statistical methods from historical data. Later, we will see that changes in the discount rate can be traced back to changes in the cost function (Lemma 18), therefore, it is sufficient to consider transition and cost changes. The following corollary summarizes the results. Corollary 15 Assume that two discounted MDPs (E and M) differ only in their transition functions and their cost functions. Let the corresponding transition and cost functions be denoted by p E , pM ∗. and gE , gM , respectively. The corresponding optimal value functions are denoted by JE∗ and JM The value function in E of the deterministic and greedy policy (π) with one stage-lookahead that is ∗ is denoted by J π . Then, based upon JM E   2α kgE − gM k∞ c α kpE − pM k1 π ∗ kJE − JE k∞ ≤ , + 1−α 1−α (1 − α)2 where c = min{kgE k∞ , kgM k∞ } and α ∈ [0, 1) is the discount factor. The proof simply follows from Theorems 6, 11 and 12 and from the triangle inequality. Another interesting question is the effects of environmental changes on the value function of a fixed control policy. However, it is straightforward to prove (Cs´aji, 2008) that the same estimations can be derived for kJ1π − J2π k∞ , where π is an arbitrary (stationary, Markovian, randomized) control policy, as the estimations of Theorems 10, 11, 12 and 13. Note that the presented theorems are only valid in case of discounted MDPs. Though, a large part of the MDP related research studies the expected total discounted cost optimality criterion, in 1687

´ AND M ONOSTORI C S AJI

some cases discounting is inappropriate and, therefore, there are alternative optimality approaches, as well. A popular alternative approach is to optimize the expected average cost (Feinberg and Shwartz, 2002). In this case the value function is defined as " # N−1 1 π t π J (x) = lim sup E ∑ α g(Xt , At ) X0 = x , N→∞ N t=0

where the notations are the same as previously, for example, as applied in Equation (2). Regarding the validity of the results of Section 4 concerning MDPs with the expected average cost minimization objective, we can recall that, in the case of finite MDPs, discounted cost offers a good approximation to the other optimality criterion. More precisely, it can be shown that there exists a large enough α0 < 1 such that ∀ α ∈ (α0 , 1) optimal control policies for the discounted cost problem are also optimal for the average cost problem (Feinberg and Shwartz, 2002). These policies are called Blackwell optimal.

5. Learning in Varying Environments In this section we investigate how value function based learning methods can act in environments which may change over time. However, without restrictions, this approach would be too general to establish convergence results. Therefore, we restrict ourselves to the case when the changes remain bounded over time. In order to precisely define this concept, the idea of (ε, δ)-MDPs is introduced, which is a generalization of classical MDPs and ε-MDPs. First, we recall the definition of ε-MDPs (Kalm´ar et al., 1998; Szita et al., 2002). ∞ is called an ε-MDP with ε > 0 if the MDPs differ Definition 16 A sequence of MDPs (Mt )t=1 only in their transition-probability functions, denoted by pt for Mt , and there exists an MDP with transition function p, called the base MDP, such that supt kp − pt k ≤ ε.

5.1 Varying Environments: (ε, δ)-MDPs Now, we extend the idea described above. The following definition of (ε, δ)-MDPs generalizes the concept of ε-MDPs in two ways. First, we also allow the cost function to change over time and, additionally, we require the changes to remain bounded by parameters ε and δ only asymptotically, in the limit. A finite number of large deviations is tolerated. ∞ , {g }∞ , αi is an (ε, δ)-MDP with ε, δ ≥ 0, if there exists Definition 17 A tuple hX, A, A , {pt }t=1 t t=1 an MDP hX, A, A , p, g, αi, called the base MDP, such that

1. lim sup kp − pt k ≤ ε t→∞

2. lim sup kg − gt k ≤ δ t→∞

The optimal value function of the base MDP and of the current MDP at time t (which MDP has transition function pt and cost function gt ) are denoted by J ∗ and Jt∗ , respectively. In order to keep the analysis as simple as possible, we do not allow the discount rate parameter α to change over time; not only because, for example, with Theorem 13 at hand, it would be straightforward to extend the results to the case of changing discount factors, but even more because, as 1688

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

Lemma 18 demonstrates, the effects of changes in the discount rate can be incorporated into the immediate-cost function, which is allowed to change in (ε, δ)-MDPs. Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . The immediate-cost function of M3 is gb(x, a) = g(x, a) + (α2 − α1 ) ∑ p(y | x, a)J2∗ (y), y∈X

where p is the probability-transition function of M1 , M2 and M3 ; g is the immediate-cost function of M1 and M2 ; and J2∗ (y) denotes the optimal cost-to-go function of M2 . On the other hand, we can notice that changes in the cost function cannot be traced back to changes in the transition function. Consider, for example, an MDP with a constant zero cost function. Then, no matter what the transition-probabilities are, the optimal value function remains zero. However, we may achieve non-zero optimal value function values if we change the immediate-cost function. Therefore, (ε, δ)-MDPs cannot be traced back to ε-MDPs. Now, we briefly investigate the applicability of (ε, δ)-MDPs and a possible motivation behind them. When we model a “real world” problem as an MDP, then we typically take only the major characteristics of the system into account, but there could be many hidden parameters, as well, which may affect the transition-probabilities and the immediate-costs, however, which are not explicitly included in the model. For example, if we model a production control system as an MDP (Cs´aji and Monostori, 2006), then the workers’ fatigue, mood or the quality of the materials may affect the durations of the tasks, but these characteristics are usually not included in the model. Additionally, the values of these hidden parameters may change over time. In these cases, we could either try to incorporate as many aspects of the system as possible into the model, which would most likely lead to computationally intractable results, or we could model the system as an (ε, δ)-MDP, which would result in a simplified model and, presumably, in a more tractable system. 5.2 Relaxed Convergence of Stochastic Iterative Algorithms In this section we present a general relaxed convergence theorem for a large class of stochastic iterative algorithms. Later, we will apply this theorem to investigate the convergence properties of value function based reinforcement learning methods in (ε, δ)-MDPs. Many learning and optimization methods can be written in a general form as a stochastic iterative algorithm (Bertsekas and Tsitsiklis, 1996). More precisely, as Vt+1 (x) = (1 − γt (x))Vt (x) + γt (x)((Kt Vt )(x) +Wt (x)),

(6)

where Vt ∈ V , operator Kt : V → V acts on value functions, each γt (x) is a random variable which determines the stepsize and Wt (x) is also a random variable, a noise parameter. Regarding reinforcement learning algorithms, for example, (asynchronous) value iteration, GaussSeidel methods, Q-learning and TD(λ) – temporal difference learning can be formulated this way. We will show that under suitable conditions these algorithms work in (ε, δ)-MDPs, more precisely, κ-approximation to the optimal value function of the base MDP will be proven. Now, in order to provide our relaxed convergence result, we introduce assumptions on the noise parameters, on the stepsize parameters and on the value function operators. 1689

´ AND M ONOSTORI C S AJI

Definition 19 We denote the history of the algorithm until time t by Ft , defined as

Ft = {V0 , . . . ,Vt ,W0 , . . . ,Wt−1 , γ0 , . . . , γt } . The sequence F0 ⊆ F1 ⊆ F2 ⊆ ... can be seen as a filtration, viz., as an increasing sequence of σ-fields. The set Ft represents the information available at each time t. Assumption 1 There exits a constant C > 0 such that for all state x and time t, we have   E [Wt (x) | Ft ] = 0 and E Wt2 (x) | Ft < C < ∞.

Regarding the stepsize parameters, γt , we make the “usual” stochastic approximation assumptions. Note that there is a separate stepsize parameter for each possible state. Assumption 2 For all x and t, 0 ≤ γt (x) ≤ 1, and we have with probability one ∞



∑ γt (x) = ∞

and

∑ γt2 (x) < ∞.

t=0

t=0

Intuitively, the first requirement guarantees that the stepsizes are able to overcome the effects of finite noises, while the second criterion ensures that they eventually converge. Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with fixed point Vt∗ . Formally, for all V1 ,V2 ∈ V , kKt V1 − Kt V2 k∞ ≤ βt kV1 −V2 k∞ . Let us introduce a common Lipschitz constant β0 = lim sup βt , and assume that β0 < 1. t→∞

Because our aim is to analyze changing environments, each Kt operator can have different fixed points and different Lipschitz constants. However, to avoid the progress of the algorithm to “slow down” infinitely, we should require that lim supt→∞ βt < 1. In the next section, when we apply this theory to the case of (ε, δ)-MDPs, each value function operator can depend on the current MDP at time t and, thus, can have different fixed points. Now, we present a theorem (its proof can be found in the appendix) that shows how the function sequence generated by iteration (6) can converge to an environment of a function. Theorem 20 Suppose that Assumptions 1-3 hold and let Vt be the sequence generated by iteration (6). Then, for any V ∗ ,V0 ∈ V , the sequence Vt κ-approximates function V ∗ with κ=

4ρ 1 − β0

where

ρ = lim sup kVt∗ −V ∗ k∞ . t→∞

This theorem is very general, it is valid even in the case of non-finite MDPs. Notice that V ∗ can be an arbitrary function but, naturally, the radius of the environment of V ∗ , which the sequence Vt almost surely converges to, depends on lim supt→∞ kVt∗ −V ∗ k∞ . If we take a closer look at the proof, we can notice that the theorem is still valid if each Kt is only a pseudo-contraction but, additionally, it also attracts points to V ∗ . Formally, it is enough if we assume that for all V ∈ V , we have kKt V − Kt Vt∗ k∞ ≤ βt kV −Vt∗ k∞ and kKt V − Kt V ∗ k∞ ≤ βt kV −V ∗ k∞ for a suitable βt < 1. This remark could be important in case we want to apply Theorem 20 to changing stochastic shortest path (SSP) problems. 1690

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

5.2.1 A S IMPLE N UMERICAL E XAMPLE Consider a one dimensional stochastic process characterized by the iteration vt+1 = (1 − γt )vt + γt (Kt (vt ) + wt ),

(7)

where γt is the learning rate and wt is a noise term. Let us suppose we have n alternating operators ki with Lipschitz constants bi < 1 and fixed points v∗i where i ∈ {0, . . . , n − 1}, ki (v) = v + (1 − bi )(v∗i − v). The current operator at time t is Kt = ki (thus, Vt∗ = v∗i and βt = bi ) if i ≡ t (mod n), that is, if i is congruent with t modulo n: if they have the same remainder when they are divided by n. In other words, we apply a round-robin type schedule for the operators. Figure 1 shows that the trajectories remained close to the fixed points. The figure illustrates the case of two (−1 and 1) and six (−3, −2, −1, 1, 2, 3) alternating fixed points. 10

10

8

8

v(t)

6

6

4

4

2

2

0

0

fixed points

-2

v(t)

fixed points

-2

-4

-4 0

500

1000

1500

0

t ~ iterations

500

1000

1500

t ~ iterations

Figure 1: Trajectories generated by (7) with two (left) and six (right) fixed points.

5.2.2 A PATHOLOGICAL E XAMPLE During this example we will restrict ourselves to deterministic functions. According to the Banach fixed point theorem, if we have a contraction mapping f over a complete metric space with fixed point v∗ = f (v∗ ), then, for any initial v0 the sequence vt+1 = f (vt ) converges to v∗ . It could be thought that this result can be easily generalized to the case of alternating operators. For example, suppose we have n alternating contraction mappings k i with Lipschitz constants bi < 1 and fixed points v∗i , respectively, where i ∈ {0, . . . , n − 1}, and we apply them iteratively starting from an arbitrary v0 , viz., vt+1 = Kt (vt ), where Kt = ki if i ≡ t (mod n). One may think that since each ki attracts the point towards its fixed point, the sequence vt converges to the convex hull of the fixed points. However, as the following example demonstrates, this is not the case, since it is possible that the point moves away from the convex hull and, in fact, it gets farther and farther after each iteration. Now, let us consider two one-dimensional functions, k i : R → R, where i ∈ {a, b}, defined below by Equation (8). It can be easily proven that these functions are contractions with fixed points v ∗i 1691

´ AND M ONOSTORI C S AJI

and Lipschitz constants bi (in Figure 2, v∗a = 1, v∗b = −1 and bi = 0.9). ki (v) =

  v + (1 − bi )(v∗i − v) 

if sgn(v∗i ) = sgn(v − v∗i ), (8)

v∗i + (v∗i − v) + (1 − bi )(v − v∗i ) otherwise,

where sgn(·) denotes the signum1 function. Figure 2 demonstrates that even if the iteration starts from the middle of the convex hull (from the center of mass), v 0 = 0, it starts getting farther and farther from the fixed points in each step when we apply k a and kb after each other. Nevertheless, v(0) … -5

20

-4

-3

-2

-1

0

1

2

v b*

v0

v a*

3

4

5



v(t)

15 10

v(1) … -5

5 0

0.9

1

-5

-4

-3

-2

-1

0

1

v b*

v0

v a*

2

3

4

5



-10 -15

v(2) … -5

2.61

v1 = ka(v0)

-20

2.9

0

15

-4

-3

-2

-1

0

v b*

v2 = kb(v1)

5

10

1 v a*

2

3

4

5

15

20

25

30

35

40

t ~ iterations

20

v(t)

… 10 5

v1

0

4.61 v(3) … -5

4.149

-5 -10

-4

-3 v2

-2

-1 v b*

0

1

2

3

4

v a*

5



v3= ka(v2)

-15 -20 0

10

20

30

40

50

60

70

80

90 100

t ~ iterations

Figure 2: A deterministic pathological example, generated by the iterative application of (8). The left part demonstrates the first steps, while the two images on the right-hand side show the behavior of the trajectory in the long run. the following argument shows that sequence vt cannot get arbitrarily far from the fixed points. Let us denote the diameter of the convex hull of the fixed points by ρ. Since this convex hull is a polygon (where the vertices are fixed points) ρ = max i, j kv∗i − v∗j k. Furthermore, let β0 be defined as β0 = maxi bi and dt as dt = mini kv∗i − vt k. Then, it can be proven that for all t, we have dt+1 ≤ β0 (2ρ + dt ). If we assume that dt+1 ≥ dt , then it follows that dt ≤ dt+1 ≤ β0 (2ρ + dt ). After rearrangement, we get the following inequality dt ≤

2 β0 ρ = φ(β0 , ρ). 1 − β0

Therefore, dt > φ(β0 , ρ) implies that dt+1 < dt . Consequently, if vt somehow got farther than φ(β0 , ρ), in the next step it would inevitably be attracted towards the fixed points. It is easy to see that this argument is valid in an arbitrary normed space, as well. 1. sgn(x) = 0 if x = 0, sgn(x) = −1 if x < 0 and sgn(x) = 1 if x > 0.

1692

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

5.3 Reinforcement Learning in (ε, δ)-MDPs In case of finite (ε, δ)-MDPs we can formulate a relaxed convergence theorem for value function based reinforcement learning algorithms, as a corollary of Theorem 20. Suppose that V consists of state-value functions, namely, X = X. Then, we have lim sup kJ ∗ − Jt∗ k∞ ≤ d(ε, δ), t→∞

where Jt∗ is the optimal value function of the MDP at time t and J ∗ is the optimal value function of the base MDP. In order to calculate d(ε, δ), Theorems 11 (or 10), 12 and the triangle inequality could be applied. Assume, for example, that we use the supremum norm, k·k∞ , for cost functions and k·k1 , defined by Equation (5), for transition functions. Then, d(ε, δ) =

δ ε α kgk∞ + , 2 (1 − α) 1−α

where g is the cost function of the base MDP. Now, by applying Theorem 20, we have Corollary 21 Suppose that we have an (ε, δ)-MDP and Assumptions 1-3 hold. Let Vt be the sequence generated by iteration (6). Furthermore, assume that the fixed point of each operator Kt is Jt∗ . Then, for any initial V0 ∈ V , the sequence Vt κ-approximates J ∗ with κ=

4 d(ε, δ) . 1 − β0

Notice that as parameters ε and δ go to zero, we get back to a classical convergence theorem for this kind of stochastic iterative algorithm (still in a little bit generalized form, since βt might still change over time). Now, with the help of these results, we will investigate the convergence of some classical reinforcement learning algorithms in (ε, δ)-MDPs. 5.3.1 A SYNCHRONOUS VALUE I TERATION

IN

(ε, δ)-MDP S

The method of value iteration is one of the simplest reinforcement learning algorithms. In ordinary MDPs it is defined by the iteration Jt+1 = T Jt , where T is the Bellman operator. It is known that the sequence Jt converges in the supremum norm to J ∗ for any initial J0 (Bertsekas and Tsitsiklis, 1996). The asynchronous variant of value iteration arises when the states are updated asynchronously, for example, only one state in each iteration. In the case of (ε, δ)-MDPs a small stepsize variant of asynchronous value iteration can be defined as Jt+1 (x) = (1 − γt (x))Jt (x) + γt (x)(Tt Jt )(x), where Tt is the Bellman operator of the current MDP at time t. Since there is no noise term in the iteration, Assumption 1 is trivially satisfied. Assumption 3 follows from the fact that each Tt operator is an α contraction where α is the discount factor. Therefore, if the stepsizes satisfy Assumption 2 then, by applying Corollary 21, we have that the sequence Jt κ-approximates J ∗ for any initial value function J0 with κ = (4 d(ε, δ))/(1 − α). 1693

´ AND M ONOSTORI C S AJI

5.3.2 Q-L EARNING

IN

(ε, δ)-MDP S

Watkins’ Q-learning is a very popular off-policy model-free reinforcement learning algorithm (EvenDar and Mansour, 2003). Its generalized version in ε-MDPs was studied by Szita et al. (2002). The Q-learning algorithm works with action-value functions, therefore, X = X × A, and the one-step Q-learning rule in (ε, δ)-MDPs can be defined as follows Qt+1 (x, a) = (1 − γt (x, a))Qt (x, a) + γt (x, a)(Tet Qt )(x, a),

(9)

(Tet Qt )(x, a) = gt (x, a) + α min Qt (Y, B), B∈A (Y )

where gt is the immediate-cost function of the current MDP at time t and Y is a random variable generated from the pair (x, a) by simulation, that is, according to the probability distribution p t (x, a), where pt is the transition function of the current MDP at time t. Operator Tet is randomized, but as it was shown by Bertsekas and Tsitsiklis (1996) in their convergence theorem for Q-learning, it can be rewritten in a form as follows et Q)(x, a) + W ft (x, a), (Tet Q)(x, a) = (K

ft (x, a) is a noise term with zero mean and finite variance, and K et is defined as where W et Q)(x, a) = gt (x, a) + α ∑ pt (y | x, a) min Q(y, b). (K y∈X

b∈A (y)

Let us denote the optimal action-value function of the current MDP at time t and the base MDP by Qt∗ and Q∗ , respectively. By using the fact that J ∗ (x) = mina Q∗ (x, a), it is easy to see that for all et and, moreover, each K et is an α contraction. Therefore, if the t, Qt∗ is the fixed point of operator K stepsizes satisfy Assumption 2, then the Qt sequence generated by iteration (9) κ-approximates Q ∗ for any initial Q0 with κ = (4 d(ε, δ))/(1 − α). In some situations the immediate costs are randomized, however, even in this case the relaxed convergence of Q-learning would follow as long as the random immediate costs had finite expected value and variance, which is required for satisfying Assumption 1. 5.3.3 T EMPORAL D IFFERENCE L EARNING IN (ε, δ)-MDP S Temporal difference learning, or for short TD-learning, is a policy evaluation algorithm. It aims at finding the corresponding value function J π for a given control policy π (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998). It can also be used for approximating the optimal value function, for example, if we apply it together with the policy iteration algorithm. First, we briefly review the off-line first-visit variant of TD(λ) in case of ordinary MDPs. It can be shown that the value function of a policy π can be rewritten in a form as " # ∞ π m π J (x) = E ∑ (αλ) Dα,m X0 = x + J π (x), m=0

where λ ∈ [0, 1) and Dπα,m denotes the “temporal difference” coefficient at time m, Dπα,m = g(Xm , Aπm ) + αJ π (Xm+1 ) − J π (Xm ), 1694

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

where Xm , Xm+1 and Aπm are random variables, Xm+1 has p(Xm , Aπm ) distribution and Aπm is a random variable for actions, it is selected according to the distribution π(Xm ). Based on this observation, we can define a stochastic approximation algorithm as follows. Let us suppose that we have a generative model of the environment, for example, we can perform simulations in it. Each simulation produces a state-action-reward trajectory. We can assume that all simulations eventually end, for example, there is an absorbing termination state or we can stop the simulation after a given number of steps. Note that even in this case we can treat each trajectory as infinitely long, viz., we can define all costs after the termination as zero. The off-line first-visit TD(λ) algorithm updates the value function after each simulation, Jt+1 (xtk ) = Jt (xtk ) + γt (xtk )



∑ (αλ)m−k dα,m,t ,

(10)

m=k

where xtk is the state at step k in trajectory t and dα,m,t is the temporal difference coefficient, dα,m,t = g(xtm , atm ) + αJt (xtm+1 ) − Jt (xtm ). For the case of ordinary MDPs it is known that TD(λ) converges almost surely to J π for any initial J0 provided that each state is visited by infinitely many trajectories and the stepsizes satisfy Assumption 2. The proof is based on the observation that iteration (10) can be seen as a RobbinsMonro type stochastic iterative algorithm for finding the fixed point of J π = HJ π , where H is a contraction mapping with Lipschitz constant α (Bertsekas and Tsitsiklis, 1996). The only difference in the case of (ε, δ)-MDPs is that the environment may change over time and, therefore, operator H becomes time-dependent. However, each Ht is still an α contraction, but they potentially have different fixed points. Therefore, we can apply Theorem 20 to achieve a relaxed convergence result for off-line first-visit TD(λ) in changing environments under the same conditions as in the case of ordinary MDPs. The convergence of the on-line every-visit variant can be proven in the same way as in the case of ordinary MDPs, viz., by showing that the difference between the two variants is of second order in the size of γt and hence inconsequential as γt diminishes to zero. 5.3.4 A PPROXIMATE DYNAMIC P ROGRAMMING Most RL algorithms in their standard forms, for example, with lookup table representations, are highly intractable in practice. This phenomenon, which was named “curse of dimensionality” by Bellman, has motivated approximate approaches that result in more tractable methods, but often yield suboptimal solutions. These techniques are usually referred to as approximate dynamic programming (ADP). Many ADP methods are combined with simulation, but their key issue is to approximate the value function with a suitable approximation architecture: V ≈ Φ(r), where r is a parameter vector. Direct ADP methods collect samples by using simulation, and fit the architecture to the samples. Indirect methods obtain parameter r by using an approximate version of the Bellman equation (Bertsekas, 2007). The power of the approximation architecture is the smallest error that can be achieved, η = infr kV ∗ − Φ(r)k, where V ∗ is the optimal value function. Suppose that η > 0, then no algorithm can provide a result whose distance from V ∗ is less than η. Hence, the maximum that we can hope for is to converge to an environment of V ∗ (Bertsekas and Tsitsiklis, 1996). In what follows, we briefly investigate the connection of our results with ADP. 1695

´ AND M ONOSTORI C S AJI

In general, many direct and indirect ADP methods can be formulated as follows  Φ(rt+1 ) = Π (1 − γt ) Φ(rt ) + γt (Bt (Φ(rt )) +Wt ) ,

(11)

where rt ∈ Θ is an approximation parameter, Θ is the parameter space, for example, Θ ⊆ R p , Φ : Θ → F is an approximation architecture where F ⊆ V is a Hilbert space that can be represented by using Φ with parameters from Θ. Function Π : V → F is a projection mapping, it renders a representation from F to each value function from V . Operator Bt : F → V acts on (approximated) value functions. Finally, γt denotes the stepsize and Wt is a noise parameter representing the uncertainties coming from, for example, the simulation. Operator Bt is time-dependent since, for example, if we model an approximate version of optimistic policy iteration, then in each iteration the control policy changes and, therefore, the update operator changes, as well. We can notice that if Π was a linear operator (see below), Equation (11) would be a stochastic iterative algorithm with Kt = Π Bt . Consequently, the algorithm described by Equation (6) is a generalization of many ADP methods, as well. Now, we show that a convergence theorem for ADP methods can also be deduced by using Theorem 20. In order to apply the theorem, we should ensure that each update operator be a contraction. If we assume that every Bt is a contraction, we should require two properties from Π to guarantee that the resulted operators remain contractions. First, Π should be linear. Operator Π is linear if it is additive and homogeneous, more precisely, if ∀V1 ,V2 : Π(V1 + V2 ) = Π(V1 ) + Π(V2 ) and ∀V : ∀ α : Π(αV ) = α Π(V ), where α is a scalar. This requirement allows the separation of the components. Moreover, Π should be nonexpansive w.r.t. the supremum norm, namely: ∀V1 ,V2 : kΠ(V1 ) − Π(V2 )k ≤ kV1 −V2 k. Then, the update operator of the algorithm, Kt = Π Bt , is guaranteed to be a contraction. If we assume that Vt∗ is the fixed point of Kt , thus, (Π Bt )Vt∗ = Vt∗ and βt is the Lipschitz constant of Kt with lim supt→∞ βt = β0 < 1, we can deduce a convergence theorem for ADP methods, as a corollary of Theorem 20. Suppose that Assumptions 1-2 hold and each Bt is a contraction as well as Π is linear and supremum norm nonexpansive, then Φ(rt ) κ-approximates V ∗ for any initial r0 with κ = 4ρ/(1 − β0 ), where ρ = lim supt→∞ kVt∗ −V ∗ k. In case all of the fixed points were the same, viz., ∀t : V0∗ = Vt∗ , then Φ(rt ) would converge to V0∗ almost surely, consequently, Φ(rt ) would κ-approximate V ∗ with κ = kV0∗ −V ∗ k. Naturally, these results are quite loose, since we did not make strong assumptions on the applied algorithm and on the approximation architecture. They only illustrate that the approach we took, which allows time-dependent update operators and analyzes approximate convergence, could also provide results for ordinary MDPs, for example, in the case of ADP.

6. Experimental Results In this section we present two numerical experiments. The first one demonstrates the effects of environmental changes during Q-learning based scheduling. The second one presents a parameter analysis concerning the effectiveness of SARSA in (ε, δ)-type grid world domains. 6.1 Environmental Changes During Scheduling Scheduling is the allocation of resources over time to perform a collection of jobs. Each job consists of a set of tasks, potentially with precedence constraints, to be executed on the resources. The 1696

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

job-shop scheduling problem (JSP) is one of the basic scheduling problems (Pinedo, 2002). We investigated an extension of JSP, called the flexible job-shop scheduling problem (FJSP), in which some of the resources are interchangeable, that is, there may be tasks that can be executed on several resources. This problem can be formulated as a finite horizon MDP and can be solved by Q-learning based methods (Cs´aji and Monostori, 2006). 330

330

(a) 310

(b) 310

k(t)

290

k’(t)

k(t)

290

270

270

250

250

230

230

210

210

190

190 170

170 1

20

40

60

80

100

120

140

160

180

1

200

20

40

60

80

t ~ time

100

120

140

160

180

200

t ~ time

Figure 3: The black curves, κ(t), show the performance measure in case there was a resource breakdown (a) or a new resource availability (b) at time t = 100; the gray curve in (a), κ’(t), demonstrates the case the policy would be recomputed from scratch. 360

360

(a)

(b) 340

340

k(t)

k(t)

320

320

300

300

280

280

260

260

240

240

220

220

200

200 1

20

40

60

80

100

120

140

160

180

200

t ~ time

1

20

40

60

80

100

120

140

160

180

200

t ~ time

Figure 4: The black curves, κ(t), show the performance measure during resource control in case there was a new job arrival (a) or a job cancellation (b) at time t = 100. In order to investigate the effects of environmental changes during scheduling, numerical experiments were initiated and carried out. The aim of scheduling was to minimize the maximum completion time of the tasks, which performance measure is called “makespan”. The adaptive features of the Q-learning based approach were tested by confronting the system with unexpected events, such as: resource breakdown, new resource availability (Figure 3), new job arrival or job cancellation (Figure 4). In Figures 3 and 4 the horizontal axis represents time, while the vertical one, the achieved performance measure. The figures were made by averaging hundred random samples. In these tests a fixed number of 20 resources were used with few dozens of jobs, where each job contained a sequence of tasks. In each case there was an unexpected event at time t = 100. After the change took place, we considered two possibilities: we either restarted the iterative scheduling process from scratch or we continued the learning using the current (obsolete) value function. We experienced that the latter approach is much more efficient. That was one of the reasons why we started to study how the optimal value function of an MDP depends on the dynamics of the system. 1697

´ AND M ONOSTORI C S AJI

Recall that Theorems 10, 11 and 12 measure the amount of the possible change in the value function in case there were changes in the MDP, but since these theorems apply supremum norm, they only provide bounds for worst case situations. However, the results of our numerical experiments, shown in Figures 3 and 4, are indicative of the phenomenon that in an average case the change is much less. Therefore, applying the obsolete value function after a change took place is preferable over restarting the optimization from scratch. The results, black curves, show the case when the obsolete value function approximation was applied after the change took place. The performance which would arise if the system recomputed the whole schedule from scratch is drawn in gray in part (a) of Figure 3. 6.2 Varying Grid World We also performed numerical experiments on a variant of the classical grid world problem (Sutton and Barto, 1998). The original version of this problem can be briefly described as follows: an agent wanders in a rectangular world starting from a random initial state with the aim of finding the goal state. In each state the agent is allowed to choose from four possible actions: “north”, “south”, “east” and “west”. After an action was selected, the agent moves one step in that direction. There are some mines on the field, as well, that the agent should avoid. An episode ends if the agent finds the goal state or hits a mine. During our experiments, we applied randomly generated 10 × 10 grid worlds (thus, these MDPs had 100 states) with 10 mines. The immediate-cost of taking a (nonterminating) step was 5, a cost of hitting a mine was 100 and the cost of finding the goal state was −100. In order to perform the experiment described by Table 1, we have applied the “RL-Glue” framework2 which consists of open source softwares and aims at being a standard protocol for benchmarking and interconnecting reinforcement learning agents and environments. We have analyzed an (ε, δ)-type version of grid world, where the problem formed an (ε, δ)MDP. More precisely, we have investigated the case when for all time t, the transition-probabilities could vary by at most ε ≥ 0 around the base transition-probability values and the immediate-costs could vary by at most δ ≥ 0 around the base cost values. During our numerical experiments, the environment changed at each time-step. These changes were generated as follows. First, changes concerning the transition-probabilities are described. In our randomized grid worlds the agent was taken to a random surrounding state (no matter what action it chose) with probability η and this probability changed after each step. The new η was computed according to the uniform distribution, but its possible values were bounded by the values described in the first row of Table 1. Similarly, the immediate-costs of the base MDP (cf. the first paragraph) were perturbed with a uniform random variable that changed at each time-step. Again, its (absolute) value was bounded by δ, which is presented in the first column of the table. The values shown were divided by 100 to achieve the same scale as the transition-probabilities have. Table 1 was generated using an (optimistic) SARSA algorithm, namely, the current policy was evaluated by SARSA, then the policy was (optimistically) improved, more precisely, the greedy policy with respect to the achieved evaluation was calculated. That policy was also soft, namely, it made random explorations with probability 0.05. We have generated 1000 random grid worlds for each parameter pairs and performed 10 000 episodes in each of these generated worlds. The results 2. RL-Glue can be found at http://rlai.cs.ualberta.ca/RLBB/top.html.

1698

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

∆ kgk δ/100 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

the bounds for the varying probability of arriving at random states ∼ ε 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -55.5 -48.8 -41.4 -36.7 -26.7 -16.7 -8.5 2.1 14.2 31.7 -54.1 -46.1 -41.2 -34.5 -25.8 -15.8 -6.0 3.7 16.5 32.3 -52.5 -44.8 -40.1 -34.4 -25.3 -15.4 -5.8 4.0 17.6 33.1 -49.7 -42.1 -36.3 -31.3 -23.9 -14.2 -5.3 8.0 18.1 37.2 -47.4 -41.5 -34.7 -30.7 -22.2 -12.2 -2.3 8.8 20.2 38.3 -42.7 -41.0 -34.5 -24.8 -21.1 -10.1 -1.3 11.2 25.7 39.2 -36.1 -36.5 -29.7 -24.0 -16.8 -7.9 1.1 17.0 31.3 43.9 -30.2 -29.3 -29.3 -19.1 -13.4 -6.0 7.4 18.9 26.9 47.2 -23.1 -27.0 -21.4 -18.8 -10.9 -2.6 8.9 22.5 31.3 50.0 -14.1 -19.5 -21.0 -12.4 -7.5 0.7 13.2 23.2 38.9 52.2 -6.8 -10.7 -14.5 -7.1 -5.3 6.6 15.7 26.4 39.8 57.3

1.0 46.0 46.3 48.1 51.6 52.0 52.1 54.1 60.9 64.2 68.1 68.7

Table 1: The (average) cumulative costs gathered by SARSA in varying grid worlds.

presented in the table were calculated by averaging the cumulative costs over all episodes and over all generated sample worlds. The parameter analysis shown in Table 1 is indicative of the phenomenon that changes in the transition-probabilities have a much higher impact on the performance. Even large perturbations in the costs were tolerated by SARSA, but large variations in the transition-probabilities caused a high decrease in the performance. An explanation could be that large changes in the transitions cause the agent to loose control over the events, since it becomes very hard to predict the effects of the actions and, hence, to estimate the expected costs.

7. Conclusion The theory of MDPs provide a general framework for modeling decision making in stochastic dynamic systems, if we know a function that describes the dynamics or we can simulate it, for example, with a suitable program. In some situations, however, the dynamics of the system may change, too. In theory, this change can be modeled with another (higher level) MDP, as well, but doing so would lead to models which are practically intractable. In the paper we have argued that the optimal value function of a (discounted) MDP Lipschitz continuously depends on the transition-probability function and the immediate-cost function, therefore, small changes in the environment result only in small changes in the optimal value function. This result was already known for the case of transition-probabilities, but we have presented an improved estimation for this case, as well. A bound for changes in the discount factor was also proven, and it was demonstrated that, in general, this dependence was not Lipschitz continuous. Additionally, it was shown that changes in the discount rate could be traced back to changes in the immediate-cost function. The application of the Lipschitz property helps the theoretical treatment of changing environments or inaccurate models, for example, if the transition-probabilities or the costs are estimated statistically, only. In order to theoretically analyze environmental changes, the framework of (ε, δ)-MDPs was introduced as a generalization of classical MDPs and ε-MDPs. In this quasi-stationary model the 1699

´ AND M ONOSTORI C S AJI

transition-probability function and the immediate-cost function may change over time, but the cumulative changes must remain bounded by ε and δ, asymptotically. Afterwards, we have investigated how RL methods could work in this kind of changing environment. We have presented a general theorem that estimated the asymptotic distance of a value function sequence from a fixed value function. This result was applied to deduce a convergence theorem for value function based algorithms that work in (ε, δ)-MDPs. In order to demonstrate our approach, we have presented some numerical experiments, too. First, two simple iterative processes were shown, a “well-behaving” stochastic process and a “pathological”, oscillating deterministic process. Later, the effects of environmental changes on Q-learning based flexible job-shop scheduling was experimentally studied. Finally, we have analyzed how SARSA could work in varying (ε, δ)-type grid world domains. We can conclude that value function based RL algorithms can work in varying environments, at least if the changes remain bounded in the limit. The asymptotic distance of the generated value function sequence from the optimal value function of the base MDP is bounded for a large class of stochastic iterative algorithms. Moreover, this bound is proportional to the diameter of this set, for example, to parameters ε and δ in the case of (ε, δ)-MDPs. These results were illustrated through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning policy evaluation. We showed, as well, that this approach could be applied to investigate the convergence of ADP methods. There are many potential further research directions. Now, as a conclusion to the paper, we highlight some of them. First, analyzing the effects of environmental changes on the value function in case of the expected average cost optimization criterion would be interesting. A promising direction could be to investigate environments with non-bounded changes, for example, when the environment might drift over time. Naturally, this drift should also be sufficiently slow in order to give the opportunity to the learning algorithm to track the changes. Another possible direction could be the further analysis of the convergence results in case of applying value function approximation. The classical problem of exploration and exploitation should also be reinvestigated in changing environments. Finally, for practical reasons, it would be important to find finite time bounds for the convergence of stochastic iterative algorithms for (a potentially restricted class of) non-stationary environments.

Acknowledgments The work was supported by the Hungarian Scientific Research Fund (OTKA), Grant No. T73376, and by the EU-project Coll-Plexity, 12781 (NEST). Bal´azs Csan´ad Cs´aji greatly acknowledges the scholarship of the Hungarian Academy of Sciences. The authors are also very grateful to Csaba Szepesv´ari for the helpful comments and discussions.

Appendix A. Proofs In this appendix the proofs of Theorems 11, 12, 13, 20 and Lemmas 14, 18 can be found.

1700

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

Theorem 11 Assume that two MDPs differ only in their transition-probability functions, denoted by p1 and p2 . Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

α kgk∞ kp1 − p2 k1 , (1 − α)2

where k·k1 is a norm on f : X × A × X → R type functions, for example, f (x, a, y) = p(y | x, a), k f k1 = max x, a

∑ | f (x, a, y) | .

y∈X

Proof First, let us introduce a deterministic Markovian policy. For all state x ∈ X:  " #    arg min g(x, a) + α ∑ p1 (y | x, a) J1∗ (y) if J1∗ (x) ≤ J2∗ (x),    y∈X  a∈A (x) ˆ π(x) = " #     ∗ ∗ ∗    arg min g(x, a) + α ∑ p2 (y | x, a) J2 (y) if J2 (x) < J1 (x) a∈A (x)

y∈X

If the arg min is ambiguous then any action that takes the minimum can be selected. Using the Bellman optimality equation in the first step, kJ1∗ − J2∗ k∞ can be estimated as follows, ∀x ∈ X : |J1∗ (x) − J2∗ (x)| = " #

" # = min g(x, a) + α ∑ p1 (y | x, a) J1∗ (y) − min g(x, a) + α ∑ p2 (y | x, a) J2∗ (y) ≤ a∈A (x) a∈A (x) y∈X y∈X ˆ ˆ ˆ ˆ ≤ g(x, π(x)) J2∗ (y) , J1∗ (y) − g(x, π(x)) − α ∑ p2 (y | x, π(x)) + α ∑ p1 (y | x, π(x)) y∈X y∈X

where we applied that ∀ f 1 , f2 : S → R bounded functions such that mins f1 (s) ≤ mins f2 (s) and sˆ = arg mins f1 (s), we have |mins f1 (s) − mins f2 (s)| ≤ | f1 (s) ˆ − f2 (s)|. ˆ Then, ∗ ∗ ∗ ∗ ˆ ˆ ∀x ∈ X : |J1 (x) − J2 (x)| ≤ α ∑ p1 (y | x, π(x))J (y) − p (y | x, π(x))J (y) = 2 1 2 y∈X ∗ ∗ ∗ ˆ ˆ ˆ = α ∑ (p1 (y | x, π(x)) − p2 (y | x, π(x)))J 1 (y) − J2 (y)) ≤ 1 (y) + α ∑ p2 (y | x, π(x))(J y∈X y∈X ∗ ∗ ∗ ˆ ˆ ˆ − p2 (y | x, π(x)))J ≤ α ∑ |(p1 (y | x, π(x)) 1 (y) − J2 (y))|, 1 (y)| + α ∑ |p2 (y | x, π(x))(J y∈X

y∈X

∗ ∗ ˆ ˆ where in the second step we have rewritten p1 (y | x, π(x))J 1 (y) − p2 (y | x, π(x))J 2 (y) as ∗ ∗ ˆ ˆ p1 (y | x, π(x))J 1 (y) − p2 (y | x, π(x))J 2 (y) = ∗ ∗ ∗ ∗ ˆ ˆ ˆ ˆ = p1 (y | x, π(x))J 1 (y) − p2 (y | x, π(x))J 1 (y) + p2 (y | x, π(x))J 1 (y) − p2 (y | x, π(x))J 2 (y) =

1701

´ AND M ONOSTORI C S AJI

∗ ∗ ∗ ˆ ˆ ˆ = (p1 (y | x, π(x)) − p2 (y | x, π(x)))J 1 (y) + p2 (y | x, π(x))(J 1 (y) − J2 (y)).

Now, let us recall (a special form of) Ho¨ lder’s inequality: let v1 , v2 be two vectors and 1 ≤ q, r ≤ ∞ with 1/q + 1/r = 1. Then, we have kv1 v2 k(1) ≤ kv1 k(q) kv2 k(r) , where k·k(q) denotes vector norm, for example, kvk(q) = (∑i |vi |q )1/q and kvk(∞) = maxi |vi | = kvk∞ . Here, we applied the unusual “(q)” notation to avoid confusion with the applied matrix norm. Notice that the first sum of the last estimation can be treated as the (1)-norm of v1 v2 , where ˆ ˆ v1 (y) = p1 (y | x, π(x)) − p2 (y | x, π(x)))

and

v2 (y) = J1∗ (y),

after which H¨older’s inequality can be applied with q = 1 and r = ∞ to estimate the sum. A similar argument can be repeated in the case of the second sum with ˆ v1 (y) = p2 (y | x, π(x))

and

v2 (y) = J1∗ (y) − J2∗ (y).

Then, after the two applications of Ho¨ lder’s inequality, we have for all x that ∗ ˆ ˆ |J1∗ (x) − J2∗ (x)| ≤ α kp1 ( · | x, π(x)) − p2 ( · | x, π(x))k (1) kJ1 k∞ + ∗ ∗ ˆ + α kp2 ( · | x, π(x))k (1) kJ1 − J2 k∞ ,

ˆ since kJ1∗ k∞ ≤ kgk∞ /(1 − α), kp2 ( · | x, π(x))k (1) = 1 and we have this estimation for all x, kJ1∗ − J2∗ k∞ ≤

α kgk∞ ˆ ˆ | + α kJ1∗ − J2∗ k∞ , max ∑ | p1 (y | x, π(x)) − p2 (y | x, π(x)) 1 − α x∈X y∈X

which formula can be overestimated, by taking the maximum over all actions, by kJ1∗ − J2∗ k∞ ≤

α kgk∞ kp1 − p2 k1 + α kJ1∗ − J2∗ k∞ , 1−α

from which the statement of the theorem immediately follows after rearrangement. Theorem 12 Assume that two discounted MDPs differ only in the immediate-cost functions, denoted by g1 and g2 . Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

1 kg1 − g2 k∞ . 1−α

Proof First, let us introduce a deterministic Markovian policy. For all state x ∈ X:  " #    arg min g1 (x, a) + α ∑ p(y | x, a) J1∗ (y) if J1∗ (x) ≤ J2∗ (x),    y∈X  a∈A (x) ˆ π(x) = " #     ∗ ∗ ∗    arg min g2 (x, a) + α ∑ p(y | x, a) J2 (y) if J2 (x) < J1 (x). a∈A (x)

y∈X

1702

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

If the arg min is ambiguous, then any action that takes the minimum can be selected. Using the Bellman optimality equation in the first step, kJ1∗ − J2∗ k∞ can be estimated as follows, ∀x ∈ X : |J1∗ (x) − J2∗ (x)| = " #

# " = min g1 (x, a) + α ∑ p(y | x, a) J1∗ (y) − min g2 (x, a) + α ∑ p(y | x, a) J2∗ (y) ≤ a∈A (x) a∈ A (x) y∈X y∈X ∗ ∗ ˆ ˆ ˆ ˆ ≤ g1 (x, π(x)) + α ∑ p(y | x, π(x)) J1 (y) − g2 (x, π(x)) − α ∑ p(y | x, π(x)) J2 (y) , y∈X y∈X

where we applied that ∀ f 1 , f2 : S → R bounded functions such that mins f1 (s) ≤ mins f2 (s) and sˆ = arg mins f1 (s), we have |mins f1 (s) − mins f2 (s)| ≤ | f1 (s) ˆ − f2 (s)|. ˆ Then, ˆ ˆ ˆ |J1∗ (y) − J2∗ (y)| ≤ ∀x ∈ X : |J1∗ (x) − J2∗ (x)| ≤ |g1 (x, π(x)) − g2 (x, π(x))| + α ∑ p(y | x, π(x)) y∈X

ˆ kJ1∗ − J2∗ k∞ = ≤ kg1 − g2 k∞ + α ∑ p(y | x, π(x)) y∈X

= kg1 − g2 k∞ + α kJ1∗ − J2∗ k∞ . It is easy to see that if ∀x ∈ X : |J1∗ (x) − J2∗ (x)| ≤ kg1 − g2 k∞ + α kJ1∗ − J2∗ k∞ , then kJ1∗ − J2∗ k∞ ≤ kg1 − g2 k∞ + α kJ1∗ − J2∗ k∞ , from which the statement of the theorem immediately follows after rearrangement. Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by α1 , α2 ∈ [0, 1). Let the corresponding optimal value functions be J1∗ and J2∗ , then kJ1∗ − J2∗ k∞ ≤

|α1 − α2 | kgk∞ . (1 − α1 )(1 − α2 )

Proof Let π∗i denote a greedy and deterministic policy based on value function Ji∗ , where i ∈ {1, 2}. Naturally, policy π∗i is optimal if the discount rate is αi (Theorem 6). Then, let us introduce a deterministic Markovian control policy πˆ defined as  ∗  π1 (x) if J1∗ (x) ≤ J2∗ (x), ˆ π(x) =  ∗ π2 (x) if J2∗ (x) < J1∗ (x).

For any state x the difference of the two value functions can be estimated as follows, |J1∗ (x) − J2∗ (x)| = 1703

´ AND M ONOSTORI C S AJI

" # " # = min g(x, a) + α1 ∑ p(y | x, a) J1∗ (y) − min g(x, a) + α2 ∑ p(y | x, a) J2∗ (y) ≤ a∈A (x) a∈A (x) y∈X y∈X ˆ ˆ ˆ ˆ ≤ g(x, π(x)) + α1 ∑ p(y | x, π(x)) J1∗ (y) − g(x, π(x)) − α2 ∑ p(y | x, π(x)) J2∗ (y) , y∈X y∈X

where we applied that ∀ f 1 , f2 : S → R bounded functions such that mins f1 (s) ≤ mins f2 (s) and sˆ = arg mins f1 (s), we have |mins f1 (s) − mins f2 (s)| ≤ | f1 (s) ˆ − f2 (s)|. ˆ Then, ∗ ∗ ∗ ∗ ˆ ∀x ∈ X : |J1 (x) − J2 (x)| ≤ ∑ p(y | x, π(x)) (α1 J1 (y) − α2 J2 (y)) ≤ y∈X ≤ |α1 − α2 |

1 kgk∞ + α2 kJ1∗ − J2∗ k∞ , 1 − α1

where in the last step we used the following estimation of |α1 J1∗ (y) − α2 J2∗ (y)|, |α1 J1∗ (y) − α2 J2∗ (y)| = |α1 J1∗ (y) − α2 J1∗ (y) + α2 J1∗ (y) − α2 J2∗ (y)| ≤ ≤ |α1 − α2 | |J1∗ (y)| + α2 |J1∗ (y) − J2∗ (y)| ≤ |α1 − α2 |

1 kgk∞ + α2 kJ1∗ − J2∗ k∞ , 1 − α1

where we applied the fact that for any state y we have, ∞

|J1∗ (y)| ≤ ∑ αt1 kgk∞ = t=0

1 kgk∞ . 1 − α1

Because the estimation of |J1∗ (x) − J2∗ (x)| is valid for all x, we have the following result kJ1∗ − J2∗ k∞ ≤ |α1 − α2 |

1 kgk∞ + α2 kJ1 − J2 k∞ , 1 − α1

from which the statement of the theorem immediately follows after rearrangement. Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. Let the corresponding optimal action-value functions be Q∗1 and Q∗2 , respectively. Then, the bounds for kJ1∗ − J2∗ k∞ of Theorems 11, 12 and 13 are also bounds for kQ∗1 − Q∗2 k∞ . Proof We will prove the theorem in three parts, depending on the changing components. Case 1: Assume that the MDPs differ only in the transition functions, denoted by p 1 and p2 . We will prove the same estimation as in the case of Theorem 11, more precisely, that kQ∗1 − Q∗2 k∞ ≤

α kgk∞ kp1 − p2 k1 . (1 − α)2

For all state-action pair (x, a) we can estimate the absolute difference of Q ∗1 and Q∗2 as |Q∗1 (x, a) − Q∗2 (x, a)| = 1704

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

= g(x, a) + α ∑ p1 (y | x, a)J1∗ (y) − g(x, a) − α ∑ p2 (y | x, a)J2∗ (y) ≤ y∈X y∈X ≤ α ∑ (p1 (y | x, a)J1∗ (y) − p2 (y | x, a)J2∗ (y)) , y∈X

from which the proof continues in the same way as the proof of Theorem 11. Case 2: Assume that the MDPs differ only in the immediate-cost functions, denoted by g 1 and g2 . We will prove the same estimation as in the case of Theorem 12, more precisely, kQ∗1 − Q∗2 k∞ ≤

1 kg1 − g2 k∞ . 1−α

For all state-action pair (x, a) we can estimate the absolute difference of Q ∗1 and Q∗2 as |Q∗1 (x, a) − Q∗2 (x, a)| =

∗ ∗ = g1 (x, a) + α ∑ p(y | x, a)J1 (y) − g2 (x, a) − α ∑ p(y | x, a)J2 (y) ≤ y∈X y∈X ∗ ∗ ≤ kg1 − g2 k∞ + α ∑ p(y | x, a)(J1 (y) − J2 (y)) ≤ kg1 − g2 k∞ + α kJ1∗ − J2∗ k∞ . y∈X

The statement immediately follows after we apply Theorem 12 to estimate kJ1∗ − J2∗ k∞ . Case 3: Assume that the MDPs differ only in the discount rates, denoted by α 1 and α2 . We will prove the same estimation as in the case of Theorem 13, more precisely, that kQ∗1 − Q∗2 k∞ ≤

|α1 − α2 | kgk∞ . (1 − α1 )(1 − α2 )

For all state-action pair (x, a) we can estimate the absolute difference of Q ∗1 and Q∗2 as |Q∗1 (x, a) − Q∗2 (x, a)| =

= g(x, a) + α1 ∑ p(y | x, a)J1∗ (y) − g(x, a) − α2 ∑ p(y | x, a)J2∗ (y) ≤ y∈X y∈X 1 kgk∞ + α2 kJ1∗ − J2∗ k∞ , ≤ α1 ∑ p(y | x, a)J1∗ (y) − α2 ∑ p(y | x, a)J2∗ (y) ≤ |α1 − α2 | y∈X 1 − α 1 y∈X

where in the last step we applied the same estimation as in the proof of Theorem 13. The statement immediately follows after we apply Theorem 13 to estimate kJ1∗ − J2∗ k∞ . Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . The immediate-cost function of M3 is gb(x, a) = g(x, a) + (α2 − α1 ) ∑ p(y | x, a)J2∗ (y), y∈X

1705

´ AND M ONOSTORI C S AJI

where p is the probability-transition function of M1 , M2 and M3 ; g is the immediate-cost function of M1 and M2 ; and J2∗ (y) denotes the optimal cost-to-go function of M2 . Proof First of all, let us overview some general statements that will be used in the proof. Recall from Bertsekas and Tsitsiklis (1996) that we can treat the solution (the optimal value function) of the infinite horizon problem as the limit of the finite horizon solutions. More precisely, the Bellman optimality equation for the n-stage (finite horizon) problem is h i ∗ Jk∗ (x) = min g(x, a) + α ∑ p(y | x, a) Jk−1 (y) , a∈A (x)

y∈X

for all k ∈ {1, . . . , n} and x ∈ X. Note that by definition, we have J0∗ (x) = 0. Moreover, ∀x ∈ X : J ∗ (x) = J∞∗ (x) = lim Jn∗ (x). n→∞

Also recall that the n-stage optimal action value function is defined as ∗ Q∗k (x, a) = g(x, a) + α ∑ p(y | x, a)Jk−1 (y), y∈X

for all x, a and k ∈ {1, . . . , n}. We also have Q∗0 (x, a) = 0 and Jn∗ (x) = mina Q∗n (x, a). During the proof we will apply the solutions of suitable finite horizon problems, thus, in order to avoid notational confusions, let us denote the optimal state and action value functions of M 2 and M3 by J ∗ , Q∗ and Jˆ∗ , Qb∗ , respectively. The corresponding finite horizon optimal value functions b∗n , respectively, where n is the length of the horizon. We will will be denoted by Jn∗ , Q∗n and Jˆn∗ , Q b∗ (x, a), from which J ∗ = Jˆ∗ follows. Let show that for all state x and action a we have Q∗ (x, a) = Q us define gbn for all n > 0 by ∗ gbn (x, a) = g(x, a) + (α2 − α1 ) ∑ p(y | x, a)Jn−1 (y). y∈X

b∗ , since both of them We will apply induction on n. For the case of n = 0 we trivially have Q ∗0 = Q 0 b∗ for k ≤ n, then are constant zero functions. Now, assume that Q∗k = Q k b∗n+1 (x, a) = gbn+1 (x, a) + α1 Q

∑ p(y | x, a)Jˆn∗(y) =

y∈X

= g(x, a) + (α2 − α1 ) ∑ p(y | x, a)Jn∗ (y) + α1 y∈X

= g(x, a) + α2



y∈X

∑ p(y | x, a)Jn∗(y) + α1 ∑ p(y | x, a)

y∈X

= g(x, a) + α2

∑ p(y | x, a)Jˆn∗(y) =

y∈X

p(y | x, a)Jn∗ (y) + α1

y∈X

∑ p(y | x, a)

y∈X

= g(x, a) + α2



 Jˆn∗ (y) − Jn∗ (y) =

 ∗ ∗ b min Qn (y, b) − min Qn (y, b) =

b∈A (y)

b∈A (y)

∑ p(y | x, a)Jn∗(y) = Q∗n+1 (x, a).

y∈X

b∗n (x, a) b∗n . Consequently, Q∗ (x, a) = limn→∞ Q∗n (x, a) = limn→∞ Q We have proved that for all n: Q∗n = Q ∗ ∗ ∗ ∗ ∗ b (x, a) and, thus, J (x) = mina Q (x, a) = mina Q b (x, a) = Jˆ (x). Finally, note that for the case =Q 1706

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

of the infinite horizon problem gb(x, a) = limn→∞ gbn (x, a).

Theorem 20 Suppose that Assumptions 1-3 hold and let Vt be the sequence generated by Vt+1 (x) = (1 − γt (x))Vt (x) + γt (x)((Kt Vt )(x) +Wt (x)), then, for any V ∗ ,V0 ∈ V , the sequence Vt κ-approximates function V ∗ with κ=

4ρ 1 − β0

where

ρ = lim sup kVt∗ −V ∗ k∞ . t→∞

The applied three main assumptions are as follows Assumption 1 There exits a constant C > 0 such that for all state x and time t, we have   E [Wt (x) | Ft ] = 0 and E Wt2 (x) | Ft < C < ∞. Assumption 2 For all x and t, 0 ≤ γt (x) ≤ 1, and we have with probability one ∞

∑ γt (x) = ∞



and

t=0

∑ γt2 (x) < ∞.

t=0

Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with fixed point Vt∗ . Formally, for all V1 ,V2 ∈ V , kKt V1 − Kt V2 k∞ ≤ βt kV1 −V2 k∞ . Let us introduce a common Lipschitz constant β0 = lim sup βt , and assume that β0 < 1. t→∞

Proof During the proof, our main aim will be to apply Theorem 9, thus, we have to show that the assumptions of the theorem hold. Let us define operator Ht for all Va ,Vb ∈ V by Ht (Va ,Vb )(x) = (1 − γt (x))Va (x) + γt (x)((Kt Vb )(x) +Wt (x)). 0 Applying this definition, first, we will show that Vt+1 = Ht (Vt0 ,V ∗ ) κ-approximates V ∗ for all V00 . Because βt < 1 for all t and lim supt→∞ βt = β0 < 1, it follows that supt βt = e β < 1 and each Kt is e β ∗ ∗ contraction. We know that lim supt→∞ kV −Vt k∞ = ρ, therefore, for all δ > 0, there is an index t0 such that for all t ≥ t0 , we have that kV ∗ −Vt∗ k∞ ≤ ρ + δ. Using these observations, we can estimate kKt V ∗ k∞ for all t > t0 , as follows

kKt V ∗ k∞ = kKt V ∗ −V ∗ +V ∗ k∞ ≤ kKt V ∗ −V ∗ k∞ + kV ∗ k∞ ≤ ≤ kKt V ∗ −Vt∗ +Vt∗ −V ∗ k∞ + kV ∗ k∞ ≤ kKt V ∗ −Vt∗ k∞ + kVt∗ −V ∗ k∞ + kV ∗ k∞ ≤ ≤ kKt V ∗ − Kt Vt∗ k∞ + ρ + δ + kV ∗ k∞ ≤ e β kV ∗ −Vt∗ k∞ + ρ + δ + kV ∗ k∞ ≤ ≤ (1 + e β)ρ + (1 + e β)δ + kV ∗ k∞ ≤ (1 + e β)ρ + 2δ + kV ∗ k∞ .

If we apply δ = (1 − e β)ρ/2, then for sufficiently large t (t ≥ t0 ) we have that kKt V ∗ k∞ ≤ 2ρ + kV ∗ k∞ . 1707

´ AND M ONOSTORI C S AJI

0 = H (V 0 ,V ∗ ), for all x ∈ X , V 0 ∈ V and t ≥ t by Now, we can upper estimate Vt+1 t t 0 0 0 Vt+1 (x) = Ht (Vt0 ,V ∗ )(x) = (1 − γt (x))Vt0 (x) + γt (x)((Kt V ∗ )(x) +Wt (x)) ≤

≤ (1 − γt (x))Vt0 (x) + γt (x)(kKt V ∗ k∞ +Wt (x)) ≤ ≤ (1 − γt (x))Vt0 (x) + γt (x)(kV ∗ k∞ + 2ρ +Wt (x)). Let us define a new sequence for all x ∈ X by Vet+1 (x) =

  (1 − γt (x))Vet (x) + γt (x)(kV ∗ k∞ + 2ρ +Wt (x)) if t ≥ t0 , 

Vt0 (x)

if t < t0 .

It is easy to see (for example, by induction from t0 ) that for all time t and state x we have that Vt0 (x) ≤ Vet (x) with probability one, more precisely, for almost all ω ∈ Ω, where ω = hω1 , ω2 , . . . i drives the noise parameter Wt (x) = wt (x, ωt ) in both Vt0 and Vet . Note that Ω is the sample space of the underlying probability measure space hΩ, F , Pi. Applying the “Conditional Averaging Lemma” of Szepesv´ari and Littman (1999), which is a variant of the Robbins-Monro Theorem and requires Assumptions 1 and 2, we get that Vet (x) converges with probability one to 2ρ + kV ∗ k∞ for all Ve0 ∈ V and x ∈ X . Therefore, because Vt0 (x) ≤ Vet (x) for all x and t with probability one, we have that the sequence Vt0 (x) κ-approximates V ∗ (x) with κ = 2ρ for all function V00 ∈ V and state x ∈ X . Now, let us turn to Conditions 1-4 of Theorem 9. For all x and t, we define functions Ft (x) and Gt (x) as Ft (x) = βt γt (x) and Gt (x) = (1 − γt (x)). By Assumption 2, functions Ft (x), Gt (x) ∈ [0, 1] for all x and t. Condition 1 trivially follows from the definitions of Gt and Ht . For the proof of Condition 2, we need Assumption 3, namely that each operator Kt is a contraction with respect to βt . Condition 3 is a consequence of Assumption 2 and the definition of Gt . Finally, we have Condition 4 for any ε > 0 and sufficiently large t by defining ξ = β0 + ε. Applying Theorem 9 with κ = 2ρ, we get that Vt κ0 -approximates V ∗ with κ0 = 4ρ/(1 − β0 − ε). In the end, letting ε go to zero proves our statement.

References D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, Massachusetts, 3rd edition, 2007. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. B. Cs. Cs´aji. Adaptive Resource Control: Machine Learning Approaches to Resource Allocation in Uncertain and Changing Environments. PhD thesis, Faculty of Informatics, E o¨ tv¨os Lor´and University, Budapest, 2008. B. Cs. Cs´aji and L. Monostori. Adaptive sampling based large-scale stochastic resource control. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), July 16-20, Boston, Massachusetts, pages 815–820, 2006. 1708

R EINFORCEMENT L EARNING IN C HANGING E NVIRONMENTS

R. Montes de Oca, A. Sakhanenko, and F. Salem. Estimates for perturbations of general discounted Markov control chains. Applied Mathematics, 30:287–304, 2003. E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research (JMLR), 5:1–25, Dec. 2003. G. Favero and W. J. Runggaldier. A robustness result for stochastic control. Systems and Control Letters, 46:91–66, 2002. E. A. Feinberg and A. Shwartz, editors. Handbook of Markov Decision Processes: Methods and Applications. Kluwer Academic Publishers, 2002. E. Gordienko and F. S. Salem. Estimates of stability of Markov control processes with unbounded cost. Kybernetika, 36:195–210, 2000. Zs. Kalm´ar, Cs. Szepesv´ari, and A. L˝orincz. Module-based reinforcement learning: Experiments with a real robot. Machine Learning, 31:55–85, 1998. M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209–232, 2002. A. M¨uller. How does the solution of a Markov decision process depend on the transition probabilities? Technical report, Institute for Economic Theory and Operations Research, University of Karlsruhe, 1996. R. Munos and A. W. Moore. Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 647–654. Morgan Kaufmann, San Francisco, CA, 2000. M. Pinedo. Scheduling: Theory, Algorithms, and Systems. Prentice-Hall, 2002. S. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems, volume 9, pages 974–980. The MIT Press, 1997. R. S. Sutton and A. G. Barto. Reinforcement Learning. The MIT Press, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12: 1057–1063, 2000. Cs. Szepesv´ari and M. L. Littman. A unified analysis of value-function-based reinforcement learning algorithms. Neural Computation, 11(8):2017–2060, 1999. I. Szita, B. Tak´acs, and A. L˝orincz. ε-MDPs: Learning in varying environments. Journal of Machine Learning Research (JMLR), 3:145–174, 2002. B. Van Roy, D. Bertsekas, Y. Lee, and J. Tsitsiklis. A neuro-dynamic programming approach to retailer inventory management. Technical report, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA., 1996. 1709