Tuning Continual Exploration in Reinforcement

Tuning Continual Exploration in Reinforcement Learning∗ (Draft manuscript submitted for publication) Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information Systems Research Unit (ISYS) Place des Doyens 1, Universit´e de Louvain, Belgium {youssef.achbany, francois.fouss, luh.yen, alain.pirotte, marco.saerens}@ucLouvain.be September 1, 2006

Abstract This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at same nodes. In other words, “exploitation”is maximized for constant “exploration”. This formulation leads to a set of nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full “blind”exploration is performed. We also show that, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by performing a single backward pass from the destination state. Stochastic shortest path problems as well as discounted problems are also examined. The theoretical results are confirmed by simple simulations showing that this exploration strategy outperforms the naive ǫ-greedy and Boltzmann strategies.

1. Introduction One of the specific challenges of reinforcement learning is the tradeoff between exploration and exploitation. Exploration aims to continually try new ways of solving the ∗ A preliminary version of this work appeared in the proceedings of the International Conference on Artificial Neural Networks (ICANN 2006).

problem, while exploitation aims to capitalize on already well-established solutions. Exploration is especially relevant when the environment is changing, i.e. nonstationary. In this case, good solutions can deteriorate over time and better solutions can appear. Without exploration, the system sends agents only along the up-to-now best path without exploring alternative paths. The system is therefore unaware of the changes and its performance inevitably deteriorates with time. One of the key features of reinforcement learning is that it explicitly addresses the exploration/exploitation issue as well as the online estimation of the probability distributions in an integrated way [22]. This work makes a clear distinction between “preliminary” or “initial exploration”, and “continual online exploration”. The objective of preliminary exploration is to discover relevant goals, or destination states, and to estimate a first optimal policy for exploiting them. On the other hand, continual online exploration aims to continually explore the environment, after the preliminary exploration stage, in order to adjust the policy to changes in the environment. In the case of preliminary exploration, two further distinctions are often made [24, 25, 26, 27]. A first group of strategies uses randomness for exploration and is often referred to as undirected exploration. Control actions are selected with a probability distribution, taking the expected cost into account. The second group, referred to as directed exploration, uses domain-specific knowledge for guiding exploration [24, 25, 26, 27]. Usually, directed exploration provides better results in terms of learning time and cost. On the other hand, “continual online exploration” can be performed by, for instance, re-exploring the environment either periodically or continually [5, 19] by using a ǫ-greedy or a Boltzmann exploration strategy. For instance, joint estimation of the exploration strategy and the state-transition probabilities for continual online exploration can be performed within the SARSA framework [17, 20, 22]. This work presents a unified framework integrating exploitation and exploration for undirected, continual, exploration. Exploration is formally defined as the association of a probability distribution to the set of available control actions in each state (choice randomization). The rate of exploration is quantified with the concept of degree of exploration, defined as the (Shannon) entropy [11] of the probability distribution for the set of available actions in a given state. If no exploration is performed, the agents are routed on the best path with probability one – they just exploit the solution. With exploration, the agents continually explore a possibly changing environment to keep current with it. When the entropy is zero in a state, no exploration is performed from this state, while, when the entropy is maximal, a full, blind exploration with equal probability of choosing any action is performed. The online exploration/exploitation issue is then stated as a global optimization problem: learn the exploration strategy that minimizes the expected cumulated cost from the initial state to the goal while maintaining a fixed degree of exploration. In other words, “exploitation” is maximized for constant “exploration”. This problem leads to a set of nonlinear equations defining the optimal solution. These equations can be solved by iterating them until convergence, which is proved for a stationary environment and a particular initialization strategy. They provide the action policy (the probability distribution of choosing an action in a given state) that minimizes the average cost from the initial state to the destination states, given the degree of exploration in each state. Interestingly, when the degree of exploration is zero in all 2

states, which corresponds to the deterministic case, the nonlinear equations reduce to the Bellman equations for finding the shortest path from the initial state to the destination states. The main drawback of this method is that it is computationally demanding since it relies on iterative algorithms like the value-iteration algorithm. For the sake of simplicity, we first concentrate here on “deterministic shortest-path problem”, as defined for instance in [4], where any chosen control action deterministically drives the agent to a unique successor state. On the other hand, if the actions have uncertain effects, the resulting state is given by a probability distribution and one speaks of “stochastic shortest-path problems”, which is examined later in Section 6. In this case, a probability distribution on the successor states is introduced and it must eventually be estimated by the agents. Moreover, it is shown that, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by performing a single backward pass from the destination state. In summary, this paper has three main contributions: • It introduces a model that integrates exploration and exploitation into a single framework, with the purpose of monitoring exploration. More specifically, we propose to quantify the exploration rate by the entropy associated to the probability distribution of choosing an action in a state. This allows to measure and control the exploration rate in a principled way for continual, online, exploration. • Within this framework, it is shown that the Boltzmann exploration strategy based on the Q-value is optimal: it corresponds to the best policy in terms of average cost needed to route the agents to the goal states. Both stochastic shortest-path and discounted problems are investigated. • Simulation results performed on dynamic environments confirm that this exploration strategy indeed performs better than the ǫ-greedy and a more naive Boltzmann strategies. Usually, one tackles the exploration/exploitation issue by periodically readjusting the policy for choosing actions and re-exploring up-to-now suboptimal paths. In this case, the agent chooses a control action with a probability that is a function of the immediate cost associated with this action (usually, a softmax action selection; see [13, 22]). However, as shown in this paper, this strategy is sub-optimal because it does not integrate the exploitation issue within the exploration. Stated otherwise, the exploration strategy does not take exploitation into account. This paper is organized as follows. Section 2 introduces the notations, the standard deterministic shortest-path problem, and how we manage continual exploration. Section 3 describes our procedure for solving the deterministic shortest-path problem with exploration. The particular case where the graph of states is directed and acyclic is treated in Section 4. A short discussion of discounted problems is provided in Section 5, while the stochastic shortest-path problem is discussed in Section 6. Some numerical examples are shown in Section 5, while Section 6 is the conclusion.

3

2. Framework description 2.1. Statement of the problem and notations During every state transition, a finite cost c(k, u(k)) is incurred when leaving state k ∈ {1, 2, . . . , n} while executing a control action u selected from a set U (k) of available actions, or choices, available in state k. In order to simplify the notations, the dependence of the random variable u on the state will be admitted when obvious The cost can be positive (penalty), negative (reward), or zero provided that no cycle exists whose total cost is negative. This is a standard requirement in shortest-path problems [7]; indeed, if such a cycle exists, then traversing it an arbitrary large number of times would result in a path with an arbitrary small cost so that a best path could not be defined. In particular, this implies that, if the graph of the states is nondirected, all costs are nonnegative. The control action u is chosen according to a policy Π that maps every state k to the set U (k) of available actions with a certain probability distribution, πk (u), with u ∈ U (k). Thus the policy associates to each state k a probability distribution on the set of available actions U (k): Π ≡ {πk (u), k = 1, 2, . . . , n}. For instance, if the available actions in state k are U (k) = {u1 , u2 , u3 }, the distribution πk (u) specifies three probabilities πk (u1 ), πk (u2 ), and πk (u3 ). The degree of exploration is quantified as the entropy of this probability distribution (see next section). Randomized choices are common in a variety of fields, for instance decision sciences [16] or game theory, where they are called mixed strategies (see, e.g., [15]). Thus, the problem tackled in this section corresponds to a randomized shortest-path problem. Moreover, we assume that once the action has been chosen, the next state k ′ is known deterministically, k ′ = fk (u) where f is a one-to-one mapping between (states, actions) and resulting state. We assume that different actions lead to different states. This framework corresponds to a deterministic shortest-path problem. A simple modeling of this problem would do without actions and directly defined statetransition probabilities. The more general formalism fits full stochastic problems for which both the choice of actions and the state transitions are governed by probability distributions (see Section 5). We assume, as in [4], that there is a special cost-free destination or goal state; once the system has reached that state, it remains there at no further cost (discounted problems are studied in Section 6). The goal is to minimize the total expected cost Vπ (k0 ) (Equation (2.1)) accumulated over a path k0 , k1 , ... in the graph starting from an initial (or source) state k0 : # "∞ X c(ki , ui ) (2.1) Vπ (k0 ) = Eπ i=0

The expectation Eπ is taken on the policy Π that is, on all the random choices of action ui in state ki . Moreover, we consider a problem structure such that termination is guaranteed, at least under an optimal policy. Thus, the horizon is finite, but its length is random and it depends on the policy. The conditions for which termination holds are equivalent to establishing that the destination state can be reached in a finite number of steps from any potential initial state; for a rigorous treatment, see [2, 4]. 4

2.2. Computation of the total expected cost for a given policy This problem can be represented as a Markov chain where each original state and each action is a state. The destination state is then an absorbing state, i.e., with no outgoing link. In this framework, the problem of computing the expected cost (2.1) from any state k is closely related to the computation of the average first-passage time in the associated Markov chain [12]. The average first-passage time is the average number of steps a random walker starting from the initial state k0 will take in order to reach destination state d. It can easily be generalized in order to take the cost of the transitions into account [10]. By first-step analysis (see for instance [23]), we show in Appendix A that, once the policy is fixed, Vπ (k) can be computed through the following equations  X  πk (i) [c(k, i) + Vπ (ki′ )], Vπ (k) =   i∈U(k) (2.2) ′  with k i = fk (i) and k 6= d   Vπ (d) = 0, where d is the destination state Thus, in Equation (2.2), ki′ is the state resulting from the application of control action i in state k. Equations (2.2) can be solved by iterating the equations or by inverting the socalled fundamental matrix [12]; They are analogous to the Bellman equations in Markov decision processes. Notice that the same framework can easily handle multiple destinations by defining one absorbing state for each destination. If the destination states are d1 , d2 , . . . , dm , we have  X  πk (i) [c(k, i) + Vπ (ki′ )],  Vπ (k) =  i∈U(k) (2.3)  where ki′ = fk (i) and k 6= d1 , d2 , ..., dm   Vπ (dj ) = 0 for all destination states dj , j = 1, ..., m We now have to address the following questions:

1. How do we quantify exploration, 2. How do we control the randomized choices, i.e., exploration, and 3. How do we compute the optimal policy for a predefined degree of exploration. 2.3. Controling exploration by defining entropy at each state The degree of exploration Ek at each state k is defined by X πk (i) log πk (i) Ek = −

(2.4)

i∈U(k)

which is simply the entropy of the probability distribution of the control actions in state k [8, 11]. Ek characterizes the uncertainty about the choice at state k. It is equal to zero when there is no uncertainty at all (πk (i) reduces to a Kronecker delta);

5

it is equal to log(nk ), where nk is the number of available choices at node k, in the case of maximum uncertainty, πk (i) = 1/nk (a uniform distribution). The exploration rate Ekr = Ek / log(nk ) is the ratio between the actual value of Ek and its maximum value. It takes its values in the interval [0, 1]. Fixing the entropy at a state sets the exploration level out of this state; increasing the entropy increases exploration up to the maximal value, in which case there is no more exploitation since the next action is chosen completely at random, with a uniform distribution, without taking the costs into account. This way, the agents can easily control their exploration by adjusting the exploration rates.

3. Optimal policy under exploration constraints for deterministic shortest-path problems 3.1. Optimal policy and expected cost We turn to the determination of the optimal policy under exploration constraints. More precisely, we will seek the policy Π ≡ {πk (u), k = 1, 2, . . . , n}, for which the expected cost Vπ (k0 ) from initial state k0 is minimal while maintaining a given degree of exploration at each state k. The destination state is an absorbing state, i.e., with no outgoing link. Computing the expected cost (2.1) from any state k is similar to computing the average first-passage time in the associated Markov chain [12]. The problem is thus to find the transition probabilities leading to the minimal expected cost, V ∗ (k0 ) =π (Vπ (k0 )). It can be formulated as a constrained optimization problem. In Appendix B, we derive the optimal probability distribution of control actions in state k, which is a logit distribution: exp [−θk (c(k, i) + V ∗ (ki′ ))] πk∗ (i) = X , exp −θk c(k, j) + V ∗ (kj′ )

(3.1)

j∈U(k)

where ki′ = fk (i) is a following state and V ∗ is the optimal (minimum) expected cost given by  X  V ∗ (k) = πk∗ (i) [c(k, i) + V ∗ (ki′ )], with ki′ = fk (i) and k 6= d (3.2) i∈U(k)  ∗ V (d) = 0, for the destination state d

The control actions probability distribution (3.1) is often called Boltzmann distributed exploration and will be referred to as Boltzmann1 in the sequel. In Equation (3.1), θk must be chosen in order to satisfy X πk (i) log πk (i) = Ek (3.3) − i∈U(k)

for each state k and given Ek . It takes its values in [0, ∞]. Of course if, for some state, the number of possible control actions reduces to one (no choice), no entropy constraint is introduced. Since Equation (3.3) has no analytical solution, θk must be computed numerically in terms of Ek . This is in fact quite easy since it can be shown that the function θk (Ek ) is strictly monotonic decreasing, so that a line 6

search algorithm (such as the bisection method, see [1]) or a simple binary search can efficiently find the θk value corresponding to a given Ek value. Equation (3.1) has a simple appealing interpretation: choose preferably (with highest probability) action i leading to state ki′ of lowest expected cost, including the cost of performing the action, c(k, i) + V ∗ (ki′ ). Thus, the agent is routed preferably to the state which is nearest (on average) to the destination state. The same necessary optimality conditions can also be expressed in terms of the Q-values coming from the popular Q-learning framework [22, 28, 29]. Indeed, in the deterministic case, the Q-value represents the expected cost from state k when choosing actionP i, Q(k, i) = c(k, i) + V (ki′ ). The relationship between Q and V is thus simply V (k) = i∈U(k) πk (i) Q(k, i); we thus easily obtain X   Q∗ (k, i) = c(k, i) + πk∗′ (i) Q∗ (ki′ , i), with ki′ = fk (i) and k 6= d i ′ (3.4) i∈U(ki )  ∗ Q (d, i) = 0, for the destination state d and the πk∗ (i) are given by

exp [−θk Q∗ (k, i)] πk∗ (i) = X exp [−θk Q∗ (k, j)]

(3.5)

j∈U(k)

which corresponds to a Boltzmann exploration involving the Q-value. Thus, a Boltzmann exploration involving the Q-value may be considered as “optimal” since it provides the best expected performances for fixed degrees of exploration. 3.2. Computation of the optimal policy Equations (3.1) and (3.2) suggest an iterative procedure very similar to the wellknown value-iteration algorithm for the computation of both the expected cost and the policy. More concretely, we consider that agents are sent from the initial state and that they choose an action i in each state k with probability distribution πk (u = i). The agent then performs the chosen action, say action i, and observes the associated cost, c(k, i) (which, in a non-stationary environment, may vary over time), together with the new state, k ′ . This allows the agent to update the estimates of the cost, of the policy, and of the average cost until destination; these estimates will be denoted by c(k, i), π b bk (i) and Vb (k) and are known (shared) by all the agents. 1. Initialization phase:

• Choose an initial policy, π bk (i), ∀i, k, satisfying the exploration rate constraints (3.3) and • Compute the corresponding expected cost until destination Vb (k) by using any procedure for solving the set of linear equations (3.2) where we substitue V ∗ (k), πk∗ (i) by Vb (k), π bk (i). The π bk (i) are kept fixed in the initialization phase. Any standard iterative procedure (for instance, a GaussSeidel like algorithm) for computing the expected cost until absorption in a Markov chain could be used (see [12]). 7

2. Computation of the policy and the expected cost under exploration constraints: For each visited state k, do until convergence: • Choose an action i ∈ U (k) with current probability estimate π bk (i), observe the current cost c(k, i) for performing this action, update its estimate b c(k, i), and jump to the next state, ki′ : c(k, i) ← c(k, i) b

• Update the probability distribution for state k as: i h c(k, i) + Vb (ki′ ) exp −θbk b h i , π bk (i) ← X exp −θbk b c(k, j) + Vb (kj′ )

(3.6)

(3.7)

j∈U(k)

where ki′ = fk (i) and θbk is set in order to respect the predefined degree of entropy (see Equation (3.3)) for this state.

• Update the expected cost of state k: X  π bk (i) [b c(k, i) + Vb (ki′ )], with ki′ = fk (i) and k 6= d  Vb (k) ← i∈U(k)

 b V (d) ← 0, where d is the destination state

(3.8)

The convergence of these updating equations is proved in Appendix C. However, the described procedure is computationally demanding since it relies on iterative procedures like the value-iteration algorithm in Markov decision processes. Thus, the above procedure allows to optimize the expected cost V (k0 ) and to obtain a local minimum of this criterion. It does not guarantee to converge to a global minimum, however. Whether V (k0 ) has only one global minimum or many local minima remains an open question. Notice also that, while the initialization phase is necessary in our convergence proof, other simpler initialization schemes could also be applied. For instance, set initially b c(k, i) = 0, π bk (i) = 1/nk , Vb (k) = 0, where nk is the number of available actions in state k; then proceed by directly applying updating rules (3.7) and (3.8). While convergence has not been proved in this case, we observed that this updating rule works well in practice; in particular, we did not observe any convergence problem. This rule will be used in our experiments. 3.3. Some limit cases We will now show that when the degree of exploration is zero for all states, the nonlinear equations reduce to Bellman’s equations for finding the shortest path from the initial state to the destination state. Indeed, from Equations (3.7)-(3.8), if the parameter θbk is very large, which corresponds to a near-zero entropy, the probability of choosing the action with the lowest value of (b c(k, i) + Vb (ki′ )) dominates all the others. In other words, π bk (j) ≃ 1 for the 8

action j corresponding to the lowest average cost (including the action cost), while π bk (i) ≃ 0 for the other alternatives i 6= j. Equations (3.8) can therefore be rewritten as ( Vb (k) ← min [b c(k, i) + Vb (ki′ )], with ki′ = fk (i) and k 6= d i∈U(k) (3.9) Vb (d) ← 0, where d is the destination state

which are Bellman’s updating equations for finding the shortest path to the destination state [3, 4]. In terms of Q-values, the optimality conditions reduce to ( ∗ Q (k, i) = c(k, i) + min Q∗ (ki′ , i), with ki′ = fk (i) and k 6= d i∈U(k) (3.10) Q∗ (d, i) = 0, for the destination state d

On the other hand, when θbk = 0, the choice probability distribution reduces to π bk (i) = 1/nk , and the degree of exploration is maximum for all states. In this case, the nonlinear equations reduce to the linear equations allowing to compute the average cost for reaching the destination state from the initial state in a Markov chain with transition probabilities equal to 1/nk . In other words, we then perform a “blind” random exploration, without taking the costs into account. Any intermediary setting 0 < Ek < log(nk ) leads to an optimal exploration vs. exploitation strategy minimizing the expected cost, and favoring short paths to the solution. In the next section, we show that if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by performing a single backward pass from the destination state.

4. Optimal policy for directed, acyclic, graphs 4.1. Definition of a directed acyclic graph (Dial’s procedure) In the case where the states form a directed acyclic graph, the procedure described in the previous section simplifies greatly. We will first describe a procedure, proposed by Dial [9], which allows to simplify the model in order to obtain an acyclic process. An acyclic process is a process for which there is no cycle, that is, one can never return to the same state. Dial [9] proposed the following procedure allowing to compute an acyclic process from the original one: • First, compute the minimum cost from the initial state to any other state (by using for instance Bellman-Ford’s algorithm [7]). The minimum cost to reach any state k from the initial state k0 will be denoted as d(k0 , k). • Then, consider only actions leading to a state with a higher minimum cost from the initial state; all the other actions being labeled as “inefficient” and eliminated. Indeed, it is natural to consider that “efficient paths” should lead away from the origin k0 . Thus, all the actions leading to a state transition k → k ′ for which d(k0 , k ′ ) ≤ d(k0 , k) are prohibited; the others are “efficient actions” and are allowed. This results in an acyclic graph since all the states can be ordered by increasing d(k0 , k) and only transitions from a lower d(k0 , k) to a higher d(k0 , k ′ ) are allowed. 9

We consequently relabel the states by increasing d(k0 , k), from 0 (initial state) to n (ties are ordered arbitrarily). The label of the destination state is still noted d. Of course, this procedure can be adapted to the problem at hand. For instance, instead of computing the minimum cost from the initial state, one could compute the cost to the destination state d(k, d) and eliminate actions according to this criterion. Other measures, different from the minimum cost, could be used as well. The main drawback of this method is that the environment (the topology and the cost) must be known before hands. 4.2. Computation of the optimal policy In this case, if U (k) now represents the set of “efficient actions” for each state, the algorithm (3.7)-(3.8) reduces to 1. Initialization phase: V ∗ (d) = 0, for the destination state d. 2. Computation of the policy and the expected cost under exploration constraints: For k = (d − 1) to initial state 0, compute:  exp [−θk (c(k, i) + V ∗ (ki′ ))]  πk∗ (i) = X  ,    exp −θk c∗ (k, j) + V ∗ (kj′ )

j∈U(k) X   ∗  πk∗ (i) [c(k, i) + V ∗ (ki′ )], with k 6= d V (k) =  

(4.1)

i∈U(k)

where ki′ = fk (i) and θbk is set in order to respect the prescribed degree of entropy at each state (see Equation (3.3)). The fact that the states have been relabeled by increasing d(k0 , k) and the efficient actions always lead to a state with higher label guarantees that the values of the expected cost at states ki′ are known when using formula (4.1).

5. Optimal policy under exploration constraints for stochastic shortest path problems 5.1. Optimal policy We now consider stochastic shortest path problems for which, once an action has been performed, the transition to the next state is no longer deterministic but stochastic [4]. More precisely, when an agent chooses action i in state k, it jumps to state k ′ with a probability P(s = k ′ |u = i, s = k) = pkk′ (i) (transition probabilities). Since there are now two different probability distributions, we will redefine our notations as • The probability of choosing an action i within the state k, πk (i); • The probability of jumping to a state s = k ′ after having chosen the action i within the state k, pkk′ (i). 10

By first-step analysis (see Appendix C), the recurence relations allowing to compute the expected cost Vπ (k), given policy Π are easily found:  n X X   V (k) = π (i) [c(k, i) + pkk′ (i) Vπ (k ′ )], k π (5.1) k′ =1 i∈U(k)   Vπ (d) = 0, where d is the destination state

Furthermore, by defining the average cost when having chosen control action i in P state k by V π (k, i) = k′ pkk′ (i)Vπ (k ′ ), Equation (5.1) can be rewritten as  X  Vπ (k) = πk (i) [c(k, i) + V π (k, i)], (5.2) i∈U(k)  Vπ (d) = 0, where d is the destination state

When proceeding as in Appendix B, the optimal policy is obtained by substituting ∗ Vπ (ki′ ) by V (k, i) in both (3.1) and (3.2): h i ∗ exp −θk c(k, i) + V (k, i) i h (5.3) πk∗ (i) = X ∗ exp −θk c(k, j) + V (k, j) j∈U(k)

The additional difficulty here, in comparison with a deterministic problem, is that the probability distributions pkk′ (i) have to be estimated on-line, together with the costs and the distribution of the randomized control actions [22]. 5.2. On-line estimation of the transition probabilities The transition probabilities pkk′ (i) might be unknown and, consequently, should be estimated on-line, together with the costs and the distribution of the randomized control actions P [22]. An alternative solution is to directly estimate the average cost V π (k, i) = k′ pkk′ (i)Vπ (k ′ ) based on the observation of the value of Vπ in the next state k ′ . There is a large range of potential techniques for solving this issue, depending on the problem at hand (see for example [6]). One could simply use an exponential smoothing, leading to Vb (k, i) ← αVb (k ′ ) + (1 − α)Vb (k, i)

Or one can rely on a stochastic approximation scheme, h i Vb (k, i) ← Vb (k, i) + α Vb (k ′ ) − Vb (k, i)

(5.4)

(5.5)

which converges for a suitable decreasing policy of α [21].

This leads to the following updating rules: For each visited state k, do until convergence: • Choose an action i with current probability estimate π bk (i), observe the current cost c(k, i) for performing this action, update its estimate b c(k, i) by b c(k, i) ← c(k, i) 11

(5.6)

• Perform the action i and observe the value Vb (k ′ ) of the next state k ′ . Update Vb (k, i) accordingly (here, we choose the exponential smoothing scheme), Vb (k, i) ← αV (k ′ ) + (1 − α)Vb (k, i)

• Update the probability distribution for state k as: i h c(k, i) + Vb (k, i) exp −θbk b h i , π bk (i) ← X c(k, j) + Vb (k, i) exp −θbk b

(5.7)

(5.8)

j∈U(k)

where θbk is set in order to respect the prescribed degree of entropy (see Equation (3.3)).

• Update the expected cost of state k:  X   Vbπ (k) = πk (i) [b c(k, i) + Vb π (k, i)],

i∈U(k)   Vb (d) = 0, where d is the destination state π

(5.9)

5.3. Links with Q-learning and SARSA

We now restate the optimality relations in terms of the Q-values. In the stochastic case, the Q-value represents the expected cost from state k when choosing action i, Q(k, P i) = c(k, i) + V (k, i). As before, the relationship between Q and V is V (k) = i∈U(k) πk (i) Q(k, i) and the optimality conditions are  X X  Q∗ (k, i) = c(k, i) + πk′ (i) Q∗ (k ′ , i), with k ′ = fk (i) and k 6= d pkk′ (i) 

i∈U(k′ )

k′

∗

Q (d, i) = 0, for the destination state d

(5.10)

and the πk (i) are provided by exp [−θk Q∗ (k, i)] πk∗ (i) = X exp [−θk Q∗ (k, j)]

(5.11)

j∈U(k)

which corresponds to a Boltzmann strategy involving the Q-value. Now, if the entropy is close to zero (no exploration), the updating rule reduces to h i  X b i) ← b b ′ , i) , with k ′ = fk (i) and k 6= d  Q(k, c(k, i) + pkk′ (i) min Q(k i∈U(k)

k  b Q(d, i) = 0, for the destination state ′

(5.12) which can be approximated by the following stochastic approximation scheme h i ′ b b b b Q(k, i) ← Q(k, i)+α b c(k, i) + min′ Q(k , j) − Q(k, i) , with k ′ = fk (i) and k 6= d j∈U(k )

(5.13)

12

and this corresponds to the basic Q-learning algorithm [22, 28, 29]. On the other hand, the SARSA algorithm [17, 20, 22] considers both exploration and an unknown environment. In substance, SARSA approximates the Q-learning values, based on the optimality conditions, by averaging out both the uncertainty about the action to choose and the uncertainty about the next state. Thus, from Equation (5.10), the Q-values are updated through h i b i) ← Q(k, b i)+α b b ′ , j) − Q(k, b i) , with k ′ = fk (i) and k 6= d (5.14) Q(k, c(k, i) + Q(k

when choosing action i with probability

i h b i) exp −θk Q(k, i h π bk (i) = X b j) exp −θk Q(k,

(5.15)

j∈U(k)

However, since we are controlling the probability distributions of choosing an action (5.11), it is more sensible to rely on updationg rule (5.7) instead of (5.14).

6. Optimal policy under exploration constraints for discounted problems In this section, instead of Equation (2.1), we consider discounted problems, for which there is a discount factor 0 < α < 1, # "∞ X i α c(ki , ui ) (6.1) Vπ (k0 ) = Eπ i=0

The meaning of α is that future costs matter to us less than the same costs incurred at the present time. In this situation, we will see that there is no need to assume the existence of a destination state d. Indeed, this problem can be converted to a stochastic shortest-path problem for which the analysis of previous sections holds (see [4]). Let s = 1, . . . , n be the states and consider an associated stochastic shortest-path problem involving, as for the original problem, the states s = 1, . . . , n plus an extra termination, absorbing, state t, with state transitions and costs obtained as follows: from a state k, when a control action i ∈ U (k) is chosen with probability πk (i), a cost c(k, i) is incurred and the next state is ki′ = fk (i) 6= t with a fixed probability α, or t with probability (1 − α). Once the agent has reached state t, it remains in this state forever with no additional cost. Here, absorbing state t plays the same role as destination state d in our previous analysis, and the deterministic shortest path model can easily be adapted to this framework, leading to the following optimality condition X πk∗ (i) [c(k, i) + αV ∗ (ki′ )], with ki′ = fk (i), 0 < α < 1 (6.2) V ∗ (k) = i∈U(k)

where, as before, πk (i) is given by exp [−θk (c(k, i) + V ∗ (ki′ ))] πk∗ (i) = X exp −θk c(k, j) + V ∗ (kj′ ) j∈U(k)

13

(6.3)

6 1

1

2 2

2

12 2

1

2

1

7

2

13

3

2 2 2

1

2

11 2

2

2

1

2 2

4 2

1

8

2 14

2

2

1

2

10

5 1

1 9

Figure 7.1: The graph used in our experiments. The initial (source) state is state 1 while the destination state is state 13. The initial costs are indicated on the edges. which are the corresponding optimality conditions for deterministic discounted problems. On the other hand, the optimality conditions for stochastic discounted problems are: X

V ∗ (k) =

∗

πk∗ (i) [c(k, i) + αV (k, i)]

(6.4)

i∈U(k)

i h ∗ exp −θk c(k, i) + V (k, i) h i πk∗ (i) = X ∗ exp −θk c(k, j) + V (k, j)

(6.5)

j∈U(k)

∗

with V (k, i) =

P

k′ pkk′ (i)V

∗

(k ′ ).

7. Preliminary experiments Our experiments were performed on the graph shown in Figure 7.1. It is composed of 14 numbered nodes connected by edges of different weights, representing costs. The algorithm described in Section 3.2 was used in each experiment. It searches an optimal path in a given graph, by going from a starting node to a destination node, for a fixed exploration degree. We investigated how the algorithm reacts to changes of the exploration degree, and the impact of these changes on the total expected cost. Two simple experiments were performed, testing the algorithm’s capacity to find the optimal paths in two different settings: 1. a static environment (i.e. an environment where the weight of the edges does not change over time); 2. a dynamic (non-stationary) environment. In each experiment, the matrices of expected costs and transition probabilities are updated at each time step according to Equations (3.7) and (3.8). In the sequel, one 14

run will represent the complete routing of an agent from its source to its destination. At the beginning of each simulation, the Vb (k) and π bk (i) are initialized according to the procedure introduced at the end of Section 3.2. The exploration rate is fixed to a specified value common to each state. 7.1. First experiment 7.1.1. Description This experiment sends 15,000 agents (15,000 runs) through the network shown in Figure 7.1. The initial and destination nodes are node 1 and node 13, respectively. The costs of the external edges (i.e., the edges on the paths [1,2,6,12,13] and [1,5,9,10,14,13]) are initialized to 1, while the costs of all the other edges are initialized to 2. The goal of this simulation is to observe the paths followed by the routed agents, i.e. the traffic, in function of various values for the entropy. We simulate the agents transfer by assigning the same value of the exploration rate at each node. Each node corresponds to a state and the actions correspond to the choice of the next edge to follow. We repeat the experiment four times with, for each of them, a fixed value of the entropy corresponding to a percentage of the maximum entropy (i.e. exploration rates of 0%, 30%, 60% and 90%). 7.1.2. Results The following graphs, displayed in Table 7.1, show the behaviour of the algorithm when varying the exploration degree. For each tested value, a graph (with the same topology as described above) is used to represent the results: the more an edge is used in the routing, the more its grey level and width are important. The first graph shows the results for an exploration degree of 0%. The algorithm finds the shortest path from node 1 to node 13 (path [1,2,6,12,13]), and no exploration is carried out: [1,2,6,12,13] is the only path used. The second graph shows the followed paths for an exploration degree of 30%. The path [1,2,6,12,13], corresponding to the shortest path, is still the most used, while other edges not belonging to the shortest path are now also investigated. Moreover, we observe that the second shortest path, i.e. path [1,5,9,10,14,13], also routes a significant traffic now. Exploitation thus remains a priority while exploration is nevertheless present. The third graph shows the paths used for an exploration rate of 60%. As for the previous experiment, the algorithm uses more edges that are not part of the shortest path, but still prefers the shortest path. The last graph shows the result with an exploration rate of 90%. With such a value, the exploitation of the shortest path is no more the priority; exploration is now clearly favored over exploitation.

15

0% exploration rate

20% exploration rate

6

6 12

2

12

2

7

7

3

3

13 11

1

13 11

1

4

4 14

14

8

5

10

White: very low traffic Light gray: low traffic Gray: medium traffic Dark gray: high traffic Black: very high traffic

8

5

10


9


9


6

6 12

2

12

2

7

7

3

3

13 11

1

13 11

1

4

4 14

5 White: very low traffic Light gray: low traffic Gray: medium traffic Dark gray: high traffic Black: very high traffic

14

8

8

5 10


9

10 9

Table 7.1: Traffic through the edges for the four different exploration rates. 7.2. Second experiment 7.2.1. Description The second experiment repeats the previous simulation setting while introducing changes in the environment, i.e. varying abruptly the cost of some edges. 50% exploration rate

15.0 10.0 5.0

Average cost


15.0 10.0 5.0 20% exploration rate



15.0 10.0 5.0 10

50

90

130

(Run number)/100

Figure 7.2: Average cost needed to reach destination node (13) from the initial node (1) in terms of time step for various exploration rates. Except the change in the environment, the context remains the same as in the first experiment, namely to convey 15,000 agents (15,000 runs) from a source node (1) to a destination node (13) by using the suggested algorithm. Thus, at each time step, an agent is sent from node 1 to reach node 13. At the beginning, the costs of the external edges (i.e. the edges on the paths [1,2,6,12,13] and [1,5,9,10,14,13]) are 16

Maximal cost Average cost Minimal cost

Average cost

100

80

60

40

20

0 0

20

40

60 Exploration rate

80

100

Figure 7.3: Average, maximal and minimal costs needed to reach the destination node (13) from the initial node (1) in terms of exploration rate. initialized to 3, while the cost of all the others (internal edges) are initialized to 6. In this configuration, the shortest path from node 1 to node 13 is the path [1,2,6,12,13] with a total cost of 12. After having sent 7,500 agents (i.e., in the middle of the simulation, at run number 7,501), the cost of all internal edges was set to 1, all other things being equal. This change creates new shortest paths, all of them with a total cost of 4, passing through internal nodes [3,4,7,8,11]. 7.2.2. Results Figure 7.2 contains five graphs representing the total costs of the transfer of the agents from their source to their destination, in terms of run number. We display the results for five levels of exploration rate: 0%, 10%, 20%, 30% and 50%. The first graph (at the bottom) shows the total cost for an exploration degree of 0%. We observe that, despite the environment change (which leads to the creation of shorter paths with a total cost of 4), the cost remains equal to 12 throughout simulation. Once the algorithm has discovered the shortest path, it does not explore anymore. The system misses a path that has become (at run number 7501) much cheaper (in terms of cost) than the previous optimal path. The four remaining graphs show the results for exploration rates of 10%, 20%, 30% and 50%. Each of these settings is able to find the new optimal path after the change in environment – the total cost is updated from about 12 to 4. This is due to the fact that the algorithm carries out exploration and exploitation in parallel. This also explains the slight rise in total cost when increasing the value of the exploration rate. Figure 7.3 shows the evolution of the average cost, the maximal cost as well as the minimal cost (computed on the first 7,500 agents or runs) in terms of the exploration rate. Increasing the entropy induces a growth in the average total cost, which is natural since the rise in entropy causes an increase in exploration, therefore producing a reduction of exploitation. We also notice that the difference between the minimum and the maximum total cost grows with the increase in entropy but remains reasonably

17

small. 7.3. Third experiment 7.3.1. Description This experiment simulates a very simple load balancing problem. Load balancing is used to spread work between processes, computers, disks or other resources. The specification of the load balancing problem for this experiment is as follows. First, notice that we still use the graph defined previously, with the same cost on the edges (see Figure 7.1). In this experiment, only two destination nodes are defined for illustration and ease of interpretation, but the approach generalizes to several destination nodes. Each destination node is a special node able to provide some resources (such as food) coveted by the agents crossing the network. The amount of resources present on each destination node is equal and limited, and the agents have to retrieve these resources in some optimal way (i.e., the agents must choose a path of minimum cost while avoiding congestion). We will take into consideration the number of remaining resources in the calculation of the average cost. Indeed, the number of resources of each destination node being limited, we introduce a “warning threshold” indicating that the resources are becoming scarce, and which will be translated into a decrease in reward. For the sake of simplicity, this critical threshold is equal for each destination node and is worth 10% of the starting resources. When an agent reaches a destination node, a checking of its remaining resources is carried out. Two cases can arise. In the first case, if the number of remaining resources is higher or equal to the threshold, the agent receives a positive reward (a negative cost) equal to the value of the threshold. In the second case, if the number of resources is lower than the threshold, the agent will perceive a lower reward, equal to the numbers of remaining resources (the reward is proportional to the number of resources). The introduction of this threshold results in a gradual decrease of the reward of a destination node when the number of its resources is below the fixed threshold. Values of the simulation parameters were defined in the following way. Destinations nodes chosen for the experiment are nodes 11 and 14 with a number of resources being worth 5000 and a threshold equal to 500. 10,000 agents are sent, each agent corresponding to one run (10,000 runs in total), from node 1, with the objective of gathering resources with minimal average cost. 7.3.2. Results For this experiment, we use five exploration rates, precisely 0, 20, 40, 60 and 80%; Table 7.2 shows the resulting graphs for each rate. Each graph reports the resources quantities in terms of run number (agent sent). According to the fixed destination nodes (i.e., nodes 11 and 14) and the topography of the network, the destination with the shortest path is node 14. The first graph is the result of the simulation with an exploration rate of 0%. We observe that when no exploration is present, the algorithm proposes only the

18

0% exploration rate


Load balancing simulation


Exploration rate of 0%

Exploration rate of 40% Node 11 Node 14

4000

4000

3000

3000

Ressources

Ressources

Node 11 Node 14

2000

1000

2000

1000

0

0 0

20

40

60

80

100

0

20

40

(Run number)/100

60

80





Exploration rate of 60%

Exploration rate of 80% Node 11 Node 14

Node 11 Node 14

4000

4000

3000

3000

Ressources

Ressources

100

(Run number)/100

2000

1000

2000

1000

0

0 0

20

40

60

80

100

0

(Run number)/100

20

40

60

80

100

(Run number)/100

Table 7.2: Evolution of the remaining resources in terms of run number (agent sent), for the four different exploration rates. destination node 14 (i.e., the shortest path). Node 11 is exploited only when the resources of node 14 reach the critical point and when the cost of exploiting node 11 becomes less than exploiting node 14. In turn, the node 14 will be exploited only when the resources of node 11 reach the threshold. Notice that in the case of a zero-entropy, the two destination nodes are never exploited simultaneously, so that no “load balancing” really occurs. This could generate congestion problems on the shortest path if the resource is not transfered immediately. The other graphs show the results of simulation with entropies of 20, 40, 60 and 80%. When exploration rate is raised, the two destination nodes are more and more exploited simultaneously and in proportions increasing with the entropy. As expected, the degree of load balancing is a function of the exploration rate.

8. Experimental comparisons 8.1. Description This experiment focuses on comparisons between our method of exploration, which will be reffered to as the Randomized Shortest Path technique (RSP), and two other exploration methods usually used in reinforcement learning problems, namely e-greedy and Boltzmann. The ǫ-greedy exploration strategy selects the currently greedy best action (the best according to the current estimate of the total expected cost, Vbπ ), in any given state, with probability (1 − ǫ), and selects another random action with a uniform probability ǫ/(m − 1) where m is the number of available actions. Another popular choice for undirected exploration, which will be referred to as 19

Naive Boltzmann, assigns probability πk (i) of taking action u = i in state s = k provided by exp[−α Vπ (ki′ )] with i ∈ U (k) πk (i) = X exp[−α Vπ (kj′ )]

(8.1)

j∈U(k)

and where ki′ = fk (i) is a following state and α is a positive temperature parameter controlling the amount of randomness. Notice the difference between (8.1) and (3.1): In (8.1), the immediate cost is not included. A constant exploration rate (ǫ in the ǫ greedy exploration and α in Boltzmann exploration) are often used in practice. Notice that the random shortest path strategy is equivalent to a Boltzmann exploration strategy based on the Q-value Q instead of the expected cost V (see the discussion of Section 3.1). In order to compare the results obtained by the three different methods, the parameters ǫ (ǫ-greedy) and α (Naive Boltzmann) are tuned in order to achieve the same exploration rate as for the RSP method. To carry out these comparisons, we will simulate the same problem of load balancing as described previously. We use the same graph topology as before (see Figure 7.1), with the same weights on the edges (a weight of 1 for the external edges and 2 for the interior edges); node 1 being the source node. We assign an equal number of resources to each destination node, and we introduce a threshold which influences the costs in the same way as before (see Section 7.3). However, compared to previous simulation, the destination nodes are chosen randomly and change during time. More precisely, we fix node 15 as a permanent destination node and we introduce a second time-varying destination node, which is chosen among nodes 7, 8 and 11 with a probability of 1/3 each. The changes of the destinations node are performed at periodic intervals (e.g., every 500 runs). Nodes 7, 8 and 11 were chosen as potential destination nodes because they are not on the optimal path (i.e., the shortest path) from source node 1 to destination node 15. This will allow to assess the exploration strategies. As before, agents are sent trough the network from the source node until they reach a destination node, and this operation is reffered to as a run. We execute four different simulations for this experiment. Each simulation corresponds to a predefined periodicity of the environment changes, called a epoch; respectively 500, 250, 100 and 50 runs. Therefore, during a given epoch, the destination nodes are kept fixed; they change at the beginning of the next epoch. Thus, a short epoch corresponds to a fast-changing environment while a large epoch corresponds to a slowly changing environment. Each simulation is carried out on the sending of 5000 agents through the network (5,000 runs), and the corresponding average cost per run (averaged on 5000 runs) is recorded. 8.2. Results For illustration, Figure 8.1 presents the results (the cost averaged on 100 runs) for one simulation, for an epoch of 500 runs. We observe that the RSP method obtains better performances than the two other methods. The different oscillations reflect the epochs (i.e., a change of the destination nodes) and the progressive reduction of the

20

Exploration rate of 5% Randomized shortest path (RSP)

6 4 2

Average cost

0

Naive Boltzmann

6 4 2 0

ε-greedy

6 4 2 0 0

20

40

60

80

100

Percentage of simulation progression (the total corresponds to 5,000 runs)

Figure 8.1: Results of one simulation when the environment changes every 500 runs (one epoch = 500 runs), for an exploration rate of 5 percent. The total duration of the simulation is 5,000 runs. The average cost is displayed in terms of percentage of simulation progression.

number of resources pertaining to the destination nodes. Table 8.1 represents the results of the simulations for each method, five exploration rates (ranging from 1% to 30%) and four different epochs. We report the total cost averaged on 5,000 runs. The 95% confidence interval computed on the 5,000 runs is also displayed. By analyzing these results, we clearly observe that the RSP method outperforms the two other standard exploration methods, ǫ-greedy and Naive Boltzmann.

RSP Naive Boltzmann ǫ-greedy




Environment change every 500 sendings (epoch=500) Exploration rate 01% 05% 10% 20% Average cost 4.31 ± 0.03 4.30 ± 0.03 4.33 ± 0.03 4.43 ± 0.03 Average cost 4.52 ± 0.03 4.62 ± 0.03 4.66 ± 0.03 4.75 ± 0.03 Average cost 4.61 ± 0.03 4.65 ± 0.03 4.64 ± 0.03 4.74 ± 0.03 Environment change every 250 sendings (epoch=250) Exploration rate 01% 05% 10% 20% Average cost 4.36 ± 0.02 4.37 ± 0.02 4.41 ± 0.02 4.56 ± 0.03 Average cost 4.60 ± 0.02 4.65 ± 0.02 4.70 ± 0.02 4.83 ± 0.02 Average cost 4.64 ± 0.02 4.63 ± 0.02 4.70 ± 0.02 4.86 ± 0.03 Environment change every 100 sendings (epoch=100) Exploration rate 01% 05% 10% 20% Average cost 4.27 ± 0.03 4.30 ± 0.03 4.33 ± 0.03 4.51 ± 0.03 Average cost 4.63 ± 0.02 4.62 ± 0.02 4.66 ± 0.02 4.86 ± 0.03 Average cost 4.61 ± 0.02 4.61 ± 0.02 4.69 ± 0.02 4.86 ± 0.03 Environment change every 50 sendings (epoch=50) Exploration rate 01% 05% 10% 20% Average cost 4.40 ± 0.03 4.43 ± 0.03 4.45 ± 0.03 4.60 ± 0.03 Average cost 4.67 ± 0.03 4.70 ± 0.03 4.75 ± 0.03 4.92 ± 0.03 Average cost 4.68 ± 0.03 4.72 ± 0.03 4.77 ± 0.03 4.91 ± 0.03

30% 4.75 ± 0.03 4.97 ± 0.03 5.05 ± 0.03 30% 4.76 ± 0.03 5.07 ± 0.03 5.11 ± 0.03 30% 4.68 ± 0.03 5.11 ± 0.03 5.13 ± 0.03 30% 4.79 ± 0.04 5.14 ± 0.04 5.17 ± 0.04

Table 8.1: Comparisons of the three exploration strategies, RSP, ǫ-greedy and Naive Boltzmann for different exploration rates and different epochs, corresponding to various environment change rates.

21

9. Conclusions We have presented a model integrating continual exploration and exploitation in a common framework. The exploration rate is controlled by the entropy of the choice probability distribution defined on the states of the system. When no exploration is performed (zero entropy on each node), the model reduces to the common valueiteration algorithm computing the minimum cost policy. On the other hand, when full exploration is performed (maximum entropy on each node), the model reduces to a “blind” exploration, without considering the costs. The main drawback of the present approach is that it is computationally demanding since it relies on iterative procedures such as the value-iteration algorithm. Further work will investigate alternative cost formulations, such as the “average cost per step”. Moreover, since the optimal average cost is obtained by taking the minimum among all the potential policies, this quantity defines a dissimilarity measure between the states of the process. Further work will investigate the properties of this distance, which generalizes the Euclidean commute-time distance between nodes of a graph, as introduced and investigated in [10, 18].

References [1] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear programming: Theory and algorithms. John Wiley and Sons, 1993. [2] D. P. Bertsekas. Neuro-dynamic programming. Athena Scientific, 1996. [3] D. P. Bertsekas. Network optimization: continuous and discrete models. Athena Scientific, 1998. [4] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2000. [5] J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in Neural Information Processing Systems 6 (NIPS6), pages 671–678, 1994. [6] R. G. Brown. Smoothing, forecasting and prediction of discrete time series. Prentice-hall, 1962. [7] N. Christofides. Graph theory: An algorithmic approach. Academic Press, 1975. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991. [9] R. Dial. A probabilistic multipath assignment model that obviates path enumeration. Transportation Research, 5:83–111, 1971. [10] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. To appear in the IEEE Transactions on Knowledge and Data Engineering, 2006.

22

[11] J. N. Kapur and H. K. Kesavan. Entropy optimization principles with applications. Academic Press, 1992. [12] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Springer-Verlag, 1976. [13] T. M. Mitchell. Machine learning. McGraw-Hill Compagnies, 1997. [14] J. Norris. Markov Chains. Cambridge University Press, 1997. [15] M. J. Osborne. An introduction to Game Theory. Oxford University Press, 2004. [16] H. Raiffa. Decision analysis. Addison-Wesley, 1970. [17] G. Rummery and M. Niranjan. On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Departement, 1994. [18] M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its relationships to spectral clustering. Proceedings of the 15th European Conference on Machine Learning (ECML 2004). Lecture Notes in Artificial Intelligence, Vol. 3201, Springer-Verlag, Berlin, pages 371–383, 2004. [19] G. Shani, R. Brafman, and S. Shimony. Adaptation for changing stochastic environments through online pomdp policy learning. In Workshop on Reinforcement Learning in Non-Stationary Environments , ECML 2005, pages 61–70, 2005. [20] S. Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158, 1996. [21] J. C. Spall. Introduction to stochastic search and optimization. Wiley, 2003. [22] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] H. M. Taylor and S. Karlin. An Introduction to Stochastic Modeling, 3th Ed. Academic Press, 1998. [24] S. Thrun. Efficient exploration in reinforcement learning. Technical report, School of Computer Science, Carnegie Mellon University, 1992. [25] S. Thrun. The role of exploration in learning control. In D. White and D. Sofge, editors, Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches. Van Nostrand Reinhold, Florence, Kentucky 41022, 1992. [26] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005. [27] K. Verbeeck. Coordinated exploration in multi-agent reinforcement learning. PhD thesis, Vrije Universiteit Brussel, Belgium, 2004. [28] J. C. Watkins. Learning from delayed rewards. PhD thesis, King’s College of Cambridge, UK, 1989. [29] J. C. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.

23

Appendix: Proof of the main results

A. Appendix: Computation of the expected cost for a deterministic shortest path problem and a fixed policy The goal is to compute the expected cumulated cost Vπ (k0 ) from state k0 under policy Π: "∞ # X (A.1) Vπ (k0 ) = Eπ c(st , u(st )) |s0 = k0 , t=0

The expectation is taken on the policy, that is, on all random variables u(k) associated to states k. We derive the recurrence relation for computing Vπ (k0 ) by first-step analysis (for a rigorous proof, see, for instance, [12]). The cumulated cost cπ (k0 ) incurred during one particular walk (one realization of the random process), starting from state k0 and entering for the first time a destination state is cπ (k0 ) =

Tk0 X

c(st , u(st ))

(A.2)

t=0

where Tk0 is the time (number of steps) taken to reach a destination state. Now, without changing the problem, we can assume that the destination state is an absorbing state, just as if the process stops once this destination state has been reached. If we assume that c(d, u) = 0, for every destination state d, than Equation (A.2) can be rewritten as ∞ X c(st , u(st )) cπ (k0 ) = t=0

Furthermore cπ (d) = 0.

24

Formula A.1 becomes: Vπ (k0 ) = E[cπ (k0 )|s0 = k0 ] = Eπ X

=

"

∞ X

c(st , u(st )) |s0 = k0

t=0

P(u(s0 ) = i0 , u(s1 ) = i1 , . . . |s0 = k0 )

X

× c(s0 , u(s0 )) + =

X

#

(A.4)

P(u(s0 ) = i0 , u(s1 ) = i1 , . . . |s0 = k0 )

i0 ,i1 ,...

"

(A.3)

c(st , u(st ))

t=0

i0 ,i1 ,...

=

"∞ X

#

∞ X

#

c(st , u(st ))

t=1

(A.5)

P(u(s0 ) = i0 |s0 = k0 )

i0

×

  X

 i1 ,i2 ,... "

P(u(s1 ) = i1 , u(s2 ) = i2 , . . . |u(s0 ) = i0 , s0 = k0 )

× c(s0 , u(s0 )) +

=

X

∞ X

P(u(s0 ) = i0 |s0 = k0 )

× c(s0 , u(s0 )) + =

X

c(st , u(st ))

t=1

i0

"

#)

∞ X t=1

  X 

(A.6)

P(u(s1 ) = i1 , u(s2 ) = i2 , . . . |s1 = k1 )

i1 ,i2 ,...

#)

c(st , u(st ))

P(u(s0 ) = i0 |s0 = k0 ) [c(s0 , u(s0 )) + Vπ (k1 )]

(A.7) (A.8)

i0

where we used the Markov property (the future behaviour only depends on the knowledge of the last state) and the fact that the sequence of visited states is deterministic, k1 = fk0 (i0 ), k2 = fk1 (i1 ). Therefore, since Vπ (d) = 0, we can compute the expected cost at state k in terms of the expected cost at each state that can be reached from state k:  X  Vπ (k) = πk (i) [c(k, i) + Vπ (ki′ )], with ki′ = fk (i) and k 6= d (A.9) i∈U(k)  Vπ (d) = 0

These equations are very similar to the relations allowing to compute the average first-passage times in Markov chains [14]. Their meaning is quite obvious: in order to compute the total expected cost from state k to state d, first go to any reachable state ki′ and proceed from there.

B. Appendix: Determination of the optimal policy We now turn to the problem of finding the optimal policy under entropy constraints. We therefore consider an agent starting from state k0 and trying to reach the des25

tination state by choosing an action according to a policy Π ≡ {π1 , π2 , π3 , . . .}. We already know (2.2) that the expected cost from any state, Vπ (k), is provided by the following equations  X  πk (i) [c(k, i) + Vπ (ki′ )], Vπ (k) =   i∈U(k) (B.1)  with ki′ = fk (i) and k 6= d   Vπ (d) = 0, where d is the destination state

The goal here is to determine the policy that minimizes this expected cost, when starting from state k0 and under entropy constraints (2.4). We therefore introduce the following Lagrange function, taking all the constraints into account,   X X πk (i) [c(k, i) + V (ki′ )] (B.2) L = V (k0 ) + λk V (k) − k6=d

+ λd (V (d) − 0) +

+

X



X

k6=d

X



µk 

i∈U(k)

X

i∈U(k)



πk (i) − 1 

(B.3)

πk (i) log πk (i) + Ek 

(B.4)

−λl (c(l, j) + V (kj′ )) + µl + ηl (log πl (j) + 1) = 0

(B.5)

k6=d

ηk 

i∈U(k)

with ki′ = fk (i). Differentiating this Lagrange function in terms of the choice probabilities, ∂L/∂πl (j), and setting the result to go gives

with kj′ = fk (j). By multiplying Equation (B.5) by πl (j) and summing over j ∈ U (l), we easily obtain −λl V (l) + µl + ηl (El + 1) = 0 (B.6) from which we deduce µl = λl V (l) − ηl (El + 1). Replacing µl by its expression in (B.5) provides −λl (c(l, j) + V (kj′ ) − V (l)) + ηl (log πl (j) − El ) = 0

(B.7)

Defining θl = −λl /ηl and extracting log πl (j) from (B.7) gives log πl (j) = −θl (c(l, j) + V (kj′ ) − V (l)) + El

(B.8)

or, equivalently, πl (j) = exp −θl (c(l, j) + V (kj′ )) exp [−θl V (l) + El ]

(B.9)

Summing this last equation over j ∈ U (l), and remembering that kj′ depends on j = fl (j)), allows us to compute the second factor in the right-hand side of Equation (B.9): −1  X ′ exp −θl (c(l, j) + V (kj ))  exp [−θl V (l) + El ] =  (B.10)

(kj′

j∈U(l)

26

By replacing (B.10) in Equation (B.9), we finally obtain πl (i) =

exp [−θl (c(l, i) + V (ki′ ))] P exp −θl (c(l, j) + V (kj′ ))

(B.11)

j∈U(l)

Finally, expressing the fact that each probability distribution has a given entropy, X πl (i) log πl (i) = El , − i∈U(k)

allows us to compute the value of θl in terms of El . This defines the optimal policy; it can be shown that it is indeed a minimum. Notice that differentiating the Lagrange function in terms of the V (k) provides dual equations allowing to compute the Lagrange multipliers.

C. Appendix: Convergence of the iterative updating rule Before proving the convergence of the iterative updating rule (3.7)-(3.8), let us prove a preliminary lemma that will be useful later. P Indeed, we will first show that minimizing the linear form i pi xi in terms of pi , subject to entropy and “sum-to-one” constraints leads to the following functional form: exp[−θxi ] pi = X (C.1) exp[−θxj ] j

P where θ is chosen in order to satisfy − i pi log pi = E (entropy constraint). This probability distribution has exactly the same form as (3.7), in the proposed algorithm. This lemmaPshows that using this distribution (C.1) automatically minimizes the linear form i pi xi . To this end, let us introduce the Lagrange function X X X pi xi + λ( pi log pi + E) + µ( pi − 1) L= i

(C.2)

i

i

and compute the differential of L in term of pj , ∂L/∂pj = 0, xj + λ log pj + λ + µ = 0

(C.3)

Now, by defining θ = λ−1 , Equation (C.3) can be rewritten as pi = exp[−(1 + θµ] exp[−θxi ]

(C.4)

P Summing Equation (C.4) over i and using the fact that j pj = 1 allows us to compute the value of the first factor of the right-hand side of Equation (C.4), 

exp[−(1 + θµ] = 

X

27

j

−1

exp[−θxj ]

(C.5)

Thus, we obtain the expected result from (C.4)-(C.5): exp[−θxi ] pi = X exp[−θxj ]

(C.6)

j

where θ is chosen in order to satisfy − proves our preliminary lemma.

P

i

pi log pi = E (entropy constraint). This

Now, we are ready to prove the convergence of (3.7)-(3.8) by induction. First, we show that if the Vb (k) have been decreasing (less than or equal to) up to the current time step, then the next update will also decrease Vb (k). Then, since the first update necesarily decreases Vb (k), the following updates will also decrease the value of Vb (k). Now, since Vb (k) is decreasing at each update and cannot be negative, this sequence must converge. Thus, we first show that if all the Vb (k) are decreasing up to the current time step t, then the next update will also decrease Vb (k). Suppose the agent does visit state k at time step t and chooses action i, so that an update of the expected cost is computed. If we denote by Vb t (k) the estimate of the expected cost for state k at time step t, the update of Vb (k) is given by i h  bt (k ′ )  c(k, i) + V exp −θ  k i   h i  π bt (i) = X   k b exp −θk c(k, j) + Vt (kj′ ) (C.7) j∈U(k)  X     π bkt (i) [c(k, i) + Vb t (ki′ )], with ki′ = fk (i) and k 6= d Vb t (k) =   i∈U(k)

All the πkt verify the entropy as well as sum-to-one constraints at any time t. Now, denote by τ < t the time of the last visit of the agent at state k, and consequently the last update of Vb (k) before time t. We easily find X π bkt (i) [c(k, i) + Vb t (ki′ )] (C.8) Vb t (k) = i∈U(k)

≤

X

i∈U(k)

≤

X

i∈U(k)

π bkτ (i) [c(k, i) + Vb t (ki′ )] π bkτ (i) [c(k, i) + Vb τ (ki′ )]

= Vb τ (k)

(C.9)

(C.10) (C.11)

The passage from Equation (C.8) to (C.9) is due to the preliminary lemma, while the passage from (C.9) to (C.10) results from the assumption that the Vb (k) are decreasing up to the current time step t.

Now, initially, an arbitrary initial policy, π bk (i), ∀i, k, verifying the exploration rate constraints (3.3) is chosen, and the corresponding expected cost until destination Vb 0 (k) are computed by solving the set of linear equations (2.2). Therefore, initially (t = 0), X π bk0 (i) [b c(k, i) + Vb 0 (ki′ )],∀i, k Vb 0 (k) = i∈U(k)

28

Thus, at the beginning of the exploration process (t = 1), the agent starts at node k0 and updates Vb 1 (k0 ) according to X π bk1 (i) [b c(k, i) + Vb 1 (ki′ )] (C.12) Vb 1 (k) = i∈U(k)

≤

X

i∈U(k)

π bk0 (i) [b c(k, i) + Vb 0 (ki′ )]

(C.13)

since for every node ki′ 6= k0 , Vb 1 (ki′ ) = Vb 0 (ki′ ) (no update has yet occurred). The assumption is thus clearly verified: all the Vb (k) are decreasing up to the current time step t = 1. Consequently, the next update will certainly decrease Vb (k), as well as all the subsequent updates. Thus, if we define the decrease of Vb (k) as ∆Vb t (k) = Vb t−1 (k) − Vb t (k) ≥ 0, since Vb (k) cannot become less than the cost of the shortest path C, we have Vb ∞ (k) = P∞ Vb 0 (k) − ∆Vb t (k) ≤ C so that ∆Vb t (k) → 0 as t → ∞. t=1

D. Appendix: Computation of the expected cost for a stochastic shortest path problem and a fixed policy

Let us now compute the quantity of interest, E[cπ (k0 )|s0 = k0 ], in the case of a stochastic shortest path problem, for k0 6= d: "∞ # X (D.1) Vπ (k0 ) = E[cπ (k0 )|s0 = k0 ] = Eπ c(st , u(st )) |s0 = k0 X

=

X

t=0

P(u(s0 ) = i0 , u(s1 ) = i1 , . . . , s1 = k1 , . . . |s0 = k0 )

i0 ,i1 ,... k1 ,k2 ,...

×

"∞ X

c(st , u(st ))

t=0

=

#

X

X

(D.2)

P(u(s0 ) = i0 , u(s1 ) = i1 , . . . , s1 = k1 , . . . |s0 = k0 )

i0 ,i1 ,... k1 ,k2 ,...

"

× c(s0 , u(s0 )) + =

XX i0

×

c(st , u(st ))

(D.3)

Pu (u(s0 ) = i0 |s0 = k0 ) Ps (s1 = k1 |s0 = k0 , u(s0 ) = i0 )

k1

  X

X

 i1 ,i2 ,... k2 ,k3 ,... " X

#

t=1

× c(s0 , u(s0 )) + =

∞ X

P(u(s1 ) = i1 , u(s2 ) = i2 , . . . , s1 = k1 , . . . |s1 = k1 ) ∞ X

#)

c(st , u(st ))

t=1

(D.4)

Pu (u(s0 ) = i0 |s0 = k0 ) Ps (s1 = k1 |s0 = k0 , u(s0 ) = i0 )

i0 ,k1

× [c(s0 , u(s0 )) + Vπ (k1 )]

(D.5) 29

where we used the Markov property (the future behaviour only depends on the knowledge of the present state). Therefore, since Vπ (d) = 0, we can compute the expected cost at state k in terms of the expected cost at each state that can be reached from state k:  X X  Vπ (k) = πk (i)pkk′ (i) [c(k, i) + Vπ (k ′ )], k 6= d ′ (D.6) i∈U(k) k  Vπ (d) = 0

which is the expected result.

30