Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

1 downloads 0 Views 6MB Size Report
Feb 10, 2016 - He, Ruijie, Brunskill, Emma, and Roy, Nicholas. Efficient planning under uncertainty with ... Wu, Lei, Shahidehpour, Mohammad, and Fu, Yong.
Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Daniel J. Mankowitz

DANIELM @ TX . TECHNION . AC . IL 1

Timothy A. Mann

TIMOTHYMANN @ GOOGLE . COM 2 SHIE @ EE . TECHNION . AC . IL 1

arXiv:1602.03348v1 [cs.LG] 10 Feb 2016

Shie Mannor

Abstract Reinforcement Learning (RL) aims to learn an optimal policy for a Markov Decision Process (MDP). For complex, high-dimensional MDPs, it may only be feasible to represent the policy with function approximation. If the policy representation used cannot represent good policies, the problem is misspecified and the learned policy may be far from optimal. We introduce IHOMP as an approach for solving misspecified problems. IHOMP iteratively refines a set of specialized policies based on a limited representation. We refer to these policies as policy threads. At the same time, IHOMP stitches these policy threads together in a hierarchical fashion to solve a problem that was otherwise misspecified. We prove that IHOMP enjoys theoretical convergence guarantees and extend IHOMP to exploit Option Interruption (OI) enabling it to learn where policy threads can be reused. Our experiments demonstrate that IHOMP can find near-optimal solutions to otherwise misspecified problems and that OI can further improve the solutions.

1. Introduction Reinforcement Learning (RL) algorithms can learn near-optimal solutions to well-defined problems. However, real-world problems rarely come in the form of a concrete problem description. A human has to translate the poorly defined target problem into a concrete problem description. A Misspecified Problem (MP) occurs when an optimal solution to the problem description is inadequate in the target problem. Unfortunately, creating a well-defined problem description is a challenging art. Furthermore, MPs can have serious consequences in many domains ranging from smart-grids (Abiri-Jahromi et al., 2013; Wu et al., 2010) and robotics (Smart & Kaelbling, 2002) to inventory management systems (Mann & Mannor, 2014). In this paper, we introduce a hierarchical approach that mitigates the consequences of problem misspecification. Most often, RL problems are described as Markov Decision Processes (Sutton & Barto, 1998, MDPs) and a solution to a MDP is a function that generates an action when presented with the current state, called a policy. The solution to a MDP is any policy that maximizes the long term sum of rewards. For problems with continuous, high dimensional state-spaces, explicitly representing the policy is infeasible, thus for the remainder of this paper we restrict our discussion to linearly parametrized policy representations (Sutton, 1996; Roy & How, 2013).3 Why are problems misspecified? While a problem description can be misspecified for many reasons, one important case is due to the state representation. It is well established in the machine learning (Levi & Weiss, 2004; Zhou et al., 2009) and RL (Konidaris et al., 2011) literature that “good” features can have a dramatic impact on performance. Finding “good” features to represent the state is a challenging domain specific problem that is generally considered outside of the scope of RL. Unfortunately, domain experts may not supply useful features either because they do not fully understand the target problem or the technicalities of reinforcement learning. In addition, we may prefer a MP with a limited state representation for several reasons: (1) Regularization: In many 1

Electrical Engineering Department, The Technion - Israel Institute of Technology, Haifa 32000, Israel Deepmind, London, UK 3 Our results are generalizable and complementary to non-linear parametric policy representations. 2

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 1. (a) An episodic MDP with S-shaped state-space and goal region G. (i) Flat approach: A single policy failing to solve the entire task resulting in a misspecified model (ii) Hierarchical approach: Using and combining simple policy representations to solve a task. (b) Learning PTs, denoted by black arrows. The domain is partitioned into five classes (sub-partitions) resulting in PT set Σ = {σ1 , σ2 , σ3 , σ4 , σ5 }. In iteration 1, all PTs except for σ5 (which has immediate access to the goal region) are arbitrary. In iteration 2, σ5 propagates reward back to σ4 . This process repeats until useful PTs are learned over the entire state-space.

problems, we wish to have a limited feature representation to improve the generalization and avoid overfitting (Singh et al., 1995; Geramifard et al., 2012). (2) Memory and System constraints: In practice, only a finite number of the features can be used due to computational constraints (Roy & How, 2013; Singh et al., 1995). In real-time systems, querying a feature may take too long. In physical systems, the sensor required to measure a desired feature may be prohibitively expensive. (3) Learning on Large Data: After learning on large amounts of data, augmenting a feature set with new features to get improved performance is non-trivial and often inefficient (Geramifard et al., 2012). How can we mitigate misspecification? We argue that learning a hierarchical policy can mitigate the problems associated with a MP and contrast this against a flat policy approach where a single, parameterized policy is used to solve the entire MDP. To illustrate how learning a hierarchical policy can repair MPs, consider the S-shaped domain shown in Figure 1a. To solve the task the agent must move from the bottom left corner to the goal region denoted by the letter ‘G’ in the top right. The state representation only permits policies that move in a straight line. So the problem is misspecified, and it is not solvable with a flat policy approach (Figure 1a.i). However, if we break up the state-space, as shown in Figure 1a.ii, and learn one policy for each cell, the problem is solvable. The partial policies shown in Figure 1a.ii are an example of abstract actions, called options (Sutton et al., 1999), macroactions (Hauskrecht et al., 1998; He et al., 2011), or skills (Konidaris & Barto, 2009). Learning a useful set of options has been a topic of intense research (McGovern & Barto, 2001; Moerman, 2009; Konidaris & Barto, 2009; Brunskill & Li, 2014; Hauskrecht et al., 1998). However, previous approaches have proposed algorithms for learning options with the objective of using them to learn or plan faster. In contrast, our objective is to learn options to repair a MP. Similar to using multiple pieces of thread to stitch up unwanted holes in a garment, we learn and stitch together options to circumvent MPs. To highlight this novel objective, we use the term Policy Threads (PTs) instead of options, while recognizing that PTs are a special case of options. Proposed Algorithm: We introduce a novel meta-algorithm, Iterative Hierarchical Optimization for Misspecified Problems (IHOMP), that uses an RL algorithm as a “black box” to iteratively learn and stitch together PTs to repair MPs. To force the PTs to specialize, IHOMP uses a partition of the state-space and trains one PT for each sub-partition, also referred to as a class (Figure 1b). Any arbitrary partitioning scheme can be used, however the partition may impact performance. During an iteration of IHOMP, an RL algorithm is used to update each PT. The PTs may be initialized arbitrarily, but after the first iteration PTs with access to a goal region or non-zero rewards will learn how to exploit those rewards (e.g., Figure 1b, Iteration 1). On further iterations, the newly acquired PTs propagate reward back to other regions of the state-space. Thus, PTs that previously had no reward signal exploit the rewards of other PTs that have received meaningful reward signals (e.g., Figure 1b, Iterations 2 and 5). Although each PT is only learned over a single partition class, it can be initialized in any state. Why partitions? If all PTs are trained on all data, then the PTs would not specialize defeating the purpose of learning multiple policies. Partitions are necessary to foster specialization. Natural partitionings arise in many different applications and are often easy to design. Consider navigation tasks (which we use in this paper for ease of visualization), which are

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

ever-present in robotics (Smart & Kaelbling, 2002) as well as games such as Minecraft4 , where partitions naturally lead an agent from one location to another in the state space. In addition, partitions are well suited to cyclical tasks; that is, tasks that have repeatable cycles (For example, a yearly cycle of 12 months). Here the state space can be easily partitioned based on time. Examples include inventory management systems (Mann & Mannor, 2014) as well as maintenance scheduling of generation units and transmission lines in smart grids (Abiri-Jahromi et al., 2013; Wu et al., 2010). Automatically Learning partitions: The availability of a pre-defined partitioning of the state space is a strong assumption in some domains. We have developed a relaxation to this assumption that can enable partitions to be learned automatically using Regularized Option Interruption (ROI) (Mankowitz et al., 2014; Sutton et al., 1999). Contributions: Our main contributions are: (1) The introduction of Iterative Hierarchical Optimization for Misspecified Problems (IHOMP), which learns PTs (i.e., options) to repair and solve MPs. (2) Theoretical convergence guarantees ensuring that IHOMP converges to a near-optimal solution. (3) Theorem 1, which relates the quality of the policy returned by IHOMP to the quality of the PTs learned by the “black box” RL algorithm. (4) Theorem 2 which proves that Regularized Option Interruption (ROI) can be safely incorporated into IHOMP. (5) Experiments demonstrating that, given a misspecified problem, IHOMP can learn and stitch together policy threads to solve a given problem. Experiments showing IHOMP-ROI learning partitions and discovering reusable PTs. This divide-and-conquer approach may also enable us to scale and solve larger MDPs.

2. Background Let M = hS, A, P, R, γi be an MDP, where S is a (possibly infinite) set of states, A is a finite set of actions, P is a mapping from state-action pairs to probability distributions over next states, R maps each state-action pair to a reward in [0, 1], and γ ∈ [0, 1) is the discount factor. A policy π(a|s) gives the probability of executing action a ∈ A from state s ∈ S. π (s) = LetPM be an MDP. The value function of a policy π with respect to a state s ∈ S is VM ∞ t−1 E γ R(s , a )|s = s where the expectation is taken with respect to the trajectory produced by following policy t t 0 t=1 π. The value function of a policy π can also be written recursively as

π VM (s) = Ea∼π(·|s) [R(s, a)] + γEs0 ∼P (·|s,π) [V π (s0 )] ,

(1)

∗ which is known as the Bellman equation. The optimal Bellman equation can be written as VM (s) = maxa E [R(s, a)] + ∗ π (s) − ε for all s ∈ S. The (s) ≥ VM γEs0 ∼P (·|s,π) [V ∗ (s0 )] . Let ε > 0. We say that a policy π is ε-optimal if VM action-value function of a policy π is defined by QπM (s, a) = Ea∼π(·|s) [R(s, a)] + γEs0 ∼P (·|s,π) [V π (s0 )] , for a state s ∈ S and an action a ∈ A, and the optimal action-value function is denoted by Q∗M (s, a). Throughout this paper, we will drop the dependence on M when it is clear from the context.

3. Policy Threads One of the key ideas behind Policy Threads (PTs) is that they may be learned locally, but they are potentially reusable throughout the state-space. We present a formal definition for PTs and a stitching policy. Definition 1. A policy thread σ is defined by a pair hπθ , βi, where πθ is a parametric policy with parameter vector θ and β : S → {0, 1} indicates whether the PT has finished (β(s) = 1) or not (β(s) = 0) given the current state s ∈ S. Definition 2. Let Σ be a set of m ≥ 1 PTs. A stitching policy µ is a mapping µ : S → [m] where S is the state-space and [m] is the index set over PTs. A stitching policy selects which PT to execute from the current state by returning the index of one of the PTs. By defining stitching policies to select an index (rather than the PT itself), we can use the same policy even as the set of PTs is adapting. Figure 1b shows an arbitrary partitioning P, consisting of 5 sub-partitions {Pi |i = 1 · · · 5}, defined over the original MDP’s state space. Each Pi is initialized with an arbitrary PT and its corresponding PT-Inducing MDP (PTI-MDP) Mi0 . PTI-MDP Mi0 (see supplementary material for a full definition) is an episodic MDP that terminates once the agent escapes from Pi and upon terminating receives a reward equal to the value of the state the agent would have transitioned to in the original MDP. Therefore, we construct a modified MDP called a PTI-MDP and apply a planning or RL algorithm to solve 4

https://minecraft.net/

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

it. The resulting solution (policy) is a specialized policy thread. Given a “good” set of PTs, planning can be significantly faster (Sutton et al., 1999; Mann & Mannor, 2014). However, in many domains we may not be given a good set of PTs. Therefore it is necessary to learn this set of PTs given the unsatisfactory PT set. In the next section, we introduce an algorithm for dynamically learning and improving PTs using iterative hierarchical optimization.

4. Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) Iterative Hierarchical Optimization for Misspecified Problems (IHOMP, Algorithm 1) takes the original MDP M , a partition P over the state-space and a number of iterations K ≥ 1 and returns a pair hµ, Σi containing a stitching policy µ and a set of PTs Σ. The number of PTs m = |P| is equal to the number of classes (sub-partitions) in the partition P (line 1). The stitching policy µ returned by IHOMP is defined (line 2) by µ(s) = arg max I {s ∈ Pi } , (2) i∈[m]

where I{·} is the indicator function returning 1 if its argument is true and 0 otherwise and Pi denotes the ith class in the partition P. Thus µ simply returns the index of the PT associated with the partition class containing the current state. On line 3, IHOMP initializes Σ with arbitrary PTs (IHOMP can also be initialized with PTs that we believe might be useful to speed up learning). Algorithm 1 Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) Require: M {MDP}, P{Partitioning of S}, K{Iterations} 1: m ← |P| {# of partitions.} 2: µ(s) = arg maxi∈[m] I{s ∈ Pi } 3: Initialize Σ with m PTs. {1 PT per partition.} 4: for k = 1, 2, . . . , K do {Do K iterations.} 5: for i = 1, 2, . . . , m do {One update per PT.} 6: Policy Evaluation: hµ,Σi 7: Evaluate µ with Σ to obtain VM 8: PT Update: hµ,Σi 9: Construct PTI-MDP Mi0 from M & VM 0 10: Solve Mi obtaining policy πθ 11: σi0 ← hπθ , βi i 12: Replace σi in Σ by σi0 13: end for 14: end for 15: return hµ, Σi Next (lines 4 – 14), IHOMP performs K iterations. In each iteration, IHOMP updates the PTs in Σ (lines 5 – 13). Note that the value of a PT depends on how it is combined with other PTs. If we allowed all PTs to change simultaneously, the PTs could not reliably propagate value off of each other. Therefore, IHOMP updates each PT individually. Multiple iterations are needed so that the PT set can converge (Figure 1b). The process of updating a PT (lines 6 – 12) starts by evaluating µ with the current PT set Σ (line 6). Any number of policy evaluation algorithms could be used here, such as TD(λ) with function approximation (Sutton & Barto, 1998) or LSTD (Boyan, 2002), modified to be used with PTs. In our experiments, we used a straightforward variant of LSTD (Sorg & Singh, 2010). Then we use the original MDP M to construct a PTI-MDP M 0 (line 9). Next, IHOMP uses a planning or RL algorithm to approximately solve the PTI-MDP M 0 returning a parametrized policy πθ (line 10). Any planning or RL algorithm for regular MDPs could fill this role provided that it produces a parametrized policy. However, in our experiments, we used a simple actor-critic PG algorithm, unlessotherwise stated. Then a new PT σi0 = hπθ , βi i is created 0 if s ∈ Pi (line 11) where πθ is the policy derived on line 10 and βi (s) = . The definition of βi means that the PT 1 otherwise th will terminate only if it leaves the i partition. Finally, we update the PT set Σ by replacing the ith PT with σi0 (line 12). It is important to note that in IHOMP, updating a PT is equivalent to solving a PTI-MDP.

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

5. Analysis of IHOMP We provide the first convergence guarantee for combining hierarchically and iteratively learning PTs in a continuous state MDP using IHOMP (Lemma 1 and Lemma 2, proven in the supplementary material). We use this guarantee as well as Lemma 2 to prove Theorem 1. This theorem enables us to analyze the quality of the stitching policy returned by IHOMP. It turns out that the quality of the policy depends critically on the quality of the PT learning algorithm. An important parameter for determining the quality of a policy returned by IHOMP is the misspecification error defined below. Definition 3. Let P be a partition over the target MDP’s state-space. The misspecification error is ηP = max ηi , i∈[m]

(3)

where ηi is the smallest ηi ≥ 0, such that πθ 0 ∗ VM 0 (s) − V M 0 (s) ≤ ηi , for all s ∈ Pi and πθ is the policy returned by the PT learning algorithm executed on Mi . i

i

The misspecification error quantifies the quality of the PTI-MDP solutions returned by our PT learning algorithm. If we used an exact solver to learn PTs, then ηP = 0. However, if we use an approximate solver, then ηP will be non-zero and the quality will depend on the partition P. Generally, using finer grain partitions will decrease ηP . However, Theorem 1 reveals that adding too many PTs can also negatively impact the returned policy’s quality. Theorem 1. Let ε > 0. If we run IHOMP with partition P for K ≥ logγ (ε(1 − γ)) iterations, then the algorithm returns stitching policy ϕ = hµ, Σi such that mηP ϕ ∗ kVM − VM k∞ ≤ +ε , (4) (1 − γ)2 where m is the number of partition classes in P. The proof of Theorem 1 is divided into three parts (a complete proof is given in the supplementary material). The main challenge is that updating one PT can impact the value of other PTs. Our analysis starts by bounding the impact of updating one PT. Note that Σ represents a PT set and Σi represents a PT set where we have updated the ith PT (corresponding to the ∗ , the globally optimal value function, ith partition class Pi ) in the set. In the first part, we show that the error between VM hµ,Σi hµ,Σi i ηP ∗ and VM , is a contraction when s ∈ Pi and is bound by kVM − VM k∞ + 1−γ otherwise (Lemma 1). In the second part, we apply an inductive argument to show that updating all m PTs results in a γ contraction over the entire state space (Lemma 2). In the third part, we apply this contraction recursively, which proves Theorem 1. This provides the first theoretical guarantees of convergence to a near optimal solution when combining hierarchically, and iteratively learning, a set of PTs Σ in a continuous state MDP. Theorem 1 tells us that when the misspecification error is small, IHOMP returns a near-optimal policy. The first term on the right hand side of (4) is the approximation error. This is the loss we pay for the parametrized class of policies that we learn PTs over. Since m represents the number of classes defined by the partition, we now have a formal way of analyzing the effect of the partitioning structure. In addition, complex PTs do not need to be designed by a domain expert; only the partitioning needs to be provided a-priori. The second term is the convergence error. It goes to 0 as the number of iterations K increases. The guarantee provided by Theorem 1 may appear similar to ((Hauskrecht et al., 1998), Theorem 1). However, (Hauskrecht et al., 1998) derive TEAs only at the beginning of the learning process and do not update them. On the other hand, IHOMP updates its PT set dynamically by propagating value throughout the state space during each iteration. Thus, IHOMP does not require prior knowledge of the optimal value function. Theorem 1 does not explicitly present the effect of policy evaluation error, which occurs with any approximate policy evaluation technique. However, if the policy evaluation error is bounded by ν > 0, then we can simply replace ηP in (4) with (ηP + ν). Again, smaller policy evaluation error leads to smaller approximation error.

6. Learning Partitions Using Regularized Option Interruption A strong assumption in this work is for the partitioning to be known a-priori to performing IHOMP. However, it may be non-trivial to design a partitioning and, in many cases, the partitioning may be sub-optimal. In order to relax this assumption, we have incorporated Regularized Option Interruption (ROI) (Mankowitz et al., 2014) into this work that enables IHOMP to automatically learn a near-optimal partitioning from an initially misspecified partitioning.

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 2. Regularized Option Interruption (ROI): (a) The initial misspecified partition pre-defined by the user. (b) The actual optimal partitioning for the task. (c) The partition learned using ROI. If PT 1 is executing, the PT will continue to execute at location y and will terminate at location x. At this point PT 3 will begin to execute.

In order to do so, IHOMP keeps track of the action value function Qhµ,Σi (s, j) which represents the expected value of being in state s ∈ S and executing PT j, given the stitching policy µ and PT set Σ. ROI uses this estimate of the action-value function to enable the agent to choose when to switch partitionings according to the following termination rule:  βj (s, t) =

1 0

if Qhµ,Σi (s, j) < V hµ,Σi (s) − ρ otherwise ,

(5)

Here βj (s, t) corresponds to the termination probability of the j th PT partition and V hµ,Σi (s) = maxi∈[m] Qhµ,Σi (s, i). This rule is illustrated in Figure 2. A user has designed a partitioning that is initially misspecified (Figure 2a) compared to the optimal partitioning for this domain (Figure 2b). IHOMP needs to use ROI to ‘modify’ the initial partition into the optimal one. By learning the optimal action-value function Qh∗,µ,Σi (s, j), IHOMP builds a near-optimal partition (Figure 2c) that is implicitly stored within this action-value function. That is, if the agent is executing a PT in partition class 1, and the value of continuing with PT 1, Qhµ,Σi (s, 1), is less than V hµ,Σi (s) − ρ for some regularization function ρ (see the x location in Figure 2c), then switch to the new PT partition (βj (s, t) = 1). Otherwise, continue executing the current PT (see the y location in Figure 2c). Using this technique, we define a new algorithm IHOMP-ROI (IHOMP with Regularized Option Interruption). The algorithm can be found in the supplementary material. The key difference between IHOMP and IHOMP-ROI is applying ROI during the policy evaluation step after each of the m PTs have been updated. IHOMP-ROI is able to automatically learn a near-optimal partitioning on-the-fly between iterations. This means that, given a sub-optimal partitioning, IHOMP-ROI can learn an improved partitioning that is better suited to the problem. We show that ROI can be safely incorporated into IHOMP in Theorem 2. The theorem shows that incorporating ROI can only improve the policy produced by IHOMP. The full proof is given in the supplementary material. Theorem 2. (IHOMP-ROI Approximate Convergence) Let ε > 0. If we run IHOMP-ROI with any admissible regularization function and initial partitioning P for K ≥ logγ (ε(1 − γ)) iterations, then the algorithm returns a stitching policy ϕ such that

mηP

∗ hµ,Σi ϕ ∗ kVM − VM k∞ ≤ VM − VM ≤ +ε , (6) (1 − γ)2 ∞ where hµ, Σi is the policy returned by Iterative Hierarchical Optimization for Misspecified Problems without ROI, m is the number of classes in P and ηP is the misspecification error.

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 3. The Puddle World domain: (a) Puddle World. (b) The average reward for the IHOMP algorithm generated by the IHOMP stitching policy. This is compared to the solution obtained by a flat policy in the initially misspecified problem as well as an approximately optimal policy derived using Q-learning (applied for a huge number of iterations). (c) The average cost (negative reward) for each grid partition.

7. Experiments and Results We performed experiments on three well-known RL benchmarks: Mountain Car (MC), Puddle World (PW) (Sutton, 1996) and the Pinball domain (Konidaris & Barto, 2009). The MC domain has similar results to PW and therefore has been moved to the supplementary material. We use two variations for the Pinball domain, namely maze-world (moved to supplementary material), which we created, and pinball-world which is one of the standard pinball benchmark domains. We also perform experiments on a sub-domain of Minecraft, called the Dungeon Domain. Minecraft is a video game whose aim is to explore a world as well as imagine, create and build any kind of structure, similar to Lego5 sets. Finally, we created a domain which we term the Two-room domain to demonstrate learning optimal partitions using IHOMP-ROI. In each of the experiments, we have defined a MP. That is, the flat policy approach is inadequate, and in some of the tasks, cannot solve the task at all. These experiments are designed to replicate situations where the policy representation is constrained due to overfitting concerns, system constraints or badly designed features. We show that in each case, by applying the IHOMP algorithm to the misspecified problem, the policy is transformed into PTs and IHOMP stitches these threads together hierarchically to obtain a near-optimal solution. In the case of the Two Rooms domain, we show that the agent can learn a near-optimal partitioning given an initially misspecified problem. It also provides a natural framework for discovering reusable skills. Our experiments demonstrate potential to scale up to higher dimensional domains by hierarchically combining PTs over simple representations (A Video of an agent solving the Pinball and Minecraft tasks with the IHOMP learned policy can be found in the supplementary material). IHOMP is a meta-algorithm. We must provide an algorithm for Policy Evaluation (PE) and Policy Thread Learning (PTL). For the MC, PW and Minecraft domains, we used SMDP-LSTD (Sorg & Singh, 2010) for PE and a modified version of Regular-Gradient Actor-Critic (RG-AC) (Bhatnagar et al., 2009) for PTL (see supplementary material for details). In the Pinball domains, we used Nearest-Neighbor Function Approximation (NN-FA) for PE and UCB Random Policy Search (UCB-RPS) for PTL. In the two rooms domain, we use a variation of LSTDQ with Option Interruption for PE and RG-AC for PTL. In our experiments, for the MC, PW, Minecraft and Two Room domains, each PT is represented as a probability distribution over actions (independent of the state). We compare their performance to the original misspecified problem using a flat policy with the same representation. Grid-like partitions are generated for each task. Binary-grid features are used to estimate the value function. In the pinball domains, each PT is represented by 5 polynomial features corresponding to each state dimension and a bias term. The value function is represented by a KD-Tree containing 1000 state-value pairs uniformly sampled in the domain. A value for a particular state is obtained by assigning the value of the nearest neighbor to that state that is contained within the KD-tree. These are example representations. In principal, any value function and policy representation that is representative of the domain can be utilized. 7.1. Puddle World Puddle World is a continuous 2-dimensional world containing two puddles as shown in Figure 3a. A successful agent (red ball) should navigate to the goal location (blue square), avoiding the puddles. The state space is the hx, yi location 5

http://www.lego.com/en-us/

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 4. (a) The Minecraft Dungeon Domain (b) (i) The agent learns a PT to navigate to the exit of the barrier and (ii) The agent learns a PT to reach the goal from the exit. (c) The average reward from executing IHOMP compared to the flat policy of the originally misspecified problem.

of the agent. Initially, the agent is provided with a misspecified problem. That is, a flat policy that can only move in a single direction (thus it cannot avoid the puddles). Figure 3b compares this flat policy with IHOMP (for a 2 × 2 grid partition (Four PTs)). The flat policy achieves low average reward. However, IHOMP turns the flat policy into PTs and hierarchically stitches these PTs together, resulting in a richer solution space and a higher average reward as seen in Figure 3b. This is comparable to the approximately optimal average reward attained by executing Approximate Value Iteration (AVI) for a huge number of iterations. In this experiment IHOMP is not initiated in the partition class containing the goal state but still achieves near-optimal convergence after only 2 iterations. Figure 3c compares the performance of different partitions where a 1 × 1 grid represents the flat policy of the initially misspecified problem. The PT learning error ηP is significantly smaller for all the partitions greater than 1 × 1, resulting in lower cost. On the other hand, according to Theorem 1, adding more PTs m increases the cost. A trade off therefore exists between ηP and m. In practice, ηP tends to dominate m. In addition to the trade off, the importance of the partition design is evident when analyzing the cost of the 3 × 3 and 4 × 4 grids. In this scenario, the 3 × 3 partition design is better suited to Puddle World than the 4 × 4 partition, resulting in lower cost. 7.2. Minecraft Minecraft is a huge open-ended environment. We cannot hope to solve Minecraft using only a flat policy and therefore need PTs. For example, we tested IHOMP on the Dungeon domain, a sub-domain which we built within Minecraft using the BurlapCraft6 framework. In this task, the agent needs to navigate from a pre-defined start location to reach the golden blocks, and receive a large reward, as shown in Figure 4a. The state space is the agent’s location and the agent has four actions, moving in each of the four cardinal directions. The agent receives negative rewards while searching for the golden blocks and a large positive reward upon reaching these blocks. Using a flat policy (which can only move the agent in a straight line), the agent is provided with a misspecified problem as he cannot traverse between the barrier separating the agent from its goal. However, IHOMP with a 2 × 1 grid partition, learns two near-optimal PTs and stitching policy which enable the agent to solve the Dungeon domain and reach the golden blocks. 6

https://github.com/h2r/burlapcraft

Average Reward

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

10000 8000 6000 4000 2000 0

Optimal

IHOMP

(b)

y position

0

Value

10000 9000

5

8000

10

7000

15 0

(a)

Flat

6000 5 10 15 x position

5000

(c)

Figure 5. (a) The pinball-world domain. (b) The average reward for the IHOMP algorithm generated by the IHOMP stitching policy in the pinball-world domain. In this domain, IHOMP converges after a single iteration as we start IHOMP in the partition containing the goal. (c) The learned value function.

To accelerate learning of the PTs, a simulator was developed to mimic the Dungeon Domain and the PT parameters learned by the simulator initialized the Minecraft agent’s stitching policy. The Minecraft agent learned two intuitive PTs. The first PT enables the agent to navigate to the exit of the barrier as shown in Figure 4b(i). The second PT navigates the agent from the exit to the goal (golden blocks) as seen in Figure 4b(ii). Its performance is shown in Figure 4c. As can be seen in the figure, the agent is able to solve the misspecified problem using IHOMP with only two simple policy threads, achieving a near-optimal stitching policy. 7.3. Pinball We also tested IHOMP on the Pinball domain (Konidaris & Barto, 2009). The goal in Pinball is to direct an agent (small blue ball) to a goal location (large red region). The Pinball domain provides a sterner test for IHOMP as the agent needs to circumnavigate uneven obstacles. In addition, collisions with obstacles are non-linear. The state space is the four-tuple hx, y, x, ˙ yi ˙ where x, y represents the 2D location of the agent, and x, ˙ y˙ the velocities in each direction. We tested IHOMP on the challenging pinball-world task (Figure 5a) (Konidaris & Barto, 2009). The agent is initially provided with a 5-feature flat policy h1, x, y, x, ˙ yi. ˙ This results in a misspecified problem as the agent is unable to solve the task using this limited representation as shown by the average reward in Figure 5b. Using IHOMP with a 4 × 3 × 1 × 1 grid, 12 PTs were learned. IHOMP clearly outperforms the flat policy as shown in Figure 5b. It is less than optimal but still manages to sufficiently perform the task (see value function, Figure 5c). The drop in performance is due to the complicated obstacle setup, the non-linear dynamics and the partition design. Nevertheless, this shows that IHOMP can produce a reasonable solution with a limited representation. 7.4. Learning to improve the Partition Providing a ‘good’ PT partitioning a-priori is a strong assumption. It may be non-trivial to design the partitioning especially in continuous, high-dimensional domains. A sub-optimal partitioning may still mitigate misspecification, but it will not result in a near-optimal solution. To relax this assumption, we have incorporated Regularized Option Interruption into

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

(a)

(c)

(b)

Average Reward

80 60 40 20 0 −20

Flat Policy IHOMP IHOMP-ROI

−40 −60 0

1

Iteration

2

3

(d)

Figure 6. (a) The two rooms domain (b) Two Rooms domain with the sub-optimal partitioning after performing IHOMP (c) The learned partitioning after performing IHOMP-ROI with ROI (d) The average reward from executing IHOMP-ROI compared to IHOMP without ROI and the flat policy.

the IHOMP algorithm to produce the IHOMP-ROI Algorithm. This algorithm learns the PTs and improves the partition, effectively determining where the PTs should be executed in the state space. We tested IHOMP-ROI on the two rooms domain shown in Figure 6a. The agent (red ball) needs to navigate to the goal region (blue square). The policy parameterization is limited to a distribution over actions (moving in a single direction). This limited representation results in a MP as the agent is unable to traverse between the two rooms. If we use IHOMP with a sub-optimal partitioning containing two PTs as shown by the red and green cells in Figure 6b, the problem is still misspecified. Here, the agent leaves the red cell and immediately gets trapped behind the wall whilst in the green cell. Using IHOMP-ROI, as shown in Figure 6c, the agent learns both the PTs and a partition such that the agent can navigate to the goal. The green region in the bottom left corner comes about as a function approximation error but does not prevent the agent from reaching the goal. Reusable PTs: If the reader looks carefully in Figure 6c, they will notice something unexpected. The optimal partitioning learned for the Two Rooms domain includes executing the red PT on the right hand side of the wall. This is in fact intuitive given the parameterizations learned for each of the PTs. The red PT has a dominant right action whereas the green policy thread has a dominant upward action. When the agent is near the wall on the side of the goal, it actually makes more sense to execute the red PT such that it moves to the right and reaches the goal. Thus, IHOMP-ROI provides an effective way to, not only learn an optimal partition, but to also discover reusable PTs.

8. Discussion We argued that, given a limited policy representation (due to system constraints, badly designed or limited features, regularization, etc.), problems may be misspecified. We introduced IHOMP, a RL planning algorithm for iteratively learning and combining policy threads — a type of parameterized option — in a hierarchical fashion to repair a MP. However, using IHOMP, MPs can be repaired and solved without modifying the limited policy representation by making use of multiple policy threads. We provide theoretical results for IHOMP that directly relate the quality of the final stitching policy to the misspecification error. IHOMP is the first framework that enjoys theoretical convergence guarantees while iteratively learning a set of PTs in a continuous state space. In addition, we have developed IHOMP-ROI which makes use of regu-

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

larized option interruption (Sutton et al., 1999; Mankowitz et al., 2014) to learn an improved partition to solve an initially misspecified problem. IHOMP-ROI is also able to discover regions in the state space where the PTs should be reused, making a case for automatically discovering reusable policies. One limitation of IHOMP is that it learns PTs for all partition classes. This may be a problem in high-dimensional statespaces. However, the problem can be overcome, by focusing only on the most important regions of the state-space. One way to identify these regions is by observing an expert’s demonstrations (Abbeel & Ng, 2005; Argall et al., 2009). In addition, we could apply self-organizing approaches to facilitate PT reuse (Moerman, 2009). PT reuse can be especially useful for transfer learning. Consider a multi-agent environment (Garant et al., 2015) where many of the agents perform similar tasks requiring a similar PT-set. PT reuse can also facilitate learning complex multi-agent policies (co-learning) with few samples.

References Abbeel, Pieter and Ng, Andrew Y. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, 2005. Abiri-Jahromi, Amir, Parvania, Masood, Bouffard, Francois, and Fotuhi-Firuzabad, Mahmud. A two-stage framework for power transformer asset maintenance managementpart i: Models and formulations. Power Systems, IEEE Transactions on, 28(2):1395–1403, 2013. Argall, Brenna D, Chernova, Sonia, Veloso, Manuela, and Browning, Brett. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009. ISSN 0921-8890. Bhatnagar, Shalabh, Sutton, Richard S, Ghavamzadeh, Mohammad, and Lee, Mark. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009. Boyan, Justin A. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2-3):233–246, 2002. Brunskill, Emma and Li, Lihong. PAC-inspired option discovery in lifelong reinforcement learning. JMLR, 1:316–324, 2014. Garant, Daniel, da Silva, Bruno C., Lesser, Victor, and Zhang, Chongjie. Accelerating Multi-agent Reinforcement Learning with Dynamic Co-learning. Technical report, 2015. Geramifard, A, Tellex, S, Wingate, D, Roy, N, and How, JP. A bayesian approach to finding compact representations for reinforcement learning. In European Workshops on Reinforcement Learning (EWRL), 2012. Hauskrecht, Milos, Meuleau, Nicolas, Kaelbling, Leslie Pack, Dean, Thomas, and Boutilier, Craig. Hierarchical solution of markov decision processes using macro-actions. In Proceedings of the 14th Conference on Uncertainty in AI, pp. 220–229, 1998. He, Ruijie, Brunskill, Emma, and Roy, Nicholas. Efficient planning under uncertainty with macro-actions. Journal of Artificial Intelligence Research, 40:523–570, 2011. Konidaris, G.D., Osentoski, S., and Thomas, P.S. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pp. 380–385, 2011. Konidaris, George and Barto, Andrew G. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS 22, pp. 1015–1023, 2009. Levi, Kobi and Weiss, Yair. Learning object detection from a small number of examples: the importance of good features. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II–53. IEEE, 2004. Mankowitz, Daniel J, Mann, Timothy A, and Mannor, Shie. Time regularized interrupting options. ICML, 2014. Mann, Timothy A and Mannor, Shie. Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the 31 st ICML, 2014.

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

McGovern, Amy and Barto, Andrew G. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density. In Proceedings of the 18th ICML, pp. 361 – 368, 2001. Moerman, Wilco. Hierarchical reinforcement learning: Assignment of behaviours to subpolicies by self-organization. PhD thesis, Cognitive Artificial Intelligence, Utrecht University, 2009. Roy, N and How, JP. A tutorial on linear function approximators for dynamic programming and reinforcement learning. 2013. Singh, Satinder P, Jaakkola, Tommi, and Jordan, Michael I. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pp. 361–368, 1995. Smart, William D and Kaelbling, Leslie Pack. Effective reinforcement learning for mobile robots. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, volume 4, pp. 3404–3410. IEEE, 2002. Sorg, Jonathan and Singh, Satinder. Linear options. In Proceedings 9th AAMAS, pp. 31–38, 2010. Sutton, Richard. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pp. 1038–1044, 1996. Sutton, Richard and Barto, Andrew. Reinforcement Learning: An Introduction. MIT Press, 1998. Sutton, Richard S, Precup, Doina, and Singh, Satinder. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. AI, 112(1):181–211, August 1999. Wu, Lei, Shahidehpour, Mohammad, and Fu, Yong. Security-constrained generation and transmission outage scheduling with uncertainties. Power Systems, IEEE Transactions on, 25(3):1674–1685, 2010. Zhou, Huiyu, Yuan, Yuan, and Shi, Chunmei. Object tracking using sift features and mean shift. Computer vision and image understanding, 113(3):345–352, 2009.