An Information-Theoretic Optimality Principle for Deep Reinforcement ...

4 downloads 0 Views 535KB Size Report
Aug 6, 2017 - maximum operator), introducing a bias at early stages of ... large-scale domains, e.g. the Arcade Learning Environment .... on the quality of the initial Q-values. ...... [Tishby and Polani 2011] Tishby, N., and Polani, D. 2011.
An Information-Theoretic Optimality Principle for Deep Reinforcement Learning Felix Leibfried and Jordi Grau-Moya and Haitham B. Ammar

arXiv:1708.01867v1 [cs.AI] 6 Aug 2017

PROWLER.io, Cambridge, UK {felix,jordi,haitham}@prowler.io

Abstract In this paper, we address the problem of cumulative reward overestimation in deep reinforcement learning methodologically. We generalise notions from information-theoretic bounded rationality to handle high-dimensional state spaces efficiently. The resultant algorithm encompasses a wide range of learning outcomes that can be demonstrated by tuning a Lagrange multiplier that intrinsically penalises rewards. We show that deep Q-networks fall naturally as a special case from our proposed approach. We further contribute by introducing a novel scheduling scheme for bounded-rational behaviour that ensures sample efficiency and robustness. In experiments on Atari games, we show that our algorithm outperforms various deep reinforcement learning algorithms (e.g., deep and double deep Q-networks) in terms of both, gameplay performance and sample complexity.

Introduction

in (Mnih et al. 2015) proposed deep Q-networks (DQNs) and showed that DQNs attain state-of-the-art results in large-scale domains, e.g. the Arcade Learning Environment for Atari games (Bellemare et al. 2013). Deep Q-networks, however, fail to address the (second) overestimation problem, and as such also suffer from being sample-inefficient (see (van Hasselt, Guez, and Silver 2016)). One way of addressing cumulative reward overestimations is to introduce an intrinsic penalty signal to instantaneous rewards. This is formalised within the context of information-theoretic bounded rationality (Ortega and Braun 2013; Genewein et al. 2015; Ortega et al. 2015), where such a penalty is expressed in terms of a Kullback-Leibler divergence between current and prior policies. Though successful in the tabular setting (e.g., (Azar, Gomez, and Kappen 2012; Rawlik, Toussaint, and Vijayakumar 2012; Fox, Pakman, and Tishby 2016)), none of these methods are readily applicable to high-dimensional environments that adapt function approximators. Since we are interested in improving sample complexity of RL in high-dimensional state spaces, we contribute by generalising information-theoretic bounded rationality to phrase a novel optimisation objective for learning Q-values with deep parametric approximators. The resultant algorithm encompasses a wide range of learning outcomes that can be demonstrated by tuning a Lagrange multiplier. Interestingly, we show that DQN falls naturally as special cases from our proposed approach. We further contribute by introducing a novel scheduling scheme for bounded-rational behaviour based on temporal Bellman error evolution. This ultimately allows us to outperform DQN, as well as double DQN (van Hasselt, Guez, and Silver 2016) by large margins in the Atari domain in terms of both performance and sample complexity. Finally, we show further performance increase by adopting the dueling architecture presented in (Wang et al. 2016). In short, the contributions of this paper can be summarised as:

Reinforcement learning (Sutton and Barto 1998) (RL) is a discipline of artificial intelligence that deals with identifying optimal behaviour by maximising cumulative reward. A popular RL algorithm is Q-learning (Watkins 1989) that operates by estimating expected cumulative rewards (i.e. Q-values) as the learner interacts with the environment. Although successful in numerous applications (Bertsekas 2007; Busoniu et al. 2010), standard Qlearning suffers from two drawbacks. First, due to its tabular nature in representing Q-values, this method is not readily applicable to high-dimensional environments that exhibit large state and/or action spaces. Second, it initially overestimates cumulative reward values (due to the usage of the maximum operator), introducing a bias at early stages of training. This bias has to be unlearnt in the course of further training, leading to decreased sample efficiency. To address the first problem, Q-learning has been extended to high-dimensional environments by using parametric function approximators instead of Qtables (Busoniu et al. 2010). One particularly appealing class of approximators are deep neural networks that enable learning complex relationships between highdimensional input states (e.g. images) and low-level actions autonomously. Building on this idea, the authors

1. generalising information-theoretic bounded rationality to handle high-dimensional state-spaces with function approximators;

c 2018, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.

2. proposing a novel information-theoretic inspired optimisation objective for deep RL;

3. demonstrating bounded-rational solutions for deep RL including DQNs as a special case; and 4. outperforming DQN and double DQN in the Atari domain.

Reinforcement Learning In RL, an agent, being in a state s ∈ S, chooses an action a ∈ A by exploiting a behavioural policy πbehave : S × A → [0, 1]. Resulting from this choice, is a transition to a successor state s′ ∼ P (s′ |s, a), where P : S × A × S → [0, 1] is the unknown state transition model, and a reward r = R(s, a) that quantifies instantaneous performance. After subsequent interactions with the environment, the goal of ⋆ the agent is to optimise for Pπ∞behavet that maximises the total expected return Eπbehave ,P [ t=1 γ rt ], with t denoting time and γ ∈ (0, 1] the discount factor. Clearly, to learn an optimal behavioural policy, the agent, above, has to reason about long term consequences of instantaneous actions. Q-learning, a famous RL algorithm, estimates these effects using state-action value pairs or Q-values to quantify the performance of the policy. In Q-learning, updates are conducted online after each interaction with the environment using:   ′ ′ Q (s, a) ← Q(s, a)+α r + γ max Q(s , a ) − Q (s, a) , ′ a

(1) with α > 0 being a learning rate. Intuitively, Equation 1 assumes an old value (i.e. Q(s, a)) and corrects for its estimate based on new information, i.e. using r + γ maxa′ Q(s′ , a′ ).

Optimistic Overestimation: Upon careful investigation of Equation 1, one comes to recognise that Q-learning updates introduce a bias to the learning process caused by an overestimation of the optimal cumulative rewards (van Hasselt 2010; Azar et al. 2011; Lee and Powell 2012; Bellemare et al. 2016; Fox, Pakman, and Tishby 2016). Namely, the usage of the maximum operator assumes that current guesses for Q-values reflect optimal cumulative rewards. Of course, this assumption is violated, especially during the beginning of the learning process, where a relatively small number of updates has been performed. Due to the correlative effect of “bad” estimations between different state-action pairs, these mistakes tend to propagate rapidly through the Q-table and have to be unlearnt in the course of further training. Though unlearning such an optimistic bias eventually, the convergence speed (in terms of environmental interactions, i.e. sample complexity) of Q-learning is highly dependent on the quality of the initial Q-values. The problem of optimistic overestimation only worsens in high-dimensional state spaces, such as in the case of visual inputs in the Arcade Learning Environment. To handle high dimensional representations, tabular Q-learning is first generalised to use parametric function approximators, e.g. deep neural networks (Mnih et al. 2015) and weights are then fitted by stochastic gradients to minimise the following:  2  ′ ′ Es,a,r,s′ r + γ max . (2) Qθ − (s , a ) − Qθ (s, a) ′ a

Here, the expectation E refers to samples drawn from a replay memory storing state transitions (Lin 1993), and Qθ − (s′ , a′ ) denotes a Q-network at an earlier stage of training. Intuitively, the minimisation objective in Equation 2 resembles similarities to that used in the tabular setting. Again, old value estimates are updated based on new information, while introducing the maximum operator bias. Although deep Q-networks generalise well over a wide range of input states, they are agnostic to the aforementioned overestimation problem (Thrun and Schwartz 1993). However, when compared with the tabular setting, this problem is even more severe due to the lack of any convergence guarantees to optimal Q-values when using parametric approximators, and the impossibility to explore the whole state-action space. Hence, the number of environmental interactions needed to unlearn the optimistic bias can become prohibitively expensive.

Addressing Overestimation Methodologically A potential solution to address optimistic overestimation in Q-learning is to “inject” randomness in the system by adding an intrinsic penalty to instantaneous rewards, thus reducing Q-value estimates. Methodological principles for introducing such randomness can be adapted from the field of information-theoretic bounded rationality (Ortega and Braun 2013; Genewein et al. 2015; Ortega et al. 2015), where an agent aims at finding an optimal behavioural policy that maximises a notion of utility (or reward) U (s, a). Contrary to other settings (Mnih et al. 2015), bounded rationality is induced in the system by bounding the deviation of the behavioural policy with respect to a prior policy πprior (a|s). This can be achieved by constraining the Kullback-Leibler (KL) divergence of the two policies, KL (πbehave ||πprior ), by a positive constant K. Therefore, for a one-step decision making problem, we can write: X max πbehave (a|s)U (s, a) πbehave

a∈A

s.t. KL (πbehave ||πprior ) ≤ K,

with KL (πbehave ||πprior ) =

X

πbehave (a|s) log

a∈A

πbehave (a|s) πprior (a|s)

Before being able to solve the above constrained optimisation problem, an unconstraint objective is written by introducing a Lagrange multiplier λ > 0: X L (s, πprior , λ) = max πbehave (a|s)U (s, a) πbehave

a∈A

(3) 1 − KL (πbehave ||πprior ) , λ where λ represents a trade-off between utility versus closeness to prior information. Since the unconstraint problem is concave with respect to the optimisation argument, one can express the optimal solution in closed form as: ⋆ πbehave (a|s) = P

πprior (a|s) exp (λU (s, a)) . ′ ′ a′ πprior (a |s) exp (λU (s, a ))

(4)

We are not the first to propose such information-theoretic principles within the context of reinforcement learning. In fact, similar principles have received increased attention within policy search and identification of optimal cumulative reward values in the recent past. In policy search, information-theoretic principles can be categorised in three classes depending on the choice of the prior policy πprior (a|s). The first class assumes a fixed prior that remains fixed during the learning process. Entropy regularisation under a uniform prior (Williams and Peng 1991; Mnih et al. 2016), for example, is a special case within this class. The second assumes a marginal prior policy that is optimal from an information-theoretic perspective. Such a marginalisation is obtained by averaging the parametrised behavioural policy over all environmental states. The intuition, here, is to encourage the agent to neglect information in the environment that has little to no impact on utility or reward, see (Leibfried and Braun 2015; Leibfried and Braun 2016; Peng et al. 2017). Finally, the third class considers an adaptive prior (e.g. a policy learnt at an earlier stage) to ensure incremental improvement steps as learning proceeds (Bagnell and Schneider 2003; Peters and Schaal 2008; Peters, Mulling, and Altun 2010; Schulman et al. 2015). When it comes to identification of optimal cumulative reward values, information-theoretic principles can be divided in two categories. The first considers a restricted class of Markov Decision processes (MDPs), where instantaneous rewards are a sum of two terms. The first is a state-dependent reward signal, while the second penalises deviations (expressed using a KL-measure) from uncontrolled environmental dynamics. Here, optimal cumulative reward values can be computed efficiently as outlined in (Todorov 2009; Kappen, Gomez, and Opper 2012). The second category, comprises discrete MDPs with intrinsic signals penalising deviations of optimal behaviour from prior policies. Optimal cumulative reward values are either computed with generalised value iteration schemes (Tishby and Polani 2011; Rubin, Shamir, and Tishby 2012; Grau-Moya et al. 2016), or in an online fashion (Azar, Gomez, and Kappen 2012; Rawlik, Toussaint, and Vijayakumar 2012; Fox, Pakman, and Tishby 2016) similar to Q-learning. Closest to our work are the recent approaches in (Haarnoja et al. 2017; Schulman, Abbeel, and Chen 2017). It is worth mentioning that apart from the discrete and highdimensional action space setting, we tackle two additional problems not addressed in the previously. First, we consider dynamic adaptation for trading off rewards versus intrinsic penalties as opposed to the static scheme presented in (Haarnoja et al. 2017; Schulman, Abbeel, and Chen 2017). Second, we deploy a robust computational approach that incorporates value-based advantages to ensure bounded exponentiation terms. Our approach also fit into the work of how utilising entropy for reinforcement learning connects policy search to optimal cumulative reward value identification (Haarnoja et al. 2017; Nachum et al. 2017; O’Donoghue et al. 2017;

Schulman, Abbeel, and Chen 2017). The focus here, however, is on deep value-based approaches. Due to the intrinsic penalty signal, information-theoretic Q-learning algorithms provide a principled way of reducing Q-value estimates and are hence suited for addressing the overestimation problem outlined earlier. Although being successful in the tabular setting, these algorithms are not readily applicable to high-dimensional environments as detailed next.

Bounded Rationality for Deep RL We aim to reduce optimistic overestimation in deep RL methodologically by leveraging ideas from informationtheoretic bounded rationality. Since Q-value overestimations are a source of sample-inefficiency, we improve large-scale reinforcement learning where current techniques are known to exhibit high sample complexity (Mnih et al. 2015). To do so, we introduce an intrinsic penalty signal in line with the methodology put forward in the previous section. Before commencing, however, it can be interesting to gather more insights into the range of allowed learners while tuning ⋆ such a penalty. Plugging the optimal behaviour policy πbehave from Equation (4) back in Equation (3) yields: X 1 L⋆ (s, πprior , λ) = log πprior (a|s) exp (λU (s, a)) . λ a∈A

The Lagrange multiplier λ, steers the magnitude of the penalty signal and thus leads to different learning outcomes. If λ is large, little penalisation from the prior policy is introduced. As such, one would expect a learning outcome that mostly considers maximising utility. This is confirmed in the limit as λ → ∞, where: lim L⋆ (s, πprior , λ) = max U (s, a). a∈A

λ→∞

On the other hand, for small λ values, the deviation penalty is significant and hence, the prior policy should dominate. This is again confirmed when λ → 0, where we recover the expected utility under πprior : X lim L⋆ (s, πprior , λ) = πprior (a|s)U (s, a) λ→0

a∈A

= Eπprior [U (s, a)] .

Carrying this idea to deep RL, we notice that incorporating a penalty signal in the context of large-scale Q-learning with parameterised function approximators leads to: X 1 L⋆θ (s, πprior , λ) = log πprior (a|s) exp (λQθ (s, a)) , λ a∈A

where Qθ (s, a) represents a deep Q-network. We then use this operator to phrase an information-theoretic optimisation objective for deep Q-learning: i h 2 Jλ (θ) = Es,a,r,s′ (r + γL⋆θ − (s, πprior , λ) − Qθ (s, a)) , (5) where Es,a,r,s′ refers to samples drawn from a replay memory in each iteration of training, and θ − to the parameter values at an earlier stage of learning.

The above objective leads to a wide variety of learners and can be considered a generalisation of the current methods, including deep Q-networks (Mnih et al. 2015). Namely, if λ → ∞, we recover the approach in (Mnih et al. 2015) that poses the problem of optimistic overestimation " Jλ→∞ (θ) = Es,a,r,s′

r + γ max Qθ − (s, a) a∈A

!2 #

− Qθ (s, a)

.

On contrary, if λ → 0, we obtain the following objective " X πprior (a′ |s′ )Qθ − (s′ , a′ ) Jλ→0 (θ) = Es,a,r,s′ r + γ a′ ∈A

!2 #

− Qθ (s, a)

. (6)

Effectively, Equation 6 estimates future cumulative rewards using the prior policy as can be seen in the term P ′ ′ ′ ′ a′ ∈A πprior (a |s )Qθ − (s , a ). From the above two special cases, we recognise that our formulation allows for a variety of learners, where the hyperparameter λ steers outcomes between the above two limiting cases. Notice, however, setting low values for λ introduces instead a pessimistic bias as detailed for the tabular setting in (Fox, Pakman, and Tishby 2016).

Dynamic & Robust Deep RL Choosing a fixed hyperparameter λ through the course of training is undesirable as the effect of the intrinsic penalty signal remains unchanged. Since overestimations are more severe at the start of the learning process, a dynamic scheduling scheme for λ with small values at the start of training and larger values towards the end is more desirable. Adaptive λ: A suitable candidate for dynamically adapting λ in the course of training is the average squared loss (over replay memory samples) between target values t = r + γL⋆θ − (s, πprior , λ) and predicted values p = Qθ (s, a): Jsquared (t, p) = (t − p)2 The rationale, here, is that λ should be inversely proportional to the average squared loss. If Jsquared (t, p) is high on average, as is the case during early episodes of training, low values of λ are favoured. If Jsquared (t, p) is low on average during further course of training, however, high λ values are more suitable for the learning process. We therefore propose to adapt λ with a running average over the loss between targets and predictions. The running average Javg should emphasize recent history as opposed to samples that lie further in the past since the parameters θ of the Q-value approximator change over time. This is achieved with an exponential window and the online update rule:   1 1 Javg ← 1 − Javg + Et,p [Jsquared (t, p)] , (7) τ τ

with τ being a time constant referring to the exponential window size of the running average, and Et,p [Jsquared (t, p)] is a shorthand notation for Equation (5). Using this running average allows to dynamically assign λ = J1avg at each iteration of the training procedure. The squared loss Jsquared (t, p) has an impeding impact on the stability in deep Q-learning, where the parametric approximator is a deep neural network and parameters are updated with gradients and backpropagation (Werbos 1982). To prevent loss values from growing too large, the squared loss is replaced with an absolute loss if |t − p| > 1 (Mnih et al. 2015). In this work, we follow a similar approach and use the Huber loss JHuber (t, p) (Huber 1964) instead:  1 2 if |t − p| < 1, 2 (t − p) JHuber (t, p) = |t − p| − 21 otherwise. The Huber loss has a positive side effect leading to a more robust adaptation of λ as it uses an absolute loss for large error values instead of a squared one. Furthermore, the squared loss is more sensitive to outliers and might penalise the learning process in an unreasonable fashion in the presence of sparse but large error values. Robust Free Energy Values: The dynamic adaptation of λ encourages to learn unbiased estimates of the optimal cumulative reward values. Presupposing Qθ (s, a) is bounded, L⋆θ (s, πprior , λ) is also bounded in the limits of λ: Eπprior [Qθ (s, a)] ≤ L⋆θ (s, πprior , λ) ≤ max Qθ (s, a). a∈A

In practice however, this operator is prone to computational instability for large λ due to the exponential term exp (λQθ (s, a)). We address this problem by amending the θ (s)) term exp(λV exp(λVθ (s)) , where Vθ (s) = maxa Qθ (s, a), yielding L⋆θ (s, πprior , λ) X exp (λVθ (s)) 1 log πprior (a|s) exp (λQθ (s, a)) = log λ exp (λVθ (s)) a∈A X 1 πprior (a|s) exp (λ (Qθ (s, a) − Vθ (s))) = Vθ (s) + log λ a∈A

The first term represents the ordinary maximum operator as in vanilla deep Q-learning. The second term is a logpartition sum with computationally stable elements due to the non-positive exponents λ(Qθ (s, a) − Vθ (s)) ≤ 0. As a result, the log-partition sum is non-positive and subtracts a portion from Vθ (s) that reflects how the instantaneous information-theoretic penalty translates into a penalisation of cumulative reward values.

Experiments & Results We hypothesise that addressing the overestimation problem results in improved sample efficiency and overall performance. To this end, we use the Arcade Learning Environment for Atari games (Bellemare et al. 2013) as benchmark to evaluate our method. We compare

against deep Q-networks (Mnih et al. 2015) that are agnostic to overestimations, and to double deep Q-networks (van Hasselt, Guez, and Silver 2016)—an alternative proposed to address the same exact problem we target. Our results demonstrate that our proposed method (titled deep information networks (DIN)) leads to significantly lower Qvalue estimates resulting in improved sample efficiency and game play performance. We also show that these findings remain valid when using the recently proposed dueling architecture (Wang et al. 2016) for network outputs.

Parameter Settings for Reproducibility We conduct all experiments in Python with TensorFlow (Abadi et al. 2016) and OpenAI gym (Brockman et al. 2016) extending the GitHub project from (Kim 2016). We use a deep convolutional neural network Qθ (s, a) as a function approximator for Q-values, designed and trained according to the procedure described in (Mnih et al. 2015). Qθ (s, a) receives as inputs the current state of the environment s that is composed of the last four video frames. The number of neurons in the output layer is set to be the number of possible actions a. Numerical values of each output neuron corresponds to the expected cumulative reward when taking the relevant action in state s. We train the network for 5 × 107 iterations where one iteration corresponds to a single interaction with the environment. Environment interactions (s, a, r, s′ ) are stored in a replay memory consisting of at most 106 elements. Every fourth iteration, a minibatch of size 32 is sampled from the replay memory and a single gradient update is conducted with a discount factor γ = 0.99. We use RMSProp (Tieleman and Hinton 2012) as the optimiser with a learning rate of 2.5 × 10−4 , gradient momentum of 0.95, squared gradient momentum of 0.95, and a minimum squared gradient of 0.01. Rewards r are clipped to {−1, 0, +1}. The target network Qθ − (s, a) is updated every 104 iterations. The time constant τ for dynamically updating the hyperparameter β is set to 105 , and the prior policy πprior is chosen to be uniform in all states. When the agent interacts with the environment, every fourth frame is skipped and the current action is repeated on the skipped frames. During training, the agent follows an ǫ-greedy policy where ǫ is initialised to 1 and linearly annealed over 106 iterations until a final value of ǫ = 0.1. Training and ǫ-annealing start at 5 · 104 iterations. RGBimages from the Arcade Learning Environment are preprocessed by taking the pixel-wise maximum with the previous image. After preprocessing, images are transformed to grey scale and down-sampled to 84 × 84 pixels. All our experiments are conducted in duplicate with two different initial random seeds. The random number of NOOP-actions at the beginning of each game episode is between 1 and 30. We compare our approach against deep Q-networks and double deep Q-networks. In addition, we conduct further experiments by replacing all networks’ output type with the dueling architecture. The dueling architecture leverages the advantage function A(s, a) = Q(s, a) − maxa Q(s, a) to approximate Q-values. The motivation is to decrease the

variance in the parameter updates to further improve sample efficiency and game play performance, which is also confirmed in our experiments.

Q-Values and Game Play Performance During training, the network parameters are stored every 105 iterations and used for offline evaluation. Evaluating a single network offline comprises 100 game play episodes each of which lasting for at most 4.5 × 103 iterations. In evaluation mode, the agent follows an ǫ-greedy policy with ǫ = 0.05. In this study, we investigate three games (Asterix, Road Runner and Double Dunk). Figure 1 reports the results from the offline evaluation in Road Runner, illustrating average maximum Q-values and average cumulative episodic rewards as a function of training iterations. Note that average cumulative episodic rewards are smoothed with an exponential window over time, similar to Equation 7 with τ = 10, to preserve a clearer view. In Road Runner, our information-theoretic approach leads to significantly lower Q-value estimates when compared to DQN and double DQN for both, the vanilla as well as the dueling architecture (see left plots in Figure 1). This leads to significant improvements in game play performance as shown in the right plots of Figure 1. We conduct additional experiments in Asterix and Double Dunk, see Figure 2. In line with the results obtained in Road Runner, our approach leads to significantly lower Q-value estimates when compared to DQN and double DQN, see Figure 2. While this results in significantly improved game play performance in Asterix, there is no significant difference in performance in the Double Dunk game across all approaches (including Double DQN) and network architectures.

Sample Efficiency To quantify sample complexity, we identify the number of training iterations required to attain maximum deep Qnetwork performance (maximised over iterations and seeds). To this end, we compute the average cumulative episodic reward smoothed with an exponential window (τ = 100). We then identify for each approach the number of training iterations at which maximum deep Q-network performance is attained for the first time. The results are shown in Figure 3. 1 It can be seen that our approach leads to significant improvements in sample efficiency when compared to DQN and double DQN. For instance, DIN require only 1.8 × 107 training iterations in Road Runner compared to about 3 × 107 for double DQNs, and 5 × 107 for standard DQNs. Please note that these results are also valid for both, vanilla DQN, as well as with the dueling DQN architecture.

Policy Evaluation We compare the performance of all approaches in terms of the best (non-smoothed) cumulative episodic reward ob1

According to our description above, sample complexity can only be quantified in case of improvement over deep Q-networks. Therefore, only results from Asterix and Road Runner are depicted.

Episodic rewards

Q-values Normal architecture

Dueling architecture

Normal architecture

Dueling architecture

25.0 45000

22.5

40000

20.0 17.5

35000

15.0 30000 12.5 25000

10.0 7.5

20000 0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

Figure 1: Q-values and episodic rewards for Road Runner. The first two plots illustrate Q-value estimates, and the second two plots game play performance in terms of episodic cumulative rewards. The first and the third plot refer to the vanilla DQN architecture, and the second and fourth to the dueling DQN architecture. Each plot shows three pairs of graphs, illustrating the outcomes of two different random seeds, in black for DQN, purple for double DQN (DDQN) and blue for our informationtheoretic approach (DIN). Summing up, our approach leads to lower Q-value estimates and better game play performance when compared to the DQN and double DQN baseline for both, vanilla and dueling networks. tained in the course of the entire evaluation (maximised over iterations and random seeds). To ensure comparability between games where absolute reward values may vary substantially, we normalize cumulative episodic rewards (score) according to: score − scorerandom scorenorm = · 100%, scorehuman − scorerandom where scorerandom and scorehuman refer to random and human baseline cumulative episodic rewards respectively, see (Mnih et al. 2015; van Hasselt, Guez, and Silver 2016). The results are summarised in Table 1 for the normal network architecture and in Table 2 for the dueling architecture. DQN refers to deep Q-networks, DDQN to double deep Q-networks and DIN to our approach. Overall best results in Road Runner and Asterix are achieved by our proposed method for both, vanilla and dueling networks. In Double Dunk, best results are achieved by double DQN. Note however that the conclusion in Double Dunk should be treated with caution since optimal results are obtained by maximising over random seeds. Inspecting Figure 2, however, reveals that game play performance differs considerably over different seeds in Double Dunk and these differences are most severe for double DQN. Game Road Runner Asterix Double Dunk

DQN 503.6% 70.4% 161.6%

DDQN 593.2% 73.4% 265.8%

DIN 643.7% 85.0% 165.8%

Table 1: Normalized episodic reward for the normal network architecture.

Conclusions & Future Work In this paper, we proposed a novel method for reducing sample complexity in deep reinforcement learning. Our technique introduces an intrinsic penalty signal by generalising

Game Road Runner Asterix Double Dunk

DQN 553.6% 45.6% 223.9%

DDQN 624.4% 75.7% 241.6%

DIN 659.0% 104.1% 222.6%

Table 2: Normalised episodic reward for the dueling network architecture. principles from information-theoretic bounded rationality to high-dimensional state spaces. We showed that DQN falls out as a special case of our proposed approach by a specific choice for the Lagrange multiplier steering the intrinsic penalty signal. Finally, in a set of experiments in the Arcade Learning Environment, we demonstrated that our technique indeed outperforms competing approaches in terms of performance and sample efficiency. In this work, we assumed a fixed prior behavior policy which is sufficient to tackle the overestimation problem in Q-learning. The most promising direction of future work is to study adaptive prior policies instead, in line with (Bagnell and Schneider 2003; Peters and Schaal 2008; Peters, Mulling, and Altun 2010; Schulman et al. 2015). This could be used to generalise our framework to support multitask learning where a prior policy is shared among different tasks, and serves as a good starting point to efficiently lean multiple task-specific policies.

References [Abadi et al. 2016] Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.;

Episodic rewards

Q-values

Asterix

Normal architecture

Dueling architecture

Normal architecture

16

7000

14

6000

12

Dueling architecture

5000

10 4000 8 3000 6 2000 4 0

1

2 3 4 Training iterations

5 1e7

0

1

Normal architecture

2 3 4 Training iterations

5 1e7

0

1

Dueling architecture

2 3 4 Training iterations

5 1e7

0

1

Normal architecture

2 3 4 Training iterations

5 1e7

Dueling architecture

0.4

Double Dunk

16 0.2 17

0.0 0.2

18

0.4 19 0.6 20 0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

0

1

2 3 4 Training iterations

5 1e7

Figure 2: Q-values and episodic rewards for Asterix and Double Dunk. The figure is structured the same way as Figure 1. Our approach leads to lower Q-value estimates in both games when compared to DQN and double DQN for both, the vanilla as well as the dueling network architecture. In Asterix, this results in significantly improved game play performance. In Double Dunk, there seems to be no significant difference in game play performance across approaches and network architectures.

5

1e7

Normal architecture

5

1e7

Dueling architecture

Training iterations

Training iterations

4

3

2

1

0

3

2

1

Road Runner

Asterix

0

Road Runner

Asterix

Figure 3: Sample efficiency. The first plot illustrates sample efficiency for networks with vanilla DQN architecture, and the second plot for dueling networks. Each plot depicts results for Road Runner and Asterix separately. Each bar plot consists of three bars: DQN (black), double DQN (DDQN, purple) and our approach (DIN, blue). The height of each bar refers to the number of training iterations required to attain maximum DQN performance. Our approach is more sample-efficient than DQN as well double DQN for both, the vanilla and the dueling architecture. Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv 1603.04467. [Azar et al. 2011] Azar, M. G.; Munos, R.; Ghavamzadeh, M.; and Kappen, H. J. 2011. Speedy Q-learning. In Advances in Neural Information Processing Systems.

[Azar, Gomez, and Kappen 2012] Azar, M. G.; Gomez, V.; and Kappen, H. J. 2012. Dynamic policy programming. Journal of Machine Learning Research 13:3207–3245. [Bagnell and Schneider 2003] Bagnell, J. A., and Schneider, J. 2003. Covariant policy search. Proceedings of the International Joint Conference on Artificial Intelligence. [Bellemare et al. 2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47:253–279. [Bellemare et al. 2016] Bellemare, M. G.; Ostrovski, G.; Guez, A.; Thomas, P. S.; and Munos, R. 2016. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence. [Bertsekas 2007] Bertsekas, D. P. 2007. Dynamic Programming and Optimal Control, volume 2. Athena Scientific. [Brockman et al. 2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI gym. arXiv 1606.01540. [Busoniu et al. 2010] Busoniu, L.; Babuska, R.; De Schutter, B.; and Ernst, D. 2010. Reinforcement Learning and Dynamic Programming using Function Approximators. CRC Press. [Fox, Pakman, and Tishby 2016] Fox, R.; Pakman, A.; and Tishby, N. 2016. Taming the noise in reinforcement learn-

ing via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. [Genewein et al. 2015] Genewein, T.; Leibfried, F.; GrauMoya, J.; and Braun, D. A. 2015. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI 2(27). [Grau-Moya et al. 2016] Grau-Moya, J.; Leibfried, F.; Genewein, T.; and Braun, D. A. 2016. Planning with information-processing constraints and model uncertainty in Markov decision processes. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. [Haarnoja et al. 2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energybased policies. arXiv 1702.08165. [Huber 1964] Huber, P. J. 1964. Robust estimation of a location parameter. Annals of Statistics 53(1):73–101. [Kappen, Gomez, and Opper 2012] Kappen, H. J.; Gomez, V.; and Opper, M. 2012. Optimal control as a graphical model inference problem. Machine Learning 87(2):159– 182. [Kim 2016] Kim, T. 2016. Deep reinforcement learning in TensorFlow. [Lee and Powell 2012] Lee, D., and Powell, W. B. 2012. An intelligent battery controller using bias-corrected Qlearning. In Proceedings of the AAAI Conference on Artificial Intelligence. [Leibfried and Braun 2015] Leibfried, F., and Braun, D. A. 2015. A reward-maximizing spiking neuron as a bounded rational decision maker. Neural Computation 27(8):1686– 1720. [Leibfried and Braun 2016] Leibfried, F., and Braun, D. A. 2016. Bounded rational decision-making in feedforward neural networks. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. [Lin 1993] Lin, L.-J. 1993. Reinforcement learning for robots using neural networks. Ph.D. Dissertation, Carnegie Mellon University. [Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533. [Mnih et al. 2016] Mnih, V.; Puigdomenech Badia, A.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. [Nachum et al. 2017] Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2017. Bridging the gap between value and policy based reinforcement learning. arXiv 1702.08892. [O’Donoghue et al. 2017] O’Donoghue, B.; Munos, R.; Kavukcuoglu, K.; and Mnih, V. 2017. Combining policy

gradient and Q-learning. Proceedings of the International Conference on Learning Representations. [Ortega and Braun 2013] Ortega, P. A., and Braun, D. A. 2013. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A 469(2153). [Ortega et al. 2015] Ortega, P. A.; Braun, D. A.; Dyer, J.; Kim, K.-E.; and Tishby, N. 2015. Information-theoretic bounded rationality. arXiv 1512.06789. [Peng et al. 2017] Peng, Z.; Genewein, T.; Leibfried, F.; and Braun, D. A. 2017. An information-theoretic on-line update principle for perception-action coupling. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. [Peters and Schaal 2008] Peters, J., and Schaal, S. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks 21:682–697. [Peters, Mulling, and Altun 2010] Peters, J.; Mulling, K.; and Altun, Y. 2010. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence. [Rawlik, Toussaint, and Vijayakumar 2012] Rawlik, K.; Toussaint, M.; and Vijayakumar, S. 2012. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings Robotics: Science and Systems. [Rubin, Shamir, and Tishby 2012] Rubin, J.; Shamir, O.; and Tishby, N. 2012. Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers. Springer. chapter 3. [Schulman, Abbeel, and Chen 2017] Schulman, J.; Abbeel, P.; and Chen, X. 2017. Equivalence between policy gradients and soft Q-learning. arXiv 1704.06440. [Schulman et al. 2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel, P. 2015. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning. [Sutton and Barto 1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. [Thrun and Schwartz 1993] Thrun, S., and Schwartz, A. 1993. Issues in using function approximation for reinforcement learning. In Proceedings of the Connectionist Models Summer School. [Tieleman and Hinton 2012] Tieleman, T., and Hinton, G. E. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning. [Tishby and Polani 2011] Tishby, N., and Polani, D. 2011. Information theory of decisions and actions. In PerceptionAction Cycle. Springer. chapter 19. [Todorov 2009] Todorov, E. 2009. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences of the United States of America 106(28):11478– 11483. [van Hasselt, Guez, and Silver 2016] van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence.

[van Hasselt 2010] van Hasselt, H. 2010. Double Q-learning. In Advances in Neural Information Processing Systems. [Wang et al. 2016] Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; and de Freitas, N. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. [Watkins 1989] Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Dissertation, University of Cambridge. [Werbos 1982] Werbos, P. J. 1982. Applications of advances in nonlinear sensitivity analysis. System Modeling and Optimization 762–770. [Williams and Peng 1991] Williams, R. J., and Peng, J. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3):241–268.

Suggest Documents