A Hybridized Bayesian Parametric-Nonparametric Approach to the Pure Exploration Problem
Girish V. Chowdhary Mechanical and Aerospace Engineering Oklahoma State University Stillwater, OK 74074
[email protected]
Allan M. Axelrod Mechanical and Aerospace Engineering Oklahoma State University Stillwater, OK 74074
[email protected]
Abstract Information-driven approaches to reinforcement learning (RL) and bandit problems largely rely on optimizing with respect to an expectation on calculated Kullback-Leibler (KL) divergence values. Although KL divergence may provide bounds on problem domain models, bounds on the expected KL divergence itself are absent from information-driven approaches. As such, we focus our investigation on the pure exploration problem, a key component of RL and bandit problems, where the objective is to efficiently gain knowledge about the problem domain. For this task, we develop an algorithm using a Poisson exposure process Cox Gaussian process (Pep-CGP), a hybridized Bayesian parametric-nonparametric L´evy process, and theoretically derive a bound for the Pep-CGP expectation on KL divergence. Our algorithm, Real-time Adaptive Prediction of Time-varying and Obscure Rewards (RAPTOR), is validated on 4 real-world datasets, wherein baseline pure exploration approaches are outperformed by RAPTOR.
1
Introduction
We consider the use of Bayesian nonparametrics in the pure exploration bandit problem; a key component to reinforcement learning and bandit problems where the objective is to maximize the information gain, or Kullback-Leibler (KL) divergence between posterior and prior beliefs Hobson [1969], of each selected bandit arm. This problem is particularly challenging in the real-world, which is often partially-observable [Modares et al., 2014, Jaulmes et al., 2005, Kurniawati et al., 2008] and time-varying [Cipelletti and Ramos, 2005, Clough and Penzien, 1975, Dimentberg, 1988]. Thus our challenge is, without requiring a priori knowledge on the underlying problem domain statistics, to develop a principled probabilistic bound on our ability to accurately anticipate which subset of bandit arm models have suffered the greatest depreciation in relevance. Hence, we formulate a Poisson exposure process Cox Gaussian process (Pep-CGP), a hybridized Bayesian parametric-nonparametric L´evy process, and theoretically derive an exposure bound on the Pep-CGP expectation on KL divergence. Herein we use exposure bound to mean a bound that is a function of the number of samples obtained and the duration of sampling. The exposure bound is key, as it allows for a principled transition from uninformed exploration, where bandit arms are selected irrespective of observed KL divergence, to informed exploration, where bandit arms are selected with respect to the sequential KL divergence, in unstructured environments. In theoretically deriving principled conditions for so-called uninformed-to-informed exploration, we make informed exploration easier to reliably implement in practice. This is of particular importance since informed exploration approaches such as [Little and Sommer, 2013, Russo and Van Roy, 2014, Mobin et al., 2014, Axelrod and Chowdhary, 2015, Axelrod et al., 2015] require expert-defined preprocessing or uninformed exploration to avoid extended durations of poor informed exploration performance, which is difficult to do without a priori knowledge. 1
2
Pep-Cox Gaussian Process
Definition 2.1. Axelrod [2015] The Poisson exposure process (Pep) is defined as f (z|Λ(t)) = CΛ(t)
(Λ(t))z e−Λ(t) , Γ(z + 1)
(1)
where Λ(t) = λt is a homogeneous Pep with an exposure rate of λ, Cλ is the normalizing constant, and Z(τi ) is termed the ith Poisson exposure trial. When Λ(t) 6= λt ∀ t, the Pep is termed inhomogeneous. We term the Pep as a Poisson exposure distribution (Ped) when Λ(t) is a constant. As Definition 2.1 shows that the Pep is similar in form to the Poisson process, we show that the conjugate prior of the Pep is identical to that of the Poisson process as shown in Fact 2.2. Fact 2.2. The gamma distribution is a conjugate prior of the homogeneous Poisson exposure process (Pep) such that G (λ∗ t|α + z, β + t, z) ∝ P ep (z|λt) G (λt|α, β) . (2) Proof. See Appendix for details. As we show in Section 3, an exposure bound on the estimation error of the homogeneous (linear) Pep may be used to switch to informed exploration from uninformed exploration. In order to capitalize on this bound, our approach must consider the homogeneous Pep in some facet. Below, we present the Pep-Cox Gaussian process (Pep-CGP), which initializes as a Pep and then smoothly transitions to a Cox Gaussian process. E[Z|∆t] ← (1 − σ 2 )EGP (∆Z|∆t) + (σ 2 )EP ep (∆Z|∆t) + Z(τn ), 2
(3)
2
where σ ∈ [0, 1] is the predictive confidence of the Gaussian process, where σ is initially 0 and σ 2 varies as a function of the samples obtained.
3
Exposure Bound
We consider an homogeneous Pep with Λ(t, n) = λ∆t∆n, where λ is the exposure per unit-time per sample, which provides us the following exposure inequality. Theorem 3.1 (Exposure Inequality). ¯ n) = Let ∆Z(τ1 ), ..., ∆Z(τn ) be independent Pep increments so that P r(∆Z(τi )) = pi . Let Z(τ Pn ∆Z(τ ) j j=1 ¯ n )]. Then using the homogeneous Pep rate λ, and λ = E[Z(τ nτn 3 1 √ . P r Z¯ − λ ≥ λ 4 ≤ (4) nβ λ Proof. See Appendix for details. While our exposure inequality in Theorem 3.1 provides an exposure bound for learning a single Pep, it does not yet provide a condition for the transition between uninformed and informed exploration across all bandit arms. Hence, we develop Corollary 3.2. Corollary 3.2 (Informed Policy Exposure Bound). An uninformed-to-informed exploration algorithm has sufficiently explored the domain when 1 √ , τ(b,n) > τ(b,1) + (5) nb c λ b where 0 < c ≤ 1 and the inequality 1 √ τb1 + nb1cλb (Domain Exposure Bound) η t ← argmax h P η⊂{1,...,K} i E i∈η Vi subject to Car(η) ≤ κ Else: η t ←Uninformed Policy (Sequential, Random, etc.) End If End For Figures 1a-1d provide strong evidence for the validity and strength of our hybridized Bayesian parametric-nonparametric method on the pure exploration problem. We leverage the parametric and homogeneous Poisson exposure process (Pep) to develop a bound for transitioning from an 3
% Max Divergence Sampled
% Max Divergence Sampled
100 80 60 40 20 0 0
5000
10000
100 80 60 40 20 0
15000
0
2000
Episode
Lab
data
100 80 60 40 20 0 0
5000
10000
6000
8000
set (b) ERA Daily Interim Temperature Data Set (κ=6, K=50)Berrisford et al. [2009] % Max Divergence Sampled
% Max Divergence Sampled
(a) Intel Berkeley Research (κ=6, K=52)Bodik et al. [2004]
4000
Episode
15000
100 80 60 40 20 0 0
Episode
1000
2000
3000
4000
5000
6000
Episode
(c) Washington Rainfall Data (κ=2, K=25)Widmann and Bretherton [2000]
Set (d) Ireland Wind-Speed Data Set (κ=2, K=12)Haslett and Raftery [1961-1978]
(e) Legend for Figures 1a-1d
Figure 1: In the above subfigures, RAPTOR sequentially explores the data set until a domain exposure bound condition is satisfied. Once the bound is satisfied, RAPTOR has probabilistic guarantees on the prediction of Kullback-Leibler divergence, and the performance noticeably improves with respect to the baseline algorithms.
uninformed exploration policy to an informed exploration policy; this allows an agent, without preprocessing or a priori knowledge, to learn to exploit the task of exploration in real-world datasets. We improve upon the baseline guarantees provided by the homogeneous Pep by using a convex combination of the homogeneous Pep and the Cox Gaussian Process to yield a high-fidelity model Pep-CGP model. The resultant model is incorporated into the Real-time Adaptive Prediction of Time-varying and Obscure Rewards (RAPTOR) algorithm, which is shown to outperform uninformed and informed baseline exploration methods across 4 real-world datasets.
5
Conclusion
We use a hybridized Bayesian parametric-nonparametric Poisson exposure process Cox Gaussian process (Pep-CGP) to model the Kullback-Leibler divergence of our Gaussian belief updates, which are due to the data received at each bandit arm in an n-armed multi-play bandit problem, in the pure exploration context. Our approach outperforms a set of baseline algorithms across 4 real-world datasets. Future work will focus on including non-Gaussian time-varying data-level distributions.
Acknowledgements This work is sponsored by the Air Force Office of Scientific Research Young Investigator Program Number FA9550-15-1-0146. 4
References Arthur Hobson. A new theorem of information theory. Journal of Statistical Physics, 1(3):383–391, 1969. Hamidreza Modares, Frank L Lewis, and Mohammad-Bagher Naghibi-Sistani. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica, 50(1):193–202, 2014. R. Jaulmes, J. Pineau, and D. Precup. Learning in non-stationary partially observable markov decision processes. In ECML Workshop on Reinforcement Learning in Non-Stationary Environments, 2005. Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, volume 2008, 2008. Luca Cipelletti and Laurence Ramos. Slow dynamics in glassy soft matter. Journal of Physics: Condensed Matter, 17(6):R253, 2005. Ray W Clough and Joseph Penzien. Dynamics of structures. Technical report, McGraw-Hill Incorporated, 1975. Mikhail Fedorovich Dimentberg. Statistical dynamics of nonlinear and time-varying systems, volume 5. Research Studies Press Taunton, 1988. Daniel Y Little and Friedrich T Sommer. Learning and exploration in action-perception loops. Frontiers in neural circuits, 7, 2013. Dan Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 1583–1591, 2014. Shariq A. Mobin, James A. Arnemann, and Fritz Sommer. Information-based learning by agents in unbounded state spaces. In Advances in Neural Information Processing Systems, pages 3023– 3031, 2014. Allan Axelrod and Girish Chowdhary. Adaptive algorithms for autonomous data-ferrying in nonstationary environments. In AIAA Infotech @ Aerospace. AIAA Science and Technology Forum 2015, 2015. Allan M Axelrod, Sertac A Karaman, and Girish V Chowdhary. Exploitation by informed exploration between isolated operatives for information-theoretic data harvesting. In Conference on Decision and Control, volume 54, Osaka, JP, 2015. CDC. Allan Max Axelrod. Learning to exploit time-varying heterogeneity in distributed sensing using the information exposure rate. Master’s thesis, Oklahoma State University, 2015. Peter Bodik, Wei Hong, Carlos Guestrin, Sam Madden, Mark Paskin, and Romain Thibaux. Intel lab data. Technical report, Intel Berkely Research Lab, Feb 2004. URL http://db.csail. mit.edu/labdata/labdata.html. Paul Berrisford, DPKF Dee, K Fielding, M Fuentes, P Kallberg, S Kobayashi, and S Uppala. The era-interim archive. 2009. Martin Widmann and Christopher S Bretherton. Validation of mesoscale precipitation in the ncep reanalysis using a new gridcell dataset for the northwestern united states. Journal of Climate, 13 (11):1936–1950, 2000. John Haslett and Adrian E. Raftery. Ireland wind data set. Technical report, Trinity College and University of Washington, 1961-1978. URL http://lib.stat.cmu.edu/datasets/ wind.desc. 5
A
Synopsis
The concept of uninformed-to-informed exploration is founded on the excessively poor results of informed exploration techniques presented in Section B. The proofs used for analysis and the exposure bound for uninformed-to-informed exploration are presented in Section C.
B
Informed Exploration Results
% Max Divergence Sampled
% Max Divergence Sampled
The benefit of uninformed-to-informed exploration is therefore made clearer by considering what would happen if we immediately engaged in informed exploration, which would provide a worstcase performance scenario. We must emphasize that Figures 2a-2d are generated using the same analysis techniques as in the main work, and that the key difference is the number of times each bandit arm is selected before informed exploration is attempted. 100 80 60 40 20 0 0
5000
10000
100 80 60 40 20 0
15000
0
2000
Episode 100 80 60 40 20 0 5000
10000
6000
8000
(b) European Temperature Data Set (κ=6, K=50)
% Max Divergence Sampled
% Max Divergence Sampled
(a) Intel Berkeley Temperature Data Set (κ=6, K=52)
0
4000
Episode
15000
100 80 60 40 20 0 0
1000
Episode
2000
3000
4000
5000
6000
Episode
(c) Washington Rainfall Data Set (κ=2, K=25)
(d) Ireland Wind-Speed Data Set (κ=2, K=12)
(e) Legend for Figures 2a-2d
Figure 2: In the above subfigures, RAPTOR sequentially explores the data set until each bandit arm has been selected twice, so Kullback-Leibler divergence may be calculated with respect to datadriven beliefs.
6
C
Proofs
First, we briefly prove that the conjugate prior of the homogeneous Poisson exposure process (Pep) is a gamma distribution. Fact C.1. The gamma distribution is a conjugate prior of the homogeneous Poisson exposure process (Pep) such that G (λ∗ t|α + z, β + t, z) ∝ P ep (z|λt) G (λt|α, β) . (7) Proof. We apply the Poisson exposure process and gamma distribution to Bayes theorem as f (λ∗ t|α∗ , β ∗ , z) = Cλt
(λt)z e−λt β α · (λt)α−1 e−βλt . Γ(z + 1) Γ(α)
(8)
In dropping the constant terms, we have f (λ∗ t|α∗ , β ∗ , z) ∝ (λt)z e−λt · (λt)α−1 e−βλt .
(9)
f (λ∗ t|α∗ , β ∗ , z) ∝ (λt)α+z−1 e−λ(β+t) .
(10)
which resolves to Therefore, given a Poisson exposure process with parameter λt, the posterior distribution will be proportional to the prior Gamma distribution with parameters, α∗ = α + z, β ∗ = β + t. Now, considering a Pep on the exposure rate per unit time per sample we obtain a proof for Theorem 3.1. Theorem C.2 (Exposure Inequality). ¯ n) = Let ∆Z(τ1 ), ..., ∆Z(τn ) be independent Pep increments so that P r(∆Z(τi )) = pi . Let Z(τ Pn ∆Z(τ ) j j=1 ¯ n )]. Then using homogeneous Pep rate λ, and λ = E[Z(τ nτn
3 P r Z¯ − λ ≥ λ 4 ≤
1 √ . nβ λ
(11)
Proof. From Chebyshev’s inequality, we know that P r(|Z¯ − µ| ≥ k) ≤
V ar(Z) . k2
(12)
Inserting the mean and variance of our gamma distribution model on homogeneous Pep into (12) yields P r(|Z¯ − λ| ≥ k) ≤
α (nβ)2 k2
,
(13)
P r(|Z¯ − λ| ≥ k) ≤
λ . nβk 2
(14)
which simplifies to
3
In assigning, λ 4 = k, we resolve the proof. 3 P r(|Z¯ − λβ| ≥ λ 4 ) ≤
1 √ . nβ λ
(15)
Then Corollary 3.2 follows. Corollary C.3 (Informed Policy Exposure Bound). An uninformed-to-informed exploration algorithm has sufficiently explored the domain when τ(b,n) > τ(b,1) + 7
1 √ , nb c λ b
(16)
where 0 < c ≤ 1 and the inequality 1 √ (19) nβ λ where 0 < c ≤ 1 and β = τn − τ1 . Then meaningful guarantees are available at a bandit arm when τn > τ1 +
1 √ . nc λ
(20)
Consequently, meaningful guarantees are available across the entire sensing domain once τ(b,n) > τ(b,1) + where b = argmax i
1 √ , nb c λ b 1 √
ni βi λi
(21)
.
Lemma C.4 provides similar guarantees to the baseline informed exploration, Predicted Information Gain (PIG). Lemma C.4 (Sequential Sampling Inequality, k = λ). Let Z(τ1 ), ..., Z(τn ) be independent Poisson Pn ∆Z(τj ) ¯ ¯ n )]. exposure distribution trials so that P r(Z(τi )) = pi . Let Z(τn ) = j=1 n and λ = E[Z(τ Then, 1 . (22) P r |Z¯ − λ| ≥ λ ≤ nλ Proof. From Chebyshev’s inequality, we know that V ar(Z) P r(|Z¯ − µ| ≥ k) ≤ . k2
(23)
Inserting the mean and variance of our gamma distribution model on homogeneous Pep into (23) yields P r(|Z¯ − λ| ≥ k) ≤
α (n)2 k2
,
(24)
which simplifies to λ P r(|Z¯ − λ| ≥ k) ≤ . nk 2 In assigning, λ = k, we resolve the proof. P r(|Z¯ − λβ| ≥ λ) ≤
Lastly, we use Corollary C.5 for implementing PIG. 8
1 . nλ
(25)
(26)
Corollary C.5 (Informed Policy Bound for PIG). An uninformed-to-informed exploration algorithm may be used for informed exploration over the entire state space once an agent has explored for a time t such that 1 n> √ , (27) λb where 0 < c ≤ 1 and the inequality 1 nλ where 0 < c ≤ 1. Then meaningful guarantees are available at a bandit arm when 1 . cλ Consequently, meaningful guarantees are available across the entire sensing domain once n>
nb >
1 , cλb
where b = argmax i
9
1 . ni λ i
(31)
(32)