On the application of higher order symplectic integrators in ...

On the application of higher order symplectic integrators in Hamiltonian Monte Carlo

arXiv:1608.07048v1 [stat.CO] 25 Aug 2016

Janne Mannseth, Tore S. Kleppe, Hans J. Skaug August 26, 2016

Abstract We explore the construction of new symplectic numerical integration schemes to be used in Hamiltonian Monte Carlo and study their efficiency. Two integration schemes from Blanes et al. (2014), and a new scheme based on optimal acceptance probability, are considered as candidates to the commonly used leapfrog method. All integration schemes are tested within the framework of the No-U-Turn sampler (NUTS), both for a logistic regression model and a student t-model. The results show that the leapfrog method is inferior to all the new methods both in terms of asymptotic expected acceptance probability for a model problem and the and efficient sample size per computing time for the realistic models. Keywords: Hamiltonian Monte Carlo; Leapfrog method; Parametrized numerical scheme; Integration scheme; No-U-Turn Sampler; Efficient Sample Size

1. Introduction Hamiltonian Monte Carlo (Duane et al., 1987; Neal, 2010) (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior of the commonly used Metropolis-Hastings algorithm. For this reason HMC has become popular for sampling from the posterior distribution of Bayesian models (Liu, 2008; Hoffman and Gelman, 2014). HMC is applicable when the parameters have continuous distributions, and requires that the gradient of the log-posterior can be evaluated. Additional practical challenges

1

with the HMC algorithm is its sensitivity to two user specified parameters; the so-called step size and the number of steps. The NUTS algorithm (Hoffman and Gelman, 2014) automates the choice of these tuning parameters. There exist extensions of HMC, such as Riemann Manifold HMC and Lagrangian HMC (see e.g. Girolami and Calderhead (2011); Lan et al. (2015)). Here we employ plain HMC because of its widespread use e.g. via the software package STAN (Carpenter et al., 2015). An essential component of HMC is to evolve the Hamiltonian system derived from the posterior distribution while sampling is taking place. In practice the pair of differential equations that describes this system must be discretized, while retaining the the volume-preserving and time-reverible nature of the exact solution (Neal, 2010). The most commonly used parametrized integration scheme used for HMC is the leapfrog method. Blanes et al. (2014) have developed a framework for deriving more efficient integration schemes. Such schemes have the potential to perform better than the leapfrog method, particularly in high dimensions. The leapfrog method often requires a small step size to obtian reasonable acceptance probabilites, whereas the new integration schemes have the potentential to allow larger step sizes while retianing good acceptance probabilites (Blanes et al., 2014). The purpose of this paper is to compare the numerical integrators of Blanes et al. (2014), and also a new numerical integrator, to the leapfrog method by considering realistic problems and experiments. The effective sample size per computing time is used as the performance measure. The results indicate that the new integrators perform better than the leapfrog method, with neglible additional implementation effort. The paper is laid out as follows: Section 2 defines HMC and in Section 3 the leapfrog method and the three other parameterized integration schemes are introduced. An asymptotic expression for the expected acceptance probability is found for each of the numerical schemes based on a Gaussian test problem. Section 4 embeds the integration schemes within the NUTS algorithm, and measures performance on a logistic regression model and a student t-model. Lastly, Section 5 contains some discussion.

2

2. Hamiltonian Monte Carlo Denote the target (posterior) density by π(q), where q ∈ Rd . The system is augmented with another random vector p ∈ Rd . In the language of Hamiltonian dynamics q(t) and p(t) represent respectively the position and momentum of a particle at time t. The Hamiltonian is denoted H(q, p) and can be expressed as (Neal, 2010): 1 H(q, p) = U (q) + pT p, 2

(2.1)

where U (q) = − log(π(q)). The last part of equation (2.1) represents the negative log denn

sity of a N (0, I) distribution where the density is defined as f (x) = ( √12π ) 2 exp(− 12 xT x) and I is the identity matrix. Note that strictly speaking only a kernel density proportional to the density is needed, not the density itself. The Hamiltonian governs the evaluation of p(t) and q(t) via the following system of differential equations: ∂H dp ∂H dq = = p, =− = −∇U (q), dt ∂p dt ∂q

(2.2)

where ∇U (q) is the gradient of U (q). Let (q, p) denote the current state of the sampler and (q ∗ , p∗ ) a proposal for the next. One iteration of the HMC algorithm involves the following steps. We start by drawing a new p ∼ N (0, I). Next, we solve equation (2.2) using L steps of the integration scheme (leapfrog or a similar time-reversible, volume-preserving method). After L steps we arrive at the proposed position and momentum, q ∗ and p∗ . The proposal is accepted (Duane et al., 1987; Neal, 2010) with probability:

α = min(1, exp(∆)),

(2.3)

where ∆ = (−H(q ∗ , p∗ ) + H(q, p)). At each iteration p∗ is discarded after α has been computed.

3

3. Symplectic numerical integration schemes As described above, volume-preserving and time-reversible integrators are essential to HMC. Time is discretized in the differential equations (2.2) by introducing the step size . Each parameterized numerical scheme consists of partial steps updating either position or momentum. For each step it is indicated where we are in the process of going from time t to t + .

3.1. The leapfrog method Consider the first integration scheme known as the leapfrog method (Leimkuhler and Reich, 2004; Neal, 2010), expressed for notational simplicity for t = 0: p(/2) = p(0) − ∇U (q(0)), 2 q() = q(0) + p(/2),

(3.1)

p() = p(/2) − ∇U (q()). 2 Each partial step represents an update in either p or q, jointly resulting in an update for (q, p) at time . To obtain a proposal, (3.1) is repeated L times, taking the output of the previous iteration as input to the next. The leapfrog method is widely used in practice, and therefore constitutes a reasonable reference.

3.2. Two-stage methods Following the method suggested in Blanes et al. (2014) the next parametrized integration scheme, composed of two leapfrog steps, is calculated to be as follows:

q(a1 ) = q(0) + a1 p(0), p(/2) = p(0) − ∇U (q(a1 )), 2 q((1 − a1 )) = q(a1 ) − (1 − 2a1 )p(/2), p() = p(/2) − ∇U (q((1 − a1 ))), 2 q() = q((1 − a1 )) + a1 p().

4

(3.2)

Following Blanes et al. (2014), the value of the parameter is a1 =

√ 3− 3 6

(independent of

the dimension d). This value is found by minimizing an expression for the energy error H(q, p) − H(q ∗ , p∗ ) with respect to a1 . Inserting this a1 before performing any further calculations gives the integration scheme called the “two-stage” method. The next scheme has the same basis (3.2). We suggest an alternative criterion for determining the value for the parameter a1 . Now, a1 is chosen so that the value of the expected acceptance probability α in (2.3) is maximized. As shown in Appendix A the optimal value is a1 =

√ 3− 5 4 ,

slightly smaller than for the two-stage method. Inserting this

a1 into equation (3.2) we get a new numerical scheme and name it the “new two-stage” method. The proposed method for finding the new value for a1 is under the constraint that proposals are independent of the current state under a standard d-dimensional Gaussian model.

3.3. The three-stage method The final integration scheme is called the three-stage method:

q(a1 ) = q(0) + a1 p(0), p(b1 ) = p(0) − b1 ∇U (q(a1 )), 1 q(/2) = q(a1 ) + − a1 p(b1 ), 2 p((1 − b1 )) = p(b1 ) − (1 − 2b1 ) ∇U (q(/2)), 1 q((1 − a1 )) = q(/2) + − a1 p((1 − b1 )), 2

(3.3)

p() = p((1 − b1 )) − b1 ∇U (q((1 − a1 ))), q() = q((1 − a1 )) + a1 p(),

where the values of parameters a1 and b1 are found in Blanes et al. (2014) to be a1 = 12127897 102017882

and b1 =

4271554 14421423

using the energy error criterion. All the introduces integration

schemes are easy to implement as alternatives to the leapfrog method.

5

3.4. Asymptotic acceptance probabilities Based on an d-dimensional standard Gaussian test problem, we compute asymptotic (in dimension) expressions for the expected acceptance probability when the integration times, L, correspond to proposals that are independent of the current state. [t]Table 1 near here [/t] Note that E(α) is 1 + O(4 ) for the new two-stage while it is 1 + O(2 ) for the other three. Hence, the new two-stage method has the highest acceptance rate for sufficiently small , owing to how the value of parameter a1 was found. 1

Beskos et al. (2013) provide a general result that shows that should behave as O(d− 4 ). 1

To obtain E(∆) = O(1) for L = O(1), the results in Table 1 also indicate = O(d− 4 ) for leapfrog, two-stage and three-stage under the standard Gaussian model problem. If decreases faster we will waste computation, and if it decreases slower, the MCMC chain will stagnate and be ineffective. On the other hand, the acceptance probability for the 1

new two-stage method imply = O(d− 8 ) to obtain E(∆) = O(1) for L = O(1). This could indicate that the new two-stage numerical scheme is too specialized to the particular model problem used to find it, and will be furhter explored in section 4. More details on the calculations in Table 1 are given in Appendix A.

3.5. Comparing performance on the model problem To further add understanding of the results provided in Table 1, we compare a weighted measure of efficiency based on E(α) for different numbers of dimensions. Again, we use a d-dimensional standard Gaussian model problem. Let Υ denote the expected number of accepted movements per calculation time, as a measure of efficiency. For the leapfrog method we have Υ =

E(α) L ,

for the two-stage methods Υ =

E(α) 2L ,

where the factor 2 in the

denominator is an effect of increase in number of gradient evaluations. Correspondingly the three-stage method has Υ =

E(α) 3L .

The measure is considered for different values of

dimension d and the maximum Υ within each dimension is the preferable one. For each dimension the max Υ(, d),

6

(3.4)

where is such that L ≈

π 2

corresponds to independent proposals, is found for all four

numerical schemes. For each of the numerical schemes, Figure 1 shows how the maximum Υ develops as d increases. We see that the leapfrog method starts out the best, but is quickly overtaken by all the other schemes as d increases. It is also seen that the performance of the new two-stage method, which is of a higher order than the other schemes, stands out for high d. Figure 2 shows the ratio between Υ for the leapfrog method and each of the other numerical schemes. For each fraction, a value greater than one indicates that the suggested method (for the corresponding d) is the best of the two methods considered. In Figure 2 we see a different illustration on how the new methods perform better than leapfrog method as d increases. It is clear that the new two-stage method is the most efficient of all methods considered here . [f]Figure 1 near here [/f] [f]Figure 2 near here [/f] Based on these results the way forward is to determine whether this is relevant in practice. Thus we need to consider some realistic examples.

4. Simulation studies and realistic applications Arguably, the d-dimensional standard Gaussian problem (which Table 1 relies on) is too simple for us to claim the new integrators constitute improvements over leapfrog in practice. This section considers realistic models. In the d-dimensional Gaussian case, we used proposals independent of current state. In a general case that does not lend it self to analytic solution of (2.2), we resort to the No-U-Turn sampler for choosing appropriate integration times.

4.1. The No-U-Turn Sampler The No-U-Turn Sampler (NUTS) was introduced by Hoffman and Gelman (2014) as an improvement to the HMC algorithm. One reason for using NUTS here is to avoid the sensitivity to user specified parameters, because both too low and too high values of step size and number of steps L can cause problems (Hoffman and Gelman, 2014). The NUTS algorithm discovers the point where the HMC algorithm could start to explore

7

random walk behaviour by allowing no u-turns in the trajectory. This way the choice of L is eliminated. In addition, a method for adapting the step size by using dual averaging is implemented in NUTS (Hoffman and Gelman, 2014). Using NUTS with adaptive selection of eliminates all user specified choices from Section 3. In the case of the four numerical schemes, this means that their efficiency can be tested without user specified parameters. All subsequent computations are carried out in MATLAB.

4.2. Simulation testing using NUTS and logistic regression model As the first test case, we use a Bayesian logistic regression model, and the collection of five data sets and a Bayesian logistic regression model from Girolami and Calderhead (2011). The five data sets vary in number of observations n from 250 to 1000 and in number of parameters d from 7 to 25. The model layout is identical to Girolami and Calderhead (2011) where the prior is given as β ∼ N (0, 100I). The effective sample size (ESS) is found by following Girolami and Calderhead (2011) and used in the performance measurement ESS/CPU time. All the experiments are repeated ten times, so that the results in Table 2 are mean results over these ten runs. Table 2 presents the mean CPU times, the mean minimum, median and maximum ESS and the mean number of step sizes for all numerical integration schemes and data sets. The final two columns of Table 2 show minimum ESS/CPU time and median ESS/CPU time. The number of samples were 5000 with burn-in set to 1000 for all experiments. [t]Table 2 near here [/t] In general Table 2 shows that the higher order numerical schemes get higher ESS/CPU time than the leapfrog method. The two-stage method generally performs better than the new two-stage method. Four data sets have ESS that reach the number of samples (5000). This means that there is little or no autocorrelation between the samples. Table 2 shows that the three-stage method or the two-stage method perform the best within each data set, as they have the largest minimum ESS/CPU time. The leapfrog method has the lowest value of minimum ESS/CPU time for all five data sets. The last column of Table 2, the median ESS/CPU time, confirms the same pattern as for the minimum ESS/CPU time. It is expected that the step size increases in relation to the number of steps the methods have. With regards to step size we have that the two-stage method

8

> (the leapfrog method)/2 for all data sets except ”Ripley“. The same goes for the new two-stage method. Correspondingly we have that the three-stage method > (the leapfrog method)/3 with regards to step size. For the “Ripley” data set this only holds for the new two-stage method, while the other two are very close. With these results we see that the suggested integration schemes outperform the leapfrog method also when considering a more realistic problem than the standard Gaussian test problem.

4.3. Simulation testing using NUTS and student t-model In addition to the logistic model in Section 4.2, we also consider a multivariate student t distribution. The target density kernel is given as ν+d 1 T −1 − 2 π(x) ∝ 1 + x Σ x , ν where the degrees of freedom, ν, is set to 5 and dimension d is 2, 10 and 100. Σ−1 is the precision matrix associated with a Gaussian AR(1) model with autocorrelation 0.95 and unit innovation variance. All variations are run ten times and the results in Table 3 are the means taken over these ten runs. [t]Table 3 near here [/t] From Table 3 we see that in general the two-stage method performs better than the new two-stage method. For each dimension the highest minimum and median ESS is obtained with the two-stage method or the three-stage method. However, because of the time cost for the two-stage methods, the leapfrog method gets a slightly higher value of med ESS CPU time .

and

min ESS CPU time

The three-stage method requires less CPU time and gives the highest

med ESS CPU time

and

min ESS CPU time

of all four numerical integrators for dimensions d = 10 and d = 100. We see

that for d = 100 that the step sizes of the two-stage methods and the three-stage method are greater than the step size of

the leapfrog method 2

and

the leapfrog method 3

respectively. For

d = 10 this only holds for the three-stage method. For d = 2 it holds for the two-stage method. Considering that the integration schemes are all easy to implement and work with, these results give reason to further explore the use of them.

9

5. Discussion This paper makes the contribution of introducing and developing numerical schemes as alternatives to the leapfrog method. For a d-dimensional standard Gaussian model all three new numerical integration schemes perform better than the original scheme as dimension d increases. It is seen that a numerical integration scheme of a higher order, like the new two-stage method, stands out for the Gaussian model, especially for larger d. Using these schemes can require more costly computations because of the number of gradient evaluations. However, they are still superior to their precursor in HMC. For the logistic regression model the new schemes were all shown to be more efficient than the leapfrog method. Even with the increased gradient evaluations the results were positive regarding the efficiency of the suggested numerical integration schemes, for all the five datasets considered. For the student t-model the suggested numerical integration schemes all obtain higher minimum and median ESSes than the leapfrog method for d = 2 and d = 10. For d = 100, the ESSes are quite close, but only the three-stage method has a higher minimum ESS than the leapfrog method. The leapfrog method has a higher minimum and median ESS/CPU time than the two-stage methods. This can be because of the cost related to the gradient evaluations of the two-stage methods, which can impact the CPU time. Still, the tree-stage method is the most efficient of all schemes as d increases. This means that also for the student t-model there can be efficiency gain by replacing the leapfrog method with other numerical schemes. These results give reason to explore the development of numerical schemes on the form seen in this paper. By doing this the performance and efficiency of HMC can be improved, especially as the dimension increases.

10

Appendix A. Details The U (q) used when obtaining the results in Section 3 is U (q) = 21 q T q and the Hamiltonian then is: 1 1 H = q T q + pT p, 2 2 where q, p ∈ Rd . Owing to the fact that the gradient of H with respect to both q and p are linear, H together with a numerical integration scheme gives a 2 × 2 matrix M (one for each scheme) whose elements change with step size so that 







qi () qi (0)   = M   , i = 1, . . . , d. pi () pi (0)

(A.1)

Using the leapfrog scheme as an example, the matrix M in (A.1) is 

2 /2



 1−  A= . − + 3 /4 1 − 2 /2 The chosen integration scheme is applied L times, so that the each element of the proposal obtains as

    ∗ q (0)  qi  L i   = M   , i = 1, . . . , d. p∗i pi (0)

To obtain proposals that are independent of the current state, we consider combinations of and L so that the diagonal elements of ML are zero (i.e. L ≈ π/2), and get:      ∗ qi   0 rL  qi (0)  =   sL 0 pi (0) p∗i For the leapfrog method, the rL and sL are: 3 4 5 6 1 + + O(8 ), rL = 1 + 2 + 8 128 1024 1 1 4 1 6 sL = −1 + 2 + + + O(8 ). 8 128 1024

11

Now, recall that the acceptance probability α:

α = min(1, exp(∆)),

where ∆ = (−H(q ∗ , p∗ ) + H(q, p)). For dimension d, ∆ simplifies as: d 1X 2 2 ∆ = [ [(1 − s2L )qi2 + (1 − rL )pi ]]. 2 i=1

Note that from expressions for rL and sL , we have that rL → 1 and sL → −1 when L → ∞. As a result of this, ∆ → 0. This behaviour is in correspondence with what we expect, because it indicates that the acceptance probability will go towards one. Owing to the fact that qi , pi , i = 1, . . . , d are iid standard normal, expressions for µ = E(∆) and σ 2 = Var(∆) can be found for each of the integration schemes. Further, we use the central limit theorem so that ∆−µ −−−→ N (0, 1), d→∞ σ and this forms the basis for the formulas in Table 1.

12

References Beskos, A., N. Pillai, G. Roberts, J.-M. Sanz-Serna, A. Stuart, et al. (2013). Optimal tuning of the hybrid monte carlo algorithm. Bernoulli 19 (5A), 1501–1534. Blanes, S., F. Casas, and J. Sanz-Serna (2014). Numerical integrators for the hybrid Monte Carlo method. SIAM Journal on Scientific Computing 36 (4), A1556–A1580. Carpenter, B., A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell (2015). Stan: a probabilistic programming language. Journal of Statistical Software. Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987). Hybrid Monte Carlo. Physics letters B 195 (2), 216–222. Girolami, M. and B. Calderhead (2011). Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2), 123–214. Hoffman, M. D. and A. Gelman (2014). The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15 (1), 1593–1623. Lan, S., V. Stathopoulos, B. Shahbaba, and M. Girolami (2015). Markov chain monte carlo from lagrangian dynamics.

Journal of Computational and Graphical Statis-

tics 24 (2), 357–378. Leimkuhler, B. and S. Reich (2004). Simulating Hamiltonian dynamics, Volume 14. Cambridge University Press. Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media. Neal, R. M. (2010). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 54, 113–162.

13

0 −1 −2

Logarithm of Upsilon

−3

Leapfrog Two stage New Two stage

−4

Three stage 0

2

4

6

8

10

12

14

Logarithm of d Figure 1: Logarithm of maximum expected number of accepted movements per calculation time as a function of log(d).

14

6 5

New Two stage/Leapfrog

2

3

4

Three stage/Leapfrog

1

Relation between maximum Upsilons

Two stage/Leapfrog

0

2

4

6

8

10

12

14

Logarithm of d Figure 2: Relations between maximum expected number of accepted movements per calculation time as a function of log(d).

15

Table 1: Table 1: Asymptotic Expected Acceptance Probability E(α) Numerical scheme

E(α)

ENew (α)/ELeapfrog (α)

Leapfrog

√ 2 − 2Φ( 18 2 d)

1

Two-stage

1 2 − 2Φ( 24 (2 −

New two-stage

√ √ 1 2 − 2Φ( 64 (5 5 − 11)4 d)

√ ≈ 1 + 0.09972 d + O(4 )

Three-stage

√ 2√ 49087755 2 − 2Φ( 19422491137 2 d)

√ ≈ 1 + 0.09692 d + O(4 )

√

√ 3)2 d)

16

√ ≈ 1 + 0.09082 d + O(4 )

Table 2: Numerical scheme

CPU time (s)

Leapfrog Two-stage New two-stage Three-stage

15.45 14.01 14.50 23.29


20.64 16.53 17.39 17.10


7.63 6.46 6.22 7.86


12.52 8.95 9.07 9.38


18.18 17.37 18.50 16.55

Integrator

minimum ESS

median ESS

maximum ESS

Australian credit data set (d = 15, n = 690) 0.0742 336 5000 5000 0.1911 465 5000 5000 0.1734 402 5000 5000 0.2979 1376 5000 5000 German credit data set (d = 25, n = 1000) 0.0386 2019 4755 5000 0.1110 2251 5000 5000 0.0999 2121 5000 5000 0.1866 4986 5000 5000 Heart data set (d = 14, n = 270) 0.1106 3364 4914 5000 0.2772 4950 5000 5000 0.2533 4369 5000 5000 0.4390 5000 5000 5000 Pima Indian data set (d = 8, n = 532) 0.0725 3687 4666 5000 0.1795 5000 5000 5000 0.1598 4365 4990 5000 0.2886 4986 5000 5000 Ripley data set (d = 7, n = 250) 0.1263 1351 1917 2793 0.2488 1513 2121 2967 0.2661 1547 2142 2787 0.3787 1765 2554 4144

min ESS / CPU time

med ESS / CPU time

21.74 33.19 27.72 59.08

323.62 356.89 344.83 214.68

97.82 136.18 121.97 291.58

230.38 302.48 287.52 292.40

440.89 766.25 702.41 636.13

644.04 733.99 803.86 636.13

294.49 558.66 481.26 531.56

372.68 558.66 550.17 508.65

74.31 87.10 83.62 106.65

105.45 122.12 115.78 154.32

Table 1: ESS and CPU times by numerical scheme for logistic regression applied to data sets from Girolami and Calderhead (2011). The values in bold face represent the best results within each data set.

3

17

Table 3: CPU time (s)

Integrator


13.80 18.34 20.41 15.90

0.7897 1.5863 1.3941 2.3257


48.59 59.89 69.54 54.68

0.4027 0.8027 0.7027 1.2399


853 1021 1162 699

0.2615 0.6171 0.5647 1.0762

Numerical scheme

minimum ESS

median ESS

Dimension d = 2 859 876 962 969 963 975 955 963 Dimension d = 10 577 633 605 645 614 651 696 731 Dimension d = 100 1180 1552 1156 1533 1242 1537 1228 1521

maximum ESS

min ESS / CPU time

med ESS / CPU time

893 977 987 971

62.25 52.45 47.18 60.06

63.48 52.84 47.77 60.57

747 748 749 845

11.87 10.10 8.83 12.72

13.03 10.77 9.36 13.37

2416 2372 2353 2333

1.38 1.13 1.07 1.75

1.82 1.50 1.32 2.18

Table 2: ESS and CPU times when using the four numerical schemes with a d-dimensional student t-model. The values in bold face represent the best results within each dimension.

4

18