Parallel variational Bayes for large datasets with an ...

3 downloads 604 Views 837KB Size Report
Parallel variational Bayes for large datasets with an. 1 application to generalized linear mixed models. 2. Minh-Ngoc Tran, David J. Nott, Anthony Y.C. Kuk and ...
1

Parallel variational Bayes for large datasets with an

2

application to generalized linear mixed models Minh-Ngoc Tran, David J. Nott, Anthony Y.C. Kuk and Robert Kohn∗

3

Abstract

4

5

The article develops a hybrid Variational Bayes algorithm that combines the mean-

6

field and stochastic linear regression fixed-form Variational Bayes methods. The new

7

estimation algorithm can be used to approximate any posterior without relying on con-

8

jugate priors. We propose a divide and recombine strategy for the analysis of large

9

datasets, which partitions a large dataset into smaller subsets and then combines the

10

variational distributions that have been learnt in parallel on each separate subset using

11

the hybrid Variational Bayes algorithm. We also describe an efficient model selection

12

strategy using cross validation, which is straightforward to implement as a by-product

13

of the parallel run. The proposed method is applied to fitting generalized linear mixed

14

models. The computational efficiency of the parallel and hybrid Variational Bayes algo-

15

rithm is demonstrated on several simulated and real datasets.

16

Keywords. Divide and Recombine, Fixed-form Variational Bayes, Improved Varia-

17

tional Bayes approximation, Mean-field Variational Bayes, Parallelization.

18

1

Introduction

19

Variational Bayes (VB) methods are increasingly used in machine learning and statistics

20

as a computationally efficient alternative to Markov Chain Monte Carlo (MCMC) simu-

21

lation for approximating posterior distributions in Bayesian inference. See, for example, ∗

Minh-Ngoc Tran is Research Fellow, Australian School of Business, University of New South Wales, Sydney

2052 Australia ([email protected]). David J. Nott is Associate Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 ([email protected]). Anthony Y.C. Kuk is Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 ([email protected]). Robert Kohn is Professor, Australian School of Business, University of New South Wales, Sydney 2052 Australia ([email protected]).

1

22

Bishop (2006); Ormerod and Wand (2010). VB algorithms can be categorized into two main

23

groups: the mean-field Variational Bayes (MFVB) algorithms (Attias, 1999; Waterhouse et al.,

24

1996; Ghahramani and Beal, 2001) and the fixed-form Variational Bayes (FFVB) algorithms

25

(Honkela et al., 2010; Salimans and Knowles, 2013). The mean field VB refers to VB algo-

26

rithms that factorize the VB distribution into factors without assuming any fixed functional

27

forms of the factors, with the forms of the optimal factors automatically determined by using

28

conjugate priors (Ormerod and Wand, 2010). The fixed form VB refers to VB algorithms that

29

assume a fixed functional form for the VB distribution, with the parameter optimized using

30

some optimization procedure such as stochastic gradient descent search (Honkela et al., 2010;

31

Salimans and Knowles, 2013). The MFVB algorithm provides an efficient and convenient

32

iterative scheme for updating the variational parameters, but in its exact form it requires

33

conjugate priors and therefore rules out some interesting models. In applications of VB a

34

common strategy is to combine mean-field steps where these are convenient with fixed-form

35

steps. The convergence of the whole updating procedure is guaranteed as long as the lower

36

bound is increased after each fixed-form update step; see Section 2. This article develops

37

a VB algorithm that uses this combination strategy and where the stochastic search FFVB

38

method of Salimans and Knowles (2013) (see Section 3) is used within a MFVB procedure

39

for updating variational distribution factors that do not have a conjugate form. Related work

40

by Waterhouse et al. (1996) and Wang and Blei (2013) used the Laplace approximation for

41

updating non-conjugate variational factors, and Knowles and Minka (2011) introduced the

42

non-conjugate variational message passing framework for variational Bayes with approxima-

43

tions in the exponential family when the lower bound can be approximated in some way. Braun

44

and McAuliffe (2010) and Wang and Blei (2013) also consider approximating the lower bound

45

in non-conjugate models using the delta method. Tan and Nott (2013a) extend the stochastic

46

variational inference approach of Hoffman et al. (2013) by combining non-conjugate variational

47

message passing with algorithms from stochastic optimization which work with mini-batches

48

of data, and apply the idea to non-conjugate generalized linear mixed models. We refer to

49

the suggested algorithm we develop as the stochastic fixed-form within mean-field Variational

50

Bayes algorithm, or the hybrid Variational Bayes algorithm. The new algorithm can be used

51

to conveniently and efficiently approximate any posterior without relying on the conjugacy

52

assumption.

53

The second contribution of this article is to propose a parallel VB procedure that uses a

54

divide and recombine strategy for the analysis of large datasets based on exponential family

55

variational Bayes posterior approximations. We borrow the terminology “divide and recom-

56

bine” from Guha et al. (2012). The idea is to partition a large dataset into smaller subsets

2

57

and learn the variational distribution in parallel on each separate subset using the hybrid

58

Variational Bayes algorithm. The resulting variational distributions are then recombined to

59

construct the final approximation of the posterior. The recombination is particularly easy for

60

posterior approximations in the exponential family. The methodology proposed in our article

61

is closely related to the methodology proposed independently in a recent preprint by Broderick

62

et al. (2013). The main difference is that they develop the methodology in an online setting

63

in which the data subsets arrive sequentially in time, while we describe the method in a static

64

setting in which the whole dataset has already been collected. Furthermore, we show how

65

to use the parallel divide and recombine strategy for model selection using cross validation.

66

We also study empirically the effect of the number of data subsets and recommend a good

67

number to use in practice. Other online approaches to variational Bayes have been considered

68

by Sato (2001), Tchumtchoua et al. (2011) and Luts et al. (2013).

69

Recently, there have been several articles working with a random subset of the full dataset

70

at a time as a way to speed up computations (Maclaurin and Adams, 2014; Quiroz et al., 2014).

71

These methods require access to the full dataset, that needs to remain on-hold, and therefore

72

require frequent communications between the working machines. It is important to note that

73

our parallel VB is computer-memory efficient in the sense that the full dataset needs not remain

74

on-hold and only minimum communication between the machines is needed. After one-time

75

partitioning the full data into small subsets, VB procedures are run on multiple machines

76

without communicating between them, except for the recombine step. This is similar to the

77

parallel MCMC method of Neiswanger et al. (2014), which first partitions the full dataset into

78

subsets and runs separate MCMC chains to sample from the sub-posteriors based on subsets,

79

then constructs an approximation of the full-data posterior using the MCMC chains.

80

A drawback of mean-field VB is that the dependence between the parameter blocks in the

81

factorization is ignored. As a result, VB often underestimates the variances of the posterior

82

distributions. The third contribution of this article is to propose a simple modification of the

83

original VB approximation that improves its approximation accuracy by accounting for the

84

posterior dependence between (some of) the blocks of the parameters.

85

As a main application of the proposed methods, we derive a detailed algorithm for fitting

86

generalized linear mixed models (GLMMs). GLMMs are often considered difficult to estimate

87

because of the presence of random effects and lack of conjugate priors. VB schemes for GLMMs

88

were considered previously by Rijmen and Vomlel (2008); Ormerod and Wand (2012) and Tan

89

and Nott (2013b), and shown to have attractive computational and accuracy trade-offs.

90

The computational efficiency and accuracy of the proposed method are demonstrated on

91

several simulated and real datasets. We show how the method can be used to handle large

3

92

datasets of tens of millions observations and even more in several minutes.

93

The rest of the paper is organized as follows. Section 2 first provides the background to

94

VB methods, then presents the hybrid VB algorithm and the method for improving on the

95

VB approximation. Section 3 reviews the fixed-form VB method of Salimans and Knowles

96

(2013) that we use for updating the non-conjugate variational factors within the mean-field

97

VB algorithm. Section 4 presents the parallel implementation idea for handling large datasets.

98

The detailed parallel and hybrid Variational Bayes algorithm for fitting GLMMs is presented

99

in Section 5. Section 6 reports the results of a simulation study and the analysis of real data

100

examples.

101

2

102

Suppose we have data y, a likelihood p(y|θ) where θ ∈ Rd is an unknown parameter, and a prior

103

distribution p(θ) for θ. Variational Bayes (VB) approximates the posterior p(θ|y) ∝ p(θ)p(y|θ)

104

by a distribution q(θ) within some more tractable class, chosen to minimize the Kullback-

105

Leibler divergence

Some Variational Bayes theory

KL(qkp) =

106

107

q(θ) log

q(θ) dθ. p(θ|y)

(1)

We have log p(y) =

108

109

Z

Z

p(y, θ) q(θ) log dθ + q(θ)

where L(q) =

110

Z

Z

q(θ) log

q(θ) log

q(θ) dθ = L(q) + KL(qkp), p(θ|y)

p(y, θ) dθ. q(θ)

(2)

111

As KL(qkp) ≥ 0, log p(y) ≥ L(q) for every q(θ). L(q) is therefore often called the lower bound,

112

and minimizing KL(qkp) is equivalent to maximizing L(q).

113

Often factorized approximations to the posterior are considered in variational Bayes. We

114

explain the idea for a factorization with 2 blocks. Assume that θ = (θ1,θ2) and that q(θ) is

115

factorized as

116

q(θ) = q1 (θ1)q2(θ2 ).

117

We further assume that q1(θ1 ) = qλ1 (θ1 ) and q2(θ2 ) = qλ2 (θ2 ) where λ1 and λ2 are variational

4

(3)

118

119

120

121

122

123

124

125

parameters that need to be estimated. Then Z Z L(λ1 , λ2 ) = L(q) = qλ1 (θ1)qλ2 (θ2) log p(y, θ)dθ1 dθ2 − qλ1 (θ1) log qλ1 (θ1)dθ1 + C(λ2) Z  Z Z = qλ1 (θ1) qλ2 (θ2) log p(y, θ)dθ2 dθ1 − qλ1 (θ1 ) log qλ1 (θ1)dθ1 + C(λ2 ) Z Z = qλ1 (θ1) log pe(y, θ1)dθ1 − qλ1 (θ1) log qλ1 (θ1 )dθ1 + C(λ2 ) Z pe1 (y, θ1) dθ1 + C(λ2), = qλ1 (θ1) log qλ1 (θ1) where C(λ2) is a constant depending only on λ2 and Z   pe1 (y, θ1) = exp qλ2 (θ2) log p(y, θ)dθ2 = exp E−θ1 (log p(y, θ)) . Let λ∗1

126

127

=

λ∗1 (λ2 )

L(λ∗1 , λ2 ) ≥ L(λ1 , λ2 ) for all λ1 . Similarly, let λ∗2

130

131

=

λ∗2 (λ1 )

with pe2 (y, θ2) = exp

132

133

(4)

Z

= arg max

Z

λ2

 pe2 (y, θ2) dθ2 , qλ2 (θ2 ) log qλ2 (θ2)

qλ1 (θ1) log p(y, θ)dθ1



(5)

(6)

 = exp E−θ2 (log p(y, θ)) .

Then, L(λ1 , λ∗2 ) ≥ L(λ1 , λ2 ) for all λ2 .

134

135

λ1

 pe1 (y, θ1) dθ1 ; qλ1 (θ1 ) log qλ1 (θ1)

then

128

129

= arg max

Z

(7)

old new new Let λold = (λold = λ∗1 (λold = λ∗2 (λnew 1 ,λ2 ) and λ1 2 ) as in (4) and λ2 1 ) as in (6). Then,

136

L(λnew ) ≥ L(λold).

137

This leads to an iterative scheme for updating λ and (8) ensures the improvement of the

138

lower bound over the iterations. Because the lower bound L(λ) is bounded from above,

139

the convergence of the iterative scheme is ensured under some mild conditions. The above

140

argument can be easily extended to the general case in which q(θ) is factorized into K blocks

141

q(θ) = q1(θ1)×...×qK(θK ).

142 143 144 145

146

(8)

The variational Bayes approximation is now reduced to solving an optimization problem in the form of (4). In many cases, a conjugate prior p(θ1) can be selected such that pe1 (y,θ1) be-

longs to a recognizable density family. Then the optimal VB posterior qλ1 (θ1) that maximizes the integral on the right hand side of (4) is pe1 (y,θ1), i.e.

 qλ∗1 (θ1 ) ∝ pe1 (y, θ1) = exp E−θ1 (log p(y, θ)) , 5

(9)

147

and λ∗1 is determined accordingly. In such cases, the resulting iterative procedure is often re-

148

ferred to as the mean-field Variational Bayes (MFVB) algorithm. The MFVB is computation-

149

ally convenient but it is not easy to apply to some interesting models involving non-conjugate

150

priors.

151

If pe1 (y,θ1) does not belong to a recognizable density family, some optimization technique

152

is needed to solve (4). Note that (4) has exactly the same form as the original VB problem

153

that attempts to maximize L(q) in (2). We can first select a functional form for the varia-

154

tional distribution q and then estimate the unknown parameters accordingly. Such a method

155

is known in the literature as the fixed-form Variational Bayes (FFVB) algorithm. If the vari-

156

ational distribution is assumed to belong to the exponential family with unknown parameters

157

λ, Salimans and Knowles (2013) propose a stochastic approximation method for solving for

158

λ. The details of this method are presented in next section. It is obvious that we can use a

159

FFVB algorithm within a MFVB procedure to solve for (4) and the convergence of the whole

160

procedure is still guaranteed because of (8). Interestingly, this procedure is similar in spirit to

161

the popular Metropolis-Hastings within Gibbs sampling in Markov Chain Monte Carlo simu-

162

lation, in the sense that the FFVB update step, which is silimar to the Metropolis-Hastings

163

sampling step, is carried out whenever a closed form update is not available.

164

2.1

165

A drawback of VB is the assumption of the posterior independence between θ1 and θ2. This

166

often causes VB to underestimate the posterior variances of θ1 and θ2. Suppose that the

167

full conditional p(θ2 |θ1,y) is a standard distribution in θ2 or it is straighforward to sample

168

from it. Then, once the VB approximation q(θ) = q1 (θ1)q2(θ2 ) has been learnt, we suggest to

169 170

171

172

173

174

175 176

Improving the VB approximation

replace q(θ) by qe(θ) = q1(θ1)p(θ2 |θ1,y). The improved VB approximation qe(θ) is better than the original VB approximation q(θ) in the sense that Z q1(θ1 )q2(θ2) dθ1 dθ2 − KL(qkp)−KL(e q kp) = q1(θ1 )q2(θ2)log p(θ1 |y)p(θ2 |θ1,y) Z q1(θ1 )p(θ2 |θ1,y) dθ1 dθ2 q1(θ1 )p(θ2 |θ1,y)log p(θ1 |y)p(θ2|θ1,y)  Z Z   q1 (θ1) q2(θ2 ) q2(θ2 )−p(θ2|θ1,y) log +q2(θ2 )log dθ2 dθ1 = q1(θ1 ) p(θ1 |y) p(θ2|θ1 ,y)  Z Z  q2 (θ2) dθ2 dθ1 = q1(θ1 ) q2 (θ2)log p(θ2 |θ1,y) ≥ 0.

6

177

In the general case in which θ is factorized into K blocks, assume that for some 1 ≤ k < K,

178

the full conditionals p(θk+1 |θ1,...,θk,y),..., p(θK |θ1 ,...,θK−1,y) are standard distributions or it is

179

straighforward to sample from them. Then, once the VB approximation q(θ) = q1(θ1 )×...×

180

qK (θK ) has been learnt, we can replace q(θ) by qe(θ) = q1(θ1 ) × ... × qk (θk )p(θk+1 |θ1, ..., θk, y) × ... × p(θK |θ1 , ..., θK−1, y).

181

182

If the full conditionals are standard distributions, then it might be possible to work with the

183

collapsed model in which the blocks θk+1 ,..., θK are integrated out, i.e. we approximate the

184

posterior of θ1,...,θk only. Working with the collapsed model will remove the influence of the

185

factorization of the blocks which are not integrated out. It is therefore advantageous to work

186

with the collapsed model if possible. However, in some cases we cannot or do not wish to work

187

with the collapsed model. It is because some of the full conditionals can be easy to sample

188

from but are not of standard forms which allow analytic integrals, and the collapsed model

189

may break the structure that allows the convenient mean-field update.

190

3

191

Fixed-form Variational Bayes method of Salimans and Knowles

192

Salimans and Knowles (2013) approximate the posterior p(θ|y) ∝ p(θ)p(y|θ) by a density, with

193

respect to some base measure which we take as Lebesgue measure for simplicity, which is in

194

the exponential family

195

 qλ(θ) = exp S(θ)T λ−Z(λ) ,

196

where λ is a vector of natural parameters, S(θ) denotes a vector of sufficient statistics for the

197

given exponential family and Z(λ) is a normalization term. The parameter vector λ is chosen

198

199

by minimizing the Kullback-Leibler divergence Z Z   p(θ|y) qλ (θ)dθ = KL(λ) = log logp(θ|y)−S(θ)T λ+Z(λ) exp S(θ)T λ−Z(λ) dθ. qλ (θ)

201

Differentiating with respect to λ, and using the result that for exponential families Z ∇λZ(λ) = S(θ)qλ(θ)dθ = Eλ(S(θ)),

202

we have

200

203

204

∇λKL(λ) =

(10)

Z

{−S(θ)+∇λZ(λ)}qλ (θ)dθ Z  + {S(θ)−∇λZ(λ)} logp(θ|y)−S(θ)T λ+Z(λ) qλ (θ)dθ. 7

(11)

205

Equation (10) can be obtained by differentiating the normalization condition

R

qλ(θ)dθ = 1

209

with respect to λ. Using (10), the first term on the right hand side of (11) disappears leaving Z ∇λKL(λ) = logp(θ|y){S(θ)−∇λZ(λ)}qλ(θ)dθ Z  − S(θ)S(θ)T λ−∇λZ(λ)S(θ)T λ−S(θ)Z(λ)+∇λZ(λ)Z(λ) qλ (θ)dθ   = Covλ S(θ),logp(θ|y) −Covλ S(θ) λ,

210

where the last line is obtained by using (10). Hence ∇λKL(λ) = 0 if

211

λ = Covλ S(θ)

206

207

208

−1

 Covλ S(θ),logp(θ|y) .

(12)

212

That is, the optimal λ is the solution to a fixed point problem. Note that logp(θ|y) differs

213

only by a constant not depending on θ from logp(θ)p(y|θ) so (12) can be rewritten as λ = Covλ S(θ)

214

−1

 Covλ S(θ),logp(θ)p(y|θ) .

(13)

215

This suggests an iterative scheme for minimizing KL(λ), where at iteration k+1 the parameters

216

λ(k) are updated to

217

λ(k+1) = Covλ(k) S(θ)

−1

 Covλ(k) S(θ),logp(θ)p(y|θ) .

(14)

218

Salimans and Knowles (2013) observe that this iterative scheme does not necessarily converge.

219

Instead, inspired by stochastic gradient descent algorithms (Robbins and Monro, 1951), they

220

choose to estimate Covλ (S(θ)) and Covλ (S(θ),logp(θ)p(y|θ)) by a weighted average over iter-

221

ates in a Monte Carlo approximation to a pre-conditioned gradient descent algorithm which

222

is guaranteed to converge to a local mode if a certain step size parameter in their algo-

223

rithm is small enough. They argue for Monte Carlo estimation of both Covλ(k) (S(θ)) and

224

Covλ(k) (S(θ),logp(θ)p(y|θ)) using the same Monte Carlo samples. This results in the approx-

225

imation of the right hand side of (14) taking the form of a linear regression of the log target

227

distribution on the sufficient statistics of the approximating family. The number of iterations √ N for which their algorithm is run is determined in advance, a constant step size of c=1/ N

228

is chosen for all iterations and averaging is carried out over the last N/2 iterations in forming

229

the estimates of the covariance matrices to calculate an estimate of λ. Theoretical support for

230

these choices in the context of stochastic gradient descent algorithms is given by Nemirovski

231

et al. (2009). See Salimans and Knowles (2013) for further discussion of why stochastic es-

232

timation of the covariance matrices rather than averaging over the parameters λ in a more

233

conventional stochastic gradient algorithm is beneficial.

226

8

234

Salimans and Knowles (2013) show using properties of the exponential family that Covλ (S(θ)) = ∇λEλ (S(θ)),

235

236

(15)

and Covλ S(θ),logp(θ)p(y|θ)

237



 = ∇λEλ logp(θ)p(y|θ) ,

(16)

238

and then consider Monte Carlo approximations to the expectations on the right hand side of

239

(15) and of (16) based on a random draw θ∗ ∼ qλ (θ) where θ∗ = f (λ,s) and s is some random

240

seed. If f is smooth, the Monte Carlo approximations are smooth functions of λ, and these

241

approximations can be differentiated in (15) and (16). When the approximating distribution

242

qλ(θ) is multivariate normal, and working with a direct parameterization in terms of the mean

243

and covariance matrix, results due to Minka (2001) and Opper and Archambeau (2009) are

244

used to evaluate the gradients in (15) and (16) to simplify the approximations while making

245

use of the first and second derivatives of the target posterior (Salimans and Knowles, 2013,

246

Section 4.4 and Appendix C). This results in a highly efficient algorithm.

247

We are concerned with a certain modification of their algorithm for Gaussian qλ(θ) but

248

where there is independence between blocks of the parameters. Suppose θ is decomposed into

249

T T K blocks θ = (θ1T ,...,θK ) and that the variational posterior qλ(θ) factorizes as

qλ(θ) = qλ1 (θ1) × ... × qλK (θK ),

250

with each factor qλk (θk ), k = 1,...,K, being multivariate normal, and λk denotes the natural parameters for the kth factor. We write µk and Σk for the corresponding mean and covariance matrix and write Sk (θk ) for the vector of sufficient statistics in the kth normal factor. Because of independence, the optimality condition (13) simplifies to λk = Covλk Sk (θk ) 251 252 253

−1

Covλ Sk (θk ),logp(θ)p(y|θ)



and we can use the ideas of Salimans and Knowles (2013) to estimate the covariance matrices on the right hand side of this expression. This results in the following slight modification of their Algorithm 2. In the description below tk , gk , ¯tk , g¯k are vectors of the same length as θ(j)

255

¯ k are square matrices with dimension the length of θk . We assume below that and Γk and Γ √ N is even so that N/2 is an integer and set c = 1/ N .

256

Algorithm 1:

254

257

• Initialize µk ,Σk , k = 1,...,K.

258

• Initialize tk = µk , Γk = Σ−1 k and gk = 0, k = 1,...,K. 9

259

¯ k = 0 and g¯k = 0, k = 1,...,K. • Initialize ¯tk = 0, Γ

260

• For i = 1,...,N do

261

∗ T – Generate a draw θ∗ = (θ1∗,...,θK ) from qλ (θ)

262

– For k = 1,...,K do

263

∗ Set Σk = Γ−1 k and µk = Σk gk +tk

264

∗ Calculate the gradient gi

(k)

(k)

and Hessian Hi

of logp(θ)p(y|θ) with respect to

θk evaluated at θ∗ .

265

(k)

(k)

266

∗ Set gk = (1−c)gk +cgi , Γk = (1−c)Γk −cHi , tk = (1−c)tk +cθk∗ .

267

(k) ¯ (k) ¯ 2 ¯ ¯ 2 ∗ ∗ If i > N/2 then set g¯k = g¯k + N2 gi , Γ k = Γk − N Hi , tk = tk + N θk .

268

¯ −1 , µk = Σk g¯k + ¯tk for k = 1,...,K. • Set Σk = Γ k

269

When the iteration terminates, µk and Σk are the estimated mean and covariance matrix

270

of the normal density qλk (θk ).

271

4

272

Suppose the data y are partitioned into M subsets, y 0 =(y (1) ,...,y (M ) )0 . Suppose also that we

273

have learnt a variational posterior distribution for each subset, qλ(j) (θ) approximating p(θ|y (j)).

274

We assume that

Parallel implementation for large datasets 0

0

qλ(j) (θ) = qλ(j) (θ1 ) × · · · × qλ(j) (θK ),

275

1

(17)

K

(j)

276

where λk is the natural parameter for qλ(j) (θk ) which has been assumed to have an exponential

277

family form, j = 1,...,M and k = 1,...,K. We will also assume that

k

278

p(y|θ) = p(y (1) |θ) × · · · × p(y (M )|θ),

279

i.e. the blocks y (1),...,y (M ) are conditionally independent given θ. Then the posterior distri-

280

bution is

281

282

283

p(θ|y) ∝ p(θ)p(y (1)|θ) × · · · × p(y (M )|θ)   p(θ)p(y (1)|θ) × · · · × p(θ)p(y (M ) |θ) = p(θ)M −1 p(θ|y (1)) × · · · × p(θ|y (M )) ∝ . p(θ)M −1

10

284

Hence, given our approximation qλ(j) (θ) of p(θ|y (j) ), p(θ|y) is approximately proportional to

285

qλ(1) (θ) × · · · × qλ(M ) (θ) . p(θ)M −1

286

The reasoning used here is the same as that used in the Bayesian committee machine (Tresp,

287

2000) although Tresp focused more on applications to Gaussian process regression. A similar

288

strategy was independently proposed in a recent preprint by Broderick et al. (2013), who

289

assume that the data subsets y (j) arrive sequentially in time.

290

Recall that qλ(j) (θ) has the factorization (17). Hence, if the prior also factorizes as p(θ) = pλ(0) (θ1) × · · · × pλ(0) (θK ),

291

1

K

(0)

292

where pλ(0) (θk ), with natural parameters λk , has the same exponential family form as qλ(j) (θk ),

293

then the marginal posterior for θk is approximately proportional to

k

k

qλ(1) (θk ) × · · · × qλ(M ) (θk ) k

294

295 296

k

pM(0)−1 (θk ) λk

, k = 1, ..., K.

(18)

This approximation to p(θk |y) is an exponential family distribution of the same form as each PM (j) (0) of the factors with natural parameter j=1 λk −(M −1)λk . Hence we can learn the approx-

297

imations qλ(j) (θ) independently in parallel for different chunks of the data and then combine

298

these posteriors to get an approximation to the full posterior. (j)

299

(j)

If the factors qλ(j) (θk ) are all normal, with λk corresponding to mean µk and covariance k

(j)

300

(0)

(0)

matrix Σk and if pλ(0) (θk ) has mean µk and covariance matrix Σk , then the approximation k

301

to p(θk |y) is normal, with mean M X

302

(j) −1 Σk

− (M −

(0)−1 1)Σk

!−1

j=1 303

M X

(j) −1 (j) Σk µk

− (M −

(0) −1 (0) 1)Σk µk

!

j=1

and covariance matrix M X

304

(j) −1 Σk

− (M −

(0) −1 1)Σk

!−1

.

j=1 305

A similar way of combining normal approximations of posterior distributions in mixed models

306

has been considered by Huang and Gelman (2005). If qλ(j) (θk ) is Wishart, W (νk ,Sk ), and

(j)

(j)

k

(0)

307

(0)

if pλ(0) (θk ) is Wishart, W (νk ,Sk ), then p(θk |y) is approximated as the Wishart distribution k

308

 !−1  M M X X −1 −1 (j) (0) (j) (0) . W νk − (M − 1)νk , Sk − (M − 1)Sk j=1

j=1

11

309

Note that, a random matrix A of size d×d is said to be distributed as a Wishart distribution

310

W (ν,S) with degrees of freedom ν >d−1 and scale matrix S if its probability density function

311

is

 1 |A| 2 (ν−d−1) exp − 12 tr(S −1 A) . p(A) = 1 1 Q 2 2 νd π d(d−1)/4|S| 2 ν di=1 Γ( 12 (ν + 1 − i))

312

313

4.1

314

The way of combining approximations learnt independently on different subsets of the data

315

makes model choice by cross-validation straightforward to implement. Let one of the subsets

316

y (1),...,y (M) be a future dataset yF , and the rest is used as the training data yT . Let M be

317

the model that is being considered. A common measure of the performance of the model M

318

319

Model selection with cross-validation

is the log predictive density scores (LPDS) defined as (Good, 1952) Z X LPDS(yF |yT , M) = log p(y|θ, M)p(θ|yT , M)dθ,

(19)

y∈yF 320

where we assume that p(y|θ,yT ,M) = p(y|θ,M), i.e. conditional on M and θ the future

321

observations are independent of the observed, and p(θ|yT ,M) is the posterior of the model

322

parameter θ conditional on the training data yT . The posterior p(θ|yT ,M) can be replaced

323 324

by its VB estimate q(θ|yT ,M) and integral in (19) then can be approximated by Monte Carlo b T ),M) samples drawn from q(θ|yT ,M). A simpler method is to estimate the integral by p(y|θ(y

325

b T ) an estimator of the posterior mean of θ which can be obtained by using the mean with θ(y

326

of the VB approximation q(θ|yT ,M). We use this plug-in method in our paper and define the

327

M-fold cross-validated LPDS as

328

1 X LPDS(M) = LPDS(y (j)|y \ y (j) , M). m j=1

329

Computing (20) is straightforward with parallel implementation and the main advantage

330

is that no extra time is needed to refit the model on each training dataset. From (18), the

331

variational distribution q(θk |y\y (j),M) of the parameter block θk conditional on dataset y\y (j)

332

is proportional to

M

qλ(1) (θk , M) × · · · × qλ(j−1) (θk , M) × qλ(j+1) (θk , M) × · · · × qλ(M ) (θk , M) 333

k

k

k

k

pM(0)−2 (θk , M)

(20)

, k = 1, ..., K,

λk

334

(j) b from which the estimator θ(y\y ) is easily computed accordingly. Recall that qλ(j) (θk ,M) is

335

the VB approximation to the marginal posterior of the kth block θk , based on the jth data

336

subset, j = 1,...,M and k = 1,...,K.

k

12

337

5

338

Consider a generalized linear mixed model where yi = (yi1,...,yini )T is the vector of responses

339

for the ith subject/panel, i = 1,...,m. Given random effects bi , the yij are conditionally inde-

340

341

Application to generalized linear mixed models

pendently distributed with the density or probability function   yij ηij − ζ(ηij ) + c(yij , φ) , f (yij |β, bi) = exp φ

342

where ηij is a canonical parameter which is monotonically related to the conditional mean µij =

343

E(yij |β,bi) through a link function g(·), g(µij ) = ηij , β is a p-vector of fixed effect parameters,

344

φ is a scale parameter which we assume known (for example, in the binomial and Poisson

345

families φ =1), and ζ(·) and c(·) are known functions. Here, for simplicity, we are considering

346

the case of a canonical link function, i.e. g(µij ) = ηij . The vector ηi =(ηi1 ,...,ηini )T is modeled

347

as ηi = Xi β +Zi bi , where Xi is an ni ×p design matrix for the fixed effects and Zi is an ni ×u

348

349

350

351

design matrix for the random effects (where u is the dimension of bi ). Let b = (b01,...,b0m)0 and       0 ··· 0 Z η X  1   1  1  0 Z ··· 0  η  X  2    2  2 X =  . , Z =  . , η =  .  = Xβ +Zb. . . . . . . .  ..   .  . ··· .        0 0 ··· Zm Xm ηm The likelihood is p(y|β, b) =

ni m Y Y

f (yij |β, bi) = exp



i=1 j=1

 1 T T (y η − 1 ζ(η)) + c(y, φ) , φ

352

where ζ(η) is understood componentwise and c(y,φ) =

353

are independently distributed as N (0,Q−1 ). Hence

354

matrix diag(Q,...,Q).

P

i,j c(yij ,φ). The random effects bi p(b) ∼ N (0,Q−1 b ) with Qb a block diagonal

355

We consider Bayesian inference with a normal prior N (µ0β ,Σ0β ) for β and a Wishart prior

356

W (ν0 ,S0) for Q, where µ0β , Σ0β , ν0 and S0 are known hyperparameters. We set µ0β =0, Σ0β =τ0 Ip,

357

ν0 = u+1 and S0 = τ0Iu with τ0 = 1000.

358 359

360

Let θ = (β,b,Q) be the vector of all the unknown parameters and random effects. Assume that the variational posterior is factorized as q(θ) = q(β, b)q(Q) = q(α)q(Q)

with

α = (β T , bT )T ,

(21)

361

where q(α) is normal with mean µqα and covariance matrix Σqα . The optimal VB approximation

362

q(Q) is a Wishart W (ν q ,S q ) with ν q and S q given in Algorithm 2. 13

By combining the VB theory in Section 1 and Algorithm 1 of Section 3, we obtain the

363 364

following mean-field fixed-form VB algorithm for fitting GLMMs.

365

Algorithm 2

366

1. Initialize ν q ,S q .

367

2. Update µqα and Σqα as follows

368

• Initialize tα = µqα, Γα = Σqα−1 and gα = 0.

369

¯ α = 0 and g¯α = 0. • Initialize t¯α = 0, Γ

370

• For i = 1,...,N do

371

– Generate α∗ = (β ∗T ,b∗T )T ∼ N (µqα ,Σqα ) and compute η ∗ = Xβ ∗ +Zb∗.

372

q q – Set Σqα = Γ−1 α and µα = Σαgα +tα.

373

– Compute the gradient giα =

374

1 ˙ ∗)) − Σ0 −1 (β ∗ − µ0 ) X T (y − ζ(η α β φ 1 T ∗ ˙ )) − Eq(Q) (Qb)b∗ Z (y − ζ(η φ

!

and Hessian       1 1 T ∗ 0 −1 T ∗ ¨ ¨ − φ X diag ζ(η ) X − Σα − φ X diag ζ(η ) Z      Hiα =  ¨ ∗) X ¨ ∗) Z − Eq(Q) (Qb) − φ1 Z T diag ζ(η − φ1 Z T diag ζ(η

375

376

377

– Set gα = (1−c)gα +cgiα, Γα = (1−c)Γα −cHiα , tα = (1−c)tα +cα∗ .

378

¯α = Γ ¯ α − 2 H α , t¯α = ¯tα + 2 α∗ , – If i > N/2 then set g¯α = g¯α + N2 giα , Γ N i N

379

380

381

¯ −1 , µq = Σq g¯α + ¯tα . • Set Σqα = Γ α α α  −1 P q q T q 3. Update ν q = ν0 +m, S q = S0−1 + m (µ µ +Σ ) . bi bi bi i=1 4. Repeat Steps 2-3 until convergence.

383

In the above algorithm, Eq(Q) (Qb) = diag(Eq(Q) (Q),...,Eq(Q)(Q)) with Eq(Q)(Q) = ν q S q , µqbi and ¯ α, Σq are the mean and covariance of bi computed from µq and Σq . The H α , and therefore Γ

384

are block, high-dimensional and sparse matrices whose lower right blocks are block diagonal.

385

Techniques for handling such sparse matrices should be used to reduce the computing time. In

386

our experience, the algorithm often converges very quickly, within a few iterations. A common

387

stopping rule is to stop iterating when the lower bound does not improve any further. However,

388

computing the lower bound in the GLMM context often involves an analytically intractable

389

integral. Alternatively, we can stop iterating if the difference in the variational parameters

382

α

bi

14

α

i

390

between two successive iterations is smaller than a small threshold. In our implementation, the

391

algorithm is terminated if 1/d times the difference between two successive iterations is smaller

392

than  = 10−5 (d is the total number of the parameters). The number of iterations within

393

each fixed-form update N is set to 100 after some experimentation, but this can be varied

394

depending on the computational budget or even adaptively increased as we near convergence.

395

It is important to note that we treat β and b as a single block α to take into account the

396

posterior dependence between the fixed and random effects. An alternative to (21) is to fully

397

factorize q(θ) as q(θ) = q(β)

398

m Y

q(bi)q(Q),

(22)

i=1 399

but this ignores the posterior dependence between β and bi’s and thus resulting in a poor VB

400

approximation. For a large m, there would be numerical problems when treating β and b as a

401

single block if we worked with the full dataset, as the dimension of Σqα is (m·u+p)×(m·u+p).

402

It is therefore necessary to work with subsets of the data as developed in Section 4, so that

403

for each subset we only have to work with matrices of size (mj ·u+p)×(mj ·u+p), with mj

404

the number of panels in that subset.

405

Partitioning the data into too many small subsets will lead to a poor approximation in

406

each subset. On the other hand, using a too big subset size mj is time consuming as we have

407

to work with high-dimensional matrices, which may cause numerical issues. Our simulation

408

study in Section 6.1 suggests that we should partition the data such that mj ≈ (200−p)/u.

409

5.1

410 411

412

413

414

Improving the VB approximation for GLMMs

With the prior p(Q) ∼ W (ν0,S0), it is easy to see that p(Q|β,b,y) = p(Q|b,y) ∼W (ν0+m,(S0−1 + Pm 0 −1 i=1 bi bi ) ). In order to improve the approximation of the posterior, we can use qe(β, b, Q) = q(β, b)p(Q|b, y). The marginal posterior of Q is then estimated by Z qe(Q) = q(b)p(Q|b, y)db,

(23)

415

where q(b) ∼ N (µqb ,Σqb ) with the mean µqb and covariance Σqb computed from µqα and Σqα . In

416

the case of parallel implementation in which the parameters are learnt separately on M data

417

subsets y (j), µqb can be approximated by (µqb(1) T ,...,µqb(M ) T )T and Σqb approximated by the block

418

diagonal matrix diag(Σqb(1) ,...,Σqb(M ) ). Here, b(j) is the vector of random effects with respect to

419

the data subset y (j) and q(b(j)) ∼ N (µqb(j) ,Σqb(j) ). This method is not suitable for the case with

420

large data because Σqb is a high dimensional matrix of size m×m. 15

421

5.2

422

Given the response vector y, assume that a GLMM has been specified, then model selection

423

in GLMMs consists of selecting fixed effect covariates and random effect covariates among a

424 425 426

427

Model selection for GLMMs

set of potential covariates. Assume that we have fitted a GLMM M and denote the estimated b Q). b The plug-in log predictive density score of a future dataset with parameter by θb = (β, response vector yF , fixed effect design matrix XF and random effect design matrix ZF is   Z X 1 T T b M) = b i, (yi ηi − 1 ζ(ηi )) + c(yi, φ) p(bi |Q)db log p(yF |θ, log exp φ y ∈y i

F

428

where ηi = Xi βb + Zi bi and Xi ∈ XF , Zi ∈ ZF , correspondingly. The integrals above can be

429

estimated by the Laplace method. The M-fold cross-validated plug-in LPDS is then computed

430

as in (20). The model having the biggest LPDS is selected.

431

Clearly, this model selection strategy can be used for selecting GLMMs themselves as well

432

as the link functions. A drawback of this model selection method is that it is not suitable

433

for cases in which the number of candidate covariates is large because the total number of

434

candidate models is huge and searching over the model space is very time demanding.

435

6

436

The proposed parallel VB algorithm is written in Matlab. The examples with small-to-

437

moderate data are run on an Intel Core 16 i7 3.2GHz desktop supported by the Matlab

438

Parallel Toolbox with 4 local processors. The big data example is run on a high performance

439

cluster with 27 machines, each has 12 local processors.

Examples

440

The performance of the parallel VB method is compared to MCMC simulation, if it is

441

possible to do MCMC simulation, using the running time and cross-validated LPDS as per-

442

formance measures. Recall that, given the model being estimated, the LPDS measured on a

443

444

future dataset yF based on a training dataset yT is Z X LPDS(yF |yT ) = log p(y|θ)p(θ|yT )dθ, y∈yF

445

For MCMC, the integrals on the right side are estimated by the Markov chain samples. We

446

compute a 5-fold cross-validated LPDS for MCMC. For VB, the posterior p(θ|yT ) is replaced

447

by its VB estimate q(θ|yT ), and then the integrals are estimated by Monte Carlo samples

448

drawn from q(θ|yT ). We note that, as discussed in Section 4.1, the cross-validated LPDS for

449

the parallel VB approach is easily computed without refitting the model on each training set

450

yT . The likelihood p(y|θ) in GLMMs is an integral over the random effects and is estimated by 16

451

importance sampling. We use common random numbers to reduce variations when computing

452

these integrals by importance sampling.

453

In this paper, we use the marginal-pseudo MCMC simulation (see, e.g., Andrieu and

454

Roberts, 2009; Flury and Shephard, 2011), which still generates sample from the posterior

455

when the likelihood in the Metropolis-Hastings algorithm is replaced by its unbiased estimator.

456

The likelihood in the GLMM context is a product of m integrals over the random effects. Each

457

integral is estimated unbiasedly using importance sampling, with 10 importance samples in

458

the simulation examples and 100 in the real data examples. We use the Laplace approximation

459

for selecting the importance proposal density. Note that each likelihood estimation is also run

460

in parallel. To handle the positive definiteness constraint on the inverse covariance Q, we use

461

the Leonard and Hsu transformation (Leonard and Hsu, 1992) Q = exp(Σ), where Σ is an

462

unconstrained symmetric matrix, to reparameterize Q by the lower-triangle elements θQ of Σ,

463

which is an one-to-one transformation between Q and θQ . We then use the adaptive random

464

walk Metropolis-Hastings algorithm in Haario et al. (2001) to sample from the posterior

465

p(β,θQ|y). Each MCMC chain consists of 20000 iterates with another 20000 iterates used as

466

burn-in.

467

Alternative MCMC methods for estimating GLMMs such as Gibbs sampling (Zeger and

468

Karim, 1991) can be faster than the MCMC scheme implemented in this paper. However, the

469

marginal-pseudo Metropolis-Hastings sampling scheme with the random effects integrated out

470

can avoid mixing problems that one would have with Gibbs sampling due to the strong de-

471

pendence between the fixed and random effects. Gibbs sampling and similar MCMC methods

472

for GLMMs are in general not parallelizable and therefore cumbersome in cases of a very large

473

m. It should be noted that it is often difficult to compare the running times between different

474

algorithms which depend heavily on the programming language being used and the optimality

475

of the algorithms implemented for the characteristics of the particular example considered.

476

However, we believe the results reported here are indicative of the speed up obtained with our

477

variational Bayes methods.

478

6.1

479

6.1.1

480

This simulation example studies the effect of the divide and recombine strategy and its parallel

481

implementation. Panel data are generated from the following logistic model with random

Simulations A simulation study of the parallel implementation

17

482

effects exp(ηij ) , 1 + exp(ηij ) = xij β + zij bi , bi ∼ N (0, Σ), i = 1, ..., m, j = 1, ..., ni,

483

yij ∼ Bernoulli(πij ), πij =

484

ηij

(24)

485

with β =(−1.5,2.5)0 , Σ= σ 2 Iu , σ 2 = 1.5, ni =8, xij =(1,j/ni )0. We consider two cases: the first

486

case with a random intercept, i.e. u=1 and zij =1 and the second case with u=2 and zij =xij .

7

4.5

7

M=1 M=5 M=10 M=20

1000

M=1 M=5 M=10 M=20

M=1 M=5 M=10 M=20

900

4 6

6 800 3.5

5

5

700

4

4

2.5

2

3

CPU time in second

3

3

600

500

400

1.5 2

300

2

1 200 1

1 0.5

0 −1.8

−1.6

β

−1.4

−1.2

0

0

100

2

β

2.5

3

0

1

1.2

1.4

2

σ

1

1.6

1.8

0

1

5

10

20

number of pieces M

Figure 1: Parallel implementation with the four different values of the number of subsets M. 487

We first carry out a small-data study on a single desktop and generate a dataset from

488

(24) with m = 1000 and u = 1 and run the parallel VB method for four different values of the

489

number of subsets: M = 1 (i.e. no partitions of the data are performed), M = 5, M = 10 and

490

M =20. Equivalently, each subset has respectively mj =1000, 200, 100 and 50 panels. All the

491

partitions are selected randomly. The first three panels of Figure 1 plot the posterior density

492

estimates for β and σ 2 obtained by the four parallel VB runs, which show that the estimates

493

are close to each other, in the sense that differences in the estimates are small relative to the

494

estimated posterior standard deviations. The last panel plots the running times taken, which

495

shows that running the divide and recombine strategy in parallel produces considerable gains

496

in computing time efficiency.

497

In order to have a more formal comparison of these four parallel VB runs, we generate 50

498

independent datasets from model (24) and compute the mean squared errors of the estimates 18

499

of the fixed effects (MSEβ ) and the mean squared errors of the estimates of the random effect

500

variance (MSEΣ ). Table 1 summarizes these performance measures and the running times

501

averaged over the 50 replications. The results show that the parallel VB run with subset size

502

mj = 200 produces accurate estimates while having a reasonable running time. M

mj

MSEβ

MSEΣ

Time (second)/replication

1

1000

0.048

0.057

995.9

5

200

0.048

0.055

56.7

10

100

0.050

0.061

36.1

20

50

0.056

0.071

21.6

Table 1: Small data simulation study: The table reports the mean squared errors and the running time averaged over 50 replications for the parallel VB runs with different subset sizes. Data are simulated from model (24) with u = 1 and m = 1000. 503

To study the performance of the parallel VB algorithm in a big data context, we simulate

504

a large dataset with m = 2.7 million panels from model (24) for both cases u = 1 and u = 2.

505

That is, there are totally 13.5 million observations, which can be considered as a big dataset

506

in modern statistical applications. We run the parallel VB on a high performance cluster with

507

27 machines, each has 12 local processors. Table 2 summarizes the performance measures and

508

the running times averaged over 10 replications. We draw three conclusions. First, there is not

509

much difference in the mean squared errors between the different strategies of partitioning the

510

data into subsets. However, the running time increases when the subset size increases. Second,

511

the performance of the parallel VB depends mainly on the product of the subset size mj and

512

the number of random effects u, which determines the size of the matrix Σqα in Algorithm 2.

513

Third, the results suggest that we should partition the data such that mj ×u ≈ 200 in order

514

to have a good tradeoff between computing time and accuracy.

515

All the VB runs in the following examples are run in parallel with data partitioned such

516

that the product of the subset size and the number of random effects is roughly 200.

517

6.1.2

518

We now study the performance of the model selection procedure discussed in Section 5.2. We

519

generate datasets from the logistic random intercept model (24) and also generate irrelevant

520

covariates xij2 and zij1 randomly from the set {−1,0,1}. We have created a model selection

521

problem in which there are 2 potential covariates for the fixed effects and 1 potential covariate

522

for the random effects. It is reasonable to always include a fixed intercept and a random

Model selection

19

u

mj ×u

MSEβ ×104

MSEΣ ×105

Time (minute)/replication

1

50

0.018

0.026

7

100

0.015

0.024

18

200

0.014

0.024

47

400

0.014

0.024

158

50

0.050

0.910

6

100

0.035

0.828

11

200

0.030

0.946

31

400

0.029

1.134

94

2

Table 2: Large data simulation study: The table reports the mean squared errors and the running time averaged over 10 replications for the parallel VB runs with different subset sizes. Data are simulated from model (24) with m = 2.7 million. 523

intercept in a GLMM. So there are a total of 8 candidate models to consider. We consider two

524

values of m, 500 and 1000. Each is used to generate 100 datasets from the true model (24).

525

The performance is measured by the correctly fitted rate (CFR) defined as the proportion of

526

the 100 replications in which the true model is selected. The CFR is 80% for m = 500 and

527

100% for m=1000, which shows that the model selection strategy performs well. The running

528

time, averaged over the replications, taken to run the whole model selection procedure is 3.54

529

and 5.86 minutes for m=500 and m=1000, respectively. This running time is spent on fitting

530

the 8 candidates models and computing the cross-validated plug-in LPDS.

531

6.1.3

532

This simulation study compares the performance of the proposed parallel and hybrid VB

533

algorithm to MCMC. Datasets are generated from a Poisson mixed model with a random

534

intercept

A comparison to MCMC

535

yij ∼ Poisson(λij ), λij = exp(ηij ),

(25)

536

ηij = β0 + β1xij + bi, bi ∼ N (0, σ 2 ), i = 1, ..., m, j = 1, ..., ni.

537

We set β0 =−1.5, β1 =2.5, σ 2 =0.2 and ni =5 with xij generated from the uniform distribution

538

on (0,1).

539

The performance is measured by (i) mean squared errors of the estimates of the fixed

540

effects (MSEβ ) and of the estimates of the variance of the random effect (MSEσ2 ); (ii) running

541

time in minutes. Table 3 reports the simulation results, averaged over 10 replications, for four

542

different sizes of data m ranging from small data (m =50) to moderate data (m =10000). We 20

543

do not run the MCMC simulation in the case m =5000 and m =10000 because it is very time

544

consuming. In the case m = 10000, it takes approximately 1.1 seconds to run each likelihood

545

estimation in parallel, thus it would take approximately 733 minutes to run one MCMC chain

546

in the setting of this example. Table 3 shows that the performance of the VB and MCMC

547

is very similar in terms of mean squared errors. However the VB approach is much more

548

computationally efficient. m

Method

MSEβ

MSEσ2

Time (minute)

50

VB

0.155

0.025

0.07

MCMC

0.155

0.057

18.8

VB

0.058

0.016

0.41

MCMC

0.059

0.016

33.4

VB

0.012

0.007

7.4

MCMC

-

-

-

VB

0.011

0.004

14.5

MCMC

-

-

733

200 5000 10000

Table 3: Simulation example. The table reports the mean squared errors and the running time averaged over 10 replications.

549

6.2

550

The anti-epileptic drug longitudinal dataset (see, e.g., Fitzmaurice et al., 2011, p.346) consists

551

of seizures counts on m=59 epileptic patients over 5 time-intervals of treatment. The objective

552

is to study the effects of the anti-epileptic drug on the patients. Following Fitzmaurice et al.

553

(2011), we consider a mixed effects Poisson regression model but with a random intercept

554

555

Drug longitudinal data

p(yij |β,bi) = Poisson(exp(ηij )), ηij = cij +β1 +β2timeij +β3treatmentij +β4timeij ×treatmentij +bi,

556

j = 0,1,...,4, i = 1,...,59 and cij is an offset, and bi ∼ N (0,σ 2). The offset cij = log(8) if j = 0

557

and cij = log(2) for j > 0, timeij = j, treatmentij = 0 if patient i is in the placebo group and

558

treatmentij = 1 if in the treatment group.

559

The running time taken to run the VB (without partitioning the data into subsets because

560

of a small m) and MCMC in this example is 0.14 minutes and 17.7 minutes. Figure 2 plots

561

the VB estimates (dashed line) and MCMC estimates (solid line) of the marginal posterior

562

densities p(βi |y), i = 1,...,4 and p(σ 2 |y). All the MCMC density estimates in this paper are 21

563 564

carried out using the kernel density estimation method based on the built-in Matlab function ksdensity. The last panel in Figure 2 also plots the improved VB density qe(σ 2) given in (23),

565

estimated by kernel density estimation based on the draws of σ 2 generated from (23). The

566

figure shows that the VB estimates are very close to the MCMC estimates in this example.

567 568

As shown, the improved VB estimate qe(σ 2) is closer to p(σ 2|y) than the original VB estimate q(σ 2).

569

We now partition the data into 5 subsets and compute the 5-fold cross-validated LPDS for

570

both MCMC and the parallel VB. The cross-validated LPDS for MCMC, VB and improved

571

VB are −0.5732, −0.5807 and −0.5803 respectively. This shows that the methods have similar

572

predictive performance with MCMC slightly better.

3

2

16

4 MCMC VB improved VB

1.8 2.5

14

3.5

12

3

10

2.5

8

2

6

1.5

4

1

2

0.5

20 1.6

1.4 2 15 1.2

1.5

1

10

0.8

1 0.6

5

0.4

0.5 0.2

0

2.5

β

3

3.5

1

0 −0.5 −0.45 −0.4 −0.35 −0.3

β

2

0

−0.5

0

β

3

0.5

0

−0.1

−0.05

β

4

0

0.05

0

0.2

0.4

0.6

0.8

1

1.2

σ2

Figure 2: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities for the anti-epileptic drug data.

573

6.3

Six city data

574

The six cities data in Fitzmaurice and Laird (1993) consists of binary responses yij which

575

indicate the wheezing status (1 if wheezing, 0 if not wheezing) of the ith child at time-point

576

j, i =1,...,537 and j = 1,...,4. Covariates are the age of the child at time-point j, centered at 9

577

years, and the maternal smoking status (0 or 1). We consider the following logistic regression

22

578

model with a random intercept p(yij |β,bi) = Binomial(1,pij ),

579

logit(pij ) = β1 +β2Ageij +β3Smokeij +bi .

580

581

Figure 3 plots the VB estimates (dashed line) and MCMC estimates (solid line) of the

582

marginal posterior densities p(βi |y), i =1,...,3 and p(σ 2 |y). The running time taken to run the

583

VB and MCMC in this example is 0.56 and 52.8 minutes, respectively. The figure shows that

584

the VB estimates of the posterior densities of the βj are again close to the MCMC estimates.

585

The improved VB estimate qe(σ 2) overcomes the problem of underestimation of the variance

586

inherent in q(σ 2). The VB is about 94 times more computationally efficient than the MCMC

587

implementation.

588 589

The cross-validated LPDS for MCMC, VB and improved VB are 132.37, 132.35 and 132.37 respectively. This shows that the methods have very similar predictive performance.

3

7

1.8

2

MCMC VB improved VB

1.8

1.6 6 2.5

1.6 1.4 5

1.4

2

1.2 1.2 4

1

1.5

1 0.8

3

0.8 1

0.6 0.6

2 0.4

0.4 0.5 1 0.2

0

−3.5

−3

β1

−2.5

−2

0

−0.4

−0.2

β2

0

0.2

0 −1

0.2

−0.5

0

β3

0.5

1

1.5

0

2

3

4

5

σ2

Figure 3: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities for the six city data.

590

6.4

Skin cancer data

591

A clinical trial is conducted to test the effectiveness of beta-carotene in preventing non-

592

melanoma skin cancer (Greenberg et al., 1989). Patients were randomly assigned to a control 23

593

or treatment group and biopsied once a year to ascertain the number of new skin cancers since

594

the last examination. The response yij is a count of the number of new skin cancers in year j

595

for the ith subject. Covariates include age, skin (1 if skin has burns and 0 otherwise), gender,

596

exposure (a count of the number of previous skin cancers), year of follow-up and treatment (1

597

if the subject is in the treatment group and 0 otherwise). There are m = 1683 subjects with

598

complete covariate information.

599

Donohue et al. (2011) consider 5 different Poisson mixed models with different inclusion

600

of covariates whose inclusion status is given in Table 4. Using the model selection strategy

601

described in Section 5.2, we compute the cross-validated LPDS whose values are shown in

602

Table 4, which suggest that Model 1 should be chosen. By using an AIC-type model selection

603

criterion, Donohue et al. (2011) show that the first three models cannot be distinguished and,

604

on parsimony grounds, they select Model 1. Model 1

Model 2

Model 3

Model 4

Model 5

Fixed intercept

Y

Y

Y

Y

Y

Age

Y

Y

Y

Y

Y

Skin

Y

Y

Y

Y

Y

Gender

Y

Y

Y

Y

Y

Exposure

Y

Y

Y

Y

Y

Year

N

Y

Y

Y

Y

N

N

Y

N

Y

Random intercept

Y

Y

Y

Y

Y

Random slope (Year)

N

N

N

Y

Y

−277.5

−278.5

−278.1

−1366.6

−1404.6

Year

2

LPDS

Table 4: Five different Poisson mixed models for the skin cancer data and their LPDS values, which show that Model 1 is chosen. For comparison, after selecting Model 1, we also use MCMC to estimate this model, which

605 606

607

608

609

is p(yij |β,bi) = Poisson(exp(ηij )), ηij = β0 +β1Agei +β2Skini +β3Genderi +β4Exposureij +bi , where bi ∼ N (0,σ 2), i = 1,....,m = 1683, j = 1,...,5.

610

Figure 4 plots the VB estimates (dashed line) and MCMC estimates (solid line) of the

611

marginal posterior densities p(βi |y), i = 0,1,...,4 and p(σ 2 |y). The running time taken to run

612

the VB and MCMC is 1.45 and 130 minutes, respectively. The VB and MCMC estimates 24

1.6

100

6

5

50

12 MCMC VB improved VB

90

4.5

45

4

40

3.5

35

3

30

2.5

25

2

20

1.5

15

1

10

0.5

5

1.4 5

10

80 1.2 70 4

8

1 60

0.8

50

3

40

6

0.6 2

4

30 0.4 20 1

2

0.2 10

0 −6

−5

β

−4

−3

0 −0.02

0

0

0.02

β

1

0.04

0

0

0.2

0.4

β

0 0.2

0.6

2

0.4

0.6

β

0.8

3

1

0 0.12 0.14 0.16 0.18 0.2 0.22

β

4

0 0.6 0.8

1

1.2 1.4 1.6

2

σ

Figure 4: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities when fitting Model 1 to the skin cancer data. 613

of the fixed effects β are similar. The VB method is about 90 times more computationally

614

efficient than the MCMC. The cross-validated LPDS for MCMC, VB are −0.992 and −0.998,

615

which shows that the methods perform similarly in terms of predictive performance.

616

7

617

We have developed a hybrid VB algorithm that uses a flexible and accurate fixed-form VB

618

algorithm within a mean-field VB updating procedure for approximate Bayesian inference,

619

which is similar in spirit to the Metropolis-Hastings within Gibbs sampling method in MCMC

620

simulation. If the variational distribution is factorized into a product and an exponential form

621

is specified for factors that do not have a conjugate form, then the new algorithm can be used

622

to approximate any posterior distributions without relying on conjugate priors. A simple

623

approach for improving on VB approximation is described. We have also developed a divide

624

and recombine strategy for handling large datasets, and a method for model selection as a

625

by-product. The proposed VB method is applied to fitting GLMMs and is demonstrated by

626

several simulated and real data examples.

Conclusion

25

627

Acknowledgment

628

The research of Minh-Ngoc Tran and Robert Kohn was partially supported by Australian

629

Research Council grant DP0667069. David Nott’s research was supported by a Singapore

630

Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112).

631

References

632

Andrieu, C. and Roberts, G. (2009). The pseudo-marginal approach for efficient Monte Carlo

633

computations. The Annals of Statistics, 37:697–725.

634

Attias, H. (1999). Inferring parameters and structure of latent variable models by variational

635

Bayes. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pages

636

21–30.

637

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.

638

Braun, M. and McAuliffe, J. (2010). Variational inference for large-scale models of discrete

639

choice. Journal of the American Statistical Association, 105(489):324–335.

640

Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Stream-

641

ing variational Bayes. Technical report, University of California, Berkeley. Available at

642

http://arxiv.org/pdf/1307.6769v1.pdf.

643 644

645 646

647 648

Donohue, M. C., Overholser, R., Xu, R., and Vaida, F. (2011). Conditional Akaike information under generalized linear and proportional hazards mixed models. Biometrika, 98:685–700. Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis. John Wiley & Sons, Ltd, New Jersey, 2nd edition. Flury, T. and Shephard, N. (2011). Bayesian inference based only on simulated likelihood: Particle filter analysis of dynamic economic models. Econometric Theory, 1:1–24.

649

Ghahramani, Z. and Beal, M. (2001). Propagation algorithms for variational Bayesian learn-

650

ing. In Leen, T., Dietterich, T., and Tresp, V., editors, Neural Information Processing

651

Systems, volume 13, pages 507–513. MIT Press.

652

Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society B, 14:107–114.

26

653

Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer, S. K.,

654

Elias, P. M., Lowe, N., Nierenberg, D. N., G., B., and Vance, J. C. (1989). The skin cancer

655

prevention study: design of a clinical trial of beta-carotene among persons at high risk for

656

nonmelanoma skin cancer. Controlled Clinical Trials, 10:153–166.

657 658

659 660

661 662

Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., and Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1:53–67. Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7:223–242. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347.

663

Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. (2010). Approximate Rie-

664

mannian conjugate gradient learning for fixed-form variational Bayes. Journal of Machine

665

Learning Research, 11:3235–3268.

666

Knowles, D. A. and Minka, T. (2011). Non-conjugate variational message passing for multi-

667

nomial and binary regression. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F.,

668

and Weinberger, K., editors, Advances in Neural Information Processing Systems 24, pages

669

1701–1709. Neural Information Processing Systems.

670 671

672 673

674 675

676 677

678 679

680 681

Leonard, T. and Hsu, J. S. J. (1992). Bayesian inference for a covariance matrix. The Annals of Statistics, 20(4):1669–1696. Luts, J., Broderick, T., and Wand, M. P. (2013). Real-time semiparametric regression. Journal of Computational and Graphical Statistics. In press. Maclaurin, D. and Adams, R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. Technical report, Harvard University. Minka, T. (2001). A family of algorithms for approximate Bayesian inference. PhD thesis, MIT. Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel MCMC. Technical report, Carnegie Mellon University. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609.

27

682 683

684 685

686 687

688 689

690 691

692 693

694 695

696 697

Opper, M. and Archambeau, C. (2009). The variational Gaussian approximation revisited. Neural Computation, 21:786–792. Ormerod, J. T. and Wand, M. P. (2010). Explaining variational approximations. American Statistician, 64:140–153. Ormerod, J. T. and Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear mixed models. J. Comput. Graph. Stat., 21:2–17. Quiroz, M., Villani, M., and Kohn, R. (2014). Speeding up MCMC by efficient data subsampling. Technical report, Stockholm University. Rijmen, F. and Vomlel, J. (2008). Assessing the performance of variational methods for mixed logistic regression models. J. Stat. Comput. Simul., 78:765–779. Robbins, H. and Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407. Salimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):741–908. Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681.

698

Tan, L. S. L. and Nott, D. J. (2013a). A stochastic variational framework for fitting and diag-

699

nosing generalized linear mixed models. Technical report, National University of Singapore.

700

Available at http://arxiv.org/pdf/1208.4949v2.pdf.

701 702

Tan, L. S. L. and Nott, D. J. (2013b). Variational inference for generalized linear mixed models using partially noncentered parametrizations. Statistical Science, 28:168–188.

703

Tchumtchoua, S., Dunson, D. B., and Morris, J. S. (2011). Online variational Bayes inference

704

for high dimensional correlated data. Technical report, Duke University. Available at

705

http://arxiv.org/pdf/1108.1079/.

706

Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12:2719–2741.

707

Wang, C. and Blei, D. M. (2013). Variational inference in nonconjugate models. Journal of

708

Machine Learning Research, 14:1005–1031.

28

709

Waterhouse, S., MacKay, D., and Robinson, T. (1996). Bayesian methods for mixtures of

710

experts. In Touretzky, M. C. M. D. S. and Hasselmo, M. E., editors, Advances in Neural

711

Information Processing Systems, pages 351–357. MIT Press.

712 713

Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86:79–86.

29