1
Parallel variational Bayes for large datasets with an
2
application to generalized linear mixed models Minh-Ngoc Tran, David J. Nott, Anthony Y.C. Kuk and Robert Kohn∗
3
Abstract
4
5
The article develops a hybrid Variational Bayes algorithm that combines the mean-
6
field and stochastic linear regression fixed-form Variational Bayes methods. The new
7
estimation algorithm can be used to approximate any posterior without relying on con-
8
jugate priors. We propose a divide and recombine strategy for the analysis of large
9
datasets, which partitions a large dataset into smaller subsets and then combines the
10
variational distributions that have been learnt in parallel on each separate subset using
11
the hybrid Variational Bayes algorithm. We also describe an efficient model selection
12
strategy using cross validation, which is straightforward to implement as a by-product
13
of the parallel run. The proposed method is applied to fitting generalized linear mixed
14
models. The computational efficiency of the parallel and hybrid Variational Bayes algo-
15
rithm is demonstrated on several simulated and real datasets.
16
Keywords. Divide and Recombine, Fixed-form Variational Bayes, Improved Varia-
17
tional Bayes approximation, Mean-field Variational Bayes, Parallelization.
18
1
Introduction
19
Variational Bayes (VB) methods are increasingly used in machine learning and statistics
20
as a computationally efficient alternative to Markov Chain Monte Carlo (MCMC) simu-
21
lation for approximating posterior distributions in Bayesian inference. See, for example, ∗
Minh-Ngoc Tran is Research Fellow, Australian School of Business, University of New South Wales, Sydney
2052 Australia (
[email protected]). David J. Nott is Associate Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 (
[email protected]). Anthony Y.C. Kuk is Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546 (
[email protected]). Robert Kohn is Professor, Australian School of Business, University of New South Wales, Sydney 2052 Australia (
[email protected]).
1
22
Bishop (2006); Ormerod and Wand (2010). VB algorithms can be categorized into two main
23
groups: the mean-field Variational Bayes (MFVB) algorithms (Attias, 1999; Waterhouse et al.,
24
1996; Ghahramani and Beal, 2001) and the fixed-form Variational Bayes (FFVB) algorithms
25
(Honkela et al., 2010; Salimans and Knowles, 2013). The mean field VB refers to VB algo-
26
rithms that factorize the VB distribution into factors without assuming any fixed functional
27
forms of the factors, with the forms of the optimal factors automatically determined by using
28
conjugate priors (Ormerod and Wand, 2010). The fixed form VB refers to VB algorithms that
29
assume a fixed functional form for the VB distribution, with the parameter optimized using
30
some optimization procedure such as stochastic gradient descent search (Honkela et al., 2010;
31
Salimans and Knowles, 2013). The MFVB algorithm provides an efficient and convenient
32
iterative scheme for updating the variational parameters, but in its exact form it requires
33
conjugate priors and therefore rules out some interesting models. In applications of VB a
34
common strategy is to combine mean-field steps where these are convenient with fixed-form
35
steps. The convergence of the whole updating procedure is guaranteed as long as the lower
36
bound is increased after each fixed-form update step; see Section 2. This article develops
37
a VB algorithm that uses this combination strategy and where the stochastic search FFVB
38
method of Salimans and Knowles (2013) (see Section 3) is used within a MFVB procedure
39
for updating variational distribution factors that do not have a conjugate form. Related work
40
by Waterhouse et al. (1996) and Wang and Blei (2013) used the Laplace approximation for
41
updating non-conjugate variational factors, and Knowles and Minka (2011) introduced the
42
non-conjugate variational message passing framework for variational Bayes with approxima-
43
tions in the exponential family when the lower bound can be approximated in some way. Braun
44
and McAuliffe (2010) and Wang and Blei (2013) also consider approximating the lower bound
45
in non-conjugate models using the delta method. Tan and Nott (2013a) extend the stochastic
46
variational inference approach of Hoffman et al. (2013) by combining non-conjugate variational
47
message passing with algorithms from stochastic optimization which work with mini-batches
48
of data, and apply the idea to non-conjugate generalized linear mixed models. We refer to
49
the suggested algorithm we develop as the stochastic fixed-form within mean-field Variational
50
Bayes algorithm, or the hybrid Variational Bayes algorithm. The new algorithm can be used
51
to conveniently and efficiently approximate any posterior without relying on the conjugacy
52
assumption.
53
The second contribution of this article is to propose a parallel VB procedure that uses a
54
divide and recombine strategy for the analysis of large datasets based on exponential family
55
variational Bayes posterior approximations. We borrow the terminology “divide and recom-
56
bine” from Guha et al. (2012). The idea is to partition a large dataset into smaller subsets
2
57
and learn the variational distribution in parallel on each separate subset using the hybrid
58
Variational Bayes algorithm. The resulting variational distributions are then recombined to
59
construct the final approximation of the posterior. The recombination is particularly easy for
60
posterior approximations in the exponential family. The methodology proposed in our article
61
is closely related to the methodology proposed independently in a recent preprint by Broderick
62
et al. (2013). The main difference is that they develop the methodology in an online setting
63
in which the data subsets arrive sequentially in time, while we describe the method in a static
64
setting in which the whole dataset has already been collected. Furthermore, we show how
65
to use the parallel divide and recombine strategy for model selection using cross validation.
66
We also study empirically the effect of the number of data subsets and recommend a good
67
number to use in practice. Other online approaches to variational Bayes have been considered
68
by Sato (2001), Tchumtchoua et al. (2011) and Luts et al. (2013).
69
Recently, there have been several articles working with a random subset of the full dataset
70
at a time as a way to speed up computations (Maclaurin and Adams, 2014; Quiroz et al., 2014).
71
These methods require access to the full dataset, that needs to remain on-hold, and therefore
72
require frequent communications between the working machines. It is important to note that
73
our parallel VB is computer-memory efficient in the sense that the full dataset needs not remain
74
on-hold and only minimum communication between the machines is needed. After one-time
75
partitioning the full data into small subsets, VB procedures are run on multiple machines
76
without communicating between them, except for the recombine step. This is similar to the
77
parallel MCMC method of Neiswanger et al. (2014), which first partitions the full dataset into
78
subsets and runs separate MCMC chains to sample from the sub-posteriors based on subsets,
79
then constructs an approximation of the full-data posterior using the MCMC chains.
80
A drawback of mean-field VB is that the dependence between the parameter blocks in the
81
factorization is ignored. As a result, VB often underestimates the variances of the posterior
82
distributions. The third contribution of this article is to propose a simple modification of the
83
original VB approximation that improves its approximation accuracy by accounting for the
84
posterior dependence between (some of) the blocks of the parameters.
85
As a main application of the proposed methods, we derive a detailed algorithm for fitting
86
generalized linear mixed models (GLMMs). GLMMs are often considered difficult to estimate
87
because of the presence of random effects and lack of conjugate priors. VB schemes for GLMMs
88
were considered previously by Rijmen and Vomlel (2008); Ormerod and Wand (2012) and Tan
89
and Nott (2013b), and shown to have attractive computational and accuracy trade-offs.
90
The computational efficiency and accuracy of the proposed method are demonstrated on
91
several simulated and real datasets. We show how the method can be used to handle large
3
92
datasets of tens of millions observations and even more in several minutes.
93
The rest of the paper is organized as follows. Section 2 first provides the background to
94
VB methods, then presents the hybrid VB algorithm and the method for improving on the
95
VB approximation. Section 3 reviews the fixed-form VB method of Salimans and Knowles
96
(2013) that we use for updating the non-conjugate variational factors within the mean-field
97
VB algorithm. Section 4 presents the parallel implementation idea for handling large datasets.
98
The detailed parallel and hybrid Variational Bayes algorithm for fitting GLMMs is presented
99
in Section 5. Section 6 reports the results of a simulation study and the analysis of real data
100
examples.
101
2
102
Suppose we have data y, a likelihood p(y|θ) where θ ∈ Rd is an unknown parameter, and a prior
103
distribution p(θ) for θ. Variational Bayes (VB) approximates the posterior p(θ|y) ∝ p(θ)p(y|θ)
104
by a distribution q(θ) within some more tractable class, chosen to minimize the Kullback-
105
Leibler divergence
Some Variational Bayes theory
KL(qkp) =
106
107
q(θ) log
q(θ) dθ. p(θ|y)
(1)
We have log p(y) =
108
109
Z
Z
p(y, θ) q(θ) log dθ + q(θ)
where L(q) =
110
Z
Z
q(θ) log
q(θ) log
q(θ) dθ = L(q) + KL(qkp), p(θ|y)
p(y, θ) dθ. q(θ)
(2)
111
As KL(qkp) ≥ 0, log p(y) ≥ L(q) for every q(θ). L(q) is therefore often called the lower bound,
112
and minimizing KL(qkp) is equivalent to maximizing L(q).
113
Often factorized approximations to the posterior are considered in variational Bayes. We
114
explain the idea for a factorization with 2 blocks. Assume that θ = (θ1,θ2) and that q(θ) is
115
factorized as
116
q(θ) = q1 (θ1)q2(θ2 ).
117
We further assume that q1(θ1 ) = qλ1 (θ1 ) and q2(θ2 ) = qλ2 (θ2 ) where λ1 and λ2 are variational
4
(3)
118
119
120
121
122
123
124
125
parameters that need to be estimated. Then Z Z L(λ1 , λ2 ) = L(q) = qλ1 (θ1)qλ2 (θ2) log p(y, θ)dθ1 dθ2 − qλ1 (θ1) log qλ1 (θ1)dθ1 + C(λ2) Z Z Z = qλ1 (θ1) qλ2 (θ2) log p(y, θ)dθ2 dθ1 − qλ1 (θ1 ) log qλ1 (θ1)dθ1 + C(λ2 ) Z Z = qλ1 (θ1) log pe(y, θ1)dθ1 − qλ1 (θ1) log qλ1 (θ1 )dθ1 + C(λ2 ) Z pe1 (y, θ1) dθ1 + C(λ2), = qλ1 (θ1) log qλ1 (θ1) where C(λ2) is a constant depending only on λ2 and Z pe1 (y, θ1) = exp qλ2 (θ2) log p(y, θ)dθ2 = exp E−θ1 (log p(y, θ)) . Let λ∗1
126
127
=
λ∗1 (λ2 )
L(λ∗1 , λ2 ) ≥ L(λ1 , λ2 ) for all λ1 . Similarly, let λ∗2
130
131
=
λ∗2 (λ1 )
with pe2 (y, θ2) = exp
132
133
(4)
Z
= arg max
Z
λ2
pe2 (y, θ2) dθ2 , qλ2 (θ2 ) log qλ2 (θ2)
qλ1 (θ1) log p(y, θ)dθ1
(5)
(6)
= exp E−θ2 (log p(y, θ)) .
Then, L(λ1 , λ∗2 ) ≥ L(λ1 , λ2 ) for all λ2 .
134
135
λ1
pe1 (y, θ1) dθ1 ; qλ1 (θ1 ) log qλ1 (θ1)
then
128
129
= arg max
Z
(7)
old new new Let λold = (λold = λ∗1 (λold = λ∗2 (λnew 1 ,λ2 ) and λ1 2 ) as in (4) and λ2 1 ) as in (6). Then,
136
L(λnew ) ≥ L(λold).
137
This leads to an iterative scheme for updating λ and (8) ensures the improvement of the
138
lower bound over the iterations. Because the lower bound L(λ) is bounded from above,
139
the convergence of the iterative scheme is ensured under some mild conditions. The above
140
argument can be easily extended to the general case in which q(θ) is factorized into K blocks
141
q(θ) = q1(θ1)×...×qK(θK ).
142 143 144 145
146
(8)
The variational Bayes approximation is now reduced to solving an optimization problem in the form of (4). In many cases, a conjugate prior p(θ1) can be selected such that pe1 (y,θ1) be-
longs to a recognizable density family. Then the optimal VB posterior qλ1 (θ1) that maximizes the integral on the right hand side of (4) is pe1 (y,θ1), i.e.
qλ∗1 (θ1 ) ∝ pe1 (y, θ1) = exp E−θ1 (log p(y, θ)) , 5
(9)
147
and λ∗1 is determined accordingly. In such cases, the resulting iterative procedure is often re-
148
ferred to as the mean-field Variational Bayes (MFVB) algorithm. The MFVB is computation-
149
ally convenient but it is not easy to apply to some interesting models involving non-conjugate
150
priors.
151
If pe1 (y,θ1) does not belong to a recognizable density family, some optimization technique
152
is needed to solve (4). Note that (4) has exactly the same form as the original VB problem
153
that attempts to maximize L(q) in (2). We can first select a functional form for the varia-
154
tional distribution q and then estimate the unknown parameters accordingly. Such a method
155
is known in the literature as the fixed-form Variational Bayes (FFVB) algorithm. If the vari-
156
ational distribution is assumed to belong to the exponential family with unknown parameters
157
λ, Salimans and Knowles (2013) propose a stochastic approximation method for solving for
158
λ. The details of this method are presented in next section. It is obvious that we can use a
159
FFVB algorithm within a MFVB procedure to solve for (4) and the convergence of the whole
160
procedure is still guaranteed because of (8). Interestingly, this procedure is similar in spirit to
161
the popular Metropolis-Hastings within Gibbs sampling in Markov Chain Monte Carlo simu-
162
lation, in the sense that the FFVB update step, which is silimar to the Metropolis-Hastings
163
sampling step, is carried out whenever a closed form update is not available.
164
2.1
165
A drawback of VB is the assumption of the posterior independence between θ1 and θ2. This
166
often causes VB to underestimate the posterior variances of θ1 and θ2. Suppose that the
167
full conditional p(θ2 |θ1,y) is a standard distribution in θ2 or it is straighforward to sample
168
from it. Then, once the VB approximation q(θ) = q1 (θ1)q2(θ2 ) has been learnt, we suggest to
169 170
171
172
173
174
175 176
Improving the VB approximation
replace q(θ) by qe(θ) = q1(θ1)p(θ2 |θ1,y). The improved VB approximation qe(θ) is better than the original VB approximation q(θ) in the sense that Z q1(θ1 )q2(θ2) dθ1 dθ2 − KL(qkp)−KL(e q kp) = q1(θ1 )q2(θ2)log p(θ1 |y)p(θ2 |θ1,y) Z q1(θ1 )p(θ2 |θ1,y) dθ1 dθ2 q1(θ1 )p(θ2 |θ1,y)log p(θ1 |y)p(θ2|θ1,y) Z Z q1 (θ1) q2(θ2 ) q2(θ2 )−p(θ2|θ1,y) log +q2(θ2 )log dθ2 dθ1 = q1(θ1 ) p(θ1 |y) p(θ2|θ1 ,y) Z Z q2 (θ2) dθ2 dθ1 = q1(θ1 ) q2 (θ2)log p(θ2 |θ1,y) ≥ 0.
6
177
In the general case in which θ is factorized into K blocks, assume that for some 1 ≤ k < K,
178
the full conditionals p(θk+1 |θ1,...,θk,y),..., p(θK |θ1 ,...,θK−1,y) are standard distributions or it is
179
straighforward to sample from them. Then, once the VB approximation q(θ) = q1(θ1 )×...×
180
qK (θK ) has been learnt, we can replace q(θ) by qe(θ) = q1(θ1 ) × ... × qk (θk )p(θk+1 |θ1, ..., θk, y) × ... × p(θK |θ1 , ..., θK−1, y).
181
182
If the full conditionals are standard distributions, then it might be possible to work with the
183
collapsed model in which the blocks θk+1 ,..., θK are integrated out, i.e. we approximate the
184
posterior of θ1,...,θk only. Working with the collapsed model will remove the influence of the
185
factorization of the blocks which are not integrated out. It is therefore advantageous to work
186
with the collapsed model if possible. However, in some cases we cannot or do not wish to work
187
with the collapsed model. It is because some of the full conditionals can be easy to sample
188
from but are not of standard forms which allow analytic integrals, and the collapsed model
189
may break the structure that allows the convenient mean-field update.
190
3
191
Fixed-form Variational Bayes method of Salimans and Knowles
192
Salimans and Knowles (2013) approximate the posterior p(θ|y) ∝ p(θ)p(y|θ) by a density, with
193
respect to some base measure which we take as Lebesgue measure for simplicity, which is in
194
the exponential family
195
qλ(θ) = exp S(θ)T λ−Z(λ) ,
196
where λ is a vector of natural parameters, S(θ) denotes a vector of sufficient statistics for the
197
given exponential family and Z(λ) is a normalization term. The parameter vector λ is chosen
198
199
by minimizing the Kullback-Leibler divergence Z Z p(θ|y) qλ (θ)dθ = KL(λ) = log logp(θ|y)−S(θ)T λ+Z(λ) exp S(θ)T λ−Z(λ) dθ. qλ (θ)
201
Differentiating with respect to λ, and using the result that for exponential families Z ∇λZ(λ) = S(θ)qλ(θ)dθ = Eλ(S(θ)),
202
we have
200
203
204
∇λKL(λ) =
(10)
Z
{−S(θ)+∇λZ(λ)}qλ (θ)dθ Z + {S(θ)−∇λZ(λ)} logp(θ|y)−S(θ)T λ+Z(λ) qλ (θ)dθ. 7
(11)
205
Equation (10) can be obtained by differentiating the normalization condition
R
qλ(θ)dθ = 1
209
with respect to λ. Using (10), the first term on the right hand side of (11) disappears leaving Z ∇λKL(λ) = logp(θ|y){S(θ)−∇λZ(λ)}qλ(θ)dθ Z − S(θ)S(θ)T λ−∇λZ(λ)S(θ)T λ−S(θ)Z(λ)+∇λZ(λ)Z(λ) qλ (θ)dθ = Covλ S(θ),logp(θ|y) −Covλ S(θ) λ,
210
where the last line is obtained by using (10). Hence ∇λKL(λ) = 0 if
211
λ = Covλ S(θ)
206
207
208
−1
Covλ S(θ),logp(θ|y) .
(12)
212
That is, the optimal λ is the solution to a fixed point problem. Note that logp(θ|y) differs
213
only by a constant not depending on θ from logp(θ)p(y|θ) so (12) can be rewritten as λ = Covλ S(θ)
214
−1
Covλ S(θ),logp(θ)p(y|θ) .
(13)
215
This suggests an iterative scheme for minimizing KL(λ), where at iteration k+1 the parameters
216
λ(k) are updated to
217
λ(k+1) = Covλ(k) S(θ)
−1
Covλ(k) S(θ),logp(θ)p(y|θ) .
(14)
218
Salimans and Knowles (2013) observe that this iterative scheme does not necessarily converge.
219
Instead, inspired by stochastic gradient descent algorithms (Robbins and Monro, 1951), they
220
choose to estimate Covλ (S(θ)) and Covλ (S(θ),logp(θ)p(y|θ)) by a weighted average over iter-
221
ates in a Monte Carlo approximation to a pre-conditioned gradient descent algorithm which
222
is guaranteed to converge to a local mode if a certain step size parameter in their algo-
223
rithm is small enough. They argue for Monte Carlo estimation of both Covλ(k) (S(θ)) and
224
Covλ(k) (S(θ),logp(θ)p(y|θ)) using the same Monte Carlo samples. This results in the approx-
225
imation of the right hand side of (14) taking the form of a linear regression of the log target
227
distribution on the sufficient statistics of the approximating family. The number of iterations √ N for which their algorithm is run is determined in advance, a constant step size of c=1/ N
228
is chosen for all iterations and averaging is carried out over the last N/2 iterations in forming
229
the estimates of the covariance matrices to calculate an estimate of λ. Theoretical support for
230
these choices in the context of stochastic gradient descent algorithms is given by Nemirovski
231
et al. (2009). See Salimans and Knowles (2013) for further discussion of why stochastic es-
232
timation of the covariance matrices rather than averaging over the parameters λ in a more
233
conventional stochastic gradient algorithm is beneficial.
226
8
234
Salimans and Knowles (2013) show using properties of the exponential family that Covλ (S(θ)) = ∇λEλ (S(θ)),
235
236
(15)
and Covλ S(θ),logp(θ)p(y|θ)
237
= ∇λEλ logp(θ)p(y|θ) ,
(16)
238
and then consider Monte Carlo approximations to the expectations on the right hand side of
239
(15) and of (16) based on a random draw θ∗ ∼ qλ (θ) where θ∗ = f (λ,s) and s is some random
240
seed. If f is smooth, the Monte Carlo approximations are smooth functions of λ, and these
241
approximations can be differentiated in (15) and (16). When the approximating distribution
242
qλ(θ) is multivariate normal, and working with a direct parameterization in terms of the mean
243
and covariance matrix, results due to Minka (2001) and Opper and Archambeau (2009) are
244
used to evaluate the gradients in (15) and (16) to simplify the approximations while making
245
use of the first and second derivatives of the target posterior (Salimans and Knowles, 2013,
246
Section 4.4 and Appendix C). This results in a highly efficient algorithm.
247
We are concerned with a certain modification of their algorithm for Gaussian qλ(θ) but
248
where there is independence between blocks of the parameters. Suppose θ is decomposed into
249
T T K blocks θ = (θ1T ,...,θK ) and that the variational posterior qλ(θ) factorizes as
qλ(θ) = qλ1 (θ1) × ... × qλK (θK ),
250
with each factor qλk (θk ), k = 1,...,K, being multivariate normal, and λk denotes the natural parameters for the kth factor. We write µk and Σk for the corresponding mean and covariance matrix and write Sk (θk ) for the vector of sufficient statistics in the kth normal factor. Because of independence, the optimality condition (13) simplifies to λk = Covλk Sk (θk ) 251 252 253
−1
Covλ Sk (θk ),logp(θ)p(y|θ)
and we can use the ideas of Salimans and Knowles (2013) to estimate the covariance matrices on the right hand side of this expression. This results in the following slight modification of their Algorithm 2. In the description below tk , gk , ¯tk , g¯k are vectors of the same length as θ(j)
255
¯ k are square matrices with dimension the length of θk . We assume below that and Γk and Γ √ N is even so that N/2 is an integer and set c = 1/ N .
256
Algorithm 1:
254
257
• Initialize µk ,Σk , k = 1,...,K.
258
• Initialize tk = µk , Γk = Σ−1 k and gk = 0, k = 1,...,K. 9
259
¯ k = 0 and g¯k = 0, k = 1,...,K. • Initialize ¯tk = 0, Γ
260
• For i = 1,...,N do
261
∗ T – Generate a draw θ∗ = (θ1∗,...,θK ) from qλ (θ)
262
– For k = 1,...,K do
263
∗ Set Σk = Γ−1 k and µk = Σk gk +tk
264
∗ Calculate the gradient gi
(k)
(k)
and Hessian Hi
of logp(θ)p(y|θ) with respect to
θk evaluated at θ∗ .
265
(k)
(k)
266
∗ Set gk = (1−c)gk +cgi , Γk = (1−c)Γk −cHi , tk = (1−c)tk +cθk∗ .
267
(k) ¯ (k) ¯ 2 ¯ ¯ 2 ∗ ∗ If i > N/2 then set g¯k = g¯k + N2 gi , Γ k = Γk − N Hi , tk = tk + N θk .
268
¯ −1 , µk = Σk g¯k + ¯tk for k = 1,...,K. • Set Σk = Γ k
269
When the iteration terminates, µk and Σk are the estimated mean and covariance matrix
270
of the normal density qλk (θk ).
271
4
272
Suppose the data y are partitioned into M subsets, y 0 =(y (1) ,...,y (M ) )0 . Suppose also that we
273
have learnt a variational posterior distribution for each subset, qλ(j) (θ) approximating p(θ|y (j)).
274
We assume that
Parallel implementation for large datasets 0
0
qλ(j) (θ) = qλ(j) (θ1 ) × · · · × qλ(j) (θK ),
275
1
(17)
K
(j)
276
where λk is the natural parameter for qλ(j) (θk ) which has been assumed to have an exponential
277
family form, j = 1,...,M and k = 1,...,K. We will also assume that
k
278
p(y|θ) = p(y (1) |θ) × · · · × p(y (M )|θ),
279
i.e. the blocks y (1),...,y (M ) are conditionally independent given θ. Then the posterior distri-
280
bution is
281
282
283
p(θ|y) ∝ p(θ)p(y (1)|θ) × · · · × p(y (M )|θ) p(θ)p(y (1)|θ) × · · · × p(θ)p(y (M ) |θ) = p(θ)M −1 p(θ|y (1)) × · · · × p(θ|y (M )) ∝ . p(θ)M −1
10
284
Hence, given our approximation qλ(j) (θ) of p(θ|y (j) ), p(θ|y) is approximately proportional to
285
qλ(1) (θ) × · · · × qλ(M ) (θ) . p(θ)M −1
286
The reasoning used here is the same as that used in the Bayesian committee machine (Tresp,
287
2000) although Tresp focused more on applications to Gaussian process regression. A similar
288
strategy was independently proposed in a recent preprint by Broderick et al. (2013), who
289
assume that the data subsets y (j) arrive sequentially in time.
290
Recall that qλ(j) (θ) has the factorization (17). Hence, if the prior also factorizes as p(θ) = pλ(0) (θ1) × · · · × pλ(0) (θK ),
291
1
K
(0)
292
where pλ(0) (θk ), with natural parameters λk , has the same exponential family form as qλ(j) (θk ),
293
then the marginal posterior for θk is approximately proportional to
k
k
qλ(1) (θk ) × · · · × qλ(M ) (θk ) k
294
295 296
k
pM(0)−1 (θk ) λk
, k = 1, ..., K.
(18)
This approximation to p(θk |y) is an exponential family distribution of the same form as each PM (j) (0) of the factors with natural parameter j=1 λk −(M −1)λk . Hence we can learn the approx-
297
imations qλ(j) (θ) independently in parallel for different chunks of the data and then combine
298
these posteriors to get an approximation to the full posterior. (j)
299
(j)
If the factors qλ(j) (θk ) are all normal, with λk corresponding to mean µk and covariance k
(j)
300
(0)
(0)
matrix Σk and if pλ(0) (θk ) has mean µk and covariance matrix Σk , then the approximation k
301
to p(θk |y) is normal, with mean M X
302
(j) −1 Σk
− (M −
(0)−1 1)Σk
!−1
j=1 303
M X
(j) −1 (j) Σk µk
− (M −
(0) −1 (0) 1)Σk µk
!
j=1
and covariance matrix M X
304
(j) −1 Σk
− (M −
(0) −1 1)Σk
!−1
.
j=1 305
A similar way of combining normal approximations of posterior distributions in mixed models
306
has been considered by Huang and Gelman (2005). If qλ(j) (θk ) is Wishart, W (νk ,Sk ), and
(j)
(j)
k
(0)
307
(0)
if pλ(0) (θk ) is Wishart, W (νk ,Sk ), then p(θk |y) is approximated as the Wishart distribution k
308
!−1 M M X X −1 −1 (j) (0) (j) (0) . W νk − (M − 1)νk , Sk − (M − 1)Sk j=1
j=1
11
309
Note that, a random matrix A of size d×d is said to be distributed as a Wishart distribution
310
W (ν,S) with degrees of freedom ν >d−1 and scale matrix S if its probability density function
311
is
1 |A| 2 (ν−d−1) exp − 12 tr(S −1 A) . p(A) = 1 1 Q 2 2 νd π d(d−1)/4|S| 2 ν di=1 Γ( 12 (ν + 1 − i))
312
313
4.1
314
The way of combining approximations learnt independently on different subsets of the data
315
makes model choice by cross-validation straightforward to implement. Let one of the subsets
316
y (1),...,y (M) be a future dataset yF , and the rest is used as the training data yT . Let M be
317
the model that is being considered. A common measure of the performance of the model M
318
319
Model selection with cross-validation
is the log predictive density scores (LPDS) defined as (Good, 1952) Z X LPDS(yF |yT , M) = log p(y|θ, M)p(θ|yT , M)dθ,
(19)
y∈yF 320
where we assume that p(y|θ,yT ,M) = p(y|θ,M), i.e. conditional on M and θ the future
321
observations are independent of the observed, and p(θ|yT ,M) is the posterior of the model
322
parameter θ conditional on the training data yT . The posterior p(θ|yT ,M) can be replaced
323 324
by its VB estimate q(θ|yT ,M) and integral in (19) then can be approximated by Monte Carlo b T ),M) samples drawn from q(θ|yT ,M). A simpler method is to estimate the integral by p(y|θ(y
325
b T ) an estimator of the posterior mean of θ which can be obtained by using the mean with θ(y
326
of the VB approximation q(θ|yT ,M). We use this plug-in method in our paper and define the
327
M-fold cross-validated LPDS as
328
1 X LPDS(M) = LPDS(y (j)|y \ y (j) , M). m j=1
329
Computing (20) is straightforward with parallel implementation and the main advantage
330
is that no extra time is needed to refit the model on each training dataset. From (18), the
331
variational distribution q(θk |y\y (j),M) of the parameter block θk conditional on dataset y\y (j)
332
is proportional to
M
qλ(1) (θk , M) × · · · × qλ(j−1) (θk , M) × qλ(j+1) (θk , M) × · · · × qλ(M ) (θk , M) 333
k
k
k
k
pM(0)−2 (θk , M)
(20)
, k = 1, ..., K,
λk
334
(j) b from which the estimator θ(y\y ) is easily computed accordingly. Recall that qλ(j) (θk ,M) is
335
the VB approximation to the marginal posterior of the kth block θk , based on the jth data
336
subset, j = 1,...,M and k = 1,...,K.
k
12
337
5
338
Consider a generalized linear mixed model where yi = (yi1,...,yini )T is the vector of responses
339
for the ith subject/panel, i = 1,...,m. Given random effects bi , the yij are conditionally inde-
340
341
Application to generalized linear mixed models
pendently distributed with the density or probability function yij ηij − ζ(ηij ) + c(yij , φ) , f (yij |β, bi) = exp φ
342
where ηij is a canonical parameter which is monotonically related to the conditional mean µij =
343
E(yij |β,bi) through a link function g(·), g(µij ) = ηij , β is a p-vector of fixed effect parameters,
344
φ is a scale parameter which we assume known (for example, in the binomial and Poisson
345
families φ =1), and ζ(·) and c(·) are known functions. Here, for simplicity, we are considering
346
the case of a canonical link function, i.e. g(µij ) = ηij . The vector ηi =(ηi1 ,...,ηini )T is modeled
347
as ηi = Xi β +Zi bi , where Xi is an ni ×p design matrix for the fixed effects and Zi is an ni ×u
348
349
350
351
design matrix for the random effects (where u is the dimension of bi ). Let b = (b01,...,b0m)0 and 0 ··· 0 Z η X 1 1 1 0 Z ··· 0 η X 2 2 2 X = . , Z = . , η = . = Xβ +Zb. . . . . . . . .. . . ··· . 0 0 ··· Zm Xm ηm The likelihood is p(y|β, b) =
ni m Y Y
f (yij |β, bi) = exp
i=1 j=1
1 T T (y η − 1 ζ(η)) + c(y, φ) , φ
352
where ζ(η) is understood componentwise and c(y,φ) =
353
are independently distributed as N (0,Q−1 ). Hence
354
matrix diag(Q,...,Q).
P
i,j c(yij ,φ). The random effects bi p(b) ∼ N (0,Q−1 b ) with Qb a block diagonal
355
We consider Bayesian inference with a normal prior N (µ0β ,Σ0β ) for β and a Wishart prior
356
W (ν0 ,S0) for Q, where µ0β , Σ0β , ν0 and S0 are known hyperparameters. We set µ0β =0, Σ0β =τ0 Ip,
357
ν0 = u+1 and S0 = τ0Iu with τ0 = 1000.
358 359
360
Let θ = (β,b,Q) be the vector of all the unknown parameters and random effects. Assume that the variational posterior is factorized as q(θ) = q(β, b)q(Q) = q(α)q(Q)
with
α = (β T , bT )T ,
(21)
361
where q(α) is normal with mean µqα and covariance matrix Σqα . The optimal VB approximation
362
q(Q) is a Wishart W (ν q ,S q ) with ν q and S q given in Algorithm 2. 13
By combining the VB theory in Section 1 and Algorithm 1 of Section 3, we obtain the
363 364
following mean-field fixed-form VB algorithm for fitting GLMMs.
365
Algorithm 2
366
1. Initialize ν q ,S q .
367
2. Update µqα and Σqα as follows
368
• Initialize tα = µqα, Γα = Σqα−1 and gα = 0.
369
¯ α = 0 and g¯α = 0. • Initialize t¯α = 0, Γ
370
• For i = 1,...,N do
371
– Generate α∗ = (β ∗T ,b∗T )T ∼ N (µqα ,Σqα ) and compute η ∗ = Xβ ∗ +Zb∗.
372
q q – Set Σqα = Γ−1 α and µα = Σαgα +tα.
373
– Compute the gradient giα =
374
1 ˙ ∗)) − Σ0 −1 (β ∗ − µ0 ) X T (y − ζ(η α β φ 1 T ∗ ˙ )) − Eq(Q) (Qb)b∗ Z (y − ζ(η φ
!
and Hessian 1 1 T ∗ 0 −1 T ∗ ¨ ¨ − φ X diag ζ(η ) X − Σα − φ X diag ζ(η ) Z Hiα = ¨ ∗) X ¨ ∗) Z − Eq(Q) (Qb) − φ1 Z T diag ζ(η − φ1 Z T diag ζ(η
375
376
377
– Set gα = (1−c)gα +cgiα, Γα = (1−c)Γα −cHiα , tα = (1−c)tα +cα∗ .
378
¯α = Γ ¯ α − 2 H α , t¯α = ¯tα + 2 α∗ , – If i > N/2 then set g¯α = g¯α + N2 giα , Γ N i N
379
380
381
¯ −1 , µq = Σq g¯α + ¯tα . • Set Σqα = Γ α α α −1 P q q T q 3. Update ν q = ν0 +m, S q = S0−1 + m (µ µ +Σ ) . bi bi bi i=1 4. Repeat Steps 2-3 until convergence.
383
In the above algorithm, Eq(Q) (Qb) = diag(Eq(Q) (Q),...,Eq(Q)(Q)) with Eq(Q)(Q) = ν q S q , µqbi and ¯ α, Σq are the mean and covariance of bi computed from µq and Σq . The H α , and therefore Γ
384
are block, high-dimensional and sparse matrices whose lower right blocks are block diagonal.
385
Techniques for handling such sparse matrices should be used to reduce the computing time. In
386
our experience, the algorithm often converges very quickly, within a few iterations. A common
387
stopping rule is to stop iterating when the lower bound does not improve any further. However,
388
computing the lower bound in the GLMM context often involves an analytically intractable
389
integral. Alternatively, we can stop iterating if the difference in the variational parameters
382
α
bi
14
α
i
390
between two successive iterations is smaller than a small threshold. In our implementation, the
391
algorithm is terminated if 1/d times the difference between two successive iterations is smaller
392
than = 10−5 (d is the total number of the parameters). The number of iterations within
393
each fixed-form update N is set to 100 after some experimentation, but this can be varied
394
depending on the computational budget or even adaptively increased as we near convergence.
395
It is important to note that we treat β and b as a single block α to take into account the
396
posterior dependence between the fixed and random effects. An alternative to (21) is to fully
397
factorize q(θ) as q(θ) = q(β)
398
m Y
q(bi)q(Q),
(22)
i=1 399
but this ignores the posterior dependence between β and bi’s and thus resulting in a poor VB
400
approximation. For a large m, there would be numerical problems when treating β and b as a
401
single block if we worked with the full dataset, as the dimension of Σqα is (m·u+p)×(m·u+p).
402
It is therefore necessary to work with subsets of the data as developed in Section 4, so that
403
for each subset we only have to work with matrices of size (mj ·u+p)×(mj ·u+p), with mj
404
the number of panels in that subset.
405
Partitioning the data into too many small subsets will lead to a poor approximation in
406
each subset. On the other hand, using a too big subset size mj is time consuming as we have
407
to work with high-dimensional matrices, which may cause numerical issues. Our simulation
408
study in Section 6.1 suggests that we should partition the data such that mj ≈ (200−p)/u.
409
5.1
410 411
412
413
414
Improving the VB approximation for GLMMs
With the prior p(Q) ∼ W (ν0,S0), it is easy to see that p(Q|β,b,y) = p(Q|b,y) ∼W (ν0+m,(S0−1 + Pm 0 −1 i=1 bi bi ) ). In order to improve the approximation of the posterior, we can use qe(β, b, Q) = q(β, b)p(Q|b, y). The marginal posterior of Q is then estimated by Z qe(Q) = q(b)p(Q|b, y)db,
(23)
415
where q(b) ∼ N (µqb ,Σqb ) with the mean µqb and covariance Σqb computed from µqα and Σqα . In
416
the case of parallel implementation in which the parameters are learnt separately on M data
417
subsets y (j), µqb can be approximated by (µqb(1) T ,...,µqb(M ) T )T and Σqb approximated by the block
418
diagonal matrix diag(Σqb(1) ,...,Σqb(M ) ). Here, b(j) is the vector of random effects with respect to
419
the data subset y (j) and q(b(j)) ∼ N (µqb(j) ,Σqb(j) ). This method is not suitable for the case with
420
large data because Σqb is a high dimensional matrix of size m×m. 15
421
5.2
422
Given the response vector y, assume that a GLMM has been specified, then model selection
423
in GLMMs consists of selecting fixed effect covariates and random effect covariates among a
424 425 426
427
Model selection for GLMMs
set of potential covariates. Assume that we have fitted a GLMM M and denote the estimated b Q). b The plug-in log predictive density score of a future dataset with parameter by θb = (β, response vector yF , fixed effect design matrix XF and random effect design matrix ZF is Z X 1 T T b M) = b i, (yi ηi − 1 ζ(ηi )) + c(yi, φ) p(bi |Q)db log p(yF |θ, log exp φ y ∈y i
F
428
where ηi = Xi βb + Zi bi and Xi ∈ XF , Zi ∈ ZF , correspondingly. The integrals above can be
429
estimated by the Laplace method. The M-fold cross-validated plug-in LPDS is then computed
430
as in (20). The model having the biggest LPDS is selected.
431
Clearly, this model selection strategy can be used for selecting GLMMs themselves as well
432
as the link functions. A drawback of this model selection method is that it is not suitable
433
for cases in which the number of candidate covariates is large because the total number of
434
candidate models is huge and searching over the model space is very time demanding.
435
6
436
The proposed parallel VB algorithm is written in Matlab. The examples with small-to-
437
moderate data are run on an Intel Core 16 i7 3.2GHz desktop supported by the Matlab
438
Parallel Toolbox with 4 local processors. The big data example is run on a high performance
439
cluster with 27 machines, each has 12 local processors.
Examples
440
The performance of the parallel VB method is compared to MCMC simulation, if it is
441
possible to do MCMC simulation, using the running time and cross-validated LPDS as per-
442
formance measures. Recall that, given the model being estimated, the LPDS measured on a
443
444
future dataset yF based on a training dataset yT is Z X LPDS(yF |yT ) = log p(y|θ)p(θ|yT )dθ, y∈yF
445
For MCMC, the integrals on the right side are estimated by the Markov chain samples. We
446
compute a 5-fold cross-validated LPDS for MCMC. For VB, the posterior p(θ|yT ) is replaced
447
by its VB estimate q(θ|yT ), and then the integrals are estimated by Monte Carlo samples
448
drawn from q(θ|yT ). We note that, as discussed in Section 4.1, the cross-validated LPDS for
449
the parallel VB approach is easily computed without refitting the model on each training set
450
yT . The likelihood p(y|θ) in GLMMs is an integral over the random effects and is estimated by 16
451
importance sampling. We use common random numbers to reduce variations when computing
452
these integrals by importance sampling.
453
In this paper, we use the marginal-pseudo MCMC simulation (see, e.g., Andrieu and
454
Roberts, 2009; Flury and Shephard, 2011), which still generates sample from the posterior
455
when the likelihood in the Metropolis-Hastings algorithm is replaced by its unbiased estimator.
456
The likelihood in the GLMM context is a product of m integrals over the random effects. Each
457
integral is estimated unbiasedly using importance sampling, with 10 importance samples in
458
the simulation examples and 100 in the real data examples. We use the Laplace approximation
459
for selecting the importance proposal density. Note that each likelihood estimation is also run
460
in parallel. To handle the positive definiteness constraint on the inverse covariance Q, we use
461
the Leonard and Hsu transformation (Leonard and Hsu, 1992) Q = exp(Σ), where Σ is an
462
unconstrained symmetric matrix, to reparameterize Q by the lower-triangle elements θQ of Σ,
463
which is an one-to-one transformation between Q and θQ . We then use the adaptive random
464
walk Metropolis-Hastings algorithm in Haario et al. (2001) to sample from the posterior
465
p(β,θQ|y). Each MCMC chain consists of 20000 iterates with another 20000 iterates used as
466
burn-in.
467
Alternative MCMC methods for estimating GLMMs such as Gibbs sampling (Zeger and
468
Karim, 1991) can be faster than the MCMC scheme implemented in this paper. However, the
469
marginal-pseudo Metropolis-Hastings sampling scheme with the random effects integrated out
470
can avoid mixing problems that one would have with Gibbs sampling due to the strong de-
471
pendence between the fixed and random effects. Gibbs sampling and similar MCMC methods
472
for GLMMs are in general not parallelizable and therefore cumbersome in cases of a very large
473
m. It should be noted that it is often difficult to compare the running times between different
474
algorithms which depend heavily on the programming language being used and the optimality
475
of the algorithms implemented for the characteristics of the particular example considered.
476
However, we believe the results reported here are indicative of the speed up obtained with our
477
variational Bayes methods.
478
6.1
479
6.1.1
480
This simulation example studies the effect of the divide and recombine strategy and its parallel
481
implementation. Panel data are generated from the following logistic model with random
Simulations A simulation study of the parallel implementation
17
482
effects exp(ηij ) , 1 + exp(ηij ) = xij β + zij bi , bi ∼ N (0, Σ), i = 1, ..., m, j = 1, ..., ni,
483
yij ∼ Bernoulli(πij ), πij =
484
ηij
(24)
485
with β =(−1.5,2.5)0 , Σ= σ 2 Iu , σ 2 = 1.5, ni =8, xij =(1,j/ni )0. We consider two cases: the first
486
case with a random intercept, i.e. u=1 and zij =1 and the second case with u=2 and zij =xij .
7
4.5
7
M=1 M=5 M=10 M=20
1000
M=1 M=5 M=10 M=20
M=1 M=5 M=10 M=20
900
4 6
6 800 3.5
5
5
700
4
4
2.5
2
3
CPU time in second
3
3
600
500
400
1.5 2
300
2
1 200 1
1 0.5
0 −1.8
−1.6
β
−1.4
−1.2
0
0
100
2
β
2.5
3
0
1
1.2
1.4
2
σ
1
1.6
1.8
0
1
5
10
20
number of pieces M
Figure 1: Parallel implementation with the four different values of the number of subsets M. 487
We first carry out a small-data study on a single desktop and generate a dataset from
488
(24) with m = 1000 and u = 1 and run the parallel VB method for four different values of the
489
number of subsets: M = 1 (i.e. no partitions of the data are performed), M = 5, M = 10 and
490
M =20. Equivalently, each subset has respectively mj =1000, 200, 100 and 50 panels. All the
491
partitions are selected randomly. The first three panels of Figure 1 plot the posterior density
492
estimates for β and σ 2 obtained by the four parallel VB runs, which show that the estimates
493
are close to each other, in the sense that differences in the estimates are small relative to the
494
estimated posterior standard deviations. The last panel plots the running times taken, which
495
shows that running the divide and recombine strategy in parallel produces considerable gains
496
in computing time efficiency.
497
In order to have a more formal comparison of these four parallel VB runs, we generate 50
498
independent datasets from model (24) and compute the mean squared errors of the estimates 18
499
of the fixed effects (MSEβ ) and the mean squared errors of the estimates of the random effect
500
variance (MSEΣ ). Table 1 summarizes these performance measures and the running times
501
averaged over the 50 replications. The results show that the parallel VB run with subset size
502
mj = 200 produces accurate estimates while having a reasonable running time. M
mj
MSEβ
MSEΣ
Time (second)/replication
1
1000
0.048
0.057
995.9
5
200
0.048
0.055
56.7
10
100
0.050
0.061
36.1
20
50
0.056
0.071
21.6
Table 1: Small data simulation study: The table reports the mean squared errors and the running time averaged over 50 replications for the parallel VB runs with different subset sizes. Data are simulated from model (24) with u = 1 and m = 1000. 503
To study the performance of the parallel VB algorithm in a big data context, we simulate
504
a large dataset with m = 2.7 million panels from model (24) for both cases u = 1 and u = 2.
505
That is, there are totally 13.5 million observations, which can be considered as a big dataset
506
in modern statistical applications. We run the parallel VB on a high performance cluster with
507
27 machines, each has 12 local processors. Table 2 summarizes the performance measures and
508
the running times averaged over 10 replications. We draw three conclusions. First, there is not
509
much difference in the mean squared errors between the different strategies of partitioning the
510
data into subsets. However, the running time increases when the subset size increases. Second,
511
the performance of the parallel VB depends mainly on the product of the subset size mj and
512
the number of random effects u, which determines the size of the matrix Σqα in Algorithm 2.
513
Third, the results suggest that we should partition the data such that mj ×u ≈ 200 in order
514
to have a good tradeoff between computing time and accuracy.
515
All the VB runs in the following examples are run in parallel with data partitioned such
516
that the product of the subset size and the number of random effects is roughly 200.
517
6.1.2
518
We now study the performance of the model selection procedure discussed in Section 5.2. We
519
generate datasets from the logistic random intercept model (24) and also generate irrelevant
520
covariates xij2 and zij1 randomly from the set {−1,0,1}. We have created a model selection
521
problem in which there are 2 potential covariates for the fixed effects and 1 potential covariate
522
for the random effects. It is reasonable to always include a fixed intercept and a random
Model selection
19
u
mj ×u
MSEβ ×104
MSEΣ ×105
Time (minute)/replication
1
50
0.018
0.026
7
100
0.015
0.024
18
200
0.014
0.024
47
400
0.014
0.024
158
50
0.050
0.910
6
100
0.035
0.828
11
200
0.030
0.946
31
400
0.029
1.134
94
2
Table 2: Large data simulation study: The table reports the mean squared errors and the running time averaged over 10 replications for the parallel VB runs with different subset sizes. Data are simulated from model (24) with m = 2.7 million. 523
intercept in a GLMM. So there are a total of 8 candidate models to consider. We consider two
524
values of m, 500 and 1000. Each is used to generate 100 datasets from the true model (24).
525
The performance is measured by the correctly fitted rate (CFR) defined as the proportion of
526
the 100 replications in which the true model is selected. The CFR is 80% for m = 500 and
527
100% for m=1000, which shows that the model selection strategy performs well. The running
528
time, averaged over the replications, taken to run the whole model selection procedure is 3.54
529
and 5.86 minutes for m=500 and m=1000, respectively. This running time is spent on fitting
530
the 8 candidates models and computing the cross-validated plug-in LPDS.
531
6.1.3
532
This simulation study compares the performance of the proposed parallel and hybrid VB
533
algorithm to MCMC. Datasets are generated from a Poisson mixed model with a random
534
intercept
A comparison to MCMC
535
yij ∼ Poisson(λij ), λij = exp(ηij ),
(25)
536
ηij = β0 + β1xij + bi, bi ∼ N (0, σ 2 ), i = 1, ..., m, j = 1, ..., ni.
537
We set β0 =−1.5, β1 =2.5, σ 2 =0.2 and ni =5 with xij generated from the uniform distribution
538
on (0,1).
539
The performance is measured by (i) mean squared errors of the estimates of the fixed
540
effects (MSEβ ) and of the estimates of the variance of the random effect (MSEσ2 ); (ii) running
541
time in minutes. Table 3 reports the simulation results, averaged over 10 replications, for four
542
different sizes of data m ranging from small data (m =50) to moderate data (m =10000). We 20
543
do not run the MCMC simulation in the case m =5000 and m =10000 because it is very time
544
consuming. In the case m = 10000, it takes approximately 1.1 seconds to run each likelihood
545
estimation in parallel, thus it would take approximately 733 minutes to run one MCMC chain
546
in the setting of this example. Table 3 shows that the performance of the VB and MCMC
547
is very similar in terms of mean squared errors. However the VB approach is much more
548
computationally efficient. m
Method
MSEβ
MSEσ2
Time (minute)
50
VB
0.155
0.025
0.07
MCMC
0.155
0.057
18.8
VB
0.058
0.016
0.41
MCMC
0.059
0.016
33.4
VB
0.012
0.007
7.4
MCMC
-
-
-
VB
0.011
0.004
14.5
MCMC
-
-
733
200 5000 10000
Table 3: Simulation example. The table reports the mean squared errors and the running time averaged over 10 replications.
549
6.2
550
The anti-epileptic drug longitudinal dataset (see, e.g., Fitzmaurice et al., 2011, p.346) consists
551
of seizures counts on m=59 epileptic patients over 5 time-intervals of treatment. The objective
552
is to study the effects of the anti-epileptic drug on the patients. Following Fitzmaurice et al.
553
(2011), we consider a mixed effects Poisson regression model but with a random intercept
554
555
Drug longitudinal data
p(yij |β,bi) = Poisson(exp(ηij )), ηij = cij +β1 +β2timeij +β3treatmentij +β4timeij ×treatmentij +bi,
556
j = 0,1,...,4, i = 1,...,59 and cij is an offset, and bi ∼ N (0,σ 2). The offset cij = log(8) if j = 0
557
and cij = log(2) for j > 0, timeij = j, treatmentij = 0 if patient i is in the placebo group and
558
treatmentij = 1 if in the treatment group.
559
The running time taken to run the VB (without partitioning the data into subsets because
560
of a small m) and MCMC in this example is 0.14 minutes and 17.7 minutes. Figure 2 plots
561
the VB estimates (dashed line) and MCMC estimates (solid line) of the marginal posterior
562
densities p(βi |y), i = 1,...,4 and p(σ 2 |y). All the MCMC density estimates in this paper are 21
563 564
carried out using the kernel density estimation method based on the built-in Matlab function ksdensity. The last panel in Figure 2 also plots the improved VB density qe(σ 2) given in (23),
565
estimated by kernel density estimation based on the draws of σ 2 generated from (23). The
566
figure shows that the VB estimates are very close to the MCMC estimates in this example.
567 568
As shown, the improved VB estimate qe(σ 2) is closer to p(σ 2|y) than the original VB estimate q(σ 2).
569
We now partition the data into 5 subsets and compute the 5-fold cross-validated LPDS for
570
both MCMC and the parallel VB. The cross-validated LPDS for MCMC, VB and improved
571
VB are −0.5732, −0.5807 and −0.5803 respectively. This shows that the methods have similar
572
predictive performance with MCMC slightly better.
3
2
16
4 MCMC VB improved VB
1.8 2.5
14
3.5
12
3
10
2.5
8
2
6
1.5
4
1
2
0.5
20 1.6
1.4 2 15 1.2
1.5
1
10
0.8
1 0.6
5
0.4
0.5 0.2
0
2.5
β
3
3.5
1
0 −0.5 −0.45 −0.4 −0.35 −0.3
β
2
0
−0.5
0
β
3
0.5
0
−0.1
−0.05
β
4
0
0.05
0
0.2
0.4
0.6
0.8
1
1.2
σ2
Figure 2: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities for the anti-epileptic drug data.
573
6.3
Six city data
574
The six cities data in Fitzmaurice and Laird (1993) consists of binary responses yij which
575
indicate the wheezing status (1 if wheezing, 0 if not wheezing) of the ith child at time-point
576
j, i =1,...,537 and j = 1,...,4. Covariates are the age of the child at time-point j, centered at 9
577
years, and the maternal smoking status (0 or 1). We consider the following logistic regression
22
578
model with a random intercept p(yij |β,bi) = Binomial(1,pij ),
579
logit(pij ) = β1 +β2Ageij +β3Smokeij +bi .
580
581
Figure 3 plots the VB estimates (dashed line) and MCMC estimates (solid line) of the
582
marginal posterior densities p(βi |y), i =1,...,3 and p(σ 2 |y). The running time taken to run the
583
VB and MCMC in this example is 0.56 and 52.8 minutes, respectively. The figure shows that
584
the VB estimates of the posterior densities of the βj are again close to the MCMC estimates.
585
The improved VB estimate qe(σ 2) overcomes the problem of underestimation of the variance
586
inherent in q(σ 2). The VB is about 94 times more computationally efficient than the MCMC
587
implementation.
588 589
The cross-validated LPDS for MCMC, VB and improved VB are 132.37, 132.35 and 132.37 respectively. This shows that the methods have very similar predictive performance.
3
7
1.8
2
MCMC VB improved VB
1.8
1.6 6 2.5
1.6 1.4 5
1.4
2
1.2 1.2 4
1
1.5
1 0.8
3
0.8 1
0.6 0.6
2 0.4
0.4 0.5 1 0.2
0
−3.5
−3
β1
−2.5
−2
0
−0.4
−0.2
β2
0
0.2
0 −1
0.2
−0.5
0
β3
0.5
1
1.5
0
2
3
4
5
σ2
Figure 3: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities for the six city data.
590
6.4
Skin cancer data
591
A clinical trial is conducted to test the effectiveness of beta-carotene in preventing non-
592
melanoma skin cancer (Greenberg et al., 1989). Patients were randomly assigned to a control 23
593
or treatment group and biopsied once a year to ascertain the number of new skin cancers since
594
the last examination. The response yij is a count of the number of new skin cancers in year j
595
for the ith subject. Covariates include age, skin (1 if skin has burns and 0 otherwise), gender,
596
exposure (a count of the number of previous skin cancers), year of follow-up and treatment (1
597
if the subject is in the treatment group and 0 otherwise). There are m = 1683 subjects with
598
complete covariate information.
599
Donohue et al. (2011) consider 5 different Poisson mixed models with different inclusion
600
of covariates whose inclusion status is given in Table 4. Using the model selection strategy
601
described in Section 5.2, we compute the cross-validated LPDS whose values are shown in
602
Table 4, which suggest that Model 1 should be chosen. By using an AIC-type model selection
603
criterion, Donohue et al. (2011) show that the first three models cannot be distinguished and,
604
on parsimony grounds, they select Model 1. Model 1
Model 2
Model 3
Model 4
Model 5
Fixed intercept
Y
Y
Y
Y
Y
Age
Y
Y
Y
Y
Y
Skin
Y
Y
Y
Y
Y
Gender
Y
Y
Y
Y
Y
Exposure
Y
Y
Y
Y
Y
Year
N
Y
Y
Y
Y
N
N
Y
N
Y
Random intercept
Y
Y
Y
Y
Y
Random slope (Year)
N
N
N
Y
Y
−277.5
−278.5
−278.1
−1366.6
−1404.6
Year
2
LPDS
Table 4: Five different Poisson mixed models for the skin cancer data and their LPDS values, which show that Model 1 is chosen. For comparison, after selecting Model 1, we also use MCMC to estimate this model, which
605 606
607
608
609
is p(yij |β,bi) = Poisson(exp(ηij )), ηij = β0 +β1Agei +β2Skini +β3Genderi +β4Exposureij +bi , where bi ∼ N (0,σ 2), i = 1,....,m = 1683, j = 1,...,5.
610
Figure 4 plots the VB estimates (dashed line) and MCMC estimates (solid line) of the
611
marginal posterior densities p(βi |y), i = 0,1,...,4 and p(σ 2 |y). The running time taken to run
612
the VB and MCMC is 1.45 and 130 minutes, respectively. The VB and MCMC estimates 24
1.6
100
6
5
50
12 MCMC VB improved VB
90
4.5
45
4
40
3.5
35
3
30
2.5
25
2
20
1.5
15
1
10
0.5
5
1.4 5
10
80 1.2 70 4
8
1 60
0.8
50
3
40
6
0.6 2
4
30 0.4 20 1
2
0.2 10
0 −6
−5
β
−4
−3
0 −0.02
0
0
0.02
β
1
0.04
0
0
0.2
0.4
β
0 0.2
0.6
2
0.4
0.6
β
0.8
3
1
0 0.12 0.14 0.16 0.18 0.2 0.22
β
4
0 0.6 0.8
1
1.2 1.4 1.6
2
σ
Figure 4: The VB estimates (dashed) and MCMC estimates (solid) of the marginal posterior densities when fitting Model 1 to the skin cancer data. 613
of the fixed effects β are similar. The VB method is about 90 times more computationally
614
efficient than the MCMC. The cross-validated LPDS for MCMC, VB are −0.992 and −0.998,
615
which shows that the methods perform similarly in terms of predictive performance.
616
7
617
We have developed a hybrid VB algorithm that uses a flexible and accurate fixed-form VB
618
algorithm within a mean-field VB updating procedure for approximate Bayesian inference,
619
which is similar in spirit to the Metropolis-Hastings within Gibbs sampling method in MCMC
620
simulation. If the variational distribution is factorized into a product and an exponential form
621
is specified for factors that do not have a conjugate form, then the new algorithm can be used
622
to approximate any posterior distributions without relying on conjugate priors. A simple
623
approach for improving on VB approximation is described. We have also developed a divide
624
and recombine strategy for handling large datasets, and a method for model selection as a
625
by-product. The proposed VB method is applied to fitting GLMMs and is demonstrated by
626
several simulated and real data examples.
Conclusion
25
627
Acknowledgment
628
The research of Minh-Ngoc Tran and Robert Kohn was partially supported by Australian
629
Research Council grant DP0667069. David Nott’s research was supported by a Singapore
630
Ministry of Education Academic Research Fund Tier 2 grant (R-155-000-143-112).
631
References
632
Andrieu, C. and Roberts, G. (2009). The pseudo-marginal approach for efficient Monte Carlo
633
computations. The Annals of Statistics, 37:697–725.
634
Attias, H. (1999). Inferring parameters and structure of latent variable models by variational
635
Bayes. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pages
636
21–30.
637
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.
638
Braun, M. and McAuliffe, J. (2010). Variational inference for large-scale models of discrete
639
choice. Journal of the American Statistical Association, 105(489):324–335.
640
Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). Stream-
641
ing variational Bayes. Technical report, University of California, Berkeley. Available at
642
http://arxiv.org/pdf/1307.6769v1.pdf.
643 644
645 646
647 648
Donohue, M. C., Overholser, R., Xu, R., and Vaida, F. (2011). Conditional Akaike information under generalized linear and proportional hazards mixed models. Biometrika, 98:685–700. Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis. John Wiley & Sons, Ltd, New Jersey, 2nd edition. Flury, T. and Shephard, N. (2011). Bayesian inference based only on simulated likelihood: Particle filter analysis of dynamic economic models. Econometric Theory, 1:1–24.
649
Ghahramani, Z. and Beal, M. (2001). Propagation algorithms for variational Bayesian learn-
650
ing. In Leen, T., Dietterich, T., and Tresp, V., editors, Neural Information Processing
651
Systems, volume 13, pages 507–513. MIT Press.
652
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society B, 14:107–114.
26
653
Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer, S. K.,
654
Elias, P. M., Lowe, N., Nierenberg, D. N., G., B., and Vance, J. C. (1989). The skin cancer
655
prevention study: design of a clinical trial of beta-carotene among persons at high risk for
656
nonmelanoma skin cancer. Controlled Clinical Trials, 10:153–166.
657 658
659 660
661 662
Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., and Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 1:53–67. Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7:223–242. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347.
663
Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. (2010). Approximate Rie-
664
mannian conjugate gradient learning for fixed-form variational Bayes. Journal of Machine
665
Learning Research, 11:3235–3268.
666
Knowles, D. A. and Minka, T. (2011). Non-conjugate variational message passing for multi-
667
nomial and binary regression. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F.,
668
and Weinberger, K., editors, Advances in Neural Information Processing Systems 24, pages
669
1701–1709. Neural Information Processing Systems.
670 671
672 673
674 675
676 677
678 679
680 681
Leonard, T. and Hsu, J. S. J. (1992). Bayesian inference for a covariance matrix. The Annals of Statistics, 20(4):1669–1696. Luts, J., Broderick, T., and Wand, M. P. (2013). Real-time semiparametric regression. Journal of Computational and Graphical Statistics. In press. Maclaurin, D. and Adams, R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. Technical report, Harvard University. Minka, T. (2001). A family of algorithms for approximate Bayesian inference. PhD thesis, MIT. Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel MCMC. Technical report, Carnegie Mellon University. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609.
27
682 683
684 685
686 687
688 689
690 691
692 693
694 695
696 697
Opper, M. and Archambeau, C. (2009). The variational Gaussian approximation revisited. Neural Computation, 21:786–792. Ormerod, J. T. and Wand, M. P. (2010). Explaining variational approximations. American Statistician, 64:140–153. Ormerod, J. T. and Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear mixed models. J. Comput. Graph. Stat., 21:2–17. Quiroz, M., Villani, M., and Kohn, R. (2014). Speeding up MCMC by efficient data subsampling. Technical report, Stockholm University. Rijmen, F. and Vomlel, J. (2008). Assessing the performance of variational methods for mixed logistic regression models. J. Stat. Comput. Simul., 78:765–779. Robbins, H. and Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407. Salimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):741–908. Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation, 13(7):1649–1681.
698
Tan, L. S. L. and Nott, D. J. (2013a). A stochastic variational framework for fitting and diag-
699
nosing generalized linear mixed models. Technical report, National University of Singapore.
700
Available at http://arxiv.org/pdf/1208.4949v2.pdf.
701 702
Tan, L. S. L. and Nott, D. J. (2013b). Variational inference for generalized linear mixed models using partially noncentered parametrizations. Statistical Science, 28:168–188.
703
Tchumtchoua, S., Dunson, D. B., and Morris, J. S. (2011). Online variational Bayes inference
704
for high dimensional correlated data. Technical report, Duke University. Available at
705
http://arxiv.org/pdf/1108.1079/.
706
Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12:2719–2741.
707
Wang, C. and Blei, D. M. (2013). Variational inference in nonconjugate models. Journal of
708
Machine Learning Research, 14:1005–1031.
28
709
Waterhouse, S., MacKay, D., and Robinson, T. (1996). Bayesian methods for mixtures of
710
experts. In Touretzky, M. C. M. D. S. and Hasselmo, M. E., editors, Advances in Neural
711
Information Processing Systems, pages 351–357. MIT Press.
712 713
Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86:79–86.
29