b. Unlike previous examples, conditional distribution π(φ|−) is not available in closed-form. Nevertheless, propagated surrogate quantities (SCSS) enable approximate MCMC draws to be sampled from C-DF full conditional π ˜ (φ|−) via MetropolisHastings. Here, steps for approximate MCMC using the C-DF algorithm in the context of a Metropolis-within-Gibbs sampler are: (1) At time t observe yt ; (2) If t ≤ b : Draw S samples for (θ1 , . . . , θt , τ 2 , σ 2 , φ) from the Gibbs full conditionals. In addition, C1t = C2t = C3t = C4t = 0, t ≤ b. (3) If t > b : Repeat S times the following (a) for t−b < s < t, draw sequentially from [θs |θs−1 , θs+1 , −] and finally from [θt |θt−1 , −], noting that θt−b = θˆt−b ; (t−1) (t−1) (t−1) (b) draw τ 2 ∼ IG at , bt , and bt = C4 − 2φC3 + φ2 C2 + 12 (θt−b+1 − φθˆt−b )2 + Pb 2 j=2 (θt+j−b − φθt+j−b−1 ) ; P (t−1) (c) draw σ 2 ∼ IG at , C1 + 12 bj=1 (yt+j−b − θt+j−b )2 ; (t−1)
(d) sample φ ∼ π ˜ (·|C3
(t−1)
, C4
, θ(t−b+1):t ) via a Metropolis-Hastings step. 12
(t−1) (4) Set θˆt−b ← mean(θt−b ), and then update surrogate quantities C1t ← C1 + 12 (yt−b − ˆ θˆ2 (t−1) (t−1) (t−1) θˆt−b )2 , C t ← C + θt−b−1 , C t ← C + θˆt−b−1 θˆt−b , C t = C + t−b . 2
2
2
3
3
2
4
4
2
2
We vary signal-to-noise ratio τ /σ for the data generating process in simulation experiments. A number of high-signal cases were examined, and results are reported for a √ representative case with τ = 2, σ = 0.1. A sufficiently large window size b is necessary to prevent C-DF from suffering due to poor estimation at early time-points, causing propagation of error in the defined surrogate quantities. For the competitor, 100 particles were propagated in time, and hence we fix b = 100 as well. Kernel density estimates for θt using Particle learning (PL; Lopes et al., 2011) and the C-DF algorithm are shown in Figure 3. In the case of C-DF, estimates for all model parameters are found to be concentrated near their true values, whereas for PL, θt at different times are centered correctly, albeit with much higher posterior variance (presumably due to poor estimation of noise parameters τ 2 , σ 2 ). In addition to being 35% faster than PL, average coverage for the latent AR(1) process using the C-DF algorithm is near 80% with credible intervals roughly 10 times narrower than PL (see Table 3). C-DF is substantially more efficient in terms of memory and storage utilization, requiring only 6% of what PL uses for an identical simulation. Finally, C-DF produces accurate estimates for latent parameters τ 2 , φ (in contrast to PL), although their posterior spread appears somewhat over shrunk. This may be remedied by using larger window size b (at the expense of increased runtime) depending on the task at hand. Table 3: Inferential performance for C-DF and Particle Learning (PL). Coverage and length are based on 95% credible intervals for θt averaged all time points and 10 replications. P n over (θˆt − θt0 )2 . We report the time taken to For truth θt0 at time t, we report MSE = T1n Tt=1 run C-DF with 50 Gibbs samples at each time for τ 2 , θ, σ 2 and 500 MH samples for φ. C-DF PL
Avg. coverage θ 0.780.10 10.00
Length 0.330.11 3.360.46
Time (sec) 1138.600.10 1750.580.10
MSE 0.0110.001 0.0960.027
Table 4: Computational and storage requirements for the Dynamic Linear Model using C-DF t and PL. Ci,j , is the i-th CSS corresponding to the j-th particle in PL, i = 1 : 4, j = 1 : N , N = 100 is the number of particles propagated by PL, and G = 500 is the number of Metropolis samples used by both PL and C-DF. Memory used to store and propagate SCSS and CSS for C-DF and PL is reported. Sampling and update complexities are in terms of big-O. C-DF PL
Stats Cit t Ci,j
Data {yi }i>nt−b {yi }i≥1
Sampling S(N + G) flops N G flops
Updating N flops N flops
Memory (bytes) 128 3330
3.3.2 An application to binary response data We consider an application of the C-DF algorithm for the probit model Pr(yi = 1|xi ) = Φ(x0i β), with standard normal distribution function Φ(·). The Albert and Chib (1993) la13
C−DF, T = 3000
−2
−1
0
1
2
4 −2
0
2
4
−6
−5
θ2000
−4
−3
−2
−1
0
1
θ3000 C−DF, T: 1000 PL, T: 1000
80
C−DF, T: 4000 PL, T: 4000
C−DF, T: 4000 PL, T: 4000
20
4
Density
6
60
8
10
C−DF, T: 1000 PL, T: 1000
0
0
2
Density
3
Density
1 −4
θ1000
40
−3
0
1 0 −4
PL, T = 3000
2
4 3
Density
2
4 3 2 0
1
Density
PL, T = 2000 5
C−DF, T = 2000 5
PL, T = 1000
5
C−DF, T = 1000
0
1
2
3
4
0.0
0.2
0.4
0.6
0.8
1.0
φ
τ2
Figure 3: Row #1 (left to right): Kernel density estimates for posterior draws of θt using PL and the C-DF algorithm at t = 1000, 2000, 3000; Row #2 (left to right) plots of model parameters τ 2 and φ, respectively. tent variable augmentation applies Gibbs sampling, with yi = sign(zi ), and zi |β ∼ N(x0i β, 1). Assuming a conditionally conjugate prior β ∼ N(0, Σβ ), the Gibbs sampler alternates between the following steps: (1) conditional on latent variables z = (z1 , z2 , . . . )0 , sample β from its p-variate Gaussian full conditional; and (2) conditional on β and the observed binary response y, impute latent variables from their truncated normal full conditionals. However, imputing latent variables {zi : i ≥ 1} presents an increasing computational bottleneck as the sample size increases, as does recalculating the conditional sufficient statistics (CSS) for β given these latent variables at each iteration. C-DF alleviates this problem by imputing latent scores for the last b training observations and propagating surrogate quantities. “Budget” b is allowed to grow with the predictor dimension, and as a default we set b = p log p. For data shards of size n, define It = {i > nt − b} as an index over the final b observations at time t. Parameter partitions in this setting are dynamic as in Section 3.3.1, with ΘG1 = {β; zi , i ∈ It } and ΘG2 = {zi : i ≤ nt − b}. The C-DF algorithm proceeds as follows: (1) Observe data X t , y t at time t; (2) If t = 1, set β = 0, and draw zi ∼ N(0, 1), i = 1, . . . , n. If t ≤ b/n, C t = 0 and draw S Gibbs samples for (z1 , . . . , znt , β) from the full conditionals. 14
0
ˆ , update sufficient statistic S XX ← S XX + X t X t , and compute (3) If t > b/n, set β ← β t−1 t t−1 XX −1 −1 ΣXX = (S + Σ ) ; t t β (4) For s = 1, . . . , S: (a) draw zi ∼ TN(x0i β), i ∈ It , and define z It and X It respectively (s) as the collection of latent draws and data for moving window b; (b) draw β|z It , C t−1 ∼ (s) (s) N(ΣXX C t , ΣXX ), where C t (z It ) = C t−1 + X 0It z It ; t t ˆ t ← mean(β (s) ). For ui (β) = yi φ(−x0 β)/Φ(yi x0 β), the trailing n latent scores (5) Set β i i ˆ t +ui (β ˆ t ), i ∈ within moving window b are fixed at their expected value, namely zˆi ← x0i β out It = {(nt + b − 1) : (n(t + 1) + b − 1)}; (6) Denote X Itout and z Itout as predictor data and latent scores for the “outgoing” set of ˆ Itout , and set ΘG1 ← data indexed by Itout . Update surrogate CSS C t ← C t−1 + X 0Itout z ˆ Itout . ΘG1 ∪ z We report simulation results for the following examples: (Case 1) (p, b, n, t) = (100, 500, 25, 100): βj,0 ∼ U(−3/4, 3/4) for j ∈ [11, 100]; (Case 2) (p, b, n, t) = (500, 3500, 100, 100): βj,0 ∼ U(−1/3, 1/3) for j ∈ [11, 200]. The first 10 regression coefficients are (3.5, -3.5, -2.0, 2.0, -1.5, 1.5, -1.5, 1.5, -1.0, 1.0), with βj,0 = 0 for j > 200 in case 2. Data are generated as xij ∼ N(0, σ = 0.25), and P(yi = 1) = Φ(x0i β). Table 5 summarizes inferential performance for the regression parameters in each of the simulated cases. In case 2, although coverage for predictors with large coefficients (i.e., β1 and β2 ) is less than the nominal value, the average coverage across all predictors produced by C-DF is 70% despite the high dimension with a significant number of “noisy” predictors. In addition, C-DF has very good mean-square estimation of parameter coefficient in both cases. As a competitor to C-DF, S-MCMC (batch Gibbs), SMC (Chopin, 2002) and S-VB (Broderick et al., 2013) are implemented. Results for SMC with 2000 propagated particles are shown in Table 5. For clarity, Figure 4 displays a number of marginal kernel density estimates at t = 50, 100 for C-DF and S-MCMC. S-MCMC has the worst-case storage requirement and scales linearly in the number of training samples. At each Gibbs iteration, draws for the latent scores is O(nt) for S-MCMC compared to O(b) for C-DF. For both methods, sampling from the full conditional for β is O(p3 ), updating sufficient statistics is O(np2 ), and updating surrogate quantities is O(np). Computational and storage requirements for both methods are summarized in Table 6.
4. C-DF for online compressed regression We consider an application to compressed linear regression where y 1 , . . . , y t ∈ i) πt (Θ2 |θ
i=1 0 b b 1,t−1 ). πt (Θ1 |Θ2,t−1 )π(Θ02 |Θ
The last step follows from the fact that p1 hY
b 1,t−1 , Θ0 , l πt (Θ01i |Θ 1l
p2 i hY i b 1,t−1 , Θ0 , l < i, Θ2l , l > i) < i, Θ1l , l > i) and πt (Θ02i |Θ 2l
i=1
i=1
b 1,t−1 ) and πt (Θ1 |Θ b 2,t−1 ), reare the Gibbs kernels with the stationary distribution πt (Θ2 |Θ spectively. proof of Lemma 6 The following proof builds on Theorem 3.6 from Yang and Dunson (2013). Although most of the proof coincides with their result, we present the entire proof for completeness. Proof Note that, for a fixed ∈ (0, 1) nt , t ≥ 1 are s.t. αtnt < . Using the fact that universal ergodicity condition implies uniform ergodicity, one obtains dT V (Ttnt , πt ) ≤ αtnt < . n
t−1 Let h = Tt−1 · · · T1n1 π0 , then
dT V (Ttnt · · · T1n1 π0 , ft ) = dT V (Ttnt h, ft ) ≤ dT V (Ttnt , ft ) dT V (h, ft ) ≤ αtnt dT V (h, ft ) ≤ (dT V (h, ft−1 ) + dT V (ft , ft−1 )) .
(18)
using the result repeatedly, one obtains dT V (Ttnt · · · T1n1 π0 , ft ) ≤
t X
t+1−l dT V (fl , fl−1 ).
l=1
R.H.S clearly converges to 0 applying condition (ii). The proof is completed by using condition (iii) and the fact that dT V (Ttnt · · · T1n1 π0 , πt ) ≤ dT V (Ttnt · · · T1n1 π0 , ft ) + dT V (ft , πt ). 28
proof of lemma 10 Proof Stationary distribution ft of the C-DF transition kernel Tt is the approximate posterior distribution to πt obtained at time t, and by Lemma 4 is given by b 2,t ) πt (Θ2 |Θ b 1,t ) ft (Θ1 , Θ2 ) = πt (Θ1 |Θ Qt Qt b b b 2,t (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ . = R Qt Qt ˆ b b 2,t (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ b 1,t → Θ0 , Θ b 2,t → Θ0 a.s. under Θ0 , there exists Ω0 which has probability By assumption, Θ 1 2 b 1,t (ω) and Θ b 2,t (ω) are in an arbitrarily 1 under the data generating law s.t. for all ω ∈ Ω0 , Θ 0 0 small neighborhood of Θ1 and Θ2 , respectively. Also by assumption, prior π0 is continuous at Θ0 . That is, given > 0 and η > 0, there exists a neighborhood N,η s.t. for all Θ ∈ N,η one has |π0 (Θ1 , Θ2 ) − π0 (Θ01 , Θ02 )| < .
(19)
b 1,t and Θ b 2,t as above, one obtains for all t > t0 and Using (19) and the consistency of Θ ω ∈ Ω0 b 2,t ) − π0 (Θ0 )| < , |π0 (Θ1 , Θ
b 1,t , Θ2 ) − π0 (Θ0 )| < . |π0 (Θ
(20)
Similarly, continuity of pΘ (·) at Θ0 leads to the condition that for all t > t0 , |pΘ1 ,Θ2 (D l ) − pΘ01 ,Θ02 (D l )| < .
(21)
Further, consistency assumptions on ft and πt yield that for all t > t1 and ω ∈ Ω1 ft (N,η |D (t) (ω)) > 1 − η,
πt (N,η |D (t) (ω)) > 1 − η,
where Ω1 has probability 1 under the data generating law. Considering Ω = Ω0 ∩ Ω1 and t2 = max{t1 , t0 } it is evident that Ω has also probability 1 under the true data generating law and all of the above conditions hold for t > t2 and ω ∈ Ω. Simple algebraic manipulations yield ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) R Qt ft (N,η |D (t) (ω)) N,η l=1 pΘ (D l )π0 (Θ) = Qt πt (N,η |D (t) (ω)) l=1 pΘ (D l )π0 (Θ) hQ i Qt t b b b 2,t (D l ) π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ hQ i ×R Qt t ˆ 1,t , Θ2 )π0 (Θ1 , Θ b 2,t ) p (D ) p (D ) π0 (Θ b b l l l=1 Θ1 ,Θ2,t l=1 Θ1 ,Θ2,t N,η 29
(22)
Using (20) we have 0
(π0 (Θ ) − )
2
t Y
Z
N,η l=1 t Y
Z ≤
N,η l=1
p Θ 1 ,Θ b 2,t (D l ) pΘ b 1,t ,Θ2 (D l )
b b p Θ 1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t )
2 ≤ π0 (Θ ) + 0
Z
t Y
N,η l=1
p Θ 1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l ).
Similarly, t Y
Z
0
(π0 (Θ ) − )
Z pΘ (D l ) ≤ N,η
N,η l=1
t hY
Z i 0 pΘ (D l ) π0 (Θ) ≤ (π0 (Θ ) + )
t Y
pΘ (D l ).
N,η l=1
l=1
Therefore, ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) Qt
b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ ≤ (1 − η)−1 R Qt b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ N,η
Qt
R
l=1
N,η
pΘ (D l ) (π0 (Θ0 ) + )3
Qt
l=1 pΘ (D l )
(π0 (Θ0 ) − )3
.
Using similar calculations we have ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) Qt
≥ (1 −
b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ η) R Qt ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ N,η
Qt
R
l=1
N,η
Qt
l=1
pΘ (D l ) (π0 (Θ0 ) − )3
pΘ (D l )
(π0 (Θ0 ) + )3
.
Condition (21) now gives us R Qt Qt Qt 3 p ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) 0 (D l ) − ) (p l=1 pΘ (D l ) Θ , Θ l=1 N 1 ,η Θ R ≤ Ql=1 Q Q t t t 3 ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) l=1 (pΘ0 (D l ) + ) l=1 pΘ (D l ) l=1 pΘ1 ,Θ N,η Qt (p 0 (D l ) + )3 ≤ Qtl=1 Θ . 3 0 (D l ) − ) (p Θ l=1 √ Using the condition that limt→∞ tpΘ0 (D (t) ) is bounded away from 0 and ∞ and choosing , η sufficiently small, we have f (Θ|D (t) (ω)) t − 1 < κ for all t > t2 and ω ∈ Ω. πt (Θ|D (t) (ω))
30
Finally, Z
Z
Z
|πt (Θ) − ft (Θ)| ≤
|πt (Θ) − ft (Θ)| +
|πt (Θ) − ft (Θ)| c N,η
N,η
Z |πt (Θ) − ft (Θ)| + 2η
≤ N,η
≤ πt (N,η )κ + 2η < κ + 2η.
B Consistent sequence of estimators b 1 and Θ b 2 for the examples in Section 3.1. We verify the consistency for C-DF estimators Θ Details are presented for the linear regression case, with similar steps following for the Anova bj which and compressed regression examples. Step 3 in Algorithm 1 proposes an estimator θ (j) (t) estimates the mean of the corresponding approximate conditional posterior π ˜j (·|ΘGl , C j ). When the mean of this distribution is available in a closed form, one might use it to create sequence of estimators. Such mean functions are readily available for the examples in Section 3.1, and for ease of exposition, we focus on showing consistency for the sequence of estimators in such cases. For linear regression example in Section 3.1.1, the sequence of C-DF estimators are given by −1 X X t t 0 X X X 0l y l l l b = + I β t σ ˆl2 σ ˆl2 l=1 l=1 Pt b 0 0 b P b0 0 P 2b + tl=1 y 0l y l − 2 tl=1 β l X l yl + 2 l=1 β l X l X l β l σ ˆt = . 2a + nt − 2 Let the true regression model be y t = X t β 0 + t , t ∼ N (0, σ02 I). Then, the maximum a-posteriori estimator for the vector of regression coefficients is given in closed form as −1 X X t t X 0l (X l β 0 + l ) X 0l X l ˆ βt = + I σ ˆl2 σ ˆl2 l=1 l=1 −1 X −1 X X t t t X 0l X l X 0l l X 0l X l + I β + + I . = β0 − 0 2 2 2 σ ˆ σ ˆ σ ˆ l l l l=1 l=1 l=1 Pt 0 1 Assume (i) σ ˆt2 → σ02 a.s. under the data generating law and (ii) that nt l=1 X l X l has bounded eigenvalues for every t. Under the above assumptions, and using the weighted −1 Pt X 0l l Pt X 0l X l SLLN (Adler & Rosalsky, 1991), we obtain +I → 0 a.s. and l=1 σ l=1 σ ˆl2 ˆl2 −1 Pt X 0l X l b → β a.s. under the data generating law. Similarly, +I β → 0 a.s. Then, β 2 l=1
σ ˆl
0
t
0
σ ˆt2
b t → β 0 a.s., one can show assuming β → σ02 a.s. Simultaneous convergence of the model parameters is argued on the basis of established results on alternate optimization (Byrne, 2013). 31
C Accuracy Plots One-way Anova
Linear regression β3
β4
ζ1
ζ2
●
ζ5
ζ8
σ2
1.0
1.00
●
●
0.4
4
6
8
10
●
●
●
●
●
2
4
6
8
10
●
●
2
Time/50
Time/50
●
0.8
●
●
Accuracy
●
●
●
●
0.6
● ●
0.4
0.90 0.85 2
µ
●
●
●
●
0.8
●
Accuracy
●
●
●
●
0.6
0.95
●
●
●
0.80
Accuracy
● ●
τ2
1.0
β1
σ2
4
6
8
10
Time/50
Figure 7: Parameter accuracy plotted over time as defined in (2) for the motivating examples of Section 3.1. Image 1: accuracy for representative regression coefficients βj and σ 2 in the linear regression example. Images 2 and 3: accuracy for representative group means ζj , along with hierarchical parameters µ, τ 2 and σ 2 for the one-way Anova model. D Poisson mixed effect model Consider the Poisson additive mixed effects model where count data y t with predictors X t are modeled as y t ∼ Poisson exp(X t β + Zu) , β ∼ N (0, σβ2 I k ) u ∼ N 0, diag(σ12 I k1 , . . . , σr2 I kr ) , σs2 ∼ IG(0.5, 1/bs ), bs ∼ IG(1/2, 1/a2s ), with I k1 , . . . , I kr respectively denoting identity matrices of order k1 , . . . , kr and s = 1, 2, . . . , r. Here, obstacles to implementing the C-DF algorithm immediately arise as the full conditional distributions have no closed-form and do not admit surrogate quantities to propagate. The global approximation made by the ADF approximation becomes increasingly unreliable and overly restrictive in higher dimensions. A possibly less restrictive approximation may seek to obtain an “optimal” approximation to the posterior subject to ignoring posterior dependence among different parameters. This is given by a variational approximation to the joint posterior over model parameters β, u, σ12 , . . . , σr2 , namely π(β, u, σ12 , . . . , σr2 |D t ) ≈ q1 (β, u) q2 (σ12 , . . . , σr2 ) q3 (b1 , . . . , br ). Closed-form expressions for q1 , q2 , q3 are derived and given by q1 = N (µβ,u , Σβ,u ), q2 =
r Y
IG
ks +1 , µ1/bs 2
+
||µus ||2 +tr(Σus ) 2
, q3 =
r Y
IG(1, µ1/σs2 + a−2 s ).
s=1
s=1
R
Above, µ1/σs2 = (1/σs2 )q(σs2 ), with µ1/bs defined similarly, and µu , Σu represent the mean and covariance specific to u under approximating density q1 . The approximate posterior is 32
completely specified in terms of µβ,u , Σβ,u , and µ1/bs , µ1/σs2 for s = 1, . . . , r, hence the CDF algorithm in this setup is applied directly on these parameters. In particular, one only needs point estimates for these parameters to fully specify the approximate posterior, hence sampling steps (see Algorithm 1) are replaced by fixed-point iteration. With a partition ΘG1 = {µβ,u , Σβ,u }, ΘG2 = {µ1/bs , µ1/σs2 , s = 1, . . . , r} for the parameters of the variational distribution, the C-DF algorithm proceeds as follows: (1) Observe data shard (y t , X t ) at time t. If t = 1, initialize all the parameters using (t−1) (t−1) draws from respective priors. Otherwise set µtβ,u ← µβ,u , Σtβ,u ← Σβ,u , µt1/σs2 ← (t−1)
(t−1)
µ1/σs2 , µt1/bs ← µ1/bs ; (2) The following steps are repeated a maximum of Nfx iterations (or until convergence): (a) Update D ← blockdiag(σβ−2 I k , µ ˆ1/σ12 I tk1 , . . . , µ ˆt1/σr2 I kr ); 0 0 (b) Update wβ,u ← exp ct µβ,u + 21 ct Σβ,u ct , with ct = [X t , Z]; (t−1)
(c) Set H1 = C 1,1
(t−1)
(t−1)
+ ct y t , H2 = C 1,2 (t−1)
(d) Update µβ,u ← µβ,u + Σβ,u
0
+ ct wβ,u , and H3 = C 1,3 + wβ,u ct ct ; −1 H1 − H2 − Dµβ,u , and Σβ,u ← H3 + D .
(3) Set µtβ,u , Σtβ,u at the final values after Nfx iterations; (4) Also repeat the following step a maximum of Nfx iterations (or until convergence): t −1 t 2 (a) Update µ1/bs ← (µ1/σs2 + a−2 s ) , µ1/σs2 ← (ks + 1)/ 2µ1/bs + ||µu || + tr(Σu ) , s = 1, . . . , r. (5) Set µt1/σs2 , µt1/bs at the final values after Nfx iterations; (t−1)
(6) Update surrogate quantities C t1,1 ← C 1,1 0 (t−1) t C 1,3 + wβ,u ct ct .
(t−1)
+ ct y t ; C t1,2 ← C 1,2
t + ct wβ,u ; C t1,3 ←
With the arrival of new data, Nfx fixed-point iterations update parameters estimates for the approximating distributions. Consistent with the definition of SCSS, an update to the parameters for ΘG1 under q1 may involve estimates from the previous time-point, in addition to estimates for parameters ΘG2 and vice versa. This online sampling scheme has excellent empirical performance, both in terms of parametric inference and prediction (see Luts and Wand (2013)). This establishes the broad applicability of propagating surrogate quantities (SCSS) using the C-DF algorithm in a non-conjugate setting using variational approximation methods.
References James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993. Artin Armagan, David B Dunson, and Jaeyong Lee. Generalized double Pareto shrinkage. Statistica Sinica, 23(1):119, 2013. 33
M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE Transactions on, 50(2):174–188, 2002. Xavier Boyen and Daphne Koller. Tractable inference for complex stochastic processes. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 33–42. Morgan Kaufmann Publishers Inc., 1998. Michael Braun and Jon McAuliffe. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association, 105(489):324–335, 2010. Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming variational bayes. In Advances in Neural Information Processing Systems, pages 1727–1735, 2013. Charles L Byrne. Alternating minimization as sequential unconstrained minimization: a survey. Journal of Optimization Theory and Applications, 156(3):554–566, 2013. Carlos Carvalho, Michael S Johannes, Hedibert F Lopes, and Nick Polson. Particle learning and smoothing. Statistical Science, 25(1):88–106, 2010. Nicolas Chopin. A sequential particle filter method for static models. Biometrika, 89(3): 539–552, 2002. Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, 10(3):197–208, 2000. Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods. Springer, 2001. Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. Changwei Hu, Piyush Rai, and Lawrence Carin. Zero-Truncated Poisson Tensor Factorization for Massive Binary Tensors. a. Changwei Hu, Piyush Rai, Changyou Chen, Matthew Harding, and Lawrence Carin. Scalable Bayesian Non-Negative Tensor Factorization for Massive Count Data. b. T Jaakkola and Michael I Jordan. A variational approach to Bayesian logistic regression models and their extensions. Citeseer, 1997. M. Johannes, N. G. Polson, and S. M. Yae. Particle Learning in Nonlinear Models using Slice Variables, 2010. Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. arXiv preprint arXiv:1304.5299, 2013. 34
Steffen L Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. H. F. Lopes, C. M. Carvalho, M. S. Johannes, and N. Polson. Particle learning for sequential Bayesian computation. Bayesian Statistics 9, 9:317, 2011. Jan Luts and Matt P Wand. Variational inference for count response semiparametric regression. arXiv preprint arXiv:1309.4199, 2013. Dougal Maclaurin and Ryan P Adams. Firefly Monte Carlo: Exact MCMC with subsets of data. arXiv preprint arXiv:1403.5693, 2014. Tyler H McCormick, Adrian E Raftery, David Madigan, and Randall S Burd. Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics, 68 (1):23–30, 2012. Alan Medlar, Dorota Glowacka, Horia Stanescu, Kevin Bryson, and Robert Kleta. SwiftLink: parallel MCMC linkage analysis using multicore CPU and GPU. Bioinformatics, 29(4): 413–419, 2013. Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001. Thomas P Minka, Rongjing Xiang, and Yuan Alan Qi. Virtual vector machine for Bayesian online classification. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 411–418. AUAI Press, 2009. Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B Dunson. Robust and scalable Bayes via a median of subset posterior measures. arXiv preprint arXiv:1403.2660, 2014. Manfred Opper. A Bayesian approach to on-line learning. 1998. Trevor Park and George Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. Matias Quiroz, Mattias Villani, and Robert Kohn. Speeding up MCMC by efficient data subsampling. arXiv preprint arXiv:1404.4178, 2014. Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. 2013. Linda SL Tan, David J Nott, et al. A stochastic variational framework for fitting and diagnosing generalized linear mixed models. Bayesian Analysis, 9(4):963–1004, 2014. Yee Whye Teh, Alexandre Thi´ery, and Sebastian Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. arXiv preprint arXiv:1409.0578, 2014. 35
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011. Mike West and Jeff Harrison. Bayesian forecasting and dynamic models, volume 18. 2007. Yun Yang and David B Dunson. Sequential Markov Chain Monte Carlo. arXiv preprint arXiv:1308.3861, 2013.
36