Bayesian Conditional Density Filtering

7 downloads 0 Views 715KB Size Report
to each sample and in calculating likelihoods (Medlar et al., 2013). Another possibility is to rely on ...... Linda SL Tan, David J Nott, et al. A stochastic variational ...
Journal of Machine Learning Research 1 (2015) 07/05

Submitted 4/00; Published 10/00

Bayesian Conditional Density Filtering

Bayesian Conditional Density Filtering Shaan Qamar∗

[email protected]

Department of Statistical Science Duke University Durham, NC 27708-0251, USA

Rajarshi Guhaniyogi∗

[email protected]

Department of Applied Mathematics & Statistics University of California Santa Cruz, CA 95064, USA

David B. Dunson

[email protected]

Department of Statistical Science Duke University Durham, NC 27708-0251, USA Editor:

Abstract We propose a Conditional Density Filtering (C-DF) algorithm for efficient online Bayesian inference. C-DF adapts MCMC sampling to the online setting, sampling from approximations to conditional posterior distributions obtained by propagating surrogate conditional sufficient statistics (a function of data and parameter estimates) as new data arrive. These quantities eliminate the need to store or process the entire dataset simultaneously and offer a number of desirable features. Often, these include a reduction in memory requirements and runtime and improved mixing, along with state-of-the-art parameter inference and prediction. These improvements are demonstrated through several illustrative examples including an application to high dimensional compressed regression. Finally, we show that C-DF samples converge to the target posterior distribution asymptotically as sampling proceeds and more data arrives. Keywords: Approximate MCMC; Big data; Density filtering; Dimension reduction; Streaming data; Sequential inference; Sequential Monte Carlo; Time series.

1. Introduction Modern data are increasingly high dimensional, both in the number of observations n and the number of predictors measured p. Statistical methods increasingly make use of lowdimensional structure assumptions to combat the curse of dimensionality (e.g., sparsity assumptions) and efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Bayesian methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions, ∗. These authors contributed equally

c

2015 Shaan Qamar, Rajarshi Guhaniyogi and David B. Dunson.

but there is a lack of scalable inference algorithms having guarantees on accuracy. Markov chain Monte Carlo (MCMC) methods for Bayesian computation are routinely used due to ease, generality and convergence guarantees. When the number of observations is truly massive, however, data processing and computational bottlenecks render many MCMC methods infeasible as they demand (1) the entire dataset (or available sufficient statistics) be held in memory; and (2) likelihood evaluations for updating model parameters at every sampling iteration which can be costly. A number of alternative strategies have been proposed in recent years for scaling MCMC to large datasets. One possibility is to parallelize computation within each MCMC iteration using GPUs or multicore architectures to free bottlenecks in updating unknowns specific to each sample and in calculating likelihoods (Medlar et al., 2013). Another possibility is to rely on Hamiltonian Monte Carlo with stochastic gradient methods used to approximate gradients with subsamples of the data (Welling and Teh, 2011; Teh et al., 2014). Korattikara et al. (2013) instead use sequential hypothesis testing to choose a subsample to use in approximating the acceptance ratio in Metropolis-Hastings. More broadly, data subsampling approaches attempt to mitigate the MCMC computational bottleneck by choosing a small set of points at each iteration for which the likelihood is to be evaluated, and doing so in a way that preserves validity of the draws as samples from the desired target posterior (Quiroz et al., 2014; Maclaurin and Adams, 2014). Building on ideas reminiscent of sequential importance resampling and particle filtering (Doucet et al., 2000; Arulampalam et al., 2002), these methods are promising and are an active area of research. Yet another approach assigns different data subsets to different machines, runs MCMC in parallel for each subset, and recombines the resulting samples (Scott et al., 2013; Minsker et al., 2014). However, theoretical guarantees for the latter have not been established in great generality. Sequential Monte Carlo (SMC) (Chopin, 2002; Arulampalam et al., 2002; Lopes et al., 2011; Doucet et al., 2001) is a popular technique for online Bayesian inference that relies on resampling particles sequentially as new data arrive. Unfortunately, it is difficult to scale SMC to problems involving large n and p due to the need to employ very large numbers of particles to obtain adequate approximations and prevent particle degeneracy. The latter is addressed through rejuvenation steps using all the data (or sufficient statistics), which becomes expensive in an online setting. One could potentially rejuvenate particles only at earlier time points, but this may not protect against degeneracy for models involving many parameters. More recent particle learning (PL) algorithms (Carvalho et al., 2010) reduce degeneracy for the dynamic linear model, with satisfactory density estimates for parameters – but they too require propagating a large number of particles, which significantly adds to the per-iteration computational complexity. MCMC can be extended to accommodate data collected over time by adapting the transition kernel Kt , and drawing a few samples at time t, so that samples converge in distribution to the joint posterior distribution πt in time (Yang and Dunson, 2013). However, this method requires the full data or available sufficient statistics to be stored, leading to storage and processing bottlenecks as more data arrive in time. Indeed, models with large parameter spaces often have sufficient statistics which are also high dimensional (e.g., linear regression). In addition, MCMC often faces poor mixing requiring longer runs with additional storage and computation. In simple conjugate models, such as Gaussian state-space models, efficient updating equations can be obtained using methods related to the Kalman filter. Assumed density filtering 2

(ADF) was proposed (Lauritzen, 1992; Boyen and Koller, 1998; Opper, 1998) to extend this computational tractability to broader classes of models. ADF approximates the posterior distribution with a simple conjugate family, leading to approximate online posterior tracking. The predominant concern with this approach is the propagation of errors with each additional approximation to the posterior in time. Expectation-propagation (EP) (Minka, 2001; Minka et al., 2009) improves on ADF through additional iterative refinements, but the approximation is limited to the class of assumed densities and has no convergence guarantees. Moreover in fairly standard settings, arbitrary approximation to the posterior through an assumed density severely underestimates parameter and predictive uncertainties. Similarly, variational methods developed in the machine learning literature attempt to address various difficulties facing MCMC methods by making additional approximations to the joint posterior distribution over model parameters. These procedures often work well in simple low dimensional settings, but fail to adequately capture dependence in the joint posterior, severely underestimate uncertainty in more complex or higher dimensional settings and generally come with no accuracy guarantee. Additionally, they often require computing expensive gradients for large datasets. Hoffman et al. (2013) proposed stochastic variational inference (SVI) which uses stochastic approximation to the full gradient for a subset of parameters, thus circumventing this computational bottleneck. However, SVI requires retrieving, storing or having access to the entire dataset at once, which is often not feasible in very large or streaming data settings. A parallel literature on online variational approximations (Hoffman et al., 2010) focus primarily on improving batch inferences by feeding in data sequentially. In addition, these methods require specifying or tuning of a ‘learning-rate,’ and generally come with no storage reduction. Broderick et al. (2013) recently proposed a streaming variational Bayes (SVB) algorithm to facilitate storage for only the recent batch of data for a data stream. However, all variational methods (batch or online) rely on a factorized form of the posterior that typically fails to capture dependence in the joint posterior and severely underestimates uncertainty. Recent attempts to design careful online variational approximations (Luts and Wand, 2013), though successful in accurately estimating marginal densities, are limited to specific models and no theoretical guarantees on accuracy are established except for stylized cases. We propose a new class of Conditional Density Filtering (C-DF) algorithms that extend MCMC sampling to streaming data. Sampling proceeds by drawing from conditional posterior distributions, but instead of conditioning on conditional sufficient statistics (CSS) (Carvalho et al., 2010; Johannes et al., 2010), C-DF conditions on surrogate conditional sufficient statistics (SCSS) using sequential point estimates for parameters along with the data observed. This eliminates the need to store the data in time (process the entire dataset at once), and leads to an approximation of the conditional distributions that produce samples from the correct target posterior asymptotically. The C-DF algorithm is demonstrated to be highly versatile and efficient across a variety of settings, with SCSS enabling online sampling of parameters often with dramatic reductions in the memory and per-iteration computational requirements. C-DF has also successfully been applied to non-negative tensor factorization models for massive binary and count-valued tensor streaming data, with state-of-the-art performance demonstrated against a wide-range of batch and online competitors (Hu et al., a,b).

3

Section 2 introduces the C-DF algorithm in generality, along with definitions, assumptions, and a description of how to identify updating quantities of interest. Section 3 demonstrates approximate online inference using C-DF in the context of several illustrating examples. Section 3.1 applies C-DF to linear regression and the one-way Anova model. More complex model settings are considered in Sections 3.2, namely extensions of the C-DF algorithm to a dynamic linear model and binary regression using the high dimensional probit model. Here, we investigate the performance of C-DF in settings with an increasing parameter space. Along with comparing inferential and predictive performance, we discuss various computational, storage and mixing advantages of our method over state-of-the-art competitors in each. Section 4 presents a detailed implementation of the C-DF algorithm for high dimensional compressed regression. We report state-of-the-art inferential and predictive performance across extensive simulation experiments as well as for real data studies in Section 5. The C-DF algorithm is also applied to a Poisson mixed effects model to update the parameters for a variational approximation to the posterior distribution in Appendix D. Section 6 presents a finite-sample error bound for approximate MCMC kernels and establishes the asymptotic convergence guarantee for the proposed C-DF algorithm. Section 7 concludes with a discussion of extensions and future work. Proofs and additional figures pertaining to Section 3 appear in Appendix A and C, respectively.

2. Conditional density filtering Define Θ = (θ 1 , θ 2 , . . . , θ k ) as the collection of unknown parameters in probability model P (Y |Θ) and Y ∈ Y, with θ j ∈ Ψj , and Ψj denoting an arbitrary sample space (e.g., a subset of

b. Unlike previous examples, conditional distribution π(φ|−) is not available in closed-form. Nevertheless, propagated surrogate quantities (SCSS) enable approximate MCMC draws to be sampled from C-DF full conditional π ˜ (φ|−) via MetropolisHastings. Here, steps for approximate MCMC using the C-DF algorithm in the context of a Metropolis-within-Gibbs sampler are: (1) At time t observe yt ; (2) If t ≤ b : Draw S samples for (θ1 , . . . , θt , τ 2 , σ 2 , φ) from the Gibbs full conditionals. In addition, C1t = C2t = C3t = C4t = 0, t ≤ b. (3) If t > b : Repeat S times the following (a) for t−b < s < t, draw sequentially from [θs |θs−1 , θs+1 , −] and finally from [θt |θt−1 , −], noting that θt−b = θˆt−b ;  (t−1) (t−1) (t−1) (b) draw τ 2 ∼ IG at , bt , and bt = C4 − 2φC3 + φ2 C2 + 12 (θt−b+1 − φθˆt−b )2 +  Pb 2 j=2 (θt+j−b − φθt+j−b−1 ) ;  P (t−1) (c) draw σ 2 ∼ IG at , C1 + 12 bj=1 (yt+j−b − θt+j−b )2 ; (t−1)

(d) sample φ ∼ π ˜ (·|C3

(t−1)

, C4

, θ(t−b+1):t ) via a Metropolis-Hastings step. 12

(t−1) (4) Set θˆt−b ← mean(θt−b ), and then update surrogate quantities C1t ← C1 + 12 (yt−b − ˆ θˆ2 (t−1) (t−1) (t−1) θˆt−b )2 , C t ← C + θt−b−1 , C t ← C + θˆt−b−1 θˆt−b , C t = C + t−b . 2

2

2

3

3

2

4

4

2

2

We vary signal-to-noise ratio τ /σ for the data generating process in simulation experiments. A number of high-signal cases were examined, and results are reported for a √ representative case with τ = 2, σ = 0.1. A sufficiently large window size b is necessary to prevent C-DF from suffering due to poor estimation at early time-points, causing propagation of error in the defined surrogate quantities. For the competitor, 100 particles were propagated in time, and hence we fix b = 100 as well. Kernel density estimates for θt using Particle learning (PL; Lopes et al., 2011) and the C-DF algorithm are shown in Figure 3. In the case of C-DF, estimates for all model parameters are found to be concentrated near their true values, whereas for PL, θt at different times are centered correctly, albeit with much higher posterior variance (presumably due to poor estimation of noise parameters τ 2 , σ 2 ). In addition to being 35% faster than PL, average coverage for the latent AR(1) process using the C-DF algorithm is near 80% with credible intervals roughly 10 times narrower than PL (see Table 3). C-DF is substantially more efficient in terms of memory and storage utilization, requiring only 6% of what PL uses for an identical simulation. Finally, C-DF produces accurate estimates for latent parameters τ 2 , φ (in contrast to PL), although their posterior spread appears somewhat over shrunk. This may be remedied by using larger window size b (at the expense of increased runtime) depending on the task at hand. Table 3: Inferential performance for C-DF and Particle Learning (PL). Coverage and length are based on 95% credible intervals for θt averaged all time points and 10 replications. P n over (θˆt − θt0 )2 . We report the time taken to For truth θt0 at time t, we report MSE = T1n Tt=1 run C-DF with 50 Gibbs samples at each time for τ 2 , θ, σ 2 and 500 MH samples for φ. C-DF PL

Avg. coverage θ 0.780.10 10.00

Length 0.330.11 3.360.46

Time (sec) 1138.600.10 1750.580.10

MSE 0.0110.001 0.0960.027

Table 4: Computational and storage requirements for the Dynamic Linear Model using C-DF t and PL. Ci,j , is the i-th CSS corresponding to the j-th particle in PL, i = 1 : 4, j = 1 : N , N = 100 is the number of particles propagated by PL, and G = 500 is the number of Metropolis samples used by both PL and C-DF. Memory used to store and propagate SCSS and CSS for C-DF and PL is reported. Sampling and update complexities are in terms of big-O. C-DF PL

Stats Cit t Ci,j

Data {yi }i>nt−b {yi }i≥1

Sampling S(N + G) flops N G flops

Updating N flops N flops

Memory (bytes) 128 3330

3.3.2 An application to binary response data We consider an application of the C-DF algorithm for the probit model Pr(yi = 1|xi ) = Φ(x0i β), with standard normal distribution function Φ(·). The Albert and Chib (1993) la13

C−DF, T = 3000

−2

−1

0

1

2

4 −2

0

2

4

−6

−5

θ2000

−4

−3

−2

−1

0

1

θ3000 C−DF, T: 1000 PL, T: 1000

80

C−DF, T: 4000 PL, T: 4000

C−DF, T: 4000 PL, T: 4000

20

4

Density

6

60

8

10

C−DF, T: 1000 PL, T: 1000

0

0

2

Density

3

Density

1 −4

θ1000

40

−3

0

1 0 −4

PL, T = 3000

2

4 3

Density

2

4 3 2 0

1

Density

PL, T = 2000 5

C−DF, T = 2000 5

PL, T = 1000

5

C−DF, T = 1000

0

1

2

3

4

0.0

0.2

0.4

0.6

0.8

1.0

φ

τ2

Figure 3: Row #1 (left to right): Kernel density estimates for posterior draws of θt using PL and the C-DF algorithm at t = 1000, 2000, 3000; Row #2 (left to right) plots of model parameters τ 2 and φ, respectively. tent variable augmentation applies Gibbs sampling, with yi = sign(zi ), and zi |β ∼ N(x0i β, 1). Assuming a conditionally conjugate prior β ∼ N(0, Σβ ), the Gibbs sampler alternates between the following steps: (1) conditional on latent variables z = (z1 , z2 , . . . )0 , sample β from its p-variate Gaussian full conditional; and (2) conditional on β and the observed binary response y, impute latent variables from their truncated normal full conditionals. However, imputing latent variables {zi : i ≥ 1} presents an increasing computational bottleneck as the sample size increases, as does recalculating the conditional sufficient statistics (CSS) for β given these latent variables at each iteration. C-DF alleviates this problem by imputing latent scores for the last b training observations and propagating surrogate quantities. “Budget” b is allowed to grow with the predictor dimension, and as a default we set b = p log p. For data shards of size n, define It = {i > nt − b} as an index over the final b observations at time t. Parameter partitions in this setting are dynamic as in Section 3.3.1, with ΘG1 = {β; zi , i ∈ It } and ΘG2 = {zi : i ≤ nt − b}. The C-DF algorithm proceeds as follows: (1) Observe data X t , y t at time t; (2) If t = 1, set β = 0, and draw zi ∼ N(0, 1), i = 1, . . . , n. If t ≤ b/n, C t = 0 and draw S Gibbs samples for (z1 , . . . , znt , β) from the full conditionals. 14

0

ˆ , update sufficient statistic S XX ← S XX + X t X t , and compute (3) If t > b/n, set β ← β t−1 t t−1 XX −1 −1 ΣXX = (S + Σ ) ; t t β (4) For s = 1, . . . , S: (a) draw zi ∼ TN(x0i β), i ∈ It , and define z It and X It respectively (s) as the collection of latent draws and data for moving window b; (b) draw β|z It , C t−1 ∼ (s) (s) N(ΣXX C t , ΣXX ), where C t (z It ) = C t−1 + X 0It z It ; t t ˆ t ← mean(β (s) ). For ui (β) = yi φ(−x0 β)/Φ(yi x0 β), the trailing n latent scores (5) Set β i i ˆ t +ui (β ˆ t ), i ∈ within moving window b are fixed at their expected value, namely zˆi ← x0i β out It = {(nt + b − 1) : (n(t + 1) + b − 1)}; (6) Denote X Itout and z Itout as predictor data and latent scores for the “outgoing” set of ˆ Itout , and set ΘG1 ← data indexed by Itout . Update surrogate CSS C t ← C t−1 + X 0Itout z ˆ Itout . ΘG1 ∪ z We report simulation results for the following examples: (Case 1) (p, b, n, t) = (100, 500, 25, 100): βj,0 ∼ U(−3/4, 3/4) for j ∈ [11, 100]; (Case 2) (p, b, n, t) = (500, 3500, 100, 100): βj,0 ∼ U(−1/3, 1/3) for j ∈ [11, 200]. The first 10 regression coefficients are (3.5, -3.5, -2.0, 2.0, -1.5, 1.5, -1.5, 1.5, -1.0, 1.0), with βj,0 = 0 for j > 200 in case 2. Data are generated as xij ∼ N(0, σ = 0.25), and P(yi = 1) = Φ(x0i β). Table 5 summarizes inferential performance for the regression parameters in each of the simulated cases. In case 2, although coverage for predictors with large coefficients (i.e., β1 and β2 ) is less than the nominal value, the average coverage across all predictors produced by C-DF is 70% despite the high dimension with a significant number of “noisy” predictors. In addition, C-DF has very good mean-square estimation of parameter coefficient in both cases. As a competitor to C-DF, S-MCMC (batch Gibbs), SMC (Chopin, 2002) and S-VB (Broderick et al., 2013) are implemented. Results for SMC with 2000 propagated particles are shown in Table 5. For clarity, Figure 4 displays a number of marginal kernel density estimates at t = 50, 100 for C-DF and S-MCMC. S-MCMC has the worst-case storage requirement and scales linearly in the number of training samples. At each Gibbs iteration, draws for the latent scores is O(nt) for S-MCMC compared to O(b) for C-DF. For both methods, sampling from the full conditional for β is O(p3 ), updating sufficient statistics is O(np2 ), and updating surrogate quantities is O(np). Computational and storage requirements for both methods are summarized in Table 6.

4. C-DF for online compressed regression We consider an application to compressed linear regression where y 1 , . . . , y t ∈ i) πt (Θ2 |θ

i=1 0 b b 1,t−1 ). πt (Θ1 |Θ2,t−1 )π(Θ02 |Θ

The last step follows from the fact that p1 hY

b 1,t−1 , Θ0 , l πt (Θ01i |Θ 1l

p2 i hY i b 1,t−1 , Θ0 , l < i, Θ2l , l > i) < i, Θ1l , l > i) and πt (Θ02i |Θ 2l

i=1

i=1

b 1,t−1 ) and πt (Θ1 |Θ b 2,t−1 ), reare the Gibbs kernels with the stationary distribution πt (Θ2 |Θ spectively. proof of Lemma 6 The following proof builds on Theorem 3.6 from Yang and Dunson (2013). Although most of the proof coincides with their result, we present the entire proof for completeness. Proof Note that, for a fixed  ∈ (0, 1) nt , t ≥ 1 are s.t. αtnt < . Using the fact that universal ergodicity condition implies uniform ergodicity, one obtains dT V (Ttnt , πt ) ≤ αtnt < . n

t−1 Let h = Tt−1 · · · T1n1 π0 , then

dT V (Ttnt · · · T1n1 π0 , ft ) = dT V (Ttnt h, ft ) ≤ dT V (Ttnt , ft ) dT V (h, ft ) ≤ αtnt dT V (h, ft ) ≤  (dT V (h, ft−1 ) + dT V (ft , ft−1 )) .

(18)

using the result repeatedly, one obtains dT V (Ttnt · · · T1n1 π0 , ft ) ≤

t X

t+1−l dT V (fl , fl−1 ).

l=1

R.H.S clearly converges to 0 applying condition (ii). The proof is completed by using condition (iii) and the fact that dT V (Ttnt · · · T1n1 π0 , πt ) ≤ dT V (Ttnt · · · T1n1 π0 , ft ) + dT V (ft , πt ). 28

proof of lemma 10 Proof Stationary distribution ft of the C-DF transition kernel Tt is the approximate posterior distribution to πt obtained at time t, and by Lemma 4 is given by b 2,t ) πt (Θ2 |Θ b 1,t ) ft (Θ1 , Θ2 ) = πt (Θ1 |Θ Qt Qt b b b 2,t (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ . = R Qt Qt ˆ b b 2,t (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ b 1,t → Θ0 , Θ b 2,t → Θ0 a.s. under Θ0 , there exists Ω0 which has probability By assumption, Θ 1 2 b 1,t (ω) and Θ b 2,t (ω) are in an arbitrarily 1 under the data generating law s.t. for all ω ∈ Ω0 , Θ 0 0 small neighborhood of Θ1 and Θ2 , respectively. Also by assumption, prior π0 is continuous at Θ0 . That is, given  > 0 and η > 0, there exists a neighborhood N,η s.t. for all Θ ∈ N,η one has |π0 (Θ1 , Θ2 ) − π0 (Θ01 , Θ02 )| < .

(19)

b 1,t and Θ b 2,t as above, one obtains for all t > t0 and Using (19) and the consistency of Θ ω ∈ Ω0 b 2,t ) − π0 (Θ0 )| < , |π0 (Θ1 , Θ

b 1,t , Θ2 ) − π0 (Θ0 )| < . |π0 (Θ

(20)

Similarly, continuity of pΘ (·) at Θ0 leads to the condition that for all t > t0 , |pΘ1 ,Θ2 (D l ) − pΘ01 ,Θ02 (D l )| < .

(21)

Further, consistency assumptions on ft and πt yield that for all t > t1 and ω ∈ Ω1 ft (N,η |D (t) (ω)) > 1 − η,

πt (N,η |D (t) (ω)) > 1 − η,

where Ω1 has probability 1 under the data generating law. Considering Ω = Ω0 ∩ Ω1 and t2 = max{t1 , t0 } it is evident that Ω has also probability 1 under the true data generating law and all of the above conditions hold for t > t2 and ω ∈ Ω. Simple algebraic manipulations yield ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) R Qt ft (N,η |D (t) (ω)) N,η l=1 pΘ (D l )π0 (Θ) = Qt πt (N,η |D (t) (ω)) l=1 pΘ (D l )π0 (Θ) hQ i Qt t b b b 2,t (D l ) π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) b 2,t (D l ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ hQ i ×R Qt t ˆ 1,t , Θ2 )π0 (Θ1 , Θ b 2,t ) p (D ) p (D ) π0 (Θ b b l l l=1 Θ1 ,Θ2,t l=1 Θ1 ,Θ2,t N,η 29

(22)

Using (20) we have 0

(π0 (Θ ) − )

2

t Y

Z

N,η l=1 t Y

Z ≤

N,η l=1

p Θ 1 ,Θ b 2,t (D l ) pΘ b 1,t ,Θ2 (D l )

b b p Θ 1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l )π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t )

2 ≤ π0 (Θ ) +  0

Z

t Y

N,η l=1

p Θ 1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l ).

Similarly, t Y

Z

0

(π0 (Θ ) − )

Z pΘ (D l ) ≤ N,η

N,η l=1

t hY

Z i 0 pΘ (D l ) π0 (Θ) ≤ (π0 (Θ ) + )

t Y

pΘ (D l ).

N,η l=1

l=1

Therefore, ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) Qt

b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ ≤ (1 − η)−1 R Qt b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ N,η

Qt

R

l=1

N,η

pΘ (D l ) (π0 (Θ0 ) + )3

Qt

l=1 pΘ (D l )

(π0 (Θ0 ) − )3

.

Using similar calculations we have ft (Θ|D (t) (ω)) πt (Θ|D (t) (ω)) Qt

≥ (1 −

b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ η) R Qt ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ N,η

Qt

R

l=1

N,η

Qt

l=1

pΘ (D l ) (π0 (Θ0 ) − )3

pΘ (D l )

(π0 (Θ0 ) + )3

.

Condition (21) now gives us R Qt Qt Qt 3 p ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) 0 (D l ) − ) (p l=1 pΘ (D l ) Θ , Θ l=1 N 1 ,η Θ R ≤ Ql=1 Q Q t t t 3 ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) l=1 (pΘ0 (D l ) + ) l=1 pΘ (D l ) l=1 pΘ1 ,Θ N,η Qt (p 0 (D l ) + )3 ≤ Qtl=1 Θ . 3 0 (D l ) − ) (p Θ l=1 √ Using the condition that limt→∞ tpΘ0 (D (t) ) is bounded away from 0 and ∞ and choosing , η sufficiently small, we have f (Θ|D (t) (ω)) t − 1 < κ for all t > t2 and ω ∈ Ω. πt (Θ|D (t) (ω))

30

Finally, Z

Z

Z

|πt (Θ) − ft (Θ)| ≤

|πt (Θ) − ft (Θ)| +

|πt (Θ) − ft (Θ)| c N,η

N,η

Z |πt (Θ) − ft (Θ)| + 2η

≤ N,η

≤ πt (N,η )κ + 2η < κ + 2η.

B Consistent sequence of estimators b 1 and Θ b 2 for the examples in Section 3.1. We verify the consistency for C-DF estimators Θ Details are presented for the linear regression case, with similar steps following for the Anova bj which and compressed regression examples. Step 3 in Algorithm 1 proposes an estimator θ (j) (t) estimates the mean of the corresponding approximate conditional posterior π ˜j (·|ΘGl , C j ). When the mean of this distribution is available in a closed form, one might use it to create sequence of estimators. Such mean functions are readily available for the examples in Section 3.1, and for ease of exposition, we focus on showing consistency for the sequence of estimators in such cases. For linear regression example in Section 3.1.1, the sequence of C-DF estimators are given by −1 X X t t 0 X X X 0l y l l l b = + I β t σ ˆl2 σ ˆl2 l=1 l=1 Pt b 0 0 b P b0 0 P 2b + tl=1 y 0l y l − 2 tl=1 β l X l yl + 2 l=1 β l X l X l β l σ ˆt = . 2a + nt − 2 Let the true regression model be y t = X t β 0 + t , t ∼ N (0, σ02 I). Then, the maximum a-posteriori estimator for the vector of regression coefficients is given in closed form as −1 X X t t X 0l (X l β 0 + l ) X 0l X l ˆ βt = + I σ ˆl2 σ ˆl2 l=1 l=1 −1 X −1 X X t t t X 0l X l X 0l l X 0l X l + I β + + I . = β0 − 0 2 2 2 σ ˆ σ ˆ σ ˆ l l l l=1 l=1 l=1 Pt 0 1 Assume (i) σ ˆt2 → σ02 a.s. under the data generating law and (ii) that nt l=1 X l X l has bounded eigenvalues for every t. Under the above assumptions, and using the weighted −1 Pt X 0l l Pt X 0l X l SLLN (Adler & Rosalsky, 1991), we obtain +I → 0 a.s. and l=1 σ l=1 σ ˆl2 ˆl2 −1 Pt X 0l X l b → β a.s. under the data generating law. Similarly, +I β → 0 a.s. Then, β 2 l=1

σ ˆl

0

t

0

σ ˆt2

b t → β 0 a.s., one can show assuming β → σ02 a.s. Simultaneous convergence of the model parameters is argued on the basis of established results on alternate optimization (Byrne, 2013). 31

C Accuracy Plots One-way Anova

Linear regression β3

β4

ζ1

ζ2



ζ5

ζ8

σ2

1.0

1.00





0.4

4

6

8

10











2

4

6

8

10





2

Time/50

Time/50



0.8





Accuracy









0.6

● ●

0.4

0.90 0.85 2

µ









0.8



Accuracy









0.6

0.95







0.80

Accuracy

● ●

τ2

1.0

β1

σ2

4

6

8

10

Time/50

Figure 7: Parameter accuracy plotted over time as defined in (2) for the motivating examples of Section 3.1. Image 1: accuracy for representative regression coefficients βj and σ 2 in the linear regression example. Images 2 and 3: accuracy for representative group means ζj , along with hierarchical parameters µ, τ 2 and σ 2 for the one-way Anova model. D Poisson mixed effect model Consider the Poisson additive mixed effects model where count data y t with predictors X t are modeled as  y t ∼ Poisson exp(X t β + Zu) , β ∼ N (0, σβ2 I k )  u ∼ N 0, diag(σ12 I k1 , . . . , σr2 I kr ) , σs2 ∼ IG(0.5, 1/bs ), bs ∼ IG(1/2, 1/a2s ), with I k1 , . . . , I kr respectively denoting identity matrices of order k1 , . . . , kr and s = 1, 2, . . . , r. Here, obstacles to implementing the C-DF algorithm immediately arise as the full conditional distributions have no closed-form and do not admit surrogate quantities to propagate. The global approximation made by the ADF approximation becomes increasingly unreliable and overly restrictive in higher dimensions. A possibly less restrictive approximation may seek to obtain an “optimal” approximation to the posterior subject to ignoring posterior dependence among different parameters. This is given by a variational approximation to the joint posterior over model parameters β, u, σ12 , . . . , σr2 , namely π(β, u, σ12 , . . . , σr2 |D t ) ≈ q1 (β, u) q2 (σ12 , . . . , σr2 ) q3 (b1 , . . . , br ). Closed-form expressions for q1 , q2 , q3 are derived and given by q1 = N (µβ,u , Σβ,u ), q2 =

r Y

IG



ks +1 , µ1/bs 2

+

||µus ||2 +tr(Σus ) 2



, q3 =

r Y

IG(1, µ1/σs2 + a−2 s ).

s=1

s=1

R

Above, µ1/σs2 = (1/σs2 )q(σs2 ), with µ1/bs defined similarly, and µu , Σu represent the mean and covariance specific to u under approximating density q1 . The approximate posterior is 32

completely specified in terms of µβ,u , Σβ,u , and µ1/bs , µ1/σs2 for s = 1, . . . , r, hence the CDF algorithm in this setup is applied directly on these parameters. In particular, one only needs point estimates for these parameters to fully specify the approximate posterior, hence sampling steps (see Algorithm 1) are replaced by fixed-point iteration. With a partition ΘG1 = {µβ,u , Σβ,u }, ΘG2 = {µ1/bs , µ1/σs2 , s = 1, . . . , r} for the parameters of the variational distribution, the C-DF algorithm proceeds as follows: (1) Observe data shard (y t , X t ) at time t. If t = 1, initialize all the parameters using (t−1) (t−1) draws from respective priors. Otherwise set µtβ,u ← µβ,u , Σtβ,u ← Σβ,u , µt1/σs2 ← (t−1)

(t−1)

µ1/σs2 , µt1/bs ← µ1/bs ; (2) The following steps are repeated a maximum of Nfx iterations (or until convergence): (a) Update D ← blockdiag(σβ−2 I k , µ ˆ1/σ12 I tk1 , . . . , µ ˆt1/σr2 I kr );  0 0 (b) Update wβ,u ← exp ct µβ,u + 21 ct Σβ,u ct , with ct = [X t , Z]; (t−1)

(c) Set H1 = C 1,1

(t−1)

(t−1)

+ ct y t , H2 = C 1,2 (t−1)

(d) Update µβ,u ← µβ,u + Σβ,u

0

+ ct wβ,u , and H3 = C 1,3 + wβ,u ct ct ;  −1 H1 − H2 − Dµβ,u , and Σβ,u ← H3 + D .

(3) Set µtβ,u , Σtβ,u at the final values after Nfx iterations; (4) Also repeat the following step a maximum of Nfx iterations (or until convergence):  t −1 t 2 (a) Update µ1/bs ← (µ1/σs2 + a−2 s ) , µ1/σs2 ← (ks + 1)/ 2µ1/bs + ||µu || + tr(Σu ) , s = 1, . . . , r. (5) Set µt1/σs2 , µt1/bs at the final values after Nfx iterations; (t−1)

(6) Update surrogate quantities C t1,1 ← C 1,1 0 (t−1) t C 1,3 + wβ,u ct ct .

(t−1)

+ ct y t ; C t1,2 ← C 1,2

t + ct wβ,u ; C t1,3 ←

With the arrival of new data, Nfx fixed-point iterations update parameters estimates for the approximating distributions. Consistent with the definition of SCSS, an update to the parameters for ΘG1 under q1 may involve estimates from the previous time-point, in addition to estimates for parameters ΘG2 and vice versa. This online sampling scheme has excellent empirical performance, both in terms of parametric inference and prediction (see Luts and Wand (2013)). This establishes the broad applicability of propagating surrogate quantities (SCSS) using the C-DF algorithm in a non-conjugate setting using variational approximation methods.

References James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993. Artin Armagan, David B Dunson, and Jaeyong Lee. Generalized double Pareto shrinkage. Statistica Sinica, 23(1):119, 2013. 33

M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE Transactions on, 50(2):174–188, 2002. Xavier Boyen and Daphne Koller. Tractable inference for complex stochastic processes. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 33–42. Morgan Kaufmann Publishers Inc., 1998. Michael Braun and Jon McAuliffe. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association, 105(489):324–335, 2010. Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming variational bayes. In Advances in Neural Information Processing Systems, pages 1727–1735, 2013. Charles L Byrne. Alternating minimization as sequential unconstrained minimization: a survey. Journal of Optimization Theory and Applications, 156(3):554–566, 2013. Carlos Carvalho, Michael S Johannes, Hedibert F Lopes, and Nick Polson. Particle learning and smoothing. Statistical Science, 25(1):88–106, 2010. Nicolas Chopin. A sequential particle filter method for static models. Biometrika, 89(3): 539–552, 2002. Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, 10(3):197–208, 2000. Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods. Springer, 2001. Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013. Changwei Hu, Piyush Rai, and Lawrence Carin. Zero-Truncated Poisson Tensor Factorization for Massive Binary Tensors. a. Changwei Hu, Piyush Rai, Changyou Chen, Matthew Harding, and Lawrence Carin. Scalable Bayesian Non-Negative Tensor Factorization for Massive Count Data. b. T Jaakkola and Michael I Jordan. A variational approach to Bayesian logistic regression models and their extensions. Citeseer, 1997. M. Johannes, N. G. Polson, and S. M. Yae. Particle Learning in Nonlinear Models using Slice Variables, 2010. Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. arXiv preprint arXiv:1304.5299, 2013. 34

Steffen L Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. H. F. Lopes, C. M. Carvalho, M. S. Johannes, and N. Polson. Particle learning for sequential Bayesian computation. Bayesian Statistics 9, 9:317, 2011. Jan Luts and Matt P Wand. Variational inference for count response semiparametric regression. arXiv preprint arXiv:1309.4199, 2013. Dougal Maclaurin and Ryan P Adams. Firefly Monte Carlo: Exact MCMC with subsets of data. arXiv preprint arXiv:1403.5693, 2014. Tyler H McCormick, Adrian E Raftery, David Madigan, and Randall S Burd. Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics, 68 (1):23–30, 2012. Alan Medlar, Dorota Glowacka, Horia Stanescu, Kevin Bryson, and Robert Kleta. SwiftLink: parallel MCMC linkage analysis using multicore CPU and GPU. Bioinformatics, 29(4): 413–419, 2013. Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001. Thomas P Minka, Rongjing Xiang, and Yuan Alan Qi. Virtual vector machine for Bayesian online classification. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 411–418. AUAI Press, 2009. Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B Dunson. Robust and scalable Bayes via a median of subset posterior measures. arXiv preprint arXiv:1403.2660, 2014. Manfred Opper. A Bayesian approach to on-line learning. 1998. Trevor Park and George Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. Matias Quiroz, Mattias Villani, and Robert Kohn. Speeding up MCMC by efficient data subsampling. arXiv preprint arXiv:1404.4178, 2014. Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. 2013. Linda SL Tan, David J Nott, et al. A stochastic variational framework for fitting and diagnosing generalized linear mixed models. Bayesian Analysis, 9(4):963–1004, 2014. Yee Whye Teh, Alexandre Thi´ery, and Sebastian Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. arXiv preprint arXiv:1409.0578, 2014. 35

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011. Mike West and Jeff Harrison. Bayesian forecasting and dynamic models, volume 18. 2007. Yun Yang and David B Dunson. Sequential Markov Chain Monte Carlo. arXiv preprint arXiv:1308.3861, 2013.

36