Bayesian Conditional Density Filtering - Semantic Scholar

2 downloads 0 Views 699KB Size Report
Oct 16, 2014 - †Shaan Qamar, Ph.D. student, Department of Statistical Science, 022 Old ... ‡David B. Dunson, Arts & Sciences Distinguished Professor, ...
Bayesian Conditional Density Filtering Rajarshi Guhaniyogi∗, Shaan Qamar† and David B. Dunson‡ October 16, 2014

Abstract We propose a Conditional Density Filtering (C-DF) algorithm for efficient online Bayesian inference. C-DF adapts MCMC sampling to the online setting, sampling from approximations to conditional posterior distributions obtained by propagating surrogate conditional sufficient statistics (a function of data and parameter estimates) as new data arrive. These quantities eliminate the need to store or process the entire dataset simultaneously and offer a number of desirable features. Often, these include a reduction in memory requirements and runtime and improved mixing, along with stateof-the-art parameter inference and prediction. These improvements are demonstrated through several illustrative examples including an application to high dimensional compressed regression. Finally, we show that C-DF samples converge to the target posterior distribution asymptotically as sampling proceeds and more data arrives.

Keywords: Approximate MCMC; Big data; Density filtering; Dimension reduction; Streaming data; Sequential inference; Sequential Monte Carlo; Time series. ∗

Rajarshi Guhaniyogi, Assistant Professor, Department of Applied Math & Stat, SOE2, UC Santa Cruz,

1156 High Street, Santa Cruz, CA 95064 (E-mail: [email protected]). † Shaan Qamar, Ph.D. student, Department of Statistical Science, 022 Old Chemistry Building, Box 90251, Duke University, Durham, NC 27708-0251 (E-mail: [email protected]). ‡ David B. Dunson, Arts & Sciences Distinguished Professor, Department of Statistical Science, 218 Old Chemistry Building, Box 90251, Duke University, Durham, NC 27708-0251 (E-mail: [email protected])

1

1

Introduction

Modern data are increasingly high dimensional, both in the number of observations n and the number of predictors measured p. Statistical methods increasingly make use of lowdimensional structure assumptions to combat the curse of dimensionality (e.g., sparsity assumptions). When the number of observations is truly massive, however, data processing and computational complexity bottlenecks render many learning algorithms infeasible. Bayesian methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions, but there is a lack of scalable Bayesian inference algorithms having guarantees on accuracy. Here, sampling algorithms remain the gold standard, with the overwhelming emphasis being on Markov chain Monte Carlo (MCMC). State of the art performance is obtained in small to moderate sized problems, but MCMC fails to scale as the data and number of parameters grow (often due to the need to update many parameters with expensive likelihood evaluations). MCMC can be extended to accommodate data collected over time by adapting the transition kernel Kt , and drawing a few samples at time t, so that samples converge in distribution to the joint posterior distribution πt in time (Yang and Dunson, 2013). Sequential MCMC (SMCMC) requires the full data to be stored, leading to storage and processing bottlenecks as more data arrive in time. When available, propagating sufficient statistics can help, although models with large parameters spaces often have sufficient statistics which are also high dimensional (e.g., linear regression). In addition, SMCMC often faces poor mixing requiring longer runs with additional storage and computation. A number of alternative strategies have been proposed for scaling MCMC to large datasets. One possibility is to parallelize computation within each MCMC iteration using GPUs or multicore architectures to free bottlenecks in updating unknowns specific to each sample and in calculating likelihoods (Medlar et al., 2013). Another possibility is to rely on Hamiltonian Monte Carlo with stochastic gradient methods used to approximate gradients with subsamples of the data (Welling and Teh, 2011). Ahn, Korattikara and Welling (2012) proposed a variation on this 2

theme. Korattikara, Chen and Welling (2013) instead use sequential hypothesis testing to choose a subsample to use in approximating the acceptance ratio in Metropolis-Hastings. In simple conjugate models, such as Gaussian state-space models, efficient updating equations can be obtained using methods related to the Kalman filter. Assumed density filtering (ADF) was proposed (Lauritzen, 1992; Boyen et al., 1998; Opper et al., 1999) to extend this computational tractability to broader classes of models. ADF approximates the posterior distribution with a simple conjugate family, leading to approximate online posterior tracking. The predominant concern with this approach is the propagation of errors with each additional approximation to the posterior in time. Expectation-propagation (EP) (Minka et al.,2009; Minka, 2013) improves on ADF through additional iterative refinements, but the approximation is limited to the class of assumed densities and has no convergence guarantees. Moreover in fairly standard settings, arbitrary approximation to the posterior through an assumed density severely underestimates parametric and predictive uncertainties. A parallel literature on online variational approximations (Wang et al., 2011; Hoffman et al., 2010) focus primarily on improving batch inferences by feeding in data sequentially. Broderick et al. (2013) recently proposed a streaming variational Bayes algorithm, while Hoffman et al. (2012) combined stochastic approximation with variational inference for better scaling. However, variational methods rely on a factorized form of the posterior that typically fails to capture dependence in the joint posterior and severely underestimates uncertainty. Recent attempts to design careful online variational approximations (Luts et al., 2014), though successful in accurately estimating marginal densities, are limited to specific models and no theoretical guarantees on accuracy are established except for stylized cases. Sequential Monte Carlo (SMC) (Chopin, 2002; Arulampalam, 2002; Lopes et al., 2010) is a popular technique for online Bayesian inference that relies on propagating and resampling samples in time. Unfortunately, it is difficult to scale SMC to problems involving large n and p due to the need to employ very large numbers of particles to obtain adequate approximations and prevent particle degeneracy. The latter is addressed through rejuvenation steps using

3

all the data (or sufficient statistics), which becomes expensive in an online setting. One could potentially rejuvenate particles only at earlier time points, but this may not protect against degeneracy for models involving many parameters. More recent particle learning (PL) algorithms (Carvalho et al., 2010) reduce degeneracy for the dynamic linear model, with satisfactory density estimates for parameters – but they too require propagating a large number of particles, which significantly adds to the per-iteration computational complexity. We propose a new class of Conditional Density Filtering (C-DF) algorithms that extend MCMC sampling to streaming data. Sampling in C-DF proceeds by drawing from conditional posterior distributions, but instead of conditioning on conditional sufficient statistics (CSS) (Johannes et al., 2010; Carvalho et al., 2010), C-DF conditions on surrogate conditional sufficient statistics (SCSS) using sequential point estimates for parameters along with the data observed. This eliminates the need to store the data in time (process the entire dataset at once), and leads to an approximation of the conditional distributions that produce samples from the correct target posterior asymptotically. The C-DF algorithm is demonstrated to be highly versatile and efficient across a variety of settings, with SCSS enabling online sampling of parameters with dramatic reductions in the memory and per-iteration computational requirements. Section 2 introduces the C-DF algorithm in generality, along with definitions, assumptions, and a description of how to identify updating quantities of interest. Section 3 demonstrates approximate online inference using C-DF in the context of several illustrating examples. Section 3.1 applies C-DF to linear regression and the one-way Anova model. More sophisticated model settings are considered in Sections 3.2, namely extensions of the C-DF algorithm to a dynamic linear model and binary regression using the probit model. Here, we investigate the performance of C-DF in settings with an increasing parameter space. The CDF algorithm is also applied to a Poisson mixed effects model to update the parameters for a variational approximation to the posterior distribution. We discuss various computational, storage and mixing advantages of our method over state-of-the-art competitors in each.

4

Section 4 presents a detailed implementation of the C-DF algorithm for high dimensional compressed regression. We report state-of-the-art inferential and predictive performance across extensive simulation experiments as well as for real data studies in Section 5. Section 6 presents a finite-sample error bound for approximate MCMC kernels and establishes the asymptotic convergence guarantee for the proposed C-DF algorithm. Section 7 concludes with a discussion of extensions and future work. Proofs and additional figures pertaining to Section 3 appear in Appendix A and B, respectively.

2

Conditional density filtering

Define Θ = (θ 1 , θ 2 , . . . , θ k ) as the collection of unknown parameters in probability model P (Y |Θ) and Y ∈ Y, with θ j ∈ Ψj , and Ψj denoting an arbitrary sample space (e.g., a subset of Rp ). Data D t denotes the data observed at time t, while D (t) = {D s , s = 1, . . . , t} defines the collection of data observed through time t. Below, θ −j = Θ\θ j and let θ −j = (θ −j,1 , θ −j,2 ) for each j = 1, . . . , k. Either of θ −j,1 or θ −j,2 is allowed to be a null set.

2.1

Surrogate conditional sufficient statistics (t)

Definition: S j is defined to be a conditional sufficient statistic (CSS) for θ j at time t L

(t)

if θ j ⊥ D (t) | θ −j,1 , S j . Suppose θ j | θ −j , D t = θ j | θ −j,1 , h(D t , θ −j,2 ). Then it is easy to (t)

show that for known functions f, h, S j = f (h(D 1 , θ −j,2 ), . . . , h(D t , θ −j,2 )). This satisfies L

(t)

(t)

θ j | θ −j , D (t) = θ j | θ −j,1 , S j , or equivalently θ j ⊥ D (t) | θ −j,1 , S j . (t)

Because S j depends explicitly on θ −j,2 , its value changes whenever new samples are drawn for this collection of parameters. This necessitates storing entire data D (t) or available sufficient statistics. The following issues inhibit efficient online sampling: (1) sufficient statistics, when they exist, often scale poorly with the size of the parameter-space. This often creates a significant storage overhead when multiple copies of these data structures need to be created on a computer network (e.g., for ensemble gradient-descent based procedures); and (2) in 5

(t)

the context of sampling, updating S j , j = 1, . . . , k, for every iteration at time t causes an immense per-iteration computational bottleneck. To address these challenges, we propose surrogate conditional sufficient statistics (SCSS) as a means to approximate full conditional distribution θ j |θ −j , D (t) at time t. L

Definition: Suppose θ j | θ −j , D t = θ j | θ −j,1 , h(D t , θ −j,2 ). For known functions g, h, define (t)

(t−1)

C j = g(C j

bt )) as the surrogate conditional sufficient statistic for θ j , with θ bt , h(D t , θ −j,2 −j,2 (t)

being a consistent estimator of θ −j,2 at time t. Then θ j |θ −j,1 , C j is the C-DF approximation to θ j |θ −j , D (t) . (t)

If the full conditional for θ j admits a surrogate quantity C j , approximate sampling (t) (t) ˜j ∼ π via C-DF proceeds by drawing θ ˜j (·|θ −j,1 , C j ). Crucially, C j depends only on point

estimates for θ −j,2 which remain fixed while drawing samples from the approximate full conditional distribution. This avoids having to update potentially expensive functionals h(D (t) , θ −j,2 ) (i.e., CSS) at every MCMC iteration. If the approximating full conditionals are conjugate, then sampling proceeds in a Gibbs-like fashion. Otherwise, assuming a proposal (t)

distribution pj (·) for θ j , draws from π ˜j (θ j |θ −j,1 , C j ) may be obtained via a Metropolis step with acceptance probability (t)

0

αj (θ |θ) =

π ˜j (θ 0 |θ −j,1 , C j )/pj (θ 0 ) (t)

.

(1)

π ˜j (θ|θ −j,1 , C j )/pj (θ)

Section 3.1 provides examples of the former, and Section 3.2.1 of the latter. In both cases, draws using the C-DF approximation to the conditional distributions have the correct asymptotic stationary distribution (see Section 6). Another scenario arises when a non-conjugate full conditional fails to admit SCSS (and therefore also SS or CSS) as in the case of the Poisson regression example in Section 3.2.3. A variational approximation to the joint posterior admits SCSS, and we demonstrate using the C-DF algorithm when a full conditional (t)

πj (θ j |−) is replaced by an approximating kernel qj (θ j |θ −j,1 , C j ) at time t. Propagating SCSS associated with the latter allows us to obtain approximate online inference using vari6

ation approximations to the posterior distribution. Here, the limiting distribution for draws (t)

sampled from qj (θ j |θ −j,1 , C j ) are not guaranteed to have the correct asymptotic stationary distribution.

2.2

The C-DF algorithm

Define a fixed grouping of parameters {θ j , j = 1, . . . , p} into sets {Gl , l = 1, . . . , k} satisfying P Gl ∩ Gl0 = ∅, l 6= l0 , and l I(j ∈ Gl ) = 1, j = 1, . . . , k. The C-DF algorithm alternates between obtaining consistent estimates for Θl = {θ j 0 , j 0 6∈ Gl } and sampling from approxi(t) b l ), Θl = Θ\ΘG }. mate full conditionals for a group of parameters {θ lj : j ∈ Gl , C j = fj (Θ l

By default, parameter estimates are obtained by averaging over S draws from the approximating full-conditionals (in cases where closed-form expression for the conditional mean, say, is available then this is used instead). This leads to efficient updating equations for (t−1)

(t)

C j based on C j

and incoming data D t , resulting in scalable parameter inference with

sampling-based approximations to the posterior distribution. Conditional independence assumptions for a model often suggest natural parameter partitions, though alternate partitions may be more suitable for specific tasks. For streaming data, these sets are identified to maximize computational and storage gains. In some applications, one may be interested in conditional inference, updating only estimates for “nuisance” parameter sets. Examples of various partitioning schemes are presented in the context of several illustrating examples in Section 3. A general outline for the C-DF algorithm is given in Algorithm 1.

3

Illustrating examples using C-DF

This section demonstrates approximate online inference using C-DF in the context of several illustrating examples. Section 3.1 applies C-DF to linear regression and the one-way Anova model. The choice of parameter partitions and detailed steps of the C-DF algorithm for

7

Algorithm 1 A sketch of the C-DF algorithm for approximate MCMC (t−1)

Input: (1) Data shard Dt ; (2) IndexPsets {Gl } for parameter partition; and (3) C j every j = 1, . . . , k. Require that l I(j ∈ Gl ) = 1 and Gl ∩ Gl0 = ∅, l 6= l0 . (t) ˜ (1) , . . . , θ ˜ (S) for every j = 1, . . . , k. Output: SCSS C , and approximate samples θ j

j

for

j

function CDF.SAMPLE(D t , C (t−1) , G· ) · for l = 1 : L do for s = 1 : S do for j ∈ Gl do (j) Define ΘGl = {θ j , j ∈ Gl } and ΘGl = ΘGl \θ j (t) (j) ˜ (s) ∼ qj (·|Θ(j) , C (t) )] ˜ (s) ∼ π . [or θ 6: Sample θ ˜j (·|ΘGl , C j ) j j j Gl 7: end for 8: end for

1: 2: 3: 4: 5:

9: 10: 11: 12: 13: 14:

bj ← mean(θ ˜ (1:S) ), For each j ∈ Gl : Set θ j (1) (S) ˜ ,...,θ ˜ as approximate posterior samples For each j ∈ Gl : Store θ j j (t) (t−1) 0 b m )), Θm = Θ\ΘGm For each m 6= l and j ∈ Gm : Update C j 0 ← g(C j 0 , h(D t , Θ end for end function

these simpler models serve to build intuition. Here, the conjugate conditional distributions are parameterized in terms of low-dimensional (or univariate) sufficient statistics, hence there is no significant computational benefit to using C-DF for online inference. Instead, our main focus is assessing the accuracy of the C-DF approximation in these simple examples. More sophisticated model settings are considered in Sections 3.2, namely extensions of the C-DF algorithm to a dynamic linear model and binary regression using the probit model. Here, we investigate the performance of C-DF in settings with an increasing parameter space (e.g., data augmentation using latent variables for the probit regression). The C-DF algorithm is also used in the context of a Poisson mixed effects model to update the parameters for a variational approximation to the posterior distribution. In particular, updates to the variational parameters admit surrogate statistics which are propagated to yield approximate online inference. The examples in Section 3.2 do not satisfy conditions for which we derive asymptotic guarantees for samples drawn from the set of approximating (C-DF) full conditionals (see Section 6). In particular, theoretical accuracy of the C-DF approximation is ensured only

8

for finite parameter settings where conditional distributions admit surrogate statistics and may be sampled from directly or using a Metropolis step (e.g., examples in Sections 3.1, 4.1). Where applicable, the similarity between the full conditional distribution using sequential MCMC (SMCMC), πt (θ j |−), and approximate C-DF full conditional, π ˜t (θ j |−), is measured as 1 Accuracy = 1 − 2

Z |πt (θ j ) − π ˜t (θ j )| dθ j .

(2)

As defined, accuracy ranges between 0 and 1, with larger values implying better agreement between the two distributions. Accuracy plots are all given in Appendix B. Fixing notation, data D t denotes the data observed at time t, while D (t) = {D s , s = 1, . . . , t} defines the collection of data observed through time t. Where appropriate, D t = (X t , y t ) with X t = (xt1 , . . . , xtn )0 and y t = (y1t , . . . , ynt )0 . Shards of a fixed size arrive sequentially over a T = 1000 time horizon, and 500 MCMC iterations are drawn to approximate the corresponding posterior distribution πt at every time point t = 1, . . . , T . We report mean squared estimation accuracy for model parameters of interest, in addition to mean squared prediction error (MSPE), length and coverage of 95% predictive intervals where applicable. Results are averaged over 10 independent replications with associated standard errors appearing as subscripts in Tables. Plots showing kernel density estimates for marginal posterior densities for representative model parameters at various time points are also provided.

3.1 3.1.1

Motivating examples Linear regression

Consider the simple linear regression model for a p-dimensional predictor vector,

y t = X t β + ,

 ∼ N (0, σ 2 I n ).

9

(3)

A standard Bayesian analysis proceeds with conjugate priors β ∼ N (0, I p ), σ 2 ∼ IG(a, b), and having Gibbs full conditionals given by,

σ 2 |β, D (t) ∼ IG(a0 , b0 ),

with

β|σ 2 , D (t) ∼ N (µt , Σt ),

with

a0 = a +

tn , 2

+ β 0 S XX β StY Y − 2β 0 S XY t t 2 XX S S XY Σt = ( t 2 + I p )−1 , µt = Σt t 2 . σ σ b0 = b +

0

t t These distributions are specified in terms of sufficient statistics S XX = S XX t t−1 + X X , 0

0

Y YY + y t y t , enabling online inference using SMCMC. A S XY = S Yt−1 + X t y t , and StY Y = St−1 t

C-DF implementation begins by defining a partition over the model parameters, and here we set ΘG1 = {β} and ΘG2 = {σ 2 }. Approximate sampling via C-DF then proceeds as: (1) Observe data D t at time t. If t = 1 initialize all parameters at some default values (e.g., 2 β = 0, σ 2 = 1 assuming a centered and scaled response); otherwise set σ ˆt2 ← σ ˆt−1 ; 0

0

t t t t σt2 ; (2) Update surrogate CSS for β, C t1,1 = C t−1 σt2 , C t1,2 = C t−1 1,2 + X y /ˆ 1,1 + X X /ˆ

(3) Define C t1 = {C t1,1 , C t1,2 } and draw S samples from the approximate Gibbs full con(t)

ditional β|ˆ σ2, C 1

ˆ ← ˆ t ) (Σ ˆ t = (C t + I p )−1 , µ ˆ t C t ), and set β ˆ t, Σ ˆt = Σ ∼ N (µ t 1,1 1,2

mean(β (1:S) ) (or use the analytical expression for the posterior mean); t−1 t ˆ 0 t0 t ˆ 0 t0 t ˆ (4) Update surrogate CSS for σ 2 , C t2,1 = C t−1 2,1 + β t X X β t , C 2,2 = C 2,2 + β t X y ;

(5) Define C t2 = {C t2,1 , C t2,2 } and draw S samples from the approximate Gibbs full condi ˆ t , C (t) ∼ IG a0 , b + (S Y Y − 2C t + C t )/2 , and set σ ˆt2 ← mean(σ 2(1:S) ) (or tional σ 2 |β 2 t 2,2 2,1 use the analytical expression for the posterior mean). We compared the hold-out predictive performance and inference on model parameters β, σ 2 using SMCMC and C-DF. Data is generated using predictors drawn from U (0, 1), with true parameters β 0 = (1, 0.5, 0.25, −1, 0.75) and σ02 = 25. For both methods, density estimates for parameters are shown at t = 200, 500, with accuracy comparisons validating that approximate draws using the C-DF algorithm converge to the true stationary distribution 10

in time. Excellent parameter MSE and coverage using the C-DF algorithm is reported in Table 1. Avg. coverage β

Length

Time (sec)

1.0 1.0

0.600.01 0.600.01

954.12 119.44.64

C-DF SMCMC

P MSE = pj=1 (βˆt − β0 )2 /p t = 200 t = 400 t = 500 0.270.001 0.150.001 0.060.001 0.120.001 0.080.001 0.040.001

0

1

2

Density

0.4

0.6

0.8

CDF, t: 200 Gibbs, t: 200

0.0

0.2

0.5 0.0 −1

CDF, t: 500 Gibbs, t: 500

1.0

CDF, t: 200 Gibbs, t: 200

1.0

Density

1.0 0.0

0.5

Density

CDF, t: 500 Gibbs, t: 500

2.0

CDF, t: 200 Gibbs, t: 200

1.5

CDF, t: 500 Gibbs, t: 500

1.5

2.0

Table 1: Inferential performance for C-DF and SMCMC for parameters of interest. Coverage and length are based on 95% credible intervals and is averaged over all the βj ’s (j = 1, .., 5) and all time points and over 10 independent replications. We report the time taken to produce 500 MCMC samples with the arrival of each data shard. MSE along with associated standard errors are reported at different time points.

−3

−2

−1

β1

0

1

22

β4

23

24

25

26

27

28

σ2

Figure 1: Kernel density estimates for posterior draws using SMCMC and the C-DF algorithm at t = 200, 500. Shown from left to right are plots of model parameters β1 , β4 , and σ 2 , respectively.

3.1.2

One-way Anova

Consider the one-way Anova model with a fixed number treatment groups, k,

yij = ζi + ij ,

iid

ζi ∼ N (µ, τ 2 ),

ij ∼ N (0, σ 2 ),

(4)

with π(µ) ∝ 1, τ 2 ∼ IG(a, b), and σ 2 ∼ IG(α, β). Sufficient statistics can be propagated yielding an efficient SMCMC sampler. Having observed a new data shard at time t, groupP 2(t) 2(t−1) specific sufficient statistics Sit are updated as Sit = Sit−1 + nj=1 yijt and Si = Si + 11

Pk

i=1

||y ti ||2 , i = 1, . . . , k. At time t, inference proceeds with draws from the following Gibbs

full conditionals: Pk 2(t)  − 2ζi Sit + ntζi2 )  τ 2σ2  nkt 2 i=1 (Si ζi |σ, µ, τ, y ∼ N , , σ |ζ, y ∼ IG α + , β + + σ 2 ntτ 2 + σ 2 2 2 Pk  Pk ζ τ 2   2 (ζ − µ) k i i=1 i µ|ζ, τ ∼ N , , τ 2 |ζ, µ ∼ IG a + , b + i=1 . k k 2 2 (5)  τ 2S t + σ2µ i ntτ 2

Using the hierarchical structure of model (4), C-DF partitions the parameters as ΘG1 = {ζ, σ 2 }, and ΘG2 = {µ, τ 2 }. The modified full conditionals are defined in terms of surrogate quantities, as well as the previously defined group-specific sufficient statistics. Approximate inference using C-DF proceeds as follows: (1) Observe data y t1 , . . . , y tk at time t. If t = 1, set ζi = 0, σ = s¯(vec(y 1 , . . . , y k )), µ = 0, τ = 1. Otherwise, set µ ˆt ← µ ˆt−1 , τˆt ← τˆt−1 ; t−1 t (2) Update surrogate statistic C t1 component-wise as C1i ← C1i + τˆt2 Sit , i = 1, . . . , k;

(3) For s = 1, . . . , S: (a) draw ζi |σ, µ ˆt , τˆt , C t1 , i = 1, . . . , k; (b) draw σ 2 |ζ, y. The C-DF full conditional for group-specific means are given by

ζi |σ, µ ˆt , τˆt , C t1

 t τˆt2 σ 2 C1i + σ2µ ˆt ∼N , , i = 1, . . . , k; ntˆ τt2 + σ 2 ntˆ τt2 + σ 2 

(6)

(1:S) (4) Set ζˆi ← mean(ζi ), i = 1, . . . , k, and σ ˆ 2 ← σ 2(1:S) ; using these values, update surrot−1 t ˆ 2 , C t ← C t−1 + Pk ζˆi ). Let C t = (C t , C t ); + ||ζ|| gate statistic C2,1 ← C2,1 2,2 2,2 2 2,1 2,2 i=1 t (5) For s = 1, . . . , S: sample from approximate full conditionals µ|τ, C t2 ∼ N C2,2 /k, τ 2 /k  t t and τ 2 |µ, C t2 ∼ IG a + k/2, b + (C2,1 − 2µC2,2 + kµ2 )/2 defined in (5);



(6) Finally, set µ ˆt ← mean(µ(1:S) ) and τˆt ← mean(τ (1:S) ). ADF is added as an additional competitor for this example. Here, the joint posterior over θ = (ζ1 , . . . , ζk , log(σ 2 )) is approximated in time assuming a multivariate-normal density 12

π ˜t (θ) ∼ N (µt , Σt ). To begin, we integrate over hyper-parameters µ, τ 2 to obtain marginal prior π(ζ, log(σ 2 )). For t > 1, the approximate posterior at time (t − 1) becomes the prior at t, and parameters µt , Σt are updated in the sense of McCormick et. al (2012). In particular, Σt = (−∇2 `(µt−1 ))−1 and µt = µt−1 + Σt−1 ∇`(µt−1 ), with `(θ) = log(p(y|θ)π(θ)). iid

Simulated data for this example is generated according to model (4) with parameters ζi ∼

N (4, τ 2 = 0.01) and σ 2 = 100. Kernel density estimates for posterior draws using SMCMC and the C-DF algorithm are very similar as depicted in Figure 2 at t = 200, 500. The C-DF approximation results in steadily increasing accuracy along with good parameter inference as shown in Table 2. ADF is extremely sensitive to good calibration for σ 2 , without which estimates for ζ are far from the truth and fail to converge by t = 500. Optimal performance is obtained by using the first data shard and performing the ADF approximation at t = 1 until convergence. These parameters estimates are subsequently propagated as described earlier. ADF produces accurate parameter point-estimates, however, severely underestimates uncertainty (see Table 2). This occurs in-part because ADF makes a global approximation on the joint posterior (restricting the propagation of uncertainty in a very specific way), whereas C-DF makes local approximations on the set of full conditionals. P Avg. coverage ζ Length Time (sec) MSE = kl=1 (ζˆt − ζ0 )2 /k t = 200 t = 400 t = 500 C-DF 0.870.09 0.530.005 85.05.25 0.770.22 0.290.11 0.260.15 SMCMC 0.920.10 0.520.002 119.48.43 0.410.05 0.220.14 0.200.16 0.360.23 0.800.02 0.880.01 0.420.18 0.280.12 0.270.11 ADF Table 2: Inferential performance for C-DF, SMCMC, and ADF for parameter ζ. Coverage is based on 95% credible intervals averaged over all time points, all ζ and over 10 independent replications. We report the time taken to produce 500 MCMC samples with the arrival of each data shard. MSE along with associated standard errors are reported at different time points.

3.2

Advanced data models

The C-DF algorithm introduced in Section 2 modifies the set of full conditional distributions to depend on propagated surrogate quantities which yields an approximate kernel. This 13

3

4

5

6

2.5 1.5

Density

1.0 0.5 0.0 2

3

4

ζ1

5

6

2

3

ζ5

4

5

6

ζ10

Gibbs, t: 500

CDF, t: 500

Gibbs, t: 500

0.6

Density

0.4

3 0

0.0

1

0.2

2

Density

4

0.8

5

6

1.0

CDF, t: 500

CDF, t: 200 Gibbs, t: 200

2.0

2.5 2.0 1.0 0.5 0.0 2

CDF, t: 500 Gibbs, t: 500

3.0

CDF, t: 200 Gibbs, t: 200

1.5

Density

2.0 1.5 0.0

0.5

1.0

Density

CDF, t: 500 Gibbs, t: 500

3.0

CDF, t: 200 Gibbs, t: 200

2.5

3.0

CDF, t: 500 Gibbs, t: 500

0.0

0.2

0.4

0.6

0.8

1.0

96

98

100

102

104

σ2

τ2

Figure 2: Row #1 (left to right): Kernel density estimates for posterior draws of ζ1 , ζ5 , ζ10 using SMCMC and the C-DF algorithm at t = 200, 500; Row #2 (left to right): Kernel density estimates for model parameters τ 2 , and σ 2 at t = 500. enables efficient online MCMC, with guarantees on correctness of approximate C-DF samples as data accrue over time (see Section 6). In settings where the conditional posterior is (a) not in closed form, or (b) corresponds to a model with an increasing number of parameters, theoretical consistency properties of parameters often do not hold. This section introduces extensions of the C-DF algorithm to three models of this type. We present an application to binary regression in Section 3.2.2 where the conditional posterior distribution over the regression coefficients does not assume a closed form. For logistic regression, a variational approximation to the posterior introduces additional variational parameters for each observation to obtain a lower-bound for the likelihood (Jaakkola & Jordan, 2000). Additional non-conjugate message passing approximations have been considered in Braun and McAliffe (2010), and Tan (2014). One may also resort to ADF using

14

a Laplace approximation to the posterior over regression coefficients and propagating associated mean and covariance parameters in time. These techniques, however, are known to severely underestimate parameter uncertainty. Fortunately, data augmentation via the probit model enables conditionally conjugate sampling. Section 3.2.1 presents an application to a dynamic linear model for a continuous response centered on a first-order auto-regressive process. Here, the Kalman filtering provides an online sampling strategy. We discuss how to use the C-DF algorithm to overcome the computational and storage bottlenecks arising in both these examples. Additionally, Sections 3.2.1 and 3.2.2 provide excellent examples of C-DF applied to the conditionally conjugate and non-conjugate settings in increasing parameter spaces respectively. Finally, Section 3.2.3 considers an application to Poisson regression where no augmentation scheme is available and the full conditional distributions do not admit surrogate quantities. Here, we demonstrate how the online strategy of Luts et al. (2014) is a special case of the C-DF algorithm applied to a variational approximation of the posterior distribution.

3.2.1

Dynamic Linear Model

We consider the first-order dynamic linear model (West & Harrison, 1997), namely

yt+1 ∼ N (θt+1 , σ 2 ), θt+1 ∼ N (φθt , τ 2 ).

Here, noisy observations yt , t ≥ 1 are observed and modeled as arising from an underlying stationary AR(1) process with lag parameter φ, |φ| < 1. Default priors σ 2 ∼ IG(a0 , b0 ), τ 2 ∼ IG(c0 , d0 ), φ ∼ U (−1, 1) are chosen to complete the hierarchical model, and assume θ0 ∼ N (0, h0 ), the stationary distribution for the latent process. Forecasting future trajectories of the response is a common goal in this setting, hence good characterization of the distribution

15

over θt , φ, τ 2 is of interest. Here, the full conditional distributions are given by  σ −2 yt+1 + τ −2 φθt 1 θt+1 |yt+1 , θt , τ , σ , φ ∼ N , , −2 σ −2 + τ −2 σ + τ −2  −2  σ yt+1 + τ −2 φ(θs−1 + θs+1 ) 1 2 2 θs |ys , θs−1 , θs+1 , τ , σ , φ ∼ N , 1 ≤ s ≤ t, , −2 σ −2 + (φ2 + 1)τ −2 σ + (φ2 + 1)τ −2 2

2

σ 2 |θ0:t , y t ∼ IG(at , bt ),



at+1 = at + 1/2,

bt+1 = bt + (yt+1 − θt+1 )2 /2,

τ 2 |θ0:t , φ ∼ IG(ct , dt ), ct+1 = ct + 1/2, dt+1 = dt + (θt+1 − φθt )2 /2, ! Pt 2 θ θ τ l l−1 φ|θ0:t , τ 2 ∝ π(φ) N , Pt−1 2 I(|φ| < 1). Ps=1 t 2 θ s=1 l−1 s=1 θl Forward-filtering and backward-sampling using Kalman filtering updates for the latent process enables online posterior computation. However, this strategy is computational intractable as the time horizon grows even for univariate time-series. While propagation of uncertainty for the latent process is of key importance, retrospective sampling of {θs , s < t} is often less meaningful. To extend the use of the C-DF algorithm in this growing parameter setting, we propose sampling of the latent process over a moving time-window of size b. We exploit the dependence structure of the latent process by borrowing strength across several observations to better estimate θt . This eliminates the propagation of errors that might otherwise result from poor estimation at earlier time points. As the resampling window shifts forward, trailing latent parameters are fixed at their most recent point estimates. Parameter partitions for the C-DF algorithm must therefore also evolve dynamically, and are defined at time t as ΘG1 = {θt , . . . , θt−b+1 , τ 2 , σ 2 , φ}, ΘG2 = {θ1 , . . . , θt−b }, t > b. Unlike previous examples, conditional distribution π(φ|−) is not available in closed-form. Nevertheless, propagated surrogate quantities (SCSS) enable approximate MCMC draws to be sampled from C-DF full conditional π ˜ (φ|−) via Metropolis-Hastings. Here, steps for approximate MCMC using the C-DF algorithm in the context of a Metropolis-within-Gibbs sampler are: (1) At time t observe yt ; (2) If t ≤ b, draw S Gibbs samples for (θ1 , . . . , θt , τ 2 , σ 2 , φ) from the full conditionals. In 16

addition, C1t = C2t = C3t = C4t = 0, t ≤ b. (3) If t > b, then for s = 1, . . . , S: (a) draw sequentially from [θt−b+1 |θˆt−b , θt−b+2 , −], · · · , [θt |θt−1 , −];  (t−1) (t−1) (t−1) (b) draw τ 2 ∼ IG at , bt , and bt = C4 − 2φC3 + φ2 C2 + (θt−b+1 − φθˆt−b )2 +   Pb Pb (t−1) 2 2 2 /2; (c) draw σ ∼ IG a , C + (y − θ ) /2 ; (θ − φθ ) t t+j−b t+j−b t+j−b t+j−b−1 1 j=1 j=2 (t−1)

(d) sample φ ∼ π ˜ (·|C2

(t−1)

, C4

, θ(t−b+1):t ) via a Metropolis-Hastings step.

(4) Set θˆt−b ← mean(θt−b ); (t−1)

(5) Update surrogate quantities C1t ← C1 (t−1)

C3t ← C3

(t−1) + (yt−b − θˆt−b )2 /2, C2t ← C2 + θˆt−b−1 /2,

(t−1) 2 + θˆt−b−1 θˆt−b , C4t = C4 + θˆt−b /2.

We vary signal-to-noise ratio τ 2 /σ 2 for the data generating process in simulation experiments. A number of high-signal cases were examined, and results are reported for a √ representative case with τ = 2, σ = 0.1. A sufficiently large window size b is necessary to prevent C-DF from suffering due to poor estimation at early time-points, causing propagation of error in the defined surrogate quantities. For the PL competitor, 100 particles were propagated in time, and hence we fix b = 100 as well. Kernel density estimates for θt using PL and the C-DF algorithm are shown in Figure 5. In the case of C-DF, estimates for all model parameters are found to be concentrated near their true values, whereas for PL, θt at different times are centered correctly, albeit with much higher posterior variance (presumably due to poor estimation of noise parameters τ 2 , σ 2 ). In addition to being 35% faster than PL, average coverage for the latent AR(1) process using the C-DF algorithm is near 80% with credible intervals roughly 10 times narrower than PL (see Table 3). C-DF is substantially more efficient in terms of memory and storage utilization, requiring only 6% of what PL uses for an identical simulation. Finally, C-DF produces accurate estimates for latent parameters τ 2 , φ (in contrast to PL), although their posterior spread appears somewhat over shrunk. This may be remedied by using larger window size b at the expense of increased runtime which may be desired depending on the task at hand.

17

C-DF PL

Avg. coverage θ 0.780.10 10.00

Length 0.330.11 3.360.46

Time (sec) 1138.600.10 1750.580.10

MSE 0.0110.001 0.0960.027

Table 3: Inferential performance for C-DF and Particle Learning. Coverage and length are based on 95% credible intervals for θt averaged over all time and 10 independent PT npoints 1 ˆ replications. For truth θt0 at time t, we report MSE = T n t=1 (θt − θt0 )2 . We report the time taken to run C-DF with 50 Gibbs samples at each time for τ 2 , θ, σ 2 and 500 MH samples for φ.

C-DF PL

Stats Cit t Ci,j

Data {yi }i>nt−b {yi }i≥1

Sample complexity S(N + G) NG

Update complexity N N

Memory (bytes) 128 3330

Table 4: Computational and storage requirements for the Dynamic Linear Model using C-DF t and PL. Ci,j , is the i-th CSS corresponding to the j-th particle in PL, i = 1 : 4, j = 1 : N , N = 100 is the number of particles propagated by PL, and G = 500 is the number of Metropolis samples used by both PL and C-DF. Memory in terms of RAM used to store and propagate SCSS and CSS for C-DF and PL is reported. Sampling and update complexities are in terms of big-O. 3.2.2

An application to binary response data

We consider an application of the C-DF algorithm for the probit model Pr(yi = 1|xi ) = Φ(x0i β), with standard normal distribution function Φ(·). The Albert and Chib (1993) latent variable augmentation applies Gibbs sampling, with yi = sign(zi ), and zi |β ∼ N(x0i β, 1). Assuming a conditionally-conjugate prior, β ∼ N (0, Σβ ), the Gibbs sampler alternates between the following steps: (1) conditional on latent variables z = (z1 , z2 , . . . )0 , sample β from its p-variate Gaussian full conditional; and (2) conditional on β and the observed binary response y, impute latent variables from their truncated normal full conditionals. However, imputing latent variables {zi : i ≥ 1} presents an increasing computational bottleneck as the sample size increases, as does recalculating the conditional sufficient statistics for β given these latent variables at each iteration. C-DF alleviates this problem by imputing latent scores for the last b training observations and propagating surrogate quantities. “Budget” size b is allowed to grow with the predictor dimension, and as a default we set b = pdlog(p)e. For data shards of size n, define It = {i > nt − b} as an index over the final b observations

18

C−DF, T = 3000

−2

−1

0

1

2

4 −2

0

2

4

−6

−5

θ2000

−4

−3

−2

−1

0

1

θ3000 C−DF, T: 1000 PL, T: 1000

80

C−DF, T: 4000 PL, T: 4000

C−DF, T: 4000 PL, T: 4000

20

4

Density

6

60

8

10

C−DF, T: 1000 PL, T: 1000

0

0

2

Density

3

Density

1 −4

θ1000

40

−3

0

1 0 −4

PL, T = 3000

2

4 3

Density

2

4 3 2 0

1

Density

PL, T = 2000 5

C−DF, T = 2000 5

PL, T = 1000

5

C−DF, T = 1000

0

1

2

3

4

0.0

0.2

0.4

0.6

0.8

1.0

φ

τ2

Figure 3: Row #1 (left to right): Kernel density estimates for posterior draws of θt using PL and the C-DF algorithm at t = 1000, 2000, 3000; Row #2 (left to right) plots of model parameters τ 2 and φ, respectively. at time t. Parameter partitions in this setting are dynamic as in Section 3.2.1, with ΘG1 = {β; zi , i ∈ It } and ΘG2 = {zi : i ≤ nt − b}. The C-DF algorithm proceeds as follows: (1) Observe data X t , y t at time t; (2) If t = 1, set β = 0, and draw zi ∼ N (0, 1), i = 1, . . . , n. If t ≤ b/n, C t = 0 and draw S Gibbs samples for (z1 , . . . , znt , β) from the full conditionals. ˆ , update sufficient statistic S XX ← S XX + X t0 X t , and compute (3) If t > b/n, set β ← β t−1 t t−1 −1 ΣXX = (S XX + Σ−1 t t β ) ;

(4) For s = 1, . . . , S: (a) draw zi ∼ T N (x0i β), i ∈ It , and define z It and X It respectively (s)

as the collection of latent draws and data for moving window b; (b) draw β|z It , C t−1 ∼ (s)

(s)

N (ΣXX C t , ΣXX ), where C t (z It ) = C t−1 + X 0It z It ; t t 19

ˆ ← mean(β (s) ). For ui (β) = yi φ(−x0 β)/Φ(yi x0 β), the trailing n latent scores (5) Set β t i i ˆ +ui (β ˆ ), i ∈ within moving window b are fixed at their expected value, namely zˆi ← x0i β t t Itout = {(nt + b − 1) : (n(t + 1) + b − 1)}; (6) Denote X Itout and z Itout as predictor data and latent scores for the “outgoing” set of ˆ Itout , and set ΘG1 ← data indexed by Itout . Update surrogate CSS C t ← C t−1 + X Itout z ˆ Itout . ΘG1 ∪ z We report simulation results for the following examples: (Case 1) (p, b, n, t) = (100, 500, 25, 100): βj,0 ∼ Uniform(−3/4, 3/4) for j ∈ [11, 100]; (Case 2) (p, b, n, t) = (500, 3500, 100, 100): βj,0 ∼ Uniform(−1/3, 1/3) for j ∈ [11, 200]. The first 10 regression coefficients are (3.5, -3.5, -2.0, 2.0, -1.5, 1.5, -1.5, 1.5, -1.0, 1.0), with βj,0 = 0 for j > 200 in case 2. Data are generated as xij ∼ N(0, σ = 0.25), and P(yi = 1) = Φ(x0i β). Table 5 summarizes inferential performance for the regression parameters in each of the simulated cases. In case 2, although coverage for predictors with large coefficients (i.e., β1 and β2 ) is less than the nominal value, the average coverage across all predictors produced by C-DF is 70% despite the high dimension with a significant number of “noisy” predictors. In addition, C-DF has very good mean-square estimation of parameter coefficient in both cases. Figure 4 compares kernel density estimates of marginal posteriors at t = 200, 500 for both cases. C-DF Case 1 (p = 100) SMCMC C-DF Case 2 (p = 500) SMCMC

Avg. coverage β 0.770.08 0.960.01 0.700.03 0.920.02

Length 0.42 0.65 0.23 0.33

Time (sec) 28.92.5 152.510.3 461.529.6 2,196.3170.5

MSE10 0.0250.01 0.0180.01 0.200.05 0.0500.01

MSE 0.040.01 0.020.01 0.0160.003 0.0090.001

Table 5: Inferential performance for C-DF and SMCMC in simulation studies (p, b, n, t) = (100, 500, 25, 100) and (p, b, n, t) = (500, 3500, 100, 100). Coverage and length are based on 95% credible intervals averaged over all predictors and replications. P10overˆ 10 independent 1 2 MSE is reported over all predictors, while MSE10 = 10 ( β − β ) . We report the total j0 j=1 j time taken to produce 500 MCMC samples after the arrival of each data shard.

20

SMCMC scales linearly in the number of training observations and has the worst-case storage requirement. At each Gibbs iteration, draws for the latent scores is O(nt) for SMCMC compared to O(b) for C-DF. For both methods, sampling from the full conditional for β is O(p3 ), updating sufficient statistics is O(np2 ), and updating surrogate quantities is O(np). Computational and storage requirements for both methods are summarized in Table 6. C-DF SMCMC

Stats S XX , C S XX

Data {xi , yi }i>nt−b {xi , yi }i≥1

Sample complexity Sb Snt

Update complexity p3 + np2 p3 + np2

Memory (M-bytes) 20.0 46.0

Table 6: Computational and storage requirements for the latent variable probit model using C-DF and SMCMC. Budget b represents the number of latent scores updated by C-DF when processing data shard at time t. Runtime is quickly dominated by the sampling complexity which scales linearly in time for the augmented Gibbs sampler (SMCMC). Memory is reported for case 2, (p, b, n, t) = (500, 3500, 100, 100). Sampling and update complexities are in terms of big-O. CDF, t: 100 Gibbs, t: 100

2

3

4

5

3 1 0 −3

−2

β1

−1

0

−1

0

β5

2

3

CDF, t: 50 Gibbs, t: 50

CDF, t: 100 Gibbs, t: 100

CDF, t: 50 Gibbs, t: 50

6 5 4 3

Density

1 2

3

4

5

0

0

0

1

2

2

4

Density

4 3 2

Density

5

6

6

7

CDF, t: 100 Gibbs, t: 100

1 β10

8

CDF, t: 50 Gibbs, t: 50

7

CDF, t: 100 Gibbs, t: 100

CDF, t: 50 Gibbs, t: 50

2

Density

3 0

1

2

Density

3 2 0

1

Density

CDF, t: 50 Gibbs, t: 50

4

CDF, t: 100 Gibbs, t: 100

4

CDF, t: 50 Gibbs, t: 50

4

CDF, t: 100 Gibbs, t: 100

−3

−2

β1

−1 β5

0

−1

0

1

2

3

β10

Figure 4: Kernel density estimates for posterior draws using SMCMC and the C-DF algorithm at t = 50, 100. Shown from left to right are plots of model parameters β1 , β5 , and β10 (and top to bottom are case 1 and case 2), respectively.

21

3.2.3

Poisson mixed effect model

Consider the Poisson additive mixed effects model where count data y t with predictors X t are modeled as

 y t ∼ Poisson exp(X t β + Zu) , β ∼ N (0, σβ2 I k )  u ∼ N 0, diag(σ12 I k1 , . . . , σr2 I kr ) , σs2 ∼ IG(0.5, 1/bs ), bs ∼ IG(1/2, 1/a2s ),

with I k1 , . . . , I kr respectively denoting identity matrices of order k1 , . . . , kr and s = 1, 2, . . . , r. Here, obstacles to implementing the C-DF algorithm immediately arise as the full conditional distributions have no closed-form and do not admit surrogate quantities to propagate. The global approximation made by the ADF approximation becomes increasingly unreliable and overly restrictive in higher dimensions. A possibly less restrictive approximation may seek to obtain an “optimal” approximation to the posterior subject to ignoring posterior dependence among different parameters. This is given by a variational approximation to the joint posterior over model parameters β, u, σ12 , . . . , σr2 , namely π(β, u, σ12 , . . . , σr2 |D t ) ≈ q1 (β, u) q2 (σ12 , . . . , σr2 ) q3 (b1 , . . . , br ).

Closed-form expressions for q1 , q2 , q3 are derived and given by

q1 = N (µβ,u , Σβ,u ), q2 =

r Y

IG



ks +1 , µ1/bs 2

+

s=1

||µus ||2 +tr(Σus ) 2



, q3 =

r Y

IG(1, µ1/σs2 + a−2 s ).

s=1

R Above, µ1/σs2 = (1/σs2 )q(σs2 ), with µ1/bs defined similarly, and µu , Σu represent the mean and covariance specific to u under approximating density q1 . The approximate posterior is completely specified in terms of µβ,u , Σβ,u , and µ1/bs , µ1/σs2 for s = 1, . . . , r, hence the CDF algorithm in this setup is applied directly on these parameters. In particular, one only needs point estimates for these parameters to fully specify the approximate posterior, hence sampling steps (see Algorithm 1) are replaced by fixed-point iteration. With a partition 22

ΘG1 = {µβ,u , Σβ,u }, ΘG2 = {µ1/bs , µ1/σs2 , s = 1, . . . , r} for the parameters of the variational distribution, the C-DF algorithm proceeds as follows: (1) Observe data shard (y t , X t ) at time t. If t = 1, initialize all the parameters using (t−1)

(t−1)

draws from respective priors. Otherwise set µtβ,u ← µβ,u , Σtβ,u ← Σβ,u , µt1/σs2 ← (t−1)

(t−1)

µ1/σs2 , µt1/bs ← µ1/bs ; (2) The following steps are repeated a maximum of Nfx iterations (or until convergence): (a) Update D ← blockdiag(σβ−2 I k , µ ˆ1/σ12 I tk1 , . . . , µ ˆt1/σr2 I kr );  0 0 (b) Update wβ,u ← exp ct µβ,u + 21 ct Σβ,u ct , with ct = [X t , Z]; (t−1)

(c) Set H1 = C 1,1

(t−1)

+ ct y t , H2 = C 1,2

(t−1)

+ ct wβ,u , and H3 = C 1,3

0

+ wβ,u ct ct ;

 −1 (t−1) (d) Update µβ,u ← µβ,u + Σβ,u H1 − H2 − Dµβ,u , and Σβ,u ← H3 + D . (3) Set µtβ,u , Σtβ,u at the final values after Nfx iterations; (4) Also repeat the following step a maximum of Nfx iterations (or until convergence):  t −1 t 2 2 ← (ks + 1)/ 2µ1/b + ||µ || + tr(Σ ) , s = (a) Update µ1/bs ← (µ1/σs2 + a−2 ) , µ 1/σ s s u u s 1, . . . , r. (5) Set µt1/σs2 , µt1/bs at the final values after Nfx iterations; (t−1)

(6) Update surrogate quantities C t1,1 ← C 1,1 (t−1)

C 1,3

(t−1)

+ ct y t ; C t1,2 ← C 1,2

t + ct wβ,u ; C t1,3 ←

0

t + wβ,u ct ct .

With the arrival of new data, Nfx fixed-point iterations update parameters estimates for the approximating distributions. Consistent with the definition of SCSS, an update to the parameters for ΘG1 under q1 may involve estimates from the previous time-point, in addition to estimates for parameters ΘG2 and vice versa. This online sampling scheme has excellent empirical performance, both in terms of parametric inference and prediction (see Luts & Wand (2014)). This establishes the broad applicability of propagating surrogate quantities 23

(SCSS) using the C-DF algorithm in a non-conjugate setting using variational approximation methods.

4

C-DF for online compressed regression

We consider an application to compressed linear regression, where y 1 , · · · , y t ∈ Rn are a sequence of n-dimensional response vectors with associated features X 1 , · · · , X t ∈ Rn×p observed over time. Data are modeled according to

y t |Φ, β, σ 2 ∼ N (X t Φ0 β, σ 2 I n ),

(7)

where Φ is an m × p projection matrix, with m  p. A Bayesian analysis proceeds by sampling from posterior β, Φ, σ 2 |D (t) , with the following default prior specification: β ∼ N (0, σ 2 Σβ ); σ 2 ∼ IG(a, b), with Jeffrey’s prior obtained letting a, b → 0; Φ ∼ M N (Φ0 , K, 1m ), centered on a row ortho-normalized random projection matrix, Φ0 with iid

row-specific scaling κi ∼ IG(1/2, 1/2) and K = diag({κi }), i = 1, 2, . . . , m. Of course, Φ and β are only identified up to an orthogonal transformation, since Φ0 β = Φ0 LL0 β for any orthogonal matrix L. Nevertheless, regression coefficients γ = Φ0 β are identifiable, and valid inference is obtained using posterior draws of the associated model parameters.

4.1

Online inference and competitors

Conditional on Φ, the posterior distribution over β, σ 2 factorizes as β | Φ, D (t) ∼ Tn (µt , Σt )

σ 2 | Φ, D (t) ∼ IG(a1,t /2, b1,t /2)

Σt = b1,t /n W −1

a1,t = nt

µt = (b1,t /n)−1 Σt ΦF Xy t

0

(8) 0

0 −1 b1,t = Ftyy − F Xy ΦF Xy , t ΦW t

24

where Tν (·) is the multivariate t-distribution with ν degrees of freedom, with hyper-parameters yy XX 0 defined in terms of sufficient statistics Ftyy = Ft−1 + y 0t y t , F Xy = F Xy = t t−1 + y t X t , F t 0 XX 0 F XX Φ + Σ−1 t−1 + X t X t and W = ΦF t β . Sampling for Φ = [Φ1 , Φ2 , . . . , Φp ] proceeds by

drawing successively from the set of column-specific full conditionals, namely Φj |{Φ}−j , K, β, D (t) ∼ Nm (µΦj ΣΦj ), P −1 t 0 0 −1 2 ΣΦj = βX X β /σ + K js js s=1  P t −1 0 2 + K Φ µΦj = ΣΦj βX z /σ 0j s js js s=1 For column j, z jt = y t −

P

l6=j

κi |Φ ∼ IG(ci /2, di /2) ci = c + p (i,·)

(i,·)

di = d + (Φ(i,·) − Φ0 )0 (Φ(i,·) − Φ0 ). (9)

X lt Φ0l β, with X lt denoting the l-th column of X t , and Φ(i,·)

represents the i-th row of Φ. The SMCMC sampling scheme for model (7) propagates sufficient statistics Ftyy , F Xy and F XX after observing {X t , y t } at time t and draws sequentially t t from full conditional distributions (8) and (9). Due to high autocorrelation between β and Φ, however, this online Gibbs sampler faces poor mixing in the joint parameter space. The C-DF algorithm applied to this setting partitions model parameters into ΘG1 = {β, σ 2 } and ΘG2 = {Φ, K}, and inference proceeds as follows: (1) Observe data D t = {X t , y t } at time t. If t = 1, set β = 0, Φ = Φ0 , σ = s¯(vec(y 1 , . . . , y k )), bt ← Φ b t−1 , κ and here we assume y t is zero-mean. Otherwise, set Φ ˆ i,t ← κ ˆ i,t−1 ; (t)

(t−1)

(2) Update surrogate quantities C 1,1 ← C 1,1

0

(t)

(t−1)

b tX 0 X tΦ b and C ← C +Φ 1,2 1,2 t t

+ ΦX 0t y t .

(t)

yy Using the notation in (8), redefine W = C 1,1 + Σ−1 β and set a1,t = nt, b1,t = Ft − (t)

(t)0

C 1,2 W −1 C 1,2 . b t , C (t) and β|σ 2 , Φ b t , C (t) . (3) Draw S approximate C-DF samples compositionally from σ 2 |Φ 1,· 1,· Here, a single m × m matrix inversion is required (instead of once at every sample using SMCMC); b ← µ , where Σt = b1,t /n W −1 , and µ = (b1,t /n)−1 Σt C (t) (4) Set σ ˆt2 ← b1,t /(a1,t +1) and β t t t 1,2 (using closed-form expressions for the posterior MAP); 25

(t) (t−1) 0 bβ b0 (5) For j = 1, . . . , p : update surrogate quantities C 21,j ← C 21,j + β t t (X jt X jt ) and (t)

(t−1)

b X0 y ; C 22,j ← C 22,j + β t jt t (t) b t, σ (6) Draw S samples from C-DF full conditional Φ, K|β ˆt2 , C 2· by sampling from [Φj |{Φ}−j , K, −],

1 ≤ j ≤ p and [K|Φ, −] defined in (9), now modified in terms the defined surrogate quantities; (t)

(a) compute H jt (Φ−j ) = C 22,j − (t)

with ΣΦj = C 21,j /ˆ σt2 + K −1

P

(t)

C 21,l Φl . Using (9), draw from [Φj |{Φ}−j , K, −]  and µΦj = ΣΦj H jt /ˆ σt2 + K −1 Φ0j ;

l6=j

−1

(b) Sample K|Φ, − by drawing independently from [κi |Φ, −], i = 1, . . . , m. b t and {ˆ (7) Set Φ κi,t } as the sample mean over these S draws. By propagating surrogate quantities instead of the much larger sufficient statistics, the C-DF algorithm dramatically reduces storage requirements (see Section 4.2) and results in stateof-the-art parameter inference along with 20-40% gains in efficiency over SMCMC (see Table 11). Bayesian shrinkage methods such as BL (Park et al., 2008) and GDP (Armagan et al., 2013) were attempted but rendered infeasible by the need to invert a p × p matrix at each MCMC iteration. SMC (Doucet et. al, 2000) suffers from severe particle degeneracy for learning the high-dimensional parameter Φ, and while sufficient statistics for particle rejuvenation are available, inference fairs no better than SMCMC. In addition, the need to store and propagate a large number of high-dimensional particles is completely impractical. Instead, we derive a variational Bayes (VB) approximation to the joint posterior using a GDP shrinkage prior on the coefficients of a standard linear regression model, yi = N (x0i β, σ 2 ) and βj | σ ∼ GDP (ζ =

ση , α). α

The latter is equivalent to hierarchical prior βj | σ, τj ∼ N (0, σ 2 τj ),

with τj ∼ Exp(λ2j /2) and λj ∼ Ga(α, η). For Θ = (β, τ , λ, σ 2 )0 , τ = (τ1 , . . . , τp )0 , and Q λ = (λ1 , . . . , λp )0 , we approximate π(Θ|D) by q(Θ) = qj (θ j ). Given a variational approx  imation of this form, optimal densities qj (θ j ) ∝ exp E−q(θj ) {log π(Θ, D)} are obtained by

26

minimizing the KL distance between π(θ|D) and q(Θ). Here, E−q(θj ) denotes the expectaQ tion over i6=j qi (θ i ). Additional details can be found in Ormerod & Wand (2010), Luts & Ormerod (2013).

4.2

Simulation experiments

The C-DF algorithm provides robust parameter inference and predictive performance across numerous simulation experiments which consider varying degrees of sparsity as well as predictor dimension and correlation (see Tables 8 and 10). Shards of n = 100 observations arrive sequentially over a T = 1000 time horizon, and predictor data are generated as xi ∼ N (0, R), Rjk = ρ|j−k| , for j, k = 1, 2, . . . , p and correlation ρ ∈ (0, 1). The response is generated as yi ∼ N (x0i β, σ 2 = 4), where true coefficient vector β used in each case is specified in Table 7. Case 1* 2* 3* 4* 5 6

ρ 0.1 0.1 0.4 0.4 0.1 0.1

p #βj 6= 0 500 10 1000 10 500 10 1000 10 500 500 500 500

Signal high high high high high low

Table 7: Simulation experiments for supervised compressed regression. “High” signal strength corresponds to generating βj ∼ Uniform(−3, 3), while “low” signal corresponds to setting βj = 0.10 for every nonzero feature. Above, sparse cases are denoted with (*). Table 8 reports predictive MSE for each simulation experiment averaged over 50 simulated datasets. In all cases, the results appear robust to the choice of m, the dimension of the subspace which Φ maps into. Choosing m large can add significantly to the computational overhead of the algorithm, so we suggest setting m = max{10, log(p)} by default. VBGDP yields the lowest MSPE for sparse truths, though C-DF and SMCMC demonstrate competitive performance. Increasing the number of predictors causes MSPE to increase for all methods, as does an increase of correlation between predictors. The C-DF algorithm

27

performs well in all cases, while VB-GDP suffers for dense truths, especially in the lowsignal setting. In addition, C-DF results in excellent parameter estimation and variable selection, including cases with high predictor correlation (see Table 9). Table 10 reports coverage probabilities for 95% predictive intervals for the competing methods. While CDF and SMCMC show proper coverage, VB suffers due to the restrictive assumption of independence between parameters a posteriori.

VB C-DF SMCMC

Case 1* 3.430.005 3.490.010 3.560.020

Case 2* 4.200.006 4.380.010 4.400.020

Case 3* 3.490.005 3.620.007 3.580.020

Case 4* 4.230.006 4.400.020 4.430.020

Case 5 3.520.007 3.810.006 3.680.020

Case 6 8.790.010 3.640.020 3.700.020

Table 8: MSPE comparisons for each simulated experiment in Table 7. Subscripts denote bootstrapped standard errors calculated using independent replications.

VB C-DF SMCMC

Time 100 200 100 200 100 200

Case 1* 0.0090.001 0.0040.001 0.0100.001 0.0040.001 0.0140.003 0.0060.001

Case 2* 0.0190.001 0.0080.001 0.0330.003 0.0110.002 0.0320.004 0.0150.002

Case 3* 0.0180.002 0.0050.001 0.0290.004 0.0110.002 0.0240.003 0.0100.002

Case 4* 0.0240.002 0.0080.001 0.0740.001 0.0270.001 0.0340.009 0.0170.004

Case 5 0.0020.001 0.0010.001 0.0130.001 0.0030.001 0.0040.001 0.0020.000

Case 6 0.700.06 0.730.06 0.0420.003 0.0200.00 0.0630.006 0.0360.002

Table 9: Performance comparison in terms of relative parameter MSE at t = 100, 200. 2 2 b b0β Relative MSE for γ = Φ0 β is computed as ||Φ t t − γ 0 || /||γ 0 || for C-DF and SMCMC, γ 0 2 2 VB is γ at the truth. For VB, we compute ||µVB β − γ 0 || /||γ 0 || , where µβ is the approximate posterior mean for β. Subscripts denote bootstrapped standard errors calculated using independent replications.

VB C-DF SMCMC

Case 1* 0.82 (0.81,0.84) 0.98 (0.98,0.99) 0.96 (0.95,0.96)

Case 2* 0.83 (0.76,0.87) 0.98 (0.97,0.99) 0.93 (0.91,0.96)

Case 3* 0.82 (0.81,0.85) 0.98 (0.98,0.99) 0.95 (0.95,0.96)

Case 4* 0.83 (0.75,0.87) 0.98 (0.97,0.98) 0.93 (0.90,0.96)

Case 5 0.81 (0.76,0.86) 0.97 (0.97,1) 0.96 (0.96,0.96)

Case 6 0.83 (0.80,0.85) 0.99 (0.99,1) 0.95 (0.93,0.97)

Table 10: Predictive coverage for each simulation experiment in Table 7. Empirical 95% confidence intervals of the coverage probabilities over independent replications are also reported. 28

Propagating surrogate quantities using the C-DF algorithm for model (7) results in a dramatic efficiency gain over SMCMC. A measure of this efficiency is the “effective samplesize,” namely, the number of samples required for the predictive MSE to drop below a chosen threshold. Table 11 reports effective sample sizes for each simulated experiments in Table 7. Finally, storage of the sufficient statistics for model (7) in the predictor dimension, and

C-DF SMCMC

Case 1* 2040 2940

Effective sample size Case 2* Case 3* Case 4* Case 5 2610 1710 3630 5250 3990 2820 4410 4440

Case 6 210 1050

Memory p = 500 p = 1000 0.68 MB 1.34 MB 2 MB 8.1 MB

Table 11: Effective sample sizes (number of MCMC samples) required until MSPE ≤ 5 are shown above for each simulation experiment, with results averaged over 10 independent replications. The storage required in terms of propagated quantities is also reported for each competitor. updating these quantities is O(np2 ) + O(np) at each time point. In comparison, updating surrogate quantities for C-DF is O(mpn) + O(m2 n). The C-DF algorithm dramatically reduces storage requirements for online inference by reducing the quadratic dependence on p to a linear order.

5

Real data illustrations

We consider two F16 aircraft control datasets having application to variable selection and prediction. For the “Ailerons” data, the goal is to predict the control action of the ailerons based on attributes related to the status of the airplane (e.g. climb-rate), while in the “Elevators” data, we predict the aircraft’s angle of elevation. Additional information is available at http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html. In both datasets, the relationship between the response and predictors is anticipated to be sparse. As before, VB-GDP and SMCMC are chosen as competitors and VB-GDP is run on the entire data in batch mode, while online algorithms feed in data shards sequentially over time. Approximate MCMC for model (7) obtained via the C-DF algorithm delivers good parameter inference 29

and predictive performance as shown in Table 12. Effective sample size comparisons (not shown) between C-DF and SMCMC on both datasets are consistent with findings discussed in Section 4.2 and summarized in Table 11. The Aileron dataset was partitioned with data shards having n = 50 observations arriving over a T = 143 time horizon. The predictor dimension is p = 40, and the hold-out test data has 6,595 observations. All methods are competitive in terms of predictive MSE, with online methods stabilizing within the first 10 time points. Predictive intervals have comparable coverage and length for C-DF and SMCMC, whereas the variational Bayes competitor VBGDP once again suffers in this regard (see Table 12). Finally, VB-GDP results in a sparser solution (18 / 40 predictors selected), whereas 33 / 40 predictors have nonzero MAP estimates using C-DF and SMCMC. For the Elevator data, 8750 training observations were partitioned with data shards having n = 175 observations with a predictor dimension of p = 14 arriving over a T = 50 time horizon. Again, C-DF and SMCMC have virtually identical performance in terms of MSPE, coverage and length of 95% PIs, however VB-GDP suffers in terms of predictive MSE and has predictive intervals which are more than twice as long than its competitors. Dataset Ailerons

Elevators

Method VB C-DF SMCMC VB C-DF SMCMC

MSPE 0.57 0.58 0.57 0.94 0.89 0.90

Coverage 0.80 0.93 0.94 0.98 0.96 0.96

PI length 2.20 2.86 2.89 9.68 3.74 3.73

#{βˆj = 0} 22/40 7/40 7/40 4/14 1/14 2/14

Table 12: Predictive performance for real data illustrations after training on the full data. As before, MSPE denotes the mean square prediction error measured on a hold-out dataset. Coverage and length are based on 95% predictive intervals averaged over all time points on a hold-out dataset. #{βˆj = 0} refers to the number of features identified by the model as unrelated to the response. For Ailerons (Elevators) data, the hold-out dataset consists of 6597 (7896) observations.

30

6

Convergence Behavior of Approximate Samplers

This section studies the convergence behavior for a general class of approximate MCMC algorithms, with the C-DF algorithm a special case. We first present a general result on the limiting error of approximation when a kernel with the targeted stationary distribution is approximated by another kernel. This result deals with finite samples, and is not limited to any specific class of approximations. We then show that such approximations improve with increasing sample size, yielding draws from the exact posterior asymptotically. Proofs are provided in the appendix.

6.1

Notation and Framework

Let πt (·|D (t) ) be posterior distribution having observed data D (t) through time t. A sequence of probability measures πt (·|D (t) ) is defined on a corresponding sequence of measure spaces, (H t , Ht ), t ≥ 1. Below, we take H t = Rp , with Ht = B(Rp ) denoting the Borel σ-algebra on Rp for a fixed p-dimensional parameter space Θ = (θ1 , · · · , θp ). πt admits density πt (Θ) with respect to the Lebesgue measure dν(Θ) = dν1 (θ1 ), d(θ2 ), . . . , dνp (θp ). Transition kernel Tt : Rp × Rp → R+ defines the parameter updating process at time t, and (1) Tt (x, ·) is a probability measure for all x ∈ Rp ; (2) Tt (·, A) is a measurable function w.r.t the σ-algebra for all ν-measurable A. A function ft : Rp → R+ defined over σ-field Ht is the invariant distribution of Tt if 0

ft (Θ ) =

Z

Tt (Θ, Θ0 )ft (Θ)dν(Θ).

(10)

To denote the transition kernels and stationary distributions for finite sample cases, we omit the subscript t. We study the convergence of a sequence of distributions in total variation norm, namely dT V (µ1 , µ2 ) = sup |µ1 (A) − µ2 (A)|. A

31

6.2

A finite sample error bound for approximate MCMC samplers

We begin by characterizing how error propagates while approximating a kernel by another kernel. Lemma 6.1 Let K be a kernel approximated by a kernel T s.t.

sup ||K(Θ, ·) − T (Θ, ·)||T V ≤ ρ. Θ

Assume, µ1 , µ2 are the stationary distributions of T and K respectively. Further assume, ||T (r) − µ1 ||T V → 0 and ||K (r) − µ2 ||T V → 0. Then ∃ r0 s.t. sup ||K (r) (Θ, ·) − T (r) (Θ, ·)||T V ≤ rρ, ∀ r ≤ r0 Θ

sup ||K (r) (Θ, ·) − T (r) (Θ, ·)||T V ≤ min{ρr, 2||µ1 − µ2 ||T V }, ∀ r > r0 .

(11)

Θ

When the sample size is small, with data observed over a finite time horizon or at once, the C-DF kernel remains unchanged once surrogate quantities (SCSS) have been calculated for the final shard of data. According to Lemma 6.1, the approximation error due to C-DF increases initially before stabilizing. With certain additional assumptions, the C-DF kernel can be shown to generate draws from the exact posterior distribution as data accumulates over time (i.e., for streaming data).

6.3

Convergence results for a general approximation class

We assume a p-dimensional parameter Θ is partitioned into Θ1 = (θ11 , · · · , θ1p1 )0 ∈ Rp1 and Θ2 = (θ21 , · · · , θ2p2 )0 ∈ Rp2 , where p = p1 + p2 . The C-DF approximation requires that the approximating kernel for a full conditional distribution admits surrogate quantities as discussed in Section 2. Within this framework and without loss of generality, we assume b 2 , i = 1, . . . , p1 are conditionally conjugate distributions. Then, at time t the that θ1i |θ 1,−i , Θ approximating kernel Tt : Rp1 × Rp2 → R+ can be of the following forms: 32

b 1 , i = 1, ..., p2 are conjugate, hence sampling (1) All full conditional distributions θ 2i | θ 2,−i , Θ approximate transition kernel T proceeds in a Gibbs-like fashion. For two sequences of b 2,t }∞ , the C-DF transition kernel Tt is defined as b 1,t }∞ , {Θ estimators {Θ t=1 t=1 0

Tt (Θ, Θ ) =

"p 1 Y

0 b 0 πt (θ1i |Θ2,t−1 , θ1l ,l

# #" p 2 Y 0 0 b < i, θ1l , l > i) πt (θ2i |Θ1,t−1 , θ2l , l < i, θ2l , l > i) .

i=1

i=1

(12)

b 1 , i = 1, . . . , p2 are non-conjugate (2) Some (or all) of the conditional distributions θ2i | θ 2,−i , Θ (i.e., not available in closed-form). The C-DF transition kernel Tt is defined as

Tt (Θ, Θ0 ) =

"p 1 Y

# 0 b 0 b 1,t−1 ). πt (θ1i |Θ2,t−1 , θ1l , l < i, θ1l , l > i) Q(Θ2 , Θ02 | Θ

(13)

i=1

b 1 ). As Here, Θ2 is updated using a Metropolis-Hastings step having kernel Q(Θ2 , Θ02 | Θ b 2,t }∞ are two sequences of estimators. b 1,t }∞ , {Θ before, {Θ t=1 t=1 Lemma (6.2) specifies the unique stationary distribution ft : Rp → R+ of Tt . Lemma 6.2 C-DF approximate kernel Tt in (12) and (13) have unique stationary distribub 2,t−1 )πt (Θ2 |Θ ˆ 1,t−1 ), where πt (Θ1 |−) and πt (Θ2 |−) are the stationary tion ft (Θ) = πt (Θ1 |Θ distributions for the respective parameter full conditionals.

Remark 1: Illustrations in Section 3.1 fall into the first scenario, where all model parameters have conjugate full conditionals that admits surrogate quantities (SCSS). The dynamic linear model presented in Section 3.2.1 is an example where full conditional distributions admit SCSS, although some require sampling via Metropolis-Hastings. For all other examples presented in Section 3.2, the C-DF algorithm is modified to accommodate situations in which (a) the parameter space is increasing over time; or (b) some (or all) of the full conditional distributions do not admit SCSS. Though asymptotic guarantees on the samples produced are not established by Theorem (6.3) for the latter cases, the C-DF algorithm 33

proves its versatility by producing excellent inferential and predictive performance in all examples considered (see Section 3). Let π0 be the initial distribution from which parameters are drawn. Below, we state the main result. Theorem 6.3 Assume, (i) ∃αt ∈ (0, 1) s.t. ∀ t, sup dT V (Tt (Θ, ·), ft ) ≤ 2αt , (ii) dT V (ft , ft−1 ) → Θ (nt )

0, and (iii) dT V (ft , πt ) → 0, then ∃ {nt }t≥1 s.t. dT V (Tt

(n1 )

· · · T1

π0 , πt ) → 0.

Remark 2: In essence, Theorem 6.3 states that running a Markov chain with approximate kernel Tt for nt times at each time t will asymptotically have draws from the true joint posterior distribution.

Remark 3: Condition (i) in Theorem 6.3 is referred to as the universal ergodicity condition (Yang & Dunson, 2013). It is shown in Lemma 3.2 in Yang & Dunson (2013) that the universal ergodicity condition is weaker than uniform ergodicity condition on the transition kernel T . Condition (ii) ensures that the stationary distribution of the approximating kernel changes slowly as time progresses. Lemma 3.7 in Yang & Dunson (2013) shows that condition (ii) is satisfied for any regular parametric model by applying a simple Bernstein-Von Mises theorem. Finally, the stationary distribution of the approximating chain is required to be “close” to the true posterior distribution at later time; this is formalized in condition (iii). Sufficient conditions under which assumption (iii) holds are outlined in Lemma 6.4. Before stating this Lemma, we recall the definition of posterior consistency. Definition: A posterior Π(·|D (t) ) is defined to be consistent at Θ0 if, for every neighborhood U of Θ0 , Π(U |D t ) → 1 under the data generating law at Θ0 . Lemma 6.4 Assume that the likelihood function pΘ (·) is continuous as a function of Θ at √ Θ0 = (Θ01 , Θ02 ) and tpΘ0 (D (t) ) in limit is bounded away from 0 and ∞. Suppose Θ0 is an interior point in the domain, with the prior distribution π0 (Θ1 , Θ2 ) being positive and

34

b 1,t → Θ0 , Θ b 2,t → Θ0 a.s. under the data generating continuous at Θ0 . Further, assume Θ 1 2 law at Θ0 , and ft and πt are both consistent at Θ0 . Then Z |πt (Θ) − ft (Θ)| dΘ → 0 as t → ∞, almost surely under the true data generating model at Θ0 .

Remark 4: Using the simple fact that Z dT V (πt , ft ) = 2 sup | A

Z

Z

(πt − ft )| = 2 A

(πt − ft ) =

πt >ft

Z (πt − ft ) +

πt >ft

(ft − πt ) ft >πt

Z =

|πt − ft |,

it is clear that under the conditions of lemma 6.4, dT V (πt , ft ) → 0 as t → ∞. Remark 5: Lemma 6.4 says that if the likelihood grows at a certain rate (satisfied under b 1,t , Θ b 2,t are consistent estimators of true standard regularity conditions) and the estimates Θ parameters, the stationary distribution of C-DF becomes close to the stationary distribution of Gibbs sampling on shards. This means that the condition (iii) in Theorem 6.3 is getting satisfied. It should also be that our primary focus is on static models where it is quite easy to obtain consistent sequence of estimates for the involved parameters (as shown in examples).

7

Conclusion

Conditional density filtering (C-DF) is a novel algorithm designed to scale Bayesian inference to streaming data. C-DF produces approximate MCMC samples by drawing from approximations to the conditional posterior distributions which rely on propagated surrogate conditional sufficient statistics (SCSS). These quantities are computed using sequential point estimates for model parameters along with data shards observed in time. This eliminates 35

the need to process the entire data set simultaneously, with potentially substantial computational gains for massive data sets that are distributed across a computer network. To date, there have been a limited number of approaches that scale Bayesian inference to data with a very large number of observations. Popular approaches that often work well in simple models include assumed density filtering (ADF), variational Bayes, expectation propagation (EP), integrated nested Laplace approximation (INLA), along with particle filtering and sequential Monte Carlo (SMC). These approaches face substantial difficulty, either computationally or in terms of inferential accuracy, for more complex models (i.e., non-conjugate settings or when the parameter space becomes large) and come with no asymptotic guarantee (with the exception of SMC). In contrast, approximate MCMC using C-DF is shown to have the correct stationary distribution as data accrue over time (see Section 6). The versatility of C-DF algorithm is demonstrated through illustrative examples presented in Sections 3.1, 3.2 and 4.1. In each, approximate sampling via the C-DF algorithm results in good parameter estimation, uncertainty characterization, and predictive performance. In addition, C-DF offers runtime, memory and sampling efficiency improvements compared to various other competitors (see supporting Tables in Sections 3.2.1, 3.2.2, and 4.1).

36

Appendix A

Proofs

Proof of Lemma 6.1 The proof follows by induction. First we will prove an identity that will be used in the proof that follows. Letting A ∈ B(Rd ),

K

(r)

(Θ, A) − T

(r)

Z (Θ, A) = Z +



 K (r−1) (Θ0 , A) − T (r−1) (Θ0 , A) T (Θ, dΘ0 )

[K(Θ, dΘ0 ) − T (Θ, dΘ0 )] K (r−1) (Θ0 , A).

(14)

(14) relates the differences between the kernels at r-th iteration of the Markov chain. Using R R the fact that r.h.s is free of A and the relationship that ||ν1 −ν2 ||T V = sup gdν1 − gdν2 , g:Rd →[0,1]

(14) yields

||K (r) (Θ, ·) − T (r) (Θ, ·)||T V ≤ ||K(Θ, ·) − T (Θ, ·)||T V Z + ||K (r−1) (Θ0 , ·) − T (r−1) (Θ0 , ·)||T V T (Θ, dΘ0 ).

(15)

Suppose (11) holds for (r − 1). Using (15) we find

sup ||K (r) (Θ, ·) − T (r) (Θ, ·)||T V ≤ sup ||K(Θ, ·) − T (Θ, ·)||T V Θ Θ Z + sup ||K (r−1) (Θ0 , ·) − T (r−1) (Θ0 , ·)||T V T (Θ, dΘ0 ) Θ

≤ sup ||K(Θ, ·) − T (Θ, ·)||T V + (r − 1)ρ < ρ + (r − 1)ρ = ρr. Θ

(16)

37

Also note that ∃ r0 s.t. for all r > r0 , ||T (r) − µ1 ||T V < 12 ||µ1 − µ2 ||T V and ||K (r) − µ2 ||T V < 1 ||µ1 2

− µ2 ||T V . Using triangle inequality, for all r > r0

||K (r) − T (r) ||T V ≤ ||K (r) − µ2 ||T V + ||µ1 − µ2 ||T V + ||T (r) − µ1 ||T V < 2||µ1 − µ2 ||T V . (17)

Comparing (16) and (17) the result follows.

Proof of Lemma (6.2) R We will show that Tt (Θ, Θ0 )ft (Θ)dΘ = ft (Θ0 ). Note that in case (i)

L.H.S =

Z "Y d1

#" d # 2 Y b 2,t−1 , Θ0 , l < i, Θ1l , l > i) b 1,t−1 , Θ0 , l < i, Θ2l , l > i) πt (Θ01i |Θ πt (Θ02i |Θ 1l 2l

i=1

i=1

b 2,t−1 )πt (Θ2 |Θ b 1,t−1 ) πt (Θ1 |Θ # ! Z "Y d1 b 2,t−1 , Θ0 , l < i, Θ1l , l > i) π(Θ1 |Θ b 1,t−1 )dΘ1 = πt (Θ01i |Θ 1l i=1

Z "Y d2

#

!

b1,t−1 )dΘ2 b 1,t−1 , Θ0 , l < i, Θ2l , l > i) πt (Θ2 |θ πt (Θ02i |Θ 2l

i=1

b 2,t−1 )π(Θ0 |Θ b 1,t−1 ). = πt (Θ01 |Θ 2 i hQ d2 0 0 b | Θ , Θ , l < i, Θ , l > i) , π (Θ The last step follows from the well known fact that 1,t−1 2l 2i 2l i=1 t i hQ d1 0 b 0 i=1 πt (Θ1i |Θ1,t−1 , Θ1l , l < i, Θ1l , l > i) are the Gibbs sampling kernels with the stationary b 1,t−1 ) and πt (Θ1 |Θ b 2,t−1 ) respectively. The proof for case (ii) follows in an distribution πt (Θ2 |Θ b 1,t−1 ) has πt (Θ2 |Θ b 1,t−1 ) identical manner taking into account that the MH kernel Q(Θ02 , Θ2 |Θ as its stationary distribution.

Proof of Lemma (6.3) The following proof builds on results from Yang & Dunson (2013). Although most of the proof coincides with the proof of Theorem 3.6 in Yang & Dunson (2013), for the sake of completeness we present the entire proof. Fix  ∈ (0, 1). Choose nt , t ≥ 1 s.t. ρnt < . Using

38

the fact that universal ergodicity condition implies uniform ergodicity, one obtains

dT V (Ttnt , πt ) ≤ αtnt < . n

t−1 Let h = Tt−1 · · · T1n1 π0 , then

dT V (Ttnt · · · T1n1 π0 , ft ) = dT V (Ttnt h, ft ) ≤ dT V (Ttnt , ft )dT V (h, ft ) ≤ αtnt dT V (h, ft ) ≤  (dT V (h, ft−1 ) + dT V (ft , ft−1 )) .

(18)

using the result repeatedly, one obtains

dT V (Ttnt

· · · T1n1 π0 , ft )



t X

t+1−l dT V (fl , fl−1 ).

l=1

R.H.S clearly converges to 0 applying condition (ii). The proof is completed by using condition (iii) and the fact that

dT V (Ttnt · · · T1n1 π0 , πt ) ≤ dT V (Ttnt · · · T1n1 π0 , ft ) + dT V (ft , πt ).

Proof of Lemma (6.4) Note that the approximated posterior distribution ft is given by

b 2,t )πt (Θ2 |Θ b 1,t ) ft (Θ1 , Θ2 ) = πt (Θ1 |Θ hQ i hQ i t t b 1,t , Θ2 )π0 (Θ1 , Θ b 2,t ) p (D ) p (D ) π0 (Θ b b l l l=1 Θ1 ,Θ2,t l=1 Θ1 ,Θ2,t i hQ i = R hQ . t t ˆ b p (D ) p (D ) π ( Θ , Θ )π (Θ , Θ ) b 2,t b 2,t l l 0 1,t 2 0 1 2,t l=1 Θ1 ,Θ l=1 Θ1 ,Θ b 1,t → Θ0 , Θ b 2,t → Θ0 a.s. under Θ0 , ∃ Ω0 which has prob. 1 under the data Given that Θ 1 2 ˆ 1,t (ω) and Θ b 2,t (ω) are in the arbitrary small neighborhood generating law s.t. ∀ ω ∈ Θ0 , Θ of Θ01 and Θ02 respectively. As π0 is cont. at Θ0 , given  > 0 and η > 0 ∃ a neighborhood N,η s.t. ∀ Θ ∈ N,η one 39

has

|π0 (Θ1 , Θ2 ) − π0 (Θ01 , Θ02 )| < .

(19)

b 1,t and Θ b 2,t one obtains, for all t > t0 and ω ∈ Ω0 Using (19) and consistency of Θ b 2,t ) − π0 (Θ0 )| < , |π0 (Θ b 1,t , Θ2 ) − π0 (Θ0 )| < . |π0 (Θ1 , Θ

(20)

Continuity of pΘ (·) at Θ0 leads to the condition that for all t > t0 ,

|pΘ1 ,Θ2 (D l ) − pΘ01 ,Θ02 (D l )| < .

(21)

Also consistency assumptions on ft and πt yield that for all t > t1 and ω ∈ Ω1 ft (N,η |D (t) (ω)) > 1 − η, πt (N,η |D (t) (ω)) > 1 − η,

where Ω1 has probability 1 under the data generating law. Considering Ω = Ω0 ∩ Ω1 and t2 = max{t1 , t0 } it is evident that Ω has also prob. 1 under the true data generating law and all of the above conditions hold for t > t2 and ω ∈ Ω. Simple algebraic manipulations yield ft (Θ|D (t) (ω)) ft (N,η |D (t) (ω)) = πt (Θ|D (t) (ω)) πt (N,η |D (t) (ω)) h i hQ i   "R Qt # Qt t b 1,t , Θ2 )π0 (Θ1 , Θ b 2,t ) p (D ) p (D ) π0 (Θ p (D )π (Θ) b 2,t b 2,t l l Θ l 0 l=1 Θ1 ,Θ l=1 Θ1 ,Θ l=1  N,η hQ i hQ i × R Qt t t b b l=1 pΘ (D l )π0 (Θ) ˆ 2,t (D l ) b 2,t (D l ) π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t ) l=1 pΘ1 ,Θ l=1 pΘ1 ,Θ N,η (22)

40

Using (20) we have

0

2

t Y

Z

(π0 (Θ ) − )

N,η l=1

Z ≤

p Θ 1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l )

t h i Y b b pΘ1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) π0 (Θ1,t , Θ2 )π0 (Θ1 , Θ2,t )

N,η l=1 0

2

t Y

Z

≤ (π0 (Θ ) + )

N,η l=1

pΘ1 ,Θ b 2,t (D l )pΘ b 1,t ,Θ2 (D l ).

Similarly,

0

t Y

Z

(π0 (Θ ) − )

Z pΘ (D l ) ≤

N,η l=1

N,η

"

t Y

# 0

Z

pΘ (D l ) π0 (Θ) ≤ (π0 (Θ ) + )

t Y

pΘ (D l ).

N,η l=1

l=1

Therefore, # " Qt # "R Qt pΘ (D l )  (π0 (Θ0 ) + )3  p b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) ft (Θ|D (t) (ω)) l=1 l=1 Θ , Θ N 1 ,η . ≤ (1 − η)−1 R Qt Qt (π0 (Θ0 ) − )3 πt (Θ|D (t) (ω)) b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) l=1 pΘ1 ,Θ l=1 pΘ (D l ) N,η Using similar calculations we have # "R # " Qt Qt  p (D ) Θ l b 2,t (D l )pΘ b 1,t ,Θ2 (D l ) (π0 (Θ0 ) − )3 ft (Θ|D (t) (ω)) l=1 l=1 pΘ1 ,Θ N,η . ≥ (1 − η) R Qt Qt 0 3 (π (Θ ) + ) p (D )p (D ) p (D ) πt (Θ|D (t) (ω)) 0 ˆ ˆ l l Θ l Θ , Θ Θ ,Θ l=1 l=1 N,η 1 2,t 1,t 2 (21) now gives us  Qt  3 l=1 (pΘ0 (D l ) − ) Qt (p 0 (D l ) + )3 "l=1 QΘt # "R # Qt ˆ 2,t (D l )pΘ ˆ 1,t ,Θ2 (D l ) l=1 pΘ (D l ) l=1 pΘ1 ,Θ N,η ≤ R Qt Qt p (D )p (D ) ˆ ˆ l l Θ , Θ Θ ,Θ l=1 l=1 pΘ (D l ) N,η 1 2,t 1,t 2  Qt  3 l=1 (pΘ0 (D l ) + ) ≤ Qt 3 l=1 (pΘ0 (D l ) − ) Using the condition that

√ tpΘ0 (D (t) ) as t → ∞ is bounded away from 0 and ∞ and choosing

41

, η small we have f (Θ|D (t) (ω)) t − 1 < κ, πt (Θ|D (t) (ω)) for all t > t2 , ω ∈ Ω. Finally, Z

Z

Z |πt (Θ) − ft (Θ)| +

|πt (Θ) − ft (Θ)| ≤

|πt (Θ) − ft (Θ)| c N,η

N,η

Z |πt (Θ) − ft (Θ)| + 2η

≤ N,η

≤ πt (N,η )κ + 2η < κ + 2η.

42

B

Accuracy Plots Accuracy over time β1

β3

β4

1.00



σ2 ●



0.90 0.70

Linear regression



0.80

Accuracy



200

250

300

350

400

450

500

Time index ζ1

ζ5

ζ10



σ2

τ2





0.90

Accuracy



● ●



0.80

0.90



0.80

Accuracy

One-way Anova

µ

1.00

ζ2

1.00



0.70

0.70



200

250

300

350

400

Time index

450

500

200

250

300

350

400

450

500

Time index

Figure 5: Parameter accuracy plotted over time as defined in (2) for the motivating examples of Section 3.1. Row #1: accuracy for representative regression coefficients βj and σ 2 in the linear regression example. Row #2: accuracy for representative group means ζj , along with hierarchical paraemters µ, τ 2 and σ 2 for the one-way Anova model.

References Albert, J.H., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychotomous Response Data”, Journal of the American Statistical Association, 88, 669-679. Armagan, A., Dunson, D.B., and Lee, J. (2013), “Generalized Double Pareto Shrinkage,” 43

Statistica Sinica, 23, 119-143. Ahn, S., Korattikara, A. & Welling, M. (2012), “Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring,” International Conference on Machine Learning (ICML). Arulampalam, M. S., Maskell, S., Gordon, N. & Clapp, T. (2002), “A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking,” Signal Processing, IEEE Transactions on, 50, 174-188. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C. & Jordan, M. I. (2013), “Streaming Variational Bayes,” arXiv:1307.6769. Boyen, X. & Koller, D. (1998), “Tractable Inference for Complex Stochastic Processes,” Uncertainty in Artificial Intelligence (UAI). Braun, M., and McAuliffe, J. (2010), “Variational Inference for Large-scale Models of Discrete Choice,” Journal of the American Statistical Association, 105, 324-335. Chopin, N. (2002), “A Sequential Particle Filter Method for Static Models,” Biometrika, 89, 539-552. Carvalho, C.M., Johannes, M.S., Lopes, H.F., and Polson, N.G. (2010), “Particle Learning and Smoothing,” Statistical Science, 25, 88-106. Doucet, A., Godsill, S.J., and Andrieu, C. (2000), “On Sequential Monte Carlo Sampling Methods for Bayesian Filtering,” Statistics and Computing, 10, 197-208. Guhaniyogi, R. & Dunson, D.B. (2013), “Bayesian Compressed Regression,” arXiv:1303.0642. Hoffman, M. D., Blei, D. M. & Bach, F. (2010), “Online Learning for Latent Dirichlet Allocation,” Neural Information Processing Systems (NIPS). Hoffman, M., Blei, D. M., Wang, C. & Paisley, J. (2012), “Stochastic Variational Inference,” arXiv:1206.7051. 44

Hall, P., Ormerod, J. T. & Wand, M. P. (2011), “Theory of Gaussian Variational Approximation for a Poisson Mixed Model,” Statistica Sinica, 21, 369-389. Jaakkola, T.S., and Jordan, M.I. (2000), “Bayesian Logistic Regression: a Variational Approach,” Statistics and Computing, 10, 25-37. Johannes, M., Polson, N. G.& Yae, S. M. (2010), “Particle Learning in Nonlinear Models using Slice Variables,” http://faculty.chicagobooth.edu/nicholas.polson/ research/papers/Nonli.pdf. Korattikara, A., Chen, Y. & Welling, M. (2013), “Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget,” arXiv:1304.5299. Lauritzen, S. L. (1992), “Propagation of Probabilities, Means and Variances in Mixed Graphical Association Models,” Journal of the American Statistical Association, 87, 1098-1108. Lopes, H. F., Carvalho, C. M., Johannes, M. S. & Polson, N. (2010), “Particle Learning for Sequential Bayesian Computation,” Bayesian Statistics, 9, 2010. Luts, J., and Ormerod, J.T. (2013), “Mean field variational Bayesian Inference for Support Vector Machine Classification,” under construction. Luts, J., and Wand, M.P. (2014), “Variational Inference for Count Response Semiparametric Regression,” http://matt-wand.utsacademics.info/coupap.pdf. Luts, J., Broderick, T. and Wand, M.P. (2013), “Real-time Semiparametric Regression,” Journal of Computational and Graphical Statistics, 23, 589-615. Minka, T. P. (2013), “Expectation Propagation for Approximate Bayesian Inference,” http: //research.microsoft.com/en-us/um/people/minka/papers/ep/minka-ep-uai.pdf.

45

Medlar, A., Glowacka, D., Stanescu, H., Bryson K. & Kleta R. (2013), “SwiftLink: Parallel MCMC Linkage Analysis using Multicore CPU and GPU,” Bioinformatics, 29, 413419. McCormick, T.M., Raftery, A.E., Madigan, D. & Burd, R.S. (2012), “Dynamic Logistic Regression and Dynamic Model Averaging for Binary Classification,” Biometrics, 68, 23-30. Minka, T. P., Xiang, R. & Qi, Y. (2009), “Virtual Vector Machine for Bayesian Online Classification,” Uncertainty in Artificial Intelligence (UAI). Opper, M. & Winther, O. (1999), “A Bayesian Approach to On-Line Learning,” On-Line Learning in Neural Networks, Cambridge University Press. Ormerod, J. T. & Wand, M.P. (2010), “Explaining variational approximations,” The American Statistician, 64, 140-153. Park, T. & Casella, G. (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103, 681-686. Tan, L.S.L, and Nott, D.J. (2014), “A Stochastic Variational Framework for Fitting and Diagnosing Generalized Linear Mixed Models,” arXiv:1208.4949. West M., and Harrison, J. (1997), “Bayesian Forecasting and Dynamic Models,” Springer, 2nd Edition. Wang, L. & Dunson, D. B. (2011), “Fast Bayesian Inference in Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics 20, 196-216. Welling, M. & Teh, Y. W. (2011), “Bayesian Learning via Stochastic Gradient Langevin Dynamics,” International Conference on Machine Learning (ICML). Yang, Y. & Dunson, D. B. (2013), “Sequential Markov Chain Monte Carlo,” arXiv:1308.3861.

46