Bayesian nonparametric regression with varying ... - Duke University

2 downloads 0 Views 277KB Size Report
Nov 12, 2009 - Bayesian nonparametric regression with varying residual density. Debdeep Pati, David B. Dunson. Department of Statistical Science,.
Bayesian nonparametric regression with varying residual density Debdeep Pati, David B. Dunson Department of Statistical Science, Duke University, NC 27708 email: [email protected], [email protected] November 12, 2009

Abstract We consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. The proposed class of models is based on a Gaussian process prior for the mean regression function and mixtures of Gaussians for the collection of residual densities indexed by predictors. Initially considering the homoscedastic case, we propose priors for the residual density based on probit stick-breaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Both priors restrict the residual density to be symmetric about zero, with the sPSB prior more flexible in allowing multimodal densities. We provide sufficient conditions to ensure strong posterior consistency in estimating the regression function under the sPSB prior, generalizing existing theory focused on Gaussian residual distributions. The PSB and sPSB priors are generalized to allow residual densities to change nonparametrically with predictors through incorporating Gaussian processes in the stick-breaking components. This leads to a robust Bayesian regression procedure that automatically down-weights outliers and influential observations in a locally-adaptive manner. Posterior computation relies on an efficient data augmentation exact block Gibbs sampler. The methods are illustrated using simulated and real data applications.

Keywords: Data augmentation, exact block Gibbs sampler, Gaussian process, nonparametric regression, outliers, symmetrized probit stick breaking process

1

1.

INTRODUCTION

Nonparametric regression offers a more flexible way of modeling the effect of covariates on the response compared to parametric models having restrictive assumptions on the mean function and the residual distribution. Here we consider a fully Bayesian approach. The response y ∈ Y corresponding to a set of covariates x = (x1 , x2 , . . . , xp )0 ∈ X can be expressed as y = η(x) + ²

(1)

where η(x) = E(y | x) is the mean regression function under the assumption that the residual density has mean zero, i.e., E(² | x) = 0 for all x ∈ X . Our focus is on obtaining a robust estimate of η while allowing heavy tails to down-weight influential observations. We propose a class of models that allows the residual density to change nonparametrically with predictors x, with homoscedasticity arising as a special case. There is a substantial literature proposing priors for flexible estimation of the mean function, typically using basis function representations such as splines or wavelets (Denison et al. (2002)). Most of this literature assumes a constant residual density, possibly up to a scale factor allowing heteroscedasticity. Yau and Kohn (2003) allow the mean and variance to change with predictors using thin plate splines. In certain applications, this structure may be overly restrictive due to the specific splines used and the normality assumption. Chan et al. (2006) also used splines for heteroscedastic regression, but with locally adaptive estimation of the residual variance and allowance for uncertainty in variable selection. Nott (2006) considered the problem of simultaneous estimation of the mean and variance function by using penalized splines for possibly non Gaussian data. Due to the lack of conjugacy, these methods rely on involved sampling techniques using Metropolis Hastings, requiring proposal distributions to be chosen that may not be efficient in all cases. The residual density is assumed to have a known parametric form and heavy-tailed distributions have not been considered. In addition, since basis function selection for multiple predictors is highly computationally demanding, additive assumptions are typically made that rule out interactions. Gaussian process (GP) regression (Adler 1990; Ghoshal and Roy 2006; Neal 1998) is an increasingly popular choice, which avoids the need to explicitly choose the basis functions, while having many appealing computational and theoretical properties. For articles describing some of these

2

properties, refer to Adler (1990), Cram´er and Leadbetter (1967) and van der Vaart and Wellner (1967). A wide variety of functions can arise as the sample paths of the Gaussian process. GP priors can be chosen that have support on the space of all smooth functions while facilitating Bayes computation through conjugacy properties. In particular, the GP realizations at the data points are simply multivariate Gaussian. As shown by Choi and Schervish (2007), GP priors also lead to consistent estimation of the regression function under normality assumptions on the residuals. We extend their result to unknown residual distributions in Section 3. There is a rich literature on Bayesian methods for density estimation using mixture models of the form yi ∼ f (θi ),

θi ∼ P,

P ∼ Π,

(2)

where f (·) is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process (Ferguson 1973; Ferguson 1974). Lo (1984) showed that Dirichlet process mixtures of normals have dense support on the space of densities with respect to Lesbesgue measure, while Escobar and West (1995) developed methods for posterior computation and inference. James et al. (2005) considered a broader class of normalized random measures for Π. In order to combine methods for Bayesian nonparametric regression with methods for Bayesian density estimation, one can potentially use mixture model (2) for the residual density in (1). A number of authors have considered nonparametric priors for the residual distribution in regression. For example, Kottas and Gelfand (2005) proposed mixture models for the error distributions in median regression models. Lavine and Mockus (2005) allow both a regression function for a single predictor and the residual distribution to be unknown subject to a monotonicity constraint. A number of recent papers have focused on generalizing model (2) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Refer, for example, to M¨ uller, Erkanli and West (1996), Griffin and Steel (2006), Griffin and Steel (2008), Dunson et al. (2007) and Dunson and Park (2008) among others. Although these approaches are clearly highly flexible, there are several issues that provide motivation for this article. First, to simplify inferences and prior elicitation, it is appealing to separate the mean function η(x) from the residual distribution in the specification, which is not accomplished 3

by the current density regression methods. In many applications, the main interest is in inference on η or in prediction, and the residual distribution can be considered as a nuisance. Second, we would like to be able to provide a specification with theoretical support. In particular, it would be appealing to show strong posterior consistency in estimating η without requiring restrictive assumptions on η or the residual distribution. Current density regression models lack such theoretical support. In addition, computation for density regression can be quite involved, particularly in cases involving more than a few predictors, and one encounters the curse of dimensionality in that the specifications are almost too flexible. Our goal is to obtain a computationally convenient specification that allows consistent estimation of the regression function, while being flexible in the residual distribution specification to obtain robust estimates. To accomplish this, we propose to place a Gaussian process prior on η and to allow the residual density to be unknown through a probit stick-breaking (PSB) process mixture. The basic PSB process specification was proposed by Chung and Dunson (2009) in developing a density regression approach that allows variable selection. Here, we propose four novel variants of PSB mixtures for the residual distribution. The first uses a scale mixture of Gaussians to obtain a prior with large support on unimodal symmetric distributions. The next is based on a symmetrized location-scale PSB mixture, which is more flexible in avoiding the unimodality constraint, while constraining the residual density to be symmetric and have mean zero. In addition, we show that this prior leads to strong posterior consistency in estimating η under weak conditions. To allow the residual density to change flexibly with predictors, we generalize the above priors through incorporating probit transformations of Gaussian processes in the weights. The last two prior specifications allow changing residual variances and tail heaviness with predictors, leading to a highly robust specification which is shown to have better performance in simulation studies and out of sample prediction. It will be shown in some small sample simulated examples that the heteroscedastic symmetrized location-scale PSB mixture leads to even more robust inference than the heteroscedastic scale PSB mixture without compromising out of sample predictive performance. Section 2 proposes the class of models under consideration. Section 3 shows consistency properties. Section 4 develops efficient posterior computation through an exact block Gibbs sampler.

4

Section 5 describes measures of influence to study robustness properties of our proposed methods. Section 6 contains simulation study results, Section 7 applies the methods to the Boston housing data and body fat data, and Section 8 discusses the results. Proofs are included in the Appendix. 2. 2.1

NONPARAMETRIC REGRESSION MODELING

Data Structure and Model

Consider n observations with the ith observation recorded in response to the covariate xi = (xi1 , xi2 , . . . , xip )0 . Let X = (x1 , . . . , xn )0 be the predictor matrix for all n subjects. The regression model can be expressed as yi = η(xi ) + ²i ,

²i ∼ fxi , i = 1, . . . , n.

(3)

We assume that the response y ∈ Y is continuous and x ∈ X where X ⊂ Rp is compact. Also, the residuals ²i are sampled independently, with fx denoting the residual density specific to predictor value xi = x. We focus initially on the case in which the covariate space X is continuous, with the covariates arising from a fixed, non-random design or consisting of i.i.d realizations of a random variable. We choose a prior on the regression function η(x) that has support on a large subset of C ∞ (X ), the space of smooth real valued X → R functions. The priors proposed for {fx , x ∈ X } will be chosen to have large support so that heavy-tailed distributions and outliers will automatically be accommodated, with influential observations downweighted in estimating η. 2.2

Prior on the Mean Regression Function

We assume that η ∈ F = {g : X → R is a continuous function}, with η assigned a Gaussian process (GP) prior, η ∼ GP (µ, c), where µ is the mean function and c is the covariance kernel. A Gaussian process is a stochastic process {η(x) : x ∈ X } such that any finite dimensional distribution is multivariate normal, i.e., for any n and x1 , . . . , xn , η(X) := (η(x1 ), . . . , η(xn ))0 ∼ N(µ(X), Ση ), where µ(X) = (µ(x1 ), . . . , µ(xn ))0 and Σηij = c(xi , xj ). Naturally the covariance kernel c(·, ·) must satisfy, for each n and x1 , . . . , xn , that the matrix Ση is positive definite. The smoothness of the covariance kernel essentially controls the smoothness of the sample paths of {η(x) : x ∈ X }. For an appropriate choice of c, a Gaussian process has large support in the space of all smooth functions. More precisely, the support of a Gaussian process is the reproducing kernel Hilbert space generated 5

by the covariance kernel with a shift by the mean function (Ghoshal and Roy 2006). For example, when X ⊂ R, the eigenfunctions of the univariate covariance kernel, c(x, x0 ) =

1 −κ(x−x0 )2 , τe

span

C ∞ (X ) if κ is allowed to vary freely over R+ . Thus we can see that the Gaussian process prior has a rich class of functions as its support and hence is appealing as a prior on the mean regression function. We follow common practice in choosing the mean function in the GP prior to correspond to a linear regression, µ(X) = Xβ, with β denoting unknown regression coefficients. As a commonly used covariance kernel, we took the Gaussian kernel c(x, x0 ) =

1 −κ||x−x0 ||2 , τe

where τ and κ are

unknown hyperparameters, with κ controlling the local smoothness of the sample paths of η(x). Smoother sample paths imply more borrowing of information from neighboring x values. 2.3

Priors for Residual Distribution

Motivated by the problem of robust estimation of the regression function η, we consider five different types of priors for the residual distributions {fx , x ∈ X } as enumerated below. The first prior corresponds to the t distribution, which is widely used for robust modeling of residual distributions (West 1984; Lange 1989; Fonseca, Ferreira and Migon 2008), while the remaining priors are flexible nonparametric specifications. 1.

Heavy tailed parametric error distribution: Following many previous authors, we first

consider the case in which the residual distributions follow a homoscedastic Student-t distribution with unknown degrees of freedom. As the Student-t with low degrees of freedom is heavy tailed, outliers are allowed. By placing a hyperprior on the degrees of freedom, νσ ∼ Ga(aν , bν ), with Ga(a, b) denoting the Gamma distribution with mean a/b, one can obtain a data adaptive approach to down-weighting outliers in estimating the mean regression function. However, note that this specification assumes that the same degrees of freedom and tail-heaviness holds for all x ∈ X . Following West (1987), we express the Student-t distribution as a scale mixture of normals for ease in computation. In addition, we allow an unknown scale parameter, letting ²i ∼ N(0, σ 2 /φi ), with φi ∼ Ga(νσ /2, νσ /2). 2.

Homoscedastic PSB mixtures of Gaussians: Under the homoscedasticity and symmetric

about zero assumption, we propose two nonparametric priors for the residual density fx = f for all x ∈ X . The first prior is a PSB scale mixture of Gaussians which enforces symmetry about 6

zero and unimodality, and the next is a symmetrized location-scale PSB mixture of Gaussians, which we develop to satisfy the symmetric about zero assumption while allowing multimodality. An advantage of using a PSB process specification instead of a Dirichlet process is that extensions to accommodate predictor-dependent residual distributions become straightforward to develop, with computation for such models being easy to implement using a data augmentation algorithm. Before proposing the priors, we first review the PSB process specification and its relationship to the Dirichlet process. Let (Y, B) be a complete and separable metric space, with Y the sample space and B the Borel σ-algebra of subsets. A probability measure P ∈ P on (Y, B) follows a probit stick-breaking process with base measure P0 if it has a representation of the form P (·) =

∞ X

πh δθh (·),

θh ∼ P0 ,

(4)

h=1

where the atoms {θh }∞ h=1 are independent and identically distributed from P0 and the random Q weights are defined as πh = Φ(αh ) l 0 is a completely monotone function}. 8

m

d A function h(x) on (0, ∞) is completely monotone in x if it is infinitely differentiable and (−1)m dx m h(x) ≥

0 for all x and for all m ∈ {1, 2, . . . , ∞}. Chu (1973) proved that if f is a density on R which is symmetric about zero and unimodal, it can be written as a scale mixture of normals, Z f (x) =

σ −1 φ(σ −1 x)g(σ)dσ

(7)

√ for some density g on R, if and only if h(x) = f ( x), x > 0, is a completely monotone function. This restriction places a smoothness constraint on f (x), but still allows a broad variety of densities. Definition Letting f ∼ Π, f0 is in the Kullback-Leibler(KL) support of Π if µ Z ¶ f0 (y) Π f : f0 (y) log dy < ² > 0, f (y)

∀ ²>0

(8)

The set of densities f in the Kullback-Leibler support of Π is denoted by KL(Π). Let S˜s denote the subset of Ss corresponding to densities satisfying the following regularity conditions. 1. f is nowhere zero and bounded by M < ∞ ¯R ¯ 2. ¯ R f (y) log f (y)dy ¯ < ∞ ¯R ¯ ¯ 3. ¯ R f (y) log ψf1(y) (y) dy < ∞, where ψ1 (y) = inf t∈[y−1,y+1] f (t) 4. there exists ψ > 0 such that

R

2(1+ψ) f (y)dy R |y|

0 ∀ ² > 0, if η0 is continuous. In order to prove posterior consistency for our proposed model, we rely on a theorem of AmewouAtisso et al. (2003), which is a modification of the celebrated Schwartz (1965) theorem to accommodate independent but not identically distributed data. In particular, we will prove that their first condition holds under a minimal assumption, which we refer to as A∗ . This assumption modifies the assumption A presented by Amewou-Atisso et al. (2003). They focused on consistent estimation of parameters in a linear regression. As we are instead estimating a regression function nonparametrically, we require much more stringent restrictions on the predictors. If the predictors were random variables, identifiability requires that the xi ’s are drawn from a continuous distribution with support on X . In the fixed by design case, we instead need an infill design that adds additional design points as the sample size increases. Assumption A∗ formalizes the infill design restriction. Assumption A∗ : For each x∗ ∈ X , there exists an ²∗ > 0 depending on x∗ such that for all {² : ² > 0 and ² ≤ ²∗ }, the covariate values satisfy n

1X I{xi ∈ B² (x∗ )} > 0, lim inf n→∞ n i=1

where B² (x∗ ) = {x : dX (x, x∗ ) < ²}, dX being a metric on X . Assumption A∗ ensures that in the limit as the sample size gets very large, there will be subjects with covariates within an arbitrarily small neighborhood of any point in X . This is a minimal condition for identifiability, since we cannot expect to consistently estimate η(x) at x = x∗ without having a non-zero proportion of the observations within an arbitrarily small interval of the point x∗ without strong restrictions on η. Theorem 3.2 Suppose η ∼ GP (µ, c), with µ as in Section 2.2, c as in Assumption 1∗ and f ∼ Πs , with Πs defined in (6). In addition, assume the data are drawn from the true density f0 (yi −η0 (xi )), with {xi } satisfying Assumption A∗ , f0 ∈ S˜s , η0 ∈ F, and f0 following the additional regularity conditions, 1. 2.

R

y 4 f0 (y)dy < ∞ and

R

f0 (y)| log f0 (y)|2 dy < ∞

¯ ¯ ¯ log f0 (y) ¯2 dy < ∞, where ψ1 (y) = inf t∈[y−1,y+1] f0 (t) f (y) 0 R ψ1 (y)

R

12

Let U be a weak neighborhood of f0 and W = U × {η : ||η − η0 || ≤ ∆}, with W ⊂ S˜s × F. Then the posterior probability R

Qn i=1 fηi (yi )dΠs (f )dπ(η) Q → 1 a.s. Π(W|y1 , . . . , yn , x1 , . . . , xn ) = R n i=1 fηi (yi )dΠs (f )dπ(η) S˜s ×F W

(14)

Theorem 3.2 ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η. 4.

POSTERIOR COMPUTATION

We provide details for posterior computation separately for the most important models. We first describe the choice of hyperparameters of the prior on the regression function. Choice of hyperpriors: We choose the typical conjugate prior for the regression coefficients in the mean of the GP, β ∼ N(β 0 , Σ0 ), where β 0 = 0 and Σ0 = cI is a common choice corresponding to a ridge regression shrinkage prior. The prior on τ is given by τ ∼ Ga( ν2τ , ν2τ ). We let κ ∼ Ga(ακ , βκ ) with small βκ and large ακ . Normalizing the predictors prior to analysis, we find that the data are quite informative about κ under these priors, so as long as the priors are not overly informative, inferences are robust. 4.1 Gaussian process regression with t residuals Let Y = (y1 , . . . , yn )0 , η = (η(x1 ), η(x2 ), . . . , η(xn ))0 and define a matrix T such that Tij = 2

e−κ||xi −xj || . Hence Ση =

1 τ T.

Assume Ω = diag(1/φi : i = 1, . . . , n) and φ = (φ1 , . . . , φn )0 .

Then we have Y|η ∼ N(η, σ 2 Ω), η|β, τ, κ ∼ N(Xβ, τ −1 T), β ∼ N(β 0 , Σ0 ) ¡ νσ νσ ¢ φi ∼ Ga , , νσ ∼ Ga(αν , βν ), σ −2 ∼ Ga(a, b) 2 2 ¡ ντ ντ ¢ κ ∼ Ga(ακ , βκ ), τ ∼ Ga , 2 2 Next we provide the full conditional distributions needed for Gibbs sampling. Due to conjugacy, η, β, σ −2 , φ and τ have closed form full conditional distributions, while νσ and κ are updated by

13

using Metropolis Hastings steps within the Gibbs sampler. ¡ ¢ η|Y, β, σ −2 , τ, κ, νσ , φ ∼ N (τ T−1 + σ −2 Ω−1 )−1 (τ T−1 Xβ + σ −2 Ω−1 Y), (τ T−1 + σ −2 Ω−1 ) ¢ ¡ −1 −1 −1 −1 0 −1 0 −1 β|Y, η, σ −2 , τ, κ, νσ , φ ∼ N (τ X0 T−1 X + Σ−1 0 ) (τ X T η + Σ0 β 0 ), (τ X T X + Σ0 ) Ã ! n n 1X −2 2 σ |Y, η, β, τ, κ, νσ , φ ∼ Ga + a, φi (yi − ηi ) + b 2 2 i=1 µ ¶ © ª n + ν 1 τ −2 0 −1 τ |Y, η, β, σ , κ, νσ , φ ∼ Ga , (η − Xβ) T (η − Xβ) + ντ 2 2 µ ¶ νσ + 1 1 −2 φi |Y, η, β, σ −2 , κ, νσ ∼ Ga , {σ (yi − ηi )2 + νσ } 2 2 4.2 Heteroscedastic PSB mixture of normals First we need to describe the choice of hyperparameters in this case. Choice of hyperparameters: We assume κα ∼ Ga(γκ , βκ ) and τα ∼ Ga( ν2α , ν2α ). If the data yi are normalized, we can expect the overall variance to be close to one, so the variance of the P −1 residuals, V ar(²) = ∞ h=1 πh τh , should be less than one. We set ατ = 1 and choose a hyperprior on βτ , βτ ∼ Ga(1, k0 ) with k0 > 1 so that the prior mean of τh is significantly less than one. Different values of k0 are tried out to assess robustness of the posteriors. For posterior computation, we propose a Markov chain Monte Carlo (MCMC) algorithm, which is a hybrid of data augmentation, the exact block Gibbs sampler of Papaspiliopoulos (2008) and Metropolis Hastings sampling. Papaspiliopoulos (2008) proposed the exact block Gibbs sampler as an efficient approach to posterior computation in Dirichlet process mixture models, modifying the block Gibbs sampler of Ishwaran and James (2001) to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler Papaspiliopoulos and Roberts (2008) and the slice sampler (Walker 2007). We included the label switching moves introduced by Papaspiliopoulos and Roberts (2008) for better mixing. Introduce γ1 , . . . , γn such that πh (xi ) = P (γi = h), h = 1, 2, . . . , ∞. Then γi ∼

∞ X h=1

πh (xi )δh =

∞ X

1(ui < πh (xi ))δh

h=1

where ui ∼ U (0, 1). The MCMC steps are given below. 1.

Update ui ’s and stick breaking random variables: Generate ui |− ∼ U (0, πγi (xi )) 14

(15)

where πh (xi ) = Φ{αh (xi )}

Q

l 0, Zl (xi ) < 0 for l < h). Then Zh (xi )|− ∼

   N(αh (xi ), 1)IR+ , h = γi   N(αh (xi ), 1)IR− , h < γi

¡ ¢ Let Zh = (Zh (x1 ), . . . , Zh (xn ))0 and αh = (αh (x1 ), . . . , αh (xn ))0 . Letting Σα ij = e−κα ||xi −xj || , Zh ∼ N(αh , I) and αh ∼ N(0, τ1α Σα ), ¡ ¢ −1 −1 −1 αh |− ∼ N (τα Σ−1 α + In ) Zh , (τα Σα + In ) Continue up to h = 1, . . . , h∗ = max{h∗1 , . . . , h∗n }, where h∗i is the minimum integer satisfying Ph∗i l=1 πl (xi ) > 1 − min{u1 , . . . , un }, i = 1, . . . , n. Now µ ¶¶ µ h∗ ¢ 1 X 1¡ ∗ 0 −1 τα |− ∼ Ga αk Σα αk + να , nh + να , 2 2 l=1

while kα is updated using a Metropolis Hastings step. 2.

Update allocation to atoms: Update (γ1 , . . . , γn )|− as multinomial random variables with

probabilities P (γi = h) ∝ N(yi ; η(xi ), τh−1 )I(ui < πh (xi )), h = 1, . . . , h∗ 3.

Update component-specific locations and precisions: Letting nl = #{i : γi = l}, l =

1, 2, . . . , h∗ , ¶ µ X nl 2 τl |− ∼ Ga + ατ , βτ + (yi − η(xi )) , l = 1, 2, . . . , h∗ 2 i:γi =l

µ X ¶ k∗ βτ |− ∼ Ga 1, τl + k 0 l=1

4.

), Update the mean regression function: Letting Λ = diag(τγ−1 , . . . , τγ−1 n 1 η|− ∼ N((τ T−1 + Λ−1 )−1 (τ T−1 Xβ + Λ−1 Y), (τ T−1 + Λ−1 )−1 ) ¡ ¢ −1 −1 −1 −1 0 −1 0 −1 β|− ∼ N (τ X0 T−1 X + τ Σ−1 0 ) (τ X T η + Σ0 β0 ), (τ X T X + Σ0 ) µ ¶ ª n + ντ 1 © 0 −1 0 τ |− ∼ Ga , (η − Xβ) T (η − Xβ) + ντ . 2 2

5.

Update κ in a Metropolis Hastings step. 15

4.3 Heteroscedastic sPSB process location-scale mixture We will need the following changes in the updating steps from the previous case. 1.

Update allocation to atoms: Update (γ1 , . . . , γn )|− as multinomial random variables with

probabilities P (γi = h) ∝

ª 1© N(yi ; η(xi ) + µh , τh−1 ) + N(yi ; η(xi ) − µh , τh−1 ) I(ui < πh (xi )), h = 1, . . . , h∗ 2

Component-specific locations and precisions: Let nl = #{i : γi = l}, l = 1, 2, . . . , h∗ P and ml = i:γi =l (yi − ηi ). The atoms of the base measure location is updated from a mixture of 3.

normals as µ ¶ µ ¶ µ0 σ0−2 + τl ml 1 µ0 σ0−2 − τl ml 1 µl |− ∼ pl N , + (1 − pl )N , , σ0−2 + nl τl σ0−2 + nl τl σ0−2 + nl τl σ0−2 + nl τl ½ µ ¶¾ −2 1 µ0 σ0 +τl ml where pl ∝ exp 2 σ−2 +n τ . 0

l l

¶ µ ¶ µ X X nl nl 2 {yi − η(xi ) − µl } + (1 − pl )Ga {yi − η(xi ) + µl }2 , τl |− ∼ pl Ga + ατ , β τ + + ατ , βτ + 2 2 i:γi =l

i:γi =l

¾ nl +α

½ where pl ∝

¡

βτ + 12

P i:γi

1 ¢ 2 =l {yi −η(xi )−µl }

2

.

Update the mean regression function: Let Λ = diag(τγ−1 , . . . , τγ−1 ), µ∗ = (µγ1 , µγ2 , . . . , µγn ) n 1 ¡ ¢−1 and W = τ T−1 + Λ−1 . Hence µ ¶ µ ¶ −1 −1 ∗ −1 −1 ∗ η| − pN η; W{τ T Xβ + Λ (Y − µ )}, W + (1 − p)N η; W{τ T Xβ + Λ (Y + µ )}, W 4.

where p ∝ exp

£1© ª¤ −1 Xβ + Λ−1 (Y − µ∗ ))0 WXβ + Λ−1 (Y − µ∗ ) − (Y − µ∗ )0 Λ−1 (Y − µ∗ ) . (τ T 2 5.

MEASURES OF INFLUENCE

There has been limited work on sensitivity of the posterior distribution to perturbations of the data and outliers. Arellano-Vallea et al. (2000) use deletion diagnostics to assess sensitivity, but their methods are computationally expensive in requiring posterior computation with and without data deleted. Weiss (1996) proposed an alternative that perturbs the posterior instead of the likelihood, ˜ xi ) denote and only requires samples from the full posterior. Following Weiss (1996), let f (yi |Θ, the likelihood of the data yi , define ˜ = δi∗ (Θ)

˜ xi ) f (yi + δ|Θ, , ˜ xi ) f (yi |Θ, 16

(16)

˜ and let pi (Θ|Y) denote a new perturbed posterior, ∗ ˜ ˜ p(Θ|Y)δ i (Θ) ˜ pi (Θ|Y) = . ˜ E(δi∗ (Θ)|Y)

(17)

Denote by Li the influence measure, which is a divergence measure between the unperturbed ˜ ˜ posterior p(Θ|Y) and the perturbed posterior pi (Θ|Y), Z 1 ˜ ˜ ˜ Li = |p(Θ|Y) − pi (Θ|Y)|d Θ. 2

(18)

˜ ˜ Li is bounded and takes values in [0, 1]. When p(Θ|Y) = pi (Θ|Y), Li = 0 indicating that the ˜ and perturbation h∗i has no influence. On the other hand, if Li = 1, the supports of p(Θ|Y) ˜ pi (Θ|Y) are disjoint indicating maximum influence. We can define an influence measure as L = 1 Pn i=1 Li . Clearly L also takes values in [0, 1] with L = 0 ⇒ Li = 0 ∀ i = 1, 2, . . . , n. Also n L = 1 ⇒ Li = 1 ∀ i = 1, 2, . . . , n. Weiss (1996) provided a sample version of Li , i = 1, . . . , n. ˜ 1, . . . , Θ ˜ M be the posterior samples with B the burn-in, Letting Θ ˆi = L ˆ ∗ (Θ)} ˜ = where E{δ i

1 M −B

PM

¯ ¯ M X ˜ k) ¯ 1 ¯¯ δi∗ (Θ 1 − 1¯¯, ¯ ∗ ˆ ˜ M −B 2 E(δi (Θ)) k=B+1

∗ ˜ k=B+1 δi (Θk ).

ˆ= Our estimated influence measure is L

(19) 1 n

Pn

ˆ

i=1 Li .

We

will calculate the influence measure for our proposed methods and compare their sensitivity. 6.

SIMULATION STUDIES

To assess the performance of our proposed approaches, we consider a number of simulation examples, (i) linear model, homoscedastic error with no outliers, (ii) linear model, homoscedastic error with outliers (iii) linear model, heteroscedastic errors and outliers, (iv) non-linear model (a), heteroscedastic errors and outliers and (v) non-linear model (b), heteroscedastic errors and outliers. We let the heaviness of the tails and error variance change with x in cases (iii), (iv) and (v). We considered the following methods of assessing the performance, namely, mean squared prediction error (MSPE), coverage of 95% prediction intervals, mean integrated squared error (MISE) in estimating the regression function at the points for which we have data, point wise coverage of 95% ˆ as described in Section credible intervals for the regression function and the influence measure (L) 5. We also consider a variety of sample sizes in the simulation, n=30, 60, 80 and simulate 10 covariates independently from U (0, 1). Let z be 10-dim vector of i.i.d U (0, 1) random variables 17

independent of the covariates. Generation of errors in heteroscedastic case and outliers: Let fxi (²i ) = pxi N(²i ; 0, 1) + qxi N(²i ; 0, 5) where pxi = Φ(x0i z). The outliers are simulated from the model with error distribution fxoi (·), which is a mixture of truncated normal distributions as follows. In the heteroscedastic case, fxoi (²i ) = pxi TN(−∞,3)∪(3,∞) (²i ; 0, 1) + qxi TN(−∞,−3√5)∪(3√5,∞) (²i ; 0, 5) where TNR (· ; µ, σ 2 ) denotes a truncated normal distribution with mean µ and standard deviation σ over the region R. We consider the following five cases. 1. Case (i): yi = 2.3 + 5.7x1i + ²i , ²i ∼ N(0, 1) with no outliers. 2. Case (ii): yi = 2.3 + 5.7x1i + ²i , ²i ∼ 0.95N(0, 1) + 0.05N(0, 10). 3. Case (iii): yi = 1.2 + 5.7x1i + 4.7x2i + 0.12x3i − 8.9x4i + 2.4x5i + 3.1x6i + 0.01x7i + ²i , ²i ∼ fxi , with 5% outliers generated from fxoi (²i ). 4. Case (iv): yi = 1.2 + 5.7x1i + 3.4x21i + 4.7xi2 + 0.89x2i2 + 0.12xi3 − 8.9xi4 xi8 + 2.4xi5 xi9 + 3.1xi6 + x2i6 + 0.01xi7 + ²i , ²i ∼ fxi with 5% outliers generated from fxoi (²i ). 5. Case (v): yi = 1.2 + 5.7 sin x1i + 3.4 exp(x2i ) + 4.7 log |xi3 | + ²i , ²i ∼ fxi with 5% outliers generated from fxoi (²i ). For each of the cases and for each sample size n, we took the first and the next

n 2

n 2

samples as the training set

samples as the test set. The hyperparameters are specified as follows.

1. Heavy tailed parametric error distribution: We described the choice of the hyperparameters in Section 5. We took β 0 = 0, Σ0 = 5I2 , αν = 1, βν = 1, a = 0.5, b = 0.5, ατ = 5 and βτ = 1. 2. Heteroscedastic PSB or sPSB process scale mixture on the residual density: β 0 = 0, Σ0 = 5I2 , αν = 1, βν = 1, a = 0.5, b = 0.5, ατ = 5, βτ = 1, γκ = 5, βκ = 1, να = 1 and k0 = 10. We also compare the MSPE of the proposed methods with Lasso (Tibshirani 1996), Bayesian additive regression trees (Chipman et al. 2008), and Treed Gaussian processes (Gramacy and Lee 2007). The MCMC algorithms described in Section 5 are used to obtain samples from the posterior 18

distribution. The results for model 1 given here are based on 20,000 samples obtained after a burn-in period of 3,000. The results for Model 2 and 3 are based on 20,000 samples obtained after a period of 7,000. Rapid convergence was observed based on diagnostic tests of Geweke (1992) and Raftery and Lewis (1992). In addition, the mixing was very good for model 1. For models 2 and 3, we use the label switching moves by Papaspiliopoulos and Roberts (2008), which lead to adequate mixing. Tables 1, 2 and 3 summarize the performance of all the methods based on 50 replicated datasets. Tables 1, 2 and 3 clearly show that in small samples both of the heteroscedastic methods (2 and 3) have substantially reduced MSPE and MISE relative to the heavy tailed parametric error model in most of the cases, interestingly even in the homoscedastic cases. This may be because discrete mixture of Gaussians better approximate a single normal than a t-distribution in small samples. Methods 2 and 3 also did a better job than method 1 in allowing uncertainty in estimating the mean regression and predicting the test sample observations. In some cases, the heavy tailed tresidual distribution results in overly conservative predictive and credible intervals. As seen from the value of the influence statistic, the heteroscedatic PSB process mixtures result in more robust inference compared to the parametric error model, the sPSB process mixture of normals being more robust than the symmetric and unimodal version. As the sample size increases, the difference in the predictive performances between the parametric and the nonparametric models is reduced and in some cases the parametric error model performs as well as the nonparametric approaches, which is as expected given the Central Limit Theorem. Table 1 shows that, in the simple linear model with normal homoscedastic errors, all the models perform similarly in terms of mean squared prediction error, though the methods 2 and 3 are somewhat better than the rest. Also, in estimating the mean regression function in case (i), methods 2 and 3 performed better than all the other methods. In case (ii)(Table 1), methods 2 and 3 are most robust in terms of estimation and prediction in presence of outliers. In cases (iii) and (iv), when the residual distribution is heteroscedastic, our methods 2 and 3 perform significantly better than the parametric model 1 in both estimation and prediction, since the heteroscedastic PSB mixture is very flexible in modeling the residual distribution. This is quite evident from the MSPE values under cases (iii) and (iv) in Table 2. Lasso did a poor job in estimating the mean

19

regression function and also in prediction particularly in cases (iii) and (iv) when the underlying mean function is actually non-linear. Also BART failed to perform well in estimating the mean function in small samples in these cases. On the other hand, GP based approaches perform quite well in these cases in estimating the regression function with methods 2 and 3 performing better than the rest. Treed GP performed close to method 1 in estimation and prediction as both the methods are based on GP priors on the mean function and have a parametric error distribution. In not allowing heteroscedastic error variance, BART and Treed GP under-estimates uncertainty in prediction, leading to overly narrow predictive intervals. In case (v)(Table 3), where the true model is generated using comparatively less number of true signals, Lasso and BART performed slightly better in terms of prediction than the methods 4 and 5 in small samples. This may be due to the fact that Lasso can pick up the true signals quite efficiently in an overly parsimonious model. However, as the sample size increased, Lasso performed poorly while the GP prior on the mean can accommodate the non-linearity resulting in substantially good predictive performances. 7. 7.1

APPLICATIONS

Boston housing data Application

To compare our proposed approaches to alternatives, we applied the methods to a commonly used data set from the literature, the Boston housing data. The response is the median value of the owner-occupied homes (measured in 1000$) in 506 census tracts in the Boston area, and there are 13 predictors (12 continuous, 1 binary) that might help to explain the variation in the median value across tracts. We predict the median value of the owner occupied homes of which the first 253 is taken as the training set and the remaining 253 as the test set. Out of sample predictive performance of our three methods is compared to competitors in Table 4. The parametric model 1, the heteroscedastic PSB process mixture models 2 and 3 and the Lasso perform very closely to each other in terms of prediction and did better than BART and Treed GP. Methods 1 and 2 even perform slightly better than method 3 and Lasso. As in the simulation examples, BART and Treed GP underestimates the uncertainty in prediction. On the other hand, the predictive intervals of the methods 1, 2 and 3 are more conservative and accommodate uncertainty in predicting regions

20

with outliers quite flexibly. Also the model 3 appears to be more robust compared to models 1 and 2 in terms of the influence measure. 7.2

Body fat data application

With the increasing trend in obesity and concerns about associated adverse health effects, such as heart disease and diabetes, it has become even more important to obtain accurate estimates of body fat percentage. It is well known that body mass index, which is calculated based only on weight and height, can produce a misleading measure of adiposity as it does not take into account muscle mass or variability in frame size. As a gold standard for measure percentage of body fat, one can rely on under water weighing techniques, and age and body circumference measurements have also been widely used as additional predictors. We consider a commonly-used data set from Statlib (http://lib.stat.cmu.edu/datasets/bodyfat), which contains the following 15 variables; percentage of body fat(%), body density from underwater weighing (gm/cm3 ), age (year), weight (lbs.), height (inches), and ten body circumferences (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, wrist, all in cm). Percentage of body fat is given from Siri’s (1956) equation: Percentage of body Fat =

495 − 450 Density

(20)

We predict the percentage of body fat(%) taking the first 126 as the training set and the remaining 126 as the test set. We summarize the predictive performances in Table 4. Table 4 suggests that the nonparametric regression procedures with heteroscedastic residual distribution 2 and 3 perform better than the parametric model 1, BART, Lasso and Treed GP in predicting the percentage of body fat. 8.

DISCUSSION

Thus we have developed a novel regression model that can accommodate a large range of non linearity in the mean function and at the same time can flexibly deal with outliers and heteroscedasticity. Based on preliminary simulation results, it appears that our method can outperform contemporary nonparametric regression methods, such as BART and treed Gaussian processes, with the performance also better than Lasso in certain linear regression settings. We also provide theoretical support for the proposed methodology when both the mean and the residuals are modeled 21

nonparametrically. One possible future direction is to relax the symmetry assumption on the residual distribution and introduce a model for median regression based on conditional PSB mixtures for allowing possibly asymmetric residual densities constrained to have zero median. Conditional DP mixtures are well known in the literature (Doss 1985; Burr and Doss 2005) and it is certainly interesting to extend our approach via a conditional PSB. In that way we can hope to obtain a more robust estimate of the regression function. It is challenging to extend our theoretical results to conditional PSB and develop a fast algorithm for computation. Another possible theoretical direction is to prove posterior consistency using heteroscedastic mixtures. Currently we only have results for the homoscedastic PSBP mixture. This calls into question the problem of choosing appropriate norm on the conditional densities. Choi (2005) introduced certain norms on the space of conditional densities by considering the joint distribution on (x, y), which may be appropriate. 9.

ACKNOWLEDGEMENT

This research was partially supported by grant number R01 ES017240-01 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH). 10.

APPENDIX

Proof of Lemma 2.1 It follows from Chu (1973) that Z f ∈ Cm ⇔ f (x) =

σ −1 φ(σ −1 x)g(σ)dσ

for some density g on R+ . Recall from Ongaro and Cattaneo (2004) that a collection of random P∞ weights {πh }∞ h=1 πh = 1 a.s. is said to have a full support if for any m ≥ 1, (π1 , . . . , πm ) h=1 with admits a positive joint density with respect to Lebesgue measure on the simplex {(p1 , . . . , pm ) : Pm i=1 pi ≤ 1}. Ongaro and Cattaneo (2004) showed that if πh ’s have a full support, the weak support of P =

∞ X

πh δθh , θh ∼ G0

h=1

22

is the set of all probability measures whose support is contained the support of G0 . Since d

(π1 , . . . , πm ) =

µ ¶ m−1 Y Φ(α1 ), Φ(α2 ){1 − Φ(α1 )}, . . . , Φ(αm ) {1 − Φ(αi )} , αi ∼ N(µα , σα2 ), i=1

πh ’s have a full support and hence the weak support of P =

P∞

h=1 πh δτh

defined in (5) is all prob-

ability measures on R+ . It follows that the weak support of the induced prior Πu on Su , denoted by wk(Πu ), is precisely Cm . Proof of Lemma 2.2 It follows from Tokdar (2006) that if we can show that the weak support of Πs contains all probability measures symmetric about zero and having compact support, then f ∈ S˜s ⇒ f ∈ KL(Πs ). The argument given in Lemma 2.1 shows that the weak support of the PSB prior in (4) is the set of all probability measures on R × R+ . Now we will show that an arbitrary P˜ s is in a weak neighborhood of P s if P˜ is in a weak neighborhood of P . We state a lemma to prove our claim. Lemma 1 Let P˜n be a sequence of probability measures and P˜ be a fixed probability measure. Then (P˜n ⇒ P˜ ) ⇒ (P˜ns ⇒ P˜ s ), with P˜ns and P˜ s the symmetrised versions of P˜n and P˜ , respectively, where the symmetrizing operation is as defined in (6). Proof Assume P˜n ⇒ P˜ . We have to show that for any bounded function φ on R × R+ , Z

Z φ(t, τ )dP˜ns (t, τ ) →

φ(t, τ )dP˜ s (t, τ ) as n → ∞.

Now, Z

Z Z 1 1 s ˜ ˜ φ(t, τ )dPn (t, τ ) = φ(t, τ )dPn (t, τ ) + φ(t, τ )dP˜n (−t, τ ) 2 2 Z ª 1© = φ(t, τ ) + φ(−t, τ ) dP˜n (t, τ ) 2 © ª Since ψ(t, τ ) = 12 φ(t, τ ) + φ(−t, τ ) is also a bounded continuous function and P˜n ⇒ P˜ , Z

ª 1© φ(t, τ ) + φ(−t, τ ) dP˜n (t, τ ) → 2

Z

ª 1© φ(t, τ ) + φ(−t, τ ) dP˜ (t, τ ) = 2

Z φ(t, τ )dP˜ s (t, τ )

as n → ∞. This completes the proof of Lemma 1. Lemma 1 in fact shows that the weak support of Πs contains all probability measures symmetric about zero. With an appeal to Tokdar (2006), f ∈ S˜s ⇒ f ∈ KL(Πs ). 23

Proof of Theorem 3.2 In order to prove the theorem we will show the following. 1. There is an exponentially consistent sequence of tests for H0 : (f, η) = (f0 , η0 ) against H1 : (f, η) ∈ W 2. For all δ > 0 and for almost every data sequence {yi , xi }∞ i=1 , ¾ ½ ∞ X Vi (f, η) 0 (Π × π) (f, η) : Ki (f, η) < δ ∀i, i2 i=1

Then the result would follow from Theorem 2.1 of Amewou-Atisso et al. (2003). In order to verify 1, we will write the alternative region of the test as a disjoint union of two easily tractable regions. The particular form of W that is of interest to us is W1 ∪ W2 , where for any ∆ > 0, W1 = U c × {η : ||η − η0 ||∞ ≤ ∆}

W2 = {(f, η) : ||η − η0 ||∞ > ∆}

We will establish the existence of a consistent sequence of tests for each of these regions by considering the following variants of Proposition 3.1 and Proposition 3.3 of Amewou-Atisso et al. (2003).

Proposition 2 If Assumption A∗ holds, then there exists an exponentially consistent sequence of tests for H0 : (f, η) = (f0 , η0 ) against H1 : (f, η) ∈ W2 Proof Since ||η − η0 ||∞ > ∆, there exists an x0 such that |η(x0 ) − η0 (x0 )| > ∆. So by continuity of η and η0 , there exists a δ such that for all x0 ∈ Bδ (x0 ), |η(x0 ) − η0 (x0 )| > δ. Let Kn = #{i : 1 ≤ i ≤ n : |η(xi ) − η0 (xi )| > δ} and Rn = #{i : 1 ≤ i ≤ n : xi ∈ Bδ (x0 )}. If i ∈ Kn , it follows from Amewou-Atisso et al. (2003) that there exists a set Ai and a constant C > 0 depending on δ and f0 such that αi := Pf0i (Ai ) ≤

1 2

− C and γi := inf (f,η)∈W2 Pfηi (Ai ) ≥ 21 . If i ≤ n and i ∈ / Kn , set

Ai = R, so that αi = γi = 1. Thus lim inf n→∞

n n 1X 1 X #Kn #Rn (γi − αi ) ≥ lim inf C ≥ C lim inf ≥ C lim inf >0 n→∞ n n→∞ n→∞ n n n i=1

i∈Kn

by Assumption A∗ . The result follows from Lemma 3.1 of Amewou-Atisso et al. (2003) with Φi = IAi . 24

Proposition 3 There exists an exponentially consistent sequence of tests for H0 : (f, η) = (f0 , η0 ) against H1 : (f, η) ∈ W1 Proof Without loss of generality take ½ Z ¾ Z U = f : Φ(y)f (y)dy − Φ(y)f0 (y)dy < ² where 0 ≤ Φ ≤ 1 and Φ is uniformly continuous. Given ² > 0, there exists δ > 0 such that ˜ i (y) = Φ{y − η0 (xi )}. Notice that Ef Φ ˜ = Ef Φ. Now |y1 − y2 | < δ ⇒ |Φ(y1 ) − Φ(y2 )| < 2² . Set Φ 0 0i i Z ˜i = Efηi Φ

Z ˜ i (y)fηi (y)dy = Φ

Φ(y)f [y − {η(xi ) − η0 (xi )}] Z



Φ[y − {η(xi ) − η0 (xi )}]f [y − {η(xi ) − η0 (xi )}]dy ¯ Z ¯ ¯ ¯ − ¯¯Φ(y) − Φ[y − {η(xi ) − η0 (xi )}]¯¯f [y − {η(xi ) − η0 (xi )}]dy Z ² ≥ Φ(y)f (y)dy − 2 ² ≥ Ef0 Φ + 2 for any f ∈ U c . We apply Lemma 3.1 of Amewou-Atisso et al. (2003) to complete the proof. It remains to verify the second sufficient condition of Theorem 3.2. Under the assumptions, it follows from Lemma 2.2 that f0 ∈ KL(Πs ). We will present an important lemma which is similar to Lemma 5.1 of Tokdar (2006). It guarantees that K(f0 , fθ ) and V (f0 , fθ ) are continuous at θ = 0. First we state and prove some properties of the prior Πs described in (6) which will be used to prove the lemma. Lemma 4 If Πs is the prior described in (6) and P0 (t, τ ) = N(t; µ0 , σ02 )×Ga(τ ; ατ , βτ ), with ατ > 0 and βτ > 0. Then, Z

Z s

t2 dP s (t, τ ) < ∞ a.s., Z Z 2 s τ t dP (t, τ ) < ∞ a.s., −∞ < (log τ )dP s (t, τ ) < ∞ a.s. τ dP (t, τ ) < ∞ a.s.,

25

(A.1)

Proof Z Z

Z

Z

τ dP s (t, τ )dP =

dP s (t, τ )dP τ >0,t∈R τ >0,t∈R Z Z 1 1 2 τ N(t; µ0 , σ0 )Ga(τ ; ατ , βτ )dtdτ + τ N(t; −µ0 , σ02 )Ga(τ ; ατ , βτ )dtdτ = 2 τ >0,t∈R 2 τ >0,t∈R Z = τ Ga(τ ; ατ , βτ )dτ < ∞ τ

τ >0

R

The proofs of

t2 dP s (t, τ ) < ∞ a.s. and

R

τ t2 dP s (t, τ ) < ∞ a.s. are similar. Since ατ > 0, choose 1 m.

an integer m large enough such that ατ >

Z

Z Z (log τ )dP s (t, τ )dP =

τ >0

τ >0,t∈R

Z

=C

(log τ )Ga(τ ; ατ , βτ )dτ

Z

(log τ )τ

ατ −1 −βτ τ

e

1

(τ 1/m log τ )τ ατ − m −1 e−βτ τ dτ > −∞

dτ = C

τ >0

τ >0

since τ 1/m log τ is bounded in [0, 1]. Also

R

τ >0 (log τ )τ

ατ −1 e−βτ τ dτ

Lemma 5 Under the conditions of the Theorem 3.2, if f (·) =

R



R

τ >0 τ τ

ατ −1 e−βτ τ dτ

< ∞.

N(· ; t, τ −1 )dP s (t, τ ) and fθ (y) =

f (y − θ), then 1. limθ→0 2. limθ→0

R R

f0 (y) log ffθ0 (y) (y) dy = ¡ f0 (y) log+

R

f0 (y) ¢2 fθ (y) dy

(y) f0 (y) log ff0(y) dy

=

R

¡ f0 (y) log+

f0 (y) ¢2 dy f (y)

© © ª © ª R Proof Clearly τ φ τ (y − θ − t)} → τ φ τ (y − t) as θ → 0. Since τ φ τ (y − θ − t) dP s (t, τ ) ≤ R √1 τ dP s (t, τ ) < ∞, so by DCT fθ (y) → f (y) as θ → 0. Hence 2π f0 (y) f0 (y) → log as t → 0 ft (y) f (y) µ ¶ µ ¶ f0 (y) 2 f0 (y) 2 log+ → log+ as t → 0 ft (y) f (y) log

To apply DCT again, we have to bound the function | log fθ (y)|by an integrable function. ¯ ¯ Z √ ¯ ¯ − τ2 (y−t−θ)2 s ¯ | log fθ (y)| ≤ log 2π + ¯ log τ e dP (t, τ )¯¯ Let c =

R

τ dP s (t, τ ) < ∞. Then ¯ ¯ ¯ ¯ Z Z ¯ ¯ ¯ ¯ ¯ log τ e− τ2 (y−t−θ)2 dP s (t, τ )¯ ≤ | log c| + ¯ log τ e− τ2 (y−t−θ)2 dP s (t, τ )¯ ¯ ¯ ¯ ¯ c 26

Now since

R

¯ R τ 2 τ e− 2 (y−t−θ) dP s (t, τ ) ≤ c, ¯ log

¯

τ − τ2 (y−t−θ)2 dP s (t, τ )¯ ce

= − log

R

τ − τ2 (y−t−θ)2 dP s (t, τ ). ce

Hence, by Jensen’s inequality applied to − log x, we get, Z Z Z τ − τ (y−t−θ)2 s 1 s 2 − log e dP (t, τ ) ≤ log c − (log τ )dP (t, τ ) + τ (y − t − θ)2 dP s (t, τ ) c 2 Now since θ → 0, w.l.o.g assume |θ| ≤ 1. Hence µ Z ¶ Z Z 2 s 2 s 2 s τ (y − t − θ) dP (t, τ ) ≤ 4 y τ dP (t, τ ) + τ t dP (t, τ ) + 1 µ Z ¶ Z Z √ s 2 s 2 s ⇒ | log fθ (y)| ≤ log 2π + | log c| + log c − (log τ )dP (t, τ ) + 2 y τ dP (t, τ ) + τ t dP (t, τ ) + 1 which is clearly f0 -integrable according to the assumptions of the lemma and from the properties of Πs proved in Lemma 4. Similarly | log fθ (y)|2 can be bounded by an f0 -integrable function. The conclusion of the lemma follows from a simple application of DCT. ½ Lemma 2.2 together with the assumption (2) of the Theorem 3.2 guarantees Π f : K(f0 , f ) < ¾ δ, V (f0 , f ) < ∞ > 0 for all δ > 0. Since (A.1) holds, we may assume Π(U) > 0, where U = Now for every f (·) =

R

½ ¾ f : K(f0 , f ) < δ, V (f0 , f ) < ∞, (A.1) holds .

(A.2)

N(· ; t, τ −1 )dP s (t, τ ) ∈ U, using Lemma 5, choose δf such that for |θ| < δf , K(f0 , fθ ) < 2K(f0 , f ), V (f, fθ ) < 2V (f0 , f )

Now if ||η − η0 || < δf , |η(xi ) − η0 (xi )| < δf , for i = 1, . . . , n. So if f ∈ U and ||η − η0 || < δf , we have Z

Z f0i f0 Ki (f, η) = f0i log = f0 log < 2K(f0 , f ) fηi f(η−η0 )i Z Z ¡ ¡ f0i ¢2 f0 ¢2 Vi (f, η) = f0i log+ = f0 log+ < 2V (f0 , f ) fηi f(η−η0 )i From (A.2) and Lemma 3.1 we have, ½ ¾ Π (f, η) : f ∈ U, ||η − η0 || < δf > 0. Hence ½ ¾ ∞ X Vi (f, η) Π (f, η) : Ki (f, η) < 2δ ∀ i, < ∞ > 0. i2 i=1

This ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η. 27

REFERENCES Adler, R. J. (1990), “An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes,” in IMS Lecture Notes-Monograph Series, Vol. 12, Hayward, CA: Academic Press. Amewou-Atisso, M., Ghoshal, S., Ghosh, J. K., and Ramamoorthi, R. V. (2003), “Posterior consistency for semi-parametric regression problems,” Bernoulli, 9(2), 291–312. Arellano-Vallea, R. B., Galea-Rojasb, M., and Zuazola, P. I. (2000), “Bayesian sensitivity analysis in elliptical linear regression models,” Journal of Statistical Planning and Inference, 86, 175– 199. Burr, D., and Doss, H. (2005), “A Bayesian Semiparametric Model for Random-Effects MetaAnalysis,” Journal of the American Statistical Association, 100, 242–251. Chan, D., Kohn, R., Nott, D., and Kirby, C. (2006), “Locally adaptive semiparametric estimation of the mean and variance functions in regression models,” Journal of Computational and Graphical Statistics, 15, 915–936. Chipman, H. A., George, E. I., and McCulloch, R. E. (2008), “Bayesian Additive Regression Trees,” Journal of the Royal Statistical Society Series B, . Choi, T. (2005), “Posterior Consistency in Nonparametric Regression problems in Gaussian Process Priors,” in Dissertation, Department of Statistics, Carnegie Mellon University: Academic Press. Choi, T., and Schervish, M. J. (2007), “On posterior consistency in nonparametric regression problems,” Journal of Multivariate Analysis, 10, 1969–1987. Chu, K. C. (1973), “Estimation and detection in linear systems with elliptical errors.,” IEEE Trans. Auto. Control, 18, 499–505. Chung, Y., and Dunson, D. B. (2009), “Nonparametric Bayes conditional distribution modeling with variable selection,” Journal of the American Statistical Association, to appear, .

28

Cram´er, H., and Leadbetter, M. R. (1967), Stationary and Related Stochastic Processes, Sample Function Properties and their Applications, New York: John Wiley & Sons. Denison, D., Holmes, C., Mallick, B., and Smith, A. F. M. (2002), Bayesian methods for nonlinear classification and regression, London: Wiley & Sons. Doss, H. (1985), “Bayesian Nonparametric Estimation of the Median: Part I: Computation of the Estimates,” The Annals of Statistics, 13, 1432–1435. Dunson, D. B., and Park, J.-H. (2008), “Kernel stick-breaking processes,” Biometrika, 95, 307–323. Dunson, D. B., Pillai, N., and Park, J.-H. (2007), “Bayesian density regression,” Journal of the Royal Statistical Society, Series B, 69, 163–183. Escobar, M. D., and West, M. (1995), “Bayesian Density Estimation and Inference Using Mixtures,” Journal of the American Statistical Association, 90, 577–588. Ferguson, T. S. (1973), “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209–230. Ferguson, T. S. (1974), “Prior Distributions on Spaces of Probability Measures,” The Annals of Statistics, 2, 615–629. Fonseca, T. C. O., Ferreira, M. A. R., and Migon, H. S. (2008), “Objective Bayesian analysis for the Student-t regression model,” Biometrika, 95, 325–333. Geweke, J. (1992), “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments,” Bayesian Statistics, 4, 169–193. Ghoshal, S., and Roy, A. (2006), “Posterior consistency of Gaussian process prior in nonparametric binary regression,” The Annals of Statistics, 34, 2413–2429. Gramacy, R. B., and Lee, K. H. H. (2007), “Bayesian treed Gaussian process models with an application to computer modeling,” , . Griffin, J., and Steel, M. F. J. (2006), “Order-Based Dependent Dirichlet Processes,” Journal of the American Statistical Association, Theory and Methods, 101, 179–194. 29

Griffin, J., and Steel, M. F. J. (2008), “Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother,” Statistica Sinica, . Ishwaran, H., and James, L. (2001), “Gibbs Sampling Methods for Stick-Breaking Priors,” Journal of the American Statistical Association, 96, 161–173. James, L. F., Lijoi, A., and Pr¨ unster, I. (2005), “Bayesian nonparamentric inference via classes of normalized random measures,” Tech. Rep. ICER WP 2005/5, . Kottas, A., and Gelfand, A. E. (2005), “Bayesian Semiparametric Median Regression Modeling,” Journal of the American Statistical Association, 96, 1458–1468. Lange, K. L. (1989), “Robust statistical modeling using the t-distribution.,” Journal of the American Statistical Association, 84, 881–896. Lavine, M., and Mockus, A. (2005), “A nonparametric Bayes method for isotonic regression,” Journal of Statistical Planning and Inference, 46, 235–248. Lo, A. Y. (1984), “On a class of Bayesian nonparametric estimates. I: Density estimates,” The Annals of Statistics, 12, 351–357. M¨ uller, P., Erkanli, A., and West, M. (1996), “Bayesian curve fitting using multivariate normal mixtures,” Biometrika, 83, 67–79. Neal, R. J. (1998), “Regression and Classification using Gaussian Process Priors,”, Oxford: Oxford University Press, pp. 475–501. Nott, D. (2006), “Semiparametric estimation of mean and variance functions for non-Gaussian data,” Computational Statistics, 21, 603–320. Ongaro, A., and Cattaneo, C. (2004), “Discrete random probability measures:a general framework for nonparametric Bayesian inference,” Statistics & Probability Letters, 67, 33–45. Papaspiliopoulos, O. (2008), “A note on posterior sampling from Dirichlet mixture models,” in Technical Report, Department of Economics, Universitat Pompeu Fabra, Barcelona: Academic Press. 30

Papaspiliopoulos, O., and Roberts, G. . (2008), “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models,” Biometrika, 95, 169–183. Raftery, A. E., and Lewis, S. (1992), “How Many Iterations in the Gibbs Sampler?,” Bayesian Statistics, 4, 763–773. Sethuraman, J. (1994), “A Constructive Definition of Dirichlet Priors,” Statistica Sinica, 4, 639– 650. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society Series B, 58, 267–288. Tokdar, S. T. (2006), “Posterior Consistency of Dirichlet Location-scale Mixture of Normals in Density Estimation and Regression,” Sankhy¯ a, 68, 90–110. van der Vaart, A. W., and Wellner, J. A. (1967), Weak Convergence and Empirical Processes, New York: Springer-Verlag. Walker, S. G. (2007), “Sampling the Dirichlet Mixture Model with Slices,” Communications in StatisticsSimulation and Computation, 36, 45–54. Weiss, R. (1996), “An Approach to Bayesian Sensitivity Analysis,” Journal of the Royal Statistical Society. Series B (Methodological), 58, 739–750. West, M. (1984), “Outlier models and prior distributions in Bayesian linear-regression,” Journal of the Royal Statistical Society Series B, 46, 431–439. West, M. (1987), “On scale mixtures of normal-distributions,” Biometrika, 74, 646–648. Wu, Y., and Ghoshal, S. (2008), “Kullback Leibler property of kernel mixture priors in Bayesian density estimation,” Electronic Journal of Statistics, 2, 298–331. Yau, P., and Kohn, R. (2003), “Estimation and variable selection in nonparametric heteroscedastic regression,” Statistics and Computing, 13, 191–208.

31

Table 1: Simulation results under homoscedastic residuals (Cases (i) and (ii)) n=40

Case (i)

Case (ii)

MSPE

cov(y)a

MISE

cov(η)b

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1c

0.2997

1

0.0248

1

0.0017

0.6043

1

0.0232

1

0.0027

Method 2d

0.2821

0.9980

0.0141

1

0.0015

0.5983

0.9740

0.0173

1

0.0019

Method 3e

0.2798

1

0.0144

1

0.0015

0.5987

0.9745

0.0169

1

0.0017

Lasso

0.4651

BART

0.3510

Treed GP

0.1934

0.6410

0.1080

0.6866

0.0714

0.7051

0.7845

0.0950

0.3042

0.9134

0.0256

0.6968

0.9365

0.0803

n=60

MSPE

cov(y)

MISE

cov(η)

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.2990

1

0.0246

1

0.0019

0.5776

1

0.0242

1

0.0023

Method 2

0.2769

0.9947

0.0103

1

0.0017

0.5471

0.95

0.0143

0.97

0.0016

Method 3

0.2752

0.9963

0.0104

1

0.0016

0.5541

0.95

0.0141

0.98

0.0016

Lasso

0.4715

BART

0.3314

Treed GP

0.1974

0.6702

0.1194

0.6753

0.0539

0.6725

0.7777

0.1098

0.3000

0.9193

0.0218

0.6880

0.9301

0.1198

n=80

MSPE

cov(y)

MISE

cov(η)

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.2913

1

0.0252

1

0.0021

0.5583

1

0.0172

1

0.0022

Method 2

0.2592

0.9940

0.0086

1

0.0021

0.4989

0.97

0.0050

1

0.0014

Method 3

0.2574

0.9956

0.0069

1

0.002

0.4898

0.98

0.0067

1

0.0010

Lasso

0.4318

BART

0.3128

Treed GP

0.2886

0.1756

0.6569

0.6525

0.0437

0.6509

0.7815

0.1098

0.9301

0.0175

0.6532

0.9224

0.1031

a

0.1150

cov(y) denotes the coverage of the 95% predictive intervals of the test cases cov(η) denotes the coverage of the 95% credible intervals of the mean regression function c GP on mean and t residual distribution d GP on mean and heteroscedastic PSB process scale mixtures as residual distribution e GP on mean and heteroscedastic sPSB process mixtures as residual distribution b

32

Table 2: Simulation results under heteroscedastic residuals (Cases (iii) and (iv)) n=40

Case (iii)

Case (iv)

MSPE

cov(y)

MISE

cov(η)

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.4833

1

0.3612

1

0.0027

0.4416

1

0.3274

1

0.0029

Method 2

0.2570

0.9990

0.1394

1

0.0025

0.2783

0.9923

0.1583

0.98

0.0023

Method 3

0.2586

0.9990

0.1298

1

0.0025

0.2712

0.9867

0.1501

0.97

0.0017

Lasso

0.3219

BART

0.4639

Treed GP

0.1970

0.3140

0.1863

0.8444

0.3413

0.4103

0.8833

0.2675

0.3320

0.7834

0.1979

0.3548

0.8268

0.2108

n=60

MSPE

cov(y)

MISE

cov(η)

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.2254

1

0.1154

1

0.0023

0.2367

1

0.1067

1

0.0021

Method 2

0.1744

0.9973

0.0572

1

0.0020

0.2178

1

0.0562

0.97

0.0019

Method 3

0.1712

0.9878

0.0567

1

0.0016

0.2099

1

0.0656

0.98

0.0017

Lasso

0.2958

BART

0.3429

Treed GP

0.1830

0.3025

0.1543

0.8546

0.2217

0.3385

0.9122

0.1799

0.2047

0.8349

0.0779

0.2611

0.8867

0.0899

n=80

MSPE

cov(y)

MISE

cov(η)

L

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.1636

1

0.0454

1

0.0018

0.1855

1

0.0346

1

0.0019

Method 2

0.1509

0.9976

0.0373

0.95

0.0015

0.1653

1

0.0321

0.9952

0.0014

Method 3

0.1578

0.9931

0.0324

1

0.0013

0.1614

1

0.0312

0.9932

0.0010

Lasso

0.2592

BART

0.2284

Treed GP

0.1655

0.1437

0.2798

0.9265

0.1098

0.2491

0.9490

0.1083

0.8876

0.0427

0.2022

0.8923

0.0548

33

0.1373

Table 3: Simulation results under heteroscedastic residuals (Case (v)) n=40

Case (v) MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.6666

0.9800

0.5856

1

0.0033

Method 2

0.5233

0.9770

0.3980

0.9812

0.0025

Method 3

0.5231

0.9854

0.3745

0.9765

0.0019

Lasso

0.3713

BART

0.4956

0.8980

0.4013

Treed GP

0.7224

0.8123

0.6132

n=60

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.3828

1

0.2911

0.9985

0.0031

Method 2

0.3745

0.9832

0.2617

0.9840

0.0022

Method 3

0.3767

0.9812

0.2601

0.9867

0.0020

Lasso

0.3532

BART

0.3930

0.9313

0.2668

Treed GP

0.4225

0.9023

0.3217

n=80

MSPE

cov(y)

MISE

cov(η)

L

Method 1

0.3599

0.9901

0.2759

0.9998

0.0029

Method 2

0.3503

0.9762

0.2582

0.9765

0.0022

Method 3

0.3519

0.9712

0.2545

0.9715

0.0019

Lasso

0.4505

BART

0.3594

0.9442

0.2867

Treed GP

0.4489

0.9125

0.3509

0.2871

0.2616

0.2751

34

Table 4: Boston housing data and body fat data results Boston housing data

body fat data

Methods

MSPE

cov(y)

L

corr(Ytest, Ypred)a

MSPE

cov(y)

L

corr(Ytest, Ypred)

Method 1

0.0012

0.99

0.0034

0.9894

0.0055

1

0.0020

0.9972

Method 2

0.0013

0.99

0.0027

0.9901

0.0031

1

0.0017

0.9984

Method 3

0.0016

0.99

0.0020

0.9863

0.0029

1

0.0017

0.9989

Lasso

0.0015

0.9909

0.0184

BART

0.0024

0.92

0.9836

0.0355

0.95

0.9655

Treed GP

0.0053

0.91

0.9524

0.1526

0.98

0.9250

a

0.9909

corr(Ytest, Ypred) denotes the sample correlation between the test y and the predicted y

35

Suggest Documents