Dec 5, 2006 - The linear regression model is used extensively in statistical ..... against the estimates resulting from a simple linear regression of the data; the.
A general approach to heteroscedastic linear regression David S. Leslie Department of Mathematics, University of Bristol, University Walk, Bristol, BS8 1TW, United Kingdom Robert Kohn Faculty of Business, University of New South Wales UNSW, Sydney 2052, Australia David J. Nott School of Mathematics University of New South Wales UNSW, Sydney 2052, Australia December 5, 2006 Abstract Our article presents a general treatment of the linear regression model, in which the error distribution is modelled nonparametrically and the error variances may be heteroscedastic, thus eliminating the need to transform the dependent variable in many data sets. The mean and variance components of the model may be either parametric or nonparametric, with parsimony achieved through variable selection and model averaging. A Bayesian approach is used for inference with priors that are data-based so that estimation can be carried out automatically with minimal input by the user. A Dirichlet process mixture prior is used to model the error distribution nonparametrically; when there are no regressors in the model, the method reduces to Bayesian density estimation, and we show that in this case the estimator compares favourably with a well-regarded plug-in density estimator. We also consider a method for checking the fit of the full model. The methodology is applied to a number of simulated and real examples and is shown to work well. Keywords: density estimation; Dirichlet process mixture; heteroscedasticity; model checking; nonparametric regression; variable selection.
1
1
Introduction
The linear regression model is used extensively in statistical applications. The basic model assumes that the covariates enter linearly and the errors are normally distributed with constant variance. Data analysts often overcome the problems of non-normal errors or a non constant variance by transforming the dependent variable, for example by using a monotonic transformation. However, assuming that the regression model has normal errors and constant variance after a monotonic transformation implies that in the original (untransformed) scale all moments of the dependent variable are functions of the mean; see for example Ruppert et al. (2003). Furthermore, theoretical considerations often require that the model is fitted on the original scale rather than the transformed scale. Our article provides a general approach to the linear regression problem. The error distribution is modelled nonparametrically using a mixture of normal distributions and the variance can be a function of covariates. The model incorporates variable selection in the mean and variance components. This allows us to discover parsimonious representations of the mean and the variance functions when such parsimony exists. In particular, the methodology allows the recovery of a model with constant variance and normal errors when this is the true model. In its most general form the methodology also allows the mean and the variance of the response to be modelled nonparametrically. Our article does so by combining Bayesian density estimation (Escobar and West, 1995) with variable selection techniques and regression splines (Kohn et al., 2001) and semiparametric modelling of heteroscedasticity (Chan et al., 2005). The model in our article is yi = x0i β + exp{z 0i δ/2}i ,
(1)
where the i are independent and identically distributed (though not necessarily normal), xi and z i are given vectors of regressors for observation yi , and β and δ are unknown vectors of regression coefficients. We assume throughout this article both the observations and theP regressors have been mean-corrected Pthat Pn n n (i.e. i=1 yi = 0 and i=1 xij = 0 and i=1 zik = 0 for each j and k), and that no intercept term is included in either xi or z i , so that the model is fully identified (see also the comments at the start of Section 3). We specify a general noise distribution by using a Dirichlet process mixture prior for the distribution of i (Ferguson, 1973; Escobar and West, 1995), for which the closure of the support of the implied prior on the distribution of i is the set of all distributions on the real line (Lo, 1984). Indeed, in the absence of any regressors the regression problem becomes a density estimation problem, and our method is very similar to that of Escobar and West (1995). We investigate this density estimation problem, introducing a data-based prior distribution to give a plug-in Bayesian density estimator, and compare the method with the theoretically optimal plug-in kernel density estimator of Sheather and Jones (1991). We find that in the examples considered here, the Bayesian den-
2
sity estimator consistently performs well in comparison with the Sheather and Jones (1991) estimator. We incorporate this successful density estimation technique into the regression model (1) by placing prior distributions on the regression coefficients β and δ that are independent of the prior for the i . The priors we use are similar to those of Chan et al. (2005), in which an atom at 0 is present for each coefficient, resulting in Bayesian variable selection. Note that if there are no regressors for the variance component the problem reduces to a linear regression with unknown noise distribution, the problem considered by Kuo and Mallick (1997), with median regression forms of the problem considered by several researchers (including Walker and Mallick, 1999; Kottas and Gelfand, 2001; Hanson and Johnson, 2002). However in application of their model Kuo and Mallick (1997) restrict the generality of their noise distributions by fixing the variance terms σi2 to be the same for all data points. In this article we will show that removing this restriction results in a regression technique that outperforms the more advanced recent approach of Hanson and Johnson (2002) (in a precise sense to be defined). In addition to the general noise distributions used by Kuo and Mallick (1997), we allow for the variance of the noise to depend on regressors. Our model for heteroscedasticity has been examined extensively in the frequentist literature, in which iterative methods are used to fit the model; a convenient summary is provided by Carroll and Ruppert (1988). The Bayesian approach to the problem is achieved using Markov chain Monte Carlo (MCMC), as in Chan et al. (2005), which uses essentially the same model as that studied here but where the noise terms are assumed to be normally distributed. We demonstrate the efficiency of our full method on several simulated and real data sets, including some which require the introduction of the locally adaptive regression spline approach of Kohn et al. (2001) to model both mean and variance functions non-parametrically. Using a Dirichlet process prior in a mean regression context has previously been discussed by West et al. (1994) and Mukhopadhyay and Gelfand (1997), among others. However, these papers use the Dirichlet process to mix over regression coefficients. In contrast, we consider traditional linear regression in which all observations are assumed to come from the same model (i.e. only one β for all observations), and use the Dirichlet process mixture prior to give a nonparametric model for the noise terms i . In this sense, our work is closest in flavour to semi-parametric median regression approaches which consider mixtures of zero-median distributions, using either Dirichlet process mixing (Kottas and Gelfand, 2001), Polya trees (Walker and Mallick, 1999), or a mixture of Polya trees (Hanson and Johnson, 2002) to allow non-parametric noise, although these approaches do not allow completely general noise distributions. We believe that retaining the traditional mean regression framework is advantageous for consistency with previous work, and also show that the predictive densities resulting from our conjugate Dirichlet process framework actually outperform those from the more complex approaches above. An additional benefit of our method over these approaches is the use of weakly informative data-based pri3
ors, which means that the method becomes available to scientists who may be unprepared (or unable) to manually specify priors for such a complex model. Most importantly, the full heteroscedastic approach introduced in this article removes the necessity to transform responses so that variances are the same across all sets of responses, in addition to removing the need to transform so that responses are normally distributed. Perhaps the work closest in intention to ours in this respect is the recent article of Kottas and Krnjajic (2005), studying general quantile regression, in which a dependent Dirichlet process prior is placed on the noise terms to allow completely different noise distributions at different parts of the regression space. However, to incorporate this degree of generality it is necessary to have several observations for each value of the regressors, and furthermore the noise is restricted to be unimodal; both restrictions are unnecessary for our model. Our article is structured as follows. Section 2 briefly revises the density estimation model of Escobar and West (1995) and specifies our data-based prior parameters, then compares the resulting density estimator with the Sheather and Jones (1991) density estimator. Section 3 completes the specification of the full heteroscedastic linear regression model (1). Section 4 compares estimates with those resulting from reintroducing the assumption of homoscedasticity, or of normally distributed errors, or both, in various simulated and real examples. For the simulated examples, we show that little power is lost by using the general regression model on data generated according to a restricted model, whereas data generated from the full model results in unsatisfactory results from a restricted regression model. In the real data sets, we see that for some examples the assumption of Gaussian errors is upheld by our procedure, whereas in others the assumption is shown to be false. Technical details of the sampling scheme are given in Appendix A.
2
Density estimation with Dirichlet process mixture priors
This section revisits the density estimation model of Escobar and West (1995) in order to introduce the notation necessary for the more complex examples considered later in the paper. Additionally, a data-based prior is developed which results in a plug-in density estimator. We compare the density estimates produced with those resulting from the Sheather and Jones (1991) plug-in kernel density estimator on various simulated data sets, and see that our estimator is significantly slower, but provides more accurate density estimates. We place a Dirichlet process mixture prior on the distribution of yi = i , resulting in the hierarchical model i | µi , σi2 ∼ N (µi , σi2 ) µi , σi2 | G ∼ G
G ∼ DP(αG0 ) 4
(2)
where N (µ, σ 2 ) is a normal distribution with mean µ and variance σ 2 , and DP(αG0 ) is a Dirichlet process with parameter αG0 (Ferguson, 1973). Lo (1984) shows that if the support of G0 is R × R+ (so that positive density is given to any valid values of µ and σ 2 ) then the closure of the support of this prior on the distribution of i is the set of all distributions on the real line; following Escobar and West (1995), we will therefore define G0 by taking σ 2 ∼ IG(aσ , bσ ) and µ | σ 2 ∼ N (0, τ 2 σ 2 ), where IG(a, b) denotes an inverse Gamma distribution with shape parameter a and mean b/(a − 1) for a > 1, which allows us to make use of MacEachern’s (1994) sampling for conjugate base distributions. While our use of conjugate distributions may appear to result in a restrictive model, the results of Lo (1984), and experimental evidence presented below, show that the model is sufficiently flexible to capture a wide range of data distributions. As observed by Richardson and Green (1997), it is necessary to use informative priors in the case of mixture models, and we follow that paper in choosing a weakly informative data-based prior. We take aσ = 2 so that the prior distribution for y is heavy-tailed but has finite variance. bσ specifies the prior for the variance terms σ 2 , and is difficult to fix directly based on simple summaries of the data. Hence we place a weakly informative conjugate hyperprior on b σ , and specify the range of this hyperprior using the variance of the observed data: we take bσ ∼ Ga(0.5, 1/Var(y)), where Ga(a, b) denotes a Gamma distribution with shape parameter a and mean a/b, so that E(σ 2 ) = E(bσ ) = Var(y)/2. Note that this means that scaling the data will not change the estimates produced by the method. We also place a hyperprior on τ 2 ; again this is chosen to be diffuse but so that the prior variance of y is finite, and we choose τ 2 ∼ IG(2, 1) for this purpose. Note that with these choices, if Y is drawn from the prior then Var(Y ) = E(Var(Y | σ 2 , τ 2 )) + Var(E(Y | σ 2 , τ 2 )) = E((1 + τ 2 )σ 2 ) + Var(0)
= (1 + E(τ 2 ))E(σ 2 ) = Var(y) and the prior variance is the same as the observed data variance. We complete the prior specification by taking α ∼ Ga(2, 4). While it is known (Antoniak, 1974) that the choice of parameters for this prior can have a significant effect on the number of components in the resulting mixture of normal distributions, in this work we are interested in the resulting density estimates and not the number of components. We have experimented with changes to all of the prior parameters specified, and observed that the resulting inference is not significantly affected unless the parameters are changed by an order of magnitude. Note that the complete model is nearly identical to that used by Escobar and West (1995), except for the introduction of the data-based hyperprior on bσ , which means that the method can be applied without the need for the user to specify prior parameters, and can therefore be compared with plug-in kernel density estimators (Sheather and Jones, 1991). 5
Markov chain Monte Carlo (MCMC) sampling for this density estimation model has been well studied (MacEachern, 1994; Escobar and West, 1995). We combine these techniques with the updates of hyperparameters used by Richardson and Green (1997) to give a very simple sampling scheme, full details of which are given in Appendix A. While more sophisticated approaches to sampling for this model are available (Green and Richardson, 2001; Dahl, 2003) we found it unnecessary to apply those techniques in this case. The output of the sampling scheme for the density estimation is the posterior mean density. We demonstrate the effectiveness of this density estimator with some examples using several data sets of 100 simulated points. In each case we compare the performance with that of the Sheather and Jones (1991) plugin (SJPI) kernel density estimator by comparing the recovered densities with the true data-generating densities using estimates of both the Kullback–Leibler divergence and the L2 distance, defined respectively as Z p(x) KL(ˆ p, p) = p(x) log dx, pˆ(x) R Z 1/2 2 L2 (ˆ p, p) = (ˆ p(x) − p(x)) dx . R
The MCMC burn-in was 1000 iterations, with the following 1000 iterations being used to produce the density estimate; these run lengths are considered sufficient through experimenting with longer run lengths and multiple starting points, all of which did not affect the outcome of the analyses. See Section 4 for a more formal convergence analysis. The distributions tested were a standard normal distribution, a t distribution with 3 degrees of freedom, a skewed distribution in which yi = log(ηi /2) where ηi are sampled from a χ22 distribution, and a mixture of two normal distributions with density 0.3φ(x; −7, 0.25) + 0.7φ(x; 0, 25) where φ(x; a, b) is the density of the normal distribution with mean a and variance b evaluated at x. Plots of these densities are shown in Fig. 2. For each set of data, density estimation was carried out using both the Dirichlet process mixture model and the SJPI estimator, and the ratios log(KLDP /KLSJP I ) SJP I and log(LDP ) were calculated. Boxplots of these ratios calculated from 2 /L2 100 simulated data sets for each density are given in Fig. 1, where values less than 0 signify that the Dirichlet process estimate is closer to the true density than the SJPI estimate. For each test distribution the density estimate with median value of KLDP is shown in Fig. 2 along with the Sheather and Jones estimate for the same data. (The KL divergence of the SJPI estimate for this normally distributed data set was the 61st lowest out of the hundred normally distributed data sets, for the t3 distribution it was the 57th lowest, for the skewed distribution it was the 27th lowest, and for the bimodal example it was the 75th lowest.) It can be seen that the model developed in this section, which is a simple extension and illustration of the method demonstrated by Escobar and West (1995) using data-based priors, is sufficiently flexible to capture the main features of these four very different distributions, and compares favourably with the well-regarded Sheather and Jones (1991) method. Of course, this should be 6
log KL ratio
1 0 −1 −2 −3 −4 Normal
t_3
Skew
Bimodal
Normal
t_3
Skew
Bimodal
log L2 ratio
1 0.5 0 −0.5 −1 −1.5
Figure 1: Boxplots of the log ratios of distance/divergence from the true densities from the Dirichlet process mixture model and the SJPI estimator.
balanced by the fact that the Dirichlet process method takes approximately 500 times longer to run than the Sheather and Jones method (∼ 40 seconds against ∼ 0.08 seconds in Matlab 7.0 on a 2.4GHz Intel Pentium IV). We mention here that for the test suite of densities introduced by Marron and Wand (1992) our algorithm also compares favourably with the Sheather and Jones method, in terms of distances/divergences from the truth (results not reported).
3
Heteroscedastic linear regression with a general noise distribution
Since the Dirichlet process mixture prior discussed in Section 2 appears to support a range of densities, we expect that incorporating this prior into the full heteroscedastic regression model (1) will result in a regression that will perform reasonable inference regardless of the distribution of the noise terms i . We place priors on the parameters β and δ that are similar to those of Chan et al. (2005) but that do not depend on the distribution of the noise and therefore, conditional on β and δ, we can update the properties of the noise distribution exactly as in the case of density estimation. Conditional on properties of the noise distribution, updating the regression parameters is analogous to the scheme of Chan et al. (2005) for heteroscedastic linear regression with normally distributed noise. Recall that we assume that the data y and columns of the design matrix X are mean-corrected. This ensures that an intercept term is not needed, and so the identification problem of Kuo and Mallick’s (1997) model disappears. Fur7
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 −5
0
0
5
0.5
−5
0
5
−10
0
10
0.3
0.4 0.2
0.3 0.2
0.1
0.1 0 −6
−4
−2
0
2
0
4
Figure 2: Density estimates for the normal distribution (top left), t3 distribution (top right), skew distribution (bottom left) and bimodal distribution (bottom right). In each case the true density is a dotted line, the Dirichlet process density estimate is a dashed line and the Sheather–Jones plug-in estimator is a solid line.
8
thermore, although the introduction of the heteroscedastic term exp{z 0i δ/2}i without the constraint Ei = 0 may appear to introduce nonlinearities into the model, the following argument shows that the estimated density for i will have mean approximately 0, so that this effect is not great. Consider sampling new observations yi∗ = x0i β + exp{z 0i δ/2}∗i , where the ∗i are drawn independently from the same distribution as the i . (Note that µi and σi2 are latent variables for yi , so that we do not use them for sampling yi∗ ; instead ∗i from the full posterior distribution.) Then, Pnwe generate −1 ∗ ∗ ¯ defining y = n i=1 yi , " n # 1X 0 ∗ ¯ E(y ) = exp{z i δ/2} E∗i , n i=1
remembering that the xi have been mean-corrected. However, since the yi∗ have been drawn from the same model as the mean-corrected yi , for large samples we must have E(y¯∗ ) ≈ 0. Furthermore, since the ∗i are independent and identically distributed from the same distribution as the i , we must have E∗i ≈ 0.
3.1
Prior specification
Taking the full heteroscedastic model (1), we follow Chan et al. (2005) by specifying the prior for δ through a vector of indicator variables K such that Kl = 0 if and only if δl = 0. The prior for the number of ones in K is beta-binomial (Kohn et al., 2001), with parameters selected so that P(Kl = 1) = πK independently for each l with πK ∼ U [0, 1]. Writing δK for the subvector of δ corresponding to Kl = 1, the prior for δ is completed by taking δK | K cδ
∼ N (0, cδ I|K | ),
and
∼ IG(1, 1),
Pp where I|K | is the identity matrix with |K| = l=1 Kl columns. Analogously, we specify the prior for β through a vector J of indicator variables, such that Jl = 0 if and only if βl = 0, and the prior for J is the same as that for K. Writing βJ for the non-zero subvector of β, XJ for the corresponding submatrix of the design matrix X, and D(δ) = exp{diag(Zδ/2)}, note that D(δ)−1 y = D(δ)−1 XJ βJ + e, where the components of e are independent and identically distributed. We therefore specify the prior for βJ by taking βJ | J , δ ∼ N (0, cβ (XJ0 D(δ)−2 XJ )−1 ). 9
This prior is very similar to that of Chan et al. (2005), but with one modification: the automatic scaling in their prior for β provided by multiplying the covariance by σ 2 is no longer available here, since there is no longer a single σ 2 (due to the Dirichlet process mixture prior for the i ). Hence we modify the prior distribution for cβ so that it compensates for the loss of the automatic scaling. We take cβ ∼ IG(1, 2nVar(y)) so that the prior is diffuse but has a mode at nVar(y): the n compensates for the increased precision resulting from (XJ0 D(δ)−2 XJ )−1 as the number of observations n increases, whereas the Var(y) is a surrogate for having σ 2 present in the prior: see Cripps et al. (2006) for a similar technique of choosing a diffuse prior centred about a sensible value given the data. This completes the prior specification. Details of the MCMC sampling scheme used to generate samples from the posterior distribution of this model are given in Appendix A. Again, the main output of the sampler is density estimates at various sets of regressors. Samples of the regression coefficients β and δ are also provided, to help interpret the model.
4
Examples
This section illustrates the method with several examples. The focus is on providing predictive distributions for a new set of regressors, and accordingly the measure of success is the accuracy of the predictive densities. We start with a simple illustrative example, and progress to more complex simulated and real examples. In each case we use a number of MCMC iterations that was judged to be appropriate by comparison with multiple runs, multiple starting points, and longer runs. For the examples involving real data we also perform a convergence analysis using Gelman-Rubin statistics calculated based on quantiles of the predictive densities at certain values of the regressors.
4.1
Toy illustrative example
Our first example is a simple one, yet shows the power of our method to differentiate between heteroscedasticity and non-normal error distributions. Only one replication is performed, since this is simply an illustrative example. We simulate 100 x values from a standard normal distribution, and consider two sets of data: 1. yi = xi + exi i with i ∼ N (0, 1), and 2. yi = xi + i with i ∼ t2 . The three sets of assumptions we allow for analysis of this data are given by: (a) heteroscedastic, but assuming normality, 10
Table 1: Posterior means and standard deviations for the toy example. Notation 1(a) means that the data generating process is 1 and the estimation method is (a). 1(a) 1(b) 1(c) 2(a) 2(b) 2(c) ˆ J 1(0) 1(0) 1(0) 1(0) 1(0) 1(0) ˆ 0.932(0.001) 0.967(0.144) 0.926(0.016) 1.085(0.026) 0.787(0.081) 0.785(0.083) β ˆ K 1(0) N/A 1(0) 1(0) N/A 0.244(0.430) ˆ 2.096(0.127) δ N/A 2.076(0.134) 0.833(0.130) N/A 0.054(0.168)
(b) homoscedastic, but allowing i to follow a general distribution, (c) heteroscedastic and allowing a general noise distribution. Posterior means of J , β, K and δ are given in Table 1, with predictive densities at xnew = 0 shown in Fig. 3 (qualitatively similar densities estimates are estimated at other points). Note that the estimates of the parameters from the full model (i.e. 1(c) and 2(c)) are very similar to those estimated using the correct model (1(a) and 2(b)), with only a little loss of precision due to the use of a more general model. However, we see that when assumptions of the fitting model do not match the form of the data (i.e. cases 1(b) and 2(a)) we either get an incorrect prediction of heteroscedasticity or an incorrect assessment of departure from normality. The anomaly is the estimation of K in the case of homoscedastic data; we have found in general that with datasets of this size the samplers do not perform particularly well in selecting out variance parameters when no heteroscedasticity is present, despite the corresponding value of δl being close to zero. This is also the case for models that assume normality, and will be explored elsewhere. The results reported were produced with 2000 iterations of the MCMC sampler, with the first 1000 discarded as burn-in, which took approximately 12 seconds when normality was assumed, and approximately 70 seconds when the Dirichlet process prior was used, in Matlab 7.0 on a 2.4GHz Intel Pentium IV.
4.2
Comparison with Hanson and Johnson (2002)
We now consider an example used by Hanson and Johnson (2002) to demonstrate the efficiency of their mixture of Polya trees approach, in which xi1 = i mod 2 and xi2 ∼ N (0, 1) for i = 1, . . . , 100, with yi = −xi1 + xi2 + i . We consider the case of i ∼ N (0, 1) as well as the case i = log(ηi ),
ηi ∼ 0.5N (1, 0.152) + 0.5N (3, 0.152),
(3)
as used in the experiment of Hanson and Johnson (2002). Predictive densities from one simulated dataset are shown in Fig. 4. When the errors are normally 11
0.4
Homoscedastic t noise
Heteroscedastic normal noise
2
0.5
0.35 0.4
0.3 0.25
0.3
0.2 0.2
0.15 0.1
0.1
0.05 0 −10
−5
0
5
0 −10
10
−5
0
5
10
Figure 3: Posterior predictive densities for the toy example, for data sets 1. (left) and 2. (right). In each case the predictive densities at xnew = 0 resulting from analysis (a) are shown dotted, from (b) dashed, and from (c) solid. (a) and (c) are coincident in the left plot, with (b) showing heavy-tailed behaviour. (b) and (c) are coincident on the right, capturing the heavy-tailed nature of the distribution.
distributed the prediction from the model assuming normality differs slightly from that of the model using the Dirichlet process mixture prior in this case (by chance, the simulated noise terms did indeed exhibit skewness in this case, and that is picked up by the Dirichlet process mixture prior). On the other hand, when the errors are distributed according to (3) the predictions are significantly different, with those using the Dirichlet process mixture prior being significantly better. We re-emphasise the fact that exactly the same code was used for both examples — the data-based priors and flexibility of the model mean that this is a plug-in approach. Also shown in Fig. 4 are the predictions from Hanson and Johnson’s method when the optimal parameters suggested in their paper are used. It is seen that in the case of bimodal noise (3) the predictive density is reasonable, but when the same priors are used for the case of normal noise the method results in a particularly bad predictive density (which nevertheless also exhibits the skewness detected by the Dirichlet process mixture prior). It should be noted that for both methods the parameters used for the normal case are the same as those used for the bimodal case; alternative parameters would presumably produce better estimates for Hanson and Johnson’s method for the normal noise case, but the plug-in prior developed for the Dirichlet process method means that the user doesn’t need to know the form of the noise in order to choose good parameters. We formalise the comparison with Hanson and Johnson’s method in the same way as we compared the density estimators in Section 2, simulating data with i as in (3), which is the example used by Hanson and Johnson. Fifty data sets were
12
0.7
4
0.6
3.5 3
0.5
2.5
0.4
2 0.3
1.5
0.2
1
0.1 0 −2
0.5 −1
0
1
0 −2
2
−1
0
1
2
Figure 4: Posterior predictive density at xnew = (0 0)0 with normal noise (left) and with bimodal noise (right). For each plot, the true density is a dotted line, the prediction assuming normality is a dot-dashed line, the prediction using the Dirichlet process prior is a dashed line, and the prediction using Hanson and Johnson’s mixture of Polya trees approaches is a solid line.
generated, each consisting of 100 observations, and the distance/divergence of the estimated predictive densities from the true distributions at xnew = (0 0) HJ were calculated as before. Boxplots of log(KLDP /KLHJ ) and log(LDP 2 /L2 ), given in Fig. 5, show that the Dirichlet process mixture prior results in significantly better results than Hanson and Johnson’s method on this problem.
4.3
Electricity bills data
In our first example using real data, we consider a data set consisting of 1602 household electricity bill levels from South Australia (Bartels et al., 1996), where the regressors consist of indicator variables for the presence or absence of appliances in the house, as well as some demographic variables, and are described in Appendix B. A classical linear regression results in residuals which appear to be heteroscedastic (Fig. 6). The data (out of context) clearly suggest that a transformation should be applied to the responses. On the other hand, a linear model in the indicator variables corresponds to a response which is additive in terms of the appliances present, and economic theory clearly suggests this to be the correct functional form of the response, at least as far as the mean response is concerned. Bartels et al. (1996) suggest a heteroscedastic generalisation of the original linear model. However, fitting our heteroscedastic model but retaining the assumption of normality results in clearly skewed standardised residuals (Fig. 6). Hence we fit the fully heteroscedastic model (1), using 10000 MCMC iterations with the first 5000 discarded as burn-in, which took approximately 18 minutes for this data in Matlab 7.0 on a 2.4GHz Intel Pentium IV. To assess whether this was 13
−0.5 −1 −1.5 −2 −2.5 −3 log KL ratio
log L_2 ratio
1500
6
1000
4 Standardised residuals
Residuals
Figure 5: Boxplots of the log ratios of KL divergence (left) and L2 distance (right) from the true density for the Dirichlet process mixture model and the Hanson and Johnson (2002) model.
500
0
−500
−1000
2
0
−2
0
200
400 600 800 Fitted values
−4 −4
1000 1200
−2 0 2 Standard Normal Quantiles
4
Figure 6: Residuals versus fitted values using classical linear regression in the electricity data (left) and standardised residuals against quantiles of the normal distribution for the electricity data when analysed allowing for heteroscedasticity but assuming normality (right).
14
a sufficient number of iterations, we examine the convergence of the predictive distributions estimated at certain values of the regressors. Since we are not interested in any specific predictive distribution in this case, we select the regressors corresponding to the median, upper and lower quantiles of the observed y values, which should hopefully result in predictive distributions corresponding to a range of values. Then on each iteration of the MCMC scheme we calculate the median, upper and lower quantile of the predictive distribution for each of the three points, and calculated the nine Gelman–Rubin statistics, as corrected by Brooks and Gelman (1998), based on 5 parallel runs, where the start point for each run was sampled from the (very weakly informative) prior distribution. In each case the value was less than 1.1. Posterior means for the regression vectors are given in Table 2, and compared against the estimates resulting from a simple linear regression of the data; the b generally selecting estimates are broadly similar, with the posterior means for J the same variables as are deemed significant by the classical procedure and the estimates of the selected coefficients similar. The exceptions are the variables poolfilt*log(people) and whtgel*dish which are selected out by the Bayesian procedure but deemed highly significant by the classical method (and indeed the former also by Bartels et al.’s improved method); it is interesting to note that Bartels et al. regarded the large negative estimate of the poolfilt*log(people) coefficient as anomalous, and indeed closer inspection reveals that the classical estimate is heavily influenced by two outliers, which are accommodated by the significant skewness that is estimated using the full model. Since we are no longer fitting a model with normally distributed errors it can be difficult to assess the fit of the model. The traditional standardised residuals, obtained by subtracting the mean and dividing by the standard deviation, need not be normally distributed, and instead we calculate the normalised residuals Φ−1 (Ui ), where Ui = Fˆi (yi ) and Fˆi is the posterior mean cumulative distribution function given the regressors xi and Φ is the standard normal cumulative distribution function. The Ui are easily estimated from the output of the MCMC scheme, by simply estimating the integral Z yi fˆi (y) dy Ui = −∞
where fˆi is the estimated predictive density at regressors xi , and correspond to the approximate cross-validatory predictive p-values of Marshall and Spiegelhalter (2003). A successfully fitted model should result in these Ui being approximately uniformly distributed on the interval [0, 1], so that the normalised residuals Φ−1 (Ui ) will be approximately normally distributed. A plot of these values against the fitted values, as well as a predictive density for a randomly selected observation, are given in Fig. 7: it is clearly seen that the heteroscedasticity has been accounted for, despite the fact that the variance structure estimated in Table 2 exhibits no clear structure, and that the predictive distributions are significantly skewed. The third plot in Fig. 7 is a plot of the normalised residuals against quantiles of the normal distribution, and shows that the model appears to have fitted the data well. 15
Table 2: Ordinary least squares estimates of coefficients and standard deviations, with bold signifying significance at the 5% level, compared with posterior means from the full model (1), for the electricity data. ˆ ˆ ˆ Variable name OLS estimate J β K δˆ log(rooms) 203(27) 1.000 139.377 0.994 0.664 log(income) 44 (8) 1.000 36.509 0.616 0.030 log(people) 70(13) 1.000 83.783 0.997 0.497 mhtgel 47(14) 1.000 40.807 0.945 0.287 mhtgel*log(rooms) -55(47) 0.034 -0.564 0.752 0.263 mhtgel*rms10 31(51) 0.049 3.395 0.606 -0.020 mhtgel*log(income) 11(13) 0.019 -0.018 0.469 0.011 mhtgel*inc60 -36(46) 0.023 -0.927 0.757 -0.035 mhtgel*shtgel 18(20) 0.253 9.234 0.685 0.071 sheonly 9(10) 0.015 0.033 0.841 -0.129 whtgel 260(24) 1.000 238.012 1.000 0.945 whtgel*log(people) 172(27) 1.000 154.239 0.947 0.336 wthtgel*hotwash 32(26) 0.030 0.231 0.682 0.011 whtgel*dish 172(43) 0.029 2.833 0.900 0.570 cookel 87(11) 1.000 79.626 1.000 0.425 cookel*log(people) 14(17) 0.074 1.540 0.665 -0.188 cookel*mwave -5(18) 0.029 0.269 0.874 -0.178 poolfilt 98(15) 1.000 63.773 1.000 0.465 poolfilt*log(people) -193(38) 0.045 -2.435 0.987 -0.727 airrev 25(10) 0.793 19.905 0.581 0.038 aircond 34(10) 0.998 32.847 0.977 0.307 mwave 19(14) 0.100 1.334 0.510 0.019 dish 31(13) 1.000 55.993 0.529 0.091 dryer 55 (9) 1.000 50.540 0.999 0.309
16
3
3
2
2.5
1 0
2
1
−2
0.5 0
500 Fitted values
1000
5
1.5
−1
−3
x 10
Normalised residuals
3.5
Density
Normalised residuals
−3
4
0
0
500 Electricity bill
1000
0
−5 −5
0 Standard Normal Quantiles
5
Figure 7: Normalised residuals versus fitted values using full heteroscedastic Dirichlet process regression (left); the predictive density of a randomly chosen point using this model (middle, dashed line) compared with the same predictive density assessed assuming normally distributed errors (solid line); and a quantile plot of the normalised residuals resulting from the Dirichlet process model (right, dashed line) compared with the normalised residuals resulting from assuming normality (solid line).
4.4
Ethanol data
Hurn et al. (2003) consider fitting the data presented in Fig. 8 using a mixture of regressions. This data has the explanatory variable NOx, the concentration of Nitrous Oxide in the exhaust, and response E, the richness of the air-ethanol mixture in the engine. Clearly a simple linear regression assuming normality is inadequate for this data. We fit our full model allowing for general noise distributions and heteroscedasticity. We focus on predicting the value of the response at a newly observed value of the regressor, giving this prediction as a density estimate. This avoids the complexities of post-processing involved in the mixture model approach of Hurn et al. (2003). The results from a run of 2000 MCMC iterations, with the first 1000 discarded as burn-in, are presented in Fig. 8. As with the electricity bills data, this was estimated to be a sufficiently long run through the calculation of the corresponding nine Gelman–Rubin statistics (for the median, upper and lower quantiles of the predictive densities corresponding to the median, upper and lower quantile of the observed values): all of these were less than 1.1. The left plot presents contours of a surface obtained by taking the density estimate at each value of the NOx variable, normalised by dividing by the maximum value of that density. It is clearly seen that the data (shown on the same plot) are well represented by these density estimates. The right hand plot confirms this, with a plot of the normalised residuals plotted against quantiles of the standard normal distribution, as with the electricity bills example, showing that these are approximately normal.
17
3
1.4 1.3
2 1.2
Standardised residuals
1.1
E
1 0.9 0.8 0.7
1
0
−1
0.6 −2
0.5 0.4 0.5
1
1.5
2
NOx
2.5
3
3.5
−3 −3
4
−2
−1 0 1 Standard Normal Quantiles
2
3
Figure 8: The ethanol data, and the contours showing the shape of the estimated densities for different values of NOx (left) and a quantile plot of the normalised residuals resulting from the heteroscedastic Dirichlet process model, compared with quantiles of the standard normal distribution (right).
4.5
Simulated non-parametric examples
Chan et al. (2005) show how to fit the spatially adaptive regression spline model of Kohn et al. (2001) for both the mean and variance components, using Bayesian variable selection to automatically select knots to avoid over-fitting. This very general model still assumes that the data are distributed normally, which can result in overfitting in the case of heavy-tailed noise distributions. In this section we apply our method to the linear regression problem resulting from Example 4 of Chan et al. (2005), in which 500 predictors are sampled from a U [0, 1] distribution, then observations are constructed by taking yi = µ(xi ) + σ(xi )i , where µ(x)
= (φ(x; 0.2, 0.004) + φ(x; 0.6, 0.1))/4,
σ(x)
= 2µ(x)/3,
and
with φ(x; a, b) being the density of the normal distribution with mean a and variance √ b evaluated at √ x. We consider this example when i ∼ N (0, 1), and when 3i ∼ t3 (the 3 being present so that the variance of the noise is 1). The regression problem was formulated identically to the method of Chan et al. (2005), using 30 knots. We analysed these data sets using heteroscedastic methodology, restricting to the case of normal noise as in Chan et al. (2005), as well as using a Dirichlet process mixture prior for i . The L2 distance of the estimated mean and standard deviation functions from the truth were calculated in each case, and the usual log ratios are boxplotted in Fig. 9, where 50 replicated data sets were used, and the MCMC was run for 1000 iterations with the first 500 discarded (the runs took approximately 81 seconds when normality was assumed, and 216 18
0.5
0.5 2
log(LDP/LN)
1
0
2
0
2
2
log(LDP/LN)
1
−0.5 −1 −1.5
−0.5 −1 −1.5
mu(x) sigma(x) Normally distributed noise
mu(x) sigma(x) t distributed noise 3
Figure 9: Log ratio of L2 distance of µ ˆ(x) and σ ˆ (x) from µ(x) and σ(x), using the Dirichlet process prior, and assuming normality.
seconds using the Dirichlet process prior). It is seen that in the case of normal noise the estimation of the mean function is comparable, whereas the estimation of the standard deviation function is slightly better when normality is assumed. On the other hand, when t3 distributed noise is present, the method involving Dirichlet process mixture priors achieves better results for estimation of the mean, and similar results for estimation of the standard deviation. However closer inspection of the estimates from all 50 replications of this experiment with t3 distributed noise, plotted in Fig. 10, shows that the method which assumes normality overfits the variance component in the situation where the true noise is distributed according to a t3 distribution, whereas the Dirichlet process method appears to slightly underestimate the standard deviation function while producing what looks like a better fit. We could carefully select a distance metric (such as the L2 distance of the first derivative from the truth, or Marron and Tsybakov’s (1995) visual error criteria) to cast our results in a better light, but feel that Fig. 10 provides sufficient evidence.
4.6
LIDAR data
The nonparametric method is now demonstrated using the LIDAR (light detection and ranging) dataset studied by Ruppert et al. (2003). This data, plotted in Fig. 11, consists of 221 observations of the ratio of reflected laser-emitted light at different ranges, and is used to detect chemical compounds in the atmosphere. The data requires nonparametric fits for both the mean and variance functions, and the Bayesian regression spline approach of the previous section was applied using 30 evenly spaced knots along the range of the data. The MCMC scheme was run for 4000 iterations, with the first 2000 discarded as burn-in, which took 346 seconds when the Dirichlet process mixture prior was used, and 168 seconds when normality was assumed. These run lengths were judged to be adequate using the same criterion as for the electricity bills data and the ethanol data — the GR statistics for estimating the median, upper and lower quantiles of the predictive distributions at regressors corresponding 19
2
2
DP
(x)
3
µ
µN(x)
3
1
0
0.5 x
0
1
3
2
2
0
0.5 x
1
0
0.5 x
1
DP
(x)
3
σ
σN(x)
0
1
1 0
0
0.5 x
1 0
1
Figure 10: Estimated mean and standard deviation functions from 50 replications of the simulated nonparametric example with t3 distributed noise, estimated assuming normality (left) and using the Dirichlet process prior (right).
to the median, upper and lower quantile observed values are all close to one. The fitted means and symmetric 95% prediction intervals from both methods are shown in Fig. 11; it is clearly seen in this plot that both methods report virtually identical results. The normalised residuals, also shown in Fig. 11, were calculated as before; it is clear that normality is an adequate assumption in this example, and that the method using the Dirichlet process mixture prior recovers these normal distributions.
5
Conclusion
Transformations are often used in applied linear regression analysis. However, there are many situations in which transformation of the response is inappropriate. We have shown that the Dirichlet process mixture prior is a flexible and convenient method to generalise the heteroscedastic linear regression method of Chan et al. (2005) to the case where the noise terms are not normally distributed. Furthermore, the use of data-based prior distributions means that this complex Bayesian procedure can be applied with no input from the user; in particular, all of the examples in this paper were analysed using the same code. Data-based prior parameters were specified for the Dirichlet process mixture prior in density estimation (Escobar and West, 1995). The resulting estimator was shown to compare favourably with the theoretically optimal plug-in density 20
0.4
3
0.2
2 Normalised residuals
logratio
0 −0.2 −0.4 −0.6
0 −1 −2
−0.8 −1
1
400
500
range
600
−3 −3
700
−2
−1 0 1 2 Standard normal quantiles
3
Figure 11: Left: fits for the LIDAR data. The estimated mean and 95% predictive interval assuming normality are given by a solid line, while the same estimates resulting from the Dirichlet process mixture prior are a dashed line (the estimates are virtually coincident). Right: quantile plot of the normalised residuals resulting from assuming normality (solid line) compared with the normalised residuals resulting from the Dirichlet process mixture prior (dashed line).
estimator of Sheather and Jones (1991) in terms of the Kullback–Liebler divergence and L2 distance of the resulting density estimates from the true generating density. When this prior was introduced for the noise terms in a heteroscedastic linear regression, it was shown that virtually no predictive power was lost when the errors are normally distributed, but when the errors are not normally distributed the predictive densities are much more accurate. Furthermore, the predictive densities resulting from the heteroscedastic regression procedure for a simple example of Hanson and Johnson (2002) significantly outperformed the more sophisticated method of that paper. Assessing how well a model fits the data can be difficult with such a complex model, but in the real data examples we showed that by considering the normalised residuals Φ−1 (Fˆi (yi )), which are approximately normally distributed if a well-fitting model has been obtained, the fit of the model can be assessed. When the regression spline approach of Chan et al. (2005) was introduced to model both the mean and variance functions non-parametrically, it was seen that the model can still predict both normal and non-normal errors in the presence of heteroscedasticity, and generally results in a better fit to the data than when the model is restricted by fitting normally distributed noise.
21
Acknowledgements The research by Robert Kohn and David Nott was partially supported by an Australian Research Council Grant on mixture models. The same grant supported David Leslie’s postdoctoral work at the University of New South Wales. We thank an anonymous associate editor and anonymous referees for suggestions that led to substantive improvements in the paper. We also thank Tim Hanson for providing computer code.
A
Technical details of the sampling scheme
A.1
Density estimation
Consider first the modification of the Escobar and West (1995) density estimation model (2). It is well known that the random distribution G drawn from the Dirichlet process prior can be integrated out. Writing θi = (µi , σi2 ), the prior on θ can be written Pi−1 αG0 + j=1 δθj θ1 ∼ G 0 , θi | θ1 , . . . , θi−1 ∼ (4) α+i−1 where δθj is a degenerate distribution with unit mass at θj . It is clear from this representation that the θi will be “clustered” into k ≤ n components; following Dahl (2003) we will consider the representation of θ as a partition ˜ = {θ˜1 , . . . , θ˜k(S) } such that S = {S1 , . . . , Sk(S) } of {1, . . . , n}, and a vector θ ˜ i ∈ Sj ⇒ θi = θj . This results in a prior on S of the form k(S)
p(S) ∝
Y
αΓ(nj )
(5)
j=1
where Γ is the gamma function and nj = |Sj |. The prior specification for θ is ˜ is drawn independently from G0 . completed by noting that each element of θ A.1.1
Updating the partition S
MacEachern (1994) showed that under the hierarchical model (2), it is also possible to integrate out the θi = (µi , σi2 ) to perform Gibbs sampling solely on the partition S. The likelihood of a partition (given the hyperparameters) is given by k(S)
p(y | S, α, τ 2 , bσ ) = p(ySj | τ 2 , bσ ) =
Y
j=1
Z Z
p(ySj | τ 2 , bσ ) fN (0,σ2 τ 2 ) (µ)fIG(aσ ,bσ ) (σ 2 )
Y
i∈Sj
22
fN (µ,σ2 ) (yi ) dµ dσ 2
where fN (m,s2 ) is the density function of the Normal distribution with mean m and variance s2 , and fIG(aσ ,bσ ) is the density function for the IG(aσ , bσ ) distribution. Due to the conjugacies in the model, this integral is easily carried out, and we see that p(S | y, α, τ 2 , bσ ) ∝ p(y | S, α, τ 2 , bσ )p(S) k(S)
∝
Y
baσ Γ(aσ + nj /2) p αΓ(nj ) σ Γ(aσ ) nj τ 2 + 1 j=1
" #!−(aσ +nj /2) y¯j2 nj 2 sj + (6) bσ + 2 1 + nj τ 2
P −1 P 2 2 where y¯j = n−1 ¯j2 . j i∈Sj yi and sj = nj i∈Sj yi − y On each iteration of the MCMC scheme we therefore take each yi in turn, and allocate i either to an existing component of the partition or to a new component consisting only of i, with probabilities calculated using (6). A.1.2
Updating the hyperparameters
Given a partition S, the hyperparameters α, bσ and τ 2 can be updated by standard means. West (1992) shows how to sample α from its conditional distribution using an auxiliary variable sampled from a Beta distribution. We shall not repeat the derivation here. Sampling the bσ from a distribution related to (6) is not easy. However given the partition, it is easily seen that !! 2 y ¯ n n j j j σ ˜j2 | y, S, α, τ 2 , bσ ∼ IG aσ + , bσ + s2j + . (7) 2 2 nj τ 2 + 1 These distributions are easy to sample from, and hence we generate the vec˜ 2 as an auxiliary variable before sampling bσ from a gamma distribution tor σ (Richardson and Green, 1997). ˜ Similarly, the easiest way to update τ 2 is to sample the auxiliary vector µ, using the fact that ! τ 2σ ˜j2 τ 2 nj y¯j 2 2 µ ˜j | y, S, α, τ , bσ , σ ˜j ∼ N , , (8) nj τ 2 + 1 nj τ 2 + 1 then to update τ 2 from an inverse gamma distribution (Escobar and West, 1995). A.1.3
Density estimates
The density is estimated using the identity Z ∗ p(y | y) = p(y ∗ | y, S, α, τ 2 , bσ )p(S, α, τ 2 , bσ | y) dS dα dτ 2 dbσ ≈ T −1
T X t=1
p(y ∗ | y, S (t) , α(t) , τ 2 23
(t)
, bσ(t) )
(t)
(t)
where S (t) , α(t) , τ 2 , bσ are samples from the Gibbs sampler. The quantities p(y ∗ | y, S, α, τ 2 , bσ ) are given by a mixture of (k(S) + 1) t distributions, corresponding to allocating y ∗ either to a current component of the partition (which has probability nj /(n + α)), or to a new component (which has probability α/(n + α)). Consider the case of y ∗ being allocated to component j. We know in this case that y∗ | µ ˜j , σ ˜j2 , . . . ∼ N (˜ µj , σ ˜j2 ) where . . . signifies conditioning on any other variables present in the sampler. Hence Z Z p(y ∗ | y, S, α, τ 2 , bσ ) = fN (˜µj ,˜σj2 ) (y ∗ )p(˜ µj , σ ˜j2 | y, S, α, τ 2 , bσ ) d˜ µj d˜ σj2 .
Since the posterior of (˜ µj , σ ˜j2 ) is a Normal/Inverse gamma distribution, as shown in (7) and (8), it is a standard calculation to see that, conditional on being allocated to component j, y ∗ is distributed according to a t distribution with τ 2 n y¯ shape 2aσ + nj , mean nj τ 2j+1j and scale v 2 u b + nj s2 + y¯j u σ j t 2 nj τ 2 +1 τ2 . 1+ n nj τ 2 + 1 aσ + 2j ∗
Similarly, if y ∗ comes from a new component, then (µ∗ , σ 2 ) come from the ∗ Normal/Inverse gamma base distribution, and hence p the distribution of y is a 2 t distribution with shape 2aσ , mean 0 and scale (1 + τ )bσ /aσ .
A.2
Heteroscedastic linear regression
In order to sample from the posterior distribution of the full heteroscedastic linear model (1), we need to find a concise form of the likelihood. In particular, it has been noted (Kohn et al., 2001; Chan et al., 2005) that sampling is much more efficient in this type of model if the regression coefficients β can be integrated out. We will write Σ = diag(σi2 ) for the diagonal matrix of variances resulting from the Dirichlet process mixture, and µ = (µi )i=1,...,n for the vector of means. Thus it follows directly from the model (1) and the definition of the Dirichlet process mixture prior that y | β, δ, µ, Σ ∼ N (D(δ)µ + Xβ, D(δ)2 Σ).
(9)
Given a partition S resulting from the Dirichlet process prior, with k components, we define ES to be the n × k matrix with each row being a unit vector with a one in the position corresponding to the component to which point yi ˜ since µ ˜ is simply the vector of is allocated in S. It follows that µ = ES µ, component means. Recalling that Xβ = XJ βJ , we can therefore rewrite the mean of the distribution (9) as ˜ µ ( D(δ)ES XJ ) . βJ 24
˜ 2 , S, τ 2 and cβ , the prior tells us that Notice that, conditional on δ, σ 2 ˜ µ ˜ 2) τ diag(σ 0 ∼ N (0, P ), P = . 0 cβ (XJ0 D(δ)2 XJ )−1 βJ ˜ and βJ , resulting in It is therefore easy to marginalise out the parameters µ the likelihood equation 1 ˜ 0X ˜ + P −1 |1/2 |2πΣD(δ)2 |1/2 |P |1/2 |X ˜ X ˜ 0X ˜ + P −1 )−1 X ˜ 0y ˜0y ˜−y ˜ 0 X( ˜ ,(10) × exp − 12 y
p(y | J , δ, Σ, cβ , τ 2 ) =
where and
˜ y ˜ X
= D(δ)−1 Σ−1/2 y, = Σ−1/2 ( ES D(δ)−1 XJ ).
˜ 2 , α, bσ , τ 2 } for the set of variables that We write Θ = {J , cβ , δ, K, cδ , S, σ are carried through from iteration to iteration. Other sampled values can be considered as auxiliary variables. Sampling therefore proceeds as follows: 1. Update the indicator vector J for the mean component, conditional on Θ\ {J }, by proposing new values of subvectors of J from the prior, identically to Chan et al. (2005), but using (10) to form the likelihood ratios required for the Metropolis–Hastings acceptance probabilities. The proposals are analogous to those for K described below. 2. Update K and δ simultaneously, conditional on Θ\{K, δ}, using Metropolis– Hastings steps as described below. 3. Sample βJ from a multivariate normal distribution, conditional on Θ. ˜ 2 }, update the partition S, and sample µ ˜ 4. Conditional on Θ ∪ {βJ } \ {S, σ 2 ˜ , exactly as for the density estimation model, but with y replaced and σ by D(δ)−1 (y − Xβ). Also update α, bσ and τ 2 as before. 5. Conditional on Θ \ {cβ }, update cβ using a multiplicative random walk Z Metropolis–Hastings, by proposing cP β = e cβ where Z ∼ N (0, 1), and accepting the proposal according to a likelihood ratio. 6. Conditional on Θ \ {cδ }, sample cδ from an inverse Gamma distribution. ˜ is only sampled in order to update τ 2 , and β is only sampled so Note that µ that the partition S can be updated using the Gibbs sampling procedure as with density estimation. A.2.1
Updating the variance component
In part 2 of the sampling scheme, the elements of K are updated sequentially in blocks, with the block size chosen at random to be 2, 4 or 6, and the elements 25
of the block selected randomly from the elements of K yet to be updated in the current iteration. Let K B be one such block; a new value for this block is proposed from the prior, conditional on the value of the parts of K that are not being updated. This step is described in detail by Kohn et al. (2001). We denote the proposed value of the whole vector by K P . A new value δ P for the vector of regression coefficients is then proposed, conditional on K P and the current values of all parameters, including the current value of δ. We note, following Nott and Leonte (2004), that if µ ˆ ˜ 0X ˜ + P −1 )−1 X ˜ 0 y˜, = (X βˆ ˜ and y˜ are as defined above, and where X
µ ˆ XJ ) ˆ , β i = 1, . . . , n,
ˆ r
= y − ( D(δ)ES
sˆi
= rˆi2 ,
ˆ can be thought of as a vector of approximate squared residuals, and then s therefore we have an approximate generalised linear model with design matrix ZK . If it were the case that K P = K, then Gamerman (1997) showed how to modify the Fisher scoring procedure to produce effective proposal densities for δK . Nott and Leonte (2004) observe that this construction uses the curˆ, and so the rent value of δK only through the resulting linear predictors for s construction can be used to propose a value of δ K P even when K P 6= K, by using the linear predictors Zδ C , where δ C denotes the current value of δ. This idea was used by Chan et al. (2005) for the heteroscedastic model that assumes normality, and is again modified to accommodate the Dirichlet process mixture prior. We propose δ P K P from a multivariate t distribution with 4 degrees of freedom, and variance and mean parameters −1 1 0 ∆ = c−1 , and δ I|K P | + 2 ZK P ZK P C C −2 −1 δˆK P = ∆ Zδ + D(δ ) Σ (ˆ s − D(δ C )2 Σ1n ) respectively where 1n is a vector of 1’s of length n. The proposal density q(δ P | K P , δ C , J, Σ, cβ , τ 2 ) can be explicitly evaluated, as can the density for the reverse move, and since K P is proposed from the prior distribution the Metropolis–Hastings acceptance probability for the proposal (K P , δ P ) is simply ( ) p(y | δ P , J, Σ, cβ , τ 2 )p(δ P | K P , cδ )q(δ C | K C , δ P , J, Σ, cβ , τ 2 ) max 1, . p(y | δ C , J, Σ, cβ , τ 2 )p(δ C | K C , cδ )q(δ P | K P , δ C , J, Σ, cβ , τ 2 ) A.2.2
Density estimates
We use the sampled β J value while calculating the density estimates at new values of the regressors. The procedure is then exactly the same as for the density estimation case described above, but the regression terms must be taken into account. 26
B
Electricity bills regressors
The electricity bills data set has 24 regressors, consisting of demographic variables, indicator variables, and interaction terms. The non-interaction terms are described here. Variable name log(rooms) rms10 log(income) inc60 log(people) mhtgel shtgel sheonly whtgel hotwash dish cookel mwave poolfilt airrev aircond dryer
Description log of the number of rooms in the house indicator for whether rooms is 10 or more log of the annual pretax household income in Australian dollars indicator for whether the income is greater than 60 log of the number of usual residents in the house indicator for electric main heating indicator for secondary electric heating indicator for electric secondary heating only indicator for peak electric water heating indicator for hot water washing machine indicator for dishwasher indicator for electric cooking only indicator for microwave indicator for pool filter indicator for reverse cycle air conditioning indicator for air conditioning indicator for dryer
References Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics 2, 1152–1174. Bartels, R., D. G. Fiebig, and M. H. Plumb (1996). Gas or electricity, which is cheaper?: An econometric approach with application to Australian expenditure data. The Energy Journal 17, 33–58. Brooks, S. P. and A. Gelman (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7, 434–455. Carroll, R. J. and D. Ruppert (1988). Transformation and Weighting in Regression. Monographs on Statistics and Applied Probability. Chapman and Hall, London. Chan, D., R. Kohn, D. J. Nott, and C. Kirby (2005). Locally adaptive semiparametric estimation of the mean and variance functions in regression models. Forthcoming in Journal of Computational and Graphical Statistics, available at http://www.caer.unsw.edu.au/DP/CAER0603.pdf.
27
Cripps, E., R. Kohn, and D. Nott (2006). Bayesian subset selection and model averaging using a centred and dispersed prior for the error variance. Australian & New Zealand Journal of Statistics 48, 237–252. Dahl, D. B. (2003). An improved merge–split sampler for conjugate Dirichlet process mixture models. Technical Report 1086, Department of Statistics, University of Wisconsin–Madison. Escobar, M. D. and M. West (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577–588. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209–230. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing 7, 57–68. Green, P. J. and S. Richardson (2001). Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of Statistics 28, 355–375. Hanson, T. and W. O. Johnson (2002). Modeling regression error with a mixture of Polya trees. Journal of the American Statistical Association 97, 1020–1033. Hurn, M., A. Justel and C. P. Robert (2003). Estimating mixtures of regressions. Journal of Computational and Graphical Statistics 12, 55–79. Kohn, R., M. Smith, and D. Chan (2001). Nonparametric regression using linear combinations of basis functions. Statistics and Computing 11, 313–322. Kottas, A. and A. E. Gelfand (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96, 1458–1468. Kottas, A. and M. Krnjajic (2005). Bayesian nonparametric modeling in quantile regression. Technical Report 2005-06, UCSC Department of Applied Math and Statistics. Kuo, L. and B. K. Mallick (1997). Bayesian semiparametric inference for the accelerated failure time model. Canadian Journal of Statistics 25, 457–472. Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Denisty estimates. The Annals of Statistics 12, 351–357. MacEachern, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation 7, 727–741. Marron, J. S. and A. B. Tsybakov (1995). Visual error criteria for qualitative smoothing. Journal of the American Statistical Association 90, 499–507. Marron, J. S. and M. P. Wand (1992). Exact mean integrated squared error. Annals of Statistics 20, 712–736. 28
Marshall, E. C. and D. J. Spiegelhalter (2003). Approximate cross-validatory predictive checks in disease mapping models. Statistics in Medicine 22, 1649– 1660. Mukhopadhyay, S. and A. E. Gelfand (1997). Dirichlet process mixed generalised linear models. Journal of the American Statistical Association 92, 633–639. Nott, D. J. and D. Leonte (2004). Sampling schemes for Bayesian variable selection in generalized linear models. Journal of Computational and Graphical Statistics 13, 362–382. Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, B 59, 731–792. Ruppert, D., M. P. Wand, and R. J. Carroll (2003). Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, B 53, 683–690. Walker, S. G. and B. K. Mallick (1999). Semiparametric accelerated life time model. Biometrics 55, 477–483. West, M. (1992). Hyperparameter estimation in Dirichlet process mixture models. ISDS Discussion paper 92-A03, Duke University. West, M., P. M¨ uller, and M. D. Escobar (1994). Hierarchical priors and mixture models, with application in regression and density estimation. In A. Smith and P. Freeman (Eds.), Aspects of Uncertainty: A tribute to D. V. Lindley, pp. 363–386. Wiley, New York.
29