This paper outlines a general Bayesian approach to estimating a bivariate regression function in a nonparametric manner. It models the function using a ...
A Bayesian Approach to Nonparametric Bivariate Regression. Michael Smitha and Robert Kohna a Australian Graduate School of Management, University of New South Wales First Version 13th March 1996, Revised Feburary 12th 1997
This paper outlines a general Bayesian approach to estimating a bivariate regression function in a nonparametric manner. It models the function using a bivariate regression spline basis with many terms. Binary indicator variables corresponding to these terms are introduced to explicitly model the uncertainty of whether, or not, the terms provide a signi cant contribution to the regression. The regression function is estimated using an estimate of its posterior mean, smoothing over the distribution of these binary indicator variables. To make the computations tractable all estimates are obtained using Markov chain Monte Carlo sampling. Extensive simulated comparisons are provided which demonstrate the competitive performance of this approach against other data-driven bivariate surface estimators prominent in the literature. It is then shown how the procedure can be extended to provide a general approach to nonparametric bivariate surface estimation in two dicult regression settings. The rst case allows for outlying values in the dependent variable. The second case considers data collected in time order with the errors potentially autocorrelated. Simulated and real data examples illustrate the eectiveness of the methodology in tackling such dicult problems.
Key Words: Bayesian subset selection; Gibbs sampler; Robust regression; Time series; Markov chain Monte Carlo; Surface estimation
This work is part of Michael Smith's PhD thesis. Robert Kohn's work was partially supported by an Australian Research Grant 2 We would like to thank Grace Wahba and Chong Gu for helping us implement their smoothing spline routines, Jerry Friedman for sending us his MARS program and Clive Loader for help in implementing his LOCFIT code. We would also like to thank Paul Yuan for his help with the computations. 1
0
1 Introduction Suppose that in a regression model the signal is a bivariate function f , so that
yi = f (xi; zi ) + ei for i = 1; : : : ; n ;
(1.1)
where the dependent variable is y, the independent variables are x and z and the errors ei N (0; 2 ) are independent. A popular way of estimating f is to assume that its form is known a priori, except for a small number of parameters- for example, assuming f is a linear or quadratic function of x and z . However, this assumption is often unjusti ed and it is more realistic to only assume that f is a smooth function of x and z . Such an approach is called nonparametric regression. An illustration of such bivariate modeling is given in section 4.3 where the electricity savings data used by Mitchell and Beauchamp (1988) is modeled as a function of pre-program electricity use and heated oor area. Let S = fBi (x; z ); i 2 I g be a set of linearly independent bivariate functions, which we will call basis functions. This paper discusses how to nonparametrically estimate the surface f by modeling it as a linear combination of these basis functions so that
f (x; z) =
X B (x; z) : i2I
i i
(1.2)
Popular choices for the Bi include B-forms (De Boor, 1987), tensor products of univariate B-splines (De Boor, 1979), and multivariate regression spline bases, (Friedman, 1991; Stone, 1994). By replacing f (x; z ) in equation (1.1) by its approximation (1.2) the regression becomes a linear model in the coecients i which can be estimated by least squares. In general, it is dicult to determine which basis functions to use in (1.2). If the basis has too many functions and the regression parameters are calculated using least squares, then the resulting surface estimator may have a high local variance{ producing interpolation in the extreme case. Conversely, if too few or poorly chosen functions are in the basis, then the resulting estimator may be locally biased, with whole features of the surface missed. The solution advocated in this paper is to use a basis with many terms and adopt a hierarchical Bayesian model in which terms are allowed to be in or out of the regression. It is this 1
procedure that makes the estimator nonparametric, rather than a t of a linear combination of parametric terms. Smith and Kohn (1996) propose a Bayesian approach to univariate nonparametric regression using a univariate regression spline basis containing a large number of elements. They consider two estimators of the regression function. The rst estimator is based on the subset of basis functions with the highest posterior probability. The second is an estimator of the posterior mean of the regression curve, averaging over the distribution of all possible subsets of the basis functions. Smith and Kohn (1996) show that this approach works well on both real and simulated data, comparing favorably with a kernel based local linear approach using a direct plug-in estimator for the bandwidth. They also extend their analysis to additive and robust nonparametric regression. This paper extends the approach of Smith and Kohn (1996) to bivariate surface estimation by utilizing a bivariate regression spline basis. It establishes in an empirical manner the following important features of the resulting mixture estimate of the regression surface: (i) That it selects terms in a substantially more ecient manner than MARS (Friedman, 1991) from a similar bivariate regression spline basis. (ii) That, as a bivariate surface estimator, it is competitive with multivariate smoothing splines (Wahba, 1990; Gu, Bates, Chen and Wahba, 1989) and quadratic local regression as implemented by Clive Loader's `loc t' procedure. (iii) That it can be extended to bivariate nonparametric regression problems which are dicult to solve using other methods. Two such extensions are discussed in the paper{ robust bivariate surface estimation and bivariate surface estimation when the errors are autocorrelated. The approach in this paper can be readily extended to higher dimensional problems by using higher dimension basis functions. In addition, it can be generalized to basis families which do not have the simple structure of multivariate regression splines; a point that is discussed in section 7. Section 2 outlines Bayesian model selection and, in particular, Bayesian subset selection. Section 3 illustrates how such subset selection can be used to nonparametrically estimate a 2
bivariate surface and presents extensive empirical comparisons with other alternative methods. Section 4 shows how to make these surface estimates robust to outliers. Section 5 shows how to estimate the surface nonparametrically when the errors are autocorrelated. Section 6 discusses how the choice of preset parameters aects the results, and compares the running time of the Bayesian estimators with the times required by competing estimators.
2 Bayesian model and subset selection. 2.1 Introduction This section discusses Bayesian model selection (and hence the special case of Bayesian subset selection) for linear regression models. The procedure forms the basis for the nonparametric estimator of a bivariate surface presented in section 3. Let ? be a family of linear regression models all having the same dependent variable and such that for 2 ? y = X + e ; (2.1) Here, y = (y1 ; : : : ; yn )0 is the vector of dependent variables, X is an n q( ) matrix of independent variables which is of full column rank, is a q( )-vector of regression parameters and the errors e N (0; 2 In ). Bayesian model selection for such linear models usually consists of selecting those members of ? with the highest posterior probability. Subset selection is a special case of this model selection problem in which there exists a n q matrix of independent variables X , such that ? is the set of all 2q possible subsets of the columns of X . In this case we can write the model parameter as = ( 1 ; : : : ; q ) and set i = 1 if the ith column of X is in the model and i = 0 if it is not. If = ( 1 ; : : : ; q ) is a coecient vector corresponding to the full design X , then consists of all elements i such that i = 1. Some approaches to nonparametric regression correspond to such a subset selection problem; for example, the regression spline formulation focused on in this paper and discussed in section 3. However, for others it is more useful to think in terms of model selection where ? 3
need not have the special structure outlined above. An example of this later case is the use of B-splines and B-forms as in Stone, Hansen, Kooperberg, and Truong (1995) where dropping one column of X results in changes to several other columns of the design matrix. To tackle model selection, including the special case of subset selection, the following hierarchical prior proposed by Smith and Kohn (1996) will be used to provide a Bayesian analysis. (i) The prior for j2 ; is j2 ; N (0; c2 (X 0 X )?1 )
(2.2)
In (2.2) the variance matrix of j2 ; is the variance matrix of the least squares estimator of scaled by c. Here, c is set so large that this conditional prior of contains much less information about than the likelihood p(yj ; 2 ; ). In all the empirical work presented in this paper we set c = 100; 000 and nd that it works well, though there is further discussion of this point in section 6. (ii) The prior p(2 j ) / 1=2 which means that log 2 has a at prior. This a commonly used prior; see, for example, Box and Tiao (1973). (iii) The prior for is p( ) = 1=#(?), where #(?) is the number of elements in ?. Thus all the models are equally likely a priori. Using this framework, the posterior probability of a model is given by
p( jy) / p(yj )p( ) /
ZZ
p(yj ; 2 ; )p( j2 ; )p(2 j ) d d2 / (1 + c)?q( )=2 S ( )?n=2
(2.3)
where S ( ) = y0 y ? 1+c c y0 X (X 0 X )?1 X 0 y.
2.2 The Gibbs sampler for subset selection. When there are a large number of models it is computationally impractical to consider each one separately and some search strategy is necessary to nd the most promising models. When model selection coincides with subset selection a number of deterministic search strategies, such as stepwise regression, have been suggested. The Gibbs sampler (Gelfand and 4
Smith, 1990) provides a stochastic alternative for subset selection which eventually produces samples from the posterior distribution jy . As a search strategy it can be used with any model selection criterion, such as adjusted R2 or BIC, though we concentrate on the posterior probability p( jy) itself. The following Gibbs sampler is proposed by Smith and Kohn (1996) who show how to implement it eciently.
Gibbs sampler for subset selection
(i) Choose an initial value [0] = ( 1[0] ; : : : ; q[0] ) of . (ii) Successively generate from p( i jy ; j 6=i ), i 2 f1; : : : ; qg. It can be shown (Gelfand and Smith, 1990) that the iterates of produced by the Gibbs sampler converge to samples from the posterior distribution p( jy). Step (ii) is carried out many times and in two stages. The rst stage is a warmup period at the end of which it is assumed that the sampler has converged to the joint distribution of p( jy). The second stage is a sampling period and the i generated during this period are used for inference. The conditional posterior probability of i can be deduced from (2.3) as
p( i jy; j6=i ) / (1 + c)?q( )=2 S ( )?n=2 :
(2.4)
Because i is binary, the conditional probability p( i jy ; j 6=i) is obtained by evaluating (2.4) for i = 0 and i = 1 and normalizing.
2.3 Estimation with subset selection This section discusses parameter estimation under the assumption of subset uncertainty. It assumes use of the Gibbs sampler for subset selection outlined above and is extended to the more general model selection case in section 7. In most approaches to nonparametric regression using a linear model, a model is rst selected and its coecients are then estimated, usually by least squares. By examining the posterior probabilities p( [j ]jy) of the converged sampling sequence [1]; : : : ; [J ] estimates of the mode of the model space ? are obtainable as in Smith and Kohn (1996). However, because the cardinality #(?) is often very large, a long sampling period may be required to uncover the model with the very best posterior probability. 5
P
Therefore, we focus on the posterior mean E ( jy) = 2? E ( jy; )p( jy) which is approximated using the following mixture estimate proposed in Smith and Kohn (1996) ^ =
J 1X E ( jy; [j])
(2.5)
J j=1
Here, the calculations are completed by using the iterates of the converged sampling sequence
[1] ; : : : ; [J ]. The conditional expectation at (2.5) is evaluated exactly at each iteration as [ ] is conditionally multivariate student t, while the other elements of are exactly zero. j
Raftery, Madigan and Hoeting (1993) call (2.5) model averaging. When the true function is not exactly a linear combination of a subset of the basis terms being considered, the model averaging estimator seems to outperform the posterior mode estimator; see Smith and Kohn (1996). for examples. It also usually requires far fewer iterations to obtain a good mixture estimate than it takes to search for a mode through the 2q possible models in ?. This dierence in the number of iterations is more pronounced in the robust case, discussed in section 4, as the multinomial parameter space of the outlier and subset selection processes now has cardinality 2n+q . In our experience, converged sampling runs of less than one thousand iterations are adequate to get an excellent estimate of the posterior mean. Throughout this paper we use three hundred iterations to ensure convergence and a further ve hundred for estimation procedures. Finally, we note that mixture estimates are only available in a Bayesian framework and not in frequentist approaches.
3 Bivariate surface estimation 3.1 Tensor product regression splines The bivariate surface at (1.2) can be modeled using a tensor product of two univariate functional bases, so that
h
i
f (x; z) 2 span f1; b1j (x)jj 2 I1 g f1; b2j (z)jj 2 I2 g
(3.1)
Here, I1 and I2 are indexing sets, f1; b1j (x)jj 2 I1 g and f1; b2j (z )jj 2 I2 g are univariate function bases corresponding to the two independent variables x and z , while the tensor 6
product U V := fuvju 2 U; v 2 V g. Writing f as a linear combination of the elements of such a tensor product basis,
X 1b1(x) + X 2b2(z) + X X 1;2b1(x)b2 (z) j j j j j i;j i j 2I j 2I i2I j 2I X + B (x; z )
f (x; z) = 1 + = 1
1
j 2I
2
1
(3.2)
2
(3.3)
j j
Here, I is an indexing set, j 's are the regression parameters and the functions Bj (x; z ) are bivariate basis functions. Notice that this bivariate basis can be decomposed into main eects f1 and f2 in x and z, along with an interaction part f12 , so that
f1 (x) =
X 1b1(x)
j 2I1
j j
f2 (z) =
X 2b2(z)
j 2I2
j j
f12 (x; z) =
X X 1;2b1(x)b2(z)
i2I1 j 2I2
i;j i
j
The univariate function bases we use in this paper are the (m ? 1)-times dierentiable regression spline bases, where
f1; b1j (x)jj 2 I1g = f1; x; x2 ; : : : ; xm ; (x ? k11)m+ ; : : : ; (x ? kK1 1 )m+ g f1; b2j (z)jj 2 I2g = f1; z; z2 ; : : : ; zm ; (z ? k12 )m+ ; : : : ; (z ? kK2 2 )m+ g Here, (k11 ; k21 ; : : : ; kK1 1 ) and (k12 ; k22 ; : : : ; kK2 2 ) are knot sequences for x and z respectively, the function ()m+ = (maxf; 0g)m , I1 = f2; 3; : : : ; m + K1 + 1g and I2 = f2; 3; : : : ; m + K2 + 1g. When the independent variable is continuous, we use the cubic regression spline basis (m = 3) with the knots placed so that they follow the variable's observed density. That is, one knot is placed every `ath' value of the sorted independent variable. Here, a is chosen so that there are enough knots to be reasonably sure that a knot pair (ki1 ; kj2 ) is at, or near, every position required to capture the uctuations in the surface. Unless the true underlying function is highly oscillatory, the con guration used throughout this paper of eighty-one knot pairs (a nine by nine grid) will usually suce. When the independent variable is highly discrete, we use linear splines (m = 1) with knots placed at internal values of x and z as in Smith, Sheather and Kohn (1996).
7
3.2 Design-adaptive regression splines For the many designs in x and z , most knot pairs will fall inside the convex hull of the data, with the few that do not being removed automatically. However, to ensure knot pairs follow the bivariate density of x and z in cases where there are strong relationships between the independent variables, we alter part of the interaction terms in equation (3.2) by replacing each term of the form (x ? ki1 )m+ (z ? kj2 )m+ with (x ? ki1 )m+ (z ? kj2;i )m+ . The other terms, including the main eects, remain unchanged, while the knot pairs (ki1 ; kj2;i ) are now allowed to adapt to the design. For each i, the knots k12;i ; : : : ; kK2;i2 are chosen to follow the conditional observed density of z in the locality of x = ki1 , rather than the marginal density of z. Figure 1 gives an example of how such adaptive placing guards against initial knot placement outside the observed bivariate density of the design. Although the resulting bivariate basis is no longer a tensor product basis as in (3.1) and (3.2), the decomposition into main and interaction eects, and the ability to write the the model as a linear combination of bivariate basis functions as in (3.3) still applies. To dierentiate between the two bivariate bases we call the rst a `tensor product regression spline basis' and the second a `design-adaptive regression spline basis'. |{ gure 1 about here|{
3.3 Estimating the surface by Bayesian subset selection. The previous two sub-sections construct a bivariate regression spline basis with a large number of terms. Equation (3.3) expresses the function f as a linear combination of basis functions, making the nonparametric regression problem a subset selection problem in which we expect most of the variables to be redundant. The coecient vector is estimated by the mixture estimate ^ given at equation (2.5) and the estimate of the regression surface is
8
therefore
f^(x; z) = ^1 +
X ^ B (x; z) i2I
i i
In addition, we can also extract the columns and estimated coecients corresponding to the main eects and the interaction and write them as f^1 (x), f^2 (z ) and f^12 (x; z ).
3.4 Discussion of other surface estimators. There are a number of dierent approaches to nonparametric bivariate regression. A popular method is local polynomial regression, such as that discussed by Cleveland and Grosse (1991) and Ruppert and Wand (1994). To make this approach practical it is crucial to obtain data driven estimates of the smoothing parameters. For example, one procedure that does so is Clive Loader's `loc t' locally quadratic estimator which uses an adaptive smoothing parameter. At the time of writing, his program and associated documentation are at present available on the world wide web at http : ==cm:bell ? labs:com=ms=departments=sia=project=loc t=index:html : A second approach to nonparametric bivariate surface estimation is to use multivariate smoothing splines. This approach was developed by Wahba and her co-workers who use two distinct types of smoothing splines. The rst is called a thin plate spline and involves a single smoothing parameter, while the second type is called a tensor product spline and involves more than one smoothing parameter (Gu et. al., 1989). Usually, data-driven estimates of the smoothing parameters are obtained by either generalized cross validation (GCV) or generalized maximum likelihood. Both methods are described in Wahba (1990). A number of other approaches to nonparametric regression consist eectively of basis selection. While the bases used are often multivariate regression splines, similar to that discussed in sections 3.1 and 3.2, both the search procedures and method of function estimation are radically dierent. They include the use of deterministic procedures, such as stepwise (Silverman and Friedman, 1989) and regression trees (Friedman, 1991; Stone, Hansen, Kooperberg and Truong, 1996), to locate a single best model, the coecients of which are then estimated, usually by least squares. In this paper the search algorithm is stochastic (see section 2.2) and the entire distribution of models (parameterized by ) accounted for via the 9
use of mixture estimates which smooth over the distribution of 2 ?, (see section 2.3).
3.5 Simulation comparisons. Simulations were performed to compare the performance of the proposed nonparametric Bayesian estimator with some other surface estimators in the literature. The following three examples were considered for the model at (1.1). Example 1: x and z were distributed independently normal with a mean of 0.5 and variance of 0.1, while f (x; z ) = 51 exp(?8x2 ) + 3 2 5 exp(?8z ) which is an additive model used in Gu, et. al. (1989). Example 2: x and z were distributed independently uniform on [0,1], while f (x; z ) = x sin(4z ) was a nonlinear interaction with very dierent partial derivatives in the x and z directions. Example 3: x and z were both distributed normally with a mean of 0:5, variance of 0:05 and correlation of 0:5, while f (x; z ) = xz was a linear interaction. Perspective plots of all three functions over a typical convex hull from the design in each example are given in gures 2(a)-(c). |{ gure two about here|{
Three hundred observations of x; z and y were generated for each of the three examples with = 41 range(f ) in equation (1.1). A design-adaptive regression spline basis was used with K1 = K2 = 9, resulting in 132 = 169 total terms from which a signi cant subset was to be selected. Along with the Bayesian subset selection based procedure (BSS) outlined in this paper, the following estimators were used to t the generated data: (i) MARS: (Friedman, 1991) Version 3.6 was used, where the total number of possible terms was restricted to 169 as the program default of 15 appeared purely arbitrary. It should be noted that rarely were there more than 15 terms selected anyway. (ii) LFAF: Clive Loader's `loc t' program with a variable bandwidth. This procedure requires an initial estimate of 2 and, following Clive Loader's advice, we obtained this from a nearest neighbor t with a xed global bandwidth. This bandwidth was chosen by experimentation to be a little larger than the minimal bandwidth necessary for the nearest neighbor estimator to exist. This approach to estimating 2 worked well in practice and produced surface estimates using loc t that were similar to 10
those obtained by loc t using the true value of 2 . (iii) TPMSS: A bivariate cubic thin plate spline with a single smoothing parameter which is estimated by GCV as in Gu et. al. (1989). (iv) TENSMSS: A tensor product cubic smoothing spline using ve smoothing parameters, which were estimated by GCV as in Gu et. al. (1989). (v) ACE: Breiman and Friedman's (1985) additive back tting routine based on `super-smoother' as implemented in Splus with the data transformation turned o. (vi) LSP: A least squares t of the parametric linear interaction model yi = 0 + 1 xi + 2 zi + 3 xi zi + ei . One hundred replications of this simulation were carried out for each example and for each of the estimators. The performance of all the estimators was measured for each of the three test functions using an approximation to the integrated squared error given by n o AISE (f^) = N1 PNi=1 (fi ? f^i)2 . Here, ffi gNi=1 and ff^i gNi=1 are the true and estimated function values, respectively, over some N points. In this simulation we used the n design locations. |{ gure three about here|{
Boxplots of the results for the three examples are given in gures 3(a)-(c). Perspective plots are also given of the surface estimates corresponding to the upper tenth, median and lower tenth decile of the AISE for each of the three examples ( gures 4, 5 and 6). These have been provided for the BSS and MARS estimators, together with the better performer of LFAF, TPMSS and TENSMSS for each example. The simulation results provide a range of interesting insights into the relative strengths and weaknesses of the various estimators and we wish to stress the following points: (i) The excellent performance of ACE on the additive example is matched by the more general bivariate estimators BSS, TPMSS and TENMSS. (ii) MARS performs poorly across all examples, revealing a consistent failure to select appropriate terms from the bivariate regression spline basis. Particularly disappointing is its inability to t the third test function well, as the true function falls exactly within the regression spline basis considered by MARS (iii). In direct contrast, the BSS estimator performs well in all three examples, selecting appropriate terms from a similar regression spline basis as MARS. It markedly outperforms the LSP estimator 11
on the third test function, which is very encouraging as the parametric model tted under LSP is close to the true model. This superior performance can be attributed to the fact that the models [j ] that are generated under BSS are often exactly the true model xz , which is in the regression spline basis. (iv) Of the two multivariate smoothing splines, TENSMSS is consistently more variable than TPMSS because more smoothing parameters are estimated. For the second example TENSMSS performs much better than TPMSS because this function has very dierent curvature in the x and z directions and so requires more than one smoothing parameter. (v) It is important to note that the bivariate regression spline basis used here may not provide good approximations to certain functions outside the polynomial space it spans. Nevertheless, the excellent results for the second test function, which is not in this space, illustrates that it often does so. |{ gures four, ve and six about here|{
We conclude with two technical points. First, the MARS code normalizes its graphical surface estimates so that min(f^) = 0. Second, the surfaces are plotted only over the convex hull of x and z as this represents the support of the design.
4 Robust surface estimation 4.1 Robust Bayesian model selection This section shows how to make Bayesian model selection, and hence bivariate surface estimation, robust to outliers in the observations y. We are unaware of any other robust bivariate nonparametric estimator available in the literature that is fully data-driven. We follow Smith and Kohn (1996) and Smith, Sheather and Kohn (1996) and model the errors as a mixture of two normals by assuming that the errors in (1.1) are independently distributed with ei N (0; !i 2 ). Here, !i = 1 if the ith observation is not an outlier and !i = 100 otherwise. The use of a mixture of normals to model an outlier process in a linear regression is a popular approach (Verdinelli and Wasserman, 1991). To carry out robust Bayesian model 12
and subset selection, we amend the priors in section 2.1 as follows. (i) Let ! = (!1 ; : : : ; !n ) and ! = diag(! ). The prior for is adjusted to incorporate the error variance from the outlier process, so that j2 ; ; ! N (0; c2 (X 0 ?!1 X )?1 ). (ii) The prior p(2 j ; ! ) / 1=2 and the prior for is the same as in section 2.1 (iii) The !i are independent a priori with p(!i = 100) = e . We set e = 0:05 in our numerical work to represent the notion that outliers are rare and nd that this works well in practice. The Gibbs sampler in section 2.2 can be generalized to the robust case to obtain iterates from the joint posterior distribution !; jy as follows.
Gibbs sampler for robust subset selection
(i) Choose initial values for [0] and ! [0] for and ! . (ii) Successively generate from p( i jy ; j 6=i ; ! ), i = 1; : : : ; q. (iii) Successively generate from p(!i jy; ; ! j 6=i ), i = 1; : : : ; n. Step (ii) is carried out as in section 2.2 and !i is generated similarly to i ; see Smith, Sheather and Kohn (1996) for an ecient algorithm for generating the !i . Let ( [1]; ! [1] ); : : : ; ( [J ]; ! [J ]) be a sequence of iterates from the posterior distribution of and ! . The following mixture estimates of and p(!i = 100) ^ =
J J 1X 1X [j ] [j ] E ( jy ; [j ]; ! [j ]) ; p^(!i = 100) = J J p(!i = 100jy ; ; ! k6=i) j =1
j =1
give robust estimates of the regression parameters (and hence the resulting bivariate surface) and the posterior probability of the ith observation being an outlier.
4.2 Simulation comparisons Smith, Sheather and Kohn (1996) present evidence to suggest that modeling errors as a mixture of normals when tting a linear model is competitive with several frequently used non-Bayesian robust estimators. They also undertake simulations indicating that Bayesian subset selection, combined with a mixture of normals approach to robustness, compares favorably with robust quantile smoothing splines methods (Koneker, Ng and Portnoy, 1994) in 13
univariate nonparametric regression. Smith and Kohn (1996) compare the Bayesian procedure with local polynomial robust regression (Cleveland, 1979) and again show this approach to robust nonparametric additive regression is competitive. To extend these empirical results, the simulation experiments of the previous section are now repeated, but with twenty of the three hundred observations generated being potential outliers. These observations were restricted to fall within the convex hull of the data and with an error standard deviation equal to 154 range(f ). The error standard deviation of the rest of the `clean' observations remained 41 range(f ). Robust Bayesian subset selection (RBSS) was applied to select terms from the same design-adaptive regression spline basis as was used in section 3.5. The data was also tted by the same non-Bayesian estimators as used in section 3.5 to see how sensitive they were to outliers. Figures 3(d)-(f) give boxplots of log(AISE ) for all the estimators. The non-robust techniques, of course, perform poorly. However, it is interesting to note that the nonparametric procedures appear to perform worse than the parametric model LSP, suggesting that if outliers appear prominent in the data, a parametric model should be considered instead of the non-robust nonparametric estimators. However, the RBSS estimator appears highly successful in down-weighting the outlying observations, even though the method of generating the errors is not the same as the prior for the outliers. This feature is further discussed in Smith and Kohn (1996) where outliers generated from non-normal distributions were also found to be well captured using a simple mixture of two normals. The resulting surface ts appear highly robust; a comparison between the boxplots for RBSS and the BSS in gures 3(a)-(f) illustrating that the robust t to the outlier contaminated data is only slightly less ecient than the non-robust t to the `clean' data. |{ gures seven and eight about here|{
14
4.3 The electricity savings data The robust procedure was applied to a subset of the variables in the electricity savings data used by Mitchell and Beauchamp (1988). The data consists of n = 401 observations of a dependent variable yi measuring the electricity savings in a weatherization program. There are a number of potential predictor variables and the full dataset was tted as a robust nonparametric additive model in Smith, Sheather, and Kohn (1996). The two independent variables we use in this section are pre-program electricity use (x) and heated oor area (z ), which are positively correlated; see gure 7(a). The robust Bayesian bivariate estimator was used to t a surface with robust subset selection made from 169 terms corresponding to the 9 knots placed along the domain of each independent variable. A design-adaptive regression spline basis was used and the resulting robust mixture estimate of the surface is given in contour form in gure 7(a) and as a perspective plot in gure 7(b). Figure 8(d) gives the estimated posterior probability that each observation is an outlier, with 6 observations having probability greater than 1/2. These 6 observations also appear on the perspective and contour plots in gures 7(a) and 7(b). The functional decomposition of the surface into f^1 , f^2 and f^12 , given in gures 8(a)-(c),
indicates that there is a strong main eect in x, but none in z . The interaction surface estimate f^12 suggests that there is a signi cant interaction over a section of the domain of the independent variables and that this bivariate subset of the electricity savings data should not be modeled additively. The overall surface estimate suggests that while there is a strong increase in electricity savings as pre-program electricity usage increases, dwellings with a large heated oor area do not bene t from this as greatly as otherwise expected.
5 Surface estimation for time series data. 5.1 Bayesian model selection with autocorrelated errors If the observations are collected in time order, so that yt is now collected over the time interval t = 1; : : : ; n, then the errors are potentially autocorrelated. Such autocorrelation can result in 15
very inecient estimates of the regression surface if ignored. Smith, Wong and Kohn (1996) show how to handle autocorrelated errors in additive nonparametric regression and this section extends the approach to bivariate surface estimation. To the best of our knowledge, there is no estimator currently in the literature for estimating a bivariate regression surface P with autocorrelated errors. We rewrite (1.1) as yt = f (xt ; zt )+ut , where ut = sj=1 j ut?j +et , so that the errors ut are modeled as a stationary autoregressive process of maximal order s. Here, = (1 ; : : : ; s )0 are the autoregressive parameters and can be reparameterized in terms of the partial autocorrelations = ( 1 ; : : : ; s )0 (Monahan, 1984; Smith, Wong and Kohn, 1996) which are bounded to enforce stationarity, so that ?1 < i < 1 for i = 1; : : : ; s. The errors et N (0; 2 ) are independent and, as in Monahan (1984), var(u) = 2 , with the matrix depending only on . Extending Bayesian model selection to this autocorrelated error case is brie y discussed below and the reader is referred to Smith, Wong and Kohn (1996) for a full discussion. To determine the order of the autoregression we introduce the binary variables = (1 ; : : : ; s ), such that i = 0 if i is identically zero and i = 1 otherwise. The true order l s of the autoregression can be determined from the posterior distribution of because, by Monahan (1984), l+1 = = s = 0 if and only if l+1 = = s = 0. Parameterization of the autocorrelation by allows the adjustment of the conditional prior for to take account of the var(u), so that j2 ; ; N (0; c2 (X 0 ?1 X )?1 ). The prior for i ji = 1 is uniform on (-1,1), while a descending prior on the elements of is used, so that p(1 = 1) = 0:5; p(2 = 1) = 0:4; p(3 = 1) = 0:3; p(4 = 1) = 0:2; p(i = 1) = 0:1 for i > 4. Using this Bayesian model Smith, Wong and Kohn (1996) outline a sampling scheme which provides mixture and histogram estimates of E ( jy); E ( jy) and E (i jy) = p(i = 1jy ). These, in turn, are used to estimate the regression surface, the order of the autoregression, and the autoregressive coecients.
16
5.2 Simulated Examples. Smith, Wong and Kohn (1996) show how high quality univariate function estimates could be obtained when the errors are autocorrelated. This section demonstrates the eectiveness of the Bayesian approach in the bivariate case. Three hundred observations were simulated from the regression model (1.1) for each of the three functions displayed in gures 2(a)-(c). However, to simplify the examples the designs now had x and z distributed independent uniform on [0,1]. The errors were generated from the second order autoregressive process with parameters 1 = 0:2; 2 = ?0:85, and with = 41 range(f ). Three Bayesian nonparametric surface estimators were tted to these three data sets. The rst estimator ts a second order autoregression to the errors, with 1 and 2 xed at 1 and s = 2. That is, the error structure is assumed to be known except for the values of the parameters 1 ; 2 and 2 . The second estimator ts an autoregression of maximal order s = 8 to the errors and selects the signi cant partial autocorrelations. The third estimator treats the errors as independent. Figure 9 plots the resulting surface estimates and shows that the estimator that treats the errors as independent, not taking account of the autocorrelation structure, performs very poorly. In contrast, the two estimators in which the autocorrelation was accounted had a great deal of success in estimating the underlying functions because for this example var(ut ) is much greater than var(et ). |{ gure 9 and table 1 about here|{
Table 1 provides the estimates of the autoregressive and autocorrelation parameters for the two estimators which took the autocorrelation structure of the errors into account. Note that the `full' procedure which also picks the signi cant partial autocorrelations, strongly suggests that a second order autoregressive process is appropriate for the errors. This supports the results presented in Smith, Wong and Kohn (1996) which suggest that there is very little loss in function estimate eciency when choosing the signi cant partial autocorrelations compared to the case when the correct order of the autoregression is known. 17
6 Implementation issues. 6.1 Choice of priors All the empirical results presented in the paper use the priors stated in the text. In particular, the single value c = 100; 000 is used in the prior for . This value of c appears highly insensitive to n; , and the `shape of the function'. That is, c does not act in the manner of a traditional smoothing parameter in the smoothing spline or local polynomial regression sense, because the same value for c is appropriate for functions with very dierent amounts of curvature. This is indicated by the previous simulation examples where the three functions had distinctly dierent pro les, but the same value c worked well for all of them; see also Smith and Kohn (1996). We also note that the t produced is relatively insensitive to incremental changes in c. To demonstrate these points, we performed a simulation experiment using the same speci cations as in the simulations in section 3. We tted the design-adaptive nonparametric estimator (with q = 169) to ten simulated data sets for each possible combination of the following factors: (i) the three test functions and designs; (ii) three noise levels = 81 ; 41 and 12 range(f ); (iii) three levels of n = 300; 1200 and 4800; (iv) three levels of c = 50; 000; 100; 000 and 200; 000. The means of the log(AISE ) over the 10 replicates for each combination of the factors are plotted in gure 10. This indicates that there is little dierence in the performance of the estimator when c = 100; 000 is perturbed by a factor of 1/2 or 2. |{ gure10 about here|{
However, the choice of a good value of c appears to be related to q, the number of terms in the full model. In the applications outlined in this paper, these correspond to the number of potential knot sites considered. Smith and Kohn (1996) and Smith, Sheather and Kohn (1996) use Bayesian subset selection for univariate nonparametric regression and additive nonparametric regression. The number of terms in these papers ranged from 20 to 80 and c in the range 10 to 2000 worked well. Here, with a bivariate basis and q = 169, a 18
larger value of c appears to be required. Nevertheless, for tting a bivariate function, q = 169 and c = 100; 000 appears sucient for most examples.
6.2 Timing comparisons. Table 2 lists the times required for each of the procedures to t surfaces when n = 300; 1200 and 4800. The machine used was a low end modern workstation with all the programs compiled similarly. Although these timings are implementation dependent, they illustrate the feasibility of implementing the Bayesian solution relative to the other procedures. |{table 2 about here|{
Note that both multivariate smoothing spline estimators require O(n3 ) operations, although the multiplier for the TPMSS estimator is substantially lower than that of the TENSMSS. MARS requires O(n) operations and therefore has a much greater potential in modeling larger data sets. The number of operations required for the non-robust Bayesian estimator is independent of n, after a single initial matrix multiplication of 21 nq(q ? 1) operations. As q is xed with respect to n, it has a great deal of potential in tting large data sets. The robust Bayesian estimator is slower, because generating the binary variables !1 ; : : : ; !n requires O(n) operations for each iteration of the sampler. The times for the surface estimator with autoregressive errors are not reported, but this estimator is the slowest because generating each i requires O(n) operations, with a larger multiplier. An Splus compatible package to implement these Bayesian estimators will soon be publicly available from Statlib.
7 Alternative bases and model mixing The focus of this paper has been on the multivariate regression spline bases outlined in sections 3.1 and 3.2 and subset selection. The two are a natural combination, as removal of an element from a regression spline basis corresponds exactly to the removal of a column 19
from the associated design matrix. However, other bases can be more complex and do not correspond exactly to subset selection. The sampling schemes in sections 2.2, 4.1 and 5.1 may not then be appropriate, as they rely on the binomial structure of the model parameterization 2 ?. Nevertheless, the formulation of the priors for general model selection found in sections 2.1, 4.1 and 5.1 do extend to the selection from a general family of bases. For example, in the plain model selection case (that is, without an outlier or autoregressive error process) this can be accomplished in the following manner. Let fS ; 2 ?g be a family of bivariate bases with S = fBi; (x; z ); i 2 I g the basis for model . As at (1.2), for 2 ? we approximate the surface f by
f (x; z) =
X
i2I
i; Bi; (x; z )
which we write as f (x; z ) = x (x; z )0 , where = ( 1; ; 2; ; : : : ; q( ); )0 , q( ) = #(S ) and x (x; z ) = (B1; (x; z ); B2; (x; z ); : : : ; Bq( ); (x; z )). Thus, the nonparametric regression can be expressed in the linear regression form given at equation (2.1) and the posterior probability of any model p( jy) can be calculated as shown at equation (2.3). The posterior mean of f (x; z ) can be expressed as
E ff (x; z)jyg =
X Eff (x; z)jy; gp( jy) = X x
2?
2?
(x; z )0 E f j ; y gp( jy)
(7.1)
If an appropriate sampling scheme can be constructed to generate a sample [1]; : : : ; [J ] from
jy then the posterior mean at (7.1) can be estimated using the mixture estimate J X 1 ^ f (x; z) = J x [ ] (x; z)0 E f [ ] j [j]; yg j =1
j
j
The estimates for the robust and autocorrelated error model selections are similar. Two popular bases families that require special mention are tensor product B-splines (De Boor, 1978), and B-forms (De Boor, 1987). Here, removal of a knot corresponds to the removal of a term in a similar manner as with regression splines, but also requires recalculation of several adjacent basis terms found in the design X . In this case the binomial structure 20
of remains appropriate and therefore the sampling schemes of sections 2.2, 4.1 and 5.1 can still be used. However, the fast implementation details given in Smith and Kohn (1996) no longer apply as these relied on the fact that removing terms from X did not alter any adjacent basis terms in the resulting design matrix.
References Box, G., and Tiao, G., (1973), Bayesian inference in statistical analysis, Reading, Mass: AddisonWesley Breiman, L., and Friedman, J. (1985), \Estimating optimal transformations for multiple regression and correlation," J. Am. Stat. Ass., 80, 580-598 Cleveland, W. (1979), \Robust locally weighted regression and smoothing scatterplots," J. Am. Stat. Ass., 74, 828-836 Cleveland, W., and Grosse, E. (1991), \Computational methods for local regression," Statistics and Computing, 1, 47-62. De Boor, C. (1978), A practical guide to splines,, New York: Springer-Verlag. De Boor, C. (1987), \B-Form Basics" in Geometric Modeling, ed. G. Farin, Philadelphia, SIAM, pp. 131-148 Friedman, J. (1991), \Multivariate Adaptive Regression Splines", Ann. Stats., 19, 1-141 Friedman J. H., and Silverman B. W. (1989), \Flexible parsimonious smoothing and additive modeling," Technometrics, 31, 3-39. Gelfand, A. and Smith, A. (1990), \Sampling based approaches to calculating marginal densities," J. Am. Stat. Ass., 85, 398-409 Gu, C., Bates D., Chen, Z., and Wahba, G. (1989), \The computation of GCV functions through Householder tridiagonalization with application to the tting of interaction spline models," SIAM J. Matrix Analysis and Applications, 10, 457-480 21
Koneker, R., Ng, P., and Portnoy, S. (1994), \Quantile smoothing splines," Biometrika, 81, 673-680 Mitchell, T. J., and Beauchamp J. J. (1988) \Bayesian variable selection in linear regression," J. Am. Stat. Ass., 83, 1023-1036 Monahan, J., (1984), \Full Bayesian analysis of ARMA time series models," J. Econometrics, 21, 307-331. Raftery, A.E., Madigan, D., and Hoeting, J. (1993), \Model selection and accounting for model uncertainity in linear regression models," TR 262, Dept. Stats., Uni. of Washington. Ruppert, R, and Wand, M. (1994), \Multivariate locally weighted least squares regression," Ann. Stats., 22, 1346-1370 Silverman, B. W., and Friedman, J., (1989), \Flexible parsimonious smoothing and additive modeling," Technometrics, 31, 3-21 Smith, M., and Kohn R., (1996), \Nonparametric regression via Bayesian variable selection," J. Econometrics, vol. 75, no. 2, 317-344 Smith, M., Sheather, S.J., and Kohn, R. (1996), \Finite sample performance of robust Bayesian regression," J. Comp. Stats., vol. 11, 3, 269-301 Smith M., Wong, C. and Kohn R., (1996), \Additive nonparametric regression with autocorrelated errors," under review Stone, C.J. (1994), \The use of Polynomial Splines and their tensor products in multivariate function estimation (with discussion)", Ann. Stats., 22, 118-184 Stone, C., Hansen, M., Kooperberg, C., and Truong, Y., (1996), \Polynomial splines and their tensor products in extended linear modeling," TR 437, Dept. Stats., Uni. of California, Berkeley. Verdinelli, I., and Wasserman, L., (1991), \Bayesian analysis of outlier problems using the Gibbs sampler," Statistics and Computing, 1, 105-117 Wahba, G., (1990), Spline models for observational data, Philadelphia, SIAM. 22
i 1 2 3 4 5 6 7 8
function 1 AR(2) AR(s) ^i ^i ^i 0.197 1 0.198 -0.828 1 -0.828 0.038 -0.001 0.03 0.001 0.009 0 0.009 0 0.01 0 0.008 0
function 2 AR(2) AR(s) ^i ^i ^i 0.15 1 0.154 -0.84 1 -0.852 0.033 -0.002 0.021 0.001 0.047 -0.003 0.008 0 0.009 -0.001 0.008 0
function 3 AR(2) AR(s) ^i ^i ^ i 0.165 1 0.169 -0.845 1 -0.861 0.128 -0.008 0.021 0 0.022 0 0.009 0 0.009 0 0.009 0
Table 1: Estimates of the autoregressive parameters and the i indicator variables from the simulated data sets.
fn 1 BSS 18 MARS 29 TPMSS 5 TENMSS 221 LFAF 26 RBSS 138
n=300 fn 2 18 24 5 221 26 159
n=1200 fn 3 fn 1 fn 2 18 29 38 18 101 98 5 147 147 221 NEM NEM 17 100 86 133 565 537
fn 3 26 101 147 NEM 56 497
fn 1 91 459 NEM NEM 404 3290
n=4800 fn 2 99 436 NEM NEM 331 3213
fn 3 85 433 NEM NEM 134 3029
Table 2: Times (in seconds) required to run the various programs on data with n = 300; 1200 and 4800 for the three dierent test functions. NEM denotes that there was not enough memory to run the program because the algorithms involved required O(n2 ) storage locations.
0.0
0.2
0.4
0.6 x
0.8
1.0
1.2 1.0 0.8 0.6 0.4
. . . . . . . . . . .. . . . .. ..... .. ... . . .. . ..... ....... .. .. . .... . .. ...... ..... . . . . .. . ....... .. .. . .. ......... . . ... . . ... .. . . . . . . . . ... . . . .. . . .. ... .. . .. . . . ... . . . .. . .. .. ...... . . . . . . .... . ... ..... . .. . . . . . .. ........ .. . . .. . . .. ........ . . .. . . . ... ......... . . ... . . . . .. ..... . . ... .. ... . . .
z
.
(b)
0.2
0.2
0.4
0.6
z
0.8
1.0
1.2
(a) .
. . . . . . . . . . .. . . . .. ..... .. ... . . .. . ..... ....... .. .. . .... . .. ...... ..... . . . . .. . ....... .. .. . .. ......... . . ... . . ... .. . . . . . . . . ... . . . .. . . .. ... .. . .. . . . ... . . . .. . .. .. ...... . . . . . . .... . ... ..... . .. . . . . . .. ........ .. . . .. . . .. ........ . . .. . . . ... ......... . . ... . . . . .. ..... . . ... .. ... . . 0.0
0.2
.
0.4
0.6
0.8
1.0
x
Figure 1: (a) A highly dependent design with knots placed using the tensor product regression spline basis. The lines indicate the positioning of the knots in the x dimension, with the base of each arrow being the location of the knot pairs (ki1 ; kj2 ). The direction of the arrows denotes the direction of the nonzero contribution of the term (x ? ki1 )m+ (z ? kj2 )m+ . (b) A similar plot for the design-adaptive regression spline basis. The locations of the knots kj2;i follow the observed density of z jx 2 (ki1 ; ki1+1 ] as this is the region of nonzero contribution of the terms (x ? ki1 )m+ (z ?; kj2;i )m+ for j = 1; : : : ; K2 .
(b) function two
(c) function three
81 f(x,z)
00.20.40.60.
2
-0.
f(x,z) 1 -1-0.5 0 0.5
f(x,z) 60.8 00.20.40.
(a) function one
0
1.
8
0.4
0.2
6 4
x
0.4
0.
0.8
0.6
2
0.
1 1.2
1
0.
z
0.6
0 0.5 x
1
1
0.
0.2
2
0.8
z
0.
8
8
0.
0.
6
z
0.
4
4
0.
0.
2
6 0. x
2
0
0.
1.5
Figure 2: (a)-(c) Plots of the three functions over a typical convex hull resulting from the design of the three examples.
(b)
(c)
-10
log(aise)
-4
log(aise)
-12
-6
-7
-5
-6
log(aise)
-8
-3
-5
-6
-2
-4
(a)
BSS MARS LFAF TPMSS TENSMSS ACE LSP
BSS MARS LFAF TPMSS TENSMSS ACE LSP
(d)
(e)
(f)
log(aise)
-6
-10
-6
-4
-8
-6
-2
log(aise)
-4
log(aise)
-4
-2
-2
0
0
0
BSS MARS LFAF TPMSS TENSMSS ACE LSP
RBSS MARS LFAF TPMSS TENSMSS ACE LSP
RBSS MARS LFAF TPMSS TENSMSS ACE LSP
RBSS MARS LFAF TPMSS TENSMSS ACE LSP
Figure 3: (a)-(c) Boxplots of log(AISE ) for the various estimators (see the text for their description) tted to the 100 replicated data sets corresponding to the three examples. From left to right they are are BSS, MARS, LFAF, TPMSS, TENSMSS, ACE and LSP. Panels (d)(f) have the same interpretation, but are for the simulated data containing outliers. From left to right the boxplots are for RBSS, MARS, LFAF TPMSS, TENSMSS and LSP.
10th worst BSS fit
0 -0.200.2
z) fhat(x, 0.60.8 0.4 -0.200.2
z) fhat(x, 0.6.7 00.10.20.30.40.5
median BSS fit
) fhat(x,z 0.60.8 .4
10th best BSS fit
0
0
0
5
z
5 0. z
5 0. 0.5
-0.
0.5 z
0 0
0
5
0.
0.5
x
1
1
1
10th worst MARS fit
z) fhat(x, 0.40.60.8 00.2
z) fhat(x, 0.60.81 0.4 00.2
median MARS fit
z) fhat(x, 0.811.2 0.6 0.4 00.2
10th best MARS fit
x
1
1
1
x
.5
-0
0
0
0 0
0
0.
5
0.
0 5
z
z
5
0.
x
x
1
1
1
1
1
0.5
z
0.
5
0.5 x
1
median TENSMSS fit
10th worst TENSMSS fit
-0
.5
-0
.5
-0
z) fhat(x, 0.8 0.6 0.4 -0.200.2
,z) fhat(x 0.40.6 00.2
,z) fhat(x 0.8 0.6 0.4 -0.200.2
10th best TENSMSS fit
.5
0
0
z
0.
5
z 1
1
5
1.
1.5
0
0.
5
0. 1
0.5 x
0 2
0
0 0.5 x 1
z
6
5 8
0.
1
0.
4 0.
0.
x
1 2 1.
Figure 4: Surface estimates for the function from the rst example, given in gure 2(a), for BSS (upper row), MARS (middle row) and TPMSS (bottom row) estimators. The three columns (from left to right) give the surfaces corresponding to the ts at the upper decile, median and lower tenth decile of the AISE measure plotted in gure 3(a). Note that the MARS estimate has its intercept normalized so that min(f^) = 0.
median BSS fit
fhat(x,z) -1.5-1-0.5 00.5 1 1.5
fhat(x,z) -1.5-1-0.5 00.5 1 1.5
0.
0.
0.
8
8
8 0.
6 x
0.
4
0.4
0.
0.6
2
0.8
0.
0.
6
0.2
x
0.
4
0.4
0.
0.6
2
z
0.8
6
0.2
x
fhat(x,z) 2 0 0.5 1 1.5
fhat(x,z) 0 0.5 1 1.5
8 0. 2
z
fhat(x,z)
-1-0.5 00.5 1 1.5
fhat(x,z)
-1-0.5 0 0.5
1
fhat(x,z) 1.5 -1-0.5 00.5 1
0.8
0. 8
8
8
x
4 0.6
2
0.
z
0.4
0.
2
0.
2
0.
0.6 0.8
0.2
6
4
x
0.
x
4
0.
0.4
0.
0.2
6
6
0.
0. z
0.6
10th worst TENSMSS fit
0.
0. 0.8
4
z
median TENSMSS fit
0.2 0.4 0.6
0.4
0.
x
2
0.8
0.2
6
0.6
0.
z
10th best TENSMSS fit
0.
4
0.8
0.2 0.4
0.
2
x
0.6
0.
z
10th worst MARS fit
6
0.2 0.4
4
0.2
0.
6
0.6 0.8
0.
0.
0.4
0.
2
8
8 0.
4
z
0.
0.
x
0.
median MARS fit
00.5 1 1.5 2 2.5
10th best MARS fit
fhat(x,z)
10th worst BSS fit
fhat(x,z) -1.5-1-0.5 00.5 1 1.5
10th best BSS fit
0.8
z
Figure 5: Surface estimates for the function from the second example, given in gure 2(b), for BSS (upper row), MARS (middle row) and TENSMSS (bottom row) estimators. The three columns (from left to right) give the surfaces corresponding to the ts at the upper decile, median and lower decile of the AISE measure plotted in gure 3(b). Note that the MARS estimate has its intercept normalized so that min(f^) = 0.
median BSS fit
10th worst BSS fit
fhat(x,z) 0.8 1 -0.200.20.40.6
fhat(x,z) 0.60.811.2 0.4 -0.200.2
fhat(x,z) 0.60.811.2 0.4 -0.200.2
10th best BSS fit
1
1
0.
1
8
8
0.
8
0.
0.
2 0
x
0
1
0.6
0.8
1
x
.2 fhat(x,z)
00.20.40.60.811
1
4
.2
10th best TENSMSS fit
0.2
0
.2
0.2
-0
2
-0
0
1
0.
0
2
x
0
0.6 x 0.4
1.2
4
2
0.
1
z
0.
z
0.
4
z
0.8
0.
6
6
0.
0.
0.
6
8
8
0.
0.
1
0.
1
8
0.
2
0.4
0.6
0.8
x
0
0
median TENSMSS fit
10th worst TENSMSS fit
0 0.6 0.4 -0.200.2
fhat(x,z) .811.2
fhat(x,z) 0.811.2 0.6 2.4 -0.200.0
fhat(x,z) 0.8 -0.200.20.40.6
0.2
0.4
10th worst MARS fit
fhat(x,z) 1.4 0.811.2 0.6 00.02.4
) fhat(x,z 0.811.2 0.6 2.4 00.0
median MARS fit
1.
0.5
0.
0.4
4
0.2
0.
0
0
10th best MARS fit
0
1
z
2
2
0.
0.
x
0.8
6
0.6
4
z
4
0.
0.
1
6
6 z
0.4
0.8
0.
0. 0.2
0
0.6
2
1.
8
1
0.
1
8 6 z
4
0.
x
0
.2
-0
2
0.
0
0
0
0.4
1
0.
2
0.
z
2
0.
2
-0.
0.2
0.8
8
4
z
0.6
0.
0.
4
0.
6
0.
6 0.2 0
0.4 x
0.8
0.
0.
1 0.6
2
-0.
0
0.4 0.2 x
0.6
0.8
1
Figure 6: Surface estimates for the function from the third example, given in gure 2(c), for BSS (upper row), MARS (middle row) and TPMSS (bottom row) estimators. The three columns (from left to right) give the surfaces corresponding to the ts at the upper decile, median and lower decile of the AISE measure plotted in gure 3(c). Note that the MARS estimate has its intercept normalized so that min(f^) = 0.
2500 1500 500
heated floor area (z)
3500
(a) contours of estimated response • •
•
•
•
•• • 8000 • • • •• • • • • • •• • • • • ••• • • •• • •• • • • •• • •• • • •• ••O •• • • •O • • •• • • • • • • •• • • •• • •• •• •• • • •• • •• ••• • • • • • •• • •• • •• ••• • ••••••• • •• •••• ••••••••••• • •O ••••• •O •• •• • •• • • • • • • • ••••••••••••••••• •• • • •• •• • ••••••• •••••• •••••••••••••••••• •••••••••••••• ••O ••• 14000 • • • • • • ••••••••••• ••••••••••••••••••••••• ••••• •••••• • 12000 • ••• • •• •• •••• •••• •O • •• •• •• •••••••••••••• •• •••• • ••••• 10000 •••• • • • •• • • • • • • 8000 • 6000 2000 • 4000 • • 10000
30000
50000
pre-program electricity use (x)
(y) electricity savings 025000 00 0 20 00 15 0 500010000
(b) surface estimate • O • O • O
00
• O
• O
35
30
00
• O
25 he
00
ate 200 df 0 loo 15 r a 00 re a ( 100 z) 0
00
00 400 (x) e s u icity
50
0
00
100
500
00 300 tr lec 0 e 0 200 rogram p pre
Figure 7: (a) Contours of the robust estimated surface f^(x; z ). The scatter plot is of x vs z and the circled observations are the design locations of the detected outliers. (b) Perspective plot of f^(x; z ). The six outlying observations are also plotted.
(b) main effect of z
10000 5000
f2hat(z)
0
10000 0
-10000
5000
f1hat(x)
20000
(a) main effect of x
20000
30000
40000
50000
500
1000
1500
2000
3000
heated floor area (z)
(c) contours of f12hat
(d) Outlier detection
-4000 -2000
0.6
0.8
O 148 O 162 O O 158 203
0.0
500
O O286 245
0.4
-6000
prob. of being an outlier
2000
-10000 -8000
3500
0.2
3000
-12000
1000
heated floor area (z)
2500
pre-program electricity use (x)
1.0
10000
10000
20000
30000
40000
50000
pre-program electricity use (x)
0
100
200
300
400
observation number
Figure 8: (a) Estimate f^1 (x) of the main eect of x. (b) Estimate f^2 (z ) of the main eect of z . (c) Contour plot of estimate f^12 (x; z ) of the interaction eect. (d) Plot of the posterior probability of each observation being an outlier.
order estimated
independent errors
fhat(x,z) 0 0.20.40.6
fhat(x,z) .8 -0.2 00.20.40.60
-0.2 00.20.40.60
fhat(x,z)
.8
AR(2) errors
2
0.
2 0.
2 0.
0. 6 8
0.
x
0.6
x
0.8
independent errors
fhat(x,z) 01 4320.1 -0. 5-0. -0. -0. -0.
fhat(x,z) 1 -1-0.5 0 0.5
order estimated
8
0.
fhat(x,z) 1 -1.5-1-0.5 0 0.5
0.4
0.
AR(2) errors
0.6 0.8
z
8 0.
8 0.
x
0.2
4
0.4
6 0.
z
z
6 0.
0.6 0.8
0.2
4 0.
4 0.
0.2 0.4
8
0.
0.
0.
6
8
0.
2
fhat(x,z) 0 0.20.40.60.8
fhat(x,z) 00.20.40.60.8 1
fhat(x,z) 00.20.40.60.8 1
independent errors
0.
x
0.8
0.
0.6
4 0.4
0.
2
0.2
x
0.8 0.6 2
0.2
z
0.
2
6
z
0.4
0.
4
0.6
4
0.
0.
6
0.8
0.
6
8
8
0.
z
0.8
0.
8
0.6
0.
z
order estimated
0.
0.
4
0.6
0.2 0.4
0.8
AR(2) errors
z
x
z
2
0.8
0.
0.6
2
x
0.4
0.
4
4
0.4
0.
x
0.2
0.
0.2
6
6
0.
0.
0.2
0.4 x
Figure 9: Plots of the surface estimates of the simulated data in section 5. The rst column gives the estimate from the surface estimator that does assumes that the order of the autoregression is known, the second column corresponds to the estimator that estimates the order of the autoregression, and third column corresponds to the estimator that treats the errors as independent. The rows correspond to each of the three functions.
-10.0
-5.5 -6.0
-7.5
-7.5
log(aise) -11.0
log(aise) -7.0 -6.5
-8.0 log(aise) -8.5
-8.5
-12.0
-8.0
-9.0 -9.5
1000
2000
3000
4000
1000
2000
3000
4000
1000
2000
n
4000
3000
4000
3000
4000
-7.5 log(aise) -9.0 -8.5
log(aise) -6.0 -5.0
-8.0
-6.5
1000
2000
3000
4000
-10.0
-7.0
-8.5
-9.5
-8.0
log(aise) -7.5 -7.0
3000 n
-4.0
-6.0
n
1000
2000
3000
4000
1000
2000
n
n
log(aise) -7
log(aise) -4 -6
-9
-7.0
-5
-8
-6.5
log(aise) -6.0
-5.5
-3
-6
-5.0
n
1000
2000
3000 n
4000
1000
2000
3000 n
4000
1000
2000 n
Figure 10: Plots of the mean log(AISE ) using the design-adaptive bivariate regression spline based BSS estimator with q = 169. The columns correspond (from left to right) to test functions 1,2, and 3. The rows correspond (from top to bottom) to noise levels = 81 ; 14 and 1 2 times the range of the true underlying test function. In each panel the mean log(AISE ) averaged over the ten replications is given for three observation levels (n = 300; 1200 and 4800). The bold line corresponds to c = 50; 000, the dotted line to c = 100; 000 and the dashed line to c = 200; 000.