Nonparametric spline regression with prior information

1 downloads 0 Views 745KB Size Report
By using prior information about the regression curve we propose new nonparametric regression estimates. We incorporate two types of information. First, we ...
Biometrika (1993), 80, 1, pp. 75-8 Printed in Great Britain

Nonparametric spline regression with prior information BY CRAIG F. ANSLEY Department of Accounting and Finance, University of Auckland, Auckland, New Zealand ROBERT KOHN AND CHI-MING WONG Australian Graduate School of Management, University of New South Wales, Kensington, New South Wales, Australia 2033 SUMMARY

Some key words: Bayesian confidence interval; Generalized cross-validation; Differential equation; Equality constraints; Filtering; Maximum likelihood; Penalized least squares; Periodic spline; Spline smoothing; State space model.

1.

INTRODUCTION

Suppose that we observe a function with noise. Usually an estimate of the function is needed and often a confidence interval for the function is also required. If we assume a parametric form for the function then unknown parameters can be estimated by linear or nonlinear least squares giving an estimate of the function together with a confidence interval. Hoerl (1954) presents and discusses a large number of useful parametric functions. Instead of assuming a parametric form for the regression curve, we can estimate it by nonparametric regression. Two popular ways of doing so are kernel regression and polynomial spline regression (Wahba, 1978) with these and other approaches discussed by Eubank (1988). In the present paper we extend polynomial smoothing splines by incorporating prior information about the regression curve when such information is available. By using this extra information we hope to obtain improved estimates of the regression function within the range of the data and in particular at the boundaries and improved forecasts of the function outside the range of the data. In § 5 we illustrate with examples the improvement possible by using prior information. The first type of information we incorporate is a belief that the regression curve is similar in shape to a parametric curve, for example a sine or exponential curve. In § 2 we consider parametric curves that are solutions to linear differential equations and in § 3 we consider parametric curves that are monotonic but can be solutions to both linear

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

By using prior information about the regression curve we propose new nonparametric regression estimates. We incorporate two types of information. First, we suppose that the regression curve is similar in shape to a family of parametric curves characterized as the solution to a linear differential equation. The regression curve is estimated by penalized least squares with the differential operator defining the smoothness penalty. We discuss in particular growth and decay curves and take a time transformation to obtain a tractable solution. The second type of prior information is linear equality constraints. We estimate unknown parameters by generalized cross-validation or maximum likelihood and obtain efficient O(n) algorithms to compute the estimate of the regression curve and the cross-validation and maximum likelihood criterion functions.

76

CRAIG F. ANSLEY, ROBERT KOHN AND CHI-MING WONG

2. PRIOR INFORMATION DESCRIBED BY A LINEAR DIFFERENTIAL EQUATION

2 1 . General Suppose we observe the regression function f(t) with noise so that y(i)=f(ti)

+ e(i)

(i = l , . . . , / i ) ,

(2-1)

with e(i) independent N(0, a2). We assume that tx =£ . . . =£ tn. Suppose further that we believe that/(f) is similar in shape to the parametric function g(t; 8) which is defined up to m initial conditions by the mth-order linear differential equation Lteg{t) = 0 where the differential operator

L,,9= dm/dtm + fll(l; 8) d^/df"-1

+ .. .+am(t; 8).

We assume that the coefficients ax{t; 8),..., am(t; 8) are functions of t and 8 that are m times differentiate in t. The parameter vector 8 lies in a subset & of Euclidean space. For example, if g(t; 8) = 0,4- 82t is linear then d2g/dt2 = 0, if g(t; 8) = sin (8t) is the sine function with period 2TT/8 then the corresponding differential equation is d2g/dt2+ 82g = 0, and if g(t; 8) = ee' is an exponential curve then d2g/dt2-82g = 0. To allow departures from the parametric model we propose to estimate/( t) by penalized least squares where we minimize

t {y(i)-f(t,)}2+\ f'" {L,,ef{t)}2 dt,

(2-2)

for given 8 and A, over all functions f(t) that have a square integrable mth derivative on the interval {*,, tn). We assume that there are at least m distinct knots t,. Then by Kimeldorf & Wahba (1971) and Kohn & Ansley (1988) the solution f(t; 8, A) to (2-2) exists and is unique and is called an L-spline. To simplify notation we will not indicate the dependence of / on 8 and A unless stated otherwise. It is shown by Kimeldorf &

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

and nonlinear differential equations. Monotonic parametric curves are often used to model growth and decay curves in plants and animals and product diffusion in marketing. The second type of information we deal with is equality constraints on the regression curve and its derivatives and we apply our results to periodic splines. This is done in § 4. Examples are given in § 5. In all the cases outlined above the regression curve is estimated by penalized least squares with the differential equation associated with the prior information appearing in the roughness penalty. We can show that the solution to the penalized least squares problem can be obtained by signal extraction with the signal generated by a stochastic differential equation. By interpreting the signal as a prior for the regression curve and expressing it in state space form we show how to estimate the curve and its derivatives, obtain Bayesian confidence intervals for the curve and estimate unknown parameters by generalized cross-validation or marginal likelihood in O(n) operations, where n is the sample size. In related work on spline smoothing Kimeldorf & Wahba (1971), Ramsay & Dalzell (1991) and the discussion of Silverman (1985) also consider roughness penalties defined by differential equations. Ramsay & Dalzell emphasize that many smoothing problems require the estimation of a function rather than just a small number of function values. However our Bayesian emphasis and computational techniques are quite different from theirs.

Nonparametric spline regression with prior information

77

Wahba (1971) and more generally by Kohn & Ansley (1988) that an alternative way to evaluate/ is by signal extraction, where we assume that/(f) is generated by the stochastic differential equation LJ(t) = a\idW(t)/dt,

(2-3)

and has diffuse initial conditions

LEMMA 1 (Kohn & Ansley, 1983). (i) For t< f, and t> tn the smoothing spline estimate satisfies L,f(t) = O. (ii) Let Lf be the adjoint operator to L,. Then in the interval ((,_,, /,) ( i > 1) thesmoothing L-spline estimate satisfies LfL,f(t) = 0.

Thus if we are dealing with exponential prior information, then f(t) will be an exponential curve for t < f, and t> tn and if 8 < 0 it will have the appropriate decay properties associated with exponentials. By treating the stochastic model (2-3) and (2-4) as a prior for/(f), we can obtain pointwise posterior confidence intervals for/(f) and its derivatives as given by Wahba (1983) and Wecker & Ansley (1983). Using (2-3) as a prior for a deterministic function is somewhat controversial as are the resulting Bayesian confidence intervals. This is discussed by Wahba (1983) and Silverman (1985). Wahba (1983) and Nychka (1988) give a frequency interpretation of Bayesian confidence intervals. 2-2. State space representation In order to compute / and its derivatives efficiently, we express the stochastic model (2-1), (2-3) and (2-4) in state space form. Let x(t) = {f(t),fU)(t),... ,fim~l\t)}T be the m x l state vector. Then we can write (2-3) as dx(t)/dt = A(t)x(t) + a\*bdW(t)/dt,

(2-5)

T

where the m x l vector b = ( 0 , . . . , 1) and A(t) is an m x m matrix defined as follows. Let Ajj(t) be the (y)th element of A(t). For i = 1 , . . . , m -1 the elements Aii+l(t) = 1, and, for i = 1 , . . . , m, AmJ = -a m _, + 1 (/); the rest of the elements of A(t) are zero. We can therefore write (2-1), (2-4) and (2-5) in state space form as ti_1)x(ti^) + u(i),

(2-6)

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

/{/(',), • • • , / ( " - 1 ) ( f . ) } T ~ N(0, klm) (2-4) with^-» oo. The random function W(t) is a zero mean Wiener process with var {W(t)} = t. Let f(t; k) = E{f(t)\yxk}. Then /( tn. 2. For tn^t^t^. we have the following. (i) With a and b functionally independent of t, f(t) = a + bg{t). Hence, as t-> t* then so does df(t)/dt.

LEMMA

ifdg(t)/dt^0

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

and has diffuse initial conditions/ 0 = { / ( T 1 ) , / U ) ( T 1 ) } T ~ JV(0, kl2) with k -» oo. The Wiener process W(T) is defined as in § 2. We regard (3-4) as a prior f o r / ( r ) . We note that the parametric function g(t; 6) is a particular solution of the signal extraction problem obtained when A = 0. To compute the posterior mean and variance of / ( T ) and its first derivative we write the prior/(T) in vector form as

Nonparametric spline regression with prior information

81

(ii) IfS^(T\n) is boundedforT> Tn and dg(t)/dt-* Oast -* f* then the posterior variance of df{t)ldt tends to zero as f-» t^. Proof. By Kohn & Ansley (1983), / ( T ) is a linear function of T for T > rn and (i) follows from (3-8) because r is a linear function of g(t). Part (ii) follows from (3-9). • 3-2. Exponential prior

Suppose that we base the prior on the exponential function g(t) = exp (-dt) with 0 > 0 and 13* 0. Taking t0 = 0 and t% — oo we obtain r(t; 0) = 1 -exp (-6t),

r(r; 0) = 1/{0(1 - T)}

so that (3-4) becomes d2f{r)/ dr2 = a{\/ e^fil-

T)-V2 dW{r)/ dr.

d2f(r)/dT2 = crA*(l - T)~ 3 / 2 dW{r)/dr.

(3-10)

With some algebraic manipulation we can evaluate the matrices U(T, T') explicitly as

3-3. Logistic prior

If we base the prior on the logistic function g{t) = {1 + 62 exp (-0, f)}"' with 0, > 0, 02 > 0 and / 2= 0 and take t0 = 0 and

We can show that the limits w and 5,,, exist if we have observations at two or more distinct *,. The next lemma shows that the posterior mean w satisfies the constraints and the posterior variance of H*w is zero. 3. Under the above assumptions H*w = c and H*SWH*T = O. The smoothness properties off{t) = lim E{f(t)\y,y*; k} asfc-»oodepend on the matrix of constraints H* and can be determined as given by Kohn & Ansley (1983). LEMMA

Proof. We have that E{H*w\y, y*; k) = H*w = c, var (H*w\y,y*; k) = 0 because y* = H*w.



The most important step in computing the constrained posterior means and variances of/(';) and /(1)(> = 2exp{-7exp(-7r)}+e with e~N(0,0-22) and the parameters estimated by marginal likelihood. The estimates in the interval [0.9,1-0] are forecasts. Solid lines, curve estimates; dotted lines, 95% Bayesian confidence bands. A linear prior, i.e. cubic splines, was used for (a) and (b), and the logistic prior (3-12) for (c) and (d). (a) Data together with cubic spline function estimate and 95% Bayesian confidence bands, (b) Estimate of the first derivative of the function together with 95% Bayesian confidence bands, (c) Data together with function estimate and 95% Bayesian confidence bands using the logistic prior, (d) Estimate of the first derivative of the function together with 95% Bayesian confidence bands.

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

10 •

86

CRAIG F. ANSLEY, ROBERT KOHN AND CHI-MING WONG

same interpretation as Figs 3(a) and (b) respectively but are now obtained for the logistic prior (3-11). The forecast values of the function, that is estimates of the function for t > 0-9, areflatterfor the logistic prior than for the linear prior and the Bayesian confidence intervals are much narrower. In our third example we generated 85 equally spaced observations on the interval [0,0-85] using the sine regression y(i) = sin (2irti) + e(i) with e(/)~ N(0,0-22). The regression function was estimated using both quintic splines, i.e. a quadratic prior, and quintic periodic splines and the regression curve predicted in the interval (0*85,1]. Figure 4(a) plots the data, the quintic spline estimate and 95% Bayesian confidence intervals for the regression function, and Fig. 4(b) plots the first derivative estimate and its 95%

00

0-4

00

0-8 /

0-4

0-8

0-4

0-8

(d)

df/dt

10 \

0

-10

0-4

0-8

00

Fig. 4. Analysis of 85 data points generated by the sine regression curve y = sin {2-iTt) + e with e ~ N(0,0-2 2 ) and the parameters estimated by marginal likelihood. The estimates in the interval (0-85,1 -0] are forecasts. Solid lines, curve estimates; dotted lines, 95% Bayesian confidence bands. A quadratic prior, i.e. quintic splines, was used for (a) and (b), and quintic periodic splines for (c) and (d). (a) Data together with cubic spline function estimate and 95% Bayesian confidence bands, (b) Estimate of the first derivative of the function together with 95% Bayesian confidence bands, (c) Data together with function estimate and 95% Bayesian confidence bands using the logistic prior, (d) Estimate of the first derivative of the function together with 95% Bayesian confidence bands.

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

-10

Nonparametric spline regression with prior information

87

Bayesian confidence intervals. Figures 4(c) and (d) have the same interpretation as Figs 4(a) and (b) respectively but are now obtained for the quintic periodic spline. The boundary behaviour and the forecast values of the function, that is estimates of the function for f >0-85, are better behaved for the periodic spline than for the quadratic prior and the Bayesian confidence intervals are much narrower. 5-2. Computational details The criterion function L(A, 0) was expressed as a sum of squares and minimized using subroutine LMDIF in the MINPACK subroutine library (More, Garbow & Hilstrom, 1980). Independent Gaussian random variables were generated by IMSL subroutine DRNNOA. All computations were carried on an IBM RS 6OOO computer at the University of New South Wales. ACKNOWLEDGEMENTS

REFERENCES C. F. & KOHN, R. (1985). Estimation, filtering and smoothing in state space models with incompletely specified initial conditions. Ann. Statist. 13, 1286-316. ANSLEY, C. F. & KOHN, R. (1990). Filtering and smoothing in state space models with partially diffuse initial conditions. J. Time Ser. Anal. 11, 275-93. COGBURN, R. & DAVIS, H. T. (1974). Periodic splines and spectral estimation. Ann. Statist. 2, 1108-26. CRAVEN, P. & WAHBA, G. (1979). Smoothing noisy data with spline functions. Numer. Math. 31, 377-403. DRAPER, N. R. & SMITH, H. (1981). Applied Regression Analysis, 2nd ed. New York: Wiley. EUBANK, R. L. (1988). Spline Smoothing and Nonparametric Regression. New York: Marcell Dekker. HOERL, A. E. (1954). Fitting curves to data. In Chemical Business Handbook, Ed. J. H. Perry, pp. 55-77. New York: McGraw-Hill. KIMELDORF, G. S. & WAHBA, G. (1971). Some results on Tchebycheflfian splines. J. Math. Anal. Appl. 33, 82-95. KOHN, R. & ANSLEY, C. F. (1983). On the smoothness properties of the best linear unbiased estimate of a stochastic process observed with noise. Ann. Statist. 11, 1011-7. KOHN, R. & ANSLEY, C. F. (1988). The equivalence between Bayesian smoothness priors and optimal smoothing for function estimation. In Bayesian Analysis of Time Series and Dynamic Models, Ed. J. C. Spall, pp. 393-430. New York: Marcel Dekker. KOHN, R. & ANSLEY, C. F. (1991). The performance of cross-validation and maximum likelihood estimators of spline smoothing parameters. / Am. Statist. Assoc. 86, 1042-50. MAHAJAN, V., MULLER, E. & BASS, F. M. (1990). New product diffusion models in marketing: a review and directions for research. J. Marketing 54, 1-26. MORE, J. J., GARBOW, B. S. & HILSTROM, K. E. (1980). User Guide for Minpack. Argonne, Illinois: Argonne National Laboratory. MULLER, H. G. (1988). Nonparametric Regression Analysis of Longitudinal Data. New York: Springer-Verlag. NELDER, J. (1961). The fitting of a generalization of the logistic curve. Biometrics 17, 89-110. NYCHKA, D. (1988). Bayesian confidence intervals for smoothing splines. J. Am. Statist. Assoc. 83,1134-43. RAMSAY, J. O. & DALZELL, C. J. (1991). Some tools for functional data analysis (with discussion). / R. Statist. Soc. B S3, 539-72. SILVERMAN, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion). J. R. Statist. Soc. B 47, 1-52. VAN LOAN, C. (1978). Computing integrals involving the matrix exponential. IEEE Trans. Auto. Cont. AC-23, 395-404. WAHBA, G. (1978). Improper priors, spline smoothing and the problem of guiding against model errors in regression. J. R. Statist. Soc. B 40, 364-72. WAHBA, G. (1980). Automatic smoothing of the log periodogram. J. Am. Statist. Assoc. 75, 122-32. ANSLEY,

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015

We thank the referees for useful suggestions which helped improve the presentation of the paper. The research was partially supported by an Australian Research Council grant.

88

CRAIG F. ANSLEY, ROBERT KOHN AND CHI-MING WONG

G. (1983). Bayesian confidence intervals for the cross-validated smoothing spline. /. R. Statist. Soc. B 45, 133-50. WAHBA, G. (1985). A comparison of o c v and CML for choosing the smoothing parameter in the generalized spline smoothing problem. Ann. Statist. 11, 1378-402. WECKER, W. E. & ANSLEY, C. F. (1983). The signal extraction approach to linear regression and spline smoothing. J. Am. Statist. Assoc. 78, 81-9. WAHBA,

[Received June 1991. Revised August 1992]

Downloaded from http://biomet.oxfordjournals.org/ by guest on October 29, 2015