An Assessment of Bayesian Inference in ... - Semantic Scholar

0 downloads 0 Views 278KB Size Report
Jul 31, 1996 - Nandini Raghavan. Department of Statistics. Ohio State University. 141 Cockins Hall, 1958 Neil Avenue. Columbus, OH 43210. Phone: (614) ...
An Assessment of Bayesian Inference in Nonparametric Logistic Regression CORRESPONDING AUTHOR:

Nandini Raghavan



Department of Statistics Ohio State University 141 Cockins Hall, 1958 Neil Avenue Columbus, OH 43210 Phone: (614) 292-0738 Fax: (614) 292-2096 Email: [email protected]

Dennis D. Cox

y

Department of Statistics Rice University P. O. Box 1892 Houston, Texas 77251 Phone: (713) 527-6007 Fax: (713) 285-5476 Email: [email protected] July 31, 1996

 Research supported by NSF Grant DMS 9410-94. y Research supported by NSF Grant DMS 93-17464.

0

Abstract A Monte Carlo study is performed to assess the properties of a Bayesian procedure for inference in nonparametric regression with a binary response variable. The logodds (logit) of the probability of the response is modeled as an integrated Wiener process. This leads to a generalized smoothing spline as the posterior mode. Such priors have been used by many authors for nonparametric regression with Gaussian errors. In the logistic regression setup the posterior is analytically intractable. Monte Carlo approximation (speci cally importance sampling) is used to evaluate posterior functionals. Investigated here are: (i) the accuracy of the Gaussian approximation of the posterior; (ii) the frequentist coverage probabilities of Bayesian credible regions; and (iii) the large sample behavior of the Bayesian procedures.

Key Words and Phases : Logistic regression, nonparametric regression, Bayesian inference, improper priors, generalized smoothing splines, Monte Carlo simulation, importance sampling.

0

1 Introduction Bayesian inference for nonparametric regression has been of interest to many investigators including Wahba (1978), (1983), Silverman (1985), Nychka (1988) and Gu (1992). This entails specifying a prior on an in nite (or very high) dimensional space of functions. Gaussian processes are convenient for this and work well for the additive Gaussian errors model because the posterior is then Gaussian. In non-Gaussian GLM models (such as logistic regression) the posterior is quite complicated under a Gaussian process prior. One approach to calculations with the posterior in these settings is to use the Gaussian approximation based on a second-order Taylor expansion about the posterior mode (Gu (1992)). The adequacy of such an approximation needs to be assessed. Here we compute posterior expectations, quantiles and credible regions and compare them to those based on the Gaussian approximation. Since the posterior is analytically intractable, we use Monte Carlo importance sampling to estimate these quantities. It is also of interest to assess the Bayesian procedures from a frequentist perspective, e.g. compare the frequentist coverage probabilities of Bayesian credible regions with the posterior coverage probabilities. Other studies (Wahba (1983), Nychka (1988), Cox (1989) and Shen (1994)) suggest that the two coverages are approximately equal in the Gaussian additive error model. A further contribution here is the introduction of a method for computing simultaneous credible bands. In typical parametric settings the posterior concentrates around the truth as the number of observations increases. Assuming the true parameter value is in the support of the prior, then the prior becomes approximately constant in a small neighborhood of the truth, so the posterior increasingly resembles the likelihood function, which can be approximated by a Gaussian density with mean equal to the posterior mode (which is approximately the MLE) and variance equal to the negative inverse Hessian of the log likelihood (which is approximately the inverse Fisher information). See Berger (1980), pp. 224-225, and Tierney and Kadane (1986) for such approximations. Further, the frequentist coverage probability of Bayesian credible sets approaches the Bayesian posterior coverage probability. Again, 1

we emphasize that these results apply to the setting of nite dimensional parameter spaces. There are no corresponding results known for the nonparametric setting. Results in Shen (1994) show that in the nonparametric setting the likelihood and prior are always of the same order of magnitude and this could give rise to a lack of agreement between the Gaussian approximation and the true posterior, and between Bayesian and frequentist coverage probabilities. Our results here indicate that the Gaussian approximation does perform reasonably well as an approximation to the posterior. The approximate equality of the frequentist and posterior coverage probabilities seems to require a larger sample size than in the Gaussian additive errors model. Further diculties seem to arise which may be unique to logisitic regression. The coverage probabilities for the simultaneous credible bands tend to be even more problematic. In Section 2 we describe the Bayesian nonparametric logistic regression model. Section 3 outlines the Monte Carlo study. Section 4 provides a detailed discussion of the results.

2 The Bayesian Model Let Y be a binary response, t a predictor, and

p(t) = P [Y = 1 j t] = 1 ? P [Y = 0 j t]:

(1)

Assuming 0 < p(t) < 1, it is convenient to transform and estimate the logit g(t) = lnfp(t)=[1 ? p(t)]g instead. Given n independent observations (yi; ti); i = 1; : : : ; n, the density of g = (g(t1)    g(tn ))prime is

f (yjg) =

n Y i=1

exp[yig(ti) ? ln(1 + exp [g(ti)])]:

Assume the predictor t is one dimensional. A prior on g(t) is given by

g(t) =

mX ?1 j =0

j tj + Z (t) 2

(2)

where 0 = (0; : : : ; m?1 )  N (0; I ),  > 0 is a scale parameter and fZ (t) : t 2 [a; b]g is an (m ? 1)-fold integrated Weiner process with covariance function

Zb ? 2 (s ? u)m+ ?1 (t ? u)m+ ?1 du ; Q(s; t) = ((m ? 1)!) a

a  s; t  b:

Here xk+ = (maxfx; 0g)k .  and Z (t) are assumed to be independent. In order to specify the posterior density for any nite set of values of the independent variable, it suces to consider the marginal posterior density of g. To see this, note that if fs1; : : : ; sk g is disjoint from the set of design points ft1; : : : ; tn g, then the conditional posterior f ( g(s1), : : : , g(sk ) j g, y) = f ( g(s1), : : :, g(sk ) j g), and the latter is a Gaussian distribution determined solely from the prior. This relation holds for a proper prior and carries through for an improper prior ( ! 1) as well, (Raghavan and Cox (1995)). Restricting our attention to the design points, we have Z = (Z (t1); : : : ; Z (tn ))  N (0; Qn ), where Qn = [Q(ti; tj )]fi;j=1;:::;ng. Let T be the n  m matrix with (i; j )-th entry Tij = tji ?1. The prior distribution of g is given by g  N (0; 2 ), where 2 =  TT 0 + 2 Qn . Allowing  ! 1 results in a uniform distribution on IRm for  and a \partially improper" limiting prior for g which is proportional to

h(g) = exp[?g0Dg=2 ];

(3)

2

where D = lim!1  ?1 is the precision matrix of rank (n ? m) and has null space spanned by the columns of T . This prior can be derived as a limit of proper priors as shown in Raghavan and Cox (1995), Wahba (1978). The unnormalized posterior density is

h

h(gjy) = h(yjg) h(g) = exp ?g0Dg=(2 ) + y0g ? 10n log(1n + exp[g]) 2

i

(4)

The normalized posterior density of g is

f (gjy) = R hh((ggjjyy))dg : IRn

(5)

In general, we denote an unnormalized density by h and a normalized density by f . The arguments indicate the random variable. Note that the posterior density is proper if and 3

only if the denominator in (5) is nite. This occurs if and only if the number of switches (de ned below) in y is at least m (Raghavan and Cox (1995)).

De nition 1 Assume t  t  : : :  tn. A permutation  of (1; 2; : : : ; n) is allowable if t  t      t n . For an allowable permutation , we say a switch occurs at the interval [t j ; t j ] if y j = 6 y j . The number of switches in the data is min Pni jy j ? y j j, where the minimum is taken over allowable permutations . 1

(1)

(2)

( )

( )

=1

( )

2

( +1)

( )

( +1)

( +1)

There remains the problem of selecting the scale parameter  for the Gaussian prior, or equivalently the smoothing parameter

 = 1=(n ): 2

In all work reported below, an estimate of  is obtained by minimizing a least squares cross validation (CV) criterion on a large grid of values. Details of the procedure are given in Chang (1991). The Gaussian approximation to the posterior is given by

fG(g) / exp[?(g ? ^g)0H (g ? g^)=2]

(6)

where H = d2=(dg2 )[? ln f (gjy)] is the Hessian at the posterior mode g^ . Under the (m-1)fold integrated Wiener process prior the posterior mode g^ is a natural polynomial spline function of order 2m, (O'Sullivan et al. (1986)). It is obtained by an iteratively reweighted least squares (IRLS) algorithm. The Hessian matrix can be obtained from the last iteration of the IRLS algorithm. A 95% pointwise posterior credible interval for g(t) at t is given by (g:025(t); g:975(t)), where g (t) denotes the -th quantile of the posterior distribution at t, and is de ned by the equation Z1 Z g (t) f (g(t)jy)dg(t)= f (g(t)jy)dg(t) = ?1

?1

To construct (1 ? )100% simultaneous bands for the entire regression function in an interval T of interest, we need to be able to nd bounds L(t) and U (t) (based on some 4

estimator g^(t)) such that

P [L(t)  g(t)  U (t); t 2 T ] = 1 ? : We approximate this by constructing such bounds on a grid of points t1; : : : ; tn . Let s2(t) = Var[g(t)jy] denote the pointwise posterior variance, and q the -th %-ile of the distribution of maxifjg(ti) ? E [g(ti)jy]j=s(ti )g. The simultaneous bands we propose are then given by E [g(ti)jy]  q s(ti). The bands have variable width proportional to the pointwise posterior standard deviation. The grid used to construct these bands in our examples is the grid of design points. Pointwise credible intervals based on the Gaussian approximation are centered at the posterior mode with variances given by the inverse Hessian matrix. Simultaneous credible bands based on the Gaussian approximation to the posterior are constructed similarly to the simultaneous posterior bands described above, but are based on the mode (^g) and the covariance matrix of the Gaussian approximation. We use Monte Carlo importance sampling to approximate the posterior. A brief description is given here; for a general discussion see Geweke (1989). An importance sampling density fI (g) must satisfy

fg : h(gjy) > 0 g  fg : fI (gjy) > 0 g: Writing the posterior expectation of a function a(g) as

EI [a(g)] =

Z

IR

a(g) fh(gjy)=fI (g)g fI (g)dg= n

Z IRn

fh(gjy)=fI (g)g fI (g)dg;

(7)

and given a random sample fgigNi=1 of size N from fI (g), the expectation above can be estimated by N N X X (8) a(gi ) fh(gijy)=fI (gi)g = fh(gijy)=fI (gi)g : i=1

i=1

This estimator will be consistent provided both integrals on the r.h.s. of (7) are nite. To estimate quantiles of the posterior distribution of a(g), let a(g; a ) = I (g; fg : a(g)  ag), where I (g; A) = 1 if g 2 A and I (g; A) = 0 if g 2 Ac . As long as the denominator of 5

(8) is nite, a consistent estimator of P [a(g)  a ] is N X i=1

a(gi ; a )w(gi)=

N X i=1

w(gi)

An estimate of the -th quantile is given by the value of q which solves (approximately) the equation N X X (9) w(xi ) = w(xi ): i=1

i:a(xi )q

Standard errors for this estimator can be estimated by the \batch" or \block" method (Bratley et al. (1987)). A crucial aspect of the importance sampling scheme is the choice of an importance sampling density fI (g). One must ensure that fI (g) dominates h(gjy) at the tails. The tail behavior of h(gjy) is analyzed in Raghavan and Cox (1995). Using these results, we construct a \tail-dominating" density fT (g). For a detailed description of fT (g), see Raghavan and Cox (1995). The importance sampling density is a mixture

fI (g) = pfG (g) + (1 ? p)fT (g)

(10)

where fG (g) is de ned in (6) and 0 < p < 1. The importance sampling weights w(g) = f (gjy)=fI (g) can be shown to be bounded, which is desirable because it guarantees that any functional square integrable under h is also square integrable under fI . The strategy we implemented in fact uses xed samples from the two components of the mixture. A two-stage sampling scheme is used wherein pilot samples are drawn from both components of the mixture and this is used to estimate the optimal allocation between components for the second stage. Once all the samples are generated, we compute optimal mixture probabilities p to minimize the asymptotic variance for the estimator of each posterior functional of interest. See Raghavan and Cox (1996b) for details.

3 The Monte Carlo Experiment In our Monte Carlo study, the chosen logit functions were sampled at equally spaced points in their domains for three sample sizes: n = 25; 100; 400. The logit functions we used are: 6

Example 1: f (t) = 60(0:9t) (1:1 ? t), 0  t  1. 3

Example 2: f (t) = 1:5t ? 5, 0  t  5. Example 3: f (t) = 3[10 t (1 ? t) + 10 t (1 ? t) ] ? 2, 0  t  1. 5 11

6

8 >< 0 Example 4: f (t) = >: sin(2((1 ? t) )) 2

3 3

10

?0:2  t < 0 0t

Suggest Documents