Convex Mixture Regression for Quantitative Risk Assessment arXiv ...

3 downloads 33 Views 1020KB Size Report
Aug 12, 2017 - of the pesticide dichlorodiphenyltrichloroethane (DDT), was measured in maternal serum samples collected during the third trimester of ...
arXiv:1701.02950v2 [stat.ME] 1 Feb 2017

Convex Mixture Regression for Quantitative Risk Assessment Antonio Canale∗, Daniele Durante† and David B. Dunson‡ January 31, 2017

Abstract There is considerable interest in studying how the distribution of an outcome varies with a predictor. We are motivated by environmental applications in which the predictor is the dose of an exposure and the response is a health outcome. A fundamental focus in these studies is inference on dose levels associated with a particular increase in risk relative to a baseline. Current methodologies either dichotomize the continuous response or focus on modeling changes with dose in the expectation of the outcome. These choices may lead to a loss of information and provide a restrictive characterization of the dose–response relation. We instead propose a class of convex mixture regression models that allow the entire distribution of the outcome to be unknown and changing with the dose. To avoid overfitting, we rely on a flexible characterization of the density at the extreme dose levels, and express the conditional density at each intermediate dose as a convex combination of the extremal densities. This representation generalizes popular dose–response models for binary outcomes and facilitates inference on extra risk functions of interest in quantitative risk assessments with continuous outcomes. We develop simple Markov chain Monte Carlo algorithms for implementation, and propose goodness-of-fit assessments. The benefits of our methods are highlighted in simulations and in a study of the impact of dioxin exposure on preterm birth. Keywords: Benchmark dose; Conditional density estimation; Convex interpolation; Density regression; ∗

Dipartimento ESOMAS, Universit` a di Torino e Collegio Carlo Alberto, Torino, Italy [email protected] Dipartimento di Scienze Statistiche, Universit`a degli Studi di Padova, Padova, Italy [email protected] ‡ Department of Statistical Science, Duke University, Durham, NC [email protected]

1

1

Introduction

It is often of substantial interest to study how the distribution of an outcome y ∈ Y varies with a predictor x ∈ X ⊆ R. Focusing on the case in which Y ⊆ R, with the response variable univariate and continuous, we let f (y | x) denote the conditional density of y given x. We are interested in environmental health studies in which x is the dose of a potentially adverse exposure, and y represents a measure of health. There is a wide literature on dose–response modeling, relating dose to the risk of an adverse health outcome, with such models forming the basis of quantitative risk assessment and regulatory guidelines on safe levels of exposure. In quantitative risk assessment, one attempts to estimate the dose level associated with a selected small increase in risk of an adverse health outcome relative to the background risk corresponding to zero dose. In applying such methods to quantitative outcomes, it is standard practice to study the conditional expectation of the response variable within a classical regression framework (e.g. Ritz and Streibig 2005), or to dichotomize the response prior to inference (e.g. West and Kodell 1999). As a motivating application, we consider a study relating dioxin exposure in pregnant women to risk of premature delivery (Longnecker et al. 2001). Data are obtained from a sub-study of the US Collaborative Perinatal Project (CPP), which was a large prospective study of pregnant women and their children. In this sub-study, Dichlorodiphenyldichloroethylene (DDE), which is a persistent metabolite of the pesticide DDT, was measured in a maternal serum sample collected during the third trimester of pregnancy. DDE is lipophilic and is stored in fat tissue, so that women build up a body burden of the chemical, which potentially impacts the health of their pregnancy. In Figure 1 we show how the length of gestation relates to DDE in our data. Using the typical cutoff of ≤ 37 weeks to define a premature delivery, we find a significant trend in risk of premature delivery with DDE dose (p-value < 10−9 ). However, dichotomizing leads to an unavoidable loss of clinically-relevant information. In particular, an infant born at 37 weeks is essentially full term and has little or no adverse consequences attributable to the early delivery, but morbidity and mortality increase dramatically as the length of gestation decreases. To obtain a more complete and clinically relevant characterization of risk, it is necessary to estimate how the whole distribution of gestational age at delivery shifts with DDE. This can be accomplished with a flexible nonparametric density regression, also commonly referred to as conditional density estimation. There is a rich literature on conditional density estimation for univariate continuous outcomes and predictors. From a frequentist perspective, Fan et al. (1996) proposed a double kernel method. This approach is related conceptually to the Bayesian nonparametric approach of M¨ uller et al. (1996), which defines a Dirichlet process (DP, Ferguson 1973, 1974) mixture of Gaussians (Lo 1984; Escobar and West 1995) for the joint distribution of (y, x) to induce an estimate of the conditional density f (y | x). Focusing on the joint distribution has some clear disadvantages in practice due to the need to model the marginal of x, which

2

Gestational age at delivery

45

40

35

30

0

50

100

150

Dichlorodiphenyldichloroethylene (DDE)

Figure 1: Length of gestation (in weeks) and DDE in maternal serum (in µg/l) from the CPP data. The solid line and the grey variability bands are obtained via nonparametric loess smoothing. is effectively a nuisance parameter and can lead to the introduction of unnecessary clusters in the data (Wade et al. 2014). For this reason, we focus instead on a conditional specification which avoids modeling the marginal of x. Consider the Gaussian mixture regression model: f (y | x) =

H X h=1

1 νh (x) φ σh



y − µh (x) σh

 ,

(1)

where νh (x) is a probability weight for the hth component varying with x, and φ{·} is the standard Gaussian density. In the special case in which ν1 (x) = 1 for all x, this model reduces to a regression framework, with E(y | x) = µ1 (x) and independent N(0, σ12 ) residuals. This simplified representation would be inappropriate for our motivating data set, which exhibits clear changes in the shape of f (y | x) with x. However, by mixing together different mean regression models, we can allow f (y | x) to change flexibly with x (Pati et al. 2013; Norets and Pelenis 2014; Norets and Pati 2014). There is a rich nonparametric Bayes literature focused on model (1). For example, De Iorio et al. (2004) used a dependent Dirichlet process mixture, which assumed fixed weights νh (x) = νh for all x ∈ X , while using an ANOVA structure for the conditional mean. Similar approaches

3

have been applied to spatial modelling (Gelfand et al. 2005) and time series data (Caron et al. 2006). There is also a literature on approaches which allow the weights to change with x, such as Dunson and Park (2008) and Chung and Dunson (2009). These types of models have appealing theoretical properties in asymptotic contexts (Pati et al. 2013), but their considerable flexibility comes at a cost in terms of interpretability and parsimony in characterizing functionals of interest. In particular, in our motivating study, it is important to quantify the dose effect via a one-dimensional curve β(x), facilitating quantitative risk assessment. We propose a simple construction, which is applicable well beyond our motivating application. The basic idea is to consider two extremal distributions for the response, observed at the low and high end of the predictor range. In toxicology applications, for example, the lower bound may be zero, meaning no dose exposure, whereas the upper bound can be considered a conservative value — potentially infinite — outside of the range of dose observed in practice. It is important to flexibly model the extremal distributions. For example, when the response is univariate and continuous, a mixture of Gaussians can be used. To include changes with the predictor, we model the conditional density f (y | x) as a convex combination of those associated with the extremal points, with the weight function β(x) ∈ [0, 1] changing non-linearly from zero at the low extreme of X to one at the high extreme. The β(x) function has a simple interpretation as the relative change in the response distribution along the continuum between the two extremes, leading to straightforward quantitative risk assessment. In addition, there is a substantial reduction in effective degrees of freedom of the model relative to unconstrained inference. Our conjecture is that this simple Convex Mixture Regression (CoMiRe) will provide adequate fit in many applications, particularly in the motivating epidemiology and toxicology contexts. The model avoids unnecessary parametric assumptions about the response distribution, but relies for interpretability and dimensionality reduction on a convexity property, which can be easily checked via goodness of fit assessments, and relaxed in various ways — e.g. by allowing the extremal points to not correspond exactly to the extremes of the range of X . It is also straightforward to include covariate adjustments, hierarchical dependence structures, and additional complications within the proposed model. Figure 2 illustrates that a simple implementation of CoMiRe provides an excellent fit to the motivating data. There are other restrictions that can be incorporated in conditional density estimation to include more structure and limit concerns about overfitting. By far the most common is to allow only the mean or selected quantiles of f (y | x) to change with x. Such assumptions are violated in many applications, including the DDE and premature delivery study. Another possibility is to incorporate a shape constraint — for example, restricting f (y | x) to be stochastically ordered in x (Dunson and Peddada 2008; Wang and Dunson 2011). While this assumption may be warranted in certain applications, the convexity property specific to

4

0.25

0.25

0.25

0.20

0.20

0.20

0.15

0.15

0.15

0.10

0.10

0.10

0.05

0.05

0.05

0.00

0.00

0.00 25

30

35

40

45

25

Gestational age at delivery (DDE

Suggest Documents