New Haven, CT 06520. University Park, PA 16802 ... to total cholesterol, only total cholesterol is measured in the remaining nR individuals; fWj; Dj; j = 1;:::;nRg. 1 ...
A Nonparametric Mixture Approach to Case-Control Studies with Errors in Covariables Kathryn Roeder Department of Statistics Yale University Yale Station, Box 2179 New Haven, CT 06520
R. J. Carroll Department of Statistics Texas A&M University College Station, TX 77843
Bruce G. Lindsay Department of Statistics Penn State University The Classroom Building University Park, PA 16802
Abstract Methods are devised for estimating the parameters of a prospective logistic model in a case{control study with dichotomous response D which depends on a covariate X . For a portion of the sample, both the gold standard X and a surrogate covariate W are available; however, for the greater portion of the data only the surrogate covariate W is available. By using a mixture model, the relationship between the true covariate and the response can be modeled appropriately for both types of data. The likelihood depends on the marginal distribution of X and the measurement error density (W jX; D). The latter is modeled parametrically based on the validation sample. The marginal distribution of the true covariate is modeled using a nonparametric mixture distribution. In this way we can improve the eciency and reduce the bias of the parameter estimates. The results are suciently general that they allow us to provide the rst results which allow no validation, if the error distribution is modeled from independent data. Many of the results also apply to the easier case of prospective sampling.
1 INTRODUCTION In this article, we examine case-control studies with errors in covariables, using a semiparametric mixture model approach. To study the relationship between disease status and exposure level to a suspected disease causing agent (X ), epidemiologists often use retrospective sampling in which diseased population (the \cases", with D = 1) and disease free population (the \controls", with D = 0) are sampled separately to determine their levels of exposure X . Viewed from the perspective of the joint population of (X; D), we are taking observations from the conditional distributions of X jD. Suppose the probability of disease in the source population can be described by the prospective logistic model, Pr (D = 1jX = x) = K(x) = [1 + exp(? ? t x)]? , and the marginal density of X in the population is g(x), which will be modeled as an unknown discrete density. Then the population is described by parameters ( ; ; g). It is wellknown that standard logistic regression, performed as if D were the dependent variable and the covariates X were xed, leads to the maximum likelihood estimate of for retrospective sampling (Prentice & Pyke 1979). Unless the incidence of disease in the source population () is known, must be viewed as a nuisance parameter. In x2, we shed new light on this well known result by demonstrating that the retrospective model is identi able up to a speci c equivalence class. If is unknown, only is identi able. The remaining parameters, and g are linked by . 0
1
0
1
1
1
0
1
0
Epidemiologists often use a validation design, in which some of the data are \complete" in that the covariates are measured both directly (without error) and indirectly (with error), while for the remainder of the data, the covariates are only measured indirectly. The latter sample is called the \reduced" or \incomplete" sample. Let W and X denote the covariates measured with and without error, respectively. Carroll, Gail & Lubin (1993) extend the use of the prospective logistic model to account for this errors in variables design. Other papers in the area include the work of Armstrong, Howe & Whittemore (1989), Buonaccorsi (1990) and Satten & Kupper (1993). A motivating example which we analyze in x8 concerns the eect of LDL cholesterol (X ) on the probability of heart disease (D). Consider a design in which the case{control sample is split into a group of size nC and another of size nR , independent of X . Suppose total cholesterol (W ) and LDL are measured for the nC individuals; fXi; Wi; Di; i = 1; : : : ; nC g constitutes the complete data. Because LDL is expensive to measure relative to total cholesterol, only total cholesterol is measured in the remaining nR individuals; fWj ; Dj ; j = 1; : : : ; nRg 1
constitutes the reduced data. Following Carroll et al. (1993), we assume a parametric conditional distribution for W , given X and D, denoted by fW jX;D (wjx; d; ). We choose a parametric model for the measurement error distribution for two reasons: (i) there is opportunity to test this assumption through the validation data; and (ii) in many problems, other studies will also have assessed this assumption, and the error distribution may be transportable from study-to-study. On the other hand, G, the distribution of X corresponding to density g is left unspeci ed, largely because this distribution depends on the source population. A natural way to model errors in variables data is with a nonparametric mixture model (Kiefer & Wolfowitz 1956). Let hj ( ; ; ) denote the conditional likelihood of (Wj ; Dj jX = ) and let = (; ; G) denote the parameters of the model. For the reduced data, it is immediately obvious that the joint likelihood of (Wj ; Dj ) can be written in the form of R a mixture model Lj () = hj ( ; ; ) dG( ). The joint likelihood of the complete data R (Xi; Wi; Di) can also be written in the form a mixture model Li() = hi( ; ; )I (xi = ) dG( ), with the use of an indicator function. The terms in the conditional (retrospective) likelihood, however, are not of the form of a mixture model. In x3 we show that if maximizes the joint likelihood, it lies in an equivalence class of parameters that also maximize the retrospective likelihood. Hence maximizing the joint likelihood is equivalent to maximizing the retrospective likelihood. Thus a semiparametric mixture model approach can be taken to obtain m.l. estimates after all. Speci cally, the retrospective m.l. estimate of the parameter of interest, , can be obtained by maximizing the joint likelihood over the parametric (; ) and nonparametric (G) components of the likelihood. A con dence interval for is obtained from the pro le likelihood. Assuming the likelihood satis es the regularity conditions speci ed by Kiefer & Wolfowitz (1956), the pro le m.l. estimator, ^ , is consistent. In certain models with conditional structure, the estimates of the parametric component are known to be asymptotically ecient (Lindsay et al. 1991). Although many simulation studies have shown high eciency in other models (e.g., Lesperance 1989), a general theory of eciency has not yet been developed. 1
1
1
Because X was not measured in the reduced sample, the problem can be viewed as a missing data problem. In x4, an EM algorithm to obtain the m.l. estimate of G is presented. Gradient based algorithms appropriate for semiparametric mixture models are also noted in this section. In x5, the results are extended to models for which the probability of disease depends 2
on additional covariates. In x6, an extension to a study design without direct validation data is developed. Suppose that no complete data are available in the sample; however, an independent study has been conducted in which both X and W have been measured. Although none of the other measurement error models can handle this situation, data of this form can be analyzed using the semiparametric mixture method. This model requires that the measurement error be nondierential. When validation data are available, a natural competitor to the mixture method is the pseudolikelihood method proposed by Carroll et al. (1993). They estimate the marginal distribution of X using a weighted average of the empirical distributions of X jD = d obtained from the complete data. This estimate is plugged into the likelihood, from which the maximum pseudolikelihood estimates of the remaining parameters is obtained. By modeling the relationship between W and D, using a rough and ready estimate of the unobserved distribution of X , the information about contained in the reduced data can be partially recovered. Because the distribution of W depends on X , however, there is some additional information in the reduced sample about the distribution of X . Consequently, jointly maximizing the full likelihood should yield more information about than is available when only the complete data are used to estimate the distribution of X . 1
1
In x7, a simulation experiment is presented which evaluates the performance of the mixture method for various sample sizes and amounts of measurement error. The mixture method always performs as well, or better than the pseudolikelihood method. The amount of improvement depends on the sample size and the amount of error in the measurements. In x8 an epidemiological data set is analyzed and pro le likelihoods are presented.
2 IDENTIFIABILITY Assume that the probability of disease in the source population, for a given level of exposure, can be modeled by the prospective logistic model K(x). Then it is readily determined that
fD (d) =
( X x
)d ( )1?d X K(x)g(x) K(x)g(x) ;
(1)
x
where K (x) = 1 ? K(x). It follows that
K(x)d K (nx) ?d g(x) o : Pr (X = xjD = d) = P f x K(x)g(x)gd Px K (x)g(x) ?d 1
1
3
(2)
In order that the p-dimensional parameter be identi able from this distribution, it will be assumed that g(xj ) > 0 for some set x ; x ; : : : ; xp of anely independent vectors. The following lemma provides useful information for the consideration of identi ability issues and, later, for maximization of the likelihood. 1
0
1
LEMMA 1: Suppose ( ; ; g) and ( ; ; g) are two logistic regression models satisfying 0
1
0
1
Pr (D = 1; ; ; g) = 0
and
1
Pr (D = 1; ; ; g) = : 0
Then
Pr (X = xjD = d; ; ; g) = Pr (X = xjD = d; ; ; g) 0
if and only if
(i) = , (ii) = + log()=(); n P n t x o (iii) g(x) = g ( x ) = t x x 1
1
1
0
(3)
1
1
0
0
+ 1+exp( 0 1 ) 1+exp( 0 + 1 )
t x o t x g (x);
1+exp( 0 + 1 ) 1+exp( 0 + 1 )
where = 1 ? and = 1 ? .
PROOF: (i) = because the log odds ratio is proportional to for either retrospective model. (ii) Follows from Bayes Theorem. (iii) When d = 1, (3) implies 1
1
1
K(x)g(x)= = K (x)g(x)=:
It follows that
exp( + t x)] : g(x) = g(x) [1[1 ++ exp( + t x)] For d = 0 the same relationship holds. The result follows from noting that Px g(x) = 1:2. 0
0
1
1
Thus we see that from the retrospective sample, only the parameter is fully identi able. The marginal distribution of X can only be determined up to an equivalence class of functions. However, if the true population probability of disease, , is otherwise known, then the above formulas show that and g are identi able. 1
0
4
3 THE ERRORS-IN-VARIABLES MODEL We assume that the true covariates and the error-prone measurements are available in a validation study consisting of a random sample of nC cases and nC controls. Thus the complete data consists of nC = nC + nC observations, fXi; Wi; Di; i = 1; : : : ; nC g. In addition, we have nR = nR + nR incomplete or reduced observations, fWj ; Dj ; j = 1; : : : ; nR g, obtained from a random sample of nR cases and nR controls. The number of cases overall is n = nC + nR and the number of controls is n = nC + nR . It is assumed that the data are missing X at random (Little & Rubin 1987). 1
0
0
1
1
1
1
1
0
0
1
0
0
0
Since we are assuming that the division of the cases and controls into the reduced and complete subsamples is done conditionally on D, but independently of the values of X , the conditional distribution of X = x given both D = d and the subsample information is still just the conditional distribution of X = x given D = d that was described in the previous section (2): Pr (X = xjD = d) = fX;D (x; d)=fD (d) where fX;D (x; d) = K(x)d K (x) ?d g(x): and the marginal distribution of D, given in (1), can be reexpressed in the form of a mixture 1
fD (d) =
Z
K(x) dG(x)
d Z
K (x) dG(x)
1?d
:
(4)
Recall from Lemma 1, there is an equivalence class of choices for ( ; g) for which the same retrospective model is obtained. Now = fD (1) identi es a particular pair. If it were necessary to evaluate fD (d) for each subsample separately, then it would be necessary to incorporate a dierential shift for each and to estimate a dierent g for each subsample. The following argument shows why this is not necessary. 0
0
In the reduced sample the likelihood contribution from an observation (W; D) is fW jD (wjd) = fW;D (w; d)=fD (d) where
fW;D (w; d) =
X
Z
h( )g( ) = h( ) dG( );
(5)
and
h( ) = K( )dK ( ) ?dfW jX;D (wj; d): (6) In the complete subsample, the likelihood contribution from an observation (W; X; D) is fW;X jD (w; xjd) = fW;X;D (w; x; d)=fD (d) where 1
fW;X;D (w; x; d) =
X
Z
h( )I (x = )g( ) = h( )I fx = g dG( ):
(7)
Note the device of using an indicator function so that both (5) and (7) are written as integrals dG( ), where G is the distribution function with discrete density g(). 5
Putting these together, we get that the likelihood to maximize is of the form LC (; ; G) = LLJ (;( ; ;GG)) ; M where Y Y LJ (; ; G) = fW;D (wj ; dj ) fW;X;D (wi; xi; di); and
i
j
LM ( ; G) = ffD (1)gn1 ffD (0)gn0 : In the above, C, J and M stand for conditional, joint and marginal. Note that as functions R of G, both LJ and LM have the product-of-integrals form Qi ti ( ) dG( ), for some nonnegative functions ti ( ). It follows that the conditional likelihood LC (; ; G) is a ratio of two products-of-integrals. Consequently we cannot apply the theory of nonparametric mixture models to the retrospective likelihood. Here now we run into some good fortune, as we can simply estimate the parameters from LJ (; ; G), which does have mixture form. The parameter estimates we so obtain will be in an equivalence class of estimators that maximize LC (; ; G) being that member of that equivalence class that ts the observed fractions of cases and controls. Moreover, the pro le likelihood for from LJ is the same as from LC . 1
We state the result in a general manner as it has applications in other problems (e.g., Lindsay et al. 1991). Suppose that for a two parameter model (; ) we have a likelihood decomposition of the form LC (; ) = LLMJ ; ; ; where LM (; ) is of multinomial form: QK m LM (; ) = k [pk (; )] k , where m = m + : : : + mK and p (; ) + : : : + pK (; ) = 1. In this setting we have the following simple theorem. 1
(
)
(
)
1
6
THEOREM 2: Let be the set of all (; ) in the parameter space such that there exists depending on (; ) satisfying
(i) LC (; ) = LC (; ); (ii) pk (; ) = mk =m. Suppose that in the set of (; ) maximizing LC (; ) there exists (^C ; ^C ) in . Then
(1) (^C ; ^C ) maximizes LJ (; ): (2) If (^J ; ^J ) is any maximizer of LJ (), it satis es (ii) and also maximizes LC (; ). Moreover, if the entire parameter space is in , LJ and LC generate the same pro le likelihoods for .
PROOF: The key to the proof is that any (; ) satisfying (ii) necessarily maximizes
LM (; ), as this term is maximized over all possible multinomial probabilities p ; : : : ; pK . Thus given any (^C ; ^) maximizing LC (; ), the corresponding (^C ; ^) satisfying (i) and (ii) simultaneously maximizes both terms in the product LC (; )LM (; ). If the entire parameter space is in then for any there is a ^ that maximizes LC ( ; ) over and satis es pk ( ; ^) = mk =m. Consequently the joint and conditional pro le likelihoods for are equal. 2 COROLLARY 3: If (^; ;^ G^) maximizes LJ (; ; G), then (^; ;^ G^) also maximizes LC (; ; G) ^ G^ ) = n =n. The pro le likelihoods for from LC and satis es the equation Pr (D = 1; ^; ; and LJ are identical. 1
0
0
0
0
0
1
1
PROOF: From Lemma 1, for any xed set of parameters (; ; G) we can nd an set
(; ; G) giving the same conditional distributions for X given D; but having the prespeci ed value of Pr fD = 1; ; ; Gg = n =n: However, it follows that the conditional distributions of W given D are also identical for ( ; ; G) and (; ; G); hence the values of LC are identical, and so the hypothesis of the theorem is met. 2 1
In addition to extending the result of Prentice & Pyke (1979) to the situation with errors in covariables, we believe that the manner of proof considerably clari es the kind of structural features necessary for one to have an equivalence between a retrospective and prospective likelihood analysis. One needs suciently rich structure that one can t perfectly the marginal distribution of D without diminishing one's ability to t the conditional 7
distributions of the covariates given D. In particular, it is important to note that if we had modeled the marginal distribution G parametrically, then there no longer need be exact equivalence between prospective and retrospective inferences.
4 ALGORITHMS In this section we present two algorithms for maximizing the likelihood with respect to the mixing distributions for a xed value of (; ). Algorithms for estimating (; ) given G are discussed in Carroll et al. (1993). Throughout this section we suppress the dependence of the likelihood on the parametric component of the model and focus on an estimation scheme for the unknown mixture.
4.1 EM Algorithms Because the reduced data are missing X , the problem is suited to the EM algorithm, which estimates missing data and then maximizes the likelihood, given these estimates. Lindsay (1983) showed that the m.l. estimate of the mixing distribution has a xed upper bound, K , on the number of support points: K equals the number of distinct terms in the likelihood. Hence the missing data can be thought of as a membership indicator variable for the K possible values of the covariate = ( ; : : : ; K ). For convenience, we assume throughout that is known. This assumption will be justi ed shortly. The group membership for the complete data is obvious since X is observed; in other words, for the i'th observation, the posterior probability of group membership is one for = Xi. Consequently, the set of support points must include all distinct observed X 's. The group membership for reduced data can be estimated in the usual way (e.g., Titterington et al. 1985). The EM algorithm, at the (m + 1)'st step, puts mass g m (k ) = fACk + ARk g=(nC + nR ); at k , where nR g m ( ) h ( ) X k j k ARk = Lj (G m ) j 1
(
+1)
(
(
=1
ACk =
nC X i=1
)
)
I (xi = k ):
For a xed (; ), limm!1 G m maximizes the likelihood, provided the support points are known. (
)
REMARK: Although the EM algorithm is typically used to estimate both the location
of the support and the mass associated with each point of support, we chose to select a 8
xed grid of support points and then to estimate the probability associated with each point of support. Selection of can be based on the observed X 's, W 's and the distributional relationship between X and W . We found it sucient to include each distinct X from the complete data set as well as a grid of points separated by at most 1=5 ^W jX , where W jX may depend on X . The range of the grid can be determined by estimating E [X jW ] for the minimum and maximum value of W observed in the experiment; call the minimum predicted value xl and the maximum predicted value xh . De ne the grid on the interval [xl ? 2 ^W jX xl ; xh + 2 ^W jX xh ]. This choice was made in the interest of computational feasibility; the EM algorithm is notoriously slow, especially when both both the location and the weights must be estimated. =
=
4.2 Gradient Methods Lindsay (1983) showed that the optimization problem reduces to one of maximizing a concave function over a convex set. Gradient based algorithms are ideally suited to maximization problems of this form. The gradient of the loglikelihood at G towards ( ), a point mass at , is
D(G; ) =
X i
[I (xi = ) ? 1] +
" X hj ( ) j
Lj (G) ? 1
#
(8)
The algorithms are based on the following characterization due to Lindsay (1983). The mixing distribution Gb maximizes the likelihood if and only if
(i) D(G;b ) 0 for all ; (ii) D(G;b ) = 0 in the support of Gb . This result is the basis of the gradient methods. We present a simple, but somewhat inecient algorithm known as the VDM (Wynn 1970, Federov 1972). At the m'th step let G m be the current estimator. At each step, nd m to maximize D(G m ); then nd m to maximize the likelihood evaluated at (1 ? )G m + ( m ); set G m = (1 ? m )G m + ( m). Iterate until condition (i) is virtually satis ed. Other more ecient versions of this sort of algorithm have been developed; see Wu (1978a,b), Boehning (1985) and Lesperance and Kalb eisch (1992). (
)
(
(
)
9
(
+1)
)
(
)
5 Additional Covariates measured without error Frequently, additional covariates measured without error will be available. For instance indicator variables such as sex and smoking status may have been recorded. In this section we extend our methodology to incorporate this extra information. Let Z denote an arbitrary set of covariables measured without error. Model the probability of disease, given all the covariates, as a prospective logistic regression model, where Pr (D = 1jX = ; Z = z) = f1 + exp(? ? t ? z)g? K(; z): 0
1
2
1
Mimicking the development of the likelihood given in (5-7), let hj ( ) = Kdj (; zj )K (; zj ) ?dj fW jX;Z;D(wj j; zj ; dj ): 1
(9)
Let dGX;Z (x; z) dFZ (z) GX jZ (xjz) dGX (x) FZ jX (zjx) represent the joint distribution of (X; Z ) in the case-control population. The contribution to the likelihood from reduced and complete observations is then
Lj () = and
Li() =
Z X
respectively.
Z
hj ( ; ; ) dGX;Z (; z)
(10)
hi( ; ; )I (xi = ) dGX;Z (; z);
(11)
X
We now arrive at a curious situation where m.l. procedures for estimating the joint distribution GX;Z will break down. To simplify the issues, consider the case where we observe variables (W; Z ), where W is X observed with error, but Z is observed directly. Suppose the distribution of W given X = x is symmetric and unimodal about x: If the observed Zi are all distinct, then by writing GX;Z = GZ GX jZ one can show that the m.l. estimator for the joint distribution GX;Z is n? P (wi; zi ), where is the point mass distribution. (This follows because wi is the value of x maximizing the error density fW jX (wijx).) Thus the m.l. estimator converges to GW;Z , not GX;Z : Curiously, if the variables are both observed without error, or both with error, m.l. produces appropriate estimates. The problem seems to be that the sharpness of the Z observations prevents the pooling together of information over i that is necessary to obtain good estimates of conditional distribution of X given Z: For example, if Z is discrete with a nite number of values, then this problem does not arise, as then the conditional distribution for each Z = z can be consistently estimated, and will simply be the mixture estimator of GX that one obtains from the set of W 's that have Z = z: 1
10
With this in mind, we must consider alternative strategies for this case. One possibility that retains the nonparametric avor of our approach and that has the same mathematical structure would be to model GX;Z as a discrete mixture of continuous densities, such as P j N (j ; hI ) where h would have to be chosen so as to balance the criteria of providing a
exible family of distributions (h small) and providing sucient smoothing to alleviate the above problem (h large). Another strategy would be to attempt to model the Z given X distribution. If fZ jX were known, then for the reduced observations (10) becomes
Lj =
Z
X
hj ( ; ; )fZ jX (zjx) dGX ( )
and the problem is of the same form as that solved previously. Furthermore, if fZ jX is known to follow some parametric form depending on and W jX; D; Z depends on then the results of the previous sections apply with parametric component (; ; ). The drawback of this approach is that if the form of Z jX is unknown, the results will depend on parametric modeling assumptions which may not be easy to verify. This problem is especially dicult if Z is multivariate with discrete and continuous components. As mentioned before, if many distinct z's were observed, we cannot simply maximize the likelihood over GX jZ for each z; however, if Z were a unidimensional categorical variable, then it would be feasible to obtain a nonparametric m.l. estimator for GX jZ for each distinct value of z. The algorithm developed below aims to reduce the problem of general covariates to this simple case. 1. Assume that the distribution of X given Z and D depends only on D and the linear combination T = t Z . Note that we assume the linear combination itself is not a function of D. For example, this assumption is satis ed if X jZ; D follows a normal distribution with mean b + b I (D = 1) + t z and variance . There are various nonparametric and semiparametric techniques for estimating , for example sliced inverse regression (Li 1991; Duan & Li, 1991). 0
2
1
2. Upon obtaining an estimate b , form Tb = b t Z and categorize Tb into a nite number of categories labelled t ; t ; : : : ; tL . 1
2
3. Maximize the likelihood as a function of (; ) and G(jt`); ` = 1; : : : ; L. Conditioning on Z has the advantage of requiring fewer parametric modeling assumptions; however, the eect (if any) due to estimating and discretizing Tb will be dicult to ascertain. 11
6 No Complete Data Observed Consider a situation in which only W; D and possibly Z have been measured. Furthermore, assume that although no validation sample was observed from the case-control population, an independent data set is available in which both X and W have been measured. Provided the measurement error distribution is nondierential (does not depend on D), it may be reasonable to assume that this distribution is transportable. The idea is to use this data to obtain an estimate of the distribution of W jX in the case-control population. The assumption of nondierential error is necessary because otherwise fW jX is a function of the study population. In particular
fW jX (wjx) = fW jX;D (wjx; D = 0)fD (0) + fW jX;D (wjx; D = 1)fD (1) which depends on the preponderance of diseased individuals in the population unless fW jX;D = fW jX . Contrast this scenario to the situation encountered when using methods which require a model for X jW among controls (for example, Satten & Kupper 1993). Clearly, by the same argument, the distribution of X jW is not transportable unless the eect is null, hence such methods cannot be applied in this situation. Moreover, the pseudolikelihood method proposed by Carroll, et al. (1993) is clearly not applicable. On the other hand, the pro le likelihood methods presents so far are just as applicable in this situation as they are in the usual validation study, provided the measurement error is nondierential. If other covariables have been measured without error, then some further restrictions must be imposed. Assuming that the distribution of W jX; D; Z does not depend on (Z; D), then as in the previous section, Z
Z
fW jZ;D = fW jX;Z;D(wjz; x; d)fX jZ;D(xjz; d) dx = fW jX (wjx)fX jZ;D (xjz; d) dx depends on Z only through t Z . Hence we can use the observed (W; Z ) data to estimate , and then implement the approach presented previously. The only other literature that allows for consistent estimation when there is no validation and the error model is estimated independently is the paper by Stefanski & Carroll (1987), who assume normally distributed measurement error. Our results allow for any error distribution.
12
7 A SIMULATION EXPERIMENT To examine the performance of our estimator we simulated data with measurement error under two scenarios: with and without a validation study. We chose a lognormal distribution for both X and W jX; D because this distribution often arises in practice (e.g., Nero, Schwher & Nazaro 1986). Distributions and parameter values were chosen to be similar to, or identical with those simulated by Carroll et al. (1993). Data were generated prospectively and then a case{control sample was obtained from this simulated source sample. The prospective logistic model was given by Pr (D = 1jX = x) = K(x), with = ?3:09 and = :5. The true covariate, X , was generated as a lognormal random variable so that log(x) had mean ?1=2x and variance x, where x = 1:08. The surrogate predictor was also lognormal, with log(W ), given (X; D) having mean log(x) and standard deviation , which varied between 0.25 and .75. We repeated each experiment 50 times. 0
1
2
2
The general form of the measurement error model is log(W ) = + log(X ) + ; with N (0; 1). In addition to the the unknown mixing distribution(s), the model has 5 free parameters, ( ; ; ; ; ), hence we call this the 5-parameter model. In the following subsection we assume that ( ; ; ) are known. This leaves only two unknowns, ( ; ), hence we call this the 2-parameter model. 0
0
1
0
1
1
0
1
0
1
7.1 Known Measurement Error In this subsection we assume that the measurement error distribution is known and examine the performance of the 2-parameter model with a small, medium and large amount of measurement error ( = :25; :50; :75). Samples fXi; Wi; Di; i = 1; : : : ; ng of size 80 and 240 were generated, each consisting of half cases and half controls. First, using fXi; Di i = 1; : : : ng, we estimated using the prospective logistic model (equivalent to = 0). Next, omitting the gold standard variables Xi and using only fWi; Di; i = 1; : : : ng, we estimated (; ; G) using the mixture method (the MIX method). Results are presented in Table 1. Notice that as the variance of the measurement error distribution increased, the MSE of the MIX method increased rapidly. In fact, when = 1, the reduced data has almost no information remaining relative to when = 0. (The results for = 1 are not presented because the algorithm failed to converge for numerous runs.) As the sample size increased, the MSE decreased substantially, as would be expected. Of greater interest is the increase in the disparity between the performance with and without measurement error. This is especially apparent for models with larger measurement error. 13
Next, we simulate data from a validation study. To see the eect of increasing amount of complete data, rst 1/8 and then 1/4 of the data were considered as complete. The performance of the MIX method is compared with the pseudolikelihood (PL) method proposed by Carroll et al. (1993) in Table 2. We found that if a substantial proportion of the data are complete (1/4 or more), then the PL and MIX methods perform similarly. When only 1/8 of the data are reduced, the pattern of results is markedly dierent. The MSE of both methods increased dramatically; however, the MIX method performed much better than the PL method for this scenario. The dierence in performance was greatest when the variance of the measurement error was greatest.
7.2 Unknown Measurement Error In this subsection we assume that the measurement error distribution is known to be lognormal, but the three parameters of the lognormal model are unknown. This puts us in the framework of the 5-parameter model. To compare the MIX method with the PL method we simulated validation data sets with two dierent sample sizes and three dierent measurement errors: nC = nC = 30; nR = nR = 30; nC = nC = 30; nR = nR = 90 and = 25; :50; :75. 0
1
0
1
0
1
0
1
For the rst choice of sample sizes, the MIX method performed almost identically to the PL method, suggesting that for data like those simulated, there is little extra information that can be extracted by maximizing over the mixing distribution. The PL method, however, is outperformed when the number of complete observations is small relative to the number of reduced observations. The disparity between the methods increased with the variance in the measurement error distribution. The PL method also tends to perform poorly when the measurement error is small, relative to the number of complete observations (not reported because the MIX algorithm is extremely slow to converge in this setting). We believe that PL performs poorly in this setting because when W is not close to any X in the sample, the contribution to the likelihood is nearly zero and hence the eciency of the method is reduced. Comparing the 2-parameter and 5-parameter results, it is clear that the variance of the estimator increases substantially when the parameters of the measurement error distribution must be estimated. The measurement error model can be extended to allow for dierential error by incorporating two extra parameters: log(w) = + log(x) + d + d : Carroll et al. (1993) performed extensive simulations, comparing the 7-parameter PL model to competing methods in the validation study setting and found that incorporating the two extra parameters 0
14
1
2
did not signi cantly reduce the eciency of the method. Meanwhile it provided a substantial increase in the robustness of the model. Other methods demonstrated a serious bias in the estimates of the when the nondierential error was ignored. We found that the 7-parameter MIX method performed similarly to, but slightly better than, the 7-parameter PL method (not reported). In toto, these results suggest that it is worthwhile to use a richer measurement error model unless there is ample evidence that the measurement error is nondierential. 1
7.3 Pro le Likelihoods There is currently no standard methodology for computing the variance of a pro le m.l. estimate in a semiparametric mixture model. Here we let denote all nuisance parameters, including the mixing distribution, while denotes the parameter of interest. For a real valued parameter of interest, Lesperance (1989) proposed inverting the semiparametric generalized likelihood ratio test ( ) = 2 log sup 1; L( ; )= sup L( ; ) to obtain a (1 ? a)100% pro le con dence interval f : ( ) < X (a)g: She observed good coverage properties when using this method for a number of standard mixture problems. Based on her extensive simulations, we recommend this approach for nding a con dence interval for . 1
1
1
1
1
1
2 1
1
8 A Cholesterol Study In this example we analyze a data set concerning the risk of coronary heart disease (CHD) as a function of blood cholesterol level. This data was extracted from the Lipids Research Clinics study which was previously discussed by Satten & Kupper (1993). We use a portion of these data involving men aged 60-70 who do not smoke (256 records: four outliers were removed). A subject is recorded as having CHD (D = 1) if they have had a previous heart attack, an abnormal exercise electrocardiogram, history of angina pectoris, and so forth. The measured covariables are low density lipoprotein (LDL) cholesterol level and total cholesterol (TC) level. Direct measurements of LDL levels is time-consuming and require costly special equipment. For this reason we are interested in whether TC serves as a useful surrogate for LDL. Note that the measurement error of TC is not the source of error which is of primary interest; rather, it is the unknown quantity of the other components of TC (triglycerides and high density lipoproteins) that lead to the \measurement error". Henceforth CHD, LDL/100 and TC/100 play the roles of D; X and W , respectively. 15
In this data set both X and W have been recorded for each subject. In the full data set there are 113 cases, of which 47 had LDL levels higher than 160. Among the 143 controls, 43 had elevated LDL levels. See Figure 1 for boxplots of the data. Using X as the predictor, the prospective logistic regression estimate for was .656 with a standard error of .336. Contrast this with the attenuated estimate (.540) obtained when measurement error was ignored and W was used as the predictor. 1
A 5-parameter lognormal measurement error model provided a good t to the data (Figure 2) with the exception of a slight increase in the variance of W jX for small values of X . The 7-parameter dierential measurement error model t signi cantly better, but did not change the parameters enough to have a practical impact on the estimation procedure. Consequently, the measurement error was modeled using a nondierential error model. To illustrate the information present in the reduced data we analyzed a sample of data with and without the reduced observations. From the 113 cases and 143 controls, 32 cases and 40 controls were randomly selected to serve as complete data. The remaining observations were treated as reduced observations. Using only the complete data we obtained ^ = :943 with standard error of .62. Figure 3 illustrates the pro le likelihood for the 5-p model when both complete and reduced data are used: ^ = :765. Notice the dramatic increase in the precision of the estimate when both complete and reduced data are used. 1
1
ACKNOWLEDGEMENT We are grateful to D. Bennett for providing the code for the simulations. Roeder's research was supported by the National Science Foundation, as was Lindsay's. Carroll's research was supported by a grant from the National Cancer Institute.
REFERENCES Armstrong, B., Howe, G. & Whittemore, A. S. (1988). Correcting for measurement error in a nutrition study. Statistics in Medicine, 8, 1151{1163. Boehning, D. (1985). Numerical Estimation of a Probability Measure. Journal of statistical planning and inference, 11 57-69. Buonaccorsi, J. P. (1990). Double sampling for exact values in the normal discriminant model with application to binary regression. Communications in Statistics, Series A, 19, 4569{4586. Carroll, R. J., Gail, M. H., & Lubin, J. H. (1993). Case{control studies with errors in predictors. Journal of the American Statistical Association, 88, 177{191. 16
Duan, N. & Li, K. C. (1991). Slicing regression: a link{free regression method. Annals of Statistics, 19, 505{530. Federov, V. V. (1972). Theory of Optimal Experiments, New York: Academic Press. Kiefer, J., and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of in nitely many incidental parameters. The Annals of mathematical statistics, 27, 886-906. Lesperance, M. L. (1989). Mixture models as applied to models involving many incidental parameters. unpublished Ph.D. dissertation, University of Waterloo, Dept. of Statistics and Actuarial Science. Lesperance, M. L. & Kalb eisch, J. D. (1992). An algorithm for computing the nonparametric MLE of a mixing distribution. Journal of the American Statistical Association, 87, 120-126. Li, K.C. (1991). Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association, 86, 316{342. Lindsay, B. G. (1983). The geometry of mixture likelihoods, Part I: A general theory. Annals of Statistics, 11,86-94. Lindsay, B. G., Clogg, C. C. and Grego, J. (1991). Semiparametric estimation in the Rasch Model and Related Exponential Response Models, including a simple latent class model for item analysis. Journal of the American Statistical Association, 86, 96{107. Little, R. J. A. & Rubin, D. B. (1987). Analysis of Missing Data. John Wiley & Sons, New York. Nero, A. V., Schwehr, M. B. & Nazaro, W. M. (1986). Distribution of airborne Radon{222 concentration in U.S. homes. Science, 234, 992{997. Prentice, R. L. & Pyke, R. (1979). Logistic disease incidence models and case{control studies. Biometrika, 66, 403{411. Satten, G. A. & Kupper, L. L. (1993). Inferences about exposure{disease association using probability of exposure information. Journal of the American Statistical Association, 88, 200{208. Stefanski, L. A. & Carroll, R. J. (1987). Conditional scores and optimal scores in generalized linear measurement error models. Biometrika, 74, 703-716. Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, New York. Wu, C. F. J. (1978a). Some algorithmic aspects of the theory of optimal designs. Annals of Statistics, 6, 1286-1301. 17
Wu, C. F. J. (1978b). Some iterative procedures for generation nonsingular optimal designs, Communications in Statistics, Part A- Theory and Methods, 7, 1 399-1412. Wynn H. P. (1970). The sequential generation of D-optimum experimental designs. The Annals of Mathematical Statistics, 41, 1655-1664.
18
Figure Captions Figure 1. Boxplot of LDL cholesterol (LDL) and total cholesterol (TC) for individuals with (D=1) and without (D=0) heart disease. Figure 2. Relationship between total cholesterol (TC) and LDL cholesterol (LDL). Figure 3. Pro le likelihood of for the cholesterol data. The horizontal bar marks the 95% pro le con dence interval. 1
19
nR nR = 0 = :25 = :50 = :75 40 40 Mean .53 .54 .57 .59 MSE .06 .08 .11 .16 RMSE 1.00 1.28 1.82 2.94 120 120 Mean .51 .50 .53 .55 MSE .01 .02 .03 .05 RMSE 1.00 1.28 2.00 3.33 Table 1: Known Measurement Error Model, with No Complete Data, = :5. Mean and MSE of ^ are calculated based on fty repetitions of the simulation experiment. RMSE is the ratio of the MSE of the model with = 0 divided by the MSE of the model of interest. 0
1
1
1
= :25 nC nC nR nR PL MIX 30 30 90 90 Mean .496 .496 MSE .016 .016 RMSE 1.00 1.00 15 15 105 105 Mean .512 .501 MSE .040 .017 RMSE 1.00 2.35
= :75 PL MIX .512 .505 .038 .032 1.00 1.19 .611 .522 .143 .038 1.00 3.76 Table 2: Known Measurement Error Model, with Some Complete Data, ^ = :5. Mean and MSE of are calculated based on fty repetitions of the simulation experiment. RMSE is the ratio of the MSE of the PL estimator to the MSE of the MIX estimator. 0
1
0
1
= :50 PL MIX .500 .502 .021 .020 1.00 1.05 .568 .515 .089 .033 1.00 2.70
1
1
= :25 = :50 = :75 nC nC nR nR PL MIX PL MIX PL MIX 30 30 30 30 Mean .501 .507 .500 .503 .518 .522 MSE .038 .034 .046 .045 .063 .060 RMSE 1.00 1.12 1.00 1.15 1.00 1.05 30 30 90 90 Mean .481 .478 .471 .468 .493 .489 MSE .025 .022 .034 .027 .067 .048 RMSE 1.00 1.14 1.00 1.26 1.00 1.40 Table 3: Unknown Measurement Error Model, with Some Complete Data, = :5. Mean and MSE of ^ are calculated based on fty repetitions of the simulation experiment. RMSE is the ratio of the MSE of the PL estimator to the MSE of the MIX estimator. 0
1
0
1
1
1
20