Linear Regression with errors in both variables: Errors-in-Variables Models Instrumental Variables Factor Analysis Graham Dunn, Biostatistics Group School of Community Based Medicine
[email protected]
Linear Regression • First we have another look at regression when there is no error in the exposure measurements (X). • Then we have a look at the effect of nondifferential random error in X.
Linear regression: the effect of a quantitative exposure on a quantitative outcome Let’s first assume that exposure, X, is measured without error. We are interested in predicting the outcome, Y, from a knowledge of exposure, i.e. the relationship E(Y)
=
α + βX
How do we estimate α and β?
Linear Regression E(Y) = Y =
α + βX α + βX + δ
Var(X) Var(Y) Cov(Y,X)
= = = = =
σX2 β2σX2 + σδ2 Cov(Y,α) + Cov(βX,X) + Cov(Y,δ) Cov(βX,X) = βVar(X) βσX2
=
Cov(Y,X)/Var(X)
Therefore β
and estimates of both α and σδ2 follow.
(c.f. OLS)
Identifiability • All of the parameters in this model are identifiable. • That is, we can use first- and second-order moments (means, variances and covariances) to derive unique estimates for all of the models parameters.
Errors in Exposure Measurements We are now interested in E(Y)
= α + βX
But have measured, say, W = X + U If we estimate β from Cov(Y,W)/Var(W) then our estimate will be biased.
A More Realistic Model W Y
= =
X +U α + βX + V
Where V is a random deviation from the model (often called ‘error’ but not necessarily measurement error) – E(V)=0. Assuming Cov(X, V)=Cov(X,U)=Cov(U, V) = 0, Var(W) = Var(X) + Var(U) Var(Y) = β2Var(X) + Var(V) Cov(W,Y) = βVar(X) Taking expectations over all subjects/items, E(W)=E(X) = μ, E(Y) = α + βμ
The Parameters of the Realistic Model are Not Identified There are more parameters to be estimated than there are moments. There is not way to equate observed and expected moments for the above model and obtain unique solutions. The model is under-identified. What can we do? Collect more data, etc., or make further assumptions.
Alternative Identifying Assumptions • The precision of W (equivalent to σu2) is known. • The reliability of W (σx2/σw2) is known.
• The intercept, α, is known (it might be zero, for example) provided μ≠0. • We’ll try each of these three in turn. Also (not straight forward): • The ratio, λ= σu2 /σv2, is known (Orthogonal or Deming’s regression).
Precision of W known Var(W) = Var(Y) = Cov(W,Y) =
σx 2 + σ u2 β2σx2 + σv2 β σx 2
We are aiming for β = Cov(W,Y)/σx2 = Cov(X,Y)/σT2 Estimates: σx2 = Var(W) - σu2; β = Cov(W,Y)/(Var(W) - σu2) α = Mean(Y) - β.Mean(W)
Reliability of W (κ) known (Is this likely? Population dependent) Var(W) = σx 2 + σ u2 Var(Y) = β2σx2 + σv2 Cov(W,Y) = β σx 2 ; κ = σx2/σw2 We are aiming for β = Cov(W,Y)/σx2 = Cov(X,Y)/σx2 Estimates: σx2 = κVar(W) β = Cov(W,Y)/κVar(W) α = Mean(Y) - β.Mean(W)
Intercept, α = 0 (Might be known in some circumstances – but not very likely) Var(W) = Var(Y) = Cov(W,Y) = Mean(W) = Mean(Y) =
σx2 + σu2 β2σx2 + σv2 β σx2 μ βμ
Estimate: β = Mean(Y)/Mean(W)
Deming’s Regression: λ = σu2 /σv2 Estimator first derived by Kummel (1879). This method is also called Orthogonal Regression.
How can we make a valid/intelligent assumption concerning λ? The random term V is not just measurement error, it also contains equation error (i.e. departure from model fit).
Deming’s Regression: λ = σu2 /σv2 Under multivariate normality, ML (and moments) estimate of β given by β =[SYY–λSWW+{(SYY–λSWW )2 +4λSWY2}1/2]/2SWY I will not try to derive this. … and you do not need to remember it!
Errors in Variables Regression (Stata command: eivreg) Example with reliability of W assumed to be known to be 0.5: eivreg y w, r(w 0.5) Problems: (1) reliability usually only estimated (and, then, probably not very precisely) and (2) reliability assumed to be transportable (i.e. comparable populations).
Collecting More Information: Replicates of W Let’s assume we can collect replicate data on an additional variable, W. Assume that the errors in the replicate Ws are uncorrelated (question this assumption in practice). With duplicate Ws: W1 = X + U1 W2 = X + U2 Y
=
α + βX + V
Replicate Ws With the usual assumptions, Var(W1) = Var(X) + Var(U1) Var(W2) = Var(X) + Var(U2) Var(Y) Cov(W1,Y) Cov(W2,Y)
= = =
β2Var(X) + Var(V) βVar(X) βVar(X)
Replicate Ws (contd.) Var(U1) = Var(U2) = Var(U) Var(W1-W2) = Var(W1) + Var(W1) - 2Cov(W1,W2) = 2Var(U) σu2 = Var(W1-W2)/2
Replicate Ws (estimates) Plugging in our estimate of σu2 yields Var(X) = Var(W1) - σu2 = Var(W2) - σu2 and β = Cov(W1,Y)/(Var(X)) = Cov(W1,Y)/(Var(W1) - σu2) = Cov(W1,Y)/(Var(W2) - σu2) or β = Cov(W2,Y)/(Var(X)) = Cov(W2,Y)/(Var(W2) - σu2) = Cov(W2,Y)/(Var(W1) - σu2)
(1) (2) (3) (4)
Over-identification We now have four distinct estimators of the regression coefficient, β. We have more information than we need. The parameter β (and the model itself) is said to be over-identified. But σu2 is not over-identified. This is not a problem, but we need a way of combining them to get an overall estimate. Typically, we assume multivariate normality and use maximum likelihood (ml).
The benefits of using replicates
In replicating our measurements of X (the Ws) we are able to estimate the measurement error variance (σu2) from within the study itself. It does not depend on any assumptions of transportability. But we do assume that there is no correlation between errors.
Collecting More Information: Instrumental Variables Let’s assume we can collect data on an additional variable, Z. Z should be correlated with W … but have no correlation with V. i.e. the partial correlation of Z and Y given W is zero. Z is an instrumental variable or instrument.
Using an Instrumental Variable W Y Z
= X +U = α + βX + V = ζ + ηX + θ (the last not necessarily the correct model) Assuming Cov(U, V) = 0 (but no longer necessary), Cov(U, θ) = 0, and Cov(V,θ) = 0. Var(W) Var(Y) Cov(W,Y) Cov(W,Z) Cov(Y,Z)
= = = = =
Var(X) + Var(U) β2Var(X) + Var(V) βVar(X) ηVar(X) βηVar(X)
Instrumental Variable (IV) Estimate Cov(W,Z) =
ηVar(X)
Cov(Y,Z) =
βηVar(X)
β
Cov(Y,Z)/Cov(W,Z)
=
Another look at an Instrument W W Y
= = = = = = =
X ζ α α α α α
+U (measurement model) + ηZ + θ (prediction from Z)* + βX + V (linear model for Y) + β(W - U) + V + β(ζ + ηZ + θ - U) + V + βηZ + β(ζ + θ - U) + V + βηZ + ψ
Regression of W on Z, η = Cov(W,Z)/Var(Z) Regression of Y on Z, ηβ = Cov(Y,Z)/Var(Z) β = Cov(Y,Z)/Cov(W,Z) *Z does not need to be a quantitative measure – it could be binary/categorical.
as before
Two Stage Least-Squares (2SLS) We could regress W on Z and save the fitted values. Then we could regress Y on the fitted values for W. This would give us the instrumental variable or IV estimate (but we wouldn’t trust the standard errors, etc.). In practice, we do not do this but use 2SLS software that correctly calculates standard errrors, CIs and pvalues.
ivregress in Stata 11 ivregress 2sls y (w=z) Very simple!
An example: Occupational Epidemiology Occupational epidemiologists are frequently concerned by the possible effects of occupational exposures on risk of illness. “Typically in occupational epidemiology, measures of cumulative or average exposure over a period of time are needed for each worker. For reasons of practicality, measuring devices often operate only over a short period of time, yet the exposure of a worker can vary greatly within and between days. Hence, mismeasurement of exposure is common and can lead to failure to identify the true effect of exposure on the health outcome of interest. Given thatoccupational epidemiology aims to prevent morbidity/mortality due to work, methods for overcoming exposure measurement error should be used.”
Batistatou & McNamee, 2008
Carbon Back Exposure and Lung Function “This paper was motivated by a study investigating the acute effect of exposure to carbon black on respiratory morbidity in the European carbon black manufacturing industry. The study was carried out over three cross sectional phases between 1987 and 1995, among several factories in seven European countries. Here, only data from the third phase (1994-1995) are used. During this period of time, several daily measurements for the inhalable Exposure were taken from each of 990 workers. Since inhalable dust exposure may vary greatly within and between days, considerable measurement error may occur. As a measure of worker’s lung function, FEV1 (Forced Expiratory Volume in 1 second) was determined together with information on age, height and cumulative smoking.” Batistatou & McNamee, 2008
Using Occupational Group as an Instrumental Variable Classify respondents into occupational groups assumed, a priori, to have differing exposures. “The exposure of workers was thought to vary by factory – because of differences in type of equipment, emissions, and ventilation - and type of job. Hence two schemes for grouping workers were considered by the original investigators: by job category and by a combination of factory and job category.” Batistatou & McNamee, 2008
The IV Procedure Use occupational group (binary dummy variables) as an instrument to predict measured exposure. Use group means of the Ws (i.e. the predicted values) to look at the effect of exposure on FEV1. e.g. with three dummies to describe occupational group: ivregress 2sls fev1 (w=d1 d2 d3)
An Example from Neuroimaging Professor Karl Herholz in the Medical School is interested in the potential of PET (Positron Emission Tomography) scans to monitor and predict the progress of Alzheimer’s Disease (AD - a form of Dementia). PET with the widely available tracer 18F-2-fluoro-2-deoxy-Dglucose (FDG) measures local glucose metabolism as a proxy for neuronal activity at a resting state without the need for cognitive activation. Impaired activity in AD is evident as reduced FDG uptake.
Calibrating the PET scores We have PET scores and a measure cognitive ability (ADAS score) available from all participants in samples from three a priori diagnostic groups: 1. Controls (no cognitive disfunction) 2. Mild Cognitive Impairment (MCI) 3. Alzheimer’s Disease Two dummy variables, describing these diagnostic groups can be used as instrumental variables. As in ivregress 2sls pet (adas=d1 d2) An alternative IV is a 2nd cognitive function score based on the Mini Mental State Examination – but errors might be correlated with those from the ADAS.
Future analysis – what best predicts transition from MCI to AD?
Biomarkers as Instruments The advantage of biomarkers is that their measurement errors are unlikely to be correlated with those of subjective assessments based on diaries and questionnaires. The potential disadvantage is cost. But they do not need to be measured on everyone (c.f. twophase designs discussed above – nested validation studies).
Two-phase designs Let’s consider a large epidemiological survey (N1=5000, for example) with error-prone exposure measurement, W. We’d like to replicate W but do not have the resources. There is no reason why we shouldn’t select a sub-sample of participants (N2=200, for example) and replicate W on these. The resulting estimate of σu2 is then used in the analysis of the data from the full survey. The second-phase (N2) might be selected completely at random or, alternatively, first stratified on the basis of the outcome of the first measurement (going for the extremes, for example). The choice of sampling method and the relative sizes of N1 and N2 is an optimal design problem (beyond the scope of the module).
Factor Analysis (FA) Let’s assume we have three independent instruments for the measurement of dietary fat intake. These give the following measurements on each of a sample of subjects: W1 – based on a food diary W2 – based on a food frequency questionnaire W3 - a blood or uring biochemical measurement (biomarker). The unknown fat intake is X.
The FA Model W1 = α1 + β1X + U1 W2 = α1 + β1X + U2 W3 = α1 + β1X + U3 with the usual assumptions concerning lack of correlation of the Us with X and with each other. The regression coefficients are usually called factor loadings in the FA literature. First, we need to establish/specify a scale of Measurement.
FA – Specifying the scale of measurement Two common approaches: 1. Fix the mean and variance of the Xs to some arbitrary value (e.g. 0 and 1, respectively). This is an approach within psychometrics (measurement of depression, for example). Not particularly informative here.
2. Use one of the Ws to set the scale (α1=0 and β1=1, for example). The two approaches are mathematically equivalent. They do not lead to contradictory results!
The FA model we’ll work with W1 = W2 = W3 = Cov(W1,W2) Cov(W1,W3) Cov(W2,W3)
X + U1 α2 + β2X + U2 α3 + β3X + U3 = β2Var(X) = β3Var(X) = β2β3Var(X)
It immediately follows that: β2 = Cov(W2,W3)/Cov(W1,W2), β3 = Cov(W2,W3)/Cov(W1,W3) and so on.
An FA model with three instruments is just-identified There are unique estimates for each of the model’s parameters. The model fits exactly. If we have four or more instruments, then W1 = X + U1 W2 = α2 + β2X + U2 W3 = α3 + β3X + U3 W4 = α4 + β4X + U4 etc., and the regression coefficients are now over-identified. Typically, we now use maximum likelihood (ML). ML can cope with missing data: not all measurements need to be available on all subjects.
Testing Constraints Returning to the fat intake measurements, for example, we could ask whether there any biases in the food frequency questionnaire when compared to the diary. This would be equivalent to fitting the model W1 W2 W3
= = =
X + U1 X + U2 α3 + β3X + U3
and asking if the model fits the data. If it does, then we might wish to test whether Var(U1)=Var(U2). Typically, questions answered by comparing a sequence of chi-square (log likelihood-ratio) statistics (but, details beyond the scope of the course).
Combining an FA model with a regression model for the outcome W1 = X W2 = α2 + β2X W3 = α3 + β3X Y = αy + βyX and use ML.
+ U1 + U2 + U3 +V
Note that the combined model is still identified with only two measurement instruments: W1 = X + U1 W2 = α2 + β2X + U2 Y = αy + βyX + V
Summary of Main Points Errors-in-variables models are not identified (their parameters do not have unique estimates). To attain identifiability, we need either to 1.
make assumptions based on external data (but problem of transportability).
or 2.
collect more information (either replicates, instrumental variables or validating measurements) and adapt the analyses accordingly.