Accepted Manuscript Title: Mendelian randomization in health research: Using appropriate genetic variants and avoiding biased estimates Author: Amy E. Taylor Neil M. Davies Jennifer J. Ware Tyler VanderWeele George Davey Smith Marcus R. Munaf`o PII: DOI: Reference:
S1570-677X(13)00124-X http://dx.doi.org/doi:10.1016/j.ehb.2013.12.002 EHB 453
To appear in:
Economics and Human Biology
Received date: Revised date: Accepted date:
15-5-2013 3-12-2013 3-12-2013
Please cite this article as: Taylor, A.E., Davies, N.M., Ware, J.J., VanderWeele, T., Smith, G.D., Munaf`o, M.R.,Mendelian randomization in health research: using appropriate genetic variants and avoiding biased estimates, Economics and Human Biology (2013), http://dx.doi.org/10.1016/j.ehb.2013.12.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Mendelian randomization in health research: using appropriate genetic variants and avoiding biased estimates Amy E. Taylor a, Neil M. Daviesb, Jennifer J. Warea,b, Tyler VanderWeelec, George
ip t
Davey Smithb, Marcus R. Munafòa a. MRC Integrative Epidemiology Unit (IEU) at the University of Bristol, UK Centre for
Corresponding author: Amy Taylor
M
School of Experimental Psychology
Ac ce p
BS8 1TU
te
Bristol
d
University of Bristol 12a Priory Road
an
us
cr
Tobacco and Alcohol Research Studies, School of Experimental Psychology, University of Bristol, 12a Priory Road, Bristol, BS8 1TU b. MRC Integrative Epidemiology Unit (IEU) at the University of Bristol, School of Social and Community Medicine, University of Bristol, Oakfield House, Oakfield Grove, Bristol, UK, BS8 2BN c. Department of Epidemiology, Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA
UK
Email address:
[email protected] Telephone number: + 44 (0)117 3318239 Fax number: +44 (0)117 92 88588
Running head: Biases in Mendelian randomization in health research
1 Page 1 of 28
Financial support Amy Taylor, Jennifer Ware and Marcus Munafò are members of the UK Centre for Tobacco
ip t
and Alcohol Studies, a UKCRC Public Health Research: Centre of Excellence. Funding from British Heart Foundation, Cancer Research UK, Economic and Social Research Council,
cr
Medical Research Council, and the National Institute for Health Research, under the auspices of the UK Clinical Research Collaboration, is gratefully acknowledged. This work was
us
supported by the Wellcome Trust (grant number 086684) and the Medical Research Council (grant numbers MR/J01351X/1, G0800612, G0802736, G0600705, MC_UU_12013/1-9).
an
George Davey Smith and Neil Davies are supported by the European Research Council
M
DEVHEALTH grant (269874). Jennifer Ware is supported by a Post-Doctoral Research Fellowship from the Oak Foundation. Tyler VanderWeele is supported by an NIH grant (R01
Ac ce p
Acknowledgements
te
d
ES017876).
The authors are grateful to Dr Stephanie von Hinke Kessler Scholde and Prof Paul Clarke for their helpful comments on earlier drafts of this article.
2 Page 2 of 28
Abstract Mendelian randomization methods, which use genetic variants as instrumental variables for
ip t
exposures of interest to overcome problems of confounding and reverse causality, are becoming widespread for assessing causal relationships in epidemiological studies. The main
cr
purpose of this paper is to demonstrate how results can be biased if researchers select genetic variants on the basis of their association with the exposure in their own dataset, as often
us
happens in candidate gene analyses. This can lead to estimates that indicate apparent "causal" relationships, despite there being no true effect of the exposure. In addition, we discuss the
an
potential bias in estimates of magnitudes of effect from Mendelian randomization analyses when the measured exposure is a poor proxy for the true underlying exposure. We illustrate
M
these points with specific reference to tobacco research.
te
d
.
Ac ce p
Key words: smoking, tobacco, Mendelian randomization, causal inference, instrumental variable
3 Page 3 of 28
1. Introduction Proving how exposures affect health outcomes can be problematic in observational studies.
ip t
Even if an exposure and an outcome are associated, the direction of causality can be difficult to ascertain because health outcomes can lead to changes in behaviour which can affect
cr
exposures (Munafò & Araya 2010). Mendelian randomization studies may help to shed light
us
on these relationships by using genetic variants, such as single nucleotide polymorphisms (SNPs) (see Table 1 for definition), as instrumental variables for measured lifestyle exposures
an
(Davey Smith & Ebrahim 2003). Mendelian randomization studies can be used for two related purposes: 1) to provide evidence for the existence of causal associations, and 2) to
M
enable accurate estimation of the magnitude of the effect of lifelong exposure to a risk factor
d
on an outcome (Davey Smith & Ebrahim 2004).
te
As is the case for instrumental variable methods generally, for Mendelian randomization studies to be useful genetic variants must be robustly associated with the exposure of interest
Ac ce p
(Davey Smith & Ebrahim 2005; Lawlor et al. 2008b). Despite this, recent Mendelian randomization studies conducted by Wehby et al. have used genetic variants as instruments for smoking heaviness which were not shown to be associated with smoking phenotypes in large genome wide association studies (Wehby et al. 2011a; Wehby et al. 2011b; Wehby et al. 2012). Whilst the authors acknowledge that these variants have not been consistently associated with smoking phenotypes, they suggest that the variants provide evidence of causal effects of smoking on body weight (Wehby et al. 2012) and smoking in pregnancy on birthweight (Wehby et al. 2011b) and risk of orofacial clefts in offspring (Wehby et al. 2011a). In addition, the authors use the genetic variants to estimate the magnitude of effect of smoking heaviness on their outcomes of interest (Wehby et al. 2011a; Wehby et al. 2011b; 4 Page 4 of 28
Wehby et al. 2012). Even if the variants they use are truly associated with smoking behaviour, this is likely to produce incorrect estimates of the effect size of smoking on the outcome.
ip t
1.1 Aims
cr
In this paper, we aim: 1) to illustrate, using a data simulation, why inferences based on the results of Mendelian randomization studies using genetic variants selected based on their
us
association in a single sample are likely to be misleading and 2) to demonstrate why
an
estimating the magnitudes of causal effects in cases where the measured exposure is not the same as the underlying exposure captured by the variant is problematic. We discuss these
M
issues with reference to the specific case of tobacco as an exposure, but these principles can be applied more widely to Mendelian randomization and instrumental variable analyses.
te
d
1.2 Assumptions of Mendelian Randomization The principle of Mendelian randomization relies on the basic (but approximate) laws of
Ac ce p
Mendelian genetics (segregation and independent assortment). If these two laws hold, then at a population level, genetic variants will not be associated with the confounding factors that generally distort conventional observational studies (Davey Smith & Ebrahim 2003; Davey Smith 2011). In addition, genetic variants will not be affected by reverse causality (Davey Smith & Ebrahim 2003). Epidemiological studies increasingly use Mendelian randomization to provide robust evidence of underlying causal mechanisms in a number of areas of health research including cardiovascular disease, cancer and mental health (Casas et al. 2005; Davey Smith et al. 2005; Benn et al. 2011; Scott et al. 2011; Interleukin-6 Receptor Mendelian Randomisation Analysis et al. 2012; Nordestgaard et al. 2012; Voight et al. 2012, Carslake et al. 2013).
5 Page 5 of 28
For a SNP to be a valid instrumental variable, the following assumptions must hold: 1) the SNP should be reliably associated with the exposure, 2) the SNP should only be associated with the outcome through the exposure of interest (the “exclusion restriction”) and 3) the
ip t
SNP should be independent of other factors affecting the outcome (confounders) (Angrist et al. 1996; Lawlor et al. 2008b; Wehby et al. 2008; Clarke & Windmeijer 2012). Moreover, to
cr
use Mendelian randomization for accurate estimation of effect sizes in mediation analysis
using a measured exposure, the measured exposure should accurately capture the true causal
M
2.1 Genetic variants for tobacco research
an
2. Use of genetic variants selected in a single sample
us
exposure (Lawlor et al. 2008a; Pierce & VanderWeele 2012).
Large consortium-based genome wide association studies have found genetic variants
d
robustly associated with smoking behaviours (Thorgeirsson et al. 2008; Furberg et al. 2010;
te
Liu et al. 2010). One genetic variant that has been highlighted by these studies, amongst
Ac ce p
others, is located in the nicotinic receptor gene cluster CHRNA5-A3-B4 on chromosome 15. Two SNPs within this region, rs16969968 and rs1051730, which are in linkage disequilibrium and can be used interchangeably in studies on Europeans, consistently associate with measures of heaviness of smoking (e.g., cigarettes per day or biomarkers of nicotine exposure) (Freathy et al. 2009; Munafò et al. 2012). Smokers with a single copy of the smoking increasing allele smoke on average one extra cigarette per day compared to those with no copies. The effects of the SNP are additive, so people with two copies of the smoking increasing allele on average smoke two additional cigarettes a day (Ware et al. 2011). The strength and consistency of this association make these variants suitable instruments for use in Mendelian randomization studies. The second assumption of instrumental variable analysis, that the SNP should only be associated with the outcome through the exposure of 6 Page 6 of 28
interest, is rarely fully testable (Glymour et al. 2012). In Mendelian randomization, this assumption may be violated if the genetic variant has pleiotropic effects,is in linkage disequilibrium with another variant of differing function or if its effects are buffered by
ip t
canalization (Davey Smith and Ebrahim 2003). However, the biological function of the nicotinic receptor gene cluster and evidence from epidemiological studies suggest that this
cr
variant is likely to affect outcomes only through tobacco exposure (for a further discussion of this see Section 3). In addition, if the variant is associated with an outcome in smokers or
us
former smokers but not never smokers, this is a good indication that the association is fully
an
mediated through tobacco exposure (Freathy et al. 2011). The rs1051730 SNP has been used in Mendelian randomization studies to investigate the causal effect of cigarette smoking on
M
body mass index, depression anxiety and birthweight of offspring (Freathy et al. 2011; Lewis et al. 2011; Bjorngaard et al. 2012; Tyrrell et al. 2012).
d
Despite the identification of variants in the CHRNA5-A3-B4 gene cluster as suitable
te
instruments, Wehby et al. use other variants (in DRD2, MAOA, DRD4, 5HTT, GABBR2,
Ac ce p
CYP2D6) as instruments for smoking heaviness in their Mendelian randomization studies (Wehby et al. 2011a; Wehby et al. 2011b; Wehby et al. 2012). The authors justify this approach by emphasizing the plausible biological roles of their chosen variants in smoking behaviour. However, this justification is questionable given that the candidate gene approach for finding functional genetic variants has had limited success, yielding few replicable associations and many false positives (Colhoun et al. 2003; Sleiman & Grant 2010, Lawlor et al. 2008b). If these common variants are truly associated with the exposure, these associations should have been detected in the large genome wide association studies of smoking behaviour. We calculated that the largest of these studies, conducted by the TAG consortium, which included 74,000 smokers had 80% power to detect variants explaining as little as 0.05% of the variance in cigarettes per day (Furberg et al. 2010). Genetic variation in 7 Page 7 of 28
the CHRNA5-A3-B4 gene cluster explains about 1% of the variance in cigarettes per day (Munafò et al. 2012).
ip t
2.2 Data simulation Below, we show why selecting variants based on their association in a single sample can
cr
introduce bias into Mendelian randomization studies. We generated continuous exposure (X)
us
and outcome (Y) variables for 10,000 individuals using the following formulae:
an
X= α1Z + e Y= β1X + u;
M
where Z is a binary instrument with a frequency of 0.3 and e and u (the error terms) are
te
d
jointly normally distributed continuous variables with a correlation coefficient of (ρ) of 0.6:
Ac ce p
To illustrate an example where the association of the SNP and the exposure is well established, and where the observational association is biased, but estimates from Mendelian randomization are unbiased, we set α1=0.5and β1= -0.3, The raw association between X and Y from linear regression was positive (beta coefficient= 0.26, 95% CI: 0.25, 0.28) (see Table 2). However, as we know from the negative value of β1, the true effect of X on Y is negative. Hence the linear regression estimate was biased and confounded by the error terms. In contrast, the estimate of the effect of X on Y from a twostage least-squares regression, using the instrument Z, was negative and equal to the “true” value of β1 (beta coefficient -0.29, 95% CI: -0.37, -0.21). This demonstrates that in the
8 Page 8 of 28
presence of confounding, when there is a robust relationship between the instrument and the exposure, Mendelian randomization, and more broadly instrumental variable analysis, can give an unbiased estimate.
ip t
We next expand our simulation to demonstrate how biases can occur if instruments are
selected based on their observed associations with the exposure in the sample within which
cr
the Mendelian randomization experiment is being carried out. To simulate an example in
us
which there is no effect of the exposure on the outcome, we set β1=0, so the outcome and exposure were only correlated (correlation=0.6) due to the error terms. Thus the association
an
of the exposure and outcome is confounded. This means that if our estimation model (estimator) is correct, then it should find no effect of the exposure on the outcome. If our
d
it suggests our estimator is biased.
M
estimator is incorrect and we find a relationship between the outcome and the exposure, then
te
Next, to simulate the selection of genetic instruments within a sample, we randomly generated 1,000 binary variables (Z) to simulate the SNPs (all had a frequency of 0.3). Since
Ac ce p
these instruments were randomly generated, there was no underlying effect of the SNPs on the exposure (α1=0). We used a binary instrument in a one instrument and one exposure example for simplicity, but these results are generalizable to additive genetic models or Mendelian randomization studies using multiple genetic variants (Pierce et al. 2011; Clarke & Windmeijer 2012). We estimated the association of each SNP with the exposure, X, using robust linear regression. As expected, by chance, roughly 5% of these SNPs were associated with the exposure (using a p-value cut-off of 0.05). We selected the ten instruments most strongly associated with the exposure and ran a two stage least squares regression on the outcome using each of these instruments in turn. Table 3 presents the effect sizes and p-values for the association of the instrument with the exposure 9 Page 9 of 28
and the outcome along with the F-statistic (a measure of the strength of the association of instrument and exposure). Of the ten instruments selected, three had an F-statistic above the commonly used cut off
ip t
point of 10, suggesting that the associations of instruments and exposure were strong enough for the instrumental variable estimates to be unbiased (Stock et al. 2002). Using two-stage
cr
least-squares regression, five of the instruments showed strong or moderate evidence for
us
associations with the outcome (p values