Mendelian randomization in health research: Using ...

Accepted Manuscript Title: Mendelian randomization in health research: Using appropriate genetic variants and avoiding biased estimates Author: Amy E. Taylor Neil M. Davies Jennifer J. Ware Tyler VanderWeele George Davey Smith Marcus R. Munaf`o PII: DOI: Reference:

S1570-677X(13)00124-X http://dx.doi.org/doi:10.1016/j.ehb.2013.12.002 EHB 453

To appear in:

Economics and Human Biology

Received date: Revised date: Accepted date:

15-5-2013 3-12-2013 3-12-2013

Please cite this article as: Taylor, A.E., Davies, N.M., Ware, J.J., VanderWeele, T., Smith, G.D., Munaf`o, M.R.,Mendelian randomization in health research: using appropriate genetic variants and avoiding biased estimates, Economics and Human Biology (2013), http://dx.doi.org/10.1016/j.ehb.2013.12.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Mendelian randomization in health research: using appropriate genetic variants and avoiding biased estimates Amy E. Taylor a, Neil M. Daviesb, Jennifer J. Warea,b, Tyler VanderWeelec, George

ip t

Davey Smithb, Marcus R. Munafòa a. MRC Integrative Epidemiology Unit (IEU) at the University of Bristol, UK Centre for

Corresponding author: Amy Taylor

M

School of Experimental Psychology

Ac ce p

BS8 1TU

te

Bristol

d

University of Bristol 12a Priory Road

an

us

cr

Tobacco and Alcohol Research Studies, School of Experimental Psychology, University of Bristol, 12a Priory Road, Bristol, BS8 1TU b. MRC Integrative Epidemiology Unit (IEU) at the University of Bristol, School of Social and Community Medicine, University of Bristol, Oakfield House, Oakfield Grove, Bristol, UK, BS8 2BN c. Department of Epidemiology, Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA

UK

Email address: [email protected] Telephone number: + 44 (0)117 3318239 Fax number: +44 (0)117 92 88588

Running head: Biases in Mendelian randomization in health research

1 Page 1 of 28

Financial support Amy Taylor, Jennifer Ware and Marcus Munafò are members of the UK Centre for Tobacco

ip t

and Alcohol Studies, a UKCRC Public Health Research: Centre of Excellence. Funding from British Heart Foundation, Cancer Research UK, Economic and Social Research Council,

cr

Medical Research Council, and the National Institute for Health Research, under the auspices of the UK Clinical Research Collaboration, is gratefully acknowledged. This work was

us

supported by the Wellcome Trust (grant number 086684) and the Medical Research Council (grant numbers MR/J01351X/1, G0800612, G0802736, G0600705, MC_UU_12013/1-9).

an

George Davey Smith and Neil Davies are supported by the European Research Council

M

DEVHEALTH grant (269874). Jennifer Ware is supported by a Post-Doctoral Research Fellowship from the Oak Foundation. Tyler VanderWeele is supported by an NIH grant (R01

Ac ce p

Acknowledgements

te

d

ES017876).

The authors are grateful to Dr Stephanie von Hinke Kessler Scholde and Prof Paul Clarke for their helpful comments on earlier drafts of this article.

2 Page 2 of 28

Abstract Mendelian randomization methods, which use genetic variants as instrumental variables for

ip t

exposures of interest to overcome problems of confounding and reverse causality, are becoming widespread for assessing causal relationships in epidemiological studies. The main

cr

purpose of this paper is to demonstrate how results can be biased if researchers select genetic variants on the basis of their association with the exposure in their own dataset, as often

us

happens in candidate gene analyses. This can lead to estimates that indicate apparent "causal" relationships, despite there being no true effect of the exposure. In addition, we discuss the

an

potential bias in estimates of magnitudes of effect from Mendelian randomization analyses when the measured exposure is a poor proxy for the true underlying exposure. We illustrate

M

these points with specific reference to tobacco research.

te

d

.

Ac ce p

Key words: smoking, tobacco, Mendelian randomization, causal inference, instrumental variable

3 Page 3 of 28

1. Introduction Proving how exposures affect health outcomes can be problematic in observational studies.

ip t

Even if an exposure and an outcome are associated, the direction of causality can be difficult to ascertain because health outcomes can lead to changes in behaviour which can affect

cr

exposures (Munafò & Araya 2010). Mendelian randomization studies may help to shed light

us

on these relationships by using genetic variants, such as single nucleotide polymorphisms (SNPs) (see Table 1 for definition), as instrumental variables for measured lifestyle exposures

an

(Davey Smith & Ebrahim 2003). Mendelian randomization studies can be used for two related purposes: 1) to provide evidence for the existence of causal associations, and 2) to

M

enable accurate estimation of the magnitude of the effect of lifelong exposure to a risk factor

d

on an outcome (Davey Smith & Ebrahim 2004).

te

As is the case for instrumental variable methods generally, for Mendelian randomization studies to be useful genetic variants must be robustly associated with the exposure of interest

Ac ce p

(Davey Smith & Ebrahim 2005; Lawlor et al. 2008b). Despite this, recent Mendelian randomization studies conducted by Wehby et al. have used genetic variants as instruments for smoking heaviness which were not shown to be associated with smoking phenotypes in large genome wide association studies (Wehby et al. 2011a; Wehby et al. 2011b; Wehby et al. 2012). Whilst the authors acknowledge that these variants have not been consistently associated with smoking phenotypes, they suggest that the variants provide evidence of causal effects of smoking on body weight (Wehby et al. 2012) and smoking in pregnancy on birthweight (Wehby et al. 2011b) and risk of orofacial clefts in offspring (Wehby et al. 2011a). In addition, the authors use the genetic variants to estimate the magnitude of effect of smoking heaviness on their outcomes of interest (Wehby et al. 2011a; Wehby et al. 2011b; 4 Page 4 of 28

Wehby et al. 2012). Even if the variants they use are truly associated with smoking behaviour, this is likely to produce incorrect estimates of the effect size of smoking on the outcome.

ip t

1.1 Aims

cr

In this paper, we aim: 1) to illustrate, using a data simulation, why inferences based on the results of Mendelian randomization studies using genetic variants selected based on their

us

association in a single sample are likely to be misleading and 2) to demonstrate why

an

estimating the magnitudes of causal effects in cases where the measured exposure is not the same as the underlying exposure captured by the variant is problematic. We discuss these

M

issues with reference to the specific case of tobacco as an exposure, but these principles can be applied more widely to Mendelian randomization and instrumental variable analyses.

te

d

1.2 Assumptions of Mendelian Randomization The principle of Mendelian randomization relies on the basic (but approximate) laws of

Ac ce p

Mendelian genetics (segregation and independent assortment). If these two laws hold, then at a population level, genetic variants will not be associated with the confounding factors that generally distort conventional observational studies (Davey Smith & Ebrahim 2003; Davey Smith 2011). In addition, genetic variants will not be affected by reverse causality (Davey Smith & Ebrahim 2003). Epidemiological studies increasingly use Mendelian randomization to provide robust evidence of underlying causal mechanisms in a number of areas of health research including cardiovascular disease, cancer and mental health (Casas et al. 2005; Davey Smith et al. 2005; Benn et al. 2011; Scott et al. 2011; Interleukin-6 Receptor Mendelian Randomisation Analysis et al. 2012; Nordestgaard et al. 2012; Voight et al. 2012, Carslake et al. 2013).

5 Page 5 of 28

For a SNP to be a valid instrumental variable, the following assumptions must hold: 1) the SNP should be reliably associated with the exposure, 2) the SNP should only be associated with the outcome through the exposure of interest (the “exclusion restriction”) and 3) the

ip t

SNP should be independent of other factors affecting the outcome (confounders) (Angrist et al. 1996; Lawlor et al. 2008b; Wehby et al. 2008; Clarke & Windmeijer 2012). Moreover, to

cr

use Mendelian randomization for accurate estimation of effect sizes in mediation analysis

using a measured exposure, the measured exposure should accurately capture the true causal

M

2.1 Genetic variants for tobacco research

an

2. Use of genetic variants selected in a single sample

us

exposure (Lawlor et al. 2008a; Pierce & VanderWeele 2012).

Large consortium-based genome wide association studies have found genetic variants

d

robustly associated with smoking behaviours (Thorgeirsson et al. 2008; Furberg et al. 2010;

te

Liu et al. 2010). One genetic variant that has been highlighted by these studies, amongst

Ac ce p

others, is located in the nicotinic receptor gene cluster CHRNA5-A3-B4 on chromosome 15. Two SNPs within this region, rs16969968 and rs1051730, which are in linkage disequilibrium and can be used interchangeably in studies on Europeans, consistently associate with measures of heaviness of smoking (e.g., cigarettes per day or biomarkers of nicotine exposure) (Freathy et al. 2009; Munafò et al. 2012). Smokers with a single copy of the smoking increasing allele smoke on average one extra cigarette per day compared to those with no copies. The effects of the SNP are additive, so people with two copies of the smoking increasing allele on average smoke two additional cigarettes a day (Ware et al. 2011). The strength and consistency of this association make these variants suitable instruments for use in Mendelian randomization studies. The second assumption of instrumental variable analysis, that the SNP should only be associated with the outcome through the exposure of 6 Page 6 of 28

interest, is rarely fully testable (Glymour et al. 2012). In Mendelian randomization, this assumption may be violated if the genetic variant has pleiotropic effects,is in linkage disequilibrium with another variant of differing function or if its effects are buffered by

ip t

canalization (Davey Smith and Ebrahim 2003). However, the biological function of the nicotinic receptor gene cluster and evidence from epidemiological studies suggest that this

cr

variant is likely to affect outcomes only through tobacco exposure (for a further discussion of this see Section 3). In addition, if the variant is associated with an outcome in smokers or

us

former smokers but not never smokers, this is a good indication that the association is fully

an

mediated through tobacco exposure (Freathy et al. 2011). The rs1051730 SNP has been used in Mendelian randomization studies to investigate the causal effect of cigarette smoking on

M

body mass index, depression anxiety and birthweight of offspring (Freathy et al. 2011; Lewis et al. 2011; Bjorngaard et al. 2012; Tyrrell et al. 2012).

d

Despite the identification of variants in the CHRNA5-A3-B4 gene cluster as suitable

te

instruments, Wehby et al. use other variants (in DRD2, MAOA, DRD4, 5HTT, GABBR2,

Ac ce p

CYP2D6) as instruments for smoking heaviness in their Mendelian randomization studies (Wehby et al. 2011a; Wehby et al. 2011b; Wehby et al. 2012). The authors justify this approach by emphasizing the plausible biological roles of their chosen variants in smoking behaviour. However, this justification is questionable given that the candidate gene approach for finding functional genetic variants has had limited success, yielding few replicable associations and many false positives (Colhoun et al. 2003; Sleiman & Grant 2010, Lawlor et al. 2008b). If these common variants are truly associated with the exposure, these associations should have been detected in the large genome wide association studies of smoking behaviour. We calculated that the largest of these studies, conducted by the TAG consortium, which included 74,000 smokers had 80% power to detect variants explaining as little as 0.05% of the variance in cigarettes per day (Furberg et al. 2010). Genetic variation in 7 Page 7 of 28

the CHRNA5-A3-B4 gene cluster explains about 1% of the variance in cigarettes per day (Munafò et al. 2012).

ip t

2.2 Data simulation Below, we show why selecting variants based on their association in a single sample can

cr

introduce bias into Mendelian randomization studies. We generated continuous exposure (X)

us

and outcome (Y) variables for 10,000 individuals using the following formulae:

an

X= α1Z + e Y= β1X + u;

M

where Z is a binary instrument with a frequency of 0.3 and e and u (the error terms) are

te

d

jointly normally distributed continuous variables with a correlation coefficient of (ρ) of 0.6:

Ac ce p

To illustrate an example where the association of the SNP and the exposure is well established, and where the observational association is biased, but estimates from Mendelian randomization are unbiased, we set α1=0.5and β1= -0.3, The raw association between X and Y from linear regression was positive (beta coefficient= 0.26, 95% CI: 0.25, 0.28) (see Table 2). However, as we know from the negative value of β1, the true effect of X on Y is negative. Hence the linear regression estimate was biased and confounded by the error terms. In contrast, the estimate of the effect of X on Y from a twostage least-squares regression, using the instrument Z, was negative and equal to the “true” value of β1 (beta coefficient -0.29, 95% CI: -0.37, -0.21). This demonstrates that in the

8 Page 8 of 28

presence of confounding, when there is a robust relationship between the instrument and the exposure, Mendelian randomization, and more broadly instrumental variable analysis, can give an unbiased estimate.

ip t

We next expand our simulation to demonstrate how biases can occur if instruments are

selected based on their observed associations with the exposure in the sample within which

cr

the Mendelian randomization experiment is being carried out. To simulate an example in

us

which there is no effect of the exposure on the outcome, we set β1=0, so the outcome and exposure were only correlated (correlation=0.6) due to the error terms. Thus the association

an

of the exposure and outcome is confounded. This means that if our estimation model (estimator) is correct, then it should find no effect of the exposure on the outcome. If our

d

it suggests our estimator is biased.

M

estimator is incorrect and we find a relationship between the outcome and the exposure, then

te

Next, to simulate the selection of genetic instruments within a sample, we randomly generated 1,000 binary variables (Z) to simulate the SNPs (all had a frequency of 0.3). Since

Ac ce p

these instruments were randomly generated, there was no underlying effect of the SNPs on the exposure (α1=0). We used a binary instrument in a one instrument and one exposure example for simplicity, but these results are generalizable to additive genetic models or Mendelian randomization studies using multiple genetic variants (Pierce et al. 2011; Clarke & Windmeijer 2012). We estimated the association of each SNP with the exposure, X, using robust linear regression. As expected, by chance, roughly 5% of these SNPs were associated with the exposure (using a p-value cut-off of 0.05). We selected the ten instruments most strongly associated with the exposure and ran a two stage least squares regression on the outcome using each of these instruments in turn. Table 3 presents the effect sizes and p-values for the association of the instrument with the exposure 9 Page 9 of 28

and the outcome along with the F-statistic (a measure of the strength of the association of instrument and exposure). Of the ten instruments selected, three had an F-statistic above the commonly used cut off

ip t

point of 10, suggesting that the associations of instruments and exposure were strong enough for the instrumental variable estimates to be unbiased (Stock et al. 2002). Using two-stage

cr

least-squares regression, five of the instruments showed strong or moderate evidence for

us

associations with the outcome (p values