Next generation modeling in GWAS: comparing ...

9 downloads 0 Views 4MB Size Report
Jun 5, 2014 - the effect size, MaF of the markers, and the different lD between the large- and small-effect Qtls. although in principle, a Qtl usually does not ...
Next generation modeling in GWAS: comparing different genetic architectures

Evangelina López de Maturana, Noelia Ibáñez-Escriche, Óscar González-Recio, Gaëlle Marenne, Hossein Mehrban, Stephen J. Chanock, et al. Human Genetics ISSN 0340-6717 Hum Genet DOI 10.1007/s00439-014-1461-1

1 23

Your article is protected by copyright and all rights are held exclusively by SpringerVerlag Berlin Heidelberg. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy Hum Genet DOI 10.1007/s00439-014-1461-1

Original Investigation

Next generation modeling in GWAS: comparing different genetic architectures Evangelina López de Maturana · Noelia Ibáñez‑Escriche · Óscar González‑Recio · Gaëlle Marenne · Hossein Mehrban · Stephen J. Chanock · Michael E. Goddard · Núria Malats 

Received: 11 March 2014 / Accepted: 5 June 2014 © Springer-Verlag Berlin Heidelberg 2014

Abstract The continuous advancement in genotyping technology has not been accompanied by the application of innovative statistical methods, such as multi-marker methods (MMM), to unravel genetic associations with complex traits. Although the performance of MMM has been widely explored in a prediction context, little is known on their behavior in the quantitative trait loci (QTL) detection under complex genetic architectures. We shed light on this still open question by applying Bayes A (BA) and Bayesian LASSO (BL) to simulated and real data. Both methods were compared to the single marker regression (SMR). Simulated data were generated in the context of six scenarios differing on effect size, minor allele frequency (MAF) and linkage disequilibrium (LD) between QTLs. These were based on real SNP genotypes in chromosome 21 from the Spanish Bladder Cancer Study. We show how the genetic architecture dramatically affects the behavior of the methods in terms of power, type I error and accuracy Electronic supplementary material The online version of this article (doi:10.1007/s00439-014-1461-1) contains supplementary material, which is available to authorized users. E. López de Maturana (*) · G. Marenne · N. Malats  Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), C/MelchorFernándezAlmagro, 3, 28029 Madrid, Spain e-mail: [email protected] N. Ibáñez‑Escriche · H. Mehrban  Genética i Millora Animal, Research Institute and Agricultural Technology (IRTA), Avd. Alcalde Rovira Roure 191, 25198 Lleida, Spain Ó. González‑Recio · M. E. Goddard  Biosciences Research Division, Department of Environment and Primary Industries, Agribio, 5 Ring Road, Bundoora, VIC 3083, Australia

of estimates. Markers with high MAF are easier to detect by all methods, especially if they have a large effect on the phenotypic trait. A high LD between QTLs with either large or small effects differently affects the power of the methods: it impairs QTL detection with BA, irrespectively of the effect size, although boosts that of small effects with BL and SMR. We demonstrate the convenience of applying MMM rather than SMR because of their larger power and smaller type I error. Results from real data when applying MMM suggest novel associations not detected by SMR.

Introduction One of the goals in biomedical science is to identify and understand the relationship between phenotypes and genotypes. The technological advances in high-throughput genotyping and sequencing technologies allow the genotyping of hundreds of thousands of genetic markers in the genome. These resources have motivated human phenotype gene discovery through the genome-wide association studies

Ó. González‑Recio  Dairy Futures Cooperative Research Centre, Bundoora, VIC 3083, Australia H. Mehrban  Department of Animal Science, University of Shahrekord, P.O. Box 115, 88186‑34141 Shahrekord, Iran S. J. Chanock  Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, Bethesda, MD, USA M. E. Goddard  Department of Food and Agricultural Systems, University of Melbourne, Melbourne, Australia

13

Author's personal copy

(GWAS). In less than a decade, GWAS have advanced from testing thousands to several millions of SNPs in increasingly large sample sizes. The most common analytical procedure used in GWAS is the so-called single marker regression (SMR) in which (one by one) association between the frequency of each of hundreds of thousands common variants and a given phenotype is tested. Only SNPs, or variants, that exceed a conservative genome-wide threshold for association (usually p  t, where   βˆp   a = −∞ p βp ypermu , β−p dβp, βˆp is the posterior mean of the SNP p after analyzing the original data, and p(βp|ypermu)

Author's personal copy Hum Genet

is the posterior distribution of the marker effect given the permuted data (ypermu). Four values for t were considered: 0.8, 0.85, 0.9, and 0.95. Although this approach may lack a theoretical justification (Mutshinda and Sillanpaa 2012), a good correspondence between the values of a from the classical permutation test and that proposed by Che and Xu (2010) (Pearson’s correlation = 0.98) was obtained for one of the replicates in the scenario 3 (results not shown). Bonferroni’s adjustment was used in the SMR. Thus, 0.05 . markers were declared as QTLs if p value < 7329 Model comparison The performance of the different models, and thresholds considered in the Bayesian methods, was evaluated in each simulated scenarios in terms of the type I error (T1E or 1-specificity), statistical power, and the positive predictive value (PPV), each defined as follows:

T1E =

FP (FP + TN)

Power =

PPV =

TP (TP + FN)

TP , (TP + FP)

where TP, TN, FP and FN refer to the numbers of truepositive findings (i.e., true detected QTLs), of true negative findings (with null effect and not decided to be associated), of false-positive findings (with null effect and declared as QTL), and of false-negative findings (with non-null effect and not declared as QTL), respectively. ROC spaces, defined by the false-positive rate or specificity and the true-positive rate or 1-sensitivity (or T1E), and the prediction results of the confusion matrix corresponding to each method and criterion in each scenario were also calculated. Please see the supplementary Table S1 for a description of the relationships among the criteria used for the comparison of methods. The different statistical approaches were further compared in terms of accuracy through the mean squared error (MSE) of the estimated effect and its regression on the true values.

Results We first present the comparison of the general behavior of the considered statistical methods and their performances according to the genetic architectures simulated regarding effect size (large/small), MAF (high/low) and LD (high/ LE) between the large- and small-effect QTLs. Next, we report results corresponding to real data analyses.

Simulated data In general, low standard deviations were obtained for each statistic, showing the robustness of the measurements to be discussed next. Method’s performance across scenarios Methods and criteria (hereinafter BAt and BLt, where t  = {0.8, 0.85, 0.9, 0.95}, the thresholds used to declare an SNP as significant in Bayesian analyses) were evaluated using some of the criteria described in Table S1. Figure 2 displays the bar plots of the T1E. As expected, the more stringent the criterion, the lower is T1E for the Bayesian methods. The SNPs detected by both BA and BL at each t had the lowest T1E rate (

Suggest Documents