Bioinformatics Advance Access published May 12, 2009
Applications Note
LINKDATAGEN – Generating linkage mapping files from Affymetrix SNP chip data Bahlo, M1,* and Bromhead, C.J.1 1
Bioinformatics Division, The Walter and Eliza Hall Institute, 1G Royal Pde, Parkville 3052 VIC, Australia
ABSTRACT Summary: LINKDATAGEN is a perl tool that generates linkage mapping input files for five different linkage mapping tools using data from all 11 HAPMAP Phase III populations. It provides rudimentary error checks and is easily amended for personal linkage mapping preferences. Availabilitiy and Implementation: LINKDATAGEN is available from http://bioinf.wehi.edu.au/software/linkdatagen/ with accompanying annotation files, reference manual and test data set.
Contact:
[email protected] 1
INTRODUCTION
The development of high density, high throughput genotyping has been pushed forward at a rapid pace to facilitate genome wide association studies. Affymetrix SNP genotyping microarray platforms have consequently gone through a rapid evolution starting with an initial 10K SNP chip with the current 6.0 chip containing ~1,000,000 SNPs as well as ~1,000,000 CNV probes. Linkage studies have also derived benefit from the dense genotyping data with increased mapping power (Evans and Cardon, 2004). A single SNP marker is not very informative for linkage only being able to attain a maximum heterozygosity of 0.5 in comparison to a typical microsatellite marker with a heterozygosity of 0.7-0.8. However it is the sheer number of these SNP markers, when incorporated in a multipoint linkage study that allows better inference in pedigrees. In addition recombination breakpoints are more clearly delineated negating the need for fine mapping. Genotyping error rates are also lower and genotyping calls are more consistent over time and between genotyping facilities allowing easier merging of genotyping data. There are new challenges that have arisen with SNP chip data. Whilst association studies rely on linkage disequilibrium (LD), linkage analyses using the Markov assumption in probability calculations assume that linkage equilibrium (LE) holds. In standard exact multipoint linkage calculations through the Lander-Green algorithm this LE assumption is used in the calculation of the founder genotyping probablilies and in the first order Markov model that underpins the algorithm. Violations to this assumption have been shown to lead to inflated LOD scores, erroneous linkage results (Abecasis and Wigginton, 2005) and incorrectly inferred haplotypes *
To whom correspondence should be addressed.
(Schaid, et al., 2002). Thus it is important when contemplating the use of SNP chip data to select sufficiently distant markers to avoid LD and to use a set of markers that are informative in the population (high heterozygosity SNPs) and also informative in the family.
2
IMPLEMENTATION
We have implemented a perl script called LINKDATAGEN that is capable of generating input files for a variety of common analysis software tools (PREST (McPeek and Sun, 2000) , MERLIN (Abecasis, et al., 2002; Abecasis and Wigginton, 2005), MORGAN (Thompson, 1995; Thompson, 2000) and PLINK(Purcell, et al., 2007)). In addition LINKDATAGEN has several error checks, removes all Mendelian errors and generates some simple identity by state (IBS) sharing statistics. SNPs are selected to satisfy LE conditions. This is performed using simple selection of informative SNPs in bins, preserving minimum genetic map distances. Only MERLIN and PLINK have some ability to handle high density SNP data in LD using methods such as clustering of markers or marker selection based on LD measures such as r2. The suitability of these methods depends on the availability of either HAPMAP data or other large datasets to allow estimation of LD patterns. LINKDATAGEN uses HAPMAP Phase III data (www.hapmap.org.au) that allows the user to choose between any of 11 populations rather than the original 4 from HAPMAP Phase I and II (2003; Frazer, et al., 2007; Thorisson, et al., 2005). Thus it has an advantage over IGG (Li, et al., 2007), ALOHOMORA (Ruschendorf and Nurnberg, 2005) and SNPLINK (Webb, et al., 2005) which are similar tools. LINKDATAGEN also covers a broader scope of linkage analysis programs than previous tools. Preliminary quality control tests implemented in LINKDATAGEN include a simple sex test based on the available samples comparing observed heterozygotes versus expected heterozygotes on the X chromosome by utilising the normal approximation to the test of proportions. Mendelian errors are identified based on the provided pedigree information. All errors are removed prior to the generation of the final files by simply removing the entire SNP. Since error rates are usually very low this is a procedure with low loss of data. This is an important step since many analysis programs assume Mendelian error free data (MORGAN and MERLIN). LINKDATAGEN produces a summary of Mendelian errors over genotyped individuals also indicating what error check was used (parental checks and/or X chromosome checks). This helps to highlight indi-
©The Author (2009). Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected]
1
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 24, 2015
Associate Editor: Prof. Dmitrij Frishman
K.Takahashi et al.
Funding: MB is funded by an Australian National Health and Medical Research Council (NHMRC) Career Develeopment Award, an NHMRC Program grant and an NHMRC Medical Genomics Bioinformatics and Proteomics program grant.
REFERENCES (2003) The International HapMap Project, Nature, 426, 789-796. Abecasis, G.R., Cherny, S.S., Cookson, W.O. and Cardon, L.R. (2002) Merlin--rapid analysis of dense genetic maps using sparse gene flow trees, Nat Genet, 30, 97-101. Abecasis, G.R. and Wigginton, J.E. (2005) Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers, Am J Hum Genet, 77, 754-767. Amor, D.J., Dahl, H.H., Bahlo, M. and Bankier, A. (2007) Keipert syndrome (Nasodigitoacoustic syndrome) is X-linked and maps to Xq22.2-Xq28, Am J Med Genet A, 143A, 2236-2241. Carvalho, B., Bengtsson, H., Speed, T.P. and Irizarry, R.A. (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data, Biostatistics, 8, 485-499. Evans, D.M. and Cardon, L.R. (2004) Guidelines for genotyping in genomewide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps, Am J Hum Genet, 75, 687-692. Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L., Gibbs, R.A., Belmont, J.W., Boudreau, A., Hardenbol, P., Leal, S.M., Pasternak, S., Wheeler, D.A., Willis, T.D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu, W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X., Wang, J., Wang, W., Yu, J., Zhang, B., Zhang, Q., Zhao, H., Zhao, H., Zhou, J., Gabriel, S.B., Barry, R., Blumenstiel, B., Camargo, A., Defelice, M., Faggart, M., Goyette, M., Gupta, S., Moore, J., Nguyen, H., Onofrio, R.C., Parkin, M., Roy, J., Stahl, E., Winchester, E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z., Huang, W., Chu, X., He, Y., Jin, L., Liu, Y., Shen, Y., Sun, W., Wang, H., Wang, Y., Wang, Y., Xiong, X., Xu, L., Waye, M.M., Tsui, S.K., Xue, H., Wong, J.T., Galver, L.M., Fan, J.B., Gunderson, K., Murray, S.S., Oliphant, A.R., Chee, M.S., Montpetit, A., Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J.F., Phillips, M.S., Roumy, S., Sallee, C., Verner, A., Hudson, T.J., Kwok, P.Y., Cai, D., Koboldt, D.C., Miller, R.D., Pawlikowska, L., Taillon-Miller, P., Xiao, M., Tsui, L.C., Mak, W., Song, Y.Q., Tam, P.K., Nakamura, Y., Kawaguchi, T., Kitamoto, T., Morizono, T., Nagashima, A., Ohnishi, Y., Sekine, A., Tanaka, T., Tsunoda, T., Deloukas, P., Bird, C.P., Delgado, M., Dermitzakis, E.T., Gwilliam, R., Hunt, S., Morrison, J., Powell, D., Stranger, B.E., Whittaker, P., Bentley, D.R., Daly, M.J., de Bakker, P.I., Barrett, J., Chretien, Y.R., Maller, J., McCarroll, S., Patterson, N., Pe'er, I., Price, A., Purcell, S., Richter, D.J., Sabeti, P., Saxena, R., Schaffner, S.F., Sham, P.C., Varilly, P., Altshuler, D., Stein, L.D., Krishnan, L., Smith, A.V., Tello-Ruiz, M.K., Thorisson, G.A., Chakravarti, A., Chen, P.E., Cutler, D.J., Kashuk, C.S., Lin, S., Abecasis, G.R., Guan, W., Li, Y., Munro, H.M., Qin, Z.S., Thomas, D.J., McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S., Freeman, C., Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P., Cardon, L.R., Clarke, G., Evans, D.M., Morris, A.P., Weir, B.S., Tsunoda, T., Mullikin, J.C., Sherry, S.T., Feolo, M., Skol, A., Zhang, H., Zeng, C., Zhao, H., Matsuda, I., Fukushima, Y., Macer, D.R., Suda, E., Rotimi, C.N., Adebamowo, C.A., Ajayi, I., Aniagwu, T., Marshall, P.A., Nkwodimmah, C., Royal, C.D., Leppert, M.F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa, N., Adewole, I.F., Knoppers, B.M., Foster, M.W., Clayton, E.W., Watkin, J., Gibbs, R.A., Belmont, J.W., Muzny, D., Nazareth, L., Sodergren, E., Weinstock, G.M., Wheeler, D.A., Yakub, I., Gabriel, S.B., Onofrio, R.C., Richter, D.J., Ziaugra, L., Birren, B.W.,
ACKNOWLEDGEMENTS
Daly, M.J., Altshuler, D., Wilson, R.K., Fulton, L.L., Rogers, J., Burton, J., Carter, N.P., Clee, C.M., Griffiths, M., Jones, M.C., McLay, K., Plumb, R.W., Ross, M.T.,
2
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 24, 2015
viduals who are likely to be incorrectly assigned in the pedigree (pedigree errors) or whose samples are poorer than others. LINKDATAGEN generates data for one pedigree at a time. The basic requirements are a genotype call file, a pedigree file, the appropriate annotation file for the chip, and a one-line file that connects the genotype call file columns to the pedigree file IDs. The genotype call file contains all individuals from the pedigree whose genotypes are available. This file can be generated by either R/Oligo (Carvalho, et al., 2007) or apt-probset-genotype (http://www.affymetrix.com/partners_programs/programs/developer/tools/p owertools.affx), which incorporates the genotype callling algorithms BRLMM (50K and 250K chips), BRLMM-P (5.0) and BIRDSEED (6.0). Note that R/Oligo imposes no threshold on genotype quality scores so that Mendelian error detection becomes very important. The genotype call file can include other individuals apart from the individuals required for the pedigree but it must contain all those individuals in the pedigree. Multiple genotype call files are not acceptable and all individuals must be from the same chip. Affymetrix chips, due to their reliance upon particular restriction enzymes, have very little overlap in their SNP sets between the 50K and 250K datasets. We provide annotation files, which are based on the annotation files distributed by Affymetrix. These contain allele frequency information derived from all 11 available HAPMAP phase III populations, the physical and genetic map information provided by Affymetrix, and SNP names, both in Affymetrix and rs format. These will be periodically updated based on updates provided by Affymetrix. A standard application for LINKDATAGEN in a linkage analysis would consist of generating MERLIN, ALLEGRO or MORGAN output files with the default binsize. Once a linkage analysis has been perfomed it is useful to generate a full data set file by using the –cp, for “complete”, option in conjunction with setting the bin size to 0 (-bin 0.0) to allow the incorporation of all marker data. By examining the linkage region(s) defined by the multipoint linkage analysis it is possible to further refine the interval of interest using all of the markers rather than just the linkage subset. This works well in practice, in particular for homozygosity mapping. Three simple IBS sharing statistics generated as part of the –cp commands are useful for this purpose. These have been generated based on the most powerful IBD (identity by descent) sharing statistics (McPeek, 1999) for homozygosity mapping, recessive diseases and dominant diseases. Using a combination of the two approaches takes full advantage of the initial powerful multipoint linkage mapping, which incorporates map distance, and allele frequencies, as well as an IBS approach that uses all the marker data to help refine the region. Other approaches use direct IBS or HBS (homozygosity by state) sharing methods (Leibon, et al., 2008; Miyazawa, et al., 2007; Purcell, et al., 2007), ignoring the precise pedigree structure. LINKDATAGEN can upload 50KHind, 50KXba, 250KSty, 250Nsp, 5.0 and 6.0 Affymetrix arrays and has been used successfully in several linkage studies with a variety of these arrays (Amor, et al., 2007; Hildebrand, et al., 2007; Hildebrand, et al., 2006; Hildebrand, et al., 2007; Shearer, et al., 2009; Storey, et al., 2008). LINKDATAGEN is written in PERL and does not require any additional PERL packages so it is very portable and runs without compilation on any machine that has a basic PERL installation. We have presented a very flexible, command line script. It is well compartmentalised and easy for others to modify for programs of their choice. Users require some familiarity with UNIX commands to run the script. Modifications to the script require basic programming skills. We encourage questions and requests to further extend and refine LINKDATAGEN.
Sims, S.K., Willey, D.L., Chen, Z., Han, H., Kang, L., Godbout, M., Wallenburg, J.C., L'Archeveque, P., Bellemare, G., Saeki, K., Wang, H., An, D., Fu, H., Li, Q., Wang, Z., Wang, R., Holden, A.L., Brooks, L.D., McEwen, J.E., Guyer, M.S., Wang, V.O., Peterson, J.L., Shi, M., Spiegel, J., Sung, L.M., Zacharia, L.F., Collins, F.S., Kennedy, K., Jamieson, R. and Stewart, J. (2007) A second generation human haplotype map of over 3.1 million SNPs, Nature, 449, 851-861. Hildebrand, M.S., Coman, D., Yang, T., Gardner, R.J., Rose, E., Smith, R.J., Bahlo, M. and Dahl, H.H. (2007) A novel splice site mutation in EYA4 causes DFNA10 hearing loss, Am J Med Genet A, 143A, 1599-1604. Hildebrand, M.S., de Silva, M.G., Gardner, R.J., Rose, E., de Graaf, C.A., Bahlo, M. and Dahl, H.H. (2006) Cochlear implants for DFNA17 deafness, Laryngoscope, 116, 2211-2215. Hildebrand, M.S., de Silva, M.G., Tan, T.Y., Rose, E., Nishimura, C., Tolmachova, T., Hulett, J.M., White, S.M., Silver, J., Bahlo, M., Smith, R.J. and Dahl, H.H. (2007) Molecular characterization of a novel X-linked syndrome involving developmental
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 24, 2015
delay and deafness, Am J Med Genet A, 143A, 2564-2575. Leibon, G., Rockmore, D.N. and Pollak, M.R. (2008) A SNP streak model for the identification of genetic regions identical-by-descent, Stat Appl Genet Mol Biol, 7, Article16. Li, M.X., Jiang, L., Ho, S.L., Song, Y.Q. and Sham, P.C. (2007) IGG: A tool to integrate GeneChips for genetic studies, Bioinformatics, 23, 3105-3107. McPeek, M.S. (1999) Optimal allele-sharing statistics for genetic mapping using affected relatives, Genet Epidemiol, 16, 225-249. McPeek, M.S. and Sun, L. (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data, Am J Hum Genet, 66, 1076-1094. Miyazawa, H., Kato, M., Awata, T., Kohda, M., Iwasa, H., Koyama, N., Tanaka, T., Huqun, Kyo, S., Okazaki, Y. and Hagiwara, K. (2007) Homozygosity haplotype allows a genomewide search for the autosomal segments shared among patients, Am J Hum Genet, 80, 1090-1102. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J. and Sham, P.C. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses, Am J Hum Genet, 81, 559-575. Ruschendorf, F. and Nurnberg, P. (2005) ALOHOMORA: a tool for linkage analysis using 10K SNP array data, Bioinformatics, 21, 2123-2125. Schaid, D.J., McDonnell, S.K., Wang, L., Cunningham, J.M. and Thibodeau, S.N. (2002) Caution on pedigree haplotype inference with software that assumes linkage equilibrium, Am J Hum Genet, 71, 992-995. Shearer, A.E., Hildebrand, M.S., Bromhead, C.J., Kahrizi, K., Webster, J.A., Azadeh, B., Kimberling, W.J., Anousheh, A., Nazeri, A., Stephan, D., Najmabadi, H., Smith, R.J. and Bahlo, M. (2009) A novel splice site mutation in the RDX gene causes DFNB24 hearing loss in an Iranian family, Am J Med Genet A, 149A, 555-558. Storey, E., Bahlo, M., Fahey, M.C., Sisson, O., Lueck, C.J. and Gardner, R.M. (2008) A new dominantly-inherited pure cerebellar ataxia, SCA 30, J Neurol Neurosurg Psychiatry. Thompson, E. (1995) Monte Carlo in Genetic Analysis. University of Washington. Thompson, E. (2000) Statistical Inference from Genetic Data on Pedigrees. NSFCBMS Regional Conference Series in Probability and Statistics. Thorisson, G.A., Smith, A.V., Krishnan, L. and Stein, L.D. (2005) The International HapMap Project Web site, Genome Res, 15, 1592-1593. Webb, E.L., Sellick, G.S. and Houlston, R.S. (2005) SNPLINK: multipoint linkage analysis of densely distributed SNP data incorporating automated linkage disequilibrium removal, Bioinformatics, 21, 3060-3061.
3