Program Report: GENECOUNTING Support ... - Wiley Online Library

Short Communication

doi: 10.1111/j.1469-1809.2005.00225.x

Program Report: GENECOUNTING Support Programs D. Curtis1 , J. Knight2 and P. C. Sham2 1 Department of Adult Psychiatry, Royal London Hospital, Whitechapel, London E1 1BB, UK 2 Social, Genetic and Developmental Psychiatry Research Centre, Institute of Psychiatry, De Crespigny Park, London SE5 8AF, UK

Summary We describe a suite of programs which enhance the usability of GENECOUNTING, a program for estimating haplotype frequencies in unrelated subjects. The programs, called RUNGC, SCANASSOC, COMPGR, SCANGROUP and LDPAIRS, carry out likelihood ratio tests and permutation tests to detect differences in haplotype frequencies between cases and controls,or between predefined groups, and output likely haplotype assignments and tables of linkage disequilibrium statistics between all pairs of markers in a dataset.

Report Here we describe a suite of programs which enhance the usability of GENECOUNTING. These programs carry out heterogeneity tests to detect differences between haplotype frequencies of cases and controls in association studies. They also facilitate obtaining measures of linkage disequilibrium (LD) between markers and can be used to check for discrepancies between haplotype frequencies in different groups of subjects, for example subjects genotyped on different DNA plates. GENECOUNTING is a program which uses the estimation-maximisation (EM) algorithm to estimate haplotype frequencies in a sample of unrelated subjects (Zhao et al. 2002). In such samples the phase of multilocus genotypes is unknown and haplotype frequencies need to be estimated. The use of the EM algorithm for this purpose was proposed by Smith (1957) and although other approaches for haplotype estimation have been implemented, such as Bayesian approaches based on population genetics models (Stephens & Donnelly, 2003), these may not necessarily deliver useful advantages (Adkins, 2004). The EM method was implemented in the EH program (Xie et al. 1993) and the efficiency of this implementation was improved in the EH+ and FEH programs (Zhao et al. 2000; Zhao & Sham, 2002) which provided faster performance, and the ability to deal with more demanding datasets having larger numbers of possible haplotypes. The most C

University College London 2005

recent of these implementations is GENECOUNTING (Zhao et al. 2002), which allows the estimation of haplotype frequencies even if some subjects are not genotyped at some markers. With multiple multiallelic markers the number of haplotypes which might possibly be formed rapidly reaches many thousands, placing performance-limiting demands on computer time and memory. The EM algorithm in GENECOUNTING restricts attention to the haplotypes which might possibly occur among the genotyped subjects, and typically means that far fewer haplotypes need to be considered, leading to a dramatic increase in efficiency. Although other programs also implement an EM algorithm for haplotype frequency estimation, none can handle both multiallelic markers and missing data. The efficiency of GENECOUNTING means that it can deal with hundreds of haplotypes. This is ample for realistic sample sizes with subject mumbers ranging up to, at most, a few thousand. GENECOUNTING accepts as input a set of multilocus genotypes, along with a count of how often each one occurs, and outputs the log likelihood and estimated haplotype frequencies under the null hypothesis of no LD between markers, and under the alternative hypothesis that LD is present between all markers according to maximum likelihood haplotype frequencies. For each observed genotype it also outputs the posterior probability for the genotype to consist of each possible pair of Annals of Human Genetics (2006) 70,277–279

277

D. Curtis et al.

constituent haplotypes, under the assumption that LD is present. The support programs provided are called RUNGC, SCANASSOC, COMPGR, SCANGROUP and LDPAIRS and aim to enhance the usability of GENECOUNTING in a number of ways. They provide heterogeneity tests to determine whether haplotype frequencies are different in cases and controls, or alternatively whether they differ in a particular group of subjects such as all those genotyped on one plate. Statistical significance is measured by likelihood ratio tests (LRTs) and permutation tests. Probable haplotype assignments for each subject can be output. Tests can be automatically run on subsets of markers - either taking groups of consecutive markers or all possible combinations of a given number of markers. Finally, a table of linkage disequilibrium statistics between all pairs of markers can be produced. RUNGC carries out heterogeneity tests and outputs haplotype assignments. It accepts an input file containing multilocus genotypes for cases and controls and then uses GENECOUNTING to obtain log likelihoods and estimate haplotype frequencies, assuming LD between all markers in the cases, the controls and the combined sample. The LRT for heterogeneity of haplotype frequencies between cases and controls is obtained by comparing the log likelihood for the whole dataset to the sum of the log likelihoods for the cases and controls, considered separately, to form a likelihood ratio statistic, LRS = 2(L CASE + L CONTROL − L ALL ). This can be used in an LRT by treating the LRS as a chi-squared statistic with degrees of freedom equal to M − 1, where M is the number of haplotypes having non-zero frequency in any of the three samples. (If a haplotype is estimated to have non-zero frequency in one sample and zero frequency in another then it still contributes a degree of freedom to the analysis.) The haplotype frequencies estimated to occur in cases, controls and the combined sample are output alongside each other. The asymptotic distribution for the LRT may be unreliable if there are large numbers of possible haplotypes and some are relatively rare, so RUNGC provides the facility to perform a permutation test to obtain an empirical estimate of significance. To do this, case and control labels are permuted at random and the LRS obtained is compared with that obtained from the real dataset. An estimate 278

Annals of Human Genetics (2006) 70,277–279

of significance is then given by P = (r + 1)/(N + 1), where N is the number of permuted datasets tested and r is the number of times a permuted dataset produces an LRS as high as that produced by the real dataset (North et al. 2002, 2003). Optionally, RUNGC can produce a set of possible haplotype assignments for each subject in the dataset. To do this it reads the GENECOUNTING output, which shows every possible haplotype combination for each observed genotype, along with its probability, treating the case and control datasets separately under the assumption that haplotype frequencies may differ between the two groups. For each subject in the dataset it then outputs this list of haplotype combinations and probabilities according to the genotype of that subject. This would allow a laboratory to identify the subjects having a high probability of carrying particular haplotypes, for example those estimated to occur more commonly in cases. These subjects could then be targets for mutation-screening. SCANASSOC provides the ability to automatically select and test subsets of markers for association in casecontrol datasets. For a specified number of markers in each subset one can either choose all sets of consecutive markers, forming a sliding window, or all possible combinations of that number of markers selected from the whole dataset. For each subset a heterogeneity test for association is carried out as described above, with a P value assigned based on the asymptotic distribution. Since this P value may be unreliable when large numbers of haplotypes occur, any subsets appearing to produce statistically significant evidence for association can then be input to RUNGC and permutation-testing performed. SCANASSOC does not itself perform correction for the multiple-testing involved in selecting different sets of markers, and any P values derived need to be interpreted in the context of the number of different tests performed. COMPGR functions in a similar way to RUNGC but compares haplotype frequencies between groups of subjects, rather than simply between cases and controls. The typical application would be to compare haplotype frequencies between different plates of DNA samples. Often, if a number of genotyping errors with a marker occur on one plate they will have only a modest effect on marker allele frequency, but may lead to the apparent formation of a number of novel haplotypes C


GENECOUNTING Support Programs

which are absent or rare in accurately typed data. If the plate contains only cases or only controls, and if it is not recognised that the effect is restricted to a single plate, then one may observe significant differences between case and control haplotype frequencies as a result of such errors. COMPGR compares each group against the rest, performs heterogeneity tests as described for RUNGC and reports the associated P values. Significant P values might draw the attention of the researcher to possible genotyping problems and these genotypes could be checked and investigated further. Another application might be to compare haplotype frequencies across different research centres or across different ethnic groups. SCANGROUP performs analogously to SCANASSOC in that it automatically selects combinations of markers to test but then looks for haplotype frequency differences between groups rather than between cases and controls. Any significant differences found can be explored further using COMPGR. LDPAIRS measures LD between all pairs of markers in the dataset and outputs these in table format. In order to provide a measure of LD between multiallelic markers Cramer’s V (Bishop et al. 1975) is calculated and is output along with its associated P value. The absolute value of D’, as measured between the commonest allele at each of the two markers, is also output. These support programs considerably enhance the usability of GENECOUNTING. They are available as C source and DOS/Windows executables at http://www.mds.qmul.ac.uk/statgen/dcurtis/software.html.

C


References Adkins, R. M. (2004) Comparison of the accuracy of methods of computational haplotype inference using a large empirical dataset. BMC Genet 5, 22. Bishop, Y. M. M., Fienberg, S. E. & Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice, pp. 385– 386. Cambridge, Mass: MIT Press. North, B. V., Curtis, D. & Sham, P. C. (2002) A note on the calculation of empirical P values from Monte Carlo procedures. Am J Hum Genet 71, 439–441. North, B. V., Curtis, D. & Sham, P. C. (2003) A note on calculation of empirical P values from Monte Carlo procedure. Am J Hum Genet 72, 498–499. Smith, C. A. (1957) Counting methods in genetical statistics. Ann Hum Genet 21, 254–276. Stephens, M. & Donnelly, P. (2003) A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73, 1162–1169. Xie, X. & Ott, J. (1993) Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53, 1107. Zhao, J. H., Curtis, D. & Sham, P. C. (2000) Model-free analysis and permutation tests for allelic associations. Hum Hered 50, 133–139. Zhao, J. H. & Sham, P. C. (2002) Faster haplotype frequency estimation using unrelated subjects. Hum Hered 53, 36–41. Zhao, J. H., Lissarrague, S., Essioux, L. & Sham, P. C. (2002) GENECOUNTING: haplotype analysis with missing genotypes. Bioinformatics 18, 1694–1695.

Received: 25 April 2005 Accepted: 16 June 2005

Annals of Human Genetics (2006) 70,277–279

279