A Computer Program for Population Genetics ... - BioTechniques

13 downloads 21616 Views 41KB Size Report
computer program was designed to facilitate such studies by providing selected .... reflects the degree to which populations are genetically dif- ferentiated.
GELSTATS: A Computer Program for Population Genetics Analyses Using VNTR Multilocus Probe Data BioTechniques 21:1128-1131 (December 1996)

Steven H. Rogstad and Stephan Pelikan University of Cincinnati, Cincinnati, OH, USA ABSTRACT GELSTATS, a computer program for population genetics analyses utilizing genetic markers revealed with variable number tandem repeat (VNTR) multilocus probes, is described and made available (both as C++ source code and as an executable DOS program). The program calculates several population genetics parameters, including: (i) individual and population band numbers; (ii) population bands exhibiting complete linkage (redundant examples of such bands can be removed in subsequent analyses); (iii) similarity (fraction of bands shared) between individuals and average similarity within and between designated groups; (iv) estimated probability that two individuals chosen at random will have identical band profiles; (v) heterozygosity estimates for designated groups; and (vi) Fst estimates. Nonparametric permutation methods are used to assess the significance of differences in both within- and between-group similarity. A jackknife test for heterozygosity differences between groups is also computed. Examples of GELSTATS analyses illustrate some features of the program.

INTRODUCTION Loci for which alleles differ primarily due to variable number of tandem repeats (VNTR; Reference 14) of a “core” DNA sequence [including minisatellite (8,9), microsatellite (11) and simple sequence repeat (21) loci] are often extremely polymorphic. VNTR multilocus probes simultaneously reveal alleles at several different loci (8,17) and have been used to examine population genetics characteristics of several organisms (e.g., References 1, 3, 5 and 19). The GELSTATS computer program was designed to facilitate such studies by providing selected population genetics parameters based on VNTR multilocus probe data. GELSTATS calculates a wider range of parameters than previous programs [e.g., SIM (1) for analyzing similarity or ThumbPrint (3) for heterozygosity] 1128 BioTechniques

and also uses, as described in the next section, nonparametric analyses for assessing the statistical significance of differences in similarity between populations. It offers an alternative to the approaches of Lynch (12,13). GELSTATS analyses depend on several assumptions (detailed in the README file accompanying the program), including: (i) bands with the same migration distance are identical alleles at one locus; and (ii) bands are transmitted in a Mendelian fashion [several studies have now shown that, for the most part, bands are transmitted this way (e.g., References 2, 4, 18 and 23)]. In calculations involving estimated allele frequencies and heterozygosities, it is assumed that populations are in Hardy-Weinberg equilibrium at each locus. MATERIALS AND METHODS Traits for multilocus VNTR marker analyses are population bands (probed endonuclease DNA fragments appearing in autoradiographs or chemilumigraphs), and each individual is scored for the presence [1] or absence [0] of each population band. Thus, data sets submitted to GELSTATS include population band profiles for all individuals, each of which is coded as to population membership. The README file includes information on data collection, on American Standard Code for Information Interchange (ASCII) file data input, on creation of an ASCII output file that can be converted for use with statistical and word-processing programs, and on other practical and theoretical details of the program. Data analyses include the following features. In the initial GELSTATS analysis, a list is provided of polymorphic population bands that are redundant and thus may be completely linked (all and only those individuals having one population band also have the “linked” band). Redundant cases of such putative, completely linked bands can be removed in subsequent analyses to examine how inclusion of such bands affects results (retention would give extra weight to such a locus). Care must be taken to determine whether reported completely linked bands are in fact due to clonal growth (e.g., in plants, multiple samplings of ramets of one clonally spreading genet in which a unique population band occurs). Vol. 21, No. 6 (1996)

Table 1. GELSTATS Similarity Tests for Humpback Whale Groups (Data from Reference 3)

Permutation Test of Similarity

“p-value”

Within group G > within group C Within group G > between groups G and C Within group C > between groups G and C

0.643 0.001* 0.004*

Within group G > within group A Within group G > between groups G and A Within group A > between groups G and A

0.634 0.000* 0.000*

Within group G > within group H Within group G > between groups G and H Within group H > between groups G and H

0.559 0.000* 0.000*

Within group C > within group A Within group C > between groups C and A Within group A > between groups C and A

0.534 0.553 0.498

Within group C > within group H Within group C > between groups C and H Within group H > between groups C and H

0.735 0.497 0.151

Within group A > within group H Within group A > between groups A and H Within group H > between groups A and H

0.904 0.537 0.224

G = Gulf of Maine, C = central California, A = southeastern Alaska and H = Hawaii (average similarity within these groups = 0.306, 0.284, 0.290 and 0.315, respectively). An asterisk indicates accept the test at the “p-value” ≤0.05 level. See text for further explanation.

The frequency of each population band (number of individuals in which a population band occurs divided by the total number of individuals in the group of interest) in each group and across all individuals is calculated. The number of bands for each individual, mean number of bands per individual (x-) by group and across the entire sample (with s.d. and s.e.) and permutation tests (Reference 6; see below) of whether these mean values differ between groups are given. A table of all possible pair-wise comparisons of similarity is provided [s = (2*NXY)/(NX + NY), where NXY is the number of bands shared between individuals X and Y, and NX and NY are the number of bands in individuals X and Y, respectively; References 12, 13, 16 and 24]. This similarity matrix can be imported into other statistical packages (i.e., see example below). The program computes, for each designated group, for all possible pair-wise comparisons of individuals between groups, and for the whole data set, both the average interindividual similarity (s-) and an estimate of the probability (s-x) that two individuals randomly chosen from a group will have identical banding patterns (9). Allele frequency (p) for each population band is estimated (according to References 10 and 20) both within each group and across all individuals. These frequencies are used to estiVol. 21, No. 6 (1996)

mate the number of loci (L) and heterozygosities (H) for groups and across all individuals (calculated according to both References 10 and 20). Permutation tests (6) are conducted by GELSTATS to explore whether groups differ in the following: (i) average number of bands per individual; (ii) average within-group interindividual similarity; and (iii) average within-group interindividual similarity vs. average between-group interindividual similarity. All of these tests take the same form, and more details are given in the README file. In general, for a particular permutation test of whether two groups differ, the value of the statistic being tested is computed for one group (the target group). Then, individuals are drawn, without replacement, at random from both groups to construct a “randomized” group the same size as the target group. For each randomized group, the “statistic of interest” is computed. By default, 5000 randomized groups are so analyzed (this default value can be changed), and the fraction of all the randomized groups that yield values of the statistic more extreme than that observed for the target group is used as a “p-value” for assessing whether the two groups differ in the “statistic of interest”. For any two populations, the permutation tests for similarity comparisons include three types of tests. First tested is whether similarity within one group differs from that in another group. Note that it is possible for two groups to have exactly the same level of similarity but simultaneously be completely genetically differentiated in population markers (e.g., within each group, all members share all bands, but no bands are shared between the groups). To test the degree of genetic differentiation, two further permutation tests are conducted: one in which the level of similarity within one group is compared to similarity for intergroup comparisons, and another in which similarity for the other group is compared to the intergroup comparisons. In the latter two tests, the degree to which within-group similarities exceed between-group similarities reflects the degree to which populations are genetically differentiated. To examine whether the levels of heterozygosity estimated for two populations differ, a jackknife (22) resampling test is conducted in which heterozygosity is calculated for subgroups of each population. Subgroups are of size n-2, where n is the size of the population, randomly drawn from each population without replacement. The n-2 subgroup size is used to increase the number of possible representative subgroup comparisons possible. Such resampling comparisons between populations are conducted 5000 times (by default; a different number can be specified) with a “p-value” reported that is the fraction of the comparisons where one population has a heterozygosity greater than the other. The nonparametric tests used by GELSTATS offer advantages over several other statistical approaches for the analysis of VNTR data sets (6). Tests using complex statistics with unknown sampling distributions (for example, H, which estimates heterozygosity) can be performed without requiring large sample sizes. Groups of unequal size or with unequal numbers of bands can be analyzed. Permutation tests avoid the problems of dependency discussed by Lynch (12,13), are frequently more powerful than rank-based, nonparametric tests and generally produce exact probabilities (6). BioTechniques 1129

RESULTS AND DISCUSSION A data set was analyzed with GELSTATS to demonstrate selected features of the program and to show how GELSTATS permutation tests of similarity compare with analyses of the same similarity matrices using the nonmetric multidimensional scaling (MDS) module of SYSTAT (Reference 25; the README file gives instructions for importing similarity matrices calculated by GELSTATS into SYSTAT). The data set includes the combined data from Tables 4–6 in Reference 3 (data from VNTR multilocus probes 33.15, 3′HVR and M13, respectively). This data set consists of band profiles from 20 humpback whales (five from each of four groups, here denoted G = Gulf of Maine, C = central California, A = southeastern Alaska and H = Hawaii) with a total of 168 population bands. An initial GELSTATS analysis of this whale data detected no monomorphic population bands (considering all 20 individuals), although 35 non-monomorphic population bands redundant with other bands (completely linked?) were found, and these redundant bands have been removed from the following analyses of the remaining 133 population bands. Results of GELSTATS within and between group similarity tests are given in Table 1. Note that although average similarities within groups differ slightly, these values are not sig-

Figure 1. Nonmetric multidimensional scaling analysis of VNTR genetic markers in humpback whale individuals. Data taken from Reference 3. Placement of individuals in the first two dimensions based on analysis of a similarity matrix generated by GELSTATS is shown. Sampled individuals were from four groups: G = Gulf of Maine; C = central California; A = southeastern Alaska; and H = Hawaii. More details are given in the text.

nificantly different. However, as noted above, groups can have equal levels of within-group similarity and yet be genetically differentiated. An examination of the within-group vs. between-group similarity permutation tests reveals that only those tests involving the G group indicate genetic differentiation has occurred (note tests with asterisks), while differentiation is not detected in any comparisons among the A, C or H groups. The latter points are supported by SYSTAT MDS analysis of the same GELSTATS similarity matrix. Figure 1 shows the placement of individual whales in the first two dimensions of the MDS analysis (accounting for 81.8% of the variance after 20 iterations with stress stabilizing at 0.15). Note that, in agreement with the permutation tests of similarity, the only group that is clearly differentiated is the G group. GELSTATS provides two different estimates (10,20) of heterozygosity (h). The bias-corrected h values (10) for the combined data sets with redundant examples of possibly linked bands removed are: Gulf of Maine h = 0.8922; California h = 0.8076; Alaska h = 0.7685; and Hawaii h = 0.7445. Resampling tests indicate that none of these values are significantly different. GELSTATS estimates Fst over the entire data set by two methods: (i) Fst as described by Lynch (References 12 and 13; using similarity values) = 0.096; and (ii) Fst as described by Nei (References 7 and 15; using heterozygosity values) = 0.073. The above examples demonstrate some of the types of data analyses possible with GELSTATS and show that the permutation tests can be used in conjunction with previously utilized tests (e.g., MDS) to further interpret data. Hopefully, GELSTATS will facilitate analyses of VNTR multilocus probe genetic data at the population level. The program and the supporting README file are available as GELSTATS.ZIP at ftp.uc.edu. ACKNOWLEDGMENTS Both authors contributed equally to this effort; author order was determined by coin toss. We thank D. Busemeyer, B. Keane, H. Lim, H. Mills and 3 anonymous reviewers. A portion of this research was supported by funds from NSF grant DEB 9096317 (S.H.R.) and from the University of Cincinnati.

5.Gilbert, D.A., N. Lehman, S.J. O’Brien and R.K. Wayne. 1990. Genetic fingerprinting reflects population differentiation in the California Channel Island fox. Nature 344:764-766. 6.Good, P. 1993. Permutation Tests. Springer-Verlag, New York. 7.Hartl, D.L. and A.G. Clark. 1989. Principles of Population Genetics, 2nd ed. Sinauer Associates, Sunderland, MA. 8.Jeffreys, A.J., V. Wilson and S.L. Thein. 1985. Hypervariable ‘minisatellite’ regions in human DNA. Nature 314:67-73. 9.Jeffreys, A.J., V. Wilson and S.L. Thein. 1985. Individual-specific ‘fingerprints’ of human DNA. Nature 316:76-79. 10.Jin, L. and R. Chakraborty. 1993. A bias-corrected estimate of heterozygosity for single-probe multilocus DNA fingerprints. Mol. Bio. Evo. 10:1112-1114. 11.Litt, M. and J.A. Luty. 1989. A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am. J. Hum. Genet. 44:397-401. 12.Lynch, M. 1990. The similarity index and DNA fingerprinting. Mol. Bio. Evo. 7:478-484. 13.Lynch, M. 1991. Analysis of population genetic structure by DNA fingerprinting. In T. Burke, G. Dolf, A.J. Jeffreys and R. Wolff (Eds.), DNA Fingerprinting: Approaches and Applications. Birkhauser Verlag, Basel. 14.Nakamura, Y., M. Leppert, P. O’Connell, R. Wolff, T. Holm, M. Culver, C. Martin, E. Fujimoto, M. Hoff, E. Kumlin and R. White. 1987. Variable number of tandem repeat (VNTR) markers for human gene mapping. Science 235:1616-1622. 15.Nei, M. 1973. Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA 70:3321-3323. 16.Nei, M. and W-H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:5269-5273. 17.Rogstad, S.H. 1993. Surveying plant genomes for variable number of tandem repeats loci. Methods Enzymol. 224:278-294. 18.Rogstad, S.H. 1994. Inheritance in turnip of variable number tandem repeat genetic markers revealed with synthetic repetitive DNA probes. Theor. Appl. Gen. 89:824-830. 19.Scribner, K.T., J.W. Arntzen and T. Burke. 1994. Comparative analysis of intra- and interpopulation genetic diversity in Bufo bufo, using allozyme, single-locus microsatellite, minisatellite, and multilocus minisatellite data. Mol. Bio. Evo. 11:737-748. 20.Stephens, J.C., D.A. Gilbert, N. Yuhki and S.J. O’Brien. 1992. Estimation of heterozygosity for single-probe multilocus DNA fingerprints. Mol. Bio. Evo. 9:729-743. 21.Tautz, D. and M. Renz. 1984. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 12:4127-4138. 22.Weir, B.S. 1990. Genetic Data Analysis. Sinauer Associates, Inc., Sunderland, MA. 23.Weising, K., H. Nybom, K. Wolff and W. Meyer. 1995. DNA Fingerprinting in Plants and Fungi. CRC Press, Boca Raton. 24.Wetton, J.H., R.E. Carter, D.T. Parkin and D. Walters. 1987. Demographic study of a wild house sparrow population by DNA fingerprinting. Nature 327:147-149. 25.Wilkinson, L. 1990. SYSTAT Manual. SYSTAT Inc., Evanston, IL.

Received 28 May 1996; accepted 21 October 1996. REFERENCES 1.Alberte, R.S., G.K. Suba, G. Procaccini, R.C. Zimmerman and S.R. Fain. 1994. Assessment of genetic diversity of seagrass populations using DNA fingerprinting: implications for population stability and management. Proc. Natl. Acad. Sci. USA 91:1049-1053. 2.Arens, P., P. Odinot, A.W. van Heusden, P. Lindhout and B. Vosman. 1995. GATA- and GACA-repeats are not evenly distributed throughout the tomato genome. Genome 38:84-90. 3.Baker, C.S., D.A. Gilbert, M.T. Weinrich, R. Lambertsen, J. Calambokidis, B. McArdle, G.K. Chambers and S.J. O’Brien. 1993. Population characteristics of DNA fingerprints in humpback whales (Megaptera novaeangliae). J. Hered. 84:281-290. 4.Dow, B.D., M.V. Ashley and H.F. Howe. 1995. Characterization of highly variable (GA/CT)n microsatellites in the bur oak, Quercus macrocarpa. Theor. Appl. Gen. 91:137-141. Vol. 21, No. 6 (1996)

Address correspondence to: Steven Rogstad Biological Sciences ML6 University of Cincinnati Cincinnati, OH 45221-0006, USA Internet: [email protected]

BioTechniques 1131