POPSIM: a general population simulation program

7 downloads 0 Views 316KB Size Report
individuals in size to be generated and to be followed over hundreds of generations. ..... evolution of genetic variability in a small population. Comput. Applic.
BIOINFORMATICS

(%  '(  !+   

POPSIM: a general population simulation program ("' &)  "(&+ #'$*  , ' "*#*  ' ,* /*'*!  ," )*,&', ( ##' ' '+,#,-, ( #% ',#+ "*#, #% "((% *%#' *&'. '  ',* (* (%-%* ##' *%#' *&'.                 

Abstract Motivation: The investigation of common disorders of polygenic inheritance using both population-based association designs and non-parametric linkage analysis within families is gaining increasing importance. We have created a program that allows for the flexible simulation of populations as a tool to investigate the properties of population-based mapping approaches. Results: We have created a population simulation program, POPSIM, that (i) creates a virtual representation of every individual, (ii) makes no prior assumptions but the Mendelian rules and (iii) allows populations of several million individuals in size to be generated and to be followed over hundreds of generations. The parameters of the disease model, population structure and population expansion rate can be specified. Flexible sampling options exist that allow samples of families and individuals to be drawn at any given point during the population history. The program may be a useful tool in the study of the influence of genetic drift, recombination and admixture on the generation and maintenance of linkage disequilibrium in populations, as well as the evaluation of stochastic sampling characteristics of families and individuals conditional on a complex genetic phenotype from homogeneous and heterogeneous populations. Availability: The source code as well as Sun and Windows NT4.0 console executables of the program are available under http://www.ukrv.de/ch/medgen/html/benutzer/j.hampe/popsi m.html. Contact: [email protected] Introduction Simulation techniques play an important role in applied genetics and in the development of new statistical methods for linkage and association analysis. Currently available

4To whom correspondence should be addressed at: 1st Department of Medicine, Christian-Albrechts-University Kiel, D-24105 Kiel, Germany

458

programs include SIMULATE (Terwilliger and Ott, 1994), SIMLINK (Boehnke, 1986) and SLINK (Weeks et al., 1990), and the more general GASP (Wilson et al., 1996) programs which simulate the genotypes of families under a defined structure and preset linkage parameters. A simulation of a given pedigree using SLINK is now a common prerequisite before embarking on a genome-wide linkage analysis in disorders where reasonable parametric estimates on the disease model exist. However, for complex genetic disorders like diabetes, arteriosclerosis, bipolar disorder or inflammatory bowel disease, the genotype to phenotype relationship is much more complex (Risch, 1990; Cox et al., 1992; Orholm et al., 1993; Risch and Botstein, 1996). Possible approaches for the ultimate mapping of susceptibility genes responsible for these conditions include non-parametric linkage analysis (using recombination events within the families under investigation) and linkage disequilibrium mapping (association studies—using recombination events over the past generations in the population). In these approaches, population structure and sampling effects may have profound effects on the results (Spielman et al., 1993, 1994; Suarez et al., 1994; Thompson and Neel, 1997). Several analytical models for defined scenarios of population history and structure have been developed for the prediction of allele frequencies and disequilibrium parameters (Kaplan and Weir, 1995; Kaplan et al., 1995), which have been applied successfully in isolated populations (Hästbacka et al., 1992, 1994). Simulation-based statistics have been used to assess the sensitivity and specificity of non-parametric linkage analysis (Almasy et al., 1995; Devlin and Risch, 1995; Davis et al., 1996) which are each directed at a specific family structure. The use of predefined statistical distributions for the prediction of certain population parameters in the mentioned parametric simulation approaches may not be generally applicable. Another approach uses direct simulation, whereby a representation of every individual is generated. The first implementation of this approach by MacCluer (1967; POPGEN) has evolved into a set of Fortran programs that  Oxford University Press

Population simulation with POPSIM

have been used by the ‘Genetic Analysis Workshops’ (MacCluer et al., 1995) and allow—by using different components from these programs—a variety of scenarios to be implemented. LINKERS, SNAPPERS and GRASPERS (Yang et al., 1990; Prasad, 1992; Ackerman et al., 1993; Goay et al., 1995) for the VMS environment allow very detailed simulation scenarios, but are limited to two major loci and two alleles per locus. The underlying simulation concept has been ported to a C++ library ‘SIMEX’ that enables the user to build custom models in C++ which may also be applied to infectious disorders like spread of influenza (Peterson et al., 1993). SMALLPOP (Fournet et al., 1995) is limited to small populations and is directed at animal breeding using external selection criteria. The population-based programs are very resource intensive and often require additional programming for efficient use. We therefore implemented a program under the following aims: (i) to enable the direct simulation of populations of a realistic size (i.e. several million individuals); (ii) making no prior assumptions but the Mendelian rules; (iii) to enable the specification of relevant population parameters; (iv) to deliver the output in a format that is easily processed by analysis programs for both parametric and non-parametric linkage analysis; (v) to include all experimental information into a set-up file that is suitable for batch processing.

System and methods POPSIM was written in ANSI C and uses C standard libraries. System-specific versions for Sun Solaris and a Windows NT4.0 console application are also available. Efforts to minimize the memory requirements of the program have included the use of half-bytes for the storage of genotype information and the immediate replacement of the parents by their children in the population space. The central ‘breeding’ routines use exclusively integer and bit operations, all float point operations have been moved into initialization routines.

Specification of the experiment In order for the program to be as flexible as possible and to be easily used in batch processing, all information on the simulation experiment is specified in a text file (extension *.pop_init). Here, a variety of settings pertaining to the structure of the population, susceptibility locus and marker information, disease model, population propagation and sampling characteristics can be specified (Table 1). These data are read into an ‘experiment’ structure during the experimental set-up routines.

Table 1. Variables specified in the experiment set-up file Population generation Number of population replicates Initial size of the population Number of subpopulations and their ratio in the final population Random seed for population generation Population propagation Number of generations Average number of kids per family and the proportion of the population attempting to reproduce—a method of specifying population growth or shrinkage Random seed for propagation Locus structure and disease model Number of susceptibility loci (all unlinked to each another) Initial allele frequencies of the risk allele for each locus and subpopulation Recombination fraction for the marker loci linked to each susceptibility locus (no interference) Initial association of a founder haplotype with the risk allele of the susceptibility locus Disease model (additive polygenic, classical inheritance, 2 locus penetrance matrix) Weighting factors for each locus, weight of the environmental factor for the additive disease model Cut-off value for affection Sampling and I/O streams For each Checkpoint: (i) number of generations after which sampling is performed; (ii) type of sample to be drawn Sample types: Sib-pairs and nuclear families with the number of affected individuals per family to be specified Output either on standard output or in a dedicated file for each type of sample

459

J.Hampe et al.

Fig. 1. Implementation of a virtual individual: The internal storage format is symbolized in boxes, giving the type of information and the format of storage (0/1: bit; 0..15: half-byte; 0..255: byte) on the top and bottom, respectively. Risk locus information contains the genotype of the susceptibility gene itself and the genotypes of the linked marker loci (up to four in the standard compilation). The thetas between marker and risk locus are specified in the experimental set-up file.

Population implementation The program uses non-overlapping populations. The population is stored in three principal data structures: person, family and population. The central data structure of the program is a person, for which the sex, a random environmental factor and genetic information on risk genes and linked markers are stored. In order to minimize memory requirements, the allele numbers of the genotypes are packed in half-bytes (Figure 1). The family is a temporary dataset of pointers, for which a male and a female are drawn from the population, and children are generated as determined by a random draw from a binomial distribution, the mean of which is given in the experimental set-up (see below). Since the program uses non-overlapping generations, memory requirements can be minimized by immediately replacing the parents with their children. Thus, the next generation grows in the same population space at the expense of the old one (Figure 2).

460

Fig. 2. Implementation of the population. The old and the successive population share the same population space in the main memory. As the simulation of a generation progresses, the parent population is gradually replaced. The bounds are managed by pointer variables (in open arrows). The four steps in the creation and destruction of a family are symbolized, the pseudo code for these operations is given in the text.

Population replicates Two streams of pseudo random numbers are used in the program. One stream provides all random numbers for mating, meiosis, etc., and the second stream is used for population set-up. The latter is reset every time a population replicate is required, giving the opportunity to subject the same population to different mating scenarios repeatedly. After the experimental set-up, the complete experiment structure is put on the stack and reconstituted for each population replicate.

Sampling during simulation Sample information is either directed to the standard output or to files with characteristic extensions (Figure 3). The sampling timetable, i.e. after which generation what type of sample is to be drawn, is given in the experimental set-up file and read into the ‘checkpoints’ structure. No analysis functions are incorporated into the POPSIM program itself. Rather, independent analysis programs like the LINKAGE (Lathrop et al., 1984) package or the companion program CHISQUARE (which calculates χ2 statistics from case–control data) and the TDT (transmission disequilibrium test)

Population simulation with POPSIM

Fig. 3. Flow of information in a simulation experiment using POPSIM. As specified in the experimental set-up file, the flow of sample information is directed into files with specific extensions.

(Spielman et al., 1993, 1994) statistic from triplet data are used for data analysis.

Family creation and meiosis A stream of integer random numbers is used to make all decisions in the mating and meiosis routines, and is designated as ‘random’ in the pseudo code below. In order to avoid floating point operations, the cut-off values were normalized during the experimental set-up routines on the intervals of the equally distributed pseudo random numbers. In the case of the number of offspring to be created for a given family, the mean of a binomial distribution is specified in the experimental set-up file. The density curve of this distribution is translated into values of an integer vector and a random number used as the index for an individual family. We now give the pseudo code of the central mating and meiosis routines. breed (struct *experiment, struct *population, struct *checkpoints) {

for (all generations) { reset_population_pointers; total_sampling(population, check points); // total population sampling happens here while (still families needed) { mother = get_male(population, random); father = get_female(population, random); if (get_number_of_kids(binomial distribution, random))!=0) { make_space_for_kids(population experiment, kidnumber); create_kids(mother, father, population, experiment); ascertain(mother, father, population, checkpoints); // family sampling is performed here destroy_parents(mother, father, population); } } } } } create_kids(struct *population, struct *experiment, kidnumber) { while (still kids to create) { assign_sex(random); assign_environmental_factor (random); assign_paternal_haplotype(random) assign_maternal_haplotype(random) } } } assign_maternal_haplotype(random){ for (all markers) { assign_principal_chromo some(first_bit_of_random); // grandfather or grandmother if (last_15_bits_of_random >= normalized_theta) recombination; else no_recombination; } }

Results and discussion Efficiency of the implementation The direct simulation approach leads to a heavy load on both main memory and processor capacities. The bulk of processor resources are used during the simulation of meiosis. Here, the packaging of genotype information into half-bytes results in a slowdown of ∼50% (data not shown). However, in contrast to the use of integer data types, this reduces (de-

461

J.Hampe et al.

Fig. 4. Decline of initial linkage disequilibrium in an at random mating population of 200 000 individuals. It is evident that the recombination fraction θ determines the rate of this process. The analytical prediction is plotted in parallel to the simulation results. A population with an initial haplotype frequency (1D) of 0.87 and a population allele frequency of marker 1 of 1/16 was chosen. Fifty replicates of this population were generated and subjected to successive generations of random mating. The mean of the observed haplotype frequency (1D) was plotted as a function of time. Note the relative discrepancy of simulation and theoretical prediction in very small thetas (see the text).

pending on the compiler settings) the memory requirements of a marker genotype from 4–8 bytes down to 1 byte. This way, for instance, a three-susceptibility locus simulation with four markers at each locus needs 18 MBytes for 1 million individuals. A further speed-up was achieved by moving all floating point operations into the experimental set-up functions and performing all operations on normalized integer values. The final program version needs ∼50 min for the simulation of 100 generations in a population of size 1 million individuals on a Sun SparcStation 10/30 and ∼10 min on a Sun Ultra 170.

Current limitations and future development In all population simulation programs, a balance between abstraction from the individual biographies and the amount of genetic data and population size has to be struck. Programs such as GRASPERS (Prasad, 1992) and SIMEX allow for a very detailed specification of the individual biographies and are therefore also suitable for the simulation of largely environmental disorders like infectious diseases, but are limited for population size and the number of loci to be handled. POPSIM does not consider the individual biographies in detail, but rather concentrates on problems like the evolution of haplotypes over long periods of time, in large populations,

462

Fig. 5. Distribution of the haplotype frequency 1D in an at random mating population (θ = 1 cM). The same data as in the experiment in Figure 4. Instead of the mean of f1D , the distribution of observations from the 50 experiments (N on the Z-axis) is shown.

Population simulation with POPSIM

and allows for a theoretically unlimited number of susceptibility loci. Specialized simulations like changing population expansion patterns, ongoing admixture, conditional mating, differential environmental factors, interference and the use of linked susceptibility loci are presently only possible by modifying the respective routines in the source code. As a future development, the respective procedures will be made available as options in the experimental set-up files.

Error elimination Errors are an inevitable part of all software from a certain degree of complexity. However, in simulations of large datasets and complex problems, errors are very hard to trace back and may corrupt the integrity of the whole computer experiment. The present version of the program has been used as a tool with gradual enhancement of functionality since May 1996 in our laboratory. Pedigrees and recombinational events were traced by hand in several simulations. Furthermore, the program was used to simulate data under known parametric models, under which the expected results were obtained (comparison with SLINK under recessive and dominant models; data not shown). It was our policy to leave the central breeding routines untouched in the evolution of the program, since they had been extensively debugged.

Application example: decay of linkage disequilibrium in a population We investigated the decay of linkage disequilibrium in a large population, which can be well described by mathematical models (Robbins, 1918; Thompson and Neel, 1978) in order to validate the simulation (Figure 4). Linkage disequilibrium ∆0 between a marker allele 1 of a biallelic marker M (1,2) and a disease locus (D,d), that existed at a certain point in population history, is expected to decline under random mating during consecutive generations (n) at a rate that is dependent on the recombination fraction (θ) between marker M and the disease locus. In a population of indefinite size, this relationship is given by ∆n = (1 – θ) n ∆0, where ∆n was calculated as ∆ = f1D f2d – f1d f2D (f denoting the population frequencies of the respective haplotypes with 1D being the haplotype of interest in this setting) (Devlin and Risch, 1995). The figure shows the mean of 50 replicate simulation experiments in good concordance with theoretical estimates. As shown in Figure 4, the relative discordance of simulation and prediction in very small recombination fractions (0.1 cM) shows that genetic drift rather than recombination determines the haplotype frequency. Here, the simulation offers the possibility to look at the distribution of haplotype frequency 1D in replicate experiments (Figure 5), which are difficult to derive analytically (Neel and Thompson, 1978; Weir and Hill, 1980; Hill and Weir, 1988, 1994).

Possible applications The program allows for a very flexible specification of population simulation experiments. Special options exist that allow the disease model to be fitted empirically as close as possible to the known epidemiological data. Once this model is specified, the program can serve as a tool for the study of genetic drift, recombination and admixture on the generation and maintenance of linkage disequilibrium in populations, as well as the evaluation of stochastic sampling characteristics of families and individuals conditional on a complex genetic phenotype from homogeneous and heterogeneous populations. The program may, therefore, serve both as an instrument in theoretical research and applied genetics in allowing the generation of datasets for power estimations of mapping projects, as well as the evaluation of genetic study designs in complex genetic disorders.

References Ackerman,E. et al. (1993) Simulation of stochastic micropopulation models—I. The SUMMERS simulation shell. Comput. Biol. Med., 23, 177–198. Almasy,L. et al. (1995) Use of sibling risk ratios and components of genetic variance in the characterization of a simulated oligogenic disease. Genet. Epidemiol., 12, 565–570. Boehnke,M. (1986) Estimating the power of a proposed linkage study: a practical computer simulation approach. Am. J. Hum. Genet., 39, 513–527. Cox,N.J. et al. (1992) Mapping diabetes-susceptibility genes. Lessons learned from search for DNA marker for maturity-onset diabetes of the young. Diabetes, 41, 401–407. Davis,S. et al. (1996) Nonparametric simulation-based statistics for detecting linkage in general pedigrees. Am. J. Hum. Genet., 58, 867–880. Devlin,B. and Risch,N. (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics, 29, 311–322. Fournet,F. et al. (1995) A FORTRAN program to simulate the evolution of genetic variability in a small population. Comput. Applic. Biosci., 11, 469–475. Goay,M.S. et al. (1995) Simulation of stochastic micropopulation models—IV. SNAPPERS: model implementation for genetic traits. Comput. Biol. Med., 25, 519–531. Hästbacka,J. et al. (1992) Linkage disequilibrium mapping in isolated founder populations: diastrophic dysplasia in Finland. Nature Genet., 2, 204–211. Hästbacka,J. et al. (1994) The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell, 78, 1073–1087. Hill,W.G. and Weir,B.S. (1988) Variances and covariances of squared linkage disequilibria in finite populations. Theor. Popul. Biol., 33, 54–78. Hill,W.G. and Weir,B.S. (1994) Maximum-likelihood estimation of gene location by linkage disequilibrium. Am. J. Hum. Genet., 54, 705–714.

463

J.Hampe et al.

Kaplan,N.L. and Weir,B.S. (1995) Are moment bounds on the recombination fraction between a marker and a disease locus too good to be true? Allelic association mapping revisited for simple genetic diseases in the Finnish population. Am. J. Hum. Genet., 57, 1486–1498. Kaplan,N.L. et al. (1995) Likelihood methods for locating disease genes in nonequilibrium populations. Am. J. Hum. Genet., 56, 18–32. Lathrop,G.M. et al. (1984) Strategies for multilocus linkage analysis in humans. Proc. Natl Acad. Sci. USA, 81, 3443–3446. MacCluer,J.W. (1967) Monte Carlo methods in human population genetics: a computer model incorporating age-specific birth and death rates. Am. J. Hum. Genet., 19(Suppl. 19), 303–312. MacCluer,J.W. et al. (1995) Simulation of a common oligogenic disease with quantitative risk factors. GAW9 problem 2: the answers. Genet. Epidemiol., 12, 707–712. Neel,J.V. and Thompson,E.A. (1978) Founder effect and number of private polymorphisms observed in Amerindian tribes. Proc. Natl Acad. Sci. USA, 75, 1904–1908. Orholm,M. et al. (1993) Investigation of inheritance of chronic inflammatory bowel diseases by complex segregation analysis. Br. Med. J., 306, 20–24. Peterson,D. et al. (1993) Simulation of stochastic micropopulation models—II. VESPERS: epidemiological model implementations for spread of viral infections. Comput. Biol. Med., 23, 199–213. Prasad,B.N. (1992) GRASPERS: A stochastic simulation system for generating populations with genetic characteristics. In Health Sciences Informatics. University of Minnesota, p. 26. Risch,N. (1990) Genetic linkage and complex diseases, with special reference to psychiatric disorders. Genet. Epidemiol., 7, 3–16. Risch,N. and Botstein,D. (1996) A manic depressive history. Nature Genet., 12, 351–353.

464

Robbins,R.B. (1918) Some applications of mathematics to breeding problems III. Genetics, 3, 375–389. Spielman,R.S. et al. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet., 52, 506–516. Spielman,R.S. et al. (1994) The transmission/disequilibrium test detects cosegregation and linkage. Am. J. Hum. Genet., 54, 559–560. Suarez,B.K. et al. (1994) Problems in replicating linkage claims in psychiatry. In Gershon,E.S. and Cloninger,C.R. (eds), Genetic Approaches to Mental Disorders. American Psychiatric Press, Washington, DC, pp. 23–46. Terwilliger,J.D. and Ott,J. (1994) Handbock of Human Genetic Linkage. The Johns Hopkins University Press, Baltimore, MD, pp. 245–250. Thompson,E.A. and Neel,J.V. (1978) Probability of founder effect in a tribal population. Proc. Natl Acad. Sci. USA, 75, 1442–1445. Thompson,E.A. and Neel,J.V. (1997) Allelic disequilibrium and allele frequency distribution as a function of social and demographic history. Am. J. Hum. Genet., 60, 197–204. Weeks,D.E. et al. (1990) SLINK: a general simulation program for linkage analysis. Am. J. Hum. Genet., 47(Suppl.), A204. Weir,B.S. and Hill,W.G. (1980) Effect of mating structure on variation in linkage disequilibrium. Genetics, 95, 477–488. Wilson,A.F. et al. (1996) The Genometric Analysis Simulation Program (G. A. S. P.): a software tool for testing and investigating methods in statistical genetics. Am. J. Hum. Genet., 59(Suppl.), A193. Yang,J.J. et al. (1990) LINKERS: a simulation programming system for generating populations with genetic structure. Comput. Biol. Med., 20, 135–144.