Accuracy of a family-based genotype imputation algorithm

4 downloads 87 Views 357KB Size Report
with the low-density panel (e.g., Habier et al. (2009)). Despite current advances in genotyping technology, missing data are fairly common in association studies.
Accuracy of a family-based genotype imputation algorithm Mehdi Sargolzaei1,2, Jacques P. Chesnais1,3 and Flavio S. Schenkel2 1

2

L‘Alliance Boviteq, Saint-Hyacinthe, QC, Canada Centre for Genetic Improvement of Livestock, University of Guelph, Guelph, ON, Canada 3 Semex Alliance, Guelph, ON, Canada

INTRODUCTION Genomic selection has been recently applied to dairy cattle breeding and is expected to substantially increase genetic gain. To estimate genomic breeding values, dense DNA marker panels are usually required to exploit the linkage disequilibrium (LD) between quantitative trait loci (QTL) and markers (Hayes et al. (2008)). A dense marker map is also a prerequisite for fine mapping in order to precisely locate QTL (e.g., Meuwissen and Goddard (2000)). Even though high-density genotyping is now feasible for dairy cattle, genotyping thousands of individuals with a high-density panel is still expensive. To bring down genotyping costs, a reference population can be genotyped with a high-density panel while other animals are genotyped with a low-density panel in which markers are evenly spaced. Then, using information from the reference population, genotypes for untyped loci can be inferred for individuals genotyped with the low-density panel (e.g., Habier et al. (2009)). Despite current advances in genotyping technology, missing data are fairly common in association studies. Usually missing genotypes are not randomly distributed and the failure rate for some individuals can be as high as 10% or even more. Implementation of fine mapping and genomic selection in presence of missing data can be challenging. Genotype imputation is simply a method for inferring missing genotypes to increase genome coverage and it is a powerful tool to increase the power of genome-wide association studies. Imputation can be used for three main purposes: 1) to infer the missing genotypes of markers that were not successfully called during genotyping, 2) to infer the genotypes of ungenotyped parents that have a sufficiently large number of genotyped progeny and 3) to Open Industry Session - April 2010.

infer genotypes for untyped loci in individuals genotyped with a low-density panel using a reference population genotyped for a high-density marker panel. The first case is for a situation where the rate of missing genotypes for some individuals is high. The second case is for a situation where DNA samples for some important ancestors are not available. The third case may be the most important one, and is primarily designed to reduce genotyping costs. The persistency of imputation accuracy over successive generations is of interest in this third case. Imputation algorithms can be classified as LD-based and family-based. Genetic linkage can effectively help with the imputation of long segments of haplotypes because of the small number of recombinations in the pedigree of close relatives. Therefore, when individuals are highly related, a modest number of markers for the low-density panel should be enough for accurate imputation (Li et al. (2009)). The objective of this article was to evaluate the accuracy of a family-based imputation algorithm for situations 2 and 3 described above.

MATERIAL AND METHODS Imputation algorithm. Imputing missing or untyped genotypes is usually done in two steps: 1) reconstruction of haplotypes 2) propagation of haplotypes to fill in unknown genotypes (Li et al. (2009)). In our case, imputation was done in three steps. In the first step, genotypes that could be inferred with high certainty from parents or progeny information were filled in. In the second step, haplotypes were reconstructed and in the third step, haplotypes of progeny were matched to haplotypes of parents and untyped loci were filled in. The first imputation step is essential in the situation where a parent is untyped. 1

Haplotypes were reconstructed using a family-based algorithm as described in Sargolzaei et al. (2008). Simulation study. In order to evaluate the accuracy of the above family-based imputation algorithm, simulated data sets were generated by QMSim software (Sargolzaei et al. (2009)). The base population consisted of 100 sires each mated randomly to 10 dams (large family scenario) or to 3 dams (small family scenario). Each dam produced 2 progeny. To simplify the presentation, 15 discrete generations were carried out. Two chromosomes of length 1 Morgan, and 10,000 SNP located randomly on each chromosome were simulated. Markers in the base generation were in LD (LD scenario) or linkage equilibrium (no LD scenario). For the LD scenario, 2,000 discrete historical generations, starting with equal allele frequencies, were simulated. Gametes of 200 sires and 200 dams were randomly paired to produce an offspring. For reaching mutation-drift equilibrium, a mutation rate of 10e-4 was used. During the last 10 historical generations, the population was gradually expanded to 1,100, in which 1000 animals were dams and 100 were sires. For the no LD situation, alleles in the base population were sampled from a uniform distribution with equal frequencies. Finally, for each chromosome, 1,875 SNP with a minor allele frequency >0.05 in the base generation were randomly chosen. This marker density corresponds to that of the 50k bovine SNP panel (approximately 45k SNP usable per breed). The first three generations (0 to 2) were considered the reference population, in which animals where genotyped with the high density panel. In subsequent generations (3 to 15), animals were genotyped with low density panels of various densities. To create these low density panels, genotypes for 42, 125, 208, and 417 equidistant SNP per chromosome were kept and genotypes for all other SNP were set to missing. These densities correspond to 1k, 3k, 5k and 10k SNP panels in the bovine. In another scenario, base sires and dams were Open Industry Session - April 2010.

randomly mated 5, 10, 15 and 20 times to generate large maternal half-sib families (with each dam producing one progeny) and then the whole genome of dams was set to missing. All scenarios were replicated 100 times.

RESULTS AND DISCUSSION Imputation accuracy and its persistency across generations for the LD scenario when descendants were genotyped with alternate low-density panels are presented in Figure 1. The accuracy of imputation was very similar for both family sizes (20 vs 6). The accuracy of imputation is highly dependent on the accuracy of haplotyping. The accuracy of haplotype reconstruction for all generations was higher than 0.99 with a standard error of ~0.00004 for both scenarios, which resulted in almost the same accuracy of imputation. Imputation accuracy was very high for all densities (SE=0.0003) when parents were genotyped with the high-density panel. The drop in accuracy over generations was relatively small for panels with a SNP density of 3k or higher. As the generation number increased the proportion of SNP not inferred increased, as well as its SE. To maintain an acceptable level of accuracy over generations, a new group of animals should be genotyped with the high-density panel after a certain number of generations. If haplotypes can be reconstructed accurately, the time interval between successive groups of animals that should be genotyped with the high-density panel depends on the density of the low-density panel. However, the cost of low-density versus high-density genotyping is an important factor to consider in addition to accuracy of imputation in a large-scale genotyping strategy. Results for the situation where all markers were in linkage equilibrium in the base generation (no LD) were very similar to those for the LD scenario, but imputation accuracy was slightly lower (results not shown). This might be due to higher marker heterozygosity for the no LD scenario. Apart from this, differences between scenarios were expected to be small since only within family information 2

is used and therefore the accuracy of the method should not depend on the level of LD. Family based-imputation is best suited to livestock with relatively large family size. A good example is Holstein cattle, which is dominated by large half-sib families and also by a small number of elite bulls. In contrast to unrelated samples, family designs facilitate the detection of genotyping errors, which will result in more accurate haplotype reconstruction (Kirk and Cardon (2002)). Another advantage of family based-imputation is its robustness to population stratification due to the use of within family information (Dudbridge (2008)). Figure 2 shows the accuracy of imputation for ungenotyped dams having different family sizes. The accuracy of inferred genotypes was close to 1 for a maternal half-sib family size of 10 and larger. When the family size was 5, the accuracy was still high (0.97). The haplotypes were perfectly reconstructed because of the large paternal half-sib family size and the absence of genotyping errors. Large paternal half-sib families are common in dairy cattle and the genotyping error rate is low. Therefore the proposed algorithm could be used to impute the genotypes of elite cows that are no longer alive (no DNA available) but have a large number of offspring from whom haplotypes can be inferred with a high degree of accuracy. Some of these results have been confirmed on real data and will be presented in subsequent paper.

was small if the low-density panel had sufficient coverage (3,000 SNP or more). If an animal was not genotyped, but had genotyped mates and a relatively large number of genotyped offspring, its high density genotype could be imputed accurately. The proposed algorithm is particularly well suited for the population structure usually encountered in dairy cattle.

REFERENCES Dudbridge, F. (2008). Hum. Hered., 66:87–98. Habier, D., Fernando, R. L., and Dekkers, J. C. M. (2009). Genetics, 182:343–353. Hayes, B. J., Bowman, P. J., Chamberlain, A. C. et al. (2009). J. Dairy Sci., 92:433–443. Kirk, K. M., and Cardon L. R. (2002). Eur. J. Hum. Genet., 10:616–622. Li, Y., Willer, C., Sanna, S. et al. (2009). Annu. Rev. Genom. Hum. Genet., 10:387–406. Meuwissen, T. H. E., and Goddard, M. E. (2000). Genetics, 155:421–430. Sargolzaei, M., Schenkel, F. S., Jansen, G. B. et al. (2008). J. Dairy Sci., 91:2106–2117. Sargolzaei, M., Schenkel, F. S. (2009). Bioinformatics, 25:680–681.

CONCLUSION When a reference population has been genotyped with a high density panel, a family-based algorithm can be used to impute the high density genotypes of individuals genotyped with a lower density panel, or even those of ungenotyped individuals with enough genotyped progeny. With the algorithm presented here, the imputation of untyped loci was very accurate if the animal’s parents were genotyped for the high-density panel. For animals genotyped with the low density panel in subsequent generations, the loss in accuracy Open Industry Session - April 2010.

3

Figure 1: Proportion of SNP inferred correctly (above the dashed-line) and proportion of SNP not inferred (below the dashed-line) over successive generations when LD was present.

Figure 2: Proportion of SNP inferred correctly for ungenotyped dams.

Open Industry Session - April 2010.

4

Suggest Documents