The utility of short tandem repeat loci beyond human ... - CiteSeerX

37 downloads 5121 Views 312KB Size Report
phisms, the number of short tandem repeat (STR) loci validated for forensic use has now grown to ... Forensic Sciences ..... such differences imply any greater degree of genetic ...... [27] Jin, L., Ph. D. Thesis, University of Texas Graduate School.
1682

Electrophoresis 1999, 20, 1682±1696

Review Ranajit Chakraborty1 David N. Stivers1 Birg Su1 Yixi Zhong1 Bruce Budowle2

The utility of short tandem repeat loci beyond human identification: Implications for development of new DNA typing systems

1

Since the first characterization of the population genetic properties of repeat polymorphisms, the number of short tandem repeat (STR) loci validated for forensic use has now grown to at least 13. Worldwide variations of allele frequencies at these loci have been studied, showing that variations of interpopulation diversity at these loci do not compromise the power of identification of individuals. However, data collected for validation of these loci for forensic use has utility beyond human identification; the origin and past migration history of modern humans can be reconstructed from worldwide variations at these loci. Furthermore, complex forensic cases previously unresolvable can now be investigated with the help of the validated STR loci. Here, we provide the absolute power of the validated set of 13 STR loci for addressing these issues using multilocus genotype data on 1,401 individuals belonging to seven populations (US European-American, US African-American, Jamaican, Italian, Swiss, Chinese and Apache Native-American). Genomic research is discovering new classes of polymorphic loci (such as the single nucleotide polymorphisms, SNPs) and lineage markers (such as the mitochondrial DNA and Y-chromosome markers); our aim, therefore, was to determine how many SNP loci are needed to match the power of this set of 13 STR loci. We conclude that the current set of STR loci is adequate for addressing most problems of human identification (including interpretations of DNA mixtures). However, if suitable number of SNPs are used that would match the power of the STR loci, they alone cannot resolve more complex cases unless they are supplemented by the validated STR loci.

Human Genetics Center, School of Public Health, University of Texas, Houston, TX, USA 2 Laboratory Division, Forensic Sciences Research and Training Center, FBI Academy, Quantico, VA, USA

Keywords: Repeat polymorphism / DNA forensics / Parentage testing / Evolutionary dynamics / Single nucleotide polymorphism / Review

Contents 1 2 3 4 4.1 4.2 4.3 4.4 5

Introduction . . . . . . . . . . . . . . . . . . . STR loci ± the present battery of 13 loci Evolutionary genetic inference from 13 STR loci . . . . . . . . . . . . . . . . . . . . Power of the 13 STR loci for forensic and identification purposes . . . . . . . . . . . . Match probability in random individuals . Match probability in relatives . . . . . . . . Paternity exclusion probalities . . . . . . . DNA mixture analysis . . . . . . . . . . . . . SNP versus STR . . . . . . . . . . . . . . . .

. . . 1682 . . . 1683

EL 3499

6

Discussion and conclusion . . . . . . . . . . . . 1693

7

References . . . . . . . . . . . . . . . . . . . . . . . 1694

8

Appendix . . . . . . . . . . . . . . . . . . . . . . . . 1695

. . . 1683

1 Introduction

. . . . . .

Mapping of genetic markers on human chromosomes and developing new techniques of polymorphism detection are some of the basic goals of the Human Genome Project, the largest single biological project undertaken in this century. A major application of these discoveries is the use of DNA technologies in forensic investigation. DNA forensics involves comparisons of DNA profiles of evidence samples with those of one or more known subjects, in order to identify, narrow down, or exclude the source of origin of the constituents of evidence samples [1±3]. Since the pioneering work of Jeffreys [4], the DNA typing systems applied to DNA forensics have evolved considerably. The original, highly discriminating multilocus probes which detected variation at a variable number of tandem

. . . . . .

. . . . . .

1687 1688 1689 1689 1691 1692

Correspondence: Prof. R. Chakraborty, Human Genetics Center, University of Texas School of Public Health, PO Box 20334, Houston TX, USA E-mail: [email protected] Fax: +713-500-0900 Abbreviation: SNP, single nucleotide polymorphism

 WILEY-VCH Verlag GmbH, 69451 Weinheim, 1999

0173-0835/99/0808-1682 $17.50+.50/0

repeat (VNTR) loci simultaneously, were replaced by single-locus restrictipn fragment length polymorphism (RFLP) analysis of individual VNTR loci, each scored separately. However, typing of RFLP/VNTR loci requires a relatively high amount of large DNA molecules, is not easily automatable, and takes a long time to develop a profile. Polymerase chain reaction (PCR)-based techniques of genotyping addresses all of these issues, first by sequence-specific polymorphic loci (polymarkers), and then by VNTR and short tandem repeat (STR) loci (also called minisatellite and microsatellites, respectively; see [5]). Although at the level of individual loci, these markers have a somewhat lower level of discrimination, the PCRbased typing techniques quickly became popular because they are less time-consuming and generally yield more easily interpretable results for forensic identification and for determining relatedness of individuals. Furthermore, the ease of amplifying specific STR loci and the modification of multiplexing conditions for PCR analysis now allows scoring over a dozen STR loci for many individual samples in a single experiment. This development has made the STR loci popular for current forensic applications [6, 7]. As the trend towards automation and miniaturization of DNA typing methods continues, proponents are already thinking of possible future replacements of STR loci, such as chip-based arrays of single nucleotide polymorphism (SNP) loci and other similar techniques [8]. However, this new technology should be evaluated for efficiency in resolving forensically relevant questions. With population data now available on the battery of 13 STR loci that are being used, it is possible to determine the number of SNP loci needed to equal the power of STR loci. In this context, ªpowerº is to be defined for specific applications. In this presentation, we consider allele frequency data at the 13 STR loci which constitute a commonly used battery of forensic loci, in several representative world populations, e.g., Chinese, US European-American, Swiss, Italian, US Native-American (Apache), US African-American and Jamaican, to address the above issue. Using allele frequency data, we show that these loci provide a clear and concise picture of evolutionary relationships among the world©s major populations. Features of both intra- and interpopulation diversity are correctly predicted by the 13 STR loci. Second, we determine the absolute efficiency of the STR loci for evaluating match probability (for identification purposes), paternity exclusion, and for determining relatedness between individuals. Then, we calculate the number of SNP loci needed to equal the power of the STR loci. Finally, we address some general questions regarding the design of microchip-based genetic typing technology that may eventually equal or exceed the potential of the currently employed STR loci. With present

Efficiency of STR loci for human identification

1683

technology, we surmise that in almost all major human populations, the current set of 13 STR loci can provide forensic inferences beyond reasonable doubt, even when reasonable measures of conservativeness are imposed in statistical interpretation of the forensic data.

2 STR loci ± the present battery of 13 loci The abundance of polymorphic short tandem repeat loci throughout the human genome prompted the initial characterization of several such loci through multiplexed PCR-based genotyping methods in the early 1990s [9, 10]. Through revision and validation of these techniques, currently 13 polymorphic STR (CSF1PO, TH01, TPOX, FGA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11 and vWA) and one genderdetermining (Amelogenin) loci can be studied in three PCR amplifications, resulting in 14-locus DNA profiles even when only a minute amount of DNA is found in forensic samples [6, 7]. The genetic characteristics of these loci, such as chromosomal locations, repeat motifs, range of repeat sizes commonly found in different populations, are available in the product brochures of the commercial kits currently validated for such purposes [6, 7]. Several features are clear from the summary characteristics of these STR loci. First, all but two of the 13 STR loci are located on different chromosomes. The two (CSF1PO and D5S818) that are located on the same chromosome are also far apart (CSF1PO is localized to 5q33.3-34 and D5S818 to 5q23.3-32). This implies that it is unlikely to observe correlated genotypes at these loci, unless the population is highly inbred and isolated or very severely fragmented (i.e. substructured). Second, the genomic location of these loci suggests that they are unlikely to have any major functional significance. This implies that genotypes at these loci cannot be predicted from known physical and/or behavioral attributes of individuals. Third, the allele-size ranges observed at these loci suggest the these loci are expected to be highly polymorphic even in isolated populations. Note that most of these loci have been found to be polymorphic even in other great apes [11±13], suggesting that these STRs represent evolutionarily relatively ancient polymorphisms in comparison to the genetic differentiation of the modern human populations.

3 Evolutionary genetic inference from 13 STR loci A considerable amount of worldwide data is now available at these loci which supports the above assertions. In this presentation, we consider allele frequency data from seven major population groups (US European-American, Italian, Swiss, US African-American, Jamaican, Chinese,

Nucleic acids

Electrophoresis 1999, 20, 1682±1696

1684

R. Chakraborty et al.

Electrophoresis 1999, 20, 1682±1696

Figure 1. Schematic representation of allele frequencies at 13 STR loci in seven world populations. Quadrants of each circle represent the individual STR loci within each of which the length of the radius (lines) are the relative allele frequencies. Concentric dotted circles are the scale of allele frequencies (20% and 40%, as shown). Panels are: (i) Jamaicans; (ii) African-Americans; (iii) US European-Americans; (iv) Swiss; (v) Italians; (vi) Chinese and (vii) Apache Native-American. and Apache Native-Americans) for each of the 13 STR loci. Estimates of allele frequencies are obtained from multilocus genotype data in 117±244 individuals from each population, selected for the study without any prior knowledge of their DNA type. Details of sampling methods and analyses of multilocus genotype data with regard to tests of independence of alleles/genotypes within and across loci will be published elsewhere, although in this paper we will discuss an important forensic consequence of the multilocus genotype data. Figure 1, panels (i) through (vii), shows a schematic picture of allele frequencies at these loci in the seven popu-

lations. While neither the exact numerical values of all allele frequencies nor the exact repeat size of the alleles can be extracted from this figure (they are available on request from the authors), some qualitative inferences are clear. For example, it shows that most alleles are present in all populations, and the loci are extremely polymorphic in each population. In other words, at all 13 loci the interpopulation differences of genetic variation are reflected only in the frequency differences of alleles. Further, since the positions of the alleles in the circular graphs are the same, a simple examination of allele frequency differences between populations shows that across loci, allele frequency variation among populations

Electrophoresis 1999, 20, 1682±1696

Efficiency of STR loci for human identification

1685

Table 1. Summary characteristics of intrapopulation genetic variation at 13 STR loci in seven world populations

Sample size (# of individuals) Heterozygosity # of alleles Allele-size variation Mean allele size

Jamaican

African-Americans

European-Americans

Swiss

Italian

Chinese

Apache

213.5  6.1 778.6  1.5 10.4  1.1 2.55  0.42 14.4  1.7

193.5  4.3 78.8  1.5 10.1  1.2 2.53  0.37 14.2  1.7

199.1  1.0 78.3  2.0 8.9  0.6 2.39  0.32 14.2  1.7

206.0  0.0 78.8  2.0 9.1  0.8 2.48  0.33 14.2  1.7

223.0  0.0 78.5  1.7 9.5  0.8 2.48  0.33 14.2  1.7

114.2  0.9 77.7  2.0 8.5  0.9 2.55  0.44 14.3  1.7

198.0  0.0 71.0  2.6 8.0  0.7 2.29  0.46 14.2  1.7

are by and large random. In other words, this implies that estimates of multilocus DNA profiles will tend to be similar (within the limits of sampling variation of estimated profile frequency, particularly for populations of the same major racial/ethnic affiliation), since frequency differences of individual alleles in any specific DNA profile tend to cancel

Figure 2. Neighbor-joining unrooted trees of population affinities based on 13 STR loci. Panel (a) uses the distance measure DA (see [21]) and panel (b) uses the distance measure (dm)2 (see [22]). Branches without any assigned numbers are an estimated length of zero.

each other. This was predicted by Chakraborty and Kidd [14] from general population genetic principles even before collection of empirical data such as that shown in Fig. 1. In summary, data depicted in Fig. 1 implies that any multilocus DNA profiles which are rare in one population remain rare in other populations and likewise the profiles that are relatively common in one population are also common in other populations. Table 1 presents some summary statistics of genetic variation over the 13 STR loci in the seven populations. The average sample sizes (114±213 individuals per locus) in each of the seven populations indicate that the databases used in the illustration are of adequate size for drawing evolutionary as well as forensic inference [15±17]. Average heterozygosity (i.e., gene diversity of 71.0±78.8% per locus) and the average number of segregating alleles (8.0±10.1 per locus), the traditional measures of withinpopulation genetic variation, are relatively high in all populations, indicating the general hypervariability of polymorphism at these loci. Since the allelic frequency distributions at these loci are studied with respect to copy number of repeat motifs, another measure of within-population variation, estimated from the variance of allele size [18], also depicts extensive polymorphism at the loci in all world populations. The mean allele size, an indicator of directionality of mutation (e.g., contraction/expansion mutation bias; see [19]), is also similar (little over 14 repeats per locus) across populations, suggesting that the interpopulation variance at these loci is apparently predominantly dictated by genetic drift and evolutionary separation of populations accompanied by contraction/ expansion mutations accumulated over the history of modern human populations. Allele frequency data in these populations can also be used to examine the effectiveness of these loci in studying evolutionary relationships of populations. Figure 2

1686

R. Chakraborty et al.

summarizes the results of such analyses. Two distance measures are used here. Takezaki and Nei [20] showed that the genetic distance measure, DA (see [21] for its exact estimating equation), can be used to study evolutionary relationships using data on repeat polymorphism. Goldstein et al. [22] suggested an alternative measure, (dm)2, which utilizes allele size information at these loci for drawing evolutionary inference. Neighbor joining networks based on DA (panel a) and (dm)2 (panel b) are virtually identical; this conforms with current anthropological knowledge that the populations of African ancestry form a cluster, this being the most distant cluster from the remaining populations. These network trees also confirm that the US European-Americans are closest to the Europeans (e.g., the Italians and the Swiss), and that a considerable proportion of the genes of US African-Americans are of African descent. The proximity of the Apache group to the Chinese indicates that the ancestry of this Native American-Indian tribe is of (north-)east Asian origin. Data shown in Table 1 also indicate that, in terms of all measures of within-population variation (particularly with respect to the number of alleles and percent heterozygosity per locus), the populations of African

Electrophoresis 1999, 20, 1682±1696 ancestry have the highest level of within-population variation, and the Native American Apache population has the lowest variability. While this may be interpreted in terms of the antiquity of modern populations (i.e., modern humans evolved in Africa, and all other human populations are the result of a relatively recent Out-ofAfrica migration), the question can be raised whether such differences imply any greater degree of genetic similarity between random individuals in populations with a lower within-population variation. With the multilocus genotype data available, a pairwise comparison of multilocus genotypes of all individuals provides this information, an analysis of which is summarized in Figs. 3 and 4 and Table 2 (see [23] for computational methods). These calculations indicate that in all populations the observed and expected distributions (under the assumption of independence of alleles within and across loci) are virtually identical. In other words, a low level of within-population variation does not necessarily have an implication for allelic independence within or across loci. Second, the mean number of alleles shared between individuals (observed as well as expected) is well below the maximum possible (26 for the 13-locus genotype in this example of data) in all

Figure 3. Distributions of observed and expected (under the assumption of allelic independence within and across loci) number of alleles shared in 13locus genotypes of pairs of individuals within seven world populations. Historgrams represent the observed distribution and the line diagrams correspond to the expected distributions (see [23] for computational methods).

Electrophoresis 1999, 20, 1682±1696

Efficiency of STR loci for human identification

1687

Figure 4. Distributions of observed (histograms) and expected (line graphs, under the assumption of allelic independence within and across loci) number of loci exhibiting genotypic identity in pairwise comparisons of 13-locus DNA profiles of individuals in seven world populations. Table 2. Mean and standard deviation of the number of shared alleles and number of loci exhibiting genotype identity in pairwise comparisons of 13-locus genotypes of individuals of different populations Population

Alleles shared Observed M  SD Expected M  SD

Genotypes shared Observed M  SD Expected M  SD

Jamaican African-Americans European-Americans Swiss Italian Chinese Apache

8.38  2.14 8.40  2.08 8.42  2.09 8.10  2.13 8.23  2.09 8.69  2.11 9.76  2.18

1.04  0.98 1.01  0.96 1.09  0.99 1.03  0.96 1.03  0.96 1.14  1.00 1.70  1.23

8.40  2.10 8.31  2.09 8.44  2.10 8.33  2.10 8.43  2.10 8.63  2.10 9.88  2.05

populations, regardless of the level of within-population variation. This, together with the standard deviation of the number of alleles shared between individuals also has an important implication with regard to forensic inference. For example, given that a 13-locus DNA profile comparison can yield 26 shared alleles between random individuals, data presented in Figs. 3 and 4 and Table 2 suggest that it is extremely improbable to find, by chance, a 13-locus profile match between random individuals in any of these populations.

1.05  0.97 1.02  0.96 1.10  0.99 1.07  0.97 1.07  0.98 1.16  1.01 1.71  1.20

4 Power of the 13 STR loci for forensic and identification purposes Although DNA profiling data can be used for forensic analyses in a wide variety of contexts, for the sake of brevity we will illustrate the power of the 13 STR loci (utilizing the allele frequency data described above) by computing: (i) match-probability in random individuals and in relatives of identified subjects with respect to whom a match is observed, (ii) exclusion probability in traditional

1688

R. Chakraborty et al.

Electrophoresis 1999, 20, 1682±1696

paternity testing situations (i.e., when DNA profiles of mother, child and alleged father are available) as well as in deficient cases (where the maternal genotype data is unavailable), and (iii) DNA mixture analysis involving mixtures of DNA from two unrelated individuals. Match probability and paternity testing computations were evaluated based on averages as well as the most common genotype expected to be seen in each population, while the mixture analysis results refer to the most common four (and three) alleles (i.e., by assuming that the 4 and 3 most common alleles are observed in the mixture) to illustrate the worst-case scenarios for the combined inference obtained from all 13 loci. To ease exposition, all estimating equations are listed in the appendix along with their citations, and the footnotes under the tables identify the specific equations used.

4.1 Match probability in random individuals Coincidental match probability in random individuals with and without adjustment for population substructure effects are shown in Table 3 for the combined 13 STR loci. The average match probability for the combined 13 loci is rarer than one in a trillion, even in the population with the most reduced genetic varia-

tion (Apache). With adjustments for the population structure effect (according to the suggestion of Li and Chakravarti [24]) it is even rarer (1 in 1.88 trillion, with q = 0.01). The worst-case scenarios (evidence profiles being heterozygous for the two most common alleles, or homozygous for the most common allele at all loci) also yield match probabilities rarer than the reciprocal of the current census sizes of the respective populations. This is true not only for the point estimate, but also for the upper 95% confidence interval limits of the estimates; e.g., for Apache Native-Americans the most common heterozygous DNA profile at all 13 STR loci is expected to occur with a frequency not exceeding 1 in 2.65 billion individuals, and the most common homozygous profile at all 13 loci will have an expected frequency no more than 1 in 8.77 billion individuals. Of course, the current census size of Apache Native-Americans is far smaller than the inverse of these estimates. For the larger populations, increased variation at these loci makes the random match probabilities even more uncommon. For example, in African-Americans, the average match probability is below 1 in 1300 trillion, and the estimated frequency of the most common 13 locus profile in this population is well below 1 in 100 billion.

Table 3. Match probability estimates in random individuals based on 13 STR loci in seven world populationsa) Population ±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±± AfricanEuropeanJamaican American American Swiss Italian Chinese Apache Average q = 0b) 9.50 ´ 1014 c) q = 0.01 1.04 ´ 1015 Most common homozygote (at all loci) q = 0.0d) 1.61 ´ 1013 e) Upper 95% Cl 5.18 ´ 1012 q = 0.01d) 1.20 ´ 1013 e) Upper 95% Cl 3.91 ´ 1012 Conditional probability (q = 0.01)f) 3.82 ´ 1012 Upper 95% Cle) 1.30 ´ 1012 Most common heterozygote (at all loci) q = 0.0d) 9.20 ´ 1010 Upper 95% Cle) 4.63 ´ 1010 Conditional probability 6.52 ´ 1010 (q = 0.01)f) e) Upper 95% Cl 3.39 ´ 1010

1.32 ´ 1015 1.46 ´ 1015

8.23 ´ 1014 9.12 ´ 1014

1.25 ´ 1015 1.39 ´ 1015

7.45 ´ 1014 8.22 ´ 1014

4.85 ´ 1014 5.45 ´ 1014

1.77 ´ 1012 1.88 ´ 1012

2.48 ´ 1013 7.36 ´ 1012 1.82 ´ 1013 5.52 ´ 1012

7.40 ´ 1013 2.15 ´ 1013 5.32 ´ 1013 1.58 ´ 1013

1.61 ´ 1014 4.67 ´ 1013 1.14 ´ 1014 3.38 ´ 1013

1.38 ´ 1014 4.29 ´ 1013 9.85 ´ 1013 3.13 ´ 1013

2.71 ´ 1013 5.10 ´ 1012 2.00 ´ 1013 3.85 ´ 1012

2.27 ´ 1010 8.77 ´ 1009 1.84 ´ 1010 7.21 ´ 1009

5.65 ´ 1012 1.79 ´ 1012

1.52 ´ 1013 4.71 ´ 1012

3.10 ´ 1013 9.58 ´ 1012

2.75 ´ 1013 9.09 ´ 1012

6.10 ´ 1012 1.26 ´ 1012

8.26 ´ 1009 3.34 ´ 1009

2.16 ´ 1011 1.02 ´ 1011

2.01 ´ 1011 9.64 ´ 1010

2.64 ´ 1011 1.27 ´ 1011

1.03 ´ 1011 5.29 ´ 1010

5.86 ´ 1010 2.29 ´ 1010

5.09 ´ 1009 2.65 ´ 1009

1.48 ´ 1011 7.27 ´ 1010

1.37 ´ 1011 6.82 ´ 1010

1.79 ´ 1011 8.92 ´ 1010

7.27 ´ 1010 3.87 ´ 1010

4.22 ´ 1010 1.73 ´ 1010

3.85 ´ 1009 2.07 ´ 1009

a) The match probabilities are 1 in N individuals, where N is the entry in the table. b) The average match probability is calculated using Eq. (4). c) The conditional average match probability is calculated using Eq. (5). d) The multilocus profile frequency is calculated using Eqs. (1) and (6). e) The multilocus confidence interval is calculated using Eqs. (9) and (10). f) The multilocus conditional profile frequency is calculated using Eqs. (2) and (6).

Electrophoresis 1999, 20, 1682±1696

Efficiency of STR loci for human identification

Some authors argue that the population substructure effect should be evaluated, with adjustments based on computations of the conditional probability given that the profile is observed in a suspect [3, 25, 26]. Although this logic appears reasonable only when the true contributor of the profile belongs to the same subpopulation as the suspect, with q = 0.01, there is virtually no qualitative change in the estimates. For example, the upper 95% confidence limit of the most common heterozygous profile (at all 13 loci) in Apaches changes from 1 in 2.65 billion (estimated by the product rule) to 1 in 2.1 billion (estimated using the conditional probability [25, 26], on which the confidence limit is imposed). Estimates presented in Table 3 further show that irrespective of the method used in the match probability determination, genetically affine population databases (see Fig. 2) yield estimates of random match probability that are well within their respective sampling variation.

4.2 Match probability in relatives In court litigations of DNA forensic evidence, the argument is sometimes raised whether or not the matched profile can occur in relatives of the suspect as well. Computational logic of such evaluations, on which confidence limits may be superimposed (see Appendix) is also available [3, 27, 28]. Table 4 presents estimates of match probability in full-sibs and first cousins to illustrate the point that a 13-locus match is extremely unlikely to arise in such close relatives as well. For example, the most common 13-locus homozygous profile is expected to

1689

occur in a full-sibling with a frequency no more than 1 in 7000 sibs of an Apache Native-American. In other words, the suspect should have 7000 full-sibs before an expected recurrence of the same specific 13-locus homozygous profile is to be observed. For human families, this is clearly a biological absurdity. The numerical illustrations of Table 4 also exhibit that the inter-population homogeneity of the match probability estimates in relatives is even more striking than the random match probability, particularly in genetically affine population databases.

4.3 Paternity exclusion probabilities In paternity testing, genetic data is used in two steps. First, given the genotypes of the mother and a child, the question can be asked how often a randomly accused male can be excluded as the father of the child, referred to as ªrandom man not excludedº [29]. This can be evaluated at the level of average degree of polymorphism at the loci (averaged over all possible genotype combinations of the mother and the chld), or for any specific genotype combination of the mother and the child. Note that the determination of exclusion probabilities in this first step can be done even without having the DNA type of the alleged father. Second, when the alleged father is not excluded (based on the genotype of the alleged father in conjunction with those of the mother and the child), a likelihood ratio, termed as the ªpaternity indexº (PI) can be computed [30]. It explains the odds of finding the specific mother child alleged-father genotype combination under the hypothesis that the alleged father is the true biological

Table 4. Match probabilities in relatives based on 13 STR loci in seven world populationsa) Population Jamaican

Most common heterozygote Full sib First cousin Estimateb) 95% Clc) Estimateb) 95% Clc)

6.80 ´ 104 African-American 8.24 ´ 104 European-American 7.78 ´ 104 Swiss 8.38 ´ 104 Italian 7.05 ´104 Chinese 5.95 ´ 104 Apache 2.63 ´ 104

Most common homozygote Full sib First cousin Estimateb) 95% Clc) Estimateb) 95% Clc)

5.92 ´ 104

6.33 ´ 1009

4.03 ´ 1009

5.18 ´ 104

4.28 ´ 104

4.70 ´ 1010

2.30 ´ 1010

7.13 ´ 104

1.23 ´ 1010

7.59 ´ 1009

5.62 ´ 104

4.60 ´ 104

6.38 ´ 1010

3.46 ´ 1010

6.76 ´ 104

1.16 ´ 1010

7.23 ´ 1009

6.93 ´ 104

5.69 ´ 104

1.39 ´ 1011

7.51 ´ 1010

7.29 ´ 104

1.47 ´ 1010

9.17 ´ 1009

8.31 ´ 104

6.84 ´ 104

2.45 ´ 1011

1.33 ´ 1011

6.17 ´ 104

7.11 ´ 1009

4.60 ´ 1009

8.37 ´ 104

6.94 ´ 104

2.26 ´ 1011

1.26 ´ 1011

4.94 ´ 104

4.42 ´ 1009

2.41 ´ 1009

5.56 ´ 104

4.19 ´ 104

6.70 ´ 1010

2.98 ´ 1010

2.29 ´ 104

4.84 ´ 1008

3.14 ´ 1008

8.43 ´ 103

7.00 ´ 103

3.11 ´ 1008

1.84 ´ 1008

a) The match probabilities are 1 in N individuals, where N is the entry in the table. b) The multilocus profile frequency is calculated using Eqs. (3) and (6). c) The multilocus confidence interval is calculated using Eqs. (9) and (10).

1690

R. Chakraborty et al.

Electrophoresis 1999, 20, 1682±1696

father as opposed to a ªrandom male not excludedº. Data from multiple loci can be combined for computations of statistical strength in both steps, assuming that genotypes at the loci are mutually independent [29, 30]. While most paternity determination cases involve genotype data on the mother child alleged-father trio, deficient cases in which genotypes of either the mother or the father are unobserved are also often entertained in parentage analysis. The computational logic of deficient cases is also similar to the above-mentioned approaches, although the strength of the evidence is obviously reduced. For illustrative purposes of determining the power of the combined battery of 13 STR loci, Table 5 presents the combined exclusion probabilities for 13 STR loci, averaged over all possible mother-child genotype pairs as well as all genotypes of the child when genotype data on the mother is not available. In addition, we also depict the worst-case scenarios: the mother and child are both heterozygous for the two most common alleles at each locus (and the deficient case, the child being heterozygous for the two most common alleles). Since hypervariability at STR loci is at least partly contributed by their high mutation rate, relative to the standard biochemically detected loci [31, 32], a recent recommendation of the

American Association of Blood Banks (see [33]) suggests declaring nonpaternity based on exclusions at two or more loci tested. Data presented in Table 5 also includes the combined exclusion probabilities (for the 13 STR loci) based on this recommendation. Numerical values shown in Table 5 indicate that with the complete data available for mother, child and the alleged father, the average exclusion probability is at least 99.99% in all seven populations when at least one of the 13 loci exhibits paternity exclusion. With the more stringent criterion of exclusions based on two or more loci, the power of exclusion is somewhat reduced, but in all large populations, the average probability of exclusion is still greater than 99.97%. In Apache Native-American (the population in which the genetic variation at these 13 loci is most reduced, in our example), with two or more loci showing exclusion, an average mother-child genotype pair will still exclude 99.81% randomly accused males. Exclusion probabilities for the worst-case scenarios (i.e., when the mother and the child are both heterozygous for the two most common alleles; or the child alone is heterozygous in the deficient case) are somewhat less encouraging, as shown in the last two columns of Table 5. In

Table 5. Paternity exclusion probabilities based on 13 STR loci in seven world populations Exclusion probability (in %) with data on Both mother and childa) Only the childb) At least one Two or more At least one Two or more locus loci locus loci exclusiond) excludede) exclusiond) excludede)

Lowest possible exclusion probabilityc) At least one Two or more locus loci exclusiond) excludede)

99.99921

99.98337

99.88083

98.78794

95.75003%

79.98800%

99.99930 European-American 99.99916 Swiss 99.99932 Italian 99.99914 Chinese 99.99862 Apache 99.98669

99.98509

99.88426

98.82094

96.58596%

82.93588%

99.98228

99.87231

98.71537

97.02795%

84.34972%

99.98533

99.89231

98.88615

97.38811%

85.72976%

99.98212

99.87491

98.74445

96.27960%

81.70450%

99.97282

99.76567

97.88548

95.60982%

79.36276%

99.80592

99.07993

93.64176

86.20767%

55.43217%

Population

Jamaican African-American

a) The exclusion probability for a single locus when the mother©s genotype is known is given in Eq. (11). a) The exclusion probability for a single locus when the mother©s genotype is unknown is given in Eq. (12). c) Let p1 and p2 be the frequencies of the two most common alleles, A1 and A2. If p1 < 2 p2, then the lowest exclusion probability occurs when the child©s genotype is A1A2, and the mother©s genotype is also A1A2, if it is known. Otherwise, the lowest exclusion probability occurs when the child©s genotype is A1A1, and the mother©s genotype is also A1A1, if it is known. The probabilities are calculated using Eq. (11) or (12). d) When multiple loci are tested, the probability of exclusion at one or more loci is given by Eq. (15). e) When multiple loci are tested, the probability of exclusion at two or more loci is given by Eq. (16).

Electrophoresis 1999, 20, 1682±1696

Efficiency of STR loci for human identification

such cases, exclusion probability based on one or more loci is below 97% for all populations (about 86.2% for Apaches); based on two or more loci exclusions, this falls to a level of 55.4% for the Apaches. The latter probability is no more than 86% in any of the seven populations. While at face value these results are discouraging with respect to the potential power of the 13 STR loci, recall that these are the worst-case scenarios, whose occurrences in a typical casework are exceedingly rare (because the frequency of having genotypic identity at 13 loci for a mother-child pair is quite uncommon, as shown in the earlier section). Further, even with this low probability of exclusion, should an alleged father not be excluded, the PI value will often still be large enough to establish paternity with reasonable certainty.

4.4 DNA mixture analysis When forensic specimen samples are suspected to have DNA from more than one individual, two different questions may be asked. First, considering the DNA profile of the mixed sample alone, one might evaluate the chance of excluding a random person as being a (part) contributor of the mixed sample. Although an answer to this question does not fully utilize all information available through DNA testing (e.g., DNA profiles of the known and suspected contributors do not enter into the computation), this exclusion probability is a crude guide as to the power of the tests performed. It is obvious that for a mixed sample originating from two contributors, the worst-case scenario (as far as the exclusion probability is concerned) would be to observe the four most common alleles in the mixture. The second question arises in the context of one contributor of the mixture being known and, with DNA typing, this known contributor and a suspect are found to explain all of the alleles in the mixture. In such an event, statistical interpretation of DNA mixture is done through a likelihood ratio [3, 34, 35], which quantifies the relative odds of observing the mixture profile under the condition that it is contributed by the suspect and the known person as opposed to the mixture being from the known contributor and a random

1691

unrelated person. The larger the likelihood ratio, the greater is the strength of the DNA mixture evidence. Of course, for this likelihood ratio computation, the worstcase scenario (i.e., smaller likelihood ratio) is that the mixture contains the most common three alleles at each locus and the known contributor is heterozygous for the two relatively less common alleles. Note that with more than two alleles present in the mixture, a contribution from at least two persons is ensured in most cases. With this rationale, Table 6 presents the exclusion probability (i.e., a random person is excluded as being a part contributor of the mixture) and likelihood ratio (L) for two types of mixture cases (four and three alleles seen in the mixture), estimated from data on all 13 STR loci. Although these are worst-case scenarios, the exclusion probabilities based on 13 loci are not particularly impressive (21±47% for 4-allele mixtures, and 47±79% for 3allele mixtures at all loci). In contrast, when the suspect and the known contributor explain all alleles in the mixture, the likelihood ratio values are large (of the order of five billion to 266 billion for 4-allele mixtures and 14 thousand to 2.8 million for the 3-allele mixture). The relatively modest values of exclusion probabilities for the combined battery of 13 STR loci are not necessarily discouraging, since even with the alleles in the known contributor considered, not all non-excluded random person will explain all of the alleles present in the mixture. For example, suppose that the alleles seen in the mixture at a locus are A1, A2, A3, and A4 and the known contributor has the genotype A3A4. A (random) person of genotype A1A1 will not be excluded as a part contributor of this mixed sample, but even with the known contributor considered, they together will not fully explain the mixture profile. In other words, exclusion probability computation is a crude conservative evaluation of the power of the battery of markers employed in mixture analysis. The conservativeness of the exclusion probability evaluation of mixture analysis can be illustrated with a simple example as follows: if we were to raise the smallest exclusion probability (0.2102 for the 13 loci in Apaches) to the level of 95%, we would have needed approximately 165 STR loci with population

Table 6. Mixture exclusion probabilities based on 13 STR loci in seven world populations Population

4-Allele mixture Exclusion probability Likelihood ratio

3-Allele mixture Exclusion probability Likelihood ratio

Jamaican African-American European-American Swiss Italian Chinese Apache

0.4612 0.4247 0.4738 0.4050 0.4065 0.9781 0.2102

0.7735 0.7240 0.7931 0.7208 0.6905 0.7043 0.4678

200 billion 104 billion 266 billion 210 billion 88 billion 59 billion 5 billion

2.8 million 2.2 million 4.7 million 1.3 million 763 thousand 974 thousand 14 thousand

1692

R. Chakraborty et al.

genetic characteristics similar to this battery of 13 loci. In contrast, even in the populations where the genetic variation is reduced (e.g., in Apache Native-American), the present set of 13 STR loci can yield a likelihood ratio of the order of several tens of thousands, too large to be ascribed to chance alone. In summary, the above computations show that the present battery of 13 STR loci provide sufficient statistical strength for most applications of forensic identification (including DNA mixture interpretations) and parentage analysis. Illustrative worldwide data used in these computations exhibit that although the point estimates of the various statistical measures are somewhat different from one population to the other, anthropologically affine population groups yield estimates that are within their respective sampling variation. As a consequence, we may surmise that forensic databases described by major anthropological population groupings should be enough to evaluate statistical strengths of DNA evidence based on this set of 13 STR loci.

5 SNP versus STR As mentioned in Section I, technological advances made in the context of the Human Genome Project mainly dictated the shift of DNA technologies that are being implemented in DNA forensics. The recent findings of the abundance of SNPs and the ease of automation and miniaturization of detection techniques (see [8, 36]) are already prompting introduction of microchip-based SNP assays for DNA forensic analyses. Therefore, based on the empirical data presented above, we ask: how many SNP loci would equal the power of the combined 13 STR loci? A rigorous answer to this question is obviously dependent on the specific details of the population genetic properties of the SNP loci (e.g., their genomic location, allele frequency distribution, and the extent to which they depict the effect of a genetic substructure in the populations), for which data are still lacking. Here we took the simple approach of determining the random match probability for a biallelic locus (for forensic identification) as a function of allele frequency at an SNP locus, and asked how many such loci would yield the combined power of the random match probability offered by the 13 STR loci. Instead of using the exact values of the estimated random match probabilities for each of the seven world populations, we determined the number of SNP loci needed to reach random match probabilities of the range from 1 in a billion to 1 in 1000 trillion (the range of values observed in Table 3 for the 13 STR loci). The summary of these computations are shown in Fig. 5, in which the number of SNP loci needed are plotted as a function of allele frequencies, assuming that each SNP locus has the

Electrophoresis 1999, 20, 1682±1696 same allele frequency distribution. Since the average match probability at a biallelic locus is symmetrical around the allele frequency of 0.5, only one-half of the allele frequency range is shown in this diagram. These computations show that, on average, we need 25± 45 SNP loci to match the power of the 13 STR loci for random match probability determination if each of the SNP loci has an allele frequency distribution (0.3, 0.7) for two segregating nucleotides. In practice, however, different SNP sites will have different allele frequencies, and the more asymmetric allele frequency distribution would require more SNP loci to equal the power of the 13 STR loci. For example, with a 0.1, 0.9 allele frequency distribution, 62 SNP loci would be needed to obtain an average match probability of 1 in 10 billion. Although detailed computations are not shown, similar conclusions are also reached with respect to match probability determination in relatives. As Table 4 shows, the match probabilities in relatives are more modest than the random match probability for the 13 STR loci, and so are the match probabilities in relatives for a biallelic SNP locus. As a consequence, qualitatively the number of SNP loci needed does not change appreciably. However, if the same level of rarity

Figure 5. Number of SNP (biallelic) loci needed to equal prescribed levels of average match probability between unrelated individuals as a function of allele frequencies at each SNP locus. The prescribed match probabilities are designated by the side of each graph. Since the average random match probability at a biallelic locus is a symmetric (around 0.5) function of allele frequency (see Section 8), a graph is drawn of the gene frequency range of 0.0± 0.5.

Electrophoresis 1999, 20, 1682±1696

Efficiency of STR loci for human identification

1693

of matches is to be demanded for all relatives, excluding a full-sib match would require more SNP loci than exclusion of matches of other kinships.

greater number of them to equal the potential of the present set of 13 STR loci. A detailed account of this will be discussed elsewhere.

Similar computations for paternity exclusions are shown in Fig. 6a for avarage exclusion power with data on the mother and child, and in Fig. 6b for genotype data on the child alone. To reach an exclusion probability of 99.9%, with data on both the mother and the child available, 33± 81 SNP loci would be needed with p = 0.5 and p = 0.1, respectively. In contrast, with these many SNP loci, for data on the child alone, the resultant exclusion probability would be around 80%.

6 Discussion and conclusion

In conclusion, the above numerical results of the comparative efficiency analysis of SNP and STR loci suggest that, without population data on SNP loci, a definite prescription regarding the required number of SNP loci cannot be given; to equal the power of the 13 STR loci with regard to genotypic match probability and/or paternity exclusion, however, somewhere in the range of 30±60 SNP loci would be needed, and they must be selected in such a manner that the assumption of independence across loci are met. Note that since SNP loci are biallelic (and hence, less mutable than the STR loci), the population substructure effect on SNP loci can be more severe than at the STR loci [37, 38]. Hence, more careful validation studies of SNP loci would be needed before implementing them for forensic and paternity analysis. In addition, the efficiency of SNP loci for interpreting DNA mixture evidence is far more reduced, necessitating a far

The overview of worldwide data on the extent of polymorphism at 13 STR loci, currently validated for forensic and paternity analysis, indicates that these loci together have adequate power to resolve most forensic and paternity cases. Above and beyond this, the population data collected in this context can address many of the broad questions of the human genome diversity studies, such as the evolutionary relationship of populations, the implications of reduced genetic variation in specific populations, as well as inference of the past demographic history of populations [39]. The availability of commercial kits for genotyping the STR loci [6, 7] offers the opportunity to conduct population genetic analysis by pooling data through interlaboratory comparisons of results. Worldwide allel frequency data at these 13 STR loci also raises some questions that could yield information as to the mechanism of maintenance of genetic variation at these tetranucleotide loci. For example, the abundance of incomplete repeat alleles at some of these loci (e.g., FGA and D21S11) raises the possibility of more than one mutation mechanism (e.g., slippage and insertion/deletion of nucleotides) operating together. Likewise, since allele frequency distributions at these tetranucleotide loci can be ordered by allele size, measures of composite parameters such as the product of population size and mutation

Figure 6. Number of SNP loci needed to equal prescribed levels of average paternity exclusion probability as a function of the allele frequency at each SNP locus. Panel (a) is for paternity testing with data on mother and child genotype data available; panel (b) is for cases where mother©s genotype is unknown. The prescribed combined exclusion probabilities are designated by the side of the line graphs. Because of symmetry (around 0.5), the graphs are drawn for allele frequencies of 0.0±0.5.

1694

R. Chakraborty et al.

rate can be estimated based on more than one statistic (e.g., locus heterozygosity and allele size variance; see [39]), the imbalance of which can predict signatures of past demographic histories of populations. A population substructure, which is also relevant for applications of these loci for forensic and paternity analysis [3, 35], can also be detected from allele size variance as well as heterozygosity, but these two measures may not be equally sensitive. Questions as to which measure is better suited to detect a population substructure can be addressed from population data on these 13 STRs. Thus, we surmise that the worldwide data on the STR loci has implications beyond human identification applications as well. The empirical estimates of match probability (in random individuals as well as in relatives), and their use in paternity exclusion and DNA mixture analysis, illustrate the point that, considered together, the 13 STR loci are quite efficient. Perhaps, supplementing them with a few additional repeat loci, we will be in a position to address all identification questions with a precision beyond reasonable doubt (particularly in the cases of mixture analysis and paternity testing with data from the mother unavailable). We showed that the number of SNP loci needed to equal such efficiencies are far larger. To attain such power, SNP loci should also be selected with caution, since they should first be chosen so that intra- and interlocus independence of alleles at SNP sites are achieved and all SNP loci must be coamplifiable to present complete multilocus genotyping without any systematic bias. Thus, we conclude that the STR-based DNA forensic analysis is currently on a sound scientific basis, and, as the SNP technology becomes more widely available, their introduction should be preceded with validation studies before considering a complete replacement of the STR technology currently available in this field. This work was partially supported by US Public Health Service research grants GM 41399, GM 53545 and GM 52601 from the US National Institutes of Health, and grants 96-IJ-CX-0023, 98-LB-VX-K019 and 98-LB-0010 from the US National Institute of Justice. The opinions expressed in this presentation are those of the authors and these do not constitute any endorsement of the granting agencies.

Electrophoresis 1999, 20, 1682±1696 [3] National Research Council, The Evaluation of Forensic DNA Evidence, National Academy Press, Washington, DC 1996. [4] Jeffreys, A. J., Wilson, V., Thein, S. L., Nature 1985, 316, 76±79. [5] Tautz, D., in: Pena, S. D. J., Chakraborty, R., Epplen, J. T., Jeffreys, A. J. (Eds.), DNA Fingerprinting: State of the Science, Birkhäuser, Basel 1993, pp. 21±28. [6] Lins, A. M., Micka, K. A., Sprecher, C. J., Taylor, J. S., Bacher, J. W., Rabbach, D. R., Bever, R. A., Creacy, S. D., Schumm, J. W., J. Forens. Sc. 1998, 43, 1±13. [7] AmpFlSTR Profiler PCR Amplification Kit, User©s Manual, Part number 402945, Rev. A, Perkin-Elmer Applied Biosystems, Norwalk, CN 1997. [8] Reynolds, J. E., Head, S. R., McIntosh, T. C., Vrolijk, L. P., Boyce-Jacino, M. T., in: Caetano-AnollØs, G., Gresshoff, P. M. (Eds.), DNA Markers: Protocols, Applications, and Overviews, Wiley-VCH, New York 1997, pp. 213±224. [9] Edwards, A., Hammond, H. A., Jin, L., Caskey, C. T., Chakraborty, R., Genomics 1992, 12, 241±253. [10] Hammond, H. A., Jin, L., Zhong, Y., Caskey, C. T., Chakraborty, R., Amer. J. Hum. Genet. 1994, 55, 175±189. [11] Ely, J., Deka, R., Chakraborty, R., Ferrell, R. E., Genomics 1992, 14, 692±698. [12] Deka, R., Shriver, M. D., Yu, L. M., Jin, L., Aston, C. E., Chakraborty, R., Ferrell, R. E., Genomics 1994, 22, 226±230. [13] Morin, P. A., Moore, J. J., Chakraborty, R., Jin, L., Goodall, J., Woodruff, D. S., Science 1994, 265, 1193±1201. [14] Chakraborty, R., Kidd, K. K., Science 1991, 254, 1735± 1739. [15] Evett, I. W., Gill, P., Electrophoresis 1991, 12, 226±230. [16] Chakraborty, R., Hum. Biol. 1992, 64, 141±159. [17] Harding, H. W. J., J. Forens. Sc. 1998, 43, 248±249. [18] Kimmel, M., Chakraborty, R., Stivers, D. N., Deka, R., Genetics 1996, 143, 549±555. [19] Rubinsztein, D. C., Amos, W., Leggo, J., Goodburn, S., Jain, S., Li, S.-H., Margolis, R. L., Ross, C. A., FergusonSmith, M. A., Nature Genet. 1995, 10, 337±343. [20] Takezaki, N., Nei, M., Genetics 1996, 144, 389±399. [21] Nei, M., Tajima, F., Tateno, Y., J. Mol. Evol. 1983, 19, 153±170. [22] Goldstein, D. B., Ruiz Linares, A., Cavalli-Sforza, L. L., Feldman, M. W., Proc. Natl. Acad. Sci. USA 1995, 92, 6723±6727. [23] Chakraborty, R., Jin, L., Hum. Biol. 1993, 65, 875±895. [24] Li, C. C., Chakravarti, A., Hum. Hered. 1994, 44, 100±109. [25] Balding, D. J., Nichols, R., Forensic Sci. Int. 1994, 64, 125±140.

Received March 25, 1999

[26] Balding, D. J., Nichols, R., in: Weir, B. S. (Ed.), Human Identification: The Use of DNA Markers, Kluwer Acad., Dordrecht 1995, pp. 3±12.

7 References

[27] Jin, L., Ph. D. Thesis, University of Texas Graduate School of Biomedical Sciences, Houston, TX 1994.

[1] Kirby, L. T., DNA Fingerprinting: An Introduction, WH Freeman, New York 1990. [2] National Research Council, DNA Technology in Forensic Science, National Academy Press, Washington, DC 1992.

[28] Weir, B. S., Hill, W. G., J. Forens. Sc. 1995, 33, 218±225. [29] Salmon, D., in: Walker, R. H. (Ed.), Inclusion Probabilities in Parentage Testing, American Association of Blood Banks, Arlington 1983, pp. 281±296.

Electrophoresis 1999, 20, 1682±1696 [30] Essen-Moller, E., Quensel, C.-E., Dsch. Z. Ges. Gerichtl. Med. 1939, 31, 79±96. [31] Weber, W., Wong, C., Hum. Mol. Genet. 1993, 2, 1123± 1128.

Efficiency of STR loci for human identification

1695

Match probability in relatives: the match probability in a relative, given the genotype of a proband (i.e., known subject) is

[32] Chakraborty, R., Kimmel, M., Stivers, D. N., Deka, R., Davison, L. J., Proc. Natl. Acad. Sci. USA 1997, 94, 1041±1046. [33] Chakraborty, R., Stivers, D. N., J. Forens. Sci. 1996, 41, 671±677. [34] Weir, B. S., Triggs, C. M., Starling, L., Stowell, L. I., Walsh, K. A. J., Buckleton, J., J. Forens. Sci. 1997, 42, 213±222. [35] Evett, I. W., Weir, B. S., Interpreting DNA Evidence, Sinauer, Sunderland 1998. [36] Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J., Kruglyak, L., Stein, L., Hsie, L., Topaloglou, T., Hubbell, E., Robinson, E., Mittmann, M., Morris, M. S., Shen, N., Kilburn, D., Rioux, J., Nusbaum, C., Rozen, S., Hudson, T. J., Lander, E. S., Science 1998, 280, 1077± 1082. [37] Chakraborty, R., Jin, L., Hum. Genet. 1992, 88, 267±272.

where f0, f1 and f2 are the probabilities of having 0, 1 and 2 alleles identical by descent (IBD) in two relatives. For full siblings, (f0, f1, f2) = (1/4, 1/2, 1/4) and for first cousins, (f0, f1, f2) = (3/4, 1/4, 0) Average match probability: The average profile frequency, which is the probability that two genotypes chosen at random will match is

[38] Jin, L., Chakraborty, R., Heredity 1995, 74, 274±285.

2 p(l) M = 2a 2±a4

[39] Kimmel, M., Chakraborty, R., King, J. P., Bamshad, M., Watkins, W. S., Jorde, L. B., Genetics 1998, 148, 1921± 1930.

where ak = Sipki, the sum of the k-th power of allele frequencies [41].

[40] Chakraborty, R., Srinivasan, M. R., Diager, S. P., Amer. J. Hum. Genet. 1993, 52, 60±70.

Average match probability with q = 0:

[41] Li, C. C., Weeks, D. E., Chakravarti, A., Hum. Hered. 1992, 43, 45±52.

2 2 2 p(l) M = (1±q) (2a 2±a4) + 20 (1±q) a3 + q a2

[42] Goodman, L. A., J. Amer. Stat. Assoc. 1962, 57, 54±60.

8 Appendix 8.1 Genotype match probabilities Unconditional match probability for the l-th single locus:

(4)

(5)

see [24]. Note that when q = 0 this reduces to (A4). Combined match probability over multiple loci: for L loci, the combined match probability is L

PM = P p(l) M l=1

(6)

where p(l) M is the match probability as computed in (A1), (A2), (A3) or (A4) (see [3, 14, 35, 40]). where q is the measure of population substructuring (see [3]).

Variance of match probability estimates V[PÃ(l) M]:

Conditional match probability: when the sample donor and the suspect are from the same subpopulation, then, conditional on the genotype of the sample, for the l-th single locus,

(see [25, 26]).

as derived in [40].

1696

R. Chakraborty et al.

Electrophoresis 1999, 20, 1682±1696

Confidence interval estimates: assuming that the logarithm of a frequency estimate is approximately distributed as a normal random variable, we can use the variance of this in order to calculate confidence limits. Following Chakraborty et al. [40], we estimate the variance of the natural log of the estimates by 2 Ã (l) ±2 Ã (l) V[ln PÃ (l) M] & s M = (P M) V[P M]

(8)

which is found by using the approximation

à M] & V[P

L

P (1 + l=1

pE = 1±4(a2 ± a3)±3a4 + 2a22

which translates to L

l=1

(9)

(14)

Multilocus power of exclusion: the probability that a random man is excluded at least at one locus is L

for an L locus-profile [40, 42]. Consequently, (as shown in [40]), the 100a % upper confidence limit is given by 2 Ã M) + Za/2s M exp(ln(P )

(13)

where ak = Sipki as above. In the deficient case, this is

±2 Ã (l) (pÃ(l) M) V[p M])±1]

±2 Ã (l) V[ln PÃM] &S (pÃ(l) M) V[p M]

Average power of exclusion: under Hardy-Weinberg equilibrium (HWE), the probability that a random man would be excluded as the father of a child in a randomly chosen mother-child pair is (see [33] for derivation) pE = 1±2a2 + a3 + 3(a2a3±a5)±2(a22±a4)

For variance of multilocus profile frequency estimates à M2 [ P

Only child©s genotype is known (deficient): the deficient case is when only the child©s genotype is available. Under this condition, the exclusion probability is

(10)

where Za/2 is the 100a/2nd percentile of the standard normal distribution.

8.2 Power of exclusion Mother and child genotype known: the probability of excluding a randomly chosen man from paternity given a mother-child pair is

PE = 1±P (1±p(l) E) l=1

(15)

The probability that a random man is excluded at two loci at least is [33]