Biometrika (1998), 85, 2, pp. 269-287 Printed in Great Britain
Measuring heterogeneity in forensic databases using hierarchical Bayes models BY KATHRYN ROEDER Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, U.S.A.
[email protected] MICHAEL ESCOBAR Department of Public Health Sciences, University of Toronto, Ontario, Canada, M5S 1A1
[email protected] JOSEPH B. KADANE Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, U.S.A.
[email protected] AND IVAN BALAZS Department of Genetics, Lifecodes Corporation, 550 West Avenue, Stamford, Connecticut 06902, U.S.A.
[email protected] SUMMARY
As currently defined, DNA fingerprint profiles do not uniquely identify individuals. For criminal cases involving DNA evidence, forensic scientists evaluate the conditional probability that an unknown, but distinct, individual matches the crime sample, given that the defendant matches. Estimates of the conditional probability of observing matching profiles are based on reference populations maintained by forensic testing laboratories. Each of these databases is heterogeneous, being composed of subpopulations of different heritages. This heterogeneity has an impact on the weight of the evidence. A hierarchical Bayes model is formulated that incorporates the key physical characteristics inherent in these data. With the help of Markov chain Monte Carlo sampling, levels of heterogeneity are estimated for three major ethnic groups in the database of Lifecodes Corporation. Some key words: DNAfingerprint;Gibbs sampling; Hardy-Weinberg equilibrium; Population heterogeneity.
1. I N T R O D U C T I O N
Since the mid 1980s, Variable Number Tandem Repeat loci (VNTRs) have been used for forensic evidence in criminal cases. Typically, a number of loci are examined with the collection of observations from these loci forming a DNA profile. Given two profiles that could plausibly be observations from the same individual, one from the crime scene and one from the defendant, the legal system is charged with providing the weight of the evidence.
270
K. ROEDER, M.
ESCOBAR, J. B. KADANE AND I. BALAZS
Two competing hypotheses are considered: (i/ t ) the defendant left the DNA crime sample, and (Ho) a person other than the defendant left the crime sample. To establish the weight of the evidence, the courts generally require an estimate of the probability that another, distinct individual's profile matches the crime sample, conditional on knowing that the defendant's profile matches. Naturally, such probabilities depend on the degree of relatedness between the defendant and the potential perpetrator. If the potential perpetrator and the defendant are related, this induces correlations which cause the conditional probability of a match to be greater than the marginal probability of a match. Aside from close relatives, human geneticists typically discretise the relationships into two levels: major ethnic groups and subpopulations within these major ethnic groups (Weir, 1990). First suppose the potential perpetrator and the defendant are assumed to be unrelated members of the same major ethnic group. While the conditional probability of a match could be calculated in various ways, the method commonly in use assumes independence of the defendant and potential perpetrator's profiles, as well as independence of the observations constituting the profile. The multiplication rule seems to provide close agreement with empirically derived counterparts for most post-industrial societies (Brookfield, 1992; Devlin & Risch, 1992; Evett & Gill, 1991; Evett, Scranage & Pinchin 1993; Weir, 1992), suggesting that the conditional and marginal probabilities of a match agree rather closely in this situation. There are small isolated populations that would be exceptions to this rule. Next, suppose the defendant and the potential perpetrator are assumed to be unrelated members of the same subpopulation. A priori, one expects that the conditional probability of a match could be measurably greater than the marginal probability of a match in this situation because of correlations of alleles induced by the heterogeneity among the subpopulations (Lewontin & Hartl, 1991; Nichols & Balding, 1991; Hartl & Lewontin, 1993). The present paper quantifies the level of correlation under the subpopulation assumption. Finally, if the potential perpetrator and the defendant are members of the same family, then the conditional probability of a match is many orders of magnitude greater than the marginal probability of a match. At this level of relatedness, Mendelian segregation probabilities inflate the conditional probability of a match; this probability can be calculated quite easily (Cotterman, 1940). Correlation of alleles as a result of heterogeneity among subpopulations may also have an impact on this calculation, but this effect can be quantified in the same way as when the individual and the defendant were merely from the same subpopulation (Balding & Nichols, 1994). How can one quantify the odds of Hr given all of these conditional probabilities? This problem has been posed as a modified version of the 'Island Problem' (Dawid, 1994; Balding & Donnelly, 1995; Dawid & Mortera, 1996). Here we assume the simplest and most common situation that arises in the U.S. courts: the defendant was located without the use of a database search, for example by eye witness identification. In this situation, the posterior odds of Hx can be computed using a weighted sum of the conditional probabilities just discussed, provided the prior odds of Hx are available for each possible perpetrator. The following equation was given by Balding, Donnelly & Nichols (1994):
where #, is the event the ith person has the DNA profile which matches the crime sample; s and C represent the labels of, respectively, the defendant and the source of the crime sample. The sum is to be taken over all possible donors of the crime sample. Rather than
Heterogeneity in forensic databases
271
calculate (1), the U.S. courts usually examine the conditional probability of a match, pr(# ( |#,), individually for various levels of genetic relatedness, e.g. National Research Council (1996, Ch. 5). Regardless of how the conditional probabilities of a match are presented in court, there is substantial interest in calculating pr(# ( |#J when i and s are members of the same subpopulation. Computation of pr(#, 11$,) requires knowledge of the heterogeneity, 9, among the allele distributions of subpopulations within a major ethnic group. Specifically 9 quantifies the correlation between alleles for two members of the same subpopulation, e.g. Crow & Denniston (1993). For loci yielding discrete alleles, i.e. discrete outcomes, there are two standard population genetic methodologies for estimating the variability among the subpopulations when the observations are classified by subpopulation, namely a direct method (Cockerham, 1969, 1973; Weir & Cockerham, 1984) and an indirect method (Nei, 1977). The database that we studied, however, presents several difficulties that make this a challenging problem: the data are continuous as a result of measurement error, and the observations available are not usually classified by subpopulations. Since the subpopulation labels are missing, the direct method is not applicable. However, using Bayesian techniques, we can apply an extension of the indirect method to these data. In § 3, we present such a method for estimating the heterogeneity parameter 9. This parameter is roughly analogous to Wright's FST (1951), Nei's (1977, 1986) fixation index, Weir & Cockerham's 9 (1984) and Morton's kinship parameter (1992). The hierarchical Bayes model employed is estimated using Markov chain Monte Carlo sampling. In § 2, we present genetic background material. In § 4, we find the posterior distribution of 9 in the forensic databases of Lifecodes Corporation for several loci. In § 5, we mix relatively homogeneous populations to induce greater heterogeneity and examine the effects of such heterogeneity on forensic calculations. More importantly, this experiment allows us to validate our model for forensic use. Specifically we compare our estimate of pr(0j | #,), obtained from a heterogeneous database, to our best estimate of the true value, obtained from the appropriate homogeneous database. Our criterion for success is an estimated conditional probability of a match that tends to be of the same order of magnitude or larger than the true value. Values larger than the truth are deemed acceptable in the legal context because they favour the defendant. In § 6, we discuss the applicability of our estimates to actual court cases. 2. GENETIC BACKGROUND
At each locus, an individual inherits an ordered pair of alleles (Au A2). If these alleles were an independent and identically distributed sample from the allele probability distribution, then the genotype probability distribution is computed by multiplying the allele probabilities. As a result of the population substructure, i.e. heterogeneity, the situation is somewhat more complicated and various evolutionary models exist, each with a host of modelling assumptions (Wright, 1951; Malecot, 1948). Nevertheless, these competing theories agree that in many situations the probability of observing genotype (a(i), a(j)) is approximately described by 2(1-9)y(i)y(j)
(i0, (ii) n-yoo, N->co so that max{pt}n2->0, the probability of two observations from the same subpopulation becomes negligibly small. Thus the likelihood for 6 is dominated by the product that comes from assuming independence across individuals. We treat this approximation as exact and proceed with this modelling approach with the understanding that our inferences apply only under suitable conditions. Given G, the observations on individuals are independent. Another way to show that our likelihood approximates the full likelihood, then, requires control of the accumulating errors when pi{(A1,A2) = {i,j)\G} is approximated by the right-hand side of (10). This is a promising line of further research. Finally, yet another approach to our likelihood is to consider asymptotics as 6->0. Thus several different approaches potentially offer justification for the likelihood in our calculations. The advantage of this approach over competing models in the literature is that it does not require the specification of subpopulation membership and yet the likelihood is quite simply computed. Balding & Nichols (1995) and Balding, Greenhalgh & Nichols (1996) develop models that are similar in some regards to this model, but both models require the specification of subpopulation membership. This can be difficult because the concept of a subpopulation, in practice, is poorly defined. Foreman, Smith & Evett (1997) seek to circumvent this difficulty by developing a model similar to ours that estimates the subpopulation membership. However, their likelihood is considerably more challenging to compute. Moreover, it requires that N be specified. This can be quite difficult in practice. 4. RESULTS
The computations were done using Markov chain Monte Carlo methods with data augmentation of an indicator for identity by descent. The details can be found at http://Ub.stat.cmu.edu/www/cmu-stats/tr/tr662/tr662.html. To illustrate how the method performs we conducted simulations for two VNTRs with markedly different features: D2S44 has a large number of alleles, each allele occurring with small probability, 0 1 (Devlin, Risch & Roeder, 1991b). We generated 100
278
K. ROEDER, M. ESCOBAR, J. B. KADANE AND I. BALAZS
samples of size 500 and 1000, using the probability model outlined in Levels 1-4, with 8 equal to 002 and 0-005, for each of the loci. The median of the posterior distribution was used as an estimate of 6 for the purpose of summarising the simulations. To determine the sensitivity of the method to the replacement of the allele distribution y by an empirical estimate, we used two methods to estimate 6: we used the same value of y that was used to generate the data, Method T; and we used a histogram estimate of the allele distribution, Method H. Typically Method H yielded a slightly smaller estimate of 6 than Method T; see Table 2. Regardless of the method used, the true value of 8 tended to be well within the high credible region and the median of the posterior distribution of 8 was fairly close to the truth. If the true value of 9 was 0005, the median tended to be above the truth, while if 6 = 0-02, the median tended to be below the truth. For samples with n < 200, not reported, we found that both methods consistently overestimated 8 with the given prior. Experiments with different priors on 8 led to the conclusion that for n < 200 the method is sensitive to the prior specification of 8. Samples larger than 200 are required to estimate the small levels of 8 usually encountered in human populations. Table 2. Median (and standard deviation) of the posterior distribution of 6 obtained in 100 repetitions of a simulation experiment: n, number of pairs of single locus profiles sampled n
6*
Locus D2S44 Method T Method H
Locus D17S79 Method T Method H
1000 1000 500 500
2-0 0-5 2-0 05
1-77 (0-71) 0-74(0-39) 1-81(085) 075(0-38)
1-77(083) 089(0-52) 1-72(093) 096(061)
1-68 (0-68) 0-68(0-43) 1-74(0-81) 078(040)
1-63(0-81) 082(049) 1-63(082) 097(042)
Method T uses the true allele distribution, Method H uses a histogram estimate of the allele distribution; 8 ~ beta (1,49) a priori. Entries for 6* and Methods T and H should be multiplied by 10" 2 .
From the lifecodes database, we obtained estimates of 8 for 3 major ethnic groups: Caucasian, African American and Hispanic; see Fig. 2 and Table 3. Although the fourstage model estimated the posterior distribution for each locus independently, the resulting posterior densities are fairly similar across loci, for each population. Estimates of Caucasian variability are quite small; the average median 8 for the four loci was 8M = 0-0028, a result which is consistent with other analyses of forensic DNA databases (Morton, Collins & Balzas, 1993). Furthermore, these estimates agree with the estimates of variability among Caucasian subpopulations in general (Chakraborty & Jin, 1992; Morton, 1992). Likewise estimates for African Americans are small (8M = 00076). Not surprisingly, Hispanic estimates are generally larger (0M = 00107). Hispanics are known to be a somewhat heterogeneous group of relatively recent origin. In summary, we conclude that none of these populations exhibited substantial heterogeneity; see Table 3. To examine the effect of mixing differentiated subpopulations, we first use four Amerindian Nations that are believed not to have undergone extensive recent cross-tribal matings. When we combined these Lifecodes databases, i.e. Maya, Navajo, Pima and Tobas, the estimate of heterogeneity is much larger than those found in the major data-
Heterogeneity in forensic databases
279 bases: 6M = 00215; see Table 4. Details about these databases appear in Balazs (1993) and Balazs et al. (1992). Weir (1994) obtained roughly comparable values for Amerindian populations using direct estimation methods, different databases and different populations. It is worth noting that combined Amerindian databases are not used for forensic calculations. In the U.S., whenever possible, probability calculations are based on the individual tribal databases. Table 3. Estimates of percentiles of the posterior distribution of 6 for Lifecodes forensic databases, for Caucasian, African American and Hispanic populations
Median 25% probability 75% probability 90% probability 95% probability
bound bound bound bound
Sample size
D2S44
Locus D17S79 D14S13
D18S27
D2S44
Locus D17S79 D14S13
D18S27
0-368 0-095 1-490 3-489 5114
Caucasian 0-186 0-196 0047 0-062 1-442 0-773 3-875 2-220 4-962 3-513
0-418 0-116 1-626 3-925 5-494
0933 0-321 2-317 3-848 4-653
African American 0492 O705 0158 1160 1-929 1160 3-872 1160 1-962 4-545
0983 0323 2-716 5109 6-391
3113
3102
1952
1004
1005
Hispanic D17S79 D14S13
D18S27
2287 D2S44
Median 25% probability 75% probability 90% probability 95% probability
bound bound bound bound
Sample size
0-962 0-349 2-323 4-434 6011
0-760 0-250 2-904 5-874 7064
1-610 0776 2-847 4-244 5-135
1102 0-403 2-896 5-763 7-498
402
405
296
284
705
580
2
Entries should be multiplied by 10" .
Table 4. Estimates of percentiles of the posterior distribution of 9 for non-forensic databases and mixed databases
Median 25% probability 75% probability 90% probability 95% probability
bound bound bound bound
D2S44
Locus D17S79 D14S13
D18S27
D2S44
Locus D17S79 D14S13
D18S27
1039 0327 3-352 7-798 10743
Amerindian 3-225 0964 0682 0388 10834 1-829 16138 2-687 17-530 3-201
3-288 0961 9-566 18-402 22-668
1-388 0449 3-463 5-803 7-198
RegionalI Hispanic 1-326 1-291 0582 O403 4-401 2-451 7-536 3-896 8-568 4-803
2065 0721 4-688 7-678 9-244
577
655
587
Sample size
594
535
Caucasian + African American D2S44 D17S79 D14S13 D18S27 Median 25% probability 75% probability 90% probability 95% probability Sample size
bound bound bound bound
D2S44
651
494
Big mixture D17S79 D14S13
505 D18S27
1-290 0486 2-781 4-494 5-546
0721 0228 2-609 5-342 6-321
0842 1-334 1-334 1-334 2-205
1-504 0475 3-666 6-034 7-280
2-915 1-069 5-996 9168 10957
2-385 0522 8-344 12-438 13-580
3-630 2150 5110 6-344 7044
4-298 1-563 9138 14-069 16-340
600
600
600
600
650
650
455
650
Entries should be multiplied by 10"
280
K.
ROEDER, M. ESCOBAR, J. B. KADANE AND I. BALAZS
The Lifecodes Hispanic database consists primarily of individuals from the New York and Los Angeles metropolitan areas. This database is expected to be more homogeneous than a database consisting of Hispanics sampled from several regions of the country. Lifecodes also possesses samples of Hispanics obtained from four regions, California, Florida, New York and Texas (Balazs, 1993). We analysed the mixed population composed of these four regional samples and found that the heterogeneity was substantially increased: 9M = 00150; see Table4. To examine the sensitivity of the procedure to admixture of two relatively homogeneous populations, we mixed observations from the Caucasian and African American databases. These two populations have the smallest estimates of 9 and are the least similar of the major ethnic groups in the U.S. (Devlin & Risch, 1992). We randomly selected 300 individuals from each of these two databases and calculated 9 based on this ethnically mixed population. The averages of the posterior percentiles are presented in Table 4 based on 25 repetitions of this experiment (6M = 00109). Note that the mixed population has substantially more heterogeneity than either component population, as would be expected. To examine the magnitude of the variability between distinct populations in the Lifecodes database, a similar experiment was performed in which we sampled 50 individuals from each of 13 populations, namely Caucasian, African American, Texas Hispanic, New York Hispanic, Florida Hispanic, California Hispanic, Cheyenne, Maya, Navajo, Pima, Tobas, Chinese and Australian Aborigine; only 35 were sampled for D14S13 because of insufficient data from one population. Estimates of 9 increased dramatically: 9M = 00328. Finally, the effect of the beta (1, 49) prior was examined and found to be minor. While unrealistic from a scientific viewpoint, a uniform distribution is a natural reference prior in this setting. Even with this prior, the estimated priors support small values for 9; typically the median of the posterior distribution obtained using a beta (1, 1) was about 20% larger than the median of the posterior calculated using the beta (1, 49) prior. The following section shows that a difference of this size has no practical impact on the forensic calculation. 5. FORENSIC CALCULATIONS
Suppose a crime has occurred and (xj,x 2 ) and (yi,y2) denote the observed fragment lengths obtained respectively from the defendant and the sample available as evidence. Lifecodes Corporation defines bins, bx and b2, about the sample: bt = (0982y,, 1018y,) for i = 1, 2. If the defendant bands fall within bx and b2 for each locus, a match is declared. For notational convenience we indicate this by writing x, e bt. To evaluate the probability of a match the proportions of observations in the reference database falling into these two bins, px and p 2 , are calculated. If we assume independence and no population heterogeneity, the conditional probability of a match among unrelated individuals can be calculated as pT(x1eb1,x2eb2\y1eb1,y2eb2)=\
* iPi
J
2>
(11)
if 0i = b2,
for each locus: in practice, the probability is also doubled whenever bl = b2 to account for coalescence. The model which motivates (11) is known as the Hardy-Weinberg model. The probability of a multilocus match is obtained by multiplying the probabilities across loci.
Heterogeneity in forensic databases 281 Suppose defendant and potential perpetrator are assumed to be from the same subpopulation. This model was dubbed the Affinal model by Morton (1992), meaning 'affined' or 'same as'. For a known value of 0 the single locus conditional match probability is calculated as -0)p2}