Incorporating population structure into forensic ...

Incorporating population structure into forensic Bayesian networks Amanda B. Heplera , A. Philip Dawidb a

Department of Statistical Science, University College London, Gower Street, London, WC1E 6BT b

Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Wilberforce Road, Cambridge, CB3 0WB Bayesian networks are gaining popularity as a graphical tool to communicate the complex probabilistic reasoning required in the evaluation of DNA evidence. Incorporating allelic dependencies that result from population structure within these networks is a relatively new endeavour and this study provides some initial thoughts on how to approach the construction of these networks. We introduce object-oriented Bayesian networks designed to model forensic identification cases while accounting for population structure. Exact and approximate methods are explored including a blocking Gibbs approach, via HUGIN’s application programming interface (API), to model the unknown subpopulation frequencies. We explore forensic paternity examples, including complex cases with missing data. Accounting for population structure within the Bayesian network framework is an important step forward, and illustrates the flexibility this technology provides as a formal tool for handling complex forensic calculations. Keywords: DNA Evidence; Probabilistic Expert Systems; Population Substructure; Paternity Index.

probability of observing Ai given ni alleles of that type have already been observed is

1. Introduction A Bayesian network (BN) is a graphical and numerical representation which provides an automated way to calculate likelihood ratios in cases where the calculations are quite laborious to perform analytically [1]. A recent advancement in BN design is the application of the object-oriented programming paradigm, resulting in hierarchical representations, particularly wellsuited for forensic DNA casework [2]. We employ this technology here, making use of the objectoriented modeling capabilities of HUGIN.1 Accounting for population structure within Bayesian networks has been discussed in both [1] and [3]. While the approach of [1] is similar in spirit to that of [3] (both make use of Balding and Nichols’ method [4]), it is only described for biallelic loci, and is used to compute only the classic suspect-culprit match probabilities. The networks we present are an elegant alternative to those of [3], drawing on object-oriented principles for a significant reduction in complexity.

Pn,ni (Ai ) =

ni θ + pi (1 − θ) . 1 + (n − 1)θ

(1)

In practice, the true population frequencies and inbreeding coefficient are rarely known. Relaxing these assumptions is an avenue for future research. 3. Exact Inference for Simple Cases In paternity cases, there are typically genotype data on three individuals; mother, child, and putative father. A BN for this case was implemented using object-oriented techniques in [2] resulting in the network appearing in Figure 1. Rather than

2. Balding and Nichols’ Method When the subpopulation of interest has reached evolutionary equilibrium, the allele frequencies follow a Dirichlet distribution with parameters (1−θ)pi /θ [5], where pi denotes the (known) population frequency of allele Ai and θ denotes the (known) inbreeding coefficient [6]. The number of observed Ai alleles is denoted by ni , whereas n denotes the total number of alleles observed. The 1 http://www.hugin.com

Figure 1. Simple Disputed Paternity Network.

describe the entire network in detail, the reader is referred to Section 4 of [2] where all nodes and network classes are described.

(HUGIN6.8 is used here)

1

2

Amanda B. Hepler, A. Philip Dawid

The new network incorporating population structure appears in Figure 2. The founder mod-

Figure 2. Simple Disputed Paternity Network with Population Structure.

ule is very similar to the original founder module presented in [2] except the allele frequencies are now calculated according to Equation 1. Figure 2 also shows two instances of the new gene counter network class, labeled gc1 and gc2. These modules count alleles, supplying ni from Equation 1. Rather than describe the internal structure of these nodes here, the reader is refered to [7]. Once the network is created, HUGIN can calculate the exact paternity index (PI) for various combinations of evidence, via the posterior probabilities for the hypothesis node tf=pf?. 4. Approximate Inference for Complex Cases Now consider the more complex situation that can occur when forensic scientists do not have access to the putative father’s DNA. Instead, perhaps they have a sample from a close relative of the putative father, say his brother. The BN for this case should count alleles for all founder nodes (in this case the mother, grandfather, grandmother, and alleged father). Due to memory limitations, HUGIN is unable to handle this additional complexity for more than two alleles. We present two approaches - the use of interval nodes and a blocking Gibbs approach. 4.1. Interval Nodes As the subpopulation allele frequencies follow a Dirichlet distribution, we could include nodes following this continuous distribution in our network. However, HUGIN only provides discretized continuous nodes which can follow a limited number of continuous distributions (the Dirichlet is not among them). For the two allele case we can avoid this dilemma by introducing a subpopn freqs node following Beta(γ1 , γ2 ), where γi = (1 − θ)pi /θ, as shown in Figure 3. When dis-

Figure 3. Brother paternity network with population structure using an interval node.

cretization is performed with care, this network can provide near exact results (see Section 5). When there are more than two alleles, a combination of Gamma nodes could be used. Unfortunately, the size of the conditional probability tables rapidly increases and HUGIN can not handle the additional complexity. Adding the Dirichlet distribution to HUGIN’s library would solve this problem, making this approach very appealing as it is quick, simple, and accurate. 4.2. Blocking Gibbs Sampler An alternative approximate method is a blocking Gibbs [8] approach. Conditional on the subpopulation allele frequencies (denoted by pe1 , . . . peX ), the founder nodes appearing in Figure 3 are independent. Thus, a blocking Gibbs sampler can be created by dividing the network into two blocks–one consisting of the node subpopn freqs and the other consisting of all other nodes. Letting GI denote the genotype of individual I, the sampler proceeds as follows: e 1. Set initial values for the vector p 2. Enter evidence GC , GB , and GM . 3. Simulate GGF , GGM GP F GAF and obtain allele counts ni . 4. Store posterior probabilities from tf=pf?. 5. Simulate pei ∼ Dirichlet(γi + ni ) ∀ i, and recompile. 6. Repeat steps 2 through 5. Once this process is complete, burn-in observations can be discarded, and a lag should be employed to ensure independent observations. The remaining posterior probability values can then be used to estimate PI. This method was implemented using HUGIN’s C++ API.

Incorporating population structure into forensic Bayesian networks

3

5. Results

6. Conclusion

Figure 4 compares the two approximate methods for the biallelic missing father case, varying the A1 allele frequency between 0.05 and 0.95, assuming GM = A1 A1 , GB = A1 A2 , and GC = A1 A1 . The Gibbs mean and the upper

Bayesian networks have become a useful tool for DNA evidence evaluation. They allow scientists to point and click their way to solutions to very difficult probability calculations. They also provide a graphical representation of, at times, highly complex forensic scenarios. One way to fully make use of this valuable tool is to provide several “shell” networks that can be used over and over again. Here we provided a few “shells” that will allow scientists to make inferences based on DNA evidence while taking into account population structure.

0.85

Gibbs Mean

Probability

True Value 0.75

Ignoring Structure

0.65 0.55

Acknowledgements

0.45 0

0.1

0.2

0.3

0.4

0.5

Frequency of A1 Allele

This work was supported by the Leverhulme Trust. REFERENCES

Figure 4. Posterior probability of paternity comparison for biallelic missing father example.

Probability

and lower confidence limits are obtained using 100 simulated posterior probability values, with a lag of five and a burn-in period of 5000. These results show that both the interval estimate and the Gibbs mean typically fall close to the true value. They also illustrate the hazards of ignoring population structure, especially when rare alleles are observed. Figure 5 presents the blocking Gibbs results when GB , GM , and GC are simulated using the 11 allele vWA marker (allele frequencies obtained from [2]), assuming an inbreeding coefficient of 0.03. In each simulated case, the true value falls within the confidence limits for the Gibbs mean.

0.85

Gibbs Mean

0.75

True Value Ignoring Structure

0.65 0.55 0.45 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15

Simulated Cases

Figure 5. Posterior probability of paternity comparison for simulated cases using marker vWA.

1. F. Taroni, C. G. G. Aitken, P. Garbolino, A. Biedermann, Bayesian Networks and Probabilistic Inference in Forensic Science, John Wiley and Sons, Chichester, 2006. 2. A. P. Dawid, J. Mortera, P. Vicard, Object-oriented Bayesian networks for complex forensic DNA profiling problems, Forensic Sci. Intl. In press. doi:10.1016/j.forsciint.2006.08.028. 3. A. B. Hepler, Improving forensic identification using Bayesian networks and relatedness estimation: Allowing for population substructure, Ph.D. thesis, Dept of Statistics, North Carolina State University (2005). 4. D. J. Balding, R. A. Nichols, DNA profile match probability calculation - how to allow for population stratification, relatedness, database selection and single bands, Forensic Sci. Intl. 64 (2–3) (1994) 125–140. 5. S. Wright, The genetical structure of populations, Ann. Eugen. 15 (1951) 32–354. 6. I. W. Evett, B. S. Weir, Interpreting DNA Evidence, Sinauer, Sunderland, MA., 1998. 7. A. B. Hepler, B. S. Weir, Bayesian networks for paternity cases with allelic dependencies, In revision (2007). 8. C. J. Jensen, U. Kjrulff, A. Kong, Blocking gibbs sampling in very large probabilistic expert systems, Int. J. of Human-Computer Studies 42 (1995) 647–666.