Forensic Science International: Genetics Supplement Series 4 (2013) e300–e301
Contents lists available at ScienceDirect
Forensic Science International: Genetics Supplement Series journal homepage: www.elsevier.com/locate/FSIGSS
Free open source software for internal validation of forensic STR typing kits Oskar Hansson a,*, Peter Gill a,b a b
Norwegian Institute of Public Health, Department of Forensic Biology, Norway University of Oslo, Norway
A R T I C L E I N F O
A B S T R A C T
Article history: Received 25 July 2013 Accepted 1 October 2013
The validation of new short tandem repeat (STR) systems for forensic purposes is extremely time consuming and expensive. However, if a full understanding of biological processes was achieved, then this would effectively by-pass the need to carry out validation by traditional methods, since millions of DNA profiles could realistically be generated in-silico at no cost. To achieve this, a PCR simulation tool and a validation toolbox have been built using ‘R programming language’. The goal is to provide realistic outputs of virtual DNA profiles by simulation of the specific methods (extraction, PCR and electrophoresis) that are used routinely in the analytical method used. With accurate simulations it will be possible to create virtual DNA profiles including real casework examples (such as partial mixtures and degraded samples). Ultimately the programme will be used to assist experimental design e.g. to define the best parameters to analyse a sample. Also the output could potentially be used in probabilistic analysis. The validation toolbox aids the implementation of new kits by simplifying the analysis of validation data. It provides functions to explore the characteristics of DNA typing kits according to ENFSI recommendations (e.g. balance and stutters). It facilitates the comparison of simulated and real data, and is therefore an important tool to ‘fine tune’ the parameters used for simulation. Both packages are open source and have easy-to-use graphical user interfaces. Command line functions are still available for power-users. ß 2013 Elsevier Ireland Ltd. All rights reserved.
Keywords: Validation Simulation PCR R STR strvalidator pcrsim
1. Introduction The validation of new short tandem repeat (STR) systems for forensic purposes is extremely time consuming and expensive. In an attempt to remedy this, a PCR simulation tool (pcrsim) and a validation toolbox (strvalidator) have been built using the ‘R’ programming language [1]. STR validator has greatly increased the speed of validation in our lab, while the PCR sim is at an experimental stage. Both packages are available on CRAN [2] and the source code is hosted at GitHub [3]. They can be used with either a powerful graphical user interface or by command line functions. 2. STR validator STR validator is an R-package intended for internal validation of forensic STR DNA typing kit. Its graphical user interface makes it very easy to analyse data exported from e.g. GeneMapper1 software. It provides convenient functions for importing, viewing,
* Corresponding author. Tel.: +47 2107 7649; fax: +47 2107 7602. E-mail addresses:
[email protected],
[email protected] (O. Hansson). 1875-1768/$ – see front matter ß 2013 Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.fsigss.2013.10.153
editing, and exporting data. After analysis the results, generated plots, heat-maps, and data can be saved in a project for easy access. Currently, analysis modules for stutter, balance and dropout (including logistic regression [4]) are available. There is also a function for calculating the average peak height ‘H’ [5]. Fig. 1 shows an example plot of logistic regression for estimation of the dropout threshold. As STR validator develops it continues to reduce the time and effort needed for analysis of validation data. By allowing easy exploration of the characteristics of DNA typing kits according to ENFSI recommendations (e.g. balance and stutters) this tool facilitates the implementation of probabilistic interpretation of DNA results. It also facilitates the comparison of simulated and real data, and is therefore an important tool to ‘fine tune’ the parameters used for simulation using PCR sim. Several functions are in the pipeline for implementation into STR validator. For example functions for analysis of GeneMapper1 ‘bins’ and ‘panels’ files. With these it will be possible to automatically create new kit definitions (needed for the software), plot the marker ranges of one or several kits, calculating the probability of getting off-ladder alleles, and assess the risk that pull-ups will be called as alleles in other channels. The latter functions can be very useful when evaluating different kits for implementation. There will also be a function for categorisation of
O. Hansson, P. Gill / Forensic Science International: Genetics Supplement Series 4 (2013) e300–e301
e301
Fig. 1. Example of a plot generated by STR validator: logistic regression and prediction of dropout probability as a function of present allele height. Individual observations are plotted as black dots and the dropout threshold, in this case at a 5% risk, is marked by the red line.
DNA result into full profile, mixture, partial profile, or no result. The categories can be further subdivided by number of peaks or markers, if giving a full profile of an alternative specified kit, and if peaks are above a defined peak height threshold.
3. PCR sim The goal of PCR sim is to provide realistic outputs of forensic STR DNA profiles by simulation of the specific methods (extraction, PCR, and electrophoresis). With accurate simulations it will be possible to create virtual DNA profiles including real casework examples (such as partial mixtures and degraded samples). This would effectively by-pass the need to carry out validation by traditional methods. Ultimately the programme will be used to assist experimental design e.g. to define the best parameters to analyse a sample. The output could potentially be used in probabilistic analysis. The programme is based on the model in Ref. [6,7] using a series of binomial distributions to simulate the PCR process. PCR sim extends the model by incorporating pre-PCR processes as a series of normal distributions, interlocus balance by modulation of the PCR efficiency, and degradation by adjusting the template amount. The programme has a user-friendly graphical interface for simulation of entire profiles coupled with a flexible EPG generator. Current focus lies on calibrating the model. This will likely extend the model to include capillary variation.
4. Package dependencies The graphical user interfaces are programmed using the toolkitindependent API provided by the ‘gWidgets’ package. Specifically the packages have been developed using the ‘gWidgetsRGtk20 package which ports the API to ‘RGtk20 (Gimp Tool Kit). All plots are created using the ‘ggplot20 package which is an implementation of the grammar of graphics in R. One function in the ‘data.table’ package is used in ‘pcrsim’ for performance reasons when working with data frames.
5. Developmental tools The free and open source integrated development environment for R ‘RStudio’ was used for development of the packages on a Windows 7 platform. Documentation of the code is updated continuously in the source files, and the final Rd documentation is created using the ‘roxygen20 package. To minimise the number of critical bugs in major functions extensive testing is written and applied using the ‘testthat’ package. Role of funding The work leading to these results has received funding from the European Union seventh Framework Programme (FP7/2007-2013) under grant agreement no. 285487 (EUROFORGEN-NoE) [8]. Conflict of interest None declared. References [1] The R Project for Statistical Computing, Wien: Institute for Statistics and Mathematics of WU (Wirtschaftsuniversita¨t Wien). Available from http://www.r-project.org/. [2] The Comprehensive R Archive Network, Wien: Institute for Statistics and Mathematics of WU (Wirtschaftsuniversita¨t Wien). Available from http://cran.r-project.org/. [3] strvalidator and pcrsim source code as GitHub repositories. Available from https:// github.com/OskarHansson. [4] P. Gill, R. Puch-Solis, J. Curran, The low-template-DNA (stochastic) threshold—its determination relative to risk analysis for national DNA databases, Forensic Science International: Genetics 3 (2009) 104–111. , http://dx.doi.org/10.1016/j.fsigen.2008.11.009. [5] T. Tvedebrink, P.S. Eriksen, H.S. Mogensen, et al., Evaluating the weight of evidence by using quantitative short tandem repeat data in DNA mixtures, Journal of the Royal Statistical Society: Series C (Applied Statistics) 59 (2010) 855–874. , http:// dx.doi.org/10.1111/j.1467-9876.2010.00722.x. [6] P. Gill, J. Curran, K. Elliot, A graphical simulation model of the entire DNA process associated with the analysis of short tandem repeat loci, Nucleic Acids Research 33 (2005) 632–643. , http://dx.doi.org/10.1093/nar/gki205. [7] Forensim web page. Available from http://forensim.r-forge.r-project.org/. [8] EUROFORGEN-NoE – the European Forensic Genetics Network of Excellence. Available from http://www.euroforgen.eu/.