Automated pollen identification system for forensic geo-historical location applications Grace M. Hwang, Kim C. Riley, Carol T. Christou, Garry M. Jacyna, Jeffrey P. Woodard, Regina M. Ryan The MITRE Corporation McLean, VA, USA
[email protected]
Mark B. Bush, Bryan G.Valencia, Crystal N.H. McMichael
Surangi W. Punyasena University of Illinois at Urbana-Champaign Urbana, IL, USA
David L. Masters Department of Homeland Security Science and Technology Directorate Washington, DC, USA
Florida Institute of Technology Melbourne, FL, USA Abstract— The use of pollen grain analysis for forensic geohistorical location has been explored for several decades, yet it is not widely adopted in the United States. We confirmed significant improvement in geographic precision, i.e., from 2.5x107 to 1.2x105 km2, by simultaneously applying flowering plant data from four different taxa at the genus and species levels. Moreover, when we calculated precision using collected pollen data, we found that cooccurring, pairwise genus-level distinctions based on expertprovided indicator taxa resulted in average precision values of 4o and 4.5o in latitude and longitude, respectively — corresponding to roughly 1.8x105 km2. We also applied computer vision techniques to identify morphologically similar pollen grains, which resulted in grain-identification error rates of 2.18% and 6.24% at the genus and species levels, respectively, surpassing previously published records. Collectively, our results demonstrate that algorithmic identification of species-specific pollen morphology, founded on established computer vision techniques, when combined with species-level pollen distribution, has the potential to revolutionize the scope, accuracy, and precision of forensic geographic attribution. Keywords- plant taxa; pollen forensics, computer vision; machine learning; Bayesian methods; GBIF; geographic attribution; geo-historical location
I.
INTRODUCTION
Customs and tariffs are large revenue generators for the United States Department of Homeland Security (DHS) Customs and Border Protection (CBP). Because some countries receive lower tax rates relative to others, false claims of origination cost the DHS a great deal of revenue [1]. Pollen grain analysis is one approach to verifying the provenance of commodities such as imported honey. Typically pollen grain analysis is outsourced at sometimes thousands of dollars per sample. Few samples are submitted for analysis because full geographic attribution is slow, costly, and pollen identification capabilities are limited. Through the Borders and Maritime Security Division (BMD) of the Science and Technology Sponsors: The MITRE Corporation, Department of Homeland Security Science and Technology
Directorate (S&T), the DHS funds, develops, and transitions tools and technologies that improve the security of our nation's borders and waterways without impeding the flow of commerce and travelers. To conquer the challenges and the asymmetric threat presented by border intruders, DHS S&T in collaboration with MITRE, plans to develop a pollen-based geographic attribution database system with involvement from many stakeholders across DHS, law enforcement, defense and civilian agencies, and academia. The use of pollen grain analysis for forensic geographic attribution has been explored for several decades [2-6]. There are an estimated 352,000 plant species world-wide [7], approximately 80,000 of which are native to the Neotropics [8]. The diversity and geographic distinctiveness offered by plant species make them naturally suitable for forensic applications. Yet several well-known shortfalls of pollen analysis have precluded its routine use in the United States. First, taxonomic classification at the family or genus level often lacks geographic specificity. Second, the time-to-answer is limited by availability of palynologists with expertise in the geographic regions of interest. Third, there is no unified global database of pollen types that encompasses both morphological descriptors and geographic attributes. Most databases are ad hoc and tailored either to morphological descriptors [9-13] or provide geographic information (but without images) [14], collectively lacking the metadata that are essential to forensic geo-historical location. In this paper, we present promising results that address each of these shortfalls, and outline a strategy forward for developing a sustainable capability in pollen-based geographic attribution.
II.
METHODS
A. Effect of Taxonomic Classification on Geographic Precision Taxonomic classification is the scientific method that categorizes organisms into related groups, canonically following the kingdom-phylum-class-order-family-genusspecies hierarchy. We are interested in the Plant Kingdom, and specifically will be concerned with pollen arising from the two lowest taxonomic ranks of seed-bearing plants: genus and species. Species that have a common ancestry are placed into a genus. Assumptions about relatedness are based on physical and, more recently, genetic attributes. Each plant is given a two-part or "binomial name". The generic name or genus that the plant belongs to is listed first, and the specific name or species is second. Most palynological classification occurs at the taxonomic ranks corresponding to either family or genus, and more rarely species. The Global Biodiversity Information Facility, (GBIF), is an open internet database of taxonomic information and distribution data for plants and other organisms (http://data.gbif.org). Although this database does not include pollen, it demonstrates the value of plant distribution data at the species versus genus level and provides some enlightenment as to how our enterprise pollen database should be expanded. One can explore distributions of taxa at any classification level including species, across countries or other datasets, and can glean some guidelines for expected pollen occurrences from corresponding plant distributions. The GBIF database allows: 1) Exploration of taxonomic information (e.g., for plants); 2) Exploration of country floras, i.e., find data on the taxa recorded in a particular country, territory or island, including records shared by publishers from the GBIF network; 3) Exploration of datasets, i.e., find records from a data publisher, dataset or data network. This includes information on the data publishers, datasets and data networks that share resources through GBIF, with summary information on 10,148 datasets from 426 data publishers. Quality control is an issue with these data, in particular, elevations were not available for all records. As a result our preliminary analysis is devoid of elevation contributions. We initially analyzed data from 364 modern pollen samples, each representing a range of collection sites from the mountains of Mexico to the Amazon rainforest. This list was organized by genus and collection site, and contained the relative abundances of pollen grains from each taxon. This list of over 500 identified pollen taxa was then reduced to 120 taxa to eliminate grains that were so rare that they carried little statistical power, and to reduce taxonomic uncertainty. Further pruning of the dataset removed sites with too few (