Systematic Literature Review and Empirical

0 downloads 0 Views 1MB Size Report
downloaded on Jan 27, 2017 that identified digitized titles, as ..... Vrandečić D, Krötzsch M. Wikidata: A Free Collaborative Knowledgebase. Commun ACM. 2014.
Harnessing Biomedical Natural Language Processing Tools to Identify Medicinal Plant Knowledge from Historical Texts Vivekanand Sharma, PhD1, Wayne Law, PhD2,3, Michael J. Balick, PhD2, Indra Neil Sarkar, PhD, MLIS1 1 Center for Biomedical Informatics, Brown University, Providence, RI 2 New York Botanical Garden, Bronx, NY 3 College of Arts and Sciences, Lynn University, Boca Raton, FL

Abstract The growing amount of data describing historical medicinal uses of plants from digitization efforts provides the opportunity to develop systematic approaches for identifying potential plant-based therapies. However, the task of cataloguing plant use information from natural language text is a challenging task for ethnobotanists. To date, there have been only limited adoption of informatics approaches used for supporting the identification of ethnobotanical information associated with medicinal uses. This study explored the feasibility of using biomedical terminologies and natural language processing approaches for extracting relevant plant-associated therapeutic use information from historical biodiversity literature collection available from the Biodiversity Heritage Library. The results from this preliminary study suggest that there is potential utility of informatics methods to identify medicinal plant knowledge from digitized resources as well as highlight opportunities for improvement.



Introduction Evidence from archaeological studies demonstrate the existence of knowledge regarding plant use even before writing had evolved1. Over centuries, such knowledge has been transferred both verbally as well as in the form of written text or scriptures. One of the oldest mentions of medicinal plant use dates back to 2600 BC written on clay tablets in cuneiform describing approximately 1000 plants2. Although the term “ethnobotany” was coined in 18953, long before that time, botanists and physicians were traveling to diverse destinations to collect and document plant use knowledge. An example of one of the early documentations of such information is De Materia Medica, written by Pedanius Dioscorides, a Greek physician, pharmacologist, and botanist, between 50 to 70 C.E.4, which remained the core of western pharmacopoeia as late as 18655. His accounts of around 600 plants were based on travels to the Mediterranean. Since then, a rich repository of data has been compiled from accounts of botanical explorations across the globe aimed at collecting specimens and documenting their cultural use. Botanical explorations reached their peak during the 19th century and, by the latter part, drew interest of academicians who recognized the importance of systematic examinations of indigenous societies and their knowledge of plant use1. Studies of the historical use from such texts have aimed to show correspondence between textual and present day folk traditions as well as guide discovery of potential medicines5. Due to the impact of modernization, there is a loss of traditional knowledge transferred over generations. A major goal of ethnobotany is to document and investigate the knowledge about plant use by indigenous communities. Ample evidence exists that reflect the success of natural products research (aimed at drug discovery) directed by knowledge gained from ethnobotanical explorations6. Considering the diminishing corpus of knowledge of traditional culture, it is imperative to document and preserve elements of indigenous knowledge7. As a result of one such contemporary effort, the Palau and Pohnpei Primary Health Care Manuals were written based on information gathered from The Republic of Palau8 and Pohnpei State, Federated States of Micronesia9, both island groups in Micronesia. Such documentation of traditional botanical knowledge is important both in terms of conservation of biodiversity and for supporting public health9. Cataloguing plant species information and their respective use in treating ailments is an essential task that remains challenging. The information present in historical texts can complement contemporary ethnobotanical studies10,11. However, historical texts need to be in computer-analyzable form to facilitate the identification of relevant knowledge. One such supportive initiative is the effort being led by the Biodiversity Heritage Library (BHL)12. The BHL houses digitized versions of biodiversity literature covering 115,124 titles. The digitization of text describing medicine provide opportunities to carry out analyses over different periods of time13. Automated analyses and integration of data from such sources with contemporary biomedical data may lead to the generation of testable hypothesis as well as provide preliminary support for bioprospecting studies.

1537

Extracting relevant information from digital archives can help mobilize information that would otherwise remain sequestered14. Use of controlled vocabularies, such as the Medical Subject Headings (MeSH) in biomedical domain, has been shown to aid efficient retrieval of relevant information15. In the biodiversity domain, approaches have focused on highlighting and leveraging the species-centric nature of how knowledge is sought in the discipline. Use of “taxonomically intelligent” approaches has been shown to facilitate retrieval and linking of organism-specific information across resources16. There have been significant advances in the biomedical domain with the development of automated methods aimed at gaining insight and hypothesis generation from narrative data. In addition to tools and algorithms for mining the text, due consideration has been put into the development of ontologies and terminological resources. Natural Language Processing (NLP) systems have been developed that determine concept equivalence of terms or phrases from unstructured text by mapping to standardized terminologies17,18. NLP systems have also been developed for mining entity-relation extraction (e.g., BioMedLEE19 and SemRep20). The main approaches in biomedical entity recognition involve lexicon-based, rule-based, and machine learning based methods, with combinations of these approaches resulting in better performance21. Potential applications of such methods in the biodiversity domain has been described by Thessen et al.14. Availability of annotated corpora is important for training NLP tools to derive key patterns that can be used subsequently. NLP tasks have benefited from the availability of annotated corpora in biomedical domain22–24. However, the lack of similar resources, including annotated corpora, make retrieval of knowledge from biodiversity texts challenging13. This study aimed to test the feasibility of using biomedical terminologies and NLP tools to extract therapeutic information from contemporary and historical ethnobotany texts. The ability of the NLP tool MetaMap17 to extract and normalize mentions of medicinal use to Unified Medical Language System (UMLS) Metathesaurus was evaluated25, focusing on plants identified from the ethnobotanical surveys of the Micronesian islands of Palau and Pohnpei. The results from this study reveal the challenges and opportunities regarding the applicability of informatics based methods for supporting large-scale analysis of historical texts, such as those that are included in the BHL. The experience gained from this preliminary study motivates the need for developing efficient strategies for reliable extraction of medicinal use information from historical texts.



Methods The primary goal of this study was to examine the feasibility of applying biomedical text mining tools and methods to facilitate the extraction of plant-associated therapeutic indications from historical texts. A reference standard based on the manual annotation of Primary Health Care Manuals (PHCMs) provided the basis for optimizing the approach and further evaluation. Following evaluation of the approach, historical text pertaining to the plant list from PHCMs was analyzed for possible therapeutic uses. A general overview of the steps of the developed pipeline is depicted in Figure 1. Annotation of Primary Health Care Manuals. The Primary Health Care Manuals (PHCMs) document indigenous knowledge related to plants commonly used for medicinal purposes for selected medical conditions on Palau and Pohnpei. These documents were a result of extensive ethnobotanical surveys carried out as a part of the program, Biodiversity and Human Health in Micronesia. The health conditions included in these manuals were selected based on frequent diagnoses by traditional healers and from the data collected during the surveys. These manuals serve as a reference for those interested in the use of plants for healing by cataloguing traditional medicinal knowledge attained first-hand from the field explorations. The Palau and Pohnpei PHCMs each include medicinal use information for more than 80 plant species. The botanical uses described covers use of individual plant species as well as recipes involving combinations of herbs. The scope of this study focused only on descriptions of medicinal use involving individual herbs. A reference standard was developed that consisted of therapeutic annotations associated with individual plants mentioned in the PHCMs. An annotation guideline was used to direct the annotation process of PHCMs, which were organized into individual text files consisting of one plant per document. Each individual plant document included: (1) Local Name(s), (2) Scientific Name, (3) Family, (4) Description of Herb, (5) Traditional Uses, (6) Pharmacological Properties, and (7) Toxicology. The scope for the annotation process was limited to the sections Traditional Uses and Pharmacological Properties. The annotation ‘therapeutic indication’ was used to include all single or multi-word expressions that related to the medicinal use of a given plant of interest. In addition to specific medical conditions (such as disease or symptoms) medically-relevant terms (e.g., ‘anti-bacterial’, ‘analgesic’, and ‘styptic’) were also included. If a given plant species was described as active against certain organisms (e.g.,

1538

microbes or parasitic worms) those were also annotated. The annotation was performed using Stanford Manual Annotation Tool26 in batch mode, which resulted in a tagged document that was parsed using a program written in Julia.

BIODIVERSITY HERITAGE LIBRARY

PLANT LIST OCR TEXT RELATED TO PLANTS FROM PHCM

SCOPE OF TEXT: PARAGRAPH PRIMARY HEALTH CARE MANUALS (PHCM)

MANUAL EXAMINATION TO IDENTIFY TRUE INDICATIONS

FILTERED OCR TEXT ASSOCIATED WITH PLANTS PALAU

POHNPEI

SEMANTIC TYPE FILTER

POTENTIAL TREATMENT INDICATIONS

OPTIMIZATION OF SEMANTIC TYPE FILTER

EVALUATION USING PRECISION-RECALL METRICS

EXTRACTION OF UMLS CONCEPTS FROM TEXT BHL TEXT

POHNPEI PRIMARY HEALTH CARE MANUALS (RAW TEXT) PALAU MetaMap

MANUAL ANNOTATION PALAU

PALAU

POHNPEI

POHNPEI

STANFORD MANUAL ANNOTATION TOOL

REFERENCE STANDARD

Figure 1: Study Overview. Text from PHCMs (Palau and Pohnpei) were manually annotated to create a reference standard against which MetaMap output was evaluated. A custom filter was designed by optimizing the combination of semantic types to achieve highest F-score. This custom filter was used to distill the MetaMap output for plant related texts identified from BHL to identify potential treatments that were manually examined. Extraction and Optimization of Biomedical Concept Identification. The individual descriptions associated with plants from the PHCMs (Palau and Pohnpei) were processed using MetaMap, with plant associated mentions of UMLS concepts and their respective semantic types being extracted by a Julia program. The processed output was used to generate and optimize a semantic type filter that to identify UMLS concepts of interest. A default filter for concepts based on those that belonged to the UMLS Semantic Group ‘Disorders’ was used27. The MetaMap output was compared to the output from manual annotation, and the correctly identified UMLS Concepts were grouped by semantic type. A cutoff was defined that included semantic types that accounted for at least 5% of correctly identified concepts entities. All possible combinations of the semantic types that were within the cutoff criteria were analyzed relative to the reference standard, calculating Precision, Recall, and F-score. The combination of semantic types with highest F-score was selected as custom filter. The optimized custom filter obtained from Palau dataset was used for evaluation of MetaMap output from Pohnpei dataset and vice versa. Finally, the semantic types from the custom filter obtained from both dataset was merged and all combinations were tested on the combined Palau and Pohnpei dataset. The concepts from filtered MetaMap output was compared to manually annotated PHCM reference standard. The best performing (according to F-score) semantic type combination was used as the final custom filter for further use.

1539

Identifying Therapeutic Uses of Table 1: Optimum Combination of Semantic Types. After applying a cutoff Plants from Historical Biodiversity (see methods), all combinations of remaining semantic types were evaluated Texts. Biodiversity Heritage for identifying optimal filter. A comparison of default filter (semantic group: Library (BHL) data were disorder) with custom filter identified from the optimization step is provided. Semantic Type Filter Dataset Precision Recall F-score downloaded on Jan 27, 2017 that acab, antb, bact, dsyn, fngs, inpo, Palau 43.52 67.12 52.80 identified digitized titles, as well as lbtr, patf, sosy, and virs Pohnpei 33.97 67.06 45.02 metadata about the volumes and (derived from Palau dataset*) Combined 37.41 66.76 47.95 scientific names identified using antb, bact, dsyn, fngs, inpo, moft, Palau 41.44 56.61 47.85 the Global Names Architecture’s sosy, and virs Pohnpei 35.18 64.24 45.39 Global Names Recognition and (derived from Pohnpei dataset) Combined 37.37 60.77 46.28 28 Discovery . These data were Semantic Group: Disorder Palau 17.92 58.31 27.41 loaded into a local MySQL (default semantic type filter) Pohnpei 18.34 62.35 28.31 database that was used to query for Combined 18.68 61.06 28.61 titles and identifier barcodes *Also the optimum combination from both the dataset combined associated with plant species. Title and barcode data were retrieved for the combined Palau and Pohnpei plant list from the PHCMs. The titles were filtered to include only those that contained the string “%medicine%” within their subject keywords. Using the barcodes identified for titles related to PHCM plant list, the OCR text files in XML format were processed and paragraphs that had mention(s) of plant species names of interest were subjected to processing using MetaMap. The MetaMap output was then filtered using the custom semantic type filter previously described. The final output from the filtered MetaMap output was manually examined. The evaluation was performed by going through the identified scope of text and examining whether a plant specific treatment indication extracted using MetaMap was ‘True’ or ‘False’. Table 2: Treatments Identified from BHL. The list of plants from PHCMs were used to located relevant text in BHL. The identified Results scope of text was processed and output was manually examined to identify potential treatments. The respective BHL identifier(s) for Manual Annotation of PHCMs. From the related text(s) are also listed. Palau PHCM, a total of 65 plant species Plant Indication BHL identifier were included in this study, with manual Barringtonia racemosa skin,diseases illustrateddicti00gouluoft annotation identifying 295 treatment diarrhea illustrateddicti00gouluoft indications. Similarly, from Pohnpei Carica papaya round,worms Malaypoisonscha00Giml PHCM, 80 plant species were identified heart,disease b21443038; and the annotation of descriptions illustrateddicti00goul; resulted in identification of 427 treatment illustrateddicti00gouluoft indications. In total, there were 129 abscess cu31924001136872; unique plant species from both the veterinarymateri00wins; veterinarymateri01wins; PHCMs. Semantic Type filter and MetaMap evaluation. A total of 2798 distinct CUIs grouped into 109 semantic types were identified by MetaMap processing of Palau dataset. The true indications accounted for 202 distinct CUIs grouped into 43 semantic types found within the reference standard. Using the cutoff of 5% or greater true indications within each semantic type, 25 semantic types were selected from Palau dataset. From Pohnpei dataset, 3427 distinct CUIs grouped into 112 semantic types were identified. The true indications accounted for 232 distinct CUIs grouped into 44 semantic types. Applying the cutoff resulted in selection of 22 semantic types for further analysis. Optimization of

indigestion

Curcuma longa Musa paradisiaca Mangifera indica

dropsy ulcers epizootic,disease foot,and,mouth,disease catarrh skin,diseases

Terminalia catappa

1540

diarrhea catarrh

veterinarymateri02wins; veterinarymateri04wins; veterinarymateri05wins; veterinarymateri06wins; veterinarymaterie8wins veterinarymateri00wins; veterinarymateri01wins; veterinarymateri02wins; veterinarymateri04wins; veterinarymateri05wins; veterinarymateri06wins; b21297034 b21297034 illustrateddicti00gouluoft illustrateddicti00gouluoft b21443038; illustrateddicti00goul b21443038; illustrateddicti00goul illustrateddicti00gouluoft illustrateddicti00gouluoft

semantic type combinations on MetaMap processed Palau dataset identified a combination of ten semantic types: acab, antb, bact, dsyn, fngs, inpo, lbtr, patf, sosy, and virs. This combination accounted for a maximum F-score of 52.8 (Precision: 43.52 and Recall: 67.12). This custom semantic type filter when applied on MetaMap processed Pohnpei dataset resulted in an F-score of 45.39. The optimum custom filter obtained from MetaMap processed Pohnpei dataset consisted of eight semantic types: antb, bact, dsyn, fngs, inpo, moft, sosy, and virs. This combination was selected as it resulted in a maximum F-score of 45.39 (Precision: 35.18 and Recall: 64.24). When applied to the MetaMap processed Palau dataset this custom filter resulted in a F-score of 47.85 (Precision: 35.18 and Recall: 64.24). The evaluation on combined (Palau and Pohnpei) dataset using custom filter resulted in a F-score of 47.95 (Precision: 37.41 and Recall: 66.76). This custom filter was same as the one derived from semantic type optimization on Palau dataset. A comparative evaluation using a default filter (semantic types included within the semantic group ‘Disorder’) resulted in F-score of: 27.41 on Palau dataset; 28.31 on Pohnpei dataset; and 28.61 on combined dataset. A summary of Precision, Recall, and F-score obtained from optimum custom filters and default filter is presented in Table 1. Treatments Identified from BHL. For the 129 distinct plant species identified, 72 BHL texts were identified after filtering based on subject keywords followed by scope of text containing mention of plant names (217 text segments). Manual evaluation of MetaMap output with custom filter resulted in identification of 14 distinct treatment indications associated with six plants from 14 different BHL titles. The resulting indications are listed in Table 2 along with their respective BHL barcodes. The true indications identified from BHL accounted for 17.54% of total associations. An additional 24 distinct treatment indications associated with nine plants were identified that belonged to semantic types ‘phsu’ and ‘fndg’ (Table 3). From the associations identified with these two semantic types the true indications described in BHL texts accounted for 11.70%. Table 4 lists the titles, links to BHL texts and DOIs for the barcodes mentioned in Tables 2 and 3. Additional treatments were identified from BHL which add to the list compiled from ethnobotanical survey of Palau and Pohnpei (as described in PHCM). A comparative list of such treatments are presented in Table 5.

Table 3: Additional Treatments from BHL with Semantic Type ‘phsu’ or ‘fndg’. Although the output from semantic types ‘phsu’ and ‘fndg’ was noisy, there were some treatment associations of relevance to the study. The respective BHL identifiers are also listed. Plant Allium cepa Areca catechu

Barringtonia racemosa Carica papaya

Curcuma longa Luffa cylindrica

Mangifera indica

Pangium edule Terminalia catappa

Indication stimulant expectorant diuretic astringent

STY phsu phsu phsu phsu

demulcent

phsu

anthelmintic

phsu

vermifuge diarrhea abortifacient vermifuge freckles emetic anthelmintic intermittent,fever stimulant

phsu fndg phsu phsu fndg phsu phsu fndg phsu

diuretic

phsu

purgative

phsu

emetic

phsu

astringent

phsu

tonic

phsu

purulent,discharges

fndg

anthelmintic purgative diarrhea

phsu phsu fndg

BHL identifier b21443038 b21443038 b21443038 b21443038; illustrateddicti00gouluoft b21443038; illustrateddicti00gouluoft b21443038; illustrateddicti00gouluoft illustrateddicti00gouluoft illustrateddicti00gouluoft Malaypoisonscha00Giml Malaypoisonscha00Giml Malaypoisonscha00Giml b21443038 illustrateddicti00gouluoft b21297034 illustrateddicti00goul; illustrateddicti00gouluoft illustrateddicti00goul; illustrateddicti00gouluoft illustrateddicti00goul; illustrateddicti00gouluoft illustrateddicti00goul; illustrateddicti00gouluoft b21443038; illustrateddicti00goul b21443038; illustrateddicti00goul b21443038; illustrateddicti00goul Malaypoisonscha00Giml illustrateddicti00gouluoft illustrateddicti00gouluoft



Discussion Starting from generational transfer of knowledge through verbal means to ancient documentation by explorers, it is important to preserve historical knowledge for guiding future discoveries. The field of pharmacognosy has benefitted by historical leads from such sources10. Since the need for systematic investigation of indigenous plant

1541

Table 4: Reference to BHL Texts for the Identified Treatments. References to the text titles for the BHL identifiers listed in Tables 2 and 3 from which the therapeutic indications were identified. BHL identifier b21297034

Title ID 106843

b21443038

101532

cu31924001136872

56302

illustrateddicti00goul

31340

illustrateddicti00gouluoft

31340

Malaypoisonscha00Giml

96551

veterinarymateri00wins

42778

veterinarymateri01wins

42775

veterinarymateri02wins

42507

veterinarymateri04wins

42788

veterinarymateri05wins

49567

veterinarymateri06wins

49562

veterinarymaterie8wins

62470

Title Materia medica and therapeutics vegetable kingdom An illustrated dictionary of medicine, biology and allied sciences Veterinary materia medica and therapeutics An illustrated dictionary of medicine, biology and allied sciences ... by George M. Gould. 5th ed., with additions and corrections An illustrated dictionary of medicine, biology and allied sciences ... by George M. Gould. 5th ed., with additions and corrections Malay poisons and charm cures

Publication details London :J. & A. Churchill,1874.

Subject Medicine

Title URL http://www.biodiversitylibrary.org/title/106843

London :Ballire, Tindall & Cox,1894.

Medicine

http://www.biodiversitylibrary.org/title/101532

Chicago :American Veterinary Pub. Co., c1919. Philadelphia P. Blakiston's Son 1907

Veterinary medicine Medicine

http://www.biodiversitylibrary.org/title/56302

Philadelphia P. Blakiston's Son 1907

Medicine

http://www.biodiversitylibrary.org/title/31340

London :J. & A. Churchill, 1923.

http://www.biodiversitylibrary.org/title/96551

Veterinary materia medica and therapeutics, by Kenelm Winslow Veterinary materia medica and therapeutics, by Kenelm Winslow Veterinary materia medica and therapeutics, by Kenelm Winslow Veterinary materia medica and therapeutics, by Kenelm Winslow Veterinary materia medica and therapeutics Veterinary materia medica and therapeutics Veterinary materia medica and therapeutics

New York, W.R. Jenkins, [c1908] New York, W.R. Jenkins Co., 1907. New York, W.R. Jenkins Co., 1906. New York, W. R. Jenkins, 1905.

Folk medicine; Traditional medicine Veterinary medicine Veterinary medicine Veterinary medicine Veterinary medicine Veterinary medicine Veterinary medicine Veterinary medicine

New York, W.R. Jenkins Co. [c1913] New York, William R. Jenkins company [c1916] Chicago, American Veterinary Publishing Co. [c1919]

1542

http://www.biodiversitylibrary.org/title/31340

http://www.biodiversitylibrary.org/title/42778 http://www.biodiversitylibrary.org/title/42775 http://www.biodiversitylibrary.org/title/42507 http://www.biodiversitylibrary.org/title/42788 http://www.biodiversitylibrary.org/title/49567 http://www.biodiversitylibrary.org/title/49562 http://www.biodiversitylibrary.org/title/62470

use knowledge has been recognized, ethnobotanists have been generating data from indigenous cultures around the globe. The process of ethnobotany includes field surveys, interviews, collection of specimens, documentation, and analysis to elucidate plant use pattern within the context of human civilization. Search for existing records and literature related to a given identified plant species is an essential part of the process. In light of the growing amount of data being accrued from surveys and digitization of historical texts, cataloguing and indexing of plant use knowledge remains an essential task to support the needs of ethnobotanists and medical historians. Having scalability in implementation of automated pipelines for such tasks may result in ease of analysis of plant use patterns, which in turn may benefit in distilling bioprospecting studies. This study sought to test the feasibility of using biomedical NLP tools and terminological resources for automated extraction of plant related therapeutic indications from ethnobotanical and historical texts. Table 5: Comparison of Therapeutic Indications from PHCMs and BHL. A BHL versus PHCMs comparative view of the therapeutic indications extracted from texts for the list of plants from PHCMs that had mention in BHL text. Plant Allium cepa Areca catechu Barringtonia racemosa Carica papaya

Curcuma longa

Therapeutic Indication(s) from BHL digestion; stimulant; expectorant; Diuretic astringent; demulcent; anthelmintic; vermifuge skin,diseases; diarrhea abortifacient; vermifuge; round,worms; freckles; heart,disease; emetic; digestive; abscess; tumors; malignant,growth; anthelmintic; Indigestion dropsy; intermittent,fever; ulcers; stimulant

Luffa cylindrica

diuretic; purgative; emetic

Mangifera indica

astringent; tonic; catarrh of nasal passage; purulent,discharges from the vagina; skin,diseases

Musa paradisiaca

epizootic,disease; foot,and,mouth,disease anthelmintic purgative; diarrhea; catarrh

Pangium edule Terminalia catappa

Therapeutic Indication(s) from PHCMs worms; intestinal worms; GI issues; diarrhea; Trichuris muris nematode shaking sickness heartburn; skin rash; antinociceptive; analgesic; pain; antibacterial; back pain stings of stonefish and scorpionfish; burning rash from fire coral; antioxidant; anti-inflammatory; local inflammation; genotoxicity; antimicrobials; guinea worm infection; anthelmintic skin rash; black spots; melasma; stretch marks; antiinflammatory; skin conditions; anti-aging; dry and damaged skin; darkened skin areas; sting of sea urchin; antibacterial; antiviral; antifungal; antispasmodic; hepatopretective; inflammatory diseases; mangrove sickness skin burns; antioxidant; antifungal; Mycosphaerella arachidicola; Fusarium oxysporum; fever; milk production in lactating women syphilis; antiviral; anti-inflammatory; antinociceptive; analgesic; dental hygiene; immune enhancing; hypoglycemic activity; back pain; joint pain; stomachache bites; stings; boil; skin burn; burn wound; cuts; wounds; syphilis; pregnant women who has stomachache hemorrhoids diarrhea; antibacterial; antifungal; antioxidant; hepatoprotective; chemopreventive

The development of automated pipelines for extraction of plant related medicinal use information from ethnobotanical and historical texts is fraught with challenges. A major challenge is the lack of annotated reference corpora for training and evaluating NLP approaches13. A contribution of this study is the manual annotation of PHCMs (Palau and Pohnpei) that can be used to support the development and evaluation of NLP tools for ethnobotany. The annotation of therapeutic indications from these manuals reveal range of descriptions for describing therapeutic use of medicinal plants. For the purpose of evaluation and optimization of the pipeline developed in this study, the concepts identified by MetaMap were organized and combined according to the UMLS Semantic Network. The UMLS Semantic Network reduces the more than one million concepts in the UMLS Metathesaurus into 133 semantic types29 that are further grouped into 15 semantic groups27. After applying the cutoff value used in this study, 25 and 22 different semantic types, respectively from Palau and Pohnpei PHCMs, were identified over which the relevant concepts were distributed. Such spread over different semantic type categories reflects the diverse nature of terms or phrases used to describe medicinal uses of plants within these texts. Owing to the diverse spread, an optimization approach was used in this study to filter noise while still retaining relevant therapeutic information. The efficacy of such filter is evident from the significant improvement in F-score

1543

over the default filter that only included the semantic group ‘Disorder’ (see Table 1). Several interesting medicinal applications were identified from the BHL texts analyzed in this study (Tables 2 and 3). The year of publication for the identified titles ranged from 1874 for the oldest to 1919 for the most recent (see Table 4 for list of identified BHL titles). A comparative view of treatments from PHCMs and associated titles identified from BHL is presented in Table 5. There were few medicinal uses that were common: Allium cepa is described in BHL title to stimulate digestion and PHCM mentions its use for gastrointestinal issues; Barringtonia racemosa in BHL title has use in skin diseases and PHCM lists its use for skin rashes; the anthelmintic potential of Carica papaya is also common between BHL title and PHCM. Additional therapeutic use information for related species were also visible in the defined scope of text such as the use of Terminalia chebula for diarrhea, dysentery and in bilious disorders30. The scope of text analyzed was defined as a paragraph containing the mention of plant name. Although interesting therapeutic indications related to the PHCMs plant list were identified, future work is needed to further investigation of individual texts for designing custom templates specifying relevant scope of text (e.g., document, page, section, paragraph, or sentence). Also, use of co-reference resolution31,32 methods and techniques for analysis may potentially be beneficial. Additionally, correction of optical character recognition (OCR) output text will be essential for contemporary NLP systems to map concepts correctly. The plant species names can be used as an identifier for integrating information, but there may be challenges involving ambiguity and use of alternative name(s) (‘synonyms’)33. Future work may be more inclusive by using a collection of synonyms and vernacular names associated with a given plant for identifying relevant texts. Similarly, there may be challenges in terms of correlating words or phrases used in historical text to describe different disorders as well as their respective synonyms and variants into canonical medical concepts10. Future work might therefore aim to include semantic libraries for improving the recall of relevant biomedical concepts, including the use of methods for automated lexicon generation34,35. Finally, approaches for developing entity name reconciliation services may be used for better aligning name strings (e.g., from iPhylo36) with collaborative knowledge bases such as Wikidata37 (or the now deprecated, but still accessible, Freebase38). Such services may allow for the building of custom knowledge graphs for obtaining answers to questions of interest39. This study highlighted several challenges and opportunities in text mining historical data available in resources like the BHL for identification of potential therapeutic leads. The results and insights gained in this study serve as a foundation upon which a bridge between historical plant-based knowledge (which include those described in biodiversity texts) and contemporary biomedical application can be built. In particular, the success of this feasibility study demonstrates the potential for automated cataloguing of medicinal plant specific information from ethnobotanical and historical data sources such BHL using existing biomedical NLP resources. It is anticipated that the approach demonstrated here can be used on other relevant sources of historical medicinal plant knowledge, such as the collection of health related historical material maintained by National Library of Medicine’s History of Medicine Division40.



Conclusion The process of cataloguing medicinal plant use information by ethnobotanists has largely relied on manual searching of records from surveys and biomedical literature. Historical evidence about medicinal use of plants from archival sources may be a valuable addition to the existing pipeline of cataloguing information. Automated methods can support this process of distilling large amounts of text descriptions to extract meaningful information. This preliminary study demonstrated the potential of informatics methods for extraction of plant-associated medicinal use information from historical biodiversity literature archived in the Biodiversity Heritage Library. In addition to identification of historical medicinal use leads, several challenges were highlighted that provides insights for how to adapt biomedical terminology and approaches.

Acknowledgement This study was funded by grant R01LM011963 from the National Institutes of Health.



1544

References 1. 2. 3. 4. 5. 6. 7. 8.

9.

10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

Schultes RE, von (eds.) (Botanical Museum of Harvard University Harvard University Massachusetts (USA)) RS. Ethnobotany: evolution of a discipline. 1995; Available from: http://agris.fao.org/agrissearch/search.do?recordID=GB9633996 Ji H-F, Li X-J, Zhang H-Y. Natural products and drug discovery. Can thousands of years of ancient medical knowledge lead us to new and powerful drug combinations in the fight against cancer and dementia? EMBO Rep. 2009 Mar;10(3):194–200. Harshberger JW. The Purposes of Ethnobotany. Bot Gaz. 1896;21:146–54. De materia medica [Internet]. [cited 2017 Feb 17]. Available from: https://www.nlm.nih.gov/hmd/greek/greek_dioscorides.html De Vos P. European materia medica in historical texts: longevity of a tradition and implications for future use. J Ethnopharmacol. 2010 Oct 28;132(1):28–47. Balick MJ, Cox PA. Plants, People, and Culture: The Science of Ethnobotany. Scientific American Library; 1997. (Library series]). Ramirez CR. Ethnobotany and the Loss of Traditional Knowledge in the 21st Century. Ethnobotany Research and Applications. 2007 Dec 31;5(0):245–7. Dahmer SM, Balick MJ, Kitalong AH, Kitalong C, Herrera K, Law W, et al. Palau Primary Health Care Manual: Health Care in Palau, Combining Conventional Treatments and Traditional Uses of Plants for Health and Healing. 1st ed. The New York Botanical Garden, Ministry of Health, Republic of Palau, The Continuum Center for Health and Healing and Belau National Museum. Charleston SC; 2012. 216 p. Lee R, Shere N, Balick MJ, Sohl F, Roberts AS, Herrera K, et al. Pohnpei Primary Health Manual: Health Care in Pohnpei, Micronesia: Traditional Uses of plants for Health and Healing. The New York Botanical Garden, The Nature Conservancy, Continuum Center for Health and Healing and The Conservation Society of Pohnpei, Charleston SC; 2010. 178 p. Buenz EJ, Schnepple DJ, Bauer BA, Elkin PL, Riddle JM, Motley TJ. Techniques: Bioprospecting historical herbal texts by hunting for new leads in old tomes. Trends Pharmacol Sci. 2004 Sep;25(9):494–8. Buenz EJ, Bauer BA, Johnson HE, Tavana G, Beekman EM, Frank KL, et al. Searching historical herbal texts for potential new drugs. BMJ. 2006 Dec 23;333(7582):1314–5. Biodiversity Heritage Library [Internet]. [cited 2017 Feb 17]. Available from: http://www.biodiversitylibrary.org/ Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J, et al. Text Mining the History of Medicine. PLoS One. 2016 Jan 6;11(1):e0144717. Thessen AE, Cui H, Mozzherin D. Applications of natural language processing in biodiversity science. Adv Bioinformatics. 2012 May 22;2012:391574. Lowe HJ, Barnett GO. Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. JAMA. 1994 Apr 13;271(14):1103–8. Sarkar IN. Biodiversity informatics: organizing and linking information across the spectrum of life. Brief Bioinform. 2007 Sep;8(5):347–57. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001;17–21. Spasic I, Ananiadou S, McNaught J, Kumar A. Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform. 2005 Sep;6(3):239–51. Chen L, Friedman C. Extracting phenotypic information from the literature via natural language processing. Stud Health Technol Inform. 2004;107(Pt 2):758–62. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003 Dec;36(6):462– 77. Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics. 2010 Oct;11(10):1467–89. Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005 May 24;6 Suppl 1:S1. Kim JD, Ohta T, Tateisi Y, Tsujii J. GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics [Internet]. 2003; Available from: http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.short Gurulingappa H, Mateen-Rajput A, Toldo L. Extraction of potential adverse drug events from medical case

1545

reports. J Biomed Semantics. 2012 Dec 20;3(1):15. 25. Lindberg DA, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med. 1993;32(4):281–91. 26. Stanford Manual Annotation Tool [Internet]. [cited 2017 Jan 22]. Available from: http://nlp.stanford.edu/software/ 27. McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform. 2001;84(Pt 1):216–20. 28. Global Names Recognition and Discovery [Internet]. [cited 2017 Feb 17]. Available from: http://gnrd.globalnames.org/ 29. McCray AT. An upper-level ontology for the biomedical domain. Comp Funct Genomics. 2003;4(1):80–4. 30. Gould GM. An illustrated dictionary of medicine, biology and allied sciences ... by George M. Gould. 5th ed., with additions and corrections. Philadelphia P. Blakiston’s Son (http://dx.doi.org/10.5962/bhl.title.31340); 1907. 31. Clark K, Manning CD. Entity-Centric Coreference Resolution with Model Stacking. In: ACL (1). 2015. p. 1405–15. 32. Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D. Stanford’s Multi-pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task. Stroudsburg, PA, USA: Association for Computational Linguistics; 2011. p. 28–34. (CONLL Shared Task ’11). 33. Sharma V, Sarkar IN. Leveraging biodiversity knowledge for potential phyto-therapeutic applications. J Am Med Inform Assoc. 2013 Jul;20(4):668–79. 34. Schiffman B, McKeown KR. Experiments in Automated Lexicon Building for Text Searching. In: Proceedings of the 18th Conference on Computational Linguistics - Volume 2. Stroudsburg, PA, USA: Association for Computational Linguistics; 2000. p. 719–25. (COLING ’00). 35. Drouin P. Automated Identification of a Transdisciplinary Scientific Lexicon. Revue française de linguistique appliquée. 2007;12(2):45–64. 36. Page RDM. iPhylo. iPhylo [Internet]. [cited 2017 Jun 27]; Available from: http://iphylo.blogspot.com/ 37. Vrandečić D, Krötzsch M. Wikidata: A Free Collaborative Knowledgebase. Commun ACM. 2014 Sep;57(10):78–85. 38. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM; 2008. p. 1247–50. (SIGMOD ’08). 39. Page R. Towards a biodiversity knowledge graph. Riogrande Odontol. 2016 Jul 4;2:e8767. 40. NLM’s History of Medicine Division [Internet]. [cited 2017 Feb 20]. Available from: https://www.nlm.nih.gov/hmd/

1546