Improved chemical text mining of patents using infinite ... - Springer Link

Sayle et al. Journal of Cheminformatics 2011, 3(Suppl 1):O16 http://www.jcheminf.com/content/3/S1/O16

ORAL PRESENTATION

Open Access

Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction Roger A Sayle1*, Plamen Petrov2, Jon Winter3, Sorel Muresan2 From 6th German Conference on Chemoinformatics, GCC 2010 Goslar, Germany. 7-9 November 2010

The text mining of patents and patent applications for chemical structures of interest to medicinal chemists poses a number of unique challenges not encountered in other fields of text analytics. Traditional text mining relies on the co-occurrence of common terms between documents to provide similarity measures that can be used to cluster and rank related documents. The more words shared between two documents, the more similar they are, and the greater the probability that they discuss the same topic. By contrast, in pharmaceutical “composition of matter” patents the novel and unique chemical entities are far more significant than those that can be found elsewhere. Although the text of a pharmaceutical patent may explicitly name thousands of individual compounds, and via generic Markush structures claim an infinite number, the role of these patents is to protect the intellectual property of only one or perhaps two drug candidates. In this work, we present an analysis of the “quality not quantity” of structures extracted by automatic Chemical Named Entity Recognition (CNER) methods both on a small hand-curated benchmark set [1] and a large-scale analysis of a comprehensive database of 12 million patents [2,3]. Our results show the limited value of traditional lexicon/dictionary based approaches in extracting “key” compounds and that the major impediment is not the performance of the name-to-structure software used, but the high rate of OCR errors, typos and lexicographic problems found in patent-office data feeds. To address this problem, novel algorithms for automatic chemical spelling correction have been developed, that take advantage of the grammar used in IUPAC-like * Correspondence: [email protected] 1 NextMove Software, Santa Fe, New Mexico, 87501, USA Full list of author information is available at the end of the article

nomenclature. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve results in our study. Author details 1 NextMove Software, Santa Fe, New Mexico, 87501, USA. 2DECS, AstraZeneca, Molndal, Sweden. 3CIRA, AstraZeneca, Alderley Park, Cheshire, UK. Published: 19 April 2011 References 1. Hattori K, Wakabayashi H, Tamaki K: Predicting Key Example Compounds in Competitor’s Patent Applications using Structural Information Alone. J Chem Inf Model. 2008, 48:135-142. 2. Rhodes J, Boyer S, Kreulen J, Chen Y, Ordonez P: Mining Patents using Molecular Similarity Search. Pac Symp Biocomput. 2007, 12:304-315. 3. Suriyawongkul I, Southan C, Muresan S: The Cinderella of Biological Data Integration: Addressing the Challenges of Entity and Relationship Mining from Patent Sources. In Data Integration in the Life Sciences. Lect Notes Bioinf. Volume 6254. Springer; 2010. 4. Sayle R: Foreign Language Translation of Chemical Nomenclature by Computer. J Chem Inf Model. 2009, 49:519-530. doi:10.1186/1758-2946-3-S1-O16 Cite this article as: Sayle et al.: Improved chemical text mining of patents using infinite dictionaries, translation and automatic spelling correction. Journal of Cheminformatics 2011 3(Suppl 1):O16.

Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: http://www.chemistrycentral.com/manuscript/

© 2011 Sayle et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Improved chemical text mining of patents using infinite ... - Springer Link

Improved chemical text mining of patents using infinite ... - Springer Link

Suggest Documents

Biomedical Text Data Mining: Recent Patents

Decision Support via Text Mining - Springer Link

The Possibility of Using the Google Patents Search ... - Springer Link

Using Text Mining and Link Analysis for Software Mining - CiteSeerX

Using Text Mining and Link Analysis for Software Mining - CiteSeerX

Applications and Challenges of Text Mining with Patents

Using text mining to support text summarization

SOME CLASSES OF WEAKLY INFINITE ... - Springer Link

BioData Mining - Springer Link

BioData Mining - Springer Link

Certain Classes of Infinite Series - Springer Link

BioData Mining - Springer Link

Text Classification Using Data Mining

Mining Text Using Keyword Distributions

Mining Text Using Keyword Distributions

Web Mining - Springer Link

Ergodic properties of infinite systems - Springer Link

Infinite Homography Estimation Using Two Arbitrary ... - Springer Link

chemical composition - Springer Link

Chemical Ribonucleases - Springer Link

IFMBE Proceedings 2512 - Text Mining for Automatic ... - Springer Link

Layout optimization of trusses using improved GA ... - Springer Link

Characterizing User-Generated Text Content Mining: a ... - Springer Link

Mining Causality for Explanation Knowledge from Text - Springer Link