Biomedical Text Data Mining: Recent Patents

0 downloads 0 Views 132KB Size Report
sentences and paragraphs and other units of ordinary language exposition. ... that contain the most comprehensive and authoritative data and present the ... bonds…” “…three major 20S proteasome activities (chymotrypsin- like, trypsin-like, and .... suppress, inhibit, and so on, between two or more genes in the function of ...
Recent Patents on Computer Science 2009, 2, 59-67

59

Biomedical Text Data Mining: Recent Patents Colleen E. Crangle* Converspeech LLC, 60 Kirby Place, Palo Alto, CA 94301, USA Received: September 25, 2008; Accepted: October 9, 2008; Revised: October 28, 2008

Abstract: Biomedical information continues to grow beyond the capacity of scientists to capture and use all that is produced. Much of this information is presented in scientific journal articles and expressed in natural language. Biomedical text data mining is concerned with automated methods for analyzing the content of these documents and discovering and extracting the knowledge in them. Numerical data mining has long been used to uncover patterns in numerical data and make predictions based on those patterns. Text data mining builds on the success of numerical data mining but presents additional challenges. This article examines text data mining for biomedical text, paying particular attention to the complexities of natural language that must be taken into account and to the role of biomedical knowledge sources. Using this perspective, recent patents for data mining specific to biomedical text are discussed and expected future patent activity is appraised.

Keywords: Text data mining, biomedical discovery, MeSH, PubMed, Gene Ontology, GO, controlled vocabulary, ontology, entity extraction, exploratory data analysis, natural language processing, patents. INTRODUCTION The Human Genome Project, an international research effort to determine the location and sequence of all human genes, has given rise to vast amounts of information. As it now seeks to discover the details of what the genes do and how they work together, information continues to grow beyond the capacity of scientists to capture and use all that is produced. Much of this information is expressed in ordinary language in unstructured text documents—in particular, scientific journal articles. These documents may contain figures and tables but primarily they are organized into sentences and paragraphs and other units of ordinary language exposition. Biomedical text data mining is concerned with automated methods for analyzing the content of these documents and discovering and extracting the knowledge in them. It can be applied in many ways. For example, curators of model-organism databases may want to know about experiments reported in the literature that support molecular-function claims. Laboratory scientists may also seek information in the scientific literature but be interested in the interaction between genes in the development of a disease. Medical billing experts may wish to associate a billing code with a hospital discharge summary. Numerical data mining has long been used to uncover patterns in numerical data and make predictions based on those patterns-for example, loan data revealing which customers are good credit risks, student data showing which students are most likely to drop out. Data exploration techniques permit the analyst first to become familiar with the data. That familiarity allows particular attributes of the data to be singled out and their predictive power tested using analytical methods such as clustering, decision trees and regression. Text data mining builds on the success of numerical data mining but presents additional challenges.

*Address correspondence to this author at the Converspeech LLC, 60 Kirby Place, Palo Alto, CA 94301, USA; Tel: 1-650-322-9257; Fax: 1-650-3286138; E-mail: [email protected]

1874-4796/09 $100.00+.00

This article examines text data mining for biomedical text. It takes a natural-language centric view of text data mining, working on the premise that it is natural language in its many complex forms that characterizes biomedical text and recognizing that natural-language text cannot readily be mapped to numerical data. This article also describes the major knowledge sources being created in biomedicine and reveals the role they play in biomedical text data mining. Using this perspective, it discusses recent patents and assesses future patent activity. BIOMEDICAL TEXT DATA MINING The most important raw data for text data mining are found in the articles published in the scientific literature. In September 2004, the U.S. National Institutes of Health (NIH) proposed that published articles funded by NIH be made available to the public without charge [1]. In January of 2008, that proposal became law and the NIH informed its grantees that they must begin sending copies of their accepted, peer-reviewed manuscripts to NIH for posting to PubMed Central (PMC). PMC is a free digital archive of the full text of biomedical and life sciences journal articles. It was developed and is managed by the National Center for Biotechnology Information (NCBI) in the National Library of Medicine (NLM). PubMed Central currently provides free and unrestricted access to the full text of 1,500,000 articles in over 450 biomedical and life sciences journals, with more to come. PubMed, a related service and database, provides access to the title, abstract, and other citation information of over 18 million articles from life sciences and related journals, including those in PMC [2]. The primary bibliographic component of PubMed is known as MEDLINE. PubMed includes links to full text articles at the publishers’ sites and to other related resources such as GenBank, the genetic sequence data repository. Currently, about 80 million PubMed searches are conducted each month. Although biomedical text data mining can be performed on any relevant text, the documents that contain the most comprehensive and authoritative data and present the © 2009 Bentham Science Publishers Ltd.

60 Recent Patents on Computer Science, 2009, Vol. 2, No. 1

greatest challenge for data mining are those in the scientific literature. Furthermore, the scientific literature, as maintained by the NLM, provides a categorized database of text data that can be used to mine the information in other documents—clinical records or patent applications, for example. The character and organization of the scientific literature database therefore features prominently in biomedical text data mining applications and patents. BIOMEDICAL TEXT DATA: THE BASIC UNITS OF ANALYSIS Biomedical text data mining differs from numerical data mining in one fundamental way: the nature of the data itself. Text data at its most basic consists of a sequence of characters that make up a document. Substantial preprocessing is needed to convert the sequence of characters in a document into basic units on which further analysis can be done. For natural language, words may be thought of as the basic units of analysis, but it is often several words that carry the meaning in a document. For example, in the following excerpt, the basic units useful for analysis include “gene product,” “functions of a gene product,” “transporting around,” binding to,” “holding together,” “one activity,” and “more than one activity.” “The functions of a gene product are the jobs that it does or the "abilities" that it has. These may include transporting things around, binding to things, holding things together and changing one thing into another. This is different from the biological processes the gene product is involved in, which involve more than one activity.” In the following excerpt, the basic units useful for analysis are “LDLH activity,” “evaluating LDLH activity,” “method for evaluating,” “purified preparations,” “crude preparations,” “both purified and crude preparations,” and so on. “A method for evaluating LDLH activity is described. This method is well suited for rapidly characterizing both purified and crude preparations of LDLH enzymes.” Identifying the basic units of meaning in a text is only part of the job. What is also needed is to know when other words and phrases refer to the same thing, and when other words and phrases do not refer to the same thing even when they appear to do so. Take the phrase “molecular function,” for example. In the following pieces the same concept (with varying degrees of similitude) is being referred to: “the function of molecules,” “the function of a molecule,” “the way in which molecules function.” But it is not being referred to in this piece of language: “the interaction between molecules functions as a metaphor for.” Consider also the specific molecular function identified by “hydrolase activity.” This function can be referred to in scientific text in many different ways: “…catalysis of the hydrolysis of phosphoric anhydride bonds…” “…three major 20S proteasome activities (chymotrypsinlike, trypsin-like, and peptidylglutamyl-peptide hydrolase)…”

Colleen E. Crangle

“…S-adenosyl-L-homocysteine-hydrolase activity…”

(SAHH)

“…SAHH activity…” “…ATP-hydrolase reaction…” At the same time, this function is not being referred to in the following pieces of text: “…activity in sucrose hydrolysis…” “…inhibitory activity against recombinant Plasmodium falciparum SAH hydrolase…” The recursive and creative character of natural language gives rise to an indefinite number of ways of combining words and embedding phrases. Recognizing when different combinations of words actually refer to the same underlying concept is not straightforward. It is in no way analogous to the task with numerical data of, for example, identifying the loan defaulters or the student with the best grades; it would be something like taking a list or table of undifferentiated numbers and figuring out which numbers identify students, which numbers indicate grades achieved for which classes, and which numbers, if any, indicate the annual income of the student’s parents or the kind of high school the student attended, and so on. Numerical data that is to be mined has a structure already imposed on it and the basic elements being separated, sifted, combined, or split are there available for use. EXTRACTING NAMED ENTITIES A large part of identifying the basic units of analysis in text is known as named entity recognition or extraction. This task identifies the specific “entities” referred to in the text data. For biomedical data, these entities may be diseases, organizations, dates, genes, proteins, DNA, RNA, experimental methods, cell lines, and cell types, for example. These are entities that have so-called rigid designators—that is, names. Natural language uses both names and descriptions to refer to entities and though there are typically a fixed number of names for an entity, there is an indefinite number of ways of referring to it by description. However, the distinction between name and descriptor is more nuanced with biomedical text; even entities as apparently unambiguous as genes or proteins can have multiple names or be referred to using a variety of descriptive phrases [3]. Research on biomedical named entity extraction has been underway for some years. A state-of-the-art open-source tool for automatically identifying genes, proteins, and other entities in text is given in [4]. Typical methods applied in entity extraction are rule-based pattern matching, collocation analysis, regular expression matching, keyword proximity, and normalized controlled-vocabulary matching [5, 6]. An additional task that is necessary for comprehensive named entity extraction is abbreviation resolution. Here, an abbreviation is replaced by its full term so that difference occurrences of that term and its abbreviation can be tied together. A database of 2,074,367 abbreviations obtained from analysis of 11,447,996 PubMed citations is described in [7]. It is important to recognize that several layers of naturallanguage processing underlie entity extraction. These layers

Biomedical Text Data Mining

are common to all text data mining tasks, whether in the biomedical domain or not, and will simply be mentioned here without detailed discussion. First there is tokenization, which breaks up the stream of alphanumeric and other characters into pieces meaningful to the task at hand. For text data mining, the most useful kind of token is a word. Hyphens, compound verbs such as “pick up,” quantities expressed using either numbers or words, units of measurement, and so on, all complicate this task. Punctuation marks and quantities (sequences of numerical and other characters) must also be separated out during tokenization. The next layer is morphological analysis or stemming, in which different forms of a word are recognized as such: “activity,” “activation,” “activates,” “activated,” for example, with stem “activ.” The part of the word stripped from it also conveys nuances of meaning that could be crucial, such as the time sequence distinction between “activates” and “activated,” and this information is sometimes captured as well as the stem itself. The next layer of linguistic analysis is part-of-speech tagging, in which a word is identified as a noun, verb, adjective or preposition, for example-a characterization that suggests how the word might function in a sentence. The final level of linguistic analysis is phrase-boundary identification or more fully, syntactic parsing, where words are grouped together based on familiar grammatical patterns to reveal larger units— noun phrases, verb phrases and prepositional phrases—and the relations between these units. Here punctuation and word order are significant. More advanced linguistic analysis entails indentifying where entities are referred to indirectly using pronouns such as “it” and “they” or noun phrases such as “those genes” or “the most important protein function.” These expressions refer to previously or still-to-be mentioned entities, and when properly resolved allow all references to a given entity to be identified. RETRIEVING INFORMATION An area closely related to and overlapping with text data mining is information retrieval, in particular the specialty of question answering. Here it is not a biomedical entity that is sought but a specific piece of biomedical information. Such information can be restated in propositional form; it will be either true or false. An example is information about gene interactions: Do genes X and Y interact? Sometimes, this information can be found in a single paragraph or even a single sentence. Even then, however, it can be extraordinarily difficult to find that specific paragraph or sentence. The relevant document must first be located—a task familiar to all those who use a search engine such as Google. Then the document itself must be searched for the desired information. For example, consider the following two information needs:  Describe the procedure for GST fusion protein expression in Sf9 insect cells.  Describe the procedure for quality control in microarray experiments. These needs can be generalized as follows: • Information describing standard methods or protocols for doing some sort of experiment or procedure.

Recent Patents on Computer Science, 2009, Vol. 2, No. 1

61

These examples come from the Genomics Track of the Text Retrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense. These annual conferences support large-scale evaluation of text retrieval methodologies for distinct text collections, including a genomics collection [8]. Other general information templates, taken from the same source, are listed below along with sample queries: •

Information describing the role of a specific gene in a specific disease.  Provide information about the role of the gene Ribosomal Protein L11 in the disease Cancer.  Provide information about the role of the gene DRD4 in the disease alcoholism.



Information describing the role of a gene in a specific biological process.  Provide information on the role of the gene HMG in the process of chromatin restructuring and transcriptional regulation.  Provide information on the role of the Insulin receptor gene in the process of signaling tumorigenesis.



Information describing interactions such as promote, suppress, inhibit, and so on, between two or more genes in the function of an organ or in a disease.

 Provide information about the genes HMG and HMGB1 in hepatitis.  Provide information about the genes MyD88, TRAM and TRIF in autoimmunity. • Information describing one or more mutations of a given gene and its biological impact or role.  Provide information about mutations of Ret in thyroid function.  Provide information about mutations of thiopurine Smethyltransferase in metabolism of drugs. An additional information template familiar in biomedical data mining is this: • Information describing the molecular function of a gene product and its sub-cellular location  Provide information about the molecular function of the gene product Gon7p. The assumption underlying information retrieval of this kind is that the information already exists and has been explicitly stated somewhere; you simply have to find the sentence or paragraph that states it. SYNTHESIZING EXTRACTED INFORMATION A more advanced form of text data mining arises when the desired information is not located in any one piece of already available text but has to be synthesized from one or more extracted items of information. It is not that an entirely new discovery is being made; it is simply that in its most readily available form the information being sought comes in several pieces. Finding molecular functions that have been established through direct experimentation is an example of this level of information extraction [9]. In this work, cooccurrence patterns are sought for different kinds of terms

62 Recent Patents on Computer Science, 2009, Vol. 2, No. 1

located in different sections of an article-experimental protocol terms in the results section, gene and protein names and terms from a controlled vocabulary of molecular functions in the abstract, introduction or results. Other examples of extracted and synthesized information are protein-protein interactions, interactions between genes and gene products, molecular binding relationships, pathway information, sub-cellular localization, enzyme interactions and protein structures. A noteworthy example supported by industry can be found in [10]. Other research on this topic can be found in proceedings of the annual Pacific Symposium on Biocomputing. See [11], for example. There is no way to determine up front whether an item of information may be extracted from one sentence in which it is explicitly stated or whether it has to be synthesized from various items of extracted information. There is also no way to predict beforehand when extracted and synthesized information may actually turn up something completely new or unnoticed up to that point. When it does, we are in the realm of true discovery. For example, if an interaction between gene A and gene B is identified and extracted and similarly for gene B and gene C, then a new connection between gene A and gene C may have been found. When several individual gene-gene interactions are extracted from a collection of documents, these can be combined to provide a network of interactions, possibly pointing to a newly hypothesized pathway. For example, data mining has been able to reconstruct part of the MAPK pathway by looking for phosphorylation relations involving RAF1 or MAPK1. The extracted relations showed that there was an indirect relationship via MAP2K1. Part of the pathway could then be reconstructed using the fact that RAF1 phosphorylates MAP2K1, which was found in one set of documents, and the fact that MAP2K1 phosphorylates MAPK1, which was found in other documents [12]. Here hidden connections have been revealed that may be the bases for new hypotheses-unless the connections were already known and in fact stated elsewhere. There is no method, no algorithm, that can find all and only new discoveries. EXPLORING PATTERNS

TEXT

DATA

AND

DETECTING

In numerical data mining, exploratory data analysis allows the analyst to become familiar with the data and possible data distributions, observe relationships that appear to be part of the data, and refine assumptions about the data. That is, exploratory data analysis allows the data itself to reveal its underlying structure. It relies heavily on graphical techniques such as histograms and scatter plots that facilitate the natural human ability to recognize patterns. Exploratory data analysis plays an especially important role in deciding which data attributes may have predictive power, that is, which features of the data-which pieces of extracted information in the case of text data mining-are the ones on which further analysis should be performed. Text data has high dimensionality; there are many distinct words, phrases, extracted entities, and items of metadata (category terms used to index a document, for instance) that can be selected for further analysis. Reducing those dimensions is imperative. However, there are as yet relatively few tools specifically designed for exploratory text data analysis, and

Colleen E. Crangle

existing approaches often focus on the individual words in a document, not on extracted entities or extracted items of information or metadata. Some exploratory analysis can be done using the presence or absence of words within documents and the frequencies of word occurrences in and across documents. In a commonly used measure known as term frequency-inverse document frequency (tf-idf), a count is taken of the number of times a given term (typically a word) appears in a document. This count is normalized using the number of words in the document so that the significance of a word in a longer document is not overemphasized. This normalized count, known as term frequency, is then used to assess the importance of a word wi within a particular document dj. It is calculated as follows: tf(wi,dj) = f(wi,dj) / S(dj) where f(wi,dj) is the frequency of word wi in document dj and S(dj) is the sum of the frequencies of all words in document dj. Inverse document frequency is used to assess the general importance of the word wi within the whole document set D. It is calculated by dividing the number of documents by the number of documents containing wi, and then taking the logarithm of that quotient. idf(wi) = log( |D| / |{dj : wi  dj}|) where j =1,…,|D| It is then possible to define tf-idf as follows: tf-idf(wi,dj) = tf(wi,dj) * idf(wi) A high tf-idf score for a word and document indicates that that word occurs frequently in that document but relatively infrequently in the whole set of documents and so can be thought of as a word that usefully discriminates this document from the others. Lower tf-idf scores identify more common words that appear in many documents. In this way a document’s contents can be summarized and the general relations between documents can be explored. In the future, more advanced methods will need to use extracted items of information rather than words. And measures other than mere occurrences and frequencies of occurrence will have to be formulated. Occurrence patterns over a document from beginning to end can reveal more about the pertinence of that item of data to the document, for example. The goal of uncovering patterns in text data and making predictions based on those patterns is considered by many to be the ultimate goal of text data mining. It is thought that reaching this goal will bring text data mining on to the same footing as numerical data mining. There have been numerous attempts to predict new, potentially meaningful relations between diseases and genes by detecting patterns in disease and gene name occurrences across scientific articles. Better results seem to be possible, as expected, when existing biomedical knowledge is integrated into the pattern detection. For instance, potential relations between diseases and genes have been found using background knowledge about chromosomal locations, information gleaned from resources such as OMIM, a compendium of human genes and genetic phenotypes [13, 14]. It is intriguing to consider what other kinds of text patterns might be suggested if there were an array of exploratory text data analysis tools. It is likely, for instance, that not just co-occurrence patterns will

Biomedical Text Data Mining

be shown to be important but occurrences of a given item of information tracked over time within an article or over time from one article to another. BIOMEDICAL KNOWLEDGE RESOURCES The NLM Unified Medical Language System (UMLS) project develops and distributes multi-purpose, electronic knowledge sources and associated system development tools. Its purpose is to spur the development of computer systems that behave as if they "understand" the meaning of biomedical language [15]. A biomedical knowledge source is a structured database of focused biomedical knowledge. There are two kinds of knowledge resources that are of primary importance for text data mining: controlled hierarchical vocabularies and ontologies. A controlled hierarchical vocabulary lists the allowed or preferred ways of referring to biomedical entities, with hierarchical relations holding between the named entities. Synonyms, alternate names, and “official” names are typically identified. Biomedical text data mining is dependent on the use of knowledge sources of one kind or another in part because of the complex nomenclature of the biomedical sciences. The two most important controlled vocabularies currently are MeSH (Medical Subject Headings) [16] and GO (Gene Ontology) [17]. GO consists of three sub-vocabularies that describe gene products in terms of their associated biological processes, cellular components, and molecular functions. MeSH is one of the vocabularies in the UMLS Metathesaurus®, a large, multipurpose and multi-lingual vocabulary database that contains information about biomedical concepts. The purpose of each of these many vocabularies is to list the preferred name for a concept, to link alternative names and views of the same concept together, and to identify useful relationships between different concepts. In text data mining, a controlled vocabulary can be particularly useful for entity extraction, and for providing words and phrases to look for in text to identify biomedical phrases of interest. MeSH is a hierarchical set of controlled vocabulary items taken from medicine and chemistry and genomics. It now also includes the names of substances with pharmacological action. There are close to 300,000 items in MeSH. PubMed articles are indexed using these vocabulary items. Staff in the NLM read each publication and select the MeSH items relevant to it. When the user enters his or her search items at PubMed, the system translates them into MeSH items that are then added to the search. In this way, articles that are relevant to the user’s search but use different terms for the same concepts can be found. The MeSH annotations also provide a rich set of metadata that describe the articles and provide a map of the entire biomedical field. Entrez is the text-based search and retrieval system of NCBI that translates the user’s query to MeSH terms and provides access to the PubMed database. Entrez also provides access to other important NCBI text repositories, such as OMIM (Online Mendelian Inheritance in Man), a database of human genes and genetic disorders, and OMIA (Online Mendelian Inheritance in Animals), a database of genes, inherited disorders and traits in more than 135 animal species other than human and mouse [2]. An ontology attempts to define the basic biomedical entities in a domain (the human genome, for example, or

Recent Patents on Computer Science, 2009, Vol. 2, No. 1

63

experimental protocols, molecular functions, or biological processes) and their relations to each other. Ontologies aim to characterize not the language of a biomedical domain but the biomedical entities themselves. They use limited and well-defined terms for this characterization. The biomedical information in an ontology is in machine-processable form to permit automated inferences to be made; inference are driven not only by the rules of logic but by a mapping between the named entities extracted from natural-language text and the items in an ontology. The National Center for Biomedical Ontology, a consortium of biologists, clinicians, and computer scientists, is overseeing the creation and dissemination of a range of ontologies-including ontologies for gene regulation, human phenotypes, microarray experimental conditions, to name just a few [18]. The future of biomedical text data mining may hinge on how well such knowledge resources, which encode large swathes of established biomedical knowledge, are developed, maintained, and integrated with each other and with controlled vocabularies. RESEARCH IN BIOMEDICAL TEXT DATA MINING In addition to TREC, there have been two other significant world-wide research efforts focused on full-text data mining in biomedicine, both concerned with the problem of automating human and model-organism database curation. They are the KDD challenge [6] and the BioCreAtIvE challenge [19-21]. These efforts were aimed at tasks such as the detection of gene and protein names in text and the extraction of functional information related to proteins based on the GO classification system. Many research projects separate from the KDD and BioCreAtIvE challenges also concern text data mining methods that support database curation using GO [9, 22-31]. A range of techniques are being experimented with, including text summarization, text clustering and text classification [25, 26, 29]. As a result of these undertakings and others, there are many annotated text repositories available as resources to advance research on biomedical text data mining. For example, there are collections of sentences annotated to indicate the genes and proteins named in those sentences. There are short pieces of text that provide evidence for the assignment of a GO code to a protein along with short pieces of text that appear to, but do not, provide such evidence. There is also an annotated corpus of literature related to the MeSH terms Humans, Blood Cells, and Transcription Factors, one of the many specialized text repositories available to aid further research and testing. The website http://biocreative.sourceforge.net/resources.html contains detailed information on these and other resources plus computational tools specifically developed for the biological domain. A topic not yet given much attention but one that will become increasingly important is how to represent documents with economy and with utility for data mining. One good example is given by [32]; it adapts methods used for low signal-to-noise-ratio data-such as seasonal climate data-to form concise document representations using MesH and GO. Along with the many efforts related to ontology creation, various text data mining approaches have been devised around the use of these ontologies. A useful summary is given by [33]. The Annual Pacific Symposium

64 Recent Patents on Computer Science, 2009, Vol. 2, No. 1

on Biocomputing regularly addresses topics related to text data mining for example, assessments of text mining tools, improving information access to the biomedical literature and databases, leveraging complementary resources and techniques provided by the literature, databases, controlled vocabularies and ontologies. Further information and online access to the conference proceedings can be found at http://psb.stanford.edu/. A visual summary of the segments of text data mining is given by this indented list: Named Entity Extraction Natural-language analysis Tokenization Morphological analysis / stemming Part-of-speech tagging, identifying nouns, verbs, adjective, prepositions, and so on Phrase-boundary identification / syntactic parsing Pronoun resolution Information retrieval Information synthesis (discovery) Pattern detection (discovery) A recap of the knowledge resources commonly used in biomedical text data mining is given in the following list: MeSH (and other Metathesaurus vocabularies) PubMed literature database indexed by MeSH and linked to OMIM and other text repositories GO, the Gene Ontology Ontologies for gene regulation, human phenotypes, microarray experimental conditions, and so on Annotated text repositories coming out of TREC, KDD, BioCreAtIvE and other research projects BIOMEDICAL TEXT DATA PATENTS Exploratory Analysis of PubMed Data Using MeSH The first patent under discussion offers a tool for exploratory analysis of text data. It does this by exploiting a hierarchical categorizing system such as MeSH. Titled “Graphics image generation and data analysis,” this patent was filed in September 2004 and published in August 2007 [34]. It describes how data mining systems must analyze vast amounts of data to obtain insights contained in the data. The results of the analysis must be displayed for the analyst, whether as extractions or summaries or new hypotheses. Text data, however, does not lend itself to summarization the way numeric data does. Measures such as the mean, median and variance can give an immediate snapshot of some feature of numeric data. Histograms and other graphical displays can give a visual overview of a numeric dataset. But there are as yet no standard analogous measures for text. Biomedical text, however, has behind it the vast amount of structured data in controlled vocabularies and ontologies. Those parts of the vocabulary or ontology that “match” the text data can

Colleen E. Crangle

function as a summarizing or context-setting tool for that text data. This patent makes specific reference to the text mining system IBM TAKMI, which exploits the terms and structure of a hierarchical category system like MeSH. While MeSH was designed to allow consistent access to articles in the PubMed database even when those articles use terms different from one other, it can be used for more than search. This invention presents a method for displaying the result of a PubMed search not on its own but in the context of the MeSH categorizing metadata. The PubMed articles returned by a search that are all indexed by a given MeSH category will be shown together, along with an overview of the hierarchical structure associated with that MeSH category. This invention is aimed at the situation in which a biomedical scientist may be familiar with the MeSH terms that pertain to his or her particular area of expertise but he or she does not have a clear sense of related MesH terms and how they link to the familiar category. Once one term (or category) has been selected, its place in the hierarchy of MeSH terms can reveal all the terms under it that refer to more specific aspects of the original biomedical category. In addition, all documents in that top category and all documents in all descendent categories can be identified and selected for further analysis. Two methods of analysis in particular are proposed in this patent: aggregation and filtering. Aggregation criteria determine which of the descendent categories are kept. Filtering criteria determine which of the descendent categories are dropped. The final display shown to the analyst is a graphical depiction of the aggregated and filtered nodes within the surrounding category hierarchy. The expectation is that this view will allow unexpected groupings of categories or relations between categories to be revealed to the analyst. Collaboratively Resources

Building

Biomedical

Knowledge

The next patent concerns the creation, use and maintenance of biomedical knowledge resources. Titled “Collaborative curation of knowledge from text and abstracts,” this patent was filed in May 2007 and published in December 2007 [35]. It has as its focus the databases that are being built up around the world to compile a complete picture of the humane genome and the genomes of various organisms such as mice, fruit flies and roundworms that function as research models for how the human organism behaves. This information is buried in existing biomedical articles. Each of the model organisms has a staff of highly trained specialists who read the articles to extract the information that will be stored in databases in a consistent way using controlled vocabularies such as GO. The volume of articles and the growth in their numbers make it infeasible to collect all this information manually. This invention concerns tools that can aid the creation and maintenance of these organism databases. The aim of this invention is to facilitate collaborative curation using the web. Connected to a text repository site such as PubMed, it will open a window, when any particular article is accessed, that displays all existing information curated from that article. The user may vote about the relevance or correctness of the knowledge displayed, revise

Biomedical Text Data Mining

items already there, or add new items of information. Users will typically be registered and they will be given certain privileges to vote, revise or add, depending on their status in the community. Automatic information extraction methods will be applied to the articles and these results also shown to users. One powerful aspect of this invention is that the author of a newly published article has an incentive to start the curation for his/her article and/or keep it accurate and complete. That incentive encourages its use. The system described is envisaged as an open access system or a feebased service with controlled access. Analysis of Text Data Using the UMLS and Spreading Activation The next patent offers an analysis method tailored specifically to text. It uses the UMLS to build specialized knowledge resources that can be used in the analysis. Titled “Processing text with domain-specific spreading activation methods,” this patent was filed in January 2008 and published in July 2008 [36]. At the most general level this invention concerns itself with what a document is “about.” Specifically, given a set of concepts and terms, it seeks to determine which concepts and terms most strongly fit that document. The invention described in this patent first builds biomedical knowledge resources that are tailored to specific domains-pediatrics, radiology, behavioral medicine, psychiatry, rheumatology, pathology, cardiology, and immunology, to name some. These knowledge resources are semantic or associative networks that then supply each document to be mined with a sub-network of terms and concepts relevant to that particular document. The UMLS provides the semantic “glue” that holds these domainspecific networks together. For example, concepts and relations taken from the almost 600,000 items of the Cincinnati Pediatric Unsupervised Corpus can be combined with the concepts and relations of the UMLS to produce a semantic network tailored to pediatrics. From this network, terms and concepts are selected for each document. The specific method used both to build the tailored knowledge resources and to associate a sub-network with a document is spreading activation [37]. Spreading activation is thought to mimic human memory in that it retrieves relevant information from a network by accessing information already associated with known information. Crucial to spreading activation is a method to stop the spreading at the appropriate point, resulting in a sub-network that best captures the semantics of the text. Once a semantic sub-network has been associated with a particular document, query terms can be tested against this network to see which has the strongest connection. That term is the one that most accurately describes what the document is about. Standard document retrieval has attempted to match a word or phrase to a document by adding to it related words and phrases [38, 39]. However, the result is often a huge number of documents that then have to be further investigated to identify those that really are relevant and best match the given term or phrase. This invention is described in terms of its application to clinical notes-a discharge summary, for example. Clinical notes have simpler structure and content than scientific articles, with interpretation of the text potentially guided by

Recent Patents on Computer Science, 2009, Vol. 2, No. 1

65

the patient’s data. An illustrative data mining task for this text is deciding which of several billing codes should be associated with a given discharge summary. Analysis of Patent Documents Using PubMed and MeSH This invention applies text data mining to a text source other than scientific articles. It mines collections of patent documents. Titled “System and method for annotating patents with MeSH data,” it was filed in November 2005 and published in 2007 [40]. Its aim is to improve the way biotechnology patents are obtained, enforced or avoided by providing a better way to find existing patents of interest. Keyword searching suffers from the usual limitations arising from different terms being used for the same or similar concepts. As a result, patents that use terms different from the selected keywords will remain hidden to the searcher. Adding more keywords to catch more patents produces many spurious matches, making it inefficient to find the relevant ones. A similar problem exists when patent classification and sub-classification codes are used to identify related patents. The search typically returns a large number of patents only some of which are truly relevant. The references cited in the patent documents give some indication of the patent’s area within the biomedical field, but on their own they are of little use as the list is seldom comprehensive. These limitations make it difficult to retrieve related patents and gain an understanding of the overall patent coverage in a particular area. However, literature citations do provide additional data in the MeSH terms that are used to index them for the PubMed database. This invention extracts from patent documents references to scientific articles. Those articles are then looked up in the PubMed database and the MeSH category terms that index those articles are associated with the patent they were taken from. Each biomedical patent thus ends up with a set of merged MeSH terms, providing a uniform characterization of patents across biomedicine. The invention described here does not limit its purview to patent documents and to MeSH metadata, but it is described primarily in terms of these documents and this kind of metadata. What is presented is a tool for better management of intellectual property in the biomedical field. General Mining of Relationships in Biomedical Text The final patent to be discussed is concerned with the general task of finding relationships within text data. It specifically addresses relationships other than the pair-wise relationships that have been the subject of early text data mining research-such as protein-protein interactions or the relationship between a molecular function and a sub-cellular localization, for example. Multi-valued relationships, it is hoped, will lead to hypotheses of greater significance. Titled “System, method, and computer program product for data mining and automatically generating hypotheses from data repositories,” this patent was filed in March 2007 and published in September 2007 [41]. This invention includes in its target text resources “biomedical literature databases; medical records databases; chemical literature databases; computer science literature databases; physics literature databases; legal literature databases; psychology literature databases; social science literature databases; news

66 Recent Patents on Computer Science, 2009, Vol. 2, No. 1

periodical databases; business journal databases; and combinations of such databases.” And it includes in its notion of document “published journal articles; text strings (such as, for example, a physician's comments in a medical record entry); file records (such as a particular medical record); resumes and/or curriculum vitae; a thesis; a numerical string of data; a patent document (including, for example, issued patents, patent applications, and publiclyavailable patent prosecution documents); online journal articles; internet web pages; material safety data sheets; pharmaceutical and/or chemical data sheets; advertisements; reported court case and/or administrative proceedings; news articles; letters; and combinations of such materials.” This patent specifically distinguishes itself from “retrieval mode” data mining, which it defines as users knowing what knowledge and information they need so that they can provide at least one concept to initiate the discovery process. This patent is addressed to users who “may not know how to express their knowledge and information needs or even may not realize and/or appreciate an existing information need.” The proffered invention refers to “a plurality of phrases and a plurality of concepts stored in the system for extracting relationships” and then “parsing a plurality of documents in a data repository according to the relationship rule.” The result of such analysis is said to be a hypothesis comprising a previously unknown combination of phrases and concepts. This patent application acknowledges the reliance on various biomedical knowledge resources such as “a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; semantic database; a metathesaurus; and combinations of such systems.” An additional component to this invention is a user profile that is specified up front before data mining is underway. Proposed hypotheses generated by the system are said to be modified in part by such a user profile. This patent appears to have a very broad coverage but offers relatively few details to understand its purview. CURRENT & FUTURE DEVELOPMENTS There are surprisingly few text data mining patents given that the successful sequencing of the humane genome was essentially completed in 2003, and given the health importance of this area. There is room for text data mining patents in all areas of current research. Two topics in particular have received little attention and are especially open areas for new patents. They are exploratory data analysis techniques that are specific to text data, and analysis methods that are designed with semantic or text data in mind. Tools for the exploratory analysis of text data can be built by exploiting the many biomedical knowledge resources that the biomedical community has built, as with the first patent discussed above. Other resources that may be similarly exploited include Ensembl, which annotates known genes and predicts new ones using functional annotations derived from other sources [42]. There are two ways to make progress in the development of analysis methods specifically for text: devise new ways to map words and other text tokens to numbers so that the techniques already established for numerical data mining can

Colleen E. Crangle

be fully exploited; or develop new methods, along the lines of the associative method of spreading activation, for example, that are designed to take semantic relations into account, that is, relations that are present in text data. Patents aimed at the design, use, and maintenance of biomedical knowledge resources will continue to be important. Ontologies, in particular, are garnering much intention, with a great deal of international collaboration around their development. And while some controlled vocabularies are well established, others are still under development by the communities that use them. A review of 12 projects is given in [43] and the function of ontologies is reviewed in [44]. Finally, there is likely to be a surge in the application of text mining techniques to existing biomedical problems. One recent patent serves as an example. Titled “Systems, methods, and computer program product for analyzing microarray data,” this patent was filed in January 2005 and published in July 2008 [44]. This invention takes a wellestablished data analysis method, long used in numerical data mining, and applies it to an important existing biomedical problem-the interpretation of microarray data. Computational methods of clustering are currently applied to microarray data in the hope that they will reveal gene clusters that are biologically meaningful, that point the way toward some new understanding of biological processes. However, any of several clustering methods can be appliedhierarchical clustering, k-means clustering, and selforganizing maps, for example-and they can each produce a different result. There is a need, therefore, to compare clustering results to determine which algorithm is most useful for a given set of data. While there are statistical measures for comparing the results of different clustering methods, such comparisons do not assess the quality of clustering in terms of its biological reasonableness, merely in terms of the apparent goodness of fit. This invention takes the genes represented in a microarray experiment, selects documents from the scientific literature that are about those genes, and then analyses the words and phrases in the documents. If it can be shown (by standard text classification methods and significance testing) that the text about different genes within a cluster uses the same words much more often than text about randomly selected genes, then it may be assumed that the genes represented by this cluster are functionally related. The clustering method applied to the microarray data that produced these clusters can therefore be assumed to be a good one as there is independent literature support for the notion that the similarity of gene expression among genes in this cluster is physiologically meaningful. As text data mining matures, more patents will appear that are not about new text analysis methods but are about the application of text data mining to solve important biomedical problems. CONFLICT OF INTEREST The authors have no conflicts of interest to declare. REFERENCES [1]

Zerhouni EA. NIH public access policy. Science 2004; 306: 1895.

Biomedical Text Data Mining [2]

[3] [4] [5]

[6] [7]

[8] [9] [10]

[11]

[12] [13]

[14] [15] [16]

[17] [18] [19]

[20] [21]

Wheeler D, Barrett T, Benson D, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, 36:D13-D21 Seringhaus MR, Cayting PD, Gerstein MB. Uncovering trends in gene naming. Genome Biol 2008, 9:401. Settles B. ABNER: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 2005; 21(14): 3191-3192. Krallinger M, Valencia A, Hirschman L. Linking genes to literature: Text mining, information extraction, and retrieval applications for biology. Genome Biology 2008; 9(Suppl 2): S8. Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: Lessons learned from the KDD Challenge Cup. Bioinformatics 2003; 19(1): i331-339. Chang JT, Schultze H, Altman RB. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc 2002; 9(6): 612-20. Hersch W, Cohen AM, Roberts P, Rekaplli HK. TREC 2006 Genomics Track Overview. TREC, NIST. SP 500-272. Crangle CE, Cherry JM, Hong EL, Zbyslaw A. Mining experimental evidence of molecular function claims from the literature. Bioinformatics 2007; 23: 3232-3240. Uramoto N, Matsusawa H, Nagano T, et al. A text-mining system for knowledge discovery from biomedical documents. IBM Syst J 2004; 43(3): 516-533. Gonzalez G, Uribe JC, Tari L, Brophy C, Baral C. Mining genedisease relationships from biomedical literature: Weighting proteinprotein interactions and connectivity measures. Pac Symp Biocomput 2007:28-39. Blaschke C, Hoffmann R, Oliveros JC, Valencia A. Extracting information automatically from the biological literature. Comp Funct Genom 2001; 2: 310-313. Hirstovski D, Peterlin B, Mitchell JA, Humphrey SM. Improving literature based discovery support by genetic knowledge integration. Stud Health Technol Inform 2003; 95:68-73. Chun H-W, Tsuruoka Y, Kim J-D, et al. Extraction of gene-disease relations from medline using domain dictionaries and machine learning. Pac Symp Biocomput 2006; 11:4-15. Lindberg DA, Humphreys BL, McCray AT. The unified medical language system. Methods Inf Med 1993; 32(4): 281-291. Nelson S, Schopen A, Savage M, Schulman J, Arluk N. The MeSH translation maintenance system: Structure, interface design, and implementation. Medinfo 2008; 11:67-69. The Gene Ontology Consortium. Gene Ontology: Tool for the unification of biology. Nature Genet 2000; 25: 25-29. Smith B, Ashburner M, Rosse C, et al. The OBO foundry: Coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 2007; 25(11): 1251-5. Blaschke C, Leon EA, Krallinger M, Valencia A. Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinform 2005; 6 (Suppl 1): S16 Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinform 2005; 6(Suppl 1), S1 Camon E, Barrell D, Dimmer E, et al. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 2005; 6, S1.7

Recent Patents on Computer Science, 2009, Vol. 2, No. 1 [22]

[23] [24]

[25]

[26] [27]

[28] [29] [30]

[31] [32]

[33] [34] [35] [36] [37] [38] [39] [40]

[41] [42] [43] [44] [45]

67

Daraselia N, Yuryev A, Egorov S, Mazo I, Ispolatov I. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks. BMC Bioinformatics 2007; 8: 243. Fyshe A, Liu Y, Szafron D, Greiner R, Lu P. Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 2008. [Epub ahead of print] Yang J., Cohen AM., Hersh W. Automatic summarization of mouse gene information by clustering and sentence extraction from MEDLINE abstracts. AMIA Annu Symp Proc 2007; 11:831-835. Jaeger S, Gaudan S, Leser U, Rebholz-Schuhmann D. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 2008; 9 Suppl 8:S2. [Epub ahead of print] Yang H, Nenadic G, Keane JA. Identification of transcription factor contexts in literature using machine learning approaches. BMC Bioinformatics 2008; 9 (Suppl 3): S11. Gaudan S, Jimeno Yepes A, Lee V, Rebholz-Schuhmann D. Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text. EURASIP J Bioinform Syst Biol 2008; 342746. Cakmak A, Ozsoyoglu G. Discovering gene annotations in biomedical text databases. BMC Bioinformatics 2008; 9:143. Groth P, Weiss B, Pohlenz HD, Leser U. Mining phenotypes for gene function prediction. BMC Bioinformatics 2008; 9:136. Natarajan J, Ganapathy J. Functional gene clustering via gene annotation sentences, MeSH and GO keywords from biomedical literature. Bioinformation 2007; 2(5):185-193. Daraselia N, Yuryev A, Egorov S, Mazo I, Ispolatov I. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks. BMC Bioinformatics 2007; 8:243. Theodosiou T, Angelis L, Vakali A. Non-linear correlation of content and metadata information extracted from biomedical article datasets. J Biomed Inform 2008; 41(1):202-216. Spasic I, Ananiadou S, McNaught J, Kumar A. Text mining and ontologies in biomedicine: Making sense of raw text. Brief Bioinform 2005; 6(3): 239-251. Matsuzawa, H., Nagano, T., Itoh, T., Yamaguchi, Y.: US20070185904 (2007). Baral, C.: WO2007139999 (2007). Pestian, J.: WO2008085857 (2008). Crestani F. Application of spreading activation techniques in information retrieval. Artif Intell Rev 1997; 11(6): 453-482. Salton G. Automatic Information Organization and Retrieval. New York: McGraw Hill; 1968. Doyle LB. Information retrieval and processing. Los Angeles: Melville Publishing, 1975. Angell, R.L., Boyer, S.K., Cooper, J.W., Hennessy, R.A., Kanungo, T., Kreulen, J.T., Martin, D.C., Rhodes, J.J., Spangler, S.W., Weintraub, H.J.R.: US20070112833 (2007). Raghavan, V.V., Xie, Y., Prestigiacomo, A.: WO2007106858 (2007). Hubbard TJP, et al. Ensembl 2007. Nucleic Acids Res Database issue: 2007; 35: D610-D617. Cimino JJ, Zhu X. The practical impact of ontologies on biomedical informatics. Yearb Med Inform 2006: 124-135. Rubin DL, Shah NH, Noy NF. Biomedical ontologies: A functional perspective. Brief Bioinform 2008; 9(1): 75-90. Rigney, D.R.: US20087400981 (2008).