Errors resulting from converting PDF or. HTML formatted documents to plain ... Errors in shallow parsing and POS-tagging
New challenges for biological text-mining in the next decade
1
New challenges for biological text-mining in the next decade Hong-Jie Dai1,2, Yen-Ching Chang1, Richard Tzong-Han Tsai3 and Wen-Lian Hsu1,2 1 Institute of Information Science, Academia Sinica, 115, Taiwan, R.O.C. 2 Department of Computer Science, National Tsing-Hua University, 300, Taiwan, R.O.C. 3 Department of Computer Science and Engineering, Yuan Ze University, 320, Taiwan, R.O.C. E-mail:
[email protected]; ro3789@ iis.sinica.edu.tw;
[email protected];
[email protected]
Abstract
The massive flow of scholarly publications from traditional paper journals to online outlets
has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal. Keywords 1
Bioinformatics databases, Mining methods and algorithms, Text mining
Introduction Life-science journal publishing has under-
gy contests, such as JNLPBA [1] and BioCreA-
gone a digital revolution in the last decade. The
tIvE [2, 3], have evaluated ways in which the lite-
massive flow of scholarly publications from tradi-
rature may be navigated. Among the methods
tional paper journals to online outlets has bene-
evaluated, text mining has shown the most prom-
fited biologists in the ease of access, but has also
ise because it makes biological literature more ac-
left these scholars adrift in the deluge of biological
cessible, and therefore more useful [4, 5].
literature they have made available. Recent biolo
Text mining involves analyzing a large col-
Regular Paper
This work was supported by the National Science Council under grants NSC 97-2218-E-155-001 and NSC96-2752-E-001-001-PAE, the Research Center for Humanities and Social Sciences, and the Thematic Program of Academia Sinica under grant AS95ASIA02.
2
J. Comp. Sci. & Tech.
lection of documents in a manner that reveals spe-
terature curation, maintain the integrity of infor-
cific information, such as the relationships and
mation, and ensure proper linkage of data to rele-
patterns buried in the collection, which is normally
vant resources. However, several new challenges
imperceptible to readers. A key text mining task
have emerged in relation to full text analysis,
involves linking extracted information to form
life-science terminology, complex relation extrac-
new facts or new hypotheses that can be explored
tion, and information mergence. First, terms, such
further by more conventional means of experi-
as gene names and corresponding database iden-
mentation [6].
tifiers, are so numerous and varied that even spe-
In the biomedical domain, several tools [7-9],
cialists have difficulty understanding them and
competitions [10-12] and projects [13, 14] have
keeping track of updates and revisions. While
started to incorporate text mining technology.
state-of-the-art normalization systems developed
However, text mining is difficult to implement in
for BioCreAtIvE II [16] can normalize gene iden-
many cases because the vital components of scien-
tifiers for humans relatively well, such systems
tific communication—journals and databases—are
have yet to be developed for inter-species norma-
designed to be read by people, not computers.
lization. Second, full-text analysis requires the use
Computers cannot extract information efficiently
of more sophisticated natural language processing
from unstructured text, which is the format
(NLP) techniques than current biological informa-
adopted by most journals and databases. Fortu-
tion retrieval and extraction tools can handle [17,
nately, some publishers, e.g., the Public Library of
18]. For example, in full text analysis, TMSs must
Science (PLoS) and BioMed Central, have sought
extract cross-sentence relations, while most cur-
to address this problem by making the full texts of
rent TMSs can only extract relations within sen-
their publications available as downloadable XML
tences. Third, current TMSs are unable to merge
files that can be processed easily by computer
information from disparate sources with different
programs. The FEBS Letters journal is currently
contextual
experimenting with embedding text-mining sys-
However, associating works in the literature via
tems (TMSs) in the manuscript submission
pathway information is essential. In the next sec-
process to construct structured digital abstracts
tion, we discuss the above-mentioned challenges
semi-automatically [15]—machine-readable XML
in more detail.
summaries of pertinent facts in the published ar-
2
and
typographical representations.
Full Text Processing
ticles. With the recent advances in text-mining technology and the fact that some publishers are now making the full texts of articles available in XML format, TMSs can be applied to the full texts rather than just the abstracts, and to accelerate li-2-
Most
Biological
Natural
Language
Processing systems have only been applied to abstracts because of the latter’s availability and abridged nature. Abstracts are good targets for information extraction (IE) as they summarize the
New challenges for biological text-mining in the next decade
-3-
content of articles. However, the full texts of pa-
literature that are machine-readable, but many
pers contain more information, relevant or not,
other aspects need to be improved further, as we
which should be treated carefully [19].
explain in the following subsections.
The preliminary results of applying current
2.1
Named Entity Identification In Full Text
state-of-the-art TMSs to full texts showed a promising F-score1 of 28.85% [22] in the BioCreAtIvE (critical assessment for information extraction in biology) protein-protein interaction (PPI) annotation extraction task [20]. However, they also re-
Errors resulting from converting PDF or
Difficulties in processing tables and figure
Multiple references to organisms and the resulting
inter-species
ambiguity
in
gene/protein normalization 4.
Sentence boundary detection errors
5.
Difficulties in extracting the associations and handling the coordination of multiple interaction pairs in single sentences.
6.
Phrases used to describe interactions in legends or titles that do not correspond to grammatically correct sentences in the text.
7.
first step towards making full use of the informa-
ty recognition (NER) task in the biomedical do-
newswire domain, such as the MUC-7 NER task [21]. The unique difficulties of biomedical NER
legends 3.
ical terms, such as gene and protein names, is the
main has different characteristics from that in the
HTML formatted documents to plain text 2.
The fundamental task of recognizing biolog-
tion encoded in biomedical texts. The named enti-
vealed several issues of concern: 1.
2.1.1 Named Entity Recognition
Errors in shallow parsing and POS-tagging tools trained on general English text collections when applied to specific expressions and abbreviations found in biomedical texts. The open text mining interface, a project di-
rected by the Nature Publishing Group, helps solve the data format conversion errors and difficulties mentioned in issues 1 and 2 because it provides open access to full text documents published in XML format. The full-text versions of scientific
are as follows. First, the number of new gene names is growing continually, and it is hard to recognize all of them because there is so much inconsistency among them [22]. Second, authors do not use standardized names; they prefer to use abbreviations or other forms depending on personal inclination [23]. Because of their limited length, abbreviations/acronyms are often identical to the respective genes’ symbols and thus increase the ambiguity of the nomenclature [24]. For instance, 80% of the abbreviations listed in the UMLS have ambiguous versions in MEDLINE [25]. Third, gene names are similar to/occur with other terminology varying from gene/protein names, such as the names of cells, tissues or organs [26]. For example, C1R is a cell line, but it is also a gene (SwissProt P00736). TMSs must be able to distinguish between different genes with identical names as well as to determine whether certain gene names refer to completely different biological entities like viruses. For compound names, it is
1
F-score is the weighted harmonic mean of precision and recall.
also necessary to determine where the name be-
4
J. Comp. Sci. & Tech.
gins and ends within a sentence. The task can be
In addition, it may also be possible to use al-
particularly difficult when verbs and adjectives are
gorithms that can identify acronyms/abbreviations
embedded in names [27].
to extract acronyms from text automatically with-
A large number of machine learning algo-
out checking whether they overlap with the gene
rithms have been developed to deal with the NER
nomenclature. Although several algorithms have
problems; for example, the hidden Markov model
been proposed for this purpose [35, 36], only a
[28], the support vector machine model [29], the
few can extract acronyms and disambiguate gene
maximum entropy Markov model [30] and the
names [37]. We hope that integrating these tools
conditional random field model [31]. To capture
will improve the NER performance.
the diverse characteristics of biomedical entities,
2.1.2 Inter-species Normalization
several feature sets, including lexicons, ortho-
Gene normalization (GN) determines the
graphic/affix information, and even external re-
unique identifiers of genes and proteins mentioned
sources like the WWW have been incorporated
in the literature. The concept was inspired by a
into different algorithms. It is conceivable that the
step in a typical curation pipeline for model or-
recognition results derived by these algorithms
ganism databases. After an article has been se-
will be diverse but complementary to each other.
lected for curation, curators list the genes or pro-
One natural idea for improving the performance of
teins of interest in this article [16]. Although the
biomedical NER is to combine the results of sev-
concept of GN was inspired by curation, the Bio-
eral algorithms. The results of the BioCreAtIvE II
CreAtIvE I/II computer-aided GN task [16, 38]
gene mention task [22] and those reported by Si et
oversimplified curation and performed GN by
al. [32] show that it is possible to achieve higher
normalizing genes in abstracts rather than on the
recognition accuracy by combining the results of
full-text. Actually, human curators normally work
multiple NER algorithms.
on the full texts and only identify particular kinds
False positive gene/protein names found in
of genes of interest. Cohen et al. [39] proposed a
the full texts of articles pose great challenges for
computer-aided GN system that, given a document,
TMSs in such basic tasks as identifying gene and
provides a ranked list of genes that are discussed
protein names in biomedical texts. Broadening the
in the document. The BioCreAtIvE II.5 competi-
range of entities beyond genes/proteins to include
tion of 2009 [40] included a similar ranking task.
entities like chemicals and diseases [33, 34] can
Such a ranked list could be used as an aid by hu-
resolve the problem. Identifying these entities also
man curators.
allows us to consider biologically relevant rela-
Computer-aided GN presents several difficult
tions, such as which entities they are derived from,
problems that need to be solved in order to reduce
where they are located, which have agency in
the workload of human curators. First, gene and
which processes, or which participate in what
protein names often have several spelling varia-
processes.
tions or abbreviations. Second, gene products are
-4-
New challenges for biological text-mining in the next decade
-5-
often described indirectly via phrases, such as
observe that ―eRF1‖ appears in the title as a hu-
―light chain-3 of microtubule-associated proteins
man gene and in the third sentence of the abstract
1A and 1B,‖ instead of by specific names or codes.
as
A number of approaches [41, 42] have been pro-
―eRF1.eRF3.GTP‖ in the last sentence is a protein
posed to address these problems in the BioCreA-
complex and should not be associated with any
tIvE I/II’s GN task. The evaluation results provide
database identifiers. These few sentences illustrate
some insight into how these problems affect our
how much more GN needs to be improved before
capacity to normalize the genes mentioned in bio-
it can be used in practice.
a
yeast
gene.
Finally,
the
complex
logical abstracts, but GN is not yet practical. The BioCreAtIvE I/II GN task involved normalizing various abstracts and demonstrating how much TMSs’ success varied according to the organism discussed in the abstracts. The results showed that the performances were satisfactory for normalizing abstracts that mentioned the genes and proteins of humans (F-score 0.81), mice (F-score 0.79), yeast (F-score 0.92) and flies (F-score 0.82), re-
HemK2 protein, encoded on human chromosome 21, methylates translation termination factor eRF1. Abstract The ubiquitous tripeptide Gly-Gly-Gln in class 1 polypeptide release factors triggers polypeptide release on ribosomes. The Gln residue in both bacterial and yeast release factors is N5-methylated, despite their distinct evolutionary origin. Methylation of eRF1 in yeast is performed by the heterodimeric methyltransferase (MTase) Mtq2p/Trm112p, and requires eRF3 and GTP. Homologues of yeast Mtq2p and Trm112p are found in man, annotated as an N6-DNA-methyltransferase and of unknown function. Here we show that the human proteins methylate human and yeast eRF1.eRF3.GTP in vitro, and that the MTase catalytic subunit can complement the growth defect of yeast strains deleted for mtq2. [PMID: 18539146]
spectively. However, the task did not address the important issue of inter-species GN, which exists
Fig.1. The abstract (PMID 18539146) in PubMed
in many published articles. The extract in Figure 1, taken from an abstract in PubMed, exemplifies the challenges posed by many articles when using computer-aided GN. One name, abbreviation or code, may refer to genes in multiple species, each with its own unique ID, or even to multiple genes in the same species or across different species. For example, the abstract (PMID: 18539146) in Figure 1 discusses the methylation of the gene ―eRF1‖. In the UniProt database, the gene’s name is listed as a synonym of multiple genes, such as ZFP36L1 (SwissProt Q07352) and ETF1 (SwissProt P62495) even though their functions are different. Moreover, both ZFP36L1 and ETF1 refer to multiple species, namely humans, mice, and rats. We also
2.2
Relation and Fact Extraction from Full
Texts TMSs, like human curators, should work on full text articles. The information provided in the headings, figure legends, and tables of full text articles helps TMSs extract relations and facts; and may help users discover implicit associations between genes and diseases in the future [18]. Seki and Javed [43] conducted a small preliminary experiment and reported that using the full text articles, rather than just their abstracts, to extract gene-disease relations greatly improved the ability of their text-mining system to discover facts and relations. In addition, Cooper and Kershenbaum [44] conducted a detailed study of 65 abstracts and
6
J. Comp. Sci. & Tech.
found that some PPIs were only reported in the
there is a need for much greater collaboration be-
full texts of the respective papers. The abstracts of
tween researchers in the two fields before TMSs
some papers did not contain any protein names.
can perform image recognition and mine related
Hence, TMSs should analyze the full text articles,
text easily.
not just their abstracts. In the following sections,
2.2.2
we consider the challenges that must be addressed. 2.2.1
Relevant versus Irrelevant Information
Relation Extraction In the biomedical field, researchers are inter-
ested in PPIs, gene-gene interactions and pro-
TMSs need to distinguish between relevant
tein-disease interactions. The major goal of rela-
and irrelevant information, but different criteria
tion extraction is to discover the relations embed-
may have to be applied depending on which sec-
ded within sentences, paragraphs, or entire docu-
tion of an article the text-mining system is analyz-
ments. Currently, the most popular relation extrac-
ing or mining. Shah et al. [45] demonstrated that
tion approaches include rule-based [48, 49], ker-
there are substantial differences in the content of
nel-based [50, 51], and co-occurrence-based [52,
different sections of a publication. For example,
53] methods. Most works focus on identifying the
specific terms, like the names of certain genes,
relations between proteins [53-55]. Craven and
may be mentioned in the titles of articles in a pa-
Kumlien [56] identified the relations between pro-
per’s bibliography, but TMSs should disregard
teins and sub-cellular locations; while Rindflesch
such terms. To identify useful terms, TMSs should
et al. [57] extracted the relations between can-
compare terms mentioned in papers’ abstracts,
cer-related genes, drugs and cell lines. Less work
which usually contain a high density of relevant
has been done on extracting the relations between
terms (keywords), to terms appearing throughout
genes and diseases [58, 59], but the area is now
the full texts of the respective papers.
attracting more research efforts.
Moreover, TMSs should also be able to asso-
Among existing methods, employing parsers
ciate useful pieces of information in the legends of
to analyze syntactic and semantic structures is
figures and tables with the text of the article, but
useful. Yusuke et al. [60] performed a comparative
this task is quite challenging. One reason is that
evaluation of state-of-art syntactic parsing me-
figures and images often have multiple sub-figures,
thods, including dependency parsing, phrase
so TMSs must be able to identify the sub-figures
structure parsing and deep parsing, and their con-
and match each one with the appropriate sentences
tribution to PPI extraction. The study provides re-
or references in the text. Although this task may
searchers with a good reference for choosing ap-
seem difficult, TMSs that have such a capacity
propriate parsers for their work. However, there is
might discover more useful relations or facts than
no guarantee that the results reported by Yusuke et
those normally extracted. Some researchers have
al. can be generalized to other datasets and tasks.
been successful in combining text-mining and
The results of the BioCreAtIvE II PPI task
image recognition techniques [46, 47]; however,
[20] demonstrate that current TMSs can detect bi-
-6-
New challenges for biological text-mining in the next decade
-7-
nary relations in abstracts reasonably well [49, 61],
with co-references. Nguyen et al. [64] conducted a
but they are not always as effective in extracting
pioneering study of the differences between
significant relations from full-text articles. There
newswire and biomedical co-reference annotated
are three reasons for this phenomenon.
corpora. We look forward to the integration of
First, biomedical terms, such as gene names, may have different meanings in full texts depend-
more sophisticated NLP techniques in this respect. 3
The Future of Text-mining Applications
ing on the context or the section in which they appear. The same gene in one section may belong to
3.1
User-focused Applications
different species (consider the example shown in
Text-mining researchers are typically good at
Figure 1). Second, the frequent use of synonyms,
analyzing textual content, but they are not as good
abbreviations, and acronyms in biomedical texts
at building interactive systems that users can adopt
hinders semantic analysis. For instance, extracting
easily [33]. To resolve the problem, researchers
facts from the Results section may require resolv-
must design applications with intuitive interfaces
ing acronyms or synonyms only mentioned in the
that require little or no knowledge of text-mining
Introduction. Third, biomedical texts usually con-
and NLP technology. The objective is to provide
tain several compound nouns as well as noun
bioinformatics, biological, biomedical, and phar-
phrases linked by prepositions. Fourth, TMSs have
macological researchers with a high-level view of
difficulty when one or more proteins involved in
biological interactions and help them form new
an interaction are expressed by more than one
hypotheses. The useful PubMed-EX browser ex-
sentence; or when they are expressed using ana-
tension [65], shown in Figure 2, is an example of
phora, as shown in the following example:
such an effort.
Human growth hormone (hGH) binds to its receptor (hGHr) in a three-body interaction: one molecule of it and two identical monomers of the receptor from a trimer. Many papers have addressed relation extraction, summarization, and evaluation issues, but few have focused on co-reference (anaphora) resolution [62], possibly because there are few publicly available datasets for system building and evaluation. Despite the substantial amount of annotation work carried out on co-referencing in molecular biology, few biomedical corpora with co-reference annotations are currently
available
[63]. Recently, the GENIA corpus was annotated
Fig.2. A PubMed abstract annotated with text-mining results by PubMed-EX
PubMed-EX annotates onsite PubMed search results with additional text-mining information but users don’t pay any extra effort such as to learn how to input a specific query. Currently, its processing speed is quite slow, but it does hide the complicated text-mining technology on which it is
8
J. Comp. Sci. & Tech.
based. Text-mining researchers should strike a compromise between the accuracy of text-mining results and the overall processing speed. Obviously, full text analysis requires more computational capacity and time than the analysis of abstracts. Users may accept a processing time of 10 minutes per article for off-line processes, such as database curation, but they may not be as patient when it comes to on-line services that provide semantic annotations or relation extraction. Therefore, providing on-the-fly full text processing, while maintaining a satisfactory accuracy level, remains a
informatics data, so complex queries remain challenging. Integrating data from multiple databases and analyzing it via TMSs is difficult. Zhang et al. [66] proposed a Web 2.0 [69] based model that represents a shift in focus from working locally to working in networked settings. Under this new approach, the Web is seen as a social, collaborative, and collective space. The model provides a vision of the future, where annotation will be performed collaboratively and innovative web tools will support such collaboration. Further development of tools like WikiProtein [70] CBioC [71], which support collaborative annotation is essential. 3.3
Information fusion
challenge for text-mining researchers. Certain types of users, such as content pro-
With the advent of advanced TMSs, re-
viders and corpus annotators, require interfaces
searchers may be able to integrate mined informa-
that allow them to change annotations, dredge for
tion and thereby gain more insight into biological
information, link resources, and create new infor-
literature. The most critical biological reactions
mation resources to capture new concepts [33].
are recorded in ―pathways,‖ which include a my-
The research community requires more collabora-
riad of cellular or disease events with multiple
tive annotation and up-to-date knowledge in bio-
protein-protein relationships and tend to influence
logical databases, but it is does not have the tools
each other directly. However, for a number of rea-
that make these procedures easy to implement. We
sons, TMSs have trouble fusing mined information
discuss this issue in the next section.
to reveal pathways. First, mapping named entities to nodes in
3.2
Integration, Communication and Colla-
boration Bioinformatics researchers often need to consult numerous databases and web servers, but many find integrating heterogeneous datasets from disparate databases associated with multiple web servers a daunting task [66]. To integrate biological data from multiple heterogeneous databases, researchers have adopted two major approaches: centralization [67] and decentralization [68]. However, the integration efforts have been piecemeal and have only considered a fraction of bio-8-
pathways requires highly context dependent properties. Named entities (NEs) may have different meanings in the same context. For example, an NE may be located in the nucleus, in the cytoplasm, or on the cell membrane. It may also refer to a cellular function, in which case it might be phosphorylated or acetylated. Thus, two consecutive sentences may mention a named entity, but the named entity may actually refer to two totally different events.
New challenges for biological text-mining in the next decade
-9-
Currently, biologists use their domain knowledge to infer information that text mining cannot predict accurately. Oda et al. [72] categorized six
hope that more resources will be used to accelerate progress in the field 4.1
Evaluating text mining via task-based
inference characters, namely, the state of an entity before or after reaction, the function of an entity
challenges
before or after a reaction, the influence of state or
Evaluation via task-based challenges is es-
functional changes of an entity, related reactions,
sential to the biology community [74]. To date,
reverse reactions, and characteristics of reactions.
several biological tasks, including document re-
If annotated corpora incorporated these features,
trieval, NER, and relation extraction, have been
TMSs would be able to infer information with lit-
evaluated. We list the major challenges below:
tle human help [72].
Experimental data even confuse biologists
ticipants to identify papers to be curated for
sometimes. Open databases of pathway references, such as BioCarta2, STKE3, and KEGG [73], ena-
The KDD Cup 2002 task 14 [75] asked par-
Drosophila gene expression.
The TREC Genomics Track5 [76], one of the
ble biologists to predict the next steps of protein
largest and longest-running challenge evalua-
pathways. However, subtle factors cause the re-
tions in biomedicine (from 2003 to 2007),
sults of many experiments to deviate from what is
evaluates systems for information retrieval.
considered consistent for proven pathways. Incon-
The Genic Interaction Extraction6 (GIE) chal-
sistencies do not necessarily mean that the proven
lenge [77], a part of the Learning Language in
pathways are wrong, but they may indicate me-
Logic workshop, evaluates the ability of par-
chanisms or parts of pathways that were previous-
ticipating TMSs to identify protein/gene inte-
ly unobserved. Therefore, pathway prediction
ractions from biological abstracts.
should be independent of experiments. In the fu-
BioCreAtIvE7 [2] is a community-wide effort
ture, TMSs may be able solve many of the argu-
that promotes the development and evaluation
ments or discrepancies that occur in research today
of text-mining and IE systems applied in the
because of their ability to map large amounts of
biological domain. The most recent challenge
data quickly.
(BioCreAtIvE II.5) in March 2009, which also
4
involved the publisher Elsevier/FEBS Letters
Text-mining Resources Text-mining
main-specific
resources,
thesauri,
such
lexicons,
as
and the MINT database, evaluated real-time
do-
text-mining capabilities on full text articles.
terminology
standards, ontologies, and additional evaluations by task-based challenges are very important. We summarize them in the following sections. It is our
3 4 5 6
2
http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
7
http://stke.sciencemag.org/ http://www.biostat.wisc.edu/~craven/kddcup/ http://ir.ohsu.edu/genomics/ http://genome.jouy.inra.fr/texte/LLLchallenge/ http://www.biocreative.org/
10
J. Comp. Sci. & Tech.
The BioNLP shared task8 is concerned with the recognition of bio-molecular named entities [1] and events [78] that appear in biomedical literature. The 2009 task used a dataset based on the GENIA event corpus [79]. In contrast to BioCreAtIvE II.5, which aims to
lated to molecular interaction and published between 1996 and 2001. A disease corpus14 provided by Jimeno et al. [34] could serve as a benchmark for other disease NER systems. 4.2.2 Relation Extraction Corpora
support the curation of PPI databases, the BioNLP task concerns to support the development of more detailed and structured databases,
e.g., pathway databases [80], and the Gene Ontology Annotation databases [81].
4.2
Text-Mining Corpora
4.2.1 Named Entity Identification Corpora
The GENIA corpus9 [82] contains 2000 abstracts taken from the MEDLINE database and annotated with various levels of linguistic and semantic information. Biological named entities were annotated according to the taxonomy defined in GENIA ontology. Currently, there are 47 biological named entity categories. GENETAG10 [83] is a corpus of 20,000 sentences taken from MEDLINE abstracts annotated with gene/protein names. The dataset of the JNLPBA Bio-NER task11 is annotated with five types of named entities: protein, DNA, RNA, cell line and cell type. The training and test sets of BioCreAtIvE I/II gene mention and normalization tasks12 provide an evaluation standard for the two problems. The Yapex Corpus13 is annotated with protein names mentioned in MEDLINE abstracts re-
14 8
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/ 9 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ 10 ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz 11 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html 12 http://sourceforge.net/projects/biocreative/files/ 13 http://www.sics.se/humle/projects/prothalt/#data - 10 -
15 16 17 18 19 20
The GENIA event corpus [79] is based on the GENIA corpus and is annotated with events mentioned in biomedical abstracts. Binarized BioInfer 15 [84] is a corpus annotated with the binary relations between proteins in abstracts. AIMed16 [54] is a corpus constructed by using the query word ―human‖ to obtain abstracts from MEDLINE. In total, 1955 sentences were extracted and annotated with gene/protein names and PPIs. EDGAR 17 [57] contains annotation for the interaction of drugs, genes, and cells. The FetchProt18 corpus is comprised of 190 full text articles of which 140 describe experimental evidence for tyrosine kinase activity in at least one protein. Its annotation includes specific experiments and results, the proteins involved in the experiments and related information. The BioText project 19 provides two corpora for relation extraction: (1) PPI data [55] annotates the interaction types between proteins in full texts; and (2) a corpus containing abstracts randomly selected from MEDLINE 2001 for evaluation of mining disease-treatment relations [85]. The IEPA corpus20 [86] contains 303 PubMed ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases http://mars.cs.utu.fi/BioInfer/ ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/ ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/EDGAR_GS.txt http://fetchprot.sics.se/#corpus http://biotext.berkeley.edu/data.html http://class.ee.iastate.edu/berleant/s/IEPA.htm
New challenges for biological text-mining in the next decade
abstracts with annotations for PPIs for each sentence. The Craven group’s IE data sets21 [56] were compiled from MEDLINE abstracts. There are three datasets, which are labeled, respectively, with instances of the following binary relations: (1) subcellular-localization gathered from the Yeast Proteome Database (YPD); (2) disease-association gathered from the Online Mendelian Inheritance in Man database (OMIM); and (3) PPIs from the MIPS Comprehensive Yeast Genome Database (CYGD). The BioCreAtIvE-PPI dataset and DIPPPI corpus22 were derived from the dataset of BioCreAtIvE I task 1A and the Database of Interaction Proteins (DIP) respectively. The BioCreAtIvE-PPI corpus contains 1,000 sentences annotated with PPI information. The PPIs annotated in the DIPPPI corpus are restricted to proteins from yeast. The goal is to find evidence of relations in the text of a paper. Whenever possible, full texts are included in the corpus as well as abstracts. The training and test dataset for the GIE challenge23 [87] contain annotations for gene interactions. Each dataset are decomposed into two subsets. The first subset does not include co/cross-references or ellipsis, but the second subset contains both features. 4.2.3 Part-of-speech, syntactic and semantic
- 11 -
1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes. It is annotated with
21 22 23 24 25 26
PASBio 24 [88] and BioProp 25 [89] contain predicate-argument structures (PAS) for event extraction in molecular biology. The PennBioIE [90] CYP corpus 26 contains http://www.biostat.wisc.edu/~craven/ie/ http://www2.informatik.hu-berlin.de/~hakenber/corpora/ http://genome.jouy.inra.fr/texte/LLLchallenge/#task1 http://research.nii.ac.jp/~collier/projects/PASBio/ http://bws.iis.sinica.edu.tw/BioProp/ http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L
sentence
boundary,
and
part-of-speech (POS) information. In addition, 324 of the abstracts are syntactically annotated. Another PennBioIE corpus, the PennBioIE Oncology corpus 27 contains similar annotations but in addition to its abstract is related to cancer concentrating on molecular genetics.
The GENIA corpus contains annotations for parts-of-speech (POS) [91] and a treebank [92].
The Brown-GENIA Treebank28 [93] contains the syntactic structures of 21 abstracts (215 sentences) taken from the GENIA corpus. There is no overlap with the GENIA treebank (beta version, 500 abstracts).
MedPost [94] is a corpus29 containing 5,700 sentences selected randomly from various thematic subsets of MEDLINE and annotated with POS information.
The PDG Bio-splitter corpus 30 contains a small collection of text datasets compiled from PubMed abstracts to develop sentence splitting tools.
The BioText project provide a corpus annotated with the definitions of abbreviations [36]
annotations
paragraph,
taken from 1,000 randomly selected abstracts by querying MEDLINE with the term ―yeast‖. 4.2.4 Full text corpora DC2008T20 27 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L DC2008T21 28 http://bllip.cs.brown.edu/resources.shtml#corpora 29 ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz 30
http://www.pdg.cnb.uam.es/martink/LINKS/biosplitter_corpus.htm
12
J. Comp. Sci. & Tech.
BioMed Central’s open access full-text cor-
enable biologists to exploit knowledge more effec-
pus31 has released 55003 full text articles to
tively.
date, including structured XML version, cov-
5
Although text-mining technologies are now
ered by open access license agreements.
quite mature, there are still some important unre-
The PPI corpus of BioCreAtIvE II and II.5
solved problems in the field. Fortunately, biomed-
[95].
ical text mining is an extremely active research 32
The FlySlip
corpus [96] is the first corpus of
area, and the outlook for continued progress is
biomedical full-text articles to be annotated
encouraging. We can foresee that the texts of ar-
with anaphora information.
ticles will be systematically mined by computer
The molecular interaction maps corpus33 [97]
programs, allowing the interrelation of journal
contains passages from full-text articles that
texts and the vast repository of knowledge to be
describe interactions summarized in a molecu-
stored semi-automatically in databases. It is ex-
lar interaction map [98].
pected that text mining tools will be used by every
Conclusion We have considered important research issues
biologist in the future. Acknowledgements
related to the exploitation of text mining in the
This research was supported in part by the
biomedical field, and drawn the following conclu-
National Science Council under grant NSC
sions.
97-2218-E-155-001, NSC 96-2752-E-001-001 and
1.
the thematic program of Academia Sinica under
The availability of full texts is clearly very
important because abstracts usually lack sufficient relevant information. Techniques for mining in-
grant AS95ASIA02. References
formation from full biomedical texts need to be improved substantially. 2.
[1]
bio-entity recognition task at JNLPBA,"
Text mining has the potential to be used in
Proceedings of the International Workshop
different applications and to fuse knowledge in the
on
literature and biological databases. However, to
[2]
and
Processing
its
in
Applications
L. Hirschman, et al., "Overview of BioCreAtIvE:
various data sources. If highly complex texts and
critical
assessment
of
information extraction for biology," BMC
bio-inference sentences can be processed effi-
Bioinformatics, vol. 6, 2005.
ciently and accurately, information fusion would [3] http://www.biomedcentral.com/info/about/datamining/ http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip 33 http://www.it.usyd.edu.au/~tara/mim_corpus/ - 12 -
Language
(JNLPBA-04), pp. 70-75, 2004.
are needed, such as methods for acronym and
31
Natural
Biomedicine
realize text mining’s full potential, new methods
co-reference resolution, and the integration of
K. Jin-Dong, et al., "Introduction to the
M. Krallinger, et al., "Evaluation of text-mining systems for biology: overview
32
of the Second BioCreative community
New challenges for biological text-mining in the next decade
- 13 -
challenge," Genome Biology, vol. 9, p. S1, 2008. [4]
[13]
for Text Mining: Aims and Objectives,"
mining," presented at the Proceedings of
UKKDD'5, 2007.
for
Computational
Linguistics
[15]
automatic text mining," FEBS Letters, vol.
database revolution," Nature, vol. 448, pp.
582, pp. 1170-1170, 2008. [16]
Morgan,
et
"Overview
(http://people.ischool.berkeley.edu/~hearst
Genome Biology, vol. 9, p. S3, 2008.
Available:
[17]
II
al.,
BioCreative
gene
of
normalization,"
G. Gonzalez, et al., "Mining gene-disease
http://people.ischool.berkeley.edu/~hearst/t
relationships from biomedical literature:
ext-mining.html
weighting protein-protein interactions and
H.-J. Dai, et al., "BIOSMILE web search:
connectivity measures," Proceedings of the
a
Pacific
web
application
for
annotating
Symposium
on
Biocomputing,
2007. [18]
R. T.-H. Tsai, et al., "HypertenGene:
D. Rebholz-Schuhmann, et al., "Text
Extracting key hypertension genes from
processing through Web services: calling
biomedical literature with position and
Whatizit," Bioinformatics, vol. 24, p. 296,
automatically-generated template features,"
2008.
in 8th InCoB - Seventh International
J. M. Fernández, et al., "iHOP web
Conference on Bioinformatics 2009, 2009. [19]
W21-W26, 2007. Elsevier
Article
2.0
Contest
html). "The
Grand
text
mining," Briefings in Bioinformatics, vol. 6, pp. 57-71, 2005. [20]
Elsevier
A. M. Cohen and W. R. Hersh, "A survey of current work in biomedical
(http://article20.elsevier.com/contest/home.
[12]
A.
M. Hearst. (2003, What is text mining
services," Nucl. Acids Res., vol. 35, pp.
[11]
M. Seringhaus and M. Gerstein, "Manually
U. Hahn, et al., "Text mining: powering the
Acids Res., vol. 1;36, pp. W390-8, 2008.
[10]
Prospect
(http://www.projectprospect.org/)."
biomedical entities and relations.," Nucl.
[9]
Project
structured digital abstracts: A scaffold for
/text-mining.html).
[8]
"RSC
Maryland, 1999.
130-130, 2007.
[7]
[14]
on
Computational Linguistics, College Park,
[6]
S. Ananiadou, et al., "The National Centre
M. A. Hearst, "Untangling text data
the 37th annual meeting of the Association
[5]
ive-ii5/biocreative-ii5/)."
Challenge
M. Krallinger, et al., "Overview of the protein-protein
interaction
annotation
(http://www.elseviergrandchallenge.com/).
extraction task of BioCreative II," Genome
"
Biology, vol. 9, p. S4, 2008.
"BioCreAtIvE
II.5
(http://www.biocreative.org/events/biocreat
[21]
N. Chinchor, "MUC-7 named entity task definition," in Proceedings of the 7th
14
[22]
[23]
[24]
J. Comp. Sci. & Tech.
Message Understanding Conference, 1997.
of the ACL-02 workshop on Natural
L. Smith, et al., "Overview of BioCreative
language processing in the biomedical
II gene mention recognition," Genome
domain
Biology, vol. 9, p. S2, 2008.
Pennsylvania, 2002.
U. Leser and J. Hakenberg, "What makes a
[26]
[28]
J. Finkel, et al., "Exploiting context for
to
Bioinformatics, vol. 6, p. 357, 2005.
International Joint Workshop on Natural
R. A. A. Erhardt, et al., "Status of
Language Processing in Biomedicine and
text-mining
its Applications, pp. 88-91, 2004.
techniques
applied
to [31]
the
web,"
Proceedings
of
the
R. T.-H. Tsai, et al., "NERBio: using
vol. 11, pp. 315-325, 2006.
selected
H. Liu, et al., "A study of abbreviations in
normalization, and global patterns to
MEDLINE abstracts," in AMIA Annual
improve
Symposium, 2002, p. 464.
recognition," BMC Bioinformatics, vol. 7,
L. Tanabe and W. J. Wilbur, "Tagging gene
p. S11, 2006. [32]
word
conjunctions,
biomedical
named
term
entity
L. Si, et al., "Boosting performance of
presented at the Proceedings of the
bio-entity recognition by combining results
ACL-02 workshop on Natural language
from multiple systems," presented at the
processing in the biomedical domain -
Proceedings
Volume 3, Phildadelphia, Pennsylvania,
workshop on Bioinformatics, Chicago,
2002.
Illinois, 2005.
L. Tanabe and W. J. Wilbur, "Tagging gene
[33]
of
the
5th
international
R. Altman, et al., "Text mining for biology
and protein names in biomedical text,"
- the way forward: opinions from leading
Bioinformatics, vol. 18, pp. 1124-1132,
scientists," Genome Biology, vol. 9, p. S7,
2002.
2008.
S. Zhao, "Named Entity Recognition in
[34]
A. Jimeno, et al., "Assessment of disease
Biomedical Texts using an HMM Model,"
named entity recognition on a corpus of
presented at the Proceedings of the
annotated sentences," BMC Bioinformatics,
International Joint Workshop on Natural
vol. 9, p. S3, 2008. [35]
H. Yu, et al., "Mapping abbreviations to
its Applications, 2004.
full forms in biomedical articles," Journal
J. i. Kazama, et al., "Tuning support vector
of the American Medical Informatics
machines for biomedical named entity
Association, vol. 9, pp. 262-272, 2002.
recognition," presented at the Proceedings - 14 -
Phildadelphia,
the biomedical literature," Briefings in
Language Processing in Biomedicine and
[29]
3,
biomedical entity recognition: from syntax
and protein names in full text articles,"
[27]
Volume
gene name? Named entity recognition in
biomedical text," Drug Discovery Today,
[25]
[30]
-
[36]
A. S. Schwartz and M. A. Hearst, "A
New challenges for biological text-mining in the next decade
simple
[37]
[38]
[39]
[40]
algorithm
for
- 15 -
identifying
abbreviation definitions in biomedical
hereditary
text," in Pac Symp Biocomput., 2003, pp.
Biocomput., Maui, Hawaii, 2007, pp.
451-462.
316-327.
R. Podowski, et al., "Suregene, a scalable
Pac
Symp
J. W. Cooper and A. Kershenbaum,
of gene and protein names," Journal of
using
Bioinformatics and Computational Biology,
statistical
vol. 3, pp. 743-770, 2005.
BMC Bioinformatics, vol. 6, p. 143, 2005.
L. Hirschman, et al., "Overview of
[45]
a
combination and
of
graphical
linguistic,
information,"
P. K. Shah, et al., "Information extraction
BioCreAtIvE task 1B: normalized gene
from full text scientific articles: Where are
lists," BMC Bioinformatics, vol. 6, p. S11,
the keywords?," BMC Bioinformatics, vol.
2005.
4, p. 20, 2003.
W. Cohen and E. Minkov, "A graph-search
[46]
H. Shatkay, et al., "Integrating image data
framework for associating gene identifiers
into
with documents," BMC Bioinformatics, vol.
Bioinformatics, vol. 22, pp. e446-453, July
7, p. 440, 2006.
15, 2006 2006.
F.
Leitner,
"Comparative
community
[47]
Z.
biomedical
Kou,
et
text
al.,
categorization,"
"A
STACKED
assessments for applied biomedical text
GRAPHICAL
mining: BioCreative II challenge and
ASSOCIATING INFORMATION FROM
metaservices," in Intelligent Systems for
TEXT AND IMAGES IN FIGURES," in
Molecular Biology (ISMB) and European
Pac Symp Biocomput., Maui, Hawaii,
Conference on Computational Biology
2007.
K.
Fundel,
et
al.,
"Exact
[48]
MODEL
FOR
J. Saric, et al., "Extraction of regulatory
versus
gene/protein networks from Medline,"
approximate string matching for protein
Bioinformatics, vol. 22, pp. 645-650,
name identication," in Proceedings of the
March 15, 2006 2006.
Challenge
Evaluation
[49]
T. Ono, et al., "Automated extraction of
Workshop 2004, 2004.
information on protein-protein interactions
J. Hakenberg, et al., "Me and my friends:
from
gene
Bioinformatics, vol. 17, pp. 155-161, Feb
mention
normalization
with
background knowledge," in Proceedings of Second BioCreAtIvE Challenge Evaluation
[43]
in
"Discovery of protein-protein interactions
BioCreative
[42]
[44]
diseases,"
system for automated term disambiguation
(ECCB), Highlights Track, 2009. [41]
implicit associations between genes and
the
biological
literature,"
2001. [50]
S. Kim, et al., "Kernel approaches for
Workshop, 2007, pp. 23-25.
genic
interaction
extraction,"
K. Seki and M. Javed, "Discovering
Bioinformatics, vol. 24, p. 118, 2008.
16
[51]
J. Comp. Sci. & Tech.
R. Bunescu and R. Mooney, "Subsequence
from the biomedical literature," in Pac
kernels
Symp Biocomput., 2000, pp. 515-524.
for
relation
extraction,"
ADVANCES IN NEURAL INFORMATION
[52]
[58]
T.
Barnickel,
et
al.,
Extraction
"Large
Scale
Proceedings of the Pacific Symposium on
from
Biocomputing, pp. 4-15, 2006. [59]
Biomedical
R. T.-H. Tsai, et al., "HypertenGene: Extracting key hypertension genes from biomedical literature with position and
A. Ramani, et al., "Consolidating the set of
automatically-generated template features,"
known human protein-protein interactions
to appear in BMC Bioinformatics, 2009. [60]
Y. Miyao, et al., "Evaluating Contributions
the human interactome," Genome Biology,
of
vol. 6, p. R40, 2005.
Protein-Protein
R.
Bunescu,
et on
extractors
for
interactions,"
al.,
"Comparative
learning
information
proteins
and
[61]
their
Artificial Intelligence
classification:
protein-protein
application
interactions,"
Technology
and
Interaction
Parsers
to
Extraction,"
L. Wong, "PIES, a protein interaction
Symposium on Biocomputing, vol. 6, pp. 520-531, 2001. [62]
J. Castaño, et al., "Anaphora resolution in
to
biomedical literature," in International
in
Symposium on Reference Resolution, 2002.
Proceedings of the conference on Human Language
Language
extraction system," Proceedings of Pacific
in
B. Rosario and M. A. Hearst, "Multi-way relation
Natural
Bioinformatics, Dec 9 2008.
Medicine, vol. 33, pp. 139-155, 2005.
[63]
J. Pustejovsky, et al., "Medstract: creating large-scale
Empirical
information
servers
for
Methods in Natural Language Processing,
biomedical libraries," in Proceedings of the
2005, pp. 732-739.
ACL-02 workshop on Natural language
M. Craven and J. Kumlien, "Constructing
processing in the biomedical domain, 2002,
Biological Knowledge Bases by Extracting
pp. 85-92.
Information
from
Text
Sources,"
in
[64]
N. Nguyen, et al., "Challenges in Pronoun
Proceedings of the 7th International
Resolution System for Biomedical Text,"
Conference on Intelligent Systems for
Proceedings of the Sixth International
Molecular Biology, 1999, pp. 77-86.
Language
T.
(LREC'08), 2008.
C.
Rindflesch,
et
al., "EDGAR:
extraction of drugs, genes and relations - 16 -
of
Texts," 2009.
experiments
[57]
"Extraction
domain dictionaries and machine learning,"
in preparation for large-scale mapping of
[56]
al.,
2006.
Relation
[55]
et
gene-disease relations from Medline using
Semantic Role Labeling for Automated
[54]
Chun,
PROCESSING SYSTEMS, vol. 18, p. 171,
Application of Neural Network Based
[53]
H.-W.
[65]
Resources
and
Evaluation
R. T.-H. Tsai, et al., "PubMed-EX: A web
New challenges for biological text-mining in the next decade
- 17 -
browser extension to enhance PubMed
genomes to life and the environment,"
search
Nucleic Acids Research, vol. 36, p. D480,
with
text
mining
features,"
Bioinformatics, vol. [Epub ahead of print], 2009. [66]
[67]
[68]
[69]
[74]
C.
Blaschke,
Text Mining for Biology and Medicine,
pp. 1-10, January 1, 2009 2009.
2005, pp. 213-245.
K. Cheung, et al., "Semantic Web approach
[75]
A. Yeh, et al., "Background and overview
to database integration in the life sciences,"
for KDD Cup 2002 task 1: Information
Semantic Web: Revolutionizing knowledge
extraction from biomedical articles," in
discovery in the life sciences, pp. 11-30,
ACM SIGKDD Explorations Newsletter,
2007.
2002, pp. 87-89.
R.
Dowell,
et
al.,
"The
distributed
[76]
W. Hersh and E. Voorhees, "TREC
annotation system," BMC Bioinformatics,
genomics
vol. 2, p. 7, 2001.
Information Retrieval, vol. 12, pp. 1-15,
T. O'Reilly. (2005, What is Web 2.0:
2009. [77]
special
issue
overview,"
J. Hakenberg, et al., "LLL’05 challenge:
the Next Generation of Software. Available:
Genic interaction extraction-identification
http://www.oreillynet.com/pub/a/oreilly/ti
of language patterns based on alignment
m/news/2005/09/30/what-is-web-20.html
and finite state automata," in Proceedings
B. Mons, et al., "Calling on a million
of
minds
Language in Logic (LLL05), 2005, pp.
for
community
annotation
in
the
ICML05
workshop:
Learning
89-97. [78]
J.-D. Kim, et al., "Overview of BioNLP'09
C. Baral, et al., "CBioC: beyond a
Shared Task on Event Extraction," in
prototype for collaborative annotation of
Proceedings of the BioNLP 2009 Workshop
molecular interactions from the literature,"
Companion Volume for Shared Task, 2009,
presented
pp. 1-9.
at
Computational
the
Proceedings
Systems
of
Bioinformatics
[79]
J.-D. Kim, et al., "Corpus annotation for
Conference, San Diego, California, 2007.
mining biomedical events from literature,"
K. Oda, et al., "New challenges for text
BMC Bioinformatics, vol. 9:10, 2008.
mining: manually
mapping curated
between
text
pathways,"
and
[80]
BMC
M. Kanehisa, et al., "KEGG for linking
G. Bader, et al., "Pathguide: a pathway resource list," Nucleic Acids Research, vol.
Bioinformatics, vol. 9, p. S5, 2008. [73]
and
bioinformatics," Brief Bioinform, vol. 10,
R89, 2008.
[72]
Hirschman
"Evaluation of Text Mining in Biology," in
WikiProteins," Genome Biology, vol. 9, p.
[71]
L.
Z. Zhang, et al., "Bringing Web 2.0 to
Design Patterns and Business Models for
[70]
2008.
34, pp. D504-506, 2006. [81]
E. Camon, et al., "The Gene Ontology
18
J. Comp. Sci. & Tech.
annotation
(GOA)
knowledge
in
database:
sharing
predicate-argument structures for event
Gene
extraction in molecular biology," BMC
Ontology," Nucleic Acids Research, vol. 32,
Bioinformatics, vol. 5, p. 155, Oct 19 2004.
Uniprot
with
pp. D262-266, 2004. [82]
[83]
[89]
J. D. Kim, et al., "GENIA corpus--a
Method for Annotating a Biomedical
semantically
for
Proposition
bio-textmining," Bioinformatics, vol. 19,
Proceedings
pp. 180-182, 2003.
Frontiers
L. Tanabe, et al., "GENETAG: a tagged
Corpora, Sydney, Australia, 2006.
annotated
corpus
corpus for gene/protein named entity
[84]
[90]
J. Heimonen, et al., "Complex-to-Pairwise
Workshop
on
Linguistically
Annotated
S. Kulick, et al., "Integrated annotation for
[91]
Y. Tateisi and J. Tsujii, "Part-of-Speech
Mapping of Biological Relationships using
Annotation
a
Abstracts," in Proceedings of the 4th
Semantic
Network
International
Representation," Symposium
of
Biology
Research
on
International Conference on Language
Semantic Mining in Biomedicine (SMBM
Resource and Evaluation (LREC2004). ,
2008), 2008, pp. 45-52.
2004, pp. 1267-1270.
B. Rosario and M. A. Hearst, "Classifying
[92]
Y. Tateisi, et al., "Syntax Annotation for
semantic relations in bioscience texts,"
the GENIA corpus," Proc. IJCNLP 2005,
presented at the Proceedings of the 42nd
Companion volume, pp. 222–227, 2005.
Meeting
D.
on
Association
Linguistics,
for
[93]
Barcelona,
M. Lease and E. Charniak, "Parsing biomedical literature," presented at the Proceedings of Second International Joint
Berleant,
et
al.
(2003,
Corpus
properties
of
protein
interaction
descriptions
in
MEDLINE.
Available:
Conference
on
Natural
Language
Processing, 2005. [94]
L.
Smith,
et
al.,
"MedPost:
a
http://class.ee.iastate.edu/berleant/home/m
part-of-speech tagger for bioMedical text,"
e/cv/papers/corpuspropertiesstart.htm
Bioinformatics, vol. 20, pp. 2320-2321,
C.
Nedellec,
logic-genic
- 18 -
the
HLT/NAACL-2004, 2004, pp. 61-68.
Spain, 2004.
[88]
ACL
at
p. S3, 2005.
Computational
[87]
in
of
presented
biomedical information extraction," in
Annual
[86]
Bank,"
recognition," BMC Bioinformatics, vol. 6,
Third
[85]
W.-C. Chou, et al., "A Semi-Automatic
"Learning
language
interaction
in
extraction
September 22, 2004 2004. [95]
M. Krallinger, et al., "The BioCreative II.5
challenge," in Proceedings of the ICML05
challenge overview," in Proceedings of the
workshop: Learning Language in Logic
BioCreative II.5 Workshop 2009 on Digital
(LLL05), 2005, pp. 97-99.
Annotations, Madrid, Spain, 2009.
T.
Wattarujeekrit,
et
al.,
"PASBio:
[96]
C. Gasperin, et al., "Annotation of
New challenges for biological text-mining in the next decade
- 19 -
anaphoric relations in biomedical full-text
Yen-Ching Chang received
articles using a domain-relevant scheme,"
her B.S. degree in biochemi-
presented at the Proceedings of the Anaphor
cal science and technology
Resolution Colloquium, Lagos (Algarve),
from National Taiwan Uni-
Portugal, 2007.
versity and her M.S. degree
Discourse
[97]
Anaphora
and
T. McIntosh and J. Curran, "Challenges for automatically
extracting
interactions from full-text articles," BMC
[98]
in biochemistry and molecu-
molecular lar biology from National Taiwan University col-
Bioinformatics, vol. 10, p. 311, 2009.
lege of medicine. Since 2008, she has been and
K. W. Kohn, "Molecular Interaction Map
research assistant at the Academia Sinica. Her re-
of the Mammalian Cell Cycle Control and
search interests include: proteomics and text min-
DNA Repair Systems," Mol. Biol. Cell, vol. 10, pp. 2703-2734, August 1, 1999 1999. Hong-Jie Dai
ing.
received
Richard Tzong-Han Tsai
re-
his B.S. degree in Computer
ceived the B.S. degree in Com-
Science
Information
puter Science and Information
Engineering from Tung Hai
Engineering from National Tai-
University and his M.S. degree in Computer
wan University, Taipei, Taiwan, in 1997, the M.S
Science and Information Engineering from Na-
degree in Computer Science and Information En-
tional Central University in Taiwan in 2003 and
gineering from National Taiwan University in
2005, respectively. Since 2005, he has been an as-
1999, and, respectively, the Ph.D. degree in Com-
sistant researcher at the Academia Sinica. His re-
puter Science and Information Engineering from
search interests include: bioinformatics, machine
National Taiwan University in 2006. He was a
learning, text mining, natural language processing
postdoctoral fellow at Academia Sinica from 2006
and Software Engineering.
to 2007. He is now an assistant professor of De-
and
partment of Computer Science and Engineering, Yuan Ze University, Zhongli, Taiwan. His research areas
are
natural
language
processing,
cross-language information retrieval, biomedical
20
J. Comp. Sci. & Tech.
literature mining, and information services on mo-
Research Fellow award of the National Science
bile devices.
Council, K. T. Li breakthrough award, IEEE Fellow, Academia Sinica Investigator Award, and TeWen-Lian Hsu is a Distin-
co Technology award. From 2001 to 2002, he was
guished Research Fellow in
the President of the Artificial Intelligence Society
the Institute of Information
in Taiwan.
Science, Academia Sinica. He received a B.S. from the Department of Mathematics, National Taiwan University in 1973 and a Ph.D. in operations research from Cornell University in 1980, respectively. He then joined Northwestern University, and was promoted to tenured associate professor in 1986. In 1989, he joined the Institute of Information Science as a research fellow. Earlier in his career, Dr. Hsu focused on theoretical graph algorithms and frequently published papers in top-notch journals, such as JACM, SIAM Journal on Computing. After returning to Taiwan, he started research on automatic conversion of Pinyin to characters. In 1993, he invented a Chinese natural input method which has since attracted two million users and revolutionized the phonetic input for Chinese in Taiwan. Later, he moved into question answering, and bioinformatics. He is currently the Director of the International Graduate Program in Bioinformatics in Academia Sinica. Dr. Hsu received many awards including the Distinguished - 20 -