Automatic Term Identi cation and Classi cation in Biology Texts Chikashi Nobata, Nigel Collier and Jun-ichi Tsujii Department of Information Science Graduate School of Science University of Tokyo, Hongo-7-3-1 Bunkyo-ku, Tokyo 113, Japan
E-mail:
fnova,nigel,
[email protected] July 27, 1999
Abstract The rapid growth of collections in online academic databases has meant that there is increasing diculty for experts who want to access information in a timely and ecient way. We seek here to explore the application of information extraction methods to the identi cation and classi cation of terms in biological abstracts from MEDLINE. We explore the use of a statistical method and a decision tree method for classi cation and term candidate identi cation and also a method based on shallow parsing for identi cation. Experiments are made against a corpus of 100 expert tagged abstracts and results indicate that while identifying term boundaries is non-trivial, a high success rate can be obtained in term classi cation and that a combination of methods will provide the best solution.
1 Introduction The rapid growth of collections in online academic databases has meant that there is increasing diculty for experts who want to access information in a timely and ecient way. We seek here to explore the application of generalisable information extraction methods to the identi cation and classi cation of terms in biology texts. The texts we have chosen for our study were selected from MEDLINE [12] which is an online collection of abstracts from scienti c journals maintained by the National Library of Medicine (http://www..nlm.nih.gov/). We are interested in extracting biological information related to the cell signalling pathway [17, 11, 7] from such texts and we created a test corpus of 100 human expert tagged texts for our experiments [15]. Previous methods have also considered the use of MEDLINE articles for this task, but as yet the methods have all had various disadvantages. For example, 1
Fukuda et al. [9] used hand encoded domain knowledge to create a tool for extracting protein names from text. Although the results looked promising, there seem to be diculties with scalability and generalisability. In another study of name identi cation in MEDLINE, Andrade et al [1] used a statistical approach based on word distributions in a large corpus for nding protein names. The purpose seems to be a type of key word spotting and no formal evaluation of the method is given. In our work we have focussed on an information extraction approach and we turn now to the Message Understanding Conference (MUC) where the identi cation and classi cation of terms is developed as one of the subtasks called \named entity". At the MUC-7 conference [8], several named entity tagging system use learning algorithms. For example, New York University's \MENE"[6], [5] uses a statistical Maximum Entropy method, and BBN's \Nymble/Identi nder"[4], [14] uses Hidden Markov Models. Recently, the performance of such systems which use learning algorithms has become comparable to that of hand-coded systems. Language Technology Group's system[13] is a hybrid of statistical modules and non-statistical modules and surpasses the above two systems, but this integration has portability disadvantages. The named entity task itself is intended to be largely domain independent and has been performed automatically with high accuracy in previous conferences. As illustrated above, systems for this task are divided broadly in two types, systems with hand-coded knowledge and those with a learning method. Systems of the former type are easy to customize, but require laborious work to create knowledge like linguistic patterns manually, whenever we adapt these systems to a new domain or a new de nition of named entities. Our group is constructing the ontology for genome domain and the de nition of named entities depends on it. Therefore, the second type of system is more suitable for our purpose. We now present and discuss the learning methods we used in our experiments, based on statistics, decision trees, and shallow parsing.
2 Classi cation 1: Supervised learning with word lists The purpose of terminology classi cation is to map a concept, represented as a string of words, to a class. In the approach described below, we calculate similarity between a string and a class by the distribution of words in preclassi ed word lists from databases such as Swissprot [2][20] and GenBank [10]. The use of large gazettes and word lists has a long tradition in information extraction. The obvious disadvantage is that such word lists can rarely ever be assumed to be complete, otherwise information extraction would be a trivial task. What we seek here is to explore an extention of the word list method by assuming that the words in the lists are indicative of the terms in that class, without making a closed-set assumption. This method of internal evidence can 2
easily be extended by letting the word list input be the result of word clustering, or by adding external sources of information such as cooccurrence information. The word list method we describe and which incorporates this classi cation model can be seen in Figure 1 which shows a pre-processing stage to identify and expand abbreviations in the text. More formally, we can assume that there exists a set of m term classes, C=fc0 ; c1 ; :::; cm g and a vocabulary of words V=fw0 ; w1 ; :::; wn g. The rst task then is to estimate the probability of each word event occurring in each of the classes. By using the naive Bayes assumption that the probability of each word which occurs in a text is independent of a word's context and its position in the text we develop a simple statistical model outlined below.
Figure 1: Named entity system schema for Classi cation 1 For a set of pre-classi ed term lists for each class L=fl0 ; l1 ; :::; lm g, consisting of words which are members of V, we de ne N (wx ; ly ) to be the count of the number of times word a wx appears in the list ly . We then nd the sum of each class list as,
X = n
ny
(
N ly ; wi
)
(1)
i=0
We then estimate predictor conditional probabilities for each word and class by the relative frequency score, (
j
t cy w x
) 3
(
N ly ; wx ny
)
(2)
Class PROTEIN DNA SOURCE RNA
Examples PROTEIN, ANTIGEN, SYNTHASE, PRECURSOR, LIGAND GENE, PROMOTER, MOTIF, ENHANCER, SITE, REGION CELL, LINE, NUCLEUS RNA, mRNA
Table 1: Examples from the head noun lists The key assumption then is that many of the words in the vocabulary will be good classi ers as they appear with a higher relative frequency in some class lists than others. Given naive Bayes and a further assumption that there exists a oneto-one correspondance between strings and a correct class, a simple model for estimating class probabilities from arbitary strings of words S=fws0 ; ws1 ; :::; wsk g, 8wsx 2 V , is (
j
P r cy S
)=
P
P P(
j
)
k t cy w si i=0 k m t cj w si i=0 j =0
(
j
)
(3)
This denotes the sum of probabilities for words in S that belong to class y over the sum of probabilities that words in S belong to all classes. This model relies entirely on evidence which is internal to the string and does not consider surrounding context in any way. In our current experiments we are exploring this internal model as a baseline, and we expect to expand it in the future to include external evidence from local contextual clues. Since there is an assumption that each string must belong to one class we reserve a special class c0 for strings which are either not technical terms or do not belong to any of the other classes. The term list l0 for c0 is intended to model a background distribution of words and is derived from a large general collection of MEDLINE abstracts taken from all of the abstracts for 1990 and the index size is approximately 148000 words. Statistical classi cation works on the assumption that the focus phrase which we want to classify consists of words which have a high probability to occur in one of the pre-classi ed word lists. Taken together the class probabilities for each of the words in the candidate noun phrase can be combined to yield a score which should be highest for the correct class. Since the list consists of noun phrases from database entries, we assume that many nouns will be repeated within dierent entry elds yielding statistical information about the importance of these words to the word class. In early experiments we found that head nouns in noun phrases provided signi cant clues about the the class. Therefore in at least one respect, the naive Bayes assumption was too limiting and we decided to give greater weight to the head noun. We therefore incorporated a heuristic which classi es a noun phrase to a pre-determined class if its head can be found in a lookup list of 35 words. The lookup lists were made by hand from an inspection of documents. Examples can be seen in Table 1 for each class we are interested in. 4
At the time of creating word lists from database entries we found that a high proportion of technical words appeared with a low frequency (i.e. less than 5), which is normally considered to be unreliable for statistical purposes. Rather than exclude such a large part of our knowledge base we decided at this stage to include this data into the classi er and to use it in combination with, hopefully, more reliable evidence from other words. The index sizes of the word lists used in our experiments are as follows: Proteins (22956 words), DNA (74172 words), RNA (same as DNA), SOURCE (812 words).
3 Classi cation 2: Decision trees The system was originally created for Japanese documents, which performed the classi cation and identi cation task simultaneously. It has two phases, one for creating the decision tree from training data and the other for generating the tagged text based on the decision tree. C4.5[16] is used for creating the decision tree. This classi cation system is basically the same as[18], [19], but has been adapted for performing Information Extraction in this domain. There are three kinds of feature sets in the decision tree:
Part-of-speech information There are 45 part-of-speech categories, whose de nitions are based on Pennsylvania Treebank's categories. We use a tagger based on Adwait Ratnaparkhi's method.
Character type information Each character type information and some combinations of them are recognized like upper case, lower case, capitalization, numerical expressions, symbols.
Word lists speci c to the domain Word lists from databases such as Swissprot and Genbank are used, similarly in the statistical model.
The other customization we have made is to separate classi cation and identi cation so that we can evaluate the performance of these two tasks separately. For the classi cation task, the training texts are of the same format as for the original named entity task. However, the system regards a tagged phrase as one word and doesn't learn the information to identify boundary of terms. In the testing phase, the system assumes input texts tagged in which noun phrases are already identi ed.
4 Identi cation 1: Shallow parsing Before we can undertake the classi cation task we need to perform identi cation. This involves selecting candidate terms in the text to be given to the classi er. 5
The rst method we investigated for identi cation involves the use of a shallow parser to nd noun phrase boundaries. This is done in two stages: rstly we parse the text with the EngCG shallow parser [21] and then chunk the text to nd noun phrases which are candidate terms for classi cation. The advantage of using a shallow parser is that we can overcome the problem of a dependence on internal evidence, represented by the knowledge in word lists, i.e. we do not need to know every word in every class for it to be classi ed correctly. For robust processing of terms which contain new words it seems that we should ballance our system's dependence on internal and external evidence.
5 Identi cation 2: Decision trees The second method of identi cation is a model which uses a decision tree technique. This system for identi cation is almost identical to that for classi cation. The dierences are that the training texts are processed to replace class tags with a general open and close term tag which shows term boundaries, and that the system deals with sentences by each word.
6 Identi cation 3: Statistical identi cation The third method we attempted for identifying terms was a simple statistical model similar to the one used for classi cation and shown in Equation 3. The model calculates a probability score for each word being a member of a known class and for the unknown class c0 . If the probability of being a member of any known class is greater than being an unknown class then the word belongs to a noun phrase. The model makes the basic assumption of a closed vocabulary within the training term lists, but imposes no restriction on word ordering or cooccurrence. Since most terms in the tagged corpus contained no function words, we made the working assumption that we did not need to use a stop list. Additionally, we used a set of terminals such as punctuation marks which mark the end boundary of term candidates.
7 Test collection Our test collection is made from 100 MEDLINE abstracts tagged by a human expert in biology. Details of the corpus, tagging scheme and ontology which are now being developed can be found in [15], but are summarised here for completeness. The tag set we use is derived from the GENIA ontology which is currently being developed as part of our project. Part of the ontology relevant to the classi cation task can be seen in Figure 2 in which the entities which we want to classify, i.e. Source, DNA, RNA and Protein, are underlined. In the test corpus of 100 articles there were 1712 SOURCE, 730 DNA, 60 RNA and 4248 6
PROTEIN tagged entities. Since the corpus is created by an expert we use this as a judgement set and aim to make our automatic classi cation tool as good as the expert.
Figure 2: Part of the term classi cation ontology The biological roles of these entities are not of relevance here so we will not mention more about them. Instead we will simply say that they are of importance for biologists working in many areas such as the study of the human genome and in particular the cell signalling pathway, i.e. the way in which chemical messages are passed into and out from cells.
8 Experiments We use \F-scores" for evaluation of our experiments. \F-score" is a measurement combining \Recall" and \Precision" and de ned in Equation 4. \Recall" is the percentage of the correct answers among the answers in the key provided by human. \Precision" is the percentage of the correct answers among the answers proposed by the system. F-score =
2 2 Precision 2 Recall Precision + Recall 7
(4)
Class SOURCE PROTEIN DNA RNA All
class. only 69.9 70.3 83.8 8.2 65.8
I1 37.9 31.9 42.3 2.7 37.4
I2 53.4 53.8 50.0 4.1 58.5
I3 37.1 33.6 47.4 6.6 40.1
Table 2: F-scores for C1 on 100 abstracts with dierent term identi cation methods Class SOURCE PROTEIN DNA RNA All
class. only 77.19{82.30 83.43{87.53 17.86{44.59 0.00{ 0.00 87.72{90.10
I1 15.87{24.77 33.17{40.81 2.40{ 4.20 0.00{16.67 28.93{31.31
I2 46.33{55.14 63.37{72.10 4.95{14.81 0.00{ 0.00 56.98{66.24
I3 22.99{29.74 42.87{47.95 7.29{19.05 0.00{ 0.00 37.85{42.22
Table 3: F-scores for C2 on 100 abstracts with dierent term identi cation methods Both systems were tested on the same test set of 100 MEDLINE tagged articles. In the experiments we report results for the two classi cation methods used with the three identi cation methods. As a control we also show results from classifying terms given their bounds in the test corpus. i.e. we assume that boundaries of terms have been successfully found and then we attempt to use the internal words to classify those phrases. This is shown as method `class. only' in Tables 2 and 3. We evaluate the performance of this decision tree based system with 5-fold cross validation. 100 texts are separated into 5 subsets, and one of the 5 subsets is used as the test set and the other 80 texts are put together to form a training set. This evaluation is repeated 5 times. We showed the lowest value and the highest value in Table 3 among these 5 evaluation processes.
9 Discussion From the results we can clearly see that both methods C1 and C2 have success at classifying terms as belonging to one of the target classes. We also see that C1 is comparatively better at classifying DNA and RNA, whereas C2 is better at classifying SOURCE and PROTEIN. This is possibly because of the lack of training data for the decision tree for DNA and RNA classes in the training corpus, whereas the statistical method relies entirely on word lists. Performance drops substantially for both models when we attempt classi cation with identi cation. This also highlights the diculty of the tagging task 8
Key Response Key Response Key Response Key Response Key Response Key Response
the < T > cell antigen receptor < =T > < T > the T-cell antigen receptor < =T > < T > human < =T > < T > T lymphocytes < =T > < T > human T lymphocytes < =T > < T > Janus protein tyrosine kinases < =T > ( < T > JAKs < T > ) < T > Janus protein tyrosine kinases (JAKs) < =T > two < T > STAT-like protein complexes < =T > < T > two STAT-like protein complexes < =T > < T >Rel< =T >-< T >NF-Kappa B< =T > < T >Rel-NF-Kappa B< =T > < T >HeLa< =T > cells < T >HeLa cells< =T >
Table 4: Examples of mis-identi cation of terms for the human annotater. Some examples are shown in Table 4. Here the expert tagged term is shown as the key and the output of the term identi er is shown as response. Open term tag is shown as < T > and close term tag as < =T >. This is a simpli cation of the knowledge in the tagged corpus which contains information about the term's identi cation number and reference (although these are not at present used in the classi cation task). The rst problem we see is that terminological identi cation is not simply a matter of noun phrase identi cation so that in general, the methods which learn from examples (I2 and I3) out perform the shallow parser I1. For example we should remove count nouns, articles etc., as is the case in the rst and fourth example. Secondly, the expert who tags the text has some implicit model of the domain which allows him/her to divide noun phrase terms according to some intuition about the structure of terms. The second example in Table 4 shows such a case. Such examples account of a high proportion of identi cation errors and indicate that there may be no single correct way of marking up terms in a text. In the case of protein names, we observed many occurrences of a base family protein name like `Rel' which has a quali er related to its subtype which in this case is `NF-Kappa B', where `NF' is an abbreviation of `nuclear factor', a class indicator. In theory there could be other subtypes, so that the invariant part of the term as shown in the key is actually only `Rel'. This is indicative of the systematic and compositional nature of terms in this domain, however, without high level knowledge this segmentation is going to be very dicult to achieve automatically. Perhaps a better method for tagging then is to choose the longest term that belongs to a single class. In our ongoing work creating a tagged domain corpus we are using this experience to make tagging more systematic. For terms belonging to RNA, we found that since the terms are named after the DNA from which the RNA is derived, the RNA names are only distinguishable from DNA using the clue words in the head noun list. Therefore C2, which does not have this knowledge performs quite poorly for RNA classi cation. 9
10 Conclusion We have presented experimental results for two clasi cation and three term candidate identi cation methods and shown that decision trees and statistical classi ers based on word lists have dierent strengths for dierent types of term class. In the future we will see if we can get the best performance from combining the methods. Pre-classi ed term lists are a useful and reliable resource as they are created from experts and can be derived from existing database entries. If such lists do not exist for some domain then we can easily incorporate the results of the many term clustering algorithms (e.g. see [3]) into the above algorithm, although we can expect that this will inevitable degrade performance. The method obviously though has its limits and we will be extending the statistical algorithm to incorporate evidence from local term context to overcome the closed-lexicon assumption.
References [1] M. Andrade and A. Valencia. Automatic extraction of keywords from scienti c text: application to the knowledge domain of protein families. BioInformatics, 4(7), 1998. [2] A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research, 25:31{36, 1997. [3] L.D. Baker and A.K. McCallum. Distributional clustering of words for text classi cation. In Proceedings of the 21st Annual International ACM SIGIR Conference on Resear ch and Development in Information Retrieval, Melbourne, Australia, 1998. [4] Daniel Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel. Nymble: a High-Performance Learning Name- nder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, pages 194{ 201, Washington, DC, USA, April 1997. Association for Computational Linguistics. [5] Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora, pages 152{160, Montreal, Canada, August 1998. [6] Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman. NYU:Description of the MENE Named Entity System as Used in MUC-7. In Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax, Virginia, USA, May 1998.
10
[7] N. Collier, H.S. Park, N. Ogata, Y. Tateishi, C. Nobata, T. Ohta, T. Sekimizu, H. Imai, and J. Tsujii. The GENIA project: corpus-based knowledge acquisition and information extract ion from genome research papers. In Proceedings of the Annual Meeting of the European chapter of the Associatio n for Computational Linguistics (EACL'99), June 1999. [8] DARPA. Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax, VA, USA, May 1998. [9] K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. Toward information extraction: Identifying protein names from biological papers. In Proc. of the Paci c Symposium on Biocomputing '98 (PSB'98), Jan. 1998. [10] GENBANK. database.
ftp://ncbi.nlm.nih.gov/genbank/.
Genome sequence
[11] T. Hishiki, N. Collier, C. Nobata, T. Ohta, N. Ogata, T. Sekimizu, R. Steiner, H. Park, and J. Tsujii. Developing NLP tools for genome informatics: An information extraction perspective. In Genome Informatics. Unviersal Academy Press, Inc., 1998. [12] MEDLINE. database.
http://www.ncbi.nlm.nih.gov/pubmed/.
The PubMed
[13] Andrei Mikheev and Claire Grover. LTG description of the ne recognition system used for MUC-7. In Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax, Virginia, USA, May 1998. [14] Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone, Ralph Weischedel, and the Annotation Group. Algorighms that learn to extract information BBN: Description of the sift system as used for muc-7. In Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax, Virginia, USA, May 1998. [15] Y. Ohta, Y. Tateishi, N. Collier, C. Nobata, and J. Tsuji i. Building an annotated corpus from biological papers. In 59th Annual national convention of the IPSJ Zenkokutaikai (in J apanese), Iwate Prefectural University, (to appear) 28{30 September 1999. [16] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Mateo, California, 1993. [17] T. Sekimizu, H. Park, and J. Tsujii. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Universal Academy Press, Inc., 1998. [18] Satoshi Sekine. NYU:Description of the Japanese NE System Used for MET-2. In Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax, Virginia, USA, May 1998. 11
[19] Satoshi Sekine, Ralph Grishman, and Hiroyuki Shinnou. A Decision Tree Method for Finding and Classifying Names in Japanese Texts. In Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, August 1998. [20] SwissProt. http://www.expasy.ch/sprot/sprot-top.html. Annotated protein sequence database and supplement TrEMBL. [21] A. Voutilainen. Designing a ( nite-state) parsing grammar. In E. Roche and Y. Schabes, editors, Finite-State Language Processing. A Bradford Book, The MIT Press, 1996.
12