Named Entity Recognition using Maximum Entropy

0 downloads 0 Views 169KB Size Report
According to the explosion of online biomedical texts, it becomes more difficult to get exact information manually. The named entity recognition is the very first ...
Named Entity Recognition using Maximum Entropy Models on Biologists’ Literature Jaesoo Lim 1, Hyunchul Jang 1 , Hyun-Sook Lee 1 , Soo-Jun Park 1 , Seon-Hee Park 1 Keywords: named entity recognition, maximum entropy models, text mining

1

Introduction

According to the explosion of online biomedical texts, it becomes more difficult to get exact information manually. The named entity recognition is the very first step for further text mining tasks like information extraction, knowledge discovery and others. In this paper, we present our statistical named entity recognition method. Until now, there were some approaches using different statistical methods such as Hidden Markov Models [1], Support Vector Machines [2], and so on. In our study, we use the Maximum Entropy models with rich contextual feature set.

2

Methods

The Maximum Entropy approach is one of the most popular machine learning methods for natural language processing. We apply this method to solve the problems of classifying semantic categories based on rich contextual features. For this task, we make use of OpenNLP’s maxent [3] package which contains the Generalized Iterative Scaling (GIS) training algorithm. For the case of entity boundaries, we only classify 40 semantic categories including category ‘O’ which means no category, and after then we determine the boundaries from the predicted semantic categories. This reverse way makes mistakes only when the same category entities are in neighbor to each other. Nevertheless, it can cope with the “data sparseness” better, and also makes the problem simpler. 2.1 Contextual features

Part-of-Speech Previously Predicted Word Prefix Suffix Word Shape

-2 O O O O O O

-1 O O O O O O

0 O

+1 O

+2 O

O O O O

O O O O

O O O O

Table 1: The contextual feature set

1 Bioinformatics Research Team, Electronics and Telecommunications Research Institute, 161, Gajeongdong, Daejeon, 305-350, Korea, E-mail: {jslim, janghc, lhs63473, psj, shp}@etri.re.kr

For our Maximum Entropy modeling, six contextual features are selected. Table 1 shows our contextual feature set. Zero, in the table, means current classifying word token, -2 means two more previous position from current position and +2 means vice versa. For the part-of-speech features, we use the Brill’s transformation based part-of-speech tagger [4], which is trained on the GENIA [5] corpus. It yields approximately 98% accuracy on the trained corpus. And we also use two previously predicted semantic categories. We use four more lexical features, which are words, prefixes, suffixes and word shapes. These lexical features are collected from about 5,500 MEDLINE abstracts, which contain specific keywords “(alzheimer’s disease) AND (amyloid beta protein)” like our corpus does. Then we make closed vocabulary appear at least five more times in the MEDLINE abstracts. This closed vocabulary not only reduces computational complexity and feature spaces, but also resolves overfitting to the training corpus. 2.2 Corpus To get model parameters and to evaluate our system, we built a domain specific corpus from the MEDLINE online abstracts. We collected abut 5,500 abstracts which were results of the query “(alzheimer’s disease) AND (amyloid beta protein),” and then we manually annotated 3,000 abstracts. The 39 semantic categories were selected from the UMLS [6] semantic types.

3

Results

Our resulting system exhibited an average f-score of 0.5945 in 10-fold cross validation. And recall and precision were 0.5751 and 0.6153, respectively. But, we got higher f-scores on partial matches, 0.6301 on left boundary matches and 0.6612 on right boundary matches. So, post-processing such as using the gazetteer, handling parenthesis/symbols and processing abbreviations, may help to revise these boundary errors.

References [1] Collier, N. and Nobata, C. and Tsujii, J. 2000. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In: Proceedings of COLING 2000. pp. 201-207. [2] Kazama, J. et al. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proc. ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. pp. 1-8 [3] http://maxent.sourceforge.net/ [4] Brill, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. In: Computational Linguistics, v.21 n.4, pp.543-565 [5] http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ [6] http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/