Named Entity Recognition using Maximum Entropy

Named Entity Recognition using Maximum Entropy Models on Biologists’ Literature Jaesoo Lim 1, Hyunchul Jang 1 , Hyun-Sook Lee 1 , Soo-Jun Park 1 , Seon-Hee Park 1 Keywords: named entity recognition, maximum entropy models, text mining

1

Introduction

According to the explosion of online biomedical texts, it becomes more difficult to get exact information manually. The named entity recognition is the very first step for further text mining tasks like information extraction, knowledge discovery and others. In this paper, we present our statistical named entity recognition method. Until now, there were some approaches using different statistical methods such as Hidden Markov Models [1], Support Vector Machines [2], and so on. In our study, we use the Maximum Entropy models with rich contextual feature set.

2

Methods

The Maximum Entropy approach is one of the most popular machine learning methods for natural language processing. We apply this method to solve the problems of classifying semantic categories based on rich contextual features. For this task, we make use of OpenNLP’s maxent [3] package which contains the Generalized Iterative Scaling (GIS) training algorithm. For the case of entity boundaries, we only classify 40 semantic categories including category ‘O’ which means no category, and after then we determine the boundaries from the predicted semantic categories. This reverse way makes mistakes only when the same category entities are in neighbor to each other. Nevertheless, it can cope with the “data sparseness” better, and also makes the problem simpler. 2.1 Contextual features

Part-of-Speech Previously Predicted Word Prefix Suffix Word Shape

-2 O O O O O O

-1 O O O O O O

0 O

+1 O

+2 O

O O O O

O O O O

O O O O

Table 1: The contextual feature set

1 Bioinformatics Research Team, Electronics and Telecommunications Research Institute, 161, Gajeongdong, Daejeon, 305-350, Korea, E-mail: {jslim, janghc, lhs63473, psj, shp}@etri.re.kr

For our Maximum Entropy modeling, six contextual features are selected. Table 1 shows our contextual feature set. Zero, in the table, means current classifying word token, -2 means two more previous position from current position and +2 means vice versa. For the part-of-speech features, we use the Brill’s transformation based part-of-speech tagger [4], which is trained on the GENIA [5] corpus. It yields approximately 98% accuracy on the trained corpus. And we also use two previously predicted semantic categories. We use four more lexical features, which are words, prefixes, suffixes and word shapes. These lexical features are collected from about 5,500 MEDLINE abstracts, which contain specific keywords “(alzheimer’s disease) AND (amyloid beta protein)” like our corpus does. Then we make closed vocabulary appear at least five more times in the MEDLINE abstracts. This closed vocabulary not only reduces computational complexity and feature spaces, but also resolves overfitting to the training corpus. 2.2 Corpus To get model parameters and to evaluate our system, we built a domain specific corpus from the MEDLINE online abstracts. We collected abut 5,500 abstracts which were results of the query “(alzheimer’s disease) AND (amyloid beta protein),” and then we manually annotated 3,000 abstracts. The 39 semantic categories were selected from the UMLS [6] semantic types.

3

Results

Our resulting system exhibited an average f-score of 0.5945 in 10-fold cross validation. And recall and precision were 0.5751 and 0.6153, respectively. But, we got higher f-scores on partial matches, 0.6301 on left boundary matches and 0.6612 on right boundary matches. So, post-processing such as using the gazetteer, handling parenthesis/symbols and processing abbreviations, may help to revise these boundary errors.

References [1] Collier, N. and Nobata, C. and Tsujii, J. 2000. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In: Proceedings of COLING 2000. pp. 201-207. [2] Kazama, J. et al. 2002. Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proc. ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. pp. 1-8 [3] http://maxent.sourceforge.net/ [4] Brill, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. In: Computational Linguistics, v.21 n.4, pp.543-565 [5] http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ [6] http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

Named Entity Recognition using Maximum Entropy

Named Entity Recognition using Maximum Entropy

Suggest Documents

Named Entity Recognition in Hindi using Maximum Entropy and ...

Named Entity Recognition with a Maximum Entropy Approach - CLiPS

Maximum Entropy Approach for Named Entity Recognition ... - CiteSeerX

Maximum Entropy Named Entity Recognition for ... - Semantic Scholar

Named Entity Recognition: Resource Constrained Maximum Path

Arabic Named Entity Recognition using Conditional ... - CiteSeerX

Named Entity Recognition and Disambiguation using ...

Arabic Named Entity Recognition Using Topic Modeling

Vietnamese Named Entity Recognition using ... - Semantic Scholar

Fine-Grained Named Entity Recognition using

Improving Named Entity Recognition using Annotated ... - CiteSeerX

A Hybrid Feature Set based Maximum Entropy Hindi Named Entity

Improving Twitter Named Entity Recognition using Word ...

Improving Twitter Named Entity Recognition using Word ...

Portuguese Named Entity Recognition using LSTM-CRF

Named Entity Recognition Using the Web

Ensemble Named Entity Recognition (NER)

Arabic Named Entity Recognition - UPV

Biomedical Named Entity Recognition - InTechOpen

Fine-Grained Named Entity Recognition using ... - Semantic Scholar

Named Entity Recognition on Twitter for Turkish using Semi ...

Named Entity Recognition Using Support Vector Machine for Filipino ...

A Pipeline Arabic Named Entity Recognition using a Hybrid Approach

Greek Named Entity Recognition using Support ... - LREC Conferences