using Conditional Random Fields (CRFs) for question answering. We used ... areas of natural language research, but more finely grained classification has additional ... class N is subdivided into 2 sub-classes, i.e., B-N and I-N [3, 4]. Hence ...
Fine-Grained Named Entity Recognition using Conditional Random Fields for Question Answering Changki Lee1, Yi-Gyu Hwang1, Hyo-Jung Oh1, Soojong Lim1, Jeong Heo1, Chung-Hee Lee1, Hyeon-Jin Kim1, Ji-Hyun Wang1, Myung-Gil Jang1 1
Electronics and Telecommunications Research Institute (ETRI) 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-350, Korea {leeck,yghwang,ohj,isj,jeonghur,forever,jini,jhwang,mgjang}@etri.re.kr
Abstract. In many QA systems, fine-grained named entities are extracted by coarse-grained named entity recognizer and fine-grained named entity dictionary. In this paper, we describe a fine-grained Named Entity Recognition using Conditional Random Fields (CRFs) for question answering. We used CRFs to detect boundary of named entities and Maximum Entropy (ME) to classify named entity classes. Using the proposed approach, we could achieve an 83.2% precision, a 74.5% recall, and a 78.6% F1 for 147 fined-grained named entity types. Moreover, we reduced the training time to 27% without loss of performance compared to a baseline model. In the question answering, The QA system with passage retrieval and AIU archived about 26% improvement over QA with passage retrieval. The result demonstrated that our approach is effective for QA. Keywords: Fine-Grained Named Entity Recognition, Conditional Random Fields, Question Answering.
1
Introduction
The Question-Answering (QA) system is an information retrieval system that finds an answer instead of finding a list of documents in response to a user’s question. Passage extraction methods have been the most commonly used by many QA systems. In the passage extraction methods, sentences or passages that are regarded as the most relevant sentences or passages to the question are extracted and then answers are retrieved by using lexico-syntactic information or NLP techniques (especially named entity recognition). Recent advances have brought some systems (such as BBN’s IdentiFinder) to within 90% accuracy when classifying named entities into broad categories, such as person, organization, and location. These coarse categorizations are useful in many areas of natural language research, but more finely grained classification has additional advantages in QA [1, 2]. In many QA systems, fine-grained named entities are extracted by coarse-grained named entity recognizer and fine-grained named entity dictionary [6]. While much
research has gone into the coarse categorization of named entities, we are not aware of much previous work using machine learning algorithms to perform more finegrained classification. Fleischman and Hovy describe a method for automatically classifying person instances into eight finer-grained subcategories [1]. They use a supervised learning method that considers the local context features and global semantic features. But a training data is highly skewed, because they use a simple bootstrapping method to generate the training data automatically. And they treat only person and location, but we treat all kinds of named entity types. Mann explores the idea of a fine-grained proper noun ontology and its use in question answering [2]. He builds a proper noun ontology from unrestricted text. This automatically constructed ontology is then used on a question answering task. The disadvantage of this method is that its coverage is small, because he uses simple textual co-occurrence patterns. In this paper, we describe a fine-grained Named Entity Recognition (NER) using Conditional Random Fields (CRFs) and its use in question answering.
2
Fine-grained NER using Conditional Random Fields
We define 147 fine-grained named entity types in consideration of user’s asking points for finding answer candidates of a question answering system. They have 15 top levels and each top node consists of 2 or 4 layers. The base set of such types is; person, study field, theory, artifacts, organization, location, civilization, date, time, quantity, event, animal, plant, material, and term. In many machine learning-based named entity recognitions (NERs), each name class N is subdivided into 2 sub-classes, i.e., B-N and I-N [3, 4]. Hence, there are total 295 classes (147 name classes * 2 sub-classes + 1 not-a-name class). In this case, CRFs can not be applied directly, because CRFs which have many classes are inefficient. To solve this problem, we break down the NE task in two parts; boundary detection using CRFs and NE classification using Maximum Entropy (ME). Let x = be some observed input data sequences, such as a sequence of words of sentences in a document. Let y be a set of FSM states, each of which is associated with a name class and a boundary (i.e. B, I). Let y = be some sequences of states, (the values on T output nodes). We define the conditional probability of a state sequence y given an input sequence x as follows:
P ( y | x ) = P ( c, b | x ) = P ( b | x ) P ( c | b , x )
(1)
≈ P (b|x)∏ P (ci | bi , xi ) i =1
where b = name class states.
is a set of boundary states, c = is a set of
CRFs define the conditional probability of a boundary state sequence b given an input sequence x as follows:
PCRF (b | x) =
1 ⎛ T ⎞ exp⎜ ∑∑ λ k f k (bt −1 , bt , x, t ) ⎟ Z ( x) ⎝ t =1 k ⎠
(2)
where fk(bt-1,bt,x,t) is an arbitrary feature function over its arguments, and λk is a learned weight for each feature function. Higher λ weights make their corresponding FSM translations more likely. We calculate the conditional probability of a state ci given a named entity (bi, xi) extracted by boundary detector as follows:
PME (ci | bi , xi ) =
1 ⎛ ⎞ exp⎜ ∑ λk f k (ci , bi , xi ) ⎟ Z (bi , xi ) ⎝ k ⎠
(3)
We use the following features: • Word feature—orthographical features of the (-2,-1,0,1,2) words. • Suffix—suffixes which are contained in the current word among the entries in the suffix dictionary. • POS tag—part-of-speech tag of the (-2,-1,0,1,2) words. • NE dictionary—NE dictionary features. • 15 character level regular expressions – such as [A-Z]*, [0-9]*, [A-Za-z0-9]*, … The primary advantage of Conditional Random Fields (CRFs) over hidden Markov models is their conditional nature, resulting in the relaxation of independence assumptions required by HMMs. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs). CRFs outperform both MEMMs and HMMs on a number of real-world sequence labeling tasks [3, 4].
3 3.1
Experiment Fine-grained NER
The experiments for fine-grained named entity recognition (147 classes) were performed on our Korean fine-grained named entity data set. The data set consists of 6,000 documents tagged by human annotators. We used 5,500 documents as a training set and 500 documents as a test set, respectively. We trained the model by L-BFGS using our C++ implementation of CRFs and ME. We use a Gaussian prior of 1 (for ME) and 10 (for CRF). We perform 500 iterations for training.
Table 1. Sub-task of fine-grained NER (147 classes).
Sub-task
Method (# Class)
Boundary detection Boundary detection NE classification
ME (3 classes) CRFs (3 classes) ME (147 classes)
Training Time (second/iter.)
Pre. (%)
Rec. (%)
F1 (%)
6.23
87.4
80.6
83.9
24.14
89.5
79.7
84.3
18.36
83.7
84.2
84.0
Table 1 shows the performance of boundary detection and NE classification using CRF and ME. In boundary detection, we obtained the performance, 83.9% F1 and 84.3% F1 using ME and CRF respectively. In NE classification, we could achieve an 84.0% F1 using ME (we assume the boundary detection is perfect). Table 2. Performance of fine-grained NER (147 classes).
Task NER-1 (baseline) NER-2 NER-3 NER-4 (proposed)
Method (# Class) 1 stage: ME (295 classes) 1 stage: CRFs (295 classes) 2 stages: ME+ME (3+147 classes) 2 stages:CRFs+ME (3+147 classes)
Training Time (second/iter.)
Pre. (%)
Rec. (%)
F1 (%)
154.97
82.5
73.5
77.7
12248.33
-
-
-
24.59
82.8
73.3
77.8
42.50
83.2
74.5
78.6
Table 2 shows the performance of fine-grained named entity recognition using CRF and ME. In named entity recognition, our proposed model NER-4 that is broken down in two parts (i.e., boundary detection using CRFs and NE classification using ME) obtained 78.6% F1. The baseline model NER-1 (ME-based) that is not broken down in two parts obtained 77.7% F1. We could not train NER-2 because training is too slow. Table 2 also shows the training time (second per iteration) of fine-grained named entity recognition. The baseline model NER-1 using ME takes 155.0 seconds per iteration in training. Our proposed model NER-4 takes 42.5 seconds per iteration. So we reduced the training time to 27% without loss of performance. 3.2
QA with fine-grained NER
We performed another experiment for a question answering system to show the effect of the fine-grained named entity recognition. We used ETRI QA Test Set which consists of 402 pairs of question and answer in encyclopedia [5]. Our encyclopedia
currently consists of 163,535 entries, 13 main categories, and 41 sub categories in Korean. For each question, the performance score is computed as the reciprocal answer rank(RAR) of the first correct answer. Our fine-grained NER engine annotates with 147 answer types for each sentence and then the indexer generates Answer Index Unit (AIU) structures using the answer candidates (the fine-grained NE annotated words) and the content words which can be founded within a same sentence. We append distance information between answer candidates and content words to AIU structures [5]. Figure 1 contains as example of the data structure passed from the indexing module. [Example sentence] The Nightingale Award, the top honor in international nursing, was established at the International Red Cross in 1912 and is presented every two years (title: Nightingale) [Fine-grained NER] , the top honor in international nursing, was established at in and is presented every (title: ) [Answer candidates] The Nightingale Award:CV_PRIZE the International Red Cross:OGG_SOCIETY 1912:DT_YEAR two years:DT_DURATION Nightingale:PS_NAME [Sentence-based AIU structures] (The Nightingale Award:CV_PRIZE, the International Red Cross:O GG_SOCIETY, distance info.), … , (two year:DT_DURATION, Nightingale:PS_NAME, distance info.), AIU (The Nightingale Award:CV_PRIZE, top, distance info), (The Nightingale Award:CV_PRIZE, honor, distance info), … Fig.1. Example of index processing
The answer processing module searches the relative answer candidates from index DB using question analysis and calculates the similarities and then extracts answers [5]. Table 3. Performance of Question Answering. Task
MRAR
QA with passage retrieval
0.525
QA with passage retrieval and AIU (fine-grained NE)
0.662
Table 3 shows the performance of question answering system with or without Answer Index Unit (i.e. fine-grained named entity information). The QA system with
passage retrieval and AIU achieved about 26% improvement over QA with passage retrieval.
4
Conclusion
In this paper, we describe a fine-grained Named Entity Recognition using Conditional Random Fields (CRFs) for question answering. We used CRFs to detect boundary of named entities and Maximum Entropy (ME) to classify named entity classes. Using the proposed approach, we could achieve an 83.2% precision, a 74.5% recall, and a 78.6% F1 for 147 fined-grained named entity types. Moreover, we reduced the training time to 27% without loss of performance compared to a baseline model. In the question answering, The QA system with passage retrieval and AIU (fine-grained NE) archived about 26% improvement over QA with passage retrieval. The result demonstrated that our approach is effective for QA.
References [1] 1. M. Fleischman and E. Hovy. Fine grained classification of named entities. COLING, 2002. [2] G. Mann, Fine-Grained Proper Noun Ontologies for Question Answering, SemaNet'02: Building and Using Semantic Networks, 2002 [3] A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, CoNLL, 2003. [4] S. Fei, F. Pereira. Shallow Parsing with Conditional Random Fields, HLT & NAACL, 2003. [5] H. Kim, J. Wang, C. Lee, C. Lee, M. Jang. A LF based Answer Indexing Method for Encyclopedia Question Answering System, AIRS, 2004. [6] K. Han, H. Chung, S. Kim, Y. Song, J. Lee, H. Rim. Korea University Question Answering System at TREC 2004, TREC, 2004.
Appendix: Fine-Grained Named Entity Tag Set (147 classes) 1. Person 2. Study Field 3. Theory
4. Artifacts
PS_NAME FD_OTHERS FD_SCIENCE FD_SOCIAL_SCIENCE TR_OTHERS TR_SCIENCE TR_TECHNOLOGY TR_SOCIAL_SCIENCE AF_CULTURAL_ASSET AF_BUILDING
PS_MYTH FD_MEDICINE FD_ART FD_PHILOSOPHY TR_ART TR_PHILOSOPHY TR_MEDICINE AFW_RELIGION AFW_PHILOSOPHY
5. Organization
6. Location
7. Civilization
8. Date
9. Time 10. Quantity
11. Event 12. Animal
AF_MUSICAL_INSTRUMENT AF_ROAD AF_WEAPON AF_TRANSPORT AF_WORKS AFW_GEOGRAPHY AFW_MEDICAL_SCIENCE OG_OTHERS OGG_ECONOMY OGG_EDUCATION OGG_MILITARY OGG_MEDIA OGG_SPORTS OGG_ART OGG_SOCIETY LC_OTHERS LCP_COUNTRY LCP_PROVINCE LCP_COUNTY LCP_CITY LCP_CAPITALCITY LCG_RIVER LCG_OCEAN CV_NAME CV_TRIBE CV_SPORTS CV_SPORTS_INST CV_POLICY CV_TAX CV_FUNDS CV_LANGUAGE CV_BUILDING_TYPE CV_FOOD CV_DRINK DT_OTHERS DT_DURATION DT_DAY DT_MONTH TI_OTHERS TI_DURATION TI_HOUR QT_OTHERS QT_AGE QT_SIZE QT_LENGTH QT_COUNT QT_MAN_COUNT QT_WEIGHT EV_OTHERS EV_ACTIVITY EV_WAR_REVOLUTION AM_OTHERS AM_INSECT AM_BIRD
AFW_ART AFWA_DANCE AFWA_MOVIE AFWA_LITERATURE AFWA_ART_CRAFT AFWA_THEATRICALS AFWA_MUSIC OGG_MEDICINE OGG_RELIGION OGG_SCIENCE OGG_BUSINESS OGG_LIBRARY OGG_LAW OGG_POLITICS LCG_BAY LCG_MOUNTAIN LCG_ISLAND LCG_TOPOGRAPHY LCG_CONTINENT LC_TOUR LC_SPACE CV_CLOTHING CV_POSITION CV_RELATION CV_OCCUPATION CV_CURRENCY CV_PRIZE CV_LAW, CVL_RIGHT, CVL_CRIME, CVL_PENALTY, DT_YEAR DT_SEASON DT_GEOAGE DT_DYNASTY TI_MINUTE TI_SECOND QT_PERCENTAGE QT_SPEED QT_TEMPERATURE QT_VOLUME QT_ORDER QT_PRICE QT_PHONE EV_SPORTS EV_FESTIVAL AM_AMPHIBIA AM_REPTILIA AM_TYPE
13. Plant
14. Material 15. Term
AM_FISH AM_MAMMALIA PT_OTHERS PT_FRUIT PT_FLOWER PT_TREE MT_ELEMENT MT_METAL MT_ROCK TM_COLOR TM_DIRECTION TM_CLIMATE TM_SHAPE TM_CELL_TISSUE
AM_PART PT_GRASS PT_TYPE PT_PART MT_CHEMICAL MTC_LIQUID MTC_GAS TMM_DISEASE TMM_DRUG TMI_HW TMI_SW