Maximum Entropy Approach for Named Entity Recognition ... - CiteSeerX

7 downloads 0 Views 286KB Size Report
1 West Bengal Industrial Development Corporation, Kolkata, India. 2,3Department of ..... 4http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.p df. POS tags ...
RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol. 1,No.1, May 2009

Maximum Entropy Approach for Named Entity Recognition in Bengali and Hindi Mohammad Hasanuzzaman1, Asif Ekbal2 and Sivaji Bandyopadhyay3 1

2,3

West Bengal Industrial Development Corporation, Kolkata, India Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, India Email: 1 [email protected], 2 [email protected], [email protected]

Abstract— This paper reports about the development of a Named Entity Recognition (NER) system in two leading Indian languages, namely Bengali and Hindi using the Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task on South and South East Asian Languages1 (NERSSEAL) and tagged with a fine-grained Named Entity (NE) tagset2 of twelve tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with the four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features that are applicable to both the languages as well as the language specific features of Bengali and Hindi. Evaluation results show that the use of linguistic features can improve the performance of the system. Evaluation results of the 10-fold cross validation tests yield the overall average recall, precision, and f-score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali and 86.4%, 79.23%, and 82.66%, respectively, for Hindi.

machine learning methods have become state of the art for NER [1] [2] and for field extraction [3]. As a ME model, MENE [4] make use of diverse knowledge sources. ME conditional models like ME Markov Models [3] and Conditional Random Fields (CRFs) [5] were reported to outperform the generative HMM models on several information extraction tasks. The existing works in the area of NER are mostly in non-Indian languages. A very few works involving Indian languages can be found in [6], [7], [8] and [9]. In this paper, we have reported a ME based NER system in the Indian languages, particularly for Bengali and Hindi. Bengali is the seventh most spoken language in the world, second in India and the national language of Bangladesh. Hindi is the third most spoken language in the world and the national language of India. We have developed an automatic tag conversion routine in order to convert the fine-grained NE tagset, defined as part of the IJCNLP-08 NER shared task for SSEAL, to the four NE tags such as Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the various NE classes. We have considered both the language independent features that are applicable to both the languages and the language dependent features that are specifically applicable to Bengali and Hindi. It has been observed from the evaluation results that the use of language specific features improves the performance of the system.

Index Terms— Named Entity; Named Entity Recognition; Maximum Entropy Model; Bengali; Hindi.

I. INTRODUCTION Named Entity Recognition (NER) has important applications in almost all Natural Language Processing (NLP) application areas that include Information Retrieval, Information Extraction, Machine Translation, Question Answering and Automatic Summarization etc. The objective of NER is to identify and classify every word/term in a document into some predefined categories like person name, location name, organization name, miscellaneous name (date, time, number, percentage, monetary expressions etc.) and “none-of-the-above”. The challenge in detection of NEs is that such expressions are hard to analyze using traditional NLP because they belong to the open class of expressions, i.e., there is an infinite variety and new expressions are constantly being invented. Nowadays, machine learning (ML) approaches are popularly used to solve the problem of NER, as these are easily trainable, adoptable to various domains and languages, and their maintenance are much cheaper than that of the rule-based approaches. The probabilistic

II. NAMED ENTITY RECOGNITION IN INDIAN LANGUAGES NER in Indian languages is more difficult and challenging, as unlike English, there is no concept of capitalization in Indian languages. Another problem is that most of the NEs in Indian languages appear in the dictionary with some other valid meanings (For example, the word komol may be the name of a person as well as the name of a flower, i.e., lotus). Applying stochastic models to the NER problem requires large amount of annotated corpus in order to achieve reasonable performance. A. Named Entity Tagset We have used the IJCNLP-08 NER shared task data, tagged with the twelve NE tags (available at

1

http://ltrc.iiit.ac.in/ner-ssea-08 http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=3

2

408

© 2009 ACADEMY PUBLISHER

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol. 1,No.1, May 2009 to find the corresponding sequence of NE tags t 1....tn , drawn from a set of tags T, which satisfies:

http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=3), for NER in two different Indian languages. The tagset consists of more tags than the four tags of CoNLL 2003 shared task on NER. The underlying reason to adopt this finer NE tagset was to use the NER system in various NLP applicaions, particularly in machine translation. A tag conversion routine has been developed in order to convert the fine-grained NE tagset of twelve tags to a coarse-grained NE tagset of four tags as shown in Table 1. In order to properly denote the boundaries of NEs, four NE tags are further divided into the following forms: B-XXX: Beginning of a multiword NE, I-XXX: Internal of a multiword NE consisting of more than two words, E-XXX: End of a multiword NE, XXXÆPER/LOC/ORG/MISC. For example, the name sachin ramesh tendulkar is tagged as sachin/B-PER ramesh/I-PER tendulkar/E-PER. The single word NE is tagged as, PER: Person name, LOC: Location name, ORG: Organization name and MISC: Miscellaneous name. In the output, these sixteen NE tags are directly replaced with the four major NE tags.

P(t1...tn | w1...wn) =

i =1

where, P(ci | s, D) is determined by the maximum entropy classifier. The Beam search algorithm is then used to select the sequence of word classes with the highest probability. The features are binary valued functions which associate a NE tag with various elements of the context. For example,

Tagset used

Meaning

NEP

Person name

Single word/multiword person

NEL

Location name

Single word/multiword location name

NEA, NEN, NEM,

Miscellaneous name

Single word/multiword

⎧1 fj (h, t ) = ⎨ ⎩0

if word(h)=sachin and t=PER otherwise

(4)

The general-purpose optimization method Limited Memory BFGS method [10] has been used for the estimation of ME parameters. We have used the C++ based ME package3.

name

Single word/multiword

(2)

n

shared task tag

Organization name

i

P (c1,..., cn | s , D ) = ∏ P (ci|s, D )×P (ci|ci −1) (3)

TAG CONVERSION TABLE

NEO

i

where, hi is the context for the word wi . During testing, it is possible that the classifier produces a sequence of inadmissible classes (e.g., B-PER followed by LOC). To eliminate such sequences, we define a transition probability between word classes P (ci | cj ) to be equal to 1 if the sequence is admissible, and 0 otherwise. The probability of the classes c1,......, cn assigned to the words in a sentence ‘s’ in a document ‘D’ is defined as follows:

TABLE 1

IJCNLP-08 NER

∏ P(t | h )

i =1...n

organization name NETI NED, NEB, NETP,

III. NAMED ENTITY FEATURES IN INDIAN LANGUAGES

miscellaneous name NNE

Feature selection plays a crucial role in the ME framework. Experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. In addition, various gazetteer lists have been developed for use in the NER tasks, particularly for Bengali and Hindi. We have considered different combination from the following set for inspecting the best feature set for NER: F={ wi − m ,..., wi −1 , wi , wi +1 ,..., wi + n , |prefix|≤n,

Other than named entities

NETE, NETO

B. Maximum Entropy Framework for Named Entity Recognition The ME framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed. Such constraints are derived from the training data, expressing some relationships between features and outcome. The probability distribution that satisfies the above property is the one with the highest entropy. It is unique, agrees with the maximum likelihood distribution, and has the exponential form: n 1 P (t | h) = exp(∑ λ jfj (h, t )) Z ( h) j =1

|suffix|≤n, dynamic NE tag(s), POS tag(s), First word, Length of the word, Digit information, Rare word feature, Gazetteer lists} A. Language Independent Features • •

(1)

where, t is the NE tag, h is the context (or history),

fj (h, t ) are the features with associated weight λ j and Z (h) is a normalization function. The problem of NER can be formally stated as follows. Given a sequence of words w1....wn , we want

3

http://homepages.inf.ed.ac.uk/s0450736/software/maxen t/maxent-20061005.tar.bz2

409

© 2009 ACADEMY PUBLISHER

Context word feature: Preceding and following words of a particular word. Word suffix /prefix: Various word suffixes/prefixes can be used as the features in two different ways. The first one is to use a fixed length (say, n) word suffix/prefix of the current and/or the surrounding word(s) as the features. If the length of the corresponding word is less than or equal to n-1 then

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol. 1,No.1, May 2009

• • •





the feature values are not defined and denoted by ND. The feature value is also not defined (ND) if the token itself is a punctuation symbol or contains any special symbol or digit. In addition to the fixed length suffixes, we have also used the variable length suffixes as the binary valued features. NE information: The NE tag(s) of the previous word(s) are used as the features. This is the only dynamic feature in the experiment. First word: If the current token is the first word of a sentence, then the feature ‘FirstWord’ is set to 1; Otherwise, it is set to 0. Length of the word: If the length of the current word is greater than three then the feature ‘Length’ is set to 1; otherwise, it is set to 0. This is based on the observation that very short words are rarely NEs. Digit features: Several binary valued digit features have been considered depending upon the presence and/or the number of digits in a token, combination of digits and punctuation symbols, combination of digits and symbols. The corresponding feature is set to 1 if it is true; otherwise, it is set to 0. These binary valued features are helpful in recognizing miscellaneous NEs, such as time expressions, monetary expressions, date expressions, measurements and numerical numbers etc. Frequent word list: Frequencies of the words in each training corpus have been calculated. For each language, a cut off frequency has been chosen in order to consider the words that occur less than the cut off frequency in the training corpus. Now, a binary valued feature ‘RareWord’ is set to 1 for those words that are not in these lists; otherwise, it is set to 0. The intuition of using this feature is that the most frequently occurring words are rarely NEs.

B. Language Dependent Features for Bengali Language specific NE features have been identified from a Bengali news corpus [11]. The set of language dependent features is given as follows: • Word suffix: Variable length suffixes of words are helpful to identify NEs. This suffix information can be used as the binary valued feature. Variable length suffixes of a word can be matched with predefined lists of useful suffixes for different classes of NEs. The different suffixes that may be particularly helpful in detecting person (e.g., -babu, -da, -di etc.) and location names (e.g., -land, -pur, -lia etc.) have been prepared. If the current word contains any suffix of this type, then a binary valued feature ‘NESuffix’ is set to 1; otherwise, it is set to 0. • Part of Speech (POS) Information: POS information of the current and/or the surrounding word(s) can be used as features. Here, we have used a CRF-based POS tagger [12], which was originally developed with the help of 26 different POS tags4, defined for the Indian languages. For NER, we have considered a coarse-grained POS tagger that has only the three 4



C. Language Dependent Features for Hindi Language specific features for Hindi include the POS information of the words and the various binary valued features that are created from the gazetteers. The gazetteer lists include the first names, middle names, last names, month names, weekdays, function words and the measurement clue words. Person names have been collected from the Election Commission data5 of India. • POS information: We have used the CRF-based POS tagger [12], which was originally developed for Bengali. The POS tagger has been trained with the Hindi data, obtained from the NLPAI_Contest06 (http://www.ltrc.iiitnet/nlpai_contest06) and SPSAL2007 (http://shiva.iiit.ac.in/SPSAL2007) competition. POS information has been used in a similar way as like Bengali. • Gazetteer lists: First names (162,881 entries), Middle names (450 entries), Surnames (3,573 entries), Month name (24 entries), Weekdays list (14 entries), Function words (653 entries) and Measurement clue words (52 entries). TABLE 2 STATISTICS OF THE TRAINING AND DEVELOPMENT SETS Language

Number of tokens in the

Number of tokens in the

training set

development set

Bengali

102,467

20K

Hindi

452,974

50K

IV. EXPERIMENTAL RESULTS The ME based NER system has been trained with the Bengali, Hindi, Telugu, Oriya and Urdu data, obtained

http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.p df

5

410

© 2009 ACADEMY PUBLISHER

POS tags, namely Nominal, PREP (Postpositions) and Other. Gazetteer lists for Bengali: Various gazetteer lists have been developed either manually or semiautomatically from the Bengali news corpus [10]. These lists have been used as the binary valued features of the ME model. Any particular list does not include the ambiguous entries, i.e., those that can appear in more than one gazetteer list. If the current token is in a particular list, then the corresponding feature is set to 1 for the current and/or the surrounding word(s); otherwise, it is set to 0. Following is the set of gazetteers along with the number of entries: Organization clue word (e.g., kong, limited etc): 94, Person prefixes (e.g., sriman, sreemati etc.): 245, Middle names: 1,491, Surnames: 5,288, Common location (e.g., sarani, road etc.): 547, Action verb (e.g., balen, ballen etc.): 241, Function words: 743, Designation words (e.g., neta, sangsad etc.): 947, First names: 72,206, Location names: 7,870, Organization names: 2,225, Month name (English and Bengali calendars): 24, Weekdays (English and Bengali calendars): 14, Measurement clue words: 52 entries.

http://www.eci.gov.in/DevForum/Fullname.asp

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol. 1,No.1, May 2009 from the IJCNLP-08 NER shared task for SSEAL. An automatic tag conversion routine has been developed to convert these twelve NE tagged corpora to the forms, tagged with four NE tags. A subset of each training set has been selected as the development set to identify the best set of features for NER in each of the languages. The ME model has been trained with the remaining data. Statistics of the datasets are presented in Table 2. We define the baseline model as the one where the NE tag probabilities depend only on the current word:

of the previous word. We also conducted experiments by considering the dynamic NE information of the previous two words and observed the lower f-scores in both the languages. Finally, the system has demonstrated the fscore values of 75.98%, and 77.91% for Bengali, and Hindi, respectively. One possible reason behind the better performance for Hindi might be it’s larger training set size compared to Bengali. TABLE 3 RESULTS OF THE DEVELOPMENT SET FOR BENGALI

P(t1, t 2, t3..., tn | w1, w2, w3..., wn) = ∏ P(ti, wi) i =1...n

In this model, each word in the test data will be assigned the NE tag which occurred most frequently for that word in the training data. A number of experiments were conducted taking the different combinations from the set ‘F’ to identify the best-suited set of features for NER in each of the languages. From our empirical analysis, we found that the following combination gives the best results for Bengali and Hindi. F=[ wi − 1, wi, wi + 1 , |prefix|

Suggest Documents