Lang Resources & Evaluation (2008) 42:173–182 DOI 10.1007/s10579-008-9064-x
A web-based Bengali news corpus for named entity recognition Asif Ekbal Æ Sivaji Bandyopadhyay
Published online: 22 February 2008 Ó Springer Science+Business Media B.V. 2008
Abstract The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively. Keywords Web as corpus News corpus Web-based tagged Bengali news corpus Named entity Named entity recognition
1 Introduction The mode of language technology work has changed dramatically since the last few years with the web being used as a data source in a wide range of research activities. The web is anarchic, and its use is not in the familiar territory of computational linguistics. The web walked into the ACL meetings starting in 1999. The use of the web as a corpus for teaching and research on language has been proposed a number of times (Rundell 2000; Fletcher 2001; Robb 2003; Fletcher 2004). There has been a A. Ekbal (&) S. Bandyopadhyay Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India e-mail:
[email protected];
[email protected] S. Bandyopadhyay e-mail:
[email protected];
[email protected]
123
174
A. Ekbal, S. Bandyopadhyay
special issue of the Computational Linguistics journal on web as corpus (Kilgarriff and Grefenstette 2003). Several studies have used different methods to mine web data. The goals of the WaCky1 project includes the development of tools that will allow linguists to crawl a section of the web, process the data, index them and search them. Baroni and Bernardini (2004) built a corpus by iteratively searching Google for a small set of seed terms. Rayson et al. (2006) proposed a technique to facilitate the use of annotated web as corpus by alleviating the annotation bottleneck for corpus data drawn from the web. Boleda et al. (2006) presented CUCWeb, 166 million word corpus for Catalan, built by crawling the web. There is a long history of creating a standard for western language resources. The human language technology (HLT) society in Europe has been particularly zealous for the standardization, making a series of attempts such as EAGLES,2 PROLE/ SIMPLE (Lenci et al. 2000), ISLE/MILE (Calzolari et al. 2003; Bertagna et al. 2004) and more recently multilingual lexical database generation from parallel texts in 20 European languages (Giguet and Luquet 2006). On the other hand, inspite of having great linguistic and cultural diversity, Asian language resources have received much less attention than their western counterparts. A new project (Tokunaga et al. 2006) has started to create a common standard for Asian language resources. They have extended an existing description framework, the MILE (Bertagna et al. 2004), to describe several lexical entries of Japanese, Chinese and Thai. India is a multilingual country with a lot of cultural diversity. Bharati et al. (2001) reports on efforts to create lexical resources such as transfer lexicon and grammar from English to several Indian languages and dependency tree bank of annotated corpora for several Indian languages. But no corpus development work from web has been started in India as yet. Newspaper is a huge source of readily available documents. In the present work, the corpus has been developed from the web archive of a very well known and widely read Bengali news paper. Bengali is the fifth popular language in the World, second in India and the national language in Bangladesh. Various types of news (International, National, State, Sports, Business etc.) are collected in the corpus and so a variety of linguistics features of Bengali are covered. A code conversion routine has been written that converts the proprietary codes used in the newspaper into the standard Indian Script Code for Information Interchange (ISCII) form, which can be processed for various tasks. A separate code conversion routine has been developed for converting ISCII codes to UTF-8 codes. The Bengali news corpus is available in UTF-8 also. The problem of correct identification of named entities (NEs) is specifically addressed and benchmarked by the developers of the Information Extraction System, such as the GATE system (Cunningham 2002). The algorithm, NOMEN (Yangarber et al. 2002), is used for learning generalized names in text. It uses a novel form of bootstrapping to grow sets of textual instances and their contextual patterns. The framework (Okanohara et al. 2006) can handle the named entity recognition (NER) task that has long NEs and many labels, which increase the 1
http://www.wacky.sslmit.unibo.it
2
http://www.ilc.cnr.it/Eagles96/home.html
123
A web-based Bengali news corpus
175
computational cost. But in Indian languages, no published work in the area of NER is available. In the present work, two different models of the NER system have been developed, one (Model A) without using linguistic knowledge and the other (Model B) using linguistic knowledge. The development of the tagged Bengali news corpus is described in Sect. 2. Section 3 deals with the use of the corpus in developing the Bengali NER systems, i.e., models. The evaluation results of the NER systems are presented in Sect. 4. Finally, Sect. 5 concludes the paper.
2 Development of the tagged Bengali news corpus from the web The development of the tagged Bengali news corpus is described in terms of language resource acquisition using a web crawler, language resource creation that includes HTML file cleaning, code conversion and language resource annotation that involves defining a tag set and subsequent tagging of the news corpus.
2.1 Language resource acquisition A web crawler has been developed to retrieve the web pages in Hyper Text Markup Language (HTML) format from the news archive of a leading Bengali newspaper within a range of dates provided as input. The crawler generates the Universal Resource Locator (URL) address for the index (first) page of any particular date. The index page contains actual news page links and links to some other pages (e.g., Advertisement, TV schedule, Tender, Comics and Weather etc.) that do not contribute to the corpus generation. The HTML files that contain news documents are identified and the rest of the HTML files are not considered further.
2.2 Language resource creation The HTML files that contain news documents are identified by the web crawler and require cleaning to extract the Bengali text to be stored in the corpus along with relevant details. The HTML file is scanned from the beginning to look for tags like _ , where the BENGALI_FONT_NAME is the name of one of the Bengali font faces as defined in the news archive. The Bengali text enclosed within font tags are retrieved and stored in the database after appropriate tagging. Pictures, captions and tables may exist anywhere within the actual news. Tables are integral part of the news item. The pictures, its captions and other HTML tags that are not relevant to our text processing tasks are discarded during the file cleaning. The Bengali news corpus has been developed in both ISCII and UTF-8 codes. Currently, the tagged news corpus contains 108,305 number of news documents with about five years (2001–2005) of news data collection. Some statistics about the tagged news corpus are presented in Table 1.
123
176
A. Ekbal, S. Bandyopadhyay
Table 1 Corpus statistics
Total no. of news documents in the corpus
108,305
Total no. of sentences in the corpus
2,822,737
Avg no. of sentences in a document
27
Total no. of wordforms in the corpus
33,836,736
Avg. no. of wordforms in a document
313
Total no. of distinct wordforms in the corpus
467,858
2.3 Language resource annotation The Bengali news corpus collected from the web is annotated using a tagset that includes the type and subtype of the news, title, date, reporter or agency name, news location and the body of the news. 2.3.1 Tagset in a news corpus A news corpus, whether in Bengali or in any other language has different parts like title, date, reporter, location, body etc. To identify these parts in a news corpus, the tagset described in Table 2 have been defined. 2.3.2 Tagged corpus development A news document is stored in the corpus in XML format using the tagset, as mentioned in Table 2. In the HTML news file, the date is stored at first and is divided into three parts. The first one is the date according to Bengali Calendar, second one is the day in Bengali and the last one is the date according to English Calendar. Both Bengali and English dates are stored in the form ‘‘day month year’’. A sequence of four Bengali digits separates the Bengali date from the Bengali day. The English date starts with one/two digits in Bengali font. Bengali date, day, and English date can be distinguished by checking the appearance of the numerals and these are tagged as , , and , respectively. For example, 25 sraban 1412 budhbar 10 august 2005 is tagged as shown in Table 3. Table 2 News corpus tagset Tag
Definition
Tag
Definition
header
Header of the news document
reporter
Reporter name
title
Headline of the news document
agency
Agency providing news
t1
First headline of the title
location
The news location
t2
Second headline of the title
body
Body of the news document
date
Date of the news document
p
Paragraph
bd
Bengali date
table
Information in tabular form
day
Day
tc
Table column
ed
English date
tr
Table row
123
A web-based Bengali news corpus Table 3 Example of a tagged date pattern
177
Original date pattern
Tagged date pattern
25 sraban 1412
25 sraban 1412
budhbar
budhbar
10 august 2005
10 august 2005
Next comes the title of the news. In this corpus, the title is kept at first. The title is stored in between paragraph tags (
...
) in the HTML file. The reporter name comes after the title in HTML file. But confusions arise in cases when second title (sub-heading) appears. To solve this problem, the following heuristic is used: if the length of the part in Bengali is greater than 25 characters or three words then it is treated as a second title, otherwise it is considered as a reporter name. The reporter name and the location name are stored in the same paragraph in HTML files but separated by a special character. All the parts, i.e., title, date, reporter, agency and location are separately tagged under the general tag
. In header, two attribute fields called type and subtype are kept. Type field is assigned with the class of the news. The news items have been classified on geographic domain (International, National, State, District, Metro [Kolkata]) as well as on topic domain (Politics, Sports, Business). The type of the news item can be selected from the HTML page with a tag. In the news archive, the news from the districts are classified further. This classification is noted under subtype and the type attribute is recorded as District. The news body that starts next is divided into a number of paragraphs. The entire news body is stored in the corpus under the tag . Each paragraph of the news body is stored using the tag . The following is the structure of a tagged news corpus: title name bengali date day in bengali english date reporter name agency name location name
first paragraph of news document
..... last paragraph of the news document
Titles in news documents with multiple headlines are tagged as follows: 1st title 2nd title
123
178
A. Ekbal, S. Bandyopadhyay
3 Use of language resources The Bengali news corpus, developed in this work, has been used to develop the Bengali NER systems. NE identification in Indian languages in general and particularly in Bengali is difficult and challenging. Unlike English, there is no concept of capitalization in Bengali. A semi-supervised learning system, based on pattern directed shallow parsing, has been used to identify named entities in Bengali from the tagged Bengali news corpus. The reporter, location, agency, and different date tags of the tagged Bengali news corpus help to identify the person, location, organization, and miscellaneous names, respectively. The words automatically extracted from the reporter, location and agency tags of the tagged news corpus are treated as the initial seed data and put into the appropriate seed lists. In addition to these extracted words, most frequently occurring person names, location names and organization names have been collected from the different domains of the newspaper and kept in the corresponding seed lists. At present, the person, location, and organization seed lists contain 253, 215, and 146 entries, respectively. The date expressions have some fixed patterns in Bengali and so there is no need to put them in a separate seed list. There is no seed list for miscellaneous names. Two different NER systems, one using the lexical contextual patterns (NER system without linguistic features, i.e., Model A) and the other using the linguistic features along with the same set of lexical contextual patterns (NER system with linguistic features, i.e., Model B), have been developed. The performance of the two systems has been compared using the standard Recall, Precision and F-Score evaluation parameters. 3.1 Tagging with seed list and clue words The tagger places the left and right tags around each occurrence of the named entities of the seed lists in the corpus. For both the models, A and B, the training corpus is tagged with the help of different seed lists. In the case of Model B, the corpus is also tagged with the help of different internal and external evidences that help to identify different NEs. It uses the clue words like surnames, middle names, prefixes, and suffixes for person names. A list of common words has been kept that often helps to determine the presence of person names in the text. It considers the different affixes that may occur with location names. The system also considers several clue words that are helpful in detecting organization names. Tagging algorithm also uses the list of words that may appear as part of named entity as well as the common words. The linguistic clue words are kept in order to tag more NEs during the training of the system. As a result, more potential patterns are generated in the lexical pattern generation phase. 3.2 Lexical seed patterns generation from the training corpus For each tag T inserted in the training corpus, the algorithm generates a lexical pattern p using a context window of maximum width 4 (excluding the tagged NE) around the left and the right tags, e.g.,
123
A web-based Bengali news corpus
179
p ¼ ½l2 l1 \T [ . . .\=T [ lþ1 lþ2 ; where, l± i are the context of p. Any of l± i may be a punctuation symbol. In such cases, the width of the lexical patterns will vary. The various pattern examples are shown in Table 4. These lexical patterns are generalized by replacing the tagged NE ... by itself. These different types of patterns form the set of potential seed patterns, denoted by P. All these patterns, derived from the different tags of the tagged training corpus, are stored in a Seed Pattern Table which has four different fields, namely pattern id (Identifies any particular pattern), pattern (Generalized lexical pattern) pattern type, (Person name/ Location name/ Organization name) and relative frequency (Indicates the number of times any particular pattern appears in the entire training corpus relative to the total number of patterns generated). 3.3 Generation of new patterns through bootstrapping Every pattern p in the set P is matched against the entire training corpus. In a place, where the context of p matches, the system predicts where one boundary of a name in the text would occur. The system considers all possible noun, verb and adjective inflections during matching. During pattern checking, the maximum length of a named entity is considered to be six words. Each named entity so obtained in the training corpus is manually checked for correctness. The training corpus is further tagged with these newly acquired named entities to identify further lexical patterns. The bootstrapping is applied on the training corpus until no new patterns can be generated. The patterns are added to the pattern set P with the type and relative frequency fields set properly, if they are not already in the pattern set P with the same type. Any particular pattern in the set of potential patterns P may occur many times but with different type and with equal or different relative frequency values. For each pattern of the set P, the frequencies of its occurrences as person, location and organization names are calculated. For the candidate patterns acquisition, a particular threshold value of relative frequency is chosen. If the relative frequency for a particular pattern (along with the type) seems to be less than this threshold value then this pattern (only for that type) is discarded, otherwise it is added to the Seed Pattern Table. Same procedure is followed for all other patterns. All these patterns form the set of accepted patterns and this set is denoted by Accept Pattern. A particular pattern may appear more than one time with different type in this set. So, while testing the NER models, some identified NEs may be assigned more than one NE category. Model A cannot deal with this NE-classification disambiguation problem. The different linguistic features, used in Model B during tagging, have been used to deal with this NE-classification disambiguation problem. 4 Evaluation results To evaluate the NER models, the set of accepted patterns is applied on a test set. The process of pattern matching can be considered as a shallow parsing process.
123
180
A. Ekbal, S. Bandyopadhyay
Table 4 Lexical pattern examples Lexical pattern
Remarks
l-2 l-1 person name l+1 l+2
None of l-2, l-1, l+1 or l+2 is punctuation symbol
l-1 person name l+1 l+2 person name l+1 l+2 l-2 l-1 person name l+1
(i) l-2 is a punctuation symbol or (ii) l-1 is the start of a sentence (i) Person name appears at the beginning of a sentence or (ii) l-1 is a punctuation symbol (i) l+2 is a punctuation symbol or (ii) l+1 appears at the end of a sentence
l-2 l-1 person name
(i) Person name appears at the end of a sentence or
l-1 person name l+1
(i) l-1 is the beginning and l+1 is the end of a sentence or
(ii) l+1 is a punctuation symbol (ii) l-1 is the beginning and l+2 is a punctuation symbol or (iii) l-2 is a punctuation symbol and l+1 is the end of a sentence or (iv) l-2 and l+2 are the punctuation symbols
l-1 person name
(i) l-2, l+1 and l+2 are punctuation symbols or (ii) l-2 is a punctuation symbol and person name appears at the end of a sentence or (iii) l-1 is the beginning and person name appears at the end of that sentence or (iv) l-1 is the beginning and l+1 l+2 are the punctuation symbols
person name l+1
(i) l-2, l-1 and l+2 are punctuation symbols or (ii) l-2 and l-1 are punctuation symbols and l+1 appears at the end of a sentence or (iii) Person name appears at the beginning of a sentence and l+2 is a punctuation symbol or (iv) Person name appears at the beginning of a sentence and l+1 appears at the end of a sentence
4.1 Training and test set A semi-supervised learning method has been followed to develop the NER systems. The systems have been trained on a tagged Bengali news corpus. Some statistics of the training corpus is given in Table 5. A manually tagged test set (gold standard test set) has been used to evaluate the Bengali NER systems. The test set consists of approximately 5,000 sentences.
4.2 Experimental results The Bengali NER systems have been evaluated in terms of Recall, Precision and F-Score parameters. The actual number of different types of NEs present in the test set is known in advance and they are noted. Each pattern of the Accept Pattern set is
123
A web-based Bengali news corpus Table 5 Statistics of the training set
181
Total number of news documents
1,819
Total number of sentences in the corpus
44,432
Average number of sentences in a document
25
Total number of wordforms in the corpus
541,171
Average number of wordforms in a document
298
Total number of distinct wordforms in the corpus
45,626
Table 6 Result for gold standard test set NEcategory
Model A
Model B
R
P
FS
R
P
FS
PN
70.10
74.80
72.37
74.90
75.90
75.40
LOC
66.90
69.70
68.30
70.80
73.70
72.30
ORG
65.50
67.70
66.58
69.90
72.90
71.37
MISC
54.10
99.40
70.07
54.20
99.30
70.13
matched against the test set according to the pattern matching process described in Sect. 3.3 and the identified NEs are stored in the appropriate NE category tables according to their types. A particular identified NE may appear in more than one NE category table. Model A cannot deal with such situation and always assigns the highest probable type to the NE. As a result, the precision values suffer. On the other hand the different linguistic features, used as the clue words in Sect. 3.1 for the identification of different types of NEs, are used in order to assign the actual categories (NE types) to the identified NEs in Model B. Once the actual category of a particular NE is explored, it is removed from the other NE category tables. Some person, location, organization and miscellaneous names that appear in the fixed places of the newspaper can be identified from the appropriate tags in the test set. Recall (R), Precision (P) and F-Score (FS) parameters are computed for each individual NE category, i.e., for person name (PN), location name (LOC), organization name (ORG) and miscellaneous name (MISC). The performance of the systems for the gold standard test set has been presented in Table 6. It is observed that the NER system with linguistic features, i.e., Model B, outperforms the NER system without linguistic features, i.e., Model A, in terms of Recall, Precision and F-Score. Linguistic knowledge plays the key role to enhance the performance of Model B compared to Model A.
5 Conclusion The tagged Bengali news corpus developed in this paper can be used as a large data source in various natural language processing research areas. The corpus has been developed from a particular newspaper in this work and similar corpus development methodology with some minor variations can be used to develop the corpus from the
123
182
A. Ekbal, S. Bandyopadhyay
other newspapers, available in the web in India and Bangladesh. Buliding NER systems for Bengali using the statistical techniques like Hidden Markov Model (HMM), Maximum Entropy Model (MEM), Conditional Random Field (CRF) and Support Vector Machine (SVM) and analyzing the performance of these systems will be the other interesting tasks.
References Baroni, M., & Bernardini, S. (2004). BootCat: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004, Lisbon, pp. 1313–1316. Bertagna, F., Lenci, A., Monachini, M., Calzolari, N. (2004). Content interoperability of lexical resources, open issues and ‘‘MILE’’ Perspectives. In Proceedings of the LREC 2004, 131–134. Bharati, A., Sharma, D. M., Chaitanya, V., Kulkarni, A. P., & Sangal, R. (2001). LERIL: Collaborative effort for creating lexical resources. In Proceedings of the 6th NLP Pacific Rim Symposium PostConference Workshop, Japan. Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & Lopez, V. (2006). CUCWeb: A Catalian corpus built from the web. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 19–26. Calzolari, N., Bertagna, F., Lenci, A., & Monachini, M. (2003). Standards and best practice for multilingual computational lexicons, MILE (the multilingual ISLE lexical entry). ISLE Deliverable D2.2 & 3.2. Cunningham, H. G. (2002). A general architecture for text engineering. Computers and the Humanities, 36, 223–254. Fletcher, W. H. (2001). Concordancing the web with KWiCFinder. In Proceedings of the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001. Fletcher, W. H. (2004). Making the web more use-ful as source for linguists corpora. In U. Conor & T. A. Upton (Eds.), Applied corpus linguists: A multidimensional perspective (pp. 191–205). Amsterdam: Rodopi. Giguet, E., & Luquet, P. (2006). Multilingual lexical database generation from parallel texts in 20 European languages with endogeneous resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 271–278. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347. Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowsky, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., & Zampolli, A. (2000). SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Thesauri and Lexical-Semantic Relations, XIII(4), 249–263. Okanohara, D., Miyao, Y., Tsuruoka, Y., & Tsujii, J. (2006). Improving the scalibility of semi-Markov conditional random fields for named entity recognition. In Proceedings of the COLING/ACL 2006, Sydney, pp. 465–472. Rayson, P., Walkerdine, J., Fletcher, W. H., & Kolgarriff, A. (2006). Annotated web as corpus. In Proceedings of the second International Workshop on Web as Corpus, Torento, Italy, pp. 27–33. Robb, T. (2003). Google as a corpus tool? ETJ Journal, 4(1), Spring. Rundell, M. (2000). The biggest corpus of all. Humanising Language Teaching, 2(3). Tokunaga, T., Sornlertlamvanich, V., Charoenporn, T., Calzolari, N., Monachini, M., Soria, C., Huang, C., YingJu, X., Hao, Y., Prevot, L., & Shirai, K. (2006). Infrastructure for standardization of asian languages resources. In Proceedings of the COLING/ACL 2006, Sydney, pp. 827–834. Yangarber, R., Lin, W., & Grishman, R. (2002). Unsupervised learning of generalized names. In Proceedings of the COLING-2002.
123