Named entity recognition (NER) - CENG METU

69 downloads 5139 Views 298KB Size Report
Jun 30, 2009 - of important entities, relations, and events in free ... a list of well-known locations (the names of cities ... systems are ported to other domains.
Rule-based Named Entity Recognition from Turkish Texts Dilek Küçük1 and Adnan Yazıcı2 1 TÜBİTAK

- Uzay Institute, Ankara - Turkey [email protected]

2 Dept.

of Computer Engineering, METU, Ankara - Turkey [email protected]

Outline z

Introduction

z

Rule-based Named Entity Recognizer for Turkish

z

Evaluation

z

Discussion

z

Conclusion

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

2

Introduction [1] z

Information extraction (IE) is the determination of important entities, relations, and events in free natural language texts (Grishman, 2003).

z

Named entity recognition (NER) is one of the main IE tasks. z

30.06.2009

NER is the recognition of the information units such as person, organization, location names along with numeric and temporal expressions (Nadeau and Sekine, 2007).

Rule-based Named Entity Recognition from Turkish Texts

3

Introduction [2] z

IE research on Turkish is known to be rare. z

Language-independent IE system (Cucerzan and Yarowsky, 1999)

z

Statistical name tagger for Turkish (Tür et al, 2003)

z

Person name tagger for financial news texts (Bayraktar and Taşkaya-Temizel, 2008)

z

Person mention extractor and a string matching based coreference resolver (Küçük and Yazıcı, 2008)

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

4

Introduction [3] z

Although rule-based systems for IE usually achieve high accuracies, they are criticized for their performance degradation z

z

when they are ported to domains distinct from the domain for which the rules have been created (Grishman, 2003).

In this paper, we report the evaluation results of a rule-based NER system for Turkish on different text genres z

30.06.2009

including news articles, child stories, and historical texts. Rule-based Named Entity Recognition from Turkish Texts

5

Rule-based Named Entity Recognizer for Turkish [1] z

The rule-based NER system for Turkish employs a set of lexical resources and a set of pattern bases. z

Mainly proposed for the domain of news texts in Turkish.

z

The capitalization and punctuation clues are not utilized during the NER procedure to make the system robust to possible noisy input missing these clues.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

6

Rule-based Named Entity Recognizer for Turkish [2] z

The lexical resources include z

a dictionary of person names in Turkish comprising about 8300 entries,

z

a list of well-known political people,

z

a list of well-known locations (the names of cities and towns) in Turkey as well as in the world,

z

a list of well-known organizations in Turkey and those in the world.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

7

Rule-based Named Entity Recognizer for Turkish [3] z

Pattern bases encompass several patterns for the extraction of location/organization names as well as that of the numeric/temporal expressions.

z

The system makes use of a simple morphological analyzer to validate candidates.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

8

Evaluation [1] z

The system tags its output with Message Understanding Conference (MUC) style named entity tags: z

ENAMEX, TIMEX, and NUMEX

z

An annotation tool is implemented to create the answer set using the same tags.

z

Evaluation is performed by comparing the answer set with that of the system output.

z

Three data sets are compiled: z

A news article set of 10 articles (Say et al., 2003)

z

A child stories set comprising two stories (Ilgaz, 2003a-b)

z

A historical text set from the first three chapters of a historical book (Tanpınar, 2007).

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

9

Evaluation [2]

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

10

Evaluation [3]

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

11

Discussion [1] z

The evaluation results are in line with the finding regarding rule-based NER systems that considerable performance degradation is observed when these systems are ported to other domains. z

The system has been designed for news articles in Turkish and achieves an f-measure of 78.7 % on news articles.

z

The value of f-measure drops down to 69.3 % for child stories and more dramatically to 55.3 % for historical texts.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

12

Discussion [2] z

Most frequent errors of the system on news articles are due to following reasons: z

Some common nouns in Turkish are homonymous to some proper person names in our person name list such as ‘Savaş’ (meaning ‘war’ as a common name) and ‘Barış’ (meaning ‘peace’).

z

The rules for locations and organizations as exemplified in (1) and (2) put no constraints on X (except for being the preceding named entity or preceding token).

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

13

Discussion [3] z

The cases that decrease the performance on child stories include the following: z

The existence of foreign person names in the stories which are nonexistent in our person name list.

z

A considerable proportion of location names turns out to be village names which are not included in our list of location names.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

14

Discussion [4] z

The main sources of performance degradation on historical texts: z

z

30.06.2009

Almost all of the organization names such as ‘Osmanlı İmparatorluğu’ (the Ottoman Empire), ‘Selçuk’ (Seljuk), and ‘Roma İmparatorluğu’ (the Roman Empire) z

are not included in the lexical resource, and

z

do not conform to existing patterns.

The performance during person name recognition is also poor due to the Arabic or Persian origins of a good proportion of the person names. Rule-based Named Entity Recognition from Turkish Texts

15

Conclusion [1] z

Named entity recognition is an important information extraction task. z

z

It has been widely studied for languages including English, Spanish, and Chinese.

We present the evaluation results of a rule-based named entity recognizer for Turkish on different text types z

newspaper articles, child stories, and historical texts.

z

It does not utilize capitalization and punctuation clues.

z

The system has been originally proposed for the domain of news articles z

30.06.2009

yet it encompasses generic resources and rules in addition to the domain specific ones. Rule-based Named Entity Recognition from Turkish Texts

16

Conclusion [2] z

The evaluation results demonstrate that the performance of the system is promising with an f-measure rate of 78.7 % for news articles.

z

About 9.4 % decrease in f-measure is observed for the case of child stories

z

More dramatically, a 23.4 % decrease in fmeasure is obtained for the historical text input.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

17

Conclusion [3] z

As future work, it may be plausible to enhance the original recognizer with additional resources and rules z

z

benefiting from the error analyses provided in this paper.

Moreover, other genres of texts such as technical documents or email messages can be used for evaluation.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

18

References 1.

2.

3. 4. 5. 6.

7.

8. 9.

Özkan Bayraktar and Tuğba Taşkaya-Temizel. Person name extraction from Turkish Financial news text using local grammar based approach. In Proceedings of the International Symposium on Computer and Information Sciences, 2008. Silviu Cucerzan and David Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. Rıfat Ilgaz. Bacaksız Kamyon Sürücüsü. Çınar Publications, 2003. Rıfat Ilgaz. Bacaksız Tatil Köyünde. Çınar Publications, 2003. Ralph Grishman. Information extraction. In Ruslan Mitkov, editor, The Oxford Handbook of Computational Linguistics. Oxford University Press, 2003. Dilek Küçük and Adnan Yazıcı. Identification of coreferential chains in video texts for semantic annotation of news videos. In Proceedings of the International Symposium on Computer and Information Sciences, 2008. Bilge Say, Deniz Zeyrek, Kemal Oflazer, and Umut Özge. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics (ICTL), 2002. Ahmet Hamdi Tanpınar. Beş Şehir. Dergah Publications, 2007. Gökhan Tür, Dilek Hakkani-Tür, and Kemal Oflazer. A statistical information extraction system for Turkish. Natural Language Engineering, 9, 2:181-210, 2003.

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

19

Thank You

30.06.2009

Rule-based Named Entity Recognition from Turkish Texts

20