PoS Tagging with a Named Entity Tagger - EVALITA

12 downloads 81 Views 2MB Size Report
PoS named entity tagging di Evalita. In this paper, we describe the two “Yahoo” Part of Speech. (PoS) tagging systems. Both implement a Hidden Markov. Model  ...
CONTRIBUTI SCIENTIFICI

POS TAGGING CON UN NAMED ENTITY TAGGER POS TAGGING WITH A NAMED ENTITY TAGGER MASSIMILIANO CIARAMITA · JORDI ATSERIAS

SOMMARIO/ABSTRACT In questo articolo descriviamo i due sistemi “Yahoo” di Part of Speech (PoS) tagging. Entrambi implementano un modello HMM (Hidden Markov Model), basato su un classificatore perceptron regolarizzato. La caratteristica principale del nostro approccio e` la scelta di un unico modello, in termini di features e classificatore, per entrambi i task di PoS named entity tagging di Evalita. In this paper, we describe the two “Yahoo” Part of Speech (PoS) tagging systems. Both implement a Hidden Markov Model, based on a regularized perceptron classifier. The main feature of our approach is the choice of one model, in terms of features and classifier, for both the PoS and named entity tasks in the Evalita shared task. Keywords: PoS tagging, HMMs, Perceptrons.

1

Introduction

One of the main goals of our research is to investigate the application of natural language processing methods in emerging information retrieval tasks which are challenging for traditional approaches (e.g., see [5]). Since we expect to carry out several processing steps such as PoS tagging, named-entity detection, semantic role labeling, parsing, etc. on extremely large datasets, efficiency is a priority. It is interesting to notice that a large fraction of computation time is usually spent in the feature extraction step. For this reason it is convenient to device a sequential pipeline of incremental processing steps in which a basic feature extraction step is performed once, and then each stage feeds the next with an additional “layer” of features; e.g., the PoS tagger adds its output to the basic feature set and passes the enriched data on to the NER tagger, which can in turn add information for a syntactic parser (e.g., see [4]). For the Evalita shared task we focused on this aspect and devised a simple and efficient system, which we applied to both the PoS and NER tasks (for the NER system see [3]).

28

The most important feature of our system is the use of exactly the same features for both PoS and NER, thus conforming to the pipelined approach sketched above. Since we use the same features set defined for English, and do not use any external resources nor language-specific adaptation, we also evaluate in practice the portability of this approach to different languages. Experimental results show that this approach is competitive and extremely efficient1 .

2

HMM tagger and features

Our tagger (see [2] for more details), is a Hidden Markov Model trained with the sequence perceptron introduced in [1]. Label-to-label dependencies are limited to the previous tag (first order HMM). Models have a single adjustable parameter, the number of training iterations, chosen on the training data by cross-validation. Models are regularized by averaging [1], we also added a constant feature to each token to further reduce overfitting. In addition the following features are extracted for each token wi : 1. Sentence-relative position: rp ∈ {begin, mid, final}; 2. Word: wi , wi−1 and wi+1 , in lower-cased form; 3. Prf/Suf: initial/final character bigrams/trigrams of wi ; 4. Shape: each character r of wi is replaced as follows: (a) wi,r → x if wi,r ∈ {a, .., z}, (b) wi,r → X if wi,r ∈ {A, .., Z}, (c) wi,r → d if wi,r ∈ {0, .., 9}, (d) wi,r → wi,r otherwise; (e) Sequences of symbols y are replaced with y*.

3

Systems results and discussion

To evaluate the portability of our tagger, first developed for English, we did not use any external resources nor the provided multi-word expression and abbreviation lists. In 1 The tagger, called “sst-light”, http://sourceforge.net/projects/supersensetag/

is

available

from

Anno IV, N° 2, Giugno 2007

CONTRIBUTI SCIENTIFICI System Yahoo Ciaramita Yahoo Ciaramita Yahoo Ciaramita Yahoo Ciaramita

POS POS POS POS

s1 s2 s1 s2

Tagset Eagles Eagles Distrib Distrib

T 30 43 30 42

#features 370K 1.6M 64K 1.5M

Train 3.5m 30.2m 1.1m 14.1m

Test 2.6s 8.9s 1.2s 6.1s

Sent/s 247 72 535 105

Acc 96.78 95.27 96.61 95.11

UTAcc 87.78 81.83 88.24 84.16

Table 1: Results, in terms of accuracy (also for unknown tokens) of the systems on the two tagsets. The table reports also the number of features for each model, the number of training iterations, the time it took to train and the speed in tagging the evaluation set which consists of 643 sentences (17,313 tokens). addition to the base tagger we trained a second one using bigrams of features; i.e., a “second-order” model (see [4]). Although the second model performed worse than the simpler base model in cross-validation, we reported the results for completeness. A possible explanation for this fact might be that the token features outweigh the contribution of label-to-label features. Thus, a separate weighting scheme might be necessary in the second order model, which in any case is significantly less efficient than the the first-order model. Table 1 summarizes the systems results plus additional information concerning the models and their efficiency2 . The simpler linear models outperform the second-order models, while being faster and smaller. Our base tagger outperforms all pre-existing reference systems: MXPOST (95.15 on Distrib, 96.14 on Eagles), Brill’s tagger (94.13/94.39), MBT (95.02/95.48), and TnT on Distrib (95.96), although TnT performs just slightly better on Eagles (96.82 against our 96.78).

REFERENCES [1] M. Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP-02. 2002. [2] M. Ciaramita and Y. Altun. Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger. Proceedings of EMNLP06. 2006. [3] M. Ciaramita and J. Atserias. Named Entity Tagging with a PoS Tagger. Proceedings of Evalita-07. 2007. [4] M. Ciaramita and G. Attardi. Dependency Parsing with Second-Order Feature Maps and Annotated Semantic Information. Proceedings of IWPT-07, pp. 133–143, 2007. [5] B. Hagedorn, M. Ciaramita and J. Atserias. World Knowledge in Broad-Coverage Information Filtering. Proceedings of ACM SIGIR-07 (Poster). 2007.

C O N TA C T S MASSIMILIANO CIARAMITA, JORDI ATSERIAS Yahoo! Research Ocata 1, Barcelona 08003 Catalunya, Spain Email: {massi | jordi}@yahoo-inc.com

MASSIMILIANO CIARAMITA is a researcher at Yahoo! Research Barcelona. He received degrees from the Università di Roma ``La Sapienza'' (BA), and Brown University (Sc.M, Ph.D.). He held a research fellowship from the Italian National Research Council (CNR) in Rome between 2004 and 2006. He has worked on several topics in computational linguistics such as knowledge acquisition, semantic tagging, information extraction and parsing, currently his research concerns the application of natural language processing and machine learning to a range of information retrieval problems.

JORDI ATSERIAS is a research engineer at Yahoo! Research Barcelona. He obtained the BS in Computer Science in 1994 at the Facultat d'Informatica de Barcelona and his Ph.D. at the Euskal Herriko Unibertsitatea (University of the Basque country) in 2006. He was previously working at the TALP research group (Universitat Politecnica de Catalunya) and was involved in several European and Spanish projects related to NLP technologies. His research in NLP has focused mainly on parsing, word sense disambiguation and semantic role labeling.

2 All experiments were carried out on a laptop with a Pentium M 1.83GhZ CPU.

Anno IV, N° 2, Giugno 2007

29

Suggest Documents