Named-Entity Chunking for Norwegian Text using Support Vector Machines Bjarte Johansen Department of Information Science and Media Studies University of Bergen Postbox 7802, N-5007 Bergen, Norway
[email protected]
October 16, 2015
Abstract Named-Entity Chunking is part of the Named-Entity Recognition (NER) process and is the task of identifying which parts of a text are names. This task is usually done as an implicit part of the recognizer, but because previous attempts at NER for Norwegian text focus only on the recognition, this research represents an attempt to develop an explicit chunker. The research shows that if we only focus on demarcating names and not on discovering their type as well, we are able to accurately (>95% F1 -score) find the names in Norwegian text using Support Vector Machines.
1
Introduction
Named-Entity Chunking (NEC) is the task of demarcating which segments of a text are parts of a named entity and which are not. A named entity is a specific person, place, event, etc. Chunking is different from the process of Named-Entity Recognition (NER) in that we are not interested in the type of the entity, but only finding the entities themselves. Maurits Escher was a Dutch artist . Table 1: An example sentence. To understand better what the problem is we can look at the example in table 1. In the example, we see Maurits Escher. When we are chunking the sentence we are trying to discover that the terms Maurits and Escher are part of the same named entity. We also see the term Dutch, and even though it refers to a nation, we do not consider it by itself an entity. Artist also appears, which on the other hand is an entity, but it is a general category and does not refer to a single individual (or thing) and is therefore not a named entity. This paper was presented at the NIK-2015 conference; see http://www.nik.no/.
The reason for developing an explicit chunker instead of as part of a NER tool is that previous research on recognition for Norwegian focused on only categorizing the names in pre-chunked data (Haaland, 2008; Nøklestad, 2009). The exception is Jónsdóttir (2003), who do chunking as a part of the recognizer. In future research, we are also interested in collecting entities from newspaper articles, but are not necessarily interested in the type of the entities. It is important to do this type of research for languages like Norwegian as it shows that these types of methods used in this research generalizes over different languages given the right feature selection and training material. This paper proposes a solution to the problem of named entity chunking in the context of Norwegian text. It does that by: 1. Discussing and identifying some characteristics of Norwegian that are important for named-entities in section 3. 2. Running experiments to show that Support Vector Machines (SVM) are able to, given the features that we have defined, provide state of the art performance on this task. See section 4. 3. Comparing our solution to what others have done before in section 5. 4. Concluding on the results and discussing any future work in section 6.
2
Related work
Jónsdóttir (2003) did some early work on chunking and recognition for Norwegian. They used a ruled-based approach through the use of constraint grammar. The approach did provide good recall scores (>90%) for NER, but the precision did not reach satisfactory results (95% F1 -score) find names in Norwegian text and that if we are interested in finding just the names in a text and not their type, it is better to implement an explicit chunker.
6
Future work
As mentioned in section 2, there are several research papers describing approaches to NER, but they need the named entities in the text to be pre-tagged and will therefore not work with untagged data. A future path for this research would be to test the NEC developed in this project with these approaches and see if we are able to do streaming NER on live data. Though we were able to identify some characteristics of Norwegian text that applies to the tasks of Named-Entity Chunking and Recognition, we have not been able to find a good way to include what we learned in the feature vector used to train the chunker we developed in this paper. We have seen that polysemy and the many capitalization rules of Norwegian can affect the accuracy of our research. In the future we should try to find ways to use these characteristics to improve the accuracy of the chunker.
References Guri Bordal. Substantiv. https://snl.no/substantiv, July 2015. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Åsne Haaland. A Maximum Entropy Approach to Proper Name Classification for Norwegian. PhD thesis, University of Oslo, March 2008. Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998. Andra Björk Jónsdóttir. ARNER, what kind of name is that? - An automatic Rule-based Named Entity Recognizer for Norwegian. PhD thesis, University of Oslo, May 2003. Taku Kudo and Yuji Matsumoto. Chunking with support vector machines. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1–8. Association for Computational Linguistics, 2001. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. Anders Nøklestad. A machine learning approach to anaphora resolution including named entity recognition, pp attachment disambiguation, and animacy detection. PhD thesis, University of Oslo, June 2009. Språkrådet. Orddeling og særskriving. http://www.sprakradet.no/Vi-og-vart/ hva-skjer/Aktuelt-ord/Orddeling-og-sarskriving/, 2009. Ole Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. URL http://www.gnu.org/s/parallel. Tekstlaboratoriet and Uni Computing. The oslo-bergen tagger. tekstlab.uio.no/obt-ny/english/index.html, 2014.
http://www.
Erik F. Tjong Kim Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: Chunking. In Proceedings of the 2Nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning - Volume 7, ConLL ’00, pages 127–132, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. doi: 10.3115/1117601.1117631. URL http://dx.doi.org/10.3115/1117601.1117631. Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf. A primer on kernel methods. Kernel Methods in Computational Biology, pages 35–70, 2004. Finn-Erik Vinje. Skriveregler. Aschehaug, 7 edition, 1998. Gjennomgått av Norsk språkråd og anbefalt for offentlig bruk av Kulturdepartementet. GuoDong Zhou and Jian Su. Named entity recognition using an hmm-based chunk tagger. In proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 473–480. Association for Computational Linguistics, 2002.