Liner2 â a Generic Framework for Named Entity ...

Liner2 — a Generic Framework for Named Entity Recognition Michał Marcińczuk

Jan Kocoń

Marcin Oleksy

Institute of Informatics Wrocław University of Science and Technology Wybrzeże Wyspiańskiego 27, Wrocław, Poland

April 4, 2017

The 6th Workshop on Balto-Slavic Natural Language Processing, Valencia, Spain The work was funded by the the Polish Ministry of Science and Higher Education (CLARIN ERIC, 2016–2018)

Introduction »

Introduction

Scope only for Polish, named entity recognition, normalization (lemmatization).

Goal To test how much effort to we need to adopt existing models for NER for Polish to the new requirements (BSNLP NER Shared Task) and what level of performance can we obtain.

M. Marcińczuk, J. Kocoń, M. Oleksy

April 4, 2017

2 / 18

NER models for Polish »

NE in KPWr

1

KPWr (Corpus of Wroclaw University of Technology) https://clarin-pl.eu/ dspace/handle/11321/270,

2

1349 short documents (ca. 200 words each, 15 text genres) on the Creative Commons license annotated with named entities,

3

more than 82 categories of named entities organized in a 3-level hierarchy,


April 4, 2017

3 / 18


NE hierarchy in KPWr https://clarin-pl.eu/dspace/handle/11321/294 1

event – names of events organized by humans [6 subtypes],

2

facility – names of buildings and stationary constructions (e.g. monuments) developed by humans [9 subtypes],

3

living – people and other livings names [6 subtypes],

4

location – names of geographical (e.g, mountains, rivers) and geopolitical entities (e.g., countries, cities) [31 subtypes],

5

organization – names of organizations, institutions, organized groups of people [10 subtypes],

6

product – names of artifacts created or manufactured by humans (products of mass production, arts, books, newspapers, etc.) [26 subtypes],

7

adjective – adjective forms of proper names [3 subtypes],

8

numerical – numerical identifiers which indicate entities [5 subtypes],

9

other – other names which do not fit into previous categories [10 subtypes].


April 4, 2017

4 / 18


Distribution of top NE categories

21% 8% 4% 40%


15%

5%

April 4, 2017

2%

3% 1%

living (12k) location (6.3k) organization (4.6k) product (2.6k) adjective (1.5k) facility (1.3k) other (0.8k) event (0.7k) numex (0.1k)

5 / 18

Liner2 »

Liner2 overview https://clarin-pl.eu/dspace/handle/11321/231, implemented in Java, a set of modules for sequence labelling (statistical, ruleand dictionary-based), the statistical model uses Conditional Random Fields (CRF++ library), a rich set of features — 56 basic features (ortographic, morphological, lexicon-base and wordnet-based) and several complex features, dictionaries for NER — NELexicon (2.3M names obtained from different sources from Internet, including Wikipedia; https://clarin-pl.eu/dspace/handle/11321/247) and a dictionary of named entity triggers PNET (http://zil.ipipan.waw.pl/PNET), processes tokenized texts. M. Marcińczuk, J. Kocoń, M. Oleksy

April 4, 2017

6 / 18

Liner2 »

Liner2 applications NER — https://clarin-pl.eu/dspace/handle/11321/263, TIMEX — https://clarin-pl.eu/dspace/handle/11321/302, Event — https://clarin-pl.eu/dspace/handle/11321/301.

Task NER boundaries NER top9 NER n82 TIMEX boundaries TIMEX 4class Event mentions

P [%] 86.04 73.73 67.65 86.68 84.97 80.88

R [%] 83.02 69.00 58.83 81.01 76.67 77.82

F [%] 84.50 71.30 62.93 83.75 80.61 79.32

Figure: Precision (P), recall (R) and F-measure (F) for various tasks obtained with Liner2.


April 4, 2017

7 / 18

Liner2 »

Liner2 applications NER — https://clarin-pl.eu/dspace/handle/11321/263, TIMEX — https://clarin-pl.eu/dspace/handle/11321/302, Event — https://clarin-pl.eu/dspace/handle/11321/301.

Task NER boundaries BSNLP NER NER top9 NER n82 TIMEX boundaries TIMEX 4class Event mentions

P [%] 86.04 ? 73.73 67.65 86.68 84.97 80.88

R [%] 83.02 ? 69.00 58.83 81.01 76.67 77.82

F [%] 84.50 ? 71.30 62.93 83.75 80.61 79.32

Figure: Precision (P), recall (R) and F-measure (F) for various tasks obtained with Liner2. M. Marcińczuk, J. Kocoń, M. Oleksy

April 4, 2017

7 / 18

BSNLP NER Shared Task »

Differences comparing with the KPWr NER

1

NE mention boundaries: in the KPWr corpus nested names are annotated as a sequence of disjoint atomic names, i.e. [Motorola] [Moto X], also names of facilities with their location are annotated separatly, i.e. [Citi Handlowy] w [Poznaniu],

2

NE categorization (next slide).


April 4, 2017

8 / 18


NE category mapping KPWr category nam_loc nam_fac nam_liv nam_org_nation nam_org nam_eve nam_pro nam_adj nam_num nam_oth

BSNLP category LOC LOC PER PER ORG MISC MISC ignored ignored ignored

model top9 top9 top9 n82 top9 top9 top9 top9 top9 top9

Figure: Mapping from KPWr categories of named entities to BSNLP categories.


April 4, 2017

9 / 18


Official evaluation Task Names matching Relaxed partial Relaxed exact Strict Normalization Coreference Document level Language level Cross-language level

P

R

F

66.24 65.40 71.10 75.50

63.27 62.78 58.81 44.44

64.72 64.07 66.61 55.95

7.90 3.70 n/a

42.71 8.00 n/a

12.01 5.05 n/a

Figure: Results obtained by Liner2 in the BSNLP NER Shared Task


April 4, 2017

10 / 18

Error analysis »

Test set post-evaluation

the test set was annotated separatly by two annotators according to the BSNLP NER Shared Task guidelines, annotators took part in the annotation of the KPWr corpus in the past, then the annotations were agreed to create a gold standard, the annotaton and agreement verification was done using the Inforex system (https://clarin-pl.eu/dspace/handle/11321/13).


April 4, 2017

11 / 18

Error analysis »

Inforex — annotation


April 4, 2017

12 / 18

Error analysis »

Inforex — agreement


April 4, 2017

13 / 18

Error analysis »

Inforex — verification


April 4, 2017

14 / 18

Error analysis »

Agreement on the test set

Agreement was calculated using the Positive Specific Agreement measure (PSA) on the level of NE mentions. Names matching (strict) NE boundaries NE boundaries and categories NE boundaries, categories and lemmas

PSA 97% 94% 93%

Figure: Inter-annotator agreement on the test set


April 4, 2017

15 / 18

Error analysis »

Official vs our evaluation

Evaluation P [%] Names matching (strict) Official 71.10 Our 83.39 (+12.29) Normalization Official 75.50 Our 71.57 (-3.93)

R [%]

F [%]

58.81 70.19

(+11.38)

66.61 76.22

(+9.61)

44.44 60.24

(+15.80)

55.95 65.42

(+9.47)

Figure: Comparision of evaluation results


April 4, 2017

16 / 18

Summary »

Conclusions

we obtained lower results for named entity recognition and lemmatization than we expected, there is a discrapency between the official and our evaluation, our understanding of the guidelines differ from what it is expected, for instance: we did not annotate incomplete named entities like “Komisja” (Eng. Comission) which refers to “Komisja Europejska” (Eng. European Comission), other?

revision of the gold standard annotation of the sets and futher tunning is required.


April 4, 2017

17 / 18


The end »

Thank you for your attention.

April 4, 2017

18 / 18

Liner2 â a Generic Framework for Named Entity ...

Liner2 â a Generic Framework for Named Entity ...

Suggest Documents

Liner2 a Generic Framework for Named Entity Recognition

A Framework That Uses the Web for Named Entity Class ...

A Framework That Uses the Web for Named Entity Class

A Framework That Uses the Web for Named Entity Class ...

A Framework for Named Entity Recognition in the Open Domain

AIDArabic A Named-Entity Disambiguation Framework for Arabic Text

A Named Entity Recognition System for Dutch

Graph Ranking for Collective Named Entity Disambiguation

Named Entity Disambiguation: A Hybrid Statistical ...

Named Entity Disambiguation for German News Articles

Named Entity Recognition for QA on Speech

Named Entity Recognition System for Postpositional ...

Named Entity Recognition for Biomedical Text - DFKI

A BIOLOGICAL NAMED ENTITY RECOGNIZER 1 ... - CiteSeerX

MaNER: A MedicAl Named Entity Recogniser

Pattern Mining for Named Entity Recognition - Hal

Named Entity Recognition for Web Content Filtering*

Optimising Selective Sampling for Bootstrapping Named Entity ...

Named Entity Transliteration for Cross-Language ... - CiteSeerX

Arabic Named Entity Recognition - Association for Computational ...

Automatic Gazette Creation for Named Entity ...

Improving Named Entity Recognition for ... - BOUN CmpE

Morphological Embeddings for Named Entity Recognition in ...

Improving Named Entity Recognition for Morphologically Rich ...

Liner2 â a Generic Framework for Named Entity ...