Research Internship Report
Distant supervision for relation extraction from web pages
Supervisor : Xavier Tannier Brigitte Grau
Author : Mahmut CAVDAR
Institution : LIMSI - CNRS
Secretariat : 01 69 15 81 58 email :
[email protected]
Contents Contents
i
Liste of figures
ii
1 Introduction
1
2 Related Work
3
3 Problem
4
4 Preliminary 4.1 Raw Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 5 5 5
5 System 5.1 Deepdive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 CoreNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 ddlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 6 7 7
6 Experiments 6.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 8 8
7 Conclusion et perspectives
9
Bibliography .1 Appendix CoreNLP
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
i
Liste of figures 1.1 1.2 5.1
Example of automatically extracted information from a news . . . . . . . . . . . . . . . Example of relation extraction task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepdive’s Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
1 2 6
Special Terms NER Named Entity Recognition. 5 POS Parts of Speech. 7 Ssplit Sentence Split. 7
iii
Abstract relation extraction system for French.
Keywords Natural Language Processing, Machine Learning, Distant Supervision, Relation Extraction, Text Mining.
iv
Chapter
1
Introduction The common interest of humanity in living in a cyber world makes companies invest more in their IT infrastructures in order to ensure availability of the necessary storage capacity and data transmission speed. Nowadays, we are producing more and more data each year. According to Cisco, data center storage installed capacity will reach to 1.8 ZB by 2020, up from 382 EB in 2015, nearly a 5-fold growth. [1] Over the past few years in particular, advances in computing technology (e.g. storage capacity, computational capability, transmission rate, etc.) have allowed to implement ideas that could not be feasible before. Natural language processing also benefited from these advances to offer new applications like other fields of computer science. Natural language processing has seen big success in switching from classical method over statistics to machine-learning(linear or non-linear models like a neural network) approaches. The new ideas needed the new approach and new approach gave rise to new ideas. Recently, fields such as machine translation, speech recognition, chatbots have attracted much attention in natural language processing. Information extraction is one of them thanks to society led to the production of huge volumes of content which is generally non-structured. Information extraction is an area which aims to extract factual information in free text. In other words, identify a predefined set of concept(e.g., database records) from domain-dependent texts. Figure 1.1 shows an example a piece of news about a terrorist attack and a structured information extracted from that piece. “Gunmen kill at least 28 Coptic Christians in central Egypt.” ⇓ Type : Attack Location : central Egypt DeadCount : 28 Weapon : Gun Victim : Coptic Christians Figure 1.1 – Example of automatically extracted information from a news
The information extraction task, more than three decade old [5], attempts to find any kind of information in texts and to associate relation between them. At the late of 1980’s first Message Understanding Conference was organized which has aim to promote and evaluate research in information extraction [2], and until 1998, six more conferences were held. And then, Automated Content Extraction Program started in 1999 which aims to develop the capability to extract meaning from multimedia sources. Relation extraction (RE) is one of task type of information extraction like Named Entity Recognition, Co-reference Resolution and Event Extraction. Aim of those tasks is to create a representation of the information that is machine understandable. In particular, the RE is the task of detecting and classifying semantic relationships between entities in given text. Figure 1.2 shows an example a relation extraction task.
1
CHAPTER 1. INTRODUCTION
2
“In Your Honor est le cinqui`eme album studio du groupe am´ericain de rock alternatif Foo Fighters, sorti le 13 juin 2005 sur le label RCA Records..” ⇓ Sujet Relation Objet Foo Fighters enregistr´e In Your Honor 13 juin 2005 In Your Honor date publication RCA Records publi´e In Your Honor Figure 1.2 – Example of relation extraction task
Knowledge Base Population is an evaluation track of the Text Analysis Conference, a workshop organised by the National Institute of Standards and Technology for nine years in 2017. ACE and KBP-TAC series continue its primary goal of promoting research in relation extraction that discover semantic relationship between entities as found in a large corpus. ACE Relation Extraction Task(2008), Unified Medical Language System and Freebase datasets have different number of relations, respectively 17, 54 and thousands. [This research internship est realis´e dans une equipe...] The rest of the paper is organised in the following way : Sect. 2 describes related work, Sect. 3 introduces presents problem definition and notation, Sect. 4 provides preliminary for the field of relation extraction, Sect. 5 details Deepdive framework architecture and the pipeline, Sect. 6 presents experiments and results related. The paper finishes with Sect.7 which summarises the material presented in the paper.
Chapter
2
Related Work Earlier RE methods are based on symbolic approach like hand-build pattern to aim automatic acquisition of hyponyms. [12] Modern relation extraction methods can be examined under three headings : supervised approach, distant supervised approach and unsupervised approach. First examples of relation extraction model based on machine learning are implemented with supervised approach. [6] and [7] train different types of SVM kernels on syntactic and semantic features for classifying different types of entity relations. [8] proposes a novel hybrid kernel that combines various features such as dependency patterns, trigger words, negative cues, walk features and regular expression patterns. Depending on the volume of the used data, classical method suffers for two reasons : to labeling training data is time-consuming and hard to repeat for a new application. In that point unsupervised, bootstrapping and distant supervision offer reasonable solutions. Unsupervised relation extraction methods extract a large set of relational tuples without requiring hand-labeled corpora.[9] [10] [11] Users don’t specify their desired type of relation or information. In bootstrapping approach, the user initially provides a small number of positives examples (seeds). These examples are used iteratively to generate new extraction patterns and new positives examples extracted from the corpus. In distant supervision, instead of user seed instances, database of facts is used to generate automatically training examples. Distant supervision approach is very efficient in term of scale (very large number of relations, e.g., web KBP). But generally these methods generate false positive examples, and they have to deal with this uncertainty, e.g. by generating negative evidence. There are different distant supervision methods for relation extraction problem. In multi-instance multi-label learning, a problem object is represented by a set of instances and associated with multiple labels [4]. Multi-instance multi-label relation extraction model assumes each relation mention of an entity pair has one of the pre-specified relation labels as well as an additional NIL label and model allows the pair to have multiple mentions. Latent variable z is used to represent actual relation label of a mention from knowledge base. And each yj classifier decides if relation j holds for the given entity tuple, using the output of z classifier as input. Another method is based on joint learning of Named Entity Classification and Relation Extraction with imitation learning and distant supervision. [13][14] System works with entity pairs which identified by using Web-based and part-of-speech-based heuristics. In this way the possible mistakes in the entity classification process don’t affect relation extraction task.
3
Chapter
3
Problem In this paper we focus on distant supervision for relation extraction between two entities. Let R donete relations space, E the set of entites and X the set of words(documents). Relation extraction task is defined as : f : X → E×E×R, from a given data set {r1 (e1 , e2 ), r2 (e3 , e4 )...rn (ek , em )} where rn ∈ R and ek , em ∈ E. We define our task as a function that takes as input document collection X, a set of entity mentions extrable extracted from Xe (distant supervision data from knowledge base), a set of requested relation labels l and an extraction model, and outputs a set of relations r such that any of the relations extracted in X. For example, we want to exract that NATO is to join the anti-Islamic State coalition. We define a relaiton r(e1 , e2 ), where r is the relation name, e.g., join in our example, and e1 and e2 are two entities, e.g., Nato and coalition in our example.
4
Chapter
4
Preliminary To handle the relation extraction task, we use prior knowledge and experience.
4.1
Raw Data Processing
In some projects, RE system uses html raw data without preprocessing. Html attributes like tag, id, class, etc. can be defined as a feature vector to train system. But Deepdive framework works on large plain-text collections. Raw data gathered from different web pages. The HTML pages were cleaned with Boilerpipe and were segmented with NLTK. [maybe kia ?]
4.2
Named Entity Recognition
The term Named Entity was coined in the sixth Message Understanding Conference. Named entity recognition is a important sub-task of relation extraction like other hot research area such as Question Answering, Machine Translation, Video Annotation, Semantic Web Search, etc. To identify relations between entities requires first recognizing entities. Named entity recognition has two steps : locating and classifying of entities. To reach reliable NER system, we are faced diversity of languages, domains and entity type covered in the literature.
4.3
Features
In this section, some related features are described.
4.3.1
POS
Part of speech tagging is useful for a wide range of NLP tasks including relation extraction. 4.3.1.1
Lemma
5
Chapter
5
System Here system will presented.
5.1
Deepdive
Deepdive framework, proposed by Niu et al.[3], extracts relations from large number of web pages. Deepdive’s Architecture consists of three phases : feature extraction, probabilistic engineering and statistical inference and learning. Deepdive gets linguistic feature by using tools like named-entity recognizer and dependency paths finder. Then these feature are used to discover correlations between linguistic patterns and relations defined by user. By using Markov logic program which powered additional domain knowledge, statistical model is trained and knowledge base is populated with entities and relationships.
Figure 5.1 – Deepdive’s Architecture
One of most important design idea of DeepDive is to make KBC systems easier to debug and improvement. Depdive uses some set of labeled data to produce a calibration plot which is used to summarize the overall quality of the results. Then according the plot user can get an idea about the next step to improve the system and to efficiently handle uncertainty in predictions histogram. Another key challenge of Deepdive approach is scalability, i.e., how to process terabytes of unstructred data efficiently (web-scale relation extraction). Task linguistic features such as dependency paths has high-CPU-utilization problem. Also, statistical learning and inference phase needs to scale. These problem are solved with Condor infrastructure and BISMARCK system, respectively. Condor, high-throughput batch computing system, works on hundreds of workstations and shared cluster machines. Bismarck system integrates many machine learning techniques into an RDBMS. DeepDive demonstrates that a promising approach is to integrate various and large data sources and best-of-breed algorithms via statistical learning and inference. Also Deepdive system’s quality hardly depends on its features, rules and pre-processing pipelines.
6
CHAPTER 5. SYSTEM
5.2
7
CoreNLP
Stanford CoreNLP provides a set of natural language processing tools. It can give the segmented and tokenized format of texts, their parts of speech, whether they are proper names, normalize dates, times, and numeric quantities, mark up syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, etc. CoreNLP has different type of annotator for each function. Annotations are basically maps, from keys to bits of the annotation, such as the parse, the part-of-speech tags, or named entity tags. But unfortunately some annotator is not supported for all language. (See .1 for details) For CoreNLP pipeline following annotators are used : Tokenize, Ssplit, POS, NER, Lemma, Depparse, RegexNER. To train NER model, 3 different dataset are tested : Quaero, Europeana, WikiNER. For RegexNER annotator, two different text files are prepared by using related DBpedia dataset. It suported Named Entity Recognition phase with about 100.000 person name. To make RegexNER overwrite an existing entity assignment, we give it permission in a third tab-separated column, which contains a comma-separated list of all entities.
5.3
ddlib
ddlib, generic features library, is a utility library included with DeepDive. ”Generic Features” denote a set of features application- or domain-independent and can be used for different kind of mention and relation extraction application. For Knowledge Base Construction application one of the most time consuming operation is feature engineering and each time it has to be done from scratch. The goal of the generic features library is to allow users to get their application without KBC expertise. There are various ”classes” of generic features. The list of generic features for a mention and relation is the following : Part of Speech tag(s) of the word(s) composing the mention(POS SEQ), Named Entity Recognition tag(s) of the word(s) composing the mention (NER SEQ), lemmas of the word(s) composing the mention (LEMMA SEQ), word(s) composing the mention (WORD SEQ), length(s) of the word(s) composing the mention (LENGTH), a feature denoting whether the first word of the mention starts with a capital letter (STARTS WITH CAPITAL), features denoting whether the mention appears in a user-specified dictionary (IN DICT), lemmas and the NERs in a window of size up to 3 around the mentions composing the relation, shortest dependency paths between the mentions and keywords in user-specified dictionaries that are in the sentence, etc.
Chapter
6
Experiments 6.1
Corpus
6.2
Result
8
Chapter
7
Conclusion et perspectives
9
Bibliography [1] Cisco White Paper. Cisco Global Cloud Index : Forecast and Methodology, 2015–2020. 2016 [2] Ralph Grishman, Beth Sundheim, Message Understanding Conference-6 : A Brief History 1996 : Proceedings of the 16th Conference on Computational Linguistics - Volume 1 [3] Feng Niu, Ce Zhang, Christopher R´e, and Jude W Shavlik. Deepdive : Web-scale knowledge-base construction using statistical learning and inference. 2012 : In VLDS, pages 25–28. [4] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. Multiinstance multi-label learning for relation extraction. 2012 : In EMNLP. [5] Peggy M. Andersen, Philip J. Hayes ,Alison K. Huettner,Linda M. Schmandt,Irene B. Nirenburg and Steven P. Weinstein. Automatic extraction of facts from press releases to generate news stories. 1992 : Association for Computational Linguistics, In Proceedings of the third conference on Applied natural language processing , pages 170–177. [6] Shubin Zhao, Ralph Grishman (2005). Extracting relations with integrated information using kernel methods. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 419–426). [7] ZHOU GuoDong, SU Jian, ZHANG Jie, ZHANG Min. (2002). Exploring various knowledge in relation extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 419– 444). [8] Md. Faisal Mahbub Chowdhury, Alberto Lavelli. (2012). Combining tree structures, flat features and patterns for biomedical relation extraction. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. [9] Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. (2007). Open information extraction from the web. In Manuela M Veloso, editor, IJCAI-07, pages 2670–2676. [10] Yusuke Shinyama and Satoshi Sekine. 2006. Preemptive information extraction using unrestricted relation discovery. In HLT-NAACL-06 , pages 304–311, New York, NY. [11] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web : An experimental study. Artificial Intelligence , 165(1) :91–134, 2005. [12] Marti A. Hearst.(1992) Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING-92 [13] Andreas Vlachos and Stephen Clark (2014) Application-Driven Relation Extraction with Limited Distant Supervision. Association for Computational Linguistics and Dublin City University, Dublin, Ireland. [14] Isabelle Augenstein, Andreas Vlachos, Diana Maynard (2015), “Extracting Relations between Non-Standard Entities using Distant Supervision and Imitation Learning.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
10
Appendix .1 Appendix CoreNLP Annotator
AR
ZH
EN
FR
DE
ES
Tokenize / Segment Sentence Split Part of Speech Lemma Named Entities Constituency Parsing Dependency Parsing Sentiment Analysis Mention Detection Coreference Open IE
3 3 3
3 3 3
3 3 3
3 3
3 3 3
3
3 3 3
3 3 3 3 3 3 3 3 3 3 3
3 3
3 3 3
3 3
11
3 3