A review of relation extraction systems using distant ...

0 downloads 0 Views 497KB Size Report
Natural language processing also benefited from these improvements to offer ... “Gunmen kill at least 28 Coptic Christians in central Egypt.”
A review of relation extraction systems using distant supervision

Supervisor : Xavier Tannier

Author : Mahmut CAVDAR

Institution : LIMSI - CNRS

Secretariat : 01 69 15 81 58 email : [email protected]

Contents Contents

i

1 Introduction

1

2 Problem

2

3 Methods 3.1 Deepdive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Multi-instance multi-label relation extraction . . . . . . . . . . . . . . . . . . . . .

3 3 4

4 Conclusion et perspectives

5

Bibliographie

6

i

Chapitre

1

Introduction Over the past few years in particular, advances in computing technology (e.g. storage capacity, computational capability, transmission rate, etc.) have allowed to implement ideas that could not be feasible before. Natural language processing also benefited from these improvements to offer new application like other fields of computer science. The field has seen big success in switching from classical method over statistics to machine-learning(linear or non-linear models like a neural network) approaches. The new ideas needed the new approach and new approach gave rise to others ideas. Recently, fields such as machine translation, speech recognition, chatbots have emerged for natural language processing. Also, information extraction is one of them thanks to society led to the production of huge volumes of content which is generally non-structured. Information extraction is an area which aims to extract factual information in free text. In other words, identify a predefined set of concept(e.g., database records) from a domain consists of a corpus of texts. Figure 1.1 shows an example a piece of news about a terrorist attack and a structured information extracted from that piece. “Gunmen kill at least 28 Coptic Christians in central Egypt.” ⇓ Type : Attack Location : central Egypt DeadCount : 28 Weapon : Gun Victim : Coptic Christians Figure 1.1 – Example of automatically extracted information from a news

Most models of information extraction are based on supervised learning. Depending on the volume of the used data, classical method suffers for two reasons : to labeling training data is time-consuming and hard to repeat for a new application. In that point unsupervised, bootstraping and distant supervision offers reasonable solutions. Briefly, unsupervised relation extraction extracts a large set of relational tuples without requiring hand-labeled corpora. Users don’t specify their desired type of relation or information. In bootstrap approach initially, user provide a small number of positives examples(seeds). These examples are used iteratively to generate new extraction patterns and new positives examples extracted from the corpus. In distant supervision, instead of user seed instances, database of facts is used to generate automatically training examples. Distant supervision approach is very efficient in term of scale(very large number of relations, e.g., web KBP). There is different distant supervision methods for relation extraction problem. Second part presents problem definition and notation. Section 3 discusses different approach. Finaly I conclude the paper in last section.

1

Chapitre

2

Problem In this paper we focus on distant supervision for relation extraction between two entities. Let R donete relations space,E the set of entites and X the set of words. Relation extraction task is defined as : f : X → ExExR, from a given data set {r1 (e1 , e2 ), r2 (e3 , e4 )...rn (ek , em )} where rn ∈ R and ek , em ∈ E. For example, we want to exract that NATO is to join the anti-Islamic State coalition. We define a relaiton r(e1 , e2 ), where r is the relation name, e.g., join in our example, and e1 and e2 are two entities, e.g., Nato and coalition in our example.

2

Chapitre

3

Methods I examined two distant supervision approaches to extraction relation : one based on the Deepdive framework, and other based on the multi-instances multi-label method.

3.1

Deepdive

Deepdive framework, proposed by Niu et al., extracts relations from large number of web pages. Deepdive’s Architecture consists of three phases : feature extraction, probabilistic engineering and statistical inference and learning. Deepdive gets linguistic feature by using tools like named-entity recognizer and dependency paths finder. Then these feature are used to discover correlations between linguistic patterns and relations defined by user. By using Markov logic program which powered additional domain knowledge, statistical model is trained and knowledge base is populated with entities and relationships.

Figure 3.1 – A picture of a gull.

One of most important design idea of DeepDive is to make KBC systems easier to debug and improvement. Depdive uses some set of labeled data to produce a calibration plot which is used to summarize the overall quality of the results. Then according the plot user can get an idea about the next step to improve the system and to efficiently handle uncertainty in predictions histogram. Another key challenge of Deepdive approach is scalability, i.e., how to process terabytes of unstructred data efficiently (web-scale relation extraction). Task linguistic features such as dependency paths has high-CPU-utilization problem. Also, statistical learning and inference phase needs to scale. These problem are solved with Condor infrastructure and BISMARCK system, respectively. Condor, high-throughput batch computing system, works on hundreds of workstations and shared cluster machines. Bismarck system integrates many machine learning techniques into an RDBMS. Deepdive system’s quality hardly depends on its features, rules and pre-processing pipelines. (Preprocessing kısmı bu sebeple o¨nemli bir kısımdır.) (Orjinal ka˘gıtda da belirtildi˘gi gibi yo˘gun db sorgularından dolayı da˘ gıtık relational database ile c¸alı¸smak ¨onemlidir.)

3

CHAPITRE 3. METHODS

3.2

4

Multi-instance multi-label relation extraction

Contrary to traditional methods, in multi-instance multi-label learning, a problem object is represented by a set of instances and associated with multiple labels.[4] For clarification purposes, let’s take the example of movie. A movie can be labeled to several classes depends on purposes, e.g., drama, romance or Robert Zemeckis’s directing (multi-label). On the other hand, for real world label prediction model, user needs various instances. For example, multiple sections can be extracted from movie, and thus the movie can be represented by set of instances and be labeled(multiple) by using these instances. DB=(BornIn(Barack Obama,United States) EmployedBy(Barack Obama,United States)) Barack Obama is the 44th and current President of the United States. EmployedBy Obama was born in the United States just as he has always said. BornIn United States President Barack Obama meets with Chinese Vice President Xi Jinping today. EmployedBy Obama ran for the United States Senate in 2004. – Aligning a database of facts with text introduces challenges. Sometimes same entity pair may have different label in various sentences. Multi-instance multi-label relation extraction method, proposed by Surdeanu et al. for the Stanford KBP system, brought a solution to this problem with a novel graphical model.[3]

Chapitre

4

Conclusion et perspectives Multi relationlar,zamana ba˘ glı(dinamik) relation’lara de˘ginelim

5

Bibliographie [1] Gabor Angeli, Sonal Gupta, Melvin Jose, Christopher D. Manning, Christopher Re, Julie Tibshirani, Jean Y. Wu, Sen Wu, and Ce Zhang. Stanfords 2014 slot filling systems. 2014 : In TAC. [2] Feng Niu, Ce Zhang, Christopher R´e, and Jude W Shavlik. Deepdive : Web-scale knowledge-base construction using statistical learning and inference. 2012 : In VLDS, pages 25–28. [3] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. Multiinstance multi-label learning for relation extraction. 2012 : In EMNLP. [4] Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang, Yu-Feng Li Multi-Instance Multi-Label Learning Artificial Intelligence, 176 (1) (2012).

6

Suggest Documents