Entity Extraction within Plain-Text Collections ... - Semantic Scholar

4 downloads 95 Views 151KB Size Report
Track) in which teams must label entities within plain texts based on a given set of entities. The Wikilinks dataset comprise 40 million mentions over 3 million ...
Entity Extraction within Plain-Text Collections WISE 2013 Challenge T1: Entity Linking Track Carolina Abreu⋆ , Fl´ avio Costa, La´ecio Santos, Lucas Monteiro, Luiz Peres, Patr´ıcia Lustosa, and Li Weigang University of Brasilia, Brasilia, Brazil [email protected],{capregueira,laecio,lucasbmonteiro}@gmail.com [email protected],[email protected],[email protected]

Abstract. The increasing availability of electronic texts, such as freecontent encyclopedias on the internet, has unveiled vast interesting and important knowledge in Web 2.0. Nevertheless, to identify relations within a myriad of information is still a challenge. For large corpus of data, it is impractical to manually label each text in order to define relations to extract information. The WISE 2013 conference proposed a challenge (T1 Track) in which teams must label entities within plain texts based on a given set of entities. The Wikilinks dataset comprise 40 million mentions over 3 million entities. This paper describe a straightforward two-fold unsupervised strategy to extract and tag entities, aiming to achieve accurate results in the identification of proper nouns and concrete concepts, regardless the domain. The proposed solution is based on a pipeline of text processing modules that includes a lexical parser. In order to validate the proposed solution, we statistically evaluate the results using various measurements in the case study supplied by the Challenge. Keywords: Information extraction, Text analysis, Entity extraction, Wikilinks

1

Introduction

The 14th edition of the International Conference on Web Information System Engineering (WISE 2013) aims to promote discussion of the recent advances in Web technologies, methodologies and applications. In order to encourage research and development of this area, the Conference has presented two main challenges, which may be attended separately. The first track (T1) is the Entity Linking Track, which aims to label entities in plain texts, based on the wikilinks dataset. The second track (T2) is the Weibo Prediction Track, which aims to evaluate users’ age range in a Sina Weibo microblogging service dataset. This paper is focused on the Entity Linking Track (T1), a challenge to create an automatic system that identifies entities within simple texts and relate them ⋆

This research has been partially supported by the CAPES and FINEP grants.

2

Entity Extraction within Plain-Text Collections

to the respective URLs of Wikipedia’s set of entities provided by WISE. Attendees need to automatically detect proper nouns, such as “Shangai” or “Hillary Clinton”, and concepts in general, especially concrete concepts such as “Chinese characters” or “talk-show hosts”. The attendants are also suggested to label a concept as concrete as possible. The increasing availability of electronic texts, such as free-content encyclopedias on the internet, has unveiled vast interesting and important knowledge in Web 2.0. With the rise of the Semantic Web, the need for built-in mechanisms for defining relationships between data becomes even more relevant [1]. Nevertheless, to identify relations within a myriad of information is still a challenge. For large corpus of data, it is impractical to manually label each text in order to define relations to extract information. Automatic extraction of information from textual corpora is a well-known problem that has long been of interest in Information Extraction (IE) research. The research in the IE field aims to identify structured relations from unstructured sources such as documents or web pages, and has shown promise for largescale knowledge acquisition, mostly because the large amount of online information and recent advances in machine learning and Natural Language Processing (NLP)[2]. IE has traditionally focused on satisfying pre-specified requests from small homogeneous corpora. Shifting to a new domain relies on extensive human involvement in the form of new hand-crafted extraction rules or new hand-tagged training examples [3]. A broader approach, Open Information Extraction (OIE), is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary or any human input [4]. Named entity recognition (NER) is one of the first steps in many applications of IE and other applications of NLP. A NER system allows the identification of proper nouns in unstructured text. The NER problems usually involves names, localizations, dates and monetary amount, but it can be expanded to involve the concrete concept identification as well [5]. After entity recognition, entity matching is the task of identifying to which entity from a knowledge base a given mention in free text refers [6]. Although it seems like a fairly simple task, the research of Larkey et. al. [7] demonstrates the importance of the proper names component in language tasks involving searching, tracking, retrieving, or extracting information. Crestan and Loupy [8] also argues that named entity recognition and extraction helps endusers browse within large document collections in a more quickly and efficiently manner. This fact is true when it comes to Wikipedia, one of the largest online repositories, with millions of articles available. One of its most important attributes is the large number of links embedded in the body of each article. The Wikipedia contributors are suggested to perform annotations by hand for connecting the most important terms to other pages, thereby providing the users a quick way of accessing additional information [9].

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

3

Unsupervised algorithms are being used to address these questions, but the key issue in relation extraction remains on balancing the trade-off between high precision, recall, and scalability [10]. There are an increasingly amount of research that affirm they have successfully answered scalability and precision factors [11] at a level of exactitude compared to an expert manually labeling the information itself. This paper presents a straightforward two-fold unsupervised strategy to extract and tag entities, aiming to achieve accurate results in the identification of proper nouns and concrete concepts, regardless the domain. The proposed solution is based on a pipeline of text processing modules that includes a lexical parser. This general solution may be applied in any text corpus when an entity set is known. Even if there is no performance requirement in this track, the large number of entities and plain text to be processed in a short period is a constraint itself. Section 2 reports related work describing IE studies. In Section 3 we analyze the Track 1 dataset files of the Challenge. After that, we provide a formal description of Data in Section 4 and the problem definition and approach of the solution in Section 5. Section 6 reports on our experimental results and evaluation and section 7 closes the paper with conclusions and a discussion of future work.

2

Related Work

Etzioni (2004) [12] proposed one of the well-known IE system: the Knowital.It is a system that extracts large collections of facts from the web in an autonomous, domain-independent, and scalable manner. A more complete approach, the OIE systems, were proposed in order to perform unsupervised extraction with a much higher precision and recall evaluations. The concept of OIE was introduced with the TextRunner [2], a system that uses a Naive Bayes model with unlexicalized POS and NP-chunk features as input and analyses the text between noun-phrases to obtain relationships. The WOE [13] is a improved version of TextRunner that uses heuristic correspondences between values of attributes in Wikipedia infoboxes and sentences in order to build training data. The most recent generation of OIE systems managed to substantially improve both precision and recall when compared to previous extractors: The Reverb [11] and the R2A2 systems [4]. The Reverb’s design is based on simple rules that identify verbs that express relationships in English. It implements a novel relation phrase identifier based on generic syntactic and lexical constraints. R2A2 has the same structure as Reverb, but adds an argument identifier, Arglearner, to better extract the arguments for these relation phrases [11]. All of these frameworks apply some level of NER in their solutions. However, there are problems that NER is an end in itself, using both supervised and unsupervised methods. See [14] for a survey. Supervised methods usually apply machine learning such as Naive Bayes, decision trees, or rule induction, using syntactic patterns. Unsupervised methods use a broader range of techniques and

4

Entity Extraction within Plain-Text Collections

were found to achieve accuracy figures comparable to those obtained by state-ofthe-art supervised methods [15]. Given an input document, the Wikify! system [9] is an unsupervised method which applies word sense disambiguation methods to automatically extract the most important words and phrases in the document, and identifying for each such keyword the appropriate link to a Wikipedia article. They do it by promoting the lexical overlap between the Wikipedia page of the candidate disambiguations and the context of the ambiguous mention, and training a Naive Bayes classifier for each ambiguous mention, using the hyperlink information found in Wikipedia as ground truth. This is a local approach, which disambiguate each mention within a document separately. Recent work has tended to focus on more sophisticated global approaches to the problem, in which all mentions in a document are disambiguated simultaneously to arrive at a coherent set of disambiguations [16]. Ratinov (2011) [17] proposed the GLOW system, a named entity linking system that annotates unstructured text with Wikipedia links, utilizing the Wikipedia link graph to estimate coherence. The GLOW system treat the NER problem as an optimization problem with local and global variants, and the result is the best trade-off between local and global approaches. Similar work was created for commercial purposes, but it has raised a lot of controversy. The “Instant lookup” feature of the Trillian instant messaging client 1 scans each of the user messages in real time to determine which words or phrases area defined in Wikipedia. The Microsoft Smart Tags found in later version of Microsoft Word is a feature in which the application recognizes certain words or types of data and converts it to a hyperlink. The Google AutoLink, a feature of the third version of Google Toolbar would detect words from specific categories and link them to commercial websites. The use and coverage of these systems is still small and little comprehensive. Based on the study of previous work, our objective is to explore how precise a local approach for entity identification and disambiguation can be. We believe that a simple, straightforward approach is capable of producing results competitive to the state-of-the-art systems, in terms of standard IE evaluation. Although we acknowledge the importance of more refined approaches, our focus is to provide an automatic solution that is scalable, fast and foremost that does not require previous data training or is restrained to an specific domain.

3

Data Analysis

In the Track 1 of WISE 2013 challenge, the dataset referred to as Wikilinks contain over 40 million mentions labeled as referring to 3 million entities. It was gathered by Singh et al (2012) [6] with an automated method based on finding a large collection of hyperlinks to English Wikipedia pages from a web crawl, and annotating the corresponding anchor text (in context) as mentions of the entity or concept described by the Wikipedia page. According to the authors, 1

https://www.trillian.im/

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

5

this data set is dramatically larger and more varied than previous available data: a scale that challenges most existing approaches. Since a certain percentage of Wikipedia URLs in the current Wikilinks dataset are in wrong format, the WISE 2013 Challenge committee provided a revised version of the dataset. The Wikilinks dataset consists of many lines, where each line represents an entity and its mentions. An entity can be followed by one or multiple mentions, separated by two hash marks ‘##’. Each line is initiated by the entity URL (obtained by removing the Wikipedia prefix “http://en.wikipedia.org/wiki/”), and followed by corresponding mentions extracted for a large collection of web pages that refer to Wikipedia. In the Wikilinks entity file, there are 2,860,422 entities and 5.804,339 mentions. The average number of mentions per entity is two. The maximum number of mentions for a single entity is 431. If we look for the number of entities that have one single mention and this mention is exactly the same as the name of the entity, we will found 1,626,190 entities ( 30%). After a tokenization of the entity file, we found there are 8,208,979 tokens (a unit of text where each is either a word or something else like a number or a punctuation mark). The average number of tokens per entity is two. The maximum number of tokens in a single entity is 72. The total number of tokens within the mentions is 21,751,384. The average number of tokens in a mention is three. The maximum number of tokens within a single mention is 588. There is also a quality issue in some of the entities that are URL enconded. The WISE committee have transformed some of them, but there are still a large number of URLs where the URL enconding replace unsafe ASCII characters with a ‘?’, that must be considered in the matching algorithm for labelling accuracy purposes. Another challenge to be faced in the entity linking track is the highly heterogeneous types of entities and its mentions. One may only use the pure entity list for labeling texts except that there are multiple entities where the disambiguation can only be performed when the context (in the plain text and in the mentions) is analyzed.

4

Formal Description of Data

In order to supply a theoretical analysis of the data, we use a similar formalization as the one used in [18] and later in [17], for a disambiguation to Wikipedia (D2W) problem. The authors present a broader approach to disambiguation, whereas we present a simpler and more restricted one. Fig. 1 shows an overview of the terms tagged in a sample text provided by the WISE 2013 Challenge and its relations with the respective Wikilinks entities and mentions. Consider we are given a file f , containing a list of entities and its mentions, where each line represents a Wikipedia URL. Consider we are given a plain text document d with a set of named entities N = {n1 , . . . , nN }. We use the term named entity (NE) to denote the occurrence of a proper name or a concrete concept inside a plain text.

6

Entity Extraction within Plain-Text Collections Document d

n1= Hillary

n2= Rudy

.. .

.. .

n3= local politician

Φ (n1,e1) e1=Hillary Clinton

m1=Hillary

e2=Hillary Duff

m1=Hillary

m2=Clinton

m6=actress

e3=Rudy Giuliani

e4=Rudy (fillm)

m7=Rudy

m7=Rudy

e5= local politician

m3=politician

m8=Rudy Giuliani m3=politician

m3=politician m4=Senate

Fig. 1. Named entity recognition and matching overview.

Our purpose is to produce a mapping from the set of named entities to the set of Wikipedia entities (URLs) W = {e1 , . . . , e|W | }. It is possible for a NE to correspond to a entity that is not listed within f , therefore a null entity is added to the set W . Each Wikipedia entity has a set of mentions M = {m1 , . . . , mN } associated to itself in the file f . According to [17], to match the named entities with the Wikipedia entities may be expressed as a problem of finding a many-to-one matching on a bipartite graph, with NEs forming one partition and Wikipedia entities the other partition (Fig. 1). We denote the output matching as a N -tuple Γ = (e1 , . . . , eN ) where ei is the more precise match, or the disambiguation, for NE ni . A local NE approach matches each named entity ni apart from the others. The match is defined by some parameters. Let φ(ni , ej ) be a score function reflecting the likelihood that the entity ej ∈ W is the more precise match for ni ∈ N . Previous research considers that to identify all named entities based on a local approach can be expressed as optimization problem, as the one presented in the following:

∗ Γlocal = argmaxΓ

N X

φ(ni , ei ) .

(1)

i=1

Local approaches define a φ function to establish which one is the most accurate matching, by assigning higher scores to entities with content similar to that of the input document. Global approaches work in a more complex manner, by matching the entire set of NEs simultaneously, aiming to improve the coherence among the linked entities [17]. We intend to show that even a simpler approach, such as local matching, provides sound results that may be competitive with the state-of-the-art global oriented approaches. By keeping it simple, it is possible to reach very interesting results with limited resources.

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

5

7

Problem Definition and Approach

The entity linking method we propose here for entity recognition and extraction follows the basic framework for any IE system, summarized by [19]. The author argues that altough there are various applications, all researches converge to a standard architecture, when referring to the main functions performed by the solution.

Wikilinks Entity File

Pre-processing

Database Tables

Fig. 2. Steps that comprise the entities file processing.

The solution may be described in five main steps: 1. The entity file and the text corpus are pre-processed to load the database and to remove formatting issues, respectively (See Fig. 2). 2. Each plain text of the corpus is tokenized and gramatically annotaded with its POS tag. 3. Each sentence is analysed to detect the Proper Nouns and concrete concepts, and the NEs are extracted using a set of rules. 4. The solution will identify to which URL from the Wikilinks entity file, those NEs in plain text refers. 5. All the chosen NEs and their respective entities are compiled in a result’s file. A more detailed description of these steps is reported in the following topics and summarized in the overview of the architecture in Fig. 3. 5.1

Pre-processing

In the first step, the Wikilinks entity file is processed and, as a result, database tables are created. All the entities and their respective mentions are organized in two separate tables, containing references to the original file. Each entity is processed to create an auxiliary table, containing the tokens of the core words of the entity, which will be used as one of the matching parameters. An automatic tagger tool grammatically annotates each mention, and for each token it is assigned the most common Part-Of-Speech (POS), according to the Penn Treebank tag set. The POS result is persisted in the database and will be used in the sentence analysis and extraction. Some statistics are calculated in the pre-processing step: the number of mentions for each entity and the number of times a core word token is listed within the entities mentions. Those statistics will also be used as input for a score function in the matching step.

8

Entity Extraction within Plain-Text Collections Rudy/NNP from/ IN his/PRP job/ NN as/IN handson/JJ city/NN manager/NN

(...) Rudy from his job as hands-on city manager Text Corpus

Preprocessing

Sentence Analysis

Tokenization

Rudy Proper Noun job Noun city manager Concrete concept Template Generation

Matching

Extraction

Results Wikilinks Entities

Rudy: Rudy_Giuliani city manager:

Rudy city manager

city_manager

Fig. 3. Overview of the System Architecture. Each internal box represent a process. The external elements are an example of the solution applied to a single sentence.

For each plain text of the testing corpus, some pre-processing is also required to avoid some low-level formatting issues. As it is unpredictable to preview the quality and the source of the corpus, there may be various formatting that the solution cannot deal with, therefore it needs to be filtered out. In particular, the solution converts all first words of a sentence to lowercases. Uppercases that are found in the middle of a sentence are left unaltered. 5.2

Tokenization

The pre-processed texts are divided into tokens and they go through an automatic grammatical tagging for its appropriate part-of-speech. The tokenization process is a common first step in IE. The Penn Treebank dataset [20] was adopted for its specificity, e.g. it presents the differences between common and proper nouns. The automatic identification of proper nouns is an advantage, since it is one of the tasks of the challenge. The tokenization is done to the complete plain text. Previous works proved that it is possible to avoid the cost of parse the complete documents, as in [21]. However, for better accuracy we choose to tag the entire texts. It is paramount to state that the better the tagging tool, there will be ambiguous tagging in a non-specific domain model. For the plain text may be from any domain, the parse tool cannot be trained to adjust to an specific vocabulary. There will be unknown words and the “wrong” tag will affect the precision of any solution.

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

5.3

9

Sentence Analysis and Entity Extraction

The Sentence Analysis step aims to identify proper noun groups and concrete concepts based on parsing. It is not the purpose of this research to perform a complete, detailed parse tree for each sentence. Instead, the system needs only to perform a partial parsing: construct as much structure as the WISE 2013 Challenge requires. Unlike traditional parsers, a partial parser looks for fragments of text that can be reliably recognized [19]. We analyzed the grammatical structure of proper nouns and concrete concepts and proposed a couple of regular expressions, regex, to guide the solution to find these terms within the plain text. This technique identify these fragments deterministically based on purely local syntactic cues. For this reason, its coverage is limited. We defined a group of regular expressions based on the word level of the English Penn Treebank POS tag set. The majority of occurrences of proper nouns and concrete concepts are restricted into 5 tags: /JJ Adjective, /NN Noun singular, /NNS Noun plural, /NNP Proper noun singular and /NNPS Proper noun plural. The description of the occurrences is shown below: Regular Expressions to identify Proper Nouns and Concrete Concepts (REGEX #1: Proper Nouns) p:[/NNP, /NNPS] n:[/NN, /NNS]

p+n?

(REGEX #2: Concrete Concept) a:[/JJ] n:[/NN, /NNS]

a?n+

(REGEX Quantification: (?) zero or one of the preceding element. (+) one or more of the preceding element.)

The result of each sentence analysis is a set of NEs in the plain text that correspond either to a proper noun or to a concrete concept. A NE, as defined by the regular expressions, can be composed by one or more words. The entity extraction is the action of identifying which set of terms of the sentence may be of potential relevance. 5.4

Entity Matching

The goal of entity matching is to determine whether the NE refers to which Wikilinks entity. To accomplish that goal, we defined a score function φ(ni , ej ) to reflect the likelihood that the entity ej ∈ W is the more precise match for ni ∈ N . Most systems use manually generated heuristics to determine when two phrases describe the same entity, but generating good heuristics that cover all types of reference resolution is still a challenge. After a NEs is extracted, the algorithm searches in the database for all entities that are similar to the NE and list them all. All these entity candidates may or

10

Entity Extraction within Plain-Text Collections

may not refer to the NE. the next step is to calculate the score function φ(ni , ej ). We calculate 4 parameter, varying from 0 to 1, being 1 the perfect match for this parameter. We estimate the weight of each parameter and choose the entity with the highest overall score to be linked in the NE. There are four parameters for our solution: A B C D

: : : :

Match within the core words. Match with the phrase context. Number of mentions of the entity. Match with the text context. The final score function is defined by the equation φ(ni , ej ) = αA + βB + γC + δD .

(2)

The parameter C is independent of the word received and gives a sense of the size of the entity. All the constant (α, β, γ, δ) weights were calibrated by specialists. 5.5

Design of Solution

The object of the present work is to achieve high precision level in labelling entities within plain texts. To achieve this goal, we developed a solution as illustrated in Fig. 3. The algorithms were implemented in Java, C# and Python. The processed data was loaded to a Postgrees database. The parser used by the algorithms was the extension of the Tree Tagger toolkit2 . The texts were executed in a Dell PowerEdge R710 with 16 x Quad-core Intel Xeon E5630 2.53 GHz, Cache 12MB and 141 GB RAM.

6

Experiments and Evaluation

The test corpus dataset provided by WISE 2013 Challenge contain 8.824 files to be analyzed, with a total size of 19.2 MB of text. The smaller file is 573 bytes, whereas the biggest is 3.182 bytes. The number of lines of text within the corpus is 522.600. The average number of lines per text is 58. The minimum number of lines from a text has 6 lines. The maximum number of lines on a text is 250. Evaluation in Information Extraction makes frequent use of the notions of precision and recall. Precision is defined as a measure of the proportion of selected items that the system got right. Recall is defined as the proportion of the target items that the system setected [22]. A single measure that trades off precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall. We chose randomly a representative sample of the texts ( 0.14%), a sample of 12 texts of the corpus. A set of 3 specialists analysed the samples and manually label the Proper Nouns and concrete concepts that they think were relevant 2

http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

11

Table 1. Evaluation measures comparison between specialists analysis and the automatic solution. Text Topic ID 524 Health 873 Culture & Education 1359 Politics 2520 Politics 2560 Politics 3235 Politics 6586 Fishing 6778 Pets 7571 Auto 7636 Child Care 7963 Crimes 8337 Music

No. Proper Noun Concrete Concept Precision Recall Lines Specialist System Specilaist System 42 6 4 9 5 90,0% 60,0% 44 8 8 5 4 66,7% 61,5% 39 11 10 8 5 93,3% 73,7% 38 31 29 5 5 70,6% 66,7% 36 22 15 2 2 53,0% 71,0% 42 22 28 4 3 62,0% 80,0% 159 5 6 8 2 87,5% 53,8% 150 13 8 8 6 78,6% 52,4% 121 6 2 5 3 100,0% 45,5% 112 1 1 11 9 100,0% 83,3% 107 4 4 5 4 87,5% 77,8% 157 5 6 5 2 75,0% 60,0%

entities in each of the texts, in a process similar to the one executed by WISE 2013 Challenge Comittee. Table 1 presents the results comparing the specialists manually labelling and the solution labelling in terms of precision and recall. The evaluation of the corpus text was made based on the WISE 2013 Guidelines, in which all proper nouns should be identified in the results. In addition, the evaluation ignored common concepts (usually common nouns) and focus on the concrete concepts. If the speciallist identified a concrete concept, the label to a general concept was treated as a negative label. Table 1 presents the evaluation results for a sample of the corpus test provided by WISE, manually labeled by three specialists and automatically labeled by our solution. The results show an average precision of 80,4% and an average recall of 65,5%. The precision measure (or positive predictive value) indicates that a high fraction of the instances of proper nouns and concrete concepts retrieved were relevant. The recall measure (or sensitivity measure) was a little bit lower, but showed that the solution retrieved a large number of relevant instances. In some of the texts, the automatic solution identified and linked some concrete concepts that were not predicted by the specialists. This was not considered a mistake, therefore it didn’t affect the evaluation measures. In a plain analysis, the Table 1 showed that our algorithm returned a large number of the relevant results (recall) and returned substantially more relevant than irrelevant results.

7

Conclusions

In the WISE 2013 Challenge T1: Entity Linking Track, we were asked to identify and link entities within plain text with the Wikilinks dataset. In this paper, a synthesis formalization of the problem and an overview of the architecture and the steps developed in the solution were presented. We successfully achieved

12

Entity Extraction within Plain-Text Collections

the basic requirement to identify the proper nouns and concrete concepts from 8824 plain texts, with competitive precision an recall evaluation measures. The architecture is very scalable and we assumed that basic semantic relations can be inferred from matching given lexical-syntactic patterns. The contributions of this paper can be noted as: a) developed an EI system to label automatically the English texts by WIKI entities with reasonable precision; b) proposed an algorithm to implement a local approach disambiguation with no training data requirements or restrict domain limitations. During the development of the EI system, we overcame various difficulties such as the non-specific domain limitation, the ambiguity around concrete concepts and proper nouns and the lack of training data. As part of future works, we intend to implement the proposed solution to a specific domain, such as the energy industry, to evaluate if a controlled vocabulary it is possible to achieve even higher precision and recall measures. Even the proposed solution is used for the entity extraction within plaintext collections, it still has great potential application in industry for example to extract the keywords from daily operation schedules and system log to avoid possible accidences in the important equipment etc. This opens a new line or research in both Nature Language Processing and Data Mining in energy industry.

References 1. Ruiz-Casado, M., Alfonseca, E., Okumura, M., Castells, P.: Information Extraction and Semantic Annotation of Wikipedia. In: Proceeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, June 16, pp.145–169, (2008) 2. Banko, M., Etzioni, O., Soderland, S., Weld, D. S.: Open information extraction from the web. Communications of the ACM, 51(12), 6874, (2008) 3. Banko, M., Etzioni, O.: Strategies for Lifelong Knowledge Extraction from the Web. K-CAP07, Whistler, British Columbia, Canada, October pp.28-31, (2007) 4. Fader, A., Soderland, S., Etzioni, O.: Identifying Relations for Open Information Extraction. In: Proceedings of the Workshop on Unsupervised Learning in NLP, pp.1535-1545, (2011) 5. Shaalan, K., Raza, H.: NERA: Named Entity Recognition for Arabic. In: J. Am. Soc. Inf. Sci. Technol., 60(8), pp. 1652-1663. John Wiley & Sons, Inc., New York (2009) 6. Singh, S., Subramanya, A., Pereira, F., McCallum, a.: Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia. CMPSCI Technical Report, UM-CS-2012-015, University of Massachusetts Amherst (2012) 7. Larkey, L., Abdul Jaleel, N., Connell, M.: Whats in a name? Proper names in Arabic cross language information retrieval. CIIR Technical Report No. IR-278. (2003) 8. Crestan, E., De Loupy, C.: Browsing help for faster document retrieval. In: Proceedings of the 20th international conference on Computational Linguistics - COLING 04, pp. 576-es. Association for Computational Linguistics, Morriston (2004) 9. Mihalcea, R., Csomai, A.: Wikify! linking documents to encyclopedic knowledge. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM 07, pp. 233-242. New York (2007)

C. Abreu, F. Costa, L. Santos, L. Monteiro, L. Peres, P. Lustosa, L. Weigang

13

10. Exner, P., Nugues, P.: Entity Extraction: From Unstructured Text to DBpedia RDF Triples In: 1st International Workshop on Web of Linked Entities 2012 pp.58– 69. Boston, USA, November 11, (2012) 11. Etzioni, O., Fader, A., Christensen, J., Soderland, S.: Open Information Extraction : The Second Generation. In: Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, Vol. 1, pp. 3-10. AAAI Press, Barcelona (2011) 12. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A. M., Shaked, T., Soderland, S., et al.: Web-scale information extraction in KnowItAll. In: Proceedings of the 13th international conference on World Wide Web, pp. 100110. ACM Press, (2004) 13. Wu, F., Weld, D. S.: Open Information Extraction using Wikipedia.In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July, pp. 118-127, (2010) 14. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning CoNLL 09, June, pp. 147–155,(2009) 15. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In:Proceedings of EMNLP, 4(4), pp. 404-411, (2004) 16. Milne, D., Witten, I. H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence WIKIAI08, pp. 25-30. AAAI Press, (2008) 17. Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and Global Algorithms for Disambiguation to Wikipedia. Computational Linguistics, 1, pp. 1375-1384, (2011) 18. Bunescu, R., Pasca, M.: Using Encyclopedic Knowledge for Named Entity Disambiguation. In: Proceedings of EACL, Vol. 6, pp. 9-16. ACL, (2006) 19. Cardie, C.: Empirical Methods in Information Extraction. AI Magazine, 18(4), pp. 65-79, (1997) 20. Marcus, M. P., Beatrice, S., Marcinkiewicz, M. A.: Building a large annotated corpus of English: the Penn Treebank. In: Computational Linguistics, 19, pp.313330, (1994) 21. Jain, A., Pennacchiotti, M.: Domain-independent entity extraction from web search query logs. In: Proceedings of the 20th international conference companion on World wide web - WWW 11, pp.63–64. ACM Press, New York (2011) 22. Zhu, M.: Recall, Precision and Average Precision. Working Paper 2004-09, (2004)

Suggest Documents