Entity Disambiguation and Linking over Queries using Encyclopedic Knowledge Truc-Vien T. Nguyen
Massimo Poesio
CIMeC, University of Trento Corso Bettini 31, Rovereto (TN), 38068, Italy
CIMeC, University of Trento Corso Bettini 31, Rovereto (TN), 38068, Italy
[email protected]
[email protected]
ABSTRACT Literature has seen a large amount of work on entity recognition and semantic disambiguation in text but very limited on the effect in noisy text data. In this paper, we present an approach for recognizing and disambiguating entities in text based on the high coverage and rich structure of an online encyclopedia. This work was carried out on a collection of query logs from the Bridgeman Art Library. As queries are noisy unstructured text, pure natural language processing as well as computational techniques can create problems, we need to contend with the impact noise and the demands it places on query analysis. In order to cope with the noisy input, we use machine learning method with statistical measures derived from Wikipedia. It provides a huge electronic text from the Internet, which is also noisy. Our approach is an unsupervised approach and do not need any manual annotation made by human experts. We show that data collection from Wikipedia can be used statistically to derive good performance for entity recognition and semantic disambiguation over noisy unstructured text. Also, as no natural language specific tool is needed, the method can be applied to other languages in a similar manner with little adaptation.
Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing—Text analysis
General Terms Algorithms, Experimentation, Languages
Keywords entity recognition, semantic disambiguation, Wikipedia
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AND ’12 Mumbai, India, Copyright 2012 ACM 978-1-4503-1919-5/12/12 ...$10.00.
In recent years, noisy and unstructured text has become more and more ubiquitous in the web, such as electronic text from the Internet, contact centers, and mobile phones, small and noisy text snippets created by social network or query logs. As this data presents spelling errors, abbreviations, and non-standard words, traditional natural language processing cannot be applied successfully. In this paper, we are interested in the recognition and semantic disambiguation of entities in query logs. To contend with the noisy input with poor context information, we employ structured knowledge extracted from a large encyclopedic collection from the web, which is also noisy. The ability to identify entities (such as people, locations and organizations) has been placed as an important task and has been considered as a valuable preprocessing step for many types of language analysis, including question answering, machine translation, and information retrieval. Its goal is the detection of entities and mentions in text, and labelling with one of several categories (PERSON or LOCATION). An entity (such as George W. Bush) can be refered to multiple surface forms (e.g. “George Bush” and “Bush”) and a surface form can refer to multiple entities (e.g. two U.S presidents or the pianist Alan Bush). Our aim here is to build a large-scale, multilingual system that can recognize and disambiguate entities from a user query for a general domain. Thus, it allows to understand users’ behaviour by analysing language-based information from transaction logs. As the query input is taken from the Bridgeman Art Library1 , they are short texts which are informally written. Thus, they suffer from spelling mistakes, grammatical errors and the usually do not form a complete sentence. Also, these texts can be typed in multiple languages, we target a method that can perform well in a multilingual environment. Our approach takes advantage of the human knowledge created through Wikipedia, a large-scale, collaborative web-based resource with collective efforts of millions of contributors. Our work differs from that of previous researchers is that we have focused primarily on noisy unstructured text in the form of query logs as opposed to other content types. We show that using large-scale resource from the web helps to extract important entities in queries, semantically disambiguate and link them to an external knowledge source. We propose a framework in which dictionary and structures are 1
http://www.bridgemanart.com/
extracted from the online encyclopedia and are employed to derive statistical measures which are then placed in a machine learner for disambiguation. Ours is a unsupervised approach, in which the system aids in the process of knowledge assimilation for knowledge-base building and also performs the analytics, without requiring any human effort. To the best of our knowledge, it is the first attempt trying to adapt and apply disambiguation to Wikipedia to query texts, which is more difficult than previous wokrs on other pure natural language content types. The structure of the paper is as follows. In Section 2 we discuss previous work on using Wikipedia for entity disambiguation and linking. In Section 3 we present our framework, the knowledge and statistical measures extracted from Wikipedia. In Section 4 we present the adaptation of the traditional D2W to the Bridgeman query logs; the experimental setting used to evaluate these methods and the dataset. The results we obtained and future works are discussed in Section 5.
2. RELATED WORK 2.1 Motivation As analyzed in [17], noise in text mainly comes from two types of sources. The first type occurs during a conversation process when a textual representation of information is produced from some other form, such as handwritten documents or spontaneous speech. In the second type, noise is introduced when the text is produced in digital form such as online chat, emails or web pages.
to the task of detecting and linking expressions in text to their referent Wikipedia pages. In the last decade, substantial consideration has been given to the use of encyclopedic knowledge for disambiguation. Using encyclopedic knowledge for entity disambiguation has been pioneered by [1]. Subsequent works have been intensively exploited Wikipedia link structures and made use of natural annotations as the source for learning [10, 3, 12, 16]. Figure 1 shows an example of D2W. Given a text “Michael Jordan won the inaugural award and a total of four across his career.”, a D2W system is expected to detect the text segment “Michael Jordan” and link to the correct Wikipedia page Michael Jordan 3 , instead of other Michael Jordan who are footballer, mycologist or politician. One of the shortcoming of previous studies on D2W is that they have mainly focused on pure natural language text. For instance, [1] has focused on linking only named entities in Wikipedia articles. The author first propose some heuristics to define if a Wikipedia title correspond to a named entity or not, then exploits the redirect pages and disambiguation pages to collect other ambiguous names to build the dataset. Overlapping categories between pages with traditional bagof-words and TF-IDF measures are lastly employed in a kernel function for name disambiguation. This work, however, targets only named entities and uses only a small subset of Wikipedia articles (roughly half of a million).
In this work, we focus on the latter, where the target data is textual representation of query logs typed by users of the Bridgeman art library. We employ supervised machine learning where the training data is also textual representation of web pages derived from Wikipedia. Note that although supervised machine learning is used, it does not need any manual annotation made by human labor. Instead, we exploit natural annotations in the form of link structure created in Wikipedia. Wikipedia2 is an online encyclopedia created through the collaborative effort of millions of contributors. It has grown to be one of the largest online repositories, a multilingual resource with millions of articles available for a large number of languages. Concretely, official Wikipedias have been created for more than 200 languages with varying levels of coverage. The number of entries varies from a few pages to some million articles per language. Recently, Wikipedia has been shown as a valuable preprocessing step for many types of language analysis including measuring semantic similarity between texts [7], text classification [8], named entity recognition [4], relation extraction [13, 14, 15], coreference resolution [6].
2.2
Disambiguation to Wikipedia
Entity disambiguation refers to the detection and association of text segments with entities defined in an external repository. Disambiguation to Wikipedia (D2W ) refers 2
http://en.wikipedia.org
Figure 1: Disambiguation to Wikipedia
[3] presents a method for named entity disambiguation based on Wikipedia. They first extract resources and context for each entity from the whole Wikipedia collection, then use named entity recognizer in combination with other heuristics to identify named entity boundaries in the articles. Finally, they employ a vector space model which includes context and categories for each entity for the disambiguation process. The approach works very well with high disambiguation accuracy. However, note that the use of many heuristics and named entity recognizer shows one weakness of the method that is difficult to adapt to other languages as well as other content types such as noisy text. 3 We use the title Michael Jordan to refer to the full address http://en.wikipedia.org/wiki/Michael Jordan.
[12] proposes a general approach for D2W. First, they process the whole Wikipedia and collect set of incoming/ outgoing links for each page. They employ a statistical method for detecting links by gathering all n-gram in the document and retaining those whose probability exceeds a threshold. For entity disambiguation they use machine learning with a few features, such as the commonness of a surface form, its relative relatedness in the surrounding context and the balance of these two features. However, note that they have never tried to adapt this method to other kinds of text or other languages. [16] tries to combines local and global approaches for the D2W task with a set of local features in combination with global features. Based on traditional bag-of-words and TFIDF measures, semantic relatedness, they implement a global approach for D2W. However, their system makes use of many natural language specific tool, such as named entity recognition, chunking and part-of-speech tagging. Thus, it is very difficult to apply the method to noisy text as well as to adapt to other languages. Previous approaches to D2W differ with respect to the following aspects: 1. the corpora they address; 2. the type of the text expression they target to link; 3. the way they define and use the disambiguation context for each entity. For instance, some methods focus on linking only named entities, such as [1, 3]. The method of [3] defines the disambiguation context by using some heuristics such as entities mentioned in the first paragraph and those for which the corresponding pages refer back to the target entity. [12] utilize entities which have no ambiguous names as local context and also to compute semantic relatedness. A different method is observed in [16] where they first train a local disambiguation system and then use the prediction score of that as disambiguation context. An example of disambiguation context is shown in figure 2. The disambiguation context of the surface form human may includes other unambiguous surface forms (which is linked to only one target article), which is a subset of communication, linguistics, dialects, Natural languages, spoken, signed, etc.
For this work, we used a corpus of query logs provided by the Bridgeman Art Library (BAL). Bridgeman Art library contains a large repository of images coming from 8000 collections and representing more than 29.000 artists. From 6month query logs, we sample 1000 queries containing around 1,556 tokens. Queries are typed by users of the art library, using Bridgeman query language constructions. Each query is a text snippet containing some name of an artist, a painter, a painting, or an art movement. These texts often present spelling errors, capitalization mistakes, abbreviations, and non-standard words. The length of each query varies from 1 to 10 words; in average, there are 3 words per query. Some examples of the queries are shown in table 1. Table 1: Examples of queries calling of st. matthew, friedrich and dresden, piazzetta giambattista, xir 182931, chariot fire, gls 219820, banquet france, charles the bold, herbert james draper lamia, rembrandt crosses, order of malta, man with scythe, buddha tang guimet, lady%27s maid, cagnes-sur-mer, segovia cathedral, san francesco assisi italy, tour eiffel, guy fawkes before king james, napoleon%27s retreat, ruins wwi, v-j day, napoleon crossing the alps, princes in the tower, jean-etienne liotard, corinium museum cirencester gloucestershire
3.2
Structures in Wikipedia
In this section, we describe the categorization of articles in Wikipedia collection. Category Pages A category page defines a category in Wikipedia. For example the article Category:Impressionist painters lists painters of the Impressionist style. As it contains subcategories, one can use it to build the taxonomy between categories or to generalize a specific category to its parent level. Disambiguation Pages A disambiguation page presents the problem of polysemy, the tendency of a surface form to relate to multiple entities. For example, the surface form tree might refer to a woody plant, a hierarchical data structure in a graphical form, or a computer science concept. The correct sense depends on the context of the surface form to which we are comparing it to; consider the relatedness of tree to algorithm, and tree to plant. List of Pages A list of page, as its name indicates, lists all articles related to the specific domain. It is useful when we want to build some domain-dependent applications. For example one can employ the article Lists of painters to take the information of all popular painters and build an entity disambiguator.
Figure 2: Disambiguation context
3. RESOURCE AND SYSTEM OVERVIEW 3.1 Bridgeman Art Library
Redirect Pages A redirect page exists for each alternative name to refer to an entity in Wikipedia. The name is transformed into a title whose article contains a redirect link to the actual article for that entity. For example, Da Vinci is the surname of Leonardo da Vinci. It is therefore an alternative name for the artist, and consequently the article Da Vinci is just a pointer to the article Leonardo da Vinci.
Relevant Pages A relevant page is remained after scanning over the whole Wikipedia collection and excluding category pages, disambiguation pages, list of pages, redirect pages, and pages used to provide help, define template and Wikipedia definitions. Unrelevant Pages Unrelevant pages are those used to define some terms, some templates or provide some help (Template:Periodic table and Help:Editing for example).
3.3
Parsing Wikipedia dump
As parsing the Wikipedia dump we perform a structure analysis of running text. We use the pages-articles.xml that contains current version of all article pages, templates, and other pages. The parser takes as input the Wikipedia dump file and analyzes the content enclosed in the various XML tags. We use the Wikipedia dump in July 2011. In the first running parse, it builds the set of redirection pairs (i.e. one article is a pointer to some actual one), list of pages, disambiguation pages, and set of titles of all articles in that Wikipedia edition.
article for each link). We use the term surface form to denote the occurence of a mention inside a Wikipedia article and the term target article to denote the target Wikipedia article that surface form linked to. Table 3 shows the dictionaries we extracted with corresponding size. Table 4 depicts links with corresponding surface forms and target articles derived from the Wikipedia paragraph in figure 2. We use the dictionaries of titles, surface forms, and files to match with the textual content and detect entity/mention boundaries, whereas set of links are used to compute statistical measures for our learning framework. Table 3: Dictionaries derived from Wikipedia Dictionary Size Titles 3,742,663 Surface forms 8,829,624 Files 745,724 Links 10,871,741
Table 4: Links derived from Wikipedia Surface form Target article human Human communication Communication linguistics Linguistics dialects Dialects Natural languages Natural language spoken Speech signed Sign language stimuli Stimulus (physiology) graphic writing Written language braille Braille whistling Whistled language cognitive Cognitive
In the second parse, the system scans over all Wikipedia articles and construct list of links (i.e., surface form and target article for each link, number of times one surface form is linked to the target article), list of incoming and outgoing links of each article. Note that we use the set of redirection pairs to keep everything related to links only as the actual article and not as the redirected article. For example, in the example above, if the title Da Vinci appears in one link, we will change it to Leonardo da Vinci. This process makes the statistical measures derived from links more accurate. The third parse focuses mainly on individual pages. It scans over all Wikipedia articles and construct for each article: ID, title, set of links (with surface form and target article), set of categories, set of templates. Note that in all three times of parsing, we exclude the category pages, disambiguation pages, file pages, help pages, list of pages, pages refering to templates and Wikipedia itself. As a result, we keep only the most relevant pages with textual content. Table 2: Statistics of the English Wikipedia Type Number Redirect 4,466,270 List of 138,614 Disambiguation 177,483 Relevant 4,208,917 Total 11,459,639
3.4
Extracting dictionaries and structures
The set of Wikipedia titles and surface forms are preprocessed by casefolding. After scanning over all Wikipedia articles, we construct a set of titles, set of surface forms (i.e., words or phrases that are used to link to Wikipedia articles), set of files (i.e., a Wikipedia article but does not have textual content, File:The Marie Louise’s diadem.JPG for example), and set of links (with surface form and target 3
http://dumps.wikimedia.org/
3.5
Statistical Measures
Following [12], we develop a machine learning approach for disambiguation, based on the links available in Wikipedia articles for training. For every surface form in an article, a Wikipedian has manually selected the correct destination to represent the intended sense of the anchored text. This provides millions of manually defined ground-truth examples (see table 3) to learn from. From the dictionaries and structures extracted from Wikipedia, we derive statistical measures. We now describe the primitives for our statistical measures and learning framework. - s: a surface form (anchor text) - t: a target (a link) - W : the entire Wikipedia - tin : pages that link to t (“incoming links”) - tout : pages going from t (“outgoing links”) - count(slink ): number of pages in which s appears as a link
- count(s): total number of pages in which s appears - p(t|s): number of times s appears as a link to t - p(s): number of times s appears as a link - |W |: total number of pages in Wikipedia Keyphraseness Keyphraseness is the probability of a phrase to be a potential candidate to link to a Wikipedia article. Similarly as [9], to identify important words and phrases in a document, we first extract all word n-grams. For each n-grams a, we compute its probability of being a potential candidate.
Keyphraseness(s) =
count(slink ) count(s)
Commonness Commonness is the probability of a phrase s to link to a specific Wikipedia article t. Commonness(s, t) =
p(t|s) p(s)
Relatedness How is academic related to conference? What about music and composer ? Text similarity has been used a long time in natural language processing applications. To measure the similarity between two terms, we consider each term as a representative Wikipedia article. For instance, the term exhibition is represented by the Wikipedia page http : //en.wikipedia.org/wiki/Exhibition. We used the Normalized Google Distance [2], where the similarity judgement is based on term occurrence on web pages. The method was employed in [11] to compute relatedness between Wikipedia articles. In this method, pages that link to both terms suggest relatedness.
Relatedness(a, b) =
log(max(|A| , |B|) − log(|A ∩ B|)) log(|W |) − log(min(|A| , |B|))
where a and b are the two articles of interest, A and B are the sets of all articles that link to a and b, respectively, and W is the entire Wikipedia.
4.
DISAMBIGUATION METHOD
We follow a two-phase approach for disambiguation algorithm: first recognize mentions (i.e. surface forms) in a document with potential candidates, then generate features for each candidate and apply a machine learning approach.
4.1
Disambiguation Candidate Selection
The first step is to extract all mentions that can refer to Wikipedia articles, and to construct, for each recognized mention, a set of disambiguation candidates. Following previous work [16, 12], we use Wikipedia hyperlinks to perform these steps. To identify important words and phrases in a document, we first extract all word n-grams. For each n-grams a, we
use its keyphraseness– the probability of being a potential candidate to recognize potential n-grams that are candidates to link to an Wikipedia article. We retain n-grams whose keyphraseness exceeds a certain threshold. Some preliminary experiments on a validation set showed that the best performance for mention detection is achieved with keyphraseness = 0.01. The next step is to identify potential candidate links. For that we employ commonness– the probability of an n-gram to link to a specific Wikipedia article. The result is a set of possible mappings (surf acef orm, linki ). For computational efficiency, we use the top 10 candidates with highest commonness.
4.2
Learning Method and Features
As illustrated, disambiguation deals with the problem that the same surface form may refer to one or more entities. The links (surface form/target article) derived from Wikipedia provide a dataset of disambiguated occurences of entities. Table 5 shows examples of positive and negative examples in the dataset, created for the six separate concepts of the surface form human. Given a surface form with its potential candidates, the entity disambiguation problem can be cast as a ranking problem. Assuming that an appropriate scoring function score(s, ti ) is available, the link corresponding to surface form s is defined to be the one with highest score: tˆ = arg max score(s, ti ) ti
Table 5: Disambiguation dataset δ 1 0 0 0 0 0
Language Language Language Language Language Language
is is is is is is
Text the human the human the human the human the human the human
capacity capacity capacity capacity capacity capacity
Target article Human Human taxonomy Homo sapiens Human evolution Human behavior Humans (novel)
Baseline As a baseline we use the commonness, which is the fraction of times the title t is the target page for a surface form s. This single feature is a very reliable indicator of the correct disambiguation [16], and we use it as a baseline in our experiments. Combining all features As ranking algorithm for disambiguation, we use three kinds of features: the commonness of each candidate link, its average relatedness in the surrounding context, and the balance between these two measures. The relatedness of each candidate sense is the weighted average of its relatedness to each context article
P score(s, t) =
c∈C
relatedness(t, c) × commonness(s, t) |C|
where c ∈ C are the context articles of t. The target with the highest score will be chosen as the final answer. We follow a global approach where the disambiguation context is measured by taking every n-grams that have only one candidate, and thus, is unambiguous. To balance commonness and relatedness, we need take into account both how frequent a surface form is and how good the context is. Thus the final feature is given by the sum of the weights that were previously assigned for each unambigous surface form, as proposed by [12]. The data and experiments in this paper are based on a version of Wikipedia that was released in July 2011. For the learning machine we use Liblinear4 ([5]) as our disambiguation classifier.
5.
RESULTS AND DISCUSSION
We evaluate our approach using the dataset of noisy text we sample from the Bridgeman Art Library, and standard datasets provided in previous works. The first dataset is a collection of 1000 queries, typed by users of the art library, using Bridgeman query language constructions. The annotators are asked to link the first five nominal mentions of each co-reference chain to Wikipedia, if possible. The second dataset from [12] is a subset of the AQUAINT corpus of newswire text that is annotated to mimic the hyperlink structure in Wikipedia. The third dataset from [16], constructed using Wikipedia itself, is a sample of 10,000 paragraphs from Wikipedia pages. Mentions in this data set correspond to existing hyperlinks in the Wikipedia text. The results are shown in tables 6 and 7. We evaluate our whole system against other baselines using a previouslyemployed “bag of titles” (BOT) evaluation [12, 16]. In BOT, we compare the set of titles output for a document with the gold set of titles for that document (ignoring duplicates), and utilize standard precision, recall, and F-measure. Table 6 shows the results of D2W on query texts, which are unstructured and noisy. In the first five rows, we report the baseline results according to the number of disambiguation candidates generated. The fraction of recognized mentions and system performance increase until about five candidates per mention are generated. The last row of table 6 shows the results of the machine learning framework when all features are used. Meanwhile, the results achieved with standard datasets prove that ours is competitive to those of state-ofthe-art systems. Table 6: Results with 1000 queries Number of Recognized F-measure candidates mention(s) Candidate 1 853 64.77 Candidate 2 1022 71.59 Candidate 3 1092 75.42 Candidate 4 1134 77.18 Candidate 5 1157 78.32 All features n/a 69.32
The results show that a statistical approach is useful for 4
http://www.csie.ntu.edu.tw/ cjlin/liblinear/
Table 7: Results with other datasets System AQUAINT Wikipedia Ours 86.16 84.37 Milne-Witten:2008b 83.61 80.31 Ratinov-Roth:2011 84.52 90.20
entity recognition and disambiguation on query logs and the exploitation of large-scale resource can help. Note that we achieve competitive performance in standard datasets that were released in previous works [16, 12]. We distinguish our method with previous work in that we adapt the model learnt from a noisy text collection into another one. Our approach does not make use of any natural language specific tool, such as named entity recognition, part-of-speech tagger and can completely adapt to other languages in the same manner, with very little adaptation.
6.
REFERENCES
[1] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceesings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 9–16, Trento, Italy, 2006. [2] R. L. Cilibrasi and P. M. B. Vitanyi. The google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3):370–383, Mar. 2007. [3] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [4] W. Dakka and S. Cucerzan. Augmenting wikipedia with named entity tags. In Proceedings of IJCNLP, 2008. [5] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 2008. [6] T. Finin, Z. Syed, J. Mayfield, P. McNamee, and C. D. Piatko. Using wikitology for cross-document entity coreference resolution. In AAAI Spring Symposium: Learning by Reading and Learning to Read, pages 29–35. AAAI Press, 2009. [7] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI’07, pages 1606–1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. [8] E. Gabrilovich and S. Markovitch. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res., 8:2297–2345, Dec. 2007. [9] O. Medelyan, I. H. Witten, and D. Milne. Topic indexing with wikipedia. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence, 2008. [10] R. Mihalcea and A. Csomai. Wikify!: linking
[11]
[12]
[13]
[14]
[15]
[16]
[17]
documents to encyclopedic knowledge. In Proceedings of the 16th ACM conference on Conference on information and knowledge management, pages 233–242, New York, NY, USA, 2007. ACM. D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 22nd Conference on Artificial Intelligence, 2008. D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 509–518, New York, NY, USA, 2008. ACM. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore, August 2009. Association for Computational Linguistics. T. V. T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 277–282, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. T. V. T. Nguyen and A. Moschitti. Joint distant and direct supervision for relation extraction. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, November 2011. L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1375–1384, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. L. V. Subramaniam, S. Roy, T. A. Faruquie, and S. Negi. A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, AND ’09, pages 115–122, New York, NY, USA, 2009. ACM.