2011 22nd International Workshop on Database and Expert Systems Applications
Towards tailored semantic annotation systems from Wikipedia Shahad Kudama Rafael Berlanga Llavori Lisette Garc´ıa-Moya Victoria Nebot Mar´ıa Jos´e Aramburu Cabo Universitat Jaume I Universitat Jaume I Universitat Jaume I Universitat Jaume I Universitat Jaume I Castell´on, Spain Castell´on, Spain Castell´on, Spain Castell´on, Spain Castell´on, Spain
[email protected] [email protected] [email protected] [email protected] [email protected]
Wikipedia that covers the majority of concepts underlying such a corpus. This topic has been studied before. For example, in [?] terms related to the agronomic field are extracted manually using subsets of Wikipedia and then, a comparison to the terms obtained from a big thesauri is made, showing that the first choice is promising. Here, we propose an automatic method to extract a Wikipedia subset that embraces all the terms related to a specific field.
Abstract—The annotation of texts in natural language links some terms of the text to an external information source that gives us more detailed information about them. Most of the approaches made in this field get any text and annotate it by trying to find out the context of each term, as there are terms that have different meanings depending on the topic treated. In this article, we propose a variant of this process that annotates a text knowing in advance its context. The external source of information used is Wikipedia and we extract and use a fragment of it that embraces all the terms related to the context known beforehand.
A. Motivating Scenario
Keywords-annotation; external source; Wikipedia; context; tailored; fragment;
In our scenario, we will use semantic annotations in an Opinion Analysis System where we want to identify those product features that are subject to sentiment analysis, which will feed a corporate data warehouse [?]. In this paper, we use the digital cameras application domain. Current sentiment analysis processes identify product features by extracting noun phrases that are associated to opinion words [?]. However, this approach does not take into account the semantics of these noun phrases at all, and usually they are not representing any product feature. In this context, the use of a semantic tagger can help us deciding if the identified feature has the proper semantics. In Table ?? we can see the output of two annotation systems: Wikimachine 2 and the Concept Retrieval System (CRS) [?]. Underlined words correspond to concepts found in the Wikipedia that best match them and we have marked up in bold the concepts that are wrongly associated. For example, ensure refers to nutritionist domain, pinch refers to physics of plasma, f lash cards refers to a learning method instead of a storage system, and so on. Clearly, when the annotation system does not take into account the target domain application, it is very likely to select a wrong concept. This is specially true for the Wikipedia as it contains a comprehensive set of concepts coming from a great variety of domains. One of the key points in the successful application of semantic taggers in Life Sciences is indeed the specificity of the used vocabularies with respect to the target corpus. This is the main idea we want to explore in this paper for any application domain, like digital cameras.
I. I NTRODUCTION Semantic annotation is the process of assigning concepts described in some knowledge resource to text-rich data. This process is specially useful for the Semantic Web, where RDF triplets are usually identified from text corpora and thesauri. In this context, semantic annotation goes far from Name Entity Recognition (NER), as it aims at detecting both the type of the involved entities and the concept that best fits with those entities. Semantic annotation is being successfully applied to the domain of Life Sciences [?] where a great variety of knowledge resources are publicly available, and the specificity of their vocabularies makes the annotation process easier than in generic purpose domains. However, many domain applications do not have available such rich knowledge resources, mainly because they have not been created or they are not public. For these domains, it seems a promising choice to use an open knowledge resource such as Wikipedia 1 . Wikipedia has been shown to be a good knowledge resource for Natural Language Processing tasks [?], mainly for disambiguation [?], semantic relatedness [?]. However, little work exists on using Wikipedia for semantic annotation of free-text [?]. The main issues to be addressed for this purpose are to handle the enormous ambiguity of Wikipedia entries, and to ensure that the generated annotations really mean the intended semantics. In this paper, our main aim is to build tailored lexicons that can be used by semantic annotation systems. Our main hypothesis is that if the target corpus to be tagged refers to a specific domain, there must exist a sub-vocabulary within 1 http://en.wikipedia.org
1529-4188/11 $26.00 © 2011 IEEE DOI 10.1109/DEXA.2011.82
2 http://thewikimachine.fbk.eu/
478
The Wikimachine I recently purchased the canon powershot g2 and am extremely satisfied with the purchase. Ensure you get a larger flash, 128 or 256, some are selling with the larger flash, 32mb will do in a pinch but you’ll quickly want a larger flash card as with any of the 4mp cameras. Bottom line , well made camera , easy to use , very flexible and powerful features to include the ability to use external flash and lense/filters choices.
CR System (with tailored lexicon) I recently purchased the canon powershot g2 and am extremely satisfied with the purchase. Ensure you get a larger flash, 128 or 256, some are selling with the larger flash, 32mb will do in a pinch but you’ll quickly want a larger flash card as with any of the 4mp cameras. Bottom line , well made camera , easy to use , very flexible and powerful features to include the ability to use external flash and lense/filters choices.
Table I C OMPARISON OF TWO SEMANTIC TAGGER OUTPUTS
From the above discussion, we can formulate the tailoring problem as follows: find an optimal set of categories from the Wikipedia such that it covers many relevant categories for the application domain as possible, at the same time that it minimizes the quantity of non-relevant categories.
II. T HE W IKIPEDIA TAILORING PROBLEM In our approach, the Wikipedia Knowledge Resource (KR) is represented as a directed graph where nodes can be of two types, categories and pages. This graph only accounts for the relationships given by the categorization of pages, so we have two kinds of edges: those relating a page with a category, and those relating a category with a super-category. Pages and categories are labelled with a set of synonyms, which will be used for building the lexicon. A lexicon is just an inventory of lexical forms associated to each KR concept. The resulting graph consists of an informal taxonomy of concepts defined by the community in a concurrent way. As a result, the graph can contain cycles and a great diversity of criteria in the way categories are related. This makes difficult the task of managing this graph for domain-specific purposes, so we remove all cycles of this graph (DAG, Directed Acyclic Graph), and denote with c c0 when c is a descendant of c0 in the DAG. We assume that the relation is transitive as it usually expresses the is-a semantics of taxonomies. In this paper, we aim at extracting tailored graph fragments [?] always guided by the user requirements. The main idea is to obtain the subgraph that better covers the annotations we want to generate at the same time that rejects those annotations that the user considers wrong. We want to perform this extraction with the minimal interaction of the user, who just has to select some good and bad annotations from an initial automatically annotated corpus. For example, the user can select the categories Digital cameras and P hotography equipment from the annotations “canon powershot” and “filter/lense” from the Table ??. By selecting the Wikipedia entries having these categories, we can build a lexicon of 240 concepts. This lexicon is too short for capturing the main concepts of the digital cameras domain. One way to enlarge it consists of selecting some common ancestors from these two categories, Canc , and then retrieving all the Wikipedia entries having some category c such that c Canc . By selecting the ancestor category Optics, we can retrieve 17355 concepts from Wikipedia. Unfortunately, this category also introduces a lot of noise in the lexicon, including subcategories such as T elevision stub and F ilms. As a result, when tagging with such a lexicon, many wrong annotations can be produced.
III. G RAPH F RAGMENTS The tailoring problem previously described clearly relies on the capacity of extracting useful subgraphs from the Wikipedia category graph. These subgraphs are guided by the user requirements (relevant and non-relevant annotations to be regarded), which are expressed over a set of categories or signature (Sig). The user must also indicate which kind of fragments she wants to generate, namely: upper fragment U F (Sig), which just regards ancestors of the nodes in Sig, and lower fragment LF (φ(Sig)), which just regards descendants retrieved with the logical expression φ(Sig) over the nodes in Sig. In [?], we proposed an efficient indexing scheme and query language for generating U F and LF from ontological resources. We have adapted these techniques to index the Wikipedia DAG and to extract both upper and lower fragments. Basically, the expressions φ have the syntax: C ::= C u D|C t D|¬C where C u D means the common descendants of C and D, C t D means the union of descendants of both nodes, and ¬C are the nodes that are not descendants of C. Now, the tailoring process can be summarized as follows: 1) Given a reference corpus T , annotate it with the whole Wikipedia. Extract the categories associated to these annotations. 2) Select the initial sets of relevant categories (Sig+ ) and non-relevant categories (Sig− ). 3) Build an upper fragment from the selected categories in the previous step U F (Sig+ ∪ Sig− ). 4) Select a set of ancestors that best covers and discriminates Sig+ (Anc+ ⊆ U F (Sig+ )). 5) Select a set of ancestors that best covers and discriminates Sig− (Anc− ⊆ U F (Sig− )). 6) Finally, extract the lower fragment that considers the relevant ancestors whileFrejecting the non-relevant F ones (LF ( Anc+ u (¬ Anc− ))).
479
7) Filter the lexicon of the Wikipedia by selecting those pages that have some category in the previous lower fragment. 8) Tag again the corpus T with the filtered lexicon, and repeat from the second step if necessary. The final lexicon includes the Wikipedia code categories and pages extracted in the fragment, and for each one of them, it is included the first sentence of the Wikipedia article, which corresponds to the definition and potential synonyms.
share(n) =
nni
The selection of good ancestors from the input signatures is achieved by ranking the concepts retrieved in the upper module U F (Sig+ ∪ Sig− ) with respect to a series of measures. The first measure is the coverage of a concept n w.r.t. the signatures Sig+ and Sig− respectively. Thus, given the following sets: =
{nj |nj ∈ Sig+ ∧ nj n)}
coveredSig− (n)
=
{nj |nj ∈ Sig− ∧ nj n)}
coveredSig(n)
=
coveredSig+ (n) ∪ coveredSig− (n)
1 |successors(ni )|
Here successors(n) is the set of direct successors of n in the fragment. The idea behind the share is to measure the number of partitions produced from a root node to the node n. The smaller the share the more dense is the subgraph above the node. This measure promotes the nodes that are located in more dense parts of the DAG. For example, this measure punishes the systemmaintenance categories like Disambiguation categories or Categories requiring dif f usion, which do not participate in true taxonomies. The final measure for ranking concepts is the tuple:
IV. M EASURING COVERAGE AND DISCRIMINATION
coveredSig+ (n)
Y
score(n) = h−entropy(n), coverage(n), share(n)i In other words, those nodes that best discriminate, cover and participate in denser parts of the graph will be promoted. Ancestors selection: In order to facilitate the selection of ancestors by the users, we will construct a fragment with the best scored nodes that produce a full coverage of the input signatures. Algorithm 1 shows how the fragment reduction process is performed. Basically, it consists of selecting the best scored fragment until covering the signature. The last step consists of connecting all the selected nodes applying the transitivity property of the relationship.
We measure the coverage as follows: coverage+ (n) =
|coveredSig+ (n)|
Algorithm 1 Fragment reduction
coverage− (n) =
|coveredSig− (n)|
Require: the upper fragment (U F (Sig+ ∪ Sig− )) and the signatures Sig+ and Sig− . Ensure: N ewT ax as an upper fragment with best scored ancestors. Let Lrank be the list of concepts in U F (Sig+ ∪ Sig− ) ordered by score(n) (highest to lowest); Let F ragment = ∅ be the nodes set of the fragment to be built. repeat pop a node n from Lrank add S it to F ragment until n∈F ragment coveredSig(n) == Sig+ ∪ Sig− Let N ewT ax be the reconstructed graph for the signature F ragment
coverage(n) =
coverage+ (n) + coverage− (n)
From these measures, we can calculate the discrimination measure of each node with the entropy. Thus, the higher the entropy is, the lower the discrimination will be. This discrimination measures the quality of a node because it just covers nodes from the positive or from the negative signature: entropy(n)
V. E XPERIMENTS Preliminary results of our method have been obtained for the digital cameras application domain. After selecting a dataset of customer reviews collected from Amazon.com and CNET.com3 , we have annotated them with the whole Wikipedia (i.e., taking as lexicon all the entries of the Wikipedia), which consists of 5 million strings. Clearly, many annotations are either spurious or wrong mainly because they are out of context. Fortunately, this dataset provides a manual annotation of the product features that are subjected to opinion (e.g., flash card and lens cap). This allows us to easily select good
= −P (Sig+ |n) · log(P (Sig+ |n)) −P (Sig− |n) · log(P (Sig− |n))
These probabilities are estimated as follows: P (Sig+ |n)
=
coverage+ (n)/coverage(n)
P (Sig− |n)
=
coverage− (n)/coverage(n)
3 This dataset is available CustomerReviewData.zip
Another interesting measure for inspecting the nodes of U F (Sig+ ∪ Sig− ) is the share, which is defined as follows:
480
at
http://www.cs.uic.edu/∼liub/FBS/
Annotations with the whole Wikipedia I recently purchased the canon powershot g2 and am extremely satisfied with the purchase. Ensure you get a larger flash, 128 or 256, some are selling with the larger flash, 32mb will do in a pinch but you’ll quickly want a larger flash card as with any of the 4mp cameras. Bottom line , well made camera , easy to use , very flexible and powerful features to include the ability to use external flash and lense/filters choices.
Features extracted with sentiment analysis I recently purchased the canon powershot g2 and am extremely satisfied with the purchase. Ensure you get a larger flash, 128 or 256, some are selling with the larger flash, 32mb will do in a pinch but you’ll quickly want a larger flash card as with any of the 4mp cameras. Bottom line, well made camera , easy to use , very flexible and powerful features to include the ability to use external flash and lense/filters choices.
Table II W IKIPEDIA ANNOTATIONS VERSUS EXTRACTED FEATURES . W RONG ANNOTATIONS ARE SHOWN IN BOLD FACE
Category Photography equipment Photography Photography companies Cameras Storage media Signal Processing Creative works Genres Animals Science Optics
cov+ 3 6 1 2 3 7 1* 1* 39 39 23
cov0 0 0 0 0 1 14 10 22 22 19
Entropy 0.0 0.0 0.0 0.0 0.0 0.169 0.106 0.132 0.283 0.283 0.299
Canon powershot cameras t ((Optics t Signal processing t Storage media) u (¬Creative works t ¬Genres))
Notice that we included the Optics category in order to have a high coverage of Sig+ . This category is complemented with other categories to achieve the full coverage of Sig+ . Over this set, we removed the subcategories of the nodes with best negative coverage. As these nodes reject one interesting category, we include it separately (first operand). The resulting lower fragment contains 4467 categories, which produce a lexicon of 71959 entries. An example of the resulted tagged corpus is shown in Table ??. Almost all the features identified with sentiment analysis are tagged with this lexicon (around 782 out of 1000 in the whole corpus). An interesting outcome of this semantic annotation is that many features can be now semantically grouped. As an example, the word f lash is ambiguous in this context as it can denote either the flash memory or the flash device4 . With the obtained annotations, we can group the features compact flash card, flash card and flashcard into the same group as they are annotated with the same concept. On the other hand, the features internal flash, external flash, flash unit and flash photo can be grouped in a different group as they are associated to the device concept. In general, the results of these experiments are encouraging, although a deeper evaluation of the identified features through Wikipedia needs to be carried out.
Table III R ANKING OF CATEGORIES
annotations and categories from the annotated features (i.e., Sig+ ). For the set of negative categories Sig− , we selected the categories involved in the most frequent annotations that are not included in Sig+ . Table ?? shows the results of annotating one review with the whole Wikipedia and the features identified through sentiment analysis. Negative categories are usually related to films and songs whose titles can contain very frequent words like “recently” and “quickly”. Finally, for this application, we selected 43 categories for Sig+ and 22 for Sig− . Table ?? summarizes the selection of the best ranked ancestors for Sig+ and Sig− . The first part of the table shows the categories with best entropy and positive coverage. However, the coverage of these categories is 16 out of 43, so we need to select further ancestors to cover Sig+ . The second part of the table regards the best ranked categories w.r.t. entropy and negative coverage. The only positive element covered by these nodes is Canon powershot cameras, which is one of the products regarded in the dataset. Finally, we show the categories with maximum coverage over both Sig+ and Sig− . Notice that all these nodes present a high entropy.
VI. R ELATED W ORK Some similar works have been performed in this field, although all of them are related to the WSD (Word Sense Disambiguation) problem. Tonelli and Giuliano [?] carried out the task of linking FrameNet and Wikipedia using a word sense disambiguation system that, for a given pair frame, lexical unit (F, l), finds the Wikipage that best expresses the meaning of l, this is the automatic enrichment of the FrameNet database for English.
Taking into account these results, we decided to build a lower fragment with the following expression:
4 Indeed, it can be also many other things like films, places, people, etc. However, as expected, these concepts are not included in the final fragment.
481
In [?], the automatic cross-reference of documents with Wikipedia task is proposed, dividing it into two subtasks: disambiguation linking and detection linking. An original approach to disambiguating terms is implemented, using disambiguation to inform detection. Measures like commonness of a sense (number of times it is used as a destination) and relatedness to the surrounding context (weighted average of its relatedness to each context article) are used to disambiguate terms. The work in [?] defines a new methodology for supporting a natural language processing task with semantic information available in the Web under the form of logical theories. They mapped the terms in the text to concepts in Wikipedia and to other knowledge resources linked to Wikipedia like DBpedia, Freebase and YAGO. As we can see, they deal with similar tasks but it is not the same situation that is presented in this article. We propose a new way to link terms to Wikipedia entries, assuming a predefined context, so the problem is transfered from WSD to the best extraction of the Wikipedia entries related to the context and the linkage of the terms to the corresponding entries.
positives and to enlarge the positive and negative signatures. Finally, providing measures to determine quality of the lexicon generated will be necessary because this could help us to determine false positives generated in the extracted fragment. A test set could be to annotate Wikipedia articles of the particular domain after generating new categories using the generated graph fragment. ACKNOWLEDGMENTS This work has been partially funded by the Spanish Research Program project TIN2008-01825/TIN and the Fundacio Caixa Castello project P1.1B2008-43. R EFERENCES [1] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, “Overview of BioCreAtIvE: critical assessment of information extraction for biology.” BMC Bioinformatics, vol. 6 Suppl 1, 2005. [2] T. Zesch and I. Gurevych, “Analysis of the Wikipedia Category Graph for NLP Applications,” in Proceedings of the TextGraphs-2 Workshop (NAACL-HLT), 2007. [3] S. Cucerzan, “Large-scale named entity disambiguation based on wikipedia data,” in In Proc. 2007 Joint Conference on EMNLP and CNLL, 2007, pp. 708–716.
VII. C ONCLUSION
[4] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using wikipedia-based explicit semantic analysis,” in In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007, pp. 1606–1611.
The annotation of texts in natural language adds information about interesting terms by linking them to an external information source. Most of the investigation made in this field consists of annotating texts by trying to find out the context of each term, as there are terms that have different meaning depending on the topic treated. In this paper we have proposed and implemented a solution for the annotation problem starting from the premise that we know the context of the text being annotated. This step towards tailored semantic annotation systems from Wikipedia is an interesting option to reduce ambiguity when annotating a free text. The results obtained are good but there is still room for improvement, mainly related to the search of the best nodes of the tree that cover a proportional part of the given negative and positive nodes. The automatic selection of inital set of relevant and non-relevant categories would be also an interesting improvement. This could be done by assigning weights to categories based on IDF measures on articles with links giving the article context. Another important improvement would be the automatic search of the best combination of categories needed to make the expression of unions, intersections and negations between them. This problem results in searching the minimum number of nodes whose descendants cover the positive categories and do not cover the negative ones, always trying to minimize the residues produced (categories that need to be added or removed after the expression is built). Adding the references in a Wikipedia article to the definition of the term would help to the task of identifying false
[5] D. Milne and I. H. Witten, “Learning to link with wikipedia,” in CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 509–518. [6] D. Milne, O. Medelyan, and I. H. Witten, “Mining DomainSpecific Thesauri from Wikipedia: A Case Study,” in WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 2006, pp. 442–448. [7] L. G. Moya, S. Kudama, M. J. Aramburu, and R. Berlanga, “Integrating web feed opinions into a corporate data warehouse,” in Proceedings of the 2nd International Workshop on Business intelligencE and the WEB. ACM, 2011, pp. 20–27. [8] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Found. Trends Inf. Retr., vol. 2, pp. 1–135, 2008. [9] R. Berlanga, V. Nebot, and E. Jimenez, “Semantic annotation of biomedical texts through concept retrieval,” Procesamiento del Lenguaje Natural, no. 45, pp. 247–251, 2010. [10] V. Nebot and R. Berlanga, “Efficient retrieval of ontology fragments using an interval labeling scheme,” Inf. Sci., vol. 179, no. 24, pp. 4151–4173, 2009. [11] S. Tonelli and C. Giuliano, “Wikipedia as frame information repository,” in EMNLP ’09: Proceedings of the 2009 Conference on Empirical Methods in NLP, 2009, pp. 276–285. [12] C. Giuliano, A. M. Gliozzo, and C. Strapparava, “Kernel methods for minimally supervised wsd,” Computational Linguistics, vol. 35, no. 4, pp. 513–528, 2009.
482