Harnessing Linked Knowledge Sources for Topic Classification in Social Media Amparo E. Cano
Andrea Varga
Matthew Rowe
Knowledge Media Institute The Open University, UK
Organisations, Information and Knowledge Group (OAK) The University of Sheffield, UK
School of Computing and Communications Lancaster University, UK
[email protected]
[email protected] [email protected] Fabio Ciravegna Yulan He
OAK The University of Sheffield
[email protected] ABSTRACT Topic classification (TC) of short text messages offers an effective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolution). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the topics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detection (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.
Keywords linked knowledge sources, violence detection, emergency response, named entities, semantic concept graphs
1.
INTRODUCTION
In recent years, social media have continued to grow in popularity and have become a powerful platform for people to unite together under common interests. Particularly Twitter has proven to be a faster channel of communication when compared to traditional media as seen by the Egyptian revolution and 2011 Japan earthquake. Therefore the real-time identification of topics discussed in these channels
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 24th ACM Conference on Hypertext and Social Media 1–3 May 2013, Paris, France Copyright 2013 ACM
School of Engineering and Applied Science Aston University, UK
[email protected] could aid in different scenarios including i.e., violence detection and emergency response situations. However, this classification task poses different challenges including: high topical diversity; irregular and ill-formed words; and more importantly the sparsity presented in Tweets’ content along with the evolving jargon which emerges as different events are discussed. Recent research approaches ([12, 9]) have proposed to alleviate the sparsity of microposts by leveraging existing social knowledge sources. In particular, a large body of work which will be discussed in the related work section, has been proposed for the task of topic classification of Tweets. However, the majority of these approaches only employ lexical features (e.g. bag of words (BoW) or bag of entities (BoE)) extracted solely from a Tweet content. Other approaches classify Tweets into topics by enhancing a Tweet’s feature set with features obtained from a single Knowledge Source (KS). Nevertheless to our knowledge none of the existing approaches have leveraged the graph structures surrounding concepts present in a KS for the topical classification of Tweets. In this paper we therefore propose a generic and unified framework for TC of Tweets using multiple linked KSs, and evaluate it in the violence detection (VD) and emergency response (ER) domains. In contrast to existing approaches rather than focusing on lexical features derived from microposts, we propose a KS-based contextual enrichment of features. This enrichment is based on a technique we developed for deriving semantic meta-graphs from different KSs. Our approach leverages the entities appearing in a Tweet (e.g. Person, Location, Organisation), by exploiting additional contextual information about these entities’ resources present in different KSs. From this information we derived semantic features, which enhance the simple lexical feature representation of a Tweet. In previous work ([16]) we have shown that the performance of a topic classifier differs depending on the choice of the KS; arguing that different KSs may complement each other. Therefore in this work we investigate the benefit of combining and integrating the evidence of words and concepts from individual and linked KSs by following Linked Data principles. This approach results in the merging of additional semantic graphs derived from different knowledge spaces for the topic classification of Tweets.
The main research question which we investigate are the following: i) Do semantic meta-graphs built from KSs contain useful semantic features about entities for the topic classification (TC) of Tweets? To what extent do these semantic features help the violence detection (VD) and emergency response (ER) (TC) tasks? ; and ii) Which KS data and KS taxonomies (i.e. DBpedia and Yago or Freebase) provide more useful information for TC of Tweets? The main contribution of this paper are as follows: i) we propose and evaluate a novel set of semantic meta-graph features about entities for the TC of Tweets; ii) we investigate different strategies for building TC of Tweets: sole knowledge source based and combined linked knowledge source based, and show the superiority of the latter approach; iii) we propose a unified framework for harnessing the information and knowledge from multiple linked KSs for TC, showing the superiority against previous work using the sole KS approach and the sole Twitter classification; and iv) we compare the results using different ontologies (DBpedia, Yago and Freebase) for deriving semantic features for TC of Tweets within the VD&ER tasks, and show that the combined mapped ontology is the one providing the most accurate results.
2.
MOTIVATION
Social knowledge sources constitute some of the largest repositories built in a collaborative manner, providing an up-to-date channel of information and knowledge on a large number of topics. The relevance of these KSs to Twitter is apparent due to the Social Web-based platform characteristics of KSs including that: i) they are constantly edited by Web users; ii) their creation is done in a collaborative manner; and iii) they cover a large number of topics. In this work we investigate the use of two KSs, namely: DBpedia and Freebase. DBpedia1 is a KS derived from Wikipedia2 . In DBpedia [2] each resource is harvested from a Wikipedia article which is semantically structured into a set of DBpedia3 (dbpedia) and YAGO24 (yago) ontologies, with the provision of links to external knowledge sources such as Freebase, OpenCyc,5 and UMBEL6 . The latest DBpedia dump, DBpedia 3.8 classifies 2.35 million resources into dbpedia ontological classes, according to 359 distinct classes, which form a subsumption hierarchy and are described by 1,820 different properties. Conversely, the yago ontology [8] is a much bigger and fine grained ontology, containing 447 million facts about 9.8 million entities which are classified into 365,372 classes. In contrast, Freebase7 (freebase) is a large online knowledge base which users can edit in a similar manner to Wikipedia. In Freebase [3], resources are harvested from multiple sources such as Wikipedia, ChefMoz, NNDB and MusicBrainz8 along with data individually contributed by users. These resources are semantically structured into Freebase’s own ontologies, which consist of 1,450 classes and more than 7,000 unique properties. Overall, these ontologies (i.e. dbpedia, yago, freebase) enable a broad coverage of entities in the world, and allow entities to bear multiple overlapping types. One of the main ad1
DBpedia, http://dbpedia.org Wikipedia, http://wikipedia.org 3 http://wiki.dbpedia.org/Ontology 4 http://www.mpi-inf.mpg.de/yago-naga/yago/ 5 OpenCyc, http://sw.opencyc.org/ 6 UMBEL, http://www.umbel.org/ 7 Freebase, http://frebase.org 8 Freebase Datasources, http://sources.freebaseapps.com/ 2
vantages of exploiting these KSs is that each particular topic (e.g. http://dbpedia.org/page/Category:Violence) is associated to a large number of resources (e.g. ), allowing one to build a broad representation of a topic. In addition each resource is related to different ontological classes or concepts which provide additional contextual information for that resource, enabling in this way the exploitation of various semantic structures of these resources. The use of this structured knowledge enables the contextual enrichment of a Tweet’s entities by providing information that can help to disambiguate the role of a given entity in a particular context. Consider the Tweets in Figure 1, although the entity Obama has different roles such as president, Nobel laureate, husband ; the role of this entity will be defined by the contextual information provided in the content of each Tweet. Section 4 introduces our approach for leveraging this semantic contextual information by proposing the use and introducing the concept of semantic meta-graphs.
3.
RELATED WORK
Previous research on exploiting KSs for TC of Tweets can be divided into two main strands: approaches that use local metadata and approaches that exploit the link structure of the KSs. In the first case, Genc et al. [6] proposed a latent semantic topic modelling approach, which mapped each Tweet to the most similar Wikipedia articles based on lexical features extracted from Tweets’ content only. Song et al. [15] mapped a Tweet’s terms to the most likely resources in the Probbase KS. These resources were used as additional features in a clustering algorithm which outperformed the simple BoW approach. Munoz et al. [11] proposed an unsupervised vector space model for detecting topics in Tweets in Spanish. They used syntactical features derived from PoS (part-of-speech) tagging, extracting entities using the Sem4Tags tagger ([5]) and assigning a DBpedia URI for those entities by considering the words appearing in the context of the entity inside the Tweets. Vitale et al. [17] proposed a clustering based approach which augmented the BoW features with BoE features extracted using the Tagme system, which enriches a short text with Wikipedia links by pruning n-grams unrelated to the input text, showing significant improvement over the BoW features. Recently, we [16] studied the similarity between KSs and Twitter using both BoW and BoE, showing that DBpedia and Freebase KSs contain complementary information for TC of Tweets, with the lexical features achieving the best performance. Focusing on the approaches exploiting the linked structure of KSs; Michelson et al. [9] proposed an approach for discovering Twitter user’ topics of interest by first extracting and disambiguating the entities mentioned in a Tweet. Then a sub-tree of Wikipedia category containing the disambiguation entity is retrieved and the most likely topic is assigned. Milne et al. [10] also assigned resources to Tweets. In their approach they make use of Wikipedia as a knowledge source, and consider a Wikipedia article as a concept, their task then is to assign relevant Wikipedia article links to a Tweet. They propose a machine learning approach, which makes use of Wikipedia n-gram and Wikipedia link-based features. Xu et al. [18] proposed a clustering based approach which linked terms inside Tweets to Wikipedia articles, by leveraging Wikipedia’s linking history and the terms’ textual context information to disambiguate the terms meaning. Despite of the success of existing approaches, the vast majority still exploits a single KS when detecting topics in
Figure 1: Tweets exposing different contexts involving the same entitiy
Figure 2: Deriving a semantic metagraph from multiple KSs Tweets, however, recent studies indicate that KSs contain complementary information ([16]). Furthermore, although existing approaches ([11, 16]) consider entities’ metadata when detecting topics in Tweets, the information is constrained by the used NER service (e.g. OpenCalais9 or Tagme10 ), which often returns generic entity types ([14]) ignoring more fined grained semantic information described in external KSs. In contrast to previous work, we present an approach which exploits the semantic structure of multiple linked KSs to gauge a more fine grained role of an entity in a specific topic by proposing the use of semantic meta-graphs.
4.
FRAMEWORK FOR TOPIC CLASSIFICATION OF MICROPOSTS
The proposed approach for building a topic classifier for Tweets consists of four main stages: i) datasets collection; ii) datasets enrichment (both Tweets and KSs derived); iii) semantic features derivation; and iv) building a topic classifier based on features derived from crossed-sources; depicted in Figure 3. In the first stage, data collection, data from both Twitter and KSs is retrieved. The Twitter dataset comprises a set of
topically annotated Tweets. Conversely, the KSs dataset is built from a set of articles relevant to a given topic extracted from multiple KSs. This study considers two KSs namely DBpedia (DB) and Freebase (FB), which are applied both independently and merged. Therefore we consider three scenarios for the use of these KSs’ datasets: i) DB - from DBpedia only; ii) FB - from Freebase only; and iii) DB-FB from both DBpedia and Freebase (see Section 5). The second stage, datasets enrichment, performs two main steps: (i) entity extraction - relying on OpenCalais and Zemanta11 services for name entity recognition; and (ii) semantic mapping - where the obtained named entities are mapped to their KSs resource counterpart if exist12 . The third stage, semantic features derivation, consists of leveraging the semantic information about the extracted entities within the different KSs. This stage comprises two steps: (i) semantic meta-graphs construction and (ii) semantic feature augmentation; which are discussed in the following subsections.
4.1 11
Zemanta, http://zemanta.com Following this process, the percentage of entities without a deferenced URI are 35% in DBpedia, 40% in Freebase, and 36% in Twitter
12 9 10
OpenCalais, http://www.opencalais.com Tagme, http://tagme.di.unipi.it/
Deriving Semantic Meta-Graphs
The semantic mapping of a named entity into a KS re-
FB
TW
Derive Semantic Features
Retrieve Tweets
DB
Annotate Tweets
DB-FB
Retrieve Articles
Concept Enrichment
Build Cross-Source Topic Classifier
Figure 3: Architecture of cross-source TC using semantic features. source, incorporates a rich semantic representation. Figure 2 presents an extract of the semantic properties and classes for the entity “Barack Obama”. In this work, rather than focusing on the instances associated with a resource, we focus on each triple’s semantic structure at a meta-level, and for that we introduce the meta graph definition as follows. Definition 1 (Resource Meta Graph) is a sequence of tuples G := (R, P, C, Y ) where • R, P, C are finite sets whose elements are resources, properties, and classes; • Y is the ternary relation Y ⊆ R×P×C representing a hypergraph with ternary edges. The hypergraph of a Resource Meta Graph Y is defined as a tripartite graph H (Y) = hV, Di where the vertices are V = R ∪ P ∪ C, and the edges are: D = {{r, p, c} | (r, p, c) ∈ Y }. A resource meta-graph provides information regarding the set of ontologies and properties used on the semantic definition of a given resource. The meta-graph of a given entity e can be represented as the sequence of tuples G(e) = (R, P, C, Y 0 ), which is the aggregation of all resources, properties and classes related to this entity. This definition serves as a formal representation of those triples related to an entity. This representation enables to build upon subgraphs of this graph. For example, we introduce two further notations: R(c) = {e1 , . . . , en } for referring to the set of all entity resources whose rdf:type is class c; and R0 (c) = {e1 , . . . , em } for denoting the set of entity resource whose type are specialisations of c’s parent type (i.e. resources whose rdf:type are siblings of c). This process results in a semantic concept graph which associates each entity with the corresponding semantic ontological classes (or concepts) involving this entity. In light of the proposed three KS scenarios we construct three different semantic meta-graphs: (i) one from DB using the dbpedia and yago ontologies; (ii) one from FB using the freebase ontology; and (iii) another one from DB-FB using the joint ontologies. For the joint scenario we use the concepts from dbpedia ontology together with the the classes obtained after mapping the yago and freebase ontologies13 . Once a semantic meta graph has been constructed for a given entity, two main features can be derived from it, namely the 13
The mapping of Freebase entity classes to the most likely yago classes was done by a combined element and instance based technique (www.l3s.de/~demidova/students/ master_oelze.pdf) and is available at http://iqp.l3s. uni-hannover.de/yagof.html
class and property features. These features provide additional contextual information regarding an entity and are described as follows: C: Semantic class features: This feature set consists of all the classes appearing on the semantic meta graph of a given entity. This set captures fine-grained information about this entity. For e.g. for Barack Obama these features would be yago:PresidentsOfTheUnitedStates, freebase:/book/author, yago:LivingPeople, and dbpedia:Person. Our main intuition is that the relevance of an entity to a given topic could be inferred from an entity’s class type. For example the class yago:PresidentsOf-TheUnitedStates could be considered more relevant to the topic “Violence”, than the class yago:Singer. P: Semantic class-property features: This feature set captures all the properties appearing on the semantic meta graph of a given entity. Our intuition is that given a context, certain properties of an entity can be more indicative of this entity’s relevancy to a topic than others. For example, given the role of the Tahrir Square in the Egyptian revolution, properties such as dcterms:subject could be more topically informative than geo:geometry. The relevance of a property to a given topic can be derived from the semantic structure of a KS graph by considering the approach proposed in Subsection 4.2.
4.2
Weighting Semantic Features
In order to capture the relative importance of each feature in a semantic meta-graph we proposed two different weighting strategies. These strategies are based on the generality and specificity of a feature in a given semantic meta-graph. W-Freq: Semantic Feature Frequency: A light-weight approach for weighting the ontological class and property features enhancing the feature space of a document (i.e KSs’ article or tweet) x is to consider all the semantic metagraphs extracted from the entity resources appearing in this document. We define the frequency of a semantic feature f in a given document x with Laplace smoothing as follows: SF Fx (f ) =
Nx (f ) + 1 P , |F | + f 0 ∈F Nx (f 0 )
(1)
where Nx (f ) is the number of times feature f appears in all the semantic meta-graphs associated to document x; and F is the semantic features’ vocabulary. This weighting function captures the relative importance of a document’s semantic features against the rest of the corpus; while the normalisation prevents bias towards longer documents.
While the W-Freq (semantic feature frequency) weighting function depends on the occurrences of features in a particular document, other generalised weighting information can be derived from a KS semantic structure to characterise a semantic meta-graph. To derive a weighted semantic metagraph we propose the following W-SG weighting strategy. W-SG: Class-Property Co-Occurrence Frequency: The rationale behind this novel weighting strategy is to model the relative importance of a property p (e.g. dbpediaOwl:ground ) to a given class c (yago: MiddleEasternCountries), together with the generality of the property in a KS’s graph. We propose to compute how specific and how general is a property to a given class based on a set of semantically related resources derived from a KS’s graph. Taking into account the notations introduced in Subsection 4.1, given the semantic meta-graph of an entity e (i.e. G(e)), we derive the relative importance of a property p ∈ G(e) to a given class c ∈ G(e) in a KS graph Gks by first defining the specificity of p to c as follows:
case the generality of property dbpedia-owl:ground given the class yago:MiddleEasternCountries for the DB graph is computed as: generality( dbpediaOwl:ground, yago:MiddleEasternCountries ) = {| < yago : M iddleEasternCountries rdf:subClassOf ?parent >,