approach to support dynamic event tracking on twitter. In order to follow .... Google and Twitter since there was no strong connection of these entities to the.
Continuous Semantics: Dynamically Following Events Pavan Kapanipathi, Christopher Thomas, Pablo N. Mendes, and Amit Sheth Kno.e.sis Center, Wright State University Dayton, Ohio {pavan,topher,pablo,amit}@knoesis.org http://www.knoesis.org/
Abstract. Monitoring evolving events through microposts has prompted academic interest due to the participation of a large number of observers, their perspectives and the real time nature of the medium. However, advantages are also deterrent since the information consumer has to track a plethora of sources in a timely manner to follow an evolving event. To alleviate this issue, in this work we present an iterative semantic web approach for tracking rapidly evolving events using microposts. Our evolution cycle starts with microposts being streamed using simple event descriptors. The filtered microposts are then used to extract temporally relevant event descriptors. Extracted event descriptors are used to create event models by identifying a relevant Wikipedia subgraph. Event models are, in turn, used to filter microposts and thus back to the first step of the cycle. In order to demonstrate the effectiveness of our approach, we present a discussion analyzing our model based on the January 2011 protests in Egypt. Keywords: continuous semantics, domain model evolution, streaming, annotation, RDF, SPARQL
1
Introduction
The recent years have seen a significant change in the dissemination of news and information. Observations of unfolding events are increasingly shared real-time through ubiquitously accessible microblogging platforms [6]. However, the information being shared is growing exponentially. Twitter alone generates more than 100 Million microposts a day. This avalanche of data makes it difficult to seek out specific information, especially when done real-time. Event-specific information is often only temporarily interesting and gets stale quickly. To achieve the highest information gain it is important that select content finds its way to the user quickly. This kind of information tracking has proved its importance in the recent Egypt protests1 where twitter and other social networking sites were used as major platforms for protesters to organize gatherings and to stay 1
http://www.geekosystem.com/egyptian-protests-social-media/
2
Continuous Semantics
updated with major changes in the event. This paper presents a semantic web approach to support dynamic event tracking on twitter. In order to follow events, users of Twitter rely on two simple mechanisms; – Social Network based, whereby a user can subscribe to another user’s feed; a process called ”following”. – Content based techniques such as keyword search and tagging (hashtags). The process of following allows users to focus on specific users or news agencies. However, subscribing to only a set of known bloggers may lead to a loss of critical information from unknown bloggers who are directly or indirectly affected by the event. On the other hand, using tags to keep up with an event neglects the fact that events tend to split into finer-grained sub-events with their own tags and keywords, so users need to be aware of it. The protests in Egypt were tagged first with #Egypt and #Jan25, but later with #Tahrir, #Jan28, #EgyptRevolt, etc. Keyword searches do not require attention to tags, but users must intuitively anticipate the unfolding event and guess appropriate keywords. Even if successful, the cognitive load imposed on the user makes this an unattractive approach. In this work we offer a solution to the problem of event following based on the dynamic creation of semantic event models. The user will need to specify his area of interest only once, when an event model is automatically created. As the event unfolds, microposts are analyzed and, based on new developments, an updated model is created that subsequently filters microposts for the next iteration in the cycle (Fig. 1). This work thus presents an early realization of Continuous Semantics [6]. For the steps in this cycle, we leverage our previous work on automatic domain model creation [7] and Linked Open Social Signals [4]. Semantic annotation with Linked Data concepts (e.g. DBPedia [1]) enables users to filter streaming microposts using SPARQL queries, offering significantly more expressiveness than keywords. In addition to the annotation, DBpedia provides facts to enable richer filtering based on attributes and relationships. Twarql is an implementation of this architecture that has been demonstrated in a brand tracking scenario [3] where the information about competitors of a product was drawn automatically through the use of DBpedia. This paper is organized as follows. Section 2 presents the architecture of the application. Section 3 presents a discussion on the effectiveness of the approach when applied on a real world scenario. Section 4 includes future work and conclusions.
2
Continuous Semantics for Following Events through Microposts
This work illustrates the vision of Continuous Semantics for real-time social data. Figure 1 shows a circle of Continuous Semantics to follow a dynamically evolving event on Twitter. Each of the components in the cycle represents a module in our architecture that will be described in details in the following subsections.
Continuous Semantics
3
Fig. 1. Pipeline for event descriptions using Continuous Semantics.
We summarize the information flow in this cycle as follows. Event information enters the cycle as streaming microposts from Twitter. Twarql [4] filters microposts matching certain user-defined constraints (e.g. a SPARQL query), creating a corpus of relevant microposts (Section 2.1). Keyphrase extraction techniques are used to select prominent relevant terms, in order to keep focus on the unfolding event. The selected keyphrases are fed into Doozer for the automatic creation of an event model, a domain model that specifically describes the event of our focus (Section 2.2). The model is then translated into a Twarql filter. The last step on the cycle is then to update the micropost filter in Twarql to reflect the model created by Doozer.
2.1
Micropost Annotation & Keyphrase extraction
Twarql takes in a stream of microposts and performs a series of information extraction steps that will generate annotations at the end of the pipeline. Microposts, as they arrive, are individually sent through an extraction pipeline that extracts content-dependent metadata such as URLs, hashtags and named entities. URLs that use shortening services (e.g., bit.ly, tinyurl.com) are resolved to the corresponding original URLs by following their HTTP redirects. Similarly, community-provided definitions for hashtags are obtained from the tagal.us API so that their descriptions can also be added as annotations to microposts. A simple named entity extraction algorithm identifies named entities (resources) from DBpedia that are mentioned in the text of microposts. Content-independent metadata such as username, date of creation, location, etc. are also captured from the information provided by Twitter. All the content-based and content-
4
Continuous Semantics
independent metadata are then serialized into RDF with appropriate usage of standard vocabularies such as FOAF2 , MOAT3 , SIOC4 and OPO5 [5]. The focused delivery of real time information is enabled in Twarql by a filtering mechanism using SPARQL queries that are either manually provided by the user or by translating event models into SPARQL. For extracting keyphrases from the set of filtered microposts, we implemented simple n-gram extractors that are able to quickly extract prominent phrases. The extractors ignore the English stopwords, count the number of n-gram occurrences and stores them in a frequency Map. 2.2 Event Model Evolution Since this work represents an event with a model, adapting the delivery of posts to changes in the event requires updating the model. As events unfold new subtopics will become prominent and the event model should correspondingly adapt. The Doozer system[7] automatically creates focused domain models by carving a subgraph out of the broad Wikipedia article- and category graph that best describes a user’s interest. This process follows an “expand and reduce” paradigm that allows the algorithm to first explore and exploit the Wikipedia concept space to find potentially relevant information before reducing the concepts that were initially deemed interesting to those that are closest to the actual domain of interest. The top ten one-grams and two-grams produced in the keyphrase extraction module are selected as prime event-descriptors for each day and then fed to Doozer to create event models. These models attempt to redefine the event updated with the latest happenings and are further used to filter timely relevant microposts. In this work we focus on the overall architecture for the realization of the Continuous Semantics vision. The best way to automatically detect a topic shift is out of the scope of this paper. See Leskovec[2] for previous work on this subject.
3
Discussion
In this section, we discuss the application of our realization of Continuous Semantics for the purpose of following events in the Egypt Protests scenario. The first step was to analyze the keyphrases extracted with regard to how they corresponded temporally to the unfolding event. Examples of top occurring n-grams are million man, man march on 30th and 31st Jan which corresponded with protesters organizing a million man march on the 1st Feb 2011.“Wael Ghonim” entity most frequently occurred on 31st Jan, he became a prominent figure during the Egypt protests and news about him on Jan 31st reported that he was 2 3 4 5
foaf-project.org moat-project.or sioc-project.org online-presence.net
Continuous Semantics
5
Fig. 2. Empirical evaluation of filtered tweets.
missing. By analyzing news articles 6 , we could observe that these corresponded to important happenings in the context of the Egypt protests. Event models were created on a daily basis using the most frequent n-grams of the previous day. We observed that, as expected, the Event Models contained information that was a direct consequence of the n-grams fed. Interestingly, they also contained previously unseen related instances. Some of these were noted in news later during the protests. For instance, the entity ”Cairo suburb” was one of the most frequent n-grams on 31st, hence was used to create an event model for the next day (1st Feb). As a result, ”Heliopolis” was added to the Event Model, presumably because it is a suburb in Cairo. Serendipitously, the presidential palace at Heliopolis was the destination for a million people’s march on Feb 1st. Similarly, entities such as Suzanne Mubarak wife of Hosni Mubarak, Abd al-Hamid Kishk etc. were present in the event model generated on 28th Jan. These entities came into light later during the protests. Abd al-Hamid was mentioned in the news regarding the Egypt protests on 7th Feb, whereas this entity was present in the model from 28th Jan. Even though we were able to extract timely event descriptors, in a few cases we were not able to follow the sub-event completely. For example, we extracted keyphrases such as google launches, Speak2tweet, Twitter when Google and Twitter launched the speak2tweet service for the offline Egyptians. However, there was no impact of these n-grams in our model. A reason for this scenario is the lack of coverage of our source knowledge base. For instance, Speak to Tweet is not present on Wikipedia and our model does not capture the related entities Google and Twitter since there was no strong connection of these entities to the Egypt protests in Wikipedia. To evaluate the impact of the models on event tracking, i.e. filtering relevant tweets, we empirically evaluated a subset of the filtered tweets for each day (See Fig. 2). Overall, out of 308258 tweets collected about the protests in Egypt between Jan. 29 and Feb. 7, 253627 were classified to be event-related, 54631 were negatively classified. The analysis shows that most of the positively matched tweets were indeed event-related and usually of higher information content than 6
Million Man March - http://www.abc.net.au/news/stories/2011/02/01/3127292.htm
6
Continuous Semantics
those that were filtered out, as can be seen in the constantly high precision. Spam was almost completely filtered out. However, as the recall curve shows, many of the negatively classified tweets were still event-related, so the recall suffered.
4
Conclusion
We presented a dynamic process for following microposts related to an evolving event. We introduced semantic event models as a way to use background knowledge to describe an event in real time. The semantic event models are used to selectively stream microposts, which in turn are used to evolve the focus of an event. The evaluation showed that the system was able to deliver higher quality microposts than a keyword query usually produces. For future work we plan to create more comprehensive evaluations and to extend the information extraction and event modeling capabilities of the system.
5
Acknowledgements
Many thanks to Ashutosh Jadav for helping us in data gathering and Michael Cooney for helping with the implementation.
References 1. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia-a crystallization point for the web of data. Web Semantics: Science, Services and Agents on the World Wide Web, (2009). 2. J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 497–506. ACM, (2009). 3. P. Mendes, A. Passant, and P. Kapanipathi. Twarql: tapping into the wisdom of the crowd. In Triplification Challenge 2010 at 6th International Conference on Semantic Systems (I-SEMANTICS), (2010). 4. P. Mendes, A. Passant, P. Kapanipathi, and A. Sheth. Linked open social signals. In Web Intelligence and Intelligent Agent Technology, 2010. WI-IAT’10. IEEE/WIC/ACM International Conference on, 2010. 5. A. Passant, P. Laublet, J. G. Breslin, and S. Decker. A URI is Worth a Thousand Tags: From Tagging to Linked Data with MOAT. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):71–94, (2009). 6. A. P. Sheth, C. J. Thomas, and P. Mehra. Continuous Semantics to Analyze RealTime Data. Internet Computing, IEEE, 14(6):84–89, (2010). 7. C. J. Thomas, P. Mehra, R. Brooks, and A. P. Sheth. Growing Fields of Interest Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on, 1:496–502, (2008).