c IEEE] [Pre-print of conference paper in the proceedings of IEEE-TENSYMP 2015
Ontology based Approach for Event Detection in Twitter datastreams R. Kaushik1 , Apoorva Chandra S2 , Dilip Mallya3 , J.N.V.K. Chaitanya4 and Sowmya Kamath S5 Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Srinivas Nagar P.O. Mangalore - 575025 India
[email protected] [email protected] [email protected]
Abstract—In this paper, we present a system that attempts to interpret relations in social media data based on automatically constructed dataset-specific ontology. Twitter data pertaining to the real world events such as the launch of products and the buzz generated by it, among the users of Twitter were used for developing a prototype of the system. Firstly, Twitter data is filtered using certain tag-words which are used to build an ontology, based on extracted entities. Then, each entity’s Wikipedia data is collected and processed semantically to retrieve inherent relations and properties. The system uses these results to discover related entities and the relationships between them. We present the results of experiments to show how the system was able to effectively construct the ontology and discover inherent relationships between the entities belonging to two different datasets. Keywords: Social Media Analysis, Ontology, Event detection, Semantics, Knowledge discovery
I. I NTRODUCTION Big data analytics is currently a highly popular paradigm, which enables companies to analyze various forms of data that can discover hidden trends and knowledge about their business. Product oriented companies which are highly dependent on consumer opinion and interests, rely on extensive marketing and advertising campaigns to capture user’s interest. It is also true that the same companies conduct market research via traditional channels like online surveys, trend analysis etc, to gain an insight into the psyche of the consumer. Modern day social media provides a normally free, but valuable source of information, as it allows them to open a direct line of information on how their products’ are perceived by potential consumers. The widespread acceptance of social media such as Twitter and Facebook, provide customer relevant data which is based on the interactions made by the users are of great social significance and are highly dependent on both the user’s sentiments and also on peer consensus. Taking micro-blogging as an example, Twitter has over 280 million active users and over 500 million Tweets are sent per day [1]. Automating this process through intelligent, semantic-based information access methods is therefore increasingly needed [2]. The area of research merges methods from several fields, in addition to
semantic web technologies, for example, NLP (Natural Language Processing), big data analytics, data mining, machine learning, personalization, and information retrieval [3]. The problem of making sense of social data generated by users on micro blogging websites like Twitter is compounded by the fact that they are primarily opinion based and these interactions are meant to be read by other humans. Users may use emoticons, short forms (e.g., c u l8r), acronyms (e.g., LOL, BTW, YGTI) and irony/sarcasm in their tweets making it quite hard to understand the true meaning of a statement, that is only 140 characters in length. It is impossible to perform manually due to the sheer volume and velocity of the data generated by the users. It is also be quite a challenging task to build systems that can automatically discover useful bits of information from this highly volatile data. Techniques that use semantics and natural language processing may be helpful in overcoming some of the problems faced in the area. As one such proposed solution, presented in this paper, lies in using the concept of an ontology to understand the relation between various terms used in these tweets to automate the task of making sense of social media data for analysis purposes. The ontology would be constructed from the tweet data-set and the relations between these terms are mapped using various tools and services. The result of which can then be used to provide statistics and usable information from the dataset under consideration. The rest of the paper is organized as follows - In section II, we discuss various related work in the literature and also some relevant ontologies that currently exist for social media representation. Section III discusses the proposed methodology and the various components of the proposed system, followed by the details of the system development process in section IV. In section V, we present a discussion of the experimental results and their analysis, followed by conclusion & future work and references. II. R ELATED W ORK Several researchers have concentrated on the issue of learning semantic relationships between entities in Twitter. Celik et al [4] discuss the process of inferring relationships between the various entities present in a tweet such as timestamp,
IEEE Region 10 Symposium (IEEE TENSYMP 2015)
unique id and # tags and @ tags. Methods for inferring relationships such as using already existing ontologies and web documents crawling, using a bag of words approach based on the term frequency and inverse document frequency of the terms involve using term co-occurrence based strategies. Iwanaga et al [5] presented a ontology based system for crisis management during crucial time periods like during an earthquake. The requirement for an ontology in that scenario is to ensure the proper evacuation of the victims and Twitter data was used for this purpose. Hamasaki et al [6] presented a system for ontology extraction using social network, based on a conference and the input for building the ontology are research papers that were published in the conference. Zavitsanos et al [7] proposed Gold Standard Evaluation of ontology learning methods that we use for the evaluation of the ontology that we have built. There are several well known social media ontologies [8] available currently that created to model different kinds of social media, user profiles, sharing, tagging, liking, and other common user behavior in social media. Friend-of-a- Friend (FOAF) [9] is an ontology/vocabulary for describing people to model social media sites, the Semantically Interlinked Online Communities (SIOC) [10] ontology can be used. Bottari [11] is an ontology, which has been developed specifically to model relationships in Twitter, especially linking tweets, locations, and user sentiment (positive, negative and neutral). DLPO (LivePost Ontology) can be used for interlinking Social Media, Social Networks, and Online Sharing Practices. For modeling the tag semantics, Meaning-Of-A-Tag (MOAT) ontology can be used allowing users to define the semantic meaning of a tag.
tweet data corresponding to the launch of competing products from companies like Apple, Xiaomi and Motorola. Several preprocessing techniques were applied to extract the required data from the tweets. B. Entity extraction The volume of tweets with time is recorded in a data structure for plotting and for generating other statistics after the analysis is complete. We count the number of tweets over time with a granularity, which can be on a hourly, daily or yearly depending on the duration considered for the event being monitored. Then, the timestamp is removed from the tweets and the any extraneous entities like, hash tags, @ tags and named entities are extracted using the CMU’s Tweet Parser [13]. This is an open source Java based tool used to recognize required entities from the tweets using a part-of-speech tagging process. Once the entities are extracted they are fed as input to the build ontology module. C. Build Ontology Module An ontology is a graph or a network of nodes that are connected by edges, that are labeled by the names of relationships between the entities. These relationships are inferred using Wikipedia, DBpedia and web documents on the extracted entities. The ontology is built (shown in figure [?]) using Python’s extensive natural language library and is saved in an XML based format. This will be the basis of the search engine mechanism for automatically inferring relationships between entities within different datasets.
III. P ROPOSED S YSTEM Figure 1 presents the major processes carried out during the proposed methodology. We explain each of them in detail here.
Fig. 1: Proposed System Pipeline A. Data Collection and Cleaning Initially, a dataset for a particular time period corresponding to the real world event is to be collected. We used a tool called Sysomos [12] that provides the functionality to schedule an automated script for collecting tweets to whatever event for which we want to analyze. In our case, we collected
Fig. 2: Building the Ontology D. Search Engine Module A basic search engine is to be built using the ontology that was generated earlier. The user can enter a query which will be processed to retrieve documents regarding the topic on which the ontology was developed. The results of the search engine will depend upon the type of the dataset and will be about contemporary events rather than existing web documents. E. Visualization Module The results of the search engine are fed to the visualization module for generation of the ontology subtree used for the related document retrieval, the tweet buzz regarding the keywords mentioned in the query (trends), the matching documents etc (among other analysis results).
[Pre-print of conference paper]
IV. S YSTEM D EVELOPMENT The main intention of the proposed system is to assist companies in inferring consumer opinion their or other competitors’ brands through the tweet buzz generated during the product launch. We used Twitter data (or tweets) containing particular keywords. These were extracted from Twitter using a tool called Sysomos [12], which in addition to giving tweets also gave the URL, associated timestamp, location, sentiment and the user who tweeted. This data about a particular keyword could also be included with other specifications such as location, time, demographics and language. The tweets are stored in a single CSV file with the features considered. An average of 40,000 tweets were collected for each of the events considered - namely iphone6, Xiaomi and Moto products. As the first step to cleaning, we decided to consider only English tweets. In order to eliminate non-English tweets, we analyzed the stopwords found in the tweet and assigning the tweet a ratio for each language that would indicate its chance of belonging to that language. Almost one out of four tweets was non-English and thus, was eliminated. After the ratio was re-assigned, the tweets with English scored maximum and were selected. If too many hashtags or @ tags are used in a tweet then that tweet can be considered to be spammy. Such spammy tweets will automatically be eliminated by the above mentioned stopword detection technique. The remaining English tweets were collected along with timestamp and stored. We used the NLTK library [14] in Python to filter out the non-English tweets and also remove spammy tweets. Then, the set of tweets and their associated timestamp is returned as output to the entity extraction module. Now that the tweets are available with the associated timestamp, the data must be modified according to the requirement of the research to be conducted. The tweets are sorted according to timestamp and a daily tweet count is conducted. The granularity level was maintained as days and not lesser than that because lesser than that would amount in the appearance of many peaks which would be difficult for the marketer to analyze and many of them may not be interesting events as well. The next requirement was to build a search engine that can help the marketer to navigate through the tweet buzz on topics being discussed on Twitter. Entities or keywords were extracted from the tweets using the CMU Tweet parser. The parser tokenizes tweets and returns the parts-of-speech of each of the tweets in the dataset. The confidence associated with each part of speech was also returned after the tagging. A separate parser was required for this purpose because a POS tagger that works on formal sentences cannot be used with tweets, where many local abbreviations are used due to the limited character length (only 140 characters can be present at the maximum in a tweet). After POS tagging the entities that are required for further analysis, like named entities, hashtags and nouns, are extracted as these convey the maximum information about the topic. A frequency distribution is also drawn for all such extracted
entities. Out of this frequency distribution, the top 100 keyword occurrences are chosen as potential entities on which the ontology can be built. The frequency of occurrence of these entities is also stored. However, an ontology is only complete if the relationships that exist between the extracted entities can be discovered from Twitter. As the social chatter on Twitter is primarily made up of opinions and cannot be treated as standard sources of information, a reliable external information source is required to establish relationships between documents. We used Wikipedia as the reliable external source to recognize relationships between entities. The Python package that we used to get documents from Wikipedia also gives suggestions if the terms searched for do not correspond to any valid Wiki document. The retrieved documents are then stored in a directory which is accessed by the ontology building module, while constructing the ontology. V. R ESULTS AND A NALYSIS We considered two different tweet datasets pertaining to the launch of the popular products, Motorola Moto series of mobile phones and Apple iPhone. Their statistics is as follows: 1) Moto Dataset: 40,000 tweets over a time period from Sep 1st 2014 to Sep 9th 2014 2) iPhone6 dataset : 35,000 tweets over a time period from Sep 6th 2014 to Sep 12th 2014
Fig. 3: Tweet volume vs. time for the Motorola Dataset Spikes were detected in the tweet volume versus time graph (see figure 3) on Sep 2nd to 5th, that is, during the time when Motorola released the next generation of the Moto X, the Moto G smartphones and Moto 360, its smartwatch, a wearable Android device. This is an indication that the system was able to detect events - that is release of a product. The ontology that was generated after the tweet analysis performed for the Xiaomi dataset is presented in figure 4. It depicts the entities extracted from the Xiaomi dataset like ”Android”, ”Qualcomm”, ”Snapdragon” etc. Certain entities that may be considered as useless also occurred in the final ontology, which will need advanced techniques to eliminate, and will be our future focus. The inherent relationships between these entities were also successfully captured, for example,
[Pre-print of conference paper]
Fig. 4: Ontology extracted from the Xiaomi Dataset
”Xiaomi - manufactured - China”, ””, and that Xiaomi phones are Android phones, using Qualcomm Snapdragon processor. The fact that Xiaomi phones are competitors to smartphones manufactured by Apple and Samsung was also captured in the generated ontology. Some other interesting observations that were inferred are listed below: 1) Out of all the entities that come from the CMU tweet parser the nouns, hashtags and the named entities were the most useful. Verbs, even though may be useful in establishing relationships between the entities can be obtained from the wiki and the web documents can be used to infer relationships. 2) Hashtags generally form a lesser proportion of the entities after parsing the tweets, so some kind of boost may be given for the hashtags, as the information gain from trending hashtags can be very large. VI. C ONCLUSION In this paper, we presented an ontology based system for event detection and trend analysis for Twitter data. It is intended to automatically detect trends in social media regarding certain real world events like launch of new products, advertising campaigns etc to aid companies to analyze the effectiveness of their marketing approach or the mood of the consumer through their tweets on Twitter. Currently, the available ontologies are mostly based on web documents or direct information from sources as a result of which, they cannot be contemporary to any event/topic they are related to. In contrast, we were successful in building an ontology from social media as a result of which the ontology will be contemporary to currently occurring events, which is a major advantage for building a search engine whose search results change dynamically with time.
R EFERENCES [1] K. Weil, “Measuring tweets.” (VP of Product for Revenue and former big data engineer, Twitter Inc.), Twitter Official Blog. February 22, 2010. [2] A. Ritter et al., “Unsupervised modeling of twitter conversations,” 2010. [3] P. Mika, “Flink: Semantic web technology for the extraction and analysis of social networks,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 3, no. 2, pp. 211–223, 2005. [4] I. Celik et al., “Learning semantic relationships between entities in twitter,” in Web Engineering, pp. 167–181, Springer, 2011. [5] I. Iwanaga et al., “Building an earthquake evacuation ontology from twitter,” in Granular Computing (GrC), 2011 IEEE International Conference on, pp. 306–311, IEEE, 2011. [6] M. Hamasaki et al., “Ontology extraction using social network,” in International workshop on semantic web for collaborative knowledge acquisition, 2007. [7] E. Zavitsanos et al., “Gold standard evaluation of ontology learning methods through ontology transformation and alignment,” Knowledge and Data Engineering, IEEE Transactions on, vol. 23, no. 11, pp. 1635– 1648, 2011. [8] P. Mika, “Ontologies are us: A unified model of social networks and semantics,” in The Semantic Web–ISWC 2005, pp. 522–536, Springer. [9] D. Brickley and L. Miller, “The friend of a friend (foaf) project,” 1999. [10] A. Passant et al., “The sioc project: semantically-interlinked online communities, from humans to machines,” in Coordination, Organizations, Institutions and Norms in Agent Systems V, pp. 179–194, Springer, 2010. [11] I. Celino et al., “Towards bottari: using stream reasoning to make sense of location-based micro-posts,” in The Semantic Web: ESWC 2011 Workshops, pp. 80–87, Springer, 2012. [12] Sysomos.com, “Sysomos heartbeat.” [13] K. Gimpel et al., “Part-of-speech tagging for twitter: Annotation, features, and experiments,” in 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 42–47, Association for Computational Linguistics, 2011. [14] S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69–72, Association for Computational Linguistics, 2006. [15] O. Ozdikis et al., “Semantic expansion of hashtags for enhanced event detection in twitter,” in International Workshop on Online Social Systems, 2012. [16] O. Owoputi et al., “Improved part-of-speech tagging for online conversational text with word clusters.,” in HLT-NAACL, pp. 380–390, 2013. [17] L. Derczynski et al., “Twitter part-of-speech tagging for all: Overcoming sparse and noisy data.,” in RANLP, pp. 198–206, 2013.
[Pre-print of conference paper]