A Graph Based Clustering Technique for Tweet

A Graph Based Clustering Technique for Tweet Summarization Soumi Dutta1, Sujata Ghatak1, Moumita Roy 1, Saptarshi Ghosh2 and Asit Kumar Das2 1

2

Computer Science & Engineering, Institute of Engineering & Management Kolkata 700091, India

Computer Science & Technology, Indian Institute of Engineering Science and Technology Shibpur, Howrah 711103, India

Abstract— Twitter is a very popular online social networking site, where hundreds of millions of tweets are posted every day by millions of users. Twitter is now considered as one of the fastest and most popular communication mediums, and is frequently used to keep track of recent events or news-stories. Whereas tweets related to a particular event / news-story can easily be found using keyword matching, many of the tweets are likely to contain semantically identical information. If a user wants to keep track of an event / news-story, it is difficult for him to have to read all the tweets containing identical or redundant information. Hence, it is desirable to have good techniques to summarize large number of tweets. In this work, we propose a graph-based approach for summarizing tweets, where a graph is first constructed considering the similarity among tweets, and community detection techniques are then used on the graph to cluster similar tweets. Finally, a representative tweet is chosen from each cluster to be included into the summary. The similarity among tweets is measured using various features including features based on WordNet synsets which help to capture the semantic similarity among tweets. The proposed approach achieves better performance than Sumbasic, an existing summarization technique. Keywords -- Twitter, tweet summarization, WordNet, graph clustering, Online Social Network.

I. INTRODUCTION The Twitter microblogging site has become a popular medium for gathering real-time informaiton on popular newsstories and events. At present, there is lot of research on efficiently utilizing the vast amounts of user-generated information posted in Twitter, for a variety of purposes such as trending topic analysis [2], user sentiment analysis [3], summarization [14] , spam detection [1] , and so on. Hundreds of millions of tweets are posted on Twitter every day. Especially during any important event, such as a sports tournament or a natural disaster, tweets related to the event can be generated at the rate of several thousands per second. For a user who wants to keep track of such an event, it is relatively easy to find tweets relevant to the event by techniques such as keyword matching based search. However, most existing search systems, including the Twitter search

system, simply return all relevant tweets (e.g., which contain a given keyword) in reverse chronological order. It is practically impossible for any human being to read all the relevant tweets in real-time. Moreover, it has been observed that many of the tweets contain very similar information, and this redundancy in information quickly results in information overload for the user. In this scenario, one possible approach to deal with the information overload problem is to summarize the tweet stream, so that an overview of the news-story can be obtained just by going through the summary. For example, in marketing, the summary can help in judging the overall success of a newly launched product. Again, at the time of a natural disaster, the summary can be useful for quickly understanding the situation, so that time-critical relief efforts can be organized. In this work, we develop a methodology for summarization of tweets. Specifically, our approach takes a set of tweets as input, and selects a (small) subset of the tweets as the summary of the entire set. Note that there has already been a lot of research on summarization of text documents (as described in Section II). However, summarization techniques proposed for longer text documents may not perform well on tweets which are much shorter (at most 140 characters in length) and often informally written, e.g., using abbreviations, colloquial language, and so on. Our methodology first constructs a graph to represent the similarity between different pairs of tweets. To measure the similarity between tweets, we consider not only presence of common terms (such as common hashtags and URLs), but also semantic similarity between tweets. The WordNet tool[13] (discussed in Section III) is used to capture the semantic similarity among tweets which may use different terms to express the same informtion. Then, community detection techniques are used on the tweet similarity graph to cluster similar tweets, and a reprsentative tweet is selected from each cluster (of similar tweets) to be included into the summary.

Experiments over a set of tweets related to a particular event show that the proposed approach achieves better summarization performance over an existing summarization technique SumBasic[12]. The rest of the paper is organized as follows. Section II briefly discusses prior work on text summarization, while Section III gives some background information on WordNet and community detection techniques on graphs (which are utilized in the proposed summarization methodology). The proposed methodology is detailed in Section IV, and the performance of the proposed methodology is described in Section V. Section VI concludes the paper. II. RELATED WORK There has been a lot of research on authomatic text summarization[15-23], and several algorithms have been proposed for summarization of single or multiple text documents. In 1950s, Luhn proposed a method to automatically extract the technical articles using surface features like word frequency[7]. The TOPIC system[11] considered a hierarchical text graph from an input document and then summarized the text based on the topics. In SCISOR system[10], a conceptual summarization technique was proposed that chose parts of a concept graph created from Down Jones news wire stories. Kupiec proposed supervised learning and statistical methods considering a collection of 188 scientific documents as training set and their corresponding human created summaries, to learn the probabilities that a document sentence will be chosen in the human summaries [6]. Other notable algorithms for text summarization include SumBasic [12] and the centroid algorithm of Radev et al. [9]. Vanderwende et al. [12] proposed an algorithm SumFocus where set of rules or constraints are implemented on topic changes during summary generation. The centroid algorithm evaluated centrality measure of a sentence in relation to the overall topic of the document cluster for single document summarization. The TextRank algorithm [4] is a graph-based approach that generates an adjacency matrix finding the most highly ranked sentences based on keywords in a document using the PageRank algorithm [8]. Among publicly available programs, MEAD [9] is a flexible platform for multidocument and multi-lingual summarization. The approaches stated above have been proposed for summarization of large text documents. It is not clear how well they would perform when the documents are short (e.g., tweets which are at most 140 characters in length) and informally written. The goal of the present work is to develop a summarization approach specifically for tweets.

III. BACKGROUND STUDIES The tweet summarization algorithm proposed in the present work first constructs a tweet similarity graph based on the similarity between different tweets, and then detects communities in the graph to identify similar tweets. This section briefly discusses WordNet[13] that is used to measure the similarity between tweets, and also discusses community detection algorithms in graphs. A. WordNet WordNet[13] is a lexical web dictionary for English language. To identify word synonyms WordNet is wildly used. In WordNet dictionary nouns, verbs, adjectives and adverbs are grouped into sets of synonyms called synsets, each of which expresses a distinct concept. For instance, the words ‘car’ and ‘automobile’ are grouped together in a common synset; similarly, the words ‘shut’ and ‘close’ are synonymous, and are contained in the same synset. Since different documents (tweets) can use different words to express similar emotion, we use WordNet to find similarity between different tweet documents. In WordNet database words are linked together by their semantic relationships. It is like a supercharged dictionary/thesaurus with a graph structure. Synsets form relations with other synsets to form a hierarchy of concepts, ranging from very general (e.g., “entity”, “state”) to moderately abstract (e.g., “animal”) to very specific (e.g., “plankton”). Some relevant terminologies are - hypernyms are the synsets that are more general and hyponyms are the synsets that are more specific. For instance, “plankton” is a hyponym of “animal”. WordNet also provides hierarchical structure of related words as shown in Figure 1. In WordNet, synsets are organized as a graph, and we can measure the similarity between two synsets based on the shortest path between them. This is called the path similarity, and it is equal to 1 / (shortest_path_distance(synset1, synset2) + 1). It ranges from 0.0 (least similar) to 1.0 (identical). For instance, let us compare the path similarities between “octopus” and “nautilus” (another cephalapod, i.e., a very similar animal), “shrimp” (a non-cephalopod, i.e., a slightly different type of animal), and “pearl” (a mineral). The results are as expected, with “octopus more similar to another cephalopod than a noncephalapod and most dissimilar to a non-living thing. octopus.path_similarity(octopus) # 1.0 octopus.path_similarity(nautilus) # 0.33 octopus.path_similarity(shrimp) # 0.11 octopus.path_similarity(pearl) # 0.07 In this work, we use features such as WordNet path similarity to enhance the proposed method.

A. Data Set An experimental dataset of 2921 tweets were crawled using the Twitter API. Specifically, they were extracted from the 1% random sample of tweets that is made publicly available by Twitter, by keyword matching. We chose tweets related to a specific event – the floods in the Uttaranchal region of India in 2013. Keywords like "Uttarakhand" AND "flood" were used to select tweets relevant to the event. Some sample tweets of the dataset is shown in Table I. TABLE I A SAMPLE DATASET OF UTTARAKHAND FLOOD Fig. 1 Illustration of semantic relationships between words in WordNet

B. Community Detection in graphs Community detection algorithms for graphs / networks aim to find communities based on the network structure, e.g., to find groups of nodes that are densely connected. Similar types of nodes in a network forms a community. The edges connecting the nodes within a community are called Intracommunity edges. The edges connecting nodes in different communities are called Inter-community edge. Figure 1 shows the Intra-community edges and Inter-community edges between different communities. In the present work, we construct a graph where nodes are tweets and the edges represent the similarity among the tweets. We then detect communities in the tweet similarity graph using the InfoMap[5] algorithm, which has been found to detect meaningful communities from different types of graphs. Once communities in the graph are detected, a representative from each community is selected to be included as a summary of the whole network.

Tweet ID 1 2 3 4

5

Tweet 15,000 tourists stranded due to landslide in Uttarakhand Its raining heavily luvly weather in UTTARAKHAND. 8 perish as rains lash Uttarakhand Char Dhamyatra suspended http://t.co/cH6pcRrD41 cc: @MrsGandhi look at this Anti-hindu rains "@timesofindia: Rains bring ChardhamYatra to a halt in Uttarakhand http://t.co/JU7dXy6eHS". 8 perish as rains lash Uttarakhand, Char Dhamyatra suspended.

B. Similarity between tweets We attempt to measure similarity between tweets in two ways – (i) term-level similarity, and (ii) semantic similarity. The term-level similarity between a pair of tweets is measured based on the common terms in the tweets. Specifically, we consider the following similarity measures. Similar URL Count: For a given pair of tweets, if any common URL exists, URL tweet similarity count is considered as the number of common URL. For example if first tweet consists of an URL www.google.com and the second tweet also contains www.google.com then the frequency will be 1, otherwise it remains 0. Similar Hashtag Count: This is the number of common hashtags in the given pair of tweets. For example if both tweets contains #news, the Hashtag tweet similarity count is 1.

Fig. 2 Community Structure in a graph showing intra-community edges and inter-community edges

IV. PROPOSED METHODOLOGY Given a set of tweets, the proposed methodology identifies a (small) subset of the tweets as a summary of the entire set of tweets. This section describes our dataset and the proposed methodology.

Similar Username Count: Similarly, if any username is common for any pair of tweets, the tweet similarity counts are evaluated. For example @mohit may be present in two different tweets so username tweet similarity count will be considered. Cosine Similarity: Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented as

While computing cosine similarity between two strings, we consider the vectors to constitute of the distinct words in the strings. For instance, consider the three strings s1 = 'This is a foo bar sentence.', s2 = 'This sentence is similar to a foo bar sentence.', s3 = 'Hello world.' Then, cosine_similarity(s1,s2) = 0.862 and cosine_similarity(s2,s3) = 0.174

comprising a set V of vertices or nodes together with a set E of edges. Every tweet is considered as a node, and between every pair of nodes a weighted edge is considered, where the weight denotes the similarity between two tweets (which is computed as described above). For instance, considering the five tweets given in Table I, the tweet similarity graph is shown is Figure 3.

Levenshtein Distance: Levenshtein distance between a pair of strings is the minimum number of edit operations needed to transform one string into the other. The edit operations considered are insertion, deletion, or substitution of a single character. For instance, given the two strings 'helloWorld' and 'halloworld', the Levenshtein Distance is 2. Note that tweets frequently contain symbols, abbreviations, etc, which interfere with the procedure for detecting similarity among tweets. So we apply a preprocessing step to filter out such words before applying the cosine similarity and Levenshtein distance measures. Standard English stopwords, punctuations, whitespaces, special characters (e.g., $, @, !, ), and so on are removed from the tweets. After computing the similarity counts based on URLs, hashtags and usernames (as described above), URLs, hashtags and usernames are also removed. We also case-fold all tweets to lower case. Semantic similarity: Along with the term-level similarity measures stated above, we also attempt to measure semantic similarity between two tweets. To identify semantically similar terms, we use the WordNet[13] tool described in the previous section. Specifically, two terms are considered semantically similar if they are included within a common WordNet synset. We consider two tweets to be semantically similar if they contain more than two semantically similar words. Finally, the similarity score between two tweets is computed by simple summation of the term-level and semantic similarites listed in Table II. TABLE II FEATURE SET FOR MEASURING TWEET SIMILARITY

Features Category Term-level Features

Semantic Features

Feature Set Levenshtein distance Cosine similarity Frequency of common hashtags Frequency of common URLs Frequency of common usernames WordNet Synset Similarity WordNet Path Similarity

C. Graph Generation Next, a weighted graph is generated based on similarity scores among the tweets. We refer to this graph as the tweet similarity graph. A graph is an ordered pair G = (V, E)

Fig. 3 Total tweet similarity graph:: Considering all edges

Note that the tweet similarity graph contains an edge between each pair of nodes. Hence, if the number of tweets (nodes) be N, the number of edges will be N(N-1)/2 . Based on the tweet similarity graph, three different types of summary are generated. These are mentioned below: Total Summary: We apply the InfoMap community detection algorithm [5] on the tweet similarity graph. InfoMap optimizes the map equation, which exploits the informationtheoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data. Each community or module identified by InfoMap is a set of distinct tweets which are expected to be more similar to one another (which is why they were clustered together). Each module also has a represensative node (tweet), and these cluster representatives are considered to form the summary. We refer to this summary as the Total Summary. Total Degree Summary: In the tweet similarity graph, a node having higher weighted degree can be considered to be more similar to several other tweets. Hence, the high-degree nodes are good candidates for inclusion in the summary. Hence, in this summary, instead of considering the module representative, we include the node with the highest degree in each module (as identified by InfoMap) into the summary.

Total Length Summary: It can be considered that tweets that are longer contain more information. So, instead of considering the module representative, we consider the tweet with the maximum length in each module (identified by InfoMap) for inclusion in the summary. Threshold Graph: Recall that the tweet similarity graph contains an edge between each pair of nodes. For another summary generation approach, we consider a thresholded version of the tweet similarity graph. A threshold is computed by taking the mean weight of all the edges in the graph. Next, edges whose weights are less than threshold weight are removed. So the thresholded graph contains fewer highweight edges as compared to the total graph. For instance, the thresholded version of the total graph in Figure 3 is shown in Figure 4.

home institutions of the authors. Then, we use standard metrics such as precision (P), recall(R), and F-measure (F) to compare the performance of the algorithms. We briefly describe how the metrics are computed from the summary generated by a particular algorithm and the summary generated by the human volunteers. Let correct be the number of tweets extracted by both the algorithm and the volunteers, wrong be the number of tweets extracted by the algorithm but not by the volunteers, and missed be the number of tweets extracted by the human volunteers but not by the algorithm. Then, the precision (P) of the algorithm for the correct class is P = correct / (correct + wrong). In other words, P captures what fraction of the tweets chosen by the algorithm was also chosen by the human volunteers. The recall (R) of the algorithm for the positive class is R = correct / (correct + missed). Thus R captures what fraction of the tweets chosen by the human volunteers could be identified by the algorithm. Finally, the F-measure is the harmonic mean of precision and recall: F = 2 PR . The F-measure is especially P+R important, since it summarizes both precision and recall. Table III shows the precision, recall, and F-measure values achieved by the different algorithms on the dataset of tweets described earlier. It is seen that all the proposed approaches perform better than SumBasic. TABLE III PERFORMANCE OF SUMMARIZATION ALGORITHMS, AVERAGED OVER ALL SIX LABELLED DATASETS

Method SumBasic [12] Fig. 4 Threshold Graph :: Considering edges above Threshold value

We then use the approaches described above on the thresholded graph, to generate another three different types of summaries. These summaries are referred to as Thresholded Summary, Thresholded Degree Summary, and Thresholded Length Summary. V. RESULTS AND DISCUSSION As described in the previous section, we have six different types of summaries. We compare these summaries with that generated by a standard algorithm SumBasic[12]. The SumBasic algorithm has a higher probability to select words that occur more frequently across the whole document than words that occur less frequently. It is motivated by the observation that words occurring frequently in the document occur with higher probability in human-generated summaries than words occurring less frequently. To compare the performance of different algorithms (the six proposed algorithms and SumBasic), we obtain a set of human-generated summaries using human volunteers from the

Precision (P) 0.657

Recall (R) 0.800

F- Measure (F) 0.722

Total Summary

0.818

0.818

0.818

Total Degree Summary Total Length Summary Threshold Summary Threshold Degree Summary Threshold Length

0.833

0.827

0.821

0.826

0.815

0.830

0. 674

0.930

0.779

0.673

0.925

0.781

0.675

0.929

0.782

VI. CONCLUSION This work proposed a simple and effective methodology for tweet summarization. A tweet similarity graph is first constructed, and then a standard graph clustering algorithm is used to identify similar tweets (nodes).

We would like to note that the length of the summary (number of tweets in the summary) generated by any of the proposed algorithms is not fixed, since it depends on the number of clusters identified by the community detection algorithm. As future work, we would like to develop approaches which can give a summary of a specified length. ACKNOWLEDGMENT The authors would like to thank the human volunteers who helped in collecting the manual summaries. We also thank the anonymous reviewers whose suggestions helped to improve the paper. REFERENCES [1]

F. Ahmed, J. Erman, Z. Ge, A. X. Liu, J. Wang, and H. Yan, “Detecting and localizing end-to-end performance degradation for cellular data services”, in Proc. 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, June 15-19, 2015, pp. 459-460, 2015.

[2]

S. Asur and B. A. Huberman, "Predicting the future with social media," in Proc. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol.01, WI-IAT '10, pp. 492-499, Washington, DC, USA, 2010. IEEE Computer Society.

[3]

A. Bermingham, M. Conway, L. McInerney, N. O'Hare, and A. F. Smeaton, "Combining social network analysis and sentiment analysis to explore the potential for online radicalisation," in Proc. 2009 International Conference on Advances in Social Network Analysis and Mining, ASONAM '09, pp. 231-236, Washington, DC, USA, 2009. IEEE Computer Society.

[12]

L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova, "Beyond sumbasic: Task-focused summarization with sentence simpliffcation and lexical expansion," Inf. Process. Manage., vol. 43, no. 6, pp. 16061618, November 2007.

[13]

"Wordnet - a lexical database for english," http://wordnet.princeton.edu/.

[14]

W. Xu, R. Grishman, A. Meyers, and A. Ritter, "A preliminary study of tweet summarization using information extraction".

[15]

M. Hassel, "Resource Lean and Portable Automatic Text Summarization," PhD thesis, KTH, Numerical Analysis and Computer Science, NADA, 2007. QC 20100712.

[16]

K. Sparck Jones, "Automatic summarising: The state of the art. Inf. Process. Manage.", vol. 43, no. 6, pp.1449-1481, November 2007.

[17]

R. Barzilay and M. Elhadad, “Using lexical chains for text summarization,” in Proc. ACL/EACL 1997 Workshop on Intelligent calable Text Summarization, pp 10-17, 1997.

[18]

I. Mani, "Advances in Automatic Text Summarization," MIT Press, Cambridge, MA, USA, 1999.

[19]

M. Hassel, "Exploitation of named entities in automatic text summarization for swedish," in Proc. NODALIDA 03 - 14 th Nordic Conference on Computational Linguistics, May 30-31 2003, 2003.

[20]

M. Hassel, "Evaluation of Automatic Text Summarization - A practical implementation", Licentiate thesis, Department of Numerical Analysis and Computer Science, Royal Institute of Technology, Stockholm, Sweden, May 2004.

[21]

I. Mani and M. T. Maybury, "Automatic summarization," In Association for Computational Linguistic, 39th Annual Meeting and 10th Conference of the European Chapter, Companion Volume to the Proceedings of the Conference: Proceedings of the Student Research Workshop and Tutorial Abstracts, July 9-11, 2001, Toulouse, France., pp. 5, 2001.

[22]

C. Nobata, S. Sekine, H. Isahara, and R. Grishman, "Summarization system integrated with named entity tagging and (ie) pattern discovery," in Proc. Third International Conference on Language Resources and Evaluation, LREC 2002, May 29-31, 2002, Las Palmas, Canary Islands, Spain, 2002.

[4]

S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Comput. Netw. ISDN Syst., vol. 30, no. 1-7, pp. 107117, April 1998.

[5]

"Infomap-community detection," http://www.mapequation.org/code.html.

[6]

J. Kupiec, J. Pedersen, and F. Chen, "A trainable document summarizer," in Proc. 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '95, pp. 68-73, New York, NY, USA, 1995. ACM.

[23]

H. P. Luhn, "The automatic creation of literature abstracts," IBM J. Res. Dev., vol. 2, no.2, pp. 159-165, April 1958.

H. Dalianis, E. Astrom, and E. strm, "Swenam-a swedish named entity recognizer its construction," training and evaluation, 2001.

[24]

G. Salton, "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer," Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

[7] [8]

L. Page, S. Brin, R. Motwani, and T. Winograd, "The page rank citation ranking: Bringing order to the web," 1999.

[9]

D. R. Radev, T. Allison, S. Blair-Goldensohn, J. Blitzer, A. Celebi, S. Dimitrov, E. Dr abek, A. Hakim, W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion, S. Teufel, M. Topper, A. Winkel, and Z. Zhang, "MEAD - A platform for multidocument multilingual text summarization," in Proc. Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26-28, 2004, Lisbon, Portugal, 2004.

[10]

L. F. Rau, P. S. Jacobs, and U. Zernik, "Information extraction and text summarization using linguistic knowledge acquisition," Information Processing Management, vol. 25, no. 4, pp. 419 - 428, 1989.

[11]

U. Reimer and U. Hahn, "Text condensation as knowledge base abstraction," In Artiffcial Intelligence Applications, 1988., in Proc. Fourth Conference, pp 338-344, Mar 1988.

A Graph Based Clustering Technique for Tweet

A Graph Based Clustering Technique for Tweet

Suggest Documents

A Genetic Algorithm based tweet clustering Technique

Character-based Neural Embeddings for Tweet Clustering

A Graph-based Approach for Dynamic Clustering

Flow-Based Algorithms for Local Graph Clustering

Initialization Free Graph Based Clustering

Graph Based Technique for Hindi Text Summarization

A NEW GRAPH BASED PRIME COMPUTATION TECHNIQUE

Graph Based Technique for Hindi Text Summarization

A Total Variation-based Graph Clustering Algorithm for ... - CiteSeerX

A Graph-based Clustering Scheme for Identifying Related Tags in ...

A Distributed Genetic Algorithm for Graph-based Clustering

A Significance-Based Graph Model for Clustering Web ... - Springer Link

A Graph-based Analytical Technique for the ...

A new graph-based technique for cross-lingual ... - Semantic Scholar

A Density Based Clustering Technique For Large ... - Semantic Scholar

Strahler based graph clustering using convolution - CiteSeerX

Graph-Based Approaches to Clustering Network ... - CiteSeerX

Graph-based k-means Clustering - CSIC digital

Enhancing Modularity-Based Graph Clustering - Semantic Scholar

Graph-based Clustering with Background Knowledge

GDClust: Graph based document clustering - CS UTEP

Graph-based hierarchical conceptual clustering - WSU EECS

Clustering With Constraints Using Graph Based

Graph sketching-based Massive Data Clustering