A Tweet Summarization Method Based on a ... - ACM Digital Library

8 downloads 293 Views 580KB Size Report
tweet summarization method. The proposed method first finds the strongly related groups of words based on keyword graphs. In the graphs, the frequent words ...
A Tweet Summarization Method Based on a Keyword Graph Tae-Yeon Kim

Jaekwang Kim

Jaedong Lee

Sunkyunkwan Univ. 2066, Sebu-ro, Jangan-gu, Suwon, Gyeonggi-do, Republic of Korea +82-31-290-7987

Sunkyunkwan Univ. 2066, Sebu-ro, Jangan-gu, Suwon, Gyeonggi-do, Republic of Korea +82-31-290-7987

Sunkyunkwan Univ. 2066, Sebu-ro, Jangan-gu, Suwon, Gyeonggi-do, Republic of Korea +82-31-290-7987

[email protected]

[email protected] Jee-Hyong Lee

[email protected]

Sunkyunkwan Univ. 2066, Sebu-ro, Jangan-gu, Suwon, Gyeonggi-do, Republic of Korea +82-31-290-7154

[email protected] ABSTRACT There are a huge number of posts on the micro blogs such as Twitter and thus it can be an important information source of various domains. However, the information density of each post, tweet, is too low because the length of tweets is too short. Due to the huge amount and low information density, it is hard to obtain useful information from Twitter such as the public opinion trend. Considering these characteristics of tweets, we propose a novel tweet summarization method. The proposed method first finds the strongly related groups of words based on keyword graphs. In the graphs, the frequent words are the vertexes and the co-occurrences are the edges. We use the maximum k-clique method to find strongly related groups of words, and summarize the tweets which include the words in groups. We confirmed the proposed method is effective for summarizing of tweets and is superior to the existing method with the experiments.

Categories and Subject Descriptors E.1 [Data Structure]: Graphs and networks; H.2.8 [Database Appplications]: Data mining; H.3.3 [Information Search and Retrieval]: Clustering, Information filtering, Retrieval models, Search process, Selection process.

General Terms Algorithms, Measurement, Design, Experimentation, Theory, Verification.

Keywords Twitter, Tweet summarization, Co-occurring graph, K-clique.

1. INTRODUCTION Twitter is not only a social network service which lets one freely write about one's daily life, but also a popular micro blogging

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IMCOM (ICUIMC)’14, January 9–11, 2014, Siem Reap, Cambodia. Copyright 2014 ACM 978-1-4503-2644-5…$15.00.

platform. Along with the spread of the ubiquitous devices, one can post anything he or she wants in anytime and anywhere. Moreover, since the post does not need to be long, micro bloggers can instantly and frequently posts about what they have interest in. The amount of posts in micro blogs becomes very huge compared with that of the traditional blogs. Statistics on February 2010 tells fifty million tweets were posted a day [1]. Contrary to the amount, the information density in micro blog posts is quite low. The traditional blogs deal with specific subjects and often contains high quality contents, detailed and long information on subjects, whereas micro blogs contain short and improvised texts which lead to having low quality contents. For this reason, it is hard to obtain useful information from micro blogs, such as the public opinion trend. Though a lot of people write their opinions on micro blogs, it is hard to catch the major opinions due to the huge amount and low information density of posts. In order to extract useful information from micro blogs, a medium of massive posts with relatively low information density, a new method needs which considering such unique characteristics of micro blogs. Here, we focus on Twitter, which is one of micro blog systems. In Twitter, tweets are short threads posted in Twitter. The timeline is the list of posts arranged in the order of posting time. Twitter has a peculiar function called ‘follow’. If you follow a person, you are able to see the posts of the person that you follow in your timeline. Usually, people follow their friends, the celebrities they are interested in, or other people who post useful and interesting information. ‘Retweet’ means to forward a tweet in your timeline to other users who follow you. If users find useful a tweet and want to share it with their followers, they retweet the tweet to the followers. When retweeting, users can easily modify the original tweet as they want. They can add or delete any words from the original tweet. Thus, by simple comparison between tweets, it is hard to identify the tweets with the same content. The search function that Twitter provides is very simple. If you search for tweets including a word, every tweet that includes the word are just listed in the order of time. Since most of tweets do not have much of information, it requires a long time and much effort to find tweets which includes the information you want. If the search results are summarized and categorized, users can have useful information and opinions of people more easily with less effort.

Usually, the summary of documents can be built by extracting thesis statements or important sentences out of paragraphs. However, a tweet usually ends in one or two sentences and users usually create every similar sentence if their opinions are similar on a certain topic by retweeting. So, we should consider these characteristics of tweets for summarizing. In this light, our goal is close to the document clustering approaches. We try to cluster tweets which have similar contents and choose some major ones. There are many researches to cluster the search results [2][3]. Yet the existing methods are not appropriate for documents in a short length with a low information density such as tweets, so that the clustering results are poor [4]. Keyword extraction is one of the important issues for extracting useful information from documents. Y. Ohsawa et al. presented an algorithm for extracting keywords representing the asserted main point in a document, without relying on external knowledge such as a natural language processing or a document corpus [5]. G.K. Palshikar proposed a keyword extraction method from a single document using centrality measures [6]. He represented the given document as an undirected graph, whose vertices are words in the document and the edges are labeled with a dissimilarity measure between two words, derived from the frequency of their cooccurrence in the document. The central vertices in the graph were candidates as keywords. But keyword extracting methods from documents cannot be directly adapted to tweet environment due to the characteristics of tweet documents. The two key problems for extracting useful information in tweets are as follows: l

A huge number of tweets

l

The short length of tweets

In essence, these two problems are limitations which tweet has. Thus, an efficient method for tweet summarization is needed. In this paper, we suggest an approach extracting tweet groups with similar contents based on a graph of keywords that frequently appear in tweets in a search result. First, we obtain a set of keywords which frequently appear in tweets, and then count the co-occurrences of any two words in the keyword set. Based on the set of keywords and the co-occurrences, we build a keyword cooccurring graph and then cluster the graph into several densely connected sub-graphs. The words in such sub-graphs are the ones frequently co-occurring with each other, that is, those words are strongly related to each other. So, we group such tweets which include all words of each sub-graph into clusters. It is very probable that the tweets which share many strongly related words convey similar contents. For evaluation of the proposed approach, an expert has selected queries from the frequently appearing words in“10x10”online service. It provides hourly and daily 100 important words collected from authoritative news articles worldwide. With these queries, we collect tweets posted for 3 months from March 21, 2011 through Spinn3r1. We measure the accuracy of the proposed approach with the existing methods. Our key contributions are as follows: l We proposed a tweet grouping method with cooccurrences using graphs. We analyzed a set of keywords

1

http://spinn3r.com

which frequently appeared in tweets, and then counted the cooccurrences of any two words in the keyword set. l We adapted k-clique clustering algorithm to tweet clustering and proposed a tweet summarization method based on tweet merging. l We proved that the proposed method was superior to other existing methods. The best performance was obtained with the upper bounds around 15% to 30% in compared to the existing methods. This paper is organized as follows: We explain briefly the existing research and its limitation in Section 2. In Section 3, we propose an approach to summarize tweets. The experimental result is presented in Section 4, and finally we conclude with a direction of the feature work in Section 5.

2. RELATED WORK In this section, we explain some representative methods of result clustering and the researches related to tweets: the event detection and the tweet summarization. Since our tweet summarization method will be applied to the tweets including a given query keyword, it can be considered as one of result clustering approaches. The event detection is to discover unknown events that are currently told in tweets, such as accidents, and the tweet summarization is to extract representative data from a set of tweets by clustering or other algorithms. However, the research on tweets is currently at the beginning level.

2.1 Result Clustering As we explained, our tweet summarization approach is similar to the result clustering rather than the document summarization. Thus, it will be more appropriate to compare with the result clustering that clusters the documents in a search result. Usually clustering is a technique to comprehend the composition or formation of data by similarity. The Web search shows the result of finding relevant documents and ranking those in a list. In addition to this, there are researches to help users easily find information in the search result. One of those is the result clustering. The result clustering is a technique applied to cluster mutually relevant documents in the search results and to present the clusters of relevant documents. The purpose of this technique is to help users to more easily find information they want without going through all the search results. For the result clustering, the algorithms such as the Scatter/Gather, the NMF (Non-negative Matrix Factorization), and the STC (Suffix Tree Clustering) are usually used [2][7][3]. The Scatter/Gather algorithm is a technique presented in 1992 for clustering documents by repeating ‘Gather’, which is for gathering the parts of results, and ‘Scatter’, which is for expanding the extend of documents into words. The weakness of this method is that the arithmetic operation cost may increase due to the repeated steps. The NMF was originated from the computer vision and it was found very efficient for document clustering. It is a technique that split the matrix of documents and words into the matrices of document-characteristic and characteristic-word and sort documents with the characteristics. The number of clusters must be given and the same document is exclusively clustered into only one clusters. It is hard to extract the appropriate name for clusters because it uses the matrix.

The STC creates a tree by gathering documents which have a similar word sequence considering the sentences. Figure 1 shows “cat ate cheese”, “mouse ate cheese too”, and “cat ate mouse too” in a suffix tree. It merges the repeating parts of sentences into nodes. You can restore the original sentence if you go through from the root to terminal nodes. The strength of the suffix tree is that it is easy to label each clusters because it finds the sequence of overlapping vocabularies. The weakness is that too many clusters can be generated and one document may frequently be included in several clusters.

The tweet summarization or clustering is the technique that clusters or summarizes tweets related to a specific topic. Through this technique, users can overview the opinions of twitterers on a certain topic, or easily search for what they wanted to. The phrase reinforcement algorithm was suggested for summarizing tweets [10]. The main goal of this algorithm is to find the most frequent overlap parts among tweets. The algorithm makes a weighted graph for each tweet which consists of nodes (words) and edges (links between words). However, this method makes that the most frequently appearing words are represented by a sequence, so that it does not analysis and summarize the various opinions.

3. TWEET SUMMARIZATION USING KEYWORD GRAPH In this section, an approach to summarize the tweets searched for with a certain query is proposed. As mentioned, even though we search tweets with a certain word, it is hard to look over whole the search result and catch major opinions because there are a huge number of tweets including the word and the information density of each tweet is low. We propose an approach which summarizes the tweets which are searched for with a certain keyword to help users understand the tweets.

3.1 Overview The reason why the existing clustering methods is not proper for the tweet clustering is that those are designed for general documents and do not consider the characteristics of tweets mentioned in Section 1. Our method aims to reflect such characteristics of tweets. Figure 1. The suffix tree of the strings “cat ate cheese”, “mouse ate cheese too” and “cat ate mouse too” [8] The STC is one of recently proposed ones in all result clustering approaches. Compared with the most popular clustering method, k-means which needs the number of clusters to be given before clustering, but the STC creates clusters without a given number of clusters [8]. For this reason, we compare the proposed method with the STC, a representative result clustering algorithm.

2.2 Event Detection and Tweet Clustering / Summarization Researches on tweets have recently started and there are relatively small amount of researches. Here, we focus on the event detection, which figures out unknown events currently told in tweets, and the tweet summarization/clustering, which summarizes or clusters a set of tweets into similar content groups. The event detection is used for identifying the events which have been occurred at the real world. By analyzing tweets which are posted by various users, we can detect some unknown events such as accidents. Chakrabarti et al. suggested an algorithm which found specific events in Twitter using hidden Markov Models [9]. They evaluated the algorithm by detecting American Football games in various tweets. The tweets contain various topics such as music, entertainment, movie, IT information, news, as well American Football games. They found that the algorithm can separate each football game which stay away from other games, but it is very hard to classify the events which are occurred concurrently. And the algorithm assumes that it is a periodic and a repetitive event such as an American football.

If similar tweets frequently appear, we can say that the tweets are mentioned by many users and such tweets are important because many people are interested in those. If we find and group the tweets with similar contents mentioned by many users, we may present the summarization of the tweets on a topic. When Elizabeth Taylor passed away, for instance, Twitter was filled with tweets on condolence, celebrities who were close to her, and the funeral. In this case, it would be very helpful to understand the public interests about her if the tweets including “Elizabeth Taylor” are grouped into condolence, celebrities who were close to her, or the funeral. Figure 2 shows an overview of the proposed method. The proposed method goes through a four-step process to cluster tweets according to its contents. The steps are the keyword selection in tweets, the keyword graph generation, the graph clustering, the tweet clustering and merging. We describe each step in detail in the next section.

3.2 Keyword Selection It is important to choose meaningful and important keywords in the tweets searched for with a certain query for summarization. Intuitively, better keywords selection leads a better result to comprehend various interests. The TF-IDF (term frequencyinverse document frequency) is one of the popular indexes to choose important words from documents. However, since the length of tweets is limited to 140 characters, most of words appear less than once in each tweet, which means that the TF (term frequency) and the DF (document frequency) is almost equal. So, we simply use the term frequency (TF) to choose important words from tweets. We regard a set of tweets including a given query as one document. We can say that the words which frequently appear in tweets are important.

frequent words appear in more than 3,500 tweets but most of words appear in less than 1% of tweets. Too frequent keywords have a strong relation to the query word, “Elizabeth Taylor”, so it may not give us any information further. On the other hand, too sparse keywords may convey too minor information to be summarized. If we select the proper keywords by eliminating less important ones, we may fulfill clustering more efficiently and summarize tweets more informatively. We tested various thresholds for the keyword selections, such as words of which frequency is between 1% and 25%. The experimental result will be shown in Section 4.

3.3 Making Graph with Co-occurrence We generate a graph G(V, E), where V is a set of nodes consisting of the selected keywords and E is a set of weights between nodes. The weight between node u and v, Wu,v, is defined as equation. (1).



, , =   {, |∈,∈}

(1)

Figure 2. Overview of the proposed method. However, words which are included to most of tweets may not important to cluster tweets based on the content. For example, the query itself will appear in the entire. It is the most frequent word in the tweets. Also, the words which are semantically or contextually very closed to the query also appear in most of tweets. Another most frequent keyword in tweets is “RT” which indicates that a tweet is retweeted. In order to choose words which are helpful to cluster tweets according to the content, we have to carefully consider term frequencies. We need words which frequently but not so frequently appear in tweets. When sorted by frequency, the distribution of tokenized keywords follows Zipf’s law as shown in Figure 3 [11].

nu,v is the co-occurrence of words u and v in tweets. It is a squared root of nu,v normalized by the maximum co-occurrence. Though any type of approaches to assign weights on edges is applicable, we use the normalized co-occurrence value. The weight of edges also follows the distribution of power series. We want to remain meaningful edges as we did in the keywords selection. An example of keyword graphs is shown in Figure 4. It is a graph built from 10,000 tweets including ‘Elizabeth Taylor’. The words of which frequency is between 1% and 25% are chosen and the co-occurring words are connected with edges. If we remove less important edges, which have a low edge weight, we have a graph as shown in Figure 4. The frequently co-occurring words in the tweets form densely connected sub-graphs. In other words, the words in a densely connected sub-graph are strongly related to each other, which say that it is very probable that there are many tweets which include those words. So, if we group tweets which include such words, we may have one of major tweet groups.

Figure 3. Term frequency and unique keywords in Tweets about “Elizabeth Taylor” It is the word-frequency graph of 5,000 tweets including “Elizabeth Taylor” sampled between March 21, 2011 and April 19, 2011.We plot a graph which shows a relationship between the term frequency and the number of the unique keywords. The most

Figure 4. Keyword graph of “Elizabeth Taylor” So, we set a threshold to eliminate lower frequent edges. Since the edge weights follow the distribution of power series, those are not sensitive to the threshold. Based on edge weights and a threshold,

we identify strongly related words in tweets, which may include in the almost same tweets.

3.4 Maximal K-clique Clustering Though there are various methods for graph clustering, we use maximal k-clique clustering. A k-clique is a sub-graph which has k nodes and the nodes are fully connected to the others. A k-clique in a keyword graph represents a set of words which are tightly coupled by co-occurrence in tweets. Thus, the tweets which contain all the words of a clique will have a high probability to be similar each other. Figure 5 (a) and (b) are examples of 3-clique and 4-clique, respectively, and (c) is an example of a graph which contains 3 and 4-cliques. The fastest algorithm to find all maximal cliques in practice is known as the Bron-Kerbosch algorithm [12], and it has a time complexity of O(3n/3) in the worst case. It is a high time complexity, but it may not serious because the number of nodes in a keyword graph has a tendency to converge to a constant as the number of tweets increases. We will present the experimental result on this in Section 4.

(a) 3-Clique

tweets b, c, d and e as shown in Figure 6 (c). Therefore, tweets a, b, c and d are clustered into one cluster and tweets b, c, d and e are clustered into another cluster. Since both clusters share most of tweets, we can merge both into one cluster. If the similarity between the clusters is higher than a given threshold, we merge tweet clusters into one cluster.

(b) 4-Clique Figure 6. Tweet clusters with cliques; (a) Original keyword graph, (b) Max clique, (c) Tweets.

4. EXPERIMENT 4.1 Dataset

(c) k-Clique(colored) in a graph Figure 5. Examples of k-clique.

3.5 Tweet Clustering and Merging Once we found maximal k-cliques from a keyword graph, we build tweet clusters based on the cliques. We perform this with two steps of clustering and merging. First, we extract the tweets which contain all the words in each maximal clique. A set of tweets for a maximal clique is called a tweet cluster. Second, we merge similar tweet clusters into one cluster. Tweet clusters, A and B are defined to be similar if many of tweets in A are also included in B and vice versa, which means that the content of cluster A is similar to that of B. That is, if the contents of two tweet clusters are similar, we merge those into one. In the merging step, we use an iterative approach to merge tweet clusters. We first choose two tweet clusters and merge if two clusters are similar enough. This step is repeated until there are no similar tweet clusters anymore. An example is illustrated in Figure 6. A keyword graph in Figure 6 (a) has four nodes: A, B, C and D. The graph has two maximal cliques: a clique including A, B and C, and the other clique including A, C and D as shown in Figure 6 (b). Let us assume that there four tweets which include all the words of the first clique: tweets a, b, c and d, which means that tweets a, b, c and d include the three words A, B and C, and four tweets for the second cliques:

For experiment, we use Twitter datasets from Spinn3r. The data was collected over approximately three months from March 21, 2011. More than 1 billion tweets were totally collected, so 10 million tweets per day were collected on average. We excluded the tweets which had non-English words in them. Since it was impossible to consider all of incidents or events for summarizing tweets, here, we chose some queries for our experiment. The ‘10x10’ site 2 was a good source to extract worldwide issues. It provides important top 100 keywords collected from authoritative news sites such as ABC, BBC, CNN, Guardian, and Reuters by an hourly or daily basis. Among every 100 keywords provided by daily, we selected 8 keywords as queries and collected the tweets including those. The consideration in the query selection is as follows: 1) Keywords should have attracted public attention in the English-speaking countries 2) Keywords should be mentioned more than two days for securing sufficient amount of tweets. 3) Keywords in various subjects should be chosen. With these criteria, we selected 8 events and corresponding keywords (queries) as shown in Table 1. Table 1. Selected events and corresponding keywords. Field

Event 2

http://tenbyten.org

Keywords

Society

Japan, Fukusima nuclear power plant leaks radiation

Nuclear

Politics

The U.S. government closed because of complaints about the budget in the congress decision

Government Shutdown

Celebrity

Hollywood actress, Elizabeth Taylor Elizabeth Taylor died Prince William and Kate Middleton's Prince William wedding White iphone 4 launch

IT

White iPhone

ipad 2 sales

iPad2

Exercise

London 2012 Olympics

Olympic

Science

The large Hadron Collider(LHC) of Conceil Europeen pour la Recherche Nucleaire (CERN)

LHC

bound of each threshold has a decisive effect on the number of keywords while the upper bound has not relatively. As the lower bound of thresholds decreases, the number of keywords increases. For retaining the unique keywords as many as possible, but not too much, we choose 1% as the lower bound of the keyword threshold. Even though the upper bound of the keyword threshold has less effect on the number of the unique keywords, it may affect the summarization result much. The words which frequently appear may be strongly related to the query and thus those affect much the summarization. The higher the upper bound of the threshold is, the more keywords which are the more strongly related to the query we have. However, keywords which are too much correlated to the query may not be helpful for summarization. So, we will check the performance of our method with the various upper bound thresholds in Section 4.2.2.

4.2 Result We compared our method with the STC algorithm embedded in CARROT 2 clustering engine [13]. There were various types of criteria to measure the performance of clustering such as the purity, the normalized mutual information (NMI) and the Fmeasure [14]. We utilized the F-measure defined in Equation (2). The parameter is a weight value for determination of importance between the precision and the recall. For the golden standards of clustering results, experts read the tweets and cluster those into similar content groups. We randomly selected 500 tweets from dataset for each query, and clustered them with the expert knowledge.

=

(  )×   ×(

×  )

(a) # of keywords by sampling size of tweets

(2)

4.2.1 Analysis of keyword graphs In order to build a keyword graph, we have to choose important (frequent) keywords in tweets. For the keyword selection, we tested various thresholds of frequency with various sizes of tweets. If the number of selected keywords is sensitive to thresholds, our approach may not be feasible. However, it is not so sensitive to thresholds because it follows Zif’s law. Figure 6 shows examples of the numbers of the unique keywords with the various thresholds. The x-axis of the graph is the size of tweets and the y-axis is the number of keywords. We change the size of tweets from 1 thousand to 50 thousands, and test with the combinations of two upper thresholds and four lower thresholds. Figure 6 (a) is the total number of the unique keywords which appear in the sampled tweets. As the figure shows, the number of keywords increases as the sampling size of tweets increase. However, the numbers of the unique keywords filtered with thresholds, shown as in Figure 6 (b), stay almost constant. For example, when the numbers of the unique keywords of which frequency is between 0.5% and 25%, the number of keywords is almost 350 regardless of the size of tweets. This tells that thresholds based on frequencies can be effective for extracting important keywords from tweets. We can see another characteristic of the thresholds through the graphs: the lower

(b) # of keywords by sampling size of tweets with variation of k Figure 7. The trend of the number of unique keywords. With the keywords filtered with a threshold, we build a keyword graph. The selected words are the nodes and we connect two words if their co-occurrence frequency is higher than a threshold. In order to build a keyword graph, we also test various cooccurrence thresholds: 0.1, 0.2, 0.3 and 0.4. Figure 7 shows an example of the number of edges with different thresholds. The experiments are performed with the tweets including ‘Elizabeth Taylor’. The x-axis of the graph is the sample size of tweets. The y-axis of the graph is the number of co-occurrences which are the edges in the keyword graph. We can see that the total number of edges, the

lines tagged with ‘Edges’, log-likely increases as the number of the unique keywords does. However, the numbers of edges filtered with each threshold are almost constant regardless of the tweet size. From this, it can be said that a threshold scheme for choosing keywords and edges is feasible.

tweets into several clusters, so the summarization result of both methods were poor.

Table 2. F-measure comparison with the STC algorithm; F-measure (precision/recall). Proposed method Keywords

STC k=50%

Elizabeth Taylor

k=30%

k=20%

k=15%

k=10%

k=5%

0.22 0.26 0.26 0.26 0.26 0.18 0.18 (0.30/0.18) (1.00/0.15) (1.00/0.15) (1.00/0.15) (1.00/0.15) (1.00/0.10) (1.00/0.10)

Government 0.03 0.45 0.49 0.49 0.49 0.49 0.42 Shutdown (0.17/0.02) (0.50/0.41) (1.00/0.38) (1.00/0.38) (1.00/0.38) (1.00/0.38) (0.67/0.34) iPad2

0.12 0.72 0.72 0.72 0.72 0.76 0.70 (0.09/0.24) (0.75/0.71) (0.83/0.69) (0.83/0.69) (0.83/0.69) (0.83/0.71) (0.83/0.64)

LHC

0.19 0.20 0.17 0.17 0.58 0.48 0.36 (0.14/0.37) (0.36/0.19) (0.39/0.12) (0.39/0.12) (0.76/0.59) (0.63/0.51) (0.63/0.36)

Nuclear

0.11 0.66 0.81 0.81 0.81 0.76 0.76 (0.09/0.23) (0.71/0.63) (0.99/0.72) (0.99/0.72) (0.99/0.72) (0.99/0.66) (0.99/0.66)

Olympic

0.09 0.81 0.81 0.81 0.81 0.80 0.80 (0.08/0.14) (0.84/0.85) (0.84/0.85) (0.84/0.85) (0.84/0.85) (1.00/0.70) (1.00/0.70)

Prince William

0.10 0.53 0.57 0.57 0.61 0.50 0.50 (0.06/0.41) (0.51/0.58) (0.71/0.48) (0.71/0.48) (0.55/0.80) (0.50/0.51) (0.50/0.51)

White iPhone

0.13 0.32 0.32 0.30 0.42 0.40 0.40 (0.09/0.28) (0.730.35) (0.73/0.35) (0.83/0.27) (0.93/0.35) (1.00/0.29) (1.00/0.29)

Figure 8. The trend of the co-occurrences with various edge thresholds.

5. CONCLUSTION With a threshold scheme, we are able to get approximately constant number of keywords (nodes) and co-occurrences (edges). It means that the time complexity of our method is also constant regardless of the size of tweets. We select 0.1 for the edge threshold in order to obtain enough edges.

4.2.2 Performance comparison When we compare our proposed method to the golden standards (by experts), it had a higher F-measure score than the STC algorithm as shown in Table 2. In the table, the numbers are the F-measure scores and the numbers in parenthesis are the precision and the recall in a form of (precision/recall). We test the performance of the STC algorithm and the proposed method with the eight queries which are selected from various subjects as we mentioned in Table 1. We set the co-occurrence threshold with a value of 0.1 and the lower bound of the keyword threshold with a value of 1%. Even though the variation of the upper bound threshold has no effects on the number of keywords, it could have effects on the summarization. So, we test various upper bound thresholds. The proposed method has higher performances in terms of the Fmeasure, the precision and the recall than the STC algorithm. Since the STC algorithm assumed that the target documents had rich information, it does not properly exclude meaningless information from tweets, while the proposed algorithm eliminated the less informative words by the keyword threshold. In the table, k is the upper bound of the keyword threshold. The best performance was obtained with the upper bounds around 15% to 30%. There is no big difference in the number of keywords with various upper bound thresholds, but there are performance gaps in summarizing tweets. In the case of ‘Elizabeth Taylor’, the F-measures of the STC algorithm are the almost same as that of the proposed algorithm. In the tweets, a lot of people may have similar opinions about the events related to the query. It was hard even for experts to group

Due to the huge amount tweets and the low information density of tweets, it is hard to obtain useful information from Twitter. To solve this problem, in this paper, we proposed a tweet summarization method based on a graph of keywords in tweets. Using our method, we could eliminate a large number of less important words from tweets and find sets of strongly related words. Based on strongly related words, we could produce summarizations of tweets with high information densities. We confirmed the proposed method was effective for summarizing of tweets and was superior to the existing method with the experiments.

6. ACKNOWLEDGMENTS This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 20130458-000) and IT R&D program of MKE/KEIT (10041244, Smart TV 2.0 Software Platform).

7. REFERENCES [1] The Telegraph, “Twitter users send 50 million tweets per day,” http://www.telegraph.co.uk/technology/twitter/7297541/Twit ter-users-send-50-million-tweets-per-day.html, 23 Feb. 2010. [2] Cutting, D., Karget, D., Pederson, J., and Tukey, J. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, Denmark, Jun. 21-24, 1992). SIGIR `92. ACM, New York, NY, 318-329. DOI=http://dl.acm.org/citation.cfm?doid=133160.133214. [3] Zamir, O. and Etzioni, O. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia, Aug. 24-28,

1998). SIGIR ’98. ACM, New York, NY, 46-54. DOI= http://dl.acm.org/citation.cfm?doid=290941.290956.

Conference on Weblogs and Social Media (San Francisco, USA, Aug. 07-11, 2011). AAAI `11. DOI=10.1.1.206.4594.

[4] Carrot 2 clustering engine, "http://project.carrot2.org".

[10] Sharifi, B., Hutton, M.A. and Kalita, J.K. 2010. Experiments in microblog summarization. In Proceedings of the Second International Conference on Social Computing (Amsterdam, Netherlands, Sep. 3-5, 2012). SocialCom/PASSAT `12. IEEE, 49-56. Aug. DOI=10.1109/SocialCom.2010.17.

[5] Ohsawa, Y., Benson, N.E., Yachida, M. 1998. KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries (Santa Barbara, CA, USA, Apr. 22-24, 1998). 12-18. DOI= 10.1109/ADL.1998.670375. [6] Palshikar, G.K. 2007. Keyword extraction from a single document using centrality measures. Lecture Note in Computer Science. 4815, (Dec., 2007), 503-510. DOI= 10.1007/978-3-540-77046-6_62. [7] Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, Canada, Jul. 28-Aug. 01, 2003). SIGIR `03. ACM, New York, NY, 267-273. DOI= 10.1145/860435.860485. [8] Andrews, N.O. and Fox, E.A. 2007. Recent developments in document clustering. Technical Report TR-07-35, Computer science, Virginia Tech. [9] Chakrabarti, D. and Punera, K. 2011. Event summarization using Tweets. In Proceedings of the Fifth International AAAI

[11] Newman, M.E.J. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics. 46, 5 (May, 2006), 1-28. DOI= 10.1016/j.cities.2012.03.001. [12] Johnston, H.C. 1976. Cliques of a graph-variations on the Bron-Kerbosch algorithm. International Journal of Parallel Programming. 5, 3 (Sep., 1976), 209-238. DOI= 10.1007/BF00991836. [13] Rangrej, A., Kulkarni, S. and Tendulkar, A.V. 2011. Comparative study of clustering techniques for short text documents. In Proceedings of the 20th International World Wide Web Conference (Hyderabad, India, Mar. 28-Apr. 01, 2001). WWW `11, ACM New York, NY, USA, 111-112. DOI= 10.1145/1963192.1963249. [14] Manning, C.D., Raghavan, P. and Schütze, H. 2008. Introduction to information retrieval, Cambridge University Press.