A Genetic Algorithm based tweet clustering Technique

3 downloads 0 Views 137KB Size Report
They were on their way to various Hindu shrines. #Monsoon #India. T4 ..... Helpline info RT @AdityaRajKaul: Uttarakhand Flash Flood Helpline numbers are: ...
2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA

A Genetic Algorithm based tweet clustering Technique Soumi Dutta1, Sujata Ghatak2 Department of Computer Science & Engineering Institute of Engineering & Management Kolkata-700091, India Abstract—The Twitter micro blogging site is one of the most popular online social media in today’s Web. During an important event, such as a disaster event, hundreds of thousands of tweets are posted rapidly on Twitter. The information is posted too fast for anyone to make sense, hence the information needs to be organized in order to utilize the information effectively. It has been observed that many of the tweets posted during an event are very similar to each other, hence clustering or grouping similar tweets is an effective way to reduce the information load. However, clustering of tweets is challenging because of the small size and noisy nature of tweets. In this work, we propose a novel clustering approach for tweets, which combines two different approaches – a traditional clustering approach K-Means, and an evolutionary approach, Genetic Algorithms. We conduct experiments on a dataset of real tweets collected during a recent disaster event, and show that the proposed methodology performs better than several existing clustering techniques. Keywords-Genetic Algorithm, k-means clustering, social media network.

I.

INTRODUCTION

The Twitter micro blogging platform (https://twitter.com/) is one of the most popular sites on today’s Web, and is used by millions of users everyday to post short messages (called ‘tweets’) on various topics and events. The primary strength of Twitter is the real-time updates on events happening ‘now’, which are used in several applications like analyzing public opinion on different issues, real-time event detection, and so on. Millions of tweets are posted everyday on Twitter, and the number of tweets is even more during any important event, such as a natural disaster like flood or earthquake. If one wants to browse through the tweet stream to follow the event, it would be very difficult for him / her to go through so many tweets. In such a scenario, an effective way to reduce the information load on the user is to cluster similar tweets, so that the user might see only few tweets in each cluster. Though document clustering is a well-established area of research [1], clustering of tweets is difficult due to the very small size of tweets (at most 140 characters) and their noisy nature – due to the size restriction; tweets often contain abbreviations, colloquial language, and so on. Hence, standard data mining / natural language processing techniques do not usually perform well for tweets [2]. In this work, we proposed a novel clustering algorithm for tweets, which combines a traditional clustering algorithm (Kmeans) with an evolutionary algorithm (Genetic Algorithms).

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

Saptarshi Ghosh3, Asit K. Das4 Department of Computer Science & Technology Indian Institute of Engineering Science and Technology Shibpur, Howrah – 711103, India Though both these algorithms have been used for clustering in many prior works (as described in the next section), to our knowledge, no prior work has attempted to combine these two techniques. We perform experiments on a dataset of tweets posted during a recent disaster event – the Uttarakhand floods in 2013. We compare the performance of the proposed methodology with several baseline clustering techniques, such as the classical K-means, hierarchical clustering, density based clustering, and graph clustering algorithms. The proposed methodology performs better than the baseline approaches, according to several standard metrics for evaluating clustering [3]. The rest of the paper is organized as follows. Section II discusses related work, and the proposed methodology is detailed in Section III. Section IV describes the dataset collected, and the evaluation of the proposed methodology and the baseline algorithms on this dataset. The paper is concluded in Section V. II.

RELATED WORK

A. Clustering of tweets There have been some prior works on clustering of tweets. Cheong [4] attempted to detect intra-topic user and message clusters in Twitter, by incorporating an unsupervised selforganizing feature map (SOM) as an machine learning based clustering tool. Kang et al. proposed an affinity propagation algorithm for clustering similar tweets [5]. Dutta et al. [6] used a graph-based community detection algorithm for clustering tweets, and later used the clustering output for summarization. Yang and Leskovec proposed a clustering method [7] by hashtags using by temporal patterns of propagation. Recently, Rangrej et al. [8] conducted a comparative study on three clustering algorithms – K-Means, affinity propagation and singular value decomposition algorithm – and compared their performance in clustering short text documents. B. Genetic Algorithms A genetic algorithm (GA) is a searching technique used in computing to find true or approximate solutions to optimization and search problems. GAs are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination). The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the

2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA

fitness of every individual in the population is evaluated, multiple individuals are selected from the current population (based on their fitness), and modified to form a new population. The new population is used in the next iteration of the algorithm. The algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. The steps of basic Genetic Algorithm (GA) are outlined as follows: Step1: Start with a large population of randomly generated attempted solutions to a problem. Step 2: Repeatedly do the following: i) Evaluate each of the attempted solutions. ii) (Probabilistically) keep a subset of the best solutions. iii) Use these solutions to generate a new population. Step 3: Quit when a satisfactory solution is found out (or you run out of time). Genetic algorithms have been used for clustering in many prior works [9], [10], including clustering data obtained from online social networks. For instance, Hajeer et al. [11] proposed graph based clustering of online social network communities using genetic algorithms. Adel et al. [12] proposed a cellular genetic algorithm based clustering method to categorize twitter dataset. As stated above, various methodologies have been tried for clustering documents [1], which include traditional methods like K-means as well as evolutionary approaches like genetic algorithms. However, to our knowledge, no prior work has attempted to combine these two types of approaches. In this work, a clustering framework is proposed which combines genetic algorithms with traditional K-means clustering, and it is shown that the proposed algorithm performs better than several standard clustering algorithms like K-means, hierarchical clustering algorithms, and graph-based clustering algorithms. III.

PROPOSED METHODOLOGY

This section details our proposed methodology for clustering a given set of tweets. The methodology is briefly outlined in Algorithm 1, and each step is detailed below. To explain different steps of the methodology, we use a toy data set of 5 tweets shown in Table I. The tweets are derived from a dataset of tweets posted during the floods in the Uttaranchal state of India in 2013, which will be described later in Section IV. Pre-processing: As is seen in Table I, tweets often contain non-textual characters like smiley’s, @usernames, exclamation / question marks, etc, which act as noise and degrades clustering / classification tasks. So we pre-processed the tweets to filter out such characters. Additionally, a standard set of English stop words are also removed, and the tweets are case-folded to lower case. Also, for hash tags like ‘#uttarakhand’, we ignore the ‘#’ symbol, in other words, we consider the two terms ‘#uttarakhand’ and ‘uttarakhand’ same.

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

TABLE I.

A TOY DATASET OF 5 TWEETS, USED TO ILLUSTRATE THE GENETIC ALGORITHM BASED CLUSTERING METHODOLOGY. Tweet-id Text T1 Uttarakhand: Nature’s fury June 2013: [url] via @youtube Uttarakhand monsoon rains ‘kill 10’: At least 10 people are T2 killed in landslides and flooding [url] Thousands of pilgrims reported stuck in the hilly regions of T3 Uttarakhand. They were on their way to various Hindu shrines. #Monsoon #India Avg rainfall in uttarakhand was 61mm. This year it already T4 received 154mm. That’s 190% of avg rainfall @abpnewstv praying for my family nd muluk ppl in #Uttarak- hand.. hope T5 everything will be fine soon :) GOD SAVE ALL

Forming Document-Term Matrix: After the pre-processing step, each tweet is effectively a set of distinct term. We computed the set of all distinct terms in the entire set of tweets. Let the number of tweet be N and the number of distinct term in entire set of tweet be M, and let the distinct terms be denoted as t1, t2, ……. , tM . Then a two-dimensional N x M matrix is formed, where the rows represent individual tweets, and the columns represent distinct terms. The entries in the matrix represent the presence or absence of particular terms in the tweets, specifically, the (i, j)-th entry in the matrix is 1 if the i-th tweet contains tj, 0 otherwise. Table II shows the corresponding matrix for the set of tweets in Table I. Here, N = 5 and M = 13. Note that the above matrix could also have been weighted, where (i, j)-th entry would be the frequency of tj in the i-th tweet. However, since tweets are very small (at most 140 characters), words are very rarely repeated in the same tweet. Hence, we only consider the presence or absence of words. Genetic algorithm based clustering: As stated earlier, our proposed methodology combines an evolutionary approach (Genetic Algorithm) with a traditional clustering approach (KMeans). We now describe the methodology. An initial population set for the GA is formed randomly consisting of 100 chromosomes ci, for all i = 1, 2, 3, .., 100. An initial population matrix is thus generated, containing 100 rows (each row corresponds to a chromosome) and M columns (each column corresponds to a distinct term in the vocabulary, as described above). The population matrix is initially filled up randomly, i.e., each entry is randomly assigned either 1 or 0. If the (i, j)-th entry in the matrix is 1, it implies that the chromosome ci contains the term tj as a feature. The Table III shows a sample initial population matrix, where each tuple represents a chromosome and each column represents a distinct term. The next step is to estimate the fitness of each chromosome, which is done as follows. For each chromosome ci, i Є [1, 2, ...., 100], a separate two-dimensional matrix is created. This matrix has N rows corresponding to the N tweets, and only those columns from the document-term matrix, for which the corresponding columns in the row for ci in the initial population matrix is 1. In other words, the matrix for chromosome ci is a projection of the document-term matrix, containing only those terms (columns) tj for which the (i, j)-th entry in the initial population matrix is 1. As an illustration,

2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA TABLE II. Tweet T1 T2 T3 T4 T5

uttarakhand 1 1 1 1 1

nature 1 0 0 0 0

fury 1 0 0 0 0

DOCUMENT TERM MATRIX FOR THE TOY DATASET IN TABLE monsoon 0 1 1 0 0

raina Landslide pilgrims 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

stuck 0 0 1 0 0

region 0 0 1 0 0

hindu 0 0 1 0 0

Shrines 0 0 1 0 0

family 0 0 0 0 1

hope 0 0 0 0 1

[3] which is defined below. Table IV shows the matrix for the chromosome c1 in Table III, which contains only those terms as columns for which the row corresponding to c1 in the initial population matrix in Table III contains 1. Algorithm 1 Tweet clustering algorithm Input: N tweets, desired number of clusters, K Output: Tweets clustered into K clusters Pre-process tweets by removing special characters, URLs, stop words. Identify all distinct terms in the set of tweets. Let M be the number of distinct terms Generate a N x M document-term matrix DT where each row represents a tweet, and each column represents a distinct term. Generate initial population for GA, containing 100 chromosomes ci, i = 1, 2, 3, .., 100. Generate an initial population matrix P having one row for each ci and M columns. Each cell is randomly filled with 0 or 1. while GA not converged OR a certain number of iterations not done do for all ci, i = 1, 2, 3, .., 100 do Create a matrix having N rows (each corresponding to a tweet) and those columns j of DT for which P [i; j] = 1 Cluster the rows of the matrix using K-means Evaluate fitness of ci, by evaluating the quality of the clustering end for Apply Genetic Algorithm steps selection (Roulette Wheel method) and cross over. From the newly generated child chromosomes, best fit solutions are combined to select chromosomes for the new population. end while Output the clustering corresponding to the fittest chromosome of the final population, as the final clustering of the tweets. We get 100 such matrices corresponding to the 100 chromosomes. K-means algorithm is applied on each of these matrices, to cluster the rows. The value of K (number of clusters) is chosen based on some estimate on how many clusters the original dataset might contain; this choice is discussed in the next section. UI={X1, X2, X3, X4, ……, XK} be the clustering of the matrix corresponding to the chromosome ci into K clusters. To measure the goodness of the clustering Ui, we use the Davie Bouldin index (DBIndex)

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

Where K is the number of cluster is the complete diameter distance which is the distance between the most remote objects belonging to the same cluster, and is the average linkage distance, i.e, the average distance between all the objects belonging to two different clusters and . Smaller values of DBIndex represents better quality clusters that are compact and well-separated. TABLE III. Chromosome c1 c2 c3 ::: c100

SAMPLE INITIAL POPULATION MATRIX. G1 1 0 1

G2 0 0 0

G3 1 0 0

G4 1 1 1

G5 0 0 0

.. 0 1 0

.. 0 1 1

GM 1 0 1

0

1

0

1

0

1

1

1

TABLE IV. MATRIX CORRESPONDING TO THE CHROMOSOME C1 IN TABLE III. THIS MATRIX IS A PROJECTION OF THE DOCUMENT-TERM MATRIX SHOWN IN TABLE II, CONTAINING ALL THE ROWS AND ONLY THOSE COLUMNS WHICH CONTAIN 1 IN THE ROW CORRESPONDING TO C1 IN THE INITIAL POPULATION MATRIX IN TABLE III Tweets T1 T2 T3 T4 T5

Uttarakhand 1 1 1 1 1

fury 1 0 0 0 0

monsoon 0 1 1 0 0

hope 0 0 0 0 1

The fitness of the chromosome ci is estimated by, where w Є [0, 1] is a weighting factor that decides the relative importance of DBindex and the length of the chromosome. In general, we prefer to have chromosomes having lesser number of features, so that the clustering can be quicker. The GA is applied on the population matrix following the basic steps such as selection and crossover. For each selection, the 20% fittest chromosomes are selected from the population (using the fitness function described above). The remaining 80% chromosomes are selected using any of the selection methods. In this work, Roulette-wheel selection is considered for selecting the chromosomes. In the next step, using crossover, the subsequent population is generated. For each two-parent chromosome, crossover results in two new child chromosomes or solutions. From these four chromosomes (two parents and two children), two are selected as best

2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA

chromosomes having lowest fitness value. Newly generated child chromosomes give a new set of solutions combining the best selected chromosomes as a new population which will be used for the next step of the GA. In the next iteration of the GA, K-Means clustering algorithm is applied on the new population matrix (as in the previous iteration), and using objective function fitness value is evaluated for each chromosome in newly generated population. The GA is iterated either up to a fixed number of iterations, or until it converges. After the GA ends, the fittest chromosome in the current population is considered, and the K-means clustering corresponding to this chromosome is output as the clustering of the tweets. IV. EXPERIMENTS AND RESULTS This section describes our experiments using a dataset of tweets and comparing the proposed methodology with some baseline clustering techniques. A. Data Set As stated earlier, clustering of tweets is mostly necessary when tweets are posted rapidly during a major event. Hence, we focused on tweets posted during a certain event – the devastating landslides and floods in the Uttarakhand state of India in June 2013. Tweets related to the Uttarakhand floods were collected through the Twitter API, using the keyword ‘Uttarakhand’. Specifically, the tweets were extracted from the the 1% random sample of tweets that is made publicly available by Twitter. Generating gold standard clusters: We considered a set of distinct tweets related to the Uttarakhand dataset for this study. Note that very similar or duplicate tweets are posted by many users in Twitter; hence we ignored such duplicate tweets and considered only distinct tweets. TABLE V. EXAMPLES OF TWEETS RELATED TO THE UTTARAKHAND FLOOD EVENT. TWEETS CONTAINING FIVE DIFFERENT TYPES OF INFORMATION WERE IDENTIFIED BY HUMAN VOLUNTEERS Type Casualties

Example tweets (extract from tweet text) Uttarakhand monsoon rains kill 10: At least 10 people are killed in landslides and flooding [url] Weather info Avg rainfall in uttarakhand was 61mm. This year it already received 154mm. That’s 190% of avg rainfall @abpnewstv Helpline info RT @AdityaRajKaul: Uttarakhand Flash Flood Helpline numbers are: 0135-2710335, 2710233 Sentiment praying for my family nd muluk ppl in #Uttarakhand.. hope everything will be fine soon :) GOD SAVE ALL News

RT @BJPRajnathSingh: I appeal to all BJP workers in Uttarakhand to provide every possible help and relief to flood affected people.

The selected tweets were inspected by two human volunteers who were asked to first identify the different types of information contained in the tweets, and then to cluster / group the tweets according to the type of information. The human volunteers identified 5 different types of information in the tweets: (1) tweets informing about casualties or loss / damage

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

to resources, (2) tweets giving weather related updates, (3) Information about help lines, (4) sentiments of the people, prayers for the affected population, and (5) news about government / political steps regarding the disaster situation. Table V gives examples of tweets of each type. Thus, the gold standard (generated by human volunteers) groups the tweets in 5 clusters. We next used the proposed clustering methodology to group the tweets automatically, and checked how well the automatic clustering matches the gold standard clustering. B. Baseline Clustering Methods for Comparison We used a number of standard clustering methods (baselines) on the same tweet dataset, to compare the performance of the proposed method. Specifically, the following clustering methods were used as baseline – (1) Hierarchical Clustering, (2) Density Based Clustering, (3) K-means Clustering and (4) InfoMap community detection algorithm. Hierarchical Clustering is an algorithm which follows either top-down or bottom-up approach, whereas Density Based Clustering identify dense regions in the data space as clusters, separated by regions of lower object density. The well-known Weka toolkit was used for the first three clustering methodologies. The fourth methodology, InfoMap, is actually a methodology for detecting communities from graphs, where communities are groups of densely connected nodes in the graph. To use this methodology, we created a graph where the nodes are individual tweets, and two nodes are connected by an edge if the textual similarity between the two tweets is higher than a threshold value. The textual similarity between two tweets was computed based on the presence of common words in the tweet text, after ignoring a standard set of stop words. C. Metrics for evaluating clustering We used a set of standard metrics [3] to evaluate the quality of clustering produced by the different methodologies. The following metrics were used: (1) Calinski-Harabasz index (CH), which evaluates the cluster validity measure based on the average between and within cluster sum of squares. (2) I-Index (I), which measures separation based on the maximum distance between cluster centers, and measures compactness based on the sum of distances between objects and their cluster center. (3) Dunn index (D), which uses the minimum pair wise distance between objects in different clusters as the intercluster separation and the maximum diameter among all clusters to estimate the intra-cluster compactness. (4) Silhouette index (S), that validates the clustering performance based on the pair wise difference of between and within-cluster distances. (5) Xie-Beni index (XB), which defines the inter-cluster separation as the minimum square distance between cluster centers, and the intra-cluster compactness as the mean square

2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA

TABLE VII

COMPARING

No. of Clusters

4

5

6

4

5

6

THE PERFORMANCE OF THE PROPOSED METHODOLOGY WITH THAT OF SEVERAL BASELINE CLUSTERING

Weight Factor(w)

Silhouette Calinski-Harabasz DB I-Index Output of proposed methodology upto 100 iterations 0.1 0.8461 65.356 0.242 1.579 0.2 0.9527 8.268 0.068 0.649 0.3 0.9105 11.582 0.111 0.7993 0.1 0.942 0.312 70.001 2.698 0.2 0.76212 5.196 0.235 0.5558 0.3 0.567 0.19 0.399 0.476 0.1 35.692 1.245 0.989 0.050 0.2 0.978 14.125 0.062 0.6011 0.3 0.8564 11.546 0.258 1.495 Output of proposed methodology until the genetic algorithm converges 0.1 0.176 0.649 0.269 0.301 0.2 0.223 0.398 0.275 0.1025 0.3 0.7811 7.145 0.316 1.211 0.1 0.811 5.11 0.199 1.201 0.2 0.903 11.89 0.199 1.504 0.3 0.021 2.7568 1.201 0.375 0.1 0.809 11.598 0.312 1.576 0.2 0.877 5.014 0.175 0.8001 0.3 0.489 0.988 0.239 0.379

distance between each data object and its cluster center. (6) The DB index (DB), which has been defined earlier. Smaller values for DB Index and Xie-Beni index indicate better clustering performance, while higher values for Dunn Index, Silhouette index, Calinski-Harabasz index and I-index indicate better clustering performance. The reader is referred to [3] for a detailed description of these metrics. D. Parameter selection for proposed methodology To use the proposed methodology, three parameters / factors need to be decided – (1) the value of the weight factor w, (2) the number of clusters that will be computed, and (3) the stopping criterion of the genetic algorithm – whether to stop after a certain number of iterations, or to continue till the ge-netic algorithm converges. Note that the third factor (stopping criterion) is practically important for clustering tweets during an event. Ideally, the genetic algorithm should be continued until convergence, however, since tweets need to be clustered quickly during an ongoing event, it might be more practically useful to stop after a certain number of iterations. To decide suitable values for the three factors stated above, we executed various instances of the proposed algorithm using different choices for the factors, and then compared the clustering performance using the metrics described earlier. Specifically, we considered the following choices• Weight factor w=0.1, 0.2, 0.3….0.9 and obtain the best result for low values of w. Hence we report for only the values w=0.1, 0.2, 0.3. • Number of clusters – since the gold standard (prepared by human volunteers) identified 5 clusters; we used three instances of the proposed methodology considering 4, 5, and 6 clusters.

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

Xie-Beni

Dunn

0.051 0.120 0.2687 0.07114 0.358 61.05 0.045 0.1112 0.3532

2.25 6.123 2.005 2.125 3.312 0.87 10.102 7.003 2.289

554.145 1068.0 0.505 0.801 0.445 1471 0.411 0.578 263.87

3.298 1.78 1.832 1.789 1.779 1.213 1.759 1.0514 1.75

• Stopping criterion – we considered two instances of the proposed methodology, one where the processed till convergence, the other where the process was continued till 100 iterations of the genetic algorithm. Table VI shows the clustering results of the proposed methodology for the various choices, where the best performance according to each metric is highlighted in boldface. It can be seen that the best clustering is obtained considering 5 or 6 clusters, weight factor w = 0:1, and when the process is run till 100 iterations of the genetic algorithm. Hence we use the corresponding choices (which give the best performance of the proposed methodology) while comparing the proposed methodology with the baseline approaches. E. Comparison of proposed methodology with baselines Finally, we compare the performance of the proposed methodology with that of the baseline approaches described earlier. As observed above, we use the proposed methodology with the choices for which the best results are obtained (w = 0.1, 5 or 6 clusters computed, genetic algorithm run till 100 iterations). For the baseline algorithms K-means, density based clustering, and hierarchical clustering, the number of clusters needs to be specified a priori, hence we execute each baseline algorithm considering 4, 5 and 6 clusters separately (as we did for the proposed methodology). Different from these algorithms, the Infomap graph clustering algorithm does not need the number of clusters as input. Table VII compares the performance of the proposed methodology with that of the baseline approaches, according to the various metrics described earlier. The best performance according to each metric is highlighted in boldface. The proposed methodology performs better than all the base-line approaches according to all the metrics except Iindex (for which hierarchical clustering achieves the best

2017 International Conference on Computer Communication and Informatics (ICCCI -2017), Jan. 05 – 07, 2017, Coimbatore, INDIA

TABLE VII COMPARING THE PERFORMANCE OF THE PROPOSED METHODOLOGY WITH THAT OF SEVERAL BASELINE CLUSTERING Clustering Methods

Silhouette

Calinski-Harabasz

DB

I-Index

Xie-Beni

DUNN

Proposed (#Cluster=6, w = 0:1)

0.989

35.692

0.050

1.245

0.045

10.102

Proposed (#Cluster=5, w = 0:1)

0.942

70.001

0.312

2.698

0.07114

2.125

K-Means (#Cluster=4)

0.633

4.809

0.665

4.414

2.997

1.602

K-Means (#Cluster=5)

0.498

3.988

0.688

3.675

52.415

0.981

K-Means (#Cluster=6)

0.512

3.810

0.788

3.403

5.36

1.102

Density Based (#Cluster=4)

0.612

4.228

0.789

4.014

5.198

2.148

Density Based (#Cluster=5)

0.799

5.875

0.787

3.989

1.212

1.933

Density Based (#Cluster=6)

0.401

3.405

0.89

3.271

9.136

0.998

Hierarchical (#Cluster=4)

0.801

4.569

0.516

4.697

0.454

1.495

Hierarchical (#Cluster=5)

0.812

5.017

0.517

4.590

0.345

1.312

Hierarchical (#Cluster=6)

0.874

4.998

0.478

3.789

0.305

1.301

Infomap (#communities identified =16)

0.515

0.847

0.912

0.968

101.8

0.498

performance). These results indicate the superior performance of the proposed genetic algorithm based approach for clustering tweets. IV.

CONCLUSION

This work proposed a clustering methodology for tweets, which combines an evolutionary approach (Genetic Algorithm) with a classical approach (K-means). The experimental results show that the proposed approach performs better than several standard clustering approaches. As future work, we would like to improve the clustering algorithm, so that the suitable number of clusters can be detected automatically. Also, we want to explore various applications of clustering, such as summarization of the tweet stream. ACKNOWLEDGMENT The authors would like to thank the human volunteers who helped in collecting the manual summaries. We also thank the anonymous reviewers whose suggestions helped to improve the paper.

978-1-4673-8855-9/17/$31.00 ©2017 IEEE

REFERENCES [1] C. C. Aggarwal and C. Zhai, A Survey of Text Clustering Algorithms. Boston, MA: Springer US, 2012, pp. 77–128. [2] P. Bhattacharya, M. B. Zafar, N. Ganguly, S. Ghosh, and K. P. Gummadi, “Inferring user interests in the twitter social network,” on Recommender Systems, 2014, pp. 357–360. [3] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu, “Understanding of internal clustering validation measures,” on Data Mining, 2010, pp. 911–916. . [4] M. Cheong and V. Lee, “A study on detecting patterns in twitter intratopic user and message clustering,” on Pattern Recognition (ICPR), Aug 2010, pp. 3125–3128. [5] D. Dueck, “Affinity propagation: Clustering data by passing messages,” 2009. [6] S. Dutta, S. Ghatak, M. Roy, S. Ghosh, and A. Das, “A graph based clustering technique for tweet summarization,” on Reliability, Infocom Technologies and Optimization (ICRITO), Sept 2015, pp. 1–6. [7] J. Yang and J. Leskovec, “Patterns of temporal variation in online media,” on Web Search and Data Mining, ser. WSDM ’11. ACM, 2011, pp. 177–186. [8] A. Rangrej, S. Kulkarni, and A. V. Tendulkar, “Comparative study of clustering techniques for short text documents,” on World Wide Web Companion, ser. WWW ’11, 2011, pp. 111–112. [9] R. H. Sheikh, M. M. Raghuwanshi, and A. N. Jaiswal, “Genetic algorithm based clustering: A survey,” on Emerging Trends in Engineering and Technology, 2008, pp. 314–319. [10] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering technique,” Pattern Recognition, vol. 33, no. 9, pp. 1455 – 1465, 2000. [11] M. H. Hajeer, A. Singh, D. Dasgupta, and S. Sanyal, “Clustering online social network communities using genetic algorithms.” CoRR, vol. bs/1312.2237, 2013. [12] E. E. Amr Adel and A. Badr, “Clustering tweets using cellular genetic algorithm,” Journal of Computer Science, vol. 10, 2014

Suggest Documents