Report generated by the British-American news website, technology and social media blog ... Keywords: Clustering, Cellular Genetic Algorithm, cGA, Twitter, Tweet Similarity. ...... podcasting platform that was soon surpassed by Apple iTunes.
Arab Academy for Science and Technology & Maritime Transport College of Computing & Information Technology
CELLULAR GENETIC ALGORITHM APPLICATION FOR CLUSTERING TWEETS IN TWITTER ANALYSIS A Thesis Submitted to the College of Computing & Information Technology in Partial Fulfillment of the Requirements for the award of degree of
MASTER of Science in Information Systems
Submitted By
Amr Adel AbdelRahim ElSayed Egypt
Supervised by Prof. Dr. Amr Ahmed Badr
Dr. Essameldean Fawzy ElFakharany
Professor of Computer Sciences,
PhD, College of Management and
College of Computer Sciences and
Technology.
Information Systems.
Arab Academy for Science and
Cairo University
Technology and Maritime Transport
May 2014
Arab Academy for Science and Technology & Maritime Transport College of Computing & Information Technology
CELLULAR GENETIC ALGORITHM APPLICATION FOR CLUSTERING TWEETS IN TWITTER ANALYSIS A Thesis Submitted to the College of Computing & Information Technology in Partial Fulfillment of the Requirements for the award of degree of
MASTER of Science in Information Systems
Submitted By
Amr Adel AbdelRahim ElSayed Egypt
Supervised by Prof. Dr. Amr Ahmed Badr
Dr. Essameldean Fawzy ElFakharany
Professor of Computer Sciences,
PhD, College of Management and
College of Computer Sciences and
Technology.
Information Systems.
Arab Academy for Science and
Cairo University
Technology and Maritime Transport
May 2014
DECLARATION I certify that all the material in this thesisthat is not my own work has been identified,and that no material is included for which a degree has been previously conferred on me. The contents of this thesis reflect my own personal views, and are not necessarily endorsed by the University. Signature: ……………….. Date: …………………..
i
DEDICATED To My Parents With all love
ii
ACKNOWLEDGMENTS First and Foremost, I would like to thank God who gave me the strength and power for the accomplishment of this thesis. I would like to take this opportunity to
express my deep sense of sincere
gratitude and profound feeling of admiration to my thesis supervisors: Professor Dr. Amr Badr and Dr. Essameldean ElFakharany, for their continuous support, patience, motivation, and immense knowledge. I would like to extend my sincere gratitude and appreciation to all people who helped me to complete this thesis. Last but not least, I would like to thank my parents for their love and continuous support.
iii
ABSTRACT Social media has become an essential part of the daily online experience. They enable information sharing and communication between online users. The unprecedented huge amount of user-generated content produced by social media, needs to be analyzed in a proper manner. Twitter has emerged as an extremely popular micro-blogging social media platform in the recent years, with the number of registered twitter accounts has reached 645,750,000 active accounts by the year 2013, generating an average number of 58 million tweets per day (according to the statistical analysis website; statisticbrain.com). As the popularity of Twitter continues to increase rapidly, it is extremely necessary to analyze the huge amount of data that Twitter users generate. Twitter is an essential source of real time information in a wide variety of interests including sports events, advertising, political campaigns, mass emergencies, crisis events, health care, etc. A popular method of tweet analysis is clustering. Because most of the posted Twitter messages are textual in nature, this study focuses on clustering tweets based on their textual content similarity. Moreover, since the English language is the most popular language over Twitter (34% of all tweets are in English; according to the Report generated by the British-American news website, technology and social media blog "Mashable"), the study focuses on clustering Twitter messages written in English. The Scraping based technique was employed in this study to gather data from Twitter, using Hootsuite as a social network aggregator. The gathered Twitter messages were collected over a 3-day time duration from the 26th to the 28th of June 2013, based on a set of keywords that describe diverse specific topics in the actual world, in order to cover wide areas of interest. Experimental studies were performed over three datasets of different sizes. Genetic Algorithms belong to the class of evolutionary computational algorithms, which are population based optimization techniques designed for finding globally optimal solutions from a pool of feasible solutions (individuals). Genetic Algorithms are probabilistic search methods whose mechanisms are analogous with the natural process of biological evolution to discover solutions to problems. iv
A subclass of Genetic Algorithm: Cellular Genetic Algorithm was used in the study to cluster tweets. Based on the literature review; this study is one of the earliest attempts for tweet clustering through the use of Cellular Genetic Algorithm cGA, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. The results obtained by cGA are compared with those obtained by a conventional Genetic Algorithm: Generational Genetic Algorithm genGA. The comparison takes place according to four parameters: the average fitness value, the average time required for execution, the number of generated clusters, in addition to the number of generations. The obtained results indicate a better overall performance of cGA in comparison to genGA. Keywords: Clustering, Cellular Genetic Algorithm, cGA, Twitter, Tweet Similarity.
v
PUBLISHED WORK Adel, A., E. ElFakharany and A. Badr, 2014. Clustering tweets using cellular genetic algorithm. J. Comput. Sci., 10: 1269-1280.
vi
Table of Contents DECLARATION ........................................................................................................................................... i DEDICATED .................................................................................................................................................. ii ACKNOWLEDGMENTS ................................................................................................................... iii ABSTRACT.................................................................................................................................................... iv PUBLISHED WORK .............................................................................................................................. vi Table of Contents ...................................................................................................................................... vii List of Tables.................................................................................................................................................. xi List of Figures .............................................................................................................................................. xii List of Abbreviations .............................................................................................................................xiii CHAPTER ONE INTRODUCTION ......................................................................................... 1 1.1 Problem and Motivation ................................................................................................................. 2 1.2 Research objectives and Contribution .................................................................................. 4 1.2.1 Research Objectives ......................................................................................... 4 1.2.2 Thesis Contribution .......................................................................................... 4
1.3 Research Methodology .................................................................................................................... 4 1.4 Thesis Organization ........................................................................................................................... 5 CHAPTER TWO BACKGROUND .......................................................................................... 6 2.1 Social Media ........................................................................................................................................... 7 2.2 Twitter ...................................................................................................................................................... 11 2.3. Clustering ............................................................................................................................................. 17 2.3.1 Hierarchical Clustering Algorithms ................................................................... 18
vii
2.3.1.1 Agglomerative (Bottom-up) Algorithms ...................................................... 18 2.3.1.2 Divisive (top-down) Algorithms ................................................................. 18 2.3.2 Partitional Clustering Algorithms ..................................................................... 18
2.4 Algorithm ............................................................................................................................................... 19 2.4.1 Genetic Algorithms ........................................................................................ 19 2.4.2 Cellular Genetic Algorithms............................................................................. 19
CHAPTER THREE LITERATURE REVIEW............................................................ 21 3.1 Historical Background .................................................................................................................. 22 3.1.1 Social Media.................................................................................................. 22 3.1.2 Twitter ......................................................................................................... 23 3.1.3 Document and Tweet Clustering...................................................................... 24 3.1.4 Genetic Algorithms GAs .................................................................................. 24 3.1.5 Cellular Genetic Algorithms cGAs..................................................................... 24
3.2 Related Work....................................................................................................................................... 25 3.2.1 Social Media.................................................................................................. 25 3.2.2 Twitter ......................................................................................................... 32 3.2.3 Document Clustering...................................................................................... 45 3.2.4 Tweet Clustering............................................................................................ 50 3.2.5 Cellular Genetic Algorithms cGAs..................................................................... 58
3.3 Summary and Conclusion .......................................................................................................... 60 CHAPTER FOUR DATA AND ALGORITHM .......................................................... 61 4.1 Conceptual Framework ................................................................................................................ 62 4.2 Data Collection .................................................................................................................................. 63 4.3 Data Description and Preparation ......................................................................................... 66 4.3.1 Data Description ............................................................................................ 66 4.3.2 Data Preprocessing ........................................................................................ 68 4.3.3 TF-IDF representation .................................................................................... 69 4.3.4 Experimental setup ........................................................................................ 70 viii
4.4 Algorithm ............................................................................................................................................... 71 4.4.1 Cellular Genetic Algorithm .............................................................................. 72 4.4.2 Chromosome Representation.......................................................................... 74 4.4.3 Initial Population ........................................................................................... 75 4.4.4 Fitness Function ............................................................................................ 75 4.4.5 Parent Selection ............................................................................................ 76 4.4.6 Recombination (Crossover) ............................................................................. 76 4.4.7 Mutation ...................................................................................................... 77 4.4.8 Replacement Policy ........................................................................................ 77 4.4.9 Stopping Criterion.......................................................................................... 77
4.5 Summary and Conclusion .......................................................................................................... 79 CHAPTER FIVE EXPERIMENTAL RESULTS AND DISCUSSION... 81 5.1 Introduction .......................................................................................................................................... 82 5.2 Accuracy on test dataset .............................................................................................................. 83 5.3 Results for 1,000 tweets dataset ............................................................................................. 84 5.3.1 Average Fitness ............................................................................................. 84 5.3.2 Average Execution Time ................................................................................. 85 5.3.3 Number of Clusters ........................................................................................ 86
5.4 Results for 5,000 tweets dataset ............................................................................................. 86 5.4.1 Average Fitness ............................................................................................. 86 5.4.2 Average Execution Time ................................................................................. 87 5.4.3 Number of Clusters ........................................................................................ 88
5.5 Results for 30,000 tweets dataset .......................................................................................... 89 5.5.1 Fitness .......................................................................................................... 89 5.5.2 Execution Time .............................................................................................. 89 5.5.3 Number of Clusters ........................................................................................ 90
5.6 Number of Generations ................................................................................................................ 91 5.7 cGA performance in all datasets ............................................................................................ 91 ix
5.7.1 Average Fitness ............................................................................................. 91 5.7.2 Average Execution Time ................................................................................. 92
5.8 genGA performance in all datasets ...................................................................................... 93 5.8.1 Average Fitness ............................................................................................. 93 5.8.2 Average Execution Time ................................................................................. 93
5.9 Research Limitations ..................................................................................................................... 94 5.10 Results Discussion ........................................................................................................................ 95 5.10.1 Discussion of results obtained by both algorithms ........................................... 95 5.10.1.1 Average Fitness ..................................................................................... 95 5.10.1.2 Execution Time ...................................................................................... 96 5.10.1.3 Number of Clusters ................................................................................ 97 5.10.1.4 Number of Generations .......................................................................... 97 5.10.2 Discussion of results obtained by every algorithm............................................ 98 5.10.2.1 Cellular Genetic Algorithm ...................................................................... 98
5.11 Conclusion .......................................................................................................................................... 99 CHAPTER SIX CONCLUSION AND FUTURE WORK ................................100 6.1 Conclusion...........................................................................................................................................101 6.2 Future Work .......................................................................................................................................103 References ....................................................................................................................................................104
x
List of Tables Table 3.1 Social Media Related Work Summary……………………….
29-32
Table 32. Twitter Related Work Summary………………………...
38-44
Table 323 Document Clustering Related Work Summary…………
47-50
Table 3.4 Tweet Clustering Related Work Summary……………...
54-57
Table 3.5 Cellular Genetic Algorithm Related Work Summary…...
59-60
Table 421 Type of Data gathered using Hootsuite…………………
67
Table 42. Pseudo-code of Cellular Genetic Algorithm…………….
78-79
Table 4.3 Parameterization of the algorithm……………………….
79
Table 5.1 Tweet Distribution in the test set………………………...
83
Table 5.2 Fitness values by both algorithms in all datasets………...
96
Table 523 Execution times by both algorithms in all datasets……...
96
Table 524 Number of clusters by both algorithms in all datasets…...
97
Table 5.5 Number of generations of both algorithms in all datasets.
97
Table 5.6 cGA performance in all datasets…………………………
98
Table 5.7 genGA performance in all datasets………………………
99
xi
List of Figures Figure 1.1 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 3.1 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 5.10 Figure 5.11 Figure 5.12 Figure 5.13 Figure 5.14 Figure 5.15
Top 10 languages on twitter-2013…………………………………. Attributes of Big Data...……………………………………………. Characteristics of Social media data………………………………… Percentage of online adults using social networks by year………….. Percentage of internet users who use more than one social network... Example of a Twitter Home Page ……………………………….. Twitter input and output methods…………………………………… Twitter Alexa traffic rank on the 20th of March 2014………………. Twitter users' statistics 2013……………………………………….... Twitter users demographic analysis 2013…………………………… Frequency of social media usage ………...………………………… .113 Top 5 countries using Twitter ………………………………… 2013 Top 10 countries with the highest Twitter penetration ……… Timeline of the launch dates of major Social Network Sites………... Conceptual framework ……………………………………………… User interactions with social networks by social network aggregator. Hootsuite dashboard ………………………………………………… Hootsuite Alexa traffic rank on the 20th of March 2014……………. Sample tweets from the dataset ……………………………………... Simple Genetic Algorithm ………………………………………….. Topology of Cellular Genetic Algorithm…………………………..... Chromosome representation …….………………………………….. L5 Neighborhood. ..………………………………………………..... Reproductive cycle mechanism in cGA……………………………... Accuracy achieved by both algorithms ……………………………... Average fitness of genGA and cGA (1,000 tweets)………………… Average execution time of genGA and cGA (1,000 tweets)……… Number of clusters generated by genGA and cGA (1,000 tweets)….. Average fitness of genGA and cGA (5,000 tweets)………………… Average execution time of genGA and cGA (5,000 tweets)………… Number of clusters generated by genGA and cGA (5,000 tweets)….. Fitness value of genGA and cGA (30,000 tweets) ………………… Execution time of genGA and cGA (30,000 tweets)........................... Number of clusters generated by genGA and cGA (30,000 tweets)… Number of generations produced by each algorithm………………... cGA fitness for all sets ……………………………………………… cGA execution time for all sets……………………………………… genGA fitness for all sets……………………………………………. genGA execution time for all sets…………………………………………..
xii
3 8 9 10 10 11 12 14 15 15 16 16 17 23 62 64 65 66 67 73 73 74 75 78 84 85 85 86 87 88 88 89 90 90 91 92 92 93 94
List of Abbreviations API
Application Programming Interface
cGA
Cellular Genetic Algorithm
CGM
Consumer-Generated Media
cMOGA
Cellular Multi-Objective Genetic Algorithm
CTC
Core-Topic-based Clustering
DDE
Discrete Differential Evolution
DF
Document Frequency
DJIA
Dow Jones Industrial Average
DOI
Digital Object Identifier
DPX
Distance Preserving Crossover
e-WOM
Electronic Word Of Mouth
EACS
Energy-Aware Communications Scheduler
EAs
Evolutionary Algorithms
ESA
Explicit Semantic Analysis
GA
Genetic Algorithm
genGA
Generational Genetic Algorithm
IDF
Inverse Document Frequency
ILI
Influenza-Like Illnesses
MANETs Metropolitan Mobile Ad Hoc Networks MOPs
Multi-Objective Continuous Optimization Problems
Ms
Milliseconds
NLP
Natural Language Processing
Pc
Crossover probability
Pm
Mutation probability
RT
Retweet
SMEs
Small and Medium-sized Enterprises xiii
SMS
Short Message Service
SNEFT
Social Network Enabled Flu Trends
TF
Term Frequency
UGC
User Generated Content
VRP
Vehicle Routing Problem
xiv
CHAPTER ONE INTRODUCTION
1
Social media platforms such as Facebook, Twitter, YouTube, Hi5, Orkut, etc. have become an essential part of the daily online experience. They enable information sharing and communication between online users. The amount of data produced by social media is huge, mainly user-generated, various, and spreading at an unprecedented rate. Twitter is one of the most important and popular social media platforms. It enables its users to share ideas and coordinate activities through short status messages-that cannot exceed 140 characters length-called tweets. This limited length forces Twitter users to be as concise as possible. This chapter presents an overview of the dissertation project. The chapter is organized in the following manner. Section 1.1 describes the scope of the problem and the motivation of tweet clustering. Section 1.2 mentions the research objectives and provides the main contribution of the thesis. Section 1.3 discusses briefly the methodology followed by the researcher. Finally, section 1.4 demonstrates how the rest of the thesis is organized.
1.1 Problem and Motivation As the popularity of Twitter continues to increase rapidly, it is extremely necessary to properly analyze the huge amount of data produced by Twitter so that it can be efficiently utilized. Twitter is an essential source of real time information in a wide variety of interests including sports events, advertising, political campaigns (e.g. presidential elections), mass emergencies, crisis events, and even health care [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. One of the most popular methods of tweet analysis is clustering. Twitter provides a massive quantity of short text in the form of tweets where each tweet represents a single document [11]. Tweets can be an extremely useful data source for researchers about a wide variety of topics. Textual contents in Twitter primarily denote tweet text, URLs, and hashtags within tweets [12] and tweets are considered to be "short texts".
2
The formulation of the problem of clustering tweets based on their similarity is motivated by an essential remark: The tweet content similarity can be used as one of the similarity measures between users where this measure helps to realize whether the users have similar interests. This is an indication of good similarity between users [13]. Since the majority of the user-generated messages on micro-blogging websites are textual information [14]; therefore, the main focus of this thesis is clustering of tweets based on their textual content similarity. English language is the original Twitter language and tweets written in English are the most common, followed by Japanese and Spanish [15, 16], as displayed in Figure 1.1. Therefore, the focus is on tweets written in English.
Figure 1.1 Top 10 languages on twitter-2013 [17] Clustering of tweets is a complex problem to solve. The very short length of tweets (being only about 140 characters) can be a problem. According to Karandikar; "Such a short piece of text provides very few contextual clues for applying machine learning techniques" [18]. This type of data results in weak performance of most clustering methods due to the overall freedom of writing tweets. The writing style of tweets is informal; can be full of jargons, colloquial, spelling mistakes, domain specific content, and acronyms and out of vocabulary words with poor grammatical structure. The words are sparse, so there is a difficulty in handling tweets. In general, the data in twitter does not have a well-defined structure [19, 16, 11]. 3
1.2 Research objectives and Contribution 1.2.1 Research Objectives This thesis aims to: 1. Explore the clustering of tweets based on their textual similarity 2. Apply Cellular Genetic Algorithms cGAs to cluster tweets 3. Apply conventional Generational Genetic Algorithms genGAs to cluster tweets 4. Compare the results obtained by cGAs with those obtained by Generational Genetic Algorithms genGAs
1.2.2 Thesis Contribution In addition to achieving the desired research objectives; the main contribution of this thesis can be summarized as follows: This study is one of the first attempts to exploit cGA in clustering of tweets, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. Therefore, this study contributes to adding a new approach for tweet clustering.
1.3 Research Methodology The researcher employed the Scraping based technique to gather data from Twitter, using Hootsuite as a social network aggregator. The gathered Twitter messages were collected based on a set of keywords that describe specific topics in the actual world. Tweets were collected over a 3-day time duration from the 26th of June to the 28th of June 2013. The set of pre-defined keywords comprise eight variable categories that are intended to be diverse in order to cover different and wide areas of interest. After the pre-processing of data, both cellular and generational genetic algorithms were applied over three datasets composed of 1,000, 5,000, and
4
30,000 tweets respectively. The obtained results are compared according to the fitness value, execution time, number of clusters, and number of generations.
1.4 Thesis Organization The remainder of the thesis is structured in the following manner. Chapter 2 provides a background about social media, Twitter, clustering and its categories, as well as the Genetic Algorithm GA and Cellular Genetic Algorithm cGA. Chapter 3 provides a historical background and reviews the previous research related to Twitter, document and tweet clustering, and the applications of cGAs. Chapter 4 presents the methodology and conceptual framework of this thesis. Chapter 5 displays the experiments and the results, providing an interpretation for the results. Chapter 6 discusses the final conclusion and summary of this thesis in addition to the future work that can be performed using the obtained results and in the light of the limitations of the work done in this thesis.
5
CHAPTER TWO BACKGROUND
6
This chapter provides a background about the main issues discussed in this thesis. Organization of the chapter is as follows. First; section 2.1 presents an outline to the concept of social Networks, followed by section 2.2 that provides a background about Twitter. Then, section 2.3 describes the concept of clustering, document clustering and tweet clustering. Finally, section 2.4 talks in brief about Genetic Algorithm GA and one of its subclasses: the cellular genetic algorithm cGA.
2.1 Social Media Ellison defines social network sites as "Web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections may vary from site to site [20]. A more detailed definition is the one provided by Java which describes social media as " An umbrella term that defines the various activities that integrate technology, social interaction, and the construction of words, pictures, videos and audio. This interaction, and the manner in which information is presented, depends on the varied perspectives and “building” of shared meaning, as people share their stories, and understandings"[21]. Social Network Sites have applied a wide variety of technical features; their backbone is mainly composed of visible profiles that show a list of friends who are also users of the system. Each user has his/her unique profile. Profile pages act as starting points from which users can explore these social networks. They can look for other people, or find individuals who have common interests. Some of the most popular terms in the world of social media include Friends, Contacts, Fans, Links, Groups, etc. Social Media is an extremely powerful example of web 2.0 tools. It has radically changed the way people find information, share knowledge and interact with other people either within or outside the social networks. The shared information includes different media formats ranging from ordinary text to photos, music, videos, and PDF documents. Social media sites are centered on user-generated content and Information sharing. User Generated Content (UGC), also known as consumer-generated media (CGM), opposite to professionally edited text, is defined by Interactive Advertising Bureau in April 2008, as "Any material 7
created and uploaded to the Internet by non-media professionals". This means a dramatic shift on the web from one way communication where users are only allowed to gain information as provided to them, to a conversation style interaction where users are heavily participated as they have the ability to share information, obtain and add to information posted by other users (dynamic content) as well as spread information over their social network. Social media data is considered to be one type of "Big Data" with its three important defined attributes: Volume, Velocity, and Variety. "Volume" refers to the data size; "Velocity" refers to the frequency of production of data, while "Variety" refers to the data sources [22, 23, 24]. The attributes of big data and characteristics of social media data are displayed in Figure 2.1 and Figure 2.2 respectively.
Figure2.1 Attributes of Big Data [24]
8
Figure 2.2 Characteristics of Social media data [24] The huge amount of “user generated content” (with embedded metadata in the form of links, images, and videos) produced on a daily basis in social media is a significant source of contextual information which can be used to gain abundant inferences [21, 23, 1, 25]. This enormous amount of user-generated content created on social media results in vast, noisy, distributed, unstructured, and dynamic social media data [26]. The popularity of social networks has increased tremendously in recent years as a result of the wide spread of internet enabled devices such as personal computers, mobile devices and internet tablets. Online Social Networks have become an essential part of the global online daily experience [27, 28]. According to the Pew Research Center report of social media updates in 2013, about 73% of online adults now use a social networking site as displayed in Figure2.3.
9
Figure 2.3 Percentage of online adults using social networks by year [29] About 42% of online adults now use more than one social networking site while 36% use only one (the remaining 22% did not use any of the five specific sites: Facebook, LinkedIn, Pinterest, Twitter, and Instagram). This is demonstrated in Figure 2.4.
Figure 2.4 Percentage of internet users who use more than one social network [29]
10
2.2 Twitter Twitter is one of the most important social media platforms. It can be used to share ideas and coordinate activities, similar to instant messaging [15]. An example of a Twitter profile home page is displayed in Figure 2.5
Figure 2.5 Example of a Twitter Home Page Twitter users generate status messages called “tweets”. Tweets are status updates and musings, often of a personal nature, that cannot exceed 140 characters including text, emoticon, link or their combination. These tweets are often associated with particular events or specific topics of interest, or individual judgments, reactions, and points of view. Tweets are broadcasted to a global audience [14, 18, 5, 25]. Tweets can be posted from various sources including the Twitter website, Twitter mobile applications in addition to several third party applications/websites. Figure 2.6 displays the different input and output methods in Twitter.
11
Figure 2.6 Twitter input and output methods [30] Twitter users also have the control over the privacy features. They can choose to make their tweets public (visible to any one) or private (visible to only some users who get permission from the user). If a user’s profile is left public, his/her updates appear in a “public timeline” of recent updates [31]. Commonly, users watch the Twitter messages by viewing a main page showing a stream of the latest messages from people they follow [32]. In this work, the researcher considers only messages posted publicly on Twitter. Twitter allows users to reply to tweets of other users by clicking on the reply button on their tweet [13]. Every user is recognized by a user name of up to 15 alphanumeric characters and underscores advanced by “@”symbol [33,34]. Every tweet contains other information in addition to the message, including its timestamp, the Twitter user who posted it, whether it was part of a conversation, or a retweet, and the number of people who retweeted it [34]. The limited length of tweets means that the tweets do not certainly include well-developed thoughts; instead they are short and concise; however complete enough so that users can understand the ideas delivered by the tweets. It forces users to express their opinion in few sentences [16]. Social interaction between Twitter users takes place principally in three ways: 12
1. The "follow" relationship where users can follow other users by subscribing to their tweets. The follower gets all the status updates of the user that he/she follows. Followers are displayed in chronological order; the most recently selected follower is displayed first. Unlike other social networking sites, the relationship of following and being followed does not require interchange. Twitter supports one-way connection rather than two-way connection. In other words, a user can follow any other user without approval from the followed user, and the user being followed does not need to follow back. 2. Another form of connection that can be defined between two users is "Mention". Mention is the event of referring to other user(s) in a tweet by addressing them directly. 3. "Retweet" or RT in which individuals can rebroadcast content generated by other users, thus raising its visibility. This is similar to forwarding an email message to other users, in this case the followers. Retweet has an important role in the propagation of information on Twitter [33]. "Hashtag" is a unique concept on Twitter (Note: Hash tag is furthermore supported by other social media websites such as: Facebook, Google+, Instagram, YouTube, LinkedIn, etc.) that enables users to identify significant keywords in their tweets by adding the prefix ‘#’ before a keyword (without space) in a tweet. Hashtags are used on Twitter to set trending topics, indicate intended audience of a tweet, begin chat rooms, and categorize tweets by topic or type. The hash tags allow users to emphasize what they think as important keyword(s) in their tweet. A hashtag preceding the topic enables Twitter users to find tweets related to a particular topic during search to retrieve a list of recent tweets about this topic. In addition, Twitter offers a search portal (https://twitter.com/search-home) so that users can constantly monitor or search for tweets either by the means of keywords, hashtags or user name, but this service is restricted to only 40 search keywords. Also, Twitter has API (Application programming interface) functions to acquire user-specific information. Such information can be used to construct a network of friends [35]. Moreover, Twitter provides clickable “trending topic” terms, that initiate searches for widespread keywords. Finally, Twitter has a location function. If users 13
are posting tweets from a mobile device, they have the ability to turn on their location, and their latitude and longitude will be captured with the tweet. Twitter location information available from mobile Twitter applications can save where the user was when he/she posted the tweet. The user generally has the option to turn location services on or off [33]. The popularity of Twitter continues to grow in a rapid manner. The song of Ben Walker reveals how popular Twitter is. "You're And If
no if
you
you haven't
one
if
aren't been
you're
there
not
already
bookmarked,
you've
retweeted
on
Twitter missed
and
it
blogged
You might as well not have existed" Figure 2.7 demonstrates the traffic rank of Twitter according to Alexa.
Figure 2.7 Twitter Alexa traffic rank on the 20th of March 2014 [36] As displayed in Figure 2.8; the number of active twitter users reached about 645,750,000 users with 135,000 new users per day. Twitter has rapidly grown from handling 5,000 tweets per day in 2007 to 50 million tweets per day in 2010, now handling an average of 58 million tweets per day.
14
Figure2.8 Twitter users' statistics 2013 [37] According to Pew Research Center report of social media updates in 2013, about 18% of online adults currently use Twitter. The majority of this percentage is among young adults, as displayed in Figure 2.9.
Figure 2.9 Twitter users demographic analysis 2013 [29] According to the same report, about 46% of twitter users use it daily, with 29% checking in several times per day. However, 32% of Twitter users say that they check in less than once per week. This is displayed in Figure 2.10. 15
Figure 2.10 Frequency of social media usage [29] Considering the geographical distribution of users, 24% of Twitter's active users are in the United States, followed by Japan and Indonesia, with respectively 9.3 and 6.5 percent of Twitter's active users. While the highest twitter penetration level exists in Saudi Arabia, with 33%, followed by Indonesia, Spain, Venezuela, Argentina, UK, Netherlands, United States, Japan, and Colombia (Figure 2.11 and Figure 2.12).
Figure 2.11 2013 Top 5 countries using Twitter [38] 16
Figure 2.12 2013 Top 10 countries with the highest Twitter penetration [39]
2.3. Clustering In traditional definition; clustering or cluster analysis is an unsupervised data mining technique that includes the partitioning of data into collections or subsets of similar objects called clusters. It is used to describe (rather than predict) data into groups that are meaningful or useful. Clustering is a common technique for statistical data analysis. Document clustering can be defined as "Automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters". "Unsupervised" means that no human expert has allocated documents to classes. Through Document clustering Researchers can discover a wealth of valuable potential and hidden knowledge, like the hotspot of a discipline and crossrelationships between disciplines. This sort of knowledge can not only help the scholars to master the knowledge of a particular discipline, but also provides them with decision-making facilities [40, 41, 42, 43]. Similarly, Tweets can be grouped into clusters such that tweets in one cluster tend to be similar to each other, but dissimilar to those in other clusters i.e. minimum inter-cluster and maximum intracluster similarity [33, 44]. 17
Two broad families of clustering algorithms exist: Hierarchical and Partitional.
2.3.1 Hierarchical Clustering Algorithms These algorithms are subdivided into Agglomerative and Divisive clustering algorithms.
2.3.1.1 Agglomerative (Bottom-up) Algorithms These algorithms start with each individual document considered as a separate cluster, each of size one. At each level, the smaller clusters are merged to form a bigger cluster. The process continues this way and terminates when all clusters are merged into one cluster containing all the documents.
2.3.1.2 Divisive (top-down) Algorithms These algorithms work in an opposite manner to Agglomerative algorithms. They start with the whole set of documents. The set is then broken down to create consecutive smaller clusters. The process is repeated recursively until individual documents are reached. Agglomerative algorithms are more frequent in information retrieval than the divisive algorithms.
2.3.2 Partitional Clustering Algorithms These algorithms depend upon defining all clusters in advance. The k-means algorithm is an example of this category of algorithms. Partitional clustering algorithms typically determine all clusters at once. The k-means clustering algorithm belongs to this category [126, 1].
18
2.4 Algorithm This section provides just a brief introduction to Genetic algorithms and one of its subclasses: the cellular genetic algorithm cGA. A more detailed discussion is provided in chapter 4.
2.4.1 Genetic Algorithms Genetic Algorithms (GAs) belong to the class of evolutionary computational algorithm. Evolutionary algorithms (EAs) are population based optimization techniques designed for finding globally optimal solutions from a pool of feasible solutions (individuals). They are loosely based on some biological processes like natural selection or genetic inheritance of good traits. Genetic Algorithms are probabilistic search methods whose mechanisms are analogous with the natural process of biological evolution to discover solutions to problems. A population of individuals that encode tentative solutions to a specific problem is preserved. Some individuals are better than others. Those better individuals have a higher probability to survive, learn, and spread their genetic material (Survival of the fittest). New individuals are generated by combining members of the population; these new individuals replace the existing according to a defined policy. Implementation of a genetic algorithm starts with a population of (often randomly generated) individuals. These individuals are then evaluated according to an objective function. The individuals that represent a better solution to the problem are given more chances to reproduce than those individuals that represent worse solutions. The quality of a solution is measured against the current population. GAs have been proven to be good at effectively solving large search and optimization problem due to its selfadaptation and self-organization capabilities [45, 40, 46, 47, 48, 49, 43].
2.4.2 Cellular Genetic Algorithms Cellular Genetic Algorithms cGAs (or fine grained genetic algorithms) are a subclass of Genetic Algorithms with the population structured in a specified decentralized topology, so that individuals may only interact with their neighbors. The individuals of the population are usually arranged in a bi- dimensional toroidal grid. Each cell of the grid contains one individual (solution). At each generation of 19
the algorithm, individuals are modified by three stochastic genetic operators: selection, crossover and mutation. These operators are only performed within the neighborhood of each individual. Alba and Dorronsoro stated that "Such a kind of structured algorithms is specially well suited for complex problems". Cellular Genetic Algorithms have advantages over other Genetic Algorithms; such as a high diversity level that can be preserved for much longer in comparison with centralized algorithms due to the existence of small overlapped neighborhoods. Such structure ensures the diffusion of solutions slowly and smoothly through the population. This enhances exploration (diversification) while exploitation (intensification) is maintained within each neighborhood by genetic operations [48, 50, 51, 52, 53, 54, 55].
20
CHAPTER THREE LITERATURE REVIEW
21
This chapter presents a historical background and examines the related work to the main contributions of the thesis. The basic outline of this chapter is as follows. Section 3.1 presents a historical background about Social media, Twitter, Document and Tweet clustering, Genetic Algorithms GAs and Cellular Genetic Algorithms cGAs. Section 3.2 reviews the related work of these areas. Finally, Section 3.3 draws the summary and conclusion of the chapter.
3.1 Historical Background This section provides a historical background covering the following points: 1. Social Media in general 2. Twitter 3. Document and Tweet clustering 4. Genetic Algorithms GAs 5. Cellular Genetic Algorithms cGAs
3.1.1 Social Media The first identifiable social network site was SixDegrees.com that was launched in 1997 and began in 1998. It allowed users to create profiles and list their friends. Unfortunately, it failed to become a viable business and closed in 2000. In 1999, three social network sites Asian Avenue, Black Planet, MiGente, were initiated. In 2000, the Swedish web community Lunar Storm transformed itself to become a social network site. In 2001, the Korean virtual world site Cyworld added social network service features. In the same year, Ryze.com was launched. In 2002, Fotolog, Friendster, and Sky blog were launched. Starting from 2003, a lot of new social network sites were launched. These sites include Couch surfing, My Space, LinkedIn, Last.FM, Tribe.net, and Hi5. The phenomenon continued to grow in the year 2004, including the introduction of Flickr, Dodge ball, Orkut, Catster, a Small world, and Hyves. YouTube, Facebook, Yahoo 360, Bebo, Ning, and Renren-the Chinese twin of Facebook-were launched in 2005. The year 2006 witnessed the launch of QQ, Windows Live Spaces, and Twitter. Figure 3.1 displays the chronological order of the launch time of many social network sites [20].
22
Figure 3.1 Timeline of the launch dates of major Social Network Sites
3.1.2 Twitter On the contrary of the current enormous growth in Twitter popularity (to the degree of performing a critical role in the social revolutions in the Arab world countries such as Tunisia, Egypt, and Yemen), Twitter's beginning was as a backup project for an unsuccessful project. The unsuccessful project was a podcasting platform that was soon surpassed by Apple iTunes. Therefore, the founders turned towards developing an SMS (Short Message Service) service that deals with a limited number of users who can use this service to express their current status, feelings, or deeds. In March 2006; the first prototype of Twitter called "twttr" was initiated. And then Twitter was formally initiated in July of the same year. The initial founders of Twitter are Jack Dorsey (1976-…), Evan Williams (1972-…), Biz Stone (1974-…) and Noah Glass. Twitter is considered to be a microblogging website [Micro-blogging can be defined as "A form of blogging that allows users to send brief text updates or micro media such as photographs or audio clips in order to describe their current status"] [33, 56,16, 9].
23
3.1.3 Document and Tweet Clustering Along with the evolution of the World Wide Web in the recent years, major search engines such as Google and Yahoo utilize clustering techniques to automatically categorize the retrieved web documents hierarchically into a coherent cluster of meaningful categories, where the documents in a single cluster are about the same topic. Accompanying the huge growth in the popularity and usage of social network sites such as Twitter and the continuous proliferating increase in the amount of textual documents generated by the users, document clustering techniques are also employed to categorize Twitter messages (tweets) into related topics [57, 58, 59,60].
3.1.4 Genetic Algorithms GAs John
Henry
Holland (1929-….),
the
American
scientist,
Professor
of psychology, and Professor of electrical engineering and computer science at the University of Michigan, Ann Arbor, was the first person to use the term "Genetic Algorithm". Also, Hans Joachim Bremermann (1926–1996) in Berkeley, California, and Alex Fraser in Sidney, Australia, were among the pioneers who studied Genetic Algorithms. Holland invented Genetic Algorithms GAs in the 1960s. In the year 1975, his book "Adaptation in Natural and Artificial Systems", introduced Genetic Algorithms as an abstract of biological evolution. This book is considered to be the theoretical framework over which all the following work concerning Genetic Algorithms was based. In the same year, Kenneth DeJong finished his dissertation for the Doctoral degree that included for the first time an exhaustive treatment of the abilities of Genetic Algorithms in the field of optimization [46, 61, 62, 63, 64, 65, 66].
3.1.5 Cellular Genetic Algorithms cGAs The first model of Cellular Genetic Algorithms cGAs was put forward by Robertson in the year 1987 [67]. In the subsequent year, Mühlenbein et al. proposed what is considered to be the first hybrid Cellular Genetic Algorithm which was designed for solving the Travelling Salesman Problem TSP [68]. After these two initial efforts other cGAs saw the light in a few years. One of those is the algorithm 24
called "Pollination Plants" which David Goldberg introduced in the year 1989 [69]. In the same year, the "Fine Grained" genetic algorithm was introduced by Manderick and Spiessens [70].In the year 1991, the same researchers presented a "Massively Parallel" genetic algorithm [71]. Also in 1991, Frank Hoffmeister introduced the "Parallel Individual" algorithm [72]. In 1997, Back, Fogel and Michalewicz presented the "Diffusion Model"[73]. However, the term "cellular Genetic Algorithm" was first used only in the year1993, when Whitley presented it for the first time in a work that includes the application of cellular automaton model on a genetic algorithm [74]. Full detailed knowledge concerning Cellular Genetic Algorithms were explained by Enrique Alba and Bernabé Dorronsoro in their important book "Cellular Genetic Algorithms" published in the year 2008, which explored how to extend the use of cGAs to include various domains [46]. This book was an important asset for the researcher in this dissertation.
3.2 Related Work This section is divided into five subsections. Subsection 3.2.1 reviews the previous related research in the field of analyzing social content of social media. Subsection 3.2.2 reviews the research concerning Twitter and its usages in various domains. Subsection 3.2.3 reviews the prior work in the field of document clustering. Subsection 3.2.4 reviews previous researches about clustering of tweets. Finally, Subsection 3.2.5 discusses some of the various applications of Cellular Genetic Algorithms cGAs. After each subsection, a table summarizing the work related to this subsection is provided. Each table consists of the author(s) and year of publish, the addressed problem(s), and the contribution(s).
3.2.1 Social Media Generally, research in the area of social networks has witnessed a dramatic increase in the last years. The rapidly increasing popularity of social networking websites has elevated the consciousness and availability of this type of data. This set 25
of research focuses mainly on understanding the nature, analysis, data collection, and organizing of social media and its data. Even before the great explosion in the popularity of social media, the question of how to search a social network was raised up. Adamic and Adar simulated "small world" experiments on two datasets representing two different scenarios: a dataset extracted from a network of actual email contacts within an organization, in addition to a second dataset extracted from a student social networking website [75]. A framework for effective analysis of the content and structure of social media data as well as discovering communities in social networks (in order to understand how online communication and collaboration takes place in social applications), was introduced by Java in his dissertation for the doctoral degree [21]. Maia et al. presented a methodology for identification and characterization of user behaviors in social networks for the sake of improving business and resource management in social networks. They gathered data from YouTube and clustered users of similar behavioral patterns using a clustering algorithm [76]. Benevenuto et al. analyzed user workloads in four popular social networks: Orkut, My Space, Hi5, and LinkedIn. Their analysis is founded on two datasets: (1) a clickstream dataset gathered over a twelve days' time period using a Brazilian social network aggregator and (2) Orkut social network dataset. The results obtained by their study revealed how frequently and for how long people connect to social networks, in addition to the types and sequences of activities that users conduct on social networks [28]. In his dissertation for the doctoral degree, Zhou investigated the relationship between user-generated social media content and social actions using a broad range of applications such as ranking, discovering communities, information retrieval, and recommendation of documents [77]. Cormode et al. proposed a manifesto for modeling and measurement in social media. This manifesto includes the important features that can be used to construct models for three of the most widely-used social networks: Twitter, Facebook and YouTube. In addition; the manifesto discussed significant considerations for the collection, sampling, validation, and sharing of social media data [78]. Gozzo and D’Agata studied the connection between social networks and political participation. They presented the different shapes of politically pertinent linkages compared against the major socio-demographic dimensions. Their data sample was extracted from the electoral register of a town 26
near Sicily in Italy [79]. The problem of mining antagonistic communities (communities of people with contradictory opinions) from social networks was investigated by Kuan in his dissertation for the Masters' degree [80]. Wang addressed the challenges associated with social media data analysis, explained a dynamic perspective for analyzing social media data, and showed its strategic importance for the detection, minimization and prevention of the troublemaking influences of internet-based social media. This study also discussed how social media data can be used in the military field of national defence and security [24]. Yun et al. surveyed the definition of friends in online social networks and analyzed the private communication interactions between Korean users of the two social networks Me2day and Twitter. They gathered interactions of 32,200 accounts of Me2day from January to October, 2009 and 890 users of Twitter on the 12th of December 2009 [81]. Agrawal et al. introduced an integrated methodology to study information diffusion, opinion dynamics, and information trends in online social networks. They focused on three major problems: (1) Querying and analysis of online social network datasets; (2) Modeling and analysis of social networks; and (3) Analysis of social media and social interactions in the existing media environment [82]. In his dissertation for Doctoral degree, Becker provided a methodology for organizing social media documents. This methodology helps to identify and characterize a huge set of events which occur in social media through the usage of their relevant social media documents, in order to improve the browsing and quality of search for event content. Their analysis focused on Twitter, exploiting tweets in New York City [32]. Gloor et al. studied how social networks data analysis can help to reveal and provide valuable insights into recognizing the personality characteristics, and identifying honest signals of creativity in individuals through analyzing communications in a student email network [83]. Malik and Malik discussed the challenges associated with the analysis and development of the appropriate tools required for analyzing the massive amount of data produced by large scale social networks, in addition to the privacy issues related to social networks [84]. Stieglitz and Kaufhold presented the first steps to explore data collection from different social networks and described the architecture of a software prototype for full text analysis in social networks for the purpose of analyzing communication about individuals and events in social networks. The first application of this prototype was in the sector of political communication [85]. Takaffoli et al. proposed a framework and a community 27
matching algorithm for observing the community changeovers and evolutions in social media through time. They evaluated their proposed framework over two social network datasets: (1) The Enron email dataset, which provides email messages between the employees working in the Enron Corporation and (2) The DBLP (Data Base systems and Logic Programming) co-authorship dataset, which contains a computer science co-authorship network [86]. Derczynski et al. viewed data of social media as a constant stream of data points, each point containing text associated with spatial and temporal contexts. They identified challenges specific to each of the temporal, spatial, and spatio-temporal contexts, with the intention of subjecting to context aware querying and analysis, especially involving longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatiotemporal intent search. At the end of their study, they discussed the emerging applications and further opportunities for investigation for each context [87]. Another work that focused on user behavior in social networks is that conducted by Jiang et al. who studied the latent (passive) interactions of users (such as viewing profiles) of users in the Chinese social network Renren [88]. A Masters' degree dissertation by Santiago applied statistical analysis and data mining techniques over millions of user-generated documents and events in social media in order to discover and demonstrate the relationship between the social media data and the terrorist events in different countries of the world [89]. The software prototype developed by Stieglitz and kaufhold was used later on by Stieglitz and Xuan to construct a methodological framework for social media analytics. This framework is utilized to summarize the most important issues in political context from the point of view of political institutions and different methodologies from other scientific disciplines [90]. Table 3.1 summarizes the related work concerning social media.
28
Table 3.1 Social Media Related Work Summary
Author(s)
Addressed Problem(s)
Contribution(s)
and Publish Year
Adamic and How contributors in a small Comparing online social network Adar (2005) world experiment can find structure to the structure of an e-mail [75]
short
paths
in
a
social network
network using only local information
about
their
immediate contacts?
Java
(2008) How to analyze the structure Frameworks for analyzing social
[21]
and content of social media media content and structure, and data effectively to realize the community detection nature
of
online
communication
and
collaboration
in
social
networks?
Maia et al. How to best classify user Methodology for characterizing and (2008) [76]
behaviors in online social identifying user behaviors in online networks?
Zhou (2008) How [77]
to
network
social networks.
improve analysis
social Introducing
probabilistic
content
by models for user generated social
analyzing networks as well documents
and
annotations,
and
as social content and social investigating the connection between actions among users
social content and social actions
29
Table 3.1 Continue
Cormode al.
et How
to
model
social A platform for developing models
(2010) networks and identify its and measures for social media data
[78]
important features?
Gozzo
and What is the role played by Displaying the different connections
D'Agata
social
networks
(2010) [79]
encouraging
in between
political
and
social
political participation in social networks
participation?
Kuan (2010) How to identify and define Algorithms for mining direct and [80]
the properties of different indirect antagonistic groups on social types
of
communities
antagonistic networks on
social
networks?
Wang (2010) What are the impacts of Explaining a dynamic perspective for [24]
using social media in the social media data analysis and its fields of national security strategic importance for detection, and defense and how to minimization and prevention of the overcome
the
undesirable troublemaking influences of internet-
threats?
based social media on national defence and security
Yun
et
al. What are the influencing Defining online friends and providing
(2010) [81]
factors on the strength of an analysis of user interactions user interactions in online social networks?
30
Table 3.1 Continue
Agrawal al.
et How a sole news item or An integrated approach to study
(2011) idea spreads throughout a information
diffusion
in
online
[82]
social network?
Becker
How
(2011) [32]
characterize different events media documents which aids to
to
networks
identify
and Methodology for organizing social
in social media?
identify and characterize a huge set of events that occur in social media
Gloor et al. How to recognize signals of Initial (2011) [83]
creativity
in
results
individuals individual
on
creativity
forecasting based
on
through the use of social interpersonal interaction patterns network analysis?
Malik
and What are the challenges A snapshot of challenges associating
Malik (2011) associated [84]
with
analyzing large scale social networks analysis
large scale social networks?
Stieglitz and How
to
analyze
large Describing the architecture of a
Kaufhold
amounts of text generated in software prototype for full text
(2011) [85]
social media on a short time analysis in social networks for the scale?
purpose of analyzing communication about individuals and events in social networks
Takaffoli al. [86]
et How to detect evolution and Framework and community matching
(2011) structural communities
changes in
of algorithm to observe community social changeovers and evolutions in social
networks?
media through time.
31
Table 3.1 Continue
Derczynski
What
are
the
et al. (2013) temporal, [87]
spatial, Recognition of the spatial, temporal,
and
temporal
spatio- and
spatio-temporal
challenges associated
associated
with
with
challenges
analyzing
social
analyzing media
social media?
Jiang et al. How (2013) [88]
to
obtain
a
deep Exhaustive study of Chinese social
understanding of visible and network Renren and constructing latent user interactions over latent interaction graphs to compare online social networks?
there
an
against visible interactions
Santiago
Is
obvious Proving that there is a significant
(2013) [89]
association between terrorist relationship between terrorist events events and data on social and social media data media? Methodological framework for social
Stieglitz and How can political institutions Xuan (2013) and [90]
other
media analytics in political context
scientific
disciplines
exploit
potentials
of
the
political
discussions in social media sufficiently?
3.2.2 Twitter Recent research has started to concentrate on the content related issues of online social media particularly Twitter. Twitter analysis is a broad field of research in which researchers from different disciplines have been greatly interested in the past years. The analysis of Twitter information can result in gaining precious knowledge. The increasing popularity of Twitter has attracted the concern of researchers in numerous fields. The vast amount of information extracted from 32
Twitter has been utilized in many various applications. These applications include measurement of political opinion, prediction of stock market prices, measuring the national sentiment, health care, etc. The first set of research is concentrating on general understanding of the nature of Twitter. One of the pioneer works on Twitter is the one performed by Java et al., who focused on studying usage and communities. They monitored the Twitter public outline and analyzed posts from distinct users for two months in order to understand how and why people tweet. They identified four major types of user intentions: daily chatter, conversations, sharing information, and reporting news. The study categorized the roles played by Twitter users into three main groups: information source, friends, and information seeker [31]. A definition of a user's friend on Twitter was provided by Huberman et al. who studied social interactions within Twitter and defined a user's friend as "A person whom the user has directed at least two posts to". They reached three conclusions: First; Twitter users have a very small number of actual friends in comparison to the number of followers and followees they declare. Second; users who have a large number of actual friends have a higher tendency to post more updates than those users who have a smaller number of actual friends. Third; users with many followers or followees (fewer actual friends) post updates at a less frequent rate than those users with few followers or followees [91]. Krishnamurthy et al. presented an exhaustive characterization of Twitter. They exploited three datasets including about 100,000 Twitter users. The datasets were collected from the Twitter public timeline using two methodologies depending upon Twitter API functions. Similar to Java et al. study, the purpose of this study was the identification of different classes of Twitter users and their behaviors [30]. Honeycutt and Herring studied conversation and collaboration between users via Twitter through analysis of conversations on Twitter, paying attention to the functions and uses of the @ sign. They collected tweets from Twitter's public timeline in four one-hour samples collected at four-hour intervals, on January 11, 2008. They found that short, dynamic conversations are the most common, along with some longer conversations between multiple participants [15]. Zhao and Rosson conducted an exploratory study to achieve a profound understanding of how and why ordinary people use Twitter. They explored how the characteristics of micro-blogging behaviors enable informal communication, and 33
studied the role and influences of micro-blogging on informal communication at work. To achieve these objectives, they interviewed eleven Twitter participants working in a large Information Technology company [92]. Kwak et al. conducted a quantitative study which crawled the entire Twitter sphere-through Twitter API- to study Twitter's topological characteristics, and understand the information diffusion on Twitter and its power as an information sharing medium [8]. Naaman et al. examined the characteristics of social activity and patterns of communication on Twitter. They coded tweets manually and analyzed their content for the sake of computing the percentage of the four types of user intentions identified by Java. Their study revealed that information sharing (22%), opinions or complaints (comprise approximately 25%), random thoughts (comprise approximately 25%), and personal status (comprise approximately 40%) encompass the massive majority of tweets. They reached the result that the majority of Twitter users focus on themselves, while the minority is concerned with sharing of information [93]. Weerkamp et al. studied the way by which people use Twitter in various languages and how it differs from one language to another. These differences can be reflected in the usage of four specific features of Twitter: hashtags, links, mentions, and conversations. Their study was based on tweets written in eight different languages with a dataset of 1,000 tweets constructed for every language [94]. Goel et al. investigated the problem of determining similar users on Twitter. They focused their attention on "Production similarity" where two users are defined to be similar to each other if they generate similar content. They proposed a machine-learning based framework built upon Hadoop to discover similar accounts with high quality for hundreds of millions of Twitter users every day [95]. Another set of research paid attention to the analysis of Twitter for medical purposes. Wegrzyn-Wolska et al. followed the tweets talking about the "Escherichia coli” epidemic in three languages: English, French, and Polish [96]. In his dissertation for the doctoral degree, Yoon conducted an observational study of Twitter messages relevant to healthy behaviors such as physical activities [97]. Many researchers studied the usage of Twitter analysis for tracking, prediction, and prevention of Influenza trends. Chew and Eysenbach developed an open-source infoveillance system called Infovigil to gather
34
tweets. Then, they analyzed the content of tweets including the keywords or hashtags"H1N1" ,"swineflu", and "swine flu" during the 2009 H1N1outbreak [98]. Culotta investigated several models to analyze influenza related messages posted on Twitter in order to estimate the rates of influenza-like illnesses (ILI) in a population, detect keywords that associate with influenza rates, and combine the detected keywords to predict national influenza rates and outbreaks. Their investigation was done over a dataset of 574,643 Twitter messages gathered over a 10-week time period from the public timeline of Twitter [10].Lampos and Cristianini reported on a monitoring tool to measure the prevalence of H1N1 disease in the United Kingdom. They analyzed the textual content of the stream of Twitter data in the United Kingdom for six months during the H1N1 flu pandemic [99]. Similarly, Achrekar et al. presented the Social Network Enabled Flu Trends (SNEFT) framework that monitors the messages posted on Twitter with a reference to flu indicators, in order to track and forecast the emergence and spread of influenza epidemic in the real world. The tweets were gathered over one year. Their results proved that the usage of Twitter data can enhance the precision of ILI prediction models. Thus, Twitter data provides an accurate timely assessment of ILI activity [100]. Paul and Dredze investigated Twitter for a wide variety of public health data automatically extracted from Twitter on a variety of illnesses, instead of a limited group of applications on a few number of illnesses [101]. Signorini et al. investigated the use of the Twitter embedded information to (1) track rapidly-evolving public concerns with relevance to H1N1 or swine flu, and (2) track and estimate real disease activity in real time. They collected and stored a large sample of public tweets that matched a set of pre-specified search terms to monitor influenza related traffic within the United States [102]. Another group of research analyzed the usage of Twitter in the field of politics. Diakopoulos and Shamma analyzed and characterized sentiments of people (aggregated from Twitter posts) concerning the debate between Barack Obama and John McCain before the U.S. 2008 Presidential elections [103]. Tumasjan et al. conducted a study to achieve three objectives. First, inspect if Twitter can enhance online political discussion by monitoring how people utilize microblogging to 35
exchange information concerning political subjects. Second, assess if Twitter messages imitate the existing offline political emotion in a significant manner. Third, investigate if the Twitter activity can be used to forecast the popularity of political parties or alliances in the offline world. The study depended upon tweets published on the public timeline of Twitter in the few weeks before the federal elections of the German national parliament which were held on the 27th of September, 2009. The results confirmed that Twitter can play the role of a vehicle for political deliberation [4]. Zhou et al. studied the role played by Twitter in information dissemination through studying the posted tweets during the 2009 post-election demonstrations in Iran. The tweets were gathered from the public timeline using Twitter API [104]. Conover et al. explored a lot of approaches to forecast the political alignment of twitter users distinguishing between Twitter users who belong to the left wing and those who belong to the right wing, so that political campaigns can take advantage of this alignment to construct their political strategies. The study was based upon a dataset derived from tweets of 1,000 Twitter users concerning the 2010 U.S. midterm elections. The dataset was extracted using the Twitter ‘garden hose’ streaming API. Similar to Tumasjan et al., Conover et al. concluded that Twitter is effective in predicting the political alignment of individuals [5].Bravo-Marquez et al. conducted an experimental exploration of opinion time series extracted from Twitter messages relevant to the 2008 U.S. Presidential elections and analyzed these time series findings. Opposite to the conclusions obtained by Tumasjan et al. and Conover et al., Bravo-Marquez et al. concluded that the opinion time series extracted from Twitter cannot act as a dependable predictive model for elections [6]. Wegrzyn-Wolska and Bougueroua described a system which surveyed the trends of the French 2012 Presidential Elections from discussions on Twitter. This system performed the automatic gathering, assessment and rating of tweets for appraising the trends and detecting trend changes in the electoral behavior by mining tweets [7]. Nooralahzadeh et al. applied Natural Language Processing (NLP) and Data Mining techniques, to compare between the prevailing sentiments towards candidates in the 2012 Presidential elections in USA and France, focusing primarily on time series analysis, in addition to word cloud and hashtag analysis. Twitter datasets concerning both of the French and American Presidential elections were obtained using Twitter API [3].
36
The next set of research deals with the usage of Twitter in the prediction of stock markets. Bollen et al. investigated if collective national mood states, as derived from large scale collections of daily Twitter messages, can be correlated or can be used to predict the stock market value of the Dow Jones Industrial Average (DJIA) over time. They exploited two tools to measure the public mood differences from the posted tweets for the period of about ten months. The results revealed that public mood variations can be tracked through text processing of Twitter posts [105]. Similarly, X. Zhang et al. described an early work that attempted to forecast stock market indicators like Dow Jones, NASDAQ and S&P 500 through the analysis of positive and negative moods in Twitter feeds. They worked on a dataset composed of tweets gathered over six months. They finally concluded that checking of the sentimental outbreaks of Twitter users can give an indication of the stock market behavior in the next day [106]. The use of Twitter for marketing purposes has been discussed in another set of researches. Jansen et al. reported the results of research which investigated micro blogging as a sort of electronic word of mouth (e-WOM) for sharing consumer opinions, comments, and sentiments regarding brand names. They inspected the Twitter posts mentioning brand names (gathered over thirteen weeks) specifically those including opinions or emotion towards a brand. The results supported the usefulness of Twitter as a marketing tool [107]. Bulearca and Bulearca presented a qualitative exploratory study on the perceptions, uses, in addition to benefits and limitations of Twitter as a form of electronic word-of-mouth marketing by small and medium-sized enterprises SMEs. Data collection occurred through semi-structured online interviews. Similar to Jansen et al., the findings support that Twitter is a vital platform for companies to listen to their customers' opinions [108]. The final set of research in this section emphasizes the usage of Twitter for event detection, depending upon the real-time characteristic of Twitter. Sakaki et al. presented an investigation of the real time interaction of Twitter users in catastrophic events like earthquakes, and proposed an algorithm that can monitor tweets and distinguish a specific event. They constructed an earthquake reporting system in Japan that can detect earthquakes by monitoring tweets with high probability, and warning the registered users through sending e mails to them [9]. Becker et al.
37
developed a system for automatically identifying and presenting Twitter content relevant to prearranged events, using a combination of simple rules and advanced query building strategies [109]. Jackoway et al. studied how to identify and discover current and future live news events along with information regarding the event location, through extracting related and consistent information from Twitter posts and determining which Twitter users post reliable tweets [110]. Similar to Sakaki et al., Earle et al. evaluated the speed of Twitter users' reaction to the earthquake of Morgan Hill, California in March 2009. They also presented and assessed a procedure for detection of earthquakes depending upon data extracted from Twitter [111]. Crooks et al. studied the performance of Twitter as a sensor system for detection of earthquakes geographically, taking into consideration the 2013 earthquake in Mineral town in Virginia, USA [112]. Table 3.2 summarizes the related work concerning Twitter. Table 3.2 Twitter Related Work Summary
Author(s)
and Addressed Problem(s)
Contribution(s)
Publish Year
Java et al. (2007) What are the major behaviors Studying the topological and [31]
and
intentions
of
micro- geographical
properties
of
bloggers? What are the roles Twitter and analysis of user they play?
intentions at the community level
Huberman et al. What are the different types of A detailed analysis of Twitter (2008) [91]
social
interaction
within and
Twitter?
social
interactions
of
Twitter users
Krishnamurthy et How to characterize Twitter and An exhaustive characterization al. (2008) [30]
analyze
its
users
geographical distribution?
38
and of Twitter
Table 3.2 Continue
Honeycutt Herring
and How far does Twitter endorse Analysis of the conversational (2009) user-to-user interactions? Why and collaborative features of
[15]
do people use Twitter? What Twitter,
particularly
the
are the required alterations to functions and usages of the @ make Twitter more functioning sign as a collaboration tool?
Zhao and Rosson How and Why ordinary people Providing an analysis and (2009) [92]
use
Twitter?
features
of
behaviors
How
do
the understanding
micro-blogging played enable
of
the
by Twitter
informal instrument
of
as
role an
informal
communication? What are its communication in the work potential
influences
on
informal
communication
the environment at
work?
Kwak
et
(2010) [8]
al How do people interact on Quantitative Twitter? Who are the people of entire
study
Twitter
on
the
sphere,
its
the most influence? What are topological characteristics, and the trending topics? How does information diffusion on it information diffusion take place via retweet?
Naaman
et
(2010) [93]
al. What are the characteristics of Classifying social activity and patterns of intentions communication on Twitter?
Twitter (
showing
user the
percentage of each intention) and identifying the behavioral patterns of Twitter users
39
Table 3.2 Continue
Weerkamp et al. Does the language difference Emphasizing the difference in (2011) [94]
affect the usage of Twitter?
using Twitter features due to the language difference
Goel et al. (2013) How to find out similar users A [95]
on Twitter?
machine-learning
based
framework to discover similar Twitter accounts
Wegrzyn-Wolska What are the problems and Presenting et al (2011) [96]
the
directions
challenges associated with the where social network analysis use of Social Network Analysis can be used for medical and Text Mining methods for purposes applications in E health and medical purposes?
Yoon (2011) [97] What can be learned about Advancing the methodological physical activities from Twitter breadth messages?
of
mining
social
media for the health-related purposes
Chew
and Is it feasible to use Twitter to Illustrating that social media
Eysenbach
measure public opinion towards and particularly Twitter can be
(2010) [98]
a specific topic
used to conduct studies about public health
Culotta [10]
(2010) Is it possible to detect influenza Investigation outbreaks
through
of
several
analyzing models for analyzing tweets to
Twitter content
predict rates of influenza-like illnesses in a population
40
Table 3.2 Continue
Lampos
and How to track the spread of flu A
Cristianini
pandemic
(2010) [99]
analysis
through
method
Twitter hundreds
that
of
analyzes
thousands
of
Twitter messages daily to measure the prevalence of a disease in a population
Achrekar et al. How can Twitter data be used Framework (2011) [100]
to forecast flu trends
for
monitoring
Twitter posts relevant to flu indicators, in order to track and forecast the emergence and
spread
of
influenza
epidemic in the real world
Paul and Dredze Is there a public health signal Analysis of Twitter data for a (2011) [101]
that can be detected within the broad chatter of Twitter?
variety
of
diseases
instead of a limited group of applications on a few number of illnesses
Signorini et al. How can embedded information Using e embedded information (2011) [102]
in Twitter be used to measure in Twitter to measure public public
sentiment
H1N1
and
concerning interest concerning H1N1 and
measure
activity of the disease
actual measuring actual activity of the disease
Diakopoulos and How do Twitter users react to a An analytical methodology to Shamma (2010) political media event?
analyze
[103]
Twitter messages
the
sentiments
users
posting
concerning
televised political debate 41
of
a
Table 3.2 Continue
Tumasjan et al. Can Twitter be used as a Confirming that Twitter can (2010) [4]
medium for political debate? be used as a platform for Can Twitter messages reflect political discussion offline political emotions?
Zhou
et
al. How does a message spread Emphasizing
(2010) [104]
widely
among
the
the
role
of
Twitter Twitter as a medium for
users? Are the resulting cascade information propagation, and dynamics different due to the displaying the structure and unique Twitter features? What mechanism
of
information
role does message content play propagation on Twitter in its popularity?
Conover et al. How to forecast the political Proving how effectively we (2011) [5]
alignment
of
Twitter
users can use Twitter in predicting
based upon their tweet content the political alignment of its and structure of their political users communication?
and
discriminating
between left wing and right wing users
Bravo-Marquez
Is a time series suitable for Opinion time series analysis of
et al. (2012) [6]
reliable prediction?
messages
extracted
from
Twitter relevant to the 2008 U.S Presidential elections
Wegrzyn-Wolska What are the problems and A system which surveyed the and Bougueroua challenges associated with the trends of the French 2012 (2012) [7]
use of Social Network Analysis Presidential
Elections
and Text Mining methods for discussions on Twitter applications in politics? 42
from
Table 3.2 Continue
Nooralahzadeh et What is the nature of elections? Comparison of the prevailing al. (2013) [3]
What is the impact of social sentiments before and after the media on elections?
2012
US
and
France
Presidential elections through time series analysis
Bollen
et
(2011) [20]
al. How public mood obtained Analyzing the textual content from Twitter posts influence the of value
of
a
stock
Twitter feeds every day
market using two mood tracking tools
indicator over time?
X. Zhang et al. Can the behavior of stock Concluding that checking of (2011) [106]
market indicators be predicted the sentimental outbreaks of by analyzing the sentiments Twitter users can give an reflected in Twitter posts?
indication of the stock market behavior in the next day
Jansen
et
(2009) [107]
al. Is Twitter suitable for use as a Emphasizing the usefulness of form of electronic word-of- Twitter as a marketing tool mouth marketing?
Bulearca
and Should Twitter be considered Qualitative investigative study
Bulearca (2010) by Small and Medium-sized on [108]
the
perceptions,
uses,
Enterprises (SMEs) in their benefits and limitations of marketing strategies?
Twitter as a form of electronic word-of-mouth marketing by small
and
medium-sized
enterprises SMEs
43
Table 3.2 Continue
Sakaki
et
(2010) [9]
al. Can the event occurrence in An real-time
be
detected
earthquake
reporting
by system in Japan that can detect
monitoring tweets?
earthquakes
by
monitoring
tweets
Becker
et
(2011) [109]
al. How to identify Twitter posts A system for automatically related to an event?
identifying
and
presenting
Twitter content relevant to prearranged events
Jackoway et al. How to identify and discover Development of a method for (2011) [110]
current and future live news determining
which
Twitter
events along with information users post reliable information regarding the event location? and which Twitter posts are of Which
Twitter
users
post interest.
reliable tweets?
Earle
et
(2012) [111]
al. Can Twitter be used to detect An
automatic
earthquakes? How fast can detection
earthquake
algorithm
that
Twitter-based systems detect depends upon Twitter data earthquakes?
How
these
systems can be used during real-time earthquake response?
Crooks
et
(2013) [112]
al. What
are
the
spatial
and Analysis of the spatial and
temporal characteristics of the temporal characteristics of the Twitter feed activity in response Twitter to an earthquake?
feed
activity
response to an earthquake
44
in
3.2.3 Document Clustering Document clustering is considered to be one of the most important research topics whose objective is to satisfy the human interests in information searching and understanding in the fields of information retrieval and text mining. The resulting clusters can be utilized to clarify the features of the underlying data, and therefore act as a basis for other data mining and analysis techniques [113]. The idea behind algorithms of document clustering is gathering documents into groups depending upon their similarity. This enables users to locate documents of interest to them in a much simpler way, and to obtain an overview of the set of retrieved documents [114]. A large amount of research studies have been performed on document clustering. The first set of research is concerned with reviewing the concept of document clustering and its well-known techniques. Steinbach et al. presented a technical report that compared through an investigational study two of the most common approaches used in document clustering: agglomerative hierarchical clustering and K-means algorithm [59]. Zeng et al. studied the issue of organizing (online) web search results into clusters. They re-formalized the search result clustering problem from an unsupervised clustering problem to a supervised learning problem [115]. Zhu et al. presented an algorithm for clustering of documents into one ultimate cluster built upon frequent co-occurring word sets [116]. Huang compared and analyzed the effectiveness of similarity measures in Partitional clustering for text document datasets. She utilized the standard K-means algorithm in her experiments over seven text document datasets, with the application of five similarity measures [60]. Jajoo tried a different document clustering approach. First, he used a standard clustering approach to cluster the words of documents in order to reduce the data noise and increase the time efficiency. Then, he used the word cluster (including the frequent co-occurring words) to cluster the documents in a way similar to what was done by Zhu et al. [42]. Sathiyakumari et al. surveyed different document clustering approaches and algorithms in text mining [58]. Similarly, a comprehensive investigation of the problem of document clustering was conducted by Aggarwal and Zhai, who also studied its major challenges, and discussed the main approaches used for document clustering and their comparative advantages [117]. Koteeswaran et al. 45
provided a review on implementation techniques, recent research on clustering and outlier analysis [118]. Anastasiu et al. prepared a technical report on February 2013.This report introduced the general purpose document clustering, described its challenges, and paid attention to the most recent developments concerning the next frontier in the field of document clustering: long and short documents [119]. Another set of research has taken into consideration the use of Genetic Algorithms GAs for clustering and document clustering. One of the pioneer researches in this field is the research conducted by Jones et al. in the year 1995. Their experiments exploited three document test collections, involving documents, queries, and their associated relevance judgments. They compared effectiveness of clusters resulting from GA-based clustering technique with that of network-based clustering. They concluded that Genetic Algorithms are not of practical use in document clustering. However, take into consideration that this study was conducted in 1995. Further researches reach a totally different conclusion [120]. Similarly, Maulik and Bandyopadhyay introduced a clustering technique depending upon Genetic Algorithm. The algorithm was examined over a wide variety of artificial and real-life data sets. The results were compared against those obtained by k-means clustering algorithm. The obtained results showed a significant superiority of the Genetic Algorithm-clustering algorithm over the k-means algorithm [121]. Casillas et al. introduced in 2003 a genetic algorithm that clusters documents into an unknown number of clusters. Their experiments were carried out over a collection of 14,000 news items gathered from a Spanish newspaper [122]. Premalatha and Natarajan introduced a method for clustering a set of documents based upon Genetic Algorithm using simultaneous mutation operator and ranked mutation rate. Experimental results were examined against a number of common datasets in comparison with simple GA and k-means. The obtained results demonstrated that the projected algorithm statistically beats the Simple Genetic Algorithm as well as KMeans algorithm [40]. Similar to the previous two reviews, Jian-Xiang et al. developed an algorithm based on Genetic Algorithm to cluster documents, and proved that GA can produce better outcomes than those produced by Kmeans[43].Verma et al. exploited a modified genetic algorithm for clustering documents. The modification lies in the initiation of the initial population. Instead of random generation, the initial population is created through measuring similarity 46
between documents depending on their sum of squared distances from the previously selected documents .The results of the modified algorithm were compared and proven to outperform the K-means algorithm [123]. Another modification for genetic algorithm in document clustering was presented by Meena et al., who used the features of Genetic Algorithm GA in combination with the features of Discrete Differential Evolution (DDE) for text document clustering. The purpose of the mentioned combination is to decrease the number of iterations required by GA to find the optimum solution. The experiments were implemented using fifty documents from Reuter-21578 database [113]. Usharani and Iyakutti proposed a method based upon genetic algorithm for finding similarity between web documents according to cosine similarity [47]. Table 3.3 summarizes the related work concerning document clustering. Table 3.3 Document Clustering Related Work Summary
Author(s)
and Addressed Problem(s)
Contribution(s)
Publish Year
Steinbach et al. Comparison of common An experimental study of some (2000) [59]
document
clustering popular document
techniques
Zeng et al. (2004) How [115]
to
clustering approaches
organize
web Re-formalizing the search result
search results using other clustering techniques inadequate
than
problem
from
an
the unsupervised clustering problem
traditional to a supervised learning problem
clustering techniques?
Zhu et al. (2006) High dimensionality of the A document clustering algorithm [116]
clustered
data,
huge depending
upon
database, and absence of occurring word sets intuitive cluster description
47
frequent
co-
Table 3.3 Continue
Huang (2008) [60] How to precisely define Investigation of the effectiveness similarity between objects of
similarity
in order to achieve accurate Partitional clustering?
Jajoo (2008) [42]
How
measures
clustering
for
in text
document datasets
to
improve
the Introduction
of
a
clustering
accuracy and efficiency of algorithm where the data noise is document clustering with minimized by first clustering the large number of documents words of documents followed by added daily from different clustering the documents on the sources?
Sathiyakumari
basis of their word clusters
et How to provide a complete A comprehensive overview of
al. (2011) [58]
evaluation
of
various different
techniques
used
in
clustering approaches in document clustering text mining?
Aggarwal
and How to
Zhai (2012) [117]
obtain
a full A comprehensive investigation of
understanding
of
the document clustering, its major
document
clustering challenges, the main approaches
problem, along with its for document clustering and their main
challenges
and comparative advantages
techniques?
Koteeswaran et al. How to provide a complete An assessment of data clustering (2012) [118]
evaluation
of
clustering
and
various and outlier analysis technique outlier
analysis approaches?
48
Table 3.3 Continue
Anastasiu et al. What is the concept of Introducing the general purpose (2013) [119]
Document
clustering? document clustering, describing
What are its challenges and its challenges, and the recent recent advances?
developments concerning the next frontier in document clustering
Jones et al. (1995) Can Genetic Algorithms Introduction [120]
of
document
GAs be used for document clustering technique based on clustering?
Maulik
Genetic Algorithm
and How to get a clustering Introduction
Bandyopadhyay
methodology
which
is technique
(2000) [121]
simple as K-means and Algorithm
of based
clustering on
Genetic
avoids its drawbacks
Casillas
et
(2003) [122]
al. How to handle the problem A genetic algorithm for document of clustering a set of clustering documents
without
knowing in advance the suitable number of clusters
Premalatha Natarajan [40]
and How to provide the best Genetic Algorithm for document (2009) grouping
of
documents clustering
into K number of clusters
using
dynamic
mutation operator and adaptive mutation rate
Jian-Xiang et al. How to (2009) [43]
apply Genetic Genetic Algorithm for document
Algorithm for document clustering clustering?
49
Table 3.3 Continue
Verma
et
(2010) [123]
al. How
to
improve
performance Algorithm
of in
the Genetic Algorithm with squared
Genetic distance
optimization
for
document document clustering
clustering?
Meena
et
(2012) [113]
al. How
to
improve
performance Algorithm
of in
Genetic Differential document optimization
clustering?
Usharani Iyakutti [47]
and How
to
the Genetic Algorithm with Discrete Evolution for
document
clustering
increase
the A Genetic Algorithm based on
(2013) relevance of retrieved web cosine documents?
similarity for
relevant
document retrieval
3.2.4 Tweet Clustering One way of analyzing Twitter is the cluster analysis of tweets. In addition to the general understanding of Twitter and its uses, other researchers are interested in the cluster analysis of tweets and Twitter users. Khot applied k-means clustering technique for masses consisting of a huge number of documents. He selected eight specific news twitter feeds and came up with the conclusion that when the documents’ content is very short (as in the case of tweets), it is more appropriate to cluster the words instead of the documents. Therefore, he proposed a method for tweet clustering that clusters the words using the word co-occurrence as a similarity measure (similar to what was done by Zhu et al. which was mentioned in the previous section) [124]. Conover et al. used network clustering algorithms to obtain information concerning individuals who a Twitter user communicates with. This information was further used to cluster Twitter users into Right and Left wing clusters according to their political beliefs and stances [5]. In his dissertation for the doctoral degree, Yoon examined the usage of clustering to detect, summarize, and categorize the content of tweets [97]. Mosley and Roosevelt described how data 50
mining and text analytics can be applied to social media. They used a specific example related to an insurance company. They created an archive of Twitter public posts including a hashtag of the company name. Mosley and Roosevelt applied Ward’s Minimum-Variance clustering method to 116 keyword indicators extracted from the archive based on their similarity [33]. Another set of research focuses on clustering users and discovering communities in Twitter. Goyal et al. presented a method to cluster Twitter users depending upon social connections in addition to content and link similarity. Their data was divided into three types: (1) Geo-tagged tweets collected from five cities, (2) Tweets about specific topics of interest, and (3) Tweets from a specific group of users. They analyzed and compared the performance of two standard clustering algorithms for clustering users in Twitter [13].Moreover, in his dissertation for the Masters' degree; Kewalramani clustered the users of Twitter into communities, and considered Twitter users to be similar according to the similarity of the content they generate, in addition to link similarity and meta-data similarity. Kewalramani evaluated the quality of clustering using different similarity measures on different types of datasets [1]. Y. Zhang et al. calculated the similarity between Twitter users depending upon their interests. This similarity is further used as a measure to recognize communities in Twitter. The study was conducted over 45,772 Twitter users with at least 100 tweets (gathered using Twitter API) and 20 friends [12]. Similarly, Yamashita et al. proposed a method to analyze and cluster groups of Twitter users depending upon mutual interests or shared attributes among followers [125]. Another group of research discusses the use of cluster analysis to discover and recommend topics. Phelan et al. proposed a recommender system depending on Twitter to recommend for current and topical news from a collection of RSS feeds [2]. Similarly, Sankaranarayanan et al. investigated the use of Twitter to build a news processing system that works exclusively with tweets posted on Twitter. This system automatically obtains breaking news, identifies current news topics, and groups news tweets into clusters, such that each cluster consists of tweets relating to a particular topic [35]. Bernstein et al. developed an interactive topic browser for discovering topics from short Twitter status updates, powered by linguistic syntactic
51
transformation and callouts to a search engine, using a technique called "Tweet Topic". The dataset was composed of 100 random tweets. After being browsed, the tweets are clustered into topics mentioned either implicitly or explicitly [126]. Karandikar addressed the problem of determining which topic model is the most appropriate for clustering tweets depending upon its clustering performances. He used an R-system (that applies k-means clustering) for statistical analysis and graphics for clustering the tweets (written in English language and aggregated into four datasets) based upon their topic vectors. He clustered users on Twitter according to the content they generate [18]. On the same track, O’Connor et al. presented a topic extraction system called "Tweet Motif" that clusters Twitter messages (gathered from Twitter API) by frequent significant terms [127]. Rangrej et al. gathered tweets using Tweet Motif, in order to compare the performance of three different document clustering techniques including K-means, SVD-based method and a graph-based approach on short text data collected from Twitter. They performed their experiments on a dataset of 611 handpicked tweets representing different topics from Twitter [11]. Rosa et al. presented a study on automatically clustering and categorizing tweets into six different pre-defined topics through the use of hash tags as approximate indicators of tweet topics, encouraged by the approaches adopted by news aggregating systems such as Google News. Their clustering technique was appraised using a dataset including more than one million Twitter messages gathered using Twitter API over two weeks [128]. Kim et al. proposed a clustering method called Core-Topic-based Clustering (CTC) to extract meaningful topics from tweets and cluster tweets according to the topics. They exploited the Retweet ratio (RT ratio) as a weight to evaluate the score of clusters. Experiments were performed over a dataset consisting of tweets-gathered over a period of one month-about four popular TV programs. The obtained results were compared and demonstrated to be better than those obtained by K-means algorithm [129]. Similarly, Rafea and Mostafa presented their experience in extracting Arabic hot topics from Twitter. Experiments were performed over 110 tweets collected over four days [130]. The next set of research considers enriching the short document terms through using Wikipedia to improve the clustering process of short text items such as tweets. One of the early studies in this field was conducted by Banerjee et al., who 52
proposed an approach to increase the accurateness of clustering short text items by using Wikipedia as an additional knowledge source to enrich the short text representation with extra features (titles of selected Wikipedia articles). Results indicate that for most clustering algorithms; the accuracy of clustering has improved substantially with the enriched representation [131]. Another work that exploits Wikipedia was explained by Gabrilovich and Markovitch who proposed a method called Explicit Semantic Analysis (ESA), which applied Wikipedia concepts on a collection of fifty documents to determine closeness between natural language texts [132]. A similar work was performed by Chen et al. which attempted to minimize the impurity of tweet clusters through using Wikipedia. First, they expanded the feature and training sets using Wikipedia search expansion for each tweet in order to overcome the limited length of tweets. Second, they used a classifier to reduce the impurity of clusters [133]. Perez-Tellez et al. introduced and compared different methods built upon k- means clustering in order to differentiate between Twitter messages that are related to a specific company and those that are not. Their approach involves categorization of Twitter messages which include a possible company entity into two clusters: the first cluster is corresponding to those Twitter messages which refer to the specific company, while the second cluster is corresponding to tweets which refer to another topic. Terms forming the Twitter messages were enriched using Wikipedia. Experiments were carried out on Twitter messages relating to twenty companies, written in English, with only the true and false Twitter messages taken into consideration. The purpose of the experiments is to confirm whether the procedure of enriching using Wikipedia will lead to an improvement in the clustering of company tweets or not [19]. Table 3.4 summarizes the related work concerning tweet clustering.
53
Table 3.4 Tweet Clustering Related Work Summary
Author(s)
and Addressed Problem(s)
Contribution(s)
Publish Year
Khot (2010) [124]
How to increase the speed A method for tweet clustering of tweet clustering to be as that clusters the words using real time as possible?
word
co-occurrence
as
a
similarity measure
Conover
et
al. How to forecast the political Proving how effectively we
(2011) [5]
alignment of Twitter users can use Twitter in predicting based
upon
their
tweet the political alignment of its
content and structure of their users political communication?
and
discriminating
between left wing and right wing users
Yoon (2011) [97]
What can be learned about Advancing the methodological physical
activities
Twitter messages?
Mosley Roosevelt [33]
from breadth of mining social media for the health-related purposes
and How to apply data mining Application of data mining and (2012) and text analysis techniques text analysis techniques to to social media?
specific example of Insurance company Twitter posts
Goyal et al. (2011) How to understand different A method to cluster Twitter [13]
data mining and analysis users depending upon social techniques on Twitter?
connections in addition to content and link similarity
54
Table 3.4 Continue
Kewalramani (2011) How [1]
to
identify A methodology to formulate
communities in Twitter?
similarity between any two Twitter users on the basis of their
generated
content
similarity, link similarity and metadata similarity
Y.
Zhang
et
(2012) [12]
al. How
to
communities
recognize Similarity calculation between on
Twitter Twitter users based on their
based on users' interests?
interests. This similarity is further used as a measure to recognize
communities
in
Twitter
Yamashita
et
(2013) [125]
al. How to analyze and cluster A method to analyze and groups
of
depending
Twitter on
users cluster groups of Twitter users mutual based upon a commonality or a
interests or shared attributes shared attribute among followers?
Phelan et al. (2009) Can Twitter be used as a A [2]
recommender
system
news?
recommender
system
for depending on Twitter to rank and recommend current and topical news from a collection of RSS feeds
Sankaranarayanan et How Twitter can be used to A news processing system that al. (2009) [35]
automatically
extract works exclusively with Twitter
breaking news from Twitter posts posts? 55
Table 3.4 Continue
Bernstein
et
(2010) [126]
al. How to obtain topics from An interactive topic browser Twitter?
for discovering topics from Twitter posts
Karandikar
(2010) Which topic model is the Determining
[18]
most
appropriate
clustering
the
most
for appropriate topic model to
tweets
and cluster tweets and Twitter
Twitter users?
users based on their status updates
O’Connor (2010) [127]
et
al. How to organize and search A topic extraction system that through millions of Twitter clusters Twitter messages by posts?
Rangrej et al. (2011) How [11]
frequent significant terms
to
handle
the Comparative study of
sparseness of words in the performance condensed
short
of
the
different
text clustering approaches for short
documents gathered from text documents using datasets the web?
gathered from Twitter
Rosa et al. (2011) How to detect the topics Automatic [128]
discussed
in
clustering
and
Twitter categorizing of tweets into pre-
messages automatically?
defined topics through the use of hash tags as approximate indicators of tweet topics
Kim et al. (2012) How to extract meaningful a clustering method to extract [129]
topics from tweets?
56
meaningful topics from tweets
Table 3.4 Continue
Rafea and Mostafa How to extract Arabic hot Development and application (2013) [130]
topics and recognize the of an approach for assessing sentiment of Arab users Key-phrase towards these topics from algorithm tweets?
Banerjee
et
extraction to
identify
the
sentiment topic in a cluster
al. How to solve the problem of A method to improve the
(2007) [131]
information
overload
in accuracy of clustering short
famous news or blog feeds?
text items using Wikipedia as an
additional
knowledge
source
Gabrilovich
and How to compute semantic
Markovitch (2007) relatedness [132]
of
A method applying Wikipedia
natural concepts
language texts?
to
determine
closeness of natural language texts
Chen et al. (2010) How [133]
to
overcome
the Introduction of the Wikipedia
inaccurate categorization of search expansion to reduce the tweets due to their limited impurity of tweet clusters length?
Perez-Tellez et al.
How
to
differentiate Proposal and comparison of
between Twitter messages different methods of tweet (2010) [19]
that are related to a specific representation depending upon company and those that are term not?
expansion
influence
on
company tweets
57
and
their
clustering
3.2.5 Cellular Genetic Algorithms cGAs Cellular Genetic Algorithms cGAs have been used in various domains to solve different types of problems. A group of research is concerned with the utilization of cGAs in different applications. In the field of transportation, Alba and Dorronsoro introduced a cellular genetic algorithm for solving the well-known Vehicle Routing Problem (VRP) [48]. In the field of communication networks, Alba et al. studied the usage of a cellular multi-objective evolutionary algorithm (cMOGA) to solve the problem of optimally tuning a particular broadcasting strategy for Metropolitan Mobile Ad Hoc Networks (MANETs) [134]. Nebro et al. introduced a Multi-objective cellular genetic algorithm called MOCell for solving multi-objective continuous optimization problems (MOPs) by utilizing an external archive for the storage of non-dominated solutions found during the execution of the algorithm, in addition to a feedback mechanism in which solutions from this archive randomly substitutes the current individuals in the population after each iteration [135]. Guzek et al. proposed a cellular genetic algorithm (called Energy-Aware Communications
Scheduler
EACS)
for
scheduling
precedence-constrained
applications and optimizing the energy consumption during the inter-processor communications in modern parallel and distributed systems through the usage of task clustering techniques [50]. Khezri and Hazrati developed a cellular genetic algorithm for solving the problem of sensor placement in distributed sensor networks for target location under restrictions of complete coverage of sensor network with minimum costs [136]. In the field of electric power, Yugui established a kind of power forecast model which combines cellular genetic algorithm with BP neural network in order to forecast the mid-long term demand for electric energy in the urban areas of the Chinese city Nanchang [137]. Table 3.5 summarizes the related work concerning cellular genetic algorithm.
58
Table 3.5 Cellular Genetic Algorithm Related Work Summary
Author(s)
Addressed Problem(s)
Contribution(s)
and Publish Year
Alba
and How to apply Cellular Genetic A cellular genetic algorithm for
Dorronsoro
Algorithms to solve the Vehicle solving
(2004) [48]
Routing Problem?
the
Vehicle
Routing
Problem
Alba et al. How to optimize broadcasting Application of a cellular multi(2007) [134]
of MANETs networks using objective genetic algorithm to Cellular Genetic Algorithms?
solve the optimum broadcasting problem
Nebro et al. How to use Cellular Genetic A Multi-objective cellular genetic (2009) [135]
Algorithms to solve the multi- algorithm objective
(2010) [50]
to
multi-
reduce
problems
time in
for A cellular genetic algorithm for
exchanging
data
processor
communications? constrained
How
reduce
to
solving
continuous objective continuous optimization
optimization problems?
Guzek et al. How
for
inter- scheduling
energy optimizing
precedenceapplications the
and energy
dissipation due to data transfer consumption during the interbetween processing elements?
processor communications
Khezri
and How to use Cellular Genetic A cellular genetic algorithm for
Hazrati
Algorithms to solve the problem solving the problem of sensor
(2013) [136]
of
sensor
placement
distributed sensor networks?
59
in placement in distributed sensor networks
Table 3.5 Continue
Yugui (2013) How to forecast electric demand A [137]
using
Cellular
power
Genetic combining
Algorithms?
forecast cellular
model genetic
algorithm with BP neural network to forecast the mid-long term demand for electric energy in the urban areas of the Chinese city Nanchang
3.3 Summary and Conclusion The chapter started with providing a historical overview about Social media and Twitter in particular. It also provided a historical background about clustering of documents and tweets, Genetic algorithms and one of its subclasses: Cellular Genetic Algorithms. The second section of the chapter presented a detailed review of the literature concerning the previously mentioned topics. To the best of the researcher's knowledge, Cellular Genetic Algorithms cGAs have not been previously used for clustering tweets in the literature. The main contribution of this thesis is the application of cGAs in tweet clustering. This thesis is considered to be one of the first attempts to do so.
60
CHAPTER FOUR DATA AND ALGORITHM
61
This chapter includes the detailed steps of the work done in this study. The basic outline of this chapter is as follows. Section 4.1 presents the conceptual framework for the study. Section 4.2 compares the different approaches to gather data from Twitter and describes the selected data collection technique and tool. Section 4.3 describes the different aspects of data description, data preparation, data representation, and the environment in which the experiments were performed. Section 4.4 includes a detailed description of the simple Genetic Algorithm and the Cellular Genetic Algorithm. Finally, section 4.5 draws the summary and conclusion of the chapter.
4.1 Conceptual Framework The conceptual frame work for the study is presented in Figure 4.1.
Figure 4.1 Conceptual framework The steps of data collection, data preprocessing, and algorithm application are discussed in details in the following sections of this chapter. The results and their comparison and analysis are presented and discussed in the following chapters.
62
4.2 Data Collection In this section, the common methodologies for data gathering from social media are explained, ending with the selected methodology and description of the utilized tool. For such kind of research, there is no standard dataset available for testing. The publicly available datasets, such as the famous "Netflix prize", Datamob datasets, Enron Email dataset, or the dataset provided by Stanford University, etc. might not be able to obtain all of the required data to answer a particular question. Social networks do not provide complete and precise data directly to researchers because of the privacy concerns and the fierce competition between various social networking sites. The most commonly used practice is that researchers collect their own datasets from different real world systems. Three well-known techniques are exploited by researchers to gather social media data: 1-API driven approach: In this approach, the Application Programming Interface (API) provided by the social network is exploited to query the entities, the characteristics and the relationships between entities. Unfortunately, the Twitter API only permits users to make a limited number of calls in a given hour. Moreover, it returns only the most recently added user friends. API enables only a sample of tweets to be available. 2-Scraping based approach: In this approach, the researcher accesses the social network directly through the usage of a web client. This approach is harder than the API driven approach (as the scraper has to struggle against the redesigns that may occur to the social network frequently), and subject to bandwidth limitations. However, it is not limited by a specific number of calls like the API driven approach.
63
3-Passive network measurement approach: In this approach, the researcher tracks the social network traffic and examines the requests to and from this particular social network. This approach can provide a real view of the studied network. However, it is hindered by the privacy issues. Moreover, due to the multiple ways of accessing a social network, it is difficult to keep track of all accesses [138,78, 139]. The research conducted in this study exploited public data available on the timeline from the Twitter social network. Data for this study were gathered utilizing the “Scraping based technique” where Twitter was directly accessed through a web client. The web client is a social network aggregator that pulls content from multiple social networking sites into a single location such that users can access their social network accounts through single interface, without having to sign in to each site alone, so that users who have multiple accounts in more than one social networking site can manage their profiles in a much simpler manner [28]. The manner by which users exploit and interact with social network aggregators is displayed in Figure 4.2.
Figure 4.2 User interactions with social networks by social network aggregator [28] Hootsuite.com is the social network aggregator that was selected by the researcher to gather data from Twitter. Hootsuite.com is a web site that enables its 64
users to track and archive Twitter messages. To track Twitter messages relevant to a particular topic or to a particular user, users can access this website and create an archive. The created archive will track and archive such Twitter messages. In addition to Twitter, Hoot suite enables its users to archive data on various social networks according to well-defined search criteria. Archives created by others can be retrieved only if the archive owner grants an obvious approval. Social networks that can be managed using Hootsuite include Twitter, Facebook, LinkedIn, Google+, Foursquare, Word press, and Mixi. Moreover, more social networks such as Tumblr, YouTube, Flickr, Mail Chimp, Social Flow, Inbox Q, and Constant Contact can be added to the Hootsuite dashboard through using a feature known as "Hoot Suite App Directory" [140]. An example of the Hootsuite dashboard is displayed in Figure 4.3.
Figure 4.3Hootsuite dashboard According to Alexa traffic ranks, Hootsuite occupies global rank number 143 on the 20th of March 2014, as displayed in Figure 4.4.
65
Figure 4.4 Hootsuite Alexa traffic rank on the 20th of March 2014 [141]
4.3 Data Description and Preparation 4.3.1 Data Description For the purpose of this study, tweets were collected based on a set of keywords that describe specific topics in the actual world. Tweets were collected over a 3-day time duration from the 26th of June to the 28th of June 2013. The set of pre-defined keywords comprise eight variable categories that are intended to be diverse in order to cover different and wide areas of interest: Cinema, Egypt, Film, Hollywood, Iran, Juventus, Messi and Sport. Eight archives with the previously mentioned keywords were established in Hootsuite.com. The type of data gathered using Hootsuite is displayed in Table 4.1
66
Table 4.1 Type of Data gathered using Hootsuite
The username of the tweet sender The tweet content The date and time of tweet posting (according to GMT) Twitter Identification number of the tweet Geographic coordinates of the user determining his/her location A sample out of the tweets included in the datasets is displayed in Figure 4.5
Figure 4.5 Sample tweets from the dataset The gathered tweets are clustered according to their similarity. Similarity measures in Twitter include: 1. User Connections: The most commonly used similarity measure which depends upon the following relationship between Twitter users and user mentions. 2. Description Content Similarity: Measures the similarity between descriptions provided by Twitter users on their profile pages.
67
3. Tweet Content Similarity: Goyal et al. stated that tweet similarity between two users is defined as “the cosine similarity between the documents formed by combining the tweets of a user into one”. 4. Hash tag similarity: Defined by Goyal et al. as "the cosine similarity between the collections of hashtags of the different users". This measure is based on the number of common hashtags between users and the importance of these hashtags [13, 12]. Because the majority of Twitter messages are textual, this study focuses on clustering tweets based on their textual content similarity. Cosine Similarity is one of the most commonly used similarity measures in data analysis because of its ease of use and fast calculation. It can be utilized to compare words in documents and normalize the comparison between documents of different word counts as well as compare vectors of profile attributes [142]. Cosine similarity has none-negative values ranging from zero to one [60]. Cosine similarity is a famous similarity measure that has been used several times. Some examples of using cosine similarity by researchers include Lee et al. who used cosine similarity in order to detect topics in biomedical text [143], Sankaranarayanan et al. who determined topic clustering of Twitter messages based upon cosine similarity [35], Java who exploited text based cosine similarity for evaluating the relatedness between clustered blog feeds [21], Sayyadi et al. who utilized cosine similarity to cluster documents around topics based on the co-occurrence of keywords in documents [144], Perez-Tellez et al. who utilized cosine similarity to compute similarity between tweets before clustering them into two clusters; one representing tweets relevant to a specific company and another representing irrelevant tweets [19], Becker who utilized cosine similarity to identify and characterize different events in social media [32], Goel et al. defined two users on Twitter to be similar by computing the cosine similarity of their sets of followers [95], and Usharani and Iyakutti who proposed a method based upon genetic algorithm for finding similarity between web documents according to cosine similarity [47].
4.3.2 Data Preprocessing The preprocessing of data involved several steps. The first step was the elimination of tweets that: 68
Are not in English
Have very few words (fewer than three)
Have just a URL
Duplicate tweets
All Re-tweets The non-English tweets were not taken into consideration. As previously
mentioned in chapter 1, English language is the most commonly used language over Twitter. In addition; all the stop words, punctuations, and symbols were removed. Such information contains quotation marks, parentheses, punctuation marks plus stray symbols. However, those signs which are really significant for Twitter were kept (such as @ and #).
4.3.3 TF-IDF representation The researcher used a tweet representation based on Term frequency - Inverse Document Frequency or TF-IDF. TF-IDF is a popular representation that is commonly used in Natural Language Processing NLP (using vector representation). It measures the statistical weight of terms in a given document corpus (reflecting the importance of a word across the corpus), depending upon the word co-occurrence or word repetition. The TF-IDF score for a term with respect to a tweet corpus is measured in terms of two individual components, term frequency (TF) and inverse document frequency (IDF). The importance of each term is directly proportional to the number of times this term appears in a document (term frequency). The TF is calculated by the normalized frequency of the term. Term frequency is normalized by the frequency of the term in the corpus (IDF). IDF discounts the weight of terms with high overall importance among all documents in the corpus (terms which occur more frequently in the corpus).
69
The product TF・IDF measures the extent to which a term occurs frequently in a specific document without occurring in the other documents forming the document corpus. (
)
(
)
(
( )
)
df (t) represents the number of documents in which a term t appears (document frequency). Generally, TF measures the relative importance of a term in a particular document. On the other hand, IDF is a measure of the general importance of the term across the whole corpus of documents. IDF measures the rareness of a word across all of the documents. The greater the IDF value, the rarer (and consequently the more discriminative) the word across the corpus of documents is. In Twitter domain the documents are the tweets. This means that terms with a high frequency within the tweet (high term frequency) and a low frequency over all tweets (low inverse document frequency) have a high TF-IDF value. The vector representation depends upon the textual content of the Twitter messages. Similarity between Twitter messages is measured using the cosine similarity metric. Term frequency – inverse document frequency (TF-IDF) struggle with Twitter messages as the limited length of a tweet (140 characters) is often not sufficient to indicate important terms. Sometimes, tweets are approximately similar to search queries. TF-IDF struggles because Twitter users usually tend to remove redundant words from a Twitter message for the sake of saving space. This results in very low term frequencies (1 or 2). In some tweets, no terms are repeated. Therefore, TF-IDF may be only the IDF term. However, TF-IDF remains one of the most powerful and useful methods to represent different types of documents including tweets [5, 16, 19, 60, 117,126, 143].
4.3.4 Experimental setup The experiments in this study were implemented over three datasets of 1,000, 5,000, and 30,000 tweets respectively. The experiments included running each of cellular and generational genetic algorithms for 40 independent runs over the 1,000 70
tweets dataset, 50 independent runs over the 5,000 tweets dataset, and a single run for the 30,000 tweets dataset. Both algorithms have been executed using Java on a single PC 1.90 Ghz under Windows 7 Professional operating system and having 8 GB of memory. The fitness value, execution time, number of generations, and number of generated clusters for each run are recorded. Then the average fitness value and execution time (in milliseconds ms) are calculated. Finally, the values of cellular genetic algorithm are compared to those of generational genetic algorithm to select the most appropriate algorithm that achieves the best fitness i.e., higher quality of clustering at the least time. Moreover, the performance of every algorithm is evaluated over the three datasets according to fitness value and execution time.
4.4 Algorithm Traditional clustering algorithms were not selected by the researcher because such algorithms explore just a small subset of the potential clusterings. Thus, the found solution is not guaranteed to be optimal (might get stuck at local minima). Moreover, traditional clustering approaches that require a priori knowledge of the number of clusters, such as K-means, are not suitable to handle large volume of data produced by Twitter and other social media sites [32]. Premalatha and Natarajan (2009) used genetic algorithm with a ranked mutation operator and demonstrated that it outperforms the traditional algorithms like the K-means [40]. The use of Evolutionary Algorithms (EAs) to handle complicated problems is massive in recent years. They imitate the biological processes in nature. These algorithms are population-based, which means that they act on a group of prospective solutions (population of individuals) through the application of some operators iteratively to these individuals for the sake of finding the finest solutions. The majority of these algorithms deploys only one population and applies operators to them as a whole [134]. These steps are repeated iteratively until a stopping condition (for example; the maximum number of evaluation limits) is met. The balance (tradeoff) between exploration (diversification) of new solutions and exploitation (intensification) in the search space is an important criterion for 71
performance evaluation of a genetic algorithm and adjusting this tradeoff can improve the overall performance of the algorithm. This tradeoff is represented by “Selection Pressure” which is defined as “A measure of the diffusion speed of the good solutions through the population” [46]. Reeves (1993) formulated the selection pressure in the following equation [145]:
Ø Prob.selecting fittest string Prob.selecting average string
Equation 1: Selection Pressure Higher exploitation leads to a higher selection pressure because the algorithm tends to converge rapidly to a good enough solution, so it is liable to get stuck into local optimum. On the other hand, higher exploration leads to a lower selection pressure because the algorithm tends to explore the search space in depth for an optimal solution.
4.4.1 Cellular Genetic Algorithm Genetic Algorithms (GAs) imitate the biological process of natural selection and evolution. A population of individuals (of chromosome-like structure) that represent empirical solutions to a particular given problem is maintained. The initial population is usually generated in random. Novel individuals are then created through reproduction of the population individuals through the application of particular genetic operators: The recombination (crossover) operator and the mutation operator. The reproductive cycle gives a higher advantage to the better individuals to survive and reproduce (Survival of the fittest). The fitness of individuals is evaluated using the so-called Fitness function or Objective function. Selection of the reproducing individuals takes place according to their assigned fitness values, where those individuals with the greatest fitness per generation have a higher possibility to be selected than those with lower fitness values. The newly generated individuals substitute their predecessors according to a pre-defined substitution policy. This reproductive cycle goes on continuously until a particular termination condition is achieved. A simplified summarization of the mechanism of simple genetic algorithms is provided and displayed in Figure 4.6. 72
Figure 4.6 Simple Genetic Algorithm [40] Cellular Genetic Algorithms cGAs represent a subclass of Genetic Algorithms in which the arrangement of the population is structured in a decentralized manner and the concept of small neighborhood is strongly applied, so that individuals can merely recombine with the individuals which belong to its neighbors as displayed in Figure 4.7.
Figure 4.7 Topology of Cellular Genetic Algorithm [134]
73
Alba and Dorronsoro stated that “Such a kind of structured algorithms is specially well suited for complex problems” [48]. The existence of small overlapped neighborhoods in Cellular Genetic Algorithms helps to preserve a high diversity level for much longer time in comparison with other centralized algorithms [55]. A behavioral comparison of two different cGAs versus two traditional genetic algorithms, on a large benchmark composed of problems with many different features, revealed that the behavior of cGA is more robust since it obtains smaller standard deviations than the traditional algorithms. In addition, the cGA shows faster performance (shorter elapsed time) than the traditional genetic algorithms. The results obtained by Cellular Genetic Algorithm are compared against those of Generational Genetic Algorithm. Generational Genetic Algorithms (genGAs) are unstructured genetic algorithms in which any individual can interact with any other one in the population. The offspring individuals are placed in a temporary population which will substitute the existing population when the number of offsprings is equal to the size of this temporary population [46].
4.4.2 Chromosome Representation The population structure takes the shape of a two-dimensional grid with a neighborhood defined over it. Every chromosome in the generation represents a candidate tentative solution to the problem and is composed of a sequence of genes. A chromosome is represented as an array of integers of length equal to the number of tweets. Each entry in the array corresponds to a cluster number for a tweet. Representation of the chromosome is described in Figure 4.8. The used neighborhood is L5 (displayed in Figure 4.9)
Figure 4.8 Chromosome representation
74
Figure 4.9 L5 Neighborhood [46]
4.4.3 Initial Population The population is composed of 400 individuals (20*20). Initial population of candidate tentative solutions is often randomly generated from the search space, with a fitness value assigned to each individual.
4.4.4 Fitness Function The fitness function is used to evaluate the quality of the solution (clustering method). Higher fitness value indicates higher quality of the solution. The used fitness function is a function of cosine similarity. Usharani and Iyakutti stated that “cosine similarity is a measure of similarity that can be used to compare documents with respect to a given vector of query words. This is quantified as the cosine of angle between vectors”. The function is as follows [47]:
n i 1
n i 1
Ai Bi
(Ai ) 2
n i 1
(Bi ) 2
Equation 2: Fitness Function
75
The function should be maximized where (A and B) represent tweet vectors, (Ai) represents the weight of term i in the chromosome, (Bi) represents the weight of the term in the vector B,(n) represents the total number of tweets in the tweet corpus, while (x) is the product of the two vectors. Here, the comparison is between tweets and other tweets in the corpus, instead of comparing documents to queries. The value of cosine similarity ranges between 0 and 1.
4.4.5 Parent Selection The objective of the selection operator is to enhance the population’s quality by granting higher quality individuals (individuals with the highest fitness values) a greater possibility to survive and replicate in the following generations than the lower quality individuals (individuals with the lowest fitness values). This means that the individual’s quality is evaluated using the fitness function [49]. Here, the first parent is selected using the dissimilarity tournament selection operator, while the second parent is chosen by the linear rank selection operator. Dissimilarity tournament selection operator is an operator that does not depend upon the relative fitness of the nearby individuals. However, takes into consideration the difference between the respective solutions where two neighbors are chosen in random and the individual which is more dissimilar to the existing individual is chosen. On the other hand, in linear ranking selection, all neighborhood individuals are arranged in order in a list depending upon their fitness values, from the best to the worst, with a greater possibility of choosing a parent with a higher rank in this list [46].
4.4.6 Recombination (Crossover) The recombination step includes combining two or more portions from the parent individuals to create new offsprings. The generated offsprings are not identical to their parents, but includes combined building blocks from both of the two selected parents. A recombination (Crossover) operator with a pre-specified crossover probability (Pc) is applied to the individuals. Here, the applied operator is a "two 76
points-crossover": Distance Preserving Crossover (DPX) operator with Pc=1.0.The objective of this operator is to produce off springs that have equal distance to every parent. This distance is the same as the distance in between parents [46, 146].
4.4.7 Mutation While recombination operator acts on two or more parent individuals, mutation operator modifies single individual randomly by altering one or more genes of this individual. A mutation operator with a pre-specified mutation probability (Pm) is applied to the individuals. Here, the applied operator is the Integer Mutation operator with Pm=1.0. Integer mutation involves the replacement of the integer value of a gene by a new value generated in random [46, 147].
4.4.8 Replacement Policy After the application of selection, recombination (crossover), and mutation operators, the offsprings are placed incrementally into a temporary population, and the fitness values of the novel offsprings are calculated. The new offsprings of the population replace the parent population, only if the new population is not worse than the existing population, in order to maintain the best solutions.
4.4.9 Stopping Criterion The loop of reproductive cycle is repeated iteratively until the stopping condition is fulfilled. Here, termination occurs when the maximum number of fitness function evaluations (15,000,000 evaluations) is reached. The mechanism by which the reproductive cycle of Cellular Genetic Algorithm cGA takes place is displayed in Figure 4.10.
77
Figure 4.10 Reproductive cycle mechanism in cGA [46] The pseudo code of the algorithm is described in Table 4.2 Table 4.2 Pseudo-code of Cellular Genetic Algorithm [46]
1. proc evolve (cga) 2. GenerateInitialPopulation(cga.pop); 3. Evaluation(cga.pop); 4. while !StopCondition() do 5. for individual ← 1 to cga.popSize do 6. neighbors ← Calculate Neighborhood(cga,position(individual)); 7. parents ← Selection(neighbors); 8. offspring ← Recombination(cga.Pc,parents); 9. offspring ← Mutation(cga.Pm,offspring); 10. Evaluation(offspring); 11. Replacement(position(individual),auxiliary pop,offspring); 12. end for 78
Table 4.2 Continue 13. cga.pop← auxiliary pop; 14. end while 15. end proc Evolve To select the parameters' values of cGA, several experiments to tune these parameters have been performed. These experiments involve modifying the value of each parameter (one by one) while keeping the rest of the parameters constant. After these experiments were performed, the final values of the parameters that achieved the best fitness are described in Table 4.3 Table 4.3 Parameterization of the algorithm
Population size
400 individuals (20*20)
Stopping condition
15,000,000 fitness evaluations
Neighborhood
Linear5
Parent selection
Dissimilarity+ Linear rank
Recombination operator DPX Crossover probability
Pc = 1.0
Mutation operator
Integer mutation
Mutation probability
Pm = 1.0
Replacement policy
Replace if none worse
4.5 Summary and Conclusion This chapter presented the conceptual framework and explained in details the steps of the work done in this study. First, the different approaches to gather data from Twitter were compared and described. Then, the selected data gathering methodology, as well as the work mechanism of the selected data collection tool (Hootsuite) was described. In the next section, the researcher described the collected 79
dataset in addition to the selected similarity measure (cosine similarity), illustrated by some examples of other work that used cosine similarity measure. The dataset preparation, preprocessing, representation, and the setup of the performed experiments were further described. Finally, a detailed description of the procedures of simple Genetic Algorithm and the Cellular Genetic Algorithm was presented.
80
CHAPTER FIVE EXPERIMENTAL RESULTS AND DISCUSSION
81
5.1 Introduction This chapter presents the research findings generated by the experiments of Tweet clustering, using both of Cellular Genetic Algorithm and Generational Genetic Algorithm. It also presents the research limitations, a discussion of the results, and finally a conclusion. As previously mentioned at the end of section 4.3 of chapter 4; the experimental studies for this study were performed over three datasets of 1,000, 5,000, and 30,000 tweets respectively. The experiments included running each of the cellular and the generational genetic algorithms for 40 independent runs over the 1,000 tweets dataset, 50 independent runs over the 5,000 tweets dataset, and a single run for the 30,000 tweets dataset. Both algorithms have been executed using Java on a single PC 1.90 Ghz under Windows 7 operating system and having 8 GB of memory. For all of the three used datasets, four types of results have been generated, for each of the two algorithms, depending upon four parameters: 1. The fitness value 2. The execution time 3. The number of clusters (taking into consideration that the number of clusters is not predefined) 4. The number of generations for each run All the results concerning the previous four parameters are recorded, and then the average fitness value and execution time (in milliseconds ms) are calculated. Finally, the values obtained using cellular genetic algorithm are compared to those of generational genetic algorithm to select the most appropriate algorithm that achieves the best fitness i.e., higher quality of clustering at the least time. Moreover, the performance of every algorithm is evaluated over the three datasets according to fitness value and execution time. The remainder of the chapter is organized in the following manner. Section 5.2 measures the accuracy of the two algorithms against a test set. Section 5.3
82
represents the results of the dataset composed of 1,000 tweets. Section 5.4 represents the results of the dataset composed of 5,000 tweets. Section 5.5 represents the results of the dataset composed of 30,000 tweets. Section 5.6 compares the number of generations produced by both algorithms. Sections 5.7 and 5.8 evaluate the performance of cGA and genGA over the three datasets according to fitness value and execution time. The limitations of the study are presented in section 5.9. A discussion of the obtained results is presented in section 5.10. At the end of the chapter, the final conclusion of the chapter is presented in section 5.11.
5.2 Accuracy on test dataset The accuracy of the clustering obtained by the two algorithms was compared against a baseline approach on a test dataset using the following equation:
Equation 3: Clustering Accuracy The test dataset is composed of 60 tweets equally distributed over three topics: Sport, Politics, and Cinema. The researcher went manually through those tweets and determined their relevance to the three topics. The distribution of tweets in the test set is displayed in Table 5.1 Table 5.1: Tweet Distribution in the test set
Topic
Number of tweets
Sport
20
Politics
20
Cinema
20
An accuracy of 95.01% was obtained by cGA compared to a 94.166% accuracy obtained by genGA (0.844% difference) as displayed in Figure 5.1
83
Accuracy 1%
95.20%
1%
95.00%
1%
94.80%
1% Accuracy % Difference
1% 0% 0%
0.844%
94.60% 94.40% 94.20%
0%
94.00%
0%
93.80%
0%
93.60%
0% genGA
cGA
Figure 5.1 Accuracy achieved by both algorithms
5.3 Results for 1,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms for 40 independent runs over a dataset composed of 1,000 tweets.
5.3.1 Average Fitness A comparison of the average fitness values of the results obtained by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.2. The average fitness value of cellular genetic algorithm is 72.68859389, while the average fitness value of generational genetic algorithm is 72.68254524. The average fitness obtained by cGA is 0.01% higher than the average fitness obtained by genGA.
84
0.01%
72.69 0.01%
72.689 0.01%
72.688 72.687
0.01% Fitness
0.01%
% difference 0.00%
72.686 72.685 72.684 72.683 72.682
0.00%
72.681 72.68
0.00%
0.00%
72.679 genGA
cGA
Figure 5.2 Average fitness of genGA and cGA (1,000 tweets)
5.3.2 Average Execution Time A comparison of the average execution time of the results obtained by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.3. The average execution time of cellular genetic algorithm is 2389352.2 ms, while this obtained by generational genetic algorithm is 2847968.75 ms. Average execution time of cGA is less than the average execution time of genGA by 17.51% 2900000
20.00%
2800000
18.00%
2700000
Time % difference
16.00% 14.00%
2600000
12.00%
2500000
10.00%
2400000
8.00%
2300000
17.51%
6.00% 4.00%
2200000
2.00%
2100000
0.00%
0.00% cGA
genGA
Figure 5.3 Average execution time of genGA and cGA (1,000 tweets) 85
5.3.3 Number of Clusters A comparison of the average number of clusters generated by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.4. The average number of clusters generated by Cellular Genetic Algorithm is approximately 312 clusters, while the average number of clusters generated by Generational Genetic Algorithm is approximately 316 clusters.
317 316 316 315 314 313 312 312 311 310 cGA
genGA
Figure 5.4 Number of clusters generated by genGA and cGA (1,000 tweets)
5.4 Results for 5,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms for 50 independent runs over a dataset composed of 5,000 tweets.
5.4.1 Average Fitness A comparison of the average fitness of the results obtained by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 86
5.5. The average fitness value of cellular genetic algorithm is 315.9893146, while the average fitness value of generational genetic algorithm records 318.2736139. The average fitness obtained by cGA is 0.72% less than the average fitness obtained by genGA
Fitness %
0.80%
318.5
0.70%
318
0.60%
317.5
0.50%
317
0.40%
316.5
0.30%
316
0.20%
315.5
0.10%
315
0.00%
314.5
0.72%
0.00% cGA
genGA
Figure 5.5 Average fitness of genGA and cGA (5,000 tweets)
5.4.2 Average Execution Time A comparison of the execution time of the results obtained by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 5.6. The average execution time of cellular genetic algorithm is 19739837.3 ms, while this obtained by generational genetic algorithm is 21356261.49 ms. The average execution time of cGA is less than the average execution time of genGA by 7.87%
87
21500000 21000000 20500000 Time
20000000
% 19500000 19000000 18500000
9.00%
7.87%
8.00% 7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00%
0.00% cGA
genGA
Figure 5.6Average execution time of genGA and cGA (5,000 tweets)
5.4.3 Number of Clusters A comparison of the number of clusters generated by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 5.7. The average number of clusters generated by Cellular Genetic Algorithm is approximately 1865 clusters, while the average number of clusters generated by Generational Genetic Algorithm is approximately 1821 clusters. 1870
1865
1860 1850 1840 1830 1821 1820 1810 1800 1790 cGA
genGA
Figure 5.7 Number of clusters generated by genGA and cGA (5,000 tweets)
88
5.5 Results for 30,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms over a dataset composed of 30,000 tweets.
5.5.1 Fitness A comparison of the fitness value of the results obtained by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.8. The fitness value of cellular genetic algorithm is 8077.4931640625, while the fitness value of generational genetic algorithm records 8115.17919921875. The fitness obtained by cGA is 0.47% less than the fitness obtained by genGA. 0.50% 0.45%
0.47%
8120 8110
0.40% 0.35% 0.30% Fitness
0.25%
%
0.20% 0.15%
8100 8090 8080 8070
0.10% 0.05% 0.00%
8060
0.00% 8050
cGA
genGA
Figure 5.8 Fitness value of genGA and cGA (30,000 tweets)
5.5.2 Execution Time A comparison of the execution time of the results obtained by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.9. The execution time of cellular genetic algorithm is 217341048 ms, while this obtained by generational genetic algorithm is 247159030 ms. The execution time of cGA is less than the execution time of genGA by 12.84%
89
14.00% 12.00%
12.84%
250000000 245000000 240000000
10.00% 8.00% Time
230000000 225000000
6.00%
%
235000000
4.00%
220000000 215000000 210000000
2.00% 0.00%
205000000
0.00%
200000000
cGA
genGA
Figure 5.9 Execution time of genGA and cGA (30,000 tweets)
5.5.3 Number of Clusters A comparison of the number of clusters generated by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.10. The number of clusters generated by Cellular Genetic Algorithm is 7777 clusters, while the number of clusters generated by Generational Genetic Algorithm is 7845 clusters. 7860 7845 7840 7820 7800 7780
7777
7760 7740 cGA
genGA
Figure 5.10 Number of clusters generated by genGA and cGA (30,000 tweets) 90
5.6 Number of Generations A comparison of the number of generations produced by Cellular and Generational genetic algorithms per run is displayed in Figure 5.11. Cellular Genetic Algorithm generates 37500 generations per run, while Generational genetic algorithm generates 30061 generations per run.
Number of generations 37500
40000 35000
30061
30000 25000 20000 15000 10000 5000 0
genGA
cGA
Figure 5.11 Number of generations produced by each algorithm
5.7 cGA performance in all datasets This section presents an evaluation of the performance of the Cellular Genetic Algorithm over the three datasets. The evaluation is performed in terms of the fitness value and execution time.
5.7.1 Average Fitness A comparison of the average fitness values of the results obtained by Cellular Genetic Algorithm over all datasets is displayed in Figure 5.12. As the size of the data set increases, the fitness value obtained increases.
91
9000
8077.493164
8000 7000 6000 5000 4000 3000 2000 1000
72.68859389
315.9893146
1000 tweets
5000 tweets
0 30000 tweets
Figure 5.12 cGA fitness for all sets
5.7.2 Average Execution Time A comparison of the average execution time of the results obtained by Cellular Genetic Algorithm over all datasets is displayed in Figure 5.13. As the size of the data set increases, the time required for execution increases. 250000000 217341048 200000000
150000000
100000000
50000000 19739837.3 2389352.2 0 1000 tweets
5000 tweets
Figure 5.13 cGA execution time for all sets
92
30000 tweets
5.8 genGA performance in all datasets This section presents an evaluation of the performance of the Generational Genetic Algorithm over the three datasets. The evaluation is performed in terms of the fitness value and execution time.
5.8.1 Average Fitness A comparison of the average fitness values of the results obtained by Generational Genetic Algorithm over all datasets is displayed in Figure 5.14. Similar to cGA, as the size of the data set increases, the fitness value obtained increases. 9000
8115.179199
8000 7000 6000 5000 4000 3000 2000 1000
72.68254552
318.2736139
1000 tweets
5000 tweets
0 30000 tweets
Figure 5.14 genGA fitness for all sets
5.8.2 Average Execution Time A comparison of the average execution time of the results obtained by Generational Genetic Algorithm over all datasets is displayed in Figure 5.15. Similar to cGA, as the size of the data set increases, the time required for execution increases.
93
300000000 247159030
250000000 200000000 150000000 100000000 50000000
21356261.49 2847968.75
0 1000 tweets
5000 tweets
30000 tweets
Figure 5.15 genGA execution time for all sets
5.9 Research Limitations The first limitation in this study concerns the selected data gathering approach. As mentioned in chapter 3, Data collection through the scraping based approach is essentially more difficult than API-based methods, and may still be subject to bandwidth limitations imposed by the site (many social network sites dynamically recognize and block efforts to scrape). The scraper has to struggle against the redesigns that may occur to the social network at a higher frequency and with less notice than changes to the API. In addition, Hootsuite.com does not guarantee that all the Twitter messages meeting the criteria of the established archives will be captured, so there could potentially be some tweets that were not included in the gathered tweet corpus. While this may introduce a bias, the concepts for analyzing the tweets are still valid. In spite of this limitation, the scraping based approach was selected by the researcher because it is not limited by a specific number of calls like the API driven approach. At the same time, it is simpler than passive network measurement approach. Another limitation in this study that it focuses only on the clustering of similar Twitter messages, without attempting to analyze the semantics of those Twitter messages.
94
Furthermore, one of the primary limitations of this study is the use of a single language for data gathering (only English language tweets were gathered). Due to the high computational complexity of the problem, the high processing power (large memory and CPU resources) required to conduct several runs for both algorithms and handle the huge number of term-tweet relationships (about millions of relationships), and therefore, the long time required: Consequently, the size of the large dataset was limited to just 30,000 Twitter messages which is considered to be insufficient. The long time limitation leads to conducting a single run for each of the Cellular and Generational genetic algorithms over the 30,000 tweets dataset.
5.10 Results Discussion From the previously mentioned results, a number of things can be observed. This section is divided into two subsections. Section 5.10.1 compares the obtained results by both algorithms according to the four parameters: the average fitness, the execution time, the number of clusters, and the number of generations produced. On the other hand, section 5.10.2 discusses the results obtained by every single algorithm over all datasets in terms of fitness value and execution time.
5.10.1
Discussion
of
results
obtained
by
both
algorithms This section includes discussing the results obtained by cGA and genGA over the three datasets in terms of average fitness, execution time, number of clusters, and number of generations.
5.10.1.1 Average Fitness Table 5.2 displays the fitness values obtained by both algorithms in the three used datasets.
95
Table 5.2 Fitness values by both algorithms in all datasets
Dataset size
cGA
genGA
1,000 tweets
72.68859389
72.68254524
5,000 tweets
315.9893146
318.2736139
30,000 tweets
8077.4931640625
8115.17919921875
Concerning the average fitness, the difference in the average fitness values in the solutions generated by both algorithms is slight. This slight difference is in favor of the Cellular Genetic Algorithm in the 1,000 dataset, but for the 5,000 and 30,000 datasets, this difference is in favor of the Generational Genetic Algorithm. The use of small overlapped neighborhood niches in cGA maintains population diversity as it enhances exploration of the search space due to the relatively smooth spread of the finest solutions across the entire population, at the same time exploitation occurs within each neighborhood by genetic operations. In other words, cGA provides a good tradeoff between exploration and exploitation. Therefore; it avoids being stuck into local optima [46].
5.10.1.2 Execution Time Table 5.3 displays the execution time consumed by both algorithms in the three used datasets. Table 5.3 Execution times by both algorithms in all datasets
Dataset size
cGA
genGA
1,000 tweets
2389352.2 ms
2847968.75 ms
5,000 tweets
19739837.3 ms
21356261.49 ms
30,000 tweets
217341048 ms
247159030 ms
Concerning the average time required for execution, the Cellular Genetic Algorithm requires a remarkably shorter time to implement for all of the three employed datasets. 96
This can be attributed to decentralized structure of the population. The population in cGA is structured into neighborhoods, while it is unstructured in case of Generational Genetic Algorithm. This means that the individual in the Generational Genetic Algorithm has to search through the whole population, while cGA individual can interact only with its nearby neighbors. In other words, the decentralized population structure improves the execution time of Cellular Genetic Algorithm in comparison to its equivalent Generational Genetic Algorithm [46].
5.10.1.3 Number of Clusters Table 5.4 displays the number of clusters generated by both algorithms in the three used datasets. Table 5.4 Number of clusters by both algorithms in all datasets
Dataset size
cGA
genGA
1,000 tweets
Approximately 312
Approximately316
5,000 tweets
Approximately1865
Approximately1821
30,000 tweets
7777
7845
Note that the number of clusters is not constant because there is no a priori knowledge about the number of clusters as in the case of K-means for example.
5.10.1.4 Number of Generations Table 5.5 displays the number of generations produced by both algorithms in the three used datasets Table 5.5 Number of generations of both algorithms in all datasets
Dataset size
cGA
genGA
1,000 tweets
37500
30061
5,000 tweets
37500
30061
30,000 tweets
37500
30061
97
Cellular Genetic Algorithm gives a larger number of generations than the Generational Genetic Algorithm. This means that Generational Genetic Algorithm is more efficient than cGA (as it requires a fewer number of generations to find the solution). The reason is that cGA enhances more exploration, thus induces a lower selection pressure, and therefore more generations are needed.
5.10.2 Discussion of results obtained by every algorithm This section includes discussing the results obtained by each algorithm individually over the three datasets in terms of average fitness and execution time.
5.10.2.1 Cellular Genetic Algorithm Table 5.6 displays the fitness values and execution time obtained by cellular genetic algorithm in the three used datasets. Table 5.6 cGA performance in all datasets
Dataset size
Fitness value
Execution time
1,000 tweets
72.68859389
2389352.2 ms
5,000 tweets
315.9893146
19739837.3 ms
30,000 tweets
8077.4931640625
217341048 ms
As the size of the data set increases, the complexity of the problem increases. Therefore; the obtained fitness value and time required for execution increase. Table 5.7 displays the fitness values and execution time obtained by generational genetic algorithm in the three used datasets.
98
Table 5.7 genGA performance in all datasets
Dataset size
Fitness value
Execution time
1,000 tweets
72.68254524
2847968.75 ms
5,000 tweets
318.2736139
21356261.49 ms
30,000 tweets
8115.17919921875
247159030 ms
Similar to cGA, as the size of the data set increases, the complexity of the problem increases. Therefore; the obtained fitness value and time required for execution increase.
5.11 Conclusion Based on the obtained results, Cellular Genetic Algorithm cGA was selected over Generational Genetic Algorithm genGA. Despite the slight advantage of genGA in terms of average fitness and number of generations, cGA performs remarkably faster with higher accuracy. Therefore, cGA was selected to perform tweet clustering.
99
CHAPTER SIX CONCLUSION AND FUTURE WORK
100
In this chapter, the final conclusion of the study is mentioned in addition to a summary of the performed work in section 6.1. Section 6.2 states the future work and enhancements that can be accomplished, based upon the obtained results and in the light of the study limitations.
6.1 Conclusion This study approached the problem of clustering tweets based upon their similarity. This problem is important because of an essential reason: The tweet content similarity can be utilized as a similarity measure between Twitter users. This similarity measure helps to recognize whether the Twitter users share similar interests and attributes or not. This is a signal of good similarity between Twitter users. Some traditional clustering algorithms such as the K-means algorithm require a priori knowledge about the number of clusters (i.e. the number of clusters is known in advance), which is not the case in Twitter. Other traditional clustering algorithms can get stuck into local optima because such algorithms explore just a small subset of the potential clusterings. Moreover, the type of data in Twitter results in weak performance of most clustering methods due to the overall freedom of writing tweets. The researcher applied two subclasses of genetic algorithm for clustering tweets based upon their textual content similarity: 1. Cellular Genetic Algorithm cGA, 2. A conventional genetic algorithm: the Generational Genetic Algorithm genGA. The accuracy of the clustering obtained by the two algorithms was compared against a baseline approach on a test dataset composed of 60 tweets equally distributed over three distinct topics. The test dataset results demonstrate a slightly better accuracy in the favor of cGA (0.844% difference). Then, the researcher compared the clustering results obtained by cGA with the clustering results obtained by genGA according to four parameters: 1. The fitness value 2. The execution time 101
3. The number of clusters 4. The number of generations for each run Experimental results are tested with three datasets: One of 1,000 tweets, the second formed of 5,000 tweets, and the last composed of 30,000 tweets. The gathered data represent a variety of topics in the real world. This data was collected using "Scraping-based approach" through using a web client called Hootsuite. The average fitness of the two algorithms is nearly equal over the three datasets. In the 1,000 tweets dataset, cGa achieved a higher fitness by just 0.01%. In the 5,000 tweets dataset, genGA achieved a higher fitness by just 0.72%. In the 30,000 tweets dataset, genGA achieved a higher fitness by just 0.47%. On the other hand, cGA shows a much faster performance than genGA for all datasets. In the 1,000 tweets dataset, cGa achieved a faster performance by 17.51%. In the 5,000 tweets dataset, cGA achieved a faster performance by 7.87%. In the 30,000 tweets dataset, cGA achieved a faster performance by 12.84%. Regarding the number of generations, genGA is considered more efficient as it needs 30,061 generations to reach the optimal solution while cGA needs 37,500 generations. Despite the slight advantage of genGA in terms of average fitness and number of generations, cGA performs remarkably faster with higher accuracy. Therefore, cGA was selected to perform tweet clustering. Besides achieving the desired research objectives; the main this study is considered to be one of the first attempts to exploit cGA in clustering of tweets, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. Therefore, this study contributes to adding a new approach for tweet clustering.
102
6.2 Future Work This work still has a room for improvements in the future in order to overcome the previously mentioned limitations in section 4.6 of chapter four. For future work, the researcher plans to:
Cluster tweets with multiple runs over a larger dataset
Considering the high computational complexity of the problem, the researcher considers the use of parallel computing to minimize the time required for execution
Moreover, the researcher plans to perform semantic analysis over the collected Twitter messages
Include more tweet languages in the Twitter dataset (not just the English language)
Use cGA for clustering over other social media
103
References 1.
Kewalramani, M.N., 2011. Community Detection in Twitter. MSc Thesis. University of Maryland, Baltimore County2
2.
Phelan, O., K. McCarthy, and B. Smyth. Using twitter to recommend realtime topical news, 2009. Proceedings of the third ACM conference on Recommender systems, Oct. 22-25, New York, NY, USA. ACM. DOI: 10.1145/1639714.1639794
3.
Nooralahzadeh, F., V. Arunachalam, and C.G. Chiru, 2013.2012 Presidential Elections on Twitter--An Analysis of How the US and French Election were Reflected in Tweets. Proceedings of the 2013 19th International Conference on Control Systems and Computer Science (CSCS 2013), May 29-31, Bucharest, Romania. pp: 240-246. IEEE2
4.
Tumasjan, A., 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (WSM’ 10), May 23-26, Washington DC, USA. pp: 178-185. DOI: 10.1002/asi.21149
5.
Conover, M.D., B. Goncalves, J. Ratkiewicz, A. Flammini and F. Menczer, 2011. Predicting the political alignment of Twitter users. Proceedings of the IEEE 3rd International Conference on Privacy, Security, Risk and Trust, Oct. 9-11, IEEE Xplore Press, Boston, MA. pp: 192-199. DOI: 10.1109/PASSAT/SocialCom.2011.34
6.
Bravo-Marquez, F., D. Gayo-Avello, M. Mendoza, and B. Poblete.Opinion Dynamics of Elections in Twitter, 2012. Proceedings of 2012 Eighth Latin American Web Congress (LA-WEB), OCT. 25-27, Cartagena, Colombia. pp: 32-39, IEEE2
7.
Wegrzyn-Wolska, K., and L. Bougueroua. Tweets mining for French Presidential Election, 2012. Proceedings of 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN 2012), Nov. 21-23, pp: 138-143, IEEE2
8.
Kwak, H., C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? , 2010. Proceedings of the 19th international conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. DOI: 10.1145/1772690.1772751 104
9.
Sakaki, T., M. Okazaki and Y. Matsuo, 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. Proceedings of the 19th International Conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. pp: 851-860. DOI: 10.1145/1772690.1772777
10.
Culotta, A. Towards detecting influenza epidemics by analyzing Twitter messages, 2010. Proceedings of the first workshop on social media analytics (SOMA'2010), Jul 25, ACM, Washington, DC, USA. Pp: 115-122. DOI: 10.1145/1964858.1964874
11.
Rangrej, A., S. Kulkarni and A.V. Tendulkar, 2011. Comparative study of clustering techniques for short text documents. Proceedings of the 20th International Conference Companion on World Wide Web, Mar. 28-Apr. 01, ACM New York, NY, USA. pp: 111-112. DOI: 10.1145/1963192.1963249
12.
Zhang, Y., Y. Wu and Q. Yang, 2012. Community discovery in Twitter based on user interests. J. Comput. Inform. Syst., 8: 991-1,0002
13.
Goyal, P., 2011. Semester project report semester project report data mining and analysis on Twitter data mining and analysis on Twitter 2
14.
Liang, P.W. and B.R. Dai, 2013. Opinion mining on social media data. Proceedings of the IEEE 14th International Conference on Mobile Data Management, Jun. 3-6, IEEE Xplore Press, Milan, pp: 91-96. DOI: 10.1109/MDM.2013.73
15.
Honey, C. and S.C. Herring, 2009. Beyond microblogging: Conversation and collaboration via Twitter. Proceedings of the 42nd Hawaii International Conference on System Sciences, Jan. 5-8, IEEE Xplore Press, Big Island, HI., pp: 1-10. DOI: 10.1109/HICSS.2009.89 2
16.
De Groot, R., 2012. Data mining for tweet sentiment classification. MSc Thesis, Utrecht University, Utrecht, Netherlands2
17.
http://mashable.com/2013/12/17/twitter-popular-languages/, Accessed on January 4, 20142
18.
Karandikar, A., 2010. Clustering short status messages: A topic model based approach. MSc Thesis, University of Maryland, Baltimore County 2
19.
Perez-Tellez, F., D. Pinto, J. Cardiff and P. Rosso, 2010. On the difficulty of clustering company tweets. Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, Oct. 26-30, ACM New York, NY, USA. pp: 95-102. DOI: 10.1145/1871985.1872001 105
20.
Bollen, J., H. Mao, and X. Zeng, 2011. Twitter mood predicts the stock market. Journal of Computational Science 2.1: 1-82
21.
Java, A., 2008. Mining social media communities and content. PhD Thesis, University of Maryland, Baltimore County2
22.
Bicen, H., and N. Cavus, 2010. The most preferred social network sites by students. Procedia-Social and Behavioral Sciences 2.2: 5864-5869. DOI: 10.1016/j.sbspro.2010.03.958
23.
IAB Platform Status Report: User Generated Content, Social Media, and Advertising — An Overview, April 2008. Available on: http://www.iab.net/media/file/2008_ugc_platform.pdf , Accessed on January 25, 20142
24.
Wang, Z., 2010. Social media data analysis: A dynamic perspective. NATO Research and Technology Organisation, Ottawa, Ontario K1A 0K2, Canada 2
25.
Khabiri, E., 2013. Ranking, labeling, and summarizing short text in social media: PhD Thesis, Texas A&M University2
26.
Gundecha, P., and H. Liu, 2012. Mining Social Media: A Brief Introduction. Tutorials in Operations Research 1.4. DOI: 10.1287/educ.1120.0105
27.
Aggarwal, C.C., Social network data analytics, 2011. ISBN 978-1-44198461-6 e-ISBN 978-1-4419-8462-3, Springer DOI: 10.1007/978-1-44198462-3.
28.
Benevenuto, F., T. Rodrigues, M. Cha, and V. Almeida, 2009. Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM. DOI: 10.1145/1644893.1644900. pp: 49-62
29.
Social Media Update, Pew Research Center http://pewinternet.org/~/media//Files/Reports/2013/Social%20Networking%2 02013_PDF.pdf, Accessed on January 16, 20142
30.
Krishnamurthy, B., P. Gill, and M. Arlitt. A few chirps about twitter, 2008. Proceedings of the first workshop on online social networks. Seattle, WA, USA — Aug. 17 - 22, ACM. DOI: 10.1145/1150402.1150476
31.
Java, A., X. Song, T. Finin and B. Tseng, 2007. Why we Twitter: Understanding microblogging usage and communities. Proceedings of the 9th Workshop on Web Mining and Social Network Analysis, Aug. 12-15, ACM New York, NY, USA, pp: 56-65. DOI: 10.1145/1348549.1348556 106
32.
Becker, H., 2011. Identification and characterization of events in social media. PhD Thesis, Columbia University 2
33.
Mosley, J. and C. Roosevelt, 2012. Social media analytics: Data mining applied to insurance Twitter posts. Casualty Actuarial Society E-Forum 2
34.
Moseley, N., 2013. Using Word and Phrase Abbreviation Patterns to Extract Age from Twitter Micro texts. MSc Thesis, Rochester Institute of Technology, New York, NY, USA2
35.
Sankaranarayanan, J., H. Samet, B.E. Teitler, M.D. Lieberman and J. Sperling, 2009. Twitterstand: News in tweets. Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Nov. 04-06, ACM New York, NY, USA, pp: 42-51. DOI: 10.1145/1653771.1653781
36.
http://www.alexa.com/siteinfo/twitter.com, Accessed on March 20, 20142
37.
http://www.statisticbrain.com/twitter-statistics/, Accessed on January 5, 20142
38.
http://www.statista.com/chartoftheday/twitter/, Accessed on January 4, 20142
39.
http://www.statista.com/topics/737/twitter/chart/1629/twitter-penetration/, Accessed on January 21, 20142
40.
Premalatha, K. and A.M. Natarajan, 2009. Genetic algorithm for document clustering with simultaneous and ranked mutation. Modern Applied Sci., 3: 75-82. 1
41.
Begelman, G., P. Keller, and F. Smadja, 2006. Automated tag clustering: Improving search and exploration in the tag space. Collaborative Web Tagging Workshop at WWW, May 23-26, Edinburgh, Scotland. DOI: 10.1.1.120.5736
42.
Jajoo, P., 2008. Document clustering. MTech Thesis. Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India2
43.
Jian-Xiang, W., L. Huai, S. Yue-Hong, and S. Xin-Ning, 2009. Application of Genetic Algorithm in Document Clustering. Information Technology and Computer Science. Proceedings of the International Conference on Information Technology and Computer Science (ITCS 2009), Jul 25-26, Kiev, Ukraine. Vol. 1. IEEE, 2009. DOI: 10.1109/ITCS.2009.269
44.
Tan, P.N., M. Steinbach, and V. Kumar, 2006. Introduction to data mining. Library of Congress 2
107
45.
Amitava, D. and S. Bandyopadhyay, 2010. Subjectivity detection using genetic algorithm. Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA10), Lisbon, Portugal, August2
46.
Alba, E. and B. Dorronsoro, 2008. Cellular Genetic Algorithms Electronic Resource. 1st Edn., Springer, New York, ISBN-10: 0387776109.
47.
Usharani, J. and K. Iyakutti, 2013. A genetic algorithm based on cosine similarity for relevant document retrieval. Int. J. Eng. Res. Technol2
48.
Alba, E., and B. Dorronsoro, 2004. Solving the vehicle routing problem by using cellular genetic algorithms. Proceedings of the Evolutionary Computation in Combinatorial Optimization. Springer Berlin Heidelberg, Apr. 5-7, Springer Berlin Heidelberg, Coimbra, Portugal, pp: 11-20. DOI: 10.1007/978-3-540-24652-7_2
49.
Khaliessizadeh, S.M., 2006. Genetic mining: Using genetic algorithm for topic based on concept distribution. Proceedings of the World Academy of Science, Engineering and Technology, (SET’ 06)
50.
Guzek, M., J. E. Pecero, B. Dorronsoro, P. Bouvry, and S. U. Khan, 2010. A cellular genetic algorithm for scheduling applications and energy-aware communication optimization. Proceedings of 2010 International Conference on High Performance Computing and Simulation (HPCS 2010), Jun 28-Jul 2, Caen, France. pp: 241-248. on IEEE. DOI:10.1109/HPCS.2010.55471242
51.
Rudolph, G., and J. Sprave, 1995. A cellular genetic algorithm with selfadjusting acceptance threshold. Genetic Algorithms in Engineering Systems: Innovations and Applications. GALESIA. First International Conference on (Conf. Publ. No. 414), Sep 12-14. IET2
52.
De Felice, M., S. Meloni, and S. Panzieri, 2012. Influence of Topological Features on Spatially-Structured Evolutionary Algorithms Dynamics. arXiv preprint arXiv: 1202.0678c12
53.
Simoncini, D., P. Collard, S. Verel, and M. Clergue, 2007. On the influence of selection operators on performances in cellular genetic algorithms. Proceedings of Evolutionary Computation, (CEC) 2007, IEEE Congress on. IEEE2
108
54.
Alba, E., and B. Dorronsoro, 2005. The exploration/exploitation tradeoff in dynamic cellular genetic algorithms. Evolutionary Computation, IEEE Transactions on 9.2: 126-1422
55.
Morales-Reyes, A., A. Al-Naqi, A.T. Erdogan and T. Arslan, 2009. Towards 3D architectures: A comparative study on cellular GAs dimensionality. Proceedings of Adaptive Hardware and Systems, Jul. 29-Aug. 1, IEEE Xplore Press, San Francisco, CA. pp: 223-229. DOI: 10.1109/AHS.2009.29
56.
http://www.policymic.com/articles/10642/twitter-revolution-how-the-arabspring-was-helped-by-social-media, Accessed on January 31, 2014
57.
Xiao, Y, 2010. A Survey of Document Clustering Techniques & Comparison of LDA and moVMF2
58.
Sathiyakumari, K., V. Preamsudha, and G. Manimekalai, 2011. A Survey on Various Approaches in Document Clustering. International Journal Computer Technology 2.5: 1534-15392
59.
Steinbach, M., G. Karypis, and V. Kumar, 2000. A comparison of document clustering techniques. KDD workshop on text mining. Vol. 400. No. 12
60.
Huang, A., 2008. Similarity measures for text document clustering. Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Apr 14-17. Christchurch, New Zealand. pp: 49-562
61.
http://en.wikipedia.org/wiki/John_Henry_Holland, Accessed on February 12,2014
62.
http://en.wikipedia.org/wiki/Hans-Joachim_Bremermann, Accessed on March 15, 2014
63.
https://www.google.com.eg/#q=alex+fraser, Accessed on March 15, 2014
64.
Melanie, M., 1999. An introduction to genetic algorithms. Cambridge, Massachusetts London, England, Fifth printing 3. Melanie, M., (Ed.), ISBN: 0−262−13316−4 (HB), 0−262−63185−7 (PB)2
65.
Holland, J.H., 1975. Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press (Second edition: MIT Press, 1992)
66.
De Jong, K.A., 1975. Analysis of the behavior of a class of genetic adaptive systems. Ph.D. thesis, University of Michigan, Ann Arbor2
109
67.
Robertson, G.G., 1987. Parallel implementation of genetic algorithms in a classifier system. Genetic algorithms and their applications: proceedings of the second International Conference on Genetic Algorithms: July 28-31, 1987 at the Massachusetts Institute of Technology, Cambridge, MA. Hillsdale, NJ: L. Erlhaum Associates, 19872
68.
Mühlenbein, H., M.G. Schleuter, and O. Krämer, 1988. Evolution algorithms in combinatorial optimization. Parallel Computing 7.1: 65-852
69.
Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning. Addison−Wesley. ISBN: 0-201-15767-52
70.
Manderick, B., and P. Spiessens, 1989. Fine-grained parallel genetic algorithms. Proceedings of the third international conference on Genetic algorithms, Fairfax, Virginia, USA. Morgan Kaufmann Publishers Inc. pp: 428-4332
71.
Spiessens, P., and B. Manderick, 1991. A massively parallel genetic algorithm: Implementation and first analysis. Proceedings of the Fourth International Conference on Genetic Algorithms, San Diego, California, USA. pp: 279-2872
72.
Hoffmeister, F, 1991. Scalable parallelism by evolutionary algorithms. Parallel Computing and Mathematical Optimization. Springer Berlin Heidelberg. pp: 177-1982
73.
Back, T., D.B. Fogel, and Z. Michalewicz, 1997. Handbook of evolutionary computation. Oxford University Press. ISBN:0750303921
74.
Whitley, L. D., 1993. Cellular genetic algorithms. Proceedings of the 5th International Conference on Genetic Algorithms, Urbana-Champaign, IL, USA. Morgan Kaufmann Publishers Inc2
75.
Adamic, L., and E. Adar., 2005. How to search a social network. Social Networks 27.3: 187-2032
76.
Maia, M., J. Almeida, and V. Almeida, 2008. Identifying user behavior in online social networks. Proceedings of the 1st workshop on Social network systems (SNS 2008), Glasgow, Scotland, UK, April 1, 2008. ACM
77.
Zhou, D., 2008. Mining Social Documents and Networks. PhD Thesis, the Pennsylvania State University2
78.
Cormode, G., B. Krishnamurthy, and W. Willinger, 2010. A manifesto for modeling and measurement in social media. First Monday 15.92 110
79.
Gozzo, S., and R. D’Agata, 2010. Social networks and political participation in a Sicilian community context. Procedia-Social and Behavioral Sciences 4: 49-582
80.
Kuan, Z., 2010. Mining Antagonistic Communities from Social Networks. MSc Thesis, Singapore Management University2
81.
Yun, S., H. Do, and H.G. Kim, 2010. Analysis of user interactions in online social networks. Yun S. Proceedings of the 19th international Conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. Vol. 22
82.
Agrawal, D., B. Bamieh, C. Budak, A. El Abbadi, A. Flanagin, and S. Patterson, 2011. Data-driven modeling and analysis of online social networks. Web-Age Information Management. Springer Berlin Heidelberg, 3-172
83.
Gloor, P.A., K. Fischbach, H. Fuehres, C. Lassenius, T. Niinimäki, D. O. Olguin, S. Pentland, A. Piri, and J. Putzke, 2011. Towards “Honest Signals” of Creativity–Identifying Personality Characteristics through Microscopic Social Network Analysis. Procedia-Social and Behavioral Sciences 26: 1661792
84.
Malik, H., and A.S. Malik, 2011. Towards identifying the challenges associated with emerging large scale social networks. Procedia Computer Science 5: 458-465. DOI: 10.1016/j.sbspro.2011.10.573
85.
Stieglitz, S., and C. Kaufhold, 2011. Automatic full text analysis in public social media–adoption of a software prototype to investigate political communication. Procedia Computer Science 5: 776-781. DOI: 10.1016/j.procs.2011.07.104
86.
Takaffoli, M., F. Sangi, J. Fagnan, and O. R. Za¨ıane, 2011. Community evolution mining in dynamic social networks. Procedia-Social and Behavioral Sciences 22: 49-58. DOI: 10.1016/j.sbspro.2011.07.0552
87.
Derczynski, L.R.A, B. Yang, and C.S. Jensen, 2013. Towards context-aware search and analysis on social media data. Proceedings of the 16th International Conference on Extending Database Technology (EDBT/ICDT 2013), Mar 18-22, Genoa, Italy. ACM2
111
88.
Jiang, J., C. Wilson, X. Wang, P. Huang, W. Sha, Y. Dai, and B. Y. Zhao, 2013. Understanding latent interactions in online social networks. ACM Transactions on the Web (TWEB 2013) 7.4: 182
89.
Santiago, N.G, 2013. Data mining social media networks for terrorist events indicators. MSc Thesis. Polytechnic University of Puerto Rico2
90.
Stieglitz, S., and L.D. Xuan, 2013. Social media and political communication: a social media analytics framework. Social Network Analysis and Mining 3.4: 1277-1291. DOI 10.1007/s13278-012-0079-3
91.
Huberman, B. A., D.M. Romero, and F. Wu, 2008. Social networks that matter: Twitter under the microscope. arXiv preprint arXiv: 0812.10452
92.
Zhao, D., and M.B. Rosson, 2009. How and why people Twitter: the role that micro-blogging plays in informal communication at work. Proceedings of the ACM 2009 international conference on Supporting group work (Group '09), Sanibel Island, FL, USA, May 10 - 13. ACM. pp: 189-1922
93.
Naaman, M., J. Boase, and C.H. Lai, 2010. Is it really about me? : message content in social awareness streams. Proceedings of the 2010 ACM conference on Computer supported cooperative work (CSCW 2010), Feb 610, Savannah, Georgia, USA. ACM2
94.
Weerkamp, W., S. Carter, and M. Tsagkias, 2011. How people use twitter in different languages: 1-22
95.
Goel, A., A. Sharma, D. Wang, and Z. Yin, 2013. Discovering Similar Users on Twitter. Proceedings of the Workshop on Mining and Learning with Graphs (MLG-2013), Aug 11, Chicago, Illinois, USA2
96.
Wegrzyn-Wolska, K., L. Bougueroua, and G. Dziczkowski, 2011. Social media analysis for e-health and medical purposes. Proceedings of 2011 International Conference on Computational Aspects of Social Networks (CASoN 2011), Oct 19-21, Salamanca, Spain. IEEE. pp: 278-2832
97.
Yoon, S. Application of social network analysis and text mining to characterize network structures and contents of microblogging messages: An observational study of physical activity-related tweets, 2011. PhD Thesis. Columbia University2
98.
Chew C, Eysenbach G, 2010. Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak. PLoS ONE 5(11): e14118. DOI: 10.1371/journal.pone.00141182 112
99.
Lampos, V., and N. Cristianini, 2010. Tracking the flu pandemic by monitoring the social web. Proceedings of the 2nd International Workshop on Cognitive Information Processing (CIP 2010), Jun 14-16, Elba Island (Tuscany), Italy. DOI, 10.1109/CIP.2010.56040882
100.
Achrekar, H., A. Gandhe, R. Lazarus, S. Yu, and B. Liu, 2011. Predicting flu trends using twitter data. Proceedings of 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Apr 10-15, Shanghai, China. IEEE. pp: 702-707. DOI:10.1109/infcomw.2011.59289032
101.
Paul, M. J., and M. Dredze, 2011. You Are What You Tweet: Analyzing Twitter for Public Health. Proceedings of Fifth International AAAI Conference on Weblogs and Social Media (WSM'2011), Jul 17-21, Barcelona, Spain. pp: 265-272. DOI: 10.1.1.224.99742
102.
Signorini A., A.M. Segre, and P.M. Polgreen, 2011. The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 pandemic. PloS one 6.5 (2011): e194672
103.
Diakopoulos, N.A., and D.A. Shamma, 2010. Characterizing debate performance via aggregated twitter sentiment. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2010), Apr 1015, Atlanta, Georgia, USA. ACM. pp: 1195-1198
104.
Zhou, Z., R. Bandari, J. Kong, H. Qian, and V. Roychowdhury, 2010. Information resonance on Twitter: watching Iran. Proceedings of the First Workshop on Social Media Analytics, (SOMA'2010), Jul 25, Washington, DC, USA. ACM. pp: 123-131. DOI: 10.1145/1964858.19648752
105.
Bollen, J., H. Mao, and X. Zeng, 2011. Twitter mood predicts the stock market. Journal of Computational Science 2.1: 1-82
106.
Zhang, X., H. Fuehres, and P.A. Gloor, 2011. Predicting stock market indicators through twitter “I hope it is not as bad as I fear”. Procedia-Social and Behavioral Sciences 26: 55-62. DOI: 10.1016/j.sbspro.2011.10.562
107.
Jansen, B.J., M. Zhang, K. Sobel, and A. Chowdury. Twitter power: Tweets as electronic word of mouth, 2009. Journal of the American society for information science and technology 60.11: 2169-2188. DOI: 10.1002/asi.211492
113
108.
Bulearca, M., and S. Bulearca, 2010. Twitter: a viable marketing tool for SMEs. Global Business and Management Research: An International Journal 2.4: 296-3092
109.
Becker, H., F. Chen, D. Iter, M. Naaman, and L. Gravano, 2011. Automatic Identification and Presentation of Twitter Content for Planned Events. Proceedings of Fifth International AAAI Conference on Weblogs and Social Media (WSM'2011), Jul 17-21, Barcelona, Spain. pp: 655-656. DOI: 10.1.1.225.15552
110.
Jackoway, A., H. Samet, and J. Sankaranarayanan, 2011. Identification of live news events using Twitter. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN 2011), Nov 1, 2011, Chicago, Illinois, USA. pp: 25-32. DOI: 10.1145/2063212.20632242
111.
Earle, P.S., D.C. Bowden, and M. Guy, 2012. Twitter earthquake detection: earthquake monitoring in a social world." Annals of Geophysics 54.6. DOI: 10.4401/ag-5364. pp: 708-7152
112.
Crooks, A., A. Croitoru, A. Stefanidis, and J. Radzikowski, 2013. # Earthquake: Twitter as a distributed sensor system. Transactions in GIS 17.1: 124-1472
113.
Meena, Y.K., Shashank and V.P. Singh, 2012. Text Documents Clustering using Genetic Algorithm and Discrete Differential Evolution. International Journal of Computer Applications 43(1):16-19, April. Published by Foundation of Computer Science, New York, USA. DOI: 10.5120/6067-8221
114.
Zamir, O., and O. Etzioni, 1999. Grouper: a dynamic clustering interface to Web search results. Computer Networks 31.11: 1361-1374. DOI: 10.1.1.31.8216
115.
Zeng, H., Q. He, Z. Chen, W. Ma, J. Ma, 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Jul 25-29, Sheffield, UK. DOI: 10.1145/1008992.10090302
116.
Zhu, Y.H., G. Dai, B. C. M. Fung, and D. Mu, 2006. Document Clustering Method Based on Frequent Co-occurring Words. Proceedings of the 20th Pacific Asia Conference on Language, Informatics, and Computation (PACLIC 2006), Nov 1-3, Wuhan, China. pp: 442-445 114
117.
Aggarwal, C.C., and C.X. Zhai, 2012. A survey of text clustering algorithms. Mining text data. Springer US. pp: 77-1282
118.
Koteeswaran, S., P. Visu and J. Janet, 2012. A review on clustering and outlier analysis techniques in data mining. Am. J. Applied Sci., 9: 254258.DOI: 10.3844/ajassp.2012.254.258
119.
Anastasiu, D.C., A. Tagarelli, and G. Karypis, 2013. Document Clustering: The Next Frontier. Technical Report. University of Minnesota. DOI: 10.1.1.401.8428
120.
Jones, G., A. M. Robertson, C. Santimetvirul, and P. Willett, 1995. Nonhierarchic document clustering using a genetic algorithm. Information Research, 1(1). Available at: http://InformationR.net/ir/1-1/paper1.html2
121.
Maulik, U., and S. Bandyopadhyay, 2000. Genetic algorithm-based clustering technique. Pattern recognition 33.9: 1455-1465. DOI: 10.1.1.19.1878
122.
Casillas, A., M.T. Gonzalez De Lena and R. Martinez, 2003. Document clustering into an unknown number of clusters using a genetic algorithm. Proceedings of the 6th International Conference on Text Speech and Dialogue, Sept. 8-12, Springer Berlin Heidelberg, Czech Republic, pp: 43-49. DOI: 10.1007/978-3-540-39398-6_7
123.
Verma, H., E. Kandpal, B. Pandey, and J. Dhar, 2010. A Novel Document Clustering Algorithm Using Squared Distance Optimization Through Genetic Algorithms. International Journal on Computer Science & Engineering 2.5. pp: 1875-18792
124.
Khot, T., 2010. Clustering Twitter feeds using word co-occurrence. University of Wisconsin. Available on: http://www.learningace.com/doc/1912964/627012c3a31a544c6f20fc8aa2482 935/cs784
125.
Yamashita, T., H. Sato, S. Oyama, and M. Kurihara, 2013. Classification of Twitter Users Based on Following Relations. Proceedings of the International Multi Conference of Engineers and Computer Scientists (imecs'2013), Mar 13-15, Hong Kong. Vol. 1. DOI: 10.1145/1772690.17727512
126.
Bernstein, M.S., B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi, 2010. Eddi: interactive topic-based browsing of social status streams. Proceedings of the 23rd annual ACM symposium on User interface software and
115
technology (UIST'2010), Oct 3-6, New York, NY, USA. DOI: 10.1145/1866029.18660772 127.
O’Connor, B., M. Krieger and D. Ahn, 2010. Tweet motif: Exploratory search and topic summarization for Twitter. Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (WSM’ 10), May 23-26, Washington DC, USA. pp: 384-385. DOI: 10.1.1.365
128.
Rosa, K.D., R. Shah, B. Lin, A. Gershman, and R. Frederking, 2011. Topical clustering of tweets. Proceedings of the 34th Annual ACM SIGIR Conference, Jul 24-28, Beijing, China2
129.
Kim, S., S. Jeon, J. Kim and P. Young-Ho, 2012. Finding core topics: Topic extraction with clustering on tweet. Proceedings of the 2nd International Conference on Cloud and Green Computing, Nov. 1-3, IEEE Xplore Press, Xiangtan, pp: 777-782. DOI: 10.1109/CGC.2012.120
130.
Rafea, A., and N. A. Mostafa, 2013. Topic extraction in social media. Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS 2013), May 20-24, IEEE San Diego, California, USA., pp: 94-98. DOI: 10.1109/CTS.2013.6567212
131.
Banerjee, S., K. Ramanathan and A. Gupta, 2007. Clustering short texts using Wikipedia. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2327, ACM New York, NY, USA., pp: 787-788. DOI: 10.1145/1277741.1277909
132.
Gabrilovich, E., and S. Markovitch, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of the 20th IJCAI, Jan. 9-12, Vol.7, Hyderabad, India. pp: 1606-1611. DOI: 10.1.1.76.9790
133.
Chen, Q., T. Shipper, and L. Khan, 2010. Tweets mining using WIKIPEDIA and impurity cluster measurement. Proceedings of IEEE International Conference on Intelligence and Security Informatics (ISI'2010), May 23-26. pp: 141-143. DOI: 10.1109/ISI.2010.5484758
134.
Alba, E., B. Dorronsoro, F. Luna, A.J. Nebro and P. Bouvry, 2007. A cellular multi-objective genetic algorithm for optimal broadcasting strategy in metropolitan MANETs. Comput. Commun. , 30: 685-697. DOI: 10.1109/IPDPS.2005.4 116
135.
Nebro, A. J., J. J. Durillo, F. Luna, B. Dorronsoro, and E. Alba, 2009. Mocell: A cellular genetic algorithm for multiobjective optimization. International Journal of Intelligent Systems 24.7: 726-746. DOI: 10.1002/int.203582
136.
Khezri, S., and A.Hazrati, 2013. Sensor Placement in WSN Using the Cellular Genetic Algorithm. J. Basic. Appl. Sci. Res., 3(2s): 745-7502
137.
Yugui, C., 2013. Electric Energy Demand Forecast of Nanchang based on Cellular Genetic Algorithm and BP Neural Network. TELKOMNIKA Indonesian Journal of Electrical Engineering 11.7. pp:. 3821- 3825
138.
Lu, C., 2011. Exploiting social tagging network for web mining and search. PhD Thesis, Drexel University, Philadelphia, USA2
139.
https://dev.twitter.com/docs/rate-limiting/1, Accessed on March 5, 2014
140.
https://hootsuite.com/features/social-networks,Accessed on March 7, 2014
141.
http://www.alexa.com/siteinfo/hootsuite.com, Accessed on March 20, 2013
142.
Schenk, C.B., 2004. Finding Event-Specific Influencers in Dynamic Social Networks. MSc Thesis. The Pennsylvania State University2
143.
Lee, M., W. Wang, and H. Yu, 2006. Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC bioinformatics7.1: 140. DOI: 10.1186/1471-2105-7-140
144.
Sayyadi, H., M. Hurst, and A. Maykov, 2009. Event Detection and Tracking in Social Streams. Proceedings of 3rd International AAAI Conference on Weblogs and Social Media (ICWSM'09), May 17-20, San Jose, California, USA. pp: 311-314. DOI: 10.1.1.187.7972
145.
Reeves, C.R., 1993. Genetic Algorithms. In: Modern Heuristic Techniques for Combinatorial Problems, In: Blackwell Scientific Publications, Reeves, C.R., (Ed.), Oxford, ISBN-10: 04702207912
146.
Misevičius, A. and B. Kilda, 2005. Comparison of crossover operators for the quadratic assignment problem. Inform. Technol. Control, 34: 109-119. DOI: 10.1.1.132.568
147.
Hugosson, J., E. Hemberg, A. Brabazon and M. O'Neill, 2007. An investigation of the mutation operator using different representations in Grammatical Evolution. Proceedings of the 2nd International Symposium Advances in Artificial Intelligence and Applications, Oct. 15-17, Wisla, Poland, pp: 409-4192 117
المستخلص أصبحت مواقع التواصل االجتماعي جزءا أساسيا من النشاط اليومي لمتصفحى االنترنت حيث أنها تتيح تبادل المعلومات والتواصل بينهم 2وقد أصبحت الحاجة ملحة إلى تحليل هذا الكم الهائل –الذى لم يسبق له مثيل -من المحتوى المقدم من مستخدمى مواقع التواصل االجتماعي ،بطريقة سليمة 2وقد برزموقع التواصل االجتماعي تويتر في السنوات األخيرة كأحد أكثر هذه المواقع شعبية ،وقد بلغ عدد الحسابات النشطة المسجلة علي تويتر عام 545,051,111 :.113حساب ،تنتج بمعدل 55 مليون رسالة في اليوم الواحد وفقا ً لتقرير موقع التحليل االحصائي 2Statisticbrain.comو مع استمرار شعبية تويتر في الزيادة بسرعة ،أصبح من الضروري للغاية تحليل الكمية الهائلة من البيانات التي ينتجها مستخدمو تويتر 2و يعتبر تويتر مصدرا أساسيا للمعلومات اللحظية عن مجموعة متنوعة و واسعة من الموضوعات مثل األحداث الرياضية ،واإلعالنات ،و الحمالت السياسية، وحاالت الطوارئ الشاملة ،و األزمات ،والرعاية الصحية ،الخ2 يعتبر التجميع أحد أكثر الطرق انتشاراً لتحليل الرسائل المنشورة علي تويتر 2نظراً ألن معظم رسائل تويتر نصية بالطبيعة ،لذا تركز هذه الدراسة على تجميع الرسائل المنشورة علي تويتر استنادا على تشابه مضمون النص2و تشير االحصاءات أن اللغة اإلنجليزية هي اللغة األكثر استخداماً على تويتر ( ٪ 34من جميع الرسائل المنشورة علي تويتر باللغة اإلنجليزية ،وفقا ً للتقرير الصادر عن الوقع االخباري البريطاني األمريكي " "Mashableالمهتم بالتكنولوجيا و مواقع التواصل االجتماعي) 2و بنا ًء علي ما تقدم ،تعمل هذه الدراسة على تجميع رسائل تويتر المكتوبة باللغة اإلنجليزية2 وقد تم استخدام " "Scraping based techniqueمن أجل جمع البيانات الالزمة لهذه الدراسة 2و يعتمد هذا األسلوب على جمع البيانات من تويتر ،وذلك باستخدام أحد التطبيقات المستعملة في تجميع و تحليل البيانات عبر مختلف مواقع التواصل االجتماعي و يدعي " 2"Hootsuiteو باستخدام هذا التطبيق ،تم تجميع مجموعة من الرسائل المنشورة علي تويتر على مدار 3أيام في الفترة من السادس و العشرين حتي الثامن و العشرين من يونيو عام ،.113استنادا إلى مجموعة من رؤوس الموضوعات التي تصف موضوعات محددة و متنوعة ،وذلك لدراسة مجموعات متعددة من اهتمامات مستخدمي تويتر 2وقد أجريت الدراسات على ثالث مجموعات من البيانات ذات أحجام مختلفة باستخدام الخوازميات الوراثية2 و تنتمي الخوارزميات الوراثية Genetic Algorithmsإلى عائلة الخوارزميات التطورية ،Evolutionary Algorithmsوالتي هي عبارة عن تقنيات مصممة إليجاد الحلول المثلى من بين مجموعة من الحلول الممكنة (األفراد) 2و تعد الخوارزميات الجينية طرق بحث احتمالية مشابهة آلليات العملية الطبيعية للتطور البيولوجي الكتشاف حلول للمشاكل2 تم استخدام فئة فرعية من الخوارزميات الجينية :الخوارزمية الجينية الخلوية في هذه الدراسة لتجميع مجموعة من الرسائل المنشورة علي تويتر 2و تعتبر هذه الدراسة واحدة من المحاوالت األولى لتجميع الرسائل المنشورة علي تويتر من خالل استخدام الخوارزمية الجينية
الخلوية ، CGAمما يؤدى الى تحسين نتائج التجميع مقارنةً مع نتائج كل من خوارزميات التجميع التقليدية ،أو خوارزميات التجميع التي تتطلب المعرفة المسبقة بعدد المجموعات مثل 2K-means تمت مقارنة النتائج التي تم الحصول عليها من قبل CGAمع تلك التي تم الحصول عليها عن طريق الخوارزميات الجينية التقليدية :الخوارزمية الجينية الجيلية 2 genGAوقد تمت المقارنة وفقا ألربعة معايير : متوسط قيمة اللياقة ،و متوسط الوقت الالزم لتنفيذها ،وعدد الكتل الناتجة ،باإلضافة إلى عدد األجيال 2و تشير النتائج التي تم الحصول عليها الى أفضلية أداء CGAبشكل عام مقارنةً بأداء 2genGA الكلمات الرئيسية :تجميع ،الخوارزميات الجينية الخلوية ،cGA ،تويتر ،تشابه الرسائل المنشورة علي تويتر
األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري كلية الحاسبات وتكنولوجيا المعلومات
تطبيق الخوارزميات الجينية الخلوية لتجميع التغريدات في تحليل تويتر
ضمن المتطلبات الالزمة للحصول على درجة ماجستير العلوم في نظم معلومات من كلية الحاسبات وتكنولوجيا المعلومات ( القاهرة) رسالة مقدمة من الدارس عمرو عادل عبدالرحيم السيد
تحت اشراف أ.د .عمروأحمد بدر أستاذعلوم الحاسب-كلية الحاسبات و المعلومات جامعة القاهرة
د.عصام الدين فوزي الفخراني مدرس نظم المعلومات-كلية االدارة والتكنولوجيا األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري
مايو .114
األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري كلية الحاسبات وتكنولوجيا المعلومات
تطبيق الخوارزميات الجينية الخلوية لتجميع التغريدات في تحليل تويتر
ضمن المتطلبات الالزمة للحصول على درجة ماجستير العلوم في نظم معلومات من كلية الحاسبات وتكنولوجيا المعلومات ( القاهرة) رسالة مقدمة من الدارس عمرو عادل عبدالرحيم السيد
تحت اشراف أ.د .عمروأحمد بدر أستاذعلوم الحاسب-كلية الحاسبات و المعلومات جامعة القاهرة
د.عصام الدين فوزي الفخراني مدرس نظم المعلومات-كلية االدارة والتكنولوجيا األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري
مايو .114 القاهرة