4.4.1 Cellular Genetic Algorithm

11 downloads 337 Views 2MB Size Report
Report generated by the British-American news website, technology and social media blog ... Keywords: Clustering, Cellular Genetic Algorithm, cGA, Twitter, Tweet Similarity. ...... podcasting platform that was soon surpassed by Apple iTunes.
Arab Academy for Science and Technology & Maritime Transport College of Computing & Information Technology

CELLULAR GENETIC ALGORITHM APPLICATION FOR CLUSTERING TWEETS IN TWITTER ANALYSIS A Thesis Submitted to the College of Computing & Information Technology in Partial Fulfillment of the Requirements for the award of degree of

MASTER of Science in Information Systems

Submitted By

Amr Adel AbdelRahim ElSayed Egypt

Supervised by Prof. Dr. Amr Ahmed Badr

Dr. Essameldean Fawzy ElFakharany

Professor of Computer Sciences,

PhD, College of Management and

College of Computer Sciences and

Technology.

Information Systems.

Arab Academy for Science and

Cairo University

Technology and Maritime Transport

May 2014

Arab Academy for Science and Technology & Maritime Transport College of Computing & Information Technology

CELLULAR GENETIC ALGORITHM APPLICATION FOR CLUSTERING TWEETS IN TWITTER ANALYSIS A Thesis Submitted to the College of Computing & Information Technology in Partial Fulfillment of the Requirements for the award of degree of

MASTER of Science in Information Systems

Submitted By

Amr Adel AbdelRahim ElSayed Egypt

Supervised by Prof. Dr. Amr Ahmed Badr

Dr. Essameldean Fawzy ElFakharany

Professor of Computer Sciences,

PhD, College of Management and

College of Computer Sciences and

Technology.

Information Systems.

Arab Academy for Science and

Cairo University

Technology and Maritime Transport

May 2014

DECLARATION I certify that all the material in this thesisthat is not my own work has been identified,and that no material is included for which a degree has been previously conferred on me. The contents of this thesis reflect my own personal views, and are not necessarily endorsed by the University. Signature: ……………….. Date: …………………..

i

DEDICATED To My Parents With all love

ii

ACKNOWLEDGMENTS First and Foremost, I would like to thank God who gave me the strength and power for the accomplishment of this thesis. I would like to take this opportunity to

express my deep sense of sincere

gratitude and profound feeling of admiration to my thesis supervisors: Professor Dr. Amr Badr and Dr. Essameldean ElFakharany, for their continuous support, patience, motivation, and immense knowledge. I would like to extend my sincere gratitude and appreciation to all people who helped me to complete this thesis. Last but not least, I would like to thank my parents for their love and continuous support.

iii

ABSTRACT Social media has become an essential part of the daily online experience. They enable information sharing and communication between online users. The unprecedented huge amount of user-generated content produced by social media, needs to be analyzed in a proper manner. Twitter has emerged as an extremely popular micro-blogging social media platform in the recent years, with the number of registered twitter accounts has reached 645,750,000 active accounts by the year 2013, generating an average number of 58 million tweets per day (according to the statistical analysis website; statisticbrain.com). As the popularity of Twitter continues to increase rapidly, it is extremely necessary to analyze the huge amount of data that Twitter users generate. Twitter is an essential source of real time information in a wide variety of interests including sports events, advertising, political campaigns, mass emergencies, crisis events, health care, etc. A popular method of tweet analysis is clustering. Because most of the posted Twitter messages are textual in nature, this study focuses on clustering tweets based on their textual content similarity. Moreover, since the English language is the most popular language over Twitter (34% of all tweets are in English; according to the Report generated by the British-American news website, technology and social media blog "Mashable"), the study focuses on clustering Twitter messages written in English. The Scraping based technique was employed in this study to gather data from Twitter, using Hootsuite as a social network aggregator. The gathered Twitter messages were collected over a 3-day time duration from the 26th to the 28th of June 2013, based on a set of keywords that describe diverse specific topics in the actual world, in order to cover wide areas of interest. Experimental studies were performed over three datasets of different sizes. Genetic Algorithms belong to the class of evolutionary computational algorithms, which are population based optimization techniques designed for finding globally optimal solutions from a pool of feasible solutions (individuals). Genetic Algorithms are probabilistic search methods whose mechanisms are analogous with the natural process of biological evolution to discover solutions to problems. iv

A subclass of Genetic Algorithm: Cellular Genetic Algorithm was used in the study to cluster tweets. Based on the literature review; this study is one of the earliest attempts for tweet clustering through the use of Cellular Genetic Algorithm cGA, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. The results obtained by cGA are compared with those obtained by a conventional Genetic Algorithm: Generational Genetic Algorithm genGA. The comparison takes place according to four parameters: the average fitness value, the average time required for execution, the number of generated clusters, in addition to the number of generations. The obtained results indicate a better overall performance of cGA in comparison to genGA. Keywords: Clustering, Cellular Genetic Algorithm, cGA, Twitter, Tweet Similarity.

v

PUBLISHED WORK Adel, A., E. ElFakharany and A. Badr, 2014. Clustering tweets using cellular genetic algorithm. J. Comput. Sci., 10: 1269-1280.

vi

Table of Contents DECLARATION ........................................................................................................................................... i DEDICATED .................................................................................................................................................. ii ACKNOWLEDGMENTS ................................................................................................................... iii ABSTRACT.................................................................................................................................................... iv PUBLISHED WORK .............................................................................................................................. vi Table of Contents ...................................................................................................................................... vii List of Tables.................................................................................................................................................. xi List of Figures .............................................................................................................................................. xii List of Abbreviations .............................................................................................................................xiii CHAPTER ONE INTRODUCTION ......................................................................................... 1 1.1 Problem and Motivation ................................................................................................................. 2 1.2 Research objectives and Contribution .................................................................................. 4 1.2.1 Research Objectives ......................................................................................... 4 1.2.2 Thesis Contribution .......................................................................................... 4

1.3 Research Methodology .................................................................................................................... 4 1.4 Thesis Organization ........................................................................................................................... 5 CHAPTER TWO BACKGROUND .......................................................................................... 6 2.1 Social Media ........................................................................................................................................... 7 2.2 Twitter ...................................................................................................................................................... 11 2.3. Clustering ............................................................................................................................................. 17 2.3.1 Hierarchical Clustering Algorithms ................................................................... 18

vii

2.3.1.1 Agglomerative (Bottom-up) Algorithms ...................................................... 18 2.3.1.2 Divisive (top-down) Algorithms ................................................................. 18 2.3.2 Partitional Clustering Algorithms ..................................................................... 18

2.4 Algorithm ............................................................................................................................................... 19 2.4.1 Genetic Algorithms ........................................................................................ 19 2.4.2 Cellular Genetic Algorithms............................................................................. 19

CHAPTER THREE LITERATURE REVIEW............................................................ 21 3.1 Historical Background .................................................................................................................. 22 3.1.1 Social Media.................................................................................................. 22 3.1.2 Twitter ......................................................................................................... 23 3.1.3 Document and Tweet Clustering...................................................................... 24 3.1.4 Genetic Algorithms GAs .................................................................................. 24 3.1.5 Cellular Genetic Algorithms cGAs..................................................................... 24

3.2 Related Work....................................................................................................................................... 25 3.2.1 Social Media.................................................................................................. 25 3.2.2 Twitter ......................................................................................................... 32 3.2.3 Document Clustering...................................................................................... 45 3.2.4 Tweet Clustering............................................................................................ 50 3.2.5 Cellular Genetic Algorithms cGAs..................................................................... 58

3.3 Summary and Conclusion .......................................................................................................... 60 CHAPTER FOUR DATA AND ALGORITHM .......................................................... 61 4.1 Conceptual Framework ................................................................................................................ 62 4.2 Data Collection .................................................................................................................................. 63 4.3 Data Description and Preparation ......................................................................................... 66 4.3.1 Data Description ............................................................................................ 66 4.3.2 Data Preprocessing ........................................................................................ 68 4.3.3 TF-IDF representation .................................................................................... 69 4.3.4 Experimental setup ........................................................................................ 70 viii

4.4 Algorithm ............................................................................................................................................... 71 4.4.1 Cellular Genetic Algorithm .............................................................................. 72 4.4.2 Chromosome Representation.......................................................................... 74 4.4.3 Initial Population ........................................................................................... 75 4.4.4 Fitness Function ............................................................................................ 75 4.4.5 Parent Selection ............................................................................................ 76 4.4.6 Recombination (Crossover) ............................................................................. 76 4.4.7 Mutation ...................................................................................................... 77 4.4.8 Replacement Policy ........................................................................................ 77 4.4.9 Stopping Criterion.......................................................................................... 77

4.5 Summary and Conclusion .......................................................................................................... 79 CHAPTER FIVE EXPERIMENTAL RESULTS AND DISCUSSION... 81 5.1 Introduction .......................................................................................................................................... 82 5.2 Accuracy on test dataset .............................................................................................................. 83 5.3 Results for 1,000 tweets dataset ............................................................................................. 84 5.3.1 Average Fitness ............................................................................................. 84 5.3.2 Average Execution Time ................................................................................. 85 5.3.3 Number of Clusters ........................................................................................ 86

5.4 Results for 5,000 tweets dataset ............................................................................................. 86 5.4.1 Average Fitness ............................................................................................. 86 5.4.2 Average Execution Time ................................................................................. 87 5.4.3 Number of Clusters ........................................................................................ 88

5.5 Results for 30,000 tweets dataset .......................................................................................... 89 5.5.1 Fitness .......................................................................................................... 89 5.5.2 Execution Time .............................................................................................. 89 5.5.3 Number of Clusters ........................................................................................ 90

5.6 Number of Generations ................................................................................................................ 91 5.7 cGA performance in all datasets ............................................................................................ 91 ix

5.7.1 Average Fitness ............................................................................................. 91 5.7.2 Average Execution Time ................................................................................. 92

5.8 genGA performance in all datasets ...................................................................................... 93 5.8.1 Average Fitness ............................................................................................. 93 5.8.2 Average Execution Time ................................................................................. 93

5.9 Research Limitations ..................................................................................................................... 94 5.10 Results Discussion ........................................................................................................................ 95 5.10.1 Discussion of results obtained by both algorithms ........................................... 95 5.10.1.1 Average Fitness ..................................................................................... 95 5.10.1.2 Execution Time ...................................................................................... 96 5.10.1.3 Number of Clusters ................................................................................ 97 5.10.1.4 Number of Generations .......................................................................... 97 5.10.2 Discussion of results obtained by every algorithm............................................ 98 5.10.2.1 Cellular Genetic Algorithm ...................................................................... 98

5.11 Conclusion .......................................................................................................................................... 99 CHAPTER SIX CONCLUSION AND FUTURE WORK ................................100 6.1 Conclusion...........................................................................................................................................101 6.2 Future Work .......................................................................................................................................103 References ....................................................................................................................................................104

x

List of Tables Table 3.1 Social Media Related Work Summary……………………….

29-32

Table 32. Twitter Related Work Summary………………………...

38-44

Table 323 Document Clustering Related Work Summary…………

47-50

Table 3.4 Tweet Clustering Related Work Summary……………...

54-57

Table 3.5 Cellular Genetic Algorithm Related Work Summary…...

59-60

Table 421 Type of Data gathered using Hootsuite…………………

67

Table 42. Pseudo-code of Cellular Genetic Algorithm…………….

78-79

Table 4.3 Parameterization of the algorithm……………………….

79

Table 5.1 Tweet Distribution in the test set………………………...

83

Table 5.2 Fitness values by both algorithms in all datasets………...

96

Table 523 Execution times by both algorithms in all datasets……...

96

Table 524 Number of clusters by both algorithms in all datasets…...

97

Table 5.5 Number of generations of both algorithms in all datasets.

97

Table 5.6 cGA performance in all datasets…………………………

98

Table 5.7 genGA performance in all datasets………………………

99

xi

List of Figures Figure 1.1 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 3.1 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 5.10 Figure 5.11 Figure 5.12 Figure 5.13 Figure 5.14 Figure 5.15

Top 10 languages on twitter-2013…………………………………. Attributes of Big Data...……………………………………………. Characteristics of Social media data………………………………… Percentage of online adults using social networks by year………….. Percentage of internet users who use more than one social network... Example of a Twitter Home Page ……………………………….. Twitter input and output methods…………………………………… Twitter Alexa traffic rank on the 20th of March 2014………………. Twitter users' statistics 2013……………………………………….... Twitter users demographic analysis 2013…………………………… Frequency of social media usage ………...………………………… .113 Top 5 countries using Twitter ………………………………… 2013 Top 10 countries with the highest Twitter penetration ……… Timeline of the launch dates of major Social Network Sites………... Conceptual framework ……………………………………………… User interactions with social networks by social network aggregator. Hootsuite dashboard ………………………………………………… Hootsuite Alexa traffic rank on the 20th of March 2014……………. Sample tweets from the dataset ……………………………………... Simple Genetic Algorithm ………………………………………….. Topology of Cellular Genetic Algorithm…………………………..... Chromosome representation …….………………………………….. L5 Neighborhood. ..………………………………………………..... Reproductive cycle mechanism in cGA……………………………... Accuracy achieved by both algorithms ……………………………... Average fitness of genGA and cGA (1,000 tweets)………………… Average execution time of genGA and cGA (1,000 tweets)……… Number of clusters generated by genGA and cGA (1,000 tweets)….. Average fitness of genGA and cGA (5,000 tweets)………………… Average execution time of genGA and cGA (5,000 tweets)………… Number of clusters generated by genGA and cGA (5,000 tweets)….. Fitness value of genGA and cGA (30,000 tweets) ………………… Execution time of genGA and cGA (30,000 tweets)........................... Number of clusters generated by genGA and cGA (30,000 tweets)… Number of generations produced by each algorithm………………... cGA fitness for all sets ……………………………………………… cGA execution time for all sets……………………………………… genGA fitness for all sets……………………………………………. genGA execution time for all sets…………………………………………..

xii

3 8 9 10 10 11 12 14 15 15 16 16 17 23 62 64 65 66 67 73 73 74 75 78 84 85 85 86 87 88 88 89 90 90 91 92 92 93 94

List of Abbreviations API

Application Programming Interface

cGA

Cellular Genetic Algorithm

CGM

Consumer-Generated Media

cMOGA

Cellular Multi-Objective Genetic Algorithm

CTC

Core-Topic-based Clustering

DDE

Discrete Differential Evolution

DF

Document Frequency

DJIA

Dow Jones Industrial Average

DOI

Digital Object Identifier

DPX

Distance Preserving Crossover

e-WOM

Electronic Word Of Mouth

EACS

Energy-Aware Communications Scheduler

EAs

Evolutionary Algorithms

ESA

Explicit Semantic Analysis

GA

Genetic Algorithm

genGA

Generational Genetic Algorithm

IDF

Inverse Document Frequency

ILI

Influenza-Like Illnesses

MANETs Metropolitan Mobile Ad Hoc Networks MOPs

Multi-Objective Continuous Optimization Problems

Ms

Milliseconds

NLP

Natural Language Processing

Pc

Crossover probability

Pm

Mutation probability

RT

Retweet

SMEs

Small and Medium-sized Enterprises xiii

SMS

Short Message Service

SNEFT

Social Network Enabled Flu Trends

TF

Term Frequency

UGC

User Generated Content

VRP

Vehicle Routing Problem

xiv

CHAPTER ONE INTRODUCTION

1

Social media platforms such as Facebook, Twitter, YouTube, Hi5, Orkut, etc. have become an essential part of the daily online experience. They enable information sharing and communication between online users. The amount of data produced by social media is huge, mainly user-generated, various, and spreading at an unprecedented rate. Twitter is one of the most important and popular social media platforms. It enables its users to share ideas and coordinate activities through short status messages-that cannot exceed 140 characters length-called tweets. This limited length forces Twitter users to be as concise as possible. This chapter presents an overview of the dissertation project. The chapter is organized in the following manner. Section 1.1 describes the scope of the problem and the motivation of tweet clustering. Section 1.2 mentions the research objectives and provides the main contribution of the thesis. Section 1.3 discusses briefly the methodology followed by the researcher. Finally, section 1.4 demonstrates how the rest of the thesis is organized.

1.1 Problem and Motivation As the popularity of Twitter continues to increase rapidly, it is extremely necessary to properly analyze the huge amount of data produced by Twitter so that it can be efficiently utilized. Twitter is an essential source of real time information in a wide variety of interests including sports events, advertising, political campaigns (e.g. presidential elections), mass emergencies, crisis events, and even health care [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. One of the most popular methods of tweet analysis is clustering. Twitter provides a massive quantity of short text in the form of tweets where each tweet represents a single document [11]. Tweets can be an extremely useful data source for researchers about a wide variety of topics. Textual contents in Twitter primarily denote tweet text, URLs, and hashtags within tweets [12] and tweets are considered to be "short texts".

2

The formulation of the problem of clustering tweets based on their similarity is motivated by an essential remark: The tweet content similarity can be used as one of the similarity measures between users where this measure helps to realize whether the users have similar interests. This is an indication of good similarity between users [13]. Since the majority of the user-generated messages on micro-blogging websites are textual information [14]; therefore, the main focus of this thesis is clustering of tweets based on their textual content similarity. English language is the original Twitter language and tweets written in English are the most common, followed by Japanese and Spanish [15, 16], as displayed in Figure 1.1. Therefore, the focus is on tweets written in English.

Figure 1.1 Top 10 languages on twitter-2013 [17] Clustering of tweets is a complex problem to solve. The very short length of tweets (being only about 140 characters) can be a problem. According to Karandikar; "Such a short piece of text provides very few contextual clues for applying machine learning techniques" [18]. This type of data results in weak performance of most clustering methods due to the overall freedom of writing tweets. The writing style of tweets is informal; can be full of jargons, colloquial, spelling mistakes, domain specific content, and acronyms and out of vocabulary words with poor grammatical structure. The words are sparse, so there is a difficulty in handling tweets. In general, the data in twitter does not have a well-defined structure [19, 16, 11]. 3

1.2 Research objectives and Contribution 1.2.1 Research Objectives This thesis aims to: 1. Explore the clustering of tweets based on their textual similarity 2. Apply Cellular Genetic Algorithms cGAs to cluster tweets 3. Apply conventional Generational Genetic Algorithms genGAs to cluster tweets 4. Compare the results obtained by cGAs with those obtained by Generational Genetic Algorithms genGAs

1.2.2 Thesis Contribution In addition to achieving the desired research objectives; the main contribution of this thesis can be summarized as follows: This study is one of the first attempts to exploit cGA in clustering of tweets, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. Therefore, this study contributes to adding a new approach for tweet clustering.

1.3 Research Methodology The researcher employed the Scraping based technique to gather data from Twitter, using Hootsuite as a social network aggregator. The gathered Twitter messages were collected based on a set of keywords that describe specific topics in the actual world. Tweets were collected over a 3-day time duration from the 26th of June to the 28th of June 2013. The set of pre-defined keywords comprise eight variable categories that are intended to be diverse in order to cover different and wide areas of interest. After the pre-processing of data, both cellular and generational genetic algorithms were applied over three datasets composed of 1,000, 5,000, and

4

30,000 tweets respectively. The obtained results are compared according to the fitness value, execution time, number of clusters, and number of generations.

1.4 Thesis Organization The remainder of the thesis is structured in the following manner. Chapter 2 provides a background about social media, Twitter, clustering and its categories, as well as the Genetic Algorithm GA and Cellular Genetic Algorithm cGA. Chapter 3 provides a historical background and reviews the previous research related to Twitter, document and tweet clustering, and the applications of cGAs. Chapter 4 presents the methodology and conceptual framework of this thesis. Chapter 5 displays the experiments and the results, providing an interpretation for the results. Chapter 6 discusses the final conclusion and summary of this thesis in addition to the future work that can be performed using the obtained results and in the light of the limitations of the work done in this thesis.

5

CHAPTER TWO BACKGROUND

6

This chapter provides a background about the main issues discussed in this thesis. Organization of the chapter is as follows. First; section 2.1 presents an outline to the concept of social Networks, followed by section 2.2 that provides a background about Twitter. Then, section 2.3 describes the concept of clustering, document clustering and tweet clustering. Finally, section 2.4 talks in brief about Genetic Algorithm GA and one of its subclasses: the cellular genetic algorithm cGA.

2.1 Social Media Ellison defines social network sites as "Web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections may vary from site to site [20]. A more detailed definition is the one provided by Java which describes social media as " An umbrella term that defines the various activities that integrate technology, social interaction, and the construction of words, pictures, videos and audio. This interaction, and the manner in which information is presented, depends on the varied perspectives and “building” of shared meaning, as people share their stories, and understandings"[21]. Social Network Sites have applied a wide variety of technical features; their backbone is mainly composed of visible profiles that show a list of friends who are also users of the system. Each user has his/her unique profile. Profile pages act as starting points from which users can explore these social networks. They can look for other people, or find individuals who have common interests. Some of the most popular terms in the world of social media include Friends, Contacts, Fans, Links, Groups, etc. Social Media is an extremely powerful example of web 2.0 tools. It has radically changed the way people find information, share knowledge and interact with other people either within or outside the social networks. The shared information includes different media formats ranging from ordinary text to photos, music, videos, and PDF documents. Social media sites are centered on user-generated content and Information sharing. User Generated Content (UGC), also known as consumer-generated media (CGM), opposite to professionally edited text, is defined by Interactive Advertising Bureau in April 2008, as "Any material 7

created and uploaded to the Internet by non-media professionals". This means a dramatic shift on the web from one way communication where users are only allowed to gain information as provided to them, to a conversation style interaction where users are heavily participated as they have the ability to share information, obtain and add to information posted by other users (dynamic content) as well as spread information over their social network. Social media data is considered to be one type of "Big Data" with its three important defined attributes: Volume, Velocity, and Variety. "Volume" refers to the data size; "Velocity" refers to the frequency of production of data, while "Variety" refers to the data sources [22, 23, 24]. The attributes of big data and characteristics of social media data are displayed in Figure 2.1 and Figure 2.2 respectively.

Figure2.1 Attributes of Big Data [24]

8

Figure 2.2 Characteristics of Social media data [24] The huge amount of “user generated content” (with embedded metadata in the form of links, images, and videos) produced on a daily basis in social media is a significant source of contextual information which can be used to gain abundant inferences [21, 23, 1, 25]. This enormous amount of user-generated content created on social media results in vast, noisy, distributed, unstructured, and dynamic social media data [26]. The popularity of social networks has increased tremendously in recent years as a result of the wide spread of internet enabled devices such as personal computers, mobile devices and internet tablets. Online Social Networks have become an essential part of the global online daily experience [27, 28]. According to the Pew Research Center report of social media updates in 2013, about 73% of online adults now use a social networking site as displayed in Figure2.3.

9

Figure 2.3 Percentage of online adults using social networks by year [29] About 42% of online adults now use more than one social networking site while 36% use only one (the remaining 22% did not use any of the five specific sites: Facebook, LinkedIn, Pinterest, Twitter, and Instagram). This is demonstrated in Figure 2.4.

Figure 2.4 Percentage of internet users who use more than one social network [29]

10

2.2 Twitter Twitter is one of the most important social media platforms. It can be used to share ideas and coordinate activities, similar to instant messaging [15]. An example of a Twitter profile home page is displayed in Figure 2.5

Figure 2.5 Example of a Twitter Home Page Twitter users generate status messages called “tweets”. Tweets are status updates and musings, often of a personal nature, that cannot exceed 140 characters including text, emoticon, link or their combination. These tweets are often associated with particular events or specific topics of interest, or individual judgments, reactions, and points of view. Tweets are broadcasted to a global audience [14, 18, 5, 25]. Tweets can be posted from various sources including the Twitter website, Twitter mobile applications in addition to several third party applications/websites. Figure 2.6 displays the different input and output methods in Twitter.

11

Figure 2.6 Twitter input and output methods [30] Twitter users also have the control over the privacy features. They can choose to make their tweets public (visible to any one) or private (visible to only some users who get permission from the user). If a user’s profile is left public, his/her updates appear in a “public timeline” of recent updates [31]. Commonly, users watch the Twitter messages by viewing a main page showing a stream of the latest messages from people they follow [32]. In this work, the researcher considers only messages posted publicly on Twitter. Twitter allows users to reply to tweets of other users by clicking on the reply button on their tweet [13]. Every user is recognized by a user name of up to 15 alphanumeric characters and underscores advanced by “@”symbol [33,34]. Every tweet contains other information in addition to the message, including its timestamp, the Twitter user who posted it, whether it was part of a conversation, or a retweet, and the number of people who retweeted it [34]. The limited length of tweets means that the tweets do not certainly include well-developed thoughts; instead they are short and concise; however complete enough so that users can understand the ideas delivered by the tweets. It forces users to express their opinion in few sentences [16]. Social interaction between Twitter users takes place principally in three ways: 12

1. The "follow" relationship where users can follow other users by subscribing to their tweets. The follower gets all the status updates of the user that he/she follows. Followers are displayed in chronological order; the most recently selected follower is displayed first. Unlike other social networking sites, the relationship of following and being followed does not require interchange. Twitter supports one-way connection rather than two-way connection. In other words, a user can follow any other user without approval from the followed user, and the user being followed does not need to follow back. 2. Another form of connection that can be defined between two users is "Mention". Mention is the event of referring to other user(s) in a tweet by addressing them directly. 3. "Retweet" or RT in which individuals can rebroadcast content generated by other users, thus raising its visibility. This is similar to forwarding an email message to other users, in this case the followers. Retweet has an important role in the propagation of information on Twitter [33]. "Hashtag" is a unique concept on Twitter (Note: Hash tag is furthermore supported by other social media websites such as: Facebook, Google+, Instagram, YouTube, LinkedIn, etc.) that enables users to identify significant keywords in their tweets by adding the prefix ‘#’ before a keyword (without space) in a tweet. Hashtags are used on Twitter to set trending topics, indicate intended audience of a tweet, begin chat rooms, and categorize tweets by topic or type. The hash tags allow users to emphasize what they think as important keyword(s) in their tweet. A hashtag preceding the topic enables Twitter users to find tweets related to a particular topic during search to retrieve a list of recent tweets about this topic. In addition, Twitter offers a search portal (https://twitter.com/search-home) so that users can constantly monitor or search for tweets either by the means of keywords, hashtags or user name, but this service is restricted to only 40 search keywords. Also, Twitter has API (Application programming interface) functions to acquire user-specific information. Such information can be used to construct a network of friends [35]. Moreover, Twitter provides clickable “trending topic” terms, that initiate searches for widespread keywords. Finally, Twitter has a location function. If users 13

are posting tweets from a mobile device, they have the ability to turn on their location, and their latitude and longitude will be captured with the tweet. Twitter location information available from mobile Twitter applications can save where the user was when he/she posted the tweet. The user generally has the option to turn location services on or off [33]. The popularity of Twitter continues to grow in a rapid manner. The song of Ben Walker reveals how popular Twitter is. "You're And If

no if

you

you haven't

one

if

aren't been

you're

there

not

already

bookmarked,

you've

retweeted

on

Twitter missed

and

it

blogged

You might as well not have existed" Figure 2.7 demonstrates the traffic rank of Twitter according to Alexa.

Figure 2.7 Twitter Alexa traffic rank on the 20th of March 2014 [36] As displayed in Figure 2.8; the number of active twitter users reached about 645,750,000 users with 135,000 new users per day. Twitter has rapidly grown from handling 5,000 tweets per day in 2007 to 50 million tweets per day in 2010, now handling an average of 58 million tweets per day.

14

Figure2.8 Twitter users' statistics 2013 [37] According to Pew Research Center report of social media updates in 2013, about 18% of online adults currently use Twitter. The majority of this percentage is among young adults, as displayed in Figure 2.9.

Figure 2.9 Twitter users demographic analysis 2013 [29] According to the same report, about 46% of twitter users use it daily, with 29% checking in several times per day. However, 32% of Twitter users say that they check in less than once per week. This is displayed in Figure 2.10. 15

Figure 2.10 Frequency of social media usage [29] Considering the geographical distribution of users, 24% of Twitter's active users are in the United States, followed by Japan and Indonesia, with respectively 9.3 and 6.5 percent of Twitter's active users. While the highest twitter penetration level exists in Saudi Arabia, with 33%, followed by Indonesia, Spain, Venezuela, Argentina, UK, Netherlands, United States, Japan, and Colombia (Figure 2.11 and Figure 2.12).

Figure 2.11 2013 Top 5 countries using Twitter [38] 16

Figure 2.12 2013 Top 10 countries with the highest Twitter penetration [39]

2.3. Clustering In traditional definition; clustering or cluster analysis is an unsupervised data mining technique that includes the partitioning of data into collections or subsets of similar objects called clusters. It is used to describe (rather than predict) data into groups that are meaningful or useful. Clustering is a common technique for statistical data analysis. Document clustering can be defined as "Automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters". "Unsupervised" means that no human expert has allocated documents to classes. Through Document clustering Researchers can discover a wealth of valuable potential and hidden knowledge, like the hotspot of a discipline and crossrelationships between disciplines. This sort of knowledge can not only help the scholars to master the knowledge of a particular discipline, but also provides them with decision-making facilities [40, 41, 42, 43]. Similarly, Tweets can be grouped into clusters such that tweets in one cluster tend to be similar to each other, but dissimilar to those in other clusters i.e. minimum inter-cluster and maximum intracluster similarity [33, 44]. 17

Two broad families of clustering algorithms exist: Hierarchical and Partitional.

2.3.1 Hierarchical Clustering Algorithms These algorithms are subdivided into Agglomerative and Divisive clustering algorithms.

2.3.1.1 Agglomerative (Bottom-up) Algorithms These algorithms start with each individual document considered as a separate cluster, each of size one. At each level, the smaller clusters are merged to form a bigger cluster. The process continues this way and terminates when all clusters are merged into one cluster containing all the documents.

2.3.1.2 Divisive (top-down) Algorithms These algorithms work in an opposite manner to Agglomerative algorithms. They start with the whole set of documents. The set is then broken down to create consecutive smaller clusters. The process is repeated recursively until individual documents are reached. Agglomerative algorithms are more frequent in information retrieval than the divisive algorithms.

2.3.2 Partitional Clustering Algorithms These algorithms depend upon defining all clusters in advance. The k-means algorithm is an example of this category of algorithms. Partitional clustering algorithms typically determine all clusters at once. The k-means clustering algorithm belongs to this category [126, 1].

18

2.4 Algorithm This section provides just a brief introduction to Genetic algorithms and one of its subclasses: the cellular genetic algorithm cGA. A more detailed discussion is provided in chapter 4.

2.4.1 Genetic Algorithms Genetic Algorithms (GAs) belong to the class of evolutionary computational algorithm. Evolutionary algorithms (EAs) are population based optimization techniques designed for finding globally optimal solutions from a pool of feasible solutions (individuals). They are loosely based on some biological processes like natural selection or genetic inheritance of good traits. Genetic Algorithms are probabilistic search methods whose mechanisms are analogous with the natural process of biological evolution to discover solutions to problems. A population of individuals that encode tentative solutions to a specific problem is preserved. Some individuals are better than others. Those better individuals have a higher probability to survive, learn, and spread their genetic material (Survival of the fittest). New individuals are generated by combining members of the population; these new individuals replace the existing according to a defined policy. Implementation of a genetic algorithm starts with a population of (often randomly generated) individuals. These individuals are then evaluated according to an objective function. The individuals that represent a better solution to the problem are given more chances to reproduce than those individuals that represent worse solutions. The quality of a solution is measured against the current population. GAs have been proven to be good at effectively solving large search and optimization problem due to its selfadaptation and self-organization capabilities [45, 40, 46, 47, 48, 49, 43].

2.4.2 Cellular Genetic Algorithms Cellular Genetic Algorithms cGAs (or fine grained genetic algorithms) are a subclass of Genetic Algorithms with the population structured in a specified decentralized topology, so that individuals may only interact with their neighbors. The individuals of the population are usually arranged in a bi- dimensional toroidal grid. Each cell of the grid contains one individual (solution). At each generation of 19

the algorithm, individuals are modified by three stochastic genetic operators: selection, crossover and mutation. These operators are only performed within the neighborhood of each individual. Alba and Dorronsoro stated that "Such a kind of structured algorithms is specially well suited for complex problems". Cellular Genetic Algorithms have advantages over other Genetic Algorithms; such as a high diversity level that can be preserved for much longer in comparison with centralized algorithms due to the existence of small overlapped neighborhoods. Such structure ensures the diffusion of solutions slowly and smoothly through the population. This enhances exploration (diversification) while exploitation (intensification) is maintained within each neighborhood by genetic operations [48, 50, 51, 52, 53, 54, 55].

20

CHAPTER THREE LITERATURE REVIEW

21

This chapter presents a historical background and examines the related work to the main contributions of the thesis. The basic outline of this chapter is as follows. Section 3.1 presents a historical background about Social media, Twitter, Document and Tweet clustering, Genetic Algorithms GAs and Cellular Genetic Algorithms cGAs. Section 3.2 reviews the related work of these areas. Finally, Section 3.3 draws the summary and conclusion of the chapter.

3.1 Historical Background This section provides a historical background covering the following points: 1. Social Media in general 2. Twitter 3. Document and Tweet clustering 4. Genetic Algorithms GAs 5. Cellular Genetic Algorithms cGAs

3.1.1 Social Media The first identifiable social network site was SixDegrees.com that was launched in 1997 and began in 1998. It allowed users to create profiles and list their friends. Unfortunately, it failed to become a viable business and closed in 2000. In 1999, three social network sites Asian Avenue, Black Planet, MiGente, were initiated. In 2000, the Swedish web community Lunar Storm transformed itself to become a social network site. In 2001, the Korean virtual world site Cyworld added social network service features. In the same year, Ryze.com was launched. In 2002, Fotolog, Friendster, and Sky blog were launched. Starting from 2003, a lot of new social network sites were launched. These sites include Couch surfing, My Space, LinkedIn, Last.FM, Tribe.net, and Hi5. The phenomenon continued to grow in the year 2004, including the introduction of Flickr, Dodge ball, Orkut, Catster, a Small world, and Hyves. YouTube, Facebook, Yahoo 360, Bebo, Ning, and Renren-the Chinese twin of Facebook-were launched in 2005. The year 2006 witnessed the launch of QQ, Windows Live Spaces, and Twitter. Figure 3.1 displays the chronological order of the launch time of many social network sites [20].

22

Figure 3.1 Timeline of the launch dates of major Social Network Sites

3.1.2 Twitter On the contrary of the current enormous growth in Twitter popularity (to the degree of performing a critical role in the social revolutions in the Arab world countries such as Tunisia, Egypt, and Yemen), Twitter's beginning was as a backup project for an unsuccessful project. The unsuccessful project was a podcasting platform that was soon surpassed by Apple iTunes. Therefore, the founders turned towards developing an SMS (Short Message Service) service that deals with a limited number of users who can use this service to express their current status, feelings, or deeds. In March 2006; the first prototype of Twitter called "twttr" was initiated. And then Twitter was formally initiated in July of the same year. The initial founders of Twitter are Jack Dorsey (1976-…), Evan Williams (1972-…), Biz Stone (1974-…) and Noah Glass. Twitter is considered to be a microblogging website [Micro-blogging can be defined as "A form of blogging that allows users to send brief text updates or micro media such as photographs or audio clips in order to describe their current status"] [33, 56,16, 9].

23

3.1.3 Document and Tweet Clustering Along with the evolution of the World Wide Web in the recent years, major search engines such as Google and Yahoo utilize clustering techniques to automatically categorize the retrieved web documents hierarchically into a coherent cluster of meaningful categories, where the documents in a single cluster are about the same topic. Accompanying the huge growth in the popularity and usage of social network sites such as Twitter and the continuous proliferating increase in the amount of textual documents generated by the users, document clustering techniques are also employed to categorize Twitter messages (tweets) into related topics [57, 58, 59,60].

3.1.4 Genetic Algorithms GAs John

Henry

Holland (1929-….),

the

American

scientist,

Professor

of psychology, and Professor of electrical engineering and computer science at the University of Michigan, Ann Arbor, was the first person to use the term "Genetic Algorithm". Also, Hans Joachim Bremermann (1926–1996) in Berkeley, California, and Alex Fraser in Sidney, Australia, were among the pioneers who studied Genetic Algorithms. Holland invented Genetic Algorithms GAs in the 1960s. In the year 1975, his book "Adaptation in Natural and Artificial Systems", introduced Genetic Algorithms as an abstract of biological evolution. This book is considered to be the theoretical framework over which all the following work concerning Genetic Algorithms was based. In the same year, Kenneth DeJong finished his dissertation for the Doctoral degree that included for the first time an exhaustive treatment of the abilities of Genetic Algorithms in the field of optimization [46, 61, 62, 63, 64, 65, 66].

3.1.5 Cellular Genetic Algorithms cGAs The first model of Cellular Genetic Algorithms cGAs was put forward by Robertson in the year 1987 [67]. In the subsequent year, Mühlenbein et al. proposed what is considered to be the first hybrid Cellular Genetic Algorithm which was designed for solving the Travelling Salesman Problem TSP [68]. After these two initial efforts other cGAs saw the light in a few years. One of those is the algorithm 24

called "Pollination Plants" which David Goldberg introduced in the year 1989 [69]. In the same year, the "Fine Grained" genetic algorithm was introduced by Manderick and Spiessens [70].In the year 1991, the same researchers presented a "Massively Parallel" genetic algorithm [71]. Also in 1991, Frank Hoffmeister introduced the "Parallel Individual" algorithm [72]. In 1997, Back, Fogel and Michalewicz presented the "Diffusion Model"[73]. However, the term "cellular Genetic Algorithm" was first used only in the year1993, when Whitley presented it for the first time in a work that includes the application of cellular automaton model on a genetic algorithm [74]. Full detailed knowledge concerning Cellular Genetic Algorithms were explained by Enrique Alba and Bernabé Dorronsoro in their important book "Cellular Genetic Algorithms" published in the year 2008, which explored how to extend the use of cGAs to include various domains [46]. This book was an important asset for the researcher in this dissertation.

3.2 Related Work This section is divided into five subsections. Subsection 3.2.1 reviews the previous related research in the field of analyzing social content of social media. Subsection 3.2.2 reviews the research concerning Twitter and its usages in various domains. Subsection 3.2.3 reviews the prior work in the field of document clustering. Subsection 3.2.4 reviews previous researches about clustering of tweets. Finally, Subsection 3.2.5 discusses some of the various applications of Cellular Genetic Algorithms cGAs. After each subsection, a table summarizing the work related to this subsection is provided. Each table consists of the author(s) and year of publish, the addressed problem(s), and the contribution(s).

3.2.1 Social Media Generally, research in the area of social networks has witnessed a dramatic increase in the last years. The rapidly increasing popularity of social networking websites has elevated the consciousness and availability of this type of data. This set 25

of research focuses mainly on understanding the nature, analysis, data collection, and organizing of social media and its data. Even before the great explosion in the popularity of social media, the question of how to search a social network was raised up. Adamic and Adar simulated "small world" experiments on two datasets representing two different scenarios: a dataset extracted from a network of actual email contacts within an organization, in addition to a second dataset extracted from a student social networking website [75]. A framework for effective analysis of the content and structure of social media data as well as discovering communities in social networks (in order to understand how online communication and collaboration takes place in social applications), was introduced by Java in his dissertation for the doctoral degree [21]. Maia et al. presented a methodology for identification and characterization of user behaviors in social networks for the sake of improving business and resource management in social networks. They gathered data from YouTube and clustered users of similar behavioral patterns using a clustering algorithm [76]. Benevenuto et al. analyzed user workloads in four popular social networks: Orkut, My Space, Hi5, and LinkedIn. Their analysis is founded on two datasets: (1) a clickstream dataset gathered over a twelve days' time period using a Brazilian social network aggregator and (2) Orkut social network dataset. The results obtained by their study revealed how frequently and for how long people connect to social networks, in addition to the types and sequences of activities that users conduct on social networks [28]. In his dissertation for the doctoral degree, Zhou investigated the relationship between user-generated social media content and social actions using a broad range of applications such as ranking, discovering communities, information retrieval, and recommendation of documents [77]. Cormode et al. proposed a manifesto for modeling and measurement in social media. This manifesto includes the important features that can be used to construct models for three of the most widely-used social networks: Twitter, Facebook and YouTube. In addition; the manifesto discussed significant considerations for the collection, sampling, validation, and sharing of social media data [78]. Gozzo and D’Agata studied the connection between social networks and political participation. They presented the different shapes of politically pertinent linkages compared against the major socio-demographic dimensions. Their data sample was extracted from the electoral register of a town 26

near Sicily in Italy [79]. The problem of mining antagonistic communities (communities of people with contradictory opinions) from social networks was investigated by Kuan in his dissertation for the Masters' degree [80]. Wang addressed the challenges associated with social media data analysis, explained a dynamic perspective for analyzing social media data, and showed its strategic importance for the detection, minimization and prevention of the troublemaking influences of internet-based social media. This study also discussed how social media data can be used in the military field of national defence and security [24]. Yun et al. surveyed the definition of friends in online social networks and analyzed the private communication interactions between Korean users of the two social networks Me2day and Twitter. They gathered interactions of 32,200 accounts of Me2day from January to October, 2009 and 890 users of Twitter on the 12th of December 2009 [81]. Agrawal et al. introduced an integrated methodology to study information diffusion, opinion dynamics, and information trends in online social networks. They focused on three major problems: (1) Querying and analysis of online social network datasets; (2) Modeling and analysis of social networks; and (3) Analysis of social media and social interactions in the existing media environment [82]. In his dissertation for Doctoral degree, Becker provided a methodology for organizing social media documents. This methodology helps to identify and characterize a huge set of events which occur in social media through the usage of their relevant social media documents, in order to improve the browsing and quality of search for event content. Their analysis focused on Twitter, exploiting tweets in New York City [32]. Gloor et al. studied how social networks data analysis can help to reveal and provide valuable insights into recognizing the personality characteristics, and identifying honest signals of creativity in individuals through analyzing communications in a student email network [83]. Malik and Malik discussed the challenges associated with the analysis and development of the appropriate tools required for analyzing the massive amount of data produced by large scale social networks, in addition to the privacy issues related to social networks [84]. Stieglitz and Kaufhold presented the first steps to explore data collection from different social networks and described the architecture of a software prototype for full text analysis in social networks for the purpose of analyzing communication about individuals and events in social networks. The first application of this prototype was in the sector of political communication [85]. Takaffoli et al. proposed a framework and a community 27

matching algorithm for observing the community changeovers and evolutions in social media through time. They evaluated their proposed framework over two social network datasets: (1) The Enron email dataset, which provides email messages between the employees working in the Enron Corporation and (2) The DBLP (Data Base systems and Logic Programming) co-authorship dataset, which contains a computer science co-authorship network [86]. Derczynski et al. viewed data of social media as a constant stream of data points, each point containing text associated with spatial and temporal contexts. They identified challenges specific to each of the temporal, spatial, and spatio-temporal contexts, with the intention of subjecting to context aware querying and analysis, especially involving longitudinal analyses on social media archives, spatial keyword search, local intent search, and spatiotemporal intent search. At the end of their study, they discussed the emerging applications and further opportunities for investigation for each context [87]. Another work that focused on user behavior in social networks is that conducted by Jiang et al. who studied the latent (passive) interactions of users (such as viewing profiles) of users in the Chinese social network Renren [88]. A Masters' degree dissertation by Santiago applied statistical analysis and data mining techniques over millions of user-generated documents and events in social media in order to discover and demonstrate the relationship between the social media data and the terrorist events in different countries of the world [89]. The software prototype developed by Stieglitz and kaufhold was used later on by Stieglitz and Xuan to construct a methodological framework for social media analytics. This framework is utilized to summarize the most important issues in political context from the point of view of political institutions and different methodologies from other scientific disciplines [90]. Table 3.1 summarizes the related work concerning social media.

28

Table 3.1 Social Media Related Work Summary

Author(s)

Addressed Problem(s)

Contribution(s)

and Publish Year

Adamic and How contributors in a small Comparing online social network Adar (2005) world experiment can find structure to the structure of an e-mail [75]

short

paths

in

a

social network

network using only local information

about

their

immediate contacts?

Java

(2008) How to analyze the structure Frameworks for analyzing social

[21]

and content of social media media content and structure, and data effectively to realize the community detection nature

of

online

communication

and

collaboration

in

social

networks?

Maia et al. How to best classify user Methodology for characterizing and (2008) [76]

behaviors in online social identifying user behaviors in online networks?

Zhou (2008) How [77]

to

network

social networks.

improve analysis

social Introducing

probabilistic

content

by models for user generated social

analyzing networks as well documents

and

annotations,

and

as social content and social investigating the connection between actions among users

social content and social actions

29

Table 3.1 Continue

Cormode al.

et How

to

model

social A platform for developing models

(2010) networks and identify its and measures for social media data

[78]

important features?

Gozzo

and What is the role played by Displaying the different connections

D'Agata

social

networks

(2010) [79]

encouraging

in between

political

and

social

political participation in social networks

participation?

Kuan (2010) How to identify and define Algorithms for mining direct and [80]

the properties of different indirect antagonistic groups on social types

of

communities

antagonistic networks on

social

networks?

Wang (2010) What are the impacts of Explaining a dynamic perspective for [24]

using social media in the social media data analysis and its fields of national security strategic importance for detection, and defense and how to minimization and prevention of the overcome

the

undesirable troublemaking influences of internet-

threats?

based social media on national defence and security

Yun

et

al. What are the influencing Defining online friends and providing

(2010) [81]

factors on the strength of an analysis of user interactions user interactions in online social networks?

30

Table 3.1 Continue

Agrawal al.

et How a sole news item or An integrated approach to study

(2011) idea spreads throughout a information

diffusion

in

online

[82]

social network?

Becker

How

(2011) [32]

characterize different events media documents which aids to

to

networks

identify

and Methodology for organizing social

in social media?

identify and characterize a huge set of events that occur in social media

Gloor et al. How to recognize signals of Initial (2011) [83]

creativity

in

results

individuals individual

on

creativity

forecasting based

on

through the use of social interpersonal interaction patterns network analysis?

Malik

and What are the challenges A snapshot of challenges associating

Malik (2011) associated [84]

with

analyzing large scale social networks analysis

large scale social networks?

Stieglitz and How

to

analyze

large Describing the architecture of a

Kaufhold

amounts of text generated in software prototype for full text

(2011) [85]

social media on a short time analysis in social networks for the scale?

purpose of analyzing communication about individuals and events in social networks

Takaffoli al. [86]

et How to detect evolution and Framework and community matching

(2011) structural communities

changes in

of algorithm to observe community social changeovers and evolutions in social

networks?

media through time.

31

Table 3.1 Continue

Derczynski

What

are

the

et al. (2013) temporal, [87]

spatial, Recognition of the spatial, temporal,

and

temporal

spatio- and

spatio-temporal

challenges associated

associated

with

with

challenges

analyzing

social

analyzing media

social media?

Jiang et al. How (2013) [88]

to

obtain

a

deep Exhaustive study of Chinese social

understanding of visible and network Renren and constructing latent user interactions over latent interaction graphs to compare online social networks?

there

an

against visible interactions

Santiago

Is

obvious Proving that there is a significant

(2013) [89]

association between terrorist relationship between terrorist events events and data on social and social media data media? Methodological framework for social

Stieglitz and How can political institutions Xuan (2013) and [90]

other

media analytics in political context

scientific

disciplines

exploit

potentials

of

the

political

discussions in social media sufficiently?

3.2.2 Twitter Recent research has started to concentrate on the content related issues of online social media particularly Twitter. Twitter analysis is a broad field of research in which researchers from different disciplines have been greatly interested in the past years. The analysis of Twitter information can result in gaining precious knowledge. The increasing popularity of Twitter has attracted the concern of researchers in numerous fields. The vast amount of information extracted from 32

Twitter has been utilized in many various applications. These applications include measurement of political opinion, prediction of stock market prices, measuring the national sentiment, health care, etc. The first set of research is concentrating on general understanding of the nature of Twitter. One of the pioneer works on Twitter is the one performed by Java et al., who focused on studying usage and communities. They monitored the Twitter public outline and analyzed posts from distinct users for two months in order to understand how and why people tweet. They identified four major types of user intentions: daily chatter, conversations, sharing information, and reporting news. The study categorized the roles played by Twitter users into three main groups: information source, friends, and information seeker [31]. A definition of a user's friend on Twitter was provided by Huberman et al. who studied social interactions within Twitter and defined a user's friend as "A person whom the user has directed at least two posts to". They reached three conclusions: First; Twitter users have a very small number of actual friends in comparison to the number of followers and followees they declare. Second; users who have a large number of actual friends have a higher tendency to post more updates than those users who have a smaller number of actual friends. Third; users with many followers or followees (fewer actual friends) post updates at a less frequent rate than those users with few followers or followees [91]. Krishnamurthy et al. presented an exhaustive characterization of Twitter. They exploited three datasets including about 100,000 Twitter users. The datasets were collected from the Twitter public timeline using two methodologies depending upon Twitter API functions. Similar to Java et al. study, the purpose of this study was the identification of different classes of Twitter users and their behaviors [30]. Honeycutt and Herring studied conversation and collaboration between users via Twitter through analysis of conversations on Twitter, paying attention to the functions and uses of the @ sign. They collected tweets from Twitter's public timeline in four one-hour samples collected at four-hour intervals, on January 11, 2008. They found that short, dynamic conversations are the most common, along with some longer conversations between multiple participants [15]. Zhao and Rosson conducted an exploratory study to achieve a profound understanding of how and why ordinary people use Twitter. They explored how the characteristics of micro-blogging behaviors enable informal communication, and 33

studied the role and influences of micro-blogging on informal communication at work. To achieve these objectives, they interviewed eleven Twitter participants working in a large Information Technology company [92]. Kwak et al. conducted a quantitative study which crawled the entire Twitter sphere-through Twitter API- to study Twitter's topological characteristics, and understand the information diffusion on Twitter and its power as an information sharing medium [8]. Naaman et al. examined the characteristics of social activity and patterns of communication on Twitter. They coded tweets manually and analyzed their content for the sake of computing the percentage of the four types of user intentions identified by Java. Their study revealed that information sharing (22%), opinions or complaints (comprise approximately 25%), random thoughts (comprise approximately 25%), and personal status (comprise approximately 40%) encompass the massive majority of tweets. They reached the result that the majority of Twitter users focus on themselves, while the minority is concerned with sharing of information [93]. Weerkamp et al. studied the way by which people use Twitter in various languages and how it differs from one language to another. These differences can be reflected in the usage of four specific features of Twitter: hashtags, links, mentions, and conversations. Their study was based on tweets written in eight different languages with a dataset of 1,000 tweets constructed for every language [94]. Goel et al. investigated the problem of determining similar users on Twitter. They focused their attention on "Production similarity" where two users are defined to be similar to each other if they generate similar content. They proposed a machine-learning based framework built upon Hadoop to discover similar accounts with high quality for hundreds of millions of Twitter users every day [95]. Another set of research paid attention to the analysis of Twitter for medical purposes. Wegrzyn-Wolska et al. followed the tweets talking about the "Escherichia coli” epidemic in three languages: English, French, and Polish [96]. In his dissertation for the doctoral degree, Yoon conducted an observational study of Twitter messages relevant to healthy behaviors such as physical activities [97]. Many researchers studied the usage of Twitter analysis for tracking, prediction, and prevention of Influenza trends. Chew and Eysenbach developed an open-source infoveillance system called Infovigil to gather

34

tweets. Then, they analyzed the content of tweets including the keywords or hashtags"H1N1" ,"swineflu", and "swine flu" during the 2009 H1N1outbreak [98]. Culotta investigated several models to analyze influenza related messages posted on Twitter in order to estimate the rates of influenza-like illnesses (ILI) in a population, detect keywords that associate with influenza rates, and combine the detected keywords to predict national influenza rates and outbreaks. Their investigation was done over a dataset of 574,643 Twitter messages gathered over a 10-week time period from the public timeline of Twitter [10].Lampos and Cristianini reported on a monitoring tool to measure the prevalence of H1N1 disease in the United Kingdom. They analyzed the textual content of the stream of Twitter data in the United Kingdom for six months during the H1N1 flu pandemic [99]. Similarly, Achrekar et al. presented the Social Network Enabled Flu Trends (SNEFT) framework that monitors the messages posted on Twitter with a reference to flu indicators, in order to track and forecast the emergence and spread of influenza epidemic in the real world. The tweets were gathered over one year. Their results proved that the usage of Twitter data can enhance the precision of ILI prediction models. Thus, Twitter data provides an accurate timely assessment of ILI activity [100]. Paul and Dredze investigated Twitter for a wide variety of public health data automatically extracted from Twitter on a variety of illnesses, instead of a limited group of applications on a few number of illnesses [101]. Signorini et al. investigated the use of the Twitter embedded information to (1) track rapidly-evolving public concerns with relevance to H1N1 or swine flu, and (2) track and estimate real disease activity in real time. They collected and stored a large sample of public tweets that matched a set of pre-specified search terms to monitor influenza related traffic within the United States [102]. Another group of research analyzed the usage of Twitter in the field of politics. Diakopoulos and Shamma analyzed and characterized sentiments of people (aggregated from Twitter posts) concerning the debate between Barack Obama and John McCain before the U.S. 2008 Presidential elections [103]. Tumasjan et al. conducted a study to achieve three objectives. First, inspect if Twitter can enhance online political discussion by monitoring how people utilize microblogging to 35

exchange information concerning political subjects. Second, assess if Twitter messages imitate the existing offline political emotion in a significant manner. Third, investigate if the Twitter activity can be used to forecast the popularity of political parties or alliances in the offline world. The study depended upon tweets published on the public timeline of Twitter in the few weeks before the federal elections of the German national parliament which were held on the 27th of September, 2009. The results confirmed that Twitter can play the role of a vehicle for political deliberation [4]. Zhou et al. studied the role played by Twitter in information dissemination through studying the posted tweets during the 2009 post-election demonstrations in Iran. The tweets were gathered from the public timeline using Twitter API [104]. Conover et al. explored a lot of approaches to forecast the political alignment of twitter users distinguishing between Twitter users who belong to the left wing and those who belong to the right wing, so that political campaigns can take advantage of this alignment to construct their political strategies. The study was based upon a dataset derived from tweets of 1,000 Twitter users concerning the 2010 U.S. midterm elections. The dataset was extracted using the Twitter ‘garden hose’ streaming API. Similar to Tumasjan et al., Conover et al. concluded that Twitter is effective in predicting the political alignment of individuals [5].Bravo-Marquez et al. conducted an experimental exploration of opinion time series extracted from Twitter messages relevant to the 2008 U.S. Presidential elections and analyzed these time series findings. Opposite to the conclusions obtained by Tumasjan et al. and Conover et al., Bravo-Marquez et al. concluded that the opinion time series extracted from Twitter cannot act as a dependable predictive model for elections [6]. Wegrzyn-Wolska and Bougueroua described a system which surveyed the trends of the French 2012 Presidential Elections from discussions on Twitter. This system performed the automatic gathering, assessment and rating of tweets for appraising the trends and detecting trend changes in the electoral behavior by mining tweets [7]. Nooralahzadeh et al. applied Natural Language Processing (NLP) and Data Mining techniques, to compare between the prevailing sentiments towards candidates in the 2012 Presidential elections in USA and France, focusing primarily on time series analysis, in addition to word cloud and hashtag analysis. Twitter datasets concerning both of the French and American Presidential elections were obtained using Twitter API [3].

36

The next set of research deals with the usage of Twitter in the prediction of stock markets. Bollen et al. investigated if collective national mood states, as derived from large scale collections of daily Twitter messages, can be correlated or can be used to predict the stock market value of the Dow Jones Industrial Average (DJIA) over time. They exploited two tools to measure the public mood differences from the posted tweets for the period of about ten months. The results revealed that public mood variations can be tracked through text processing of Twitter posts [105]. Similarly, X. Zhang et al. described an early work that attempted to forecast stock market indicators like Dow Jones, NASDAQ and S&P 500 through the analysis of positive and negative moods in Twitter feeds. They worked on a dataset composed of tweets gathered over six months. They finally concluded that checking of the sentimental outbreaks of Twitter users can give an indication of the stock market behavior in the next day [106]. The use of Twitter for marketing purposes has been discussed in another set of researches. Jansen et al. reported the results of research which investigated micro blogging as a sort of electronic word of mouth (e-WOM) for sharing consumer opinions, comments, and sentiments regarding brand names. They inspected the Twitter posts mentioning brand names (gathered over thirteen weeks) specifically those including opinions or emotion towards a brand. The results supported the usefulness of Twitter as a marketing tool [107]. Bulearca and Bulearca presented a qualitative exploratory study on the perceptions, uses, in addition to benefits and limitations of Twitter as a form of electronic word-of-mouth marketing by small and medium-sized enterprises SMEs. Data collection occurred through semi-structured online interviews. Similar to Jansen et al., the findings support that Twitter is a vital platform for companies to listen to their customers' opinions [108]. The final set of research in this section emphasizes the usage of Twitter for event detection, depending upon the real-time characteristic of Twitter. Sakaki et al. presented an investigation of the real time interaction of Twitter users in catastrophic events like earthquakes, and proposed an algorithm that can monitor tweets and distinguish a specific event. They constructed an earthquake reporting system in Japan that can detect earthquakes by monitoring tweets with high probability, and warning the registered users through sending e mails to them [9]. Becker et al.

37

developed a system for automatically identifying and presenting Twitter content relevant to prearranged events, using a combination of simple rules and advanced query building strategies [109]. Jackoway et al. studied how to identify and discover current and future live news events along with information regarding the event location, through extracting related and consistent information from Twitter posts and determining which Twitter users post reliable tweets [110]. Similar to Sakaki et al., Earle et al. evaluated the speed of Twitter users' reaction to the earthquake of Morgan Hill, California in March 2009. They also presented and assessed a procedure for detection of earthquakes depending upon data extracted from Twitter [111]. Crooks et al. studied the performance of Twitter as a sensor system for detection of earthquakes geographically, taking into consideration the 2013 earthquake in Mineral town in Virginia, USA [112]. Table 3.2 summarizes the related work concerning Twitter. Table 3.2 Twitter Related Work Summary

Author(s)

and Addressed Problem(s)

Contribution(s)

Publish Year

Java et al. (2007) What are the major behaviors Studying the topological and [31]

and

intentions

of

micro- geographical

properties

of

bloggers? What are the roles Twitter and analysis of user they play?

intentions at the community level

Huberman et al. What are the different types of A detailed analysis of Twitter (2008) [91]

social

interaction

within and

Twitter?

social

interactions

of

Twitter users

Krishnamurthy et How to characterize Twitter and An exhaustive characterization al. (2008) [30]

analyze

its

users

geographical distribution?

38

and of Twitter

Table 3.2 Continue

Honeycutt Herring

and How far does Twitter endorse Analysis of the conversational (2009) user-to-user interactions? Why and collaborative features of

[15]

do people use Twitter? What Twitter,

particularly

the

are the required alterations to functions and usages of the @ make Twitter more functioning sign as a collaboration tool?

Zhao and Rosson How and Why ordinary people Providing an analysis and (2009) [92]

use

Twitter?

features

of

behaviors

How

do

the understanding

micro-blogging played enable

of

the

by Twitter

informal instrument

of

as

role an

informal

communication? What are its communication in the work potential

influences

on

informal

communication

the environment at

work?

Kwak

et

(2010) [8]

al How do people interact on Quantitative Twitter? Who are the people of entire

study

Twitter

on

the

sphere,

its

the most influence? What are topological characteristics, and the trending topics? How does information diffusion on it information diffusion take place via retweet?

Naaman

et

(2010) [93]

al. What are the characteristics of Classifying social activity and patterns of intentions communication on Twitter?

Twitter (

showing

user the

percentage of each intention) and identifying the behavioral patterns of Twitter users

39

Table 3.2 Continue

Weerkamp et al. Does the language difference Emphasizing the difference in (2011) [94]

affect the usage of Twitter?

using Twitter features due to the language difference

Goel et al. (2013) How to find out similar users A [95]

on Twitter?

machine-learning

based

framework to discover similar Twitter accounts

Wegrzyn-Wolska What are the problems and Presenting et al (2011) [96]

the

directions

challenges associated with the where social network analysis use of Social Network Analysis can be used for medical and Text Mining methods for purposes applications in E health and medical purposes?

Yoon (2011) [97] What can be learned about Advancing the methodological physical activities from Twitter breadth messages?

of

mining

social

media for the health-related purposes

Chew

and Is it feasible to use Twitter to Illustrating that social media

Eysenbach

measure public opinion towards and particularly Twitter can be

(2010) [98]

a specific topic

used to conduct studies about public health

Culotta [10]

(2010) Is it possible to detect influenza Investigation outbreaks

through

of

several

analyzing models for analyzing tweets to

Twitter content

predict rates of influenza-like illnesses in a population

40

Table 3.2 Continue

Lampos

and How to track the spread of flu A

Cristianini

pandemic

(2010) [99]

analysis

through

method

Twitter hundreds

that

of

analyzes

thousands

of

Twitter messages daily to measure the prevalence of a disease in a population

Achrekar et al. How can Twitter data be used Framework (2011) [100]

to forecast flu trends

for

monitoring

Twitter posts relevant to flu indicators, in order to track and forecast the emergence and

spread

of

influenza

epidemic in the real world

Paul and Dredze Is there a public health signal Analysis of Twitter data for a (2011) [101]

that can be detected within the broad chatter of Twitter?

variety

of

diseases

instead of a limited group of applications on a few number of illnesses

Signorini et al. How can embedded information Using e embedded information (2011) [102]

in Twitter be used to measure in Twitter to measure public public

sentiment

H1N1

and

concerning interest concerning H1N1 and

measure

activity of the disease

actual measuring actual activity of the disease

Diakopoulos and How do Twitter users react to a An analytical methodology to Shamma (2010) political media event?

analyze

[103]

Twitter messages

the

sentiments

users

posting

concerning

televised political debate 41

of

a

Table 3.2 Continue

Tumasjan et al. Can Twitter be used as a Confirming that Twitter can (2010) [4]

medium for political debate? be used as a platform for Can Twitter messages reflect political discussion offline political emotions?

Zhou

et

al. How does a message spread Emphasizing

(2010) [104]

widely

among

the

the

role

of

Twitter Twitter as a medium for

users? Are the resulting cascade information propagation, and dynamics different due to the displaying the structure and unique Twitter features? What mechanism

of

information

role does message content play propagation on Twitter in its popularity?

Conover et al. How to forecast the political Proving how effectively we (2011) [5]

alignment

of

Twitter

users can use Twitter in predicting

based upon their tweet content the political alignment of its and structure of their political users communication?

and

discriminating

between left wing and right wing users

Bravo-Marquez

Is a time series suitable for Opinion time series analysis of

et al. (2012) [6]

reliable prediction?

messages

extracted

from

Twitter relevant to the 2008 U.S Presidential elections

Wegrzyn-Wolska What are the problems and A system which surveyed the and Bougueroua challenges associated with the trends of the French 2012 (2012) [7]

use of Social Network Analysis Presidential

Elections

and Text Mining methods for discussions on Twitter applications in politics? 42

from

Table 3.2 Continue

Nooralahzadeh et What is the nature of elections? Comparison of the prevailing al. (2013) [3]

What is the impact of social sentiments before and after the media on elections?

2012

US

and

France

Presidential elections through time series analysis

Bollen

et

(2011) [20]

al. How public mood obtained Analyzing the textual content from Twitter posts influence the of value

of

a

stock

Twitter feeds every day

market using two mood tracking tools

indicator over time?

X. Zhang et al. Can the behavior of stock Concluding that checking of (2011) [106]

market indicators be predicted the sentimental outbreaks of by analyzing the sentiments Twitter users can give an reflected in Twitter posts?

indication of the stock market behavior in the next day

Jansen

et

(2009) [107]

al. Is Twitter suitable for use as a Emphasizing the usefulness of form of electronic word-of- Twitter as a marketing tool mouth marketing?

Bulearca

and Should Twitter be considered Qualitative investigative study

Bulearca (2010) by Small and Medium-sized on [108]

the

perceptions,

uses,

Enterprises (SMEs) in their benefits and limitations of marketing strategies?

Twitter as a form of electronic word-of-mouth marketing by small

and

medium-sized

enterprises SMEs

43

Table 3.2 Continue

Sakaki

et

(2010) [9]

al. Can the event occurrence in An real-time

be

detected

earthquake

reporting

by system in Japan that can detect

monitoring tweets?

earthquakes

by

monitoring

tweets

Becker

et

(2011) [109]

al. How to identify Twitter posts A system for automatically related to an event?

identifying

and

presenting

Twitter content relevant to prearranged events

Jackoway et al. How to identify and discover Development of a method for (2011) [110]

current and future live news determining

which

Twitter

events along with information users post reliable information regarding the event location? and which Twitter posts are of Which

Twitter

users

post interest.

reliable tweets?

Earle

et

(2012) [111]

al. Can Twitter be used to detect An

automatic

earthquakes? How fast can detection

earthquake

algorithm

that

Twitter-based systems detect depends upon Twitter data earthquakes?

How

these

systems can be used during real-time earthquake response?

Crooks

et

(2013) [112]

al. What

are

the

spatial

and Analysis of the spatial and

temporal characteristics of the temporal characteristics of the Twitter feed activity in response Twitter to an earthquake?

feed

activity

response to an earthquake

44

in

3.2.3 Document Clustering Document clustering is considered to be one of the most important research topics whose objective is to satisfy the human interests in information searching and understanding in the fields of information retrieval and text mining. The resulting clusters can be utilized to clarify the features of the underlying data, and therefore act as a basis for other data mining and analysis techniques [113]. The idea behind algorithms of document clustering is gathering documents into groups depending upon their similarity. This enables users to locate documents of interest to them in a much simpler way, and to obtain an overview of the set of retrieved documents [114]. A large amount of research studies have been performed on document clustering. The first set of research is concerned with reviewing the concept of document clustering and its well-known techniques. Steinbach et al. presented a technical report that compared through an investigational study two of the most common approaches used in document clustering: agglomerative hierarchical clustering and K-means algorithm [59]. Zeng et al. studied the issue of organizing (online) web search results into clusters. They re-formalized the search result clustering problem from an unsupervised clustering problem to a supervised learning problem [115]. Zhu et al. presented an algorithm for clustering of documents into one ultimate cluster built upon frequent co-occurring word sets [116]. Huang compared and analyzed the effectiveness of similarity measures in Partitional clustering for text document datasets. She utilized the standard K-means algorithm in her experiments over seven text document datasets, with the application of five similarity measures [60]. Jajoo tried a different document clustering approach. First, he used a standard clustering approach to cluster the words of documents in order to reduce the data noise and increase the time efficiency. Then, he used the word cluster (including the frequent co-occurring words) to cluster the documents in a way similar to what was done by Zhu et al. [42]. Sathiyakumari et al. surveyed different document clustering approaches and algorithms in text mining [58]. Similarly, a comprehensive investigation of the problem of document clustering was conducted by Aggarwal and Zhai, who also studied its major challenges, and discussed the main approaches used for document clustering and their comparative advantages [117]. Koteeswaran et al. 45

provided a review on implementation techniques, recent research on clustering and outlier analysis [118]. Anastasiu et al. prepared a technical report on February 2013.This report introduced the general purpose document clustering, described its challenges, and paid attention to the most recent developments concerning the next frontier in the field of document clustering: long and short documents [119]. Another set of research has taken into consideration the use of Genetic Algorithms GAs for clustering and document clustering. One of the pioneer researches in this field is the research conducted by Jones et al. in the year 1995. Their experiments exploited three document test collections, involving documents, queries, and their associated relevance judgments. They compared effectiveness of clusters resulting from GA-based clustering technique with that of network-based clustering. They concluded that Genetic Algorithms are not of practical use in document clustering. However, take into consideration that this study was conducted in 1995. Further researches reach a totally different conclusion [120]. Similarly, Maulik and Bandyopadhyay introduced a clustering technique depending upon Genetic Algorithm. The algorithm was examined over a wide variety of artificial and real-life data sets. The results were compared against those obtained by k-means clustering algorithm. The obtained results showed a significant superiority of the Genetic Algorithm-clustering algorithm over the k-means algorithm [121]. Casillas et al. introduced in 2003 a genetic algorithm that clusters documents into an unknown number of clusters. Their experiments were carried out over a collection of 14,000 news items gathered from a Spanish newspaper [122]. Premalatha and Natarajan introduced a method for clustering a set of documents based upon Genetic Algorithm using simultaneous mutation operator and ranked mutation rate. Experimental results were examined against a number of common datasets in comparison with simple GA and k-means. The obtained results demonstrated that the projected algorithm statistically beats the Simple Genetic Algorithm as well as KMeans algorithm [40]. Similar to the previous two reviews, Jian-Xiang et al. developed an algorithm based on Genetic Algorithm to cluster documents, and proved that GA can produce better outcomes than those produced by Kmeans[43].Verma et al. exploited a modified genetic algorithm for clustering documents. The modification lies in the initiation of the initial population. Instead of random generation, the initial population is created through measuring similarity 46

between documents depending on their sum of squared distances from the previously selected documents .The results of the modified algorithm were compared and proven to outperform the K-means algorithm [123]. Another modification for genetic algorithm in document clustering was presented by Meena et al., who used the features of Genetic Algorithm GA in combination with the features of Discrete Differential Evolution (DDE) for text document clustering. The purpose of the mentioned combination is to decrease the number of iterations required by GA to find the optimum solution. The experiments were implemented using fifty documents from Reuter-21578 database [113]. Usharani and Iyakutti proposed a method based upon genetic algorithm for finding similarity between web documents according to cosine similarity [47]. Table 3.3 summarizes the related work concerning document clustering. Table 3.3 Document Clustering Related Work Summary

Author(s)

and Addressed Problem(s)

Contribution(s)

Publish Year

Steinbach et al. Comparison of common An experimental study of some (2000) [59]

document

clustering popular document

techniques

Zeng et al. (2004) How [115]

to

clustering approaches

organize

web Re-formalizing the search result

search results using other clustering techniques inadequate

than

problem

from

an

the unsupervised clustering problem

traditional to a supervised learning problem

clustering techniques?

Zhu et al. (2006) High dimensionality of the A document clustering algorithm [116]

clustered

data,

huge depending

upon

database, and absence of occurring word sets intuitive cluster description

47

frequent

co-

Table 3.3 Continue

Huang (2008) [60] How to precisely define Investigation of the effectiveness similarity between objects of

similarity

in order to achieve accurate Partitional clustering?

Jajoo (2008) [42]

How

measures

clustering

for

in text

document datasets

to

improve

the Introduction

of

a

clustering

accuracy and efficiency of algorithm where the data noise is document clustering with minimized by first clustering the large number of documents words of documents followed by added daily from different clustering the documents on the sources?

Sathiyakumari

basis of their word clusters

et How to provide a complete A comprehensive overview of

al. (2011) [58]

evaluation

of

various different

techniques

used

in

clustering approaches in document clustering text mining?

Aggarwal

and How to

Zhai (2012) [117]

obtain

a full A comprehensive investigation of

understanding

of

the document clustering, its major

document

clustering challenges, the main approaches

problem, along with its for document clustering and their main

challenges

and comparative advantages

techniques?

Koteeswaran et al. How to provide a complete An assessment of data clustering (2012) [118]

evaluation

of

clustering

and

various and outlier analysis technique outlier

analysis approaches?

48

Table 3.3 Continue

Anastasiu et al. What is the concept of Introducing the general purpose (2013) [119]

Document

clustering? document clustering, describing

What are its challenges and its challenges, and the recent recent advances?

developments concerning the next frontier in document clustering

Jones et al. (1995) Can Genetic Algorithms Introduction [120]

of

document

GAs be used for document clustering technique based on clustering?

Maulik

Genetic Algorithm

and How to get a clustering Introduction

Bandyopadhyay

methodology

which

is technique

(2000) [121]

simple as K-means and Algorithm

of based

clustering on

Genetic

avoids its drawbacks

Casillas

et

(2003) [122]

al. How to handle the problem A genetic algorithm for document of clustering a set of clustering documents

without

knowing in advance the suitable number of clusters

Premalatha Natarajan [40]

and How to provide the best Genetic Algorithm for document (2009) grouping

of

documents clustering

into K number of clusters

using

dynamic

mutation operator and adaptive mutation rate

Jian-Xiang et al. How to (2009) [43]

apply Genetic Genetic Algorithm for document

Algorithm for document clustering clustering?

49

Table 3.3 Continue

Verma

et

(2010) [123]

al. How

to

improve

performance Algorithm

of in

the Genetic Algorithm with squared

Genetic distance

optimization

for

document document clustering

clustering?

Meena

et

(2012) [113]

al. How

to

improve

performance Algorithm

of in

Genetic Differential document optimization

clustering?

Usharani Iyakutti [47]

and How

to

the Genetic Algorithm with Discrete Evolution for

document

clustering

increase

the A Genetic Algorithm based on

(2013) relevance of retrieved web cosine documents?

similarity for

relevant

document retrieval

3.2.4 Tweet Clustering One way of analyzing Twitter is the cluster analysis of tweets. In addition to the general understanding of Twitter and its uses, other researchers are interested in the cluster analysis of tweets and Twitter users. Khot applied k-means clustering technique for masses consisting of a huge number of documents. He selected eight specific news twitter feeds and came up with the conclusion that when the documents’ content is very short (as in the case of tweets), it is more appropriate to cluster the words instead of the documents. Therefore, he proposed a method for tweet clustering that clusters the words using the word co-occurrence as a similarity measure (similar to what was done by Zhu et al. which was mentioned in the previous section) [124]. Conover et al. used network clustering algorithms to obtain information concerning individuals who a Twitter user communicates with. This information was further used to cluster Twitter users into Right and Left wing clusters according to their political beliefs and stances [5]. In his dissertation for the doctoral degree, Yoon examined the usage of clustering to detect, summarize, and categorize the content of tweets [97]. Mosley and Roosevelt described how data 50

mining and text analytics can be applied to social media. They used a specific example related to an insurance company. They created an archive of Twitter public posts including a hashtag of the company name. Mosley and Roosevelt applied Ward’s Minimum-Variance clustering method to 116 keyword indicators extracted from the archive based on their similarity [33]. Another set of research focuses on clustering users and discovering communities in Twitter. Goyal et al. presented a method to cluster Twitter users depending upon social connections in addition to content and link similarity. Their data was divided into three types: (1) Geo-tagged tweets collected from five cities, (2) Tweets about specific topics of interest, and (3) Tweets from a specific group of users. They analyzed and compared the performance of two standard clustering algorithms for clustering users in Twitter [13].Moreover, in his dissertation for the Masters' degree; Kewalramani clustered the users of Twitter into communities, and considered Twitter users to be similar according to the similarity of the content they generate, in addition to link similarity and meta-data similarity. Kewalramani evaluated the quality of clustering using different similarity measures on different types of datasets [1]. Y. Zhang et al. calculated the similarity between Twitter users depending upon their interests. This similarity is further used as a measure to recognize communities in Twitter. The study was conducted over 45,772 Twitter users with at least 100 tweets (gathered using Twitter API) and 20 friends [12]. Similarly, Yamashita et al. proposed a method to analyze and cluster groups of Twitter users depending upon mutual interests or shared attributes among followers [125]. Another group of research discusses the use of cluster analysis to discover and recommend topics. Phelan et al. proposed a recommender system depending on Twitter to recommend for current and topical news from a collection of RSS feeds [2]. Similarly, Sankaranarayanan et al. investigated the use of Twitter to build a news processing system that works exclusively with tweets posted on Twitter. This system automatically obtains breaking news, identifies current news topics, and groups news tweets into clusters, such that each cluster consists of tweets relating to a particular topic [35]. Bernstein et al. developed an interactive topic browser for discovering topics from short Twitter status updates, powered by linguistic syntactic

51

transformation and callouts to a search engine, using a technique called "Tweet Topic". The dataset was composed of 100 random tweets. After being browsed, the tweets are clustered into topics mentioned either implicitly or explicitly [126]. Karandikar addressed the problem of determining which topic model is the most appropriate for clustering tweets depending upon its clustering performances. He used an R-system (that applies k-means clustering) for statistical analysis and graphics for clustering the tweets (written in English language and aggregated into four datasets) based upon their topic vectors. He clustered users on Twitter according to the content they generate [18]. On the same track, O’Connor et al. presented a topic extraction system called "Tweet Motif" that clusters Twitter messages (gathered from Twitter API) by frequent significant terms [127]. Rangrej et al. gathered tweets using Tweet Motif, in order to compare the performance of three different document clustering techniques including K-means, SVD-based method and a graph-based approach on short text data collected from Twitter. They performed their experiments on a dataset of 611 handpicked tweets representing different topics from Twitter [11]. Rosa et al. presented a study on automatically clustering and categorizing tweets into six different pre-defined topics through the use of hash tags as approximate indicators of tweet topics, encouraged by the approaches adopted by news aggregating systems such as Google News. Their clustering technique was appraised using a dataset including more than one million Twitter messages gathered using Twitter API over two weeks [128]. Kim et al. proposed a clustering method called Core-Topic-based Clustering (CTC) to extract meaningful topics from tweets and cluster tweets according to the topics. They exploited the Retweet ratio (RT ratio) as a weight to evaluate the score of clusters. Experiments were performed over a dataset consisting of tweets-gathered over a period of one month-about four popular TV programs. The obtained results were compared and demonstrated to be better than those obtained by K-means algorithm [129]. Similarly, Rafea and Mostafa presented their experience in extracting Arabic hot topics from Twitter. Experiments were performed over 110 tweets collected over four days [130]. The next set of research considers enriching the short document terms through using Wikipedia to improve the clustering process of short text items such as tweets. One of the early studies in this field was conducted by Banerjee et al., who 52

proposed an approach to increase the accurateness of clustering short text items by using Wikipedia as an additional knowledge source to enrich the short text representation with extra features (titles of selected Wikipedia articles). Results indicate that for most clustering algorithms; the accuracy of clustering has improved substantially with the enriched representation [131]. Another work that exploits Wikipedia was explained by Gabrilovich and Markovitch who proposed a method called Explicit Semantic Analysis (ESA), which applied Wikipedia concepts on a collection of fifty documents to determine closeness between natural language texts [132]. A similar work was performed by Chen et al. which attempted to minimize the impurity of tweet clusters through using Wikipedia. First, they expanded the feature and training sets using Wikipedia search expansion for each tweet in order to overcome the limited length of tweets. Second, they used a classifier to reduce the impurity of clusters [133]. Perez-Tellez et al. introduced and compared different methods built upon k- means clustering in order to differentiate between Twitter messages that are related to a specific company and those that are not. Their approach involves categorization of Twitter messages which include a possible company entity into two clusters: the first cluster is corresponding to those Twitter messages which refer to the specific company, while the second cluster is corresponding to tweets which refer to another topic. Terms forming the Twitter messages were enriched using Wikipedia. Experiments were carried out on Twitter messages relating to twenty companies, written in English, with only the true and false Twitter messages taken into consideration. The purpose of the experiments is to confirm whether the procedure of enriching using Wikipedia will lead to an improvement in the clustering of company tweets or not [19]. Table 3.4 summarizes the related work concerning tweet clustering.

53

Table 3.4 Tweet Clustering Related Work Summary

Author(s)

and Addressed Problem(s)

Contribution(s)

Publish Year

Khot (2010) [124]

How to increase the speed A method for tweet clustering of tweet clustering to be as that clusters the words using real time as possible?

word

co-occurrence

as

a

similarity measure

Conover

et

al. How to forecast the political Proving how effectively we

(2011) [5]

alignment of Twitter users can use Twitter in predicting based

upon

their

tweet the political alignment of its

content and structure of their users political communication?

and

discriminating

between left wing and right wing users

Yoon (2011) [97]

What can be learned about Advancing the methodological physical

activities

Twitter messages?

Mosley Roosevelt [33]

from breadth of mining social media for the health-related purposes

and How to apply data mining Application of data mining and (2012) and text analysis techniques text analysis techniques to to social media?

specific example of Insurance company Twitter posts

Goyal et al. (2011) How to understand different A method to cluster Twitter [13]

data mining and analysis users depending upon social techniques on Twitter?

connections in addition to content and link similarity

54

Table 3.4 Continue

Kewalramani (2011) How [1]

to

identify A methodology to formulate

communities in Twitter?

similarity between any two Twitter users on the basis of their

generated

content

similarity, link similarity and metadata similarity

Y.

Zhang

et

(2012) [12]

al. How

to

communities

recognize Similarity calculation between on

Twitter Twitter users based on their

based on users' interests?

interests. This similarity is further used as a measure to recognize

communities

in

Twitter

Yamashita

et

(2013) [125]

al. How to analyze and cluster A method to analyze and groups

of

depending

Twitter on

users cluster groups of Twitter users mutual based upon a commonality or a

interests or shared attributes shared attribute among followers?

Phelan et al. (2009) Can Twitter be used as a A [2]

recommender

system

news?

recommender

system

for depending on Twitter to rank and recommend current and topical news from a collection of RSS feeds

Sankaranarayanan et How Twitter can be used to A news processing system that al. (2009) [35]

automatically

extract works exclusively with Twitter

breaking news from Twitter posts posts? 55

Table 3.4 Continue

Bernstein

et

(2010) [126]

al. How to obtain topics from An interactive topic browser Twitter?

for discovering topics from Twitter posts

Karandikar

(2010) Which topic model is the Determining

[18]

most

appropriate

clustering

the

most

for appropriate topic model to

tweets

and cluster tweets and Twitter

Twitter users?

users based on their status updates

O’Connor (2010) [127]

et

al. How to organize and search A topic extraction system that through millions of Twitter clusters Twitter messages by posts?

Rangrej et al. (2011) How [11]

frequent significant terms

to

handle

the Comparative study of

sparseness of words in the performance condensed

short

of

the

different

text clustering approaches for short

documents gathered from text documents using datasets the web?

gathered from Twitter

Rosa et al. (2011) How to detect the topics Automatic [128]

discussed

in

clustering

and

Twitter categorizing of tweets into pre-

messages automatically?

defined topics through the use of hash tags as approximate indicators of tweet topics

Kim et al. (2012) How to extract meaningful a clustering method to extract [129]

topics from tweets?

56

meaningful topics from tweets

Table 3.4 Continue

Rafea and Mostafa How to extract Arabic hot Development and application (2013) [130]

topics and recognize the of an approach for assessing sentiment of Arab users Key-phrase towards these topics from algorithm tweets?

Banerjee

et

extraction to

identify

the

sentiment topic in a cluster

al. How to solve the problem of A method to improve the

(2007) [131]

information

overload

in accuracy of clustering short

famous news or blog feeds?

text items using Wikipedia as an

additional

knowledge

source

Gabrilovich

and How to compute semantic

Markovitch (2007) relatedness [132]

of

A method applying Wikipedia

natural concepts

language texts?

to

determine

closeness of natural language texts

Chen et al. (2010) How [133]

to

overcome

the Introduction of the Wikipedia

inaccurate categorization of search expansion to reduce the tweets due to their limited impurity of tweet clusters length?

Perez-Tellez et al.

How

to

differentiate Proposal and comparison of

between Twitter messages different methods of tweet (2010) [19]

that are related to a specific representation depending upon company and those that are term not?

expansion

influence

on

company tweets

57

and

their

clustering

3.2.5 Cellular Genetic Algorithms cGAs Cellular Genetic Algorithms cGAs have been used in various domains to solve different types of problems. A group of research is concerned with the utilization of cGAs in different applications. In the field of transportation, Alba and Dorronsoro introduced a cellular genetic algorithm for solving the well-known Vehicle Routing Problem (VRP) [48]. In the field of communication networks, Alba et al. studied the usage of a cellular multi-objective evolutionary algorithm (cMOGA) to solve the problem of optimally tuning a particular broadcasting strategy for Metropolitan Mobile Ad Hoc Networks (MANETs) [134]. Nebro et al. introduced a Multi-objective cellular genetic algorithm called MOCell for solving multi-objective continuous optimization problems (MOPs) by utilizing an external archive for the storage of non-dominated solutions found during the execution of the algorithm, in addition to a feedback mechanism in which solutions from this archive randomly substitutes the current individuals in the population after each iteration [135]. Guzek et al. proposed a cellular genetic algorithm (called Energy-Aware Communications

Scheduler

EACS)

for

scheduling

precedence-constrained

applications and optimizing the energy consumption during the inter-processor communications in modern parallel and distributed systems through the usage of task clustering techniques [50]. Khezri and Hazrati developed a cellular genetic algorithm for solving the problem of sensor placement in distributed sensor networks for target location under restrictions of complete coverage of sensor network with minimum costs [136]. In the field of electric power, Yugui established a kind of power forecast model which combines cellular genetic algorithm with BP neural network in order to forecast the mid-long term demand for electric energy in the urban areas of the Chinese city Nanchang [137]. Table 3.5 summarizes the related work concerning cellular genetic algorithm.

58

Table 3.5 Cellular Genetic Algorithm Related Work Summary

Author(s)

Addressed Problem(s)

Contribution(s)

and Publish Year

Alba

and How to apply Cellular Genetic A cellular genetic algorithm for

Dorronsoro

Algorithms to solve the Vehicle solving

(2004) [48]

Routing Problem?

the

Vehicle

Routing

Problem

Alba et al. How to optimize broadcasting Application of a cellular multi(2007) [134]

of MANETs networks using objective genetic algorithm to Cellular Genetic Algorithms?

solve the optimum broadcasting problem

Nebro et al. How to use Cellular Genetic A Multi-objective cellular genetic (2009) [135]

Algorithms to solve the multi- algorithm objective

(2010) [50]

to

multi-

reduce

problems

time in

for A cellular genetic algorithm for

exchanging

data

processor

communications? constrained

How

reduce

to

solving

continuous objective continuous optimization

optimization problems?

Guzek et al. How

for

inter- scheduling

energy optimizing

precedenceapplications the

and energy

dissipation due to data transfer consumption during the interbetween processing elements?

processor communications

Khezri

and How to use Cellular Genetic A cellular genetic algorithm for

Hazrati

Algorithms to solve the problem solving the problem of sensor

(2013) [136]

of

sensor

placement

distributed sensor networks?

59

in placement in distributed sensor networks

Table 3.5 Continue

Yugui (2013) How to forecast electric demand A [137]

using

Cellular

power

Genetic combining

Algorithms?

forecast cellular

model genetic

algorithm with BP neural network to forecast the mid-long term demand for electric energy in the urban areas of the Chinese city Nanchang

3.3 Summary and Conclusion The chapter started with providing a historical overview about Social media and Twitter in particular. It also provided a historical background about clustering of documents and tweets, Genetic algorithms and one of its subclasses: Cellular Genetic Algorithms. The second section of the chapter presented a detailed review of the literature concerning the previously mentioned topics. To the best of the researcher's knowledge, Cellular Genetic Algorithms cGAs have not been previously used for clustering tweets in the literature. The main contribution of this thesis is the application of cGAs in tweet clustering. This thesis is considered to be one of the first attempts to do so.

60

CHAPTER FOUR DATA AND ALGORITHM

61

This chapter includes the detailed steps of the work done in this study. The basic outline of this chapter is as follows. Section 4.1 presents the conceptual framework for the study. Section 4.2 compares the different approaches to gather data from Twitter and describes the selected data collection technique and tool. Section 4.3 describes the different aspects of data description, data preparation, data representation, and the environment in which the experiments were performed. Section 4.4 includes a detailed description of the simple Genetic Algorithm and the Cellular Genetic Algorithm. Finally, section 4.5 draws the summary and conclusion of the chapter.

4.1 Conceptual Framework The conceptual frame work for the study is presented in Figure 4.1.

Figure 4.1 Conceptual framework The steps of data collection, data preprocessing, and algorithm application are discussed in details in the following sections of this chapter. The results and their comparison and analysis are presented and discussed in the following chapters.

62

4.2 Data Collection In this section, the common methodologies for data gathering from social media are explained, ending with the selected methodology and description of the utilized tool. For such kind of research, there is no standard dataset available for testing. The publicly available datasets, such as the famous "Netflix prize", Datamob datasets, Enron Email dataset, or the dataset provided by Stanford University, etc. might not be able to obtain all of the required data to answer a particular question. Social networks do not provide complete and precise data directly to researchers because of the privacy concerns and the fierce competition between various social networking sites. The most commonly used practice is that researchers collect their own datasets from different real world systems. Three well-known techniques are exploited by researchers to gather social media data: 1-API driven approach: In this approach, the Application Programming Interface (API) provided by the social network is exploited to query the entities, the characteristics and the relationships between entities. Unfortunately, the Twitter API only permits users to make a limited number of calls in a given hour. Moreover, it returns only the most recently added user friends. API enables only a sample of tweets to be available. 2-Scraping based approach: In this approach, the researcher accesses the social network directly through the usage of a web client. This approach is harder than the API driven approach (as the scraper has to struggle against the redesigns that may occur to the social network frequently), and subject to bandwidth limitations. However, it is not limited by a specific number of calls like the API driven approach.

63

3-Passive network measurement approach: In this approach, the researcher tracks the social network traffic and examines the requests to and from this particular social network. This approach can provide a real view of the studied network. However, it is hindered by the privacy issues. Moreover, due to the multiple ways of accessing a social network, it is difficult to keep track of all accesses [138,78, 139]. The research conducted in this study exploited public data available on the timeline from the Twitter social network. Data for this study were gathered utilizing the “Scraping based technique” where Twitter was directly accessed through a web client. The web client is a social network aggregator that pulls content from multiple social networking sites into a single location such that users can access their social network accounts through single interface, without having to sign in to each site alone, so that users who have multiple accounts in more than one social networking site can manage their profiles in a much simpler manner [28]. The manner by which users exploit and interact with social network aggregators is displayed in Figure 4.2.

Figure 4.2 User interactions with social networks by social network aggregator [28] Hootsuite.com is the social network aggregator that was selected by the researcher to gather data from Twitter. Hootsuite.com is a web site that enables its 64

users to track and archive Twitter messages. To track Twitter messages relevant to a particular topic or to a particular user, users can access this website and create an archive. The created archive will track and archive such Twitter messages. In addition to Twitter, Hoot suite enables its users to archive data on various social networks according to well-defined search criteria. Archives created by others can be retrieved only if the archive owner grants an obvious approval. Social networks that can be managed using Hootsuite include Twitter, Facebook, LinkedIn, Google+, Foursquare, Word press, and Mixi. Moreover, more social networks such as Tumblr, YouTube, Flickr, Mail Chimp, Social Flow, Inbox Q, and Constant Contact can be added to the Hootsuite dashboard through using a feature known as "Hoot Suite App Directory" [140]. An example of the Hootsuite dashboard is displayed in Figure 4.3.

Figure 4.3Hootsuite dashboard According to Alexa traffic ranks, Hootsuite occupies global rank number 143 on the 20th of March 2014, as displayed in Figure 4.4.

65

Figure 4.4 Hootsuite Alexa traffic rank on the 20th of March 2014 [141]

4.3 Data Description and Preparation 4.3.1 Data Description For the purpose of this study, tweets were collected based on a set of keywords that describe specific topics in the actual world. Tweets were collected over a 3-day time duration from the 26th of June to the 28th of June 2013. The set of pre-defined keywords comprise eight variable categories that are intended to be diverse in order to cover different and wide areas of interest: Cinema, Egypt, Film, Hollywood, Iran, Juventus, Messi and Sport. Eight archives with the previously mentioned keywords were established in Hootsuite.com. The type of data gathered using Hootsuite is displayed in Table 4.1

66

Table 4.1 Type of Data gathered using Hootsuite

The username of the tweet sender The tweet content The date and time of tweet posting (according to GMT) Twitter Identification number of the tweet Geographic coordinates of the user determining his/her location A sample out of the tweets included in the datasets is displayed in Figure 4.5

Figure 4.5 Sample tweets from the dataset The gathered tweets are clustered according to their similarity. Similarity measures in Twitter include: 1. User Connections: The most commonly used similarity measure which depends upon the following relationship between Twitter users and user mentions. 2. Description Content Similarity: Measures the similarity between descriptions provided by Twitter users on their profile pages.

67

3. Tweet Content Similarity: Goyal et al. stated that tweet similarity between two users is defined as “the cosine similarity between the documents formed by combining the tweets of a user into one”. 4. Hash tag similarity: Defined by Goyal et al. as "the cosine similarity between the collections of hashtags of the different users". This measure is based on the number of common hashtags between users and the importance of these hashtags [13, 12]. Because the majority of Twitter messages are textual, this study focuses on clustering tweets based on their textual content similarity. Cosine Similarity is one of the most commonly used similarity measures in data analysis because of its ease of use and fast calculation. It can be utilized to compare words in documents and normalize the comparison between documents of different word counts as well as compare vectors of profile attributes [142]. Cosine similarity has none-negative values ranging from zero to one [60]. Cosine similarity is a famous similarity measure that has been used several times. Some examples of using cosine similarity by researchers include Lee et al. who used cosine similarity in order to detect topics in biomedical text [143], Sankaranarayanan et al. who determined topic clustering of Twitter messages based upon cosine similarity [35], Java who exploited text based cosine similarity for evaluating the relatedness between clustered blog feeds [21], Sayyadi et al. who utilized cosine similarity to cluster documents around topics based on the co-occurrence of keywords in documents [144], Perez-Tellez et al. who utilized cosine similarity to compute similarity between tweets before clustering them into two clusters; one representing tweets relevant to a specific company and another representing irrelevant tweets [19], Becker who utilized cosine similarity to identify and characterize different events in social media [32], Goel et al. defined two users on Twitter to be similar by computing the cosine similarity of their sets of followers [95], and Usharani and Iyakutti who proposed a method based upon genetic algorithm for finding similarity between web documents according to cosine similarity [47].

4.3.2 Data Preprocessing The preprocessing of data involved several steps. The first step was the elimination of tweets that: 68



Are not in English



Have very few words (fewer than three)



Have just a URL



Duplicate tweets



All Re-tweets The non-English tweets were not taken into consideration. As previously

mentioned in chapter 1, English language is the most commonly used language over Twitter. In addition; all the stop words, punctuations, and symbols were removed. Such information contains quotation marks, parentheses, punctuation marks plus stray symbols. However, those signs which are really significant for Twitter were kept (such as @ and #).

4.3.3 TF-IDF representation The researcher used a tweet representation based on Term frequency - Inverse Document Frequency or TF-IDF. TF-IDF is a popular representation that is commonly used in Natural Language Processing NLP (using vector representation). It measures the statistical weight of terms in a given document corpus (reflecting the importance of a word across the corpus), depending upon the word co-occurrence or word repetition. The TF-IDF score for a term with respect to a tweet corpus is measured in terms of two individual components, term frequency (TF) and inverse document frequency (IDF). The importance of each term is directly proportional to the number of times this term appears in a document (term frequency). The TF is calculated by the normalized frequency of the term. Term frequency is normalized by the frequency of the term in the corpus (IDF). IDF discounts the weight of terms with high overall importance among all documents in the corpus (terms which occur more frequently in the corpus).

69

The product TF・IDF measures the extent to which a term occurs frequently in a specific document without occurring in the other documents forming the document corpus. (

)

(

)

(

( )

)

df (t) represents the number of documents in which a term t appears (document frequency). Generally, TF measures the relative importance of a term in a particular document. On the other hand, IDF is a measure of the general importance of the term across the whole corpus of documents. IDF measures the rareness of a word across all of the documents. The greater the IDF value, the rarer (and consequently the more discriminative) the word across the corpus of documents is. In Twitter domain the documents are the tweets. This means that terms with a high frequency within the tweet (high term frequency) and a low frequency over all tweets (low inverse document frequency) have a high TF-IDF value. The vector representation depends upon the textual content of the Twitter messages. Similarity between Twitter messages is measured using the cosine similarity metric. Term frequency – inverse document frequency (TF-IDF) struggle with Twitter messages as the limited length of a tweet (140 characters) is often not sufficient to indicate important terms. Sometimes, tweets are approximately similar to search queries. TF-IDF struggles because Twitter users usually tend to remove redundant words from a Twitter message for the sake of saving space. This results in very low term frequencies (1 or 2). In some tweets, no terms are repeated. Therefore, TF-IDF may be only the IDF term. However, TF-IDF remains one of the most powerful and useful methods to represent different types of documents including tweets [5, 16, 19, 60, 117,126, 143].

4.3.4 Experimental setup The experiments in this study were implemented over three datasets of 1,000, 5,000, and 30,000 tweets respectively. The experiments included running each of cellular and generational genetic algorithms for 40 independent runs over the 1,000 70

tweets dataset, 50 independent runs over the 5,000 tweets dataset, and a single run for the 30,000 tweets dataset. Both algorithms have been executed using Java on a single PC 1.90 Ghz under Windows 7 Professional operating system and having 8 GB of memory. The fitness value, execution time, number of generations, and number of generated clusters for each run are recorded. Then the average fitness value and execution time (in milliseconds ms) are calculated. Finally, the values of cellular genetic algorithm are compared to those of generational genetic algorithm to select the most appropriate algorithm that achieves the best fitness i.e., higher quality of clustering at the least time. Moreover, the performance of every algorithm is evaluated over the three datasets according to fitness value and execution time.

4.4 Algorithm Traditional clustering algorithms were not selected by the researcher because such algorithms explore just a small subset of the potential clusterings. Thus, the found solution is not guaranteed to be optimal (might get stuck at local minima). Moreover, traditional clustering approaches that require a priori knowledge of the number of clusters, such as K-means, are not suitable to handle large volume of data produced by Twitter and other social media sites [32]. Premalatha and Natarajan (2009) used genetic algorithm with a ranked mutation operator and demonstrated that it outperforms the traditional algorithms like the K-means [40]. The use of Evolutionary Algorithms (EAs) to handle complicated problems is massive in recent years. They imitate the biological processes in nature. These algorithms are population-based, which means that they act on a group of prospective solutions (population of individuals) through the application of some operators iteratively to these individuals for the sake of finding the finest solutions. The majority of these algorithms deploys only one population and applies operators to them as a whole [134]. These steps are repeated iteratively until a stopping condition (for example; the maximum number of evaluation limits) is met. The balance (tradeoff) between exploration (diversification) of new solutions and exploitation (intensification) in the search space is an important criterion for 71

performance evaluation of a genetic algorithm and adjusting this tradeoff can improve the overall performance of the algorithm. This tradeoff is represented by “Selection Pressure” which is defined as “A measure of the diffusion speed of the good solutions through the population” [46]. Reeves (1993) formulated the selection pressure in the following equation [145]:

Ø  Prob.selecting fittest string  Prob.selecting average string 

Equation 1: Selection Pressure Higher exploitation leads to a higher selection pressure because the algorithm tends to converge rapidly to a good enough solution, so it is liable to get stuck into local optimum. On the other hand, higher exploration leads to a lower selection pressure because the algorithm tends to explore the search space in depth for an optimal solution.

4.4.1 Cellular Genetic Algorithm Genetic Algorithms (GAs) imitate the biological process of natural selection and evolution. A population of individuals (of chromosome-like structure) that represent empirical solutions to a particular given problem is maintained. The initial population is usually generated in random. Novel individuals are then created through reproduction of the population individuals through the application of particular genetic operators: The recombination (crossover) operator and the mutation operator. The reproductive cycle gives a higher advantage to the better individuals to survive and reproduce (Survival of the fittest). The fitness of individuals is evaluated using the so-called Fitness function or Objective function. Selection of the reproducing individuals takes place according to their assigned fitness values, where those individuals with the greatest fitness per generation have a higher possibility to be selected than those with lower fitness values. The newly generated individuals substitute their predecessors according to a pre-defined substitution policy. This reproductive cycle goes on continuously until a particular termination condition is achieved. A simplified summarization of the mechanism of simple genetic algorithms is provided and displayed in Figure 4.6. 72

Figure 4.6 Simple Genetic Algorithm [40] Cellular Genetic Algorithms cGAs represent a subclass of Genetic Algorithms in which the arrangement of the population is structured in a decentralized manner and the concept of small neighborhood is strongly applied, so that individuals can merely recombine with the individuals which belong to its neighbors as displayed in Figure 4.7.

Figure 4.7 Topology of Cellular Genetic Algorithm [134]

73

Alba and Dorronsoro stated that “Such a kind of structured algorithms is specially well suited for complex problems” [48]. The existence of small overlapped neighborhoods in Cellular Genetic Algorithms helps to preserve a high diversity level for much longer time in comparison with other centralized algorithms [55]. A behavioral comparison of two different cGAs versus two traditional genetic algorithms, on a large benchmark composed of problems with many different features, revealed that the behavior of cGA is more robust since it obtains smaller standard deviations than the traditional algorithms. In addition, the cGA shows faster performance (shorter elapsed time) than the traditional genetic algorithms. The results obtained by Cellular Genetic Algorithm are compared against those of Generational Genetic Algorithm. Generational Genetic Algorithms (genGAs) are unstructured genetic algorithms in which any individual can interact with any other one in the population. The offspring individuals are placed in a temporary population which will substitute the existing population when the number of offsprings is equal to the size of this temporary population [46].

4.4.2 Chromosome Representation The population structure takes the shape of a two-dimensional grid with a neighborhood defined over it. Every chromosome in the generation represents a candidate tentative solution to the problem and is composed of a sequence of genes. A chromosome is represented as an array of integers of length equal to the number of tweets. Each entry in the array corresponds to a cluster number for a tweet. Representation of the chromosome is described in Figure 4.8. The used neighborhood is L5 (displayed in Figure 4.9)

Figure 4.8 Chromosome representation

74

Figure 4.9 L5 Neighborhood [46]

4.4.3 Initial Population The population is composed of 400 individuals (20*20). Initial population of candidate tentative solutions is often randomly generated from the search space, with a fitness value assigned to each individual.

4.4.4 Fitness Function The fitness function is used to evaluate the quality of the solution (clustering method). Higher fitness value indicates higher quality of the solution. The used fitness function is a function of cosine similarity. Usharani and Iyakutti stated that “cosine similarity is a measure of similarity that can be used to compare documents with respect to a given vector of query words. This is quantified as the cosine of angle between vectors”. The function is as follows [47]:

 

n i 1

n i 1

Ai  Bi

(Ai ) 2 



n i 1

(Bi ) 2

Equation 2: Fitness Function

75

The function should be maximized where (A and B) represent tweet vectors, (Ai) represents the weight of term i in the chromosome, (Bi) represents the weight of the term in the vector B,(n) represents the total number of tweets in the tweet corpus, while (x) is the product of the two vectors. Here, the comparison is between tweets and other tweets in the corpus, instead of comparing documents to queries. The value of cosine similarity ranges between 0 and 1.

4.4.5 Parent Selection The objective of the selection operator is to enhance the population’s quality by granting higher quality individuals (individuals with the highest fitness values) a greater possibility to survive and replicate in the following generations than the lower quality individuals (individuals with the lowest fitness values). This means that the individual’s quality is evaluated using the fitness function [49]. Here, the first parent is selected using the dissimilarity tournament selection operator, while the second parent is chosen by the linear rank selection operator. Dissimilarity tournament selection operator is an operator that does not depend upon the relative fitness of the nearby individuals. However, takes into consideration the difference between the respective solutions where two neighbors are chosen in random and the individual which is more dissimilar to the existing individual is chosen. On the other hand, in linear ranking selection, all neighborhood individuals are arranged in order in a list depending upon their fitness values, from the best to the worst, with a greater possibility of choosing a parent with a higher rank in this list [46].

4.4.6 Recombination (Crossover) The recombination step includes combining two or more portions from the parent individuals to create new offsprings. The generated offsprings are not identical to their parents, but includes combined building blocks from both of the two selected parents. A recombination (Crossover) operator with a pre-specified crossover probability (Pc) is applied to the individuals. Here, the applied operator is a "two 76

points-crossover": Distance Preserving Crossover (DPX) operator with Pc=1.0.The objective of this operator is to produce off springs that have equal distance to every parent. This distance is the same as the distance in between parents [46, 146].

4.4.7 Mutation While recombination operator acts on two or more parent individuals, mutation operator modifies single individual randomly by altering one or more genes of this individual. A mutation operator with a pre-specified mutation probability (Pm) is applied to the individuals. Here, the applied operator is the Integer Mutation operator with Pm=1.0. Integer mutation involves the replacement of the integer value of a gene by a new value generated in random [46, 147].

4.4.8 Replacement Policy After the application of selection, recombination (crossover), and mutation operators, the offsprings are placed incrementally into a temporary population, and the fitness values of the novel offsprings are calculated. The new offsprings of the population replace the parent population, only if the new population is not worse than the existing population, in order to maintain the best solutions.

4.4.9 Stopping Criterion The loop of reproductive cycle is repeated iteratively until the stopping condition is fulfilled. Here, termination occurs when the maximum number of fitness function evaluations (15,000,000 evaluations) is reached. The mechanism by which the reproductive cycle of Cellular Genetic Algorithm cGA takes place is displayed in Figure 4.10.

77

Figure 4.10 Reproductive cycle mechanism in cGA [46] The pseudo code of the algorithm is described in Table 4.2 Table 4.2 Pseudo-code of Cellular Genetic Algorithm [46]

1. proc evolve (cga) 2. GenerateInitialPopulation(cga.pop); 3. Evaluation(cga.pop); 4. while !StopCondition() do 5. for individual ← 1 to cga.popSize do 6. neighbors ← Calculate Neighborhood(cga,position(individual)); 7. parents ← Selection(neighbors); 8. offspring ← Recombination(cga.Pc,parents); 9. offspring ← Mutation(cga.Pm,offspring); 10. Evaluation(offspring); 11. Replacement(position(individual),auxiliary pop,offspring); 12. end for 78

Table 4.2 Continue 13. cga.pop← auxiliary pop; 14. end while 15. end proc Evolve To select the parameters' values of cGA, several experiments to tune these parameters have been performed. These experiments involve modifying the value of each parameter (one by one) while keeping the rest of the parameters constant. After these experiments were performed, the final values of the parameters that achieved the best fitness are described in Table 4.3 Table 4.3 Parameterization of the algorithm

Population size

400 individuals (20*20)

Stopping condition

15,000,000 fitness evaluations

Neighborhood

Linear5

Parent selection

Dissimilarity+ Linear rank

Recombination operator DPX Crossover probability

Pc = 1.0

Mutation operator

Integer mutation

Mutation probability

Pm = 1.0

Replacement policy

Replace if none worse

4.5 Summary and Conclusion This chapter presented the conceptual framework and explained in details the steps of the work done in this study. First, the different approaches to gather data from Twitter were compared and described. Then, the selected data gathering methodology, as well as the work mechanism of the selected data collection tool (Hootsuite) was described. In the next section, the researcher described the collected 79

dataset in addition to the selected similarity measure (cosine similarity), illustrated by some examples of other work that used cosine similarity measure. The dataset preparation, preprocessing, representation, and the setup of the performed experiments were further described. Finally, a detailed description of the procedures of simple Genetic Algorithm and the Cellular Genetic Algorithm was presented.

80

CHAPTER FIVE EXPERIMENTAL RESULTS AND DISCUSSION

81

5.1 Introduction This chapter presents the research findings generated by the experiments of Tweet clustering, using both of Cellular Genetic Algorithm and Generational Genetic Algorithm. It also presents the research limitations, a discussion of the results, and finally a conclusion. As previously mentioned at the end of section 4.3 of chapter 4; the experimental studies for this study were performed over three datasets of 1,000, 5,000, and 30,000 tweets respectively. The experiments included running each of the cellular and the generational genetic algorithms for 40 independent runs over the 1,000 tweets dataset, 50 independent runs over the 5,000 tweets dataset, and a single run for the 30,000 tweets dataset. Both algorithms have been executed using Java on a single PC 1.90 Ghz under Windows 7 operating system and having 8 GB of memory. For all of the three used datasets, four types of results have been generated, for each of the two algorithms, depending upon four parameters: 1. The fitness value 2. The execution time 3. The number of clusters (taking into consideration that the number of clusters is not predefined) 4. The number of generations for each run All the results concerning the previous four parameters are recorded, and then the average fitness value and execution time (in milliseconds ms) are calculated. Finally, the values obtained using cellular genetic algorithm are compared to those of generational genetic algorithm to select the most appropriate algorithm that achieves the best fitness i.e., higher quality of clustering at the least time. Moreover, the performance of every algorithm is evaluated over the three datasets according to fitness value and execution time. The remainder of the chapter is organized in the following manner. Section 5.2 measures the accuracy of the two algorithms against a test set. Section 5.3

82

represents the results of the dataset composed of 1,000 tweets. Section 5.4 represents the results of the dataset composed of 5,000 tweets. Section 5.5 represents the results of the dataset composed of 30,000 tweets. Section 5.6 compares the number of generations produced by both algorithms. Sections 5.7 and 5.8 evaluate the performance of cGA and genGA over the three datasets according to fitness value and execution time. The limitations of the study are presented in section 5.9. A discussion of the obtained results is presented in section 5.10. At the end of the chapter, the final conclusion of the chapter is presented in section 5.11.

5.2 Accuracy on test dataset The accuracy of the clustering obtained by the two algorithms was compared against a baseline approach on a test dataset using the following equation:

Equation 3: Clustering Accuracy The test dataset is composed of 60 tweets equally distributed over three topics: Sport, Politics, and Cinema. The researcher went manually through those tweets and determined their relevance to the three topics. The distribution of tweets in the test set is displayed in Table 5.1 Table 5.1: Tweet Distribution in the test set

Topic

Number of tweets

Sport

20

Politics

20

Cinema

20

An accuracy of 95.01% was obtained by cGA compared to a 94.166% accuracy obtained by genGA (0.844% difference) as displayed in Figure 5.1

83

Accuracy 1%

95.20%

1%

95.00%

1%

94.80%

1% Accuracy % Difference

1% 0% 0%

0.844%

94.60% 94.40% 94.20%

0%

94.00%

0%

93.80%

0%

93.60%

0% genGA

cGA

Figure 5.1 Accuracy achieved by both algorithms

5.3 Results for 1,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms for 40 independent runs over a dataset composed of 1,000 tweets.

5.3.1 Average Fitness A comparison of the average fitness values of the results obtained by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.2. The average fitness value of cellular genetic algorithm is 72.68859389, while the average fitness value of generational genetic algorithm is 72.68254524. The average fitness obtained by cGA is 0.01% higher than the average fitness obtained by genGA.

84

0.01%

72.69 0.01%

72.689 0.01%

72.688 72.687

0.01% Fitness

0.01%

% difference 0.00%

72.686 72.685 72.684 72.683 72.682

0.00%

72.681 72.68

0.00%

0.00%

72.679 genGA

cGA

Figure 5.2 Average fitness of genGA and cGA (1,000 tweets)

5.3.2 Average Execution Time A comparison of the average execution time of the results obtained by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.3. The average execution time of cellular genetic algorithm is 2389352.2 ms, while this obtained by generational genetic algorithm is 2847968.75 ms. Average execution time of cGA is less than the average execution time of genGA by 17.51% 2900000

20.00%

2800000

18.00%

2700000

Time % difference

16.00% 14.00%

2600000

12.00%

2500000

10.00%

2400000

8.00%

2300000

17.51%

6.00% 4.00%

2200000

2.00%

2100000

0.00%

0.00% cGA

genGA

Figure 5.3 Average execution time of genGA and cGA (1,000 tweets) 85

5.3.3 Number of Clusters A comparison of the average number of clusters generated by Cellular and Generational genetic algorithms over the 1,000 tweets dataset is displayed in Figure 5.4. The average number of clusters generated by Cellular Genetic Algorithm is approximately 312 clusters, while the average number of clusters generated by Generational Genetic Algorithm is approximately 316 clusters.

317 316 316 315 314 313 312 312 311 310 cGA

genGA

Figure 5.4 Number of clusters generated by genGA and cGA (1,000 tweets)

5.4 Results for 5,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms for 50 independent runs over a dataset composed of 5,000 tweets.

5.4.1 Average Fitness A comparison of the average fitness of the results obtained by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 86

5.5. The average fitness value of cellular genetic algorithm is 315.9893146, while the average fitness value of generational genetic algorithm records 318.2736139. The average fitness obtained by cGA is 0.72% less than the average fitness obtained by genGA

Fitness %

0.80%

318.5

0.70%

318

0.60%

317.5

0.50%

317

0.40%

316.5

0.30%

316

0.20%

315.5

0.10%

315

0.00%

314.5

0.72%

0.00% cGA

genGA

Figure 5.5 Average fitness of genGA and cGA (5,000 tweets)

5.4.2 Average Execution Time A comparison of the execution time of the results obtained by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 5.6. The average execution time of cellular genetic algorithm is 19739837.3 ms, while this obtained by generational genetic algorithm is 21356261.49 ms. The average execution time of cGA is less than the average execution time of genGA by 7.87%

87

21500000 21000000 20500000 Time

20000000

% 19500000 19000000 18500000

9.00%

7.87%

8.00% 7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00%

0.00% cGA

genGA

Figure 5.6Average execution time of genGA and cGA (5,000 tweets)

5.4.3 Number of Clusters A comparison of the number of clusters generated by Cellular and Generational genetic algorithms over the 5,000 tweets dataset is displayed in Figure 5.7. The average number of clusters generated by Cellular Genetic Algorithm is approximately 1865 clusters, while the average number of clusters generated by Generational Genetic Algorithm is approximately 1821 clusters. 1870

1865

1860 1850 1840 1830 1821 1820 1810 1800 1790 cGA

genGA

Figure 5.7 Number of clusters generated by genGA and cGA (5,000 tweets)

88

5.5 Results for 30,000 tweets dataset The presented results in this section are obtained after running Cellular and Generational genetic algorithms over a dataset composed of 30,000 tweets.

5.5.1 Fitness A comparison of the fitness value of the results obtained by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.8. The fitness value of cellular genetic algorithm is 8077.4931640625, while the fitness value of generational genetic algorithm records 8115.17919921875. The fitness obtained by cGA is 0.47% less than the fitness obtained by genGA. 0.50% 0.45%

0.47%

8120 8110

0.40% 0.35% 0.30% Fitness

0.25%

%

0.20% 0.15%

8100 8090 8080 8070

0.10% 0.05% 0.00%

8060

0.00% 8050

cGA

genGA

Figure 5.8 Fitness value of genGA and cGA (30,000 tweets)

5.5.2 Execution Time A comparison of the execution time of the results obtained by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.9. The execution time of cellular genetic algorithm is 217341048 ms, while this obtained by generational genetic algorithm is 247159030 ms. The execution time of cGA is less than the execution time of genGA by 12.84%

89

14.00% 12.00%

12.84%

250000000 245000000 240000000

10.00% 8.00% Time

230000000 225000000

6.00%

%

235000000

4.00%

220000000 215000000 210000000

2.00% 0.00%

205000000

0.00%

200000000

cGA

genGA

Figure 5.9 Execution time of genGA and cGA (30,000 tweets)

5.5.3 Number of Clusters A comparison of the number of clusters generated by Cellular and Generational genetic algorithms over the 30,000 tweets dataset is displayed in Figure 5.10. The number of clusters generated by Cellular Genetic Algorithm is 7777 clusters, while the number of clusters generated by Generational Genetic Algorithm is 7845 clusters. 7860 7845 7840 7820 7800 7780

7777

7760 7740 cGA

genGA

Figure 5.10 Number of clusters generated by genGA and cGA (30,000 tweets) 90

5.6 Number of Generations A comparison of the number of generations produced by Cellular and Generational genetic algorithms per run is displayed in Figure 5.11. Cellular Genetic Algorithm generates 37500 generations per run, while Generational genetic algorithm generates 30061 generations per run.

Number of generations 37500

40000 35000

30061

30000 25000 20000 15000 10000 5000 0

genGA

cGA

Figure 5.11 Number of generations produced by each algorithm

5.7 cGA performance in all datasets This section presents an evaluation of the performance of the Cellular Genetic Algorithm over the three datasets. The evaluation is performed in terms of the fitness value and execution time.

5.7.1 Average Fitness A comparison of the average fitness values of the results obtained by Cellular Genetic Algorithm over all datasets is displayed in Figure 5.12. As the size of the data set increases, the fitness value obtained increases.

91

9000

8077.493164

8000 7000 6000 5000 4000 3000 2000 1000

72.68859389

315.9893146

1000 tweets

5000 tweets

0 30000 tweets

Figure 5.12 cGA fitness for all sets

5.7.2 Average Execution Time A comparison of the average execution time of the results obtained by Cellular Genetic Algorithm over all datasets is displayed in Figure 5.13. As the size of the data set increases, the time required for execution increases. 250000000 217341048 200000000

150000000

100000000

50000000 19739837.3 2389352.2 0 1000 tweets

5000 tweets

Figure 5.13 cGA execution time for all sets

92

30000 tweets

5.8 genGA performance in all datasets This section presents an evaluation of the performance of the Generational Genetic Algorithm over the three datasets. The evaluation is performed in terms of the fitness value and execution time.

5.8.1 Average Fitness A comparison of the average fitness values of the results obtained by Generational Genetic Algorithm over all datasets is displayed in Figure 5.14. Similar to cGA, as the size of the data set increases, the fitness value obtained increases. 9000

8115.179199

8000 7000 6000 5000 4000 3000 2000 1000

72.68254552

318.2736139

1000 tweets

5000 tweets

0 30000 tweets

Figure 5.14 genGA fitness for all sets

5.8.2 Average Execution Time A comparison of the average execution time of the results obtained by Generational Genetic Algorithm over all datasets is displayed in Figure 5.15. Similar to cGA, as the size of the data set increases, the time required for execution increases.

93

300000000 247159030

250000000 200000000 150000000 100000000 50000000

21356261.49 2847968.75

0 1000 tweets

5000 tweets

30000 tweets

Figure 5.15 genGA execution time for all sets

5.9 Research Limitations The first limitation in this study concerns the selected data gathering approach. As mentioned in chapter 3, Data collection through the scraping based approach is essentially more difficult than API-based methods, and may still be subject to bandwidth limitations imposed by the site (many social network sites dynamically recognize and block efforts to scrape). The scraper has to struggle against the redesigns that may occur to the social network at a higher frequency and with less notice than changes to the API. In addition, Hootsuite.com does not guarantee that all the Twitter messages meeting the criteria of the established archives will be captured, so there could potentially be some tweets that were not included in the gathered tweet corpus. While this may introduce a bias, the concepts for analyzing the tweets are still valid. In spite of this limitation, the scraping based approach was selected by the researcher because it is not limited by a specific number of calls like the API driven approach. At the same time, it is simpler than passive network measurement approach. Another limitation in this study that it focuses only on the clustering of similar Twitter messages, without attempting to analyze the semantics of those Twitter messages.

94

Furthermore, one of the primary limitations of this study is the use of a single language for data gathering (only English language tweets were gathered). Due to the high computational complexity of the problem, the high processing power (large memory and CPU resources) required to conduct several runs for both algorithms and handle the huge number of term-tweet relationships (about millions of relationships), and therefore, the long time required: Consequently, the size of the large dataset was limited to just 30,000 Twitter messages which is considered to be insufficient. The long time limitation leads to conducting a single run for each of the Cellular and Generational genetic algorithms over the 30,000 tweets dataset.

5.10 Results Discussion From the previously mentioned results, a number of things can be observed. This section is divided into two subsections. Section 5.10.1 compares the obtained results by both algorithms according to the four parameters: the average fitness, the execution time, the number of clusters, and the number of generations produced. On the other hand, section 5.10.2 discusses the results obtained by every single algorithm over all datasets in terms of fitness value and execution time.

5.10.1

Discussion

of

results

obtained

by

both

algorithms This section includes discussing the results obtained by cGA and genGA over the three datasets in terms of average fitness, execution time, number of clusters, and number of generations.

5.10.1.1 Average Fitness Table 5.2 displays the fitness values obtained by both algorithms in the three used datasets.

95

Table 5.2 Fitness values by both algorithms in all datasets

Dataset size

cGA

genGA

1,000 tweets

72.68859389

72.68254524

5,000 tweets

315.9893146

318.2736139

30,000 tweets

8077.4931640625

8115.17919921875

Concerning the average fitness, the difference in the average fitness values in the solutions generated by both algorithms is slight. This slight difference is in favor of the Cellular Genetic Algorithm in the 1,000 dataset, but for the 5,000 and 30,000 datasets, this difference is in favor of the Generational Genetic Algorithm. The use of small overlapped neighborhood niches in cGA maintains population diversity as it enhances exploration of the search space due to the relatively smooth spread of the finest solutions across the entire population, at the same time exploitation occurs within each neighborhood by genetic operations. In other words, cGA provides a good tradeoff between exploration and exploitation. Therefore; it avoids being stuck into local optima [46].

5.10.1.2 Execution Time Table 5.3 displays the execution time consumed by both algorithms in the three used datasets. Table 5.3 Execution times by both algorithms in all datasets

Dataset size

cGA

genGA

1,000 tweets

2389352.2 ms

2847968.75 ms

5,000 tweets

19739837.3 ms

21356261.49 ms

30,000 tweets

217341048 ms

247159030 ms

Concerning the average time required for execution, the Cellular Genetic Algorithm requires a remarkably shorter time to implement for all of the three employed datasets. 96

This can be attributed to decentralized structure of the population. The population in cGA is structured into neighborhoods, while it is unstructured in case of Generational Genetic Algorithm. This means that the individual in the Generational Genetic Algorithm has to search through the whole population, while cGA individual can interact only with its nearby neighbors. In other words, the decentralized population structure improves the execution time of Cellular Genetic Algorithm in comparison to its equivalent Generational Genetic Algorithm [46].

5.10.1.3 Number of Clusters Table 5.4 displays the number of clusters generated by both algorithms in the three used datasets. Table 5.4 Number of clusters by both algorithms in all datasets

Dataset size

cGA

genGA

1,000 tweets

Approximately 312

Approximately316

5,000 tweets

Approximately1865

Approximately1821

30,000 tweets

7777

7845

Note that the number of clusters is not constant because there is no a priori knowledge about the number of clusters as in the case of K-means for example.

5.10.1.4 Number of Generations Table 5.5 displays the number of generations produced by both algorithms in the three used datasets Table 5.5 Number of generations of both algorithms in all datasets

Dataset size

cGA

genGA

1,000 tweets

37500

30061

5,000 tweets

37500

30061

30,000 tweets

37500

30061

97

Cellular Genetic Algorithm gives a larger number of generations than the Generational Genetic Algorithm. This means that Generational Genetic Algorithm is more efficient than cGA (as it requires a fewer number of generations to find the solution). The reason is that cGA enhances more exploration, thus induces a lower selection pressure, and therefore more generations are needed.

5.10.2 Discussion of results obtained by every algorithm This section includes discussing the results obtained by each algorithm individually over the three datasets in terms of average fitness and execution time.

5.10.2.1 Cellular Genetic Algorithm Table 5.6 displays the fitness values and execution time obtained by cellular genetic algorithm in the three used datasets. Table 5.6 cGA performance in all datasets

Dataset size

Fitness value

Execution time

1,000 tweets

72.68859389

2389352.2 ms

5,000 tweets

315.9893146

19739837.3 ms

30,000 tweets

8077.4931640625

217341048 ms

As the size of the data set increases, the complexity of the problem increases. Therefore; the obtained fitness value and time required for execution increase. Table 5.7 displays the fitness values and execution time obtained by generational genetic algorithm in the three used datasets.

98

Table 5.7 genGA performance in all datasets

Dataset size

Fitness value

Execution time

1,000 tweets

72.68254524

2847968.75 ms

5,000 tweets

318.2736139

21356261.49 ms

30,000 tweets

8115.17919921875

247159030 ms

Similar to cGA, as the size of the data set increases, the complexity of the problem increases. Therefore; the obtained fitness value and time required for execution increase.

5.11 Conclusion Based on the obtained results, Cellular Genetic Algorithm cGA was selected over Generational Genetic Algorithm genGA. Despite the slight advantage of genGA in terms of average fitness and number of generations, cGA performs remarkably faster with higher accuracy. Therefore, cGA was selected to perform tweet clustering.

99

CHAPTER SIX CONCLUSION AND FUTURE WORK

100

In this chapter, the final conclusion of the study is mentioned in addition to a summary of the performed work in section 6.1. Section 6.2 states the future work and enhancements that can be accomplished, based upon the obtained results and in the light of the study limitations.

6.1 Conclusion This study approached the problem of clustering tweets based upon their similarity. This problem is important because of an essential reason: The tweet content similarity can be utilized as a similarity measure between Twitter users. This similarity measure helps to recognize whether the Twitter users share similar interests and attributes or not. This is a signal of good similarity between Twitter users. Some traditional clustering algorithms such as the K-means algorithm require a priori knowledge about the number of clusters (i.e. the number of clusters is known in advance), which is not the case in Twitter. Other traditional clustering algorithms can get stuck into local optima because such algorithms explore just a small subset of the potential clusterings. Moreover, the type of data in Twitter results in weak performance of most clustering methods due to the overall freedom of writing tweets. The researcher applied two subclasses of genetic algorithm for clustering tweets based upon their textual content similarity: 1. Cellular Genetic Algorithm cGA, 2. A conventional genetic algorithm: the Generational Genetic Algorithm genGA. The accuracy of the clustering obtained by the two algorithms was compared against a baseline approach on a test dataset composed of 60 tweets equally distributed over three distinct topics. The test dataset results demonstrate a slightly better accuracy in the favor of cGA (0.844% difference). Then, the researcher compared the clustering results obtained by cGA with the clustering results obtained by genGA according to four parameters: 1. The fitness value 2. The execution time 101

3. The number of clusters 4. The number of generations for each run Experimental results are tested with three datasets: One of 1,000 tweets, the second formed of 5,000 tweets, and the last composed of 30,000 tweets. The gathered data represent a variety of topics in the real world. This data was collected using "Scraping-based approach" through using a web client called Hootsuite. The average fitness of the two algorithms is nearly equal over the three datasets. In the 1,000 tweets dataset, cGa achieved a higher fitness by just 0.01%. In the 5,000 tweets dataset, genGA achieved a higher fitness by just 0.72%. In the 30,000 tweets dataset, genGA achieved a higher fitness by just 0.47%. On the other hand, cGA shows a much faster performance than genGA for all datasets. In the 1,000 tweets dataset, cGa achieved a faster performance by 17.51%. In the 5,000 tweets dataset, cGA achieved a faster performance by 7.87%. In the 30,000 tweets dataset, cGA achieved a faster performance by 12.84%. Regarding the number of generations, genGA is considered more efficient as it needs 30,061 generations to reach the optimal solution while cGA needs 37,500 generations. Despite the slight advantage of genGA in terms of average fitness and number of generations, cGA performs remarkably faster with higher accuracy. Therefore, cGA was selected to perform tweet clustering. Besides achieving the desired research objectives; the main this study is considered to be one of the first attempts to exploit cGA in clustering of tweets, which can improve the performance of clustering in comparison with the traditional clustering algorithms, or those clustering algorithms that require a priori knowledge of the number of clusters such as K-means. Therefore, this study contributes to adding a new approach for tweet clustering.

102

6.2 Future Work This work still has a room for improvements in the future in order to overcome the previously mentioned limitations in section 4.6 of chapter four. For future work, the researcher plans to: 

Cluster tweets with multiple runs over a larger dataset



Considering the high computational complexity of the problem, the researcher considers the use of parallel computing to minimize the time required for execution



Moreover, the researcher plans to perform semantic analysis over the collected Twitter messages



Include more tweet languages in the Twitter dataset (not just the English language)



Use cGA for clustering over other social media

103

References 1.

Kewalramani, M.N., 2011. Community Detection in Twitter. MSc Thesis. University of Maryland, Baltimore County2

2.

Phelan, O., K. McCarthy, and B. Smyth. Using twitter to recommend realtime topical news, 2009. Proceedings of the third ACM conference on Recommender systems, Oct. 22-25, New York, NY, USA. ACM. DOI: 10.1145/1639714.1639794

3.

Nooralahzadeh, F., V. Arunachalam, and C.G. Chiru, 2013.2012 Presidential Elections on Twitter--An Analysis of How the US and French Election were Reflected in Tweets. Proceedings of the 2013 19th International Conference on Control Systems and Computer Science (CSCS 2013), May 29-31, Bucharest, Romania. pp: 240-246. IEEE2

4.

Tumasjan, A., 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (WSM’ 10), May 23-26, Washington DC, USA. pp: 178-185. DOI: 10.1002/asi.21149

5.

Conover, M.D., B. Goncalves, J. Ratkiewicz, A. Flammini and F. Menczer, 2011. Predicting the political alignment of Twitter users. Proceedings of the IEEE 3rd International Conference on Privacy, Security, Risk and Trust, Oct. 9-11, IEEE Xplore Press, Boston, MA. pp: 192-199. DOI: 10.1109/PASSAT/SocialCom.2011.34

6.

Bravo-Marquez, F., D. Gayo-Avello, M. Mendoza, and B. Poblete.Opinion Dynamics of Elections in Twitter, 2012. Proceedings of 2012 Eighth Latin American Web Congress (LA-WEB), OCT. 25-27, Cartagena, Colombia. pp: 32-39, IEEE2

7.

Wegrzyn-Wolska, K., and L. Bougueroua. Tweets mining for French Presidential Election, 2012. Proceedings of 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN 2012), Nov. 21-23, pp: 138-143, IEEE2

8.

Kwak, H., C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? , 2010. Proceedings of the 19th international conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. DOI: 10.1145/1772690.1772751 104

9.

Sakaki, T., M. Okazaki and Y. Matsuo, 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. Proceedings of the 19th International Conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. pp: 851-860. DOI: 10.1145/1772690.1772777

10.

Culotta, A. Towards detecting influenza epidemics by analyzing Twitter messages, 2010. Proceedings of the first workshop on social media analytics (SOMA'2010), Jul 25, ACM, Washington, DC, USA. Pp: 115-122. DOI: 10.1145/1964858.1964874

11.

Rangrej, A., S. Kulkarni and A.V. Tendulkar, 2011. Comparative study of clustering techniques for short text documents. Proceedings of the 20th International Conference Companion on World Wide Web, Mar. 28-Apr. 01, ACM New York, NY, USA. pp: 111-112. DOI: 10.1145/1963192.1963249

12.

Zhang, Y., Y. Wu and Q. Yang, 2012. Community discovery in Twitter based on user interests. J. Comput. Inform. Syst., 8: 991-1,0002

13.

Goyal, P., 2011. Semester project report semester project report data mining and analysis on Twitter data mining and analysis on Twitter 2

14.

Liang, P.W. and B.R. Dai, 2013. Opinion mining on social media data. Proceedings of the IEEE 14th International Conference on Mobile Data Management, Jun. 3-6, IEEE Xplore Press, Milan, pp: 91-96. DOI: 10.1109/MDM.2013.73

15.

Honey, C. and S.C. Herring, 2009. Beyond microblogging: Conversation and collaboration via Twitter. Proceedings of the 42nd Hawaii International Conference on System Sciences, Jan. 5-8, IEEE Xplore Press, Big Island, HI., pp: 1-10. DOI: 10.1109/HICSS.2009.89 2

16.

De Groot, R., 2012. Data mining for tweet sentiment classification. MSc Thesis, Utrecht University, Utrecht, Netherlands2

17.

http://mashable.com/2013/12/17/twitter-popular-languages/, Accessed on January 4, 20142

18.

Karandikar, A., 2010. Clustering short status messages: A topic model based approach. MSc Thesis, University of Maryland, Baltimore County 2

19.

Perez-Tellez, F., D. Pinto, J. Cardiff and P. Rosso, 2010. On the difficulty of clustering company tweets. Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, Oct. 26-30, ACM New York, NY, USA. pp: 95-102. DOI: 10.1145/1871985.1872001 105

20.

Bollen, J., H. Mao, and X. Zeng, 2011. Twitter mood predicts the stock market. Journal of Computational Science 2.1: 1-82

21.

Java, A., 2008. Mining social media communities and content. PhD Thesis, University of Maryland, Baltimore County2

22.

Bicen, H., and N. Cavus, 2010. The most preferred social network sites by students. Procedia-Social and Behavioral Sciences 2.2: 5864-5869. DOI: 10.1016/j.sbspro.2010.03.958

23.

IAB Platform Status Report: User Generated Content, Social Media, and Advertising — An Overview, April 2008. Available on: http://www.iab.net/media/file/2008_ugc_platform.pdf , Accessed on January 25, 20142

24.

Wang, Z., 2010. Social media data analysis: A dynamic perspective. NATO Research and Technology Organisation, Ottawa, Ontario K1A 0K2, Canada 2

25.

Khabiri, E., 2013. Ranking, labeling, and summarizing short text in social media: PhD Thesis, Texas A&M University2

26.

Gundecha, P., and H. Liu, 2012. Mining Social Media: A Brief Introduction. Tutorials in Operations Research 1.4. DOI: 10.1287/educ.1120.0105

27.

Aggarwal, C.C., Social network data analytics, 2011. ISBN 978-1-44198461-6 e-ISBN 978-1-4419-8462-3, Springer DOI: 10.1007/978-1-44198462-3.

28.

Benevenuto, F., T. Rodrigues, M. Cha, and V. Almeida, 2009. Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM. DOI: 10.1145/1644893.1644900. pp: 49-62

29.

Social Media Update, Pew Research Center http://pewinternet.org/~/media//Files/Reports/2013/Social%20Networking%2 02013_PDF.pdf, Accessed on January 16, 20142

30.

Krishnamurthy, B., P. Gill, and M. Arlitt. A few chirps about twitter, 2008. Proceedings of the first workshop on online social networks. Seattle, WA, USA — Aug. 17 - 22, ACM. DOI: 10.1145/1150402.1150476

31.

Java, A., X. Song, T. Finin and B. Tseng, 2007. Why we Twitter: Understanding microblogging usage and communities. Proceedings of the 9th Workshop on Web Mining and Social Network Analysis, Aug. 12-15, ACM New York, NY, USA, pp: 56-65. DOI: 10.1145/1348549.1348556 106

32.

Becker, H., 2011. Identification and characterization of events in social media. PhD Thesis, Columbia University 2

33.

Mosley, J. and C. Roosevelt, 2012. Social media analytics: Data mining applied to insurance Twitter posts. Casualty Actuarial Society E-Forum 2

34.

Moseley, N., 2013. Using Word and Phrase Abbreviation Patterns to Extract Age from Twitter Micro texts. MSc Thesis, Rochester Institute of Technology, New York, NY, USA2

35.

Sankaranarayanan, J., H. Samet, B.E. Teitler, M.D. Lieberman and J. Sperling, 2009. Twitterstand: News in tweets. Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Nov. 04-06, ACM New York, NY, USA, pp: 42-51. DOI: 10.1145/1653771.1653781

36.

http://www.alexa.com/siteinfo/twitter.com, Accessed on March 20, 20142

37.

http://www.statisticbrain.com/twitter-statistics/, Accessed on January 5, 20142

38.

http://www.statista.com/chartoftheday/twitter/, Accessed on January 4, 20142

39.

http://www.statista.com/topics/737/twitter/chart/1629/twitter-penetration/, Accessed on January 21, 20142

40.

Premalatha, K. and A.M. Natarajan, 2009. Genetic algorithm for document clustering with simultaneous and ranked mutation. Modern Applied Sci., 3: 75-82. 1

41.

Begelman, G., P. Keller, and F. Smadja, 2006. Automated tag clustering: Improving search and exploration in the tag space. Collaborative Web Tagging Workshop at WWW, May 23-26, Edinburgh, Scotland. DOI: 10.1.1.120.5736

42.

Jajoo, P., 2008. Document clustering. MTech Thesis. Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India2

43.

Jian-Xiang, W., L. Huai, S. Yue-Hong, and S. Xin-Ning, 2009. Application of Genetic Algorithm in Document Clustering. Information Technology and Computer Science. Proceedings of the International Conference on Information Technology and Computer Science (ITCS 2009), Jul 25-26, Kiev, Ukraine. Vol. 1. IEEE, 2009. DOI: 10.1109/ITCS.2009.269

44.

Tan, P.N., M. Steinbach, and V. Kumar, 2006. Introduction to data mining. Library of Congress 2

107

45.

Amitava, D. and S. Bandyopadhyay, 2010. Subjectivity detection using genetic algorithm. Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA10), Lisbon, Portugal, August2

46.

Alba, E. and B. Dorronsoro, 2008. Cellular Genetic Algorithms Electronic Resource. 1st Edn., Springer, New York, ISBN-10: 0387776109.

47.

Usharani, J. and K. Iyakutti, 2013. A genetic algorithm based on cosine similarity for relevant document retrieval. Int. J. Eng. Res. Technol2

48.

Alba, E., and B. Dorronsoro, 2004. Solving the vehicle routing problem by using cellular genetic algorithms. Proceedings of the Evolutionary Computation in Combinatorial Optimization. Springer Berlin Heidelberg, Apr. 5-7, Springer Berlin Heidelberg, Coimbra, Portugal, pp: 11-20. DOI: 10.1007/978-3-540-24652-7_2

49.

Khaliessizadeh, S.M., 2006. Genetic mining: Using genetic algorithm for topic based on concept distribution. Proceedings of the World Academy of Science, Engineering and Technology, (SET’ 06)

50.

Guzek, M., J. E. Pecero, B. Dorronsoro, P. Bouvry, and S. U. Khan, 2010. A cellular genetic algorithm for scheduling applications and energy-aware communication optimization. Proceedings of 2010 International Conference on High Performance Computing and Simulation (HPCS 2010), Jun 28-Jul 2, Caen, France. pp: 241-248. on IEEE. DOI:10.1109/HPCS.2010.55471242

51.

Rudolph, G., and J. Sprave, 1995. A cellular genetic algorithm with selfadjusting acceptance threshold. Genetic Algorithms in Engineering Systems: Innovations and Applications. GALESIA. First International Conference on (Conf. Publ. No. 414), Sep 12-14. IET2

52.

De Felice, M., S. Meloni, and S. Panzieri, 2012. Influence of Topological Features on Spatially-Structured Evolutionary Algorithms Dynamics. arXiv preprint arXiv: 1202.0678c12

53.

Simoncini, D., P. Collard, S. Verel, and M. Clergue, 2007. On the influence of selection operators on performances in cellular genetic algorithms. Proceedings of Evolutionary Computation, (CEC) 2007, IEEE Congress on. IEEE2

108

54.

Alba, E., and B. Dorronsoro, 2005. The exploration/exploitation tradeoff in dynamic cellular genetic algorithms. Evolutionary Computation, IEEE Transactions on 9.2: 126-1422

55.

Morales-Reyes, A., A. Al-Naqi, A.T. Erdogan and T. Arslan, 2009. Towards 3D architectures: A comparative study on cellular GAs dimensionality. Proceedings of Adaptive Hardware and Systems, Jul. 29-Aug. 1, IEEE Xplore Press, San Francisco, CA. pp: 223-229. DOI: 10.1109/AHS.2009.29

56.

http://www.policymic.com/articles/10642/twitter-revolution-how-the-arabspring-was-helped-by-social-media, Accessed on January 31, 2014

57.

Xiao, Y, 2010. A Survey of Document Clustering Techniques & Comparison of LDA and moVMF2

58.

Sathiyakumari, K., V. Preamsudha, and G. Manimekalai, 2011. A Survey on Various Approaches in Document Clustering. International Journal Computer Technology 2.5: 1534-15392

59.

Steinbach, M., G. Karypis, and V. Kumar, 2000. A comparison of document clustering techniques. KDD workshop on text mining. Vol. 400. No. 12

60.

Huang, A., 2008. Similarity measures for text document clustering. Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Apr 14-17. Christchurch, New Zealand. pp: 49-562

61.

http://en.wikipedia.org/wiki/John_Henry_Holland, Accessed on February 12,2014

62.

http://en.wikipedia.org/wiki/Hans-Joachim_Bremermann, Accessed on March 15, 2014

63.

https://www.google.com.eg/#q=alex+fraser, Accessed on March 15, 2014

64.

Melanie, M., 1999. An introduction to genetic algorithms. Cambridge, Massachusetts London, England, Fifth printing 3. Melanie, M., (Ed.), ISBN: 0−262−13316−4 (HB), 0−262−63185−7 (PB)2

65.

Holland, J.H., 1975. Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press (Second edition: MIT Press, 1992)

66.

De Jong, K.A., 1975. Analysis of the behavior of a class of genetic adaptive systems. Ph.D. thesis, University of Michigan, Ann Arbor2

109

67.

Robertson, G.G., 1987. Parallel implementation of genetic algorithms in a classifier system. Genetic algorithms and their applications: proceedings of the second International Conference on Genetic Algorithms: July 28-31, 1987 at the Massachusetts Institute of Technology, Cambridge, MA. Hillsdale, NJ: L. Erlhaum Associates, 19872

68.

Mühlenbein, H., M.G. Schleuter, and O. Krämer, 1988. Evolution algorithms in combinatorial optimization. Parallel Computing 7.1: 65-852

69.

Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning. Addison−Wesley. ISBN: 0-201-15767-52

70.

Manderick, B., and P. Spiessens, 1989. Fine-grained parallel genetic algorithms. Proceedings of the third international conference on Genetic algorithms, Fairfax, Virginia, USA. Morgan Kaufmann Publishers Inc. pp: 428-4332

71.

Spiessens, P., and B. Manderick, 1991. A massively parallel genetic algorithm: Implementation and first analysis. Proceedings of the Fourth International Conference on Genetic Algorithms, San Diego, California, USA. pp: 279-2872

72.

Hoffmeister, F, 1991. Scalable parallelism by evolutionary algorithms. Parallel Computing and Mathematical Optimization. Springer Berlin Heidelberg. pp: 177-1982

73.

Back, T., D.B. Fogel, and Z. Michalewicz, 1997. Handbook of evolutionary computation. Oxford University Press. ISBN:0750303921

74.

Whitley, L. D., 1993. Cellular genetic algorithms. Proceedings of the 5th International Conference on Genetic Algorithms, Urbana-Champaign, IL, USA. Morgan Kaufmann Publishers Inc2

75.

Adamic, L., and E. Adar., 2005. How to search a social network. Social Networks 27.3: 187-2032

76.

Maia, M., J. Almeida, and V. Almeida, 2008. Identifying user behavior in online social networks. Proceedings of the 1st workshop on Social network systems (SNS 2008), Glasgow, Scotland, UK, April 1, 2008. ACM

77.

Zhou, D., 2008. Mining Social Documents and Networks. PhD Thesis, the Pennsylvania State University2

78.

Cormode, G., B. Krishnamurthy, and W. Willinger, 2010. A manifesto for modeling and measurement in social media. First Monday 15.92 110

79.

Gozzo, S., and R. D’Agata, 2010. Social networks and political participation in a Sicilian community context. Procedia-Social and Behavioral Sciences 4: 49-582

80.

Kuan, Z., 2010. Mining Antagonistic Communities from Social Networks. MSc Thesis, Singapore Management University2

81.

Yun, S., H. Do, and H.G. Kim, 2010. Analysis of user interactions in online social networks. Yun S. Proceedings of the 19th international Conference on World Wide Web (www 2010), Apr. 26-30, ACM New York, NY, USA. Vol. 22

82.

Agrawal, D., B. Bamieh, C. Budak, A. El Abbadi, A. Flanagin, and S. Patterson, 2011. Data-driven modeling and analysis of online social networks. Web-Age Information Management. Springer Berlin Heidelberg, 3-172

83.

Gloor, P.A., K. Fischbach, H. Fuehres, C. Lassenius, T. Niinimäki, D. O. Olguin, S. Pentland, A. Piri, and J. Putzke, 2011. Towards “Honest Signals” of Creativity–Identifying Personality Characteristics through Microscopic Social Network Analysis. Procedia-Social and Behavioral Sciences 26: 1661792

84.

Malik, H., and A.S. Malik, 2011. Towards identifying the challenges associated with emerging large scale social networks. Procedia Computer Science 5: 458-465. DOI: 10.1016/j.sbspro.2011.10.573

85.

Stieglitz, S., and C. Kaufhold, 2011. Automatic full text analysis in public social media–adoption of a software prototype to investigate political communication. Procedia Computer Science 5: 776-781. DOI: 10.1016/j.procs.2011.07.104

86.

Takaffoli, M., F. Sangi, J. Fagnan, and O. R. Za¨ıane, 2011. Community evolution mining in dynamic social networks. Procedia-Social and Behavioral Sciences 22: 49-58. DOI: 10.1016/j.sbspro.2011.07.0552

87.

Derczynski, L.R.A, B. Yang, and C.S. Jensen, 2013. Towards context-aware search and analysis on social media data. Proceedings of the 16th International Conference on Extending Database Technology (EDBT/ICDT 2013), Mar 18-22, Genoa, Italy. ACM2

111

88.

Jiang, J., C. Wilson, X. Wang, P. Huang, W. Sha, Y. Dai, and B. Y. Zhao, 2013. Understanding latent interactions in online social networks. ACM Transactions on the Web (TWEB 2013) 7.4: 182

89.

Santiago, N.G, 2013. Data mining social media networks for terrorist events indicators. MSc Thesis. Polytechnic University of Puerto Rico2

90.

Stieglitz, S., and L.D. Xuan, 2013. Social media and political communication: a social media analytics framework. Social Network Analysis and Mining 3.4: 1277-1291. DOI 10.1007/s13278-012-0079-3

91.

Huberman, B. A., D.M. Romero, and F. Wu, 2008. Social networks that matter: Twitter under the microscope. arXiv preprint arXiv: 0812.10452

92.

Zhao, D., and M.B. Rosson, 2009. How and why people Twitter: the role that micro-blogging plays in informal communication at work. Proceedings of the ACM 2009 international conference on Supporting group work (Group '09), Sanibel Island, FL, USA, May 10 - 13. ACM. pp: 189-1922

93.

Naaman, M., J. Boase, and C.H. Lai, 2010. Is it really about me? : message content in social awareness streams. Proceedings of the 2010 ACM conference on Computer supported cooperative work (CSCW 2010), Feb 610, Savannah, Georgia, USA. ACM2

94.

Weerkamp, W., S. Carter, and M. Tsagkias, 2011. How people use twitter in different languages: 1-22

95.

Goel, A., A. Sharma, D. Wang, and Z. Yin, 2013. Discovering Similar Users on Twitter. Proceedings of the Workshop on Mining and Learning with Graphs (MLG-2013), Aug 11, Chicago, Illinois, USA2

96.

Wegrzyn-Wolska, K., L. Bougueroua, and G. Dziczkowski, 2011. Social media analysis for e-health and medical purposes. Proceedings of 2011 International Conference on Computational Aspects of Social Networks (CASoN 2011), Oct 19-21, Salamanca, Spain. IEEE. pp: 278-2832

97.

Yoon, S. Application of social network analysis and text mining to characterize network structures and contents of microblogging messages: An observational study of physical activity-related tweets, 2011. PhD Thesis. Columbia University2

98.

Chew C, Eysenbach G, 2010. Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak. PLoS ONE 5(11): e14118. DOI: 10.1371/journal.pone.00141182 112

99.

Lampos, V., and N. Cristianini, 2010. Tracking the flu pandemic by monitoring the social web. Proceedings of the 2nd International Workshop on Cognitive Information Processing (CIP 2010), Jun 14-16, Elba Island (Tuscany), Italy. DOI, 10.1109/CIP.2010.56040882

100.

Achrekar, H., A. Gandhe, R. Lazarus, S. Yu, and B. Liu, 2011. Predicting flu trends using twitter data. Proceedings of 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Apr 10-15, Shanghai, China. IEEE. pp: 702-707. DOI:10.1109/infcomw.2011.59289032

101.

Paul, M. J., and M. Dredze, 2011. You Are What You Tweet: Analyzing Twitter for Public Health. Proceedings of Fifth International AAAI Conference on Weblogs and Social Media (WSM'2011), Jul 17-21, Barcelona, Spain. pp: 265-272. DOI: 10.1.1.224.99742

102.

Signorini A., A.M. Segre, and P.M. Polgreen, 2011. The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 pandemic. PloS one 6.5 (2011): e194672

103.

Diakopoulos, N.A., and D.A. Shamma, 2010. Characterizing debate performance via aggregated twitter sentiment. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2010), Apr 1015, Atlanta, Georgia, USA. ACM. pp: 1195-1198

104.

Zhou, Z., R. Bandari, J. Kong, H. Qian, and V. Roychowdhury, 2010. Information resonance on Twitter: watching Iran. Proceedings of the First Workshop on Social Media Analytics, (SOMA'2010), Jul 25, Washington, DC, USA. ACM. pp: 123-131. DOI: 10.1145/1964858.19648752

105.

Bollen, J., H. Mao, and X. Zeng, 2011. Twitter mood predicts the stock market. Journal of Computational Science 2.1: 1-82

106.

Zhang, X., H. Fuehres, and P.A. Gloor, 2011. Predicting stock market indicators through twitter “I hope it is not as bad as I fear”. Procedia-Social and Behavioral Sciences 26: 55-62. DOI: 10.1016/j.sbspro.2011.10.562

107.

Jansen, B.J., M. Zhang, K. Sobel, and A. Chowdury. Twitter power: Tweets as electronic word of mouth, 2009. Journal of the American society for information science and technology 60.11: 2169-2188. DOI: 10.1002/asi.211492

113

108.

Bulearca, M., and S. Bulearca, 2010. Twitter: a viable marketing tool for SMEs. Global Business and Management Research: An International Journal 2.4: 296-3092

109.

Becker, H., F. Chen, D. Iter, M. Naaman, and L. Gravano, 2011. Automatic Identification and Presentation of Twitter Content for Planned Events. Proceedings of Fifth International AAAI Conference on Weblogs and Social Media (WSM'2011), Jul 17-21, Barcelona, Spain. pp: 655-656. DOI: 10.1.1.225.1555‎2

110.

Jackoway, A., H. Samet, and J. Sankaranarayanan, 2011. Identification of live news events using Twitter. Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN 2011), Nov 1, 2011, Chicago, Illinois, USA. pp: 25-32. DOI: 10.1145/2063212.20632242

111.

Earle, P.S., D.C. Bowden, and M. Guy, 2012. Twitter earthquake detection: earthquake monitoring in a social world." Annals of Geophysics 54.6. DOI: 10.4401/ag-5364. pp: 708-7152

112.

Crooks, A., A. Croitoru, A. Stefanidis, and J. Radzikowski, 2013. # Earthquake: Twitter as a distributed sensor system. Transactions in GIS 17.1: 124-1472

113.

Meena, Y.K., Shashank and V.P. Singh, 2012. Text Documents Clustering using Genetic Algorithm and Discrete Differential Evolution. International Journal of Computer Applications 43(1):16-19, April. Published by Foundation of Computer Science, New York, USA. DOI: 10.5120/6067-8221

114.

Zamir, O., and O. Etzioni, 1999. Grouper: a dynamic clustering interface to Web search results. Computer Networks 31.11: 1361-1374. DOI: 10.1.1.31.8216

115.

Zeng, H., Q. He, Z. Chen, W. Ma, J. Ma, 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Jul 25-29, Sheffield, UK. DOI: 10.1145/1008992.10090302

116.

Zhu, Y.H., G. Dai, B. C. M. Fung, and D. Mu, 2006. Document Clustering Method Based on Frequent Co-occurring Words. Proceedings of the 20th Pacific Asia Conference on Language, Informatics, and Computation (PACLIC 2006), Nov 1-3, Wuhan, China. pp: 442-445 114

117.

Aggarwal, C.C., and C.X. Zhai, 2012. A survey of text clustering algorithms. Mining text data. Springer US. pp: 77-1282

118.

Koteeswaran, S., P. Visu and J. Janet, 2012. A review on clustering and outlier analysis techniques in data mining. Am. J. Applied Sci., 9: 254258.DOI: 10.3844/ajassp.2012.254.258

119.

Anastasiu, D.C., A. Tagarelli, and G. Karypis, 2013. Document Clustering: The Next Frontier. Technical Report. University of Minnesota. DOI: 10.1.1.401.8428

120.

Jones, G., A. M. Robertson, C. Santimetvirul, and P. Willett, 1995. Nonhierarchic document clustering using a genetic algorithm. Information Research, 1(1). Available at: http://InformationR.net/ir/1-1/paper1.html2

121.

Maulik, U., and S. Bandyopadhyay, 2000. Genetic algorithm-based clustering technique. Pattern recognition 33.9: 1455-1465. DOI: 10.1.1.19.1878‎

122.

Casillas, A., M.T. Gonzalez De Lena and R. Martinez, 2003. Document clustering into an unknown number of clusters using a genetic algorithm. Proceedings of the 6th International Conference on Text Speech and Dialogue, Sept. 8-12, Springer Berlin Heidelberg, Czech Republic, pp: 43-49. DOI: 10.1007/978-3-540-39398-6_7

123.

Verma, H., E. Kandpal, B. Pandey, and J. Dhar, 2010. A Novel Document Clustering Algorithm Using Squared Distance Optimization Through Genetic Algorithms. International Journal on Computer Science & Engineering 2.5. pp: 1875-18792

124.

Khot, T., 2010. Clustering Twitter feeds using word co-occurrence. University of Wisconsin. Available on: http://www.learningace.com/doc/1912964/627012c3a31a544c6f20fc8aa2482 935/cs784

125.

Yamashita, T., H. Sato, S. Oyama, and M. Kurihara, 2013. Classification of Twitter Users Based on Following Relations. Proceedings of the International Multi Conference of Engineers and Computer Scientists (imecs'2013), Mar 13-15, Hong Kong. Vol. 1. DOI: 10.1145/1772690.17727512

126.

Bernstein, M.S., B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi, 2010. Eddi: interactive topic-based browsing of social status streams. Proceedings of the 23rd annual ACM symposium on User interface software and

115

technology (UIST'2010), Oct 3-6, New York, NY, USA. DOI: 10.1145/1866029.18660772 127.

O’Connor, B., M. Krieger and D. Ahn, 2010. Tweet motif: Exploratory search and topic summarization for Twitter. Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (WSM’ 10), May 23-26, Washington DC, USA. pp: 384-385. DOI: 10.1.1.365

128.

Rosa, K.D., R. Shah, B. Lin, A. Gershman, and R. Frederking, 2011. Topical clustering of tweets. Proceedings of the 34th Annual ACM SIGIR Conference, Jul 24-28, Beijing, China2

129.

Kim, S., S. Jeon, J. Kim and P. Young-Ho, 2012. Finding core topics: Topic extraction with clustering on tweet. Proceedings of the 2nd International Conference on Cloud and Green Computing, Nov. 1-3, IEEE Xplore Press, Xiangtan, pp: 777-782. DOI: 10.1109/CGC.2012.120

130.

Rafea, A., and N. A. Mostafa, 2013. Topic extraction in social media. Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS 2013), May 20-24, IEEE San Diego, California, USA., pp: 94-98. DOI: 10.1109/CTS.2013.6567212

131.

Banerjee, S., K. Ramanathan and A. Gupta, 2007. Clustering short texts using Wikipedia. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2327, ACM New York, NY, USA., pp: 787-788. DOI: 10.1145/1277741.1277909

132.

Gabrilovich, E., and S. Markovitch, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of the 20th IJCAI, Jan. 9-12, Vol.7, Hyderabad, India. pp: 1606-1611. DOI: 10.1.1.76.9790

133.

Chen, Q., T. Shipper, and L. Khan, 2010. Tweets mining using WIKIPEDIA and impurity cluster measurement. Proceedings of IEEE International Conference on Intelligence and Security Informatics (ISI'2010), May 23-26. pp: 141-143. DOI: 10.1109/ISI.2010.5484758

134.

Alba, E., B. Dorronsoro, F. Luna, A.J. Nebro and P. Bouvry, 2007. A cellular multi-objective genetic algorithm for optimal broadcasting strategy in metropolitan MANETs. Comput. Commun. , 30: 685-697. DOI: 10.1109/IPDPS.2005.4 116

135.

Nebro, A. J., J. J. Durillo, F. Luna, B. Dorronsoro, and E. Alba, 2009. Mocell: A cellular genetic algorithm for multiobjective optimization. International Journal of Intelligent Systems 24.7: 726-746. DOI: 10.1002/int.203582

136.

Khezri, S., and A.Hazrati, 2013. Sensor Placement in WSN Using the Cellular Genetic Algorithm. J. Basic. Appl. Sci. Res., 3(2s): 745-7502

137.

Yugui, C., 2013. Electric Energy Demand Forecast of Nanchang based on Cellular Genetic Algorithm and BP Neural Network. TELKOMNIKA Indonesian Journal of Electrical Engineering 11.7. pp:. 3821- 3825

138.

Lu, C., 2011. Exploiting social tagging network for web mining and search. PhD Thesis, Drexel University, Philadelphia, USA2

139.

https://dev.twitter.com/docs/rate-limiting/1, Accessed on March 5, 2014

140.

https://hootsuite.com/features/social-networks,Accessed on March 7, 2014

141.

http://www.alexa.com/siteinfo/hootsuite.com, Accessed on March 20, 2013

142.

Schenk, C.B., 2004. Finding Event-Specific Influencers in Dynamic Social Networks. MSc Thesis. The Pennsylvania State University2

143.

Lee, M., W. Wang, and H. Yu, 2006. Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC bioinformatics7.1: 140. DOI: 10.1186/1471-2105-7-140

144.

Sayyadi, H., M. Hurst, and A. Maykov, 2009. Event Detection and Tracking in Social Streams. Proceedings of 3rd International AAAI Conference on Weblogs and Social Media (ICWSM'09), May 17-20, San Jose, California, USA. pp: 311-314. DOI: 10.1.1.187.7972

145.

Reeves, C.R., 1993. Genetic Algorithms. In: Modern Heuristic Techniques for Combinatorial Problems, In: Blackwell Scientific Publications, Reeves, C.R., (Ed.), Oxford, ISBN-10: 04702207912

146.

Misevičius, A. and B. Kilda, 2005. Comparison of crossover operators for the quadratic assignment problem. Inform. Technol. Control, 34: 109-119. DOI: 10.1.1.132.568

147.

Hugosson, J., E. Hemberg, A. Brabazon and M. O'Neill, 2007. An investigation of the mutation operator using different representations in Grammatical Evolution. Proceedings of the 2nd International Symposium Advances in Artificial Intelligence and Applications, Oct. 15-17, Wisla, Poland, pp: 409-4192 117

‫المستخلص‬ ‫أصبحت مواقع التواصل االجتماعي جزءا أساسيا من النشاط اليومي لمتصفحى االنترنت‬ ‫حيث أنها تتيح تبادل المعلومات والتواصل بينهم‪ 2‬وقد أصبحت الحاجة ملحة إلى تحليل هذا الكم الهائل‬ ‫–الذى لم يسبق له مثيل‪ -‬من المحتوى المقدم من مستخدمى مواقع التواصل االجتماعي‪ ،‬بطريقة سليمة‬ ‫‪ 2‬وقد برزموقع التواصل االجتماعي تويتر في السنوات األخيرة كأحد أكثر هذه المواقع شعبية ‪ ،‬وقد‬ ‫بلغ عدد الحسابات النشطة المسجلة علي تويتر عام ‪ 545,051,111 :.113‬حساب‪ ،‬تنتج بمعدل ‪55‬‬ ‫مليون رسالة في اليوم الواحد وفقا ً لتقرير موقع التحليل االحصائي ‪ 2Statisticbrain.com‬و مع‬ ‫استمرار شعبية تويتر في الزيادة بسرعة ‪ ،‬أصبح من الضروري للغاية تحليل الكمية الهائلة من‬ ‫البيانات التي ينتجها مستخدمو تويتر‪ 2‬و يعتبر تويتر مصدرا أساسيا للمعلومات اللحظية عن مجموعة‬ ‫متنوعة و واسعة من الموضوعات مثل األحداث الرياضية‪ ،‬واإلعالنات‪ ،‬و الحمالت السياسية‪،‬‬ ‫وحاالت الطوارئ الشاملة‪ ،‬و األزمات‪ ،‬والرعاية الصحية‪ ،‬الخ‪2‬‬ ‫يعتبر التجميع أحد أكثر الطرق انتشاراً لتحليل الرسائل المنشورة علي تويتر‪ 2‬نظراً ألن‬ ‫معظم رسائل تويتر نصية بالطبيعة‪ ،‬لذا تركز هذه الدراسة على تجميع الرسائل المنشورة علي تويتر‬ ‫استنادا على تشابه مضمون النص‪2‬و تشير االحصاءات أن اللغة اإلنجليزية هي اللغة األكثر استخداماً‬ ‫على تويتر ( ‪ ٪ 34‬من جميع الرسائل المنشورة علي تويتر باللغة اإلنجليزية‪ ،‬وفقا ً للتقرير الصادر‬ ‫عن الوقع االخباري البريطاني األمريكي "‪ "Mashable‬المهتم بالتكنولوجيا و مواقع التواصل‬ ‫االجتماعي)‪ 2‬و بنا ًء علي ما تقدم‪ ،‬تعمل هذه الدراسة على تجميع رسائل تويتر المكتوبة باللغة‬ ‫اإلنجليزية‪2‬‬ ‫وقد تم استخدام "‪ "Scraping based technique‬من أجل جمع البيانات الالزمة لهذه‬ ‫الدراسة‪ 2‬و يعتمد هذا األسلوب على جمع البيانات من تويتر ‪ ،‬وذلك باستخدام أحد التطبيقات المستعملة‬ ‫في تجميع و تحليل البيانات عبر مختلف مواقع التواصل االجتماعي و يدعي "‪ 2"Hootsuite‬و‬ ‫باستخدام هذا التطبيق‪ ،‬تم تجميع مجموعة من الرسائل المنشورة علي تويتر على مدار ‪ 3‬أيام في‬ ‫الفترة من السادس و العشرين حتي الثامن و العشرين من يونيو عام ‪ ،.113‬استنادا إلى مجموعة من‬ ‫رؤوس الموضوعات التي تصف موضوعات محددة و متنوعة ‪ ،‬وذلك لدراسة مجموعات متعددة من‬ ‫اهتمامات مستخدمي تويتر‪ 2‬وقد أجريت الدراسات على ثالث مجموعات من البيانات ذات أحجام‬ ‫مختلفة باستخدام الخوازميات الوراثية‪2‬‬ ‫و تنتمي الخوارزميات الوراثية ‪ Genetic Algorithms‬إلى عائلة الخوارزميات التطورية‬ ‫‪ ،Evolutionary Algorithms‬والتي هي عبارة عن تقنيات مصممة إليجاد الحلول المثلى من بين‬ ‫مجموعة من الحلول الممكنة (األفراد)‪ 2‬و تعد الخوارزميات الجينية طرق بحث احتمالية مشابهة‬ ‫آلليات العملية الطبيعية للتطور البيولوجي الكتشاف حلول للمشاكل‪2‬‬ ‫تم استخدام فئة فرعية من الخوارزميات الجينية ‪ :‬الخوارزمية الجينية الخلوية في هذه‬ ‫الدراسة لتجميع مجموعة من الرسائل المنشورة علي تويتر ‪2‬و تعتبر هذه الدراسة واحدة من‬ ‫المحاوالت األولى لتجميع الرسائل المنشورة علي تويتر من خالل استخدام الخوارزمية الجينية‬

‫الخلوية ‪ ، CGA‬مما يؤدى الى تحسين نتائج التجميع مقارنةً مع نتائج كل من خوارزميات التجميع‬ ‫التقليدية ‪ ،‬أو خوارزميات التجميع التي تتطلب المعرفة المسبقة بعدد المجموعات مثل ‪2K-means‬‬ ‫تمت مقارنة النتائج التي تم الحصول عليها من قبل ‪ CGA‬مع تلك التي تم الحصول عليها عن طريق‬ ‫الخوارزميات الجينية التقليدية ‪ :‬الخوارزمية الجينية الجيلية ‪ 2 genGA‬وقد تمت المقارنة وفقا ألربعة معايير ‪:‬‬ ‫متوسط قيمة اللياقة ‪ ،‬و متوسط الوقت الالزم لتنفيذها ‪ ،‬وعدد الكتل الناتجة‪ ،‬باإلضافة إلى عدد األجيال‪ 2‬و تشير‬ ‫النتائج التي تم الحصول عليها الى أفضلية أداء ‪ CGA‬بشكل عام مقارنةً بأداء ‪2genGA‬‬ ‫الكلمات الرئيسية ‪ :‬تجميع‪ ،‬الخوارزميات الجينية الخلوية‪ ،cGA ،‬تويتر‪ ،‬تشابه الرسائل المنشورة‬ ‫علي تويتر‬

‫األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري‬ ‫كلية الحاسبات وتكنولوجيا المعلومات‬

‫تطبيق الخوارزميات الجينية الخلوية لتجميع التغريدات في تحليل تويتر‬

‫ضمن المتطلبات الالزمة للحصول على درجة ماجستير العلوم في نظم معلومات‬ ‫من كلية الحاسبات وتكنولوجيا المعلومات ( القاهرة)‬ ‫رسالة مقدمة من الدارس‬ ‫عمرو عادل عبدالرحيم السيد‬

‫تحت اشراف‬ ‫أ‪.‬د‪ .‬عمروأحمد بدر‬ ‫أستاذعلوم الحاسب‪-‬كلية الحاسبات و المعلومات‬ ‫جامعة القاهرة‬

‫د‪.‬عصام الدين فوزي الفخراني‬ ‫مدرس نظم المعلومات‪-‬كلية االدارة والتكنولوجيا‬ ‫األكاديمية العربية للعلوم و التكنولوجيا و النقل‬ ‫البحري‬

‫مايو ‪.114‬‬

‫األكاديمية العربية للعلوم و التكنولوجيا و النقل البحري‬ ‫كلية الحاسبات وتكنولوجيا المعلومات‬

‫تطبيق الخوارزميات الجينية الخلوية لتجميع التغريدات في تحليل تويتر‬

‫ضمن المتطلبات الالزمة للحصول على درجة ماجستير العلوم في نظم معلومات‬ ‫من كلية الحاسبات وتكنولوجيا المعلومات ( القاهرة)‬ ‫رسالة مقدمة من الدارس‬ ‫عمرو عادل عبدالرحيم السيد‬

‫تحت اشراف‬ ‫أ‪.‬د‪ .‬عمروأحمد بدر‬ ‫أستاذعلوم الحاسب‪-‬كلية الحاسبات و المعلومات‬ ‫جامعة القاهرة‬

‫د‪.‬عصام الدين فوزي الفخراني‬ ‫مدرس نظم المعلومات‪-‬كلية االدارة والتكنولوجيا‬ ‫األكاديمية العربية للعلوم و التكنولوجيا و النقل‬ ‫البحري‬

‫مايو ‪.114‬‬ ‫القاهرة‬