Categorizing Twitter Users on the basis of their interests ... - IEEE Xplore

3 downloads 60 Views 205KB Size Report
Kurukshetra, India. [email protected], [email protected]. Abstract— Traditional k-means algorithm has been used successfully to various problems ...
Categorizing Twitter Users on the basis of their interests using Hadoop/Mahout Platform Eeti Jain, S K Jain Department of Computer Engineering National Institute of Engineering Kurukshetra, India [email protected], [email protected] H

H

Abstract— Traditional k-means algorithm has been used successfully to various problems but its application is restricted to small datasets. Online websites like twitter have large amount of data that has to be handled properly. So, there is a need of a platform that can perform faster data clustering which leds to the development of Mahout/Hadoop. Mahout is machine learning library approach to parallel clustering algorithm that run on hadoop in distributed manner. Mahout along with Hadoop proves to be the best option for clustering. In this work, we have categorized the twitter users on the basis of their interest patterns by implementing Mahout over Hadoop platform and performed experiments with its datasets. We have also studied the performance evaluation of k-means and fuzzy k-means and have compared their results to find out the better algorithm to work on this type of dataset. Keywords— Clustering, Hadoop, Mahout, Parallel, Twitter

I. INTRODUCTION Data mining [1] is the process of extracting useful information from the plentiful data set and convert it into an understandable form for future use. Data mining extract information from past data to analyze the outcome and presents some output. Data mining has various classes like clustering, classification, recommendation engine. Clustering is the task of combining objects in a way that objects which belongs to the same group are more similar in their characteristics than the objects in the other groups. It is a type of unsupervised learning. In past, data was not huge such that they can be processed on single machine. Online changing datasets need not to be handled. But now amount of data is growing exponentially with rapid growth in internet. It has become hard to process Big Data using traditional methodologies. Specially for document clustering, some real datasets are required which can be noisy too. Social networking sites like twitter are the best examples for accumulation of massive amount of unstructured data. Other examples can be big IT companies. Therefore, to manage Big Data some mechanism is required, which led to the development of HADOOP [5]. In 2004, Google published a paper titled Google’s MapReduce [12]. Large amount of data can be processed by parallel mechanism provided by MapReduce. Queries are divided and distributed across many nodes to be processed parallel, which is known as Map stage. Then the results are combined by H

H

H

Reduce stage and output is given out. This framework was doing very well. Therefore, Apache introduced an open source project known as Hadoop based on Google’s MapReduce framework. Mahout is a machine learning library which works on Hadoop and includes various implementations including clustering, classification. H

H

Clustering is divided into hierarchical and partitional clustering. Hierarchal clustering groups the objects into nested partitions that form a hierarchal structure. Partitional clustering just divides the data into different random groups. As it is self explained that more understandable results can be provided by hierarchal approach but due to less execution time and simplicity, partitional approach is more in use. K-means is the partitional clustering. It is a type of hard clustering. There is a type of soft clustering also which produces overlapped clusters known as fuzzy k-means. Clustering can be used to cluster points, objects as well as text documents. Objects based on their features are clustered. Some sort of feature selection should be there. Document clustering helps to organise large number of documents efficiently. Irrelevant data is grouped into meaningful clusters based on their text similarity. In this, some useful information is retrieved without any prior knowledge. Previously Wikipedia articles are clustered by Esteves and Rong [6]. They have also been tested the working of k-means on Mahout without taking Wikipedia data [8]. Here we are taking dataset of around 30,000 tweets from twitter and will cluster the similar users in a group according to the similarity in their tweets. In this paper experiments are done to evaluate the performance of k-means and fuzzy k-means and will find the execution time by changing the convergence threshold. We are going to consider these points in this paper: 1. Comparing results of Euclidean and Cosine distance measure 2. Comparing results with K-Means and fuzzy K-Means a) How much slower/faster the fuzzy k-means in comparison to kmeans? b) How much FCM improves the quality of the clustering?

II. K-MEANS

IV. HADOOP/MAHOUT

The K-means is used widely as a clustering algorithm. It divides the data into “clusters” such that data within a cluster are more similar to one another than the data of different groups [2]. It is an UNSUPERVISED learning algorithm. Euclidean distance is a similarity measure that used to measure the distance between centroids and the other points. Document vectors need to be normalized as they are directional. Therefore, cosine similarity measure is preferred for text data rather than Euclidian distance which consider clustering according to angles. Spherical K-means is the resulting algorithm [10]. Standard K-means algorithm can be found in [8]. K-means has high complexity when large data sets are to be handled. Therefore running k-means on Hadoop/Mahout can be a promising solution to cluster large datasets as they can scale well. 2-Tier clustering is proposed by Zhang, Wu and Li [4] which parallelizes the k-means on Hadoop platform. First clustering process takes place in map stage and the second process in reduce stage. This means that working of k-means is divided into map and reduce stages [2] to parallelize the work. Due to this, parallelized k-means clustering algorithm comes into scene. In map stage, initial clusters are formed. Then in reduce stage, calculating average of cluster and reassigning the centroids takes place. Then again map stage is executed. This keeps on iterate until a threshold is reached. Kmeans does not support the overlapping clusters therefore a data point can belong to a single group only.

Apache Hadoop [13] is an open-source software framework for storing and processing large data in a distributed manner. It is a set of libraries written in Java to process big data across number of nodes. It is scalable as any number of nodes can be added to the cluster and processing can be distributed. Hadoop is a very efficient platform for data intensive applications. Hadoop consists of the HDFS filesystem and MapReduce engine. Hadoop cluster consists of a single master node and many worker nodes. HDFS is based on master/slave architecture. HDFS cluster is made up of a single NameNode and many DataNodes that manage storage issues. Namenode is responsible for managing all the DataNodes and keeps the metadata of all the datanodes. Internally, NameNode splits a file into a number of blocks which are stored in the DataNodes. Replicas of blocks are also created and distributed among all the nodes to provide reliability in case of failure. MapReduce deals with the parallel processing of data across multiple nodes. The MapReduce framework consists of a single JobTracker (master) and number of TaskTrackers (slave). A MapReduce task is also split into a number of tasks that are executed in parallel way by different TaskTrackers on different nodes. The sorted output acts as input to the reduce task [3]. The two main stages of MapReduce are Map and Reduce. In the Map stage, queries are divided and distributed across different nodes for parallel processing. The results of different nodes are combined in the reduce stage to produce the final output.

III. FUZZY K-MEANS Fuzzy k-means was originally introduced by Jim Bezdek in 1981 [9] as an improvement on earlier clustering methods. K-means assigns a point completely to one cluster as it is a type of hard clustering. Fuzzy k-means is soft clustering algorithm where data elements can belong to more than one cluster and a set of membership levels is associated with each element. It produces overlapped clusters which cannot be produced in k-means. As it has to calculate membership function, it takes more time to execute this algorithm. The FCM algorithm attempts to partition a finite collection of n elements X = {x1 , …, xn } into a collection of c fuzzy clusters with respect to some given criterion. Given a finite set of data, the algorithm returns a list of c cluster centres C = { c1 , … , cn} and a partition matrix W= wi,j € [0,1], I = 1, …, n, j = 1, …, c, where each element wij tells the degree to which element xi belongs to cluster cj. Like the k-means algorithm, the FCM aims to minimize an objective function. The standard function is 1 ∑

, ,

H

H

Apache Software Foundation (ASF) builds Apache Mahout [14], an open source project and a scalable machinelearning library. Mahout contains inbuilt libraries for solving clustering, categorization, classification problems. It is a subproject build on hadoop platform which uses distributed paradigm of MapReduce and proves to be an inexpensive solution for machine learning problems. Hadoop and Mahout projects are open source. Mahout along with Hadoop platform is a promising technology to solve data intensive problems efficiently which was not there in the past because of less amount of data. Various clustering implementations are there in Mahout like K-means, fuzzy K-means, Dirichlet and many others. Some steps needs to be followed first to input data for Mahout. Firstly, preprocessing needs to be done if the data is not numeric. Then data is converted to a specific format of Hadoop that is Sequence file format. Then sequence FILE IS converted to vector form by seqtosparse command on which clustering is executed. A. Preprocessing dataset under Mahout In clustering, sample needs to be converted to vectorized form. If the sample is numeric, it can be directly converted to the sequence file format. If the inputs are objects, they must be firstly converted to vector having number of dimensions equal to its features. Here we are considering text files. So for them, the most common method is vector space model for vectorizing. A document has many words. So, each word is assigned an index in the vector for that document and if the

word bank is assigned 39,900th index, it belongs to the 39,900th dimension of that document vector. Value for each dimension is the frequency of occurrence of the word in the document which is known as term frequency (TF) weighting. But there are various commonly occurring words like a, an, the for which dimension value becomes very large and proper similar clusters are not made. Term frequency–inverse document frequency (TF-IDF) weighting is an improvement on simple term-frequency weighting [11]. Dimension value is reduced for the words that are used frequently. Document frequency is the number of documents the word occurs in. So, accordingly different methods for preprocessing can be used as explained by C. Carpineto, S. Osinski, G. Romano, and D. Weiss [7] V.

Setup

Euclidean Distance Measure

Cosine Distance Measure

1

3.29

3.9

2

110

5.3

Table 2 Comparison of Number of Iterations for distance measures Setup

Euclidean Distance Measure

Cosine Distance Measure

1

2

2

2

100

6

EXPERIMENT AND PERFORMANCE EVALUATION

A. Dataset

We extracted dataset of twitter containing tweets of different users by running a program online. Dataset consists of around 30,000 tweets from twitter website. B. Experimental setup I have performed my experiment on single node with configuration:

1) 2) 3) 4) 5) 6) 7) 8)

Table 1 Comparison of Execution Time for distance measures

AMD quad core 1.5 GHZ processor 4 GB RAM 500 GB hard disk. 1000 Mbps Ethernet connection switch The operating system is ubuntu12.04. Hadoop version is 1.0.3 Java version is l.6.0. Mahout version is 0.8

We have experimented with different setups having different convergence Threshold (cd) Setup 1 (cd = 0.5) Setup 2 (cd = 0.1) Setup 3 (cd = 0.01) Setup 4 (cd = 0.005) C. Experiment Firstly, tweets are converted into different documents containing tweets of individual user. All the tweets of one user are concatenated in one document. Then username act as key and tweets act as value in sequence file. This file is converted into vectors and then k-means is executed using cosine distance measure and Euclidean distance measure. Finally, output is checked out by clusterdumper [11] class and densities are find out. a) Comparing results of Euclidean and Cosine distance measure

b) Comparing results with K-Means and fuzzy K-Means Table 3 Comparison of Execution Time for clustering algorithms (CD = Convergence Threshold) Setup

K-Means

1

3.9

Fuzzy K-Means 6.19

2

5.3

6.19

3

17

14.16

4

28.6

44.4

Table 4 Comparison of Number of Iterations for clustering algorithms Setup

K-Means

Fuzzy K-Means

1

2

2

2

6

2

3

14

5

4

28

15

Table 6 Comparison of IntraClusterDensity for clustering algorithms

Execution Time

50 40 30 20

Setup

K-Means

Fuzzy K-Means

1

0.73

0.635

2

0.74

0.61

3

0.79

0.65

4

0.72

0.67

K-Means

10

Fuzzy k-means

0 0

2

4

6

Setup

30 25 20 15 10 5 0

K-Means

0

2

4

6

Fuzzy kmeans

Intercluster Density

Number of Iterations

Figure 1 Comparison of Execution Time for clustering algorithms 1 0.8 0.6

K-Means

0.4 0.2

Fuzzy kmeans

0 0

2

4

6

Setup

Setup Figure 3 Comparison of InterClusterDensity for clustering algorithms

Table 5 Comparison of InterClusterDensity for clustering algorithms Setup

K-Means

Fuzzy K-Means

1

0.92

0.032

2

0.93

0.038

3

0.925

0.027

4

0.93

0.047

Intracluster Density

Figure 2 Comparison of Number of Iterations for clustering algorithms 1 0.8 0.6

K-Means

0.4 0.2

Fuzzy kmeans

0 0

2

4

6

Setup

Figure 4 Comparison of IntraClusterDensity for clustering algorithms D. Evaluation a) Comparing results of Euclidean and Cosine distance measure As we have seen from table 2, when CosineDistanceMeasure is used clusters emerged within some

iterations but with the EuclideanDistanceMeasure, it took very large number of iterations to emerge the final clusters. We have set the number of iterations to 100 and it is consuming all. This means that if we increase the number of iterations, it will go for more iteration. It also shows in table1 that more execution time is taken by the Euclidean distance measure. Poor results are shown by Euclidean distance measure. It happens because Euclidean distance measure considers two document of different length as different document instead of their similarity. So, cosine distance measure is considered to be good for finding the distance between the text documents. b) How much slower/faster the fuzzy k-means in comparison to kmeans? We have experimented with different values of cd. We keep on decreasing the cd and find the execution time. Fuzzy k-means mostly take more time compared to K-Means except in one case as shown in table 3. This is ploted in the form of figure 1 to give clearer picture. More calculation is involved in fuzzy k-means compared to k-means. But when number of iterations is compared with fuzzy k-means, fuzzy kmeans’s clusters converge in less iteration although it takes more execution time as shown in table 4. Initially, there is less increase in iterations but afterwards a steep increase in iteration is seen as shown in figure 2. So, fuzzy k-means is slower than k-means in execution. c) How much FCM improves the quality of the clustering? Intercluster and intracluster cluster density is taken as quality measures for the clusters which are find out for different cases. More intercluster density means that there are many points between the clusters which are not included into any cluster resulting in low quality clusters. It also means that there is less distance between the clusters that is there is not well defined partitioning between the clusters. More intracluter density means that points are more closer in the cluster which means points with less similarity are not clustered resulting in high quality clusters. For fuzzy k-means intercluster distance is very less compared to that for k-means as shown in table 5. This implies that by fuzzy k-means, we get good cluster quality. Considering intracluster distance, it is slightly less for fuzzy kmeans as shown in table 6 that is also a good factor for quality clusters. Clusters are calculated using different cd but with every cd, better results for fuzzy k-means are obtained. VI. CONCLUSION AND FUTURE WORK Mahout along with hadoop is a promising platform to handle big data and perform clustering due to their inexpensive and scalable characterstics. Mahout has inbuilt libraries and hadoop provides parallelism. As it is seen, with change in convergenge threshold, there is a big change in the execution time. So, there is a need of choosing proper convergence thershold. We have experimented the twitter data to find clusters of similar users. Although fuzzy k-means is

taking more time in calculation, we can say that fuzzy kmeans is giving better results compared to k-means as inter cluster density is found to be very less and intra cluster density comes out to be large by some value compared to k-means algorithm. Along with this cosine distance measure is preferred as a distance measure for clustering text documents. In future work, datasets from different websites can be taken to further research on clustering. In our experiment, only words are used to calculate user similarity. But similarity can also be inferred from user interaction which may prove to be a good feature for clustering users who think alike. Other clustering algorithms can be tried to find better results. Mahout also has other data mining algorithms for classification and recommendation which can also be implemented. REFERENCES [1] X. Zhengqiao, and Z. Dewei, “Research on Clustering Algorithm for Massive Data Based on Hadoop Platform,” International Conference on Computer Science & Service System, Aug. 2012, pp. 43-45. [2] S. Li, and Y. Chang, “Research on K-Means Algorithm Based on Cloud Computing,” International Conference on Computer Science and Service System, 11-13 Aug. 2012, pp. 1762-1765. [3] T. White, Hadoop: The Definitive Guide, O’Reilly Media, Yahoo! Press, June 5, 2009. [4] J. Zhang, G. Wu, and H. Li, “A 2-Tier Clustering Algorithm with Map-Reduce,” Fifth Annual ChinaGrid Conference, 16-18 July 2010, pp. 160-166. [5] S. Humbetov, “Data-intensive computing with map-reduce and hadoop,” International Conference on Application of Information and Communication Technologies, 17-19 Oct. 2012, pp. 1-5. [6] R. M. Esteves, and C. Rong, “Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud,” 3rd IEEE International Conference on Coud Computing Technology and Science, 2011, pp.565-569. [7] C. Carpineto, S. Osinski, G. Romano, and D. Weiss, “A survey of Web clustering engines,” ACM Comput. Surv., vol. 41, no. 3, pp. 1-38, 2009. [8]

R. M. Esteves, R. Pais, and C. Rong, “K-means Clustering in the Cloud -- A Mahout Test,” in Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications, 2011, pp. 514-519.

[9] J. C. Bezdek ,"Pattern Recognition with Fuzzy Objective Function Algoritms", Plenum Press, New York, 1981 [10] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” University of Minnesota, Technical Report 00- 034, 2000. [11] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action. Manning Publications, 2011. [12] J. Dean and S. Ghemawat, “MapReduce: Simplied Data Processing on Large Clusters,” 2004. [13] http://www.apache.hadoop.org/ H

[14] https://mahout.apache.org/

Suggest Documents