Improving Tweet Clustering using Bigrams ... - ACM Digital Library

0 downloads 0 Views 812KB Size Report
Oct 12, 2015 - Y item in an itemset, whether X leads to Y. Apriori algorithm is used to ... keyword “Amazon” since a large corpus was required to validate our results. .... To see which kind of tweets are being tweeting, we did cluster analysis ...
Improving Tweet Clustering using Bigrams formed from Word Associations Khadija Ali Vakeel

Shubhamoy Dey

Indian Institute of Management Prabandh Shikhar, Rau Indore – 453331, India

Indian Institute of Management Prabandh Shikhar, Rau Indore – 453331, India

[email protected]

[email protected]

traditional marketplace to the online space. Due to growth of electronic commerce (e-commerce) and convenience associated with it, more and more customers are adopting e-commerce as a medium for transaction. E-commerce companies are the fastest growing and are showing eye popping valuation. In India, though the industry is in its infancy, it has grown at 35 per cent compound Annual Growth Rate (PWC report, 2014). Amazon started its business in 2013 in India and Flipkart, another eretailing giant, sold whopping gross merchandise volume of 1 billion USD (PWC report, 2014). Due to high competition and consequent investment in acquiring customers, Amazon’s losses were close to Rs. 321.3 crore and SnapDeal, another giant in Indian e-commerce showed losses of Rs 246.6 crore.

ABSTRACT In this work we propose an innovative clustering algorithm for twitter data. In the the context of e-commerce, we use Apiori algorithm to form 2-gram association rules and cluster tweets using self organizing maps. Since tweets are relatively small, word association becomes all the more important in mining the information. To check if 2-grams formed using word associations, help in increasing clustering tendency we use Hopkins index. Tested on two separate datasets, of 200 and 10,000 tweets each related to the key word “Amazon”, our results of the analysis show that there is improvement in the clustering tendency in both the datasets. This improvement in clustering tendency is potentially useful because customer grouping based on the tweets can help businesses determine new trends and identify customers with different sentiments.

Social media provides a platform for customers to share their experience of purchasing in online mode. In this paper, we analyze twitter tweets on Amazon where we crawl 200 tweets and 10,000 tweets from 2014 regarding Amazon Diwali sale to form two corpora. Then we try to use word associations with clustering algorithms on twitter tweets. Taking the context of e-commerce and using Apiori algorithm to form 2-gram association rules to finally cluster them into self organizing maps. To check if Apriori algorithm helps in increasing clustering tendency we use Hopkins index. Using two dataset of 200 and 10,000 tweets with key word “Amazon” our results of the analysis shows there is improvement in the clustering tendency in both the datasets.

CCS Concepts • Information systems➝Information systems applications; • Information systems➝Document representation; • Applied computing➝Document management and text processing;

Keywords Text mining; Association mining; Apriori algorithm; Hopkins index; Self Organizing Maps

2. RELATED WORK 1. INTRODUCTION

The process of grouping a set of abstract objects into classes of similar objects is called clustering (Han and Kamber, 2006). A cluster is a collection of objects such that objects in same group are similar and objects in different groups are dissimilar (Han and Kamber, 2006). Clustering is unsupervised learning as no predefined class labels are there, for example, in classification. Kmeans algorithm is a top down partitioning strategy which can make hard assignments or soft assignments to clusters. Hard assignments forcefully allocate 0/1 to the document but soft clustering does not make specific assignments (Chakrabarty, 2003). Clustering of tweets in the context of e-commerce is important to analyze the customer characteristics and group them according to sentiments, trends and experience. In this paper, we analyze twitter tweets for Amazon Diwali Dhamaka sale for trend analysis regarding the experience of customers while participating in the sale. Due to mixed emotions exhibited in online flash sale we expect two separate cluster of negative and positive sentiments.

Twitter is a microblogging website which allows its users to tweet their views and opinion in the form of 140 characters. These tweets can be used to detect patterns and attitude of consumers towards specific brands. Hence, analysis of tweets to generate clusters followed by trend analysis or sentiment analysis is a field of text mining which has gained recent attention of researchers. Over the years, a substantial bulk of trade has shifted from the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. RACS’15, October 9–12, 2015, Prague, Czech Republic. © 2015 ACM. ISBN 978-1-4503-3738-0/15/10 …$15.00.

Self organizing maps (SOM) or Kohonen maps are algorithms that embed the clusters in a low-dimensional space right from the beginning and proceed in a way that places related clusters close

DOI: http://dx.doi.org/10.1145/2811411.2811535

108

unigrams contain information but association of one word with another can help us mine unstated insight into the tweet.

together in that space (Chakrabarti, 2003). SOMs are unsupervised competitive Artificial nueral network (ANN) based on the principle of winner takes all (Kohonen, 1995). It is made up of input and output layer with array of interconnected matrix from input layer also called as Kohonen’s Layer to output layer. A SOM learns through the weights associated with its neurons in each layer (Sharma & Dey, 2013). SOMS essentially is a four-step algorithm:

Hopkins index (Hopkins et al., 1954) is a statistic used to measure clustering tendency. It examines whether objects in a data set differ significantly from uniform distribution in the multidimensional space. It compares the distances wi between the real objects and their nearest neighbors to the distances qi between hypothetical objects, uniformly generated over the data space, and their nearest real neighbors.

1: Initialize weights of the neurons in the map; 2: Randomly select a training vector from the corpus

Now using dataset of 200 and 10,000 tweets, we show our algorithm increases the clustering tendency using apriori algorithm which can then be clustered using self organizing maps.

3: Determine the Euclidean distance between input layer and the neuron’s weight vector

3. DATA AND PREPROCESSING 3.1 Data

4: Update the weights according to the learning rate which controls the amount of learning in each iteration. SOM is multi-pass algorithms where first pass takes small training time and second pass fine tunes the SOM and takes longer time. Like soft k-means, SOM is an iterative process where representing vector is associated with a cluster. But unlike soft k-means, SOM also shows clusters in low dimensional space where representative vectors are graded according to the similarity (Chakrabarti, 2003). This visualization of gradient helps uniquely distinguish one cluster from another cluster (Cheong & Lee, 2010). Advantages of SOM include non-linear projection of the input space and cluster topology preservation (Sharma & Dey, 2013). In Emergent Self Organizing Maps (ESOM) the cluster boundaries are ‘indistinct’, the degree of separation between ‘regions’ of the map (i.e. clusters) being depicted by ‘gradients’ (Ultsch et al, 2005). Cheong and Lee (2010) have used SOM for timely detection of hidden patterns on twitter messages using Cheong and Lee’s “context” aware content analysis framework. Liu et al. (2011) proposed 2-stage semantic role labeling after clustering of tweets to increase accuracy in labeling of tweets leading to 3.1 per cent increase in F1. Taking news aggregating services of Google News, Rosa et al. (2011) have designed automatic clustering and classifying algorithm considering for temporal effects and training set size finally, finding the most representative tweets of the cluster.

We collected data from twitter and formed two datasets: one of nearly 200 tweets and second of approximately 10,000 tweets. The keyword used to crawl the smaller dataset was “Amazon Diwali Dhamaka” and for larger dataset we crawled using keyword “Amazon” since a large corpus was required to validate our results.

Association rule mining on transactions help us show given X and Y item in an itemset, whether X leads to Y. Apriori algorithm is used to find frequent itemsets that occur together in transactions given minimum support and confidence. It is a data mining algorithm for generating association rules based on “if an itemset is not frequent, any of its supersets is never frequent”. It is a three step algorithm where if rule X => Y is true then for a transaction set D the confidence c are those transactions in D where if X is there Y is also there. Similarly, support for a transaction set D is X U Y. Figure 1: Preprocessing steps

1: From frequent itemsets of size k, generate candidates CK+1.

3.2 Preprocessing

2: Calculate the Support of each candidate itemset.

We collected tweets from Sep 21 to Feb 5 on following keywords “Amazon Diwali Dhamaka sale”. Over 200 tweets were collected and cleaned using twitter API. We used R for further analysis of the data. R is open source data analysis software which has the capability to transform text data into numerical form ready for processing with the help of the package TM. As raw data is not sufficient to run the data mining techniques we followed following steps to convert the plain text data into matrix of document and frequency where each term represents the number

3: Add items above minimum support to the class. The algorithm has two distinct functions join and prune where union of candidates of frequent itemset of k+1 and size k is taken in join step. Then in the prune step, subsets of all non-frequent itemset are removed because subset of non frequent set cannot be frequent (Wu et al., 2008). Word association in tweets is more important than the standard text like news and other documents due to its short length of 140 characters. Hence, not only the

109

as the Hopkins index calculated has increased. Finally, we cluster these using self organizing maps.

of times a word appears in the document. Since, tweets are only of 140 characters, tweets by their characteristic have very low possibility of word repetition. Before changing the plain tweets to document term matrix, few preprocessing steps were required (Vakeel & Dey, 2014). Firstly, we converted all the tweets to lower case; secondly, all the numbers were removed since these have little meaning in the context we are working. Thirdly, all the punctuations were also removed along with the white spaces. Lastly, we also removed frequent English stop-words all the incremental contribution to the corpus of these words is negligible.

5. ANALYSIS AND RESULTS 5.1 Small Dataset To analyze the trends on which people were talking about, we used word cloud to describe the importance of each word based on the frequency of the attributes occurring in the total tweets. Figure 2 shows the word cloud for the tweets after converting them into document term matrix. The most frequently appearing words in the tweets are higher in area as presented by Amazon, diwali and dhamaka; followed by attributes which occur less frequently like effect, safe. All the words occurring in the tweets more than 5 times were included in this word cloud. This could be helpful in knowing what people are talking about on the twitter and subsequently analyzing them to generate positive emotions about the company. In case of Amazon Diwali Dhamaka most of the people are talking about deals and moreover comparing Amazon to Flipkart which is another e-commerce giant in India.

Post pre processing the plain text was changed to document term matrix where each column represents an attribute and the row represents all the tweets retrieved and cleaned. After the conversion to document term matrix, we would be able to do data mining on the data. The pre-process is shown in Figure 1. Similar preprocessing was done for the larger dataset, except that we also removed tweets not in English before going for the preprocessing. This process yielded 10272 valid tweets to be used for further analysis

Similarly, less frequent words like flops, fail, crash, disappointed show negative emotions of customers towards the Amazon Diwali sale whereas attributes like love shows positive experiences. There are other neutral words like morning, discount, and products etc which do not communicate any specific emotion but are useful in discovering current topic of interest in the twitter community.

4. ALGORITHM In this work, instead of using only single-word terms (unigrams), we have used n-grams along with the unigrams. To form the bigrams we utilize associations between words. These associations can be discovered, in a way similar to that of generating association rules, using the apriori algorithm. The apriori algorithm generates frequent item-sets. These item-sets when transformed into additional n-gram attributes would be expected to increase the clustering tendency and therefore lead to formation of better clusters. We propose the following process for clustering of tweets, that incorporates the word association based n-gram generation: 1: Generate Corpus C from tweets 2: Convert Corpus C to Document Term Matrix dtm 3: Run apriori 4: Generate n-gram association rules 5: Generate new Document Term Matrix n-dtm based on apriori rules 6: Calculate Hopkins index 7: Run self organizing Map for clustering Tweets are public opinion about any news, political statement or event. It can be in multiple languages hence preprocessing is necessary. Once the pre processing is done the corpus can be formed to then convert it into document term matrix. A document term matrix is a matrix where each column represents an attribute and the row represents the document. Hence, each cell in the matrix refers to number of times an attribute occurs in a document, in this case a tweet. After the document term matrix is ready we can use this matrix to derive frequent itemsets where each itemset would be formed of attributes of the matrix. This association rules when derived should also lead to reduction in dimensionality of the matrix as various attributes would be clubbed into one frequent itemset similar to a transaction in apriori. Then we regenerate the document term matrix where each column now is a frequent item-set. By doing this we increase the cluster tendency

Figure 2: Word cloud for 200 tweets To see which kind of tweets are being tweeting, we did cluster analysis on the tweets. Since, we do not have very clear idea about how the tweets are distributed across or the degree of emotion attached. We use hierarchical agglomerative clustering which is bottom up clustering where the algorithm starts with each document, in this case tweet, as an individual bucket and on the basis of similarity tries to club the bucket in each iteration. Hence, ever document is one distribution and are repeatedly merged to form group of similar documents. Once the groups are formed, it is visually represented as dendogram (Figure 3).

110

Figure 3: Dendogram for 200 tweets to a cluster. As tweets can be fuzzy in nature this is required.

We can cut the dendogram at the height where we find desired number of clusters. The algorithm is of quadratic complexity. In the clustering of tweets we used Euclidean distance to find the similarity between the tweets using Ward’s method. Euclidean distance is a straight line distance between two points in Euclidean plane and is calculated as follows (Equation 1). Ward’s method says that the distance between two clusters, A and B, is as much increase as the sum of squares when we merge them (Equation 2 and 3).

EuclideanDistance[{a,b,c},{x,y,z}] =

(1)

(2)

(3) In figure 4 there are two major clusters that are being depicted, therefore from the tweets that we have collected on Amazon Diwali sale people are talking about two experiences mostly. The two clusters contain 89 and 111 tweets respectively. After converting it into document term matrix and selecting attributes with frequency more than 5 tweets, we run SOMS to see two different clusters represented by gradients in figure 4. This is similar to hard clusters represented in the dendogram in figure 3. The yellow dots show the tweets of one cluster and blue dots represent another cluster with blue color. Now, we run our above proposed algorithm on the dataset of 195 tweets after preprocessing and removing tweets not fit for the analysis. The number of attributes with frequency more than 5 was selected for document term matrix leading to total of 44 attributes being selected. Further apriori algorithm using support and confidence of 0.05 and 0.01 respectively we generated 43 rules with item set count of 2. The Hopkins index for clustering tendency was 0.8 before the apriori rules were generated which is just on the basis of frequency count which increased to 0.998738 after we generated 43 apriori rules for our proposed algorithm. This shows generating 2-gram rules and converting frequency based document term matrix to association based document term matrix increases the clustering tendency. Finally clustering was done using self organizing maps, this was beneficial because first, it gives relative nearness of the clusters to each other and second this is soft clustering which does not forcefully a lot a data point

Figure 4: SOM for 200 tweets, frequency =5

Figure 5: SOM for 200 tweets, 2 gram association rules

111

Figure 6: SOM for 10,000 tweets, frequency=25

Figure 7: SOM for 10,000 tweets. 2 gram association rules

5.2 Large Dataset

6. CONCLUSION

As the datapoints were few, clusters were not well established therefore we collected data for a big corpus with 17000 tweets in total. These tweets were on the keyword “Amazon”. Once the tweets were retrieved, after removing all other language tweets except for English we were left with 10272 usable tweets. Same process in the proposed algorithm was followed. The frequency of the term to be included in the document term matrix was set to 25, as this is a sparse matrix we could not maintain the same ratio as there in the small dataset. We got 762 2-gram association rules from 13941 total attributes and after frequency of 25, we selected 307 attributes. The Hopkins index for clustering tendency before association rule with the frequency of 25 was 0.9999978 and with the 2-gram item association rules the Hopkins index increased to 0.9999993. The support and confidence was 0.1 and 0.01 respectively. Finally, we ran ESOM to cluster the datapoints.

Tweets contain lot of information in the form of text. We use keywords “Amazon Diwali sale” and “Amazon” to form two corpora of 200 and 10,000 tweets respectively. We then discover word associations in the tweets and generate bigrams from them. With these bigrams in the document-term matrix, we use selforganizing maps to form clusters of tweets. The bigrams leads to formation of more distinct clusters – dividing the clusters into those with positive and negative sentiments. With the help of Hopkin’s Index we demonstrate that there is a measurable increase in clustering tendency when bigrams generated from word associations are used. Contribution of the study are many-fold; firstly, we have demonstrated that using word associations formed from the apriori algorithm as attributes of clustering has improved using Hopkins index which is novel and has not been used in the past. Secondly,

112

[8] “Hands-On Data Science with R”, Graham William, 2014 Hopkins, B., & Skellam, J. G. (1954). A new method for determining the type of distribution of plant individuals. Annals of Botany, 18(2), 213-227.

though Apriori algorithm for association mining is pervasive in literature, it has not been used with twitter data. Usage of word association while clustering tweets is more important because they are shorter and it is more important to analyze the words in pairs or groups. Thirdly we measure improvement in clustering tendency using Hopkins index.

[9] Kohonen, T. 1995. Self-organizing maps. Springer-Verlag, Berlinpackage. Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science.

7. REFERENCES [1] Agrawal, R., & Srikant, R. 1994, September. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499).

[10] Liu, X., Li, K., Zhou, M., & Xiong, Z. 2011, July. Collective semantic role labeling for tweets with clustering. In IJCAI (Vol. 11, pp. 1832-1837).

[2] Chakrabarti, S. 2003. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann.

[11] Rosa, K. D., Shah, R., Lin, B., Gershman, A., & Frederking, R. 2011. Topical clustering of tweets. Proceedings of the ACM SIGIR: SWSM.

[3] Cheong, M., & Lee, V. 2010, August. A study on detecting patterns in twitter intra-topic user and message clustering. In Pattern Recognition (ICPR), 2010 20th International Conference on (pp. 3125-3128). IEEE.

[12] Sharma, A., & Dey, S. 2013. Using Self-Organizing Maps for Sentiment Analysis. arXiv preprint arXiv:1309.3946. [13] Twitter, www.twitter.com

[4] Cluster, http://www.statmethods.net/advstats/cluster.html [5]

“Distances between Clustering, Hierarchical Clustering”, accessed from http://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture08.pdf

[6]

“Evolution of e-commerce in India Creating the bricks behind the clicks”, www.pwc.in accessed on Feb 3, 2015

[14] Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Steinberg, D. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 137. [15] Ultsch, A., & Mörchen, F. 2005. ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM.

[7] Han, J., Kamber, M., & Pei, J. (2006). Data mining, southeast asia edition: Concepts and techniques. Morgan kaufmann.

[16] Vakeel, K., & Dey, S. 2014, October. Impact of News Articles on Stock Prices: An Analysis using Machine Learning. In Proceedings of the 6th IBM Collaborative Academia Research Exchange Conference (I-CARE) on ICARE 2014 (pp. 1-4). ACM.

113