Extracting Turkish Tweet Topics Using LDA Fahriye Gemci and Kadir A. Peker Melikşah University, Kayseri, Turkey
[email protected],
[email protected] Abstract Social media is a very popular medium of communication for sharing people’s activities, opinions and feelings with others. Twitter has become one of the most popular of these social media services. Finding relevant information in the large space of Twitter is a challenge. Various algorithms have been developed to find related tweets. Extracting tweet topics is one of the techniques that can be used for this purpose. Recently, LDA (Latent Dirichlet Allocation) has been successfully used in analysis of tweet topics in English. However, LDA hasn’t been tried on Turkish tweets. Turkish is an agglutinative language, which makes application of LDA a new challenge compared to LDA on English tweets. A series of preprocessing steps like stemming, stop word elimination, cleaning of punctuations and spelling errors etc. to make the tweet texts suitable for LDA analysis are used in our application. 6 6.000 tweets and 4.000 control tweets with Twitter4j library are crawled [2]. Zemberek library is used for stemming [3]. The language used in tweets - with its own rules, wide spread spelling errors or made-up words - make text analysis very difficult. Some of these problems are solved by adding new words into Zemberek library and some of the problematic words are completely removed. After preprocessing, LDA i s performed and 40 topics are extracted. The initial results look promising. Some significant topics such as football teams, celebrity names or hot news topics are detected. Using these results, automatic recommendation of relevant tweets will be performed.
1. Introduction Social media has become an important part of daily life. People use social media to communicate with each other in daily life. Twitter is a good instance of social media. Twitter is very useful to analyze the pattern of topics or trends [1]. With its large and accessible data set, Twitter is a good source for researchers [12]. Jack Dorsey and July created Twitter in March 2006. Today Twitter has over 500 million users and 340 million tweets per day. Finding relevant data in this large space is a challenge. Various approaches have been proposed and algorithms have been proposed for finding related tweets. One approach is through automatic detection of tweet topics. For instance, if two users talk about similar topics then these users can be recommended to each other. Various methods for extracting topics in text documents are proposed in the literature. Some of these are: latent semantic indexing (LSI), probalistic latent semantic indexing (PLSI) and latent dirichlet allocation (LDA). LDA differs from PLSI as it calculates maximum posterior with a uniform Dirichlet prior. Before applying automatic topid detection methods, a
preprocessing step is necessary. This step is critical in order to obtain meaningful topics using automatic methods. However, it is a challenging problem. Tweets are usually written with a large number of false spellings. Some words are falsely combined together. So these words are meaningless for our algorithm. Turkish special letters (e.g. ü,ö,ç etc.) are usually replaced with Englich ones. Zemberek library is used to find the roots of words. But root of some of the words are unobtainable in Zemberek library. So a text file that consists of the roots of some important words is created manually and added into Zemberek. Stop words are also deleted from the dictionary. After preprocessing, tweet topics are extracted using LDA [6,7]. We test our method on control tweets collected on specific sets of topics. An inspection of the results indicates that the automatically extracted topics are meaningful and useful. In the following section, initial researches, our background components and applications are introduced. In third part, our system framework is roughed in. In fourth part, detailed our system are explained. In fifth part, our results and future work are discussed.
2. Related Work In last years, interest of researchers increases to social media. Because they realize that social media have the large data that introduces people using the social media. An addition to this development, a different problem comes together. The problem is to find initial data in wide Social Media medium. For this too, new research topics borns. In last researches, content similarity is used such as [4]. Both LDA and an application based on standard term frequency vectors are tried by researchers [4]. Chen, et al., [8] also build number of recommender approaches, one of which is topic based. One of them is similar to our system. User and tweets are characterized by using LDA [5]. Twitterrank is also used to analyze Twitter [10]. Before this paper important thing for us in this thesis is for Turkish system building.
3. Methodology and Applications 3.1. Twitter Twitter is a popular social media service, and its popularity continues to increase (Figure 1). People share their opinions, aims, interests, or in short, their life with each other on Twitter. Twitter has become a large data source for researchers. Twitter service has a few basic components and terminology. The most important ones are tweet, hashtag, retweet, follower, following, URL, trend topic, and Twitter API. “Tweet” is a prose restricted to 140 characters and it is written by Twitter users and shared
through their Twitter accounts. “Hashtag” is a word or phrase that is a topic determining by users. It is symbolized by the # (pound/sharp) character. “Re-tweet” is forwarding of a tweet of another user. “Follower” is a Twitter user who follows another Twitter user. “Following” is (a list of) other Twitter users that a given Twitter user chose to follow. “Trend topic” is a topic that is popular (i.e. many tweets on that topic) at the given time and is rated higher than the other topics. Twitter API is an API that Twitter provides to users and developers so that they can access most of the Twitter functions and data in their applications.
and expectation propagation is used to find the topics according to the described model and based on the observed words [12]. LDA is similar to LSI in its probabilistic structure. However, the Dirichlet prior is proposed as a better probabilistic model for the topic distributions in text documents [9].
3.3. Gibbs Sampling Gibbs sampling is a Markov Chain Monte Carlo (MCMC) algorithm [13]. It is a technique for generating random variables from a desired distribution. However it doesn't need to calculate density to do it. Gibbs sampling is particularly well-adapted to sampling the Bayes posterior distribution .
4. System Design
Fig. 1. Twitter Logo
3.2. Topic analysis with LDA LDA is a generative probabilistic model in natural language processing. It was first introduced by Blei, Ng and Jordan as a graphical model for automatic topic discovery [14]. It is one of the techniques used in extracting topics from text documents. A topic is a latent concept that has a specific probability distribution of different words. So, it is assumed that when the document is on a certain topic then words are generated based on the distribution specified by that topic. Furthermore, a word can be included in the output of more than one topic. Each document consists of topics. Representation of documents is generally uses the bag of words model. Figure 2 illustrates LDA as a probabilistic graphical model.
Figure 3 shows how our system works. There are a number of steps required for the preprocessing in the framework. First step is the separating of tweets into files. Second step is the separating of the user name of each tweet. The following step is the separating of hashtags in tweets. The following step is separating re-tweets (rt) of tweets. The following step is deleting user name, hashtags and rt from the tweet body text. The following step is separating URLs that appear in tweets. The following step is finding words that aren’t in Zemberek. The following step is modifying words that aren’t in Zemberek. The following step is deleting punctuation and numbers in tweet texts. The following step is deleting stop words in tweets. At this point some of the tweets become null and are eliminated from the data set. The following step is converting tweets into numbers by encoding each word by its index in the dictionary. Thus, at the end the tweet documents are converted to a list of numbers that can be analyzed by the LDA algorithm.
TWİTTER
Tweets PREPROCESSING
Null tweets
Fig. 2. Graphical Model of LDA. In the graphical model of LDA, number of topics is denoted by K. During topic extraction, the K value is set by the user. M is the number of documents and N is the number of words in a document. Number of all the words in the dictionary is V. α is the parameter of the Dirichlet prior on the per-document topic distributions, and β is the parameter of the Dirichlet prior on the per-topic word distribution. θi is the topic distribution for document i, φk is the word distribution for topic k, z ij is the topic for the jth word in document i, and wij is the word variable. The words wij are observable, and the other variables are “latent”. The three hierarchical levels of the LDA as a Bayesian model can be seen in Figure 2. Posterior distribution Gibbs Sampling
Not null tweets Latent Dirichlet Allocation Fig. 3. Our system framework.
5. Experimental Results 5.1. Data Collection For this paper, we need test data. In this way, we started to look for the data. As we read the articles, we see that social media is very large information source. In fact, this is seen in daily life. Since 2006, with over 500 million users as of 2012 and 340 million tweets daily, Twitter is a great knowledge source for researchers. So we prefer Twitter from social media. Twitter has Twitter API to reach tweets. We use Twitter API to reach tweets. We obtain 66.000 tweets and 4.000 control tweets from Twitter.
5.2. Preproccessing Preprocessing in this paper are thought as all of tweet texts process. A lot of steps are performed to preprocess. First step is separating tweets into files. Second step is separating is user name of tweets. The following step is separating hashtags of tweets. The following step is separating rt of tweets. The following step is deleting user name, hashtags and rt of tweets. The following step is separating urls of tweets. The following step is finding words that aren’t in Zemberek.[3] The following step is modifying words that aren’t in Zemberek. The following step is deleting punctuation and numbers of tweets. The following step is deleting stop words of tweets. This step names as “stemming”. Stemming is built to reduce Turkish words in our dictionary. For stemming, root of words are found. So number of words are reduced. For example, the stem of words ‘giysi’ and ‘giymiş’ is ‘giy’. The last step is converting tweets into numbers.
Fig. 5. Clustering with 30 topics.
5.3. Results In this paper, we experiment our 66.000 tweets with the generative modeling approach Latent Dirichlet Allocation(LDA) into 40 topics. Then we control our success with 4.000 control tweets. We experiment our 4.000 control tweets with LDA into 20 topics. In this process, we have to decide number of topics. So we prefer different number of tweets. If we prefer 10 topics, different topics mix with each other. For example, “KPPS topic” and “sports topic” are found in forth topic. So it is less topic for this clustering. So we prefer 40 topics for 66.000 tweets and 20 topics for 4.000 test tweets. Then we experiment k-means with these topics [5]. In this case, we decide value of k. We try different values of k. The following figures are examples.
Fig. 4. Distrubution of words among topics. Clustering with 10 topics.
Fig. 6. Clustering with cos similarity of k-means with 10 topics.
Fig. 7. Clustering with cos similarity of k-means with 20 topics.
Social Informatics. Springer Berlin Heidelberg, 2012. 232245.
Fig. 8. Clustering with cos similarity of k-means with 30 topics. Figures 4 to 8 show that ways of k-means, different number of topics lead to different results. In other words, figure 4 to figure 8 show distribution of topics obtained by k-means. So finding suitable number of topics is very important for successful clustering. If we increase number of topic, distribution among topics is more balanced. If we increase number of topic as an extra, in this case the distribution among topics damages. So number of topics is stopped in an ideal value.
6. Conclusions Turkish Twitter users use a different language that is independent on Turkish language. Analyzing and understanding language of Turkish Twitter users is very hard. It may be a separate article topic for academicians. The language used in tweets with its own rules, wide spread spelling errors or madeup words make text analysis very difficult. Tweets have written with false spelling. For example, a post Word or a pre word has written with the same characters together such as ‘gollllllll !!’ and ‘hadiiiiii’. So these words are meaningless for our algorithm. We use Zemberek library to find root of words. But root of some words are unobtainable in Zemberek library. So we create a text file that consists of root of important words manually. We add these root of words into Zemberek project.
7. References [1] Kim, Jeongin, et al. "A Method for Extracting Topics in News Twitter”, International Journal of Software Engineering and Its Applications Vol. 7, No. 2, March, 2013. [2] http://Twitter4j.org/en/index.html, 2013 [3] https://code.google.com/p/Zemberek/,2013 [4] Schaal, Markus, John O’Donovan, and Barry Smyth. "An analysis of topical proximity in the Twitter social graph."
[5] Rosa, Kevin Dela, et al. "Topical clustering of tweets." Proceedings of SIGIR Workshop on Social Web Search and Mining. 2011. [6] Pochampally, Ravali, and Vasudeva Varma. "User context as a source of topic retrieval in Twitter." Workshop on Enriching Information Retrieval (with ACM SIGIR). 2011. [7] Ramage, Daniel, Susan Dumais, and Dan Liebling. "Characterizing microblogs with topic models." International AAAI Conference on Weblogs and Social Media. Vol. 5. No. 4. 2010. [8] Chen, Jilin, et al. "Short and tweet: experiments on recommending content from information streams." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2010. [9] Michelson, Matthew, and Sofus A. Macskassy. "Discovering users' topics of interest on Twitter: a first look." Proceedings of the fourth workshop on Analytics for noisy unstructured text data. ACM, 2010. [10] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive inuential Twitterers. In Proc. of the third ACM international conference on Web search and data mining, 2010. [11] Canini, Kevin R., Lei Shi, and Thomas L. Griffiths. "Online inference of topics with latent Dirichlet allocation." Proceedings of the International Conference on Artificial Intelligence and Statistics. Vol. 5. No. 1999. 2009. [12] Java, Akshay, et al. "Why we Twitter: understanding microblogging usage and communities." Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, 2007. [13] Walsh, Brian. "Markov chain monte carlo and gibbs sampling." (2004). [14] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003).