Feature Expansion using Word Embedding for Tweet ... - IEEE Xplore

5 downloads 0 Views 284KB Size Report
used word embeddings based on word2vec to reduce the .... Word2vec Model Architecture[19] ... algorithm is used to create tweet topic classification model.
Feature Expansion using Word Embedding for Tweet Topic Classification Erwin B. Setiawan, Dwi H. Widyantoro, Kridanto Surendro School of Electrical Engineering and Informatics Institut Teknologi Bandung Indonesia [email protected], [email protected], [email protected] Abstract— One of Online Social Network (OSN) roles is a source of information, especially during an emergency. Twitter is an OSN service that enables users to send and read message but is limited to only 140 characters. Therefore, the tweet is written very short, not always using grammatically correct and using a lot of word variations. The use of word variations increase the likelihood of vocabulary mismatch and make the tweets difficult to understand without some kind of context. In this paper, we used word embeddings based on word2vec to reduce the vocabulary mismatch for tweet topic classification. Keywords—classification; topic; twitter; word2vec

I.

INTRODUCTION

Due to efficiency, volume and timeliness of information, Online Social Network (OSN) has become an important source of information[1]. According to Twitter blog, there is an average of 340 million tweets that has been generated each day on March, 2012.1 In addition to receiving information from people they 'followed', people were looking for relevant topic (for example, more than 1.6 billion requests each day on search portal twitter). Learning more about twitter news has become a strong motivation for people to read this microblogs[2], particularly to keep being update in local emergency situation[3]. Twitter users reached 517 million on June 2012 (Semiocast 2012) and Facebook reached 955 million in 2012 (US SEC 2012). Meanwhile, the number of Twitter users in Indonesia is about 19.5 million people. The user number is in the fifth place after the United States (about 107 million users), Brazil (33 million users), Japan (29 million users), and the UK (24 million users). According to data released from the A World of Tweets Dot Com, which was obtained by recording the total number of tweets worldwide since November, 20102, Indonesia has high writing tweet rate. Internationally, Indonesia is in the third position with 13.39 % in writing tweet. Whereas the United States occupied the first rank by number of tweets as much as 27%, and Brazil held the second place with a tweet by 24%while England and Netherlands hold the fourth and fifth position with the rate 6% with 4% respectively. Regionally in Asia, Indonesia gains the highest tweet activity by 53.97%, 1

http://blog.twitter.com/2012/03/twitter-turns-six.html

2

http://aworldoftweets.frogdesign.com/

followed by Japan 14.5%, Malaysia 8.96%, South Korea 4.36%, and Turkey with 4.08%. One of the functions of OSN is a media for sharing and finding some online information [4]. Each user can act as a source or as information transmitter, who spread the information as an entirety or its modified version. The OSN has become very important during emergency such as accidents, natural disasters, and terrorisms, since it gives reports faster than the conventional media[5]. Researchers that have been working on tweet topics classification, particularly in the domain Bahasa Indonesia, are only a few. They worked on topics such as sentiment analysis[6], trending topic [7], gender, age as well as occupation predictions [8]. Tweet messages are limited to 140 characters, very short, not always grammatically correct and using a lot of word variations. The aim is to be able to maximally exchange by using little characters as possible[9]. However, this limit causes users to employ large amounts of “noise” like emoticons, many abbreviations[10], shorten terms, misspelled words[9] and internet slang words [11] in order to compress more information. Tweets are often shortened and so hard to understand. The use of word variations increase the likelihood of vocabulary mismatch and make the tweets difficult to understand without some kind of context[11]. The purpose of this paper is to focus on how to reduce the vocabulary mismatch with "word embeddings". We expand features by using word2vec. Word2vec attempts to associate words with points in space. The spatial distance between words then describes the similarity relation between these words. The rest of the paper is organized as follows. Section two discusses issues related to the classification topic researches. Whereas the third section describes our approach to classification of tweet messages. Section 4 provides the experiment results, followed by conclusion in the last Section. II.

RELATED WORK

Castillo et al. (2011) demonstrated that the automated classification techniques can be used to detect a news topic from the topic of conversation and assessed their credibility based on various features of Twitter [12]. Canini et al. (2011) used the strategy to gauge the credibility of automated ranking of information sources on Twitter for any given topic. It has

ª*&&&

Fig. 1. Tweet Classification Processes

been found that the analysis of topic-based content from social networks proved extremely useful in identifying the relevant topics and credible source of information to follow [13]. Gupta. A and Kumaraguru (2012) analyzed fourteen highimpact events in 2011, and the result indicated that on average 30% of total tweets posted on an event provides situational information about the event, while 14% are spam. Additionally, credible posted tweets that provide situational information awareness is only 17% [14]. A number of researchers had been working on feature expansion. Speriosu et al (2011) used feature expansion based on “tweet-tweet relation information”. If a tweet x is related to another tweet y, the tweet id y as a feature is added into x’s feature vector[15]. Moran et al. (2016) demonstrated that first story detection can be improved using word embeddings. They used three-step processes to expand tweets such as learning word embeddings with word2vec, word filtering and word similarity computation[16]. Miran et al. (2016) applied word embedding using GloVe, Word2Vec, and CCA for named entity recognition (NER) training[17]. This paper uses word2vec in classifying Indonesian tweets topics. To our knowledge, word2vec has not been explored for such task in Bahasa Indonesia. III.

TWEET CLASSIFICATION

Tweet Classification processes developed in this study consists of several steps including tweet pre-processing, feature extraction and weighting, feature expansion, and tweet classification. Fig. 1 shows summary of these processes. Details of each step will be described in the following sections. A.

Tweet Pre-Processing Assuming the input is texts of original tweet content, preprocessing includes case folding, tokenization, stopwords removal, and stemming. Case folding is the process by which words or phrases in tweet text will be converted to lowercase (a to z). It helps

overcoming problems when the words are written with different capitalization. Tokenization is performed to chop input tweet into by words that constitute it. In principle, it separates each word in the tweet text. This process includes removal of numbers, punctuation marks, and characters other than letters of the alphabet. These characters are considered word separator (delimiter) so that they will be deleted to prevent the occurrence of noise in further processing. Stopwords removal eliminates words non-topical words that are not considered important such as: “dan”, “ini”, “itu”, “adalah”, “atau”, “yang”, “via”, and others. This preprocessing helps reduce irrelevant features in data Finally, stemming is a process of finding the root of a word by eliminating prefixes, infixes, suffixes and confixes (a combination of the prefix and suffix) on derivative word. By stemming, word variations having the same root will be considered as the same token (feature). In Information Retrieval, it helps improve the retrieval performance. B.

Tweet Representation In this paper, tweet text is represented as a Boolean feature vector with a fixed length. Each Boolean feature encodes the presence or absence of a word in a particular tweet. This paper assumes that each tweet topic is represented by m-most important words that describe the topic. Therefore, for n topics, the maximum length of feature vector is m times n (It could be less than mn because there might be more than one topic that shares the topical words). As an illustration, suppose the feature vector with the length of five encodes the presence of word in the following order: “eat”, “national”, “I”, “religion”, and “burger”, respectively. A tweet containing “I eat burger” will be represented as {1, 0, 1, 0, 1}. Text data such as tweets have a lot of features (e.g., words) many of which are non-topical (i.e., irrelevant features). Feature selection is performed to select only features

deemed relevant. This paper employs Term Frequency Inverse Document Frequency (TF-IDF) to weigh features. This weighing method is widely used particularly in the Information Retrieval community. It is also efficient, easy and has accurate results [18]. The weight of word w in a tweet T is calculated as follows: ܹ௪௜ ൌ ‫݂ݐ‬௪் ‫‰‘Ž כ‬ሺ்ܰ Τ݊௪ ሻ

Output of Word2vec is a list of similar words. Table I shows the words that have similarities with “Golkar”. While Rank-1 until Rank-10 column expressed degree of proximity with “Golkar”. TABLE I. WORDS

(1)

Where tfwT is the occurrence frequency of word w in the tweet T, N is the number of tweets that has been observed and nw is the number of tweets in which the word w appears. In this method, a word in a tweet is higher if it occurs more in that tweet (identifying topical words) but the number of tweets that contain that word is less (indicating that the word has better discriminant function w.r.t other topics). For all words appearing in the same tweet topics, then we select the top k-words as features based on their TF-IDF values. A tweet representation is based on the k-words from all topics. Feature Expansion As introduced in the early section, we employ embedding words to address the problem of vocabulary mismatch. The idea is to identify the missing words in tweet representation if it can be substituted with semantically related word. This paper uses word2vec to obtain those related words. Word2vec, developed by Mikolov et al. (2013), is continuous learning tool to generate word embeddings. It takes a word as input and output a set of semantically related words along with their similarity values.

Golkar

Word2Vec provides two steps to get the similar words. The first step uses the neighboring words to predict a word target (a method known as continuous bag of words, or CBOW), and the second step uses a word to predict the neighboring words in a sentence, the so-called skip-gram [19]. The illustration is shown in Fig. 2.

Rank-1

Rank-2

Rank-3

Rank-4

Rank-5

Persatuan Rank-6 Aziz

Sejahtera Rank-7 Halid

Azas Rank-8 Aburizal

Umum Rank-9 berisikan

Yasin Rank-10 Politikus

Specifically, we perform feature expansion by replacing a term having zero values in the feature vector with similar word in word2vec lists that appears in tweet. Given a tweet T, the process of feature expansion is as follow: 1. 2.

C.

The purpose and usefulness of Word2vec are to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words.

SIMILAR WORDS OF GOLKAR

Let fv= {t1,t2, ..., tn } the feature vector of tweet T. For each ti ∈ fv and ti = 0 a. Retrieve semantically similar words W from the word2vec list b. If at least one of the word in W appears in the tweet T then assign its corresponding feature vector a value 1, i.e., ti ← 0.

As an illustration, given a tweet:”... Persatuan Bangsa lebih utama”. Suppose ”Golkar” is a word whose corresponding feature value in the tweet representation is zero. Suppose also that the similar words of ”Golkar” as returned by the word2Vec are as follows: ”Persatuan”, ”Sejahtera”, ”Azas”, ”Umum”, ”Yasin”, ”Aziz”, ”Halid”, ”Aburizal”, ”Berisikan”, and ”Politikus”. Because word “Persatuan” as one of the similar words also appears in the tweet content, then the “Golkar”’s corresponding feature value in the tweet representation is assigned to value 1. D.

Classification Algorithm Three learning algorithms are explored in this paper: Naive Bayes (NB), Support Vector Machine (SVM), and Logistic Regression (Logit). As illustrated in Fig. 1, the algorithm is used to create tweet topic classification model during training phase. The tweet topic model is then used to classify the topic of new tweets, using the same algorithms as used to create the classification model. The following is the description of each algorithm. Naive Bayes (NB) Naive Bayes (NB) classification models are in the form of the probability values of each attribute to a class, and the classification of new data is performed by looking the class that has the maximum probability based on the data attributes [20]. Naive Bayes has the advantage of its easiness of their construction, does not require some complex parameters, and is scalable. Moreover, this method is expressed as algorithms that have the nature of simplicity, elegance, robustness, and has high accuracy [21].

Fig. 2. Word2vec Model Architecture[19]

TABLE III. TOPIC DISTRIBUTION OF TWITTER DATA

Support Vector Machine (SVM) The idea of Support Vector Machine (SVM) for classification is to find the optimal hyperplane (line / field delimiter) that separates the data into two classes in ndimensional feature space of data. With this concept, the optimal hyperplane solution on SVM does not have a local optimum, and as the result solution will be unique[20]. SVM can be implemented easily and is one of the proper methods used to solve the problem of high dimension, within the limitations of existing data sample.

No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Logistic Regression (Logit) Logistic Regression (Logit) is a probability classification model with real-valued input vector. Dimensions of input vector are called features and no restriction is imposed for correlated features. Logistic regression was used whenever we need to set the input to one of several classes. Its logistic function is a linear combination of features. The output is usually binary, but logistic regression can also be applied to multiclass classification problems[20]. IV.

A.

Data We use dataset containing 19.401 tweets in Bahasa Indonesia taken from 97 Twitter Account. For feature expansion, we used two corpuses: one taken from Indonesian news (IndoNews) and Google News. The first corpus contains around 1111 articles (13.846 words). The Articles are taken from six mainstream medias such as Kompas, Tempo, Republika, CNN Indonesia, Liputan 6, Koran Sindo, and Detik.com. Table II depicts the distribution of this corpus. In the second corpus, we use Google News-vectorsnegative300.bin.gz3, which consists of 100 billion words. TABLE II. DISTRIBUTION DATA EXPANSION ARTICLE No 1 2 3 4 5 6 7

Article Tempo Republika CNN Indonesia Kompas Liputan 6 Koran Sindo Detik.com Total

Quantity 159 184 239 146 224 10 149 1111

Table III describes the break-down of Twitter data. Consisting of 19 topics, its distribution is rather unbalancing, ranging from 0.2% to 15.3%. Table IV provides samples of tweet from Law, Politics, and Entertainment Topics.

3

http://code.google.com/archive/p/word2vec

Quantity 1025 460 235 235 1742 1557 485 2420 74 35 927 431 1935 466 149 2959 1238 1218 1810 19401

Percentage 5.28% 2.37% 1.21% 1.21% 8.98% 8.03% 2.50% 12.47% 0.38% 0.18% 4.78% 2.22% 9.97% 2.40% 0.77% 15.25% 6.38% 6.28% 9.33%

TABLE IV. SAMPLE OF TWEETS

EXPERIMENTS

The objective of experiment was to find out the accuracy of word2vec toward various data sources. The classification accuracy is defined as the percentage of correctly classified instances using 10-fold cross validation. The experiments were varied by using 1, 5 and 10-top features for each topic during feature selection.

Label Religion Business Culture Economy Entertainment Law Advertisement Journalistics Health Financial Motivation Sport Government Education Transportation Politics Social Technology General Total

Tweet Yg disoal Saripin cm apakah KPK berwenang sidik #BG, dugaan korupsinya tdk diusik. Bagi saya, BG tetap "tersangka" mestinya #JKW jg demikian DPR Akan Gelar Paripurna Sahkan Revisi UU Pilkada Hari Ini http://t.co/jcxclL9faO @detikcom Studio Denny JA, MTV dan Mizan bersama Hanung Bramantyo membuat 5 film layar lebar bertema Islam Cinta: http://t.co/BrdHfhBsub

Label Law

Politics

Entertainment

B.

Experiments Result Table V-VII depict the experiment results using Naïve Bayes, SVM, and Logistic Regression classifiers, respectively. The baseline column describes the results without performing feature expansion. The IndoNews and Google News columns describe the experiment results by performing feature expansion using IndoNews and Google News, respectively. The performance of word2vec on Naive Bayes classifier can be seen in Table V. The decline in accuracy occurs when using top 5 features. The highest improvement of 0.21% was achieved by dataset using Google News in top 10. TABLE V. PERFORMANCE OF WORD2VEC ON NB CLASSIFIER #features Top 1 Top 5 Top 10

Baseline 52.51 52.74 52.88

Accuracy (%) Baseline + IndoNews Baseline + GoogleNews 52.51(+ 0.00) 52.55 (+ 0.08) 52.60 (- 0.27) 52.71 (- 0.06) 52.97 (+ 0.17) 52.99 (+ 0.21)

Table VI shows the performance of word2vec on SVM Classifier. The only increase in accuracy occurs in the amount of 0.13% and was achieved by word2vec using Google News in the Top 10.

TABLE VI. PERFORMANCE OF WORD2VEC ON SVM CLASSIFIER Accuracy (%) #features Baseline Baseline + IndoNews Baseline + GoogleNews Top 1 54.46 54.46 (+ 0.00) 54.44 (- 0.04) Top 5 54.63 54.54 (- 0.16) 54.59 (- 0.07) Top 10 54.60 54.54 (- 0.11) 54.67 (+ 0.13)

The performance of word2vec on Logit Classifier is shown in Table VII. In this case, all the schemes have improved performance over the baseline. The highest improvement was 0.38% and is achieved by dataset using Google News in the Top 5. TABLE VII. #features Top 1 Top 5 Top 10

PERFORMANCE OF WORD2VEC ON LOGISTIC CLASSIFIER Accuracy (%) Baseline Baseline + IndoNews Baseline + GoogleNews 58.32 58.32 (+ 0.00) 58.52 (+ 0.34) 58.52 58.57 (+ 0.09) 58.74 (+ 0.38) 58.76 58.81 (+ 0.09) 58.86 (+ 0.17)

The effect of feature size is shown in Fig. 3 dan Fig.4. In this case, all the schemes have shown performance improvement. Performance improvement of data set using Google News is better than data set using IndoNews.

performance improvement of data set using Google News is better than data set using IndoNews on Naive Bayes, SVM, and Logistic Regression classifiers. [1]

[2]

[3]

[4]

[5] [6]

[7]

[8]

60 58.32

58.81

Accuracy

58.57

55

54.46

54.54

52.51

52.6

[9] NB 54.54

SVM

52.97

Logit

[10]

[11] 50 Top 1

Top 5

Top 10

Fig. 3. The Effect of Feature Size (%) on Baseline + IndoNews 60

[12]

58.52

58.86

Accuracy

58.74 NB 55

54.44

52.55

[13]

54.59 54.67

SVM

52.99

Logit

52.71

[14]

50 Top 1

Top 5

Top 10

[15]

Fig. 4. The Effect of Feature Size (%) on Baseline + Google News

V. CONCLUSION In this paper, we have described our approach for Indonesian tweet topic classification. To alleviate vocabulary mismatch problem, we apply feature expansion using word embedding based on word2vec. We apply this approach using GoogleNews and IndoNews data set on Naive Bayes, SVM, and Logistic Regression classifiers. Contrary to the prior findings that SVM was always among the top performer, our experiments that apply feature expansion reveal that it tends to degrade the SVM performance. Applying feature expansion with Google News data set can consistently improve performance when using Logistic Regression classifier. The use of Naive Bayes classifier provides mixed results. The

[16] [17]

[18]

[19]

[20] [21]

REFERENCES H. Kwak, C. Lee, H. Park, dan S. Moon, “What is Twitter , a Social Network or a News Mediaௗ? Categories and Subject Descriptors,” Proc. 19th Int. World Wide Web Conf., pp. 591–600, 2010. J. Teevan, D. Ramage, dan M. R. Morris, “#TwitterSearch: a comparison of microblog search and web search,” Proc. fourth ACM Int. Conf. Web search data Min. - WSDM ’11, p. 35, 2011. M. R. Morris, S. Counts, A. Roseway, A. Hoff, dan J. Schwarz, “Tweeting is Believingௗ? Understanding Microblog Credibility Perceptions,” CSCW ’12 Proc. ACM 2012 Conf. Comput. Support. Coop. Work, pp. 441–450, 2012. M. Naaman, M. Naaman, J. Boase, J. Boase, C. H. Lai, dan C. H. Lai, “Is it Really About Me? Message Content in Social Awareness Streams,” Dot-Me.of-Cour.Se, pp. 0–3, 2010. M. Mendoza, B. Poblete, dan C. Castillo, “Twitter Under Crisis: Can we trust what we RT?,” Work. Soc. Media Anal., p. 9, 2010. I. Sunni dan D. H. Widyantoro, “Analisis Sentimen dan Ekstraksi Topik Penentu Sentimen pada Opini Terhadap Tokoh Publik,” vol. 1, no. 2, pp. 200–206, 2012. Y. A. Winatmoko dan M. L. Khodra, “Automatic Summarization of Tweets in Providing Indonesian Trending Topic Explanation,” Procedia Technol., vol. 11, no. 0, pp. 1027–1033, 2013. Y. Wibisono dan K. Kunci, “Penentuan Gender Otomatis Berdasarkan Isi Microblog Memanfaatkan Fitur Sosiolinguistik,” vol. 1, no. 1, pp. 2011–2014, 2013. W. Wu, B. Zhang, dan M. Ostendorf, “Automatic Generation of Personalized Annotation Tags for Twitter Users,” Comput. Linguist., no. June, pp. 689–692, 2010. S. Mukherjee, A. Malu, a R. Balamurali, dan P. Bhattacharyya, “TwiSentௗ: A Multistage System for Analyzing Sentiment,” Cikm 12, pp. 2531–2534, 2012. M. A. Zingla, L. Chiraz, Y. Slimani, C. Berrut, M. A. Zingla, L. Chiraz, Y. Slimani, dan C. B. Statistical, “Statistical and Semantic Approaches for Tweet Contextualization To cite this versionௗ: Statistical and Semantic Approaches for Tweet Contextualization,” Proceeding 19th Int. Conf. Knowl. Based Intell. Inf. Eng. Syst., vol. 60, pp. 498 – 507, 2015. C. Castillo, M. Mendoza, dan B. Poblete, “Information credibility on twitter,” Proc. 20th Int. Conf. World wide web - WWW ’11, p. 675, 2011. K. R. Canini, B. Suh, dan P. L. Pirolli, “Finding Credible Information Sources in Social Networks Based on Content and Social Structure,” 2011 IEEE Third Int’l Conf. Privacy, Secur. Risk Trust 2011 IEEE Third Int'l Conf. Soc. Comput., pp. 1–8, 2011. A. Gupta dan P. Kumaraguru, “Credibility ranking of tweets during high impact events,” Proc. 1st Work. Priv. Secur. Online Soc. Media PSOSM 12, pp. 2–8, 2012. X. Hu, L. Tang, J. Tang, dan H. Liu, “Exploiting social relations for sentiment analysis in microblogging,” Proc. sixth ACM Int. …, vol. 1, pp. 537–546, 2013. S. Moran, R. Mccreadie, C. Macdonald, dan I. Ounis, “Enhancing First Story Detection using Word Embeddings,” pp. 0–3, 2016. M. Seok, H. Song, C. Park, J. Kim, dan Y. Kim, “Named Entity Recognition using Word Embedding as a Feature 1,” vol. 10, no. 2, pp. 93–104, 2016. S. Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF,” J. Doc., vol. 60, no. 5, pp. 503–520, 2004. T. Mikolov, G. Corrado, K. Chen, dan J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proc. Int. Conf. Learn. Represent. (ICLR 2013), pp. 1–12, 2013. T. M. Mitchell, Machine Learning. McGraw-Hill Science, 1997. X. Wu, V. Kumar, Q. J. Ross, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. H. Zhou, M. Steinbach, D. J. Hand, dan D. Steinberg, Top 10 algorithms in data mining, vol. 14, no. 1. 2008.