Paper Title (use style: paper title)

1 downloads 1770 Views 896KB Size Report
and the Facebook graph API [5][6][15] to collect career-related messages for .... Facebook Query Language, or FQL, enables developers to use an SQL-style ...
A Crossed-domain Sentiment Analysis System for the Discovery of Current Careers from Social Networks Trinh Thi Van Anh Posts and Telecommunications Institute of Technology Hanoi, Vietnam

Hoang Xuan Dau

Posts and Telecommunications Institute of Technology Hanoi, Vietnam Email: [email protected]

Email: [email protected]

Abstract - In recent years, the sentiment analysis on data messages from social networks has attracted high attention of researchers. However, most of their works have been focused on classifying user messages to positive (or like) and negative (or dislike) on social issues or discussion topics. In addition, they usually only worked with English messages from a single data source, or a domain. In this paper, we proposed a crosseddomain sentiment analysis system for the discovery of current careers from social networks. The proposed system can capture sentiment of career-related messages from two famous social networks, including Twitter and Facebook. The experimental results clearly pointed out that the most favorite careers which enjoy the highest positive sentiment and the least favorite careers that have the highest negative sentiment. The performance results of the proposed system are promising for crossed-domain sentiment analysis, with the precision of over 85% and the recall of over 90%. Keywords - sentiment analysis; current careers; Latent Dirichlet Allocation model (LDA); SVM; MaxEnt;

I. INTRODUCTION As the recently fast growth of social networks, such as Facebook and Twitter, hundreds of millions of users have joined social networks [19] and daily discuss current trends and share their personal ideas on various topics. Therefore, social networks have been becoming rich data sources for collecting public opinions on many social issues. This trend has also happened in Vietnam, where its population is young and its number of social network users has been double in last 12 months [9]. A number of sentiment analysis approaches have been proposed to analyze sentiments on data messages from Twitter or Facebook [3, 4, 8, 10, 20, 21, 22, 23]. The paper in [20] focused on using data messages from Twitter, the most popular micro-blogging platform, for sentiment analysis. It described how to automatically collect a corpus for sentiment analysis and opinion mining purposes. Using the corpus, it built a sentiment classifier that is able to determine positive, or negative, or neutral sentiments for documents. However, the proposed system only worked with English messages. On the other hand, Ahkter et all [23] used Maximum Entropy (MaxEnt) method to classify Facebook status messages in English. It is reported that the accuracy of the proposed method is 85%. Although the proposed approaches have achieved good results, most of them only worked with English messages from a single data source, either Facebook or Twitter.

In this paper, we focus on the crossed-domain sentiment analysis of career-related Vietnamese messages from two famous social networks, including Twitter and Facebook. In our approach, we first collect a set of listed careers and their initial words from a job search website. Next, for each career, initial career words have been expanded to its complete career keywords using a list of career-related topics generated by the LDA method [1] from news articles published by online Vietnamese local newspapers. Then, data messages which are collected from Twitter and Facebook based on the collected career keywords, are used for the crossed-domain sentiment analysis. The rest of this paper is organized as follows: Section II presents the architecture of the proposed system and Section III describes the process of extracting career keywords. Section IV presents our approach for crossed-domain the sentiment analysis on the data from social networks and Section V shows our experiments and results. Finally, Section VI is our conclusion. II. ARCHITECTURE OF THE PROPOSED SYSTEM The architecture of the proposed system is presented in Figure 1. The proposed system's processing flow consists of two stages: (1) constructing a list of careers with attached job/career keywords and (2) collecting the data messages from Twitter and Facebook based on results of the first stage, and then carrying out the crossed-domain sentiment analysis of collected data messages. In the first stage, daily published news articles are collected from online local news websites, including VnExpress.net, Dantri.com.vn, Laodong.com.vn, Vietnamnet.vn and Kenh14.vn. Then, news headlines and summaries are extracted from news articles and then they are clustered by the LDA model [1] to produce a list of K topics and their corresponding keywords. We only use news headlines and summaries because they represent the main ideas of the news articles. Next, a list of T careers with initial keywords is extracted from a job search website, such as Vieclam.24h.com.vn, or VietnamWorks.com. In order to expand the job keywords of each career, a career's initial words are matched against each topic of K topics generated LDA model. If a match is found the topic's keywords are merged into career's job keywords. The first stage's result is a list of T careers with job keywords which are used as the input for the second stage.

LDA Topic 1 (topic

words) Topic 2 (…) Topic K (…) career 1 (job keywords ) career 2 (…) career T (…)

words, each word wi drawn from a vocabulary of terms {t1, t2, …, tv}. The goal of LDA is to find the latent structure of “topics” or “concepts” which capture the meaning of text that is obscured by “word choice” noise. LDA is a complete generative model that has shown better results than the earlier approaches [1]. Considering the LDA's graphical model representation shown in Figure 2, the generative process can be interpreted as follows: First, LDA generates a stream of observable words wm,n, partitioned into documents ⃗⃗⃗⃗⃗ 𝑑𝑚 .

A list of T careers with initial words

career 1 (job keywords ) career 2 (…) career T (…)

Twitter API

Facebook graph API

career 1 (a list of messages ) career 2 (…) career T (…) Fig. 2. LDA's graphical model representation SVM, MaxEnt

classifying the sentiment of messages

Fig. 1. Proposed system architecture

In the second stage, the system explores public opinions by carrying out the sentiment analysis on messages from Twitter and Facebook. Firstly, we use the Twitter API ver. 1.1 [3][13] and the Facebook graph API [5][6][15] to collect career-related messages for opinion mining. For each career in the list of current careers, data messages are collected based on the career's keywords. Secondly, two machine learning algorithms, SVM [7] and MaxEnt [2] are used consecutively to estimate the sentiment polarity of collected messages for each career. III. EXTRACTING THE CAREER KEYWORDS A. LDA model (LDA – Latent Dirichlet Allocation) LDA model [1] is a generative model originally proposed for topic modelling. In LDA, the data is in the form of a collection of documents, in which each document is considered as a collection of words. It is assumed that each document is represented as a random mixture of latent topics, and each topic is represented as a distribution over words. These mixture distributions are assumed to be Dirichlet-distributed random variables which must be inferred from the data. Given a corpus of M documents denoted by D = {d1, d2, …, dM}, in which each document m in the corpus consists of Nm

Then, for each of these documents, a topic proportion 𝜗𝑚 is drawn, and from which, topic-specific words are emitted. That is, for each word, a topic indicator zm,n is sampled according to the document – specific mixture proportion, and then the corresponding topic-specific term distribution ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ 𝜑𝑧𝑚,𝑛 is used to draw a word. The mixture component of topic k 𝜑 ⃗ 𝑘 is sampled one time for the entire corpus. The complete (annotated) generative model is presented in Algorithm 1 and Figure 3 gives a list of involved notations. Algorithm 1: Generative model for Latent Dirichlet Allocation

Input: Set of M documents m K: number of topics V: Vietnamese lexicon α, β: are the hyper parameters on the Dirichlet priors N: number of iterations Output: Set of K topic k Each ki (i=1,K) has a list of words -------------------------------------------------------------//Topic plate for all topics k ∈ [1,K] do Sample mixture components 𝜑 ⃗ 𝑘 ~ Dir(𝛽 ) end for //document plate for all documents m ∈ [1,M] do //Vietnamese word segmentation To tokenize document by vnTokenizer [18] // Post tagger for a document, remove stop-words Preprocess text of a document Sample mixture proportion 𝜗𝑚 ~ Dir(𝛼) Sample document length Nm ~ Poiss(ξ) //word plate for all words n ∈ [1,Nm] in doc m do Sample topic index zm,n ~ Mult(𝜗𝑚 ) Sample term for word wm,n ~ Mult(𝜑 ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ 𝑧𝑚,𝑛 ) end for

end for

Where, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively. - M: number of documents to generate (constant scalar). - K: number of topics/mixture components (constant scalar). - V: number of terms t in vocabulary (constant scalar). - 𝛼 : hyperparameter on the mixing proportions (K-vector or scalar if symmetric). - 𝛽 : hyperparameter on the mixture components (V-vector or scalar if symmetric). - 𝜗𝑚 : parameter notation for p(z/d=m), the topic mixture proportion for document m. One proportion for each document, 𝑀

𝛩 = {𝜗𝑚 }𝑚=1 (𝑀𝑥𝐾 𝑚𝑎𝑡𝑟𝑖𝑥). -𝜑 ⃗ 𝑘 : parameter notation for p(t/z=k), the mixture component of topic k. One component for each topic, 𝛩 = {𝜑 ⃗ 𝑘 }𝐾𝑘=1 (𝐾𝑥𝑉 𝑚𝑎𝑡𝑟𝑖𝑥). - Nm: document length (document-specific), here modelled with a Poisson distribution [1] with constant parameter. - Zm,n: mixture indicator that chooses the topic for the nth word in document m. - Wm,n: term indicator for the nth word in document m.

and server and web are extracted from the IT & Technology career. We only take the top thirty words in each career as initial words. Figure 4 shows the example of extracting initial words for the IT & Technology career. C. Generating career keywords We generate career keywords for the sentiment analysis by expanding initial career words using topics and their keywords of online news articles, which are generated by LDA model described in Section III.A. We assume that there are fixed K topics in M documents (each document includes a news headline and a summary) with N terms. We collected news headlines and summaries from online local news websites, including VnExpress.net, Dantri.com.vn, Laodong.com.vn, Vietnamnet.vn and Kenh14.vn in 6 days, from April 22, 2014 to April 27, 2014 for processing. Table I shows an example of headlines of news articles for clustering using LDA model. For each generated topic, we set a limit of 40 words. TABLE I.

SAMPLE ARTICLES WITH HIGH PROBABILITY IN CAREER TOPICS

Fig. 3. Notations used the latent Dirichlet allocation model

In our work, we use JGibbLDA [12] which is a Java implementation of Latent Dirichlet Allocation (LDA) using the Gibbs Sampling technique for the parameter estimation and inference. The input of the LDA model is a list of news headlines and summaries collected from online local news website and its output is a list of K topics and their corresponding keywords. B. Collecting the career list and initial words In order to collect the list of careers and their initial words, we first manually search and collect a list of careers from job search websites, such as Vieclam.24h.com.vn and VietnamWorks.com. Each career is represented by a word or a phrase of words, such as IT-Software, Customer Service, Marketing, Education/Training, etc. Totally, 22 careers are collected. URL

Career

Location Career description

Nos. Required Fig. 4. Extracting initial words for IT & Technology career

Then, we extract initial career words for each career from the career description based on the bag of words model. For each career, only words that have the highest probability are collected. For example, words and phrases such as programming, development, game, large scale system, client

Algorithm 2 is used to generate career keywords by expanding initial career words using topics and their keywords created using LDA model. The algorithm accepts the input of the set of K topics and the set of T careers and then outputs the list of career keywords for each specific career. Algorithm 2: Generating the career keywords

Input: Set of K topics T={t1,t2,…,tK} Each tj (j=1,K) has tj= {w1,w2,…,w40} (set of topic words) Set of T careers C={c1,c2,…,cT} Each ci (i=1,T) has ci= {w1,w2,…,w30} (set of initial words) Output: Set of T careers C’ = {c’1,c’2,…,c’T} Each c’i (i=1,T) has Li (set of career keywords) --------------------------------------------------------------------------------for all ci in C c’i = ci Li = ci for all tj in T count  0 for all wl in tj if wl ϵ ci then count count + 1 end for if count >= threshold Li = Li tj end for end for return List L= {l1,l2,….,lp}

IV. SENTIMENT ANALYSIS ON SOCIAL NETWORKS A. Social network dataset for classification For the crossed-domain sentiment analysis, we collect the data messages based on the generated career keywords from two famous social networks, including Twitter and Facebook. Twitter is an online social network and micro-blogging service that enables its users to send and read short messages, known as tweets. The limit length of each tweet is 140 characters. Twitter offers a special orthography that includes special features, hash-tags, user mentions, retweets and emotions. Two types of application programming interfaces [13] are available for external applications to communicate with Twitter. The REST API provides a simple interface to Twitter functionalities, and the Streaming API is a powerful real-time API. However, the access to the public data of Twitter is limited using both API interfaces. The keyword filtering is provided by the Twitter Streaming API which allows to only collect keyword-related messages. A list of terms can be passed as a parameter to the Twitter Streaming API. Then only tweets which contain an element of the term list are included in the output data stream. The size of term list is limited to a maximum of 400 terms. Twitter is used all over the world and the posted messages can be in any language. Sentiment analysis makes use of language specific content and therefore the language plays an important role. In our work we assume that all collected tweets are Vietnamese (using the language filter, language=“vi”). Because of the limited message size and the freedom in composing, incorrectly formatted tweets are contained in the dataset. To get a cleaner dataset for training, message cleansing techniques are applied. Different filters are applied in sequence in the experiments. The first step is keyword filtering since only messages in the defined domain are processed. The next step is to filter Twitter special features: "RT", "#"-hashtags, "@user"-mentions, emoticons and URLs. The last step is filtering the overuse of letters or punctuations (dots, question and exclamation marks) and stop-words [14]. Facebook is a social network service and a website launched in February 2004. Facebook allows users to create profiles for themselves, upload photographs and videos. Users can view profiles of other users who are added as their friends and they can also exchange text messages. The Graph API [15] is the primary way to put data in and get data out of Facebook. Most API calls must be signed with an access token [15]. The Facebook Query Language, or FQL, enables developers to use an SQL-style interface to query the data exposed by the Graph API. It provides advanced features that are not available in the Graph API. FQL can handle simple math and basic logical operators, and ORDER BY and LIMIT clauses. ORDER BY clause only supports a single table. In our work, we use one named stream table (FQL stream table retrieves all the posts of a specific user containing tags) [5, 6]. B. Sentiment analysis of messages We implemented two machine learning algorithms, including SVM [16] and MaxEnt [17] consecutively for the sentiment classification of Vietnamese messages. The dataset of Vietnamese messages is collected from Twitter and

Facebook and pre-processed as described in Section IV.A. The dataset can contain positive, negative, neutral messages or irrelevant messages which are unlabeled data. Both SVM and MaxEnt models are supervised learning methods so they require a set of labeled data. We selected a small part of the dataset and labeled it manually. The labeled data is used as the input for both SVM and MaxEnt models. We first use the SVM model to classify the collected messages into two classes. The first class contains messages, each of which can be either positive or negative sentiment related to careers. The other class contains messages which have no sentiment related to careers. This class is named neutral sentiment. Next, we use the MaxEnt model to classify the positive and negative sentiment messages generated by SVM model into two classes of positive and negative. V.

EXPERIMENTS AND RESULTS

In this section, we examine the overall effectiveness of the proposed system. First, we extracted a list of careers and initial career words from Vieclam.24h.com.vn job website [11]. Then, we captured topics of news articles of 5 online local news websites, including VnExpress.net, Dantri.com.vn, Laodong.com.vn, Vietnamnet.vn and Kenh14.vn. A self-built corpus is collected in the period from April 22, 2014 to April 27, 2014 using jsoup (http://jsoup.org). Totally, we collected 40268 news articles for processing. Table I shows an example of the headlines of newspaper articles. Each headline of an article is used only once so the number of articles remained 40268 for extracting features. After that stop-words are removed from the data for clustering. Then, JGibbLDA [11] - a tool for Latent Dirichlet Allocation using Gibbs Sampling is used to cluster topics based on LDA model. In the hidden topic mining phrase, the number of topics, K, was set at 400. The hyper-parameters α and β were set at 0.25 and 0.1 respectively. The number of Gibbs sampling iterations N=1000. The step counted by the number of Gibbs sampling iterations is saved to hard disk as savestep = 100. The number of words for each topic is set to 40. After combining the initial career words with the topic keywords we get the list of career keywords for each specific career. Table II shows the top 10 keywords in each career and Table III shows the number of articles by careers. TABLE II.

CAREERS AND CAREER KEYWORDS

Career Bán hàng (Sales)

Top 10 keywords Khách hàng, thị trường, phân phối, bán hàng, chi nhánh, kinh doanh, tiềm năng, hang hóa, cạnh tranh, sản phẩm Công nghệ Lập trình, phần mềm, phát triển, ứng dụng, nền tảng, phần mềm web, lập trình viên, dự án, ngôn ngữ lập trình, thiết kế, (IT-Software) hệ thống Marketing Tổ chức, marketing, quảng cáo, thị trường, sự kiện, khảo sát, dữ liệu kinh doanh, tiếp thị, thương hiệu, truyền thông Sản xuất Sản xuất, xưởng, máy móc, vận hành, giám sát, theo (Production/ dõi, công nghệ, chế tạo, qui trình sản xuất, bộ phận Process) Cơ khí Lắp đặt, phụ tùng cơ khí, nguyên liệu, sự cố, phân (Mechanical) xưởng, vật tư, vận hành và bảo trì, lắp đặt, sửa chữa, dây truyền TABLE III.

THE NUMBER OF ARTICLES BY CAREERS

Number of articles 512 425

5000 4000 3000 2000 1000 0

419 293 276

All the career keywords of 22 careers are formed a list of terms, which is passed as a parameter to the Twitter streaming API and the Facebook graph API to get the career-related messages. The collected corpus consists of 195674 messages in the period from April 2, 2014 until May 22, 2014. After being filtered non-Vietnamese messages out and pre-processed, the corpus contains 119138 messages. For the sentiment analysis, we use SVM and MaxEnt models as described in Section IV.B. Table IV shows the distribution of the three classes of positive, negative and neutral for each career. The distribution shows that over 50 percent of the messages are neutral on average. TABLE IV.

THE NUMBER OF SENTIMENTS BY CAREER

Career Bán hàng (Sales) Công nghệ phần mềm (IT-Software ) Marketing Sản xuất (Production/ Process) Cơ khí (Mechanical) Dịch vụ khách hàng (Customer Service) Kế toán (Accounting) Điện/Điện tử (Electrical/Electronics) Tài chính/Đầu tư (Finance/Investment) Xây dựng (Civil/Construction) IT-phần cứng/mạng (IT-Hardware/Networking) Nhân sự (Human Resources) Ngân hàng (Banking) Bán hang kỹ thuật (Sales Technical) Biên phiên dịch (Interpreter/Translator) Internet/online Media Quảng cáo (Advertising/PR) Kiểm toán (Auditing) Du lịch (Tourism) Giáo dục/Đào tạo (Education/Training) Xuất nhập khẩu (Export-Import) Dầu khí (oil/gas)

Positive Negative 2241 1175 4309 1580

Neutral # of messages 5021 8437 5765 11654

2145 449

1487 864

3457 1985

7098 3298

2932 2515

1221 1077

5365 4532

9518 8124

1048 1509

2011 909

1453 4012

4512 6430

502

834

1721

3057

1961

994

4026

6981

2594

1243

7034

10871

2089

1503

3737

7329

498 832

682 730

1841 1711

3021 3273

721

231

1215

2167

813 1125

308 449

3981 1602

5108 3176

298 541 343

905 965 652

1562 3932 1548

2765 5438 2543

276

513

1452

2231

136

314

1657

2107

Figure 5 shows the number of positive and negative messages by career. While careers, including IT-Software, Mechanical, IT-Hardware/networking, Customer service, Sales and Human resources enjoy the highest positive sentiment, the Accounting career has the highest negative sentiment. On the other view of the results, Figure 6 shows ratio of positive to negative sentiment for each career.

Positive Negative

Fig. 5. The number of positive and negative messages by career

4.00 3.00 2.00 1.00 0.00

Bán hàng Công nghệ phần… Marketing Sản xuất Cơ khí Dịch vụ khách… Kế toán Điện/Điện tử Tài chính/Đầu … Xây dựng IT-phần… Nhân sự Ngân hàng Bán hang kỹ thuật Biên phiên dịch Internet/online… Quảng cáo Kiểm toán Du lịch Giáo dục/Đào tạo Xuất nhập khẩu Dầu khí

Career Bán hàng (Sales) Công nghệ phần mềm (IT-Software ) Marketing Sản xuất (Production/ Process) Cơ khí (Mechanical)

Ratio of positive to negative sentiment Fig. 6. Ratio of positive to negative sentiment for each career

To validate the classification results using the SVM model, we took a random sample of 1000 messages from the results (about 10% of the results) and manually labeled them. The result from the manual classification is compared to the result produced by SVM model to calculate the performance measures of TS (True Sentiment), FS (False Sentiment), TN (True Neutral) and FN (False Neutral). TS is the number of messages that are correctly classified as either positive or negative sentiment related to careers. FS is the number of messages that are incorrectly classified as having sentiment (positive or negative) related to careers. TN is the number of messages that are correctly classified as neutral sentiment. FN is the number of messages that are incorrectly classified as neutral sentiment. Table V shows the performance of SVM classification. TABLE V.

PERFORMANCE OF SVM CLASSIFICATION

The testing dataset 1000

TS 550

FS 100

TN FN Precision 295 55 85%

Recall 91%

Using the similar method, we verify the performance of the MaxEnt model using a random sample of 500 messages from the results (also about 10% of the results) and manually labeled them. The result from the manual classification is compared to the result produced by MaxEnt model to calculate the performance measures of TP (True Positive), FP (False Positive), TN (True Negative) and FN (False Negative). Table VI shows the performance of MaxEnt classification. TABLE VI.

PERFORMANCE OF MAXENT CLASSIFICATION

The testing dataset 500

TP FP TN FN Precision Recall 323 51 142 34 86% 90%

[6]

[7]

VI.

CONCLUSION

In this paper, we have presented a crossed-domain sentiment analysis system that can capture sentiment of careerrelated messages from two famous social networks, including Twitter and Facebook. The experimental results clearly show the most favorite careers of IT-Software, Mechanical, ITHardware/networking, Customer service, Sales and Human resources, which enjoy the highest positive sentiment. On the other side, the Accounting career is the least favorite with the highest negative sentiment. The performance results of the proposed system are good with the precision of over 85% and recall of over 90%. This is a promising result for crosseddomain sentiment analysis.

[8]

[9] [10]

[11] [12]

[13]

In the future, we will conduct extensive experiments on larger datasets to verify the performance of the proposed system. Furthermore, we will expand our work on crosseddomain sentiment analysis by adding more domains and using results from one domain to training process of other domains to reduce the amount of required labelled data.

[14]

REFERENCES

[18] [19] [20]

[1] [2]

[3] [4]

[5]

D. M. Blei, A.Y. Ng and M.I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research 3, page 993-1022, 2003. Kamal Nigam, John Lafferty and Andrew McCallum, Using Maximum Entropy for Text Classification, In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999. Kumar, Shamanth, Morstatter, Fred, and Huan Liu, Twitter Data Analytics, Springer, 2013. Nick V. Flor, Technology Corner Automated Data Extraction Using Facebook, Journal of digital forensics, security and law. Volume 7, number 2, 2012. Philihp Busby, Three Different Ways to Import JSON from the Facebook Graph API, SAS Global Forum 2014, Washington, DC, March 2014.

[15] [16] [17]

[21]

[22] [23]

[24]

Shashwat Srivastava, Apeksha Singh, Facebook Application Development with Graph API Cookbook, Packt Publishing, Ebook ISBN: 978-1-84969-093-5, November 2011. S.R. Gunn, Support Vector Machines for classification and regression, Technical report. May 1998. Trịnh Thị Vân Anh, Nguyễn Duy Phương, Phân loại quan điểm các tin nhắn tiếng Việt trên twitter, Tạp chí Khoa học và Công nghệ 51 (4A), 2013. VOV, http://english.vov.vn/Society/Development/Vietnam-boasts-308million-internet-users/244626.vov, visited June 2014. Nikolaos Pappas, Georgios Katsimpras and Efstathios Stamatatos, An Agent-Based Focused Crawling Framework for Topic and GenreRelated Web Document Discovery, 24th IEEE International Conference on Tools with Artificial Intelligence, 2012. Vieclam, http://hn.vieclam.24h.com.vn, visited June 2014. JGibbLDA, A Java Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference, http://jgibblda.sourceforge.net/ Twitter, https://dev.twitter.com/docs/api/1.1/overview, visited June 2014. http://seo4b.com/thuat-ngu-SEO/stop-words-la-gi.html, visited June 2014. Facebook dev, https://developers.facebook.com/tools/explorer, visited June 2014. SVM, http://svmlight.joachims.org/, visited June 2014. MaxEnt, http://homepages.inf.ed.ac.uk/lzhang10/maxent.html, visited June 2014. VNU, http://mim.hus.vnu.edu.vn/phuonglh/softwares, visited June 2014. Facebook, http://newsroom.fb.com/company-info/, visited July 2014. Alexander Pak, Patrick Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, LREC, 2010. Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow and Rebecca Passonneau, Sentiment Analysis of Twitter Data, LSM '11 Proceedings of the Workshop on Languages in Social Media, 2011. Hassan Saif, Yulan He and Harith Alani, Semantic Sentiment Analysis of Twitter, ISWC, 2012. Julie Kane Ahkter and Steven Soria, Sentiment Analysis: Facebook Status Messages Final Project CS224N, Stanford University, USA, 2010. GibbsLDA++, gibbslda.sourceforge.net, visited July 2014.