Automatic Domain-Specific Sentiment Lexicon Generation with Label ...

14 downloads 204 Views 867KB Size Report
ABSTRACT. Nowadays, the advance of social media has led to the explosive growth of opinion data. Therefore, sentiment analysis has attracted a lot of ...
Automatic Domain-Specific Sentiment Lexicon Generation with Label Propagation Yen-Jen Tai

Hung-Yu Kao

Department of Computer Science and Information Engineering, National Cheng Kung University Tainan, Taiwan, R.O.C.

Department of Computer Science and Information Engineering, National Cheng Kung University Tainan, Taiwan, R.O.C.

[email protected]

[email protected]

ABSTRACT Nowadays, the advance of social media has led to the explosive growth of opinion data. Therefore, sentiment analysis has attracted a lot of attentions. Currently, sentiment analysis applications are divided into two main approaches, the lexiconbased approach and the machine-learning approach. However, both of them face the challenge of obtaining a large amount of human-labeled training data and corpus. For the lexicon-based approach, it requires a sentiment lexicon to determine the opinion polarity. There are many existing benchmark sentiment lexicons, but they cannot cover all the domain-specific words meanings. Thus, automatic generation of a domain-specific sentiment lexicon becomes an important task. We propose a framework to automatically generate sentiment lexicon. First, we determine the semantic similarity between two words in the entire unlabeled corpus. We treat the words as nodes and similarities as weighted edges to construct word graphs. A graph-based semi-supervised label propagation method finally assigns the polarity to unlabeled words through the proposed propagation process. Experiments conducted on the microblog data, Twitter, show that our approach leads to a better performance than baseline approaches and general-purpose sentiment dictionaries.

The advancement of Web 2.0 technologies has led to the explosive growth of online opinion data, which is becoming a valuable source for sentiment analysis. Currently, most of sentiment analysis applications on social media are based on the lexicon-based approach instead of machine learning approach because the latter one requires a large amount of training data by human labeled. It brings the urgent need for automatically constructing a sentiment lexicon. A comprehensive sentiment lexicon is essential for sentiment analysis. However, there is no general-purpose sentiment lexicon which can cover all the domains since opinion expressions vary among different domains significantly. Many domains are characterized by their own sublanguage such as specific terms and jargons. Representing all sublanguages in a single knowledge base would be nearly impossible. Because of the above reasons, we want to propose a method which can generate a domain-specific sentiment lexicon automatically. While there has been prior research on automatic sentiment generation, none of them construct the lexicon from a social network corpus, like Twitter. Words from social media are more informal and diverse to hardly determine the polarity. For example, word like “long” has different usage in this tweet. For example in Figure 1:

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information filtering; J.4 [Social and Behavior Sciences]: Sociology; I.2.7 [Artificial Intelligence]: Natural Language Processing – Text Analysis

General Terms Management, Languages

Keywords Sentiment Analysis, Sentiment Lexicon, Twitter

1. INTRODUCTION Sentiment analysis has attracted more and more attention recently. The task is to predict the sentiment polarities of opinions by analyzing sentiment words and expressions in sentences and documents. Sentiment words contain three kinds of polarity, which are positive, negative and neutral.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iiWAS2013, 2-4 December, 2013, Vienna, Austria. Copyright 2013 ACM 978-1-4503-2113-6/13/12 …$15.00.

Figure 1. A word has different meanings and polarities in different domains There are three domains, which are movie, financial, and food domain. Even all tweets use the word “long” in each tweet. Nevertheless, “long” have entirely different meanings. In movie domain, “long” stands for the movie length. Also, it means that this movie is boring and unworthy to waste too much time on it. In financial domain, investors always use the word “long” to stand for the buying position. In food domain, the word “long” just use to describe something’s shape is long. In this example, we can realize that a word may have different meanings in different domains. Thus, how to distinguish a word meanings and usages in different domains is a hard task. But it is really important task for domain-specific sentiment lexicon generation. However, such special words meanings or usages in a specific domain cannot intuitively acquire from existing thesaurus. Furthermore, the part-of-speeches of words on social media are

more diverse than formal sentences. Therefore, finding the semantic relations of any two words in a specific domain and extracting the candidate words with part-of-speeches are troublesome tasks. For the purpose to generate the domainspecific sentiment lexicon, we utilize the information contained in the corpus besides only using thesaurus to accomplish the task. For the above reasons, we focus on the problem of constructing domain-specific sentiment lexicons in Twitter automatically, financial domain, phone domain, and food domain. There exist three approaches to generate sentiment lexicons: manual, thesaurus-based [1-5], and corpus-based approaches [6-10]. The manual approach needs human annotators to label the words as positive, negative, or neutral. Thesaurus-based approaches use the semantic relations like “synonyms” and “antonyms” to expand the sentiment words based on a small set of seed words. Corpus-based approaches use the co-occurrence patterns in the text or the conjunction rules to generate the sentiment lexicons. For the reason that the manual method is a high-cost approach, we utilize the thesaurus-based and corpus-based approached together to generate our domain-specific sentiment lexicons. The advantages of combining these two approaches provide us a high-quality and domain-specific sentiment lexicons. In thesaurus-based approaches, famous dictionaries like WordNet provide “strong” semantic relations. “Strong” means even two words used in different domains, they still have the same meanings or opposite meanings. For example, an antonyms word pair “predictable” and “unpredictable”. Regardless of domains, they retain their opposite relation. Furthermore, corpus-based approaches can help us to find domain-dependent and context-specific sentiment words. Therefore, we propose a method to aggregate multiple resources including thesaurusbased and corpus-based approaches.

2. RELATED WORK In this section, we review several prior works mostly related to our work, including sentiment analysis, and generation of sentiment lexicons. For the purpose of understanding the urgent need for generation of sentiment lexicons. We have to discuss about previous researches of sentiment analysis. After realizing the important task for generation of sentiment lexicons, we introduce related work about generating sentiment lexicons in the next.

2.1 Sentiment Analysis In recent years, sentiment analysis mine opinions from largescale subjective information on the Web such as news, blogs, reviews and tweets that have attracted much attention. The applications of sentiment analysis are based on two main approaches, lexicon-based approach and machine learning approach. We describe the details in the below sections and introduce some related applications.

2.1.1 Lexicon-based approach Most of current sentiment analysis applications on social media choose a lexicon-based approach because it is a huge challenge to obtain enough human-labeled training data. Recent works are almost based on the sentiment lexicon to quantify the emotion for sentiment analysis. Hu and Liu [4] propose a method to generate a sentiment lexicon for sentiment classification on customer reviews. We will introduce their method of generation of sentiment lexicon in Section 2.2.1. They mine the product features on the customer reviews that have expressed their opinions. Then, they use

“Opinion Lexicon” to judge the sentiment scores of each feature on the customer reviews. Summarizing the customers reviews is not only useful for consumers, but also important to the manufacturers. O’Connor et al. [11] use a sentiment dictionary which contains positive words and negative words, and consider the ratio of the sentiment words to assign the sentiment of a tweet. They propose a relative sentiment detector (RSD). Based on the RSD, they found Twitter messages could be a leading indicator for the Index of Consumer Sentiment (ICS), a measure of US consumer confidence. Bollen et al. [12] extracted six dimension of mood (tension, depression, anger, vigor, fatigue, confusion) using an extended version of the Profile of Mood States (POMS). POMS is a wellestablished psychometric instrument, which contains 72 terms. It’s used to judge status of people mood. They compared the six dimension score with major events in 2008. Surprisingly, they found that events in the social, political, cultural and economic have a significant effect on the various dimension of public mood. Bollen et al. [13] also investigated whether measurements of public mood states derived from Twitter are correlated to the value of the Dow Jones Industrial Average (DJIA). They measured mood in terms of six dimensions (Calm, Alert, Sure, Vital, Kind, Happy).The authors expanded the original 72 terms of the POMS questionnaire to a lexicon of 964 associated terms by analyzing word co-occurrences by Google. Finally, they discovered that the “Calm” dimension has the highest Granger causality relation with DJIA index as shown in Figure 5.

2.1.2 Machine learning approach In last decade, many works do the sentiment analysis focusing on the news articles. In the work of Fung et al. [14] take the financial news articles to perform a binary classification into two categories (rise and drop) by SVM. Another machine learning technique, naïve Bayesian, was used in the work [15, 16] to build language model for predicting the stock trend. In recent years, more works begin to mine opinions on the microblog. J. Ruiz et al. generate an interaction graph [17] to represent the microblogging activity and calculate the correlation between financial time series and graph features. Another work by Mao et al. compared different data source like surveys, news, tweets and search engine data to predict the DJIA [18]. Machine learning approaches retain its performance on sentiment analysis. However, the above works have the same problem that they must spend more efforts on obtaining the training data. Furthermore, it is a challenge to build an appropriate model with the explosive growth of online opinion data.

2.1.3 Application of sentiment analysis There exist many applications of sentiment analysis, including reviews summarization, consumer confidence, Food mood, etc. An important domain of sentiment analysis is financial markets. A sentiment analysis system can use various sources, such as news, blogs and tweets. One such system is TweetTrader.net. This system crawls real-time streaming tweets and analyzes their sentiment to tell users which stock is bullish or bearish.

2.2 Construction of Sentiment Lexicons The researches of automatic construction of sentiment lexicons help people to use less time to acquire a sentiment lexicon. Those researches construct the sentiment lexicons base on

various resources. We separate them into Thesaurus-based method and Corpus-based method.

2.2.1 Thesaurus-based Hu and Liu [4] proposed a simple and effective method by utilizing the adjective synonym set and antonym set in WordNet to predict the semantic orientations of adjectives. Based on the relations of synonyms and antonyms, they manually construct a set of common adjectives as the seed list, e.g. positive adjectives: great, fantastic, nice, cool, and negative adjectives: bad, dull. If a word is the synonym of the positive seed words, its semantic orientation will be positive. Otherwise, if a word is the antonym of the positive seed words, its semantic orientation will be negative. They released the lexicon called “Opinion Lexicon” and it contains around 6800 words. Kim and Hovy [2] built a word sentiment classifier using WordNet and three sets of positive, negative, and neutral words tagged manually. They expanded those manually selected seed words of each sentiment class by collecting synonyms from WordNet as training data. Their insight is that synonyms of positive words tend to have positive polarity. However, not all the synonyms of positive seed words always are positive since most words could have synonym relation with the other class. Therefore, they determine the polarity of words by the probabilities calculated from each class.

The context-dependent means a sentiment word is dependent to an aspect. For example, in a laptop review, “large” is negative for the battery aspect while being positive for the screen aspect. One of the main differences of the sources is that the reviews data have overall sentiment ratings. Overall sentiment ratings of reviews can help them to fit the sentiment lexicon according to the words usages of aspects. Unlike the above research, we aim to construct a sentiment lexicon without any human efforts like overall ratings. We introduce our method in the next section.

3. METHOD In this section, we present our proposed framework to address the problem of generation of domain-specific sentiment lexicon. Figure 2 shows the framework of our method. A three-stage study was designed to candidate words extraction, construction of word graphs, and a graph-based semi-supervised label propagation method. At the first stage, candidate words extraction take an unlabeled tweet corpus into process. After some essential NLP processing, we obtain candidate words without polarities. During the second stage, we use multiple resources to construct word graphs. We utilize semantic relations in WordNet and calculate words semantic similarities by SOC-PMI. Furthermore, there exist conjunction rules to distinguish words similarities. At the final stage, we apply semisupervised label propagation to induct words polarity based on the word graphs.

Kamps et al. [3] proposed a method to measure the semantic orientation from WordNet synonymy graph. They typically determined using the adjectives “good” and “bad” as bipolar center words. Then they calculated the shortest path in the synonymy graph between the target word and those two bipolar center words “good” and “bad”. The core concept is similar to the [7] work, it also determine the semantic orientation from the target word PMI values with ‘excellent’ and ‘poor’.

2.2.2 Corpus-based Hatzivassiloglou and McKeown first proposed corpusbased word level sentiment analysis [6]. They extended the adjectives by using conjunction rules extracted from a large document corpus. For example, if words are linked by “and” in a sentence, they have higher probability to be the same orientation. Otherwise, if words are linked by “but” in a sentence, their orientation is probably opposite. The authors constructed a graph where words are nodes connected by “same-orientation” or “opposite -orientation” edges. After applying the clustering algorithm, the graph was partitioned into a positive cluster and a negative cluster based on a relation of similarity induced by the edges. Turney [7] used semantic orientation (SO) to assign the polarity of the words extracted from reviews. They observed that the word of positive semantic orientation would appear together with positive opinion words more than negative opinions words. They first define a set of positive seed terms and negative seed terms, then search the target terms and the seed terms to measure their point-wise mutual information (PMI). The semantic orientation of the target term is determined by the sum of weights of its PMI with positive seed terms minus that with negative seed term. The semantic orientation (SO) of the phrase would be estimated as positive if SO value was calculated more than zero, and negative otherwise. Lu et al. propose an optimization framework [10] that provides a unified and principled way to combine different sources of information for learning a context-dependent sentiment lexicon.

Figure 2. Proposed framework

3.1 Candidate Words Extraction In this section, we will describe some Natural Language Processing (NLP) techniques in our work, including the part-ofspeech tagging, lemmatization and the stop words removal.

3.1.1 Part-of-Speech tagging (POS) In the candidate words extraction step, we select adjectives, verb, adverb as our candidate POS. For the goal to extract our candidate words, we use the Ark-TweetNLP [19] as our POS tagging parser. This POS tagger generates word segmentation and POS tagging, and we list an example results in Table 1. For example, “N” stands for common noun and “A” stands for adjective. The main difference between Ark-TweetNLP and other parsers is that Ark-TweetNLP is built from tweet corpus. They manually annotate tweet corpus and train the POS tagger. The parsing output contains part-of-speech of the word and probability of the part-of-speech. For the correctness, we choose words with probabilities higher than 0.5.

Table 1. Examples of tweets and results of POS tagging Example

Result

$AAPL Taking a long position here, with a stop below support. Double bottom with only 39% bulls $AAPL( ^, 0.8936) Below(P, 0.9624) Taking(V, 0.9968) Support(N, 0.8995) a(D, 0.9961) .(,, 0.9979) Long(A, 0.9966) Double(^, 0.3769) Position(N, 0.9995) Bottom(N, 0.4315) Here(R, 0.9735) With(P, 0.9986) ,(,, 0.9983) Only(A, 0.8088) With(P, 0.9974) 39%($, 0.9608) A(D, 0.9921) Bulls(N, 0.9379) Stop(N, 0.9595)

In the Example of Table 1, since the word “Double (^, 0.3769)” do not satisfy our threshold of candidate words standard, it is removed from our candidate words list.

3.1.2 Stop words removal Stop words are words which are necessary to filter out in nature language processing. For the stop words removal, we choose the stop words list in the Wikipedia external link1. There are totally 119 words in the stop words list, as shown in the Table 2. Table 2. Stop words list Words a, able, about, across, after, all, almost, also, am, among, an, and, any, are, as, at, be, because, been, but, by, can, cannot, could, dear, did, do, does, either, else, ever, every, for, from, get, got, had, has, have, he, her, hers, him, his, how, however, i, if, in, into, is, it, its, just, least, let, like, likely, may, me, might, most, must, my, neither, no, nor, not, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, since, so, some, than, that, the, their, them, then, there, these, they, this, tis, to, too, twas, us, wants, was, we, were, what, when, where, which, while, who, whom, why, will, with, would, yet, you, your

3.1.3 Lemmatization Stemming is the process for reducing inflected words to their stem, base or root form. There are several types of stemming algorithms, for examples, Porter stemming and WordNet-based stemming. For the reason that our goal is to construct a sentiment lexicon, we want the lexicon words are formal words. Therefore, we used the WordNet-based stemming. Although we called the method we used as WordNet-based stemming, it’s a lemmatization actually. Lemmatization 2 is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. There are some instances to explain the difference between stemmer and lemmatization as follows (from wiki). The word "better" has "good" as its lemma. This link is missed by stemming, as it 1

http://en.wikipedia.org/wiki/Stop_words

2

http://en.wikipedia.org/wiki/Lemmatisation

requires a dictionary look-up. In our wok, we preserve the comparative degree of a word. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization can in principle select the appropriate lemma depending on the context. We use WordNet as our dictionary in lemmatization. If we cannot find the lemma of candidate words, these words will be removed from the candidate words list. After part-of-speech tagging, stop words removal and lemmatization, we get 11,762 candidate words in financial domain corpus finally, statistics as shown in Table 3. Table 3. Statistics of final candidate words # words

# words

before processing

after processing

Adjective

3551

2394

Noun

10934

6441

POS

Adverb

913

655

Verb

3213

2272

3.2 Word Graphs Construction After candidate words extraction, we treat the candidate words as nodes in a graph. The next step is to build the edges between each node. Therefore, we show the methods to construct the word graphs in this section. The word graphs could construct by different means from variety of resources, such as corpus, WordNet, and the web documents. The core idea to construct a word graph is that we want to decide the similarities between each word. A word graph with vertices V, edges E, and an edgeweighted similarity matrix . Every vertex (or node) is adjacent to other vertices. The edge between and stands for the similarity of the two words. In our work, we combine three kinds of different resources from WordNet, Conjunction rules, and SOC-PMI. The benefit of aggregating multiple resources is that we can extract complementary relations of words. We sequentially introduced these resources and their similarity matrices construction.

3.2.1 WordNet WordNet [20] is a lexical database which groups English words into sets of synonyms called synsets. Synsets provides short and general definitions, and records the various semantic relations between these synonym sets. The most recent Windows version of WordNet is 2.1, released in March 2005. There are totally 155,327 unique Strings (words), 117,597 synsets and 207,016 word-sense pairs in WordNet2.1. A synset in WordNet is linked to other synsets by means of a small number of “conceptual relations.” Additionally, a synset contains a “gloss” which gives a brief definition of this synset. In most cases of synsets, they also have some example sentences to illustrate the usage of the synsets members. A word has several distinct meanings lead to different synsets. Thus, a unique word will have more than one word-sense pairs. The main relation among words in WordNet is “synonymy”. Synonyms are words with the same or similar meanings. For example, “big” and “large” also have the same meaning. In

addition to synonymy, another relation called “antonymy” means two words contain opposite semantic meanings, like “open” and “close”. Based on the above two relations, we set the similarity matrix W entries for definition of Eq ( 1 ). 𝑤𝑖𝑗

1 𝑤 𝑎𝑛𝑑 𝑤𝑗 𝑖𝑠 𝑠𝑦𝑛𝑜𝑛𝑦𝑚𝑦 {−1 𝑤 𝑎𝑛𝑑 𝑤𝑗 𝑖𝑠 𝑎𝑛𝑡𝑜𝑛𝑦𝑚𝑦 0oh s

(1)

In previous research, Hatzivassiloglou and McKeown [6] propose a corpus-based method to extend the adjective words by using conjunction rules extracted from a large document corpus. The key concept is that two words linked by “and” are prone to the same polarity and by “but” are prone to the opposite polarity. Therefore, we want to extract the conjunction pairs to distinguish two words relations. In order to extract the conjunction pairs from tweets, dependency parsing helps us to identify the relation between two words in a tweet. Dependency parsing transforms a sentence to a tree structure, where nodes are words and edges stand for the relationships among the words. In our work, we apply Stanford Parser [21] as our dependency parser. Given a tweet “@culator $SLW CCI is bullish and rising”, we can extract the conjunction pairs “conj_and (bullish-4, rising-6)”. However, most of tweets contain some Twitter-specific symbols like #hashtag, @username, $ticker and URL. They will lead to worse result of dependencies in tweets than formal sentences from Stanford Parser. For this reason, we propose a clean-up procedure to remove noise words and repair some words to appropriate meaning for the quality of dependency parsing. We describe the clean-up procedure step by step in Table 4 and the examples in Table 5 After extracting the conjunction pairs, we describe the words relation as Eq ( 2 ). The entry in similarity matrix W is a graph edge weight between o and o . The frequency of the o and o being linked with ‘and’ is represented as , and is the frequency of the o and o being linked with ‘but’. We adopt a logarithmic transform of the sum of the linkages to avoid the huge edge weights that exist among frequent words. 𝑖𝑗

{

𝑙𝑜𝑔 𝑑 + 1 𝑙𝑜𝑔 𝑑𝑖𝑗 + 1

Step 1 2 3

3.2.2 Conjunction rules

𝑤𝑖𝑗

Table 4. Clean-up procedure before dependency parsing

4 5 6

Description Remove any URL in tweets. Remove RT @username in front of tweets. Replace any $ticker to STOCKTICKER in tweets Remove @username in front of tweets. Remove #hashtag in the end of tweets. Remove all prefix(@/#) in the middle of the tweets.

Purpose URL removal Retweet tag removal Stock ticker removal Reply tag removal Hashtag tag removal #hashtag/@username normalization

Table 5. Examples of the raw tweets and clean-up tweets Tweet 1

After Clean-up Tweet 2 After Clean-up

Happy New Year! I'm very bullish for 2012, but @jaltucher has done a much better job or articulating why than I could http://t.co/FJn849PX Happy New Year! I'm very bullish for 2012, but jaltucher has done a much better job or articulating why than I could RT @tvanderrank @jamelm6 $LLY CCI is bullish and rising #AAPL STOCKTICKER CCI is bullish and rising

Step 1 Step 6

Step 2 Step 3 Step 4 Step 5

Turney used semantic orientation (SO) to assign the polarity of words from the customer reviews or from the search engine. The core concept of their method is two words co-occurrence. It’s a simple and powerful method to calculate two target words semantic similarity; even so, it is not suitable for short texts like tweets. For instance, there are two tweets below.

(2)

Although dependency parsing extracts the conjunction pairs, negation is not taken into consideration to conjunction pairs. For instance, sentence “he is strong and not fat.” will be extracted a conjunction pair, “conj_and (strong, fat)”. However, this conjunction pair ignores the negation, which will cause increasing the probability of “strong” and “fat” becoming the same polarity. For solving this situation, we set some words for handling the negation problem. If a conjunction pair “conj_and (strong, fat)”comes with a dependency “neg (fat, not)” or a dependency “neg (fat, never)”, this conjunction pair will be transform into “conj_but (strong, fat)”.

3.2.3 SOC-PMI For generating a sentiment lexicon, they are lots of resources, such as WordNet, and conjunction rules can be used. For that reason that we aim to generate a domain-specific lexicon, not a general- purpose lexicon, we need to add more semantic relations extracted from the domain corpus. In previous research,

In the two sentences, PMI score of “Car” and “Automobile” is zero. However, we know that the two words have the same semantic meaning. Therefore, we choose another corpus-based method for calculating the semantic similarity of two target words, called Second Order Co-occurrence PMI (SOC-PMI) [22]. It uses Pointwise Mutual Information (PMI) to sort the lists of important neighbors’ words of the two target words. Then it takes the words which appeared in both lists into considerations and aggregates their PMI values from each other list. In the above two sentences, in spite of two target words never occurred together, SOC-PMI still can calculate their semantic similarity by their common neighbor “Market”. We introduce the algorithm and give the functions definition of SOC-PMI in the next paragraph.

Let s and s be the two target words that we want to calculate { } denotes a the semantic similarity and large corpus of text containing m words. We use { } to denote a set of unique words in corpus . Throughout this section, we will use to denote either 𝑠 or 𝑠 . We set a window size of + 1, which in our experiment. The parameter means how many words before and after the target word . We demonstrate the type frequency function as Eq ( 3 ) which means how many times the unique word 𝑖 appeared in the entire corpus. 𝑓𝑡

|{𝑘:

𝑖

𝑖 }|

𝑘

where 𝑖=1,2…,n

(3)

Eq ( 4 ) is the bigram frequency function, which tells us how many times word 𝑖 appeared with word S in the window. 𝑓𝑏

|{𝑘:

𝑖

𝑎𝑛𝑑

𝑘

𝑘±𝑗

𝑤here 𝑖=1,2,…,n and −

𝑖

(4)

}|

≤𝑗≤

Then we demonstrate pointwise mutual information function (Eq ( 10 )) for those two words having 𝑓 𝑏 𝑖 0, 𝑓 pm

𝑙𝑜𝑔

𝑖

𝑓𝑏 𝑓𝑡

𝑚

𝑖 𝑖

(5)

𝑓𝑡

and is total number of words in corpus . For word 𝑠 , we define a set of words, , sorted in descending order by their PMI values with 𝑠 and taken the o − words having 𝑓 𝑖 𝑡𝑖 >0. For word 𝑠 , we follow the same step above and get the set and . The value for depend on the word 𝑠 and the number of words in the corpus. We define as (log(𝑓 𝑡 𝑠𝑖 ))

𝑖

h

𝑖

1

log 𝑛 𝛿

(6)

where 𝛿 is a constant. The value of 𝛿 depends on the corpus size. The smaller the corpus we use, the smaller the value of 𝛿 we choose. In our experiment, we set δ for 2. Then, we define the − summation function. For word 𝑠 , the − summation function is: 𝛽1

𝑓𝛽 𝑠 h

𝑖

∑ (𝑓 𝑓

𝑖=

𝑖

𝑖

𝑠

𝑖

𝑠 )

(7)

γ

0 and 𝑓

𝑖

𝑖

𝑠

Finally, we demostrate semantic PMI similarity between two words, 𝑠 and 𝑠 , s s

𝑓𝛽 𝑠

3.3 Token Opinion Propagation After construction of words graphs, we use a graph-based semisupervised propagation method to propagate the polarity from seed words to unlabeled words. We aim to generate a sentiment lexicon relying on few seed words. Therefore, less human efforts are our purpose. For this reason, we use the semisupervised propagation instead of the supervised methods.

3.3.1 Propagation Traditional classifiers use labeled data to train the prediction model. However, labeled data are often expensive, timeconsuming to obtain, since they need the efforts of experienced human to annotate. On the contrary, unlabeled data are relatively easy to collect. Semi-supervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data. Since we aim to automatically generate a sentiment lexicon, less efforts of human is our goal. We treat our problem as a graph-based semi-supervised learning. Graph-based semisupervised methods define a graph where the nodes are labeled and unlabeled words, and edges reflect the similarity of words. Let 𝑦 𝑦 be labeled words (tokens), where {𝑦 𝑦 𝑦 } are the class labels. Let {𝑦 be unlabeled words (tokens) where 𝑦 𝑦 { } , where 𝑖 are unlabeled class. Let Our goal is to estimate from and .

0

+

𝑓𝛽 𝑠

function (8)

We use semantic PMI similarity function to determine the semantic similarity of any two words in the corpus. If two words have more common important neighbors, the value of semantic similarity will be higher. Notice that SOC-PMI values don’t have the maximum boundary values which may cause the imbalance when we aggregate all the similarity matrices.

} .

Intuitively, the concept of semi-supervised propagation is that the nearby nodes are likely to have the same label. The labels of nodes propagate to all nodes through the edges. Therefore, edges with larger weights allow labels to affect the neighborhood significantly. As we mentioned is the section 3.2, we construct similarity matrix from three kinds of resources. In each matrix entry 𝑤𝑖𝑗 means the similarity between 𝑤𝑖 and 𝑤𝑗 . Then we define a 𝑙 + 𝑙+ probabilistic transition matrix transformed from the similarity matrix , where 𝑙 is the number of labeled instance and is the number of unlabeled instance. 𝑇𝑖𝑗

We set in our experiment, the higher the value is, the greater emphasis on words having high PMI values. We sum all the positive PMI values of words in the set and set both. Similarly, for word 𝑠 , we follow the Eq ( 7 ) to calculate the − summation. This function actually aggregate the positive PMI values of common important neighbors with 𝑠 and 𝑠 .

Si

Therefore, we normalize the SOC-PMI values and scale the value into range [0, 1].

𝑃 𝑗→j

𝑤𝑖𝑗 ∑𝑘= 𝑤𝑘𝑗

(9)

where 𝑇𝑖𝑗 is the probability to propagate the label from node j to 𝑖 . Then we define a 𝑙 + class matrix , whose rows represent the label distribution of nodes and is the number of classes. First step of the iterative process is to calculate the following equation: 𝒀 ← 𝑻𝒀 + 1 −

𝒀

( 10 )

where is a damping factor to control the label value become infinity. Also, it decides the percentage of label values which get from neighbors in each iteration. If α approaches 0, it stands for that we obtain label values from neighbors slowly. On the contrary, if α approaches 1, it stands for that a node’s label is decided by its neighbors completely. We set for 0.3 in our experiment. In the first iteration, only the nodes connected with seed words obtain the label values. More similar to seed words, more label values we get. In pace with continuous iteration, even a node has low closeness with seed words will obtain the label values from its neighbors’ nodes. Second step, we clamp the class matrix of labeled data to its initial state. With this constant “push” step from labeled data to

unlabeled data, the unlabeled data gradually obtain the label values in the iterative process.

search engine of Twitter, it indexes tweets published in Twitter and analyzes

In our experiment, we have three kinds of different resources similarity matrices. We can choose one of them or aggregate all of them to work as the similarity matrix in label propagation. We define the aggregate function in Eq ( 11 ).

Table 7. Example of propagation

𝐴

𝑊𝑜𝑟𝑑𝑁𝑒𝑡

+

𝐶𝑜 𝑗

𝑐𝑡𝑖𝑜

+

𝑆𝑂𝐶 𝑃𝑀𝐼

( 11 )

where , ,and are weight number decided by the characteristic of each similarity matrix. We also conduct the experiments with various weights to tune the performance. For a better understanding of the propagation process, we demonstrate an example in the next sub-section.

Iteration 0

3.3.2 A walk-through example For easy understanding the algorithm, we give a walk-through example to show the variation of propagation. Given a similarity matrix where 𝑤𝑖 {goo i llish ish} and class matrix as below. 1 1 1 0 0 1 1 0 1 1 0 1, 0 0 1 0 1 0 0 [1 [0 0] 1 1 0] Then we transform the similarity matrix W into transition matrix 𝑇

Iteration 1

0 1

0 01 0 0 01 01 0 01 01 0 𝑇 0 01 0 0 01 0 01 0 0 01 [0 1 0 01 01 0 ] We show the propagation process step by step to understand how to propagate its label from labeled data to unlabeled data in Table 7. In this toy example, we set the converged threshold for 0.01 and damping factor 0 . In the example, we can see that the class matrix converges after fourteen times iteration. The number of iterations is sensitive to the threshold and damping factor. If we set a smaller converged threshold, the iteration times will increase until the class matrix is a stable status. Finally, “nice” and “bullish” both obtain more positive label values than negative values. Therefore, we predict these two words as positive.

4. EXPERIMENTS In this section, we present experimental evaluation results to access the effectiveness of our method. We show that our method can propagate appropriate label value to the unlabeled nodes through the word graphs. Furthermore, we show that our method works more effectively than baseline method and general-purpose lexicons.

4.1 Dataset Table 6. Statistics of twitter data set keyword Bullish Bearish

Period of time 2012/01/01 to 2012/12/31 2011/01/01 to 2012/12/31

# users 68,077 28,241

# tweets 335,912 312,181

We choose the microblog platform Twitter as our data source and use the TOPSY Search API to fetch the financial domain tweets with two keywords, “bullish” and “bearish”. TOPSY, a

Iteration 2

Iteration n

1 0 0 0 [0

0 1 0 0 0]

0 01 0 0 [0 1

01 0 01 01 0

0 01 0 0 01

0 01 0 0 01

01 1 0 0 01 0 01 0 0 ] [0 1 0 01 01 [0 0

0 01 0 0 [0 1

01 0 01 01 0

0 01 0 0 01

0 01 0 0 01

01 1 0 0 01 01 01 01 0 ] [0 0 1 0 0 0 [0 06

0 01 0 0 [0 1

01 0 01 01 0

0 01 0 0 01

0 01 0 0 01

01 1 0 0 01 0 01 0 0 ] [0 08

[

Iteration 14

0 01 0 0 [0 1

01 0 01 01 0

0 01 0 0 01

0 01 0 0 01

0 1 0 0 0] 0 1 00 00 01 ] 0 1 00 00 01 ]

0 1 0 08 0 08 0 9]

0 1 01 01 0 8]



]

01 1 0 0 0 1 01 06 0 01 06 0 0 ] [0 1 0 6 ] 1 0 0 1 0 66 0 0 66 0 [0 06 ]

the contents. With the help of TOSPY Search API, we can use some keywords to fetch related tweets. Our motivation for choosing the keywords is based on the special meanings of them. “Bullish” means someone expecting a rise in prices. Contrary to “bullish”, bearish means someone expecting prices to fall. Totally, we crawl around 640 thousands of tweets. We show the statistics in Table 6. For balancing the distribution of individual keywords, we fetch bearish tweets with double period of time.

4.2 Ground-truth Dictionaries We evaluate our automatic generated sentiment lexicon against three dictionaries. However, we found that the General Inquirer

(GI)3 and Financial sentiment dictionary (FS)4 does not correctly tag some words according to their usage in the specific domain. For example, “cool” is tagged as negative in GI. In addition, some words are not included in GI and FS, like ‘long’. In the options trading, “long” is to buy a market position and being bullish toward the option. Thus, we decide to obtain another gold standard dictionary labeled by humans (HL). We labeled the candidate words by ourselves.

4.3 Evaluation Metric In order to evaluate our lexicon words polarity, we choose precision, recall and F-measure as evaluation metrics. The precision metric is shown as follows. # 𝑤𝑜𝑟𝑑𝑠 𝑤𝑖𝑡ℎ 𝑜𝑟𝑟𝑒 𝑡 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 #𝑤𝑜𝑟𝑑𝑠 𝑤𝑖𝑡ℎ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦

isio

( 12 )

4.5 Seed Words We show the seed words list in Table 8. For the purpose to construct a domain-specific sentiment lexicon, choosing the seed words is an important step. Different seed words are closely related to the polarities of words. In positive class, we select “beneficial”, “profitable”, “booming”, which are have positive meanings against the operation status of companies. Table 8. Seed words list in each class Polarity

Words

Positive

bullish, good, optimistic, nice, excellent, positive, correct, superior, lucky, greatest, superb, admirable, awesome, best, premium, booming, profitable, wonderful, beautiful, constructive, gorgeous, upbeat, beneficial, perfect, kind, better, prosperous, happy, precious, vibrant

Negative

bearish, bad, pessimistic, nasty, poor, negative, wrong, inferior, unfortunate, troubling, horrid, abusive, disturbing, egregious, worried, jobless, infuriating, inexorable, indignant, insolvent, inconsolable, ashamed, scary, shoddy, bogus, pernicious, unrealistic, unwilling, horrific, fake

The purpose of precision can help us to judge how many words with the correct polarity predicted from our system. R

ll

# 𝑤𝑜𝑟𝑑𝑠 𝑤𝑖𝑡ℎ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 # 𝑤𝑜𝑟𝑑𝑠 𝑤𝑖𝑡ℎ 𝑝𝑜𝑙𝑎𝑟𝑖𝑡𝑦 𝑖𝑛 𝑡ℎ𝑒 𝑔𝑟𝑜 𝑛𝑑 𝑡 𝑟𝑡ℎ

( 13 )

Recall tells us how many words we can predict its polarity in the ground-truth dictionaries. F−

s

𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 𝑒 𝑎𝑙𝑙 𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 + 𝑒 𝑎𝑙𝑙

( 14 )

4.6 Analysis of Converged Threshold

Our goal is to generate a sentiment lexicon with balanced performance on precision and recall. Therefore, higher Fmeasure stands for better performance.

4.4 Baselines We utilized the following baseline methods in related work to compare with our system in order to demonstrate the effectiveness of our method:

4.4.1 Heuristic conjunction rules(CR) In brief, a target word connected to a seed word, these two words are the same polarity. We handle the conflictions of target words polarities by the sum of linkage to each seed word.

4.4.2

𝑀

[6]

Adjectives connect by same orientation links have zero dissimilarities, and conversely, different orientation links have one dissimilarities. Conjunction pairs with no connecting links are assigned the neutral dissimilarities 0.5. A clustering algorithm split the graph into two subsets. The cluster with the higher average frequency is classified as positive polarity.

4.4.3 WordNet closeness (WC) [2] Kim and Hovy expanded those seed words by collecting synonyms in WordNet as training set and calculate closeness between the target words and each class.

4.4.4 Geodesic distance (GD) [3] Geodesic distance determined the semantic orientation by the shortest path from the target word to those bipolar words “good” and “bad” in the WordNet synonyms graph

3

http://www.wjh.harvard.edu/~inquirer/homecat.htm

4

http://www3.nd.edu/~mcdonald/Word_Lists.html

Figure 3. Performance of different threshold Since our algorithm is an iterative process, it is an important issue to set the threshold for difference between iterations. In this section, we show the performance of various thresholds. Figure 3 depicts the f-measure with different thresholds. We discover while threshold is less than and equal to 0.01, performance becomes stable. On this observation, we will set the threshold for 0.01 in the following experiments.

4.7 Influence of the Damping Factor The damping factor in Eq ( 10 ) is closely related to the speed of convergence. The damping factor stand for how much label values we get in this iteration. It can control the situation that one node obtain opposite label values from its connected nodes when states of nodes are still extremely unstable. Therefore, we test the damping factor range from 0.1 to 1 to observe the performance. In the experiment, we set converged threshold for 0.01 and aggregate all the similarity matrices with equal weight. Based on the setting, we found that while the damping factor is 0.3, achieving the best performance.

methods. Finally, we aggregate all the similarity matrices with tuning weight, achieving the best performance.

5. CONCLUSION AND FUTURE WORK

Figure 4. Influence of damping factor on precision Figure 4 shows that precision increases significantly with damping factor from range 0.1 to 0.3. We discover the main reason of this phenomenon is because smaller damping factor cause it to converge early. On the contrary, larger damping factor can let the propagation process converges quickly to a stable status. Nevertheless, since the damping factor is large, some of nodes will propagate its label values in unstable statuses. Unstable status means a node polarity is changed quickly in the propagation process. Therefore, this experiment conducts a proper damping factor for our proposed method. In previous researches of generating a sentiment lexicon almost focused on the adjectives. Although our method can apply on any POS, we follow previous researches to evaluate our lexicon on adjectives. We use the human-labeled dictionary as the main ground-truth dictionary. In addition, we demonstrate the performance by the other dictionaries. We use ,where x = similarity matrix from different resources, to denote our method. sets weights after tuning ( 1 1) to each similarity matrix. First, we choose the human-labeled dictionary to evaluate the automatic generated lexicon. Experimental results show in Table 9. Table 9. Performance evaluated by HL dictionary Method

F-

Precision

Recall

CR

0.7561

0.1051

0.1846

𝑪𝑹𝑯𝑴

0.5231

1

0.6700

WC

0.8292

0.5438

0.6569

GD

0.7977

0.6175

0.6961

Public

SentiWordNet

0.7767

0.8540

0.8135

dictionary

OpinionLexicon

0.9750

0.6181

0.7566

𝑻𝑶𝑷𝑾𝒐𝒓𝒅𝑵𝒆𝒕

0.7784

0.8616

0.8178

𝑻𝑶𝑷𝑪𝒐𝒏𝒋𝒖𝒏𝒄𝒕𝒊𝒐𝒏

0.6730

0.1684

0.2695

0.5836

1

0.7371

0.7476

1

0.8556

Baseline

𝑻𝑶𝑷𝑺𝑶𝑪

𝑷𝑴𝑰

𝑻𝑶𝑷𝑨𝒍𝒍

measure

The performances vary significantly on different resources. WordNet and Conjunction rule have higher precision. Moreover, WordNet achieved the most balanced performance between precision and recall. From another perspective, with the lack of conjunction pairs, recall of Conjunction rules is lower than other

In this paper, we studied the problem of automatically generate a domain-specific sentiment lexicon from an unlabeled tweets corpus. We utilize several kinds of useful resources including WordNet, Conjunction rules, and SOC-PMI to construct word graphs. Based on the word graphs, label values propagate from the seed words to other unlabeled words. Finally, we compare our automatic generated sentiment lexicon with baseline methods and general-purpose lexicon. The experimental results show that our method has the ability to extract or repair words polarity toward the domain. In future work, for the purpose to strength the reliability of sentiment lexicon, we could utilize more information like stock price. Furthermore, there exist many approaches to derive the words similarities based on a corpus. As far as we are concerned, how to choose a common approach to calculate the similarities of words for any corpus is also a challenge. In addition, with the burst growth of social media, more and more research topics focus on the big data. Sentiment analysis applications are one of the important parts. How to automatically generate the sentiment lexicons in the streaming data will be a worth-discussed topic in the streaming data will be a worth-discussed topic in the future.

6. ACKNOWLEDGMENTS The authors are supported in part by the National Science Council Project No. NSC 101-2221-E-006-261 Taiwan, R.O.C.

7. REFERENCES [1] A. E. a. F. Sebastiani, "SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining," in In Proceedings of the 5th Conference on Language Resources and Evaluation(LREC), 2006, pp. 417--422. [2] S.-M. Kim and E. Hovy, "Identifying and analyzing judgment opinions," presented at the Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (ACL), New York, New York, 2006. [3] J. Kamps, M. Marx, R. J. Mokken, and M. D. Rijke, "Using wordnet to measure semantic orientation of adjectives," in National Institute for, ed, 2004, pp. 1115-1118. [4] M. Hu and B. Liu, "Mining and summarizing customer reviews," presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA, USA, 2004. [5] D. Rao and D. Ravichandran, "Semi-supervised polarity lexicon induction," presented at the Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 2009. [6] V. Hatzivassiloglou and K. R. McKeown, "Predicting the semantic orientation of adjectives," 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pp. 174-181, 1997. [7] P. D. Turney, "Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews," presented at the Proceedings of the 40th Annual

Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002. [8] P. D. Turney and M. L. Littman, "Measuring praise and criticism: Inference of semantic orientation from association," ACM Trans. Inf. Syst., vol. 21, pp. 315-346, 2003. [9] T. Loughran and B. McDonald, "When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks," Journal of Finance, 2010. [10] Y. Lu, M. Castellanos, U. Dayal, and C. Zhai, "Automatic construction of a context-aware sentiment lexicon: an optimization approach," presented at the Proceedings of the 20th international conference on World wide web, Hyderabad, India, 2011. [11] B. O'Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith, "From tweets to polls: Linking text sentiment to public opinion time series," Fourth International AAAI Conference on Weblogs and Social Media, 2010. [12] J. Bollen, A. Pepe, and H. Mao, "Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena," CoRR, pp. -1--1, 2009. [13] J. Bollen, H. Mao, and X. Zeng, "Twitter mood predicts the stock market," Journal of Computational Science, vol. 2, pp. 1-8, 2011. [14] G. P. C. Fung, J. X. Yu, and W. Lam, "News Sensitive Stock Trend Prediction," presented at the Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2002. [15] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan, "Mining of Concurrent Text and Time Series," In proceedings of the 6 th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, 2000.

[16] G. Gidofalvi, "Using News Articles to Predict Stock Price Movements," Department of Computer Science and Engineering University of California, San Diego, June 15 2001. [17] E. J. Ruiz, V. Hristidis, C. Castillo, and A. Gionis, "Correlating Financial Time Series with Micro-Blogging Activity," Proceedings of the fifth ACM international conference on Web search and data mining (WSDM), pp. 513-522 2012. [18] H. Mao, S. Counts, and J. Bollen, "Predicting Financial Markets- Comparing Survey,news,twitter and search engine data," 2011. [19] K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, et al., "Part-of-speech tagging for Twitter: annotation, features, and experiments," presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2(ACL), Portland, Oregon, 2011. [20] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, "Introduction to WordNet: An On-line Lexical Database," International Journal of Lexicography(special issue), pp. 3(4):235-312, 1990. [21] M.-c. D. Marneff, B. Maccartney, and C. D. Manning, "GeneratingTyped Dependency Parses from Phrase Structure Parses," presented at the Proceedings of the conference on Language Resources and Evaluation (LREC), 2006. [22] A. Islam and D. Inkpen, "Second order co-occurrence PMI for determining the semantic similarity of words," in Proceedings of the International Conference on Language Resources and Evaluation (LREC), ed, 2006, pp. 10331038.

Suggest Documents