Mood Sensing from Social Media Texts and Its Applications - CiteSeerX

KAIS manuscript No. (will be inserted by the editor)

Mood Sensing from Social Media Texts and Its Applications Thin Nguyen · Dinh Phung Adams · Svetha Venkatesh

·

Brett

Received: May 16, 2012 / Revised: Jan 30, 2013 / Accepted: Feb 08, 2013

Abstract

We present a large-scale mood analysis in social media texts. We organize

the paper in three parts: 1) addressing the problem of feature selection and classication of mood in blogosphere, 2) we extract global mood patterns at dierent level of aggregation from a large-scale dataset of approximately 18 millions documents 3) and nally, we extract mood trajectory for an egocentric user and study how it can be used to detect subtle emotion signals in a user-centric manner, supporting discovery of hyper-groups of communities based on sentiment information. For mood classication, two feature sets proposed in psychology are used, showing that these features are ecient, do not require a training phase and yield classication results comparable to state-of-the-art, supervised feature-selection schemes; on mood patterns, empirical results for mood organisation in the blogosphere are provided, analogous to the structure of human emotion proposed independently in the psychology literature; and on community structure discovery, sentiment-based approach can yield useful insights into community formation.

1 Introduction So much of the web today is two-way; that is, users are able to not only read commercially owned content but can also respond to it. The facility to comment on what was previously only broadcast media, such as news articles, has led to the creation of an unprecedented amount of associated sentiment-laden text. This adoption of usercontribution by the on-line arms of traditional media and industry seems to have been driven by the emergence of

social media YouTube,

Flickr, Facebook, Twitter and

T. Nguyen, D. Phung, S. Venkatesh School of Information Technology, Deakin University, Geelong Waurn Ponds Campus, Australia. E-mail: {thin.nguyen,dinh.phung,svetha.venkatesh}@deakin.edu.au B. Adams Department of Computing Curtin University, Perth, Australia. E-mail: [email protected]

2

Nguyen et al.

blogsthrough which whole communities contribute media of various types and trans-

1

act via on-line communication channels. Social media's uniquely ègocentric'

nature

means that these communities and the artifacts they produce constitute sentimentladen corpora in their own right.

2

Hence, the user-generated content is usually opinionated and/or sentiment-bearing, bringing opportunity, for example, for a company to gain insight into consumer opinions about its products and those of its competitors. Thus the ability to identify opinion sources on the Web and monitor them is a growing research eld, termed opinion mining. This line of research mainly focuses on subjectivity detection, sentiment classication, joint topic-sentiment analysis, and opinion summarisation. In this paper, we focus on a popular form of sentiment:

mood. Mood is a strong form of sentiment

expression, conveying a state of the mind such as being happy, sad or angry. Social media texts are rich in sentiment and this paper discusses various fundamental issues related to mood sensing from these texts and novel applications of this information. Text-based mood classication and clustering, as a sub-problem of opinion and sentiment mining, have many potential applications, as identied in [28], such as automated recommendation for product websites, as a sub-component of web technology in business and government intelligence, or for the collection of empirical evidence for studies in psychological and behavioural sciences. Specically, in the blogosphere, mood classication can be used to lter search results, to ascertain the mental health of communities, or to gain detailed insight into patterns of how bloggers behave and relate to one another. However, text-based mood analysis poses additional challenges beyond standard text categorisation and clustering. The complex cognitive processes of mood formulation make it dependent on the specic social context of the user, their idiosyncratic associations of mood and vocabulary, their syntax and style which reects on language usage (for example, the order of linguistic components) and the specic genre of the text. In the case of weblogs studied in this paper, these challenges are reected in the diverse styles of expression of the bloggers, the relatively short text length and the use of informal language, such as jargon, abbreviations and non-standard grammar. Featureselection methods available in machine learning are often computationally expensive, relying on labelled data to learn discriminative features. However, the blogosphere is

3

vast (reaching almost 130 million users ) and is continuing to grow, making it desirable to construct a feature set that works without requiring supervised feature training to classify mood. To this end, it is necessary to look to the results of studies that intersect Psychology and Linguistics. Doing so reveals two potentially useful systems: (1) the sentiment-bearing lexicon known as ANEW [6] and (2) psycholinguistic features drawn from the LIWC [31]. It is proposed that these systems are suitable for use in mood classication. Our contribution is three-fold. First, we conduct a comparative study of machine learning-based text feature selection for the specic problem of mood classication, elucidating insights into what can be transferred from a generic text-categorisation problem for mood classication. Second, we formulate a novel use of two psychologyinspired sets of features for mood classication that do not require supervised feature

1 For example, blog text was found to have a higher occurrence of the 1st person singular

than conversations [31]. 2 For example, one of the main reasons for writing, cited by bloggers, is to speak their minds: www.intac.net/breakdown-of-the-blogosphere/, accessed August 2011. 3 From the state of the blogosphere 2008 at http://technorati.com.

Mood Sensing from Social Media Texts and Its Applications

3

learning and are thus useful for large-scale mood classication, and use them to obtain empirical results for mood organization of a large dataset drawn from the blogosphere, which contains the largest set of mood groundtruth available to date. To our knowledge, we are the rst to consider the problem of data-driven mood pattern discovery at this scale. Third, we examine the potential for mood to reveal hyper-communities, or inter-community boundaries, not apparent from topical analysis. Our study includes results for community clustering using topical, mood-based, and psycho-linguistic-based, features, in a comparative analysis that is the rst of its kind. Our results have signicance for all of the application domains noted above. Our comparison of mood estimators can serve as a basis for deciding which feature to use in a particular application setting when trading o performance for speed. The cheap estimators we have experimented with can be used at scale wherever mood is a useful facet of analysis, from monitoring of discussions in participatory democracy to surveillancing online forums for malicious intent or recruitment campaigns, and even generic search; our hyper-community formulation has direct application to any social media community applications with a textual component, and could serve as a useful feature in a domain whose lifeblood is product dierentiation and rapid innovation. This paper represents extensions of our previous work. For mood classication problem, in addition to [25], cheap and eective features inspired from psychology research are included. Also, experiments on a range of features are conducted on a larger dataset containing million blog posts. For hyper-community problem, continuing from previous work [26], we introduce a novel view of online communities through the linguistic styles their members express in journal diaries. Furthermore, instead of running experiments on data without Live Journal category annotations, we explore new data with the labels, allowing a more objective clustering measurement, in comparison to [26]. Finally, this paper glues these previous work together into a coherent and unied view towards sentiment analysis in social media. The remainder of this paper is structured as follows. The following section provides a discussion of related work. Section 3 examines mood classication in a supervised framework, compares a number of feature selection methods, and introduces cheap features based on psychology-sourced sets. Section 4 presents to an unsupervised approach to examine the correlation between real-world use of a mood vocabulary and an ecient feature set introduced in the previous section, using a large dataset of blog posts with associated groundtruth mood. Section 5 focuses on particular online communities, and examines the use of various community signatures, including topic, mood, and psycholinguistic features, for the task of clustering them into hyper-communities.

2 Related Work 2.1 Feature selection for mood classication For generic text categorization, a wide range of feature selection methods in machine learning has been studied [5, 9]. Most noticeably, Yang and Pedersen [40] conduct a comparative study on dierent feature selection schemes, including information gain (IG), mutual information (MI), and

χ2

statistic (CHI). Their study concludes that IG

and CHI are most eective at dimensionality reduction for text categorisation without compromising classication accuracy, while MI has inferior performance by comparison. An alternative to a term-class interactions approach to selecting features is to consider

4

Nguyen et al.

term statistics. Thresholds for term frequency (TF) or document frequency (DF) are commonly used in feature reduction in data mining. The joint term frequencyinverse document frequency (TF.IDF) scheme, popular for retrieval settings, is also used in text mining and often outperforms TF and DF. Some approaches reduce feature space for feature selection from the entire vocabulary by using linguistic representations, such as parts of speech (POS), including adjectives, verbs, adverbs, and other subcategorizations. All of these practices will be conducted in this study to provide a comparative study of machine learning-based text feature selection for the specic problem of mood classication, elucidating insights into what can be transferred from a generic text categorisation problem for mood classication. In addition, due to the vast scale of social media, we do experiments on feature sets that work without requiring a supervised feature selection stage to classify mood. We extend previous work on mood classication in a supervised framework [25] to include more ecient features based on psychology-sourced sets and provide additional experiments on a larger dataset.

2.2 Mood analysis Mood analysis can be viewed as a subset of sentiment analysis and opinion mining [28] where subjective information is identied, extracted, or utilised in real-world applications. This sentiment information conveyed in social media data has been used in viral marketing, as in Fan and Chang[8], where the authors introduced a sentiment-oriented approach to place advertisements in a specic blog. Also, sentiment information conveyed in user feedback can be helpful in reviews of products and their features [17]. Furthermore, Feng et al. [10] propose a method to group blogs into sentiment clusters, employing Chinese sentiment lexicon for sentiment-bearing representation, facilitating public opinion monitoring for governments and business organizations. Mood classication in weblogs has been conducted by Mishne [20], who classies blog text according to the mood tagged by its author at the time of writing, and by Leshed and Kaye [16], who predict the user's state of mind for an incoming blog post. Mood estimation has a wide scope of application. For example, in business and government intelligence, product research, ethnographic study of the Internet, community or media recommendation, search with a mood facet and pervasive healthcare. However, mood estimation from text poses challenges beyond those encountered by typical text categorisation and clustering. Leaving aside the complexity of the underlying cognitive processes, the manifestation of mood is coloured by a person's idiosyncratic vocabulary and style, with messages often reecting a social context, including community norms, history and shared understanding. Consequently, text is often short, informal and punctuated by jargon, abbreviations and frequent grammatical errors. In addition to classication, clustering mood into patterns is also an important task as it provides clues about human emotion structures, with implications for sentimentaware applications such as sentiment-sensitive text retrieval. The structure of mood organisation has been investigated from a psychological perspective for some time. For example, Russell [32, 33] proposes the

circumplex model of aect

to represent aect

states. Using this model, emotion names can be placed around the perimeter of a circle in two-dimensional space. The dimensions dening the space are pleasantness (or valence) and activation (or arousal). However, the structure of mood formation has not been investigated from a data-driven and computational point of view. This pa-


5

per aims to discover intrinsic patterns in mood structure using unsupervised learning approaches. Using a large, ground-truthed dataset of approximately 18 million posts introduced in [16], it seeks empirical evidence to answer various questions that have often been posed in psychology. For example, does mood follow a continuum in its transition from `pleasure' to `displeasure' or from àctivation' to `deactivation'? Is èxcited' closer to àroused' or `happy'? Does `depressed' transit to `calm' before reaching `happy'? These are interesting and important research questions that have long been the focus of conjecture in dierent elds, but have not been extensively investigated from a data-driven perspective. The work of Russell [32], for example, includes only 36 participants, and is thus far smaller in scale than the dataset used here. Mood clustering has also been found in [14], in which the authors performed clustering on music genres, artists and usages in term of moods used and in [16], in which the authors grouped blog posts to nd mood synonymy. The idiosyncratic nature of mood attributions for the domain of music leads Hu and Downie [14] to cluster, rather than classify, music genres, artists and usages into data-derived `mood spaces'. Some approaches employ a clustering component for mood classication. For example, Sood and Vasserman [36] use K-means clustering to obtain a much reduced set of mood classes from the popular blog site, Livejournal's, predened 132 moods. Mood classication of blog posts is reduced to a ternary classication from among happy, sad and angry, and is achieved using a number of generic text features plus some that are mood specic (for example, emoticons and Internet slang), with an average F-measure of 0.66. Their mood classier is used as part of a mood-aware search interface.

2.3 Mood and Inference about Communities and Users The above sections discuss work aimed at classifying, or grouping, by manifest mood

texts. If the focus of analysis is shifted from texts to their authors, dierent questions arise: Can users be characterized by the mood of the messages they author? Can communities be so characterized? Are some more alike than others in terms of the mood of their discussions? For example, Mishne and Glance [21] are not alone in using social media commentary as a kind of sensor about how a product, company, political party, or person is being perceived. They use Nigam and Hurst's polar sentiment classier [27] to analyze blog posts referring to new movies together with the movie's box oce performance. They nd some evidence of a positive correlation between use of positive sentiment prior to a movie's release and the movie's subsequent nancial success. Cambria et al. [7] take a Semantic Web approach to opinion mining of patients, relatives and healthcare professionals for the purpose of obtaining a crowd validation of the UK National Health Service. The problem of community structure discovery in complex networks has been investigated in numerous studies. In particular, link structure, which can be explicitly declared as

friendship or membership in subscribers' proles, has been exploited. For

example, Kumar et al. [15] use friendship links and other information user proles, annotated with timestamps, to learn the network structure and its evolution using Flickr, Yahoo! 360 and blogosphere data. Backstrom et al. [2] use co-authorship and publication information in the Digital Bibliography and Library Project, and friendship and community membership in Livejournal to learn group formation in the two networks. A disadvantage of link-based approaches is that they make strong assumptions about

6

Nguyen et al.

the availability of link structure and the stability of communities, which is not always true in practice. Another approach to learning community structure is to use the content created by users. For example, Mc Callum et al. [18] present the Author-Recipient-Topic model to exploit the content in emails sent among Enron employees. The model can discover relevant topics, predict people's roles and give lower perplexity on previously unseen messages. They also introduce the Group-Topic model to detect groups in streams using relationships between entities and the properties of the relationships [19]. In [35], the Content-Time-Relation model is applied to the Enron email corpus to discover roles, predict the senders, infer the receivers and to describe the content of information exchanged between people in the company. In addition to the content, tags can also be exploited for the task of community detection. For instance, Negoescu et al. [24] use the tags and membership information of Flickr users in a bag-of-words representation to learn Flickr hyper-groups. Berendt and Hanser [3] show that tags could potentially enrich information related to the posts and the bloggers. The authors reveal that tags complement content, reect dierences between annotators and provide additional information. Hayes and Avesani [13] use tag information to nd topic-relevant blogs. A joint linkcontent approach has also been used in studying social networks. An example of this practice is that of Nallapati and Cohen [23], who introduce LinkPLSA-LDA to model the relationships between the citing/linking and the cited/linked documents on specic topics. In this paper, community detection using topic-based and mood-based features in a comparative analysis as proposed in [26] that, to our knowledge, is the rst of its kind. In addition, this paper introduces a novel representation of communities which bases on psycho-linguistic features. These features oer a wide scope of classication, including topical, linguistic, stylistic and mood categories, and they are cheap to obtain. They are found to be eective in capturing and identifying the style of authors in personal blogs [22].

3 Mood Classication We begin by framing the problem of mood estimation from text in a supervised classication setting. The results of this approach will serve as a baseline for comparison when we subsequently look for lighter-weight alternatives to computationally expensive feature selection and classiers. Denote by

B

the corpus of all blog posts and by

M=

{sad, happy, ...} the set of

all mood categories. In a standard feature-selection setting, each blog post also labelled with a mood category a feature vector

x(d)

V = v1 , . . . , v|V |

d

to be classied as

is

d

cd .

the set of all terms, then the

x(d) = [. . . , xi , . . .] might take a simple counting with its i -component representing the number of times the term vi appears in document d, a scheme

feature vector

(d)

d∈B

and the objective is to extract from

being as discriminative as possible for

For example, if we further denote by

xi

cd ∈ M

(d)

widely known as bag-of-word representation. The problem of generic text document classication has been investigated extensively in the text-mining domain. It is generally agreed among researchers that, despite the strong independence assumption among features conditional on the class, the simple NB classier remains the state-of-the-art for this task: a fact that is veried by


7

the experimental results for the problem of mood classication presented in this paper. However, dierent feature-selection methods are found to inuence classication performance greatly [40]. Taking the view that the works of Yang and Pedersen [40] and Sebastiani [34] represent state-of-the-art results in feature selection for generic text-categorisation tasks, we question whether the ndings in the work hold for our mood classication problem. We shall briey describe commonly used feature-selection schemes, including those in [40, 34].

3.1 Feature selection methods

Term-based selection These are features derived with respect to a term

v.

Two common features are term

term frequency T F (v, d) represents the number of times the term v appears in document d, whereas document frequency DF (v) is the

and document frequencies where

v . It is also well-known in text mining that T F.IDF (v, d) weighting can improve discriminative power where T F.IDF (v, d) = T F (v, d) × IDF (v) with IDF (v) = |D|/DF (v) is the inverse document frequency. In this work, a term v will be selected if it has high DF (v) value, or high average values of T F (v, d) or T F.IDF (v, d) across all documents d over a threshold. number of blogposts containing the term

Term-Class interaction-based selection The essence of these methods is to capture the dependence between terms and corresponding class labels during the feature selection process, and capture the relevance of

IG(v), χ2 -statistics CHI (v, l) [40]. IG (v) captures the information gain (measures in bits) when a term v is present or absent; M I (v, l) measures the mutual information between a term v and a class label l; and lastly CHI (v, l) terms. Three common selection methods in this category are: information gain

mutual information

M I(v, l)

and

measures the dependence between a term and a class label by comparing against one degree of freedom

χ2

distribution.

It has been shown that linguistic components, such as use of adverbs, adjectives or verbs, can be a strong indicator of mood [28]. Therefore we also apply the above feature selection methods to subsets of the raw unigram text classed by part-of-speech.

4

We use the SS-Tagger [39], ported to the Antelope NLP framework,

to pre-process

blog post text, and tag verbs, adjectives and adverbs.

Aective Norms for English Words ( ANEW) lexicon A source of text representation apt to inferring mood comes by way of mood-valued lexicons, such as Aective Norms for English Words (ANEW) [6]. ANEW is a set of 1034 sentiment-conveying words created by the National Institute of Mental Health of the United States to serve as a standard for studies in cognition and emotion. Words are rated in terms of normalized scores using a well-known mood model consisting of the triple: valence, arousal, and dominance. The valence values in ANEW range from 1.25 (

suicide ) to 8.82 (triumphant ); arousal ranges between 2.39 (relaxed ) and

4 www.proxem.com

8

Nguyen et al. Category

Code

Word count Words/sentence Dictionary words Words>6 letters Function words

Total pronouns

Personal pron.

1st pers singular 1st pers plural 2nd person 3rd pers singular 3rd pers plural

Impers. pron.

Articles Common verbs Auxiliary verbs Past tense Present tense Future tense Adverbs Prepositions Conjunctions Negations Quantiers Numbers Swear words

Social

Example

Linguistic processes wc wps dic sixltr funct pronoun ppron i we you shehe they ipron article verb auxverb past present future adverb prep conj negate quant number swear

Since Them I Me We You She They It The Walk Am Went Is Will Very To And No Few Once Damn

227 13.3 83.4 12.4 53.4 17.2 11.9 7.9 0.7 1.6 1.2 0.4 5.3 4.5 15.8 9.3 4.0 9.8 1.0 5.8 10.6 6.3 1.9 2.5 0.7 0.7

social family friend human aect posemo negemo anx

Mate Son Buddy Adult Happy Love Hurt Nervous

8.4 0.4 0.3 0.8 7.3 4.6 2.7 0.3

Psychological processes

Family Friends Humans

Aective Positive Negative Anxiety

%

Category Anger Sadness

Cognitive

Insight Causation Discrepancy Tentative Certainty Inhibition Inclusive Exclusive

Perceptual See Hear Feel

Biological Body Health Sexual Ingestion

Relativity Motion Space Time

Work Achieve Leisure Home Money Religion Death

Code anger sad cogmech insight cause discrep tentat certain inhib incl excl percept see hear feel bio body health sexual ingest relativ motion space time

Example Hate Crying Cause Think Hence Should Maybe Always Block And But Heard View Listen Touch Eat Cheek Flu Horny Dish Bend Car Down Until

% 1.2 0.5 15.3 1.9 1.3 1.5 2.5 1.3 0.4 4.5 2.8 2.3 0.9 0.6 0.7 2.6 0.9 0.6 0.8 0.4 13.7 2.1 4.9 6.3

work achieve leisure home money relig death

Job Hero Chat Family Cash God Bury

1.6 1.2 1.5 0.5 0.5 0.4 0.2

assent nonu ller

OK Hm Blah

1.2 0.5 0.1

Personal concerns

Assent Nonuency Fillers

Spoken

Table 1: Language groups categorised by LIWC over the corpus of approximately 18 million blog posts studied in this paper. The numbers are the means of the percentages of the features' words used in a blog post, except for words in a post) and

wps

wc

(the mean of the number of

(the mean of the number of words per sentence).

rage ); and dominance ranges are between 2.27 (helpless ) and 7.88 (leader ). The patient ), 5.06 (sunrise ), and 4.12 (knife ),

8.17 (

median values for these dimensions are 5.29 (

respectively. For each blog post, we construct a feature vector that contains counts for each ANEW word. Because only a fraction of the 1034 ANEW words appears in any given blog post (typically between 5 and 20 words), the resulting feature vectors are sparse.

Psycholinguistic features (LIWC) Another powerful set of features used in this paper are psycholinguistic, drawn from the LIWC package [31]. The LIWC package assigns English words to one of four high-level


9

categories: linguistic processes, psychological processes, personal concerns and spoken

5

categories, which are further sub-divided into a three-level hierarchy.

The taxonomy

ranges across topic (for example, religion and health), mood (for example, positive emotion) and processes not captured by either, such as cognition (for example, causation and discrepancy) and tense. Tausczik and Pennebaker [37] survey the use of the LIWC package, based on the social and psychological meaning of words, across dierent research areas in sociology and psychology, including status, dominance and social hierarchy, honesty and deception, thinking styles and individual dierences. According to [30], weblogs are particularly suitable for sentiment analysis since they contain a high rate of words in

aective processes. For an 18 million blog posts dataset used in

this paper we compute the mean for each of the LIWC groups and present them in Table 1. As can be seen, the percentage of words in

aective processes

in the corpus

(7.3 per cent) is larger than that of any class of text reported in [30], even the emotional writing class (6.02 per cent). This enables sentiment analysis for social media-derived corpora.

3.2 Mood Classication We use IR05 and WSM09 datasets for the task of mood classication. We use data crawled from Livejournal, a weblog hosting site. Livejournal allows people to tag their

mood

when they are blogging, thus providing an excellent source of ground truth data

for sentiment analysis. The host provides a comprehensive set of 132 moods for users to specify their current emotion at the time of blogging. The provided moods range diversely in the emotion spectrum, for example, or

discontent

and

uncomfortable

for

sadness.

cheerful

and

grateful

for

happiness

The IR05 dataset The IR05 dataset, introduced by Mishne [20], contains 815,494 blog posts from Livejournal. This dataset can be considered the rst Livejournal corpus created for the purpose of mood classication. Mishne [20] performs emotion analysis on this blog post corpus. He uses a number of feature sets such as frequency counts, lengths, sentiment orientations, emphasized words and special symbols as input to an SVM classier. The classication accuracy is modest, being slightly above baseline. Of the total, 535,844 posts are tagged with predened moods. We disregard the posts annotated with nonpredened moods.

The WSM09 dataset The WSM09 dataset was provided by Spinn3r (spinn3r.com) as the benchmark dataset

6

for the ICWSM 2009 conference.

It contains 44 million blog posts crawled between

August and October 2008. A subset from this dataset was extracted for this paper consisting of only blog posts from Livejournal, which include the mood ground truth entered by the user when the post was composed. Again, only the moods predened by

5 http://www.liwc.net/descriptiontable1.php accessed July 2011. 6 http://www.icwsm.org/2009/data/, retrieved November 2011.

10

Nguyen et al. Order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Selection method IG TF DF IG TF TF.IDF DF LIWC TF.IDF ANEW IG TF.IDF DF TF TF.IDF IG TF DF MI MI TF.IDF IG DF TF MI MI CHI CHI CHI CHI MI CHI

Linguistic subsets Unigram Unigram Unigram AdjVbAdv AdjVbAdv AdjVbAdv AdjVbAdv Unigram Verb Verb Verb Verb Adjective Adjective Adjective Adjective AdjVbAdv Adjective Adverb Adverb Adverb Adverb Adverb Verb AdjVbAdv Adjective Verb Unigram Unigram Adverb

Accuracy 0.779 0.764 0.763 0.752 0.743 0.743 0.742 0.744 0.726 0.704 0.698 0.696 0.696 0.694 0.687 0.685 0.683 0.682 0.624 0.570 0.607 0.606 0.606 0.606 0.605 0.617 0.601 0.601 0.589 0.580 0.522 0.561

F-score 0.772 0.757 0.756 0.742 0.732 0.732 0.731 0.730 0.709 0.681 0.678 0.676 0.676 0.674 0.658 0.655 0.653 0.653 0.575 0.570 0.569 0.568 0.568 0.568 0.567 0.560 0.555 0.540 0.533 0.515 0.509 0.445

Classier SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes SVM Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes

Table 2: Mood classication results for dierent feature selection schemes and for different feature subsets in WSM09 dataset, sorted in descending order of F-score. Both SVM and Naive Bayes classiers are run on 32 sub-corpora but only the better results are reported.

Livejournal are considered and all others discarded, resulting in approximately 600,000 blog posts.

7

The WSM09 dataset is used in [36] for a two-step data-mining application. At rst, all moods are categorised into three classes: happy, sad and angry, using K-mean clustering. Blogs posts in these three groups are then subjected to a Naive Bayes classier. The feature set considered in this task consists of unigrams, bigrams, stems, emotion, emoticons and slang. The highest recall, precision and F-measure are 67.1%, 65% and 66.1% respectively. In order to compare our results with those of Sood and Vasserman [36], we restrict classication for this experiment to three popular moods { complete set of

sad, happy, angry }.

The

132 moods employed by Livejournal will be considered in the following

section.

7 Consistent with what is reported in [36].

Mood Sensing from Social Media Texts and Its Applications Order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Selection method TF DF TF DF TF.IDF TF.IDF ANEW LIWC TF DF TF.IDF TF TF.IDF DF CHI IG MI CHI IG MI CHI IG MI DF TF TF.IDF CHI IG MI CHI IG MI

Linguistic subsets Unigram Unigram AdjVbAdv AdjVbAdv AdjVbAdv Unigram Verb Verb Verb Adjective Adjective Adjective Unigram Unigram Unigram Adjective Adjective Adjective AdjVbAdv AdjVbAdv AdjVbAdv Adverb Adverb Adverb Adverb Adverb Adverb Verb Verb Verb

Accuracy 0.771 0.771 0.752 0.750 0.749 0.740 0.727 0.732 0.709 0.707 0.707 0.711 0.692 0.692 0.669 0.669 0.669 0.667 0.667 0.667 0.657 0.657 0.657 0.638 0.637 0.637 0.633 0.633 0.633 0.637 0.637 0.637

11 F-score 0.761 0.761 0.738 0.736 0.734 0.721 0.699 0.691 0.680 0.680 0.678 0.673 0.673 0.673 0.597 0.597 0.597 0.594 0.594 0.594 0.591 0.591 0.591 0.586 0.584 0.584 0.563 0.563 0.563 0.555 0.555 0.555

Classier SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes Naive Bayes

Table 3: Mood classication results for dierent feature selection schemes and for different feature subsets in IR05 dataset, sorted in descending order of F-score. Both SVM and Naive Bayes classiers are run on 32 sub-corpora but only the better results are reported.

For the classication method we experimented with two popular classiers implemented in the Weka package [12]: Support Vector Machine (SVM) and Naive Bayes classiers. For each run we use ten-fold cross-validation, repeat 10 runs, and report the average result. To evaluate the results we report two commonly used measures: accuracy and F-score.

Eect of feature selection schemes and linguistic components TF, DF, TF-IDF), and three term-class interactionIG, MI, CHI) selection methods, are used in this experiment. Feature selection

Three term weighting-based ( based (

is applied using all terms (unigrams), and using subsets of terms performing a particular part-of-speech (adjectives, verbs, adverbs, or in combination). This leads to 30 parameter combinations: (term-based(3)

+ term-class interaction(3)) × part of speech(5). In

addition, we use the 1034 ANEW words and 68 categories returned from the LIWC

12

Nguyen et al. 80 Frequency Presence

70

F−score (%)

60 50 40 30 20

0

dfUni tfUni igUni tfidfAdj tfAdj miAdj dfAdj igAdj tfidfUni tfidfAdjVbAdv dfAdjVbAdv tfAdjVbAdv igAdjVbAdv ANEW tfVb dfVb igVb tfidfVb miAdjVbAdv miUni tfidfAdv igAdv dfAdv tfAdv miAdv chiAdjVbAdv miVb chiVb chiAdj chiUni chiAdv

10

(a) Naive Bayes classier. 80 Frequency Presence

70

F−score (%)

60 50 40 30 20

0

tfidfVb igVb dfVb tfVb tfidfUni igAdjVbAdv miAdj tfidfAdjVbAdv miAdjVbAdv ANEW tfAdjVbAdv dfAdjVbAdv miVb chiAdjVbAdv tfUni igUni dfAdv dfUni tfAdv igAdv miAdv tfidfAdv tfidfAdj igAdj chiUni tfAdj dfAdj chiAdj chiVb chiAdv miUni

10

(b) SVM classier. Fig. 1: Performances of binary versus counting features for mood classication for the WSM09 dataset. The performance is measured in F-score and descending sorted by the dierence of the performances.

package as separate feature vectors. For comparison, the number of features in all other cases equals 1,034, which is the number of ANEW words. We experiment with all possible combinations of feature selection methods and dierent linguistic subsets and report the top results in Table 3. With respect to feature-selection schemes, IG is observed to be the best selection scheme. Other term-class interaction-based methods do not perform well; noticeably, mutual information (MI) does not appear in any of the top ten results. These observations are consistent with results for text categorization in [40]. In contrast to their


13

80 Frequency Presence

70

F−score (%)

60 50 40 30 20

0

dfUni tfUni tfidfUni tfAdjVbAdv tfidfAdjVbAdv dfAdjVbAdv ANEW dfAdj tfidfAdj dfVb tfVb tfidfVb tfAdj tfidfAdv dfAdv chiUni igUni miUni tfAdv chiAdj igAdj miAdj chiAdv igAdv miAdv chiAdjVbAdv igAdjVbAdv miAdjVbAdv chiVb igVb miVb

10

(a) Naive Bayes classier. 80 Frequency Presence

70

F−score (%)

60 50 40 30 20

0

tfidfUni ANEW tfidfVb tfVb dfVb tfUni dfUni tfidfAdjVbAdv tfAdjVbAdv chiAdjVbAdv igAdjVbAdv miAdjVbAdv dfAdjVbAdv chiVb igVb miVb chiUni igUni miUni tfidfAdj tfAdj dfAdj chiAdj igAdj miAdj tfAdv dfAdv tfidfAdv chiAdv igAdv miAdv

10

(b) SVM classier. Fig. 2: Performances of binary versus counting features for mood classication for the IR05 dataset. The performance is measured in F-score and descending sorted by the dierence of the performances.

ndings, we nd CHI does not perform well for the mood classication task, absent

TF and DF perform TF.IDF in unigram casesthe opposite is true for generic text mining; that is, IF.IDF is often superior, albeit more computationally expensive. Thus TF or DF are recommended, in conjunction with IG, for good performance in the trade-o

as it is from either of the top ten result lists. Surprisingly, both better than

with computational cost. With respect to the eect of linguistic components (which are not tested in [16] or [36]), a combination of adjectives, verbs and adverbs (AjVbAv) dominates the top

14

Nguyen et al.

results and gives performance very close to that achieved when using all terms. Using verbs or adjectives alone also produces a good performance. The performance of the selected feature-selection schemes is also acceptable across the two datasets, as can be seen in Tables 2&3 for WSM09 and IR05 datasets respectively. The best results stand at 76.1% F-score for IR05 and 77.2% F-score for WSM09. Although NB classier outperforms SVM in a majority of feature-selection schemes for IR05 dataset, all top results in both datasets are returned from SVM.

Performance of LIWC In comparison with more than 30 combinations of selection methods and feature spaces, though without the need for a supervised feature-selection stage, the result of LIWC features is found to be very encouraging, appearing among the top results across the two datasets. This result reveals dierences in the use of the psycholinguistic features among posts tagged with dierent moods.

Performance of ANEW The results of classication based on the ANEW lexicon alone are encouraging, appearing in the top ten results for both datasets. Performance is consistent across datasets at approximately 70% F-score.

Term presence vs. term frequency It is well known in text mining that the bag-of-word counting representation (that is, count the number of appearances of a term) is an eective feature. However, in sentiment analysis, it has been found that a simple binary representation (that is, use

1 if the term appears in the document and 0 otherwise) is more eective at movie review classication [29]. Further, a binary feature representation can be better compressed, making it suitable for dealing with large datasets. Therefore, we are motivated to investigate whether a binary representation is eective for mood classication. Figures 1&2 show the results for the WSM09 and IR05 datasets when the classication was performed on the binary and counting features respectively. For each dataset, the classication F-score is plotted for both types of representation, displaying the top results in an increasing order of performance. The results reveal that binary representation outperforms its counterpart for four top performances in all datasets. This result again conrms the superiority of binary representation over counting for mood classication, suggesting that recording the appearance of a term in a document is sucient and recommended for mood classication tasks.

4 Exploratory Mood Patterns Much of the existing work aimed at inferring mood from text has been framed as supervised classication, but the computational costs of these approaches are prohibitive. The Livejournal corpus introduced by [16] is a case in point: it contains 18 million posts, many of which are groundtruthed with one of 132 predened mood labels.

8 For the full list of pre-dened moods, visit http://www.livejournal.com

8

A


15

ACTIVATION

AROUSAL

. .. .. . . . . . . . ... .... . . .. . . . . .. .. . ... . . . . . . .. . .... .... ... . . . . ............ . .. . .. .. . . .. . . . .. . ................ ...... ... .. . .... . . . . . ... ...................... ........ . ... ..... . .... . . . . .. .. . . . ..... ......................... .... ... .. .. .. ... .. ...... .................................... ....... . .. . .... .... .. .. . .. .................... .... . . . .. . . . . ... . .. . . .. ............................... ....... .. ........ .. .. .. ... ...................................... .... .... . . . . . ... . .. ........ .... ... .. ...... ...................................... ....... ... . . . . . ............. . ........ . ...... .......... .............................. ....... .. .. .. .. ... . . .. .. .. . ........................... . . . . . . . . . . . . . . ... . ... ................ . . . ... ... . ... . .. . . .. . . .

8

7

DISPLEASURE

6

5

4

3

PLEASURE

9

2

1

0

1

2

3

4

5

6

DEACTIVATION

7

8

9 VALENCE

Fig. 3: Valence and arousal values of 1,034 ANEW words on the aect circle.

Fig. 4: Visualization of the distribution of

132

predened mood labels from approxi-

mately 18 million LiveJournal blog posts.

cloud visualisation of moods tagged in this dataset is shown in Figure 4. Leshed and Kaye [16] perform emotion classication on fty of the most frequent moods appearing in the corpus. They use the TFIDF feature-selection method to select only the rst 5,000 features. Blog entries are represented in the `bag-of-word' model of information retrieval, and subjected to an SVM classier. The average accuracy of the system is re-

16

Nguyen et al.

(a) ANEW words used in a post tagged with mood this post is 7.75.

(b) ANEW words used in a post tagged with mood post is 3.12.

happy. The average valence for

sad. The average valence for this

Fig. 5: Examples of the use of ANEW words in happy and sad blog posts. The colour reects the valence and arousal values an ANEW word conveys, as shown in Figure 3.

ported to be 78% [16]. To apply the feature selection schemes presented in section 3.1 to

M I(v, l) for each pair (term, O (|M| × |V|) where |M| = 132 and |V| is

this corpus would be expensive; for example, computing mood) has a computational complexity of

the number of unique terms, which could be on the order of hundreds of thousands. And this corpus is a fraction of the existing social media corpus, which continues to grow at an accelerating speed. We investigate the possibility of unearthing intrinsic patterns of mood using unsupervised approaches. We take as a promising base for this analysis the ANEW feature vectors used in section 3.2, which gave classication performance comparable with those employing the expensive feature selection schemes. Figure 5 depicts the relationship we hope to exploit between blog posts with groundtruth mood that also use words in the ANEW lexicon in their textual content. For the Livejournal corpus we make the initial observation that blog posts tagged with moods in the same emotion pattern have similar proportions of use of words in the ANEW lexicon. For example, see Figure 6, which plots a sample of ANEW words having arousal in the range of 7.2 8.2 against proportion of ANEW words in blog

happy/cheerful angry/p*ssed o. We can see clearly that the words anger, enraged, and rage are most likely to be found in posts labelled angry or p*ssed o, and least likely in those

posts tagged with Livejournal moods of similar sentiment, one of either or

Mood Sensing from Social Media Texts and Its Applications Cluster 1

Exemplar CHEERFUL

2 3 4

PENSIVE REJUVENATED QUIXOTIC

5

CRAZY

6 7 8 9 10

MELLOW GRATEFUL AGGRAVATED ANGRY GLOOMY

11 12 13 14 15 16

PRODUCTIVE TIRED NAUSEATED MOODY THIRSTY EXANIMATE

17

Members ecstatic, jubilant, giddy, happy, excited, energetic, bouncy, chipper determined, contemplative, thoughtful optimistic, relieved, refreshed, hopeful, peaceful surprised, enthralled, devious, geeky, creative, recumbent, artistic, impressed, amused, complacent, curious, weird horny, giggly, high, irty, hyper, drunk, naughty, dorky, ditzy, silly pleased, satised, relaxed, content, anxious, good, full, calm, okay loved, thankful, touched irritated, bitchy, annoyed, frustrated, cynical p*ssed o, infuriated, irate, enraged jealous, envious, rejected, confused, worried, lonely, guilty, scared, pessimistic, discontent, distressed, indescribable, crushed, depressed, melancholy, numb, morose, sad, sympathetic accomplished, working, nervous, busy, rushed sore, lazy, sleepy, awake, groggy, exhausted, lethargic, drained sick disappointed, grumpy, cranky, stressed, uncomfortable, crappy nerdy, mischievous, hungry, dirty, hot, cold, bored, blah intimidated, predatory, embarrassed, restless, nostalgic, indierent, listless, apathetic, blank, shocked

Table 4: Livejournal moods clustered by similarity of ANEW word use. 0.06

angry p*ssed off cheerful happy

0.05

Proportion

0.04

0.03

0.02

0.01

Fig. 6: ANEW usage proportion in the posts tagged with

gry /p*ssed o.

happy /cheerful

and

love

fun

happy

party

birthday

ANEW words

christmas

pretty

food

cute

good

fight

mad

alone

stupid

hell

hate

dead

hurt

sick

death

0

an-

18

Nguyen et al. ANGRY,p*ssed off,infuriated, irate,enraged

CHEERFUL,ecstatic,jubilant, giddy,happy,excited,energetic, bouncy chipper bouncy,chipper

AGGRAVATED,irritated,bitchy, annoyed,frustrated,cynical REJUVENATED,optimistic, relieved,refreshed,hopeful, peaceful f l

MOODY,disappointed,grumpy, cranky,stressed,uncomfortable, crappy

MELLOW,pleased,satisfied, relaxed,content,anxious,good, full,calm,okay

TIRED,sore,lazy,sleepy,awake, groggy,exhausted,lethargic, drained

GRATEFUL,loved,thankful, touched

Fig. 7: Mood patterns on the aect circle.

labelled

happy

or

cheerful.

often used in posts labelled or

romantic and surprised are not p*ssed o, but are used in posts labelled happy

In contrast, the words

angry

or

cheerful. This congruence between the use of words in the ANEW lexicon and the

attached mood groundtruth supports the idea of using the ANEW feature to discover the implicit structure of mood in textual social media. Following on from this observation, we next cluster Livejournal moods based on how users who tag blog posts with a given mood use words from the ANEW lexicon.

M= {sad, happy, ...} the = 132). Each blog post b ∈ B in the corpus is labeled with a mood lb ∈ M. Denote by n the number of ANEW words (n = 1034). Let m m xm = [xm 1 , . . . , xi , . . . , xn P] be the vector representing the usage of ANEW words m by mood m. Thus, xi = b∈B,lb =m cib , where cib is the count of the i-th ANEW word in blog post b tagged with mood m. The usage vector is normalized so that Pn m m i=1 xi = 1 for all m ∈ M. The vectors xi will serve as a similarity measure upon Let us denote by

B

the corpus of all blog posts, and by

predened set of moods (|M|

which to cluster. Clustering is performed using the non-parametric algorithm termed Anity Propagation (AP) [11]. AP has the desirable properties of automatically discovering the number of clusters, and cluster exemplars (instances that best represent the given cluster). AP simply requires the pairwise similarities between moods, which we compute using Euclidean distance. After running the AP algorithm,

16 clusters were detected. Table 4 lists the discov-

ered clusters, with exemplar moods in caps, and the remaining members in lower case. The discovered clusters are intuitively coherent. Clusters 17 typically contain moods of high valence or pleasure; clusters 816 contain moods of low valence or displeasure. Figure 7 depicts the same clusters as distributed in valencearousal space. Clusters are plotted at the average (ANEW) arousal and valence of their member moods. For those moods that do not appear in the ANEW lexicon (95 of the pre-dened 132), arousal and valence is taken from the nearest parent in Livejournal's mood hierarchy that appears in the ANEW lexicon.

9

The above data-driven aect analysis is, to the authors' knowledge, the rst of a kind. The coherence of the discovered implicit mood structures validate the use

9 http://www.livejournal.com/moodlist.bml accessed July 2011.


19

Cooking and Bentolunch, consider Bookish and 50bookchallenge ; and Pokemon and Pkmncollectors.

Fig. 8: Proles of six Livejournal communities: similar topics; as do

of ANEW-based features for the purpose of inferring mood in unlabelled text. Such a mood sensor is relatively inexpensive to apply, and has a potential role in many applications including text surveillance, mood-related search facets, characterization and recommendation of media and communities, and ethnographic study, to name a few.

5 Mood as an Index for Users and Communities Having in the previous section obtained a robust and eective sensor for mood from text, we now seek to apply it to ner-grained problems. In particular, we will investigate the use of mood to detect hyper-communities, and to characterize users, discussions, and communities.

5.1 Community Representation While there has been extensive work on characterizing collections of text by topic, including blog sub-communities [1] and tagged media [24], the same has not been attempted with mood. But mood is often an integral feature of a text, particularly for social media forums; two communities might discuss precisely the same topics, yet within an entirely dierent atmosphere. E.g., where one forum might host conversations about politics in a cerebral, serious-minded, and friendly fashion; another will discuss

Nguyen et al.

Communities

20 __quotexwhore _we_are_lost 20sknitters adayinmylife add_a_writer addme25_and_up aesthetes altparent amateur_artists baby_names baristas battlestar_blog beatlepics beauty101 behind_the_lens bentolunch birls bjorkish blackfolk boys_and_girls breastfeeding broadway calmallamadown cat_lovers charloft chuunin classics clucky color_theory computer_help computerhelp corsetmakers curlyhair davis_square doctorwho dog_lovers dogsintraining dyed_hair egl egl_comm_sales eurotravel filmsg ftm gamers glee_tv gleeclub gundam00 house_cameron htmlhelp i_am_thankful ipod iworkatborders just_good_music lword macintosh madradstalkers miracle______ naturalbirth naturesbeauty ncisficfind news_jpop newyorkers nonfluffypagans note_to_cat ofmornings ontd_political ourbedrooms parenting101 patd patdslashseek picturing_food poor_skills prolife queer_rage relaxmusic rilokiley rpattz_kstew ru_glamour seattle sew_hip sgagenrefinders sgastoryfinders sheldon_penny theater_icons thecure thenicestthings theoffice_us thesims2 time_and_chips todayirealized topmodel transnews trashy_eats vintagehair walmart_employe webdesign weddingplans world_tourist worldofwarcraft wow_ladies 5

10

15

20

25

30

35

40

45

50

Topics

Fig. 9: Topic proportions of 100 communities.

Category AdviceSupport CreativeExpression EntertainmentMusic Fandom Fashion-Style Food-Travel GamingTechnology ParentingPets PoliticsCulture Television

Community add_a_writer, addme25_and_up, baristas, boys_and_girls, i_am_thankful, iworkatborders, thenicestthings, todayirealized, walmart_employe, weddingplans __quotexwhore, 20sknitters, adayinmylife, aesthetes, amateur_artists, behind_the_lens, charloft, color_theory, naturesbeauty, sew_hip beatlepics, bjorkish, broadway, just_good_music, news_jpop, patd, relaxmusic, rilokiley, theater_icons, thecure chuunin, house_cameron, miracle______, nciscnd, patdslashseek, rpattz_kstew, sgagenrenders, sgastorynders, sheldon_penny, time_and_chips beauty101, corsetmakers, curlyhair, dyed_hair, egl, egl_comm_sales, madradstalkers, ourbedrooms, ru_glamour, vintagehair bentolunch, davis_square, eurotravel, lmsg, newyorkers, ofmornings, picturing_food, seattle, trashy_eats, world_tourist computer_help, computerhelp, gamers, htmlhelp, ipod, macintosh, thesims2, webdesign, worldofwarcraft, wow_ladies altparent, baby_names, breastfeeding, cat_lovers, clucky, dog_lovers, dogsintraining, naturalbirth, note_to_cat, parenting101 birls, blackfolk, classics, ftm, nonuypagans, ontd_political, poor_skills, prolife, queer_rage, transnews _we_are_lost, battlestar_blog, calmallamadown, doctorwho, glee_tv, gleeclub, gundam00, lword, theoce_us, topmodel

Table 5: Communities from ten Livejournal directory for experiments.

the same issues adversarially, with zest and tolerance of profanity. Such mood-related distinctions are important for many kinds of analysis and application, not the least of which is community recommendation. Online communities come in many shapes and sizes, and are aected by many factors, including the demographics of their members, reason for existence, and facilities aorded by the hosting application. The Livejournal blog site referred to in the previous sections includes a community feature. Each community is dened by the scope of topics


21

0.6

dog−lovers dogsintraining cat−lovers note−to−cat computer−help computerhelp ipod macintosh htmlhelp webdesign

0.5

Proportion

0.4

0.3

0.2

0.1

0 5

10

15

20

25

30

35

40

45

50

Topics

Topic 9

16

Top Topic Terms

dog baby

dogs

milk

question

puppy

training

bed

vet

gets

started

tried

vet

outside

months sleep weeks thanks

tried

hospital

dear cat 20

animal

water

sleep

doctor

cats mom

eat

food

big

pet

months

daughter

kitten

problem

walk

away

potty

breast

couple

mommy

litter

glad

pain

room

using apple mac thanks drive files screen

file running internet music

link table page code 29

website

links layout

click background

text thank

big

loves

start

weight

kitty

thank

clean

computer ipod tried problem windows 23

run

month birth started

stop

eat

house

fix

itunes

open

download

thanks site

journal picture

change entries

post box

entry

Fig. 10: Above: Topic proportions of 10 communities; Below: Example topics and most likely words sized by

p (word | topic).

it aims to host, and comprises among other things members, posts and comments made in response to posts. Figure 8 shows example proles for six Livejournal communities.

Hyper-community detection aims to group communities that are somehow related. We investigate the usefulness of including mood in this clustering task.

Nguyen et al. __quotexwhore _we_are_lost 20sknitters adayinmylife add_a_writer addme25_and_up aesthetes altparent amateur_artists baby_names baristas battlestar_blog beatlepics beauty101 behind_the_lens bentolunch birls bjorkish blackfolk boys_and_girls breastfeeding broadway calmallamadown cat_lovers charloft chuunin classics clucky color_theory computer_help computerhelp corsetmakers curlyhair davis_square doctorwho dog_lovers dogsintraining dyed_hair egl egl_comm_sales eurotravel filmsg ftm gamers glee_tv gleeclub gundam00 house_cameron htmlhelp i_am_thankful ipod iworkatborders just_good_music lword macintosh madradstalkers miracle______ naturalbirth naturesbeauty ncisficfind news_jpop newyorkers nonfluffypagans note_to_cat ofmornings ontd_political ourbedrooms parenting101 patd patdslashseek picturing_food poor_skills prolife queer_rage relaxmusic rilokiley rpattz_kstew ru_glamour seattle sew_hip sgagenrefinders sgastoryfinders sheldon_penny theater_icons thecure thenicestthings theoffice_us thesims2 time_and_chips todayirealized topmodel transnews trashy_eats vintagehair walmart_employe webdesign weddingplans world_tourist worldofwarcraft wow_ladies accomplished aggravated amused angry annoyed anxious apathetic artistic awake bitchy blah blank bored bouncy busy calm cheerful chipper cold complacent confused contemplative content cranky crappy crazy creative crushed curious cynical depressed determined devious dirty disappointed discontent distressed ditzy dorky drained drunk ecstatic embarrassed energetic enraged enthralled envious exanimate excited exhausted flirty frustrated full geeky giddy giggly gloomy good grateful groggy grumpy guilty happy high hopeful horny hot hungry hyper impressed indescribable indifferent infuriated intimidated irate irritated jealous jubilant lazy lethargic listless lonely loved melancholy mellow mischievous moody morose naughty nauseated nerdy nervous nostalgic numb okay optimistic peaceful pensive pessimistic p*ssed off pleased predatory productive quixotic recumbent refreshed rejected rejuvenated relaxed relieved restless rushed sad satisfied scared shocked sick silly sleepy sore stressed surprised sympathetic thankful thirsty thoughtful tired touched uncomfortable weird working worried

Communities

22

Moods

(a) The mood usage proportions of 100 communities used in hyper-community detection. computer−help computerhelp htmlhelp ipod webdesign ncisficfind sgagenrefinders sgastoryfinders

0.2

0.18

0.16

Proportion

0.14

0.12

0.1

0.08

0.06

0.04

0.02

Moods

confused

frustrated

annoyed

aggravated

worried

p*ssed off

creative

cranky

flirty

content

sleepy

excited

optimistic

grateful

chipper

cheerful

cold

bouncy

tired

hopeful

0

(b) An illustration of mood usage proportions in two groups of communities: {computer_help, computerhelp, htmlhelp, ipod, webdesign } and {nciscnd, sgagenrenders, sgastorynders }. Fig. 11: Communities and the mood usage.

Topic-based Representation of Communities To represent what community members talk about, we apply Latent Dirichlet Allocation (LDA) [4] a Bayesian probabilistic topic model to the blog post corpus. All posts for each community are aggregated to form the corpus input to LDA, wherein each post is considered as one document. LDA learns the probabilities

p (vocabulary | topic),

which are used to describe a topic, and assigns a topic to each word in every docu-


23

ment. Each post can then be represented as a mixture of topics using the probability

p (topic | document). Intuitively, we expect similar communities to discuss a similar mix of topics, and hence have similar mixtures of mally, supposing we have set of blog posts in the

J

j th

p (topic | document)

aggregated from their posts. For-

communities, denote by

community where

nj

xj =

x1j , x2j , . . . , xnj j

the

is the total number of posts by this

N = Jj=1 nj documents agJ gregated from all communities D = ∪j=1 xj . Finally, if θij denotes the topic mixture Pnj for blog post xij , the j th community can be represented by θj = (1/nj ) i=1 θij . θj is a K -dimensional vector, where K is the number of topics used by LDA, and the kth element represents the mixture proportion of topic k for community j . Figure 9

P

community. Thus the corpus to be modelled consists of

contains example topic mixtures for a number of communities. These mixtures are used to perform topic-based community clustering. The topic distributions are well separated among some groups of communities. As

dog_lovers, dogsintraining } could be inferred as a group Dog (topic numbered 9); similarly, {cat_lovers, note_tocat } about cat (topic 20); {macintosh, computer_help, computerhelp, ipod } on computer/ipod (topic 23); and {webdesign, htmlhelp } on web design (topic 29). can be seen in Figure 10, {

of communities mainly talking about the character

LDA requires that the number of topics be specied in advance, which is dicult to determine in real-world applications. To avoid this we explore the use of the hierarchical Dirichlet processes (HDP) [38], a hierarchical Bayesian nonparametric topic modelling. This approach automatically infers the number of latent topics. Again, clustering is performed using AP [11], as the number of clusters is not known in advance, and to obtain cluster exemplars during clustering. Similarity between communities each

θj

j

and

l is calculated as the Kullback-Leibler divergence between θj

and

θl

(as

is a proper probability mass function over topics).

Mood-based Representation of Communities Recall that Livejournal oers 132 moods for users to tag their posts. We assume that there exists a dierence in tagging moods among communities, supporting the intuition that such communities can be grouped by mood. Let

M=

{

sad, happy,

...} be the predened set of moods;

|M| = 132

is total

number of moods provided by Livejournal. Using the notation in Section 4, each blogpost

xij

in the community

community

j th,

j th

ment is the number of times the Each

mj

is further tagged with a mood

a 132-dimension mood usage vector

k-th

mood in

M

mj

mij ∈ M.

For each

is constructed whose

kth

ele-

was tagged within this community.

is then normalized to unity. Figure 11a shows the mood proportions for

the 100 communities. These proportions are used to perform mood-based community clustering. Figure 11b shows a plot of the mood usage by eight dierent communities in

computer_help, computerhelp, htmlhelp, ipod, webdesign ) is rather well separated from another group (nciscnd, sgagenrenders, sgastorynders ). The rst group favors using moods having low valence (such as p*ssed o, worried, and confused ) while the second prefers high valence moods (for instance, hopeful ), empirically suggesting that Livejournal. It can be seen that the mood usage in one group of communities (

it is sensible to study grouping behaviour based on mood.

24

Nguyen et al.

No. Clusters Purity NMI

Topic (LDA) 20 70% 62%

Topic (HDP) 17

75% 68%

Mood 9 46% 43%

ANEW 15 63% 59%

LIWC 12 54% 51%

Table 6: Cluster purity and NMI based on dierent community representations.

When mood groundtruth is not available (because a mood feature is not implemented by the social media application, or it is present but not used), mood-based hyper-community detection can be performed using ANEW vectors (as described in Section 3.1). Again, vector similarity is calculated with the Bhattacharyya coecient, and AP is used to cluster communities.

Psycho-linguistic-based Representation of Communities As a nal point of comparison that bridges pure topical and mood-based representation of communities, we use psycho-linguistic features as classied by the Linguistic Inquiry and Word Count (LIWC) taxonomy [31]. LIWC assigns terms to one of four high level categories: linguistic processes, psychological processes, personal concerns, and spoken categories, which are further sub-divided into a three level hierarchy. The taxonomy ranges across topic (e.g., religion and health), mood (e.g., positive emotion), and processes not captured by either, such as cognition (e.g. causation and discrepancy) and tense. In all, 68 LIWC features are used to build a vector to provide a psycho-linguistic representation of each community.

10

5.2 Hyper-community Detection For experimentation, we crawled the communities listed in the Livejournal directory.

11

These communities are categorized into 13 groups: Advice-Support, Creative-Expression, Entertainment-Music, Everything Else, Fandom, Fashion-Style, Food-Travel, GamingTechnology, Parenting-Pets, Politics-Culture, Sports-Fitness, Television, and Threadbased RP. From the 579 communities obtained (consisting of 1,090,408 posts and 10,081,215 comments by 182,197 members), we extracted a subset consisting of the top 100 communities, having the most members, across ten categories, resulting in a dataset of 211,740 posts by 59,496 users. Table 5 lists the communities used for the remaining experiments. Overall clustering performance for the dierent community representationstopic, mood, mood-proxy, and psycho-linguisticis shown in Table 6. We report Cluster purity and Normalized Mutual Information (NMI) against the Livejournal community classication, which is a topical classication, thus it is expected that these metrics will be highest for the topic-based community representation, 75% purity and 68% NMI, versus the result for pure mood, 46% purity and 43% NMI. We are chiey concerned with new knowledge discovered through the use of mood-related representations, which will be analyzed in more detail below for each type of representation.

10 http://www.liwc.net/descriptiontable1.php accessed July 2011. 11 http://www.livejournal.com/browse/ accessed July 2011.

Mood Sensing from Social Media Texts and Its Applications No. I II

Members 20sknitters, corsetmakers, sew_hip addme25_and_up, add_a_writer

No. XI XII

III

beauty101, curlyhair, dyed_hair, vintagehair bjorkish, patd, rilokiley, thecure

XIII

blackfolk, classics, nonuypagans, ontd_political, prolife, queer_rage, transnews calmallamadown, battlestar_blog, news_jpop, rpattz_kstew, theoce_us cat_lovers, note_to_cat

XV

IV V VI VII VIII

IX

X

charloft, _we_are_lost, baby_names, birls, chuunin, doctorwho, gamers, gundam00, house_cameron, i_am_thankful, just_good_music, lword, miracle______, patdslashseek, relaxmusic, sheldon_penny, thesims2, time_and_chips, weddingplans, worldofwarcraft, wow_ladies color_theory, adayinmylife, aesthetes, amateur_artists, beatlepics, behind_the_lens, lmsg, madradstalkers, naturesbeauty, ourbedrooms, ru_glamour, topmodel, world_tourist dog_lovers, dogsintraining

XIV

XVI XVII XVIII

25 Members egl, egl_comm_sales glee_tv, broadway, gleeclub, theater_icons macintosh, computer_help, computerhelp, ipod newyorkers, davis_square, eurotravel, poor_skills, seattle ofmornings, bentolunch, picturing_food, trashy_eats parenting101, altparent, breastfeeding, clucky, ftm, naturalbirth sgagenrenders, nciscnd, sgastorynders todayirealized, __quotexwhore, boys_and_girls, thenicestthings

XIX

walmart_employe, baristas, iworkatborders

XX

webdesign, htmlhelp

Table 7: LDA-topic-based hyper-communities (exemplar listed rst for each).

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

AdviceͲSupport CreativeͲExpression EntertainmentͲMusic Fandom FashionͲStyle FoodͲTravel GamingͲTechnology

HC_I HC_II HC_III HC_IV HC_V HC_VI HC_VII HC_VIII HC_IX HC_X HC_XI HC_XII HC_XIII HC_XIV HC_XV HC_XVI HC_XVII C_XVII HC_XVIII C_XVIII HC_XIX HC_XX

ParentingͲPets PoliticsͲCulture Television

Fig. 12: LDA-topic-based hyper-communities with Livejournal category; multi-coloured clusters are less pure (best seen in colour).

26

Nguyen et al. No. I

No. X

II

Members altparent, breastfeeding, clucky, ftm, naturalbirth, parenting101 baby_names, news_jpop

III

beatlepics, ru_glamour

XII

IV

classics, nonuypagans

XIII

V

corsetmakers, 20sknitters, beauty101, curlyhair, dyed_hair, egl, egl_comm_sales, madradstalkers, sew_hip, vintagehair

XIV

VI

davis_square, charloft, eurotravel, newyorkers, poor_skills, seattle

XV

VII

ipod, computer_help, computerhelp, htmlhelp, macintosh, webdesign iworkatborders, baristas, walmart_employe note_to_cat, cat_lovers, dog_lovers, dogsintraining

XVI

VIII IX

XI

XVII

Members ofmornings, bentolunch, picturing_food, trashy_eats ourbedrooms, adayinmylife, aesthetes, amateur_artists, behind_the_lens, birls, color_theory, lmsg, naturesbeauty, topmodel, world_tourist rpattz_kstew, house_cameron, miracle______, sheldon_penny, time_and_chips thecure, bjorkish, blackfolk, just_good_music, lword, relaxmusic, rilokiley theoce_us, _we_are_lost, battlestar_blog, broadway, calmallamadown, doctorwho, glee_tv, gleeclub, nciscnd, patd, patdslashseek, sgagenrenders, sgastorynders, theater_icons todayirealized, __quotexwhore, add_a_writer, addme25_and_up, boys_and_girls, chuunin, i_am_thankful, thenicestthings, thesims2, weddingplans transnews, ontd_political, prolife, queer_rage worldofwarcraft, gamers, gundam00, wow_ladies

Table 8: HDP-topic-based hyper-communities (exemplar listed rst for each).

100% 90%

AdviceͲSupport

80%

CreativeͲExpression

70%

EntertainmentͲMusic

60%

Fandom

50%

FashionͲStyle

40%

HC_XVII

HC_XV

HC_XVI

HC_XIV

HC_XII

HC_XIII

HC_X

HC_XI

HC_IX

HC_VII

HC_VIII

HC_V

HC_VI

PoliticsͲCulture

HC_III

ParentingͲPets

0% HC_IV

GamingͲTechnology

10% HC_I

FoodͲTravel

20%

HC_II

30%

Television

Fig. 13: HDP-topic-based hyper-communities with Livejournal category; multi-coloured clusters are less pure (best seen in colour).

Topic-based hyper-communities Using LDA with 50 topics yielded 20 hyper-communities, listed in Table 7. Figure 12 shows community assignments to clusters together with Livejournal category. Clustering appears to have gathered topically similar communities together in a number

ofmornings, bentolunch, picturing_food, trashy_eats }), but also elu-

of cases (e.g., {

Mood Sensing from Social Media Texts and Its Applications No.

Members

IV VII

classics, nonuypagans

XII XVI

PoliticsCulture GamingTechnology

iworkatborders, baristas, walmart_employe note_to_cat, cat_lovers, dog_lovers, dogsintraining

AdviceSupport ParentingPets

ofmornings, bentolunch, picturing_food, trashy_eats

FoodTravel

rpattz_kstew, house_cameron, miracle______, sheldon_penny, time_and_chips

Fandom

transnews, ontd_political, prolife, queer_rage

PoliticsCulture

IX X

Category

ipod, computer_help, computerhelp, htmlhelp, macintosh, webdesign

VIII

27 Top topic

pagan troy

ancient greek history wicca

gods book community war paris achilles religion witch roman fluffy

magic paganism

goddess

film

computer ipod

windows problem

page using files link file

drive

download

website

store starbucks

cafe

department

dog dear

kitty

door

email

mart

cat kitten

company

customer

wal

dogs

house

desk

customers

food

problem

leash

skin

cats

puppy

food lunch egg

info

chocolate

coffee

stores

training

vet

crate

bed

cake

eat

green

gibbs

ryan

outside

pet

post

cup

stories

children

law

woman

abortion

pro community trans male identity

sexual

national

cat_lovers, note_tocat }, {dog_lovers, dogsintraining }). Hyper-community VIII has the lowest purity. On further inspection, a number cidated ner distinctions (e.g., {

of its communities have a signicant romance or relationships component. E.g., in addition to those communities with obvious topics, three are about particular ctional

house_cameron, sheldon_penny, and time_and_chips.

Noticeably, the clustering result is better when HDP is employed to learn latent topics in the data, 75% purity and 68% NMI, versus the result for LDA, 70% purity and 62% NMI. Specically, running HDP on the same dataset, the model discovers 52 topics discussed by bloggers in their online diaries. Based on the similarity on the preference of these topics among communities, 17 hyper-communities are clustered as shown in Table 8. As can be seen from the community assignments in reference to the Livejournal categories in Figure 13, there are more 100%-purity hyper-communities in HDP-topic based clustering than in LDA-topic based. These 100%-purity hyper-communities with their top topics are shown in Table 9.

ofmornings, Food-Travel

Only one 100%-purity hyper-community is found in both clusterings: {

bentolunch, picturing_food, trashy_eats }

service

story post spoilers

Table 9: Pure HDP-topic-based clusters and their top topics.

relationships:

borders

title

women gay sex rights female

job

breakfast sauce

transgender gender health

shift

icons fic episode rodney

posted

tried

question

butter

tony fics glee doctor remember read brendon

info

treats

rice chicken box cream tea

site

drink

bento cheese dinner

click

screen

manager card

music

those communities in the

Livejournal category. It is interesting to see from the table that while being further

separated in LDA-topic based clustering, {cat_lovers, note_tocat } and {dog_lovers, dogsintraining } are in the same hyper-community in HDP-topic based clustering, and so do {webdesign, htmlhelp } and {macintosh, computer_help, computerhelp, ipod }. In contrast, we also see a ner distinction for the Politics-Culture Livejournal category:

classics, nonuypagans } mentioned classical topics such as ancient, troy or Greek history, {transnews, ontd_political, prolife, queer_rage } were interested in contemporary issues such as transgender, abortion or gay. while {

Mood-based hyper-communities Clustering based on explicit mood labels yielded 9 hyper-communities, which are recorded in Table 10. In contrast to the topic-based clustering, only two hyper-communities

support

28

Nguyen et al.

Members altparent, boys_and_girls, breastfeeding, cat_lovers, clucky, dog_lovers, dogsintraining, macintosh, naturalbirth, parenting101, todayirealized baristas, blackfolk, iworkatborders, nonuypagans, note_to_cat, ontd_political, prolife, queer_rage, thesims2, transnews, walmart_employe, worldofwarcraft beatlepics, __quotexwhore, addme25_and_up, birls, charloft, egl_comm_sales, gundam00, house_cameron, miracle______, news_jpop, patdslashseek, rpattz_kstew, sheldon_penny, thenicestthings, time_and_chips behind_the_lens, add_a_writer, aesthetes, amateur_artists, color_theory, lmsg, just_good_music, naturesbeauty, ofmornings, ourbedrooms, relaxmusic bentolunch, adayinmylife, i_am_thankful, picturing_food, trashy_eats, world_tourist broadway, _we_are_lost, baby_names, beauty101, bjorkish, classics, curlyhair, davis_square, doctorwho, egl, eurotravel, ftm, gamers, lword, madradstalkers, newyorkers, poor_skills, rilokiley, seattle, thecure, theoce_us, topmodel, weddingplans, wow_ladies chuunin, 20sknitters, battlestar_blog, calmallamadown, corsetmakers, dyed_hair, glee_tv, gleeclub, patd, ru_glamour, sew_hip, theater_icons, vintagehair htmlhelp, computer_help, computerhelp, ipod, webdesign nciscnd, sgagenrenders, sgastorynders

Categories Advice-Support, Gaming-Technology, Parenting-Pets

Advice-Support, Gaming-Technology, Parenting-Pets, Politics-Culture

Advice-Support, Creative-Expression, Entertainment-Music, Fandom, Fashion-Style, Politics-Culture, Television Advice-Support, Creative-Expression, Entertainment-Music, Fashion-Style, Food-Travel Advice-Support, Creative-Expression, Food-Travel Advice-Support, Entertainment-Music, Fashion-Style, Food-Travel, Gaming-Technology, Parenting-Pets, Politics-Culture, Television Creative-Expression, Entertainment-Music, Fandom, Fashion-Style, Television Gaming-Technology Fandom

Hyper-community Mood Cloud

curious

confused

contemplative

happy

tired

aggravated

hopeful

cheerful sad

amused

anxious excited

worried

annoyed

frustrated

accomplished

awake

chipper

calm

content

amused curious

annoyed

aggravated

confused

frustrated

awake

cheerful

angry

excited

sad

contemplative cranky

hopeful

tired

blah

worried

happy

pissed off

accomplished

cheerful amused

bouncy

accomplished happy calm tired hopeful sleepy anxious

chipper creative

bored blah

curious

excited

busy

content

awake

loved

calm cheerful accomplished creative artistic awake chipper amused happy curious tired bored bouncy content sleepy cold

busy contemplative working anxious

happy hungry full accomplished cheerful tired calm content amused sleepy chipper thankful satisfied busy cold

curious hopeful

happy

bouncy

amused excited bored

amused

calm

contemplative

confused

hopeful anxious

bored

aggravated

Table 10: Mood-based hyper-communities.

chipper

accomplished

awake

confused anxious

annoyed

creative

tired

sleepy

excited bouncy happy chipper

confused bored artistic calm

hopeful

cheerful

content

curious accomplished

creative cheerful

aggravated

creative

awake bouncy curious exhausted

tired

anxious

awake

hopeful

sleepy

crazy

curious frustrated annoyed

anxious

worried

blah

contemplative

curious

confused

content

chipper

calm

sleepy

pissed off

tired

blah

cranky

bored

cold

annoyed

creative

awake

cheerful

contemplative

calm

busy

bouncy

content

awake

stressed

sad

distressed

frustrated

exhausted

Mood Sensing from Social Media Texts and Its Applications Members addme25_and_up, add_a_writer, htmlhelp, nonuypagans altparent, baby_names, breastfeeding, clucky, ftm, naturalbirth, parenting101, prolife beatlepics, amateur_artists, birls, calmallamadown, chuunin, classics, gundam00, news_jpop, ourbedrooms, patd, theater_icons, thecure, theoce_us behind_the_lens, aesthetes, color_theory, lmsg, naturesbeauty, ru_glamour, world_tourist

Categories Advice-Support, Gaming-Technology, Politics-Culture Parenting-Pets, Politics-Culture Creative-Expression, Entertainment-Music, Fandom, Fashion-Style, Politics-Culture, Television Creative-Expression, Fashion-Style, Food-Travel

bentolunch, ofmornings, picturing_food, trashy_eats

Food-Travel

curlyhair, beauty101, dyed_hair, vintagehair

Fashion-Style

dog_lovers, cat_lovers, dogsintraining, note_to_cat egl, 20sknitters, corsetmakers, egl_comm_sales, madradstalkers, sew_hip, weddingplans lword, _we_are_lost, battlestar_blog, bjorkish, blackfolk, broadway, glee_tv, gleeclub, rilokiley, topmodel macintosh, computer_help, computerhelp, ipod, webdesign miracle______, charloft, house_cameron, just_good_music, nciscnd, patdslashseek, relaxmusic, rpattz_kstew, sgastorynders, sheldon_penny, thesims2 newyorkers, adayinmylife, baristas, davis_square, eurotravel, iworkatborders, ontd_political, poor_skills, seattle, transnews, walmart_employe time_and_chips, doctorwho todayirealized, __quotexwhore, boys_and_girls, i_am_thankful, queer_rage, sgagenrenders, thenicestthings worldofwarcraft, gamers, wow_ladies

29 Hyper-community ANEW Words

journal

people love

music

thought

name

pretty

hope

baby name love

month

pretty

love news

kids

people

color

surgery

name

black

doctor

milk

hope

idea

life

couple

hope people

song

art

part

kind

panic

fall

journal

spring

sunset

city

cut

art

red

milk pretty black cake

good

cut time love

thought

people

kind

dog

face

dark

blue

cat love

people

happy

house

color pretty

hope

natural

time

pretty

cute

couple

good

happy

bed

idea

red black nice

home

pet

kind

idea

friend

free

hope

friend

food

door

name

thought

hope

watch

free

computer idea

pretty

love

home

love pretty

cute

cut

pretty

time couple

girl

good

friend

trouble

home

life

friend

doctor love

song

rock

kind

couple

time

thought

book

part

love

part

hope

good

people

quick

Table 11: ANEW-based hyper-communities.

person

pretty

time world

happy

kind

girl

hope

good name

free

family

life

car

free

journal

cut

song

nice

tank

love friend

name

sex

thought pretty

pretty

nice

song

time hard

hope

thought

world

travel

idea

happy

people

love

home

idea

cut

hard

save

free love kind

hate

free

kind

good hope

people hit

black

name

happy

hope

friend

people good thankful

happy

game thought

white

name

music

kind

time good people

thought

couple

name

time part house

people pretty music

money

eat

people thought good time lost song

music hope part

city

eat

sugar cut

black people fabric thought wedding

thought

Gaming-Technology

life

poetry

salad food dinner time butter love

love

Advice-Support, Creative-Expression, Fandom, Politics-Culture

book

chocolate green egg good

Entertainment-Music, Politics-Culture, Television

Fandom, Television

girl

pretty

good dress time

Advice-Support, Creative-Expression, Food-Travel, Politics-Culture

free

kind

hope thought good

white

Advice-Support, Creative-Expression, Fashion-Style

Creative-Expression, Entertainment-Music, Fandom, Gaming-Technology

good

person

people good thought

home

happy

time love

glamour

puppy

Gaming-Technology

time friend

time thought good

red white pancakes

Parenting-Pets

part

time

abortion child

pretty friend office free

free part

table

heart

fun

music

friend mind

world

happy

life

world

priest

couple

watch

pretty

pet

hate

fun

30

Nguyen et al. leisure motion

sixltr numberassent

home

assent nonfl

see ingest feel

see

i

home

body

percept

filler

bio

incl

percept

e we

percept

bio see

Food-Travel

trashy_eats, bentolunch, ofmornings, picturing_food

Creative-Expression, Fashion-Style

color_theory, naturesbeauty, ru_glamour

negemo

Fashion-Style

vintagehair, beauty101, curlyhair, dyed_hair

we

sad

inhib

family

bio

death sixltr

past health

relig

anx humans hear

friend

anx

sexual anger

you

home we

sexual shehe

leisure

Creative-Expression, Entertainment-Music, Politics-Culture, Television

just_good_music, battlestar_blog, charloft, ontd_political, relaxmusic anx past wc

social ppron

Parenting-Pets

Advice-Support, Parenting-Pets

altparent, baby_names, breastfeeding, clucky, naturalbirth, note_to_cat, parenting101

dog_lovers, boys_and_girls, cat_lovers, dogsintraining, thenicestthings

work quant excl discrep

affect

money

body future motion

health

shehe

fl nonfl

body

time

sexual sad

feel

cause

sixltr

nonfl health death

see achieve

relig

shehe

number

home

tentat

friend

home

space

Creative-Expression, Fandom

Fandom, Fashion-Style, Food-Travel, Politics-Culture

sheldon_penny, adayinmylife, house_cameron, miracle______, rpattz_kstew, time_and_chips

newyorkers, davis_square, egl_comm_sales, eurotravel, madradstalkers, ourbedrooms, poor_skills, seattle, sgagenrenders, world_tourist

posemo

Advice-Support, Creative-Expression, Food-Travel, Gaming-Technology, Television

aesthetes, amateur_artists, behind_the_lens, doctorwho, lmsg, i_am_thankful, wow_ladies

auxverb

tentat

work

posemo

negemo

present

filler

sad

cause

nonfl

they

anger

ffuture

filler

swear

relig

assent they th

ipron work

swear

leisure hear

Advice-Support, Entertainment-Music, Fandom, Fashion-Style, Gaming-Technology, Politics-Culture, Television

calmallamadown, addme25_and_up, baristas, beatlepics, birls, bjorkish, broadway, chuunin, egl, gamers, glee_tv, gleeclub, gundam00, lword, nciscnd, news_jpop, patd, patdslashseek, rilokiley, theater_icons, thecure, theoce_us, topmodel, weddingplans, worldofwarcraft

anger

sexual

tentat death

discrep

i

Advice-Support, Creative-Expression, Fashion-Style, Gaming-Technology, Politics-Culture

macintosh, 20sknitters, add_a_writer, classics, computer_help, computerhelp, corsetmakers, ftm, htmlhelp, ipod, iworkatborders, sew_hip, webdesign

humans

health

Advice-Support, Creative-Expression, Fandom, Gaming-Technology, Politics-Culture, Television

blackfolk, __quotexwhore, _we_are_lost, nonuypagans, prolife, queer_rage, sgastorynders, thesims2, todayirealized, transnews, walmart_employe

Table 12: Psycho-linguistic-based (LIWC) hyper-communities, showing pie charts of LIWC features used above average per hyper-community; Livejournal categories; and grouped communities.


31

have 100% purity with respect to the topical groundtruth, one of which is the only group characterized by negative mood, consisting of

puterhelp, ipod, and webdesign.

htmlhelp, computer_help, com-

Mood-based clustering reveals distinctions not apparent in the topic-based representation. E.g., the group including

behind_the_lens, while having signicant overlap

with Group IX (see Table 7) in the topic-based clustering, has some illuminating dier-

beatlepics, madradstalkers, ru_glamour, topmodel, worldtourist ; replacing them are add_a_writer, just_good_music, and ofmorn-

ences: gone are the communities and

ings. From an appraisal of the content of these communities, we nd the distinctions

to be nuanced. The topic-based hyper-community is loosely united by pictures and people, whereas the mood-based hyper-community is united by the desire

to create

and its outcomesdierences that are best explained by prevailing mood and intent. Indeed, these distinctions are captured by the predominant moods of the dierent hyper-communities, respectively and

creative.

curious, cheerful,

or

happy

vs.

calm, accomplished

ANEW-based hyper-communities Clustering on ANEW features as proxy mood yielded 15 hyper-communities, listed in Table 11. Of these, ve consisted of communities with matching Livejournal category (e.g.,

curlyhair, beauty101, dyed_hair, vintagehair are all classied as Fashion-Style).

Two hyper-communities are examples of the sub-category distinctions returned by the

macintosh, computer_help, computerhelp, ipod, webdesign } worldofwarcraft, gamers, wow_ladies } are both from Livejournal's Gaming-

topic-based clustering; { and {

Technology category.

Psycho-linguistic-based hyper-communities Clustering based on psycho-linguistic features yielded 12 hyper-communities, shown in Table 12. Three hyper-communities contain communities with the same Livejournal category, and appear to have been associated topically. The top three LIWC categories

feel, body, and percept ingest, bio (i.e., biological processes), and percept ; and for Parenting-Petsfamily, health, and humans (e.g., adult, baby, boy).

for these hyper-communities are illuminating; for Fashion-Style (i.e., perceptual processes); for Food-Travel

Other hyper-communities appear to exhibit a characteristic mixture of topic and style of discussion, which is in part captured by the linguistic processes of LIWC. E.g., one hyper-community aggregates

all

of the communities in the dataset about

ctional relationships (plus one community about documenting a day in one's life). These communities are a kind of meta-genre not easily captured by topical features

wc ) and extensive use shehe ) appear to help associate these communities.

alone; linguistic features such as post length (i.e., personal singular (i.e.,

of the 3rd

Last, there is at least one hyper-community for which mood appears to be the discriminating feature, that which includes

walmart_employe. This hyper-community sad, swear, death, sexual, health, hu-

has above average use of the LIWC categories

mans, anger, relig, they (i.e., 3rd person plural). These communities could be described

as some combination of angst-ridden, gritty, adversarial (note the above average use of 3rd person plural), or forthright, and contain much negative emotion and introversion.

32

Nguyen et al.

ACTIVATION

ACTIVATION 9

9

8

8

aggravated annoyed

7

7

awake

cheerful ecstatic excited

excited

touched

6

happy

6

hopeful contemplative

content

sick

crappy

4

drained groggy

DISPLEASURE

hyper productive

depressed

PLEASURE

DISPLEASURE

energetic

cold 5

thoughtful PLEASURE

determined

determined

5

thankful tired

4

gloomy hot sad

3

3

sleepy sore

calm

tired 2

2

1

1

0

1

2

3

4

5

6

7

8

9

DEACTIVATION

(a) High mood variance (user from 24_7_posting ).

0

1

2

3

4

5

6

7

8

9

DEACTIVATION

(b) Low mood variance (user from tional365 ).

devo-

Fig. 14: Aggregated mood distribution for two users, plotted by valence (pleasure) x-axis, and arousal (activation) y-axis. Larger mood labels indicate more frequent occurrence.

Contrast this hyper-community with that which includes the community

aesthetes

it discusses similar topics (e.g., health, death, relig) but does so with above average

posemo

(i.e., positive emotion).

5.3 Discussion It is not surprising that the dierent community representations lead to hyper-communities that reect these varying emphases. Topic-based representation is the method of choice for recovering hierarchy within, and associations across, Livejournal's canonical topic categories. Likewise, the results for the mood-based representation indicate an ability to recover non-topical features of a community such as prevailing intent and atmosphere of discussion. However, contrary to expectations, ANEW does not appear to be a well suited and cheap alternative to mood-based representation for the task of hyper-community detection. The clustering results for LIWC's psycho-linguistic representation are worthy of follow-up. LIWC oers a wide scope of classicationincluding as it does topical, linguistic, stylistic, and mood categoriesyet is cheap to obtain. Some of the distinctions captured by the hyper-communities arising from LIWC representation are a kind of

topic + atmosphere

that seems relevant to the Web 2.0 denizen, who is faced

with a surfeit of choice, and whose decision as to which community they will invest in may turn on the presence of more than one characteristic of the content. Hence psycho-linguistic analysis would make a useful facet of community recommendation (and analysis). Of course, there are more dimensions than community along which to break-down and re-factor the Livejournal corpus; e.g., user and discussion. Figure 14 contains plots of mood aggregated from the posts of two dierent users. Figure 14a. represents a user


33

with high mood variancethis user's moods are distributed across three quadrants of the mood circle, with a few instances in the fourth; Figure 14b. represents a user with low mood variancethis user's moods tend to be restricted to positive valence with average arousal. User proling of this kind is interesting in and of itself, but when correlated with communities oers additional insight into the user and/or community. E.g., we have found users who appear to project dierent persona conditional on the community of posting, which is captured by valence; valence over time for a particular user can indicate traumatic events; and joint analysis of such events with the mood of responding users potentially oer an even ner grained tool of analysis.

6 Conclusion In seeking to create tools for analysing content in social media under the impact of users' moods, we have investigated the problem of mood classication in weblogs. While the problem of machine learning-based feature selection for text categorisation has been intensively explored, little work has been done on textual-based mood classication, which is often more challenging. This paper's contribution is a comprehensive comparison of dierent feature-selection schemes in combination with a range of linguistic subsets as feature spaces across two large datasets, elucidating insights into what can be transferred from a generic text-categorisation problem to mood classication. In addition, a novel use of a set of psychology-inspired features (ANEW) and psycholinguistic features (LIWC) is proposed, that do not require a supervised selection phase and can therefore be applied for mood analysis at a much larger scale. The results support similar ndings in previous research, but have also brought to light discoveries particular to the problem of mood classication. The newly proposed feature sets have also performed comparatively well at a fraction of the computational cost of supervised schemes. Our analysis of global mood structures reveals interpretable and interesting patterns in the organisation, transition and continuum of moods, suggesting valuable empirical evidence about the structure of human emotion. The patterns contain mood synonyms, which can be used interchangeably, for instance, in terms of the sentiment score. The results have additional potential applications, such as mood-sensitive indexing and retrieval. We have discovered hyper-groups of communities by using topics, sentiment information and psycholinguistic properties of the posts of members. The problem of sentiment-based clustering for community structure discovery is rich with many interesting open aspects to be explored. The meta-communities grouped based on the sentiment information can be a social indicator, having potential applications in, for example, mental healthby targeting support or surveillance to communities with negative moodor in marketingby targeting customer communities having the same sentiment on similar topics. On the other hand, the psycholinguistic hyper-groups detected provide insight into the language styles of people in specic categories; whereas, topical meta-communities are a good source for users to nd suitable communities based on their interests.

34

Nguyen et al.

References 1. B. Adams, D. Phung, and S. Venkatesh. Discovery of latent subcommunities in a blog's readership. ACM Transactions on the Web, 4(3):130, 2010. 2. L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 4454, 2006. 3. B. Berendt and C. Hanser. Tags are not metadata, but `just more content'to some people. Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), 2007. 4. D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, 2003. 5. V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos. A review of feature selection methods on synthetic data. Knowledge and Information Systems, pages 137, 2012. 6. M.M. Bradley and P.J. Lang. Aective norms for English words (ANEW): Instruction manual and aective ratings. University of Florida, 1999. 7. E. Cambria, A. Hussain, C. Havasi, C. Eckl, and J. Munro. Towards crowd validation of the UK national health service. In Proceedings of the Web Science Conference (WebSci), 2010. 8. T.K. Fan and C.H. Chang. Sentiment-oriented contextual advertising. Knowledge and Information Systems, 23:321344, 2010. 9. A.K. Farahat, A. Ghodsi, and M.S. Kamel. Ecient greedy feature selection for unsupervised learning. Knowledge and Information Systems, pages 126, 2012. 10. S. Feng, D. Wang, G. Yu, W. Gao, and K.F. Wong. Extracting common emotions from blogs based on ne-grained sentiment clustering. Knowledge and Information Systems, 27:281302, 2011. 11. B.J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972976, 2007. 12. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The Weka data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):1018, 2009. 13. C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), 2007. 14. X. Hu and J.S. Downie. Exploring mood metadata: Relationships with genre, artist and usage metadata. In Proceedings of the International Conference on Music Information Retrieval, 2007. 15. R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), page 617, 2006. 16. G. Leshed and J.J. Kaye. Understanding how bloggers feel: Recognizing aect in blog posts. In Proceedings of the ACM Conference on Human Factors in Computing Systems (SIGCHI), page 1024, 2006. 17. C. Long, J. Zhang, M. Huang, X. Zhu, M. Li, and B. Ma. Estimating feature ratings through an eective review selection approach. Knowledge and Information Systems, 2012. 18. A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on Enron and academic email. Journal of Articial Intelligence Research, 30:249272, 2007. 19. A. McCallum, X. Wang, and N. Mohanty. Joint group and topic discovery from relations and text. Lecture Notes in Computer Science, 4503:28, 2007. 20. G. Mishne. Experiments with mood classication in blog posts. In Proceedings of ACM Workshop on Stylistic Analysis of Text for Information Access, 2005. 21. G. Mishne and N. Glance. Predicting movie sales from blogger sentiment. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, 2006. 22. H. Mohtasseb and A. Ahmed. Two-layered blogger identication model integrating prole and instance-based methods. Knowledge and information systems, 31(1):121, 2012. 23. R. Nallapati and W. Cohen. Link-PLSA-LDA: A new unsupervised model for topics and inuence of blogs. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), 2008.


35

24. R.A. Negoescu, B. Adams, D. Phung, S. Venkatesh, and D. Gatica-Perez. Flickr hypergroups. In Proceedings of the ACM International Conference on Multimedia, pages 813816, 2009. 25. T. Nguyen, D. Phung, B. Adams, T. Tran, and S. Venkatesh. Classication and pattern discovery of mood in weblogs. Advances in Knowledge Discovery and Data Mining, pages 283290, 2010. 26. T. Nguyen, D. Phung, B. Adams, T. Tran, and S. Venkatesh. Hyper-community detection in the blogosphere. In Proc. of ACM Workshop on Social media, in conjunction with ACM Int. Conf on Multimedia (ACM-MM), Firenze, Italy, 2010. ACM. 27. K. Nigam and M. Hurst. Towards a robust metric of opinion. In AAAI Spring Symposium on Exploring Attitude and Aect in Text, pages 598603, 2004. 28. B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1135, 2008. 29. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classication using machine learning techniques. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, pages 7986, 2002. 30. J.W. Pennebaker, C.K. Chung, M. Ireland, A. Gonzales, and R.J. Booth. The development and psychometric properties of LIWC2007. Austin, Texas: LIWC Inc, 2007. 31. J.W. Pennebaker, M.E. Francis, and R.J. Booth. Linguistic inquiry and word count (LIWC) [computer software]. Austin, Texas: LIWC Inc, 2007. 32. J.A. Russell. A circumplex model of aect. Journal of Personality and Social Psychology, 39(6):11611178, 1980. 33. J.A. Russell. Core aect and the psychological construction of emotion. Psychological Review, 110(1):145, 2003. 34. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):147, 2002. 35. X. Song, C.Y. Lin, B.L. Tseng, and M.T. Sun. Modeling and predicting personal information dissemination behavior. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 479488, 2005. 36. S.O. Sood and L. Vasserman. ESSE: Exploring mood on the web. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), 2009. 37. Y.R. Tausczik and J.W. Pennebaker. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24, 2010. 38. Y. W. Teh and M. I. Jordan. Hierarchical Bayesian nonparametric models with applications. In N. Hjort, C. Holmes, P. Müller, and S. Walker, editors, Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, 2010. 39. Y. Tsuruoka and J. Tsujii. Bidirectional inference with the easiest-rst strategy for tagging sequence data. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 467474, 2005. 40. Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the International Conference on Machine Learning (ICML), pages 412420, 1997.