Understanding Email Writers: Personality Prediction from Email ...

Understanding Email Writers: Personality Prediction from Email Messages Jianqiang Shen, Oliver Brdiczka, and Juan Liu Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA {jianqiang.shen, oliver.brdiczka, juan.liu}@parc.com

Abstract. Email is a ubiquitous communication tool and constitutes a significant portion of social interactions. In this paper, we attempt to infer the personality of users based on the content of their emails. Such inference can enable valuable applications such as better personalization, recommendation, and targeted advertising. Considering the private and sensitive nature of email content, we propose a privacy-preserving approach for collecting email and personality data. We then frame personality prediction based on the well-known Big Five personality model and train predictors based on extracted email features. We report prediction performance of 3 generative models with different assumptions. Our results show that personality prediction is feasible, and our email feature set can predict personality with reasonable accuracies. Keywords: Personality, behavior analysis, email, text processing

1

Introduction

Email is one of the most successful computer applications to date. Office workers around the world spend a significant amount of time writing and reading electronic messages - in fact, previous research even characterized email as the natural “habitat” of the modern office worker [7]. The growing importance of email as a communication medium has inspired many attempts to develop tools for presenting, organizing, classifying, and understanding emails (e.g., [2, 5, 6]). In this paper, we extend this line of research by focusing on an as-yet unexplored area: the inference of a user’s personality based on emails. Our work is inscribed in a broader research effort of enhancing user experience through user personalization and adaptation. Modeling user personality based on electronic communications could enable better personalization of user interfaces and content [8], more efficient collaboration (by forming groups of compatible individuals) [16], more precise targeted advertising, or improved learning efficiency by customizing teaching materials and styles [17], to name just a few possibilities. Personality modeling can also be important in the work place environment. Employers may prefer to match an employee’s personality with suitable tasks to optimize performance. On the other hand, in high-stake work environments, a significant mismatch between personality and task requirement may impose risk. Our personality predictor is part of a larger system being developed

to detect such anomalies and prevent malicious behaviors in corporate networks. Personality profiling enables us to identify individuals having the motivation and capability to damage the work environment, harm co-workers, or commit suicide in the workplace. Training reliable personality predictors is an unexplored territory in research and one major barrier is data collection. Although email is ubiquitous, public and realistic email corpora are rare due to privacy issues. Acquiring personality profiles can be even more challenging since personality is often considered private. To our knowledge, all public email sets including Enron [6] do not have personality information. We thus designed innovative strategies to retrieve predictive information while protecting privacy. Given a feature vector abstracted from one single email, our goal is to reliably predict the personality of this email’s writer. In psychology, the Big Five model is one of the most widely-adopted standards [12]. Using this framework, each email message can be associated with a set of personality trait values. We explored 3 different generative models and as will become apparent, a method with label-independence assumption works best in our case, which suggests that in our data sets personality traits are relatively distinct and independent from each other. The contribution of this paper is threefold. First, we show that it is possible to reliably infer an email writer’s personality with the results from two large real-word email sets. To the best of our knowledge, this is the first attempt at inferring personality from emails using automatic content analysis. Second, we present our email features and their predictive power in detail to inform future work on personality. Third, we present a pilot exploration of different learning strategies for personality prediction, which could facilitate future studies.

2

Personality

Personality traits are consistent patterns of thoughts, feelings, or actions that distinguish people from one another [12]. A trait is an internal characteristic that corresponds to an extreme position on a behavioral dimension. Personality traits are basic tendencies that remain stable across the life span. Based on the Big 5 model, the following five factors, Neuroticism, Agreeableness, Conscientiousness, Extraversion and Openness have been shown to account for most individual differences in personality. An individual’s personality affects their behavior [12]. Scoring highly on a given trait means reacting consistently to the same situation over time. It is possible to estimate a stranger’s personality based on their behaviors [13]. Our personality predictor is part of a larger system being developed to detect anomalies and prevent malicious behaviors in corporate networks. Corporations and government agencies are interested in predicting and protecting against insider attacks by trusted employees. We posit that some personalities are more likely to react with negative actions to job and life stress than others. Consequently, we introduce personality traits as static personal variables to consider when assessing an individual’s relative interest in carrying out insider attacks.

Research shows that malicious intentions are more related to Extraversion, Agreeableness, Conscientiousness and Neuroticism, and less related to Openness [10]. Intuitively, the definition of each trait matches this finding well. Neuroticism concerns the extent to which individuals are anxious, irritable, depressed, moody and lacking self-confidence. Agreeableness concerns the extent to which individuals are trusting, non-hostile, compliant and caring – in particular, it is negatively correlated with antisocial personality disorder. Conscientiousness concerns the extent to which individuals are efficient, organized, dutiful, and self-disciplined. Extraversion is related to excitement-seeking that requires high levels of environmental stimulation to avoid boredom. By contrast, Openness includes facets of aesthetics and values but is less correlated with criminality. Therefore, we focus in this paper on predicting the values of Neuroticism, Agreeableness, Conscientiousness and Extraversion from the content of written emails.

3

Email Data Collection

Although email is ubiquitous, public and realistic email corpora are rare. The limited availability is largely due to privacy issues. Data collection is even more challenging for our problem, since we need two kinds of data from each user: the user’s personality profile and their written emails. Both data are highly sensitive and need to be anonymized. Our logging tools take two measures to protect privacy: (1) participants are assigned a random unique ID, and only this ID is kept as identifier in the extracted data set; (2) only high-level aggregated features are extracted, while none of the raw email content is included in the feature set. By these steps, we ensure that it is impossible to reconstruct personal identifiable information and raw content from our feature data set. We developed two email extraction tools: one is deployed as a software agent installed on participants’ PCs to extract emails from Outlook, another is deployed as a web service to collect emails from a participant’s Gmail account. 3.1

Email Anonymization

Email Preprocessing We need to preserve the structure of email threads for constructing predictive features later. An email thread is a group of messages that are related by “reply/forward” links. By capturing email threads, we can further analyze the possible factors that lead to specific email responses. Those threading relationships can be reliably recovered by analyzing the subject lines and the RFC-822 header fields of email messages (e.g., Message-ID, InReply-To, Reference). We also need to clean up emails before extraction to avoid being misled by irrelevant material. We detect and isolate reply lines and signature blocks. Many users tend to delete less important emails and there might be gap in the email thread. We thus rely on a hybrid content analysis approach for detection: each message is checked by both “false positive” regular expressions [15] and pretrained automatic reply/signature detectors. When you reply to emails, popular

systems such as Outlook and Gmail produce fixed patterns for the reply heads and lines. Signatures are typically at the end of the message and contain names and/or addresses. We designed specific regular expressions to capture such patterns. The pre-trained reply/signature detectors [4] utilize over 30 features (such as email patterns, name patterns, punctuations) and adopt Conditional Random Fields [14] for inference. To maximize the probability of thoroughly cleaning the email, we treat a line as a reply or signature if either of the approaches predicts so. In a small test set of 100 emails, our approach achieved over 95% accuracy. Email Features After performing the pre-processing steps outlined above, our data logger extracts the following high-level features from the raw written text. Bag-of-word features. The simplest and most prevalent approach to representation of a document is the bag-of-word space [11]. Instead of collecting every appearing word, we only build our features from words that are most commonly used in everyday life for three reasons. First, focusing on common words can protect an individual’s privacy by avoiding collecting specific, unique and sensitive words. Second, research on sentiment analysis [18] shows that common words are most predictive to identify the viewpoints underlying a text span. Third, focusing on common words can also help to avoid associating specific, unique words to certain individuals in the training set and not to certain personalities. We started building the common word list with the top 20000 most common words from TV and movie scripts [26]. We then retrieved the top 1000 male first names, 2000 female first names and 5000 surnames based on the survey of U.S. Census Bureau for the year 2005. We removed those names from the word list. We further checked the list and removed any words related to addresses, names, research areas and job titles. This left us with a list of 16,623 common words. Each cleaned email was heuristically tokenized into a word list, and we counted the frequency for each word in the common word list. Meta Features. From the email message, we calculated TO/CC/BCC counts, importance of the email, count of different punctuation symbols, count of words, count of characters, count of positive and negative numbers, count of paragraphs, count of attachments, month of the sent time, day of the month of the sent time, and day of the week of the sent time. We also recorded whether this is a reply/forward message. If it is, we then calculated the duration since he/she received the original email, and meta features in the original email. Word Statistics. We applied part-of-speech (POS) tagging [22] to the cleaned email. Part-of-speech tagging is a process to read text and sequentially tag each token with syntactic labels, such as noun and verb, based on its definition as well as its context - i.e. its relationship with adjacent and related words in a phrase, sentence, and paragraph. We recorded the count of words for each POS tag. We adopted sentiment analysis to generate features from the email content. Sentiment analysis is the task of identifying positive and negative opinions, emotions, and evaluations [27]. We used a sentiment polarity dictionary from UPitt [27] for this purpose. Compared with LIWC [19], this dictionary has a much higher coverage and contains detailed human-annotated polarity information on

8,221 distinct words. We scanned the email and counted the number of words in different sentiment categories, including positive/negative/neutral/both words, strongly/weakly subjective words. If it is a reply email, we also calculate the above features from the original email. Studies have shown that the usage of pronouns and negations can be indicators of emotional states [21]. For example, the word “me” and “not” are related to anger and the word “I” is related to emotional vulnerability. We listed 76 pronouns and 36 negations to count the frequency of pronouns and negations. We counted the number of words consisting of all low-case letters and words consisting of all up-case letters. We also wanted to get some sense of the complexity and difficulty of the words used by the writer. We attempted to achieve this using two calculations. First, we count the number of letters in each word and compute the histogram of those word lengths. Second, each word is assigned a difficulty level based on its ranking in the common word list. We then compute the average difficulty level for those words. Writing Styles. Emails can have a conventional, structured format. Such a message is typically started with a greeting so as to help create a friendly tone. The choice of using the correspondent’s name also depends on who is written to. Emails often end in a polite way with common endings such as “best regards”. At the opposite end of the formality spectrum, email writers can also use smileys to convey emotional content such as sarcasm, laughter and other feelings, such as “:-)” or “:-p”. Therefore, there are no mandatory formulas for writing emails. Not all social and business emails feature exactly one of the above formats, since emails strike a balance between the conventional format and the writer’s own personal style. Their writing style can be very informative [1]. We tried to capture such difference in writing styles with: 1. greeting patterns: we collected 83 popular open greetings, e.g., “Hi”, “Dear”. 2. closing patterns: we collected 68 closing patterns, e.g., “kindly”, “sincerely”. 3. wish patterns: we collected 28 common wish patterns, such as “have a good day”, ”looking forward to seeing you”. 4. smileys words: we collected 115 popular smileys, such as “:-)”, “8-)”. Speech Act Scores. One important use of work-related email is negotiating and delegating shared tasks and subtasks [5, 25]. To get a better understanding of the whole message, it is therefore desirable to detect its purpose. The Speech Act taxonomy is typically divided into verbs and nouns, and each email message is represented by one or more verb-noun pairs: for example, an email proposing a meeting would have the labels Propose, Meeting. We used a pre-trained Speech Predictor [5] to predict these Speech Acts from the email and adopted the prediction confidence of each act as a feature. This predictor averagely has nearly 70% F1 scores over all Speech Acts [5]. Although it is not extremely accurate, it nevertheless provides some additional information on the tone and purpose of the email, as shown in the experimental results. 3.2

Email and Behavior: an Illustration

Before presenting our personality prediction models, we want to first use an example to illustrate how revealing email messages are regarding a person’s

3.5

0.2

superlative adjectives

3 0.15

2.5

0.1 2

May−11

Jul−11

Sep−11 Time

Nov−11

Jan−12

superlative adjectives

positive sentiments

positive sentiments

0.05 Mar−12

Fig. 1. Trend of average count of positive sentiment words and superlative adjective words per sent email, with 95% confidence levels.

personality and emotion states. We were fortunate to have access to the sent emails from an individual who left his job in March 2012 and kindly donated his email archive. He claimed that he started thinking about career change half a year beforehand, while none of his colleagues noticed. We therefore tried to see if his changing emotion could be reflected in the content of his messages prior to his eventual departure. As shown in Figure 1, the count of positive sentiment words this person used kept decreasing significantly, suggesting that he became disenchanted in his work. The count of superlative adjectives also decreased significantly, suggesting that he became less excited about work-related matters. In this particular case the email trends are quite revealing. It will be interesting to explore in general how email features are correlated with one’s personality and emotion states, both in corporate email and in more causal settings. 3.3

Data Collection

We carried out two email data collections. One email set was collected in a research lab. Recruiting was done through email and door-to-door solicitations. We tried to recruit participants as diverse as possible. Our software agent was installed on participants’ PCs to extract their sent emails in Outlook from the last 12 months. In another data collection, participants logged into our web service and let our servlet collect emails in their Gmail accounts from the last 12 months. We recruited participants from all over the U.S.. Although we were using emailing to colleagues and posting on Facebook for recruiting, most (> 90%) of our participants were not affiliated with our company or friends with any of the involved researchers. Each participant first answered an online personality survey that was compiled by psychologists and consists of 112 questions. Based on the survey results, we calculated the ground-truth levels (low, medium, high) for each participant on the dimensions of Neuroticism, Agreeableness, Conscientiousness and Extraversion. We filtered out a participant if the survey answers were not consistent, or sent less than 5 emails and had less than 100 received emails. For the Outlook dataset, we have 28 valid users and 49,722 emails. Each subject sent an average of 1,776 emails, with a standard deviation of 1,596. For the Gmail dataset, this left us 458 valid users and 65,185 emails. Each subject had an average of 142 sent emails, with a standard deviation of 176. In both datasets, roughly 50% subjects’ personalities are “medium”, and 25% of them are “low” or “high”.

4

Personality Prediction

Given a feature vector abstracted from one single email, our goal is to reliably predict the personality of this email’s writer. We need to learn a function f : X → Y whose input X is the feature vector and output Y = hy1 , .., yK i is the personality trait value vector. Each element yj of Y corresponds to a personality trait (e.g., Extraversion) and its value can be “low”, “medium” or “high”. Given an PK example (X, Y), the error of f is defined as E(f, (X, Y)) = j=1 I(yj 6= f (X)j ) where I(·) is indicator function. That is, E counts the cases where the predicted value of a trait is different from the real value. Appropriate f should have a low expected value of E. Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label y from a set of disjoint labels L. In this multi-label classification problem [23], the examples are associated with a set of labels Y = hy1 , .., yK i. Given a message Xi and a set of personality values Yi , we imagine there could be the following models for generating feature xij : – Joint Model: hyi1 , .., yiK i work as a single entity and they jointly decide whether to select feature xij . – Sequential Model: first select a label yik out of yi1 , .., yiK , then yik decides whether to select feature xij . – Survival Model: each label yik independently decides whether to select feature xij . If all labels decide to select xij , then finally xij gets selected. Based on the above generative models, we derived different learning algorithms. Joint Model. Joint Model assumes that each feature is jointly selected by all labels. It thus needs to consider each distinct combination of label values that exist in the training set as a different class value of a single-label classification task. In such cases, the number of classes may become very large and at the same time many classes are associated with very few training examples. To improve the computational efficiency and predictive accuracy, we adopt a simple yet effective ensemble learning method [24] for Joint Model. Our ensemble method initially selects m small labelsets R1 , .., Rm from the powerset of label set L via random sampling without replacement. It then considers each labelset Rk as a different class value of a single-label classification task and learns m singlelabel classifiers f1 , .., fm independently. In this way, it aims to take into account label correlations using single-label classifiers that are applied on subtasks with manageable number of labels and adequate number of examples per label. Sequential Model. Sequential Model assumes that first a label is selected from the label set and then this label selects a feature. Labeled Latent Dirichlet Allocation (LDA) [20] is an extension of Naive Bayes that exactly handles such situation. It constrains LDA [3] by defining a one-to-one correspondence between LDA’s latent topics and user labels. This allows Labeled LDA to directly learn feature-label correspondences. If each document has only one label, the probability of each document under Labeled LDA is equal to the probability of the document under the Multinomial Naive Bayes event model. If documents have multiple labels, in a traditional one-versus-rest Multinomial Naive Bayes model,

a separate classifier for each label would be trained on all documents with that label, so each feature can contribute a count of 1 to every observed label’s feature distribution. By contrast, Labeled LDA assumes that each document is a mixture of underlying topics, so the count mass of single feature instance must instead be distributed over the document’s observed labels. Survival Model. Survival Model assumes that each label independently determines whether to use a feature. Only if all labels agree, this feature will get selected. Given a set of label values Yi = hyi1 , .., yiK i, the probability that xij gets selected is P (xij |Yi )=Qk P (xij |yik ) . Given a set of D training instances {(X1 ,Y1 ),..,(XD ,YD )}, with naive Bayes assumption, we search for parameter to maximize the likelihood Y i

P (Xi |Yi ) =

YY i

P (xij |Yi ) =

j

Y Y Y YY P (Xi |yik )) P (xij |yik )) = ( ( k

i

j

k

i

This equals to independently searching for parameters to maximize the liklihood i P (Xi |yik ) for each label k . Given a test instance X, we want to assign values to K personality traits to maximize the posterior probability P (y1 ,..,yK |X). Using Bayes Formula and label independence, we get Q

P (Y|X) ∝ P (y1 , .., yK )P (X|y1 , .., yK ) =

Y k

P (yk )

Y k

P (X|yk ) =

Y k

(P (yk )P (X|yk )) ∝

Y

P (yk |X)

k

This equals to independently searching for optimal values for each label. We thus can consider each label as an independent classification problem. We transform this multi-label problem into one single-label problem for each label and independently train K classifiers f1 ,..,fK . Each classifier fi is responsible for predicting the low/medium/high value for each corresponding personality trait. Using Multiple Message for Prediction. In this paper, we focus on predicting the personality of the email writer based on one single email. Note that the accuracy of prediction could be improved if we can make inference based on multiple email messages. Assume we have n messages X1 ,..,Xn . For each message Xi , we can utilize the predictor to estimate Pik , the probability that its writer has personality trait value yk . We then use those estimated probabilities to “vote” for the final prediction: for personality trait value yk , its overall score will P be i Pik . We rank trait values based on their overall scores and make predictions with those having highest scores.

5

Experimental Results

We evaluated the accuracy of the proposed algorithms with the collected Outlook and Gmail datasets. For the Outlook set, we hold one subject’s data for test, use the rest subjects for training, and iterate over all subjects. For the Gmail set, all results are based on 10-fold cross validation [11] with the constraint that all emails from the same subject either go to the training set or the test set. Effects of generative models. We first evaluated personality prediction using 3 generative models with the Outlook set. We used Naive Bayes as the underneath single-label classifier. We only used bag-of-word features of emails

Joint Model

Sequential Model

Survival Model

NB

Tree

SVM

NB

80

80

80

70

70

70

60

60

60

50

50

50

40

40

40

30

30

20 Agreeableness

Conscientiousness

Extraversion

SVM

30 20

20

Neuroticism

Tree

Neuroticism

Agreeableness

Conscientiousness

Extraversion

Neuroticism

Agreeableness

Conscientiousness

Extraversion

(a) Different generative models (b) Using bag-of-word features (c) Using aggregated features

Fig. 2. Accuracy of using different strategies, features, classifiers on Outlook dataset NB

Tree

SVM

80

80 70

75 60

70

50 40

65 30

60

20 Neuroticism

Agreeableness

Conscientiousness

Extraversion

(a) Outlook dataset using different classifiers

Neuroticism

Agreeableness

Conscientiousness

Extraversion

(b) Gmail dataset using SVM

Fig. 3. Accuracy of using all features for Survival Model.

to simplify the comparison. The results are shown in Figure 2(a). It is clear that Joint Model performs badly. Its accuracies are worse than the simple Survival Model and the difference is significant for Agreeableness, Conscientiousness and Extraversion (p