Identifying Personality-based Communities in Social Networks Eleanna Kafeza1 , Andreas Kanavos2 , Christos Makris2 and Dickson Chiu3 1. Athens University of Economics and Business, Greece,
[email protected] 2. Computer Engineering and Informatics Department, University of Patras, Greece {kanavos, makri}@ceid.upatras.gr 3. The University of Hong Kong, Hong Kong,
[email protected]
Abstract. In this paper we present a novel algorithm for forming communities in a graph representing social relations as they emerge from the use of services like Twitter. The main idea centers in the careful use of features to characterize the members in the community, and in the hypothesis that well formed communities are those that designate diversity in the features of the participating members.
1
Introduction
The topic of the paper is to present a novel methodology in order to characterize interesting communities as they arise in social networks, such as those that are formed in Twitter. The novelty of our approach lies in the fact that we are looking for emerging communities, according to the diversity among the characters of the involved users. Until now, most practices on message transmission are based on finding the influential users and try to use them to transmit a message. Moreover, recent work on data flow on social networks deals with the problem of predicting the information current. Our approach is different in the sense that we examine ways to ”drive” the information within the network. We look for sub-networks that demonstrate a high degree of information flow and as a second step; we aim at using these networks for increasing information continuance. There is a lot of work from different areas for creating communities from graphs. For a thorough survey, we propose [5]. In our approach we argue that communities in social media, e.g. Twitter, are more probable to contact information easily if they are not ”biased” with respect to user personality. A balanced community can handle information flow quicker and deeper. Hence, here we divide the Twitter graph related to the personality of users which is extracted based on their behavior.
2
Related Work
Analysis in social networks has a long history, which is related to graph clustering algorithms, web searching algorithms, as well as bibliometrics; for a complete
2
Eleanna Kafeza, Andreas Kanavos, Christos Makris and Dickson Chiu
review of this area one should consult [4], [5], [10], [12], [14] and [17]. The field is related to link analysis in the web with cornerstone the analysis of the significance of web pages in Google using the PageRank citation metric [3], the HITS algorithm proposed by Kleinberg [9] as well as their numerous variants proposed in [11]. PageRank employs a simple metric based on the importance of the incoming links while HITS uses two metrics emphasizing the dual role of a web page as a hub and as an authority for information. Both metrics have been improved in various forms and a related review can be found in [11]. Concerning community detection, various algorithms in literature have been proposed. It should be noted that HITS by itself if exploring non principal eigenvectors, can be used in order to compute communities. Concerning communities, the problem with which one can come across in bibliography, is related to graph partitioning. A breakthrough in the area is the algorithm proposed in [6], for identifying the edges lying between communities and their successive removal; a procedure that after some iterations leads to the isolation of the communities [6]. The majority of the algorithms proposed in the area are related to spectral partitioning techniques. Those are techniques that partition objects by using the eigenvectors of matrices, which form themselves in the specific set [8], [15], [18] and [19]. One should also mention techniques that use modularity, a metric that designates the density of links inside communities against the density outside communities [5], [13], with the most popular being the algorithm proposed by [2]. Besides finding emerging communities, estimating authorities has also attracted attention. In [1], they extracted several graph features such as the users’ degree distribution, hubs and authority scores in order to model a user’s relative importance. Other works in this area include Expertise Ranking [7] and [21], where they identified authorities using link analysis by considering the induced graph from interactions between users. Interesting is the work presented in [20], which employs Latent Dirichlet Allocation and a variant of the PageRank algorithm that clusters according to topics and finds the authorities of each topic; the proposed metric is called TwitterRank. A method proposed in [16], though similar to TwitterRank, differs in the use of additional features, in the employment of clustering, and in its applicability in real-time scenarios since it can be easily implemented.
3
A methodology for identifying personality-based communities in Social Networks
In our work we address the problem of identifying networks that can potential exhibit maximum flow of information regarding a subject matter. As already mentioned in most cases in literature authors deal with the problem of finding influential nodes, and there are several metrics developed to address this issue. Our approach is different in two aspects; first we identify influential networks and not individuals ones. Then we extract the networks related to a specific
Identifying Personality-based Communities in Social Networks
3
subject, and compute the influence based on user personality as extracted and computed by quantitative metrics retrieved by Social Networks. Social Networks provide metrics to measure different aspects of user behavior. In this paper, we will use Twitter as a case study but our approach can be easily extended to any Social Network. 3.1
Basic Twitter Metrics
In this section we examine the basic metrics that we have exported from Twitter so as to extract users personality. Primarily, we can categorize users’ tweets into two categories: direct tweets and indirect tweets: – Direct tweets (D): Here we can find tweets that are produced by an author. This category comes from the option Compose new Tweet and by this, a user can potentially start a new conversation. – Indirect tweets (I1, I2): In this category, tweets come from another user and can take place with one of two following ways: when a user copies or forwards a specific tweet so as to spread it in his network (retweets) or in the second possible way, a user makes a comment to another tweet and as a matter of fact, a possible conversation may be started (conversations). More specifically, I1 represents the number of retweets of a user for a specific time interval and in contrast, I2 represents the number of times there actually was a conversation upon a tweet. Other metrics we look into are: – Number of followers (F ): The number of users that follow a specific user. – Frequency (F R): It calculates the frequency of users tweets. Hence this metric indicates how often an author posts tweets. The way to calculate the frequency is given as a set of time e.g. half an hour, how many times the user tweeted. – Number Hashtag keywords (HK): These keywords are words starting with the symbol #. Under this symbol, anyone can put a specific tweet into a certain thematic category. These metrics count the number of hashtags a user has used upon a set of tweets that have occurred for a specific set of time.
3.2
Using metrics to extract user personality
Related to the above metrics, we identify users personality as it appears in Twitter. We have classified users in four basic categories based on their personality as perceived by their peers and as reflected by their behavior. We call them personality traits. A personality trait is a type of behavior exhibited by a Twitter user and can get one of the following values: 1. Popular: when a user is followed by many other users (e.g. followers).
4
Eleanna Kafeza, Andreas Kanavos, Christos Makris and Dickson Chiu Table 1. Twitter basic metrics Metric F D I1 I2 FR HK
Sense/Meaning Number of followers Number of direct tweets Number of indirect tweets (retweets) Number of indirect tweets (conversations) Frequency of user’s tweets Number of hashtag keywords
2. Energetic: when a user posts tweets frequently. This means that this specific user is energetic and enjoys talking hence he/she tweets on a regular basis. 3. Conversational: when a user takes part in conversations either by commenting other people’s posts or republishing them. 4. Multi-systemic: where a user has a high number of interests and likes to state his opinion in a variety of subjects. Given the above basic behavioral characteristics that a user can show in any social network, we associate features in each one of them so as to have a qualitative insight. An atomic personality trait for a user x, is a tuple (F 1, F 2, F 3, F 4) where each Fi is defined as follows: 1. Atomic Popular (F 1): the number of followers computed as F . 2. Atomic Energetic (F 2): the number of direct tweets divided by time interval, computed as F R. 3. Atomic Conversational (F 3): the number of retweets plus the number of conversations computed as I1 + I2. 4. Atomic Multi-systemic (F 4): the number of hashtags found in a given set of tweets that occurs in a specific time interval, computed as HK. Related to the above definitions, given a user of the Twitter xi , the atomic personality trait is a tuple (F 1, F 2, F 3, F 4) where each Fi , 1 ≤ i ≤ 4, holds the degree that a user’s personality is associated with each one of the personality traits. As a next step, we need to identify the dominant characteristics for each user. As a result, for each metric, we set a range of values such that for the characteristic Fi if the atomic values of Fi are within the given range, we characterize the user as having the corresponding behavior. For example, let us assume that we have the user Helen14,3,2,1 and that the range for each Fi is set to (10-14, 1-5, 3-7,0-4); then Helen is Popular, Energetic as well as Multi-systemic. Let P ={Popular (p1), Energetic (p2), Conversational (p3), Multi-systemic (p4)} be the set of personality traits and xF 1,F 2,F 3,F 4 the atomic personality trait for the user x. Moreover, we define with RR1,R2,R3,R4 , a set of values that determine the dominance of personalities; then color(x) is a tuple (c1, c2, c3, c4) (personality tuple) such that ci has the value pi if Fi ≤ Ri , for 1 ≤ i ≤ 4.
Identifying Personality-based Communities in Social Networks
3.3
5
The community extraction algorithm
Based on the above we can now derive the personality traits of each user of the Twitter. We conceptualize Twitter as a graph where users are the nodes and we color each node of the graph related to user personality. We map Twitter as a graph where each node is a user and there is an edge between two users if there is a relation between them i.e. one user follows the other or vice versa. We then associate each node with one of the 15 possible personality traits. Based on the above description of personality there are 4 possible personality traits and each user can have any of the 24 − 1 possible values as his personality tuple. We use a breadth first approach to traverse the graph and for each node we use the definition of color(x) to decide upon the color of each node/user. After having decided upon the color/personality of each user, we define ”personality balanced networks”. A personality balances network is a network that contains at least one node of every possible personality tuple. Related to observations regarding the flow of information in human networks, we notice that balanced networks which are composed by users with different personality traits tend to demonstrate higher degrees of information flow. In our approach we create sub-graphs based on the coloring of the nodes. Given the initial Twitter graph, we traverse the graph using BFS until we find nodes from each one of the 15 personality tuples. The algorithm then extracts that sub-graph.
4
Experimental Evaluation and Results
We examined the validity of our approach through experiments. We then implemented the Twitter graph using Twitter4J, and have colored our graph according to our methodology and finally have extracted the personality-based community graph. Twitter4J is a Java library for the Twitter API, with which one can easily integrate a Java application with the Twitter service. Firstly, we created a Twitter graph as follows: we made a query on Twitter on the subject of #SocialNetworks and we retrieved all the associated information regarding users and tweets for a time interval of 7 days (01/07/2013 − 08/07/2013). We also defined the dominance of personality ranges as follows. Initially, we specified ranges of a user that has all the personality trait as follows; (15%, 35%, 35%, 25%). Consequently, we identified the users that satisfy the above ranges. As a next step, for the following more similar personality traits, we increased each particular range by (33%, 25%, 25%, 33%) respectively. For example, suppose that the initial user has (500, 2, 3, 6) and is characterized as Popular, Energetic, Conversational as well as Multi-systemic. Then a user in order to be characterized as Popular, Energetic, Conversational, he/she has to demonstrate this tuple (750, 3, 4, < 8). We use this approach to set ranges in order to incorporate the concept that these personality traits are inter-related. For example, a user with 1000 tweets
6
Eleanna Kafeza, Andreas Kanavos, Christos Makris and Dickson Chiu
is characterized as Popular, but also a user with 500 tweets and a number of retweets plus conversations is also characterized as Popular. Our results show that the selected nodes are approximately the 3% of our graph for the time interval we use. Furthermore, the percentage of the number of all tweets that the users of this graph exchange divided by the total number of tweets for the given time is approximately 10%. Tweets consist of direct tweets, retweets, as well as conversational tweets.
5
Conclusions and Future Work
Our conclusions are that although the community graph was the 3% of the whole graph (number of community nodes divided by the number of total nodes in the graph), we had in this network almost 10% of tweets (direct, retweets, conversations) of the whole Twitter traffic. Hence we can conclude that our assumption of the personality based communities playing a dominant role in data traffic, has been verified. This is a preliminary work. Further work is necessary to identify other personality traits, different clusters of networks and different ranges for dominant personalities. Moreover, this work could be extended to other Social Networks as well.
References 1. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. WSDM 2008:183-194. 2. V.D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of community hierarchies in large networks. Journal of Statistical Mechanics: Theory and Experiment. P1000. 2008. 3. S. Brin, and L. Page. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library. 1998. 4. P.J. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press. 2005. 5. S. Fortunato. Community detection in graphs. Physics Reports 486, 75-174. 2010. 6. M. Girvan, and M.E.J. Newman. Community Structure in Social and Biological Networks. National Academy of Sciences, Vol. 99, No. 12, pp. 7821-7826. 2002. 7. P. Jurczyk, and E. Agichtein. Discovering Authorities in Question Answer Communities by Using Link Analysis. CIKM 2007:919-922. 8. B.W. Kernighan, and S. Lin. An Efcient Heuristic Procedure for Partitioning Graphs. The Bell System Technical Journal, Vol. 49, No. 1, pp. 291-307. 1970. 9. J.M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. SODA 1998:668-677. 10. A. Lancichinetti, and S. Fortunato. Community detection algorithms: A comparative analysis. Physical Review E80, 056117. 2009. 11. A.N. Langville, and C.D. Meyer. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press. 2006. 12. J. Leskovec, K.J. Lang, and M.W. Mahoney. Empirical Comparison of Algorithms for Network Community Detection. WWW 2010:631-640.
Identifying Personality-based Communities in Social Networks
7
13. M.E.J. Newman. Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133. 2004. 14. M.E.J. Newman. Networks: An Introduction. Oxford University Press. 2010. 15. A.Y. Ng, M.I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm. NIPS 2001:849-856. 16. A. Pal, and S. Counts. Identifying topical authorities in microblogs. WSDM 2011:45-54. 17. J.G. Scott. Social Network Analysis: A Handbook. SAGE Publications Ltd. 2000. 18. J. Shi, and J. Malik. Normalized Cuts and Image Segmentation. CVPR 1997:731737. 19. J. Shi, and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):888-905. 2000. 20. J. Weng, E.-P. Lim, J. Jiang, and Q. He. TwitterRank: Finding Topic-sensitive Influential Twitterers. WSDM 2010:261-270 21. J. Zhang, M.S. Ackerman, and L.A. Adamic. Expertise Networks in Online Communities: Structure and Algorithms. WWW 2007:221-230.