Enhancing Link Prediction in Twitter using Semantic User Attributes

45 downloads 126 Views 2MB Size Report
5 Dr. Ahmed Zewail St., Postal:12613,Giza,Egypt. +202 3567 9064 ... dataset of 2.974k users on the Twitter social network. The proposed model considers ...
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Enhancing Link Prediction in Twitter using Semantic User Attributes Cherry Ahmed

Abeer ElKorany

Computer Science Department Faculty of Computers and Information, Cairo University Cairo, Egypt {c.ahmed,a.korani}@fci-cu.edu.eg 5 Dr. Ahmed Zewail St., Postal:12613,Giza,Egypt +202 3567 9064 Abstract—Studying social networks and the ties connecting people in those networks has attracted many researchers. Social networks like Facebook, Twitter and Flickr require efficient and accurate methods to recommend friends to their users in the network. Several algorithms have been developed to recommend friends or predict likelihood of future links. Link Prediction algorithms utilize local features of the network in the neighborhood of the two nodes in question, or use global features like path structure of the whole network. New algorithms tend to combine both in order to achieve the best results such as FriendTNS that takes into account the degrees of the nodes, and the direct links between them. This paper extends FriendTNS such that it takes the strength of the tie between two users into account. The strength of the tie is represented by the interaction that takes place between two users. In order to evaluate the correctness of the proposed model, it has been applied on a real dataset of 2.974k users on the Twitter social network. The proposed model considers different features of users to represent their connective and social relationships. Experiments shows that the proposed model outperforms traditional algorithms when applied individually. Keywords— link prediction; social networks

I. INTRODUCTION Social networks became so popular recently. Social network analysis has emerged to study the patterns and relationships between users of those networks. Link prediction is an example of applying social network analysis to predict links that have high likelihood of occurring in the near future [1]. A link could mean friendship relation in social networks, but it could also mean co-authoring a paper in a co-author network, or it could mean interaction in a protein network. Any kind of relation that connects two vertices in a network can be considered as a link. Link prediction is used in many areas like information retrieval, bioinformatics, e-commerce, criminal investigations and recommender systems. In Social networks, Friend Recommendation is one of the most important usage of link prediction algorithms. Link prediction is also used as an indication of network’s growth rate. Those algorithms need to be accurate and efficient. There are two types of link prediction algorithms, each utilize specific features of social networks; global and local features

ASONAM '15, August 25-28, 2015, Paris, France © 2015 ACM. ISBN 978-1-4503-3854-7/15/08 $15.00 DOI: http://dx.doi.org/10.1145/2808797.2810056

[2]. Local features focus mainly on the node structure and neighborhood, while the global features use the overall path structure of the network. Some of the algorithms used in social networks that depend on local features are Common Neighbors (CN), Adamic Adar (AA), and Jaccard Coefficient (JC) [3]. Examples of algorithms that depend on global features are Katz and SimRank [3]. Algorithms that depend solely on local features are efficient, but there are global features of the network which may increase the accuracy of the prediction if taken into consideration, like the shortest path between two users, they also ignore recommendations for friends who are not in the neighborhood of the user. While algorithms which depend on global features lack the scalability with large networks where friend recommendations are done on the fly and need to be fast. Some algorithms, like FriendTNS, combine both features, but they use direct links only for predicting future links. These algorithms perform well in many situations, and part of their popularity is due to their simplicity and being generic to be applied on almost any social network. But they miss out on important features that are different from a social network to another. Those features can be the frequency of interaction between the users, the content of their posts, or the personal information in the user’s profile. Twitter currently recommends friends based on many criteria, like user’s email address, phone contacts if uploaded by the user, tweets and who the user follows. The challenge of friend recommendation in social networks is to choose the features which highly affect link formation to perform highly accurate predictions, and at the same time be able to perform recommendations efficiently in order to be scalable to large networks. This paper proposes a framework for link prediction that utilizes the semantic of relationships between users. Different types of user attributes are extracted and used to measure the tie strength between them. The proposed framework considers both direct and indirect relationships among users in the same social network. Based on FriendTNS [4], two similarity measurement functions are applied. The first one (called basic similarity) concerns connected users, while the second one involves unconnected users. In order to evaluate the accuracy of the proposed model for link prediction, Twitter2.9k dataset

1155

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining was used as a case study. Twitter, unlike many social networks, allows users to follow complete strangers, and the relation does not have to be reciprocal. Furthermore, users can retweet, reply or mention someone who is not on their friend list. Accordingly, applied similarity measurement techniques provided different weighting scheme to both directly connected users as well as indirect users. FriendTNS has also extended their work in [5] to include signed, weighted and directed networks. Since weighted and signed networks are out of our scope in this work, and directed networks can directly be applied using the basic FriendTNS algorithm, so we used the basic algorithm in [4]. For the rest of this paper, section II will discuss related work, section III explains our framework, section IV has the evaluation of our algorithm compared to other algorithms, and finally section V will conclude our work. II. RELATED WORK Recently research has focused on Link Prediction for the purpose of predicting relationships between members of social networks in the near future. There are many types of approaches for Link Prediction; there are Structural Based Approaches (Topological) [6], and Content Based Approaches. Structural Based Approaches use only the topology of the nodes and the overall graph structure to predict future links, while Content Based Approaches use other features like user posts, interests, groups, or topics for prediction. The Structural Approaches are the most commonly used approaches for prediction, because they are generic and can be directly applied on several different domains. Structural approaches are divided into local and global methods [3]; local methods which use the local features (neighborhood) of two users to compute the likelihood of forming a link between those two users, and global methods which use the whole path and network structure and features to compute the likelihood of forming new links. The local methods provide good-enough predictions for big-sized networks because of their efficiency. While global methods give better results in small-sized networks. There are node-neighborhood local methods, which depend on nodes local features to get a score indicating the proximity between each pair of nodes. Neighbor-based similarity measures were devised based on Common Neighbors [7], Jaccard Coefficient [8], and Adamic/Adar [9]. The three of them depend on the same idea which states that two nodes are more likely to form a link the more they have common neighbors. Adamic/Adar index and Jaccard Coefficient are different from Common Neighbors, because instead of just counting the common neighbors, they give more weight to neighbors with lower degree. Another well-known local-based similarity measure is Preferential Attachment [1]. It adopts the assumption that “rich get richer” and based on that nodes with higher degree have higher probability to form new edges. In [10], a visibility metric is used to predict edges in twitter, such that the probability of forming a follow relation to a certain user in twitter is increased with the visibility of that user. There are also machine learning local methods like [11], which uses a machine learning technique that depends on the idea that users who have tightly coupled neighbors are more likely to form a link . They perform an object-object matching

to predict links where objects are users and neighbors of users are features associated with the objects. Global-based approaches attracted many researchers. They were used to capture network features and path-based features that affect the likelihood of the formation of new edges. The Global methods are more suitable to small networks and are not scalable to big networks. They can be divided into pathbased methods (ensemble of paths [3]), machine learning methods, and probabilistic methods. Examples of the pathbased algorithms are Katz, Hitting Time, PageRank, and SimRank algorithms [3]. Katz [12] calculates similarity between nodes by summing all paths connecting the two nodes, giving higher weight to shorter paths. SimRank [13] also computes a global-based similarity measure based on the structural context of a network that says “two objects are similar if they are related to similar objects”. [2] counts the paths between two users with a maximum length. A review and evaluation of the relative effectiveness of such techniques can be found in [14]. [15] and [16] are examples of unsupervised machine learning global methods. They use clustering techniques to partition the graph into communities, then use them for link prediction. Other global-based researches which use supervised machine learning approaches for link prediction are techniques that model link prediction as a classification task like Decision Trees, Naïve Bayes, Neural Networks, Support Vector Machine (SVM), k-nearest neighbors, Bagging, Boosting and Logistic Regression [1]. However, classification methods have the problem of class skewness which will require a big training dataset to overcome. Also calibrating the model into the domain is complex. There are also global-based Probabilistic-Based Approaches [6]. Among those approaches are the Bayesian Probabilistic Models [1] which generates edge labels with the probability of that edge to exist in reality. Their drawback is that they work on structural features of the network only and existing edge attributes are not put into consideration by these techniques. Local based approaches are efficient but they fail to capture the network global path structure, while global methods capture the network structure but are not scalable with large networks. Some research tried to combine both methods to get an efficient and accurate algorithm which is scalable and can be applied to large networks. [17] uses a probabilistic model to get a ranked list of recommendations using the local and global structure of the network. FriendTNS [4] also combines local and global structural features by computing the similarity between any two nodes in the network based on their shortest path. It has two similarity matrices; the first matrix is called the “basic matrix” and it computes the similarity index for connected nodes, while the second similarity matrix is called the “extended matrix”, which computes the similarity between unconnected users using their shortest path. In their similarity calculation of connected users, they depend on the idea that a connection between two users is stronger if they don’t have many friends (lower degree) which is the opposite of the intuition behind Preferential attachment. The extended similarity matrix is

1156

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining computed using the values in the basic matrix by multiplying the similarity values along the shortest path connecting the users. Other than the Structural Approach, which only rely on the topology of the network, there are also Topical Based Approaches, which use the content explicitly posted by the user or embedded in the user’s profile to predict topical similarities between users. Most research that use topical features also include structural features. These techniques usually use the topical information combined with the link structure. [9], [18] and [19] use both structural and topical features for computing similarity. [20] use 16 similarity indices (both local and global), which are used in a linear combination of sixteen neighborhood and node similarity indices. They use both topical and structural features for prediction. The weights of all the features are optimized using an evolutionary algorithm.

Twitter tell a lot about a user much more than his followers do, and that is why we considered only the friend links to model users in the proposed framework. The degree of a user is the number of his friends. To capture the connectivity and social features of users in Twitter social network, we considered the frequency of replies, mentions and retweets. Our framework consists of two main modules, the first one is responsible of modeling users in twitter social network, and the second one involves measuring similarity between users which is used to predict links with other users. We used Precision and Recall for accuracy measurement as defined in [4]. Moreover, we used Average Reciprocal Hit-Rank measure (ARHR) [21]. A. Used Notations The following table presents the most important notations for our work and their corresponding definitions.

The algorithms which use a combination of local and global features tried to avoid the drawbacks of using local or global only features. They use direct structural methods which depend only on links connecting users, some other algorithms also use topical features combined with structural features, but there are indirect links between users represented by their interactions. Interactions give strong indication of the strength of relationship between connected users and the probability of future links between unconnected users. Our contribution is modifying FriendTNS to take these indirect links between users into account, which increases the accuracy of prediction. III. FRAMEWORK OF ENHANCING LINK PREDICTION IN TWITTER A lot of the link prediction algorithms are simple and popular due to the generality of their implementations on many social networks. However, most of them only consider direct relationships between users and neglect the effect of indirect relations between unconnected users. Direct relations exist between connected users and are represented by the strength of the tie between them. The proposed framework distinguishes between several features to measure the tie strength such as connectivity and social features. Connectivity is represented by the node degree (the number of user connections) where the lower the degree, the stronger the tie between two users. While social features rely on frequency of interaction between two users such as number of comments, replies, mentions, etc. Indirect relations could exist between unconnected users if they follow the same behavior like having common friends or being members in the same groups or mentioned each other. Thus, the proposed framework considered both direct and indirect relations among users in order to measure their similarity. It has been applied on a Twitter dataset. In Twitter each user has a friends list (followees) and a followers list. The friends list of a user x contains the users that user x follows. While the followers list contains the users who follow user x. The direct links that we considered for our prediction is the “follow” relation between two users on twitter, while the indirect links that were considered in the prediction were the interaction between two users (retweet, mention and reply). This makes links in Twitter asymmetric because a user x can follow user y without the need of user y to follow back. Since a user chooses his friends, but not his followers, so friends on

Symbol

Description

G vi ei V E n=|V| m=|E| A S ES sim(vi,vj) esim(vi,vj) RT(vi,vj) MN(vi,vj) RP(vi,vj) interaction(vi, vj) deg(vi) α β

undirected and unweighted graph node of a graph edge of a graph G set of graph nodes set of graph edges number of nodes in graph G number of edges in graph G adjacency list of graph G basic similarity matrix of graph G extended similarity matrix of graph G basic similarity between vi and vj extended similarity between vi and vj Number of times where vi retweeted vj Number of times where vi mentioned vNumber j of times where vi replied to vj Interaction value of vi towards vj degree of node vi weight of connection strength weight of interaction between users

Graph G has a set of nodes V and a set of edges E. Every edge ei is represented by a pair of graph nodes (vi, vj), where vi, vj ∈ V. The graph G is directed, so the order of nodes in an edge is important. (vi, vj) and (vi, vj) denote different edges in G. S and ES are basic and extended matrices respectively. They both have dimensions nxn. B. User Modeling Users in a social network are distinguished through some features that characterize them such as their interests, behavior, activities, etc. Those features are identified using either content published by users or by analysis of their relationships through network links. Extracted content posted by the user is used to identify the user interest, while link-based features, in our case represented by the friend list, are used to identify the behavior and degree of trust between users.

1157

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining In this work we focus on 3 categories of features which are classified into the following: connectivity, social activity and trust based features. The basic idea of the proposed framework is to utilize the strength of tie between connected users in order to identify the degree of similarity between unconnected users. Thus, the following features are used: connectivity, social and trust-based features that indicate the strength of tie between users. Connectivity feature is measured by considering only the friend list of users (representing the degree of node). The social feature represents the interaction between users in a social network. For direct users (connected) an example of the social feature is the frequency of replies. For unconnected users, a social feature can be represented by mentioning each other. Finally, trust-based features represent the degree of trust between both connected and unconnected users such as retweet. The three types of features are represented as the following:  Connectivity Features: 1- User’s degree: number of friends (out links). deg(vi) 2- Friends List: who are neighbors of the user. For each user vi, get list of direct neighbors (A[vi]) 3- Friends of Friends List: For each friend vj in A[vi] get list of direct neighbors of vj.  Social Activity Features:

path connecting them. The similarity between two users depend on the similarity between each two users on that path, using values of the basic similarity matrix. It gives higher weight to shorter paths. FriendTNS gives a zero similarity to users who have no path connecting them. We modified FriendTNS similarity matrices to account other features of interaction between users in the network, hence capturing the user’s behavior. We measure the strength of tie in direct and indirect relationships between users. The Direct relationship of two connected users has a strength based on their degrees and interactions. The indirect relationship between two unconnected users depends on the similarity of users along the path connecting them and their interaction. Unlike FriendTNS, the proposed model doesn’t give a zero similarity to users with no path connecting them, but gives them a similarity value based on the frequency of interaction between them.  Basic Similarity The basic similarity matrix S has nxn dimensions, where it contains the similarity values between each two connected users. The basic similarity matrix has non-zero values for the diagonal (a node is similar to itself with value 1), and for connected users where user vi follows user vj (vj is in friends list of vi). The matrix is not symmetric because if user vi follows vj, user vj doesn’t have to follow vi back. A user vi is similar to user vj through a similarity function that is a modification of the original similarity function in FriendTNS, simMTNS in (3), to integrate users' social activity and degree of trust among each other.

1- Mentions: User’s mentions to other users. MN(vi,vj)

So our basic similarity matrix has two main constituents: 1- Connection strength between users: a connection between two users is stronger if both users as their summation of degrees is low, which makes each user an important part of the other’s network. While their connection is weaker if they both have a large network of friends. This constitutes the first part of the equation, which uses simTNS in (1).

2- Replies: User’s replies to other users. RP(vi,vj)  Trust-based features: Retweets: User’s retweets of other users. RT(vi,vj) C. Similarity Measurement We limited the list of friend predictions per user to those with maximum path length of 2, except for the users who have interactions with the user under consideration. [22] shows that 90% of new friends in Twitter are 2 hops away. FriendTNS computation uses both local and global features of the network for link prediction. The original algorithm has two similarity matrices; a basic similarity matrix simTNS in (1) to measure similarity between connected users, and an extended similarity matrix esimTNS in (4) to measure similarity between unconnected users. The calculations of the basic similarity matrix of FriendTNS gives higher weight for relations between low-degree users, based on the idea that if a user has 10 friends, he probably has stronger relationship with them than a user who has 1000 friends. The extended similarity matrix calculated similarity between users using the

1158

2- Users’ Interactions: interactions from user vi towards user vj are represented as interactions(vi,vj) in (2). Interactions can either be trust(RT(vi,vj)), or social activity(MN(vi,vj) and RP(vi,vj)). Interaction between user vi and user vj is represented as the sum of retweets, mentions and replies of user vi towards user vj . To normalize interaction values, this sum was divided by the maximum sum of interactions of user vi towards all other users. The product of both components gives the similarity score in (3).

(1)

(2)

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining     Extended Similarity The extended similarity matrix ES has nxn dimensions, where it contains the similarity values between each two unconnected users. Each two unconnected users can either have a path connecting them or not. In the extended similarity matrix of original FriendTNS in (4), users with no path connecting them are not considered for link prediction. For users who have a path connecting them, the shortest path is considered, and for multiple shortest paths the one with the highest similarity is chosen. The similarity between those users is computed as the product of the similarities on the path connecting them. For example if user vi is connected to user vk with similarity 0.2, and user vk is connected to user vj with similarity 0.5, the similarity between users vi and vj, given they are unconnected, would be 0.2*0.5=0.1.

(4)

In twitter, two users can interact through mentions, replies and retweets even though they are unconnected. So the same features of Trust and Social Activity represented by interaction in (2) can be also used in the extended matrix shown in (5).

(retweets) to vj . We didn’t want two users who don’t have a path connecting them and have some common interaction to have a similarity score based on the frequency of their interactions only. In order to ensure the idea of FriendTNS which states that two nodes similarity is inversely proportional to their degrees, we divided the interactions frequency by the sum of their degrees. The extended similarity measurement esimMTNS is represented in (5) for both users who do not have a connecting path as well as users who are two hops away.

(5) where vph = vi , vpk+1 = vj, and vph(where h=2 to k) indicates all intermediate nodes on the shortest path connecting vi and vj. D. Predicting Top-k links for each user Now that the similarity values in the extended matrix are calculated, each row in the matrix represents similarity values between a user and all other users. To get a list of predicted friends for each user, we first filtered the predictions to users who are not connected to the user, but are either 2-hops away or have some interaction with that user. Then we sorted the predictions associated with that user. We then choose top-k users as the most similar for each user. The k parameter is given as an input to the algorithm. In our experiments (section IV), we varied the value of k and determined how the accuracy of all of the tested algorithms was affected. The value of k highly affects the Precision, Recall and ARHR measures used for measuring the algorithms’ accuracy. Smaller k values give higher Precision and ARHR values, while higher values of k give better Recall values.

So the extended similarity matrix has two main constituents: 1- Connection strength between users: product of simMTNS on the path connecting the two users in question. 2- Users’ Interactions: interactions shown in (2).

Each of the 2 components was given a weight in esimMTNS. The weight of the first part (П simMTNS) is α, while the weight of interaction is β. We used values α=1 and β=2, which gave the best results. To normalize esimMTNS, we finally divided by the sum of weights (α+β)

We also limited the shortest path’s length to 2, based on [22] that 90% of new friends in Twitter are 2 hops away. On the other hand, we also consider users who are more than 2 hops away on the list of predicted friends, such that a user vj can be added to the list of friend predictions to vi if vi has interactions (replies and mentions) or some degree of trust

IV. EVALUATION A. Dataset We used a twitter dataset that we populated (Jan2015) with 2.974 k users. A set of 13 users were randomly selected and then their friend list is crawled. Considering only the follow relation (one direction), there are 88,458 friend relations among those users with average 30 friends per user. The data consists of the users, their relations and their tweets in the duration of a month. There are a total of 726 k tweets. We repopulated the set of edges in May, 2015. 5787 more friend relations were added to the original edges in two months duration.

B. Experiment We did our experiments on three training-test sets; for the first experiment, we divided the edges in the Jan2015 dataset into equal training and test sets (50%-50%), with 44,229 edges in each set. For the second experiment we divided the edges in the Jan2015 dataset into 70%-30% for training and testing sets respectively, with 61,920 in the training set and 26,538 in the test set. Finally, for the last experiment we used all of the

1159

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

2 5 10 20 30

SimRank

FriendTNS

0.04 0.06 0.06 0.07 0.07

0.09 0.09 0.09 0.09 0.1

Modified FriendTNS 0.5 0.43 0.37 0.33 0.31

Recall

2 5 10 20 30

50/50 - TRAIN/TEST DATA

0.04 0.08 0.12 0.19 0.24

0.03 0.07 0.12 0.19 0.24

0.006 0.02 0.04 0.09 0.14

0.02 0.04 0.08 0.15 0.22

0.07 0.11 0.15 0.18 0.19

ARHR

2 5 10 20 30

Precision

k

TABLE I. Common Adamic Neighbors Adar 0.23 0.21 0.2 0.18 0.17 0.16 0.15 0.14 0.13 0.13

0.18 0.1 0.06 0.04 0.03

0.16 0.09 0.056 0.03 0.024

0.03 0.02 0.016 0.01 0.01

0.07 0.045 0.03 0.02 0.013

0.4 0.27 0.21 0.18 0.17

2 5 10 20 30

Precision

0.05 0.05 0.05 0.05 0.05

0.06 0.12 0.18 0.27 0.34

0.01 0.02 0.04 0.11 0.16

0.02 0.06 0.11 0.185 0.24

0.08 0.12 0.15 0.18 0.2

0.17 0.09 0.05 0.03 0.02

0.14 0.07 0.04 0.024 0.02

0.01 0.01 0.008 0.005 0.004

0.04 0.03 0.016 0.01 0.007

0.31 0.18 0.13 0.1 0.09

TABLE III. Common Neighbors 2 0.015 5 0.015 10 0.015 20 0.014 30 0.013

Precision

2 5 10 20 30

0.02 0.03 0.03 0.03 0.04

Modified FriendTNS 0.39 0.29 0.23 0.19 0.17

0.06 0.13 0.2 0.29 0.36

k

2 5 10 20 30

FriendTNS

Recall

2 5 10 20 30

SimRank

ARHR

2 5 10 20 30

70/30 - TRAIN/TEST DATA

TRAIN DATA: JAN 2015, TEST DATA: MAY 2015 Adamic Modified SimRank FriendTNS Adar FriendTNS 0.013 0.0035 0.006 0.08 0.013 0.004 0.007 0.05 0.014 0.005 0.008 0.04 0.013 0.006 0.007 0.03 0.013 0.006 0.007 0.03

Recall

The results in Table II belong to the 70%-30% train-test sets. The results show that our algorithm has the best Precision and ARHR for all values of k, while common neighbors algorithm has better recall values for most values of k. This is probably because some users follow others on twitter because of the many common friends they have, without any interaction between the two. It could be a weak tie between them, but it exists.

k

TABLE II. Common Adamic Neighbors Adar 0.22 0.18 0.16 0.15 0.14 0.12 0.11 0.1 0.09 0.09

0.02 0.05 0.09 0.16 0.23

0.02 0.04 0.09 0.16 0.23

0.0035 0.01 0.04 0.08 0.12

0.013 0.033 0.06 0.11 0.15

0.07 0.1 0.12 0.17 0.2

ARHR

edges in Jan2015 dataset (88,458 edges) as training set, and the new edges in May2015 dataset (5787 edges) as test set. In twitter, creation date of links is not known. This means that for the first and second experiments (50-50 and 70-30), the links in the train set don’t necessarily mean that they are older than the ones in test set. Only the last experiment guarantees that the train set edges are older than the test set edges. Since only one experiment won’t be sufficient for testing, the first two experiments had to be used. Edges in the training set were used to train the algorithm, while the other part in the test set was used for measuring how accurate the predictions were. We used Precision, Recall and ARHR to measure the accuracy of our modified algorithm against the original FriendTNS, and some of the traditional link prediction algorithms that rely on local and global features like Common Neighbors, Adamic Adar, SimRank. We varied the k parameter that determines the number of predicted links per user, and used the k values 2, 5, 10, 20 and 30. The best accuracy per experiment is written in bold. The results in Table I belong to the 50%-50% train-test sets. Our algorithm gets the best Precision and ARHR for all values of k, and the best Recall for most values of k.

0.01 0.007 0.004 0.002 0.002

0.01 0.006 0.004 0.002 0.002

0.0024 0.002 0.001 0.001 0.0006

0.005 0.003 0.002 0.001 0.001

0.06 0.03 0.02 0.02 0.01

Additionally, if these two users have high degrees, this won’t affect common neighbors, but will definitely affect our algorithm which gives lower similarity for people with high degrees, and puts them on the bottom of the list. In the last experiment (Table III) we used all of the edges in Jan2015 dataset (88,458) as training set, and the new edges in May2015 dataset (5787) as test set. This experiment, unlike the first two experiments, represents a real prediction task. First, for the first two experiments a static snapshot of the graph was taken and was then divided into different proportions of train and test data, with train set simulating the current links and test set simulating the future links. While for the last experiment the train set contains the actual current links (Jan2015) and the test set contains the actual future links (May2015). This is why this experiment accurately measures the accuracy of our algorithm compared to the other algorithms, and it is the closest to their relative accuracy in real prediction situations. Our algorithm outperforms the other algorithms, with a

1160

2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining noticeable difference for lower k values. The results of Precision, Recall and ARHR measures is generally low for all of the algorithms due to the small size of the test set in this experiment.

[5]

[6] [7]

V. CONCLUSION In this paper we introduced a modification to the FriendTNS algorithm for link prediction for better accuracy of prediction. The original algorithm defines a transitive node similarity using local and global features of the network. The features taken into account depended on links of the network and degrees of nodes. We proposed a modification of the algorithm to predict links in twitter social network more accurately. We included interaction factors between users like retweets, mentions and replies. We compared our modified FriendTNS to the original FriendTNS, Common Neighbors, Adamic Adar, SimRank using Precision, Recall and ARHR measures. A twitter dataset of 2.974k users was used. The results showed that in most of the cases our algorithm performed better, except for the Recall measure in the 70%30% train-test sets. Our future work will be in three directions. First, we will be working on an evolutionary algorithm to adjust the weight of each of the features taken into account in prediction to reach the best weight assignment to all the features. Second, we will be applying this to other social networks taking the interaction features into consideration. Finally, considering negative links in the prediction process will be part of our future work. Missing links were noticed when taking a snapshot of the links in May2015. Some links that existed in Jan2015 were not there anymore, so our third direction will be taking into account those negative links.

[8] [9] [10]

[11]

[12] [13]

[14]

[15]

[16]

[17]

[18]

[19]

REFERENCES [1]

[2]

[3]

[4]

M. Hasan, and M. Zaki, “A survey of link prediction in social networks,” In C. Aggarwal, editor, Social Network Data Analytics. Springer US, ch. 9, pp. 243-275, 2011. A. Papadimitriou, P. Symeonidis, and Y. Manolopoulos, “Fast and accurate link prediction in social networking systems,” The Journal of Systems and Software, vol.85, issue 9, pp. 2119-2132 , 2012. D. Liben-Nowell, and J. Kleinberg, “The link prediction problem for social networks,” In proceedings of the twelfth international conference on Information and knowledge management (CIKM ’03) , ACM Press, pp. 556–559, 2003. P. Symeonidis, E. Tiakas, and Y. Manolopoulos, “Transitive Node Similarity for Link Prediction in Social Networks with Positive and Negative Links,” In proceedings of the 4th ACM conference on Recommender systems (RecSys '10), 2010.

[20]

[21]

[22]

1161

P. Symeonidis, and E. Tiakas, “Transitive Node Similarity: Predicting and Recommending Links in Signed Social Networks,” In World Wide Web, pp. 743-776, 2014. E. Xiang, “A Survey on Link Prediction Models for Networked Data,” Department of Computer Science and Engineering, HKUST, 2008. M. Newman, “Clustering and preferential attachment in growing networks,” Physical Review E, 64(025102), 2001. G. Salton, M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc. New York, NY, USA, 1986. L. A. Adamic, and E. Adar, “Friends and Neighbors on the Web,” In Social Networks Journal, vol. 25, issue 3, pp. 211-230, July 2003. L. Zhu, and K. Lerman, “A Visibility-based Model for Link Prediction in Social Media,” In proceedings of the ASE/IEEE Conference on Social Computing, 2014. W. Jang, and M. Kwak, “A Network Link Prediction Model Based on Object-Object Match Method,” In proceedings of the Southern Association for Information Systems Conference, 2014. L. Katz, “A new status index derived from sociometric analysis,” In Psychometrika, vol. 18, no.1, pp.39-43, March 1953. G. Jeh, and J. Widom, ”SimRank: a measure of structural-context similarity,” In proceedings of 8th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2002. D. Liben-Nowell, and J. Kleinberg, “The Link-Prediction Problem for Social Networks,” Journal of the American Society for Information Science and Technology, pp. 1019–1031, 2007. F. Li, J. He, G. Huang, Y. Zhang, and Y. Shi, “A Clustering-based Link Prediction Method in Social Networks,” In 14th International Conference on Computational Science (ICCS 2014), pp. 432–442, 2014. J. Valverde-Rebaza, and A. de Andrade Lopes, “Link prediction in complex networks based on cluster information,” In: Advances in Artificial Intelligence, SBIA 2012, pp. 92-101, 2012. D. Yin, L. Hong, and B. D. Davison, “ Structural Link Analysis and Prediction in Microblogs,” In proceedings of the 20th ACM international conference on Information and knowledge management (CIKM’11), 2011. N. Barbieri, F. Bonchi, and G. Manco, “Who to Follow and Why: Link Prediction with Explanations,” In 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’14), 2014. M. Rowe, M. Stankovic, and H. Alani, “Who will follow whom? Exploiting Semantics for Link Prediction in Attention-Information Networks,” In Proceedings of the 11th international conference on The Semantic Web - Volume Part I. (ISWC'12), pp. 476-491, 2012. C. Bliss, M. Frank, C. Danforth, and P. Dodds, “An evolutionary algorithm approach to link prediction in dynamic social networks,” In Journal of Computational Science, pp. 22-26, January 2014. M. Deshpande, and G. Karypis, “Item-based top-n recommendation algorithms,” In ACM Transactions on Information Systems (TOIS), vol.22, no. 1, pp. 143-177, 2004. D. Yin, L. Hong, X. Xiong, and B. Davison, “Link formation analysis in microblogs,” In proceedings of the 34th international ACM SIGIR conference on Research and development in Information (SIGIR '11), 2011.

Suggest Documents