ARTICLE IN PRESS
JID: COMCOM
[m5G;July 28, 2015;11:36]
Computer Communications 000 (2015) 1–9
Contents lists available at ScienceDirect
Computer Communications journal homepage: www.elsevier.com/locate/comcom
Predicting the attributes of social network users using a graph-based machine learning method Yuxin Ding a,∗, Shengli Yan a, YiBin Zhang a, Wei Dai a, Li Dong b a b
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, China State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China
a r t i c l e
i n f o
Article history: Received 5 May 2014 Revised 8 April 2015 Accepted 5 July 2015 Available online xxx Keywords: Social network analysis Data mining Social network privacy Semi-supervised learning Information inference
a b s t r a c t Attribute information from social network users can be used as a basis for grouping users, sharing content, and recommending friends. However, in practice, not all users provide their attributes. In this paper, we try to use information from both the graph structure of the social network and the known attributes of users to predict the unknown attributes of users. Considering the topological structure of a social network and the characteristics of users’ data, we select a graph-based semi-supervised learning algorithm to predict users’ attributes. We design different strategies for computing the relational weights between users. The experimental results on real-world data from Renren demonstrate that the semi-supervised learning method is more suitable for predicting users’ attributes compared with the supervised learning models, and our strategies for computing the relational weights between users are effective. We also analyze the effect of different social relations on predicting users’ attributes. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Now online social networking has become one of the most popular ways for people to interact with their friends, to grow a business and to get information. By some estimates, up to 67% of individuals aged 18–29 use social networking at some level [1]. About 22% of all people use social network sites for some reason [1]. Today social networks such as Facebook and Twitter are driving new forms of social interaction, dialogue, exchange and collaboration. Typically, users in a social network post their own attributes on their pages, such as their universities, majors, and hobbies. This profile information can be used as a basis for grouping users, sharing content, and recommending friends. However, in practice, not all users in a social network provide their profiles. Moreover, to protect the privacy of users, some social networks allow users to hide their personal information. One interesting solution is that we can infer the hidden (unknown) attributes of users by using the known information supplied by the social network. In this paper, we try to use information from both the topological structure of the social network and the known attributes of users to predict the hidden attributes of users. Inferring the hidden attributes of users in an online social network [2] has been a hot issue in machine learning. Although the users can hide some profile information, their friendship information and ∗
Corresponding author. Tel.: +86 755 2603 2193; fax: +86 755 2603 2461. E-mail address:
[email protected] (Y. Ding).
group information can be obtained directly or indirectly. For example, we can obtain a user’s friendship information by searching his/her friend list, which is available in a social network, and his/her friends’ attributes can also be easily obtained by accessing the links to his/her friends. We can extract a large amount of data from a social network; however, only a small part of the data is assigned attribute values. Fig. 1 shows the percentage of users who hide their attribute information in the Renren social network. Nearly 55% of users hide part of their profiles. Nearly 70% of users hide their hobbies. This means that the labeled data are considerably fewer than the unlabeled data. In general, the traditional supervised prediction models are required to be trained using a large amount of labeled data (training data) to obtain a high accuracy. Therefore, the challenge is how we can take advantage of the unlabeled data to improve the prediction accuracy. Compared with the supervised learning algorithms, the semisupervised learning algorithms [3–5] can use both labeled and unlabeled data to train a prediction model. The semi-supervised learning algorithms can also achieve better performance, especially when the labeled data are very limited. Therefore, the semi-supervised algorithms are more suitable for inferring the users’ profiles in a social network. In this paper, our goal is to infer the users’ hidden attributes by using the users’ personal information, friendship information, and group information. We choose the graph-based semisupervised learning algorithm to predict a user’s hidden attributes. To improve the prediction accuracy, we propose different strategies for computing the relational weights between users according to the users’ data type (labeled or unlabeled).
http://dx.doi.org/10.1016/j.comcom.2015.07.007 0140-3664/© 2015 Elsevier B.V. All rights reserved.
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
JID: COMCOM 2
ARTICLE IN PRESS
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9
Fig. 1. The percentage of users who hide their attribute information in the Renren social network.
The remaining part of this paper is organized as follows. Section 2 gives an introduction of the related work. Section 3 gives the definition of the problem. In Section 4 we discuss how to use semisupervised method to predict the users’ attributes. The experimental results are shown in Section 5. The conclusion and future work are discussed in the last section. 2. Related work To protect privacy information, many social network users provide only partial information or deliberately hide some of their personal information. Can failing to complete or hiding personal information protect the privacy of users? The answer is no. Fond and Neville [6] proposed the principle of social influence, which states that users who are linked are likely to adopt similar attributes, and suggest that the network structure should inform attribute inference. Other studies [7, 8] show that users with similar attributes are likely to link to one another, motivating the use of attribute information for link prediction. In our study, we mainly focus on attribute inference. Rao et al. [9, 10] inferred user attributes (e.g., ethnicity and political orientation) using supervised learning methods with features extracted from user names and user-generated text. Strictly speaking, their study belongs to text classification. They did not consider the topological information and social relationships of a social network. Yin et al. [11, 12] proposed an attribute-augmented social network model, which is called the social-attribute network (SAN). In the model, attributes are represented as a network node. Link information and attribute link information are extracted to infer user attributes. In their work, an unsupervised method (random walk with restart) was used to infer user attributes. Gong et al. [13, 14] also used SAN models to infer user attributes. They extracted a set of topological features for each positive and negative attribute link. The positive attribute links are taken as positive examples, while the negative attribute links are taken as negative examples. They trained a binary classifier (support vector machine, SVM) to infer the missing attributes. In addition to topological information, social relations can be used to infer user attributes. In [15], the authors discussed how to infer user profiles by using friendship relations. They used the Bayesian network to model the cause and effect relationships among people in a social network and analyzed the effect of the prior probability, influence strength, and society openness to the inference accuracy on the online weblog service provider “live journal social network”. In their work, they did not consider the social relations when inferring
user attributes. In [16], the authors noted that different social relations such as the relationship between students and staff relations, suggesting the properties of some users, can be used to infer different attributes. For example, the relations between classmates can be used to infer users’ ages. In [17], the authors used the naive Bayesian classifier to infer users’ attributes in online social networks. They predicted the political affiliation of each user by using the users’ personal information and friendship information. In [2], the authors noted the importance of group information. They predict users’ attributes by using the available public group information and friendship information. Experimental results show that a public group contains a large amount of potential information for predicting user’s attributes. In [2,15–17], the authors discussed how to use different types of information available in a social network to infer the user’s private information. The information they considered included personal information, friendship relations, and group relations. In [18], the authors proposed that users that have the same attributes are more likely to become friends or to form a dense community, and they employed a community-based method to infer users’ attributes in an online social network. In [19], the authors inferred users’ demographics based on their daily mobile communication patterns. They extracted the individual features, friend features and circle features of users and used these features to infer users’ age and gender. Most of the above mentioned papers employed supervised learning methods to infer users’ attributes. However, in practice, the online social network users often supply only a small amount of information. Therefore, we want a learning algorithm that can use not only the labeled data but also large amount of unlabeled data to train classifiers. The semi-supervised learning method can learn from both labeled data and unlabeled data. In [20], the authors used a semisupervised learning algorithm to infer users’ profiles in a social network. They proposed a general semi-supervised learning framework and built two prediction models, the graph-based model and cotraining model, to infer users’ attributes. In contrast to the supervised learning algorithms, the semi-supervised learning algorithm predicts the hidden attributes of social network users more accurately. In this paper, we also use a graph-based semi-supervised learning algorithm to infer users’ hidden attributes. We focus on the following problems: (1) The principle of social influence [6] is the base assumption for inferring user’s attributes. Presently, researchers proposed different algorithms to predict the attributes of social network users. However, most of them did not interpret if the data they used to infer attributes supports the base assumption. Moreover, what types of attributes can be derived more accurately? These problems are the main motivation of our study. (2) Reference [20] uses a semi-supervised learning algorithm to predict the attributes of social network users. The semisupervised leaning algorithm can learn from both labeled data and unlabeled data. However, the potential information supplied by the two types of data is different. How can we use the different potential information to improve the performance of the semi-supervised learning algorithm? (3) In our work, we use different social relations, such as a group relation and friendship relation to calculate the similarity of two users. Different from [20], we try different combinations of social relations to calculate user similarity and evaluate their effect on the accuracy of the semi-supervised learning algorithm. 3. Problem definition In general, a social network can be represented as a graph. A user can be viewed as a node in the graph, and the edge can be viewed as the relationship between users. The weight of the edge can be used to measure the similarity of two users. Fig. 2 shows a simple network of
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9
Boy
Data collection
Boy Tom
Uni. A
Jim
Uni. A
Football
Football
Data analysis girl
Girl Lily
Uni. A
Uni. B
Alice
Data preprocessing and feature extraction
shopping
Singing Uni. A
Tom
Football
Tom
Jim
Lily ?
Local and global consistency method
Fig. 2. A sample of a social network data structure.
Definition 1. Social network: we define an online social network as a graph G = (V , E ); each user is represented as a node in G; and each edge represents the social relation between two users. We let V represent the set of nodes, and E represent the set of edges between nodes. For each user node vi , vi ∈ V , i = {1, 2, . . . , n}, one vector Ai (see Eq. (1)) is used to describe the attributes of user i.
Ai = a1i , a2i , . . . am i
Supervised learning method
Fig. 3. The framework for inferring user attributes.
four users, and each user has three attributes: gender, university, and hobby. In Fig. 2, Lily and Jim are friends, and Lily and Alice are not. There are two groups: the football group and the uni (university) A group. We notice that user Lily does not publish her university. Therefore, our problem is if we can infer her university from the known information in the graph. For the convenience of understanding, some commonly used notations are described in Table 1. The social network graph is defined as follows.
3
(1)
In Eq. (1), m is the number of attributes for user i. ai , j ∈ (1, 2, . . . , m) represents the jth attribute of user node i. wi j is the weight of the relationship on an edge between vi and v j . It is used to measure the similarity between user i and user j. In our study, we call wi j as the normalized similarity. wi j is evaluated by three components: the friendship relation, group relation, and attribute relation. The detailed information regarding wi j is described in Section 4.4. j
Definition 2. Labeled data: we define dataset Dc = (d1 , d2 , . . . , d ) as the labeled dataset, whose corresponding label set is Lc = (L1 , L2 , . . . , L ), where Li ∈ L (labels or values for a certain attribute) and i ∈ (1, 2, . . . ) and is the number of the labeled data. The vector Lc = (L1 , L2 , . . . , L ) is the class vector, and nclass is the number of classes. Definition 3. Unlabeled data: we define dataset Du = (d1+ , d2+ , . . . , dn ) as the unlabeled dataset, whose corresponding label set is Lu = (L+1 , L+2 , . . . , Ln ), where Li ∈ L, i ∈ (1 + , 2 + , . . . , n).
Given the graph G = (V , E ) and labeled dataset Dc = (d1 , d2 , . . . , d ), our goal is to predict the label vector for the unlabeled dataset Du = Lˆu = (L+1 , L+2 , . . . , Ln ) (d1+ , d2+ , . . . , dn ). For example, if we choose the label as the universities where students study, the label set in Fig. 2 is defined as L = (Uni. A, Uni. B). Dc is defined as Dc = (d1 = v1 , d2 = v2 , d3 = v4 ), and Lc is defined as Lc = (Uni.A, Uni.A, Uni.B). Du is (d4 = v3 ). Our goal is to predict the university of user v3 , shown as Lˆu = (L4 ). 4. The Framework for inferring users’ attributes The whole framework for inferring users’ attributes is shown in Fig. 3. The framework contains five components: data collection, data extraction, data analysis, data preprocessing and the algorithms for inferring users’ attributes. We describe each part as follows. 4.1. Data collection We implemented a crawler in python to download the web pages in a social network, which contains the user attributes, friend list and group information of the users. The attribute information is stored into a MySQL database. In our study, the social network we chose is “Renren”, which is one of the most popular social networks in China. The crawler uses the breadth-first search technique to crawl the web. We randomly selected six social network users from students who wanted to supply their social network data and used the six users as the source seeds for the social network crawler. Finally, we obtained 19,567 users, 4,500,410 links and 2778 available public groups. In the social network “Renren”, the user’s attribute information includes basic information, school information, work information, contact information, and common web page information. We extract 15 attributes for each user from the dataset, and these attributes are
Table 1 Description of some commonly used notions. Notation G V E A a ai Ai
Description
Notation
Description
Social network graph Node set of G Edge set of G Attribute set An attribute Attribute a of user i Attribute vector of user i
aij
jth attribute of user i An edge from node(user) i to node(user) j Number of elements in a set L2 norms of a vector Value set for attribute a Similar matrix of users The normalized similarity between user i and user j
(i,j) |•| ||•|| Values(a) W wi j
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM 4
Y. Ding et al. / Computer Communications 000 (2015) 1–9 Table 2 Affinity values of attributes in Renren social network. User profile
Affinity
Gender Age Constellation Province City University Major High school Hobby group
0.966 1.26 0.998 3.2 9.94 11.1 3.24 37.15 3.4 6.92
This implies that the probability of users coming from the same high school that become friends is 37 times the empirical probability that users sharing the same attribute value become friends. On the contrary, we can assume that if users are friends, it is very possible that most of them share common attribute values. Based on this assumption, we can employ the relations among users to predict their hidden attribute values. In addition, the higher the affinity value of an attribute is, the higher the prediction accuracy is. We will verify these results by performing experiments. 4.3. Data preprocessing and feature selection
user ID, gender, date of birth, constellation, hometown (including province and city), university, year of university enrollment, major, high school, year of high school enrollment, QQ account number, mobile phone number, common web page addresses, and hobbies. For each user, at most, 10 hobbies are extracted. Each webpage a user is interested in is assigned a unique ID, and at most, 10 web page IDs are stored for each user. 4.2. Data analysis From Fig. 1, we can see that more than 95% of users publish their gender. On the contrary, only 10% of users publish their mobile phone numbers. Except for the gender attribute, the values of other attributes are provided by approximately 46% of users on average. This shows that for most attributes, the number of unlabeled data is greater than that of the labeled data. The base assumption for attribute prediction is the principle of social influence [6]. So, the first step we need to do is to prove the data we collected to satisfy the principle of social influence. To prove it, we provide the definition of affinity. The parameter Sa is defined in Eq. (2). In Eq. (2), ai represents attribute a of user i. Sa represents the probability that user i and user j become friends because they have the same attribute value for a.The parameter Ea is defined in Eq. (3). In Eq. (3), k is the number of values a takes, and Ti represents the number of users whose attribute a takes the ith value. Pa represents the empirical probability that users sharing the same value of attribute a become friends.
(i, j) ∈ E : s.t.ai = a j Sa = a ∈ A, and |E |
i, j ∈ (1, 2, . . . , . . . n) (2)
k
T i (T i − 1) ki=0 T i − 1 i=0 T i
Pa = k
i=0
(3)
Affinity Ca (Ca ∈ (0, ∞)) of attribute a is defined in Eq. (4). Ca is the ratio of Sa to the empirical probability Pa . If Ca is greater than one, then the links (friendship) are positively correlated with attribute a. Otherwise, the links (friendship) are negatively correlated with attribute a. The larger Ca is, the more possible users become friends because they have the same value for a. Table 2 shows the affinity values of the various attributes in our data.
Ca =
[m5G;July 28, 2015;11:36]
Sa Pa
(4)
In Table 2, the affinity values of most attributes are larger than one, and this proves that in a social network, the friendship relations between users are strongly dependent on whether they have the same attribute values. The data in Table 2 show that certain attributes have a larger affinity than others. For example, the affinity value of “high school” is considerably larger than “gender”, which is as high as 37.15.
In our dataset, the data types of attributes are different, and some are hard to use with the machine learning algorithm. Therefore, we need to change their data types. The attribute “hometown” is represented by “province” and “city”, which take string values. It is difficult for machine learning algorithms to obtain valuable information from this data representation. If we use longitude and latitude to represent the position of a place, we can calculate the distance between two places and then compare the similarities between users. In this paper, we use longitude and latitude to represent a place, such as “hometown”. The original dataset contains 1793 groups. Some groups only contain a few members, and the main web pages of these groups have not been updated for a long time. In our study, the purpose of introducing group relations is to predict users’ hobbies; therefore, we hope groups represent users’ interests. We assume that if a group represents a common interest of users, it should have many members, and the members are active. Therefore, we ignore the groups that have a small number of members (lower than 30 in our experiments) and that do not update their web pages for at least one month. In our work, we infer the “university”, “major” and “hobby” attributes of social network users. To objectively evaluate the prediction accuracy of the machine learning algorithms, we select users who provide the inferred attribute values. For example, in the dataset for inferring the “university” attribute, all users should provide their universities. When training a learning model, we need to extract part of the data from each class as training data. To avoid a very small number of training samples, we ignore the attribute class that contains a small number of users. In the experiments, an attribute class containing less than 60 users is ignored. After data preprocessing, the dataset we selected for inferring the “university” attribute includes 1002 nodes of users, 194,600 edges, and 238 groups. All users come from nine universities, and each user only belongs to one “university” class. The user number for each university varies from 80 to 150. The dataset for inferring the “hobby” attribute contains 5810 nodes, 1,111,001 edges, and 275 groups. There are 36 distinct hobbies, and the user number for each hobby class is larger than 140 and smaller than 190. When inferring the “hobby” attribute, we only consider the first hobby selected by a user, and ignore other hobbies of the user. The dataset for inferring the “major” attribute is the same as that for inferring university. There are eight distinct majors, and each user only belongs to one “major” class. The user number for each major class varies from 90 to 140. Each user has 15 attributes; however, not all attributes are helpful for inferring hidden attributes. We need to choose the proper attributes for inferring hidden attributes. In our study, predicting the unknown attributes is viewed as a classification problem. For a classification problem, we usually use information gain to analyze the classification ability of an attribute and choose the attributes that have strong classification ability as features. In our dataset, some attributes have many unique attribute values, and the information gain values will bias toward these attributes. To solve this issue, we use the gain ratio to measure the classification ability of attributes [21].
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9 Table 3 The gain ratio of user attributes for inferring “hobby”. Attribute
Gain ratio
Attribute
Gain ratio
Gender Age Major Hometown University Constellation High school
0.142 0.125 0.123 0.084 0.083 0.061 0.032
Year of university enrollment Year of high school enrollment Phone number QQ User id Date of birth
0.031 0.029 0.013 0.013 0.013 0.019
5
mobile phone number. These attributes do not have power for classification, so we do not chose these attributes as features. From Tables 3–5, we can see that these attributes also have lower gain ratio values. In addition, the gain ratio values of some attributes are small and are very close to that of the above mentioned attributes. We do not choose these attributes as features. Finally, we choose six attributes: gender, age, constellation, university, hometown and major to predict users’ hobbies; three attributes: gender, major and hometown to predict users’ universities; and three attributes: gender, university and hometown to predict users’ majors.
Table 4 The gain ratio of user attributes for inferring “university”. Attribute
Gain ratio
Attribute
Gain ratio
Gender Age Major Hometown Constellation High school
0.031 0.014 0.049 0.042 0.010 0.017
Year of university enrollment Year of high school enrollment Phone number QQ User id Date of birth
0.011 0.013 0.009 0.009 0.009 0.014
4.4. Attribute prediction using local and global consistency method Considering the data properties of a social network, we choose the graph-based semi-supervised learning (SSL) method to predict the hidden attributes. The reasons for using this method are as follows:
Table 5 The gain ratio of user attributes for inferring “major”. Attribute
Gain ratio
Attribute
Gain ratio
Gender Age University Hometown Constellation High school
0.043 0.014 0.051 0.045 0.012 0.015
Year of university enrollment Year of high school enrollment Phone number QQ User id Date of birth
0.014 0.013 0.009 0.009 0.009 0.012
The definitions for information entropy, information gain, split information and gain ratio are shown in formulas (5)–(8) and [21].
Entropy (S) =
c
−pi log2 pi
(5)
i=1
Gain (S, a) = Entropy (S) −
|Sv | Entropy (Sv ) |S| v∈Values(a)
(6)
c |Si | |Si | log |S| 2 |S|
(7)
Gain (S, a) Spilit Information (S, a)
(8)
Spilit Information (S, a) = −
i=1
Gain Ratio (S, a) =
(1) From Fig. 1, we can see that most users do not provide their attribute values and this results in considerably fewer labeled data (training data) than unlabeled data. The semi-supervised learning method can learn from both the labeled and unlabeled data [22]. (2) The graph-based semi-supervised learning (SSL) method is proposed for learning data that have a graph structure. The data of an online social network can be described by a graph; therefore, the graph-based SSL method is suitable for solving this problem [23, 24]. (3) In the graph-based SSL method, information can be transferred from the labeled data to the unlabeled data through the weighted edges. The graph-based SSL method assumes that if two data are very close, then the two data have the same label [25,26]. We call this assumption a smoothness assumption. In the dataset we collected, most attributes have high affinity values, and this shows that the data obey the assumption. Namely, if two users are friends (very close), then it is very possible that they have the same attributes. We choose the local and global consistency algorithm to infer user attributes. The algorithm is a graph-based semi-supervised algorithm, which considers both the local consistency and global consistency in a graph [27–29]. The algorithm is shown in Algorithm 1 [27,28].
In formula (5), c represents the number of classes, and pi is the proportion of S belonging to class i. In formula (6), a represents the attribute to be chosen. Values (a) is the set of all possible values for attribute a, and Sv is the subset of S for which attribute a has value v. In formula (7), S1 through Sc are the c subsets of examples resulting from partitioning S by the c-valued attribute a. So, according to the formulas described above, we can calculate the gain ratio of each attribute, and Tables 3–5 show the gain ratio values of attributes for inferring “hobby”, “university” and “major”, respectively. Attributes “hobby” (in the data for inferring “university”) and “web page ID” are multi-choice attributes, whose gain ratio values are hard to calculate. We do not compute their gain ratio values in the tables. In the experiments, both of them are chosen to infer attribute values. In addition, the “university”, “major”, and “hobby” attributes are required attributes when predicting the values of their own, so they are not listed in the tables. From each table, we choose attributes with higher gain ratio values to predict missing attribute values. Some attributes are only assigned special (or unique) values. For example, each user has a unique “QQ” number, user id, and
Algorithm 1. Local and Global Consistency Algorithm. { for vi , v j ∈ V , i, j = {1, 2, . . . , n} compute d(i, j); //compute the similarity matrix W as following: let wii ← 0, i = (1, 2, . . . , n); for all wi j i, j = (1, 2, . . . n) and i j { if vi , v j ∈ V , i, j = {1, 2, . . . , } 1, Li = L j wi j = { i, j = (1, 2, . . . ); 0, Li = L j Otherwise wi j = e−
d(i, j)2 2σ 2
, i, j = (1, 2, . . . n); } Compute diagonal matrix D, where Dii ← nj wi j Compute the matrix S = D−1/2W D1/2 (0) ← (L1 , L2 , . . . , L , 0, 0, . . . , 0) Initialize L choose α ∈ [0, 1) do (0) (t ) + (1 − α) L (t+1) ← α S L L Until L
(t )
converges to L
(∞)
(∞)
Label vi = max Lˆ L∈L
}
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM 6
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9 Table 6 Accuracy for inferring users’ hobbies. Percentage of labeled data (%)
LGC friend (%)
LGC group (%)
LGC All (%)
KNN (%) k = 10
SVM
Decision tree
8 10 15 20 30
57.0 67.1 67.5 71.0 72.1
46.1 63.2 73.1 72.2 71.1
58.2 67.1 72.0 74.9 75.1
43.1 44.2 57.2 61.1 63.0
41.2 43.0 54.1 62.1 63.2
42.1 43.2 54.8 61.2 63.1
(t )
The termination condition of Algorithm 1 is “Lˆ converges (∞) to Lˆ ”. The matrix W is a normalized symmetric matrix, and this guarantees the algorithm converges to a stable value. (t+1) (t ) In practice, if Lˆ − Lˆ < ε (ε is a small number), we stop the algorithm. We set ε as 0.0001 in our experiments. In Algorithm 1, d(i, j)representsthe similarity between two user nodes, which is usually defined as the Euclidean distance [27–29]. Compared with the original semi-supervised learning algorithm [27–29], we use different measures to evaluate d(i, j). In a social network, we usually judge whether two users are similar by determining if they have similar attributes, if they are friends, and if they belong to the same groups. Therefore, we define the corresponding similarity as: attribute similarity, group similarity, and friendship similarity. The attribute similarity between user i and user j is defined as the cosine similarity between the attributes of the two users, as shown in formula (9):
di,a j = 1 −
ATi A j ||Ai ||||A j ||
(9)
where ATi is the transpose of vector Ai . The group similarity between user i and user j is defined as the Jaccard similarity of the groups of the two users, as shown in formula (10).
di,g j = 1 −
|Gi ∩ G j | |Gi ∪ G j |
(10)
where Gi and G j represent the group of user i and user j joined in, respectively. The friendship similarity between user i and user j are defined as the number of hops on the shortest path between users. In our experiments, we try different ways to calculated(i, j), f g f g which are d(i, j) = di, j , d(i, j) = di, j , and d(i, j) = (di, j + di, j + di,a j ). The parameter α is used to balance the weight between the initial (0) (t ) (t+1) label vector Lˆ and the derived label vector Lˆ for predicting Lˆ . α > 0.5 means the label of the attribute of a user mainly depends on its neighbor nodes. Otherwise, it mainly depends on its own initial label. Reference [20] also defined similar measures to calculate the similarity of two users. Different from reference [20], we take advantage of the labels of user data to improve the accuracy of the similarity calculation. When calculating wi j , user i and user j may have different data types, for example, their data may be labeled or unlabeled. Furthermore, if their data are all labeled, their data may have similar or different labels. This information is useful for improving the accuracy of the similarity calculation. The similarity calculation in Algorithm 1 is interpreted as follows. If both the data of user i and user j are labeled data and they have the same labels, we assign the similarity weight wi j a large value; otherwise, we assign wi j a small value. In Algorithm 1, wi j = 1 when user i and user j have the same labels, and wi j = 0 when user i and user j have the different labels. If either of the users’ data sets are not labeled, the radial basis function [28] is used to
− d(i, j)
2
compute the similarity wi j = e 2σ 2 . The radial basis function limits wi j between zero and one, and this guarantees that even though the value of d(i, j) changes in a large scale, the value of wi j is relatively stable. The parameter σ is used to adjust the change rate of wi j with d(i, j). In our work, we set σ as the average value of 1
all d(i, j). If d(i, j) equals σ , wi j equals e− 2 , which is the average similarity. This definition of σ can easily embody the difference between the similarity of two users and the average similarity of all users. 5. Experiments 5.1. Data description In our experiments, we predict the universities, majors and hobbies of social network users. The datasets used in the experiments are described in Section 4.3. In the experiments, the labeled data are chosen randomly from each class, and we select the same number of labeled data for each class. The remaining data are viewed as unlabeled data. For example, the first number in the first column of Table 6 is 8%, which means we need to randomly select 12 (5810 × 0.08/36) users from each “hobby” class as labeled data. The datasets used for inferring user attributes are extracted from the whole dataset we collected. So, we can assume that they support the principle of social influence. In fact, there is no missing attribute value in the datasets for inferring attributes. So, we can accurately calculate the affinity value of each attribute. The affinity values of “hobby”, “major”, and “university” are 7.6, 12.1, and 18.3, respectively. This shows that the dataset supports the assumption. However, in reality, many attributes are not assigned values, so it is hard for us to accurately estimate the affinity value of an attribute. In the experiments, we compare the performance of the prediction models using different similarity measures. Furthermore, to evaluate the performance of the semi-supervised learning method, we compare it with supervised models. The supervised models we selected are support vector machine (SVM) [30], K-nearest neighbor algorithm (KNN) [31], and decision tree [32]. For KNN, we also use attribute similarity, group similarity, and friendship similarity to calculate the distance between two data samples. For SVM and decision tree, all 15 attributes are used for classification. In addition, we compare the semi-supervised method with the attribute predicting method proposed by Gong et al. [14]. The authors [14] used the Social-Attribute Network to infer missing attributes. They extracted nine topological features for each attribute link and used a binary SVM as the classifier. When training an SVM, the labeled data from one class is viewed as positive data, and the labeled data from the other classes are viewed as negative data. The prediction accuracy is the average of the classification accuracy for each class. In our experiments, the supervised models come from WEKA, which is an open source data mining software. To avoid the overfitting problem, typically validation set is required for the supervised algorithm. However, the semi-supervised algorithm does not need a validation set. To fairly compare the two types of algorithms, each algorithm is executed several times, and the
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9
7
Table 7 Accuracy for inferring users’ majors. Percentage of labeled data (%)
LGC friend (%)
LGC group (%)
LGC All (%)
KNN (%) k = 10
SVM
Decision tree
8 10 15 20 30
62.1 63.7 67.9 72.6 77.6
51.0 51.2 54.7 59.4 62.5
64.2 65.1 69.2 73.5 78.8
47.1 49.1 59.3 63.2 65.6
46.4 49.5 58.7 63.5 64.9
46.0 48.3 57.1 62.7 64.5
Table 8 Accuracy for inferring users’ universities. Percentage of labeled data (%)
LGC friend (%)
LGC group (%)
LGC All (%)
LGC [20] friend (%)
KNN (%) k = 10
SVM
Decision tree
8 10 15 20 30
77.7 79.2 83.0 83.2 84.3
41.6 43.7 44.1 44.1 45.2
74.0 75.1 77.7 79.1 79.5
71.9 74.8 75.7 76.3 76.8
55.4 57.6 64.1 65.0 66.7
53.8 56.2 61.1 64.7 67.1
53.2 56.1 61.8 64.0 67.1
average accuracy is calculated (see reference [33]). We run each algorithm ten times, and each time, we randomly select the labeled data. Before running the semi-supervised learning algorithm, we need to decide the value of α . Avrachenkov et al. [33,34] analyzed the behavior of the semi-supervised learning methods when assigning α different values. They proved that if the number of labeled points in each class is same, then there is no dominating class that attracts all instances as α approaches one. Therefore, in our experiments, the performance of the learning algorithm is not sensitive to α . We set α as 0.3. In fact, we also tried different values of α and obtained nearly the same results. 5.2. Inferring users’ hobbies Table 6 gives the performances of the prediction algorithms with different parameter settings. In Table 6, LGC represents the local and global consistency algorithm. “Friend”, “group”, and “all” represent f the similarity measures used in LGC. “Friend” means d(i, j) = di, j , “group” means d(i, j) = di, j , and “all” means d(i, j) = di, j + di, j + di,a j . k is the number of nearest neighbors in the KNN algorithm. From Table 6, we can see with the increase of labeled samples, the accuracies of the semi-supervised models and supervised models are improved. When using the same labeled data to train the learning models, the semi-supervised models outperform the supervised models. The prediction accuracy of the semi-supervised model is approximately 10% higher than that of the supervised models on average. This shows that the semi-supervised models can improve the prediction accuracy by learning from unlabeled data. The LGC algorithm using the “all” similarity measure has the best performance. We found that dataset users with the same hobby often join the same group or have similar attributes, so the attribute similarity between users and group similarity between users are helpful for improving the prediction accuracy. g
f
g
5.3. Inferring user’s majors Table 7 gives the performances of the prediction algorithms for inferring “major” attributes. From Table 7, we can see that the LGC algorithm using the “all” similarity measure achieves the best performance; however, the performance of the LGC algorithm using the “all” similarity measure and the LGC algorithm using the “friend” similarity measure are nearly the same. This shows that compared with the group relation and attribute relation, the friendship relation is more important for inferring “major” attributes. In fact, the affinity
value of “major” is 12.1, which is a relatively high value. This means that people who have the same major become friends easily. Except for the LGC model using the “group” similarity measure, the semisupervised models outperform the supervised models. The prediction accuracy of the LGC model using the “all” similarity measure is approximately 8% higher than that of the corresponding supervised models. 5.4. Inferring users’ universities Table 8 shows the performances of the learning algorithms on predicting the “university” attribute. In the same way, we compare the LGC algorithm with supervised learning models and the LGC algorithm proposed in [20], where the Euclidean distance is used. From Table 8, the LGC algorithm using the group distance measure has a poor performance. We found that members in a group often come from different universities; therefore, it is very hard to use group relations to infer users’ universities. The LGC algorithm using the friend distance measure achieves the best performance. In Table 4, the affinity value of “university” is 18.3, which is a relatively high value. This means that users coming from the same university become friends more easily; therefore, friend similarity is the best similarity measure for inferring “university”. In Table 8, we can see the accuracy of the proposed LGC algorithm using the “friend” similarity measure is approximately 5% higher than that of the LGC algorithm in [20]. The reason is that Algorithm 1 can adjust the similarity weights according to the label types of user data, so it can use more information to infer hidden attributes. This proves that the strategies for computing the similarity weight are effective. From Tables 6–8, we can see the average accuracy for predicting “university” is higher than that for predicting “hobby” and “major”. Compared with “hobby” and “major”, “university” has a higher affinity value; therefore, it is easier to infer “university” using the friendship relations. On the other hand, compared with university information, some hobbies have not standard descriptions. Two users may have the same hobby; however, they use different words to describe their hobby, which makes the prediction task more difficult. 5.5. Assigning different weights to similarity measures In the “LGC all” experiments, each similarity measure is assigned the same weight. In this section, we discuss how the prediction accuracy is affected by the weight of similarity measures. In our experiments we found the contribution of each similarity measure is dependent on the attribute to be predicted. In addition, when adjusting
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
ARTICLE IN PRESS
JID: COMCOM 8
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9
85 Accuracy(%) 80
75
70
Major Uni. Hobby
65
Weight 60 0
1
2
3
4
Fig. 4. Adjusting the weight of friend similarity for attribute prediction. Table 9 Accuracy of the SAN-based method [14]. Percentage of labeled data (%)
Inferring hobby (%)
Inferring major (%)
Inferring university (%)
8 10 15 20 30
56.1 65.2 70.2 72.1 73.3
64.5 64.8 70.5 74.4 77.6
78.8 80.2 85.6 85.8 87.1
the weight, the similarities between samples in the same class or different classes are changed in the same trend, and the weight analysis is very difficult. So it is very difficult to decide a strategy to adjust the weight of each similarity measure. In our experiments, we found if the affinity value of the predicted attribute is big (usually greater than ten), we can improve the predicf tion accuracy by gradually increasing the weight of di, j . If the affinity value of an attribute is bigger than ten, users having the same value on this attribute become friends more easily. The friend similarity becomes the best similarity measure for inferring this attribute. So, increasing the weight of friend similarity can improve the prediction accuracy. Fig. 4 shows the experimental results using the combination of similarities to predict “major”, “university” and “hobby”. In Fig. 4, the horizontal axis represents the weight of friend similarity, and the vertical axis represents the prediction accuracy. The curves “major”, “uni.” and “hobby” represent the prediction accuracies of attributes “major”, “university” and “hobby” respectively. In the experiments, we set the weight of group similarity and attribute similarity as 1. In practice, we can normalize the weight to be between 0 and 1 by dividing by the sum of weights. The affinity values of “university” and “major” all are bigger than ten. In Fig. 4 we can see with the increase of the weight of friend similarity, the prediction accuracy is improved. However, when the weight is greater than a certain value, the accuracy does not change. Compared with “major” and “university”, the affinity value of “hobby” is relatively small. We get the best performance when the weight of friend similarity is about 1.2. When we keep increasing the weight, the accuracy begins to decline. When the weight is lower than three, the prediction accuracy tends to be stable. 5.6. Comparison with social-attribute network (SAN) based method [14] Table 9 shows the prediction accuracy of the SAN-based method. In a social-attribute network, attributes are also perceived as nodes, and the relations between user nodes and attribute nodes
are established. SAN can be perceived as an extension of the traditional social network graph, so it can supply more topological information for attribute inference. Thus, the SAN-based method [14] outperforms the traditional supervised methods as shown in Tables 6–8. Even the semi-supervised method does not provide this topological information for learning, and the semi-supervised method can learn more information from the unlabeled data. Thus, the overall performance of the semi-supervised method and the SANbased method are nearly the same. The SAN-based method and semi-supervised method choose different features for classification. This results in some differences in their performance. For inferring the “hobby” attribute, the average accuracy of the semi-supervised method (using the “all” similarity measure) is approximately 1.5% higher than that of the SAN-based method. For inferring the “university” attribute, the average accuracy of the SAN-based method is approximately 2% higher than that of the semi-supervised method (using the “friend” similarity measure). In our datasets, the affinity value of “university” is significantly larger than that of “hobby”. Compared with the “hobby” attribute, users from the same university are more likely to be linked together. So, the topological features from the social network play the main role in classification. Under this situation, the SAN-based method has better performance. 6. Conclusions We use a machine learning method to predict user attributes. Considering the characteristics of a social network, we choose a local and global consistency algorithm as the classifier. In our study, we use the affinity of an attribute to evaluate which attribute is more possible to predict using social relations. The improved LGC algorithm uses the label types of user data to adjust the similarity weights and achieve a good performance. We also try different similarity measures to calculate the relations between users and compare their performances by using experiments. Our future study will consider how to use the SAN to improve the performance of the semi-supervised method. Acknowledgments This work was partially supported by Scientific Research Foundation in Shenzhen (grant no. JCYJ20140627163809422), Scientific Research Innovation Foundation in Harbin Institute of Technology (project no. HIT.NSRIF2010123), State Key Laboratory of Computer Architecture, Chinese Academy of Sciences and Key Laboratory of Network Oriented Intelligent Computation (Shenzhen). References [1] http://socialnetworking.lovetoknow.com/Why_Do_People_Use_Online_Social_ Network (accessed 13/04/2014). [2] E. Zheleva, E.L. Getoor, To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles, in: Proceedings of the 19th International Conference on World Wide Web, 2009, pp. 531–540. [3] K. Han-joon, L. Sybghucj, Building concept network-based user profile for personalized web search[C], in: The 9th IEEE/ACIS International Conference on Computer and Information Science, 2010, pp. 1847–1850. [4] M. Szummer, T. Jaakkola, Partially labeled classification with Markov random walks, Adv. Neural Inform. Process. Syst. (2001) 1–7. [5] X. Zhu, J. Lafferty, Z. Ghahramani, Semi-supervised Learning: From Gaussian Fields to Gaussian Processes, Technical Report CMU-CS-03-175, Carnegie Mellon University, 2003, pp. 5–9. [6] T.L. Fond, J. Neville, Randomization tests for distinguishing social influence and homophily effects, in: Proceedings of the World Wide Web Conference (WWW), ACM, New York, NY, 2011, pp. 601–610. [7] R. Kumar, J. Novak, P. Raghavan, A. Tomkin, Structure and evolution of blogspace, Commun. ACM 47 (12) (2004) 35–39. [8] M. Kim, J. Leskovec, Modeling social networks with node attributes using the multiplicative attribute graph model, in: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), 2011.
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007
JID: COMCOM
ARTICLE IN PRESS
[m5G;July 28, 2015;11:36]
Y. Ding et al. / Computer Communications 000 (2015) 1–9 [9] D. Rao, M. Paul, C. Fink, D. Yarowsky, T. Oates, G. Coppersmith, Hierarchical Bayesian models for latent attribute detection in social networks, in: Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 2011. [10] D. Rao, D. Yarowsky, A. Shreevats, M. Gupta, Classifying latent user attributes in twitter, in: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents (SMUC), ACM, New York, NY, 2010, pp. 37–44. [11] Z. Yin, M. Gupta, T. Weninger, J. Han, LINKREC: a unified framework for link recommendation with user attributes and graph structure, in: Proceedings of the World Wide Web Conference (WWW), 2010, pp. 1211–1212. [12] Z. Yin, M. Gupta, T. Weninger, J. Han, A unified framework for link recommendation using random walks, in: Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2010. [13] N.Z. Gong, A. Talwalkar, Jointly predicting links and inferring attributes using a social-attribute network (SAN), in: The 6th SNA-KDD Workshop Proceedings, Beijing, China, 2012, pp. 76–87. [14] N.Z. Gong, et al., Jointly predicting links and inferring attributes using a socialattribute network, ACM Trans. Intell. Syst. Technol. 5 (2) (2014) 1–20. [15] J. He, W. Chu, Z. Liu, Inferring privacy information from social networks, Lect. Notes Comput. Sci. 3975 (2006) 154–168. [16] W. Xu, X. Zhou, Inferring privacy information via social relations, in: IEEE 24th International Conference on Data Engineering Workshop, 2008, pp. 525–530. [17] J. Lindamood, R. Heatherly, M. Kantarcioglu, et al., Inferring private information using social network data, in: World Wide Web Conference (WWW), 2009, pp. 123–130. [18] A. Mislove, B. Viswanath, K.P. Gummadi, You are who you know: inferring user profiles in online social networks, in: Proceedings of the Conference on WSDM, 2010, pp. 4–6. [19] Y. Dong, Y. Yang, J. Tang, Y. Yang, Nitesh V. Chawla, Inferring user demographics and social strategies in mobile social networks, in: Proceedings of the Twentieth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 15–24. [20] M. Mo, D. Wang, Exploit of online social networks with semi-supervised learning, in: Proceedings of the IEEE World Congress on Computational Intelligence, 2010, pp. 18–23.
9
[21] Tom M. Mitchell, Machine Learning., The McGraw-Hill Companies, Inc., 1997. [22] X. Zhu, Z. Ghahramani, Towards Semi-supervised Classification with Markov Random Fields, Technical Report CMU-CALD-02-106, School of Computer Science, Carnegie Mellon University, USA, 2002, pp. 23–65. [23] D. Zhou, B. Scholkopf, Semi-supervised learning on directed graphs, in: Proceedings of the NIPS, 2005, pp. 15–56. [24] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 192–261. [25] M. Belkin, I. Matveeva, Regularization and semi-supervised learning on large graphs, in: COLT, 2004. [26] X. Zhang, W.S. Lee, Hyperparameter learning for graph based semi-supervised learning algorithms, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2006. [27] D. Zhou, B. Scholkopf, T. Hofmann, Semi-supervised learning on directed graphs, in: L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information Processing Systems, MIT Press, 2005. [28] D. Zhou, O. Bousquet, T. Lal, Learning with local and global consistency, in: Proceedings of the Advances in Neural Information Processing Systems(NIPS), 2004, pp. 1–8. [29] J. Gui, D.S. Huang, Z.H. You, An improvement on learning with local and global consistency, in: IEEE ICPR, 2008, pp. 1–4. [30] H. Chih-Wei, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. [31] D. Ada., D. Kibler., M. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 37–66. [32] M. Nunez, The use of background knowledge in decision tree induction, Mach. Learn. 6 (3) (1991) 231–250. [33] K. Avrachenkov, P. Goncalves, M. Sokol, On the choice of kernel and labelled data in semi-supervised learning methods, Lect. Notes Comput. Sci. 8305 (2013) 56– 67. [34] K. Avrachenkov, P. Goncalves, A. Mishenin, M. Sokol, Generalized optimization framework for graph-based semi-supervised learning, in: Proceedings of SIAM Conference on Data Mining (SDM 2012), 2011.
Please cite this article as: Y. Ding et al., Predicting the attributes of social network users using a graph-based machine learning method, Computer Communications (2015), http://dx.doi.org/10.1016/j.comcom.2015.07.007