International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------Improving Link Prediction in Social Network with PopulationBased Metaheuristics Algorithm Tarnaz chamani1* , Alireza pourebrahimi2 and Babak shirazi3 1
Mazandaran university of science and technology, Babol, Iran 2
3
Islamic azad university, Tehran, Iran
Mazandaran university of science and technology, Babol, Iran *Corresponding Author's E-mail:
[email protected]
Abstract Link prediction is a new interdisciplinary research direction in social network analysis (SNA) which, existing links are analyzed and future links are predicted among millions of users of social network. There are various prediction models including k-nearest neighbor (kNN), fuzzy inference, SVMs, Bayesian model, Markov model and others. In this paper we use Bayesian model to predict future links in flickr social network dataset, it was includes more than 35,000 users. then we use population-based metaheuristics algorithms to enhance accuracy of Bayesian Network Classifiers in feature Selection. We use two standard
metric such as
AUC and MAP measures for quantifying the accuracy of
prediction algorithms. Keywords: Link prediction, Bayesian Network, Feature Selection, Social network
1. Introduction Social networks are a popular way to model the interactions among the people in a group or community. They can be visualized as graphs, where a vertex corresponds to a person in some group and an edge represents some form of association between the corresponding persons.[1] Addition of new edges shows new interactions in the underlying social structure. This dynamic property of social network makes the study
1202
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------of these graphs and predicting link, a challenging task. Link prediction is a sub-field of
social network
analysis,
which
has
applications
in various
domains
like
communication surveillance, information integration, and recommender system [2]. We consider three types of models for link prediction: first, the traditional (nonBayesian) models which extract a set of features to train a binary classification model. Second, the probabilistic approaches which model the joint-probability among the entities in a network by Bayesian graphical models. And, finally the linear algebraic approach which computes the similarity between the nodes in a network by rank-reduced similarity matrices .In this paper we use second method in link prediction with simple bayasian network [2]. Bayesian network classifiers are classification algorithms that assume probability distributions encoded as Bayesian networks as the probabilistic model for the dataset that is the learning objective.[3] Bayesian network is supervised learning algorithm, The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown [4]. In this paper we use Bayesian network algorithm to link prediction, then we use Particle Swarm Optimization algorithm (PSO), Genetic Algorithm (GA), Artificial Neural Network (ANN) and Imperialist Competitive Algorithm (ICA) in searching the feature subset has shown high effectiveness as well as efficiency to solve complex and large problems in feature selection.
2. Related works Liben-Nowell and Kleinberg [1] proposed one of the earliest link prediction models that works explicitly on a social network. The learning paradigm in this setup
1203
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------typically extracts the similarity between a pair of vertices by various graph-based similarity metrics and uses the ranking on the similarity scores to predict the link between two vertices. They examine various topological features, including graph shortest distance, common neighbors, preferential attachment, Adamic-Adar, Jaccard, SimRank, hitting time, rooted PageRank, and Katz.They concentrated mostly on the performance of various graph-based similarity metrics for the link prediction task. Later, Hasan et. al. [6] extended this work in two ways. First, they showed that using external data outside the scope of graph topology can significantly improve the prediction result. Second, they used various similarities metric as features in a supervised learning setup where the link prediction problem is posed as a binary classification task. Since then, the supervised classification approach has been popular in various other works in link prediction [7, 8, and 9]. The available methods such as decision tree induction, naive Bayes [10], support vector
machine, logistic
regression, etc.
are
examples of supervised learning
techniques. Doppa et. al. proposed a learning algorithm for link prediction based on chance constraints [11]. The prominent feature of supervised learning is feature construction and collective classification using a learned model. Once the features are computed for a particular node pair, we obtain a vector of values referred to as a feature vector, which may be correlated with the future possible link between that node pair. We train the learning system with the set of feature vectors computed for training data. Then the model is used to predict the future links [12]. In other hand, The performance of the simple Bayesian Classifier with Feature Selection is ipproved. Recently, an iterative approach of simple Bayes is presented in [13]. In this paper, We use population-based metaheuristics to improve accuracy of classification to improve link prediction accuracy .
1204
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------3. Link prediction Link prediction is a graph mining task that aims to predict the occurence of new edges in a graph after a certain interval of time[1] .This task can be formulated as followed ( [2]): Given an undirected social network G(V,E) which V is the set of nodes and E is the set of links. Where |V| denotes the number of elements in set V. Then, the set of nonexistent links is U − E. We assume that there are some missing links (or the links that will appear in the future) in the set U − E, and the task of link prediction is to find out these links. We can record multiple interactions by parallel edges or by using a complex timestamp for an edge [2]. We tested link prediction performance of novel features in conjunction with supervised machine learning method by treating it as an instance of binary classification [6] which is described in the following sections.
3.1. Validation measures for link prediction In this paper, The confusion matrix identifies true positives (“hits”), false positives (“false alarms”), false negatives (“misses”), and true negatives (“correct rejections”). These will be designated as TP, FP, FN, TN, respectively (Fig. 1). From the confusion matrix, several measures for assessing a classifier can be computed (Table 1).
Figure 1: confusion matrix.
1205
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------figure1: Confusion matrix showing the true positives (TP), false positives (FP), false negatives (FN), true negatives (TN), predicted positives (P'), predicted negatives (N'), true state positives (P) and true state negatives (N). As shown in the diagram, rows and columns sum as follows: P' = TP + FP, N' = FN + TN, P = TP + FN And N = FP + TN.
The Receiver Operating Characteristic (ROC) curve depicts the true positive rate (TPR) as a function of the false positive rate (FPR). A classification method which randomly assigns true or false to the presence of future links would, on average, have TPR equal to FPR. Successful classifiers have TPR > FPR and this is often quantified by estimating the area under the curve (AUC) for the ROC. The AUC approximates the probability that a link predictor will assign a higher score to user-user pairs who exhibit a link in the next time step than to user-user pairs who do not exhibit a link in the next time step. This can be computed using a trapezoidal approximation. In practice, many researchers approximate AUC= where n represents the number of comparisons, n' represents the number of times that useruser pairs which receive a new link in the next time step receive a higher score than randomly selected user-user pairs which do not have a link in the next time step and n'' represents the number of times that they have equal scores. Thus, the degree to which the AUC exceeds 0.5 indicates how much better our predictions. [5, 14]
1206
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------Table 1: metrics for link prediction. T PR =
S ensitivity (true positive rate, recall) Accuracy
ACC =
S pecificity (true negative rate)
T NR =
Positive predictive value (precision)
PPV = | |
Mean Average Precision(Area under the precision-recall
| |
curve)
∑
∑
4. Bayesian networks for classification Supervised classification is one of the tasks most frequently carried out by so called Intelligent Systems .A Bayesian network consists of a structural model and a set of conditional probabilities. The structural model is a directed graph in which nodes represent attributes and arcs represent attribute dependencies. Attribute dependencies are quantified by conditional probabilities for each node given its parents. Bayesian networks are often used for classification problems, in which a learner attempts to construct a classifier from a given set of training examples with class labels. [15] a naive Bayesian classifier is defined in Equation1: ( )
( )∏
( | ).
(1)
It has been proved that learning an optimal Bayesian networks is NP -hard [3]. In order to avoid the intractable complexity for learning Bayesian networks, learning improved naive Bayes has attracted much attention from researchers. Related work can be broadly divided into four approaches:[16,17]
1207
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------1. Feature selection: Selecting attributes subsets in which attributes satisfy the attribute independence assumption. 2. Structure extension: Extending the structure of naive Bayes to represent the dependencies among attributes. 3. Local learning: Employing the principle of local learning to find a local training data set and use it to build a naive Bayes. 4. Data expansion: Expanding the training data and build a naive Bayes on the expanded training data.
4.1. population-based metaheuristic algorithm for feature selection The feature selection applied as a pre-processing step to machine learning. This approach improves the classification performance of naïve Bayes by removing redundant and/or irrelevant attributes from training data sets, and only selecting those that are most informative in classification task. This approach works well with the hypotheses that it can improve naive Bayes’ classification accuracy in domains that include redundant and/or irrelevant attributes without reducing naive Bayes’ classification accuracy in domains that don’t. In other words, the improved naive Bayes with feature selection classifies the test instance E using Equation 2 to replace Equation 1. ( )
( )∏
( | ).
(2)
where ai (i = 1, 2, . . ., k) respectively is the value of the selective attribute Ai ( i = 1, 2, . . ., k), k is the number of selective attributes. The basic idea of these improved algorithms is how to efficiently select relevant attributes subsets from training data sets.[17] In general, there are two categories of metaheuristic search algorithms: Single solutionbased metaheuristics (SBM) that manipulate and transform a single solution during the search and population-based metaheuristics (PBM) where a whole population of solutions
1208
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------is evolved.[18] Different from SBM, PBM represents an iterative improvement in a population of solutions that works as follows. Firstly, the population is initialized. Then, a new population of solutions is generated. Next, the new population is integrated into the existing one by using some selection procedures. The search process is terminated when a certain criterion is satisfied. Genetic algorithm [19], particle swarm optimization [20], Imperialist Competitive Algorithm [21] and neural networks [22] are PBM algorithms was use in this paper for feature selection.
5. Experimental result We evaluated our approach using social network dataset obtained from flickrsite was obtain 3.jully 2010. It includes more than 35,000 users, with their joined groups, tags. It also includes the friendship and the commentship (i.e., who comments on whose photos) among the set of users, this topological properties was shown in (table 2). The joined groups can be treated as class labels in classification tasks, or ground truth for community detection tasks [23]. Table 2: Topological properties of the flickr data sets.
Nodes
Links
Network
Maximum
Average
Clustering
density
degree
degree
coefficient
Categories (Number of groups)
80،513
5،899،882
1.8 ×10 −3
5،706
146
0.61
195
The first step in this report is model used to predict the link of a simple Bayesian classification. We used of validation method for Breakdown training and testing data .in this test 10% of the records randomly extracted from our experimental data. This way is one of the pre-processing test case of real data for selection and training process. At each
1209
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------stage of data on the network, we randomly selected 2,000 nodes and the highest grade of selection of vertices is 50. The result of Confusion matrix Showed in table 3 and result of predict showed in table 4. Table 3: Confusion matrix. TP=7816
FP=174
FN=63
TN=438
Table4: Link prediction with nave bayes Classifier. metrics classifier
S pecificity
Accuracy
& precision
S ensitivity
MAP
& recall
(0-1)
0.2000
0.202220
Naive Bayes Classifier
NaN
0.2000
AUC Class1(0-1) 2.20.0
AUC Class2(0-1) 2.20.0
Also, used of Population-based meta-heuristic algorithms for characteristics extraction stage to improve the prediction of the Naive Bayes Classifier. In this research for improve predication used of Colonial competitive algorithm, Particle Swarm Optimization algorithm, neural networks and genetic algorithms, which are part of the PBM. Result showed in table 5. Table5: Comparison of evolutionary optimization algorithms to improve the link prediction. Metrics Algoritms
S pecificity & precision
Accuracy
S ensitivity
MAP
AUC
& recall
(0-1)
Class1(0-1)
AUC Class2(0-1)
GA
0.22220
0.222.
0.280.02
0.200022
2.20.2
2.20.2
ICA
0.220028
0.22..
0.220028
0.22.
2.8222
2.8222
ANN
NaN
0.222
0.222
0.200228
0.00
1.00
PS O
NaN
0.222022
0.222022
0.202222
2.2000
2.280.
1210
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------Table 5 compares the predicted values for the AUC and MAP algorithms; closet of two metric value to 1 indicates the results are more favorable. In this paper due to the dataset intended presents the Colonial competitive algorithm is better. Furthermore, the MAP Table
5 according to the values in Table 4 indicate that the use of population-based algorithms have been used in this report suggests that the increase is predicted links. Although the change in data Train / Test can change the results, but the characteristics extraction stage used in this study is not relevant.
Conclusion In this paper, we propose a novel approach for improving naïve bays classification with PBM for link prediction in social network. Our method represents the static network in a time. Since, social networks are growing steadily in their membership and activities. Therefore taking so long patterns in a set period of time is necessary for the results of further studies. In order to avoid the intractable complexity for learning Bayesian networks, learning improved naive Bayes has attracted much attention from researchers. In this paper we use feature selection to improve the predict result az shown in section 5 .Related work for future can be broadly divided into three other approaches: Structure extension, Local learning and Data expansion.
References [1]. Liben-Nowell, David, and Kleinberg, Jon." The Link Prediction Problem for Social Networks". Journal of the American Society for Information Science and Technology, 58(7):1019–1031, (M ay 2007). [2]. M ohammad Al Hasan, and M ohammed J. Zaki," A Survey of Link Prediction in Social Networks " , Springer Science+Business M edia, LLC, (2011). [3].Heckerman, D., Geiger, D., and D Chickering. "Learning Bayesian networks: The combination of knowledge and statistical data". M achine Learning, Springer, (1995). [4]. Sotiris B. Kotsiantis, Ioannis D. Zaharakis, and Panayiotis E. Pintelas. "M achine learning: a review of classification and combining techniques". Artificial Intelligence Review, 26(3):159–190, (2006).
1211
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------[5]. Catherine A. Bliss, M organ R. Frank, Christopher M . Danforth, Peter Sheridan Dodds. "An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks ". submitted to Journal of Computational Science, Elsevier, 5 February (2014). [6]. Hasan, M ohammad A., and Chaoji, Vineet, and Salem, Saeed and Zaki, M ohammed"Link Prediction using Supervised Learning". In Proceedings of SDM Workshop of Link Analysis, Counterterrorism and Security , (2006). [7]. Bilgic, M ustafa, and Namata, Galileo M ., and Getoor, Lise" Combining collective classification and link prediction". In Proceedings of the Workshop on M ining Graphs and Complex Structures at ICDM Conference, (2007). [8]. Wang, Chao, and Satuluri, Venu, and Parthasarathy, Srinivasan. " Local Probabilistic M odels for Link Prediction". ICDM ’07: In Proceedings of International Conference on Data M ining, (2007). [9]. Doppa, Janardhan R., and Yu, Jun, and Tadepalli, Prasad, and Getoor, Lise. " Chance-Constrained Programs for Link Prediction". In Proceedings of Workshop on Analyzing Networks and Learning with Graphs at NIPS Conference, (2009). [10]. Jiang, L., Zhang, H., Cai, Z. "Discriminatively Improving Naive Bayes by Evolutionary Feature Selection". Romanian Journal of Information Science and Technology 9(3), 163–174 (2006) [11]. Janardhan Rao Doppa, Jun Yu, Prasad Tadepalli, and Lise Getoor." Learning algorithms for link prediction based on chance constraints". In Proceedings of European Conference M achine Learning and Knowledge Discovery in Databases, pages 344–360,( 2010). [12]. M ilen Pavlov and Ryutaro Ichise. "Finding experts by link prediction in coauthorship networks". In Proceedings of 2nd International ISWC+ASWC Workshop on Finding Experts on the Web with Semantics, pages 42–55,( 2007). [13]. C. Ratanamahatana and D. Gunopulos. "Feature Selection for the Naive Bayesian Classifier using Decision Trees", Applied Artificial Intelligence, 17: 475–487, (2003). [14]. Linyuan Lü, Tao Zhou." Link prediction in complex networks: A survey ", Elsevier, 390 1150–1170page 3,(2011). [15]. Sotiris B. Kotsiantis, Ioannis D. Zaharakis, and Panayiotis E. Pintelas. "M achine learning: a review of classification and combining techniques". Artificial Intelligence Review, 26(3):159–190, (2006). [16]. Pearl, J." Probabilistic Reasoning in Intelligent Systems". M organ Kaufmann, San Francisco, CA (1988). [17]. Chickering, D.M ." Learning Bayesian networks is NP-Complete". In: Fisher, D., Lenz, H. (eds.) Learning from Data: Artificial Intelligence and Statistics . Springer, Heidelberg V, pp. 121–130 (1996).
1212
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(12), J ul, 2 0 1 4, pp. 1 2 0 2-1213, ISSN: 2 3 0 5-0543
Available online at: http://www.aeuso.org © A ustrian E-Journals of Universal Scientif ic Organization
--------------------------------------------------[18].AhmedM ajid Taha, Aida M ustapha, and Soong-Der Chen."Naive Bayes-Guided Bat Algorithm for Feature Selection" .The ScientificWorld Journal/ Hindawi Publishing Corporation( 2013). [19]. J. Yang and V. Honavar, "Feature subset selection using genetic algorithm" IEEE Intelligent Systems andTheir Applications, vol. 13, no. 2, pp. 44–48, (1998). [20]. X.Wang, J. Yang, X. Teng,W. Xia, and R. Jensen. "Feature selection based on rough sets and particle swarm optimization" .Pattern Recognition Letters, vol. 28, no. 4, pp. 459–471, (2007). [21]. by Rice Classification, S. J. M ousavi Rad, F. Akhlaghian Tab, K. M ollazade."Application of Imperialist Competitive Algorithm for Feature Selection: A Case Study on Bulk Rice Classification". International Journal of Computer Applications (0975 – 8887) /Volume 40– No.16, February (2012) [22]. R. K. Sivagaminathan and S.Ramakrishnan . "Ahybrid approach for feature subset selection using neural networks and ant colony optimization." Expert Systems with Applications, vol. 33, no.1, pp. 49–60, 2007. [23] Flickr dataset." http://socialcomputing.asu.edu/datasets/Flickr" 3.jully (2010)
Authors
Name: Tarnaz Chamani Degree: MS Student in Mazandaran University of science and technology, Babol, Iran Education: Information Technology Engineering
1213