Finding User Groups in Social Networks Using Ant ...

4 downloads 329 Views 502KB Size Report
colony optimization (ACO) based clustering algorithm for clustering the social network data. The proposed technique takes advantage of Ants cemetery and ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 00 (2016) 000–000 www.elsevier.com/locate/procedia

The 3rd International Symposium on Emerging Information, Communication and Networks (EICN 2016)

Finding User Groups in Social Networks Using Ant Cemetery Waseem Shahzada*, Sana Qamberb a

Department of Computer Science, National University of Computer and Emerging Sciences, A.K.Brohi Road, H-11/4, Islamabad, 44000, Pakistan b Department of Computer Science, National University of Computer and Emerging Sciences, A.K.Brohi Road, H-11/4, Islamabad, 44000, Pakistan

Abstract Clustering plays a major role in data mining. It helps in identifying patterns and distribution of data. In this paper, we propose ant colony optimization (ACO) based clustering algorithm for clustering the social network data. The proposed technique takes advantage of Ants cemetery and allows each ant to play a role. In every iteration, the ants produce a cluster for the data, and the pheromone values are updated after every iteration of all the ants. The experiment results show that proposed technique discovered the clusters which reveal the truthfulness of the network. We have also compared the proposed technique with another state of the art clustering algorithm, and experimental result demonstrated that proposed technique find out the better clusters. © 2016 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. Keywords: Ant colony optimization; ant cemetery; Social media; clustering

1. Introduction Networks are of many types; they include a social network, web network, biological networks and others1. The network is considered as a graph and finding a community in network means to find a subgraph based on connectivity. Functions are used to find dense and sparse connections2. In social networks nodes represent people, and nodes are connected by an edge showing if they are linked or not. Community detection helps in understanding the structure of the community1. It is a clustering problem, and clustering strategies are applied to divide the data

* Corresponding author. Tel.: +92-051-111-128-128; fax: +92-051-410-0619. E-mail address: [email protected] 1877-0509 © 2016 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs.

2

W. Shahzad et al./ Procedia Computer Science 00 (2015) 000–000

instances into some natural groups. In clustering process, instances are grouped in such a way that similar objects are grouped to be in one group1. Aim is to have clusters that are sparsely connected with other clusters but are densely connected within a cluster. Ant Colony Algorithm (ACO) is inspired by the movement of ants3. Marco Dorigo first proposed it in 19924. It is an algorithm used for finding the optimal solution. ACO has been used for social network clustering and has shown good results7. There are many other clustering algorithms, but experiments with real and artificial datasets have proved that ACO is a better clustering algorithm as it has provided better results5. Its chance of getting stuck at local optima is less6. Previously algorithms have been developed but they do not consider multiple attributes. For this purpose we have proposed our technique for the unsupervised dataset and it is based on ant colony optimization technique. The technique is novel and can handle multiple attribute based dataset. It takes advantage of Ants cemetery and allows each ant to play a role. All ants have the opportunity to update pheromone value based on the solution that is developed. The rest of the paper is structured as follows; Section 2 contains details of related work, applications of clustering by using ant colony optimization. Section 3 discusses the details of our proposed technique. Comparison with other algorithms and results are in section 4. Conclusion and future aspect in the last section 5.

2. Related Work Data that seems meaningless and hard to understand is made meaningful and easy to comprehend by clustering. Data is closely analyzed, and patterns are determined. Clustering is very useful in almost every field8, medicine9, library, insurance, earthquake, websites and marketing. A modified technique of ant colony optimization for community detection using modularity Q as the objective function1. It is different from other techniques as is in this ants, do not search for candidate solution but in every step checks if current vertex joins the previous. Thus, the solution provided is checking locally. No pheromone is used in this technique. Weak points are it is a locally optimized solution. The network consists of nodes and edges. They do not consider attribute values. The main algorithm is to take each community as a vertex and initially randomly distribute ants. A methodology that is based on community division2, it allows mixing the link and node attribute information. The main aim of this paper is to find companies that have done more than two projects together. The dataset consists of members of all the projects and information of each company. Strong points are system has been practically implemented and tested on FP6 database. It caters multiple attributes but does not use ACO and is not implemented on a social network. Clustering is performed by similarity3. The objective of this paper is to find maximum subgroups from the structure. The major distinguishing factor from the normal ant colony algorithm is that the path on which the ant moves in the proposed algorithm is a vertex, and the pheromone is deposited on vertex rather than moving by an edge sequence. It is tested on Orkut data set; the member is represented by nodes and if any communication by the message is exchanged it is represented by a vertex. Previously instance based search was used in which a biased search may lead to a suboptimal wrong solution. It does not use multiple attributes. Ant colony optimization, model-based search has been used to find clusters in social network10. Solution is made by using probabilistic models. The methodology includes dividing the graph into two sub graphs. This division process is repeated until no further increase in modularity is achieved. Graph only shows the connectivity, no importance is given to the individual attribute. A new algorithm based on ant colony optimization to detect community in networks11. Artificial ants are used in this algorithm to construct a solution for community detection. College Football graph, vertices show different teams and presence of edge tells that match has been played between the teams. Larger the pheromone on the path, greater is the probability of its selection. Weak points are that it doesn’t consider multiple attributes of the networks. Multiple relations that exist between users are used for clustering of social networks12. A random walk is performed on each user. The drawback is if a graph is disconnected the walk will just rotate around the starting point and will not converge. Thus, the solution becomes biased. Mark13 provides a fast algorithm for detecting community in a network. The main idea is modularity, to check the quality of a community. The drawback is that this algorithm is very costly and is good for only small networks. The community is detected based on multiple attributes14. Similarity-based algorithm is proposed by the authors. The system has been tested on a Chinese social network known as Renren. But the drawback is that it uses a small number of users and doesn’t perform on large datsets. The algorithm does not use ant colony algorithm for finding community in a social network. Algorithm15 is based on fitness perception and pheromone diffusion. The technique uses ant colony algorithm for clustering. In this algorithm, each node is considered as a community. The drawback is the use of random selection as it may lead to wrong results. Random selection is known to be very costly. For its proper

W. Shahzad et al./ Procedia Computer Science 00 (2015) 000–000

3

working, all possibilities and lists must be present and calculated beforehand. So that values could be randomly selected from it. A well-known technique for clustering is K-Mean clustering16. The number of clusters to be created in the algorithm is specified in the initialization step. The k mean clustering method is simple and effective in producing good results. It is very obvious that choosing the cluster center to be the centroid minimizes the total squared distance from each of the clusters points to its center. It is not a suitable for social network clustering as it does not guarantee optimal solution. The related work shows that in the past algorithm using Ant colony algorithm that caters multiple attributes on a social network has not been developed thus our focus is to develop an algorithm considering these aspects. 3. Proposed Technique 3.1. Data Collection and Representation Databases are available online for social network clustering, but they are mostly single attribute based. The online databases for social networking are a few. We have collected our own database from Facebook using scraping. It includes the data of more than 3000 users, their likes and a wide number of categories are extracted. The input dataset is represented in the matrix form with a size equal to N x M where N denotes the number of Users in the data set, and M represents the features/attributes or dimensions of each data item. The dataset is in the form of numerals. 3.2. Reducing Complex network Ants, unlike humans, do not have preplanned routines. Incase ants die they do not have a specific location for the dead ants. They clean their nest and during cleaning they pick the dead ants known as corpses and put them aside. The dead is picked up or not by using picking probability equation 1. A value close to one means that the current surrounding has few corpses thus it needs to be picked and placed in another location where there are higher numbers of corpses. Once its picked now where to drop the corpse is based on dropping probability equation 2. A value close to one means high probability to be dropped in the current location. This means that the current location has high number corpses in the surrounding. In our algorithm, the dead ants that need to be scattered in the subspace are the data set points which are scattered in the subspace. Ant Cemetery process is performed. Picking probability 2 Close to 1 pick  K1   

(1)

Dropping probability 2 Close to 1 drop  f 

(2)

Pp     K1  f 

 Pd    K2  f 

Algorithm Proposed Algorithm 1. 2. 3. 4.

5.

Input Dataset for the network Transformation Adjacency Matrix Initialization Number of ants, max number of iterations T, Ant Cemetery Nodes randomly distributed, Ant cemetery algorithm initialized, Groups formed and result forwarded. Pheromone Calculation

4

W. Shahzad et al./ Procedia Computer Science 00 (2015) 000–000

Pheromone value is calculated and updated. Following formula has been used for calculation.

  .     .  

Pij 



ij

(3)

ij



vk



ij

ij

In equation 3, Pij is the probability of the i th data that belongs to j th cluster.  ij is the amount of pheromone on edge between ij ,  ij is the heuristic value calculated by   1 ij

(4)

X ij

6.

Cluster creation Euclidean distance is calculated and clusters are formed. Equation 5 has been used for distance calculation.

d  p, q  

n

 q i 1

7.

 pi 

2

i

(5)

Objective Function Calculation Intra cluster distance calculated using equation 6. n

d c, k    xi  c p (i )

2

(6)

i 1

8.

9.

Best Solution In every iteration clusters formed and compared with previous solution. Best solution saved based on objective function Iterations run for T number T is the maximum number of iterations taken as 1000.

4. Experimentation Experiments are performed on our datasets, and we have evaluated the results according to the cluster validation process. The dataset used for experimentation is extracted from Facebook. The users are not related; the experiment does not become biased. Clustering is purely done by the similarity of the attributes of the users. The experimental results validate the performance of the proposed approach and show that our approach is more accurate in providing clustering of social website data. The program for proposed algorithm is written in MATLAB (R2011a) and run on an Intel Pentium 2370M with 2.00GB memory. We have compared the performance of our algorithm with some other clustering algorithm ACCFP [15] and KMEAN [16]. Our algorithm has provided better results as compared to the mentioned clustering techniques. 4.1. Discussion Algorithm ACCFP as described in the related work section states to produce good results, but its drawback is the use of random selection as it may lead to wrong results. Random selection is known to be very costly. For its proper working, all possibilities and its lists must be present and calculated beforehand. So that values could be randomly selected from it. The K-Mean algorithm has no guarantee that it will converge in the global minimum; it can also converge to the local minimum depending upon its initial cluster centers. Small changes in the initial choice of cluster centers may result into entirely different solution. Thus, it is not a suitable way for clustering. The reasons ACCFP algorithm and K-Mean algorithm are used for comparison are that ACCFP is one of latest papers, and it states to have a good result. K-Mean is the widely known and used an algorithm for clustering purpose. To find the best K value that is the number of clusters that should be created in K-Means algorithm, the Elbow method is used. Elbow method uses the sum of square. The main concept is to select such value of k that if further K value is increased it doesn’t further give good modeling for the dataset. Fig. 1 shows the value of the sum of the square of

W. Shahzad et al./ Procedia Computer Science 00 (2015) 000–000

5

the y-axis and number of clusters that is the value of K on the x-axis. The figure shows that the sum of square decreases abruptly as the value of K increases. But after the value of K 10 sum of the square doesn’t decrease abruptly. 1.8

Sum of Squares

80

Sum of Square

60 40 Sum of Square

20

Intra cluster distance

1.6

Proposed Algo ACCFP

1.4 1.2 1 0.8 0.6 0.4 0.2

0

0 1 3 5 7 9 11 13 15 17 19 K

Fig. 1 Sum of Square

1 2 3

4 5 6 7

8 9 10

Cluster No.

Fig. 2 Intra Cluster distance

Fig. 2 shows results, ACCFP algorithm, and our proposed algorithm is compared. The results show that the average intra-cluster distance is high when comparing the ACCFP algorithm and our proposed algorithm. As shown in the figure the blue and red line are close but mostly the red line is above the blue line showing that the ACCFP algorithm has higher intra-cluster distance value as compared to proposed algorithm This shows that our proposed algorithm has better results compared to ACCFP. Though our algorithm and ACCFP are quite close, our algorithm has performed better. The reason that our algorithm has performed better that the existing algorithm is that we have incorporated the use of ant cemetery along with Ant colony algorithm. Suitable values of ants and other parameters have been used that makes the proposed algorithm perform better than the existing algorithms. Fig 3 shows the interpretation of the results obtained from the proposed algorithm. The figure illustrates the clusters and the attributes. The X axis shows the attribute and y-axis show the value. Colored bars show the presences of the attribute in the cluster. The first cluster 1, in this cluster it shows that the majority attributes are 4, 5, 6 and 8 as they are the colored bars. A display of total 12 clusters is shown, showing the good performance of the proposed algorithm.

Fig. 3Interpretation of Clusters

6

W. Shahzad et al./ Procedia Computer Science 00 (2015) 000–000

5. Conclusion Discovering hidden patterns from a dataset is part of computational intelligence. The unsupervised dataset is classified in clustering. Clustering plays a major role in data mining. It helps in identifying patterns and distribution of data. Clustering is performed on the unsupervised dataset. It partitions the data by following the trends in the dataset. Recently ant colony algorithm has been providing good results with clustering problems. Clustering using Ant cemetery and Ant Colony Optimization has the advantage over other methods that it has the capability to explore the dataset. It tries to find the global optima and protects itself from converging into local optima. The main aim of this paper is to develop an efficient ACO based clustering algorithm which can cluster the social network dataset by handling multiple attributes. The dataset used is unsupervised. Pheromone update is allowed by all ants. A combination of ant cemetery and ant colony algorithm is used. In the clustering of the social network, the clusters that are formed should be closely related and have maximum similarities in them thus the objective function of intra distance is used to know the amount compactness in the clusters formed. Distance values show that the proposed algorithm has less intra-cluster distance as compared to the state of the art clustering algorithms. The lesser the intracluster distance, the better it is for the algorithm. Thus, the proposed technique finds better clusters and can produce optimal clusters which reveal the truthfulness of the network. As per future work, more efficient heuristic measure can be determined to improve the performance of proposed algorithm further. Some alternative mechanism can be searched for updating pheromone values further to improve the convergence mechanism and quality of the solution. References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11. 12. 13. 14. 15. 16.

D. He, J Liu, D. Liu, D. Jin, and Z. Jia. "Ant colony optimization for community detection in large-scale complex networks."In Natural Computation ICNC, Seventh International Conference IEEE, vol 2, 2011 p. 1151-1155. S. Lozano, J. Duch, and A. Arenas. "Community detection in a large social dataset of european projects." Counterterrorism and Security (SIAM) on Data mining, 2006. M. Al-Fayoumi, S. Banerjee, and P. K. Mahanti. "Analysis of social network using clever ant colony metaphor." World Academy of Science, Engineering and Technology vol 5, 2009, p. 970-974. V. Ganapathy, T. T. Jia Jie, and S. Parasuraman. "Improved Ant Colony Optimization for robot navigation." Mechatronics and its Applications (ISMA), 7th International Symposium on IEEE, 2010, p.1-6. A. P. Engelbrecht, Computational intelligence: an introduction, John Wiley & Sons, 2nd Edition, 2007. T. A. Runkler. "Ant colony optimization of clustering models."International Journal of Intelligent Systems vol 20.12, 2005, p.12331251. M. Halkidi, Y. Batistakis, and M.V, "On Clustering Validation Techniques," Journal of Intelligent Information Systems, vol. 17, 2001, p. 107–145. P. Berkhin, "A survey of clustering data mining techniques." Grouping multidimensional data. Springer Berlin Heidelberg, 2006, p. 25-71. U. Alon,N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, & A. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, vol 96, no.12, 1999, p. 6745-6750. S. R. Mandala, S. R.T. Kumara, C. R. Rao, and R. Albert. "Clustering social networks using ant colony optimization.", vol 13, 2013, p. 47-65. B. Chen, L. Chen, and Y. Chen. "Detecting community structure in networks based on ant colony optimization." In International Conference on Information & Knowledge Engineering, 2012, p. 247-253. M. Gjoka, C. T. Butts, M. Kurant, and A. Markopoulou. "Multigraph sampling of online social networks." Selected Areas in Communications, IEEE Journal vol 9, 2011, p. 1893-1905. M. E. Newman, "Fast algorithm for detecting community structure in networks." Physical review vol 69.6, 2004. D. Li, L. Ma, Q. Yu, Y. Zhao, and Y. Mao. "A community detection algorithm based on the multi-attribute similarity." In Communications, Circuits and Systems IEEE (ICCCAS), vol. 1, 2013, p. 118-121. X. Song, J.Junzhong, C.Yang, & X. Zhang, “Ant colony clustering based on sampling for community detection”, In Evolutionary Computation (CEC), IEEE, 2014, p. 687-692. M. E. Celebi, H. A. Kingravi, & P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm.” Journal of Expert Systems with Applications, Elsevier, vol 40.1, 2013, p. 200-210.

Suggest Documents