Short Random Walks for Community Discovery in ... - Semantic Scholar

3 downloads 7826 Views 80KB Size Report
Assistant Professor in Computer Applications, ... communities from a blog dataset using the evolving field of random ... terrorist networks to computer science.
Short Random Walks for Community Discovery in Social Networks Dr. M. Mohamed Sathik,

A. Abdul Rasheed,

Reader in Computer Science, Sathakathullah Appa college, Thirunelveli, INDIA [email protected]

Assistant Professor in Computer Applications, Valliammai Engineering College, SRM Nagar, Kattankulathur, Chennai, INDIA [email protected]

Abstract — The study of networks is an active area of research due to its capability of modeling many real world complex systems. One such interesting property to investigate in any typical network is the community structure which is the division of networks into groups. The study of community structure in networks is closely related to the ideas of graph partitioning in graph theory. Finding an exact solution to a partitioning task is believed to be an NP-hard problem. Partitioning a graph into a number of sub graphs can be thought of as clustering a graph into number of sub graphs. Random walk is a new method to discover communities in a large scale, complex networks. The idea is that short random walks tend to stay in the same community. This paper makes an attempt to discover the communities from a blog dataset using the evolving field of random walks. We also produced the results obtain by using this method for the dataset for our study.

terrorist networks to computer science. Social Network Analysis (SNA) is a field of research that provides a set of tools and theoretical approaches for holistic exploration of the communication and interaction patterns of social systems. SNA also attracted the interaction among the people who shared their views each other over a common area that is the blog. Social interactions are often modeled with networks. The study of networks has become an important topic with increasing availability of data about large networks. A Social network is a set of people or organizations or other social

Keywords – Data Mining, Social Networks, Short Random Walks, Graph Partitioning, Community Discovery

entities connected by a set of social relationships, such as friendship, co – working or information exchange. Social

I.

INTRODUCTION

network analysis is the study of social networks to understand

Network analysis is an active area of research. In the recent

their structure and behavior. In recent years, online social

decade, large complex networks representing relationships

networking has become a very popular application in the age

among set of entities have been one of the focuses of interest

of Web 2.0 application, which facilitates the users to

of researchers in many fields like social networks, World

communicate, interact and share on the World Wide Web.

Wide Web and telecommunication network. Web 2.0 reduced

Onine

the geographical divide among the people drastically. Any

interfaces for on-line content sharing like photo sharing, video

person who is in one corner of the globe can contact any other

sharing and professional network. A social network N consists

person in another corner over the globe, very easily. More

of a collection of nodes (people, organizations, or groups) A,

over, it facilitates people to express their own views for any

B, C …. together with a collection of link sets L(A;B) which

comment made over the website, what is otherwise called a

generalize the idea of a link from A to B. SNA is a

blog. Social network interconnects individuals, by means of

methodology to collect and analyze relational data. SNA

graph representation. In this structure, a node represents an

facilitates for analyzing and comparing information flows in

individual, or an organization and the link (Edge) represents

an organization as well as groups and individuals. SNA

the relationship between them or interaction. It plays a very

techniques have been applied to a variety of problems and

prominent role in the dissemination of knowledge, in today’s

they have been successful in uncovering relationships not seen

scenario. Social network is applied in wide variety of areas

with any other traditional method. Many complex systems in

ranges from marketing, epidemiology, sociology, biology,

the real world, such as World Wide Web, can be represented

social

Proceedings of the International Conference on Innovative Computing, Information and Communication Technology Sri Sairam Engineering College, Chennai.

networking

websites

facilitates

different

1

as a large connection of networks and / or graphs. With the increasing availability of data about these kind of networks and to study them has become a very important research topic. Social networks are the graphs of interactions between individuals, groups or organizations. One of the most important research and review questions in social networks is the “identification of communities”. Though the definition of community itself is arguable one, we have wide variety of definitions for the same. Communities can be defined as collections of individuals

who

interact

unusually

frequently.

The

identification of communities often reveals the properties shared by the members such as occupations, social functions, or some other common hobbies like dating. These properties include related topics or common view points, which has led to a large amount of research on identifying communities in the web graph. A community is a densely connected subset of nodes that is only sparsely linked to the remaining network. A community is a subset of nodes on the network such that nodes in same community (Intra – Community) are more like to be connected than nodes in different communities (Inter – Communities).

Figure

1

represents

three

groups

of

communities. This also shows the interaction level among the members of intra – community and also the interaction with inter – community members. Community detection reveals some interesting characteristics or properties shared by the

Figure 1: A group of three communities and their interaction among the members of individual group and the members of other group

A random walk on a graph G = (V, E) is a sequence of random variables X0, X1, X2 . . . ∈ V such that Xt+1 is a uniformly random neighbor of Xt. Given a graph and a starting point (vertex), we select a neighbor of it at random, and move to this neighbor, and then we select a neighbor of this vertex and move to it and so on. The (random) sequence of vertices selected this way is a “random walk” on the graph. A random walk (denoted RW) is a mathematical formalization of a trajectory that consists of taking successive random steps. A random walk on a directed graph can also be viewed as a Markov Chain (MC). A random walk is a process consisting of a sequence of discrete steps of fixed length, where the direction of each step is random and does not depend on the previous steps. Random walks are an abstraction for a range of processes observed in all sorts of natural complex systems. The mathematical properties of random walks vary greatly depending on the dimensionality of the space in which the walk is undertaken. The idea is that short random walks tend to stay in the same community.

members. Community Inferring are of important tasks in Data Mining. This can be done by clustering techniques. A Blog, also called as weblog, is a website comprising blog posts, or content written by the bloggers, which are typically organized into categories. Blogs create a context for dialogues between bloggers and readers. Blogs are also used as a tool for learning as well as for knowledge dissimination. In today’s context, information extraction form blogs are called as user elicitation. The popularity of blogs based on the increasing number of blogs, bloggers and blog readers is massive. Most blog platforms provide a personal writing space that is easy to publish, sharable.

II.

RELATED LITERATURE

Clustering and community detection are of important tasks in data mining. Several studies had been made in this aspect. We have to define a feature vector for each data point to describe the data to cluster a set of data. There are attempts of work which discuss about clustering social networks. We can cluster the vertices which has similarity among the attributes. Different methods proposed by different researchers. [1] proposed a frameworks and algorithms for identifying communities in social networks that change over time. The authors proposed an optimization – based approach for modeling dynamic community structure. In this paper, the

Proceedings of the International Conference on Innovative Computing, Information and Communication Technology Sri Sairam Engineering College, Chennai.

2

authors described different application areas of social

communities. The idea of using random walks to detect

networks and how to define the communities. This paper also

communities in a complex network is that short random walks

discusses about different frameworks proposed by other

tend to stay in the same community. Our results are also

researchers who are working in the similar area.

shown in this paper.

[8] applied the concept of fuzzy c – means clustering to identify the overlapping structure in complex networks. The authors utilized spectral clustering methods, which is the framework for their study. They have taken two different datasets to analyze their results and the results are also produced. They also map data points into Euclidean space for clustering process.

III.

EXPERIMENTAL SETUP AND RESULTS

Network analysis is an active area of research. In the recent decade, large complex networks representing relationships among set of entities have been one of the focuses of interest of researchers in many fields like social networks, World Wide Web and telecommunication network. One of the most important research and review questions in social networks is

A new approach of applying genetic algorithm to identify the communities in a social network is tried by [2]. In this paper the author proposed an algorithm to discover communities in networks by employing genetic algorithms. The algorithm uses a fitness function able to identify groups of nodes in the

the “identification of communities”. The study of networks has become an important topic with increasing availability of data about large networks. A Blog is a website comprising blog posts, or content written by the bloggers, which are typically organized into categories.

network having dense intra-connections, and sparse interconnections. The approach defines a quality metric of a network partitioning in communities based on the number and topology of the links present among the nodes constituting a community, and tries to optimize this quantity by running the genetic algorithm. Spectral Clustering algorithm for community discovery is presented by [9]. An improved spectral clustering method is presented in this paper by the authors. The authors presented the communities which are not overlapped in structure. [5] explained the statistical properties of community structure. In this paper, the authors explored from a novel perspective several questions related to identifying meaningful communities in social and information networks. Applying

statistical

mechanics

to

community

detection was described by [4]. In this paper the author shows how community detection can be interpreted as finding the ground state of an infinite range spin glass. They show how

Fig 2: Actual Graph of the dataset

hierarchies and overlap in the community structure can be detected. Though the above are the few methods that can be applied to identify the communities in a given graph structure, we applied random walks. This method is considered as one among the new methods that can be used to identify the Proceedings of the International Conference on Innovative Computing, Information and Communication Technology Sri Sairam Engineering College, Chennai.

3

partitioning. Discovering communities is often believed as NP – hard problem. When discovering the communities, the clusters of intra – community members having frequent interaction, while inter – community members are less interacted. There are few methods introduced to discover the communities over the past decade. In this paper, we applied short random walks, as an emerging method, to discover the communities. The ideology of using short random walks is the short random walks tend to stable among communities. To discover communities in this paper, we have taken AIDS blog dataset – blog posts that are shared by AIDS patients, which are represented as a graph. The results are also produced. We can identify some clusters of bloggers who probably have frequent interactions among the community members and few others are less frequently interacted. Fig 3: Communities discovered for the dataset

REFERENCES In this paper, we tried to discover communities using a blog dataset collected by [7]. This dataset contains an edge list for a directed graph representing the pattern of citation among 146

[1] Chayant Tantipathananandh, Tanya Berger-Wolf, David Kempe, “A Framework for community identification in dynamic social networks”, KDD '07, ACM, 2007

unique blogs related to AIDS patients, and their support

[2] Clara Pizzuti, “Community Detection in Social Networks with Genetic

networks. Vertices correspond to blogs. A directed edge from

Algorithms”, GECCO’08, ACM, 2008

one blog to another indicates that the former had a link to the

[3] Csardi G, Nepusz T, “The igraph software package for complex network

latter in the web page. When there is a response j for a blog

research”, InterJournal, Complex Systems 1695. 2006. http://igraph.sf.net [4] Jorg Reichardt, Stefan Bornholdt, “Statistical mechanics of community

post i, there emanate an edge from i to j. This generates a

detection”, Physical Review, E74, 016110, 2006

digraph structure. For implementation purpose, we used

[5] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, Michael W. Mahoney,

Statistical Package R and to identify the communities we used

“Statistical properties of community structure in large social and information networks”, WWW 2008, ACM, pp695-704, 2008

igraph [3].

[6] Newman M. E. J, “Finding community strucutre in networks using the

IV.

CONCLUSION

eigenvectors of matrices”, Physical Review, E74, 2006 [7] S. Gopal, "The evolving social geography of blogs in Societies and Cities

A social network can be modeled as a graph of V vertices and E edges that connect two elements of V. Dividing the network

in the Age of Instant Access”, H. Miller, Ed.

Berlin:Springer, pp. 275-294,

2007 [8] Shihua Zhang, Rui-Sheng Wang, Xiang-Sun Zhang, “Identification of

into groups, so called communities, is an interesting property

overlapping

in a typical network. The discovery of communities often

community structure in complex networks using fuzzy c-means clustering”,

reveals interesting properties shared by the members.

Physica A374. pp 483-490, 2006

Community discovery is an important task in data mining, as

[9] Shuzi Niu, Daling Wang, Shi Feng, Ge Yu, “An Improved Spectral Clustering Algorithm for Community Discovery”, Ninth International

it is believed as NP – hard problem. The study of community

Conference on Hybrid Intelligent Systems, IEEE, pp 262-267, 2009

structure in networks is closely related to the ideas of graph

[10] V. Batagelj, A. Mrvar: Pajek – Program for Large Network Analysis. Home

Proceedings of the International Conference on Innovative Computing, Information and Communication Technology Sri Sairam Engineering College, Chennai.

page:

http://vlado.fmf.uni-lj.si/pub/networks/pajek/

4