Generic Social Network Data Crawler Using Attributed Graph Rinta Kridalukmana Department of Computer System Faculty of Engineering, Diponegoro University Semarang, Indonesia
[email protected] Abstract—The more increasing active users in sharing information and in interacting with others on online social network unconsciously has reflected the existence of many data that can be used as the research objects for various purposes. Hence, the activity of data crawling is critical as a first gate in accessing the information in social network. This study aims to develop software of data crawler by using the attributed graph approach to store the information from online social network and simultaneously to use simple algorithm to do node and edge detection in order to illustrate the interrelation of objects. The results of the research showed that the information from the developed data crawler can obtained more effectively and can be visualized with the assistance of Gephi software. Keywords—online social network; data crawler; social network analysis; gephi; attributed graph
I.
INTRODUCTION
Online Social Network (OSN) such as Facebook and twitter recently is developing rapidly both in numbers of the users and in the purpose of the use of the media. This cannot be apart from its function to facilitate the interaction among the users both as an individual and as a group in obtaining supports or response from other party. In addition, the media of social network plays an important role in facilitating the information sharing among users, thus able to support the existence of interrelation of one user to another user. The more increasing information and relation distributed in social media by users has positioned online social network as a source of information. This information can be used for various studies, such as research on marketing, individual profiles, community profiles [8] [1], community detection [3] [12] [13], added-value services [14], and other studies. To access the information, crawling on the online social network data through the available application programming interfaces (APIs) is required. A common approach is to crawl the network per user in which a user is randomly chosen and a list of his friends is downloaded [7]. One of information that can be obtained from online social network is community information formed from the interrelation of individual users. Building such a profile automatically will be helpful for a number of subsequent analytical tasks, such as helping users to visualize and organize social communications, classifying new messages into
corresponding focus areas, and prioritizing new messages by the user's activeness or relevance in that area [8] , Though there have been a large number of social network publications, few have been dedicated to the data collection process [11]. Hence, this research has been designed to do an exploration related to how to do crawling online social network data. Online social network used in this research is Facebook, and the data of the crawling result would be tested using the Gephi software for conducting the process of community detection. II.
RELATED WORK
Some researches related to crawling social network data have been done, both related to techniques/algorithms and its implementation. Some studies related to the algorithm were conducted by Chau et al. [15] [10] [9] exploiting Breadth-FirstSearch (BFS) algorithms and illustrating the use of parallel crawlers to crawl eBay profiles efficiently. Mislove et al. [20] analyzed the graphs of a number of popular online social networks, including Flickr, YouTube, LiveJournal, and Orkut. Their analysis confirmed the scalability properties of Online Social Networks, such as a power-law degree distribution, a densely connected core, and small average path length. Meanwhile, some researchers also developed software both based on the design and based on its implementation as conducted by Wong et al. [11] developing design, Gjoka et al. [16] offering a practical recommendation in conducting crawling social network data. Blenn et al. [7] developed software to crawl a network community-wise and detect communities at the same time. In this research, contribution given was to develop software to do crawling social network data using a simple approach by implementing the concept of attributed graph and to detect the friendship relation of among the users (node) in social network data as the relations among nodes (edge). III. LITERATURE A. Representing Social Network Entities in Attributed Graph A graph is a way of specifying relationships of a collection of items and consists of a set of objects, called nodes with certain pairs of these objects connected by links called edges [18]. Figure 1(a) shows 4 nodes (A, B, C, and D) that are connected by edge. A is connected with B, B is connected with
C, D, and A. Meanwhile, C is connected with w D. These two nodes are called neighbor if being connectted by edge. The relation of node and such characteristics is called symmetric relation and to show an asymmetric relation,, for example, that A points to B but not vice versa, can be used through directed graph consisting of a collection of nodes wiith directed edges, with the direction being important [18] (Figurre1(b)). Therefore, the interrelation of nodes not emphasizing onn the direction can be called as undirected graph
such as application server, middleware m layer, or database (Figure 3) [5]. APIs enables a developer to use the available entities to obtain the values off those entities. For example, to access the information about customer needs a mediation of API to the database of customeer. The purposes of making API are:
Online Social Network data can be modelled as a graph [10] that later can be developed to be attributeed graph [19][17]. Entities are converted to nodes in the graph and the nodes can have different types—e.g., people, organizaations, and events. Meanwhile, the interrelation of nodes for exxample a node that is labelled as Bob and node that is labeled as Fred (Knows(Bob, Fred)), will be converted into directed edge that has Knows attribute. To describe a bidiirectional relation (Knows(a, b) implies Knows (b,a)), undireccted edge between those two nodes can be used (see Figure 2).. The next process of forming attributed graph is by converting the attributes from the entities to be node attributes.
2) To provide a mechanism thaat enables to share information
1) To provide an access to the data and business process without any user interface Application programming interface that can be used to access the business and data process can be categorized as follows: 1) Full Service Interface This type provides all serrvices to access the service of business, object and data 2) Limited Service Interface In limited service interfacee, usually it is only allowed to access one of the levels from the service of business, object or data. 3) Controlled service interfacee This type provides a minim mum service to logics and data
a) Graph with 4 nodes
b) Directed Grapph with 4 nodes
Fig. 1. (a) Undirected Graph (b) Directed Graph [18]
Online Social Network commonly c also provides some facilities for other application such s as e-commerce website, to communicate through API. This service is defined as webbased service that enables otheer parties to do a construction of user profile either from the public data or from semi-public data in certain system limit [2]; thus, it can be categorized as limited service interface. API in Facebook includes a broad function and enables applicatioon developer to access the data related to the information downloaded d by the user and interrelation of the users [6]. Through T this API, the activity of crawling the social network data can be done and certainly the scope of data that can be acceessed depends upon policy and privacy from each service proviider of social network. In addition, API can allso be classified into several categories depends on the data abstraction that will be m Table 1, it can be seen that described (see Table 1). From Facebook, as the research objecct in this paper, implements web service as API and uses JSON N (JavaScript Object Notation) format as the output of the response when data request occurs.
Fig. 2. Attributed Graph to Show the unidirectionall Knows relationship between Bob and both John and Fred, and thee unidirectional relationship between John and Tom m [19]
B. Application Programming Interface (APII) The understanding about the applicattion programming interfaces (API) is very important to find out o the essence of application of interfaces. APIs is a mechannism that has been well defined and developed to connect the available resource
TABLE 1 – CATE EGORY OF API [2] Category of API Example Operational API for MS M Windows API for Appple Mac OS System Programming Java API Language Application API for SA AP Service Service Amazon web w service API Infrastructure Web Service Twitter AP PI, Facebook API
C. Social Network Data Crawling Process As social network data can be modeled as a graph, crawling process can be naturally divided into crawling nodes and crawling edges [10]: • Crawling nodes: There is often extra information associated with a node, such as personal information (profile), photos, posts, and list of friends. The crawler must crawl a node to collect such information, whereas the crawler may become aware of the existence of a node without having crawled it. • Crawling edges: While we distinguish between crawled and seen nodes, we do not have such difference between crawled and seen edges. First of all, few edge attributes are provided by most OSNs, such as when the edge is created and how a node categorizes this edge (friends, classmates, etc.). Secondly, these attributes often come with the list of multiple edges directly instead of requiring the crawler to ask for a particular edge. Therefore, in many cases there is either no edge attributes to crawl or no way to crawl a specific edge. This automated process of downloading users typically requires a robot (a software program) to look at the prole page of a user and store the names of all friends, which usually take between 0.1 to 2 seconds as it includes multiple HTTP requests to a server to iterate through the whole list of friends [7]. An optimistic calculation shows that with one crawling computer, obtaining LinkedIn's database of 120 million users (as of Nov. 2011) would take approximately half a year. The same calculation for Facebook's dataset of 650 million users leads to a crawling time of ca. 2 years. By using massively parallel crawling techniques those times can be decreased. Clearly, by the time the last records have been obtained, much of the retrieved information will be outdated [7]. IV.
CRAWLER DESIGN AND IMPLEMENTATION
A. Crawler Algorithm In this research, social network data crawler works from the initial node that is a Facebook user. This user is considered as an entity; thus, the attribute of the user entity will be converted as an attributed node. In Facebook, a user is viewed as an object and is connected to other objects such as the object of other Facebook users, object of Facebook page, and object of photograph. Each of the object will have an identification number (ID). The further process is to seek the object of the Facebook user by comparing the connectivity of object of Facebook user that is used as the initial node and results in a friend list. This friend list will be stored in an array. Furthermore, this friend list will be examined one by one to observe whether there is a friendship correlation of one node to other in a friend list. If a node has a relation with other node, then a bidirectional edge will be generated to be implemented with an undirected edge. Since an object of list contains an object that is not the object of Facebook user, then when the relation of friendship is checked, it will result in a false value and automatically it will be eliminated in the list of nodes. A detail of algorithm of data crawler in pseudo code can be seen in algorithm 1 below:
Algorithm 1 : Data Crawler Algorithm 1: Create friendlist array 2: From starting node, crawl Facebook user related to starting node 3: Decode crawl result 4: Create node and store to friendlist 4: If friendlist is not empty then 5: For all elements as x in friendlist do 6: For all elements as y in friendlist do 7: If isFriend(x,y) then 8: Create edge(x,y) 9: End if 10: End for 11: End for 12: End if
B. Architecture The architecture of software used in this research can be seen in Figure 3 below.
Fig. 3. System Architecture
The input parameter is the information of initial node that will be used as a starting point in crawling data, in this case, the information of a Facebook user account. Furthermore, the data crawler will call the Facebook server through Facebook API to collect information about the initial node after firstly passing authorization process. The information contains the public profile (such as name, gender, email, locale and etc.), friend list and the relation among friends. Facebook API gives response in JSON format that will be processed furthermore as gdf file so it can be used by GEPHI software for testing purpose. C. Attributed Graph Data Structure Design In general, the declaration of attributed graph covers two definitions that must be determined: the definition of node and the definition of edge in which both of two definitions are adjusted in the format of gdf file. In the establishment of node data structure, the type of primitive data used is varchar, which will be used to save the information about node identity (also as an ID of Facebook user and other users that has a friendship with the initial node), name, gender, and location. This information is the attribute from each of the existing nodes. The basic form of the design of node definition and edge definition in the format of gdf can be shown as follows: nodedef>name VARCHAR, label VARCHAR, sex VARCHAR, locale VARCHAR edgedef>node1 VARCHAR, node2 VARCHAR
D. Dealing with Facebook Graph API In this research, Facebook PHP SDK waas used to support the connection with Facebook Graph API. The T classes used in Facebook PHP SDK incude [4]: • Facebook/FacebookSDKException • Facebook/FacebookSession • Facebook/FacebookRequest • Facebook/GraphObject • Facebook/GraphUser • Facebook/FacebookRequestException • Facebook/FacebookRedirectLoginHelper • Facebook/Entities/AccessToken • Facebook/HttpClients/FacebookCurl • Facebook/HttpClients/FacebookHttpable Client • Facebook/HttpClients/FacebookCurlHttpC • Facebook/HttpClients/FacebookStreamHtttpClient • Facebook/HttpClients/FacebookStream • Facebook/FacebookResponse • Facebook/FacebookAuthorizationExceptiion • Facebook/GraphObject • Facebook/GraphSessionInfo E. Visual Onlie Social Network Graph Resuult In this research Facebook user ID used as the initial node was the user with ID 1187452***. The attriibute of each node would be used as a base of community detecction that would be visualized with the Gephi Software. Thhe result of the visualization of community detection based on o gender attribute is presented in Figure 4 and locale attribute inn Figure 5.
Fig. 5. Graph of the result of com mmunity detection based on Locale
In Figure 4, nodes with light l blue color represent male community of Facebook user associated a with the initial node. Meanwhile nodes with red collor represent female community of Facebook user. For locale baased community detection graph as shown in Figure 5, nodes wiith red color represent Facebook user that use English Britain as a chosen locale, green color for Indonesian locale, and blue color c for United States locale. Name attribute for each nodee in both Figure 4 and 5 also attached to visualize name of eaach node. F. Method Evaluation In order to evaluate, our method will be compared to research from Wong et al. (22011) [11] and Catanese et al. (2014) [9] as shown in Table 2 below. TABLE 2 – METH HOD COMPARISON Limitation Crawl Characteristics 20 friends/load Crawling initial node friendlist, accessing public information profile Catanese et 400 friends/load Crawling initial node friendlist, al. accessing public information profile for each friendlist Our method 1000 Crawling initial node friendlist and friends/load their profile, check interrelation tested between nodes to create edge Method Wong et al
t our method can crawl until From Table 2 we can see that 1000 friends per load. As we chheck interrelation between nodes in the graph, our method haas better functionality than the others. In other method, edgee creation only between initial node and their friends. Meanw while, in our method the edge creation also considered frienndship between friends of the initial node. But this checkingg friend activities can decrease crawling performance.
Fig. 4. Graph of the result of community detection baased on gender
V.
CONCLUSION
From the result of the research it can be concluded that the activity of crawling social network data is very important in research related to social network analysis. With the large amount of information to be crawled, it requires an efficient and effective approach to lighten the work of processor. Using the algorithm, we made an attributed graph that can be directly used and at the same time detect the interrelation of nodes as the relation of edge. This makes the data crawler we developed to be more effective. In addition, the use of attributed graph in our data crawler can be used further more for the research on both community detection and community profiling. However, some limitations will be found often when accessing more detail information of the users of online social network as this is related to the regulations defined by the service provider of online social networks.
ACKNOWLEDGMENT I would like to acknowledge the Software Engineering Laboratory in Department Computer System Diponegoro University for providing IT infrastructures we needed to conduct preliminary test on online social network application programming interface. REFERENCES [1]
Archana, Ch., Supreethi, K.P. (2014) “Community Profiling for Social Networks”, International Journal of Research in Engineering and Technology, Vol. 3, Issue 3, pp. 124-127. [2] Arun, K. & Nayagam, M.G. (2014) “Building Application with Social Networking API’s”, International Journal Advanced Networking and Applications, Vol. 5, Issue 5, pp. 2070-2075 [3] Du, N., Wu, B., Pei, X., Wang, B., Xu, L. (2007) “Community Detection in Large-Scale Social Networks”, Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07 ( WebKDD/SNAKDD’ 07) August 12, 2007 , San Jose, California , USA. [4] Facebook, Inc. (2014) “Graph API Overview”, Dec, 19th 2014, https://developers.facebook.com/docs/graphapi/overview/?locale=id_ID [5] Linthicum, D.S. (1999), “Enterprise Application Integration”, Addison Wesley, ISBN 0-201-61583-5 [6] Tiago, E., Zaidman, A., Gross, H. (2014) “Web API Growing Pains: Stories from Client Developers and Their Code”, Software Maintenance, Reengineering and Reverse Engineering (CSMRWCRE), 2014 Software Evolution Week-IEEE Conference on. IEEE, 2014. [7] Blen, N., Doerr, C., Kester, B.V., Mieghem, P.V. (2012) “Crawling and Detecting Community Structure in Online Social Networks using Local Information”, 11th International IFIP TC 6 Networking Conference, Prague, Czech Republic, May 21-25, 2012, Proceedings, Part I [8] Zhou, W., Jin, H., Liu, H. (2012) “Community Discovery and Profiling with Social Messages”, KDD’12, August 12–16, Beijing, China [9] Catanese, S.A., Meo, P.D., Ferrara, E., Fiumara, G., Provetti, A. (2011) “Crawling Facebook for Social Network Analysis Purposes”, WIMS’11, May 25-27, 2011 Sogndal, Norway [10] Ye, S., Lang, J., Wu, F. (2010) “Crawling Online Social Graphs”, Web Conference (APWEB), 6-8 April 2010, 12th International Asia-Pacific, pp. 236 – 242, DOI : 10.1109/APWeb.2010.10 [11] Wong, C., Wong, K., Ng, K., Fan, W., Yeung, K. (2014) “Design of a Crawler for Online Social Networks Analysis”, Wseas Transactions On Communications, Vol. 3, pp. 264-274
[12] Chen, J., Zaiane, O.R., Goebel, R. (2010) “Detecting Communities in Social Networks using Local Information”, From Sociology to Computing in Social Networks, Part IV, pp 197-214, Springer-Verlag Wien, DOI : 10.1007/978-3-7091-0294-7_11 [13] Tang, S., Blenn, N., Doerr, C., Mieghem, P.V (2011) “Digging in the Digg Social News Website”, IEEE Transactions On Multimedia, Vol. 13, No. 5, pp. 1163-1175 [14] Zigkolis, C., Kompatsiaris, Y., Vakali, A. (2009) “Information Analysis In Mobile Social Networks for Added-Value Services”, The W3C Workshop on the Future of Social Networking, Barcelona. 2009 [15] Chau, D.H., Pandit, S., Wang, S., Faloutsos, C. (2007) “Parallel Crawling for Online Social Networks”, WWW’07: Proceedings of the 16th international conference on World Wide Web, 2007, pp. 1283– 1284 [16] Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A. (2011) “Practical Recommendations on Crawling Online Social Networks”, IEEE Journal On Selected Areas In Communications, VoL. 29, No. 9, October 2011, pp. 1872 – 1892 [17] Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P. (2010) “You Are Who You Know: Inferring User Profiles in Online Social Networks”, WSDM’10, February 4–6, 2010, New York City, New York, USA. [18] Easly, D. and Kleinberg, J. (2010) “Networks, Crowds, and Markets: Reasoning about a Highly Connected World”, Cambridge University Press [19] Campbell, W.M., Dagli, C.K., Weinstein, C.J. (2013) “Social Network Analysis with Content and Graphs”, Lincoln Laboratory Journal, Vol. 20, No. 1, 2013 [20] Mislove, A., Marcon, M., Gummadi, K.P. (2007) “Measurement and Analysis of Online Social Networks”, IMC 07: Proceeding of the seventh ACM SIGCOMM conference on Internet Measurement, pp. 29-42