Selecting the best open source social media tool for collection and visualization data from social networks' web is a big challenge. In this paper, we propose an ...
Selecting the Best Open Source Tools for Collecting and Visualzing Social Media Content Muhammad Al-Qurishi, Mabrook Al-Rakhami, Majed Alrubaian
Abdullah Alarifi, Sk. Md. Mizanur Rahman, Atif Alamri
College of Computer and Information Sciences Research Chair of Pervasive and Mobile Computing King Saud University, Riyadh, Saudi Arabia {qurishi, malrakhami, malrubaian.c}@ksu.edu.sa
College of Computer and Information Sciences Research Chair of Pervasive and Mobile Computing King Saud University, Riyadh, Saudi Arabia {aalarifi, mizan, atif}@ksu.edu.sa
Abstract— Social media data and information become a real concern for many parties such as countries and organizations for analysis, strategies and decision making purposes. Selecting the best open source social media tool for collection and visualization data from social networks’ web is a big challenge. In this paper, we propose an integrated approach to select the most appropriate tools for collecting and visualizing information that are available on social media web. The proposed approach has two phases, namely, evaluation and selection. The current practice of evaluation process involves expert users opinions on the tool to rate it’s importance based on the users’ satisfaction which is measured on the efficiency of the tool for finding required results. The evaluation phase of the proposed technique combines the current practice with the weight of importance, where the later one has been calculated by applying statistical methods. The selection phase has designed based on the well-known PROMETHEE technique. The results have been showed as a ranking list of the selected libraries and frameworks/tools for both the data collection and data visualization. Keywords— Online Social Networks; Crawling, Visualizing
I.
INTRODUCTION
Online Social Networks (OSNs) become one of the largest source of huge amount of data. This source contains thoughts, discussions, and debates expressed in public social conversation, which can be used not only as an essential component of the decision-making process, but also in industry, academic and political decisions. Facebook, YouTube and Twitter are the most popular online social networks. They are widely adopted by large segments of the population. Facebook announced 1.39 billion monthly active users in Dec 2014 [1]. Twitter with more than 500M users, and YouTube with more than one billion users [2]. A big amount of usergenerated contents are being created and shared continuously across those networks. There are desperate needs to get this information from social networks, and storage, analyze and visualize them so that they can benefit from them in several aspects. Collecting, storing, processing, analyzing, and visualizing of the online social networks data are challenging tasks. Due to the huge amount of data in OSNs, it is very difficult to achieve those tasks without an accompanying tool. In this respect, 978-1-4799-8172-4/15/$31.00 ©2015 IEEE
several commercial as well as open source tools and libraries has been developed. This include: • Crawling tools such as Storm, Tubekit, Web-Harvest etc. • Storing and processing tools such as Hadoop file system (HDFS), Titan, new4j, etc. • Analyzing tools and libraries such as R and Scalable C++ Machine Learning Library (MLPACK), etc. • Visualizing such as Gephi, amCharts, D3, etc. Existing research use some of these tools and libraries to achieve their experiments [3]. Most of them mentioned the use of the tool to crawling tweets [4], user profiles [5], or YouTube videos attributes [6]. There is no previous work considered the process of evaluating and selecting the best tool to manipulate OSNs content. In this paper, we proposed an integrated approach that first help evaluate the existing candidate tools and libraries and then selecting the best amongst them, which fulfill the requirements and needs of the users to collect, store, analyze and visualize OSNs contents. We considered only the collecting and visualizing tools and libraries. We conducted a small-scale survey to evaluate the tools. Twelve experts specialized in both social networks and CASE tools participated in this survey. They helped to determine the relative importance (weight) of each of the proposed criteria. Our major contributions of this paper are as follows: • We developed an approach to calculate the scores based on the determined weight of the importance and the current practice of each tool. • We reviewed more than 25 open source tools, and selected 12, according to their popularity, downloads rate and update history. • To the best of our knowledge, this paper is the first to propose an integrated approach to select the appropriate framework and tool for collecting and visualization process.
The remainder of this paper is organized as follows. Section 2 describes related work; then Section 3 presents our proposed approach in details. Section 4 represents analysis and results, and finally concluding remarks are made in Section 5. II.
RELATED WORKS
In this section, we review related work in the areas of tools and libraries that help collect data from online social media websites, as well as the tools and libraries that help visualize data. A. Data collection tools and libraries Data Collection is a concept of gathering data in a repository for the purpose of analysis, which lead to useful information or knowledge. There are many tools that can be used to collect data from the social media applications, called “crawlers”. Crawler is a software program that traverse the WWW information space by following hypertext links and retrieving web documents by standard HTTP protocol [7]. For the purpose of collecting and analyzing a massive data, to describe the connections between participants in social media networks, a special ad-hoc, privacy-compliant crawlers has been developed by [4]. Authors in [5] study and evaluate the different approaches of crawling and describe the application of hidden Markov models (HMM). HMM approach is used to indicate the paths leading to web content with its relevant metadata. The study in [6] provides a design of a crawler for OSNs, and propose countermeasures to the technical challenges for the process of collecting data from the networks in practice, providing visualized graphs for the crawled data. However, social network supplies rich application programming interface (APIs). These APIs make it easy for web developers to use and build applications, to collect data from online social networks with some constraints and limitations. Generally, there are two types of APIs. The first is request response type and provides interfaces, which allow querying of the site; the second type of API sends filtered data, when it appears on the site, to the requester (streaming API). Digitally encoded data may classified as public, semipublic, or private. Everyone can access public information, whereas; private information is accessible only by the owner. Small groups of individuals, who must be authenticated and authorized, may access semi-public information [7]. There are a lot of research focused on analyzing process of data collected from OSNs, where they used open sources tools and libraries to collect these data [8-12]. Table I shows the candidate tools and libraries to collect data from social media. B. Data visualization Tools and libraries Data visualization is a technique to represent the analyzed data in a communicative, clear and efficiently way. Data visualization has a set of formats or ways to display the required results. These methods include dashboards, charts, radar and gauges, which give users the ability to select any format or pattern that suitable and preferred for him. J. Peffer, S. Potter and W. Elm, created an organizational framework for the Selection of visualization tool to assess the
degree of decision support provided by tools within the intelligence analysis community. Their framework is designed to give the user the ability to select the visualization search criteria based on the Support Function Model (SFM), probe and inspect multiple visualizations and refine search process based on prior search history [13]. Though a large volume of visualization tools is proposed and designed, research on selecting the appropriate tool among them is very rare. This paper contributes to provide a new approach for selecting the best tools regarding the data crawling and visualization. It is believed to be the first of which provides an integrated approach to evaluate the tools based on their different features. Table II describes a list of open source libraries and tools used to visualize data collected from social networks. C. Multi-criteria Decision Making MCDM techniques facilitates decision makers in analyzing a variety of tools selection criteria, evaluating CASE tools alternatives, and making desired tool selections. Numerous MCDM techniques have been proposed for the selection, such as the Analytical Hierarchy Process (AHP) [14] (Saaty, 1989; Saaty, 1990) and Preference Ranking Organization METHod for Enrichment Evaluations (PROMETHEE). We choose PROMETHEE for the selection process for several reasons. Firstly, PROMETHEE is appropriate technique to use, when limited number of alternatives (case tools) needs to be evaluated. Secondly, it can deal with qualitatively and quantitatively at the same time. Thirdly, it can deal with the selection and ranking problems efficiently. PROMETHEE needs a little amount of input from the decision maker. Finally, it has a broad application domain and have been successfully used for evaluation and selection in different application domains [15]. TABLE I. Tool Storm
Tubekit WebHarvest Tweetinvi Tweepy Twitter4j
LIST OF OPEN SOURCE TOOLS AND LIBRARIES TO COLLECT DATA FROM SOCIAL MEDIA Description A reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data from multiple sources into the Hadoop Distributed File System (HDFS) A Youtube crawling toolkit, which provides the ability to construct a crawler for YouTube based on a set of seed queries A Web crawling and extracting tool that collects desired Web pages and extract useful data from them. A crawling and streaming library that provides a ready-to-use classes to access Twitter Stream API A Python library toolkit that enables accessing the Twitter API in an easy-to-use way A Java library tool that enable crawling Twitter data using API easily and integrate Java application with the Twitter service
TABLE II. LIST OF OPEN SOURCE TOOLS AND LIBRARIES TO VISUALIZE DATA COLLECTED FROM SOCIAL MEDIA Tool D3 Gephi
Description Data-Driven Documents (D3), is a JavaScript data visualization library to that enables constructing a dynamic and interactive graphical forms. An open source interactive visualization toolkit to create and control all kinds of dynamic, hierarchical and complex
amCharts Prefuse Kap-lab JUNG
systems networks graphs. A Javascript-based library that provides an easy-to-use charts to handle any data visualization need. A set of software tools for constructing an interactive and animated data visualizations. Based on Java and Actionscript languages. A set of visualization tools to create a visual charts and diagrams. A library and tool that provides an extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network.
III. PROPOSED APPROACH The proposed approach to select the best tools and libraries, consists of two main processes: evaluation process and selection process. The overall procedure is described in figure 1. A. The evaluation process: Consists of four steps: (1) preparing an evaluation task, (2) identifying the candidate tools and libraries, (3) identifying and selecting evaluation criteria, and finally (4) evaluating the candidate tool and library. The first step is the preparation of an evaluation task, which considers user needs assumptions and constraints relevant to collecting and visualizing online social network data. The second step is to identify and select the evaluation criteria regarding collecting and visualizing tools. In this respect, Table III illustrates the evaluation criteria for collecting and visualizing tool, respectively. The next step is to identify the candidate tools and libraries supporting online social network data crawling and visualizing, based on practical experience and our survey, we have selected some tools as explained in Table I,II. The last step is to evaluate these candidate tools for selection process. TABLE III.
EVALUATION CRITERIA FOR COLLECTING AND VISUALIZING TOOL
Data collection features:
1 2 3 4 5 6 7 8 9 10 11
Scalability level of data collection (SL) Requests limitations. (RL) Usability of the tool configuration and installation. (UC) Reliability of collecting data through the tool. (RD) The ability of integration with another system. (IN) Data distribution based on the request. (DD) Queue manager usage. (QM) Slang words considered in the indexes. (SW) Multi-language support. (ML) Programming language support. (PL) Limitation of query execution. (LQ)
1 2 3 4 5 6 7 8 9 10 11 13
The ability of live monitoring (real-time data) (LM) The ability to monitor the health of the logical nodes (HN) 2D charts support (column-area-stack… etc.). (2D) Map charts support. (MC) Network graph support. (NG) Motion charts support (radar-gauge). (MC) The ability to export the result. (EX) The speed of displaying the result (performance). (SD) Programming language support. (PL) Mobile version support. (MV) Online or daily report support through e-mail. (DR) Data navigation and manipulation easiness. (DN)
Data visualization features:
B. The selection process: The selection is based on the evaluation process results, in order to select the best tools. With respect to the user needs, we conducted two considerations on each element of both collecting and visualizing tool: 1) the level of importance of the element; and 2) the current practice of the same element. For both points, a Likert scale of five levels has taken. The midlevel “level three” would represent the average, with two levels above, and two levels below. After finalizing the questioner and gathering the results, we calculated the differences between the importance and the current practice of the selected element. Equation (1) calculate the average weight of data collection i for a assessments:
ሾሿ ൌ ୀଵ ሾǡ ሿȀ(1)
Where a refers to the number of tools and expertise that considered in the questioner, w refers to the importance weight, c refers to the current practice. Equation (2) calculate the average of current practice of data collection i:
ܥሾ݅ሿ ൌ
ୀଵ
ሾǡሿ
Equation (3) calculate the collective implementation indicator V, for N elements: ே
ൌ ሼ
ୀଵ
(2)
weighted
ே
ሾሿ
כሾሿȀୀଵ ሾሿ כሺͷሻሽ ͲͲͳ כሺ͵ሻ
The last step is to extract the importance and the current practice grades, to get the highest and lowest score on the proposed elements; we applied the equation (3). As a result, if the percentage is high that means there is a consideration for the element in the current tool, on the other hand, if the percentage is low, that means the element is an important and not considered in the current practice. The proposed method provides the rate of Convergence indicators between the importance and the current practice, which represented by V as follows: C. Weighting Process Results Fig. 2,4, illustrate the results obtained for calculating the weights and current practices, while fig. 3,5, illustrate the results indicators for collecting and visualization factors. Table IV shows assessment results: average importance weighted and the current practice grades and results indicators. Regarding the data visualization, "Data navigation and manipulation easiness element" has the highest score for the importance and the current practice (92%), and the "The ability to monitor the health of the logical nodes element" has the lowest score for the importance and the current practice (56%), the other elements are in average between these scores. The data collection has two elements as a highest score which are "Reliability of collected data through the tool" and "Queue manager usage ", these elements have (92%) for the convergence rate between the importance and the current practice. In other hand, the lowest score of data collection was
in the "Requests limitations" that has (67.5%), so that means the current practice has more consideration than the importance of this element. D. Preference ranking organisation METHod for enrichment evaluations (PROMETHEE) PROMETHEE technique is one of the most well-known MCDM techniques to solve ranking and selection problem. It was developed and extended by Brans and Vincke in 1982. It has many versions. PROMETHEE II has been implemented because it has almost most of the features of all PROMETHEE versions. This technique is dependent on the weight of each criterion and the preferences of the decision-maker. The former considers that the decision-maker has strong skills in assigning the relative importance for each criterion. The latter shows the score that can be obtained by comparing alternatives. TABLE IV.
Figure 1. Evaluation and selection process model for selecting the appropriate open source tools to collect and visualize data
We first determines deviations based on pair-wise comparisons, which followed by application of the relevant preference function for each criterion in Step 2, then calculating global preference index in Step 3. After that calculation of positive and negative outranking flows for each alternative and partial ranking in Step 4. Finally, it calculates net outranking flow for each alternative. In Step 4, the positive outranking (known as outgoing flow) +(a) describes the degree to which a dominates the other actions in A , where, A is a set of actions that should be ranked. The negative outranking (known as incoming flow) ij-(a) represents the degree to which a is dominated. Using the both flows, the two total preorders (P+, I+), and (P-, I-) have been defined, such that: a P+b if ij+(a) > ij+(b), a I+b if ij+(a) = ij+(b),
ASSESSMENT RESULTS
Data Visualization No.
1 2 3 4 5 6 7 8 9 10 11 12 13
w
c
v
4.375 3.5 4 4 3.625 4.375 4.625 4.25 3.375 2.875 3.75 3.75 4.25
4 2.80 4.40 3.60 4.20 3.80 4 3.40 3.40 3.40 3.20 3.40 4.60
80% 56% 88% 72% 84% 76% 80% 68% 68% 68% 64% 68% 92%
Consequently, the partial ranking can be concluded by the following: ܽܲ ା ܾܽ݊݀ܽܲ ି ܾ ࢛࢚ܽ࢘ࢇ࢙ܾࢌ ൝ ܽܲ ା ܾܾܽ݊݀ܽ ି ܫ ܽ ܫା ܾܽ݊݀ܽܲ ି ܾ ܑܽ ܫܽࢌܾܗܜܜܖ܍ܚ܍ܑ܌ܖା ܾܽ݊݀ܽܲ ି ܾ Otherwise, a and b are incomparable. The complete ranking can be fulfilled by the following conditions: ܽ߮ࢌܾܛܓܖ܉ܚܜܝܗሺܽሻ ߮ሺܾሻ ܑܽ߮ࢌܾܗܜܜܖ܍ܚ܍ܑ܌ܖሺܽሻ ൌ ߮ሺܾሻ
Data Collection No.
1 2 3 4 5 6 7 8 9 10 11
w
c
v
3.875 2.625 4 4.375 4 3.5 3.5 3.125 3.875 3.625 3.25
4.375 3.375 3.875 4.625 4 4.375 4.625 3.875 4.25 4 4.125
87.5% 67.5% 77.5% 92.5% 80% 87.5% 92.5% 77.5% 85% 80% 82.5%
a P-b if ij-(a) > ij-(b); a I-b if ij-(a) = ij-(b).
E. Promethee Result Analysis The corresponding criteria values for each alternative tool are based on the evaluation process as well as from the final response of the participants. Table IV illustrates weight and preference values for each criterion. In this study, we used the third form of preference function for all criteria according to equation (4). In order to find outranking degree ʌ(a,b) , the weight of each criterion has to be assigned. Usually the decision maker is the one who determines these weights. Using the derived weights from section IV and assign them to the PROMETHEE techniques. Based on this information, ʌ (a, b) values are determined by using (5). In addition, we used Visual PROMETHEE1 tool to calculate these values and table 5 shows the final result. A preference function, Pj(a,b), is usually presented by a function p(x), where f(a) and f(b) represent the values of a particular criterion j, for actions a and b respectively: ሺݔሻ ՜ ݔሾͲǡͳሿǡ ݔൌ ݂ሺܽሻ െ ݂ሺܾሻ (4) We can determine the preference indexߨሺܽǡ ܾሻ: ߨሺܽǡ ܾሻ ൌ σ
ଵ
ೕసభ ௐ
1
http://visualpromethee.com/
כσ ୀଵ ܹ݆݆ܲ ሺܽǡ ܾሻ
(5)
This equation shows how to derive the preference of alternative a regarding to alternative b over all criteria. Where m represents the number of criteria, Wj is a weight for the criterion j, and Pj(a,b) is the preference function for the criterion j. In order to rank the alternatives by a partial preorder, we must evaluate the outflow as explained in (5): ߮ ା ሺܽሻ ൌ σ࢞ߨ אሺࢇǡ ࢞ሻ (6) Where, A is the set of all alternative, and the inflow can be calculated using (8): ߮ ି ሺܽሻ ൌ σ࢞ߨ אሺࢇǡ ࢞ሻ (7) Finally the inflow can be calculated as: ߮ሺܽሻ ൌ ߮ ା ሺܽሻ െ ߮ ି ሺܽሻ (8) For OSN content collecting tools, from Fig. 3, we found that Storm is the best followed by Twitter4j library in the second place and TweetINVI in the third place. Tubekit was the last to be selected. The numerical data is also shown in figure 2.
Figure 4 OSN Visualizing Tools Representation of the Preference flows
Figure 2 OSN Data Collecting Tool Partial and Complete Ranking Using PROMETHEE
Figure 5 OSN Data Visualizing Tool Partial and Complete Ranking
Figure 3 Representation of the Preference flows
Aimed at OSN content visualizing tools, from Fig. 5, we found that Gephi is the best followed by JUNG library in the second place. The Prefuse library is in the last place. Numerically, Figure 4 depicts the flow of the ranking output.
IV. CONCLUSION Over the past few years, it has been notably increased the usage of Online Social Media by individuals, countries and organizations for analysis, strategies and decision making purposes. Selecting the best open source social media tool is one of the biggest challenges for collection and visualization data. In this paper, we propose an integrated approach to select the most appropriate tools for the collection and the visualization processes of information on social web. The evaluation phase of the proposed technique is the combination of current practice with statistical methods and the selection phase is based on the famous PROMETHEE technique with enhancement. The results ensure that the proposed approach is a promising technique for selecting a tool for social media data analysis. As a future work, we may expand this work for storing and mining analysis tools in online social network.
ACKNOWLEDGMENT The project was supported by King Saud University, Deanship of Scientific Research, Research Chair of Pervasive and Mobile Computing.
[8] [9]
REFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7]
Relations, I. (2014, January 28). Facebook Reports Fourth Quarter and Full Year 2014 Results. Retrieved from investor.fb.com: http://investor.fb.com/releasedetail.cfm?ReleaseID=893395 Statista. (2015, December 1). Global social networks ranked by number of users 2015. Retrieved from www.statista.com: http://www.statista.com/statistics/272014/global-social-networksranked-by-number-of-users/ More, J.S.; Lingam, C., "Reality mining based on Social Network Analysis," Communication, Information & Computing Technology (ICCICT), 2015 International Conference on , vol., no., pp.1,6, 15-17 Jan. 2015 doi: 10.1109/ICCICT.2015.7045752 Saroop, A.; Karnik, A., "Crawlers for social networks & structural analysis of Twitter," Internet Multimedia Systems Architecture and Application (IMSAA), 2011 IEEE 5th International Conference on , vol., no., pp.1,8, 12-13 Dec. 2011 doi: 10.1109/IMSAA.2011.6156368 Subercaze, J.; Gravier, C.; Laforest, F., "Towards an Expressive and Scalable Twitter's Users Profiles," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on , vol.1, no., pp.101,108, 17-20 Nov. 2013 doi: 10.1109/WI-IAT.2013.15 Chengqi Zhang; Wenqian Shang, "Discovery of video websites based on machine learning," Computer Science and Information Processing (CSIP), 2012 International Conference on , vol., no., pp.1396,1399, 2426 Aug. 2012 doi: 10.1109/CSIP.2012.6309124 Cheong, F.-C., 1996. Internet agents: spiders, wanderers, brokers, and bots. Indianapolis, IN, USA: New Riders Publishing.
[10]
[11]
[12] [13] [14]
[15]
Prochaska JJ, Pechmann C, Kim R, Leonhardt JM. Twitter = Quitter? An Analysis of Twitter Quit Smoking Social Networks. Tobacco Control. 2012;21(4):447-449. doi:10.1136/tc.2010.042507. Erik Tjong Kim Sang, Bos, J., Inkpen, D., Farzindar, A., 2012. Predicting the 2011 Dutch Senate Election Results with Twitter. In Farzindar, A., Inkpen, D.(Eds.), Proceedings of the Workshop on Semantic Analysis in Social Media. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 53–60. Skoric, M., Poor, N., Achananuparp, P., Lim, E.-P., Jiang, J., 2012. Tweets and Votes: A Study of the 2011 Singapore General Election. In 2012 45th Hawaii International Conference on System Science (HICSS). pp. 2583 –2591. Doan, S., Ohno-Machado, L., Collier, N., 2012. Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses. In Ohno-Machado, Lucila, Jiang, X. (Eds.), 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology (HISB). California, USA: IEEE Computer Society, pp. 62 –71. Earle, P.S., Bowden, D.C., Guy, M., 2011. Twitter earthquake detection: earthquake monitoring in a social world. Annals of Geophysics 54(6), 708–715. Peffer, Jay, Scott Potter, and William Elm. "Visualizing Visualizations: A Tool for the Selection of Visualization Support for Intelligence Analysis." Proceedings Compendium IEEE InfoVis 5 (2005): 39-40. Muhammad Al-Qurishi, Mabrook Al-Rakhami, Fattoh Al-Qershi, et al., “A Framework for Cloud-Based Healthcare Services to Monitor Noncommunicable Diseases Patient,” International Journal of Distributed Sensor Networks, vol. 2015, Article ID 985629, 11 pages, 2015. doi:10.1155/2015/985629 Behzadian M., Kazemadeh R.B., Albadvi A., Aghdasi M. (2010) PROMETHEE: A comprehensive literature review on methodologies and applications. European Journal of Operational Research 200:198215. DOI: 10.1016/j.ejor.2009.01.021.