A Review on the Development and Use of Open Source Data Science Tools Angelito O. Carbonell Saint Louis University Baguio City, Philippines
[email protected]
ABSTRACT Data science is rapidly an emerging need in every institution. The vastness of data in terms of content, format, and scope requires capable algorithms and tools to collate, process and to present new knowledge. The advancement of these tools requires the greater collaboration between developers, data scientists and the general public. The collaboration and integration of ways to improve these tools can leverage the advantages of open source development. Open source development is allowing the availability of the source code of the software with no monetary responsibility on the part of the community. Reviewing the characteristics of these tools presents opportunities to improve and provide some guidance for future plans. These characteristics can be seen from conducted studies. As time and development progress, and as the range and scope of data and functions these tools accumulate, new characteristics emerge and can be used to evaluate these tools. This study presents new characteristics to evaluate open source data science tools with the goal of providing an overview on the development and use of these tools. The success of open source data science tools are driven by the continuity of their development; their defensibility in terms of studies conducted; and interoperability with other tools.
1. INTRODUCTION Data is increasingly cheap and ubiquitous [1]. Data is processed in every corner of any organization. The impact of data in an organization is not only on the bases of how much data is stored, but rather, recently, how these data can influence the progress of tomorrow. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. [1] Data gathered includes all available information of each business and operational transaction. These metadata are used in ways that go beyond the knowledge about the transactions but can also include performance metrics and behavioral information. Such information can be used to understand and predict phenomenon that may not directly involve the transactions conducted. Data science is the study of the generalizable extraction of knowledge from data [2]. It is the extraction of learning from substantial volumes of information that are unorganized or
unstructured, which is a continuation of the field of information mining and perceptive investigation, otherwise called information disclosure and information mining [3]. Data science was initially seen as an emerging field to include the study of capture of data, their analysis, metadata, fast retrieval, archiving, exchange, mining to find unexpected knowledge and data relationships, visualization in two or three dimensions including movement, and management. [4] Given the state of data science, data scientists fashion their own tools and even conduct academic-style research. Data scientist’s most basic, universal skill is the ability to write code and to communicate in language that all the stakeholders understand. [5] Data science provides distinction with other data-driven fields. A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data. And while people without strong social skills might thrive in traditional data professions, data scientists must have such skills to be effective. [5] The data science tools are product of the data science with the goal of educating, helping, integrating, and covering the need for more analysis of data. From one analysis output to another, new knowledge is born. From these new knowledge, a new perspective will shape the future of business. In the periodical discussion report on database research, open source as a research agenda is barely mentioned. The Lowell Report [6] never mentioned about open source, nor just the term in the report. The Claremont Report [7] mentions the accelerated adoption of DBMS and query languages by the maturation of open source systems. Open source development has become most popular during recent times and is more supported by big companies, research institutes and universities. The major benefit of the open source development is that it gets a wide range of support all over the world regarding better development of the code, quick bug fixes, and for better customization. Institutes and companies get involved in this development because of the wide range of future development, free availability and for the low cost ownership. [8] The four biggest software companies ordered by their software revenues, Microsoft, IBM, Oracle and SAP are all explicitly
increasing their support into open source projects. Microsoft contributed roughly 20,000 lines of code to Linux, IBM is increasingly supporting Eclipse project, Oracle’s MySQL database management system is already has a very big market share and SAP developed their own open database system, as well as contributing into Eclipse development environment actively with “Strategic Developer” status on the project. [8] Digital information networks have become a globally accessible space for meaningful social interaction and collaboration. The Internet has enabled people to create and innovate together online, regardless of their physical location, geographical boundaries or cultural backgrounds. These conditions are used to exceptionally good creative advantage by online open-source software development (OSSD) projects and communities. OSSD communities share a collaborative effort and approach to software development, but even in such information-rich and openly collaborative surroundings implementing design solutions is very challenging. Various design approaches have been attempted with inconsistent success. [9] This study reviews related studies on comparison and evaluation of data science tools. Characteristics used in the evaluation were synthesized to provide an overview of open source data science tools. This study will provide other researchers a perspective of how open source data science tools are used and their progress on development. Considering open source data science tools as a target focus will also guide future data scientists, developers, and researchers contribute to others in the field of data science. Collaborative development of tools by a community of enthusiasts will further increase its contribution to the goal of solving pieces of raw-data-to-knowledge puzzles, allowing seamless integration and easier use by both lay and expert users. [10]
Studies that evaluated data science tools were reviewed. Listed open source data science tools were collated as objects of review on this study. Despite the many factors that composes the definition of open source [12], this study considered the following in its definition and scope: a. b. c.
Availability of the source code Availability of the software for download Not for sale
This study used the above characteristics in its search, review and evaluation of open source data science tools. Furthermore, the study delimits data science tools that function the three processes as presented in the study entitled “Challenges in Data Science: A Comprehensive Study on Application and Future Trends” [3] as follows: a) collection and preparation of data (data wrangling and munging); b) data analysis and reflection to interpret the output; and; c) dissemination of the results in the form of written reports and/or executable code. This study is delimited to data science tools that comply with the definition of open source. Characteristics used in related studies evaluating related data science tools were examined to come up with a simplified list of characteristics to be used in this study. The different data science tools are then examined based on the characteristics formulated. The evaluation is tabulated providing an overview on the development and use of the open source data science tools.
3. REVIEW AND DISCUSSION 2. METHODOLOGY The study uses descriptive theoretical approach to research. Theoretical research is characterized as a non-empirical type of research that provides an overview of innovative research in a particular field within or related to the focus and scope of the study [11]. The search for published literature started from the result pages on Google Scholar for the terms: “open source data science tools”. From the result pages led to research and library websites like ResearchGate.net, Academia.edu, Japan Science and Technology Information Aggregator, Springer Link, Internet Archive, and among other freely accessible research/article repositories. Any further references needed were subsequently searched from Google Scholar. Results of studies published after 2010 were highly regarded. However, earlier studies were also used as reference with the consideration that the study provides emphasis to the formulation and realization of this study.
3.1 Open Source Data Science Data Science is the core that drives new research in many areas, from environment to social. Among the grand challenges in data science, one typical aspect is how to build the next generation learning and analytical theories and systems to deeply and completely understand heterogeneous and interdependent largescale data from multiple relevant resources for real-time decision making [13]. Data science has evolved from being used in the relatively narrow field of statistics and analytics to being a universal presence in all areas of science and industry [3]. The Beckman Report [10] presented the challenges of big data as follows: scalable big/fast data infrastructures; coping with diversity in the data management landscape; end-to-end processing and understanding of data; cloud services; and the roles of people in the data life cycle. A large influence in addressing these challenges is the faster, more cohesive, diverse and community-driven software development. Even large companies use open source development to hasten the approaches to these challenges. Pal, et al [3] presented the principal areas of application and research where data science is currently used and is at the forefront
of innovation: business analytics, prediction, security, computer vision, natural language processing, bioinformatics, science and research, revenue management, and government. The success of open source as a development methodology had been proven in decades. The like of GNU/Linux, Apache, BSD, MySQL, Open Office and others clearly demonstrate that open source software can be as robust, or even more robust, than commercial and closed source software. De Morales and Martinez [14] highlights the relevance of Computational Intelligence to the development of Data Science because the latter makes use of the tools and techniques of the former mainly for dealing with imprecision, uncertainty, learning, and evolution, in posing and solving computational problems. Chen, Williams, & Xu (2007) surveyed 12 popular open source data mining systems available on the Internet. The evaluation used comparative characteristics, their data access functionality, data mining functionality, and their usability, including user interfaces and extensibility. Advantages and disadvantages were also examined. In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published online with separate directories or repositories for each team. [2] In the presentation of Chen, Williams and Xu [15], small and medium enterprises wishing to adopt business intelligence solutions look to adopt a low cost approach to experimenting data mining solutions and in gaining data mining expertise due to the high cost of commercial software. Open source development is a major consideration. Using open source software generally also means saving on the software costs, and allowing an enterprise to instead invest in skilling its people. Open source ensures that staff can understand exactly how the algorithms work by examining the source code, if they so desire, and can also fine tune the algorithms to suit the specific purposes of the enterprise [15]. KDNuggets [16] list down 25 open source/free software suites/platforms for data science. The list is a quick starter when considering using or implementing one. 3.2 Characteristics of Open Source Data Science Tools Chen et al. [15] enumerated important features of open source data mining systems as basis for evaluation enumerated as follows with their brief descriptions:
- Ability to access various data sources. Data comes from databases, data warehouses, and flat files in different formats. A good system will easily access different data sources. - Data preprocessing capability. Preprocessing occupies a large proportion of the time in a data mining process. Data preparation is often the key to solving the problem. A good system should provide various data preprocessing functions tasks easily and efficiently. - Integration of different techniques. There is no single best technique suitable for all data mining problems. A good data mining system will integrate different techniques, providing easy access to a wide range of different techniques for different problems. - Ability to operate on large datasets. Commercial data mining system can operate on very large datasets. This is also very important for open source data mining systems, so scalability is a key characteristic. - Good data and model visualization. Experts and novices alike need to investigate the data and understand the models created. - Extensibility. With new techniques and algorithms it is very important for open source data mining systems to provide an architecture that allows incorporation of new methods with little effort. Good extensibility means easy integration of new methods. - Interoperability with other systems. Open standards means that systems can interoperate. Interoperability includes data and model exchange. A good system will provide support of canonical standards. - Active development community. An active development community will make sure that the system is maintained and updated regularly. Chen et al. [15] used the above features to categorize data mining systems. Twelve commonly used open source data mining systems were investigated with respect to the features mentioned above. Four aspects were used to describe the identified data mining systems: (a) General characteristics includes systems features, activity, type of license, language used in this programming, and the compatible operating system from which the systems run; (b) Data source aspect – describe the ability of the systems to handle different sources and different formats; (c) Functionality Aspect – describe the systems’ capability to execute the different data mining processes: data preprocessing, classification, prediction, clustering, association rules, evaluation, and visualization. (d) Usability Aspect – describes how easy an open source data mining system can be used in solving real world business problems. Gaussier and Cao [17], reinforced that the successful implementation of the data mining techniques requires a careful assessment of the various tools and algorithms available to mining experts. The more tools and algorithms that are defined or made available will mostly have higher successful implementation and usage. In the development of algorithm visualization, Cooper, et al [18] observed that “developers of Algorithm Visualization do not appear
to have used open source best practices”. The authors propose to use open-source procedures to gain users and contributions for the progress in the field of algorithm visualization. Employing a standard in development can guarantee lasting relationship with the community in terms of continued development, evaluation and contributions. Rufenacht, et al. [19] evaluated R, SAS®, Weka and Orange in its relative use and work with the USDA Forest Service standard remote-sensing and geographic information system packages. To integrate the software packages in the existing systems, Python was used. Ranking the packages, the study used the following criteria: usability, critical mass, uniqueness, defensibility, and performance. Python, as an open source programming language, provided interoperability between the in-house used system and the evaluated tools. de Almeida and Pedrosa [20] evaluated RapidMiner, Orange, WEKA, and Knime in their study “Open source data mining tools for audit purposes” in 2011. The identified tools were seen as helpful in audit processes, especially as data analyses tools, to detect data patterns and act as decision support tools. In “Sports Data Mining” [21], the authors narrowed down to Weka and RapidMiner as “exceptional open source tools” and useful for effective sports data mining. The conclusion was based on the capitalization of abundant availability of machine learning algorithms, data manipulation options, and visualization techniques. The conference paper presented by Sun, et al. [22] evaluated Weka and GeneXProTools for classification function in authentic emotion recognition. A hybrid classification algorithm was developed and used to compare the tools. The result was characterized by various classifiers. 3.3 Evaluation of Open Source Data Science Tools In the study of Chen et al in 2007 [15] characterizes data mining tools of which are similar to the characteristics of data science tools. While data mining tools are specialized data science tools at that time, their functionalities have improved to qualify as data science tools. Subsequent study of Ruefenacht et al in 2009 [19] characterized “data mining software packages” to be almost similar set of software as data mining tools. Integrating all the characteristics of data mining tools presented above with the definition of open source and data science tools, this study presents the following characteristics to evaluate the development and use of currently available tools: Tri-features; open source; continuity; interoperability; and; defensibility. 3.3.1 Tri-features To evaluate data science tools, this study utilized the three core functions of data science referred to as Tri-features. Tri-features determine that the tools have the three characteristics of a data science tool: (a) collection and preparation; (b) analysis and reflection; and; (c) dissemination or presentation of results.
To qualify a tool as data science tool, it should possess the three functionalities of data science. 3.3.2 Open source As a core characteristic of an open source data science tool, the tool should be able to comply with the definition of open source presented above. As available, the respective open source license is identified. 3.3.3 Continuity Continuity characterizes the tools the capability for the community to contribute, improve and to outlive the original developers. Updates on the source code of the tools signifies that there exists a continuity in the development. 3.3.4 Interoperability Interoperability conveys that the tools can use output data from other data science tools. This study proposes that tools providing integration, interfaces, and standard format are likely to improve. 3.3.5 Defensibility One of the criteria of Ruenacht, et al [19] in the “Evaluation of Open Source Data Mining Software Packages” is defensibility. Defensibility was characterized as “how often the citation for the particular software package appears in peer-reviewed publications”. In this study, defensibility presents the volume or availability of studies that may have used, evaluated, or mentioned the different tools. Publications by their own developers, publishers, copyright owners are considered limited labeled as “L”. If there are no publications that dates back from 2010, the tool is labelled N. Due to the inability to discern some characteristics of the open source data science tools, the label “-“ is used. This means that information about the characteristics on the particular open source data science tool is either non-conclusive, not known, or, not publicly available. Table 1 presents the matrix of the characteristics of the enumerated data science tools. Table 1. Characterized open source data science tools Tool
A
B
C
D
E
ADaM
Y
N
N
Y
-
ADAMS
Y
GPLv3
12/22/2015
Y
L
Alteryx Project Edition
Y
Y
-
-
-
AlphaMiner
-
N
N
Y
-
-
N
-
Y
-
Y
Y
Y
Y
L
Y
Y
N
Y
-
AGPLv3
Y
Y
-
1/3/2006
N
-
CMSR Data Miner CRAN Task View Databionic ESOM Tools ELKI Greening
N
GPL
Gnome Data Mining Tools
N
GPL
jHepWork /Scavis
Y
“permissive”
(DataMelt)
Y
L
KEEL
Y
GPLv3
Y
Y
Y
KNIME
Y
GPLv3
Y
Y
Y
GPL
5/22/2002
Y
L
MJL MiningMart
Y
Y
10/2/2006
-
Y
ML-Flex
N
GPLv3
Y
Y
L
MLC++
N
GA
12/21/1997
N
N
MLDB
Y
Apachev2
Y
Y
Y
OpenNN
N
LGPLv3
6/16/2016
Y
L
Orange
Y
GPLv3
Y
Y
Y
PredictionIO
Y
Apachev2
(Incubator)
Y
L
RapidMiner
Y
AGPL-3
Y
Y
Y
Rattle
N
GPL
Y
N
L
SpectraFox
Y
Y
Y
Y
L
TANAGRA
N
Y
13-12-18
N
L
Vowpal Wabbit
N
BSD
Y
Y
L
Weka
Y
GPL
Y
Y
Y
4. SUMMARY Most tools that are considered as data mining tools are now capable to be considered as data science tools. The characteristics used to evaluate data mining tools served as basis for evaluating data science tools with reference to the definition and scope of data science. Some of the enumerated tools do not qualify to be data science tools as their functions do not fully satisfy the definition of data science. Those that does not have the complete features are libraries, plugins or add-ons to other software packages. Some of them however are considered specifically as data mining tools, data visualization tools, or data modeling suites. Some tools are developed to address specific functions of the data science scope. The tools in turn become repositories of individual tools like Togaware’s list: Gnome Data Mining Tools, Rattle and Greening. Gnome Data Mining Tools is a collection of experimental GUI-based tools; Rattle is a data mining toolkit written in statistical language R; Greening is a decision tree builder. Most of the tools were developed using open source. C/C++, Python, and Java are the most popular development languages for the enumerated data science tools. Alteryx Project edition and CMSR Data Miner however are no longer available as open source tools. SpectraFox [23] is provided as open source however it uses .NET Framework – non-open source. In terms of continuity, ADaM, AlphaMiner, and Databionic ESOM Tools are no longer in development. jHepWork had changed its name to SCAVIS then to DataMelt. PredictionIO is being named as Incubator. The rest are actively in development with most of them have their source code hosted on Github – a source code
repository and collaboration network – allowing anyone with a registered account can copy, improve, and/or produce a new line of software. Interoperability is becoming a key to improve open source data science tools driven by development of newer algorithms, newer data manipulation and storage schemas, and wider collaboration. Tools that are actively in development have their own efforts in providing interaction with other tools. Orange use various add-ons that are also available on some of the other tools. Vowpal Wabbit has features that can be internally paired so that algorithm is linear in the cross-product of the subsets. CRAN Task Views provides different tools that utilizes output from other tools. KEEL, KNIME, MLDB, Orange, RapidMiner, and Weka satisfy all the characteristics used. They are also the most popular open source data science tools both in use and in studies. These tools enlist a considerable number of studies that make use, evaluate, and involvement from the community. In terms of defensibility, these tools are proven in both practice and research. The study is hoped to provide guidance for prospective developers and researches on which open source data science tool can probably be considered subject of interest. A more in-depth evaluation of the tools that satisfy all the characteristics is encouraged. Evaluating them in terms of functionalities will also provide better understanding that promotes collaborative improvement of the tools.
5. REFERENCES [1] UC Berkeley School of Information, "What is Data Science?," [Online]. Available: https://datascience.berkeley.edu/about/whatis-data-science/. [Accessed 20 07 2016]. [2] V. Dhar, "Data science and prediction," Communications of the ACM, vol. 56, no. 12, p. 64, 2013. [3] P. Pal, T. Mukherjee and A. Nath, "Challenges in Data Science: A Comprehensive Study on Application and Future Trends," International Journal of Advance Research in Computer Science and Management Studies, vol. 3, no. 8, August 2015. [4] F. J. Smith, "Data Science as an Academic Discipline," Data Science Journal, vol. 5, pp. 163-164, 01 2006. [5] T. H. Davenport and D. J. Patil, "Data Scientist: The Sexiest Job of the 21st Century," Harvard Business Review, 10 2012. [6] S. Abiteboul, R. Agrawal, P. Bernstein, M. Carey, S. Ceri, B. Croft, D. DeWitt, M. Franklin, H. G. Molina, D. Gawlick, J. Gray, L. Haas and A. Halevy, "The Lowell Database Research Self Assessment," 2003.
[7] R. Agrawal, A. Ailamaki, P. Bernstein, E. Brewer, M. Carey, S. Chaudhuri, A. Doan and et al, "The Claremont Report on Database Research". [8] I. Vardhan, M. Praison, P. Dixit, T. Sofuoglu and V. c. Pawate, Open Source Development, University of Leicester, 2016.
[16] KDNuggets, "Software Suites/Platforms for Analytics, Data Mining, & Data Science," KDNuggets, [Online]. Available: http://www.kdnuggets.com/software/suites.html#free. [Accessed 27 07 2016]. [17] S. Christa, K. L. Madhuri and V. Suma, "A Comparative Analysis of Data Mining Tools in Agent Based Systems," October 2012.
[9] K. Aleksandar, K. Warruntorn and K. Katsuhiko, "Designing [18] M. L. Cooper, C. A. Shaffer, S. H. Edwards and S. P. Ponce, With and Within A Community: Heuristic techniques for designing with and within Open Source Software Development," "Open Source Software and the Algorithm Visualization Bulletin of Japanese Society for the Science of Design, vol. 60, no. Community," Science of Computer Programming, vol. 88, pp. 825, pp. 5_93-5_102, 2014. 91, 2014. [10] D. Abadi, R. Agrawal, A. Ailamaki and M. Balazinska, "The Beckman Report on Database Research," Irvine, CA, USA, 2013.
[19] B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and D. Wendt, Evaluation of Open Source Data Mining Software Packages, 2009.
[11] "Structure and style of theoretical article," SA Journal of Human [20] N. V. de Almeida and I. Pedrosa, "Open source data mining tools Resource Management, [Online]. Available: for audit purposes," in OSDOC'11 Workshop on Open Source and http://www.sajhrm.co.za/index.php/sajhrm/pages/view/theoretical. [Accessed 26 August 2016]. Design of Communication, Lisboa, Portugal, 2011. [12] Open Source Initiative, "The Open Source Definition," 22 03 2007. [Online]. Available: https://opensource.org/osd. [Accessed 20 07 2016].
[21] R. P. Schumaker, O. K. Solieman and H. Chen, "Open Source Data Mining Tools for Sports," in Sports Data Mining, Springer US, 2010, pp. 89-92.
[13] E. Gaussier and L. Cao, "Conference Report on 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA'2015)," IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 13-14, February 2016.
[22] Y. Sun, L. Zhang, Z. Li and Y. Chen, "Evaluating Data Mining Tools for Authentic Emotion Classification," in International Conference on Intelligent Computation Technology and Automation, 2010.
[14] R. M. De Moraes and L. Martinez, "Editoria: Computational Intelligence Applications for Data Science," Knowledge-Based Systems, vol. 87, pp. 1-2, July 2015.
[23] M. Ruby, "SpectraFox: A Free Open-Source Data Management and Analysis Tool for Scanning Probe Microscopy and Spectroscopy," SoftwareX, 2016.
[15] X. Chen, G. Williams and X. Xu, "A Survey of Open Source Data Mining Systems," in Lecture Notes in Computer Science, 2007.