Evaluation and Comparison of Open Source Software Suites for Data

20 downloads 0 Views 283KB Size Report
One of the best-known open source data mining software is called WEKA (Waikato. Environment ... rise to the design of more and more data mining proposals.
Evaluation and Comparison of Open Source Software Suites for Data Mining and Knowledge Discovery Abdulrahman H. Altahi∗, J. M. Luna†, M. A. Vallejo‡, S. Ventura

§¶

Abstract The growing interest in the extraction of useful knowledge from data with the aim of being beneficial for the data owner is giving rise to multiple data mining tools. Research community is specially aware of the importance of open source data mining software to ensure and ease the dissemination of novel data mining algorithms. The availability of these tools at no cost, and also the chance of better understanding of the approaches by examining their source code, is providing the research community with a great opportunity to tune and improve the algorithms. Documentation, updating, variety of algorithms, extensibility, and interoperability among others, can be major issues to motivate users for opting for a specific open source data mining tool. The aim of this paper is to evaluate 19 open source data mining tools and to provide the research community with an extensive study based on a wide set of features that any tool should satisfy. The evaluation is carried out by following two methodologies. The first one is based on scores provided by experts to produce a subjective judgment of each tool. The second procedure performs an objective analysis about which features are satisfied by each tool. The ultimate aim of this work is to provide the research community with an extensive study on different features included in any data mining tool, either from a subjective and an objective point of view. Results reveal that RapidMiner, KNIME and WEKA are the tools that include a higher percentage of these features.



Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia. Email: [email protected] † Department of Computer Science and Numerical Analysis, University of Cordoba, Campus de Rabanales, edificio ”Einstein”, 14071 Cordoba, Spain. Email: [email protected] ‡ Technological Surveillance Department. Global Center Of Excellence for Development of Cognitive Applications (CADC). Everis NTT DATA. Zaragoza, Spain. Email: [email protected] § Department of Computer Science and Numerical Analysis, University of Cordoba, Campus de Rabanales, edificio ”Einstein”, 14071 Cordoba, Spain. Email: [email protected] ¶ Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia. Email: [email protected]

1

Introduction Over the last decade there has been an exponential interest in data gathering on almost any application domain, giving rise to an increasing necessity of managing and transforming tons of facts into useful information 1 . Generally, raw data lack of significance and an indepth analysis is required to extract hidden information from which new knowledge can be derived 2 . This raising interest in data analytics 3 has been specially alluring for the research community resulting in numerous data mining algorithms as well as a variety of open source data mining software 4–6 . The availability of data mining software at no cost, and also the chance of better understanding the algorithms by examining their source code, is providing the research community with a great opportunity to tune existing algorithms and improve them with new features 7 . One of the best-known open source data mining software is called WEKA (Waikato Environment for Knowledge Analysis) 5 , which includes a collection of machine learning algorithms for different data mining tasks. In the past year, for instance, this software was downloaded by more than 1,100,000 users. However, this is not the only option and, together with WEKA, there exist many other well-known open source software suites for data mining, including KEEL 4 , Orange 6 , RapidMiner 8 and KNIME 9 , among others. Hence, the number of open source alternatives is really extensive 10 and it is actually beneficial not only for researchers in the data mining 11 field but also for enterprises and many other organizations 3 . At this point, it is interesting to highlight that this high number of alternatives hampers, at the same time, the choice of the right software 12 . Here, an extensive study based on a wide set of features that any tool should satisfy is essential, analysing issues such as documentation, updating, variety of algorithms, visualization, extensibility, interoperatibility, etc. In this paper, we present a review of 19 well-known open source data mining tools taken from the KDnuggets list1 . The methodology followed to evaluate them consists of two different procedures described in literature. The first one 13 is based on scores provided by experts to produce a subjective judgment of each tool. A set of experts evaluates each tool 1

KDnuggets website (http://www.kdnuggets.com/) aims at connecting researchers of the field of Knowl-

edge Discovery, covering the news in the field as well as including many interesting courses and interviews.

2

according to a collection of questions that are based on performance, functionality, usability, tasks supported, etc. The second procedure performs an objective analysis about which features are satisfied by each tool 12 . These features are, among others, the system requirements (multi-platform, thin client, etc), the type of approaches (classification, regression, clustering, association, etc.), complementary activities (data visualization, data cleaning, etc.) and some user interface issues. It is important to highlight that the main value of this study lies in providing the research community with an extensive analysis of different features included in any data mining tool, either from a subjective and an objective point of view. To our knowledge, there is no tool that outperforms the others for any problem but a set of tools that include a more widespread set of features. The rest of the paper is organized as follows. Section 2 summarises the open source data mining tools used in this analysis. Section 3 presents the methodology carried out to evaluate the data mining tools. Section 4 includes a comparative study considering different features that any data mining tool should satisfy. Finally, some concluding remarks are highlighted in Section 5.

Open Source Data Mining Tools As previously described, the growing interest on data analytics, and the real need for eliciting useful knowledge from data with the aim of being beneficial for the data owner 3,14 , is giving rise to the design of more and more data mining proposals. This interest on data mining models is especially appealing to the research community and dozens of open source data mining tools have recently been designed by many researchers in the field 10 . All of this has given rise to a great opportunity to tune and improve existing algorithms as well as a way of distributing new models. Any open source software 15 offers a source code with a license in which the copyright holder provides the rights that allow the software to be freely used, modified, and shared with anyone and for any purpose, subject to conditions preserving the provenance and openness of the software. Without a license, the source code is copyrighted by default so no one has the legal right to use it. On the contrary, if your code is in the public domain, anyone may use 3

Table 1: Comparison of a subset of open source software licenses. License

Latest version

Linking

Distribution

Modification

Sublicense

AGPL

3.0

GPLv3

Copyleft

Copyleft

Copyleft

Apache

2.0

Permissive

Permissive

Permissive

Permissive

BSD

3.0

Permissive

Permissive

Permissive

Permissive

CC-0

1.0

Public Domain

Public Domain

Public Domain

Public Domain

CC-BY

4.0

Permissive

Permissive

Permissive

Permissive

CC-BY-SA

4.0

Copyleft

Copyleft

Copyleft

No

Eclipse

1.0

Limited

Limited

Limited

Limited

GPL

3.0

GPLv3

Copyleft

Copyleft

Copyleft

LGPL

3.0

Restrictions

Copyleft

Copyleft

Copyleft

-

Permissive

Permissive

Permissive

Permissive

MIT/X11 Mozilla Unlicense

2.0

Permissive

Copyleft

Copyleft

Copyleft

-

Public domain

Public domain

Public domain

Public domain

the software to do whatever he or she pleases. At this point, it is interesting to summarize some features of a subset of open source licenses (see Table 1) to understand the legal rights associated to each of the data mining software studied in this work. Here, linking is analysed as the software interaction to create a derivative software. Distribution of the code to third parties as well as modification of the code by a license are also analysed. Finally, it is also analysed whether any modified code can be licensed under a different license (sublicense). Once existing licenses have been summarized, the set of 19 open source data mining tools are described with particular attention to their licenses as well as additional features such as the programming language, the operating system and the latest update (see Table 2). All these tools were taken from the KDnuggets list of open source data mining software. • ADaM (Algorithm Development and Mining) 16 is a data mining toolkit developed and copyrighted by the Information Technology and Systems Center (ITSC) at the University of Alabama in Huntsville. This multiplatform system is used to apply data mining technologies 17 to remotely-sensed and other scientific data. The ADaM toolkit provides a suite of tools for each of the basic data mining processes including classification, clustering, association rule mining and preprocessing. The toolkit is packaged as a series of independent components. Each component can be used either as a standalone running file or via a wrapper using a Python scripting language.

4

Table 2: General characteristics of the most well-known data mining tools.

ADaM

Language

License

Operating system

Latest update

Python

Own license

Multiplatform

May 2005

ADAMS

Java

GPLv3

Multiplatform

December 2015

AlphaMiner

Java

GPLv2

Multiplatform

March 2013

CMSR

Java

Not specified

Windows

Not specified

D.ESOM

Java

GPL

Multiplatform

February 2006

DataMelt

Java

GPL

Multiplatform

January 2017

ELKI

Java

AGPLv3

Multiplatform

January 2016

Python

GPL

Linux

January 2006

KEEL

GDataMine

Java

GPLv3

Multiplatform

May 2016

KNIME

Java

GPLv3

Multiplatform

January 2017

MiningMart

Java

Own license

Multiplatform

April 2013

ML-Flex Orange

Java

GPLv3

Multiplatform

March 2016

Python

GPL

Multiplatform

JJanuary 2017

Java

AGPL

Multiplatform

January 2017

RapidMiner Rattle

R

GPL

Multiplatform

January 2016

SPMF

Java

GPLv3

Multiplatform

January 2016

Tanagra

C++

Own license

Windows

December 2013

V. Wabbit

C++

BSDL

Multiplatform

October 2015

Java

GPL

Multiplatform

January 2017

WEKA

• ADAMS (Advanced Data mining And Machine learning System) 18 is a modular opensource Java framework for developing workflows 19 . ADAMS is available for academic research as well as commercial applications. This system, released under GPLv3, includes a novel workflow engine designed for rapid prototyping and maintenance of complex knowledge workflows. The use of workflows is really useful for the end user since each step of the knowledge discovery process is described by a graphical user interface. Another major feature is that ADAMS is continuously up-to-date. • AlphaMiner 10 is a general purpose data mining system carefully designed to facilitate the implementation of data mining processes. AlphaMiner, which is released under GPLv2 and implemented in Java, provides the user with a wide range of functionalities to carry out different processes: data access from different data sources; data manipulation; building data mining models; additional functionalities for statistical analysis.

5

• Cramer Modelling Segmentation and Rules (CSMR) 10 is a data mining suite used for business analytics. This Java framework is freely available for working on Windows. Neither its license nor the changelog is specified by the owner so it is a major handicap for users who want to use this software. CSMR provides an integrated environment for predictive modeling, segmentation, data visualization, statistical data analysis, and SQL queries. • Databionic ESOM (D.ESOM) 20 is another interesting tool written in Java. This tool is a suite of programs that perform data mining tasks like clustering, visualization, and classification with emergent self organizing maps. D.ESOM was fully designed by the Databionics Research Group at the University of Marburg, Germany. A major drawback of this tool, which was released under GPL, is its lack of novel algorithms since it was last updated in 2006. • DataMelt 21 is an environment for numeric computation, data analysis and data visualization. This multiplatform tool, which is implemented in Java, can be used with several scripting languages: Jython (Python implemented in Java), Groovy, JRuby, BeanShell. A major feature of DataMelt is its ability to analyse data by applying different data mining techniques by means of a graphical user interface. DataMelt can be used to data visualization including both 2D and 3D plots. Additionally, it also enables statistical tests to be performed. Finally, it is interesting to highlight that DataMelt also provides ways for solving systems of linear and differential equations as well as regression problems. A major feature of this tool is that it is continuously updated. • ELKI (Environment for deveLoping KDD-applications supported by Index-structures) 22 is a really up-to-date open source data mining software that is written in Java and released under AGPLv3 license. ELKI is mainly focused on algorithms proposed by the research community with special emphasis on unsupervised methods for cluster analysis and outlier detection. ELKI aims at providing a large collection of highly parameterizable algorithms, allowing an easy and fair evaluation to be performed. Additionally, it was properly designed to be easily applicable for researchers and students 6

in the knowledge discovery field. • The gnome data mine tools (GDataMine) 23 is a set of open source data mining programs. This set of tools requires Python and Gnome to be previously installed on the computer. Additionally, Debian GNU/Linux distribution is highly recommended. GDataMine includes algorithms for the association rule mining task, a bayes classifier and a decision tree classification algorithm. The major handicap of GDataMine is its lack of updates since it was last revised by 2006. • KEEL (Knowledge Extraction based on Evolutionary Learning) 4 is another interesting tool that is continuously updated with new features and algorithms. This open source tool for data mining purposes was written in Java and released under GPLv3. KEEL provides a simple graphical user interface based on dataflows to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms 2,24 ). This tool comprises a wide variety of well-known algorithms for knowledge discovery, preprocessing techniques, computational intelligence based learning algorithms, hybrid models, statistical methodologies for contrasting experiments and so forth. • KNIME (Konstanz Information Miner) 9 is an open source data analytics tool released under GPLv3 and written in Java. The KNIME analytics platform incorporates hundreds of processing nodes for data I/O, preprocessing and cleaning, modeling, analysis and data mining. Additionally, it includes various interactive views such as scatter plots and parallel coordinates among others. KNIME is based on the Eclipse platform and, through its modular API, it is easily extensible. This modularity and extensibility enables KNIME to be employed in commercial production environments, teaching and research prototyping settings. • MiningMart 25 is a graphical tool for processing and transforming data stored in very large databases. It provides two dual graphical views on the transformations, i.e. a data view and a process view. MiningMart is mainly focused on data preparation for data mining tasks, and this tool offers an environment to develop, document and share

7

complete data processing chains from the raw data tables in a relational database to the final data mining application. MiningMart is a multiplatform tool written in Java and released under its own license, which is properly described when downloading the tool. • ML-Flex 26 is a multiplatform open source software package that is written in Java and released under the GPLv3 license. ML-Flex includes a set of machine learning approaches to be applied on a disparate set of data. This tool enables algorithms implemented in any programming language to be invoked by users. • Orange 6 was proposed by the Bioinformatics Laboratory of the Faculty of Computer and Information Science at University of Ljubljana. It is a machine learning and data mining software written in Python and released under the GPL license. Orange presents a visual programming front-end for explorative data analysis and visualization, and it may also be used as a Python library. The default installation includes a number of machine learning, preprocessing and data visualization algorithms. As a matter of example, the machine learning algorithms in the default installation are limited to naive Bayesian classifier, k-nearest neighbors, induction of rules and trees, support vector machines, neural networks, linear/logistic regression, and ensemble methods. • RapidMiner 8 , also known as YALE (Yet Another Learning Environment), is a software released under AGPL. This tool is frequently updated and it provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. RapidMiner is traditionally used for business and industrial applications as well as for research, education, training, rapid prototyping and application development. It supports all steps of the data mining process including results visualization, validation and optimization. • Rattle (the R Analytical Tool To Learn Easily) 27 is a popular tool with a graphical user interface for data mining tasks. Rattle is based on the use of the R programming language. It also includes statistical and visual summaries of data, different data transformations, a varied set of both unsupervised and supervised models as well as a

8

visualization of the performance of the executions. Rattle is released under GPL and it is continuously updated. It is of high interest for users who look for novel techniques. • SPMF (Sequential Pattern Mining Framework) 28 is an open source data mining mining library written in Java. SPMF is mainly focused on pattern mining tasks. It is distributed under the GPLv3 license and it offers implementations of more than 120 data mining algorithms for: association rule mining 29 , itemset mining 2 , sequential pattern mining, sequential rule mining, sequence prediction, periodic pattern mining, high-utility pattern mining, clustering and classification. SPMF has no dependencies to other libraries, and the source code of each algorithm can be easily integrated in other Java software. • Tanagra 23 is an open source project relesead under its own license, which is described in depth when donwloading the tool. Tanagra proposes several data mining methods such as exploratory data analysis, statistical learning and machine learning. However, a major drawback is it was last updated by 2013. • Vowpal Wabbit (VW) 10 is an open source fast out-of-core learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. This tool supports a number of machine learning problems, loss functions as well as optimization algorithms like SGD (Stochastic Gradient Descent), BFGS (a a popular algorithm for parameter estimation), conjugate gradient, etc. • WEKA (Waikato Environment for Knowledge Analysis) 5 is a collection of visualization tools and algorithms for data analysis and predictive modeling, together with a graphical user interface. This multiplatform tool, released under GPL, is considered by the research community as one of the most popular platforms. In the past year, for example, this software was downloaded by more than 1,100,000 users. WEKA supports several standard data mining tasks and, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Finally, it should be noted that the algorithms provided by WEKA can be easily called from your own Java code.

9

As previously stated, Table 2 aims at summarizing the major characteristics of all of the open source data mining tools described above. All these tools were analysed according to their general features, including latest updates, license, programming language and operating systems. In this general analysis there is no need to go into detail about more specific features since the goal is to provide a general overview of the data mining tools. i

Methodology In this section, the methodology used to analyse the set of open source data mining tools itemized in Section 2 is described. The ultimate aim is to provide the research community with an extensive study based on a wide set of features that any tool should satisfy. The methodology followed in this paper consists of two different procedures described in literature. The first one is a subjective scoring procedure 13 performed by experts in the field to produce a final judgment of each tool. The second procedure performs an objective analysis 12 about which features, organized into a number of categories, are satisfied by each tool.

Scoring procedure The first evaluation procedure carried out in this paper follows an interesting framework proposed by Carey et al 13 . This framework was proposed as a scoring procedure based on four main categories (performance, functionality, usability, and support of supplementary activities) to evaluate data mining tools. This methodology was described by the Center for Data Insight (CDI) at Northern Arizona University as a first-hand experience to evaluate many data mining tools. According to CDI 13 , a series of questions should be answered by the experts who assign a specific score to each one. Rather than scoring on some artificially absolute scale, a scoring procedure (values from 1 to 5) is considered by taking a specific tool as the reference. This very tool receives a score of 3 for each criterion, whereas the rest of tools are then rated against the reference by applying the following discrete rating scale: (1) much worse than the reference; (2) worse than the reference; (3) same as the reference; (4) better than the 10

reference; (5) much better than the reference. All these scores are assigned to the questions arisen and organized into the following categories: • Performance of a data mining tool is considered as the ability to handle a variety of data sources in an efficient manner. It is important to highlight that some approaches are inherently more efficient than others and the hardware configuration also plays a crucial role here. Thus, according CDI 13 , the performance is evaluated by considering the following criteria (Table 3 presents the set of questions arisen to evaluate the performance): variety of platforms in which the software run; the software architecture (client-server or stand-alone); either the data size and the variety of data sources handled by the tool; the efficiency of the algorithms according to the data size; the ability to interoperate with other tools as well as the robustness of the tool. • Functionality is measured as the inclusion of a variety of capabilities, techniques, and methodologies for data mining (see Table 4). Software functionality helps to assess how Table 3: Questions considered for the performance analysis. Criteria

Description

Platform Variety

Does the software run on a wide-variety of computer platforms?

Software Architecture

Does the software use client-server architecture or a stand-alone architecture?

Data Access

Does the software handle a variety of data sources?

Data Size

How well does the software scale to large data sets?

Efficiency

Does the software produce results in a reasonable amount of time relative to the data size?

Interoperability

Does the tool interface with other KDD support tools easily?

Robustness

Does the tool run consistently without crashing?

Table 4: Questions arisen for the functionality analysis. Criteria

Description

Algorithmic Variety

Does the software provide an adequate variety of data mining techniques and algorithms?

Prescribed Methodology

Does the software aid the user by presenting a user-friendly methodology?

Model Validation

Does the tool support model validation?

Data Type Flexibility

Does the software handle a wide-variety of data types?

Algorithm Modifiability

Does the user have the ability to modify and fine-tune the modeling algorithms?

Data Sampling

Does the tool allow random sampling of data for predictive modeling?

Reporting

Are the results reported in a variety of ways (summary/detailed)?

Model Exporting

Does the software provide a variety of ways to export the tool for ongoing use?

11

well the tool will adapt to different data mining problem domains. The evaluation of this category is based on a set of questions described in Table 4: quantity and variety of algorithms and data mining techniques included in the tool; the assistance offered to the user during the mining process; inclusion of a validation process; variety of types of data handled by the tool; ability to fine-tune the algorithms; whether the tool allows random sampling of data for predictive modeling; how the results are summarized and detailed; ability to export the results for ongoing use. • Usability. The data mining task is a process that usually requires an adjustment of a set of variables in order to obtain more interesting and useful insights. Any tool should provide meaningful diagnostics to help in improving the output. In this regard, a number of criteria (see Table 5 for a better description) is evaluated: how easy is the interface and whether the results are provided in a meaningful way; in what way users learn to use the tool correctly (learning curve); suitability of the tool for different users (begginer, intermediate, advanced); visualization of the data and the modeling results; error reporting; whether the tool maintains a history of the actions taken during the mining process; variety of business problems that can be solved by the tool. • Supplementary activities. This fourth category is responsible for quantifying the variety of approaches related to data cleaning, manipulation, transformation, visualization and many others. In this regard, this category determines which of the following criteria are satisfied by the tool (see Table 6 for a further description): data cleaning Table 5: Description of the usability analysis. Criteria

Description

User Interface

Is the user interface easy to navigate presenting results in a meaningful way?

Learning Curve

Is the tool easy to learn? Is the tool easy to use correctly (beginning/advanced users)?

Data Visualization

How well does the tool present the data and the modeling results?

Error Reporting

How meaningful is the error reporting?

Action History

Does the tool maintain a history of actions taken in the mining process? Can the user modify parts of this history and re-execute the script?

Domain Variety

Can the tool be used in a variety of problems? How well does the tool focus on one problem domain?

12

Table 6: Questions considered for the auxiliary tasks. Criteria

Description

Data Cleaning

How well does the software allow the user to modify spurious values in data?

Value Substitution

Does the tool allow global substitution of one data value with another?

Data Filtering

Does the software allow the selection of subsets of data according to some criteria?

Binning

Does the tool allow the binning of continuous data to improve modeling efficiency? Does the tool require continuous data to be binned or is this decision left to user discretion?

Deriving Attributes

Does the software allow the creation of derived attributes based on the inherent attributes?

Randomization

Does the tool allow efficient/effective randomization of data prior to model building?

Record Deletion

Does the tool allow the deletion of entire records that may be incomplete or may bias the results?

Handling Blanks

Does the tool handle blanks? Can they be substituted with a variety of derived values?

Metadata Manipulation

Does the tool present the user with data descriptions, types, categorical codes, etc.?

Result Feedback

Does the tool allow the results from a mining analysis to be fed back into another analysis for further model building?

(modify spurious values and other data cleaning operations); value substitution; data filtering; binning of continuous data; creation of derived attributes based on the inherent attributes; randomization; deletion of entire records; whether the tool enables blanks to be handled and whether these blanks can be substituted with a variety of values (user-defined, mean, median, etc); possibility of changing metadata; whether the tool allows the results from a mining analysis to be fed back into another analysis for further model building.

Characterization procedure The second evaluation procedure considered in this paper is responsible for characterising in an objective way different software by means of a collection of specific features denoted by Giraud-Carrier et al. 12 . This methodology describes a template for the characterisation of any data mining software by considering four main groups: system requirements, type of approaches, process-dependent features, and user interface features. This methodology is entirely objective and impartial since it only checks wether the features are satisfied or not by each tool. Hence, it is different and complementary to the previous scoring procedure. The use of both methodologies at time is useful to analyse each tool either from a subjective

13

and an objective point of view. The four main categories considered by this procedure are described as follows: • System requirements. The aim of this analysis is to check whether a set of system requirements are satisfied by each tool. These requirements are: whether the tool runs in multiple computing platforms; whether the tool depends on a different computer to fulfill its computational roles or if the tool is able to work off-line; and whether the tool is able to work as a client/server system. • Type of approaches. There exists a high number of data mining algorithms in literature, each one related to a different task. Existing approaches are therefore categorized into a different task (classification, regression, clustering, associations, anomaly detection and time series) so the aim is to check whether each tool includes approaches for each category. Either classification and regression are related to predictive modelling. The first one aims at learning a function that associates with each data object one of a finite number of pre-defined classes. The second one tries to learn a function that maps each data object to a real value. Additionally, association is related to descriptive modelling, which aims at discovering groups or categories of data objects that share similarities and help in describing the data space. Finally, whether the tool includes approaches for anomaly detection and time-series is also checked. • Process-dependent features. Since the knowledge discovery process involves a number of complementary activities, it is interesting to check which features are considered by each data mining tool. In this regard, a number of different aspects are evaluated, including the input data formats (flat files, ODBC/JDBC, SAS, XML); the data preprocessing tasks included by each tool (data characterisation, data visualisation, data cleaning, record selection, attribute selection and data transformation); standard data mining approaches included in each tool (decision trees, rule learning, neural networks, linear/logistic regression, association rule mining, learning based on instances, unsupervised learning, and probabilistic learning); aspects related to the analysis and evaluation of the results (hold-out, cross-validation, lift/gain charts, ROC analysis, summary reports, and model visualisation); features related to save models (save/reload models, 14

produce executable, PMML/XML export, and comment fields); and, finally, a set of miscellaneous methods (data set size limit, support for parallelisation, expert options, batch processing, etc). • User interface features. In this fourth category, the aim is to check whether each software tool satisfies the following four main features: whether the tool offers a graphical user interface or a command-line driven; whether the tool supports simple visual programming based on selecting and sequencing icons on the screen; and whether the tool includes any on-line help.

Case study In this case study, the two methodologies described in the previous section are applied to a set of 19 open source data mining tools taken from the KDnuggets list. KDnuggets website (http://www.kdnuggets.com/) aims at connecting researchers of the field of Knowledge Discovery, covering the news in the field as well as including many interesting courses and interviews.

Scoring procedure In this first evaluation procedure, each data mining tool is evaluated according to the scoring of four main categories: performance, functionality, usability and supplementary activities. The scoring process has been carried out by a group of 10 researchers on the data mining field from three different universities (University of Cordoba, King Abdulaziz University and Virginia Commonwealth University). Their positions vary from post-doc researchers to full professors. In order to avoid average scores that may be biased by extreme evaluations, the median score for each feature is taken as final score. Additionally, according to CDI 13 , a tool should be chosen as the reference to avoid the tendency of evaluating toward a famous software. All the evaluators were first required to choose the tool they consider is the most well-known. WEKA was taken in this regard. Finally, in order to do a fair analysis, the weights considered in this evaluation process are those proposed by the CDI 13 research center

15

and based on its expertise. The scores obtained for each tool in each group of features (performance, functionality, usability, and support of supplementary activities) are described below. Note that the scores are then modified according to the weights assigned to each feature. • The first analysis (see Table 7) carried out in the scoring procedure is related to the performance. Here, the variety of platforms in which the software can be run has a significance of 15% of the total value; the software architecture has a significance of 10%; either the size of the data and the variety of data sources handled by the tool are weighted by 15% each one; the efficiency of the algorithms according to the data size Table 7: Scores assigned to the performance category. Weighted score is the sum of the

Platform variety

Software architecture

Heterogeneous data

Data size

Efficiency

Interoperability

Robustness

scores when the weights are applied.

Weighted

Weight

15%

10%

15%

15%

10%

15%

20%

score

ADaM

3

1

1

3

1

2

1

1.75

ADAMS

3

1

3

3

3

3

3

2.80

AlphaMiner

3

1

3

3

3

3

3

2.80

CMSR

1

1

1

3

2

3

3

2.10

D.ESOM

3

1

1

2

3

3

2

2.15

DataMelt

3

3

2

3

3

3

3

2.85

ELKI

3

1

1

1

2

3

2

1.90

GDataMine

1

1

2

3

1

2

3

2.00

KEEL

3

1

3

3

3

3

2

2.60

KNIME

3

3

4

3

4

3

3

3.25

MiningMart

3

1

1

3

2

3

3

2.40

ML-Flex

3

3

2

3

3

3

3

2.85

Orange

3

1

2

3

2

3

3

2.55

RapidMiner

3

3

4

3

4

4

3

3.40

Rattle

3

1

4

3

3

3

3

2.95

SPMF

3

1

1

2

2

3

2

2.05

Tanagra

1

1

2

3

3

1

3

2.05

V. Wabbit

3

1

1

3

2

2

2

2.05

WEKA

3

3

3

3

3

3

3

3.00

16

has a significance of 10%; the ability to interoperate with other tools is weighted with 15% of the total value; and, finally, the robustness of the tool is weighted by 20%. As a matter of example, if the variety of platforms is scored as 3, then the weighted value will be 3 × 0.15 = 0.45. Results of this very analysis are illustrated in Table 7, denoting the score values assigned to each tool and each criterion. Finally, the overall score (the final weighted score) obtained for all the subcategories is also shown in Table 7. Results show that the four best open source tools for the performance category are the followings: RapidMiner, KNIME, WEKA and Rattle. It is noteworthy that both RapidMiner and KNIME are scored better than the reference, so these two open source tools seem to be really promising for data mining tasks when the performance is analysed. When analysing the partial scores assigned to each specific criterion, no huge differences are obtained among the four best open source tools for the performance category. • Continuing the analysis, we evaluate the functionality category (see Table 8) for each of the 19 open source tools considered in this study. Similarly to the previous category, different features are evaluated and weighted according to its importance. As illustrated in Table 8, there are three open source data mining tools (KEEL, KNIME, and RapidMiner) that obtain a final score that is higher than the one obtained by the reference (WEKA). Additionally, in this group of features, ML-Flex has obtained the same score as the reference, i.e. an overall score of 3.00. Finally, it is interesting to analyse the partial scores assigned to the aforementioned tools (KEEL, KNIME, RapidMiner, WEKA, ML-Flex). KEEL appears as the tool with the highest number of algorithms for data mining, achieving the maximum score. The scores for the rest of the subcategories are almost similar, obtaining the higher difference when analysing the ability to export the results for ongoing use. For this specific criterion, ML-Flex behaves much worse than the WEKA, whereas RapidMiner behaves better than the reference. • In a third analysis (see Table 9), the scoring procedure is focused on the usability aspect, which describes how easily an open source data mining system can be used 17

Variety of algorithms

Help offered

Validation process

Variety of data

Algorithm modifiability

Data sampling

Reporting

Model exporting

Table 8: Results obtained for the functionality category.

Weighted

Weight

20%

10%

15%

15%

15%

5%

10%

10%

score

ADaM

1

2

2

1

1

3

1

1

1.35

ADAMS

3

3

3

3

3

1

3

3

2.90

AlphaMiner

2

3

3

2

3

3

3

1

2.45

CMSR

1

2

3

2

3

1

2

3

2.15

D. ESOM

1

1

2

1

3

1

2

1

1.55

DataMelt

1

1

1

3

3

3

2

2

1.90

ELKI

3

3

3

3

3

3

3

1

2.80

GDataMine

1

2

3

2

3

1

2

1

1.95

KEEL

5

3

3

3

3

1

3

2

3.20

KNIME

4

4

3

3

4

3

4

3

3.55

MiningMart

1

2

1

1

2

1

2

2

1.45

ML-Flex

4

3

3

3

3

3

3

1

3.00

Orange

2

3

3

3

3

3

3

3

2.80

RapidMiner

3

4

5

3

4

3

4

4

3.75

Rattle

2

3

3

3

3

2

3

3

2.75

SPMF

2

1

1

1

3

1

2

1

1.60

Tanagra

2

3

3

2

2

3

3

1

2.30

V. Wabbit

1

1

1

3

2

1

2

1

1.55

WEKA

3

3

3

3

3

3

3

3

3.00

in solving real world problems. Here, we evaluate some different human interaction features, such as whether the graphical user interface is user-friendly, and whether the results are provided in a meaningful way; whether the learning curve is appropriate for anyone who wants to use the tool; and the suitability of the tool for different users (begginer, intermediate, advanced). Other aspects related to the error reporting and the history of the actions taken during the mining process are also analysed. The overall score obtained when all the criteria are considered is shown in Table 9. Results illustrate that the four best open source tools for the usability category are the followings: KEEL, KNIME, Orange and RapidMiner. All of these four open source tools seem to be really promising for data mining tasks from the human interaction 18

User interface

Learning curve

User types

Data visualization

Error reporting

Action history

Domain variety

Table 9: Results obtained for the usability category.

Weighted

Weight

20%

15%

15%

20%

15%

10%

5%

score

ADaM

1

1

2

1

1

1

1

1.15

ADAMS

3

2

3

3

3

3

3

2.85

AlphaMiner

3

4

3

2

3

3

3

2.95

CMSR

2

2

2

3

3

1

1

2.20

D. ESOM

2

3

2

2

3

1

1

2.15

DataMelt

2

3

2

3

2

1

3

2.30

ELKI

2

2

2

1

2

1

3

1.75

GDataMine

2

3

2

1

2

2

2

1.95

KEEL

3

5

4

4

3

3

4

4.10

MiningMart

3

3

3

3

3

3

2

2.95

ML-Flex

1

1

2

1

3

2

3

1.65

Orange

5

5

3

4

3

3

3

3.90

RapidMiner

5

5

4

4

3

3

3

4.05

Rattle

3

3

3

3

3

3

3

3.00

SPMF

4

4

4

1

2

1

3

2.75

Tanagra

2

3

3

2

3

3

2

2.55

V. Wabbit

1

1

2

1

1

1

2

1.20

WEKA

3

3

3

3

3

3

3

3.00

point of view. When analysing the partial scores assigned to each specific subcategory, some remarkable differences are obtained. For example, as for the user interface, KEEL appears as the worst system of these four tools. On the contrary, KNIME, Orange and RapidMiner behave much better than the reference (WEKA). The same behaviour appears when the learning curve is analysed. Finally, analysing the history of the actions taken during the mining process, KEEL behaves much worse than the rest. • In the fourth category, the aim is to quantify other supplementary activities including data cleaning, data manipulation and data transformation among others. As shown in Table 10, two data mining tools (KNIME and RapidMiner) behave better than the others since their final score is 3.20, which is higher than the reference value (3.00). Additionally, three tools (ADAMS, Orange and Rattle) obtained the same value as

19

Table 10: Results obtained when analysing other supplementary tasks supported by the data

Data cleaning

Value substitution

Data filtering

Binning

Attributes deriving

Randomization

Record deletion

Handling blanks

Metadata manipulation

Result feedback

mining tools.

Weighted

Weight

20%

10%

15%

5%

10%

5%

5%

10%

10%

10%

score

ADaM

2

1

1

3

1

2

1

1

1

1

1.35

ADAMS

3

3

3

3

3

3

3

3

3

3

3.00

AlphaMiner

2

3

3

3

4

1

1

3

3

3

2.70

CMSR

1

1

3

1

1

1

1

1

4

2

1.70

D. ESOM

1

2

1

1

2

1

2

1

3

1

1.45

DataMelt

1

1

1

1

1

1

1

1

1

1

1.00

ELKI

2

1

2

1

1

1

1

3

3

1

1.75

GDataMine

1

1

1

3

1

1

1

1

1

1

1.10

KEEL

3

2

3

3

1

2

3

2

3

3

2.55

KNIME

3

3

3

3

3

3

3

3

5

3

3.20

MiningMart

3

2

3

3

3

3

3

3

3

3

2.90

ML-Flex

2

2

2

2

2

3

2

2

4

3

2.35

Orange

3

3

3

3

3

3

3

3

3

3

3.00

RapidMiner

3

3

3

3

3

3

3

3

5

3

3.20

Rattle

3

3

3

3

3

3

3

3

3

3

3.00

SPMF

1

1

1

1

1

1

1

1

3

1

1.20

Tanagra

2

1

2

3

1

3

2

1

3

3

2.00

V. Wabbit

1

1

1

1

1

1

1

1

3

1

1.20

WEKA

3

3

3

3

3

3

3

3

3

3

3.00

WEKA, which is the reference. It demonstrates that many of the open source tools studied in this article are highly promising and they include many and varied data mining tasks. The small difference obtained among all these tools describes that six tools (KNIME, RapidMiner, ADAMS, Orange, Rattle and WEKA) equally behave. To summarize this scoring procedure, Figure 1 illustrates the scores obtained for each data mining tool. Analysing the four categories, it seems that RapidMiner, Orange and KNIME outperforms the others. To demonstrate this issue, the final score (see Table 11) is analysed. Results reveal that three software (KNIME and RapidMiner) behave better than the reference, whereas Orange behaves similar to WEKA (the reference). From these three 20

Figure 1: Summary of scores obtained in each category by each of the studied tools. Table 11: Final score obtained when analysing a set of tools according to four categories. Performance

Functionality

Usability

Auxiliary tasks

Score

ADaM

1.75

1.35

1.15

1.35

1.40

ADAMS

2.80

2.90

2.85

3.00

2.89

AlphaMiner

2.80

2.45

2.95

2.70

2.73

CMSR

2.10

2.15

2.20

1.70

2.04

D. ESOM

2.15

1.55

2.15

1.45

1.90

DataMelt

2.85

1.90

2.30

1.00

2.01

ELKI

1.90

2.80

1.75

1.75

2.05

GDataMine

2.00

1.95

1.95

1.10

1.60

KEEL

2.60

3.20

3.05

2.55

2.93

KNIME

3.25

3.55

4.10

3.20

3.53

MiningMart

2.40

1.45

2.95

2.90

2.43

ML-Flex

2.85

3.00

1.65

2.35

2.46

Orange

2.55

2.80

3.90

3.00

2.99

RapidMiner

3.40

3.75

4.05

3.20

3.60

Rattle

2.95

2.75

3.00

3.00

2.93

SPMF

2.05

1.60

2.75

1.20

1.90

Tanagra

2.05

2.30

2.55

2.00

2.23

V. Wabbit

2.05

1.55

1.20

1.20

1.50

WEKA

3.00

3.00

3.00

3.00

3.00

21

tools, the best one is RapidMiner, which behaves better than the reference in each of the four categories obtaining a final score of 3.60 over 5.00. Furthermore, according to the general characteristics described in Table 2, RapidMiner is a really promising tool, written in a wellknown programming languague, working on multiple operating systems and continuously updated. Nevertheless, according to the study carried out, both RapidMiner and KNIME can be considered as similar tools (according to the scores obtained and the general features described in this subjective study). Finally, it is interesting to note that the final score obtained by Orange is slightly higher than WEKA. However, when considering both the performance and functionality categories, Orange behaves really worse than WEKA. Hence, except for the usability, it is possible to assert that WEKA is better than Orange.

Characterization procedure In this second evaluation procedure, an objective analysis is carried out by checking whether each data mining tool satisfies a set of predefined features. All these features are organized into the following four groups: • First, the aim is to analyse the system requirements needed by each of the tools studied in this paper (see Table 12). Here, a set of subcategories are analysed including whether the tool runs in multiple computing platforms; whether the tool is able to work offline or it requires an external computer to fulfill its tasks; and whether the tool may work as a client/server system. The analysis performed in this step (see Table 12)

Multi-platform

X X X

X X X

X X X X X X X X

V. Wabbit

WEKA

Tanagra

Rattle

SPMF

RapidMiner

Orange

ML-Flex

KNIME

MiningMart

KEEL

GDataMine

DataMelt

ELKI

D. ESOM

CMSR

AlphaMiner

ADAMS

ADaM

Table 12: System requirements for each specific data mining tool.

X X

ThinClient Standalone

X X X X X X X X X X X X X X X X X X X

Client/Server Total satisfied

X 2

2

2

1

2

2

2

1

22

2

3

X 2

3

X 2

3

X 2

2

1

2

3

Classification

V. Wabbit

WEKA

Tanagra

SPMF

Rattle

Orange

RapidMiner

ML-Flex

MiningMart

KNIME

KEEL

X X X X X X X X X X X X X X X X X X X

Regression

X X X

X

X X X X X X X

Clustering

X X X X X X X

Associations

X X X

Anomaly detection X 3

5

X X X X X X

X

X X X X X

X X X X X X

X

X

X X X 4

3

4

X X X

X X

X

Time series Total satisfied

GDataMine

ELKI

DataMelt

D. ESOM

CMSR

AlphaMiner

ADAMS

ADaM

Table 13: Type of approaches included in each specific data mining tool.

5

5

2

X

X

X X X

X

5

6

3

4

4

6

X X

4

4

5

2

4

illustrates that none of the tools analysed in this work is able to work as a thin-client but as a standalone system. Four open source data mining tools (KNIME, ML-Flex, RapidMiner and WEKA) satisfy three of the subcategories analysed in this study. • Second, the aim is to analyse the type of approaches included in each specific data mining tool (see Table 13). The first two subcategories (classification and regression) are related to predictive modelling. Another subcategory is related to the discovery of groups (clusters) or categories of data objects that share similarities. In this category, it is also included a subcategory to include approaches that describe significant associations or dependencies among features. Finally, the analysis of anomaly detection and time-seies (detecting patterns or trends in time-dependent data) is also considered. The performed analysis determines that only a pair of tools (KNIME and RapidMiner) include approaches of any type, i.e., classification, regression, clustering, association rule mining, anomaly detection and time series. On the contrary, the worst tools focusing on the variety of approaches included are GDataMine and Vowpal Wabbit. GDataMine includes approaches for classification and association rule mining, whereas Vowpal Wabbit includes models for classification and regression tasks. As for each specific type of task, classification seems to be the most popular since all the tools analysed in this article include at least one approach for this very task. • Third, each data mining tool is analysed to check which complementary activity is 23

ODBC/JDBC

X X X X X X X X X X X X X

X

WEKA

V. Wabbit

Tanagra

Rattle

SPMF

RapidMiner

ML-Flex

X X X X X X X X

X X

X X

SAS

X

X

XML Data Characterisation

Orange

MiningMart

KNIME

KEEL

GDataMine

ELKI

DataMelt

D. ESOM

CMSR

AlphaMiner

ADaM Flat File

ADAMS

Table 14: Complementary activities available in each data mining tool.

X

Data Visualisation

X

X

X

X

X

X X X

X X X

X

X

X

X

X

X

X X

X X X

Data Cleaning

X X

X

X X X

X X X

Record Selection

X X X

X

X X X

X X X

X

X

Attribute Selection

X X X

X

X X X X X X X

X

X

Data Transformation

X X X X X

Decision Tree

X X X X

Rule Learning Neural Networks Linear/Logistic Reg.

X X X

X

X X X

X

X

X X X X X X X X X X

X

X

X

X X X X X

X

X X

X X X

X X X

X X

X

X

X X X X

X X X X X X X

X X X X

Association learning

X X X

X X X X X

X

Instance-based learning

X X X

X X

X X

X X X X

Unsupervised Learning

X X

X X

X

X

X

Probabilistic Learning

X X

X X

X X

X

X

Hold-out/Indepen. Test

X X

X

X X

X X X X

X X X

Cross-validation

X X

X X

X X X X

X

X

X X

X X X X

X

X

X X X X

X

X

Lift/Gain Charts ROC Analysis Summary Reports

X

X X X X X X

X

X X X X

X X X X X X

X X X X

X X

X

X X X X X X X X

Model Visualisation

X

X X X X

X X

X X X

Save/Reload Models

X

X X X

X X

X X X

Produce Executable

X

X X X

X

X X

X

PMML/XML Export

X

Comment Fields

X

X

X

X X X

Data Set Size Limit Parallelisation

X

Incrementaly Expert Options

X X

X

X

X X

X X X

X X X X X

Batch Processing

X X

Total satisfied

12 25 15 17 7

X X

X X

19 13 7

X X X X X

X

X X X X X X X X X X X

26 27 10 19 20 30 24 6

X X 18 7

25

available. In this regard, a number of different aspects are evaluated (see Table 14). For instance, the input data formats (flat files, ODBC/JDBC, SAS, XML) allowed by the tools are checked. Data pre-processing and data preparation are also analysed since 24

they are really important tasks in any data mining process. Here, broadly used preprocessing techniques are considered: data characterisation; data visualisation (data spread); data cleaning including techniques to remove outliers, single-valued attributes, missing values, etc; record selection; attribute selection; and data transformation. The analysis continues with a set of available data mining approaches, considering the following standard classes of approaches: decision trees, rule learning, neural networks, linear/logistic regression, association rule mining, learning based on instances, unsupervised learning, and probabilistic learning. Additionally, each of these data mining approaches produce results that should be analysed and evaluated. Hence, the following standard methods are also checked: hold-out, cross-validation, lift/gain charts, ROC analysis, summary reports, and model visualisation. All the information and models obtained through the data minin process is desired to be available for the future so the following methods are also studied: save/reload models, executable models, ability to export as PMML/XML files and many others. Finally, a set of miscellaneous methods are analysed to determine which tools integrate special features: data size limit, support for parallelisation, expert options, batch processing, etc. As a result of the analysis of these complementary activities (see Table 14), the best tools are RapidMiner, KNIME, KEEL, WEKA and ADAMS. All of them satisfy more than twenty-five of these complementary activities. On the contrary, the worst data mining tools are SPMF, Databioncis ESOM, GDataMine and Vowpal Wabbit, which satisfy only six or seven of the complementary activities. • Finally, a series of features related to the user interface are also checked (see Table 15). In this study, we focus on four main features: whether the tool offers a graphical user interface or it is a command-line driven; whether the tool supports simple visual programming based on selecting and sequencing icons on the screen; and whether the tool includes any on-line help. Considering these four features, a set of three data mining tools (ADaM, ML-Flex and Vowpal Wabbit) satisfy a low number of features. As for the rest of tools, there is not too much difference among them.

25

Graphical user interface Command line

X X X X X X X X X X X X

Drag & Drog

X X X X X

X X X X

WEKA

Tanagra

V. Wabbit

Rattle

SPMF

RapidMiner

Orange

ML-Flex

KNIME

MiningMart

KEEL

GDataMine

ELKI

DataMelt

D. ESOM

CMSR

AlphaMiner

ADAMS

ADaM

Table 15: User interface features satisfied by different data mining tools.

X X X X X X X X X

X X X X X

X X

X

X X X X X

X

On-line help

X X X X X X X X X X X X X X X X X X X

Total satisfied

2

4

3

3

4

3

4

4

4

3

3

2

4

3

4

4

3

2

4

Figure 2: Summary of percentage of features satisfied by each data mining tool according to a set of four categories. Finally, Figure 2 aims at summarizing which tool satisfies a higher number of features for each category. It illustrates the percentage of features (in per unit basis) that each data mining tool satisfies by this second evaluation procedure.

Discussion The analysis carried out in this article, which is based on two well-known evaluation frameworks, is summarized in Figure 3. Here, the percentage (in per unit basis) of features satisfied by each tool is illustrated. Results reveal that KNIME and RapidMiner are the best algo-

26

Figure 3: Summary of the results obtained for the two methodologies (score and characterization procedures) rithms for the scoring procedure (performance, functionality, usability and sumplementary tasks that support the data mining process). Additionally, the characterization procedure (system requirements, type of approaches, data mining activities, and user interface) has revealed that either KNIME and RapidMiner are the most interesting tools together with WEKA and KEEL. Focusing on some specific features satisfied by the tools, it is obtained that DataMelt, KNIME and RapidMiner are the only open source data mining tools that support XML as an input file. This may be an advantage with regard to WEKA or KEEL in situations where the user needs to load XML data. On the contrary, WEKA enables the algorithms to be run by using command lines, and this option is not available for both KNIME and RapidMiner. Hence, it is important to highlight that the main value of this study lies in providing the research community with an extensive analysis, from a subjective and an objective point of view, on different open source data mining tools. For that reason, there is no tool that outperforms the others but a set of tools that include a higher number of features for a specific group of characteristics.

27

Concluding Remarks The availability of data mining software at no cost, and also the chance of better understading of the algorithms by examining their source code, is providing the research community with a great opportunity to tune existing algorithms and improving them with different features. In this regard, a complete analysis about the features of each data mining tool is required to know and understand existing open source data mining software. This analysis has been carried by using both a subjective and an objective evaluation methodology. In this paper an assessment based on the state of the art of 19 open source data mining tools has been carried out. The major aim of this paper is to carry out a comparative study, which is useful to provide the user with features included in each data mining tool. The key point of this comparative study was to provide a general analysis based on a varied set of features that, according to the evaluation frameworks, should satisfy any good open source data mining tool. The performance, functionality, usability and the tasks supported by each tool are analysed according to a well-known methodology. Additionally, the type of approaches included, some process-dependent features, and other features such as the user interface and system requirements are also evaluated. As a result, RapidMiner, KNIME and WEKA appear as the most promising open source data mining tools for the two specific evaluation procedures.

Acknowledgements This work was supported by the Spanish Ministry of Economy and Competitiveness under the project TIN2014-55252-P, and FEDER funds.

References 1. P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining (Addison Wesley, 2005), ISBN 978-0321321367. 2. S. Ventura and J. M. Luna,

Pattern Mining with Evolutionary Algorithms

28

(Springer,

2016),

ISBN 978-3-319-33857-6,

URL http://dx.doi.org/10.1007/

978-3-319-33858-3. 3. M. J. Berry and G. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support (John Wiley & Sons, Inc., New York, NY, USA, 2011), ISBN 978-0-470-65093-6. 4. J. Alcal´a-Fdez, A. Fern´andez, J. Luengo, J. Derrac, and S. Garc´ıa, Multiple-Valued Logic and Soft Computing 17, 255 (2011). 5. R. R. Bouckaert, E. Frank, M. A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, Journal of Machine Learning Research 11, 2533 (2010), ISSN 1532-4435. 6. J. Demˇsar, T. Curk, A. Erjavec, C. Gorup, T. Hoˇcevar, M. Milutinoviˇc, M. Moˇzina, M. Polajnar, M. Toplak, A. Stariˇc, et al., Journal of Machine Learning Research 14, 2349 (2013), ISSN 1532-4435. 7. X. Wu and V. Kumar, The Top Ten Algorithms in Data Mining (Chapman & Hall/CRC, 2009), 1st ed., ISBN 1420089641, 9781420089646. 8. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, NY, USA, 2006), KDD ’06, pp. 935–940, ISBN 1-59593-339-5, URL http://doi.acm.org/10.1145/1150402.1150531. 9. S. Beisken, T. Meinl, B. Wiswedel, L. F. de Figueiredo, M. Berthold, and C. Steinbeck, BMC Bioinformatics 14 (2013), ISSN 1471-2105, URL http://dx.doi.org/10.1186/ 1471-2105-14-257. 10. X. Chen, Y. Ye, G. Williams, and X. Xu, in Proceedings of the 2007 International Conference on Emerging Technologies in Knowledge Discovery and Data Mining (SpringerVerlag, Berlin, Heidelberg, 2007), PAKDD’07, pp. 3–14, ISBN 3-540-77016-X, 978-3540-77016-9, URL http://dl.acm.org/citation.cfm?id=1780582.1780585. 11. S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K. R. M¨ uller, F. Pereira, C. E. Rasmussen, et al., Journal of Machine Learning Research 8, 2443 (2007), ISSN 1532-4435. 29

12. C. Giraud-Carrier and O. Povel, Intelligent Data Analysis 7, 181 (2003), ISSN 1088467X. 13. B. Carey and C. Marjaniemi, in Proceedings of the Thirty-second Annual Hawaii International Conference on System Sciences (IEEE Computer Society, Washington, DC, USA, 1999), HICSS ’99, pp. 6009–6020, ISBN 0-7695-0001-3, URL http://dl.acm.org/ citation.cfm?id=874072.876160. 14. M. van Leeuwen, in Interactive Knowledge Discovery and Data Mining in Biomedical Informatics, edited by A. H. Gandomi and C. Alavi, A. H. Ryan (Springer Berlin Heidelberg, 2015), vol. 8401 of Lecture Notes in Computer Science, pp. 169–182, ISBN 978-3-662-43967-8. 15. K. Fogel, Producing Open Source Software: How to Run a Successful Free Software Project (O’Reilly Media, Inc., 2005), ISBN 0596007590. 16. J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and H. Lin, Computers and Geosciences 31, 607 (2005), ISSN 0098-3004. 17. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in Knowledge Discovery and Data Mining (American Association for Artificial Intelligence, Menlo Park, CA, USA, 1996), ISBN 0-262-56097-6. 18. P. Reutemann and G. Holmes, in Proceedings of the 4th International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (Sydney, Australia, 2015), BigMine 2015, pp. 5–8. 19. B. Lud¨ascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao, Concurrency and Computation: Practice & Experience 18, 1039 (2006), ISSN 1532-0626. 20. A. Ultsch and F. Morchen, Tech. Rep., Data Bionics Research Group, University of Marburg (2005).

30

21. S. V. Chekanov, Scientific Data Analysis Using Jython Scripting and Java (Springer Publishing Company, Incorporated, 2010), 1st ed., ISBN 1849962863, 9781849962865. 22. E. Achtert, H. P. Kriegel, and A. Zimek, in Proceedings of the 20th International Conference Scientific and Statistical Database Management (Hong Kong, China, July 9-11, 2008,), SSDBM 2008, pp. 580–585. 23. R. Mikut and M. Reischl, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 431 (2011), ISSN 1942-4795, URL http://dx.doi.org/10.1002/widm.24. 24. A. A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms (Springer-Verlag Berlin Heidelberg, 2002), ISBN 3-540-43331-7. 25. K. Morik and M. Scholz, Intelligent Technologies for Information Analysis pp. 47 – 65 (2004). 26. S. R. Piccolo and L. J. Frey, Journal of Machine Learning Research 13, 555 (2012), ISSN 1532-4435, URL http://dl.acm.org/citation.cfm?id=2503308.2188404. 27. G. J. Williams, The R Journal 1, 45 (2009), URL http://journal.r-project.org/ archive/2009-2/RJournal\_2009-2\_Williams.pdf. 28. P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. W. Wu, and V. S. Tseng, Journal of Machine Learning Research 15, 3389 (2014), ISSN 1532-4435. 29. C. Zhang and S. Zhang, Association rule mining: models and algorithms (Springer Berlin / Heidelberg, 2002), ISBN 3-540-43533-6.

31