2015 11th International Conference on Signal-Image Technology & Internet-Based Systems
An evaluation of data processing solutions considering preprocessing and “special” features Rabiul Islam Jony1, Nabeel Mohammed1, Ahsan Habib1, Sifat Momen1, Rakibul Islam Rony2 1 University of Liberal Arts Bangladesh {rabiul.islam, nabeel.mohammed, ahsan.habib, sifat.momen}@ulab.edu.bd 2 Primeasia University
[email protected]
Abstract— Recently we have witnessed an explosion of data in the digital world. In order to make sense of data, proper tools are required to carry out extensive data analysis. This is especially true with the advent of Big Data and how it is slowly becoming a part of everyday life. In this paper we evaluated multiple data processing tools in light of their data pre-processing features and “special features”. These “special features” are a part of the contribution of this paper, as they have been gathered from a literature survey and through interviews of 20 experts from industry and academia. Based on these features 13 tools were scored, from which we selected four for further hands on testing. The hands on testing highlighted the strengths and weaknesses of these tools, giving an insight into their suitability in Big Data processing. Index Terms—Data preprocessing, preprocessing tools, data analysis, data mining, Big Data processing
and complex and consequently this introduced the difficulty to process such data through traditional data processing applications. Hence Big Data techniques need to be incorporated to process these data. Data analysis is a two-step process, namely preprocessing and actual processing. Raw data often contain noise, and are inconsistent and incomplete. Therefore, they need to be pre-processed in order to get a meaningful data set. After successful pre-processing of data, the actual processing can begin. In this paper, we provide an evaluation of data mining tools considering pre-processing software feature and software “special features”. The objective of the paper is to evaluate a range of data mining tools with a particular attention to their suitability of processing large data sets (i.e. Big Data [5]). Some twenty experts from different telecom operators, vendors and Alto University, Finland were interviewed to shed understanding on Big Data preprocessing, feature requirements and available tools for this research.
I. INTRODUCTION In the year 2000, the Sloan Digital Sky Survey (SDSS) started creating a three dimensional map of the universe. On the first couple of weeks since their inception, their telescopes in New Mexico collected more data than what was available in the entire astronomical repository. After one decade, the archive now contains around 140 terabytes of data. Another large synoptic survey telescope based in Chile is expected to collect a similar quantity of data after every five days by the end of 2016 [1]. Wal-Mart, the retail giant, generates around 2.5 petabytes of data for 1 million customer transactions every hour [2]. Facebook, a social network, stores more than 500 terabytes of new data every day. Search engines like Google processes 20 petabytes of data every day [1]. Until 2003, 5 exabytes of data were created by humans, and currently this amount of data is created in only two days [3]. The amount of data in the digital world reached 2.72 zettabytes in 2012 and is expected to double in every two years [4]. All these examples show the super data explosion that is taking place in this information age. Data, in recent time, is getting large 978-1-4673-9721-6/15 $31.00 © 2015 IEEE DOI 10.1109/SITIS.2015.125
224224
The rest of the paper is organized as follows: In section II, the paper describes a data pre-processing techniques followed by data pre-processing solutions in section III. Section IV is focused on description of hands-on testing of the selected tools and the result is presented in Section V. Finally the paper is concluded in section VI with a discussion on the assessment of the results and future research scopes. II. DATA PREPROCESSING According to [6], data preprocessing can be up to 80% of the total analysis work. Once the data hase been joined, cleaned and transformed, the analysis is only 20% of the work. Willian H. Inmon, also known as father of data warehousing, has stated that the data extraction, cleaning and transformation comprise the majority of the work of building a data warehouse [7]. Raw data is usually noisy, inconsistent and incomplete due to their typically large size and their likely origin from
multiple and heterogeneous sources. Low-quality data leads to low-quality results. Real world data is noisy, where the term noisy refers to having incorrect or corrupted values. There are multiple causes of such noise including technical problems within the tools, human error, transmission error and possible outlier values.
gathering includes data integration, data reduction. Furthermore, the technique new information generation simply requires data transformation. III. PREPROCESSING SOLUTIONS
Inconsistency is another common characteristic of raw data. Data inconsistency means various copies of the data no longer agree with each other. Raw data is inconsistent typically due to conflicting versions, data redundancy and discrepancies. Incomplete data means lacking attribute values, or certain attributes of interest, and/or missing values in the dataset. Incomplete data occurs typically due to attributes unavailability and equipment malfunctioning.
Data preprocessing tools require a number of preprocessing, analytics and performance features. The tools which are capable of fulfilling the feature requirements are best suited for preprocessing solutions. A. Feature requirements As mentioned earlier that, twenty experts were interviewed during this research. At first, they described their current works on Big Data and then answered questions on Big Data preprocessing, feature requirements and suitable tools. Based on literature review and the interviews, this study classifies the feature requirements and divided those into three different categorizes as-
The target of the preprocessing phase is to increase the quality of data i.e. addressing these problems and prepare the data for actual analysis. Data preprocessing has important impact on the data value chain and typically includes several major tasks and techniques.
1. 2. 3.
Data preprocessing features Analytics features Performance and usability features
The data preprocessing features reflects the major data preprocessing tasks and techniques described in section II. This study classified 27 data preprocessing features which are listed below in Table 1.
A. Major Task and Techniques in Data Preprocessing In [8], data preprocessing has been divided into two parts, namely Classical Preprocessing and Advanced Preprocessing. The classical preprocessing can be further divided into data fusion, data cleaning and data structuration phases. The advanced data preprocessing part includes only the data summarization phase. In [9], data preprocessing techniques have been divided into three parts, namely data transformation, information gathering, and new information generation. This study classifies the six major data preprocessing tasks, which also classify the three parts mentioned earlier.
Data preprocessing features 1.
2. 3. 4. 5. 6. 7.
8. 9.
Handle missing values Filtering Aggregation Validation Correlating
10.
11. 12. 13. 14.
Enrichment Data synchronizati on Profiling
15. 16.
Outlier detection and analysis
18.
17.
Meta data transformati on Sampling Discretization
Clustering Transformati on Reduction Name, role and value modification Type conversion Optimization
19.
Attribute generation
20. 21. 22. 23.
Sorting Rotation Set operations Classification
24. 25.
Regression Segmentation
26
Manipulation
27.
PCA
Table 1: Data preprocessing features
Figure 1: Major data preprocessing task and techniques (typical process flow)
Although few features included in Table 1 such as clustering and regression are generally considered as processing features, they usually are important features in case of Big Data preprocessing.
Figure 1 is a typical process flow of data preprocessing. Usually, the process starts with data import form the source and ends with feeding the preprocessed data to the analysis part. Figure 1 shows that, the Data transformation technique involves three major tasks such as data import, data summarization and data cleaning. Information
In this study, the selected tools were evaluated according to their capability to meet these functional requirements and were scored from 0 to 3. For example, Table 2 below shows the criteria of scoring the feature “Missing value analysis”.
225225
And finally, the “special features” were selected from the Analytics and Performance and usability features list based on their frequency of use by the experts in the interviews.
Scoring (0-3) Scoring description Capability to Auto detect, Impute, calculate or 3 predict missing value Capability to fill up missing values by e.g. 2 attribute mean, median Capability to fill up missing values by some 1 constants like zero or infinity 0 No capability to handle missing value
Function Missing value analysis
1. 2. 3.
Table 2: Scoring levels for “Missing value analysis”
4.
Table 3 and 4 below lists the analytics features and the Performance and Usability features respectively.
5. 6.
Special Features Less memory 7. User friendly GUI consumption Real time preprocessing 8. Dashbording Low latency 9. Multi-deimensional plotting Minimum coding 10. Text analysis requirement Easy to learn 11. Hadoop extension Documentation 12. Access to all data types
Table 5: List of “Special features” Analytics features Database connectivity Access to all data sources Access to all data types Data extraction Multidimensional format data delivery NoSQL support Hadoop extension
Advanced Parallel preprocessing Modelling Series preprocessing Real-time preprocessing
Text analysis Signs analysis Exploration
Data Visualization
Migration
Documentation
B. Selection of preprocessing tools
Multi-dimensional plotting Dashboarding
At first, based on Internet study and the Interviews, this study listed around fifty (50) available analytics/data mining-, ETL (Extract, Transform and Load)-, Big Dataspecific-, reporting-, and predictive analysis-tools which are capable of fulfilling the preprocessing requirements. However, it was challenging to test the capability of every tool and select few for hands-on testing. Therefore, this study minimizes the tools list to thirteen (13) tools, based on preprocessing qualitative ranking.
Tokenize
Rromve stop words Box of word creation Steming
The qualitative rankings have been achieved by: • Online reviews • Popularity (how many analysts are using the tool) [10] • Preprocessing capabilities • Analytics, and performance and usability features • Features descriptions from the tools’ manufacturers’ websites • Preliminary testing of available tools.
Table 3: Analytics feature list
Ease of use Self Organizing Flexibility numbers of options/features
User friendly GUI Easy to configure Easy Learning Efficiency Process storage Process import/export Minimun Coding requirement
Performance and Usability features Error Performance Management Error Low Latency detection Availability In memory preprocess Smart Export and Import of data Real time Preprocess Reliable Accurate ELT support Less user intervension requirement Less Memory consumption
Auto fixing
Price Free
Preliminary testing was simple preprocessing capability testing of available tools and was not part of hands-on testing.
Low price
These tools were analyzed and scored according to their capability of data preprocessing by the preliminary testing. However, presentation of all scoring result was challenging. Therefore, we have selected the features which have higher variance values among the tools to present in Figure 2. The tools were also analyzed based on the availability of the “special features”. Table 6 summarizes that result, where a tool is marked with “x” if it has that “special feature”.
Table 4: Performance and Usability feature list
226226
3 Aggregation
2
Validation Correlation analysis
1
Outlier detection and analysis Discretization
0
Reduction Rotation Classification Regression PCA
Figure 2: Scoring results of selected tools on high variance data preprocessing features RapidMiner
Tools
Features Access to all data types Hadoop extension Text Analysis Multi-dimensional plotting Documentation Dashbording User friendly GUI Minimun Coding requirement Low Latency Real time Preprocess Less memory consumption Easy to learn
Pentaho Data integrator
Data Prepa rator
Talend open studio for data integration
Clover ETL community edition
Orange
Excel
Matlab
x x x x
x
x
x
x x x x
x x x
x x
x x
x
x
x
x
x x
TANA GRA
x
x x
x x
x x x
x x x
x
x x
x
x
x
x x
x x x x
x
x
x x
x
x
x
x
x
x
x
Weka
x x x x x x x
x x x x
x x
x x
x x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x
R
x
x
x x x
SPSS Statistics
x x x x
KNIME
x x
x
x
Table 6: “Special features” availability in tools A. Datasets for hands-on testing
IV. HANDS-ON TESTING OF THE TOOLS Finally four tools were selected for the Hands-on testing. i. KNIME ii. RapidMiner iii. Orange iv. IBM SPSS Statistics These four tools were selected because they all passed the conditions of being easy to learn and with minimum coding requirement as well as having a good spread of the features summarized in Table 6. The tools were tested through four pre-specified preprocessing tasks using six pre-specified datasets. Tools performances were measured according to certain criteria mentioned in Section IV (C) and the results are presented in Section V. The Hands-on testing were conducted on an Intel Core (TM) 2 Duo machine with 4GB RAM running a 64-bit version of Windows 7. 227227
For the Hands-on testing six different data sets with different types of values and sizes were used. No
Dataset name
File type
1
Survey example Mobile usage behavior Time series Text data Large dataset 1 Large dataset 2
2
3 4 5 6
File size
Value types
CSV
Number of attributes 11
110KB
Numeric
CSV
66
56KB
CSV
132
551KB
Numeric, Nominal, Polynomial, Binominal Numeric
TXT CSV
NA 400
6KB 69000KB
Text Numeric
CSV
700
50000KB
Numeric
Table 7: Summary of Datasets for hands-on testing
The Survey Example dataset contains results of surveys conducted for 11 products/services. Mobile Usage behavior dataset is an edited version of the mobile usage originally called ‘OtaSizzle Handset-based Data’. OtaSizzle project used MobiTrack software for collecting data from the users’ handsets behavior data set collected for research purpose by the SizzleLab. The time series dataset contains date and time and service IDs of subscriber’s usage in rows. Text data represents unstructured data and the large datasets were built manually for the test. B. Preprocessing tasks for hands-on testing Four preprocessing tasks were created for the Hands-on testing of the tools. The tasks were designed in such a manner that, the tasks involve the major tasks and techniques of data preprocessing. Preprocessing task 1 was performed on dataset 1 & 5. The preprocessing task was to create 39 new attributes, which will present the count of each response by the participants. For example, the resulting attribute ‘1’ will show how many‘1’s were chosen by a participant in the corresponding row. This task is categorized as a simple preprocessing task. Preprocessing task 2 was performed on dataset 2, where the task includes: • • • • • • • • •
C. Tools performance evaluation criteria The hands-on testing result was created based on six criteria, which are described below. Complexity: To successfully perform a preprocessing task, every tool requires a series of steps to be selected by the user. Complexity describes the easiness (e.g. easy to learn, time spend and complexity to build the process) to perform the preprocessing task. This criterion has three possible outcomes in a complexity order, named as complex, moderate and less complex. Exact intended output: According to the preprocessing tasks, the required outputs were also defined. Sometimes, one tool can perform the preprocessing task but cannot present the output exactly how it was intended to. This criterion has two possible outcomes, namely achieved and not achieved. Memory taken: This criterion describes how much system memory the tool consumes while running the process. Sometimes, some tool requires memory even more than the system memory to run big process on large data, and so the tool fails to perform the process. This criterion was numeric values which represents the memory taken by the tools in kilobytes (only cases where the tools were able to complete the task). Time taken to execute: In this criteria the time taken by the tools to execute the task were measured. The time taken to build the process or load the data was not considered in this criterion. Some tools require loading the data separately and then executing the task. On the other hand, some tools are capable of loading the data and executing the task as a single process. Therefore, for fare comparison, this criterion presents only the executing time the process. IBM SPSS Statistics and Orange perform the tasks on step by step basis, therefore, this criteria was not applicable for those. This criterion has numeric values which represent the time taken by a tool to execute a task in seconds (s).
To define the data types correctly Declare missing values Impute missing value Remove unusable variables Outlier detection Filter out missing values Aggregation of two parameters Resample the dataset Apply Principal Component Analysis (PCA)
This task is also categorized as a simple preprocessing task Preprocessing task 3 was performed in dataset 3 & 6. The task was to aggregate each first three attributes by summing them, and generate a new attribute. In the result, two attributes will be deleted leaving only the first of the three attributes, and the resulting attribute. The data set contains numbers of attributes, therefore, if a tool is able to perform this task through automatic iteration or loop, it passes this task otherwise it fails. This task is categorized as a advanced preprocessing task. Preprocessing task 4 was performed on dataset 4. The task was to preprocess the text file including reading the text file, tokenizing it if possible, filtering out the stop words and building a box of words.
228228
User intervention required: This criterion reflects the capability of a tool to perform specific preprocessing tasks automatically, with less user intervention. This criterion has three possible outcomes in order, naming much, moderate and less. Result: This criterion summarizes whether a tool was capable of performing the preprocessing task or not. This criterion has two possible outcomes naming, passed, and failed, baring their usual meaning. The next section describes the results of the Hands-on testing.
V. RESULT Table 8 below presents a summary of task 1 Hands-on testing.
SPSS was the only tool able to perform the task 4 unlike other tools1.
Tools Tools
Criteria
Task 1 Dataset 5 (Large data) Less complex Moderate Achieved Achieved
RapidMiner Version: 5
Dataset 1
RapidMiner Version: 5
KNIME Version: 2.8.0
Orange Version: 2.7
IBM SPSS Statistics Version: 21
Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required
253276K 0s
1144864K 420s
Passed Much
Passed Much
Moderate Achieved
Less complex Achieved
167000K 0s
181321K 360s
Passed Less
Passed Less
Complex Not achieved 121424K NA
Complex NA
Passed Much
Failed NA
Less complex
KNIME Version: 2.8.0
Orange Version: 2.7
NA NA
IBM SPSS Statistics Version: 21
Less complex
Achieved
Achieved
167064K NA
181664K NA
Passed Much
Passed Much
Criteria Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required
Task 2 Less complex
Task 4 Less complex
Achieved
Achieved
265156K 1s
783100K 0s
Passed Moderate
Passed Less
Moderate Achieved
Less complex Achieved
176364K 1s
204708K 0s
Passed Less
Passed Less
Moderate
Less complex
Achieved
Achieved
125552K NA
141744K NA
Passed Moderate
Passed Less
Moderate
NA
Achieved
NA
170108K NA
NA NA
Passed Much
Failed NA
Table 9: Hands-on testing results for task 2 and task 4
Table 8: Tools Hands-on testing results for Task 1 Table 6 shows, all the selected tools were able to complete the task 1 on dataset 1. The result shows that, of all the tools, KNIME required the least amount of system memory to complete the task. RapidMiner performed average and SPSS had the most favorable result. However, Orange failed to perform the task on large dataset, because it required more system memory than available. As mentioned earlier that the task 1 and task 3 were performed on both ordinary large datasets, which was not the case for task 2 and task 4. Therefore, the result of task 2 and task 4 are summarized in a single table. All four tools successfully completed the preprocessing task 2. RapidMiner and Orange had very similar performance. However, Orange took the lowest amount of system memory. KNIME again has the best scores overall.
229229
Table 10 below summarizes the performance of the tools on task 3. For preprocessing task 3, RapidMiner was unable to provide the exact intended output; however, it was able to complete the task. The task was too complex to successfully complete on Orange. As an advanced level preprocessing task, it required automatic iteration and Orange does not have such functionality. In IBM SPSS Statistics this task would have required additional2 Java coding, therefore, it was scored as failed. The result shows that, KNIME performed the best for task 3. In summary, RapidMiner and KNIME were able to perform all four tasks. However, RapidMiner consumes large amount of memory while running preprocessing tasks on the large datasets. 1
IBM SPSS has a separate product for text processing called IBM SPSS Text Analytics.
Tools
Criteria Dataset 3
RapidMiner Version: 5
Complexity
KNIME Version: 2.8.0
Orange Version: 2.7
IBM SPSS Statistics Version: 21
Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intendant output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required
Complex Not achieved 783684K 1s Passed Less
popular preprocessing features, which are expected to be present in such tools, and “special features”, which are usability features identified through literature survey and by interviewing twenty experts from industry and academia.
Task 3 Dataset 6 (Large data) Complex Not achieved
Fifty such tools were evaluated; however, due to space constraints this paper presented the results of thirteen tools. The initial evaluation was done considering preprocessing and “special features”. From there we identified four tools which had the best overall performance. These tolls were further evaluated using Hands-on testing on six datasets using four tasks.
2144346K 1620s Passed Moderate
Complex Achieved
Less complex Achieved
180008K 2s
182500K 2340s
Passed Less
Passed Less
Complex
Complex
NA
NA
NA NA
NA NA
Failed NA
Failed NA
Complex
Complex
NA
NA
NA NA
NA NA
Failed NA
Failed NA
Of all the tools, KNIME and RapidMiner were the most user-friendly (by the criteria presented in this paper) as well as the most versatile in preprocessing tasks. It is our observation that, none of the tools in their current state are that much suitable for wide spread Big Data preprocessing. From the results presented it is clear that even for datasets which are less than one hundred megabytes, some of these tools required about three more memory. In fact, RapidMiner required almost two Gigabytes of memory to process fifty Megabytes of data. At these memory usage levels, the tools are not suitable to process truly Big Data. While some of the tools have Big Data extensions, these extensions are not easy for a non technical person to set up. If the vendors take into account some of these “special features” identified in this paper for the development of their Big Data extensions then these tools will have further main stream adoption.
Table 10: Hands-on testing results for task 3 ‘Radoop’ is an extension for RapidMiner, which is built for editing and running ETL, data analytics and machine learning process over Hadoop. For preprocessing on large datasets, KNIME performed the best. The reasons were the following: 1. 2.
3.
KNIME actually consumes very less memory while running a big process on large data. KNIME can load large data and perform preprocessing task at a time, separate data loading is not required. In KNIME robust big data extensions are available for distributed frameworks such as Hadoop.
None of the Big Data extensions were evaluated as the level of technical expertise required to set these up were far beyond that of a normal user. VI. CONCLUSION In this study, we have evaluated multiple existing data processing tools. The evaluation was done on the basis of
230230
The preprocessing tasks used for hands-on testing involved several sub-tasks, that were selected to exercise maximum amount of preprocessing functionalities KNIME, RapidMiner and Orange are open source tools and allow the users to create new nodes according to their requirements to extend the functionality. IBM SPSS Statistics also provides option to further modify the analyses by using command syntax. However, IBM SPSS Statistics is a commercial tool. Therefore, it might not be a cost-efficient solution for small organizations. Other commercial tools also have rich functionality of data preprocessing but not been tested because of unavailability. For our future work, we will continue to evaluate such tools on datasets with larger volume, variety and velocity in particular, we are keen to evaluate the Big Data extensions of these tools and how user friendly these tools can be.
VII. BIBLIOGRAPHY [1] The Economist, "Data, data everywhere," The Economist, February 2010. [2] Infosys, "Big Data: Challenges and Opportunities," 2013. [3] Eric Schmidt. (2010, August) techcrunch. [Online].
http://techcrunch.com/2010/08/04/schmidt-data/ [4] S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems (CTS), 2013 International Conference on, San Diego, CA, 2013, pp. 42 - 47. [5] Mark A. Beyer and Douglas Laney, "The Importance of 'Big Data': A Definition," Gartner, Analysis Report G00235055, 2012. [6] Olaf Acker, Adrian Blockus, and Florian Pötscher, "Benefiting from big data: A new approach for the telecom industry," Strategy&, Analysis Report 2013. [7] Abdelghani Bellaachia, Data preprocessing, 1st ed. Washington, USA, 2011. [8] D Tanasa and Antipolis Sophia, "Advanced data preprocessing for intersites Web usage mining," Intelligent Systems, IEEE, vol. 19, no. 2, pp. 59 - 65, March 2004. [9] Fazel Famili, Wei-min Shen, Richard Weber, and Evangelos Simoudis, "Data Preprocessing and Intelligent Data Analysis," Intelligent Data Analysis, vol. 1, no. 1, pp. 3-23, 1997. [10] KDnugeets. (2014, May) Analytics, Data Mining, Data Science software/tools used in the past 12 months-Poll. [Online]. http://www.kdnuggets.com/polls/2014/analytics-datamining-data-science-software-used.html
231231