An Evaluation of Data Processing Solutions Considering ...

3 downloads 22794 Views 308KB Size Report
suitability in Big Data processing. Index Terms—Data preprocessing, preprocessing tools, data analysis, data mining, Big Data processing. I. INTRODUCTION.
2015 11th International Conference on Signal-Image Technology & Internet-Based Systems

An evaluation of data processing solutions considering preprocessing and “special” features Rabiul Islam Jony1, Nabeel Mohammed1, Ahsan Habib1, Sifat Momen1, Rakibul Islam Rony2 1 University of Liberal Arts Bangladesh {rabiul.islam, nabeel.mohammed, ahsan.habib, sifat.momen}@ulab.edu.bd 2 Primeasia University [email protected]

Abstract— Recently we have witnessed an explosion of data in the digital world. In order to make sense of data, proper tools are required to carry out extensive data analysis. This is especially true with the advent of Big Data and how it is slowly becoming a part of everyday life. In this paper we evaluated multiple data processing tools in light of their data pre-processing features and “special features”. These “special features” are a part of the contribution of this paper, as they have been gathered from a literature survey and through interviews of 20 experts from industry and academia. Based on these features 13 tools were scored, from which we selected four for further hands on testing. The hands on testing highlighted the strengths and weaknesses of these tools, giving an insight into their suitability in Big Data processing. Index Terms—Data preprocessing, preprocessing tools, data analysis, data mining, Big Data processing

and complex and consequently this introduced the difficulty to process such data through traditional data processing applications. Hence Big Data techniques need to be incorporated to process these data. Data analysis is a two-step process, namely preprocessing and actual processing. Raw data often contain noise, and are inconsistent and incomplete. Therefore, they need to be pre-processed in order to get a meaningful data set. After successful pre-processing of data, the actual processing can begin. In this paper, we provide an evaluation of data mining tools considering pre-processing software feature and software “special features”. The objective of the paper is to evaluate a range of data mining tools with a particular attention to their suitability of processing large data sets (i.e. Big Data [5]). Some twenty experts from different telecom operators, vendors and Alto University, Finland were interviewed to shed understanding on Big Data preprocessing, feature requirements and available tools for this research.

I. INTRODUCTION In the year 2000, the Sloan Digital Sky Survey (SDSS) started creating a three dimensional map of the universe. On the first couple of weeks since their inception, their telescopes in New Mexico collected more data than what was available in the entire astronomical repository. After one decade, the archive now contains around 140 terabytes of data. Another large synoptic survey telescope based in Chile is expected to collect a similar quantity of data after every five days by the end of 2016 [1]. Wal-Mart, the retail giant, generates around 2.5 petabytes of data for 1 million customer transactions every hour [2]. Facebook, a social network, stores more than 500 terabytes of new data every day. Search engines like Google processes 20 petabytes of data every day [1]. Until 2003, 5 exabytes of data were created by humans, and currently this amount of data is created in only two days [3]. The amount of data in the digital world reached 2.72 zettabytes in 2012 and is expected to double in every two years [4]. All these examples show the super data explosion that is taking place in this information age. Data, in recent time, is getting large 978-1-4673-9721-6/15 $31.00 © 2015 IEEE DOI 10.1109/SITIS.2015.125

224224

The rest of the paper is organized as follows: In section II, the paper describes a data pre-processing techniques followed by data pre-processing solutions in section III. Section IV is focused on description of hands-on testing of the selected tools and the result is presented in Section V. Finally the paper is concluded in section VI with a discussion on the assessment of the results and future research scopes. II. DATA PREPROCESSING According to [6], data preprocessing can be up to 80% of the total analysis work. Once the data hase been joined, cleaned and transformed, the analysis is only 20% of the work. Willian H. Inmon, also known as father of data warehousing, has stated that the data extraction, cleaning and transformation comprise the majority of the work of building a data warehouse [7]. Raw data is usually noisy, inconsistent and incomplete due to their typically large size and their likely origin from

multiple and heterogeneous sources. Low-quality data leads to low-quality results. Real world data is noisy, where the term noisy refers to having incorrect or corrupted values. There are multiple causes of such noise including technical problems within the tools, human error, transmission error and possible outlier values.

gathering includes data integration, data reduction. Furthermore, the technique new information generation simply requires data transformation. III. PREPROCESSING SOLUTIONS

Inconsistency is another common characteristic of raw data. Data inconsistency means various copies of the data no longer agree with each other. Raw data is inconsistent typically due to conflicting versions, data redundancy and discrepancies. Incomplete data means lacking attribute values, or certain attributes of interest, and/or missing values in the dataset. Incomplete data occurs typically due to attributes unavailability and equipment malfunctioning.

Data preprocessing tools require a number of preprocessing, analytics and performance features. The tools which are capable of fulfilling the feature requirements are best suited for preprocessing solutions. A. Feature requirements As mentioned earlier that, twenty experts were interviewed during this research. At first, they described their current works on Big Data and then answered questions on Big Data preprocessing, feature requirements and suitable tools. Based on literature review and the interviews, this study classifies the feature requirements and divided those into three different categorizes as-

The target of the preprocessing phase is to increase the quality of data i.e. addressing these problems and prepare the data for actual analysis. Data preprocessing has important impact on the data value chain and typically includes several major tasks and techniques.

1. 2. 3.

Data preprocessing features Analytics features Performance and usability features

The data preprocessing features reflects the major data preprocessing tasks and techniques described in section II. This study classified 27 data preprocessing features which are listed below in Table 1.

A. Major Task and Techniques in Data Preprocessing In [8], data preprocessing has been divided into two parts, namely Classical Preprocessing and Advanced Preprocessing. The classical preprocessing can be further divided into data fusion, data cleaning and data structuration phases. The advanced data preprocessing part includes only the data summarization phase. In [9], data preprocessing techniques have been divided into three parts, namely data transformation, information gathering, and new information generation. This study classifies the six major data preprocessing tasks, which also classify the three parts mentioned earlier.

Data preprocessing features 1.

2. 3. 4. 5. 6. 7.

8. 9.

Handle missing values Filtering Aggregation Validation Correlating

10.

11. 12. 13. 14.

Enrichment Data synchronizati on Profiling

15. 16.

Outlier detection and analysis

18.

17.

Meta data transformati on Sampling Discretization

Clustering Transformati on Reduction Name, role and value modification Type conversion Optimization

19.

Attribute generation

20. 21. 22. 23.

Sorting Rotation Set operations Classification

24. 25.

Regression Segmentation

26

Manipulation

27.

PCA

Table 1: Data preprocessing features

Figure 1: Major data preprocessing task and techniques (typical process flow)

Although few features included in Table 1 such as clustering and regression are generally considered as processing features, they usually are important features in case of Big Data preprocessing.

Figure 1 is a typical process flow of data preprocessing. Usually, the process starts with data import form the source and ends with feeding the preprocessed data to the analysis part. Figure 1 shows that, the Data transformation technique involves three major tasks such as data import, data summarization and data cleaning. Information

In this study, the selected tools were evaluated according to their capability to meet these functional requirements and were scored from 0 to 3. For example, Table 2 below shows the criteria of scoring the feature “Missing value analysis”.

225225

And finally, the “special features” were selected from the Analytics and Performance and usability features list based on their frequency of use by the experts in the interviews.

Scoring (0-3) Scoring description Capability to Auto detect, Impute, calculate or 3 predict missing value Capability to fill up missing values by e.g. 2 attribute mean, median Capability to fill up missing values by some 1 constants like zero or infinity 0 No capability to handle missing value

Function Missing value analysis

1. 2. 3.

Table 2: Scoring levels for “Missing value analysis”

4.

Table 3 and 4 below lists the analytics features and the Performance and Usability features respectively.

5. 6.

Special Features Less memory 7. User friendly GUI consumption Real time preprocessing 8. Dashbording Low latency 9. Multi-deimensional plotting Minimum coding 10. Text analysis requirement Easy to learn 11. Hadoop extension Documentation 12. Access to all data types

Table 5: List of “Special features” Analytics features Database connectivity Access to all data sources Access to all data types Data extraction Multidimensional format data delivery NoSQL support Hadoop extension

Advanced Parallel preprocessing Modelling Series preprocessing Real-time preprocessing

Text analysis Signs analysis Exploration

Data Visualization

Migration

Documentation

B. Selection of preprocessing tools

Multi-dimensional plotting Dashboarding

At first, based on Internet study and the Interviews, this study listed around fifty (50) available analytics/data mining-, ETL (Extract, Transform and Load)-, Big Dataspecific-, reporting-, and predictive analysis-tools which are capable of fulfilling the preprocessing requirements. However, it was challenging to test the capability of every tool and select few for hands-on testing. Therefore, this study minimizes the tools list to thirteen (13) tools, based on preprocessing qualitative ranking.

Tokenize

Rromve stop words Box of word creation Steming

The qualitative rankings have been achieved by: • Online reviews • Popularity (how many analysts are using the tool) [10] • Preprocessing capabilities • Analytics, and performance and usability features • Features descriptions from the tools’ manufacturers’ websites • Preliminary testing of available tools.

Table 3: Analytics feature list

Ease of use Self Organizing Flexibility numbers of options/features

User friendly GUI Easy to configure Easy Learning Efficiency Process storage Process import/export Minimun Coding requirement

Performance and Usability features Error Performance Management Error Low Latency detection Availability In memory preprocess Smart Export and Import of data Real time Preprocess Reliable Accurate ELT support Less user intervension requirement Less Memory consumption

Auto fixing

Price Free

Preliminary testing was simple preprocessing capability testing of available tools and was not part of hands-on testing.

Low price

These tools were analyzed and scored according to their capability of data preprocessing by the preliminary testing. However, presentation of all scoring result was challenging. Therefore, we have selected the features which have higher variance values among the tools to present in Figure 2. The tools were also analyzed based on the availability of the “special features”. Table 6 summarizes that result, where a tool is marked with “x” if it has that “special feature”.

Table 4: Performance and Usability feature list

226226

3 Aggregation

2

Validation Correlation analysis

1

Outlier detection and analysis Discretization

0

Reduction Rotation Classification Regression PCA

Figure 2: Scoring results of selected tools on high variance data preprocessing features RapidMiner

Tools

Features Access to all data types Hadoop extension Text Analysis Multi-dimensional plotting Documentation Dashbording User friendly GUI Minimun Coding requirement Low Latency Real time Preprocess Less memory consumption Easy to learn

Pentaho Data integrator

Data Prepa rator

Talend open studio for data integration

Clover ETL community edition

Orange

Excel

Matlab

x x x x

x

x

x

x x x x

x x x

x x

x x

x

x

x

x

x x

TANA GRA

x

x x

x x

x x x

x x x

x

x x

x

x

x

x x

x x x x

x

x

x x

x

x

x

x

x

x

x

Weka

x x x x x x x

x x x x

x x

x x

x x

x x

x x

x

x

x

x

x

x

x

x

x x

x

x

R

x

x

x x x

SPSS Statistics

x x x x

KNIME

x x

x

x

Table 6: “Special features” availability in tools A. Datasets for hands-on testing

IV. HANDS-ON TESTING OF THE TOOLS Finally four tools were selected for the Hands-on testing. i. KNIME ii. RapidMiner iii. Orange iv. IBM SPSS Statistics These four tools were selected because they all passed the conditions of being easy to learn and with minimum coding requirement as well as having a good spread of the features summarized in Table 6. The tools were tested through four pre-specified preprocessing tasks using six pre-specified datasets. Tools performances were measured according to certain criteria mentioned in Section IV (C) and the results are presented in Section V. The Hands-on testing were conducted on an Intel Core (TM) 2 Duo machine with 4GB RAM running a 64-bit version of Windows 7. 227227

For the Hands-on testing six different data sets with different types of values and sizes were used. No

Dataset name

File type

1

Survey example Mobile usage behavior Time series Text data Large dataset 1 Large dataset 2

2

3 4 5 6

File size

Value types

CSV

Number of attributes 11

110KB

Numeric

CSV

66

56KB

CSV

132

551KB

Numeric, Nominal, Polynomial, Binominal Numeric

TXT CSV

NA 400

6KB 69000KB

Text Numeric

CSV

700

50000KB

Numeric

Table 7: Summary of Datasets for hands-on testing

The Survey Example dataset contains results of surveys conducted for 11 products/services. Mobile Usage behavior dataset is an edited version of the mobile usage originally called ‘OtaSizzle Handset-based Data’. OtaSizzle project used MobiTrack software for collecting data from the users’ handsets behavior data set collected for research purpose by the SizzleLab. The time series dataset contains date and time and service IDs of subscriber’s usage in rows. Text data represents unstructured data and the large datasets were built manually for the test. B. Preprocessing tasks for hands-on testing Four preprocessing tasks were created for the Hands-on testing of the tools. The tasks were designed in such a manner that, the tasks involve the major tasks and techniques of data preprocessing. Preprocessing task 1 was performed on dataset 1 & 5. The preprocessing task was to create 39 new attributes, which will present the count of each response by the participants. For example, the resulting attribute ‘1’ will show how many‘1’s were chosen by a participant in the corresponding row. This task is categorized as a simple preprocessing task. Preprocessing task 2 was performed on dataset 2, where the task includes: • • • • • • • • •

C. Tools performance evaluation criteria The hands-on testing result was created based on six criteria, which are described below. Complexity: To successfully perform a preprocessing task, every tool requires a series of steps to be selected by the user. Complexity describes the easiness (e.g. easy to learn, time spend and complexity to build the process) to perform the preprocessing task. This criterion has three possible outcomes in a complexity order, named as complex, moderate and less complex. Exact intended output: According to the preprocessing tasks, the required outputs were also defined. Sometimes, one tool can perform the preprocessing task but cannot present the output exactly how it was intended to. This criterion has two possible outcomes, namely achieved and not achieved. Memory taken: This criterion describes how much system memory the tool consumes while running the process. Sometimes, some tool requires memory even more than the system memory to run big process on large data, and so the tool fails to perform the process. This criterion was numeric values which represents the memory taken by the tools in kilobytes (only cases where the tools were able to complete the task). Time taken to execute: In this criteria the time taken by the tools to execute the task were measured. The time taken to build the process or load the data was not considered in this criterion. Some tools require loading the data separately and then executing the task. On the other hand, some tools are capable of loading the data and executing the task as a single process. Therefore, for fare comparison, this criterion presents only the executing time the process. IBM SPSS Statistics and Orange perform the tasks on step by step basis, therefore, this criteria was not applicable for those. This criterion has numeric values which represent the time taken by a tool to execute a task in seconds (s).

To define the data types correctly Declare missing values Impute missing value Remove unusable variables Outlier detection Filter out missing values Aggregation of two parameters Resample the dataset Apply Principal Component Analysis (PCA)

This task is also categorized as a simple preprocessing task Preprocessing task 3 was performed in dataset 3 & 6. The task was to aggregate each first three attributes by summing them, and generate a new attribute. In the result, two attributes will be deleted leaving only the first of the three attributes, and the resulting attribute. The data set contains numbers of attributes, therefore, if a tool is able to perform this task through automatic iteration or loop, it passes this task otherwise it fails. This task is categorized as a advanced preprocessing task. Preprocessing task 4 was performed on dataset 4. The task was to preprocess the text file including reading the text file, tokenizing it if possible, filtering out the stop words and building a box of words.

228228

User intervention required: This criterion reflects the capability of a tool to perform specific preprocessing tasks automatically, with less user intervention. This criterion has three possible outcomes in order, naming much, moderate and less. Result: This criterion summarizes whether a tool was capable of performing the preprocessing task or not. This criterion has two possible outcomes naming, passed, and failed, baring their usual meaning. The next section describes the results of the Hands-on testing.

V. RESULT Table 8 below presents a summary of task 1 Hands-on testing.

SPSS was the only tool able to perform the task 4 unlike other tools1.

Tools Tools

Criteria

Task 1 Dataset 5 (Large data) Less complex Moderate Achieved Achieved

RapidMiner Version: 5

Dataset 1

RapidMiner Version: 5

KNIME Version: 2.8.0

Orange Version: 2.7

IBM SPSS Statistics Version: 21

Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required

253276K 0s

1144864K 420s

Passed Much

Passed Much

Moderate Achieved

Less complex Achieved

167000K 0s

181321K 360s

Passed Less

Passed Less

Complex Not achieved 121424K NA

Complex NA

Passed Much

Failed NA

Less complex

KNIME Version: 2.8.0

Orange Version: 2.7

NA NA

IBM SPSS Statistics Version: 21

Less complex

Achieved

Achieved

167064K NA

181664K NA

Passed Much

Passed Much

Criteria Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required

Task 2 Less complex

Task 4 Less complex

Achieved

Achieved

265156K 1s

783100K 0s

Passed Moderate

Passed Less

Moderate Achieved

Less complex Achieved

176364K 1s

204708K 0s

Passed Less

Passed Less

Moderate

Less complex

Achieved

Achieved

125552K NA

141744K NA

Passed Moderate

Passed Less

Moderate

NA

Achieved

NA

170108K NA

NA NA

Passed Much

Failed NA

Table 9: Hands-on testing results for task 2 and task 4

Table 8: Tools Hands-on testing results for Task 1 Table 6 shows, all the selected tools were able to complete the task 1 on dataset 1. The result shows that, of all the tools, KNIME required the least amount of system memory to complete the task. RapidMiner performed average and SPSS had the most favorable result. However, Orange failed to perform the task on large dataset, because it required more system memory than available. As mentioned earlier that the task 1 and task 3 were performed on both ordinary large datasets, which was not the case for task 2 and task 4. Therefore, the result of task 2 and task 4 are summarized in a single table. All four tools successfully completed the preprocessing task 2. RapidMiner and Orange had very similar performance. However, Orange took the lowest amount of system memory. KNIME again has the best scores overall.

229229

Table 10 below summarizes the performance of the tools on task 3. For preprocessing task 3, RapidMiner was unable to provide the exact intended output; however, it was able to complete the task. The task was too complex to successfully complete on Orange. As an advanced level preprocessing task, it required automatic iteration and Orange does not have such functionality. In IBM SPSS Statistics this task would have required additional2 Java coding, therefore, it was scored as failed. The result shows that, KNIME performed the best for task 3. In summary, RapidMiner and KNIME were able to perform all four tasks. However, RapidMiner consumes large amount of memory while running preprocessing tasks on the large datasets. 1

IBM SPSS has a separate product for text processing called IBM SPSS Text Analytics.

Tools

Criteria Dataset 3

RapidMiner Version: 5

Complexity

KNIME Version: 2.8.0

Orange Version: 2.7

IBM SPSS Statistics Version: 21

Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required Complexity Exact intendant output Memory taken Time taken to execute Result User intervention required Complexity Exact intended output Memory taken Time taken to execute Result User intervention required

Complex Not achieved 783684K 1s Passed Less

popular preprocessing features, which are expected to be present in such tools, and “special features”, which are usability features identified through literature survey and by interviewing twenty experts from industry and academia.

Task 3 Dataset 6 (Large data) Complex Not achieved

Fifty such tools were evaluated; however, due to space constraints this paper presented the results of thirteen tools. The initial evaluation was done considering preprocessing and “special features”. From there we identified four tools which had the best overall performance. These tolls were further evaluated using Hands-on testing on six datasets using four tasks.

2144346K 1620s Passed Moderate

Complex Achieved

Less complex Achieved

180008K 2s

182500K 2340s

Passed Less

Passed Less

Complex

Complex

NA

NA

NA NA

NA NA

Failed NA

Failed NA

Complex

Complex

NA

NA

NA NA

NA NA

Failed NA

Failed NA

Of all the tools, KNIME and RapidMiner were the most user-friendly (by the criteria presented in this paper) as well as the most versatile in preprocessing tasks. It is our observation that, none of the tools in their current state are that much suitable for wide spread Big Data preprocessing. From the results presented it is clear that even for datasets which are less than one hundred megabytes, some of these tools required about three more memory. In fact, RapidMiner required almost two Gigabytes of memory to process fifty Megabytes of data. At these memory usage levels, the tools are not suitable to process truly Big Data. While some of the tools have Big Data extensions, these extensions are not easy for a non technical person to set up. If the vendors take into account some of these “special features” identified in this paper for the development of their Big Data extensions then these tools will have further main stream adoption.

Table 10: Hands-on testing results for task 3 ‘Radoop’ is an extension for RapidMiner, which is built for editing and running ETL, data analytics and machine learning process over Hadoop. For preprocessing on large datasets, KNIME performed the best. The reasons were the following: 1. 2.

3.

KNIME actually consumes very less memory while running a big process on large data. KNIME can load large data and perform preprocessing task at a time, separate data loading is not required. In KNIME robust big data extensions are available for distributed frameworks such as Hadoop.

None of the Big Data extensions were evaluated as the level of technical expertise required to set these up were far beyond that of a normal user. VI. CONCLUSION In this study, we have evaluated multiple existing data processing tools. The evaluation was done on the basis of

230230

The preprocessing tasks used for hands-on testing involved several sub-tasks, that were selected to exercise maximum amount of preprocessing functionalities KNIME, RapidMiner and Orange are open source tools and allow the users to create new nodes according to their requirements to extend the functionality. IBM SPSS Statistics also provides option to further modify the analyses by using command syntax. However, IBM SPSS Statistics is a commercial tool. Therefore, it might not be a cost-efficient solution for small organizations. Other commercial tools also have rich functionality of data preprocessing but not been tested because of unavailability. For our future work, we will continue to evaluate such tools on datasets with larger volume, variety and velocity in particular, we are keen to evaluate the Big Data extensions of these tools and how user friendly these tools can be.

VII. BIBLIOGRAPHY [1] The Economist, "Data, data everywhere," The Economist, February 2010. [2] Infosys, "Big Data: Challenges and Opportunities," 2013. [3] Eric Schmidt. (2010, August) techcrunch. [Online].

http://techcrunch.com/2010/08/04/schmidt-data/ [4] S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems (CTS), 2013 International Conference on, San Diego, CA, 2013, pp. 42 - 47. [5] Mark A. Beyer and Douglas Laney, "The Importance of 'Big Data': A Definition," Gartner, Analysis Report G00235055, 2012. [6] Olaf Acker, Adrian Blockus, and Florian Pötscher, "Benefiting from big data: A new approach for the telecom industry," Strategy&, Analysis Report 2013. [7] Abdelghani Bellaachia, Data preprocessing, 1st ed. Washington, USA, 2011. [8] D Tanasa and Antipolis Sophia, "Advanced data preprocessing for intersites Web usage mining," Intelligent Systems, IEEE, vol. 19, no. 2, pp. 59 - 65, March 2004. [9] Fazel Famili, Wei-min Shen, Richard Weber, and Evangelos Simoudis, "Data Preprocessing and Intelligent Data Analysis," Intelligent Data Analysis, vol. 1, no. 1, pp. 3-23, 1997. [10] KDnugeets. (2014, May) Analytics, Data Mining, Data Science software/tools used in the past 12 months-Poll. [Online]. http://www.kdnuggets.com/polls/2014/analytics-datamining-data-science-software-used.html

231231

Suggest Documents