Big data Mining refers to the process of extracting useful knowledge from large datasets .... and more. For example, Facebook reached more than 8 billion video views per day in September .... Knowledge Analysis (WEKA) [25], RapidMiner.
Communication, Management and Information Technology – Sampaio de Alencar (Ed.) © 2017 Taylor & Francis Group, London, ISBN 978-1-138-02972-9
Big data mining: A classification perspective Nojod M. Alotaibi & Manal A. Abdullah Faculty of Computing and Information Technology, King Abdulaziz University (KAU), Saudi Arabia
ABSTRACT: An unprecedented amount of data is being generated and recorded every day. Big data is the term used to describe such data which is difficult to process, manage and analyze patterns using traditional databases or data mining algorithms. Mining big data is currently one of the most critical emerging research areas. Big data Mining refers to the process of extracting useful knowledge from large datasets or streams of data. Due to enormity, high dimensionality, heterogeneous, and distributed nature of data, traditional techniques of data mining may be unsuitable to work with big data. As a result, there is a critical need to develop effective and efficient big data mining techniques. This paper explores the current use of supervised classification algorithms for the big data. It also compares between the protocols based on their advantages and limitations. Keywords: 1
Big data, knowledge discovery, Data mining, Big Data mining, Supervised classification
INTRODUCTION
With the fast development of Internet communication and collaboration, Internet of Things and Cloud Computing, large amounts of data have become increasingly available at significant volumes (petabytes or more). Such data comes from a wider variety of sources and formats including social networking interactions, web pages, click streams, online transaction, emails, videos, audios, images, posts, search queries, health records, science data, sensors, smart phones and their applications, and so on [1]. According to the 2014 IDC ‘Digital Universe Study’ [2], 130 exabytes (EB) of world’s data were created and stored in 2005. The amount grew to 4.4 zettabytes (ZB). It is doubling in size every two years and is projected to grow to 44 ZB in 2020 [2]. In 2012, IBM estimated that 2.5 quintillion bytes of data were created daily [3]. The rapid growth in the amount of data led to constitute the big data phenomenon. Since 2004, the interest of search on “big data” in Worldwide has increased exponentially, according to Google Trends (see Figure 1) [4]. There are three characteristics used to define big data (also called, the 3V’s of big data): volume as data keeps growing, variety as the type of data is diverse, and velocity as it is continuously arriving very fast into the systems [1]. Due to these characteristics, the existing traditional techniques and technologies do not have the ability to handle storage and processing of this data. Therefore, new technologies have been developed to manage this big data phenomenon. IDC
Figure 1.
Worldwide interest: big data [4].
[5] defines big data technologies as “a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high velocity capture, discovery and analysis”. Mining and discovering meaningful knowledge from big data for decision-making, prediction, and for other purposes is extremely challenging due to its characteristics. Knowledge Discovery (KD) is the process of discovering useful knowledge from a collection of data. Major KD application areas include marketing, manufacturing, fraud detection, telecommunication, education, medical, Internet agent and many other areas [6, 7]. Data mining is the core step of KD process where algorithms are applied to extract useful patterns from data. Tasks in data mining can be classified into
687
ICCMIT_Book.indb 687
10/3/2016 9:27:09 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
clustering, classification, summarization, regression, association rule, sequence analysis, and dependency modeling. Supervised classification is one of the most common tasks of data mining which concerned with prediction. The aim of the classification is to build a classifier based on the training data with known class labels to predict the class labels of new data [8]. There are various methods for data mining classification tasks such as: Decision tree’s (TD), Support Vector Machine (SVM), genetic algorithms, neural networks, etc. This paper is organized as follows. In Section 2, authors briefly review big data definitions and its related technologies. In Section 3, an overview of the KD and data mining is provided. Section 4 presents the concept of supervised classification. Big data mining and the related issues and challenges are described in Section 5. Section 6 explores some of current works of big data classification. Finally, authors give some conclusions in Section 7. 2
BIG DATA
In recent years, big data has become a hot research topic in many areas where storage and processing of massive amounts of data are required. In March 2012, American president Barack Obama administration announced the “Big Data Research and Development Initiative” with over $200 million in research funding [9]. The goals of this initiative were to develop and improve technologies needed to collect, store, manage, and analyze this big data, to use these technologies to accelerate the pace of knowledge discovery in science and engineering fields, improve national security, and transform teaching and learning, and to expand the workforce required to develop and use big data technologies [9]. According to McKinsey [10] the term big data is used to refer to datasets whose size is beyond the capability of existing database software tools to capture, store, manage and analyze within a tolerable amount of time. However, there is no single definition of big data. O’Reilly [11] defines big data as “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of existing database architectures. To gain value from this data, there must be an alternative way to process it”. As seen from the above definitions, the volume of data is not the only characteristic of big data. In fact, big data has three major characteristics (known as 3V’s), shown in Figure 2, which were first defined by Doug Laney in 2001 [12].
Figure 2.
Three V’s of big data [1].
• Data volume (i.e. the size of data) is the primary attribute of big data. The size of data could reach terabytes (TB, 1012 B), petabytes (PB, 1015 B), exabytes (EB, 1018 B), zettabytes (ZB, 1021 B) and more. For example, Facebook reached more than 8 billion video views per day in September 2015 [13]. • Variety refers to the fact that big data can come from different data sources in various formats and structures. These data sources are divided into three types: structured, semi-structured and unstructured data [14]. Structured data is described as data that follows a fixed schema. An example of this type is a relational database system. Semi-structured data is a type of structured data, but it doesn’t have a rigid structure [15]. Its structure may change rapidly or unpredictably [15]. Examples include weblogs and social media feeds. Unstructured data refers to data that cannot be stored into relational tables for analysis and querying. This data represents 80% of the world’s data. Files or documents such as videos, images, audio, PDF and spreadsheet are examples. • The velocity of data refers to the increasing rate at which data flows into an organization [11]. More recently, two additional V’s have been added to define big data: veracity and value. Veracity (uncertainty of data) refers to the accuracy, integrity, and quality of the data being collected, while value refers to the worth of the data being extracted [17]. All previous characteristics of big data are considered challenging and this the reason why we cannot use traditional Database Management Systems (DBMS) in the processing and analyzing big data. As a result, new technologies have been developed to meet the challenges. Following subsection discusses some of these technologies.
688
ICCMIT_Book.indb 688
10/3/2016 9:27:09 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
2.1
Big data technologies
Big Data is a new term used to identify the datasets that due to their large size and complexity, which cannot be managed with traditional database systems. In recent years, there are many technologies have been developed to process this huge volumes of data. Apache Hadoop [18] is an open source software framework that enables the distributed processing of large data sets across clusters of commodity hardware using simple programming models. There are two main components of Hadoop: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is a distributed, scalable file system written in java for the Hadoop framework. MapReduce is a programming paradigm that allows users to define two functions, map and reduce, to process large number data in parallel. Companies like Facebook, Yahoo!, Amazon, Baidu, AOL, and IBM use Hadoop on a daily basis. Hadoop has many advantages include [19]: cost effective, fault tolerant, flexibility, and scalability. Hadoop has many other related software projects that uses the MapReduce and HDFS framework such as Apache Pig, Apache Hive, Apache Mahout, Apache HBase, and others [18]. Apache Pig [1] was originally developed at Yahoo in 2006 for processing big data. In 2007, it was moved into the Apache Software Foundation. It allows people using Hadoop to focus more on analyzing large data sets and spend less time having to write MapReduce programs. Apache Hive [20] was developed at Facebook in 2009. It is data warehouse software for querying and managing large datasets residing in distributed storage. It built on top of Apache Hadoop. Hive defines a simple SQL-like query language, called Hive Query Language (HQL), which enables users familiar with SQL to query the data. Hive is optimized for scalability, extensibility, and fault-tolerance. Apache HBase [21] is a distributed columnar database that supports structured data storage for very large tables. Jaql [22] was created by workers at IBM Research Labs in 2008 and released to open source. It is a query language for JavaScript Object Notation (JSON), but it supports more than just JSON such as XML, CSV, flat files, and more. Storm [23] was created at Backtype, a company acquired by Twitter in 2011. It is a free and open source distributed real-time computation system that does for real-time processing what Hadoop does for batch processing. Strom offers features such as scalability, fault-tolerant, and distributed computation. NoSQL Database [24] (Not only SQL) is a term used to designate database management systems that differ from classic RDBMS in some way.
These data stores may not require fixed table schemas, usually avoid join operations, do not attempt to provide ACID (atomicity, consistency, isolation, durability) properties and typically scale horizontally. There are several types of NoSQL database: • Key-value stores. In key-value store, each single item in the database is stored as an attribute name (or key), together with its value. Examples of key-value store are Amazon’s Dynamo and Oracle’s BerkeleyDB. • Document-oriented database. It is a database designed for storing, retrieving and managing document-oriented or semi-structured data. Examples of these databases are CouchDB and MongoDB. • Column stores. It stores columns of data together, instead of rows. Examples include Cassandra and Apache HBase. • Graph database. It contains nodes, edges and properties to represent and store data. Examples of graph databases are Neo4j and HyperGraphDB. 3
DATA MINING
Knowledge Discovery (KD) is the process of extracting useful knowledge from huge volumes of data. It can be defined as the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data [7]. The KD process consists of the following steps as shown in Figure 3 [7]: 1. Understanding the application domain. It includes learning prior knowledge and the user’s goals 2. Creating a target data set. In this step, a subset of variables and data are selected that will be used to perform discovery task. 3. Data cleaning and preprocessing. It includes the basic operations: removing noise, dealing with missing values. 4. Data reduction and projection. It involves of finding useful attributes to represent data. 5. Choosing the data mining task. There are several data mining tasks include: clustering, classification, regression, summarization, etc. 6. Choosing the data mining algorithms. In this step, appropriate methods are selected to be used for searching for patterns in the data. 7. Data mining. Searching for patterns in a particular representational form (such as classification rules or trees, regression and clustering) using the selected data mining methods. 8. Interpretation. Interpreting mined patterns and possibly returns to any of the previous steps for further iteration if the pattern evaluated is not useful.
689
ICCMIT_Book.indb 689
10/3/2016 9:27:09 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
Figure 3.
A typical knowledge discovery process [7].
9. Using discovered knowledge. The final step consists of incorporating the discovered knowledge into another system, or documenting and reporting it to interested parties. Data mining is the core step in the whole KD process. It consists of applying data analysis and discovery algorithms and produce enumeration of patterns (models) over data. It is widely used in fields such as science, engineering, economics, social media, medicine, marketing, and business. Currently, many data mining tools are available for free on the Web such as Waikato Environment for Knowledge Analysis (WEKA) [25], RapidMiner [26], Orange [27], Konstanz Information Miner (KNIME) [28], and more. The major tasks of data mining can be classified as the following [3]: • Clustering: maps a data item into one of several clusters, where clusters are natural grouping of data items based on similarity or probability density models. • Classification: classifies a data item into one of several predefined categorical classes. • Regression: maps a data item to real-valued prediction variable. It is used in different prediction and modeling applications. • Summarization: provides compact description for a subset of data. Examples include mean and standard deviation of fields. • Association rule: describes association relationship among different attributes. • Sequence analysis: models sequential patterns, like time-series data. The goal is to model the process of generating the sequence or to extract and report deviation and trends over time. • Dependency modeling: describes significant dependencies between variables. 4
CLASSIFICATION
Classification is one of the most common types of data mining, which finds patterns in information and
categorizes them into different classes. It is supervised learning, which generates a classifier (model) based on a set of instances with known labels, is called the training set. Then, the classifier is used for classifying new or previously unseen data [8]. The converse of this is unsupervised learning, which involves classifying data into categories based on some similarity of input parameters in the data. Examples of supervised classification are spam detection, credit card fraud detection, and medical diagnosis. There are three different types of supervised classification: binary, multi-class, and multi-label classification. In binary classification, each instance of data may belong to one of two possible class labels. In multiclass classification, more than two class labels are involved and each instance is assigned to only one class label. In case of multi-label classification, there are more than two class labels and each instance may belong to more than one class label at same time. There are many types of classification algorithms for extracting knowledge from data, which can be categorized into: logic-based techniques (C4.5, CART, and RIPPER), perceptron-based techniques (Artificial Neural Networks), statistical learning techniques (Naive Bayes classifiers and Bayesian networks), and instance-based techniques (k-nearest neighbor) [8]. 5
BIG DATA MINING
In the present age, huge amount of data are produced every moment in various fields such as science, Internet, and physical systems. This is big data. Useful knowledge can be extracted from this big data with the help of data mining. Due to enormity, high dimensionality, heterogeneous, and distributed nature of data, traditional techniques of data mining may be unsuitable for extracting knowledge from this data. Mining big data is an emerging research area, hence a plethora of possible future research directions arise. The objectives of big data mining techniques go beyond fetching the requested information or even uncovering some hidden relationships and patterns [29]. Comparing with the results derived from mining the traditional datasets, unveiling the massive volume of interconnected heterogeneous big data has the potential to maximize our knowledge and insights in the target domain. Begoli and Horey in [30] proposed three principles for effective knowledge discovery from big data: first, the architecture should support many analysis methods such as data mining, statistical analysis, machine learning, and visualization. Second, different storage mechanism should be used because all data cannot fit in a single storage. Also, the data should be stored and processed at all stages of the pipeline. Third, the results should be accessible and easy to understand.
690
ICCMIT_Book.indb 690
10/3/2016 9:27:09 AM
5.1
Issues and challenges of big data mining
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
There are a number of issues and challenges related to big data mining as follows [31]: • Heterogeneity or variety: an existing data mining techniques have been used to discover unknown patterns and relationships of interest from structured, homogeneous, and small datasets. Variety is one of the fundamental characteristics of big data, comes from the phenomenon that there exists unlimited different sources that generates and contributes to big data. The data from different data sources may formed interconnected, interrelated, and delicately and inconsistently represented data. Mining useful information from such data is great challenge. Heterogeneity in big data also means that it is an obligation to accept and deal with structured, semi-structured, and even entirely unstructured data concurrently. • Scalability or volume: the extraordinary volume requires high scalability of its data management and mining tools. Cloud computing with parallelism can deal with the volume challenge of big data. • Speed or velocity: the ability of fast accessing and mining big data is highly essential-processing/mining of task must be completed within a definite period of time, otherwise, the results becomes less valuable or even worthless. • Accuracy and trust: with big data, the data sources are of many different origins, not all well-known, and not all confirmable. As a result, the accuracy and trust of the source data quickly become a serious concern. • Privacy crisis: data privacy has been always a challenge. The concern has become extremely serious with big data mining that often requires personal information in order to produce relevant/accurate results such as location-based and personalized services. Additionally, with the enormous volume of big data such as social media that contains tremendous amount of highly interconnected personal information. When all bits of information about a person are dug out and put together, any privacy about that individual instantly disappears. • Interactiveness: it means the capability of a data mining system that allows fast and adequate user interaction such as feedback/interference/ guidance from users. It relates to all the characteristics of big data and can help overcome the challenges coming along with each of them. 6
mine such huge data. As a consequence, there is an urgent need for developing algorithms and techniques capable of mining big data while dealing with their inherent properties. Several studies attempted to improve the traditional classification algorithms to make them work with big data, to parallelize classification algorithms based on MapReduce or to develop new software tools to mining big data. The approaches of big data classification are summarized in Figure 4. 6.1
Improving traditional classification algorithms
Niu et al. [32] improved the traditional KNN algorithm and proposed a new algorithm, called Neighbor Filter Classification (NFC) to realize fast classification operation in big data. Lui [33] proposed a new improved model for the original random forest algorithm in big data environment. The proposed model has higher classification accuracy. Support Vector Machines (SVMs) is one of the most popular techniques for data classification and regression. Their computation and storage requirements increase rapidly with the size of the dataset, making it unsuitable for big data. Many researchers have tried to find possible methods to apply SVM classification for large data sets. Rebentrost et al. [34] presented a quantum-based support vector machine algorithm for big data classification. The proposed algorithm can achieve exponential speedup over classical algorithm. In [35] Cervantes et al. presented SVM for classification large data using minimum enclosing ball clustering. The proposed approach has good classification accuracy compared with classical SVM. Cervantes et al. [36]
BIG DATA CLASSIFICATION
Because of big data characteristics, traditional data mining algorithms may not be suitable to
Figure 4.
Big data classification approaches.
691
ICCMIT_Book.indb 691
10/3/2016 9:27:10 AM
Table 1.
Summarization of big data classification algorithms.
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
Reference number Algorithm
Limitations
Modification
[32]
KNN
– Time cost of modeling Neighbor Filter is unacceptable. Classification – Sensitive to parameter K. (NFC).
[33]
Random forest
–
[34]
SVM
–
[37]
NBC
–
[38]
SVM
–
KNN
–
[35]
[36]
[39]
[40]
[41]
– [42]
KNN-join
–
[43]
C4.5
–
[44]
Back-propagation – neural network (PBNN)
Advantages
– It reduces the computational cost to O(n). – It is able to replace or adjust key input parameters automatically. – It updates other parameters regularly. Accuracy of a random Improved random – It has higher classification forest will gradually forest. accuracy than the traditional reduce over time. random forest. High computational Quantum – Achieve exponential speedup complexity (long least-squares over classical algorithm: training time) and SVM. O(log NM) in both training extensive memory and classification stages. requirements of the SVM using Minimum – It provides good classification required quadratic Enclosing Ball accuracy compared with classic programming in (MEB) clustering. SVM, while the training time is large-scale tasks. significantly shorter. SVM based on fuzzy – It achieves good performance clustering. for large datasets and fast convergence speed. It does not scale Implementing NBC – The accuracy of NBC is up well when the on top of Hadoop improved and approaches 82% dataset is large. MapReduce when the dataset size increase. framework. The computation and MapReduce based – It works efficiently on large storage requirement parallel SVM datasets as compared to the increases tremendously algorithm. sequential SVM. for large dataset. – The computation time taken by the SVM with multi-node cluster is less as compared to the single node cluster for large dataset. Parallel SVM based – The training time is reduced on MapReduce significantly. (PSMR). Ontology enhanced – It reduces the training time parallel SVM based significantly. on MapReduce. The complexity of KNN MapReduce-based – The reduction of computational is O(n. D), where n is the K-Nearest time achieved compared to the number of instances and Neighbor utilization of the sequential D the number of features. approach version. Memory consumption (MR-KNN). problems. It needs to spend a lot Parallel MapReduce – It achieves higher performance of time to handle large based KNN-join. than the serial one. volume data The process of building Parallel C4.5 decision – It exhibits both time efficiency decision trees can be very tree classification and scalability. time consuming when the algorithm based on dataset is extremely big. MapReduce. The computation MapReduce based – The computation overhead of process of ANN is slow parallel backneural network can be especially when dealing propagation significantly reduced. with large datasets. neural network (MRBPNN).
692
ICCMIT_Book.indb 692
10/3/2016 9:27:10 AM
proposed an SVM classification algorithm based on fuzzy clustering. The proposed approach is scalable to large data sets with high classification accuracy and fast convergence speed.
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
6.2
Classification algorithms based on MapReduce
Liu et al. [37] designed big data analyzing system to classify millions of movie reviews using a Naïve Bayes Classifier (NBC). They implemented the NBC on top of Hadoop framework, with some additional modules. The results show that the accuracy of NBC is improved and approaches 82% when the dataset size increases. Priyadarshini and Agarwal [38] proposed a MapReduce based parallel SVM algorithm for big data classification. The software used was lib-SVM. In the proposed algorithm, the training data is divided into subsets and each subset is trained with SVM. Then, the support vectors of two SVMs are combined to be input for the next SVM. This process is repeated until only one set of support vectors is left. Xu et al. [39] proposed a parallel SVM based on MapReduce (PSMR) algorithm for email classification. The parallel SVM is based on the cascade SVM model. Caruana et al. [40] developed a new algorithm for parallelized SVM based on MapReduce framework for scalable spam filter training. The parallel SVM is built on the Sequential Minimal Optimization (SMO) algorithm. Ontology semantics are used to minimize the accuracy degradation when distributing the training data among a number of SVM classifiers. Maillo et al. [41] proposed a MapReduce-based K-Nearest Neighbor (KNN) approach (MR-KNN) for big data classification. Yan et al. [42] proposed a parallel KNN-join algorithm using MapReduce for big data multi-label classification. Dai and Ji [43] suggested a parallel C4.5 decision tree classification algorithm based on MapReduce. Liu et al. [44] proposed a MapReduce Based Parallel Back-Propagation Neural Network (MRBPNN). In this work, three parallel neural networks are presented to deal with data intensive scenarios in terms of the volume of classification data, the size of the training data, and the number of neurons in NN. They concluded that the computation overhead of NN can be significantly reduced using number of computers in parallel. These big data classification algorithms with their advantages and limitations are summarized in Table 1. 6.3
Big data mining tools
In big data mining, there are many open source tools. Some of these tools are summarized in the following:
NIMBLE [45] is a portable infrastructure that enables rapid development of parallel machine learning and data mining algorithms. It runs on the top of Hadoop framework. Apache Mahout [46] is open source project by Apache Software Foundation (ASF). Mahout is written in java and provides scalable data mining algorithms. It contains implementations for clustering, categorization, Collaborative Filtering (CF), and evolutionary programming on top of Apache Hadoop. Big Cloud-Parallel Data Mining (BC-PDM) [47] is a cloud-based data mining platform that provides access to large telecom data and business solutions for telecom operators. BC-PDM is based on the MapReduce implementation of cloud computing. It supports parallel ETL process (extract, transform, and load), statistical analysis, data mining, text mining, and social network analysis. Apache SAMOA (Scalable Advanced Massive Online Analysis) [48] is a platform for mining big data streams. It includes distributed algorithms for common machine learning tasks. PEGASUS (Peta-Scale Graph Mining System) [49] which is a graph mining system for very large graphs built on top of the Hadoop framework. GraphLab [50] is high-level graph-parallel system built without using MapReduce. It is an open source project written in C++.
7
CONCLUSION
Big data has become a hot research topic that attracts extensive attention from academia, industry, and governments around the world. In this paper, we briefly introduce the concept of big data, including its definitions, characteristics, and technologies. This paper also provides an overview of big data mining and discuss the related issues and challenges. To support big data mining, we briefly describe the overview of supervised classification algorithms over big data.
REFERENCES [1] Zikopoulos P., Eaton C., deRoos D., Deutsch T., and Lapis G., Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1 s ed., Sit S., Ed. USA: McGraw-Hill Companies, 2012. [2] Tuner V., Reinsel D., Gantz J., and Minton S., “The Digital Universe of Opportunities: Rich Data and The Increasing Value of The Internet of Things,” EMC Corporation, Apr. 2014. [3] (2013) What is Big Data?: Bringing Big Data to The Enterprise. [Online]. Available: https://www-01.ibm. com/software/data/bigdata/what-is-big-data.html. [4] (2015) Google Trends. [Online]. Available: http:// www.google.com/trends/explore#q=big%20data.
693
ICCMIT_Book.indb 693
10/3/2016 9:27:10 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
[5] Gantz J. and Reinsel D., “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East,” EMC Corporation, Dec. 2012. [6] Singh P., Gosawi G., and Dubey S., “Application of Data Mining,” Binary Journal of Data Mining and Networking, vol. 4, pp. 41–44, 2014. [7] Fayyad U., Piatetsky-Shapiro G., and Smyth P., “The KDD Process for Extracting Useful Knowledge from Volumes of Data,” Communications of the ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996. [8] Kotsiantis S., “Supervised Machine Learning: A Review of Classification Techniques,” Informatica, vol. 31, pp. 249–268, July 2007. [9] (2012) Obama Administration Unveil “Big Data” Initiative: Announces $200 Million in New R&D Investments. [Online]. Available: https://www. whitehouse.gov/sites/default/files/microsites/ostp/ big_data_press_release.pdf. [10] Manyika J., Chui M., Brown B., Bughin J., Dobbs R., Roxburgh C., and Byers A., “Big Data: The next frontier for innovation, competition and productivity,” McKinsey Global Institute, May 2011. [11] Dumbill E., Croll A., Steele J., and Loukides M., Planning for big data, Beijing: O’Reilly Media, 2012. [12] Laney D., “3D Data Management: Controlling Data Volume, Velocity and Variety,” META Group Inc., Feb. 2001. [13] (2015) Facebook Reports Third Quarter 2015 Results. [Online]. Available: http://www.techmeme. com/151104/p24#a151104p24. [14] Sagiroglu S. and Sinanc D., “Big Data: A Review,” in Proc. of the 2013 International Conference on Collaboration Technologies and Systems (CTS), 2013, pp. 42–47. [15] Pankowski T., “Querying Semistructured Data Using a Rule-Oriented XML Query Language,” in Proc. of the 15th European Conference on Artificial Intelligence (ECAI), 2002, pp. 302–206. [16] Beyer M. and Laney D., “The Importance of ‘Big Data’: A Definition,” Gartner, 2012. [17] Hassanien A., Azar A., Snasel V., Kacprzyk J., and Abawajy J., Big Data in Complex Systems: Challenges and Opportunities, 1s ed., Kacprzyk J., Ed. Springer International Publishing, 2015. [18] (2014) Apache Hadoop. [Online]. Available: https:// hadoop.apache.org/. [19] Mirajkar N., Bhujbal S., and Deshmukh A., “Perform wordcount Map-Reduce Job in Single Node Apache Hadoop Cluster and Compress Data Using Lempel-Ziv-Oberhumer (LZO) algorithm,” International journal of Computing Science Issues (IJCSI), vol. 10, pp. 719–728, Jan. 2013. [20] (2014) Apache Hive. [Online]. Available: http://hive. apache.org/. [21] (2015) Apache HBase. [Online]. Available: http:// hbase.apache.org/. [22] What is Jaql? [Online]. Available: http://www-01. ibm.com/software/data/infosphere/hadoop/jaql/. [23] (2015) Apache Storm. [Online]. Available: http:// storm-project.net/. [24] (2015) What is NoSQL? [Online]. Available: https:// www.mongodb.com/nosql-explained.
[25] (2015) WEKA: The University of Waikato. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/. [26] (2015) RapidMiner. [Online]. Available: https://rapidminer.com/. [27] (2015) Orange: Data Mining Fruitful and Fun. [Online]. Available: http://orange.biolab.si/. [28] (2015) KNIME. [Online]. Available: https://www. knime.org/. [29] Prakash B. and Hanumanthappa M., “Issues and Challenges in the Era of Big Data Mining,” International Journal of Emerging Trends and Technology in Computer Science (IJETICS), vol. 3, pp. 321–325, 2014. [30] Begoli E. and Horey J., “Design Principles for Effective Knowledge Discovery From Big Data,” in Proc. of the Joint Working IEEE/IFIP Conference on Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), 2012, pp. 215–218. [31] Hong B., Meng X., Chen L., Winiwarter W., and Song W., Database Systems for Advanced Applications, 1s ed., Berlin: Springer, 2013. [32] Niu K., Zhao F., and Zhang S., “A Fast Classification Algorithm for Big Data Based on KNN,” Journal of Applied Science, vol. 13, pp. 2208–2212, 2013. [33] Lui Y., “Random Forest Algorithm in Big Data Environment,” Computer Modelling & New Technologies, vol. 18, pp. 147–151, 2014. [34] Rebentrost P., Mohseni M., and Lloyd S., “Quantum Support Vector Machine for Big Data Classification,” Physical review letters, vol. 113, pp. 1–5, Sept. 2014. [35] Cervantes J., Li X., Yu W., and Li K., “Support vector machine classification for large data sets via minimum enclosing ball clustering,” Neurocomputing, vol. 71, pp. 611–619, 2008. [36] Cervantes J., Li X., and Yu W., “Support Vector Machine Classification Based on Fuzzy Clustering for Large Data Sets,” in Proc. of The 5th Mexican International Conference on Artificial Intelligence (MICAI), 2006, pp. 572–582. [37] Liu B., Blasch E., Chen Y., Shen D., and Chen G., “Scalable Sentiment Classification for Big Data Analysis Using Naïve Bayes Classifier,” in Proc. of The 2013 IEEE International Conference on Big Data, 2013, pp. 99–104. [38] Priyadarshini A. and Agarwal S., “A Map Reduce based Support Vector Machine for Big Data Classification,” International Journal of Database Theory and Application, vol. 8, pp. 77–98, 2015. [39] Xu K., Wen C., Yuan Q., He X., and Tie J., “A MapReduce based Parallel SVM for Email Classification,” Journal of Networks, vol. 9, pp. 1640–1647, June 2014. [40] Caruana G., Li M., and Liu Y., “An Ontology Enhanced Parallel SVM for Scalable Spam Filter Training,” Journal of Neurocomputing, vol. 108, pp. 45–57, May 2013. [41] Maillo J., Triguero I., and Herrera F., “A MapReduce-based k-Nearest Neighbor Approach for Big Data Classification,” in Proc. of The 2015 IEEE Trustcom/BigDataSE/ISPA Conference, 2015, pp. 167–172.
694
ICCMIT_Book.indb 694
10/3/2016 9:27:10 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016
[42] Yan X., Wang Z., Zeng D., Hu C., and Yao H., “Design and Analysis of Parallel MepReduce based KNN-join Algorithm for Big Data Classification,” TELKOMNIKA Indonesian Journal of Electrical Engineering, vol. 12, pp. 7927–7934, Nov. 2014. [43] Dai W. and Ji W., “A MapReduce Implementation of C4.5 Decision Tree Algorithm,” International Journal of Database Theory and Application, vol. 7, pp. 49–60, 2014. [44] Liu Y., Yang J., Huang Y., Xu L., Li S., and Qi M., “MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning,” Computational Intelligence and Neuroscience, vol. 2015, pp. 1–13, Aug. 2015. [45] Ghoting A., Kambadur P., Pednault E., and Kannan R., “NIMBLE: a Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce,” in Proc. of The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 334–342. [46] (2009) Introducing Apache Mahout. [Online]. Available: http://www.ibm.com/developerworks/library/jmahout/.
[47] Yu L., Zheng J., Wu B., and Wang B., “BC-PDM: Data Mining, Social Network Analysis and Text Mining System Based on Cloud Computing,” in Proc. of The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 1496–1499. [48] Morales G. and Bifet A., “SAMOA: Scalable Advanced Massive Online Analysis,” Journal of Machine Learning Research, vol. 16, pp. 149–153, 2015. [49] Kang U., Tsourakakis C., and Faloutsos C., “PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations,” in Proc. of The 9th IEEE International Conference on Data Mining, 2009, pp. 229–238. [50] Low Y., Bickson D., Gonzalez J., Guestrin C., Kyrola A., and Hellerstein, J., “Distributed GraphLab: A Framework for Machine Learning and Data Mining in The Cloud,” Journal of VLDB Endowment, vol. 5, pp. 716–727, Apr. 2012.
695
ICCMIT_Book.indb 695
10/3/2016 9:27:10 AM
Downloaded by [Manal Abdullah] at 10:06 24 November 2016