Big Data infrastructure: A survey

Big Data infrastructure: A survey Jaime Salvador1 , Zoila Ruiz1 , and Jose Garcia-Rodriguez2 1

Universidad Central del Ecuador, Ciudadela Universitaria, Quito, Ecuador. Email (jsalvador;[email protected]) 2 Universidad de Alicante, Ap. 99. 03080, Alicante, Spain. Email ([email protected])

R

AF

T

Abstract. In the last years, the volume of information is growing faster than ever before, moving from small datasets to huge volumes of information. This data growth has forced researchers to look for new alternatives to process and store this data, since traditional techniques have been limited by the size and structure of the information. On the other hand, the power of parallel computing in new processors has gradually increased, from single processor architectures to multiple processor, cores and threads. This latter fact enabled the use of machine learning techniques to take advantage of parallel processing capabilities offered by new architectures on large volumes of data. The present paper reviews and proposes a classification, using as criteria, the hardware infrastructures used in works of machine learning parallel approaches applied to large volumes of data.

D

Keywords: Machine Learning, Big Data, Hadoop, MapReduce, GPU

Introduction

AL

1

FI N

Machine Learning tries to imitate human being intelligence using machines. Machine learning algorithms (supervised and unsupervised) use data to train the model. Depending on the problem modeled, the data may be of the order of gigas. For this reason an optimized storage system for large volumes of data (Big Data) is indispensable. In the last years, the term Big Data has become very popular because of the large amount of information available from many different sources. The amount of data is increasing very fast and the technology to manipulate large volumes of information is the key when the purpose is the extraction of relevant information. The complexity of the data demands the creation of new architectures that optimize the computation time and the necessary resources to extract valuable knowledge from the data [44]. Traditional techniques are oriented to process information in clusters. With the evolution of the graphic processor unit (GPU) it appeared alternatives to take full advantage of the multiprocess capacity of this type of architectures. Graphic processors are in transition from processors specialized in graphic acceleration to general purpose engines for high performance computing [7]. This type

2

Jaime Salvador et al.

2

Machine Learning for Big Data

T

of systems are known as GPGPU (General Purpose Graphic Processing Units). In 2008, Khronos Group introduced the OpenCL (Open Computing Language) standard, which was a model for parallel programming. Subsequently appeared its main competitor, NVidia CUDA (Computer Unified Device Architecture). CUDA devices increase the performance of a system due to the high degree of parallelism they are able to control [26]. This document reviews platforms, languages, and many other features of the most popular machine learning frameworks and offers a classification for someone who wants to begin in this field. In addition an exhaustive review of works using machine learning techniques to deal with Big Data is done, including its most relevant features. The remainder of this document is organized as follows. Section 2 describes the techniques of machine learning and the most popular platforms. Next, an overview about Big Data Processing is presented. Section 3 presents a summary table with several classification criteria. Finally, in section 4 some conclusions and opportunities for future work are presented.

Machine Learning

D

2.1

R

AF

This section reviews Machine Learning and Big Data processing platforms concepts. The first part presents a classification of the machine learning techniques to later summarize some platforms oriented to the implementation of the algorithms.

FI N

AL

The main goal of machine learning is to create systems that learn or extract knowledge from data and use that knowledge to predict future situations, states or trends [28]. Machine Learning techniques can be grouped into the following categories: – Classification algorithms. A system is able to learn from a set of sample data (labeled data), from which a set of classification rules called model is built. This rules are used to classify new information [34]. – Clustering algorithms. It consists on the formation of groups (called clusters) of instances that share common characteristics without any prior knowledge [34]. – Recommendations algorithms. It consists of predicting patterns of preferences and the use of those preferences to make recommendations to the users [36]. – Dimensionality Reduction Algorithms. It consists on the reduction in the number of variables (attributes) of a dataset without affecting the predictive capacity of the model [13]. Scalability is an important aspect to consider in a learning method. Such capacity is defined as the ability of a system to adapt to the increasing demands in terms of information processing. To support this process onto large volumes of data, the platforms incorporate different forms of scaling [43]:

Big Data infrastructure: A survey

3

– Horizontal scaling, involves the distribution of the process over several nodes (computers). – Vertical scaling, involves adding more resources to a single node (computer).

Machine Learning Libraries In this section we present the description of several libraries that implement some of the algorithms included in the previous classification. The use of these libraries is proposed using as criteria the integration with new systems but not and end user tool criteria. Apache Mahout 3 provides implementations for many of the most popular machine learning algorithms. It contains implementation of clustering algorithms, classification, and many others. To enable clustering, it uses Apache Hadoop [2].

AF

T

MLlib 4 is part of the Apache Spark project that implements several machine learning algorithms. Among the groups of algorithms that it implements we can find classification, regression, clustering and dimensionality reduction algorithms [29].

R

FlinkML 5 is the Flink machine learning library. It’s goal is to provide scalable machine learning algorithms. It provides tools to help the design of machine learning systems [2].

AL

D

Table 1 describes the libraries presented above. For each one, the supported programming language is detailed:

Tool

FI N

Table 1. Machine learning tool by language

Apache Mahout Spark MLlib FlinkML

Java Scala Python R x x x

x x

x

x

As shown above, the most used language is Java, which shows a tendency when implementing systems that integrate machine learning techniques. Table 2 describes each library and the support for scaling when working with large volumes of data: 3 4 5

http://mahout.apache.org/ http://spark.apache.org/mllib/ https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/libs/ml/index.html

4

Jaime Salvador et al. Table 2. Machine learning tool scaling

Tool

H+V Multicore GPU Cluster

Apache Mahout Spark MLlib FlinkML

x x x

x x x

x

x x x

In the table we can see that the scaling techniques are combined with others to obtain platforms with better performance. 2.2

Platforms for Big Data processing

AF

T

Big Data is defined as a large and complex collection of data, which are difficult to process by a relational database system. The typical size of the data, in this type of problems, is of the order of tera or peta bytes and they are in constant growth [22]. There are many platforms to work with Big Data, among the most popular we can mention Hadoop, Spark (two open source projects from the Apache Foundation) and MapReduce.

D

R

MapReduce [9] is a programming model oriented to the process of large volumes of data [15]. Problems addressed using MapReduce should be problems that can be separated into small tasks to be processed in parallel [25]. This programming model is adopted by several specific implementations (platforms) like Apache Hadoop and Spark.

FI N

AL

Apache Hadoop 6 is an open source project sponsored by the Apache Software Fundation. It allows distributed processing of large volumes of data in a cluster [20]. It consists of three fundamental components [18]: HDFS (storage component), YARN (resource planning component) and MapReduce. Apache Spark 7 is a cluster computing system based on the MapReduce concept. Spark supports interactive computing and its main objective is to provide high performance while maintaining resources and calculations in memory [28]. H2 O 8 is an open source framework that provides libraries for parallel processing, information analysis and machine learning along with data processing and evaluation tools [28]. Apache Storm 9 is an open source system for real time distributed computing. An application created with Storm is designed as a directed acyclic graph (DAG) topology. 6 7 8 9

http://hadoop.apache.org/ http://spark.apache.org/ http://www.h2o.ai/h2o/ http://storm.apache.org/


Apache Flink

10

5

is an open source platform for batch an stream processing.

Table 3 compares the platform with the type of processing supported and the storage system used. Table 3. Platform scaling

Scaling Storage HV Multicore GPU Cluster IM11 Local DFS12

Tool Apache Apache H2O Apache Apache

Hadoop Spark

x x x x x

Storm Flink

x x x x x

x x13

x x x x x

x x x x

x x x x x

x x x x x

AF

T

As can be seen, all the platforms support all varieties of storage. However in large implementations the local storage (or pseudo-cluster) is not an option to consider.

Machine Learning for Big Data Review

R

3

3.1

FI N

AL

D

This section proposes a classification of research works that use Big Data platforms to apply machine learning techniques based on three criteria: language (programming language supported), scaling (scalability) and storage (supported storage type). The combination of these three factors allows the proper selection of a platform to integrate machine learning techniques with Big Data. Scaling is related to the type of storage. In the case of using horizontal scaling, a distributed file system must be used. Table 4 shows the classification based on the above mentioned criteria. Language

The supported programming language is an important aspect when selecting a platform. Taking into account that this work is oriented to platforms that are not end user (tools like Weka14 , RapidMiner15 , etc. are not considered), the following programming languages are considered: Java, Scala, Python, R. 10 11 12 13 14 15

https://flink.apache.org/ In-memory Distributed File System Through Deep Water (http://www.h2o.ai/deep-water/) in beta state http://www.cs.waikato.ac.nz/ml/weka/ https://rapidminer.com/

6

Jaime Salvador et al. Table 4. Language-Scaling-Storage LSS Classification

Article16

Language Scala, Java Scala, Java, Python Java Java

Storage

V Cluster Cluster Cores, Cluster Cluster Cluster Cluster, GPU

DFS, IM DFS, IM Local, DFS, IM DFS, IM Local, DFS

Scala, Java, Python Java Java

R

AF

T

Cluster Cluser, GPU Vertical Cluster Scala, Java, Python Cluster Java H, V, Cluster, GPU V, Cluster Scala, Java Cluster Java, Python H, Cluster Java, Python Cluster Java Cluster Java, Python Cluster, GPU Scala, Java Cluster

DFS DFS, IM DFS, IM DFS, IM DFS DFS, IM DFS DFS DFS, IM DFS, IM

DFS, IM DFS, IM DFS, IM Cluster DFS Cluster DFS Java Cluster DFS Scala, Java Cluster DFS Cluster DFS Scala, Java, Python Cluster DFS Scala Core, Cluster DFS H, V, Cluster, GPU DFS Scala, Java Cluster, Paralell DFS Java Cluster DFS Java Cluster Local, IM, DFS Scala, Java Local, IM, DFS Java H, V, Cluster DFS Java Local, DFS Scala, Java Cluster DFS

D

Java Scala Scala, Java Scala, Java

DFS, IM Loca,IM Local, DFS DFS

Cluster Cluster

FI N

AL

Al-Jarrah et al. [1] Aridhi and Mephu [2] Armbrust et al. [3] Bertolucci et al. [4] Borthakur [5] Castillo et al. [6] Catanzaro et al. [7] Crawford et al. [8] Nagina [31] Fan and Bifet [10] Gandomi and Haider [11] Ghemawat et al. [12] Hafez et al. [14] Hashem et al. [15] He et al. [16] Hodge et al. [17] Issa and Figueira [19] Jackson et al. [20] Jain and Bhatnagar [21] Jiang et al. [22] Kacfah Emani et al. [23] Kamishima and Motoyoshi [24] Kiran M. and G. [25] Kraska et al. [27] Landset et al. [28] Meng et al. [29] Modha and Spangler [30] Naimur Rahman et al. [32] Namiot [33] Norman et al. [35] P¨ aa ¨kk¨ onen [37] Ram´ırez-Gallego et al. [38] Saecker and Markl [39] Salloum et al. [40] Saraladevi et al. [41] Seminario and Wilson [42] Singh and Reddy [43] Singh and Kaur [44] Walunj and Sadafale [45] Zaharia et al. [46]

Scaling

The most common language in this type of implementations is Java. However over time appeared layers of software that abstract access to certain technologies. 16

Some articles only deal one of the criteria mentioned above, but in any case all the articles are included in the table.


7

With these news technologies is possible to use the platforms described in this document with different programming languages.

3.2

Scaling

Scaling focuses on the possibility of including more process nodes (horizontal scaling) or include parallel processing within the same node (vertical scaling) by using of graphics cards. This document consider horizontal, vertical, cluster and GPU scaling. As can be seen in Table 4, and in the above tables, all the platforms scale horizontal by using a DFS that in most cases corresponds to Hadoop. However platforms that work with data in memory are becoming more popular (like Spark).

Storage

T

3.3

4

Conclusions

AL

D

R

AF

Depending on the amount of information and the strategy used to process the information, it is possible to decide the type of storage: local, DFS, in-memory. Each type involves the implementation of hardware infrastructure to be used. For example, in the case of a cluster implementation, it is necessary to have adequate equipment that will constitute the cluster nodes. For proof of concepts it is acceptable to use a pseudo-cluster that simulates a cluster environment on a single machine.

FI N

In this document, a brief review on the machine learning techniques was carried out, orienting them to the processing of large volumes of data. Then, some of the platforms that implement several of the techniques described in the document were analyzed. Most of the reviewed articles use techniques, languages and scaling described in this document. In general, we can conclude that all these techniques are scalable in one way or another. It is possible to start from a local architecture (or pseudo-cluster) for a proof of concept test and progressively scale to more complex architectures like a cluster, evidencing the need for distributed storage (DFS). The programming language should not be a factor when selecting a platform, since nowadays there are interfaces that allow the use of the mentioned platforms in a growing variety of languages. Future work may be proposed to perform a similar study considering the time factor. This factor determines two forms of processing: batch and online processing. Finally, streaming processing could be incorporated into the study, which is a fundamental part of some of the platforms described in this document.

8


Acknowledgements This work has been funded by the Spanish Government TIN2016-76515-R grant for the COMBAHO project, supported with Feder funds.

References

FI N

AL

D

R

AF

T

[1] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha. Efficient Machine Learning for Big Data: A Review. Big Data Research, 2 (3):87–93, 2015. [2] S. Aridhi and E. Mephu. Big Graph Mining: Frameworks and Techniques. Big Data Research, 6:1–10, 2016. [3] M. Armbrust, A. Ghodsi, M. Zaharia, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, and M. J. Franklin. Spark SQL: Relational Data Processing in Spark Michael. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD ’15, pages 1383–1394, 2015. [4] M. Bertolucci, E. Carlini, P. Dazzi, A. Lulli, and L. Ricci. Static and dynamic big data partitioning on apache spark. Advances in Parallel Computing, 27:489–498, 2016. [5] D. Borthakur. HDFS architecture guide. Hadoop Apache Project, pages 1–13, 2008. [6] S. J. L. Castillo, J. R. Fernández, and L. G. Sotos. Algorithms of Machine Learning for K-Clustering. Trends in Practical Applications of Agents and Multiagent Systems, pages 443–452, 2010. [7] B. Catanzaro, B. Catanzaro, K. Keutzer, and K. Keutzer. Fast Support Vector Machine Training and Classication on Graphics Processors. Machine Learning, pages 104–111, 2008. [8] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, and H. Al Najada. Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1):23, 2015. [9] J. Dean and S. Ghemawat. MapReduce: Simplied Data Processing on Large Clusters. Proceedings of 6th Symposium on Operating Systems Design and Implementation, pages 137–149, 2004. [10] W. Fan and A. Bifet. Mining Big Data : Current Status , and Forecast to the Future. ACM SIGKDD Explorations Newsletter, 14(2):1–5, 2013. [11] A. Gandomi and M. Haider. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2): 137–144, 2015. [12] S. Ghemawat, H. Gobioff, and S.-t. Leung. Google File System. 2003. [13] M. Guller. Big Data Analytics with Spark. 2015. ISBN 9781484209653. [14] M. M. Hafez, M. E. Shehab, E. El Fakharany, and A. E. F. Abdel Ghfar Hegazy. Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark. Proceedings of the International Conference on Intelligent Systems, 1(1), 2016.


9

FI N

AL

D

R

AF

T

[15] I. A. T. Hashem, N. B. Anuar, A. Gani, I. Yaqoob, F. Xia, and S. U. Khan. MapReduce: Review and open challenges. Scientometrics, 109(1): 1–34, 2016. [16] Q. He, N. Li, W. J. Luo, and Z. Z. Shi. A survey of machine learning for big data processing. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 27(4):327–336, 2014. [17] V. J. Hodge, S. O. Keefe, and J. Austin. Hadoop neural network for parallel and distributed feature selection. Neural Networks, 78:24–35, 2016. [18] A. Holmes. Hadoop in Practice. Manning, second edi edition, 2015. ISBN 9781617292224. [19] J. Issa and S. Figueira. Hadoop and memcached: Performance and power characterization and analysis. Journal of Cloud Computing: Advances, Systems and Applications, 1(1):10, 2012. [20] J. C. Jackson, V. Vijayakumar, M. A. Quadir, and C. Bharathi. Survey on programming models and environments for cluster, cloud, and grid computing that defends big data. Procedia Computer Science, 50:517–523, 2015. [21] A. Jain and V. Bhatnagar. Crime Data Analysis Using Pig with Hadoop. Physics Procedia, 78(December 2015):571–578, 2016. [22] H. Jiang, Y. Chen, Z. Qiao, T. H. Weng, and K. C. Li. Scaling up MapReduce-based Big Data Processing on Multi-GPU systems. Cluster Computing, 18(1):369–383, 2015. [23] C. Kacfah Emani, N. Cullot, and C. Nicolle. Understandable Big Data: A survey. Computer Science Review, 17:70–81, 2015. [24] T. Kamishima and F. Motoyoshi. Learning from Cluster Examples. Machine Learning, 53(3):199–233, 2003. [25] S. M. Kiran M., Amresh Kumar and R. P. G. Verification and Validation of MapReduce Program Model for Parallel Support Vector Machine Algorithm on Hadoop Cluster. International Conference on Advanced Computing and Communication Systems (ICACCS), 4(3):317–325, 2013. [26] D. Kirk and W.-M. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. 2010. ISBN 0123814723. [27] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, and M. Jordan. MLbase : A Distributed Machine-learning System. 6th Biennial Conference on Innovative Data Systems Research (CIDR’13), 2013. [28] S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1):24, 2015. [29] X. Meng, J. Bradley, S. Street, S. Francisco, E. Sparks, U. C. Berkeley, S. Hall, S. Street, S. Francisco, D. Xin, R. Xin, M. J. Franklin, U. C. Berkeley, and S. Hall. MLlib : Machine Learning in Apache Spark. 17:1–7, 2016. [30] D. S. Modha and W. S. Spangler. Feature Weighting in k -Means Clustering. Machine Learning, 52(3):217–237, 2003. [31] D. S. D. Nagina. Scheduling Algorithms in Big Data: A Survey. International Journal Of Engineering And Computer Science, 5(8):17737–17743, 2016.

10


FI N

AL

D

R

AF

T

[32] M. Naimur Rahman, A. Esmailpour, and J. Zhao. Machine Learning with Big Data An Efficient Electricity Generation Forecasting System. Big Data Research, 5:9–15, 2016. [33] D. Namiot. On Big Data Stream Processing. International Journal of Open Information Technologies, 3(8):48–51, 2015. [34] T. T. T. Nguyen and G. Armitage. A survey of techniques for internet traffic classification using machine learning. 10(4):56–76, 2008. [35] S. Norman, R. Martin, and B. Franczyk. Evaluating New Approaches of Big Data Analytics Frameworks. Lecture Notes in Business Information Processing, 208:28–37, 2015. [36] S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning, 2012. ISBN 9781935182689. [37] P. P¨ a¨ akk¨ onen. Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing. Journal of Big Data, 3(1):6, 2016. [38] S. Ram´ırez-Gallego, H. Mouri˜ no-Tal´ın, D. Mart´ınez-Rego, V. BolónCanedo, J. M. Benitez, A. Alonso-Betanzos, and F. Herrera. Un Framework de Selecci´ on de Caracter´ısticas basado en la Teor´ıa de la Información para Big Data sobre Apache Spark. [39] M. Saecker and V. Markl. Big data analytics on modern hardware architectures: A technology survey. Lecture Notes in Business Information Processing, 138 LNBIP:125–149, 2013. [40] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang. Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3):145–164, 2016. [41] B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. S. Basha, and P. Dhavachelvan. Big data and Hadoop-A study in security perspective. Procedia Computer Science, 50:596–601, 2015. [42] C. E. Seminario and D. C. Wilson. Case study evaluation of mahout as a recommender platform. CEUR Workshop Proceedings, 910(September 2012):45–50, 2012. [43] D. Singh and C. K. Reddy. A survey on platforms for big data analytics. Journal of Big Data, 2(1):8, 2015. [44] R. Singh and P. J. Kaur. Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud. Journal of Big Data, 3(1):19, 2016. [45] S. G. Walunj and K. Sadafale. An Online Recommendation System for E-commerce Based on Apache Mahout Framework. Proceedings of the 2013 Annual Conference on Computers and People Research, pages 153– 158, 2013. [46] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in Cloud Computing, page 10, 2010.