Researching Apache Hama: A Pure BSP Computing ...

4 downloads 23015 Views 246KB Size Report
2Dept. of Mathematics and Computer Science, University of Udine, Italy ... Hama, previously known as HAdoop MAtrix and short for Apache Hama, is a top- ... BSP Master is responsible for scheduling jobs and assigning the tasks to a .... Yoon, E. J. Available online: http://hadoop.co.kr/2013/HIS2013_edwardyoon.pdf [Ac-.
Researching Apache Hama: A Pure BSP Computing Framework Kamran Siddique1, Zahid Akhtar2 and Yangwoo Kim1 1

Dept. of Information and Communication Engineering, Dongguk University, Republic of Korea

{kamran,ywkim}@dongguk.edu 2

Dept. of Mathematics and Computer Science, University of Udine, Italy

[email protected]

Abstract. In recent years, the technological advancements have led to a deluge of data from distinctive domains and the need for development of solutions based on parallel and distributed computing has still long way to go. That is why, the research and development of massive computing frameworks is continuously growing. At this particular stage, highlighting a potential research area along with key insights could be an asset for researchers in the field. Therefore, this paper explores one of the emerging distributed computing frameworks, Apache Hama. It is a Top Level Project under the Apache Software Foundation, based on Bulk Synchronous Processing (BSP). We present an unbiased and critical interrogation session about Apache Hama and conclude research directions in order to assist interested researchers. Keywords: apache hama·bulk synchronous parallel·bsp·distributed computing·hadoop·mapreduce

1

Introduction

Nowadays, one of the largest technological challenges in computing systems research is to provide mechanisms for storage, information retrieval, and manipulation on massive amount of data. The need for research and development of big data processing frameworks is tremendously increasing. This paper presents our efforts towards writing a first focused paper on one of the emerging open source software frameworks, Apache Hama. It is a Top Level Project under the Apache Software Foundation but still it has a very limited documentation and research resources, which may be intimidating for newcomers in the initial phase of research. Therefore, highlighting a potential research area along with concise and significant insight could be of interest to researchers in the field. Hama is a distributed computing framework based on Bulk Synchronous Parallel (BSP) computing techniques for massive scientific computations e.g., graph, matrix and network algorithms [1, 2]. Our idea to contribute in this particular area is motivated by the key observation of the current trends in large scale adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

data processing and at the same time, by observing the search log about Apache Hama over several research platforms [3, 4]. This paper provides an unbiased and critical interrogation session about Hama and our work is mainly directed to:  Investigate research direction in big data processing using Hama;  Researchers and graduate students intended to explore Hama in a record time;  Practitioners, and users interested in acquainting themselves with thorough analysis of Hama. The next section of the paper presents a questions and answers session followed by conclusion and future direction in Sect. 3.

2

Interrogation Session

In this section, we formulate the most significant, focused and unbiased questions and answers about Apache Hama that severs the basis for extracting research directions. 2.1

What is Hama?

Hama, previously known as HAdoop MAtrix and short for Apache Hama, is a toplevel project of Apache Software Foundation. Hama is a distributed computing framework based on the Bulk Synchronous Parallel (BSP) programming model [1, 2]. It is designed to run on massive datasets, which are stored in the Hadoop Distributed File System (HDFS), and to solve scientific problems based on graphs, matrices, machine learning and networks. It is written in Java. It is deployed on HDFS and therefore fully compatible with Hadoop clusters. 2.2

What is the architecture of Hama and how it works?

The architecture of Hama is very analogous to Apache Hadoop, since the underlying distributed file system is same. It consists of the following three major components [5]: BSP Master is responsible for scheduling jobs and assigning the tasks to a Groom Server. This master component maintains the Groom Server status and the job progress information. It disseminates execution class across Groom Servers. The BSP Master controls the supersteps in a cluster and takes care if a job fails on a Groom Server. Moreover it also provides a cluster control interface for users. Groom Server acts as a slave component in Hama architecture and it runs BSP peer tasks, assigned by the BSP Master. Each Groom Server pings BSP Master to undertake tasks and acknowledge its status to BSP Master by using periodical piggybacks. The BSP Peer slots are configurable and they should match the number of threads that a node is able to run in parallel. For best performance and data locality concept of Apache Hadoop, a Groom Server and a data node should run on the same machine.

Zookeeper or synchronization component is responsible for managing barrier synchronization tasks efficiently. The Zookeeper and BSP Master are launched parallel because of central barrier synchronization. Fig. 1 illustrates the architecture and working of Hama. When a user submits a job, the BSP JobClient establishes the communication channel with BSP Master by using Hadoop RPC framework. In this process, it partitions the input and stores the input splits to HDFS and then submit a new job to BSP Master. It is locally executed by the client and sends status updates about the submitted job and superstep count. The BSP Master reads all the information and then schedules the jobs. The BSP JobClient updates itself with the job progress and also periodically informs the user for the same. Once, a task is assigned to Groom Server, it continues its execution until the last superstep is executed. During this process, the job rescheduling is not done on a different Groom Server. In case of failure, the job is marked failed and gets killed. The Zookeeper is responsible for barrier synchronization of BSP Peer tasks.

Fig. 1. Hama architecture and its working

2.3

What makes Hama different from other big data computing frameworks?

Hama is a pure BSP model, inspired by the work on Google Pregel. Hama is more focused towards processing complex computation-intensive tasks rather dataintensive tasks, which makes it different from other frameworks. In spite of Hama’s similar architecture with Apache Hadoop and an inspiration from Google Pregel work, yet it has significant differences. Hama framework aims to provide a more general-purpose framework than Pregel and Apache Hadoop, supporting massive scientific computations such as matrix, graph, machine learning, business intelligence and network algorithms. It is not only restricted to graph processing; it provides a full set of primitives that allows the creation of generic BSP applications. The main differ-

ence in the architecture of Hama and Hadoop can be seen in Fig. 2. In Hama, the BSP tasks can communicate with each other, while the communication between Map and Reduce tasks is forbidden in Hadoop [6]. Moreover, in MapReduce model, the only form of communication between the tasks is through the persistence of data on the disk because this model enforces all Map tasks to finish their execution before the execution of any Reduce task. Whereas Hama provides direct message exchange for the BSP tasks, which leads to better efficiency as the overhead of I/O operations is avoided.

Fig. 2. Comparison between the architecture of Apache Hadoop and Apache Hama

2.4

What are the strengths of Hama?

Following are the strengths of Hama [6]:  Hama supports diverse massive computational tasks. It is not just limited to large scale graph processing.  Hama provides BSP primitives instead of graph processing APIs, which enables it to operate at a lower level.  Hama provides explicit support to message passing.  It has a simple, small and flexible API.  As a result of following the BSP model, it does not suffer with conflicts and deadlines during the communication.  Hama is primarily proposed to be used with Java, but Hama pipes also enable programmers to write programs in C++.  Hama is not just limited to Hadoop Distributed File System (HDFS); it is flexible enough to be used with any distributed file system.  Hama supports General-Purpose computing on Graphics Processing Units (GPGPU) acceleration. 2.5

What are the shortcomings of Hama?

The shortcomings of Hama are given below [6, 7]:

 For graph partitioning, Hama does not use any special algorithm which results in causing unnecessary communication between the nodes.  There is a lack of graph manipulation functions in Hama’s API.  BSP Master is a single point of failure and if it dies, the application will stop. 2.6

In which application domains Hama can be the most suitable choice?

Hama is a general purpose solution for large-scale computing and it can be more suitable to use for intensive iterative applications. Hama is able to outperform MapReduce frameworks [8] in such application domains because it avoids the processing overhead of MapReduce approach such as sorting, shuffling, reducing the vertices etc. MapReduce inherits this overhead in each iteration and of course there exist at least millions of iterations. Hama provides a message passing interface and each superstep in BSP is faster than a full job execution in MapReduce framework, such as Hadoop. 2.7

Could Hama be applied to Deep Learning frameworks?

Recent advances in deep learning hold the promise of allowing machine learning algorithms to extract discriminative information from big data without labor-intensive feature engineering. Though, few giant companies such as Google, Microsoft have developed distributed deep learning systems, these systems are closed source software. However, latest Apache Hama provides open source distributed training of an Artificial Neural Network (ANN) using its BSP computing engine. In Hama, two kinds of components are involved in the training procedure: i) master task (merging the model, updating information and sending model updating information to all groom tasks); ii) groom task (calculating the weight updates according to the training data). Hama’s ANN is presently data parallel only. Researches are afoot for supporting both data and model parallelism. 2.8

Who is using Hama?

Hama is currently being used by both small and large enterprises across a wide range of industries. It is currently being used and sponsored by SAMSUNG electronics [3]. A well-known Chinese search engine named Sogou is also using Hama with the following specifications [9, 10]:  Runs 7200 cores Hama cluster for SiteRank;  Data set is approximately 400GB;  Over 6 Billion edges.

3

Conclusions

In this paper, we highlighted and explored Apache Hama as a potential research area in the field of computing. It is evident that research on big data processing using

Apache Hama is in its early stage and there is a need to identify its future directions by providing an in-depth and critical analysis. In order to accomplish this task, we formulated the most significant and focused questions about Hama in Sect. 2 and answered them with the help of research literature. The extraction of such questions reveals some of the promising areas of Hama that probably deserves further exploration and development; such as specialized graph partitioning algorithms, optimization of memory usage and fault tolerance mechanism. In our future work, we intend to come up with the performance comparison of Apache Hama with other massive computing frameworks. It may further help to forecast Apache Hama’s future and to open new doors for intended researchers. Finally, to the best of our knowledge and belief, this kind of up-to-date work about Apache Hama is missing from the current bibliography and we hope that it will help researchers who want to devote in this particular research area.

Acknowledgment This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program IITP-2015-(H8501-15-1015) supervised by the IITP (Institute for Information & communications Technology Promotion).

References 1. Apache Hama: http://www.apache.org/ [Accessed: November 20, 2015] 2. Illecker M, “Scientific Computing in the Cloud with Apache Hadoop and Apache Hama on GPU”, Master thesis, Faculty of Mathematic, Computer Science and Physics, University of Innsbruck, 2014. 3. Apache Hama: https://blogs.apache.org/Hama/ [Accessed: September 06, 2015] 4. Mailing list archives: http://mail-archives.apache.org/mod_mbox/Hama-user/ [Accessed: February 26, 2015] 5. Apache Hama Design Document V0.6: [Online]. Available: http://people.apache.org/~tjungblut/downloads/hamadocs/ApacheHamaDesign_06.pdf [Accessed: April 10, 2015] 6. Cordeiro, D., Goldman, A., Kraemer, A., Junior, F. P. “Using the BSP model on Clouds”, [Online]. Available: http://ccsl.ime.usp.br/baile/sites/ime.usp.br.baile/files/chapter_0.pdf [Accessed: March 20, 2015] 7. Luo, S., Liu, L., Wang, H., Wu, B., Liu, Y. “Implementation of a Parallel Graph Partitioning Algorithm to Speed up BSP Computing”, The 11th International Conference on Fuzzy Systems and Knowledge Discovery, IEEE, China: 19-21 August, 2014, pp. 740-744. 8. Ting, I. H., Lin, C. H., Wang, C. S. “Constructing A Cloud Computing Based Social Networks Data Warehousing and Analyzing System”, International Conference on Advances in Social Networks Analysis and Mining, IEEE, Kaohsiung, Taiwan: 25-27 July, 2011, pp. 735-740. 9. DataSayer: http://datasayer.com/hama.php [Accessed: February 12, 2015] 10. Yoon, E. J. Available online: http://hadoop.co.kr/2013/HIS2013_edwardyoon.pdf [Accessed: January 12, 2015]