Hadoop Distributed Computing Clusters for Fault ... - IEEE Xplore

2 downloads 0 Views 206KB Size Report
January-February 2016. doi: 10.1109/MNET.2016.7389829. [5] Y.M. Teo, B.L. Luong, Y. Song, T. Nam,”Cost-. Performance of Fault Tolerance in Cloud ...
Hadoop Distributed Computing Clusters for Fault Prediction Joey Pinto1, Pooja Jain2, Tapan Kumar3 Indian Institute of Information Technology Kota, India [email protected](pooja2, tapan3) @iiitkota.ac.in

Abstract—Hadoop architecture provides one level of fault tolerance, in a way of rescheduling the job on the faulty nodes to other nodes in the network. But, this approach is inefficient when a fault occurs after most of the job is executed. Thus, it’s necessary to predict the fault at the node at quite an early stage so that the rescheduling of the job is not costly in terms of time and efficiency. Prediction of these faults gives us the necessary time to shift the task load onto another node(s) and thus prevent data or computation time loss. An implementation is done on MATLAB SVM kernel and Ganglia with Java as an interfacing language. Ganglia is used for network system statistics monitoring. The system is trained using statistics of a normal task run and can thus detect deviations from them in real time. The experimental results clearly indicate that it is possible to predict the occurrence of a fault using previously gained knowledge with minimal time delay.As a result of which either the job can be rescheduled or the cluster itself can be upscaled.The reinforced learning module reduces false positives with each run and makes it possible to implement a truly faulttolerant cluster. IndexTerms—Hadoop tolerance

cluster,

Big

data,

SVM,

Fault

I. INTRODUCTION One of the most important challenges in distributed computing is to ensure that services are correct and available despite the faults [1]. Fault detection aims at identifying the faulty components so that they can be isolated and repaired [2][3]. As the need for big data increases, new data collection, transmission, and processing techniques are required [4]. To avoid the complications of high-performance computing of big data, the distributed systems should be fault tolerant [5]. Map-Reduce frameworks such as Hadoop have built-in fault-tolerance mechanisms that allow jobs to run to completion even in the occurrence of certain faults [6]. But these jobs can experience severe performance penalties when a node crashes. The time taken to reschedule or restart a task can havea serious impact on the time taken to complete the complete job. For example, if the total execution time of task ‘i’ on node ‘j’ is Tij , and the task ‘i’ encounters a failure at the time ‘t’ . If t