Performance Optimization of Heterogeneous Hadoop ...

World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57

Performance Optimization of Heterogeneous Hadoop Clusters Using MapReduce for Big Data Manish Bhardwaj, Research Scholar, [email protected], Poornima College of Engineering, Jaipur,

Dr. Ajay Khuteta Professor, [email protected] Poornima College of Engineering, Jaipur

Abstract:- The key problem that arises due to enormous

improved to cope up with increasing and unorganized

growth of connectivity between devices and systems is

unstructured data. Another issues to be taken care of

creating so much data at an exponential rate that a

are, turnaround times, effective clustering techniques,

feasible solution for processing it is becoming difficult day by day. Therefore, developing a platform for such advanced level of data processing, hardware as well as software enhancements need to be conducted to come in

algorithms to sort data, structure it and retrieve data with high throughput and low latency. Hadoop, the open source framework to store and analyse data over

level with such magnanimous data. To improve the

low-cost hardware has made a big round of buzz among

efficiency of Hadoop clusters in storing and analysing big

developers and organizations. The Hadoop entertains

data, we have proposed an algorithmic approach that will

the storage, analysis, and retrieval of data using clusters

cater the needs of heterogeneous data stored over Hadoop

of nodes operating under it to manifest the possibility

clusters and improve the performance as well as

of screening large data sets that are quiet inefficient for

efficiency.The proposed paper aims to find out the

relational databases. It is designed and aided in such a

effectiveness of new algorithm, comparison, suggestions, and a competitive approach to find out the best solution for improving the big data scenario. The Map Reduce technique from Hadoop will help in maintaining a close

manner that a single framework is enough to scale thousands of servers complemented with fast local computing and storage. The feature that makes it useful

watch over the unstructured or heterogeneous Hadoop

in fast paced development scenario is the fact that it can

clusters with insights on results as expected from the

equally monitor structured as well as unstructured data

algorithm.

that dominates the Internet using Map Reduce.

Keywords:- big data, hadoop, heterogeneous clusters, map reduce, throughput, latency.

Big

data comprises of data that is so large that and complex that traditional databases are incompetent to store, or analyse it. The big data can be understood by four V’s,

1. INTRODUCTION

i.e. Variety, Velocity, Value and Volume.

The studies conducted so far demonstrates that roughly

Big data consists of predictions to derive values from

30% of data exhibited from digital world can be useful

very large databases. The analysing and processing

for various purposes if analysed properly. But a mere

involved determines how sufficient value could a data

0.5% of data is utilized by the existing means which

or frame provide. It involves prying with data that

states clearly the inadequacy posed in big data field

could deliver some useful and meaningful content by

because of limited capabilities in existing systems. The

storing, sharing, search, visualizing, or querying.

current analysis of unstructured data residing on

Business Intelligence and statistics are deeply

servers or clusters reveals the conflicting influences of

associated with it. Other relative fields like language

data over systems. The Map Reduce model can provide

processing,

high throughput, fairness among job distribution or low

genomics, weather conditions, or any field that holds a

latency. But the speed at which it is done needs to be

magnanimous amount of data are all related to it. Data

artificial

intelligence,

astronomy,

World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 is generated with exponential quantity today, to tackle

not only store and retrieve the data but also be able to

this data; big data involves predictions, learning and

query and analyze it. The big data scenario presents

involves complex calculations and algorithms to

challenges in itself as to store the data typically which

provide some value. The coming years will see a huge

is a much unorganized manners. Primarily the data that

rise in need of highly calculative and processing

one need today is important to scientific, retail, or

applications to deal with big data. There are a number

business based. So in order to manifest the data to turn

of applications and frameworks that use big data but

up some profit or simply understand the patterns, one

are still at a loss at many points. There are algorithms

needs to dive deep and conclude to a meaningful

that classify the data based on their own set of

interpretation. There are several platforms to store data.

principals which provide a smooth utility to clustering.

Traditionally, relational databases offered a wide

The algorithms thus used provide a way to account all

variety of options to store and properly analyse data but

the data and associate them during Map Reduce. The

as the data limits and types of data exceeded current

objective is to identify existing studies related to

trends, it became vitally important to store unlabelled

Hadoop, Big Data, and Map Reduce framework and

or unstructured data. The unstructured data here

work

commonly

represents any kind of data that is in raw form and is

encountered, to deal with large sets of statistical or

simply a collection of datasets where one chunk might

unstructured data and review them for the dissertation

differ from another very much.

work, to perform the comparative study of existing

As Zholiu points out, over 33% of the digital data can

algorithms that are used for clustering data on

prove valuable if we have proper tools and methods to

heterogeneous Hadoop clusters and which one of them

analyse it whereas the current uses scores up to mere

suits best for future trends, to design a new algorithm

0.5%. These kinds of observations initiate the need of

that removes the shortcomings of existing algorithms

highly efficient, scalable and flexible approaches to

to provide a better functionality than the previous ones,

help analyse gigantic size of big data. Map Reduce

and to use the algorithm to improve performance of

algorithms over a small time have proved their

heterogeneous Hadoop clusters using Map Reduce

importance as efficient, scalable, and fault-tolerant data

model for big data and display the results. The

analytics programming model but somewhat lacks

proposed work identifies and improves such challenges

query-level semantics in task scheduling resulting in

to make the community better and useful versions than

low utilization of system resources, prolonged

previous ones. With big data, one could improve sales,

execution time and low query throughput. Map Reduce

marketing, discovery of new medicines, devising

offers task-level scheduling through Directed Acyclic

strategies for lay out and a whole new world of tasks.

Graphs (DAG) and set of query languages aided with

Most of the problem in handling big data arises from

data warehousing capabilities. The scalability of larger

the unlabelled objects which are hard to tackle in

clusters in Hadoop is generally heterogeneous.

Hadoop clusters.

Consider the case where a website gathers content over

II. BACKGROUND WORK

continents. Typically, it will include different platform

The internet is expanding at exponential rates; given

users with languages and format suitable to their own.

the current scenario scientists evaluate that in a matter

The website has to have complacency with the

of few year times, the size of data will exceed

heterogeneity and to store such data over clusters is a

thousands of petabytes. To cope with such information

big task. The clustering involved will have to set

or specifically data burst, one needs a good system to

parameters to associate those data with enticed values

with

the

problems

that

are

World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 so that clusters could be easily identified and used for

A series of steps or phases are added to improve the

queries. As mentioned by [Zingben Xu, et al] cluster

efficiency of heterogeneous clusters. Each step is

configuration highly effects the parallelization of data

explained below:

in Hadoop. The challenge among the heterogeneity of

a.

the clusters is to produce and handle infrastructures for

fetched into the parser and semantic analyzer, they

computing that harnesses not only existing computing

invariably compute the dependencies among the

platforms but also with low-cost equipment provide

queries. But once this parsed query is sent to the

better results at hand. The computing is largely in

Hadoop’s MapReduce to execute the dependencies that

continually in need of tools for dealing with structured

were calculated for the Hive query is lost between the

and unstructured data. There exists lack of performance

transitions. Once the dependencies are calculated they

of Hadoop’s heterogeneous clusters which implement

can be used for semantics extractions in Hive QL

further problems in distributed processing. The

processor. In the second step we can use these

performance bottlenecks of such systems as addressed

dependencies such as logical predicates and input

by, [Krushnakanth Kametkar] states that the hardware

tables to process the dependencies among different

configurations,

proper

queries to be closely attached to each other during

underlying framework can reduce the problems faced

transition. When the sematic gap between Hive and

by heterogeneous Hadoop clusters. The clusters could

Hadoop is bridged by these intermediary steps, they

be managed by following a minimum requirement over

can be easily used for clustering similar queries at

a specific number of nodes in a cluster to avoid data

query level, and hence by minute steps improving the

node failure, replication techniques, rack maintenance

query job execution.

logical

connectivity

and

etc. As also noted by [Jiong Ji et al], there clearly exists b.

Query Improvisation: when queries are

Query Scheduling Algorithm: Hadoop uses

a distinction where low performance clusters are

First In First Out (FIFO) scheduler for job distribution

surpassed by high-performance clusters in terms of

by default. But there are other schedulers as well like,

local data transfers. By distributing the load between

Hadoop Fair Scheduler (HFS) and Hadoop Capacity

high and low performance heterogeneous clusters,

Scheduler (HCS) that are customizations for Hadoop

there is a significant improvement in Map Reduce

ecosystem. To have a good fairness among different

tasks.

jobs, the HCS and HFS are sufficient. In the proposed

III. METHODOLOGY The dissertation work focuses on improving the performance of heterogeneous clusters on Hadoop by following a set of phases that improves the data input/ output queries, improves the routing of algorithm to heterogeneous clusters, and then improving the performance of processing of queries in such a way that they are easily connected to the right part of execution

methodology, both the schedulers are used one by one to find the outcome which can help in maintaining order among different capacity clusters. c.

Clustering

Algorithm:

The

algorithm

proposed under this dissertation work is to simplify the need to categorize big data that arrives in dynamic fashion. If the clusters of data could be improved effectively.

in minimal time as possible without increasing the

The algorithm that is used for this proposed work is as

costs at computing level. The proposed work follows a

following:

series of steps to handle these scenarios which are described in subsequent sections.

Algorithm Input: Datasets cleansed of garbled values.

World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 Tq = Read_queries from Query.DB(); fq=

amount of data digitally poses challenges that need

Frequency of item dataset Ci= item in

solutions to maintain them as well as to identify the

Cluster i ds = Similarity matrix from

correct sets of information to maximize the utilization

Clustering through K- Means Algorithm.

with cost benefits. The two level query processing

Compute fq

technique can be used to improve the efficiency of such problems and classifying data with the help of Map

Output: Clustered data of importance.

Reduce on Big Data.

Step 1:- We will read the log transition. Step 2:- We store the transition value in Qi, Where i=1,

Throughput

2, 3……..n

Avg I/O rate

IO rate std dev.

HFS

HCS

Where n is equal to the log transition value. Step 3:- We store the query

which is request from

client and store in array by get Query () method. IQ = I get query. Where I is number of query request and .We store the query in array to create cluster. By method table – put (null, array, parameter1,

FIFO

Figure 4.1: Comparison of different Hadoop Schedulers

parameter 2 ….n); To convert the array in object they are

The test execution time taken by all the different

Qi= Table – get query ();

schedulers is as below:

Step 4:- Merge the pair of most similar queries (qi,qj) that does not convert the same queries. If

800

(Qi is not in Ci

600

) then store the frequency of the item

and the increase the value.

400

Step 5:- Compute the simulating Matrix if Qi in Ci.

200

Step 6:- If Qi is not Ci then compute new cluster IQ

581.04

394.82

HFS

HCS

0 FIFO

=New (Ci).

407.2

Test Execution Time (in Sec)

Step 7:- Go to Step 3. Step 8:- go to Step 2.The algorithm finds the frequency of an item that is of importance and is associated to a

Figure 4.2: Test Execution time for different schedulers

similarity matrix. IV. CONCLUSION and RESULTS The following results have been established by the conductance of the proposed work, the graphs are depicted on the formula, Throughput (Nq)= The estimation of time taken by a query is calculated through this throughput. The following graph depicts the efficiency as demonstrated by the different schedulers based on their throughput.

The rise of

Figure:-4.3Comparative study of FIFO, HFC, HCS

World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 [2]

Zongben Xu, Yong Shi, Exploring Big Data Analysis: Fundamental Scientific Problems, Springer, Dec. 2015

[3]

Fernando G. Tinetti1, Ignacio Real, Rodrigo Jaramillo, and Damián Barry, Hadoop Scalability and Performance Testing in Heterogeneous Clusters, PDPTA 2015

[4]

Krushnarah Kamtekar, Performance Modeling of Big Data, May 2015

[5]

Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-

Figure 4.4 Comparative Study of proposed Work

Chin Chang,

Generating a new algorithm to solve these problems for

omprehensive Performance Rating for Hadoop

the commercial as well as non-commercial uses can

Clusters

help the betterment of community. The proposed

November 2014

algorithm can help improve the placement of data

[6]

and

on

Wei-Tsong

Cloud

Computing

Lee,

The

Platform,

T.K.Das, P.Mohan Kumar, BIG Data Analytics:

classifying algorithm MapReduce in heterogeneous

A Framework for Unstructured Data Analysis,

Hadoop clusters.

IJET

V. LIMITATIONS

ISSN: 0975-4024, Vol 5 No 1 Feb-Mar 2013

As big data and Hadoop clusters need n number of

[7]

Big

Data

in High Performance Scientific Computing,

physical storage, the querying output is partly

2013

dependent on the I/O processing time and data node with name node server. The hardware costs, and the

Novăcescu,

Florica

[8]

B. Thirumala Rao, N.V. Sridevi, V. Krishna

processing power to amount to redundancy is not taken

Reddy, L.S.S. Reddy, Performance Issues of

into account in this review paper.

Heterogeneous Hadoop Clusters in Cloud

IV. REFERENCES

Computing, GJSCT, Vol. XI Issue VII, May

[1]

2011

Zhuo Liu, Efficient Storage Design and Query Scheduling for Improving Big Data Retrieval and Analytics, 2015

[9]

Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding,

Yun

Manzanares, MapReduce

Tian,

James

and

Xiao

Majors, Qin,

Adam

Improving

Performance Optimization of Heterogeneous Hadoop ...

Performance Optimization of Heterogeneous Hadoop ...

Suggest Documents

Performance Issues of Heterogeneous Hadoop Clusters in Cloud

Performance Issues of Heterogeneous Hadoop Clusters in Cloud ...

Performance Issues of Heterogeneous Hadoop Clusters in Cloud

Hadoop Performance Models

Hadoop Performance Models

Performance Evaluation of a MapReduce Hadoop-Based ...

Performance Evaluation of Virtualized Hadoop Clusters - arXiv

QoS oriented MapReduce Optimization for Hadoop

Workload-level optimization strategies for Hadoop - OpenProceedings ...

Optimization of Hierarchically Scheduled Heterogeneous Embedded

On the Optimization of Heterogeneous MDDs

Optimization of Mechanical Performance

Job Aware Scheduling in Hadoop for Heterogeneous ... - IEEE Xplore

hsla: heterogeneous storage-tier log analyzer over hadoop

Informatica PowerExchange for Hadoop - Native, High-Performance ...

Scheduling on Heterogeneous Hadoop System in Eucalyptus Private

Research Article Optimizing Hadoop Performance for ...

Some Performance Metrics for Heterogeneous

Hadoop Mapreduce Performance Enhancement Using In-node

Can High-Performance Interconnects Benefit Hadoop ... - CiteSeerX

Hadoop Scalability and Performance Testing in

Performance Evaluation of Heterogeneous ... - Semantic Scholar

Computational Analysis of Performance for Heterogeneous Integrated ...

Downlink Outage Performance of Heterogeneous ... - Semantic Scholar