World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57
Performance Optimization of Heterogeneous Hadoop Clusters Using MapReduce for Big Data Manish Bhardwaj, Research Scholar,
[email protected], Poornima College of Engineering, Jaipur,
Dr. Ajay Khuteta Professor,
[email protected] Poornima College of Engineering, Jaipur
Abstract:- The key problem that arises due to enormous
improved to cope up with increasing and unorganized
growth of connectivity between devices and systems is
unstructured data. Another issues to be taken care of
creating so much data at an exponential rate that a
are, turnaround times, effective clustering techniques,
feasible solution for processing it is becoming difficult day by day. Therefore, developing a platform for such advanced level of data processing, hardware as well as software enhancements need to be conducted to come in
algorithms to sort data, structure it and retrieve data with high throughput and low latency. Hadoop, the open source framework to store and analyse data over
level with such magnanimous data. To improve the
low-cost hardware has made a big round of buzz among
efficiency of Hadoop clusters in storing and analysing big
developers and organizations. The Hadoop entertains
data, we have proposed an algorithmic approach that will
the storage, analysis, and retrieval of data using clusters
cater the needs of heterogeneous data stored over Hadoop
of nodes operating under it to manifest the possibility
clusters and improve the performance as well as
of screening large data sets that are quiet inefficient for
efficiency.The proposed paper aims to find out the
relational databases. It is designed and aided in such a
effectiveness of new algorithm, comparison, suggestions, and a competitive approach to find out the best solution for improving the big data scenario. The Map Reduce technique from Hadoop will help in maintaining a close
manner that a single framework is enough to scale thousands of servers complemented with fast local computing and storage. The feature that makes it useful
watch over the unstructured or heterogeneous Hadoop
in fast paced development scenario is the fact that it can
clusters with insights on results as expected from the
equally monitor structured as well as unstructured data
algorithm.
that dominates the Internet using Map Reduce.
Keywords:- big data, hadoop, heterogeneous clusters, map reduce, throughput, latency.
Big
data comprises of data that is so large that and complex that traditional databases are incompetent to store, or analyse it. The big data can be understood by four V’s,
1. INTRODUCTION
i.e. Variety, Velocity, Value and Volume.
The studies conducted so far demonstrates that roughly
Big data consists of predictions to derive values from
30% of data exhibited from digital world can be useful
very large databases. The analysing and processing
for various purposes if analysed properly. But a mere
involved determines how sufficient value could a data
0.5% of data is utilized by the existing means which
or frame provide. It involves prying with data that
states clearly the inadequacy posed in big data field
could deliver some useful and meaningful content by
because of limited capabilities in existing systems. The
storing, sharing, search, visualizing, or querying.
current analysis of unstructured data residing on
Business Intelligence and statistics are deeply
servers or clusters reveals the conflicting influences of
associated with it. Other relative fields like language
data over systems. The Map Reduce model can provide
processing,
high throughput, fairness among job distribution or low
genomics, weather conditions, or any field that holds a
latency. But the speed at which it is done needs to be
magnanimous amount of data are all related to it. Data
artificial
intelligence,
astronomy,
World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 is generated with exponential quantity today, to tackle
not only store and retrieve the data but also be able to
this data; big data involves predictions, learning and
query and analyze it. The big data scenario presents
involves complex calculations and algorithms to
challenges in itself as to store the data typically which
provide some value. The coming years will see a huge
is a much unorganized manners. Primarily the data that
rise in need of highly calculative and processing
one need today is important to scientific, retail, or
applications to deal with big data. There are a number
business based. So in order to manifest the data to turn
of applications and frameworks that use big data but
up some profit or simply understand the patterns, one
are still at a loss at many points. There are algorithms
needs to dive deep and conclude to a meaningful
that classify the data based on their own set of
interpretation. There are several platforms to store data.
principals which provide a smooth utility to clustering.
Traditionally, relational databases offered a wide
The algorithms thus used provide a way to account all
variety of options to store and properly analyse data but
the data and associate them during Map Reduce. The
as the data limits and types of data exceeded current
objective is to identify existing studies related to
trends, it became vitally important to store unlabelled
Hadoop, Big Data, and Map Reduce framework and
or unstructured data. The unstructured data here
work
commonly
represents any kind of data that is in raw form and is
encountered, to deal with large sets of statistical or
simply a collection of datasets where one chunk might
unstructured data and review them for the dissertation
differ from another very much.
work, to perform the comparative study of existing
As Zholiu points out, over 33% of the digital data can
algorithms that are used for clustering data on
prove valuable if we have proper tools and methods to
heterogeneous Hadoop clusters and which one of them
analyse it whereas the current uses scores up to mere
suits best for future trends, to design a new algorithm
0.5%. These kinds of observations initiate the need of
that removes the shortcomings of existing algorithms
highly efficient, scalable and flexible approaches to
to provide a better functionality than the previous ones,
help analyse gigantic size of big data. Map Reduce
and to use the algorithm to improve performance of
algorithms over a small time have proved their
heterogeneous Hadoop clusters using Map Reduce
importance as efficient, scalable, and fault-tolerant data
model for big data and display the results. The
analytics programming model but somewhat lacks
proposed work identifies and improves such challenges
query-level semantics in task scheduling resulting in
to make the community better and useful versions than
low utilization of system resources, prolonged
previous ones. With big data, one could improve sales,
execution time and low query throughput. Map Reduce
marketing, discovery of new medicines, devising
offers task-level scheduling through Directed Acyclic
strategies for lay out and a whole new world of tasks.
Graphs (DAG) and set of query languages aided with
Most of the problem in handling big data arises from
data warehousing capabilities. The scalability of larger
the unlabelled objects which are hard to tackle in
clusters in Hadoop is generally heterogeneous.
Hadoop clusters.
Consider the case where a website gathers content over
II. BACKGROUND WORK
continents. Typically, it will include different platform
The internet is expanding at exponential rates; given
users with languages and format suitable to their own.
the current scenario scientists evaluate that in a matter
The website has to have complacency with the
of few year times, the size of data will exceed
heterogeneity and to store such data over clusters is a
thousands of petabytes. To cope with such information
big task. The clustering involved will have to set
or specifically data burst, one needs a good system to
parameters to associate those data with enticed values
with
the
problems
that
are
World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 so that clusters could be easily identified and used for
A series of steps or phases are added to improve the
queries. As mentioned by [Zingben Xu, et al] cluster
efficiency of heterogeneous clusters. Each step is
configuration highly effects the parallelization of data
explained below:
in Hadoop. The challenge among the heterogeneity of
a.
the clusters is to produce and handle infrastructures for
fetched into the parser and semantic analyzer, they
computing that harnesses not only existing computing
invariably compute the dependencies among the
platforms but also with low-cost equipment provide
queries. But once this parsed query is sent to the
better results at hand. The computing is largely in
Hadoop’s MapReduce to execute the dependencies that
continually in need of tools for dealing with structured
were calculated for the Hive query is lost between the
and unstructured data. There exists lack of performance
transitions. Once the dependencies are calculated they
of Hadoop’s heterogeneous clusters which implement
can be used for semantics extractions in Hive QL
further problems in distributed processing. The
processor. In the second step we can use these
performance bottlenecks of such systems as addressed
dependencies such as logical predicates and input
by, [Krushnakanth Kametkar] states that the hardware
tables to process the dependencies among different
configurations,
proper
queries to be closely attached to each other during
underlying framework can reduce the problems faced
transition. When the sematic gap between Hive and
by heterogeneous Hadoop clusters. The clusters could
Hadoop is bridged by these intermediary steps, they
be managed by following a minimum requirement over
can be easily used for clustering similar queries at
a specific number of nodes in a cluster to avoid data
query level, and hence by minute steps improving the
node failure, replication techniques, rack maintenance
query job execution.
logical
connectivity
and
etc. As also noted by [Jiong Ji et al], there clearly exists b.
Query Improvisation: when queries are
Query Scheduling Algorithm: Hadoop uses
a distinction where low performance clusters are
First In First Out (FIFO) scheduler for job distribution
surpassed by high-performance clusters in terms of
by default. But there are other schedulers as well like,
local data transfers. By distributing the load between
Hadoop Fair Scheduler (HFS) and Hadoop Capacity
high and low performance heterogeneous clusters,
Scheduler (HCS) that are customizations for Hadoop
there is a significant improvement in Map Reduce
ecosystem. To have a good fairness among different
tasks.
jobs, the HCS and HFS are sufficient. In the proposed
III. METHODOLOGY The dissertation work focuses on improving the performance of heterogeneous clusters on Hadoop by following a set of phases that improves the data input/ output queries, improves the routing of algorithm to heterogeneous clusters, and then improving the performance of processing of queries in such a way that they are easily connected to the right part of execution
methodology, both the schedulers are used one by one to find the outcome which can help in maintaining order among different capacity clusters. c.
Clustering
Algorithm:
The
algorithm
proposed under this dissertation work is to simplify the need to categorize big data that arrives in dynamic fashion. If the clusters of data could be improved effectively.
in minimal time as possible without increasing the
The algorithm that is used for this proposed work is as
costs at computing level. The proposed work follows a
following:
series of steps to handle these scenarios which are described in subsequent sections.
Algorithm Input: Datasets cleansed of garbled values.
World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 Tq = Read_queries from Query.DB(); fq=
amount of data digitally poses challenges that need
Frequency of item dataset Ci= item in
solutions to maintain them as well as to identify the
Cluster i ds = Similarity matrix from
correct sets of information to maximize the utilization
Clustering through K- Means Algorithm.
with cost benefits. The two level query processing
Compute fq
technique can be used to improve the efficiency of such problems and classifying data with the help of Map
Output: Clustered data of importance.
Reduce on Big Data.
Step 1:- We will read the log transition. Step 2:- We store the transition value in Qi, Where i=1,
Throughput
2, 3……..n
Avg I/O rate
IO rate std dev.
HFS
HCS
Where n is equal to the log transition value. Step 3:- We store the query
which is request from
client and store in array by get Query () method. IQ = I get query. Where I is number of query request and .We store the query in array to create cluster. By method table – put (null, array, parameter1,
FIFO
Figure 4.1: Comparison of different Hadoop Schedulers
parameter 2 ….n); To convert the array in object they are
The test execution time taken by all the different
Qi= Table – get query ();
schedulers is as below:
Step 4:- Merge the pair of most similar queries (qi,qj) that does not convert the same queries. If
800
(Qi is not in Ci
600
) then store the frequency of the item
and the increase the value.
400
Step 5:- Compute the simulating Matrix if Qi in Ci.
200
Step 6:- If Qi is not Ci then compute new cluster IQ
581.04
394.82
HFS
HCS
0 FIFO
=New (Ci).
407.2
Test Execution Time (in Sec)
Step 7:- Go to Step 3. Step 8:- go to Step 2.The algorithm finds the frequency of an item that is of importance and is associated to a
Figure 4.2: Test Execution time for different schedulers
similarity matrix. IV. CONCLUSION and RESULTS The following results have been established by the conductance of the proposed work, the graphs are depicted on the formula, Throughput (Nq)= The estimation of time taken by a query is calculated through this throughput. The following graph depicts the efficiency as demonstrated by the different schedulers based on their throughput.
The rise of
Figure:-4.3Comparative study of FIFO, HFC, HCS
World Academy of Research in IT and Engineering ISSN NO: - 02278-1315 Volume 5 Issue 1 Page No:-52-57 [2]
Zongben Xu, Yong Shi, Exploring Big Data Analysis: Fundamental Scientific Problems, Springer, Dec. 2015
[3]
Fernando G. Tinetti1, Ignacio Real, Rodrigo Jaramillo, and Damián Barry, Hadoop Scalability and Performance Testing in Heterogeneous Clusters, PDPTA 2015
[4]
Krushnarah Kamtekar, Performance Modeling of Big Data, May 2015
[5]
Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-
Figure 4.4 Comparative Study of proposed Work
Chin Chang,
Generating a new algorithm to solve these problems for
omprehensive Performance Rating for Hadoop
the commercial as well as non-commercial uses can
Clusters
help the betterment of community. The proposed
November 2014
algorithm can help improve the placement of data
[6]
and
on
Wei-Tsong
Cloud
Computing
Lee,
The
Platform,
T.K.Das, P.Mohan Kumar, BIG Data Analytics:
classifying algorithm MapReduce in heterogeneous
A Framework for Unstructured Data Analysis,
Hadoop clusters.
IJET
V. LIMITATIONS
ISSN: 0975-4024, Vol 5 No 1 Feb-Mar 2013
As big data and Hadoop clusters need n number of
[7]
Big
Data
in High Performance Scientific Computing,
physical storage, the querying output is partly
2013
dependent on the I/O processing time and data node with name node server. The hardware costs, and the
Novăcescu,
Florica
[8]
B. Thirumala Rao, N.V. Sridevi, V. Krishna
processing power to amount to redundancy is not taken
Reddy, L.S.S. Reddy, Performance Issues of
into account in this review paper.
Heterogeneous Hadoop Clusters in Cloud
IV. REFERENCES
Computing, GJSCT, Vol. XI Issue VII, May
[1]
2011
Zhuo Liu, Efficient Storage Design and Query Scheduling for Improving Big Data Retrieval and Analytics, 2015
[9]
Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding,
Yun
Manzanares, MapReduce
Tian,
James
and
Xiao
Majors, Qin,
Adam
Improving