1 EQT Plaza 625 Liberty Avenue, Suite 1020 Pittsburgh, Pennsylvania -15222, USA. ... 1. INTRODUCTION. Big Data has gained much attention from the last.
DOI: 10.17812/IJRA.4.16(100)2017
International Journal of Research and Applications Oct - Dec © 2017 Transactions 4(16): 596-601
eISSN : 2349 – 0020 pISSN : 2394 – 4544
www.ijraonline.com COMPUTERS
RESEARCH ARTICLE
An Overview of Big Data Challenges, Tools and Techniques Sriramoju Ajay Babu, Namavaram Vijay and Ramesh Gadde 1 Programmer Analyst, Randstad Technologies, 1 EQT Plaza 625 Liberty Avenue, Suite 1020 Pittsburgh, Pennsylvania -15222, USA. 2 Practicepa Ltd,IT Product Manager, 104 Stamford Road, E61LR, UK. 3 Assistant Professor, Ethiopian Institute of Technology, Mekelle University, Ethiopia ABSTRACT Big Data is the large amount of data that cannot be processed by making use of traditional methods of data processing. Due to widespread usage of many computing devices such as smart phones, laptops, wearable computing devices; the data processing over the internet has exceeded more than the modern computers can handle. Due to this high growth rate, the term Big Data is envisaged. However, the fast growth rate of large data will generate numerous challenges, such as data inconsistency and incompleteness, scalability, timeliness, and security. This paper provides a brief introduction to the Big data technology and its importance in the contemporary world. This paper addresses various challenges and issues that need to be emphasized to present the full influence of big data. The tools used in Big data technology are also discussed in detail. This paper also discusses the characteristics of Big data and the platform used in Big Data i.e. Hadoop. Keywords: Big Data, Hadoop, MapReduce. 1.
INTRODUCTION
management) to the double-digit growth rates of
Big Data has gained much attention from the last
real- time intelligence and exploration/discovery
few years in the IT industry. As we can find
of the unstructured worlds. [1].
billions of people are connected to internet worldwide, generating large amount of data at the rapid rate. The generation of this large amount of engenders various challenges. Along with
Big
Data’s
huge
benefits
to
many
organizations, the challenges and issues should also be brought into light. A forecast from International Data Corporation (IDC), the Big
Figure 1: Growth rate of Big Data from 2011-2017[2]
Data technology and services market represents a fast-growing multibillion- dollar worldwide showing that the Big Data technology and
2. BIG DATA OVERVIEW Big Data is a combination of big datasets that
services market will grow at a 26.4% compound
cannot
annual growth rate to $41.5 billion through 2019,
computing techniques. It is not a technique
or about six times the growth rate of the overall
that can be worked on its own or in isolation;
information technology market. Additionally, by
rather it involves many areas of business and
2020 IDC believes that the line of business buyers
technology. The properties of signify Big Data
will help drive analytics beyond its historical
are volume, Variety, Velocity, Variability and
sweet
Complexity as shown in figure 2[3]
opportunity. In fact, the recent IDC forecast is
spot
of
relational
IJRA | Volume 4 | Issue 16
(performance
be
© www.globalsciencepg.org
processed
using
traditional
P a g e | 596
© 2017 Global Science Publishing Group, USA
© Oct – Dec, IJRA Transactions
Figure 4. ETL process [4] Table 1: Various phases in ETL [5] Sr. Phase Description of the Phase No 1. Extraction In this phase relevant
Figure 2. Properties of Big Data
information is extracted. To
Big data involves the data produced by different devices and applications. Some of the sources of Big Data are shown in the figure.
make this phase efficient, only the data source that has been changed since
Enterprise Data
2. Public Data
Transactions
Transfor mation
recent last ETL process is considered. Data is transformed through various phases [10] The phases are 1. Data analysis;
Big Data Sources
2.
Definition
transformation Sensor Data
of
workflow
and mapping rules;
Social Media
3.Verification;
3.
Loading
Figure 3. Some Sources of Big Data
4.Transformation; and 5. Backflow of cleaned data. At the last, after the data is in the required format, it is then loaded into the data
3.
PHASES IN BIG DATA PROCESSING Before processing Big data, it must be recorded from various data generating sources. After recording, it must be filtered and compressed. Only the relevant data should be recorded by means of filters that discard useless information. In order to facilitate this work specialized tools are used such as ETL. ETL tools represent the means in which data actually gets loaded into the warehouse. The figure 3 demonstrates different stages in the process.
IJRA | Volume 4 | Issue 16
warehouse. 4. BIG DATA CHALLENGES Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. The Figure 5 shows various challenges in big data [12]. Figure 6 list some of the challenges in Big data along with its impact and risks involved. [6]
© www.globalsciencepg.org
P a g e | 597
© 2017 Global Science Publishing Group, USA
© Oct – Dec, IJRA Transactions Hadoop Hadoop is an Apache open source framework which is written in java. High volumes of data, in any structure, are processed by Hadoop. Hadoop
allows
distributed
storage
and
distributed processing for very large data sets. The main components of Hadoop are:
1. Hadoop distributed file system (HDFS) 2. MapReduce The architecture of Hadoop is shown in the figure 7. Hadoop has three layers. The two major layers are MapReduce and HDFS. HDFS (Storage layer): - Hadoop has a distributed File System called HDFS, which
Figure 5. Challenges in Big data
stands for Hadoop Distributed File System. It is a File System used to store very large files with streaming data access patterns, running on clusters on commodity hardware. [8] There are two types of nodes in HDFS cluster, namely namenode and datanodes. The name node will maintain the file system tree and the metadata for all the files and directories in the tree. The data node stores and retrieve blocks as per the instructions of clients or the namenode. The data retrieved is reported back to the namenode with lists of blocks that they are storing. Without the namenode it is not possible to access the file. So, it becomes very important to make name node resilient to
Figure 6. Challenges, its impact and risk involved in Big data 5.
TECHNIQUES HANDLING
FOR
failure. [11] MapReduce Processing/Computation layer: - It BIG
DATA
is a programming paradigm which is meant for managing applications on multiple distributed
There are many techniques available for data
servers. In Map Reduce divide and conquer
management. That includes Google Big Table,
method is used to break the large complex data
Simple DB, Not Only SQL (NoSQL), Data
into small units and process them. It will read
Stream Management System (DSMS), Mem
the data from HDFS. However, it can read the
cache DB, and Voldemort [7]. But these
data from other places including mounted local
traditional approaches are only applicable to
file systems, the web, and databases. It divides
traditional data and not Big data as it cannot be
the computations between different servers or
stored on a single machine. The Big Data
nodes. It is also fault-tolerant. If anyone of the
handling techniques and tools include Hadoop,
node fails, Hadoop can understand how to
MapReduce, and Big Table. Out of these,
continue with computation by reassigning the
Hadoop is one of the most widely used
incomplete work to another node and cleaning
technologies.
up after the node that could not complete its
IJRA | Volume 4 | Issue 16
© www.globalsciencepg.org
P a g e | 598
© 2017 Global Science Publishing Group, USA
© Oct – Dec, IJRA Transactions
task. It also knows how to combine the results of the computation in one place. [9]. The other core
8)
components in Hadoop architecture includes Hadoop YARN, it is a framework for job scheduling and cluster resource management. The other component is the cluster which is the
9)
set of host machines (nodes).
10)
11) Figure 7. Hadoop Architecture 6.
CONCLUSION 12)
As there are huge volumes of data that are produced every day, so such large size of data it becomes very challenging to achieve effective processing
using
the
existing
traditional
13)
techniques Big data is data that exceeds the processing capacity of conventional database systems. In this paper fundamental concepts about Big Data are presented. These concepts include Big Data characteristics, challenges and
14)
techniques for handling big data. REFERENCES 1) 2) 3)
https://www.idc.com/prodserv/4Pillars/b igdata www.Wikibon.org
15)
A, Katal, Wazid M, and Goudar R.H. "Big data: Issues, challenges, tools and Good practices." Noida: 2013, pp. 404 – 409, 8-10 Aug. 2013.
4)
Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: modern principles and methodologies. Columbus: McGraw-Hill 5) Almeida, F., and Calistru, C, "The Main Challenges and Issues of Big Data Management", International Journal of Research Studies in Computing, 2(1), 2013, pp. 11-20. 6) https://www.progress.com. 7) M. Chen, S. Mao, and Y. Liu, “Big data: a survey,” Mobile Networks and Applications,
IJRA | Volume 4 | Issue 16
16)
17)
vol. 19, no. 2, pp. 171–209, 2014 Apache Hadoop (2013). HDFS Architecture Guide [Online]. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht Amrit pal, Pinki Aggrawal, “A Performance Analysis of Map Reduce Task with Large Number of Files Dataset in Big Data using Hadoop” Forth International Conference on Communication Systems and Network Technologies, 2014. Rahm, E., & Hai Do, H. (2000). Data cleaning: problems and current approaches. Bulletin of the Technical Committee on Data Engineering, 23(4), 3-13.). Apache Hadoop (2013). HDFS Architecture Guide [Online]. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht Intel, “Big Data Analaytics,”2012, http://www.intel.com/content/dam/www/pub lic/us/en/documents/reports/data-insights peer- research-report.pdf Shoban Babu Sriramoju, " Review on Big Data and Mining Algorithm" in “International Journal for Research in Applied Science and Engineering Technology”, Volume-5, IssueXI, November 2017, 1238-1243 [ ISSN: 23219653], www.ijraset.com Shoban Babu Sriramoju, “OPPORTUNITIES AND SECURITY IMPLICATIONS OF BIG DATA MINING” in “International Journal of Research in Science and Engineering”, http://ijrise.org/asset/archive/17Dec7.pdf, Vol3, Issue-6, Nov-Dec 2017, 44-58 [ ISSN: 23948299]. Guguloth Vijaya, A. Devaki, Dr. Shoban Babu Sriramoju, “A Framework for Solving Identity Disclosure Problem in Collaborative Data Publishing” in International Journal of Research and Applications (Apr-Jun © 2015 Transactions) 2(6): 292-295. Shoban Babu Sriramoju, "Heat Diffusion Based Search for Experts on World Wide Web" in “International Journal of Science and Research”, https://www.ijsr.net/archive/v6i11/v6i11.php, Volume 6, Issue 11, November 2017, 632 635, #ijsrnet. Dr. Shoban Babu Sriramoju, Prof. Mangesh Ingle, Prof. Ashish Mahalle “Trust and Iterative Filtering Approaches for Secure
© www.globalsciencepg.org
P a g e | 599
© 2017 Global Science Publishing Group, USA
18)
19)
20)
21)
22)
23)
24)
25)
Data Collection in Wireless Sensor Networks” in “International Journal of Research in Science and Engineering” Vol 3, Issue 4, July-August 2017 [ ISSN: 2394-8299]. Dr. Shoban Babu, Prof. Mangesh Ingle, Prof. Ashish Mahalle, “HLA Based solution for Packet Loss Detection in Mobile Ad Hoc Networks” in “International Journal of Research in Science and Engineering” Vol 3, Issue 4, July-August 2017 [ ISSN: 2394-8299]. Shoban Babu Sriramoju, “A Framework for Keyword Based Query and Response System for Web Based Expert Search” in “International Journal of Science and Research” Index Copernicus Value (2015):78.96 [ ISSN: 2319-7064]. Sriramoju Ajay Babu, Dr. S. Shoban Babu, “Improving Quality of Content Based Image Retrieval with Graph Based Ranking” in “International Journal of Research and Applications” Vol 1, Issue 1, Jan-Mar 2014 [ ISSN: 2349-0020]. Dr. Shoban Babu Sriramoju, Ramesh Gadde, “A Ranking Model Framework for Multiple Vertical Search Domains” in “International Journal of Research and Applications” Vol 1, Issue 1, Jan-Mar 2014 [ ISSN: 2349-0020]. Mounika Reddy, Avula Deepak, Ekkati Kalyani Dharavath, Kranthi Gande, Shoban Sriramoju, “Risk Aware Response Answer for Mitigating Painter Routing Attacks” in “International Journal of Information Technology and Management” Vol VI, Issue I, Feb 2014 [ ISSN: 2249-4510]. Mounica Doosetty, Keerthi Kodakandla, Ashok R, Shoban Babu Sriramoju, “Extensive Secure Cloud Storage System Supporting Privacy-Preserving Public Auditing” in “International Journal of Information Technology and Management” Vol VI, Issue I, Feb 2012 [ ISSN: 2249-4510]. Shoban Babu Sriramoju, “An Application for Annotating Web Search Results” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 2, Issue 3, March 2014 [ ISSN(online): 2320-9801, ISSN(print) : 2320-9798 ]. Shoban Babu Sriramoju, “Multi View Point Measure for Achieving Highest IntraCluster Similarity” in “International Journal of Innovative Research in Computer and
IJRA | Volume 4 | Issue 16
© Oct – Dec, IJRA Transactions
26)
27)
28)
29)
30)
31)
32)
33)
Communication Engineering” Vol 2, Issue 3, March 2014 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]. Shoban Babu Sriramoju, Madan Kumar Chandran, “UP-Growth Algorithms for Knowledge Discovery from Transactional Databases” in “International Journal of Advanced Research in Computer Science and Software Engineering” Vol 4, Issue 2,360-365, February 2014 [ ISSN: 2277 128X]. Shoban Babu Sriramoju, Azmera Chandu Naik, N. Samba Siva Rao, “Predicting The Misusability Of Data From Malicious Insiders” in “International Journal of Computer Engineering and Applications” Vol V,Issue II,Febrauary 2014 [ ISSN : 2321-3469 ]. Ajay Babu Sriramoju, Shoban Babu Sriramoju, “Analysis in Image Compression Using Bit-Plane Separation Method” in “International Journal of Information Technology and Management” Vol VII, Issue X, November 2014 [ ISSN: 2249-4510]. Shoban Babu Sriramoju, “Mining Big Sources Using Efficient Data Mining Algorithms” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 2, Issue 1, January 2014 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]. Mr Ajay Babu Sriramoju, Dr Shoban Babu,“Objective Quality Metric Design for Wireless Image and Video Communication” in “International Journal of Information Technology and Management” Vol VII, Issue 10, November 2014 [ ISSN: 2249-4510]. Ajay Babu Sriramoju, Dr Shoban Babu, “Study of Multiplexing Space and Focal Surfaces and Automultiscopic Displays for Image Processing” in “International Journal of Information Technology and Management” Vol V, Issue I, August 2013 [ ISSN: 2249-4510]. Dr. Shoban Babu Sriramoju, “A Review on Processing Big Data” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol-2, Issue-1, January 2014, [ ISSN(online): 23209801, ISSN(print): 2320-9798]. Shoban Babu Sriramoju, Dr. Atul Kumar, “An Analysis around the study of Distributed Data Mining Method in the Grid Environment: Technique, Algorithms
© www.globalsciencepg.org
P a g e | 600
© 2017 Global Science Publishing Group, USA and Services” in “Journal of Advances in Science and Technology” Vol-IV, Issue NoVII, November 2012 [ ISSN: 2230-9659]. 34) Shoban Babu Sriramoju, Dr. Atul Kumar, “An Analysis on Effective, Precise and Privacy Preserving Data Mining Association Rules with Partitioning on Distributed Databases” in “International Journal of Information Technology and management” Vol-III, Issue-I, August 2012 [ ISSN: 2249-4510]. 35) Shoban Babu Sriramoju, Dr. Atul Kumar, “A Competent Strategy Regarding Relationship of Rule Mining on Distributed Database Algorithm” in “Journal of Advances in Science and Technology” Vol-II, Issue No-II, November 2011 [ ISSN: 2230-9659]. 36) Shoban Babu Sriramoju, Dr. Atul Kumar, “Allocated Greater Order Organization of Rule Mining utilizing Information Produced Through Textual facts” in “International Journal of Information Technology and management” Vol-I, Issue-I, August 2011 [ ISSN: 2249-4510]. 37) Ramesh Gadde, Namavaram Vijay, “A SURVEY ON EVOLUTION OF BIG DATA WITH HADOOP” in “International Journal of Research in Science and Engineering”, Vol-3, Issue-6, Nov-Dec 2017, 92-99 [ ISSN: 2394-8299]. 38) Namavaram Vijay, S Ajay Babu, "Heat Exposure of Big Data Analytics in a Workflow Framework" in “International
IJRA | Volume 4 | Issue 16
© Oct – Dec, IJRA Transactions
39)
40)
Journal of Science and Research”, Volume 6, Issue 11, November 2017, 1578 - 1585, #ijsrnet. Ajay Babu Sriramoju, Namavaram Vijay, Ramesh Gadde, “SKETCHING-BASED HIGH-PERFORMANCE BIG DATA PROCESSING ACCELERATOR” in “International Journal of Research in Science and Engineering”, Vol-3, Issue-6, Nov-Dec 2017, 92-99 [ ISSN: 2394-8299]. Namavaram Vijay, Ajay Babu Sriramoju, Ramesh Gadde, “Two Layered Privacy Architecture for Big Data Framework” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 5, Issue 10, October 2017 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]
© www.globalsciencepg.org
P a g e | 601