An Overview of Big Data Challenges, Tools and ... - ijraonline.com

DOI: 10.17812/IJRA.4.16(100)2017

International Journal of Research and Applications Oct - Dec © 2017 Transactions 4(16): 596-601

eISSN : 2349 – 0020 pISSN : 2394 – 4544

www.ijraonline.com COMPUTERS

RESEARCH ARTICLE

An Overview of Big Data Challenges, Tools and Techniques Sriramoju Ajay Babu, Namavaram Vijay and Ramesh Gadde 1 Programmer Analyst, Randstad Technologies, 1 EQT Plaza 625 Liberty Avenue, Suite 1020 Pittsburgh, Pennsylvania -15222, USA. 2 Practicepa Ltd,IT Product Manager, 104 Stamford Road, E61LR, UK. 3 Assistant Professor, Ethiopian Institute of Technology, Mekelle University, Ethiopia ABSTRACT Big Data is the large amount of data that cannot be processed by making use of traditional methods of data processing. Due to widespread usage of many computing devices such as smart phones, laptops, wearable computing devices; the data processing over the internet has exceeded more than the modern computers can handle. Due to this high growth rate, the term Big Data is envisaged. However, the fast growth rate of large data will generate numerous challenges, such as data inconsistency and incompleteness, scalability, timeliness, and security. This paper provides a brief introduction to the Big data technology and its importance in the contemporary world. This paper addresses various challenges and issues that need to be emphasized to present the full influence of big data. The tools used in Big data technology are also discussed in detail. This paper also discusses the characteristics of Big data and the platform used in Big Data i.e. Hadoop. Keywords: Big Data, Hadoop, MapReduce. 1.

INTRODUCTION

management) to the double-digit growth rates of

Big Data has gained much attention from the last

real- time intelligence and exploration/discovery

few years in the IT industry. As we can find

of the unstructured worlds. [1].

billions of people are connected to internet worldwide, generating large amount of data at the rapid rate. The generation of this large amount of engenders various challenges. Along with

Big

Data’s

huge

benefits

to

many

organizations, the challenges and issues should also be brought into light. A forecast from International Data Corporation (IDC), the Big

Figure 1: Growth rate of Big Data from 2011-2017[2]

Data technology and services market represents a fast-growing multibillion- dollar worldwide showing that the Big Data technology and

2. BIG DATA OVERVIEW Big Data is a combination of big datasets that

services market will grow at a 26.4% compound

cannot

annual growth rate to $41.5 billion through 2019,

computing techniques. It is not a technique

or about six times the growth rate of the overall

that can be worked on its own or in isolation;

information technology market. Additionally, by

rather it involves many areas of business and

2020 IDC believes that the line of business buyers

technology. The properties of signify Big Data

will help drive analytics beyond its historical

are volume, Variety, Velocity, Variability and

sweet

Complexity as shown in figure 2[3]

opportunity. In fact, the recent IDC forecast is

spot

of

relational

IJRA | Volume 4 | Issue 16

(performance

be

© www.globalsciencepg.org

processed

using

traditional

P a g e | 596

© 2017 Global Science Publishing Group, USA

© Oct – Dec, IJRA Transactions

Figure 4. ETL process [4] Table 1: Various phases in ETL [5] Sr. Phase Description of the Phase No 1. Extraction In this phase relevant

Figure 2. Properties of Big Data

information is extracted. To

Big data involves the data produced by different devices and applications. Some of the sources of Big Data are shown in the figure.

make this phase efficient, only the data source that has been changed since

Enterprise Data

2. Public Data

Transactions

Transfor mation

recent last ETL process is considered. Data is transformed through various phases [10] The phases are 1. Data analysis;

Big Data Sources

2.

Definition

transformation Sensor Data

of

workflow

and mapping rules;

Social Media

3.Verification;

3.

Loading

Figure 3. Some Sources of Big Data

4.Transformation; and 5. Backflow of cleaned data. At the last, after the data is in the required format, it is then loaded into the data

3.

PHASES IN BIG DATA PROCESSING Before processing Big data, it must be recorded from various data generating sources. After recording, it must be filtered and compressed. Only the relevant data should be recorded by means of filters that discard useless information. In order to facilitate this work specialized tools are used such as ETL. ETL tools represent the means in which data actually gets loaded into the warehouse. The figure 3 demonstrates different stages in the process.


warehouse. 4. BIG DATA CHALLENGES Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. The Figure 5 shows various challenges in big data [12]. Figure 6 list some of the challenges in Big data along with its impact and risks involved. [6]


P a g e | 597


© Oct – Dec, IJRA Transactions Hadoop Hadoop is an Apache open source framework which is written in java. High volumes of data, in any structure, are processed by Hadoop. Hadoop

allows

distributed

storage

and

distributed processing for very large data sets. The main components of Hadoop are:

1. Hadoop distributed file system (HDFS) 2. MapReduce The architecture of Hadoop is shown in the figure 7. Hadoop has three layers. The two major layers are MapReduce and HDFS. HDFS (Storage layer): - Hadoop has a distributed File System called HDFS, which

Figure 5. Challenges in Big data

stands for Hadoop Distributed File System. It is a File System used to store very large files with streaming data access patterns, running on clusters on commodity hardware. [8] There are two types of nodes in HDFS cluster, namely namenode and datanodes. The name node will maintain the file system tree and the metadata for all the files and directories in the tree. The data node stores and retrieve blocks as per the instructions of clients or the namenode. The data retrieved is reported back to the namenode with lists of blocks that they are storing. Without the namenode it is not possible to access the file. So, it becomes very important to make name node resilient to

Figure 6. Challenges, its impact and risk involved in Big data 5.

TECHNIQUES HANDLING

FOR

failure. [11] MapReduce Processing/Computation layer: - It BIG

DATA

is a programming paradigm which is meant for managing applications on multiple distributed

There are many techniques available for data

servers. In Map Reduce divide and conquer

management. That includes Google Big Table,

method is used to break the large complex data

Simple DB, Not Only SQL (NoSQL), Data

into small units and process them. It will read

Stream Management System (DSMS), Mem

the data from HDFS. However, it can read the

cache DB, and Voldemort [7]. But these

data from other places including mounted local

traditional approaches are only applicable to

file systems, the web, and databases. It divides

traditional data and not Big data as it cannot be

the computations between different servers or

stored on a single machine. The Big Data

nodes. It is also fault-tolerant. If anyone of the

handling techniques and tools include Hadoop,

node fails, Hadoop can understand how to

MapReduce, and Big Table. Out of these,

continue with computation by reassigning the

Hadoop is one of the most widely used

incomplete work to another node and cleaning

technologies.

up after the node that could not complete its



P a g e | 598



task. It also knows how to combine the results of the computation in one place. [9]. The other core

8)

components in Hadoop architecture includes Hadoop YARN, it is a framework for job scheduling and cluster resource management. The other component is the cluster which is the

9)

set of host machines (nodes).

10)

11) Figure 7. Hadoop Architecture 6.

CONCLUSION 12)

As there are huge volumes of data that are produced every day, so such large size of data it becomes very challenging to achieve effective processing

using

the

existing

traditional

13)

techniques Big data is data that exceeds the processing capacity of conventional database systems. In this paper fundamental concepts about Big Data are presented. These concepts include Big Data characteristics, challenges and

14)

techniques for handling big data. REFERENCES 1) 2) 3)

https://www.idc.com/prodserv/4Pillars/b igdata www.Wikibon.org

15)

A, Katal, Wazid M, and Goudar R.H. "Big data: Issues, challenges, tools and Good practices." Noida: 2013, pp. 404 – 409, 8-10 Aug. 2013.

4)

Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: modern principles and methodologies. Columbus: McGraw-Hill 5) Almeida, F., and Calistru, C, "The Main Challenges and Issues of Big Data Management", International Journal of Research Studies in Computing, 2(1), 2013, pp. 11-20. 6) https://www.progress.com. 7) M. Chen, S. Mao, and Y. Liu, “Big data: a survey,” Mobile Networks and Applications,


16)

17)

vol. 19, no. 2, pp. 171–209, 2014 Apache Hadoop (2013). HDFS Architecture Guide [Online]. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht Amrit pal, Pinki Aggrawal, “A Performance Analysis of Map Reduce Task with Large Number of Files Dataset in Big Data using Hadoop” Forth International Conference on Communication Systems and Network Technologies, 2014. Rahm, E., & Hai Do, H. (2000). Data cleaning: problems and current approaches. Bulletin of the Technical Committee on Data Engineering, 23(4), 3-13.). Apache Hadoop (2013). HDFS Architecture Guide [Online]. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht Intel, “Big Data Analaytics,”2012, http://www.intel.com/content/dam/www/pub lic/us/en/documents/reports/data-insights peer- research-report.pdf Shoban Babu Sriramoju, " Review on Big Data and Mining Algorithm" in “International Journal for Research in Applied Science and Engineering Technology”, Volume-5, IssueXI, November 2017, 1238-1243 [ ISSN: 23219653], www.ijraset.com Shoban Babu Sriramoju, “OPPORTUNITIES AND SECURITY IMPLICATIONS OF BIG DATA MINING” in “International Journal of Research in Science and Engineering”, http://ijrise.org/asset/archive/17Dec7.pdf, Vol3, Issue-6, Nov-Dec 2017, 44-58 [ ISSN: 23948299]. Guguloth Vijaya, A. Devaki, Dr. Shoban Babu Sriramoju, “A Framework for Solving Identity Disclosure Problem in Collaborative Data Publishing” in International Journal of Research and Applications (Apr-Jun © 2015 Transactions) 2(6): 292-295. Shoban Babu Sriramoju, "Heat Diffusion Based Search for Experts on World Wide Web" in “International Journal of Science and Research”, https://www.ijsr.net/archive/v6i11/v6i11.php, Volume 6, Issue 11, November 2017, 632 635, #ijsrnet. Dr. Shoban Babu Sriramoju, Prof. Mangesh Ingle, Prof. Ashish Mahalle “Trust and Iterative Filtering Approaches for Secure


P a g e | 599


18)

19)

20)

21)

22)

23)

24)

25)

Data Collection in Wireless Sensor Networks” in “International Journal of Research in Science and Engineering” Vol 3, Issue 4, July-August 2017 [ ISSN: 2394-8299]. Dr. Shoban Babu, Prof. Mangesh Ingle, Prof. Ashish Mahalle, “HLA Based solution for Packet Loss Detection in Mobile Ad Hoc Networks” in “International Journal of Research in Science and Engineering” Vol 3, Issue 4, July-August 2017 [ ISSN: 2394-8299]. Shoban Babu Sriramoju, “A Framework for Keyword Based Query and Response System for Web Based Expert Search” in “International Journal of Science and Research” Index Copernicus Value (2015):78.96 [ ISSN: 2319-7064]. Sriramoju Ajay Babu, Dr. S. Shoban Babu, “Improving Quality of Content Based Image Retrieval with Graph Based Ranking” in “International Journal of Research and Applications” Vol 1, Issue 1, Jan-Mar 2014 [ ISSN: 2349-0020]. Dr. Shoban Babu Sriramoju, Ramesh Gadde, “A Ranking Model Framework for Multiple Vertical Search Domains” in “International Journal of Research and Applications” Vol 1, Issue 1, Jan-Mar 2014 [ ISSN: 2349-0020]. Mounika Reddy, Avula Deepak, Ekkati Kalyani Dharavath, Kranthi Gande, Shoban Sriramoju, “Risk Aware Response Answer for Mitigating Painter Routing Attacks” in “International Journal of Information Technology and Management” Vol VI, Issue I, Feb 2014 [ ISSN: 2249-4510]. Mounica Doosetty, Keerthi Kodakandla, Ashok R, Shoban Babu Sriramoju, “Extensive Secure Cloud Storage System Supporting Privacy-Preserving Public Auditing” in “International Journal of Information Technology and Management” Vol VI, Issue I, Feb 2012 [ ISSN: 2249-4510]. Shoban Babu Sriramoju, “An Application for Annotating Web Search Results” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 2, Issue 3, March 2014 [ ISSN(online): 2320-9801, ISSN(print) : 2320-9798 ]. Shoban Babu Sriramoju, “Multi View Point Measure for Achieving Highest IntraCluster Similarity” in “International Journal of Innovative Research in Computer and



26)

27)

28)

29)

30)

31)

32)

33)

Communication Engineering” Vol 2, Issue 3, March 2014 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]. Shoban Babu Sriramoju, Madan Kumar Chandran, “UP-Growth Algorithms for Knowledge Discovery from Transactional Databases” in “International Journal of Advanced Research in Computer Science and Software Engineering” Vol 4, Issue 2,360-365, February 2014 [ ISSN: 2277 128X]. Shoban Babu Sriramoju, Azmera Chandu Naik, N. Samba Siva Rao, “Predicting The Misusability Of Data From Malicious Insiders” in “International Journal of Computer Engineering and Applications” Vol V,Issue II,Febrauary 2014 [ ISSN : 2321-3469 ]. Ajay Babu Sriramoju, Shoban Babu Sriramoju, “Analysis in Image Compression Using Bit-Plane Separation Method” in “International Journal of Information Technology and Management” Vol VII, Issue X, November 2014 [ ISSN: 2249-4510]. Shoban Babu Sriramoju, “Mining Big Sources Using Efficient Data Mining Algorithms” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 2, Issue 1, January 2014 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]. Mr Ajay Babu Sriramoju, Dr Shoban Babu,“Objective Quality Metric Design for Wireless Image and Video Communication” in “International Journal of Information Technology and Management” Vol VII, Issue 10, November 2014 [ ISSN: 2249-4510]. Ajay Babu Sriramoju, Dr Shoban Babu, “Study of Multiplexing Space and Focal Surfaces and Automultiscopic Displays for Image Processing” in “International Journal of Information Technology and Management” Vol V, Issue I, August 2013 [ ISSN: 2249-4510]. Dr. Shoban Babu Sriramoju, “A Review on Processing Big Data” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol-2, Issue-1, January 2014, [ ISSN(online): 23209801, ISSN(print): 2320-9798]. Shoban Babu Sriramoju, Dr. Atul Kumar, “An Analysis around the study of Distributed Data Mining Method in the Grid Environment: Technique, Algorithms


P a g e | 600

© 2017 Global Science Publishing Group, USA and Services” in “Journal of Advances in Science and Technology” Vol-IV, Issue NoVII, November 2012 [ ISSN: 2230-9659]. 34) Shoban Babu Sriramoju, Dr. Atul Kumar, “An Analysis on Effective, Precise and Privacy Preserving Data Mining Association Rules with Partitioning on Distributed Databases” in “International Journal of Information Technology and management” Vol-III, Issue-I, August 2012 [ ISSN: 2249-4510]. 35) Shoban Babu Sriramoju, Dr. Atul Kumar, “A Competent Strategy Regarding Relationship of Rule Mining on Distributed Database Algorithm” in “Journal of Advances in Science and Technology” Vol-II, Issue No-II, November 2011 [ ISSN: 2230-9659]. 36) Shoban Babu Sriramoju, Dr. Atul Kumar, “Allocated Greater Order Organization of Rule Mining utilizing Information Produced Through Textual facts” in “International Journal of Information Technology and management” Vol-I, Issue-I, August 2011 [ ISSN: 2249-4510]. 37) Ramesh Gadde, Namavaram Vijay, “A SURVEY ON EVOLUTION OF BIG DATA WITH HADOOP” in “International Journal of Research in Science and Engineering”, Vol-3, Issue-6, Nov-Dec 2017, 92-99 [ ISSN: 2394-8299]. 38) Namavaram Vijay, S Ajay Babu, "Heat Exposure of Big Data Analytics in a Workflow Framework" in “International



39)

40)

Journal of Science and Research”, Volume 6, Issue 11, November 2017, 1578 - 1585, #ijsrnet. Ajay Babu Sriramoju, Namavaram Vijay, Ramesh Gadde, “SKETCHING-BASED HIGH-PERFORMANCE BIG DATA PROCESSING ACCELERATOR” in “International Journal of Research in Science and Engineering”, Vol-3, Issue-6, Nov-Dec 2017, 92-99 [ ISSN: 2394-8299]. Namavaram Vijay, Ajay Babu Sriramoju, Ramesh Gadde, “Two Layered Privacy Architecture for Big Data Framework” in “International Journal of Innovative Research in Computer and Communication Engineering” Vol 5, Issue 10, October 2017 [ ISSN(online): 2320-9801, ISSN(print): 2320-9798]


P a g e | 601