Beginning with Big Data Simplified Punam Bedi1, Vinita Jindal2, Anjali Gautam3 Department of Computer Science University of Delhi Delhi, India 1
[email protected], 2
[email protected],
[email protected]
Abstract— Big Data is a collection of datasets containing massive amount of data in the range of zettabytes and yottabytes. Organizations are facing difficulties in manipulating and managing this massive data as existing traditional database and software techniques are unable to process and analyze voluminous data. Dealing with Big Data requires new tools and techniques that can extract valuable information using some analytic process. Volume, Variety, Velocity, Value, Veracity, Variability and Complexity are attributes associated with Big Data in various works in the literature. In this paper, we briefly describe these existing attributes and also propose to add Viability, Cost and Consistency as new attributes to this set. This paper also discusses existing tools and techniques associated with Big Data. Fleet management is an evolving application of GPS data. It is taken as a case study in this work to illustrate various attributes of Big Data. This paper also presents the implementation of a sorting problem by varying Hadoop cluster sizes for the GPS data.
rapidly growing volume of real time GPS data cannot be handled using traditional methods instead parallel computing systems are needed to handle this voluminous data. GPS data is becoming ubiquitous and playing a major role in logistics for preventing road accidents and also keeping track of vehicle activities. Thus, in this paper GPS Data is taken as a case study to illustrate the concepts of Big Data. GPS data consist of various attributes like GPS Id, Registration number, date, time, speed, longitude, latitude, status etc. This data is recorded periodically and time stamped and needs to be sorted on date and time for ease of search. Fleet management is an emerging application of GPS data. It deals with managing a fleet of vehicles by knowing the real-time location of all drivers. This information allows management to meet customer needs more efficiently. This data can be sorted on GPS Id for tracking a vehicle along with date and time and helps the management in analysing the driver’s behaviour and providing services to their customers.
Keywords — Big Data; V’s of Big Data; C’s of Big Data; GPS data; Hadoop; Map Reduce.
This paper discusses the present state of art in the new field of Big Data computing discussing various attributes of such data. The rest of the paper is organized as follows: Section 2 discusses various existing and proposed attributes of Big Data. The concepts are further illustrated with the help of a case study on GPS data. Tools and techniques for Big Data are presented in section 3. Section 4 gives the implementation of the case study of sorting problem in GPS data using the techniques of Big Data and finally section 5 concludes with summary of the work.
I.
INTRODUCTION
Big Data is the state-of-the-art exhortation for researchers in all disciplines. It is the collection of large data sets that are very complex and voluminous in nature and it becomes difficult to process and analyze them using conventional database systems. Big Data can be apprehended from various sources like healthcare data, retail sector data, sensors data, posts to social media sites, digital pictures and videos, purchase transaction records and cell phone GPS signals etc. Data collected from these several sources can be structured, semi-structured and unstructured in nature. Moreover Big Data helps businesses to target their potential customers and recommend services which they are yearning for. These days, almost every company is procuring the digital representations of its existing data, which results in high increase of digital data growth. Many industries fall under the patronage for creation and digitization of existing data, and are becoming apposite sources for Big Data. Big Data has come along with many issues and challenges like storage, scalability, processing, timeliness, privacy and security.
Big Data can be useful to analyze business trends, improve quality of research, combat crime, weather forecast, prevent diseases etc., by using the real-time large datasets. The size of Big Data is constantly doubling every 40 months and in early 2013 it ranges around few dozen zettabytes to yottabytes in a single data set [12]. A key challenge for researchers is to deal with the fast growth rate of Big Data. So, they must have the ability to design new techniques or tools for handling the data efficiently and examining it to extract the significant meaning for decision making.
Many devices such as smartphones, tablets, laptops, computers etc. generate GPS data which is voluminous and
Big Data can be defined as: “ The tools or techniques for describing the new generation of technologies and architectures
978-1-4799-4674-7/14/$31.00 ©2014 IEEE
II.
ATTRIBUTES OF BIG DATA
A. Existing Attributes of Big Data
that are designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery and/or analysis” [7]. The definition of “Big Data” is further described in terms of its attributes by Doug Laney as 3 V’s: Volume, Variety and Velocity in 2011[7], [13]. Later in 2012, IBM finalized two more V’s as Value and Veracity [5], thus making 5 V’s of Big Data [14], [17]. Further in 2013, one more V was proposed as Variability to make 6 V’s of Big Data [4], [18]. These 6 V’s are now listed as: Volume, Variety, Velocity, Value, Veracity and Variability and are described as follows: Volume refers to the amount of data being generated by different data sources contributing to the Big Data in real time. Variety refers to the data coming from varied sources which differ in the type of data being delivered by them. It includes unstructured data like blogs, emails, data from social networking sites, audio and video messages, semi-structured data like XML files to structured data such as logs, data from databases, data warehouse to name a few. Velocity relates to the fact that the data is being generated at a very fast rate. It not only includes the rate of generation of data but also the time it takes to process the incoming data into the system and also the frequency with which the data is delivered. It is very crucial for an organization to analyse and predict the value of data in order to predict the behaviour of its customers to formulate a policy or make a decision for maximizing the profits. Veracity in Big Data relates to data certainty and trustworthiness in terms of collection, processing methods, trusted infrastructure and data origin. This ensures that data used is protected from unauthorized access and modifications throughout the lifecycle. Data veracity depends on the security infrastructure which is deployed from Big Data infrastructure [5]. Variability causing the peaks in the data load and takes into account the data which is rapidly engulfing the system. This causes a wave effect which requires the data model and associations to be altered to get inferred from the growing data stock. Complexity measures the degree of interdependence and the degree of interconnections between the data [10]. This caters to the fact that whether making a small change to one or more elements would result in a corresponding change in the other data or not [11]. GPS data is very common now a days as there are smartphones, laptops and other GPS enabled devices. These devices use GPS for locating the current position, to navigate the routes etc. The exponential increase in data makes its voluminous. It is generated in real time through various sources making this data volatile and diverse. In order to make any routing decision it is crucial to get the exact value of GPS data. There is significant variability in GPS data due to the difference in the quality and speed of various devices that generates the GPS data. All these factors brings complexity in the data as there is a degree of independence along with the degree of interconnection between the data exists. B. Proposed Attributes of Big Data With the advent of technology, data generation has increased by leaps and bounds. The data in general has varied dimensions. The need of an hour is to process the data efficiently in real time by identifying the pertinent dimension.
The entire process of analyzing Big Data also requires a costeffective, simple and consistent approach, which motivated us to propose three new attributes of Big Data as Viability, Cost and Consistency and are explained in the next paragraph. The set of existing and proposed 7 V’s and 3 C’s of Big Data are as shown in Fig. 1 and Fig. 2 respectively. Viability: The data is multidimensional by nature. Depending on the outcome all the dimensions are not relevant during the processing. Moreover, considering all the dimensions of data makes the process inefficient in terms of both space and time. Thus, there is a need to analyze massive data sets in real time with efficiency. It requires to carefully select the dimensions and factors that are most likely to predict the outcomes which matter the most to businesses. One needs to uncover the latent and hidden relationships among the dimensions by checking the sustainability of data and its dimensions to build a cost-effective predictive model quickly. Viability is selecting the relevant dimensions and factors of Big Data that are responsible for predicting outcomes which matter the most to businesses. GPS data, for example, has many attributes and hence viability plays a major role in selecting the desired attribute for sorting and easy access of the data as per the requirement of the vendor. For example GPS Id, date, time latitude and longitude are the common attributes for any given problem using GPS data. Any other attribute in GPS data is problem specific and can be chosen appropriately depending upon the problem to help in generating a cost-effective predictive model.
Fig. 1. 7 V’s of Big Data.
Fig. 2. 3 C’s of Big Data.
Cost: This attribute is complement to the volume attribute of the Big Data. The rate at which the data grows is significant hence big data cannot be stored and processed by the traditional databases and requires all together a novel technology which comes with a cost. GPS data needs special equipment and a working internet connection for storing and accessing the data periodically. This database collected through GPS enabled devices become very large and its processing is beyond the capability of existing DBMS. There is a need of some additional tool and techniques to handle the large volume real data. So, it comes up with the additional cost for the desired infrastructure using Big Data technologies. Consistency: Consistency refers to the data that flows among various sources and is shared by multiple users. To maintain consistency among multiple duplicates stored at difference locations is non-trivial. Two issues that need to be addressed carefully with consistency are simultaneous write and read problems. These days many GPS enabled devices like laptops, smart phones, GPS navigators in vehicles etc. exist and formats of the data fetched with these multiple devices may vary. Moreover different devices may provide the same data results resulting in duplication of data. So there is a need for consistency among the data. In the next section overview of tools and techniques used by big data analytics is presented. III.
TOOLS AND TECHNIQUES
As Big Data handles a large amount of data with varied sources, therefore it is a mandate for the system to capture huge data from multiple devices. The storage capacity of the devices has increased with time, which has to some extent taken care of the volume characteristics but on the contrary, the time to read and write data hasn’t shown any significant improvement. The time to access the devices can be reduced by reading simultaneously from multiple disks. By making significant use of the hardware and relying on that hardware induces a high chance of failure, which can be avoided by making replicas of the data on different devices. This can be handled by providing a file system which is organized in the form of clusters comprising of various nodes or servers [11]. Apart from handling reading and writing data from multiple disks, combining data being read from multiple devices is a major confront faced by the techniques for processing and analyzing Big Data. This can be achieved by selecting a programming paradigm based on a master slave architecture so that each of these can be programmed to carry out the desired task of combining data from multiple devices and processing it. It should also expedite the computation near the data. Processing of Big Data is made easy by an Apache project Hadoop which supports the storage and processing of large data sets across multiple clustered systems. Hadoop: Hadoop is an open source project hosted by Apache Software Foundation. It is an application designed to work on distributed systems[16]. The main components of Hadoop are a file system called Hadoop Distributed File
System (HDFS) and a programming paradigm known as Map Reduce which are discussed in detail as below: HDFS: Main functionality of HDFS lies in storing data in a distributed manner on the nodes. HDFS is organized in the form of clusters which in turn consist of several nodes in each cluster, for storage of data. A file is divided into blocks and then these blocks are stored on different nodes. Each block is of a large size and is of 64 MB by default. Each cluster in HDFS encompasses of two diverse types of nodes namely – Namenode and Datanode. These two nodes differ from each other in the computation and the functionality. The Namenode as such does not store the data, however stores all the metadata for the directories and files. HDFS manages and stores the namespace tree, file mappings and directory blocks to the Datanodes. Whenever a HDFS client needs to access the data residing in a cluster, it sends request to the Namenode to procure a list of Datanodes where all the blocks are residing. Therefore, without accessing a Namenode, one cannot access a file. On the other side, Datanodes store the contents of a file in the form of blocks. Each block residing on a Datanode is replicated to the other Datanodes within the same cluster to assure fault tolerance. All the connections and communications are done via TCP based protocol among the nodes in a cluster. Map Reduce: Big Data analyse the huge volumes of data arriving in the system at a high velocity. This implies that the techniques for analyzing Big Data should be able to outfit the scalability factor. Hadoop provides a programming paradigm which supports scalability known as Map Reduce, a framework for processing problems which can be parallelized across massive datasets using clusters i.e. a large number of computers, or a grid. It makes use of Map Reduce engine comprising of mappers and the reducers, that carries out two functionalities- Map task and Reduce task. There is a master node to control and co-ordinate with all other worker nodes. The master node takes the data as input, which is then divided into fixed size units which is known as a spilt. Each input split is assigned to a map task running on worker node for processing. The mapper receives input as a pair and then processes the given input to generate the output in the same pair. This output which is obtained is written on the local disk and not on HDFS. After the mapper finishes working on the data, the result is given in the pair back to the master node.
Fig. 3. Data Flow For Map Reduce.
The prerequisites for installing g Hadoop on Ubuntu is, installation of Java, creating a dedicated Hadoop user, configuring SSH (Secure SHell) fo or accessing shell remotely and disabling IPV6 as it is not req quired to connect to a IPV6 network. Single Node Cluster Setup
Fig. 4. Hadoop Arcitecture.
This master node, then sorts and combbines the output obtained from the different mappers for a paarticular key and gives the sorted output as an input to the reduucer in the pair. The reducer now works on the givven input thereby giving an output in the pair. T The Map Reduce data flow is illustrated in Fig. 3. It is also poossible that in the map task the worker node may further sub divvides the problem and assigns it to one or more other worker noddes. Map Reduce paradigm is considered as fault tolerant system because whenever a worker node goes down, it is detected by the master node which restarts the task by assignning that to some other worker node. The complete Hadoopp architecture is depicted by Fig. 4. The next section discusses the experimental setup that first record the steps required to estaablish a cluster in Hadoop with multiple nodes and then gives details about the implementation of case study using the Hadoopp framework. IV.
EXPRIMENTAL SETU UP
In our experiment, we examine the problem m of sorting GPS data and analyze its performance on Hadoop cluster is by varying the number of nodes in a cluster andd then comparing these clusters in terms of processing time requuired to complete a given job. All tests were conducted first on a Hadoop running on an Ubuntu machine with a single cluuster in pseudodistributed mode [16]. Further, tests weree performed on Hadoop in a fully distributed mode with a vvaried number of nodes in the cluster. For each case of impplementation the processing time of the jobs are recorded annd the results are compared. A. Cluster Setup The very first challenge which comes acrooss while running an application on Hadoop is setting up a Hadoop cluster. To set up a Hadoop cluster it is required to install Haadoop on each of the machine in a standalone mode [16] or in pseudo-distributed mode. Hadoop installation is supported by vvaried number of operating systems such as on Windows, Ubunntu and Mac. For our experiment we have installed Hadoop oon Ubuntu 12.04. This section focuses cluster setup on Ubuntu 12.04.
After installing the prerequisites, it is now required to set up Hadoop in a pseudo-distributted mode on single node cluster. The appropriate version of the Hadoop can be downloaded from the Hadoop reeleases [15] provided by Apache using Apache Download Mirror. The downloaded version of Hadoop will be a zipped file which needs to be extracted. Extract the contents of th he downloaded file into any location of your own choice. This T location will act as HADOOP_HOME. Post extraction n requires updating shell’s configuration file with the HADOOP_HOME and JAVA_HOME which specifies thee path where Hadoop and Java are located respectively. After A updating the shell’s configuration file it is now required d to configure the directory where Hadoop will store all the daata files and also to update some of the configuration files of Hadoop. H The conf directory within the extracted Hadoop direectory contains the set of configuration files. Of these, hadoo op-env.sh requires updating it with JAVA_HOME. The confiiguration tag in the coresite.xml file is updated with the dirrectory which is configured to store the data files within Hadoo op. The mapred-site.xml is updated with the host and port nu umber of the Map Reduce Jobtracker. In hdfs-site.xml we speecify the replication factor when a file is created in HDFS. In I pseudo-distributed mode running on single cluster it is sett to 1. This completes the configuration of Hadoop. Before starting the daemons it is required to format the Namenode. After formatting the Namenode now is the time to start s the daemons on the machine. Each daemon will execu ute as a separate JVM in pseudo-distributed mode. jps comm mand will list all the Hadoop daemons which are running on the machine. Fig. 5 shows the output of jps command in pseudo-d distributed mode. It lists all the Hadoop daemons which aree running i.e. Namenode, Datanode, Secondary Namenode, Jo obTracker and TaskTracker.
Fig. 5. Jps Command for Sin ngle Node Cluster.
Fig. 6. Jps Command on slave in Multi-node Cluster.
Fig. 8. Namenode Daemon UI for 5 node cluster.
Multi-node Cluster Setup For distributed mode, it is required that the standalone or pseudo-distributed installation works well on each node. Few modifications are required in the configuration files of shell and Hadoop to run a multi-node cluster. The “hosts” file of each node is to be updated with the IP address and the hostname of every node in the cluster. Apart from the hosts file, the slaves file in the “conf” directory of Hadoop is also updated to list all the nodes in the cluster which will act as slave. Update the core-site.xml, mapred-site.xml to reflect the port and hostname of the master node in the cluster; update hdfs-site.xml to reflect the replication factor to 2 (default is 3). Format the Namenode and start the daemons by executing the scripts. In the distributed mode, master node will run Namenode, Secondary Namenode and Jobtracker. Datanode and Tasktracker will run on the slave. Fig. 6 shows the output of “jps” command on the slave machine. There are instances where master node can also act as a slave, in this case master will have all the five daemons running.
Fig. 9. Jobtracker Daemon UI for single node cluster.
B. Cluster Configuration The hardware platform for a single-node cluster is a machine with an Intel core 2 CPU 6600 operating at 2.40 GHz with 5.3 GB RAM operating on Ubuntu 12.04, implementing Hadoop-0.20.2. The Namenode administration UI is shown in Fig. 7. The number of live node in this case is one. This type of a cluster is in a standalone mode. Fully distributed mode is a cluster of machines, comprising of a master with varying number of slaves. Hardware platform for the master and slave is a machine with an Intel core 2 CPU 6600 operating at 2.40 GHz with 5.3 GB RAM operating on Ubuntu 12.04, implementing Hadoop-0.20.2. In this paper, fully distributed mode is implemented as a cluster of three and five nodes.
Fig. 10. Jobtracker Daemon UI for 5 node cluster.
Fig. 7. Namenode Daemon UI for single node cluster.
A single node in the cluster is made to function both as a slave as well as a master. The Namenode administration for a cluster of five nodes is shown in Fig. 8. In this case the number of live nodes is 5. Processing will take place on these five nodes. Each slave node would be running a Datanode and a Tasktracker. The master node will run Namenode, Secondary Namenode, Jobtracker in addition to Datanode and Tasktracker.
C. Job Execution The experiment performs on GPS data to sorts a text file of 1 GB, in a standalone mode as well as in a fully distributed mode with two, three, four and five nodes respectively. The text file is first exported to HDFS prior to the job execution. The job is executed with the same number of mapper (sixteen) and the reducer (one) for each case. The Jobtracker UI for standalone mode and a cluster of five nodes is shown in Fig. 9 and Fig. 10 respectively. D. Results The experiment was conducted and the results were tabulated. Table I. shows the time taken (in minutes) for a job in different cases with respect to number of nodes in the cluster. In pseudodistributed mode the job took 16 minutes 20 seconds to complete. In a fully distributed mode with 2, 3, 4 and 5 nodes it took 11 minutes 17 seconds, 8 minutes 29 seconds, 4 minutes 35 seconds and 3 minutes 48 seconds respectively to execute the job. It is inferred from the results that when the number of nodes in a cluster increases, the processing time of the job decreases due to distributed processing of the data on different nodes. Fig. 11 depict the graphical representation of the experimental results. V. SUMMARY Big Data is the new leading edge for all the businesses. With the growing data, the rate of processing and knowledge generation has also increased. Many organizations are coming up with new ways to extract information and knowledge from this voluminous data. Volume, Variety, Velocity, Veracity, Value, Variability (6 V’s) and Complexity are the seven attributes of Big Data discussed in literature. Viability is proposed as a new V in this paper. Further, two new C’s Cost and Consistency have been proposed to more precisely define the properties of Big Data. Later we justifies these attributes on GPS data domain and defines a new standard of 7 V’s and 3 C’s of Big Data. The paper also discussed Hadoop and Map Reduce as the major tools and techniques to work with Big Data. Experimental results on a case study for sorting problem in the Fleet management on GPS data showed that the processing time of a job decreases as the number of nodes increases in the cluster. TABLE I. EXPERIMENTAL RESULTS Number of nodes in the cluster (N)
Time taken for the Processing (Mins)
N=1
16:20
N=2
11:17
N=3
8:29
N=4
4:35
N=5
3:48
Fig. 11. Processing Time versus Number of Nodes in Hadoop cluster
ACKNOWLEDGMENT The authors duly acknowledge the University of Delhi for the support in work on this paper under the research grant number DRCH/R&D/2013-14/4155.
REFERENCES [1]
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., . . . Widom, J. (2012). Challenges and Opportunities with Big Data. U.S.: A community white paper developed by leading researchers across the United States. [2] Baker, R. (2013, October 24). Coursera.org: Big Data in Education. Retrieved December 19, 2013, from http://class.coursera.org/bigdataedu-001/ [3] Barské, A. (2013). Big Data Whitepaper,. Erdogan: Arzu Barské Erdogan 2013 © version 2.0. [4] Demchenko, Y., Grosso, P., Laat, C. d., & Membrey, P. (20-24 May 2013). Addressing Big Data Issues in Scientific Data Infrastructure. Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 48 - 55). San Diego, CA: IEEE. doi:doi: 10.1109/CTS.2013.6567203 [5] Demchenko, Y., Worring, M., Los, W., & Laat, C. d. (2013, september 16). 2013-09-16-rda-bdaf.pdf. Retrieved December 11, 2013, from http://www.delaat.net: http://www.delaat.net/~cees/posters/2013-09-16rda-bdaf.pdf [6] Dumbill, E. (2012). Getting up to speed with big data : what is big data. sebastopol: O'Reilly. [7] Gantz, J., & Reinsel, D. (2011). The 2011 Digital Universe Study: Extracting Value from Chaos. [8] Gentile, B. (2012, June 19). big-data-myths/. Retrieved November 25, 2013, from http://mashable.com/2012/06/19: http://mashable.com/2012/06/19/big-data-myths/ [9] James, L. (2012, october 2). why-big-data-and-business-intelligenceone-direction. Retrieved november 25, 2013, from http://smartdatacollective.com/yellowfin/75616/: http://smartdatacollective.com/yellowfin/75616//why-big-data-andbusiness-intelligence-one-direction [10] Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big Data: Issues and Challenges Moving Forward. 46th Hawaii International Conference on System Sciences (pp. 995-1004). Hawaii: IEEE. [11] Katal, A., Wazid, M., & Goudar, R. H. (2013). Big Data: Issues, Challenges, Tools and Good Practices. IEEE, 404-409. [12] Leber, J. (2013). A business Report on Big Data Gets Personal. MIT Technology Review. Retrieved from www.technologyreview.com
[13] Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The next frontier for innovation, competition, and productivity. Europe: McKinsey & Company, MGI. [14] Matsudaira, K. (2013, November 20). ACM Learning Center Webcast, Big Data Without Big Database—Extreme In-Memory Caching for Better Performance. Retrieved from learning.acm.org/webinar/ [15] Warden, P. (2012). Big Data Glossary. Sebastopol: O'Reilly Media, Inc. [16] White, T. (2012). Hadoop: The Definitive Guide, Third Edition. O'REILLY. [17] Zikopoulos, P. C., Eaton, C., deRoos, D., Deutsch, T., & Lapis, G. (2012). Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. United States of America : McGraw-Hill. [18] Zikopoulos, P., deRoos, D., & Corrigan, K. P. (2013). Harness the Power of Big Data The IBM Big Data Platform. US: McGraw Hill. [19] http://hadoop.apache.org/releases.html