Various Computing models in Hadoop eco system

2 downloads 0 Views 374KB Size Report
Abstract: The advent of mobile revolution and sensor data causing the ... company uses video recommendation algorithm to the subscribers based on .... Mahout's frequent pattern set mining, Credit card fraud detection based on the known.
Various Computing models in Hadoop eco system along with the perspective of analytics using R and Machine learning Uma Pavan Kumar Kethavarapu PhD Research Scholar, CSE Department, Pondicherry Engineering College. [email protected]

Dr.Lakshma Reddy Bhavanam, Nivetha, Principal, BCC College, M.Tech IT, Bangalore PondicherryEngineeringCollege [email protected] [email protected]

Abstract: The advent of mobile revolution and sensor data causing the generation of huge amounts of the data from various levels of users and various categories of the applications. The main challenge of this context is storage logic and processing logic. The other and important requirement is generation of the strategic and valuable information to estimate the user behaviour and their preferences is mandatory. In the he current paper an attempt is made to figure out how Hadoop echo system is used to store and process the data. To explain the process and storage relevant the usages of Map reduce and HDFS along with PIG and Hive are considered. The other goal is analytics on the populated data based on the reference of R and Machine learning are mentioned. The ultimate goal of this paper is to study about the computing scenarios existing in the processing of bulk data along with analytics mention. Keywords: Computing, HDFS, Map reduce, R, Machine Learning,Mahout I.

Introduction

Big data refers to data sets so large and complex that it becomes difficult to store and manage the data by using traditional Relational Data base Management System (RDBMS) tools. Every day we generate 2.5 quintillion bytes of data, 90 % of the word collected data has been generated only in the last few years. Data sizes now are Peta- bytes, tera-bytes, exa-bytes and zeta-bytes.

Figure 1: Data Metrics Source: Skill speed R & D

So the existing paradigm only allows scale-up of the systems but the requirement is scale-out. Here scale-in refers to improving the capacity of RAM and processor which is having some limitations. The scale-out refers to adding of the additional systems to the existing systems so as to achieve the highest computing capacity. Some of the common big data scenarios are sports, health care domain, amazon services and NETFLIX .In sports the strike rate of the player and his average need to display while he is entering to play, similarly the advertisement of the company to which the player is an ambassador. In case of health care domain the prediction of certain deceases based on the family history and other parameters will give the better solution to the doctors for treating the patients. The amazon service usually recommends the users based on his past selections to do so amazon has to store huge amounts of the data. Similarly NEYFlIX is a video streaming service currently it has 83 million subscribers all over the world and its net income is 122 million US dollars. The company uses video recommendation algorithm to the subscribers based on their last 3 video viewings. So the scenarios of big data are more popular and gaining revenues by estimating the user interests with the help of huge data storing and analytics on that data. The organization of the paper is as follows. In section II the discussion of storage logic and processing logic to handle the huge data problems and complexities. Section III describes the Hadoop echo system tools of PIG and HIVE along with the computing methods used to process and store the huge data. Section IV describes the discussion about the analytics so as to get the meaningful insight of the data which will helps the companies to serve their customers in a better way. To perform this usage of R and some machine learning context is described. II.

Storage and Computing frame work of Hadoop

Hadoop is a frame work to store and process the big data. Hadoop provides distributed storage to handle bulk data and it is having the frame work of map reduce to process the bulk data in parallel and in distributed way. The Hadoop Distributed File System (HDFS) is allowing the user to store the data in replicated way and by using cluster of commodity hard ware. The HDFS provides availability reliability and fault tolerant capacity of the data. To manage these things Hadoop depends on core-site.xml, hdfs-site.xml. The core-site.xml file consists of the details of temporary storage directory in the system along with the port number of local host which is used to identify the services related to storage. Similarly hdfssite.xml specifies about the replication factor information like how many time a file is going to replicate in the storage. The processing of Hadoop is in terms of Map Reduce which is frame work of parallel and distributed processing. The computing style here is the algorithm or program will be moved to the bulk data to process that, which will gives the computing capacity of the Hadoop with less data movement between the systems in the cluster. The property of mapred-site.xml consists of map reduce details with the port number of the job execution status. The combination of HDFS and Map Reduce is known as Hadoop. The process of HDFS storage is looking like this.HDFS follows master slave architecture with Name Node, Job Tracker and secondary Name Node in Master and Data node, Task tracker as slave node.

Figure 2: Hadoop Distributed File System The Figure 2 describes the management of data storage and processing logic in fault tolerant mode means that while storing the data with the help of replication if any file is not available at that time then the HDFS can access another copy. In case of processing of the data Hadoop depends on Map Reduce, while running any job due to some reason if the job could not finish then with the help of fail over and speculative execution it is going to recover the failure of the job. Here we are presenting a case study of wiki page counts in which we are going to take the bulk data from the page counts of the wiki media here the aim of the implementation is to generate the clicks happened so far to the specific search engine. The page counts are stored yearly and monthly by the wiki media page counts we can down load the data sets from the wiki media page counts directory. To perform the wiki page counts we have taken the source as pagecounts 20160701-000000 of size 72 MB. The users may enter their queries to search engines to get the relevant data that data will be stored in the form of log files. One of the log file we have taken total pagecounts 20160701-000000.Now with the help of map reduce we are getting the clicks happened to a particular search engine in the form of page counts. The estimation of the clicks is not a simple task as we are capturing the data unstructured format and separation of the clicks from the wiki media data. The implementation of map reduce involves the logic of Mapper class with business logic reducer class with summing up the clicks and driver class which contains the job configuration and information about input and output along with the information of mapper and reducer classes. The output expected is the clicks of pages. The following is the outcome of the page counts wiki data mining after the implementation with the help of map reduce paradigm.

16/07/30 02:17:19 INFO mapred.JobClient: map 100% reduce 100% 16/07/30 02:17:24 INFO mapred.JobClient:

SLOTS_MILLIS_MAPS=411237

16/07/30 02:17:24 INFO mapred.JobClient:

Bytes Written=57687668

16/07/30 02:17:24 INFO mapred.JobClient: FileSystemCounters 16/07/30 02:17:24 INFO mapred.JobClient:

FILE_BYTES_READ=149709248

16/07/30 02:17:24 INFO mapred.JobClient:

HDFS_BYTES_READ=411076865

16/07/30 02:17:24 INFO mapred.JobClient:

FILE_BYTES_WRITTEN=224737138

16/07/30 02:17:24 INFO mapred.JobClient:

HDFS_BYTES_WRITTEN=57687668

16/07/30 02:17:24 INFO mapred.JobClient:

Bytes Read=411076158

16/07/30 02:17:24 INFO mapred.JobClient:

Map output materialized bytes=74854633

16/07/30 02:17:24 INFO mapred.JobClient:

Map input records=7447982

16/07/30 02:17:24 INFO mapred.JobClient:

Reduce shuffle bytes=74854633

16/07/30 02:17:24 INFO mapred.JobClient:

Spilled Records=6461331

16/07/30 02:17:24 INFO mapred.JobClient:

Map output bytes=70537267

16/07/30 02:17:24 INFO mapred.JobClient:

CPU time spent (ms)=144260

16/07/30 02:17:24 INFO mapred.JobClient:

Total committed heap usage (bytes)=1153089536

16/07/30 02:17:24 INFO mapred.JobClient:

Combine input records=0

16/07/30 02:17:24 INFO mapred.JobClient:

SPLIT_RAW_BYTES=707

16/07/30 02:17:24 INFO mapred.JobClient:

Reduce input records=2153777

16/07/30 02:17:24 INFO mapred.JobClient:

Reduce input groups=2153777

16/07/30 02:17:24 INFO mapred.JobClient:

Combine output records=0

16/07/30 02:17:24 INFO mapred.JobClient:

Physical memory (bytes) snapshot=1345396736

16/07/30 02:17:24 INFO mapred.JobClient:

Reduce output records=2153777

16/07/30 02:17:24 INFO mapred.JobClient:

Virtual memory (bytes) snapshot=3001929728

16/07/30 02:17:24 INFO mapred.JobClient:

Map output records=2153777

Figure 3: Map Reduce Output for wiki page counts logic. The above output screen shows the resources utilized in the execution of the wiki page counts logic. III.

Pig and Hive in Hadoop Eco System

The Hadoop eco system is a stack of various tools which can be categorized into the following areas a. b. c. d.

Storage and Processing Logic Data Ingestion Tools NOSQL Data bases Security

e. Workflow Scheduling The eco system can be observed in the given diagram.

Figure 4:Hadoop Eco System Source: Google. HDFS is a storage component, Map Reduce is used for distributed and parallel processing of the data. Sqoop and Flume are used to ingest the data from relational and unstructured sources. Zookeeper is used as coordination agent and oozie is for creating the job scheduling with work flow. The alternate method to java Map reduce is scripting through Pig ,with statistics can use R connectors and SQL Query based with Hive. NOSQL data bases are emerging in the market so as to store bulk data with easy and flexible data management commands. In addition to that the support of machine learning with various algorithms like classification, clustering ,association mining ,click stream and sentiment analysis based on some recommendation engines are achieved through Mahout. With Pig the rapid development is possible. In the real time scenarios Pig used by so many companies as a data flow language to analysis the web crawls and clicks. The computing model followed by Pig is bit different, we can interact Pig either by Grunt or Pig Server in local mode and Map reduce mode of operations. The parser transform from Pig Latin to Logical plan, the optimizer elects the best logical plan and the compiler follows the path of logical plan to physical plan which in turn converts to MR Plan. Any how the storage of Pig and Processing of Pig both depends on HDFS and Map reduce of the Hadoop frame work.

The same logic of the wiki page count implemented in Pig in map reduce mode. The Map reduce computing is complex in terms of coding. The developer should think in terms of map and reduce terms, more over if functional users want to implement the logics they should know java and they need to code lengthy programs. So Hadoop eco system consists of Pig support. The advantage of Pig Latin is 10 lines of Pig code is equivalent to 200 lines of code. In the scenario of wiki page counts implementation the code in Map reduce is 70 lines whereas Pig requires just 7 lines of code, so it is simple for any functional user to code the logic with simple constructs. org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2016-07-30 05:45:14,656 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: ob Stats (time in seconds): JobId

Maps Reduces MinReduceTime

MaxMapTime AvgReduceTime

MinMapTIme MedianReducetime

AvgMapTime Alias Feature

MedianMapTime Outputs

MaxReduceTime

job_201607300209_0002 7 1 fil_records,grp_records,records,results

521 33 163 GROUP_BY,COMBINER

114

518

518

518

518

job_201607300209_0003 sorted_results

1 1 SAMPLER

12

12

18

18

18

18

job_201607300209_0004 sorted_results

1 1 ORDER_BY

66 66 66 66 hdfs://localhost:6789/user/hdp/pigopqr,

51

51

51

51

12

12

Input(s): Successfully read 7447982 records (411078643 bytes) from: "hdfs://localhost:6789/user/hdp/page1" Output(s): Successfully stored 2153777 records (57687668 bytes) in: "hdfs://localhost:6789/user/hdp/pigopqr" Counters: Total records written : 2153777 Total bytes written : 57687668

Figure 5: Wiki Page counts Implementation output with Pig Implementation.  The implementation of wiki page counts by java map reduce and Pig has been observed the observations are notified here  Java Map Reduce Requires 70 lines of code where Pig Requires 7 lines to achieve the same  Map Reduce CPU Time Spent 1, 44,260 million seconds where Pig requires 8, 96,000 million seconds. IV.

Importance of Machine Learning with R Analytics and Mahout

You tube video recommendation system internally uses machine learning algorithm so as to recommend the videos to the users according to their past viewed contents. To perform this method of seed set is used, which in this case is watched history and liked, favourites, rating given by the user. The other use case is biometrics to identify an individual based on the

physical, chemical or behavioural attributes of the person. Machine learning is a class of algorithms which is data-driven, and gives the advantage of uncovered hidden patterns that even the best data scientists may overlooked. The computing methodology of Machine learning (ML) is enabling analytic algorithms to learn from fresh feeds of data without constant human intervention and without explicit programming. The system learns form data supplied by the user. There are supervised and unsupervised categories of ML, in supervised training data set includes both the input and the desired results where as in unsupervised the model is not provided with the correct results during the training. Classification and regression we can see as supervised learning and clustering can be the unsupervised category.

Figure 6: Supervised Learning process Flow, Source: eureka R and D R is a language so we can create implementations as per requirements, the types and depth of classification models that could be run in. LinkedIn used R for model training, Google and Facebook use R While working with R the typical computing model can be notified like first the source data can be identified then the stage of Imputation followed by partition of the data and convert the data into R readable format (.csv, excel, SAS, STATA etc.) then loading to target and there after analysis of the loaded data. The ML implementation also supported by Mahout based on the Hadoop frame work and the main goal of Mahout is to build scalable ML libraries. The implementations of algorithms like clustering, classification, and batchbased collaborative filtering are implemented on top of Apache Hadoop using the Map Reduce paradigm. Twitter uses Mahout for user interest modelling, Yahoo mail uses Mahout’s frequent pattern set mining, Credit card fraud detection based on the known examples of behaviour for purchases and other credit card transactions. There are computational distinctions between R and Mahout while performing ML. The size of the data that could be analysed in R is limited whereas Mahout depends on Hadoop so commodity hard ware can be used to store and process bulk data, coming to R based on libraries and packages the ML algorithms can be applied where as in Mahout Map Reduce based on Parallel and Distributed computing are applied on the implementation.

V.

Conclusion

The overall theme of the article is to describe the computational methods used in various emerging technologies and tools. To explain this usage of Hadoop, R and Mahout are taken as a platform. In case of Hadoop the computation depends on HDFS and MR along with this the computing method of PIG explained which is using the process of logical plan and MR plan and the execution of wiki media page counts has been explained so as to realize the computing differentiations related to Java MR and Pig MR plan. The ML algorithms are mostly implemented using R and Mahout, usage of R packages and libraries with a flow is compared with Mahout ML implementation along with distinctions. References 1. Tom white,” The Hadoop Definitive Guide”, Oreilly Publications, third edition. 2. Chuck Lam,”Hadoop in Action”, Manning Publications, First edition. 3. Mark Kerzner, “Hadoop Illuminated”, GitHub Source 4. Mike Loukides,”Strata+Hadoop Wold”, Oreilly Publications, Third revision. 5. Alan Gates, “Programming Pig data flow scripting with Hadoop”, Oreilly Media, September, 2011. 6. Alice Zhang,” Evaluating Machine Learning Models” ,Oreilly Publications, September 2015, First Edition. 7. Grolemund “Hands –n Programming with R” Oreilly Publications, June 2014, Second Revision 8. Sean Owen “Mahout In Action”, Manning Publications, October 2011, Fourth Edition. 9. blog.cloudera.com 10. www.kdnuggets.com