RARS: Resource Aware Recommendation System on ...

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 9 No. 21, 2014 © Research India Publications http://www.ripublication.com/ijaer.htm

RARS: Resource Aware Recommendation System on Hadoop for big data analytics K. Latha

Raveathul Farzaana.M.Y

Department of CSE Anna University (BIT Campus) Tiruchirappalli, Tamil Nadu, India - 620 024 [email protected]

Department of CSE Anna University (BIT Campus) Tiruchirappalli, Tamil Nadu, India - 620 024 [email protected]

Abstract— Hadoop is a popular paradigm for big data analytics in the cloud. The sheer scale of cloud computing deployments makes task assignment in Hadoop, an interesting problem. Unlike the Default Hadoop Scheduler which starve small jobs in the event of resources being utilized by large jobs in a static manner, a proposed recommendation system which recommends a node to run the incoming tasks quickly by checking the job’s arrival time, node capacity and comply dynamic slot allocation on Hadoop by analyzing the data locality and the ability of the node to avoid idle slots for proper utilization of resources based on the exigency. We proved the improvement of RARS using performance metrics such as Execution time, Data throughput, Network delay, Utilization ratio of resources, RMSE rate, Goodness of data locality and corpus size of data and the results are shown. Keywords— Hadoop; Big Data; Job Scheduling; Slot Allocation; Recommendation System

I. INTRODUCTION In a Quotidian life, dealing with datasets in the order of petabytes or even Exabytes [1], [2] is a reality. Big data processing [3] is inviting more attention today, as many companies process enormous amounts of data to gain worthy insight into data patterns and behavior that were not observable in the past. Hence, processing such big datasets in an efficient way is the bare necessity for many users. At this juncture, Hadoop [4] is a framework for processing big data that has rapidly become the de facto standard in both industry and academics [5], [6]. The main causes of such popularity are the facile, scalability, and fault tolerance properties of Hadoop [7]. Hadoop furnishes a facile, ad-hoc solution to issues like web indexing, data sorting, data searching, counting etc. in data centre applications and it also equips with instinctive parallelization and distribution of large-scale computations on clusters consisting of many virtual low cost commodity hardware on a cloud. Big data handling companies like Google, Face books, Yahoo etc. have been using Hadoop in several forms. A significant aspect of Hadoop MapReduce regime is its fault tolerance property. Its common open-source implementation, Hadoop, was created primarily by Yahoo, where it does the jobs that produce hundreds of petabytes or Exabytes of data on at least 10,000 cores. Apache Hadoop [8] is also used at Face book, Amazon, and Last.fm. In addition, researchers at Cornell, Carnegie Mellon, University of

Maryland and PARC are beginning to utilize Hadoop for seismic simulation, natural language processing, mining and extracting web data. In a Hadoop MapReduce [9] framework, suppose a Map or Reduce task fails during its execution, it can be processed again on that same machine or different machines in a cluster. If a node fails, Hadoop MapReduce reprocesses its tasks on a different machine in a cluster. Moreover, if a node is available, but is performing low, we call it as a straggler task [10], thus it executes a speculative copy [11] (also called a “backup task”) on another machine to complete the jobs rapidly. The reason for this speculative task [12] is the faulty allocation of hardware or misconfiguration. This can be rectified if the slot allocation [13] of tasks is done properly by checking the capacity of the node to perform that particular job before scheduling [14], [15]. This paper focused on the scheduling of jobs in the Hadoop framework in an efficient way to allocate the sufficient node to transact the incoming tasks by analyzing the capacity [16] needed to accomplish the particular task. The rest of the paper is organized as follows: section 2 describes the related work while the section 3 elaborates our methods, experiments and performance analysis by comparing with the existing Hadoop system and the discussion about our work is noted in section 4. Section 5 presents our conclusions and gives a brief overview of future work. II. RELATED WORK A few works related to job scheduling in Hadoop environment are H. Herodotou et al. (2011) [17] proposed Starfish technique which is a is a self tuning framework that can adjust the Hadoop’s configuration based on the cost based model and sampling technique. In [18], J. Polo et al. (2011) introduced Resource-aware Adaptive Scheduling for improving resource utilization by extending the abstraction of traditional ’task slot’ of Hadoop to ’job slot’, which is an execution slot that is bound to a particular job, and a particular task type (map or reduce) within that job. In [19], Q. Chen et al. (2013) suggested LATE Scheduler to Speculative execute the task and finish the furthest into the future. In [20], Peter Bodík et al. (2014) propounded Deadline aware Scheduling to maximize the total value of completed jobs, where the value of each job depends on its completion time. In [21], Yang Wang and Wei Shi (2014) broached Budget has driven Scheduling using greedy and a knapsack algorithm to control the time and budget. In [22], M. Hammoud and M. F. Sakr (2011)

[Page No. 4912]


introduced new strategy Locality-Aware Reduce Task Scheduler (LARTS) which attempts to collocate reduce tasks to the maximum required data computed after recognizing input data network locations and sizes. LARTS include network locations and sizes in the scheduling decisions to diminish network traffic and improve MapReduce performance. LARTS employs a relaxation strategy and fragments, some reduce tasks among several cluster nodes. In [23], [24], F. Ahmad et al. (2012) developed a benchmark suite called “PUMA” which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. The three benchmarks from the Hadoop distribution are also slightly modified to take a number of reduce tasks as input from the user as well as to generate final time completion statistics of jobs. Our project focuses on the job scheduling, slot allocation and data locality issues by scheduling of jobs in the Hadoop framework in an efficient way to allocate the sufficient node to perform the incoming tasks by analyzing the capacity needed to perform the particular task. III. PROPOSED WORK We improve the performance of Hadoop via optimizing the slot allocation by assigning the appropriate node for each incoming job. The recommendation system recommends a node to transact a particular job by analyzing the node capacity and the resource need for the job and also scrutinize by checking the data locality of the current incoming job. Slot allocation can be done dynamically by considering the node capacity and data locality i.e. where the computing, data have been residing. In the pioneer approach, slots are allocated in a static manner in which map and reduce slots can run only their respective tasks. Therefore jobs are assigned to the available free slots so that, the computing data are moved to the assigned slot which overloads the network and increases the cost and processing time.

calculated by subtracting the submission time of the job from the current time to maintain the fairness among the user and job roster. This paves the way for computing the tasks in an orderly and consistent manner instead of illegitimate among the jobs. B. Recommendation System

Fig. 2. Resource Aware Recommendation System (RARS)

Fig. 2 delineates an automatic recommendation system which seizes the input as the set of ruptured tasks and recommends a node to execute each incoming small component. To choose a node for task execution, the system first determines the features described in Table 1 and by comparing them with the past job history and if the job is not similar with any tasks in the history, it computes the distance function to find the neighbor and assign the job needs. Subsequently, it tracks the available free nodes and these nodes are refined by checking the residence of the computing data called data locality. Then the capacity of each node is computed to verify whether the node capacity is greater than the job needed if so, they are named as good job and it suggests that slot, if it is not a bad job, justify the cause to become bad job and analyses the other slots to suggest a node to run. This can be described as algorithmic steps as shown as below, Step 1: Get the input jobs from user in earliest first processed first basis. Step 2: Split the job into several tasks. Step 3: Recommendation system gets the ruptured tasks and analyzes the features. Step 4: Assign the job need by comparing with the past job history. Step 5: Prioritize task tracker by checking whether the node capacity is greater than the job need and refine it by data locality. Step 6: The recommendation system recommends the slot to run the job. Step 7: After the execution consolidated result is given to the user.

Fig. 1. Proposed Architecture Diagram for Hadoop

A. Job Roster It uses the earliest first processed first strategy by checking the longest waiting time job and passes it in first. This can be

[Page No. 4913]


TABLE I.

1) A Read phase, which loads the input split from HDFS (Hadoop file system and it generates input key-value pairs.

HADOOP FEATURE DESCRIPTION Job Size Job Definition Job Submission Details Job mean CPU usage No. of processors Processor Speed Cache memory available Host server Mirror link details Link capacity URL Link status File upload and download status Internet connection status Server load status Memory usage Location of computing nodes

Job Features

Node Features

Link Features Network Features

2) A Map phase to generate the map-output data by processing the user-defined and user-developed map function. 3) A collect phase, which focuses on partitioning and collecting the intermediate data into a buffer prior to the spilling phase. 4) A spill phase is optional where sorting is done by combining function and data compression may occur, then the data is moved into the local disk. 5) A merge phase is to consolidate the spill output into a single map output file in multiple iterations.

C. Slot Allocation In general, idle slots are utilized for running the map and reduce tasks. That is, break the inherent hypothesis for a current MapReduce framework that the map tasks can only run on map slots and reduce tasks can only run on reduce slots. Instead, the system has been modified as running both tasks on either map or reduce slots. D. Map and Reduce Task The job is to find the best movie among the set of movies by using Histogram Ratings. It generates a histogram of the ratings as opposed to that of the movies based on their average ratings. Movie rating data is used here to bin the ratings of 1-5 into 5 bins and map emits tuple for each review. Reduce collects all the tuple for a rating and emits a tuple. TABLE II.

JOB SAMPLE

Input Format

{movie_id:userid1_rating1, userid2_rating2,}

Output Format

Datasets

Movie ratings dataset

Reduce task composed of 4 phases, 1) A shuffle phase transfer the intermediate data from the mapper nodes to the reducer nodes also decompression or merging may occur. 2) A merge phase combine sorted fragments from the various mapper tasks to produce the actual input into the reduce function. 3) A reduce phase invokes the user-defined reduce function to generate the final output data. 4) A write phase moves the final output into HDFS and the data compression may occur. E. Evaluation Criteria In this section, we execute the simulation (Fig. 4) and the results have been evaluated by the number of performance metrics such as Execution time, Data throughput, Network delay, Utilization ratio of resources, Job progress percentage, RMSE (Root Mean Square Error) rate, Goodness of data locality and corpus size of data. Each Metrics have been described in the following subsection.

Fig. 3. Map and Reduce Task Execution

Fig. 3 depicts the general phases of map and reduce task execution framework. Map task composed of 5 phases,

[Page No. 4914]

Fig. 4. Simulation of Proposed RARS


1) Job Execution Time: The execution time of a given task is defined as the time spent by the system executing that task, including the time spent executing run-time or system services on its behalf. It is measured in terms of milliseconds. It is implementation defined, in which task, if any, is charged the execution time that is consumed by interrupt handlers and run-time services on behalf of the system.

The contributor to network latency includes Propagation, Transmission, Router, other processing, other computer and storage delays.

Fig. 7. Performance Evaluation Chart (Network Latency with Execution Time)

4) Utilization Ratio of Resources: Fig. 5. Performance Evaluation Chart (Corpus size of Data with Execution Time)

The percentage of actual resource used to the total resources available in a Hadoop cluster as defined in (1). (1)

2) Data Throughput: Throughput refers to how much data can be transferred from one location to another or processed in a specified amount of time. It is used to measure the performance of hard drives and RAM, as well as Internet and network connections. Data transfer rates for disk drives and networks are measured in terms of throughput. Throughputs are measured in terms of kbps.

Fig. 8.

Performance Evaluation Chart (Execution Time with Utilization Ratio of Resources)

5) Job Progress: Job progress describes what percentage of job has been completed. Percentage of completion is the accounting method for job in progress evaluation and it is measured in terms of percentage as defined in (2). Fig. 6. Performance Evaluation Chart (Data Throughput with Execution Time)

(2)

3) Network Latency: Network latency is an expression of how much time it takes for a packet of data to get from one designated point to another. Sometime, latency is measured by sending a packet that is returned to the sender; the round-trip time is considered the latency. The latency assumption seems to be that data should be transmitted instantly between one point and another.

[Page No. 4915]


Fig. 9.

Performance Evaluation Chart (Job Progress with RMSE Rate)

6) RMSE Rate: Root Mean Square Error (RMSE) rate in (3) estimates the relative closeness of the predicted and actual values. The closer the predicted (estimated) values are to the actual values, the smaller the RMSE values are, resulting in the values of RMSE ~ 0.0 for almost correct prediction.

has taken from the open movie database. As shown in Fig. 11, this data collection includes details such as id, image, year, rating, genre, actors, votes, rating etc. from [25]. The system can tolerate any data size and it is flexible to run at any time for any data size. The ability of the system to process the data at various sizes has been depicted in Fig. 5. Data throughput i.e. Amount of work done in a particular unit of time increases, then the execution time automatically reduced as in Fig. 6. Fig. 7 describes the improvement of network latency with the execution time of the Hadoop system which slowly increases over a period of time. If the job executes in ten nodes, it accomplishes that task in 10s and if the job executes in twenty nodes, then it achieve it in 5s.Therefore as in Fig. 8, utilization ratio of Hadoop resources decreases as the execution time increases. Fig. 9 represents the evaluation of RMSE rate with the job progress. Normally RMSE rate increases with job progress percentage, but in our system, at the initial stage RMSE rate is at the peak as the difference between actual and predicted is high and gradually decreases and at the end of job completion, RMSE rate might be nil. If the data resides in the local disk then the time taken to get the data to process is less, then the execution time obviously decreases but if the computing data resides in remote machine, that has to be transferred to the node which is responsible for executing the data, this leads to increase in execution time as represented in Fig. 10.

(3) Where, pi = Predicted output ai = Actual output 7) Goodness of Data Locality: The concept of data locality is defined as the percent of tasks that achieve node-level data locality. It varies from 0 to 100 percentage and if the goodness of data locality is good, then the data locality cost is lower which reduces the data movement cost of job execution.

Fig. 10. Performance Evaluation Chart (Execution Time with Goodness of Data Locality)

IV. DISCUSSION Performance Evaluation Chart proves the improvement of the proposed RARS when compared with the existing Hadoop system by simulating with the movie website dataset which

Fig. 11. Movie Data Collection

V. CONCLUSION AND FUTURE WORK The open source platform as Hadoop, become now the big ideal data processing platform. Many researchers pay more attention on Hadoop, which is a typical representative of

[Page No. 4916]


Hadoop. We develop a recommendation system on Hadoop, which suggests the slots to allocate it to execute an assigned task within the prescribed time. This system improves the performance of Hadoop by enhancing its utilization of resources and also it alleviates the straggler tasks which delay the overall execution timing of the resultant job. It also concentrates on the problem of data locality while allocating the slots which avoids the unnecessary network bandwidth for transmitting the computing data and the data transmission cost. In the future, we would like to evaluate a recommendation system in heterogeneous environments as well as in cloud computing environments such as the Amazon EC2. Another interesting future direction is to predict the component failures and to examine different workload characteristics, data distributions, cluster sizes and particularly interested in exploring probabilities of sharing and reusing results among different frameworks. The work will be enhanced to consider implementing recommendation system for cloud computing environment with more metrics such as budget, deadline in different platforms.

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

References [1]

[2] [3]

[4] [5] [6]

[7]

[8] [9] [10]

[11]

[12]

[13] [14] [15]

Forsyth, “For Big Data Analytics There’s No Such Thing as Too Big: The Compelling Economics and Technology of Big Data Computing,” Forsyth Communications, March 2012. A. Thusoo et al., “Hive – A Petabyte Scale Data Warehouse Using Hadoop. In ICDE,” pp. 996–1005, 2010. Shweta Pandey, Vrinda Tokekar, “Prominence of MapReduce in BIG DATA Processing,” In Fourth International Conference on Communication Systems and Network Technologies, IEEE, 2014, pp. 555-560. Hadoop: http://hadoop.apache.org. M. Zaharia et al., “Improving MapReduce Performance in Heterogeneous Environments,” In OSDI, 2008, pp. 29–42. J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: Making a Yellow Elephant Run Like a Cheetah(Without It Even Noticing),” PVLDB, vol. 3, No. 1, pp. 519– 529, 2010. Gothai E, Balasubramanie P, “A Novel Approach For Partitioning In Hadoop Using Round Robin Technique,”In Journal of Theoretical and Applied Information Technology, Vol. 63, No.2, ISSN: 1992-8645, May 2014. Hadoop Powered By: http://wiki.apache.org/hadoop/PoweredBy. J. Dean and S. Ghemawat,” MapReduce: A Flexible Data Processing Tool,” CACM, 53 (1): 72–77, 2010. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris, “Reining in the outliers in map-reduce clusters using mantra,” in OSDI’10,pp. 1-16, 2010. F. Ahmad, S. Y. Lee, M. Thotte Shanjiang Tang, Bu-Sung Lee, Bingshe He,”DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters,” IEEE Transactions On Cloud Computing, Vol 25, No 5, pp.520-533, August 2014 Zhenhua Guo, Geoffrey Fox, Mo Zhou, Yang Ruan, “Improving Resource Utilization in MapReduce,” IEEE International conference on cluster computing, pp. 402-410, September 2012. S.J. Tang, B.S. Lee, and B.S. He, “Dynamic Slot Allocation Technique for MapReduce Clusters,” In IEEE Cluster’13, pp. 1-8, 2013. Kamal Kc, Kemafor Anyanwu,” Scheduling Hadoop Jobs to Meet Deadline,” IEEE CloudCom,2010. Geetha J, N UdayBhaskar, P ChennaReddy3, Neha Sniha, “Hadoop Scheduler with Deadline Constraint,” International Journal on Cloud

[24] [25]

Computing: Services and Architecture (IJCCSA), Vol. 4, No. 5, October 2014 Arun Murthy, “Understanding Apache Hadoop’s Capacity Scheduler,” http://hortonworks.com/blog/understanding-apache-hadoops-capacityscheduler/, July 2012. H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu,“Starfish: A Self-tuning System for Big Data Analytics,” In CIDR’11, pp. 261C272, 2011. J. Polo, C. Castillo, D. Carrera, et al., “Resource-aware Adaptive Scheduling for MapReduce Clusters,” In Middleware’11, pp. 187-207, 2011. Q. Chen, C. Liu, Z. Xiao, “Improving MapReduce Performance Using Smart Speculative Execution Strategy,” IEEE Transactions on Computer, Vol. 63, No.4, pp. 954-967,2013. Peter Bodík, Ishai Menache, Joseph (Seffi) Naor, “Brief Announcement: Deadline-Aware Scheduling of Big-Data Processing Jobs,” SPAA’14, Prague, Czech Republic, June 23–25, 2014. Yang Wang and Wei Shi IEEE Member, “Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds,” IEEE Transactions On Cloud Computing, Vol. 26, No. 4, pp. 13431357, JUNE 2014. M. Hammoud and M. F. Sakr, “Locality-Aware Reduce Task Scheduling for MapReduce,” In IEEE CLOUDCOM’11, pp. 570-576, 2011. F. Ahmad, S. Y. Lee, M. Thottethodi, T. N. Vijaykumar, “PUMA: Purdue MapReduce Benchmarks Suite,” ECE Technical Reports, 2012. PUMA Datasets: https://sites.google.com/site/farazahmad/pumadatasets. Open movie data base (OMDB) of imdb movie website: http://www.omdbapi.com/

[Page No. 4917]